Decode Issue Unit (DIU)
DecodeIssueUnit L7
The level 7 Decode-Issue Unit (DIU L7) is responsible for decoding instructions, renaming architectural registers to physical registers, evaluating source operand values, and issuing instructions to the appropriate pipe using issue queues to allow for superscalar issue, which is the key improvement over the L6 DIU. To support this, the rename table now has two lookup ports for each issue queue, such that operand ready status is looked up via the issue queues instead of right after decoding. The register file now also has two read ports per issue queue to allow for independent operand value reads. The issue queues support single-cycle latency bypassing if the operand becomes ready via the complete interface, as well as full queue bypassing for control XU’s to keep the single-cycle branch resolution latency.
Instruction Router for Issue Queues: SSInstRouter
The instruction router for issue queues (SSInstRouter) is responsible for directing each decoded instruction to the appropriate issue queue based on which pipes support the instruction’s micro-op and which queues have the most available capacity. This is similar to the previous InstRouter used for single-issue routing, but extended to handle the case where multiple issue queues may support the same instruction.
The router is composed of two submodules. First, one SSInstRouterUnit is
instantiated per pipe, each parameterized with the ISA subset supported by that
pipe. Each unit checks whether the incoming micro-op is compatible with its
pipe’s ISA subset using the in_subset function across all supported RISC-V
operations, producing a per-pipe iq_compat_op signal that is asserted when
the instruction is valid and the pipe supports it.
Second, the IQPicker module consumes the compatibility signals from all
router units along with the iq_avail_slots count from each issue queue. It
selects the compatible queue with the most available slots, breaking ties in
favor of the lowest-indexed pipe. The picker outputs a one-hot grant vector
(iq_val) indicating which queue the instruction should be sent to, as well
as an any_gnt signal indicating that at least one compatible queue was
found.
The top-level SSInstRouter module asserts the xfer handshake signal
only when a compatible queue is selected and that queue’s iq_rdy signal
indicates it can accept the instruction. This ensures backpressure is properly
propagated when all compatible queues are full.
In-Order Issue Queue: IssueQueueInOrder
The in-order issue queue (IssueQueueInOrder) is a circular FIFO that buffers decoded instructions and issues them to the execute stage strictly in program order once both source operands are ready. Each issue queue has its own pair of rename table lookup ports and register file read ports, enabling independent operand resolution per queue.
The queue maintains insert and dequeue pointers (ins_ptr and deq_ptr)
to manage its circular buffer of entries. On insertion, the instruction’s
decoded fields (micro-op, physical register addresses, immediate, PC, sequence
number, etc.) are stored at the insert pointer. The avail_slots output
communicates the remaining capacity to the SSInstRouter for load-balancing
decisions.
On the dequeue side, the queue looks up the source physical registers of the
head-of-queue instruction in the rename table via the rt_lookup_pending
signals. If neither source operand is pending (i.e., both are ready), the queue
asserts the dequeue handshake and drives the operand values read from the
register file onto the execute interface, along with the rest of the
instruction fields. In-order issue is enforced by only ever considering the
instruction at the dequeue pointer for issue.
The queue also supports a same-cycle bypass path: when the queue is empty and
both the insert and dequeue handshakes are active simultaneously, the incoming
instruction can bypass the entry storage entirely and be issued directly to the
execute stage, avoiding the one-cycle latency of writing to and reading from
the queue. This is particularly important for control-flow instructions (branch
and jump), where the bypass path (enabled via the p_bypass parameter) keeps
the branch resolution latency to a single cycle. When p_bypass is set, the
queue operates in a fully stateless mode, acting as a combinational
pass-through that gates the instruction based solely on operand readiness, with
no internal storage.
DecodeIssueUnit L8
The Level 8 Decode-Issue Unit (DIU L8) extends the L7 DIU with superscalar
frontend decode, processing p_num_fe_lanes instructions per cycle in
parallel. As shown in the diagram below, each frontend lane has its own
InstDecoder and ImmGen instance (wrapped in SSDecoder) that decode
the instruction word and generate the immediate value independently. The
SSRenameTableL3 allocates physical registers for all lanes simultaneously,
with inter-lane forwarding of destination register mappings to handle
intra-block dependencies (described further below). Instructions are then routed
from the p_num_fe_lanes input lanes to p_num_pipes output issue queues
via a two-stage crossbar: the SSInstMapper computes the lane-to-pipe mapping
using an iSLIP-based matching engine, and the SSDIURouter muxes instruction
data according to the mapping. Each issue queue can be either the in-order
IssueQueueInOrder used in L7 or a new out-of-order IssueQueueOOO
(described below), with its own register file read ports for independent operand
resolution.
Instruction Window FIFO: DIUFifo
The DIU L8 uses a DIUFifo to buffer fetched instruction blocks
between the fetch and decode-issue stages. The FIFO wraps a FifoBypass and
presents an instruction-window abstraction with per-lane editable head state.
Each lane in the head entry carries a 2-bit inst_status field with three
states: INVALID (2’b00), READY (2’b01), and DISPATCHED (2’b10).
The decode-issue pipeline reads the head entry each cycle and writes back
per-lane status edits: lanes that fail instruction checks are marked
INVALID, and lanes that successfully dispatch are marked DISPATCHED.
These edits are maintained in shadow registers that overlay the FIFO head entry
and reset when the head is popped or the FIFO is cleared. The head entry is
popped (and the window advances) only when every lane in the current entry is
either invalid, dispatching this cycle, or already dispatched.
The FIFO is reset on ctrl_flow_sub.redirect_val (an external redirect from
an XU) and cleared on ctrl_flow_pub_val_comb (an internal redirect from a
JAL/JALR dispatch in the current window).
Instruction Check Stages
Dispatch eligibility is evaluated through four cascaded InstCheck stages
per lane, each gating on the result of the previous stage:
S1 – Validity: Checks that the FIFO entry is valid (
F_curr[i].val), the instruction status is notINVALID, and the decoder recognizes the instruction. This is the first filter and itspassoutput gates all subsequent stages.S2 – Control instruction ordering: Enforces ordering constraints around control flow instructions (branches and jumps). The DIU scans the current window for the oldest undispatched control instruction (JAL/JALR/BRX) using wrap-around-safe sequence number age comparisons via
SSSeqAge. For BRX instructions, younger instructions in the window are blocked from dispatching until the branch’s source operands are ready and it can be dispatched. For JAL/JALR instructions, the oldest control instruction must be able to dispatch (pass S4) before younger instructions proceed. Instructions younger than the oldest control instruction are also blocked when a redirect is pending onctrl_flow_sub.S3 – Physical register allocation: Checks that a free physical register is available for allocation (via
alloc_rdy) for instructions that write a destination register. Lanes are checked in order, with each lane gating on the previous lane’s allocation success to prevent over-allocation.S4 – Structural hazard (crossbar routing): Checks that the instruction was successfully routed to an issue queue by the
SSInstMapper(i.e., thelane_valsignal is asserted). Lanes are again checked in order to preserve dispatch ordering.
The invalidate outputs from all four stages are OR-reduced per lane to
determine whether to mark a lane INVALID in the instruction window.
Control Flow Handling
Because multiple instructions are decoded simultaneously, the DIU L8 must
handle the case where the current window contains control flow instructions
alongside younger instructions that should not be issued. The DIU scans all
lanes each cycle to find the oldest undispatched control instruction
(JAL/JALR/BRX), tracking its lane index, sequence number, and whether it is a
branch (BRX) via oldest_ctrl_inst_found, oldest_ctrl_inst_idx, and
oldest_ctrl_inst_is_brx.
For JAL/JALR instructions, when the oldest control instruction dispatches, the
DIUCtrlFlowPublisher computes the jump target and publishes a redirect on
the ctrl_flow_pub interface via the ControlFlowNotif. For JAL
instructions, the target is PC + imm; for JALR, the target is
(rs1 + imm) & ~1, where rs1 is read from the control pipe’s register
file port. When branch prediction is enabled, the publisher also checks whether
the JAL was already predicted taken (via the predicted_taken flag propagated
from the fetch unit); if so, the redirect is suppressed since the fetch unit is
already fetching from the correct target, and only a bp_update_val is
asserted to confirm the prediction to the branch predictor. For unpredicted JALs
and all JALRs, redirect_val is asserted. The redirect signals are
combinational, relying on the downstream control flow arbiter to break the
combinational loop back to the fetch unit.
The ctrl_flow_pub_val_comb signal is used internally to clear the FIFO
immediately when a jump dispatches.
For BRX instructions, the DIU checks whether the oldest control instruction’s
source operands are ready by performing a rename table lookup through a
dedicated port (port index p_num_pipes). If the sources are not ready,
the S2 check blocks all instructions in the window from dispatching, ensuring
no instruction younger than an unresolved branch is speculatively issued.
When a redirect arrives on the ctrl_flow_sub interface (from an XU), the
FIFO is reset, discarding all instructions in the current window.
Instruction-to-Pipe Mapper: SSInstMapper
The instruction-to-pipe mapper (SSInstMapper) computes a one-to-one mapping
from p_num_input_lanes front-end lanes to p_num_pipes back-end pipes
using a shared ISLIPCore matching engine, which implements a modified version
of the iSLIP algorithm. The mapper replaces the single-instruction
SSInstRouter from L7 and handles the multi-lane case.
First, a compatibility matrix is computed: for each (input lane, output pipe)
pair, iq_compat_op is asserted when the instruction’s micro-op is supported
by the pipe’s ISA subset (via in_subset checks), the instruction is valid,
the queue is ready, and the queue has available slots. For pipes with
p_pipe_bypass set, compatibility additionally requires that the instruction’s
source operands are both ready (since bypass pipes issue combinationally with no
buffering).
The matching is performed by ISLIPCore over p_num_iter iterations
(defaulting to p_num_fe_lanes), each consisting of two phases:
Grant phase (age-based): Each output pipe examines all compatible, unmatched inputs and grants to the oldest one, determined by an
AgePEpriority element using wrap-around-safe age comparison via the oldest committed sequence number fromSSSeqAge.Accept phase (slot-based): Each input lane examines all outputs that granted to it and accepts the one with the most available issue queue slots, determined by a
SlotsPEpriority element. Ties are broken in favor of the lower-indexed pipe.
After each iteration, matched inputs and outputs are removed from
consideration (via input_free and output_free masks), and the next
iteration attempts to match the remaining unmatched ports. The final match
results across all iterations are OR-reduced to produce per-pipe iq_val and
route_idx signals (selecting which lane feeds each pipe) and per-lane
lane_val and lane_to_pipe_map signals (indicating which pipe the lane
was matched to, if any).
Data Router: SSDIURouter
The SSDIURouter is a pure data router that muxes per-lane instruction data
to per-pipe outputs based on the mapping produced by SSInstMapper. It
consumes the route_idx, iq_val, and lane_to_pipe_map signals from
the mapper and produces per-pipe iq_msg and iq_ins_try outputs for the
issue queues. It also computes dispatch_go per lane, which is asserted when
the lane passes all instruction checks and the target issue queue is ready.
Issue Queue Selection and Configuration
Each pipe’s issue queue type is determined by a set of configuration parameters:
``p_pipe_bypass``: When set to 1, all pipes use bypass-mode issue queues (
IssueQueueInOrderwithp_bypass=1); when 0 (the default), only the control-subset pipe (the one whose ISA subset exactly matchesp_ctrl_subset) uses bypass mode.``p_mem_subset``: Defines the memory instruction subset. Pipes whose ISA subset intersects
p_mem_subsetare automatically forced to use in-order issue queues, since there is no load-store queue in the memory execute unit.``p_all_iq_in_order``: When set, all issue queues use in-order issue regardless of pipe type.
Pipes that are neither bypass nor forced in-order use the out-of-order
IssueQueueOOO described below.
Out-of-Order Issue Queue: IssueQueueOOO
The out-of-order issue queue (IssueQueueOOO) is a drop-in alternative to
IssueQueueInOrder that dequeues the oldest ready instruction rather than
blocking on the in-order head. Source operand pending status is provided at
insert time and tracked internally; the queue monitors completion notifications
to clear pending bits, removing the need for rename-table lookups at dequeue.
The queue maintains a flat array of entries with per-entry valid bits and
per-entry source pending bits. On insertion, a free slot is selected via a
priority encoder. On dequeue, the queue computes per-entry readiness (both
sources not pending, with same-cycle completion bypass) and selects the oldest
ready entry using XOR-based wrap-around-safe age comparison against the
oldest_seq_num from SSSeqAge. An entry j is considered older than
entry i when (seq[j] < seq[i]) ^ (seq[j] < oldest) ^ (seq[i] < oldest).
The oldest ready entry is priority-encoded to produce the dequeue index.
Like IssueQueueInOrder, the OOO queue supports a same-cycle bypass path
when the queue is empty, issuing the incoming instruction directly if its
sources are ready. When p_bypass is set, the queue operates in fully
stateless bypass mode, identical to the in-order bypass behavior.
Sequence Age Tracker: SSSeqAge
The SSSeqAge module tracks the oldest committed sequence number by
monitoring the commit interface. This value is shared between the
SSInstMapper (for age-based grant priority) and the IssueQueueOOO
instances (for oldest-ready selection), providing consistent wrap-around-safe
age comparison across the decode-issue pipeline.
Register File Configuration
The p_regfile_per_pipe parameter controls register file organization. When
cleared (the default), a single shared SSRegfileL2 with p_num_pipes
lookup ports is instantiated. When set, one SSRegfileL2 per pipe is
instantiated, each with a single lookup port. Both configurations use
p_num_be_lanes write ports driven by the completion interface.
Rename Table with Allocated Destination Register Forwarding: SSRenameTableL3
The SSRenameTableL3 extends the previous rename table to support
p_num_fe_lanes simultaneous allocations with inter-lane destination
register forwarding, as shown in the first image above. The core data
structures – a 31-entry rename table mapping architectural registers to
physical registers with pending bits, and a free list tracking available
physical registers – are unchanged from the single-lane version.
For multi-lane allocation, each lane has its own PriorityEncoder that
selects the first free physical register from the free list. To prevent
multiple lanes from allocating the same physical register, lane i’s
priority encoder input filters out any physical register already selected by
lanes 0 through i-1, creating a forward dependency chain across lanes.
The key addition is destination register forwarding during source operand
lookup. When lane k looks up the physical register for a source
architectural register, it first reads the current mapping from the rename
table, then checks whether any previous lane m (where m < k) in the
same fetch block is allocating a new physical register for the same
architectural register. If so, lane k uses lane m’s newly-allocated
physical register instead of the stale table entry. This ensures that
within a single fetch block, a later instruction correctly reads from the
destination of an earlier instruction (e.g., add x1, x2, x3 followed by
add x4, x1, x5 in the same block will have the second instruction’s
x1 source point to the physical register just allocated by the first
instruction).