Execute Units (XU)
Execute units sit between the decode-issue unit and the writeback-commit
unit, connected via the D__XIntf (input) and X__WIntf (output)
interfaces. Each execute unit receives decoded instructions with resolved
operands from an issue queue, performs its computation, and drives the
result to the writeback stage. Execute units are specialized by function:
arithmetic/logic, control flow, memory, and multiply/divide/remainder.
Most execute units include a small input bypass FIFO at the D__XIntf
input (parameterized by p_d_intf_fifo_depth) that decouples the issue
queue from the execute datapath. The FIFO pops when the downstream
X__WIntf is ready and pushes when a valid instruction arrives from the
issue queue. Each unit also carries physical register specifiers (preg,
ppreg) through the pipeline for the writeback-commit unit to use during
commit and free-list reclamation.
Arithmetic/Logic Unit: ALU
The ALU is a single-cycle combinational execute unit that handles all
integer arithmetic and logic operations: ADD, SUB, AND,
OR, XOR, SLT, SLTU, SRA, SRL, SLL, LUI,
and AUIPC. The input bypass FIFO buffers instructions from the issue
queue; on each cycle that the FIFO is non-empty and the writeback
interface is ready, the head instruction’s operands are fed into a
combinational case-select that produces the result. AUIPC adds the PC
to the immediate; LUI passes the immediate directly. The write-enable
(wen) is always asserted since all ALU instructions write a
destination register.
Zeppelin instantiates p_num_alus copies of the ALU in parallel; each
gets its own pipe slot in the decode-issue crossbar and its own
D__XIntf/X__WIntf.
Control Flow Execute Unit: ControlFlowUnit
The control-flow execute unit resolves conditional branch instructions
(BEQ, BNE, BLT, BGE, BLTU, BGEU) and publishes
redirect notifications via the ControlFlowNotif interface. JAL and
JALR are handled in the decode-issue unit (by the DIUCtrlFlowPublisher),
so they never reach this unit.
The unit includes a bypass FIFO at the D__XIntf input and full support
for branch prediction. It evaluates the branch condition by comparing
op1 and op2, then uses the predicted_taken field (propagated
from the fetch unit through the issue queue) to detect mispredictions. A
redirect is published only when the actual outcome differs from the
prediction (should_branch != predicted_taken). The unit also asserts
bp_update_val on every conditional branch (regardless of whether it
mispredicted) so that the branch predictor can update its state. The
taken signal carries the actual branch direction, and target
carries PC + imm (the branch destination); the redirect consumer uses
taken to determine whether to redirect to the target or the
fall-through address.
The control-flow pipe is always configured in bypass-mode at the issue queue stage (the issue queue is purely combinational), so a branch resolves on the cycle after decode. This single-cycle latency is what guarantees no speculatively-fetched instructions can get past the DIU before the redirect arrives, simplifying the rest of the pipeline (no XU/WCU squashing required).
Load-Store Unit: LoadStoreUnit
The load-store unit (LoadStoreUnit) is a two-stage pipelined execute
unit that handles all memory operations (LB, LH, LW, LBU,
LHU, SB, SH, SW) via the MemIntf interface.
Stage 1 (Request): The input bypass FIFO holds decoded memory
instructions. When the head instruction is valid, the memory interface is
ready, and the stage-2 FIFO has space, the unit computes the effective
address (op1 + op2), word-aligns it, and issues the memory request.
For sub-word operations, the byte offset within the word is used to shift
the store data and compute the appropriate byte strobe mask (strb).
The byte offset and instruction metadata (sequence number, write address,
micro-op, physical registers) are pushed into a stage-2 FIFO to accompany
the memory response when it returns.
Stage 2 (Response): A stage2_fifo (depth rounded to the next
power of 2 from p_num_in_flight) buffers the metadata for all
in-flight requests. A stage-2 register holds the metadata for the
currently-awaited response. When the memory response arrives, the unit
uses the stored byte offset to extract and shift the correct bytes from
the response data, then applies sign- or zero-extension based on the
micro-op (sign-extend for LB/LH, zero-extend for LBU/LHU, full word for
LW). Store operations drive wen = 0 since they do not write a
destination register.
The stage-2 FIFO supports a bypass path: when the FIFO is empty and a push fires simultaneously, the metadata is routed directly into the stage-2 register on the next cycle, avoiding the extra latency of writing to and reading from the FIFO array. This preserves low latency for the common single-in-flight case while still supporting multiple outstanding requests via the FIFO when traffic backs up.
Because there is no load-store queue in this unit, the memory pipe’s issue queue is forced to the in-order variant so that memory ordering is preserved at issue.
Multiply/Divide/Remainder Units
The RV32M extension operations (MUL, MULH, MULHSU, MULHU,
DIV, DIVU, REM, REMU) are handled by one of three
interchangeable units, all sharing the same D__XIntf/X__WIntf
interface. Which variant is instantiated is selected by the toplevel
p_muldivrem_sel parameter, and p_num_muls copies are instantiated
in parallel.
IterativeMulDivRem (p_muldivrem_sel = 0)
The fully iterative unit uses a multi-cycle state machine with five
states: IDLE, SWAP_SIGN, CALC, RESTORE_SIGN, and
DONE. On each CALC cycle, the IterativeMulDivRemStep submodule
advances the computation by one iteration:
Multiplication: Shift-and-add algorithm.
ashifts left,bshifts right; when the LSB ofbis set,ais accumulated into the result. Completes whenbreaches zero.Division/Remainder: Restoring division. On each step, if
|a| >= bthe divisor is subtracted and the quotient bit is accumulated. Theb_shiftregister tracks the current quotient bit position, shifting right each cycle until zero. For remainder, the result is simply the final value ofa.
Signed operations use a SWAP_SIGN state to negate op2 when it is
negative (for DIV/REM), and a RESTORE_SIGN state to correct the
result sign afterward. The unit handles divide-by-zero (returns
all-ones for DIV/DIVU, dividend for REM/REMU) and signed overflow
(-2^31 / -1) as special cases. The D.rdy signal deasserts while
the computation is in progress, stalling the issue queue.
SingleCycleMulIterativeDivRem (p_muldivrem_sel = 1)
The hybrid unit combines a single-cycle multiplier for
MUL/MULH/MULHSU/MULHU with an iterative divider (IterativeDivRemStep)
for DIV/DIVU/REM/REMU. Multiplications complete in one cycle with full
throughput, while divisions iterate over multiple cycles using the same
restoring-division algorithm as the fully iterative variant. This
provides the best area/timing trade-off: the critical multiplier path is
short, and the slower division operations (which are less frequent in
typical code) amortize over multiple cycles. This is the default
selection for the configurations evaluated in the project report.
SingleCycleMulDivRem (p_muldivrem_sel = 2)
The fully single-cycle unit computes all MUL/DIV/REM operations in one
cycle using a combinational datapath. It shares a single hardware
multiplier between multiplication and remainder computation: for MUL
variants, the multiplier computes op1 * op2 directly with appropriate
sign extension; for REM variants, the multiplier computes
quotient * divisor so that the remainder can be derived as
dividend - product. Division uses a combinational / operator. The
same divide-by-zero and overflow edge cases are handled. This variant has
the highest throughput (one result per cycle) but the longest
combinational path and thus the strictest timing constraint.