Processor Versions

Blimp takes an iterative design approach, enabled through its modularity. Different levels of Blimp’s microarchitectural units can be composed to form different versions of Blimp processors. This enables the design of these components to occur iteratively, adding on functionality as the complexity of the processor progressed.

Currently, 10 versions of the processor are implemented. The table below details the level of each unit that each processor version supports:

Instruction Routing/Arbitration

While specific levels of execute units may be supported at a given processor version, many levels aren’t used in the given implementation, given that they are superseced by later levels with greater support

Proc. Version	FU Level	DIU Level	XU Levels	WCU Level	SU Level
V1	1	1	1	1
V2	2	2	1, 2	2
V3	2	3	1, 2	3
V4	2	3	1, 2, 3	3
V5	3	4	1, 2, 3, 4	3
V6	3	5	1, 2, 3, 4, 5	3	1
V7	3	5	1, 2, 3, 4, 5, 6	3	1
V8	3	5	1, 2, 3, 4, 5, 6, 7	3	1
V9	4	6	1, 2, 3, 4, 5, 6, 7	4	2
V10	4	7	1, 2, 3, 4, 5, 6, 7	4	2

Version 1

The Version 1 processor is the simplest version, and aims to be the baseline needed for assembly execution with modular units. Only three instructions are supported (add, addi, and mul). The processor is single-issue with in-order issue and completion, and assumes that Execute Units are single-cycle to maintain instruction ordering.

Instruction Routing/Arbitration

To prepare for out-of-order implementations, the Level 1 DIU dynamically routes instructions to XUs based on availability, and the Level 1 WCU implements Round-Robin arbitration to select which XU to receive an instruction from. However, this isn’t relevant based on the single-cycle guarantee of XUs, and could be replaced if needed for area.

Version 2

The Version 2 processor adds support for operations with variable latency by modifying the DIU and WCU to reorder committing instructions based on a sequence number, and to resolve WAW hazards through stalling in the DIU. The FU is also modified to attach sequence numbers to instructions when fetched. To demonstrate this, a pipelined multiply unit is used as an XU to have intructions with varying latency.

Sequence Number Generation

The number of sequence numbers is a parameter for the overall processor, and represents how many instructions can be in-flight at once. This can be a fairly small value, as the number of in-flight instructions is also currently limited by processor pipeline depth. However, if the FU gets to a point where isn’t doesn’t have any more sequence numbers, it will stall until a sequence number is released through the associated instruction committing, communicated on the commit interface

A picture of the Version 2 processor composition

Version 3

The Version 3 processor is similar to the Version 2 processor, but introduces register renaming to resolve WAW hazards instead of stalling. This requires modifying the DIU to include a rename table, as well as the WCU to forward the physical register back to the DIU upon completion/commit. Because of the commit interface indicating when a physical register is freed, the DIU must now also listen to the commit interface.

Keeping track of the architectural register

Strictly-speaking, we no longer need to propagate the architectural register specifier down the pipeline, as only physical registers are used after the DIU. However, we still maintain this to support backwards compatibility along the interfaces, as well as to be able to produce an architecturally-correct instruction trace upon commit for verification.

A picture of the Version 3 processor composition

Version 4

The Version 4 processor adds support for memory operations, requiring an additional XU to communicate with a memory interface.

Version 5

The Version 5 processor adds support for unconditional control flow operations (i.e. jumps). This requires modifications to the DIU to identify and act on jump instructions through the (newly-introduced) squash interface, as well as modifications to the FU to redirect the front of the pipeline and squash all in-flight fetches. We also need an additional XU to propagate the jump instructions to commit, as well as to store the return address when necessary (ex. JALR instructions).

Version 6

The Version 6 processor adds support for conditional control flow instructions. These instructions are detected and acted on in a new XU, which now interfaces with a squahs interface. Because control flow changes can now come from either an XU or the DIU, we also need a new Squash Unit (SU) to arbitrate between control flow changes based on sequence number. This also requires a new DIU to detect when it’s being squashed from an XU.

Instruction Routing/Arbitration

Currently, our processor requires that the control flow XU acts on control flow instructions in one cycle. This simplifies the design by not allowing mistakenly-speculated instructions to get past the DIU, but isn’t as modular as possible in design. An ideal implementation (discussed in Further Work) wouldn’t impose this limitation, but would additionally require all other XUs and the WCU to listen to and act on a squash interface.

A picture of the Version 6 processor composition

Version 7

The Version 7 processor adds new XUs to support all TinyRV2 instructions, as specified in Cornell’s ECE 4750. This functionality is sufficient to run basic C code.

Version 8

The Version 8 processor eadds new XUs to support the complete RV32IM ISA. This functionality is sufficient to run complex C/C++ applications, including games.

Version 9

The Version 9 processor supports superscalar backend, by using an m-select arbiter in the WCU to select multiple completed instructions from XUs, as well as a multi-ported ROB to intake multiple instructions from the arbiter as well as commit multiple instructions. Modified versions of FU, DIU, and SU are also required to support multiple completed and committed instructions with a new sequence number generator (SeqNumGenL4), new sequence age tracker (MSeqAge), multi-rename table (MRenameTable), and multi-write-ported register file (MRegisterFile) as the complete and commit interfaces are now parameterized based on the number of backend superscalar lanes (configurable at the toplevel).

Version 10

The Version 10 processor supports superscalar issue within the decode-issue unit itself. As such, the high-level view of the processor without looking into the units is identical to v9. all changes are internal to the DIU itself. Superscalar issue is implemented using in-order issue queues accompanied by a more intelligent instruction router which routes to the issue queue with fewer enqueued instructions if multiple pipes can support the next instruction to enqueue on the issue queues. Each issue queue thus contains instructions which are waiting to execute. Since these queues are in-order, the entire queue is blocked if the next instruction to dequeue is still waiting for its source operands to be ready, which is tracked by looking up the register status in the rename table and reading the source operands when ready from the register file. Bypassing from the complete interface is also supported to allow for 1-cycle issue latency. Any pipes supporting control-flow instrutions use a “bypass” queue to keep the 1-cycle branch-resolution latency constraint.