Pipelining RISC-V
Review

• Controller
  • Tells universal datapath how to execute each instruction

• Instruction timing
  • Set by instruction complexity, architecture, technology
  • Pipelining increases clock frequency, “instructions per second”
    • But does not reduce time to complete instruction

• Performance measures
  • Different measures depending on objective
    • Response time
    • Jobs / second
    • Energy per task
Pipelining Overview (Review)

- Pipelining doesn’t help *latency* of single task, it helps *throughput* of entire workload
- Multiple tasks operating simultaneously using different resources
- Potential speedup = Number pipe stages
- Time to “fill” pipeline and time to “drain” it reduces speedup: 2.3X v. 4X in this example
  - With lots of laundry, approaches 4X
Why Nick's Ph.D. Was An Incredibly Stupid Idea...

• My Ph.D. was on a highly pipelined FPGA architecture
  • FPGA -> Field Programmable Gate Array: Basically programmable hardware
  • The design was centered around being able to pipeline multiple independent tasks
    • We will see later how to handle "pipeline hazards" and "forwarding":
      ```
      add $s0 $s1 $s2
      add $s3 $s0 $s4
      ```
    • This is critical to get real performance gains
    • But my dissertation design didn't have this ability

• I also showed how you could use the existing registers in the FPGA to heavily pipeline it automatically
But pipelining is *not* free!

- Not only does pipelining not improve latency...
  - It actually makes it worse!

- Two sources:
  - Unbalanced pipeline stages
  - The setup & clk->q time for the pipeline registers

- Pipelining only independent tasks also can't "forward"

- So *independent* task pipelining is only about reducing cost
  - You can always just duplicate logic instead

- *Latency is fundamental, independent task throughput* can always be solved by throwing $$$ at the problem

- So I proved my Ph.D. design was *no better* than the conventional FPGA on throughput/$ and far far far far worse on latency!
# Pipelining with RISC-V

<table>
<thead>
<tr>
<th>Phase</th>
<th>Pictogram</th>
<th>$t_{step}$ Serial</th>
<th>$t_{cycle}$ Pipelined</th>
</tr>
</thead>
<tbody>
<tr>
<td>Instruction Fetch</td>
<td><img src="image" alt="IM" /></td>
<td>200 ps</td>
<td>200 ps</td>
</tr>
<tr>
<td>Reg Rea</td>
<td><img src="image" alt="Reg" /></td>
<td>100 ps</td>
<td>200 ps</td>
</tr>
<tr>
<td>ALU</td>
<td><img src="image" alt="DM" /></td>
<td>200 ps</td>
<td>200 ps</td>
</tr>
<tr>
<td>Memory</td>
<td><img src="image" alt="ALU" /></td>
<td>200 ps</td>
<td>200 ps</td>
</tr>
<tr>
<td>Register Write</td>
<td><img src="image" alt="Reg" /></td>
<td>100 ps</td>
<td>200 ps</td>
</tr>
</tbody>
</table>

$\textbf{t}_{\text{instruction}}$:
- add t0, t1, t2
- or t3, t4, t5
- sll t6, t0, t3

$\textbf{t}_{\text{cycle}}$:
- 800 ps
- 1000 ps
Pipelining with RISC-V

```
add t0, t1, t2
or t3, t4, t5
sll t6, t0, t3
```

### Single Cycle vs. Pipelining

<table>
<thead>
<tr>
<th></th>
<th>Single Cycle</th>
<th>Pipelining</th>
</tr>
</thead>
<tbody>
<tr>
<td><strong>Timing</strong></td>
<td>$t_{step} = 100 \ldots 200$ ps</td>
<td>$t_{cycle} = 200$ ps</td>
</tr>
<tr>
<td><strong>Register access</strong></td>
<td>Only 100 ps</td>
<td>All cycles same length</td>
</tr>
<tr>
<td><strong>Instruction time</strong></td>
<td>$t_{instruction} = t_{cycle} = 800$ ps</td>
<td>1000 ps</td>
</tr>
<tr>
<td><strong>CPI (Cycles Per Instruction)</strong></td>
<td>$\sim 1$ (ideal)</td>
<td>$\sim 1$ (ideal), $&lt; 1$ (actual)</td>
</tr>
<tr>
<td><strong>Clock rate</strong></td>
<td>$1/800$ ps = 1.25 GHz</td>
<td>$1/200$ ps = 5 GHz</td>
</tr>
<tr>
<td><strong>Relative speed</strong></td>
<td>1 x</td>
<td>4 x</td>
</tr>
</tbody>
</table>
Sequential vs Simultaneous

What happens sequentially, what happens simultaneously?

add t0, t1, t2
or t3, t4, t5
sll t6, t0, t3
sw t0, 4(t3)
lw t0, 8(t3)
addi t2, t2, 1

$\text{instruction sequence} = 1000 \text{ ps}$

$\text{cycle} = 200 \text{ ps}$
RISC-V Pipeline

Instruction sequence:

- Add $t0$, $t1$, $t2$
- Or $t3$, $t4$, $t5$
- SLT $t6$, $t0$, $t3$
- SW $t0$, 4($t3$)
- LW $t0$, 8($t3$)
- ADDI $t2$, $t2$, 1

$t_{cycle} = 200$ ps
$t_{instruction} = 1000$ ps

Resource use over time:

Resource use in a particular time slot:

- Instruction use over time:
  - Resource use in a particular time slot:
  - Instruction use over time:
Single Cycle Datapath

1. Instruction Fetch
2. Decode/Register Read
3. Execute
4. Memory
5. Write Back

PC → instruction memory → registers → ALU → Data memory → back to PC
Pipeline registers

- Need registers between stages
- To hold information produced in previous cycle
More Detailed Pipeline
IF for Load, Store, …
ID for Load, Store, ...
EX for Load
MEM for Load
WB for Load – Oops!

Wrong register number!
Corrected Datapath for Load
Recalculate PC+4 in M stage to avoid sending both PC and PC+4 down pipeline.

Must pipeline instruction along with data, so control operates correctly in each stage.
Each stage operates on different instruction

Pipeline registers separate stages, hold data for each instruction in flight
Pipelined Control

- Control signals derived from instruction
- As in single-cycle implementation
- Information is stored in pipeline registers for use by later stages
Administrivia

- Reminder: Project Partays…
  - Friday, March 15th, 5-7pm 405 Soda
  - Thursday, March 21st, 7-9pm 540AB Cory
- 1 on 1s still available tomorrow
  - Sign up ASAP
- Midterm Survey next week:
  - Tell us what is good and what needs improvement!
Hazards Ahead
Pipelining Hazards

- A hazard is a situation that prevents starting the next instruction in the next clock cycle

- **Structural** hazard
  - A required resource is busy (e.g. needed in multiple stages)

- **Data** hazard
  - Data dependency between instructions
  - Need to wait for previous instruction to complete its data read/write

- **Control** hazard
  - Flow of execution depends on previous instruction
Structural Hazard

• **Problem:** Two or more instructions in the pipeline compete for access to a single physical resource

• **Solution 1:** Instructions take it in turns to use resource, some instructions have to stall

• **Solution 2:** Add more hardware to machine

• Can always solve a structural hazard by adding more hardware
Regfile Structural Hazards

- **Each instruction:**
  - can read up to two operands in decode stage
  - can write one value in writeback stage
- **Avoid structural hazard by having separate “ports”**
  - two independent read ports and one independent write port
- **Three accesses per cycle can happen simultaneously**
Structural Hazard: Memory Access

- Instruction and data memory used simultaneously
  - Use two separate memories

Instruction sequence:

- add t0, t1, t2
- or t3, t4, t5
- slt t6, t0, t3
- sw t0, 4(t3)
- lw t0, 8(t3)
Instruction and Data Caches
Structural Hazards – Summary

- Conflict for use of a resource
- In RISC-V pipeline with a single memory
  - Load/store requires data access
  - Without separate memories, instruction fetch would have to stall for that cycle
    - All other operations in pipeline would have to wait
- Pipelined datapaths require separate instruction/data memories
  - Or separate instruction/data caches
- RISC ISAs (including RISC-V) designed to avoid structural hazards
  - e.g. at most one memory access/instruction
Data Hazard: Register Access

- Separate ports, but what if write to same value as read?
- Does `sw` in the example fetch the old or new value?

add $t0, $t1, $t2

or $t3, $t4, $t5

`slt $t6, $t0, $t3`

`sw $t0, 4($t3)`

`lw $t0, 8($t3)`
Register Access Policy

- Exploit high speed of register file (100 ps)
  1) WB updates value
  2) ID reads new value
- Indicated in diagram by shading

Instruction sequence:

- add t0, t1, t2
- or t3, t4, t5
- slt t6, t0, t3
- sw t0, 4(t3)
- lw t0, 8(t3)

*Might not always be possible to write then read in same cycle, especially in high-frequency designs. Check assumptions in any question.*
Data Hazard: ALU Result

Value of s0

| 5 | 5 | 5 | 5 | 5/9 | 9 | 9 | 9 | 9 | 9 |

add s0, t0, t1

sub t2, s0, t0

or t6, s0, t3

xor t5, t1, s0

sw s0, 8(t3)

Without some fix, sub and or will calculate wrong result!
Solution 1: Stalling

- Problem: Instruction depends on result from previous instruction
  - add $s0, t0, t1
  - sub $t2, $s0, $t3

- Bubble:
  - effectively NOP: affected pipeline stages do “nothing”
Stalls and Performance

- Stalls reduce performance
  - But stalls may be required to get correct results
- Compiler can rearrange code or insert NOPs (writes to register x0) to avoid hazards and stalls
  - Requires knowledge of the pipeline structure
Solution 2: Forwarding

Value of t0

add t0, t1, t2
or t3, t0, t5
sub t6, t0, t3
xor t5, t1, t0
sw t0, 8(t3)

Forwarding: grab operand from pipeline stage, rather than register file
Forwarding (aka Bypassing)

- Use result when it is computed
- Don’t wait for it to be stored in a register
- Requires extra connections in the datapath

Program execution order (in instructions)
- add $s0, $t0, $t1
- sub $t2, $s0, $t3

Diagram showing the stages of execution:
- IF
- ID
- EX
- MEM
- WB
Detect Need for Forwarding
(example)

add t0, t1, t2

or t3, t0, t5

sub t6, t0, t3

Compare destination of older instructions in pipeline with sources of new instruction in decode stage. Must ignore writes to x0!
Forwarding Path for RA to ALU
Actual Forwarding Path Location…

- We forward with two muxes just after RS1x and RS2x pipeline registers
  - The output of the read registers
- Select either the register, the output of the ALU m register, or the output of the MEM/WB register (draw below)
Agenda

• Hazards
  • Structural
  • Data
  • R-type instructions
• Load
• Control
• Superscalar processors
Load Data Hazard

1 cycle stall
unavoidable

forward

unaffected
Stall Pipeline

- **Iw $2, 20($1) and becomes nop**
- **and $4, $2, $5**
- **or $8, $2, $6**
- **add $9, $4, $2**

**Stall** repeat and instruction and forward
Data Hazard

- Slot after a load is called a **load delay slot**
- If that instruction uses the result of the load, then the hardware will stall for one cycle
- Equivalent to inserting an explicit **nop** in the slot
  - except the latter uses more code space
- Performance loss

**Idea:**

- Put unrelated instruction into load delay slot
- No performance loss!
Code Scheduling to Avoid Stalls

- Reorder code to avoid use of load result in the next instr!

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:

Original Order:
Agenda

- Hazards
- Structural
- Data
  - R-type instructions
  - Load
- **Control**
- Superscalar processors
Control Hazards

beq t0, t1, label
sub t2, s0, t5
or t6, s0, t3
xor t5, t1, s0
sw s0, 8(t3)

executed regardless of branch outcome!
executed regardless of branch outcome!!!
PC updated reflecting branch outcome
Observation

- If branch not taken, then instructions fetched sequentially after branch are correct
- If branch or jump taken, then need to flush incorrect instructions from pipeline by converting to NOPs
Kill Instructions after Branch if Taken

beq t0, t1, label
sub t2, s0, t5
or t6, s0, t3
label: xxxxxx

- Taken branch
- Convert to NOP
- Convert to NOP
- PC updated reflecting branch outcome
Reducing Branch Penalties

- Every taken branch in simple pipeline costs 2 dead cycles.
- To improve performance, use “branch prediction” to guess which way branch will go earlier in pipeline.
- Only flush pipeline if branch prediction was incorrect.
Branch Prediction

beq t0, t1, label

label: ..... 

.....

Taken branch

Guess next PC!

Check guess correct
Agenda

- Hazards
- Structural
- Data
  - R-type instructions
  - Load
- Control
- Superscalar processors
Increasing Processor Performance

1. Clock rate
   • Limited by technology and power dissipation

2. Pipelining
   • “Overlap” instruction execution
   • Deeper pipeline: 5 => 10 => 15 stages
     • Less work per stage → shorter clock cycle
     • But more potential for hazards (CPI > 1)

3. Multi-issue “superscalar” processor
Superscalar Processor

- Multiple issue “superscalar”
  - Replicate pipeline stages ⇒ multiple pipelines
  - Start multiple instructions per clock cycle
  - CPI < 1, so use Instructions Per Cycle (IPC)
  - E.g., 4GHz 4-way multiple-issue
    - 16 BIPS, peak CPI = 0.25, peak IPC = 4
  - Dependencies reduce this in practice

- “Out-of-Order” execution
  - Reorder instructions dynamically in hardware to reduce impact of hazards

- CS152 discusses these techniques!
Out Of Order Superscalar Processor

In-order issue

Out-of-order execute

In-order commit

P&H p. 340
Benchmark: CPI of Intel Core i7

CPI = 1

CPI of Intel Core i7 920 running SPEC2006 integer benchmarks.
Pipelining and ISA Design

• RISC-V ISA designed for pipelining
  • All instructions are 32-bits in the RV-32 ISA
    • Easy to fetch and decode in one cycle
      • Variant additions add 16b and 64b instructions, but can tell by looking at just the first bytes what type it is
    • Versus x86: 1- to 15-byte instructions
  • Few and regular instruction formats
    • Decode and read registers in one step
  • Load/store addressing
    • Calculate address in 3rd stage, access memory in 4th stage
  • Alignment of memory operands
    • Memory access takes only one cycle
In Conclusion

- Pipelining increases throughput by overlapping execution of multiple instructions
- All pipeline stages have same duration
  - Choose partition that accommodates this constraint
- Hazards potentially limit performance
  - Maximizing performance requires programmer/compiler assistance
- Superscalar processors use multiple execution units for additional instruction level parallelism
  - Performance benefit highly code dependent