Chapter 4 — The Processor

advertisement
CprE 381 Computer Organization and Assembly
Level Programming, Fall 2013
Chapter 4
The Processor
Zhao Zhang
Iowa State University
Revised from original slides provided
by MKP
Week 9 Overview



Mini Project B
CPU Pipelining: Pipelined Data Path and
Control
ALU Data Hazards and Forwarding
Chapter 1 — Computer Abstractions and Technology — 2
Mini-Project B Overview
Implement single-cycle processor (SCP). There
will be three parts
1. Part 1, SCPv1: Implement the nineinstruction ISA
2. Part 2, SCPv2a: Support all the instructions
needed to run bubble sorting

With coarse-level modeling of datapath elements
Part 3, SCPv2b: Detailed modeling of
datapath elements
There is a bonus project
3.
Chapter 1 — Computer Abstractions and Technology — 3
Project A Late Submission


Start working on Project B, ASAP
You may submit Mini-Project A late for
three weeks (with 20% late penalty)



Demo those parts that are working
Late penalty only applies to those parts that
are actually late
If you demo Project B successfully, you
don’t have to demo any late part of Project
A
Chapter 1 — Computer Abstractions and Technology — 4
Part 1: SCPv1
Implementing the nine-Instruction MIPS ISA



Memory reference: LW and SW
Arithmetic/logic: ADD, SUB, AND, OR, SLT
Branch: BEQ, J
The textbook provides almost all
implementation details



Datapath and control
The main control unit (9-bit signals w/o Jump)
The ALU control unit
Chapter 1 — Computer Abstractions and Technology — 5
Part 1: SCPv1
Use this diagram as the blueprint
for Part 1
Chapter 1 — Computer Abstractions and Technology — 6
SCPv1: Control Signals

Control signal setting for SCPv1

Inst
It is a truth table
RegDst
ALU- Mem- Reg- Mem Mem Bran
toReg Write Read Write ch
Src
ALU
Op1
ALU
Op0
Jum
p
R-
1
0
0
1
0
0
0
1
0
0
lw
0
1
1
1
1
0
0
0
0
0
sw
X
1
X
0
0
1
0
0
0
0
beq
X
0
X
0
0
0
1
0
1
0
j
X
X
X
0
0
0
0
X
X
1
Note: “R-” means R-format
Chapter 1 — Computer Abstractions and Technology — 7
SCPv1: ALU Control

Truth table for ALU Control
opcode
ALUOp
Operation
funct
ALU function
ALU control
lw
00
load word
XXXXXX
add
0010
sw
00
store word
XXXXXX
add
0010
beq
01
branch equal
XXXXXX
subtract
0110
R-type
10
add
100000
add
0010
subtract
100010
subtract
0110
AND
100100
AND
0000
OR
100101
OR
0001
set-on-less-than
101010
set-on-less-than
0111
Chapter 4 — The Processor — 8
SCPv1 Fast Prototyping
You are provided with following files
 mips32.vhd: A VHDL package
 regfile.vhd: For the register file
 register.vhd: For the PC
 alu.vhd: For the ALU
 adder.vhd: For the PC-related adders
 mem.vhd: The memory, for both instruction
memory and data memory
Chapter 1 — Computer Abstractions and Technology — 9
SCPv1 Fast Prototyping
Rational behind Part 1: Focus on the
structure/organization of the CPU



The provided components are modeled at
coarse-level
We know that efficient circuit design exists for
those components: Memory, register file, ALU,
adder, mux and so on
Work out the details at the late time
Chapter 1 — Computer Abstractions and Technology — 10
Strongly Structural Modeling

Your CPU composition must be strongly
structural



No behavior modeling can be used.
No process statement.
Limited dataflow modeling (see next)
Additional requirement: Declare all
components in the architecture body of CPU

Only component instantiation, no entity
instantiation
Chapter 1 — Computer Abstractions and Technology — 11
Strongly Structural Modeling

Acceptable forms of dataflow modeling
Signal copying/splitting
opcode <= inst(31 downto 26);

Signal Merging
j_target <= PC(31 downto 28) &
j_offset & "00”;

One-level of basic logic gates
taken_branch <= branch AND zero;

Chapter 1 — Computer Abstractions and Technology — 12
Cpu.vhd

This is a partial sample
-- Control Unit
CONTROL1: control
port map (opcode, reg_dst, alu_src, mem_to_reg, …);
-- ALU Control unit
ALU_CTRL1: alu_ctrl
port map (alu_op, funct, alu_code);
-- The mux connected to the dst port of regfile
DST_MUX : mux2to1 generic map (M => 5)
port map (rt, rd, reg_dst, dst);
…
Chapter 1 — Computer Abstractions and Technology — 13
Datapath and Control Modeling


For datapath elements and control units,
you may use any modeling style (in Part 1)
The provided components all use behavior
modeling for simplicity
Chapter 1 — Computer Abstractions and Technology — 14
mips32.vhd
package MIPS32 is
-- Half Cycle Time of the clock signal
constant HCT : time := 50 ns;
-- Clock Cycle Time of the clock signal
constant CCT : time := 2 * HCT;
-- MIPS32 logic type
subtype m32_logic is std_logic;
-- MIPS32 logic vector type
subtype m32_vector is std_logic_vector;
Pre-defined constants and types to make coding
simpler and consistent
Chapter 1 — Computer Abstractions and Technology — 15
mips32.vhd
-- Word type, for …
subtype m32_word is m32_vector(31 downto 0);
-- Halfword, byte, and bit fields of varying size
subtype m32_halfword is m32_vector(15 downto 0);
subtype m32_byte is m32_vector(7 downto 0);
subtype m32_1bit is m32_logic;
subtype m32_2bits is m32_vector(1 downto 0);
subtype m32_3bits is m32_vector(2 downto 0);
…
end MIPS32;
Pre-defined types shorten the names
Chapter 1 — Computer Abstractions and Technology — 16
Alu.vhd

Why provide the ALU and the other VHDL
programs?



Your implementation might have bugs
We don’t want to fight the bugs in two fronts
You shall test those modules




Always test any modules that you will use
The provided modules have been tested
Some test-bench programs are provided
Write your own test-bench or extend the
provided test-bench
Chapter 1 — Computer Abstractions and Technology — 17
Alu.vhd
entity ALU is
port (rdata1
rdata2
alu_code
result
zero
end entity;
:
:
:
:
:
in
in
in
out
out
m32_word;
m32_word;
m32_4bits;
m32_word;
m32_1bit);
Chapter 1 — Computer Abstractions and Technology — 18
Alu.vhd
architecture behavior of ALU is
signal r : m32_word;
begin
P_ALU : process (alu_code, rdata1, rdata2)
variable code, a, b, sum, diff, slt: integer;
begin
-- Pre-calculate arithmetic results
a := to_integer(signed(rdata1));
b := to_integer(signed(rdata2));
sum := a + b;
diff := a - b;
if (a < b) then
slt := 1;
else
slt := 0;
end if;
Chapter 1 — Computer Abstractions and Technology — 19
Alu.vhd
-- Select the result, convert to signal if necessary
case (alu_code) is
when "0000" =>
-- AND
r <= rdata1 AND rdata2;
when "0010" =>
-- add
r <= std_logic_vector(to_signed(sum, 32));
…
end case;
end process;
-- Drive the alu result output
result <= r;
-- Drive the zero output
with r select
zero <= '1' when x"00000000",
'0' when others;
end behavior;
Coarse-level modeling
is easy, reliable but may
not be synthesized efficiently
Chapter 1 — Computer Abstractions and Technology — 20
Regfile.vhd
entity regfile is
port(src1
: in
src2
: in
dst
: in
wdata : in
rdata1 : out
rdata2 : out
WE
: in
reset : in
clock : in
end regfile;
m32_5bits;
m32_5bits;
m32_5bits;
m32_word;
m32_word;
m32_word;
m32_1bit;
m32_1bit;
m32_1bit);
Caveat: The clock signal is needed in the single-cycle
implementation
Chapter 1 — Computer Abstractions and Technology — 21
Regfile.vhd
architecture behavior of regfile is
signal reg_array : m32_regval_array;
begin
-- Register reset logic
P_WRITE : process (clock)
variable r : integer;
begin
-- Write/reset logic
if (rising_edge(clock)) then
if (reset = '1') then
for i in 0 to 31 loop
reg_array(i) <= X"00000000";
end loop;
Chapter 1 — Computer Abstractions and Technology — 22
Regfile.vhd
elsif (WE = '1') then
r := to_integer(unsigned(dst));
if not (r = 0) the
reg_array(r) <= wdata;
end if;
end if;
end if;
end process;
Chapter 1 — Computer Abstractions and Technology — 23
Regfile.vhd
P_READ : process (clock, src1, src2)
variable r1, r2 : integer;
begin
-- Read logic
r1 := to_integer(unsigned(src1));
r2 := to_integer(unsigned(src2));
rdata1 <= reg_array(r1);
rdata2 <= reg_array(r2);
end process;
end behavior;
Chapter 1 — Computer Abstractions and Technology — 24
Demonstration
For each of multiple test cases
 Trace the program execution
 Inspect the register and memory contents
at the end of execution
Test case consists of
 MIPS binary code, e.g. in imem.txt
 Data memory content, e.g. in dmem.txt
Chapter 1 — Computer Abstractions and Technology — 25
Test Bench
Inside test bench:
CPU1 : cpu
port map (imem_addr, inst, dmem_addr, dmem_read,
dmem_write, dmem_wmask, dmem_rdata,
dmem_wdata, reset, clock);
INST_MEM : mem
generic map (mif_filename => "imem.txt")
port map (imem_addr(9 downto 2), "0000",
clock, x"00000000", '0', inst);
DATA_MEM : mem
generic map (mif_filename => "dmem.txt")
port map (dmem_addr(9 downto 2), dmem_wmask,
clock, dmem_wdata, dmem_write,
dmem_rdata);
Note: Treat memories as external datapath elements
Chapter 1 — Computer Abstractions and Technology — 26
Instruction Memory

imem.txt contents (MIF)
DEPTH=1024;
WIDTH = 32;
-lw
$t0, 0($zero)
-lw
$t1, 4($zero)
-beq $t0, $t1, +2
-add $t0, $t0, $t1
-sw
$t0, 8($zero)
-noop
CONTENT
BEGIN
-- Instruction formats
--R ======-----=====-----=====-------I ======-----=====-----------------J ======-------------------------0 : 10001100000010000000000000000000;
1 : 10001100000010010000000000000100;
2 : 00010001000010010000000000000010;
3 : 00000001000010010100000000100000;
4 : 10101100000010000000000000001000;
[5..63] : 00000000;
END;
Chapter 1 — Computer Abstractions and Technology — 27
Part 2. SCPv2 Prototyping (SCPv2a)



Support all MIPS instructions used by the
bubble sort example
We have studied how to extend the nineinstruction design to support ADDI, SLL,
BNE, and JAL
For each new instruction, think about



Datapath: Any new/revised data elements, any
new signal connections
The main control: Any new control signals, any
extension to the truth table
The ALU control: Any extension to the truth table
Chapter 1 — Computer Abstractions and Technology — 28
Part 3. SCPv2b


SCPv2 Detailed Implementation
Provide detailed modeling for




Use your code from Labs 1-4 and MiniProject A


Register file
ALU
Adder
You may revise your code
Your final code should be strongly structural

Consult your lab TA if you are not sure
Chapter 1 — Computer Abstractions and Technology — 29
Bonus Project Part 1

Green MIPS SCP (SCP-G)



Bonus Project Part 2 is to do pipelined
implementation
The lab bonus can overflow in your
overall grade



Extend SCPv2 to support all integer
instructions listed on the green sheet
As said, quiz bonus does not overflow
Partial credit will be given
The grading details will be finalized
Chapter 1 — Computer Abstractions and Technology — 30
Pipelined CPU CPU
A natural idea to improve performance
The devil is in the details
 Pipelined data path and control
 Data hazard from ALU instructions
 Data Hazard from Load instructions
 Control Hazard from branches
 Exception handling in pipelined processor
Chapter 1 — Computer Abstractions and Technology — 31
SCP With Jumps Added
Chapter 4 — The Processor — 32
Performance Issues

Longest delay determines clock period



Critical path: load instruction
Instruction memory  register file  ALU 
data memory  register file
Now we will improve performance by
pipelining
Chapter 4 — The Processor — 33

Pipelined laundry: overlapping execution

Parallelism improves performance

Four loads:


§4.5 An Overview of Pipelining
Pipelining Analogy
Speedup
= 8/3.5 = 2.3
Non-stop:

Speedup
= 2n/0.5n + 1.5 ≈ 4
= number of stages
Chapter 4 — The Processor — 34
Pipeline Performance
Look at this example
 In single-cycle implementation, the critical path is
800ps (one cycle @ 1.25 GHz)
 The longest component latency is 200ps (one cycle
@ 5GHz)
Note: Latency of mux, extender and so on ignored
Instr
Instr fetch Register
read
ALU op
Memory
access
Register
write
Total time
lw
200ps
100 ps
200ps
200ps
100 ps
800ps
sw
200ps
100 ps
200ps
200ps
R-format
200ps
100 ps
200ps
beq
200ps
100 ps
200ps
700ps
100 ps
600ps
500ps
Chapter 4 — The Processor — 35
MIPS Pipeline Idea


If we divide the execution into stages,
clock frequency can be much faster
Five stages, one step per stage
1.
2.
3.
4.
5.
IF: Instruction fetch from memory
ID: Instruction decode & register read
EX: Execute operation or calculate address
MEM: Access memory operand
WB: Write result back to register
Chapter 4 — The Processor — 36
MIPS Pipeline Idea
General idea: Split the
datapath into stages, with
critical path delay <= 1 clock cycle
Chapter 4 — The Processor — 37
Pipeline Performance
Single-cycle (Tc= 800ps)
Pipelined (Tc= 200ps)
First look at performance gain
Chapter 4 — The Processor — 38
Pipeline Speedup

If all stages are balanced




If not balanced, speedup is less


i.e., all take the same time
Time between instructionspipelined
= Time between instructionsnonpipelined
Number of stages
Ideal speedup = N for N-stage pipeline
In the example, speedup is up to 4.0
Speedup due to increased throughput

Latency (time for each instruction) does not
decrease, or even increases
Chapter 4 — The Processor — 39
Pipelining and ISA Design

MIPS ISA designed for pipelining

All instructions are 32-bits



Easier to fetch and decode in one cycle
c.f. x86: 1- to 17-byte instructions
Few and regular instruction formats

Can decode and read registers in one step
Chapter 4 — The Processor — 40
Pipelining and ISA Design

How would you design a pipeline for this
instruction format?
Prefixes (1-4 bytes)
Opcode (1-3
bytes), required
ModR/M (1 byte )
SIB (1 byte)
Addr.
Immediate
Displacement (0, 1, (0, 1, 2, or 4 bytes)
2, or 4 bytes)
ModR/M: addressing-form specifier, mixing of register numbers,
addressing modes, additional opcode bits
SIB: Second addressing byte for base-plus-index and scale-plus-index
addressing modes
Source: Intel® 64 and IA-32 Architectures Software Developer’s Manual
Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z
Chapter 1 — Computer Abstractions and Technology — 41
Pipelining and ISA Design

MIPS ISA designed for pipelining

Load/store addressing


Can calculate address in 3rd stage, access memory
in 4th stage
Alignment of memory operands

Memory access takes only one cycle
Chapter 4 — The Processor — 42
Pipelining and ISA Design

How would you design a pipeline that
works well for the following instructions?
ADD eax, ebx
; add with two registers
SUB ebx, 100
; sub with reg and const
ADD eax, [0x1000]
; add reg and memory
ADD BYTE PTR [0x1000], 100 ; add with mem and const
SUB [esi+4*ebx], eax
; sub with reg and mem (array)
Chapter 1 — Computer Abstractions and Technology — 43
§4.6 Pipelined Datapath and Control
MIPS Pipelined Datapath
MEM
Right-to-left
flow leads to
hazards
WB
Chapter 4 — The Processor — 44
Pipeline registers

Need registers between stages

To hold information produced in previous cycle
Chapter 4 — The Processor — 45
Hazards


Situations that prevent starting the next
instruction in the next cycle
Structure hazards


Data hazard


A required resource is busy
Need to wait for previous instruction to
complete its data read/write
Control hazard

Deciding on control action depends on
previous instruction
Chapter 4 — The Processor — 46
Hazards
There are ways to handle those hazards.
Let’s ignore them for now
Assume, for now, no data dependence and
control dependence in the program
lw
sub
add
lw
sub
$10,
$11,
$12,
$13,
$14,
20($1)
$2, $3
$3, $4
24($1)
$5, $6
Can you design a pipeline to run the about
instructions correctly?
Chapter 1 — Computer Abstractions and Technology — 47
Hazards
Program with data dependence
sub
and
or
add
sw
$2, $1,$3
$12,$2,$5
$13,$6,$2
$14,$2,$2
$15,100($2)
Program with control dependence
beq $1, $3, +4
addi $2, $2, 1
addi $4, $4, 1
Chapter 1 — Computer Abstractions and Technology — 48
Pipeline Operation

Cycle-by-cycle flow of instructions through
the pipelined datapath

“Single-clock-cycle” pipeline diagram



c.f. “multi-clock-cycle” diagram


Shows pipeline usage in a single cycle
Highlight resources used
Graph of operation over time
We’ll look at “single-clock-cycle” diagrams
for load & store
Chapter 4 — The Processor — 49
IF for Load, Store, …
Chapter 4 — The Processor — 50
ID for Load, Store, …
Chapter 4 — The Processor — 51
EX for Load
Chapter 4 — The Processor — 52
MEM for Load
Chapter 4 — The Processor — 53
WB for Load
Wrong
register
number
Chapter 4 — The Processor — 54
Corrected Datapath for Load
Chapter 4 — The Processor — 55
EX for Store
Chapter 4 — The Processor — 56
MEM for Store
Chapter 4 — The Processor — 57
WB for Store
Chapter 4 — The Processor — 58
Multi-Cycle Pipeline Diagram

Form showing resource usage
Chapter 4 — The Processor — 59
Multi-Cycle Pipeline Diagram

Traditional form
Chapter 4 — The Processor — 60
Single-Cycle Pipeline Diagram

State of pipeline in a given cycle
Chapter 4 — The Processor — 61
Pipelined Control (Simplified)
Chapter 4 — The Processor — 62
Pipelined Control

Control signals derived from instruction

As in single-cycle implementation
Chapter 4 — The Processor — 63
Pipelined Control
Chapter 4 — The Processor — 64
Simple Pipeline Summary
The BIG Picture

Pipelining improves performance by
increasing instruction throughput



Subject to hazards


Executes multiple instructions in parallel
Each instruction has the same latency
Structure, data, control (will be studied)
Instruction set design affects complexity of
pipeline implementation
Chapter 4 — The Processor — 65
Hazards

Situations that prevent starting the next
instruction safely in the next cycle


Structure hazards


A required resource is busy
Data hazard


The simple pipeline won’t work correctly
Need to wait for previous instruction to
complete its data read/write
Control hazard

Deciding on control action depends on
previous instruction
Chapter 4 — The Processor — 66
Structure Hazards


Conflict for use of a resource
In MIPS pipeline with a single memory


Load/store requires data access
Instruction fetch would have to stall for that
cycle


Would cause a pipeline “bubble”
Hence, pipelined datapaths require
separate instruction/data memories

Or separate instruction/data caches
Chapter 4 — The Processor — 67
Data Hazards in ALU Instructions

An instruction depends on completion of
data access by a previous instruction


add
sub
$s0, $t0, $t1
$t2, $s0, $t3
Consider this sequence:
sub
and
or
add
sw
$2, $1,$3
$12,$2,$5
$13,$6,$2
$14,$2,$2
$15,100($2)
Chapter 4 — The Processor — 68
Data Hazards in ALU Instructions

A naïve approach is to insert NOOPs to
wait out the dependence

add
sub
$s0, $t0, $t1
$t2, $s0, $t3
Change to

add $s0, $t0, $t1
noop
noop
sub $t2, $s0, $t3
Chapter 4 — The Processor — 69
Data Hazards in ALU Instructions

Another naïve approach is to stall the 2nd
instruction in the dependence

add
sub
$s0, $t0, $t1
$t2, $s0, $t3
Chapter 4 — The Processor — 70
Forwarding (aka Bypassing)

Use result when it is computed


Don’t wait for it to be stored in a register
Requires extra connections in the datapath
Chapter 4 — The Processor — 71
Data Hazards in ALU Instructions
Chapter 1 — Computer Abstractions and Technology — 72
Dependencies & Forwarding
Chapter 4 — The Processor — 73
Detecting the Need to Forward

Pass register numbers along pipeline


ALU operand register numbers in EX stage
are given by


e.g., ID/EX.RegisterRs = register number for Rs
sitting in ID/EX pipeline register
ID/EX.RegisterRs, ID/EX.RegisterRt
Data hazards when
1a. EX/MEM.RegisterRd = ID/EX.RegisterRs
1b. EX/MEM.RegisterRd = ID/EX.RegisterRt
2a. MEM/WB.RegisterRd = ID/EX.RegisterRs
2b. MEM/WB.RegisterRd = ID/EX.RegisterRt
Fwd from
EX/MEM
pipeline reg
Fwd from
MEM/WB
pipeline reg
Chapter 4 — The Processor — 74
Detecting the Need to Forward

But only if forwarding instruction will write
to a register!


EX/MEM.RegWrite, MEM/WB.RegWrite
And only if Rd for that instruction is not
$zero

EX/MEM.RegisterRd ≠ 0,
MEM/WB.RegisterRd ≠ 0
Chapter 4 — The Processor — 75
Forwarding Paths
Chapter 4 — The Processor — 76
Forwarding Conditions

EX hazard



if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRs))
ForwardA = 10
if (EX/MEM.RegWrite and (EX/MEM.RegisterRd ≠ 0)
and (EX/MEM.RegisterRd = ID/EX.RegisterRt))
ForwardB = 10
MEM hazard


if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRs))
ForwardA = 01
if (MEM/WB.RegWrite and (MEM/WB.RegisterRd ≠ 0)
and (MEM/WB.RegisterRd = ID/EX.RegisterRt))
ForwardB = 01
Chapter 4 — The Processor — 77
Download