Document 13548506

advertisement
1
Single-Cycle Processors:
Datapath & Control
Arvind
Computer Science & Artificial Intelligence Lab
M.I.T.
Based on the material prepared by
Arvind and Krste Asanovic
Instruction Set Architecture (ISA)
versus Implementation
6.823 L5- 2
Arvind
• ISA is the hardware/software interface
– Defines set of programmer visible state
– Defines instruction format (bit encoding) and instruction
semantics
– Examples: MIPS, x86, IBM 360, JVM
• Many possible implementations of one ISA
– 360 implementations: model 30 (c. 1964), z900 (c. 2001)
– x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486,
Pentium, Pentium Pro, Pentium-4 (c. 2000), AMD Athlon,
Transmeta Crusoe, SoftPC
– MIPS implementations: R2000, R4000, R10000, ...
– JVM: HotSpot, PicoJava, ARM Jazelle, ...
September 26, 2005
6.823 L5- 3
Arvind
Processor Performance
Time
Program
=
Instructions
Program *
Cycles
Instruction
*
Time
Cycle
– Instructions per program depends on source code, compiler
technology, and ISA
– Cycles per instructions (CPI) depends upon the ISA and the
microarchitecture
– Time per cycle depends upon the microarchitecture and the
base technology
this lecture
September 26, 2005
Microarchitecture
CPI
cycle time
Microcoded
>1
short
Single-cycle unpipelined
1
long
Pipelined
1
short
6.823 L5- 4
Arvind
Microarchitecture:
status
lines
Implementation of an ISA
Controller
control
points
Data
path
Structure: How components are connected.
Static
Behavior: How data moves between components
Dynamic
September 26, 2005
Hardware Elements
• Combinational circuits
OpSelect
– Mux, Demux, Decoder, ALU, ...
A0
A1
An-1
...
Mux
O
A
lg(n)
Demux
Sel
lg(n)
...
O0
O1
A
lg(n)
On-
Decoder
Sel
- Add, Sub, ...
- And, Or, Xor, Not, ...
- GT, LT, EQ, Zero, ...
...
O0
O1
A
On-1
B
Result
ALU
Comp?
1
• Synchronous state elements
– Flipflop, Register, Register file, SRAM, DRAM
D
En
Clk
ff
Q
register
Clk
En
D
Q
En
Clk
D1
D2
ff
ff
ff ...
Q0
Q1
Q2
Edge-triggered: Data is sampled at the rising edge
September 26, 2005
...
D0
...
Dn-1
ff
Qn-1
6.823 L5- 6
Arvind
Register Files
Clock WE
ReadSel1
ReadSel2
WriteSel
WriteData
ws clk
rs1
rs2
ws
wd
we
Register
file
2R+1W
ReadData1
ReadData2
rd1
rd2
rs1
wd
32
5
5
register 1
…
…
we
…
register 0
32
rd1
32
rs2
32
32
…
register 31
5
32
rd2
• No timing issues in reading a selected register
• Register files with a large number of ports are difficult
to design
– Intel’s Itanium, GPR File has 128 registers with 8 read ports and
4 write ports!!!
September 26, 2005
6.823 L5- 7
Arvind
A Simple Memory Model
WriteEnable
Clock
Address
WriteData
MAGIC
RAM
ReadData
Reads and writes are always completed in one cycle
• a Read can be done any time (i.e. combinational)
• a Write is performed at the rising clock edge
if it is enabled
⇒ the write address and data
must be stable at the clock edge
Later in the course we will present a more realistic
model of memory
September 26, 2005
6.823 L5- 8
Arvind
Implementing MIPS:
Single-cycle per instruction
datapath & control logic
September 26, 2005
6.823 L5- 9
Arvind
The MIPS ISA
Processor State
32 32-bit GPRs, R0 always contains a 0
32 single precision FPRs, may also be viewed as
16 double precision FPRs
FP status register, used for FP compares & exceptions
PC, the program counter
some other special registers
Data types
8-bit byte, 16-bit half word 32-bit word for integers
32-bit word for single precision floating point
64-bit word for double precision floating point
Load/Store style instruction set
data addressing modes- immediate & indexed
branch addressing modes- PC relative & register indirect
Byte addressable memory- big endian mode
All instructions are 32 bits
September 26, 2005
6.823 L5- 10
Arvind
Instruction Execution
Execution of an instruction involves
1.
2.
3.
4.
5.
instruction fetch
decode and register fetch
ALU operation
memory operation (optional)
write back
and the computation of the address of the
next instruction
September 26, 2005
6.823 L5- 11
Arvind
Datapath: Reg-Reg ALU Instructions
RegWrite
0x4
clk
Add
inst<25:21>
inst<20:16>
PC
clk
addr
inst<15:11>
inst
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
ALU
z
GPRs
inst<5:0>
ALU
Control
OpCode
6
0
31
26 25
September 26, 2005
5
rs
5
rt
21 20
5
rd
16 15
5
0
11
RegWrite Timing?
6
func
5
rd ← (rs) func (rt)
0
6.823 L5- 12
Arvind
Datapath: Reg-Imm ALU Instructions
RegWrite
0x4
clk
Add
inst<25:21>
PC
clk
addr
inst<20:16>
inst
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
GPRs
inst<15:0>
OpCode
31
26 25
September 26, 2005
5
rs
5
rt
2120
ALU
Control
ExtSel
16
immediate
16 15
z
Imm
Ext
inst<31:26>
6
opcode
ALU
rt ← (rs) op immediate
0
6.823 L5- 13
Arvind
Conflicts in Merging Datapath
RegWrite
0x4
Add
we
rs1
rs2
rd1
ws
wd rd2
inst<25:21>
PC
clk
addr
inst<20:16>
inst<15:11>
inst
Inst.
Memory
inst<15:0>
September 26, 2005
5
rs
5
rt
rs
rt
z
Imm
Ext
ALU
Control
ExtSel
OpCode
opcode
ALU
GPRs
inst<31:26>
inst<5:0>
6
0
Introduce
muxes
clk
5
rd
5
0
immediate
6
func
rd ← (rs) func (rt)
rt ← (rs) op immediate
6.823 L5- 14
Arvind
Datapath for ALU Instructions
RegWrite
0x4
clk
Add
PC
clk
addr
we
rs1
rs2
rd1
ws
wd rd2
<25:21>
<20:16>
inst
<15:11>
Inst.
Memory
GPRs
<15:0>
6
0
opcode
September 26, 2005
5
rs
5
rt
rs
rt
z
Imm
Ext
<31:26>, <5:0>
OpCode
ALU
RegDst
rt / rd
5
rd
ALU
Control
ExtSel
5
0
immediate
OpSel
6
func
BSrc
Reg / Imm
rd ← (rs) func (rt)
rt ← (rs) op immediate
6.823 L5- 15
Arvind
Datapath for Memory Instructions
Should program and data memory be separate?
Harvard style: separate (Aiken and Mark 1 influence)
- read-only program memory
- read/write data memory
at some level the two memories have
to be the same
Princeton style: the same (von Neumann’s influence)
- A Load or Store instruction requires
accessing the memory more than once
during its execution
September 26, 2005
6.823 L5- 16
Arvind
Load/Store Instructions:Harvard Datapath
RegWrite
0x4
we
rs1
rs2
rd1
ws
wd rd2
“base”
addr
inst
Inst.
Memory
clk
WBSrc
ALU / Mem
clk
Add
PC
MemWrite
clk
ALU
GPRs
z
rdata
Data
Memory
Imm
Ext
disp
we
addr
wdata
ALU
Control
OpCode RegDst
6
opcode
31
26 25
5
rs
ExtSel
5
rt
21 20
OpSel
BSrc
16
displacement
16 15
addressing mode
(rs) + displacement
0
rs is the base register
rt is the destination of a Load or the source for a Store
September 26, 2005
6.823 L5- 17
Arvind
MIPS Control Instructions
Conditional (on GPR) PC-relative branch
6
opcode
5
rs
5
16
offset
BEQZ, BNEZ
Unconditional register-indirect jumps
6
opcode
5
rs
5
16
JR, JALR
Unconditional absolute jumps
6
opcode
26
target
J, JAL
• PC-relative branches add offset×4 to PC+4 to calculate the
target address (offset is in words): ±128 KB range
• Absolute jumps append target×4 to PC<31:28> to calculate
the target address: 256 MB range
• jump-&-link stores PC+4 into the link register (R31)
• All Control Transfers are delayed by 1 instruction
we will worry about the branch delay slot later
September 26, 2005
6.823 L5- 18
Arvind
Conditional Branches (BEQZ, BNEZ)
PCSrc
br
MemWrite
RegWrite
pc+4
0x4
Add
Add
clk
PC
clk
addr
we
rs1
rs2
rd1
ws
wd rd2
inst
Inst.
Memory
clk
we
addr
ALU
GPRs
z
Imm
Ext
wdata
ALU
Control
OpCode RegDst
September 26, 2005
ExtSel
rdata
Data
Memory
OpSel
BSrc
zero?
WBSrc
6.823 L5- 19
Arvind
Register-Indirect Jumps (JR)
PCSrc
br
rind
RegWrite
MemWrite
pc+4
0x4
Add
Add
clk
PC
clk
addr
we
rs1
rs2
rd1
ws
wd rd2
inst
Inst.
Memory
clk
we
addr
ALU
GPRs
z
Imm
Ext
wdata
ALU
Control
OpCode RegDst
September 26, 2005
ExtSel
rdata
Data
Memory
OpSel
BSrc
zero?
WBSrc
6.823 L5- 20
Arvind
Register-Indirect Jump-&-Link (JALR)
PCSrc
br
rind
RegWrite
MemWrite
pc+4
0x4
Add
Add
clk
PC
clk
addr
inst
31
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
clk
we
addr
ALU
GPRs
z
Imm
Ext
wdata
ALU
Control
OpCode RegDst
September 26, 2005
ExtSel
rdata
Data
Memory
OpSel
BSrc
zero?
WBSrc
6.823 L5- 21
Arvind
Absolute Jumps (J, JAL)
PCSrc
br
rind
jabs
pc+4
RegWrite
MemWrite
0x4
Add
Add
clk
PC
clk
addr
inst
31
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
clk
we
addr
ALU
GPRs
z
Imm
Ext
wdata
ALU
Control
OpCode RegDst
September 26, 2005
ExtSel
rdata
Data
Memory
OpSel
BSrc
zero?
WBSrc
6.823 L5- 22
Arvind
Harvard-Style Datapath for MIPS
PCSrc
br
rind
jabs
pc+4
RegWrite
MemWrite
0x4
Add
Add
clk
PC
clk
addr
inst
31
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
clk
we
addr
ALU
GPRs
z
Imm
Ext
wdata
ALU
Control
OpCode RegDst
September 26, 2005
ExtSel
rdata
Data
Memory
OpSel
BSrc
zero?
WBSrc
23
Five-minute break to stretch your legs
Single-Cycle Hardwired Control:
6.823 L5- 24
Arvind
Harvard architecture
We will assume
• clock period is sufficiently long for all of the following steps to be “completed”:
1.
2.
3.
4.
5.
instruction fetch
decode and register fetch
ALU operation
data fetch if required
register write-back setup time
⇒
tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB
• At the rising edge of the following clock, the PC,
the register file and the memory are updated
September 26, 2005
6.823 L5- 25
Arvind
Hardwired Control is pure
Combinational Logic
ExtSel
BSrc
op code
zero?
OpSel
combinational
logic
MemWrite
WBSrc
RegDst
RegWrite
PCSrc
September 26, 2005
6.823 L5- 26
Arvind
ALU Control & Immediate Extension
Inst<5:0> (Func)
Inst<31:26> (Opcode)
ALUop
+
0?
OpSel
( Func, Op, +, 0? )
Decode Map
ExtSel
( sExt16, uExt16,
High16)
September 26, 2005
6.823 L5- 27
Arvind
Hardwired Control Table
Opcode
ExtSel
ALU
BSrc
OpSel
MemW
RegW
WBSrc
RegDst
PCSrc
SW
*
sExt16
uExt16
sExt16
sExt16
Reg
Imm
Imm
Imm
Imm
Func
Op
Op
+
+
no
no
no
no
yes
yes
yes
yes
yes
no
ALU
ALU
ALU
Mem
*
rd
rt
rt
rt
*
pc+4
pc+4
pc+4
pc+4
pc+4
BEQZz=0
sExt16
*
0?
no
no
*
*
br
BEQZz=1
sExt16
*
*
*
*
*
no
no
no
no
no
*
*
*
*
pc+4
jabs
*
*
*
*
0?
*
*
*
*
yes
no
yes
PC
*
PC
R31
*
R31
jabs
rind
rind
ALUi
ALUiu
LW
J
JAL
JR
JALR
BSrc = Reg / Imm
RegDst = rt / rd / R31
September 26, 2005
no
no
WBSrc = ALU / Mem / PC
PCSrc = pc+4 / br / rind / jabs
6.823 L5- 28
Arvind
Pipelined MIPS
To pipeline MIPS:
• First build MIPS without pipelining with CPI=1
• Next, add pipeline registers to reduce cycle
time while maintaining CPI=1
September 26, 2005
6.823 L5- 29
Arvind
Pipelined Datapath
0x4
Add
PC
addr
rdata
Inst.
Memory
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
write
fetch
decode & Reg-fetch
execute
memory
-back
phase
phase
phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined
September 26, 2005
6.823 L5- 30
Arvind
An Ideal Pipeline
stage
1
stage
2
stage
3
stage
4
• All objects go through the same stages
• No sharing of resources between any two stages
• Propagation delay through all pipeline stages is equal
• The scheduling of an object entering the pipeline
is not affected by the objects in other stages
These conditions generally hold for industrial
assembly lines.
But can an instruction pipeline satisfy the last
condition?
September 26, 2005
How to divide the datapath
into stages
6.823 L5- 31
Arvind
Suppose memory is significantly slower than
other stages. In particular, suppose
tIM
tDM
tALU
tRF
tRW
=
=
=
=
=
10 units
10 units
5 units
1 unit
1 unit
Since the slowest stage determines the clock, it
may be possible to combine some stages without
any loss of performance
September 26, 2005
6.823 L5- 32
Arvind
Alternative Pipelining
0x4
Add
PC
addr
rdata
Inst.
Memory
fetch
phase
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
we
addr
rdata
Data
Memory
Imm
Ext
wdata
decode & Reg-fetch
phase
execute
phase
memory
phase
tCC > max {tIM
, tRF
, t ALU,, ttDM
,, ttRW
}}
RW
IM
RF+tALU
DM+t
RW}
=
= ttDM
DM
DM+ tRW
write
-back
phase
⇒ increase the critical path by 10%
Write-back stage takes much less time than other stages.
Suppose we combined it with the memory phase
September 26, 2005
6.823 L5- 33
Arvind
Maximum Speedup by Pipelining
Assumptions Unpipelined
Pipelined Speedup
1. tIM = tDM = 10, tALU = 5, tRF = tRW= 1
4-stage pipeline tC tC
27
10
2.7
2. tIM =tDM = tALU = tRF = tRW = 5
4-stage pipeline
25
10
2.5
3. tIM =tDM = tALU = tRF = tRW = 5
5-stage pipeline
25
5
5.0
It is possible to achieve higher speedup with more
stages in the pipeline.
September 26, 2005
34
Thank you !
Download