CS 152 Computer Architecture and Engineering Krste Asanovic

advertisement
CS 152 Computer Architecture and
Engineering
Lecture 3 - From CISC to RISC
Krste Asanovic
Electrical Engineering and Computer Sciences
University of California at Berkeley
http://www.eecs.berkeley.edu/~krste
http://inst.eecs.berkeley.edu/~cs152
Instruction Set Architecture (ISA) versus
Implementation
• ISA is the hardware/software interface
– Defines set of programmer visible state
– Defines instruction format (bit encoding) and instruction
semantics
– Examples: MIPS, x86, IBM 360, JVM
• Many possible implementations of one ISA
– 360 implementations: model 30 (c. 1964), z990 (c. 2004)
– x86 implementations: 8086 (c. 1978), 80186, 286, 386, 486,
Pentium, Pentium Pro, Pentium-4 (c. 2000), AMD Athlon,
Transmeta Crusoe, SoftPC
– MIPS implementations: R2000, R4000, R10000, ...
– JVM: HotSpot, PicoJava, ARM Jazelle, ...
1/29/2009
CS152-Spring’09
2
Styles of ISA
• Accumulator
• Stack
• GPR
•
•
•
•
CISC
RISC
VLIW
Vector
• Boundaries are fuzzy, and hybrids are common
– E.g., 8086/87 is hybrid accumulator-GPR-stack ISA
– Many ISAs have added vector extensions
1/29/2009
CS152-Spring’09
3
Styles of Implementation
•
•
•
•
Microcoded
Unpipelined single cycle
Hardwired in-order pipeline
Out-of-order pipeline with speculative execution and
register renaming
• Software interpreter
• Binary translator
• Just-in-Time compiler
1/29/2009
CS152-Spring’09
4
Last Time in Lecture 2
• Stack machines popular to simplify High-Level
Language (HLL) implementation
– Algol-68 & Burroughs B5000, Forth machines, Occam & Transputers,
Java VMs & Java Interpreters
• General-purpose register machines provide greater
efficiency with better compiler technology (or assembly
coding)
– Compilers can explicitly manage fastest level of memory hierarchy
(registers)
• Microcoding was a straightforward methodical way to
implement machines with low gate count
1/29/2009
CS152-Spring’09
5
A Bus-based Datapath for MIPS
Opcode
ldIR
zero?
OpSel
ldA
busy
32(PC)
31(Link)
rd
rt
rs
ldB
2
rd
rt
rs
IR
ExtSel
Imm
Ext
2
enImm
3
A
ALU
control
RegSel
32 GPRs
+ PC ...
ALU
32-bit Reg
enALU
RegWrt
Memory
MemWrt
enReg
data
Bus
MA
addr
addr
B
ldMA
data
enMem
32
Microinstruction: register to register transfer (17 control signals)
MA
B
 PC
means RegSel = PC; enReg=yes;
 Reg[rt] means RegSel = rt; enReg=yes;
1/29/2009
CS152-Spring’09
ldMA= yes
ldB = yes
6
MIPS Microcontroller: first attempt
Opcode
zero?
Busy (memory)
6
latching the inputs
may cause a
one-cycle delay
PC (state)
s
addr
ROM size ?
= 2(opcode+status+s) words
Word size ?
= control+s bits
How big
is “s”?
s
Program ROM
data
next
state
Control Signals (17)
1/29/2009
CS152-Spring’09
7
Reducing Control Store Size
Control store has to be fast  expensive
• Reduce the ROM height (= address bits)
– reduce inputs by extra external logic
each input bit doubles the size of the
control store
– reduce states by grouping opcodes
find common sequences of actions
– condense input status bits
combine all exceptions into one, i.e.,
exception/no-exception
• Reduce the ROM width
– restrict the next-state encoding
Next, Dispatch on opcode, Wait for memory, ...
– encode control signals (vertical microcode)
1/29/2009
CS152-Spring’09
8
MIPS Controller V2
Opcode
absolute
ext
op-group
input encoding
reduces ROM height
PC
PC (state)
address
JumpType =
next | spin
| fetch | dispatch
| feqz | fnez
+1
PCSrc
jump
logic
zero
busy
Control ROM
data
Control Signals (17)
1/29/2009
PC+1
CS152-Spring’09
next-state encoding
reduces ROM width
9
Jump Logic
PCSrc = Case JumpTypes
1/29/2009
next

PC+1
spin

if (busy) then PC else PC+1
fetch

absolute
dispatch

op-group
feqz

if (zero) then absolute else PC+1
fnez

if (zero) then PC+1 else absolute
CS152-Spring’09
10
Instruction Fetch & ALU:MIPS-Controller-2
1/29/2009
State
Control points
fetch0
fetch1
fetch2
fetch3
...
ALU0
ALU1
ALU2
MA  PC
IR  Memory
A  PC
PC  A + 4
next
spin
next
dispatch
A  Reg[rs]
B  Reg[rt]
Reg[rd]func(A,B)
next
next
fetch
ALUi0
ALUi1
ALUi2
A  Reg[rs]
B  sExt16(Imm)
Reg[rd] Op(A,B)
next
next
fetch
CS152-Spring’09
next-state
11
Load & Store: MIPS-Controller-2
State
Control points
next-state
LW0
LW1
LW2
LW3
LW4
A  Reg[rs]
B  sExt16(Imm)
MA  A+B
Reg[rt]  Memory
next
next
next
spin
fetch
SW0
SW1
SW2
SW3
SW4
A  Reg[rs]
B  sExt16(Imm)
MA  A+B
Memory  Reg[rt]
next
next
next
spin
fetch
1/29/2009
CS152-Spring’09
12
Branches: MIPS-Controller-2
1/29/2009
State
Control points
BEQZ0
BEQZ1
BEQZ2
BEQZ3
BEQZ4
A  Reg[rs]
BNEZ0
BNEZ1
BNEZ2
BNEZ3
BNEZ4
A  Reg[rs]
next-state
next
fnez
A  PC
next
B  sExt16(Imm<<2) next
PC  A+B
fetch
next
feqz
A  PC
next
B  sExt16(Imm<<2) next
PC  A+B
fetch
CS152-Spring’09
13
Jumps: MIPS-Controller-2
1/29/2009
State
Control points
J0
J1
J2
A  PC
next
B  IR
next
PC  JumpTarg(A,B) fetch
JR0
JR1
A  Reg[rs]
PC  A
next
fetch
JAL0
JAL1
JAL2
JAL3
A  PC
Reg[31]  A
B  IR
PC  JumpTarg(A,B)
next
next
next
fetch
JALR0
JALR1
JALR2
JALR3
A  PC
B  Reg[rs]
Reg[31]  A
PC  B
next
next
next
fetch
CS152-Spring’09
next-state
14
Implementing Complex Instructions
Opcode
ldIR
zero?
OpSel
ldA
busy
32(PC)
31(Link)
rd
rt
rs
ldB
2
IR
ExtSel
2
Imm
Ext
enImm
rd
rt
rs
3
A
ALU
control
32 GPRs
+ PC ...
ALU
RegWrt
32-bit Reg
data
Bus
MA
addr
addr
B
enALU
Memory
MemWrt
enReg
data
enMem
32
rd  M[(rs)] op (rt)
M[(rd)]  (rs) op (rt)
M[(rd)]  M[(rs)] op M[(rt)]
1/29/2009
RegSel
ldMA
CS152-Spring’09
Reg-Memory-src ALU op
Reg-Memory-dst ALU op
Mem-Mem ALU op
15
Mem-Mem ALU Instructions:
MIPS-Controller-2
Mem-Mem ALU op
ALUMM0
ALUMM1
ALUMM2
ALUMM3
ALUMM4
ALUMM5
ALUMM6
M[(rd)]  M[(rs)] op M[(rt)]
MA  Reg[rs]
next
A  Memory
spin
MA  Reg[rt]
next
B  Memory
spin
MA Reg[rd]
next
Memory  func(A,B) spin
fetch
Complex instructions usually do not require datapath
modifications in a microprogrammed implementation
-- only extra space for the control program
Implementing these instructions using a hardwired
controller is difficult without datapath modifications
1/29/2009
CS152-Spring’09
16
Performance Issues
Microprogrammed control
 multiple cycles per instruction
Cycle time ?
tC > max(treg-reg, tALU, tROM)
Suppose 10 * tROM < tRAM
Good performance, relative to a single-cycle
hardwired implementation, can be achieved
even with a CPI of 10
1/29/2009
CS152-Spring’09
17
Horizontal vs Vertical Code
Bits per Instruction
# Instructions
• Horizontal code has wider instructions
– Multiple parallel operations per instruction
– Fewer steps per macroinstruction
– Sparser encoding  more bits
• Vertical code has narrower instructions
– Typically a single datapath operation per instruction
– separate instruction for branches
– More steps to per macroinstruction
– More compact  less bits
• Nanocoding
– Tries to combine best of horizontal and vertical code
1/29/2009
CS152-Spring’09
18
Nanocoding
Exploits recurring
control signal patterns
in code, e.g.,
ALU0 A  Reg[rs]
...
ALUi0 A  Reg[rs]
...
PC (state)
code
next-state
address
code ROM
nanoaddress
nanoinstruction ROM
data
• MC68000 had 17-bit code containing either 10-bit jump or
9-bit nanoinstruction pointer
– Nanoinstructions were 68 bits wide, decoded to give 196
control signals
1/29/2009
CS152-Spring’09
19
Microprogramming in IBM 360
M30
Datapath width
8
(bits)
inst width
50
(bits)
code size
4
(K µinsts)
store
CCROS
technology
store cycle
750
(ns)
memory cycle
1500
(ns)
Rental fee
4
($K/month)
M40
M50
M65
16
32
64
52
85
87
4
2.75
2.75
TCROS
BCROS
BCROS
625
500
200
2500
2000
750
7
15
35
Only the fastest models (75 and 95) were hardwired
1/29/2009
CS152-Spring’09
20
Microcode Emulation
• IBM initially miscalculated the importance of software
compatibility with earlier models when introducing the
360 series
• Honeywell stole some IBM 1401 customers by offering
translation software (“Liberator”) for Honeywell H200
series machine
• IBM retaliated with optional additional microcode for 360
series that could emulate IBM 1401 ISA, later extended
for IBM 7000 series
– one popular program on 1401 was a 650 simulator, so some
customers ran many 650 programs on emulated 1401s
– (650 simulated on 1401 emulated on 360)
1/29/2009
CS152-Spring’09
21
Microprogramming thrived in the
Seventies
• Significantly faster ROMs than magnetic core
memory or DRAMs were available
• For complex instruction sets (CISC), datapath and
controller were cheaper and simpler
• New instructions , e.g., floating point, could be
supported without datapath modifications
• Fixing bugs in the controller was easier
• ISA compatibility across various models could be
achieved easily and cheaply
Except for the cheapest and fastest machines,
all computers were microprogrammed
1/29/2009
CS152-Spring’09
22
Writable Control Store (WCS)
• Implement control store in RAM not ROM
– MOS SRAM memories now became almost as fast as control store
(core memories/DRAMs were 2-10x slower)
– Bug-free microprograms difficult to write
• User-WCS provided as option on several minicomputers
– Allowed users to change microcode for each processor
• User-WCS failed
–
–
–
–
–
–
Little or no programming tools support
Difficult to fit software into small space
Microcode control tailored to original ISA, less useful for others
Large WCS part of processor state - expensive context switches
Protection difficult if user can change microcode
Virtual memory required restartable microcode
1/29/2009
CS152-Spring’09
23
Microprogramming: early Eighties
• Evolution bred more complex micro-machines
– Ever more complex CISC ISAs led to need for subroutine and call
stacks in µcode
– Need for fixing bugs in control programs was in conflict with read-only
nature of µROM
– --> WCS (B1700, QMachine, Intel i432, …)
• With the advent of VLSI technology assumptions about
ROM & RAM speed became invalid
• Better compilers made complex instructions less
important
• Use of numerous micro-architectural innovations, e.g.,
pipelining, caches and buffers, made multiple-cycle
execution of reg-reg instructions unattractive
1/29/2009
CS152-Spring’09
24
Microprogramming in Modern Usage
• Microprogramming is far from extinct
• Played a crucial role in micros of the Eighties
DEC uVAX, Motorola 68K series, Intel 386 and 486
• Microcode pays an assisting role in most modern
micros (AMD Athlon, Intel Core 2 Duo, IBM
PowerPC)
• Most instructions are executed directly, i.e., with hard-wired
control
• Infrequently-used and/or complicated instructions invoke the
microcode engine
• Patchable microcode common for post-fabrication
bug fixes, e.g. Intel Pentiums load µcode patches
at bootup
1/29/2009
CS152-Spring’09
25
From CISC to RISC
• Use fast RAM to build fast instruction cache of
user-visible instructions, not fixed hardware
microroutines
– Can change contents of fast instruction memory to fit what
application needs right now
• Use simple ISA to enable hardwired pipelined
implementation
– Most compiled code only used a few of the available CISC
instructions
– Simpler encoding allowed pipelined implementations
• Further benefit with integration
– In early ‘80s, could finally fit 32-bit datapath + small caches
on a single chip
– No chip crossings in common case allows faster operation
1/29/2009
CS152-Spring’09
26
Nanocoding
Exploits recurring
control signal patterns
in code, e.g.,
ALU0 A  Reg[rs]
...
ALUi0 A  Reg[rs]
...
PC (state)
code
next-state
address
code ROM
nanoaddress
nanoinstruction ROM
data
• MC68000 had 17-bit code containing either 10-bit jump or 9-bit
nanoinstruction pointer
– Nanoinstructions were 68 bits wide, decoded to give 196 control
signals
1/29/2009
CS152-Spring’09
27
CDC 6600 Seymour Cray, 1964
• A fast pipelined machine with 60-bit words
• Ten functional units
- Floating Point: adder, multiplier, divider
- Integer: adder, multiplier
...
• Hardwired control (no microcoding)
• Dynamic scheduling of instructions using a
scoreboard
• Ten Peripheral Processors for Input/Output
- a fast time-shared 12-bit integer ALU
• Very fast clock, 10MHz
• Novel freon-based technology for cooling
1/29/2009
CS152-Spring’09
28
CDC 6600: Datapath
Operand Regs
8 x 60-bit
operand
10 Functional
Units
result
Central
Memory
128K words,
32 banks,
1s cycle
Address Regs
8 x 18-bit
Index Regs
8 x 18-bit
Inst. Stack
8 x 60-bit
operand
addr
result
addr
1/29/2009
IR
CS152-Spring’09
29
CDC 6600:
A Load/Store Architecture
•
Separate instructions to manipulate three types of reg.
•
All arithmetic and logic instructions are reg-to-reg
8
8
8
60-bit data registers (X)
18-bit address registers (A)
18-bit index registers (B)
6
opcode
•
3
3
3
i
j
k
Ri
 (Rj) op (Rk)
Only Load and Store instructions refer to memory!
6
3
3
18
opcode
i
j
disp
Ri M[(Rj) + disp]
Touching address registers 1 to 5 initiates a load
6 to 7 initiates a store
- very useful for vector operations
1/29/2009
CS152-Spring’09
30
CDC6600: Vector Addition
loop:
B0  - n
JZE B0, exit
A0  B0 + a0
A1  B0 + b0
X6  X0 + X1
A6  B0 + c0
B0  B0 + 1
jump loop
load X0
load X1
store X6
Ai = address register
Bi = index register
Xi = data register
1/29/2009
CS152-Spring’09
31
CDC6600 ISA designed to simplify
high-performance implementation
• Use of three-address, register-register ALU
instructions simplifies pipelined implementation
– No implicit dependencies between inputs and outputs
• Decoupling setting of address register (Ar) from
retrieving value from data register (Xr) simplifies
providing multiple outstanding memory accesses
– Software can schedule load of address register before use of value
– Can interleave independent instructions inbetween
• CDC6600 has multiple parallel but unpipelined
functional units
– E.g., 2 separate multipliers
• Follow-on machine CDC7600 used pipelined
functional units
– Foreshadows later RISC designs
1/29/2009
CS152-Spring’09
32
“Iron Law” of Processor Performance
Time = Instructions
Cycles
Time
Program
Program * Instruction * Cycle
– Instructions per program depends on source code, compiler
technology, and ISA
– Cycles per instructions (CPI) depends upon the ISA and the
microarchitecture
– Time per cycle depends upon the microarchitecture and the
base technology
this lecture
1/29/2009
Microarchitecture
Microcoded
Single-cycle unpipelined
Pipelined
CS152-Spring’09
CPI
>1
1
1
cycle time
short
long
short
33
CS152 Administrivia
• Check web site for new calendar, quiz dates should
not change
– Feb 17 and Mar 17 lecture in 320 Soda
– All other lectures in 306 Soda (here)
• PS1 and Lab 1 available now or tomorrow
– PS 1 / Lab 1 due Tuesday February 10
• Section tomorrow (Friday 1/30) 12-1pm 258 Dwinelle
– Covers lab 1 details
• Quiz 1 on Thursday Feb 12
1/29/2009
CS152-Spring’09
34
Hardware Elements
• Combinational circuits
OpSelect
– Mux, Decoder, ALU, ...
- Add, Sub, ...
- And, Or, Xor, Not, ...
- GT, LT, EQ, Zero, ...
Sel
An-1
...
Mux
A
O
lg(n)
Decoder
A0
A1
lg(n)
.
..
O0
O1
A
On-1
B
ALU
• Synchronous state elements
– Flipflop, Register, Register file, SRAM, DRAM
D
En
Clk
ff
Clk
En
D
Q
Q
Edge-triggered: Data is sampled at the rising edge
1/29/2009
CS152-Spring’09
Result
Comp?
Register Files
register
...
D0
D1
D2
ff
ff
ff ...
Q0
Q1
Q2
En
Clk
Dn-1
ff
...
Qn-1
Clock WE
ReadSel1
ReadSel2
WriteSel
WriteData
rs1
rs2
ws
wd
we
Register
file
2R+1W
rd1
rd2
ReadData1
ReadData2
• Reads are combinational
1/29/2009
CS152-Spring’09
36
Register File Implementation
ws
clk
5
wd
rd1 rd2
32
rs1
5
32
32
rs2
5
reg 1
…
…
we
…
reg 0
reg 31
• Register files with a large number of ports are difficult to design
– Almost all MIPS instructions have exactly 2 register source operands
– Intel’s Itanium, GPR File has 128 registers with 8 read ports and 4 write ports!!!
1/29/2009
CS152-Spring’09
37
A Simple Memory Model
WriteEnable
Clock
Address
WriteData
MAGIC
RAM
ReadData
Reads and writes are always completed in one cycle
• a Read can be done any time (i.e. combinational)
• a Write is performed at the rising clock edge
if it is enabled
 the write address and data
must be stable at the clock edge
Later in the course we will present a more realistic
model of memory
1/29/2009
CS152-Spring’09
38
Implementing MIPS:
Single-cycle per instruction
datapath & control logic
(Should be review of CS61C)
1/29/2009
CS152-Spring’09
39
The MIPS ISA
Processor State
32 32-bit GPRs, R0 always contains a 0
32 single precision FPRs, may also be viewed as
16 double precision FPRs
FP status register, used for FP compares & exceptions
PC, the program counter
some other special registers
Data types
8-bit byte, 16-bit half word
32-bit word for integers
32-bit word for single precision floating point
64-bit word for double precision floating point
Load/Store style instruction set
data addressing modes- immediate & indexed
branch addressing modes- PC relative & register indirect
Byte addressable memory- big endian mode
All instructions are 32 bits
1/29/2009
CS152-Spring’09
40
Instruction Execution
Execution of an instruction involves
1.
2.
3.
4.
5.
instruction fetch
decode and register fetch
ALU operation
memory operation (optional)
write back
and the computation of the address of the
next instruction
1/29/2009
CS152-Spring’09
41
Datapath: Reg-Reg ALU Instructions
RegWrite
0x4
clk
Add
inst<25:21>
inst<20:16>
PC
addr
clk
inst<15:11>
inst
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
ALU
z
GPRs
ALU
inst<5:0>
Control
OpCode
6
0
31
1/29/2009
5
rs
26 25
5
rt
21 20
5
rd
16 15
5
0
11
RegWrite Timing?
6
func
5
CS152-Spring’09
rd  (rs) func (rt)
0
42
Datapath: Reg-Imm ALU Instructions
RegWrite
0x4
clk
Add
inst<25:21>
PC
addr
inst<20:16>
inst
Inst.
Memory
clk
we
rs1
rs2
rd1
ws
wd rd2
inst<15:0>
OpCode
31
1/29/2009
26 25
z
GPRs
Imm
Ext
inst<31:26>
6
opcode
ALU
5
rs
5
rt
2120
ALU
Control
ExtSel
16
immediate
16 15
rt  (rs) op immediate
0
CS152-Spring’09
43
Conflicts in Merging Datapath
RegWrite
0x4
Add
inst<25:21>
PC
clk
Introduce
muxes
clk
addr
inst<20:16>
inst<15:11>
inst
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
ALU
z
GPRs
Imm
Ext
inst<15:0>
inst<31:26>
inst<5:0>
ExtSel
OpCode
6
0
opcode
1/29/2009
5
rs
5
rt
rs
rt
ALU
Control
5
rd
5
0
6
func
immediate
CS152-Spring’09
rd  (rs) func (rt)
rt  (rs) op immediate
44
Datapath for ALU Instructions
RegWrite
0x4
clk
Add
we
rs1
rs2
rd1
ws
wd rd2
<25:21>
<20:16>
PC
clk
addr
inst
<15:11>
Inst.
Memory
ALU
z
GPRs
Imm
Ext
<15:0>
<31:26>, <5:0>
OpCode
6
0
opcode
1/29/2009
5
rs
5
rt
rs
rt
RegDst
rt / rd
5
rd
ALU
Control
ExtSel
5
0
OpSel
6
func
immediate
CS152-Spring’09
BSrc
Reg / Imm
rd  (rs) func (rt)
rt  (rs) op immediate
45
Datapath for Memory Instructions
Should program and data memory be separate?
Harvard style: separate (Aiken and Mark 1 influence)
- read-only program memory
- read/write data memory
- Note:
Somehow there must be a way to load the
program memory
Princeton style: the same (von Neumann’s influence)
- single read/write memory for program and data
- Note:
A Load or Store instruction requires
accessing the memory more than once
during its execution
1/29/2009
CS152-Spring’09
46
Load/Store Instructions:Harvard Datapath
RegWrite
0x4
MemWrite
WBSrc
ALU / Mem
clk
Add
we
rs1
rs2
rd1
ws
wd rd2
“base”
addr
PC
inst
Inst.
Memory
clk
clk
ALU
z
GPRs
disp
we
addr
rdata
Data
Memory
Imm
Ext
wdata
ALU
Control
OpCode RegDst
6
opcode
31
26 25
5
rs
ExtSel
5
rt
21 20
OpSel
BSrc
16
displacement
16 15
addressing mode
(rs) + displacement
0
rs is the base register
rt is the destination of a Load or the source for a Store
1/29/2009
CS152-Spring’09
47
MIPS Control Instructions
Conditional (on GPR) PC-relative branch
6
opcode
5
rs
5
16
offset
BEQZ, BNEZ
Unconditional register-indirect jumps
6
opcode
5
rs
5
16
JR, JALR
Unconditional absolute jumps
6
opcode
26
target
J, JAL
• PC-relative branches add offset4 to PC+4 to calculate the
target address (offset is in words): 128 KB range
• Absolute jumps append target4 to PC<31:28> to calculate
the target address: 256 MB range
• jump-&-link stores PC+4 into the link register (R31)
• All Control Transfers are delayed by 1 instruction
we will worry about the branch delay slot later
1/29/2009
CS152-Spring’09
48
Conditional Branches (BEQZ, BNEZ)
PCSrc
br
RegWrite
MemWrite
WBSrc
pc+4
0x4
Add
Add
clk
PC
clk
we
rs1
rs2
rd1
ws
wd rd2
addr
inst
Inst.
Memory
clk
we
addr
ALU
GPRs
z
rdata
Data
Memory
Imm
Ext
wdata
ALU
Control
OpCode RegDst
1/29/2009
ExtSel
OpSel
BSrc
CS152-Spring’09
zero?
49
Register-Indirect Jumps (JR)
PCSrc
br
rind
RegWrite
MemWrite
WBSrc
pc+4
0x4
Add
Add
clk
PC
clk
we
rs1
rs2
rd1
ws
wd rd2
addr
inst
Inst.
Memory
clk
we
addr
ALU
GPRs
z
rdata
Data
Memory
Imm
Ext
wdata
ALU
Control
OpCode RegDst
1/29/2009
ExtSel
OpSel
BSrc
CS152-Spring’09
zero?
50
Register-Indirect Jump-&-Link (JALR)
PCSrc
br
rind
RegWrite
MemWrite
WBSrc
pc+4
0x4
Add
Add
clk
PC
clk
31
addr
inst
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
clk
we
addr
ALU
GPRs
z
rdata
Data
Memory
Imm
Ext
wdata
ALU
Control
OpCode RegDst
1/29/2009
ExtSel
OpSel
BSrc
CS152-Spring’09
zero?
51
Absolute Jumps (J, JAL)
PCSrc
br
rind
jabs
pc+4
RegWrite
MemWrite
WBSrc
0x4
Add
Add
clk
PC
clk
31
addr
inst
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
clk
we
addr
ALU
GPRs
z
rdata
Data
Memory
Imm
Ext
wdata
ALU
Control
OpCode RegDst
1/29/2009
ExtSel
OpSel
BSrc
CS152-Spring’09
zero?
52
Harvard-Style Datapath for MIPS
PCSrc
br
rind
jabs
pc+4
RegWrite
MemWrite
WBSrc
0x4
Add
Add
clk
PC
clk
31
addr
inst
Inst.
Memory
we
rs1
rs2
rd1
ws
wd rd2
clk
we
addr
ALU
GPRs
z
rdata
Data
Memory
Imm
Ext
wdata
ALU
Control
OpCode RegDst
1/29/2009
ExtSel
OpSel
BSrc
CS152-Spring’09
zero?
53
Hardwired Control is pure
Combinational Logic
ExtSel
BSrc
op code
zero?
OpSel
combinational
logic
MemWrite
WBSrc
RegDst
RegWrite
PCSrc
1/29/2009
CS152-Spring’09
54
ALU Control & Immediate Extension
Inst<5:0> (Func)
Inst<31:26> (Opcode)
ALUop
+
0?
OpSel
( Func, Op, +, 0? )
Decode Map
ExtSel
( sExt16, uExt16,
High16)
1/29/2009
CS152-Spring’09
55
Hardwired Control Table
Opcode
ALU
ExtSel
BSrc
OpSel
MemW
RegW
WBSrc
RegDst
PCSrc
SW
*
sExt16
uExt16
sExt16
sExt16
Reg
Imm
Imm
Imm
Imm
Func
Op
Op
+
+
no
no
no
no
yes
yes
yes
yes
yes
no
ALU
ALU
ALU
Mem
*
rd
rt
rt
rt
*
pc+4
pc+4
pc+4
pc+4
pc+4
BEQZz=0
sExt16
*
0?
no
no
*
*
br
BEQZz=1
sExt16
*
*
*
*
*
no
no
no
no
no
*
*
*
*
pc+4
jabs
*
*
*
*
0?
*
*
*
*
yes
no
yes
PC
*
PC
R31
*
R31
jabs
rind
rind
ALUi
ALUiu
LW
J
JAL
JR
JALR
BSrc = Reg / Imm
RegDst = rt / rd / R31
1/29/2009
no
no
WBSrc = ALU / Mem / PC
PCSrc = pc+4 / br / rind / jabs
CS152-Spring’09
56
Single-Cycle Hardwired Control:
Harvard architecture
We will assume
• clock period is sufficiently long for all of
the following steps to be “completed”:
1.
2.
3.
4.
5.
instruction fetch
decode and register fetch
ALU operation
data fetch if required
register write-back setup time

tC > tIFetch + tRFetch + tALU+ tDMem+ tRWB
• At the rising edge of the following clock, the PC,
the register file and the memory are updated
1/29/2009
CS152-Spring’09
57
An Ideal Pipeline
stage
1
stage
2
stage
3
stage
4
• All objects go through the same stages
• No sharing of resources between any two stages
• Propagation delay through all pipeline stages is equal
• The scheduling of an object entering the pipeline
is not affected by the objects in other stages
These conditions generally hold for industrial
assembly lines.
But can an instruction pipeline satisfy the last
condition?
1/29/2009
CS152-Spring’09
58
Pipelined MIPS
To pipeline MIPS:
• First build MIPS without pipelining with CPI=1
• Next, add pipeline registers to reduce cycle time while
maintaining CPI=1
1/29/2009
CS152-Spring’09
59
Pipelined Datapath
0x4
Add
PC
addr
rdata
Inst.
Memory
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
Imm
Ext
ALU
we
addr
rdata
Data
Memory
wdata
write
fetch
decode & Reg-fetch
execute
memory
-back
phase
phase
phase
phase
phase
Clock period can be reduced by dividing the execution of an
instruction into multiple cycles
tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably)
However, CPI will increase unless instructions are pipelined
1/29/2009
CS152-Spring’09
60
How to divide the datapath
into stages
Suppose memory is significantly slower than
other stages. In particular, suppose
tIM
tDM
tALU
tRF
tRW
= 10 units
= 10 units
= 5 units
= 1 unit
= 1 unit
Since the slowest stage determines the clock, it
may be possible to combine some stages without
any loss of performance
1/29/2009
CS152-Spring’09
61
Alternative Pipelining
0x4
Add
PC
addr
rdata
Inst.
Memory
fetch
phase
IR
we
rs1
rs2
rd1
ws
wd rd2
GPRs
ALU
we
addr
rdata
Data
Memory
Imm
Ext
wdata
decode & Reg-fetch
phase
execute
phase
memory
phase
write
-back
phase
ttCC > max
, t RF,+t
tALU
, t DM,+t
}} =
max {tIM
, ttRW
= ttDM
IM, tRF
ALU, tDM
RW
RW}
DM+ tRW
increase the critical path by 10%
Write-back stage takes much less time than other stages.
Suppose we combined it with the memory phase
1/29/2009
CS152-Spring’09
62
Summary
• Microcoding became less attractive as gap between
RAM and ROM speeds reduced
• Complex instruction sets difficult to pipeline, so
difficult to increase performance as gate count grew
• Iron Law explains architecture design space
– Trade instruction/program, cycles/instruction, and time/cycle
• Load-Store RISC ISAs designed for efficient
pipelined implementations
– Very similar to vertical microcode
– Inspired by earlier Cray machines
• MIPS ISA will be used in class and problems,
SPARC in lab (two very similar ISAs)
1/29/2009
CS152-Spring’09
63
Acknowledgements
• These slides contain material developed and
copyright by:
–
–
–
–
–
–
Arvind (MIT)
Krste Asanovic (MIT/UCB)
Joel Emer (Intel/MIT)
James Hoe (CMU)
John Kubiatowicz (UCB)
David Patterson (UCB)
• MIT material derived from course 6.823
• UCB material derived from course CS252
1/29/2009
CS152-Spring’09
64
Download