Krste Asanovic
Electrical Engineering and Computer Sciences
University of California at Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs152
• Computer Science at crossroads from sequential to parallel computing
• Computer Architecture >> ISAs and RTL
– CS152 is about interaction of hardware and software, and design of appropriate abstraction layers
• Comp. Arch. shaped by technology and applications
– History provides lessons for the future
• Cost of software development a large constraint on architecture
– Compatibility a key solution to software cost
• IBM 360 introduces notion of “family of machines” running same ISA but very different implementations
– Within same generation of machines
– “Future-proofing” for subsequent generations of machine
CS152Spring’09
1/27/2009 2
An ALGOL Machine, Robert Barton, 1960
• Machine implementation can be completely hidden if the programmer is provided only a high-level language interface.
• Stack machine organization because stacks are convenient for:
1. expression evaluation;
2. subroutine calls, recursion, nested interrupts;
3. accessing variables in block-structured languages.
• B6700, a later model, had many more innovative features
– tagged data
– virtual memory
– multiple processors and memories
CS152Spring’09
1/27/2009 3
Processor stack
:
Main
Store
A Stack machine has a stack as a part of the processor state typical operations: push, pop, +, *, ...
Instructions like + implicitly specify the top 2 elements of the stack as operands.
1/27/2009 a push b
b a push c
c b a
CS152Spring’09 pop
b a
4
(a + b * c) / (a + d * c - e)
/ a
+ b
* c
+
e a * d c
Reverse Polish a b c * + a d c * + e - /
*
1/27/2009 CS152Spring’09 c a
Evaluation Stack
5
(a + b * c) / (a + d * c - e)
/ a
+ b
* c
+
e a * d c
Reverse Polish a b c * + a d c * + e - / add
1/27/2009 CS152Spring’09
+ b * c
Evaluation Stack
6
• Stack is part of the processor state
stack must be bounded and small
number of Registers, not the size of main memory
• Conceptually stack is unbounded
a part of the stack is included in the processor state; the rest is kept in the main memory
1/27/2009 CS152Spring’09 7
• Suppose the top 2 elements of the stack are kept in registers and the rest is kept in the memory.
Each push operation
pop operation
1 memory reference
1 memory reference
No Good!
• Better performance can be got if the top N elements are kept in registers and memory references are made only when register stack overflows or underflows.
Issue - when to Load/Unload registers ?
CS152Spring’09
1/27/2009 8
program push a push b push c
*
+ push a push d
-
/ push c
*
+ push e
1/27/2009 a b c * + a d c * + e - / stack (size = 2)
R0
R0 R1
R0 R1 R2
R0 R1
R0
R0 R1
R0 R1 R2
R0 R1 R2 R3
R0 R1 R2
R0 R1
R0 R1 R2
R0 R1
R0 memory refs a b c, ss(a) sf(a) a d, ss(a+b*c) c, ss(a) sf(a) sf(a+b*c) e,ss(a+b*c) sf(a+b*c)
4 stores, 4 fetches (implicit)
CS152Spring’09 9
a and c are
“loaded” twice
not the best use of registers!
a b c * + a d c * + e - / program push a push b push c
*
+ push a push d push c
*
+ push e
-
/ stack (size = 4)
R0
R0 R1
R0 R1 R2
R0 R1
R0
R0 R1
R0 R1 R2
R0 R1 R2 R3
R0 R1 R2
R0 R1
R0 R1 R2
R0 R1
R0
1/27/2009 CS152Spring’09 10
(a + b * c) / (a + d * c - e)
Reuse
R2
Reuse
R3
Reuse
R0
Load R0
Load R1
Load R2
Mul R2
Add R2
Load R3
Mul R3
Add R3
Load R0
Sub R3
Div R2
R0 d
R1
R0 e
R0
R3 a c b
R1
More control over register usage since registers can be named explicitly
Load
Load
Load
Ri m
Ri (Rj)
Ri (Rj) (Rk)
eliminates unnecessary
Loads and Stores
- fewer Registers but instructions may be longer!
CS152Spring’09
1/27/2009 11
• In addition to push, pop, + etc., the instruction set must provide the capability to
– refer to any element in the data area
– jump to any instruction in the code area
– move any element in the stack frame to the top machinery to carry out
+, -, etc.
SP
DP
PC
push a push b push c
*
+ push e
/ code
CS152Spring’09
1/27/2009 stack a b c
.
.
.
data
12
Amdahl, Blaauw and Brooks, 1964
1. The performance advantage of push down stack organization is derived from the presence of fast registers and not the way they are used.
2.“Surfacing” of data in stack which are “profitable” is approximately 50% because of constants and common subexpressions.
3. Advantage of instruction density because of implicit addresses is equaled if short addresses to specify registers are allowed.
4. Management of finite depth stack causes complexity.
5. Recursive subroutine advantage can be realized only with the help of an independent stack for addressing.
6. Fitting variable-length fields into fixed-width word is awkward.
CS152Spring’09
1/27/2009 13
(Mostly)
1. Stack programs are not smaller if short (Register) addresses are permitted.
2. Modern compilers can manage fast register space better than the stack discipline.
GPR’s and caches are better than stack and displays
Early language-directed architectures often did not take into account the role of compilers!
B5000, B6700, HP 3000, ICL 2900, Symbolics 3600
Some would claim that an echo of this mistake is visible in the SPARC architecture register windows more later…
1/27/2009 CS152Spring’09 14
• Inmos Transputers (1985-2000)
– Designed to support many parallel processes in Occam language
– Fixed-height stack design simplified implementation
– Stack trashed on context swap (fast context switches)
– Inmos T800 was world’s fastest microprocessor in late 80’s
• Forth machines
– Direct support for Forth execution in small embedded real-time environments
– Several manufacturers (Rockwell, Patriot Scientific)
• Java Virtual Machine
– Designed for software emulation, not direct hardware execution
– Sun PicoJava implementation + others
• Intel x87 floating-point unit
– Severely broken stack model for FP arithmetic
– Deprecated in Pentium-4, replaced with SSE2 FP registers
CS152Spring’09
1/27/2009 15
– To show how to build very small processors with complex ISAs
– To help you understand where CISC machines came from
– Because it is still used in the most common machines (x86,
PowerPC, IBM360)
– As a gentle introduction into machine structures
– To help understand how technology drove the move to RISC
1/27/2009 CS152Spring’09 16
• ISA often designed with particular microarchitectural style in mind, e.g.,
– CISC microcoded
– RISC hardwired, pipelined
– VLIW fixed-latency in-order pipelines
– JVM software interpretation
• But can be implemented with any microarchitectural style
– Core 2 Duo: hardwired pipelined CISC (x86) machine (with some microcode support)
– This lecture: a microcoded RISC (MIPS) machine
– Intel could implement a dynamically scheduled outof-order VLIW (IA-64) processor
– ARM Jazelle: A hardware JVM processor
– Simics: Software-interpreted SPARC RISC machine
CS152Spring’09
1/27/2009 17
Implementation of an ISA
Controller status lines control points
Data path
Structure: How components are connected.
Static
Behavior: How data moves between components
Dynamic
1/27/2009 CS152Spring’09 18
Maurice Wilkes, 1954
Embed the control logic state table in a memory array op conditional code flip-flop
1/27/2009
Next state
address
Matrix A
Decoder
Matrix B
Control lines to
ALU, MUXs, Registers
CS152Spring’09 19
busy?
zero?
opcode
controller
(ROM) holds fixed
microcode instructions holds user program written in macrocode instructions (e.g.,
MIPS, x86, etc.)
1/27/2009
Datapath
Data Addr
Memory
(RAM) enMem
MemWrt
CS152Spring’09 20
• Processor State
32 32-bit GPRs, R0 always contains a 0
16 double-precision/32 single-precision FPRs
FP status register, used for FP compares & exceptions
PC, the program counter some other special registers
See H&P
Appendix B for full description • Data types
8-bit byte, 16-bit half word
32-bit word for integers
32-bit word for single precision floating point
64-bit word for double precision floating point
• Load/Store style instruction set data addressing modes- immediate & indexed branch addressing modes- PC relative & register indirect
Byte addressable memory- big-endian mode
1/27/2009
All instructions are 32 bits
CS152Spring’09 21
ALU
ALUi
6
0
5 rs rt rd 0 func rd (rs) func (rt) opcode rs rt
5 5 5 6 immediate rt (rs) op immediate
Mem
6 opcode rs
5 5 16 rt displacement M[(rs) + displacement]
6 5 5 16 opcode rs offset
6 5 opcode rs
5 16
6 26 opcode offset
BEQZ, BNEZ
JR, JALR
J, JAL
CS152Spring’09
1/27/2009 22
Opcode ldIR OpSel ldA zero?
ldB 32(PC)
2
IR
ExtSel
2 enImm
Imm
Ext rd rt
A
ALU control enALU
ALU
B addr
32 GPRs
+ PC ...
3
RegSel
32-bit Reg data
RegWrt enReg busy ldMA
MA addr
Memory
MemWrt data enMem
Bus 32
Microinstruction: register to register transfer (17 control signals)
MA PC means RegSel = PC; enReg=yes; ldMA= yes
B Reg[rt] means RegSel = rt; enReg=yes; ldB = yes
1/27/2009 CS152Spring’09 23
addr busy
RAM we din dout
Write(1)/Read(0)
Enable bus
Assumption: Memory operates independently and is slow as compared to Reg-to-Reg transfers
(multiple CPU clock cycles per access)
1/27/2009 CS152Spring’09 24
Execution of a MIPS instruction involves
1. instruction fetch
2. decode and register fetch
3. ALU operation
4. memory operation (optional)
5. write back to register file (optional)
+ the computation of the
next instruction address
CS152Spring’09
1/27/2009 25
instr fetch: MA
PC
A
PC
IR
Memory
PC
A + 4 dispatch on OPcode
ALU:
ALUi: can be treated as a macro
A
Reg[rs]
B
Reg[rt]
Reg[rd]
func(A,B) do instruction fetch
A
Reg[rs]
B
Imm
Reg[rt]
Opcode(A,B) do instruction fetch sign extension ...
CS152Spring’09
1/27/2009 26
(cont.)
LW: A Reg[rs]
B Imm
MA A + B
Reg[rt] Memory
do instruction fetch
A PC
B IR
PC JumpTarg(A,B)
do instruction fetch
JumpTarg(A,B) =
{A[31:28],B[25:0],00}
A Reg[rs]
If zero?(A) then go to bz-taken
do instruction fetch
A PC
B Imm << 2
PC A + B
do instruction fetch
CS152Spring’09 27 1/27/2009
J: beqz: bz-taken:
first attempt
Opcode zero?
Busy (memory) latching the inputs may cause a one-cycle delay
ROM size ?
= 2 (opcode+status+s) words
Word size ?
= control+s bits
6
PC (state) s addr
Program ROM data
Control Signals (17)
CS152Spring’09
1/27/2009 s
How big is “s”?
next state
28
worksheet
State Op zero? busy Control points fetch
0 fetch
1 fetch
1 fetch
2 fetch fetch
3
3
*
*
*
*
*
*
*
*
*
*
ALU *
* yes no IR Memory
*
*
MA PC
....
A
PC
PC
A + 4
* PC A + 4 next-state fetch
1 fetch
1 fetch
2 fetch
3
?
ALU
0
ALU
0
ALU
1
ALU
2
*
*
*
1/27/2009
*
*
*
*
*
*
A Reg[rs]
B Reg[rt]
ALU
ALU
1
2
Reg[rd] func(A,B) fetch
0
CS152Spring’09 29
State Op zero? busy Control points fetch
0 fetch
1 fetch
1 fetch
2 fetch
3 fetch
3 fetch
3 fetch
3 fetch
3 fetch
3 fetch
3 fetch
3 fetch
3
...
ALU
ALU
0
ALU
1
2
1/27/2009
ALU
ALUi
LW
SW
J
*
*
*
*
JAL
JR beqz *
*
*
*
*
*
*
*
*
*
*
*
*
*
*
JALR *
*
*
*
*
*
*
*
*
*
*
*
*
* yes
MA PC
....
no IR Memory
* A PC
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
PC A + 4
*
*
*
A
B
Reg[rs]
Reg[rt]
ALU
ALU
1
2
Reg[rd] func(A,B) fetch
0
CS152Spring’09 next-state fetch
1 fetch
1 fetch
2 fetch
3
ALU
0
ALUi
0
LW
0
SW
0
J
0
JAL
0
JR
0
JALR
0 beqz
0
30
Cont.
State Op zero? busy Control points
ALUi
0
ALUi
1
ALUi
1
ALUi
2
...
J
J
0
J
1
2
...
beqz
0 beqz
1 beqz
1 beqz
2 beqz
3
...
* sExt uExt
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
* yes no
*
*
*
*
*
*
*
*
*
*
*
*
*
*
A
B
Reg[rs] sExt
A PC
B IR
PC JumpTarg(A,B) fetch
0
A PC
....
B sExt
16
16
PC A+B
(Imm) next-state
ALUi
1
ALUi
2
B uExt
16
(Imm) ALUi
2
Reg[rd] Op(A,B) fetch
0
A Reg[rs]
(Imm)
J
J
1
2 beqz beqz fetch beqz fetch
1
2
0
3
0
1/27/2009
JumpTarg(A,B) = {A[31:28],B[25:0],00}
CS152Spring’09 31
status & opcode
/ w
PC size = 2 (w+s) x (c + s) addr
Control ROM
Control signals / c data
/ s next
PC
MIPS: w = 6+2 c = 17 s = ?
no. of steps per opcode = 4 to 6 + fetch-sequence no. of states
(4 steps per op-group ) x op-groups
+ common sequences
= 4 x 8 + 10 states = 42 states
s = 6
Control ROM = 2 (8+6) x 23 bits
48 Kbytes
1/27/2009 CS152Spring’09 32
Control store has to be fast expensive
• Reduce the ROM height (= address bits)
– reduce inputs by extra external logic each input bit doubles the size of the control store
– reduce states by grouping opcodes find common sequences of actions
– condense input status bits combine all exceptions into one, i.e., exception/no-exception
• Reduce the ROM width
– restrict the next-state encoding
Next, Dispatch on opcode, Wait for memory, ...
– encode control signals (vertical microcode)
1/27/2009 CS152Spring’09 33
• Room change/decision:
– Either Soda 306 (here) for all but two lectures, back in 320 for those
– Or 182 Dwinelle for all lectures
– Vote ?
• Class schedule almost up to date (but might need to change quiz depending on vote above)
• Lab 1 coming out on Thursday, together with PS1
CS152Spring’09
1/27/2009 34
• Designed to help you learn material and practice for quiz
• Must hand in day before section that reviews problem set, which will be shortly before quiz
• Solutions handed out in review section
• Quiz assumes you have worked through problem set.
• Problem sets graded on 0,1,2 scale
1/27/2009 CS152Spring’09 35
• Can collaborate to understand problem sets, but must turn in own solution. Some problems repeated from earlier years - do not copy solutions. Quiz problems will not be repeated…
• Each student must complete directed portion of the lab by themselves. OK to collaborate to understand how to run labs
– Class news group info on web site.
• Can work in group of up to 3 students for openended portion of each lab
– OK to be in different group for each lab -just make sure to label participants’ names clearly on each turned in lab section
CS152Spring’09
1/27/2009 36
Opcode ext op-group absolute input encoding reduces ROM height
PC (state)
JumpType =
next | spin
| fetch | dispatch
| feqz | fnez address
Control ROM data
PC PC+1
+1
PCSrc jump logic zero busy next-state encoding reduces ROM width
37 1/27/2009
Control Signals (17)
CS152Spring’09
PCSrc = Case JumpTypes next spin fetch dispatch feqz
fnez
PC+1 if (busy) then PC else PC+1 absolute op-group if (zero) then absolute else PC+1 if (zero) then PC+1 else absolute
CS152Spring’09
1/27/2009 38
MIPS-Controller-2
State fetch
0 fetch
1 fetch
2 fetch
3
...
ALU
ALU
0
ALU
1
2
ALUi
0
ALUi
1
ALUi
2
Control points
MA PC
IR Memory
A PC
PC A + 4
A Reg[rs]
B Reg[rt]
Reg[rd] func(A,B)
A
B
Reg[rs] sExt
16
(Imm)
Reg[rd] Op(A,B) next-state next spin next dispatch next next fetch next next fetch
CS152Spring’09
1/27/2009 39
MIPS-Controller-2
State
LW
0
LW
1
LW
2
LW
3
LW
4
SW
SW
SW
SW
0
1
2
SW
3
4
Control points next-state
A Reg[rs]
B sExt
16 next
(Imm) next
MA A+B next
Reg[rt] Memory spin fetch
A Reg[rs]
B sExt next
(Imm) next
MA A+B
16 next
Memory Reg[rt] spin fetch
CS152Spring’09
1/27/2009 40
MIPS-Controller-2
State
BEQZ
0
BEQZ
1
BEQZ
2
BEQZ
3
BEQZ
4
BNEZ
0
BNEZ
1
BNEZ
2
BNEZ
3
BNEZ
4
Control points next-state
A
A
B
Reg[rs]
PC sExt
16
PC A+B next fnez next
(Imm<<2) next fetch
A
A
B
Reg[rs]
PC sExt
16
PC A+B next feqz next
(Imm<<2) next fetch
CS152Spring’09
1/27/2009 41
MIPS-Controller-2
State
J
0
J
1
J
2
JR
0
JR
1
JAL
0
JAL
1
JAL
2
JAL
3
JALR
0
JALR
1
JALR
2
JALR
3
Control points next-state
A PC
B IR next next
PC JumpTarg(A,B) fetch
A Reg[rs]
PC A next fetch
A PC
Reg[31] A next next
B IR next
PC JumpTarg(A,B) fetch
A PC
B Reg[rs]
Reg[31] A
PC B next next next fetch
CS152Spring’09
1/27/2009 42
1/27/2009 CS152Spring’09 43
Opcode zero?
ldIR OpSel ldA ldB 32(PC) ldMA busy
2
IR
ExtSel
2 enImm
Imm
Ext
A
ALU control enALU
ALU
B addr
32 GPRs
+ PC ...
3
RegSel
32-bit Reg data
RegWrt enReg
MA addr
Memory
MemWrt data enMem
Bus 32 rd M[(rs)] op (rt)
M[(rd)] (rs) op (rt)
M[(rd)] M[(rs)] op M[(rt)]
1/27/2009 CS152Spring’09
Reg-Memory-src ALU op
Reg-Memory-dst ALU op
Mem-Mem ALU op
44
MIPS-Controller-2
Mem-Mem ALU op M[(rd)] M[(rs)] op M[(rt)]
ALUMM
ALUMM
ALUMM
ALUMM
ALUMM
ALUMM
ALUMM
0
1
2
3
4
5
6
MA Reg[rs]
A Memory
MA Reg[rt]
B Memory next spin next spin
MA Reg[rd] next
Memory func(A,B) spin fetch
Complex instructions usually do not require datapath modifications in a microprogrammed implementation
-- only extra space for the control program
Implementing these instructions using a hardwired controller is difficult without datapath modifications
1/27/2009 CS152Spring’09 45
Microprogrammed control
multiple cycles per instruction
Cycle time ? t
C
> max(t reg-reg
, t
ALU
, t
ROM
)
Suppose 10 * t
ROM
< t
RAM
Good performance, relative to a single-cycle hardwired implementation, can be achieved even with a CPI of 10
CS152Spring’09
1/27/2009 46
Bits per Instruction
# Instructions
• Horizontal code has wider
instructions
– Multiple parallel operations per instruction
– Fewer steps per macroinstruction
– Sparser encoding more bits
• Vertical code has narrower
instructions
– Typically a single datapath operation per instruction
– separate instruction for branches
– More steps to per macroinstruction
– More compact less bits
• Nanocoding
– Tries to combine best of horizontal and vertical code
1/27/2009 CS152Spring’09 47
Exploits recurring control signal patterns in code, e.g.,
ALU
0
...
ALUi
...
0
A Reg[rs]
A Reg[rs]
PC (state)
address
code ROM nanoaddress nanoinstruction ROM data
code next-state
• MC68000 had 17-bit code containing either 10-bit
jump or
9-bit nanoinstruction pointer
– Nanoinstructions were 68 bits wide, decoded to give 196 control signals
1/27/2009 CS152Spring’09 48
M30 M40 M50 M65
Datapath width
(bits)
inst width
(bits)
code size
(K µinsts)
store technology
store cycle
(ns) memory cycle
(ns)
Rental fee
($K/month)
8
50
4
16
52
4
32
85
2.75
CCROS TCROS BCROS BCROS
750
1500
4
625
2500
7
500
2000
15
64
87
2.75
200
750
35
Only the fastest models (75 and 95) were hardwired
1/27/2009 CS152Spring’09 49
• IBM initially miscalculated the importance of software compatibility with earlier models when introducing the
360 series
• Honeywell stole some IBM 1401 customers by offering translation software (“Liberator”) for Honeywell H200 series machine
• IBM retaliated with optional additional microcode for 360 series that could emulate IBM 1401 ISA, later extended for IBM 7000 series
– one popular program on 1401 was a 650 simulator, so some customers ran many 650 programs on emulated 1401s
– ( 650 simulated on 1401 emulated on 360)
CS152Spring’09
1/27/2009 50
• Significantly faster ROMs than DRAMs were available
• For complex instruction sets, datapath and controller were cheaper and simpler
• New instructions , e.g., floating point, could be supported without datapath modifications
• Fixing bugs in the controller was easier
• ISA compatibility across various models could be achieved easily and cheaply
1/27/2009
Except for the cheapest and fastest machines, all computers were microprogrammed
CS152Spring’09 51
• Implement control store in RAM not ROM
– MOS SRAM memories now almost as fast as control store (core memories/DRAMs were 2-10x slower)
– Bug-free microprograms difficult to write
• User-WCS provided as option on several minicomputers
– Allowed users to change microcode for each processor
• User-WCS failed
– Little or no programming tools support
– Difficult to fit software into small space
– Microcode control tailored to original ISA, less useful for others
– Large WCS part of processor state - expensive context switches
– Protection difficult if user can change microcode
– Virtual memory required restartable microcode
CS152Spring’09
1/27/2009 52
• Evolution bred more complex micro-machines
– Complex instruction sets led to need for subroutine and call stacks in µcode
– Need for fixing bugs in control programs was in conflict with read-only nature of µROM
– --> WCS (B1700, QMachine, Intel i432, …)
• With the advent of VLSI technology assumptions about ROM &
RAM speed became invalid -> more complexity
• Better compilers made complex instructions less important.
• Use of numerous micro-architectural innovations, e.g., pipelining, caches and buffers, made multiple-cycle execution of reg-reg instructions unattractive
• Looking ahead to RISC next time
– Use chip area to build fast instruction cache of user-visible vertical microinstructions - use software subroutine not hardware microroutines
– Use simple ISA to enable hardwired pipelined implementation
1/27/2009 CS152Spring’09 53
• Microprogramming is far from extinct
• Played a crucial role in micros of the Eighties
DEC uVAX, Motorola 68K series, Intel 386 and 486
• Microcode pays an assisting role in most modern micros
(AMD Opteron, Intel Core 2 Duo, Intel Atom, IBM PowerPC)
• Most instructions are executed directly, i.e., with hard-wired control
• Infrequently-used and/or complicated instructions invoke the microcode engine
• Patchable microcode common for post-fabrication bug fixes, e.g. Intel processors load µcode patches at bootup
CS152Spring’09
1/27/2009 54
• These slides contain material developed and copyright by:
– Arvind (MIT)
– Krste Asanovic (MIT/UCB)
– Joel Emer (Intel/MIT)
– James Hoe (CMU)
– John Kubiatowicz (UCB)
– David Patterson (UCB)
• MIT material derived from course 6.823
• UCB material derived from course CS252
CS152Spring’09
1/27/2009 55