Microprogramming The DLX ISA

advertisement
Krste Asanovic
February 20, 2001
6.823, L4--1
Microprogramming
Krste Asanovic
Laboratory for Computer Science
M.I.T.
http://www.csg.lcs.mit.edu/6.823
The DLX ISA
Processor State
32 32-bit GPRs, R0 always contains a 0
32 single precision FPRs, may also be viewed as
16 double precision FPRs
FP status register, used for FP compares & exceptions
PC, the program counter
some other special registers
Data types
8-bit byte, 2-byte half word
32-bit word for integers
32-bit word for single precision floating point
64-bit word for double precision floating point
Load/Store style instruction set
data addressing modes- immediate & indexed
branch addressing modes- PC relative & register indirect
Byte addressable memory- big-endian mode
All instructions are 32 bits
(See Chapter 2, H&P for full description)
Page 1
Krste Asanovic
February 20, 2001
6.823, L4--2
Krste Asanovic
February 20, 2001
6.823, L4--3
Microarchitecture
Implementation of an ISA
control
points
Controller
status
lines
Data
path
Structure: How components are connected.
Static
Sequencing: How data moves between components
Dynamic
A Bus-based Datapath for DLX
Opcode
ldIR
busy
zero?
OpSel
ldA
32(PC)
31(Link)
rf3
rf2
rf1
ldB
2 /
/
IR
ExtSel Imm
/
2
Ext
enImm
rf3
rf2
rf1
Krste Asanovic
February 20, 2001
6.823, L4--4
RegSel
ldMA
MA
3
A
ALU
control
32 GPRs
+ PC ...
ALU
addr
addr
B
32-bit Reg
enALU
data
RegWrt
enReg
Memory
MemWrt
data
enMem
Bus 32
Microinstruction: register to register transfer (17 control signals)
means
RegSel = PC; enReg=yes; ldMA= yes
MA ← PC
B ← Reg[rf2] means
RegSel = rf2; enReg=yes; ldB = yes
Page 2
Krste Asanovic
February 20, 2001
6.823, L4--5
Memory Module
addr
busy
RAM
din
we
Write/Read
Enable
dout
bus
We will assume that Memory operates asynchronously
and is slow as compared to Reg-to-Reg transfers
Krste Asanovic
February 20, 2001
6.823, L4--6
Instruction Execution
Execution of a DLX instruction involves
1. instruction fetch
2. decode and register fetch
3. ALU operation
4. memory operation (optional)
5. write back to register file (optional)
and the computation of the address of the
next instruction
Page 3
Krste Asanovic
February 20, 2001
6.823, L4--7
Microcontrol Unit
Maurice Wilkes, 1954
Embed the control logic state table in a memory array
op
conditional
code flip-flop
Next state
µ address
Matrix A
Matrix B
Decoder
Control lines to
ALU, MUXs, Registers
DLX ALU Instructions
6
5
opcode rf1
5
rf2
5
rf3
Krste Asanovic
February 20, 2001
6.823, L4--8
16
function
Register-Register form:
Reg[rf3] ← function(Reg[rf1], Reg[rf2])
6
5
opcode rf1
5
rf2
16
immediate
Register-Immediate form:
Reg[rf2] ← function(Reg[rf1], SignExt(immediate))
Page 4
Microprogram Fragments
instr fetch: MA ← PC
IR ← Memory
A ← PC
PC ← A + 4
dispatch on OPcode
Krste Asanovic
February 20, 2001
6.823, L4--9
can be
treated as
a macro
ALU:
A ← Reg[rf1]
B ← Reg[rf2]
Reg[rf3] ← func(A,B)
do instruction fetch
ALUi:
A ← Reg[rf1]
sign extention ...
B ← Imm
Reg[rf2] ← Opcode(A,B)
do instruction fetch
DLX Load/Store Instructions
6
5
opcode rf1
base
5
rf2
Krste Asanovic
February 20, 2001
6.823, L4--10
16
displacement
Load/store byte, halfword, word to/from GPR:
LB, LBU, SB, LH, LHU, SH, LW, SW
byte and half-word can be sign or zero extended
Load/store single and double FP to/from FPR:
LF, LD, SF, SD
• Byte addressable machine
• Memory access must be data aligned
• A single addressing mode
(base) + displacement
• Big-endian byte ordering
31
0
0
1
2
Page 5
3
Krste Asanovic
February 20, 2001
6.823, L4--11
DLX Control Instructions
Conditional branch on GPR
6
5
opcode rf1
5
16
offset from PC+4
BEQZ, BNEZ
Unconditional register-indirect jumps
6
5
opcode rf1
5
16
JR, JALR
Unconditional PC-relative jumps
6
opcode
26
offset from PC+4
J, JAL
• PC-offset are specified in bytes
• jump-&-link (JAL) stores PC+4 into the link register (R31)
• (Real DLX has delayed branches – ignored this lecture)
Microprogram Fragments (cont.)
LW:
J:
beqz:
bz-taken:
A ← Reg[rf1]
B ← Imm
MA ← A + B
Reg[rf2] ← Memory
do instruction fetch
A ← PC
B ← Imm
PC ← A + B
do instruction fetch
A ← Reg[rf1]
If zero?(A) then go to bz-taken
do instruction fetch
A ← PC
B ← Imm
PC ← A + B
do instruction fetch
Page 6
Krste Asanovic
February 20, 2001
6.823, L4--12
DLX Microcontroller: first attempt
Opcode
zero?
Busy (memory)
6
µPC (state)
latching the inputs
may cause a
one-cycle delay
s
addr
ROM Size ?
How big is “s”?
Krste Asanovic
February 20, 2001
6.823, L4--13
s
µProgram ROM
2(opcode+status+s) words
word = (control+s) bits
data
next state
17 Control Signals
Microprogram in the ROM
Krste Asanovic
February 20, 2001
6.823, L4--14
worksheet
State Op
zero?
busy
Control points
next-state
fetch0
fetch1
fetch1
fetch2
fetch3
*
*
*
*
*
*
*
*
*
*
*
yes
no
*
*
MA ← PC
....
IR ← Memory
A ← PC
PC ← A + 4
fetch1
fetch1
fetch2
fetch3
?
ALU0
ALU1
ALU2
*
*
*
*
*
*
*
*
*
A ← Reg[rf1]
B ← Reg[rf2]
Reg[rf3]← func(A,B)
ALU1
ALU2
fetch0
Page 7
Krste Asanovic
February 20, 2001
6.823, L4--15
Microprogram in the ROM
State Op
fetch0
fetch1
fetch1
fetch2
fetch3
fetch3
fetch3
fetch3
fetch3
fetch3
fetch3
fetch3
fetch3
...
ALU0
ALU1
ALU2
zero?
busy
Control points
next-state
*
*
*
*
ALU
ALUi
LW
SW
J
JAL
JR
JALR
beqz
*
*
*
*
*
*
*
*
*
*
*
*
*
*
yes
no
*
*
*
*
*
*
*
*
*
*
MA ← PC
....
IR ← Memory
A ← PC
PC ← A + 4
PC ← A + 4
PC ← A + 4
PC ← A + 4
PC ← A + 4
PC ← A + 4
PC ← A + 4
PC ← A + 4
PC ← A + 4
fetch1
fetch1
fetch2
fetch3
ALU0
ALUi0
LW0
SW0
J0
JAL0
JR0
JALR0
beqz0
*
*
*
*
*
*
*
*
*
A ← Reg[rf1]
B ← Reg[rf2]
Reg[rf3]← func(A,B)
ALU1
ALU2
fetch0
Microprogram in the ROM Cont.
State Op
ALUi0
ALUi1
ALUi1
ALUi2
...
J0
J1
J2
...
beqz0
beqz1
beqz1
beqz2
beqz3
...
zero?
busy
Control points
Krste Asanovic
February 20, 2001
6.823, L4--16
next-state
*
sExt
uExt
*
*
*
*
*
*
*
*
*
A ← Reg[rf1]
B ← sExt16(Imm)
B ← uExt16(Imm)
Reg[rf3]← Op(A,B)
ALUi1
ALUi2
ALUi2
fetch0
*
*
*
*
*
*
*
*
*
A ← PC
B ← sExt26(Imm)
PC ← A+B
J1
J2
fetch0
*
*
*
*
*
*
yes
no
*
*
*
*
*
*
*
A ← Reg[rf1]
A ← PC
....
B ← sExt16(Imm)
PC ← A+B
beqz1
beqz2
fetch0
beqz3
fetch0
Page 8
Krste Asanovic
February 20, 2001
6.823, L4--17
Size of Control Store
w
/
status &
opcode
µPC
addr
size = 2(w+s) x (c + s)
Control
ROM
data
/ s
next
µPC
Control signals / c
DLX
w = 6+2
c = 17
s=?
no. of steps per opcode = 4 to 6 + fetch-sequence
no. of states ≈ (4 steps per op-group ) x op-groups
+ common sequences
= 4 x 8 + 10 states = 42 states
⇒s=6
Control ROM = 2(8+6) x 23 bits ≈ 48 Kbytes
Krste Asanovic
February 20, 2001
6.823, L4--18
Reducing the Size of Control Store
Control store has to be fast ⇒ expensive
Reduce the ROM height (= address bits)
⇒ reduce inputs by extra external logic
each input bit doubles the size of the
control store
⇒ reduce states by grouping opcodes
find common sequences of actions
⇒ condense input status bits
combine all exceptions into one, i.e.,
exception/no-exception
Reduce the ROM width
⇒ restrict the next-state encoding
Next, Dispatch on opcode, Wait for memory, ...
⇒ encode control signals (vertical microcode)
Page 9
Krste Asanovic
February 20, 2001
6.823, L4--19
DLX Microcontroller-2
Opcode
absolute
ext
op-group
Reduced ROM
height by
encoding inputs
µPC+1
µPC
+1
µPC (state)
µPCSrc
zero
address
jump
logic
Control ROM
busy
Reduce ROM
width by
encoding
next-state
data
17 Control Signals
µJumpType
(next, spin, fetch,
dispatch, feqz, fnez )
Krste Asanovic
February 20, 2001
6.823, L4--20
Jump Logic
µPCSrc = Case µJumpTypes
next
⇒
µPC+1
spin
⇒
if (busy) then µPC else µPC+1
fetch
⇒
absolute
dispatch
⇒
op-group
feqz
⇒
if (zero) then absolute else µPC+1
fnez
⇒
if (zero) then µPC+1 else absolute
Page 10
DLX-Controller-2
State
fetch0
fetch1
fetch2
fetch3
...
ALU0
ALU1
ALU2
...
J0
J1
J2
...
BEQZ0
BEQZ1
BEQZ2
BEQZ3
BEQZ4
worksheet
Control points
MA ← PC
IR ← Memory
A ← PC
PC ← A + 4
Krste Asanovic
February 20, 2001
6.823, L4--21
next-state
A ← Reg[rf1]
B ← Reg[rf2]
Reg[rf3]← func(A,B)
A ← PC
B ← sExt26(Imm)
PC ← A+B
A ← Reg[rf1]
A ← PC
B ← sExt16(Imm)
PC ← A+B
Krste Asanovic
February 20, 2001
6.823, L4--22
Instruction Fetch & ALU: DLX-Controller-2
State
Control points
fetch0
fetch1
fetch2
fetch3
...
ALU0
ALU1
ALU2
MA ← PC
IR ← Memory
A ← PC
PC ← A + 4
next
spin
next
dispatch
A ← Reg[rf1]
B ← Reg[rf2]
Reg[rf3]← func(A,B)
next
next
fetch
ALUi0
ALUi1
ALUi2
A ← Reg[rf1]
B ← sExt16(Imm)
Reg[rf3]← Op(A,B)
next
next
fetch
Page 11
next-state
Load & Store: DLX-Controller-2
State
Control points
LW0
LW1
LW2
LW3
LW4
A ← Reg[rf1]
B ← sExt16(Imm)
MA ← A+B
Reg[rf2] ← Memory
next
next
next
spin
fetch
SW0
SW1
SW2
SW3
SW4
A ← Reg[rf1]
B ← sExt16(Imm)
MA ← A+B
Memory ← Reg[rf2]
next
next
next
spin
fetch
next-state
Branches: DLX-Controller-2
State
Control points
BEQZ0
BEQZ1
BEQZ2
BEQZ3
BEQZ4
A ← Reg[rf1]
BNEZ0
BNEZ1
BNEZ2
BNEZ3
BNEZ4
A ← Reg[rf1]
A ← PC
B ← sExt16(Imm)
PC ← A+B
A ← PC
B ← sExt16(Imm)
PC ← A+B
Page 12
Krste Asanovic
February 20, 2001
6.823, L4--23
next-state
next
fnez
next
next
fetch
next
feqz
next
next
fetch
Krste Asanovic
February 20, 2001
6.823, L4--24
Jumps: DLX-Controller-2
State
Control points
J0
J1
J2
A ← PC
B ← sExt26(Imm)
PC ← A+B
next
next
fetch
JR0
PC ←Reg[rf1]
fetch
JAL0
JAL1
JAL2
JAL3
Reg[31] ← PC
A ← PC
B ← sExt26(Imm)
PC ← A+B
next
next
next
fetch
JALR0
JALR1
Reg[31] ← PC
PC ←Reg[rf1]
next
fetch
next-state
Nanocoding
µPC (state)
Exploits recurring
control signal patterns
in µcode, e.g.,
ALU0 A ← Reg[rf1]
...
ALUi0 A ← Reg[rf1]
...
µaddress
µcode ROM
nanoaddress
nanoinstruction ROM
data
17 Control Signals
• Place common control signal patterns into
nanoinstruction table
• µcode word contains pointer to nanoinstruction
• Higher latency to get control signals out
Page 13
Krste Asanovic
February 20, 2001
6.823, L4--25
Krste Asanovic
February 20, 2001
6.823, L4--26
µcode
next-state
Krste Asanovic
February 20, 2001
6.823, L4--27
Implementing Complex Instructions
Opcode
ldIR
busy
zero?
ALUop ldA
32(PC)
31(Link)
rf3
rf2
rf1
ldB
ldMA
RegSel
IR
ExSel
enImm
Imm
Ext
A
32 GPRs
+ PC ...
/
6
ALU
addr
addr
B
32-bit Reg
enALU
MA
RegWrt
enReg
data
Memory
MemWrt
enMem
data
Bus
rf3 ← M[(rf1)] op (rf2)
M[(rf3)] ← (rf1) op (rf2)
M[(rf3)] ← M[(rf1)] op M[(rf2)]
Reg-Memory-src ALU op
Reg-Memory-dst ALU op
Mem-Mem ALU op
Mem-Mem ALU Instructions:
DLX-Controller-2
Mem-Mem ALU op
ALUMM0
ALUMM1
ALUMM2
ALUMM3
ALUMM4
ALUMM5
ALUMM6
M[(rf3)] ← M[(rf1)] op M[(rf2)]
MA ← Reg[rf1]
A ← Memory
MA ← Reg[rf2]
B ← Memory
MA ← Reg[rf3]
Memory ← func(A,B)
next
spin
next
spin
next
spin
fetch
Complex instructions usually do not require datapath
modifications in a microprogrammed implementation
-- only extra space for the control program
Implementing these instructions using a hardwired
controller is difficult without datapath modifications
Page 14
Krste Asanovic
February 20, 2001
6.823, L4--28
Krste Asanovic
February 20, 2001
6.823, L4--29
Microprogramming in the Seventies
Thrived because
• Significantly faster ROMs than DRAMs were available
• For complex instruction sets, datapath and controller
were cheaper and simpler
• New instructions , e.g., floating point, could be
supported without datapath modifications
• Fixing bugs in the controller was easier
• ISA compatibility across various models could be
achieved cheaply
Writable Control Store (WCS)
Krste Asanovic
February 20, 2001
6.823, L4--30
Implement control store with SRAM not ROM
• MOS SRAM memories now almost as fast as control store (core
memories/DRAMs were 10x slower)
• Bug-free microprograms difficult to write
WCS provided as option on several minicomputers
• Allowed users to change microcode for each process
WCS failed
• Little or no programming tools support
• Hard to fit software into small space
• Microcode control tailored to original ISA, less useful for others
• Large WCS part of processor state - expensive context switches
• Protection difficult if user can change microcode
• Virtual memory required restartable microcode
Page 15
Performance Issues
Krste Asanovic
February 20, 2001
6.823, L4--31
Microprogrammed control
⇒ multiple cycles per instruction
Cycle time ?
tC > max(treg-reg, tALU, tµROM, tRAM)
Given complex control, tALU & tRAM can be broken
into multiple cycles. However, tµROM cannot be
broken down. Hence
tC > max(treg-reg, tµROM)
Suppose 10 * tµROM < tRAM
good performance, relative to the single-cycle
hardwired implementation, can be achieved
even with a CPI of 10
VLSI & Microprogramming
Krste Asanovic
February 20, 2001
6.823, L4--32
By late seventies
• technology assumption about ROM & RAM
speed became invalid
• micromachines became more complicated
• to overcome slower ROM, micromachines
were pipelined
• complex instruction sets led to the need for
subroutine and call stacks in µcode.
• need for fixing bugs in control programs
was in conflict with read-only nature of µROM
⇒ WCS (B1700, QMachine, Intel432, …)
• introduction of caches and buffers, especially for
instructions, made multiple-cycle execution of
reg-reg instructions unattractive
Page 16
Modern Usage
Krste Asanovic
February 20, 2001
6.823, L4--33
Microprogramming is far from extinct
Played a crucial role in micros of the Eighties,
Motorola 68K (first microcoded microprocessor)
Intel 386 and 486
Microcode is present in most modern CISC micros in an
assisting role (e.g. AMD Athlon, Intel Pentium-4)
• Most instructions are executed directly, i.e., with
hard-wired control
• Infrequently-used and/or complicated instructions
invoke the microcode engine
Patchable microcode common for post-fabrication bug
fixes, e.g. Intel Pentiums load µcode patches at bootup
Page 17
Download