ppt

advertisement
CSC 2405
Computer Systems II
Advanced Topics
Instruction Set Architecture
Instruction Set Architecture

Assembly Language View
–
Processor state

–

Application
Program
Registers, memory, …
Instructions

addl, movl, leal, …

How instructions are encoded as bytes
Compiler
ISA
Layer of Abstraction
–
Above: how to program machine

–
Processor executes instructions in a
sequence
Below: what needs to be built


OS
Use variety of tricks to make it run fast
E.g., execute multiple instructions
simultaneously
Chapter 4
CPU
Design
Circuit
Design
Chip
Layout
3
Instruction Set
Architectures
Basic ISA Classes
The results of different address classes is easiest to see with the examples here,
all of which implement the sequences for C = A + B.
Stack
Accumulator
Register
(Register-memory)
Register
(load-store)
Push A
Load A
Load R1, A
Load
R1, A
Push B
Add B
Add
Load
R2, B
Add
Store C
Store
Add
R3, R1, R2
R1, B
C, R1
Pop C
Store
C, R3
Registers are the class that won out. The more registers on the CPU, the better.
Chapter 4
4
80x86 Instruction Frequency
Rank
1
2
3
4
5
6
7
8
9
10
Total
Instruction
load
branch
compare
store
add
and
sub
register move
call
return
Chapter 4
Frequency
22%
20%
16%
12%
8%
6%
5%
4%
1%
1%
96%
9
5
Relative Frequency of Control Instructions
Operation
Call/Return
Jumps
Branches
Integer
19%
6%
75%
Floating Pt
8%
10%
82%
Design hardware to handle branches quickly,
since these occur most frequently
Chapter 4
6
CISC Instruction Sets
–
–

Stack-oriented instruction set
–
–

Complex Instruction Set Computer
Dominant style through mid-80’s
Use stack to pass arguments, save program counter
Explicit push and pop instructions
Arithmetic instructions can access memory
–
addl %eax, 12(%ebx,%ecx,4)



Condition codes
–

requires memory read and write
Complex address calculation
Set as side effect of arithmetic and logical instructions
Philosophy
–
Add instructions to perform “typical” programming tasks
Chapter 4
7
RISC Instruction Sets
–
–

Fewer, simpler instructions
–
–

–
Many more (typically 32) registers
Use for arguments, return pointer, temporaries
Only load and store instructions can access memory
–

Might take more to get given task done
Can execute them with small and fast hardware
Register-oriented instruction set
–

Reduced Instruction Set Computer
Internal project at IBM, later popularized by Hennessy (Stanford) and
Patterson (Berkeley)
Similar to Y86 mrmovl and rmmovl
No Condition codes
–
Test instructions return 0/1 in register
Chapter 4
8
Example RISC Instruction Formats
Register-Register (R-type)
31
26 25
Op
ADD R1, R2, R3
21 20
rs1
rs2
6 5
11 10
16 15
rd
0
func
(ALI reg. operations, read/write special registers and moves)
Register-Immediate (I-type)
31
26 25
Op
rs1
21 20
SUB R1, R2, #3
16 15
0
immediate
rd
(ALU imm. operations, loads and stores, conditional branch, jump (and link)
Jump / Call (J-type)
31
JUMP
end
26 25
Op
0
offset added to PC
(jump, jump and link, trap and return from exception)
Chapter 4
9
CISC vs. RISC

Original Debate
–
–
–

Strong opinions!
CISC proponents---easy for compiler, fewer code bytes
RISC proponents---better for optimizing compilers, can make run fast
with simple chip design
Current Status
–
For desktop processors, choice of ISA not a technical issue


–
With enough hardware, can make anything run fast
Code compatibility more important
For embedded processors, RISC makes sense

Smaller, cheaper, less power
Chapter 4
10
Logic Design
Overview of Logic Design

Fundamental Hardware Requirements
–
Communication

–
–

How to get values from one place to another
Computation
Storage
Bits are Our Friends
–
–
Everything expressed in terms of values 0 and 1
Communication

–
Computation

–
Low or high voltage on wire
Compute Boolean functions
Storage

Store bits of information
Chapter 4
12
Digital Signals
0
1
0
Voltage
–
–
Time
Use voltage thresholds to extract discrete values from continuous
signal
Simplest version: 1-bit signal


–
Either high range (1) or low range (0)
With guard range between them
Not strongly affected by noise or low quality circuit elements

Can make circuits simple, small, and fast
Chapter 4
13
Computing with Logic Gates
And
a
b
Or
out
out = a && b
–
–
a
b
Not
out
out = a || b
a
out
out = !a
Outputs are Boolean functions of inputs
Respond continuously to changes in inputs

With some, small delay
Rising Delay
Falling Delay
a && b
b
Voltage
a
Time
Chapter 4
14
Combinational Circuits
Acyclic Network
Primary
Inputs

Primary
Outputs
Acyclic Network of Logic Gates
–
–
Continuously responds to changes on primary inputs
Primary outputs become (after some delay) Boolean functions of
primary inputs
Chapter 4
15
Bit Equality
Bit equal
a
HCL Expression
eq
bool eq = (a&&b)||(!a&&!b)
b
–

Generate 1 if a and b are equal
Hardware Control Language (HCL)
–
Very simple hardware description language

–
Boolean operations have syntax similar to C logical operations
We’ll use it to describe control logic for processors
Chapter 4
16
Word Equality
Word-Level Representation
b31
Bit equal
eq31
B
=
a31
b30
Bit equal
A
eq30
a30
HCL Representation
bool Eq = (A == B)
Eq
b1
Bit equal
eq1
a1
b0
Eq
–
–
Bit equal
eq0
32-bit word size
HCL representation


a0
Chapter 4
Equality operation
Generates Boolean value
17
1-Bit Latch
D Latch
D
R
Data
Q+
Q–
C
S
Clock
Latching
d
D
Storing
!d
!d
!d
d
d
R
D
!d
0R
!q
q
Q+
1
C
Q–
dS
d
!d
Q+
0
C
0
Chapter 4
S
q
!q
Q–
18
Registers
Structure
i7
D
C
Q+
o7
i6
D
C
Q+
o6
i5
D
C
Q+
o5
i4
D
C
Q+
o4
i3
D
C
Q+
o3
i2
D
C
Q+
o2
i1
D
C
Q+
o1
i0
D
C
Q+
o0
I
O
Clock
Clock
–
Stores word of data

–
–
Different from program registers seen in assembly code
Collection of edge-triggered latches
Loads input on rising edge of clock
Chapter 4
19
Random-Access Memory
valA
srcA
A
valW
Register
file
Read ports
valB
srcB
–
–
Write port
Clock
Address input specifies which word to read or write
Register file

Holds values of program registers
%eax, %esp, etc.

Register identifier serves as address

–
–
dstW
B
Stores multiple words of memory

W
ID 8 implies no read or write performed
Multiple Ports

Can read and/or write multiple words in one cycle
–
Each has separate address and data input/output
Chapter 4
20
Basic Logic Gates
NOTE: okay to use just a circle for NOT: 
Chapter 4
21
More than 2 Inputs?

AND/OR can take any number of inputs.
–
–
–

AND = 1 if all inputs are 1.
OR = 1 if any input is 1.
Similar for NAND/NOR.
Can implement with multiple two-input gates
Chapter 4
22
Logical Completeness

Can implement ANY truth table with AND, OR, NOT.
A
B
C
D
0
0
0
0
0
0
1
0
0
1
0
1
0
1
1
0
1
0
0
0
1
0
1
1
1
1
0
0
1
1
1
0
1. AND combinations
that yield a "1" in the
truth table.
2. OR the results
of the AND gates.
Chapter 4
23
DeMorgan's Law


Converting AND to OR (with some help from NOT)
Consider the following gate:
A B
A
B
A B
A B
0 0
1
1
1
0
0 1
1
0
0
1
1 0
0
1
0
1
1 1
0
0
0
1
Chapter 4
To convert AND to OR
(or vice versa),
invert inputs and output.
24
Decoder

n inputs, 2n outputs
–
exactly one output is 1 for each possible input pattern
2-bit
decoder
Chapter 4
25
Sequential Processors
newPC
PC
Sequential HW Structure

Write back
valM
State
–
–
–
–
Program counter register (PC)
Condition code register (CC)
Register File
Memories




valE, valM
Access same memory space
Data: for reading/writing program data
Instruction: for reading instructions
Data
Data
memory
memory
Memory
Addr, Data
valE
Bch
Execute
CC
CC
ALU
ALU
aluA, aluB
Instruction Flow
–
–
–
valA, valB
Read instruction at address specified by
PC
Decode
Process through stages
Update program counter
srcA, srcB
dstA, dstB
B
E
icode ifun
rA , rB
valC
valP
,
Fetch
A
Register
RegisterM
file
file
Instruction
Instruction
memory
memory
PC
PC
increment
increment
PC
Chapter 4
27
newPC
PC
Seqential Stages

Data
Data
memory
memory
Memory
Addr, Data
valE
Read program registers
Bch
Execute
Compute value or address
CC
CC
ALU
ALU
aluA, aluB
Memory
–

Read instruction from instruction
memory
Execute
–

valM
Decode
–

Write back
Fetch
–

valE, valM
Read or write data
valA, valB
Write Back
–
srcA, srcB
dstA, dstB
Decode
Write program registers
A
B
Register
RegisterM
file
file
E

PC
–
icode ifun
rA , rB
valC
valP
,
Update program counter
Fetch
Instruction
Instruction
memory
memory
PC
PC
increment
increment
PC
Chapter 4
28
Instruction Decoding
Optional
5
Optional
0 rA rB
D
icode
ifun
rA
rB
valC

Instruction Format
–
–
–
Instruction byte
Optional register byte
Optional constant word
icode:ifun
rA:rB
valC
Chapter 4
29
Sequential Summary

Implementation
–
–
–
–

Express every instruction as series of simple steps
Follow same general flow for each instruction type
Assemble registers, memories, predesigned combinational blocks
Connect with control logic
Limitations
–
–
–
–
Too slow to be practical
In one cycle, must propagate through instruction memory, register file,
ALU, and data memory
Would need to run clock very slowly
Hardware units only active for fraction of clock cycle
Chapter 4
30
Pipelined Processors
What is Pipelining





Computers execute billions of instructions, so instruction
throughput is what matters
IDEA: Divide instruction execution up into several pipeline
stages. For example
IF
ID
EX
MEM
WB
Simultaneously have different instructions in different
pipeline stages
The length of the longest pipeline stage determines the
cycle time
Desirable pipeline features (e.g., RISC):
–
–
–
all instructions same length
registers located in same place in instruction format
memory operands only in loads or stores
Chapter 4
32
What Is Pipelining
Laundry Example

Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
Washer takes 30 minutes

Dryer takes 40 minutes

“Folder” takes 20 minutes

Chapter 4
A
B
C
D
33
What Is Pipelining
6 PM
7
8
9
10
11
Midnight
Time
30 40 20 30 40 20 30 40 20 30 40 20
T
a
s
k
O
r
d
e
r
A
B
C
D
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
Chapter 4
34
What Is
Pipelining
6 PM
Start work ASAP
7
8
9
10
11
Midnight
Time
30 40
T
a
s
k
40
40
40 20
A

Pipelined laundry takes 3.5
hours for 4 loads
B
O
r
d
e
r
C
D
Chapter 4
35
What Is
Pipelining
6 PM
Pipelining
Lessons
7
8
9
Time
T
a
s
k
O
r
d
e
r
30 40
40
40
40 20
A





B

Pipelining doesn’t help latency
of single task, it helps
throughput of entire workload
Pipeline rate limited by slowest
pipeline stage
Multiple tasks operating
simultaneously
Potential speedup = Number
pipe stages
Unbalanced lengths of pipe
stages reduces speedup
Time to “fill” pipeline and time
to “drain” it reduces speedup
C
D
Chapter 4
36
Real-World Pipelines: Car Washes
Sequential
Pipelined
Parallel

Idea
–
–
–
Divide process into independent
stages
Move objects through stages in
sequence
At any given times, multiple objects
being processed
Chapter 4
37
Pipeline Diagrams

Unpipelined
OP1
OP2
OP3
–

Time
Cannot start new operation until previous one completes
3-Way Pipelined
OP1
OP2
A
B
C
A
B
C
A
B
OP3
C
Time
–
Up to 3 operations in process simultaneously
Chapter 4
38
Data Dependencies
Combinational
logic
R
e
g
Clock
OP1
OP2
OP3
Time

System
–
Each operation depends on result from preceding one
Chapter 4
39
Data Hazards
Comb.
logic
A
OP1
OP2
R
e
g
A
Comb.
logic
B
R
e
g
Comb.
logic
C
Clock
B
C
A
B
C
A
B
C
A
B
OP3
OP4
R
e
g
C
Time
–
–
Result does not feed back around in time for next operation
Pipelining has changed behavior of system
Chapter 4
40
One Memory Port/Structural Hazards
Time (clock cycles)
Ifetch
Reg
DMem
Reg
DMem
Reg
ALU
DMem
Reg
ALU
Instr 3
DMem
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Ifetch
Ifetch
Instr 4
Reg
Ifetch
Chapter 4
Reg
Reg
Reg
Reg
DMem
41
One Memory Port/Structural Hazards
Time (clock cycles)
Stall
DMem
Ifetch
Reg
DMem
Reg
ALU
Ifetch
Bubble
Instr 3
Reg
Reg
DMem
Bubble Bubble
Ifetch
Reg
Reg
Bubble
ALU
O
r
d
e
r
Instr 2
Reg
ALU
I Load Ifetch
n
s
Instr 1
t
r.
ALU
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7
Bubble
Reg
DMem
How do you “bubble” the pipe?
Chapter 4
42
Data Hazard on R1
Time (clock cycles)
and r6,r1,r7
or
Ifetch
DMem
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
sub r4,r1,r3
Reg
ALU
Ifetch
ALU
O
r
d
e
r
add r1,r2,r3
WB
ALU
I
n
s
t
r.
MEM
ALU
IF ID/RF EX
r8,r1,r9
xor r10,r1,r11
Chapter 4
Reg
Reg
Reg
Reg
DMem
43
Reg
Three Generic Data Hazards

Read After Write (RAW)
InstrJ tries to read operand before InstrI writes it
I: add r1,r2,r3
J: sub r4,r1,r3

Caused by a “Dependence” (in compiler nomenclature).
This hazard results from an actual need for
communication.
Chapter 4
44
Three Generic Data Hazards

Write After Read (WAR)
InstrJ writes operand before InstrI reads it
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7

Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
Chapter 4
45
Three Generic Data Hazards

Write After Write (WAW)
InstrJ writes operand before InstrI writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7

Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
Chapter 4
46
Data Forwarding

Naïve Pipeline
–
–
Register isn’t written until completion of write-back stage
Source operands read from register file in decode stage


Observation
–

Needs to be in register file at start of stage
Value generated in execute or memory stage
Trick
–
–
Pass value directly from generating instruction to decode stage
Needs to be available at end of decode stage
Chapter 4
47
Forwarding to Avoid Data Hazard
or
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
DMem
Ifetch
Reg
ALU
and r6,r1,r7
Ifetch
DMem
ALU
sub r4,r1,r3
Reg
ALU
O
r
d
e
r
add r1,r2,r3 Ifetch
ALU
I
n
s
t
r.
ALU
Time (clock cycles)
r8,r1,r9
xor r10,r1,r11
Chapter 4
Reg
Reg
Reg
Reg
DMem
48
Reg
Download