Lecture 4
Processor Architecture
1943: ENIAC
–
–
Presper Eckert and John Mauchly -- first general electronic computer.
(or was it John V. Atanasoff in 1939?)
Hard-wired program -- settings of dials and switches.
1944: Beginnings of EDVAC
– among other improvements, includes program stored in memory
1945: John von Neumann
– wrote a report on the stored program concept, known as the First Draft of a Report on EDVAC
The basic structure proposed in the draft became known as the “von Neumann machine” (or model).
–
–
– a memory , containing instructions and data a processing unit , for performing arithmetic and logical operations a control unit , for interpreting instructions
For more history, see http://www.maxmon.com/history.htm
3
MAR MDR
PC IR
4
2 k x m array of stored bits
Address
– unique ( k -bit) identifier of location
Contents
– m -bit value stored in location
Basic Operations:
LOAD
– read a value from a memory location
STORE
– write a value to a memory location
0000
0001
0010
0011
0100
0101
0110
1101
1110
1111
00101101
10100010
5
How does processing unit get data to/from memory?
MAR : Memory Address Register
O Y
MDR : Memory Data Register
1.
2.
3.
To LOAD a location (A):
Write the address (A) into the MAR.
Send a “read” signal to the memory.
Read the data from MDR.
1.
2.
3.
To STORE a value (X) to a location (A):
Write the data (X) to the MDR.
Write the address (A) into the MAR.
Send a “write” signal to the memory.
6
Functional Units
–
–
ALU = Arithmetic and Logic Unit could have many functional units.
some of them special-purpose
(multiply, square root, …)
Registers
–
–
Small, temporary storage
Operands and results of functional units
Word Size
–
– number of bits normally processed by ALU in one instruction also width of registers
7
Devices for getting data into and out of computer memory
Each device has its own interface, usually a set of registers like the memory’s MAR and MDR
–
–
INPUT
Keyboard
M ouse
Scanner
Disk keyboard: data register (KBDR) and status register (KBSR) monitor: data register (DDR) and status register (DSR)
OUTPUT
M onitor
Printer
LED
Disk
Some devices provide both input and output
– disk, network
Program that controls access to a device is usually called a driver .
8
Orchestrates execution of the program
CONTROL UNIT
PC IR
Instruction Register (IR) contains the current instruction .
Program Counter (PC) contains the address of the next instruction to be executed.
Control unit :
–
– reads an instruction from memory
the instruction’s address is in the PC interprets the instruction, generating signals that tell the other components what to do
an instruction may take many machine cycles to complete
9
Fundamental Hardware Requirements
–
–
–
Communication
How to get values from one place to another
Computation
Storage
Bits are Our Friends
–
–
–
–
Everything expressed in terms of values 0 and 1
Communication
Low or high voltage on wire
Computation
Compute Boolean functions
Storage
Store bits of information
11
0 1 0
Voltage
–
–
–
Time
Use voltage thresholds to extract discrete values from continuous signal
Simplest version: 1-bit signal
Either high range (1) or low range (0)
With guard range between them
Not strongly affected by noise or low quality circuit elements
Can make circuits simple, small, and fast
12
Not And a b out = a && b out
Or a b out = a || b out
–
–
Outputs are Boolean functions of inputs
Respond continuously to changes in inputs
With some, small delay a out = !
a out
Rising Delay Falling Delay a && b b
Voltage a
Time
13
Bit equal a
HCL Expression eq bool eq = (a&&b)||(!a&&!b) b
– Generate 1 if a and b are equal
Hardware Control Language (HCL)
–
–
Very simple hardware description language
Boolean operations have syntax similar to C logical operations
We’ll use it to describe control logic for processors
14
b
31 a
31 b
30 a
30
Bit equal eq
31
Bit equal eq
30 b
1 a
1 b
0 a
0
Bit equal eq
1
Bit equal eq
0
Word-Level Representation
B
Eq
=
A
Eq
HCL Representation bool Eq = (A == B)
–
–
32-bit word size
HCL representation
Equality operation
Generates Boolean value
15
D Latch
D
Data
R
Q+
Q –
C
Clock
S
Latching d !d
1
!d
!d
d d d !d
Q – d
Storing
!d
0
!q
q
0
0 q
!q
Q –
16
i
2 i
1 i
4 i
3 i
0 i
7 i
6 i
5
Structure
D
C
D
C
D
C
D
C
D
C
D
C
D
C
D
C
Q+
Q+
Q+
Q+
Q+
Q+
Q+
Q+ o o o o o o o o
7
6
5
4
3
2
1
0
I
Clock
Clock
–
–
– Stores word of data
Different from program registers seen in assembly code
Collection of edge-triggered latches
Loads input on rising edge of clock
O
17
valA
A srcA
Read ports
Register file valW
W dstW
Write port valB
B srcB
–
–
–
Clock
Stores multiple words of memory
Address input specifies which word to read or write
Register file
Holds values of program registers
%eax , %esp , etc.
Register identifier serves as address
– ID 8 implies no read or write performed
Multiple Ports
Can read and/or write multiple words in one cycle
– Each has separate address and data input/output
18
NOTE: okay to use just a circle for NOT :
19
AND/OR can take any number of inputs.
–
–
–
AND = 1 if all inputs are 1.
OR = 1 if any input is 1.
Similar for NAND/NOR.
Can implement with multiple two-input gates
20
Can implement ANY truth table with AND, OR, NOT.
A B C D
0 0 0 0
0 0 1 0
0 1 0 1
1. AND combinations that yield a "1" in the truth table.
0 1 1 0
1 0 0 0
1 0 1 1
1 1 0 0
1 1 1 0
2. OR the results of the AND gates.
21
Implement the following truth table.
A B C
0 0 0
0 1 1
1 0 1
1 1 0
22
Converting AND to OR (with some help from NOT)
Consider the following gate:
A B A B
0 0 1 1
0 1 1 0
1 0 0 1
1 1 0 0
A B
1
0
0
0
A B
0
1
1
1
.
23
n inputs, 2 n outputs
– exactly one output is 1 for each possible input pattern
24
Methodologies for creating computer programs that perform a desired function.
Problem Solving
–
–
How do we figure out what to tell the computer to do?
Convert problem statement into algorithm, using stepwise refinement .
Debugging
–
How do we figure out why it didn’t work?
– Examining registers and memory, setting breakpoints, etc.
Time spent on the first can reduce time spent on the second!
26
Also known as systematic decomposition .
Start with problem statement:
“We wish to count the number of occurrences of a character in a file. The character in question is to be input from the keyboard; the result is to be displayed on the monitor.”
Decompose task into a few simpler subtasks .
Decompose each subtask into smaller subtasks , and these into even smaller subtasks , etc....
until you get to the machine instruction level.
27
Because problem statements are written in English, they are sometimes ambiguous and/or incomplete.
–
Where is “file” located? How big is it, or how do I know when I’ve reached the end?
–
–
How should final count be printed? A decimal number?
If the character is a letter, should I count both upper-case and lower-case occurrences?
How do you resolve these issues?
–
–
Ask the person who wants the problem solved, or
Make a decision and document it.
28
There are three basic ways to decompose a task:
Task
Subtask 1
Subtask 2
Sequential
True Test condition
False
Subtask 1 Subtask 2
Conditional
Test condition
True
Subtask
False
Iterative
29
Do Subtask 1 to completion, then do Subtask 2 to completion, etc.
Count and print the occurrences of a character in a file
Get character input from keyboard
Examine file and count the number of characters that match
Print number to the screen
30
If condition is true, do Subtask 1; else, do Subtask 2.
Test character.
If match, increment counter.
True file char
= input?
False
Count = Count + 1
31
Do Subtask over and over, as long as the test condition is true.
Check each element of the file and count the characters that match.
more chars to check?
True
False
Check next char and count if matches.
32
Learn to convert problem statement into step-by-step description of subtasks.
–
Like a puzzle, or a “word problem” from grammar school math.
What is the starting state of the system?
What is the desired ending state?
How do we move from one state to another?
– Recognize English words that correlate to three basic constructs:
“do A then do B” sequential
“ if
G, then do H” conditional
“ for each X, do Y” iterative
“do Z until W” iterative
33
START
START
Input a character. Then scan a file, counting occurrences of that character. Finally, display on the monitor the number of occurrences of the character (up to 9).
A
Initialize: Put initial values into all locations that will be needed to carry out this task.
- Input a character.
- Set up a pointer to the first location of the file that will be scanned.
- Get the first character from the file.
- Zero the register that holds the count.
STOP
B
Scan the file, location by location, incrementing the counter if the character matches.
C
Display the count on the monitor.
Initial refinement: Big task into three sequential subtasks.
STOP
34
B
Scan the file, location by location, incrementing the counter if the character matches.
B
Yes
Done?
No
B1
Test character. If a match, increment counter. Get next character.
Refining B into iterative construct.
35
B
Yes
Done?
No
B1
Test character. If a match, increment counter. Get next character.
Refining B1 into sequential subtasks.
Yes
Done?
No
B1
B2 Test character. If matches, increment counter.
B3 Get next character.
36
Yes
Done?
No
B1
B2 Test character. If matches, increment counter.
B3 Get next character.
B2
Yes
Done?
No
Yes
R1 = R0?
R2 = R2 + 1
No
B3
R3 = R3 + 1
R1 = M[R3]
Conditional (B2) and sequential (B3).
37
Write code (C, assembly, Java) for each step
B2
Yes
Done?
No
Yes
R1 = R0?
R2 = R2 + 1
No
B3
R3 = R3 + 1
R1 = M[R3]
; Look at each char in file.
0001100001111100 ; is R1 = EOT?
0000010xxxxxxxxx ; if so, exit loop
; Check for match with R0.
1001001001111111 ; R1 = -char
0001001001100001
0001001000000001 ; R1 = R0 – char
0000101xxxxxxxxx ; no match, skip incr
0001010010100001 ; R2 = R2 + 1
; Incr file ptr and get next char
0001011011100001 ; R3 = R3 + 1
0110001011000000 ; R1 = M[R3]
Don’t know
PCoffset bits until all the code is done
38
Syntax Errors
–
–
–
You made a typing error that resulted in an illegal operation.
Not usually an issue with machine language, because almost any bit pattern corresponds to some legal instruction.
In high-level languages, these are often caught during the translation from language to machine code.
Logic Errors
–
–
Your program is legal, but wrong, so the results don’t match the problem statement.
Trace the program to see what’s really happening and determine how to get the proper behavior.
Data Errors
–
–
Input data is different than what you expected.
Test the program with a wide variety of inputs.
39
The instruction is the fundamental unit of work.
Specifies two things:
–
– opcode : operation to be performed operands : data/locations to be used for operation
An instruction is encoded as a sequence of bits.
(Just like data!)
–
–
–
Often, but not always, instructions have a fixed length, such as 16 or 32 bits.
Control unit interprets instruction: generates sequence of control signals to carry out operation.
Operation is either executed completely, or not at all.
A computer’s instructions and their formats is known as its
Instruction Set Architecture (ISA) .
41
Assembly Language View
–
–
Processor state
Registers, memory, …
Instructions
addl , movl , leal , …
How instructions are encoded as bytes
Layer of Abstraction
–
–
Above: how to program machine
Processor executes instructions in a sequence
Below: what needs to be built
Use variety of tricks to make it run fast
E.g., execute multiple instructions simultaneously
Application
Program
Compiler OS
ISA
CPU
Design
Circuit
Design
Chip
Layout
42
Instruction set design issues include:
–
–
–
–
–
Where are operands stored?
registers, memory, stack, accumulator
How many explicit operands are there?
0, 1, 2, or 3
How is the operand location specified?
register, immediate, indirect, . . .
What type & size of operands are supported?
byte, int, float, double, string, vector. . .
What operations are supported?
add, sub, mul, move, compare . . .
43
The results of different address classes is easiest to see with the examples here, all of which implement the sequences for C = A + B .
Stack Accumulator Register-Memory Load-Store
Push A
Push B
Add
Pop C
Load A
Add B
Store C
Load R1, A
Add R1, B
Store C, R1
Load R1, A
Load R2, B
Add R3, R1, R2
Store C, R3
Load-Store is the class that won out. The more registers on the CPU, the better.
44
Addressing Mode
1. Register
2. Immediate
3. Displacement
4. Register indirect
5. Indexed
6. Direct or absolute
7. Memory Indirect
8. Autoincrement
9. Autodecrement
10. Scaled
Example
Add R4, R3
Action
R4 <- R4 + R3
Add R4, #3 R4 <- R4 + 3
Add R4, 100(R1) R4 <- R4 + M[100 + R1]
Add R4, (R1)
Add R4, (R1 + R2)
Add R4, (1000)
Add R4, @(R3)
Add R4, (R2)+
Add R4, (R2)-
R4 <- R4 + M[R1]
R4 <- R4 + M[R1 + R2]
R4 <- R4 + M[1000]
R4 <- R4 + M[M[R3]]
R4 <- R4 + M[R2]
R2 <- R2 + d
R4 <- R4 + M[R2]
Add R4, 100(R2)[R3]
R2 <- R2 - d
R4 <- R4 +
M[100 + R2 + R3*d]
Modes 1-4 account for 93% of all operands
45
Arithmetic and Logic: AND, ADD
Data Transfer:
Control
MOVE, LOAD, STORE
BRANCH, JUMP, CALL
System
Floating Point
Decimal
String
Graphics
OS CALL, VM
ADDF, MULF, DIVF
ADDD, CONVERT
MOVE, COMPARE
(DE)COMPRESS
46
What does the compiler do?
– Translate HLL to machine lang, optimize, check for errors
Optimizations
–
–
–
–
–
Generic high-level: common subexpression, strength reduction,
“machine independent”
Local: within a straightline code fragment (a “block”)
Global: cross branches, transform loops
Register allocation: associate registers with operands
Machine-dependent: tune to the specific architecture (or ISA)
47
Impact of optimization on performance
– Goal is to improve
– Sometimes makes worse, or not better
How to make the compiler writer’s life easier
–
–
–
–
Make frequent case fast, rare case correct
Make things uniform
Reduce tradeoffs, have one “best” way of doing each thing
Allow for constant values
48
–
–
Complex Instruction Set Computer
Dominant style through mid80’s
Stack-oriented instruction set
–
–
Use stack to pass arguments, save program counter
Explicit push and pop instructions
Arithmetic instructions can access memory
– addl %eax, 12(%ebx,%ecx,4)
requires memory read and write
Complex address calculation
Condition codes
– Set as side effect of arithmetic and logical instructions
Philosophy
–
Add instructions to perform “typical” programming tasks
49
–
–
Reduced Instruction Set Computer
Internal project at IBM, later popularized by Hennessy (Stanford) and
Patterson (Berkeley)
Fewer, simpler instructions
–
–
Might take more to get given task done
Can execute them with small and fast hardware
Register-oriented instruction set
–
–
Many more (typically 32) registers
Use for arguments, return pointer, temporaries
Only load and store instructions can access memory
– Similar to Y86 mrmovl and rmmovl
No Condition codes
– Test instructions return 0/1 in register
50
Register-Register
31
Op
26 25 rs1
(R-type)
21 20 rs2
16 15
ADD R1, R2, R3
11 10 6 5 rd func
0
(ALI reg. operations, read/write special registers and moves)
Register-Immediate (I-type)
31 26 25 21 20 16 15
SUB R1, R2, #3
Op rs1 rd immediate
0
(ALU imm. operations, loads and stores, conditional branch, jump (and link)
Jump / Call (J-type)
31 26 25
Op
JUMP end offset added to PC
(jump, jump and link, trap and return from exception)
0
51
Original Debate
–
–
–
Strong opinions!
CISC proponents---easy for compiler, fewer code bytes
RISC proponents---better for optimizing compilers, can make run fast with simple chip design
Current Status
–
–
For desktop processors, choice of ISA not a technical issue
With enough hardware, can make anything run fast
Code compatibility more important
For embedded processors, RISC makes sense
Smaller, cheaper, less power
52
Fetch instruction from memory
Decode instruction
Evaluate address
Memory load or store
Write back result
Update Program Counter
54
newPC
Sequential HW Structure
PC
Write back
State
– Program counter register (PC)
–
–
–
Condition code register (CC)
Register File
Memories
Access same memory space
Data: for reading/writing program data
Instruction: for reading instructions
Memory
Execute
Instruction Flow
–
–
–
Read instruction at address specified by
PC
Process through stages
Decode
Update program counter icode , ifun rA , rB valC valE, valM valM
Addr, Data
Bch aluA, aluB srcA, srcB dstA, dstB valE valA, valB valP
Fetch
PC
55
Fetch
– Read instruction from instruction memory
Decode
– Read program registers
Execute
– Compute value or address
Memory
– Read or write data
Write Back
– Write program registers
PC
– Update program counter
PC
Write back newPC valE, valM valM
Memory
Addr, Data valE
Execute Bch aluA, aluB valA, valB
Decode icode , ifun rA , rB valC
Fetch srcA, srcB dstA, dstB valP
PC
56
Phases
Instruction
Fetch
IR
Instr. Decode
Reg. Fetch
Execute
Addr.
Calc
Memory
Access
Write
Back
Passed To Next Stage
IR <- Mem[PC]
NPC <- PC + 4
L
M
D
Instruction Fetch (IF) :
Send out the PC and fetch the instruction from memory into the instruction register (IR); increment the PC by 4 to address the next sequential instruction.
IR holds the instruction that will be used in the next stage.
NPC holds the value of the next PC.
57
Phases
Instruction
Fetch
IR
Instr. Decode
Reg. Fetch
Execute
Addr.
Calc
Memory
Access
Write
Back
L
M
D
Passed To Next Stage
A <- Regs[IR6..IR10];
B <- Regs[IR10..IR15];
Imm <- ((IR16) ##IR16-31
Instruction Decode / Register Fetch (ID) :
Decode the instruction and access the register file to read the registers.
The outputs of the general purpose registers are read into two temporary registers (A & B) for use in later clock cycles.
We extend the sign of the lower 16 bits of the Instruction Register.
58
Optional Optional
5 0 rA rB D
icode ifun rA rB valC
Instruction Format
–
–
–
Instruction byte
Optional register byte
Optional constant word icode:ifun rA:rB valC
59
Phases
Instruction
Fetch
IR
Instr. Decode
Reg. Fetch
Execute
Addr.
Calc
Memory
Access
Write
Back
Passed To Next Stage
A <- A func. B cond = 0;
L
M
D
Execute / Address Calculation (EX) :
We perform an operation (for an ALU) or an address calculation (if it’s a load or a Branch).
If an ALU, actually do the operation. If an address calculation, figure out how to obtain the address and stash away the location of that address for the next cycle.
60
Phases
Instruction
Fetch
IR
Instr. Decode
Reg. Fetch
Execute
Addr.
Calc
Memory
Access
Write
Back
L
M
D
Passed To Next Stage
A = Mem[prev. B] or
Mem[prev. B] = A
Memory Access (MEM) :
If this is an ALU, do nothing.
If a load or store, then access memory.
61
Phases
Instruction
Fetch
IR
Instr. Decode
Reg. Fetch
Execute
Addr.
Calc
Memory
Access
Write
Back
L
M
D
Passed To Next Stage
Regs <- A, B;
PC <- NPC
Write Back (WB) :
Update the registers from either the ALU or from the data loaded.
62
Implementation
–
–
–
–
Express every instruction as series of simple steps
Follow same general flow for each instruction type
Assemble registers, memories, predesigned combinational blocks
Connect with control logic
Limitations
–
–
–
–
Too slow to be practical
In one cycle, must propagate through instruction memory, register file,
ALU, and data memory
Would need to run clock very slowly
Hardware units only active for fraction of clock cycle
63
Computers execute billions of instructions, so instruction throughput is what matters
IDEA: Divide instruction execution up into several pipeline stages. For example
IF ID EX MEM WB
Simultaneously have different instructions in different pipeline stages
The length of the longest pipeline stage determines the cycle time
Desirable pipeline features (e.g., RISC):
– all instructions same length
–
– registers located in same place in instruction format memory operands only in loads or stores
65
Laundry Example
A nn, B rian, C athy, D ave each have one load of clothes to wash, dry, and fold
Washer takes 30 minutes
Dryer takes 40 minutes
“Folder” takes 20 minutes
A B C D
66
6 PM 7 8 9
Time
10 11 Midnight
30 40 20 30 40 20 30 40 20 30 40 20 s k
T a
A
B d e r
O r
C
D
Sequential laundry takes 6 hours for 4 loads
If they learned pipelining, how long would laundry take?
67
O r d e r
T a s k
6 PM 7 8 9
Time
10 11 Midnight
A
30 40 40 40 40 20
Pipelined laundry takes 3.5 hours for 4 loads
B
C
D
68
O r d e r s k
T a
A
6 PM 7 8 9
Time
30 40 40 40 40 20
B
Pipelining doesn’t help latency of single task, it helps throughput of entire workload
Pipeline rate limited by slowest pipeline stage
Multiple tasks operating simultaneously
Potential speedup = Number pipe stages
Unbalanced lengths of pipe stages reduces speedup
Time to “ fill ” pipeline and time to “ drain ” it reduces speedup
C
D
69
Sequential Parallel
Pipelined
Idea
–
–
–
Divide process into independent stages
Move objects through stages in sequence
At any given time, multiple objects being processed
70
Unpipelined
OP1
OP2
OP3
Time
– Cannot start new operation until previous one completes
3-Way Pipelined
OP1
OP2
OP3
A B
A
C
B
A
C
B C
Time
– Up to 3 operations in process simultaneously
71
Structural hazards: Not enough HW to support this combination of instructions (single person to fold and put clothes away)
Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock)
Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).
72
Combinational logic
R e g
Clock
OP1
OP2
OP3
Time
System
– Each operation depends on result from preceding one
73
Comb.
logic
A
R e g
Comb.
logic
B
R e g
Comb.
logic
C
R e g
A B
A
C
B
A
Clock
OP1
OP2
OP3
OP4
C
B
A
C
B C
Time
–
–
Result does not feed back around in time for next operation
Pipelining has changed behavior of system
74
Time (clock cycles)
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 d e
O r r s t
I n r.
Load Ifetch
Instr 1
Instr 2
Instr 3
Instr 4
Reg
Ifetch Reg
Ifetch
DMem
Reg
Ifetch
Reg
DMem
Reg
Ifetch
Reg
DMem
Reg
Reg
DMem Reg
DMem Reg
75
Read After Write (RAW)
Instr
J tries to read operand before Instr
I writes it
I: add r1,r2,r3
J: sub r4,r1,r3
Caused by a “Dependence” (in compiler nomenclature).
This hazard results from an actual need for communication.
76
Write After Read (WAR)
Instr
J writes operand before Instr
I: sub r4,r1,r3
J: add r1,r2,r3
K: mul r6,r1,r7
I reads it
Called an “anti-dependence” by compiler writers.
This results from reuse of the name “r1”.
77
Write After Write (WAW)
Instr
J writes operand before Instr
I writes it.
I: sub r1,r4,r3
J: add r1,r2,r3
K: mul r6,r1,r7
Called an “output dependence” by compiler writers
This also results from the reuse of name “r1”.
78
10: beq r1,r3,36
Ifetch Reg DMem Reg
14: and r2,r3,r5
18: or r6,r1,r7
Ifetch Reg
Ifetch Reg
DMem Reg
DMem Reg
22: add r8,r1,r9
36: xor r10,r1,r11
What do you do with the 3 instructions in between?
Ifetch Reg
Ifetch Reg
DMem Reg
DMem Reg
79
Write sequential timing diagram for: instr x = y + z b = x + y y = a + b d = z + b x = a + y
1
F
2 3 4 5
D EX M W
6 7 8 9
Rewrite using forwarding, compare total time
Rewrite using scheduling, compare total time
80