Advanced Processor Architectures

advertisement
Advanced Processor Architectures
Introduction
CISC & RISC
Basic techniques
Superscalar
VLIW
Multicore & Manycore
Introduction
Microprocessor
Implementation of the
Von Neumann architecture
Turing computation model
First microprocessor
1972, Intel 4004, 2800 Transistors, 740 KHz, 4 bit
Long and rapid evolution
Billions of transistors, 3-4 GHz, 64 bit, …
Computation model
Always the same!!
Introduction
After a few years two approaches emerged for the
design of new microprocessors
CISC: Complex Instruction-Set Computers
Enforces code compactness
Variable-length instructions
Complex operations, complex addressing modes
Complex hardware
RISC: Reduced Instruction-Set Computers
Fixed-length instructions
Simple operations on registers (load/store approach)
Simpler hardware
Introduction
RISC instructions
Load & Store (memory <-> registers)
ALU (registers)
Branches
Easily predictable execution time
Execution can be broken into smaller phases
Pipelining allows parallelizing the execution
Pipelining allows executing instructions faster
Provided that memory is fast
Can be obtained with cache memories
Sequential execution
This problem is mitigated with branch prediction
Introduction
The RISC approach seems promising and scalable
Several standards, e.g. MIPS: 5 stage pipeline
FETCH
DECODE
MEMORY
ACCESS
EXECUTE
WRITE
BACK
Execution example
INSTR 1
INSTR 2
INSTR 3
INSTR 4
INSTR 5
INSTR 6
INSTR 7
F1
D1
E1
M1
W1
F2
D2
E2
M2
W2
F3
D3
E3
M3
W3
F4
D4
E4
M4
W4
F5
D5
E5
M5
W5
F6
D6
E6
M6
W6
F7
D7
E7
M7
W7
Pipelining
With the availability of
Silicon
Foundry processes
Design tools
Pipelining became feasible also for CISC machines
Main problems with pipelining
Definition and balancing of the stages
Data hazards
Control hazards
Pipelining
Maximum operating frequency of a pipeline
Number of stages:
Total latency:
Maximum latency:
No pipelining
Fmax = 1 / L
Unbalanced pipeline
Fmax = 1 / Lmax
Optima case
L1 = L 2 = … = L N = L / N
Fmax = 1 / (L/N) = N / L
N
L = L 1 + … + LN
Lmax = max( L1, …, LN)
Data Hazards
Consider the pipelined execution of the code
ADD
DEC
R0, R1, R2
R0
// R0 <= R1 + R2
// R0 <= R0 – 1
Reads registers R1 and R2
Calculates the sum
…
ADD
DEC
…
.
F1
Writes the result in R0
D1
E1
M1
W1
F2
D2
E2
M2
W2
F3
D3
E3
M3
W3
F4
D4
E4
M4
W4
F5
D5
E5
M5
W5
Reads register R0 for instruction DEC
Data Hazards
The code should execute as follows
Reads registers R1 and R2
Calculates the sum
…
ADD
NOP
NOP
NOP
DEC
…
.
F1
Writes the result in R0
D1
E1
M1
W1
F2
D2
E2
M2
W2
F
D
E
M
W
F
D
E
M
W
F
D
E
M
W
F3
D3
E3
M3
W3
F4
D4
E4
M4
W4
F5
D5
E5
M5
W5
Reads register R0 for instruction DEC
Data Hazards
The NOP instructions can be
Generated by the compiler
Strongly architecture dependent
Low runtime overhead
Inserted as “stalls” or bubbles by the microprocessor
Simpler compiler
Hardware architecture more complex
Additional logic for data hazards is needed
General approach to this kinds of problems
Code scheduling
At compile time
At run time
Data Hazards - Solutions
Bypass logic
Involves Decode and Execute stages
Requires additional multiplexors
The Decode stage compares
Registers written by execute and memory stages
Registers read by the decode stage
Controls the multiplexors to select most recent data
F
D
E
M
W
F
D
E
M
W
Data Hazards - Solutions
Bypass logic and bubbles
When data is ready only after Memory access stage
F
D
E
M
W
F
D
E
M
W
Bypass cannot “anticipate” data
A bubble is inserted
F
D
E
F
D
M
W
E
Bubble
M
W
Control Hazards
Consider the pipelined execution of the code
DEC
BGE
ADD
R0
LOOP:
...
// R0 <= R0 – 1
// If R0 >= 0 jumps to LOOP
Reads register R0
Decrements and sets condition flags
…
DEC
BGE
ADD
…
F1
If R0 >= 0 next instruction is not ADD
D1
E1
M1
W1
F2
D2
E2
M2
W2
F3
D3
E3
M3
W3
F4
D4
E4
M4
W4
F5
D5
E5
M5
W5
ADD already fetched and decoded
Control Hazards
The code should execute as follows
Reads register R0
Decrements and sets condition flags
…
DEC
BGE
ADD
F1
E1
M1
W1
F2
D2
E2
M2
W2
F3
D3
E3
M3
W3
F4
D4
E4
M4
W4
F5
D5
E5
M5
W5
F6
D6
E6
M6
NEXT
…
If R0 >= 0 next instruction is not ADD
D1
Wrong instruction
Flush
Correct instruction
W6
Control Hazards - Solutions
Prediction units
Try to predict whether a branch will be taken on not
Fetch the next instruction accordingly
Still require pipeline flush
More rarely if a good prediction policy is implemented
Types of predictions
Always taken
Assumes that the branch will always be taken
Always not taken
Assumes that the branch will never be taken
History based
Predicts the next jump based on the history
Control Hazards - Solutions
Delay slot
The instruction following a branch is always executed
Branch is taken only after this instruction
Reads register R0
Decrements and sets condition flags
…
DEC
BGE
SLOT
NEXT
…
F1
If R0 >= 0 next instruction is not ADD
D1
E1
M1
W1
F2
D2
E2
M2
W2
F3
D3
E3
M3
W3
F4
D4
E4
M4
W4
F5
D5
E5
M5
W5
F6
D6
E6
M6
Delay slot
W6
Can fetch the next instruction
Control Hazards - Solutions
Improves execution time
Always taken: fails every time prdiction is wrong
At least one clock cycle wasted
Always taken: fails every time prdiction is wrong
At least one clock cycle wasted
Delay slot: fails if the compiler cannot exploit the slot
At least one clock cycle wasted (NOP)
Decidable at compile time
No additional hardware
Controversial solution
Relies on the compiler’s ability to exploit delay slots
Performance improvement
Assuming no hazards a major cause limiting the
execution speed is memory I/O
Several solutions at system level
Stanford
Harvard
SuperHarvard
Other approaches at CPU level
Prefetch buffer
Loop optimization
Conditional execution
Stanford Architecture
Based on Von Neumann model
Unified cache for data and instructions
Data and instructions have different access patterns
PC
FETCH
UNIT
ADDRESS
IR
L0
CACHE
MAR
MEMORY
UNIT
MDR
MEMORY
DATA
Harvard Architecture
Based on Von Neumann model
Split cache for data and instructions
Better hit rates
ADDRESS
PC
FETCH
UNIT
L0
I-CACHE
IR
DATA
ADDRESS
MEMORY
ADDRESS
MAR
MEMORY
UNIT
L0
D-CACHE
MDR
DATA
DATA
SuperHarvard Architecture
Extends
Split cache for data and instructions
Integrates controller for high speed data transfers
Two buses
ADDRESS
PC
FETCH
UNIT
L0
I-CACHE
MEMORY
IR
DATA
MAR
ADDRESS
MEMORY
UNIT
L0
D-CACHE
MDR
DATA
MEMORY
I/O
DMA
Instruction Prefetch Buffer
When the pipeline CPI is close to 1 or greater
Instruction fetch latency may slow down execution
Fetching istructions ahead of time is beneficial
Especially true for superscalar architectures
PC
IR
PC
PC
L0
I-CACHE
IR
PC
Prefetch Buffer
IR
IR
Read
Ahead
L0
I-CACHE
Loop Optimization
Most loops are explicitly indexed
This requires allocating a register for the loop index
Incrementing or decrementing the idnex
Comparing with loop boundary
Jumping
Back to the beginning, or
Out of the loop
All indexed loops can be rewritten as
INIT:
LOOP:
MOV
...
DEC
BGE
R0, #N
// Loop body
R0
LOOP
Loop Optimization
All indexed loops can be rewritten as
INIT:
LOOP:
MOV
...
...
DEC
BGE
R0, #N
// Loop body
R0
LOOP
The highligthed instruction can be executed by a
dedicated unit
Significant reduction of execution time
Using dedicated hardware they can be parallelized
Ad hoc usage of loop registers
Loop optimization logic required
Conditional execution
A similar principle can be applied to
Single instructions
Conditional constructs
Consider the code
if( A + B > 0 )
Z = C * B
...
The direct translation with normal instructions is
ADD
BLE
MUL
L000: ...
R0, R1, R2
L000
R3, R2, R4
// temp = A+B
// Z=C*B
Conditional execution
With conditional instructions we have
ADD
MULGT
R0, R1, R2
R3, R2, R4
// temp = A+B
// Z=C*B if flags
// N and Z are reset
L000: ...
This arrangement has several advantages
Smaller code
Saves one instruction at run-time
Does not generate control hazards
Clearly…
Requires additional hardware
Superscalar Architectures
Different instructions
Can use different units of the pipeline
Can share no data dependency
They can be executed in parallel!
Having more than one pipeline
EXEC 1
FETCH
DECODE
ISSUE
RETIRE
EXEC N
WRITE
BACK
Superscalar Architectures
Typical architecture
Different pipelines for different clsses of instructions
Integer & logic
Integer multiply & divide
Floating point
Branch
Each pipeline
Is optimized for execution of specific instrcutions
Requires a different number of cycles to complete
Instrcutions must be retired in order
Reorder Buffer
Superscalar Architectures
A specific unit decides how to issue instructions
Can be after the decode stage or integrated with it
This activity requires a sort of instruction scheduling
Can be generalized
Out of order execution
DECODE
SCHEDULER
EXEC 1
FETCH
DECODE
REORDER
BUFFER
EXEC N
WRITE
BACK
Superscalar Architectures
Decoded instrctuons can be stroed
Reduces latency
Needs specific unit
FETCH
DECODE
EXEC 1
SCHEDULER
TRACE CACHE
DECODE
REORDER
BUFFER
EXEC N
WRITE
BACK
Superscalar Architectures
Classes of instructions and their pipelines
FETCH
DISPATCH
DECODE
EXECUTE
PREDICT
Integer
FETCH
DISPATCH
DECODE
EXECUTE
WRITE
BACK
Load-Store
FETCH
DISPATCH
DECODE
ADDR GEN
CAHE
WRITE
BACK
Floating Point
FETCH
DISPATCH
DECODE
EXECUTE
1
EXECUTE
2
Branch
WRITE
BACK
Superscalar Architectures
Example: RISC 8000 architetcure
Explicitly Parallel Architectures
In superscalar architectures is the hardware that
Analyzes sequences of instructions
Verifies the presence of data dependencies
Dispatches them to different pipelines
Most of these tasks can be made at compile time
Data dependencies
Extraction of Instruction-Level Parallelism
Dispatch to different pipelines
Explicit scheduling on several executors
The resulting architectures are
EPIC: Explicitly Parallel Instruction Computer
VLIW: Very Long Instruction Word
Explicitly Parallel Architectures
Analyzing the structure of a typical RISC code
Compilers can extract parallelism
Code generators can “pack” several small RISC
instructions into one long VLIW instruction
Superscalar RISC code
Original RISC code
LD
ADD
ADD
MUL
ST
R0,
R1,
R3,
R5,
R5,
#0xF004
R2, R2
R4, R4
#12, R5
#0xF008
LD
R0
#0xF004
ADD
R1
R2
R2
ADD
R3
R4
R4
MUL
R5
#12
R5
ST
R5
#0xF008
VLIW code
LD R0 #0xF004 ADD R1 R2 R2 ADD R3 R4 R4 MUL R5 #12 R5
ST R5 #0xF008
Explicitly Parallel Architectures
Simplified architecture
INSTRUCTION
CACHE
FETCH
IR
DECODERS
EXECUTE
UNIT #1
DATA
CACHE
REGISTER FILE
EXECUTE
UNIT #2
EXECUTE
UNIT #N
Example: TI 6455 DSP Core
Example: TI 6455 DSP Core
Two datapaths
A, B
Provide the same
functionality
Registers
Only some registers can
be used on both A and B
Exchange data from A to
B and vice-versa
Complex routing
structure
Example: TI 6455 DSP Core
.L unit - 32/40-bit arithmetic/compare operations
32-bit logical operations
Leftmost 1 or 0 counting for 32 bits
Normalization count for 32 and 40 bits
Byte shifts
Data packing/unpacking
5-bit constant generation
Dual 16-bit arithmetic operations
Quad 8-bit arithmetic operations
Dual 16-bit minimum/maximum operations
Quad 8-bit minimum/maximum operations
.S unit - 32-bit arithmetic operations
32/40-bit shifts and 32-bit bit-field operations
32-bit logical operations
Branches
Constant generation
Register transfers to/from control register file (.S2 only)
Byte shifts
Data packing/unpacking
Dual 16-bit compare operations
Quad 8-bit compare operations
Dual 16-bit shift operations
Dual 16-bit saturated arithmetic operations
Quad 8-bit saturated arithmetic operations
.M unit - 32x32-bit multiply operations
16x16-bit multiply operations
16x32-bit multiply operations
Quad 8x8-bit multiply operations
Dual 16x16-bit multiply operations
Dual 16x16-bit multiply with add/subtract operations
Quad 8x8-bit multiply with add operation
Bit expansion
Bit interleaving/de-interleaving
Variable shift operations
Rotation
Galois Field Multiply
.D unit - 32-bit add, subtract, linear and circular
address calculation
Loads and stores with 5-bit constant offset
Loads and stores with 15-bit constant offset (.D2 only)
Load and store doublewords with 5-bit constant
Load and store nonaligned words and doublewords
5-bit constant generation
32-bit logical operations
Example: TI 6455 DSP Core
Generic instruction format
Sequential execution
Parallel execution
Multi-Core Architecture
Integration technology allows packing several
microprocessor cores on a single chip
Multi-core architectures
Many advantages
High performance
Application can scale very well
But… many problems
Communication among cores
Memory access
Cache coherence
…
Multi-Core Architecture
Simplified view of a single-core architecture
REGS
ALU
BUS INTERFACE
Simplified view of a multi-core architecture
Core 1
REGS
Core 2
ALU
REGS
Core 3
ALU
REGS
BUS INTERFACE
Core 4
ALU
REGS
ALU
Multi-Core Architecture - Execution
Threads run in parallel
Within each core threads are time-sliced
Multi-Core Architecture - Parallelism
ILP – Instruction Level Parallelism
Parallelism at the machine-instruction level
re-ordering
Instruction pipelining
Split instruction into micro-operations
Branch prediction…
Main source of performance in the last 15 years
TLP – Thread Level Parallelism
Parallelism on a more coarser scale
Server serve each client in a separate thread
Games run AI, graphics, and physics in three threads
Superscalar processors cannot fully exploit TLP
Multi-Core Architecture - Models
Four possible computation models
Instruction
single
multiple
multiple
Data
single
SISD
Traditional cores
RISC
CISC
MISD
SIMD
MIMD
Vector machines
Graphic cards
Multi-core
Many-core
Multi-Core Architecture - Memory
Shared memory
One common shared memory for all processors
Large
Distributed memory
Each processor has its own local memory
Small
Its content is not replicated anywhere else
Multi-Core Architecture - Caches
Cache memories can be
Private
Closer to the core, so faster access
Reduces contention
Cache coherence is a major problem
Shared
Threads on different cores can share cache data
More cache space available if few high-performance
thread runs on the system
Multi-Core Architecture - Caches
Initially all caches are empty
Core 1 reads variable X at address 0xF0
Cache 1 has a miss then loads the correct value
Core 1
Core 2
Core 3
Core 4
Private cache
X=33210
Private cache
Private cache
Private cache
MAIN
MEMORY
X=33210
Multi-Core Architecture - Caches
Then
Core 2 reads the same variable X at address 0xF0
Cache 2 has a miss then loads the correct value
Core 1
Core 2
Core 3
Core 4
Private cache
X=33210
Private cache
X=33210
Private cache
Private cache
MAIN
MEMORY
X=33210
Multi-Core Architecture - Caches
Then
Core 1 modifies variable X
Cache 1 uses a write-through policy
Cache 2 holds a stale value
Core 1
Core 2
Core 3
Core 4
Private cache
X=1228
Private cache
X=33210
Private cache
Private cache
MAIN
MEMORY
X=1228
Multi-Core Architecture - Caches
Then
Core 2 reads variable X
Core 2 hits but reads the wrong value!
Core 1
Core 2
Core 3
Core 4
Private cache
X=1228
Private cache
X=33210
Private cache
Private cache
MAIN
MEMORY
X=1228
Multi-Core Architecture - Caches
Invalidation with snooping
Invalidation
When a core writes data, all copies of this data in other
caches are invalidated
Snooping
Cores continuously monitor (“snoop”) the bus
connecting the cores
Multi-Core Architecture - Caches
Initially all caches are empty
Core 1 reads variable X at address 0xF0
Cache 1 has a miss then loads the correct value
Core 1
Core 2
Core 3
Core 4
Private cache
X=33210
Private cache
Private cache
Private cache
MAIN
MEMORY
X=33210
Multi-Core Architecture - Caches
Then
Core 2 reads the same variable X at address 0xF0
Cache 2 has a miss then loads the correct value
Core 1
Core 2
Core 3
Core 4
Private cache
X=33210
Private cache
X=33210
Private cache
Private cache
MAIN
MEMORY
X=33210
Multi-Core Architecture - Caches
Then
Core 1 modifies variable X
Cache 1 uses a write-through policy with invalidation
Cache 2 holds a stale value
Core 1
Core 2
Core 3
Core 4
Private cache
X=1228
Private cache
X=33210
Private cache
Private cache
MAIN
MEMORY
X=1228
Data is invalidated
Multi-Core Architecture - Caches
After invalidation
Cache 2 does not contain the value of X any longer
Core 1
Core 2
Core 3
Core 4
Private cache
X=1228
Private cache
X=33210
Private cache
Private cache
MAIN
MEMORY
X=1228
Data is invalidated
Multi-Core Architecture - Caches
Then
Core 2 reads variable X
Core 2 has a miss but now reads the correct value
Core 1
Core 2
Core 3
Core 4
Private cache
X=1228
Private cache
X=1228
Private cache
Private cache
MAIN
MEMORY
X=1228
Multi-Core Architecture - Caches
A different solution maintains all caches up-to-date
Update
When a core writes data, all caches are notified of
the new value through a broadcasted message
Multi-Core Architecture - Caches
Initially all caches are empty
Core 1 reads variable X at address 0xF0
Cache 1 has a miss then loads the correct value
Core 1
Core 2
Core 3
Core 4
Private cache
X=33210
Private cache
Private cache
Private cache
MAIN
MEMORY
X=33210
Multi-Core Architecture - Caches
Then
Core 2 reads the same variable X at address 0xF0
Cache 2 has a miss then loads the correct value
Core 1
Core 2
Core 3
Core 4
Private cache
X=33210
Private cache
X=33210
Private cache
Private cache
MAIN
MEMORY
X=33210
Multi-Core Architecture - Caches
Then
Core 1 modifies variable X
Cache 1 uses a write-through policy with update
Cache 2 receives the message and updates
Core 1
Core 2
Core 3
Core 4
Private cache
X=1228
Private cache
X=1228
Private cache
Private cache
MAIN
MEMORY
X=1228
Data is updated
Multi-Core Architecture - Caches
After update
Core 2 reads variable X
Core 2 hits and reads the correct value
Core 1
Core 2
Core 3
Core 4
Private cache
X=1228
Private cache
X=1228
Private cache
Private cache
MAIN
MEMORY
X=1228
Many-Core Architecture
Evolution of a multi-core approach: many-cores
Hundreds of simpler cores
Many local memories
Complex communication network
Often NoC
One or more “supervisor cores”
Many-Core Architecture
Simplified architecture
MEM
CORE
CORE
mem
mem
mem
mem
core
core
core
core
core
mem
mem
mem
mem
mem
core
core
core
core
core
mem
mem
mem
mem
mem
core
core
core
core
core
NoC Bridge
MEM
I/O
mem
NoC
NoC Bridge
Further reading
1. Computer architecture: a quantitative approach
John L. Hennessy,David A. Patterson,Andrea C. Arpaci-Dusseau
2. The SPARC Architecture Manual - Version 8
SPARC Intl.
3. TMS320C6455 Fixed-Point Digital Signal Processor
Texas Instrument Data Sheet
4. TMS320C64x/C64x+ DSP CPU and Instruction Set
Texas Instrument Data Sheet
5. Planning Considerations for Multicore Processor Technology
Dell Report
6. Many-core Architecture and Programming Challenges
Satnam Singh, Microsoft Research
Download