Set associative

advertisement
Hardware Support for Compiler
Speculation
• Compiler needs to move instructions before
branch, possibly before condition
• Requirements:
– Instructions that can be moved without
disrupting data flow
– Exceptions that can be ignored until outcome is
known
– Ability to speculatively access memory with
potential address conflicts
Exception Support
• Four methods:
– Hardware and OS cooperate to ignore
exceptions for speculative instructions
– Speculative instructions never raise exceptions;
explicit checks must be made
– Poison bits used to mark registers with invalid
results; use causes exception
– Speculative results are buffered until certain
Exception Handling
• Nonterminating exceptions can be handled
normally (e.g. page fault)
– May cause serious performance loss
Memory Reference Speculation
• Moving loads across stores is only safe if
the addresses do not conflict
• Special instructions check for address
conflicts
4.6. Crosscutting Issues: Hardware
–vs– Software Speculation
• A number of trade-offs and limitations
– Disambiguating memory references is hard for
a compiler
– Hardware branch prediction is usually better
– Precise exceptions easier in hardware
– Hardware does not require “housekeeping”
code
– Compilers can “look” further
– Hardware techniques are more portable
Hardware/Software Speculation
• Major disadvantage of hardware:
complexity!
• Some architectures combine hardware and
software approaches
4.7. Putting It All Together:
IA-64 and Itanium
• IA-64
– RISC-style
• Register-register
• Emphasis on software-based optimisations
• Features:
– 128 × 65-bit integer registers
– 128 × 82-bit FP registers
– 64 predicate registers; 8 branch registers
Registers
• Integer registers
– Use windowing mechanism
• 0–31 always visible
• Remainder arranged in overlapping windows
– Local and out areas (variable size)
– Hardware for over-/underflow
• Int and FP registers support register rotation
– Supports software pipelining
Instruction Format and VLIW
• Compiler schedules parallel instructions;
flags dependences
• Instruction group
– Sequence of (register) independent instructions
– Compiler marks boundaries between groups
(stop)
• Bundle
– 128-bits: 5-bit template + 3 × 41-bit
instructions
Instruction Bundle
• Template specifies stops and execution unit
–
–
–
–
–
I-unit (int + special — multimedia, etc.)
M-unit (int + memory access)
F-unit (FP)
B-unit (branches)
L+X (extended instructions)
Example
for (int k = 0; k < 1000; k++)
{ x[k]
= x[k] + s;
}
• Unrolled seven times
– Optimised for size:
• 9 bundles; 15% nops
• 21 cycles (3 per calculation)
– Optimised for performance:
• 11 bundles; 30% nops
• 12 cycles (1.7 per calculation)
Instructions
• 41-bits long
– 4-bit opcode (+ template bits)
– 6-bit predicate register specifier
• Predication
– Almost all instructions can be predicated
• Branch is jump with predicate check!
– Complex comparisons set two predicate
registers
Speculation
• Exceptions can be deferred
– Uses poison bits (65-bit registers)
– Nonspeculative and chk instructions raise
exception
• Speculative loads
– Called advanced load (ld.a)
– Stores check addresses
Itanium
• First implementation of IA-64
• Issues up to six instructions per cycle (two
bundles)
• Nine functional units
– 2 × I, 2 × M, 3 × B, 2 × F
• 10-stage pipeline
• Multilevel dynamic branch predictor
Itanium
• Complex hardware with many features of
dynamically scheduled pipelines!
–
–
–
–
–
Branch prediction
Register renaming
Scoreboarding
Deep pipeline
etc.
Itanium: Performance
• SPECint not too impressive
– 85% of Alpha 21264 (older, more powerefficient processor!)
• FP better
– Faster, even with slower clock!
– But skewed by one benchmark for Pentium
– Alpha compilers need improvement
4.8. Another View:
ILP in Embedded Processors
• Trimedia (see chapter 2)
– “Classic” VLIW
– Hardware decompression of code
• Crusoe
– Software translation of 80x86 to VLIW
– Low power
Trimedia TM32 Architecture
• VLIW
–
–
–
–
Instruction specifies five operations
Static scheduling
No hardware hazard detection
23 functional units (11 types)
Transmeta Crusoe
• Low power design
• Emulates 80x86
• VLIW
– 64-bit (2 op) and 128-bit (4 op) instructions
– Five types of operations:
•
•
•
•
•
ALU (int, register-register)
Compute (int ALU, FP, multimedia)
Memory
Branch
Immediate
Crusoe
• Simple, in-order pipeline
– Integer: 6-stage (IF1, IF2, DEC, OP, EX, WB)
– FP: 10-stage (5 EX stages)
Crusoe
• Software interpretation of 80x86 code:
– Basic blocks cached
– Exception handling complicated
• Crusoe has good support for speculative reordering
• Memory writes buffered and committed only when
safe
Crusoe Performance
• Hard to measure accurately
• Power consumption is low (⅓ of Pentium)
4.9. Fallacies and Pitfalls
• Fallacy: There is a simple approach to
multiple-issue (high performance with low
complexity)
– Big gap between peak and sustained
performance for multiple issue processors
• Need dynamic scheduling, speculation support,
branch prediction, sophisticated prefetch, etc.
• Sophisticated compilers are required
4.10. Concluding Comments
• “Hardware” techniques migrating to
“software” and vice versa
• Multiprocessors may be important in future
Chapter 5
Memory Hierarchy Design
Memory Hierarchies
• Not a new idea!
• Takes advantage of the principle of locality
– Temporal
– Spatial
• Small, fast memories close to processor
Memory Hierarchies
Speed
Cost
Registers
Cache
Memory
I/O Devices (virtual memory)
Size
Introduction
• Usually includes responsibility for memory
protection
• Performance is a major problem
Figure 5.2
Characterising Levels of the
Memory Hierarchy
• Four questions:
– Where can a block be placed? (placement)
– How is a block found? (identification)
– Which block should be replaced on a miss?
(replacement)
– What happens on a write? (write strategy)
Example
• The Alpha 21264 is used as an example
throughout
Caches
• Where is a block placed in a cache?
– Three possible answers  three different types
Anywhere
Fully associative
Only into
one block
Direct mapped
Into subset
of blocks
Set associative
Cache Categories
• Set associative
– n-way set associative, where n is number of
blocks in set
– Commonly, n = 2 or n = 4
• Direct-mapped
– “1-way set associative”
• Fully associative
– “m-way set associative” (m is total number of
blocks in cache)
Download