Hardware Support for Compiler Speculation • Compiler needs to move instructions before branch, possibly before condition • Requirements: – Instructions that can be moved without disrupting data flow – Exceptions that can be ignored until outcome is known – Ability to speculatively access memory with potential address conflicts Exception Support • Four methods: – Hardware and OS cooperate to ignore exceptions for speculative instructions – Speculative instructions never raise exceptions; explicit checks must be made – Poison bits used to mark registers with invalid results; use causes exception – Speculative results are buffered until certain Exception Handling • Nonterminating exceptions can be handled normally (e.g. page fault) – May cause serious performance loss Memory Reference Speculation • Moving loads across stores is only safe if the addresses do not conflict • Special instructions check for address conflicts 4.6. Crosscutting Issues: Hardware –vs– Software Speculation • A number of trade-offs and limitations – Disambiguating memory references is hard for a compiler – Hardware branch prediction is usually better – Precise exceptions easier in hardware – Hardware does not require “housekeeping” code – Compilers can “look” further – Hardware techniques are more portable Hardware/Software Speculation • Major disadvantage of hardware: complexity! • Some architectures combine hardware and software approaches 4.7. Putting It All Together: IA-64 and Itanium • IA-64 – RISC-style • Register-register • Emphasis on software-based optimisations • Features: – 128 × 65-bit integer registers – 128 × 82-bit FP registers – 64 predicate registers; 8 branch registers Registers • Integer registers – Use windowing mechanism • 0–31 always visible • Remainder arranged in overlapping windows – Local and out areas (variable size) – Hardware for over-/underflow • Int and FP registers support register rotation – Supports software pipelining Instruction Format and VLIW • Compiler schedules parallel instructions; flags dependences • Instruction group – Sequence of (register) independent instructions – Compiler marks boundaries between groups (stop) • Bundle – 128-bits: 5-bit template + 3 × 41-bit instructions Instruction Bundle • Template specifies stops and execution unit – – – – – I-unit (int + special — multimedia, etc.) M-unit (int + memory access) F-unit (FP) B-unit (branches) L+X (extended instructions) Example for (int k = 0; k < 1000; k++) { x[k] = x[k] + s; } • Unrolled seven times – Optimised for size: • 9 bundles; 15% nops • 21 cycles (3 per calculation) – Optimised for performance: • 11 bundles; 30% nops • 12 cycles (1.7 per calculation) Instructions • 41-bits long – 4-bit opcode (+ template bits) – 6-bit predicate register specifier • Predication – Almost all instructions can be predicated • Branch is jump with predicate check! – Complex comparisons set two predicate registers Speculation • Exceptions can be deferred – Uses poison bits (65-bit registers) – Nonspeculative and chk instructions raise exception • Speculative loads – Called advanced load (ld.a) – Stores check addresses Itanium • First implementation of IA-64 • Issues up to six instructions per cycle (two bundles) • Nine functional units – 2 × I, 2 × M, 3 × B, 2 × F • 10-stage pipeline • Multilevel dynamic branch predictor Itanium • Complex hardware with many features of dynamically scheduled pipelines! – – – – – Branch prediction Register renaming Scoreboarding Deep pipeline etc. Itanium: Performance • SPECint not too impressive – 85% of Alpha 21264 (older, more powerefficient processor!) • FP better – Faster, even with slower clock! – But skewed by one benchmark for Pentium – Alpha compilers need improvement 4.8. Another View: ILP in Embedded Processors • Trimedia (see chapter 2) – “Classic” VLIW – Hardware decompression of code • Crusoe – Software translation of 80x86 to VLIW – Low power Trimedia TM32 Architecture • VLIW – – – – Instruction specifies five operations Static scheduling No hardware hazard detection 23 functional units (11 types) Transmeta Crusoe • Low power design • Emulates 80x86 • VLIW – 64-bit (2 op) and 128-bit (4 op) instructions – Five types of operations: • • • • • ALU (int, register-register) Compute (int ALU, FP, multimedia) Memory Branch Immediate Crusoe • Simple, in-order pipeline – Integer: 6-stage (IF1, IF2, DEC, OP, EX, WB) – FP: 10-stage (5 EX stages) Crusoe • Software interpretation of 80x86 code: – Basic blocks cached – Exception handling complicated • Crusoe has good support for speculative reordering • Memory writes buffered and committed only when safe Crusoe Performance • Hard to measure accurately • Power consumption is low (⅓ of Pentium) 4.9. Fallacies and Pitfalls • Fallacy: There is a simple approach to multiple-issue (high performance with low complexity) – Big gap between peak and sustained performance for multiple issue processors • Need dynamic scheduling, speculation support, branch prediction, sophisticated prefetch, etc. • Sophisticated compilers are required 4.10. Concluding Comments • “Hardware” techniques migrating to “software” and vice versa • Multiprocessors may be important in future Chapter 5 Memory Hierarchy Design Memory Hierarchies • Not a new idea! • Takes advantage of the principle of locality – Temporal – Spatial • Small, fast memories close to processor Memory Hierarchies Speed Cost Registers Cache Memory I/O Devices (virtual memory) Size Introduction • Usually includes responsibility for memory protection • Performance is a major problem Figure 5.2 Characterising Levels of the Memory Hierarchy • Four questions: – Where can a block be placed? (placement) – How is a block found? (identification) – Which block should be replaced on a miss? (replacement) – What happens on a write? (write strategy) Example • The Alpha 21264 is used as an example throughout Caches • Where is a block placed in a cache? – Three possible answers three different types Anywhere Fully associative Only into one block Direct mapped Into subset of blocks Set associative Cache Categories • Set associative – n-way set associative, where n is number of blocks in set – Commonly, n = 2 or n = 4 • Direct-mapped – “1-way set associative” • Fully associative – “m-way set associative” (m is total number of blocks in cache)