Advanced Topics in Compilation COP5622 Prof. Robert van Engelen Fall 2005 Syllabus • • • • • • • Lectures: Prerequisites: Instructor: Office: Office hours: Email: Web page: Mon & Wed, 2:00PM, 103LOV COP5621 Prof. Robert van Engelen 471DSL Tue 1:00PM engelen@cs.fsu.edu http://www.cs.fsu.edu/~engelen/courses/COP5 622 COP5622 2 Books COP5622 3 … Embedded Computing? “If you round off the fractions, embedded systems consume 100% of the worldwide production of microprocessors.” - Jim Turley, Editor, Computer Industry Analyst There is no sharp line between embedded and general-purpose computing. Modern (embedded) CPUs combine generalpurpose and DSP features, mutatis mutandis. COP5622 4 The “Center of Gravity” of Computing Mainframes Minicomputers Desktop systems Smart products Era: 1950s 1970s 1980s 2000s Form factor: Multi-cabinet Multiple boards Single board Single chip Resource type: Corporate Departmental Personal Embedded Users per CPU: 100s-1000s 10s-100s 1 user 100s CPUs/user Type system cost: $1 million+ $100,000s+ $1,000-$10,000s $10-$100 Worldwide units: 10,000s+ 100,000+ 100,000,000s 100,000,000,000s Major platforms: IBM, CDC, Burroughs, Sperry, GE, Honeywell, Univac, NCR DEC, IBM, Prime, Wang, HP, Pyramid, Data General, … Apple, IBM, Compaq, Sun, Hp, SGI, Dell, … ? Operating systems: By manufacturer By manufacturer, some Unix DOS, MacOS, Windows, Unix/Linux ? Source: J. Fisher, P. Faraboschi, C. Young COP5622 5 Superscalar, VLIW, and EPIC Name Issue structure Hazard detection Superscalar (static) Dynamic Hardware Static In-order execution Sun UltraSparc II/III Superscalar (dynamic) Dynamic Hardware Dynamic Some out-oforder execution IBM Power2 Superscalar (speculative) Dynamic Hardware Dynamic with speculation Out-of-order execution with speculation Pentium III/4, MIPS R10K, Alpha 21264, HP PA 8500, IBM RS64III VLIW/LIW Static Software Static No hazards between issue packets Trimedia, i850 EPIC Mostly static Mostly software Mostly static Explicit dependences marked by compiler Itanium, Itanium2 Scheduling Distinguishing characteristic Examples Source: J. Hennessy & D. Patterson COP5622 6 Superscalar versus VLIW Source: J. Fisher, P. Faraboschi, C. Young COP5622 7 HP PA-8000 Source: J. Fisher, P. Faraboschi, C. Young • Instruction reorder buffer is used to issue operations to the execution units • Operations are scheduled out-oforder • Instruction reorder buffer takes prime real estate COP5622 8 Role of the Compiler for Superscalar and VLIW Sequential Architectures Dependence architectures Independence Architectures Processor style: Superscalar Dataflow VLIW Dependence information in the program: Implicit in register names An exact description of all dependence information Description of operations that are independent How dependent operations are typically exposed: By the hardware’s control unit By the compiler (they are embedded in the program) By the compiler (they are implicit in the program) How independent operations are typically exposed: By the hardware’s control unit By the hardware’s control unit By the compiler (they are embedded in the program) Where scheduling is typically performed: In the hardware’s control unit In the hardware’s control unit In the compiler Role of the compiler: Rearranges code to make ILP more evident and accessible Replaces some of the hardware Replaces virtually all hardware dedicated to ILP exposure and scheduling Source: J. Fisher, P. Faraboschi, C. Young COP5622 9 So, What’s Next? • Some evidence things are about to change: – Intel announced radically redesigned x86 core – Apple decided to adopt Intel cores – Steve Jobs: Performance per Watt must increase • PowerPC: 15 computation units per Watt • New Intel x86 core: 70 computation units per Watt – Out-of-order hardware is power hungry – VLIW shown to reduce power consumption – New compiler technology (Elbrus) bought by Intel • Conclusion… COP5622 10 Can VLIW Really Compete with Superscalar? “A fanatic is one who can’t change his mind and won’t change the subject.” - [attributed to] Sir Winston S. Churchill, British Prime Minister “Transmeta and Itanium not living up to promises” Fallacy: VLIW controls everything in software. Fallacy: VLIWs require “Heroic Compilers” to do what superscalars do in hardware. COP5622 11 Embedded System Complexity and Cost “Any intelligent fool can make things bigger, more complex, and more violent. It takes a touch of genius -- and a lot of courage -- to move in the opposite direction.” - Ernst F. Schumacher, German Economist, 1910-1977 In embedded systems, smaller is often better • Silicon cost scales with cube of area • Low-power constraints • Also custom designs, e.g. ASIC (> 100,000 units), FPGA (low volume), ASIP, SoPC COP5622 12 VLIW and ILP • ILP has significant impact on performance, which is important for high-end systems – Superscalar – VLIW/EPIC • For embedded systems, ILP yields performance gains and power savings (lower clock rate), provided that ILP implementation is low-cost and low-power – VLIW – DSP and custom COP5622 13 Example (VEX) Source: J. Fisher, P. Faraboschi, C. Young COP5622 14 Example (Compacted VEX) Source: J. Fisher, P. Faraboschi, C. Young COP5622 15 ILP Compilers “A worker may be the hammer’s master, but the hammer still prevails. A tool knows exactly how it is meant to be handled, while the user of the tool can only have an approximate idea.” - Milan Kundera, Czech writer, 1929- An ILP compiler is possibly the largest investment in engineering effort in a VLIW system. COP5622 16 Compiler in the Embedded Toolchain Workflow COP5622 17 Compilation with Profiling COP5622 18 What is Important in an ILP Compiler? • Parallelism is key to performance, price/performance, power, and cost • One-design-fits-all compilers (gcc) cannot optimize well over all platforms • Compiler technology lags behind hardware • Back-end optimizations are crucial for ILP – Responsible for finding and organizing parallelism – May need multiple intermediate representations (IRs) COP5622 19 Structure of an ILP Compiler COP5622 20 Embedded-Specific Tradeoffs for Compilers • Space, time, and energy tradeoffs • These are contrasting goals for compiler • Ideally, a compiler should expose some linear combination of the optimization dimensions to application developers: K1{speed} + K2{code_size} + K3{energy_efficiency} COP5622 21 Effect of Compiler Optimizations on Space • Code size determines cost of ROM • Should minimize I-cache and D-cache misses • Code layout techniques: – – – – – DAG-based placement Pettis-Hansen Inlining Cache line coloring Temporal-order placement COP5622 22 Code Placement Gains Source: J. Fisher, P. Faraboschi, C. Young COP5622 23 Fundamentals of Power Dissipation: Switching Unlike TTL or ECL, CMOS transistors drain current while switching, and power dissipation depends linearly on frequency and quadratically on voltage: fs Ci A switching frequency load capacitance on net i fraction of nets in circuit that actually switch COP5622 24 Fundamentals of Power Dissipation: Leakage CMOS transistors drain minimal current when open or closed, typically only 1% of total dissipation for 0.25 to about 30% for 90nm process technology. Dynamic voltage scaling (DVS) lowers operational frequency and increases the time from t1 to t2 t1(f1/f2) with an energy saving that is approximately quadratic: COP5622 25 Power-aware Software Techniques • • • • • Reducing switching activity Power-aware instruction selection Scheduling for minimal dissipation Memory access optimizations Data remapping COP5622 26 Effect of Compiler Optimizations on Power • Most of the traditional “scalar” optimizations benefit space, time, and (therefore) power • Not so for: – – – – – Loop unrolling Tail duplication Inlining and cloning Speculation and predication Global code motion • These and other optimizations will be discussed in subsequent lectures COP5622 27