The University of Texas at Austin What Programming Language/Compiler Researchers should Know about Computer Architecture Lizy Kurian John Department of Electrical and Computer Engineering The University of Texas at Austin Lizy Kurian John, LCA, UT Austin 1 Somebody once said “Computers are dumb actors and compilers/programmers are the master playwrights.” Lizy Kurian John, LCA, UT Austin 2 Computer Architecture Basics ISAs RISC vs CISC Assembly language coding Datapath (ALU) and controller Pipelining Caches Out of order execution Hennessy and Patterson architecture books Lizy Kurian John, LCA, UT Austin 3 Basics ILP DLP TLP Massive parallelism SIMD/MIMD VLIW Performance and Power metrics Hennessy and Patterson architecture books ASPLOS, ISCA, Micro, HPCA Lizy Kurian John, LCA, UT Austin 4 The Bottomline Programming Language choice affects performance and power eg: Java Compilers affect Performance and Power Lizy Kurian John, LCA, UT Austin 5 A Java Hardware Interpreter Java class file Native machine instructions Hardware bytecode translator Fetch Native executable Decode Execute bytecodes Radhakrishnan, Ph. D 2000 (ISCA2000, ICS2001) This technique used by Nazomi Communications, Parthus (Chicory Systems) Lizy Kurian John, LCA, UT Austin 6 867.8 321.9 27.7 120.0 28.8 85.2 127.7 146.2 149.7 108.8 71.0 59.8 16.0 50 60.4 150 133.7 109.3 200 135.9 250 146.1 221.5 300 250.2 350 100 492.2 911.7 400 44.8 execution cycles (millions) 4-way performance 989.4 934.1 HardInt Performance 0 db JDK 1.1.6 Interpreter javac JDK 1.1.6 JIT jess mpeg JDK 1.2 Interpreter JDK 1.2 JIT mtrt Hard-Int • Hard-Int performs consistently better than the interpreter • In JIT mode, significant performance boost in 4 of 5 applications. Lizy Kurian John, LCA, UT Austin 7 Compiler and Power A B E D F DDG C Cycle 1 A E B Cycle 2 Cycle 3 Cycle 4 E A Cycle 1 B C D D Cycle 3 F F C Peak Power = 3 Energy = 6 Lizy Kurian John, LCA, UT Austin Cycle 2 Cycle 4 Peak Power = 2 Energy = 6 8 Valluri et al 2001 HPCA workshop Quantitative Study Influence of state-of-the-art optimizations on energy and power of the processor examined Optimizations studied Standard –O1 to –O4 of DEC Alpha’s cc compiler Four individual optimizations – simple basic-block instruction scheduling, loop unrolling, function inlining, and aggressive global scheduling Lizy Kurian John, LCA, UT Austin 9 Standard Optimizations on Power Benchmark opt level O0 O1 compress O2 O3 O4 O0 O1 go O2 O3 O4 O0 O1 li O2 O3 O4 Energy Exec Time 100 100 74.48 81.55 75.13 81.44 75.13 81.44 79.01 82.77 100 100 66.2 64.13 62.62 61.31 62.62 61.31 63.67 62.19 100 100 81.32 83.66 79.6 75.97 79.6 75.97 85.71 77.89 Insts 100 81.52 82.04 82.04 86.11 100 68.94 63.01 63.01 63.75 100 83.18 82.97 82.97 90.96 Lizy Kurian John, LCA, UT Austin Avg Power 100 91.33 92.25 92.25 95.45 100 103.23 102.14 102.14 102.38 100 97.2 104.78 104.78 110.05 IPC 100 99.96 100.73 100.73 104.03 100 107.5 102.78 102.78 102.51 100 99.42 109.21 109.21 116.78 10 Somebody once said “Computers are dumb actors and compilers/programmers are the master playwrights.” Lizy Kurian John, LCA, UT Austin 11 A large part of modern out of order processors is hardware that could have been eliminated if a good compiler existed. Lizy Kurian John, LCA, UT Austin 12 Let me get more arrogant A large part of modern out of order processors was designed because computer architects thought compiler writers could not do a good job. Lizy Kurian John, LCA, UT Austin 13 Value Prediction Is a slap on your face Shen and Lipasti Lizy Kurian John, LCA, UT Austin 14 Value Locality Likelihood that an instruction’s computed result or a similar predictable result will occur soon Observation – a limited set of unique values constitute majority of values produced and consumed during execution Lizy Kurian John, LCA, UT Austin 15 Load Value Locality Lizy Kurian John, LCA, UT Austin 16 Causes of value locality Data redundancy – many 0s, sparse matrices, white space in files, empty cells in spread sheets Program constants – Computed branches – base address for jump tables is a run-time constant Virtual function calls – involve code to load a function pointer – can be constant Lizy Kurian John, LCA, UT Austin 17 Causes of value locality Memory alias resolution – compiler conservatively generates code – may contain stores that alias with loads Register spill code – stores and subsequent loads Convergent algorithms – convergence in parts of algorithms before global convergence Polling algorithms Lizy Kurian John, LCA, UT Austin 18 2 Extremist Views Anything that can be done in hardware should be done in hardware. Anything that can be done in software should be done in software. Lizy Kurian John, LCA, UT Austin 19 What do we need? The Dumb actor Or the The defiant actor – who pays very little attention to the script Lizy Kurian John, LCA, UT Austin 20 Challenging all compiler writers The last 15 years was the defiant actor’s era What about the next 15? TLP, Multithreading, Parallelizing compilers – It’s time for a lot more dumb acting from the architect’s side. And it’s time for some good scriptwriting from the compiler writer’s side. Lizy Kurian John, LCA, UT Austin 21 The University of Texas at Austin BACKUP Lizy Kurian John, LCA, UT Austin 22 Compiler Optimzations cc - Native C compiler on Dec Alpha 21064 running OSF1 operating system gcc – Used to study the effect of individual optimizations Lizy Kurian John, LCA, UT Austin 23 Std Optimizations Levels on cc -O0 – No optimizations performed -O1 – Local optimizations such as CSE, copy propagation, IVE etc -O2 – Inline expansion of static procedures and global optimizations such as loop unrolling, instruction scheduling -O3 – Inline expansion of global procedures -O4 – s/w pipelining, loop vectorization etc Lizy Kurian John, LCA, UT Austin 24 Std Optimizations Levels on gcc -O0 – No optimizations performed -O1 – Local optimizations such as CSE, copy propagation, dead-code elimination etc -O2 – aggressive instruction scheduling -O3 – Inlining of procedures NOTE: Almost same optimizations in each level of cc and gcc In cc and gcc, optimizations that increase ILP are in levels -O2, -O3, and -O4 cc used where ever possible, gcc used used where specific hooks are required Lizy Kurian John, LCA, UT Austin 25 Individual Optimizations Four gcc optimizations, all optimizations applied on top -O1 -fschedule-insns – local register allocation followed by basic-block list scheduling -fschedule-insns2 – Postpass scheduling done -finline-functions – Integrated all simple functions into their callers -funroll-loops – Perform the optimization of loop unrolling Lizy Kurian John, LCA, UT Austin 26 Some observations Energy consumption reduces when # of instructions is reduced, i.e., when the total work done is less, energy is less Power dissipation is directly proportional to IPC Lizy Kurian John, LCA, UT Austin 27 Observations (contd.) Function inlining was found to be good for both power and energy Unrolling was found to be good for energy consumption but bad for power dissipation Lizy Kurian John, LCA, UT Austin 28 MMX/SIMD Automatic usage of SIMD ISA still difficult 10+ years after introduction of MMX. Lizy Kurian John, LCA, UT Austin 29 Standard Optimizations on Power (Contd) Benchmark opt level O0 O1 saxpy O2 O3 O4 O0 O1 su2cor O2 O3 O0 O1 swim O2 O3 Energy Exec Time 100 100 97.38 100.24 97.69 99.38 97.69 99.38 98.31 99.27 100 100 42.09 51.04 40.99 47.52 40.99 46.37 100 100 30.1 36.64 28.93 34.01 28.93 34.01 Insts 100 92.49 92.49 92.49 92.84 100 33.21 33.1 33.1 100 20.01 19.05 19.05 Lizy Kurian John, LCA, UT Austin Avg Power 100 97.15 98.3 98.3 99.02 100 82.46 86.28 87.65 100 82.15 85.06 85.06 IPC 100 92.27 93.07 93.07 93.51 100 65.06 69.67 71.38 100 5463 56.01 56.01 30