Computer Science 246 Computer Architecture Spring 2015 Harvard University Instructor: Prof. David Brooks dbrooks@eecs.harvard.edu Performance Metrics, ISA Intro Computer Science 246 David Brooks Lecture Outline • • • • • Performance Metrics Averaging Amdahl’s Law Benchmarks The CPU Performance Equation – Optimal Pipelining Case Study • Modern Processor Analysis • Begin ISAs… Computer Science 246 David Brooks Performance Metrics Better • Execution Time is often what we target • Throughput (tasks/sec) vs. latency (sec/task) • How do we decide the tasks? Benchmarks – What the customer cares about, real applications – Representative programs (SPEC, SYSMARK, etc) – Kernels: Code fragments from real programs (Linpack) – Toy Programs: Quicksort, Sieve of Eratosthenes – Synthetic Programs: Just a representative instruction mix (Whetsone, Dhrystone) Computer Science 246 David Brooks Measuring Performance • Total Execution Time: 1 n n ∑ Timei i=1 • This is arithmetic mean – This should be used when measuring performance in execution times (CPI) Computer Science 246 David Brooks Measuring Performance • Weighted Execution Time: n ∑ Weighti × Timei i=1 • What if P1 and P2 are not run equally? Computer Science 246 David Brooks Measuring Performance • Normalized Execution Time • Normalize to reference machine • Can only use geometric mean (arithmetic mean can vary depending on the reference machine) n n ∏ ExecutionT imeRatioi i =1 • Problem: Ratio not Execution Time is the result Computer Science 246 David Brooks Harmonic Mean: Motivation • 30 mph for the first 10 miles • 90 mph for the next 10 miles • Average speed? (30+90)/2 = 60mph • WRONG! Average speed = total distance / total time • 20/(10/30+10/90) = 45mph Computer Science 246 David Brooks Harmonic Mean • Each program has O operations • n programs executed nO operations in S Ti • Execution rate is then nO/ S Ti= n/ S (Ti/O)=n/ S 1/Pi where 1/Pi is the rate of execution of program I Harmonic mean should be used when measuring performance in execution rates (IPC) n n ∑ i=1 1 Timei Computer Science 246 David Brooks Amdahl’s Law (Law of Diminishing Returns) • Very Intuitive – Make the Common case fast Execution Time for task without enhancement Speedup = Execution Time for task using enhancement Execution timenew = Execution timeold × Overall Speedup ( (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced Computer Science 246 David Brooks ) Amdahl’s Law Corollary SpeedupOverall = ( 1 (1 - Fractionenhanced) + As SpeedupEnhanced >> 0, SpeedupOverall = Fractionenhanced Speedupenhanced ) 1 (1 - Fractionenhanced) Computer Science 246 David Brooks Amdahl’s Law Example Fractionenhanced=95%, SpeedupEnhanced=1.1x SpeedupOverall =1/((1-.95)+(.95/1.1))=1.094 Fractionenhanced=5%, SpeedupEnhanced=10x SpeedupOverall =1/((1-.05)+(.05/10))=1.047 Fractionenhanced=5%, SpeedupEnhanced=Infinity SpeedupOverall =1/(1-.05)=1.052 Computer Science 246 David Brooks Make the common case fast! MIPS • MIPS = instruction count/(execution time x 106) = clock rate/(CPI x 106) • Problems – ISAs are not equivalent, e.g. RISC vs. CISC • 1 CISC instruction may equal many RISC! – Programs use different instruction mixes – May be ok when comparing same benchmarks, same ISA, same compiler, same OS Computer Science 246 David Brooks MFLOPS • Same as MIPS, just FP ops • Not useful either – FP-intensive apps needed – Traditionally, FP ops were slow, INT can be ignored – BUT, now memory ops can be the slowest! • “Peak MFLOPS” is a common marketing fallacy – Basically, it just says #FP-pipes X Clock Rate Computer Science 246 David Brooks GHz • Is this a metric? Maybe as good as the others… • One number, no benchmarks, what can be better? • Many designs are frequency driven Processor IBM POWER3 Intel PIII Intel Pentium 4 Itanium-2 Clock Rate 450 MHz 1.4 GHz 2.4 GHz 1.0 GHz SPEC FP2000 434 456 833 1356 Computer Science 246 David Brooks Benchmark Suites • SPEC CPU (int and float) (Desktop, Server) • EEMBC (“embassy”), SPECjvm (Embedded) • TPC-C, TPC-H, SPECjbb, ECperf (Server) Computer Science 246 David Brooks SPEC CPU2000: Integer Benchmarks 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf C C C C C C C++ C C C C C Compression FPGA Circuit Placement and Routing C Programming Language Compiler Combinatorial Optimization Game Playing: Chess Word Processing Computer Visualization PERL Programming Language Group Theory, Interpreter Object-oriented Database Compression Place and Route Simulator Computer Science 246 David Brooks SPEC CPU2000: Floating Point Benchmarks 168.wupwise 171.swim 172.mgrid 173.applu 177.mesa 178.galgel 179.art 183.equake 187.facerec 188.ammp 189.lucas 191.fma3d 200.sixtrack 301.apsi Fortran 77 Fortran 77 Fortran 77 Fortran 77 C Fortran 90 C C Fortran 90 C Fortran 90 Fortran 90 Fortran 77 Fortran 77 Physics / Quantum Chromodynamics Shallow Water Modeling Multi-grid Solver: 3D Potential Field Parabolic / Elliptic Partial Differential Equations 3-D Graphics Library Computational Fluid Dynamics Image Recognition / Neural Networks Seismic Wave Propagation Simulation Image Processing: Face Recognition Computational Chemistry Number Theory / Primality Testing Finite-element Crash Simulation High Energy Nuclear Physics Accelerator Design Meteorology: Pollutant Distribution Computer Science 246 David Brooks Server Benchmarks • TPC-C (Online-Transaction Processing, OLTP) – Models a simple order-entry application – Thousands of concurrent database accesses System # / Processor tpm $/tpm Fujitsu PrimePower 128, 563MHz SPARC64 455K $28.58 HP SuperDome 64, 875MHz PA8700 423K $15.64 IBM p690 32, 1300MHz POWER4 403K $17.80 • TPC-H (Ad-hoc, decision support) – Data warehouse, backend analysis tools for data Computer Science 246 David Brooks CPU Performance Equation • Execution Time = seconds/program instructions program Program Architecture (ISA) Compiler cycles instruction seconds cycle Compiler (Scheduling) Technology Organization (uArch) Physical Design Microarchitects Circuit Designers Computer Science 246 David Brooks Common Architecture Pitfall • Instructions/Program (Path-length) is constant – Same benchmark, same compiler – Ok usually, but for some ideas compiler may change • Seconds/Cycle (Cycle-time) is constant – “My tweak won’t impact cycle-time” – Often a bad assumption – Current designs are ~12-20FO4 Inverter Delays per cycle • Just focus on Cycles/Instruction (CPI or IPC) • Many architecture studies do just this! Computer Science 246 David Brooks Instruction Set Architecture “Instruction Set Architecture is the structure of a computer that a machine language programmer (or a compiler) must understand to write a correct (timing independent) program for that machine.” IBM, Introducing the IBM 360 (1964) • The ISA defines: – Operations that the processor can execute – Data Transfer mechanisms + how to access data – Control Mechanisms (branch, jump, etc) – “Contract” between programmer/compiler + HW Computer Science 246 David Brooks Classifying ISAs Computer Science 246 David Brooks Stack • Architectures with implicit “stack” – Acts as source(s) and/or destination, TOS is implicit – Push and Pop operations have 1 explicit operand • Example: C = A + B – Push A – Push B – Add – Pop C // S[++TOS] = Mem[A] // S[++TOS] = Mem[B] // Tem1 = S[TOS--], Tem2 = S[TOS--] , S[++TOS] = Tem1 + Tem2 // Mem[C] = S[TOS--] • x86 FP uses stack (complicates pipelining) Computer Science 246 David Brooks Accumulator • Architectures with one implicit register – Acts as source and/or destination – One other source explicit • Example: C = A + B – Load A – Add B – Store C // (Acc)umulator <= A // Acc <= Acc + B // C <= Acc • Accumulator implicit, bottleneck? • x86 uses accumulator concepts for integer Computer Science 246 David Brooks Register • Most common approach – Fast, temporary storage (small) – Explicit operands (register IDs) • Example: C = A + B Register-memory Load R1, A Add R3, R1, B Store R3, C load/store Load R1, A Load R2, B Add R3, R1, R2 Store R3, C • All RISC ISAs are load/store • IBM 360, Intel x86, Moto 68K are register-memory Computer Science 246 David Brooks Common Addressing Modes Base/Displacement Register Indirect Indexed Direct Memory Indirect Autoincrement Scaled Load R4, 100(R1) Load R4, (R1) Load R4, (R1+R2) Load R4, (1001) Load R4, @(R3) Load R4, (R2)+ Load R4, 100(R2)[R3] Computer Science 246 David Brooks What leads to a good/bad ISA? • Ease of Implementation (Job of Architect/Designer) – Does the ISA lead itself to efficient implementations? • Ease of Programming (Job of Programmer/Compiler) – Can the compiler use the ISA effectively? • Future Compatibility – ISAs may last 30+yrs – Special Features, Address range, etc. need to be thought out Computer Science 246 David Brooks Implementation Concerns • Simple Decoding (fixed length) • Compactness (variable length) • Simple Instructions (no load/update) – Things that get microcoded these days – Deterministic Latencies are key! – Instructions with multiple exceptions are difficult • More/Less registers? – Slower register files, decoding, better compilers • Condition codes/Flags (scheduling!) Computer Science 246 David Brooks Programmability • 1960s, early 70s – Code was mostly hand-coded • Late 70s, Early 80s – Most code was compiled, but hand-coded was better • Mid-80s to Present – Most code is compiled and almost as good as assembly • Why? Computer Science 246 David Brooks Programmability: 70s, Early 80s “Closing the Semantic Gap” • High-level languages match assembly languages • Efforts for computers to execute HLL directly – e.g. LISP Machine • Hardware Type Checking. Special type bits let the type be checked efficiently at run-time • Hardware Garbage Collection • Fast Function Calls • Efficient Representation of Lists • Never worked out…“Semantic Clash” – Too many HLLs? C was more popular? Computer Science 246 David Brooks Programmability: 1980s … 2000s “In the Compiler We Trust” • Wulf: Primitives not Solutions – Compilers cannot effectively use complex instructions – Synthesize programs from primitives • Regularity: same behavior in all contexts – No odd cases – things should be intuitive • Orthogonality: – Data type independent of addressing mode – Addressing mode independent of operation performed Computer Science 246 David Brooks ISA Compatibility “In Computer Architecture, no good idea ever goes unpunished.” Marty Hopkins, IBM Fellow • Never abandon existing code base • Extremely difficult to introduce a new ISA – Alpha failed, IA64 is struggling, best solution may not win • x86 most popular, is the least liked! • Hard to think ahead, but… – ISA tweak may buy 5-10% today – 10 years later it may buy nothing, but must be implemented • Register windows, delay branches Computer Science 246 David Brooks CISC vs. RISC • Debate raged from early 80s through 90s • Now it is fairly irrelevant • Despite this Intel (x86 => Itanium) and DEC/ Compaq (VAX => Alpha) have tried to switch • Research in the late 70s/early 80s led to RISC – IBM 801 -- John Cocke – mid 70s – Berkeley RISC-1 (Patterson) – Stanford MIPS (Hennessy) Computer Science 246 David Brooks VAX • 32-bit ISA, instructions could be huge (up to 321 bytes), 16 GPRs • Operated on data types from 8 to 128-bits, decimals, strings • Orthogonal, memory-to-memory, all operand modes supported • Hundreds of special instructions • Simple compiler, hand-coding was common • CPI was over 10! Computer Science 246 David Brooks x86 • Variable length ISA (1-16 bytes) • FP Operand Stack • 2 operand instructions (extended accumulator) – Register-register and register-memory support • Scaled addressing modes • Has been extended many times (as AMD did with x86-64) • Intel, initially went to IA64, now backtracked (EM64T) Computer Science 246 David Brooks RISC vs. CISC Arguments • RISC – Simple Implementation • Load/store, fixed-format 32-bit instructions, efficient pipelines – Lower CPI – Compilers do a lot of the hard work • MIPS = Microprocessor without Interlocked Pipelined Stages • CISC – Simple Compilers (assists hand-coding, many addressing modes, many instructions) – Code Density Computer Science 246 David Brooks MIPS/VAX Comparison Computer Science 246 David Brooks After the dust settled • Turns out it doesn’t matter much • Can decode CISC instructions into internal “micro-ISA” – This takes a couple of extra cycles (PLA implementation) and a few hundred thousand transistors – In 20 stage pipelines, 55M tx processors this is minimal – Pentium 4 caches these micro-Ops • Actually may have some advantages – External ISA for compatibility, internal ISA can be tweaked each generation (Transmeta, NVidia) Computer Science 246 David Brooks