Lecture 1: Performance Metrics, ISA Intro

Computer Science 246 Computer Architecture Spring 2015 Harvard University Instructor: Prof. David Brooks dbrooks@eecs.harvard.edu Performance Metrics, ISA Intro Computer Science 246 David Brooks Lecture Outline •  •  •  •  •  Performance Metrics Averaging Amdahl’s Law Benchmarks The CPU Performance Equation –  Optimal Pipelining Case Study •  Modern Processor Analysis •  Begin ISAs… Computer Science 246 David Brooks Performance Metrics Better •  Execution Time is often what we target •  Throughput (tasks/sec) vs. latency (sec/task) •  How do we decide the tasks? Benchmarks –  What the customer cares about, real applications –  Representative programs (SPEC, SYSMARK, etc) –  Kernels: Code fragments from real programs (Linpack) –  Toy Programs: Quicksort, Sieve of Eratosthenes –  Synthetic Programs: Just a representative instruction mix (Whetsone, Dhrystone) Computer Science 246 David Brooks Measuring Performance •  Total Execution Time: 1 n n ∑ Timei i=1 •  This is arithmetic mean –  This should be used when measuring performance in execution times (CPI) Computer Science 246 David Brooks Measuring Performance •  Weighted Execution Time: n ∑ Weighti × Timei i=1 •  What if P1 and P2 are not run equally? Computer Science 246 David Brooks Measuring Performance •  Normalized Execution Time •  Normalize to reference machine •  Can only use geometric mean (arithmetic mean can vary depending on the reference machine) n n ∏ ExecutionT imeRatioi i =1 •  Problem: Ratio not Execution Time is the result Computer Science 246 David Brooks Harmonic Mean: Motivation •  30 mph for the first 10 miles •  90 mph for the next 10 miles •  Average speed? (30+90)/2 = 60mph •  WRONG! Average speed = total distance / total time •  20/(10/30+10/90) = 45mph Computer Science 246 David Brooks Harmonic Mean •  Each program has O operations •  n programs executed nO operations in S Ti •  Execution rate is then nO/ S Ti= n/ S (Ti/O)=n/ S 1/Pi where 1/Pi is the rate of execution of program I Harmonic mean should be used when measuring performance in execution rates (IPC) n n ∑ i=1 1 Timei Computer Science 246 David Brooks Amdahl’s Law (Law of Diminishing Returns) •  Very Intuitive – Make the Common case fast Execution Time for task without enhancement Speedup = Execution Time for task using enhancement Execution timenew = Execution timeold × Overall Speedup ( (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced Computer Science 246 David Brooks ) Amdahl’s Law Corollary SpeedupOverall = ( 1 (1 - Fractionenhanced) + As SpeedupEnhanced >> 0, SpeedupOverall = Fractionenhanced Speedupenhanced ) 1 (1 - Fractionenhanced) Computer Science 246 David Brooks Amdahl’s Law Example Fractionenhanced=95%, SpeedupEnhanced=1.1x SpeedupOverall =1/((1-.95)+(.95/1.1))=1.094 Fractionenhanced=5%, SpeedupEnhanced=10x SpeedupOverall =1/((1-.05)+(.05/10))=1.047 Fractionenhanced=5%, SpeedupEnhanced=Infinity SpeedupOverall =1/(1-.05)=1.052 Computer Science 246 David Brooks Make the common case fast! MIPS •  MIPS = instruction count/(execution time x 106) = clock rate/(CPI x 106) •  Problems –  ISAs are not equivalent, e.g. RISC vs. CISC •  1 CISC instruction may equal many RISC! –  Programs use different instruction mixes –  May be ok when comparing same benchmarks, same ISA, same compiler, same OS Computer Science 246 David Brooks MFLOPS •  Same as MIPS, just FP ops •  Not useful either –  FP-intensive apps needed –  Traditionally, FP ops were slow, INT can be ignored –  BUT, now memory ops can be the slowest! •  “Peak MFLOPS” is a common marketing fallacy –  Basically, it just says #FP-pipes X Clock Rate Computer Science 246 David Brooks GHz •  Is this a metric? Maybe as good as the others… •  One number, no benchmarks, what can be better? •  Many designs are frequency driven Processor IBM POWER3 Intel PIII Intel Pentium 4 Itanium-2 Clock Rate 450 MHz 1.4 GHz 2.4 GHz 1.0 GHz SPEC FP2000 434 456 833 1356 Computer Science 246 David Brooks Benchmark Suites •  SPEC CPU (int and float) (Desktop, Server) •  EEMBC (“embassy”), SPECjvm (Embedded) •  TPC-C, TPC-H, SPECjbb, ECperf (Server) Computer Science 246 David Brooks SPEC CPU2000: Integer Benchmarks 164.gzip 175.vpr 176.gcc 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf C C C C C C C++ C C C C C Compression FPGA Circuit Placement and Routing C Programming Language Compiler Combinatorial Optimization Game Playing: Chess Word Processing Computer Visualization PERL Programming Language Group Theory, Interpreter Object-oriented Database Compression Place and Route Simulator Computer Science 246 David Brooks SPEC CPU2000: Floating Point Benchmarks 168.wupwise 171.swim 172.mgrid 173.applu 177.mesa 178.galgel 179.art 183.equake 187.facerec 188.ammp 189.lucas 191.fma3d 200.sixtrack 301.apsi Fortran 77 Fortran 77 Fortran 77 Fortran 77 C Fortran 90 C C Fortran 90 C Fortran 90 Fortran 90 Fortran 77 Fortran 77 Physics / Quantum Chromodynamics Shallow Water Modeling Multi-grid Solver: 3D Potential Field Parabolic / Elliptic Partial Differential Equations 3-D Graphics Library Computational Fluid Dynamics Image Recognition / Neural Networks Seismic Wave Propagation Simulation Image Processing: Face Recognition Computational Chemistry Number Theory / Primality Testing Finite-element Crash Simulation High Energy Nuclear Physics Accelerator Design Meteorology: Pollutant Distribution Computer Science 246 David Brooks Server Benchmarks •  TPC-C (Online-Transaction Processing, OLTP) –  Models a simple order-entry application –  Thousands of concurrent database accesses System # / Processor tpm $/tpm Fujitsu PrimePower 128, 563MHz SPARC64 455K $28.58 HP SuperDome 64, 875MHz PA8700 423K $15.64 IBM p690 32, 1300MHz POWER4 403K $17.80 •  TPC-H (Ad-hoc, decision support) –  Data warehouse, backend analysis tools for data Computer Science 246 David Brooks CPU Performance Equation •  Execution Time = seconds/program instructions program Program Architecture (ISA) Compiler cycles instruction seconds cycle Compiler (Scheduling) Technology Organization (uArch) Physical Design Microarchitects Circuit Designers Computer Science 246 David Brooks Common Architecture Pitfall •  Instructions/Program (Path-length) is constant –  Same benchmark, same compiler –  Ok usually, but for some ideas compiler may change •  Seconds/Cycle (Cycle-time) is constant –  “My tweak won’t impact cycle-time” –  Often a bad assumption –  Current designs are ~12-20FO4 Inverter Delays per cycle •  Just focus on Cycles/Instruction (CPI or IPC) •  Many architecture studies do just this! Computer Science 246 David Brooks Instruction Set Architecture “Instruction Set Architecture is the structure of a computer that a machine language programmer (or a compiler) must understand to write a correct (timing independent) program for that machine.” IBM, Introducing the IBM 360 (1964) •  The ISA defines: –  Operations that the processor can execute –  Data Transfer mechanisms + how to access data –  Control Mechanisms (branch, jump, etc) –  “Contract” between programmer/compiler + HW Computer Science 246 David Brooks Classifying ISAs Computer Science 246 David Brooks Stack •  Architectures with implicit “stack” –  Acts as source(s) and/or destination, TOS is implicit –  Push and Pop operations have 1 explicit operand •  Example: C = A + B –  Push A –  Push B –  Add –  Pop C // S[++TOS] = Mem[A] // S[++TOS] = Mem[B] // Tem1 = S[TOS--], Tem2 = S[TOS--] , S[++TOS] = Tem1 + Tem2 // Mem[C] = S[TOS--] •  x86 FP uses stack (complicates pipelining) Computer Science 246 David Brooks Accumulator •  Architectures with one implicit register –  Acts as source and/or destination –  One other source explicit •  Example: C = A + B –  Load A –  Add B –  Store C // (Acc)umulator <= A // Acc <= Acc + B // C <= Acc •  Accumulator implicit, bottleneck? •  x86 uses accumulator concepts for integer Computer Science 246 David Brooks Register •  Most common approach –  Fast, temporary storage (small) –  Explicit operands (register IDs) •  Example: C = A + B Register-memory Load R1, A Add R3, R1, B Store R3, C load/store Load R1, A Load R2, B Add R3, R1, R2 Store R3, C •  All RISC ISAs are load/store •  IBM 360, Intel x86, Moto 68K are register-memory Computer Science 246 David Brooks Common Addressing Modes Base/Displacement Register Indirect Indexed Direct Memory Indirect Autoincrement Scaled Load R4, 100(R1) Load R4, (R1) Load R4, (R1+R2) Load R4, (1001) Load R4, @(R3) Load R4, (R2)+ Load R4, 100(R2)[R3] Computer Science 246 David Brooks What leads to a good/bad ISA? •  Ease of Implementation (Job of Architect/Designer) –  Does the ISA lead itself to efficient implementations? •  Ease of Programming (Job of Programmer/Compiler) –  Can the compiler use the ISA effectively? •  Future Compatibility –  ISAs may last 30+yrs –  Special Features, Address range, etc. need to be thought out Computer Science 246 David Brooks Implementation Concerns •  Simple Decoding (fixed length) •  Compactness (variable length) •  Simple Instructions (no load/update) –  Things that get microcoded these days –  Deterministic Latencies are key! –  Instructions with multiple exceptions are difficult •  More/Less registers? –  Slower register files, decoding, better compilers •  Condition codes/Flags (scheduling!) Computer Science 246 David Brooks Programmability •  1960s, early 70s –  Code was mostly hand-coded •  Late 70s, Early 80s –  Most code was compiled, but hand-coded was better •  Mid-80s to Present –  Most code is compiled and almost as good as assembly •  Why? Computer Science 246 David Brooks Programmability: 70s, Early 80s “Closing the Semantic Gap” •  High-level languages match assembly languages •  Efforts for computers to execute HLL directly –  e.g. LISP Machine •  Hardware Type Checking. Special type bits let the type be checked efficiently at run-time •  Hardware Garbage Collection •  Fast Function Calls •  Efficient Representation of Lists •  Never worked out…“Semantic Clash” –  Too many HLLs? C was more popular? Computer Science 246 David Brooks Programmability: 1980s … 2000s “In the Compiler We Trust” •  Wulf: Primitives not Solutions –  Compilers cannot effectively use complex instructions –  Synthesize programs from primitives •  Regularity: same behavior in all contexts –  No odd cases – things should be intuitive •  Orthogonality: –  Data type independent of addressing mode –  Addressing mode independent of operation performed Computer Science 246 David Brooks ISA Compatibility “In Computer Architecture, no good idea ever goes unpunished.” Marty Hopkins, IBM Fellow •  Never abandon existing code base •  Extremely difficult to introduce a new ISA –  Alpha failed, IA64 is struggling, best solution may not win •  x86 most popular, is the least liked! •  Hard to think ahead, but… –  ISA tweak may buy 5-10% today –  10 years later it may buy nothing, but must be implemented •  Register windows, delay branches Computer Science 246 David Brooks CISC vs. RISC •  Debate raged from early 80s through 90s •  Now it is fairly irrelevant •  Despite this Intel (x86 => Itanium) and DEC/ Compaq (VAX => Alpha) have tried to switch •  Research in the late 70s/early 80s led to RISC –  IBM 801 -- John Cocke – mid 70s –  Berkeley RISC-1 (Patterson) –  Stanford MIPS (Hennessy) Computer Science 246 David Brooks VAX •  32-bit ISA, instructions could be huge (up to 321 bytes), 16 GPRs •  Operated on data types from 8 to 128-bits, decimals, strings •  Orthogonal, memory-to-memory, all operand modes supported •  Hundreds of special instructions •  Simple compiler, hand-coding was common •  CPI was over 10! Computer Science 246 David Brooks x86 •  Variable length ISA (1-16 bytes) •  FP Operand Stack •  2 operand instructions (extended accumulator) –  Register-register and register-memory support •  Scaled addressing modes •  Has been extended many times (as AMD did with x86-64) •  Intel, initially went to IA64, now backtracked (EM64T) Computer Science 246 David Brooks RISC vs. CISC Arguments •  RISC –  Simple Implementation •  Load/store, fixed-format 32-bit instructions, efficient pipelines –  Lower CPI –  Compilers do a lot of the hard work •  MIPS = Microprocessor without Interlocked Pipelined Stages •  CISC –  Simple Compilers (assists hand-coding, many addressing modes, many instructions) –  Code Density Computer Science 246 David Brooks MIPS/VAX Comparison Computer Science 246 David Brooks After the dust settled •  Turns out it doesn’t matter much •  Can decode CISC instructions into internal “micro-ISA” –  This takes a couple of extra cycles (PLA implementation) and a few hundred thousand transistors –  In 20 stage pipelines, 55M tx processors this is minimal –  Pentium 4 caches these micro-Ops •  Actually may have some advantages –  External ISA for compatibility, internal ISA can be tweaked each generation (Transmeta, NVidia) Computer Science 246 David Brooks

Lecture 1: Performance Metrics, ISA Intro

Related documents

Products

Support

Lecture 1: Performance Metrics, ISA Intro

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib