Lecture 1: Performance Metrics, ISA Intro

advertisement
Computer Science 246
Computer Architecture
Spring 2015
Harvard University
Instructor: Prof. David Brooks
dbrooks@eecs.harvard.edu
Performance Metrics, ISA Intro
Computer Science 246 David Brooks
Lecture Outline
• 
• 
• 
• 
• 
Performance Metrics
Averaging
Amdahl’s Law
Benchmarks
The CPU Performance Equation
–  Optimal Pipelining Case Study
•  Modern Processor Analysis
•  Begin ISAs…
Computer Science 246 David Brooks
Performance Metrics
Better
•  Execution Time is often what we target
•  Throughput (tasks/sec) vs. latency (sec/task)
•  How do we decide the tasks? Benchmarks
–  What the customer cares about, real applications
–  Representative programs (SPEC, SYSMARK, etc)
–  Kernels: Code fragments from real programs (Linpack)
–  Toy Programs: Quicksort, Sieve of Eratosthenes
–  Synthetic Programs: Just a representative instruction mix
(Whetsone, Dhrystone)
Computer Science 246 David Brooks
Measuring Performance
•  Total Execution Time:
1
n
n
∑
Timei
i=1
•  This is arithmetic mean
–  This should be used when measuring performance in
execution times (CPI)
Computer Science 246 David Brooks
Measuring Performance
•  Weighted Execution Time:
n
∑
Weighti × Timei
i=1
•  What if P1 and P2 are not run equally?
Computer Science 246 David Brooks
Measuring Performance
•  Normalized Execution Time
•  Normalize to reference machine
•  Can only use geometric mean (arithmetic mean
can vary depending on the reference machine)
n
n
∏ ExecutionT imeRatioi
i =1
•  Problem: Ratio not Execution Time is the result
Computer Science 246 David Brooks
Harmonic Mean: Motivation
•  30 mph for the first 10 miles
•  90 mph for the next 10 miles
•  Average speed? (30+90)/2 = 60mph
•  WRONG! Average speed = total distance / total time
•  20/(10/30+10/90) = 45mph
Computer Science 246 David Brooks
Harmonic Mean
•  Each program has O operations
•  n programs executed nO operations in S Ti
•  Execution rate is then
nO/ S Ti= n/ S (Ti/O)=n/ S 1/Pi
where 1/Pi is the rate of execution of program I
Harmonic mean should be used when measuring
performance in execution rates (IPC)
n
n
∑
i=1
1
Timei
Computer Science 246 David Brooks
Amdahl’s Law
(Law of Diminishing Returns)
•  Very Intuitive – Make the Common case fast
Execution Time for task without enhancement
Speedup =
Execution Time for task using enhancement
Execution timenew = Execution timeold ×
Overall
Speedup
(
(1 - Fractionenhanced) +
Fractionenhanced
Speedupenhanced
Computer Science 246 David Brooks
)
Amdahl’s Law Corollary
SpeedupOverall =
(
1
(1 - Fractionenhanced) +
As SpeedupEnhanced >> 0, SpeedupOverall =
Fractionenhanced
Speedupenhanced
)
1
(1 - Fractionenhanced)
Computer Science 246 David Brooks
Amdahl’s Law Example
Fractionenhanced=95%, SpeedupEnhanced=1.1x
SpeedupOverall =1/((1-.95)+(.95/1.1))=1.094
Fractionenhanced=5%, SpeedupEnhanced=10x
SpeedupOverall =1/((1-.05)+(.05/10))=1.047
Fractionenhanced=5%, SpeedupEnhanced=Infinity
SpeedupOverall =1/(1-.05)=1.052
Computer Science 246 David Brooks
Make the common
case fast!
MIPS
•  MIPS = instruction count/(execution time x 106)
= clock rate/(CPI x 106)
•  Problems
–  ISAs are not equivalent, e.g. RISC vs. CISC
•  1 CISC instruction may equal many RISC!
–  Programs use different instruction mixes
–  May be ok when comparing same benchmarks, same ISA,
same compiler, same OS
Computer Science 246 David Brooks
MFLOPS
•  Same as MIPS, just FP ops
•  Not useful either
–  FP-intensive apps needed
–  Traditionally, FP ops were slow, INT can be ignored
–  BUT, now memory ops can be the slowest!
•  “Peak MFLOPS” is a common marketing fallacy
–  Basically, it just says #FP-pipes X Clock Rate
Computer Science 246 David Brooks
GHz
•  Is this a metric? Maybe as good as the others…
•  One number, no benchmarks, what can be better?
•  Many designs are frequency driven
Processor
IBM POWER3
Intel PIII
Intel Pentium 4
Itanium-2
Clock Rate
450 MHz
1.4 GHz
2.4 GHz
1.0 GHz
SPEC FP2000
434
456
833
1356
Computer Science 246 David Brooks
Benchmark Suites
•  SPEC CPU (int and float) (Desktop, Server)
•  EEMBC (“embassy”), SPECjvm (Embedded)
•  TPC-C, TPC-H, SPECjbb, ECperf (Server)
Computer Science 246 David Brooks
SPEC CPU2000: Integer
Benchmarks
164.gzip
175.vpr
176.gcc
181.mcf
186.crafty
197.parser
252.eon
253.perlbmk
254.gap
255.vortex
256.bzip2
300.twolf
C
C
C
C
C
C
C++
C
C
C
C
C
Compression
FPGA Circuit Placement and Routing
C Programming Language Compiler
Combinatorial Optimization
Game Playing: Chess
Word Processing
Computer Visualization
PERL Programming Language
Group Theory, Interpreter
Object-oriented Database
Compression
Place and Route Simulator
Computer Science 246 David Brooks
SPEC CPU2000: Floating Point
Benchmarks
168.wupwise
171.swim
172.mgrid
173.applu
177.mesa
178.galgel
179.art
183.equake
187.facerec
188.ammp
189.lucas
191.fma3d
200.sixtrack
301.apsi
Fortran 77
Fortran 77
Fortran 77
Fortran 77
C
Fortran 90
C
C
Fortran 90
C
Fortran 90
Fortran 90
Fortran 77
Fortran 77
Physics / Quantum Chromodynamics
Shallow Water Modeling
Multi-grid Solver: 3D Potential Field
Parabolic / Elliptic Partial Differential Equations
3-D Graphics Library
Computational Fluid Dynamics
Image Recognition / Neural Networks
Seismic Wave Propagation Simulation
Image Processing: Face Recognition
Computational Chemistry
Number Theory / Primality Testing
Finite-element Crash Simulation
High Energy Nuclear Physics Accelerator Design
Meteorology: Pollutant Distribution
Computer Science 246 David Brooks
Server Benchmarks
•  TPC-C (Online-Transaction Processing, OLTP)
–  Models a simple order-entry application
–  Thousands of concurrent database accesses
System
# / Processor
tpm
$/tpm
Fujitsu PrimePower
128, 563MHz SPARC64
455K
$28.58
HP SuperDome
64, 875MHz PA8700
423K
$15.64
IBM p690
32, 1300MHz POWER4
403K
$17.80
•  TPC-H (Ad-hoc, decision support)
–  Data warehouse, backend analysis tools for data
Computer Science 246 David Brooks
CPU Performance Equation
•  Execution Time = seconds/program
instructions
program
Program
Architecture (ISA)
Compiler
cycles
instruction
seconds
cycle
Compiler (Scheduling) Technology
Organization (uArch)
Physical Design
Microarchitects
Circuit Designers
Computer Science 246 David Brooks
Common Architecture Pitfall
•  Instructions/Program (Path-length) is constant
–  Same benchmark, same compiler
–  Ok usually, but for some ideas compiler may change
•  Seconds/Cycle (Cycle-time) is constant
–  “My tweak won’t impact cycle-time”
–  Often a bad assumption
–  Current designs are ~12-20FO4 Inverter Delays per cycle
•  Just focus on Cycles/Instruction (CPI or IPC)
•  Many architecture studies do just this!
Computer Science 246 David Brooks
Instruction Set Architecture
“Instruction Set Architecture is the structure of a computer
that a machine language programmer (or a compiler) must
understand to write a correct (timing independent) program
for that machine.”
IBM, Introducing the IBM 360 (1964)
•  The ISA defines:
–  Operations that the processor can execute
–  Data Transfer mechanisms + how to access data
–  Control Mechanisms (branch, jump, etc)
–  “Contract” between programmer/compiler + HW
Computer Science 246 David Brooks
Classifying ISAs
Computer Science 246 David Brooks
Stack
•  Architectures with implicit “stack”
–  Acts as source(s) and/or destination, TOS is implicit
–  Push and Pop operations have 1 explicit operand
•  Example: C = A + B
–  Push A
–  Push B
–  Add
–  Pop C
// S[++TOS] = Mem[A]
// S[++TOS] = Mem[B]
// Tem1 = S[TOS--], Tem2 = S[TOS--] ,
S[++TOS] = Tem1 + Tem2
// Mem[C] = S[TOS--]
•  x86 FP uses stack (complicates pipelining)
Computer Science 246 David Brooks
Accumulator
•  Architectures with one implicit register
–  Acts as source and/or destination
–  One other source explicit
•  Example: C = A + B
–  Load A
–  Add B
–  Store C
// (Acc)umulator <= A
// Acc <= Acc + B
// C <= Acc
•  Accumulator implicit, bottleneck?
•  x86 uses accumulator concepts for integer
Computer Science 246 David Brooks
Register
•  Most common approach
–  Fast, temporary storage (small)
–  Explicit operands (register IDs)
•  Example: C = A + B
Register-memory
Load R1, A
Add R3, R1, B
Store R3, C
load/store
Load R1, A
Load R2, B
Add R3, R1, R2
Store R3, C
•  All RISC ISAs are load/store
•  IBM 360, Intel x86, Moto 68K are register-memory
Computer Science 246 David Brooks
Common Addressing Modes
Base/Displacement
Register Indirect
Indexed
Direct
Memory Indirect
Autoincrement
Scaled
Load R4, 100(R1)
Load R4, (R1)
Load R4, (R1+R2)
Load R4, (1001)
Load R4, @(R3)
Load R4, (R2)+
Load R4, 100(R2)[R3]
Computer Science 246 David Brooks
What leads to a good/bad ISA?
•  Ease of Implementation (Job of Architect/Designer)
–  Does the ISA lead itself to efficient implementations?
•  Ease of Programming (Job of Programmer/Compiler)
–  Can the compiler use the ISA effectively?
•  Future Compatibility
–  ISAs may last 30+yrs
–  Special Features, Address range, etc. need to be thought out
Computer Science 246 David Brooks
Implementation Concerns
•  Simple Decoding (fixed length)
•  Compactness (variable length)
•  Simple Instructions (no load/update)
–  Things that get microcoded these days
–  Deterministic Latencies are key!
–  Instructions with multiple exceptions are difficult
•  More/Less registers?
–  Slower register files, decoding, better compilers
•  Condition codes/Flags (scheduling!)
Computer Science 246 David Brooks
Programmability
•  1960s, early 70s
–  Code was mostly hand-coded
•  Late 70s, Early 80s
–  Most code was compiled, but hand-coded was better
•  Mid-80s to Present
–  Most code is compiled and almost as good as assembly
•  Why?
Computer Science 246 David Brooks
Programmability: 70s, Early 80s
“Closing the Semantic Gap”
•  High-level languages match assembly languages
•  Efforts for computers to execute HLL directly
–  e.g. LISP Machine
•  Hardware Type Checking. Special type bits let the type be
checked efficiently at run-time
•  Hardware Garbage Collection
•  Fast Function Calls
•  Efficient Representation of Lists
•  Never worked out…“Semantic Clash”
–  Too many HLLs? C was more popular?
Computer Science 246 David Brooks
Programmability: 1980s … 2000s
“In the Compiler We Trust”
•  Wulf: Primitives not Solutions
–  Compilers cannot effectively use complex
instructions
–  Synthesize programs from primitives
•  Regularity: same behavior in all contexts
–  No odd cases – things should be intuitive
•  Orthogonality:
–  Data type independent of addressing mode
–  Addressing mode independent of operation performed
Computer Science 246 David Brooks
ISA Compatibility
“In Computer Architecture, no good idea ever goes unpunished.”
Marty Hopkins, IBM Fellow
•  Never abandon existing code base
•  Extremely difficult to introduce a new ISA
–  Alpha failed, IA64 is struggling, best solution may not win
•  x86 most popular, is the least liked!
•  Hard to think ahead, but…
–  ISA tweak may buy 5-10% today
–  10 years later it may buy nothing, but must be implemented
•  Register windows, delay branches
Computer Science 246 David Brooks
CISC vs. RISC
•  Debate raged from early 80s through 90s
•  Now it is fairly irrelevant
•  Despite this Intel (x86 => Itanium) and DEC/
Compaq (VAX => Alpha) have tried to switch
•  Research in the late 70s/early 80s led to RISC
–  IBM 801 -- John Cocke – mid 70s
–  Berkeley RISC-1 (Patterson)
–  Stanford MIPS (Hennessy)
Computer Science 246 David Brooks
VAX
•  32-bit ISA, instructions could be huge (up to 321
bytes), 16 GPRs
•  Operated on data types from 8 to 128-bits,
decimals, strings
•  Orthogonal, memory-to-memory, all operand
modes supported
•  Hundreds of special instructions
•  Simple compiler, hand-coding was common
•  CPI was over 10!
Computer Science 246 David Brooks
x86
•  Variable length ISA (1-16 bytes)
•  FP Operand Stack
•  2 operand instructions (extended accumulator)
–  Register-register and register-memory support
•  Scaled addressing modes
•  Has been extended many times (as AMD did with x86-64)
•  Intel, initially went to IA64, now backtracked (EM64T)
Computer Science 246 David Brooks
RISC vs. CISC Arguments
•  RISC
–  Simple Implementation
•  Load/store, fixed-format 32-bit instructions, efficient pipelines
–  Lower CPI
–  Compilers do a lot of the hard work
•  MIPS = Microprocessor without Interlocked Pipelined Stages
•  CISC
–  Simple Compilers (assists hand-coding, many
addressing modes, many instructions)
–  Code Density
Computer Science 246 David Brooks
MIPS/VAX Comparison
Computer Science 246 David Brooks
After the dust settled
•  Turns out it doesn’t matter much
•  Can decode CISC instructions into internal
“micro-ISA”
–  This takes a couple of extra cycles (PLA
implementation) and a few hundred thousand transistors
–  In 20 stage pipelines, 55M tx processors this is minimal
–  Pentium 4 caches these micro-Ops
•  Actually may have some advantages
–  External ISA for compatibility, internal ISA can be
tweaked each generation (Transmeta, NVidia)
Computer Science 246 David Brooks
Download