Spatial Computation Mihai Budiu CMU CS Thesis committee: Seth Goldstein Peter Lee Todd Mowry Babak Falsafi Nevin Heintze SCS Ph.D. Thesis defense, December 8, 2003 Spatial Computation Thesis committee: Seth Goldstein A model of general-purpose computation Peter Lee SCS based on Application-Specific Hardware. Todd Mowry Babak Falsafi Nevin Heintze Ph.D. Thesis defense, December 8, 2003 2 Thesis Statement Application-Specific Hardware (ASH): • can be synthesized by adapting software compilation for predicated architectures, • provides high-performance for programs with high ILP, with very low power consumption, • is a more scalable and efficient computation substrate than monolithic processors. 3 Outline • Introduction • • • • Compiling for ASH Media processing on ASH ASH vs. superscalar processors Conclusions 4 CPU Problems • • • • Complexity Power Global Signals Limited ILP 5 Design Complexity Design Time: CAD productivity favors FPL 100,000,000 Logic transistors/chip .10 1,000,000 Transistors/staff*month 100,000 10,000 10,000 1000 x x x 100 xx 10 x 100 2005 2007 2003 2001 1999 1997 1993 1995 1991 1989 1987 21%/Year 1983 1985 2.5 x 1000 Productivity 58%/Year 10 2009 .35 10,000,000 1,000,000 100,000 1981 Chip size (K transistors) 10,000,000 Source: S. Malik, orig Sematech from Michael Flynn’s FCRC 2003 talk 6 Communication vs. Computation gate wire 5ps 20ps Power consumption on wires is also dominant 7 Our Approach: ASH Application-Specific Hardware 8 Resource Binding Time 1. 1. 2. Programs 2. CPU Programs ASH 9 Hardware Interface software software ISA virtual ISA gates hardware CPU hardware ASH 10 Application-Specific Hardware C program Dataflow IR Compiler Reconfigurable/custom hw 11 Contributions Embedded systems Asynchronous circuits Computer architecture Reconfigurable computing Compilation High-level synthesis Nanotechnology Dataflow machines theory 12 Outline • Introduction • CASH: Compiling for ASH • Media processing on ASH • ASH vs. superscalar processors • Conclusions 13 Computation = Dataflow Programs Circuits a x = a & 7; ... 7 & 2 y = x >> 2; x >> • Operations ) functional units • Variables ) wires • No interpretation 14 Basic Operation + latch data ack valid 15 Asynchronous Computation + data 1 latch 5 + + + ack valid 2 + 3 + 6 4 + 7 + 8 16 Distributed Control Logic FSM ack rdy + short, local wires asynchronous control 17 Forward Branches b if (x > 0) y = -x; else y = b*x; x * 0 - > ! y Conditionals ) Speculation critical path 18 Control Flow ) Data Flow data Merge (label) data data predicate Gateway p Split (branch) ! 19 0 Loops i * 0 int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return return sum; sum; +1 < 100 sum + ! ret 20 Predication and Side-Effects addr token sequencing of side-effects no speculation pred Load data to memory token 21 Thesis Statement Application-Specific Hardware: • can be synthesized by adapting software compilation for predicated architectures, • provides high-performance for programs with high ILP, with very low power consumption, • is a more scalable and efficient computation substrate than monolithic processors. 22 Outline • Introduction • CASH: Compiling for ASH – An optimization on the SIDE • Media processing on ASH • ASH vs. superscalar processors • Conclusions skip to 23 Availability Dataflow Analysis y = a*b; ... if (x) { ... y ... = a*b; } 24 Dataflow Analysis Is Conservative if (x) { ... y = a*b; } ... y ... = a*b; ? 25 Static Instantiation, Dynamic Evaluation flag = false; if (x) { ... y = a*b; flag = true; } ... ... = flag ? y : a*b; 26 0 epic_d 10 5 300.twolf 300.twolf 254.gap 197.parser 181.mcf 176.gcc 35 175.vpr 164.gzip 188.ammp 183.equake 147.vortex 134.perl 132.ijpeg 130.li 129.compress 124.m88ksim 099.go mesa rasta pgp_d pgp_e g721_d g721_e pegwit_d pegwit_e jpeg_d jpeg_e mpeg2_d 197.parser 254.gap epic_d mpeg2_e 176.gcc epic_e 181.mcf 175.vpr gsm_d gsm_e % reduction 40 164.gzip 188.ammp 183.equake adpcm_d 15 adpcm_e %st promo %st PRE 134.perl 0 147.vortex 132.ijpeg Stores 130.li 129.compress 30 124.m88ksim 099.go mesa rasta pgp_d pgp_e g721_d g721_e pegwit_d pegwit_e jpeg_d jpeg_e mpeg2_d mpeg2_e 20 epic_e 25 gsm_d gsm_e adpcm_d adpcm_e SIDE Register Promotion Impact 45 Loads % ld promo % ld PRE 30 25 20 15 53 10 5 27 Outline • Introduction • CASH: Compiling for ASH • Media processing on ASH • ASH vs. superscalar processors • Conclusions 28 Performance Evaluation ASH LSQ L1 8K L2 1/4M Mem limited BW CPU: 4-way OOO Assumption: all operations have the same latency. 29 5.8 rasta pegwit_e pegwit_d mpeg2_e mpeg2_d mesa 5.8 jpeg_e jpeg_d 3 gsm_e gsm_d g721_e g721_d epic_e epic_d adpcm_e adpcm_d Times faster Media Kernels, vs 4-way OOO 12 2.5 2 1.5 1 0.5 0 30 rasta pegwit_e pegwit_d mpeg2_e mpeg2_d mesa jpeg_e jpeg_d gsm_e gsm_d g721_e g721_d epic_e epic_d adpcm_e adpcm_d Media Kernels, IPC 25 Base IPC ASH IPC 20 15 10 5 4 0 31 10 Speed-up rasta pegwit_e pegwit_d mpeg2_e mpeg2_d mesa jpeg_e jpeg_d gsm_e gsm_d g721_e g721_d epic_e 8 epic_d 9 adpcm_e adpcm_d Times bigger Speed-up / IPC Correlation 12 IPC Ratio 7 6 5 4 3 2 1 0 32 Low-Level Evaluation C CASH core Results shown so far. All results in thesis. Verilog back-end Synopsys, Cadence P/R 180nm std. cell library, 2V ~1999 technology Results in the next two slides. ASIC 33 e it_ gw pe d it_ gw _e g2 pe _d g2 pe pe m m d eg _ _e gs m jp _e _d _d 21 21 gs m g7 g7 _d ad pc m Square mm 12 Area 10 8 6 4 2 0 Reference: P4 in 180nm has 217mm2 34 Power 350 Times smaller than OOO 300 250 200 150 100 50 0 power ratio adpcm_d g721_d g721_e gsm_d gsm_e jpeg_d mpeg2_d mpeg2_e pegwit_d pegwit_e 70 41 41 129 147 94 121 136 303 303 vs 4-way OOO superscalar, 600 Mhz, with clock gating (Wattch), ~ 6W 35 Thesis Statement Application-Specific Hardware: • can be synthesized by adapting software compilation for predicated architectures, • provides high-performance for programs with high ILP, with very low power consumption, • is a more scalable and efficient computation substrate than monolithic processors. 36 Outline • Introduction • CASH: Compiling for ASH • Media processing on ASH – dataflow pipelining • ASH vs. superscalar processors • Conclusions skip to 37 i Pipelining * + <= pipelined multiplier (8 stages) int sum=0, i; for (i=0; i < 100; i++) sum += i*i; return sum; 100 1 sum + cycle=1 38 i Pipelining * 100 1 + <= sum + cycle=2 39 i Pipelining * 100 1 + <= sum + cycle=3 40 i Pipelining * 100 1 + <= sum + cycle=4 41 i Pipelining i=1 100 1 + <= i=0 sum + pipeline balancing cycle=5 42 Outline • • • • Introduction CASH: Compiling for ASH Media processing on ASH ASH vs. superscalar processors • Conclusions 43 This Is Obvious! ASH runs at full dataflow speed, so CPU cannot do any better (if compilers equally good). 44 Percent slower / faster -10 0 147.vortex 134.perl 132.ijpeg 130.li 129.compress 124.m88ksim -20 099.go SpecInt95, ASH vs 4-way OOO 30 20 10 -30 -40 -50 45 Branch Prediction i ASH crit path for (i=0; i < N; i++) { ... CPU crit path 1 + < exception if (exception) break; } Predicted not taken Effectively a noop for CPU! Predicted taken. ! & result available before inputs 46 Percent slower/faster -20 147.vortex 134.perl 132.ijpeg 130.li 129.compress 124.m88ksim -40 099.go SpecInt95, perfect prediction 60 baseline prediction 40 20 no data 0 -60 47 ASH Problems • Both branch and join not free • Static dataflow (no re-issue of same instr) • Memory is “far” • Fully static – No branch prediction – No dynamic unrolling – No register renaming • Calls/returns not lenient • ... 48 Thesis Statement Application-Specific Hardware: • can be synthesized by adapting software compilation for predicated architectures, • provides high-performance for programs with high ILP, with very low power consumption, • is a more scalable and efficient computation substrate than monolithic processors. 49 Outline Introduction + CASH: Compiling for ASH + Media processing on ASH + ASH vs. superscalar processors = Conclusions 50 Strengths • low power • simple verification? • economies of scale • highly optimized • • • • • • • • specialized to app. unlimited ILP simple hardware no fixed window branch prediction control speculation full-dataflow global signals/decision 51 Conclusions • Compiling “around the ISA” is a fruitful research approach. • Distributed computation structures require more synchronization overhead. • Spatial Computation efficiently implements high-ILP computation with very low power. 52 Backup Slides • Control logic • Pipeline balancing • Lenient execution • Dynamic Critical Path • Memory PRE • Critical path analysis • CPU + ASH 53 Control Logic rdyin C C ackout D rdyout datain Reg D ackin dataout back back to talk 54 Last-Arrival Events + data ack valid • Event enabling the generation of a result • May be an ack • Critical path=collection of last-arrival edges 55 Dynamic Critical Path 3. Some edges may repeat 2. Trace back along last-arrival edges 1. Start from last node back back to analysis 56 Critical Paths b if (x > 0) y = -x; else y = b*x; x * 0 - > ! y 57 Lenient Operations b if (x > 0) y = -x; else y = b*x; x * 0 - > ! y Solve the problem of unbalanced paths back back to talk 58 i Pipelining * i=1 100 1 + <= i=0 sum + cycle=6 59 i Pipelining * + <= i’s loop predicate 100 1 Long latency pipe sum sum’s loop + cycle=7 60 i Pipelining i’s loop * critical path 100 1 + <= Predicate ack edge is on the critical path. sum sum’s loop + 61 Pipelinine balancing * i 100 1 + <= i’s loop decoupling FIFO sum sum’s loop + cycle=7 62 i Pipelinine balancing * i’s loop 100 1 + <= critical path decoupling FIFO sum sum’s loop + back back to presentation 63 Register Promotion (p1) *p=… (p1) *p=… (p2) …=*p (p2 Æ : p1) …=*p f Load is executed only if store is not 64 Register Promotion (2) (p1) *p=… (p2) …=*p (p1) *p=… (false) …=*p • When p2 ) p1 the load becomes dead... • ...i.e., when store dominates load in CFG back 65 ¼ PRE (p1) ...=*p (p2) ...=*p (p1 Ç p2) ...=*p This corresponds in the CFG to lifting the load to a basic block dominating the original loads 66 Store-store (1) (p1) *p=… (p1 Æ : p2) *p=… (p2) *p=... (p2) *p=... • When p1 ) p2 the first store becomes dead... • ...i.e., when second store post-dominates first in CFG 67 Store-store (2) (p1) *p=… (p1 Æ : p2) *p=… (p2) *p=... (p2) *p=... • Token edge eliminated, but... • ...transitive closure of tokens preserved back 68 A Code Fragment for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; Y[i] = X[j].q; } SpecINT95:124.m88ksim:init_processor, stylized 69 Dynamic Critical Path definition sizeof(X[j]) load predicate loop predicate for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; 70 MIPS gcc Code LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) L1! L2 ! L3 ! L5 ! L1 4-instructions loop-carried dependence break; 71 If Branch Prediction Correct LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; L1! L2 ! L3 ! L5 ! L1 Superscalar is issue-limited! 2 cycles/iteration sustained 72 Critical Path with Prediction Loads are not speculative for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; 73 Prediction + Load Speculation ~4 cycles! Load not pipelined ack edge (self-anti-dependence) for (j = 0; X[j].r != 0xF; j++) if (X[j].r == i) break; 74 OOO Pipe Snapshot LOOP: L1: beq $v0,$a1,EXIT ; X[j].r == i L2: addiu $v1,$v1,20 ; &X[j+1].r L3: lw $v0,0($v1) ; X[j+1].r L4: addiu $a0,$a0,1 ; j++ L5: bne $v0,$a3,LOOP ; X[j+1].r == 0xF EXIT: register renaming IF DA EX WB CT L5 L1 L2 L1 L2 L3 L4 L1 L3 L5 L3 L2 L1 L3 L3 75 Unrolling? for(i = 0; i < 64; i++) { for (j = 0; X[j].r != 0xF; j+=2) { if (X[j].r == i) break; if (X[j+1].r == 0xF) break; when 1 if (X[j+1].r == i) break; } Y[i] = X[j].q; } back back to talk iteration 76 Ideal Architecture Low ILP computation + OS + VM CPU ASH High-ILP computation Memory back 77