POD: A Parallel-On-Die Architecture Dong Hyuk Woo Joshua B. Fryman Allan D. Knies Marsha Eng Hsien-Hsin S. Lee Georgia Tech Intel Corp. Intel Berkeley Research Intel Corp. Georgia Tech What So New with ‘Many-Core’? • On Chip! – New definition of scalability! Performance Scalability Performance Scalability - Minimize communication - Minimize communication - Hide communication latency - Hide communication latency Area Scalability - How many cores on a die? Power Scalability - How much power per core? Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 2 Scalable Power Consumption? Technology (nm) 65 45 32 22 16 # of cores 4 8 16 32 64 Frequency (GHz) 3.0 3.0 3.0 3.0 3.0 Vdd (V) 1.3 1.1 0.9 0.8 0.7 Power / core (W) 37.5 26.8 18.0 14.2 10.9 Total power (W) 150 215 288 454 696 * The above pictures do not represent any current or future Intel products. Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 3 Thousand Cores, Power Consumption? ? ? ? ? ? Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 4 Some Known Facts on Interconnection • One main energy hog! Interconnection Network related – Alpha 21364: • 20% (25W / 125W) – MIT RAW: • 40% of individual tile power – UT TRIPS: • 25% • Mesh-based MIMD many-core? – Unsustainable energy in its router logic – Unpredictable communication pattern • Unfriendly to low-power techniques Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 5 POD: A Parallel-On-Die Architecture Design Goals • Broad-purpose acceleration – Scalable vector performance • Energy and area efficiency • Low latency inter-PE communication • Best-in-class scalar performance • Backward compatibility Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 7 Host Processor Host Processor Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 8 Processing Elements (PEs) Host Processor Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 9 Instruction Bus (IBus) Host Processor Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 10 ORTree: Status Monitoring Flag status Host Processor Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 11 Data Return Buffer (DRB) DRB DRB DRB DRB DRB DRB DRB DRB Host Processor Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 12 2D Torus P2P Links 0 7 1 6 2 5 3 4 DRB DRB DRB DRB DRB DRB DRB DRB Host Processor Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 13 2D Torus P2P Links DRB DRB DRB DRB DRB DRB DRB DRB Host Processor Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 14 RRQ & MBus: DMA Row Response Queue RRQ DRB RRQ DRB RRQ DRB RRQ DRB RRQ DRB RRQ DRB RRQ DRB RRQ DRB Host Processor Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture Memory Bus 15 System memory On-chip MC / Ring Memory Controller RRQ DRB RRQ DRB RRQ DRB MC RRQ DRB MC RRQ DRB RRQ DRB RRQ DRB RRQ DRB Arbitrator ARB LLC xTLB Host Processor Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 16 Wire Delay Aware Design RRQ D RRQ D D D RRQ D D D RRQ D D D RRQ D D D RRQ D D D RRQ D D D RRQ D D EFLAGS ! MemFence IBUS Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 17 Simplified PE Pipelining NSEW NSEW 96-bit VLIW instruction G DEC / RF X Local SRAM M MBUS 80 78 IBUS Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 18 Physical Design Evaluation SIMD DL1 iBTB BTB hist L G s/f Fetch B AGU FP DTLB tags PF s/f LQ SQ Bypass Decode IL1 UCode ROM TLB Tags IFQ INT prefetch BAC RAT Decode uopQ Decode Rename Alloc Decode (Conplex) RS ROB Commit ARF Intel Conroe Core 2 Duo processor die photo Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 19 Physical Design Evaluation POD PE Intel Conroe Core 2 Duo processor die photo Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 20 Physical Design Evaluation @ 45nm • PE – 128KB dual-ported SRAM, Basic GP ALU, simplified SSE ALU – ~3.2 mm2 • RRQ – ~3.2 mm2 (conservative estimation) • The host processor – Core-based microarchitecture – ~25.9 mm2 RRQ DRB RRQ DRB RRQ DRB MC RRQ DRB MC RRQ DRB RRQ DRB RRQ DRB RRQ DRB • 3MB LLC – ~20 mm2 • Total: ~276.5 mm2 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture ARB LLC xTLB Host Processor 21 Interconnection Fabric Among PEs • Different links for different communication – IBus / Mbus / 2D torus P2P links / ORTree – No contention among different communication • Fully synchronized and orchestrated by the host processor (and RRQ for MBus) – – – – No crossbar No contention / arbitration No buffer space for the router Nothing unpredictable! Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 22 Design Space Exploration • Baseline Design Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 23 Design Space Exploration • 3D POD with many-core Die 0 Die 1 You live and work here You get your beer here Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 24 Design Space Exploration • 3D many-POD processor Die 0 Die 1 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 25 Programming Model Programming Model • Example code: Reduction char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); N 1 S ai i 0 $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 27 Simulation Result Performance Result 860.3 GFLOPS 869.3 GFLOPS Speedup 113.6 GFLOPS 91.4 GFLOPS 504.6 GFLOPS 64 134.8 GFLOPS 16 4 DenseMMM (SP) FFT (SP) 1x1 POD IDCT (DP) 2x2 POD Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture Option Pricing (SP) 4x4 POD Down Sampling (SP) K-means (SP) 8x8 POD 29 Performance per mm2 14 Homogeneous many-core (ideal) Performance per mm^2 12 10 8 6 4 2 DenseMMM FFT IDCT 1x1 POD Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture Option Pricing 2x2 POD Down Sampling 4x4 POD K-means 8x8 POD 1 core 4 cores 16 cores 64 cores 0 30 10 DenseMMM FFT IDCT 1x1 POD Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture Option Pricing 2x2 POD Down Sampling 4x4 POD K-means 8 1 core 4 cores 16 cores 64 cores p= 0. p= 125 0. p= 250 0. p= 500 0. p= 125 0. p= 250 0. 5 p= 00 0. p= 125 0. p= 250 0. p= 500 0. 1 p= 25 0. p= 250 0. p= 500 0. p= 125 0. p= 250 0. p= 500 0. p= 125 0. p= 250 0. 50 0 Performance per Watt Performance per Watt Homogeneous many-core (ideal) 6 4 2 0 8x8 POD 31 Performance per Joule 600 DenseMMM FFT IDCT Option Pricing Down Sampling K-means Performance per Joule 500 Homogeneous many-core (ideal) 400 300 200 100 1x1 POD Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 2x2 POD 4x4 POD 1 core 4 cores 16 cores 64 cores p= 0. p= 125 0. p= 250 0. p= 500 0. p= 125 0. p= 250 0. 5 p= 00 0. p= 125 0. p= 250 0. p= 500 0. p= 125 0. 2 p= 50 0. p= 500 0. p= 125 0. p= 250 0. 5 p= 00 0. p= 125 0. p= 250 0. 50 0 0 8x8 POD 32 Interconnection Power Input buffer X-bar arbiter Link RAW 31% 30% ~0% 39% TRIPS 35% 33% 1% 31% Wang et al., “Power-Driven Design of Router Microarchitectures in On-Chip Networks”, MICRO 2003 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 33 2D Torus P2P Link Active Time 6% DenseMMM FFT IDCT N E W S N E W S N E W S OptionPricing DownSampling K-means 5% Active Time 4% 3% 2% 1% 0% 1x1 POD Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 2x2 POD N E W S 4x4 POD N E W S N E W S 8x8 POD 34 Conclusion • We propose POD – – – – Parallel-On-Die many-core architecture Broad-purpose acceleration High energy and area efficiency Backward compatibility RRQ DRB RRQ DRB RRQ DRB MC RRQ DRB MC RRQ DRB RRQ DRB RRQ DRB RRQ DRB ARB xTLB LLC Host Processor • A single-chip POD can achieve up to 1.5 TFLOPS of IEEE SP FP operations. • Interconnection architecture of POD is extremely power- and area-efficient. Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 35 Let’s Use Power to Compute, Not Commute! Georgia Tech ECE MARS Labs http://arch.ece.gatech.edu Back-up Slides PODSIM PE pipelining Native Memory subsystem Accurate modeling of on-chip, off-chip bandwidth x86 machine PODSIM Limitation: Host processor overhead not modeled (I$ miss, branch misprediction) LLC takeover of the MC ring by the host Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 38 Why multi-processing again? • No more faster processor – Power wall – Increasing wire delay – Diminishing return of deeply pipelined architecture • No more illusion of a sequential processor – Design complexity – Diminishing return of wide-issue superscalar processors Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 39 Out-of-order Host Issues • Speculative execution – No recovery mechanism in each PE – Broadcast POD instructions in a non-speculative manner • As long as host-side code does not depend on the results from POD, this approach does not affect the performance. • Instruction Reordering – POD instructions are strongly ordered. • IBits queue – Similar to the store queue Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 40 x86 Modification • 5 new instructions: – sendibits, getort, drainort, movl, getdrb, • 3 modified instructions: – lfence, sfence, mfence • 4 possible new co-processor registers or else mem-mapped addrs: – ibits register, ort status, drb value, xTLB support • IBits queue • All other modifications external to the x86 core – Arbiter for LLC priority can be external to LLC • Deferred Design Issue: LLC-POD Coherence – v1 Study: uncachable – v2 Prototype: LLC Snoops on A/D rings Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 41 POD ISA • “Instructions” are 3-wide VLIW bundles, – One of each G,X,M (32 bits each) • Integer Pipe (G): – and,or,not,xor,shl,shr,add,sub,mul,cmp,cmov,xfer,xferdrb : !div • SSE Pipe (X): – pfp{fma,add,sub,mul,div,max,min},pi{add,sub,mul,and,or,not,c mp},shuffle,cvt,xfer : !idiv • Memory Pipe (M): – ld,st,blkrd/wr,mask{set,pop,push},mov,xfer • Hybrid (G+X+M): – movl Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 42 Power Scalability! Relative Performance Per Joule 14 12 10 8 6 4 2 0 0 10 20 30 40 50 60 Relative Chip Power Budget Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 43 Some Known Facts on Interconnection • MIT RAW – ~ 40% of die area Taylor et al., “The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs”, IEEE MICRO Mar/Apr 2002 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 44 Some Known Facts on Interconnection • One main energy hog! – Alpha 21364: • 20% (25W / 125W) – MIT RAW: • 40% of individual tile power – UT TRIPS: • 25% Taylor et al., “The Raw Microprocessor: A Computational Fabric for Software Circuits and General Purpose Programs”, IEEE MICRO Mar/Apr 2002 Wang et al., “Orion: A Power-Performance Simulator for Interconnection Networks”, MICRO 2002 Wang et al., “Power-Driven Design of Router Microarchitectures in On-Chip Networks”, MICRO 2003 Peh, “Chip-Scale Power Scaling”, http://www.princeton.edu/~peh/talks/stanford_nws.pdf Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 45 Some Known Facts on Interconnection • Mesh-based MIMD many-core? – Unsustainable energy in its router logic – Unpredictable communication pattern • Difficult to apply common low-power techniques Bokar, “Networks for Multi-core Chip—A Controversial View”, 2006 Workshop on On- and Off-Chip Interconnection Networks for Multicore Systems Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 46 Programming Model Programming Model • Example code: Reduction char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); N 1 S ai i 0 $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 48 Programming Model (2x2 POD) • Malloc memory at the host memory space char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 49 Programming Model (2x2 POD) • Initialize ai char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); base_addr: 16 16: 0 17: 0 18: 0 19: 0 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 50 Programming Model (2x2 POD) • Load the address of my PE ID char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 51 Programming Model (2x2 POD) • Load my PE ID r15 2 r15 2 char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 r15 2 r15 2 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 52 Programming Model (2x2 POD) • Broadcast base_addr r15 2 r15 3 char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 r15 0 r15 1 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 53 Programming Model (2x2 POD) • Calculate the pointer to my ai r14 16 r15 2 r14 16 r15 3 char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 r14 16 r15 0 r14 16 r15 1 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 54 Programming Model (2x2 POD) • Load my ai from the host memory space r14 16 r15 18 r14 16 r15 19 char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 r14 16 r15 16 r14 16 r15 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 55 Programming Model (2x2 POD) • Wait until it is loaded r14 16 r15 18 r14 16 r15 19 char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 r14 16 r15 16 r14 16 r15 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 56 Programming Model (2x2 POD) • Update S r1 3 r1 4 r14 16 r15 18 r14 16 r15 19 r1 r1 1 r14 16 r15 16 2 r14 16 r15 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 57 Programming Model (2x2 POD) • 1st iteration r1 r2 r14 r15 3 3 16 18 r1 r2 r14 r15 4 4 16 19 r1 r2 r14 r15 1 1 16 16 r1 r2 r14 r15 2 2 16 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 58 Programming Model (2x2 POD) • Transfer my ai to my east-side neighbor r1 r2 r14 r15 3 3 16 18 r1 r2 r14 r15 4 4 16 19 r1 r2 r14 r15 1 1 16 16 r1 r2 r14 r15 2 2 16 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 59 Programming Model (2x2 POD) • Update S r1 r2 r14 r15 3 3 16 18 r1 r2 r14 r15 4 4 16 19 r1 r2 r14 r15 1 1 16 16 r1 r2 r14 r15 2 2 16 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 60 Programming Model (2x2 POD) • Copy rowsum to r1 r1 r2 r14 r15 4 7 16 18 r1 r2 r14 r15 3 7 16 19 r1 r2 r14 r15 2 3 16 16 r1 r2 r14 r15 1 3 16 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 61 Programming Model (2x2 POD) • 1st iteration r1 r2 r14 r15 7 7 16 18 r1 r2 r14 r15 7 7 16 19 r1 r2 r14 r15 3 3 16 16 r1 r2 r14 r15 3 3 16 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 62 Programming Model (2x2 POD) • Transfer my rowsum to my north-side neighbor r1 r2 r14 r15 7 7 16 18 r1 r2 r14 r15 7 7 16 19 r1 r2 r14 r15 3 3 16 16 r1 r2 r14 r15 3 3 16 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 63 Programming Model (2x2 POD) • Update S r1 r2 r14 r15 7 7 16 18 r1 r2 r14 r15 7 7 16 19 r1 r2 r14 r15 3 3 16 16 r1 r2 r14 r15 3 3 16 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 64 Programming Model (2x2 POD) • Done! r1 r2 r14 r15 3 10 16 18 r1 r2 r14 r15 3 10 16 19 r1 r2 r14 r15 7 10 16 16 r1 r2 r14 r15 7 10 16 17 base_addr: 16 16: 1 17: 2 18: 3 19: 4 Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture char* base_addr = (char*) malloc(npe); for ( i=0; I < npe; i++ ) { *(base_addr+i) = i+1; } $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] POD_movl( r14, base_addr ); $ASM add r15 = r15, r14 $ASM ld r1 = sys [ r15 + 0 ] POD_mfence(); $ASM add r2 = r2, r1 for ( i=0; i< (npeX-1); i++ ) { $ASM xfer.east r1 = r1 $ASM add r2 = r2, r1 } $ASM add r1 = r2, 0 for ( i=0; i< (npeY-1); i++ ) { $ASM xfer.north r1 = r1 $ASM add r2 = r2, r1 } 65 Programming Model (2x2 POD) • Broadcast the pointer to ‘result’ int result; r1 r2 r14 r15 3 10 16 18 r1 r2 r14 r15 3 10 16 19 r1 r2 r14 r15 7 10 16 16 r1 r2 r14 r15 7 10 16 17 $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] $ASM sub r15 = r15, 0 $ASM pushmask.e $ASM st sys[ r14 + 0 ] = r15 $ASM popmask POD_mfence(); base_addr: 16 16: 1 17: 2 18: 3 19: 4 POD_movl( r14, &result ); 128: <- addr of ‘result’ Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 66 Programming Model (2x2 POD) • Only PE0 will write-back! int result; r1 r2 r14 r15 3 10 128 18 r1 r2 r14 r15 3 10 128 19 r1 r2 r14 r15 7 10 128 16 r1 r2 r14 r15 7 10 128 17 $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] $ASM sub r15 = r15, 0 $ASM pushmask.e $ASM st sys[ r14 + 0 ] = r15 $ASM popmask POD_mfence(); base_addr: 16 16: 1 17: 2 18: 3 19: 4 POD_movl( r14, &result ); 128: <- addr of ‘result’ Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 67 Programming Model (2x2 POD) • Load PE_ID int result; r1 r2 r14 r15 3 10 128 2 r1 r2 r14 r15 3 10 128 2 r1 r2 r14 r15 7 10 128 2 r1 r2 r14 r15 7 10 128 2 $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] $ASM sub r15 = r15, 0 $ASM pushmask.e $ASM st sys[ r14 + 0 ] = r15 $ASM popmask POD_mfence(); base_addr: 16 16: 1 17: 2 18: 3 19: 4 POD_movl( r14, &result ); 128: <- addr of ‘result’ Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 68 Programming Model (2x2 POD) • Compare PE_ID int result; r1 r2 r14 r15 3 10 128 2 r1 r2 r14 r15 3 10 128 3 r1 r2 r14 r15 7 10 128 0 r1 r2 r14 r15 7 10 128 1 $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] $ASM sub r15 = r15, 0 $ASM pushmask.e $ASM st sys[ r14 + 0 ] = r15 $ASM popmask POD_mfence(); base_addr: 16 16: 1 17: 2 18: 3 19: 4 POD_movl( r14, &result ); 128: <- addr of ‘result’ Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 69 Programming Model (2x2 POD) • Enable PE only when equal to zero int result; r1 r2 r14 r15 3 10 128 2 r1 r2 r14 r15 3 10 128 3 r1 r2 r14 r15 7 10 128 0 r1 r2 r14 r15 7 10 128 1 $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] $ASM sub r15 = r15, 0 $ASM pushmask.e $ASM st sys[ r14 + 0 ] = r15 $ASM popmask POD_mfence(); base_addr: 16 16: 1 17: 2 18: 3 19: 4 POD_movl( r14, &result ); 128: <- addr of ‘result’ Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 70 Programming Model (2x2 POD) • Write-back int result; r1 r2 r14 r15 3 10 128 2 r1 r2 r14 r15 3 10 128 3 r1 r2 r14 r15 7 10 128 0 r1 r2 r14 r15 7 10 128 1 $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] $ASM sub r15 = r15, 0 $ASM pushmask.e $ASM st sys[ r14 + 0 ] = r15 $ASM popmask POD_mfence(); base_addr: 16 16: 1 17: 2 18: 3 19: 4 POD_movl( r14, &result ); 128: <- addr of ‘result’ Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 71 Programming Model (2x2 POD) • Enable ALL PEs int result; r1 r2 r14 r15 3 10 128 2 r1 r2 r14 r15 3 10 128 3 r1 r2 r14 r15 7 10 128 0 r1 r2 r14 r15 7 10 128 1 $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] $ASM sub r15 = r15, 0 $ASM pushmask.e $ASM st sys[ r14 + 0 ] = r15 $ASM popmask POD_mfence(); base_addr: 16 16: 1 17: 2 18: 3 19: 4 POD_movl( r14, &result ); 128: <- addr of ‘result’ Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 72 Programming Model (2x2 POD) • Wait until the result is written-back int result; r1 r2 r14 r15 3 10 128 2 r1 r2 r14 r15 3 10 128 3 r1 r2 r14 r15 7 10 128 0 r1 r2 r14 r15 7 10 128 1 $ASM mov r15 = MY_PE_ID_POINTER $ASM ld r15 = local [ r15 + 0 ] $ASM sub r15 = r15, 0 $ASM pushmask.e $ASM st sys[ r14 + 0 ] = r15 $ASM popmask POD_mfence(); base_addr: 16 16: 1 17: 2 18: 3 19: 4 POD_movl( r14, &result ); 128: <- addr of ‘result’ Hsien-Hsin Sean Lee, POD: A Parallel-On-Die Architecture 73