MIT OpenCourseWare http://ocw.mit.edu 6.189 Multicore Programming Primer, January (IAP) 2007 Please use the following citation format: Rodric Rabbah, 6.189 Multicore Programming Primer, January (IAP) 2007. (Massachusetts Institute of Technology: MIT OpenCourseWare). http://ocw.mit.edu (accessed MM DD, YYYY). License: Creative Commons Attribution-Noncommercial-Share Alike. Note: Please use the actual date you accessed this material in your citation. For more information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms 6.189 IAP 2007 Lecture 17 The Raw Experience Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 1 6.189 IAP 2007 MIT Raw Chips October 02 Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 2 6.189 IAP 2007 MIT Raw Microprocessor ● Tiled microprocessor with point-to-point pipelined scalar operand network ● Each tiles is 4 mm x 4mm • MIPS-style compute processor – – – Single-issue 8-stage pipe 32b FPU 32K D Cache, I Cache ● 4 on-chip mesh networks • • • Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 3 Two for operands One for cache misses, I/O One for message passing 6.189 IAP 2007 MIT Raw Microprocessor ● ● ● ● 16 tiles (16 issue) 180 nm ASIC (IBM SA-27E) ~100 million transistors 1 million gates ● 3-4 years of development ● 1.5 years of testing ● 200K lines of test code ● Core Frequency: • • 425 MHz @ 1.8 V 500 MHz @ 2.2 V ● Frequency competitive with IBM- implemented PowerPCs in same process ● 18W average power Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 4 6.189 IAP 2007 MIT One Cycle in the Life of a Tiled Processor Image by MIT OCW. Image by MIT OCW. ● Application uses as many tiles as needed to exploit its parallelism Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 5 6.189 IAP 2007 MIT Raw Motherboard Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 6 6.189 IAP 2007 MIT Raw in Action Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 7 6.189 IAP 2007 MIT MPEG-2 Encoder Performance 350 x 240 Images 720 x 480 Images 16 16 4 Speedup 30 8 30 4 10 1 10 1 1 4 16 8 4 # of Tiles 8 # of Tiles Square – Linear speedup Diamond – Hand-optimized, slice parallel implementation Circle – Slice parallel implementation Triangle – Baseline macroblock parallel implementation Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 8 6.189 IAP 2007 MIT Frames/s 8 60 Frames/s Speedup 60 MPEG-2 Encoder Performance Encoding Rate (frames/s) # Tiles 352 x 240 640 x 480 720 x 480 1 4.30 1.14 1.00 2 8.48 2.24 1.97 4 16.18 4.45 3.84 8 30.82 8.69 7.52 16 58.65 16.74 14.57 32 103* 30* 64 158* 51.90 * Estimated data rates Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 9 6.189 IAP 2007 MIT Programmable Graphics Pipeline Input Vertex V Vertex Sync Triangle Setup Pixel P Pixel simplified graphics pipeline Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 10 6.189 IAP 2007 MIT Phong Shading ● Per-pixel phong-shaded polyhedron ● 162 vertices, 1 light Output, rendered using Raw simulator Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 11 6.189 IAP 2007 MIT Phong Shading (64-tiles) Fixed pipeline Reconfigurable pipeline ● 33% faster ● 150% better utilization Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 12 6.189 IAP 2007 MIT Shadow Volumes ● 4 textured triangles ● 1 point light ● Rendered in 3 passes Output, rendered using Raw simulator Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 13 6.189 IAP 2007 MIT Shadow Volumes (64-tiles) Fixed pipeline Pass 1 Pass 2 Pass 3 Reconfigurable pipeline Pass 1 Pass 2 Pass 3 ● 40% faster cycles Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 14 6.189 IAP 2007 MIT 1020 Element Microphone Array Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 15 6.189 IAP 2007 MIT Case Study: Beamformer 1,600 1,430 1,400 MFLOPS 1,200 1,000 800 640 600 400 240 200 19 0 C program C program 1 GHz Pentium III 420 MHz single tile Raw Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 16 Unoptimized StreamIt Optimized StreamIt 420 MHz 64 tile Raw 420 MHz 16 tile Raw 6.189 IAP 2007 MIT The Raw Experience ● Insights into the design Raw architecture ● Raw parallelizing compiler ● StreamIt language and Compiler Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 17 6.189 IAP 2007 MIT Scalability Problems in Wide Issue Processors ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Bypass Net ALU RF ALU Wide Fetch (16 inst) Control ALU PC Unified Load/Store Queue Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 18 6.189 IAP 2007 MIT Area and Frequency Scalability Problems 2 ~N ALU ALU ALU ALU ALU ALU ALU N ALUs ALU ~N3 Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 ALU ALU ALU ALU ALU ALU Bypass Net ALU ALU RF 19 6.189 IAP 2007 MIT ALU ALU ALU ALU ALU ALU ALU ALU Operand Routing is Global ALU ALU ALU ALU ALU ALU ALU ALU RF Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 20 Bypass Net + >> 6.189 IAP 2007 MIT ALU ALU ALU ALU ALU ALU ALU ALU Idea: Make Operand Routing Local Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 ALU ALU ALU ALU ALU ALU Bypass Net ALU ALU RF 21 6.189 IAP 2007 MIT ALU ALU ALU ALU ALU ALU ALU ALU Idea: Exploit Locality ALU ALU ALU ALU ALU ALU ALU ALU RF Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 22 Bypass Net 6.189 IAP 2007 MIT ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Replace Crossbar with Point-To-Point Network Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 23 ALU ALU ALU ALU RF 6.189 IAP 2007 MIT ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 24 ALU >> ALU RF ALU + ALU Replace Crossbar with Point-To-Point Network 6.189 IAP 2007 MIT Operand Transport Latency Crossbar Point-to-Point Network Non-local Placement ~N ~ N½ Locality-driven Placement ~N ~1 Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 25 6.189 IAP 2007 MIT Distribute the Register File RF ALU RF ALU ALU RF RF 26 RF ALU ALU RF ALU Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 RF ALU ALU RF RF RF RF ALU ALU RF ALU ALU RF ALU ALU ALU RF RF RF ALU RF 6.189 IAP 2007 MIT More Scalability Problems RF Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 ALU RF ALU RF 27 RF ALU ALU RF ALU Unified Load/Store Queue ALU ALU RF RF ALU E L B A L A C S RF ALU ALU RF RF RF ALU Control RF ALU ALU RF Wide Fetch (16 inst) RF ALU ALU PC RF ALU RF 6.189 IAP 2007 MIT More Scalability Problems PC Wide Fetch (16 inst) Control Unified Load/Store Queue Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 28 6.189 IAP 2007 MIT Distribute Everything I$ D$ ALU RF I$ D$ Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 D$ 29 RF I$ ALU ALU RF PC RF PC D$ PC I$ D$ ALU ALU ALU RF PC I$ RF D$ PC I$ PC RF I$ D$ RF ALU RF PC I$ Unified Load/Store D$ Queue I$ D$ PC I$ D$ I$ PC RF RF PC RF I$ ALU D$ PC PC D$ D$ ALU RF RF ALU I$ I$ ALU Control PC I$ PC RF D$ ALU Wide Fetch (16 inst) PC ALU D$ ALU I$ ALU PC RF ALU PC D$ 6.189 IAP 2007 MIT Tiled Processor I$ I$ D$ Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 D$ 30 PC RF I$ ALU ALU PC RF I$ D$ RF ALU RF PC RF ALU ALU ALU PC I$ D$ PC I$ D$ RF PC I$ RF ALU ALU ALU RF RF PC I$ D$ RF D$ ALU RF PC D$ I$ D$ PC I$ PC RF PC D$ D$ PC I$ RF ALU RF D$ I$ D$ I$ D$ ALU I$ I$ PC RF ALU PC PC ALU D$ ALU I$ RF ALU PC D$ 6.189 IAP 2007 MIT Tiled Processor Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 31 6.189 IAP 2007 MIT Tiled Processor (Taylor PhD 2007) ● Fast inter-tile communication through point-to-point pipelined scalar operand network (SON) ● Easy to scale for the same reasons as multicores Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 32 6.189 IAP 2007 MIT Raw Compute Processor Internals r24 r24 r25 r25 r26 r27 IF r26 E M1 D r27 M2 A TL F P RF Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 U 33 6.189 IAP 2007 MIT Tile-Tile Communication add $25,$1,$2 Route $P->$E Route $W->$P sub $20,$1,$25 Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 34 6.189 IAP 2007 MIT Why Communication Is Expensive on Multicores Multiprocessor Node 1 send occupancy Multiprocessor Node 2 Transport Cost receive receive overhead latency send send overhead latency Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 receive occupancy 35 6.189 IAP 2007 MIT Why Communication Is Expensive on Multicores Multiprocessor Node 1 send occupancy send latency Destination node name Sequence number Value Launch sequence Commit Latency Network injection Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 36 6.189 IAP 2007 MIT Why Communication Is Expensive on Multicores Multiprocessor Node 2 receive sequence receive occupancy injection cost receive latency Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 37 6.189 IAP 2007 MIT A Figure of Merit for SONs ● 5-tuple <SO, SL, NHL, RL, RO> • • • • • Send occupancy Send latency Network hop latency Receive latency Receive occupancy ● Tip: Ordering follows timing of message from sender to receiver Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 38 6.189 IAP 2007 MIT The Interesting Region Scalable Multiprocessor ● Where is Cell in this space? <2, 14, 3, 14,4> (on-chip) Raw SON < 0, 0, 1, 2, 0> Superscalar < 0, 0, 0, 0, 0> (scalable) (not scalable) Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 39 6.189 IAP 2007 MIT The Raw Experience ● Insights into the design Raw architecture ● Raw parallelizing compiler Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 40 6.189 IAP 2007 MIT Raw Parallelizing Compiler (Lee PhD 2005) Sequential Program Compiler ● Data distribution ● Instruction distribution ● Coordination • • Communication Control flow Fine-grained Orchestrated Parallel Program Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 41 6.189 IAP 2007 MIT Data Distribution ? load r1 <- addr add r1 <- r1, 1 ? M M M M ? ? Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 42 6.189 IAP 2007 MIT Instruction Distribution seed.0=seed pval1=seed.0*3.0 v1.2=v1 v2.4=v2 pval5=seed.0*6.0 pval0=pval1+2.0 pval2=seed.0*v1.2 pval3=seed.o*v2.4 pval4=pval5+2.0 tmp1.3=pval2+2.0 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp1=tmp1.3 tmp0=tmp0.1 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v0=v0.9 v3.10=tmp3.6-v2.7 v2=v2.7 v3=v3.10 basic block Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 43 6.189 IAP 2007 MIT Clustering: Parallelism vs. Communication seed.0=seed seed.0=seed pval1=seed.0*3.0 pval1=seed.0*3.0 v1.2=v1 v1.2=v1 v2.4=v2 v2.4=v2 pval5=seed.0*6.0 pval5=seed.0*6.0 pval0=pval1+2.0 pval2=seed.0*v1.2 pval0=pval1+2.0 pval2=seed.0*v1.2 pval3=seed.o*v2.4 pval3=seed.o*v2.4 pval4=pval5+2.0 pval4=pval5+2.0 tmp1.3=pval2+2.0 tmp1.3=pval2+2.0 tmp0.1=pval0/2.0 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp1=tmp1.3 tmp0=tmp0.1 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp3=tmp3.6 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v1.8=pval7*3.0 v2.7=pval6*5.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v0.9=tmp0.1-v1.8 v1=v1.8 v0=v0.9 tmp3.6=pval4/3.0 tmp1=tmp1.3 tmp0=tmp0.1 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp2.5=pval3+2.0 v1=v1.8 v3.10=tmp3.6-v2.7 v0=v0.9 v2=v2.7 v2=v2.7 v3=v3.10 v3=v3.10 Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 v3.10=tmp3.6-v2.7 44 6.189 IAP 2007 MIT Adjusting Granularity: Load Balancing seed.0=seed pval1=seed.0*3.0 seed.0=seed pval1=seed.0*3.0 v1.2=v1 v1.2=v1 v2.4=v2 v2.4=v2 pval5=seed.0*6.0 pval0=pval1+2.0 pval5=seed.0*6.0 pval2=seed.0*v1.2 pval0=pval1+2.0 pval3=seed.o*v2.4 pval2=seed.0*v1.2 pval3=seed.o*v2.4 pval4=pval5+2.0 pval4=pval5+2.0 tmp1.3=pval2+2.0 tmp1.3=pval2+2.0 tmp0.1=pval0/2.0 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp1=tmp1.3 tmp0=tmp0.1 tmp0=tmp0.1 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp2.5=pval3+2.0 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v1.8=pval7*3.0 v2.7=pval6*5.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v0.9=tmp0.1-v1.8 v1=v1.8 v0=v0.9 tmp3.6=pval4/3.0 tmp1=tmp1.3 v1=v1.8 v3.10=tmp3.6-v2.7 v0=v0.9 v2=v2.7 v3=v3.10 Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 v3.10=tmp3.6-v2.7 v2=v2.7 v3=v3.10 45 6.189 IAP 2007 MIT Placement seed.0=seed seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 pval1=seed.0*3.0 v1.2=v1 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 v2.4=v2 tmp0.1=pval0/2.0 pval5=seed.0*6.0 pval0=pval1+2.0 pval5=seed.0*6.0 v3.10=tmp3.6-v2.7 pval2=seed.0*v1.2 pval3=seed.o*v2.4 tmp0=tmp0.1 pval4=pval5+2.0 tmp1.3=pval2+2.0 v3=v3.10 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp1=tmp1.3 tmp0=tmp0.1 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 v1.2=v1 v2.4=v2 pval2=seed.0*v1.2 pval3=seed.o*v2.4 tmp1.3=pval2+2.0 tmp2.5=pval3+2.0 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v1=v1.8 v0=v0.9 tmp1=tmp1.3 v3.10=tmp3.6-v2.7 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2=v2.7 v3=v3.10 v1.8=pval7*3.0 v0.9=tmp0.1-v1.8 v2.7=pval6*5.0 v2=v2.7 v1=v1.8 v0=v0.9 Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 46 6.189 IAP 2007 MIT Communication Coordination Proc Proc Proc receive Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 47 Switch Switch routes Proc send 6.189 IAP 2007 MIT Instruction Scheduling v1.2=v1 v2.4=v2 seed.0=seed send(seed.0) pval1=seed.0*3.0 seed.0=recv(0) pval3=seed.o*v2.4 route(t,E) route(W,t) route(t,E) route(N,t) seed.0=recv() pval2=seed.0*v1.2 seed.0=recv() pval5=seed.0*6.0 pval0=pval1+2.0 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp1.3=pval2+2.0 tmp2=tmp2.5 send(tmp2.5) route(t,E) route(E,t) send(tmp1.3) tmp1=tmp1.3 tmp1.3=recv() pval6=tmp1.3-tmp2.5 tmp2.5=recv() v2.7=pval6*5.0 tmp0.1=recv() pval7=tmp1.3+tmp2.5 route(t,W) route(W,t) send(tmp0.1) tmp0=tmp0.1 tmp3=tmp3.6 route(t,E) route(W,S) route(W,t) v1.8=pval7*3.0 Send(v2.7) v2=v2.7 route(t,E) route(W,N) v1=v1.8 v0.9=tmp0.1-v1.8 route(S,t) v2.7=recv() v3.10=tmp3.6-v2.7 v0=v0.9 v3=v3.10 ● Inter-tile cycle scheduling schedules communication and can guarantee deadlock freedom Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 48 6.189 IAP 2007 MIT Final Code Representation v2.4=v2 route(t,E) v1.2=v1 route(N,t) seed.0=recv() route(W,t) seed.0=seed route(t,E) seed.0=recv(0) route(t,E) seed.0=recv() route(t,W) pval5=seed.0*6.0 route(W,S) send(seed.0) route(t,E) pval3=seed.o*v2.4 route(E,t) pval2=seed.0*v1.2 route(W,t) pval4=pval5+2.0 route(S,t) pval1=seed.0*3.0 tmp2.5=pval3+2.0 route(t,E) tmp1.3=pval2+2.0 route(W,t) tmp3.6=pval4/3.0 pval0=pval1+2.0 tmp2=tmp2.5 send(tmp1.3) route(W,N) tmp3=tmp3.6 tmp0.1=pval0/2.0 send(tmp2.5) tmp1=tmp1.3 v2.7=recv() send(tmp0.1) tmp1.3=recv() tmp2.5=recv() v3.10=tmp3.6-v2.7 tmp0=tmp0.1 pval6=tmp1.3-tmp2.5 tmp0.1=recv() v3=v3.10 v2.7=pval6*5.0 pval7=tmp1.3+tmp2.5 Send(v2.7) v1.8=pval7*3.0 v2=v2.7 v1=v1.8 v0.9=tmp0.1-v1.8 v0=v0.9 Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 49 6.189 IAP 2007 MIT Control Coordination Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 50 6.189 IAP 2007 MIT Asynchronous Global Branching br x br x br x br x br x Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 51 6.189 IAP 2007 MIT Summary ● Tiled microprocessors incorporate the best elements of superscalars and multiprocessors Superscalar Multicore Tiled Processor with SON PE-PE communication Free Expensive Cheap exploitation of parallelism Implicit Explicit Both Clean semantics Yes No Yes scalable No Yes Yes power efficient No Yes Yes Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 52 6.189 IAP 2007 MIT Raw Project Contributors Anant Agarwal Saman Amarasinghe Jonathan Babb Rajeev Barua Ian Bratt Jonathan Eastep Matt Frank Ben Greenwald Henry Hoffmann Paul Johnson Jason Kim Theo Konstantakopoulos Dr. Rodric Rabbah © Copyrights by IBM Corp. and by other(s) 2007 Walter Lee Jason Miller James Psota Arvind Saraf Vivek Sarkar Nathan Shnidman Volker Strumpen Michael Taylor Elliot Waingold David Wentzlaff 53 6.189 IAP 2007 MIT