Scalar Operand Networks for Tiled Microprocessors Michael Taylor Raw Architecture Project MIT CSAIL (now at UCSD) Until 3 years ago – computer architects have been using the N-way superscalar to encapsulate the ideal for a parallel processor… - nearly “perfect” but not attainable Superscalar “PE”->”PE” communication Free exploitation of parallelism Implicit Clean semantics Yes scalable No power efficient No (or VLIW) (hw scheduler or compiler) mul $2,$3,$4 add $6,$5,$2 What’s great about superscalar microprocessors? It’s the networks! Fast low-latency tightly-coupled networks (0-1 cycles of latency, no occupancy) -For the lack of a better name let’s call them Scalar Operand Networks (SONs) - Can we incorporate the benefits of superscalar communication + multicore scalability -Can we build Scalable Scalar Operand Networks? (I agree with Jose: “We need low-latency tightly-coupled … network interfaces” – Jose Duato, OCIN, Dec 6, 2006) The industry shift toward Multicore - attainable but hardly ideal Superscalar Multicore “PE”->”PE” communication Free Expensive exploitation of parallelism Implicit Explicit Clean semantics Yes No scalable No Yes power efficient No Yes What we’d like – neither superscalar nor multicore Superscalar Multicore “PE”->”PE” communication Free Expensive exploitation of parallelism Implicit Explicit Clean semantics Yes No scalable No Yes power efficient No Yes Superscalars have fast networks and great usability Multicore has great scalability and efficiency Why communication is expensive on multicore Multiprocessor Node 1 send occupancy send send overhead latency Multiprocessor Node 2 Transport Cost receive occupancy receive receive overhead latency Multiprocessor SON Operand Routing Multiprocessor Node 1 send occupancy send latency Destination node name Sequence number Value Launch sequence Commit Latency Network injection Multiprocessor SON Operand Routing receive sequence demultiplexing branch mispredictions Multiprocessor Node 2 receive occupancy injection cost receive latency .. similar overheads for shared memory multiprocessors store instr, commit latency, spin locks (+ attndt br. mispredicts) Defining a figure of merit for scalar operand networks 5-tuple <SO, SL, NHL, RL, RO>: Send Occupancy Send Latency Network Hop Latency Receive Latency Receive Occupancy We can use this metric to quantitatively differentiate SONs from existing multiprocessor networks… Tip: Ordering follows timing of message from sender to receiver Proc 1 nothing to do Proc 0 Impact of Occupancy (“o” = so+ro) Impact of Latency if (o * “surface area” > “volume”) The lower the latency, not worth it to offload: the less work needed to keep overhead too high myself busy waiting for answer (parallelism too fine-grained) not worth it to offload: could have done it myself faster (not enough parallelism to hide latency) The interesting region Power4 (on-chip) <2, 14, 0, 14,4> Superscalar (not scalable) < 0, 0, 0, 0, 0> Tiled Microprocessors (or “Tiled Multicore”) (w/ scalable SON) Superscalar Multicore Tiled Multicore PE-PE communication Free Expensive Cheap exploitation of parallelism Implicit Explicit Both scalable No Yes Yes power efficient No Yes Yes Tiled Microprocessors (or “Tiled Multicore”) Superscalar Multicore Tiled Multicore Alu-Alu communication Free Expensive Cheap exploitation of parallelism Implicit Explicit Both scalable No Yes Yes power efficient No Yes Yes Transforming from multicore or superscalar to tiled add scalability Superscalar Tiled add scalable SON CMP/multicore The interesting region Power4 (on-chip) <2, 14, 0, 14,4> Raw < 0, 0, 1, 2, 0> Tiled “Famous Brand 2” < 0, 0, 1, 0, 0> Superscalar (not scalable) < 0, 0, 0, 0, 0> Scalability Problems in Wide Issue Microprocessors Unified Load/Store Queue ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Bypass Net ALU RF ALU Wide Fetch (16 inst) Control ALU PC Area and Frequency Scalability Problems ~N2 ALU ALU ALU ALU ALU ALU ALU N ALUs ALU 3 ~N ALU ALU ALU ALU ALU ALU ALU Bypass Net ALU RF Ex: Itanium 2 Without modification, freq decreases linearly or worse. ALU ALU ALU ALU ALU ALU ALU Bypass Net ALU ALU ALU ALU ALU ALU ALU ALU ALU Operand Routing is Global RF + >> ALU ALU ALU ALU ALU ALU ALU Bypass Net ALU ALU ALU ALU ALU ALU ALU ALU ALU Idea: Make Operand Routing Local RF ALU ALU ALU ALU ALU ALU ALU Bypass Net ALU ALU ALU ALU ALU ALU ALU ALU ALU Idea: Exploit Locality RF ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU Replace the crossbar with a point-to-point, pipelined, routed scalar operand network. RF ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU RF ALU + ALU Replace the crossbar with a point-to-point, pipelined, routed scalar operand network. >> Operand Transport Scaling – Bandwidth and Area For N ALUs and N½ bisection bandwidth: as in conventional superscalar Local BW Area Un-pipelined crossbar bypass Point-to-Point Routed Mesh Network ~ N½ ~ N2 ~N ~N Scales as 2-D VLSI We can route more operands per unit time if we are able to map communicating instructions nearby. Operand Transport Scaling - Latency Time for operand to travel between instructions mapped to different ALUs. Un-pipelined Point-to-Point crossbar Routed Mesh Network Non-local Placement ~N ~ N½ LocalityDriven Placement ~N ~1 Latency bonus if we map communicating instructions nearby so communication is local. Distribute the Register File RF ALU ALU ALU RF RF ALU RF RF RF RF ALU ALU RF ALU ALU RF ALU ALU RF RF RF ALU RF RF ALU ALU RF RF ALU ALU ALU RF ALU ALU ALU ALU ALU ALU RF RF RF ALU Unified Load/Store Queue RF RF RF ALU RF RF RF RF ALU ALU RF ALU Wide Fetch (16 inst) Control RF ALU ALU RF RF ALU ALU PC RF ALU RF More Scalability Problems PC Wide Fetch (16 inst) Control Unified Load/Store Queue ALU RF D$ RF D$ ALU ALU ALU RF PC I$ I$ RF D$ PC I$ D$ PC PC I$ D$ RF RF ALU RF PC I$ RF D$ PC I$ D$ PC I$ I$ RF PC I$ D$ RF ALU RF PC I$ Unified Load/Store D$ Queue ALU D$ PC I$ D$ RF PC D$ ALU D$ PC I$ I$ RF ALU I$ RF PC D$ ALU Control PC I$ RF D$ ALU Wide Fetch (16 inst) PC ALU D$ ALU I$ ALU PC RF ALU PC ALU Distribute the rest: Raw – a Fully-Tiled Microprocessor Tiles! D$ PC I$ D$ RF D$ I$ ALU RF D$ PC I$ D$ RF RF ALU RF PC I$ PC ALU ALU ALU RF PC I$ D$ I$ RF PC I$ D$ RF ALU ALU ALU RF RF PC I$ D$ PC I$ PC D$ D$ ALU RF PC I$ RF D$ PC I$ RF D$ PC I$ PC ALU RF D$ I$ D$ ALU D$ ALU I$ I$ RF ALU PC PC ALU D$ ALU I$ RF ALU PC Tiles! Tiled Microprocessors -fast inter-tile communication through SON -easy to scale (same reasons as multicore) Outline 1. Scalar Operand Network and Tiled Microprocessor intro 2. Raw Architecture + SON 3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network. Raw Microprocessor Tiled scalable microprocessor Point-to-point pipelined networks 16 tiles, 16 issue Each 4 mm x 4mm tile: MIPS-style compute processor - single-issue 8-stage pipe - 32b FPU - 32K D Cache, I Cache 4 on-chip networks - two for operands - one for cache misses - one for message passing Raw Microprocessor Components Functional Units Switch Processor Instruction Cache Fetch Unit Intra-tile SON CrossCrossbar bar Instruction Cache Inter-tile SON Static Router Data Cache Dynamic Router Trusted Core “MDN” Execution Core Dynamic Router “GDN” Untrusted Core Compute Processor Generalized Transport Networks Inter-tile Network Links Raw Compute Processor Internals r24 r25 r26 Ex: fadd r24, r25, r26 E M1 A F r27 M2 TL P U Tile-Tile Communication add $25,$1,$2 Tile-Tile Communication add $25,$1,$2 Route $P->$E Tile-Tile Communication add $25,$1,$2 Route $P->$E Route $W->$P Tile-Tile Communication add $25,$1,$2 Route $P->$E Route $W->$P sub $20,$1,$25 Compilation RawCC assigns instructions to the tiles, maximizing locality. It also generates the static router instructions that transfer operands between tiles. tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 …. seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 seed.0=seed pval5=seed.0*6.0 pval4=pval5+2.0 tmp3.6=pval4/3.0 tmp3=tmp3.6 tmp0.1=pval0/2.0 v3.10=tmp3.6-v2.7 tmp0=tmp0.1 pval1=seed.0*3.0 v1.2=v1 v3=v3.10 v2.4=v2 pval5=seed.0*6.0 pval0=pval1+2.0 pval2=seed.0*v1.2 pval3=seed.o*v2.4 pval4=pval5+2.0 v1.2=v1 v2.4=v2 tmp3.6=pval4/3.0 pval2=seed.0*v1.2 pval3=seed.o*v2.4 tmp1.3=pval2+2.0 tmp2.5=pval3+2.0 tmp1.3=pval2+2.0 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 tmp1=tmp1.3 tmp0=tmp0.1 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 tmp3=tmp3.6 tmp1=tmp1.3 pval6=tmp1.3-tmp2.5 pval7=tmp1.3+tmp2.5 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v1.8=pval7*3.0 v0.9=tmp0.1-v1.8 v1=v1.8 v0=v0.9 v3.10=tmp3.6-v2.7 v0.9=tmp0.1-v1.8 v1=v1.8 v0=v0.9 v2=v2.7 v3=v3.10 v2.7=pval6*5.0 v2=v2.7 One cycle in the life of a tiled micro mem Direct I/O stream into Scalar Operand Network mem mem 4-way automatically parallelized C program 2-thread MPI app httpd Zzz... An application uses only as many tiles as needed to exploit the parallelism intrinsic to that application… Tile 4 Tile 1 Tile 2 Tile 7 Tile 0 Tile 3 Tile 5 Tile 6 Tile 9 Tile 10 Tile 12 Tile 15 Tile 8 Tile 13 Tile 14 Tile 11 One Streaming Application on Raw very different traffic patterns than RawCC-style parallelization Auto-Parallelization Approach #2: Streamit Language + Compiler Splitter FIRFilter FIRFilter FIRFilter Splitter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter Joiner Splitter Splitter Vec Mult Vec Mult Vec Mult Vec Mult FIRFilter FIRFilter FIRFilter FIRFilter Magnitude Magnitude Magnitude Magnitude Detector Detector Detector Detector Original FIRFilter FIRFilter FIRFilter Joiner Joiner FIRFilter FIRFilter Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector Vec Mult FIRFilter Magnitude Detector Joiner After fusion Vec Mult FIRFilter Magnitude Detector FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter FIRFilter Joiner Vec Mult FIRFilter Magnitude Detector FIRFilter FIRFilter Joiner Vec Mult FIRFilter Magnitude Detector End Results – auto-parallelized by MIT Streamit to 8 tiles. ALU ALU ALU ALU Are operand routes predetermined? >> ALU Transport (Static/Dynamic) / & % ALU Is instruction assignment to ALUs predetermined? + ALU Assignment (Static/Dynamic) ALU AsTrO Taxonomy: Classifying SON diversity Ordering (Static/Dynamic) Is the execution order of instructions assigned to a node predetermined? Microprocessor SON diversity using AsTrO taxonomy Assignment Transport Ordering Raw Scale Dynamic Static Static Static RawDyn Dynamic Static Dynamic Dynamic Static TRIPS ILDP Dynamic WaveScalar Outline 1. Scalar Operand Network and Tiled Microprocessor intro 2. Raw Architecture + SON 3. VLSI implementation of Raw, a scalable microprocessor with a scalar operand network. Raw Chips October 02 Raw 16 tiles (16 issue) 180 nm ASIC (IBM SA-27E) ~100 million transistors 1 million gates 3-4 years of development 1.5 years of testing 200K lines of test code Core Frequency: 425 MHz @ 1.8 V 500 MHz @ 2.2 V 18W average power Frequency competitive with IBM-implemented PowerPCs in same process. Raw motherboard Support Chipset implemented in FPGA A Scalable Microprocessor in Action [Taylor et al, ISCA ’04] Conclusions Scalability problems in general purpose processors can be addressed by tiling resources across a scalable, low-latency, low-occupancy scalar operand network (SON). These SONs can be characterized by a 5-tuple and the AsTrO classification. The 180 nm 16-issue Raw prototype shows the feasibility of the approach is feasible. 64+-issue is possible in today’s VLSI processes. Multicore machines could benefit by adding internode SON for cheap communication.