Digital Space Anant Agarwal MIT and Tilera Corporation 1 Arecibo 2 Stages of Reality 100B transistors 1B transistors 2018 p m p m p m p m p m p m 2007 mem p m p m p m mem s s s s s s s s p m p m p m p m p m p m p m p m p m s s s s s s s s p m p m p m p m p m p m p m p m p m s s s s s s s s p m p m p m p m p m p m p m p m p m s s s s s s s s p m p m p m p m p m p m p m p m p m s s s s s s s s p m p m p m p m p m p m p m p m p m s s s s s s s s mem 1996 CPU 2007 Mem 1997 3 p m p m p m p m p m p m p m p m s s s s s s s s 2014 mem mem p m 2002 p m p m p m p m p m p m p m p m p m s s s s s s s s p m p m p m p m p m p m p m p m p m s s s s s s s s p m p m p m p m p m p m p m p m p m Virtual reality Simulator reality Prototype reality Product reality Virtual Reality 4 The Opportunity 1996… 20MIPS cpu in 1987 Few thousand gates 5 The Opportunity The billion transistor chip of 2007 6 How to Fritter Away Opportunity Caches Control 100 ported RegFil and RR More resolution buffers, control the x1786? does not scale 7 “1/10 ns” Take Inspiration from ASICs mem mem mem mem mem • Lots of ALUs, lots of registers, lots of local memories – huge on-chip parallelism – but with a slower clock • Custom-routed, short wires optimized for specific applications Fast, low power, area efficient But not programmable 8 Our Early Raw Proposal Got parallelism? CPU Mem E.g., 100-way unrolled loop, running on 100 ALUs, 1000 regs, 100 memory banks But how to build programmable, yet custom, wires? 9 A digital wire Pipeline it! Multiplex it! Uh! What were we Software smoking! orchestrate it! Ctrl • Fast clock (10GHz in 2010) Ctrl Ctrl Ctrl Ctrl • Improve utilization • Customize to application and maximize utilization A dynamic router! Replace custom wires with routed on-chip networks 10 Static Router Switch Code Switch Code Switch Code Switch Code Switch Code A static router! Compiler 11 Application Replace Wires with Routed Networks Ctrl 12 50-Ported Register File Distributed Registers Gigantic 50 ported register file 13 50-Ported Register File Distributed Registers Gigantic 50 ported register file 14 Distributed Registers + Routed Network R Distributed register file Called NURA [ASPLOS 1998] 15 ALU ALU ALU ALU ALU ALU ALU ALU 16-Way ALU Clump Distributed ALUs 16 ALU ALU ALU ALU ALU ALU ALU Bypass Net ALU RF ALU ALU ALU ALU ALU R ALU Distributed ALUs, Routed Bypass Network Scalar Operand Network (SON) [TPDS 2005] 17 Mongo Cache Distributed Cache Gigantic 10 ported cache 18 Distributing the Cache 19 Distributed Shared Cache ALU ALU ALU ALU ALU R ALU $ Like DSM (distributed shared memory), cache is distributed; But, unlike NUCA, caches are local to processors, not far away [ISCA 1999] 20 Tiled Multicore Architecture 21 ALU ALU ALU ALU ALU R ALU $ ALU ALU ALU ALU ALU ALU ALU ALU E.g., Operand Routing in 16-way Superscalar ALU ALU ALU ALU ALU ALU ALU Bypass Net ALU RF + >> Source: [Taylor ISCA 2004] 22 Operand Routing in a Tiled Architecture ALU ALU >> + ALU R ALU $ 23 ALU ALU >> Tiled Multicore • Scales to large numbers of cores • Modular – design, layout and verify 1 tile • Power efficient [MIT-CSAIL-TR-2008-066] – Short wires CV2f – Chandrakasan effect CV2f – Dynamic and compiler scheduled routing Processor Core S Core + Switch = Tile 24 A Prototype Tiled Architecture: The Raw Microprocessor SMEM [Billion transistor IEEE Computer Issue ’97] www.cag.csail.mit.edu/raw A Raw Tile IMEM Raw Switch PC DMEM PC REG FPU ALU SMEM PC SWITCH The Raw Chip Tile Packet stream Disk stream Video1 DRAM 25 Scalar operand network (SON): Capable of low latency transport of small (or large) packets [IEEE TPDS 2005] Virtual reality Simulator reality Prototype reality Product reality 26 Scalar Operand Transport in Raw Goal: flow controlled, in order delivery of operands fmul r24, r3, r4 27 fadd r5, r3, r24 route P->E, N->S route W->P, S->N software controlled crossbar software controlled crossbar RawCC: Distributed ILP Compilation (DILP) C tmp0 = (seed*3+2)/2 tmp1 = seed*v1+2 tmp2 = seed*v2 + 2 tmp3 = (seed*6+2)/3 v2 = (tmp1 - tmp3)*5 v1 = (tmp1 + tmp2)*3 v0 = tmp0 - v1 v3 = tmp3 - v2 Place, Route, Schedule seed.0=seed pval1=seed.0*3.0 pval0=pval1+2.0 v3.10=tmp3.6-v2.7 tmp0=tmp0.1 v3=v3.10 v1.2=v1 v2.4=v2 pval2=seed.0*v1.2 pval3=seed.o*v2.4 tmp1.3=pval2+2.0 tmp2.5=pval3+2.0 v1.2=v1 v2.4=v2 tmp1=tmp1.3 pval5=seed.0*6.0 pval7=tmp1.3+tmp2.5 pval2=seed.0*v1.2 pval3=seed.o*v2.4 pval4=pval5+2.0 tmp1.3=pval2+2.0 tmp0.1=pval0/2.0 tmp2.5=pval3+2.0 tmp3.6=pval4/3.0 tmp1=tmp1.3 tmp0=tmp0.1 tmp3=tmp3.6 pval6=tmp1.3-tmp2.5 v1.8=pval7*3.0 v2.7=pval6*5.0 v0.9=tmp0.1-v1.8 v0=v0.9 28 v3.10=tmp3.6-v2.7 v2=v2.7 v3=v3.10 v1.8=pval7*3.0 v0.9=tmp0.1-v1.8 tmp2=tmp2.5 pval6=tmp1.3-tmp2.5 v2.7=pval6*5.0 v2=v2.7 v1=v1.8 v0=v0.9 tmp2=tmp2.5 pval7=tmp1.3+tmp2.5 v1=v1.8 tmp3.6=pval4/3.0 tmp3=tmp3.6 seed.0=seed pval0=pval1+2.0 pval4=pval5+2.0 tmp0.1=pval0/2.0 Partitioning pval1=seed.0*3.0 pval5=seed.0*6.0 Black arrows = Operand Communication over SON [ASPLOS 1998] Virtual reality Simulator reality Prototype reality Product reality 29 A Tiled Processor Architecture Prototype: the Raw Microprocessor Michael Taylor Walter Lee Jason Miller David Wentzlaff Ian Bratt Ben Greenwald Henry Hoffmann Paul Johnson Jason Kim James Psota Arvind Saraf Nathan Shnidman Volker Strumpen Matt Frank Rajeev Barua Elliot Waingold Jonathan Babb Sri Devabhaktuni Saman Amarasinghe Anant Agarwal 30 October 02 Raw Die Photo IBM .18 micron process, 16 tiles, 425MHz, 18 Watts (vpenta) 31 [ISCA 2004] Raw Motherboard 32 Raw Ideas and Decisions: What Worked, What Did Not • • • • • • • 33 Build a complete prototype system Simple processor with single issue cores FPGA logic block in each tile Distributed ILP and static network Static network for streaming Multiple types of computation – ILP, streams, TLP, server PC in every tile Why Build? • • • Compiler (Amarasinghe), OS and runtimes (ISI), apps (ISI, Lincoln Labs, Durand) folks will not work with you unless you are serious about building hardware Need motivaion to build software tools -- compilers, runtimes, debugging, visualization – many challenges here Run large data sets (simulation takes forever even with 100 servers!) Many hard problems show up or are better understood after you begin building (how to maintain ordering for distributed ILP, slack for streaming codes) Have to solve hard problems – no magic! The more radical the idea, the more important it is to build • • • Cycle simulator became cycle accurate simulator only after HW got precisely defined Don’t bother to commercialize unless you have a working prototype Total network power few percent for real apps [Aug 2003 ISLPED, Kim et al. Energy • • • 34 – World will only trust end-to-end results since it is too hard to dive into details and understand all assumptions – Would you believe this: “Prof. John Bull has demonstrated a simulation prototype of a 64-way issue out-of-order superscalar” characterization of a tiled architecture processor with on-chip networks] [MIT-CSAIL-TR-2008-066 Energy scalability of on-chip interconnection networks in multicore architecures ] – Network power is few percent in Raw for real apps; however, it is 36% only for a highly contrived synthetic sequence meant to toggle every network wire Raw Ideas and Decisions: What Worked, What Did Not • • • • • • • 35 Yes Build a complete prototype system Simple processor, single issue 1GHz, 2-way, inorder in 2016 FPGA logic block in each tile No Distributed ILP Yes ‘02, No ‘06, Yes ‘14 Static network for streaming Multiple types of computation – ILP, streams, TLP, server Yes PC in every tile Yes Raw Ideas and Decisions: Streaming – Interconnect Support Forced synchronization in static network 36 route P->E, N->S route W->P, S->N software controlled crossbar software controlled crossbar Streaming in Tilera’s Tile Processor • Streaming done over dynamic interconnect with stream demuxing (AsTrO SDS) • Automatic demultiplexing of streams into registers • Number of streams is virtualized TAG add r55, r3, r4 37 sub r5, r3, r55 Dynamic Switch Dynamic Switch Virtual reality Simulator reality Prototype reality Product reality 38 Why Do We Care? Markets Demanding More Performance Wireless Networks - Demand for high thruput – more channels Fast moving standards LTE, services Networking market - GGSN Demand for high performance – 10Gbps Demand for more services, intelligence Digital Multimedia market - Base Station Switches Security Appliances Demand for high performance – H.264 HD Demand for more services – VoD, transcode Routers Video Conferencing … and with power efficiency and programming ease 39 39 Cable & Broadcast Tilera’s TILEPro64™ Processor Multicore Performance (90nm) Number of tiles Cache-coherent distributed cache Operations @ 750MHz (32, 16, 8 bit) Bisection bandwidth 64 5 MB 144-192-384 BOPS 2 Terabits per second Power Efficiency Power per tile (depending on app) Core power for h.264 encode (64 tiles) Clock speed 170 – 300 mW 12W Up to 866 MHz I/O and Memory Bandwidth I/O bandwidth Main Memory bandwidth 40 Gbps 200 Gbps Product reality Programming ANSI standard C SMP Linux programming Stream programming 40 [Tile64, Hotchips 2007] [Tile64, Microprocessor Report Nov 2007] Tile Processor Block Diagram A Complete System on a Chip DDR2 Memory Controller 0 DDR2 Memory Controller 1 XAUI MAC PHY 0 PCIe 0 MAC PHY Serdes PROCESSOR CACHE L2 CACHE Reg File P2 P1 P0 MDN GbE 0 GbE 1 Flexible IO PCIe 1 MAC PHY XAUI MAC PHY 1 Serdes Serdes DDR2 Memory Controller 3 41 DDR2 Memory Controller 2 TDN UDN SWITCH Flexible IO L1D ITLB DTLB 2D DMA Serdes UART, HPI JTAG, I2C, SPI L1I IDN STN Tile Processor NoC • 5 independent non-blocking networks – – 64 switches per network 1 Terabit/sec per Tile • Each network switch directly and independently connected to tiles • One hop per clock on all networks • I/O write example • Memory write example • Tile to Tile access example • All accesses can be performed simultaneously on non-blocking networks [IEEE Micro Sep 2007] 42 Tiles UDN STN IDN MDN TDN VDN Multicore Hardwall Implementation Or Protection and Interconnects OS1/APP1 Switch OS2/APP2 data valid HARDWALL_ENABLE OS1/APP3 Switch 43 data valid Product Reality Differences • Market forces – – – – – Need crisper answer to “who cares” SMP Linux programming with pthreads – fully cache coherent C + API approach to streaming vs new language Streamit in Raw Special instructions for video, networking Floating point needed in research project, but not in product for embedded market • Lessons from Raw – E.g., Dynamic network for streams – HW instruction cache – Protected interconnects • More substantial engineering – – – – – 44 3-way VLIW CPU, subword arithmetic Engineering for clock speed and power efficiency Completeness – I/O interfaces on chip – complete system chip. Just add DRAM for system Support for virtual memory, 2D DMA Runs SMP Linux (can run multiple OSes simultaneously) Virtual reality Simulator reality Prototype reality Product reality 45 What Does the Future Look Like? Corollary of Moore’s law: Number of cores will double every 18 months ‘02 ‘05 ‘08 ‘11 ‘14 Research 16 64 256 1024 4096 Industry 4 16 64 256 1024 1K cores by 2014! Are we ready? (Cores minimally big enough to run a self respecting OS!) 46 Vision for the Future • The ‘core’ is the logic gate of the 21st century p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m 47 s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m p m s s s s s s s s s s s s s s s s s s s s s Research Challenges for 1K Cores • 4-16 cores not interesting. Industry is there. University must focus on “1K cores”; Everything will change! • Can we use 4 cores to get 2X through DILP? Remember cores will be 1GHz and simple! What is the interconnect? • How should we program 1K cores? Can interconnect help with programming? • Locality and reliability WILL matter for 1K cores. Spatial view of multicore? • Can we add architectural support for programming ease? E.g., suppose I told you cores are free. Can you discover mechanisms to make programming easier? • What is the right grain size for a core? • How must our computational models change in the face of small memories per core? • How to “feed the beast”? I/O and external memory bandwidth • Can we assume perfect reliability any longer? 48 ATAC Architecture Electrical Mesh Interconnect (EMesh) m m p m p switch m p switch m p m switch p m m switch p switch m p switch switch p m p switch switch m p [Proc. BARC Jan 2007, MIT-CSAIL-TR-2009-018 ] 49 switch p m p m switch switch m switch p p switch p m m p switch switch Optical Broadcast WDM Interconnect Research Challenges for 1K Cores • 4-16 cores not interesting. Industry is there. University must focus on “1K cores”; Everything will change! • Can we use 4 cores to get 2X through DILP? What is the interconnect? • How should we program 1K cores? Can interconnect help with programming? • Locality and reliability WILL matter for 1K cores. Spatial view of multicore? • Can we add architectural support for programming ease? E.g., suppose I told you cores are free. Can you discover mechanisms to make programming easier? • What is the right grain size for a core? • How must our computational models change in the face of small memories per core? • How to “feed the beast”? I/O and external memory bandwidth • Can we assume perfect reliability any longer? 50 FOS – Factored Operating System OS cores collaborate, inspired by distributed internet services model Need new page OS FS OS FS OS User App OS I/O OS The key idea: space sharing replaces time sharing FS OS File System •Today: User app and OS kernel thrash each other in a core’s cache • User/OS time sharing is inefficient •Angstrom: OS assumes abstracted space model. OS services bound to distinct cores, separate from user cores. OS service cores collaborate to achieve best resource management • User/OS space sharing is efficient [OS Review 2008] 51 Research Challenges for 1K Cores • 4-16 cores not interesting. Industry is there. University must focus on “1K cores”; Everything will change! • Can we use 4 cores to get 2X through DILP? What is the interconnect? • How should we program 1K cores? Can interconnect help with programming? • Locality and reliability WILL matter for 1K cores. Spatial view of multicore? • Can we add architectural support for programming ease? E.g., suppose I told you cores are free. Can you discover mechanisms to make programming easier? • What is the right grain size for a core? • How must our computational models change in the face of small memories per core? • How to “feed the beast”? I/O and external memory bandwidth • Can we assume perfect reliability any longer? 52 The following are trademarks of Tilera Corporation: Tilera, the Tilera Logo, Tile Processor, TILE64, Embedding Multicore, Multicore Development Environment, Gentle Slope Programming, iLib, iMesh and Multicore Hardwall. All other trademarks and/or registered trademarks are the property of their respective owners. 53