Tile Processors: Many-Core for Embedded and Cloud Computing Richard Schooler VP Software Engineering Tilera Corporation rschooler@tilera.com Exploiting Natural Parallelism High-performance applications have lots of parallelism! – Embedded apps: • Networking: packets, flows • Media: streams, images, functional & data parallelism – Cloud apps: • Many clients: network sessions • Data mining: distributed data & computation Lots of different levels: – SIMD (fine-grain data parallelism) – Thread/process (medium-grain task parallelism) – Distributed system (coarse-grain job parallelism) 2 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Every one is going Manycore, but can the architecture scale? n Cores The computing world is ready for radical change 32 #Cores Sun IBM Cell Larrabee Intel 4 Performance Performance & Performance/W Gap 2 1 2005 2006 3 2010 2014 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. 2020 Time Key “Many-Core” Challenges: The 3 P’s Performance challenge – How to scale from 1 to 1000 cores – the number of cores is the new Megahertz Power efficiency challenge – Performance per watt is the new metric – systems are often constrained by power & cooling Programming challenge – How to provide a converged many core solution in a standard programming environment 4 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. “Problems cannot be solved by the same level of thinking that created them.” Current technologies fail to deliver – – – – Incremental performance increase High power Low level of Integration Increasingly bigger cores We need to have a new thinking to get – – – – 10 x performance 10 x performance per watt Converged computing Standard programming models 5 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Stepping Back: How Did We Get Here? Moore’s Conundrum: More devices =>? More performance Old answers: More complex cores; bigger caches – But power-hungry New answers: More cores – But do conventional approaches scale? Diminishing returns! 6 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. The Old Challenge: CPU-on-a-chip 20 MIPS CPU in 1987 Few thousand gates 7 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. The Opportunity: Billions of Transistors Old CPU: What to do with all those transistors? 8 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Take Inspiration from ASICs mem mem mem mem mem ASICs have high performance and low power • Custom-routed, short wires • Lots of ALUs, registers, memories – huge on-chip parallelism But how to build a programmable chip? 9 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Replace Long Wires with Routed Interconnect Ctrl [IEEE Computer ’97] 10 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. ALU ALU ALU ALU ALU ALU ALU ALU From Centralized Clump of CPUs … 11 ALU ALU ALU ALU ALU ALU ALU Bypass Net ALU RF HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. ALU ALU ALU ALU ALU R ALU … To Distributed ALUs, Routed Bypass Network Scalar Operand Network (SON) [TPDS 2005] 12 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. From a Large Centralized Cache… 13 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. …to a Distributed Shared Cache 14 ALU ALU ALU ALU ALU R ALU $ HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. [ISCA 1999] Distributed Everything + Routed Interconnect Tiled Multicore ALU ALU ALU ALU ALU R ALU $ Each tile is a processor, so programmable 15 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Tiled Multicore Captures ASIC Benefits and is Programmable Scales to large numbers of cores Modular – design and verify 1 tile Power efficient – Short wires plus locality opts – CV2f – Chandrakasan effect, more cores at 2 lower freq and voltage – CV f Processor Core Current Bus Architecture S Core + Switch = Tile 16 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Tilera processor portfolio Demonstrating the scale of many-core Gx64 & Gx100 Up to 8x performance TILEPro64 TILE64 Gx16 & Gx36 2x the performance TILEPro36 2008 2007 17 2009 2010 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. ... TILE-Gx100™: 4x GbE SGMII mPIPE 10 GbE XAUI PCIe 2.0 8-lane 4x GbE SGMII 10 GbE XAUI 4x GbE SGMII PCIe 2.0 4-lane Flexible I/O 10 GbE XAUI 4x GbE SGMII 10 GbE XAUI MiCA Memory Controller (DDR3) 18 Memory Controller (DDR3) 4x GbE SGMII 10 GbE XAUI SerDes SerDes 4x GbE SGMII 80-120 Gbps packet I/O – 8 ports XAUI / 2 XAUI – 2 40Gb Interlaken – 32 ports 1GbE (SGMII) SerDes 10 GbE XAUI SerDes Interlaken 4x GbE SGMII 10 GbE XAUI PCIe 2.0 8-lane Interlaken SerDes SerDes SerDes UART x2, USB x2, JTAG, I2C, SPI 1.2GHz – 1.5GHz 32 MBytes total cache 546 Gbps peak mem BW 200 Tbps iMesh BW 80 Gbps PCIe I/O – 3 StreamIO ports (20Gb) SerDes 10 GbE XAUI MiCA SerDes 4x GbE SGMII Memory Controller (DDR3) SerDes Memory Controller (DDR3) SerDes Complete System-on-a-Chip with 100 64-bit cores Wire-speed packet eng. – 120Mpps MiCA engines: HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. – 40 Gbps crypto – compress & decompress ™ TILE-Gx36 : SerDes 36 Processor Cores 866M, 1.2GHz, 1.5GHz clk 12 MBytes total cache SerDes Scaling to a broad range of applications 40 Gbps total packet I/O 4x GbE SGMII 10 GbE XAUI 4x GbE SGMII SerDes PCIe 2.0 4-lane PCIe 2.0 4-lane Flexible I/O mPIPE PCIe 2.0 8-lane SerDes SerDes UARTx2, USBx2, JTAG, I2C, SPI 10 GbE XAUI 4x GbE SGMII 10 GbE XAUI 4x GbE SGMII Memory Controller (DDR3) 10 GbE XAUI – 4 ports 10GbE (XAUI) – 16 ports 1GbE (SGMII) 48 Gbps PCIe I/O – 2 16Gbps Stream IO ports SerDes Memory Controller (DDR3) SerDes MiCA Wire-speed packet engine – 60Mpps MiCA engine: – 20 Gbps crypto – Compress & decompress 19 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Full-Featured General Converged Cores Processor – – – – Each core is a complete computer 3-way VLIW CPU SIMD instructions: 32, 16, and 8-bit ops Instructions for video (e.g., SAD) and networking – Protection and interrupts Memory – – – – Core Register File Three Execution Pipelines L1 cache and L2 Cache Virtual and physical address space Instruction and data TLBs Cache integrated 2D DMA engine Cache 16K L1-I I-TLB 8K L1-D D-TLB Runs SMP Linux Runs off-the-shelf C/C++ programs Signal processing and general apps 20 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. 2D DMA 64K L2 Terabit Switch Software must complement the hardware Enable re-use of existing code-bases Standards-based development environment – e.g. gcc, C, C++, Java, Linux – Comprehensive command-line & GUI-based tools Support multiple OS models – One OS running SMP – Multiple virtualized OS’s with protection – Bare metal or “zero-overhead” with background OS environment Support range of parallel programming styles – – – – Threaded programming (pThreads, TBB) Run-to-Completion with load-balancing Decomposition & Pipelining Higher-level frameworks (Erlang, OpenMP, Hadoop etc.) 21 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Software Roadmap Standards & open source integration – Compiler: gcc, g++ 4.4+ – Linux: • Kernel: Tile architecture integrated to 2.6.36 • User-space: glibc, broader set of standard packages Extended programming and runtime environments – Java: porting OpenJDK – Virtualization: porting KVM 22 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Tile architecture: The future of many-core computing Multicore is the way forward – But we need the right architecture to utilize it The Tile architecture addresses the challenges – Scales to 100’s of cores – Delivers very low power – Runs your existing code Standards-based software – Familiar tools – Full range of standard programming environments 23 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Thank you! Questions? Research Vision to Commercial Product The future? 1B Transistors in 2007 MiCA 4x GbE SGMII (DDR3) 10 GbE XAUI 4x GbE SGMII Interlaken 4x GbE SGMII 4x GbE SGMII mPIPE PCIe 2.0 4-lane 10 GbE XAUI 10 GbE XAUI 10 GbE XAUI 4x GbE SGMII 10 GbE XAUI 4x GbE SGMII Interlaken SerDes SerDes PCIe 2.0 8-lane MiCA 10 GbE XAUI 4x GbE SGMII 10 GbE XAUI 4x GbE SGMII Memory Controller (DDR3) Memory Controller (DDR3) 10 GbE XAUI 2010 Tile Processor 64 cores Mem 1997 2007 2002 25 Memory Controller Flexible I/O MIT Raw 16 cores CPU (DDR3) PCIe 2.0 8-lane SerDes 2018 1996 A blank slate Memory Controller UART x2, USB x2, JTAG, I2C, SPI SerDes SerDes SerDes SerDes SerDes TILE-Gx100 100 cores 100B transistors SerDes SerDes SerDes The opportunity HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Standard tools and programming model Multicore Development Environment Standards-based tools Standard programming SMP Linux 2.6 ANSI C/C++ Java, PHP Standard application stack Application layer Open source apps Standard C/C++ libs Operating System layer Integrated tools GCC compiler Standard gdb gprof Eclipse IDE 64-way SMP Linux Zero Overhead Linux Bare metal environment Hypervisor layer Innovative tools Multicore debug Multicore profile 26 Virtualizes hardware I/O device drivers HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Applications Applications libraries Operating System kernel drivers Hypervisor Virtualization and high speed I/O drivers Tile Processor Tile Tile Tile Tile … Standard Software Stack Management Protocols IPMI 2.0 Infrastructure Apps Transcoding Language Support Compiler, OS Hypervisor 27 Network Monitoring Perl Commercial Linux Distribution gcc & g++ HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. High single core performance Comparable to Atom & ARM Cortex-A9 cores Single-Core Single thread CoreMark™ Comparison CoreMark Score 3,500 3,000 2,500 2,000 1,500 1,000 500 Tilera TILEPro64 866 MHz Tilera TILE-Gx36 1.25 GHz ARM Cortex-A9 1 GHz Intel Atom N270 1600 MHz - Data for TILEPro, ARM Cortex-A9, Atom N270 is available on the CoreMark website http://coremark.org/home.php - TILE-Gx and single thread Atom results were measured in Tilera labs - Single core, single thread result for ARM is calculated based on chip scores 28 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Significant value across multiple markets Networking • Classification • L4-7 Services • Load Balancing • Monitoring/QoS • Security Multimedia • Video Conferencing • Media Streaming • Transcoding Wireless • Base Station • Media Gateway • Service Nodes • Test Equipment Cloud • Apache • Memcached • Web Applications • LAMP stack High Performance Low Power Standard Programming Over 100 customers Over 40 customers going into production Tier 1 customers in all target markets 29 29 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Targeting markets with highly parallel applications Web Media delivery Government Web Serving In Memory Cache Data Mining Transcoding Video delivery Wireless media Lawful interception Surveillance Other Common Themes Hundreds and Thousands of servers running each application thousands of parallel transactions All need better performance and power efficiency 30 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. The Tile Processor Architecture Mesh interconnect, power-optimized cores 2 Dimensional mesh network Processor Core S Core + Switch = Tile Scales to large numbers of cores Modular: Design-and-verify 1 tile Power efficient: – Short wires & locality optimize CV2f – Chandrakasan effect, more cores at lower freq and voltage – 31 200 Tbps on-chip bandwidth Traditional Bus/Ring Architecture CV2f HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Distributed “everything” Cache, memory management, connectivity Big centralized caches don’t scale – Contention – Long latency – High power Distributed caches have numerous benefits – Lower power (less logic lit-up per access) – Exploit locality (local L1 & L2) – Can exploit various cache placement mechanisms to enhance performance 32 Distributed caches Completely HW Coherent Cache System Global 64-bit address space HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Highest compute density 2U form factor 4 hot pluggable modules 8 Tilera TILEPro processors 512 general purpose cores 1.3 trillion operations /sec 33 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Power efficient and eco-friendly server 10,000 cores in a 8 Kilowatt rack 35-50 watts max per node Server power of 400 watts 90%+ efficient power supplies Shared fans and power supplies 34 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. Coherent distributed cache system Globally Shared Physical Address Space – – – – Each tile has local L1 and L2 caches Aggregate of L2 serves as a globally shared L3 Any cache block can be replicated locally Hardware tracks sharers, invalidates stale copies Dynamic Distributed Cache (DDC™) – DDR2 Controller 0 Memory pages distributed across all cores or homed by allocating core Coherent I/O – – – – PCIe 0 XAUI 0 Completely Coherent System Flexible I/O UART JTAG SPI, I2C PCIe 1 GbE 0 GbE 1 XAUI 1 DDR2 Controller 3 Hardware maintains coherence I/O reads/writes coherent with tile caches Reads/writes delivered by HW to home cache Header/packet delivered directly to tile caches 35 DDR2 Controller 1 HPEC, 15 September 2010 © 2010 Copyright Tilera Corporation. All Rights Reserved. DDR2 Controller 2 SerDes Distributed cache SerDes Full Hardware Cache Coherence Standard shared memory programming model SerDes – – SerDes