An Close Look at the Epiphany-IV 28nm 64-core Coprocessor Andreas Olofsson PEGPUM 2013 1 Adapteva Achieves 3 “World Firsts” 1. First processor company to reach 50 GFLOPS/W 2. First FOSS OpenCL SDK in the mobile market 3. First semiconductor company to successfully crowd-source project Copyright © Adapteva. All rights reserved. 2 The Future is… Efficient… Heterogeneous… Open.. Task Parallel… Grande Challenges Ahead… Rebuild the computer ecosystem! Rewrite billions of lines of code! Retrain millions of programmers! Rewrite the education curriculum! Copyright © Adapteva. All rights reserved. 3 Our Vision: True Heterogeneous Computing SYSTEM-ON-CHIP BIG CPU FPGA BIG CPU GPU Copyright © Adapteva. All rights reserved. BIG CPU Analog BIG CPU 100’s of small Epiphany RISC CPUs/DSPs 4 Architecture Comparison Technology FPGA DSP GPU CPU Manycore Process 28nm 40nm 28nm 32nm 28nm Programming VHDL OCL/C++/C CUDA/OCL OCL/C/C++ OCL/C/C++ Area (mm^2) 590 108 294 216 10 Chip Power (W) 40 22 135 130 2 CPUs n/a 8 16 4 64 Max GFLOPS 1500 160 3000 115 102 GHz * Cores n/a 12 16 14.4 51.2 Compile Time Hours Minutes Minutes Minutes Minutes L1 Memory 6MB 512KB 2.5MB 256KB 2MB Efficiency is everything Copyright © Adapteva. All rights reserved. Peak performance means very little No magic bullet! 5 Epiphany: Massive Task-Parallelism 1 GHz RISC Core ` Local Memory Multicore Framework Router Coprocessor to ARM/Intel CPU Copyright © Adapteva. All rights reserved. 25mW per core C/C++ programmable 6 Programming Models MODEL#1 TASK QUEUE MODEL • Great for up to 2GFLOPS • Supports standard C/C++ • “Cloud on a chip” Task1 Task2 X86/ARM/FPGA Host Task3 Task4 MODEL #2 DATA PARALLEL MODEL • openCL programmable • Easy integration of C/C++ • openMP/MPI roadmap Task1 X86/ARM/FPGA Host MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU MINI CPU Copyright © Adapteva. All rights reserved. 7 CPU Architecture Tradeoffs Feature In Out Why Single Precision Floating Point X Programming Efficiency 64 Entry Register File X GCC Byte Addressable X GCC 64bit load/store X Data movement is king Dual Issue In Order Scheduling X 4-Bank Local Memory X Multicore data movement Interrupts, breakpoints, timers X Inexpensive Hardware Cache X Too expensive, software managed Double Precision Floating Point X Overkill for initial market VLIW X GCC, Complexity Instr: Not, Mask,Rotate, Add-Carry, Ones X Not important Data Type ISA Orthogonally. (Signed ,unsigned) & (8b,16b,32b) X Expensive. Focus was on floating point Copyright © Adapteva. All rights reserved. 8 Network-On-Chip Tradeoffs Feature In Out Why 64-Bit Transfer /Cycle X Maximize efficiency Send address on every cycle X “Wires are free”, simplicity Round robin arbitration X Easy, good enough Distributed routing X Scalable Robust flow control X Must have Bidirectional Mesh X Well known Address mapped packet switching X Ease of Use Single cycle message transfer X Short messaging important 3 separate Networks X No deadlock, QOS Extensive QOS features X Deterministic traffic Torus wraparound connections X Too expensive in CMOS Circuit switching X Not general purpose Large routing buffers X Too expensive Multilayer Mesh X Too expensive (for now) Copyright © Adapteva. All rights reserved. 9 Epiphany Implementation Methodology Epiphany-IV • Scalable • 24 hrs from RTL to GDS 0,0 0,1 Reusable tiles 1,0 1,1 • • No long wires • Pattern density • Fault tolerant • Thermal balancing • Easy clocking Copyright © Adapteva. All rights reserved. IO PADS L O G I C S R A M G L U E 7,7 10 Epiphany-IV Area Breakdown IO Memory FPU Register File Everything Else Memory/RF/FPU >65% of silicon die Copyright © Adapteva. All rights reserved. Make every transistor count MIMD efficiency “good enough” 11 Epiphany-IV Specifications Copyright © Adapteva. All rights reserved. eLink IO 1 GHz High Performance RISC CPU ` 32KB+ Distributed Local Memory eLink IO 64 CPUs IEEE Floating Point (SP) 800 MHz Max Frequency 100 GFLOPS Performance 6.4 GB/s IO BW 200 GB/s peak NOC BW 1.6 TB/sec on chip memory BW 25 Billion Messages/sec 2MB on chip memory 10 mm2 total silicon area in 28nm 2 Watt total chip power 324 ball 15x15mm BGA Sampling since July, 2012 eLink IO • • • • • • • • • • • • • Multicore Communication Framework Router eLink IO 12 Epiphany System Examples SDIO FLASH USB SDRAM ETH ARM SOC Etc.. Programmable and flexible • SMALL FPGA • Easy to develop for • Efficient and powerful • Scalable FPGA JESD204 DMA eLink eLink ADC JESD204 DMA eLink eLink E16G301 E16G301 4GB/s JESD204 DMA JESD204 DMA E16G301 E16G301 E16G301 eLink eLink eLink eLink Copyright © Adapteva. All rights reserved. eLink eLink DMA JESD204 DAC eLink eLink DMA JESD204 DAC eLink eLink DMA JESD204 DAC eLink eLink DMA JESD204 DAC 4GB/s E16G301 ADC E16G301 FPGA ADC ADC 2GB/s eLink E16G301 13 Parallella Computing Project ZYNQ7010 • Open (and ”free”): SD CARD • Documentation Gb Ethernet • Board design files USB OTG • Drivers USB 2.0 • Software Tools 1GB LPDDR2 UART • Accessible (NO NDAs!) I2C • $100 entry point • ~4000 devs signed up in 4 weeks Dual-Core ARM A9 Processor 2.4GB/s Programmable Logic 1.4GB/s GPIO IO Rj45 uSD ZYNQ (ARM) CPU HDMI 1GB SDRAM GPIO Copyright © Adapteva. All rights reserved. IO E64 IO USB USB IO Epiphany (E16|E64) 14 Parallella Architecture SD-CARD I2C UART Ethernet USB OTG USB 2.0 Dual Core ARM A9 (800MHz) MIO Zynq “Hard” Zynq FPGA Off-Chip AXI BUS AXI-MASTER AXI-SLAVE “Glue-Logic” eLink GPIO Epiphany Daughter Card Copyright © Adapteva. All rights reserved. AXI-MASTER HDMI Controller MEM-CTRL “O/S” DRAM “Sandbox” SHARED DRAM 15 15 Experimental Features On The Way • Network Traffic and Congestion Monitors • Multicore Hardware Synchronization • Active Message • Multicore Breakpoint • Hardware Loops • Multicast Network Transactions Copyright © Adapteva. All rights reserved. Already inside Epiphany-III/IV silicon, but need more testing! 16 MIMD Manycore IS the Future! ~10mm 1024 CPUs 32KB/core 0.2W-20W 2012 Copyright © Adapteva. All rights reserved. ~10mm 16K CPUs 1MB/core(3D) 0.2W-20W 2022 17