Martin Kruliš by Martin Kruliš (v1.0) 6. 1. 2015 1 Adapteva Company ◦ Small fabless semiconductor company ◦ Founded in 2008 ◦ Main objective is to design massively parallel chips with emphasis on power efficiency First company that designed chip that expects to scale over 1000 cores ◦ Current products Epiphany processor (16 core and 64 core versions) Parallela board ◦ Parallela University Program started this year by Martin Kruliš (v1.0) 6. 1. 2015 2 1GB SDRAM 16-core Epiphany Coprocessor μUSB 1Gb Ethernet μSD Expansion Slots μHDMI μUSB Zyng dual-core ARM-A9 (with integrated FPGA) by Martin Kruliš (v1.0) 6. 1. 2015 3 by Martin Kruliš (v1.0) 6. 1. 2015 4 by Martin Kruliš (v1.0) 6. 1. 2015 5 Coprocessor ◦ 32-bit RISC cores with superscalar architecture ◦ 32KB local memory per core (1 cycle latency) Divided into four independent banks ◦ IEEE754 compliant floating point instruction set ◦ Two DMA channels eMesh (Network-on-Chip) ◦ Both on chip and off chip communication ◦ No specific API, works with memory transactions eLink (Chip-to-Chip Links) ◦ 4 I/O ports for external communication by Martin Kruliš (v1.0) 6. 1. 2015 6 Coprocessor Cores ◦ Simple in-order RISC architecture Most instructions take 1 cycle 8-stage dual-issue pipeline Instruction set optimized for signal processing ◦ Separate integer and floating point ALU ◦ 64x 32-bit registers (for both IALU and FPU) Load store architecture Per cycle 3/1 FPU and 2/1 IALU accesses, 1 load/store ◦ Performance 16 cores ~ 2Gflops each, 64 cores ~ 1.6 Gflops each by Martin Kruliš (v1.0) 6. 1. 2015 7 Memory Model ◦ Internal memory of each node is mapped into global memory by Martin Kruliš (v1.0) 6. 1. 2015 8 Local Memory ◦ Divided into four banks with independent controllers ◦ Each clock cycle each bank may perform: Send 64bit word to program sequencer Transfer 64bit word between memory and registers Receive 64bit word from eMesh interface Local DMA sends 64bit word to eMesh interface ◦ Memory order model Local reads and writes follow strong memory model Non-local transactions follow weak memory model Operations may not propagate in the same order by Martin Kruliš (v1.0) 6. 1. 2015 9 eMesh ◦ 2D topology with nearest-neighbor connections ◦ 3 orthogonal (independent) meshes cMesh – on-chip write transactions (8B/cycle) xMesh – off-chip write transactions (1B/cycle) rMesh – read requests (1req/8cycles) ◦ Edge connections may be interfaced with other epiphany chips Or other type of busses (off-core memory, IO ports, …) ◦ Significantly favorizes writing operations to reading Writing transactions are 16x faster by Martin Kruliš (v1.0) 6. 1. 2015 10 eMesh by Martin Kruliš (v1.0) 6. 1. 2015 11 eMesh Routing ◦ Upper 12bits of the address is address of the core 6 bits – row index, 6 bits – col index ◦ Each node uses simple routing algorithm ◦ Nodes use round-robin arbitration to avoid deadlock by Martin Kruliš (v1.0) 6. 1. 2015 12 DMA ◦ Two DMA channels per node ◦ 2D addressing awareness, flexible strides ◦ Local-external memory and external-external memory transfers ◦ Completion signaling by HW interrupt ◦ Master and slave modes Slave DMA is controlled by external IO or another DMA by Martin Kruliš (v1.0) 6. 1. 2015 13 Epiphany SDK ◦ Separate compilation for host and coprocessor code Epiphany uses e-gcc and e-objcopy ◦ The host runtime provide way to Detect the coprocessor Allocate memory, transfer data Execute precompiled binaries on the coprocessor OpenCL ◦ The coprocessor is perceived as OpenCL accelerator ◦ Each core is computing unit, on-chip memory is local memory, … by Martin Kruliš (v1.0) 6. 1. 2015 14 Host Code Example e_platform_t platform; e_epiphany_t dev; e_init(NULL); e_reset_system(); e_get_platform_info(&platform); e_open(&dev, 0, 0, platform.rows, platform.cols); e_load_group("coproccode.srec", &dev, 0, 0, platform.rows, platform.cols); for (i = 0; i < platform.rows ; ++i) for (j = 0; j < platform.cols; ++j) { coreid = (i + platform.row) * 64 + j + platform.col; usleep(100000); e_read(&emem, 0, 0, 0x0, emsg, _BufSize); e_read(&dev, i, j, 0x6000, &flag, sizeof(flag)); ... } e_close(&dev); e_finalize(); by Martin Kruliš (v1.0) 6. 1. 2015 15 Matrix Multiplication ◦ Using naïve 𝑂(𝑁3) algorithm ◦ Square matrices ◦ N is divisible by number of cores Each core computing its corresponding tile of the result matrix ◦ Both input matrices and output matrix fit the total amount of local memory A smart plan of the computations and the data transfers can be devised by Martin Kruliš (v1.0) 6. 1. 2015 16 Matrix Multiplication A tiles are rotated vertically in each column B tiles are rotated horizontally in each row by Martin Kruliš (v1.0) 6. 1. 2015 17 by Martin Kruliš (v1.0) 6. 1. 2015 18