11-parallela

Martin Kruliš by Martin Kruliš (v1.0) 6. 1. 2015 1  Adapteva Company ◦ Small fabless semiconductor company ◦ Founded in 2008 ◦ Main objective is to design massively parallel chips with emphasis on power efficiency  First company that designed chip that expects to scale over 1000 cores ◦ Current products  Epiphany processor (16 core and 64 core versions)  Parallela board ◦ Parallela University Program started this year by Martin Kruliš (v1.0) 6. 1. 2015 2 1GB SDRAM 16-core Epiphany Coprocessor μUSB 1Gb Ethernet μSD Expansion Slots μHDMI μUSB Zyng dual-core ARM-A9 (with integrated FPGA) by Martin Kruliš (v1.0) 6. 1. 2015 3 by Martin Kruliš (v1.0) 6. 1. 2015 4 by Martin Kruliš (v1.0) 6. 1. 2015 5  Coprocessor ◦ 32-bit RISC cores with superscalar architecture ◦ 32KB local memory per core (1 cycle latency)  Divided into four independent banks ◦ IEEE754 compliant floating point instruction set ◦ Two DMA channels  eMesh (Network-on-Chip) ◦ Both on chip and off chip communication ◦ No specific API, works with memory transactions  eLink (Chip-to-Chip Links) ◦ 4 I/O ports for external communication by Martin Kruliš (v1.0) 6. 1. 2015 6  Coprocessor Cores ◦ Simple in-order RISC architecture  Most instructions take 1 cycle  8-stage dual-issue pipeline  Instruction set optimized for signal processing ◦ Separate integer and floating point ALU ◦ 64x 32-bit registers (for both IALU and FPU)  Load store architecture  Per cycle 3/1 FPU and 2/1 IALU accesses, 1 load/store ◦ Performance  16 cores ~ 2Gflops each, 64 cores ~ 1.6 Gflops each by Martin Kruliš (v1.0) 6. 1. 2015 7  Memory Model ◦ Internal memory of each node is mapped into global memory by Martin Kruliš (v1.0) 6. 1. 2015 8  Local Memory ◦ Divided into four banks with independent controllers ◦ Each clock cycle each bank may perform:     Send 64bit word to program sequencer Transfer 64bit word between memory and registers Receive 64bit word from eMesh interface Local DMA sends 64bit word to eMesh interface ◦ Memory order model  Local reads and writes follow strong memory model  Non-local transactions follow weak memory model  Operations may not propagate in the same order by Martin Kruliš (v1.0) 6. 1. 2015 9  eMesh ◦ 2D topology with nearest-neighbor connections ◦ 3 orthogonal (independent) meshes  cMesh – on-chip write transactions (8B/cycle)  xMesh – off-chip write transactions (1B/cycle)  rMesh – read requests (1req/8cycles) ◦ Edge connections may be interfaced with other epiphany chips  Or other type of busses (off-core memory, IO ports, …) ◦ Significantly favorizes writing operations to reading  Writing transactions are 16x faster by Martin Kruliš (v1.0) 6. 1. 2015 10  eMesh by Martin Kruliš (v1.0) 6. 1. 2015 11  eMesh Routing ◦ Upper 12bits of the address is address of the core  6 bits – row index, 6 bits – col index ◦ Each node uses simple routing algorithm ◦ Nodes use round-robin arbitration to avoid deadlock by Martin Kruliš (v1.0) 6. 1. 2015 12  DMA ◦ Two DMA channels per node ◦ 2D addressing awareness, flexible strides ◦ Local-external memory and external-external memory transfers ◦ Completion signaling by HW interrupt ◦ Master and slave modes  Slave DMA is controlled by external IO or another DMA by Martin Kruliš (v1.0) 6. 1. 2015 13  Epiphany SDK ◦ Separate compilation for host and coprocessor code  Epiphany uses e-gcc and e-objcopy ◦ The host runtime provide way to  Detect the coprocessor  Allocate memory, transfer data  Execute precompiled binaries on the coprocessor  OpenCL ◦ The coprocessor is perceived as OpenCL accelerator ◦ Each core is computing unit, on-chip memory is local memory, … by Martin Kruliš (v1.0) 6. 1. 2015 14  Host Code Example e_platform_t platform; e_epiphany_t dev; e_init(NULL); e_reset_system(); e_get_platform_info(&platform); e_open(&dev, 0, 0, platform.rows, platform.cols); e_load_group("coproccode.srec", &dev, 0, 0, platform.rows, platform.cols); for (i = 0; i < platform.rows ; ++i) for (j = 0; j < platform.cols; ++j) { coreid = (i + platform.row) * 64 + j + platform.col; usleep(100000); e_read(&emem, 0, 0, 0x0, emsg, _BufSize); e_read(&dev, i, j, 0x6000, &flag, sizeof(flag)); ... } e_close(&dev); e_finalize(); by Martin Kruliš (v1.0) 6. 1. 2015 15  Matrix Multiplication ◦ Using naïve 𝑂(𝑁3) algorithm ◦ Square matrices ◦ N is divisible by number of cores  Each core computing its corresponding tile of the result matrix ◦ Both input matrices and output matrix fit the total amount of local memory  A smart plan of the computations and the data transfers can be devised by Martin Kruliš (v1.0) 6. 1. 2015 16  Matrix Multiplication A tiles are rotated vertically in each column B tiles are rotated horizontally in each row by Martin Kruliš (v1.0) 6. 1. 2015 17 by Martin Kruliš (v1.0) 6. 1. 2015 18

11-parallela

Related documents

Products

Support

11-parallela

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib