11-parallela

advertisement
Martin Kruliš
by Martin Kruliš (v1.0)
6. 1. 2015
1

Adapteva Company
◦ Small fabless semiconductor company
◦ Founded in 2008
◦ Main objective is to design massively parallel chips
with emphasis on power efficiency
 First company that designed chip that expects to scale
over 1000 cores
◦ Current products
 Epiphany processor (16 core and 64 core versions)
 Parallela board
◦ Parallela University Program started this year
by Martin Kruliš (v1.0)
6. 1. 2015
2
1GB SDRAM
16-core Epiphany
Coprocessor
μUSB
1Gb Ethernet
μSD
Expansion Slots
μHDMI
μUSB
Zyng dual-core ARM-A9
(with integrated FPGA)
by Martin Kruliš (v1.0)
6. 1. 2015
3
by Martin Kruliš (v1.0)
6. 1. 2015
4
by Martin Kruliš (v1.0)
6. 1. 2015
5

Coprocessor
◦ 32-bit RISC cores with superscalar architecture
◦ 32KB local memory per core (1 cycle latency)
 Divided into four independent banks
◦ IEEE754 compliant floating point instruction set
◦ Two DMA channels

eMesh (Network-on-Chip)
◦ Both on chip and off chip communication
◦ No specific API, works with memory transactions

eLink (Chip-to-Chip Links)
◦ 4 I/O ports for external communication
by Martin Kruliš (v1.0)
6. 1. 2015
6

Coprocessor Cores
◦ Simple in-order RISC architecture
 Most instructions take 1 cycle
 8-stage dual-issue pipeline
 Instruction set optimized for signal processing
◦ Separate integer and floating point ALU
◦ 64x 32-bit registers (for both IALU and FPU)
 Load store architecture
 Per cycle 3/1 FPU and 2/1 IALU accesses, 1 load/store
◦ Performance
 16 cores ~ 2Gflops each, 64 cores ~ 1.6 Gflops each
by Martin Kruliš (v1.0)
6. 1. 2015
7

Memory Model
◦ Internal memory of each node
is mapped into global memory
by Martin Kruliš (v1.0)
6. 1. 2015
8

Local Memory
◦ Divided into four banks with independent controllers
◦ Each clock cycle each bank may perform:




Send 64bit word to program sequencer
Transfer 64bit word between memory and registers
Receive 64bit word from eMesh interface
Local DMA sends 64bit word to eMesh interface
◦ Memory order model
 Local reads and writes follow strong memory model
 Non-local transactions follow weak memory model
 Operations may not propagate in the same order
by Martin Kruliš (v1.0)
6. 1. 2015
9

eMesh
◦ 2D topology with nearest-neighbor connections
◦ 3 orthogonal (independent) meshes
 cMesh – on-chip write transactions (8B/cycle)
 xMesh – off-chip write transactions (1B/cycle)
 rMesh – read requests (1req/8cycles)
◦ Edge connections may be interfaced with other
epiphany chips
 Or other type of busses (off-core memory, IO ports, …)
◦ Significantly favorizes writing operations to reading
 Writing transactions are 16x faster
by Martin Kruliš (v1.0)
6. 1. 2015
10

eMesh
by Martin Kruliš (v1.0)
6. 1. 2015
11

eMesh Routing
◦ Upper 12bits of the address is address of the core
 6 bits – row index, 6 bits – col index
◦ Each node uses simple routing algorithm
◦ Nodes use round-robin arbitration to avoid deadlock
by Martin Kruliš (v1.0)
6. 1. 2015
12

DMA
◦ Two DMA channels per node
◦ 2D addressing awareness, flexible strides
◦ Local-external memory and external-external
memory transfers
◦ Completion signaling by HW interrupt
◦ Master and slave modes
 Slave DMA is controlled by external IO or another DMA
by Martin Kruliš (v1.0)
6. 1. 2015
13

Epiphany SDK
◦ Separate compilation for host and coprocessor code
 Epiphany uses e-gcc and e-objcopy
◦ The host runtime provide way to
 Detect the coprocessor
 Allocate memory, transfer data
 Execute precompiled binaries on the coprocessor

OpenCL
◦ The coprocessor is perceived as OpenCL accelerator
◦ Each core is computing unit, on-chip memory is
local memory, …
by Martin Kruliš (v1.0)
6. 1. 2015
14

Host Code Example
e_platform_t platform;
e_epiphany_t dev;
e_init(NULL);
e_reset_system();
e_get_platform_info(&platform);
e_open(&dev, 0, 0, platform.rows, platform.cols);
e_load_group("coproccode.srec", &dev, 0, 0, platform.rows, platform.cols);
for (i = 0; i < platform.rows ; ++i)
for (j = 0; j < platform.cols; ++j) {
coreid = (i + platform.row) * 64 + j + platform.col;
usleep(100000);
e_read(&emem, 0, 0, 0x0, emsg, _BufSize);
e_read(&dev, i, j, 0x6000, &flag, sizeof(flag));
...
}
e_close(&dev);
e_finalize();
by Martin Kruliš (v1.0)
6. 1. 2015
15

Matrix Multiplication
◦ Using naïve 𝑂(𝑁3) algorithm
◦ Square matrices
◦ N is divisible by number of cores
 Each core computing its corresponding tile of the
result matrix
◦ Both input matrices and output matrix fit the total
amount of local memory
 A smart plan of the computations and the data
transfers can be devised
by Martin Kruliš (v1.0)
6. 1. 2015
16

Matrix
Multiplication
A tiles are rotated
vertically in each column
B tiles are rotated
horizontally in each row
by Martin Kruliš (v1.0)
6. 1. 2015
17
by Martin Kruliš (v1.0)
6. 1. 2015
18
Download