An OpenCL Framework for Heterogeneous Multicores with Local

advertisement
An OpenCL Framework
for Heterogeneous Multicores with
Local Memory
PACT 2010
Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park,
Honggyu Kim, Thanh Tuan Dao, Yongjin Cho, Sung Jong Seo, Seung Hak Lee,
Seung Mo Cho, Hyo Jung Song, Sang-Bum Suh, and Jong-Deok Choi
School of Computer Science and Engineering, Seoul National University, Seoul 151744, Korea
Samsung Electronics Co., Nongseo-dong, Giheung-gu, Yongin-si, Geonggi-do 446712, Korea
Presenter : Jen-Jung, Cheng
Outline
• Introduction
• Background
– OpenCL platform
• Design and Implementation
–
–
–
–
OpenCL runtime
Work-item coalescing
Web-based variable expansion
Preload-poststore buffering
• Evaluation
• Conclusion
Introduction(1/2)
GPC
APC
APC
L1 $
Local
store
Local
store
L2 $
APC
…
Interconnect bus
Main memory
The target architecture
Local
store
Introduction(2/2)
• Two major challenges in design and
implementation of the OpenCL framework
– Implements hundreds of virtual PEs with a single
accelerator core and make them efficient
– Overcomes the limited size and incoherency of the
local store
OpenCL platform(1/2)
The OpenCL platform model
OpenCL platform(2/2)
• OpenCL platform:
a host processor, compute devices,
compute units, and processing elements
• Abstract Index Space:
global ID, workgroup ID, and local ID
• Memory Region:
private, local, constant, and global
• Synchronization:
work-group barrier and command-queue barrier
OpenCL runtime(1/3)
Mapping platform components to the target architecture
OpenCL runtime(2/3)
OpenCL
Runtime thread
GPC
Command Scheduler
Command Executor
DAG
Event Queue
Command
Queues
…
Execution ordering
Assign
Device
CU
CU
Ready Queue
…
OpenCL
Host
thread
Issue
CU Status Array
CU
The command scheduler and the command executor
Work-groups
OpenCL runtime(3/3)
• The runtime implements a software-managed
cache in each APC‘s local store. It caches the
contents of the global and constant memory.
• To guarantee OpenCL memory consistency for
shared memory objects between commands, the
command executor flushes software-managed
caches whenever it dequeues a command from
the ready-queue or it removes an event object
from the DAG after the associated command has
completed.
Work-item coalescing(1/3)
• Executing work-items on a CU by switching one
work-item to another incurs a significant
overhead.
• When a kernel and its callee functions do not
contain any barrier, any execution ordering
defined between the two statements from
different work-items in the same work-group
satisfies the OpenCL semantics.
• Work-item coalescing loop(WCL) iterates on the
index space of a single work-group.
Work-item coalescing(2/3)
__kernel void vec_add(
__global float *a,
__global float *b,
__global float *c)
{
int id;
id = get_global_id(0);
c[id] = a[id] + b[id];
}
OpenCL C
source-to-source
translator
Int __i, __ j, __k;
__kernel void vec_add
(__global float *a, __global float *b, __global float *c)
{
int id;
for( __k = 0; __k < __local_size[2]; __k++ ) {
for( __ j = 0; __ j < __local_size[1]; __ j++ ) {
for( __ i = 0; __ i < __local_size[0]; __ i++ ) {
id = get_global_id(0);
c[id] = a[id] + b[id];
}
}
}
}
Work-item coalescing(3/3)
S1
barrier();
S2
if (c)
{
S1
barrier();
S2
}
[t = C’;
if (t)
{
[S1’
barrier();
[S2’
}
[S1’
barrier();
[S2’
while (c)
{
S1
barrier();
S2
}
while (1) {
[t = C’;
if (!t) break;
[S1’
barrier();
[S2’
}
Web-based variable expansion(1/5)
• A kernel code region that needs to be enclosed
with a WCL is called a work-item coalescing
region (WCR).
• A work-item private variable that is defined in
one WCR and used in another needs a separate
location for different work-items.
• A du-chain for a variable connects a definition of
the variable to all uses reached by the definition.
• A web for a variable is all du-chains of the
variable that contain a common use of the
variable.
Web-based variable expansion(2/5)
x=…
Entry
t1 = C1
if (t1)
WCR
while (1)
…=x
x=…
t2 = C2
x=…
if (t2)
…=x
x=…
x=…
barrier ()
…=x
…=x
x=…
…=x
Exit
Web-based variable expansion(3/5)
x=…
Entry
t1 = C1
if (t1)
WCR
while (1)
…=x
x=…
t2 = C2
x=…
if (t2)
…=x
x=…
x=…
barrier ()
…=x
…=x
x=…
…=x
Exit
Identifying du-chains
Web-based variable expansion(4/5)
x=…
Entry
t1 = C1
if (t1)
WCR
while (1)
…=x
x=…
t2 = C2
x=…
if (t2)
…=x
x=…
x=…
barrier ()
…=x
…=x
x=…
…=x
Exit
Identifying webs
Web-based variable expansion(5/5)
x1[][][]=
x1=malloc()
t1 = C1
Entry
if (t1)
while (1)
=x1[][][]
x=…
t2 = C2
x=…
…=x
x1[][][]=
x1[][][]=
=x1[][][]
x=…
if (t2)
…=x
barrier ()
=x1[][][]
Free(x1
Exit
)
After variable expansion
WCR
Preload-poststore buffering(1/4)
• Preload-poststore buffering enables gathering
DMA transfers together for array accesses and
minimizes the time spent waiting for them to
complete by overlapping them.
Preload-poststore buffering(2/4)
for(k = 0; k < ls[2]; k++ ) {
for(j = 0; j < ls[1]; j++ ) {
for(i = 0; i < ls[0]; i++ ) {
if( i < 100) a[j][i] = c[j][b[i]];
c[j][b[i]] = a[j][3*i+1] + a[j][i+1024];
}
}
}
for(k = 0; k < ls[2]; k++ ) {
for(j = 0; j < ls[1]; j++ ) {
PRELOAD(buf_b, &b[0], ls[0]);
PRELOAD(buf_a1, &a[j][0], ls[0]+1024);
for(i = 0; i < ls[0]; i++ )
PRELOAD(buf_a2[i], &a[j][3*i+1]);
WAITFOR(buf_b);
for(i = 0; i < ls[0]; i++ )
PRELOAD(buf_c[i], &c[j][buf_b[i]]);
for(i = 0; i < ls[0]; i++ ) {
if( i < 100) buf_a1[i] = buf_c[i];
buf_c[i] = buf_a2[i]+ buf_a1[i+1024];
}
POSTSTORE(buf_a1, &a[j][0], ls[0]+1024);
for(i = 0; i < ls[0]; i++ )
POSTSTORE (buf_c[i], &c[j][buf_b[i]]);
}
}
Preload-poststore buffering(3/4)
• Buffering candidate
A
0
1
2
3
4
1
4
7
10
6
7
…
n-1
c*I + d, where c and d are loop invariant to L
[lower bound : upper bound : stride]
[1 : 3 * ls[0] - 2 : 3]
3*i+1
buf_a2
5
c*x + d, where x is an array reference and c and d
are loop invariant to L.
13
16
19
22
…
3 * ls[0] - 2
Preload-poststore buffering(4/4)
• Condition for single buffer
– a loop-independent flow dependence
(read-after-write)
– a loop-independent output dependence
(write-after-write)
Evaluation(1/5)
• Experimental Setup
– an IBM QS22 Cell blade server with two 3.2GHz
PowerXCell 8i processors.
– The Cell BE processor consists of a single Power
Processor Element (PPE) and eight Synergistic
Processor Elements (SPEs).
– Fedora Linux 9
– SPE has 256KB of local store
Evaluation(2/5)
Applications used
Evaluation(3/5)
speedup
Evaluation(4/5)
Comparison with the IBM OpenCL framework for Cell BE.
Evaluation(5/5)
• two Intel Xeon X5660 hexa-core processors
(CPU)
• an NVIDIA Tesla C1060 GPU (GPU).
The speedup of the OpenCL applications with multicore CPUs and a GPU.
Conclusion
• This paper presents the design and
implementation of an OpenCL runtime and
OpenCL C source-to-source translator that
target heterogeneous accelerator multicore
architectures with local memory.
Web-based variable expansion(1/3)
x=…
Entry
t1 = C1
if (t1)
WCR
while (1)
…=x
x=…
t2 = C2
x=…
if (t2)
…=x
x=…
x=…
barrier ()
…=x
…=x
x=…
…=x
Exit
Web-based variable expansion(2/3)
x=…
Entry
t1 = C1
if (t1)
WCR
while (1)
…=x
x=…
t2 = C2
x=…
if (t2)
…=x
x=…
x=…
barrier ()
…=x
…=x
x=…
…=x
Exit
Web-based variable expansion(3/3)
x=…
Entry
t1 = C1
if (t1)
WCR
while (1)
…=x
x=…
t2 = C2
x=…
if (t2)
…=x
x=…
x=…
barrier ()
…=x
…=x
x=…
…=x
Exit
Download