Presentation Title Here

advertisement
OpenCL
China MCP
1
Agenda
•
•
•
•
•
•
OpenCL Overview
Usage
Memory Model
Synchronization
Operational Flow
Availability
Agenda
•
•
•
•
•
•
OpenCL Overview
Usage
Memory Model
Synchronization
Operational Flow
Availability
OpenCL Overview: Motivation
4
DVR / NVR
& smart camera
Networking
Mission
critical systems
Medical
imaging
Video and audio
infrastructure
High-performance
and cloud computing
Portable mobile radio
Industrial imaging
Home AVR and
automotive audio
Analytics
Wireless
testers
Industrial
control
media processing
computing
radar & communications
© Copyright Texas Instruments Inc., 2013
industrial electronics
OpenCL Overview: Motivation
Many current TI DSP users:
• Comfortable working with TI platforms
• Large software teams, low level programming models for algorithmic control
• Understand DSP programming
Many customers in new markets like High-Performance-Compute:
• Often not DSP programmers
• Not familiar with TI proprietary software, especially in early stages
• Comfortable with workstation parallel programming models
Important that customers in these new markets are comfortable with
leveraging TI’s heterogeneous multicore offerings
5
© Copyright Texas Instruments Inc., 2013
OpenCL Overview: What it is
• Framework for expressing programs where parallel computation is
dispatched to any attached heterogeneous device
• Open, standard and royalty-free
• Consists of two components
1. API for host program to create and submit kernels for execution
(Host-based generic header and vendor-supplied library file)
2. Cross-platform language for expressing kernels
(Based on C99 C w/ some additions/restrictions, built-in functions)
• Promotes portability of applications from device to device and across
generations of a single device roadmap
6
© Copyright Texas Instruments Inc., 2013
OpenCL Overview: Where it fits in
MPI Communication APIs
Node 0
Node 1
Node N
• MPI allows expression of parallelism across nodes in a
distributed system
• MPI’s first specification was in 1992
7
© Copyright Texas Instruments Inc., 2013
OpenCL Overview: Where it fits in
MPI Communication APIs
OpenMP Threads
CPU
CPU
CPU
CPU
Node 0
OpenMP Threads
CPU
CPU
Node 1
CPU
CPU
OpenMP Threads
CPU
CPU
CPU
CPU
Node N
• OpenMP allows expression of parallelism across homogeneous,
shared-memory cores
• OpenMP’s first specification was in 1997
8
© Copyright Texas Instruments Inc., 2013
OpenCL Overview: Where it fits in
MPI Communication APIs
OpenMP Threads
CPU
CPU
CPU
CPU
OpenMP Threads
CPU
CPU
CPU
CPU
OpenMP Threads
CPU
CPU
CPU
CPU
CUDA/OpenCL
CUDA/OpenCL
CUDA/OpenCL
GPU
GPU
GPU
Node 0
Node 1
Node N
• CUDA / OpenCL can leverage parallelism across heterogeneous
computing devices in a system, even with distinct memory spaces
• CUDA’s first specification was in 2007
• OpenCL’s first specification was in 2008
9
© Copyright Texas Instruments Inc., 2013
OpenCL Overview: Where it fits in
MPI Communication APIs
OpenMP Threads
CPU
CPU
CPU
CPU
OpenMP Threads
CPU
CPU
CPU
CPU
OpenMP Threads
CPU
CPU
OpenCL
OpenCL
OpenCL
DSP
DSP
DSP
Node 0
Node 1
Node N
• Focus on OpenCL as an open alternative to CUDA
• Focus on OpenCL devices other than GPU, like DSPs
10
CPU
© Copyright Texas Instruments Inc., 2013
CPU
OpenCL Overview: Where it fits in
MPI Communication APIs
CP
U
CP
U
CP
U
CP
U
OpenCL
Node 0
CP
U
CP
U
CP
U
OpenCL
Node 1
CP
U
CP
U
CP
U
CP
U
CP
U
OpenCL
Node N
• OpenCL is expressive enough to allow efficient control over
all compute engines in a node.
11
© Copyright Texas Instruments Inc., 2013
OpenCL Overview: Model
• Host connected to one or more OpenCL devices
– Commands are submitted from host to OpenCL devices
– Host can also be an OpenCL device
• OpenCL device is a collection of one or more compute units (cores)
– OpenCL device viewed by programmer as single virtual processor
– Programmer does not need to know how many cores are in the device
– OpenCL runtime efficiently divides total processing effort across cores
• Example on 66AK2H12
- A15 running OpenCL process acts as host
- 8 C66x DSPs available as a single device
(Accelerator type, 8 compute units)
- 4 A15’s available as single device
(CPU type, 4 compute units)
12
© Copyright Texas Instruments Inc., 2013
66AK2H12
KeyStone II Multicore DSP + ARM
* +
ARM A15
ARM A15
ARM A15
ARM A15
* +
<< *- +
<< *- +
C66x <<
DSP*- + C66x <<
DSP*- +
C66x <<
DSP*- + C66x <<
DSP*- +
C66x <<
DSP C66x <<
DSP C66x DSP
C66x DSP
Multicore Shared Memory
Agenda
•
•
•
•
•
•
OpenCL Overview
OpenCL Usage
Memory Model
Synchronization
Operational Flow
Availability
OpenCL Usage: Platform Layer
• Platform Layer APIs allow an OpenCL application to:
– Query the platform for OpenCL devices
– Query OpenCL devices for their configuration and capabilities
– Create OpenCL contexts using one or more devices
Context context (CL_DEVICE_TYPE_ACCELERATOR);
vector<Device>devices = context.getInfo<CL_CONTEXT_DEVICES>();
• Context:
– Environment within which work-items execute
– Includes devices and their memories and command queues
• Kernels dispatched within this context will run on accelerators (DSPs)
• To change the program to run kernels on a CPU device instead: change
CL_DEVICE_TYPE_ACCELERATOR to CL_DEVICE_TYPE_CPU
14
© Copyright Texas Instruments Inc., 2013
Usage: Contexts & Command Queues
Typical flow
• Query the platform for all available accelerator devices
• Create an OpenCL context containing all those devices
• Query the context to enumerate the devices and place them in a vector
C
int err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_CPU, 1, &device_id, NULL);
if (err != CL_SUCCESS) { … }
context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
if (!context) { … }
commands = clCreateCommandQueue(context, device_id, 0, &err);
if (!commands) { … }
C++
Context context(CL_DEVICE_TYPE_CPU);
std::vector<Device> devices = context.getInfo<CL_CONTEXT_DEVICES>();
CommandQueue Q(context, devices[0]);
15
© Copyright Texas Instruments Inc., 2013
Usage: Execution Model
• OpenCL C Kernel
– Basic unit of executable code on a device - similar to a C function
– Can be data-parallel or task-parallel
• OpenCL C Program
– Collection of kernels and other functions
• OpenCL Applications queue kernel execution instances
– Application defines command queues
• Command queue is tied to a specific device
• Any/All devices may have command queues
– Application enqueues kernels to these queues
– Kernels will then run asynchronously to the main application thread
– Queues can be defined to execute in-order or allow out-of-order
16
© Copyright Khronos Group, 2009
Usage: Data Kernel Execution
Kernel enqueuing is a combination of
1. OpenCL C kernel definition (expressing an algorithm for a work-item)
2. Description of the total number of work-items required for the kernel
CommandQueue Q (context, devices[0]);
Kernel kernel (program, "mpy2");
Q.enqueueNDRangeKernel(kernel,
NDRange(1024));
Kernel void mpy2(global int *p)
{
int i = get_global_id(0);
p[i] *= 2;
}
Work-items for a kernel execution are grouped into workgroups
– Workgroup is executed by a compute unit (core)
– Size of a workgroup can be specified, or left to the runtime to define
– Different workgroups can execute asynchronously across multiple cores
Q.enqueueNDRangeKernel(kernel, NDRange(1024), NDRange(128));
• Code line above enqueues kernel with 1024 work-items grouped in workgroups of 128
work-items each
• 1024/128 => 8 workgroups, that could execute simultaneously on 8 cores.
17
© Copyright Texas Instruments Inc., 2013
Usage: Execution Order
Work-Items & Workgroups
• Execution order of work-items in workgroup not defined by spec.
• Portable OpenCL code must assume they could all execute concurrently.
– GPU implementations typically execute work-items within a
workgroup concurrently
– CPU / DSP implem. typically serialize work-items within workgroup
– OpenCL C barrier instructions can be used to ensure that all workitems in a workgroup reach the barrier, before any work-items in the
workgroup proceed past the barrier.
• Execution order of workgroups associated with 1 kernel execution is not
defined by the spec.
• Portable OpenCL code must assume any order is valid
• No mechanism exists in OpenCL to synchronize or order workgroups
18
© Copyright Texas Instruments Inc., 2013
Usage: Example
OpenCL Host Code
Context context (CL_DEVICE_TYPE_ACCELERATOR);
vector<Device>devices = context.getInfo<CL_CONTEXT_DEVICES>();
Program program(context, devices, source);
Program.build(devices);
Buffer buf
(context, CL_MEM_READ_WRITE, sizeof(input));
Kernel kernel (program, "mpy2");
kernel.setArg(0, buf);
CommandQueue Q (context, devices[0]);
OpenCL Kernel
Kernel void mpy2(global int
*p)
{
int i = get_global_id(0);
p[i] *= 2;
}
Q.enqueueWriteBuffer (buf, CL_TRUE, 0, sizeof(input), input);
Q.enqueueNDRangeKernel(kernel, NDRange(globSz), NDRange(wgSz));
Q.enqueueReadBuffer
(buf, CL_TRUE, 0, sizeof(input), input);
• Host code uses optional OpenCL C++ bindings
– Creates a buffer and a kernel, sets the arguments, writes the buffer,
invokes the kernel and reads the buffer.
• Kernel is purely algorithmic
– No dealing with DMA’s, cache flushing, communication protocols, etc.
19
© Copyright Texas Instruments Inc., 2013
Usage: Compiling & Linking
• When compiling, tell gcc where the headers are:
gcc –I$TI_OCL_INSTALL/include …
• Link with the TI OpenCL library as:
gcc <obj files> -L$TI_OCL_INSTALL/lib –lTIOpenCL …
20
© Copyright Texas Instruments Inc., 2013
Agenda
•
•
•
•
•
•
OpenCL Overview
OpenCL Usage
Memory Model
Synchronization
Operational Flow
Availability
OpenCL Memory Model: Overview
•
•
•
•
22
Private Memory
– Per work-item
– Typically registers
Local Memory
– Shared within a workgroup
– Local to a compute unit (core)
Private
Memory
Private
Memory
Private
Memory
Private
Memory
Work-Item
Work-Item
Work-Item
Work-Item
Local Memory
Global/Constant Memory
– Shared across all compute units
(cores) in a device
Host Memory
– Attached to the Host CPU
– Can be distinct from global memory
• Read / Write buffer model
– Can be same as global memory
• Map / Unmap buffer model
© Copyright Khronos Group, 2009
Local Memory
Workgroup
Global/Constant Memory
Computer Device
Host Memory
Host
Workgroup
OpenCL Memory: Resources
• Buffers
– Simple chunks of memory
– Kernels can access however they like (array, pointers, structs)
– Kernels can read and write buffers
• Images
– Opaque 2D or 3D formatted data structures
– Kernels access only via read_image() and write_image()
– Each image can be read or written in a kernel, but not both
– Only required for GPU devices !
23
© Copyright Khronos Group, 2009
OpenCL Memory:
Distinct Host and Global Device Memory
1.
2.
3.
4.
5.
6.
7.
char *ary = malloc(globsz);
for (int i = 0; i < globsz; i++) ary[i] = i;
Buffer buf
(context, CL_MEM_READ_WRITE, sizeof(ary));
Q.enqueueWriteBuffer (buf, CL_TRUE, 0, sizeof(ary), ary);
Q.enqueueNDRangeKernel(kernel, NDRange(globSz), NDRange(wgSz));
Q.enqueueReadBuffer
(buf, CL_TRUE, 0, sizeof(ary), ary);
for (int i = 0; i < globsz; i++) … = ary[i];
Host Memory
0,1,2,3, …
0,2,4,6,
24
© Copyright Texas Instruments Inc., 2013
Device Global Memory
0,1,2,3
0,2,4,6 …
OpenCL Memory:
Shared Host and Global Device Memory
1. Buffer
buf (context, CL_MEM_READ_WRITE, globsz);
2. char* ary = Q.enqueueMapBuffer(buf, CL_TRUE, CL_MAP_WRITE, 0, globsz);
3. for (int i = 0; i < globsz; i++) ary[i] = i;
4. Q.enqueueUnmapMemObject(buf, ary);
5. Q.enqueueNDRangeKernel(kernel, NDRange(globSz), NDRange(wgSz));
6. ary = Q.enqueueMapBuffer(buf, CL_TRUE, CL_MAP_READ, 0, globsz);
7. for (int i = 0; i < globsz; i++) … = ary[i];
8. Q.enqueueUnmapMemObject(buf, ary);
Shared Host + Device Global Memory
0,1,2,3, …
0,2,4,6,
25
Ownership
Ownership
host
toto
device
© Copyright Texas Instruments Inc., 2013
Agenda
•
•
•
•
•
•
OpenCL Overview
OpenCL Usage
Memory Model
Synchronization
Operational Flow
Availability
OpenCL Synchronization
• Kernel execution is defined to be the execution and
completion of all work-items associated with an enqueue
kernel command
• Kernel executions can synchronize at their boundaries
through OpenCL events at the Host API level
• Within a workgroup, work-items can synchronize through
barriers and fences, expressed as OpenCL C built-in
functions
• Workgroups cannot synchronize with workgroups
• Work-items in different workgroups cannot synchronize
27
© Copyright Texas Instruments Inc., 2013
Agenda
•
•
•
•
•
•
OpenCL Overview
OpenCL Usage
Memory Model
Synchronization
Operational Flow
Availability
OpenCL Operational Flow
HOST
DSP0 DDR
Context context;
DSP 0 Core 0
DSP 0 Core 0-7
RESET DSPS()
DOWNLOAD MONITOR PGM
START DSPs
CommandQueue Q;
ESTABLISH MAILBOX
Start Host thread to monitor this queue and mailbox
Buffer buffer;
ALLOCATE SPACE
Program program;
program.build();
See if program has already been compiled and is cached, if so reuse.
Else cross compile program on host for execution on DSP.
LOAD PROGRAM
Kernel kernel(program, "kname");
Establish kname as an entry point in program
Q.enqueueNDRangeKernel(kernel)
Create Dispatch packet for the DSP
SEND DISPATCH PACKET()
Break kernel into workgroups.
SEND WGs TO ALL CORES()
Note: Items are show occuring at
their earliest point, but are often
lazily executed at first need time.
DONE
CACHE OPS()
DONE
29
© Copyright Texas Instruments Inc., 2013
DONE
Agenda
•
•
•
•
•
•
OpenCL Overview
OpenCL Usage
Memory Model
Synchronization
Operational Flow
Availability
TI OpenCL 1.1 Products
66AK2H12
TMS320C6678
8 C66 DSPs
TMS320C6678
8 C66 DSPs
1GB
1GB
DDR3
DDR3
1GB
DDR3
1GB
DDR3
KeyStone II Multicore DSP + ARM
TMS320C6678
* +
8 C66 DSPs
TMS320C6678
ARM A15
ARM A15
8 C66 DSPs
ARM A15
ARM A15
* +
<< *- +
<< *- +
C66x <<
DSP*- + C66x <<
DSP*- +
C66x <<
DSP*- + C66x <<
DSP*- +
C66x <<
DSP C66x <<
DSP C66x DSP
C66x DSP
Multicore Shared Memory
•
Advantech DSPC8681 with four 8-core DSPs
•
Advantech DSPC8682 with eight 8-core DSPs
•
Each 8 core DSP is an OpenCL device
•
Ubuntu Linux PC as OpenCL host
•
OpenCL in limited distribution Alpha
•
GA approx. End of Q1 2014.
•
•
•
•
•
•
OpenCL on a chip
4 ARM A15s running Linux as OpenCL host
8 core DSP as an OpenCL Device
6M on chip shared memory.
Up to 10G attached DDR3
GA approx. End of Q1 2014.
* Product is based on a published Khronos Specification, and is expected to pass the Khronos Conformance Testing Process.
Current conformance status can be found at www.khronos.org/conformance.
31
© Copyright Texas Instruments Inc., 2013
BACKUP
KeyStone OpenCL
Usage: Vector Sum Reduction Example
int acc = 0;
for (int i = 0; i < N; ++i) acc += buffer[i];
return acc;
• Sequential in nature
• Not parallel
33
© Copyright Texas Instruments Inc., 2013
Usage: Example //Vector Sum Reduction
kernel void sum_reduce(global float* buffer, global float* result)
{
int gid = get_global_id(0);//which work-item am I of all work-items
int lid = get_local_id (0); //which work-item am I within workgroup
for (int offset = get_local_size(0) >> 1; offset > 0; offset >>= 1)
{
if (lid < offset) buffer[gid] += buffer[gid + offset];
barrier(CLK_GLOBAL_MEM_FENCE);
}
if (lid == 0) result[get_group_id(0)] = buffer[gid];
}
34
© Copyright Texas Instruments Inc., 2013
Usage: Example // Vector Sum
Reduction (Iterative DSP)
kernel void sum_reduce(global float* buffer, local float *acc, global float*
result)
{
int
gid = get_global_id(0); //which work-item am I out of all work-items
int
lid = get_local_id (0); // which work-item am I within my workgroup
bool first_wi = (lid == 0);
bool last_wi
int
= (lid == get_local_size(0) – 1);
wg_index = get_group_id (0); // which workgroup am I
if (first_wi) acc[wg_index]
= 0;
acc[wg_index] += buffer[gid];
if (last_wi) result[wg_index] = acc[wg_index];
}
• Not valid on a GPU
• Could be valid on a device that serializes work-items in a workgroup, i.e. DSP
35
© Copyright Texas Instruments Inc., 2013
OpenCL Memory:
// Vector Sum Reduction
kernel void sum_reduce(global float* buffer, local float* scratch, global
float* result)
{
int lid
= get_local_id (0);
// which work-item am I within my workgroup
scratch[lid] = buffer[get_global_id(0)];
barrier(CLK_LOCAL_MEM_FENCE);
for (int offset = get_local_size(0) >> 1; offset > 0; offset >>= 1)
{
if (lid < offset) scratch[lid] += scratch[lid + offset];
barrier(CLK_LOCAL_MEM_FENCE);
}
if (lid == 0) result[get_group_id(0)] = scratch[lid];
}
36
© Copyright Texas Instruments Inc., 2013
Download