OpenCL China MCP 1 Agenda • • • • • • OpenCL Overview Usage Memory Model Synchronization Operational Flow Availability Agenda • • • • • • OpenCL Overview Usage Memory Model Synchronization Operational Flow Availability OpenCL Overview: Motivation 4 DVR / NVR & smart camera Networking Mission critical systems Medical imaging Video and audio infrastructure High-performance and cloud computing Portable mobile radio Industrial imaging Home AVR and automotive audio Analytics Wireless testers Industrial control media processing computing radar & communications © Copyright Texas Instruments Inc., 2013 industrial electronics OpenCL Overview: Motivation Many current TI DSP users: • Comfortable working with TI platforms • Large software teams, low level programming models for algorithmic control • Understand DSP programming Many customers in new markets like High-Performance-Compute: • Often not DSP programmers • Not familiar with TI proprietary software, especially in early stages • Comfortable with workstation parallel programming models Important that customers in these new markets are comfortable with leveraging TI’s heterogeneous multicore offerings 5 © Copyright Texas Instruments Inc., 2013 OpenCL Overview: What it is • Framework for expressing programs where parallel computation is dispatched to any attached heterogeneous device • Open, standard and royalty-free • Consists of two components 1. API for host program to create and submit kernels for execution (Host-based generic header and vendor-supplied library file) 2. Cross-platform language for expressing kernels (Based on C99 C w/ some additions/restrictions, built-in functions) • Promotes portability of applications from device to device and across generations of a single device roadmap 6 © Copyright Texas Instruments Inc., 2013 OpenCL Overview: Where it fits in MPI Communication APIs Node 0 Node 1 Node N • MPI allows expression of parallelism across nodes in a distributed system • MPI’s first specification was in 1992 7 © Copyright Texas Instruments Inc., 2013 OpenCL Overview: Where it fits in MPI Communication APIs OpenMP Threads CPU CPU CPU CPU Node 0 OpenMP Threads CPU CPU Node 1 CPU CPU OpenMP Threads CPU CPU CPU CPU Node N • OpenMP allows expression of parallelism across homogeneous, shared-memory cores • OpenMP’s first specification was in 1997 8 © Copyright Texas Instruments Inc., 2013 OpenCL Overview: Where it fits in MPI Communication APIs OpenMP Threads CPU CPU CPU CPU OpenMP Threads CPU CPU CPU CPU OpenMP Threads CPU CPU CPU CPU CUDA/OpenCL CUDA/OpenCL CUDA/OpenCL GPU GPU GPU Node 0 Node 1 Node N • CUDA / OpenCL can leverage parallelism across heterogeneous computing devices in a system, even with distinct memory spaces • CUDA’s first specification was in 2007 • OpenCL’s first specification was in 2008 9 © Copyright Texas Instruments Inc., 2013 OpenCL Overview: Where it fits in MPI Communication APIs OpenMP Threads CPU CPU CPU CPU OpenMP Threads CPU CPU CPU CPU OpenMP Threads CPU CPU OpenCL OpenCL OpenCL DSP DSP DSP Node 0 Node 1 Node N • Focus on OpenCL as an open alternative to CUDA • Focus on OpenCL devices other than GPU, like DSPs 10 CPU © Copyright Texas Instruments Inc., 2013 CPU OpenCL Overview: Where it fits in MPI Communication APIs CP U CP U CP U CP U OpenCL Node 0 CP U CP U CP U OpenCL Node 1 CP U CP U CP U CP U CP U OpenCL Node N • OpenCL is expressive enough to allow efficient control over all compute engines in a node. 11 © Copyright Texas Instruments Inc., 2013 OpenCL Overview: Model • Host connected to one or more OpenCL devices – Commands are submitted from host to OpenCL devices – Host can also be an OpenCL device • OpenCL device is a collection of one or more compute units (cores) – OpenCL device viewed by programmer as single virtual processor – Programmer does not need to know how many cores are in the device – OpenCL runtime efficiently divides total processing effort across cores • Example on 66AK2H12 - A15 running OpenCL process acts as host - 8 C66x DSPs available as a single device (Accelerator type, 8 compute units) - 4 A15’s available as single device (CPU type, 4 compute units) 12 © Copyright Texas Instruments Inc., 2013 66AK2H12 KeyStone II Multicore DSP + ARM * + ARM A15 ARM A15 ARM A15 ARM A15 * + << *- + << *- + C66x << DSP*- + C66x << DSP*- + C66x << DSP*- + C66x << DSP*- + C66x << DSP C66x << DSP C66x DSP C66x DSP Multicore Shared Memory Agenda • • • • • • OpenCL Overview OpenCL Usage Memory Model Synchronization Operational Flow Availability OpenCL Usage: Platform Layer • Platform Layer APIs allow an OpenCL application to: – Query the platform for OpenCL devices – Query OpenCL devices for their configuration and capabilities – Create OpenCL contexts using one or more devices Context context (CL_DEVICE_TYPE_ACCELERATOR); vector<Device>devices = context.getInfo<CL_CONTEXT_DEVICES>(); • Context: – Environment within which work-items execute – Includes devices and their memories and command queues • Kernels dispatched within this context will run on accelerators (DSPs) • To change the program to run kernels on a CPU device instead: change CL_DEVICE_TYPE_ACCELERATOR to CL_DEVICE_TYPE_CPU 14 © Copyright Texas Instruments Inc., 2013 Usage: Contexts & Command Queues Typical flow • Query the platform for all available accelerator devices • Create an OpenCL context containing all those devices • Query the context to enumerate the devices and place them in a vector C int err = clGetDeviceIDs(NULL, CL_DEVICE_TYPE_CPU, 1, &device_id, NULL); if (err != CL_SUCCESS) { … } context = clCreateContext(0, 1, &device_id, NULL, NULL, &err); if (!context) { … } commands = clCreateCommandQueue(context, device_id, 0, &err); if (!commands) { … } C++ Context context(CL_DEVICE_TYPE_CPU); std::vector<Device> devices = context.getInfo<CL_CONTEXT_DEVICES>(); CommandQueue Q(context, devices[0]); 15 © Copyright Texas Instruments Inc., 2013 Usage: Execution Model • OpenCL C Kernel – Basic unit of executable code on a device - similar to a C function – Can be data-parallel or task-parallel • OpenCL C Program – Collection of kernels and other functions • OpenCL Applications queue kernel execution instances – Application defines command queues • Command queue is tied to a specific device • Any/All devices may have command queues – Application enqueues kernels to these queues – Kernels will then run asynchronously to the main application thread – Queues can be defined to execute in-order or allow out-of-order 16 © Copyright Khronos Group, 2009 Usage: Data Kernel Execution Kernel enqueuing is a combination of 1. OpenCL C kernel definition (expressing an algorithm for a work-item) 2. Description of the total number of work-items required for the kernel CommandQueue Q (context, devices[0]); Kernel kernel (program, "mpy2"); Q.enqueueNDRangeKernel(kernel, NDRange(1024)); Kernel void mpy2(global int *p) { int i = get_global_id(0); p[i] *= 2; } Work-items for a kernel execution are grouped into workgroups – Workgroup is executed by a compute unit (core) – Size of a workgroup can be specified, or left to the runtime to define – Different workgroups can execute asynchronously across multiple cores Q.enqueueNDRangeKernel(kernel, NDRange(1024), NDRange(128)); • Code line above enqueues kernel with 1024 work-items grouped in workgroups of 128 work-items each • 1024/128 => 8 workgroups, that could execute simultaneously on 8 cores. 17 © Copyright Texas Instruments Inc., 2013 Usage: Execution Order Work-Items & Workgroups • Execution order of work-items in workgroup not defined by spec. • Portable OpenCL code must assume they could all execute concurrently. – GPU implementations typically execute work-items within a workgroup concurrently – CPU / DSP implem. typically serialize work-items within workgroup – OpenCL C barrier instructions can be used to ensure that all workitems in a workgroup reach the barrier, before any work-items in the workgroup proceed past the barrier. • Execution order of workgroups associated with 1 kernel execution is not defined by the spec. • Portable OpenCL code must assume any order is valid • No mechanism exists in OpenCL to synchronize or order workgroups 18 © Copyright Texas Instruments Inc., 2013 Usage: Example OpenCL Host Code Context context (CL_DEVICE_TYPE_ACCELERATOR); vector<Device>devices = context.getInfo<CL_CONTEXT_DEVICES>(); Program program(context, devices, source); Program.build(devices); Buffer buf (context, CL_MEM_READ_WRITE, sizeof(input)); Kernel kernel (program, "mpy2"); kernel.setArg(0, buf); CommandQueue Q (context, devices[0]); OpenCL Kernel Kernel void mpy2(global int *p) { int i = get_global_id(0); p[i] *= 2; } Q.enqueueWriteBuffer (buf, CL_TRUE, 0, sizeof(input), input); Q.enqueueNDRangeKernel(kernel, NDRange(globSz), NDRange(wgSz)); Q.enqueueReadBuffer (buf, CL_TRUE, 0, sizeof(input), input); • Host code uses optional OpenCL C++ bindings – Creates a buffer and a kernel, sets the arguments, writes the buffer, invokes the kernel and reads the buffer. • Kernel is purely algorithmic – No dealing with DMA’s, cache flushing, communication protocols, etc. 19 © Copyright Texas Instruments Inc., 2013 Usage: Compiling & Linking • When compiling, tell gcc where the headers are: gcc –I$TI_OCL_INSTALL/include … • Link with the TI OpenCL library as: gcc <obj files> -L$TI_OCL_INSTALL/lib –lTIOpenCL … 20 © Copyright Texas Instruments Inc., 2013 Agenda • • • • • • OpenCL Overview OpenCL Usage Memory Model Synchronization Operational Flow Availability OpenCL Memory Model: Overview • • • • 22 Private Memory – Per work-item – Typically registers Local Memory – Shared within a workgroup – Local to a compute unit (core) Private Memory Private Memory Private Memory Private Memory Work-Item Work-Item Work-Item Work-Item Local Memory Global/Constant Memory – Shared across all compute units (cores) in a device Host Memory – Attached to the Host CPU – Can be distinct from global memory • Read / Write buffer model – Can be same as global memory • Map / Unmap buffer model © Copyright Khronos Group, 2009 Local Memory Workgroup Global/Constant Memory Computer Device Host Memory Host Workgroup OpenCL Memory: Resources • Buffers – Simple chunks of memory – Kernels can access however they like (array, pointers, structs) – Kernels can read and write buffers • Images – Opaque 2D or 3D formatted data structures – Kernels access only via read_image() and write_image() – Each image can be read or written in a kernel, but not both – Only required for GPU devices ! 23 © Copyright Khronos Group, 2009 OpenCL Memory: Distinct Host and Global Device Memory 1. 2. 3. 4. 5. 6. 7. char *ary = malloc(globsz); for (int i = 0; i < globsz; i++) ary[i] = i; Buffer buf (context, CL_MEM_READ_WRITE, sizeof(ary)); Q.enqueueWriteBuffer (buf, CL_TRUE, 0, sizeof(ary), ary); Q.enqueueNDRangeKernel(kernel, NDRange(globSz), NDRange(wgSz)); Q.enqueueReadBuffer (buf, CL_TRUE, 0, sizeof(ary), ary); for (int i = 0; i < globsz; i++) … = ary[i]; Host Memory 0,1,2,3, … 0,2,4,6, 24 © Copyright Texas Instruments Inc., 2013 Device Global Memory 0,1,2,3 0,2,4,6 … OpenCL Memory: Shared Host and Global Device Memory 1. Buffer buf (context, CL_MEM_READ_WRITE, globsz); 2. char* ary = Q.enqueueMapBuffer(buf, CL_TRUE, CL_MAP_WRITE, 0, globsz); 3. for (int i = 0; i < globsz; i++) ary[i] = i; 4. Q.enqueueUnmapMemObject(buf, ary); 5. Q.enqueueNDRangeKernel(kernel, NDRange(globSz), NDRange(wgSz)); 6. ary = Q.enqueueMapBuffer(buf, CL_TRUE, CL_MAP_READ, 0, globsz); 7. for (int i = 0; i < globsz; i++) … = ary[i]; 8. Q.enqueueUnmapMemObject(buf, ary); Shared Host + Device Global Memory 0,1,2,3, … 0,2,4,6, 25 Ownership Ownership host toto device © Copyright Texas Instruments Inc., 2013 Agenda • • • • • • OpenCL Overview OpenCL Usage Memory Model Synchronization Operational Flow Availability OpenCL Synchronization • Kernel execution is defined to be the execution and completion of all work-items associated with an enqueue kernel command • Kernel executions can synchronize at their boundaries through OpenCL events at the Host API level • Within a workgroup, work-items can synchronize through barriers and fences, expressed as OpenCL C built-in functions • Workgroups cannot synchronize with workgroups • Work-items in different workgroups cannot synchronize 27 © Copyright Texas Instruments Inc., 2013 Agenda • • • • • • OpenCL Overview OpenCL Usage Memory Model Synchronization Operational Flow Availability OpenCL Operational Flow HOST DSP0 DDR Context context; DSP 0 Core 0 DSP 0 Core 0-7 RESET DSPS() DOWNLOAD MONITOR PGM START DSPs CommandQueue Q; ESTABLISH MAILBOX Start Host thread to monitor this queue and mailbox Buffer buffer; ALLOCATE SPACE Program program; program.build(); See if program has already been compiled and is cached, if so reuse. Else cross compile program on host for execution on DSP. LOAD PROGRAM Kernel kernel(program, "kname"); Establish kname as an entry point in program Q.enqueueNDRangeKernel(kernel) Create Dispatch packet for the DSP SEND DISPATCH PACKET() Break kernel into workgroups. SEND WGs TO ALL CORES() Note: Items are show occuring at their earliest point, but are often lazily executed at first need time. DONE CACHE OPS() DONE 29 © Copyright Texas Instruments Inc., 2013 DONE Agenda • • • • • • OpenCL Overview OpenCL Usage Memory Model Synchronization Operational Flow Availability TI OpenCL 1.1 Products 66AK2H12 TMS320C6678 8 C66 DSPs TMS320C6678 8 C66 DSPs 1GB 1GB DDR3 DDR3 1GB DDR3 1GB DDR3 KeyStone II Multicore DSP + ARM TMS320C6678 * + 8 C66 DSPs TMS320C6678 ARM A15 ARM A15 8 C66 DSPs ARM A15 ARM A15 * + << *- + << *- + C66x << DSP*- + C66x << DSP*- + C66x << DSP*- + C66x << DSP*- + C66x << DSP C66x << DSP C66x DSP C66x DSP Multicore Shared Memory • Advantech DSPC8681 with four 8-core DSPs • Advantech DSPC8682 with eight 8-core DSPs • Each 8 core DSP is an OpenCL device • Ubuntu Linux PC as OpenCL host • OpenCL in limited distribution Alpha • GA approx. End of Q1 2014. • • • • • • OpenCL on a chip 4 ARM A15s running Linux as OpenCL host 8 core DSP as an OpenCL Device 6M on chip shared memory. Up to 10G attached DDR3 GA approx. End of Q1 2014. * Product is based on a published Khronos Specification, and is expected to pass the Khronos Conformance Testing Process. Current conformance status can be found at www.khronos.org/conformance. 31 © Copyright Texas Instruments Inc., 2013 BACKUP KeyStone OpenCL Usage: Vector Sum Reduction Example int acc = 0; for (int i = 0; i < N; ++i) acc += buffer[i]; return acc; • Sequential in nature • Not parallel 33 © Copyright Texas Instruments Inc., 2013 Usage: Example //Vector Sum Reduction kernel void sum_reduce(global float* buffer, global float* result) { int gid = get_global_id(0);//which work-item am I of all work-items int lid = get_local_id (0); //which work-item am I within workgroup for (int offset = get_local_size(0) >> 1; offset > 0; offset >>= 1) { if (lid < offset) buffer[gid] += buffer[gid + offset]; barrier(CLK_GLOBAL_MEM_FENCE); } if (lid == 0) result[get_group_id(0)] = buffer[gid]; } 34 © Copyright Texas Instruments Inc., 2013 Usage: Example // Vector Sum Reduction (Iterative DSP) kernel void sum_reduce(global float* buffer, local float *acc, global float* result) { int gid = get_global_id(0); //which work-item am I out of all work-items int lid = get_local_id (0); // which work-item am I within my workgroup bool first_wi = (lid == 0); bool last_wi int = (lid == get_local_size(0) – 1); wg_index = get_group_id (0); // which workgroup am I if (first_wi) acc[wg_index] = 0; acc[wg_index] += buffer[gid]; if (last_wi) result[wg_index] = acc[wg_index]; } • Not valid on a GPU • Could be valid on a device that serializes work-items in a workgroup, i.e. DSP 35 © Copyright Texas Instruments Inc., 2013 OpenCL Memory: // Vector Sum Reduction kernel void sum_reduce(global float* buffer, local float* scratch, global float* result) { int lid = get_local_id (0); // which work-item am I within my workgroup scratch[lid] = buffer[get_global_id(0)]; barrier(CLK_LOCAL_MEM_FENCE); for (int offset = get_local_size(0) >> 1; offset > 0; offset >>= 1) { if (lid < offset) scratch[lid] += scratch[lid + offset]; barrier(CLK_LOCAL_MEM_FENCE); } if (lid == 0) result[get_group_id(0)] = scratch[lid]; } 36 © Copyright Texas Instruments Inc., 2013