RTC07 TSTW LogicPD_original CE_original

advertisement

Developing Embedded Hybrid Code Using OpenCL

Open Computing Language (OpenCL) is a specification and a programming framework for managing heterogeneous computing cores such as CPUs and graphics processing units (GPUs) to accelerate computationally intensive algorithms. by Mark Benson, Director of Software Strategy, Logic PD

In recent years, the mechanism by which incremental computational performance has been achieved has shifted from clock speed to a proliferation of processing cores. This shift, being driven primarily by undesirable quantum effects at higher signaling speeds and practical limits on the rates we can dissipate heat, has caused an acceleration of new software techniques. These techniques allow us to not only leverage homogeneous multicore CPUs, but also graphics accelerators, digital signal processors (DSPs) and field-programmable gate arrays (FPGAs) as general-purpose computing blocks to accelerate algorithms hungry for ever-higher computational performance.

Proposed by Apple and maintained by the Khronos Group, OpenCL was created to provide a portable open programming framework that enables software to take advantage of both multicore

CPUs and also specialized processing cores, most notably GPUs, for non-graphical processing purposes in a highly parallel way.

OpenCL is similar to OpenGL in that it is a device-agnostic open standard that anyone can adopt and use to create a custom implementation. OpenCL was designed to work with OpenGL in that data can be shared between frameworks—data can be crunched with OpenCL and subsequently displayed using OpenGL.

The OpenCL specification was developed by a working group formed in 2008, chaired by

Nvidia, and edited by Apple. Since then, backward-compatible revisions of the OpenCL specification have been released along with a set of conformance tests that can be used to demonstrate compliance.

Conformant implementations of OpenCL for a given processor are available primarily from the silicon vendor (Altera, AMD, ARM, Freescale, Imagination Technologies, Intel, Nvidia, Texas

Instruments, Xilinx, etc.). An OpenCL driver from these vendors is required in order for the

OpenCL framework to run on top of it.

OpenCL is similar to Nvidia’s CUDA, Brook from Stanford and Microsoft DirectCompute. In relation to these, OpenCL has a reputation of being open, portable, lower-level, closer to the hardware, and in some ways harder to use. Think of OpenCL as a portable hardware abstraction layer that supports parallel programming on heterogeneous cores.

OpenCL also comes with a language that is based on a subset of C99 with some additional features that support two different models of programming for parallelism: task parallelism and data parallelism.

Task parallelism is a model with which embedded engineers are most familiar. Task parallelism is commonly achieved with a multithreading OS, and leveraged so that different threads of execution can operate at the same time. When threads need to access common resources, mutexes, semaphores, or other types of locking mechanisms are used. OpenCL supports this model of programming but it is not its greatest strength.

Data parallelism is used in algorithms that use the same operation across many sets of data. In a data-parallel model, one type of operation, such as a box filter, can be parallelized such that the same micro-algorithm can be run multiple times in parallel, but each instantiation of this algorithm operates on its own subset of the data—hence the data is parallelized. This is the model of programming that OpenCL is best suited to support. Five compatible and intersecting models of OpenCL will help explain the concepts it embodies. These are framework, platform, execution, memory and programming.

The OpenCL framework consists of a platform layer, a runtime and a compiler. The platform allows a host program to query available devices and to create contexts. The runtime allows a host program to manipulate contexts. The compiler creates program executables and is based on a subset of C99 with some additional language features to support parallel programming. In order for silicon vendors to provide OpenCL conformance, they need to provide an OpenCL driver that enables the framework to operate.

The platform is defined by a host that is connected to one or more devices, for example, a GPU.

Each device is divided into one or more compute units, i.e., cores. Each compute unit is divided into one or more processing elements.

Execution within an OpenCL program occurs in two places: kernels that execute on devices— most commonly GPUs—and a host program that executes on a host device—most commonly a

CPU.

To understand the execution model, it’s best to focus on how kernels execute. When a kernel is scheduled for execution by the host, an index space is defined. An instance (work item) of the kernel executes for each item in this index space.

In OpenCL, the index space is represented by something called an NDRange. An NDRange is a

1-, 2- or 3-dimensional index space. A graphical representation of an NDRange is shown in

Figure 1. The host defines a context for the kernels to use. A context includes a list of devices, kernels, source code and memory objects. The context originates and is maintained by the host.

Additionally, the host creates a data structure using the OpenCL API called a command-queue.

The host, via the command-queue, schedules kernels to be executed on devices.

Commands that can be placed in the command-queue include kernel execution commands, memory management commands and synchronization commands. The latter are used for constraining the order of execution of other commands. By placing commands in OpenCL command-queues, the runtime then manages scheduling those commands to completion in parallel on devices within the system.

Work items executing a kernel have access to the following types of memory:

Global memory—available to all work items in all work groups.

Constant memory—initialized by the host, this memory remains constant through the life of the kernel.

Local memory—memory shared by a work group.

Private memory—memory private to a single work item.

As already mentioned, OpenCL supports two main types of programming models: data-parallel where each processor performs the same task on different pieces of distributed data; and taskparallel where multiple tasks operate on a common set of data. In any type of parallel programming, synchronization between parallel threads of execution must be considered.

OpenCL offers three main ways to control synchronization between parallel processing activities. First, there are barriers to constrain certain work items within an index space to operate in sequence. Second, there are barriers to constrain the order of commands within the command-queue. And finally there are events generated by commands within the commandqueue. These events can be responded to in a way that enforces sequential operation.

Using tools like OpenCL is great for photo/video editing applications, AI systems, modeling frameworks, game physics, Hollywood rendering and augmented reality, to name a few.

However, there is also an embedded profile for OpenCL defined in the specification that consists of a subset of the full OpenCL specification, targeted at embedded mobile devices. Here are some highlights of what the OpenCL Embedded Profile contains:

64-bit integers are optional

Support for 3D images is optional

Relaxation of rounding rules for floating point calculations

Precision of conversions on an embedded device is clarified

Built-in atomic functions are optional

Looking forward, the OpenCL roadmap contains a number of initiatives to take it to the next level of relevance.

High-Level Model (OpenCL-HLM): OpenCL is currently exploring ways to unify device and host execution environments via language constructs so that it is easier to use OpenCL. The hope is that by doing this, OpenCL will become even more widely adopted.

Long-Term Core Roadmap: OpenCL is continuing to look at ways to enhance the memory and execution models to take advantage of emerging hardware capabilities. Also, there are efforts underway to make the task-parallel programming model more robust with better synchronization tools within the OpenCL environment.

WebCL: OpenCL has a vision to bring parallel computation to the web via Javascript bindings.

Standard Parallel Intermediate Representation (OpenCL-SPIR): OpenCL wants to get out of the business of creating compilers and tools and language bindings. By creating a standardized

intermediate representation, bindings to new languages can be created by people outside of the

OpenCL core team, enabling broader adoption and allowing the OpenCL intermediate representation to be a target of any compiler in existence, now or in the future.

OpenCL has a bright future, but it has some hurdles to overcome, many of which are being addressed by current initiatives within the working group. In the next decade of computing as we continue to see a proliferation of processing cores, both homogeneous CPUs and heterogeneous

CPUs/GPUs, we will continue to have increasing needs for sophisticated software frameworks that help us take advantage of all of the hardware computing power that is available on our systems. As this trend continues, OpenCL is positioned strongly as an open, free and maturing standard that has strong industry support and a bright future.

LogicPD, Eden Prairie, MN. (952) 941-8071. [logicpd.com].

Khronos Group [www.khronos.org].

Download