Family of Accelerated Processors Opens the Doors to New

Family of Accelerated Processors Opens the Doors to New Embedded Possibilities
A new family combines dual and quad-core x86 processing with high-end graphic
capability that can also be used for numerically intensive applications.
by Tom Williams, Editor-in-Chief
The concept of combining multicore x86 functionality with a powerful parallel
architecture graphics processing engine on the same piece of silicon has come of age and
could significantly transform the areas and applications for embedded processors.
Following on its groundbreaking G-Series of what it calls accelerated processing units
(APUs), Advanced Micro Devices is announcing a new series of APUs—named the
Embedded R-Series—with more x86 cores with advanced features integrated with even
more powerful GPUs to form a line of performance, power and cost that can address a
wide selection of embedded applications that increasingly rely on high-performance
graphics as well as numerically intense computation.
Where earlier an x86-based design that integrated graphics processing relied on the x86
CPU to interface via a North Bridge connection with a discrete graphics processor, the
APU integrated both on the same chip. The GPU can, of course, be used for demanding
graphics tasks as well as to offload such things as DSP operations if needed in the same
code string. In the past, such operations would involve the x86 CPU sending calls to a
DSP or discrete GPU to invoke code running on the coprocessor that would then send
results back to the CPU—with all the latency and overhead that would necessarily be
involved. With both architectures on the same die, the application can be written as one
program using OpenCL thus vastly reducing both latency and overhead.
The new AMD R-Series builds on the earlier G-Series but uses a more advanced multiple
pipeline x86 architecture and implements the series as a selection of dual or quad-core
devices with AMDs DirectX-11 capable Radeon 7000 Series graphics engines with up to
384 parallel processing units. This integrated architecture allows a combination of
dedicated and shared resources. Therefore the two or four x86 cores, each of which has
four execution pipelines, a dedicated thread scheduler and a dedicated Level 1 cache, also
share instruction fetch and decode, a Level 2 cache and two 128-bit floating point MACs,
which can be combined into a 256-bit floating point unit (Figure1).
In addition, the single instruction multiple data (SIMD) parallel processing GPU shares
the memory controller with the x86 cores for fast access to memory as well as to provide
fast communications between GPU and CPU cores. And that is before we even get to the
floating point capabilities of the parallel GPU. Depending on the number of parallel
processing units in a given devices GPU, single precision floating point computation can
range from 178 to 578 GLOPs per second (Figure 2).
In addition it is possible to combine the graphic performance of an APU with that of a
SMD discrete GPU for even more graphics performance in terms of raw output or the
number of displays that can be driven. For example combining the performance of the
high end quad core R-464L with that of the AMD E6760 GPU would yield an additional
20% more graphics performance. That could be harnessed for raw graphical or numeric
compute power or to be able to drive more displays than the four that can be directly
driven from the device.
Interfaces available directly on the chip include HDMI, Display Port and DVI. In
addition to the four independent display ports, a x16 PCI Express port can be configured
as up to four more outputs for either graphics or other I/O. This can be configured as up
to four DVI interfaces to directly drive up to four more displays. By using a discrete GPU
on a Windows 7-based board, it is possible to drive up to a total of ten displays. A
selection of controller hubs will be available to provide additional I/O including SATA,
VGA, USB 2.0, USB 3.0 and PCIe 4x1. There is even a hub available that supports the
legacy PCI bus.
Additional Support Features
The ability to support such high end graphical and compute capabilities on a single chip
suggests a range of applications that can benefit from additional features required by such
things as video conferencing, surveillance and other kinds of distributed applications. For
example, the ability to manage distributed nodes independent of operating system state is
supported by AMD’s DAS 1.0 implementation of the Desktop and mobile Architecture
for System Hardware (DASH) management scheme. It allows remote operators to go into
systems and reset or power down and restart them and perform remote BIOS updates
among other things.
A dedicated video compression engine offers hardware support for the encoding and
compression needed for distributed applications such as video conferencing and
surveillance where high-definition video must be rapidly transmitted across wired or
wireless networks at constrained bandwidths. There are also performance enhancements
for secure asset management in terms of encryption and decryption for sharing and
rendering protected video content.
A broad range of decode support is also provided to support low-power rendering of
video content. This includes H.264, MPEG-2, VC-1, MPC, DivX, MPEG-2 IDCT+
MotionComp, Dual HD Decode (1080p+1080i) and MVC for Blu-ray Stereo 3D. Thus in
addition to the raw power of control, numeric and graphical computing, there is a range
of interfaces and built-in services that address a wide and growing range of application
needs on a single chip that can be run by a single set of code written in a single language,
namely OpenCL. Therefore one development discipline can be used to exploit the
capabilities built into something like an APU.
Where from Here?
Now why, exactly, this detailed description of an admittedly major new product line? For
one thing, the implications are potentially enormous. AMD is currently leading the
market in this particular arena, but it is sure to attract competition. What happens then is
anyone’s guess, but we can take a shot at predicting. First, there are a good many
identified applications for which such a device would be a distinct advantage. These
include digital signage where the ability to control multiple displays is a must and the
capability for remote management is very desirable. There are increased possibilities in
security and surveillance where there is a need to manage multiple video feeds.
Teleconferencing, high-end casino gaming and advanced medical imaging all come
immediately to mind. But along with such increased capabilities there arise possibilities
we may not have thought of yet and these deserve exploration as well.
It is tempting to compare the significance of the emergence of the APU as a new class of
devices along with that of the applications services platform (ASP), which combines a
general-purpose CPU—in that case an ARM architecture—on the same die with an
FPGA fabric. Such devices have recently been introduced by Xilinx, Microsemi and
Altera. While ASPs may address a whole different set of potential applications they are a
huge step forward but need to overcome the fact that they combine two different
development disciplines—that of the programmer and the FPGA developer, which are
not often mastered by the same person.
Admittedly, graphics development is a more specialized discipline than general
programming as well so there will be hurdles. Here, however the same code base can
access and allocate the resources dynamically, such as by dedicating parts of the parallel
engine to numeric computation and others to graphics rendering. One simple example
that everyone can relate to is the now notorious game, “Angry Birds.” Anyone who has
played the game has no doubt noticed that in addition to the display of flying birds and
the various things they destroy to thwart the evil pigs, there is also physics at work. The
boards, rocks and other objects sway and either fall or don’t fall, smash or don’t smash
according to the force and angle of how they are struck. These are two different types of
The same can be said of such more practical applications as computational fluid
dynamics, seismic computation and display or many other sophisticated problem solving
applications. The ability to combine a CPU architecture that is capable of generalpurpose programming including interrupt-driven and real-time control with a graphics
processor that is also capable of intense numerical computation opens a vast number of
Consider as only one example the combination of machine vision with motion control.
Video data captured by a machine’s “eyes” can be rapidly processed in the GPU’s
parallel units, features or other clues extracted and then used to direct the CPU to move
arms, wheels or other aspects of a vision-directed application all on one main device
under the control of a unified set of code. We leave the reader to imagine further.
Advanced Micro Devices, Sunnyvale, CA. (408) 749-4000. []