1. introduction - San Jose State University

Performance of Parallel Processing on Processing Units
Oscar Inzunza-Monreal
Computer Science Department
San Jose State University
San Jose, CA 95192
Parallel processing is becoming more common, with the
emergence of new processing units how efficient is our current
hardware given this new programming paradigm. Most computers
now have multi-core processors and are capable of running in
parallel to some extent. We examine how performance measures
among the current processing units available today and their
versatility onto other architectures. We do this by measuring the
task scheduling and portability efficiency of these processing
units. The processing units we going to evaluate are central
processing units, graphical, and accelerated processing units.
Traditionally computers have had a single core central processing
unit (CPU) and ran in sequential programming. In recent years we
have tapped the potential of this hardware architecture and have
then since evolved into a many-core, or multi-core architecture
system. Commercial market CPUs have been around since the
mid 1900s at a time when parallel processing was not the
conventional programming paradigm.[3] With the advancement of
processor technology and in an attempt to make advancements in
the speedup of programing algorithms a new paradigm has
evolved and brought about parallel programing. To utilize this
new paradigm some companies, like AMD/NVIDIA, have even
engineered new custom types of processing units that are parallel
focused in order to efficiently utilize parallel programming. The
current processing units available are multi-core CPUs, graphical
processing units (GPUs), and accelerated processing units
Multi-core CPUs are components with two or more independent
CPUs designed to transfer data by performing basic arithmetical,
logical, input/output operations, and move data using registers.
Before GPUs there was a predecessor known as the physics
processing unit (PPUs), which were first released by the company
Ageia. PPUs were initially designed for accelerating particle
systems to measure transformations and collisions in physics and
other scientific experiments.[9] GPUs, also called visual
processing units (VPUs), are designed to rapidly manipulate and
alter memory to accelerate the creation of images in a frame buffer
intended for output to a display. This happens to be a similar
function to their ancestor the PPU. The GPU architecture was
designed to perform computationally intensive transform and
lighting calculations. It performs this outside the CPU to reduce
latency by using many small cores that allow it to run in parallel
by distributing the computations. This hardware feature makes
GPUs very popular in multi-core processing and has spawned off
general purpose graphical processing units (GPGPU).[5] The
company AMD wanted to take advantage of this architecture and
created a hybrid processing unit they call an accelerated
processing unit, also known as an advanced processing unit
(APU). The APU is used as the computers main processing unit
designed to accelerate the computations outside of the CPU but
still share memory.[1] These new processing units are on the rise
and are developing a large following, but how well do they all
compare against one another.
In this section we take a look at the similarities and differences
among the current available processing units. The multi-core CPU
hardware design varies in it can be a single core repeatedly placed
on a die, or it can have many different cores were each one is
optimized for specific tasks.
Figure 1.1
Figure 1.2
The above figures illustrate a dual-core (1.1) and quad-core
(1.2) multi-core CPUs architecture .[7]
GPU hardware uses a similar design however, its smaller cores are
optimized by using more transistors and arithmetic logic units
allowing it to run more threads of computation on the same size
chip. This requires a smaller cache memory per core, as compared
to that of a CPU, since there are more transistors dedicated to do
computations and data processing rather than data caching or flow
control. Due to this parallel architecture some GPGPUs are used
in place of CPUs for intensive computations. However due to the
architecture of the GPU many control and serial instructions still
have to be performed by a CPU.[11]
The above Figure shows a GPU hardware architecture. [4]
This brings us to the APU which utilizes a GPGPU, or similar
specialized processing unit, and a CPU on the same die in order to
reduce overhead of the data transfer rates between the two units
and reducing power consumption.[1] This new architecture takes
advantage of the benefits of both control and serial data
processing of a CPU, and the parallel data processing and display
computations of a GPU in order to improve performance and
maintain flexibility with programs that are not designed to run in
parallel and those that are capable.
computing tasks.[12] Separately the CPU and GPU have a large
number of research results that indicate that the GPU in solving
compute-intensive problems has great advantage compared with
the CPU.[6] This forces most CPU-GPU collaborative
environments to rely on one another. The CPU being master
control distributing tasks and executing some tasks while the GPU
can only help in the execution of tasks. Further research has
proved to show that depending on the intensity of the calculations
the lower intense calculations should be done by the CPU and the
higher intense calculations should be mapped to the GPU.[12]
Even with the overhead of communication between data transfers
the collaborative environment out performs the CPU-only and
GPU-only architectures. Since APUs are so new and hardware
specific to AMD processors there is very little research, if any,
done on this type of processor.
The figure above shows parallel architectures with one CPU
and one GPU.[7]
The figure above shows 4 conceptual APU architectures.[2]
In order to measure the performance of each processing unit we
must first understand the limits of each processing unit in order to
test them accordingly. Since GPUs use a smaller cache memory
we must make sure the data falls in the capacity of the GPU
memory. When parallel programming on a classic CPU Von
Neumann architecture we must remember that this design has a
bottleneck effect on data transfer and will affect throughput.[10]
We must also remember to consider the overhead produced by the
bottleneck of data transfers between the CPU and GPU in CPUGPU collaborative environments. As for APUs they share main
memory but still have some overhead due to the differences in
which they each interpret data during transfer.
Task Scheduling
In order to compare the performance of these processing units we
must use a computing model that suits them all well. In order to
do this we must consider the APU as a separate model due to the
nature of it's architecture and it requiring a specific algorithm
implementation. This is due to the optimization of scheduling
computing tasks and communication tasks to the different
hardware. Where the GPU or CPU is used for computing tasks
and the CPU is used for the data control of input and output of the
Programing Paradigms
In this section we look at the impact of programming paradigms
on parallel processing units. The oldest programing paradigm
used has been the imperative programing paradigm (PP). Most
computer architectures are modeled after the Von Neumann
model and since it's conception the imperative PP has been used
directly and indirectly. In today's world we have a number of
different paradigms from Object-Oriented, quantum, and
others.[6] Many of which have much to thank their predecessors
in which they were modeled after. With the advancements of
different paradigms we have also seen the birth of high level
programming languages that are currently trying to take advantage
of parallel processing, such languages would include OpenCL or
CUDA. High level languages help reduce critical errors from
being made at a programming level and are easier for
programmers to debug and read. The inheritance in the structure
of these PP still have elements of the imperative PP, and other
abstraction models. All programming languages have to be
interpreted into assembly language for the hardware to understand
what computations it is being told to execute. This detail makes it
difficult to completely eradicate previous PP and slows down the
development of new paradigms and languages focused on parallel
processing without being entirely architecture specific. However,
there are many languages out now that are incorporating multithreading and other elements of parallel programming in order to
take advantage of the advances in technology. This however all
falls onto the lap of the programmer to explicitly tell the
processing units how to distribute the data making debugging
more difficult and prone to human error.[13]
Portability with OmniDB
Portability of parallel processing on CPU, GPU, CPU-GPU, and
APU architectures ranging from desktop to mobile devices has
been architecture specific up to now. This implementation
suggests the use of a kernel-adapter based design known as
OmniDB in order to implement across the different architectures.
While this implementation solves the issue of cross-platform
implementation it does run into some unique problems. For
instance, in the CPU-only and GPU-only architectures we have
parallel programing elements (PPEs) that each contain their own
memory which may or may not overlap with other PPEs. Where in
the CPU-GPU environments they each have their own memory,
and in APU architecture the two PPEs share main memory. To
relieve these architectural differences and make the paradigm
completely cross-platform we use the proposed kernel-adapter
based approach in order to first verify the resources of the given
architecture by an architecture-aware query and an adapter. The
adapter allows the kernel to optimize data distribution according
to the available resources. In the end it is an open problem to
define the boundary between the kernel and the adapters, a subject
that is open to further research.[7]
find ourselves developing yet another processing unit capable of
doing everything our current ones can and much more. In the
future we will look at multi-core APU performance.
[1] Accelerated Processing Unit
[2] AMD APU Conceptual Designs
[3] Central Processing Unit
[4] CPU/GPU Comparison
[5] Graphical Processing Unit
Naturally two heads are better than one, and this was the thinking
behind the development of multi-core and many-core systems.
The peak performance and efficiency of CPUs has been reached in
a common phrase known as the heat-and-power wall. No longer
can a single CPU out perform another single CPU of the same
caliber. With the help of a GPU a CPU can reduce its energy
consumption and heat. However a GPU is slow with basic
arithmetic computations when compared to a CPU, and thus still
making the CPU best suited for sequential and low intensive
computations. Noted is the fact that flow control/input/output
operations are still controlled by the CPU, but requiring a larger
cache to do so, making the CPU still a valuable asset to computer
architecture. While a CPU can perform parallel programming
through task parallelism it requires threads to be explicitly
defined. A GPU on the other hand can perform parallel processing
by data parallelism in which threads are managed and scheduled
by the hardware.[5] Overall the CPU-GPU collaborative
environment is currently the most efficient when it comes to
parallel processing with the APU having the additional benefit of
a much smaller data transfer overhead via shared main memory.
In conclusion new processing units are always being developed
along with programing languages and paradigms to take
advantage of the available technology. We are currently in a time
where our technology is cheap and capable of producing great
performance but our paradigms are not designed to tap the true
potential of what we have created. While we try to hold on to the
sequential programs that we have grown accustomed to and try
our best to adapt them to a more parallel architecture in order to
make leaps instead of steps into the next level of computer
science. So many advancements still await us in the next couple of
centuries from mobile portability of parallel processing to the
untapped potential of GPUs and APUs. Who knows we may even
[6] Hahn S., Oskin M., and Thompson C. “A framework and
analysis of modern graphics architectures for generalpurpose computing,” in Proceedings of the 35th Annual
International Symposium
[7] Lu, Mian. OmniDB: Towards Portable and Efficient Query
[8] Multi-Core CPU
[9] Physics Processing Units
[10] Rege, Ashu. An Introduction to Modern GPU Architecture.
[11] Von Neumann Architecture
[12] Wang, Lei. Task Scheduling of Parallel Processing in CPUGPU Collaborative Environment. International Conference
on Computer Science and Information Technology 2008.
[13] Zuhri Zuhud, Daeng A. From Programming Sequential
Machine to Parallel Smart Mobile Devices.: Bringing Back
the Imperative Paradigm to Today's Perspective. 8th
International Conference on Information Technology 2013.