RTC10 TCTW Bittware_original CE_original

advertisement
Need a Many-Core Strategy?
Has your boss asked for a “many-core” strategy yet? Get ready. Many-core is the next step in
microprocessor evolution. It will change the world for most of us.
by Jeff Milrod, BittWare
Most experts agree that the era of Moore's law driving endless performance increases of legacy
processors is ending, even though the transistor count may continue to climb. Therefore, the way
we approach processing must change. It is no longer likely that we will be getting a bigger
hammer next year. Since we are guaranteed that the problems will continue to get bigger, the
inevitable conclusion is that we need figure out how to use, and coordinate, more small
hammers. In the processing world this means multicore, and eventually many-core.
Although we don’t have a stable lexicon yet, multicore is typically used to refer to chips that
contain a few multiples of more powerful legacy cores (Intel’s traditional Xeon, for example).
Alternatively, many-core refers to a chip containing a large number of simpler processing cores.
One many-core pioneer illustrates the idea by showing a swarm of army ants successfully
attacking a server. Swarming a chicken is far more likely, but the contrived illustration reminds
us that evolution doesn’t always lead to bigger animals.
Many-core’s attraction is that it results in a larger percentage of transistors going into
computational units. Thus, for any fixed die size, it boasts higher peak performance. Many-core
does this extra work without burning additional power. After all, it is just using the same number
of transistors more effectively.
This approach can be far more efficient and be much more scalable then traditional
methodologies, but only if used appropriately. Trying to code many-core processors as if they
were legacy single threaded engines will lead to extremely unpleasant parallel processing
challenges, and ultimately to failure. New coding approaches are called for. But the problem is
NOT many-core—after all, many-core only exists because it is the solution—the problem is a
lack of knowledge and support for many-core. New strategies are needed to deal with this
reality—new processor architectures along with new approaches to coding them.
To uncover the appropriate strategy for a given many-core implementation, there must first be an
understanding of the problem. Recalling that the lexicon is still undefined and in flux,
applications can map to many-core in three basic ways—data parallel, task parallel and parallel
processing. Which type of parallelism is appropriate will drive the strategy and selection of both
processors and the programming model.
Data Parallel and Task Parallel Processing
Data parallel processing describes applications that run the same, or similar, processes many
times on different data sets. This may be as simple as an application or algorithms with >90%
runtime in loops... where (hopefully) each iteration of the loop is independent.
Provided the parallelism is data independent (i.e., there are no “IF” statements) then every data
element can be processed in the same way, and a many-core SIMD model becomes an attractive
strategy. SIMD means it executes a single-instruction on multiple data simultaneously. In the
“old days,” these were often called vector, or array processors. To add two vectors of N values,
each value can be added together in a sequential manner or, using SIMD, all N values can be
added to each other simultaneously if you have specialized hardware to support that; many-core
SIMD precursors included Motorola/Freescale’s Altivec and Intel’s MMX & SSE, which do just
that. These concepts have been applied more generally by various vendors including MathStar
and Clearspeed, both of which may have been ahead of their time and failed to find commercial
success.
The most successful implementation of many-core SIMD for data parallel applications has
clearly been GPUs. Obviously, these processors were designed to parallelize graphics operations.
After Nvidia released CUDA in 2007, however, these GPUs became usable for general purpose
acceleration (GPGPU). For algorithms with highly structured data parallelisms that can be
independently processed with no need for conditional instructions, a GPU can tremendously
outperform a single or multicore CPU (Figure 1). As is typical, the success of GPGPU has
attracted competitors. The primary many-core competition is Intel’s emerging Xeon Phi family
with code names like Knight’s Ferry and Knight’s Corner. Intel’s early Phi offerings leveraged
old Pentium cores boosted with a wide SIMD unit. Intel says later Phi chips will actually use
Atom cores.
Data parallel applications with data dependencies (i.e., with “IF” statements), however, are illsuited to implementation on GPUs, or any other SIMD engine for that matter. While every data
stream can be processed by the same algorithm, each stream needs to be processed
independently. This requirement dictates a MIMD model, meaning a many-core processor that
can independently execute multiple instructions on multiple data. This approach is far more
flexible than SIMD, and supports non-structured data and conditional, or data-dependent
algorithms. Where SIMD can be thought of as a special case, the MIMD model is a more
general-purpose implementation of the many-core concept.
A good example of a MIMD many-core processor is BittWare’s Anemone. Implementing the
Epiphany many-core architecture from Adapteva, Anemone hosts 16-64 cores that are connected
via a high-throughput mesh network, as shown in Figure 2. Since each core is independent it is
MIMD, and since it uses floating-point cores that are C-programmable, it is easy to code.
For data parallel MIMD applications, the independent cores can operate concurrently on separate
channels or data sets. Lots of applications can exploit a massive number of concurrent
lightweight cores. Packet processing is an obvious example. Lots of cores are a fine match to a
network with lots of packets. The same analogy applies to serving web pages or anything else
optimized around channels, capacity, or throughput. Concurrent programming of many-core
architectures can be quite straightforward, with independent implementations of a single
program—provided that program “fits” into a single core. If the program doesn’t fit into a single
core, then it must be broken down in to sub-tasks for implementation in a task parallel strategy.
Task parallel processing includes a wide range of applications that contain lots of independent
functions that can be run either concurrently or in a pipeline; at some point the results may be
consolidated, but for the most part the application can be run with task independence. These
types of problems can be thought of as non-data parallel concurrent processing with each core
running a different program, which would indicate using a MIMD strategy for implementation.
Parallel Processing
Many-core strategy becomes less clear for large, performance-limited applications. Speeding up
such applications requires making a single thread of execution run faster using parallel
processing. This requires multiple cores to be used in parallel to divide and conquer a bigger
problem than any one of them could handle alone (think of the army ants). This is where
everyone starts to get nervous, because parallel programming is challenging. But remember—
many-core is not the problem, it’s the solution.
In contrast, a traditional processor runs single threads as fast as possible, reducing programming
effort but increasing the complexity of the underlying hardware at the cost of lower efficiency
and higher wattage. This is a great implementation strategy until the single thread isn’t fast
enough, or when power and thermal constraints limit performance. Then you will need to
implement parallel processing on multiple processors architecturally arranged to be singlethreaded. The many-core parallel program may be “more” parallel, but it brings higher efficiency
and lower wattage, and can be architecturally optimized for implementing the parallelisms so
that it is actually easier to program.
While inevitable, parallel processing is still quite challenging. Even in the current crop of
multicore servers, it is rarely implemented. However, with proper architecture of hardware and
software it can be made much less daunting.
Ironically, one of the ways to reduce the challenge of parallel programming is to simplify the
cores and processor memory structures, control and I/O. Legacy processors have added all sorts
of “bells & whistles” to automatically manage these things for the programmer, and for singlethreaded execution these abstractions can be of great benefit. When programming many-cores in
parallel, however, these abstractions can cause great difficulties with coordination and
synchronization. It’s nearly impossible to understand and manage interdependencies between
cores when the programmer can’t actually manage the core’s operation.
One example of how reduced abstractions can simplify parallel many-core implementations is
caching. For legacy processing implementations, caches can improve performance by
intelligently predicting and managing memory access and moving data from bulk memory into
local cache. However, Wikipedia defines the “many-core” as starting at the core count at which
“traditional” threaded programming techniques break down, and this is a good example of this;
academics have demonstrated that cached, shared memory does not scale well.
Another example of architectural optimization is I/O. Since the cores are small and efficient, they
aren’t well suited to multitasking. Serving I/O could be done by dedicating some number of
cores, but that reduces the number of processing cores and can further complicate the parallel
programming. Better to have the I/O outsourced to a separate unit and only task the many-cores
with processing and inter-core communications.
The Epiphany many-core architecture used by Anemone features simple, single-threaded cores
with distributed shared memory. There is a single global address map, but the actual memory
resides within a specific core. It is not a cache but local SRAM that can be globally addressed.
Thus threaded, shared-memory programs still run. However, the chip does not try to perform
data movement speculatively—i.e., there are no caches. Instead, programmers must manually
manage the memory hierarchy with the tools provided for doing so. Virtually any C code will
run unmodified at modest speeds and the code can then be modified to DMA both working data
and key loops into the appropriate cores.
Bulk memory and I/O are both handled by extending the global memory map off chip via the
link ports. Multiple Anemones can be gluelessly connected to expand the many-core grid, or an
FPGA can be attached for off-loading the service of bulk memory and I/O. As an additional
benefit, the tightly coupled FPGA can now be used to provide additional processing and
implementation of special functions, as shown in Figure 3.
Because Anemone uses an architecture specifically designed for many-core, it enjoys the
benefits in efficiency that this technology promises. Each floating point core delivers 2 floating
point operations per cycle, or 1.5 GFLOPS at 750 MHz. Thus the AN104, with 16 cores,
achieves 24 GFLOPS while only consuming 1W of core power. The next generation Anemone,
in development now, will have up to 64 cores with double-precision and will deliver 96 GFLOPS
in only 2W of core power.
There is clearly no one-size-fits-all strategy to many-core. Whether an application is data
parallel, task parallel, or requires parallel processing will have a strong impact on the selection of
a processor. It is likely that legacy code is not written in a style suitable for any parallelism, so
job one is to decide what the application classification is. Once that is done, processors can be
evaluated and selected, then, and only then, the team can start to apply the appropriate many-core
programming model to eventually rewrite the code into a suitable form. Clearly, none of this is
easy—but remember, many-core is the solution, not the problem.
BittWare, Concord, NH. (603) 226-0404. [www.bittware.com].
Adapteva, Lexington, MA. (781) 328-0513. [www.adapteva.com].
Download