Need a Many-Core Strategy? Has your boss asked for a “many-core” strategy yet? Get ready. Many-core is the next step in microprocessor evolution. It will change the world for most of us. by Jeff Milrod, BittWare Most experts agree that the era of Moore's law driving endless performance increases of legacy processors is ending, even though the transistor count may continue to climb. Therefore, the way we approach processing must change. It is no longer likely that we will be getting a bigger hammer next year. Since we are guaranteed that the problems will continue to get bigger, the inevitable conclusion is that we need figure out how to use, and coordinate, more small hammers. In the processing world this means multicore, and eventually many-core. Although we don’t have a stable lexicon yet, multicore is typically used to refer to chips that contain a few multiples of more powerful legacy cores (Intel’s traditional Xeon, for example). Alternatively, many-core refers to a chip containing a large number of simpler processing cores. One many-core pioneer illustrates the idea by showing a swarm of army ants successfully attacking a server. Swarming a chicken is far more likely, but the contrived illustration reminds us that evolution doesn’t always lead to bigger animals. Many-core’s attraction is that it results in a larger percentage of transistors going into computational units. Thus, for any fixed die size, it boasts higher peak performance. Many-core does this extra work without burning additional power. After all, it is just using the same number of transistors more effectively. This approach can be far more efficient and be much more scalable then traditional methodologies, but only if used appropriately. Trying to code many-core processors as if they were legacy single threaded engines will lead to extremely unpleasant parallel processing challenges, and ultimately to failure. New coding approaches are called for. But the problem is NOT many-core—after all, many-core only exists because it is the solution—the problem is a lack of knowledge and support for many-core. New strategies are needed to deal with this reality—new processor architectures along with new approaches to coding them. To uncover the appropriate strategy for a given many-core implementation, there must first be an understanding of the problem. Recalling that the lexicon is still undefined and in flux, applications can map to many-core in three basic ways—data parallel, task parallel and parallel processing. Which type of parallelism is appropriate will drive the strategy and selection of both processors and the programming model. Data Parallel and Task Parallel Processing Data parallel processing describes applications that run the same, or similar, processes many times on different data sets. This may be as simple as an application or algorithms with >90% runtime in loops... where (hopefully) each iteration of the loop is independent. Provided the parallelism is data independent (i.e., there are no “IF” statements) then every data element can be processed in the same way, and a many-core SIMD model becomes an attractive strategy. SIMD means it executes a single-instruction on multiple data simultaneously. In the “old days,” these were often called vector, or array processors. To add two vectors of N values, each value can be added together in a sequential manner or, using SIMD, all N values can be added to each other simultaneously if you have specialized hardware to support that; many-core SIMD precursors included Motorola/Freescale’s Altivec and Intel’s MMX & SSE, which do just that. These concepts have been applied more generally by various vendors including MathStar and Clearspeed, both of which may have been ahead of their time and failed to find commercial success. The most successful implementation of many-core SIMD for data parallel applications has clearly been GPUs. Obviously, these processors were designed to parallelize graphics operations. After Nvidia released CUDA in 2007, however, these GPUs became usable for general purpose acceleration (GPGPU). For algorithms with highly structured data parallelisms that can be independently processed with no need for conditional instructions, a GPU can tremendously outperform a single or multicore CPU (Figure 1). As is typical, the success of GPGPU has attracted competitors. The primary many-core competition is Intel’s emerging Xeon Phi family with code names like Knight’s Ferry and Knight’s Corner. Intel’s early Phi offerings leveraged old Pentium cores boosted with a wide SIMD unit. Intel says later Phi chips will actually use Atom cores. Data parallel applications with data dependencies (i.e., with “IF” statements), however, are illsuited to implementation on GPUs, or any other SIMD engine for that matter. While every data stream can be processed by the same algorithm, each stream needs to be processed independently. This requirement dictates a MIMD model, meaning a many-core processor that can independently execute multiple instructions on multiple data. This approach is far more flexible than SIMD, and supports non-structured data and conditional, or data-dependent algorithms. Where SIMD can be thought of as a special case, the MIMD model is a more general-purpose implementation of the many-core concept. A good example of a MIMD many-core processor is BittWare’s Anemone. Implementing the Epiphany many-core architecture from Adapteva, Anemone hosts 16-64 cores that are connected via a high-throughput mesh network, as shown in Figure 2. Since each core is independent it is MIMD, and since it uses floating-point cores that are C-programmable, it is easy to code. For data parallel MIMD applications, the independent cores can operate concurrently on separate channels or data sets. Lots of applications can exploit a massive number of concurrent lightweight cores. Packet processing is an obvious example. Lots of cores are a fine match to a network with lots of packets. The same analogy applies to serving web pages or anything else optimized around channels, capacity, or throughput. Concurrent programming of many-core architectures can be quite straightforward, with independent implementations of a single program—provided that program “fits” into a single core. If the program doesn’t fit into a single core, then it must be broken down in to sub-tasks for implementation in a task parallel strategy. Task parallel processing includes a wide range of applications that contain lots of independent functions that can be run either concurrently or in a pipeline; at some point the results may be consolidated, but for the most part the application can be run with task independence. These types of problems can be thought of as non-data parallel concurrent processing with each core running a different program, which would indicate using a MIMD strategy for implementation. Parallel Processing Many-core strategy becomes less clear for large, performance-limited applications. Speeding up such applications requires making a single thread of execution run faster using parallel processing. This requires multiple cores to be used in parallel to divide and conquer a bigger problem than any one of them could handle alone (think of the army ants). This is where everyone starts to get nervous, because parallel programming is challenging. But remember— many-core is not the problem, it’s the solution. In contrast, a traditional processor runs single threads as fast as possible, reducing programming effort but increasing the complexity of the underlying hardware at the cost of lower efficiency and higher wattage. This is a great implementation strategy until the single thread isn’t fast enough, or when power and thermal constraints limit performance. Then you will need to implement parallel processing on multiple processors architecturally arranged to be singlethreaded. The many-core parallel program may be “more” parallel, but it brings higher efficiency and lower wattage, and can be architecturally optimized for implementing the parallelisms so that it is actually easier to program. While inevitable, parallel processing is still quite challenging. Even in the current crop of multicore servers, it is rarely implemented. However, with proper architecture of hardware and software it can be made much less daunting. Ironically, one of the ways to reduce the challenge of parallel programming is to simplify the cores and processor memory structures, control and I/O. Legacy processors have added all sorts of “bells & whistles” to automatically manage these things for the programmer, and for singlethreaded execution these abstractions can be of great benefit. When programming many-cores in parallel, however, these abstractions can cause great difficulties with coordination and synchronization. It’s nearly impossible to understand and manage interdependencies between cores when the programmer can’t actually manage the core’s operation. One example of how reduced abstractions can simplify parallel many-core implementations is caching. For legacy processing implementations, caches can improve performance by intelligently predicting and managing memory access and moving data from bulk memory into local cache. However, Wikipedia defines the “many-core” as starting at the core count at which “traditional” threaded programming techniques break down, and this is a good example of this; academics have demonstrated that cached, shared memory does not scale well. Another example of architectural optimization is I/O. Since the cores are small and efficient, they aren’t well suited to multitasking. Serving I/O could be done by dedicating some number of cores, but that reduces the number of processing cores and can further complicate the parallel programming. Better to have the I/O outsourced to a separate unit and only task the many-cores with processing and inter-core communications. The Epiphany many-core architecture used by Anemone features simple, single-threaded cores with distributed shared memory. There is a single global address map, but the actual memory resides within a specific core. It is not a cache but local SRAM that can be globally addressed. Thus threaded, shared-memory programs still run. However, the chip does not try to perform data movement speculatively—i.e., there are no caches. Instead, programmers must manually manage the memory hierarchy with the tools provided for doing so. Virtually any C code will run unmodified at modest speeds and the code can then be modified to DMA both working data and key loops into the appropriate cores. Bulk memory and I/O are both handled by extending the global memory map off chip via the link ports. Multiple Anemones can be gluelessly connected to expand the many-core grid, or an FPGA can be attached for off-loading the service of bulk memory and I/O. As an additional benefit, the tightly coupled FPGA can now be used to provide additional processing and implementation of special functions, as shown in Figure 3. Because Anemone uses an architecture specifically designed for many-core, it enjoys the benefits in efficiency that this technology promises. Each floating point core delivers 2 floating point operations per cycle, or 1.5 GFLOPS at 750 MHz. Thus the AN104, with 16 cores, achieves 24 GFLOPS while only consuming 1W of core power. The next generation Anemone, in development now, will have up to 64 cores with double-precision and will deliver 96 GFLOPS in only 2W of core power. There is clearly no one-size-fits-all strategy to many-core. Whether an application is data parallel, task parallel, or requires parallel processing will have a strong impact on the selection of a processor. It is likely that legacy code is not written in a style suitable for any parallelism, so job one is to decide what the application classification is. Once that is done, processors can be evaluated and selected, then, and only then, the team can start to apply the appropriate many-core programming model to eventually rewrite the code into a suitable form. Clearly, none of this is easy—but remember, many-core is the solution, not the problem. BittWare, Concord, NH. (603) 226-0404. [www.bittware.com]. Adapteva, Lexington, MA. (781) 328-0513. [www.adapteva.com].