Redpaper Matthew Drahzal Kirk E. Jordan Eric Kronstadt Lance E. Miller James C. Sexton Alexander Zekulin System Level Accelerators: A Hybrid Computing Solution This IBM® Redpaper publication presents the Systems Level Accelerator strategy and taxonomy and provides a roadmap of how to employ Systems Level Accelerators today. The paper also discusses examples of System Level Acceleration and includes strategies for forward progress. © Copyright IBM Corp. 2008. All rights reserved. ibm.com/redbooks 1 Executive summary IBM Blue Gene® is the fastest computer architecture in the world, able to simulate nuclear collisions, the activity of the human mind, and the way a virus replicates. The IBM Cell Broadband Engine™ architecture provides supercomputer performance in a set-top gaming machine, as well as providing horsepower for the super computer system under development for Los Alamos National Labs. Yet, current mainstream system and software development paradigms have not embraced the power of these solutions for business- and technical-related computing. These specialized systems and processors, however, point the way into the future and in a direction that cannot be ignored. A review of compute technology shows that performance improvements have cycled through changes that have alternated between general purpose single central processor unit (CPU) speed increases and specialization. Historically, when performance improvements in High Performance Computing (HPC) were needed, specialization technologies such as vectorization and parallelization were embraced and then abandoned quickly as processor technology evolved. We depended upon Moore’s Law to improve single core performance, yet we always knew the speed of light would eventually limit that growth.1 Moore’s Law has held, but we have been cheating. Instead of adding more density to a single core, multiple core solutions on a single die have been created. Thus, the individual performance of each general purpose core has not improved. For single threaded applications, performance has hit the wall. The gating factor is just that, gate leakage at the very fundamentals of each microprocessor. In recent years gate leakage has been an increasing contributor to the passive power consumption in microprocessors. This paper proposes another paradigm: System Level Accelerators, the deployment of diverse compute resources to solve the problem of a single work stream, all with the experience of a single server to the user. Supported by middleware, ISV applications, and development tooling, System Level Acceleration allows you to take advantage of state-of-the-art specialized processors to accelerate applications without total code rewrite and speedup applications even faster than execution on a multi-core parallel system. Introduction This section discusses the current state of HPC and current industry trends. Current state of HPC This section discusses the current state of HPC. The end of classical CPU scaling IBM foresaw the end of CPU speed scaling many years ago and began to lay the framework for more innovative solutions by introducing the first multi-core POWER™ processors in 2001. Consequently, IBM has the most pervasive use of multi-core solutions in the market today, with experience in multithreaded and multi-core CPUs, servers, and application tooling. Although multi-core technology has extended Moore’s law for IBM, the performance increases to a single application can only, at best, be linear with the number of processors that are 1 2 Moore’s Law is the “empirical observation made in 1965 that the number of transistors on an integrated circuit for minimum component cost doubles every 24 months.” Source: Wikipedia System Level Accelerators: A Hybrid Computing Solution applied to the problem. Yet, as the industry approach 4 Gigahertz CPU speed, the current speed limit for general purpose processors, multi-core is all that are available, with many-core processors being the way of the future. As circuit sizing is reduced to the size of individual atoms, it is going to get rather difficult to get any smaller. There are two types of power that are used in calculating the power consumption of a processor: Active power, the power used to do work Passive power, the cost of being there Think of passive power as the overhead or bureaucracy of a processor. Figure 1 shows current trends of power consumption verse gate length (in microns). Notice that the passive power is approaching active power consumption. In recent years gate leakage has been an increasing contributor to the passive power consumption in microprocessors. So, as the gate length (size of the etchings of the lithography of a processor) shrinks, the bureaucracy increases, thus preventing processors from running at much higher speeds. When passive power equals active power, it is difficult to get any work done on a processor. 1000 100 Active Power 10 Passive Power 1 0.1 0.01 1994 2005 0.001 1 0.1 0.01 Gate Length (microns) Figure 1 Two relevant power consumption types in a microprocessor The energy crisis In addition to difficulty in obtaining greater speeds, we are now dealing with power consumption issues. IBM is beginning to see Requests for Proposals with floating point operations per megawatt as part of the information that is required. Electricity expense is becoming a large life-cycle issue for high performance computing users. As energy costs increase, the energy that is consumed by CPUs is becoming more important. The amount of energy that is required by a CPU increases polynomially (as the cube) with its speed. Larger clock speeds require more cooling, more airflow, and more electricity to drive the cooling System Level Accelerators: A Hybrid Computing Solution 3 process. If extra heat in a CPU is bad, the problem is exacerbated by the equally wasteful energy that is required to move the waste heat. It is no coincidence that the fastest Super Computer in the world, Blue Gene, uses efficient chips running at a rather low clock speeds. Blue Gene is, in fact, very green. Scaling solutions in a cluster configuration makes the energy consumption issue even worse. When we need to add processors to increase performance, we also add memory, power supplies, disk drives, fans, network adapters, and more. All of this extra energy-using equipment is using power even when the CPU is not being used. It is a large additional per-unit fixed cost for scaling through clustering. Some of the highest energy consuming compute installations in the world are based upon clustering configurations. What is System Level Acceleration Systems Level Acceleration is a computing approach that allows for multiple, disparate computer systems to interact and complete a complete computational and interactive workflow. Although this approach can be applied in its simplest form to a cluster farm with disparate processors, all connected homogeneously, the real power of a Systems Level Accelerator approach is with disparate systems each with specific functions such as computing, visualization, data storage, and so forth. The reason for this approach is the vast improvement in the efficiency of the overall completion of a workflow. There are several different approaches to Systems Level Accelerator. Faster engineering of specialized processors IBM has proven its ability to create new custom processor designs quickly based upon specific needs for particular applications. Examples of this ability today are the Cell Broadband Engine (Cell/B.E.™) CPU for the SONY PS3, the processors running the XBOX 360 and the Nintendo Wii, and many other custom engineered solutions for various customers. As an example of how quickly IBM can design a specialized processor, IBM signed a contract with Microsoft® in August 2003 for a three-core gaming processor. IBM had early access test chips available a year later, the full chip taped out in October 2004, and in January 2005, the chip tape-out was complete. IBM had designed a specialized gaming CPU successfully in less time than it takes many companies to do a minor software revision. For additional information about this design effort, see the article by Dean Takahashi, Learning from failure The inside story on how IBM out-foxed Intel with the Xbox 360, Electronic Business, 01 May 2006, at: http://www.reed-electronics.com/CA6328378.html Today, IBM can create a processor for a specific purpose. These processors have super-linear speedup from general purpose processors, use less electricity, and require less cooling. This capability is new and IBM is leading the way. The flood of data streams The amount of data that is flowing and must be analyzed continually is also growing. IBM has announced recently efforts in stream computing that are focused on receiving flows of information and making decisions and actions based upon the content of that incoming data stream, whether it be financial data, company news, or current events. To learn more about the effort from IBM in stream computing, see the article by Darryl K. Taft, IBM Previews Stream Computing System, eWeek.com, 19 June 2007, at: http://www.eweek.com/c/a/Infrastructure/IBM-Previews-Stream-Computing-System/ Additionally, IBM is on the cusp of a change in technical data sensor collection. The more data that is collected, the more processing that must occur on each data stream. The Astron 4 System Level Accelerators: A Hybrid Computing Solution Low Frequency Array (LOFAR) software telescope project is an example of sensor data arriving and needing to be processed in real time. To learn more about IBM’s involvement with Astron LOFAR project, go to: http://domino.watson.ibm.com/comm/pr.nsf/pages/news.20040223_bluegene.html On the security front, there is an increasing use of video cameras, a technology that needs gigaflops of processing to digitize, analyze, and store incoming data streams. The need for speed The grand irony is that the failure of classical scaling is occurring at a time when a need for computing performance is growing. Areas such as computational biology, drug discovery, petroleum exploration, and simulation-based engineering sciences are driving computing requirements to a higher level than ever. For more information about simulation-based engineering, see Simulation - Based Engineering Science, from the National Science Foundation; Document number: sbes0506; May 26, 2006, at: http://www.nsf.gov/publications/pub_summ.jsp?ods_key=sbes0506 High performance computing Independent Software Vendors (ISVs) are trying to find ways to scale the performance of their applications cost effectively.2 However, the marketplace cannot afford to rewrite entire HPC applications, and the need for speed is greater than ever. Current industry trends There are multiple changes occurring in the marketplace which make adoption of the System Level Acceleration (SLA) paradigm important. These changes provide the underlying infrastructure to allow SLA growth, as well as some changes in development models that allow us to think more abstractly. Processor diversity As we look at innovation in computing, there is large innovation in specialty processors. For example, there is movement to employ capability as diverse as Video Graphics Processors and Field Programmable Gate Array (FPGA) technologies as accelerators. Traditionally, for a repeatable algorithm (for example, Signal Processing), the industry created Application Specific Integrated Circuits (ASIC) to execute that algorithm. With the cost of FPGA chips lower than ever, we can code these algorithms and write them to FPGAs much later, giving us flexibility for later change. Some server companies have even made this reconfigurable computing as part of their server offerings. In addition, we see diversity even on a single chip itself. The Cell Broadband Engine (Cell/B.E.) processor is single-die heterogeneous computing, containing one Power Processing Element (PPE) and up to eight specialized Synergistic Processing Elements (SPE) which perform floating point mathematics at lightening speed. Even Blue Gene itself is an example of diversity in design. With its architecture of four-processor shared memory design with very high speed interconnects, applications that map well to that architecture run excellently. However, this architecture cannot be all things to all people. This concept of specialization extends outside the boundaries of high performance computing. Even .Net and Java™ architectures are being optimized by using specialized processors with products such as Azul Systems Compute Appliances for Network Attached 2 To learn more about the ISV HPC market, read the book by Earl Joseph, Ph. D., Addison Snell, Christopher, G. Willard, Ph.D., Suzy Tichenor, Dolores Shaffer, Steve Conway; Study of ISVs Serving the High Performance Computing Market: The Need For Better Application Software, July 2005. System Level Accelerators: A Hybrid Computing Solution 5 Processing. These processors are non general purpose processors in a custom architecture with a single purpose—accelerating virtual machines. Virtualization Virtualization has become a leading trend in the marketplace today (although it has been part of the IBM portfolio for decades). Server access is virtualized, with Logical Partitions in IBM System p™ permitting access to even portions of a processor, or virtual machines running Java and .Net, or even virtual hardware, with virtualization is becoming commonplace. Developers have become accustomed to thinking about virtual abstractions of servers, and every time a line of Java code is written, they have very little care about where it runs. They only care about what it does. As we extend this to acceleration, we can begin to implement applications that can benefit from acceleration, yet we might not know what form the acceleration takes. In many respects, System Level Acceleration is the ultimate form of virtualization. Service-oriented architecture IBM has been making a large investment at enabling service-oriented architecture (SOA). In this architecture paradigm, the concept of definable, implement-able, and deployable services becomes the core of the systems development strategy. Business processes are defined as a series of inter-operating services, and modern business applications are implemented as such. These composite applications are developed using open standards and middleware such as WebSphere® and DB2® as the underlying fabric. What is important about SOA is that the server that implements the service is virtualized. A user of an SOA service is happily unaware of how that service is implemented or even the type of server that implements that service (a System z™, a System x™, or perhaps, even a Blue Gene server). The service consumer simply wants the service to complete as quickly as possible with the correct answers. Extending this paradigm to HPC is a natural evolution of technology. The opportunity today It has been said that technological revolutions happen as a collection of events that all point in the same direction and lead to a similar outcome. If true, we could say that the System Level Acceleration revolution has already begun. Visualization In several instances, we are seeing IBM customers and others integrating, or at least connecting, systems, that each perform a specific task of a certain workflow. Perhaps the most common instance of SLA is often connected with supercomputing and the subsequent visualization of the supercomputing results. IBM Deep Computing Visualization is an example of this integration. The rendering cluster ingests data from some parallel system that does the compute and the rendering cluster produces the pixels that can be passed on to another system for actual display. High performance computing Even Blue Gene is being used in an occasional instance now as a System Level Acceleration. The LOFAR project at ASTRON in the Netherlands is using a combination of clusters to 6 System Level Accelerators: A Hybrid Computing Solution ingest data from the array sensors, Blue Gene for correlation, and other systems to handle visualization and imaging. The Blue Gene supercomputer is one part of a heterogeneous solution and serves as a correlation accelerator. XML Acceleration Back to commercial network processing, IBM is now employing accelerators in the commercial world. The DataPower® family of XML Accelerators has been added recently to the IBM product portfolio. The DataPower accelerator appliances are purpose-built, easy-to-deploy network devices that simplify, help secure, and accelerate XML and Web services deployments, while extending SOA infrastructures (see Figure 2). The DataPower solutions allow customers to off-load XML and Security processing from their existing server infrastructure, thus freeing the servers from the CPU and memory that is needed to handle these tasks. In many cases this XML processing workload consumes a high percentage of a server’s capacity, and using a purpose-built accelerator to offload this work is a cost-effective method of optimizing server resources. XA35 in XML Proxy Mode Content & Application Generation Web Server XML HTML Internet XA35 XML Accelerator XML Fragments XML WML XSL XML XML Database Application Server or Web Server Wireless Net Figure 2 DataPower Accelerator Opportunity tomorrow IBM is in a unique position to move this vision forward because the market forces are demanding a solution (tighter IT budgets, human resource, power and cooling constraints). What started as autonomic computing, has evolved to a truly virtual environment, where programs are designed to an open architecture, and an enterprise optimizer can decide where best to run portions of a code base, based on constraints imposed by the system administrators. Because of the diverse systems and software heritage from IBM, it is in a unique position to see this as it is evolving and is in a position to drive this to a standard approach throughout the industry. System Level Accelerators: A Hybrid Computing Solution 7 The System Level Acceleration solution This section discusses the System Level Acceleration solution. The goal From the proceeding analysis, we foresee a future in which, for technology, cost, and capability reasons, large scale computer installations in business, research, and government will consist of multiple heterogeneous systems (that is Hybrid Computing). Each different system will manage a piece of the work load which the installation is delivering. The competitive driver to the adoption and deployment of such systems will continue to be the ease and ability to develop the applications. The end goal, we propose, is to provide a heterogeneous system that contains a number of different compute platforms. This heterogeneous system can provide computational service for many work loads each consisting of a number of separate components. The System Level Acceleration Approach that we propose here seeks to achieve the following goals: For the user The heterogeneous system needs to look and feel like a single integrated computer that can be driven from the user’s work station or mobile computer through a GUI, spreadsheet, or other simple interface. To the user, the effort to manage the work load on the system can be simple and intuitive, but it needs to also provide the user with the full cost benefits possible from deploying the correct systems architecture to solve each piece of the work load. For the applications developer The System Level Acceleration Approach must provide portable interfaces that abstract hardware interfaces from the application approach that allow the developer to be somewhat unaware of the nature of the System Level Acceleration. This approach implies a late binding of the accelerator to the application and a level of abstraction to enable System Level Acceleration to be approachable for a wide-range of developers. For the systems manager Again, this heterogeneous system must look and feel like a single integrated computer. Tooling that allows single system images and management of the entire solution is an important part of any System Level Acceleration solution. Approaches At first glance, Systems Level Acceleration is not a new idea. Heterogeneous systems installations have been around since the dawn of computers. What has changed significantly is the understanding that work loads are requiring significantly more cycles than can be delivered with a single processor core. Given the end of classical scaling, this problem requires multiprocessor systems for solution. You then face an optimization issue to determine the correct multiprocessor system for a given application or for a range of work streams. Increasingly, the competitive systems advantage goes to those systems that integrate well with each other in heterogeneous environments, and the competitive applications advantage goes to those applications that can be deployed easily on heterogeneous systems. 8 System Level Accelerators: A Hybrid Computing Solution From a user perspective, we identify two classes of application or work flow where System Level Acceleration is advantageous: Loosely coupled acceleration (LCA) Tightly coupled acceleration (TCA) Loosely coupled acceleration Loosely coupled acceleration (LCA) is a technique that is somewhat familiar to the traditional HPC users. In LCA, each step in a work stream is a separate application. In general, the output result from one application is stored on shared disk, and then, at some future time, those results are used as an input to another application running on, perhaps, a different server. The execution of these applications is coordinated by scripts and utilities which schedule and control the job submissions. It is not much of a leap to make an association between LCA and traditional batch processing (see Figure 3). Because LCA is a familiar instance of acceleration, many solutions have already been designed to use this technique. In addition, scheduling tools and applications are readily available to support execution using this model. Scheduler Accelerator System Host System Filesystem Filesystem Schedulers and Parallel File Systems Scheduler Figure 3 Loosely Coupled Acceleration block diagram IBM has developed demonstrations of this type of coupling for supercomputing visualization using Blue Gene as the accelerator. The accelerator applications are computational chemistry modeling codes, Amber and NAMD. Both of these codes compute models of molecules and produce text and visual output to disk. The output can then be read from disk and visualized with industry-established open source visualizer Visual Molecular Dynamics (VMD). The demonstration of loosely coupling these two systems together was achieved using a bash shell wrapper that employed secure shell (ssh) forwarding to create a seamless interface to the user. The user could with one command have the Blue Gene application remotely executed and its output visualized with customized parameters on the users mobile computer. This system was then enhanced by using Deep Computing Visualization (DCV). System Level Accelerators: A Hybrid Computing Solution 9 This approach was tested on a video wall consisting of four off the shelf high resolution monitors each driven by a dedicated Intellistation. The demonstration was configured then to send the output using Scalable Visual Network (SVN) to this scaled wall remotely. See Figure 4 for a workflow of this demonstration. Client BG Front End Scripts to authenticate user (ssh forwarding). initiate run. Present data using VMD. Calls Blue Gene to execute application. Gathers data on disk. Blue Gene Disk Figure 4 IBM Blue Gene as member of a LCA system Tightly coupled acceleration The tightly coupled acceleration (TCA) model is just that—more tightly coupled than LCA. In TCA architecture, we think less about separate applications each running start to finish and more about a single application which is distributed across a host, which houses the master, and one or more accelerators, which house the worker portion of the application. In this model, an application begins, and then like a good manager, the application uses accelerator-workers, if available, to take sections of the computation and execute them. The manager can then wait for the return of the work package, or it can go on to other tasks while the accelerator-worker completes its section of the computation. After the accelerator-worker has completed its task for the manager (and returned the results), the worker is now available to be tasked again (by the same manager or perhaps even a different instance of the manager on another server). We see this tightly coupled acceleration encoded in applications in three forms: Compiler-based and Library Acceleration In this model, the compiler (or an explicit library call) is used to create subroutine calls in an application which invokes the accelerator-workers. This is very similar to vectorizing compilers made popular in the 1980’s, which examined code and made all the requisite calls to send work to the vector processor. This technique has worked very well where the latencies for handing a task to a accelerator-worker are very small. Message Passing through Parallel Virtual Machine (PVM) and heterogeneous MPI These message-passing techniques require more architectural control over an application, and have traditionally been accomplished over homogeneous networks (although PVM had some success in heterogeneous use in the 1990’s). This kind of TCA is the most common today, with hundreds of applications having been developed with MPI. 10 System Level Accelerators: A Hybrid Computing Solution Remote Procedure Call (RPC) In this model, data is transferred and work is instantiated on an accelerator-server through the use of calls to a middleware API. The middleware handles the actual data transmission, process and thread creation, and overall control. An example of this type of coupling was completed by a collaboration of academic researchers and IBM scientists using innovative software by a research team at Rutgers University and the University of Texas at Austin (see Figure 5). Generate Guesses Start Parallel IPARS Instances Send Guesses Instance connects to DISCOVER DISCOVER notifies clients, clients interact with IPARS SPSA VFSA Send Guesses Optimization Service Client DISCOVER IPARS Factory Exhaustive Search Client If guess is in DB, send response to clients and get a new guess from optimizer MySQL Database If guess is not in DB, instantiate IPARS with guess as a parameter Figure 5 IPARS/Discover Tightly Coupled Acceleration model using Blue Gene as an SLA The Rutgers team developed an application framework called DISCOVER to monitor and steer other applications during runtime. A computational reservoir simulator Integrated Parallel Accurate Reservoir Simulator (IPARS), from the Center for Subsurface Modeling at the University of Texas at Austin, has long been a standard test example for high performance computing performance. IBM developed a System Level Acceleration proof of concept using IPARS as the accelerator code running on Blue Gene. IPARS was updated to tightly couple with this DISCOVER server. Here the codebase of IPARS was updated to connect directly to the Discover software during runtime which allowed the user to initiate the execution without knowledge of where the execution was taking place. The results were fast and seamless to the user. In this scenario, Blue Gene runs all of the IPARS instances, and the optimization service that explores the parameter space of the reservoir model. Comparison with Grid Computing System Level Acceleration is not Grid Computing, although Grid Computing and System Level Acceleration share some characteristics. Grid Computing is a way to manage and use multiple systems, but in this paradigm, each system keeps its separate identity. Grid Computing is a system and hardware focused approach to management of multiple disparate resources. Traditionally, a major issue for Grid Computing is the network interconnect. Often the network connection bottleneck is the slowest link connecting systems, which might be 100 megabits per second or less, with high latencies The network bottleneck dictates that parallel tasks minimize communications between them and operate extremely independently. System Level Acceleration is a work stream approach. The idea is that a work stream is broken up into component pieces. The component pieces use the best possible system for System Level Accelerators: A Hybrid Computing Solution 11 the particular component and communicates necessary data to other components. For example, a computational fluid dynamics work stream might consist of a grid generation component, a partitioning component, a parallel solver component, and a visualization component. The grid generation and partitioning component might be best suited to a symmetric multiprocessing (SMP) with large memory. After the partitioning is complete, the partition pieces might best be solved using parallel solver. The results of a parallel solution, because of the need for significant memory, may require parallel rendering and finally displayed at the researcher’s desk. Through System Level Acceleration, the entire work stream would be developed with the algorithms that are best suited to the systems performing the specific component part, but might also be aware of the interaction or data transfer between component parts. Some of the System Level Acceleration algorithmic development design overlaps and hides data transfer to further reduce the work stream elapse time. If done in an appropriate abstraction, when new technology arrives, only that part affected in the work stream need be revised to take advantage of the newer technology. Conclusion At a hardware level, Systems Level Acceleration integrates all the diverse systems architectures currently available. It also provides a paradigm that allows development and deployment of new architectures as they are acquired. It can also simplify the deployment of new architectures in the future because it allows development to focus on the core attributes of the new architecture and provides a ready made environment into which that architecture can be dropped when complete. Systems Level Acceleration simplifies customers’ uptake and integration effort for new architectures and provides a new framework for users to exploit new application areas more easily and robustly. In addition to decreasing time to solution, this framework flexibly allows fewer code changes in the future and allows more systems to play key roles in application development and usage. One key example of this flexibility is to extend the example of TCA that we presented in this paper, where IPARS was the accelerator running on Blue Gene using the DISCOVER application as a tightly couple runtime front end to the application. This example can be extended to include a visualization component that can be tightly or loosely coupled. Visualization of output is an essential component of any system, and the System Level Acceleration paradigm can be employed to use the right hardware (dedicated SMPs and such) to do the job. These results can seamlessly be joined and broadcast in numerous ways through the strategic use of existing IBM applications such as DCV. IBM has always been the world leader in mainstream and high performance computing. While there are limits to general purpose single core performance gains, the hardware innovation toward processor specialization and integration to expand our computing capabilities has already been led by innovative work from IBM. IBM will continue this role by developing software and middleware and by supporting architectures to allow applications and developers to take full advantage of these innovations, while making high performance computing even easier for developers and companies to use. 12 System Level Accelerators: A Hybrid Computing Solution The team that wrote this paper This paper was produced by a team of specialists working with the International Technical Support Organization (ITSO), Rochester Center. Matthew Drahzal is a Product Line Strategist for the IBM Deep Computing CTO organization. He has 20 years of experience at developing, designing, managing, and mentoring real-time, commercial, and HPC solutions. He has a BSCS and MBA from Syracuse University and has spoken worldwide on modern software engineering practices. Kirk E. Jordan is the Emerging Solutions Executive in IBM Deep Computing. At IBM, he oversees development of applications for advanced computing architectures, investigates and develops concepts for new areas of growth especially in the life sciences involving HPC, and provides leadership in high-end computing and simulation in such areas as systems biology and high-end visualization. He has a Ph.D. in Applied Math, University of Delaware and has more than 25 years experience in high performance and parallel computing. He held computational science positions at Exxon R&E, Argonne National Lab, Thinking Machines, and Kendall Square Research before joining IBM in 1994. At IBM, he has held several positions promoting HPC and high performance visualization, including managing IBM University Relations Shared University Research (SUR) Program and leading IBM Healthcare and Life Sciences Strategic Relationships and Institutes of Innovation Programs. Eric Kronstadt is Director of Exploratory Server Systems and the Director of the Deep Computing Institute, with responsibility for advanced operating systems research, HPC architectures including BlueGene, and emerging high performance applications, including computational biology. Prior to this role, Eric was Director of VLSI Systems, where his responsibilities continue to include development of the PowerPC® architecture, research in microprocessor implementation and micro-architecture, as well as CAD development. He has been with IBM over 20 years, starting as a software developer, moving into management, and being involved in the design and specification of a number of high performance experimental RISC microprocessors, as well as the development of a standard cell design system. Eric graduated from Brown University and received his Ph.D. in mathematics from Harvard University. Lance E. Miller is currently a doctoral candidate at the University of Connecticut for two degrees, one is in computer science and the other in mathematics. Before coming to Connecticut, he performed research as a student in the department of computer science at New Mexico State University and at Physical Science Laboratories working with Army Research Laboratory. He is a two time recipient of the competitive IBM Ph.D. fellowship. In support of these fellowships he spent time with IBM working on HPC demonstrations for IBM Deep Computing. James C. Sexton is a Research Staff Member at IBM T. J. Watson Research Center where he works on the IBM Blue Gene project. He received his Ph.D. in Theoretical Physics from Columbia University in New York. He has held research positions at Fermilab in Illinois, at the Institute for Advanced Study in Princeton, and at the Hitachi Central Research Laboratory in Tokyo. Prior to joining IBM in 2005, he was a professor in the School of Mathematics in Trinity College Dublin and Director of the Trinity College Center for High Performance Computing. His interests are in theoretical and computational physics and applications and systems for HPC. Alexander Zekulin is a Business Development Executive with IBM System Blue Gene Solutions group. His focus is to enable customers and partners to develop their solutions on the Blue Gene platforms. Alex received his Ph.D. from the University of Kentucky in Geophysics. Alexander has applied his knowledge in IBM working with petroleum and environmental companies in high performance computing, advanced data management, and System Level Accelerators: A Hybrid Computing Solution 13 non-linear optimization. In his previous role at IBM, he was in the healthcare and life sciences group focused on solving IBM healthcare customers IT and business problems. Before that, Alex was responsible for the IBM services relationships with numerous analytical solution providers, including SAS Institute, Intel®, Business Objects, MicroStrategy, and ESRI. These partnerships resulted in numerous joint solution offerings for IBM clients across many industries. Thanks to the following for contributing to this project: LindaMay Patterson, ITSO, IBM Rochester, U.S. 14 System Level Accelerators: A Hybrid Computing Solution Notices This information was developed for products and services offered in the U.S.A. IBM may not offer the products, services, or features discussed in this document in other countries. Consult your local IBM representative for information on the products and services currently available in your area. Any reference to an IBM product, program, or service is not intended to state or imply that only that IBM product, program, or service may be used. Any functionally equivalent product, program, or service that does not infringe any IBM intellectual property right may be used instead. However, it is the user's responsibility to evaluate and verify the operation of any non-IBM product, program, or service. IBM may have patents or pending patent applications covering subject matter described in this document. The furnishing of this document does not give you any license to these patents. You can send license inquiries, in writing, to: IBM Director of Licensing, IBM Corporation, North Castle Drive, Armonk, NY 10504-1785 U.S.A. The following paragraph does not apply to the United Kingdom or any other country where such provisions are inconsistent with local law: INTERNATIONAL BUSINESS MACHINES CORPORATION PROVIDES THIS PUBLICATION "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF NON-INFRINGEMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Some states do not allow disclaimer of express or implied warranties in certain transactions, therefore, this statement may not apply to you. This information could include technical inaccuracies or typographical errors. Changes are periodically made to the information herein; these changes will be incorporated in new editions of the publication. IBM may make improvements and/or changes in the product(s) and/or the program(s) described in this publication at any time without notice. Any references in this information to non-IBM Web sites are provided for convenience only and do not in any manner serve as an endorsement of those Web sites. The materials at those Web sites are not part of the materials for this IBM product and use of those Web sites is at your own risk. IBM may use or distribute any of the information you supply in any way it believes appropriate without incurring any obligation to you. Information concerning non-IBM products was obtained from the suppliers of those products, their published announcements or other publicly available sources. IBM has not tested those products and cannot confirm the accuracy of performance, compatibility or any other claims related to non-IBM products. Questions on the capabilities of non-IBM products should be addressed to the suppliers of those products. This information contains examples of data and reports used in daily business operations. To illustrate them as completely as possible, the examples include the names of individuals, companies, brands, and products. All of these names are fictitious and any similarity to the names and addresses used by an actual business enterprise is entirely coincidental. COPYRIGHT LICENSE: This information contains sample application programs in source language, which illustrate programming techniques on various operating platforms. You may copy, modify, and distribute these sample programs in any form without payment to IBM, for the purposes of developing, using, marketing or distributing application programs conforming to the application programming interface for the operating platform for which the sample programs are written. These examples have not been thoroughly tested under all conditions. IBM, therefore, cannot guarantee or imply reliability, serviceability, or function of these programs. © Copyright International Business Machines Corporation 2008. All rights reserved. Note to U.S. Government Users Restricted Rights -- Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp. 15 This document REDP-4409-00 was created or updated on April 24, 2008. ® Send us your comments in one of the following ways: Use the online Contact us review Redbooks form found at: ibm.com/redbooks Send your comments in an e-mail to: redbooks@us.ibm.com Mail your comments to: IBM Corporation, International Technical Support Organization Dept. HYTD Mail Station P099 2455 South Road Poughkeepsie, NY 12601-5400 U.S.A. Redpaper ™ Trademarks IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. These and other IBM trademarked terms are marked on their first occurrence in this information with the appropriate symbol (® or ™), indicating US registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at http://www.ibm.com/legal/copytrade.shtml The following terms are trademarks of the International Business Machines Corporation in the United States, other countries, or both: Redbooks (logo) Blue Gene® DataPower® DB2® ® IBM® PowerPC® POWER™ System p™ System x™ System z™ WebSphere® The following terms are trademarks of other companies: Cell Broadband Engine and Cell/B.E. are trademarks of Sony Computer Entertainment, Inc., in the United States, other countries, or both and is used under license therefrom. Java, and all Java-based trademarks are trademarks of Sun Microsystems, Inc. in the United States, other countries, or both. Microsoft, Xbox 360, Xbox, and the Windows logo are trademarks of Microsoft Corporation in the United States, other countries, or both. Intel, Intel logo, Intel Inside logo, and Intel Centrino logo are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States, other countries, or both. Other company, product, or service names may be trademarks or service marks of others. 16 System Level Accelerators: A Hybrid Computing Solution