IBM Deep Computing FORWORD BY MARK HARRIS (IBM COUNTRY GENERAL MANAGER SOUTH AFRICA) After ten years of research and development, IBM delivered to the US government, a supercomputer capable of 100 Teraflops. This project, in which hundreds of IBMers have been involved, has had a tremendous effect upon the computer industry and in many ways our society as a whole. It has helped usher in new technologies that both spurred new markets or forever changed old ones. One of the key advances was the development of processor interconnect technology. As processor speed increases alone could not meet the computing challenge, engineers developed this technology to allow thousands of processors to work on the same problem at once. Related software was developed to allow those thousands of processors to write data to a single file. These innovations have allowed fundamental changes in science and business. The automotive industry now uses sophisticated simulations to build and design safer cars – virtually. Oil companies use modelling of the earth's geologic layers to pinpoint more likely sources of oil to reduce unneeded drilling. Weather forecasting has been improved – believe it or not. Boeing has even designed an entire virtual aircraft on a computer and brought it to market without ever building a prototype. Great strides have also been made in genomic research, drug discovery, and cancer research. The Square Kilometre Array (SKA) project holds immense opportunities for astronomers the world over, yet it would not have been possible without the availability of high performance and grid computing. Of course, with advances like these come new challenges, but at the same they also bring new opportunities from a research and computational application perspective. South Africa has an opportunity to capitalise on this situation within the context the Department of Science and Technology’s ICT Roadmap and in particular it’s Technology Roadmap for High Performance Computing (HPC). In compiling this white paper, IBM South Africa, attempts to illustrate the potential journeys that may be undertaken by means of the Department’s roadmap. It is hoped that the Roadmap will draw together business, government, academic and research experts to help define and shape innovative ways in which high performance computers and advanced algorithms may provide valuable information and answers for the numerous intractable problems confronting the South African research community. I certainly hope that this white paper will facilitate such an endeavour. ____________________________ Mark Harris Copyright 2005@ IBM / CHPC / Confidential Page i Table of Contents 1 Deep Computing Overview ............................................................................. 1 IBM Deep Computing Systems ..........................................................................................1 Paradigm shift in computing systems ..............................................................................2 Holistic design and system scale optimisation................................................................3 Impact on Deep Computing ................................................................................................4 Blue Gene – a new approach to supercomputing ............................................................4 Novel architectures .............................................................................................................5 Long-Term, High Productivity Computing Systems research ........................................6 2 Deep Computing Products .............................................................................. 8 Scale-up and scale-out .......................................................................................................8 Capability or Capacity .........................................................................................................9 Server options ...................................................................................................................10 Parallel scaling efficiency .................................................................................................10 IBM Power Architecture servers ...................................................................................... 11 IBM Intel and Opteron based servers ..............................................................................12 IBM Deep Computing clusters .........................................................................................12 IBM and Linux ....................................................................................................................14 Blue Gene: A supercomputer architecture for high performance computing applications ........................................................................................................................15 US ASC Purple procurement ............................................................................................16 Integration of heterogeneous clusters ............................................................................17 IBM's Deep Computing Capacity on Demand .................................................................18 Storage ...............................................................................................................................18 IBM Deep Computing Visualization .................................................................................21 3 GRID ............................................................................................................... 24 GRID for academia and scientific research and development .....................................24 4 IBM in the Deep Computing Marketplace ..................................................... 28 HPCx – the UK National Academic Supercomputer ......................................................30 5 IBM Research Laboratories .......................................................................... 32 Work with clients ...............................................................................................................32 On Demand Innovation Services .....................................................................................32 First-of-a-Kind (FOAK) Projects .......................................................................................32 Industry Solutions Laboratories ......................................................................................32 Work with universities ......................................................................................................32 IBM Deep Computing Institute .........................................................................................33 Systems Journal and the Journal of Research and Development ...............................33 6 HPC Network Facilitation – fostering relationships in the HPC community34 Executive Briefings ...........................................................................................................34 Conferences and seminars...............................................................................................34 IBM System Scientific Computing User Group, SCICOMP ...........................................34 SP-XXL ................................................................................................................................34 UK HPC Users Group ........................................................................................................35 7 Collaboration Projects................................................................................... 36 University of Swansea, UK - Institute of Life Sciences .................................................36 Copyright 2005@ IBM / CHPC / Confidential Page ii University of Edinburgh , UK - Blue Gene ......................................................................36 IBM Academic Initiative - Scholars Program ..................................................................36 IBM's Shared University Research Program ..................................................................37 8 9 IBM Linux Centres of Competency ............................................................... 39 Technology considerations for a new HPC centre ...................................... 41 Grid enablement of central and remote systems ............................................................42 Distributed European Infrastructure for Supercomputing Applications ...........................43 ScotGrid ...........................................................................................................................44 10 Skills development, training, and building human capacity .................. 46 Programme Management..................................................................................................46 Support Services ...............................................................................................................46 With the CHPC of South Africa in mind... ........................................................................47 11 Assisting with the Promotion and Exploitation of the CHPC ................. 48 Trademarks and Acknowledgements .................................................................. 49 Addendum ............................................................................................................ 50 Copyright 2005@ IBM / CHPC / Confidential Page iii 1 Deep Computing Overview Deep Computing (DC) or High Performance Computing (HPC) is a term describing the use of the most powerful computers to tackle the most challenging scientific and business problems. It is most frequently applied today to systems with large numbers of processors clustered together to work in parallel on a single problem. Deep computing delivers powerful solutions to customers' most challenging and complex problems, enabling businesses, academics and researchers to get results faster and realize a competitive advantage. It provides extreme computational power which enables increasingly complex scientific and commercial applications to be run where trillions of calculations can be performed every second, and models of extreme complexity can be analysed to provide insights never before possible. It is a world of large numbers and the terms Tera (10**12 or 1 million million), Peta (10*15 or 1 thousand million million) and Exa (1 million million million) are frequently used in terms such as '3 TFLOPS' which means 3 million million Floating Point Operations (ie calculations) per Second or 2 PetaBytes which means 2 thousand million million Bytes of storage. While some of the most ambitious HPC work takes place in government laboratories and academia, much is performed in newer commercial marketplaces such as drug discovery, product design, simulation and animation, financial and weather modelling, and this sector is growing rapidly. IBM is the worldwide leader in Deep Computing and is able to provide the technology, the research capabilities, the support and the industry expertise for Deep Computing through a comprehensive range of products and services designed to help businesses and organizations reap the full benefits of deep computing. IBM's technical know-how and extensive industry knowledge includes technical experts capable of providing industry-specific high performance computing solutions. IBM has expertise in critical areas such as scaling and application modelling and demonstrable leadership in the latest industry trends, including clustering, chip technology and Linux. IBM's world class research capabilities provide strategic business and technical insight. Indeed, IBM's research activities lead to the award of 3,248 US patents in 2004, the twelfth year in succession that IBM was awarded more patents than any other company. In the past, HPC was dominated by vector architecture machines, and indeed IBM provided vector processors on its mainframe computer line, but the advent of powerful scalar processors and the ability to interconnect them in parallel to execute on a single job has seen vector systems eclipsed by Massively Parallel Machines, MPPs or Clusters. There are still some applications for which vector machines provide an excellent, though sometimes expensive, solution, particularly in engineering, but a glance at the Top500 Supercomputers List shows the overwhelming domination of clusters of scalar architecture machines. Clusters are the workhorse of general purpose, multi-disciplinary, scientific and research HPC. IBM’s Deep Computing strategy can be articulated simply – IBM is dedicated to solving larger problems more quickly at lower cost. To achieve this we will aggressively evolve the POWER-based Deep Computing products; continue to develop advanced systems based on loosely coupled clusters; deliver supercomputing capability with new access models and financial flexibility; undertake research to overcome obstacles to parallelism and bring revolutionary approaches to supercomputing. IBM Deep Computing Systems IBM provides cluster solutions based on IBM POWER architecture, on Intel architecture, and on AMD Opteron architecture, together with support for a broad range of applications across these multiple platforms. We offer full flexibility, with support for heterogeneous platforms, architectures and operating systems. Our hardware systems are supplied with sophisticated software designed for user productivity in demanding HPC environments. IBM strongly supports Open Standards and our entire server product line is enabled to run Linux. IBM cluster solutions provide cost effective solutions for both capability computing, running the most demanding simulations and models, and capacity computing where throughput is of crucial importance. In what can be termed the “high range” segment, the most powerful high performance Power clusters, equipped with high bandwidth, low latency switch interconnections provide versatile systems able to run the widest range of HPC applications effectively and efficiently in production environments from weather forecasting to scientific research laboratories and academia. At a slightly reduced performance, the “mid range” segment utilises mid-range Power servers in clusters which can be configured to run either AIX or Linux, using either a high performance interconnect or a lower performance industry standard interconnect. IBM Power based supercomputer products use mainstream technologies and programming techniques, leading to cost-effective systems designs that can sustain high performance on a wide variety of workloads. This approach benefits from the sustained, incremental industry improvements in hardware and software technologies that drive both performance and reliability. IBM research continues to explore extensions to the POWER architecture, and one such extension is examining their capability to include some applications traditionally handled with vector processing. All Power systems share full binary compatibility across the range. In the “density segment”, where applications and communications are less demanding, IBM provides clusters comprising blades, with Intel or AMD Opteron processors. Whether customers choose a single cluster, or a combination of clusters, IBM provides interoperability between systems, with common compilers and libraries, shared file systems, shared schedulers and grid compatibility. These clusters, based on high volume, low cost processors and industry standard interconnects are aimed at applications which perform especially well on these architectures and thus enjoy an excellent price performance. IBM blade server clusters provide a particularly space and power efficient way to construct such systems. The results speak for themselves as IBM leads the world in supercomputing, supplying 58 of the world's 100 most powerful supercomputers according to the November 2004 Top500 Supercomputers List. IBM machines in the list include the most powerful, the IBM Blue Gene at Lawrence Livermore National Laboratory (since doubled in size, and soon to be doubled yet again); the largest supercomputer in Europe, the Mare Nostrum Linux Cluster at the Barcelona University Supercomputer Centre and HPCx, the UK National Academic Supercomputer Facility. IBM supplied 161 of the 294 Linux Cluster on the list. . Deep Computing stands at the forefront of computing technology, and indeed IBM's Deep Computing business model is predicated on developing systems to meet the most aggressive requirements of HPC customers certain in the knowledge that commercial customers will need these capabilities in the near future. Paradigm shift in computing systems The largest challenge facing the entire computing industry today is the paradigm shift about to take place, caused by chip designers having reached the end of classical scaling. Forty years ago, Gordon Moore made his famous observation, dubbed Moore's Law, that the number of transistors on a chip would double every 18-24 months. Since that time, Moore's Law has held true and circuit densities have increased so that it is not uncommon for a single IBM chip the size of a thumbnail to have 200 million transistors. Scaling technology far more complex that a simple application of Moore's Law, but we can see how performance improvements have arisen as a result: Copyright 2005@ IBM / CHPC / Confidential Page 2 – the relentless increase in the number of circuits has enabled designers to implement more functions in hardware, and less in software, dramatically speeding systems. A good example of this is to compare a current IBM compute server, which can calculate four floating point operations in a single clock cycle with older machines, where a single calculation would require many instructions in microcode, and would take many clock cycles to perform. – The reduction in the size of the electronic components on the chip has allowed clock frequencies to rise. One only has to look at 1970s computers which ran at 1 MHz and compare them with today's servers running at 2,000 MHz, some 2000x faster. Classical scaling is the coordinated reduction, year on year, of a fixed set of device dimensions governing the performance of silicon technology. The next generation of silicon fabrication can implement the identical circuit components smaller – in 2001 Power4 used 180 nm technology (chip details are 180 thousands of a millionth of a metre); in 2005 Power5 uses 130 nm technology, and in 2007, Power6 will use 65 nm technology. A side effect of increased circuit density, increased numbers of circuits and increased frequency, is that power densities in chips increase – processor chips typically run at power densities up to 10 Watts/sq cm, about double that of the sole plate of a domestic iron. Consider the gate oxide, which provides the insulation between the input signal and the output in a CMOS transistor. As implemented in the smallest dimensions today, this gate oxide is only about 6 atoms thick. Fabrication technologies are not perfect and if we assume a defect just one atom thick on each side of the layer, our insulation is 1/3 thinner than required. The problem facing all chip designers is that single atom defects can cause local leakage currents to be ten to one hundred times higher than average. Oxides scaled below about 10 Angstroms (1 one thousand millionth of a meter) are leaky and likely to be unreliable. Chip designers are grappling with these problems where the characteristics of the chip are based on probability, not certainty and while leakage current has been cited, there are now many such "non-statistical behaviours" appearing in technology. Classical scaling is now failing because the power generated by the leakage currents is increasing at a much faster rate than that generated by the “useful” currents and will soon exceed it. Holistic design and system scale optimisation HPC, being at the forefront of computing technology has seen the effects of the paradigm shift first. The key question is “what will drive performance improvements in the future if classical scaling no longer does?” IBM's answer to this challenge is that holistic design and system scale optimisation will be the major drivers of performance improvements in the future. Technological advances at the chip level will still occur, and IBM will continue to research novel transistor architectures and improvements in material properties (IBM has developed strained silicon for 90nm technology which has conductor mobility 350x higher than only eight years ago, leading to better current flows and reduced power dissipation). By holistic design, IBM means innovation from atoms to software, requiring the simultaneous optimization of materials, devices, circuits, cores, chips, system architectures, system assets and system software to provide the most effective means of optimising computer performance. It will not be possible to assemble systems from disparate components as is often done today – each and every part of the system will need to be integrated in a coherent whole. Innovation will overtake scaling as the driver of semiconductor technology performance gains. Processor design and metrics have already changed irrevocably as designers grapple with Copyright 2005@ IBM / CHPC / Confidential Page 3 the increased power dissipations, the power cliff, which some designs have already fallen victim to. System level solutions, optimized via holistic design will ultimately dominate progress in information technology, and IBM is perhaps the only vendor with the in-house skills and capabilities to drive this revolution. IBM’s Power Architecture is evolving along these lines, with Scalable Multi-Core chips having 200 million circuits some of which control power and heat dissipation; and technological advances such as asset virtualisation, fine grained clock gating, dynamically optimized multithreading capability and open (accessible) architecture for system optimization/compatibility to enhance performance. Impact on Deep Computing Many Deep Computing customers’ needs for the immediate future will be met by conventional clusters which will continue to be developed to produce systems of 100s of TFLOPS. IBM is for example developing Power6 based processors, and subsequent systems, which will provide clusters of these sizes with acceptable power consumption and floor area. Customers with extreme requirements are unlikely to be able to continue as today since their computing demands will outstrip their environments and it is likely that their choice of future HPC systems will be dictated as much by floor area and power consumption as by the compute power they need. If one examines the US Advanced Simulation and Computing (ASC, formally ASCI) Program, we see that single machines will have increased from 1 TFLOP in 1997 to 360 TFLOPS in 2005, an increase of 360x in only eight years. Clearly, the next eight years will not see conventional systems increasing in power at this rate. While IBM will continue to develop and supply traditional clusters of 100s of TFLOPs, it is unlikely that these customers could afford the space and power requirements of conventional systems 100x (10 PFLOPS) or 1000x (100 PFLOPS) as large. Blue Gene – a new approach to supercomputing As a result of these constraints, IBM began a project five years ago to develop a novel architecture supercomputer with up to one million processors, but greatly reduced power dissipation and floor area. The Blue Gene supercomputer project is dedicated to building a new family of supercomputers optimized for bandwidth, scalability and the ability to handle large amounts of data while consuming a fraction of the power and floor space required by today’s fastest systems and at a much reduced cost. The first of these systems is the 64 rack, 360 TFLOP Blue Gene/L to be delivered to Lawrence Livermore National Laboratory as part of the ASC Purple procurement. At only one quarter of its final contracted size, the first delivery of Blue Gene/L leads the Top500 Supercomputers List, and offers over 100 times the floor space density, 25 times the performance per kilowatt of power, and nearly 40 times more memory per square metre than the previous Top500 leader, the Earth Simulator. BlueGene/L dramatically improves scalability and cost performance for many compute intensive applications, such as biology and life sciences, earth sciences, materials science, physics, gravity, and plasma physics. All HPC customers, from large to small, will benefit from the Blue Gene project for they will be able to afford much larger systems than previously, and be able to accommodate them in existing facilities without re-building. Numerous smaller Blue Gene/L systems are now installed in research and academic laboratories worldwide. Copyright 2005@ IBM / CHPC / Confidential Page 4 Novel architectures IBM will continue to investigating other novel architectures for powerful computing systems. IBM's collaborative work with Sony and Toshiba to develop the Cell Processor for gaming and High Definition TV has lead to a processor with the power of past industrial supercomputers which may prove to be an excellent building block for future HPC systems. Cell processor The first-generation Cell processor is a multi core chip with a 64-bit Power processor and eight "synergistic" processors, capable of massive floating point processing and optimized for compute-intensive workloads and broadband rich media applications. A high-speed memory controller and high-bandwidth bus interface are integrated on-chip. The Cell's breakthrough multi-core architecture and ultra high-speed communications capabilities will deliver vastly improved, real-time response. The Cell processor supports multiple operating systems simultaneously. Applications will range from a next generation of game systems with dramatically enhanced realism, to systems that form the hub for digital media and streaming content in the home, to systems used to develop and distribute digital content, to systems to accelerate visualization and to supercomputing applications. The first generation of Cell processors will be implemented with about 230 million transistors on a chip about 220 mm2. The Cell processor will run at about 4 GHz and will provide a performance of 0.25 TFLOPS (single precision) or 0.026 TFLOPS (double precision) per chip. Copyright 2005@ IBM / CHPC / Confidential Page 5 64b Power Processor Mem. Contr. Synergistic Processor .. . Synergistic Processor Flexible IO Long-Term, High Productivity Computing Systems research IBM has been awarded funding from the Defense Advanced Research Projects Agency (DARPA) for the second phase of DARPA's High Productivity Computing Systems (HPCS) initiative. IBM's proposal, named PERCS (Productive, Easy-to-use, Reliable Computing System), is for ground-breaking research over the next three years in areas that include revolutionary chip technology, new computer architectures, operating systems, compiler and programming environments. The research program will allow IBM and its partners, a consortium of 12 leading universities and the Los Alamos National Laboratory, to pursue a vision of a highly adaptable systems that configure their hardware and software components to match application demands. Adaptability enhances the technical efficiency of the system, its ease of use, and its commercial viability by accommodating a large set of commercial and high performance computing workloads. Ultimately, IBM's goal is to produce systems that automatically analyze the workload and dynamically respond to the changes in application demands by configuring its components to match application needs. Copyright 2005@ IBM / CHPC / Confidential Page 6 PERCS is based on an integrated software-hardware co-design that will enable multi-Petaflop sustained performance by 2010. It will leverage IBM's Power architecture and will enable customers to preserve their existing solution and application investments. PERCS also aims at reducing the time-to-solution, starting from the inception to actual result and to this end, PERCS will include innovative middleware, compiler and programming environments that will be supported by hardware features to automate many phases of the program development process. The IBM project will be managed in IBM's Research Laboratory in Austin and will include members from IBM Research, Systems Group, Software Group and the Microelectronics Division. A fundamental goal of the research is commercial viability and developments arising from the project will be incorporated in IBM Deep Computing systems by the end of the decade. References IBM Deep Computing: http://www.ibm.com/servers/deepcomputing/ IBM Power Architecture: http://www.ibm.com/technology/power/ IBM Research: http://www.research.ibm.com/ Blue Gene: http://www.research.ibm.com/bluegene/ Cell Processor: http://www.research.ibm.com/cell/ PERCS: http://domino.research.ibm.com/comm/pr.nsf/pages/news.20030710_darpa.html Accelerated Strategic Computing Program:http://www.llnl.gov/asci/overview/asci_mission.html DARPA: http://www.darpa.mil/ Top500 Supercomputers List: http://www.top500.org/ Copyright 2005@ IBM / CHPC / Confidential Page 7 2 Deep Computing Products High Performance Computing falls into a number of overlapping areas depending on the type of applications to be run and the architecture best matched to that computing. Scale-up and scale-out The largest HPC machines needed to run capability jobs are scale-up and scale-out machines. They operate at the tens or hundreds of TFLOPS range, with thousands or tens of thousands of processors. Servers are often (but not always) wide SMPs with many processors giving to rise to the term 'scale-up', while many such servers are needed to achieve the performance required, giving rise to the term 'scale-out'. The wider the machine scales, the more important the performance of the switching interconnect becomes as a single capability job is capable of utilising all or most of the servers in the system. Jobs can take advantage of using a mixture of OpenMP (in the servers) and message passing (between the servers) or message passing across the entire machine. These systems provide the most general purpose HPC systems available as they can run many different types of applications effectively and efficiently. Typical applications include weather forecasting and scientific research, while engineering applications, which cannot usually scale widely, can make use of the wide SMPs. A typical example of such a general purpose capability machine is HPCx, the UK National Academic Supercomputer, which provides about 11 TFLOPS (Peak) with 50 x 32 way servers (1620 CPU in total) interconnected by the High Performance Switch. Copyright 2005@ IBM / CHPC / Confidential Page 8 Blue Gene offers a similar approach to providing systems in the hundreds and thousands of TFLOPS range by interconnecting up to one million individual processors. The novel nature of Blue Gene lies in the techniques used to minimise the resulting power dissipations and floor area, and the topology and performance of the interconnection fabric needed to maintain a fully balanced system. At the other end of the scale Departmental machines are much smaller, and typically comprise either a single SMP or small clusters of either single CPU or small SMP servers. Departmental clusters most often run Linux and may use Gigabit Ethernet or Myrinet as interconnects. In between, Enterprise or Divisional machines range from larger Departmental to smaller capability systems, depending on the applications to be run. Jobs do not need to scale as widely as capability systems, so lower performance industry standard interconnects can sometimes be used, but some applications will not run as effectively on these systems. A number of these systems run Linux although is currently unusual to find sites doing so where a full 24 x 7 service availability is essential (weather forecasters) or where many different users are running many different types of applications and the systems needs strong system management to maintain availability (e.g. science research or national academic systems). Capability or Capacity HPC systems can also be characterised as being either Capability or Capacity machines. Capability systems, by definition, are able to run a single job using all, or most of, the system. As the job is parallelised across many processors, intercommunication between the processors is high, and efficient running can only be achieved if the message passing (MPI) efficiency is high. Capacity machines, on the other hand, are used to run many smaller jobs, none of which extend to run across the whole machine, and can therefore be implemented with lower performance interconnects while maintaining high performance. Copyright 2005@ IBM / CHPC / Confidential Page 9 Server options IBM's highest performance, most general purpose Deep Computing products for the capability sector utilise IBM's standard range of Power servers (i.e. Unix servers running IBM's version of Unix, AIX or Linux) assembled into clusters connected with IBM's High Performance Switch (HPS) interconnect. Servers can be configured with up to 64 processors in a single SMP image, offering great flexibility for applications to exploit either OpenMP or message passing constructs, or a mix of both. A suite of special purpose HPC software provides parallel computing enabling capabilities like message passing, parallel file systems and job scheduling. IBM's medium performance range of Deep Computing products utilise IBM's mid-range Power servers or IBM's standard range of Intel and AMD Opteron products, running Linux, and assembled into clusters connected by industry standard interconnects such as Myrinet or Gigabit Ethernet. IBM's density segment Deep Computing products utilise IBM's Blade Centre populated with 2 way Power or 4 way Intel processors. Parallel scaling efficiency Similarly, systems can be characterised by the parallel scaling efficiency. Blue Gene appears on this chart as a highly parallelised system which has an extremely efficient interconnect as the system is well balanced between processor compute power and interconnect scaling. Copyright 2005@ IBM / CHPC / Confidential Page 10 IBM Power Architecture servers IBM's pSeries eServers utilise industry leading Power architecture processors. IBM first introduced Power architecture in 1990 and it has evolved through Power2, Power3 and Power4 to today's Power5. Copyright 2005@ IBM / CHPC / Confidential Page 11 pSeries servers deliver reliable, cost-effective solutions for commercial and technical computing applications in the entry, midrange and high-end segments and are used in IBM Deep Computing systems. IBM changed the UNIX computing landscape with the introduction of the current Power5 pSeries p5 servers, an advanced line of UNIX and Linux servers using Power5™ microprocessors to achieve unprecedented computing performance and reduced costs for a wide range of business and scientific applications. The eServer p5 systems are the result of a large-scale, three-year intensive research and development effort, and they extend beyond traditional UNIX servers by implementing mainframe-inspired features to provide higher utilization, massive performance, greater flexibility, and lower IT management costs. IBM p5 servers give clients choices for implementing different solutions, ranging from 2-way to 64-way servers, all leveraging the industry standard Power Architecture™ and all designed to deliver the most potent performance and scalability ever available in the entry, midrange and large scale UNIX platforms. All pSeries servers are binary-compatible, from single workstations to the largest super computers. IBM Intel and Opteron based servers IBM xSeries servers utilise Intel and Opteron processors, and incorporate mainframe-derived technologies and intelligent management tools for reliability and ease of management. IBM Deep Computing clusters Cluster 1600 – IBM pSeries Deep Computing requires systems which greatly exceed the computing power available from a single processor or server, and while in earlier days, it was possible to provide enough power in a single server, this is no longer the case. Deep Computing systems are now invariably constructed as clusters of interconnected servers, where a cluster can be defined Copyright 2005@ IBM / CHPC / Confidential Page 12 as a collection of interconnected whole computers, connected in parallel and used as a single, unified computing resource. IBM first introduced clustered Unix systems in the 1990's, running AIX (IBM's Unix) and they quickly established themselves as the cluster of choice for High Performance parallel computing. Today's Cluster 1600 are configured with high performing pSeries p5 servers, and powerful switching interconnects, to provide systems capable of efficiently and effectively running the most demanding applications. The IBM eServer p575 and p595 cluster nodes have been designed for "extreme" performance computing applications. Specially designed to satisfy high-performance, high bandwidth application requirements, the p575 includes eight, and the p595 sixty four, of the most powerful 1.9 GHz IBM POWER5 microprocessors for the ultimate in high-bandwidth computing. Multiple cluster nodes can be interconnected by a low latency, high bandwidth switching fabric, the High Performance Switch, to build the largest supercomputers capable of true deep computing. HPCx Cluster 1600 at Daresbury Laboratory, UK For less demanding applications, communication between nodes can be achieved by using industry-standard interconnects such as 10/100Mbps or Gigabit Ethernet while users running Linux, whose applications need higher bandwidth and lower latency than that which can be provided by Gigabit Ethernet, can opt for the Myricom Myrinet-2000 switch. The Cluster 1600 provides a highly scalable platform for large-scale computational modelling. Cluster 1300 – IBM Intel and AMD Opteron While the most demanding applications need the capability of Power servers and high performance switching interconnects, a broad spectrum of applications (including, for example, many applications in the life sciences, particle physics and seismic fields) are well suited to lower performance systems. IBM provides Linux clusters, using Intel and AMD Opteron processors, to serve these customers, allowing affordable supercomputing for departmental HPC users Copyright 2005@ IBM / CHPC / Confidential Page 13 Clusters have also proven to be a cost-effective method for managing many HPC workloads. It is common today for clusters to be used by large corporations, universities and government labs to solve problems in life sciences, petroleum exploration, structural design, high-energy physics, finance and securities, and more. Now, with the recent introduction of densely packaged rack-optimized servers and blades along with advances in software technology that make it easier to manage and exploit resources, it is possible to extend the benefits of clustered computing to individuals and small departments. With the IBM Departmental Supercomputing Solutions, barriers to deploying clustered servers- high price, complexity, extensive floor-space and power requirements - have been overcome. Clients with smaller budgets and staff but with challenging problems to solve can now leverage the same supercomputing technology used by large organizations and prestigious laboratories. IBM Departmental Supercomputing Solutions are offered in a variety of packaged, pre-tested clustered configurations where clients have the flexibility to choose between configurations with 1U servers or blades in reduced-sized racks, and servers with Intel Xeon or AMD Opteron processors, running either Microsoft Windows or, more commonly, Linux operating systems. Mare Nostrum at University of Barcelona Supercomputing Centre, Spain IBM and Linux IBM is strongly committed to Linux, and our Linux support extends across server hardware, software and through into our services groups. IBM is focusing on: Creating a pervasive application development and deployment environment built on Linux. Producing an industry-leading product line capable of running both AIX (IBM Unix) and Linux, together with the services needed to develop and deploy these applications. Copyright 2005@ IBM / CHPC / Confidential Page 14 Creating bundled offerings including hardware, software and services built on Linux to allow out-of-box capability in handling workloads best suited for Linux. Fully participating in the evolution of Linux through Open Source submissions of IBMdeveloped technologies and by partnering with the Open Source community to make enhancements to Linux. IBM is fully committed to the Open Source movement and believes that Linux will emerge as a key platform for computing in the 21st century. IBM will continue to work with the Open Source community, bringing relevant technologies and experience to enhance Linux, to help define the standards and to extend Linux to support enterprise wide systems. Blue Gene: A supercomputer architecture for high performance computing applications Blue Gene/L in its 360 TFLOP configuration IBM Blue Gene is a supercomputing project dedicated to building a new family of supercomputers optimized for bandwidth, scalability and the ability to handle large amounts of data while consuming a fraction of the power and floor space required by today’s fastest systems. The first of the Blue Gene family of supercomputers, Blue Gene/L, has been delivered to the U.S. Department of Energy’s National Nuclear Security Administration (NNSA) program for Advanced Simulation and Computing (ASC) while others have been delivered to Edinburgh University's Parallel Computing Centre and to ASTRON, a leading astronomy organization in the Netherlands.IBM and its customers are exploring a growing list of high performance computing applications that can be optimized on Blue Gene/L with projects in the life sciences, hydrodynamics, quantum chemistry, molecular dynamics and climate modelling. Copyright 2005@ IBM / CHPC / Confidential Page 15 Blue Gene/L’s original mission was to enable life science researchers to design effective drugs to combat diseases and to identify potential cures but its versatility, advanced capabilities, compact size and power efficiency make it attractive for applications in many fields including the environmental sciences, physics, astronomy, space research and aerodynamics. US ASC Purple procurement The US Department of Energy's ASC Purple procurement comprises machines from each of the above three architectures on IBM's Technology Roadmap. The smallest machine, is a 9.2 TFLOPS Linux Cluster. This is complemented by two Power clusters, the first a 12 TFLOP Power4 / High Performance Switch system, and the second, just delivered, a 100 TFLOP Power5 / High Performance Switch cluster which are the Lawrence Livermore National Laboratory's main workhorse machines. The third system is a 360 TFLOP Blue Gene/L which will be used for extreme computations on selected applications. ASC Purple systems Copyright 2005@ IBM / CHPC / Confidential Page 16 # ASC Purple 100 TFLOP machine architecture Integration of heterogeneous clusters Many customers do as ASC Purple did and procure more than one architecture and it then becomes critical for those systems to be integrated into a seamless operating unit. IBM's HPC software suite enables this integration by clusters to be managed from a common management tool, by allowing multiple systems to access shared file systems under General Purpose File System, by allowing jobs to be scheduled across the entire system and by enabling grid access to the system. Applications further benefit from common compilers and run-time libraries, reducing the porting and support effort required. Copyright 2005@ IBM / CHPC / Confidential Page 17 IBM's Deep Computing Capacity on Demand IBM Deep Computing Capacity on Demand centres deliver supercomputing power to customers over the Internet, freeing them from the fixed costs and management responsibility of owning a supercomputer or providing increased capacity to cope with peaks of work. Customers are provided with remote highly secure Virtual Private Network (VPN) access over the Internet to supercomputing power owned and hosted by IBM enabling them rapidly and temporarily to flex up/down High Performance Computing capacity in line with business demands permitting them to respond to predictable or unpredictable peak workloads or project schedule challenges. Clients can implement an optimal combination of in-house fixed capacity/fixed cost and hosted variable capacity/variable cost HPC infrastructure. Capacity on Demand Centres are equipped with the full range of IBM Deep Computing systems including: IBM eServer Cluster 1350 with xSeries Intel 32-bit technology to run Linux™ or Microsoft ™ Windows workloads IBM eServer Cluster 1350 with eServer AMD Opteron processor 32-bit/64-bit technology to run Linux or Microsoft Windows workloads IBM pSeries with IBM POWER 64-bit technology to run Linux or IBM AIX 5L workloads IBM Blue Gene with PowerPC technology to run Linux-based workloads Applications which are well suited to Capacity on Demand include most scientific fields, and especially: Petroleum - seismic processing, reservoir simulation Automotive and Aerospace Computer-Aided Engineering - crash and clash analysis, computational fluid dynamics, structural analysis, noise/vibration/harshness analysis, design optimization Life Sciences - drug discovery, genomics, proteomics, quantum chemistry, structurebased design, molecular dynamics Electronics - design verification and simulation, auto test pattern generation, design rule checking, mask generation, optical proximity correction Financial Services - Monte Carlo simulations, portfolio and wealth management, risk management, compliance Digital Media - rendering, on-line game scalability testing Storage High Performance Computing systems require high performance disk storage systems to provide local storage during the computational phase and large near line or off line systems, usually tape, for longer term or permanent storage of user data. General Parallel File System, GPFS The IBM General Parallel File System (GPFS) is a high-performance shared-disk file system which provides fast, reliable data access from all nodes in one or more homogeneous or heterogeneous clusters of IBM UNIX servers running either AIX or Linux. GPFS allows parallel applications simultaneous access to a single file or set of files from all nodes within the clusters and allows data to be shared across separate clusters. While most UNIX file systems are designed for a single-server environment, and adding more file servers does not improve the file access performance, GPFS delivers much higher performance, scalability Copyright 2005@ IBM / CHPC / Confidential Page 18 and failure recovery by accessing multiple file system nodes directly and in parallel. GPFS's high performance I/O is achieved by "striping" data from individual files across multiple disks on multiple storage devices, and reading and writing data in parallel and performance can be increased as required by adding more storage servers and disks. GPFS is used on all large IBM clusters doing parallel computing, which includes many of the world's largest supercomputers, supporting hundreds of Terabytes of storage and over 1000 disks in a single file system. The shared-disk file system provides every cluster node with concurrent read and write access to a single file and high reliability and availability is assured through redundant pathing and automatic recovery from node and disk failures. GPFS scalability and performance are designed to meet the needs of data-intensive applications such as engineering design, digital media, data mining, financial analysis, seismic data processing and scientific research. Tivoli Storage Manager, TSM Tivoli Storage Manager, TSM, is IBM's backup, archive and Hierarchical Storage Manager (HSM) software solution for desktop, server and HPC systems. It provides support for IBM and many non-IBM storage devices and can manage hundreds of millions of files and TBytes to PBytes of data. TSM can exploit SAN technology in a SAN environment. TSM is used on many large IBM High Performance Computing systems to transfer data between the on-line GPFS and the near-line of off-line tape libraries. High Performance Storage System, HPSS HPSS is a very high performance storage system aimed at the very largest HPC users who require greater performance than traditional storage managers, like Tivoli, can provide. HPSS software manages hundreds of Terabytes to Petabytes of data on disk cache and in robotic tape libraries providing a highly flexible and fully scalable hierarchical storage management system keeping recently used data on disk, and less recently used data on tape. HPSS uses cluster and SAN technology to aggregate the capacity and performance of many computers, disks, and tape drives into a single virtual file system of exceptional size and versatility. This approach enables HPSS easily to meet otherwise unachievable demands of total storage capacity, file sizes, data rates, and number of objects stored. Copyright 2005@ IBM / CHPC / Confidential Page 19 HPSS Architecture HPSS is the result of over a decade of collaboration between IBM and five US Department of Energy laboratories, with significant contributions by universities and other laboratories worldwide. The approach of jointly developing open software with end users having the most demanding requirements and decades of high-end computing and storage system experience has proven a sound recipe for continued innovation, industry leadership and leading edge technology. The founding collaboration partners are IBM, Lawrence Livermore National Laboratory, Los Alamos National Laboratory, ational Energy Research Supercomputer Center (NERSC) at Lawrence Berkeley National Laboratory, Oak Ridge National Laboratory and Sandia National Laboratories. HPSS was developed to provide: Scalable Capacity - As architects continue to exploit hierarchical storage systems to scale critical data stores beyond a Petabyte (1024 Terabytes) towards an Exabyte (1024 Petabytes), there is a equally critical need to deploy a high performance, reliable and scalable HSM systems. Scalable I/O Performance - As processing capacity grows form trillions of operations per second towards a quadrillion operations per second and data ingest rates grows from tens of Terabytes per day to hundreds of Terabytes per day, HPSS provides a scalable data store Copyright 2005@ IBM / CHPC / Confidential Page 20 with reliability and performance to sustain 24 x 7 operations in demanding high availability environments. Incremental Growth - As data stores with 100s of Terabytes and Petabytes of data become increasing common, HPSS provides a reliable, seamlessly scalable and flexible solution using heterogeneous storage devices, robotic tape libraries and processors connected via LAN, WAN and SAN. Reliability - As data stores increase from tens to hundreds of million of files and collective storage I/O rates grow to hundreds of Terabytes per day, HPSS provides a scalable meta data engine ensuring highly reliable and recoverable transactions down to the I/O block level. HPSS sites which have accumulated one Petabyte or more of data, in a single HPSS file system include: Brookhaven National Laboratory, US Commissariat à l'Energie Atomique, France European Centre for Medium-Range Weather Forecasts, UK Lawrence Livermore National Laboratory, US Los Alamos National Laboratory, US National Energy Research Scientific Computing Center, US San Diego Supercomputer Center, US Stanford Linear Accelerator Center, US The data stored in these systems ranges from digital library material to science, engineering and defence application data, and includes data from nanotechnology, genomics, chemistry, biochemistry, radiology, functional magnetic resonance imaging, fusion energy, energy efficiency, astrophysics, nuclear physics, accelerator physics, geospatial data, digital audio, digital video, weather, climate and computational fluid dynamics. IBM Deep Computing Visualization High Performance Computing systems generate vast quantities of data and high performance visualisation systems are essential if the information buried within the increasingly large, sometimes distributed, data sets data is to be found. Visualization is the key to unlocking data secrets. Effective visualization systems require massive computing power and intensive graphics rendering capabilities, which has traditionally lead to extremely specialized and monolithic infrastructures. IBM's Deep Computing Visualization (DCV) solution offers an alternative approach by using IBM IntelliStation workstations and innovative middleware to leverage the capabilities of the latest generations of commodity graphics adapters to create an extraordinarily flexible and powerful visualization solution. Deep Computing Visualization is a complete solution for scientific researchers, who need increased screen resolutions and/or screen sizes while maintaining performance. DCV enables the display of applications on high-resolution monitors, or on large, multi-projector display walls or caves with no, or minimal, impact on performance. IT also makes graphics applications much easier to manage by keeping the application in one central location and securely transferring visual data to remote collaborators anywhere on the network. Copyright 2005@ IBM / CHPC / Confidential Page 21 The Deep Computing Visualisation Concept Deep Computing Visualisation is a visualisation infrastructure that combines commodity components with a scalable architecture and remote visualization features. Its goal is to leverage the price/performance advantage of commodity graphics components and commodity interconnect adapters to provide visualisation systems of great power and capability, from workstations to high end systems. At the heart of IBM’s DCV infrastructure is middleware that links multiple workstations into a scalable configuration, effectively combining the output of multiple graphics adapters. DCV is based on Linux and OpenGL and addresses two major market demands for scientific visualization: For users of traditional high performance proprietary graphics systems, DCV aims to provide equivalent or better levels of performance at a lower cost on many applications. – For users of personal workstations and PCs, DCV aims to provide a more scalable solution that is still based on commodity technologies. DCV has two built-in functional modes which previously were available only on high-end proprietary graphics systems. Scalable Visual Networking is a "one-to-few" mode that displays applications, without modification, on multiple projectors and/or monitors, creating immersive or stereo visualization environments. Remote Visual Networking is a "one-to-many" mode that transports rendered frames to multiple remote locations, allowing geographically dispersed users to simultaneously participate in collaborative sessions. Both Scalable Visual Networking and Remote Visual Networking can be used simultaneously or separately without additional setup, allowing both local and remote users to participate in immersive environments. Copyright 2005@ IBM / CHPC / Confidential Page 22 Deep Computing Visualisation improves the ability to gain insight by enhancing 2D and 3D graphics capabilities, portraying large quantities of data in ways that are easy to understand and analyse and which support better decision making. References IBM Power Architecture http://www.ibm.com/technology/power/ IBM Deep Computing products – Overview: – Departmental Supercomputers: http://www.ibm.com/servers/eserver/clusters/hardware/dss.html – Clusters: http://www.ibm.com/servers/eserver/clusters/ – pSeries Power servers: http://www.ibm.com/servers/eserver/pseries/ – Intel servers: http://www.ibm.com/servers/eserver/xseries/ – AMD Opteron servers: http://www.ibm.com/servers/eserver/opteron/ – Blades: http://www.ibm.com/servers/eserver/bladecenter/index.html – Storage systems: http://www.ibm.com/servers/storage/index.html http://www.ibm.com/servers/deepcomputing/offer.html GPFS: http://www-1.ibm.com/servers/eserver/clusters/software/gpfs.html TSM: http://www-306.ibm.com/software/tivoli/products/storage-mgr/ HPSS: http://www.hpss-collaboration.org/hpss/index.jsp Deep Computing Capacity on Demand: http://www.ibm.com/servers/deepcomputing/cod.html Deep Computing Visualisation: http://www.ibm.com/servers/deepcomputing/visualization/ Copyright 2005@ IBM / CHPC / Confidential Page 23 3 GRID GRID for academia and scientific research and development Research requires the analysis of enormous quantities of data. Researchers need to share not only data and processors but also instrumentation and test equipment, becoming virtual organizations (VOs). Grids allow these organizations to make database and file-based data available across departments and organizations along with securing data access and optimizing storage. Grid technologies enable creating and managing VOs giving faster, more seamless research collaboration by Reducing research and development costs and increasing the efficiency of codevelopment Reducing the time to market by executing tasks faster and more accurately Improving the hit-rates through better simulation of real-world characteristics Seamless sharing of raw data IBM grid and deep computing solutions provide faster and more seamless research collaboration by: Helping researchers analyse data through simplified data access and integration Enabling a flexible and extensible infrastructure that adapts to changing research requirements Helping procure compute and data resources on demand Providing on demand storage capacity for data collected from sensor etc Facilitating fusion engines that assimilate, aggregate and correlate raw information Copyright 2005@ IBM / CHPC / Confidential Page 24 Grid solutions start with the IBM Grid Toolbox and GLOBUS toolkit and can incorporate IBM eServer xSeries, pSeries, zSeries, IBM Total Storage, Loadleveler, DB2 database, DataJoiner, General Parallel File System, NFS, Content Manager, SAN FS, ITIO, ITPM, Tivoli Suite, CSM, WebSphere Application Server, software from IBM Business Partners Platforms (LSF and Symphony), DataSynapse GridServer, United Devices and Avaki. IBM Business Consultancy Services and ITS can provide the skills and resource to implement full grid solutions. Unlike many vendors, IBM offers clients a full end-to-end solution, from initial design concept, through detailed design, supply of hardware and software and implementation services to roll out the complete solution. The key differentiators between IBM and other potential grid suppliers can be summarised as our breakthrough initiatives and thought leadership, vast product and patent portfolio, deep industry experience, heavy research investment (people, facilities, projects), extensive intellectual capital and our partnerships with open standards groups, application providers and customers. IBM has implemented hundreds of grid implementations worldwide including academic customers such as Marist College (US), University of Florida, University of Oregon and CINECA (France). Other grid implementations include AIST (National Institute of Advanced Industrial Science and Technology) - Japan's largest national research organization provides an on-demand computing infrastructure which dynamically adapts to support various research requirements. Butterfly net – who made the strategic decision to build an architecture based on the grid computing model, using standard protocols and open-source technologies. EADS (European Aeronautic Defence and Space Company) – who cut analysis and simulation time, while improving the quality of the output. Copyright 2005@ IBM / CHPC / Confidential Page 25 IN2P3 (Institut National de Physique Nucleaire et de Physique des Particules) – this French research institute improves performance of research projects and enhances collaboration of European technical community. Marist College – this U.S. college needed a more stable, resilient and powerful platform for internal IT operations and computer science student laboratories. Royal Dutch Shell – who use grid applications for seismic studies in the petroleum industry. Tiger - The government of Taiwan's technological research grid to integrate in-country academic computing resources for life sciences and nanotechnology. University of Pennsylvania, NDMA – who uses a powerful computing grid to bring advanced methods of breast cancer screening and diagnosis to patients across the nation, helping to reduce costs at the same time. Some of the many reasons which have lead clients like these to choose IBM to implement their grids include: Unrivalled expertise - IBM has implemented grid computing for over one hundred organizations worldwide and IBM uses grids internally to support hundreds of thousands of employees. Range of solutions from small to large - Clients can start small and grow with one of IBM’s 21 industry-focused grid offerings Grid Computing leadership - Analysts and media continue to cite IBM's grid activities as industry leading. IBM partnerships with leading grid middleware and application ISVs – including like SAS, Dassault, Cadence, Accelrys, Cognos, DataSynapse, Platform Computing, Avaki, United Devices, etc. Full business partner support - IBM’s Solution Grid for Business Partners allows business partners to use the IBM grid infrastructure to remotely access IBM tools, servers, and storage so partners can grid enable and validate their applications for a distributed, on demand environment and bring their applications to market faster. Breadth of IBM solution - IBM offers a grid computing platform consisting of industry leading middleware from IBM and our business partners, allowing clients to selectively choose the components of their grid solution. Tivoli Provisioning Manager and Orchestrator provide resource provisioning and allocation within a grid solution. e-Workload Manager offers best-of-breed distributed workload management based on years of experience in managing mainframe workloads. WebSphere XD (Business Grid) offers a compelling value proposition based upon integrating J2EE transactional and grid workloads. IBM's products in the data space allow customers to virtualise their enterprise information assets without requiring significant changes to their information architecture. DB2 Information Integrator High performing file systems (General Parallel File System, SAN File System), and Industry leading storage virtualisation via SAN Volume Controller IBM’s deep commitment to open standards - grid computing is more powerful when implemented with open standards such as WS-RF, Linux, OGSA. IBM is a key collaborator in the Globus Project, the multi-institutional research-and-development effort for grid which is Copyright 2005@ IBM / CHPC / Confidential Page 26 developing Open Grid Services Architecture (OGSA), a set of standards and specifications that integrate Web services with grid computing. IBM sponsors the Global Grid Forum, whose mission is the development of industry standards for grid computing, and IBM led the development of OGSA into Web Services standards, the Web Services Resource Framework (WS-RF). IBM Business Consultants and world-class and worldwide support - to help clients leverage grid technology. BCS has hundreds of consultants, with industry expertise, trained on applying Grid Technologies while IBM Design Centers in the US, EMEA and Japan support advanced client engagements. References IBM Grid: http://www-1.ibm.com/grid/index.shtml IBM Grid Toolbox: http://www-1.ibm.com/grid/solutions/grid_toolbox.shtml Grid Solution for higher education: http://www.ibm.com/industries/education/doc/content/solution/438704310.html?g_type=rhc IBM Grid Solutions for aerospace, agricultural chemicals, automotive electronics, financial services, government, higher education, life sciences and petroleum: http://www-1.ibm.com/grid/solutions/index.shtml Customer implementations of grids: Life Sciences: http://www-1.ibm.com/grid/gridlines/January2004/industry/lifesciences.shtml Research at Argonne National Laboratory: http://www.ibm.com/grid/gridlines/January2004/feature/laying.shtml Government: http://www-1.ibm.com/grid/gridlines/January2004/industry/govt.shtml Academia: http://www-1.ibm.com/grid/gridlines/January2004/industry/education.shtml Teamwork: http://www-1.ibm.com/grid/gridlines/January2004/feature/teamwork.shtml Copyright 2005@ IBM / CHPC / Confidential Page 27 4 IBM in the Deep Computing Marketplace IBM is the world's leading supplier of High Performance Computing systems. The Top500 Supercomputers List tabulates the 500 largest supercomputers worldwide and IBM has 216 entries in the list, more than any other vendor, and over 49% of the aggregate throughput. IBM has most systems, four, in the Top10; the most, eight in the Top20; and the most, 58, in the Top100. There are 294 Linux clusters on the List and IBM supplied 161 of them showing our strength in both traditional and Linux clusters and in systems from the very largest to more modest systems. The leader of the Top500 List is IBM's Blue Gene/L supplied to the US US Department of Energy at the Lawrence Livermore National Laboratory as part of ASC Purple contract, and rated at over 70 TFLOP (Linpack R max) when the list was published in November 2004. This system was then only one quarter of its final contracted size and it has seen been doubled in size to about 140 TFLOPS and will be doubled again this year. Lawrence Livermore National Laboratory (LLNL) is a US Department of Energy national laboratory operated by the University of California. Initially founded to promote innovation in the design of the US nuclear stockpile through creative science and engineering, LLNL has now become one of the world's premier scientific centres, where cutting-edge science and engineering in the interest of national security is used to break new ground in all areas of national importance, including energy, biomedicine and environmental science. IBM has since installed a further component of ASC Purple, namely a 100 TFLOP Power5, 12,000 CPU pSeries cluster. The largest Linux cluster in the list is Mare Nostrum, the largest supercomputer in Europe and number four on the Top500 List. It is a 20 TFLOP (Linpack R max) system with over 2,200 eServer blades installed at the Ministerio de Ciencia y Tecnologia, University of Barcelona Supercomputing Centre and provides high performance computing to academic and scientific researchers in Spain running applications as diverse as life sciences (proteomics, bioinformatics, computational chemistry) weather and climate research and materials sciences. Other IBM systems in the Top 100 include: – Weather forecasting - European Centre for Medium-Range Weather Forecasts, ECMWF provides 3 to 10 day forecasts for member states throughout Europe. – Oceanography - Naval Oceanographic Office (NAVOCEANO), US – Scientific research - the US National Energy Research Scientific Computing Center (NERSC), the flagship scientific computing facility for the Office of Science in the U.S. Department of Energy. NERSC provides computational resources and expertise for basic scientific research and is a world leader in accelerating scientific discovery through computation) – Scientific research and Grid - The US National Center for Supercomputing Applications (NCSA) is a leader in defining the future's high-performance cyberinfrastructure for scientists, engineers, and society. The Center is one of the four original partners in the TeraGrid project, a National Science Foundation initiative that now spans nine sites. When completed, the TeraGrid will be the most comprehensive cyberinfrastructure ever deployed for open scientific research, including high-resolution visualization environments, and computing software and toolkits connected over the world's fastest network. – Academic research – University College of San Diego Supercomputer Center, US Copyright 2005@ IBM / CHPC / Confidential Page 28 – Academic research – HPCx is the UK national academic supercomputer, run by a consortium led by the University of Edinburgh, with the Council for the Central Laboratory for the Research Councils (CCLRC) and IBM and funded by the Engineering and Physical Sciences Research Council. – Grid - The Grid Technology Research Centre (GTRC) in Tsukuba, Japan was founded in January 2002 with a mission to lead collaboration between industrial, academic and government sectors and serve as a world leading grid technology Research and Development centre. – Scientific research - Forschungszentrum Juelich (FZJ) is the Jülich Research Centre, a major German multi-disciplinarian research centre conducting research in high energy physics, information technology, energy, life sciences and environmental science – weather forecasting – the US National Centers for Environmental Prediction is the US weather forecasting agency. – Atmospheric research – US National Center for Atmospheric Research is operated by the University Corporation for Atmospheric Research, a non profit corporation of 61 North American universities with graduate programs in atmospheric sciences. NCAR conducts research in the atmospheric sciences, collaborates in large multi-institution research programs and develops and provides facilities to support research programs in UCAR universities and at the Center itself. – Petroleum – Saudi Aramco is a leading oil exploration and production organisation – Seismic processing – Geoscience UK is a leading seismic exploration company while other seismic installations include Conoco Philips and PGS in the US – Grid – Westgrid provides high performance computing, networking and collaboration tools to universities in western Canada, concentrating on grid enabled projects. – Academic research – CINECA, the Inter-university Consortium for North Eastern Italy provides high performance computing to thirteen universities conducting public and private research – Academic research - Institute of Scientific Computing, Nankai University, China is the largest supercomputer in China and provides computing for basic and applied research in on of China's top universities. – Scientific research - Korea Institute of Science and Technology, South Korea is a multidisciplinary research facility and in commercial banks (UK and Germany), Credit Suisse (US and Switzerland), semiconductor companies (Israel and US), petroleum (Brazil), manufacturing company (US), University of Southern California (US), University of Hong Kong, Environmental Canada (the Canadian weather forecaster), University of Alaska (US), electronics companies (US), digital media (UK), Oak Ridge and Sandia National Laboratories (both US), Max Plank Institute (Germany), military research (US Army, UK atomic weapons), Sony Pictures (US) and Deutscher Wetterdienst (Germany's weather forecaster). Thirty three of the thirty five entries in the list from 100 to 134 are IBM Linux Clusters. IBM has installed a number of Blue Gene/L systems since the November list was published. The University of Edinburgh Parallel Computing Centre (UK) installed the first Blue Gene/L in Europe for a joint project with IBM to tackle some of the most challenging puzzles in science, such as understanding the behaviour of large biological molecules and modelling complex fluids. Copyright 2005@ IBM / CHPC / Confidential Page 29 ASTRON, a leading astronomy organization in the Netherlands, has installed a Blue Gene/L for a joint research project with IBM to develop a new type of radio telescope capable of looking back billions of years in time to provide astronomers unique insight into the creation of the earliest stars and galaxies at the beginning of time. The Astron Blue Gene/L, while occupying only six racks, would rank fourth on the current Top500 list. HPCx – the UK National Academic Supercomputer HPCx is one of the world's most advanced high performance computing centres and provides facilities for the UK science and engineering research communities enabling them to solve previously inaccessible problems. HPCx is funded by the Engineering and Physical Science Research Council, the Natural Environment Research Council and the Biotechnology and Biological Sciences Research Council and HPCx is operated by the CCLRC Daresbury Laboratory and Edinburgh University. The HPCx Consortium is led by the University of Edinburgh, with the Council for the Central Laboratory for the Research Councils (CCLRC) and IBM providing the computing hardware and software. The University of Edinburgh Parallel Computing Centre, EPCC, has a long experience of national HPC services and continues to be a major centre for HPC support and training. The Computational Science and Engineering Department at CCLRC Daresbury Laboratory has for many years worked closely with research groups throughout the UK through many projects, including EPSRC's Collaborative Computational Projects, and has unmatched expertise in porting and optimising large scientific codes. HPCx at Daresbury Laboratory, UK The HPCx cluster is based on IBM POWER 4 pSeries servers and currently comprises 52 x 32 way 1.7 GHz p690+ servers, each configured with 32GB of memory. The system has 36 TBytes of storage accessible under General Parallel File System. The first phase, delivered in 2002, provided a peak capability of 6.7 TFLOPS and this was upgraded in 2004, with virtually no disruption to the operational service, to over 11 TFLOPS, while a further upgrade to 22 TFLOPS in 2006 is planned. At the end of March 2005, IBM's HPCx system achieved a remarkable milestone in that there had been no break of service attributable to IBM since September 2004 – six months of faultless running for one of the largest computing systems in Europe. Copyright 2005@ IBM / CHPC / Confidential Page 30 HPCx is used to address important and complex problems in a wide range of sciences from the very small, such as the nature of matter, to the very large, such as simulations of whole systems from cells and organs to global simulations of the Earth. It has enabled new advances to be made in the human genome project, helped engineers to design new and safer structures and aircraft and assisted in opening up completely new fields of research, such as bio-molecular electronics. HPCx is used for a wide variety of peer-reviewed academic, scientific and engineering research projects including atomic and molecular physics, biochemistry, computational chemistry, computational engineering, environmental sciences, computational material science and numerical algorithms. Some of the new application areas which HPCx will enable include: Drug design - tomorrow's drugs will be highly specific and finely targeted using Terascale technology. It is already known how individual molecules interact with proteins but HPCx will enable more molecules to be screened more quickly so more potential chemical compounds can be tested for their potential in treating disease. Flight simulation - at present only the airflow around the wing of an aircraft can be simulated, but HPC can, potentially, enable the analysis of the entire flow around an aircraft. By looking at how turbulent the air is behind an aeroplane on take-off could mean greater use of air space and ease the control of traffic in the air. Structure of the Earth - the Earth's core has a major impact on our lives, for example it shapes the magnetic field which acts as a protection from the harmful effects of charged particles from the Sun. HPC techniques can be used to investigate the structure and behaviour of the core in a way that is impossible by direct observation and experiment. References Top500 Supercomputers List: http://www.top500.org Lawrence Livermore National Laboratory: http://www.llnl.gov/ ASC Purple Project: http://www.llnl.gov/asci/platforms/purple/ Mare Nostrum http://www.bsc.es/ ECMWF: http://www.ecmwf.int NERSC: http://www.nersc.gov/ NCSA: http://www.ncsa.uiuc.edu/ NCSA TeraGrid Project: http://www.ncsa.uiuc.edu/Projects/TeraGrid/ University College San Diego: http://www.sdsc.edu/ National Centers for Environmental Prediction: http://www.ncep.noaa.gov/ Grid Technology Research Centre, Japan: http://www.gtrc.aist.go.jp/en/ Forschungszentrum Juelich: http://www.fz-juelich.de/portal/home HPCx: http://www.hpcx.ac.uk/ Copyright 2005@ IBM / CHPC / Confidential Page 31 5 IBM Research Laboratories IBM Research activities extend over a broad area where technical disciplines include chemistry, computer science, electrical engineering, materials science, mathematical sciences, physics and cross-disciplinary activities include communications technology, deep computing, display technology, e-commerce, personal systems, semiconductor technology, storage and server and embedded systems. IBM researchers in eight laboratories around the world work with each other and with clients, universities and other partners on projects varying from optimizing business processes to inquiring into the Big Bang and the origins of the universe. IBM Research's focus is to continue to be a critical part of IBM's success by balancing projects that have an immediate impact with those that are long-term investments. Work with clients IBM and its Research division realize the importance of delivering innovation and competitive advantage to our clients and to aid them in achieving their specific goals, IBM Research has created On Demand Innovation Services, the First-of-a-Kind program and the Industry Solutions Lab. On Demand Innovation Services On Demand Innovation Services provide research consultants to partner with consultants from IBM Global Services on client engagements that explore cutting-edge ways to increase clients' flexibility and provide them with unique market advantages in line with IBM's on demand strategy. First-of-a-Kind (FOAK) Projects First-of-a-Kind projects are partnerships between IBM and clients aimed at turning promising research into market-ready products. Matching researchers with target companies to explore new and innovative technologies for emerging opportunities gives clients a research team to solve problems that don't have ready solutions. Researchers get immediate client feedback to further enhance their projects. Industry Solutions Laboratories The Industry Solutions Labs, located in New York and Switzerland, give IBM clients the chance to discover how leading-edge technologies and innovative solutions can help solve business problems. Each visit is tailored to meet specific needs, focused enough to target an immediate objective or broad enough to cover an array of emerging technologies. The engagement generally lasts for one day and consists of presentations by IBM scientists and industry experts, collaborative discussions on specific business issues and demonstrations of key strategic solutions. Organisations which have taken undertaken research projects with IBM include Bank SinoPac, El-Al Israel Airlines, US Federal Wildland Fire Management Agencies, Finnair, GILFAM, the Government of Alberta, Steelcase and Swiss Post. Work with universities IBM research extends beyond the boundaries of our labs to colleagues in university labs and researchers regularly publish joint papers with them in areas of mutual interest. In special Copyright 2005@ IBM / CHPC / Confidential Page 32 cases, IBM fosters collaborative relationships through Fellowships, grants and shared research programs. Amongst others, IBM has collaborated with Carnegie Mellon University, Concordia University - Montreal, Canada, Florida State University, Massachusetts Institute of Technology , Stanford University, Technion - Israel University of Technology, Tel-Aviv University, University of Coimbra, Portugal, University of Edinburgh and Virginia Polytechnic Institute and State University. IBM Deep Computing Institute IBM Deep Computing Institute is an organization within IBM Research that coordinates, promotes and advances Deep Computing activities. The institute has three objectives: – Develop solutions to previously intractable business and scientific problems by exploiting IBM's strengths in high-end computing, data storage and management, algorithms, modelling and simulation, visualization and graphics. – Realize the potential of the emerging very large scale computational, data and communications capabilities in solving critical problems of business and science. – Lead IBM's participation within the scientific community, and in the business world, in this important new domain of computing. Systems Journal and the Journal of Research and Development IBM regularly publishes the Systems Journal and the Journal of Research and Development, both of which can be accessed from the web. The Journal of Research and Development, Vol. 45, No. 3/4, 2001 was titled Deep Computing for the Life Sciences while Vol. 48, No. 2, 2004 was entitled Deep Computing. References IBM research: http://www.research.ibm.com IBM research work with clients: http://www.research.ibm.com/resources/work_clients.shtml IBM work with universities: http://www.research.ibm.com/resources/work_universities.shtml IBM Journals: Copyright 2005@ http://www.research.ibm.com/journal/ IBM / CHPC / Confidential Page 33 6 HPC Network Facilitation – fostering relationships in the HPC community IBM considers it very important for its high performance computing customers to meet with IBM developers and with other HPC customers to exchange information, make suggestions for improvements or future requirements. There are now a number of fora in which this information exchange can take place. Executive Briefings IBM hosts regular one-on-one briefings for HPC customers, usually in the development laboratories, where customers can meet with system developers to discuss future developments and requirements. Conferences and seminars IBM holds regular conferences and seminars on Deep Computing including, for example, the recent Deep Computing Executive Symposium in Zurich in 2004, details of which can be found at https://www-926.ibm.com/events/04deep.nsf IBM System Scientific Computing User Group, SCICOMP IBM fully supports the IBM System Scientific Computing User Group, SCICOMP, is an international organization of scientific and technical users of IBM systems. The purpose of SCICOMP is to share information on software tools and techniques for developing scientific applications that achieve maximum performance and scalability on systems, and to gather and provide feedback to IBM to influence the evolution of their systems. To further these goals, SCICOMP will hold periodic meetings which will include technical presentations by both IBM staff and users, focusing on recent results and advanced techniques. Discussions of problems with scientific systems will be held and aimed at providing advisory notifications to IBM staff detailing the problems and potential solutions. Mailing lists will be maintained to further open discussions and provide the sharing of information, expertise, and experience that scientific and technical applications developers need but may not easily find anywhere else. SCICOMP meetings are held once a year, normally in the spring and alternate meetings are held outside the USA at sites chosen by members at least a year in advance. Additional informal meetings, such as Birds-of-a-Feather sessions at other conferences, may also be scheduled. Information on past and proposed meetings, including presentation materials, can be found on the SCICOMP web site at http://www.spscicomp.org/ SCICOMP 10 was hosted by the Texas Advanced Computing Centre in August 2004, while SCICOMP 11 will be hosted by Edinburgh University, Scotland in May 2005 and SCICOMP 12 will be hosted by the National Centre for Atmospheric Research in Colorado, USA, in early 2006. SP-XXL IBM fully supports SP-XXL, a self-organized and self-supporting group of customers with the largest IBM systems. Members and affiliates actively participate in SP-XXL meetings which are held around the world at approximately six monthly intervals. Member institutions are Copyright 2005@ IBM / CHPC / Confidential Page 34 required to have an appropriate non-disclosure agreement in place with IBM as IBM often discloses and discusses confidential information with SP-XXL. The focus of the SP-XXL is on large-scale scientific and technical computing on IBM systems. SP-XXL addresses topics across a wide range of issues important to achieving Terascale scientific and technical computing on scalable parallel machines including applications, code development tools, communications, networking, parallel I/O, resource management, system administration and training. SP-XXL believes that by working together, customers are able to resolve issues with better solutions, provide better guidance to IBM, improve our own capabilities through sharing knowledge and software and collaboratively develop capabilities which we would not typically be able to as individual institutions. Details on SP-XXL can be found at http://www.spxxl.org/ UK HPC Users Group IBM fully supports the UK HPC Customer User Group, a self-organized and self-supporting group of UK based customers with the largest IBM HPC systems. The group meets approximately every six months and IBM is usually to attend for discussions. References IBM Executive Briefing Centers: http://www-1.ibm.com/servers/eserver/briefingcenter/ Copyright 2005@ IBM / CHPC / Confidential Page 35 7 Collaboration Projects IBM collaborates extensively with customers worldwide and some examples of collaborations include: University of Swansea, UK - Institute of Life Sciences The University of Swansea is setting up an Institute of Life Sciences which aims to provide an innovative and imaginative problem-solving environment through a unique collaboration between regional Government, a business leader and a world-class university. The Institute is expected to become one of the world's premier scientific and computing facilities and will host a new European Deep Computing Visualisation Centre for Medical Applications. The Institute and IBM have a multi-year collaboration agreement which includes a new IBM supercomputer, one of the fastest computers in the world dedicated to life science research, designed to accelerate ILS programmes. The Deep Computing Visualisation Centre for Medical Applications will research new solutions for healthcare treatment, personalised medicine and disease control. IBM will provide technical expertise and guidance, as well as specialist life science solutions that will enable a joint development programme. The ILS is a key step in delivering the recommendations set out in Knowledge Economy Nexus - the final report of the Welsh Assembly Government's Higher Education and Economic Development Task and Finish Group. IBM's contribution will comprise hardware infrastructure (the 'Blue C' supercomputer and visualisation system), software and implementation services. Importantly, IBM will also provide extensive industry and Life Sciences knowledge and expertise, in order to accelerate and drive research programmes in collaboration with the University. IBM is one of the leading providers of technology and services to the life sciences sector. It has more than 1,000 employees around the world dedicated to the field, including bioinformaticians, biologists, chemists, and computer scientists. The agreement with the Swansea University is part of IBM's continued commitment to the healthcare and life sciences sectors and its strategy of partnering with some of the world's leading research organisations. Life Sciences is recognised as one of the most fertile sources of technology transfer having the potential to create massive economic wealth from developments in the knowledge economy, through research, intellectual property licensing, spin out companies and inward investment. University of Edinburgh , UK - Blue Gene The University of Edinburgh has installed a commercial version of IBM’s Blue Gene/L supercomputer. The university’s system, which was the first Blue Gene system to run in Europe, is a smaller version of its much faster prototype, yet will still be in the top five most powerful systems in the UK. University researchers hope that the computer will eventually provide an insight into various key scientific fields, including the development of diseases such as Alzheimer’s, cystic fibrosis and CJD. Edinburgh University, in collaboration with IBM and the Council for the Central Laboratory of the Research Councils in the UK, already manages the largest supercomputer service for academic use in Europe, through the High Performance Computing (HPCx) initiative. IBM Academic Initiative - Scholars Program The IBM Academic Initiative - Scholars Program offers faculty members and researchers software, hardware and educational materials designed to help them use and implement the latest technology into curriculum and research. The IBM Scholars Program provides accredited and approved academic members with access to a wide range of IBM products for instructional, learning and non-commercial research. Offerings range from no-charge licenses Copyright 2005@ IBM / CHPC / Confidential Page 36 for IBM software (including WebSphere, DB2, Lotus and cluster software), to academic discounts for IBM eServers, to ready-to-use curriculum. Faculty and researchers at higher education institutions and secondary/high schools can apply for the IBM Scholars Program and have access to: Most comprehensive set of e-business software available Discounts on servers Access to Linux and zSeries hubs Training and educational materials Curriculum and courseware Certification resources and special offers Technical support Newsletters and newsgroups Full details can be found at http://www.developer.ibm.com/university/scholars/ IBM's Shared University Research Program IBM's highly-selective Shared University Research (SUR) program awards computing equipment (servers, storage systems, personal computing products, etc.) to institutions of higher education around the world to facilitate research projects in areas of mutual interest including: the architecture of business and processes, privacy and security, supply chain management, information based medicine, deep computing, Grid Computing, Autonomic Computing and storage solutions. The SUR awards also support the advancement of university projects by connecting top researchers in academia with IBM researchers, along with representatives from product development and solution provider communities. IBM supports over 50 SUR awards per year worldwide. In 2004, IBM announced the latest series of Shared University Research (SUR) awards, bringing the company's contributions to foster collaborative research to more than $70 million over the last three years. With this latest set of awards, IBM sustains one of its most important commitments to universities by enabling the collaboration between academia and industry to explore research in areas essential to fuelling innovation. The new SUR awards will support 20 research projects with 27 universities worldwide. Research projects range from a multiple university exploration of on demand supply chains to an effort to find biomarkers for organ transplants. The research reflects the nature of innovation in the 21st century – at the intersection of business value and computing infrastructure. Universities receiving these new awards include: Brown University, Cambridge University (UK), Columbia University, Daresbury University (UK), Fudan University (China), North Carolina Agricultural & Technical State University, Politecnico di Milano (Italy), SUNY Albany, University of Arizona, University of British Columbia (Canada), University of California – Berkeley, University of Maryland – Baltimore County, College Park, and Uppsala University (the Netherlands) and Technion – Israel Institute of Technology. "Universities play a vital role in driving innovation that could have a business or societal impact," said Margaret Ashida, director of corporate university relations at IBM. "The research collaborations enabled by IBM's Shared University Research award program exemplify the deep partnership between academia and industry needed to foster innovation that matters." Copyright 2005@ IBM / CHPC / Confidential Page 37 Examples of SUR projects already under way include: – IBM is working with Oxford University to find better and faster access to more reliable and accurate mammogram images, thereby potentially increasing early cancer detection and the number of lives saved – IBM is collaborating with Penn State University, Arizona State University, Michigan State University and University College Dublin to create supply chain research labs to conduct research on advanced supply chain practices that can be used to help businesses respond on demand to changing market conditions. Columbia University and IBM researchers worked on a project to develop core technologies needed for using computers to simulate protein folding, predict protein structure, screen potential drugs and create an accurate computer aided drug design program. Copyright 2005@ IBM / CHPC / Confidential Page 38 8 IBM Linux Centres of Competency IBM has established a number of Linux Centres of Competency worldwide to offer deep, specialized, on-site skills to customers and Business Partners looking to understand how to leverage Linux-enabled solutions to solve real business challenges. The IBM Linux Centres provide in-depth technology briefings, product and solution demonstrations, workshops and events, while also serving as showcases for Independent Software Vendors' applications on Linux, and providing facilities for customer proofs of concept. The Centres of Competency are closely affiliated with the IBM Executive Briefing Centres. For more information, visit: http://www-1.ibm.com/servers/eserver/briefingcenter/ The first Linux Centres of Competency were established in Austin, Texas, USA; Bangalore, India; Beijing, China; Boeblingen, Germany; Moscow, Russia and Tokyo, Japan. IBM South Africa has recently launched the IBM South Africa Linux Centre of Competence, a facility sponsored by IBM and Business Partners which will assist customers to integrate Linux solutions in their businesses, give access to experienced technical personnel for Linux education, training and support and provide information on Linux related solutions, offerings and future directions. The Centre will provide access to IBM OpenPower Systems, Power5 based Linux only servers which make the Linux value proposition unbeatable. Customers are rapidly embracing Linux for its flexibility, cost efficiencies and choice – and now, in increasing numbers, for its security. Not only is Linux the fastest growing operating system, it is the open source foundation for IBM's On Demand customer solutions. Linux has moved well beyond file-serving, print-serving and server consolidation and is now deployed in mission-critical and industry-specific applications. All industries are enjoying improved services delivery by implementing Linux-based business solutions. Linux and IBM are changing the IT landscape, challenging the very nature of application development and the economics of application deployment and ensuring customers get the best, most cost-effective solutions for their business with appropriate and cost effective hardware and superior support, training and infrastructure. IBM is fully committed to open standards in general, to Linux in particular and to being the leader in providing Linux solutions that work. All IBM hardware servers are able to run Linux and some Power5 servers only run Linux. IBM has over 300 Linux compatible software products and IBM continues to support the ongoing development of Linux solutions and applications. IBM South Africa has targeted seven focus areas for Linux: The South Africa Linux Centre of Competence will continue to offer deep, specialized, onsite skills to customers and Business Partners Linux for infrastructure solutions such as e-mail and file-, print- and Web serving Linux on IBM Clusters and Blades for affordable scalability that can easily be deployed and managed Linux for workload consolidation to allow customers to eliminate proliferating server farms, reduce costs, use resources efficiently and simplify IT management. Copyright 2005@ IBM / CHPC / Confidential Page 39 continue to build industry applications - like branch automation or point-of-sale solutions – which deliver specific business solutions with the flexibility, adaptability and costeffectiveness of Linux. The rapid adoption of Linux by governments worldwide exemplifies our success in this focus area. Support ValueNet, a network of IBM partners creating repeatable solutions, from problem identification to solution implementation and support. IBM Global Services will lead our Linux client focus, concentrating on delivering choice in desktop and browser-based Linux solutions. References IBM Linux: http://www-1.ibm.com/linux/ Linux Centres of Competency: http://www-1.ibm.com/linux/va_4057.shtml Linux on Power servers: http://www-1.ibm.com/servers/eserver/linux/power/?P_Site=Linux_ibm PartnerLens for Business Partners: http://www-1.ibm.com/linux/va_12.shtml Linux for developers: http://www-1.ibm.com/linux/va_4068.shtml IBM Executive Briefing centres: http://www-1.ibm.com/servers/eserver/briefingcenter/ Copyright 2005@ IBM / CHPC / Confidential Page 40 9 Technology considerations for a new HPC centre The choice of which technologies are most suitable for CHPC will depend crucially on – the type of applications to be run – the level of scaling required – are grand challenge problems to be run – is capability or capacity more important – the number of different applications to be run – the number and distribution of the users – the level and skill of system management support – the level and skill of user support – the required integration of the central facility with remote systems in universities Traditional clusters such as pSeries systems use servers with processors designed for HPC and computationally intensive tasks, with high memory bandwidths. The servers are interconnected with high performance switches providing high bandwidth and low latency communications across the entire system, maintaining the performance as the number of processors is increased. Linux clusters built with commodity components minimise the purchase cost of the system but the performance of the cluster is influenced to a greater or lesser extent by each component within the system, from hardware to software. Commodity clusters usually use PCI based interconnects and bottlenecks here often place the major limitation on the cluster performance. pSeries servers coupled with a mature operating system such AIX today tend to provide greater levels of reliability, availability and serviceability than Linux clusters, and the better management tools integrated in the system make it easier to provide a continuous and flexible service for multiple users and applications. While Linux capability is continually improving, achieving high availability remains a significant challenge with the largest systems. The key question is “what applications are to be run?” Many applications such as scientific codes, engineering analyses, weather forecasting etc require both powerful processors, with adequate memory bandwidth to ensure the processor is fed with data, and high bandwidth, low latency interconnects to provide efficient exchange of the large quantity of data moved during the computation. For these applications, purpose built computational servers, with their high memory bandwidth interconnected with high performance switches will provide the best price/performing solution, especially when the problem size is large. For other applications, like life sciences, monte carlo analysis and high energy physics applications, Linux clusters can provide excellent performance as fewer demands are placed on the system. There is a range of embarrassingly parallel applications which place little or no demand on interprocessor communication and for which Linux clusters are ideally suited. While there is little doubt that in future years, HPC systems like Blue Gene will be needed to reach very high performances, the most appropriate use for Blue Gene today is likely to be for Copyright 2005@ IBM / CHPC / Confidential Page 41 preparing for that change rather than for providing a general purpose national HPC production service. Customers who need a to run a variety of applications therefore tend to have three alternatives, although the boundaries between them are not rigid: – a pSeries cluster running AIX and equipped with a high performance switch. This general purpose machine will be able to run the widest range of applications although it will be “over-configured” for the less demanding codes. Its better management tools and its integral design for reliability, availability and serviceability will make it easier to operate and provide a heavily used production facility. Examples of national research HPC systems running pSeries clusters include HPCx, the UK National Academic Supercomputer and NERSC, the flagship scientific computing facility for the Office of Science in the U.S. Department of Energy, a world leader in accelerating scientific discovery through computation. – a Linux cluster of commodity processors and industry standard interconnect. If the bulk of the applications are Linux cluster suitable, this system will provide excellent price performance for those applications but is not likely to perform well on the more demanding applications, especially capability or grand challenge problems, and is likely to be more demanding on the support staff to operate and manage as a production facility. An example of a national research facility running a Linux Cluster is Mare Nostrum at the Barcelona National Supercomputing Centre, Spain. – a heterogeneous system of both pSeries and Linux clusters. This option can often be the optimum solution as regards running the applications, for the codes can be run on the most suitable platforms. Both pSeries and Linux clusters run the same IBM software and systems can be installed as autonomous, or with common file systems and job scheduling, or fully grid enabled. There are some restrictions on its use, and the largest problem which can be run effectively will be limited by the largest single machine. There will be increased support effort as two systems have to be managed and operated, and users may feel it necessary to port, and then need to support, their codes to both systems. An example of a national research facility running both pSeries and Linux clusters is the Lawrence Livermore National Laboratory in the USA. It is only by studying the application characteristics and balancing them against the constraints, particularly the available support skills, that an effective choice can be made. IBM has Deep Computing on Demand sites which can be used for running applications prior to deciding, and has domain experts in all areas of HPC and site operations able to assist in this analysis. IBM can also facilitate study tours, visits and exchanges for CHPC to suitable HPC sites. Grid enablement of central and remote systems Whatever system CHPC select for their central HPC facility, significant benefits will flow from grid enabling it and other systems in universities and industry to enable data to be accessed and shared and jobs to be submitted to any resource within the grid. Each site will decide which data is to be made visible, and how much compute resource is to be made available, and authorised users will be able to see all the data and submit jobs to be run on the available machines. A user in Johannesburg could use an input dataset on a system in Durban, run the job on a machine in Capetown and allow the results to be viewed and graphically analysed in Pretoria, with everything taking place transparently. Copyright 2005@ IBM / CHPC / Confidential Page 42 IBM has worked extensively with customers to implement grid enabled HPC systems across national and international boundaries. Two such examples are Distributed European Infrastructure for Supercomputing Applications, DEISA and ScotGrid. Distributed European Infrastructure for Supercomputing Applications Phase 1 of the DEISA AIX super-cluster has grid enabled four IBM Power4 platforms in FZJJülich (Germany, 1312 CPU, 8.9 TFLOPS); IDRIS-CNRS (France, 1024 CPU, 6.7 TFLOPS); RZG–Garching (Germany, 896 processors, 4.6 TFLOPS) and CINECA (Italy, 512 processors, 2.6 TFLOPS). This super-cluster is currently operating in pre-production mode with full production expected in mid 2005, when it is expected that CSC (Finland, 512 CPU, 2.2 TFLOPS) will be added. DEISA's fundamental integration concept is transparent access to remote data files via a global distributed file system under IBM's GPFS. Each of the national supercomputers above is a cluster of autonomous computing nodes linked by a high performance network. Data files are not replicated on each computing node but are unique and shared by all. A data file in the global file system is “symmetric” with respect to all computing nodes and can be accessed, with equal performance, from all of them. A user does not need to know (and in practice does not know) on which set of nodes his application is executed. The IBM systems above run IBM’s GPFS (Global Parallel File System) as a cluster file system. The wide area network functionality of GPFS has enabled the deployment of a distributed global file system for the AIX super-cluster as shown below: Applications running on one site can access data files previously “exported” from other sites as if they were local files. It does not matter in which site the application is executed and applications can be moved across sites transparently to the user. GPFS provides the high performance remote access needed for the global file system. Applications running on the pre-production infrastructure operate at the the full 1 Gb/s Copyright 2005@ IBM / CHPC / Confidential Page 43 bandwidth of the underlying network when accessing remote data via GPFS. This performance can be increased by increasing the network bandwidth as is demonstrated by the 30 Gb/s TeraGrid network in the USA, where GPFS runs at 27 Gb/s in remote file accesses. DEISA's objective is to federate supercomputing environments in a supercomputing grid that will include, in addition to the supercomputing platforms mentioned above, a number of data management facilities and auxiliary servers of all kinds. The DEISA Supercomputing Grid can be seen as a global architecture for a virtual European supercomputing centre. ScotGrid IBM UK has worked with the universities of Glasgow, Edinburgh and Durham to implement ScotGrid, a three-site Tier-2 centre consisting of an IBM 200 CPU Monte Carlo production facility run by the Glasgow Particle Physics Experimental (PPE) group, an IBM 24 TB datastore and associated high-performance server run by Edinburgh Parallel Computing Centre and the Edinburgh PPE group and a 100 CPU farm at Durham University Institute for Particle Physics Phenomenology. ScotGrid was funded for the analysis of data primarily from the ATLAS and LHCb experiments at the Large Hadron Collider and from other experiments and is now providing IBM solutions for Grid Computing in Particle Physics, Bioinformatics as well as Grid Data Management, medical imaging and device modelling simulations. ScotGrid References HPCx: http://www.hpcx.ac.uk Barcelona Supercomputing Centre: http://www.bsc.es/ LLNL: http://www.llnl.gov Copyright 2005@ IBM / CHPC / Confidential Page 44 ScotGrid: http://www.scotgrid.ac.uk/ DEISA: http://www.deisa.org http://www.deisa.org/grid/architecture.php Copyright 2005@ IBM / CHPC / Confidential Page 45 10 Skills development, training, and building human capacity Critical to the establishment of a new HPC facility, particularly a national one being established for the first time, is a comprehensive programme to enable skilling and skills transfer, as well as an ongoing process for generating a widespread skills base for supporting, using and exploiting HPC technologies. The key aspects to IBM’s approach to this are as follows: Programme Management This role coordinates and manages the various ‘streams’ of IBM activity - from implementation support, through to technical training, assisting with access to skills for application tuning and optimisation, visualisation, file management, scheduling, grid enablement, etc, issue management, and through to facilitating networking in the HPC world. Part of the Programme Management role is to ensure that Skills Transfer is an actively managed aspect of the programme. The IBM approach is to appoint a suitable IBM Programme Manager familiar with other HPC environments, ahead of taking on a new HPC Programme for the first time. Support Services The range of support services that IBM delivers is wide, as per the needs of HPC. These include: IBM hardware and software maintenance and technical support. These are the more traditional services associated with IBM’s products, and IBM has significant depth of resources in most geographies for this. Specialist Technical Support. This is for the HPC specific components of both hardware and software, and includes File System (GPFS) tuning and support, Visualisation Tools (e.g. DCV) support, Job Scheduling, Application Tuning assistance, etc. IBM trains local IBM technical engineers to act as first level support in most of the critical support areas. On-Site Services. Depending on the skills maturity level and other considerations of the HPC site, IBM can place on site personnel to support the HPC, as well as deliver an on site skills transfer function. Education and Training. Formal education is run on site (depending on the customer numbers and needs) or at IBM sites. An education plan would be kept up to date as part of the overall programme. Hands On Internships and Exchanges. One aspects of IBM HPC networking program is to help facilitate cross skilling between organisations, especially where there is value in reciprocity. Technical and Management personnel in a new HPC centre often gain rapid skill upgrades through short term Internships. HPC Networking. As covered in Section 6, IBM is active in facilitating networking across HPC organisations through conferences, seminars, user groups and one-to-one meetings. In effect, it is a strategically important ecosystem which helps all user communities to keep up with the latest thinking, innovation, best of breed approaches, and other learning opportunities. IBM is widely recognised as the leader in such networking facilitation across many different customers. IBM HPC Study Tours. These are tours organised for one organisation or indeed a number of organisations. Typically they cover a variety of exposures to ‘HPC in Action’ Copyright 2005@ IBM / CHPC / Confidential Page 46 in IBM’s own Research Facilities, in leading HPC sites, in IBM Technical Specialist Areas, and can include visits to IBM partners such as Intel and AMD. With the CHPC of South Africa in mind... A comprehensive Skills Development and Capacity Building Plan would be workshopped and developped with the CHPC and its stakeholders. IBM has the building blocks with which to create a meaningful and effective plan. Many of the building blocks are mentioned above and need to be tailored to the unique needs of the CHPC. There may also be some needs that CHPC in S Africa has, where special emphasis needs to be placed on some needs in relatively greater terms. For example, these could include a tailored education program aimed at enabling computational scientists to increase their HPC skills. Or another could be that aside from the skills in the CHPC itself, there are a geographically dispersed set of users who are very much part of the CHPC ecosystem that also need a skilling program. The key is that IBM has the breadth and depth of people and skills to be best positioned to develop a comprehensive programme in conjunction with the CHPC. We recommend that this area of the overall CHPC program receives special emphasis, as building a sustainable human capacity for HPC exploitation is a foundation for a highly successful HPC facility. Copyright 2005@ IBM / CHPC / Confidential Page 47 11 Assisting with the Promotion and Exploitation of the CHPC Whilst the fundamentals of establishing a sound HPC operation are paramount and indeed the first priority (including operational effectiveness, good governance, technical support and many other factors), once well established, optimising the usage of CHPC will become a key aspect. The user audience is likely to include university and academic researchers and users, but also other Users from both Public and Private Sector ‘customers’. IBM can assist in a number of ways to help promote the CHPC in these communities. These include: Market Awareness through Media communications. Marketing Events Target Account Planning - IBM has a wide Client Management team coverage throughout South and Central Africa. Certain industries have very special HPC needs and these can be jointly targeted, particularly through linkage with the IBM Client teams. Networking Contacts linkages and exploitation. IBM has strength in two relevant areas, namely (a) partnering in various ways and (b) marketing skills and methods. Once again, we recommend that these initiatives are made a formal part of the overall Program Management. Copyright 2005@ IBM / CHPC / Confidential Page 48 Trademarks and Acknowledgements The following terms are registered trademarks of International Business Machines Corporation in the United States and/or other countries: AIX, AIX/L, AIX/L(logo), Blue Gene, BladeCenter, IBM, IBM(logo), ibm.com, LoadLeveler, POWER, POWER4, POWER4+, POWER5, POWER5+, POWER6, Power Architecture, POWERparallel, PowerPC, PowerPC(logo), pSeries, Scalable POWERparallel Systems, Tivoli, Tivoli(logo). A full list of U.S. trademarks owned by IBM may be found at: http://www.ibm.com/legal/copytrade.shtml. UNIX is a registered trademark in the United States, other countries or both. Linux is a registered trademark of Linus Torvalds in the United States, other countries or both. Intel, Itanium and Pentium are registered trademarks and Intel Xeon and MMX are trademarks of Intel Corporation in the United States and/or other countries AMD Opteron is a trademark of Advanced Micro Devices, Inc. Microsoft, Windows, Windows NT and the Windows logo are registered trademarks of Microsoft Corporation in the United States and/or other countries. Other company, product and service names may be trademarks or service marks of others. Copyright 2005@ IBM / CHPC / Confidential Page 49 Addendum IBM UK has been registered to the requirements of ISO9001 and to the DTI-approved TickIT guidelines since August 1991. This registration is now part of the single certificate issued by BVQI for the totality of IBM Sales and Services in Europe, Middle East and Africa (EMEA). The Certificate number is 82346.9.2 and the Scope covers all activities culminating in the provision of IT solutions, including design, marketing, sales, support services, installation and servicing of computer systems and infrastructure (hardware, software and networks), management of computer centres and outsourcing services, help desk for computer users, and the provision of IT & management consultancy and services. Additionally, IBM manufacturing and hardware development locations worldwide are registered to the requirements of ISO 14001 and are part of a single certificate issued by BVQI. The Certificate number is 43820 and the scope covers development and manufacture of information technology products, including computer systems, software, storage devices, microelectronics technology, networking and related services worldwide. IBM is a registered trademark of International Business Machines Corporation. All other trademarks are acknowledged. All the information, representations, statements, opinions and proposals in this document are correct and accurate to the best of our present knowledge but are not intended (and should not be taken) to be contractually binding unless and until they become the subject of separate, specific agreement between the parties. This proposal should not be construed as an offer capable of acceptance. The information contained herein has been prepared on the basis that the agreement entered into between the parties as a result of further negotiations will be based on the terms of the IBM Customer Agreement. This proposal is valid for a period of 30 days. If not otherwise expressly governed by the terms of a written confidentiality agreement executed by the parties, this proposal contains information which is confidential to IBM and is submitted to CHPC on the basis that it must not be used in any way nor disclosed to any other party, either whole or in part. The only exception to this is that the information may be disclosed to employees or professional advisors of CHPC where such disclosure is on a need to know basis, and is for the purpose of considering this proposal. Otherwise disclosures may not take place without the prior written consent of IBM. These Services do not address the capability of your systems to handle monetary data in the Euro denomination. You acknowledge that it is your responsibility to assess your current systems and take appropriate action to migrate to Euro ready systems. You may refer to IBM Product Specifications or IBM's Internet venue at: http://www.ibm.com/euro to determine whether IBM products are Euro ready. Copyright 2005@ IBM / CHPC / Confidential Page 50