Center for Scalable Application Development Software Cooperative Agreement No. DE-FC02-07ER25799 September 2008 Principle Investigator: Katherine Yelick Department of Electrical Engineering and Computer Sciences University of California #1776 Berkeley, CA 94720 Voice: 510-495-2431 FAX: 510-642-3962 Email: yelick@cs.berkeley.edu Students supported at Berkeley: 3 Rajesh Nishtala Brian Kazian Ankit Jain Staff supported at Berkeley: 1 Katherine Yelick Introduction In January 2007 the Center for Scalable Application Development Software (CScADS) was established as a partnership between Rice University, Argonne National Laboratory, University of California – Berkeley, University of Tennessee – Knoxville, and University of Wisconsin – Madison. CScADS is pursuing an integrated set of activities that aim to increase the productivity of DOE computational scientists by catalyzing the development of software tools and libraries for leadership computing platforms. The Berkeley team co-organized the automatic tuning research, and pursued application level performance studies of PGAS languages and compilers. 1 Community Outreach and Vision-Building To engage the community in the challenges and foster interdisciplinary collaborations, we have established the CScADS Summer Workshops – an annual series of workshops that will focus on topics related to scalable software for the DOE’s leadership computing platforms. In July 2008, we held our first series of four workshops in Snowbird, Utah. The general charge for the workshops was the following: Identify important open problems and challenges for achieving high performance on leadership computing systems. Brainstorm on promising approaches to open problems. Identify infrastructure needs to address key challenges. Assess available infrastructure. Identify opportunities for synergy opportunities to consolidate and harden existing infrastructures, reuse existing components developed by others, as well as opportunities to refactor and extend existing components to apply them to new challenges. Collaborate on design of sharable components. Identify targets of opportunity for further investment of resources, in particular strategic investment targets for the DOE Office of Science. Katherine Yelick from Berkeley co-organized a workshop with Keith Cooper from Rice University, Jack Dongarra from the University of Tennessee, and Rich Vuduc from Georgia Tech. The topic of the workshop was Automatic Performance tuning and it brought together compiler developers, library writers, performance experts, and hardware designers to discuss some of the code generation challenges for multicore processors that are the building blocks for emerging petascale systems. The goal was to identify some of the challenges represented by current and future hardware and the opportunities afforded by the use of automatic tuning techniques. The attendees included computer science researchers developing autotuning tools (many funded by SciDAC or other DOE Office of Science programs), compiler writers, and computer architects representing a variety of research and production architectures. 2 Research Contributions from Last Year The Partitioned Global Address Space (PGAS) model, exemplified by the UPC, Co-Array Fortran and Titanium Languages allow programmers to easily express parallelism on complex shared data structures. The languages allows such structures to be access through global pointers and distributed array expressions, as well as bulk operations based on either high level array copies or (in UPC) explicit memory copies. PGAS programs that are designed and optimized for clusters do most of their communication using bulk operations, but programs written for shared memory hardware often have a fine-grained style. Fine-grained accesses that occur in loops may be amenable to message vectorization, where accesses are combined across iterations, but more irregular communication patterns are usually not amenable to such loop-based optimizations since they either use pointer dereferences or have dynamic access patterns (e.g., table lookup). Instead, compiler algorithms that decrease the number, reduce the volume, and hide the latencies of the message traffic for irregular applications can be very beneficial. 2.1 PGAS Languages for Multicore Systems to Petascale Sytems Dual and quad core processors are currently the dominant building block for high end systems, and the number of cores is likely to double with chip density over the next few generations. At the same time, both memory (commodity DRAM) density and off-chip bandwidth may grow at a slower pace, making it desirable to allow sharing of user level data structures between cores on a chip. PGAS languages take advantage of the shared memory and avoid some of the memory footprint costs associated with partitioned address space (message passing) programming models. The Berkeley UPC compiler currently runs on multicore system and clusters of multicore nodes but the group is exploring a number of extension to the language, compiler, and runtime system to make effective use of multicore nodes. Under the CScADS project, the group has applied autotuning techniques to the problem of building a hightly optimized collective communication library for PGAS languages. Collective communication is critical to the performance of many bulk-synchronous algorithms, whether they are programmed in MPI, UPC, CAF, or one of the languages emerging from the HPCS program. The Berkeley group specifically looked at optimization techniques for the UPC collectives and studied to fairly different multicore architectures, the Intel Clovertown and Sun Niagra2. They developed highly optimized and scalable collective implementations for shared memory and found that distributed memory algorithms such as trees are often useful as the core count grows. The choice of tree structure and communication is highly dependent on the machine size, collective routine, and data size, so they developed a prototype autotuning framework to automatically select optimized implementations. Figure 1 shows the effect of both architecture independent (a fixed radix-2 tree) and architecture dependent tuning on the Sun architectures for four different collective operations. The detailed results indicate the importance of selecting the radix and even tree structure (balanced vs. binomial), which is a strong argument for an autotuned implementation. Figure 1: Autotuning Collective Communication for Niagra2 The Berkeley group is also developing an optimized implementation of the basic GASNet communication layer for Petascale systems such as the BlueGene architecture, which has previously been supported only by an MPI implementation of GASnet, which is not very efficient, and by and IBM Research prototype which is not available outside IBM. GASnet underlies multiple PGAS language compilers (Berkeley UPC, Intrepid gcc/upc, Rice CAF, Berkeley Titanium, and Cray Chapel). 2.2 Autotuned Sparse Libraries for Multicore The Berkeley team also made progress in delivering self-tuning libraries to the user community in the form of a multicore/SMP extension of the OSKI sparse matrix library called pOSKI. Whereas OSKi tunes for register, caches, and some SIMD accelerators, pOSKI also tunes for the number of threads and adds thread count and blocking as well as explicit software prefetch. The ideas build on work by Sam Williams on optimizations for multicore, which was funded in part by the PERI SciDAC project, and in this CScADS work the optimizations ideas were encoded in the OSKI autoatuning framework to make it easier for users to benefit from the ideas. The figure below summarized the pOSKI results for the AMD Barcelona processor using 12 different matrices from a variety of application domains. Each of the bars is divided in to a set of performance results for each of the different optimizations, which are applied additively. In addition, the last point (a black diamond) shows previous results obtained by Williams et al using a standalone autotuning framework and nearly and identical set of optimizations. For most matrices the performance is comparable, but there are two matrices for which some machinedependent strategy was applied in the code by Willliams. 3 Future Plans for FY09 The primary objective for the Berkeley CScADS research projects is to complete, tune, test and release a UPC compiler for the BlueGene system. This involves a number of steps: 1) Test and tun the put/get operations which are available today in a pre-release version of GASNet 2) Complete the implementation of the full GASNet interface to support features such as faster remove invocations (active messages) 3) Scalability testing and tuning. 4) Release optimized self-tuning collectives. Additional work on autotuning for multicore will also continue, with exploration of tuning techniques for accelerators such as GPUs. Finally, Berkeley will continue with a presense in multiple summer workshops and will co-organize one of those weeks related to autotuning. 4 4.1 1. 2. 4.2 [1] Publications and Presentations Theses Ankit Jain "pOSKI: An Extensible Autotuning Framework to Perform Optimized SpMVs on Multicore Architectures" Master's Report, Computer Science Division, University of California at Berkeley, August 2008. Wei-Yu Chen, “Optimizing Partitioned Global Address Space Programs for Cluster Architectures,” Computer Science Division, University of California at Berkeley, December 2007. Published Papers S. W. Williams, D. A. Patterson, L. Oliker, J. Shalf, K. Yelick, “The Roofline Model: A pedagogical tool for auto-tuning kernels on multicore architectures”, HOT Chips, A Symposium on High Performance Chips, Stanford, CA, Aug 2008. (Abstract) [2] Ankit Jain, Shoaib Kamil, Marghoob Mohiyuddin, John Shalf, and John D. Kubiatowicz, “Hybrid Electric/Photonic Networks for Scientific Applications on Tiled CMPs ,” Hot Interconnects 2008, August 2008. (Abstract.) [3] Costin Iancu, Wei Chen, Katherine A. Yelick: Performance portable optimizations for loops containing communication operations. International Conference on Supercomputing, Island of Kos, Greece, June 7-12, 2008, pages 266-276. [4] J. Demmel, M. Hoemmen, M. Mohiyuddin, K. Yelick, “Avoiding Communication in Sparse Matrix Computations,” IEEE International Parallel and Distributed Processing Symposium (IPDPS’08), April 2008. [5] Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, “Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms,” IEEE International Parallel and Distributed Processing Symposium (IPDPS’08), April 2008. Best Paper Award, Applications Track. [6] Rajesh Nishtala, George Almasi, Calin Cascaval, “Performance without Pain = Productivity, Data layouts and Collectives in UPC ,” Principles and Practices of Parallel Programming (PPoPP) 2008 , Salt Lake City, USA, February 2008. [7] John Mellor-Crummey, Peter Beckman, Jack Dongarra, Ken Kennedy, Barton Miller, Katherine Yelick. “Software for leadership-class computing,” SciDAC Review. Fall 2007, pages 36-45. [8] Parry Husbands and Katherine Yelick, “Multithreading and One-Sided Communication in Parallel LU Factorization.” Proceedings of Supercomputing (SC07), Reno, NV, November, 2007. [9] Tong Wen, Jimmy Su, Phillip Colella, Katherine Yelick and Noel Keen, “An Adaptive Mesh Refinement Benchmark for Modern Parallel Programming Languages.” Proceedings of Supercomputing (SC07), Reno, NV, November 2007. [10] Sam Williams, Leonid Oliker, Richard Vuduc, James Demmel, Katherine Yelick, “Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms.” Proceedings of Supercomputing (SC07), Reno, NV, November 2007. [11] James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick , “Avoiding Communication in Computing Krylov Subspaces” University of California EECS Department Technical Report UCB/EECS-2007-123, October 2007. [12] Alfredo Buttari, Jack Dongarra, Parry Husbands, Jakub Kurzak and Katherine Yelick, “Multithreading for synchronization tolerance in matrix factorization,” The proceedings of the SciDAC 2007 Conference, Boston, Massachusetts, June 24-28, 2007. Published in the Journal of Physics: Conference Series. Volume 78, 2007. [13] Jimmy Su and Katherine Yelick, “Automatic Performance Debugging in Partitioned Global Address Space Programs” 20th International Workshop on Languages and Compilers for Parallel Computing (LCPC), Urbana, Illinois, October 2007. Appeared in Springer Lecture Notes in Computer Science. 1. Submitted Papers 1. Sam Williams, Kaushik Datta, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, David Bailey, PERI - Auto-tuning Memory Intensive Kernels for Multicore, SciDAC: Scientific Discovery Through Advanced Computing, Seattle Washington, July, 2008. To appear in Journal of Physics: Conference Series. LBNL # pending. 2. Kaushik Datta, Shoaib Kamil, Sam Williams, Leonid Oliker, John Shalf, Katherine Yelick, "Optimization and Performance Modeling of Stencil Computations on Modern Microprocessors", SIAM Review, 2008 (in press). LBNL-63192. 3. Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, and Katherine Yelick, “Stencil Computation Optimization and Autotuning on State-of-the-Art Multicore Architectures,” to appear at Supercomputing 2008 (SC08), November 2008. LBNL # Pending. 5 Presentations 1. K. Yelick, “Parallel Programming Models,” CSE/ParLab Parallel Computing “Bootcamp”. University of California at Berkeley, August 2008. 2. K. Yelick, “Programming Models for Manycore Processors” Intel/UPCRC Programming Languages Workshop, August 23, 2008. 3. K. Yelick, “Multicore: Fallout from a Hardware Revolution,” Summer Lecture Series, Lawrence Berkeley National Laboratory, July 2008. 4. K. Yelick (for S. Williams), “PERI - Auto-tuning Memory Intensive Kernels for Multicore, SciDAC: Scientific Discovery Through Advanced Computing, Seattle Washington,” July, 2008. 5. K. Yelick, “Programming Models: Opportunities and Challenges for Scalable Applications,” Next Generation Scalable Applications: When MPI Only is Not Enough. June 3-5, 2008. 6. K. Yelick, “Programming Models for Manycore Systems,” Intel Corp., Santa Clara, CA, April 23, 2008. Keynote. 7. K. Yelick, “Multicore Meets Exascale: The Catalyst for a Software Revolution,” 2008 Salishan Conference on High Speed Computing, Salishan, OR, April 21-22, 2008. Keynote. 8. K. Yelick, “Programming Models for Petascale to Exascale,” IPDPS 2008 Advance Program, Miami, FL, April 15-16, 2008. Keynote. 9. R. Nishtala, “Performance without Pain = Productivity, Data layouts and Collectives in UPC ,” Principles and Practices of Parallel Programming (PPoPP) 2008 , Salt Lake City, USA, February 2008. 10. K. Yelick, “Multicore Meets Petascale: The Catalyst for a Software Revolution,” North Carolina State University, Raleigh, NC, Feb 10-12, 2008. Invited Talk. 11. K. Yelick, “Programming Models for Petascale,” Princeton University, Princeton, NJ, February 25-26, 2008. Invited Talk. 12. K. Yelick, “Productivity and Performance using Partitioned Global Address Space Languages,” Parallel Symbolic Computation (PASCO ‘07), London, Canada, July 27-28, 2007. Invited talk.