Creating Software Tools and Libraries for Leadership

advertisement
Center for Scalable Application Development Software
Cooperative Agreement No. DE-FC02-07ER25799
September 2008
Principle Investigator:
Katherine Yelick
Department of Electrical Engineering and Computer Sciences
University of California #1776
Berkeley, CA 94720
Voice: 510-495-2431
FAX: 510-642-3962
Email: yelick@cs.berkeley.edu
Students supported at Berkeley: 3
Rajesh Nishtala
Brian Kazian
Ankit Jain
Staff supported at Berkeley: 1
Katherine Yelick
Introduction
In January 2007 the Center for Scalable Application Development Software (CScADS) was
established as a partnership between Rice University, Argonne National Laboratory, University
of California – Berkeley, University of Tennessee – Knoxville, and University of Wisconsin –
Madison. CScADS is pursuing an integrated set of activities that aim to increase the productivity
of DOE computational scientists by catalyzing the development of software tools and libraries for
leadership computing platforms. The Berkeley team co-organized the automatic tuning research,
and pursued application level performance studies of PGAS languages and compilers.
1
Community Outreach and Vision-Building
To engage the community in the challenges and foster interdisciplinary collaborations, we have
established the CScADS Summer Workshops – an annual series of workshops that will focus on
topics related to scalable software for the DOE’s leadership computing platforms. In July 2008,
we held our first series of four workshops in Snowbird, Utah.
The general charge for the workshops was the following:
 Identify important open problems and challenges for achieving high performance on
leadership computing systems.
Brainstorm on promising approaches to open problems.
Identify infrastructure needs to address key challenges.
Assess available infrastructure.
Identify opportunities for synergy opportunities to consolidate and harden existing
infrastructures, reuse existing components developed by others, as well as opportunities to
refactor and extend existing components to apply them to new challenges.
 Collaborate on design of sharable components.
 Identify targets of opportunity for further investment of resources, in particular strategic
investment targets for the DOE Office of Science.




Katherine Yelick from Berkeley co-organized a workshop with Keith Cooper from Rice
University, Jack Dongarra from the University of Tennessee, and Rich Vuduc from Georgia
Tech. The topic of the workshop was Automatic Performance tuning and it brought together
compiler developers, library writers, performance experts, and hardware designers to discuss
some of the code generation challenges for multicore processors that are the building blocks for
emerging petascale systems. The goal was to identify some of the challenges represented by
current and future hardware and the opportunities afforded by the use of automatic tuning
techniques.
The attendees included computer science researchers developing autotuning tools (many funded
by SciDAC or other DOE Office of Science programs), compiler writers, and computer architects
representing a variety of research and production architectures.
2
Research Contributions from Last Year
The Partitioned Global Address Space (PGAS) model, exemplified by the UPC, Co-Array Fortran
and Titanium Languages allow programmers to easily express parallelism on complex shared data
structures. The languages allows such structures to be access through global pointers and
distributed array expressions, as well as bulk operations based on either high level array copies or
(in UPC) explicit memory copies. PGAS programs that are designed and optimized for clusters
do most of their communication using bulk operations, but programs written for shared memory
hardware often have a fine-grained style. Fine-grained accesses that occur in loops may be
amenable to message vectorization, where accesses are combined across iterations, but more
irregular communication patterns are usually not amenable to such loop-based optimizations
since they either use pointer dereferences or have dynamic access patterns (e.g., table lookup).
Instead, compiler algorithms that decrease the number, reduce the volume, and hide the latencies
of the message traffic for irregular applications can be very beneficial.
2.1
PGAS Languages for Multicore Systems to Petascale Sytems
Dual and quad core processors are currently the dominant building block for high end systems,
and the number of cores is likely to double with chip density over the next few generations. At
the same time, both memory (commodity DRAM) density and off-chip bandwidth may grow at a
slower pace, making it desirable to allow sharing of user level data structures between cores on a
chip. PGAS languages take advantage of the shared memory and avoid some of the memory
footprint costs associated with partitioned address space (message passing) programming models.
The Berkeley UPC compiler currently runs on multicore system and clusters of multicore nodes
but the group is exploring a number of extension to the language, compiler, and runtime system to
make effective use of multicore nodes. Under the CScADS project, the group has applied
autotuning techniques to the problem of building a hightly optimized collective communication
library for PGAS languages. Collective communication is critical to the performance of many
bulk-synchronous algorithms, whether they are programmed in MPI, UPC, CAF, or one of the
languages emerging from the HPCS program. The Berkeley group specifically looked at
optimization techniques for the UPC collectives and studied to fairly different multicore
architectures, the Intel Clovertown and Sun Niagra2. They developed highly optimized and
scalable collective implementations for shared memory and found that distributed memory
algorithms such as trees are often useful as the core count grows. The choice of tree structure and
communication is highly dependent on the machine size, collective routine, and data size, so they
developed a prototype autotuning framework to automatically select optimized implementations.
Figure 1 shows the effect of both architecture independent (a fixed radix-2 tree) and architecture
dependent tuning on the Sun architectures for four different collective operations. The detailed
results indicate the importance of selecting the radix and even tree structure (balanced vs.
binomial), which is a strong argument for an autotuned implementation.
Figure 1: Autotuning Collective Communication for Niagra2
The Berkeley group is also developing an optimized implementation of the basic GASNet
communication layer for Petascale systems such as the BlueGene architecture, which has
previously been supported only by an MPI implementation of GASnet, which is not very
efficient, and by and IBM Research prototype which is not available outside IBM.
GASnet
underlies multiple PGAS language compilers (Berkeley UPC, Intrepid gcc/upc, Rice CAF,
Berkeley Titanium, and Cray Chapel).
2.2
Autotuned Sparse Libraries for Multicore
The Berkeley team also made progress in delivering self-tuning libraries to the user community in
the form of a multicore/SMP extension of the OSKI sparse matrix library called pOSKI. Whereas
OSKi tunes for register, caches, and some SIMD accelerators, pOSKI also tunes for the number
of threads and adds thread count and blocking as well as explicit software prefetch. The ideas
build on work by Sam Williams on optimizations for multicore, which was funded in part by the
PERI SciDAC project, and in this CScADS work the optimizations ideas were encoded in the
OSKI autoatuning framework to make it easier for users to benefit from the ideas.
The figure below summarized the pOSKI results for the AMD Barcelona processor using 12
different matrices from a variety of application domains. Each of the bars is divided in to a set of
performance results for each of the different optimizations, which are applied additively. In
addition, the last point (a black diamond) shows previous results obtained by Williams et al using
a standalone autotuning framework and nearly and identical set of optimizations. For most
matrices the performance is comparable, but there are two matrices for which some machinedependent strategy was applied in the code by Willliams.
3
Future Plans for FY09
The primary objective for the Berkeley CScADS research projects is to complete, tune, test and
release a UPC compiler for the BlueGene system. This involves a number of steps:
1) Test and tun the put/get operations which are available today in a pre-release version of
GASNet
2) Complete the implementation of the full GASNet interface to support features such as
faster remove invocations (active messages)
3) Scalability testing and tuning.
4) Release optimized self-tuning collectives.
Additional work on autotuning for multicore will also continue, with exploration of tuning
techniques for accelerators such as GPUs. Finally, Berkeley will continue with a presense in
multiple summer workshops and will co-organize one of those weeks related to autotuning.
4
4.1
1.
2.
4.2
[1]
Publications and Presentations
Theses
Ankit Jain "pOSKI: An Extensible Autotuning Framework to Perform Optimized SpMVs
on Multicore Architectures" Master's Report, Computer Science Division, University of
California at Berkeley, August 2008.
Wei-Yu Chen, “Optimizing Partitioned Global Address Space Programs for Cluster
Architectures,” Computer Science Division, University of California at Berkeley, December
2007.
Published Papers
S. W. Williams, D. A. Patterson, L. Oliker, J. Shalf, K. Yelick, “The Roofline Model: A
pedagogical tool for auto-tuning kernels on multicore architectures”, HOT Chips, A
Symposium on High Performance Chips, Stanford, CA, Aug 2008. (Abstract)
[2]
Ankit Jain, Shoaib Kamil, Marghoob Mohiyuddin, John Shalf, and John D. Kubiatowicz,
“Hybrid Electric/Photonic Networks for Scientific Applications on Tiled CMPs
,” Hot
Interconnects 2008, August 2008. (Abstract.)
[3]
Costin Iancu, Wei Chen, Katherine A. Yelick: Performance portable optimizations for loops
containing communication operations. International Conference on Supercomputing, Island
of Kos, Greece, June 7-12, 2008, pages 266-276.
[4]
J. Demmel, M. Hoemmen, M. Mohiyuddin, K. Yelick, “Avoiding Communication in Sparse
Matrix Computations,” IEEE International Parallel and Distributed Processing Symposium
(IPDPS’08), April 2008.
[5]
Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, “Lattice
Boltzmann Simulation Optimization on Leading Multicore Platforms,” IEEE International
Parallel and Distributed Processing Symposium (IPDPS’08), April 2008. Best Paper
Award, Applications Track.
[6]
Rajesh Nishtala, George Almasi, Calin Cascaval, “Performance without Pain =
Productivity, Data layouts and Collectives in UPC
,” Principles and Practices of Parallel
Programming (PPoPP) 2008 , Salt Lake City, USA, February 2008.
[7]
John Mellor-Crummey, Peter Beckman, Jack Dongarra, Ken Kennedy, Barton Miller,
Katherine Yelick. “Software for leadership-class computing,” SciDAC Review. Fall 2007,
pages 36-45.
[8]
Parry Husbands and Katherine Yelick, “Multithreading and One-Sided Communication in
Parallel LU Factorization.” Proceedings of Supercomputing (SC07), Reno, NV,
November, 2007.
[9]
Tong Wen, Jimmy Su, Phillip Colella, Katherine Yelick and Noel Keen, “An Adaptive
Mesh Refinement Benchmark for Modern Parallel Programming Languages.” Proceedings
of Supercomputing (SC07), Reno, NV, November 2007.
[10] Sam Williams, Leonid Oliker, Richard Vuduc, James Demmel, Katherine Yelick,
“Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms.”
Proceedings of Supercomputing (SC07), Reno, NV, November 2007.
[11] James Demmel, Mark Hoemmen, Marghoob Mohiyuddin, and Katherine Yelick ,
“Avoiding Communication in Computing Krylov Subspaces” University of California
EECS Department Technical Report UCB/EECS-2007-123, October 2007.
[12]
Alfredo Buttari, Jack Dongarra, Parry Husbands, Jakub Kurzak and Katherine Yelick,
“Multithreading for synchronization tolerance in matrix factorization,” The proceedings of
the SciDAC 2007 Conference, Boston, Massachusetts, June 24-28, 2007. Published in the
Journal of Physics: Conference Series. Volume 78, 2007.
[13] Jimmy Su and Katherine Yelick, “Automatic Performance Debugging in Partitioned Global
Address Space Programs” 20th International Workshop on Languages and Compilers for
Parallel Computing (LCPC), Urbana, Illinois, October 2007. Appeared in Springer Lecture
Notes in Computer Science.
1. Submitted Papers
1. Sam Williams, Kaushik Datta, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick,
David Bailey, PERI - Auto-tuning Memory Intensive Kernels for Multicore, SciDAC:
Scientific Discovery Through Advanced Computing, Seattle Washington, July, 2008. To
appear in Journal of Physics: Conference Series. LBNL # pending.
2. Kaushik Datta, Shoaib Kamil, Sam Williams, Leonid Oliker, John Shalf, Katherine Yelick,
"Optimization and Performance Modeling of Stencil Computations on Modern
Microprocessors", SIAM Review, 2008 (in press). LBNL-63192.
3. Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid
Oliker, David Patterson, John Shalf, and Katherine Yelick, “Stencil Computation
Optimization and Autotuning on State-of-the-Art Multicore Architectures,” to appear at
Supercomputing 2008 (SC08), November 2008. LBNL # Pending.
5
Presentations
1. K. Yelick, “Parallel Programming Models,” CSE/ParLab Parallel Computing “Bootcamp”.
University of California at Berkeley, August 2008.
2. K. Yelick, “Programming Models for Manycore Processors” Intel/UPCRC Programming
Languages Workshop, August 23, 2008.
3. K. Yelick, “Multicore: Fallout from a Hardware Revolution,” Summer Lecture Series,
Lawrence Berkeley National Laboratory, July 2008.
4. K. Yelick (for S. Williams), “PERI - Auto-tuning Memory Intensive Kernels for Multicore,
SciDAC: Scientific Discovery Through Advanced Computing, Seattle Washington,” July,
2008.
5. K. Yelick, “Programming Models: Opportunities and Challenges for Scalable Applications,”
Next Generation Scalable Applications: When MPI Only is Not Enough. June 3-5, 2008.
6. K. Yelick, “Programming Models for Manycore Systems,” Intel Corp., Santa Clara, CA,
April 23, 2008. Keynote.
7.
K. Yelick, “Multicore Meets Exascale: The Catalyst for a Software Revolution,” 2008
Salishan Conference on High Speed Computing, Salishan, OR, April 21-22, 2008. Keynote.
8. K. Yelick, “Programming Models for Petascale to Exascale,” IPDPS 2008 Advance Program,
Miami, FL, April 15-16, 2008. Keynote.
9.
R. Nishtala, “Performance without Pain = Productivity, Data layouts and Collectives in
UPC
,” Principles and Practices of Parallel Programming (PPoPP) 2008 , Salt Lake City,
USA, February 2008.
10. K. Yelick, “Multicore Meets Petascale: The Catalyst for a Software Revolution,” North
Carolina State University, Raleigh, NC, Feb 10-12, 2008. Invited Talk.
11. K. Yelick, “Programming Models for Petascale,” Princeton University, Princeton, NJ,
February 25-26, 2008. Invited Talk.
12. K. Yelick, “Productivity and Performance using Partitioned Global Address Space
Languages,” Parallel Symbolic Computation (PASCO ‘07), London, Canada, July 27-28,
2007. Invited talk.
Download