Creating Software Tools and Libraries for Leadership

Center for Scalable Application Development Software
Cooperative Agreement No. DE-FC02-07ER25799
September 2008
Principle Investigator:
Katherine Yelick
Department of Electrical Engineering and Computer Sciences
University of California #1776
Berkeley, CA 94720
Voice: 510-495-2431
FAX: 510-642-3962
Students supported at Berkeley: 3
Rajesh Nishtala
Brian Kazian
Ankit Jain
Staff supported at Berkeley: 1
Katherine Yelick
In January 2007 the Center for Scalable Application Development Software (CScADS) was
established as a partnership between Rice University, Argonne National Laboratory, University
of California – Berkeley, University of Tennessee – Knoxville, and University of Wisconsin –
Madison. CScADS is pursuing an integrated set of activities that aim to increase the productivity
of DOE computational scientists by catalyzing the development of software tools and libraries for
leadership computing platforms. The Berkeley team co-organized the automatic tuning research,
and pursued application level performance studies of PGAS languages and compilers.
Community Outreach and Vision-Building
To engage the community in the challenges and foster interdisciplinary collaborations, we have
established the CScADS Summer Workshops – an annual series of workshops that will focus on
topics related to scalable software for the DOE’s leadership computing platforms. In July 2008,
we held our first series of four workshops in Snowbird, Utah.
The general charge for the workshops was the following:
 Identify important open problems and challenges for achieving high performance on
leadership computing systems.
Brainstorm on promising approaches to open problems.
Identify infrastructure needs to address key challenges.
Assess available infrastructure.
Identify opportunities for synergy opportunities to consolidate and harden existing
infrastructures, reuse existing components developed by others, as well as opportunities to
refactor and extend existing components to apply them to new challenges.
 Collaborate on design of sharable components.
 Identify targets of opportunity for further investment of resources, in particular strategic
investment targets for the DOE Office of Science.
Katherine Yelick from Berkeley co-organized a workshop with Keith Cooper from Rice
University, Jack Dongarra from the University of Tennessee, and Rich Vuduc from Georgia
Tech. The topic of the workshop was Automatic Performance tuning and it brought together
compiler developers, library writers, performance experts, and hardware designers to discuss
some of the code generation challenges for multicore processors that are the building blocks for
emerging petascale systems. The goal was to identify some of the challenges represented by
current and future hardware and the opportunities afforded by the use of automatic tuning
The attendees included computer science researchers developing autotuning tools (many funded
by SciDAC or other DOE Office of Science programs), compiler writers, and computer architects
representing a variety of research and production architectures.
Research Contributions from Last Year
The Partitioned Global Address Space (PGAS) model, exemplified by the UPC, Co-Array Fortran
and Titanium Languages allow programmers to easily express parallelism on complex shared data
structures. The languages allows such structures to be access through global pointers and
distributed array expressions, as well as bulk operations based on either high level array copies or
(in UPC) explicit memory copies. PGAS programs that are designed and optimized for clusters
do most of their communication using bulk operations, but programs written for shared memory
hardware often have a fine-grained style. Fine-grained accesses that occur in loops may be
amenable to message vectorization, where accesses are combined across iterations, but more
irregular communication patterns are usually not amenable to such loop-based optimizations
since they either use pointer dereferences or have dynamic access patterns (e.g., table lookup).
Instead, compiler algorithms that decrease the number, reduce the volume, and hide the latencies
of the message traffic for irregular applications can be very beneficial.
PGAS Languages for Multicore Systems to Petascale Sytems
Dual and quad core processors are currently the dominant building block for high end systems,
and the number of cores is likely to double with chip density over the next few generations. At
the same time, both memory (commodity DRAM) density and off-chip bandwidth may grow at a
slower pace, making it desirable to allow sharing of user level data structures between cores on a
chip. PGAS languages take advantage of the shared memory and avoid some of the memory
footprint costs associated with partitioned address space (message passing) programming models.
The Berkeley UPC compiler currently runs on multicore system and clusters of multicore nodes
but the group is exploring a number of extension to the language, compiler, and runtime system to
make effective use of multicore nodes. Under the CScADS project, the group has applied
autotuning techniques to the problem of building a hightly optimized collective communication
library for PGAS languages. Collective communication is critical to the performance of many
bulk-synchronous algorithms, whether they are programmed in MPI, UPC, CAF, or one of the
languages emerging from the HPCS program. The Berkeley group specifically looked at
optimization techniques for the UPC collectives and studied to fairly different multicore
architectures, the Intel Clovertown and Sun Niagra2. They developed highly optimized and
scalable collective implementations for shared memory and found that distributed memory
algorithms such as trees are often useful as the core count grows. The choice of tree structure and
communication is highly dependent on the machine size, collective routine, and data size, so they
developed a prototype autotuning framework to automatically select optimized implementations.
Figure 1 shows the effect of both architecture independent (a fixed radix-2 tree) and architecture
dependent tuning on the Sun architectures for four different collective operations. The detailed
results indicate the importance of selecting the radix and even tree structure (balanced vs.
binomial), which is a strong argument for an autotuned implementation.
Figure 1: Autotuning Collective Communication for Niagra2
The Berkeley group is also developing an optimized implementation of the basic GASNet
communication layer for Petascale systems such as the BlueGene architecture, which has
previously been supported only by an MPI implementation of GASnet, which is not very
efficient, and by and IBM Research prototype which is not available outside IBM.
underlies multiple PGAS language compilers (Berkeley UPC, Intrepid gcc/upc, Rice CAF,
Berkeley Titanium, and Cray Chapel).
Autotuned Sparse Libraries for Multicore
The Berkeley team also made progress in delivering self-tuning libraries to the user community in
the form of a multicore/SMP extension of the OSKI sparse matrix library called pOSKI. Whereas
OSKi tunes for register, caches, and some SIMD accelerators, pOSKI also tunes for the number
of threads and adds thread count and blocking as well as explicit software prefetch. The ideas
build on work by Sam Williams on optimizations for multicore, which was funded in part by the
PERI SciDAC project, and in this CScADS work the optimizations ideas were encoded in the
OSKI autoatuning framework to make it easier for users to benefit from the ideas.
The figure below summarized the pOSKI results for the AMD Barcelona processor using 12
different matrices from a variety of application domains. Each of the bars is divided in to a set of
performance results for each of the different optimizations, which are applied additively. In
addition, the last point (a black diamond) shows previous results obtained by Williams et al using
a standalone autotuning framework and nearly and identical set of optimizations. For most
matrices the performance is comparable, but there are two matrices for which some machinedependent strategy was applied in the code by Willliams.
Future Plans for FY09
The primary objective for the Berkeley CScADS research projects is to complete, tune, test and
release a UPC compiler for the BlueGene system. This involves a number of steps:
1) Test and tun the put/get operations which are available today in a pre-release version of
2) Complete the implementation of the full GASNet interface to support features such as
faster remove invocations (active messages)
3) Scalability testing and tuning.
4) Release optimized self-tuning collectives.
Additional work on autotuning for multicore will also continue, with exploration of tuning
techniques for accelerators such as GPUs. Finally, Berkeley will continue with a presense in
multiple summer workshops and will co-organize one of those weeks related to autotuning.
