HPC Research @ UNM: X10’ding Graph Analysis Mehmet F. Su ECE Dept. - University of New Mexico Joint work with advisor: David A. Bader {mfatihsu, dbader} @ ece.unm.edu Acknowledgment of Support National Science Foundation CAREER: High-Performance Algorithms for Scientific Applications (00-93039) ITR: Building the Tree of Life -- A National Resource for Phyloinformatics and Computational Phylogenetics (EF/BIO 03-31654) DEB: Ecosystem Studies: Self-Organization of Semi-Arid Landscapes: Test of Optimality Principles (99-10123) ITR/AP: Reconstructing Complex Evolutionary Histories (01-21377) DEB Comparative Chloroplast Genomics: Integrating Computational Methods, Molecular Evolution, and Phylogeny (01-20709) ITR/AP(DEB): Computing Optimal Phylogenetic Trees under Genome Rearrangement Metrics (0113095) DBI: Acquisition of a High Performance Shared-Memory Computer for Computational Science and Engineering (04-20513). IBM PERCS / DARPA High Productivity Computing Systems (HPCS) PACI: NPACI/SDSC, NCSA/Alliance, PSC DOE Sandia National Laboratories Outline About the speaker Graph theoretic problems: what and why Our research IBM PERCS performance evaluation tools: SSCA-2 and X10 Some tool ideas for better productivity About the speaker: Mehmet F. Su Education BS Physics (Bilkent University, Ankara, Turkey) Physics Dept, Iowa State University, Ames, IA PhD track, ECE Dept, University of New Mexico, Albuquerque, NM Past and Present External Collaborations Condensed Matter Physics Group, Ames National Laboratory, Ames, IA HPC apps. in comp. biology, photonics, comp. electromagnetism Photonic Microsystems Technologies Group, Sandia National Laboratories, Albuquerque, NM HPC apps. in photonics and comp. electromagnetism The Tree of Life Power Distribution in Boylan Heights, Raleigh, NC Social Networks Air Transportation National Highway Portland, OR Manhattan, NY US Power Grid’s Control Area Operators (CAO). US Internet Backbone Characteristics of Graph Problems Graphs are of fundamental importance Many fast theoretic PRAM algorithms but few fast parallel implementations Irregular problems are challenging Sparse data structures Hard to partition data Poor locality hinders cache performance Parallel graph and tree algorithms Building blocks for higher-level parallel algorithms Hard to achieve parallel speedup (very fast sequential implementations) Our Group’s Impact Our results demonstrate the first parallel implementations of several combinatorial problems that for arbitrary, sparse instances in comparison run faster than the best sequential implementations: list ranking spanning tree, minimum spanning forest, rooted spanning tree ear decomposition tree contraction and expression evaluation maximum flow using push-relabel Our source code is freely-available under the GNU General Public License (GPL). Spanning Tree ([Cong, Bader] Ph.D. 2004, now at IBM TJ Watson) Random Graph (1M vertices, 20 M edges) 100 Execution Time (seconds) Shiloach-Vishkin Our SMP Algorithm 10 Sequential 1 2 4 6 Number of Processors 8 10 High-End SMP Servers IBM pSeries 690 “Regatta”: 32-way Power4+ 1.7GHz, 32GB RAM Streams Triad: 58.9 GB/s • IBM pSeries 575: 2U Rackmount, 8-way SMP, up to 256 GB RAM, up to 1024-proc configuration w/ single cluster 1600 • Streams Triad (8 p5 1.9 GHz procs): 55.7 GB/s About SSCA-2 DARPA High Productivity Computing Systems (HPCS) Program Productivity Benchmarks: Scalable Synthetic Compact Application (SSCA) SSCA-2 = Graph Analysis (directed multigraph with labeled edges) Simulate large-scale graph problems Multiple analysis techniques, single data Four computational kernels Integer and character ops., no floating point Emphasizes integer operations, irregular data access, choice of data structure Data structure not modified across kernels SSCA-2 Structure Scalable Data Generator produces random, but structured, set of edges Kernel 1 Builds the graph data structure from the set of edges Kernel 2 Searches multigraph for desired maximum integer weight, and desired string weight (labels) Kernel 3 Extracts desired subgraphs, given start vertices and path length Kernel 4 Extracts clusters (cliques) to help identify the underlying graph structure About X10 New programming language, in development by IBM Better productivity, more scalability Shorten development/test cycle time Object oriented New ways to express Parallelism Data access Aggregate operations (scan, reduce etc.) Rule out/catch more programming errors, bugs Implementation of SSCA-2 Designed and implemented parallel shared memory code (C with POSIX threads) for SSCA-2 [Bader/Madduri] Interested in X10 implementation Evaluate productivity with X10 and its development environment (Eclipse) Evaluate SSCA-2 performance on new systems once X10 is fully optimized Tool Ideas for Better Productivity Wizard-like interfaces Intuitive visualization for data Help resolve unresolved symbols, allow manual override w/ choices Autoconf/Automake counterparts With zoom/agglomeration, like online street maps Library/package indexing tool *NIXes, powerful development environments, cascaded menus shock many programmers Determine external dependencies/library symbols automatically for any environment Better branch prediction/feedback mechanism Collect data over multiple runs Tool Ideas (cont’d) Better binding, architecture dependent optimizer Integrated tools to help identify performance hot spots and reasons Profile for cache misses, branch prediction issues, check useful tasks performed concurrently, lock contamination Visualization to indicate high level compiler optimizations on Eclipse editor window Detect environment properties at run time Arrows for loop transforms, code relocations, annotations, different colors for propagated constants, evaluated expressions etc. Intermediate language/assembly viewer Compiler optimizations, register scheduling, SWP annotated Assembly listing from many compilers give similar info IBM Collaborators PERCS Performance X10 evaluation Ram Rajamony Pat Bohrer Mootaz Elnozahy Vivek Sarkar Kemal Ebcioglu Vijay Saraswat Christine Halverson Catalina M. Danis Jason Ellis Advanced Computing Technologies David Klepacki Guojing Cong Backup Slides SSCA #2: Graph Analysis Overview Application: Graph Theory - Stresses memory access; uses integer and character operations (no floating point) Scalable Data Generation + 4 Computational Kernels Scalable Data Generator creates a set of edges between vertices to form a sparse directed multi-graph with: Random number of randomly sized cliques Random number of intra-clique directed parallel edges Random number of gradually 'thinning' edges linking the cliques No self loops Two types of edge weight labels: integer and character string only integer weights considered in present implementation Randomized vertex numbers Directed weighted multigraph with no self-loops Scalable Data Generation Creates a set of edges between vertices to form a sparse directed multigraph with: Random number of randomly sized cliques Random number of intra-clique directed parallel edges Random number of gradually 'thinning' edges linking the cliques No self loops Two types of edge weight labels: integer and character string only integer weights considered in present implementation Randomized vertex numbers Vertices should be permuted to remove any locality for Kernel 4 Kernel 1 – Graph Generation Construct a sparse multi-graph from lists of tuples containing vertex identifiers, implied direction, and weights that represent data assigned to the implied edge. The multi-graph can be represented in any manner, but it cannot be modified between subsequent kernels accessing the data. There are various representations for sparse directed graphs - including (but not limited to) sparse matrices and (multi-level) linked lists. This kernel will be timed. Kernel 2 – Classify large sets Examine all edge weights to determine those vertex pairs with the largest integer weights and those vertex pairs with a specified string weight (label). The output of this kernel will be two vertex pair lists i.e., sets - that will be saved for use in the following kernel. These two lists will be start sets SI and SC for integer start sets and character start sets respectively. The process of generating the two lists/sets will be timed. Kernel 3 – Extracting Subgraphs Produce a series of subgraphs consisting of vertices and edges on paths of length k from the vertex pairs start sets SI and SC. A possible computational kernel for graph extraction is Breadth First Search. The process of extracting the graph will be timed. Kernel 4 – Clique Extraction Use a graph clustering algorithm to partition the vertices of the graph into subgraphs no larger than a maximum size so as to minimize the number of edges that need be cut. the kernel implementation should not utilize a priori knowledge of the details in the data generator or the statistics collected in the graph generation process heuristic algorithms that determine the clusters in near-linear time are permitted - O(V) The process of identifying the clusters and their interconnections will be timed. X10 Design Builds over an existing OO language (Java) to shorten learning curve Has new constructs for commonly used data access patterns (distributions) Commonly used parallel programming environments today… Message passing, no shared memory (MPI) Shared memory, implicit thread control (OpenMP) Shared memory, explicit thread control (Threads) Partitioned global shared mem, explicit thread control (UPC) PG shared, implicit thread control (HPF) … can these not be blended? PG shared = can specify affinity to a thread X10 Design (cont’d) Supports shared memory, allows local memory, shared memory is partitioned (places) Operation can run at a place where data resides… (async) … or data can be sent to a place to get evaluated (future) Supports short-hand definitions for array regions & data distribution, extended iterators (foreach variants) Generalized barriers (clocks) supporting more flexible operations (can operate/wait on multiple clocks), can freeze a variable until a clock advance (clocked final) Supports aggregate parallel operators (scan, reduction) in operator form (not like MPI calls) Supports atomic sections (unconditional, conditional), conditional sections lock on a logical condition (run “when” something is true) Weak memory consistency model (enables better optimizations)