Introduction to Research 2011 Ashok Srinivasan Florida State University www.cs.fsu.edu/~asriniva Part of the machine room at ORNL The Cell processor powers the Roadrunner at LANL Images from ORNL, IBM, NVIDIA NVIDIA GPUs power Tianhe-1A in China Outline Research High Performance Computing Applications and Software Multicore processors Massively parallel processors Computational nanotechnology Simulation-based policy making Potential Research Topics Research Areas High Performance Computing, Applications in Computational Sciences, Scalable Algorithms, Mathematical Software Current topics: Computational Nanotechnology, HPC on Multicore Processors, Massively Parallel Applications New Topics: Simulation-based policy analysis Old Topics: Computational Finance, Parallel Random Number Generation, Monte Carlo Linear Algebra, Computational Fluid Dynamics, Image Compression Importance of Supercomputing Fundamental scientific understanding Solution of bigger problems Automobile crash tests Solutions with time constraints Climate modeling More accurate solutions Nano-materials, drug design Disaster mitigation Study of complex interactions for policy decisions Urban planning Some Applications Increasing relevance to industry In 1993, fewer than 30% of top 500 supercomputers were commercial, now, 57% are commercial A variety of application areas Commercial Finance and insurance Medicine Aerospace and Automobiles Telecom Oil exploration Shoes! (Nike) Potato chips! Toys! Scientific Weather prediction Earthquake modeling Epidemic modeling Materials Energy Computational biology Astro-physics Supercomputing Power The amount of parallelism too is increasing, with the high end having over 200,000 cores Geographic Distribution North America has over half the top 500 systems However, Europe and East Asia too have a significant share China is determined to be a supercomputing superpower Two of its national supercomputing centers have top-five supercomputers Japan has the top machine and two in the top five Planning a $ 1.3 billion exascale supercomputer in 2020 Asian Supercomputing Trends Challenges in Supercomputing Hardware can be obtained with enough money But obtaining good performance on large systems is difficult Some DOE applications ran at 1% efficiency on 10,000 cores They will have to deal with a million threads soon, and with a billion at the exa-scale Don’t think of supercomputing as a means of solving current problems faster, but as a means of solving problems we earlier thought we could not solve Development of software tools to make use of the machines easier Architectural Trends Massive parallelism 10K processor systems will be commonplace Large end already has over 500K processors Single chip multiprocessing All processors will be multicore Heterogeneous multicore processors Cell used in the PS3 GPGPU 80-core processor from Intel Processors with hundreds of cores are already commercially available Distributed environments, such as the Grid But it is hard to get good performance on these systems Accelerating Applications with GPUs Over a hundred cores per GPU Hide memory latency with thousands of threads Can accelerate a traditional computer to a teraflop GPU cluster at FSU Quantum Monte Carlo applications Algorithms Linear algebra, FFT, compression, etc Small Discrete Fourier Transforms (DFT) on GPUs GPUs are effective for large DFTs, but not small DFTs However, they can be effective for a large number of small DFTs Useful for AFQMC We use the asymptotically slow matrix-multiplication based DFT for very small sizes We combine it with mixed-radix for larger sizes We use asynchronous memory transfer to deal with host-device data transfer overhead Comparison of DFT Performance Comparison of 512 simultaneous DFTs without host-device data transfer N 4 8 12 16 20 24 Mixed Radix Time: µs/DFT 0.043 0.214 0.550 1.14 1.96 3.19 M atrix M ultipl ication Time: µs/DFT 0.038 0.206 0.716 1.95 3.09 6.71 Cooley Tukey Time: µs/DFT 0.115 0.353 1.96 CUFFT Time: µs/DFT 18.3 23.5 45.8 35.4 47.6 46.4 FFTW Time: µs/DFT 2.87 3.41 4.78 6.81 11.2 17.1 3-D DFTs 2-D DFTs N 4 8 12 16 20 24 Mixed Radix Time: µs/DFT 0.621 4.04 12.4 34.8 71.9 138 M atrix M ultipl ication Time: µs/DFT 0.578 3.43 12.7 42.9 77.5 172 Cooley Tukey Time: µs/DFT 1.06 6.01 58.2 CUFFT Time: µs/DFT 50.1 84.7 327 836 566 678 FFTW Time: µs/DFT 3.57 12.2 38.3 92.6 230 513 Petascale Quantum Monte Carlo Originally a DOE funded project involving collaboration between ORNL, UIUC, Cornell, UTK, CWM, and NCSU Now funded by ORAU/ORNL Scale Quantum Monte Carlo applications to petascale (one million gigaflops) machines Load balancing, fault tolerance, other optimizations Load Balancing In current implementations, such as QWalk and QMCPack, cores send excess walkers to cores with fewer walkers In the new algorithm (alias method), cores may send more than their excess, and receive walkers even if they originally had an excess Load can be balanced with each core receiving from at most one other core Also optimal in maximum number of walkers received Total number of walkers sent may be twice the optimal Performance Comparison Comparisons with QWalk Mean number of walkers migrated Maximum number of receives Process-Node Affinity Node allocation is not necessarily ideal for minimizing communication Process-node affinity can, therefore, be important Allocated nodes for a 12,000 core run on Jaguar Load Balancing with Affinity Renumbering the nodes improves load balancing and AllGather time Basic load balancing Load balancing after renumbering Results on Jaguar Potential Research Topics High Performance Computing on Multicore Processors Algorithms, Applications, and libraries on GPUs Applications on Massively Parallel Processors Quantum Monte Carlo applications Load balancing and communication optimizations Simulation-based policy decisions Combine scientific computing with models of social interactions to help make policy decisions