COMP4300/COMP8300 Parallel Systems Alistair Rendell and Joseph Antony Research School of Computer Science Australian National University Concept and Rationale The idea – Motivation – Speed, Speed, Speed… at a cost effective price – Split your program into bits that can be executed simultaneously If we didn’t want it to go faster we would not be bothered with the hassles of parallel programming! Reduce the time to solution to acceptable levels – – No point waiting 1 week for tomorrow’s weather forecast Simulations that take months to run are not useful in a design environment Sample Application Areas Fluid flow problems – – Structural Mechanics – – Weather forecasting/climate modeling Aerodynamic modeling of cars, planes, rockets etc Building bridge, car, etc strength analysis Car crash simulation Speech and character recognition, image processing Visualization, virtual reality Semiconductor design, simulation of new chips Structural biology, molecular level design of drugs Human genome mapping Financial market analysis and simulation Datamining, machine learning Games programming World Climate Modeling Atmosphere divided into 3D regions or cells Complex mathematical equations describe conditions in each cell, eg pressure, temperature, velocity – – – Assume – – – Conditions change according to neighbour cells Updates repeated frequently as time passes Cells are affected by more distant cells the longer range the forecast Cells are 1x1x1 mile to a height of 10 miles, 5x108 cells 200 flops to update each cell per timestep 10 minute timesteps for total of 10 days 100 days on 100 mflop machine 10 minutes on a tflop machine ParallelSystems@ANU: NCI NCI: National Computational Infrastructure – History: established APAC in 1998 with $19.5M grant from federal government, NCI created in 2007 Current NCI collaboration agreement (2012–15) – – – http://nci.org.au Major Collaborators: ANU, CSIRO, BoM, GA, Universities: Adelaide, Monash, UNSW, UQ, Sydney, Deakin, RMIT University Consortia: Intersect (NSW), QCIF (Queensland) Co-investment (for recurrent operations) : – 2007: $0M; 2008: $3.4M; 2009: $6.4M; 2011: $7.5M; 2012: $8.5M; 2013: $11M; 2014 $11+M; to provide for all recurrent operations Current infrastructure: Data Centre • • New Data Centre: $24M (opened Nov. 2012) Machine Room: 920 sq. m. Power (after 2014 upgrades) – 4.5 MW capacity raw; 1 MW UPS; – 2 x 1.1 MVA Cummins generators • Cooling in two loops: – Server: 2 x 1.8 MW Carrier chillers; 3 x 0.8 MW “free cooling” heat exchangers; 18 deg C; 75 l/sec pump rate – Data: 3 x 0.5 MW Carrier chillers; 15 deg C • PUE: approx. 1.25 6 NCI: Raijin—Petascale Supercomputer Raijin – Supercomputer (June 2013 commissioning) – 57,472 cores (Intel Xeon Sandy Bridge, 2.6 GHz) in 3592 compute nodes – 160 TBytes (approx.) of main memory; – Mellanox Infiniband FDR interconnect (52 km cable) – 10 PBytes (approx.) of usable fast filesystem (for short-term scratch space apps, home directories). – Power: 1.5 MW max. load – Cooling systems: 100 tonnes of water – 24th fastest in the world in debut (November 2012); first petaflop system in Australia (November 2014: #52) • • • • Fastest file-system in the southern hemisphere Custom monitoring and deployment Custom Kernel Highly customised PBS Pro scheduler. 7 NCI’s integrated high-performance environment Internet NCI data movers VMware Cloud Raijin Login + Data movers To Huxley DC Raijin HPC Compute 10 GigE /g/data 56Gb FDR IB Fabric Raijin 56Gb FDR IB Fabric Massdata (tape) Cache 1.0PB, Tape 20 PB Persistent global parallel filesystem Raijin high-speed filesystem /g/data1 /g/data2 /short ~7 PB ~6 PB 7.6PB /home, /system, /images, /apps 8 ParallelSystems@DCS Bunyip: tsg.anu.edu.au/Projects/Bunyip – – 192 processor PC Cluster winner of 2000 Gordon Bell prize for best price performance High Performance Computing Group – – – Jabberwocky cluster Saratoga cluster Sunnyvale cluster The Rise of Parallel Computing Year Hardware Languages 1950 Early Designs Fortran I (Backus, 57) 1960 Integrated circuits Fortran 66 1970 Large scale integration C (72) 1980 RISC and PC C++ (83), Python 1.0 (89) 1990 Shared and distributed parallel MPI, OpenMP, Java (95) 2000 Faster, better, hotter Python 2.0 (00) 2010 Throughput oriented CUDA, OpenCL Parallelism became an issue for programmers from late 80s People began compiling lists of big parallel systems November 2014 Top500 (NCI now number 52) 14 Planning the Future Growth in ANU/NCI’s computing performance (measured in TFlops) since 1987. Architecture and capability determined by research and innovation drivers International Top500 supercomputer growth since 1993. Red: #1 machine each year Yellow: #500 machine each Blue: Sum of all machines Top500 Supercomputers Graphs show growth factors of between 8 and 9 times every 3 years. 15 Transitioning Australia to its HPC Future Increase in capability usage at NCI with time 100.0% 90.0% Goal is to move the knee in the curve in this direction 80.0% This needs expert people and accelerated hardware (eventually) Use (Cum. %) 70.0% 60.0% 50.0% 40.0% 30.0% 20.0% 10.0% 0.0% 1 2 4 8 16 32 64 128 256 512 1024 2048 4096 Cores 2008 2009 2010 2011 2013 16 8192 16384 We also had Increased Node Performance Moore’s Law ‘Transistor density will double approximately every two years.’ Dennard Scaling ‘As MOSFET features shrink, switching time and power consumption will fall proportionately’ Until the chips became too big… 250nm, 400mm2, 100% 180nm, 450mm2, 100% 130nm, 566mm2, 82% 100nm, 622mm2, 40% 70nm, 713mm2, 19% 50nm, 817mm2, 6.5% Agarwal, Hrishikesh, Keckler Burger, Clock Rate Versus IPC, ISCA 200035nm, 937mm2, 1.9% …so multiple cores appeared on chip 2004 Sun releases Sparc IV with dual cores and heralding the start of multicore …until we hit a bigger problem… …the end of Dennard scaling… Moore’s Law ‘Transistor density will double approximately every two years.’ Dennard scaling ‘As MOSFET features shrink, switching time and power consumption will fall proportionately.’ Dennard, Gaensslen, Yu, Rideout, Bassous and Leblanc, IEEE SSC, 1974 …ushering in.. …a new philosophy in processor design is emerging 1960-2010 Few transistors 2010-? No shortage of transistors No shortage of power Limited power Maximize transistor utility Minimize energy Generalize Customize …and a fundamentally new set of building blocks for our petascale systems Petascale and Beyond: Challenges and Opportunities Level Characteristic Challenge/Opportunity As a whole Sheer number of node • Tianhe 2 has equivalent >3M cores • Programming language/environment • Fault tolerance Within a domain Heterogeneity • Tianhe system uses CPUs and GPUs • What to use when • Co-location of data with unit processing it On the chip Energy minimization • Already processors have frequency and voltage scaling • Minimize data size and movement including use of just enough precision • Specialized cores In RSCS we are working in all these areas Other Important Parallelism Multiple instruction units: – Instruction Pipelining: – Use multiple rendering pipes and processing elments to render millions of polygons a second Interleaved Memory: – Complicated operations are broken into simple operations that can be overlapped Graphics Engines: – Typical processors issue ~4 instructions per cycle Multiple paths to memory that can be used at same time Input/Output: – Disks are striped with different blocks of data written to different disks at the same time Parallelisation Split program up and run parts simultaneously on different processors – – – On N computers the time to solution should (ideally!) be 1/N Parallel Programming: the art of writing the parallel code! Parallel Computer: the hardware on which we run our parallel code! COMP4300 will discuss both Beyond raw compute power other motivations include – – – Enabling more accurate simulations in the same time (finer grids) Providing access to huge aggregate memories Providing more and/or better input/output capacity Parallelism in a Single “CPU” Box Multiple instruction units: – Instruction Pipelining: – Use multiple rendering pipes and processing elments to render millions of polygons a second Interleaved Memory: – Complicated operations are broken into simple operations that can be overlapped Graphics Engines: – Typical processors issue ~4 instructions per cycle Multiple paths to memory that can be used at same time Input/Output: – Disks are stripped with different blocks of data written to different disks at the same time Health Warning! Course is run every other year – It’s a 4000/8000 level course, it’s supposed to: – – – – – Drop out this year and it won’t be repeated until 2017 Be more challenging that a 3000 level course! Be less well structured Have a greater expectation on you Have more student participation Be fun! Nathan Robertson, 2002 honours student – “Parallel systems and thread safety at Medicare: 2/16 understood it - the other guy was a $70/hr contractor” Learning Objectives Parallel Architecture: – Specific Systems: – Distributed and shared memory, things in between, Grid computing Parallel Algorithms: – Will make extensive use of research systems in our group and also visit the NCI facilities Programming Paradigms: – Basic issues concerning design and likely performance of parallel systems Numeric and non-numeric The future Course Content Discussion of Schedule: http://cs.anu.edu.au/courses/COMP4300/schedule.html Commitment and Assessment The pieces – – – – – 2 lectures per week (~30 core lecture hours) 6 Labs (not marked, solutions provided) 2 assignments (40%) 1 mid-semester exam (1 hours, 15%) 1 final exam (3 hours, 45%) Final mark is sum of assignment, midsemester and final exam mark Lectures Two slots – – Mon Thu 10:00-12:00 PSYC G6 11:00-12:00 PSYC G6 Exact schedule on web site Partial notes will be posted on the web site – bring copy to lecture Attendance at lectures and labs is strongly recommended – Attendance at labs will be recorded Course Web Site http://cs.anu.edu.au/courses/comp4300 We will use wattle only for lecture recordings Laboratories Start in week 3 (March 2nd) – See web page for detailed schedule 4 sessions available – – – – Mon Tue Wed Fri 15:00-17:00 13:00-13:00 14:00-16:00 12:00-14:00 N113 N114 N113 N113 Who cannot make any of these? Not assessed, but will be examined People Alistair Rendell (Convener) – – – Joseph Antony (lecturer) – – – – CSIT Bldg Rm N226 (and N338) Alistair.Rendell@anu.edu.au Phone 6125 4386 Senior HPC Data Specialist NCI NCI Bldg 143 (near JCSMR) Joseph.Antony@anu.edu.au Phone 6125 5988 Gaurav Mitra (tutor) – – – – PhD student, Computer Systems CSIT Bldg Rm 230 Gaurav.Mitra@anu.edu.au Phone 6125 9658 Course Communication Course web page cs.anu.edu.au/course/comp4300 Bulletin board (forum – available from streams) cs.anu.edu.au/streams At lectures and in labs Email comp4300@cs.anu.edu.au In person – – Office hours (Alistair), Thu 12:00-13:00 (after lecture) Email for appointment if you want other time Useful Books Principles of Parallel Programming, Calvin Lin and Lawrence Snyder, Pearson International Edition, ISBN 978-0-321-54942-6 Introduction to Parallel Computing, 2nd Ed., Grama, Gupta, Karypis, Kumar, Addison-Wesley, ISBN 0201648652 (Electronic version accessible on line from ANU library – search for title) Parallel Programming: techniques and applications using networked workstations and parallel computers, Barry Wilkinson and Michael Allen. Prentice Hall 2nd edition. ISBN 0131405632. and others on web page Questions so far!?