Computing on Mio Introduction Timothy H. Kaiser, Ph.D. tkaiser@mines.edu Director - CSM High Performance Computing Director - Golden Energy Computing Organization http://inside.mines.edu/mio/tutorial/ 1 Purpose To enable people to quickly be make use of the high performance computing resource Mio.mines.edu To give people an overview of high performance computing so that they might know what is and is not possible Get people comfortable using other high performance computing resources 2 Topics • • Introduction to High Performance Computing • • • Purpose and methodology Hardware Programming Mio overview • • • • History General and Node descriptions File systems Scheduling 3 More Topics • Running on Mio • • • • • • Documentation Batch scripting Interactive use Staging data Enabling Graphics sessions Useful commands 4 Miscellaneous Topics • • • Some tricks from MPI and OpenMP Libraries Nvidia node 5 High Performance Computing 6 What is Parallelism? • Consider your favorite computational application • • One processor can give me results in N hours Why not use N processors -- and get the results in just one hour? The concept is simple: Parallelism = applying multiple processors to a single problem 7 Parallel computing is computing by committee • Parallel computing: the use of multiple computers or processors working together on a common task. • • Grid of a Problem to be Solved Process 0 Process 1 does work does work for this region for this region Each processor works on its section of the problem Processors are allowed to exchange information with other processors Process 2 Process 3 does work does work for this region for this region 8 Why do parallel computing? • • Limits of single CPU computing • • Available memory Performance Parallel computing allows: • • Solve problems that don’t fit on a single CPU Solve problems that can’t be solved in a reasonable time 9 Why do parallel computing? • We can run… • • • • • Larger problems Faster More cases Run simulations at finer resolutions Model physical phenomena more realistically 10 Weather Forecasting •Atmosphere is modeled by dividing it into three-dimensional regions or cells •1 mile x 1 mile x 1 mile (10 cells high) •about 500 x 10 6 cells. •The calculations of each cell are repeated many times to model the passage of time. •About 200 floating point operations per cell per time step or 10 11 floating point operations necessary per time step •10 day forecast with 10 minute resolution => 1.5x1014 flop •100 Mflops would take about 17 days •1.7 Tflops would take 2 minutes •17 Tflops would take 12 seconds 11 Modeling Motion of Astronomical bodies (brute force) • • Each body is attracted to each other body by gravitational forces. • For N bodies, N - 1 forces / body yields N 2 calculations each time step Movement of each body can be predicted by calculating the total force experienced by the body. • • • A galaxy has, 10 11 stars => 10 9 years for one iteration Using a N log N efficient approximate algorithm => about a year NOTE: This is closely related to another hot topic: Protein Folding 12 Types of parallelism two extremes • • Data parallel • • • Each processor performs the same task on different data Example - grid problems Bag of Tasks or Embarrassingly Parallel is a special case Task parallel • • Each processor performs a different task • Pipeline is a special case Example - signal processing such as encoding multitrack data 13 Simple data parallel program • Example: integrate 2-D propagation problem Starting partial differential equation: Finite Difference Approximation: PE #1 PE #2 PE #3 PE #4 PE #5 PE #6 PE #7 y PE #0 x 14 Typical Task Parallel Application DATA Normalize Task • FFT Task Multiply Task Inverse FFT Task Signal processing • • • Use one processor for each task Can use more processors if one is overloaded This is a pipeline 15 Parallel Program Structure Communicate & Repeat work 1a Begin start parallel work 2a work (N)a work 1b work 2b work (N)b work 1c work 2c work (N)c work 1d work 2d 16 work (N)d End Parallel End Parallel Problems Communicate & Repeat work 1a work 1b Begin start parallel work 2a work 2b work 1c work 1d work (N)a work (N)b work 2c work 2d End Parallel work (N)c work (N)d Subtasks don’t finish together End Serial Section Start Serial Section start parallel Serial Section (No Parallel Work) work (N)x work 1x work 2x work 1y work 2y work (N)y work 1z work 2z work (N)z Not using all processors 17 End Parallel End • • • • Theoretical upper limits All parallel programs contain: • • Parallel sections Serial sections Serial sections are when work is being duplicated or no useful work is being done, (waiting for others) Serial sections limit the parallel effectiveness • If you have a lot of serial computation then you will not get good speedup • No serial work “allows” perfect speedup Amdahl’s Law states this formally 18 Amdahl’s Law • Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. • Effect of multiple processors on run time t p = (f p /N + fs) t s • Effect of multiple processors on speed up S = ts • Where • • • • 1 = tp f p /N + fs Fs = serial fraction of code Fp = parallel fraction of code N = number of processors Perfect speedup t=t1/n or S(n)=n 19 Illustration of Amdahl's Law It takes only a small fraction of serial content in a code to degrade the parallel performance. 20 Amdahl’s Law Vs. Reality •Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. •In reality, communications will result in a further degradation of performance 80 fp = 0.99 70 60 50 Amdahl's Law Reality 40 30 20 10 0 0 50 100 150 Number of processors 21 200 250 Hardware 22 Some Classes of machines Network Processor Processor Processor Processor Memory Memory Memory Memory Distributed Memory Processors only Have access to their local memory “talk” to other processors over a network 23 Some Classes of machines Uniform Shared Memory (UMA) All processors have equal access to Memory Can “talk” via memory Processor Processor Processor Processor Memory Processor Processor Processor Processor 24 Some Classes of machines Hybrid Shared memory nodes connected by a network ... 25 Some Classes of machines More common today Each node has a collection of multicore chips ... Mio has 48 nodes 32 have 8 cores/node (4/socket) 16 have 12 cores/node (6/socket) 26 Some Classes of machines Hybrid Machines •Add special purpose "Normal" CPU processors to normal processors •Not a new concept but, regaining traction •Example: our Tesla Nvidia node, cuda1 Special Purpose Processor FPGA, GPU, Vector, Cell... 27 Programming • • • Message passing Threads/Shared Memory Manual 28 Message Passing • • • • Most common method for large scale applications Usually the same program running on many cores Processes communicate by passing messages Let’s look at a silly example... 29 Message Passing • • • Message Passing Interface (MPI) library is often used • Works on shared and distributed memory machines • Each process is given a number (0 to N-1) Most parallel machines ship with MPI Can be build from source 30 Fortran MPI Program program send_receive include "mpif.h" integer myid,ierr,numprocs,tag,source,destination,count integer buffer integer status(MPI_STATUS_SIZE) call MPI_INIT( ierr ) call MPI_COMM_RANK( MPI_COMM_WORLD, myid, ierr ) call MPI_COMM_SIZE( MPI_COMM_WORLD, numprocs, ierr ) tag=1234 count=1 if(myid .eq. 0)then buffer=5678 Call MPI_Send(buffer, count, MPI_INTEGER,1,& tag, MPI_COMM_WORLD, ierr) write(*,*)"processor ",myid," sent ",buffer endif if(myid .eq. 1)then Call MPI_Recv(buffer, count, MPI_INTEGER,0,& tag, MPI_COMM_WORLD, status,ierr) write(*,*)"processor ",myid," got ",buffer endif call MPI_FINALIZE(ierr) stop end 31 C MPI Program #include <stdio.h> #include "mpi.h" int main(int argc,char *argv[]) { int myid, numprocs, tag,source,destination,count, buffer; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); tag=1234; count=1; if(myid == 0){ buffer=5678; MPI_Send(&buffer,count,MPI_INT,1,tag,MPI_COMM_WORLD); printf("processor %d sent %d\n",myid,buffer); } if(myid == 1){ MPI_Recv(&buffer,count,MPI_INT,0,tag,MPI_COMM_WORLD,&status); printf("processor %d got %d\n",myid,buffer); } MPI_Finalize(); } 32 Threads and Shared Memory • Threads are sometimes called light weight processes • • • • • There is some mapping of threads to cores • • Often 1-1 mapping GPU can have many threads for each core Threads share memory on a multicore machine Cheap to start, sleep, wake Threads can move between cores Communicate via shared memory 33 Threads Programming • • Fork join model • • In theory can automatically be done by compilers Pthreads is a threading package for manually writing threaded programs Semiautomatic Directive driven threading • Directives tell the compiler where it can multithread • OpenMP is the most common directive set It is possible to combine MPI and OpenMP in a single program 34 Fortran OpenMP !$OMP PARALLEL DO do i=1,maxthreads seed(i)=i+i**2 errcode(i)=vslnewstream(stream(i), brng, seed(i) ) ! write(*,*)i,errcode(i) enddo ... ... ... !$OMP PARALLEL DO PRIVATE(twod) do i=1,nrays twod=>tarf(:,:,i) j=omp_get_thread_num()+1 errcode(j)=vdrnggaussian( method, stream(j), msize*msize, twod, a,sigma) ! write(*,*)"sub array ",i,j,errcode(j),twod(1,1) enddo ... ... ... !$OMP PARALLEL DO PRIVATE(twod) do i=1,nrays twod=>tarf(:,:,i) ! write(*,*)twod call my_clock(cnt1(i)) CALL DGESV( N, NRHS, twod, LDA, IPIVs(:,i), Bs(:,i), LDB, INFOs(i) ) call my_clock(cnt2(i)) write(*,'(i5,i5,3(f12.3))')i,infos(i),cnt2(i),cnt1(i),real(cnt2(i)-cnt1(i),b8) enddo 35 C OpenMP #pragma omp parallel sections { #pragma omp section { ! ! ! system_clock(&t1_start); ! ! ! over(m1,n); ! ! ! system_clock(&t1_end); ! ! ! t1_start=t1_start-t0_start; ! ! ! t1_end=t1_end-t0_start; } #pragma omp section { ! ! ! system_clock(&t2_start); ! ! ! over(m2,n); ! ! ! system_clock(&t2_end); ! ! ! t2_start=t2_start-t0_start; ! ! ! t2_end=t2_end-t0_start; ... } 36 A different thread and core will be assigned to run each section “Manual” Parallel Programming • • Log on to a machine with N cores • • Usually the N tasks are independent • Madagascar Launch N a separate tasks, putting each in the background Each task launched could actually be a pipeline of processes or multithreaded 37 Golden Energy Computing Organization GECO - a quick overview 38 GECO • • Golden Energy Computing Organization • GECO’s main computational resource is RA Center to be a “national hub for computational inquiries aimed at the discovery of new ways to meet the energy needs of our society.” 39 GECO HPC Hardware/Software: “Ra” • Architecture • Dell with Intel quad-core, dual-socket system • 2144 processing cores in 268 nodes 256 nodes with 512 Clovertown E5355 (2.67 GHz) (quad core dual socket) 184 with 16 Gbytes & 72 with 32 Gbytes 12 nodes with 48 Xeon 7140M (3.4 GHz) (quad socket dual core) 32 Gbytes each • Memory • 5,632 Gbytes ram (5.6 terabytes) • 300 terabyte disk • 300 terabyte tape back up • 16/32 gigabytes RAM per node Center for Technology and Learning Media • Performance • 18 Tflop sustained performance • 23 Tflop peak • like every human on the planet doing 2500 calculations per second 40 41 RA problems • • • Only for Energy Science • Little student access Allocation is by proposal People wanted quick access and RA is heavily loaded 42 Mio.Mines.edu MIO Mio.Mines.Edu It’s All Mine 43 Mio.mines.edu • New concept in HPC for CSM • • School puts up the money for infrastructure Researchers purchase individual nodes • • They own the nodes Can use other’s when they are not in use • Started with the head node and compute 4 nodes, 32 cores • Current (10/28/10) 48 nodes, 472 cores 5.53 Tflop 44 Original Mio Compute Nodes • • Penguin • 2 x Dual Intel Xeon X5570 Quad Core 2.93GHz 8MB max RAM speed 1333MHz • • • up to 2 x 48GB DDR3-800 REG, ECC (24 x 4GB) 2 x 160GB, SATA, 7200RPM 2 x Intel Xeon Dual Socket Motherboard with Integrated Infiniband DDR/CX4 Connections Half the size of RA nodes • • • More efficient in power and computation More memory Faster Clock Have since gone to 12 core nodes and 1 Tbytes disk 45 Mio @ Oct. 28, 2010 Total Installed: 472 cores 5.53 Tflop 46 Node Summary Node Cores RAM GB 0 8 48 DISK Free (/scratch) GB 128 1 8 48 128 2 8 48 128 3 8 48 128 4 8 192 128 5 8 48 128 6 to 31 8 24 128 32 to 47 12 24 849 47 Mio: Installing a 200 Tbyte File system • • • • • Panasas system • Available in parallel to all nodes Used by BP Donated to CSM Panasas gave us a one shot “deal” on software Was working but we need to rebuild after a switch upgrade 48 File Systems on Mio • Four partitions • $HOME - should be kept very small, having only start up scripts and other simple scripts • $DATA - Should contain programs you have built for use in your research, small data sets and run scripts. • $SCRATCH - The main area for running applications. Output from parallel runs should be done to this directory. • $SETS - This area is for large data sets, mainly read only, that will be use over a number of months. 49 What’s in a Name? • • The name Mio is a play on words. • The phrase “The computer is mine.” can be translated as “El ordenador es mío.” It is a Spanish translation of the word “mine” as in belongs to me, not the hole in the ground. 50 Mio.mines.edu Compute Nodes Tux Panasas file system Network switch Head Node 51 Mio Links • • • • http://mio.mines.edu/ganglia/ • “Picture of usage” http://inside.mines.edu/mio • Documentation http://mio.mines.edu/inuse/ • Nodes in use http://mio.mines.edu/jobs/ • Jobs running 52 Getting on to Mio • • Address mio.mines.edu • Must use ssh client Only available on campus, or using VPN, or through another on campus machine (RA) • • Unix and OSX ssh is built in • OSX - /HD/Applications/Utilities/Terminal Windows use putty (Installed on Mines machines) • All Applications\putty 53 ssh usage • • • Strongly suggest that you set up and use ssh keys Nontrivial pass phrase and a connection agent so that you only need to type the pass phrase once every 4 hours • • Good security Easy access See http://geco.mines.edu/ssh/ 54 ssh usage • For today • Passwords on Mio.mines.edu are the same as your Mines MultiPass password, that is, it is the same as the standard unix boxes on campus such as imagine or illuminate. • If you don't have a MultiPass password or need to reset it see http://newuser.mines.edu/password 55 ... ... Creating Creating Creating Creating Creating Creating You may see: directory directory directory directory directory directory First login '/u/is/az/joeminer'. '/u/is/az/joeminer/.mozilla'. '/u/is/az/joeminer/.mozilla/extensions'. '/u/is/az/joeminer/.mozilla/plugins'. '/u/is/az/joeminer/.kde'. '/u/is/az/joeminer/.kde/Autostart'. It doesn't appear that you have set up your ssh key. This process will make the files: /u/is/az/joeminer/.ssh/id_rsa.pub /u/is/az/joeminer/.ssh/id_rsa /u/is/az/joeminer/.ssh/authorized_keys Generating public/private rsa key pair. Enter file in which to save the key (/u/is/az/joeminer/.ssh/id_rsa): Created directory '/u/is/az/joeminer/.ssh'. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /u/is/az/joeminer/.ssh/id_rsa. Your public key has been saved in /u/is/az/joeminer/.ssh/id_rsa.pub. The key fingerprint is: 55:bf:89:4d:03:4d:da:21:84:a9:e8:b2:b9:3a:bd:fe joeminer@mio.mines.edu [joeminer@mio ~]$ Hit return until you see the final prompt 56 Mio Environment • When you ssh to Mio you are on the head node • • Shared resource • You have access to basic Unix commands DO NOT run parallel, compute intensive, or memory intensive jobs on the head node • • Editors, file commands, gcc... Need to set up your environment to gain access to HPC compilers 57 Mio HPC Environment • • • You want access to parallel and high performance compilers • Intel, AMD, mpicc, mpif90 Commands to run and monitor parallel programs See: http://inside.mines.edu/mio/page3.html • • Gives details Shortcut 58 Shortcut Setup http://inside.mines.edu/mio/page3.html Add the following to you .bashrc file and log out/in if [ -f /usr/local/bin/setup/setup ]; then source /usr/local/bin/setup/setup intel ; fi •Openmpi parallel MPI environment •Compilers •MPI compilers •Intel 11.x •MPI run commands •AMD •Portland Group •Python 2.6.5 and 3.1.2 •NAG Should give you: [tkaiser@mio ~]$ which mpicc /opt/lib/openmpi/1.4.2/intel/11.1/bin/mpicc [tkaiser@mio 59~]$ Getting the Examples [tkaiser@mio [tkaiser@mio [tkaiser@mio [tkaiser@mio ... ... tests]$ cd $DATA tests]$ mkdir cwp tests]$ cd cwp cwp]$ wget http://inside.mines.edu/mio/tutorial/cwp.tar 2011-01-11 13:31:06 (48.6 MB/s) - `cwp.tar' saved [20480/20480] [tkaiser@mio cwp]$ tar -xf *tar 60 A digression - Makefiles • • • • • • Makefiles are instructions for building a program Normal names are makefile or Makefile In a directory that contains a makefile type the command • make They can be very complicated Can have “sub commands” See: http://www.eng.hawaii.edu/Tutor/Make/ 61 Our Makefile all: c_ex01 f_ex01 pointer invertc c_ex01 : c_ex01.c ! mpicc -o c_ex01 c_ex01.c f_ex01: f_ex01.f90 ! mpif90 -o f_ex01 f_ex01.f90 pointer: pointer.f90 ! ifort -openmp pointer.f90 -mkl -o pointer invertc: invertc.c ! icc -openmp invertc.c -o invertc clean: ! rm -rf c_ex01 f_ex01 *mod err* out* mynodes* pointer invertc tar: ! tar -cf introduction.tar makefile batch1 c_ex01.c f_ex01.f90 pointer.f90 invertc.c batch2 62 The End 63