1 Basic Concepts in Parallelization Ruud van der Pas Senior Staff Engineer Oracle Solaris Studio Oracle Menlo Park, CA, USA IWOMP 2010 CCS, University of Tsukuba Tsukuba, Japan June 14-16, 2010 RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 2 RvdP/V1 Outline ❑ Introduction ❑ Parallel Architectures ❑ Parallel Programming Models ❑ Data Races ❑ Summary Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 3 Introduction RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 4 Why Parallelization ? Parallelization is another optimization technique The goal is to reduce the execution time To this end, multiple processors, or cores, are used Time 1 core parallelization 4 cores Using 4 cores, the execution time is 1/4 of the single core time RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 5 What Is Parallelization ? "Something" is parallel if there is a certain level of independence in the order of operations In other words, it doesn't matter in what order those operations are performed sequence of machine instructions A collection of program statements An algorithm The RvdP/V1 problem you're trying to solve Basic Concepts in Parallelization granularity A Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 6 What is a Thread ? Loosely said, a thread consists of a series of instructions with it's own program counter (“PC”) and state A parallel program executes threads in parallel These threads are then scheduled onto processors PC Thread 0 Thread 1 ..... ..... ..... ..... ..... ..... ..... ..... PC P RvdP/V1 Thread 2 PC ..... ..... ..... ..... PC Thread 3 ..... ..... ..... ..... P Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 7 Parallel overhead ❑ The total CPU time often exceeds the serial CPU time: ● The newly introduced parallel portions in your program need to be executed ● Threads need time sending data to each other and synchronizing (“communication”) ✔ Often the key contributor, spoiling all the fun ❑ Typically, things also get worse when increasing the number of threads ❑ RvdP/V1 Effi cient parallelization is about minimizing the communication overhead Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 8 Communication Wallclock time Serial Execution Parallel - Without communication Embarrassingly parallel: 4x faster Wallclock time is ¼ of serial wallclock time Parallel - With communication Additional communication Less than 4x faster Consumes additional resources Wallclock time is more than ¼ of serial wallclock time Total CPU time increases Communication RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 9 Load balancing Perfect Load Balancing Time Load Imbalance 1 idle 2 idle 3 idle All threads fi nish in the same amount of time No threads is idle Different threads need a different amount of time to fi nish their task Total wall clock time increases Program does not scale well Thread is idle RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 About scalability Ideal Speed-up S(P) 10 Popt P Defi ne the speed-up S(P) as S(P) := T(1)/T(P) The effi ciency E(P) is defi ned as E(P) := S(P)/P In the ideal case, S(P)=P and E(P)=P/P=1=100% Unless the application is embarrassingly parallel, S(P) eventually starts to deviate from the ideal curve Past this point Popt, the application sees less and less benefi t from adding processors Note that both metrics give no information on the actual run-time As such, they can be dangerous to use In some cases, S(P) exceeds P This is called "superlinear" behaviour Don't count on this to happen though RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 11 Amdahl's Law Assume our program has a parallel fraction “f” This implies the execution time T(1) := f*T(1) + (1-f)*T(1) On P processors: T(P) = (f/P)*T(1) + (1-f)*T(1) Amdahl's law: S(P) = T(1)/T(P) = 1 / (f/P + 1-f) Comments: ▶ ▶ RvdP/V1 This "law' describes the effect the non-parallelizable part of a program has on scalability Note that the additional overhead caused by parallelization and speed-up because of cache effects are not taken into account Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 12 Amdahl's Law 16 Speed-up 14 It is easy to scale on a small number of processors Scalable performance however requires a high degree of parallelization i.e. f is very close to 1 This implies that you need to parallelize that part of the code where the majority of the time is spent 12 Value for f: 10 1.00 0.99 0.75 0.50 0.25 0.00 8 6 4 2 0 RvdP/V1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Processors Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 13 Amdahl's Law in practice We can estimate the parallel fraction “f” Recall: T(P) = (f/P)*T(1) + (1-f)*T(1) It is trivial to solve this equation for “f”: f = (1 - T(P)/T(1))/(1 - (1/P)) Example: T(1) = 100 and T(4) = 37 => S(4) = T(1)/T(4) = 2.70 f = (1-37/100)/(1-(1/4)) = 0.63/0.75 = 0.84 = 84% Estimated performance on 8 processors is then: T(8) = (0.84/8)*100 + (1-0.84)*100 = 26.5 S(8) = T/T(8) = 3.78 RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 14 Threads Are Getting Cheap ..... Elapsed time and speed up for f = 84% 100 Using 8 cores: 100/26.5 = 3.8x faster 90 80 70 60 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Number of cores used = Elapsed time = Speed up RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 15 Numerical results Consider: A=B+C+D+E Serial Processing Parallel Processing Thread 0 A = B + C T1 = B + C A = A + D T1 = T1 + T2 Thread 1 T2 = D + E A = A + E ☞ The roundoff behaviour is different and so the numerical results may be different too ☞ This is natural for parallel programs, but it may be hard to differentiate it from an ordinary bug .... RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 16 Cache Coherence RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 17 Typical cache based system ? CPU RvdP/V1 L1 cache L2 cache Basic Concepts in Parallelization Memory Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 18 Cache line modifi cations page cache line caches cache line inconsistency ! Main Memory RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 19 Caches in a parallel computer ❑ A cache line starts in memory ❑ Over time multiple copies of this line may exist Processors Caches Memory invalidated CPU 0 Cache Coherence ("cc"): ✔ Tracks changes in copies ✔ Makes sure correct cache line is used ✔ Different implementations possible ✔ CPU 1 invalidated CPU 2 Need hardware support to make it effi cient invalidated CPU 3 RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 20 Parallel Architectures RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 21 Uniform Memory Access (UMA) cache CPU Memory I/O Interconnect I/O cache CPU cache I/O CPU ❑ Also called "SMP" (Symmetric Multi Processor) ❑ Memory Access time is Uniform for all CPUs ❑ CPU can be multicore ❑ Interconnect is "cc": ● Pro ✔ Easy to use and to administer ✔ Effi cient use of resources Con RvdP/V1 ✔ Said to be expensive ✔ Said to be non-scalable Bus ● Crossbar ❑ No fragmentation Memory and I/O are shared resources Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 22 NUMA Interconnect M M I/O Also called "Distributed Memory" or NORMA (No Remote Memory Access) ❑ Memory Access time is Non-Uniform ❑ Hence the name "NUMA" ❑ Interconnect is not "cc": M I/O I/O cache cache cache CPU CPU CPU ● Ethernet, Infi niband, etc, ...... ❑ Runs 'N' copies of the OS Pro ✔ Said to be cheap ✔ Said to be scalable ❑ Con RvdP/V1 ❑ ✔ Diffi cult to use and administer ✔ In-effi cient use of resources Memory and I/O are distributed resources Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 23 The Hybrid Architecture Second-level Interconnect SMP/MC node ❑ SMP/MC node Second-level interconnect is not cache coherent ● Ethernet, Infi niband, etc, .... ❑ Hybrid Architecture with all Pros and Cons: ● ● RvdP/V1 UMA within one SMP/Multicore node NUMA across nodes Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 24 cc-NUMA 2-nd level Interconnect ❑ Two-level interconnect: ● UMA/SMP within one system ● NUMA between the systems ❑ Both interconnects support cache coherence i.e. the system is fully cache coherent RvdP/V1 ❑ Has all the advantages ('look and feel') of an SMP ❑ Downside is the Non-Uniform Memory Access time Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 25 Parallel Programming Models RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 26 How To Program A Parallel Computer? ❑ There are numerous parallel programming models ❑ The ones most well-known are: Cl An ● Distributed Memory si an us y ng d/ te ✔ Sockets (standardized, low level) le or r SM ✔ PVM - Parallel Virtual Machine (obsolete) P ✔ MPI ● - Message Passing Interface (de-facto std) Shared Memory ✔ Posix Threads (standardized, low level) ✔ OpenMP (de-facto standard) ✔ Automatic RvdP/V1 SM on P ly Parallelization (compiler does it for you) Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 27 Parallel Programming Models Distributed Memory - MPI RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 28 What is MPI? ❑ MPI stands for the “Message Passing Interface” ❑ MPI is a very extensive de-facto parallel programming API for distributed memory systems (i.e. a cluster) ● An MPI program can however also be executed on a shared memory system ❑ First specifi cation: 1994 ❑ The current MPI-2 specifi cation was released in 1997 ● RvdP/V1 Major enhancements ✔ Remote memory management ✔ Parallel I/O ✔ Dynamic process management Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 29 More about MPI ❑ MPI has its own data types (e.g. MPI_INT) ● User defi ned data types are supported as well ❑ MPI supports C, C++ and Fortran ● Include fi le <mpi.h> in C/C++ and “mpif.h” in Fortran ❑ An MPI environment typically consists of at least: ● A library implementing the API ● A compiler and linker that support the library ● A run time environment to launch an MPI program ❑ Various implementations available ● RvdP/V1 HPC Clustertools, MPICH, MVAPICH, LAM, Voltaire MPI, Scali MPI, HP-MPI, ..... Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 30 The MPI Programming Model M 0 1 M M 2 3 M M 4 5 M A Cluster Of Systems RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 31 The MPI Execution Model All processes start MPI_Init MPI_Finalize All processes fi nish = communication RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 32 The MPI Memory Model P P ✔ All threads/processes have access to their own, private, memory only ✔ Data transfer and most synchronization has to be programmed explicitly ✔ All data is private ✔ Data is shared explicitly by exchanging buffers private private Transfer Mechanism P private private P P private RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 33 Example - Send “N” Integers #include <mpi.h> include fi le you = 0; him = 1; MPI_Init(&argc, &argv); initialize MPI environment MPI_Comm_rank(MPI_COMM_WORLD, &me); if ( me == 0 ) { get process ID process 0 sends error_code = MPI_Send(&data_buffer, N, MPI_INT, him, 1957, MPI_COMM_WORLD); } else if ( me == 1 ) { process 1 receives error_code = MPI_Recv(&data_buffer, N, MPI_INT, you, 1957, MPI_COMM_WORLD, MPI_STATUS_IGNORE); } MPI_Finalize(); leave the MPI environment RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 34 Run time Behavior Process 0 Process 1 you = 1 him = 0 you = 1 him = 0 me = 0 MPI_Send N integers destination = you = 1 label = 1957 me = 1 MPI_Recv N integers sender = him =0 label = 1957 Yes ! Connection established RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 35 The Pros and Cons of MPI ❑ Advantages of MPI: ● Flexibility - Can use any cluster of any size ● Straightforward - Just plug in the MPI calls ● Widely available - Several implementations out there ● Widely used - Very popular programming model ❑ Disadvantages of MPI: ● ● ● ● ● RvdP/V1 Redesign of application - Could be a lot of work Easy to make mistakes - Many details to handle Hard to debug - Need to dig into underlying system More resources - Typically, more memory is needed Special care - Input/Output Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 36 Parallel Programming Models Shared Memory - Automatic Parallelization RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 37 Automatic Parallelization (-xautopar) ❑ Compiler performs the parallelization (loop based) ❑ Different iterations of the loop executed in parallel ❑ Same binary used for any number of threads for (i=0; i<1000; i++) a[i] = b[i] + c[i]; OMP_NUM_THREADS=4 RvdP/V1 Thread 0 Thread 1 Thread 2 Thread 3 0-249 250-499 500-749 750-999 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 38 Parallel Programming Models Shared Memory - OpenMP RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 39 Shared Memory 0 1 P M M M http://www.openmp.org RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 40 Shared Memory Model Programming Model T T private ✔ All threads have access to the same, globally shared, memory private ✔ Data can be shared or private Shared Memory ✔ Shared data is accessible by all threads T private private T T private RvdP/V1 ✔ Private data can only be accessed by the thread that owns it ✔ Data transfer is transparent to the programmer ✔ Synchronization takes place, but it is mostly implicit Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 41 About data In a shared memory parallel program variables have a "label" attached to them: ☞ Labelled "Private" ⇨ Visible to one thread only ✔ Change made in local data, is not seen by others ✔ Example - Local variables in a function that is executed in parallel ☞ Labelled "Shared" ⇨ Visible to all threads ✔ Change made in global data, is seen by all others ✔ Example - Global data RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 42 Example - Matrix times vector #pragma omp parallel for default(none) \ private(i,j,sum) shared(m,n,a,b,c) for (i=0; i<m; i++) j { sum = 0.0; for (j=0; j<n; j++) = * sum += b[i][j]*c[j]; a[i] = sum; i } TID = 0 TID = 1 for (i=0,1,2,3,4) i = 0 sum = b[i=0][j]*c[j] a[0] = sum i = 1 sum = for (i=5,6,7,8,9) i = 5 sum = b[i=5][j]*c[j] a[5] = sum i = 6 sum = b[i=1][j]*c[j] a[1] = sum b[i=6][j]*c[j] a[6] = sum ... etc ... RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 43 RvdP/V1 A Black and White comparison MPI OpenMP De-facto standard Endorsed by all key players Runs on any number of (cheap) systems “Grid Ready” High and steep learning curve You're on your own All or nothing model No data scoping (shared, private, ..) More widely used (but ....) Sequential version is not preserved Requires a library only Requires a run-time environment Easier to understand performance De-facto standard Endorsed by all key players Limited to one (SMP) system Not (yet?) “Grid Ready” Easier to get started (but, ...) Assistance from compiler Mix and match model Requires data scoping Increasingly popular (CMT !) Preserves sequential code Need a compiler No special environment Performance issues implicit Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 44 The Hybrid Parallel Programming Model RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 45 The Hybrid Programming Model Distributed Memory Shared Memory RvdP/V1 Shared Memory 0 1 P 0 1 P M M M M M M Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 46 Data Races RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 47 About Parallelism Parallelism Independence "Something" that does not obey this rule, is not parallel (at that level ...) No Fixed Ordering RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 48 Shared Memory Programming T T X private X private Y Shared Memory T X private private T T private Threads communicate via shared memory RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 49 What is a Data Race? ❑ Two different threads in a multi-threaded shared memory program ❑ Access the same (=shared) memory location ● ● ● RvdP/V1 Asynchronously Without holding any common exclusive locks At least one of the accesses is a write/store Basic Concepts in Parallelization and and Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 50 Example of a data race #pragma omp parallel shared(n) {n = omp_get_thread_num();} T private W T n W private Shared Memory RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 51 Another example #pragma omp parallel shared(x) {x = x + 1;} T private R/W x T R/W private Shared Memory RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 52 About Data Races ❑ Loosely described, a data race means that the update of a shared variable is not well protected ❑ A data race tends to show up in a nasty way: ● ● ● ● RvdP/V1 Numerical results are (somewhat) different from run to run Especially with Floating-Point data diffi cult to distinguish from a numerical side-effect Changing the number of threads can cause the problem to seemingly (dis)appear ✔ May also depend on the load on the system May only show up using many threads Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 53 A parallel loop for (i=0; i<8; i++) a[i] = a[i] + b[i]; RvdP/V1 Every iteration in this loop is independent of the other iterations Thread 1 Thread 2 a[0]=a[0]+b[0] a[4]=a[4]+b[4] a[1]=a[1]+b[1] a[5]=a[5]+b[5] a[2]=a[2]+b[2] a[6]=a[6]+b[6] a[3]=a[3]+b[3] a[7]=a[7]+b[7] Basic Concepts in Parallelization Time Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 54 Not a parallel loop for (i=0; i<8; i++) a[i] = a[i+1] + b[i]; RvdP/V1 The result is not deterministic when run in parallel ! Thread 1 Thread 2 a[0]=a[1]+b[0] a[4]=a[5]+b[4] a[1]=a[2]+b[1] a[5]=a[6]+b[5] a[2]=a[3]+b[2] a[6]=a[7]+b[6] a[3]=a[4]+b[3] a[7]=a[8]+b[7] Basic Concepts in Parallelization Time Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 55 About the experiment ❑ We manually parallelized the previous loop ● The compiler detects the data dependence and does not parallelize the loop ❑ Vectors a and b are of type integer ❑ We use the checksum of a as a measure for correctness: ● checksum += a[i] for i = 0, 1, 2, ...., n-2 ❑ The correct, sequential, checksum result is computed as a reference ❑ We ran the program using 1, 2, 4, 32 and 48 threads ● RvdP/V1 Each of these experiments was repeated 4 times Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 56 RvdP/V1 Numerical results threads: threads: threads: threads: 1 1 1 1 checksum checksum checksum checksum 1953 1953 1953 1953 correct correct correct correct value value value value 1953 1953 1953 1953 threads: threads: threads: threads: 2 2 2 2 checksum checksum checksum checksum 1953 1953 1953 1953 correct correct correct correct value value value value 1953 1953 1953 1953 threads: threads: threads: threads: 4 4 4 4 checksum checksum checksum checksum 1905 1905 1953 1937 correct correct correct correct value value value value 1953 1953 1953 1953 threads: threads: threads: threads: 32 32 32 32 checksum checksum checksum checksum 1525 1473 1489 1513 correct correct correct correct value value value value 1953 1953 1953 1953 threads: threads: threads: threads: 48 48 48 48 checksum checksum checksum checksum 936 1007 887 822 correct correct correct correct value value value value 1953 1953 1953 1953 Basic Concepts in Parallelization Data Race In Action ! Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 57 Summary RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010 58 Parallelism Is Everywhere Multiple levels of parallelism: Granularity Technology Programming Model Instruction Level Chip Level System Level Grid Level Superscalar Multicore SMP/cc-NUMA Cluster Compiler Compiler, OpenMP, MPI Compiler, OpenMP, MPI MPI Threads Are Getting Cheap RvdP/V1 Basic Concepts in Parallelization Tutorial IWOMP 2010 – CCS Un. of Tsukuba, June 14, 2010