openMP openMP I Open Multi-Processing I Concept of shared memory I Multithreading on one machine (e.g. one node/CPU) I Not capable of transferring data between nodes I No need to ’hardcopy’ - we can use pointers I Should have way less overhead than MPI openMP 1 I Systematically very similar to MPI I We can program essentially the same way I Easiest case: only use preprocessor arguments # pragma omp ... openMP 1 I Systematically very similar to MPI I We can program essentially the same way I Easiest case: only use preprocessor arguments # pragma omp ... I Extending exisitng code for openMP is very easy I Replace expensive loops by omp-loops I Eigen: simply set compiler flag (M x M) Example 1 2 3 4 int n = 100000; vector < float > x ( n ) ; vector < float > y ( n ) ; vector < float > r ( n ) ; Example 1 2 3 4 int n = 100000; vector < float > x ( n ) ; vector < float > y ( n ) ; vector < float > r ( n ) ; Single Thread 1 for ( int i =0; i < n ; i ++) r [ i ] = x [ i ] * y [ i ] ; Example 1 2 3 4 int n = 100000; vector < float > x ( n ) ; vector < float > y ( n ) ; vector < float > r ( n ) ; Single Thread 1 for ( int i =0; i < n ; i ++) r [ i ] = x [ i ] * y [ i ] ; Multi Thread: 1 2 # pragma omp parallel for for ( size_t i =0; i < n ; i ++) r [ i ] = x [ i ] * y [ i ] ; Hybrid MPI and OpenMP Hybrid MPI and OpenMP This is what one node at an HPC-Center looks like: Hybrid MPI and OpenMP This is what one node at an HPC-Center looks like: 2 CPU’s with 8 Cores each; shared global memory; Cache for each CPU Hybrid MPI and OpenMP I On HPC-Clusters we have many nodes with > 12 Cores each I We could just use MPI to make use of all nodes I Combining MPI with shared memory openMP should give best performance I Spread as many MPI-jobs as available CPU’s I On each MPI-job (CPU) start n-core threads with openMP I Most efficient combination (Lars) Hybrid MPI and OpenMP Hybrid MPI and OpenMP X = 600000x10000; compute X’X; 2 nodes x 2 CPU’s x 8 cores = 32 Cores MPI MP sec. 1 1 5451 1 2 2773 1 4 1410 1 8 728 1 16 389 2 2 1389 2 4 705 2 8 370 2 16 198 4 8 196 8 14 455 32 1 368 Hybrid MPI and OpenMP X = 600000x10000; compute X’X; 2 nodes x 2 CPU’s x 8 cores = 32 Cores MPI MP sec. 1 1 5451 1 2 2773 1 4 1410 1 8 728 1 16 389 2 2 1389 2 4 705 2 8 370 2 16 198 4 8 196 8 14 455 How to compile MPI + MP: 1 mpicxx - O3 - fopenmp cross . cpp -o cross . o 32 1 368 Hybrid MPI and OpenMP X = 600000x10000; compute X’X; 2 nodes x 2 CPU’s x 8 cores = 32 Cores MPI MP sec. 1 1 5451 1 2 2773 1 4 1410 1 8 728 1 16 389 2 2 1389 2 4 705 2 8 370 2 16 198 4 8 196 8 14 455 How to compile MPI + MP: 1 mpicxx - O3 - fopenmp cross . cpp -o cross . o How to run an MPI + MP process: 1 2 export OMP_NUM_THREADS =8 mpirun - ppn 2 -n 4 ./ cross . o example 32 1 368 Data and memory Data and memory I We parallelize not only computations but also storage I An MPI/MP program can spread the whole workflow into pieces I find appropiate way of storing, reading and writing data I e.g. HDF5, NetCDF; capable of multithreaded access I 40x faster than ASCII (single thread) Data and memory I We parallelize not only computations but also storage I An MPI/MP program can spread the whole workflow into pieces I find appropiate way of storing, reading and writing data I e.g. HDF5, NetCDF; capable of multithreaded access I 40x faster than ASCII (single thread) I Large memory demands can be overcome by small chunks I Lars: Big-memory nodes are a dead-end; make use of MPI Everyday Multi-Threading Everyday Multi-Threading I Not every problem or approach is worth programming for I We often try something out or play around with data I R is software of choice - single thread! I Use packages: e.g. foreach I Extremely inefficient (memory) Everyday Multi-Threading I Not every problem or approach is worth programming for I We often try something out or play around with data I R is software of choice - single thread! I Use packages: e.g. foreach I Extremely inefficient (memory) I Solution: Extend R with C/C++/Fortran functions that are openMP’d I Make toolbox of frequently used functions example