Introduction to Parallel Processing Dr. Guy Tel-Zur Lecture 10 Agenda • • • • • • Administration Final presentations Demos Theory Next week plan Home assignment #4 (last) Final Projects • • • • • Next Sunday: Groups 1-16 will present Next Monday: Groups 17+ will present 10 minutes presentation per group All group members should present Send to: gtelzur@gmail.com your presentation by midnight of the previous day נוכחות חובה Final Presentations • • • • • החלוקה לקבוצות הינה קשיחה קבוצה שלא תציג תאבד 5נקודות בציון יש לבצע חזרה ולוודא עמידה בזמנים המצגת צריכה לכלול :שם הפרויקט ,מטרתו, האתגר בבעיה מבחינת החישוב המקבילי ,דרכים לפתרון. לא תתקבלנה מצגות בזמן השיעור! יש להקפיד לשלוח אותן אל המרצה מבעוד מועד The Course Roadmap Introduction HPC HTC New! GPU Computing Condor Message Passing MPI Shared Memory Cilk++ OpenMP Grid Computing Cloud Computing Today Today Advanced Parallel Computing and Distributed Computing course • A new course at the department: Distributed Computing: Advanced Parallel Processing course + Grid Computing + Cloud Computing Course Number: 361-1-4691 • If you are interested in this course please send me an email Today • Algorithms – Numerical Algorithms (“slides11.ppt”) • Introduction to Grid Computing • Some demos • Home assignment #4 Futuristic A-Symmetric Multi-Core Chip SACC Sequential Accelerator Theory • Numerical Algorithms – Slides from: UNIVERSITY OF NORTH CAROLINA AT CHARLOTTE Department of Computer Science ITCS 4145/5145 Parallel Programming Spring 2009 Dr. Barry Wilkinson Matrix multiplication, solving a system of linear equations, iterative methods URL is Here Demos • Hybrid Parallel Programming – MPI + OpenMP • Cloud Computing – Setting a HPC cluster – Setting a Condor machine (a separate presentation) • StarHPC • Cilk++ • GPU Computing (a separate presentation) • Eclipse PTP • Kepler workflow Hybrid MPI + OpenMP Demo Machine File: hobbit1 hobbit2 hobbit3 hobbit4 Each hobbit has 8 cores MPI mpicc -o mpi_out mpi_test.c -fopenmp An Idea for a final project!!! cd ~/mpi program name: hybridpi.c OpenMP MPI is not installed yet on the hobbits, in the meanwhile: vdwarf5 vdwarf6 vdwarf7 vdwarf8 top -u tel-zur -H -d 0.05 H – show threads, d – delay for refresh, u - user Hybrid MPI+OpenMP continued Hybrid Pi (MPI+OpenMP #include <stdio.h> #include <mpi.h> #include <omp.h> #define NBIN 100000 #define MAX_THREADS 8 int main(int argc,char **argv) { int nbin,myid,nproc,nthreads,tid; double step,sum[MAX_THREADS]={0.0},pi=0.0,pig; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&myid); MPI_Comm_size(MPI_COMM_WORLD,&nproc); nbin = NBIN/nproc; step = 1.0/(nbin*nproc); #pragma omp parallel private(tid) { int i; double x; nthreads = omp_get_num_threads(); tid = omp_get_thread_num(); for (i=nbin*myid+tid; i<nbin*(myid+1); i+=nthreads) { x = (i+0.5)*step; sum[tid] += 4.0/(1.0+x*x); } printf("rank tid sum = %d %d %e\n",myid,tid,sum[tid]); } for(tid=0; tid<nthreads; tid++) pi += sum[tid]*step; MPI_Allreduce(&pi,&pig,1,MPI_DOUBLE,MPI_SUM,MPI_COMM_WORLD); if (myid==0) printf("PI = %f\n",pig); MPI_Finalize(); return 0; } Cilk++ Simple, powerful expression of task parallelism: cilk_for – Parallelize for loops cilk_spawn – Specify the start of parallel execution cilk_sync – Specify the end of parallel execution http://software.intel.com/en-us/articles/intel-cilk-plus/ 17/8/2011 Fibonachi (Fibonacci) Try: http://www.wolframalpha.com/input/?i=fibonacci+number Fibonachi Numbers serial version // 1, 1, 2, 3, 5, 8, 13, 21, 34, ... // Serial version // Credit: http://myxman.org/dp/node/182 long fib_serial(long n) { if (n < 2) return n; return fib_serial(n-1) + fib_serial(n-2); } Cilk++ Fibonachi (Fibonacci) #include <cilk.h> #include <stdio.h> long fib_parallel(long n) { long x, y; if (n < 2) return n; x = cilk_spawn fib_parallel(n-1); y = fib_parallel(n-2); cilk_sync; return (x+y); } int cilk_main() { int N=50; long result; result = fib_parallel(N); printf("fib of %d is %d\n",N,result); return 0; } Cilk_spawn ADD PARALLELISM USING CILK_SPAWN We are now ready to introduce parallelism into our qsort program. The cilk_spawn keyword indicates that a function (the child) may be executed in parallel with the code that follows the cilk_spawn statement (the parent). Note that the keyword allows but does not require parallel operation. The Cilk++ scheduler will dynamically determine what actually gets executed in parallel when multiple processors are available. The cilk_sync statement indicates that the function may not continue until all cilk_spawn requests in the same function have completed. cilk_sync does not affect parallel strands spawned in other functions. Cilkview Fn(30) Strands and Knots A Cilk++ program fragments ... do_stuff_1(); // execute strand 1 cilk_spawn func_3(); // spawn strand 3 at knot A do_stuff_2(); // execute strand 2 cilk_sync; // sync at knot B do_stuff_4(); // execute strand 4 ... DAG with two spawns (labeled A and B) and one sync (labeled C) a more complex Cilk++ program (DAG): Let's add labels to the strands to indicate the number of milliseconds it takes to execute each strand In ideal circumstances (e.g., if there is no scheduling overhead) then, if an unlimited number of processors are available, this program should run for 68 milliseconds. Work and Span Work The total amount of processor time required to complete the program is the sum of all the numbers. We call this the work. In this DAG, the work is 181 milliseconds for the 25 strands shown, and if the program is run on a single processor, the program should run for 181 milliseconds. Span Another useful concept is the span, sometimes called the critical path length. The span is the most expensive path that goes from the beginning to the end of the program. In this DAG, the span is 68 milliseconds, as shown below: divide-and-conquer strategy cilk_for Shown here: 8 threads and 8 iterations Here is the DAG for a serial loop that spawns each iteration. In this case, the work is not well balanced, because each child does the work of only one iteration before incurring the scheduling overhead inherent in entering a sync. Race conditions Check the “qsort-race” program with cilkscreen: StarHPC on the Cloud Will be ready for PP201X? Eclipse PTP Parallel Tools Platform http://www.eclipse.org/ptp/ Will be ready for PP201X? Recursion in OpenMP long fib_parallel(long n) { long x, y; if (n < 2) return n; #pragma omp task default(none) shared(x,n) { The task pragma can be x = fib_parallel(n-1); useful for parallelizing } irregular algorithms such y = fib_parallel(n-2); as recursive algorithms #pragma omp taskwait for which other OpenMP return (x+y); workshare constructs are } inadequate. #pragma omp parallel #pragma omp single { r = fib_parallel(n); } Use the taskwait pragma to specify a wait for child tasks to be completed that are generated by the current task. Reference: http://myxman.org/dp/node/182 Intel® Parallel Studio • Use Parallel Composer to create and compile a parallel application • Use Parallel Inspector to improve reliability by finding memory and threading errors • Use Parallel Amplifier to improve parallel performance by tuning threaded code Intel® Parallel Studio Parallel Studio add new features to Visual Studio Intel’s Parallel Amplifier – Execution Bottlenecks Intel’s Parallel Inspector – Threading Errors Intel’s Parallel Inspector – Threading Errors Error – Data Race Intel Parallel Studio - Composer The installation of this part failed for me. Probably because I didn’t install before Intel’s C++ compiler. Sorry I can’t make a demo here…