Opportunities and Challenges Distributed Computing Driving Force • Increase in Computation Need • Tera and petabytes of data • Millions of CPU Hours of Computation • Moore’s Law hits ceiling • Enormous CAPEX and OPEX- Millions of USD • Large Server Space • Huge Energy Bill Market Size • Bloomberg/IDC Predictions • HPC consulting, solution demand grows by 9% • $15-20B in a couple of years Service Sector opportunity ~ $2B GPGPU as Computing Platform • General Purpose computing on Graphics Processing Units – Using GPUs for computation intensive, non-graphical applications • Why GPU Computing? – GPUs are faster, programmable, easily available and cheap – Change in Computing Paradigm - Traditional super-scalar architectures have their limits for intensive workloads - Parallel computing becoming a common-place Cannot be automatically leveraged Desktop “Super”computing 800 600 400 200 80 656.1 80.1 0 CPU Server 60 CPU-GPU Server 20 Gflop 800 60 40 11 0 CPU Server 600 CPU-GPU Server 200 656 CPU Server 400 146 0 Gflop/$K Gflop/Kwatt CPU Server: 2x Intel Xeon X5550, 2.66 GHz, 48GB Memory, $7K, 0.55KW CPU-GPU Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48GB Memory, $11K, 1KW http://www.vpac.org/files/GPU-Slides/01.tesla_introduction.pdf 2 Racks of CPU+GPU 7x less space 15 Racks of CPUs $740K 5x less cost $3.8M $117K 4.5x power saving every year $525K CPU-GPU Server PROGRAMMING FOR PARALLELISM IS NOT EASY 7/27/2016 CCDP 2011, Mysore Park Workshop 5 Why is Parallel Programming Difficult? Parallel Programming needs entirely different way of thinking.. For example- Calculating Value of ∏ Sequential Approach Start with 1 Generate large no. of -1 random points (x,y) within (1, +1) Add -1/3 Add +1/5 Add (-1)n/(2n+1) ∏= 4x result Parallel Approach +1 +1 -1 Which point falls within circle? Count number of points within circle ∏= 4x (number within circle)/(total number of points) D. Patterson, “The Trouble with Multi-Core”, IEEE Spectrum 2010 HPC – Crossing the Chasm New Infrastructure * More and more raw compute power (GPU/ many-core/Cloud) Business/ Scientific Computation New design challenges •• • • •• Software engineering support Architecture-aware Design Assistance design • GPU memory hierarchy, thread model Programming Assistance • Elastic infrastructure Verification, validation Data-driven computation Transformation, refactoring(functional programming paradigm) Building parallel algorithm is 5 to 10 times harder Existing applications are not meant for parallel infrastructure * Ever increasing demand Focus Area Desktop High Performance Computing Memory optimized CFD Solver using GPGPU Parallel Programming Workbench Memory access optimization toolkit Cloud Computing Dealing with Scale- Hadoop based applications Design Assistant - Engineer Applications for Cloud * Infrastructure Cloud Management * Better Hadoop scheduler Why is Parallel Workbench important? • Faster to build • Faster to re-factor • Help to hide architectural complexity • Better portability • Better code – usage of hardware resource Challenges/Research Questions • How do I refractor my application to exploit multiple cores on the CPU and GPU? • How do I simplify the design and implementation of parallel applications? • How do I optimally use the computing power? – Optimal usage of Thread – Optimal usage of Memory – Optimal usage of Clusters Source Code Parallelization Assistant Existing Code Parallelized Code • C/C++ AST + CDFG • C -> CIL • LLVM Analyze Parallelization Opportunity 7/27/2016 Generative Programming Framework Expert Input •Domain specific Info, •Additional Parameter CCDP 2011, Mysore Park Workshop 10 Data Parallelism 7/27/2016 Statistical Approach Loop Identification [+]Proper selection of samples can cover different possible code coverage rather than dynamic analysis. [-] Who can give a good sample? Formal Analysis Loop computation weight [+] Simpler to execute (might be few equations) [-] No guarantee of perfect result [-] Complex Iteration Dependency for (int i=0; i<max_count; i++) { array [i] = array[i-1] *10; } Transformation CCDP 2011, Mysore Park Workshop Code Coverage [+] Gives more accurate estimate of different paths present in a loop. [-] How to get “good” test cases? [-] More time consuming, results known only after execution. Allocation of data parallel part to a processing element – e.g. to a GPU thread 11 Approach to Loop Analysis for (i=0;i<n;i++){ for (j=0; j<i; j++){ S; } } Makes loop conditions as affine constraints (i.e. linear + constants) to form a polytope An integral polytope has an associated Ehrhart polynomial which encodes the relationship between the volume of a polytope and the number of integer points the polytope contains. All the polyhedra points denote the variable values wherein the loop conditions are satisfied. Use a polytope solver to approximate the total number of iterations. Barvinok, A. I. (2006). Computing the Ehrhart quasi-polynomial of a rational simplex. Math. Comp. 75, 1449–1466 Volume Computation • Volume computation is performed by Barvinok, an opensource polytope library. • Given a polytope represented by a set of affine inequalities we can determine the volume of the polytope by subdividing it into simplexes • Simplexes are a generalization of the triangle to N-dimensions whose volume can be easily computed using linear algebra. • The final result is obtained by summing together the number of points inside all the simplexes. 7/27/2016 CCDP 2011, Mysore Park Workshop 13 Example Code – dcraw.c 7/27/2016 CCDP 2011, Mysore Park Workshop 14 Barvinok Equation 7/27/2016 CCDP 2011, Mysore Park Workshop 15 Future Work • Enabling developer to supply domain specific knowledge – Devising usable parameters – Use of source code annotations • Program slicing to enable quicker analysis • Loop iteration dependency analysis • Generation of GPU specific code for the identified part