Opportunities and Challenges

Opportunities and Challenges Distributed Computing Driving Force • Increase in Computation Need • Tera and petabytes of data • Millions of CPU Hours of Computation • Moore’s Law hits ceiling • Enormous CAPEX and OPEX- Millions of USD • Large Server Space • Huge Energy Bill Market Size • Bloomberg/IDC Predictions • HPC consulting, solution demand grows by 9% • $15-20B in a couple of years Service Sector opportunity ~ $2B GPGPU as Computing Platform • General Purpose computing on Graphics Processing Units – Using GPUs for computation intensive, non-graphical applications • Why GPU Computing? – GPUs are faster, programmable, easily available and cheap – Change in Computing Paradigm - Traditional super-scalar architectures have their limits for intensive workloads - Parallel computing becoming a common-place Cannot be automatically leveraged Desktop “Super”computing 800 600 400 200 80 656.1 80.1 0 CPU Server 60 CPU-GPU Server 20 Gflop 800 60 40 11 0 CPU Server 600 CPU-GPU Server 200 656 CPU Server 400 146 0 Gflop/$K Gflop/Kwatt CPU Server: 2x Intel Xeon X5550, 2.66 GHz, 48GB Memory, $7K, 0.55KW CPU-GPU Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48GB Memory, $11K, 1KW http://www.vpac.org/files/GPU-Slides/01.tesla_introduction.pdf 2 Racks of CPU+GPU 7x less space 15 Racks of CPUs $740K 5x less cost $3.8M $117K 4.5x power saving every year $525K CPU-GPU Server PROGRAMMING FOR PARALLELISM IS NOT EASY 7/27/2016 CCDP 2011, Mysore Park Workshop 5 Why is Parallel Programming Difficult? Parallel Programming needs entirely different way of thinking.. For example- Calculating Value of ∏ Sequential Approach Start with 1 Generate large no. of -1 random points (x,y) within (1, +1) Add -1/3 Add +1/5 Add (-1)n/(2n+1) ∏= 4x result Parallel Approach +1 +1 -1 Which point falls within circle? Count number of points within circle ∏= 4x (number within circle)/(total number of points) D. Patterson, “The Trouble with Multi-Core”, IEEE Spectrum 2010 HPC – Crossing the Chasm New Infrastructure * More and more raw compute power (GPU/ many-core/Cloud) Business/ Scientific Computation New design challenges •• • • •• Software engineering support Architecture-aware Design Assistance design • GPU memory hierarchy, thread model Programming Assistance • Elastic infrastructure Verification, validation Data-driven computation Transformation, refactoring(functional programming paradigm) Building parallel algorithm is 5 to 10 times harder Existing applications are not meant for parallel infrastructure * Ever increasing demand Focus Area Desktop High Performance Computing Memory optimized CFD Solver using GPGPU Parallel Programming Workbench Memory access optimization toolkit Cloud Computing Dealing with Scale- Hadoop based applications Design Assistant - Engineer Applications for Cloud * Infrastructure Cloud Management * Better Hadoop scheduler Why is Parallel Workbench important? • Faster to build • Faster to re-factor • Help to hide architectural complexity • Better portability • Better code – usage of hardware resource Challenges/Research Questions • How do I refractor my application to exploit multiple cores on the CPU and GPU? • How do I simplify the design and implementation of parallel applications? • How do I optimally use the computing power? – Optimal usage of Thread – Optimal usage of Memory – Optimal usage of Clusters Source Code Parallelization Assistant Existing Code Parallelized Code • C/C++ AST + CDFG • C -> CIL • LLVM Analyze Parallelization Opportunity 7/27/2016 Generative Programming Framework Expert Input •Domain specific Info, •Additional Parameter CCDP 2011, Mysore Park Workshop 10 Data Parallelism 7/27/2016 Statistical Approach Loop Identification [+]Proper selection of samples can cover different possible code coverage rather than dynamic analysis. [-] Who can give a good sample? Formal Analysis Loop computation weight [+] Simpler to execute (might be few equations) [-] No guarantee of perfect result [-] Complex Iteration Dependency for (int i=0; i<max_count; i++) { array [i] = array[i-1] *10; } Transformation CCDP 2011, Mysore Park Workshop Code Coverage [+] Gives more accurate estimate of different paths present in a loop. [-] How to get “good” test cases? [-] More time consuming, results known only after execution. Allocation of data parallel part to a processing element – e.g. to a GPU thread 11 Approach to Loop Analysis for (i=0;i<n;i++){ for (j=0; j<i; j++){ S; } } Makes loop conditions as affine constraints (i.e. linear + constants) to form a polytope An integral polytope has an associated Ehrhart polynomial which encodes the relationship between the volume of a polytope and the number of integer points the polytope contains. All the polyhedra points denote the variable values wherein the loop conditions are satisfied. Use a polytope solver to approximate the total number of iterations. Barvinok, A. I. (2006). Computing the Ehrhart quasi-polynomial of a rational simplex. Math. Comp. 75, 1449–1466 Volume Computation • Volume computation is performed by Barvinok, an opensource polytope library. • Given a polytope represented by a set of affine inequalities we can determine the volume of the polytope by subdividing it into simplexes • Simplexes are a generalization of the triangle to N-dimensions whose volume can be easily computed using linear algebra. • The final result is obtained by summing together the number of points inside all the simplexes. 7/27/2016 CCDP 2011, Mysore Park Workshop 13 Example Code – dcraw.c 7/27/2016 CCDP 2011, Mysore Park Workshop 14 Barvinok Equation 7/27/2016 CCDP 2011, Mysore Park Workshop 15 Future Work • Enabling developer to supply domain specific knowledge – Devising usable parameters – Use of source code annotations • Program slicing to enable quicker analysis • Loop iteration dependency analysis • Generation of GPU specific code for the identified part

Opportunities and Challenges

Related documents

Products

Support

Opportunities and Challenges

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib