Opportunities and Challenges

advertisement
Opportunities and Challenges
Distributed Computing
Driving Force
• Increase in Computation Need
• Tera and petabytes of data
• Millions of CPU Hours of Computation
• Moore’s Law hits ceiling
• Enormous CAPEX and OPEX- Millions of USD
• Large Server Space
• Huge Energy Bill
Market Size
• Bloomberg/IDC Predictions
• HPC consulting, solution demand grows
by 9%
• $15-20B in a couple of years
Service Sector opportunity ~ $2B
GPGPU as Computing Platform
• General Purpose computing on Graphics Processing Units
– Using GPUs for computation intensive, non-graphical applications
• Why GPU Computing?
– GPUs are faster, programmable, easily available and cheap
– Change in Computing Paradigm
- Traditional super-scalar architectures have their limits for intensive
workloads
- Parallel computing becoming a common-place
Cannot be automatically leveraged
Desktop “Super”computing
800
600
400
200
80
656.1
80.1
0
CPU
Server
60
CPU-GPU
Server
20
Gflop
800
60
40
11
0
CPU
Server
600
CPU-GPU
Server
200
656
CPU
Server
400
146
0
Gflop/$K
Gflop/Kwatt
CPU Server: 2x Intel Xeon X5550, 2.66 GHz, 48GB Memory, $7K, 0.55KW
CPU-GPU Server: 2x Tesla C2050 + 2x Intel Xeon X5550, 48GB Memory, $11K, 1KW
http://www.vpac.org/files/GPU-Slides/01.tesla_introduction.pdf
2 Racks of CPU+GPU
7x less space
15 Racks of CPUs
$740K
5x less cost
$3.8M
$117K
4.5x power saving every year
$525K
CPU-GPU
Server
PROGRAMMING FOR
PARALLELISM IS NOT EASY
7/27/2016
CCDP 2011, Mysore Park Workshop
5
Why is Parallel Programming Difficult?
Parallel Programming needs entirely different way of thinking..
For example- Calculating Value of ∏
Sequential Approach
Start
with 1
Generate
large no. of
-1
random points
(x,y) within (1, +1)
Add
-1/3
Add
+1/5
Add
(-1)n/(2n+1)
∏= 4x result
Parallel Approach
+1
+1
-1
Which point
falls within
circle?
Count number of points within circle
∏= 4x (number within circle)/(total number of points)
D. Patterson, “The Trouble with Multi-Core”, IEEE
Spectrum 2010
HPC – Crossing the Chasm
New
Infrastructure
* More and more
raw compute
power (GPU/
many-core/Cloud)
Business/
Scientific
Computation
New design challenges
••
•
•
••
Software engineering support
Architecture-aware
Design
Assistance design
• GPU memory
hierarchy, thread model
Programming
Assistance
• Elastic
infrastructure
Verification,
validation
Data-driven computation
Transformation,
refactoring(functional
programming paradigm)
Building parallel algorithm is 5 to 10 times harder
Existing applications are not meant for parallel infrastructure
* Ever
increasing
demand
Focus Area
Desktop High
Performance
Computing
Memory
optimized CFD
Solver using
GPGPU
Parallel
Programming
Workbench
Memory access
optimization
toolkit
Cloud
Computing
Dealing with
Scale- Hadoop
based
applications
Design Assistant
- Engineer
Applications for
Cloud
* Infrastructure
Cloud
Management
* Better Hadoop
scheduler
Why is Parallel Workbench
important?
• Faster to build
• Faster to re-factor
• Help to hide architectural
complexity
• Better portability
• Better code
– usage of hardware resource
Challenges/Research Questions
• How do I refractor my
application to exploit multiple
cores on the CPU and GPU?
• How do I simplify the design
and implementation of
parallel applications?
• How do I optimally use the
computing power?
– Optimal usage of Thread
– Optimal usage of Memory
– Optimal usage of Clusters
Source Code Parallelization
Assistant
Existing Code
Parallelized Code
• C/C++
AST + CDFG
• C -> CIL
• LLVM
Analyze
Parallelization
Opportunity
7/27/2016
Generative
Programming
Framework
Expert Input
•Domain specific Info,
•Additional Parameter
CCDP 2011, Mysore Park Workshop
10
Data Parallelism
7/27/2016
Statistical Approach
Loop Identification
[+]Proper selection of samples can cover different
possible code coverage rather than dynamic analysis.
[-] Who can give a good sample?
Formal Analysis
Loop computation weight
[+] Simpler to execute (might be few equations)
[-] No guarantee of perfect result
[-] Complex
Iteration Dependency
for (int i=0; i<max_count; i++) {
array [i] = array[i-1] *10;
}
Transformation
CCDP 2011, Mysore Park Workshop
Code Coverage
[+] Gives more accurate estimate of different paths present in
a loop.
[-] How to get “good” test cases?
[-] More time consuming, results known only after execution.
Allocation of data parallel part to a
processing element – e.g. to a GPU thread
11
Approach to Loop Analysis
for (i=0;i<n;i++){
for (j=0; j<i; j++){
S;
}
}
Makes loop conditions as
affine constraints (i.e. linear + constants)
to form a polytope
An integral polytope has an associated Ehrhart
polynomial which encodes the relationship between the
volume of a polytope and the number of integer points
the polytope contains.
All the polyhedra points denote the variable values
wherein the loop conditions are satisfied.
Use a polytope solver to approximate
the total number of iterations.
Barvinok, A. I. (2006). Computing the Ehrhart quasi-polynomial of a rational simplex. Math. Comp. 75, 1449–1466
Volume Computation
• Volume computation is performed by Barvinok, an opensource
polytope library.
• Given a polytope represented by a set of affine inequalities we
can determine the volume of the polytope by subdividing it into
simplexes
• Simplexes are a generalization of the triangle to N-dimensions
whose volume can be easily computed using linear algebra.
• The final result is obtained by summing together the number of
points inside all the simplexes.
7/27/2016
CCDP 2011, Mysore Park Workshop
13
Example Code – dcraw.c
7/27/2016
CCDP 2011, Mysore Park Workshop
14
Barvinok Equation
7/27/2016
CCDP 2011, Mysore Park Workshop
15
Future Work
• Enabling developer to supply domain specific knowledge
– Devising usable parameters
– Use of source code annotations
• Program slicing to enable quicker analysis
• Loop iteration dependency analysis
• Generation of GPU specific code for the identified part
Download