Programming Paradigms: What should we do about data-intensive programming?

advertisement
Programming Paradigms:
What should we do about
data-intensive programming?
Geoffrey Fox, Shantenu Jha, Dan Katz and Omer Rana
DIR: Some Take-home Messages
• Emphasis on Data Management, Integration, Complexity etc., Interesting
and effective applications – but there was limited stress-testing of CI & PP!
– Szalay talked about Graywulf (storage and IO)
– Unclear how far current approaches scale?
• Not necessarily capture long-time trends (“Living in an Exponential World”)
– Value in Null-Results:
• What makes a DIA a DIA? Towards a suite of DIA benchmark suites?
• Programming Approaches and Abstractions:
– An attempt at defining where & how PP might be effective
• Quantitative: four classes based upon ε = α/β [bytes/flops]
• Qualitative: Many stages in the data life-cycle: (i) Analysis phase [McCallum,
Fox] (ii) Integrating with simulation [Watson, Haines] (iii) Data Exploration
[Saltz] (v) Data-Integration [Many]
– And Tools for the above, which enable effective sharing, lower TTS
• [Somehow] Focus on Biological Sciences:
– Time scale over which it has transitioned from “small” to “big” science
Data-Intensive PP: What is needed?
• “Is there a classification of data-intensive
applications that would help me to understand
(simply) if Approach X were/were-not useful?”
• If so, it would involve:
– (i) Classification of DI application characteristics
– (ii) Identification of “fundamental vectors”
– (iii) Leads to an understanding of the Programming
[Deployment & Exec] Paradigms based upon (i) and (ii)
• Understanding (i) and (ii) will influence the ability
to create tools that support the abstractions
Dynamic DI Applications
• Geoffrey presented simple and effective analysis based
upon synchronisation and constraints imposed [see next
two slides for completeness]
– From DPA, we know synchronisation/coordination is one
important vector.
– For DDIA – what are the other characteristics that influence and
constraint programming approaches?
– Similar questions to slide (ii) but now with dynamic dataintensive
• Some extension for dynamic DI applications.
– What are the different classes of Dynamic DIA?
• Real-time data assimilation [Watson, Austin, Kell]
– Including sensors [Plale], SaaS etc., [Bishop/Grapel]
• Dynamic resource utilization & Application-level changes
– Bandwidth fluctuations, Network virtualization
– Data-Compute placement issues
• Feedback eg. Sensor selection [Watson, Birkin, Haines]
Application Classes
(Parallel software/hardware in terms of 5 “Application architecture” Structures)
1
Synchronous
Lockstep Operation as in SIMD architectures
2
Loosely
Synchronous
Iterative Compute-Communication stages with
independent compute (map) operations for each CPU.
Heart of most MPI jobs
3
Asynchronous
Compute Chess; Combinatorial Search often supported
by dynamic threads
4
Pleasingly Parallel
Each component independent – in 1988, Fox estimated
at 20% of total number of applications
Grids
5
Metaproblems
Coarse grain (asynchronous) combinations of classes 1)4). The preserve of workflow.
Grids
6
MapReduce++
It describes file(database) to file(database) operations
which has three subcategories.
1) Pleasingly Parallel Map Only
2) Map followed by reductions
3) Iterative “Map followed by reductions” –
Extension of Current Technologies that
supports much linear algebra and datamining
Clouds
Applications & Different Interconnection Patterns
Map Only
Input
map
Classic
MapReduce
Input
map
http://www.iterative
Iterative Reductions
mapreduce.org/
Twister
Input
map
Loosely
Synchronous
iterations
Pij
Output
reduce
reduce
CAP3 Analysis
Document conversion
(PDF -> HTML)
Brute force searches in
cryptography
Parametric sweeps
High Energy Physics
(HEP) Histograms
SWG gene alignment
Distributed search
Distributed sorting
Information retrieval
Expectation
maximization algorithms
Clustering
Linear Algebra
Many MPI scientific
applications utilizing
wide variety of
communication
constructs including
local interactions
- CAP3 Gene Assembly
- PolarGrid Matlab data
analysis
- Information Retrieval HEP Data Analysis
- Calculation of Pairwise
Distances for ALU
Sequences
-cf.
Kmeans
Szalay comment
-on
Deterministic
need for multiAnnealing
resolutionClustering
algorithms
- Multidimensional
with dynamic
Scaling MDS
- Solving Differential
Equations and
- particle dynamics
with short range forces
stopping
Domain of MapReduce and Iterative Extensions
MPI
Download