Programming Paradigms: What should we do about data-intensive programming? Geoffrey Fox, Shantenu Jha, Dan Katz and Omer Rana DIR: Some Take-home Messages • Emphasis on Data Management, Integration, Complexity etc., Interesting and effective applications – but there was limited stress-testing of CI & PP! – Szalay talked about Graywulf (storage and IO) – Unclear how far current approaches scale? • Not necessarily capture long-time trends (“Living in an Exponential World”) – Value in Null-Results: • What makes a DIA a DIA? Towards a suite of DIA benchmark suites? • Programming Approaches and Abstractions: – An attempt at defining where & how PP might be effective • Quantitative: four classes based upon ε = α/β [bytes/flops] • Qualitative: Many stages in the data life-cycle: (i) Analysis phase [McCallum, Fox] (ii) Integrating with simulation [Watson, Haines] (iii) Data Exploration [Saltz] (v) Data-Integration [Many] – And Tools for the above, which enable effective sharing, lower TTS • [Somehow] Focus on Biological Sciences: – Time scale over which it has transitioned from “small” to “big” science Data-Intensive PP: What is needed? • “Is there a classification of data-intensive applications that would help me to understand (simply) if Approach X were/were-not useful?” • If so, it would involve: – (i) Classification of DI application characteristics – (ii) Identification of “fundamental vectors” – (iii) Leads to an understanding of the Programming [Deployment & Exec] Paradigms based upon (i) and (ii) • Understanding (i) and (ii) will influence the ability to create tools that support the abstractions Dynamic DI Applications • Geoffrey presented simple and effective analysis based upon synchronisation and constraints imposed [see next two slides for completeness] – From DPA, we know synchronisation/coordination is one important vector. – For DDIA – what are the other characteristics that influence and constraint programming approaches? – Similar questions to slide (ii) but now with dynamic dataintensive • Some extension for dynamic DI applications. – What are the different classes of Dynamic DIA? • Real-time data assimilation [Watson, Austin, Kell] – Including sensors [Plale], SaaS etc., [Bishop/Grapel] • Dynamic resource utilization & Application-level changes – Bandwidth fluctuations, Network virtualization – Data-Compute placement issues • Feedback eg. Sensor selection [Watson, Birkin, Haines] Application Classes (Parallel software/hardware in terms of 5 “Application architecture” Structures) 1 Synchronous Lockstep Operation as in SIMD architectures 2 Loosely Synchronous Iterative Compute-Communication stages with independent compute (map) operations for each CPU. Heart of most MPI jobs 3 Asynchronous Compute Chess; Combinatorial Search often supported by dynamic threads 4 Pleasingly Parallel Each component independent – in 1988, Fox estimated at 20% of total number of applications Grids 5 Metaproblems Coarse grain (asynchronous) combinations of classes 1)4). The preserve of workflow. Grids 6 MapReduce++ It describes file(database) to file(database) operations which has three subcategories. 1) Pleasingly Parallel Map Only 2) Map followed by reductions 3) Iterative “Map followed by reductions” – Extension of Current Technologies that supports much linear algebra and datamining Clouds Applications & Different Interconnection Patterns Map Only Input map Classic MapReduce Input map http://www.iterative Iterative Reductions mapreduce.org/ Twister Input map Loosely Synchronous iterations Pij Output reduce reduce CAP3 Analysis Document conversion (PDF -> HTML) Brute force searches in cryptography Parametric sweeps High Energy Physics (HEP) Histograms SWG gene alignment Distributed search Distributed sorting Information retrieval Expectation maximization algorithms Clustering Linear Algebra Many MPI scientific applications utilizing wide variety of communication constructs including local interactions - CAP3 Gene Assembly - PolarGrid Matlab data analysis - Information Retrieval HEP Data Analysis - Calculation of Pairwise Distances for ALU Sequences -cf. Kmeans Szalay comment -on Deterministic need for multiAnnealing resolutionClustering algorithms - Multidimensional with dynamic Scaling MDS - Solving Differential Equations and - particle dynamics with short range forces stopping Domain of MapReduce and Iterative Extensions MPI