3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences: June 8 2011 San Jose Geoffrey Fox, Shantenu Jha, Dan Katz, Judy Qiu, Jon Weissman Discussion Topics? • Programming methods: languages vs. frameworks (advantages and disadvantages of each) • Moving compute to data: Is the data localization model imposed by Clouds scalable and/or sustainable? • Does Life Sciences want clouds or supercomputers? • Data model for Life Sciences; is it dynamic?, What is size? What is Access pattern? Is it Shared or Individual? • How important is data security and privacy? Programming methods: languages vs. frameworks for data intensive/Life Science areas • SaaS offers “Blast etc.” on demand • Role of PGAS, Data parallel compilers (like Chapel) i.e. main stream HPC high level approaches • Nodes v. Cores v. GPU’s – do hybrid programming models have special features • MapReduce v. MPI • Distributed environments like SAGA • Data parallel analysis languages like Pig Latin, Sawzall • Role of databases like SciDB and SQL based analysis – See DryadLINQ • Is R (cloud R, parallel R) critical – What about Excel, Matlab … • • • • Moving compute to data: Is the data localization model imposed by Clouds scalable and/or sustainable? This related to privacy and programming model questions Is data stored in central resources Does data have co-located compute resources (cloud) If co-located, are data and compute on same cluster – How is data spread out over disks on nodes? • Or is data in a storage system supporting wide area file system shared by nodes of cloud? • Or is data in a database (SciDB SkyServer)? • Or is data in an object store like OpenStack? • What kind of middleware exists, or needs to be developed to enable effective compute-data movement? Or it just a runtime scheduling problem? • What are performance issues and how do we get data there for dynamic data as that produced by sequencers. Data model for Life Sciences; is it dynamic?, What is size? What is Access pattern? Is it Shared or Individual? • Is it a few large centers? • Is it a distributed set of repositories containing say all data from a particular lab? – Or both of the above? – How to manage and present stream of new data • The world created ~1000 exabytes of data this year – how much will Life Sciences create? • Relative importance of large shared data centers versus instrumental or computer generated individually owned data? • Is Data replication important? • Storage model – files, objects, databases? • How often is the different types of data read (presumably written once!) – Which data is most important? Raw or processed to some level? • What is metadata challenge? Does Life Sciences want Clouds or Supercomputers? • Clouds are cost effective and elastic for varying need • Supercomputers support low latency (MPI) parallel applications • Clouds main commercial offering; supercomputers main academic large scale computing solution – Also Open Science Grid, EGI …. • Cost(time) of transporting data from sequencers and repositories to analysis engines ( clouds) – Will NLR or Internet2 link to clouds; they do to TeraGrid • What can LS data-intensive community learn from the HEP community? e.g., Will the HEP approach of community-wide "workload management systems" and VOs work? • What is the role of Campus Clusters/resources in genomic data-sharing? • No history of large cloud budgets in federal grants How important is data security and privacy? • Human Genome processing cannot use most cost effective solutions which will be shared resources such as public clouds – Commercial, military applications • What other research applications have such concerns – Analysis of copyrighted material such as digital books • Partly technical; partly policy issue of establishing a trusted approach – Companies accept off site paper storage? • See recent hacking attacks such as Sony network, gmail • How important is fault tolerance/autonomic computing?