3DAPAS/ECMLS panel Dynamic Distributed Data Intensive Analysis Environments for Life Sciences:

advertisement
3DAPAS/ECMLS panel
Dynamic Distributed Data Intensive
Analysis Environments for Life Sciences:
June 8 2011 San Jose
Geoffrey Fox, Shantenu Jha, Dan
Katz, Judy Qiu, Jon Weissman
Discussion Topics?
• Programming methods: languages vs. frameworks
(advantages and disadvantages of each)
• Moving compute to data: Is the data localization model
imposed by Clouds scalable and/or sustainable?
• Does Life Sciences want clouds or supercomputers?
• Data model for Life Sciences; is it dynamic?, What is
size? What is Access pattern? Is it Shared or Individual?
• How important is data security and privacy?
Programming methods: languages vs. frameworks
for data intensive/Life Science areas
• SaaS offers “Blast etc.” on demand
• Role of PGAS, Data parallel compilers (like Chapel) i.e.
main stream HPC high level approaches
• Nodes v. Cores v. GPU’s – do hybrid programming
models have special features
• MapReduce v. MPI
• Distributed environments like SAGA
• Data parallel analysis languages like Pig Latin, Sawzall
• Role of databases like SciDB and SQL based analysis
– See DryadLINQ
• Is R (cloud R, parallel R) critical
– What about Excel, Matlab …
•
•
•
•
Moving compute to data: Is the data localization model
imposed by Clouds scalable and/or sustainable?
This related to privacy and programming model questions
Is data stored in central resources
Does data have co-located compute resources (cloud)
If co-located, are data and compute on same cluster
– How is data spread out over disks on nodes?
• Or is data in a storage system supporting wide area file
system shared by nodes of cloud?
• Or is data in a database (SciDB SkyServer)?
• Or is data in an object store like OpenStack?
• What kind of middleware exists, or needs to be developed to
enable effective compute-data movement? Or it just a runtime scheduling problem?
• What are performance issues and how do we get data there
for dynamic data as that produced by sequencers.
Data model for Life Sciences; is it dynamic?, What is size?
What is Access pattern? Is it Shared or Individual?
• Is it a few large centers?
• Is it a distributed set of repositories containing say all data from a
particular lab?
– Or both of the above?
– How to manage and present stream of new data
• The world created ~1000 exabytes of data this year – how much
will Life Sciences create?
• Relative importance of large shared data centers versus
instrumental or computer generated individually owned data?
• Is Data replication important?
• Storage model – files, objects, databases?
• How often is the different types of data read (presumably written
once!)
– Which data is most important? Raw or processed to some level?
• What is metadata challenge?
Does Life Sciences want Clouds or Supercomputers?
• Clouds are cost effective and elastic for varying need
• Supercomputers support low latency (MPI) parallel
applications
• Clouds main commercial offering; supercomputers main
academic large scale computing solution
– Also Open Science Grid, EGI ….
• Cost(time) of transporting data from sequencers and
repositories to analysis engines ( clouds)
– Will NLR or Internet2 link to clouds; they do to TeraGrid
• What can LS data-intensive community learn from the HEP
community? e.g., Will the HEP approach of community-wide
"workload management systems" and VOs work?
• What is the role of Campus Clusters/resources in genomic
data-sharing?
• No history of large cloud budgets in federal grants
How important is data security and privacy?
• Human Genome processing cannot use most cost
effective solutions which will be shared resources
such as public clouds
– Commercial, military applications
• What other research applications have such
concerns
– Analysis of copyrighted material such as digital books
• Partly technical; partly policy issue of establishing a
trusted approach
– Companies accept off site paper storage?
• See recent hacking attacks such as Sony network,
gmail
• How important is fault tolerance/autonomic
computing?
Download