the Presentation

advertisement
The iPlant Collaborative
Cyberinfrastructure
Matt Vaughn
Cold Spring Harbor Laboratory
April 2010
What is iPlant?
• Simply put, the mission of the iPlant Collaborative is to build
Cyberinfrastructure to support the solution of the grand
challenges of plant biology.
• A “unique” aspect is the grand challenges were not defined in
advance, but are identified through an ongoing engagement with
the community.
• Not a center, but a virtual organization forming grand challenge
teams and relying on the national CI.
• Long term focus on sustainable food supply, climate change,
biofuels, pharmaceuticals, etc.
• Hundreds of participants from around the world; Working group
members at > 50 US academic institutions, USDA, DOE, etc.
2
What is Cyberinfrastructure?
(Originally about TeraGrid)
It was six men of Indostan,
To learning much inclined,
Who went to see the elephant,
(Though all of them were blind),
That each by observation
Might satisfy his mind.
WWW.TERAGRID.ORG
It’s a
Grid!
It’s a
Network!
They are
HPC
Centers!
It’s a
Common
Software
Environ!
And More!:
- Viz
- Facilities
- Data
collections
It’s Apps
and
Support!
It’s
Storage!
…
The iPlant CI
• Engagement with the CI Community to leverage best
practice and new research
• Unprecedented engagement with the user community to
drive requirements
• An exemplar virtual organization for modern
computational science
• A Foundation of Computational and Storage
Capability
• A single CI for all plant scientists, with customized
discovery environments to meet grand challenges
• Open source principles, commercial quality
development process
4
A Foundation of Computational and
Storage Capability
•
iPlant is positioned to take advantage of *tremendous* amounts of NSF and
institutional compute and storage resources:
– Compute: Ranger, Lonestar, Stampede (UT/TeraGrid) Saguaro, Sonora (ASU)
Marin, Ice (UA)
• ~700 Teraflops, more computing power than existed in all the Top 500
computers in the world 4 years ago
– Storage: Corral, Ranch (UT), Ocotillo (ASU)
• Well over 10 Petabytes of storage can be made available for the
project, on scalable systems capable of growing much more.
– Visualization: Spur, Stallion (UT), Matinee (ASU), UA-Cave
• Among the world’s largest visualization systems
– Virtualized/Cloud Services: iPlant (UA) and ASU virtual environments, vendor
clouds
• iPlant is positioned to cloud technologies to deliver persistent
gateways and services to users.
In short, the physical aspects of cyberinfrastructure employed via iPlant, utilizing large
scale NSF investments, has capabilities second to none anywhere on the planet.
5
A Single CyberInfrastructure, many
Discovery Environments
• iPlant is constructing one constantly evolving software
environment
– A single architecture and “core”
– An ever-growing collection of many integrated tools and datasets
(many will be externally-sourced).
– Transparently leveraging an evolving national physical
infrastructure
• Customized for particular problems/use cases through
the creation of individual “Discovery Environments” (DE):
–
–
–
–
–
Have an interface customized to the particular problem domain.
Integrate a specific collection of tools
Utilize the common core
Several DE’s may exist to address a single grand challenge
Think of these like ‘applications’
6
Open Source Philosophy,
Commercial Quality Process
• iPlant is open in every sense of the word:
–
–
–
–
Open access to source
Open API to build a community of contributors
Open standards adopted wherever possible
Open access to data (where users so choose).
• iPlant code design, implementation, and
quality control will be based in best industrial
practice
7
Commercial Quality Process
• Agile development methodology has been adopted
• Complete product lifecycle in place:
– Product Definition, Requirements Elicitation, Solution Design,
Software Development, Acceptance Testing
• Code is only built after rigorous requirements
process
–
–
–
–
Needs Analysis
User Persona
Problem Statement
User Stories
• The Grand Challenge Engagement Team plays the role of “Product
Champion” and “Customer Advocate” in this scheme
8
Scope: What iPlant won’t do
• iPlant is not a funding agency
– A large grant shouldn’t become a bunch of small
grants
• iPlant does not fund data collection
• iPlant will (generally) not continue funding for
<favorite tool x> whose funding is ending.
• iPlant will not seek to replace all online data
repositories
• iPlant will not *impose* standards on the community.
9
Scope: What iPlant *will* do
• Provide storage, computation, hosting, and lots of
programmer effort to support grand challenge efforts.
• Work with the community to support and develop
standards
• Provide forums to discuss the role and design of CI in
plant science
• Help organize the community to collect data
• Provide appropriate funding for time spent helping us
design and test the CI
10
What is the iPlant CI?
• Two grand challenges defined to date:
– iPlant Tree of Life (IPTOL):
Build a single tree showing the evolutionary relationships of all green
plant species on Earth
– iPlant Genotype-to-Phenotype (IPG2P)
Construct a methodology whereby an investigator, given the genomic
and environmental information about a given individual plant, can
predict it’s characteristics.
Taken together, these challenges are the key to
unlocking many “holy grails” of plant biology, such as the
creation of drought resistant or pest resistant crops, or
breaking reliance on fossil fuel based fertilizer
11
What is the iPlant CI?
• IPTOL CI:
– Five areas: Data assembly and integration, visualization, scalable
algorithms for large trees, trait evolution, tree reconciliation
• IPG2P CI:
– Five areas: Data Integration, Visualiztion, Modeling, Statistical
Inference, Next Gen Sequencing Tools
In both, a combination of applying compute resources,
developing or enhancing new tools, and creating webbased “discovery environments” to integrate tools and
facilitate collaboration.
12
Genotype-to-Phenotype (G2P)
Problem Statement
• Given: A particular
–
–
–
–
species of plant (e.g. corn, rice)
genetic description of an individual (genotype)
growth environment
trait of interest (flowering time, yield, or any of hundreds of
others)
• Predict: the quantitative result (phenotype)
Top priority problem in plant biology (NRC)
• Reverse problem: What genotype will yield the desired
result in a given environment?
13
Super-user Developer
DI
Metabolic data
DI
Whole plant data
DI
Modeling
and
Statistical
Inference
User inferred
Environment data
DI
Hypothesis
Expression data
Visualization
DI
Visualization
Seq data
User inferred
Experiment
14
iPG2P Working Groups
•
Ultra High Throughput Sequencing
–
•
Statistical Inference
–
•
Developing a framework to support tools for the construction, simulation and
analysis of computational models of plant function at various scales of resolution
and fidelity
Visual Analytics
–
•
Developing a platform using advanced computational approaches to statistically
link genotype to phenotype
Modeling Tools
–
•
Establishing an informatics pipeline that will allow the plant community to process
NextGen sequence data
Generating, adapting, and integrating visualization tools capable of displaying
diverse types of data from laboratory, field, in silico analyses and simulations
Data Integration
–
Investigating and applying methods for describing and unifying data sets into
virtual systems that support iPG2P activities
15
UHTS Discovery Environment
Scalable services
Data
•NCBI SRA
•User local
•iPlant store
Metadata
•MIAME
•MINSEQE
•SRA
Data Wrangling
•Quality Control
•Preprocessing
•Transformation
Alignments
•BWA
•TopHat
+BOWTIE
Cufflinks
SAMTools
Expression
Levels
(RPKM)
Variants
(VCF3.3)
Metadata Manager
SAM Alignments
User story: Arthur, an ecological genomics postdoc, is
looking for gene regulators by eQTL mapping expression
data in a panel of recombinant inbred lines he has
constructed and genotyped.
Coming Q2 2010
16
Statistical Inference
• Network Inference
• QTL Mapping
–
–
–
–
Regression (fixed, random effects)
Maximum likelihood
Bayesian methods
Decision trees
17
Computational Challenges
6.5 million markers:
Two Arabidopsis-sized genomes @
5% diversity
Indiv
1
…
38,963 expression phenotypes:
# transcripts in Arabidopsis
measured by UHTS
6.5e6
Indiv
1
2
1
…
3.9e4
1
X
2
3
3
…
…
* Single-SNP test: a few min
* 100-replicate bootstrap: a few hours
* Only gets larger for epistasis tests, forward
model selection, fms+bootstrapping
18
Statistical Genetics DE
Data
•User local
•iPlant store
Data Wrangling
•Projection
•Imputation
•Conversion
•Transformation
Configuration
•User-specified
•Driver code
Reconfigurable GLM Kernel
•C/MPI/Scalapack
•GPU
•Hybrid CPU
Scalable service
GLM
Computation
Kernel
Significant
results
Configuration
•User-specified
•Driver code
Command-line environment
and API expected Q3 2010
19
Modeling Tools
• Integrated suite of tools for:
– model construction & simulation
– parameter estimation, sensitivity analysis
– verification
• Draw on existing SBML tools
• Protocol converters for network models
• Facilitate MIRIAM usage for code/model
verification
20
Data Integration Principles
G2P Biology is data-driven science. Integration is
key: information curators already exist and do
extremely good work.
• No monolithic iPlant database(s)
• Provide virtual databases via services
• Provenance preservation
• Foster and actively support standards adoption
• Match orphan data sets with interested
researchers & educators
21
Existing genetic
and genomic
data
Existing expression,
metabolomic, network,
physical phenotypes
Data Integration Layer
Genotype
Generation of new
genomic data
•Re-sequencing
•De novo sequencing
Phenotype
Powerful
Statistical
Inference
Generation of new
phenotype data
•RNAseq
•High-throughput
phenotyping
•Image Analysis
22
High-throughput Image Analysis
Physical Infrastructure
Cameras, Scanners, etc
Web GUI
Workflow
Control
RESTful API
Data Intake
Processes
RDBMS
Consumer
Processes
httpd
HTIP Service
Layer
Database
Schema
Inputs
•Serial images
•Multichannel
images
•Volumetric
data
•Movies
Algorithm
Plugins
Python
MATLAB
C/C++
Scalable services
• Semantic
storage and
retrieval of
images and
metdata
• Storage of
derived results
from analysis
procedures
Requirements elicitation ongoing
23
Plant Biology CI Empowerment Strategy
Evolutionary
Biology
Plant Ecology
GC Solutions
Phenotyping
Plant Genomics
Plant Biology CI Empowerment Strategy
TreeEvolutionary
Reconciliation
Biology
Big Trees
Trait Evolution
Plant Ecology
Taxonomic Intelligence
Green Plant ToL
Tree Decoration
Visualization
GC Solutions
Statistical
Flowering Stress &
Inference Modeling Phenology Adaptation
Phenotyping
C3/C4 Evolution
Image Analysis
Data Integration
Plant Genomics
Next Gen
Sequencing
Technology and the iPC CI
User
Grand Challenge Workflows, iPlant Interfaces
Third Party Tools, iPlant-built Tools, Community Contributed Tools and Data!
iPlant Discovery Environments
Job Submission
Workflow Management
Service/Data APIs
iRODS, Grid Technologies, Condor, RESTful Services
iPlant Middleware
Compute
Storage
Persistent Virtual Machines
TeraGrid
Open Science Grid
UA/ASU/TACC
Physical Infrastructure
Build a CI that’s robust, leverages national infrastructure, and
can grow through community contribution!
Technical Questions? Contact Nirav Merchart – nirav@email.arizona.edu
26
iPlant : Connecting Users, Ideas & Resources
Core CI Foundation:
Data layer
Registry and Integration layer
Compute and Analysis layer
Interaction & Collaboration layer
Technical Questions? Contact Nirav Merchart – nirav@email.arizona.edu
27
iPlant: Using proven technologies
• Data layer:
providing access to raw and ingested data sets
including high throughput data transfers
iRODS
GridFTP , Aspera
Dspace (DuraSpace), OpenArchive initiative
Content Distribution Networks (CDN)
High performance storage @ TACC (Lustre)
MySQL and Postgres database clusters
Connection to other DataOne, DataNet initiatives
Cloud style storage (similar to Amazon S3 and Walrus)
Technical Questions? Contact Nirav Merchart – nirav@email.arizona.edu
28
iPlant: Using proven technologies
• Registry and Integration Layer
Connecting services, data and meta data
elements using semantic understanding
Technical Questions? Contact Nirav Merchart – nirav@email.arizona.edu
29
iPlant: Using proven technologies
• Compute and Analysis Layer:
Connecting tasks with scalable platforms
and algorithms
Virtualization (Xen clusters)
High Performance Computing at TACC and Teragrid
Grid (Condor, BOINC, Gearman)
Cloud (Eucalyptus, Nimbus, Hadoop)
Reconfigurable Hardware (GPU, FPGA)
Checkpoint & Restart (DMTCP)
Scaling and parallelizing code (MPI)
Technical Questions? Contact Nirav Merchart – nirav@email.arizona.edu
30
iPlant: Using proven technologies
• Interaction and Collaboration layer:
Providing end user access to unified services
and data, from API to large scale visualization
Google Web Toolkit (GWT driven front end)
Messaging bus (Java Mule, XMPP/Jabber)
RESTful web services (web API access)
Single sign-on/identity management (Shibboleth. OAuth)
Integration with desktop applications (via web services)
Sharing data (DOI, persistent URL, CDN, social networks)
Large scale visualization (Large Tree, Paraview, ENVISION)
Technical Questions? Contact Nirav Merchart – nirav@email.arizona.edu
31
An Example Discovery Environment
32
First DE
• Support for one use case: independent
contrasts. But also…
– Seamless remote execution of compute tasks on TeraGrid
resources
– Incorporation of existing informatics tools behind iPlant interface
– Parsing of multiple data formats into Common Semantic Model
– Seamless integration of online data resources
– Role based access and basic provenance support
• Next version will support:
– Ultra High Throughput Sequencing pipeline, Variant Detection,
Transcript Quantification
– Public RESTful API
33
Example Service API
34
University of Arizona
Rich Jorgensen
Greg Andrews
Kobus Barnard
Rick Blevins
Sue Brown
Vicki Bryan
Vicki Chandler
John Hartman
Travis Huxman
Tina Lee
Nirav Merchant
Martha Narro
Sudha Ram
Steve Rounsley
Suzanne Westbrook
Ramin Yadegari
Acknowledgments
Cold Spring Harbor Laboratory, NY
Lincoln Stein
Matt Vaughn
Doreen Ware
Dave Micklos
Sheldon McKay
Jerry Lu
Liya Wang
Texas Advanced Computing Center
Dan Stanzione
Michael Gonzales
Chris Jordan
Greg Abram
Weijia Xu
University of North Carolina-Wilmington
Ann Stapleton
Funded by NSF
35
Collaborating Institutions
• CSHL
iPlant CI
• UCD
• EMEC
External Evaluator
• VA Tech iPG2P
• TACC
iPlant CI
• Brown
iPToL
• UNCW iPlant CI
• UFL
iPToL
• Field Museum Natural
History
• UGA
iPToL
• UPenn
iPToL
• MoBot APWeb2
• UTK
iPToL
• BIEN
• Yale
iPToL
• UCSB
Taxonomic Intelligence
iPG2P
Image Platform
• UWISC Image Platform
• Boyce Thompson Inst. iPG2P
• KSU
iPG2P
36
Soft Collaborators
•
•
•
•
•
•
•
•
•
•
•
1kP Consortium
ARS at USDA
BRIT: Botanical Research Institute
of Texas
CGIAR and Generation Challenge
Program
Cyberinfrastructure for Phylogenetic
Research (CIPRES)
The Croquet Consortium
NIMBioS: National Institute for
Mathematical and Biological
Synthesis
Pittsburgh Supercomputing Center
pPOD: processing PhyloData
Syngenta Foundation
NanoHub & HubZero
•
•
•
•
•
•
•
•
•
•
ELIXIR
Fluxnet.
Howard Hughes Medical Institute
Knowledgebase
NPN: National Phenology Network
PEaCE Lab: Pacific Ecoinformatics
and Computational Ecology Lab
MORPH: Research Coordination
Network (RCN)
NCEAS: National Center for
Ecological Analysis and Synthesis
NEON: National Ecological
Observation Network
NESCent: National Evolutionary
Synthesis Center
37
Unprecedented Engagement with the
Plant Science User Community
• A unique engagement process
– The Grand Challenge process has resulted in the most intensive
user input of any large scale CI project to date.
• iPlant will construct a single CI for plant science; driven
by grand challenges and specific user needs
• Grand Challenge Engagement Teams will continue this
very close cooperation with the community
– Work closely with the GC proposal team and the broader
community
– Build use cases to drive development
38
An Exemplar Virtual Organization for Modern
Computational Science
• iPlant aims to be the Gold Standard against which other
science-focused CI projects will be measured.
• One Cyberinfrastructure Team, many skills and roles
– iPC CI Creation is done by a diverse group:
• Faculty, postdocs, staff, and students
• Bioinformatics, Biology, Computing and Information
Researchers, Software Engineers, Database Specialists, etc.
• Arizona, Cold Spring Harbor, Texas, etc.
– Many different tasks:
• Engagement/Requirements, Tech Eval, Prototyping, Software
Design (DE and Core), Data Integration, Systems, many more.
• A single Cyberinfrastructure Team, where roles may
change rapidly to match skill sets
39
Timelines/Milestones
• Growth in staffing & capability; from a few in March 2009,
now 47 involved in CI across all sites.
• Architecture definition in August-Sept 2009; enough to get
started, still evolving.
• Software environment, tools, practices laid down about the
same time.
• Real SW development commenced in September 2009.
• Serious prototyping and tool support in response to ET
needs began ramping up in November.
40
Technology Eval Activities
• Largest investment in semantic web activities
– Key for addressing the massive data
integration challenges
• Exploring alternate implementations of QTL
mapping algorithms
• Experimental Reproducability
• Policy and Technology for Provenance
Management
• Evaluation of HubZero, Workflow engines,
numerous other tools
41
IPTOL CI – A High Level Overview
• Goal: Build very large trees, perhaps all green
plant species
• Needs:
– Most of the data isn’t collected. A lot of what is
collected isn’t organized.
– Lots of analysis tools exist (probably plenty of them)
– but they don’t work together, and use many
different data formats.
– The tree builder tools take too long to run.
– The visualization tools don’t scale to the tree sizes
needed.
42
IPTOL CI – High Level
• Addressing these needs through CI
– MyPlant – the social networking site for phylogenetic
data collection (organized by clade)
– Provide a common repository for data without an
NCBI home (e.g. 1kP)
– Discovery Environment: Build a common interface,
data format, and API to unite tools.
– Enhance tree builder tools (RAxML, NINJA, Sate’)
with parallelization and checkpointing
– Build a remote visualization tool capable of running
where we can guarantee RAM resources
43
Download