The iPlant Collaborative Cyberinfrastructure Matt Vaughn Cold Spring Harbor Laboratory April 2010 What is iPlant? • Simply put, the mission of the iPlant Collaborative is to build Cyberinfrastructure to support the solution of the grand challenges of plant biology. • A “unique” aspect is the grand challenges were not defined in advance, but are identified through an ongoing engagement with the community. • Not a center, but a virtual organization forming grand challenge teams and relying on the national CI. • Long term focus on sustainable food supply, climate change, biofuels, pharmaceuticals, etc. • Hundreds of participants from around the world; Working group members at > 50 US academic institutions, USDA, DOE, etc. 2 What is Cyberinfrastructure? (Originally about TeraGrid) It was six men of Indostan, To learning much inclined, Who went to see the elephant, (Though all of them were blind), That each by observation Might satisfy his mind. WWW.TERAGRID.ORG It’s a Grid! It’s a Network! They are HPC Centers! It’s a Common Software Environ! And More!: - Viz - Facilities - Data collections It’s Apps and Support! It’s Storage! … The iPlant CI • Engagement with the CI Community to leverage best practice and new research • Unprecedented engagement with the user community to drive requirements • An exemplar virtual organization for modern computational science • A Foundation of Computational and Storage Capability • A single CI for all plant scientists, with customized discovery environments to meet grand challenges • Open source principles, commercial quality development process 4 A Foundation of Computational and Storage Capability • iPlant is positioned to take advantage of *tremendous* amounts of NSF and institutional compute and storage resources: – Compute: Ranger, Lonestar, Stampede (UT/TeraGrid) Saguaro, Sonora (ASU) Marin, Ice (UA) • ~700 Teraflops, more computing power than existed in all the Top 500 computers in the world 4 years ago – Storage: Corral, Ranch (UT), Ocotillo (ASU) • Well over 10 Petabytes of storage can be made available for the project, on scalable systems capable of growing much more. – Visualization: Spur, Stallion (UT), Matinee (ASU), UA-Cave • Among the world’s largest visualization systems – Virtualized/Cloud Services: iPlant (UA) and ASU virtual environments, vendor clouds • iPlant is positioned to cloud technologies to deliver persistent gateways and services to users. In short, the physical aspects of cyberinfrastructure employed via iPlant, utilizing large scale NSF investments, has capabilities second to none anywhere on the planet. 5 A Single CyberInfrastructure, many Discovery Environments • iPlant is constructing one constantly evolving software environment – A single architecture and “core” – An ever-growing collection of many integrated tools and datasets (many will be externally-sourced). – Transparently leveraging an evolving national physical infrastructure • Customized for particular problems/use cases through the creation of individual “Discovery Environments” (DE): – – – – – Have an interface customized to the particular problem domain. Integrate a specific collection of tools Utilize the common core Several DE’s may exist to address a single grand challenge Think of these like ‘applications’ 6 Open Source Philosophy, Commercial Quality Process • iPlant is open in every sense of the word: – – – – Open access to source Open API to build a community of contributors Open standards adopted wherever possible Open access to data (where users so choose). • iPlant code design, implementation, and quality control will be based in best industrial practice 7 Commercial Quality Process • Agile development methodology has been adopted • Complete product lifecycle in place: – Product Definition, Requirements Elicitation, Solution Design, Software Development, Acceptance Testing • Code is only built after rigorous requirements process – – – – Needs Analysis User Persona Problem Statement User Stories • The Grand Challenge Engagement Team plays the role of “Product Champion” and “Customer Advocate” in this scheme 8 Scope: What iPlant won’t do • iPlant is not a funding agency – A large grant shouldn’t become a bunch of small grants • iPlant does not fund data collection • iPlant will (generally) not continue funding for <favorite tool x> whose funding is ending. • iPlant will not seek to replace all online data repositories • iPlant will not *impose* standards on the community. 9 Scope: What iPlant *will* do • Provide storage, computation, hosting, and lots of programmer effort to support grand challenge efforts. • Work with the community to support and develop standards • Provide forums to discuss the role and design of CI in plant science • Help organize the community to collect data • Provide appropriate funding for time spent helping us design and test the CI 10 What is the iPlant CI? • Two grand challenges defined to date: – iPlant Tree of Life (IPTOL): Build a single tree showing the evolutionary relationships of all green plant species on Earth – iPlant Genotype-to-Phenotype (IPG2P) Construct a methodology whereby an investigator, given the genomic and environmental information about a given individual plant, can predict it’s characteristics. Taken together, these challenges are the key to unlocking many “holy grails” of plant biology, such as the creation of drought resistant or pest resistant crops, or breaking reliance on fossil fuel based fertilizer 11 What is the iPlant CI? • IPTOL CI: – Five areas: Data assembly and integration, visualization, scalable algorithms for large trees, trait evolution, tree reconciliation • IPG2P CI: – Five areas: Data Integration, Visualiztion, Modeling, Statistical Inference, Next Gen Sequencing Tools In both, a combination of applying compute resources, developing or enhancing new tools, and creating webbased “discovery environments” to integrate tools and facilitate collaboration. 12 Genotype-to-Phenotype (G2P) Problem Statement • Given: A particular – – – – species of plant (e.g. corn, rice) genetic description of an individual (genotype) growth environment trait of interest (flowering time, yield, or any of hundreds of others) • Predict: the quantitative result (phenotype) Top priority problem in plant biology (NRC) • Reverse problem: What genotype will yield the desired result in a given environment? 13 Super-user Developer DI Metabolic data DI Whole plant data DI Modeling and Statistical Inference User inferred Environment data DI Hypothesis Expression data Visualization DI Visualization Seq data User inferred Experiment 14 iPG2P Working Groups • Ultra High Throughput Sequencing – • Statistical Inference – • Developing a framework to support tools for the construction, simulation and analysis of computational models of plant function at various scales of resolution and fidelity Visual Analytics – • Developing a platform using advanced computational approaches to statistically link genotype to phenotype Modeling Tools – • Establishing an informatics pipeline that will allow the plant community to process NextGen sequence data Generating, adapting, and integrating visualization tools capable of displaying diverse types of data from laboratory, field, in silico analyses and simulations Data Integration – Investigating and applying methods for describing and unifying data sets into virtual systems that support iPG2P activities 15 UHTS Discovery Environment Scalable services Data •NCBI SRA •User local •iPlant store Metadata •MIAME •MINSEQE •SRA Data Wrangling •Quality Control •Preprocessing •Transformation Alignments •BWA •TopHat +BOWTIE Cufflinks SAMTools Expression Levels (RPKM) Variants (VCF3.3) Metadata Manager SAM Alignments User story: Arthur, an ecological genomics postdoc, is looking for gene regulators by eQTL mapping expression data in a panel of recombinant inbred lines he has constructed and genotyped. Coming Q2 2010 16 Statistical Inference • Network Inference • QTL Mapping – – – – Regression (fixed, random effects) Maximum likelihood Bayesian methods Decision trees 17 Computational Challenges 6.5 million markers: Two Arabidopsis-sized genomes @ 5% diversity Indiv 1 … 38,963 expression phenotypes: # transcripts in Arabidopsis measured by UHTS 6.5e6 Indiv 1 2 1 … 3.9e4 1 X 2 3 3 … … * Single-SNP test: a few min * 100-replicate bootstrap: a few hours * Only gets larger for epistasis tests, forward model selection, fms+bootstrapping 18 Statistical Genetics DE Data •User local •iPlant store Data Wrangling •Projection •Imputation •Conversion •Transformation Configuration •User-specified •Driver code Reconfigurable GLM Kernel •C/MPI/Scalapack •GPU •Hybrid CPU Scalable service GLM Computation Kernel Significant results Configuration •User-specified •Driver code Command-line environment and API expected Q3 2010 19 Modeling Tools • Integrated suite of tools for: – model construction & simulation – parameter estimation, sensitivity analysis – verification • Draw on existing SBML tools • Protocol converters for network models • Facilitate MIRIAM usage for code/model verification 20 Data Integration Principles G2P Biology is data-driven science. Integration is key: information curators already exist and do extremely good work. • No monolithic iPlant database(s) • Provide virtual databases via services • Provenance preservation • Foster and actively support standards adoption • Match orphan data sets with interested researchers & educators 21 Existing genetic and genomic data Existing expression, metabolomic, network, physical phenotypes Data Integration Layer Genotype Generation of new genomic data •Re-sequencing •De novo sequencing Phenotype Powerful Statistical Inference Generation of new phenotype data •RNAseq •High-throughput phenotyping •Image Analysis 22 High-throughput Image Analysis Physical Infrastructure Cameras, Scanners, etc Web GUI Workflow Control RESTful API Data Intake Processes RDBMS Consumer Processes httpd HTIP Service Layer Database Schema Inputs •Serial images •Multichannel images •Volumetric data •Movies Algorithm Plugins Python MATLAB C/C++ Scalable services • Semantic storage and retrieval of images and metdata • Storage of derived results from analysis procedures Requirements elicitation ongoing 23 Plant Biology CI Empowerment Strategy Evolutionary Biology Plant Ecology GC Solutions Phenotyping Plant Genomics Plant Biology CI Empowerment Strategy TreeEvolutionary Reconciliation Biology Big Trees Trait Evolution Plant Ecology Taxonomic Intelligence Green Plant ToL Tree Decoration Visualization GC Solutions Statistical Flowering Stress & Inference Modeling Phenology Adaptation Phenotyping C3/C4 Evolution Image Analysis Data Integration Plant Genomics Next Gen Sequencing Technology and the iPC CI User Grand Challenge Workflows, iPlant Interfaces Third Party Tools, iPlant-built Tools, Community Contributed Tools and Data! iPlant Discovery Environments Job Submission Workflow Management Service/Data APIs iRODS, Grid Technologies, Condor, RESTful Services iPlant Middleware Compute Storage Persistent Virtual Machines TeraGrid Open Science Grid UA/ASU/TACC Physical Infrastructure Build a CI that’s robust, leverages national infrastructure, and can grow through community contribution! Technical Questions? Contact Nirav Merchart – nirav@email.arizona.edu 26 iPlant : Connecting Users, Ideas & Resources Core CI Foundation: Data layer Registry and Integration layer Compute and Analysis layer Interaction & Collaboration layer Technical Questions? Contact Nirav Merchart – nirav@email.arizona.edu 27 iPlant: Using proven technologies • Data layer: providing access to raw and ingested data sets including high throughput data transfers iRODS GridFTP , Aspera Dspace (DuraSpace), OpenArchive initiative Content Distribution Networks (CDN) High performance storage @ TACC (Lustre) MySQL and Postgres database clusters Connection to other DataOne, DataNet initiatives Cloud style storage (similar to Amazon S3 and Walrus) Technical Questions? Contact Nirav Merchart – nirav@email.arizona.edu 28 iPlant: Using proven technologies • Registry and Integration Layer Connecting services, data and meta data elements using semantic understanding Technical Questions? Contact Nirav Merchart – nirav@email.arizona.edu 29 iPlant: Using proven technologies • Compute and Analysis Layer: Connecting tasks with scalable platforms and algorithms Virtualization (Xen clusters) High Performance Computing at TACC and Teragrid Grid (Condor, BOINC, Gearman) Cloud (Eucalyptus, Nimbus, Hadoop) Reconfigurable Hardware (GPU, FPGA) Checkpoint & Restart (DMTCP) Scaling and parallelizing code (MPI) Technical Questions? Contact Nirav Merchart – nirav@email.arizona.edu 30 iPlant: Using proven technologies • Interaction and Collaboration layer: Providing end user access to unified services and data, from API to large scale visualization Google Web Toolkit (GWT driven front end) Messaging bus (Java Mule, XMPP/Jabber) RESTful web services (web API access) Single sign-on/identity management (Shibboleth. OAuth) Integration with desktop applications (via web services) Sharing data (DOI, persistent URL, CDN, social networks) Large scale visualization (Large Tree, Paraview, ENVISION) Technical Questions? Contact Nirav Merchart – nirav@email.arizona.edu 31 An Example Discovery Environment 32 First DE • Support for one use case: independent contrasts. But also… – Seamless remote execution of compute tasks on TeraGrid resources – Incorporation of existing informatics tools behind iPlant interface – Parsing of multiple data formats into Common Semantic Model – Seamless integration of online data resources – Role based access and basic provenance support • Next version will support: – Ultra High Throughput Sequencing pipeline, Variant Detection, Transcript Quantification – Public RESTful API 33 Example Service API 34 University of Arizona Rich Jorgensen Greg Andrews Kobus Barnard Rick Blevins Sue Brown Vicki Bryan Vicki Chandler John Hartman Travis Huxman Tina Lee Nirav Merchant Martha Narro Sudha Ram Steve Rounsley Suzanne Westbrook Ramin Yadegari Acknowledgments Cold Spring Harbor Laboratory, NY Lincoln Stein Matt Vaughn Doreen Ware Dave Micklos Sheldon McKay Jerry Lu Liya Wang Texas Advanced Computing Center Dan Stanzione Michael Gonzales Chris Jordan Greg Abram Weijia Xu University of North Carolina-Wilmington Ann Stapleton Funded by NSF 35 Collaborating Institutions • CSHL iPlant CI • UCD • EMEC External Evaluator • VA Tech iPG2P • TACC iPlant CI • Brown iPToL • UNCW iPlant CI • UFL iPToL • Field Museum Natural History • UGA iPToL • UPenn iPToL • MoBot APWeb2 • UTK iPToL • BIEN • Yale iPToL • UCSB Taxonomic Intelligence iPG2P Image Platform • UWISC Image Platform • Boyce Thompson Inst. iPG2P • KSU iPG2P 36 Soft Collaborators • • • • • • • • • • • 1kP Consortium ARS at USDA BRIT: Botanical Research Institute of Texas CGIAR and Generation Challenge Program Cyberinfrastructure for Phylogenetic Research (CIPRES) The Croquet Consortium NIMBioS: National Institute for Mathematical and Biological Synthesis Pittsburgh Supercomputing Center pPOD: processing PhyloData Syngenta Foundation NanoHub & HubZero • • • • • • • • • • ELIXIR Fluxnet. Howard Hughes Medical Institute Knowledgebase NPN: National Phenology Network PEaCE Lab: Pacific Ecoinformatics and Computational Ecology Lab MORPH: Research Coordination Network (RCN) NCEAS: National Center for Ecological Analysis and Synthesis NEON: National Ecological Observation Network NESCent: National Evolutionary Synthesis Center 37 Unprecedented Engagement with the Plant Science User Community • A unique engagement process – The Grand Challenge process has resulted in the most intensive user input of any large scale CI project to date. • iPlant will construct a single CI for plant science; driven by grand challenges and specific user needs • Grand Challenge Engagement Teams will continue this very close cooperation with the community – Work closely with the GC proposal team and the broader community – Build use cases to drive development 38 An Exemplar Virtual Organization for Modern Computational Science • iPlant aims to be the Gold Standard against which other science-focused CI projects will be measured. • One Cyberinfrastructure Team, many skills and roles – iPC CI Creation is done by a diverse group: • Faculty, postdocs, staff, and students • Bioinformatics, Biology, Computing and Information Researchers, Software Engineers, Database Specialists, etc. • Arizona, Cold Spring Harbor, Texas, etc. – Many different tasks: • Engagement/Requirements, Tech Eval, Prototyping, Software Design (DE and Core), Data Integration, Systems, many more. • A single Cyberinfrastructure Team, where roles may change rapidly to match skill sets 39 Timelines/Milestones • Growth in staffing & capability; from a few in March 2009, now 47 involved in CI across all sites. • Architecture definition in August-Sept 2009; enough to get started, still evolving. • Software environment, tools, practices laid down about the same time. • Real SW development commenced in September 2009. • Serious prototyping and tool support in response to ET needs began ramping up in November. 40 Technology Eval Activities • Largest investment in semantic web activities – Key for addressing the massive data integration challenges • Exploring alternate implementations of QTL mapping algorithms • Experimental Reproducability • Policy and Technology for Provenance Management • Evaluation of HubZero, Workflow engines, numerous other tools 41 IPTOL CI – A High Level Overview • Goal: Build very large trees, perhaps all green plant species • Needs: – Most of the data isn’t collected. A lot of what is collected isn’t organized. – Lots of analysis tools exist (probably plenty of them) – but they don’t work together, and use many different data formats. – The tree builder tools take too long to run. – The visualization tools don’t scale to the tree sizes needed. 42 IPTOL CI – High Level • Addressing these needs through CI – MyPlant – the social networking site for phylogenetic data collection (organized by clade) – Provide a common repository for data without an NCBI home (e.g. 1kP) – Discovery Environment: Build a common interface, data format, and API to unite tools. – Enhance tree builder tools (RAxML, NINJA, Sate’) with parallelization and checkpointing – Build a remote visualization tool capable of running where we can guarantee RAM resources 43