The iPlant Collaborative IBP Annual Meeting – June 1st 2011 Steve Goff iPlant Collaborative, BIO5 Institute School of Plant Science University of Arizona www.iplantcollaborative.org sgoff@iplantcollaborative.org What is iPlant? • iPlant’s mission is to build the CI to support plant biology’s Grand Challenge solutions • Phase I – Community Input • Phase II – Building the CI Foundation • Next Phase – Enabling Plant Science Discovery Now need to integrate workflows and test theories Will support tool integration and synthesis activities www.iplantcollaborative.org sgoff@iplantcollaborative.org NSF Cyberinfrastructure Vision • High Performance Computing • Data and Data Analysis • Virtual Organizations • Learning and Workforce Ref: “Cyberinfrastructure Vision for 21st Century Discovery”, NSF Cyberinfrastructure Council, March 2007. www.iplantcollaborative.org sgoff@iplantcollaborative.org CI for Plant Science: Observations • Investment in data creation is high • Sources of data are disparate. • Investment in existing tools is significant • Tools shouldn’t be discarded • Tools shouldn’t be reproduced, but lack: –Interoperability w/other tools –Data standards –Scalability –Consistency of interface access & use –Experimental reproducibility www.iplantcollaborative.org sgoff@iplantcollaborative.org iPlant is a process and a platform (or set of platforms, depending on your point of view). www.iplantcollaborative.org sgoff@iplantcollaborative.org Computational & Storage Capability – Compute: Ranger, Lonestar, Stampede (UT/TeraGrid) Saguaro, Sonora (ASU) Marin, Ice (UA) •~700 Teraflops – Storage: Corral, Ranch (UT), Ocotillo (ASU) •> 10 Petabytes of storage available for the project – Visualization: Spur, Stallion (UT), Matinee (ASU), UA-Cave •Among the world’s largest visualization systems – Virtualized/Cloud Services: iPlant, TeraGrid, vendor clouds • Cloud tech to deliver persistent gateways and user services Thanks to large-scale NSF investments, iPlant has excellent CI access www.iplantcollaborative.org sgoff@iplantcollaborative.org Bench Biologists Computational Biologists Semantic Web Layer Discovery Environment Data Store Atmosphere iPlant Cyberinfrastructure APIs Data APIs Algorithms www.iplantcollaborative.org sgoff@iplantcollaborative.org Overview of Components • iPlant Discovery Environment - Core Software • iRODS Integration – Core Services • Atmosphere Cloud – Core Services • Semantic Web Tech – SSWAP Team • iPlant Tool/Workflow API – Core Software & Engagement Teams www.iplantcollaborative.org sgoff@iplantcollaborative.org Discovery Environment Semantic Web Event 3rd Party Science Gateways DNA Subway I/O User Scripts & Applications Public APIs Data Apps Job Profile Auth Low-Level Services Condor PBS SGF LSF LL iRODS LDAP Shibboleth Globus/ Unicore GPIR MySQL Eucalyptus Action Folders MyProxy XSEDE iPlant Hardware Resources High Perf Computing Databases www.iplantcollaborative.org Storage Cloud Systems sgoff@iplantcollaborative.org iRODS Integrated Rule-Oriented Data System www.irods.org • Why iRODS? – Large data storage in simple format – Sharing of large data among iPlant CI Resources – Sharing of large data with colleagues and collaborators – Processing large data with TACC resources • General information on iRODS: www.irods.org • Access iPlant’s iRODS: irodsweb.iplantcollaborative.org • Documentation: https://pods.iplantcollaborative.org/wiki/display/systems/iRODS www.iplantcollaborative.org sgoff@iplantcollaborative.org Atmosphere iPlant’s Cloud Computing Resources http://atmosphere.iplantcollaborative.org • Tutorial: https://pods.iplantcollaborative.org/wiki/display/atmosphe re/Demo+with+picture+walkthrough • Why Atmosphere? – Use a virtual machine (VM) with preinstalled software – Create a VM to install complex software – Create and share an image of a VM (VMI) – Mount data from iPlant iRODS for use by your VM www.iplantcollaborative.org sgoff@iplantcollaborative.org Semantic Web http://www.iplantcollaborative.org/communities/developers/semanticweb • Why Semantic Web Technology? –Provides a means for web-services to communicate and be aware of one another iPlant Service Semantic Web Remote Consumer iPlant Consumer Semantic Web Remote Service User-Created Service in Atmosphere Semantic Web iPlant’s Discovery Environment www.iplantcollaborative.org sgoff@iplantcollaborative.org iPG2P: From Genotype to Phenotype • • • • • Visual Analytics – R. Grene and G. Abram: Information Visualization Tools capable of displaying diverse types of data from laboratory, field, in silico analyses and simulations Data Integration – D. Ware and C. Jordan: Methods for describing and unifying data sets into systems that support iPG2P activities Statistical Inference – D. Kliebenstein and E. Buckler: Platform for using advanced computational approaches to statistically link genotype to phenotype Modeling Tools – J. White, C. Myers, S. Welch : Framework for the construction, simulation and analysis of computational models of plant Ultra High Throughput Sequencing – T. Brutnell and M. Vaughn: HPC resources and applications to process large-volume sequence data Ultra High-Throughput Sequencing Genome Services Scalable computing Data •NCBI SRA •Desktop •AmazonS3 •FTP •HTTP Data Wrangling •Quality Control •Preprocessing •Rescaling •Barcoding Alignments •BWA •TopHat Community Use Cases Expression studies Forward genetic screens Association studies Cufflinks SAMTools Expression Levels (RPKM) Genome Variants (VCF3.3) SAM Alignments High Throughput Image Analysis Scope: Enable image-based plant sciences research by incorporating image processing algorithms, grid computing, and databasing into an analysis pipeline Objectives 1. Integrate Phytomorph and BISQUE as PhytoBisque 2. Broaden access to algorithms that benefit the community 3. Automate workflows so that plant biologists need not be computer scientists APIs Storage Authentication Compute cluster E. Spalding @ U of Wisconsin, B.S Majunath and K. Kvilekval @ UCSB Phytobisque: Example Use Case Given a flatbed scanner image of Arabidopsis seeds, measures the length, width, and area and produce a population estimate for each trait Seed trait QTL can be mapped when applied to mapped populations like Ler x CVI A Strategy for Association Studies Iterative analyses • iPlant workflow management simplifies automation • Compare methods! Basic QTL/GWAS analysis • R/Qtl, QTLcartographer, et al. • Community can integrate these into the CI Exploratory methods • Hand-built R, Python, SAS, C codes • Easy integration into iPlant CI via API • Adopt common data model Scalability Challenges: Highdensity markers, large populations, combinatorial analyses • iPlant-authored parallel GLM (etc) implementations • Common data model • Utilize workflow framework Statistical Inference: Scalable GLM Genotype Phenotype 40 million markers in maize NAM 6 traits of interest X ANOVA •Simplest case*: a few minutes using GLM on desktop TASSEL •1000-replicate bootstrap: 75-150 hours / trait •Runtimes only gets larger (days to years) for more complex analyses * One trait x 40 million markers with no bootstrapping or epistasis testing X 1000 replicate analyses Epistasis testing GPU-based QTL Mapping •Aspects of the problem are highly parallel •Re-architect data flow and mapping algorithms for GPU architecture •Interface for C and GPU implementations will be identical Ali Akoglu and Dave Lowenthal, UArizona Alignment-based protein searches sped up 6-10x 19 iPlant Tree of Life (iPToL) Large phylogenetic inference Building a tree of life for up to 500,000 green plants Tree Visualization Scalable visualization for small to large trees Data Assembly and Integration Acquisition, organization and processing the data Taxonomic Intelligence Sorting out different names for the same species Tree Reconciliation Resolving discordant gene and species trees Trait Evolution Using tree to understand how traits evolved www.iplantcollaborative.org sgoff@iplantcollaborative.org Phyloviewer: visualization of large phylogenetic trees www.iplantcollaborative.org sgoff@iplantcollaborative.org 21 My-Plant • Social networking for plant biologists • Organized by clade • Used to organize the data collection for the “big tree” www.iplantcollaborative.org sgoff@iplantcollaborative.org Taxonomic Name Resolution Service www.iplantcollaborative.org sgoff@iplantcollaborative.org Integration of New Tools w/o Programming This part is done!!! This part is coming soon! www.iplantcollaborative.org sgoff@iplantcollaborative.org Related Activities Integrated Breeding Platform Social networking portal for plant breeders R analysis packages Breeders fieldbook 1kp (1,000 plant transcriptomes) DOE’s Knowledgebase (Kbase) Seed projects Elixir CoGe Future Workshop Activities Small tool/workflow integration meetings 2-3 days each, 10-20 local participants 4-5 meetings starting in June 2011 Addressing specific biological questions With appropriate test data and available software Building on iPlant’s cyberinfrastructure Complementary tools and additional data access Preference for broad use, high impact tools & workflows Can be kept private until published Positive results will stimulate additional support iPlant’s Building Blocks Faculty Advisors: Greg Andrews Kobus Barnard Susan Brown Vicki Chandler John Hartman Nirav Merchant Sudha Ram Ann Stapleton Lincoln Stein Doreen Ware Sue Wessler Ramin Yadegari Metadata Staff: Greg Abram Victoria Bryan Rion Dooley Andy Edmonds Juan Antonio Raygoza Garay Karla Gendler Damian Gessler Cornel Ghiban Michael Gonzales Hariolf Häfele Matthew Helmke Data Students: Storme Briscoe Steven Gregory Monica Lent Bansri Poduval Pavithra Ravi Shannon Wermes Jill Yarmchuk Executive Team: Steve Goff Dan Stanzione Tools Natalie Henriques Uwe Hilgert Nicole Hopkins Lisa Howells Kathleen Kennedy Mohammed Khalfan Seung-jin Kim Adam Kubach Sangeeta Kuchimanchi Tina Lee Andrew Lenards Sonya Lowry www.iplantcollaborative.org Workflows Jerry Lu Eric Lyons Naim Matasci Sheldon McKay Dave Micklos Andy Muir Martha Narro Christos Noutos Dennis Roberts Bernice Rogowitz Jerry Schneider Bruce Schumaker Viz Edwin Skidmore Sriram Srinivasan Mary Margaret Sprinkle Matthew Vaughn Liya Wang Sharon Wei Jason Williams Frank Willmore John Wregglesworth Weijia Xu sgoff@iplantcollaborative.org 27