The iPlant Collaborative Community Cyberinfrastructure for Life Science Jason Williams Cold Spring Harbor Laboratory, iPlant www.iPlantCollaborative.org The iPlant Collaborative Vision Enable life science researchers and educators to use and extend cyberinfrastructure to understand and ultimately predict the complexity of biological systems The iPlant Collaborative Vision How can we prepare for science we can’t anticipate? The iPlant Collaborative What is cyberinfrastructure? iPlant makes computation, data storage, cloud services, and software tools easily available to informaticians and researchers, leveraging existing CI investments. Cyberinfrastructure consists of computing systems, data storage systems, instruments and data repositories, visualization environments, and people, linked together by software and networks to improve research productivity and enable breakthroughs not otherwise possible. --Craig Stewart Biological Cyberinfrastructure The Problem of Big Data in Biology The iPlant Collaborative Where iPlant is today and where we are going • Initial funding in 2008 • Almost 2 years of community input gathering – software development starts in 2009 • Major CI components appear late 2010 • Finished 5th year • Recommended for second 5 year term • > 9000 users • > 20K (analyses) jobs in 2012 • > 10K HPC jobs) • 500 terabytes of user data Image from: http://adammclane.com/2011/12/06/bottlenecks/ The iPlant Collaborative Where iPlant is today and where we are going iPlant Renewed by NSF September 2013 begins next 5 year period Scientific Advisory Board Focus on Genotype-Phenotype science NSF Recommended expansion of scope beyond plants The iPlant Collaborative What we have to offer you • • • • • • • • • • Data Management & Storage Resources Access to High Performance Computing Resources Tool Integration System Application Programming Interfaces (APIs) Cloud Computing Resources Genotype To Phenotype Science Enablement Portfolio Tree of Life Science Enablement Portfolio Image Analysis Platform Support for Molecular Breeding Platform (IBP) Support for AgMIP How iPlant CI Enables Discovery Solution: Discovery Environment An extensible platform for science • • • • • High-powered computing Data sharing/collaboration Easy to use interface Virtually limitless apps Analysis history (provenance) How iPlant CI Enables Discovery Solution: Atmosphere On-demand computing resource built on a cloud infrastructure • Virtual Machine pre-configured with: Software Memory requirements Processing power • Plant authentication and storage and HPC capabilities • Build custom images/appliances and share with community • Cross-platform desktop access to GUI applications in the cloud (using VNC) How iPlant CI Enables Discovery Solution: iPlant Data Store All data in within the same platform speed and accessibility Source Time (s) CD 320 • Access your data from multiple iPlant services External Drive 36* • Automatic data backup redundant between University of Arizona and University of Texas (NSF Data management plan) USB2.0 Flash 30 • Multiple ways to share data with collaborators iPlant Data Store 18* • Multi-threaded high speed transfers My Computer 15 • Default 100GB allocation. >1TB allocations available with justification Berkeley Server 150 Highlighted Objectives and Deliverables Community identified priorities • Increased interoperability with other data providers – e.g. BioMarts, CoGe, MaizeGDB • Data discovery through interaction with trait repositories (trait/plant ontologies) • Workflows for variant discovery – SNP detection pipelines • Scalable Genome Assembly Workflows – expanded capabilities with MAKER, InterProScan • iPlant Data Commons – Resources for storage, data conversion, and metadata The iPlant Collaborative Leadership Team Steve Goff - UA Dan Stanzione – TACC Matthew Vaughn - TACC Nirav Merchant - UA Doreen Ware – CSHL Michael Schatz – CSHL David Micklos – CSHL Ann Stapleton – UNC Wilmington Ron Vetter – UNC Wilmington Faculty Advisors & Collaborators: Ali Akoglu Kobus Barnard Timothy Clausner Brian Enquist Damian Gessler Ruth Grene John Hartman Matthew Hudson David Lowenthal B.S. Manjunath David Neale Brian O’Meara Sudha Ram David Salt Mark Schildhauer Doug Soltis Pam Soltis Edgar Spalding Alexis Stamatakis Steve Welch Your colleagues Postdocs: Students: Barbara Banbury Christos Noutsos Solon Pissis Brad Ruhfel Peter Bailey Jeremy Beaulieu Devi Bhattacharya Storme Briscoe YaDi Chen David Choi Barbara Dobrin Staff: Steve Gregory Matthew Hanlon Natalie Henriques Uwe Hilgert Nicole Hopkins EunSook Jeong Logan Johnson Chris Jordan Kathleen Kennedy Mohammed Khalfan David Knapp Lars Koersterk Sangeeta Kuchimanchi Kristian Kvilekval Sue Lauter Tina Lee Andrew Lenards Monica Lent Greg Abram Sonali Aditya Ritu Arora Roger Barthelson Rob Bovill Brad Boyle Gordon Burleigh John Cazes Mike Conway Victor Cordero Rion Dooley Aaron Dubrow Andy Edmonds Dmitry Fedorov Melyssa Fratkin Michael Gatto Utkarsh Gaur Cornel Ghiban John Donoghue Yekatarina Khartianova Chris La Rose Amgad Madkour Aniruddha Marathe Andre Mercer Kurt Michaels Zack Pierce Andrew Predoehl Sathee Ravindranath Kyle Simek Gregory Striemer Jason Vandeventer Nicholas Woodward Kuan Yang Zhenyuan Lu Eric Lyons Aaron MarcuseKubitz Naim Matasci Sheldon McKay Robert McLay Nathan Miller Steve Mock Martha Narro Shannon Oliver Benoit Parmentier Jmatt Peterson Dennis Roberts Paul Sarando Jerry Schneider Bruce Schumaker Edwin Skidmore Brandon Smith Mary Margaret Sprinkle Sriram Srinivasan Josh Stein Lisa Stillwell Jonathan Strootman Peter Van Buren Hans VasquezGross Rebeka Villarreal Ramona Wallls Liya Wang Anton Westveld Jason Williams John Wregglesworth Weijia Xu Overview of the iPlant Discovery Environment Scalable platform for powerful computing, data, and application resources Overview of the iPlant Discovery Environment The evolution of cyberinfrastructure, from the bench biologist’s point of view ? Image From: http://www.wired.com/wired/archive/17.01/ff_mac_viewer.html Overview of the iPlant Discovery Environment Through the Discovery Environment you have: • High-powered computing • iPlant data store • Easy to use interface • Virtually limitless apps • Analysis history (provenance) Overview of the iPlant Discovery Environment Key DE Features in the 1.6-1.8 releases • Enhanced ability to share data with colleagues/collaborators • Visual workflow creation • WYSIWYG Tool Integration Discovery Environment What’s Next for the DE? Support for batch submission Extensive support for metadata, ontologies and tagging Search data based on metadata and ontologies Integrate with external data sources like Bio-mart Overview of Atmosphere Overview of Atmosphere Cloud Computing on Demand • On-demand computing resource built on a cloud infrastructure • Virtual Machine pre-configured with: – Software – Memory requirements – Processing power Overview of Atmosphere Cloud Computing on Demand • Fully integrated into iPlant authentication and storage and HPC capabilities • Enables users to build custom images/appliances and share with community • Cross-platform desktop access to GUI applications in the cloud (using VNC) • Provide easy web based access to resources Overview of Atmosphere Cloud Computing on Demand • API-compatible implementation of Amazon EC2/S3 interfaces • Virtualize the execution environment for applications and services • Up to 16 core / 32 GB instances • Access to Cloud Storage + EBS Overview of Atmosphere Multiple Ways to Access • VNC client • Command line tools (e.g. SSH) Atmosphere What’s Next ? Increased capacity to meet demand Transition to OpenStack Pause and Resizing Images Private Federated Clouds – a little further ahead Where to go for help iPlant user forums ask.iplantcollaborative.org