Cyberinfrastructure Is

advertisement
The iPlant Collaborative
Cyberinfrastructure
aka
Development of Public Cyberinfrastructure to Support Plant Science
Nirav Merchant
University of Arizona
PowerPoint Does Rocket Science--and Better Techniques for Technical Reports
Essay by Edward Tufte
What is iPlant?
• iPlant’s mission is to build the CI to support plant
biology’s Grand Challenge solutions
• Grand Challenges were not defined in advance, but
identified through engagement with the community
• A virtual organization with Grand Challenge teams
relying on national cyberinfrastructure
• Long term focus on sustainable food supply,
climate change, biofuels, ecological stability, etc
• Hundreds of participants globally… Working group
members at >50 US institutions, USDA, DOE, etc.
Brief History
• Formally approved by National Science Board – 12/2007
• Funding by NSF – February 1st, 2008
• iPlant Kickoff Conference at CSHL – April 2008
o ~200 participants
 Grand Challenge Workshops – Sept-Dec 2008
 CI workshop – Jan 2009
 Grand Challenge White Paper Review – March 2009
 Project Recommendations – March 2009
 Project Kickoffs – May 2009 & August 2009
 First Release of Discovery Environments – April 2010
The paradigm shift
 Classic paradigm: You produce data, analyze, interpret (end
to end)
 Conventional paradigm: Consortium/centers produce data
and you consume it
 New Paradigm: Consortium/centers have produced data
and creating “cyber infrastructure” to tackle the “grand
challenge”
GC Projects Recommended by the
iPlant Board of Directors March 2009
Initial Projects:
Plant Tree of Life – iPToL – May ‘09
+Taxonomic Intelligence
+ APWeb2
+ Social Networking Website
Genotype to Phenotype – iPG2P – Aug ‘09
+ Image Analysis Platform
iPlant Tree of Life Working Groups

Trait Evolution, Brian Omeara
–

Post-tree analysis and mapping of ancestral traits
Tree Reconciliation, Todd Vision
–
Large-scale reconciliation of gene trees, co-evolving parasites, etc.,
with species trees

Big Trees, Alexandros Stamatakis
–
HPC Phylogenetic inference with 500K taxa

Tree Visualization Michael Sanderson; Karen Cranston
–
Cross cutting group for the viz needs of all

Data Integration, Val Tannen, Bill Piel
–
Cross cutting group for the data integration needs of all

Data Assembly, Doug Soltis, Pam Soltis, Michael Donoghue
–
Community and network building, data assembly
iPlant Genotype to Phenotype Working Groups
•
NextGen Sequencing
–
•
Statistical Inference
–
•
Developing a framework to support tools for the construction, simulation and analysis
of computational models of plant function at various scales of resolution and fidelity
Visual Analytics
–
•
Developing a platform using advanced computational approaches to statistically link
genotype to phenotype
Modeling Tools
–
•
Establishing an informatics pipeline that will allow the plant community to process
NextGen sequence data
Generating, adapting, and integrating visualization tools capable of displaying diverse
types of data from laboratory, field, in silico analyses and simulations
Data Integration
–
Investigating and applying methods for describing and unifying data sets into virtual
systems that support iPG2P activities
What is Cyberinfrastructure?
(Originally about TeraGrid)
WWW.TERAGRID.ORG
It was six men of Indostan,
To learning much inclined,
Who went to see the elephant,
(Though all of them were blind),
That each by observation
Might satisfy his mind.
It’s a
Grid!
It’s a
Network!
They are
HPC
Centers!
It’s a
Common
Software
Environ!
And More!:
- Viz
- Facilities
- Data
collections
It’s Apps
and
Support!
It’s
Storage!
…
The iPlant Cyberinfrastructure
User
Grand Challenge Workflows, iPlant Interfaces
Third Party Tools, iPlant-built Tools, Community Contributed Tools and Data!
iPlant Discovery Environments
Job Submission
Workflow Management
Service/Data APIs
iRODS, Grid Technologies, Condor, RESTful Services
iPlant Middleware
Compute
Storage
Persistent Virtual Machines
TeraGrid
Open Science Grid
UA/ASU/TACC
Physical Infrastructure
Build a CI that’s robust, leverages national infrastructure, and
can grow through community contribution!
Open Source Philosophy,
Commercial Quality Process
• iPlant is open in every sense of the word:
–
–
–
–
Open access to source
Open API to build a community of contributors
Open standards adopted wherever possible
Open access to data (where users so choose).
• iPlant code design, implementation, and
quality control will be based in best industrial
practice
Portfolio of Activities
• Maintaining a balance of “past, present, future”
strategies
– “Past”: make services, systems, and support
available to existing bioinformatics projects, either to
enhance them or simply make critical tools more
widely available.
– “Present” build the best bioinformatics software tools
that today’s technologies can provide.
– “Future” track emerging technologies, and where
appropriate stimulate research into the creation and
use of those technologies.
Portfolio of Activities
• In a nutshell:
– 12 Working groups in the two grand challenges, each of
which is defining requirements for DE development.
Each group not only has discussions that leads to final projects, but they also
spawn prototyping efforts, tech eval projects, tool support projects, etc.
– Services group: provide cycles, storage, hosting, etc. to
users.
– A comprehensive technology evaluation program to find,
borrow, or build relevant technologies, headlined by the
semantic web effort.
– A number of ancillary projects related to grand challenges,
i.e. APWEB, high throughput image analysis
– The Core development/integration effort.
Systems and Services
• Provide access for problems like these on
large scale systems
• Provide the storage infrastructure for
biological data (again, in support of
existing projects)
• Provide cloud style VM infrastructure for
service hosting.
iPlant : Connecting Users, Ideas and Resources
The core foundation component comprises of
:
Data layer
Registry and Integration layer
Compute and Analysis layer
Interaction and Collaboration layer
iPlant: Using proven technologies
• Data layer:
providing access to raw and ingested data sets
including high throughput data transfers
•
•
•
•
•
•
•
iRODS
GridFTP , Aspera
Dspace (DuraSpace), OpenArchive initiative
Content Distribution Networks (CDN)
High performance storage @ TACC (Lustre)
MySQL and Postgres database clusters
Connection to established data sources (NCBI, TAIR,
Gramene)
• Connection to DataOne, DataNet initiatives
• Cloud style storage (similar to Amazon S3 and Walrus)
iPlant: Using proven technologies
• Registry and Integration Layer:
Connecting services, data and meta data
elements with semantic understanding
•
•
•
•
Meta data catalog management
Provenance tracking (W7 model)
Integrated Registry and Service discovery servers
Data Client and Data Provider Ontology
development Kit
• Semantic Architecture (OWL based SSWAP)
iPlant: Using proven technologies
• Compute and Analysis Layer:
Connecting tasks with scalable platforms and
algorithms
•
•
•
•
•
•
•
•
Virtualization (Xen clusters)
High Performance Computing at TACC and TeraGrid
Grid (Condor, BOINC, Gearman)
Cloud (Eucalyptus, Nimbus)
Reconfigurable Hardware (GP GPU, FPGA)
Checkpoint & Restart (DMTCP)
Scaling and parallelizing code (MPI)
Workflow engines (DAGman, Pegasus, Kepler)
iPlant: Using proven technologies
• Interaction and Collaboration layer:
Providing end user access to unified services
and data, from API to large scale visualization
•
•
•
•
•
•
•
•
•
•
Google Web Toolkit (GWT driven front end)
Messaging bus (Java Mule, RabbitMQ, XMPP/Jabber)
RESTful web services (web API access)
Single sign-on/identity management (Shibboleth. Oauth ?)
Transparent HPC integration (TeraGrid science gateway and
TACC resources
Integration with desktop applications (via web services)
Collaboration platforms (openmeeting, webex wiki, mailman)
Shared analysis (shared workflows, desktop view)
Sharing data (DOI, persistent URL, CDN, social networks)
Large scale visualization (Large Tree, Paraview, SAGE)
Storage Services
• We have also begun offering storage to a
number of projects connected to the grand
challenges in some way, as well as iPlant
internal.
– IRODS interface
– Corral at TACC, a local storage array at UA
• Data arriving now for 1KP project, Gates
C3/C4 project.
Cloud Services
• iPlant is now
offering “cloud”
style hosting
services.
• Dynamically launch
virtual servers
hosted by iPlant.
• Still in prototype
Arrival of “As a Service” models
Cyberinfrastructure
Is “Research as a Service”
SaaS: Software as a Service
(e.g. Clustering/Assembly is a service)
PaaS: Platform as a Service
IaaS plus core software capabilities on which you build SaaS
(e.g. Hadoop/MapReduce is a Platform)
IaaS: Infrastructure as a Service
(get computer time with a credit card and with a Web interface like EC2)
http://salsahpc.indiana.edu
22
What do working groups want ?
•
•
•
•
•
•
•
•
Wiki
Shared storage
WebEX
CMS
Google apps
Machine for prototyping/development
Change management s/w (git/svn)
Access to compute grid/cluster
What iPlant wants
• Ability to integrate single sign on (sso) with
all services we offer (api, cloud, grid, irods
etc)
• Leverage credentials from users home
institutions
• Lower the barrier to access while still
being secure
• Emphasis on ease of access to “research
as a service”
Phases of a project
• Enthusiasm
• Disillusionment
• Panic
• Search for the guilty
• Punishment of the innocent
• Praise and honor for the non-participants
Karla Jennings
25
Download