Applied CyberInfrastructure Concepts ISTA 420/520

advertisement
Applied CyberInfrastructure Concepts
ISTA 420/520 Fall 2014
Will Computers Crash Genomics? Science Vol 331 Feb 2011
Nirav Merchant (nirav@email.arizona.edu)
Bio Computing & iPlant Collaborative
Eric Lyons (ericlyons@email.arizona.edu)
Plant Sciences & iPlant Collaborative
University of Arizona
1
http://goo.gl/p4j3m
or https://sites.google.com/site/appliedciconcepts/
1
Topic Coverage
 Lifecycle Issues (example from MIT)
 Why DM (Data Management)
 iRODS Introduction
 Scaling the Infrastructure for Data Management
(Chapter 3 from FiMDA) Group homework
Reality of data
“We are drowning in data, but starving of
information”
- Attribution unknown
Data Life Cycle
http://www.data-archive.ac.uk/create-manage/life-cycle
iRODS Background and Evolution
•
integrated Rule-Oriented Data System (iRODS) http://www.irods.org
•
Originated at SDSC, developed by the DICE (Data Intensive Cyber Environments)
group
•
Based on decade-long SRB development experience for managing distributed data
•
Community-driven
•
Most of the group migrated to UNC Chapel Hill in 2008-2009
– The group is bi-coastal: DICE-UNC, DICE-UCSD
•
First release of iRODS in 2009
•
iRODS picked up where SRB left off
5
iRODS Background and Evolution
• Modular, extensible, customizable
• Open source (BSD license)
• Supported at UNC with complementary activities by DICE and RENCI, a
research unit of UNC Chapel Hill
•
https://github.com/irods/irods
6
iRODS
I.
Data grid middleware
II.
Data management infrastructure
III. A framework for procedural implementation of
data management policy (policy-driven data
management)
iRODS is all these.
iRODS Unified Virtual Collection
iRODS View of Distributed Data
User Client
User sees a single collection
My Data:
disk, filesystem,
site-specific storage, ...
My Data:
tape, database, filesystem, ...
Partner’s Data
remote disk, tape, filesystem,
site-specific storage,…
• iRODS installs over heterogeneous data resources
• Users can share & manage distributed data as a single collection
iRODS as a Data Grid
• Sharing data across:
– geographic and institutional boundaries
– heterogeneous resources (hardware/software)
• Virtual (logical) collections of distributed data
• Global name spaces
– data: files and collections
– users: single sign on
– storage: virtual resources
• Metadata catalogue (iCAT) manages mappings between logical and physical
name spaces
A RENCI Data Grid
A complete data grid (zone) has
one metadata catalogue (iCAT)
NCSU
Duke
iRODS Server
iRODS Server
iPlant
iRODS Server
UNC-A
UNC-CH
iRODS Server
RENCI, Europa Center
iRODS Server
iRODS Server
Metadata
Catalog (iCAT)
• Client asks for data – request goes to an iRODS server
• Server contacts the iCAT-enabled server
• Information (location, access rights, etc) is
retrieved from the iCAT
• Server containing data is signaled to send data to
authorized client
TUCASI Infrastructure Project (TIP)
Federated Data Grids
Independent data grids (zones),
each with its own iCAT,
18 September 2012
can be federated 11
Federation of Data Grids
• NASA
– Disparate data collections: Satellite data, model data, remote sensing data
– Manage the collections separately (technically and administratively) with separate
data grids
– Federate the data grids to give users an overall view onto NASA data
• Collaboration between consortia
– DataNet Federation Consortium: 6 science domain partners, federating their data
grids to share data, users
– Users authenticate to home data grid, access federated data grids
• For geographically distributed replication, evolution in data life cycle
18 September 2012
12
iPlant Data Store
Free Your Data
Different Users,
Different Access Needs:
One Data Store
Download