NPACI Neuroscience

advertisement

Storage Resource Broker

Federating Archives in the

DELAMAN Network

Reagan W. Moore

San Diego Supercomputer Center moore@sdsc.edu

http://www.npaci.edu/DICE/SRB

Distributed Data Management

Using Data Grids

• Build a shared collection

• Authenticate users independently of the storage systems

• Control access independently of the storage systems

• Organize the file name space independently of the storage systems

• Manage context (metadata) independently of content (files)

• Maintain consistency between context and operations on content

Storage Resource Broker

• Generic distributed data management technology

• Data grids - sharing

• Digital libraries - publication

• Persistent archives - preservation

• Federated server architecture / thin client

• 250,000 lines of “C” code

• Supports all major compute and storage platforms

• All requirements listed on following Scenario slides are supported

Scenario 1- Data Migration

• Provide URIDs (logical file names) that are independent of storage system

• Provide metadata for each file

• Support browse and discovery on collection hierarchy

• Support access interfaces to the data

• Support registration of existing files into a shared collection

• Single sign-on environment

• GSI / challenge response / tickets

Managing Distributed Data

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Storage Repository

• Storage location

• User name

• File name

• File context (creation date,…)

• Access constraints

Naming conventions provided by storage systems

Data Grids Provide a Level of Indirection for Each Naming Convention

Data Access Methods (C library, Unix, Web Browser)

Data Collection

Storage Repository

• Storage location

• User name

• File name

• File context (creation date,…)

• Access constraints

Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space (URID)

• Logical context (metadata)

• Control/consistency constraints

Data is organized as a shared collection

Provide Context for Data

• Properties of files

• Provenance - source

• Descriptive attributes

• State information resulting from operations on files

• Organize properties as metadata in a collection hierarchy

• Define operations on file properties

• Manage state information - location, replicas, containers, checksums

• Separate context management from content management

• Maintain consistency of context as operations are done on content

• Support context management

• Schema extension, automated SQL generation, bulk metadata load

• Metadata extraction through a remote procedure parsing the file

Federated Server Architecture

Peer-to-peer

Brokering Read Application

Logical Name

Or

Attribute Condition

1

6 5/6

SRB server

SRB agent

3

5

Parallel Data

Access

SRB server

4

SRB agent

1.Logical-to-Physical mapping

2.Identification of Replicas

3.Access & Audit Control

2

R1

MCAT

Data

Access

R2

Server(s)

Spawning

Storage Resource Broker - Data Grid

Application

C, C++, Java

Libraries

Linux

I/O

Unix

Shell

Java, NT

Browser

Kepler Actors

DLL /

Python,

Perl

HTTP

DSpace

OpenDAP

OAI,

WSDL,

WSRF

Federation Management

Consistency & Metadata Management / Authorization,Authentication,Audit

Logical Name

Space

Latency

Management

Data

Transport

Metadata

Transport

Catalog Abstraction

Databases

DB2, Oracle, Sybase,

Postgres, mySQL,

Informix

Storage Repository Virtualization

Archives - Tape,

Sam-QFS, DMF,

HPSS, ADSM,

UniTree, ADS

ORB

File Systems

Unix, NT,

Mac OSX

Databases

DB2, Oracle, Sybase,

SQLserver, Postgres, mySQL, Informix

Scenario 2 - Data Exchange

• Support access controls on the URIDs

• Java administration GUI to support owner control of access controls

• Can delegate permission to set access controls

• Access controls apply on all replicas independent of storage system

• Support latency management for moving files across wide area networks

• Parallel I/O, replication, staging, aggregation of data / metadata / I/O commands

• Support integrity validation

• Manage checksums for each file

Latency Management -Bulk Operations

• Bulk register

• Create a logical name for a file

• Bulk load

• Create a copy of the file on a data grid storage repository

• Bulk unload

• Provide containers to hold small files and pointers to each file location

• Bulk delete

• Mark as deleted in metadata catalog

• After specified interval, delete file

• Bulk metadata load

• Support parsing of metadata from a remote file at remote storage

• Requests for bulk operations for access control setting,

Scenario 3 - Community Access

• Within the shared collection, the digital entities are owned and managed by the data grid

• Files, URLs, SQL commands, database binary large objects can be registered into the shared collection

• Access controls for

• Files / metadata / storage systems

• Access controls are defined for multiple roles

• Schema extension, create new metadata

• Modify metadata

• Add annotations

• Turn on audit trails

• Write data

• Read data

Scenario 4 - Explorative Studies

• Uniform access mechanisms to data across all storage systems

• Support for queries on databases

• Support for formatting results (XML, HTML)

• Support audit trails, encryption

• Support user-defined collection hierarchy

• Soft links (build a logical collection of pointers to data within the data grid)

• Support for multiple types of discovery

• By URID (Logical File Name)

• By query on metadata (may be unique to a single file)

• By GUID (handle system)

Scenario 5 - Education

• SRB is used to build digital libraries

• Assemble class material

• Manage student reports

• Display material through web browsers

• Federation of digital libraries

• Controlled sharing across independent data grids or digital libraries

• Support for cross-registration of logical name spaces

• Authentication done by “home” data grid

• Access controls managed by both data grids

Federation

Data Access Methods (Web Browser, DSpace, OAI-PMH)

Data Collection A Data Collection B

Data Grid Data Grid

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Logical resource name space

• Logical user name space

• Logical file name space

• Logical context (metadata)

• Control/consistency constraints • Control/consistency constraints

Access controls and consistency constraints on cross registration of digital entities

Scenario 6 - Updating Resources

• Maintain system level metadata

• Owner of registered file

• Creation time, modification time, size, audit trails

• Replica locations

• Support for synchronization of replicas

• Can modify a replica, subsequent reads are to the modified copy

• Can synchronize copies to the modified version

• Support for physical file containers

• Aggregate small files before storage

Scenario 7 - Web-based Editions

• Support for digital library interfaces on top of the data grid

• Transana - technology to manipulate, edit, and manage classroom video (University of Wisconsin)

• DSpace - digital library system to manage ingestion of material into a collection

• OAI-PMH - Open Archives Initiative protocol for metadata harvesting

• OpenDAP - Data Access Protocol that supports both semantic and structural manipulation of registered files

• Windows browser, Web browser, Java, WSDL interfaces

• Collaborating on development of portlet interface

Storage Resource Broker - Data Grid

Application

C, C++, Java

Libraries

Linux

I/O

Unix

Shell

Java, NT

Browser

Kepler Actors

DLL /

Python,

Perl

HTTP

DSpace

OpenDAP

OAI,

WSDL,

WSRF

Federation Management

Consistency & Metadata Management / Authorization,Authentication,Audit

Logical Name

Space

Latency

Management

Data

Transport

Metadata

Transport

Catalog Abstraction

Databases

DB2, Oracle, Sybase,

Postgres, mySQL,

Informix

Storage Repository Virtualization

Archives - Tape,

Sam-QFS, DMF,

HPSS, ADSM,

UniTree, ADS

ORB

File Systems

Unix, NT,

Mac OSX

Databases

DB2, Oracle, Sybase,

SQLserver,Postgres, mySQL, Informix

Scenario 8 - Unconnected Editions

Ability to download data from shared collection to local resource

• Support for PCs, workstations, supercomputers

Generalization of anonymous FTP

• Can issue a ticket permitting

• Limited number of read accesses valid for specified time interval

• Can set public access to a sub-collection

• Can restrict access by user name/domain/zone

Local Archives

Maintain files in local file system

• Register existence of the files into the data grid

• Issue synchronization command to replicate into the archive

Maintain a data grid on the local system

• Entire environment can be installed on a Mac in 15 minutes (Perl install script)

• Use data grid federation to synchronize name spaces, files, metadata from local data grid to archives data grid

Scenario 9 - Collaborative Commmentary

• Comments can be added by owner

• Annotations can be added by authorized persons

• Annotations marked by person name, date

• Can restrict annotation right by group

• Can choose to create explicit metadata attributes to manage comments

• Can store multiple comments per object

• Can search across metadata

• Or can use digital library interfaces to manage comments

Sites Using the SRB

Academia Sinica, Taiwan

ASCC, Computing Centre, Taiwan

Australian National University

Bedford Oceanography,Canada

Bioinformatics Institute, Singapore

CSIRO, Australia

Data Storage Institute, Singapore

EGEE, French National Center

GeoForschungsZentrum, Germany

James Cook University, Australia

KEK High Energy Physics, Japan

Max Planck Institute, Netherlands

Parallab, Norway

South Australian Advanced Computing

UIB (Parallab) , Norway

University of Amsterdam

University of Cambridge, Astronomy

University of Cambridge, e-Science

University of Edinburgh

University of Genoa, Italy

University of Hong Kong

Univrsity of Manchester

University of Oslo

University of Southampton

York Univ (UK)

CiteSeer, Penn State

City Univ. of New York

Geospatial Environment, UCSD

Drexel University

EOSDIS Distributed Active, NASA Goddard

Georgia Tech

Kentucky State Libraries & Archives

Library of Congress

Los Alamos National Lab

NASA Ames

NASA Goddard Space Flight Center

NCSA Grid Computing

NIH (NCI Center for Bioinformatics)

Penn State University

Pittsburgh Supercomputing Center

Purdue University. Indiana

Stanford University

TACC, University of Texas

Texas A & M

UC Santa Cruz

UCLA

UCSD Neuroscience

University of Maryland

University of Michigan, CAC department

University of New Mexico

University of Washington

University of Wisconsin

USC

Yale University

Storage Resource Broker Collections at SDS C

(11/2/2004 )

Data Gr id

NSF/ITR - National Virtual Observatory

NSF - N ational Partnership for Advanced Computational Infras tructure

Hayden Planetarium - Evolution of the Solar System vis ualizations

NSF/NPACI - Joint Center for Structural Genomics

NSF/NPACI - Biology and Environmental collections

NSF - TeraGrid, ENZO Cosmology simulations

NIH - Biomedical Informatics Research Network

Digital Library

NLM - D igital Embryo image collection

NSF/NPACI - Long Term Ecological Reserve

NSF/NPACI - Grid Portal

NIH - Alliance for Cell Signaling microarray d ata

NSF - N ational Science Digital Library SIO Explorer collection

NSF/NPACI -Transana education research video collection

NSF/ITR - Southern California Earthquake Center

Persistent Archive

UCSD Libraries archive

NARA- Research Prototype Persistent Archive

NSF - N ational Science Digital Library pers istent archive

TOTAL

GBs of data stored

Number of files

53,858 9,536,698

24,738 5,754,890

7,201

5,228

113,600

652,031

8,851 33,340

121,550 1,096,947

6,002 4,107,508

720

253

2,211

856

2,080

45,365

8,436

51,227

62,291

808,901

92 2,387

91,040 1,791,494

128

166

204,828

316,813

3,571 26,908,350

328 TB 51 million

Number of Users

29

58

122

4,900

80

380

178

50

67

3,247

214

23

36

407

21

27

26

62

Generic Infrastructure

• SDSC developed the Storage Resource

Broker (SRB) to support access to distributed data

• Effort started in 1996 as a DARPA funded project

• Now support over 30 national/international projects

• Development team of 12 staff is led by

• Michael Wan, data management systems

• Arcot Rajasekar , information management systems

SDSC SRB Team

(left to right)

QuickTime™ and a

TIFF (Uncomp resse d) de com press or are nee ded to s ee this picture.

QuickTime™ and a are nee ded to s ee this picture.

QuickTime™ and a

TIFF (Uncomp resse d) de com press or are nee ded to s ee this picture.

QuickTime™ and a

TIFF (Uncomp resse d) de com press or are nee ded to s ee this picture.

QuickTime™ and a are nee ded to s ee this picture.

QuickTime™ and a

TIFF (Uncomp resse d) de com press or are nee ded to s ee this picture.

Arun Jagatheesan

• George Kremenek

• Sheau-Yen Chen

• Arcot Rajasekar (SRB development lead )

• Reagan Moore (SRB PI)

• Michael Wan (SRB architect)

Roman Olschanowsky (BIRN)

• Bing Zhu

• Charlie Cowart

• Lucas Gilbert

Tim Warnock

Wayne Schroeder (SRB product)

• Adam Birnbaum (SRB production)

• Antoine De Torcy

Vicky Rowley (BIRN)

• Marcio Faerman (SCEC)

• Students & emeritus

Erik Vandekieft

Reena Mathew

Xi (Cynthia) Sheng

Allen Ding

Grace Lin

Qiao Xin

Daniel Moore

Ethan Chen

Jon Weinburg

• Supported by overt 20 projects (NSF,

DOE, NASA, NARA, NIH, LOC,

NHPRC)

Data Grid Capabilities

• Data manipulation

• Containers

• Parallel I/O

• Firewall interactions

• Resource interactions

• Fault tolerance

• Load leveling

• Replication

• HIPAA security requirements

• Authentication of all users

• Access controls on data and metadata

• Audit trails

• Data encryption

• Centralized control

• Application interfaces

• C library, Shell commands, Java, Perl, Python, WSDL, workflow

Data Management System Features

• Data grid for managing distributed data

• Latency management for bulk analyses of collections

• Infrastructure independent name spaces for describing data, resources, users, and state information

• Digital library for managing data context

• Curation services for managing collections

• Descriptive metadata for discovery

• Persistent archive to manage technology evolution

• Interoperability mechanisms between heterogeneous storage systems and user access mechanisms

BIRN - Biomedical Informatics Research

Network Data Grid

Wash U.

Duke

Cal Tech

NIH/NCRR Centers for

Imaging and Computing

NPACI/

SDSC

Cal-(IT) 2

“Deep Web”

Duke

UCLA

Integrating Cyber Infrastructure to Link:

•Advanced Imaging Instruments

•Data Intensive Computing

•Multi-Scale Brain Databases

Harvard

“Surface Web”

Wireless “Pad”

Web Interface

Digital Library

• Collection hierarchy for organizing data

• User-defined metadata

• Collection level metadata

• Metadata manipulation

• Schema extension

• Bulk metadata processing

• Queries on metadata

• Access controls on metadata

• Views on collections

• Digital library APIs

• DSpace, Fedora, OAI-PMH, web browsers

• METS metadata XML schema

Southern California Earthquake Center

Select Receiver (Lat/Lon)

Store seismic data

Managing over 90 TBs, over 1.7 million files

• Store community models for seismic velocity

Data distributed between USC,

SDSC

SCEC community digital library

• Storage Resource Broker data grid technology

NMI portal interface

• Digital library services to display seismograms

Visualizations of seismic waves at the surface

• Visualization of seismic wave propagation through the volume

Select Scenario

Fault Model

Source Model

SCEC

Community

Library

Output

Time History

Seismograms

Provide access to large star catalogs and large image sky surveys

• 2MASS

• SDSS

• DPOSS

• USNO-B

• Macho

National Virtual Observatory

VOPlot

Virtual Observatory Architecture

Discover Compute Publish Collaborate

Portals, User Interfaces, Tools

Topcat

DIS

SkyQuery

Aladin

Mirage conVOT

OASIS interfaces to data

Registry Layer

HTTP Services stateless, registered authenticated

Semantics (UCD)

Data Services

SOAP Services self-describing visualization

Compute Services

Grid Services persistent, crossmatch

ADS

OAI

Digital Library

Other registries

XML, DC, METS

Existing Data Centers image data mining source detection

Virtual Data

Workflow (pipelines)

My Space storage services

Authentication & Authorization

Grid Middleware

SRB, Globus, OGSA

SOAP, GridFTP

Databases, Persistency, Replication

Disks, Tapes, CPUs, Fiber

National Science Digital Library

Preserve educational material that has been registered into a central repository at Cornell through URLs

• Crawl web and retrieve material, 10 levels of indirection

• Convert internal URLs into data grid handles

• Aggregate files into containers for storage

• Preserve using SRB data grid technology

• Currently housing over 26 million files

Web Interface to

Persistent Archive

National Archives and Records Administration -

Research Prototype Persistent Archive

Demonstrate preservation environment

• Authenticity

• Integrity

• Management of technology evolution

• Mitigation of risk of data loss

• Replication of data

• Federation of catalogs

• Management of preservation metadata

• Scalability

• EAP collection

• 350,000 files

• 1.2 TBs in size

Federation of Three

Independent Data Grids

NARA

MCAT MCAT

U Md

MCAT

SDSC

Principle copy stored at NARA with complete metadata catalog

Replicated copy at U Md for improved access, load balancing and disaster recovery

Deep Archive at

SDSC, no user access, but complete copy

For More Information

Reagan W. Moore

San Diego Supercomputer Center moore@sdsc.edu

http://www.npaci.edu/DICE http://www.npaci.edu/DICE/SRB http://www.npaci.edu/dice/srb/mySRB/mySRB.html

Download