Persistent Management of Distributed Data

advertisement
Persistent Management of
Distributed Data
Reagan W. Moore
General Atomics, Inc.
San Diego Supercomputer Center
moore@sdsc.edu
http://www.npaci.edu/DICE/
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Topics
• Data management systems
– Data collections, digital libraries
• Distributed data management
– Data grids
• Persistent data management
– Persistent archives
• Common infrastructure for data management
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Data Collections
• Define the context for describing a
collection of digital entities
–
–
–
–
Context specified by metadata attributes
Provenance, origin of the digital entities
Administrative, location of the digital entities
Technical, purpose of the digital entities
• Support organization of attributes as
hierarchy of sub-collections
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Digital Libraries
• Provide services on the data collection
–
–
–
–
–
Ingestion, loading of attribute values
Extensibility, definition of new attributes
Discovery, queries on attributes
Browsing, hierarchical listing
Presentation, formatting specified data models
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Data Grids
• Manage data in a distributed environment
–
–
–
–
Logical name space, provide global identifier
Data access, storage system abstraction
Replication, disaster back up
Uniform access, common API across file
systems, archives, and databases
– Single sign-on, authenticate across
administration domains
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Persistent Archives
• Manage technology evolution
– Storage system abstraction, support data
migration across storage systems
– Information repository abstraction, support
catalog migration to new databases
– Logical name space, support global persistent
identifier
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Storage Resource Broker
• Integration of collection-based management
of digital entities, with
– Remote data access through storage system
abstraction
– Catalog access through information repository
abstraction
– Automation through collection-owned data
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Capabilities
•
•
•
•
•
Support legacy systems
Integrate archives with file systems
Share distributed data
Maintain persistent collection
Control data access
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Digital Entities
• Digital entities are “images of reality”,
made of
– Data, the bits (zeros and ones) put on a storage
system
– Information, the attributes used to assign
semantic meaning to the data
– Knowledge, the structural relationships
described by a data model
• Every digital entity requires information
and knowledge to correctly interpret and
display
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Digital Entities
• Files
– Text documents, images, spread sheets, binary
files
• URLs
• Database query commands
• Databases
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Digital Entities
• Register digital entities into a catalog
• Assign metadata to describe each digital
entity
• Separate management of the associated data
bits from management of the metadata
• Support manipulation of each digital entity
data type
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Technology Management
Old Application
New Operating System
Wrap Storage System
Wrap Display System
Old Storage System
Old Display System
Migrate Encoding Format
Digital Object
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Preservation of Data
• Migration
– Preserve the data bits
– Preserve the digital entity name
– Preserve the information and knowledge
content for presentation by new applications
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Migration Advantages
• By migrating the digital entity encoding
format to new standards, more sophisticated
technologies can be applied to express the
information and knowledge content inherent
in collections of digital entities.
• Requires the ability to associate data model
with digital entity
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Uniform API
• Provide common access semantics
• Map from the interface preferred by your
application to the interfaces required by
legacy storage systems
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
SDSC Storage Resource Broker & Meta-data Catalog
Common APIs
Application
C, C++, Linux
Libraries
I/O
Unix
Shell
Java, NT
Browsers
DLL /
Python
GridFTP
Web
WSDL
Consistency Management / Authorization-Authentication
Logical Name
Space
Latency
Management
Catalog Abstraction
Databases
DB2, Oracle, Sybase
Data
Transport
Metadata
Transport
Storage Abstraction
Archives
File Systems Databases
HPSS, ADSM, HRM
UniTree, DMF
Unix, NT,
Mac OSX
National Partnership for Advanced Computational Infrastructure
DB2, Oracle,
Postgres
Access
APIs
Prime
Server
Servers
San Diego Supercomputer Center
Discovery Transparencies
• Naming transparency - find a data set without
knowing its name
– Map from attributes to a global file name
• Location transparency - access a data set without
knowing where it is
– Map from global file name to local file name
• Access transparency - access a data set without
knowing the type of storage system
– Federated client-server architecture
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
SDSC Storage Resource Broker & Meta-data Catalog
Transparencies
Application
C, C++, Linux
Libraries
I/O
Unix
Shell
Java, NT
Browsers
DLL /
Python
GridFTP
Web
WSDL
Consistency Management / Authorization-Authentication
Logical Name
Space
Latency
Management
Catalog Abstraction
Databases
DB2, Oracle, Sybase
Data
Transport
Metadata
Transport
Storage Abstraction
Archives
File Systems Databases
HPSS, ADSM, HRM
UniTree, DMF
Unix, NT,
Mac OSX
National Partnership for Advanced Computational Infrastructure
DB2, Oracle,
Postgres
Access
APIs
Prime
Server
Servers
San Diego Supercomputer Center
Persistent Collection
• Maintain authenticity
– Authenticate all accesses
– Assign roles for access control lists (curation,
write, annotate, read)
– Manage audit trails of all operations
• Collection-owned data
– All accesses through the data management
system
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
SDSC Storage Resource Broker & Meta-data Catalog
Persistency
Application
C, C++, Linux
Libraries
I/O
Unix
Shell
Java, NT
Browsers
DLL /
Python
GridFTP
Web
WSDL
Consistency Management / Authorization-Authentication
Logical Name
Space
Latency
Management
Catalog Abstraction
Databases
DB2, Oracle, Sybase
Data
Transport
Metadata
Transport
Storage Abstraction
Archives
File Systems Databases
HPSS, ADSM, HRM
UniTree, DMF
Unix, NT,
Mac OSX
National Partnership for Advanced Computational Infrastructure
DB2, Oracle,
Postgres
Access
APIs
Prime
Server
Servers
San Diego Supercomputer Center
Preservation
(Similar requirements to a data grid)
• Name transparency
– Find a file by attributes (map from attributes to global
name)
• Location transparency
– Access a file by a global identifier (map from global to
local file name)
• Access transparency
– Use same API to access data in archive or file cache
• Authenticity
– Disaster recovery, replicate data across storage systems
– Audit and process management
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
SDSC Storage Resource Broker & Meta-data Catalog
Preservation
Application
C, C++, Linux
Libraries
I/O
Unix
Shell
Java, NT
Browsers
DLL /
Python
GridFTP
Web
WSDL
Consistency Management / Authorization-Authentication
Logical Name
Space
Latency
Management
Catalog Abstraction
Databases
DB2, Oracle, Sybase
Data
Transport
Metadata
Transport
Storage Abstraction
Archives
File Systems Databases
HPSS, ADSM, HRM
UniTree, DMF
Unix, NT,
Mac OSX
National Partnership for Advanced Computational Infrastructure
DB2, Oracle,
Postgres
Access
APIs
Prime
Server
Servers
San Diego Supercomputer Center
Convergence of Technologies
• Data grids as basis for distributed data management
– Federation of distributed resources
– Creation of logical name space to automate discovery
• Distributed data collections
– Discovery based on attributes
– Distributed data storage systems
• Digital libraries
– Development of services for manipulating, viewing data
• Persistent archives
– Management of technology evolution
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Data Naming Ontologies
Concept space
Discipline concepts
Data grid
Global Identifier
Collection
Discipline attributes
Archive / file systems
Local file name
Data model
Attributes that describe
data structure
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Knowledge Creation Roadmap
• Knowledge syntax (consensus)
– RDF, XMI, Topic Map
• Knowledge management (recursive operations)
– Oracle parallel database
• Knowledge manipulation (spatial/procedural rules)
– Generation of inference rules and mapping to data models
• Knowledge generation (scalable inference engine)
– Application of inference rules in inference engine
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Knowledge Based Data Grid Roadmap
Knowledge
Repository for
Rules
Access
Services
Rules - KQL
Knowledge
Relationships
Between
Concepts
Management
XTM DTD
Ingest
Services
Knowledge or
Topic-Based
Query / Browse
Attributes
Semantics
Information
Repository
SDLIP
Information
XML DTD
(Model-based Access)
Attribute- based
Query
Fields
Containers
Folders
Storage
(Replicas,
Persistent IDs)
National Partnership for Advanced Computational Infrastructure
Grids
Data
MCAT/HDF
(Data Handling System - SRB)
Feature-based
Query
San Diego Supercomputer Center
Further Information
http://www.npaci.edu/DICE
National Partnership for Advanced Computational Infrastructure
San Diego Supercomputer Center
Related documents
Download