Efficient access and query, data integration Group 4 Group coordinators:

advertisement
Efficient access and query, data
integration
Group 4
Group coordinators:
Alok Choudhary
Rob Ross
1
Parallel and Random I/O
•
I/O Stacks
•
•
•
•
•
Solutions may exist
•
•
•
•
Compute
node
Compute
node
Compute
node
Compute
node
Parallel netCDF
MPI-IO
Parallel File System
Query metadata for optimizing seemingly random accesses
switch
network
Research and development
•
•
•
Scale! Not just an engineering problem.
DB-like, query operations (more later)
Recognizing and/or passing on access pattern information,
then acting on it
•
•
Execution of app. code at the I/O server (active disk
(user) Metadata as file system constructs
•
I/O
Server
I/O
Server
I/O
Server
Related to metadata issues
Hardening and packaging
•
•
•
•
Performance/scalability are “ok”
Will these scale to next-generation systems (e.g. BG/L, Red
Storm?)
SDM/DBMS/indexing/analysis/query
Random I/O
•
•
High-level I/O libraries (PnetCDF, HDF5, SILO)
I/O middleware (MPI-IO)
Parallel file systems (Lustre, GPFS, PVFS)
Other shared file systems (CXFS, GFS, Panasas, qfs)
Large FC configurations
Fault tolerance
System support
Deployment and maintenance
•
•
Low BW, serial applications in good shape
High BW, embarrassingly parallel, task farming
4
Parallel and Random I/O
• Gaps with Priority
• Scaling of parallel I/O stack
• Both scaling of # of clients, and
• Scaling of size of the file system (# of files/objects)
• APIs for passing more information to the system
• (already there in MPI-IO to some extent, some PFSs, but not adequate,
also needed support at the high-level I/O library)
• Management of large scale storage
• Fault tolerance
• Autonomic (self-managing, etc.) storage
• Connecting PFSs to hierarchical storage systems efficiently
5
Large-scale feature-based Queries
• Lots of dimensions
• existing indexing techniques aren’t particularly good for this
• Not worth building an index at all in some instances
• Research and development
• Parallel update problem with existing representations
• When to linear scan, streaming
• Hardware-assisted searching (e.g. Netezza, NexQL, Seisint)
• Hardening and packaging
• Bitmapped indexing, in some use
• Deployment and maintenance
• Relational DBs
• Object DBs
6
Large-Scale, Feature-Based Queries
• Gaps with Priorities
• Scalability of techniques, such as indexing, as a solution to this
problem
• Support for runtime feature extraction
• Concurrent update (addition) to indices
• only for some groups
7
Query processing over files
• DB-like operations on files
• Structured data files such as HDF5, PnetCDF, SILO
• Alternative APIs, file format independent
• Java database objects, ODMG
• Research and development
• What should the API look like?
• Protocols for accessing databases in distributed environments
with arbitrary backends (e.g., GGF DAIS group)
• Hardening and packaging
• Ad-hoc Query package (LLNL work)
• Range queries over SILO mesh data
• Root (HEP community)
• Operates on files in internal file format
• Deployment and maintenance
• nothing
8
Query Processing over Files
• Gaps with Priorities
• Determining the API for this query processing
• What capabilities are needed from this API?
• Implementing this API for common file formats
• Appropriate underlying optimizations may impact all of I/O stack (e.g.
query optimizations, cache management, etc.)
• Extensible, parallel runtime for aiding in the use of this API,
constructing queries, etc.
9
Data Integration
• Digital libraries, federations and warehousing
• Research and development
• Tools for aiding in creation of warehouses, ontology creation
• Fine-grained access control
• Security in federated/dist. environment (pharma etc.)
• Applies even to the queries, not just the data itself
• Hardening and packaging
• Digital libraries (SRB)
• Many one-off instances of domain-specific integrations
• Deployment and maintenance
• DiscoveryLink (IBM), other commercial packages – framework for
doing data integration with their DB offerings
• Linking similar (R) DBs together isn’t too difficult
10
Data Integration
• Gaps with Priorities
• Converging on a language for describing metadata for
communities
• Tools to support wrapping and integrating complex data
• From arbitrary sources (free text, mesh data, etc.), including files
• For this domain (community exists looking at bio domain)
• Provenance
• Security
• Cross-domain access and authentication
• Encryption of both queries and data
• Authentication of data sources
11
The End
12
Download