Efficient access and query, data integration Group 4 Group coordinators: Alok Choudhary Rob Ross 1 Parallel and Random I/O • I/O Stacks • • • • • Solutions may exist • • • • Compute node Compute node Compute node Compute node Parallel netCDF MPI-IO Parallel File System Query metadata for optimizing seemingly random accesses switch network Research and development • • • Scale! Not just an engineering problem. DB-like, query operations (more later) Recognizing and/or passing on access pattern information, then acting on it • • Execution of app. code at the I/O server (active disk (user) Metadata as file system constructs • I/O Server I/O Server I/O Server Related to metadata issues Hardening and packaging • • • • Performance/scalability are “ok” Will these scale to next-generation systems (e.g. BG/L, Red Storm?) SDM/DBMS/indexing/analysis/query Random I/O • • High-level I/O libraries (PnetCDF, HDF5, SILO) I/O middleware (MPI-IO) Parallel file systems (Lustre, GPFS, PVFS) Other shared file systems (CXFS, GFS, Panasas, qfs) Large FC configurations Fault tolerance System support Deployment and maintenance • • Low BW, serial applications in good shape High BW, embarrassingly parallel, task farming 4 Parallel and Random I/O • Gaps with Priority • Scaling of parallel I/O stack • Both scaling of # of clients, and • Scaling of size of the file system (# of files/objects) • APIs for passing more information to the system • (already there in MPI-IO to some extent, some PFSs, but not adequate, also needed support at the high-level I/O library) • Management of large scale storage • Fault tolerance • Autonomic (self-managing, etc.) storage • Connecting PFSs to hierarchical storage systems efficiently 5 Large-scale feature-based Queries • Lots of dimensions • existing indexing techniques aren’t particularly good for this • Not worth building an index at all in some instances • Research and development • Parallel update problem with existing representations • When to linear scan, streaming • Hardware-assisted searching (e.g. Netezza, NexQL, Seisint) • Hardening and packaging • Bitmapped indexing, in some use • Deployment and maintenance • Relational DBs • Object DBs 6 Large-Scale, Feature-Based Queries • Gaps with Priorities • Scalability of techniques, such as indexing, as a solution to this problem • Support for runtime feature extraction • Concurrent update (addition) to indices • only for some groups 7 Query processing over files • DB-like operations on files • Structured data files such as HDF5, PnetCDF, SILO • Alternative APIs, file format independent • Java database objects, ODMG • Research and development • What should the API look like? • Protocols for accessing databases in distributed environments with arbitrary backends (e.g., GGF DAIS group) • Hardening and packaging • Ad-hoc Query package (LLNL work) • Range queries over SILO mesh data • Root (HEP community) • Operates on files in internal file format • Deployment and maintenance • nothing 8 Query Processing over Files • Gaps with Priorities • Determining the API for this query processing • What capabilities are needed from this API? • Implementing this API for common file formats • Appropriate underlying optimizations may impact all of I/O stack (e.g. query optimizations, cache management, etc.) • Extensible, parallel runtime for aiding in the use of this API, constructing queries, etc. 9 Data Integration • Digital libraries, federations and warehousing • Research and development • Tools for aiding in creation of warehouses, ontology creation • Fine-grained access control • Security in federated/dist. environment (pharma etc.) • Applies even to the queries, not just the data itself • Hardening and packaging • Digital libraries (SRB) • Many one-off instances of domain-specific integrations • Deployment and maintenance • DiscoveryLink (IBM), other commercial packages – framework for doing data integration with their DB offerings • Linking similar (R) DBs together isn’t too difficult 10 Data Integration • Gaps with Priorities • Converging on a language for describing metadata for communities • Tools to support wrapping and integrating complex data • From arbitrary sources (free text, mesh data, etc.), including files • For this domain (community exists looking at bio domain) • Provenance • Security • Cross-domain access and authentication • Encryption of both queries and data • Authentication of data sources 11 The End 12