NoDB: Querying Raw Data --Mrutyunjay Overview ▪ Introduction ▪ Motivation ▪ NoDB Philosophy: PostgreSQL ▪ Results ▪ Opportunities “NoDB in Action: Adaptive Query Processing on Raw Data” Ioannis Alagiannis, Renata Borovica, Miguel Branco, Stratos Idreos and Anastasia Ailamaki. In the Proceedings of the VLDB Endowment (PVLDB) (Demo), 2012. Introduction Motivation ▪ DBMS: are rarely used for emerging applications such as scientific analysis and social networks. ▪ due to the prohibitive initialization cost, complexity (loading the data, configuring the physical design, etc.) and the increased "data-to-query" time. ▪ For example, a scientist needs to quickly examine a few Terabytes of new data in search of certain properties. Even though only few attributes might be relevant for the task, the entire data must first be loaded inside the database. For large amounts of data, this means a few hours of delay. ▪ NoDB Philosophy: To make database systems more accessible to the user by eliminating major bottlenecks of current state-of-the-art technology that increases data-to-query time. Querying Raw Data ▪ Straight Forward Approaches: ▪ -- Run the loading process whenever the relevant query arrives. Store in temporary table, discard table after query. ▪ -- Integrating raw file access with query execution: Scan operator: raw file is read from disk in chunks. ▪ Limitations: ▪ Not viable for extensive and repeated query processing. ▪ Does not use important database system functionality like indexing. NoDB Philosophy: PostgreSQL ▪ On-the-fly parsing ▪ Indexing ▪ Caching ▪ Updates On-the-fly parsing ▪ Parsing and Tokenizing Raw Data: ▪ Load raw data file -> Identify Rows (tuples) and attributes ▪ Transform into proper binary values depending on the attribute type. ▪ Selective Tokenizing: ▪ Abort tokenizing when required attributes are found. ▪ If a query needs 4th and 8th attributes then tokenize up to 8th attribute only. ▪ No I/O benefit, reduce CPU processing cost. ▪ Selective Parsing: ▪ As said above: Binary transformation of the required attributes. ▪ Selective Tuple Formation: ▪ Only contain the attributes required for a give query. Indexing (Adaptive Positional Map) ▪ Adaptive Positional Map: ▪ Metadata information on the structured flat file. Used to navigate and retrieve raw data faster. ▪ Reduce parsing and tokenizing costs. ▪ Meta data refers to position of attributes in the raw file. ▪ Positional map is created on-the-fly during query processing. Populated during the tokenizing phase. ▪ Positional map store position of every tuple in table based on the query. Variable attribute length. i.e. same attribute appears in different position in different tuples Indexing (Adaptive Positional Map) ▪ Positional Map implemented as collection of chunks portioned vertically and horizontally. ▪ Maintain a higher level data structure which contains the order of attributes in the map in respect to order in the file. Caching and Updates* ▪ Cache holds previously accessed data. Previously accessed attributes. ▪ Populated on-the-fly during query processing (Cache the binary data immediately). ▪ Follows the format of positional map. Fixed size cache. LRU policy to drop and populate the cache. ▪ A change in a position of an attribute in the data file might call for significant reorganization. Nevertheless, being an auxiliary data structure, the positional map can be dropped and recreated when needed again. Results ▪ Raw data file: 11GB ▪ 7.5 x 106 tuples. 150 attributes each Opportunities ▪ Flexible Storage: NoDB systems do not require a priori loading, which implies they also do not require a priori decisions on how data is physically organized during loading. ▪ Updates: Immediate access on updated raw data files provides a major opportunity towards decreasing even further the data-to-query time and enabling tight interaction between the user and the database system. ▪ Information Integration: Query multiple different data sources and formats. Supporting different file formats. ▪ File System Interface: Unlike traditional database systems, data in NoDB systems is always stored in file systems, such as NTFS or ext4. This provides NoDB the opportunity to intercept file system calls and gradually create auxiliary data structures that speed up future NoDB queries. Questions