Storage Engine for Semantic Web Assertion Storage engine for semantic web has requirements similar to those for ecommerce aplications. Draw upon results and lessons from – R. Agrawal, A. Somani, Y. Xu: Storage and Retrieval of E-Commerce Data. VLDB-2001. Typical E-Commerce Data Characteristics An Experimental E-marketplace for Computer components Nearly 2 Million components More than 2000 leaf-level categories Large number of Attributes (5000) Constantly evolving schema Sparsely populated data (about 50-100 attributes/component) Alternative Physical Representations Horizontal – One N-ary relation Binary – N 2-ary relations Vertical – One 3-ary relation Conventional horizontal representation (n-ary relation) Name Monitor Height Recharge Output playback Smooth scan Progressive Scan PAN DVD-L75 7 inch - Built-in Digital - - - KLH DVD221 - 3.75 - S-Video - - No SONY S-7000 - - - - - - - SONY S-560D - - - - Cinema Sound Yes - … … … … … … … … DB Catalogs do not support thousands of columns (DB2/Oracle limit: 1012 columns) Storage overhead of NULL values Nulls increase the index size and they sort high in DB2 B+ tree index Hard to load/update Schema evolution is expensive Querying is straightforward Binary Representation (N 2-ary relations) Monitor Height Output Name Val Name Val PAN DVD-L75 7 inch KLH DVD221 3.75 Dense representation Manageability is hard because of large number of tables Schema evolution expensive Name Val PAN DVD-L75 Digital KLH DVD221 S-Video Decomposition Storage Model [Copeland et al SIGMOD 85], [Khoshafian et al ICDE 87] Monet: Binary Attribute Tables [Boncz et al VLDB Journal 99] Attribute Approach for storing XML Data [Florescu et al INRIA Tech Report 99] Vertical representation (One 3-ary relation) Oid (object identifier) Key (attribute name) Val (attribute value) Oid Key Val 0 ‘Name’ ‘PAN DVDL75’ 0 ‘Monitor’ ‘7 inch’ 0 ‘Recharge’ ‘Built-in’ 0 ‘Output’ ‘Digital’ 1 ‘Name’ ‘KLH DVD221’ 1 ‘Height’ ‘3.75’ 1 ‘Output’ ‘S-Video’ 1 ‘Progressiv e Scan’ ‘No’ 2 ‘Name’ ‘SONY S-7000’ … … … Objects can have large number of attributes Handles sparseness well Schema evolution is easy Implementation of SchemaSQL [LSS 99] Edge Approach for storing XML Data [FK 99] Querying over Vertical Representation is Complex Simple query on a Horizontal scheme SELECT MONITOR FROM H WHERE OUTPUT=‘Digital’ Becomes quite complex: SELECT v1.Val FROM vtable v1, vtable v2 WHERE v1.Key = ‘Monitor’ AND v2.Key = ‘Output’ AND v2.Val = ‘Digital’ AND v1.Oid = v2.Oid Writing applications becomes much harder. What can we do ? Solution Provide horizontal view of the vertical table Translation layer automatically maps operations on H to operations on V Horizontal view (H) Attr1 … Attr2 Query Mapping Layer Vertical table (V) Oid Key Val Attrk … Transformation Algebra Defined an algebra for transforming expressions over horizontal views into expressions over the vertical representation. Two key operators: – v2h () – h2v () Sample Algebraic Transforms v2h () Operation – Convert from vertical to horizontal k(V) = [Oid(V)] [i=1,k Oid,Val(Key=‘Ai’(V))] h2V () Operation – Convert from horizontal to vertical k(H) = [i=1,k Oid,’Ai’Ai(Ai ‘’(V))] [i=1,k Oid,’Ai’Ai(i=1,k Ai=‘’(V)) Similar operations such as Unfold/Fold and Gather/Scatter exist in SchemaSQL [LSS 99] and [STA 98] respectively Complete transforms in VLDB-2001 Paper From the Algebra to SQL Equivalent SQL transforms for algebraic transforms – Select, Project – Joins (self, two verticals, a horizontal and a vertical) – Cartesian Product – Union, Intersection, Set difference – Aggregation Extend DDL to provide the Horizontal View CREATE HORIZONTAL VIEW hview ON VERTICAL TABLE vtable USING COLUMNS (Attr1, Attr2, … Attrk, …) Alternative Implementation Strategies VerticalSQL – Uses only SQL-92 level capabilities VerticalUDF – Exploits User Defined Functions and Table Functions to provide a direct implementation Binary (hand-coded queries) – 2-ary representation with one relation per attribute (using only SQL-92 transforms) Data Organization Matters: Clustering by Key significantly outperforms by Oid Execution time (seconds) density = 10%, 1000 cols x 20K rows 25 20 VerticalSQL_oid 15 VerticalSQL_key 10 5 0 0.1% 1% Join selectivity Join 5% VerticalSQL comparable to Binary and outperforms Horizontal density = 10% Execution time (seconds) 60 50 40 HorizontalSQL 30 VerticalSQL 20 Binary 10 0 200x100K 400x50K 800x25K 1000x20K Table (#cols x #rows) Projection of 10 columns VerticalUDF is the best approach density = 10% Execution time (seconds) 30 20 VerticalSQL Binary 10 VerticalUDF 0 200x100K 400x50K 800x25K 1000x20K Table (#cols x #rows) Projection of 10 columns Summary Horizontal Vertical (w/ Mapping) Binary (w/ Mapping) Manageability + + - Flexibility - + - Querying + + + Performance - + + Remarks Lessons of this study directly apply to building storage engine for semantics webs Performance of vertical representation can be further improved by: – Enhanced table functions – First class treatment of table functions – Native support for v2h and h2v operations – Partial indices