Part 2 - Semantic Web Workshop 2002

advertisement
Storage Engine for Semantic Web
Assertion

Storage engine for semantic web has
requirements similar to those for ecommerce aplications.

Draw upon results and lessons from
– R. Agrawal, A. Somani, Y. Xu: Storage and
Retrieval of E-Commerce Data. VLDB-2001.
Typical E-Commerce Data
Characteristics
An Experimental E-marketplace for
Computer components

Nearly 2 Million components
 More than 2000 leaf-level
categories
 Large number of Attributes (5000)

Constantly evolving schema
Sparsely populated data (about
50-100 attributes/component)

Alternative Physical Representations

Horizontal
– One N-ary relation

Binary
– N 2-ary relations

Vertical
– One 3-ary relation
Conventional horizontal representation
(n-ary relation)
Name
Monitor
Height
Recharge
Output
playback
Smooth scan
Progressive Scan
PAN DVD-L75
7 inch
-
Built-in
Digital
-
-
-
KLH DVD221
-
3.75
-
S-Video
-
-
No
SONY S-7000
-
-
-
-
-
-
-
SONY S-560D
-
-
-
-
Cinema Sound
Yes
-
…
…
…
…
…
…
…
…

DB Catalogs do not support thousands of columns (DB2/Oracle
limit: 1012 columns)
 Storage overhead of NULL values Nulls increase the index size
and they sort high in DB2 B+ tree index
 Hard to load/update
 Schema evolution is expensive
 Querying is straightforward
Binary Representation
(N 2-ary relations)
Monitor

Height
Output
Name
Val
Name
Val
PAN DVD-L75
7 inch
KLH DVD221
3.75
Dense representation
 Manageability is hard
because of large number of
tables
 Schema evolution expensive
Name



Val
PAN DVD-L75
Digital
KLH DVD221
S-Video
Decomposition Storage Model
[Copeland et al SIGMOD 85],
[Khoshafian et al ICDE 87]
Monet: Binary Attribute Tables
[Boncz et al VLDB Journal 99]
Attribute Approach for storing
XML Data [Florescu et al INRIA
Tech Report 99]
Vertical representation
(One 3-ary relation)
Oid (object identifier) Key (attribute name) Val (attribute value)
Oid
Key
Val
0
‘Name’
‘PAN DVDL75’
0
‘Monitor’
‘7 inch’
0
‘Recharge’
‘Built-in’
0
‘Output’
‘Digital’
1
‘Name’
‘KLH DVD221’
1
‘Height’
‘3.75’
1
‘Output’
‘S-Video’
1
‘Progressiv
e Scan’
‘No’
2
‘Name’
‘SONY S-7000’
…
…
…





Objects can have large number of
attributes
Handles sparseness well
Schema evolution is easy
Implementation of SchemaSQL [LSS 99]
Edge Approach for storing XML Data [FK
99]
Querying over Vertical
Representation is Complex

Simple query on a Horizontal scheme
SELECT MONITOR FROM H WHERE OUTPUT=‘Digital’
Becomes quite complex:
SELECT v1.Val
FROM vtable v1, vtable v2
WHERE v1.Key = ‘Monitor’
AND v2.Key = ‘Output’
AND v2.Val = ‘Digital’
AND v1.Oid = v2.Oid
Writing applications becomes much harder. What can we do ?
Solution

Provide horizontal view of the vertical table
 Translation layer automatically maps operations
on H to operations on V
Horizontal
view (H)
Attr1
…
Attr2
Query Mapping Layer
Vertical
table (V)
Oid
Key
Val
Attrk
…
Transformation Algebra

Defined an algebra for transforming
expressions over horizontal views into
expressions over the vertical representation.
 Two key operators:
– v2h ()
– h2v ()
Sample Algebraic Transforms

v2h () Operation – Convert from vertical to horizontal
k(V) = [Oid(V)]  [i=1,k Oid,Val(Key=‘Ai’(V))]

h2V () Operation – Convert from horizontal to vertical
k(H) = [i=1,k Oid,’Ai’Ai(Ai  ‘’(V))] 
[i=1,k Oid,’Ai’Ai(i=1,k Ai=‘’(V))

Similar operations such as Unfold/Fold and Gather/Scatter
exist in SchemaSQL [LSS 99] and [STA 98] respectively

Complete transforms in VLDB-2001 Paper
From the Algebra to SQL

Equivalent SQL transforms for algebraic transforms
– Select, Project
– Joins (self, two verticals, a horizontal and a vertical)
– Cartesian Product
– Union, Intersection, Set difference
– Aggregation

Extend DDL to provide the Horizontal View
CREATE HORIZONTAL VIEW hview ON VERTICAL TABLE vtable
USING COLUMNS (Attr1, Attr2, … Attrk, …)
Alternative Implementation
Strategies

VerticalSQL
– Uses only SQL-92 level capabilities
 VerticalUDF
– Exploits User Defined Functions and Table
Functions to provide a direct implementation
 Binary (hand-coded queries)
– 2-ary representation with one relation per
attribute (using only SQL-92 transforms)
Data Organization Matters: Clustering
by Key significantly outperforms by Oid
Execution time (seconds)
density = 10%, 1000 cols x 20K rows
25
20
VerticalSQL_oid
15
VerticalSQL_key
10
5
0
0.1%
1%
Join selectivity
Join
5%
VerticalSQL comparable to Binary
and outperforms Horizontal
density = 10%
Execution time (seconds)
60
50
40
HorizontalSQL
30
VerticalSQL
20
Binary
10
0
200x100K
400x50K
800x25K
1000x20K
Table (#cols x #rows)
Projection of 10 columns
VerticalUDF is the best approach
density = 10%
Execution time (seconds)
30
20
VerticalSQL
Binary
10
VerticalUDF
0
200x100K
400x50K
800x25K
1000x20K
Table (#cols x #rows)
Projection of 10 columns
Summary
Horizontal
Vertical (w/
Mapping)
Binary (w/
Mapping)
Manageability
+
+
-
Flexibility
-
+
-
Querying
+
+
+
Performance
-
+
+
Remarks

Lessons of this study directly apply to building
storage engine for semantics webs
 Performance of vertical representation can be
further improved by:
– Enhanced table functions
– First class treatment of table functions
– Native support for v2h and h2v operations
– Partial indices
Download