Id - Helios Hud

advertisement
SCHEMALESS APPROACH
OF MAPPING XML DOCUMENTS
INTO RELATIONAL DATABASE
Ibrahim Dweib, Ayman Awadi,
Seif Elduola Fath Elrhman, Joan Lu
CIT 2008
Sydney, Australia 8-11 July 2008
1
Why schema-less
Many applications deal with highly flexible
XML documents from different sources, which
make it difficult to define their structure by a
fixed schema or a DTD. Therefore, it is
necessary for schema-less approaches to deal
with such XML documents.
2
The method aims to overcome the
challenges faced due to fixed shredding

No loss of information while shredding.

Reconstruction of original XML documents
is easier and much faster.

Maintaining XML document structure.

Preserve the ordering nature of XML data.
3
Theory guidance
The main mathematical concepts that are used in this
method are:

Definition 1:
XML tree is composed of many sub-trees of different levels; it can be
define as the following:
i=1, 2 … n, represent the levels of XML tree, 0 represents the root
Where, Ei is a finite set of elements in the level i.
Ai is a finite set of attributes in the level i.
Xi is a finite set of texts in the level i.
ri-1 is the root of the sub-tree of level i.
4
Theory guidance (Con’t)
Definition 2:
A dynamic fragment (shred) df(i) is defined to be the attributes and
texts (leaf children) of the sub-tree i of the XML tree plus its root
ri-1, as follows:
df(i) = (Ai, Xi, ri-1),
Where:
Ai is a finite set of attributes in the level i
Xi is a finite set of texts in the level i.
ri-1 is the root of the sub-tree of level i.
5
Design framework

A master table for documents. Called "documents“ table, to keep
information about documents themselves,
documents(doc_id, doc_structure, ….. ),
Additional fields may be added to keep all information about the
document itself such as dates, statistics, types… etc.

The doc_id is a unique id generated per document to identify documents.

The doc_structure is a big text field containing a coded string describing
each document structure, any changes on the document structure should be
reflected in this field, such as adding a new tag or property, deleting an
existing tag or property, or relocating a given tag or property to a different
location in the same document
6
Design framework (Con’t)

A second table to store the actual contents for all
documents. Documents will be shredded into pieces of
data that will be called tokens, each document element,
tag, or property will be considered a token, the tokens
table will have at the minimum this structure,
tokens(doc_id, token_id, token_name, token_value).
The token_id is the primary generated id for each token.
 The doc_id is the foreign key linking the tokens table to the documents table.
 token_name is the tag name or the property name as found in the original XML
document.
 token_value is the text value of the XML tag property.

7
Design framework, (Con’t)
“doc_structure” field construction rules:

The doc_structure field is where the document structure maintained.

It consists of long series of related keys.

Each key should start with a given alphabet character,

The letter 'T' for element (child), and the letter 'A' for attribute,

These letters are necessary to delimit keys in the sequence.
Then the letter is followed by a numeric number representing the
token_id that this key is referring to,

Example: T120 is a key referring to a token in the tokens table whose
token_id = 120.

8
Design framework,
“doc_structure” field construction rules: (Con’t)
If the token has properties then
the key representing this token in the
doc_structure will be followed with a set of keys
defining these properties.
 Example: T120A12A17A2 is a valid key string for
token number 120 which has three properties defined by
tokens number 12, 17, and 2.
 These properties appear in the original document in this
order.

9
Design framework,
“doc_structure” field construction rules: (Con’t)

If the token has some children tags then
these children will be represented as a key-string
surrounded by angle brackets.
Example: T120<T12T7<T2T1>T77> is a valid string that
can be read, token 120 has three sub tags in this order: token
12, followed by token 7, then token 77, and token 7 itself has
also two sub tags 2, and 1 in the given order.

10
Theory implementation on simple case
study
<books>
<book id="11210" category="fiction">
<author id="a1" sex="m">M. John</author>
<name>Computer Science 101</name>
</book>
<book id="11211">
<author>A. Mark</author>
<name>Applied Math 101</name>
<subject>Math</subject >
</book>
</books>
Figure 1: XML document
11
Theory implementation on simple case
study
99
100
Books
107
Book
Book
102
101
103
Id
"11210"
Category
"fiction"
author
106
name
108
Id
"11211"
109
111
110
author
name
subject
A. Mark
Math
Applied
Math 101
105
104
Id
"a1"
Sex
"m"
M. John
CS 101
Figure 2: A tree representation for XML document in figure 1
12
Theory implementation on simple case
study
Doc_id
10
Doc_strcuture
T99<T100A101A102<T103A104A105T106>T107A108<T109T110T111>>
Figure 5: Documents table
13
Theory implementation on simple case
study
doc_id
token_id
token_name
token_value
10
99
books
Null
10
100
book
Null
10
101
id
11210
10
102
category
fiction
10
103
author
M. John
10
104
id
a1
10
105
sex
m
10
106
name
Computer Science 101
10
107
book
Null
10
108
id
11211
10
109
author
A. Mark
10
110
name
Applied Math 101
10
111
subject
Math
Figure 6: Tokens table
14
EXPERIMENTAL Environment
An Intel Core 2 Duo computer with 2 GHz CPU, 1 GB RAM, 256 MB shared
Cache

OS: Windows Vista home edition.

Visual Basic 6 is used as software development kit with Microsoft Access 2003
as relational database target.

Five XML documents with different sizes are used in the experiment.

The data is taken from the XML data repository that is available at the web site
of the School of Computer Science and Engineering, University of Washington.

The performance metric is the time spent for mapping XML documents to
relational database and the time spent for reconstructing these documents from
relational database.

The experiment is repeated five times and the mean value of those times is
reported to obtain a realistic and accurate results.

15
EXPERIMENTAL RESULTS
Document size
4 KB
28 KB
64 KB
602KB
1MB
Mapping time
(secs)
0.01988238
0.14977736
.3551445
3.574335
5.85278136
Reconstructing
time (secs)
0.018990234
0.44980958
1.926836
18.305544
32.06255104
Table 1: The time spent for mapping XML documents to
RDBMS, and the time for reconstructing them
16
EXPERIMENTAL RESULTS
Time spend
The time spent for mapping XML documents to RDBMS
and the time spent for reconstructing them
35
30
25
20
15
10
5
0
Mapping time (secs)
Reconstructing time
(secs)
4 KB
28 KB
64 KB
602KB
1MB
Document size
17
Conclusion (1)
By using this method:

Maintaining document structure at a low cost price and
easily,

Building the original document is straight forward,

Performing first level semantic search is also
achievable either on a single document or on all
documents.
18
Conclusion (2)
Method Limitation:

Complex semantic search is not achievable easily in
this structure.

Document size is limited to memory size since we use
DOM based parsing
19
Future Works
Improving this method to achieve complex semantic
search, differentiate between XML data type (i.e., strings,
dates, integers), in order to apply less than or greater than
queries.

Making an intensive testing and compare our method
with other methods in the literature to see its performance.

Using SAX parsing for XML document to solve
document size limitation.

20
Thank You for Your Time
21
Download