Semi-structured data Information Science Patrick Lambrix

advertisement
Patrick Lambrix
Department of Computer and Information Science
Linköpings universitet
Semi-structured
data
1
2
Data is not just text, but is not as wellstructured as data in databases
Occurs often in web databanks
Occurs often in integration of databanks
Semi-structured data
3
irregular structure
implicit structure
partial structure
a posteriori ’data guide’
versus a priori schema
large data guides
Semi-structured data properties
4
It should be possible to ignore the data
guide upon querying
Data guide changes fast
object can change type/class
difference between data guide and data
is blurred
Semi-structured data properties
5
network of nodes
object model (oid)
query: path search in the network
Semi-structured data - model
6
Graph
Nodes: objects
oid
atomic or complex
- atoms: integer, string, gif, html, …
- value of a complex object is a set of
object references (label, oid)
Edges have labels
OEM is used by a number of systems (ex.
Lorel)
OEM (Object Exchange Model)
44
street
Chef Chu
13
7
El Camino Real
gourmet
17
Palo Alto
15
city
14
category name address
19
16
92310
zipcode
Vietnamese
66
18
35
nearby
Mountain
View
23
Menlo Park
25
54
cheap
55
92310
price
zipcode
address address
restaurant
cafe
Restaurant Guide
Guide
name
Saigon
category
nearby
nearby
restaurant
12
OEM example
79
category
fast food
price
Sandra
80
name
77
8
2. Find the names and streets of all restaurants in
Palo Alto
select R.name, A.street
from RestaurantGuide.restaurant{R}.address A
where A.city = “Palo Alto”
1. Find all places to eat Vietnamese food
select P
from RestaurantGuide.% P
where P.category grep “ietnamese”
Lorel query language
9
Wildcards and variables
? - 0 or 1 path
- object variables
+ - 1 or more paths
select P from Guide.% P
* - 0 or more paths
select A from #.address{A}
# - any path
- path variables
% - 0 or more chars
select Guide.#@P.name
3. Find all restaurants to eat with zipcode 92310
select RestaurantGuide.restaurant
where
RestaurantGuide.restaurant(.address)?.zipcode = 92310
Lorel query language
10
A structural summary over a data source
that is used as a dynamic schema
Is used in query formulation and
optimization
Is often created a posteriori
Properties:
concise
accurate
convenient
Data Guides
11
Label path: sequence of labels
L1.L2. … .Ln
Data path: alternating sequence of labels
and oid:s
L1.o1.L2.o2. … .Ln.on
Data path d is an instance of label path l if
the sequences of labels are identical in l
and d.
Data Guides - definitions
12
A data guide for object s is an object d
such that every label path of s has exact
one data path instance in d,
d and each
label path in d is a label path of s.
Data Guides - definitions
13
Minimal data guides
the smallest data guides
A data source can have several data
guides
Data Guides
14
C
6
D
9
C
5
D
8
(a)
3
B
2
A
1
Data model
B
10
D
7
C
4
A
(c)
21
D
20
C
19
18
B
minimal Data Guide
Data Guides - example
15
be hard to maintain
Example: child node for 10 with label E
May
Concise
Minimal Data Guides
16
Intuitively:
”label
label paths that reach the same set of
objects in the data model = label paths
that reach the same objects in the data
guide”
Strong Data Guides
17
An object o can be reached from s via l if
there is a data path of s that is an
instance of l and that has o as last oid
(L1.o1.L2.o2. … Ln.o)
The target set for label path l in object s
is the set of objects that can be
reached from s via l. Notation: T(s,l)
L(s,l): set of label paths of s that have the
same target set in s as l.
Strong Data Guides - definitions
18
There is a 1-1-mapping between target
sets in the data model and nodes in a
strong data guide.
Definition:
d is a strong data guide for s if
for all label paths l of s
it holds that L(s,l) = L(d,l)
Strong Data Guides - definitions
19
C
6
D
9
C
5
D
8
(a)
3
B
2
A
1
Data model
B
10
D
7
C
4
(b)
17
16
15
C
13
D
B
D
14
C
12
A
11
strong Data Guide
A
(c)
21
D
20
C
19
18
B
minimal Data Guide
Data Guides - example
20
Implementation:
- Traverse data model depth-first.
- Each time you find a new target set for
label path l, create a new object in the
data guide.
If the target set is already represented
in the data guide, do not create a new
object, but link to the existing object.
Strong Data Guides - algorithm
21
Easier to maintain
Used as path index for query
optimization
ti i ti
Strong Data Guides - use
22
Semi-structured data
exercises
i
23
r_id
r1
1
r2
r3
name
H l t
Hamlet
Normandie
McDonald's
r_id
r1
r2
r3
street
Storgatan
St.Larsgatan
Kungsgatan
Restaurants&Cities
c_id
c1
c1
c2
Cities
c_id
c1
c2
name
Linkoping
Norkoping
Represent the relations below using the OEM data
model.
Restaurants
Exercise 1
24
find all the restaurants that are located in Linkoping
find the address (city and street) of the “Hamlet”
restaurant
list the restaurants by city (equivalent of GROUP BY)
Using the data model from the previous question,
formulate the following queries using Lorel:
Exercise 2
16
street
25
18
17
92310
zipcode
contact
city
7
Palo Alto
Chef Chu
6
El Camino Real
gourmet
5
2
9
Saigon
Guide
RestaurantGuide
3
11
phone
27
11-12-13
26
71-72-73
20
reservation
Menlo Park
10
12
cafe
21
4
15
22
28
phone
24
31-32-33
58435
23
34-35-36
29
phone
25
city zipcode reservation manager
14
address contact
Linkoping
street
Sandra
13
category name
nearby
nearby
Rydsvagen
fast food
address contact
restaurant
phone
19
manager
8
name
nearby
restaurant
1
Draw the strong Data Guide for
the restaurant guide data model below.
category
g y name address
Exercise 3
Download