Patrick Lambrix Department of Computer and Information Science Linköpings universitet Semi-structured data 1 2 Data is not just text, but is not as wellstructured as data in databases Occurs often in web databanks Occurs often in integration of databanks Semi-structured data 3 irregular structure implicit structure partial structure a posteriori ’data guide’ versus a priori schema large data guides Semi-structured data properties 4 It should be possible to ignore the data guide upon querying Data guide changes fast object can change type/class difference between data guide and data is blurred Semi-structured data properties 5 network of nodes object model (oid) query: path search in the network Semi-structured data - model 6 Graph Nodes: objects oid atomic or complex - atoms: integer, string, gif, html, … - value of a complex object is a set of object references (label, oid) Edges have labels OEM is used by a number of systems (ex. Lorel) OEM (Object Exchange Model) 44 street Chef Chu 13 7 El Camino Real gourmet 17 Palo Alto 15 city 14 category name address 19 16 92310 zipcode Vietnamese 66 18 35 nearby Mountain View 23 Menlo Park 25 54 cheap 55 92310 price zipcode address address restaurant cafe Restaurant Guide Guide name Saigon category nearby nearby restaurant 12 OEM example 79 category fast food price Sandra 80 name 77 8 2. Find the names and streets of all restaurants in Palo Alto select R.name, A.street from RestaurantGuide.restaurant{R}.address A where A.city = “Palo Alto” 1. Find all places to eat Vietnamese food select P from RestaurantGuide.% P where P.category grep “ietnamese” Lorel query language 9 Wildcards and variables ? - 0 or 1 path - object variables + - 1 or more paths select P from Guide.% P * - 0 or more paths select A from #.address{A} # - any path - path variables % - 0 or more chars select Guide.#@P.name 3. Find all restaurants to eat with zipcode 92310 select RestaurantGuide.restaurant where RestaurantGuide.restaurant(.address)?.zipcode = 92310 Lorel query language 10 A structural summary over a data source that is used as a dynamic schema Is used in query formulation and optimization Is often created a posteriori Properties: concise accurate convenient Data Guides 11 Label path: sequence of labels L1.L2. … .Ln Data path: alternating sequence of labels and oid:s L1.o1.L2.o2. … .Ln.on Data path d is an instance of label path l if the sequences of labels are identical in l and d. Data Guides - definitions 12 A data guide for object s is an object d such that every label path of s has exact one data path instance in d, d and each label path in d is a label path of s. Data Guides - definitions 13 Minimal data guides the smallest data guides A data source can have several data guides Data Guides 14 C 6 D 9 C 5 D 8 (a) 3 B 2 A 1 Data model B 10 D 7 C 4 A (c) 21 D 20 C 19 18 B minimal Data Guide Data Guides - example 15 be hard to maintain Example: child node for 10 with label E May Concise Minimal Data Guides 16 Intuitively: ”label label paths that reach the same set of objects in the data model = label paths that reach the same objects in the data guide” Strong Data Guides 17 An object o can be reached from s via l if there is a data path of s that is an instance of l and that has o as last oid (L1.o1.L2.o2. … Ln.o) The target set for label path l in object s is the set of objects that can be reached from s via l. Notation: T(s,l) L(s,l): set of label paths of s that have the same target set in s as l. Strong Data Guides - definitions 18 There is a 1-1-mapping between target sets in the data model and nodes in a strong data guide. Definition: d is a strong data guide for s if for all label paths l of s it holds that L(s,l) = L(d,l) Strong Data Guides - definitions 19 C 6 D 9 C 5 D 8 (a) 3 B 2 A 1 Data model B 10 D 7 C 4 (b) 17 16 15 C 13 D B D 14 C 12 A 11 strong Data Guide A (c) 21 D 20 C 19 18 B minimal Data Guide Data Guides - example 20 Implementation: - Traverse data model depth-first. - Each time you find a new target set for label path l, create a new object in the data guide. If the target set is already represented in the data guide, do not create a new object, but link to the existing object. Strong Data Guides - algorithm 21 Easier to maintain Used as path index for query optimization ti i ti Strong Data Guides - use 22 Semi-structured data exercises i 23 r_id r1 1 r2 r3 name H l t Hamlet Normandie McDonald's r_id r1 r2 r3 street Storgatan St.Larsgatan Kungsgatan Restaurants&Cities c_id c1 c1 c2 Cities c_id c1 c2 name Linkoping Norkoping Represent the relations below using the OEM data model. Restaurants Exercise 1 24 find all the restaurants that are located in Linkoping find the address (city and street) of the “Hamlet” restaurant list the restaurants by city (equivalent of GROUP BY) Using the data model from the previous question, formulate the following queries using Lorel: Exercise 2 16 street 25 18 17 92310 zipcode contact city 7 Palo Alto Chef Chu 6 El Camino Real gourmet 5 2 9 Saigon Guide RestaurantGuide 3 11 phone 27 11-12-13 26 71-72-73 20 reservation Menlo Park 10 12 cafe 21 4 15 22 28 phone 24 31-32-33 58435 23 34-35-36 29 phone 25 city zipcode reservation manager 14 address contact Linkoping street Sandra 13 category name nearby nearby Rydsvagen fast food address contact restaurant phone 19 manager 8 name nearby restaurant 1 Draw the strong Data Guide for the restaurant guide data model below. category g y name address Exercise 3