Semi-structured data • Data is not just text, but is not as wellstructured as data in databases • Occurs often in web databanks • Occurs often in integration of databanks Semi-structured data 1 Semi-structured data - properties 2 Semi-structured data - properties • • • • irregular structure implicit structure partial structure a posteriori ’data guide’ versus a priori schema • large data guides • It should be possible to ignore the data guide upon querying • Data guide changes fast • object can change type/class • difference between data guide and data is blurred 3 4 OEM (Object Exchange Model) Semi-structured data - model • network of nodes • object model (oid) • query: path search in the network 5 • Graph • Nodes: objects oid atomic or complex - atoms: integer, string, gif, html, … - value of a complex object is a set of object references (label, oid) • Edges have labels • OEM is used by a number of systems (ex. Lorel) 6 OEM example 12 restaurant Lorel query language Restaurant Guide Guide restaurant cafe nearby 19 35 zipcode category name address 17 gourmet 13 category 14 Chef Chu street 44 El Camino Real city 15 Palo Alto 54 77 92310 nearby name 66 18 Vietnamese Saigon address address 23 price 25 Mountain View Menlo Park price category name 55 79 80 cheap fast food Sandra zipcode 16 92310 1. Find all places to eat Vietnamese food select P from RestaurantGuide.% P where P.category grep “ietnamese” nearby 2. Find the names and streets of all restaurants in Palo Alto select R.name, A.street from RestaurantGuide.restaurant{R}.address A where A.city = “Palo Alto” 7 Data Guides Lorel query language 3. Find all restaurants to eat with zipcode 92310 select RestaurantGuide.restaurant where RestaurantGuide.restaurant(.address)?.zipcode = 92310 Wildcards and variables ? - 0 or 1 path + - 1 or more paths * - 0 or more paths # - any path % - 0 or more chars 8 - object variables select P from Guide.% P select A from #.address{A} - path variables select Guide.#@P.name • A structural summary over a databank that is used as a dynamic schema • Is used in query formulation and optimization • Is often created a posteriori • Properties: – concise – accurate – convenient 9 Data Guides - definitions 10 Data Guides - definitions • Label path: sequence of labels L1.L2. … .Ln • Data path: alternating sequence of labels and oid:s L1.o1.L2.o2. … .Ln.on • Data path d is an instance of label path l if the sequences of labels are identical in l and d. 11 • A data guide for object s is an object d such that every label path of s has exact one data path instance in d, and each label path in d is a label path of s. 12 Data Guides - example Data Guides Data model minimal Data Guide 1 • A databank can have several data guides A • Minimal data guides the smallest data guides B 18 A B B 2 3 4 19 C C C C 5 6 7 20 D D D D 9 10 8 (a) 21 (c) 13 Minimal Data Guides 14 Strong Data Guides Intuitively: ”label paths that reach the same set of objects in the data model = label paths that reach the same objects in the data guide” • Concise • May be hard to maintain Example: child node for 10 with label E 15 Strong Data Guides - definitions 16 Strong Data Guides - definitions An object o can be reached from s via l if there is a data path of s that is an instance of l and that has o as last oid (L1.o1.L2.o2. … Ln.o) The target set for label path l in object s is the set of objects that can be reached from s via l. Notation: T(s,l) L(s,l): set of label paths of s that have the same target set in s as l. 17 Definition: d is a strong data guide for s if for all label paths l of s it holds that L(s,l) = L(d,l) There is a 1-1-mapping between target sets in the data model and nodes in a strong data guide. 18 Data Guides - example strong Data Guide Data model 1 A B Strong Data Guides - algorithm minimal Data Guide 11 A B 18 A B B 2 3 4 12 13 19 C C C C C C 5 6 7 14 15 20 D D D D D D 9 10 16 8 17 21 (b) (a) Implementation: - Traverse data model depth-first. - Each time you find a new target set for label path l, create a new object in the data guide. If the target set is already represented in the data guide, do not create a new object, but link to the existing object. (c) 19 20 Strong Data Guides - use – Easier to maintain – Used as path index for query optimization Semi-structured data exercises 21 Exercise 1 22 Exercise 2 • Represent the relations below using the OEM data model. • Using the data model from the previous question, formulate the following queries using Lorel: – find all the restaurants that are located in Linkoping r_id r1 r2 r3 c_id c1 c2 name Hamlet Normandie McDonald's name Linkoping Norkoping Cities – list the restaurants by city (equivalent of GROUP BY) Restaurants r_id r1 r2 r3 c_id c1 c1 c2 – find the address (city and street) of the “Hamlet” restaurant street Storgatan St.Larsgatan Kungsgatan Restaurants&Cities 23 24 Exercise 3 Exercise 4 • Draw the strong Data Guide for the restaurant guide data model below. • Write 4 simple queries in Lorel that illustrate the use of coercion in the following types of comparison: – – – – string type against integer type; value against atomic object value against complex object value against set of objects Guide RestaurantGuide 1 restaurant restaurant cafe nearby nearby 2 4 3 nearby category name address 5 gourmet Explain how coercion works in each case 16 El Camino Real name 8 Chef Chu street 25 contact 7 6 9 Saigon city zipcode 17 18 Palo Alto 92310 manager address contact 10 11 Menlo Park category name 13 fast food Sandra reservation 19 20 phone phone street 21 Rydsvagen address contact 14 12 15 city zipcode reservation manager 22 Linkoping 23 24 25 phone phone 58435 26 27 28 71-72-73 11-12-13 31-32-33 29 26 34-35-36