Capturing Semantics in XML Documents Tok Wang Ling Department of Computer Science National University of Singapore April 9, 2006 KDXD 2006, Singapore 1 Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) [4] 3. The applications of ORA-SS 4. Discovering Semantics in XML documents 5. Conclusion [4]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc. 2005 April 9, 2006 KDXD 2006, Singapore 2 Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3. The applications of ORA-SS 4. Discovering Semantics in XML documents 5. Conclusion April 9, 2006 KDXD 2006, Singapore 3 1. XML – Brief introduction • XML (eXtensible Markup Language) is – Released by W3C – An application of SGML – A promising standard of data publishing, integrating and exchanging on the web • XML schema – DTD (Data Type Definition) [3] – XSD (XML Schema Definition), W3C recommended standard [6, 7, 8] [3]. Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February 2004. http://www.w3.org/TR/2004/REC-xml-20040204/ [6]. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/ [7]. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/ [8]. XML Schema Part 2: Datatypes Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/ April 9, 2006 KDXD 2006, Singapore 4 1. XML – A motivating example • Suppose we have an XML document “psj.xml” about different parts, suppliers and projects, where – – – – The document has a root element psj; Under psj, there is a sequence of part elements; Under part, there is a sequence of supplier elements; Under supplier, there is a sequence of project elements. April 9, 2006 KDXD 2006, Singapore 5 Example 1. psj.xml <?xml version="1.0" encoding="UTF-8"?> <psj xmlns:xsi="…" xsi:noNamespaceSchemaLocation="…"> <part> <pno>P001</pno> <pname>Nut</pname> <color>Silver</color> <supplier> <sno>S001</sno> <sname>Alfa</sname> <city>Atlanta</city> <price>5</price> <project> <jno>J001</jno> <jname>Rocket boots</jname> <budget>20000</budget> <qty>60</qty> </project> <project> <jno>J003</jno> <jname>Firework launcher</jname> <budget>250000</budget> <qty>650</qty> </project> </supplier> <supplier> <sno>S002</sno> <sname>Beta</sname> <city>Atlanta</city> <city>New York</city> <price>5.5</price> <project> <jno>J002</jno> <jname>Diving helm</jname> <budget>18000</budget> <qty>70</qty> </project> <project> <jno>J003</jno> <jname>Firework launcher</jname> <budget>250000</budget> <qty>50</qty> </project> </supplier> </part> … April 9, 2006 … <part> <pno>P002</pno> <pname>Nut</pname> <color>Copper</color> <supplier> <sno>S001</sno> <sname>Alfa</sname> <city>Atlanta</city> <price>4.6</price> <project> <jno>J002</jno> <jname>Diving helm</jname> <budget>18000</budget> <qty>60</qty> </project> </supplier> <supplier> <sno>S003</sno> <sname>Beta</sname> <city>New York</city> <price>5</price> <project> <jno>J001</jno> <jname>Rocket boots</jname> <budget>20000</budget> <qty>20</qty> </project> <project> <jno>J004</jno> <jname>Blue fireworks</jname> <budget>20000</budget> <qty>50</qty> </project> </supplier> </part> </psj> KDXD 2006, Singapore 6 1. XML – the DTD of the “psj.xml” <?xml version="1.0" encoding="UTF-8"?> <!--DTD generated by XXX--> <!ELEMENT psj (part+)> <!ELEMENT part (pno, pname, color, supplier+)> <!ELEMENT pno (#PCDATA)> <!ELEMENT pname (#PCDATA)> <!ELEMENT color (#PCDATA)> <!ELEMENT supplier (sno, sname, city+, price, project+)> <!ELEMENT sno (#PCDATA)> <!ELEMENT sname (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT project (jno, jname, budget, qty)> <!ELEMENT jno (#PCDATA)> <!ELEMENT jname (#PCDATA)> <!ELEMENT budget (#PCDATA)> <!ELEMENT qty (#PCDATA)> ▼♦ psj ▼♦ part ♦ pno ♦ pname ♦ color ▼♦ supplier ♦ sno ♦ sname ♦ city ♦ price ▼♦ project ♦ jno ♦ jname ♦ budget ♦ qty (a) “psj.dtd”, The DTD of the “psj.xml” (b) psj.dtd in Data Guide April 9, 2006 KDXD 2006, Singapore 7 1. XML – what the DTD says • DTD is a simple definition of an XML document, where users can define – Element/Attribute types – Occurrence constraints (e.g. ?, +, *) – Containment among different element types (the structure) • DTD cannot express – Occurrence constraints in numbers (e.g. 2 to 8) – Uniqueness/Key constraints on a combination of attributes/elements (ID attribute can be only assigned on one attribute at a time in DTD.) – Relationship types among elements and their degrees – Difference between the attribute (or simple element) of element type and the attribute (or simple element) of relationship type. Simple elements are those element types with PCDATA only without any attribute types. April 9, 2006 KDXD 2006, Singapore 8 1. XML – XSD “psj.xsd”, the XSD schema of the motivating example data. XSD definition of element occurrence constraint XSD definition of key constraint, which requires that all part element should have a non-nil pno element and the value of all pno elements in the document should be unique. April 9, 2006 <xs:schema xmlns:xs = “…”> <xs:element name = “psj”> <xs:complexType> <xs:sequence> <xs:element name="part"> <xs:complexType> <xs:sequence> <xs:element name="pno" type="xs:string"/> <xs:element name="pname" type=" xs:string"/> <xs:element name="color" type=" xs:string"/> <xs:element name="supplier" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="sno" type=" xs:string"/> <xs:element name="sname" type=" xs:string"/> <xs:element name="city" type=" xs:string“ maxOccurs="unbounded"/> <xs:element name="price" type=" xs:string"/> <xs:element name="project" maxOccurs="unbounded"> <xs:complexType> <xs:sequence> <xs:element name="jno" type=" xs:string"/> <xs:element name="jname" type=" xs:string"/> <xs:element name="budget" type=" xs:string"/> <xs:element name="qty" type=" xs:string"/> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> <xs:key name="PK"> <xs:selector xpath="part"/> <xs:field xpath="pno"/> </xs:key> </xs:element> </xs:schema> KDXD 2006, Singapore 9 1. XML – what XSD can tell • XSD is the standard of XML schema definition, recommended by W3C and supported by most vendors, which – has extensible XML syntax, – supports more data types (user-defined type and 37 built-in types) – is able to represent uniqueness/key for both attribute types and element types. – And has many other improvements in comparison with DTD. April 9, 2006 KDXD 2006, Singapore 10 1. XML – XSD still flaws XSD is not sufficient in expressing the relational semantics in XML data, such as: 1. A key constraint is specified by a key element. The key constraints in XSD is an extension of ID in DTD. It is totally different to the key constraint in relational databases. – – E.g. In the previous XSD, the values of key attribute, pno of part, should be unique within the set of the part elements in the whole document. Therefore, when an element type is located in a lower level such as supplier and project, XSD cannot declare sno and jno as their key attributes (OIDs) respectively. April 9, 2006 KDXD 2006, Singapore 11 1. XML – XSD still flaws (cont.) - The key element must contain the following (in order): a) One and only one selector element - contains an XPath expression that specifies the set of elements across which the values specified by the field must be unique b) One or more field elements - contain an XPath expressions that specifies the values must be unique for the set of elements specified by the selector element. - The key constraint is similar to the unique constraint, except that the column on which a unique constraint is defined can have null values. April 9, 2006 KDXD 2006, Singapore 12 1. XML – XSD still flaws (Cont.) 2. XSD does not support relationship types and other relational semantic constraints. – E.g. The ternary relationship type psj among part, supplier and project in the original data is lost in the XSD. 3. XSD cannot distinguish attributes (or simple elements) of relationship types from those attributes (or simple elements) of element types. – E.g. Price is an attribute of the binary relationship type ps between part and supplier. However, it looks the same as sname, an attribute (simple element) of the element supplier. April 9, 2006 KDXD 2006, Singapore 13 Reconsider the semantics in Example 1. • The XML data in Example 1. (psj.xml) is a typical data-centric XML document that is derived from structured data contents usually stored in relational or object-relational databases. • The semantics of the data in Example 1. can be described in the ER diagram as follows. April 9, 2006 KDXD 2006, Singapore 14 The ER diagram of the data in Example 1. price n part PS n supplier n pno pname color PSJ sno sname city n jno April 9, 2006 project qty jname budget KDXD 2006, Singapore 15 One of the object-relational database representations of psj.xml part pno supplier pname color sno project sname city+ jno jname budget P001 Nut Silver S001 Alfa Atlanta J001 Rocket boots 20000 P002 Nut Copper S002 Beta {Atlanta, New York} J002 Diving helm 18000 J003 Firework launcher 250000 S003 Gama New York J004 Blue fireworks 20000 PS pno There 5 tables in the relational schema: sno price S001 5 P001 S002 5.5 P002 S001 4.6 P001 S001 J001 60 P002 S003 5 P001 S001 J003 650 P001 S002 J002 70 P001 S002 J003 50 P002 S001 J002 60 P002 S003 J001 20 P002 S003 J004 50 part (pno, pname, color) supplier (sno, sname, (city)+) project (jno, jname, budget) PS (pno, sno, price) PSJ (pno, sno, jno, qty) April 9, 2006 PSJ P001 KDXD 2006, Singapore pno sno jno qty 16 Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3. The applications of ORA-SS 4. Discovering Semantics in XML documents 5. Conclusion April 9, 2006 KDXD 2006, Singapore 17 2. ORA-SS in a nutshell • ORA-SS is a semantics rich data model for semistructured data. • It can easily represent the relational semantics and constraints in XML data. • ORA-SS model is also a bridge that connects the tree structure of XML and the semantics in relational and object-relational databases. • In comparison with traditional ER diagram, ORA-SS schema diagram represents the hierarchical structure of XML data. April 9, 2006 KDXD 2006, Singapore 18 2. ORA-SS in a nutshell • A complete ORA-SS model has 4 diagrams – Schema diagram • Represents the structure and constrains (business rules) on XML documents – Instance diagram • Visually represents the graphical structure of XML data – Functional dependency diagram • Represents FDs in relationship types – Inheritance diagram • Represents the specialization/generalization relationships among different object classes in ORA-SS April 9, 2006 KDXD 2006, Singapore 19 2. ORA-SS data models • Object class – attributes of object class – ordering on object class • Relationship Type – – – – – – degree of relationship type participating object classes in relationship type attributes of relationship type disjunctive relationship type recursive relationship type ID dependent relationship type April 9, 2006 KDXD 2006, Singapore 20 2. ORA-SS data models (Cont.) • Attribute – – – – – – – – – attributes of object class or relationship type key attribute (OID) foreign key / referential constraint (IDREF/IDREFS) composite attribute disjunctive attribute attribute with unknown structure ordering on attributes fixed or default value of attribute derived attribute April 9, 2006 KDXD 2006, Singapore 21 The ORA-SS schema diagram of Example 1. Part, supplier and project are modeled as object classes. PS is a binary relationship type between part and supplier, part PS, 2, +, + supplier pno pname color PS sno sname + city price PSJ, 3, +, + project PSJ Pno, sno and jno are declared as the object ID of part, supplier and project respectively. April 9, 2006 jno jname budget PSJ is a ternary relationship type defined among part, supplier and project qty Price is an attribute of the relationship type PS; and qty is an attribute of PSJ. KDXD 2006, Singapore 22 ORA-SS – Features • ORA-SS can represent the following semantics – Object ID attributes play the key constraints in object-relational databases, i.e. the object ID attributes functional determine (or multi-valued determine) object attributes of the same object class. – Various relationship types including ID dependent relationship types, their degrees and participating object classes. – Distinguish relationship attributes from object attributes. April 9, 2006 KDXD 2006, Singapore 23 Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3. The applications of ORA-SS 4. Discovering Semantics in XML documents 5. Conclusion April 9, 2006 KDXD 2006, Singapore 24 3. ORA-SS applications • Due to the rich semantics in ORA-SS, the model can be widely used in – – – – – Normal form XML schema Relational/object-relational storage of XML data XML view creation and validation [1] XML schema/data integration XML data query, especially with graphical user interfaces [5] – XML query optimization – etc. [1]. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland. Oct 7-11, 2002 [5]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA 2003. April 9, 2006 KDXD 2006, Singapore 25 3. ORA-SS applications Store ORA-SS in object-relational databases • Current existing storage approaches store XML in flat files (NF relations), which are long and difficult to query and update; • Pure relational DBMS – join needs much time. • ORA-SS reflects the nested structure of semistructured data • Less join in nested relations April 9, 2006 KDXD 2006, Singapore 26 3. ORA-SS applications Store ORA-SS in object-relational databases (Cont.) Given an ORA-SS schema diagram • Each object class is stored as an object relation with its object ID and its object attributes. (e.g. part, supplier, project) part PS, 2, +, + supplier pno pname color PS sno sname + city price PSJ, 3, +, + project PSJ jno jname budget qty • Each relationship type is stored as a relationship relation with the object IDs of participating object classes and its relationship attributes. (e.g. PS and PSJ) • Multi-value attributes and composite attributes are stored as nested relations. (e.g. city) April 9, 2006 KDXD 2006, Singapore 27 3. ORA-SS applications Store ORA-SS in object-relational databases (Cont.) Storage Schema for ORA-SS/XML Databases of the data in Example 1. ORA-SS schema diagram Storage schema Object Relations part (pno, pname, color) supplier (sno, sname, (city)+) project(jno, jname, budget) part PS, 2, +, + supplier pno pname color PS sno sname + city price PSJ, 3, +, + project PSJ jno April 9, 2006 jname budget qty Relationship relations PS (pno, sno, price) PSJ (pno, sno, jno, qty) Constraint: PSJ[pno, sno] PS[pno, sno] KDXD 2006, Singapore 28 3. ORA-SS applications Store ORA-SS in object-relational databases (Cont.) An example to show the advantage of using object-relational database instead of relational database. ORA-SS schema diagram employee eno ename * hobby * year quantification degree Univ. year job_title company Storage schema in traditional RDB Storage schema in ORDB Employee (eno, ename, (hobby)*, quantification(year, degree, Univ)*, job_history(year, job_title, company)*) April 9, 2006 job_history * Employee (eno, ename) E_hobby (eno, hobby) E_quantification (eno, year, degree, Univ.) E_job_history (eno, year, job_title, company) KDXD 2006, Singapore 29 3. ORA-SS applications Define and validate XML views •Valid XML views in ORA-SS •View definition operators: select, project/drop, swap, join For example, consider the following swapping operation that changes the position of supplier and part in different hierarchical levels: PS, 2, +, + supplier pno pname supplier supplier part 2 2 part part price color PS sno sname + city price 2 PSJ, 3, +, + project project PSJ jno jname budget price 3 qty Because price is a relationship attribute, it cannot be moved up with supplier elements, which would be semantically meaningless in the result view. April 9, 2006 3 3 qty Valid view KDXD 2006, Singapore project 3 qty Invalid view 30 3. ORA-SS applications Define and validate XML views (cont.) Another example, consider the following projection operation that drops supplier from the structure: part PS, 2, +, + part part supplier pno pname color PS sno sname + city price PSJ, 3, +, + project price project Avg_price project PSJ qty jno jname budget Total_qty qty Invalid view Valid view Dropping supplier makes price and qty become multi-valued attributes, and we should apply aggregation functions to get a meaningful view. April 9, 2006 KDXD 2006, Singapore 31 3. ORA-SS applications Graphical XML query based on ORA-SS A graphical XML query language is designed on the base of ORA-SS Query 1: To select and display the projects that do not have any suppliers located in Atlanta. The schema panel loads the ORA-SS schema diagram Graphical query can be posed by either dragging components from the diagram in schema panel or using the construction buttons on the top of the window. Complex query logics such as quantification, negation, IF-THEN construction can be specified in the Condition Logic Window The screenshot of the user-interface of our graphical query language April 9, 2006 KDXD 2006, Singapore 32 3. ORA-SS applications XML query optimization • The semantic information represented in ORA-SS is also helpful in optimizing XML query. Consider the following simple query example which means, (Query 2.) To display the budget of project “J001”. April 9, 2006 KDXD 2006, Singapore 33 3. ORA-SS applications XML query optimization • Traditional processing should scan the whole XML document, checking every project with jno=“J001” and finding all corresponding budget values. • However, in ORA-SS, since jno is the object ID and we have the functional dependecny: jno budget so the optimized processing only need to find the first project instance with jno=“J001” and return the corresponding budget value. April 9, 2006 KDXD 2006, Singapore 34 Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3. The applications of ORA-SS 4. Discovering Semantics in XML documents 5. Conclusion April 9, 2006 KDXD 2006, Singapore 35 4. Discover semantics in XML documents • Problem definition – Input: a well formed XML document, probably with a DTD or XSD schema – Output: semantics that are necessary to ORA-SS schema • It is a process of enriching XML schema to ORASS schema by using mining techniques. April 9, 2006 KDXD 2006, Singapore 36 4. Discover semantics in XML documents • Related issues in mining semantics – Object classes • • • • Identify object classes Identify object IDs Identify object attributes and their cardinalities Identify IDREF(s) attributes – Relationship types • Find relationship types with their degrees and participating object classes • Find attributes and their cardinalities of relationship types April 9, 2006 KDXD 2006, Singapore 37 4. Discover semantics in XML documents The whole vision of the process. Composite attributes Multi-valued attributes Start Identify Object Classes Pick out multi-value attributes Identify Object ID Object Classes Single -value d re lationship attribute s Single-valued object attributes Object ID Multi-valued object attributes Identify Multi-valued and composit Object attributes Composite relationship attributes Composite object attributes Multi-valued relationship attributes Identify relationship types with relationship attributes The main flow of the process The output flow The input flow Identify relationship types without relationship attributes Relationship Types End April 9, 2006 KDXD 2006, Singapore 38 4. Discover semantics in XML documents • Assumption – To simplify the discussion, we do not consider the order of attributes and elements. • User-verification – The findings of each steps during the process should be verified by the user. – The verified findings of previous steps would be used in later steps. April 9, 2006 KDXD 2006, Singapore 39 4. Discover semantics in XML documents Find object classes • Identify object classes from element types: – Scan the XML document or, if possible, the DTD/XSD of the XML document to select all internal nodes in the document tree. – An internal node means the node must have some child nodes such as XML attribute types and/or subelement types. – An internal node may not be an object class, but an object class must correspond to an internal node. Therefore, internal nodes are candidates of object classes. April 9, 2006 KDXD 2006, Singapore 40 4. Discover semantics in XML documents Find object classes (cont.) • Detecting composite attributes from object classes – Although composite attributes are also internal nodes, there are some special patterns that indicate they are not object classes. birthday XML element XML elements Or XML attributes values April 9, 2006 The first pattern is that, all subelement types or attributes are month day year "3" "20" "2005" 1) Single-valued 2) Always occur with the same order 3) No functional dependency can be found within the component attributes of a composite attribute. KDXD 2006, Singapore 41 4. Discover semantics in XML documents Find object classes (cont.) student hobbies XML element XML elements Or XML attributes values hobby hobby studNo hobby "swimming" "reading" "basket ball" The second pattern is that, all subelement types or attributes are: 1) Of the same type (repeated) 2) The set of the subelement/attribute values is often determined by other element/attribute values. (e.g. studNo determines the values of hobby elements under “hobbies” element) April 9, 2006 KDXD 2006, Singapore 42 4. Discover semantics in XML documents Find object classes (cont.) The DTD of Example 1. <?xml version="1.0" encoding="UTF-8"?> <!--DTD generated by XXX--> <!ELEMENT psj (part+)> <!ELEMENT part (pno, pname, color, supplier+)> <!ELEMENT pno (#PCDATA)> <!ELEMENT pname (#PCDATA)> <!ELEMENT color (#PCDATA)> <!ELEMENT supplier (sno, sname, city+, price, project+)> <!ELEMENT sno (#PCDATA)> <!ELEMENT sname (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT project (jno, jname, budget, qty)> <!ELEMENT jno (#PCDATA)> <!ELEMENT jname (#PCDATA)> <!ELEMENT budget (#PCDATA)> <!ELEMENT qty (#PCDATA)> Dataguide ▼♦ psj ▼♦ part ♦ pno ♦ pname ♦ color ▼♦ supplier ♦ sno ♦ sname ♦ city ♦ price ▼♦ project ♦ jno ♦ jname ♦ budget ♦ qty From the DTD of Example 1, element type: psj, part, supplier and project are internal nodes (can be intuitively found in Dataguide). Then, the list { psj, part, supplier, project } contains candidate object classes. Because a well-formed XML document usually have a document root that is not concerned with the data, we can drop the root node psj from the list and get the final result { part, supplier, project }. April 9, 2006 KDXD 2006, Singapore 43 4. Discover semantics in XML documents Identify multi-valued attributes • After Object classes and composite attributes are identified, we pick out all multi-valued attributes for later use. – Multi-valued attributes can be detected by checking the occurrence constraints in DTD/XSD, or counting directly in the document. – Multi-valued attributes can be either of an object class (e.g. city of supplier) or a relationship type. To determine the affiliation of multi-valued attributes, we need to find object ID first. – Without considering multi-valued attributes, the search of object ID would be easier. April 9, 2006 KDXD 2006, Singapore 44 4. Discover semantics in XML documents Find object IDs • For each identified object class (after user-verified) – If it is located at the first level below the document root, and the DTD/XSD has specified ID attribute or key constraint, then the corresponding attribute/element should be an object ID. – Otherwise • A temporary table is built, which contains all XML attributes and single-valued simple subelement types of the object class. • To find full functional dependencies in the temporary table. – If all attributes/elements are fully functional dependent on an attribute/element k, then k is most likely the object ID; Else, » find an attribute/element k’, which functional determines the most number of attributes/elements, k’ is suggested as the object ID, » and the attributes/elements that are not determined by k’ will be classified as single-valued attributes of some relationship types to be determined later. • The result should be verified by the user. April 9, 2006 KDXD 2006, Singapore 45 4. Discover semantics in XML documents Find object IDs (cont.) <?xml version="1.0" encoding="UTF-8"?> <!--DTD generated by XXX--> <!ELEMENT psj (part+)> <!ELEMENT part (pno, pname, color, supplier+)> <!ELEMENT pno (#PCDATA)> <!ELEMENT pname (#PCDATA)> <!ELEMENT color (#PCDATA)> <!ELEMENT supplier (sno, sname, city+, price, project+)> <!ELEMENT sno (#PCDATA)> <!ELEMENT sname (#PCDATA)> <!ELEMENT city (#PCDATA)> <!ELEMENT price (#PCDATA)> <!ELEMENT project (jno, jname, budget, qty)> <!ELEMENT jno (#PCDATA)> <!ELEMENT jname (#PCDATA)> <!ELEMENT budget (#PCDATA)> <!ELEMENT qty (#PCDATA)> Candidate object classes list {part, supplier, project} Three temporary tables part_temp (pno, pname, color) supplier_temp (sno, sname, price) project_temp (jno, jname, budget, qty) Notice that, in this stage, all simple subelement types and attributes are treated the same. Multi-valued attributes such as city is not included inside the temporary table. April 9, 2006 KDXD 2006, Singapore 46 4. Discover semantics in XML documents Find object IDs (cont.) Three temporary tables part_temp (pno, pname, color) supplier_temp (sno, sname, price) project_temp (jno, jname, budget, qty) 1. In part_temp, we find that pno pname, color thus, pno is the object ID of part. 2. In supplier_temp, we only have sno sname thus, sno is the object ID of supplier, and price is picked our as a relationship attribute. 3. In project_temp, we only have jno jname, budget thus, jno is the object ID of project, and qty is picked out as a relationship attribute. April 9, 2006 KDXD 2006, Singapore 47 4. Discover semantics in XML documents Find object IDs • In the stage after the process of identifying object IDs, we find out: – Object IDs of each object class, – Single-valued object attributes and their corresponding object classes, – Single-valued relationship attributes without knowing what relationship type they belong to. April 9, 2006 KDXD 2006, Singapore 48 4. Discover semantics in XML documents Multi-valued attributes of object classes • Recall that, before searching object ID, all multivalued attributes are identified. Given a multivalued attribute under an object class, we check, – for each object ID value of the object class, whether there is a unique set of values of the attribute • If it is true, then it is a multi-valued attribute of the object class; Else, it is classified as a multi-valued attribute of some relationship type not known yet. April 9, 2006 KDXD 2006, Singapore 49 4. Discover semantics in XML documents Multi-valued attributes of object classes • For example, the city is a multi-valued attribute under supplier – We check sno and city, since each sno value is associated with the same set of city values, city is a multi-valued attribute of supplier April 9, 2006 KDXD 2006, Singapore The temporary table of sno and city sno city+ S001 Atlanta S002 {Atlanta, New York} S001 Atlanta S003 New York 50 4. Discover semantics in XML documents Find cardinality of object class attributes • For multi-valued object attributes, we should know their cardinality – If the DTD/XSD has specified, reuse it – Without schema, count the minimum and maximum occurrences of the multi-valued attributes. – Notice that, both single-valued and multi-valued attributes can be null (e.g. ? and *). Thus, the result should be verified by the user. April 9, 2006 KDXD 2006, Singapore 51 4. Discover semantics in XML documents Find IDREF/IDREFS • Identify IDREFs – If the DTD/XSD has specified IDREF/IDREFS or Keyref constraints, reuse them. – Without the schema, we compare the object attribute values with the values of other object IDs, • If all values of a single-valued attribute of objects of the same class appear as object ID values of some particular object class, then it is an IDREF; • If all values of a multi-valued attribute of objects of the same class appear as object ID values of some particular object class, then it is an IDREFS. (Note that, if it is an XML attribute, multiple values of IDREFS are separated by a blank character.) April 9, 2006 KDXD 2006, Singapore 52 4. Discover semantics in XML documents Find relationship types • Identify relationship types (basic idea) – The search of relationship types is based on the object ID and relationship attributes (single-valued or multivalued). – Along with a path from the root to a leaf node in the document tree, we may pass through several object classes. The object IDs of these object classes can form a temporary table. We build such kind of temporary tables for each single-valued relationship attributes, and find relationship types. April 9, 2006 KDXD 2006, Singapore 53 4. Discover semantics in XML documents Find relationship types (cont.) • For each single-valued relationship attribute, there is a path from the root to the attribute, and along the path, put object IDs of object classes inside the temporary table together with the relationship attribute. – Find the FDs that determines the single-valued relationship attribute in the temporary table. • For multi-valued relationship attributes, we should find a combination of object IDs of different object classes that each unique combination object ID value corresponds to a unique set of the attribute values. April 9, 2006 KDXD 2006, Singapore 54 4. Discover semantics in XML documents Find relationship types (cont.) • From the data in Example 1, we can have a temporary table for price along with the path: “part/supplier/price” as follows part pno P001 sno S001 price 5 supplier pno P001 S002 5.5 P002 S001 4.6 P002 S003 5 pname color sno sname + city price jno project jname budget qty We can find that {pno, sno} price, thus, there is an binary relationship type between part and supplier; and price is an attribute of the binary relationship type. April 9, 2006 KDXD 2006, Singapore 55 4. Discover semantics in XML documents Find relationship types (cont.) • Similarly, we can have a temporary table for qty along with the path: “part/supplier/project/qty” as follows pno sno jno qty P001 S001 J001 60 P001 S001 J003 650 P001 S002 J002 70 P001 S002 J003 50 P002 S001 J002 60 P002 S003 J001 20 P002 S003 J004 50 part supplier pno pname color sno sname + city price jno project jname budget qty We can find that {pno, sno, jno} qty, thus, there is an ternary relationship type among part, supplier and project; and qty is an attribute of the ternary relationship type. April 9, 2006 KDXD 2006, Singapore 56 4. Discover semantics in XML documents Find relationship types (cont.) • Relationship types can be exist without have relationship attributes. • To find such kind of relationship types, we need to build a temporary table for different object classes with their object IDs based on the existing paths in the document tree. • Search the temporary table and find MVDs (see the following example.) April 9, 2006 KDXD 2006, Singapore 57 4. Discover semantics in XML documents Find relationship types (cont.) • Suppose we have another document of project, staff, and paper. After we found their object ID attributes, accordingly, i.e. J_no, St_no, and Pa_no, we can create a temporary table as follows. project ... We have already identified the - Hierarchical structure; - Object classes and their object IDs; - Attributes of object classes; staff J_no … ... paper St_no ... - But no attribute is likely to be of some relationship types. Pa_no April 9, 2006 KDXD 2006, Singapore 58 4. Discover semantics in XML documents Find relationship types (cont.) We build a temporary table which consists of J_no, St_no, and Pa_no project J_no 2 staff 2 paper St_no Pa_no J001 S001 P001 J001 S002 P003 J002 S001 P001 J002 S003 P001 … … … CASE 1. project staff 3 paper CASE 2. CASE 1. If we find that each St_no value is associated with a unique set of Pa_no values, i.e. St_no multi-determines Pa_no, then there are two binary relationship types, one consists of project and staff, and the other consists of staff and paper. CASE 2. If there is no FD or MVD in the table, then there is a ternary relationship among project, staff and paper. April 9, 2006 KDXD 2006, Singapore 59 4. Discover semantics in XML documents Find participating constraints • The participating constraints of each relationship types can be obtained through the count of unique object ID values in the temporary table accordingly. April 9, 2006 KDXD 2006, Singapore 60 4. Discover semantics in XML documents User verification • All outputs, including those intermediate results, should be verified by users. • With input from users and their verification, a semiautomatic mining process can be applied to discover the semantics in XML documents that are important in designing XML databases, storing XML data, validating XML view and processing/optimizing XML query. • All the discovered semantics can be represented by ORA-SS; but some of them cannot be represented in DTD/XSD. April 9, 2006 KDXD 2006, Singapore 61 Roadmap 1. XML documents and current XML schema languages 2. ORA-SS (Object-Relationship-Attribute model for Semi-Structured data) 3. The applications of ORA-SS 4. Discovering Semantics in XML documents 5. Conclusion April 9, 2006 KDXD 2006, Singapore 62 5. Conclusion 1) We demonstrate a data-centric XML document and show the limitations of current XML schema standard in represent relational semantics and constraints. 2) We Introduce ORA-SS, a semantics rich data model that can intuitively express the semantics in XML data. 3) We discuss the naïve method of mining semantics from XML data/schema to generate ORA-SS schema. More efficient methods should be further investigated. April 9, 2006 KDXD 2006, Singapore 63 5. Conclusion (cont.) 4) The semantics in ORA-SS are crucial in designing XML database, writing and interpreting XML query and validating XML views, etc. 5) The method we proposed in the presentation to discover semantics only provides candidate answers. In other words, not all the results are necessarily true because the contents of the data may be changed. Therefore, user feedback is indispensable in the process of enriching XML schema to ORA-SS schema. April 9, 2006 KDXD 2006, Singapore 64 References: [1]. [2]. [3]. [4]. [5]. [6]. [7]. [8]. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland. Oct 7-11, 2002 C. J. Date. An Introduction to Database Systems. 3rd edition, Addison-Wesley Publishing Company (1981). Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February 2004. http://www.w3.org/TR/2004/REC-xml-20040204/ T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc. 2005 W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA 2003. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/ XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/ XML Schema Part 2: Data types Second Edition. W3C Recommendation 28 October 2004. http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/ April 9, 2006 KDXD 2006, Singapore 65 Q&A April 9, 2006 KDXD 2006, Singapore 66 The End April 9, 2006 KDXD 2006, Singapore 67