Capturing Semantics in XML Documents Tok Wang Ling National University of Singapore

advertisement
Capturing Semantics in
XML Documents
Tok Wang Ling
Department of Computer Science
National University of Singapore
April 9, 2006
KDXD 2006, Singapore
1
Roadmap
1. XML documents and current XML schema
languages
2. ORA-SS (Object-Relationship-Attribute
model for Semi-Structured data) [4]
3. The applications of ORA-SS
4. Discovering Semantics in XML documents
5. Conclusion
[4]. T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business media, Inc. 2005
April 9, 2006
KDXD 2006, Singapore
2
Roadmap
1. XML documents and current XML schema
languages
2. ORA-SS (Object-Relationship-Attribute
model for Semi-Structured data)
3. The applications of ORA-SS
4. Discovering Semantics in XML documents
5. Conclusion
April 9, 2006
KDXD 2006, Singapore
3
1. XML – Brief introduction
• XML (eXtensible Markup Language) is
– Released by W3C
– An application of SGML
– A promising standard of data publishing, integrating and
exchanging on the web
• XML schema
– DTD (Data Type Definition) [3]
– XSD (XML Schema Definition), W3C recommended standard
[6, 7, 8]
[3]. Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February 2004.
http://www.w3.org/TR/2004/REC-xml-20040204/
[6]. XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October 2004.
http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/
[7]. XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October 2004.
http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/
[8]. XML Schema Part 2: Datatypes Second Edition. W3C Recommendation 28 October 2004.
http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/
April 9, 2006
KDXD 2006, Singapore
4
1. XML – A motivating example
• Suppose we have an XML document “psj.xml”
about different parts, suppliers and projects,
where
–
–
–
–
The document has a root element psj;
Under psj, there is a sequence of part elements;
Under part, there is a sequence of supplier elements;
Under supplier, there is a sequence of project
elements.
April 9, 2006
KDXD 2006, Singapore
5
Example 1. psj.xml
<?xml version="1.0" encoding="UTF-8"?>
<psj xmlns:xsi="…" xsi:noNamespaceSchemaLocation="…">
<part>
<pno>P001</pno> <pname>Nut</pname> <color>Silver</color>
<supplier>
<sno>S001</sno> <sname>Alfa</sname>
<city>Atlanta</city> <price>5</price>
<project>
<jno>J001</jno> <jname>Rocket boots</jname>
<budget>20000</budget> <qty>60</qty>
</project>
<project>
<jno>J003</jno> <jname>Firework launcher</jname>
<budget>250000</budget> <qty>650</qty>
</project>
</supplier>
<supplier>
<sno>S002</sno> <sname>Beta</sname>
<city>Atlanta</city> <city>New York</city> <price>5.5</price>
<project>
<jno>J002</jno> <jname>Diving helm</jname>
<budget>18000</budget> <qty>70</qty>
</project>
<project>
<jno>J003</jno> <jname>Firework launcher</jname>
<budget>250000</budget> <qty>50</qty>
</project>
</supplier>
</part>
…
April 9, 2006
…
<part>
<pno>P002</pno> <pname>Nut</pname> <color>Copper</color>
<supplier>
<sno>S001</sno> <sname>Alfa</sname>
<city>Atlanta</city> <price>4.6</price>
<project>
<jno>J002</jno> <jname>Diving helm</jname>
<budget>18000</budget> <qty>60</qty>
</project>
</supplier>
<supplier>
<sno>S003</sno> <sname>Beta</sname>
<city>New York</city> <price>5</price>
<project>
<jno>J001</jno> <jname>Rocket boots</jname>
<budget>20000</budget> <qty>20</qty>
</project>
<project>
<jno>J004</jno> <jname>Blue fireworks</jname>
<budget>20000</budget> <qty>50</qty>
</project>
</supplier>
</part>
</psj>
KDXD 2006, Singapore
6
1. XML – the DTD of the “psj.xml”
<?xml version="1.0" encoding="UTF-8"?>
<!--DTD generated by XXX-->
<!ELEMENT psj (part+)>
<!ELEMENT part (pno, pname, color, supplier+)>
<!ELEMENT pno (#PCDATA)>
<!ELEMENT pname (#PCDATA)>
<!ELEMENT color (#PCDATA)>
<!ELEMENT supplier (sno, sname, city+, price, project+)>
<!ELEMENT sno (#PCDATA)>
<!ELEMENT sname (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ELEMENT project (jno, jname, budget, qty)>
<!ELEMENT jno (#PCDATA)>
<!ELEMENT jname (#PCDATA)>
<!ELEMENT budget (#PCDATA)>
<!ELEMENT qty (#PCDATA)>
▼♦ psj
▼♦ part
♦ pno
♦ pname
♦ color
▼♦ supplier
♦ sno
♦ sname
♦ city
♦ price
▼♦ project
♦ jno
♦ jname
♦ budget
♦ qty
(a) “psj.dtd”, The DTD of the “psj.xml”
(b) psj.dtd in Data Guide
April 9, 2006
KDXD 2006, Singapore
7
1. XML – what the DTD says
• DTD is a simple definition of an XML document, where users can
define
– Element/Attribute types
– Occurrence constraints (e.g. ?, +, *)
– Containment among different element types (the structure)
• DTD cannot express
– Occurrence constraints in numbers (e.g. 2 to 8)
– Uniqueness/Key constraints on a combination of attributes/elements (ID
attribute can be only assigned on one attribute at a time in DTD.)
– Relationship types among elements and their degrees
– Difference between the attribute (or simple element) of element type and
the attribute (or simple element) of relationship type.
 Simple elements are those element types with PCDATA only without any attribute types.
April 9, 2006
KDXD 2006, Singapore
8
1. XML – XSD
“psj.xsd”, the XSD schema of the
motivating example data.
XSD definition of element
occurrence constraint
XSD definition of key constraint,
which requires that all part
element should have a non-nil pno
element and the value of all pno
elements in the document should
be unique.
April 9, 2006
<xs:schema xmlns:xs = “…”>
<xs:element name = “psj”>
<xs:complexType>
<xs:sequence>
<xs:element name="part">
<xs:complexType>
<xs:sequence>
<xs:element name="pno" type="xs:string"/>
<xs:element name="pname" type=" xs:string"/>
<xs:element name="color" type=" xs:string"/>
<xs:element name="supplier" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="sno" type=" xs:string"/>
<xs:element name="sname" type=" xs:string"/>
<xs:element name="city" type=" xs:string“ maxOccurs="unbounded"/>
<xs:element name="price" type=" xs:string"/>
<xs:element name="project" maxOccurs="unbounded">
<xs:complexType>
<xs:sequence>
<xs:element name="jno" type=" xs:string"/>
<xs:element name="jname" type=" xs:string"/>
<xs:element name="budget" type=" xs:string"/>
<xs:element name="qty" type=" xs:string"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
<xs:key name="PK">
<xs:selector xpath="part"/>
<xs:field xpath="pno"/>
</xs:key>
</xs:element>
</xs:schema>
KDXD 2006, Singapore
9
1. XML – what XSD can tell
• XSD is the standard of XML schema definition,
recommended by W3C and supported by most
vendors, which
– has extensible XML syntax,
– supports more data types (user-defined type and 37
built-in types)
– is able to represent uniqueness/key for both attribute
types and element types.
– And has many other improvements in comparison
with DTD.
April 9, 2006
KDXD 2006, Singapore
10
1. XML – XSD still flaws
XSD is not sufficient in expressing the relational semantics in
XML data, such as:
1. A key constraint is specified by a key element.
The key constraints in XSD is an extension of ID
in DTD. It is totally different to the key constraint
in relational databases.
–
–
E.g. In the previous XSD, the values of key attribute,
pno of part, should be unique within the set of the part
elements in the whole document.
Therefore, when an element type is located in a lower
level such as supplier and project, XSD cannot declare
sno and jno as their key attributes (OIDs) respectively.
April 9, 2006
KDXD 2006, Singapore
11
1. XML – XSD still flaws (cont.)
-
The key element must contain the following (in
order):
a) One and only one selector element
- contains an XPath expression that specifies the set of
elements across which the values specified by the field
must be unique
b) One or more field elements
- contain an XPath expressions that specifies the values
must be unique for the set of elements specified by the
selector element.
- The key constraint is similar to the unique constraint,
except that the column on which a unique constraint is
defined can have null values.
April 9, 2006
KDXD 2006, Singapore
12
1. XML – XSD still flaws (Cont.)
2. XSD does not support relationship types and
other relational semantic constraints.
–
E.g. The ternary relationship type psj among part, supplier and
project in the original data is lost in the XSD.
3. XSD cannot distinguish attributes (or simple
elements) of relationship types from those
attributes (or simple elements) of element types.
–
E.g. Price is an attribute of the binary relationship type ps
between part and supplier. However, it looks the same as
sname, an attribute (simple element) of the element supplier.
April 9, 2006
KDXD 2006, Singapore
13
Reconsider the semantics in Example 1.
• The XML data in Example 1. (psj.xml) is a
typical data-centric XML document that is
derived from structured data contents usually
stored in relational or object-relational databases.
• The semantics of the data in Example 1. can be
described in the ER diagram as follows.
April 9, 2006
KDXD 2006, Singapore
14
The ER diagram of the data in Example 1.
price
n
part
PS
n
supplier
n
pno
pname
color
PSJ
sno
sname
city
n
jno
April 9, 2006
project
qty
jname
budget
KDXD 2006, Singapore
15
One of the object-relational database
representations of psj.xml
part
pno
supplier
pname
color
sno
project
sname
city+
jno
jname
budget
P001
Nut
Silver
S001
Alfa
Atlanta
J001
Rocket boots
20000
P002
Nut
Copper
S002
Beta
{Atlanta,
New York}
J002
Diving helm
18000
J003
Firework launcher
250000
S003
Gama
New York
J004
Blue fireworks
20000
PS
pno
There 5 tables in the
relational schema:
sno
price
S001
5
P001
S002
5.5
P002
S001
4.6
P001
S001
J001
60
P002
S003
5
P001
S001
J003
650
P001
S002
J002
70
P001
S002
J003
50
P002
S001
J002
60
P002
S003
J001
20
P002
S003
J004
50
part (pno, pname, color)
supplier (sno, sname, (city)+)
project (jno, jname, budget)
PS (pno, sno, price)
PSJ (pno, sno, jno, qty)
April 9, 2006
PSJ
P001
KDXD 2006, Singapore
pno
sno
jno
qty
16
Roadmap
1. XML documents and current XML schema
languages
2. ORA-SS (Object-Relationship-Attribute
model for Semi-Structured data)
3. The applications of ORA-SS
4. Discovering Semantics in XML documents
5. Conclusion
April 9, 2006
KDXD 2006, Singapore
17
2. ORA-SS in a nutshell
• ORA-SS is a semantics rich data model for semistructured data.
• It can easily represent the relational semantics
and constraints in XML data.
• ORA-SS model is also a bridge that connects the
tree structure of XML and the semantics in
relational and object-relational databases.
• In comparison with traditional ER diagram, ORA-SS
schema diagram represents the hierarchical structure of
XML data.
April 9, 2006
KDXD 2006, Singapore
18
2. ORA-SS in a nutshell
• A complete ORA-SS model has 4 diagrams
– Schema diagram
• Represents the structure and constrains (business rules) on XML
documents
– Instance diagram
• Visually represents the graphical structure of XML data
– Functional dependency diagram
• Represents FDs in relationship types
– Inheritance diagram
• Represents the specialization/generalization relationships among
different object classes in ORA-SS
April 9, 2006
KDXD 2006, Singapore
19
2. ORA-SS data models
• Object class
– attributes of object class
– ordering on object class
• Relationship Type
–
–
–
–
–
–
degree of relationship type
participating object classes in relationship type
attributes of relationship type
disjunctive relationship type
recursive relationship type
ID dependent relationship type
April 9, 2006
KDXD 2006, Singapore
20
2. ORA-SS data models (Cont.)
• Attribute
–
–
–
–
–
–
–
–
–
attributes of object class or relationship type
key attribute (OID)
foreign key / referential constraint (IDREF/IDREFS)
composite attribute
disjunctive attribute
attribute with unknown structure
ordering on attributes
fixed or default value of attribute
derived attribute
April 9, 2006
KDXD 2006, Singapore
21
The ORA-SS schema diagram of Example 1.
Part, supplier and project are modeled as object classes.
PS is a binary
relationship type
between part and
supplier,
part
PS, 2, +, +
supplier
pno
pname
color
PS
sno
sname
+
city
price
PSJ, 3, +, +
project
PSJ
Pno, sno and jno are
declared as the object ID of
part, supplier and project
respectively.
April 9, 2006
jno
jname
budget
PSJ is a ternary
relationship type
defined among
part, supplier
and project
qty
Price is an attribute of the relationship type PS;
and qty is an attribute of PSJ.
KDXD 2006, Singapore
22
ORA-SS – Features
• ORA-SS can represent the following semantics
– Object ID attributes play the key constraints in
object-relational databases, i.e. the object ID attributes
functional determine (or multi-valued determine)
object attributes of the same object class.
– Various relationship types including ID dependent
relationship types, their degrees and participating
object classes.
– Distinguish relationship attributes from object
attributes.
April 9, 2006
KDXD 2006, Singapore
23
Roadmap
1. XML documents and current XML schema
languages
2. ORA-SS (Object-Relationship-Attribute
model for Semi-Structured data)
3. The applications of ORA-SS
4. Discovering Semantics in XML documents
5. Conclusion
April 9, 2006
KDXD 2006, Singapore
24
3. ORA-SS applications
• Due to the rich semantics in ORA-SS, the model
can be widely used in
–
–
–
–
–
Normal form XML schema
Relational/object-relational storage of XML data
XML view creation and validation [1]
XML schema/data integration
XML data query, especially with graphical user
interfaces [5]
– XML query optimization
– etc.
[1]. Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland. Oct 7-11, 2002
[5]. W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA 2003.
April 9, 2006
KDXD 2006, Singapore
25
3. ORA-SS applications
Store ORA-SS in object-relational databases
• Current existing storage approaches store XML in
flat files (NF relations), which are long and
difficult to query and update;
• Pure relational DBMS – join needs much time.
• ORA-SS reflects the nested structure of semistructured data
• Less join in nested relations
April 9, 2006
KDXD 2006, Singapore
26
3. ORA-SS applications
Store ORA-SS in object-relational databases
(Cont.)
Given an ORA-SS schema diagram
• Each object class is stored as an object
relation with its object ID and its object
attributes. (e.g. part, supplier, project)
part
PS, 2, +, +
supplier
pno
pname
color
PS
sno
sname
+
city
price
PSJ, 3, +, +
project
PSJ
jno
jname
budget
qty
• Each relationship type is stored as a
relationship relation with the object IDs
of participating object classes and its
relationship attributes. (e.g. PS and PSJ)
• Multi-value attributes and composite
attributes are stored as nested relations.
(e.g. city)
April 9, 2006
KDXD 2006, Singapore
27
3. ORA-SS applications
Store ORA-SS in object-relational databases (Cont.)
Storage Schema for ORA-SS/XML Databases of the data in Example 1.
ORA-SS schema diagram
Storage schema
Object Relations
part (pno, pname, color)
supplier (sno, sname, (city)+)
project(jno, jname, budget)
part
PS, 2, +, +
supplier
pno
pname
color
PS
sno
sname
+
city
price
PSJ, 3, +, +
project
PSJ
jno
April 9, 2006
jname
budget
qty
Relationship relations
PS (pno, sno, price)
PSJ (pno, sno, jno, qty)
Constraint:
PSJ[pno, sno]  PS[pno, sno]
KDXD 2006, Singapore
28
3. ORA-SS applications
Store ORA-SS in object-relational databases (Cont.)
An example to show the advantage of using object-relational database instead of
relational database.
ORA-SS schema diagram
employee
eno
ename
*
hobby
*
year
quantification
degree
Univ.
year
job_title
company
Storage schema in traditional RDB
Storage schema in ORDB
Employee (eno, ename, (hobby)*,
quantification(year, degree, Univ)*,
job_history(year, job_title, company)*)
April 9, 2006
job_history
*
Employee (eno, ename)
E_hobby (eno, hobby)
E_quantification (eno, year, degree, Univ.)
E_job_history (eno, year, job_title, company)
KDXD 2006, Singapore
29
3. ORA-SS applications
Define and validate XML views
•Valid XML views in ORA-SS
•View definition operators: select,
project/drop, swap, join
For example, consider the following swapping operation that changes the position of
supplier and part in different hierarchical levels:
PS, 2, +, +
supplier
pno
pname
supplier
supplier
part
2
2
part
part
price
color
PS
sno
sname
+
city
price
2
PSJ, 3, +, +
project
project
PSJ
jno
jname
budget
price
3
qty
Because price is a relationship attribute, it
cannot be moved up with supplier elements,
which would be semantically meaningless in
the result view.
April 9, 2006
3
3
qty
Valid view
KDXD 2006, Singapore
project
3
qty
Invalid view
30
3. ORA-SS applications
Define and validate XML views (cont.)
Another example, consider the following projection operation that drops supplier from
the structure:
part
PS, 2, +, +
part
part
supplier
pno
pname
color
PS
sno
sname
+
city
price
PSJ, 3, +, +
project
price
project
Avg_price
project
PSJ
qty
jno
jname
budget
Total_qty
qty
Invalid view
Valid view
Dropping supplier makes price and qty
become multi-valued attributes, and we should
apply aggregation functions to get a
meaningful view.
April 9, 2006
KDXD 2006, Singapore
31
3. ORA-SS applications
Graphical XML query based on ORA-SS
A graphical XML query language is designed on the base of ORA-SS
Query 1: To select and display the projects that do not have any suppliers located in Atlanta.
The schema
panel loads the
ORA-SS
schema diagram
Graphical query can be
posed by either dragging
components from the
diagram in schema panel
or using the construction
buttons on the top of the
window.
Complex query logics such
as quantification, negation,
IF-THEN construction can
be specified in the
Condition Logic Window
The screenshot of the user-interface of our graphical query language
April 9, 2006
KDXD 2006, Singapore
32
3. ORA-SS applications
XML query optimization
• The semantic information represented in ORA-SS is also
helpful in optimizing XML query.
Consider the following simple query example which means,
(Query 2.) To display the budget of project “J001”.
April 9, 2006
KDXD 2006, Singapore
33
3. ORA-SS applications
XML query optimization
• Traditional processing should scan the whole XML
document, checking every project with jno=“J001”
and finding all corresponding budget values.
• However, in ORA-SS, since jno is the object ID and
we have the functional dependecny:
jno  budget
so the optimized processing only need to find the first
project instance with jno=“J001” and return the
corresponding budget value.
April 9, 2006
KDXD 2006, Singapore
34
Roadmap
1. XML documents and current XML schema
languages
2. ORA-SS (Object-Relationship-Attribute
model for Semi-Structured data)
3. The applications of ORA-SS
4. Discovering Semantics in XML documents
5. Conclusion
April 9, 2006
KDXD 2006, Singapore
35
4. Discover semantics in XML documents
• Problem definition
– Input: a well formed XML document, probably with
a DTD or XSD schema
– Output: semantics that are necessary to ORA-SS
schema
• It is a process of enriching XML schema to ORASS schema by using mining techniques.
April 9, 2006
KDXD 2006, Singapore
36
4. Discover semantics in XML documents
• Related issues in mining semantics
– Object classes
•
•
•
•
Identify object classes
Identify object IDs
Identify object attributes and their cardinalities
Identify IDREF(s) attributes
– Relationship types
• Find relationship types with their degrees and participating
object classes
• Find attributes and their cardinalities of relationship types
April 9, 2006
KDXD 2006, Singapore
37
4. Discover semantics in XML documents
The whole vision of the process.
Composite
attributes
Multi-valued
attributes
Start
Identify Object
Classes
Pick out
multi-value
attributes
Identify Object ID
Object
Classes
Single -value d
re lationship
attribute s
Single-valued
object attributes
Object ID
Multi-valued
object attributes
Identify Multi-valued
and composit Object
attributes
Composite
relationship
attributes
Composite
object attributes
Multi-valued
relationship
attributes
Identify relationship types
with relationship attributes
The main flow of the process
The output flow
The input flow
Identify relationship types
without relationship attributes
Relationship
Types
End
April 9, 2006
KDXD 2006, Singapore
38
4. Discover semantics in XML documents
• Assumption
– To simplify the discussion, we do not consider the
order of attributes and elements.
• User-verification
– The findings of each steps during the process should
be verified by the user.
– The verified findings of previous steps would be used
in later steps.
April 9, 2006
KDXD 2006, Singapore
39
4. Discover semantics in XML documents
Find object classes
• Identify object classes from element types:
– Scan the XML document or, if possible, the
DTD/XSD of the XML document to select all internal
nodes in the document tree.
– An internal node means the node must have some
child nodes such as XML attribute types and/or
subelement types.
– An internal node may not be an object class, but an
object class must correspond to an internal node.
Therefore, internal nodes are candidates of object
classes.
April 9, 2006
KDXD 2006, Singapore
40
4. Discover semantics in XML documents
Find object classes (cont.)
• Detecting composite attributes from object
classes
– Although composite attributes are also internal nodes,
there are some special patterns that indicate they are
not object classes.
birthday
XML element
XML elements
Or XML attributes
values
April 9, 2006
The first pattern is that, all subelement
types or attributes are
month
day
year
"3"
"20"
"2005"
1) Single-valued
2) Always occur with the same order
3) No functional dependency can be
found within the component
attributes of a composite
attribute.
KDXD 2006, Singapore
41
4. Discover semantics in XML documents
Find object classes (cont.)
student
hobbies
XML element
XML elements
Or XML attributes
values
hobby
hobby
studNo
hobby
"swimming" "reading" "basket ball"
The second pattern is that, all subelement types or attributes are:
1) Of the same type (repeated)
2) The set of the subelement/attribute values is often determined by other
element/attribute values. (e.g. studNo determines the values of hobby
elements under “hobbies” element)
April 9, 2006
KDXD 2006, Singapore
42
4. Discover semantics in XML documents
Find object classes (cont.)
The DTD of Example 1.
<?xml version="1.0" encoding="UTF-8"?>
<!--DTD generated by XXX-->
<!ELEMENT psj (part+)>
<!ELEMENT part (pno, pname, color, supplier+)>
<!ELEMENT pno (#PCDATA)>
<!ELEMENT pname (#PCDATA)>
<!ELEMENT color (#PCDATA)>
<!ELEMENT supplier (sno, sname, city+, price, project+)>
<!ELEMENT sno (#PCDATA)>
<!ELEMENT sname (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ELEMENT project (jno, jname, budget, qty)>
<!ELEMENT jno (#PCDATA)>
<!ELEMENT jname (#PCDATA)>
<!ELEMENT budget (#PCDATA)>
<!ELEMENT qty (#PCDATA)>
Dataguide
▼♦ psj
▼♦ part
♦ pno
♦ pname
♦ color
▼♦ supplier
♦ sno
♦ sname
♦ city
♦ price
▼♦ project
♦ jno
♦ jname
♦ budget
♦ qty
From the DTD of Example 1, element type: psj, part, supplier and project are internal nodes (can
be intuitively found in Dataguide). Then, the list { psj, part, supplier, project } contains candidate
object classes. Because a well-formed XML document usually have a document root that is not
concerned with the data, we can drop the root node psj from the list and get the final result
{ part, supplier, project }.
April 9, 2006
KDXD 2006, Singapore
43
4. Discover semantics in XML documents
Identify multi-valued attributes
• After Object classes and composite attributes are
identified, we pick out all multi-valued attributes for later
use.
– Multi-valued attributes can be detected by checking the
occurrence constraints in DTD/XSD, or counting directly in the
document.
– Multi-valued attributes can be either of an object class (e.g. city
of supplier) or a relationship type. To determine the affiliation
of multi-valued attributes, we need to find object ID first.
– Without considering multi-valued attributes, the search of
object ID would be easier.
April 9, 2006
KDXD 2006, Singapore
44
4. Discover semantics in XML documents
Find object IDs
• For each identified object class (after user-verified)
– If it is located at the first level below the document root, and the DTD/XSD
has specified ID attribute or key constraint, then the corresponding
attribute/element should be an object ID.
– Otherwise
• A temporary table is built, which contains all XML attributes and single-valued
simple subelement types of the object class.
• To find full functional dependencies in the temporary table.
– If all attributes/elements are fully functional dependent on an attribute/element k,
then k is most likely the object ID;
Else,
» find an attribute/element k’, which functional determines the most number of
attributes/elements, k’ is suggested as the object ID,
» and the attributes/elements that are not determined by k’ will be classified as
single-valued attributes of some relationship types to be determined later.
• The result should be verified by the user.
April 9, 2006
KDXD 2006, Singapore
45
4. Discover semantics in XML documents
Find object IDs (cont.)
<?xml version="1.0" encoding="UTF-8"?>
<!--DTD generated by XXX-->
<!ELEMENT psj (part+)>
<!ELEMENT part (pno, pname, color, supplier+)>
<!ELEMENT pno (#PCDATA)>
<!ELEMENT pname (#PCDATA)>
<!ELEMENT color (#PCDATA)>
<!ELEMENT supplier (sno, sname, city+, price, project+)>
<!ELEMENT sno (#PCDATA)>
<!ELEMENT sname (#PCDATA)>
<!ELEMENT city (#PCDATA)>
<!ELEMENT price (#PCDATA)>
<!ELEMENT project (jno, jname, budget, qty)>
<!ELEMENT jno (#PCDATA)>
<!ELEMENT jname (#PCDATA)>
<!ELEMENT budget (#PCDATA)>
<!ELEMENT qty (#PCDATA)>
Candidate object classes list
{part, supplier, project}
Three temporary tables
part_temp (pno, pname, color)
supplier_temp (sno, sname, price)
project_temp (jno, jname, budget, qty)
Notice that, in this stage, all simple
subelement types and attributes are
treated the same.
Multi-valued attributes such as
city is not included inside the
temporary table.
April 9, 2006
KDXD 2006, Singapore
46
4. Discover semantics in XML documents
Find object IDs (cont.)
Three temporary tables
part_temp (pno, pname, color)
supplier_temp (sno, sname, price)
project_temp (jno, jname, budget, qty)
1. In part_temp, we find that
pno  pname, color
thus, pno is the object ID of part.
2. In supplier_temp, we only have
sno  sname
thus, sno is the object ID of supplier,
and price is picked our as a relationship attribute.
3. In project_temp, we only have
jno  jname, budget
thus, jno is the object ID of project,
and qty is picked out as a relationship attribute.
April 9, 2006
KDXD 2006, Singapore
47
4. Discover semantics in XML documents
Find object IDs
• In the stage after the process of identifying object
IDs, we find out:
– Object IDs of each object class,
– Single-valued object attributes and their
corresponding object classes,
– Single-valued relationship attributes without knowing
what relationship type they belong to.
April 9, 2006
KDXD 2006, Singapore
48
4. Discover semantics in XML documents
Multi-valued attributes of object classes
• Recall that, before searching object ID, all multivalued attributes are identified. Given a multivalued attribute under an object class, we check,
– for each object ID value of the object class, whether
there is a unique set of values of the attribute
• If it is true, then it is a multi-valued attribute of the object
class;
Else, it is classified as a multi-valued attribute of some
relationship type not known yet.
April 9, 2006
KDXD 2006, Singapore
49
4. Discover semantics in XML documents
Multi-valued attributes of object classes
• For example, the city is a
multi-valued attribute under
supplier
– We check sno and city, since
each sno value is associated with
the same set of city values, city
is a multi-valued attribute of
supplier
April 9, 2006
KDXD 2006, Singapore
The temporary table of
sno and city
sno
city+
S001
Atlanta
S002
{Atlanta,
New York}
S001
Atlanta
S003
New York
50
4. Discover semantics in XML documents
Find cardinality of object class attributes
• For multi-valued object attributes, we should
know their cardinality
– If the DTD/XSD has specified, reuse it
– Without schema, count the minimum and maximum
occurrences of the multi-valued attributes.
– Notice that, both single-valued and multi-valued
attributes can be null (e.g. ? and *). Thus, the result
should be verified by the user.
April 9, 2006
KDXD 2006, Singapore
51
4. Discover semantics in XML documents
Find IDREF/IDREFS
• Identify IDREFs
– If the DTD/XSD has specified IDREF/IDREFS or Keyref
constraints, reuse them.
– Without the schema, we compare the object attribute values
with the values of other object IDs,
• If all values of a single-valued attribute of objects of the same class
appear as object ID values of some particular object class, then it is an
IDREF;
• If all values of a multi-valued attribute of objects of the same class
appear as object ID values of some particular object class, then it is an
IDREFS.
(Note that, if it is an XML attribute, multiple values of IDREFS are
separated by a blank character.)
April 9, 2006
KDXD 2006, Singapore
52
4. Discover semantics in XML documents
Find relationship types
• Identify relationship types (basic idea)
– The search of relationship types is based on the object
ID and relationship attributes (single-valued or multivalued).
– Along with a path from the root to a leaf node in the
document tree, we may pass through several object
classes. The object IDs of these object classes can
form a temporary table. We build such kind of
temporary tables for each single-valued relationship
attributes, and find relationship types.
April 9, 2006
KDXD 2006, Singapore
53
4. Discover semantics in XML documents
Find relationship types (cont.)
• For each single-valued relationship attribute, there is a
path from the root to the attribute, and along the path, put
object IDs of object classes inside the temporary table
together with the relationship attribute.
– Find the FDs that determines the single-valued relationship
attribute in the temporary table.
• For multi-valued relationship attributes, we should find a
combination of object IDs of different object classes that
each unique combination object ID value corresponds to
a unique set of the attribute values.
April 9, 2006
KDXD 2006, Singapore
54
4. Discover semantics in XML documents
Find relationship types (cont.)
• From the data in Example 1, we can have a temporary table for
price along with the path: “part/supplier/price” as follows
part
pno
P001
sno
S001
price
5
supplier
pno
P001
S002
5.5
P002
S001
4.6
P002
S003
5
pname
color
sno
sname
+
city
price
jno
project
jname
budget
qty
We can find that {pno, sno}  price, thus, there is an binary relationship
type between part and supplier; and price is an attribute of the binary
relationship type.
April 9, 2006
KDXD 2006, Singapore
55
4. Discover semantics in XML documents
Find relationship types (cont.)
• Similarly, we can have a temporary table for qty along with the
path: “part/supplier/project/qty” as follows
pno
sno
jno
qty
P001
S001
J001
60
P001
S001
J003
650
P001
S002
J002
70
P001
S002
J003
50
P002
S001
J002
60
P002
S003
J001
20
P002
S003
J004
50
part
supplier
pno
pname
color
sno
sname
+
city
price
jno
project
jname
budget
qty
We can find that {pno, sno, jno}  qty, thus, there is an ternary
relationship type among part, supplier and project; and qty is an
attribute of the ternary relationship type.
April 9, 2006
KDXD 2006, Singapore
56
4. Discover semantics in XML documents
Find relationship types (cont.)
• Relationship types can be exist without have
relationship attributes.
• To find such kind of relationship types, we need
to build a temporary table for different object
classes with their object IDs based on the existing
paths in the document tree.
• Search the temporary table and find MVDs (see
the following example.)
April 9, 2006
KDXD 2006, Singapore
57
4. Discover semantics in XML documents
Find relationship types (cont.)
• Suppose we have another document of project, staff,
and paper. After we found their object ID attributes,
accordingly, i.e. J_no, St_no, and Pa_no, we can
create a temporary table as follows.
project
...
We have already identified the
- Hierarchical structure;
- Object classes and their object IDs;
- Attributes of object classes;
staff
J_no
…
...
paper
St_no
...
- But no attribute is likely to be of some
relationship types.
Pa_no
April 9, 2006
KDXD 2006, Singapore
58
4. Discover semantics in XML documents
Find relationship types (cont.)
We build a temporary table which consists of J_no, St_no, and Pa_no
project
J_no
2
staff
2
paper
St_no
Pa_no
J001
S001
P001
J001
S002
P003
J002
S001
P001
J002
S003
P001
…
…
…
CASE 1.
project
staff
3
paper
CASE 2.
CASE 1. If we find that each St_no value is associated with a unique set of Pa_no
values, i.e. St_no multi-determines Pa_no,
then there are two binary relationship types, one consists of project and staff,
and the other consists of staff and paper.
CASE 2. If there is no FD or MVD in the table,
then there is a ternary relationship among project, staff and paper.
April 9, 2006
KDXD 2006, Singapore
59
4. Discover semantics in XML documents
Find participating constraints
• The participating constraints of each relationship
types can be obtained through the count of unique
object ID values in the temporary table
accordingly.
April 9, 2006
KDXD 2006, Singapore
60
4. Discover semantics in XML documents
User verification
• All outputs, including those intermediate results,
should be verified by users.
• With input from users and their verification, a semiautomatic mining process can be applied to
discover the semantics in XML documents that are
important in designing XML databases, storing XML
data, validating XML view and
processing/optimizing XML query.
• All the discovered semantics can be represented by
ORA-SS; but some of them cannot be represented
in DTD/XSD.
April 9, 2006
KDXD 2006, Singapore
61
Roadmap
1. XML documents and current XML schema
languages
2. ORA-SS (Object-Relationship-Attribute
model for Semi-Structured data)
3. The applications of ORA-SS
4. Discovering Semantics in XML documents
5. Conclusion
April 9, 2006
KDXD 2006, Singapore
62
5. Conclusion
1) We demonstrate a data-centric XML document and
show the limitations of current XML schema standard
in represent relational semantics and constraints.
2) We Introduce ORA-SS, a semantics rich data model
that can intuitively express the semantics in XML data.
3) We discuss the naïve method of mining semantics from
XML data/schema to generate ORA-SS schema. More
efficient methods should be further investigated.
April 9, 2006
KDXD 2006, Singapore
63
5. Conclusion (cont.)
4) The semantics in ORA-SS are crucial in designing
XML database, writing and interpreting XML query
and validating XML views, etc.
5) The method we proposed in the presentation to
discover semantics only provides candidate answers. In
other words, not all the results are necessarily true
because the contents of the data may be changed.
Therefore, user feedback is indispensable in the
process of enriching XML schema to ORA-SS schema.
April 9, 2006
KDXD 2006, Singapore
64
References:
[1].
[2].
[3].
[4].
[5].
[6].
[7].
[8].
Y. B. Chen, T. W. Ling, M. L. Lee. Designing Valid XML Views. ER2002, Tampere, Finland.
Oct 7-11, 2002
C. J. Date. An Introduction to Database Systems. 3rd edition, Addison-Wesley Publishing
Company (1981).
Extensible Markup Language (XML) 1.0 (3rd Edition). W3C Recommendation 04 February
2004. http://www.w3.org/TR/2004/REC-xml-20040204/
T. W. Ling, M. L. Lee, G. Dobbie. Semistructured Database Design. Springer Science+Business
media, Inc. 2005
W. Ni, T. W. Ling. GLASS: A Graphical Query Language for Semi-Structured Data. DASFAA
2003.
XML Schema Part 0: Primer Second Edition. W3C Recommendation 28 October 2004.
http://www.w3.org/TR/2004/REC-xmlschema-0-20041028/
XML Schema Part 1: Structures Second Edition. W3C Recommendation 28 October 2004.
http://www.w3.org/TR/2004/REC-xmlschema-1-20041028/
XML Schema Part 2: Data types Second Edition. W3C Recommendation 28 October 2004.
http://www.w3.org/TR/2004/REC-xmlschema-2-20041028/
April 9, 2006
KDXD 2006, Singapore
65
Q&A
April 9, 2006
KDXD 2006, Singapore
66
The End
April 9, 2006
KDXD 2006, Singapore
67
Download