Tools for Memory: Semantic Content (XML)

advertisement
Tools for Memory:
Semantic Content (XML)
Mahesh Chaudhari
School of Computing and Informatics
Department of Computer Science and Engineering
Arizona State University
1
Outline
 World Wide Web (WWW) and HTML
 Migration towards XML
 What is XML?
 XML supporting Technologies
 Languages based on XML specifications
 Conclusion
2
World Wide Web (WWW) and
HTML
 Giant network of computers.
 Part of day to day activities.
 Emails, chat, video, news.
 Most important Browsing or
surfing the Net.
 HTML (Hyper-Text Markup
Language) common
language for Internet.
3
Why HTML?
 Easy to understand, learn and use.
 Quick and fancy way of presentation.
 Fixed set of instructions in the form of elements
(tags) and attributes.

e.g. <HTML>, <HEAD>, <BODY>, etc.
 Standard for sharing information over Internet.
 Understandable by all the Internet browsers.
 Text Based browsers e.g. Lynx, HyperTerminal, etc.
 Graphical browsers e.g. IE, Firefox, Netscape
Navigator, etc.
4
Structure of HTML Document.
Header
<HTML>  Root of the Document
<HEAD>  Cover Page of the Document
<META HTTP-EQUIV="Content-Type" CONTENT="text/html;
charset=Windows-1252">
<TITLE>course</TITLE>
</HEAD>
<BODY bgcolor="#9F9F9F">  Main Content of the Document
Contents
<Center>course</Center>
<TABLE>
<TR>
<TD>123</TD>
<TD>databases</TD>
<TD>S. Urban</TD>
<TD>3</TD>
</TR>
 Draws a Table with course
Information in rows and columns.
</TABLE>
</BODY>
</HTML>
5
Tree Structure for HTML
6
Why not HTML?
 Fixed set of elements vs. User-defined
HTML: <TD>S. Urban</TD>
XML: <instructor>S. Urban</instructor>
 Similarly with the attributes.
 Cannot exchange information between
different applications, organizations, etc.
 Cannot provide more meaning to the data
(semantics to the data).
7
School
Virtual
Organization
School Records
•Sharing Information
•Understanding what
is being sent/received
University
University Records
Internet/Network
Employment Records
Personal Records
Health Records
Company
Student/Person
Hospital
8
Can XML be THE Solution?
9
eXtensible Markup Language
(XML)
 Similar to HTML (consists of elements,
attributes and DATA).
 Allows definition of user-defined elements
and attributes (<instructor> tag is allowed).
 More meaning to the data (adds semantics to
the data).
 Extensively used for data exchange.
 Understood by most of the Internet browsers.
More Strict, Powerful and Rich than HTML.
10
Structure of XML Document
<?xml version="1.0" encoding="UTF-8"?>
 Root
<dataroot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="course.xsd">
<course>
<crsid>123</crsid>
<cname>databases</cname>
<inst>S. Urban</inst>
 Main Content of the Document
<length>3</length>
</course>
</course>
<crsid>124</crsid>
<cname>software engineering</cname>
<inst>J. Urban</inst>
<length>3</length>
</course>
</dataroot>
User-defined elements give
more meaning to the data
11
Is XML Strict?
XML
HTML
<HTML>
<HEAD>
<TITLE>course</TITLE>
</HEAD>
<BODY bgcolor="#9F9F9F">
<Center>course</Center>
<TABLE border="1">
<TR>
<TD>123
</TD>?
<TD>databases</TD>
<TD>S. Urban</TD>
<TD>3</TD>
</TR>?
</TABLE>
</BODY>
</HTML>
Allowed in HTML
<?xml version="1.0" encoding="UTF-8"?>
<dataroot>
Not allowed in XML
<course>
<crsid>123
<cname>databases</cname>
<inst>S. Urban</inst>
<length>3</length>
</course>
</dataroot>
<?xml version="1.0" encoding="UTF-8"?>
<dataroot>
Allowed in XML
<course>
<crsid>123</crsid>
<cname>databases</cname>
<inst>S. Urban</inst>
<length>3</length>
</course>
</dataroot>
For every starting element
XML should always have ending element !
12
Is XML more strict than that?
XML
HTML
<HTML>
<HEAD>
<TITLE>course</TITLE>
</HEAD>
<BODY bgcolor="#9F9F9F">
<b><i>This text is bold and
italics.</b> but this text is
only italics.</i>
</BODY>
</HTML>
Allowed in HTML
<?xml version="1.0" encoding="UTF8"?>
Not allowed in XML
<dataroot
xmlns:xsi="http://www.w3.org/2001/X
MLSchema-instance">
<b><i>This text is bold and
italics.</b> but this text is only
italics.</i>
</dataroot>
<?xml version="1.0" encoding="UTF8"?>
Allowed in XML
<dataroot
xmlns:xsi="http://www.w3.org/2001/X
MLSchema-instance">
<b><i>This text is bold and italics.
but this text is only italics. </i></b>
</dataroot>
13
All the elements in XML
document should be properly nested !
Key Features of XML
 User-defined tags/elements possible.
 Document has only one root element.
 Document must be well-formed.
Every start tag should have end tag.
 Tags must be properly nested.
 Tags in XML are case sensitive and may not contain white
space.

 Tags must start with a letter or underscore, and may contain
letters, digits, period ( . ), underscore( _ ) or hyphen ( - )
 Tags cannot begin with the letters "xml" - reserved
 Tags should have semantic meaning.
 Start tags may have attributes.
14
Elements
 Elements


Always consist of start_tag, data (optional), and end_tag.
E.g. <crsid>123</crsid>
<hr></hr> or
<hr/>
 Attributes


Provide metadata information or additional information for
the element and occur only once inside the element.
E.g. <course ID=“123”></course>
15
Special Attributes in XML
 ID and IDREF


ID: unique value in the whole document.
IDREF: reference the unique ID values in the document.
e.g.
<instructor ID=“1”>S. Urban</instructor>
<instructor ID=“2”>P. Dasgupta</instructor>
…
<course>
<inst IDREF=“1” />
…
</course>
16
Data-centric XML
 Regular, defined structure.
 Ordering of tags immaterial.
 Used for machine reading.
 E.g. Course information or Instructor
Information.
17
Data-centric XML
E.g. Course Information
<?xml version="1.0" encoding="UTF-8"?>
<dataroot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation="course.xsd">
<course>
<crsid>123</crsid>
<cname>databases</cname>
<inst>S. Urban</inst>
<length>3</length>
</course>
</course>
<crsid>124</crsid>
<cname>software engineering</cname>
<inst>J. Urban</inst>
<length>3</length>
</course>
</dataroot>
18
Document-centric XML
 Less regular structure.
 Ordering of tags important.
 Mostly used for human consumption.
 E.g. Product description, Book Information,
Library Catalogs.
19
Document-centric XML
E.g. Product Description
<Product>
<Intro>
The <ProductName>Turkey Wrench</ProductName> from <Developer>Full
Fabrication Labs, Inc.</Developer> is <Summary>like a monkey wrench,
but not as big.</Summary>
</Intro>
<Description>
<Para>The turkey wrench, which comes in <i>both right- and lefthanded versions (skyhook optional)</i>, is made of the <b>finest
stainless steel</b>. The Readi-grip rubberized handle quickly adapts
to your hands, even in the greasiest situations. Adjustment is
possible through a variety of custom dials.</Para>
<Para>You can:</Para>
<List>
<Item><Link URL="Order.html">Order your own turkey wrench</Link></Item>
<Item><Link URL="Wrenches.htm">Read more about wrenches</Link></Item>
<Item><Link URL="Catalog.zip">Download the catalog</Link></Item>
</List>
<Para>The turkey wrench costs <b>just $19.99</b> and, if you
order now, comes with a <b>hand-crafted shrimp hammer</b> as a
bonus gift.</Para>
</Description>
</Product>
20
XML a Jigsaw Puzzle
 Supporting
Technologies


SAX
XPATH
XQUERY

XML
XSLT
DOM
DTD
XSD
CSS


Give meaning to
the elements.
Data types for
every element.
Traversing,
querying
mechanism.
Support in other
programming
languages.
Presentation like
HTML.
21
Meaning and Structure to XML
 Document Type Definition (DTD)
 Describes the structure of the XML document.
 Legal Parents – Children relationships.
 Custom non-XML syntax to describe the
schema.
 Does not support data types and namespaces.
 XML Schema Definition (XSD)
 XML-like syntax to describe the schema.
 Supports different data types.
 Support namespaces.
22
Traversing and Querying
 XPath
 Navigate through the XML document.
 Find a particular element or attribute.
 Building block for XQuery, XSLT.
 XQuery
 Find and retrieve elements and attributes from
XML document.
 Query language similar to SQL.
 Supported by relational databases like Oracle
and SQL server.
23
Other Programming Languages
Support
 Document Object Model (DOM)
 Standard way of accessing and manipulating XML
elements.
 Loads the entire XML document in the memory (RAM).
 Bi-directional traversal of the XML tree.
 Slow and high memory consumption for large XML
documents.
 Simple API for XML (SAX)
 Event-driven parser to access XML elements.
 Reads the XML document from the file, element-byelement basis.
 Unidirectional traversal of the XML tree (top to bottom).
 Fast and low memory consumption for large XML
documents.
24
Presenting XML Data
 Cascading Style Sheet (CSS)
 Set of instructions to present data in readable format.
 Non-XML syntax to make data look pretty.
 Used in conjunction with HTML.
 EXtensible Stylesheet Language Transformations
(XSLT)



Transforms XML document into HTML, another XML or
text file.
Uses XPath extensively.
XML-like syntax.
25
Other Markup Languages
 MathML : Markup language for Mathematics
 SVG : Scalar Vector Graphics
 MusicXML: an XML-based music notation file
format.
 VoiceXML: format for specifying interactive
voice dialogues between a human and a
computer
 Linguists : Use of XML in studying different
languages and their grammar.
http://en.wikipedia.org/wiki/List_of_XML_markup_languages
26
Useful Links

http://xml.coverpages.org/PESC-HS-Transcript2006.html
XML High School Transcript Standard

http://enterprise.astm.org/REDLINE_PAGES/E2369.htm
XML Health care Record Standard

http://www.w3schools.com/xml/default.asp
XML Tutorial from W3Cschools

http://www.w3schools.com/xsl/xsl_languages.aspXSL Tutorial from W3Cschools

http://www.w3schools.com/xpath/default.asp
XPath Tutorial from W3Cschools

http://www.w3schools.com/xquery/default.asp
XQuery Tutorial from W3Cschools

http://www.w3schools.com/dtd/default.asp
DTD Tutorial from W3Cschools

http://www.w3schools.com/schema/default.asp
XML Schema (XSD) Tutorial from
W3Cschools

http://www.xml.com/pub/rg/XML_Editors
XML Editors (contains a list of editors,
not exhaustive many more exist outthere)

http://www.w3.org/
The World Wide Web Consortium (W3C)

http://www.wowwiki.com/XML_User_Interface
World of Warcraft and XML

http://docs.info.apple.com/article.html?artnum=93732
iTunes and XML
27
School
Revisit
Virtual
Organization
School Records
(XML)
•Sharing Information
•Understanding what
is being sent/received
University
University Records
(XML)
Internet/Network
Employment Records
Personal Records
Health Records
(XML)
Company
Student/Person
Hospital
28
Summary
 HTML and WWW

Limitations of HTML
 Introduction to XML
 XML
Structure
 Key features
 Data-centric Vs. Document-centric.
 Supporting technologies

29
30
Document Type
Definition (DTD)
The following slides are derived from the slides of
Dr. Suzanne Dietrich.
She is an assistant professor at the West campus,
Department of Mathematical Sciences & Applied
Computing.
31
Document Type Definition
(DTD)
 Describes the structure of the XML document.
 Legal Parents – Children relationships.
 Can be defined as internal section of the XML
document before the root element of the XML
document.
<!DOCTYPE root-element [element-declarations]>
 Can be attached to XML document as an external
reference.
<!DOCTYPE root-element SYSTEM "filename.dtd">
32
Structure of DTD Document
 <!ELEMENT elementName contentSpecification>
 contentSpecification defines the content of the element
 ANY: No restrictions on the element’s content; limited use
 EMPTY: Cannot store any content (assume attributes)
 #PCDATA: Contains parsed character data (NO ELEMENTS)
< (<) >(<) "(") ' (‘) & (&)
<!ELEMENT inst (#PCDATA)>
 Nested elements using parentheses
 Mixed elements – can contain parsed character data and nested
elements
33
DTD: Nested Elements
 (element1, element2, element3)
indicates a sequence of elements, i.e., ordered
<!ELEMENT sequencedElements (element1, element2, element3)>
<!ELEMENT course (crsid, cname, inst, length)>
 (elementA | elementB | elementC)
indicates a choice of elements
<!ELEMENT choiceOfElements (elementA | elementB | elementC)>
<!ELEMENT customer (name | company)>
34
DTD: Elements Cardinality
 element+: element occurs one or more
times
 element*: element occurs zero or more
times
 element?: optional (0 or 1)
 element: exactly once
35
DTD: Mixed Elements
<!ELEMENT elementName (#PCDATA | child1 | child2 | …) * >
 Elements with mixed content allow for both
parsed character data or child elements.
 Allows any number of occurrences of pcdata
or child elements
 Not very useful for a document with defined
structure.
36
Limitations of DTD
 No support for newer features of XML —
most importantly, namespaces.
 Lack of expressivity. Certain formal aspects of
an XML document cannot be captured in a
DTD.
 Custom non-XML syntax to describe the
schema.
37
Download