Tools for Memory: Semantic Content (XML) Mahesh Chaudhari School of Computing and Informatics Department of Computer Science and Engineering Arizona State University 1 Outline World Wide Web (WWW) and HTML Migration towards XML What is XML? XML supporting Technologies Languages based on XML specifications Conclusion 2 World Wide Web (WWW) and HTML Giant network of computers. Part of day to day activities. Emails, chat, video, news. Most important Browsing or surfing the Net. HTML (Hyper-Text Markup Language) common language for Internet. 3 Why HTML? Easy to understand, learn and use. Quick and fancy way of presentation. Fixed set of instructions in the form of elements (tags) and attributes. e.g. <HTML>, <HEAD>, <BODY>, etc. Standard for sharing information over Internet. Understandable by all the Internet browsers. Text Based browsers e.g. Lynx, HyperTerminal, etc. Graphical browsers e.g. IE, Firefox, Netscape Navigator, etc. 4 Structure of HTML Document. Header <HTML> Root of the Document <HEAD> Cover Page of the Document <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=Windows-1252"> <TITLE>course</TITLE> </HEAD> <BODY bgcolor="#9F9F9F"> Main Content of the Document Contents <Center>course</Center> <TABLE> <TR> <TD>123</TD> <TD>databases</TD> <TD>S. Urban</TD> <TD>3</TD> </TR> Draws a Table with course Information in rows and columns. </TABLE> </BODY> </HTML> 5 Tree Structure for HTML 6 Why not HTML? Fixed set of elements vs. User-defined HTML: <TD>S. Urban</TD> XML: <instructor>S. Urban</instructor> Similarly with the attributes. Cannot exchange information between different applications, organizations, etc. Cannot provide more meaning to the data (semantics to the data). 7 School Virtual Organization School Records •Sharing Information •Understanding what is being sent/received University University Records Internet/Network Employment Records Personal Records Health Records Company Student/Person Hospital 8 Can XML be THE Solution? 9 eXtensible Markup Language (XML) Similar to HTML (consists of elements, attributes and DATA). Allows definition of user-defined elements and attributes (<instructor> tag is allowed). More meaning to the data (adds semantics to the data). Extensively used for data exchange. Understood by most of the Internet browsers. More Strict, Powerful and Rich than HTML. 10 Structure of XML Document <?xml version="1.0" encoding="UTF-8"?> Root <dataroot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="course.xsd"> <course> <crsid>123</crsid> <cname>databases</cname> <inst>S. Urban</inst> Main Content of the Document <length>3</length> </course> </course> <crsid>124</crsid> <cname>software engineering</cname> <inst>J. Urban</inst> <length>3</length> </course> </dataroot> User-defined elements give more meaning to the data 11 Is XML Strict? XML HTML <HTML> <HEAD> <TITLE>course</TITLE> </HEAD> <BODY bgcolor="#9F9F9F"> <Center>course</Center> <TABLE border="1"> <TR> <TD>123 </TD>? <TD>databases</TD> <TD>S. Urban</TD> <TD>3</TD> </TR>? </TABLE> </BODY> </HTML> Allowed in HTML <?xml version="1.0" encoding="UTF-8"?> <dataroot> Not allowed in XML <course> <crsid>123 <cname>databases</cname> <inst>S. Urban</inst> <length>3</length> </course> </dataroot> <?xml version="1.0" encoding="UTF-8"?> <dataroot> Allowed in XML <course> <crsid>123</crsid> <cname>databases</cname> <inst>S. Urban</inst> <length>3</length> </course> </dataroot> For every starting element XML should always have ending element ! 12 Is XML more strict than that? XML HTML <HTML> <HEAD> <TITLE>course</TITLE> </HEAD> <BODY bgcolor="#9F9F9F"> <b><i>This text is bold and italics.</b> but this text is only italics.</i> </BODY> </HTML> Allowed in HTML <?xml version="1.0" encoding="UTF8"?> Not allowed in XML <dataroot xmlns:xsi="http://www.w3.org/2001/X MLSchema-instance"> <b><i>This text is bold and italics.</b> but this text is only italics.</i> </dataroot> <?xml version="1.0" encoding="UTF8"?> Allowed in XML <dataroot xmlns:xsi="http://www.w3.org/2001/X MLSchema-instance"> <b><i>This text is bold and italics. but this text is only italics. </i></b> </dataroot> 13 All the elements in XML document should be properly nested ! Key Features of XML User-defined tags/elements possible. Document has only one root element. Document must be well-formed. Every start tag should have end tag. Tags must be properly nested. Tags in XML are case sensitive and may not contain white space. Tags must start with a letter or underscore, and may contain letters, digits, period ( . ), underscore( _ ) or hyphen ( - ) Tags cannot begin with the letters "xml" - reserved Tags should have semantic meaning. Start tags may have attributes. 14 Elements Elements Always consist of start_tag, data (optional), and end_tag. E.g. <crsid>123</crsid> <hr></hr> or <hr/> Attributes Provide metadata information or additional information for the element and occur only once inside the element. E.g. <course ID=“123”></course> 15 Special Attributes in XML ID and IDREF ID: unique value in the whole document. IDREF: reference the unique ID values in the document. e.g. <instructor ID=“1”>S. Urban</instructor> <instructor ID=“2”>P. Dasgupta</instructor> … <course> <inst IDREF=“1” /> … </course> 16 Data-centric XML Regular, defined structure. Ordering of tags immaterial. Used for machine reading. E.g. Course information or Instructor Information. 17 Data-centric XML E.g. Course Information <?xml version="1.0" encoding="UTF-8"?> <dataroot xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="course.xsd"> <course> <crsid>123</crsid> <cname>databases</cname> <inst>S. Urban</inst> <length>3</length> </course> </course> <crsid>124</crsid> <cname>software engineering</cname> <inst>J. Urban</inst> <length>3</length> </course> </dataroot> 18 Document-centric XML Less regular structure. Ordering of tags important. Mostly used for human consumption. E.g. Product description, Book Information, Library Catalogs. 19 Document-centric XML E.g. Product Description <Product> <Intro> The <ProductName>Turkey Wrench</ProductName> from <Developer>Full Fabrication Labs, Inc.</Developer> is <Summary>like a monkey wrench, but not as big.</Summary> </Intro> <Description> <Para>The turkey wrench, which comes in <i>both right- and lefthanded versions (skyhook optional)</i>, is made of the <b>finest stainless steel</b>. The Readi-grip rubberized handle quickly adapts to your hands, even in the greasiest situations. Adjustment is possible through a variety of custom dials.</Para> <Para>You can:</Para> <List> <Item><Link URL="Order.html">Order your own turkey wrench</Link></Item> <Item><Link URL="Wrenches.htm">Read more about wrenches</Link></Item> <Item><Link URL="Catalog.zip">Download the catalog</Link></Item> </List> <Para>The turkey wrench costs <b>just $19.99</b> and, if you order now, comes with a <b>hand-crafted shrimp hammer</b> as a bonus gift.</Para> </Description> </Product> 20 XML a Jigsaw Puzzle Supporting Technologies SAX XPATH XQUERY XML XSLT DOM DTD XSD CSS Give meaning to the elements. Data types for every element. Traversing, querying mechanism. Support in other programming languages. Presentation like HTML. 21 Meaning and Structure to XML Document Type Definition (DTD) Describes the structure of the XML document. Legal Parents – Children relationships. Custom non-XML syntax to describe the schema. Does not support data types and namespaces. XML Schema Definition (XSD) XML-like syntax to describe the schema. Supports different data types. Support namespaces. 22 Traversing and Querying XPath Navigate through the XML document. Find a particular element or attribute. Building block for XQuery, XSLT. XQuery Find and retrieve elements and attributes from XML document. Query language similar to SQL. Supported by relational databases like Oracle and SQL server. 23 Other Programming Languages Support Document Object Model (DOM) Standard way of accessing and manipulating XML elements. Loads the entire XML document in the memory (RAM). Bi-directional traversal of the XML tree. Slow and high memory consumption for large XML documents. Simple API for XML (SAX) Event-driven parser to access XML elements. Reads the XML document from the file, element-byelement basis. Unidirectional traversal of the XML tree (top to bottom). Fast and low memory consumption for large XML documents. 24 Presenting XML Data Cascading Style Sheet (CSS) Set of instructions to present data in readable format. Non-XML syntax to make data look pretty. Used in conjunction with HTML. EXtensible Stylesheet Language Transformations (XSLT) Transforms XML document into HTML, another XML or text file. Uses XPath extensively. XML-like syntax. 25 Other Markup Languages MathML : Markup language for Mathematics SVG : Scalar Vector Graphics MusicXML: an XML-based music notation file format. VoiceXML: format for specifying interactive voice dialogues between a human and a computer Linguists : Use of XML in studying different languages and their grammar. http://en.wikipedia.org/wiki/List_of_XML_markup_languages 26 Useful Links http://xml.coverpages.org/PESC-HS-Transcript2006.html XML High School Transcript Standard http://enterprise.astm.org/REDLINE_PAGES/E2369.htm XML Health care Record Standard http://www.w3schools.com/xml/default.asp XML Tutorial from W3Cschools http://www.w3schools.com/xsl/xsl_languages.aspXSL Tutorial from W3Cschools http://www.w3schools.com/xpath/default.asp XPath Tutorial from W3Cschools http://www.w3schools.com/xquery/default.asp XQuery Tutorial from W3Cschools http://www.w3schools.com/dtd/default.asp DTD Tutorial from W3Cschools http://www.w3schools.com/schema/default.asp XML Schema (XSD) Tutorial from W3Cschools http://www.xml.com/pub/rg/XML_Editors XML Editors (contains a list of editors, not exhaustive many more exist outthere) http://www.w3.org/ The World Wide Web Consortium (W3C) http://www.wowwiki.com/XML_User_Interface World of Warcraft and XML http://docs.info.apple.com/article.html?artnum=93732 iTunes and XML 27 School Revisit Virtual Organization School Records (XML) •Sharing Information •Understanding what is being sent/received University University Records (XML) Internet/Network Employment Records Personal Records Health Records (XML) Company Student/Person Hospital 28 Summary HTML and WWW Limitations of HTML Introduction to XML XML Structure Key features Data-centric Vs. Document-centric. Supporting technologies 29 30 Document Type Definition (DTD) The following slides are derived from the slides of Dr. Suzanne Dietrich. She is an assistant professor at the West campus, Department of Mathematical Sciences & Applied Computing. 31 Document Type Definition (DTD) Describes the structure of the XML document. Legal Parents – Children relationships. Can be defined as internal section of the XML document before the root element of the XML document. <!DOCTYPE root-element [element-declarations]> Can be attached to XML document as an external reference. <!DOCTYPE root-element SYSTEM "filename.dtd"> 32 Structure of DTD Document <!ELEMENT elementName contentSpecification> contentSpecification defines the content of the element ANY: No restrictions on the element’s content; limited use EMPTY: Cannot store any content (assume attributes) #PCDATA: Contains parsed character data (NO ELEMENTS) &lt; (<) &gt;(<) &quot;(") &apos; (‘) &amp; (&) <!ELEMENT inst (#PCDATA)> Nested elements using parentheses Mixed elements – can contain parsed character data and nested elements 33 DTD: Nested Elements (element1, element2, element3) indicates a sequence of elements, i.e., ordered <!ELEMENT sequencedElements (element1, element2, element3)> <!ELEMENT course (crsid, cname, inst, length)> (elementA | elementB | elementC) indicates a choice of elements <!ELEMENT choiceOfElements (elementA | elementB | elementC)> <!ELEMENT customer (name | company)> 34 DTD: Elements Cardinality element+: element occurs one or more times element*: element occurs zero or more times element?: optional (0 or 1) element: exactly once 35 DTD: Mixed Elements <!ELEMENT elementName (#PCDATA | child1 | child2 | …) * > Elements with mixed content allow for both parsed character data or child elements. Allows any number of occurrences of pcdata or child elements Not very useful for a document with defined structure. 36 Limitations of DTD No support for newer features of XML — most importantly, namespaces. Lack of expressivity. Certain formal aspects of an XML document cannot be captured in a DTD. Custom non-XML syntax to describe the schema. 37