XML Basics for Digital Humanists Alabama Digital Humanities Center September 19 & 23, 2011 Instructor: Shawn Averkamp, Metadata Librarian smaverkamp@ua.edu What is XML? eXtensible Markup Language Language • XML is a language for structuring data. (other methods of structuring data: database, excel spreadsheet, etc.) • Not a data model, but a way of encoding a data model or knowledge domain so that it is machine-processable. • XML is composed of syntax rules (just like any other language). Markup • XML uses “markup” to structure data. • XML uses labels within angle brackets (like in HTML) to “tag” text. Ingredients 3 avocados 1/4 cup onions 1/4 teaspoon garlic salt 12 corn tortillas 1 bunch fresh cilantro leaves jalapeno pepper sauce element <ingredients> <ingredient qty=“3”>avocados</ingredient> <ingredient qty=“1/4” unit=“cup”>onions,diced</ingredient> <ingredient qty=“1/4” unit=“t”>garlic salt</ingredient> <ingredient qty=“12”>corn tortillas</ingredient> <ingredient qty=“1”>fresh cilantro leaves</ingredient> <ingredient>jalapeno pepper sauce</ingredient> </ingredients> attribute Elements = things we care about Attributes = properties of those things eXtensible • You can extend your data model with other XML data models (“schemas”). The etd schema (in red) “extends” the mods schema <mods> <titleInfo> <title>Pac-man shaped magnetic tunnel junctions for magnetic flip flops for space applications</title> </titleInfo> <name type="personal"> <namePart>Red Ghost<namePart> <role> <roleTerm>Author</roleTerm> </role> </name> <name type="personal"> <namePart>Dot Chomper<namePart> <role> <roleTerm>Advisor</roleTerm> </role> </name> <abstract>Pac-man shaped magnetic tunnel junctions are proposed for CMOS-based magnetic flip flops for space applications…</abstract> <extension> <etd:degree>Ph.D.</etd:degree> <etd:discipline>Electrical and Computer Engineering</etd:discipline> </extension> </mods> Where is XML? XML drives applications and information you use every day: • RSS feeds (Real Simple Syndication) for blogs, podcasts, more • iTunes stores your music library metadata and usage data in XML • Google uses XML to display geographic data in Google Maps and Earth (more info: http://code.google.com/apis/kml/documentation/kml_tut.html ) What’s XML good for? • • • • Sharing/exchanging data online Storing data Controlling data display Syndication The XML Family XML The document language XPath Language for navigating XML documents XSD Schema language XSLT (XML Stylesheet Language Transformations) Language for transforming XML into other formats (HTML, text, other XML documents) XQuery Language for querying XML (similar to SQL database querying) XForms Language for creating web input forms XML in the Humanities • TEI – Shakespeare Quartos Archive: http://www.quartos.org/ – Lewis & Clark Journals: http://lewisandclarkjournals.unl.edu/ • Syriac Reference Portal: http://www.syriac.ua.edu/ Getting Started • Open Oxygen • Open movies.xml example (in left sample.xpr sidebar) or paste code below into a new document <?xml version="1.0" encoding="UTF-8"?> <movies> <movie id="1"> <title>The Green Mile</title> <year>1999</year> </movie> <movie id="2"> <title>Taxi Driver</title> <year>1976</year> </movie> <movie id="3"> <title>The Matrix: Revolutions</title> <year>2004</year> </movie> <movie id="4"> <title>Shrek II</title> <year>2004</year> </movie> </movies> Well-formedness XML documents must be “well-formed” to be machine-readable. • XML documents must have a root element • XML elements must have a closing tag • XML tags are case sensitive • XML elements must be properly nested • XML attribute values must be quoted Exercise 1 Copy and paste the following code into a new XML document in Oxygen. Correct all errors necessary to make this a well-formed XML document. <movie id=1> <title>The Green Mile<title> <year>1999</year> </movie> <movie id="2"> <title>Taxi Driver</title> <year>1976</year> </movie> <movie id="3"> <title>The Matrix: Revolutions</title> <Year>2004</year> </movie> <movie id="4"> <title>Shrek II</title> <year>2004</movie> </year> <!-- Comments --> Enclose comments within double-hyphen/angle bracket notation: <!-- a brief comment --> <!-This is a very long block of comments… … … … more comments… … … comments… (still more comments here…) --> 5 special symbols To use the following characters in a text value, you must replace them with these entities: & &amp; < &lt; > &gt; “ &quot; ‘ &apos; Exercise 2 In your movies.xml document, add another movie to the collection. Add a comment somewhere in the document (or “comment out” a block of elements). When you’ve finished, check for well-formedness (blue check icon). XML Schemas Schemas describe the syntax rules for encoding a data model in XML: – Allowable elements, attributes, and values – Element types -- simple or complex • Simple – contains a value • Complex – contains other elements – Constraints of elements, attributes, and values • Repeatability (how many instances of each element allowed) • Obligation (is the element or attribute mandatory?) – Datatypes of values (integer, string, date, etc.) <movies xmlns="http://example.com/schema.xsd"> <movie id="1"> <title>The Green Mile</title> <year>1999</year> </movie> <movie id="2"> <title>Taxi Driver</title> <year>1976</year> </movie> <movie id="3"> <title>The Matrix: Revolutions</title> <year>2004</year> </movie> <movie id="4"> <title>Shrek II</title> <year>2004</year> </movie> </movies> XML Schemas • Schemas are themselves XML files but with a .xsd file extension. • In our XML document, we reference the schema by using a “namespace” Namespaces The namespace is the unique identifier for the schema. <mods xmlns=“http://www.loc.gov/mods/v3”> <titleInfo> <title>Pac-man shaped magnetic tunnel junctions for magnetic flip flops for space applications</title> </titleInfo> … … </mods> Namespace prefixes When two or more schemas are used in an XML document, we use “prefixes” to distinguish between the elements of each. <mods xmlns="http://www.loc.gov/mods/v3" xmlns:etd="http://www.ndltd.org/standards/metadata/etdms/1.0/"> … … <dateIssued>2011</dateIssued> <extension> <etd:degree>Ph.D.</etd:degree> <etd:discipline>Electrical and Computer Engineering</etd:discipline> </extension> </mods> Valid XML To be “valid” an XML document must: • Be well-formed • Include the schema declaration in the root element (e.g., <mods xmlns=“http://www.loc.gov/mods/v3”>) • Conform to the rules of the schema Exercise 3 Copy and paste the code on the next slide into a new XML document in Oxygen. Add a <name> element to the document, then validate (red check icon). If it validates, then introduce an error into your document to see what error messages Oxygen gives you. <mods xmlns="http://www.loc.gov/mods/v3" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:etd="http://www.ndltd.org/standards/metadata/etdms/1.0/" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd" version="3.4"> <titleInfo> <title>Pac-man shaped magnetic tunnel junctions for magnetic flip flops for space applications</title> </titleInfo> <name type="personal"> <namePart>Red Ghost</namePart> <role> <roleTerm>Author</roleTerm> </role> </name> <name type="personal"> <namePart>Dot Chomper</namePart> <role> <roleTerm>Advisor</roleTerm> </role> </name> <abstract>Pac-man shaped magnetic tunnel junctions are proposed for CMOS-based magnetic flip flops for space applications…<abstract> <originInfo> <dateIssued>2011</dateIssued> </originInfo> <extension> <etd:degree>Ph.D.</etd:degree> <etd:discipline>Electrical and Computer Engineering</etd:discipline> </extension> </mods> Using and creating schemas • Always start with the data model! • Decide what entities and properties are important to you and your project before choosing or creating a schema. Things to consider • Are there existing schemas that meet your needs? • Are there commonly used schemas within your field? • If you find a schema that almost meets your needs, can you extend it to cover the entire scope of what you want to model? • Who (or what software applications) will you be sharing the data with? • What kind of functionality do you want to support? Indexing? Flexible display? Visualizations? Tailor schemas to meet your needs • You can make schema rules more strict (but not more lax) • Extend schemas with other schemas (Your primary schema must allow extensions) • If you expect use of your XML data to be very limited, you can change the schema. (Not recommended if you plan to share your data widely or beyond your own software applications) Documentation • Data dictionaries, markup guidelines, best practices are important, especially if you have assistants entering your data. • Examples of documentation: – MODS guidelines: http://www.loc.gov/standards/mods/userguide/generalap p.html – UVa Library TEI guidelines: http://www.lib.virginia.edu/digital/reports/teiPractices/dlp sPractices_postkb.html Exercise 4 Work together to create a data model for a dictionary (or a knowledge domain of your choosing). What should the root element be? What are the elements that will be contained within the root? What are the attributes* (properties) of each of your elements? Create an instance of your data model in XML. What adjustments or enhancements would you need to make for your schema to be extensible? *How do you know when something should be an attribute or an element? There is often no wrong answer to this. Use your best judgment—if you think you will not need to further refine a property (for instance, in our recipe example we would not need to refine quantity or unit any further), an attribute is probably the best choice. Resources • Books, tutorials, and other resources: http://www.lib.ua.edu/digitalhumanities/xmlresources • http://www.xml.com/