XML - The University of Alabama Libraries

advertisement
XML Basics for Digital Humanists
Alabama Digital Humanities Center
September 19 & 23, 2011
Instructor:
Shawn Averkamp, Metadata Librarian
smaverkamp@ua.edu
What is XML?
eXtensible
Markup
Language
Language
• XML is a language for structuring data. (other
methods of structuring data: database, excel spreadsheet,
etc.)
• Not a data model, but a way of encoding a
data model or knowledge domain so that it is
machine-processable.
• XML is composed of syntax rules (just like any
other language).
Markup
• XML uses “markup” to structure data.
• XML uses labels within angle brackets (like in
HTML) to “tag” text.
Ingredients
3 avocados
1/4 cup onions
1/4 teaspoon garlic salt
12 corn tortillas
1 bunch fresh cilantro leaves
jalapeno pepper sauce
element
<ingredients>
<ingredient qty=“3”>avocados</ingredient>
<ingredient qty=“1/4” unit=“cup”>onions,diced</ingredient>
<ingredient qty=“1/4” unit=“t”>garlic salt</ingredient>
<ingredient qty=“12”>corn tortillas</ingredient>
<ingredient qty=“1”>fresh cilantro leaves</ingredient>
<ingredient>jalapeno pepper sauce</ingredient>
</ingredients>
attribute
Elements = things we care about
Attributes = properties of those things
eXtensible
• You can extend your data model with other
XML data models (“schemas”).
The etd schema (in red) “extends” the mods schema
<mods>
<titleInfo>
<title>Pac-man shaped magnetic tunnel junctions for magnetic flip flops for space
applications</title>
</titleInfo>
<name type="personal">
<namePart>Red Ghost<namePart>
<role>
<roleTerm>Author</roleTerm>
</role>
</name>
<name type="personal">
<namePart>Dot Chomper<namePart>
<role>
<roleTerm>Advisor</roleTerm>
</role>
</name>
<abstract>Pac-man shaped magnetic tunnel junctions are proposed for CMOS-based magnetic
flip flops for space applications…</abstract>
<extension>
<etd:degree>Ph.D.</etd:degree>
<etd:discipline>Electrical and Computer Engineering</etd:discipline>
</extension>
</mods>
Where is XML?
XML drives applications and information you use
every day:
• RSS feeds (Real Simple Syndication) for blogs,
podcasts, more
• iTunes stores your music library metadata and
usage data in XML
• Google uses XML to display geographic data in
Google Maps and Earth (more info:
http://code.google.com/apis/kml/documentation/kml_tut.html )
What’s XML good for?
•
•
•
•
Sharing/exchanging data online
Storing data
Controlling data display
Syndication
The XML Family
XML
The document language
XPath
Language for navigating XML documents
XSD
Schema language
XSLT
(XML Stylesheet Language Transformations)
Language for transforming XML into other
formats (HTML, text, other XML documents)
XQuery
Language for querying XML (similar to SQL
database querying)
XForms
Language for creating web input forms
XML in the Humanities
• TEI
– Shakespeare Quartos Archive:
http://www.quartos.org/
– Lewis & Clark Journals:
http://lewisandclarkjournals.unl.edu/
• Syriac Reference Portal:
http://www.syriac.ua.edu/
Getting Started
• Open Oxygen
• Open movies.xml example (in left sample.xpr sidebar) or
paste code below into a new document
<?xml version="1.0" encoding="UTF-8"?>
<movies>
<movie id="1">
<title>The Green Mile</title>
<year>1999</year>
</movie>
<movie id="2">
<title>Taxi Driver</title>
<year>1976</year>
</movie>
<movie id="3">
<title>The Matrix: Revolutions</title>
<year>2004</year>
</movie>
<movie id="4">
<title>Shrek II</title>
<year>2004</year>
</movie>
</movies>
Well-formedness
XML documents must be “well-formed” to be
machine-readable.
• XML documents must have a root element
• XML elements must have a closing tag
• XML tags are case sensitive
• XML elements must be properly nested
• XML attribute values must be quoted
Exercise 1
Copy and paste the following code into a new XML
document in Oxygen. Correct all errors necessary
to make this a well-formed XML document.
<movie id=1>
<title>The Green Mile<title>
<year>1999</year>
</movie>
<movie id="2">
<title>Taxi Driver</title>
<year>1976</year>
</movie>
<movie id="3">
<title>The Matrix: Revolutions</title>
<Year>2004</year>
</movie>
<movie id="4">
<title>Shrek II</title>
<year>2004</movie>
</year>
<!-- Comments -->
Enclose comments within double-hyphen/angle
bracket notation:
<!-- a brief comment -->
<!-This is a very long block of comments…
… … … more comments… … … comments…
(still more comments here…)
-->
5 special symbols
To use the following characters in a text value,
you must replace them with these entities:
&
&
<
<
>
>
“
"
‘
'
Exercise 2
In your movies.xml document, add another
movie to the collection. Add a comment
somewhere in the document (or “comment out”
a block of elements). When you’ve finished,
check for well-formedness (blue check icon).
XML Schemas
Schemas describe the syntax rules for encoding a
data model in XML:
– Allowable elements, attributes, and values
– Element types -- simple or complex
• Simple – contains a value
• Complex – contains other elements
– Constraints of elements, attributes, and values
• Repeatability (how many instances of each element allowed)
• Obligation (is the element or attribute mandatory?)
– Datatypes of values (integer, string, date, etc.)
<movies xmlns="http://example.com/schema.xsd">
<movie id="1">
<title>The Green Mile</title>
<year>1999</year>
</movie>
<movie id="2">
<title>Taxi Driver</title>
<year>1976</year>
</movie>
<movie id="3">
<title>The Matrix: Revolutions</title>
<year>2004</year>
</movie>
<movie id="4">
<title>Shrek II</title>
<year>2004</year>
</movie>
</movies>
XML Schemas
• Schemas are themselves XML files but with a
.xsd file extension.
• In our XML document, we reference the
schema by using a “namespace”
Namespaces
The namespace is the unique identifier for the
schema.
<mods xmlns=“http://www.loc.gov/mods/v3”>
<titleInfo>
<title>Pac-man shaped magnetic tunnel junctions
for magnetic flip flops for space applications</title>
</titleInfo>
…
…
</mods>
Namespace prefixes
When two or more schemas are used in an XML
document, we use “prefixes” to distinguish between
the elements of each.
<mods xmlns="http://www.loc.gov/mods/v3"
xmlns:etd="http://www.ndltd.org/standards/metadata/etdms/1.0/">
…
…
<dateIssued>2011</dateIssued>
<extension>
<etd:degree>Ph.D.</etd:degree>
<etd:discipline>Electrical and Computer
Engineering</etd:discipline>
</extension>
</mods>
Valid XML
To be “valid” an XML document must:
• Be well-formed
• Include the schema declaration in the root
element (e.g., <mods xmlns=“http://www.loc.gov/mods/v3”>)
• Conform to the rules of the schema
Exercise 3
Copy and paste the code on the next slide into a
new XML document in Oxygen. Add a <name>
element to the document, then validate (red
check icon). If it validates, then introduce an
error into your document to see what error
messages Oxygen gives you.
<mods xmlns="http://www.loc.gov/mods/v3" xmlns:xs="http://www.w3.org/2001/XMLSchema"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:etd="http://www.ndltd.org/standards/metadata/etdms/1.0/"
xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-4.xsd"
version="3.4">
<titleInfo>
<title>Pac-man shaped magnetic tunnel junctions for magnetic flip flops for space applications</title>
</titleInfo>
<name type="personal">
<namePart>Red Ghost</namePart>
<role>
<roleTerm>Author</roleTerm>
</role>
</name>
<name type="personal">
<namePart>Dot Chomper</namePart>
<role>
<roleTerm>Advisor</roleTerm>
</role>
</name>
<abstract>Pac-man shaped magnetic tunnel junctions are proposed for CMOS-based magnetic flip flops for space applications…<abstract>
<originInfo>
<dateIssued>2011</dateIssued>
</originInfo>
<extension>
<etd:degree>Ph.D.</etd:degree>
<etd:discipline>Electrical and Computer Engineering</etd:discipline>
</extension>
</mods>
Using and creating schemas
• Always start with the data model!
• Decide what entities and properties are
important to you and your project before
choosing or creating a schema.
Things to consider
• Are there existing schemas that meet your needs?
• Are there commonly used schemas within your field?
• If you find a schema that almost meets your needs, can
you extend it to cover the entire scope of what you
want to model?
• Who (or what software applications) will you be
sharing the data with?
• What kind of functionality do you want to support?
Indexing? Flexible display? Visualizations?
Tailor schemas to meet your needs
• You can make schema rules more strict (but
not more lax)
• Extend schemas with other schemas (Your
primary schema must allow extensions)
• If you expect use of your XML data to be very
limited, you can change the schema. (Not
recommended if you plan to share your data
widely or beyond your own software
applications)
Documentation
• Data dictionaries, markup guidelines, best
practices are important, especially if you have
assistants entering your data.
• Examples of documentation:
– MODS guidelines:
http://www.loc.gov/standards/mods/userguide/generalap
p.html
– UVa Library TEI guidelines:
http://www.lib.virginia.edu/digital/reports/teiPractices/dlp
sPractices_postkb.html
Exercise 4
Work together to create a data model for a dictionary (or
a knowledge domain of your choosing). What should the
root element be? What are the elements that will be
contained within the root? What are the attributes*
(properties) of each of your elements?
Create an instance of your data model in XML. What
adjustments or enhancements would you need to make
for your schema to be extensible?
*How do you know when something should be an attribute or an element? There is often no wrong answer to
this. Use your best judgment—if you think you will not need to further refine a property (for instance, in our
recipe example we would not need to refine quantity or unit any further), an attribute is probably the best
choice.
Resources
• Books, tutorials, and other resources:
http://www.lib.ua.edu/digitalhumanities/xmlresources
• http://www.xml.com/
Download