Markup and Metadata How to Build a Digital Library Ian H. Witten and David Bainbridge Digital Library Elements Basic Elements of Organization Markup Controls structure and appearance Metadata Expedites access Structural Markup Identify and maintain the document structure: Section divisions Headings Subsection structure Lists Quotations Tabular information Structural markup items become metadata Presentation Markup Specify how the document will appear typographically by formatting the document: Page size Headers and footers Font Line spacing Section headers Figures Kinds of Metadata Assist navigation Resource discovery Metadata to assist in finding documents through searching and browsing Value of digital libraries depends on how easily information can be located Policy Structural markup Define rights, restrictions, and rules that govern who can do what with digital resources Administration and Preservation Information necessary to preserve the integrity and functionality of a digital resource long term Explicit versus Extracted Metadata Explicit Metadata Requires careful analysis of a document Takes 1-2 hours to create a traditional library catalog entry (or 5 minutes, depending on number of fields!) Extracted Metadata “Text Mining” Automatically obtained from the contents of a document Cheaper, but less reliable HTML Hypertext Markup Language Document format of the World Wide Web Original vision: separate document structure from presentation Inconsistent ways of formatting and metadata in HTML may discourage automatic processing of document collections Basic HTML Angle brackets enclose words <title>My Story</title> Tag names are not case sensitive HTML Tags <p> Paragraph <tr> Table Row <td> Table Cell <li> Special characters, list item <img> Images <i> Italics <ul> Unordered List, Bulleted List <a> .. </a> Link Anchor HTML Opening Tags Special Markers Header Navigation within a single document Forms Gives global information Title, encoding scheme, metadata Body ASCII /UTF-8 Unicode Local link anchors Attributes Collect data from user Frames HTML document can be tiled into smaller, independent segments (each an HTML page) Frameset – a set of frames – can be displayed simultaneously (useful for navigation bars) HTML in Digital Libraries Many source documents are presented in HTML form Explicit specification of metadata using <meta> tags Extract text Plain text browser “lynx” extracts text from HTML documents XML Extensible Markup Language Flexible way to characterize document structure and metadata Well suited to digital libraries Widespread use XML Document Type Description DTD = Document Type Description Tag Syntax <!...> Keywords in Block Capitals Square Bracket […] indicates DTD will appear in-line Otherwise, DTD can be in external file New elements Keyword ELEMENT Tag name Description of what element may contain A Leaf Referred to by a URL Desirable An element that is plain text, with no markup Declared as #PCDATA (parsed character data) Special Characters Encoded as in HTML (&lt; &amp, etc.) XML Regular Expressions Regular expression Comma indicates an ordered sequence Vertical bar indicates a choice of one element from sequence Asterisk indicates zero or more Plus indicates one or more Question mark indicates zero or one XML Attributes Attributes Give set of possible values No nesting Keyword ATTLIST Element to which it applies Attribute name Attribute type Appearance restrictions (optional) XML Entities Entities: &lt, &amp, &gt, &apos, &quote New entities can be added in the DTD Use syntax ENTITY Name “value” Example: <!ENTITY howto “How to Build a Digital Library”> XML Parameter Entity Several elements share the same attributes Parameter Entity Special type of entity Percent symbol Well Formed and Valid XML Well Formed A document that conforms to XML syntax but does not supply a DTD (Document Type Description) Valid A document that conforms to XML syntax and does supply a DTD The content follows the syntactic constraints defined in the DTD Parsing XML Parsing indicates whether the document conforms to the general rules of XML (or the specific DTD, when applicable) Parsing produces a parse tree Begins with a root node Root node has descendents Descendents reflect text content and nested tags Programming Interface Lets user traverse the tree and retrieve the data “API” Application Program Interface XML DOM Document Object Model Application Program Interface (API) Cross-platform Cross-language Allows programs to be written that access and modify the document’s: Content Structure Style XML and Digital Libraries XML is powerful XML allows file formats within a digital library to be shared Structure explanations are put in a DTD (Document Type Description) XML provides syntax for expressing structural information – metadata XML goes further by combining with other standards: Support document restructuring, querying, information extraction and formatting Can have display capabilities similar to HTML Style Sheets Control the presentation of marked-up documents Two Kinds of Style Sheets: Cascading Style Sheets Work with HTML and XML Extensible Stylesheet Language – XSL Works with XML Powerful Allows document structure to be altered dynamically Bibliographic Metadata Two Standards for Representing Document Metadata: Machine-Readable Cataloging (MARC) The Dublin Core Used by professional catalogers for use in libraries Minimal standard used by people who are not trained in library cataloging Two metadata formats used by document authors in scientific and technical fields: BibTeX Refer MARC Machine-Readable Cataloging Internally stored as collection of tagged fields Format covers: Bibliographic records Authority records – standardized forms that are part of the librarian’s controlled vocabulary Governed by AACR2R Anglo-American Cataloging Rules Detailed set of rules and guidelines Two Parts Part 1: Description of Documents Part 2: Description of Works Dublin Core Set of metadata elements Simple - designed for non-specialists Intended for electronic materials that will not receive a full MARC catalog entry Named after Dublin, Ohio The first meeting was held there in 1995 Approved by ANSI (American National Standards Organization) in 2001 Dublin Core Fifteen metadata elements form the core element set Resource May be refined through qualifiers May be augmented by additional elements for local purposes “Anything that has identity” Similar to “entity” (objectives of bibliographic system) Does not impose any kind of vocabulary control or authority files Two people might generate very different descriptions of the same resource Dublin Core Metadata Standard Title Creator Subject Description Publisher Contributor Date Type Format Identifier Source Language Relation Coverage Rights BibTeX Manages bibliographic data and references within documents TeX LaTeX Generalized document-processing system Scientific, Mathematical and Technical Purposes Customized Version of TeX Freely available BibTeX Subsystem of LaTeX Refer Similar to BibTeX Designed by computer scientists for use by scientific and technical researchers Basis of EndNote Bibliographic tool which augments Microsoft Word Metadata for Images and Multimedia Metadata is not confined to text Most image files include data about resolution PNG can store text strings Image metadata is usually kept separate from the image file Metadata for Images and Multimedia Two Metadata Formats: TIFF Tagged Image File Format Associates metadata with image files Widespread use for over a decade How images are stored in digital libraries Normal images Document images MPEG-7 Multimedia Content Description Interface Scheme to define and store metadata associated with any multimedia information General, extensible, and still being standardized Extracting Metadata Text Mining Plain text documents Require text comprehension skills Computer techniques for text analysis Automatic extraction of information from text Good results in constrained domains XML and other Structured Markup Languages Make key aspects of documents available to computers and people Encoded information can easily be extracted by parsing the document structure Few documents contain explicitly encoded metadata General Techniques Extracting Document Metadata Generic Entities Title, Author, Publisher, Date, etc. Email, URLs, Dates, Time, Money Bibliography Entries Citation analysis Key Phrase Metadata Key-phrase metadata can successfully be obtained automatically from documents Two Different Approaches: Key-Phrase Assignment Key-Phrase Extraction Generating Phrase Hierarchies Key phrases consist of a few well-chosen words that characterize the document It is useful to extract a structure that contains ALL the phrases in the documents Hierarchical structure of phrases can support browsing around a digital library collection