Chapter5

advertisement
Markup and
Metadata
How to Build a Digital Library
Ian H. Witten and David Bainbridge
Digital Library Elements

Basic Elements of Organization
 Markup
 Controls
structure and appearance
 Metadata
 Expedites
access
Structural Markup

Identify and maintain the document structure:
Section divisions
 Headings
 Subsection structure
 Lists
 Quotations
 Tabular information


Structural markup items become metadata
Presentation Markup

Specify how the document will appear
typographically by formatting the document:
Page size
 Headers and footers
 Font
 Line spacing
 Section headers
 Figures

Kinds of Metadata

Assist navigation


Resource discovery



Metadata to assist in finding documents through searching and browsing
Value of digital libraries depends on how easily information can be located
Policy


Structural markup
Define rights, restrictions, and rules that govern who can do what with
digital resources
Administration and Preservation

Information necessary to preserve the integrity and functionality of a digital
resource long term
Explicit versus Extracted Metadata

Explicit Metadata
Requires careful analysis of a document
 Takes 1-2 hours to create a traditional library catalog
entry (or 5 minutes, depending on number of fields!)


Extracted Metadata
“Text Mining”
 Automatically obtained from the contents of a
document
 Cheaper, but less reliable

HTML




Hypertext Markup Language
Document format of the World Wide Web
Original vision: separate document structure
from presentation
Inconsistent ways of formatting and metadata in
HTML may discourage automatic processing of
document collections
Basic HTML



Angle brackets enclose words
<title>My Story</title>
Tag names are not case sensitive
HTML Tags








<p> Paragraph
<tr> Table Row
<td> Table Cell
<li> Special characters, list item
<img> Images
<i> Italics
<ul> Unordered List, Bulleted List
<a> .. </a> Link Anchor
HTML

Opening Tags



Special Markers
Header





Navigation within a single document
Forms


Gives global information
Title, encoding scheme, metadata
Body
ASCII /UTF-8 Unicode
Local link anchors


Attributes
Collect data from user
Frames


HTML document can be tiled into smaller, independent segments (each an HTML page)
Frameset – a set of frames – can be displayed simultaneously (useful for navigation bars)
HTML in Digital Libraries



Many source documents are presented in HTML
form
Explicit specification of metadata using <meta>
tags
Extract text

Plain text browser “lynx” extracts text from HTML
documents
XML




Extensible Markup Language
Flexible way to characterize document structure
and metadata
Well suited to digital libraries
Widespread use
XML Document Type Description





DTD = Document Type Description
Tag Syntax <!...>
Keywords in Block Capitals
Square Bracket […] indicates DTD will appear in-line
Otherwise, DTD can be in external file



New elements




Keyword ELEMENT
Tag name
Description of what element may contain
A Leaf



Referred to by a URL
Desirable
An element that is plain text, with no markup
Declared as #PCDATA (parsed character data)
Special Characters

Encoded as in HTML (< &amp, etc.)
XML Regular Expressions

Regular expression
Comma indicates an ordered sequence
 Vertical bar indicates a choice of one element from
sequence
 Asterisk indicates zero or more
 Plus indicates one or more
 Question mark indicates zero or one

XML Attributes

Attributes
Give set of possible values
 No nesting
 Keyword ATTLIST
 Element to which it applies
 Attribute name
 Attribute type
 Appearance restrictions (optional)

XML Entities

Entities:
&lt, &amp, &gt, &apos, &quote
 New entities can be added in the DTD
 Use syntax

ENTITY
 Name
 “value”


Example: <!ENTITY howto “How to Build a
Digital Library”>
XML Parameter Entity


Several elements share the same attributes
Parameter Entity
Special type of entity
 Percent symbol

Well Formed and Valid XML

Well Formed


A document that conforms to XML syntax but does
not supply a DTD (Document Type Description)
Valid
A document that conforms to XML syntax and does
supply a DTD
 The content follows the syntactic constraints defined
in the DTD

Parsing XML


Parsing indicates whether the document conforms to
the general rules of XML (or the specific DTD, when
applicable)
Parsing produces a parse tree




Begins with a root node
Root node has descendents
Descendents reflect text content and nested tags
Programming Interface


Lets user traverse the tree and retrieve the data
“API” Application Program Interface
XML DOM





Document Object Model
Application Program Interface (API)
Cross-platform
Cross-language
Allows programs to be written that access and
modify the document’s:
Content
 Structure
 Style

XML and Digital Libraries





XML is powerful
XML allows file formats within a digital library to be shared
Structure explanations are put in a DTD (Document Type
Description)
XML provides syntax for expressing structural information –
metadata
XML goes further by combining with other standards:


Support document restructuring, querying, information extraction and
formatting
Can have display capabilities similar to HTML
Style Sheets


Control the presentation of marked-up
documents
Two Kinds of Style Sheets:

Cascading Style Sheets


Work with HTML and XML
Extensible Stylesheet Language – XSL
Works with XML
 Powerful
 Allows document structure to be altered dynamically

Bibliographic Metadata

Two Standards for Representing Document Metadata:

Machine-Readable Cataloging (MARC)


The Dublin Core


Used by professional catalogers for use in libraries
Minimal standard used by people who are not trained in library
cataloging
Two metadata formats used by document authors in
scientific and technical fields:


BibTeX
Refer
MARC



Machine-Readable Cataloging
Internally stored as collection of tagged fields
Format covers:



Bibliographic records
Authority records – standardized forms that are part of the
librarian’s controlled vocabulary
Governed by AACR2R



Anglo-American Cataloging Rules
Detailed set of rules and guidelines
Two Parts


Part 1: Description of Documents
Part 2: Description of Works
Dublin Core




Set of metadata elements
Simple - designed for non-specialists
Intended for electronic materials that will not
receive a full MARC catalog entry
Named after Dublin, Ohio


The first meeting was held there in 1995
Approved by ANSI (American National
Standards Organization) in 2001
Dublin Core

Fifteen metadata elements form the core element set



Resource



May be refined through qualifiers
May be augmented by additional elements for local purposes
“Anything that has identity”
Similar to “entity” (objectives of bibliographic system)
Does not impose any kind of vocabulary control or
authority files

Two people might generate very different descriptions of the
same resource
Dublin Core Metadata Standard








Title
Creator
Subject
Description
Publisher
Contributor
Date
Type







Format
Identifier
Source
Language
Relation
Coverage
Rights
BibTeX

Manages bibliographic data and references within
documents

TeX



LaTeX



Generalized document-processing system
Scientific, Mathematical and Technical Purposes
Customized Version of TeX
Freely available
BibTeX

Subsystem of LaTeX
Refer



Similar to BibTeX
Designed by computer scientists for use by
scientific and technical researchers
Basis of EndNote

Bibliographic tool which augments Microsoft Word
Metadata for Images and
Multimedia




Metadata is not confined to text
Most image files include data about resolution
PNG can store text strings
Image metadata is usually kept separate from the
image file
Metadata for Images and
Multimedia

Two Metadata Formats:

TIFF




Tagged Image File Format
Associates metadata with image files
Widespread use for over a decade
How images are stored in digital libraries



Normal images
Document images
MPEG-7



Multimedia Content Description Interface
Scheme to define and store metadata associated with any multimedia
information
General, extensible, and still being standardized
Extracting Metadata

Text Mining


Plain text documents


Require text comprehension skills
Computer techniques for text analysis


Automatic extraction of information from text
Good results in constrained domains
XML and other Structured Markup Languages



Make key aspects of documents available to computers and
people
Encoded information can easily be extracted by parsing the
document structure
Few documents contain explicitly encoded metadata
General Techniques

Extracting Document Metadata


Generic Entities


Title, Author, Publisher, Date, etc.
Email, URLs, Dates, Time, Money
Bibliography Entries

Citation analysis
Key Phrase Metadata


Key-phrase metadata can successfully be
obtained automatically from documents
Two Different Approaches:
Key-Phrase Assignment
 Key-Phrase Extraction

Generating Phrase Hierarchies



Key phrases consist of a few well-chosen words
that characterize the document
It is useful to extract a structure that contains
ALL the phrases in the documents
Hierarchical structure of phrases can support
browsing around a digital library collection
Download