robbins.CIS702.chapt..

advertisement
CIS 702 Communication/Information Technologies (CIT)
Teaching Session #9
Chapter 6
Documents: Language & Properties
Philip Robbins – March 7, 2013
Dr. Luz Quiroga, Ph.D.
Communication & Information Sciences Ph.D. Program
University of Hawai'i at Mānoa
1
Documents: Language & Properties
Chapter Contents
•
•
•
•
•
•
•
Metadata
Document Formats
Markup Languages
Text Properties
Document Preprocessing
Organizing Documents
Text Compression
2
Introduction
Document
•
•
•
•
Denotes a single unit of information
Structure and a Syntax
Semantics, specified by the author
Presentation style
3
Introduction
4
Introduction
Document Syntax
•
•
•
•
Expresses structure, presentation style, semantics
Implicit in its content
Expressed in a simple declarative language
Expressed in a programming language
Text
• Can be written in natural language (Hard to process)
5
Introduction
Document Style
• How a document is visualized
or printed
• Can be embedded in the document
i.e. RTF files
• Can be complemented by macros
6
Introduction
Queries
•
•
•
•
Short pieces of text
Differ from normal text
Semantics often ambiguous due to polysemy
User intent behind a query is not easy to infer
7
Metadata
Metadata
• Data about data
• Information on the organization of the data, various data
domains, and their relationship
• Metadata is associated with most documents
8
Metadata
Descriptive Metadata
• External to the meaning of the document and pertain more to
how it was created.
• Author of the text
• Date of publication
• Source of the publication
• Documentation length
9
Metadata
Semantic Metadata
• Characterizes the subject matter within the document contents
• Associated with a wide number of documents
• Availability is increasing
10
Metadata
Metadata Format
• Machine Readable Cataloging Record (MARC)
• Format used for most library records
• Includes fields for distinct attributes of a bibliographic entry
such as: title, author, publication venue.
11
Metadata
Metadata in Web Documents
• Increase in web data has led to adding metadata information
to web pages.
• Cataloging and content rating
• Intellectual property rights and digital signatures
• Electronic Commerce
12
Metadata
Resource Description Framework (RDF)
• New standard for Web metadata
• Allows describing Web resources to facilitate automated
processing.
• Does not assume any particular application or semantic
domain.
• Consists of a description of nodes and attached attribute/value
pairs.
13
Text
Text
• Computers represent characters in binary, which is done
through coding schemes:
• EBCDIC (7 bits)
• ASCII (8 bits)
• UNICODE (16 bits)
• IR systems should be able to retrieve information from many
text formats (doc, pdf, html, txt)
• IR systems have filters to handle most documents (might not
be possible with proprietary formats)
14
Text
Text Formats
• For document exchange: Rich Text Format (RTF)
• For printing and displaying: Portable Document Format (PDF)
• For printing and displaying: Postscript (PS)
15
Text
Interchange Formats
• For encoding email: Multipurpose Internet Mail Exchange
(MIME)
• For compressing text: ZIP
16
Multimedia
Multimedia
•
•
•
•
•
•
For applications that handle different types of data:
Text
Sounds
Images
Video
Different types of formats are necessary for storing each media
17
Images
Image Formats
• Simplest image formats are direct representations of a bitmapped display: XBM, BMP, PCX
• These formats have lots of redundancy and can be compressed
efficiently: GIF
18
Images
Lossy Compression
• To improve compression ratios.
• Uncompressing a compressed image does not yield exactly the
original image.
• Joint Photographic Experts Group (JPEG)
• Eliminates parts of the image that have less impact in the
human eye.
• Parametric format – loss can be tuned.
19
Images
Interchange Formats for Images
• Tagged Image File Format (TIFF)
• Provides for metadata, compression, and varying number of
colors.
• Standard de facto for images on the Web:
• Portable Network Graphics (PNG)
20
Audio
Audio Formats
• Audio is digitalized
• MIDI is the standard format to interchange music between
electronic instruments and computers.
• AU, WAVE
21
Movies
Movie Formats
•
•
•
•
•
Works by coding changes in consecutive frames
Takes advantage of temporal image redundancy
Includes audio signal associated with the video
Audio: MP3, Video: MP4
AVI, FLI, Quicktime
22
Graphics
Format for 3-D Graphics
• Computer Graphics Metafile (CGM)
• Virtual Reality Modeling Language (VRML)
• VRML is the universal interchange format for 3-D graphics
and multimedia.
23
Markup
Markup Languages
• Defined as extra syntax used to describe formatting actions,
structure information, text semantics, attributes
• XML:
eXtensible Markup Language
• HTML:
Hyper Text Markup Language
• SGML:
Standard Generalized Markup Language
24
Markup
Standard Generalized Markup Language (SGML)
•
•
•
•
ISO 8879
Meta-language for tagging text
Provides rules for defining a markup language based on tages
Includes a description of the document structure: “document
type definition”
• SGML document defined by: document type definition with
the text itself marked with tags describing the structure
25
Markup
SGML Document Type Definition
• Describes the pieces that a document is composed of
• Defines how those pieces relate to each other
• Part of the definition can be specified by an SGML
Document Type Declaration (DTD)
• Other parts (i.e. semantics of elements & attributes) cannot
be express formally in SGML
26
Markup
SGML Document Type Definition
27
Markup
SGML Document Type Definition
28
Markup
SGML
•
•
•
•
Tags are denoted by angle brackets < >
Used to identify the beginning and ending of an element
Ending tags include a slash before the tag name
Attributes are specified inside the beginning tag
29
Markup
SGML
• Document description does not specify how a document is
printed
• Output specifications are added to SGML documents:
• DSSSL: Document Style Semantic Specification Language
• FOSI: Formatted Output Specification Instance
• These standards define mechanisms for associating style
information with SGML document instances
• Allows defining data identified by a tag should be typeset in
some particular font
30
Markup
HyperText Markup Language (HTML)
•
•
•
•
•
•
Instance of SGML
Created in 1992
Latest Version is 4.0 (HTML5 under development)
Includes support for style sheets, frames, tables, forms, etc.
Backwards compatible
Most documents on the Web are stored and transmitted in
HTML
• HTML tags follow all SGML conventions and include
formatting directives.
31
Markup
HyperText Markup Language (HTML)
• Can have media embedded within, such as images or audio
• Has fields for metadata
• Adding programs (i.e. Javascript) inside a webpage makes it
dynamic (hence dynamic HTML).
32
Markup
HyperText Markup Language (HTML)
33
Markup
HyperText Markup Language (HTML)
34
Markup
Cascade Style Sheets (CSS)
• Because HTML does not fix a presentation style, CSS was
introduced.
• 1997
• Way for authors to improve the aesthetics of HTML pages
• Information about presentation is separate from document
content
• Support for CSS in current browsers in still modest
35
Markup
eXtensible Markup Language (XML)
• Is a simplified subset of SGML
• Not a markup language (like HTML) but a meta-language
(like SGML)
• Allows human-readable sematic markup, which is also
machine-readable
• Does not have the restriction of HTML
• Allows any user to define new tags
• More rigid syntax on the syntax:
• Ending tags cant be omitted
• Distinguishes upper and lower case
• Attribute values must be in quotes
36
Markup
eXtensible Style Sheet Language (XSL)
• The XML counterpart of Cascading Style Sheets (CSS)
• Syntax based on XML
• Designed to transform and style highly-structured, data-rich
documents written in XML
• i.e. With XML it would be possible to automatically extract a
table of contents from a document
37
Markup
Hypermedia/Time-based Structuring Language
• SGML architecture that specifies the generic hypermedia
structure of documents
• Includes complex locating of document objects
• Includes relationships (hyperlinks) between document
objects
• Includes numeric, measured associations between document
objects
• Does not specify graphical interfaces, user navigation or user
interaction.
38
Theory
Information Theory
• It is difficult to formally capture how much information there
is in a given text
• However, distribution of symbols is related to it
• A text where one symbol appears almost all the time does not
convey much information
• Information Theory defines a special concept, entropy, to
capture information content
39
Theory
Entropy
40
Theory
Entropy
41
Theory
Modeling Natural Language
•
•
•
•
•
We can divide the symbols of a text in two disjoint subsets:
Symbols that separate words;
Symbols that belong to words;
Symbols are not uniformly distributed in a text
i.e. In English the vowels are usually more frequent than
most consonants.
42
Theory
Modeling Natural Language
•
•
•
•
A simple model to generate text is the Binomial model
The probability of a symbol depends on previous symbol.
i.e. f cannot appear after a letter c
A finite-context or Markovian model can be used to reflect
this dependency.
• Second issue: is how the different words are distributed
inside each document.
43
Theory
Zipf’s Law
44
Theory
45
Theory
Modeling Natural Language
• Words arranged in decreasing order of their frequencies
46
Theory
Modeling Natural Language
• Words arranged in decreasing order of their frequencies
• Distribution of words is very skewed
• Words that are too frequent (“stopwords”) can be
disregarded.
• Stopword is a word which does not carry meaning in natural
language
• i.e. Stopwords in English: a, the, by, and
• Therefore, half of the words appearing in a text do not need
to be considered
47
Theory
Modeling Natural Language
• Third Issue: Distribution of words in the documents of a
collection.
• Simple Model: Consider that each word appears the same
number of times in every document (Not True)
• Better Model: Use a binomial distribution
48
Theory
Heaps’ Law
• Fourth Issue: Number of distinct words in a document
(document vocabulary)
• To predict the growth of vocabulary size in natural language
text:
49
Theory
Modeling Natural Language
• Vocabulary size grows sub-linearly with text size
50
Theory
Modeling Natural Language
• The set of different words of a language is fixed by a
constant.
• However, the limit is so high that it is common to assume the
size of the vocabulary is:
• Many argue that the number keeps growing anyway because
of typing and spelling errors.
• As the total text size grows, the predictions of the model
become more accurate.
51
Theory
Text Similarity
• Similarity is measured by a distance function
• Hamming distance: For strings of the same length, distance
between them is the number of positions with different
characters (distance is 0 if equal).
• A distance function should be symmetric and satisfy:
52
Theory
Text Similarity
• Levenshtein “edit” distance: the minimal number of char
insertions, deletions, and substitutions needed to make two
strings equal.
• Edit distance between color and colour is 1
• Edit distance between survey and surgery is 2
53
Theory
Text Similarity
•
•
•
•
Longest Common Subsequence (LCS):
All non-common characters of two (or more) strings
Remaining sequence of characters is the LCS of both strings
LCS of survey and surgery is surey.
54
Theory
Text Similarity
• Similarity can be extended to documents
• Compute the longest sequence of lines between two files
• ‘diff’ command in Unix
55
Theory
Resemblance Measure
56
Theory
Resemblance Measure
57
Model
Document Preprocessing Operations
•
•
•
•
•
Lexical analysis of the text
Elimination of stopwords
Stemming of the remaining words
Selection of index terms or keywords
Construction of term categorization structures (thesaurus)
58
Model
Logical View of a Document
59
Document Preprocessing
Lexical Analysis
• Process of converting stream of chars into stream of words
• Major Objective: Identify words in the text
• Word Seperators:
- Space: most common separator
- Numbers: inherently vague, context required
- Hyphens: break up hyphenated words
- Punctuation marks
- Case of letters: A vs. a
60
Document Preprocessing
Elimination of Stopwords
•
•
•
•
•
Words that appear too frequently
Usually, not good discriminators
Filtered out as potential index terms
Reduces size of index by 40% or more
At expense of reducing recall: not able to retrieve documents
that contain “to be or not to be”
61
Document Preprocessing
Stemming
• Stem: portion of word left after removal of prefixes/suffixes
• User specifies query word but only variant of it is present in
a relevant document
• This is partially solved by the adoption of stems
• Stemming reduces size of the index
• Controversial
• Many search engines do not adopt any stemming
62
Document Preprocessing
Keyword Selection
• Full text representation: all words in text is used as index
terms (or, keywords).
• Alternative to full text representation:
– Not all words in text used as index terms
– Use just nouns as index terms
– Group nouns that appear nearby in text into a single indexing
component (a concept)
63
Document Preprocessing
Thesaurus
• Used as reference to a treasury of words.
• Precompiled list of important words in a knowledge domain
• For each word in this list, a set of related words derived from
a synonymy relationship
64
Document Preprocessing
Thesaurus
• Used as reference to a treasury of words.
• Precompiled list of important words in a knowledge domain
• For each word in this list, a set of related words derived from
a synonymy relationship
65
Document Preprocessing
Thesaurus
• Query formulation process (for IR):
– User forms a query
– Query terms might be erroneous and improper
– Solution: reformulate the original query
– Usually, this implies expanding original query with related
terms
– Thus, it is natural to use a thesaurus for finding related terms
66
Taxonomies
67
Folksonomies
Folksonomy
• Collaborative flat vocabulary
• Terms are selected by a population of users
• Each term is called a tag
68
Folksonomies
69
References
•
Baeza-Yates & Ribeiro-Neto, Modern Information Retrieval, 2nd Edition. Chapter 6,
Documents: Languages & Properties, Retrieved from
http://grupoweb.upf.es/WRG/mir2ed/pdf/slides_chap06.pdf
70
Questions?
probbins@hawaii.edu
www2.hawaii.edu/~probbins
71
Download