ir - Dr. C. Lee Giles

advertisement
IST 511 Information Management: Information
and Technology
Text Processing and Information Retrieval
Dr. C. Lee Giles
David Reese Professor, College of Information Sciences
and Technology
Professor of Computer Science and Engineering
Professor of Supply Chain and Information Systems
The Pennsylvania State University, University Park, PA,
USA
giles@ist.psu.edu
http://clgiles.ist.psu.edu
Special thanks to C. Manning, P. Raghavan, H. Schutze, M. Hearst
Last Time
What are networks
– Definitions
– Theories
– Social networks
Why do we care
Impact on information science
Today
What is text
– Definitions
– Why important
– What is text processing
What is information retrieval?
–
–
–
–
–
–
Definition of a document
Text and documents as vectors
Relevance (performance) – recall/precision
Inverted index
Similarity/ranking
ACM SIGIR (list serve)
Impact on information science
– Foundations of SEARCH ENGINES
Tomorrow
Topics used in IST
Machine learning
Linked information and search
Encryption
Probabilistic reasoning
Digital libraries
Others?
Theories in Information Sciences
Enumerate some of these theories in this course.
Issues:
– Unified theory?
– Domain of applicability
– Conflicts
Theories here are
– Mostly algorithmic
– Some quite qualitative (qualitative vs algorithmic?)
Quality of theories
– Occam’s razor
– Subsumption of other theories (any subsumptions so far?)
Theories of text and information retrieval
– Theories are mostly representational
What is information retrieval (IR)
Automatic document retrieval usually based on a
query
Involves:
• Documents
• Document parsing
• Vector model (theory!)
• Inverted index
• Query processing
• Similarity/Ranking
• Evaluation
History of information retrieval
History of Information Retrieval
USA
Ancient libraries to the web
Always a part of information sciences
Early ideas of IR/search
Vannevar Bush - Memex - 1945
"A memex is a device in which an individual stores all his books,
records, and communications, and which is mechanized so that it
may be consulted with exceeding speed and flexibility. It is an
enlarged intimate supplement to his memory.”
Bush seems to understand that computers won’t just store information as
a product; they will transform the process people use to produce and use
information.
Some IR History
– Roots in the scientific “Information Explosion” following
WWII
– Interest in computer-based IR from mid 1950’s
•
•
•
•
•
•
•
•
•
Key word indexing H.P. Luhn at IBM (1958)
Probabilistic models at Rand (Maron & Kuhns) (1960)
Boolean system development at Lockheed (‘60s)
Medlars (‘60s)
Vector Space Model (Salton at Cornell 1965)
Statistical Weighting methods and theoretical advances (‘70s)
Lexis Nexis (‘70s)
Refinements and Advances in application (‘80s)
User Interfaces, Large-scale testing and application (‘90s)
– Then came the web and search engines and everything
changed
– More History
IR and Search Engines
Search engines are an IR application (instantiation).
Search engines have become the most popular IR
tools.
Why?
Query Engine
Index
Interface
Indexer
Users
Crawler
Web
A Typical Web Search Engine
Query Engine
Index
Interface
Indexer
Users
Document
Collection
A Typical Information Retrieval System
Query Engine
Index
Interface
Tokenization
Indexer
Users
Document
Collection
A Typical Information Retrieval System
What is Text?
Text is so common that we often ignore it’s importance
What is text?
– Strings of characters (alphabets, ideograms, ascii, unicode, etc.)
• Words
•.,:;-()_
 
• 1 2 3 3.1415 1010
• f = ma H20
• Tables
• Figures
– Anything that is not an image, etc.
– Why is text important?
• Text is language capture
– an instantiation of language, culture, science, etc.
Corpora?
A collection of texts especially if complete and self contained; the corpus of
Anglo-Saxon verse
In linguistics and lexicography, a body of texts, utterances or other specimens
considered more or less representative of a language and usually stored as
an electronic database (The Oxford Companion to the English Language
1992)
A collection of naturally occurring language text chosen to characterize a state or
variety of a language (John Sinclair Corpus Concordance Collocation
OUP 1991)
Types:
– Written vs Spoken
– General vs Specialized
• e.g. ESP, Learner corpora
– Monolingual vs Multilingual
• e.g. Parallel, Comparable
– Synchronic vs Diachronic
• At a particular point in time or over time
– Annotated vs Unannotated
Written corpora
Brown
LOB
Time of compilation
1960s
1970s
Compiled at
Brown University (US)
Lancaster, Oslo, Bergen
Language variety
Written American English
Written British English
Size
1 million words (500 texts of 2000 words each)
Design
Balanced corpora; 15 genres of text, incl. press reportage,
editorials, reviews, religion, government documents,
reports, biographies, scientific writing, fiction
Focus on documents
• Fundamental unit to be indexed, queried and retrieved.
• Deciding what is an individual document can vary depending on
problem
• Documents are units usually consisting of text
• Text is turned into tokens or terms
• Terms or tokens are words or roots of words, semantic
units or phrases which are the atoms of indexing.
• Repositories (databases) are collections of documents.
• Query is a request for documents on a query-related topic
Definitions
Collections consist of Documents
Document
– The basic unit which we will automatically index
• usually a body of text which is a sequence of terms
– has to be digital
Tokens or terms
– Basic units of a document, usually consisting of text
• semantic word or phrase, numbers, dates, etc
Collections or repositories
– particular collections of documents
– sometimes called a database
Query
– request for documents on a topic
Collection vs documents vs terms
Collection
Document
Terms or tokens
What is a Document?
A document is a digital object with an operational definition
– Much of what you deal with daily!
– Indexable
– Can be queried and retrieved.
Many types of documents
– Text or part of text
– Image
– Audio
– Video
– Data
– Email
– Etc.
Text Processing & Tokenizaton
Standard Steps:
• Recognize document structure
• titles, sections, paragraphs, etc.
• Tokenization
• Break into tokens – type of markup
• Tokens are delimited text; ex:
– Hello, how are you.
– _hello_,_how_are_you_._
• usually space and punctuation delineated
• special issues with Asian languages
– Stemming/morphological analysis
– Store in inverted index
Basic indexing pipeline
Documents to
be indexed.
Friends, Romans, countrymen.
Tokenizer
Token stream.
Friends Romans
Countrymen
Linguistic
modules
Modified tokens.
Inverted index.
friend
roman
countryman
Indexer friend
2
4
roman
1
2
countryman
13
16
Parsing a document
What format is it in?
– pdf/word/excel/html?
What language is it in?
What character set is in use?
Each of these is a classification problem
which can be solved using heuristics or ML
methods.
But there are complications …
Format/language stripping
Documents being indexed can include docs from many different
languages
– A single index may have to contain terms of several languages.
Sometimes a document or its components can contain multiple
languages/formats
– French email with a Portuguese pdf attachment.
What is a unit document?
– An email?
– With attachments?
– An email with a zip containing documents?
Tokenization
Parsing (chopping up) the document into basic units that are
candidates for later indexing
What to use and what not
Issues with
–
–
–
–
–
–
–
Punctuation
Numbers
Special characters
Equations
Formula
Languages
Normalization (often by stemming)
Tokenization
Input: “Friends, Romans and Countrymen”
Output: Tokens
– Friends
– Romans
– Countrymen
Each such token is now a candidate for an index entry, after
further processing
But what are valid tokens to emit?
Tokenization
Issues in tokenization:
– Finland’s capital 
Finland? Finlands? Finland’s?
– Hewlett-Packard 
Hewlett and Packard as two
tokens?
• State-of-the-art: break up hyphenated sequence.
• co-education ?
• the hold-him-back-and-drag-him-away-maneuver ?
– San Francisco: one token or two? How do you decide it is one
token?
Numbers
3/12/91
Mar. 12, 1991
55 B.C.
B-52
My PGP key is 324a3df234cb23e
100.2.86.144
– Generally, don’t index as text.
– Will often index “meta-data” separately
• Creation date, format, etc.
Tokenization: Language issues
L'ensemble  one token or two?
– L ? L’ ? Le ?
– Want ensemble to match with un ensemble
German noun compounds are not segmented
– Lebensversicherungsgesellschaftsangestellter
– ‘life insurance company employee’
Tokenization: language issues
Chinese and Japanese have no spaces between words:
– Not always guaranteed a unique tokenization
Further complicated in Japanese, with multiple alphabets
intermingled
– Dates/amounts in multiple formats
フォーチュン500社は情報不足のため時間あた$500K(約6,000万円)
Katakana
Hiragana
Kanji
“Romaji”
End-user can express query entirely in hiragana!
Tokenization: language issues
Arabic (or Hebrew) is basically written right to left, but with certain items
like numbers written left to right
Words are separated, but letter forms within a word form complex
ligatures
.‫ عاما من االحتالل الفرنسي‬132 ‫ بعد‬1962 ‫استقلت الجزائر في سنة‬
← → ←→
← start
‘Algeria achieved its independence in 1962 after 132 years of French
occupation.’
With Unicode, the surface presentation is complex, but the stored form is
straightforward
Normalization
Need to “normalize” terms in indexed text as well as query terms into the
same form
– We want to match U.S.A. and USA
We most commonly implicitly define equivalence classes of terms
– e.g., by deleting periods in a term
Alternative is to do limited expansion:
– Enter: window
Search: window, windows
– Enter: windows
Search: Windows, windows
– Enter: Windows
Search: Windows
Potentially more powerful, but less efficient
Linguistic Modules
Stemming and Morphological Analysis
Goal: “normalize” similar words
Morphology (“form” of words)
– Inflectional Morphology
• E.g,. inflect verb endings and noun number
• Never change grammatical class
– dog, dogs
– Derivational Morphology
• Derive one word from another,
• Often change grammatical class
– build, building; health, healthy
Lemmatization
Reduce inflectional/variant forms to base form
E.g.,
– am, are, is  be
– car, cars, car's, cars'  car
the boy's cars are different colors  the boy car be different
color
Lemmatization implies doing “proper” reduction to dictionary
headword form
Documents / bag of words
A textual document is a digital object
consisting of a sequence of words and
other symbols, e.g., punctuation.
Take out the sequence and one has a
“bag of words”
The individual words and other symbols
are known as tokens or terms.
A textual document can be:
• Free text, also known as
unstructured text, which is a
continuous sequence of tokens.
• Fielded text, also known as
structured text, in which the text is
broken into sections that are
distinguished by tags or other markup.
Word Frequency
Observation: Some words are more common than others.
Statistics: Most large collections of text documents have
similar statistical characteristics. These statistics:
• influence the effectiveness and efficiency of data structures
used to index documents
• many retrieval models rely on them
Treat each document as a bag of words
Word Frequency
Example
The following example is taken from:
Jamie Callan, Characteristics of Text, 1997
Sample of 19 million words
The next slide shows the 50 commonest words in rank order (r),
with their frequency (f).
f
the 1,130,021
of
547,311
to
516,635
a
464,736
in
390,819
and 387,703
that 204,351
for
199,340
is
152,483
said 148,302
it
134,323
on
121,173
by
118,863
as
109,135
at
101,779
mr
101,679
with 101,210
f
from
96,900
he
94,585
million 93,515
year
90,104
its
86,774
be
85,588
was
83,398
company83,070
an
76,974
has
74,405
are
74,097
have
73,132
but
71,887
will
71,494
say
66,807
new
64,456
share 63,925
f
or
about
market
they
this
would
you
which
bank
stock
trade
his
more
who
one
their
54,958
53,713
52,110
51,359
50,933
50,828
49,281
48,273
47,940
47,401
47,310
47,116
46,244
42,142
41,635
40,910
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
f
is the frequency that w appears
r
is rank of w in order of frequency. (The most
commonly occurring word has rank 1, etc.)
f
w has rank r and
frequency f
1
r
Zipf Distribution
(linear and log scale)
Zipf's Law
If the words, w, in a collection are ranked, r, by their frequency, f,
they roughly fit the relation:
r*f=c
or
r = c * 1/f
Different collections have different constants c.
In English text, c tends to be about n / 10, where n is the number of
word occurrences in the collection.
For a weird but wonderful discussion of this and many other
examples of naturally occurring rank frequency distributions, see:
Zipf, G. K., Human Behaviour and the Principle of Least Effort.
Addison-Wesley, 1949
Methods that Build on Zipf's Law
The Important Points:
–a few elements occur very frequently
–a medium number of elements have medium frequency
–many elements occur very infrequently
Stop lists: Ignore the most frequent words (upper cut-off). Used
by almost all systems.
Significant words: Ignore the most frequent and least frequent
words (upper and lower cut-off). Rarely used.
Term weighting: Give differing weights to terms based on their
frequency, with most frequent words weighed less. Used by
almost all ranking methods.
Definitions
Corpus: A collection of documents that are indexed and searched
together.
Word list: The set of all terms that are used in the index for a
given corpus (also known as a vocabulary file).
With full text searching, the word list is all the terms in the corpus,
with stop words removed. Related terms may be combined by
stemming.
Controlled vocabulary: A method of indexing where the word list
is fixed. Terms from it are selected to describe each document.
Keywords: A name for the terms in the word list, particularly with
controlled vocabulary.
Information Retrieval
Information Retrieval (IR) is finding material (usually
documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually using computational methods).
Basic Assumptions
– Collection: Fixed set of documents
– Goal: Retrieve documents with information that is relevant to the
user’s information need and helps the user complete a task
IR vs. databases:
Structured vs unstructured data
Structured data tends to refer to information in “tables”
Employee
Manager
Salary
Smith
Jones
50000
Chang
Smith
60000
Ivy
Smith
50000
Typically allows numerical range and exact match
(for text) queries, e.g.,
Salary < 60000 AND Manager = Smith.
Unstructured data
Typically refers to free text
Allows
– Keyword queries including operators
– More sophisticated “concept” queries e.g.,
• find all web pages dealing with drug abuse
Classic model for searching text documents
Database and IR: How are they
different
Structured Data vs. unstructured data
Relevance managed by the user vs. Relevance managed by the
system (at least partially)
Relevance measure: binary vs. scaled
Unstructured (text) vs. structured
(database) data in 2009
Terms & tokens
Terms are what happens to text after tokenization
and linguistic processing.
– Examples
•
•
•
•
Computer -> computer
knowledge -> knowledg
The -> the
Removal of stop words?
TERM VECTOR SPACE
Term vector space
n-dimensional space, where n is the number of different
terms/tokens used to index a set of documents.
Vector
Document i, di, represented by a vector. Its magnitude in
dimension j is wij, where:
wij > 0
wij = 0
if term j occurs in document i
otherwise
wij is the weight of term j in document i.
A Document Represented in a
3-Dimensional Term Vector Space
t3
d1
t13
t2
t11
t12
t1
Basic Method: Incidence Matrix
(Binary Weighting)
document
d1
d2
d3
text
ant ant bee
dog bee dog hog dog ant dog
cat gnu dog eel fox
terms
ant bee
ant bee dog hog
cat dog eel fox gnu
ant bee cat dog eel fox gnu hog
d1
1
1
d2
1
1
d3
1
1
1
1
1
1
3 vectors in
8-dimensional
term vector
space
1
Weights: tij = 1 if document i contains term j and zero otherwise
Basic Vector Space Methods:
Similarity between 2 documents
The similarity between
two documents is a
function of the angle
between their vectors in
the term vector space.
t3
d1
d2

t2
t1
Sec. 1.1
Example
Which plays of Shakespeare contain the words Brutus AND
Caesar but NOT Calpurnia?
One could SEARCH all of Shakespeare’s plays for Brutus and Caesar,
then strip out lines containing Calpurnia?
Why is that not the answer?
– Slow (for large corpora)
– NOT Calpurnia is non-trivial
– Ranked retrieval (best documents to return
Sec. 1.1
Term-document incidence matrix
(vector model)
Antony and Cleopatra
Julius Caesar The Tempest
Hamlet
Othello
Macbeth
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Brutus AND Caesar BUT NOT
Calpurnia
1 if play contains
word, 0 otherwise
Sec. 1.1
Incidence vectors
So we have a 0/1 vector for each term.
To answer query: take the vectors for Brutus, Caesar and
Calpurnia (complemented)  bitwise AND.
110100 AND 110111 AND 101111 = 100100.
Query Engine
Index
Interface
Indexer
Users
Document
Collection
A Typical Information Retrieval System
How good are the retrieved documents
Use Relevance as a measure
How relevant is the document retrieved
– for the user’s information need.
Subjective, but assume it’s measurable
Measurable to some extent
– How often do people agree a document is relevant to a
query
• More often than expected
How well does it answer the question?
– Complete answer? Partial?
– Background Information?
– Hints for further exploration?
How do we measure relevance?
Measures:
– Binary measure
• 1 relevant
• 0 not relevant
– N-ary measure
•
•
•
•
3 very relevant
2 relevant
1 barely relevant
0 not relevant
– Negative values?
N=? consistency vs. expressiveness tradeoff
Sec. 1.1
Precision and recall
 Defined by relevance of the documents
 Precision: Fraction of retrieved docs that are relevant to
user’s information need
 Recall: Fraction of relevant docs in collection that are
retrieved
Precision and Recall
Relevant
Retrieved
Not Retrieved
Hit
Miss
Not Relevant
False Alarm
Correct Rejection
Precision and Recall (Cont’d)
Relevant
Not Relevant
Retrieved
a
b
Not Retrieved
c
d
a
P = -------------(a+b)
Hits
P = -------------Hits + False Alarms
a
R = -------------(a+c)
Hits
R = -------------Hits + Misses
Precision and Recall (Cont’d)
Relevant
Not Relevant
Retrieved
4
6
10
Not Retrieved
1
89
90
95
100
5
4
P = -------------- = 0.4
(4+6)
4
R = -------------- = 0.8
(4+1)
Relevance
• An information system should give user relevant
information
• However, determining exactly what document is
relevant can get complex
– It depends on the content
• Is it on topic?
– It depends on the task
• Is it trustworthy?
– It depend on the user
• What have they already looked at?
• What languages do they understand?
• Is relevance all-or-none? Or, can there be partial
relevance?
• Also good for evaluation of IR system
How do we Measure Relevance?
• In order to evaluate the performance of information
retrieval systems, we need to measure relevance.
• Usually the best we can do is to have an expert make
ratings of the pertinence.
• User as expert or evaluator
Precision vs. Recall
| RelRetrieved |
Precision
| Retrieved|
| RelRetrieved |
Recall
| Rel in Collection|
All docs
Retrieved
Relevant
Retrieved vs. Relevant Documents
Very high precision, very low recall
retrieved
Relevant
Retrieved vs. Relevant Documents
High recall, but low precision
Relevant
retrieved
Retrieved vs. Relevant Documents
Very low precision, very low recall (0 for both)
retrieved
Relevant
Retrieved vs. Relevant Documents
High precision, high recall (at last!)
retrieved
Relevant
Sec. 1.1
How big can be the Term-Document
Matrix
Consider N = 1 million documents, each with about 1000
words.
Avg 6 bytes/word including spaces/punctuation
– 6GB of data in the documents.
Say there are M = 500K distinct terms among these. The
Term-Doc Matrix will be 500K x 1M , i.e. half-a-trillion
0’s and 1’s.
But it has no more than one billion 1’s.
– matrix is extremely sparse.
Query Engine
Index
Interface
Indexer
Users
Document
Collection
A Typical Information Retrieval System
Sec. 1.2
Inverted index
For each term t, we must store a list of all documents that
contain t.
– Identify each by a docID, a document serial number
We need variable-size postings lists
In memory, can use linked lists or variable length arrays
Brutus
1
Caesar
1
Calpurnia
2
2
2
31
Posting
4
11
31
45 173 174
4
5
6
16 57 132
54 101
Postings
Dictionary
Sorted by docID.
Sec. 1.2
Inverted index construction
Documents to
be indexed.
Friends, Romans, countrymen.
Tokenizer
Friends Romans
Token stream.
Countrymen
Linguistic modules
friend
Modified tokens.
Indexer
Inverted index.
roman
countryman
friend
2
4
roman
1
2
countryman
13
16
Sec. 1.2
Indexer steps: Token sequence
Sequence of (Modified token, Document ID) pairs.
Doc 1
I did enact Julius
Caesar I was killed
i' the Capitol;
Brutus killed me.
Doc 2
So let it be with
Caesar. The noble
Brutus hath told you
Caesar was ambitious
Sec. 1.2
Indexer steps: Sort
Sort by terms
– And then docID
Core indexing step
Sec. 1.2
Indexer steps: Dictionary & Postings
Multiple term entries in a
single document are
merged.
Split into Dictionary and
Postings
Doc. frequency
information is added.
Sec. 1.3
Query processing: AND
Consider processing the query:
Brutus AND Caesar
– Locate Brutus in the Dictionary;
• Retrieve its postings.
– Locate Caesar in the Dictionary;
• Retrieve its postings.
– “Merge” the two postings:
2
4
8
16
1
2
3
5
32
8
64
13
128
21
Brutus
34 Caesar
Sec. 1.3
Boolean queries: Exact match
The Boolean retrieval model is being able to ask a query that is a
Boolean expression:
– Boolean Queries are queries using AND, OR and NOT to join query terms
• Views each document as a set of words
• Is precise: document matches condition or not.
Good for expert users with precise
Not good for the majority of users.
 Most users incapable of writing Boolean queries (or they are, but they
think it’s too much work).
 Most users don’t want to wade through 1000s of results.
Problem with Boolean search:
feast or famine
Ch. 6
Boolean queries often result in either too few (=0) or too
many (1000s) results.
Query 1: “standard user dlink 650” → 200,000 hits
Query 2: “standard user dlink 650 no card found”: 0 hits
It takes a lot of skill to come up with a query that produces a
manageable number of hits.
– AND gives too few; OR gives too many
Ranked retrieval models
Rather than a set of documents satisfying a query expression,
in ranked retrieval models, the system returns an ordering
over the (top) documents in the collection with respect to a
query
Free text queries: Rather than a query language of operators
and expressions, the user’s query is just one or more words
in a human language
Sec. 6.2
Binary term-document incidence
matrix
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
1
1
0
0
0
1
Brutus
1
1
0
1
0
0
Caesar
1
1
0
1
1
1
Calpurnia
0
1
0
0
0
0
Cleopatra
1
0
0
0
0
0
mercy
1
0
1
1
1
1
worser
1
0
1
1
1
0
Each document is represented by a binary vector ∈ {0,1}|V|
Sec. 6.2
Term-document count matrices
Consider the number of occurrences of a term in a document:
– Each document is a count vector in ℕv: a column below
– This is called the bag of words model.
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
157
73
0
0
0
0
Brutus
4
157
0
1
0
0
Caesar
232
227
0
2
1
1
Calpurnia
0
10
0
0
0
0
Cleopatra
57
0
0
0
0
0
mercy
2
0
3
5
5
1
worser
2
0
1
1
1
0
Term frequency tf
The term frequency tft,d of term t in document d is defined as
the number of times that t occurs in d.
We want to use tf when computing query-document match
scores. But how?
Raw term frequency is not what we want:
– A document with 10 occurrences of the term is more relevant than
a document with 1 occurrence of the term.
– But not 10 times more relevant.
Relevance does not increase proportionally with term
frequency.
Sec. 6.2
Log-frequency weighting
The log frequency weight of term t in d is
wt,d
1  log10 t ft,d ,

0,

if t ft,d  0
ot herwise
0 → 0, 1 → 1, 2 → 1.3, 10 → 2, 1000 → 4, etc.
Score for a document-query pair: sum over terms t in both q
and d:
score

(1  log tf )

tqd
t ,d
The score is 0 if none of the query terms is present in the
document.
Sec. 6.2.1
Document term frequency
Rare terms are more informative than frequent terms
A document containing the term arachnocentric is very likely to be
relevant to the query arachnocentric
→ We want a high weight for rare terms like arachnocentric.
Sec. 6.2.1
Document frequency, continued
Frequent terms are less informative than rare terms
– Consider a query term that is frequent in the collection (e.g., high,
increase, line)
– A document containing such a term is more likely to be relevant
than a document that doesn’t
– But it’s not a sure indicator of relevance.
→ For frequent terms, we want high positive weights for
words like high, increase, and line
– But lower weights than for rare terms.
We will use document frequency (df) to capture this.
Sec. 6.2.1
idf weight
dft is the document frequency of t: the number of documents
that contain t
– dft is an inverse measure of the informativeness of t
– dft  N
We define the idf (inverse document frequency) of t by
idf t  log10 ( N/df t )
– We use log (N/dft) instead of N/dft to “dampen” the effect of idf.
Sec. 6.2.1
idf example, suppose N = 1 million
term
dft
idft
calpurnia
1
animal
100
sunday
1,000
fly
10,000
under
the
100,000
1,000,000
idf t  log10 ( N/df t )
There is one idf value for each term t in a collection.
Sec. 6.2.1
Collection vs. Document frequency
The collection frequency of t is the number of occurrences of
t in the collection, counting multiple occurrences.
Example:
Word
Collection frequency
Document frequency
insurance
10440
3997
try
10422
8760
Which word is a better search term (and should get a higher
weight)?
Sec. 6.2.2
tfidf weighting
The tfidf weight of a term is the product of its tf weight and its idf
weight.
w t ,d  (1  logtft ,d )  log10 ( N / df t )
The weight increases with
1.
2.
the number of occurrences within a document
the rarity of the term in the collection
Sec. 6.2.2
Final ranking of documents for a
query
Score (q,d)  
tq d
tf.idf t,d
Sec. 6.3
Binary → count → weight matrix
Antony and Cleopatra
Julius Caesar
The Tempest
Hamlet
Othello
Macbeth
Antony
5.25
3.18
0
0
0
0.35
Brutus
1.21
6.1
0
1
0
0
Caesar
8.59
2.54
0
1.51
0.25
0
Calpurnia
0
1.54
0
0
0
0
Cleopatra
2.85
0
0
0
0
0
mercy
1.51
0
1.9
0.12
5.25
0.88
worser
1.37
0
0.11
4.15
0.25
1.95
Each document is now represented by a real-valued vector of tfidf weights ∈ R|V|
Sec. 6.3
Documents as vectors
So we have a |V|-dimensional vector space
Terms are axes of the space
Documents are points or vectors in this space
Very high-dimensional: tens of millions of dimensions when
you apply this to a web search engine
These are very sparse vectors - most entries are zero.
Query Engine
Index
Interface
Indexer
Users
Document
Collection
A Typical Information Retrieval System
Sec. 6.3
Queries as vectors
Key idea 1: Do the same for queries: represent them as
vectors in the space
Key idea 2: Rank documents according to their proximity to
the query in this space
Sec. 6.3
How to measure similarity?
[Method 1]
The Euclidean distance
between q and d2
[Method 2]
Rank documents
according to angle with
query
Sec. 6.3
From angles to cosines
But how – and why – should we be computing cosines?
Sec. 6.3
From angles to cosines
The following two notions are equivalent.
– Rank documents in decreasing order of the angle between query
and document
– Rank documents in increasing order of cosine(query,document)
Cosine is a monotonically decreasing function for the interval
[0o, 180o]
Sec. 6.3
cosine(query,document)
Dot product
Unit vectors
  

 
qd q d
cos(q , d )        
q d
qd

V
q di
i 1 i

V
2
i 1 i
q
qi is the tf-idf weight of term i in the query
di is the tf-idf weight of term i in the document
cos(q,d) is the cosine similarity of q and d … or,
equivalently, the cosine of the angle between q and d.
2
d
i1 i
V
Summary – vector space ranking
Represent the query as a weighted tf-idf vector
Represent each document as a weighted tf-idf vector
Compute the cosine similarity score for the query vector and each
document vector
Rank documents with respect to the query by score
Return the top K (e.g., K = 10) to the user
Methods for Selecting Weights
Empirical
Test a large number of possible weighting schemes
with actual data. (Based on work of Salton, et al.)
What is often used today.
Model based
Develop a mathematical model of word distribution
and derive weighting scheme theoretically.
(Probabilistic model of information retrieval.)
Discussion of Similarity
The choice of similarity measure is widely used and works
well on a wide range of documents, but has no theoretical
basis.
1. There are several possible measures other that angle
between vectors
2. There is a choice of possible definitions of tf and idf
3. With fielded searching, there are various ways to adjust the
weight given to each field.
Importance of automated indexing
• Crucial to modern information retrieval and web search
• Vector model of documents and queries
– Inverted index
• Scalability issues crucial for large systems
Similarity Ranking Methods
Methods that look for matches (e.g., Boolean) assume that a
document is either relevant to a query or not relevant.
Similarity ranking methods: measure the degree of similarity
between a query and a document.
Similar
Query
Documents
Similar: How similar is document to a request?
Similarity Ranking Methods
Query
Index
database
Documents
Mechanism for determining the similarity
of the query to the document.
Set of documents
ranked by how similar
they are to the query
Organization of Files for Full Text
Searching
Word list
Postings
Term Pointer to
postings
ant
bee
cat
dog
elk
fox
gnu
hog
Inverted
lists
Documents store
Indexing Subsystem
Documents
text
assign document IDs
break into tokens
tokens
*Indicates
optional
operation.
documents
stop list*
non-stoplist
stemming*
tokens
stemmed
terms
document
numbers
and *field
numbers
term weighting*
terms with
weights
Index
database
Search Subsystem
query parse query
ranked
document set
query tokens
stop list*
non-stoplist
tokens
ranking*
stemming*
*Indicates
optional
operation.
Boolean
retrieved operations*
document set
relevant
document set
stemmed
terms
Index
database
What we covered
•
•
•
•
•
•
Documents parsing – text
Zipf’s law
Documents as bag of words
Tokens and terms
Inverted index file structure
Relevance and evaluation
• Precision/recall
• Documents and queries as vectors
• Similarity measures for ranking
Issues
What are the theoretical foundations for IR?
Where do documents come from?
How will this relate to search engines?
Download