IRindexing

advertisement
Information Retrieval:
Indexing
Acknowledgements:
Dr Mounia Lalmas (QMW)
Dr Joemon Jose (Glasgow)
Roadmap
What is a document?
 Representing the content of documents

– Luhn's analysis
– Generation of document representatives
– Weighting

Inverted files
Indexing Language
Language used to describe documents and
queries
 index terms – selected subset of words

– derived from the text or arrived at independently

Keyword searching
– Statistical analysis of document based of word
occurrence frequency
– Automated, efficient and potentially inaccurate

Searching using controlled vocabularies
– More accurate results but time consuming if
documents manually indexed
Luhn's analysis

Resolving power of significant words:
– ability of words to discriminate document content
– peak at rank order position half way between the two
cut-offs
Generating document
representatives
Generating document
representatives




Input text: full text, abstract, title
Document representative: list of (weighted) class names,
each name representing a class of concepts (words)
occurring in input text
Document indexed by a class name if one of its
significant words occurs as a member of that class
Phases:
–
–
–
–
–
–
–
identify words - Lexical Analysis (Tokenising)
removal of high frequency words
suffix stripping (stemming)
detecting equivalent stems
thesauri
others (noun-phrase, noun group, logical formula, structure)
Index structure creation
Process View
Document
Lexical
Analysis
Stopwords
removal
stemming
Indexing features
Lexical Analysis

The process of converting a stream of
characters (the text of the documents)
into a stream of words (the candidate
words to be adopted as index terms)
– treating digits, hyphens, punctuation marks,
and the case of the letters.
Stopword Removal






Removal of high frequency words
list of stop words (implement Luhn's upper cutoff)
filtering out words with very low discrimination
values for retrieval purposes
example: “been", “a", “about", “otherwise"
compare input text with stop list
reduction: between 30 and 50 per cent
Conflation

Conflation reduces word variants into a single
form
– similar words generally have similar meaning
– retrieval effectiveness increased if the query is
expanded with those which are similar in meaning to
those originally contained within it.

Stemming algorithm is a conflation procedure
– reduces all words with same root into a single root
Different forms - stemming

Stemming
– Matching the query term “forests” to “forest” and
“forested”
– “choke", “choking", “choked"

Suffix removal
–
–
–
–

removal of suffixes - worker
Porter algorithm: remove longest suffix
error: “equal" -> “eq": heuristic rules
more effective than ordinary word forms
Detecting equivalent stems
– example: ABSORB- and ABSORPT-

Stemmers remove affixes
– prefixes? - megavolt
Plural stemmer

Plurals in English
– If word ends in “ies” but not “eies”, “aies”
 “ies” -> “y”
– if word ends in “es” but not “aes, “ees”, “oes”
 “es” -> “e”
– if word ends in “s” but not “us” or “ss”
 “s” -> “”
– First applicable rule is the one used
Processing
“The destruction of the amazon rain forests”
 Case normalisation
 Stop word removal.

– From fixed list
– “destruction amazon rain forests”

Suffix removal (stemming).
– “destruct amazon rain forest”
Thesauri
A collection of terms along with some structure
or relationships between them. Scope notes etc..
1. provide standard vocabulary for indexing &
searching
2. assist user locating terms for proper query
formulation
3. provide classification hierarchy for broadening
and narrowing current query according to user
need

–
–
–
Equivalence: synonyms, preferred terms
Hierarchical: broader/narrower terms (BT/NT)
Association: related terms across the hierarchy (RT)
Thesauri Examples: WordNet
Faceted
Classification
Thesauri Examples: AAT
Art and Architecture Thesaurus
Hierarchical Classifications
Alphanumeric coding schemes
 Subject classifications
 A taxonomy that represents a
classification or kind-of hierarchy.
 Examples: Dewey Decimal, AAT, SHIC,
ICONCLASS

Kind of
a door
41A32 Door
Action associated
41A322 Closing the Door
with a door
41A323 Monumental Door
41A324 Metalwork of a Door Something attached
41A3241 Door-Knocker to a door
41A325 Threshold
41A327 Door-keeper, houseguard
Terminology/Controlled
vocabulary






The descriptors from a thesauri form a
controlled vocabulary
Normalise indexing concepts
Identification of indexing concepts with clear
semantics
Retrieval based on concepts rather than terms
Good for specific domains (e.g., medical)
Problematic for general domains (large, new,
dynamic)
No One Classification
No One Classification
Generating document
representatives - Outcome
Class
– words with the same stem
Class name
– stem
Document representative:
– list of class names (index terms or keywords)
 Same
process applied to query
Precision and Recall

Precision
– Ratio of the number of relevant documents retrieved
to the total number of documents retrieved.
– The number of hits that are relevant

Recall
– Ratio of number of relevant documents retrieved to
the total number of relevant documents
– The number of relevant documents that are hits
Precision and Recall
Relevant
Documents
Retrieved
Documents
Document
Space
Low Precision
Low Recall
High Precision
Low Recall
Low Precision
High Recall
High Precision
High Recall
Precision and Recall
Relevant
Retrieved
Documents |RA| Documents
|R|
|A|
Recall =
|RA|
|R|
Information
Space
• The user isn’t usually given the answer
set RA at once
• The documents in A are sorted to a
degree of relevance (ranking) which
|RA|
Precision =
the user examines. Recall and precision
|A|
vary as the user proceeds with their
examination of the answer set A
Precision and Recall Trade Off
Increase number of
documents retrieved
 Likely to retrieve
more of the relevant
documents and thus
increase the recall
 But typically retrieve
more inappropriate
documents and thus
decrease precision

100%
Recall
Precision
100%
Index term weighting
Effectiveness of an indexing language:
 Exhaustivity

– number of different topics indexed
– high exhaustivity: high recall and low precision

Specificity
– ability of the indexing language to describe topics
precisely
– high specificity: high precision and low recall
Index term weighting

Exhaustivity
– related to the number of index terms assigned to a
given document

Specificity
– number of documents to which a term is assigned in a
collection
– related to the distribution of index terms in
collection

Index term weighting
– index term frequency: occurrence frequency of a
term in document
– document frequency: number of documents in which a
term occurs
IR as Clustering
A query is a vague spec of a
set of objects, A
 IR is reduced to the problem
of determining which
documents are in set A and
which ones are not
 Intra clustering similarity:

xx A: x
Retrieved
Documents
x
x
x
C:
Document
Collection
– What are the features that
better describe the objects in A

Inter clustering dissimilarity:
– What are the features that
better distinguish the objects A
from the remaining objects in C
Index term weighting
N
n(t)
Weight(t,d) = tf(t,d) x idf(t)
Number of documents in collection
idf(t)
Number of documents in which term t
occurs
Inverse document frequency
occ(t,d)
Occurrence of term t in document d
tmax
Term in document d with highest
occurrence
Term frequency of t in document d
tf(t,d)
Index term weighting
Intra-clustering similarity
– The raw frequency of a term t inside a
Normalised
frequency of term t
in document d
document d.
– A measure of how well the document
term describes the document contents tf(t,d) =
Inter-cluster dissimilarity
occ(t,d)
occ(tmax, d)
– Inverse document frequency
– Inverse of the frequency of a term
t
among the documents in the collection.
– Terms which appear in many documents
are not useful for distinguishing a
relevant document from a non-relevant
one.
Inverse document
frequency
N
idf(t) = log
n(t)
Weight(t,d) = tf(t,d) x idf(t)
Term weighting schemes

Best known
weight(t,d) =

occ(t,d)
N
x log
occ(tmax, d)
n(t)
Variation for query term weights
0.5occ(t,q)
N
x log
weight(t,d) = 0.5 +
occ(tmax, q)
n(t)
Term frequency
Inverse document
frequency
Example
Nuclear 7
Poverty 5
Luddites 3
People 25
Computer 9
Unemployment 1
Machines 19
And 49
Weight(machine) =
19/25 x log(100/50)
= 0.76 x 0.3013 = 0.228988
Weight(luddite) =
3/25 x log(100/2)
= 0.12 x 1.69897 = 0.2038764
Weight(poverty) = 5/25 x log(100/2)
= 0.2 x 1.69897 = 0.339794
Inverted Files


Word-oriented mechanism for indexing test collections
to speed up searching
Searching:
–
–
–
vocabulary search (query terms)
retrieval of occurrence
manipulation of occurrence
Original Document view
Cosmonaut astronaut moon
car
truck
D1
1
0
1
1
1
D2
0
1
1
0
0
D3
0
0
0
1
1
Inverted view
D1
D2
D3
Cosmonaut
1
0
0
astronaut
0
1
0
moon
1
1
0
Car
1
0
1
truck
1
0
1
Inverted index
cosmonaut
D1
astronaut
D2
moon
D1
D2
car
D1
D3
truck
D1
D3
Inverted File
The speed of retrieval is maximised by
considering only those terms that have
been specified in the query
 This speed is achieved only at the cost of
very substantial storage and processing
overheads

Components of an inverted file
Header Information
frequency
term
Field type
pointer
Document number
frequency
Postings file
A
B
AI
AL
BA
BR
C
D
F
G
J
L
M
N
O
P
Q
T
TH
TI
Term
Doc 2
Doc 3
Doc 4
Doc 5
Doc 6
Doc 7
Doc 8
Inverted File
Doc 1
Producing an Inverted file
Postings
aid
all
back
brown
come
dog
fox
good
jump
lazy
men
now
over
party
quick
their
time
0
0
1
1
0
0
0
0
0
1
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
1
0
0
0
0
1
4, 8
2, 4, 6
1, 3, 7
1, 3, 5, 7
2, 4, 6, 8
3, 5
3, 5, 7
2, 4, 6, 8
3
1, 3, 5, 7
2, 4, 8
2, 6, 8
1, 3, 5, 7, 8
6, 8
1, 3
1, 5, 7
2, 4, 6
0
0
1
1
0
1
1
0
1
1
0
0
1
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
0
0
0
1
0
0
0
1
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
1
0
1
0
0
1
0
0
1
1
0
0
1
0
0
1
0
0
1
0
0
1
0
1
0
0
0
1
0
0
1
0
0
1
1
1
1
0
0
0
An Inverted file
Inverted File
A
B
AI
AL
BA
BR
C
D
F
G
J
L
M
N
O
P
Q
T
TH
TI
Term Postings
aid
all
back
brown
come
dog
fox
good
jump
lazy
men
now
over
party
quick
their
time
4, 8
2, 4, 6
1, 3, 7
1, 3, 5, 7
2, 4, 6, 8
3, 5
3, 5, 7
2, 4, 6, 8
3
1, 3, 5, 7
2, 4, 8
2, 6, 8
1, 3, 5, 7, 8
6, 8
1, 3
1, 5, 7
2, 4, 6
Searching Algorithm
For each document D, Score(D) =0;
 For each query term

– Search the vocabulary list
– Pull out the postings list
– for each document J in the list,
 Score(J) +=Score(J) +1
What Goes in a Postings File?

Boolean retrieval
– Just the document number

Ranked Retrieval
– Document number and term weight (TF*IDF, ...)

Proximity operators
– Word offsets for each occurrence of the term

Example: Doc 3 (t17, t36), Doc 13 (t3, t45)
How Big Is the Postings File?

Very compact for Boolean retrieval
– About 10% of the size of the documents


If an aggressive stopword list is used
Not much larger for ranked retrieval
– Perhaps 20%

Enormous for proximity operators
– Sometimes larger than the documents

But access is fast - you know where to look
Query
Documents
Indexing features
Stemming
Matching
Query features
Storage: inverted index
Term 1
di dj dk
Term 2
Term 3
Doc
dj
di
dk
Score
s1
s2
s3
s1>s2>s3> ...
indexing
Tokenize
Stop word
indexing
Tokenize
Stop word
Stemming
Similarity Matching
The process in which we compute the relevance
of a document for a query
 A similarity measure comprises

– term weighting scheme which allocates numerical
 values to each of the index terms in a query or
document reflecting their relative importance
– similarity coefficient - uses the term weights to
compute the overall degree of similarity between a
query and a document
Download