next word index

advertisement
|1
Zoekmachines
Gertjan van Noord
2014
Lecture 2: vocabulary, posting lists
Agenda for today
•
•
•
•
Questions Chapter 1
Chapter 2: Term vocabulary & posting lists
Chapter 2: Posting lists with positions
Homework/lab assignment
Chapter 2 Overview
Preprocessing of documents
• choose the unit of indexing (granularity)
• tokenization (removing punctuation, splitting in
words)
• stop list?
• normalization: case folding, stemming versus
lemmatizing, ...
• extensions to postings lists
Tokens, types and terms
token
type
each separate word in the text
same words belong to one type
(index) term finally included in the index
index term is an equivalence class
of tokens and/or types
26-01-12
Tokens, types and terms
The Lord of the Rings
•
•
•
•
•
•
Number of tokens?
5
Number of types?
4
Number of terms?
4? 2? 1?
26-01-12
Equivalence classes
•
•
•
•
•
•
Casefolding
Diacritics
Stemming/lemmatisation
Decompounding
Synonym lists
Variant spellings
26-01-12
Equivalence classes
•
•
•
•
Implicit: mapping rules
Relational: query expansion
Relational: double indexing
Mapping should be done:
– Indexing
– Querying
Words and word forms
• Inflection (D: verbuiging/vervoeging)
- changing a word to express person, case, aspect, ...
- for determiners, nouns, pronouns, adjectives:
declination (D: verbuiging)
- for verbs: conjugation (D: vervoeging)
• Derivation (D: afleiding)
- formation of a new word from another word (e.g. by
adding an affix (prefix or suffix) or changing the
grammatical category)
Inflection examples
Determiners
E: the D: de, het G: der, des, dem, den, die, das
Adjectives
E: young D: jonge, jonge G: junger, junge, junges, jungen
Nouns
E: man, men D: man, mannen G: mann, mannes,
Verbs
E write / writes / wrote / written
D schrijf/ schrijft /schrijven / schreef/ schreven / geschreven
G schreibe/ schreibst / schreibt / schreiben / schrieben /
geschrieben
Derivation examples
to browse -> a browser
red
-> to redden, reddish
Google
-> to google
arm(s) -> to arm, to disarm ->
disarmament, disarming
Stemming and lemmatizing
verb forms
derivations
stem
lemma
inform, informs, informed, informing
information, informative, informal??
inform
inform, information, informative, informal
verb forms
derivations
stem
lemma
sing: sings, sang, sung, singing
singer, singers, song, songs
sing, sang, sung, song,
sing, singer, song
Discussion
Why is stemming used when lemmatizing is much more
precise?
Lemmatizing is a more complex process
it needs
- a vocabulary (problem: new words)
- morphologic analysis (knowledge of inflection rules)
- syntactic analysis, parsing (noun or verb?)
26-01-12
Compound splitting
Marketingjargon -> marketing AND jargon
•
•
•
•
•
Increased retrieval
Decreased precision
Must be applied to both query and index!
But what to do with the query marketing jargon ?
And with spreekwoord appel boom ?
Chapter 2 Overview
Preprocessing of documents
• choose the unit of indexing (granularity)
• tokenization (removing punctuation, splitting in
words)
• stop list?
• normalization: case folding, stemming versus
lemmatizing, ...
• extensions to postings lists
Efficient merging of postings
For X AND Y, we have to intersect 2 lists
Most documents will contain only one of the two terms
Recall basic intersection algorithm
Skip pointers
Skip pointers
• Makes intersection of 2 lists more efficient
• think of millions of list items
• How many skip pointers and where?
• Trade-off:
• More pointers, often useful but small skips.
• Less pointers …
• Heuristic: distance √n, evenly distributed
Skip pointers: useful?
Yes, certainly in the past
With very fast CPUs less important
Especially in a rather static index
If a list keeps changing less effective
Extensions of the simple term index
To support phrase queries
• “information retrieval”
• “retrieval of information”
Different approaches
• biword indexes
• phrase indexes
• positional indexes
• combinations
Biword and phrase indexes
• Holding terms together in the index
• Simple biword index:
• retrieval of, of information
• Sophisticated: POS tagger selects nouns
• N x* N retrieval of this information
• Phrase index: includes variable lengths of word
sequences
• terms of 1 and 2 words both included
Positional index
Add in the postings lists for each doc the list of positions of
the term
for phrase queries
for proximity search
Example
[information, 4] : [1:<4,22, 35>, 2:<5,17, 30>, …]
[retrieval, 2] : [1:<5,20>,
2:<18,31>]
Combination schemes
Often queried combinations: phrase index
names of persons and organization
esp. combinations of common terms (!)
find out from query log
For other phrases a positional index
Williams e.a.: next word index added
H.E. Williams, J.Zobel, and D.Bahle (2004) Fast
Phrase Querying With Combined Indexes (ACM
Dig Library):
Phrase querying with a combination of three approaches
(next word index, phrase index and inverted file)
... is more than 60% faster on average than using an inverted
index alone
... requires structures that total only 20% of the size of the
collection.
We conclude that our approaches make stopping
unnecessary and allow fast query evaluation for all phrase
queries.
Doc ID
No of matching
docs
No of
occurrences
in doc
A nextword index (Williams e.a.)
position
docfreq,(<doc,freq,[pos, pos,..]>,<doc, freq, [..]
Download