Document

advertisement
Intelligent Text Processing
lecture 3
Word distribution laws.
Word-based indexing
Szymon Grabowski
sgrabow@kis.p.lodz.pl
http://szgrabowski.kis.p.lodz.pl/IPT08/
Łódź, 2008
Zipf’s law (Zipf, 1935, 1949)
[ http://en.wikipedia.org/wiki/Zipf's_law,
http://ciir.cs.umass.edu/cmpsci646/Slides/ir08%20compression.pdf ]
word-rank  word-freq  constant
That is, a few most frequent words cover a relatively large
part of the text, while the majority of words in the given
text’s vocabulary occur only once or twice.
More formally, the frequency of any word is
(approximately) inversely proportional to its rank
in the frequency table. Zipf’s law is empirical!
Example from the Brown Corpus (slightly over 106 words):
“the” is the most freq. word with ~7% (69971) of all word occs.
The next word, “of”, has ~3.5% occs (36411),
followed by “and” with less than 3% occs (28852).
Only 135 items are needed to account for half the Brown Corpus.
2
Does Wikipedia confirm Zipf’s law?
[ http://en.wikipedia.org/wiki/Zipf's_law ]
Word freq in
Wikipedia, Nov 2006,
log-log plot.
x is a word rank,
y is the total # of
occs.
Zipf's law roughly
corresponds to the
green (1/x) line.
3
Let’s check it in Python...
Dickens collection:
http://sun.aei.polsl.pl/~sdeor/stat.php?s=Corpus_dickens&u=corpus/dickens.bz2
distinct words: 28347
5 top freq words:
[('the', 80103), ('and', 63971), ('to', 47413), ('of', 46825),
('a', 37182)]
4
Lots of words with only a few occurrences,
(dickens example continued)
there are 9423 words with freq 1
there are 3735 words with freq 2
there are 2309 words with freq 3
there are 1518 words with freq 4
5
Brown corpus statistics
[ http://ciir.cs.umass.edu/cmpsci646/Slides/ir08%20compression.pdf ]
6
Heaps’ law (Heaps, 1978)
[ http://en.wikipedia.org/wiki/Heaps'_law ]
Another empirical law. It tells how vocabulary size V
grows with growing text size n (expressed in words):
V(n) = K · nβ,
where K is typically around 10..100 (for English)
and β between 0.4 and 0.6.
Roughly speaking, the vocabulary
(# of distinct words) grows proportially to the
square root of the text length.
7
Musings on Heaps’ law
The number of ‘words’ grows without a limit...
How is it possible?
Because in new documents new words tend to occur:
e.g. (human, geographical etc.) names. But also typos!
On the other hand, the dictionary size grows
significantly slower than the text itself.
I.e. it doesn’t pay much to represent the dictionary
succinctly (with compression) – dictionary
compression ideas which slow down the access
should be avoided.
8
Inverted indexes
Almost all real-world text indexes are inverted indexes
(in one form or another).
An inverted index (Salton and McGill, 1983) stores words
and their occurrence lists in the text database/collection.
The occurrence lists may store exact word positions
(inverted list) or just a block or a document (inverted file).
Storing exact word positions (inverted list) enables for
faster search and facilitates some kinds of queries
(compared to the inverted file),
but requires (much) more storage.
9
Inverted file, example
[ http://en.wikipedia.org/wiki/Inverted_index ]
We have three texts (documents):
T0 = "it is what it is"
T1 = "what is it"
T2 = "it is a banana"
We built the vocabulary and the
associated document index lists:
Let’s search for what, is and it.
That is, we want to obtain the references to all the
documents which contain those three words (at least
once each, and in arbitrary positions!).
The answer:
0,1  0,1,2  0,1,2  0,1
10
How to build an inverted index
[ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ]
Step 1 – obvious.
Step 2 – split the text into tokens (roughly: words).
Step 3 – reduce the amount of unique tokens
(e.g. eliminate capitalized words, plural nouns).
Step 4 – build the dictionary structure
and occurrence lists.
11
Tokenization
[ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ]
A token is an instance of a sequence of characters
in some particular document that are grouped
together as a useful semantic unit for processing.
How to tokenize? Not so easy as it first seems...
See the following sent. and possible tokenizations
of some excerpts from it.
12
Tokenization, cont’d
[ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ]
Should we accept tokens like
C++, C#, .NET, DC-9, M*A*S*H ?
And if M*A*S*H is ‘legal’ (single token),
then why not 2*3*5 ?
Is a dot (.) a punctuation mark (sentence terminator)?
Usually yes (of course), but think about
emails, URLs and IPs...
13
Tokenization, cont’d
[ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ]
Is a space always a reasonable token delimiter?
Is it better to perceive New York (or San Francisco,
Tomaszów Mazowiecki...) as one token or two?
Similarly with foreign phrases (cafe latte, en face).
Splitting on spaces may bring bad retrieval results:
search for York University will mainly fetch documents
related to New York University.
14
Stop words
[ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ]
Stop words are very frequent words which carry
little information (e.g. pronouns, articles). For example:
Consider an inverted list (i.e. exact positions for all
word occurrences are kept in the index).
If we have the stop list (i.e. discard them
during indexing), the index will get smaller, say by
20–30% (note it has to do with Zipf’s law).
Danger of using stop words: some meaningful
queries may consist of stop words exclusively:
The Who, to be or not to be...
The modern trend is not to use a stop list at all.
15
Glimpse – a classic inverted file index
[ Manber & Wu, 1993 ]
Assumption: the text to index is a single unstructure file
(perhaps a concatenation of documents).
It is divided into 256 blocks of approximately
the same size.
Each entry in the index is a word and the list of blocks
in which it occurs, in ascending order. Each block
number takes 1 byte (why?).
16
Block-addressing (e.g. Glimpse) index, general
scheme [ fig. 3 from Navarro et al., “Adding compression...”, IR, 2000 ]
17
How to search in Glimpse?
Two basic queries:
• keyword search – the user specifies one or more words
and requires a list of all documents (or blocks)
in which those words occur in any positions,
• phrase search – the user specifies two or more words
and requires a list of all documents (or blocks)
in which those words occur as a phrase, i.e. one-by-one.
Why phrase search is important?
Imagine keyword search for +new +Scotland
and phrase search for “new Scotland”.
What’s the difference?
18
How to search in Glimpse, cont’d
(fig. from http://nlp.stanford.edu/IR-book/pdf/01bool.pdf )
The key operation is to intersect several block lists.
Imagine the query has 4 words,
and their corresponding lists have length:
10, 120, 5, 30. How do you perform the intersection?
It’s best to start with two shortest lists, i.e. of length 5 and 10
in our example. The intersection output will have length at most 5
(but usually less, even 0, when we just stop!).
Then we intersect the obtained list with the list of length 30,
and finally with the longest list.
No matter the intersection order, the same result
but (vastly) different speeds!
19
How to search in Glimpse, cont’d
We have obtained the intersection of all lists;
what then?
Depends on the query: for the keyword query,
we’re done (can retrieve now the found blocks / docs).
For the phrase query, we have yet to scan the
resulting blocks / documents and check
if and where the phrase occurs in them.
To this end, we can use any fast exact string matching
alg, e.g. Boyer–Moore–Horspool.
Conclusion: the smaller the resulting list,
the faster the phrase query is handled.
20
Approximate matching with Glimpse
Imagine we want to find occurrences of a given phrase
but with up to k (Levenshtein) errors. How to do it?
Assume for example k = 2 and the phrase grey cat.
The phrase has two words, so there are the following
error per word possibilities:
0 + 0, 0 + 1, 0 + 2, 1 + 0, 1 + 1, 2 + 0.
21
Approximate matching with Glimpse, cont’d
E.g. 0 + 2 means here that the first word (grey)
must be exactly matched, but the second with 2 errors
(e.g. rats).
So, the approximate query grey cat translates to many
exact queries (many of them rather silly...): grey cat,
gray cat, grey rat, great cat, grey colt, gore cat, grey date...
All those possibilities are obtained from traversing
the vocabulary structure (e.g., a trie).
Another option is on-line approx matching over
the vocabulary represented as plain text
(concatenation of words) – see fig. at the next slide.
22
Approx matching with Glimpse,
query example with a single word x
(fig. from Baeza-Yates & Navarro, Fast Approximate String Matching in a Dictionary,
SPIRE’98)
23
Block boundaries problem
If the Glimpse blocks never cross document
boundaries (i.e., are natural), we don’t have this problem...
But if the block boundaries are artificial,
then we may be unlucky and have one of our
keywords at the very end of a block Bj
and the next keyword at the beginning of Bj+1.
How not to miss an occurrence?
There is a simple solution:
block may overlap a little. E.g. the last 30 words of
each block are repeated at the beginning of the next
block. Assuming the phrase length / keyword set
has no more than 30 words, we are safe.
But we may then need to scan more blocks than necessary (why?).
24
Glimpse issues and limitations
The authors claim their index takes only 2–4 % of the
original text size.
But it can work to text collections to about 200 MB only;
then it starts to degenerate, i.e., the block lists
tend to be long and many queries and handled not much
faster than using online seach (without any index). 
How can we help it?
Overall idea is fine, so we must take care of details.
One major idea is to apply compression
to the index...
25
(Glimpse-like) index compression
The purpose of data compression in inverted indexes
is not only to save space (storage).
It is also to make queries faster!
(One big reason is less I/O, e.g. one disk access where
without compression we’d need two accesses.
Another reason is more cache-friendly memory access.)
26
Compression opportunities (Navarro et al., 2000)
• Text (partitioned into blocks) may be compressed
on word level (faster text search in the last stage).
• Long lists may be represented as their complements,
i.e. the numbers of the block in which
a given word does NOT occur.
• Lists store increasing numbers, so the gaps
(differences) between them may be encoded, e.g.
2, 15, 23, 24, 100  2, 13, 8, 1, 76 (smaller numbers).
• The resulting gaps may be statistically encoded
(e.g. with some Elias code; next slides...).
27
Compression paradigm: modeling + coding
Modeling: the way we look at input data.
They can be perceived as individual (context-free) 1-byte
characters, or pairs of bytes, or triples etc.
We can look for matching sequences in the past buffer
(bounded or unbounded, sliding or not),
the minimum match length can be set to some value, etc. etc.
We can apply lossy or lossless transforms (DCT in JPEG,
Burrows–Wheeler transform), etc.
Modeling is difficult.
Sometimes more art than science. Often data-specific.
Coding: what we do with the data
transformed in the modeling phase.
28
Intro to coding theory
A uniquely decodable code: if any concatenation
of its codewords can be uniquely parsed.
A prefix code: no codeword is a proper prefix
of any other codeword. Also called an
instantaneous code.
Trivially, any prefix code is uniquely decodable.
(But not vice versa!)
29
Average codeword length
Let an alphabet have s symbols,
with probabilities p0, p1, ..., ps–1.
Let’s have a (uniquely decodable) code
C = [c0, c1, ..., cs–1].
The avg codeword length for a given
probability distribution is defined as
So this is a weighted average length over individual
codewords. More frequent symbols have a stronger influence.
30
Entropy
Entropy is the average information in a symbol.
Or: the lower bound on the average number (may be
fractional) of bits needed to encode an input symbol.
Higher entropy = less compressible data.
What is “the entropy of data” is a vague issue.
We measure the entropy always according to a given
model (e.g. context-free aka order-0 statistics).
Shannon’s entropy formula
(S – the “source” emitting “messages” / symbols)
31
Redundancy
Simply speaking, redudancy is the excess
in the representation of data.
Redundant data means: compressible data.
A redundant code is a non-optimal (or: far from optimal)
code.
A code redundancy (for a given prob. distribution):
R(C, p0, p1, ..., ps–1) =
L(C, p0, p1, ..., ps–1) – H(p0, p1, ..., ps–1)  0.
The redundancy of a code is the avg excess
(over the entropy) per symbol.
Can’t be below 0, of course.
32
Basic codes.
Unary code
Extremely simple though.
Application: very skew distribution
(expected for a given problem).
33
Basic codes.
Elias gamma code
Still simple, and usually much better
than the unary code.
34
Elias gamma code in Glimpse (example)
an examplary occurrence list:
[2, 4, 5, 6, 9, 11, 40, 42, 43, 94, 96, 120, 133, 134, 151, 203]
list of deltas (differences):
[2, 2, 1, 1, 3, 2, 29, 2, 1, 51, 2, 24, 13, 1, 17, 52]
list of deltas minus 1 (as zero was previously impossible,
except perhaps on the 1st position):
[2, 1, 0, 0, 2, 1, 28, 1, 0, 50, 1, 23, 12, 0, 16, 51]
no compression (one item – one byte): 16 * 8 = 128 bits
with Elias coding: 78 bits
101 100 0 0 101 100 111101101 100 0 11111010011
100 111101000 1110101 0 111100001 11111010100
35
Python code for the previous example
occ_list = [2, 4, 5, 6, 9, 11, 40, 42, 43, 94, 96, 120, 133, 134, 151, 203]
import math
delta_list = [occ_list[0]]
for i in range(1, len(occ_list)):
delta_list += [occ_list[i]-occ_list[i-1]-1]
print occ_list
print delta_list
total_len = 0
total_seq = ""
for x in delta_list:
z = int(math.log(x+1, 2))
code1 = "1" * z + "0"
code2 = bin(x+1-2**z)[2:].zfill(z) if z >= 1 else ""
total_seq += code1 + code2 + " "
total_len += len(code1 + code2)
v2.6 needed
as the function bin()
is used
print "no compression:", len(occ_list)*8, "bits"
print "with Elias gamma coding:", total_len, "bits"
print total_seq
36
Huffman coding (1952) – basic facts
Elias codes assume that we know (roughly)
the symbol distribution before encoding.
What if we guessed it badly...?
If we know the symbol distribution, we may construct an
optimal code, more precisely,
optimal among the uniquely decodable codes having
a codebook.
It is called Huffman coding.
Example. Symbol frequencies above,
final Huffman tree on the right.
37
Huffman coding (1952) – basic facts, cont’d
Its redundancy is always less than
1 bit / symbol (but may be arbitrarily close).
In most practical applications (data not very skew)
Huffman code avg length only 1-3% worse
than entropy.
E.g. the order-0 entropy of book1
(English 19th century novel, plain text)
from the Calgary corpus: 4.567 bpc.
Huffman avg length: 4.569 bpc.
38
Why not using only Huffman
for encoding occurrence lists?
39
Download