Intelligent Text Processing lecture 3 Word distribution laws. Word-based indexing Szymon Grabowski sgrabow@kis.p.lodz.pl http://szgrabowski.kis.p.lodz.pl/IPT08/ Łódź, 2008 Zipf’s law (Zipf, 1935, 1949) [ http://en.wikipedia.org/wiki/Zipf's_law, http://ciir.cs.umass.edu/cmpsci646/Slides/ir08%20compression.pdf ] word-rank word-freq constant That is, a few most frequent words cover a relatively large part of the text, while the majority of words in the given text’s vocabulary occur only once or twice. More formally, the frequency of any word is (approximately) inversely proportional to its rank in the frequency table. Zipf’s law is empirical! Example from the Brown Corpus (slightly over 106 words): “the” is the most freq. word with ~7% (69971) of all word occs. The next word, “of”, has ~3.5% occs (36411), followed by “and” with less than 3% occs (28852). Only 135 items are needed to account for half the Brown Corpus. 2 Does Wikipedia confirm Zipf’s law? [ http://en.wikipedia.org/wiki/Zipf's_law ] Word freq in Wikipedia, Nov 2006, log-log plot. x is a word rank, y is the total # of occs. Zipf's law roughly corresponds to the green (1/x) line. 3 Let’s check it in Python... Dickens collection: http://sun.aei.polsl.pl/~sdeor/stat.php?s=Corpus_dickens&u=corpus/dickens.bz2 distinct words: 28347 5 top freq words: [('the', 80103), ('and', 63971), ('to', 47413), ('of', 46825), ('a', 37182)] 4 Lots of words with only a few occurrences, (dickens example continued) there are 9423 words with freq 1 there are 3735 words with freq 2 there are 2309 words with freq 3 there are 1518 words with freq 4 5 Brown corpus statistics [ http://ciir.cs.umass.edu/cmpsci646/Slides/ir08%20compression.pdf ] 6 Heaps’ law (Heaps, 1978) [ http://en.wikipedia.org/wiki/Heaps'_law ] Another empirical law. It tells how vocabulary size V grows with growing text size n (expressed in words): V(n) = K · nβ, where K is typically around 10..100 (for English) and β between 0.4 and 0.6. Roughly speaking, the vocabulary (# of distinct words) grows proportially to the square root of the text length. 7 Musings on Heaps’ law The number of ‘words’ grows without a limit... How is it possible? Because in new documents new words tend to occur: e.g. (human, geographical etc.) names. But also typos! On the other hand, the dictionary size grows significantly slower than the text itself. I.e. it doesn’t pay much to represent the dictionary succinctly (with compression) – dictionary compression ideas which slow down the access should be avoided. 8 Inverted indexes Almost all real-world text indexes are inverted indexes (in one form or another). An inverted index (Salton and McGill, 1983) stores words and their occurrence lists in the text database/collection. The occurrence lists may store exact word positions (inverted list) or just a block or a document (inverted file). Storing exact word positions (inverted list) enables for faster search and facilitates some kinds of queries (compared to the inverted file), but requires (much) more storage. 9 Inverted file, example [ http://en.wikipedia.org/wiki/Inverted_index ] We have three texts (documents): T0 = "it is what it is" T1 = "what is it" T2 = "it is a banana" We built the vocabulary and the associated document index lists: Let’s search for what, is and it. That is, we want to obtain the references to all the documents which contain those three words (at least once each, and in arbitrary positions!). The answer: 0,1 0,1,2 0,1,2 0,1 10 How to build an inverted index [ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ] Step 1 – obvious. Step 2 – split the text into tokens (roughly: words). Step 3 – reduce the amount of unique tokens (e.g. eliminate capitalized words, plural nouns). Step 4 – build the dictionary structure and occurrence lists. 11 Tokenization [ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ] A token is an instance of a sequence of characters in some particular document that are grouped together as a useful semantic unit for processing. How to tokenize? Not so easy as it first seems... See the following sent. and possible tokenizations of some excerpts from it. 12 Tokenization, cont’d [ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ] Should we accept tokens like C++, C#, .NET, DC-9, M*A*S*H ? And if M*A*S*H is ‘legal’ (single token), then why not 2*3*5 ? Is a dot (.) a punctuation mark (sentence terminator)? Usually yes (of course), but think about emails, URLs and IPs... 13 Tokenization, cont’d [ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ] Is a space always a reasonable token delimiter? Is it better to perceive New York (or San Francisco, Tomaszów Mazowiecki...) as one token or two? Similarly with foreign phrases (cafe latte, en face). Splitting on spaces may bring bad retrieval results: search for York University will mainly fetch documents related to New York University. 14 Stop words [ http://nlp.stanford.edu/IR-book/pdf/02voc.pdf ] Stop words are very frequent words which carry little information (e.g. pronouns, articles). For example: Consider an inverted list (i.e. exact positions for all word occurrences are kept in the index). If we have the stop list (i.e. discard them during indexing), the index will get smaller, say by 20–30% (note it has to do with Zipf’s law). Danger of using stop words: some meaningful queries may consist of stop words exclusively: The Who, to be or not to be... The modern trend is not to use a stop list at all. 15 Glimpse – a classic inverted file index [ Manber & Wu, 1993 ] Assumption: the text to index is a single unstructure file (perhaps a concatenation of documents). It is divided into 256 blocks of approximately the same size. Each entry in the index is a word and the list of blocks in which it occurs, in ascending order. Each block number takes 1 byte (why?). 16 Block-addressing (e.g. Glimpse) index, general scheme [ fig. 3 from Navarro et al., “Adding compression...”, IR, 2000 ] 17 How to search in Glimpse? Two basic queries: • keyword search – the user specifies one or more words and requires a list of all documents (or blocks) in which those words occur in any positions, • phrase search – the user specifies two or more words and requires a list of all documents (or blocks) in which those words occur as a phrase, i.e. one-by-one. Why phrase search is important? Imagine keyword search for +new +Scotland and phrase search for “new Scotland”. What’s the difference? 18 How to search in Glimpse, cont’d (fig. from http://nlp.stanford.edu/IR-book/pdf/01bool.pdf ) The key operation is to intersect several block lists. Imagine the query has 4 words, and their corresponding lists have length: 10, 120, 5, 30. How do you perform the intersection? It’s best to start with two shortest lists, i.e. of length 5 and 10 in our example. The intersection output will have length at most 5 (but usually less, even 0, when we just stop!). Then we intersect the obtained list with the list of length 30, and finally with the longest list. No matter the intersection order, the same result but (vastly) different speeds! 19 How to search in Glimpse, cont’d We have obtained the intersection of all lists; what then? Depends on the query: for the keyword query, we’re done (can retrieve now the found blocks / docs). For the phrase query, we have yet to scan the resulting blocks / documents and check if and where the phrase occurs in them. To this end, we can use any fast exact string matching alg, e.g. Boyer–Moore–Horspool. Conclusion: the smaller the resulting list, the faster the phrase query is handled. 20 Approximate matching with Glimpse Imagine we want to find occurrences of a given phrase but with up to k (Levenshtein) errors. How to do it? Assume for example k = 2 and the phrase grey cat. The phrase has two words, so there are the following error per word possibilities: 0 + 0, 0 + 1, 0 + 2, 1 + 0, 1 + 1, 2 + 0. 21 Approximate matching with Glimpse, cont’d E.g. 0 + 2 means here that the first word (grey) must be exactly matched, but the second with 2 errors (e.g. rats). So, the approximate query grey cat translates to many exact queries (many of them rather silly...): grey cat, gray cat, grey rat, great cat, grey colt, gore cat, grey date... All those possibilities are obtained from traversing the vocabulary structure (e.g., a trie). Another option is on-line approx matching over the vocabulary represented as plain text (concatenation of words) – see fig. at the next slide. 22 Approx matching with Glimpse, query example with a single word x (fig. from Baeza-Yates & Navarro, Fast Approximate String Matching in a Dictionary, SPIRE’98) 23 Block boundaries problem If the Glimpse blocks never cross document boundaries (i.e., are natural), we don’t have this problem... But if the block boundaries are artificial, then we may be unlucky and have one of our keywords at the very end of a block Bj and the next keyword at the beginning of Bj+1. How not to miss an occurrence? There is a simple solution: block may overlap a little. E.g. the last 30 words of each block are repeated at the beginning of the next block. Assuming the phrase length / keyword set has no more than 30 words, we are safe. But we may then need to scan more blocks than necessary (why?). 24 Glimpse issues and limitations The authors claim their index takes only 2–4 % of the original text size. But it can work to text collections to about 200 MB only; then it starts to degenerate, i.e., the block lists tend to be long and many queries and handled not much faster than using online seach (without any index). How can we help it? Overall idea is fine, so we must take care of details. One major idea is to apply compression to the index... 25 (Glimpse-like) index compression The purpose of data compression in inverted indexes is not only to save space (storage). It is also to make queries faster! (One big reason is less I/O, e.g. one disk access where without compression we’d need two accesses. Another reason is more cache-friendly memory access.) 26 Compression opportunities (Navarro et al., 2000) • Text (partitioned into blocks) may be compressed on word level (faster text search in the last stage). • Long lists may be represented as their complements, i.e. the numbers of the block in which a given word does NOT occur. • Lists store increasing numbers, so the gaps (differences) between them may be encoded, e.g. 2, 15, 23, 24, 100 2, 13, 8, 1, 76 (smaller numbers). • The resulting gaps may be statistically encoded (e.g. with some Elias code; next slides...). 27 Compression paradigm: modeling + coding Modeling: the way we look at input data. They can be perceived as individual (context-free) 1-byte characters, or pairs of bytes, or triples etc. We can look for matching sequences in the past buffer (bounded or unbounded, sliding or not), the minimum match length can be set to some value, etc. etc. We can apply lossy or lossless transforms (DCT in JPEG, Burrows–Wheeler transform), etc. Modeling is difficult. Sometimes more art than science. Often data-specific. Coding: what we do with the data transformed in the modeling phase. 28 Intro to coding theory A uniquely decodable code: if any concatenation of its codewords can be uniquely parsed. A prefix code: no codeword is a proper prefix of any other codeword. Also called an instantaneous code. Trivially, any prefix code is uniquely decodable. (But not vice versa!) 29 Average codeword length Let an alphabet have s symbols, with probabilities p0, p1, ..., ps–1. Let’s have a (uniquely decodable) code C = [c0, c1, ..., cs–1]. The avg codeword length for a given probability distribution is defined as So this is a weighted average length over individual codewords. More frequent symbols have a stronger influence. 30 Entropy Entropy is the average information in a symbol. Or: the lower bound on the average number (may be fractional) of bits needed to encode an input symbol. Higher entropy = less compressible data. What is “the entropy of data” is a vague issue. We measure the entropy always according to a given model (e.g. context-free aka order-0 statistics). Shannon’s entropy formula (S – the “source” emitting “messages” / symbols) 31 Redundancy Simply speaking, redudancy is the excess in the representation of data. Redundant data means: compressible data. A redundant code is a non-optimal (or: far from optimal) code. A code redundancy (for a given prob. distribution): R(C, p0, p1, ..., ps–1) = L(C, p0, p1, ..., ps–1) – H(p0, p1, ..., ps–1) 0. The redundancy of a code is the avg excess (over the entropy) per symbol. Can’t be below 0, of course. 32 Basic codes. Unary code Extremely simple though. Application: very skew distribution (expected for a given problem). 33 Basic codes. Elias gamma code Still simple, and usually much better than the unary code. 34 Elias gamma code in Glimpse (example) an examplary occurrence list: [2, 4, 5, 6, 9, 11, 40, 42, 43, 94, 96, 120, 133, 134, 151, 203] list of deltas (differences): [2, 2, 1, 1, 3, 2, 29, 2, 1, 51, 2, 24, 13, 1, 17, 52] list of deltas minus 1 (as zero was previously impossible, except perhaps on the 1st position): [2, 1, 0, 0, 2, 1, 28, 1, 0, 50, 1, 23, 12, 0, 16, 51] no compression (one item – one byte): 16 * 8 = 128 bits with Elias coding: 78 bits 101 100 0 0 101 100 111101101 100 0 11111010011 100 111101000 1110101 0 111100001 11111010100 35 Python code for the previous example occ_list = [2, 4, 5, 6, 9, 11, 40, 42, 43, 94, 96, 120, 133, 134, 151, 203] import math delta_list = [occ_list[0]] for i in range(1, len(occ_list)): delta_list += [occ_list[i]-occ_list[i-1]-1] print occ_list print delta_list total_len = 0 total_seq = "" for x in delta_list: z = int(math.log(x+1, 2)) code1 = "1" * z + "0" code2 = bin(x+1-2**z)[2:].zfill(z) if z >= 1 else "" total_seq += code1 + code2 + " " total_len += len(code1 + code2) v2.6 needed as the function bin() is used print "no compression:", len(occ_list)*8, "bits" print "with Elias gamma coding:", total_len, "bits" print total_seq 36 Huffman coding (1952) – basic facts Elias codes assume that we know (roughly) the symbol distribution before encoding. What if we guessed it badly...? If we know the symbol distribution, we may construct an optimal code, more precisely, optimal among the uniquely decodable codes having a codebook. It is called Huffman coding. Example. Symbol frequencies above, final Huffman tree on the right. 37 Huffman coding (1952) – basic facts, cont’d Its redundancy is always less than 1 bit / symbol (but may be arbitrarily close). In most practical applications (data not very skew) Huffman code avg length only 1-3% worse than entropy. E.g. the order-0 entropy of book1 (English 19th century novel, plain text) from the Calgary corpus: 4.567 bpc. Huffman avg length: 4.569 bpc. 38 Why not using only Huffman for encoding occurrence lists? 39