Text-retrieval Systems NDBI010 Lecture Slides KSI MFF UK http://www.ms.mff.cuni.cz/~kopecky/teaching/ndbi010/ Version 10.05.12.13.30.en Literature (textbooks) • Introduction to Information Retrieval – Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze • Cambridge University Press, 2008 • http://informationretrieval.org/ • Dokumentografické informační systémy – Pokorný J., Snášel V., Kopecký M.: • Nakladatelství Karolinum, UK Praha, 2005 – Pokorný J., Snášel V., Húsek D.: • Nakladatelství Karolinum, UK Praha, 1998 • Textové informační systémy – Melichar B.: • Vydavatelství ČVUT, Praha, 1997 NDBI010 - DIS - MFF UK Further links (books) • Computer Algorithms - String Pattern Matching Strategies, – Jun Ichi Aoe, • IEEE Computer Society Press 1994 • Concept Decomposition for Large Sparse Text Data using Clustering – Inderjit S. Dhillon, Dharmendra S. Modha • IBM Almaden Research Center, 1999 NDBI010 - DIS - MFF UK Further links (articles) • The IGrid Index: Reversing the Dimensionality Curse For Similarity Indexing in High Dimensional Space for Large Sparse Text Data using Clustering – Charu C. Aggrawal, Philip S. Yu • IBM T. J. Watson Research Center • The Pyramid Technique: Towards Breaking the Curse of Dimensionality – S. Berchtold, C. Böhm, H.-P. Kriegel: • ACM SIGMOD Conference Proceedings, 1998 NDBI010 - DIS - MFF UK Further links (articles) • Affinity Rank: A New Scheme for Efficient Web Search – Yi Liu, Benyu Zhang, Zheng Chen, Michael R. Lyu, Wei-Ying Ma • 2004 • Improving Web Search Results Using Affinity Graph – Benyu Zhang, Hua Li, Yi Liu, Lei Ji, Wensi Xi, Weiguo Fan, Zheng Chen1, Wei-Ying Ma • Efficient computation of pagerank – T.H. Haveliwala • Technical report, Stanford University, 1999 NDBI010 - DIS - MFF UK Further links (older) • Introduction to Modern Information Retrieval – Salton G., McGill M. J.: • McGRAW-Hill 1981 • Výběr informací v textových bázích dat – Pokorný J.: • OVC ČVUT Praha 1989 NDBI010 - DIS - MFF UK Lecture No. 1 Introduction Overview of the problem informativeness measurement Retrieval system origin • 50th of 20th century • The gradual automation of the procedures used in libraries • Now a separate subsection of IS’s – Factual IS • Processing of information having defined internal structure (usually in the form of tables) – Bibliographic IS • Processing of information in form of the text written in natural language without strict internal structure. NDBI010 - DIS - MFF UK Interaction with TRS 1. 2. 3. 4. Query formulation Comparison Hit-list obtaining Query tuning/reformulation 5. Document request 6. Document obtaining 1 3 2 4 NDBI010 - DIS - MFF UK DIS 5 6 TRS Structure I) Document disclosure system • 1 Returns secondary information • • • Author Title ... 3 2 4 5 II) Document delivery system • I) Need not to be supported by the SW NDBI010 - DIS - MFF UK 6 II) Query Evaluation • Direct comparison is time-consuming Doc1 Comparison Query Doci1 NDBI010 - DIS - MFF UK Query Evaluation • Document model is used to compare • Lossy process, usually based on presence of words in documents • Produces structured data suitable for effective comparison Doc1 Indexation NDBI010 - DIS - MFF UK X1 Query Evaluation • Query is processed to obtain needed form Comparison Query • Processed query is compared against the index Doci1 X1 NDBI010 - DIS - MFF UK Text preprocessing • Searching is more effective using created (structured) model of documents, but it can use only information stored in the model, not in documents. • The goal is to create model, preserving as much information form the original documents as possible. • Problem: lot of ambiguity in text. • Still exist many not resolved tasks concerning document understanding. NDBI010 - DIS - MFF UK Text understanding • Writer: – Text = sequence of words in natural language. – Each word stands for some idea/imagination of writer. – Ideas represent real subject, activity, etc. • Reader: folows (not necessary exactly the same mappings) from left to right ... NDBI010 - DIS - MFF UK Text understanding • Synonymy of words – More words can have the same meaning for the writer • car = automobile • sick = ill ... NDBI010 - DIS - MFF UK Text understanding • Homonymy of words – One word can have more than one different meanings • fluke: fish, anchor, … • crown: currency, treetop, jewel, … • class: year of studies, kategory in set theory, … ... NDBI010 - DIS - MFF UK Text understanding • Word meanings need not be exactly the same. – Hierarchical overlapping • animal > horse > Stallion – Associativity among meanings • calculator ~ computer ~ processor ... NDBI010 - DIS - MFF UK Text understanding • Mapping between subjects, ideas and words can depend on individual persons – readers and writers. – Two people can assign partly or completely different meaning to given term. – Two people can imagine different thing for the same word. • mother, room, ... • In result, by reading the same text two different readers can obtain different information – Each from other – In comparison with author’s intention NDBI010 - DIS - MFF UK Text understanding • Homonymy and ambiguities grows with transition form words/terms to sentences and bigger parts of the text. • Example of English sentence with more grammatically correct meanings (in this case a human reader probably eliminates the nonsense meaning) – See Podivné fungování gramatiky, http://www.scienceworld.cz/sw.nsf/lingvistika – In the sentence „Time flies like an arrow“ either flies (fly) or like can be chosen for the predicate, what produces two significantly different meanings. NDBI010 - DIS - MFF UK Text preprocessing • Inclusion of linguistic analysis into the text processing can partially solve the problem – Disambiguation • Selection of correct meaning of the term in the sentence – According to grammar (Verb versus Noun etc.) – According to context (more complicated, can distinguish between two Verbs, two Nouns, etc). NDBI010 - DIS - MFF UK Text preprocessing • Inclusion of linguistic analysis into the text processing can partially solve the problem – Lemmatization • For each term/word in the text – after its proper meaning is found – assigns – Type of word, plural vs. singular, present time vs. preterite, etc. – Base form (singular for Nouns, infinitive for Verbs, …) – Information obtained by sentence analysis (subject, predicate, object, ...) NDBI010 - DIS - MFF UK Text preprocessing • Other options, that can be more or less solved are – Identification of collocations • World war two, ... – Assigning of Nouns for Pronouns, used in the text (very complex and hard to solve, sometimes even for human reader) NDBI010 - DIS - MFF UK Precision and Recall • As a result of ambiguities there exists no optimal text retrieval system • After the answer of the query is obtained, following values can be evaluated – Number of returned documents in the list: Nv • The system supposed them to be relevant – useful – according to their math with the query – Number of returned relevant documents: Nvr • The questioner find them to be really relevant as they fulfill its information needs – Number of all relevant documents in the system: Nr • Very hard to guess for large and unknown collections NDBI010 - DIS - MFF UK Precision and Recall • Two TRS’s can (and do) return two different result for the same query, that can be partly or completely unique. How to compare quality of those systems? Documents in the database NDBI010 - DIS - MFF UK Returned by TRS2 Relevant documents Returned by TRS1 Precision and Recall • Two questioners can suppose another documents to be relevant for their equally formulated query How to meet both subjective expectations of questioners? Documents in the database NDBI010 - DIS - MFF UK Relevant Relevant Returned docs. Precision and Recall • Quality of result set of documents is usually evaluated according to numbers Nv, Nr, Nrv – Precision • P = Nvr / Nv • Probability of returned document to be relevant to the user – Recall • R = Nvr / Nr • Probability of relevant document to be returned to the user NDBI010 - DIS - MFF UK Precision and Recall • Both coefficients depend on the feeling of the questioner • The same document can fulfill information needs of first questioner while at the same time fail to meet them for the second one. – Each user determines different values of Nr and Nrv coefficients – Both measures P and R depend on them NDBI010 - DIS - MFF UK Precision and Recall • In optimal case Optimal answer 1 – P=R=1 – There are all and only relevant documents in the response of the system • Usually – The answer in the first iteration is neither precise nor complete Typical initial answer 0 0 NDBI010 - DIS - MFF UK 1 Precision and Recall • Query tuning R – Iterative modification of the query targeted to increase the quality of the response • Theoretically it is possible to reach the optimum sooner or later … 1 0 Optimum 0 NDBI010 - DIS - MFF UK 1 P Přesnost a úplnost • … due to (not only) R ambiguities both measures 1 depend indirectly each on the other, ie. P*R const. < 1 Optimum – In order to increase P the absolute number of relevant documents in the response is decreased. – In order to increase R the number of irrelevant documents rapidly grows. • The probability to reach quality above the limit is low. 0 0 NDBI010 - DIS - MFF UK 1 P Prediction Criterion • In time of query formulation the questioner has to guess correct term (words) the author used for expression of given idea – Problems are caused e.g. by • Synonyms (author could use different synonym not remembered by the user) • Overlapping meanings of terms • Colorful poetical hyperboles •… NDBI010 - DIS - MFF UK Prediction Criterion • The problem can be partly suppressed by inclusion of thesaurus, containing – Hierarchies of terms and their meanings – Sets of synonyms – Definitions of associations between terms • Questioner can use it during query formulation • System can use it during query evaluation NDBI010 - DIS - MFF UK Prediction Criterion • The user often tends to tune its own query in conservative way – He/she tends to fix terms used in the first iteration (they must be the best because I remembered them immediately) and vary only additional terms at the end of the query • It is useful to support the user to (semi)automatically eliminate wrong terms and replace them with useful ones, that describe really relevant documents NDBI010 - DIS - MFF UK Maximal Criterion • The questioner is usually not able or willing to go through exhaustive number of hits in the response to find out the relevant one • Usually max. 20-50 documents according to their length Need to not only sort out documents not matching the query but order the answer list according to supposed relevancy in descendant order – the supposedly best documents at the begining NDBI010 - DIS - MFF UK Maximal Criterion • Due to maximal criterion, the user usually tries to increase the Precision of the answer – Small amount of resulting documents in the answer, containing as high ratio of relevant documents as possible „better“ „worse“ Vr. Rel. Rel. Vr. • Some problematic domains requires both high precision and recall – Lawyers, especially in territories having case law based on precedents (need to find out as much similar cases as possible) NDBI010 - DIS - MFF UK Exact pattern matching Why to Search for Patterns in Text • Due to index documents or queries – To involve only given set of terms (lemmas) – To omit given set of meaningless terms (lemmas) as conjunctions, numerals, pronouns, … • To highlight given terms in documents, presented to users • … NDBI010 - DIS - MFF UK Algorithms classification by preprocessing Categories by preprocessing Text NO YES Pattern NO YES I. II. III. IV. I - Brute-force algorithm II - Others (suitable for TRS) Further divided according to • Number of simultaneously matched patterns – 1, N, • Direction of comparison – Left to right – Right to left NDBI010 - DIS - MFF UK Class II Algorithms Subcategories of class II Number of patterns 1 N inf. Direction Left to right Right to left KMP BM AC CW KA 2WJFA NDBI010 - DIS - MFF UK Exact Pattern Matching Searching of One Pattern Within Text Brute-force Algorithm • Let m denotes length of text t, let n denotes length of pattern p. • If i-th position in text doesn’t match j-th position in pattern – Shift of pattern one position to the right, restart comparison at first (leftmost) position in the pattern • Average time complexity: o(m*n), e.g. in search of „an-1b“ in „am-1b“ • For natural language text/pattern m*const ops, i.e. o(m) a b c c b a b c a bb c a abc cbabcbbba b c c Text const is small number Bef. shift a b c c b a b c b b b (<10), dependent on the language Aft. shift a b c c b a b c b b b NDBI010 - DIS - MFF UK Lecture no. 2 Knuth-Morris-Pratt Algorithm • Left to right searching for one pattern • In comparison with brute-force algorithm KMP eliminates repeated comparison of already successfully compared characters of text • Pattern is shifted as less as possible to align own prefix of examined part of pattern below equal fragment of text NDBI010 - DIS - MFF UK KMP Algorithm Brute-force algorithm a b c c b a b c a bb c a abc cbabcbbba b c c Text Bef. shift a b c c b a b c b b b Aft. shift a b c c b a b c b b b KMP Text a b c c b a b c a bb c a abc cbabcbbba b c c Bef. shift a b c c b a b c b b b abccbabcbbb Aft. shift NDBI010 - DIS - MFF UK KMP Algorithm • In front of mismatch position is left own prefix already examined part of pattern • It has to be equal to the postfix of already examined part of pattern • The longest such a prefix determines the smallest shift a b c c b a b c a bb c a abc cbabcbbba b c c Text Před pos. a b c c b a b c b b b abccbabcbbb Po posunu NDBI010 - DIS - MFF UK KMP algoritmus Text Text a b c c b a b c a b b c a a b c c b a b c bbb a b c c Před a b c c b a b c b b b Bef.posunem shift Po posunu Aft. shift a b c c b a b c b b b a b c c b a b c b b b a b c c b a b c b b b a b c c b a b c b b b a b c c b a b c b b b a b c c b a b c bbb NDBI010 - DIS - MFF UK KMP algoritmus • If – j-th position of pattern p doesn’t match to i-th position of text t – The longest own prefix of already examined part of pattern equal to the postfix of already examined part of pattern is of length k • then – After the shift k characters remain before the mismatch position – Comparison restarts from k+1st position of the pattern • Restart positions are pre-computed and stored in auxiliary array A • In this case A[j] = k+1 NDBI010 - DIS - MFF UK KMP algoritmus begin {KMP} m := length(t); n := length(p); i := 1; j := 1; while (i <= m) and (j <= n) do begin while (j > 0) and (p[j] <> t[i]) do j := A[j]; inc(i); inc(j); end; {while} if (j > n) then {pattern found at position i-j+1} else {not found} end; {KMP} NDBI010 - DIS - MFF UK Obtaining of array A for KMP search • A[1] = 0 • If all values are known for positions 1 .. j-1, it is easy to compute correct value for j-th position – Let A[j-1] contains corrections for j-1st position. I.e., A[j-1]-1 chars at the beginning of pattern are the same as equivalent number of chars before j-1st positon NDBI010 - DIS - MFF UK Obtaining of array A for KMP search 1 2 3 4 5 6 7 8 9 1011 Pattern Vzorek a b c c b a b c b b b PA 0 1 1 1 1 1 2 ? NDBI010 - DIS - MFF UK Obtaining of array A for KMP search • If j-1st position of pattern match to A[j-1] th position, the prefix can be prolonged and so correct value of A[j] is by one higher, than the previous value. NDBI010 - DIS - MFF UK Obtaining of array A for KMP search 1 2 3 4 5 6 7 8 9 1011 Pattern Vzorek a b c c b a b c b b b PA 0 1 1 1 1 1 2 3 NDBI010 - DIS - MFF UK Obtaining of array A for KMP search • If j-1st and A[j-1] th positions in pattern doesn’t match, the correction A[j-1]+1 would cause mismatch at the previous position in text • The correction for such a mismatch is already known (numbers A[1] .. A[j-1] are already computed) NDBI010 - DIS - MFF UK Obtaining of array A for KMP search • It is necessary to follow correction starting by j-1 st position until j-1 st position in pattern match to the found position in the target position, or the correction reaches 0 (out of pattern) NDBI010 - DIS - MFF UK Obtaining of array A for KMP search 1 2 3 4 5 6 7 8 9 1011 Pattern Vzorek a b c c b a b c b b b PA 0 1 1 1 1 1 2 34 ? NDBI010 - DIS - MFF UK Obtaining of array A for KMP search 1 2 3 4 5 6 7 8 9 1011 Vzorek Pattern a b c c b a b c b b b Posun pro P[j]=P[j-1]+1 Shift A[j]=A[j-1]+1 a b c c b a b c b b b th pos. Shift topro correct error Posun chybu naat4.4pozici st pos. Shift topro correct error Posun chybu naat1.1pozici a b c c b a b c b b b a b c c b a b c b b b NDBI010 - DIS - MFF UK 0 1 1 1 1 1 2 34 ? AP Obtaining of array A for KMP search - algorithm begin A[1] := 0; n := length(p); j := 2; while (j <= n) do begin k := j-1; l := k; repeat l := A[l]; until (l = 0) or (p[l] = p[k]); A[j] := l+1; inc(j); end; end; NDBI010 - DIS - MFF UK KMP algorithm • Time complexity of KMP is o(m+n). • Already successfully compared positions in text are never checked again • After each shift of pattern the given mismatch position can be checked again, but there are at most o(m) shifts of pattern. • Similarly time complexity of preprocessing is o(n). NDBI010 - DIS - MFF UK KMP Optimization • It is possible to further optimize auxiliary array A • If the character p[j] equals to p[A[j]], there would be the same character as the one that caused the mismatch aligned to mismatch position. • In this case the optimization can be computed in advance in another auxiliary array A’ where A’[j] =def A’[A[j]] • Else A’[j] =def A[j] p abccbabcbbb • Array A’ [j] can be used A 01111123411 during the search phase A’ 0 1 1 1 1 1 1 1 4 1 1 NDBI010 - DIS - MFF UK Boyer-Moore Algorithm • Right to left search of one pattern using pattern preprocessing – Pattern is shifted left to right – Characters of pattern are compared from right to left NDBI010 - DIS - MFF UK Boyer-Moore Algorithm • If the mismatch of n-j th position of pattern against i-j th position of text, where T[i-j]=x occures, where: – – – – n denotes length of pattern, i denotes position of the end of pattern in text, j=0..n-1 xX, X is the alphabet • Pattern is moved by SHIFT[n-j,x] characters to the rights • The comparison restarts at the end of pattern, i.e. for j=0 NDBI010 - DIS - MFF UK Boyer-Moore Algorithm • There exists more different definitions of SHIFT[n-j,x] • Variant 1: Auxiliary array SHIFT[0..n-1,X] is for each position in the pattern and for each character of the alphabet X defined as follows: – The smallest possible shift, aligning the character x in the text at the mismatch position with the same character in the pattern. – If there exists no such character x in the pattern left to the mismatch position, shift the pattern to start immediately after the mismatch position. NDBI010 - DIS - MFF UK Boyer-Moore Algorithm (1) • Average time complexity is o(m*n), e.g. for searching „ban-1“ in „am-nban-1“ • For huge alphabets and patterns with small number of different characters (especially for words searched in texts in natural languages) the average time complexity is o(m/n) – i.e. the longer the pattern, the more efficient search NDBI010 - DIS - MFF UK Boyer-Moore Algorithm (1) • Example: T RO P I CK Ý M OV OC EM J E I A A NA NA S A NA NA S A NA NA S A NA NA A NA A NDBI010 - DIS - MFF UK NA NA S . S NA S NA NA S Boyer-Moore Algorithm (1) • Representation of SHIFT array for pattern ’ANANAS’ – Full arrows depicts successful comparison of one character – Other arrows stands for shift of target character to position of starting character – Not present arrows means shift after the mismatch position A N A A N A N A N N A NDBI010 - DIS - MFF UK N A A N S S A Boyer-Moore Algorithm (1) • Another representation. To save the space complexity x{‘A’,’N’,’S’,’X’} – ‘X‘ stands for any character not apearing in the pattern • Values beginning with „+“ represents the length of shift • Values without „+“ represents new value of j j 0 1 2 3 4 5 6 ‘A‘ +1 2 +1 4 +1 6 pattern ‘N‘ +2 +1 3 +1 5 +1 was ‘S‘ 1 +5 +4 +3 +2 +1 successfully NDBI010 - DIS - MFF UK ‘X‘ +6 +5 +4 +3 +2 +1 found Benchmark on Artificial Text Text ('a'rnd(200)'b ')1000 Size 100 KB #patterns 1 000 #unique patterns 200 NDBI010 - DIS - MFF UK Benchmark on Artificial Text Text ('a'rnd(200)'b ')1000 Size 100 KB #patterns 1 000 #unique patterns 200 #compar. - Br.-f. 24 128 586 NDBI010 - DIS - MFF UK Benchmark on Artificial Text Text ('a'rnd(200)'b ')1000 Size 100 KB #patterns 1 000 #unique patterns 200 #compar. - Br.-f. 24 128 586 #compar. - KMP 885 747 3,7% NDBI010 - DIS - MFF UK Benchmark on English Text Text Size #patterns #unique patterns Words English 130 KB 18 075 1 570 Note: Unique pattern found at its original position NDBI010 - DIS - MFF UK Benchmark on English Text Text Size #patterns #unique patterns #compar. - Br.-f. Words English 130 KB 18 075 1 570 256 799 832 NDBI010 - DIS - MFF UK Benchmark on English Text Text Size #patterns #unique patterns #compar. - Br.-f. #compar. - KMP Words English 130 KB 18 075 1 570 256 799 832 255 942 030 99,7% NDBI010 - DIS - MFF UK Benchmark on English Text Text Size #patterns #unique patterns #compar. - Br.-f. #compar. - KMP #compar. - BM Words English 130 KB 18 075 1 570 256 799 832 255 942 030 99,7% 50 114 658 19,5% NDBI010 - DIS - MFF UK Benchmark on English Text Text Size #patterns #unique patterns #compar. - Br.-f. #compar. - KMP #compar. - BM Words English 130 18 075 1 570 256 799 832 255 942 030 50 114 658 Bi-words English KB 130 KB 9 038 4 395 433 721 058 99,7% 430 220 025 99,2% 19,5% 52 046 084 12,0% NDBI010 - DIS - MFF UK Review of Algorithms No preporocessing Pattern preprocessing Left to right Right to left timemax o(m*n) o(m+n) o(m*n) timeavg o(m*n) o(m+n) o(m*n) timeavg (nat. language) o(m) o(m+n) o(m/n) NDBI010 - DIS - MFF UK Exact pattern matching Searching for finite set of patterns Aho-Corrasick Algorithm • Left to right searching of more patterns simultaneously • Extension of KMP algorithm – Preprocessing of patterns – Linear reading of text • Average time complexity o(m+ni), where m denotes length of text ni denotes length of i-th pattern NDBI010 - DIS - MFF UK A-C Algorithm • Text T • Set of patterns P={P1, P2, …, Pk} • Search engine S = (Q, X, q0, g, f, F) • • • • Q finite set of states X alphabet q0Q initial state g: Q x X Q (go) forward function • f: Q Q (fail) backward function • F Q set of final states NDBI010 - DIS - MFF UK A-C Algorithm • States in the set Q correspond to all prefixes of all patterns • State q0 reprezents empty prefix • g(q,x) = qx, iff qxQ • Else g(q0,x)=q0 • Else g(q,x) undefined • f(q) for q<>q0 is equal to longest own postfix q in the set Q |f(q)|<|q| • Final states correspond to all complete patterns, i.e. F=P NDBI010 - DIS - MFF UK A-C Algorithm • Search based on total (fully defined) transition function (q,x): QxXX – (q,x) = g(q,x), iff g(q,x) is defined – (q,x) = (f(q),x) • Correct definition, because |f(q)| - distance of f(q) from initial state – is less than |q| and g(q0,x) is completely defined. NDBI010 - DIS - MFF UK A-C Algorithm • f is constructed in order of increasing |q|, i.e. according to distance of state from the beginning • It is not necessary to define f(q0) • If |q|=1 the longest own postfix is empty, i.e. f(q)=q0 • f(qx)=f(g(q,x)) = (f(q),x) • To determine value of fail function for state qx, accessible from state q using character x, it is necessary to start in q, follow fail function to f(q) and then go forward using the character x NDBI010 - DIS - MFF UK A-C Algorithm • Example: P={”he”,”her”,”she”}, function g X\{h,s} h "h" e "he" r "her" "s" h "sh" e "she" "" s NDBI010 - DIS - MFF UK A-C Algorithm • Example: P={”he”,”her”,”she”}, function f X\{h,s} h "h" e "he" r "her" "s" h "sh" e "she" "" s NDBI010 - DIS - MFF UK A-C Algorithm • Detection of all occurrences of patterns, even of patterns hidden inside another ones: – Either collect all patterns detected in given state by going through all states accessible from it using fail function, i.e. final states in {f i(q), i>=0} – Or - after transition to state q – go through all states linked together by fail function and report all final states NDBI010 - DIS - MFF UK A-C Algorithm – delta function function delta(q:states; x: alphabet):states; begin {delta} while g[q,x] = fail do q := f[q]; delta := g[q,x]; end; {delta} begin {A-C} q := 0; for i := 1 to length(t) do begin q := delta(q,t[i]); report(q); {report all found patterns, ending by t[i]} end; {for} end; {A-C} NDBI010 - DIS - MFF UK KMP vs. A-C for 1 pattern • • • • Equal algorithms, different formulations j (~ compared position) • qj-1 (~ # compared positions) P[1]=0 • g(q0,*)=q0 P[j]=k • f(qj-1)=qk-1 P a b c c b a b c b b b A 0 1 1 1 1 1 2 3 4 1 1 NDBI010 - DIS - MFF UK Commentz-Walter Algorithm • Right to left search for more patterns simultaneously • Combination of B-M and A-C algorithms • Average time complexity (for natural languages) o(m/min(ni)), where m denotes length of texts ni denotes length of i-th pattern NDBI010 - DIS - MFF UK C-W Algorithm • Text T • Set of patterns P={P1, P2, …, Pk} • Search engine S = (Q, X, q0, g, f, F) – – – – Q finite set of states X alphabet q0Q initial state g: Q x X Q (go) forward function – f: Q Q (fail) backward function – F Q set of final states NDBI010 - DIS - MFF UK C-W Algorithm • States in set Q represents all postfixes of all patterns • State q0 represents empty postfix • g(q,x) = xq, iff xqQ • f(q) where q<>q0 is equal to longest own prefix q in the set Q |f(q)|<|q| • Final states correspond to all complete patterns, i.e. F=V NDBI010 - DIS - MFF UK C-W Algorithm • Forward function s “she“ h “he“ e “e“ r h “her“ e “er“ NDBI010 - DIS - MFF UK “r“ C-W Algorithm • Backward function (arrows going to q0 are not shown) s “she“ h “he“ e “e“ r h “her“ e “er“ NDBI010 - DIS - MFF UK “r“ C-W Algorithm • LMIN = min(ni) length of the shortest pattern • h(q) = |q| distance of state q from the initial state • char(x) minimal distance of state, reachable via character x • pred(q) predecessor of state q, i.e. the state, representing one character shorter postfix • If g(q,x) is not defined, patterns (search engine) is shifted by shift(q,x) positions to the right and the again search restarts by state q0 again • shift(q,x)=min( max( shift1(q,x), shift2(q) ), shift3(q) ) NDBI010 - DIS - MFF UK C-W Algorithm • shift1(q,x) = char(x)-h(q)-1, pokud > 0 • shift2(q) = min({LMIN}{h(q’)-h(q), f(q’)=q}) • shift3(q0) = LMIN • shift3(q) = min({shift3(pred(q))} {h(q’)-h(q), k:fk(q’)=q q’F}) NDBI010 - DIS - MFF UK C-W Algorithm • shift1(q,x) – aligning of “collision” character char(’y’)-h(’kolo’)-1=8-4-1=3 . . . my k o l o g . . . k S k o g y m o o l n o l k o á l k o o v z o o m l r i g l a o a u . . . my k o l o g . . . o z v t m +3 NDBI010 - DIS - MFF UK k S k o g y m o o l n o l k o á l k o o v z o o m l r i g l a o a u o z v t m C-W Algorithm • shift2(q) – aligning of checked part of text states, where fail function goes to q must be taken into account . . . my k o l o g . . . k S k o g y m o o l n o l k o á l k o o v z o o m l r i g l a o a u . . . my k o l o g . . . o z v t m +1 k S k o g y m NDBI010 - DIS - MFF UK o o l n o l k o á l k o o v z o o m l r i g l a o a u o z v t m C-W Algorithm • shift3(q) – aligning of (any) postfix of checked text, collision character need not be used again to find a match . . . my k o l o g . . . k S k o g y m o o l n o l k o á l k o o v z o o m l r i g l a o a u . . . my k o l o g . . . o z v t m +2 NDBI010 - DIS - MFF UK k S k o g y m o o l n o l k o á l k o o v z o o m l r i g l a o a u o z v t m Lecture No. 3 Exact Pattern Matching Searching for (Regular) Infinite Set of Patterns in Text Regular expressions and languages • Regular expression R • Atomic expressions – – – a, a X • Value of expression h(R) – empty language – {} empty word only – a, a X • Operations – – – – – U.V – concatenation U+V – union Vk = V.V…V V* = V0+V1+V2+… V+ = V1+V2+V3+… – {u.v|uh(U) vh(V)} – h(U)h(V) NDBI010 - DIS - MFF UK Regular Expression Feature • • • • • • • • • • 1) U+(V+W) = (U+V)+W 2) U.(V.W) = (U.V).W 3) U+V = V+U 4) (U+V).W = (U.W)+(V.W) 5) U.(V+W) = (U.V)+(U.W) 6) U+U = U 7) .U = U 8) .U = 9) U+ = U 10) U* = +U*.U = (+U)* NDBI010 - DIS - MFF UK (Deterministic) Finite Automaton • K = ( Q, X, q0, , F ) – Q is a finite set of states – X is an alphabet – q0 Q is an initial state – : Q x X Q is totally defined transition function – F Q is a set of final states NDBI010 - DIS - MFF UK (Deterministic) Finite Automaton • Configuration of FA – (q,w) Q x X* • Transition of FA – relation – (q,aw) (Q x X*) x (Q x X*) (q’,w) (q,a) = q’ • Automaton accepts word w (q0, w) * (q,), qF NDBI010 - DIS - MFF UK Non-deterministic Finite Automaton • a) default def. b) extended def. K = ( Q, X, q0, , F ) K = ( Q, X, S, , F ) – Q is a finite set of internal states – X is an alphabet – q0 Q is an initial state S Q is (alternatively) set of initial states – : Q x X P(Q) is a transition function – F Q is a set of final states NDBI010 - DIS - MFF UK Non-deterministic Finite Automaton • NKA for P={”he”, ”her”, ”she”} – S={1,4,8} – F={3,7,11} * 1 h 2 e 3 * 4 h 5 e 6 r 7 * 8 s 9 h 10 e 11 – S={1} – F={3,4,7} * 1 h 2 e 3 r 4 s NDBI010 - DIS - MFF UK 5 h 6 e 7 NDFADFA Conversion • K=(Q, X, S, , F) • K’=(Q’, X, q’0, ‘, F‘) • • • • Q’ = P(Q) X q’0 = S ‘( q’, x) = (q, x), qq’ • F‘ = {q’Q’q’F} NDBI010 - DIS - MFF UK NDFADFA Conversion Set of Initial States Allowed • By table, only reachable states 1 h 2 e 3 * transitions to state 1 4 h 5 e 6 r 7 * not shown * state {1,4,8} {1,2,4,5,8} {1,4,8,9} {1,3,4,6,8} {1,2,4,5,8,10} {1,4,7,8} {1,3,4,6,8,11} 8 s 9 h 10 e 11 lbl. 1 2 3 4 5 6 7 e h {1,4,8} {1,2,4,5,8} {1,3,4,6,8} {1,2,4,5,8} {1,4,8} {1,2,4,5,8,10} {1,4,8} {1,2,4,5,8} {1,3,4,6,8,11} {1,2,4,5,8} {1,4,8} {1,2,4,5,8} {1,4,8} {1,2,4,5,8} r {1,4,8} {1,4,8} {1,4,8} {1,4,7,8} {1,4,8} {1,4,8} {1,4,7,8} NDBI010 - DIS - MFF UK h h 2 1 e h s s s s h 6 4 rs r hh h e s 5 s 7 3 s {1,4,8,9} {1,4,8,9} {1,4,8,9} {1,4,8,9} {1,4,8,9} {1,4,8,9} {1,4,8,9} x {1,4,8} {1,4,8} {1,4,8} {1,4,8} {1,4,8} {1,4,8} {1,4,8} NDFADFA Conversion Only One Initial State Allowed • By table, only reachable states 1 h 2 e 3 r 4 transitions * to state 1 s 5 h 6 e 7 not shown state {1} {1,2} {1,5} {1,3} {1,2,6} {1,4} {1,3,7} lbl. 1 2 3 4 5 6 7 e {1} {1,3} {1} {1} {1,3,7} {1} {1} h {1,2} {1,2} {1,2,6} {1,2} {1,2} {1,2} {1,2} 1 e h NDBI010 - DIS - MFF UK s h 6 4 rs s s s r hh h e s 5 s 7 3 r {1} {1} {1} {1,4} {1} {1} {1,4} h h 2 s {1,5} {1,5} {1,5} {1,5} {1,5} {1,5} {1,5} x {1} {1} {1} {1} {1} {1} {1} Derivation of regular expression • If, hV v • I.e., if then , then a shell , hV stop plot dV h da dV hell h ds top dV h dt dV h v xv hV dx NDBI010 - DIS - MFF UK Derivation of regular expression • • • • d , a X da d , a X da da , a X da db , b a da dU V dV • da dU da da dU .V dU • da da .V , U • dU.V dU .V dV , U da da da • dV dV .V * * da • dV d d dV ... , x a1... an1 an dx an an1 a1 NDBI010 - DIS - MFF UK da Construction of DFA Using Derivations of RE • Derivation of regular expressions allows directly and algorithmically build DFA for any regular expression • Let V is given regular expression in alphabet X • Each state of DFA defines a set of words, that move the DFA from this state to any of final states. So, every state can be associated with regular expression, defining this set of words – q0 = V dq – (q,x) = dx – F = {qQ | h(q)} NDBI010 - DIS - MFF UK Construction of DFA Using Derivations of RE • V= (0+1)*.01 in alphabet X={0,1} • q0 = (0+1)*.01 • • d 0 1* .01 d 0 1* d 01 d 0 1 .01 .0 1* .01 1 d0 d0 d0 d0 d 0 d1 . 0 1* .01 1 0 1* .01 1 .0 1* .01 1 d0 d0 d 0 1* .01 d 0 1* d 01 d 0 1 .01 .0 1* .01 d1 d1 d1 d1 d 0 d1 .0 1* .01 . 0 1* .01 0 1* .01 d1 d1 NDBI010 - DIS - MFF UK Construction of DFA Using Derivations of RE • V= (0+1)*.01 in alphabet X={0,1} state * (0+1) .01 * (0+1) .01+1 * (0+1) .01+ lbl. A B C • q0 = (0+1)*.01 • F = {(0+1)*.01+} 0 1 * (0+1) .01 * (0+1)*.01+ * (0+1) .01 (0+1) .01+1 (0+1) .01+1 (0+1) .01+1 * * 1 A B 0 0 C 1 0 NDBI010 - DIS - MFF UK 1 Document Models • Different variants of models – Takes (non)existence of terms in documents into account or not – Takes frequencies of terms in documents into account or not – Takes positions of terms in documents into account or not –… NDBI010 - DIS - MFF UK Document Models in TRS’s Boolean Model Boolean Model of TRS • Mid of 20. century • Adoption of procedures, used in librarianship and their gradual implementation NDBI010 - DIS - MFF UK Boolean Model of TRS • Database (collection) D containing n documenta – D={d1, d2, … dn} • Documents described using m terms – T ={t1, t2, … tm} – term tj = descriptor, usually word or collocation • Each document is represented as a subset of available terms – Contained in the document – Better describing content of the document – d1 T NDBI010 - DIS - MFF UK Boolean Model of TRS • Assigning of a set of terms to document can be achieved by different approaches – Subdivision according to author • Manual – Done by a human indexer, that understands the content of document – Non-consistent. More indexers need not produce the same set of terms. One indexer might later produce different set of terms as before. • Automatic – Done algorithmically – Consistent, but without text understanding – Subdivision according to free will in selecting descriptors • Controlled – Set of terms is defined in advance and indexer cannot change it. It only can select those describing given document as best as possible. • Non-controlled – The set of terms can be extended whenever new document is inserted into collection. NDBI010 - DIS - MFF UK Indexation • Thesaurus – Internally structured set of terms • • • • Synonyms with defined preferred term Hierarchies of semantically narrower/broader terms Similar terms ... • Stop-list – Set of non-significant terms that are meaningless for indexation • Pronouns, interjections, … NDBI010 - DIS - MFF UK Indexation • Common words are not suitable for document identification • Too specific words as well. Lot of different terms appears in very small number of docs • Its elimination decreases significantly size of the index, and slightly its quality vhodné termy Suitable terms 0 0 0,1 NDBI010 - DIS - MFF UK 0,5 0,9 1 Boolean Model of TRS • Query is represented by Boolean expression – ta AND tb – ta OR tb – NOT t document has to contain/to be described by both terms document has to contain/to be described by at least one term document has not contain/to be described by given term NDBI010 - DIS - MFF UK Boolean Model of TRS • Query examples: – ‘searching’ AND ‘information’ – ‘encoding’ OR ‘decoding’ – ‘processing’ AND (‘document’ OR ‘text’) – ‘computer’ AND NOT ‘personal’ NDBI010 - DIS - MFF UK Boolean Model of TRS – Extensions • Collocations in queries – ‘searching for information’ – ‘data encoding’ OR ‘data decoding’ – ‘text processing’ – ‘computer’ AND NOT ‘personal computer’ NDBI010 - DIS - MFF UK Boolean Model of TRS – Extensions • Using of factual meta-data (attribute values) – ‘database’ AND (author = ‘Salton’) – ‘text processing’ AND (year_of_publishing >= 1990) NDBI010 - DIS - MFF UK Boolean Model of TRS – Extensions • Wildcards in terms – ‘datab*’ AND ‘system*’ • stands for terms ‘database’, ‘databases’, ‘system’, ‘systems’, etc. – ‘portabl?’ AND ‘computer*’ • stands for terms ‘portable’, ‘computer’, ‘computers’, ‘computerized’ etc. NDBI010 - DIS - MFF UK Boolean Index Structure • Inverted file – It holds a list of identified documents for each term (instead of a set of terms for each document) • t1 = d1,1, • t2 = d2,1, • tm = dm,1, d1,2, d2,2, dm,2, ..., ..., ..., d1,k1 d2,k2 dm,km NDBI010 - DIS - MFF UK Boolean Index Structure • One-by-one processing of inserted documents produces a sequence of couples <doc_id,term_id> sorted by first component, i.e. by doc_id • Next the sequence is reordered lexicographically by term_id, doc_id and duplicates are removed • The result can be further optimized by adding directory pointing to sections, corresponding to individual terms, and removing term_id’s from the sequence NDBI010 - DIS - MFF UK Lemmatization and Disambiguation of Czech Language (ÚFAL) • Odpovědným zástupcem nemůže být každý. • Zákon by měl zajistit individualizaci odpovědnosti a zajištění odbornosti. … Paragraph Nr. Sentence Nr. Word in document Lemma including meaning Type of word (Adverb), … • <p n=1> <s id="docID:001-p1s1"> <f cap>Odpovědným <MDl>odpovědný_^(kdo_za_něc o_odpovídá) <MDt>AAIS7----1A---<f>zástupcem<MDl>zástupce< MDt>NNMS7-----A---<f>nemůže<MDl>moci_^(mít_ možnost_[něco_dělat])<MDt>VB -S---3P-NA--<f>být<MDl>být<MDt>Vf-------A---<f>každý<MDl>každý<MDt>A AIS1----1A---• <p n=2> … NDBI010 - DIS - MFF UK Proximity Constraints • t1 (m,n) t2 – most general form – term t2 can appear at most m words after t1, or term t1 can appear at most n words after t2. • t1 sentence t2 – terms have to appear in the same sentence • t1 paragraph t2 – terms have to appear in the same paragraph NDBI010 - DIS - MFF UK Proximity Constraints – Evaluation • Using the same index structure – Operators replaced by conjunctions – Query evaluation to find candidates – Check for co-occurrences in primary texts • Small index • Longer time needed for evaluation • Necessity of storing primary documents • Extension of index by positions of term occurrences in documents – Large index NDBI010 - DIS - MFF UK Extended Index Structure • During indexation is built a sequence of 5-tuples <dok_id,term_id,para_nr,sent_nr,word_nr> ordered by dok_id, para_nr,sent_nr,word_nr • Sequence is reordered by <term_id,dok_id,para_nr,sent_nr,word_nr> • No duplicities are removed NDBI010 - DIS - MFF UK Thesaurus Utilization • • • • • • BT(x) - Broader Term to term x NT(x) - Narrower Terms PT(x) - Preferred Term SYN(x) - SYNonyms to term x RT(x) - Related Terms TT(x) - Top Term NDBI010 - DIS - MFF UK Disadvantages of Boolean Model • Salton: – – – – Query formulation is more an art than science. Hits can not be rate by its quality. All terms in the query are taken as equally important. Output size can not be controlled. System frequently produces empty or very large answers. – Some results doesn’t correspond to intuitive understanding. • Documents in answer to disjunctive query can contain only one of mentioned term as well as all of them. • Documents eliminated from answer to conjunctive query can miss one of mentioned term as well as all of them. NDBI010 - DIS - MFF UK Partial Answer Ordering Q = (t1 OR t2) AND (t2 OR t3) AND t4 – conversion to equivalent DNF Q’ = OR OR OR OR (t1 AND t2 AND t3 AND t4) (t1 AND t2 AND NOT t3 AND t4) (t1 AND NOT t2 AND t3 AND t4) (NOT t1 AND t2 AND t3 AND t4) (NOT t1 AND t2 AND NOT t3 AND t4) NDBI010 - DIS - MFF UK Partial Answer Ordering • Each elementary conjunction (further EC) contain all terms used in original query and is rated by number of terms used in positive way (without NOT) • All EC’s differs each from another in at least one term (one contains tj, second contains NOT tj) Every document correspond to at most one EC Document is then rated by number, assigned to given EC. NDBI010 - DIS - MFF UK Partial Answer Ordering • There exist 2k EC’s in case of query using k terms • There exist only k different ratings • More EC’s can have the same rating • (ta OR tb) = = (ta AND tb) … rating 2 OR (ta AND NOT tb) … rating 1 OR (NOT ta AND tb) … rating 1 NDBI010 - DIS - MFF UK Lecture No. 4 Vector Space Model of TRS • 70th of 20. century – cca 20 years younger than Boolean model of TRS • Tries to minimize and/or eliminate disadvantages of Boolean model NDBI010 - DIS - MFF UK Vector Space Model of TRS • Database D containing n documents – D={d1, d2, … dn} • Documents are described by m terms – T ={t1, t2, … tm} – term tj = word or collocation • Document representation using m-dimensional vector of term weights – d i wi,1, wi,2,...,wi,m NDBI010 - DIS - MFF UK Vector Space Model of TRS • Document model – d i wi,1, wi,2,..., wi,m 0,1m – wi,j … level of importance of j-th term to identify/describe i-th document • Query – q q1, q2,..., qm 0,1m – qj … level of importance of j-th term for the user NDBI010 - DIS - MFF UK Vector Space Model Index d 1 w1,1 d 2 w2,1 D d n wn,1 w2,2 wn,2 w1,2 NDBI010 - DIS - MFF UK w1,m w2,m nm 0,1 wn,m Vector Space Model of TRS • Similarity between vectors representing document and query is in general defined by Similarity function 1 Sim q ,d i R 0 0 1 NDBI010 - DIS - MFF UK Similarity Functions • Sim q ,d i m q jwi, j q d i j 1 qd i cos • Factor q j wi, j is proportional both to level of importance in document and for the user • Orthogonal vectors have zero similarity – Base vectors in the vector space (individual terms) are orthogonal each to other and so have zero similarity NDBI010 - DIS - MFF UK Vector Space Model of TRS • Sim q ,d i q d i cos • Not only the angle, but also sizes of vectors influence the similarity • Longer vectors, that tends to be assigned to longer texts have an advantage on shorter ones • Its desirable to normalize all vectors to have unitary size NDBI010 - DIS - MFF UK Vector normalization • Vector length influence elimination NDBI010 - DIS - MFF UK Vector normalization • In time of indexation – No overhead in time of searching – Sometimes it is necessary to re-compute all vectors – in case that vectors reflects also aspects dependent on complete collection • In time of search – Part of similarity function definition – Slows down the response of the system NDBI010 - DIS - MFF UK Output Size Control • Documents in the output list are ordered by descending similarity to the given query – Most similar documents at the beginning of the list – The list size can be easily restricted with respect to maximal criterion • The maximal number of documents in the list can be restricted • Only documents reaching threshold similarity can be shown in the result NDBI010 - DIS - MFF UK Negation in Vector Space Model • Sim q ,d i m q j wi, j j 1 • It is possible to extend query space • q q1, q2,..., qm 1,1m Then the contribution q j wi, j of j-th dimension can be negative • Documents that contain j-th term are suppressed in comparison with others NDBI010 - DIS - MFF UK Scalar product Sim q ,d i m q j wi, j j 1 NDBI010 - DIS - MFF UK Cosine Measure (Salton) m Sim q ,d i q jwi, j j 1 m 2 m q j wi, j j 1 j 1 NDBI010 - DIS - MFF UK 2 Jaccard Measure m Sim q ,d i q jwi, j m j 1 m m j 1 j 1 j 1 q j wi, j q jwi, j NDBI010 - DIS - MFF UK Dice Measure m Sim q ,d i 2 q jwi, j j 1 m m j 1 j 1 q j wi, j NDBI010 - DIS - MFF UK Overlap Measure m Sim q ,d i q jwi, j j 1 m 2 min q j , wi, j j 1 2 NDBI010 - DIS - MFF UK Asymmetric Measure m Sim q ,d i min q j , wi, j j 1 m wi, j j 1 NDBI010 - DIS - MFF UK Pseudo-Cosine Measure m Sim q ,d i q jwi, j j 1 m m q j wi, j j 1 j 1 NDBI010 - DIS - MFF UK Vector Space Model Indexation • Based on number of occurrences of given term in given document – The more given word occurs in given document, the more important for its identification • Term Frequency TFi,j = #term_occurs / #all_occurs NDBI010 - DIS - MFF UK Vector Space Model Indexation • Without stop-list the result contains almost only meaningless words at the beginning 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 term the of a to and is for be if in use are it should class NDBI010 - DIS - MFF UK # 239 96 84 78 70 65 60 53 52 49 49 44 44 38 33 TF 0,0582 0,0234 0,0205 0,0190 0,0171 0,0158 0,0146 0,0129 0,0127 0,0119 0,0119 0,0107 0,0107 0,0093 0,0080 Vector Space Model Indexation • Term frequencies are very small even for most frequent terms • Normalized term frequency NTFi, j 0 iff TF i, j else 1 1 TF i , j NTF i, j 2 2 max TF i,k k 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 term use class owl c line example comments file bi functions files code int data public NDBI010 - DIS - MFF UK # 49 33 31 26 26 25 23 23 22 20 18 17 17 16 15 TF 0,0119 0,0080 0,0076 0,0063 0,0063 0,0061 0,0056 0,0056 0,0054 0,0049 0,0044 0,0041 0,0041 0,0039 0,0037 NTF 1,0000 0,8367 0,8163 0,7653 0,7653 0,7551 0,7347 0,7347 0,7245 0,7041 0,6837 0,6735 0,6735 0,6633 0,6531 Vector Space Model Indexation Histogram of (norm alized) term frequency 1,0000 0,9000 0,8000 Frequency 0,7000 0,6000 TF 0,5000 0,4000 0,3000 0,2000 Differentiation of important terms from non-important ones 0,1000 0,0000 Term s ordered by increasing frequency NDBI010 - DIS - MFF UK NTF Vector Space Model Indexation def w • i, j TF i , j • IDF (Inverted Document Frequency) reflects importance of given term in the index for complete collection def • w i, j def • w i, j NTF i, j NTF IDF i, j j 2,5000 2,0000 1,5000 0,5000 Entropy of probability that the term occurs in randomly chosen document NDBI010 - DIS - MFF UK 1,00 0,97 0,94 0,91 0,88 0,85 0,82 0,79 0,76 0,73 0,70 0,67 0,64 0,61 0,58 0,55 0,52 0,49 0,46 0,43 0,40 0,37 0,34 0,31 0,28 0,25 0,22 0,19 0,16 0,13 0,10 0,0000 0,07 j 0,04 IDF 1,0000 0,01 # docs containingterm log # all docs def Vector Space Model Indexation • (Optional) document vector normalization to unite size def v NTF v w v i, j def i, j IDF j i, j i, j 2 i ,k k NDBI010 - DIS - MFF UK Querying in Vector Space Model • Equal representation of documents and queries brings many advantages over Boolean Model • Query can be defined – Directly by its hand-made definition – By reference to known indexed document q d i – By reference to non-indexed document – indexer creates ad-hoc vector from its primary text – By text fragment (using copy-paste etc.) – By combination of some above mentioned ways NDBI010 - DIS - MFF UK Feedback • Query building/tuning based on user feedback to previous answers – Adding terms identifying relevant documents – Elimination of terms unimportant for relevant document identification and important for irrelevant ones • Prediction criterion improvement NDBI010 - DIS - MFF UK Feedback • Answer to previous query q is classified by the user, who can mark relevant and/or irrelevant documents NDBI010 - DIS - MFF UK Positive Feedback • Relevant document “attract” the query towards them NDBI010 - DIS - MFF UK Negative Feedback • Irrelevant documents push query away from them – Less effective than positive feedback – Less used NDBI010 - DIS - MFF UK Feedback • The query iteratively migrates towards the center of relevant documents NDBI010 - DIS - MFF UK Feedback k ' • General form q 0 q jd i j 1 j • One of used special form Centroid (centre of gravity) k di j ' j 1 q q 1 k NDBI010 - DIS - MFF UK (1-) / Feedback k ' • General form q 0 q jd i j 1 • Other used (weighted) form j (1-) / k v j* d i j ' j 1 q q 1 k v j j 1 NDBI010 - DIS - MFF UK Weighted centroid (centre of gravity) Term Equivalence in VS Model • Individual terms (dimensions of the space) are supposedly, but not really, mutually independent T t , t , t , t , t , t 1 1 1 1 , 0 , 0 , , , d 2 2 2 2 1 3 q 0, 4 , 4 , 0, 0, 0 Simq , d 0 1 2 3 4 5 6 , where t1 t 2 , t 3 t 4 Problem with prediction – inappropriately chosen synonyms NDBI010 - DIS - MFF UK Term Equivalence in VS Model • Equivalency matrix E d q 1 1 1 1 , 0, 0, , , 2 2 2 2 1 3 0, , , 0, 0, 0 4 4 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 q E 4 1 Sim q E , d 8 2 1 1 3 3 , , , , 0, 0 4 4 4 4 NDBI010 - DIS - MFF UK Term Similarity in VS Model 1.0 0.8 0 0 0.2 0 • Generalised equivalence 0.8 1.0 0 0 0 0 • Similarity matrix 0 0 1.0 1.0 0 0 S 0 0 1.0 1.0 0 0 • 38 19 0.2 0 0 0 1.0 0 Sim q S , d 0 0 0 0 0 1.0 80 40 • All computation used in VS model can be evaluated also on transposed index. Here mutual similarity of term can be evaluated (vector dimension n, not m) Sim t j1 , t j 2 + Really similar terms co-occurs often together – Common terms co-occurs often together as well NDBI010 - DIS - MFF UK Term Hierarchies in VS Model • Similarly to Boolean Model Publication Print Papers NDBI010 - DIS - MFF UK Book Magazine Term Hierarchies in VS Model • Similarly to Boolean Model • Edges can have assigned weights 0.3 • User weights Papers then can be easily propagated 0.8 Publication 0.4 0.32 Print 0.6 Book 0.7 0.096 NDBI010 - DIS - MFF UK Magazine 0.224 0.48 Citations and VS Model • Scientific publications cites their sources • Assumption: – Cited documents are semantically similar – Citing documents are semantically similar NDBI010 - DIS - MFF UK Citations and VS Model • Direct reference between documents “A” a “B” – Document “A” cites document “B” – Denoted AB • Indirect reference between “A” a “B” – Ex. C1, …Ck so, that AC1…CkB • Link between documents “A” a “B” – AB or BA NDBI010 - DIS - MFF UK Citations and VS Model • A and B are bibliographically paired, if and only if they cite the same source C AC BC • A and B are co-cited, if and only if they are both cited in some document C CA CB NDBI010 - DIS - MFF UK Citations and VS Model • Acyclic oriented citation graph • Flowchart matrix of the citation graph • C=[cij]{0,1}<nxn> cij=1, iff ij cij=0 else NDBI010 - DIS - MFF UK Citations and VS Model • BP matrix of bibliographic pairing • bpij = number of documents cited in both documents i and j. – Follows bpii = number of documents cited in i • bpij c i c j n cik c jk k 1 NDBI010 - DIS - MFF UK Citations and VS Model • CP matrix of co-citation pairing • cpij = number of documents citing both i and j – Follows cpii = number of documents citing i • kpij c i c j n ckickj k 1 NDBI010 - DIS - MFF UK Citations and VS Model • DL matrix of document links • dlij = 1 (cij = 1 cji = 1) • It is possible to modify resulting similarities between documents and given query using some of matrices BP, CP, DL • Modification of index matrix D – D’= BP.D, resp. D’=CP.D , resp. D’=DL.D – D’=BP.CP.DL.D NDBI010 - DIS - MFF UK Using mutual document similarities in VS Model • DS matrix of mutual document similarities • dsij = Sim d i,d j • The same idea as in case of BP, CP, DL • Modification of index matrix D – D’=DS.D NDBI010 - DIS - MFF UK Lecture No. 5 Term Discrimination Values • Discrimination value defines the importance of the term in the vector space to distinguish individual documents stored in the collection • By removal of the term from index, i.e. by reduction of index dimensionality it can happen: – Overall distance between documents decreases (average similarity of document pairs increases) – Overall distance between documents increases (average similarity of document pairs decreases) • In this case the presence of the dimension in the space is not needed (is contra-productive) NDBI010 - DIS - MFF UK 45,0 35,3 45,0 NDBI010 - DIS - MFF UK 0,0 Term Discrimination Values • Computation based on average document similarity Simd , d Q n n i i , j 1 2 j • More efficient variant using “central document” (centroid) d c n Simd , c Q n n i 1 n i 1 NDBI010 - DIS - MFF UK i i Term Discrimination Values • The same value is computed for the space reduced by k-th dimension x (k ) x , x , , x , x d i Q c n 1 n (k ) i 1 2 k 1 k 1 , , xm (k ) (k ) Sim d , c i i 1 n (k ) NDBI010 - DIS - MFF UK n (k ) Term Discrimination Values • Discrimination value is defined as a difference of both average values DV (k ) k Q Q • Can be used instead of IDFk 0 Important term discriminating documents DVk defines the measure of importance 0 Unimportant term NDBI010 - DIS - MFF UK Sem přetáhněte stránková pole. Term Discrimination Values Celkem (value DV of terms depending on number of documents where the term is present) Průměr z DVk 0,00001 90/7777 180/7777 1200/7777 -0,00004 -0,00009 Results for collection of 7777 articles published in papers „Lidové noviny“ in 1994, described by 13495 lemmas -0,00014 Positive DVk in case of 12324 lemmas having 478849 occurrences. Negative DVk in case of 1170 lemmas having 466992 occurrences. -0,00019 -0,00024 Number of documents, where the term is present. Collection contains 7777 documents NDBI010 - DISVýskyty - MFF UK Se Document clustering Kohonen maps C3M algorithm K-mean algorithm Document Clustering • Response time of VS based TRS is directly proportional to number of documents in the collection, that must be compared with the query • Clustering allows to skip major part of index during the search and compare only closest documents NDBI010 - DIS - MFF UK Document Clustering • Without clusters, it is necessary to compare all documents, even if the minimal needed similarity is defined NDBI010 - DIS - MFF UK Document Clustering • Each cluster represent m-dimensional sphere, defined by its center and radius • If not, it is possible to approximate it this way during computations NDBI010 - DIS - MFF UK Document Clustering • Having clusters, the query evaluation need not to compare documents in clusters outside the area of user interest NDBI010 - DIS - MFF UK Cluster types • Clusters having the same volume + Easy to create – Some clusters can be almost empty, while others can contain huge amount of documents NDBI010 - DIS - MFF UK Cluster types • Clusters having (approximately) the same number of documents – Hard to create + More effective in case of nonuniformly distributed docs. NDBI010 - DIS - MFF UK Cluster types • Non-disjunctive clusters • One document can belong to more than one cluster • Sometimes weighted belonging in fuzzy clusters. NDBI010 - DIS - MFF UK Cluster types • Disjunctive clusters • Document can belong to exactly one cluster NDBI010 - DIS - MFF UK Cluster types • It is not possible to completely and disjointly cover space using spheres • It is possible to use convex polyhedra, where each document belongs to closest center NDBI010 - DIS - MFF UK Cluster types • Then clusters can be approximated by non-disjoint set of spheres, defined by the center and the most distant belonging document NDBI010 - DIS - MFF UK Query Evaluation With Clusters (I) • Let are given query q and minimal required similarity s – Note.: Similarity computed by scalar product, vectors are normalized • Index is split to k clusters (c1,r1), …, (ck,rk) r= – Note. Radii are angular • Query radius r = = arccos(s) s = cos() NDBI010 - DIS - MFF UK 1 1 Query Evaluation With Clusters (I) • Emptiness of cluster intersection with the query area is found out from the value arccos(Sim(q,ci))-r-ri • If this value 0, documents in the cluster are compared • If this values > 0, documents can not be in the result NDBI010 - DIS - MFF UK Query Evaluation With Clusters (II) • Let are given query q and maximal number of required documents x. • Again, index is split to k clusters (c1,r1), …, (ck,rk) • No radius of the query is available NDBI010 - DIS - MFF UK Query Evaluation With Clusters (II) • Clusters are sorted in ascended order by increasing distance of their center from the query, i.e. according to arccos(Sim(q,ci)) • Better sorted by increasing distance of cluster boundary from the query, i.e. according to arccos(Sim(q,ci))-ri NDBI010 - DIS - MFF UK 1. 2. 2. 1. Query Evaluation With Clusters (II) • Clusters are sorted in ascended order by arccos(Sim(q,ci))-ri i.e. by increasing distance of cluster boundary from the query q NDBI010 - DIS - MFF UK x=7 4. 5. 2. 1. 3. Vyhodnocení dotazu se shluky II • Closest cluster is evaluated NDBI010 - DIS - MFF UK x=7 Vyhodnocení dotazu se shluky II • While there is not enough documents in the result, the next closest cluster is evaluated • If there is enough of documents, the x-th best document defines the working radius NDBI010 - DIS - MFF UK x=7 Vyhodnocení dotazu se shluky II • Once there is enough documents, next cluster is evaluated only if it intersects the sphere given by the query and x-th best hit • If some documents were replaced by better ones from new cluster, working radius is reduced NDBI010 - DIS - MFF UK x=7 Víceúrovňové shlukování • If there is still lot of clusters, it is possible to cluster them further to obtain second-level clusters etc. NDBI010 - DIS - MFF UK Clustering Methods • Kohonen self-organizing maps – Used to classification of multi-dimensional input patterns – Unsupervised artificial neural network – Self-organizing structure – Conforms to density of patterns (documents) in given area – Tends to create clusters having approximately the same number of members NDBI010 - DIS - MFF UK Kohonen self-organizing maps • Regular k-dimensional net of m-dimensional points (centers) – Usually k << m • Each center has assigned its position in mdimensional space and up to 2k predefined neighbors (two in each of k dimensions, center at the boundary have less neighbors) – Ex.: 2- a 1-dimensional maps in 2-dimensional space NDBI010 - DIS - MFF UK Kohonen self-organizing maps • At the start, centers have random positions • When inserting document d – The closest center c x is found – and moved closer to the document d d 1 c x cx cx cx – Its defined neighbors in the map are moved as well c c d c d 1 c NDBI010 - DIS - MFF UK Kohonen self-organizing maps • Parameters 0 1 denotes measure of system flexibility • It is advisable to decrease those coefficients in time to zero – Later the map center positions represent more useful information that should not be suddenly forgotten NDBI010 - DIS - MFF UK Kohonen self-organizing maps • The map before adaptation to new document NDBI010 - DIS - MFF UK Kohonen self-organizing maps • The map after adaptation to new document NDBI010 - DIS - MFF UK Kohonen self-organizing maps • Clusters are defined by map centers • Each cluster contains documents, which are closer to given center than to any other one • Proximate points in the map are proximate in the original space (but not vice versa) NDBI010 - DIS - MFF UK Kohonen self-organizing maps NDBI010 - DIS - MFF UK Kohonen self-organizing maps • It is possible to use them to cluster terms/lemmas – Index matrix is transposed – The n-dimensional space is mapped instead of original m-dimensional one • Example: • Created lemma clusters (translated from Czech) C2,4 stock-exchange, stocking, coupon, stock, investor, volume, investment, fund, value, business C2,5 wave, privatization, national – The map having 15*15 centers – 7777 documents (Lidové noviny, 1994) – 13495 lemmas – 100000 iterations of learning using randomly chosen lemma vector C3,6 literary, writer, literature, publisher, origination, reader, history, text, book, write C3,13 Havel, Václav, president C4,13 Klaus, prime, minister C6,14 stage, comedy, film, script, audience, festival, shot, story, film, role NDBI010 - DIS - MFF UK ekonomika economy literatura literature C 1 1 2 3 patřit vést firma 2 znamenat platit obchodní Kč banka činit trh milión koruna cena 3 oblast současný deset sto tisíc částka zaplatit pět 30 peníz 4 poptávka nabídka značný měsíc rychlý změnit ztráta vývoj cíl šest hranice Kohonen maps sport 2D map of lemmas (unfold 2D map from previous example) politika politics 4 celkový procento 5 podnik zisk burza akcie kupónový cenný papír investor objem investiční fond hodnota obchod majitel 50 prodej výše držet dovolit dostávat stav jednička tenista Open postoupit semifinále porazit turnaj finále postup 6 7 čas cesta řada přijít čtyři vlna privatizace národní dílo autor umění vydat 1994 ruka najít žebříček literární spisovatel literatura naklada-telství počátek čtenář dějiny text kniha napsat Američan úspěšný americký USA třetí skupina století život lidský pracovat působit práce mladý 8 stát (verb.) zůstat vyjít starý pravda muž 9 10 výstavba stavba 11 Lille-hammer sportovní zlatý město městský metr pohled zdát znát poskytnout manželka pocit cítit myslit mluvit udělat vidět rád životní prostředí mezi-národní NDBI010 - DIS - MFF UK 12 medaile olympijský vítěz závod podtitulek 13 14 MS mistrovství start 15 světový sledovat šampionát ME kvalifikace válka komise předseda Havel Václav prezident bývalý úřad část jednání jednat Klaus premiér ministr vztah názor období činnost mistr titul Kohonen self-organizing maps Obtained Cluster Sizes • Size of clusters is here not so equal … Počet z t y x 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 1 2 3 4 5 9 26 81 193 204 572 329 129 362 Celkový součet 1920 2 3 4 1 2 2 1 8 11 2 7 4 1 7 4 2 15 10 4 15 12 7 12 12 12 25 11 47 23 19 132 85 33 346 349 190 481 320 459 122 3223 709 1062 291 383 163 533 186 2383 4915 2045 5 6 7 8 2 3 2 1 3 5 2 1 1 10 3 9 6 4 4 12 11 8 5 2 12 6 2 3 15 4 22 34 10 4 14 30 18 2 4 21 19 6 7 84 50 26 12 272 166 30 6 302 129 10 7 103 38 98 17 121 94 40 17 999 586 245 119 9 1 3 7 3 2 9 15 3 2 2 3 6 15 4 75 10 2 1 1 2 3 2 2 6 1 2 3 3 4 5 3 40 NDBI010 - DIS - MFF UK 11 3 1 1 1 2 2 3 2 3 1 2 2 1 3 27 12 5 2 2 3 1 2 1 1 1 1 2 1 2 2 26 13 2 1 3 3 2 1 1 1 2 1 1 2 5 2 2 29 14 3 3 2 2 4 10 6 4 1 1 1 1 1 1 2 42 15 Celkový součet 1 29 1 40 1 42 2 57 3 85 11 87 3 109 4 170 234 1 507 6 1276 2 2321 3 4853 5 2152 1 1533 44 13495 Kohonen self-organizing maps Cluster C15,1 • C15,1 (Football player names and terminology) 37, 41, 45, 46, 53, 54, 55, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, Adamec, Babka, Baček, Balcárek, Baník, Baránek, Barát, Barbarič, Barbořík, Barcuch, Bečka, Bejbl, Beránek, Berger, Bielik, Bílek, Blažek, boční, Bohuněk, Borovec, Brabec, brána, Branca, brankový, Breda, brejk, Brückner, břevno, Březík, Budka, centr, Cieslar, Culek, Cupák, Čaloun, čára, Časka, čermák, Červenka, Čihák, čížek, Diepold, Dobš, dohrávat, Dostál, Dostálek, drnovický, Drulák, Duhan, Džubara, faul, fauloval, Frýdek, Fujdiar, Gabriel, Galásek, gólman, gólový, Gunda, Guzik, Harazim, Hašek, Havlíček, Heřman, hlavička, Hodúl, Hoftych, Hogen, Holec, Holeňák, Hollý, holomek, Holota, holub, Horňák, Horváth, hostující, hradecký, Hrbek, Hromádko, Hrotek, Hruška, Hřebík, hřídel, Hýravý, Hyský, chebský, chovanec, inkasovat, jablonecký, Janáček, Jančula, Janeček, Jánoš, Janota, Janoušek, Jarabinský, Jarolím, Jihočech, Jindra, Jindráček, jinoch, Jirka, Jirousek, Jukl, Kafka, Kamas, Kerbr, Kirschbaum, Klejch, Klhůfek, Klimeš, klokan, Klusáček, Knoflíček, Kobylka, Kocman, kočí, Koller, kolouch, koncovka, kop, kopačka, Kopřiva, Kordule, kostelník, Kotrba, Kotůlek, Kouba, Koubek, kovář, kozel, Kožlej, Kr(krypton), krejčí, Krejčík, Krištofík, Krondl, Křivánek, Kubánek, kuchař, Kukleta, Lasota, Lerch, Lexa, Lička, Litoš, Lokvenc, Ložek, Macháček, Machala, Maier, Majoroš, Maléř, Marek, Maroš, Mašek, Mašlej, Maurer, mela, míč, Mičega, Mičinec, Michálek, Mika, mina, mířit, Mojžíš, Mucha, nápor, nastřelit, Navrátil, Nečas, Nedvěd, Nesvačil, Nešický, Neumann, Novák, Novotný, Obajdin, olomoucký, Onderka, Ondráček, Ondrůšek, Palinek, Pařízek, Pavelka, Pavlík, Pěnička, Petrouš, Petržela, Petřík, pilný, plzeňský, Poborský, pokutový, poločas, poslat, Poštulka, Povišer, prázdný, Pražan, proměněný, protiútok, Průcha, předehrávka, přesný, převaha, Přibyl, přidat, přihrávka, ptáček, Puček, půle, Purkart, rada, Radolský, Rehák, roh, rohový, Rusnák, Řepka, samec, Sedláček, Schindler, Siegl, Sigma, síť, Skála, skórovat, slabý, Slezák, Slončík, Sokol, sólo, srazit, standardní, Stejskal, střídající, střílet, Studeník, Suchopárek, Svědík, Svoboda, šatna, Šebesta, šedivý, šestnáctka, šilhavý, Šimurka, šindelář, Šlachta, Šmarda, Šmejkal, Špak, Švach, Tejml, tesařík, Tibenský, tlak, Tobiáš, trefit, Trval, Tuma, tyč, Tymich, Uličný, Ulich, Ulrich, uniknout, Urban, Urbánek, útočný, úvod, Vacek, Vaďura, Vágner, Vácha, Valachovič, valnoha, Váňa, Vaněček, Vaniak, vápno, Vávra, vejprava, veselý, Vidumský, Víger, Viktoria, vlček, volej, Vonášek, Vosyka, Votava, vrabec, vyloučený, vyložený, Výravský, vyrazit, vyrovnání, Vyskočil, vystrašit, Wagner, Weber, Wohlgemuth, zachránit, zákostelský, zákrok, Západočech, zlikvidovat, zlínský, Zúbek, žižkovský, ŽK (žlutá karta) NDBI010 - DIS - MFF UK Přednáška č. 6 C3M Clustering Cover Coefficient-based Clustering Methodology 3 CM Clustering • First, inverse values are computed for all sum total for each row and column in matrix w1,1 w2,1 D wn,1 w1,2 w2,2 wn,2 NDBI010 - DIS - MFF UK w1,m w2,m wn,m 3 CM Clustering • Inverse row total i 1 m wi,k k 1 • Inverse column total • Assuming that: j – Each document is indexed by at least one term – Each term describes at least one document NDBI010 - DIS - MFF UK 1 n wk , j k 1 3 CM Clustering • Product wi, j * i expresses occurrence of j-th term in i-th document – If one points randomly inside i-th document, what is the probability he/she find j-th term? • Product wi, j * j expresses occurrence of i-th document for j-th term – If one points randomly to some occurrence of j-th term in the collection, what is the probability he/she find it in i-th document? NDBI010 - DIS - MFF UK 3 CM Clustering • Second, the matrix C of cover-coefficients is computed ci, j w j,k * i wi,k * k i k wi,k w j,k m m k 1 k 1 If I pick random term occurrence from the i-th document and then I try to select random occurrence of the same term in the collection, what is the probability I pick up occurrence form the j-th document? NDBI010 - DIS - MFF UK 3 CM Clustering • Second, the matrix C of cover-coefficients is computed ci, j w j,k * i wi,k * k i k wi,k w j,k m m k 1 k 1 If the i-th document contains exclusive set of terms • cii =1, • cij =0 iff i<>j NDBI010 - DIS - MFF UK 3 CM Clustering n 1) n ci, j 1 j 1 n m j 1 m k 1 ci, j i k wi, k w j, k j 1 i i n m n k wi, k w j, k i k wi, k w j, k j 1k 1 m k 1 j 1 m n wi, k k w j, k i wi, k 1 1 k 1 j 1 k 1 =1 NDBI010 - DIS - MFF UK =1 3 CM 2) 3) 4) 5) 6) Clustering ci, j 0 - obvious ci, j 1 - follows from 1) and 2) ci, j 0 c j, i 0 k wi,k w j,k 0 ci, j 0 c j, i 0 - follows from 2)and 4) ci, i ci, j c j, j c j, i d i d j NDBI010 - DIS - MFF UK 3 CM Clustering • Cover coefficient cij says, how much terms occurring in one document cover terms in other documents • If given document covers poorly other documents (lot of exclusive terms), the value cii is close to 1 • If given document covers other documents well (lot of very common terms), • Pokud dokument naopak dobře pokrývá ostatní, je the value cii is close to 0 NDBI010 - DIS - MFF UK 3 CM Clustering • i ci , i Decoupling coefficient • 1 Coupling coefficient c i , i n i • Number of needed clusters nc • i i 1 m pi i i wi, j j 1 „Power“ of given document become a centre of the cluster NDBI010 - DIS - MFF UK 3 CM Clustering m • pi i i wi, j j j j 1 normalized computation, where • n ci, j i k wk ,i wk , j k 1 i ci, i i 1 ci, i NDBI010 - DIS - MFF UK 3 CM Clustering • First nc documents having biggest pi value become cluster centers, with exceptions – Too dissimilar documents are put to special “trash” cluster, which is compared to any query, and the nc can be decreased accordingly – Only one representative from the group of mutually similar documents (with similar values of cii, cij, cjj and cji) is taken and others (with similar pi) are skipped • Other documents are assigned to closest centre NDBI010 - DIS - MFF UK x y z d01 0,9800 0,1950 0,0397 0,8233 d02 0,9400 0,3410 0,0109 0,7740 d03 0,9600 0,2800 0,0000 0,8065 d04 0,9700 0,2430 0,0071 0,8196 d05 0,9500 0,3120 0,0125 0,7846 d06 0,1900 0,9800 0,0592 0,8136 d07 0,3200 0,9400 0,1183 0,7255 d08 0,2200 0,9600 0,1732 0,7390 d09 0,2300 0,9700 0,0787 0,7820 d10 0,0100 0,0200 0,9997 0,9711 0,1733 0,1908 0,6669 NDBI010 - DIS - MFF UK 3 CM Clustering • Cover coefficients c 1 2 3 4 5 6 7 8 9 10 1 0,1439 0,1421 0,1428 0,1432 0,1427 0,0579 0,0761 0,0639 0,0636 0,0238 2 0,1336 0,1358 0,1352 0,1346 0,1356 0,0736 0,0884 0,0771 0,0783 0,0079 3 0,1399 0,1408 0,1409 0,1406 0,1409 0,0677 0,0834 0,0709 0,0727 0,0022 4 0,1426 0,1425 0,1429 0,1429 0,1428 0,0636 0,0803 0,0675 0,0689 0,0060 5 0,1360 0,1374 0,1371 0,1367 0,1374 0,0707 0,0860 0,0744 0,0755 0,0088 6 0,0572 0,0774 0,0683 0,0632 0,0733 0,1561 0,1554 0,1575 0,1563 0,0354 7 0,0671 0,0828 0,0751 0,0711 0,0795 0,1386 0,1420 0,1437 0,1400 0,0602 8 0,0574 0,0736 0,0650 0,0608 0,0701 0,1431 0,1464 0,1509 0,1445 0,0883 9 0,0604 0,0791 0,0705 0,0657 0,0753 0,1502 0,1509 0,1529 0,1508 0,0443 10 0,0281 0,0099 0,0027 0,0072 0,0108 0,0423 0,0806 0,1161 0,0550 0,6474 NDBI010 - DIS - MFF UK 3 CM Clustering • Coefficients , , a p 1 2 3 4 5 6 7 8 9 10 0,1439 0,1358 0,1409 0,1429 0,1374 0,1561 0,1420 0,1509 0,1508 0,6474 1 2 3 4 5 6 7 8 9 10 0,8561 0,8642 0,8591 0,8571 0,8626 0,8439 0,8580 0,8491 0,8492 0,3526 1 2 3 4 5 6 7 8 9 10 p 0,1496 0,1516 0,1501 0,1494 0,1510 0,1619 0,1679 0,1734 0,1638 0,2351 NDBI010 - DIS - MFF UK n 2 3 CM Clustering • After 8. document is chosen as the centre, c 1 2 3 4 5 6 7 8 9 10 1 0,1439 0,1421 0,1428 0,1432 0,1427 0,0579 0,0761 0,0639 0,0636 0,0238 2 0,1336 0,1358 0,1352 0,1346 0,1356 0,0736 0,0884 0,0771 0,0783 0,0079 3 0,1399 0,1408 0,1409 0,1406 0,1409 0,0677 0,0834 0,0709 0,0727 0,0022 4 0,1426 0,1425 0,1429 0,1429 0,1428 0,0636 0,0803 0,0675 0,0689 0,0060 5 0,1360 0,1374 0,1371 0,1367 0,1374 0,0707 0,0860 0,0744 0,0755 0,0088 6 0,0572 0,0774 0,0683 0,0632 0,0733 0,1561 0,1554 0,1575 0,1563 0,0354 7 0,0671 0,0828 0,0751 0,0711 0,0795 0,1386 0,1420 0,1437 0,1400 0,0602 8 0,0574 0,0736 0,0650 0,0608 0,0701 0,1431 0,1464 0,1509 0,1445 0,0883 9 0,0604 0,0791 0,0705 0,0657 0,0753 0,1502 0,1509 0,1529 0,1508 0,0443 10 0,0281 0,0099 0,0027 0,0072 0,0108 0,0423 0,0806 0,1161 0,0550 0,6474 6th,7th and 9th document should not be taken NDBI010 - DIS - MFF UK x y z d01 0,9800 0,1950 0,0397 0,8233 d02 0,9400 0,3410 0,0109 0,7740 d03 0,9600 0,2800 0,0000 0,8065 d04 0,9700 0,2430 0,0071 0,8196 d05 0,9500 0,3120 0,0125 0,7846 d06 0,1900 0,9800 0,0592 0,8136 d07 0,3200 0,9400 0,1183 0,7255 d08 0,2200 0,9600 0,1732 0,7390 d09 0,2300 0,9700 0,0787 0,7820 d10 0,0100 0,0200 0,9997 0,9711 0,1733 0,1908 0,6669 NDBI010 - DIS - MFF UK 2 C ICM Clustering • Cover Coefficient-based Incremental Clustering Methodology – INSERT: assigning to closest cluster, respectively to the trash cluster – DELETE: if the centre of cluster is deleted, cluster is marked as ivnalid – REORGANIZE: • Centers of clusters are chosen from scratch • Current clusters, whose centre was not chosen, are marked as invalid • Documents from invalid clusters are reassigned to new clusters – The fact, that some documents from valid clusters should be reassigned, because some new centre can be closer, is not taken into account NDBI010 - DIS - MFF UK Spherical K-means Clustering nm • Vector index D 0,1 split to k disjoint sets of documents Dhillon, I., S.; Modha, D., S. D • For each set j are defined k den. j j 1 S k 1 – Centroid m j d i nj m j – Centroid having unite size cj mj d i j NDBI010 - DIS - MFF UK d1 m c d2 Spherical K-means Clustering • Value c d i j d i j can be considered as cluster quality measure (the higher value, the better) • Value S k d i c j k j 1 d i j represents quality of classification/clustering Note: in spite of Euclidean K-means, that minimizes 2 S k d i c j k NDBI010 - DIS - MFF UK j 1 d i j Spherical K-means Clustering • The goal is to find classification having maximal value of assessing function • Obviously S k d i c j n k j 1 d i j d i c j 1 due to (vectors are normalized) • In general NP-complete problem – Iterative algorithm that converge to (local) maximum NDBI010 - DIS - MFF UK Spherical K-means Algorithm • Initialization (0th iteration) – Documents are assigned randomly to k clusters ( 0) Sk (here k=3) – Positions of centroids (gravity centers) are computed ( 0) cj NDBI010 - DIS - MFF UK Spherical K-means Algorithm • Iteration step tt+1 – Documents are assigned to closest centroid from previous iteration ( t 1) j | x (t ) (t ) d c j d c x d i i i – New centroid positions are computed (t 1) cj NDBI010 - DIS - MFF UK Spherical K-means Algorithm • Cycle iterates until the growth of assessing function is below predefined threshold • I.e. while S (kt 1) S (kt ) NDBI010 - DIS - MFF UK Spherical K-means Algorithm • Result for k=5 NDBI010 - DIS - MFF UK Spherical K-means Algorithm • Assessing function is nondecreasing (t 1) Sk (t ) Sk • Cauchy Schwartz inequality x, x 1 d i x d i c j d di j i j i.e.: centroid of the set has the highest average similarity to all items in the set NDBI010 - DIS - MFF UK k ? k (t ) (t ) (t 1) d ic j S k S k j 1 d i (jt ) j 1 d i (jt 1) (t 1) d ic j • We want to show, that above inequality holds, i.e. that iterations converge (nondecreasing, from above limited function) ? NDBI010 - DIS - MFF UK S (kt ) k (t ) k k d ic j j 1d i (jt ) (t ) (t 1) l 1 j 1 d i j l (t ) d ic j • Similarities are summed first in intersection of areas from previous and current iteration • Similarities are taken with respect to original centroids, so the sum doesn’t change NDBI010 - DIS - MFF UK k k j 1l 1 d i (jt ) (t 1) l (t ) k k d ic j j 1l 1 d i (jt ) (t 1) l (t ) d ic l • We take similarities to closest centroid instead of the original one • Because new center is not farther than older one, similarity doesn’t decrease NDBI010 - DIS - MFF UK k l 1 d i (t 1) l (t ) k d ic l l 1 d i (t 1) l (t 1) d ic l • Cauch.-Schwartz. inequality: – Sum of similarities of group of vectors with their centroid is not smaller than sum of similarities with any other unite vector NDBI010 - DIS - MFF UK Proof: inequality sequence k k k S (kt ) d i c (jt ) (t ) k k Doc. Assignment in t+1st iteration: (t ) (t ) t 1) l d i | x d i c j d i c x ( l 1 d i (l t 1) (t ) d icl (t ) (t 1) l 1 j 1 d i j l k k Cauch.-Schwartz. inequality: (t ) d icl j 1l 1 d i (jt ) (l t 1) j 1d i j k (t ) d ic j l 1 j 1 d i (jt ) (l t 1) (t ) k d icl l 1 d i (l t 1) S tk 1 NDBI010 - DIS - MFF UK (t 1) d icl Spherical K-means Algorithm – Example • Combination of documents from three collections – MEDLIN: 1033 abstracts, medical magazines – CISI: 1460 abstracts, information retrieval – CRANFIELD: 1400 abstracts, aviation • Flowchart matrix for three computed clusters MEDLIN 1004 11 18 CISI 5 1440 15 CRANFIELD 4 16 1380 NDBI010 - DIS - MFF UK Spherical K-means Algorithm – Example • Distr. of similar. of docs – same cluster NDBI010 - DIS - MFF UK Spherical K-means Algorithm – Example • Distr. of similar. of docs – different clusters Cluster are mutually (almost) orthogonal NDBI010 - DIS - MFF UK Spherical K-means Algorithm – Cluster Labeling • Mutual orthogonality of clusters obtained using spherical K-means algorithm shows, that terms most important for given centroid characterization are (almost) unimportant for characterization of other centroids • Individual centroids can be considered prototypes of content inside cluster – a concept • Important terms can “label” the content of given cluster NDBI010 - DIS - MFF UK Spherical K-means Algorithm – Cluster Labeling • For each classification of documents S k into k clusters we can define k clusters of terms k W k j1 • The ith cluster contains terms, having the highest weight in the ith document cluster centroid • Terms can be ordered primarily by the term cluster number, secondly by its weight in the ith centroid • Each cluster is labeled by most important terms in the cluster (i.e. by terms having highest weights) NDBI010 - DIS - MFF UK NDBI010 - DIS - MFF UK Hierarchical Clustering • Either repeating of flat clustering, or incremental building of clusters until stop condition is met – typically reaching of wanted number of clusters • Agglomerative methods – Gradual joining of most similar documents and/or smaller clusters • Divisive methods – Gradual splitting of largest clusters NDBI010 - DIS - MFF UK Hierarchical Clustering • Different definitions of cluster similarity produce different results – Single linkage clustering • Similarity of two clusters = = similarity of closest couple of documents – Complete linkage clustering • Similarity of two clusters = = similarity of farthest couple of documents – Average group linkage • Similarity of two clusters = = average similarity of all couples NDBI010 - DIS - MFF UK d1 d2 d3 d4 d5 d6 d7 d8 d9 d10 arithmetic 0.0 0.0 0.541 0.0 0.55 0.0 0.0 0.0 0.0 0.0 basketball 0.0 0.0 0.0 0.0 0.0 0.0 0.556 0.563 0.0 0.0 C 0.0 0.0 0.541 0.0 0.55 0.0 0.0 0.0 0.0 0.0 error 0.0 0.0 0.0 0.556 0.0 0.0 0.0 0.0 0.517 0.0 cycle 0.0 0.0 0.0 0.556 0.55 0.55 0.0 0.0 0.0 0.0 inheritance 0.0 0.0 0.541 0.556 0.55 0.0 0.0 0.0 0.0 0.0 hardware 0.563 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 player 0.0 0.583 0.0 0.0 0.0 0.0 0.0 0.531 0.517 0.556 java 0.0 0.0 0.541 0.556 0.0 0.0 0.0 0.0 0.0 0.0 language 0.0 0.0 0.541 0.556 0.55 0.0 0.0 0.0 0.0 0.0 trash 0.0 0.0 0.0 0.0 0.0 0.0 0.556 0.531 0.0 0.0 ball 0.0 0.0 0.0 0.0 0.0 0.0 0.556 0.531 0.333 0.0 pivot 0.0 0.0 0.0 0.0 0.0 0.0 0.556 0.531 0.0 0.0 platform 0.0 0.0 0.0 0.556 0.55 0.0 0.0 0.0 0.0 0.0 computer 0.563 0.0 0.541 0.556 0.55 0.55 0.0 0.0 0.0 0.0 procedure 0.0 0.583 0.0 0.0 0.55 0.55 0.0 0.0 0.0 0.0 speed 0.563 0.0 0.0 0.0 0.0 0.55 0.0 0.0 0.0 0.556 server 0.563 0.0 0.0 0.0 0.0 0.55 0.0 0.0 0.0 0.0 software 0.563 0.0 0.541 0.0 0.0 0.0 0.0 0.0 0.0 0.0 sport 0.0 0.0 0.0 0.0 0.0 0.0 0.556 0.563 0.55 0.556 net 0.563 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.517 0.0 try 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.531 0.517 0.0 performance 0.0 0.0 0.0 0.556 0.0 0.55 0.0 0.0 0.0 0.0 NDBI010 - DIS - MFF UK Hierarchical Clustering – dendrogram • Result obtained by agglomerative hierarchical clustering using average group linkage NDBI010 - DIS - MFF UK Hierarchical Clustering • Obtained hierarchy is binary (in general k-ary) • More natural and more suitable is hierarchy with open arity, that better reflects similarities between clusters • Optimal number of descendants for root cluster is found, then the process id recursively applied to them – The quality of clustering is measured for different number of descendants • Cut is done in time of highest growth of error • Cut is done in time of highest ratio of error differences • Cut is done in time of highest second derivation NDBI010 - DIS - MFF UK Hierarchical Clustering • Cutting in time of highest error growth produces tree d1, d2, d3, d4, d5, d6, d7, d8, d9, d10 d1, d2, d3, d4, d5, d6 d1 d2 d3 d4 d5 d7, d8, d9, d10 d6 d7, d8 d7 NDBI010 - DIS - MFF UK d9 d8 d10 General Cluster Labeling • Simplifies user navigation through clusters • Mark clusters by: – Term set – Collocation set • Terms used as labels should be – Descriptive (describe content of clusters well) – Discriminative (Discriminate content from other clusters) NDBI010 - DIS - MFF UK General Cluster Labeling • Modified Information Gain of term t in cluster X IGm t , X Pt , X log Pt PX , where Pt , X P t, X P t , X log Pt * PX P t *P X # docs, containing term t # all docs in the collection # docs inside cluster X # all docs in the collection Pt , X PX * PX | t Pt PX # docs, NOT containing term t # all docs in the collection # docs OUTSIDE cluster X # all docs in the collection # docs in cluster X containing t # all docs in the collection NDBI010 - DIS - MFF UK General Cluster Labeling • Select terms having highest IGm • Join clusters on the same level having the same labeling NDBI010 - DIS - MFF UK TRS models, based on the VS Model Inductive Model Semantic Net Inductive TRS • Modification of VS model • Similar to two-layered neural network – Bottom (input) layer contains m nodes representing terms t1, …, tm – Upper (output) layer contains n nodes representing documents d1, …, dn – Terms tj are interconnected with documents di by oriented edges, rated using weights wi,j NDBI010 - DIS - MFF UK Inductive TRS • Up to this equal to VS model using different terminology d2 d1 W1,1 d3 d4 d5 dn W1,3 t1 t2 t3 NDBI010 - DIS - MFF UK t4 tm Inductive TRS • Plus reverted edges rated by weights xi,j • Usually xi,j = wi,j, can differ in general d2 d1 W1,1 d3 W1,3 t1 d4 d5 dn x2,5 t2 x4,5 t3 NDBI010 - DIS - MFF UK t4 tm Inductive TRS • Query q defines initial values of input nodes • Initialization t j q j m • Forward step d i wi, j t j • Backward step t j xi, j d i j 1 n i 1 NDBI010 - DIS - MFF UK Inductive TRS • Weights in query initialize bottom layer of the net – term nodes t1 t2 t3 NDBI010 - DIS - MFF UK t4 tm Inductive TRS • Forward step computes similarities of documents with given query d2 d1 t1 d3 d4 t2 d5 dn t3 NDBI010 - DIS - MFF UK t4 tm Inductive TRS • Backward step activates further terms, not mentioned in original query, but are important for documents, similar to the query d2 d1 t1 d3 d4 t2 d5 dn t3 NDBI010 - DIS - MFF UK t4 tm Inductive TRS • Forward step activates more documents ... d2 d1 t1 d3 d4 t2 d5 dn t3 NDBI010 - DIS - MFF UK t4 tm Inductive TRS • The global sum of values (the energy) in the layer grows with iterations • Forward step: – Column sum in the index matrix D is greater than 1, if there is enough documents in the collection Each value in the bottom layer contributes by more than its own value to the top layer energy NDBI010 - DIS - MFF UK Inductive TRS • The global sum of values (the energy) in the layer grows with iterations • Backward step: – Row sum in the index matrix D is always greater than 1, if document vectors are normalized m 2 m wi, j 1 wi, j 1 j 1 j 1 Each value in the top layer contributes by more than its own value to the bottom layer energy NDBI010 - DIS - MFF UK Inductive TRS • Solved by so called lateral inhibition • Documents interconnected each to other using horizontal edges • Each of them weighted by a value li,j, determining how much jth document inhibits value of ith document • Lateral inhibition is executed before the backward step n di di l d j j 1 i, j for ij NDBI010 - DIS - MFF UK Inductive TRS • Either n2 independent coefficients (space consuming) • Or one coefficient for each document • Or (usually) one coefficient at all d2 d1 t1 d3 d4 t2 d5 dn t3 NDBI010 - DIS - MFF UK t4 tm Inductive TRS • Forward step n d i wi, j t j wi t Sim j 1 wi,t • Backward step (no lat. inhibition, x w) t k 1 k Sim d i ,t *d i i 1 n • Corresponds to (automatical) feedback using i Sim d i,t NDBI010 - DIS - MFF UK k Semantic Net and Spreading • Semantic net – Thesaurus generalization • By associations between documents • By associations between documents and terms • General oriented graph with weighted edges – Nodes correspond to terms and documents – Weighted oriented edges correspond to associations NDBI010 - DIS - MFF UK Semantic Net and Spreading • TermTerm associations – – – – Synonym Broader-narrower terms Related terms ... • TermDocument associations – Importance of terms to identify documents – ... • DocumentDocument associations – Citations – ... NDBI010 - DIS - MFF UK Semantic Net term(s) Is component of hardw are hardware Assoc. informační systém inform. system ISA D1 K P V počítač computer ISA domácí počítač home computer S ISA výběr informace inform. retrieval P osobní počítač pers. computer Synonyms Term to document assoc. K data data ISA informatika informatics V P Broader/narr. K bibliografická inf ormatika bibliogr. informatics P D2 C P D3 Citations document(s) NDBI010 - DIS - MFF UK O P D4 Similar documents Spreading • Query q q1, q2 ,..., qm 0,1m • Initialization t j q j m q j j 1 • Increment of node value uj caused by node ui during the iteration u j uiwi, j wi, k • Overall k u j u j uiwi, j k wi, k i NDBI010 - DIS - MFF UK Models Based on Boolean Model Fuzzy Model MMM Model Paice Model P-norm Model Boolean Model Extensions • Opposite to classical Boolean Model – Allow weighted queries • Information(0,7) AND System(0,3) • Breeding(0,9) AND (Dogs(0,6) OR Cats(0,4)) – Allow use internally vector space index • Allow order output according to presumed relevancy NDBI010 - DIS - MFF UK Fuzzy Logic • Document d i wi,1, wi,2, ..., wi,n • Query t a qa AND tbqb Similarity min qawi,a, qbwi,b ta qa OR tbqb max qawi,a , qbwi,b NOT t a qa 1 qa wi,a NDBI010 - DIS - MFF UK Fuzzy Logic • Documents with the 1 same similarity to (unweighted) conjunction are denoted by blue lines • Documents with the 0 same similarity to 0 (unweighted) disjunction are denoted by green lines NDBI010 - DIS - MFF UK 1 Fuzzy Logic • Example: 1 11 d1 2 , 4 ,8 1 11 d 2 2 , 6,8 1 1 1 1 1 1 1 1 a AND bAND c min , , min , , 2 4 8 8 2 6 8 8 1 1 1 1 1 1 1 1 max , , max , , a OR b OR c 2 4 8 2 2 6 8 2 1 5 1 3 1 1 NOT b 6 6 4 4 NDBI010 - DIS - MFF UK MMM (Min-Max Model) • Linear combination of minimal an maximal values t a qa AND tbqb Sim k MinAnd min qa wi,a, qb wi,b k MaxAnd max qa wi,a, qb wi,b ta qa OR tbqb Sim k MinOr min qa wi,a, qb wi,b k MaxOr max qa wi,a, qb wi,b • kMinAnd>kMaxAnd, kMinOr<kMaxOr • Usually 2 coef.: kMinAnd+kMaxAnd =kMinOr+kMaxOr=1 Or 1 coef.: k=kMinAnd=1-kMaxAnd=kMaxOr=1-kMinOr NDBI010 - DIS - MFF UK MMM (Min-Max Model) • Documents with the 1 same similarity to (unweighted) conjunction are denoted by blue lines • Documents with the 0 same similarity to 0 (unweighted) disjunction are denoted by green lines NDBI010 - DIS - MFF UK k=0,75 1 MMM (Min-Max Model) 1 11 d1 2 , 4 ,8 1 11 d 2 2 , 6,8 a AND bAND c 7 3 1 1 1 4 8 4 2 32 7 3 1 1 1 4 8 4 2 32 a OR b OR c 1 1 3 1 13 4 8 4 2 32 1 1 3 1 13 4 8 4 2 32 • 3 Ex.: k 4 NDBI010 - DIS - MFF UK Paice Model • All values are taken into account their importance decreases geometrically Sim r q j wi, jk , where 0 r 1 k k k • In case of conjunction are values q jk wi, jk ordered in ascending order • In case of disjunction in descending order NDBI010 - DIS - MFF UK Paice Model • Ex.: 1 r 2 1 11 d1 2 , 4 ,8 1 11 d 2 2 , 6,8 a AND bAND c 36 1 1 1 1 1 1 32 1 1 1 1 1 1 . . . . . . 2 8 4 4 8 2 192 2 8 4 6 8 2 192 a OR b OR c 84 1 1 1 1 1 1 59 1 1 1 1 1 1 . . . . . . 2 2 4 4 8 8 192 2 2 4 6 8 8 192 NDBI010 - DIS - MFF UK Extended Boolean Logic (P-norm Model) • Similarity is derived from the distance of document (measured by p-norm) from zero (false) document dF=<0, 0, …, 0> for disjunction, resp. from unitary (true) document dT=<1, 1, …, 1> for conjunctions NDBI010 - DIS - MFF UK Extended Boolean Logic (P-norm Model) • Non-weighted query variant , 2 terms Sima OR b p p p wi,a wi,b 2 d F d i 1wi,a 1wi,b p Sima AND b 1 p p p 2 NDBI010 - DIS - MFF UK 1 d T d i p Extended Boolean Logic (P-norm Model) • Non-weighted query, k terms SimOR k p wi, j p j 1 Sim AND 1 k k p 1 wi, j p j 1 k NDBI010 - DIS - MFF UK Extended Boolean Logic (P-norm Model) • Weighted queries, k terms SimOR k p p q wi, j j j 1 p k q j j 1 k p p q 1 wi, j j j 1 Sim AND 1 p k q j j 1 NDBI010 - DIS - MFF UK Extended Boolean Logic (P-norm Model) • If p, model turns toward classical Boolean model • If p=1, disjunctions correspond to vector space model • If p=2, reported results are better than in case of vector space model NDBI010 - DIS - MFF UK Extended Boolean Logic • Documents with the same similarity to (unweighted) conjunction are denoted by blue arcs • Documents with the same similarity to (unweighted) disjunction are denoted by green arcs p=2 1 0 0 NDBI010 - DIS - MFF UK 1 Handling Index-term Dependencies (Concepts) Concept Net for Boolean Model Concept Net for Boolean Model • Requires Boolean TRS with thesaurus available • Instead of individual terms (that can be mutually dependent) it works with so called concepts, that are mutually semantically independent NDBI010 - DIS - MFF UK Concept Net for Boolean Model • Synonyms – All equivalent terms (set of synonyms) form one semantic concept (theme) • I.e.: “Home computer” “Personal computer” – Documents, using any of those terms are supposed to tell something about the same concept – In above case exist only 21=2 different document classes instead of four in classic Boolean model NDBI010 - DIS - MFF UK xy Concept Net for Boolean Model • Related terms (in some semantic association) – Couples of related terms define three semantically independent concepts • I.e. “Information system” “Informatics” – Documents can independently tell about • about “Information system” (but not about “Informatics”) • about “Informatics” (but not about “Information system”) • about theme, defined by intersection of their semantics – There exist 23=8 different document classes instead of four in classic Boolean model NDBI010 - DIS - MFF UK x y Concept Net for Boolean Model • Broader term – narrower term – Couples of related terms define two semantically independent concepts • I.e. “Computer” > “Personal computer” – Documents can independently tell about • about “Computer” (but not about “Personal computer”), let say about mainframe • about “Personal computer” (and so about “Computer” as well) – There exist 22=4 different document classes, not equivalent to those in classic Boolean model y NDBI010 - DIS - MFF UK x Concept Net Tezaurus: Synonyms/Vztahy/ISA hierarchies hardware hardw are inform. system informační systém U isa Uisa A V computer počítač isaU home computer domácí počítač data data AV Uisa inform.informace retrieval výběr isa U SS informatics informatika pers. computer osobní počítač NDBI010 - DIS - MFF UK U isa A V bibliogr. informatics bibliografická inf ormatika Concept Net • Corresponding concept net of atomic concepts informatics bibliographical informatics hardware computer home computer pers. computer information system X data inform. retrieval complementary concept, represents “anything else” NDBI010 - DIS - MFF UK common sense of “biblio. inf.” and “inform. retr.” Concept Net • There is 9 terms, it is 29 = 512 different document classes • In fact 12 different, semantically independent atomic concepts (the last one represents any other topic, different form those, described by terms in thesaurus), i.e. 212 = 4096 different document classes (212 = 2048 without the twelfth) • Each atomic concept is represent by conjunction of all terms in positive or negative notion according to the position of concept inside/outside corresponding set – For example, X represents: “informatics” and “bibliographical informatics” and “information retrieval” and not “hardware” and not “computer” and not “home computer” and not “information system” and not “data” and not “anything else” NDBI010 - DIS - MFF UK home computer hardware information system informatics personal computer computer information retrieval 1 2 3 4 5 6 7 8 data • One concept is assigned to each set of synonyms bibliographical informatics Concept Net Construction 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 NDBI010 - DIS - MFF UK home computer hardware information system informatics personal computer computer information retrieval 1 2 3 4 5 6 7 8 X9 10 11 data • Concepts, corresponding to couples of related terms are added bibliographical informatics Concept Net Construction 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 NDBI010 - DIS - MFF UK home computer hardware information system informatics personal computer computer information retrieval 1 2 3 4 5 6 7 8 X9 10 11 data • Going through thesaurus in bottom to top direction, the ones from narrower terms are copied to broader term columns bibliographical informatics Concept Net Construction 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 1 1 1 1 0 0 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 NDBI010 - DIS - MFF UK home computer hardware information system informatics personal computer computer information retrieval 1 2 3 4 5 6 7 8 X 9 10 11 12 data • Last, complementary concept is (optionally) added bibliographical informatics Concept Net Construction 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 0 1 0 0 1 0 0 1 1 1 1 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 NDBI010 - DIS - MFF UK Concept Net • Query – Non-weighted disjunction of terms ‘informatics’ OR ‘information retrieval’ OR ‘data’ – Weighted disjunction of terms (‘informatics’ ; 0.5) OR (‘information retrieval’ ; 1.0) OR (‘data’ ; 0.4) NDBI010 - DIS - MFF UK NDBI010 - DIS - MFF UK informatika výběr informace • Unweighted disjunction is translated to disjunction of corresponding columns – to column vector of concepts • Documents can be translated as well ‘informatics’ OR ‘information retrieval’ OR ‘data’ (1,1,0,0,0,1,0,1,1,1,1,0) data Concept Net 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 Concept Net • Both vectors corresponding to query and document are compared using dot product similarity computation • Query „informatics“ OR „information retrieval“ OR „data“ (1,1,0,0,0,1,0,1,1,1,1,0) • Document „Information system“ (0,1,0,0,1,0,0,1,1,1,1,0) • Similarity = 5, while term-based similarity is nought NDBI010 - DIS - MFF UK Concept Net w1, • Weighted term w2, q q1, q 2, ..., q m disjunction can be ..., translated to column wk vector of concept weights using fuzzy logic t1, j , • Concept weight t 2, j , qj qj , where t j ..., wi max k max j 1,...,m t r , j j 1,...,m t j 1 r 1 t k, j NDBI010 - DIS - MFF UK 0.4 0.5 1.0 NDBI010 - DIS - MFF UK inform. retrieval "informatics" 1 4 "information retrieval" 1 2 "data" 1 2 informatics • Query (‘informatics’ ; 0.5) OR (‘information retrieval’ ; 1.0) OR (‘data’ ; 0.4) data Concept Net 1 1 1 / / / 2 4 2 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0.5*1/4 0.4*1/2 0 0 0 0.5*1/4 0 1.0*1/2 1.0*1/2 0.4*1/2 0.5*1/4 0 Handling Index-term Dependencies (Concepts) Singular Value Decomposition – SVD for Vector Space Model Latent Semantic Indexing - LSI • Similarly to concept net in Boolean model, the LSI tries to find out mutually independent concepts – themes, that can be used for indexation instead of terms that can be (and usually are mutually dependent). • Doesn’t use thesaurus. • It derives so called latent semantic dependencies directly from the Vector space index. NDBI010 - DIS - MFF UK Latent Semantic Indexing - LSI • 3 documents in 3D space 2/2 D 3 /3 0 2/2 3 /3 0 0 3 / 3 1 • Matrix rank = 2 < dim. NDBI010 - DIS - MFF UK Singular Value Decomposition • Each matrix A having mxn values and rank r, (e.g. matrix A=DT, i.e. rows terms), can be decomposed to product DT=USVT, where – URmxr has r orthonormal columns • Forms a base in m-dimensional term space. • Its dimension corresponds to rank of original matrix. – SRrxr is a diagonal regular matrix – VRnxr has r orthonormal rows • Forms a base in n-dimensional document space. • Its dimension corresponds to rank of original matrix. NDBI010 - DIS - MFF UK Singular Value Decomposition • Left singular vectors u1, u2, ..., ur – Eigen vectors of matrix A.AT=DT.D • Singulární hodnoty 1 2 ... r > 0 – Square roots of abs. values of eigen values of matrix A.AT , resp. AT.A • Right singular vectors v1, v2, ..., vr – Eigen vectors of matrix AT.A=D.DT U VT S 1 u1 v1 ur r NDBI010 - DIS - MFF UK vr Singular Value Decomposition • Geometrical meaning – Matrix projects unitary m-dimensional sphere to r-dimensional ellipsoid with axis id directions stored in columns of matrix U – Half-axes lengths correspond to values 1, 2, …, r – Rights singular vectors are projected to vectors parallel to space axes DT=USVT V*,3 NDBI010 - DIS - MFF UK V*,1 V*,2 2 U*, 2 1 U*, 1 3 U*, 3 Latent Semantic Indexing (LSI) • LSI takes into account mutual dependencies of terms using SVD of index matrix + Co-occurring (equivalent) terms are projected to common dimension + Allows further reduction of matrix dimensionality + The space required by the index can be smaller + Documents, containing similar terms can be considered as similar, even if they contain distinct terms NDBI010 - DIS - MFF UK Latent Semantic Indexing (LSI) • Represents documents in the space with dimensionality equivalent to rank of original indexation matrix • Dimensions correspond to left singular vectors of SVD decomposition NDBI010 - DIS - MFF UK Latent Semantic Indexing (LSI) • It is possible to approximate the index matrix D by a matrix with defined lower rank k<r – To achieve rank k<r, matrix USVT is approximated by UkSkVkT, where • Uk corresponds to first k columns of matrix U • Sk correspond to upper left corner of matrix S with size k x k • Vk corresponds to first k columns of matrix V NDBI010 - DIS - MFF UK Latent Semantic Indexing (LSI) • When the k is decreased by one, the ellipsoid is flatten along its shortest axis • UkSkVkT represents the best approximation of USVT matrix that has rank k (according to Frobenius norm of difference of both matrices) • I.e.: U S V T U k S kV Tk F U S VT X F for all X with rank k where the Frobenius norm M F NDBI010 - DIS - MFF UK m n xi2, j i 1 j 1 Latent Semantic Indexing (LSI) • Example – 6 documents containing 5 different terms d1 d2 d3 d4 d5 d6 Cosmonaut 1,0000 1,0000 Astronaut 1,0000 Moon 1,0000 1,0000 Vehicle 1,0000 1,0000 1,0000 Car 1,0000 1,0000 NDBI010 - DIS - MFF UK u1 U= Cosmonaut Astronaut Moon Vehicle Car u2 u3 u4 u5 -0,4403 0,2962 -0,5695 -0,5774 0,2464 -0,1293 0,3315 0,5870 0,0000 0,7272 -0,4755 0,5111 0,3677 0,0000 -0,6144 -0,7030 -0,3506 -0,1549 0,5774 0,1598 -0,2627 -0,6467 0,4146 -0,5774 -0,0866 2,1625 1,5944 1,2753 S= 1,0000 0,3939 V= D1 D2 D3 D4 D5 D6 v1 v2 v3 v4 v5 -0,7486 -0,2797 -0,2036 -0,4466 -0,3251 -0,1215 0,2865 0,5285 0,1858 -0,6255 -0,2199 -0,4056 -0,2797 0,7486 -0,4466 0,2036 -0,1215 0,3251 0,0000 0,0000 -0,5774 0,0000 0,5774 -0,5774 -0,5285 0,2865 0,6255 0,1858 0,4056 -0,2199 NDBI010 - DIS - MFF UK Latent Semantic Indexing (LSI) • UkSkVkT for k=5=r 1.000 -0.000 1.000 1.000 0.000 0.000 1.000 1.000 -0.000 0.000 1.000 -0.000 -0.000 -0.000 0.000 -0.000 -0.000 0.000 1.000 1.000 NDBI010 - DIS - MFF UK -0.000 -0.000 -0.000 1.000 -0.000 0.000 -0.000 0.000 -0.000 1.000 Latent Semantic Indexing (LSI) • UkSkVkT for k=4 1.051 0.151 0.872 1.033 -0.018 -0.028 0.918 1.069 -0.018 0.010 0.939 -0.179 0.151 -0.040 0.021 -0.018 -0.053 0.045 0.988 1.006 NDBI010 - DIS - MFF UK -0.039 -0.116 0.098 0.975 0.014 0.021 0.063 -0.053 0.014 0.993 Latent Semantic Indexing (LSI) • UkSkVkT for k=3 1.051 0.151 0.872 1.033 -0.018 -0.028 0.918 1.069 -0.018 0.010 0.606 -0.179 0.151 0.294 -0.312 -0.018 -0.053 0.045 0.988 1.006 NDBI010 - DIS - MFF UK 0.294 -0.116 0.098 0.641 0.347 -0.312 0.063 -0.053 0.347 0.659 Latent Semantic Indexing (LSI) • UkSkVkT for k=2 0.848 0.361 1.003 0.978 0.130 0.516 0.358 0.718 0.130 -0.386 0.282 0.155 0.361 0.206 -0.076 0.130 -0.206 -0.050 1.029 0.899 NDBI010 - DIS - MFF UK 0.206 -0.025 0.155 0.617 0.411 -0.076 -0.180 -0.206 0.411 0.487 DIS based on LSI • Instead of matrix DT we can use its SVD decomposition USVT, resp. its approximation UkSkVkT. • Similarity of document couples DDT =(UkSkVkT)(VkSkTUkT) =(UkSk)(SkTUkT) {Vk is orthonormal, i.e. VkTVk=I} =UkSk2UkT {Sk is diagonal, i.e. Sk=SkT} =(UkSk)(UkSk)T NDBI010 - DIS - MFF UK DIS based on LSI • Instead of matrix DT we can use its SVD decomposition USVT, resp. its approximation UkSkVkT. • Similarity of term couples DTD =(VkSkTUkT)(UkSkVkT) =(VkSkT)(SkVkT) {Uk is orthonormal, i.e. UkTUk=I} =VkSk2VkT {Sk is diagonal, i.e. SkT=Sk} =(VkSk)(VkSk)T NDBI010 - DIS - MFF UK DIS based on LSI • From DT=UkSkVkT {multiply by UkT on left side} follows UkTDT=SkVkT {UkTUk=I} and futher {multiply by S-1 on left side} follows Sk-1UkTDT=VkT {Sk-1Sk=I} • By transposition we got: Vk=DUkSk-1 • I.e.: – We obtain new k-dimensional vector from original one by multiplication of the vector with matrix UkSk-1 – The query can be transformed the same way, by multiplication of the query vector with matrix UkSk-1 NDBI010 - DIS - MFF UK DIS based on LSI • Similarity between document and the query – SimLSI(q,Di)=Sim(qUkSk-1,DiUkSk-1) • Disadvantages of the method – Static method • Decomposition is done using given specific set of vectors (documents) – Further documents can be added using UkSk-1 transformation, but this approach doesn’t reflect latent semantic features of terms NDBI010 - DIS - MFF UK Latent Semantic Indexing (LSI) • Evaluation of query [moon,vehicle], i.e. <0, 0, 1, 1, 0> • Without LSI we obtain similarities <2.000; 1.000; 0.000; 1.000; 1.000; 0.000> d1 d2 d3 d4 d5 d6 Cosmonaut 1,0000 1,0000 Astronaut 1,0000 Moon 1,0000 1,0000 Vehicle 1,0000 1,0000 1,0000 Car 1,0000 1,0000 NDBI010 - DIS - MFF UK Latent Semantic Indexing (LSI) • Using LSI both documents and the query can be projected to a 2-dimensional space using matrix U2S2-1 (k=2), where the space keeps two most important latent semantic concepts • After the transformation the query is evaluated as usual • We obtain similarities <0.983; 0.621; 0.849; 0.424; 0.713; 0.108> Comp.<2.000; 1.000; 0.000; 1.000; 1.000; 0.000> NDBI010 - DIS - MFF UK Latent Semantic Indexing (LSI) • Existing benchmark results: – On small collections having 103 documents up to 30% increase of precision in comparison with VS model – On collections having 104 documents still better than VS model – On collections having 105 documents result are behind VS model NDBI010 - DIS - MFF UK Signatures Signatures • Suitable for conjunctive queries over Boolean IS • Excludes large amount of irrelevant documents • Requires low time and space complexity NDBI010 - DIS - MFF UK Signatures • Signature = k-bit string – k is a predefined constant • Document signature si is assigned to each document di • Query signature s is assigned to conjunctive query q Query NDBI010 - DIS - MFF UK 00101 01001 10101 00111 10100 Comparison Doc1 Signature Query Evaluation • Document signature si matches query signature s if and only if si s (bit by bit) i.e. iff s AND NOT si = 0 (binary) • If document signature doesn’t match, given document cannot contain all queried terms • If document signature matches, document can, but need not, contain all required terms NDBI010 - DIS - MFF UK Signature Query Evaluation • Effectively computable evaluation using native machine code instructions of the CPU • Non-exact, irrelevant document can still match in signature comparison (false hit) • Necessary to be followed by further – more exact – comparison Query NDBI010 - DIS - MFF UK 00101 01001 10101 00111 10100 Doc1 Comparison 10101 00111 Comparison II Doc2 Signature Assigning • Word signature – Hash function h : X* 0..k-1 – Signature sig(w) of word w has set 1-bit at position h(w) all other positions contain 0-bits – More suitable than hash function X* 0..2k-1, that directly generates signatures with many (k/2 in avg.) 1-bits, matching to large amount of queries NDBI010 - DIS - MFF UK Signature Assigning • Document signature – Layering of signatures assigned to individual words using binary disjunction. – Collection of documents with fixed internal structure (author, title, abstract, body, publisher) can use signature built by concatenation of (layered) signatures corresponding to individual document sections. Each of concatenated signatures can have its own predefined length. NDBI010 - DIS - MFF UK Concatenated Signatures • Concatenated signatures allow independent querying over document section(s) – Books written by Alan Poe: • sig(„Alan“)=00100000, sig(„Poe“)=00001000 – Query q=00101000 | 00000000 | 0000000000000000 • sig(„Saul“)=00010000, sig(„Bellow“)=10000000 – Dokument 10010000 | xxxxxxxx | xxxxxxxxxxxxxxxx doesn’t match to query signature • sig(„Toni“)=00001000, sig(„Morrison“)=00100000 – Dokument 00101000 | xxxxxxxx | xxxxxxxxxxxxxxxx matches to query signature, doesn’t match to query NDBI010 - DIS - MFF UK Layered Signatures • By layering word signatures the document/query signature gains more and more 1-bits and so the ability to discriminate irrelevant documents vanishes, because document signature containing all bits set to one matches to any query. • Solution: – More signatures are assigned to the document, each of them for different block of text. – By suitable sectioning, for example on chapter or paragraph boundaries, the information loss is not problematic. Unrelated co-occurrence of words is not much important. NDBI010 - DIS - MFF UK Layered Signatures • Block can be created by two methods – FSB (fixed size block) Each block contains approximately the same number of words – FWB (fixed weight block) Each block produces a signature with approximately the same number of 1-bits Optimally k/2 NDBI010 - DIS - MFF UK Monotonous Signatures • Signature is monotonous, if for each two words, resp. their fragments u, v holds sig(u.v) sig(u) I.e.: signature of on right side extended word is not less than signature of original word • Monotonous signatures allow querying of radixes using right-side wildcards „*“ • For example q=„datab*” AND „system*” NDBI010 - DIS - MFF UK Monotonous Signatures • Monotonous signatures can be created using following approaches (all of them use in general more than one 1-bit in the signature) – Layering of signatures corresponding to all word prefixes • sig(„system“)=sig’(„s“)+sig’(„sy“)+...+sig’(„system“) where sig’(w) is arbitrary signature – Layering of signatures corresponding to all n-grams within given word n-gram = sequence of n adjacent characters • sig(„system“)=sig’(„sys“)+sig’(„yst“)+...+sig’(„tem“) for trigrams NDBI010 - DIS - MFF UK Monotonous Signatures • Signatures built upon n-grams – Allow wildcards at the beginning or inside of words – Allow uniform utilization of all k positions in the signature • There exists fixed number of n-grams ( ~ 26n in English) • It is possible to estimate probabilities of individual n-grams from the language model and assign bits to them uniformly NDBI010 - DIS - MFF UK Signature Storing • Inverted file • Non-inverted file • Signature tree 00 000 01001 10101 001 – By ordering signatures with the same prefix of length k1<k appear in one continuous block – Prefix can be stored only once – It is possible to use more levels for k1, k2, … NDBI010 - DIS - MFF UK 00111 10100 01 10 11 Distributed TRS Distributed TRS • Data as well as functionality is spread among more computers – Transparency • User should not be aware of the distribution – Scalability • Performance can be boost using more computers – Robustness • Failure of one computer doesn’t affect functionality of other computers • Overall functionality of the system can remain the same (redundancy) or only slightly decreased (unavailable is only small part of data) NDBI010 - DIS - MFF UK Distributed TRS • Data stored in TRS – Primary data (original documents) – Secondary data (author, title, year of publication, ...) – Index • Computer nodes can be distinguished according to the services they provides and what data they store and maintain • More processes can run on one computer node NDBI010 - DIS - MFF UK Distributed TRS • Processes of TRS (involved in query answering) – Clients (C) • User interface – Document server (D) • Document delivery system containing primary data • For example independent WEB server – Index server (S) • Document disclosure system containing the index and secondary data – Integration node (I) • Specific process ensuring coordination of cooperating nodes and processes NDBI010 - DIS - MFF UK Distributed TRS • Integration node – Takes queries from client processes (users) – Defines strategy of query evaluation according to its knowledge of system topology – Distributes partial queries to individual index servers – Collects final result from answers to partial queries NDBI010 - DIS - MFF UK Distributed TRS • DIS • DDIS D D D C S C D I S NDBI010 - DIS - MFF UK C I S I S Distributed TRS • Availability of more Clients and Integration nodes increases the throughput of the system and its robustness • Necessity to replicate metadata about system topology on all integration nodes C I S NDBI010 - DIS - MFF UK C I S I S Distributed Boolean TRS • The index matrix is split to more (usually distinct, redundancy is achieved due to mirroring) parts • Description here is based on relational algebra notation and semantics – – – – – Relation R(A1, A2, …, An) Boolean condition q Projection R[Ai1, Ai2, …, Aik] Selection R(q) Natural join of relations R*S NDBI010 - DIS - MFF UK Distributed Boolean TRS • Index is presented as (m+1)-ary relation D(d, t1, t2, …, tm), where tj {0,1}, d N (document identification) • Document index instance is matrix 1 D 2: n w w : w 1.1 2 ,1 n ,1 w w : w 1, 2 2, 2 n,2 .. w .. w , where w {0,1} : .. w 1, m 2,m i, j n,m NDBI010 - DIS - MFF UK Distributed Boolean TRS • Answer to query q is a list of identifiers of matching documents, i.e. relation D(q)[d] NDBI010 - DIS - MFF UK Horizontal Fragmentation • Splitting of the index to k fragments D1, D2, …, Dk based on k-tuple of queries q1, q2, …, qk, where Dx = D(qx) • q1 q2 … qk = true i.e. D1 D2 … Dk = D • qx qy = false if x y i.e. Dx Dy = NDBI010 - DIS - MFF UK Horizontal Fragmentation • D(q)[d] = (D1 … Dk)(q)[d] = (D(q1) … D(qk))(q)[d] = (D(q1q)[d] … D(qkq)[d]) • If qxq = false, the Dx(q)[d] = and the xth index server need not to take part on query evaluation NDBI010 - DIS - MFF UK Horizontal Fragmentation • How to choose queries qi – To obtain fragments of approx. the same size – To obtain fragments where typical queries can be evaluated on as small number of nodes as possible • Need to have statistics about queries evaluated in the past NDBI010 - DIS - MFF UK Vertical Fragmentation • Splitting of the index to k fragments D1, D2, …, Dk based on k-tuple of sets {d} T1, T2, …, Tk {d, t1, t2, …, tm}, where Dx = D[Tx] • T1 T2 … Tk = {d, t1, t2, …, tm} i.e. D1 * D2 * … * Dk = D • Tx Ty = {d} if x y NDBI010 - DIS - MFF UK Vertical Fragmentation • D(q)[d] = (D1 * D2 * … * Dk)(q)[d] • Let Tq is the set of terms used in the query. Let ={d}Tq D(q)[d] = (D[T1] * … * D[Tk])(q)[d] • Smaller relations are joined • Fragments, where Tx={d} can be omitted NDBI010 - DIS - MFF UK Vertical Fragmentation • Based on following rules D(q1q2)[d]=D(q1)[d]D(q2)[d] D(q1q2)[d]=D(q1)[d]D(q2)[d] queries can be rewritten to intersections and unions of partial results that can be evaluated on available index servers NDBI010 - DIS - MFF UK Vertical Fragmentation • How to choose sets Ti – To obtain fragments of approx. the same size • Sets of the same size (and having similar probability of terms in the text) – Fragments, where queries can be evaluated on as small number of fragments • Terms co-occurring in typical queries are hold together. Necessity to store history of queries NDBI010 - DIS - MFF UK Combined Fragmentation • Regular (grid) D11=D(q1)[T1] D21=D(q2)[T1] D12=D(q1)[T2] D22=D(q2)[T2] D13=D(q1)[T3] D11=D(q2)[T3] D31=D(q3)[T1] D32=D(q3)[T2] D11=D(q3)[T3] D12=D(q1)[T2] D13=D(q1)[T3] • Irregular D11=D(q1q2)[T1] D2=D(q2)[T2T3] D3=D(q3) NDBI010 - DIS - MFF UK Example • Irregular combined fragmentation D1=D(q1) D21=D(q2)[T1] D22=D(q2)[T2] D3=D(q3) • Where – T1={d, t1, t2, t3}, T2={d, t4, t5, t6} – q1=(t1t4), q2=(t1t4)(t1t4), q3=(t1t4) NDBI010 - DIS - MFF UK Example • D=D1(D21*D22)D3 q=t1(t2t4t5) • D(q)[d]=(D1(D21*D22)D3)(q)[d] =D1(q)[d](D21*D22)(q)[d]D3(q)[d] =D1(q)[d](D21*D22)(q)[d] qq3=false • (D21*D22)(q)[d] =(D21*D22)(t1(t2t4t5))[d] =(D21*D22)(t1)[d](D21*D22)(t2t4t5)[d] NDBI010 - DIS - MFF UK Example • (D21*D22)(t1)[d]=D21(t1)[d] • (D21*D22)(t2t4t5)[d] =(D21*D22)(t2)[d](D21*D22)(t4t5)[d] =D21(t2)[d]D22(t4t5)[d] NDBI010 - DIS - MFF UK D1(q)[d] Example D21(t1)[d] D21(t2)[d] S1 S21 D22(t4t5)[d] S22 NDBI010 - DIS - MFF UK S3 Distributed Vector Space TRS • Uses clustering – Analogous to horiyontal fragmentation in Boolean TRS • Integration servers need information about cluster topology, i.e. about centers and radii of clusters NDBI010 - DIS - MFF UK Integrated TRS Integrated TRS • Integration of more independent TRS into one meta-system • Problems – Different methods of indexing • One document can have mode different representation – Different sets of terms – Different similarity computations NDBI010 - DIS - MFF UK Optimal searching • One of methods of TRS’s integration, plus – Decreasing of problems with prediction criterion – Suitable for systems, containing multiple modules for • Document indexation (more independent algorithms) – High space complexity • Query intexation algorithms • More similarity computations – System combines results together and (based on the user interaction) chooses optimal combination of available methods NDBI010 - DIS - MFF UK Optimal searching • Optimal query answering method – k different methods – ith method returned ri relevant documents in set of ni returned documents • How to determine the best of available TRS’s? – By ri2 / ni criterion – It is not necessary to find out the overall number of all relevant documents in collections NDBI010 - DIS - MFF UK Optimal searching • Let suppose our knowledge of all relevant documents and its count r • Let take X x1, x2,..., xn where xj is equal to 1, if and only if jth document is relevant for the user, else xj=0 • Let take Y i yi,1, yi,2,..., yi,n where yi,j is equal to 1, if and only if ith system returned jth document, else yi,j=0 NDBI010 - DIS - MFF UK Optimal searching Number of relevant documents, returned by the i system • Optimally Y i X • The quality of the ith system , . Sim Y i X Y i X Y i . X .cos Y i. X cos , where Y i . X yi, jxi ri Yi X Y i ni X r th r i ri r i cos Pi Ri nir ni r ni r 2 ri NDBI010 - DIS - MFF UK Optimal searching • The quality measure corresponds to PiRi 2 • Due to P R r r r r n n r where r is constant, and square root is ascending on <0;1>, to order those expressions i i i i i i i 2 r ordering of is sufficient n i i NDBI010 - DIS - MFF UK Optimal searching • Query evaluation algorithm – The query is evaluated by all available methods and results are merged together to one ordered list (see below) – User marks relevant documents in the answer – Individual methods are rated by ri2 / ni criterion – The best method is preferred in future answers NDBI010 - DIS - MFF UK Optimal searching • Merging of output document lists – Different methods can return documents rated by numbers from different intervals – It is necessary to normalize all values from local interval <l1,l2> of ith method to global interval <g1,g2> g2 linearly: y=(x- l1)*((g2–g1)/(l2–l1))+g1 y – <g1,g2> is usually <0,1>, thus g1 y=(x- l1)/(l2–l1) l1 x l2 NDBI010 - DIS - MFF UK Optimal searching • Merging of output document lists – If some given document is found and rated by more than one methods, the overall rate of the document has to be computed – Individual document rates are considered to be estimations of the probability of document relevancy given by the method(s) – If one document is returned multiple times, the probability of its relevancy grows – If si <0;1>, then s = 1-(1-si) • Computed for all methods, that returned given document NDBI010 - DIS - MFF UK Optimal searching • If there exists some method (or methods), that should be preferred over another methods, rates provided by the method are normalized to individual intervals <g1,g1+i(g2-g1)>, where i<0,1> denotes the credibility (quality) of the method • For example: – i=1 for the best method, i=<1 for all others – i=(ri2 / ni) / (rmax2 / nmax) NDBI010 - DIS - MFF UK HTML Searching • Web can be considered as the special case of TRS – Unknown and huge number of stored documents • Surface web – anonymously accessible documents • Hidden (deep) web – Dynamic web pages – Unlinked pages – Documents accessible after authorization – Volume of deep web is (est.) hundred times larger – Quality of deep web is thousand times higher NDBI010 - DIS - MFF UK HTML Searching • Web can be considered as the special case of TRS – Redundancy • Estimations says the web redundancy is approx. 30% – Volatility • ¼ of pages changes every day • Estimated half-life of pages is approx. 10 days • I.e. information in the index ages considerably NDBI010 - DIS - MFF UK HTML Searching • Web can be considered as the special case of TRS – Number of documents in Google • 4.285.199.774 documents (July 2004) • 8.058.044.651 documents (May 2005) • At least 25 270 000 000 documents, more probably over 35 070 000 000 docs (April 2006) • Query „the“ used in Google returns – 3 200 000 000 hits (May 2005) – 24 210 000 000 hits (April 2006) • Query „-the“used in Google returns – 14 800 000 000 hits (April 2006) NDBI010 - DIS - MFF UK HTML Searching • Two methods of information search – Query engines • www.google.com, www.yahoo.com, www.altavista.com, morfeo.centrum.cz, … – Catalog browsing • seznam.cz, centrum.cz, … • Usually manually managed pages NDBI010 - DIS - MFF UK Querying the Web • Typically different implementations of some extended Boolean search engines – Binary logical operators – Support for explicit or implicit proximity operators – Usually without query weighting • Different additional techniques – The location of terms in document is important • Titles and headings more important than plain text, … – Mutual references between pages are taken into account NDBI010 - DIS - MFF UK Querying the Web • Catalogs – Thematically organized lists of references – Navigation through hierarchies of search terms – Suitable in situations, where the user exactly knows what he/she searches for, as well as when he/she cannot express the query using keywords NDBI010 - DIS - MFF UK Web Macrostructure • Taken from: Graph structure in the web Andrei Broder, Ravi Kumar2 et al. http://www9.org/ w9cdrom/160/160.html NDBI010 - DIS - MFF UK Hypertext References Utilization • Web can be considered as oriented graph G(N,E) – N is set of nodes (pages) – E is set of edges, where (p,q)E means, that page q is referred from page p • Output page degree o(p) – Number of references in the page p • Input page degree i(p) – Number of pages referencing to page p NDBI010 - DIS - MFF UK Hypertext References Utilization • Edges inside one domain are denoted as inner edges • Edges crossing domain boundaries are denoted as traversal edges p11 dom1 dom2 p12 p22 p21 NDBI010 - DIS - MFF UK i(p11) = 0, o(p11) = 2 i(p12) = 1, o(p12) = 0 i(p21) = 2, o(p21) = 0 i(p22) = 1, o(p22) = 2 Web Search Engine Structure URL Robot HTML Queries Index Indexer • Robot (crowler, spider) – Uses internal URL store and visits pages with given frequency and in given order – Stores data to list of harvested HTML pages NDBI010 - DIS - MFF UK Web Search Engine Structure • Indexer URL Robot HTML Queries Index Indexer – Indexes harvested HTML pages – Generates index data • Textual • Structural – Adds newly found references to URL store NDBI010 - DIS - MFF UK Web Search Engine Structure • Query processing URL Robot HTML Queries Index Indexer – Uses index data to evaluate document similarities according to queries – If the query is formulated using URL to unknown page, it can be loaded and adhoc indexed at the time of the query evaluation NDBI010 - DIS - MFF UK Page Harvesting • Usually breadth-first search traversal from starting set of pages, stored in the URL store – Not all stored URL has to be the starting pages • Priorities can be defined according to – Page theme • For example based on page similarity to some predefined vector and/or query – Page popularity – Page location • According to the domain – … NDBI010 - DIS - MFF UK Page Harvesting • Robots are not able to download all freely available pages – Web doesn’t form connected graph • Indexed can be only connected components reachable from starting pages – Web volume grows and content on the web changes rapidly than the robots are able to harvest NDBI010 - DIS - MFF UK Indexation • Indexer decides, what pages harvested by robots will be really indexed – Tendency to ommit duplicates • Obtained index data is stored (in case of extended Boolean model) using inverted files, usually including positions of term occurrences in pages • In addition, another metadata, needed for query evaluation is stored – Graph of references – Page sizes – … NDBI010 - DIS - MFF UK Document to Query Comparison • Documents are rated from more points of view, with respect to – Given query • Similarity of document content and the query – Given document itself • Page popularity, derived for example from the number of (traversal) references – Given user • Newest part of rating, that tries to create and hold user profile, obtained from previous interaction with him/her • Prefer user’s subjective feeling about document quality NDBI010 - DIS - MFF UK Web Page Popularity • Is derived from the references graph analysis and from the similarity between source and target pages – PageRank – HITS algorithm NDBI010 - DIS - MFF UK PageRank • Supposes that the reference to foreign page represents a recommendation of the target page given by the author of the source page • Problems – Algorithm can be confused by generation of large amount of pages referring to target page – The popularity of source page and/or the content of the source page can be taken into account NDBI010 - DIS - MFF UK PageRank • Rating – rank – r(q) of page q depends on ranks of referring pages and on the number of those pages – Simple PageRank: r(q) = (p,q)E((1/o(p))*r(p)) • Multiple references are counted only once – Matrix notation: r = Xr, where xp,q = 1/o(p) if (p,q)E, else 0 NDBI010 - DIS - MFF UK PageRank • Iterative PageRank evaluation – Problems • Group of pages can link each other, but not outside of the group (rank sink) – Rating accumulation – No contribution to other pages p p1 p2 p 0.1 0.1 0.1 0.1 0.1 0.1 NDBI010 - DIS - MFF UK p1 0.0 0.1 0.1 0.2 0.2 0.3 p2 0.0 0.0 0.1 0.1 0.2 0.2 PageRank • Rank sink problem can be according to authors (Lawrence Page, Sergey Brin) reduced using Random Surfer Model – PageRank : r(q) = (1-d) + d*(p,q)E((1/o(p))*r(p)), d<0,1> • d represents damping factor, usually d = 0.85 – Matrix notation: r = (1-d)e + dXr, where e is a vector containing ones NDBI010 - DIS - MFF UK PageRank • Random Surfer Model r(q) = (1-d) + d*(p,q)E((1/o(p))*r(p)), d<0,1> – User browses the web randomly – The probability of visiting given page is defined by the PageRank – User clicks to some hyperlink in the page with the probability d • The selection of any of o(p) links is random with uniform distribution – User doesn’t follow any link on the page and writes new URL into address field or choose some of favorites or … with probability (1-d) NDBI010 - DIS - MFF UK PageRank • Other – more exact – variant of PageRank (Lawrence Page, Sergey Brin) – PageRank : r(q) = (1-d)/|V| + d*(p,q)E((1/o(p))*r(p)) – If o(p)=0, i.e. page refers to nothing, it is considered referring to all pages of the Web. I.e. • o(p)=|V|, • (p,q)E q – The result suites better to probabilities of visiting web pages NDBI010 - DIS - MFF UK PageRank Example • • • r(x) = 0.5 + 0.5*r(z) r(y) = 0.5 + 0.5*r(x)/2 r(z) = 0.5 + 0.5*(r(x)/2+ r(y)) Exact solution of equations: r(x) = 14/13 = 1.07692308 r(y) = 10/13 = 0.76923077 r(z) = 15/13 = 1.15384615 Iterative computation r(x) r(y) 0 1.0 1.0 1 1.0 0.75 2 1.0625 0.765625 3 1.07421875 0.76855469 4 1.07641602 0.76910400 … … … 10 1.07692305 0.76923076 11 1.07692307 0.76923077 12 1.07692308 0.76923078 y x r(z) 1.0 1.125 1.1484375 1.15283203 1.15365601 … 1.15384615 1.15384615 1.15384615 NDBI010 - DIS - MFF UK z Kleinberg’s HITS Algorithm • Hypertext-Induced Topic Search – Rates documents, returned by the given query – Supposes, that the set contains similar documents retrieved using the user query • Documents are often mutually linked by references NDBI010 - DIS - MFF UK Kleinberg’s HITS Algorithm • Two classes of pages are distinguished – Authorities • Pages having high input degree • i.e. referenced by many pages included in the query answer – Hubs • Pages having high output degree • i.e. referencing many pages included in the query answer h a h a h h NDBI010 - DIS - MFF UK a HITS • Algorithm – The input set of pages for HITS is chosen • Small enough collection • Containing documents similar to given query q • Containing large number of authorities – Rating of selected pages NDBI010 - DIS - MFF UK HITS • – – – 1. 2. 3. 4. Selection of page set Sq according to query q In(p) … set of pages referring to page p Out(p) … set of pages referenced by page p d … chosen small integer number Rq := first 200 of pages from the answer to query q Sq := Rq; for each p in Rq do begin Sq := Sq Out(p); if i(p)d then Sq := Sq In(p) else Sq := Sq S; {SIn(p), |S|=d, S chosen randomly} end; Remove inner links from graph induced by Sq NDBI010 - DIS - MFF UK HITS • Page rating – – 1. 2. ak(p) … authority rating of the page p in the kth iteration hk(p) … hub rating of the page p in the kth iteration for each p in Sq do begin a0(p) := 1; h0(p) := 1; end; for k := 1 to n do for each p do begin ak(p) := (q,p)Ehk-1(q); hk(p) := (p,q)Eak-1(q); normalize ratings so, that pSq(hk(p))2=pSq(ak(p))2=1 end; NDBI010 - DIS - MFF UK PageRank Computation Speed-Up • One of possibilities is approximate the PageRank computation r(q) rs(q) * rp(q) – rs(q) … rank of the site (domain) – rp(q) … rank of the page within the site (domain) • The number of domains is much smaller than number of pages – The eigen-vector of the matrix is much simpler • Page number within the site is also much smaller – Computation for different sites can be easily done in parallel NDBI010 - DIS - MFF UK Extending of PageRank • PageRank itself is independent on page content – Empty page with many referring pages will obtain high PageRank – Can be easily fooled • Page has different importance depending on the theme the user is searching for • The PageRank computation can be change to reflect particular themes NDBI010 - DIS - MFF UK Extending of PageRank • Original proposition (Haveliwala) computes independent PageRank values for top-terms taken from the ODP (Open Directory Project) thesaurus – Dependent on the language – Pages written in different languages should be rated individually NDBI010 - DIS - MFF UK Themes Based Extending of PageRank • Basic equations for PageRank computation r(q) = d*(p,q)E(r(p)/o(p)) + (1-d)/n are modified, so that during random walk the new page is with the probability (1-d) chosen only if the page theme match to the searched theme • Having given theme t the set of equations is following – If the page q matches theme t rt(q) = d*(p,q)E(r(p)/o(p)) + (1-d)/nt – If the page q doesn’t match theme t rt(q) = d*(p,q)E(r(p)/o(p)) + 0 NDBI010 - DIS - MFF UK Themes Based Extending of PageRank • For the query q coefficient c(q,t) of matching the query to individual themes are computed • PageRank of the page p is then evaluated as the linear combination of PageRank values for all themes – r(p,q) = rt(p)*c(q,t) NDBI010 - DIS - MFF UK Personalized PageRank • Also modifies the random walk algorithm • It remembers set of favorite pages for known users • During random traversal using address field of the browser favorite pages are preferred NDBI010 - DIS - MFF UK Further TRS Quality Metrics • To evaluate the TRS quality, other metrics among standard P (precision) and R (recall) can be used • Additional metrics try to take into account the maximal criterion • Having large collections, the quality of the system should be measured for the beginning part of the answer NDBI010 - DIS - MFF UK Further TRS Quality Metrics • Simplest of those metrics is precision within the first k returned documents, denoted as Pk – The system with values P10=0.9; P=0.3 can be considered as better, than other system where P10=P=0.6 NDBI010 - DIS - MFF UK Diversity, Information Richness • If the query is ambiguous, more independent groups of documents can match the query • For example: „Binding“ – – – – – Foot binding Ski binding, a device for connecting a foot to a ski Snowboard binding, a device for connecting a foot to a snowboard Book binding, the protective cover of a book Binding (computer science), a tie to certain names in programming languages – Binding (molecular), a chemical interaction between molecules – Neural binding, synchronous activity of neurons NDBI010 - DIS - MFF UK Diversity, Information Richness • In case of ambiguous query is desirable to have documents representing all available topics as best as possible within the first page of the answer or at the beginning of the answer • Which topic the user requested is possible to tell from the further interaction with him/her, so the system can provide restricted set of documents in the next iteration NDBI010 - DIS - MFF UK Diversity, Information Richness • To express the number of individual topics appearing in the answer the new metrics has to be defined, different from both precision and recall – Diversity … number of groups (clusters), of mutually simillar documents in the answer • Grows with number of topics presented in the answer or at the beginning of the answer – Information Richness … quality of documents with regards to their respective topics • Grows with the quality of chosen documents for individual topics NDBI010 - DIS - MFF UK Diversity, Information Richness • Computation similar to PageRank evaluation, with exceptions – It is computed for the answer (similarly to HITS algorithm) – Doesn’t use the graph based on mutual references, but on their mutual similarities NDBI010 - DIS - MFF UK Diversity, Information Richness • Let is given a collection of documents D={di, i 1, …, n} • Diversity (of the set) Div(D) – Number of topics, covered by documents within the set D • Information Richness (of the document) IRD(di) <0;1> – How much the document di represents its own topic. NDBI010 - DIS - MFF UK Diversity, Information Richness • If Div(D)=k, each of documents within the set is assigned to one of k topics – The number of documents assigned to topic l is denoted nl – ith document assigned to the topic l is denoted as dil NDBI010 - DIS - MFF UK Diversity, Information Richness • The similarities Sim(di, dj) of document couples are computed for all di, dj D, where Sim(di, dj) = (di*dj)/(|di|*|dj|) • The rated graph G=(D,E) is built – Graph nodes correspond to documents in D • The rating of edge eij=(di, dj) is defined by the similarity, i.e. h(eij) = Sim(di, dj) • To spare the space and time edges corresponding to dissimilar documents are not used, i.e. the edge eij E, if and only if Sim(di, dj) St, where St is chosen treshold NDBI010 - DIS - MFF UK Diversity, Information Richness • The adjacency matrix M of graph G is built • The Information richness measure is derived from two aspects – The more similar documents (neighbors) the graph node has, the higher the IR is – The more similar the neighbor document is, the higher the IR is NDBI010 - DIS - MFF UK Diversity, Information Richness • The eigen vector of matrix c*M’T + (1-c)*U is computed where – M’ is matrix M, where rows are normalized to unite size (Manhattan metrics, sum of all values is equal to one) – U contains values 1/n – c = 0.85 (similar as in case of Page Rank) • Vector contains required values of IR(di) NDBI010 - DIS - MFF UK Diversity, Information Richness • Average value of Information Richness in the collection can be defined as 1 Div D 1 nl l IRD IR D di DivD l 1 nl i 1 • Values of IR allow to choose best representatives, but can choose more very similar documents that have similar IR, but represent the same topic NDBI010 - DIS - MFF UK Diversity, Information Richness • Greedy algorithm – A := ; B := D; sort B in descending order by IRD value; while B <> do begin move the best available document di from B to A; decrease values of left documents by Mij*IRD(di); re-order left documents in B end; NDBI010 - DIS - MFF UK Artificial Neural Networks in TRS’s COSIMIR Neural Networks in TRS’s • Neural networks are increasingly used in TRS’s since nineties of 20th century – Usually targets some of following aspects • • • • Clustering (SOM) Reduction of index dimensionality Concepts finding Document relevancy estimation NDBI010 - DIS - MFF UK Neural Networks in TRS’s • Advantages of using neural networks – Generalization of relevancy rules based on learning from specific examples – Finding abstract dependencies, even if the user is not able to formulate them exactly – Robustness, fault tolerance – Pattern classification (documents and/or terms) using self-organizing maps NDBI010 - DIS - MFF UK Neural Networks in TRS’s • Disadvantages of using neural networks – Unsure results • If sample data are not chosen correctly, or the amount of them is too small/large, derived generalizations need not reflects the reality – Difficult interpretation of learned rules NDBI010 - DIS - MFF UK Neural Networks in TRS’s • Neuron (perceptron) – Basic unit of the neural network – Models biological neuron axon • Inaccurate, reflects only our idea of its functionality • Perceptron structure – n inputs, corresponding to dendrites of biological neuron – 1 output, corresponding to the axon NDBI010 - DIS - MFF UK dendrites Neural Networks in TRS’s • Perceptron structure – n inputs denote as xi, each of them has assigned weight wi – 1 output y – Threshold t • if the neuron excitation is sufficient, i.e. exceeding the threshold, the neuron triggers its output NDBI010 - DIS - MFF UK y t x1 x2 x3 … xn Neuronové sítě a DIS • Perceptron functionality 1. 2. Enumeration of its inner activation (of the weighted sum) a = i (wixi)-t Calculation of the output value using transition function g y = g(a) – – Usually g(a) = 1/(1+e-y) so called sigmoid function with steepness >0 Sometimes other transition functions – – – Linear g(a) = a Signum g(a) = sgn(a) … NDBI010 - DIS - MFF UK y t x1 x2 x3 … xn Neuronové sítě a DIS • Geometrical interpretation – Equation i (wixi)-t=0 determines separating hyperplane in the input n-dimensional space – Perceptron separates input patterns i (wixi)-t > 0 i (wixi)-t < 0 belonging to individual halfspaces NDBI010 - DIS - MFF UK Neuronové sítě a DIS • Nets with more neurons – Are built by connection of neuron output to inputs of other neurons • Recurrent networks allows cyclic connections – Usually unsupervised networks – Allow pattern classification to groups • Acyclic layered neural networks connect outputs of one group (layer) of neurons to inputs of all neurons in following layer – Usually supervised, learned by backpropagation, based on couples [pattern, required response] NDBI010 - DIS - MFF UK Layered Neural Network Learning • Backpropagation algorithm – Based on set o vector couples [pattern, required response] – Minimizes error E of network over the learning set – E = j(yj-oj)2, where yj is the output of jth neuron in the output layer oj is its response required by the supervisor NDBI010 - DIS - MFF UK Layered Neural Network Learning • Backpropagation algorithm – Numerical iterative calculation • E is function of network weights, derivable everywhere with respect to all weights (in case of sigmoid transition function, need not to hold in general) • The derivation dE/dw is calculated for each weight • The weights is changed slightly in direction of decreasing error NDBI010 - DIS - MFF UK Layered Neural Network Learning • Backpropagation algorithm – Calculates weight changes layer-by-layer from top to bottom • Output – vth – layer – iv = λ yi (1 – yi) (oi – yi) – wij wij + η δjv yi • Any other – kth – layer – ik = λyi(1 – yi)i(jk+1yij) – wij wij + η δjk yi NDBI010 - DIS - MFF UK Document clustering • Kohonen self-organizing maps – High-dimensional space projection into low-dimensional (often two-dimensional) space – Partial topology preservation – Clustering NDBI010 - DIS - MFF UK COSIMIR • COSIMIR Model - Thomas Mandl, 1999: COgnitive SIMilarity learning in Information Retrieval • Cognitive calculation of document to query similarity, based on the layered artificial neural network and the backpropagation algorithm NDBI010 - DIS - MFF UK COSIMIR • Input layer has 2m neurons, where m is number of terms • Hidden layer contains k neurons (“symbolic concepts”) • Output layer has 1 neuron representing similarity indexation similarity wi1 wi2 wi3 ... wim D NDBI010 - DIS - MFF UK q1 q2 q3 ... qm q Dimensionality curse Pyramid technique IGrid indexes What is Dimensionality Curse? • Most methods invented for nearest neighbor search to given point in m-dimensional space as R-trees, m-trees and others, work well in low-dimensional spaces, but quickly lose their effectiveness with increasing m. • With m ~ 500 and more is more efficient to go through whole space (cluster) sequentially. • TRS’s spaces consists of m ~ 10000 and (much) more dimensions NDBI010 - DIS - MFF UK Pyramid Technique • Reduces the problem of neighborhood searching of given predefined size from m-dimensional to 1-dimensional, where B-trees and similar structures can be easily used • Searching through m-dimensional block in neighborhood of point x is converted to scanning certain sections of the line NDBI010 - DIS - MFF UK Pyramid Technique • m-dimensional cube <0;1>m is split to 2m pyramids – Each pyramid is understaffed by one of 2m m-1-dimensional cube walls – The top is in the cube center 2 3 m=2, 4 pyramids 1 0 m=3, 6 pyramids NDBI010 - DIS - MFF UK Pyramid Technique • m-dimensional cube <0;1>m is split to 2m pyramids v2 2 – Each m-dimensional point within the cube is projected onto point 0,5*PyramidNr+DistanceFromBase v1 3 1 0,5+v1 0 0,0 0,5 1,0 NDBI010 - DIS - MFF UK 1,5+v2 1,5 2,0 Pyramid Technique • During search within neighborhood of given point are searched only required parts (layers) of pyramids parallel with bases. There exists 2m maximum 2 v1 3 1 0,5+v1 0 0,0 0,5 1,0 NDBI010 - DIS - MFF UK 1,5 2,0 Pyramid Technique • Advantages • Disadvantages – Easy search for all points within block in neighborhood of given centre with given size – Responses to TRS queries with predefined maximal distance (minimal similarity) of document with respect to given query – Not optimal for searching for k closest (most similar) points to given point x • It is not easy to estimate the needed size of the block containing at least k poinst 2 3 – Not optimal, if some dimensions of the block are unbound v1 1 0 NDBI010 - DIS - MFF UK • Resulting block has large intersection with many pyramids IGrid Index • Solves problem of increasing dimensionality by different distance (similarity) definition in m-dimensional space – Less “intuitive” definition – Inadequate for low-dimensional spaces – Increasing effectiveness with increasing space dimensionality • The more the space dimensionality, the less percentage of the space must be searched to find the closest point(s) NDBI010 - DIS - MFF UK IGrid Index • The idea of the method is splitting of m-dimensional space containing n points to discrete m-dimensional sub-intervals – ith dimension is split to km sections in such a way, that in each section defined by interval <li;ui> exist approx. m/km points, i.e. all sections contain approx. the same number of points • If the distribution of points is uneven in given dimension, densely occupied areas are split to smaller sections NDBI010 - DIS - MFF UK IGrid Index • Typically km=m NDBI010 - DIS - MFF UK IGrid Index • The similarity is defined as follows • If X=[x1,x2,…xm], Y=[y1,y2,…ym] then – If xi and yi belong to the same section <li;ui>, the points are considered as similar in ith dimension and the similarity increase is 1-[(|xi-yi|)/(ui-li)]p. Else the increase is equal to 0. • The similarity is in fact defined only over dimensions, where both points are similar enough Sim(X,Y)= p(1-[(|xi-yi|)/(ui-li)]p) Sim(X,Y)<0;pm> NDBI010 - DIS - MFF UK IGrid Index • Why 1-[(|xi-yi|)/(ui-li)]p ? – Derived from P-norm – The more is the distance in given dimension closer to the size of the section, the closer the fraction is to 1and thus the similarity increase is closer to 0 |ui-li| NDBI010 - DIS - MFF UK |xi-yi| X Y IGrid Index Zero similarity Points are not similar in any dimension Maximal reachable similarity p1=1 Points are similar in one dimension only • The probability of the similarity of two points in ith dimension is 1/km • Two points are in average similar in m/km dimensions Maximal reachable similarity p2 Points are similar in both dimensions NDBI010 - DIS - MFF UK IGrid Index • Index structure • IGrid Index – For each of m dimensions and each of km sections exists list (with length n/km) of items, having the value of given dimension within given section – Each item contains value of corresponding dimension and reference to original complete vector NDBI010 - DIS - MFF UK IGrid Index • Index structure • IGrid Index – Index size m*km*(n/km)=m*n – Each vector is referenced from m lists, one for each dimension NDBI010 - DIS - MFF UK IGrid Index • It is shown that the number of sections km in each dimension should be linearly dependent on the number of dimensions m, for example km=m • Reduction of discretization impact – Each dimension is split to l*km sections (l is an odd number, for example l=3) – So it exists l times more l times shorter lists within the index – In each dimension is searched l lists, the corresponding one plus (l1)/2 adjacent lists in each direction NDBI010 - DIS - MFF UK IGrid Index • Direct search in vector index – Index volume m*n – During query evaluation necessary to read m*n • Vyhledání pomocí IGrid indexu – Index volume (m*km)*(n/km)=m*n – During query evaluation necessary to read m*(n/km)=n , if km=m • The volume of data doesn’t depend on the space dimensionality m i.e. the effectiveness increases NDBI010 - DIS - MFF UK Approximate search • Error (typo) detection in text • Typo correction • Words (with max. length n) in the alphabet X correspond to points in the space (X{})n where pads short words to uniform length • Not each point in the space correspond to some word NDBI010 - DIS - MFF UK Approximate search • Metrics in the space (X{})n – Hamming metrics H(u,v) • Minimal number of REPLACE (of one character) operations needed to convert one word to another • Omissions, respectively addition of a character usually produces large distance of words • H(’success’,’syccess’)=1 • H(’success’,’sucess’)=3 NDBI010 - DIS - MFF UK Approximate search • Metrics in the space (X{})n – Levenshtein metrics L(u,v) • Minimal number of REPLACE, INSERT, DELETE (of one character) operations needed to convert one word to another • L(’success’,’syccess’)=1 • L(’success’,’sucess’)=1 NDBI010 - DIS - MFF UK Hamming Metrics • Detection using non-deterministic FA * s u c c e s s -s -u u -c c -c c -e e -s s -s s -u u -c c -c c -e e -s s -s s NDBI010 - DIS - MFF UK 0 errors 1 error 2 errors Hamming Metrics • • • • Q = {qi,j | 0ik, ijn} is finite set of states X is given alphabet S = {q0,0} Q is set of initial states F = {qi,n} Q is set of final states, where state qi,n detects word w with i errors. a) qi,j (qi,j-1, xj) represents an acceptance of character xi without error. b) qi+1,j (qi,j-1 , x), for x X\{xj} represenstan acceptance of character xi with error (REPLACE operation). NDBI010 - DIS - MFF UK Levenshtein Metrics • Detection using non-deterministic FA * s u c c u c c -s -s -u -u -c -c -c s u c u c c -s -s -u -u -c -c -c s u c e e -c -e -e c e e -c -e -e c e s s s s -s -s -s -s s s s s -s -s -s -s s s NDBI010 - DIS - MFF UK 0 errors 1 error 2 errors Levenshtein Metrics • • • • a) b) c) d) Q = {qi,j | 0ik, 0jn} is finite set of states X is given alphabet S = {q0,0} Q is set of initial states F = {qi,n} Q is set of final states, where state qi,n detects word w with i errors. qi,j (qi,j-1, xj) represents an acceptance of character xi without error. qi+1,j (qi,j-1, x), for xX\{xj} REPLACE. qi+1,j-1 (qi,j-1, x), for xX\{xj} INSERT. qi+1,j+1 (qi,j-1, xj+l) DELETE. NDBI010 - DIS - MFF UK Vocabulary Construction • Frequency dictionary – List of words ordered by number of occurrences in descending order • At the end Doc. 1 Freq. dict. Stoplist Doc. 2 – Rarely used words – Typos Terms Freq. dict. Stoplist • At the beginning Terms … – Often used words – Stop words NDBI010 - DIS - MFF UK Typos Typos Non-interactive Spell Checking • Simultaneous comparison of two alphabetically ordered dictionaries – Alphabetical list of terms in document – Alphabetical list of correct terms • Requires one pass through both dictionaries NDBI010 - DIS - MFF UK Interactive Spell Checking • Each term has to be checked immediately against the dictionary • Saves memory and time using a hierarchical dictionary • Uses so called Empirical Zipf’s Law: order of the term in the frequency dictionary multiplied by its frequency is approximately constant NDBI010 - DIS - MFF UK Empirical Zipf’s Law • First 10 words of English frequency dictionary (containing approx.1.000.000 words) pořadí 1 2 3 4 5 6 7 8 9 10 slovo the of and to a in that is was he NDBI010 - DIS - MFF UK frekvence pořadí*frekvence 0.069971 0.069971 0.036411 0.072822 0.028852 0.086556 0.026149 0.104596 0.023237 0.116185 0.021341 0.128046 0.010595 0.074165 0.010099 0.080792 0.009816 0.088344 0.009543 0.095430 Empirical Zipf’s Law • Cumulative term frequency of first k terms k CTF frequency k i i 1 • First 10 words 25% words in text order 1 2 3 4 5 6 7 8 9 10 word the of and to a in that is was he frequency order*frequency 0,069971 0,069971 0,036411 0,072822 0,028852 0,086556 0,026149 0,104596 0,023237 0,116185 0,021341 0,128046 0,010595 0,074165 0,010099 0,080792 0,009816 0,088344 0,009543 0,095430 NDBI010 - DIS - MFF UK CTF 0,069971 0,106382 0,135234 0,161383 0,184620 0,205961 0,216556 0,226655 0,236471 0,246014 Empirical Zipf’s Law • Cumulative term frequency graph % word occurrences in text % všech slov v textu 100 70 0 0 20 NDBI010 - DIS - MFF UK % different words in text 100 % rùzných slov v textu Hierarchical dictionaries • Complete dictionary in external memory 10.000 and more different words • Dictionary of words found in document 2.000 different words • Dictionary of most frequent words inn memory 200 different words 50% word occurrences in the document NDBI010 - DIS - MFF UK Compression Term lists (terms, stoplist) Index Primary documents Compression • Compression in TRS’s – Term lists (Terms, Stopterms) – Index – Primary documents NDBI010 - DIS - MFF UK Compression of Term Lists • POM - Prefix Omitting Method – Each term represented as a couple • Length of prefix common with previous term in list • Rest of the term (postfix) a abeceda absence absolutní absolvent abstinent abstraktní aby ačkoli administrace administrativní NDBI010 - DIS - MFF UK 0:a 1:beceda 2:sence 3:olutní 5:vent 3:tinent 4:raktní 2:y 1:čkoli 1:dministrace 10:tivní Index Vector Representation • Individual vectors are sparse (approx. 90% of zeroes) • Storage of all weights d [w1, w2 , w3,, wm] including zeroes is ineffective • More effective is storage of only couples – Non-zero element index – Value of the element d [ j1:w j1, j2:w j2 , j3:w j3,, jk :w jk , ] NDBI010 - DIS - MFF UK Encoding and Compression • Code K=(A,C,f), where – A={a1, a2, ..., an} is source alphabet, |A|=n – C={c1, c2, ..., cm} is target (code) alphabet, |C|=m, usually C={0,1} – f: AC+ je injective mapping of characters of alphabet A onto words in alphabet C. • Extension for word encoding from A* f(ai1ai2...aik) = f(ai1)f(ai2)...f(aik) NDBI010 - DIS - MFF UK Encoding and Compression • Code K is uniquely decodable, if and only if for each string YC+ exists at most one string XA+ so, that f(X)=Y. • Code K=({0,1,2,3},{0,1},f), where f(0)=0, f(1)=1, f(2)=10, f(3)=11 is not uniquely decodable f(21)=101=f(101) NDBI010 - DIS - MFF UK Encoding and Compression • Code K is a prefix code, if and only if no code word f(ai) is a prefix of another code word f(aj). • Code K=({0,1,2,3},{0,1},f), where f(0)=0, f(1)=10, f(2)=110, f(3)=111 is a prefix code. • Code K is a block code (of length k), if and only if all code words have the length k. NDBI010 - DIS - MFF UK Encoding and Compression • Each block code is also a prefix code. • Each prefix code is uniquely decodable. • Each code K is left to right (char by char) decodable, if it is possible to determine the end of the code word f(ai) and the corresponding character ai just after the last bit of the code word is read. NDBI010 - DIS - MFF UK Encoding and Compression • Code K=({a,b,c,d},{0,1},f), where f(a)=0, f(b)=01, f(c)= 011, f(d)=111 is not left to right decodable, but is still uniquely decodable. • Example: f(X)=011111111... NDBI010 - DIS - MFF UK Entropy and Redundancy • Let A={a1, a2, ..., an} is source alphabet. • Let the occurrence probability of character ai in text is equal to pi. • P(A) = (p1, p2, ..., pn) is denoted as probability distribution of A. NDBI010 - DIS - MFF UK Entropy and Redundancy • Entropy (measure of amount of information) of character ai in the text is equal to value E(ai) = -log(pi) bits. • Average entropy of one character n AE A pi log pi i 1 • Entropy of text E ai ai ...ai k 1 2 NDBI010 - DIS - MFF UK log pi j j 1 k Entropy and Redundancy • Let the code K(A,C,f) assigns code words with lengths |f(ai)| = di to characters aiA. • The lengths of encoded message is equal to k f ai ai ...ai l ai ai ...ai d i 1 2 k 1 2 k j 1 j • Holds l ai1ai2 ...aik E ai1ai2 ...aik • Redundancy R l ai ai ...ai E ai ai ...ai k k 1 2 1 2 NDBI010 - DIS - MFF UK Number Encoding • Binary encoding for (potentially) infinite set • Left to right decodable • As most effective as possible NDBI010 - DIS - MFF UK Fibonacci Encoding • Instead of powers of 2 it uses Fibonacci numbers for individual orders • So instead of n=bi2i, where bi{0,1} it uses notation n=biFi, where bi{0,1}, F0=1, F1=2, Fk+1=Fk-1+Fk • Highest orders are on the right side • Problem: ambiguity 1710 =1+3+5+8 =F0+F2+F3+F4 =10111Fib =1+3+13 =F0+F2+F5 =101001Fib NDBI010 - DIS - MFF UK Fibonacci Encoding • Exists only one notation, that doesn’t use two consecutive members of the Fibonacci sequence • If there exist two consecutive members, the highest of all occurrences can be replaced by their sum (by the following member) 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, … • 10010 =111101101Fib =111100011Fib =1111000001Fib =1100100001Fib =0010100001Fib NDBI010 - DIS - MFF UK Fibonacci Encoding • In normalized Fibonacci encoding – Don’t exist two consecutive one-bits – Last (rightmost) used position is one-bit • At the end of notation is added extra one-bit that allows determining of the end • 10010 =00101000011F • Fibonacci sequence grows exponentially notation has logarithmic length NDBI010 - DIS - MFF UK Elias Codes • Set of number encodings with different features • Alpha code (unary) • n 00 01 n 1 + Decodable – Long codes number code 1 1 2 01 3 001 4 0001 5 00001 6 000001 7 0000001 8 00000001 9 000000001 |(230-1)|= 230-1 NDBI010 - DIS - MFF UK Elias Codes • Beta code (binary) • Standard encoding 1 1 2n n .0 2n 1 n .1 + Short code words – Undecodable number 1 2 3 4 5 6 7 8 9 code 1 10 11 100 101 110 111 1000 1001 |(230-1)|= 30 NDBI010 - DIS - MFF UK Elias Codes • Modified beta code • Without leading one-bit • 1 2n n .0 2n 1 n .1 + Short code words – Undecodable number 1 2 3 4 5 6 7 8 9 code 0 1 00 01 10 11 000 001 |’(230-1)|= 30-1=29 NDBI010 - DIS - MFF UK Elias Codes • Theta code • n n .# + Short code words + Decodable – Ternary encoding number 1 2 3 4 5 6 7 8 9 code 1# 10# 11# 100# 101# 110# 111# 1000# 1001# |(230-1)|= 30+1=31 NDBI010 - DIS - MFF UK Elias Codes • Gamma kód • Combination of two codes – Modified beta code encodes the number – Alpha code ensures decodability • 7 01011 + Short codes + Decodable ’(n) (|(n)|) number 1 2 3 4 5 6 7 8 9 code 1 001 011 00001 00011 01001 01011 0000001 0000011 |(230-1)|= 30+29=59 NDBI010 - DIS - MFF UK Elias Codes • Modified gamma code • n n . n • More human readable • Non-regular number 1 2 3 4 5 6 7 8 9 code 1 010 011 00100 00101 00110 00111 0001000 0001001 |’(230-1)|= 30+29=59 NDBI010 - DIS - MFF UK Elias Codes • Delta code • Uses more efficient gamma code for encoding of the length of binary code • n n . n number 1 2 3 4 5 6 7 8 9 code 1 0010 0011 01100 01101 01110 01111 00001000 00001001 |(230-1)|= 6+5+29=40 NDBI010 - DIS - MFF UK Elias Codes • Omega code • Used for very long numbers encoding n B0 B1 B k 0 B n B log B B 1 B 2 k i 1 2 i 0 i number 1 2 3 4 5 6 7 8 9 code 0 100 110 101000 101010 101100 101110 1110000 1110010 |(230-1)|= 2+3+5+30+1=41 NDBI010 - DIS - MFF UK Vector Index Structure • The weights can be stored using integers instead of floats • The precision of a few positions is usually sufficient • Smaller numbers are stored in less number of bits, it is possible to store differences of indexes instead of their values d [ j1:w j1, j2 j1:w j2 ,, jk jk 1:w jk , ] NDBI010 - DIS - MFF UK Text Compression • Huffman encoding. • Prefix code for alphabet A with minimal reachable redundancy NDBI010 - DIS - MFF UK Text Compression • Huffman code construction – The alphabet A={a1, a2, ..., an} with distribution P(A)={p1, p2, ..., pn}, suppose p1 p2 ... pn – If n=2, then f(a1)=0, f(a2)=1 – Else the modified (reduced) alphabet is built A’={a1a2, a3, ..., an}, P(A’)={p1+p2, p3, ..., pn} And the modified code f’ recursively – f(a1)=f’(a1a2).0, f(a1a2)=f’(a2).1, f(ai)=f’(ai) NDBI010 - DIS - MFF UK Huffman Encoding Example – A={u, v, w, x, y, z} with distribution 1 2 3 4 5 17 P(A)= 32 , 32 , 32 , 32 , 32 , 32 • • • • • • f(u) f(v) f(w) f(x) f(y) f(z) = 0000 = 0001 = 001 = 010 = 011 =1 uvw xyz 0 32/32 1 uvw xy 0 z 15/32 uvw 0 u 1/32 NDBI010 - DIS - MFF UK 17/32 xy 6/32 1 uv 0 1 w 3/32 1 v 2/32 0 x 3/32 9/32 1 y 4/32 5/32 Text Compression • Data model – Both compression and decompression are controlled by set of data, that parameterizes given method • For Huffman encoding the probability distribution – The equality of both models must be ensured model model input text Vstupní text compre ssion komprese model model compresed Komprimovaná data data decomp ression dekomprese NDBI010 - DIS - MFF UK output text Vstupní text Text Compression • Static compression – Static model for all documents in the collection • Can be computed and stored only once • Compression is not optimal • Semi-adaptive compression – Each document has its own model • Model must be stored together with compressed data • Dynamic (adaptive) compression – Both algorithm forms model dynamically according to already processed data NDBI010 - DIS - MFF UK Adaptive Huffman Compression • FGK (Faller, Gallager a Knuth) algorithm • It uses so called sibling property – Nodes in tree can be ordered so, that • Sibling are consecutive in the ordering • Weights (probabilities, frequencies) don’t decrease uvw xyz 0 32/32 1 uvw xy 0 z 15/32 uvw 0 u 1/32 NDBI010 - DIS - MFF UK 17/32 xy 6/32 1 uv 0 1 w 3/32 1 v 2/32 0 x 3/32 9/32 1 y 4/32 5/32 Adaptive Huffman Compression • There exist encoding tree for each character of the text • Following character is encoded/decoded according to existing tree • The tree is modified to reflect the increased frequency of the last encoded/decoded character • Both algorithms have to start form the same tree NDBI010 - DIS - MFF UK Adaptive Huffman Compression • Huffman tree modification Node := Processed_Node; while Node <> Root do begin Swap the Node including its subtree with the last node with the same frequency; Increase the Node frequency by one {the ordering is not corrupt} Node := Predecessor(Node) end; NDBI010 - DIS - MFF UK Adaptive Huffman Compression • The following tree is given • The character z is encoded as 1011 • The node z is the only one with frequency 3 • The frequency can be increased to 4 0 32 1 x z 11 NDBI010 - DIS - MFF UK 0 0 10 21 1 0 w 5 1 u 0 y 2 5 1 z 3 5 11 1 y v 6 Adaptive Huffman Compression • Node with frequency 5 is not the last in the ordering • Must be swapped with the last one • Then the frequency can be increased 0 32 1 x z 11 NDBI010 - DIS - MFF UK 0 0 10 21 1 0 w 5 1 u 0 y 2 5 1 z 4 5 11 1 y v 6 Adaptive Huffman Compression • Node with frequency 11 is not the last in the ordering • Must be swapped with the last one • Then the frequency can be increased 0 32 1 x z 11 NDBI010 - DIS - MFF UK 0 0 w 5 10 21 1 1 0 11 u 1 y v 5 0 y 2 6 1 z 4 6 Adaptive Huffman Compression • Node with frequency 32 is not the last in the ordering • Need not to be swapped • The frequency can be increased 0 32 1 z 0 0 y 2 NDBI010 - DIS - MFF UK 6 12 1 y v 1 z 4 0 21 1 x 6 0 w 5 10 1 u 5 11 Adaptive Huffman Compression • According to the modified tree the next character z would be encoded as 001 instead of 1011 0 33 1 z 0 0 y 2 NDBI010 - DIS - MFF UK 6 12 1 y v 1 z 4 0 21 1 x 6 0 w 5 10 1 u 5 11 Adaptive Huffman Compression • The starting tree can be either the one that suppose frequencies equal to one 0 1 4 0 a 1 NDBI010 - DIS - MFF UK 2 1 0 b 1 c 1 2 1 d 1 Adaptive Huffman Compression • • • More effective is to start with only one-node tree that represents all unknown characters with the frequency equal to one New, still unknown character is encoded using this special node and the representation of the character is then stored in the compressed data as well The node is split to new node representing unknown characters and the one representing last character – both with frequencies equal to one • The first character of the message is encoded ? abacbda 1 • The empty string is sent Followed by the definition of a a bacbda 0 ? 1 NDBI010 - DIS - MFF UK 2 1 a 1 Adaptive Huffman Compression • • • More effective is to start with only one-node tree that represents all unknown characters with the frequency equal to one New, still unknown character is encoded using this special node and the representation of the character is then stored in the compressed data as well The node is split to new node representing unknown characters and the one representing last character – both with frequencies equal to one • The next character is encoded a bacbda 0 2 1 ? 1 • a 1 The code 0 is sent followed by b definition a0b acbda 0 1 3 a 1 0 ? 1 NDBI010 - DIS - MFF UK 2 1 b 1 Adaptive Huffman Compression • • • More effective is to start with only one-node tree that represents all unknown characters with the frequency equal to one New, still unknown character is encoded using this special node and the representation of the character is then stored in the compressed data as well The node is split to new node representing unknown characters and the one representing last character – both with frequencies equal to one • The third character is encoded a0b acbda 0 1 3 a 1 0 b 1 cbda 0 1 4 a 2 0 ? 1 NDBI010 - DIS - MFF UK 1 ? 1 • The string 0 is sent a0b0 2 2 1 b 1 HuffWord Algorithm • Encoding of the text word by word using Huffman algorithm • Semi-adaptive method • More effective than char by char encoding – More different symbols with much different frequencies • Simple representation of the tree • More time consuming compression NDBI010 - DIS - MFF UK HuffWord Algorithm • Uses canonical form of code words • Each code word is in form p.c – p is a prefix of all zeroes – c is code number • Codes are ordered by the length, for the same length by their values • The prefix length of the longer code words is equel to the complete length of shorter code words NDBI010 - DIS - MFF UK HuffWord Algorithm Example • Words A..H with code lengths 4,4,5,5,2,2,4,2 • Code words with length 4 have prefix 00 of length 2 Codes with the length 2 are 01,10,11 • Codes with the length 5 have prefix 0000 with length 4 Codes with the length 4 are 00.01, 00.10, 00.11 Codes with the length 5 are 0000.0 and 0000.1 NDBI010 - DIS - MFF UK HuffWord Algorithm Example • Assigning of code words to words A B C D E F G H length 4 4 5 5 2 2 4 2 code 0001 0010 00000 00001 01 10 0011 11 • Codes in proper order E F H A B G C D lenght 2 2 2 4 4 4 5 5 NDBI010 - DIS - MFF UK code 01 10 11 0001 0010 0011 00000 00001 HuffWord Algorithm • Length of codes finding • Given words wi, their frequencies ni and probabilities pi • for each i b=round(-log(pi)) if (b==0) then b++ x[b]+=ni • Index b for the highest accumulated value x[b] determines the length of shortest code words • Index belonging to second highest value defines the length increase of second shortest code words • ... NDBI010 - DIS - MFF UK HuffWord Algorithm Example • Code word lengths • Words A..E word A B C D E ni 3 7 20 30 40 pi -log(pi ) round(-log(pi )) 0,03 5,06 5 0,07 3,84 4 0,20 2,32 2 0,30 1,74 2 0,40 1,32 1 • Array x I x[I] 1 40 2 50 3 0 4 7 5 3 – – – – 2 2+1 2+1+4 2+3+4+5 • Codes – – – – – E=01 D=10 C=11 B=000 A=001 NDBI010 - DIS - MFF UK HuffWord Algorithm Example • Decompression • For each length b in bits the following information should be available – The value of lowest valid code word having the given length (first[b]) – Index of first valid code word in the table of code words (base[b]) • For E=01, D=10, C=11, B=000, A=001 – – – – first[0] =+ first[1] = + first[2] = 1 first[3] = 0 base[2] = 1 base[3] = 4 NDBI010 - DIS - MFF UK HuffWord Algorithm Example • For E=01, D=10, C=11, B=000, A=001 • first[0]=+, first[1]=+, first[2]=1, first[3]=0 • base[2]=1, base[3]=4 • Input 10|11|01|01|000|10|001 • c = 0; d = 0 while (c<prvni[d]) c=2*c+next_bit() d++ word_index = base[d] + c - first[d] NDBI010 - DIS - MFF UK Markov Automata • The probability distribution of characters can be very different according to context (previous characters) – p(”o”) =0.058 probability of occurrence of character ”o” in the text – p(”o”|”e”) =0.004 probability of occurrence of character ”o” in the text supposing that the previous character was ”e” – p(”o”|”c”) =0.237 NDBI010 - DIS - MFF UK Markov Automata • It is possible to build a finite automaton whose states correspond to strings of given length • Q=Xn, where n is the automaton order • (x1x2…xn, x) = x2…xnx • For each state the individual compression model can be constructed that correspond to conditional probabilities • Better probability estimation results in better – more effective – compression NDBI010 - DIS - MFF UK