CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Lecture 4: Indexing Files Inverted File Lexical Analysis Stop lists CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Indexing Arrangement of data (data structure) to permit fast searching Which list is easier to search? sow fox pig eel yak hen ant cat dog hog ant cat dog eel fox hen hog pig sow yak CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Creating inverted files Word Extraction Word IDs Original Documents W1:d1,d2,d3 W2:d2,d4,d7,d9 Document IDs Wn :di,…dn Inverted Files CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Creating Inverted file Map the file names to file IDs Consider the following Original Documents D1 The Department of Computer Science was established in 1984. D2 The Department launched its first BSc(Hons) in Computer Studies in 1987. D3 followed by the MSc in Computer Science which was started in 1991. D4 The Department also produced its first PhD graduate in 1994. D5 Our staff have contributed intellectually and professionally to the advancements in these fields. CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Creating Inverted file Red: stop word D1 The Department of Computer Science was established in 1984. D2 The Department launched its first BSc(Hons) in Computer Studies in 1987. D3 followed by the MSc in Computer Science which was started in 1991. D4 The Department also produced its first PhD graduate in 1994. D5 Our staff have contributed intellectually and professionally to the advancements in these fields. CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Creating Inverted file After stemming, make lowercase (option), delete numbers (option) D1 depart comput scienc establish D2 depart launch bsc hons comput studi D3 follow msc comput scienc start D4 depart produc phd graduat D5 staff contribut intellectu profession advanc field CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Creating Inverted file (unsorted) Words Documents Words Documents depart d1,d2,d4 produc d4 comput d1,d2,d3 phd d4 scienc d1,d3 graduat d4 establish d1 staff d5 launch d2 contribut d5 bsc d2 intellectu d5 hons d2 profession d5 studi d2 advanc d5 follow d3 field d5 msc d3 start d3 CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Creating Inverted file (sorted) Words Documents Words Documents advanc d5 msc d3 bsc d2 phd d4 comput d1,d2,d3 produc d4 contribut d5 profession d5 depart d1,d2,d4 scienc d1,d3 establish d1 staff d5 field d5 start d3 follow d3 studi d2 graduat d4 intellectu d5 launch d2 CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Searching on Inverted File Binary Search Using in the small scale Create thesaurus and combining techniques such as: Hashing B+tree Pointer to the address in the indexed file CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Lexical Analysis for indexing Word extraction Stop words elimination Spaces as English words boundaries Chinese word segmentation “a”,”an”,”the”,”about”,”etc”,”every”,”you”,etc. Word stemming CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Lexical Analysis Lexical analysis is the process of converting an input stream of characters into a stream of words or tokens Lexical analysis is the first stage of: Automatic indexing Query processing CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Lexical Analysis for Automatic Indexing What counts as a word or token in the indexing scheme? (an easy problem?) Digits Hyphens “F-16” “MS-DOS” Other Punctuation “Year 2000”, “Y2K” “COMMAND.COM” “max_size” (often in C code) Case IBM or ibm CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Lexical Analysis for Automatic Indexing (cont.) No technical difficulty in solving any of these problems Must think about them carefully Tradeoff between recall and precision Breaking up hyphenated terms increase recall but decreases precision Preserving case distinctions enhances precision but decreases recall CS5286 Search Engine Technology and Algorithms/Xiaotie Deng Lexical Analysis for Query Processing Depends on the design of the lexical analyzer for automatic indexing Distinguish operators (Boolean operators, weighting function operators etc.) Process certain characters: Control characters “” for phase search, {} for priority Disallowed punctuation characters (error) CS5286 Search Engine Technology and Algorithms/Xiaotie Deng STOPLISTS Many of the most frequently occurring words in English (“the” ,”of” etc.) are worthless as index terms Eliminating such words Speeds processing Saves huge amounts of space in indexes Does not damage retrieval effectiveness Stoplists are used to eliminates such words. E.g., http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_u tils/stop_words http://bll.epnet.com/help/ehost/Stop_Words.htm http://www.syger.com/jsc/docs/stopwords/english.htm CS5286 Search Engine Technology and Algorithms/Xiaotie Deng STOPLISTS Choices of words in stop list may vary from person to person. The general idea is to find words that occur often so that they are not good terms for information retrieval. How to use vector space model to find out a list of stop words? How to find stop words in Chinese?