ppt

Vocabulary size and term distribution: tokenization, text normalization and stemming Lecture 2 Overview  Getting started: – –  Collection vocabulary – – –   tokenization, stemming, compounds end of sentence Terms, tokens, types Vocabulary size Term distribution Stop words Vector representation of text and term weighting Tokenization   Friends, Romans, Countrymen, lend me your ears; Friends | Romans | Countrymen | lend | me your | ears Token an instance of a sequence of characters that are grouped together as a useful semantic unit for processing Type the class of all tokens containing the same character sequence Term type that is included in the system dictionary (normalized)  The cat slept peacefully in the living room. It’s a very old cat.  Mr. O’Neill thinks that the boys’ stories about Chile’s capital aren’t amusing. How to handle special cases involving apostrophes, hyphens etc? C++, C#, URLs, emails, phone numbers, dates San Francisco, Los Angeles  Issues of tokenization are language specific –  Requires the language to be known Language identification based on classifiers that use short character subsequences as features is highly effective – Most languages have distinctive signature patterns Very important for information retrieval  Splitting tokens on spaces can cause bad retrieval results –  German: compound nouns – –  Search for York University, returns pages containing new york university Retrieval systems for German greatly benefit fron the use of compound-splitter module Checks if a word can be subdivided into words that appear in the vocabulary East Asian Languages (Chinese, Japanese, Korean, Thai) – Text is written without any spaces between words Stop words  Very common words that have no discriminatory power Building a stop word list  Sort terms by collection frequency and take the most frequent –  Why do we need stop lists – –  In a collection about insurance practices, “insurance” would be a stop word Smaller indices for information retrieval Better approximation of importance for summarization etc Use problematic in phrasal searches  Trend in IR systems over time – – – –   Large stop lists (200-300 terms) Very small stop lists (7-12 terms) No stop list whatsoever The 30 most common words account for 30% of the tokens in written text Good compression techniques for indices Term weighting leads to very common words having little impact for document represenation Normalization  Token normalization – – – – Canonicalizing tokens so that matches occur despite superficial differences in the character sequences of the tokens U.S.A vs USA Anti-discriminatory vs antidiscriminatory Car vs automobile? Normalization sensitive to query Query term Windows windows Window Terms that should match Windows Windows, windows, window window, windows Capitalization/case folding  Good for – –  Bad for – –   Allow instances of Automobile at the beginning of a sentence to match with a query of automobile Helps a search engine when most users type ferrari when they are interested in a Ferrari car Proper names vs common nouns General Motors, Associated Press, Black Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning In IR, lowercasing is most practical because of the way users issue their queries Other languages  60% of webpages are in english – –  Less than one third of Internet users speak English Less than 10% of the world’s population primarily speak English Only about one third of blog posts are in English Stemming and lemmatization   Organize, organizes, organizing Democracy, democratic, democratization Am, are, is  be Car, cars, car’s, cars’ ==? car  Stemming – Crude heuristic process that chops off the ends of the words   Democratic  democa Lemmatization – Use of vocabulary and morphological analysis, returns the base form of a word (lemma)   Democratic  democracy Sang  sing Porter stemmer  Most common algorithm for stemming English – – 5 phases of word reduction SSES  SS  – IES  I  – – ponies  poni SS  SS S  – caresses  caress cats  cat EMENT    replacement  replac cement  cement Vocabulary size  Dictionaries –  600,000+ words But they do not include names of people, locations, products etc Heap’s law: estimating the number of terms M  kT b M vocabulary size (number of terms) T number of tokens 30 < k < 100 b = 0.5 Linear relation between vocabulary size and number of tokens in log-log space Zipf’s law: modeling the distribution of terms  The collection frequency of the ith most common term is proportional to 1/i 1 cf i  i  If the most frequent term occurs cf1 then the second most frequent term has half as many occurrences, the third most frequent term has a third as many, etc cf i  ci k log cf i  log c  k log i Problems with the normalization  A change in the stop word list can dramatically alter term weightings  A document may contain an outlier term

ppt

Related documents

Products

Support

ppt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib