Word-level bigrams - Harvey Mudd College

advertisement
Text Classification
Eric Doi
Harvey Mudd College
November 20th, 2008
Kinds of Classification
 Language
 Hello. My name is Eric.
 Hola. Mi nombre es Eric.
 こんにちは。 私の名前はエリックである.
 你好。 我叫Eric。
Kinds of Classification
 Type
 “approaches based on n-grams obtain
generalization by concatenating”*
 To: eric_k_doi@hmc.edu
Subject: McCain and Obama use it too
You have received this message because you opted in
to receives Sund Design pecial offers via email.
Login to your member account to edit your email
subscription . Click here to unsubscribe.
 ACAAGATGCCATTGTCCCCCGGCCTCCTG
*(Bengio)
Difficulties
 Dictionary? Generalization?
 Over 500,000 words in English language
(and over one million if counting scientific
words)
 Typos/OCR errors
 Loan words
 We practice ballet at the café.
 Nous pratiquons le ballet au café.
Approaches
 Unique letter combinations
Language
English
French
Gaelic
Italian
String
“ery”
“eux”
“mh”
“cchi“
Dunning, Statistical Identification of Language
Approaches
 “Unique” letter combinations
Language
English
French
Gaelic
Italian
String
“ery”
“milieux”
“farmhand”
“zucchini“
Dunning, Statistical Identification of Language
Approaches
 “Unique” letter combinations
Language
English
French
Gaelic
Italian
String
“ery”
“milieux”
“farmhand”
“zucchini“
 Requires hand-coding; what about other
languages (6000+)?
Dunning, Statistical Identification of Language
Approaches
 Try to minimize:
 Hand-coded knowledge
 Training data
 Input data (isolating phrases?)
 Dunning, “Statistical Identification of
Language.” 1994.
 Bengio, “A Neural Probabilistic Language
Model.” 2003.
Statistical Approach: N-Grams
 N-grams are sequences of n elements
Professor Keller is not a goth.
 Word-level bigrams:
 Char-level trigrams:
Statistical Approach: N-Grams
 N-grams are sequences of n elements
Professor Keller is not a goth.
 Word-level bigrams: (Professor, Keller)
 Char-level trigrams:
Statistical Approach: N-Grams
 N-grams are sequences of n elements
Professor Keller is not a goth.
 Word-level bigrams: (Keller, is)
 Char-level trigrams:
Statistical Approach: N-Grams
 N-grams are sequences of n elements
Professor Keller is not a goth.
 Word-level bigrams: (is, not)
 Char-level trigrams:
Statistical Approach: N-Grams
 N-grams are sequences of n elements
Professor Keller is not a goth.
 Word-level bigrams: (not, a)
 Char-level trigrams:
Statistical Approach: N-Grams
 N-grams are sequences of n elements
Professor Keller is not a goth.
 Word-level bigrams: (a, goth)
 Char-level trigrams:
Statistical Approach: N-Grams
 N-grams are sequences of n elements
Professor Keller is not a goth.
 Word-level bigrams:
 Char-level trigrams:
Statistical Approach: N-Grams
 N-grams are sequences of n elements
Professor Keller is not a goth.
 Word-level bigrams:
 Char-level trigrams: (P, r, o)
Statistical Approach: N-Grams
 N-grams are sequences of n elements
Professor Keller is not a goth.
 Word-level bigrams:
 Char-level trigrams: (r, o, f)
Statistical Approach: N-Grams
 N-grams are sequences of n elements
Professor Keller is not a goth.
 Word-level bigrams:
 Char-level trigrams: (o, f, e)
Statistical Approach: N-Grams
 N-grams are sequences of n elements
Professor Keller is not a goth.
 Word-level bigrams:
 Char-level trigrams: (f, e, s)
Statistical Approach: N-Grams
 N-grams are sequences of n elements
Professor Keller is not a goth.
 Word-level bigrams:
 Char-level trigrams:
Statistical Approach: N-Grams
 Mined from 1,024,908,267,229 words
 Sample 4-grams
serve as the infrastructure 500
serve as the initial 5331
serve as the injector 56
Statistical Approach: N-Grams
 Informs some notion of probability
 Normalize frequencies
 P(serve as the initial) > P(serve as the injector)
 Classification
P(English | serve as the initial) >
P(Spanish | serve as the initial)
P(Spam | serve as the injector) <
P(!Spam | serve as the injector)
Statistical Approach: N-Grams
 But what about P(serve as the ink)?
 = 0?
 P(serve as the ink) = P(vxvw aooa *%^$) = 0?
 How about P(sevre as the initial)?
Statistical Approach: N-Grams
 How do we smooth out sparse data?







MacCartney
Additive smoothing
Interpolation
Good-Turing estimate
Backoff
Witten-Bell smoothing
Absolute discounting
Kneser-Key smoothing
Statistical Approach: N-Grams
 Additive smoothing
 Interpolation- consider smaller n-grams as
well, e.g. (serve as the), (serve)
 Backoff- use interpolation only if necessary
MacCartney
Statistical Approach: Results
 Dunning: Compared parallel translated
texts in English and Spanish
 20 char input, 50K training: 92% accurate
 500 char input, 50K training: 99.9%
 Modified for comparing DNA sequences of
Humans, E-Coli, and Yeast
Neural Network Approach
Bengio et. al, “A Neural Probabilistic
Language Model.” 2003:
 N-gram does handle sparse data well
 However, there are problems:
 Narrow consideration of context (~1–2 words)
 Does not consider semantic/grammatical
similarity:
“A cat is walking in the bedroom”
“A dog was running in a room”
Neural Network Approach
 The general idea:
1. Associate with each word in the vocabulary
(e.g. size 17,000) a feature vector (30–100
features)
2. Express the joint probability function of word
sequences in terms of feature vectors
3. Learn simultaneously the word feature vectors
and the parameters of the probability function
References
 Dunning, “Statistical Identification of Language.”
1994.
 Bengio, “A Neural Probabilistic Language Model.”
2003.
 MacCartney, “NLP Lunch Tutorial: Smoothing.”
2005.
Download