n-gram models

advertisement
Ngram Models
Bahareh Sarrafzadeh
Winter 2010
Agenda
• Ngrams
– Language Modeling
– Evaluation of LMs
• Markov Models
– Stochastic Process
– Markov Chain
• Text Classification
– Ngram-based Approach
NGram
What is an N-Gram?
• A subsequence of n items from a given
sequence
• Items:
– Phonemes
– Syllables
– Letters
– Words
• Number of Items:
– Unigram, Bigram, Trigram, ...
N-Gram - Examples
• 3-Grams
–
–
–
–
–
ceramics collectables collectibles (55)
ceramics collectables fine (130)
ceramics collected by (52)
ceramics collectible pottery (50)
ceramics collectibles cooking (45)
• 4-Grams
–
–
–
–
–
–
serve as the incoming (92)
serve as the incubator (99)
serve as the independent (794)
serve as the index (223)
serve as the indication (72)
serve as the indicator (120)
N-Gram Model
• A Probabilistic Model for Predicting the next Item
in such a sequence.
• Why do we want to Predict Words?
–
–
–
–
–
–
–
Chatbots
Speech recognition
Handwriting recognition/OCR
Spelling correction
Author attribution
Plagiarism detection
...
N-Gram Model
• Models Sequences, esp. NL, using the
Statistical Properties of N-Grams
• Idea: Shannon
– given a sequence of letters (e.g. "for ex"), what is
the likelihood of the next letter?
– From training data, derive a probability
distribution for the next letter given a history of
size n.
N-Gram Model
• Predicts xi based on xi – 1, xi – 2 , ..., xi – n:
P( xi | xi  1, xi  2,..., xi  n)
• NGram Independence Assumption:
– word is affected only by its “prior local context” (last
few words)
– Advantages:
• Massively simplifies the problem of learning the language
model
• because of the open nature of language, it is common to
group words unknown to the language model together
Language Models
• A statistical language model assigns a
probability to a sequence of m words by
means of a probability distribution
• Applications in NLP:
–
–
–
–
–
speech recognition,
machine translation,
part-of-speech tagging,
parsing,
information retrieval.
• The goal of Statistical Language Modeling is to
build a statistical language model that can
estimate the distribution of natural language
as accurate as possible.
A bad language model
A bad language model
A bad language model
A bad language model
What happened?
• A Language model is a probability distribution
over word sequences
– P(“And nothing but the truth”)  0.001
– P(“And nuts sing on the roof”)  0
How language models work?
• Hard to compute P(“And nothing but the
truth”)
• Step 1: Decompose probability
P(“And nothing but he
t truth) 
P(“And”)  P(“but|and nothing”)  P(“the|and nothingbut”) 
P(“truth|and nothingbut the”)
Language Models - Simplification
• Estimating the probability of sequences can
become difficult in corpora
– Arbitrary long phrases or sentences
– Data sparseness
– Overfitting
• Solution: Models are often approximated
using smoothed N-gram models.
Ngram Modeling of a Language
• In an n-gram model, the probability of observing
the sentence w1,...,wm is approximated as:
Prediction
History
m
m
i 1
i 1
P( w1,...,wm)   P( wi | w1,...,wi  1)   P( wi | wi  ( n  1),...,wi  1)
• The conditional probability can be calculated from
n-gram frequency counts:
count( wi  ( n  1),...,wi  1, wi )
P( wi | wi  ( n  1),...,wi  1) 
count( wi  ( n  1),...,wi  1)
Example
• Assume each word depends only on the
previous two words (Trigram Assumption)
P(“the| whole truth and nothing but”) 
P(“the|nothing but”)
P(“truth| whole truth and nothing but he”)
t

P(“truth|but the”)
Smoothing
• It is useful to assign small probabilities to
unseen n-mers.
• For example, for 3-grams we add 2 “dummy“
words (such as ‘.’) to the beginning of each
sentence, we have:
P(w1w2...wn)  P(w1 | ..)P(w2 | w1.)...P(wn | wn  1wn  2)
Graphical Representation
...
1-gram
...
...
2-gram
...
Previous (n-1)-gram
n-gram
Use of Log Probabilities
• Multiplying a large number of probabilities
gives a very small result (close to zero)
• So in order to avoid floating-point underflow,
we should use logarithms of the probabilities
in the model.
Evaluation
• Extrinsic
– The language model is embedded in a wider
application:
• Slow
• Specific to the application
• Intrinsic
– The language model is evaluated directly using
some measure, such as Perplexity
Perplexity
• Perplexity is a measure of the size of the set of
words from which the next word is chosen
given that we observe the history of spoken
words.
• The perplexity of a LM depends on the
domain of discourse.
Perplexity: Intuition
• Ask a speech recognizer to recognize digits “0,
1, 2, 3, 4, 5, 6, 7, 8, 9” – easy – perplexity 10
• Ask a speech recognizer to recognize names at
Microsoft – hard – 30,000 – perplexity 30,000
• Perplexity is weighted equivalent branching
factor.
Perplexity: Is lower better?
• Remarkable fact: the true model for data has
the lowest possible perplexity
• Lower the perplexity, the closer we are to the
true model.
Markov
Model
Markov Property – Markov Process
• “the future is independent of the past given
the present.”
• A stochastic process has the Markov property
if the conditional probability distribution of
future states of the process depend only upon
the present state.
• A process with this property is called Markov
process.
Markov Chain
• We have a set of states, S = {s1, s2, ... , sr}.
• The process starts in one of these states and
moves successively from one state to another.
• Each move is called a step.
• If the chain is currently in state si, then it moves
to state sj at the next step with a probability
denoted by pij .
• This probability does not depend upon which
states the chain was in before the current
state
Order m – Markov Chain
• A Markov chain of order m (or a Markov chain
with memory m) where m is finite, is a process
in which the future state depends on the past
m states.
Text Generation using
Markov Chains
• Markov processes can also be used to
generate superficially "real-looking" text given
a sample document
• These processes are also used by spammers to
inject real-looking hidden paragraphs into
emails to get these messages past spam filters.
• Shannon considers a series of Markov chain
REPRESENTING AND
approximations
to English
SPEEDILY
IS AN prose.
GOOD
COME CAN
• For example,APT
he OR
presents
first a simulation
DIFFERENT
NATURAL
where the words
are chosen
independently
HERE HE THE
A IN
but with appropriate
frequencies.
CAME THE TO OF TO
EXPERT GRAY COME TO
FURNISHES THE LINE
MESSAGE HAD BE
THESE.
•
THE HEAD AND IN
He then notesFRONTAL
the increased
similarity
to
ATTACK ON
AN
ENGLISH
ordinary English
textWRITER
when the words are
THAT THEchain,
CHARACTER
chosen as a Markov
in which case he
OF THIS POINT IS
obtains
THEREFORE
ANOTHER METHOD FOR
THE LETTERS THAT THE
TIME OF
WHO EVER TOLD THE
PROBLEM FOR AN
UNEXPECTED.
Garkov!
Text Classification
using
NGram
Text Classification
• A fundamental kind of document processing
• A content based assignment of one or more
predefined categories to free texts.
• Approaches:
– Supervised
– Unsupervised
– Semisupervised
Main Tasks
1. Feature Construction / Selection
– Extracting Representative Features
•
•
•
•
Words- Frequency
Context of Words – Set of Words
Spare Phrases – Neighbour Words
Word Ngrams - Frequency
2. Learning Phase
– Binary Classifiers
– M-ary Classifiers
Learning Algorithms
•
•
•
•
•
Decision Trees
Naive Bayes
KNN
Neural Networks
Support Vector Machines
Ngram based Text Classification
• Features:
– N-grams
• Values:
– N-grams Frequencies
• Similarity measure
– Of various types
Classifier’s Characteristics
• The categorization must work reliably in spite
of textual errors.
• The categorization must be efficient,
consuming as little storage and processing
time as possible.
• The categorization must be able to recognize
when a given document does not match any
category, or when it falls between two
categories.
Overall Approach
• Start with a set of pre-existing text categories (such as
subject domains)
• Generate a set of N-gram frequency profiles to
represent each of the categories.
• When a new document arrives for classification, the
system first computes its N-gram frequency profile.
• It then compares this profile against the profiles for
each of the categories using an easily calculated
distance measure.
• The system classifies the document as belonging to the
category having the smallest distance.
N-gram Frequency Statistics
• Each word occurs in human languages with a
different frequency.
• One of the most common ways of expressing
this idea: Zipf’s Law
Zipf’s Law
• The nth most common word in a human
language text occurs with a frequency
inversely proportional to n:
k
f 
r
• there is always a set of words which
dominates most of the other words of the
language in terms of frequency of use.
Zipf’s Law
• The most frequent word will occur
approximately twice as often as the second
most frequent word, which occurs twice as
often as the fourth most frequent word ...
• This is true for:
– Languages,
– Subject – specific words
Zipf’s Law: Example
• For example, in the Brown Corpus "the" is the
most frequently occurring word, and by itself
accounts for nearly 7% of all word occurrences,
• The second-place word "of" accounts for
slightly over 3.5% of words,
• Followed by "and" (about 2%)
• Only 135 vocabulary items are needed to
account for half the Brown Corpus.
Zipf’s Law Applies to Lots of Things
•
•
•
•
•
frequency of accesses to web pages
sizes of settlements
income distribution amongst individuals
size of earthquakes
words in the English language
word frequency in Wikipedia
Zipf’s Law: Classification
• Zipf’s Law implies that classifying documents
with N-gram frequency statistics will not be
very sensitive to cutting off the distributions
at a particular rank.
• It also implies that if we are comparing
documents from the same category they
should have similar N-gram frequency
distributions.
Document Representation
• Documents were represented, by their N-gram
frequency profiles:
– The list of N-grams ordered by the number of
occurrences in the given document.
– It simply describes the Zipfian distribution of Ngrams in the document.
Generating N-Gram Frequency
Profiles
• Split the text into separate tokens
• Scan down each token, generating all possible
N-grams
• Hash into a table to find the counter for the Ngram, and increment it.
• When done, output all N-grams and their
counts.
• Sort those counts into reverse order by the
number of occurrences.
Comparing and Ranking N-Gram
Frequency Profiles
• Take two N-gram profiles
• Calculate a simple rank-order statistic :
– E.g. “out-of-place” measure
Language Classification
• Most writing systems support more than one
language.
• Given a text that uses a particular writing
system, it is necessary to determine the
language in which it is written before further
processing is possible.
Lexicon-based Approach
• Keep a lexicon for each possible language
• Look up every word in the sample text to see
in which lexicon it falls
• The lexicon that contains the most words from
the sample indicates which language was used
• Is it a good approach?
Challenges
• Building or Obtaining a Representative Lexicon
is not easy!
• For the highly inflected languages,
– A much larger lexicon
– Some language-specific morphological processing
required
• Spelling errors (e.g. as the result of an OCR
process), will disrupt the lexicon lookup
process
Ngram-based Approach
• Basic idea: Identify N-grams whose occurrence
in a document gives strong evidence for /
against identification of a text as belonging to
a particular language
• N-gram frequency profile technique can be
used to classify document according to their
language
Requirements
• No lexicon
• No Morphological Processing rules
• A good number of sample texts (10K to 20K
bytes)
• Calculating the N-gram frequency profiles
Advantages
• Modest Computational and Storage
requirements
• Very effective
• Simple
• No Semantic or Content analysis required
(apart from the N-gram frequency profile)
Subject Classification
• The same text categorization approach
• Extended to a multi-language database
• Overall:
–
–
–
–
A training set is obtained
N-gram frequencies are calculated for each class
N-gram frequencies are calculated for a new article
An overall distance measure between profiles is
computed
– The article is assigned to the category which
minimizes this distance
N-grams: Summary
•
•
•
•
Very simple but effective
Resistant to Textual Errors
No Text Preprocessing
Language Independent
References
• P. Brown, et al, “Class-Based n-gram Models of Natural
Language”, Association for Computational Linguistics, 1992
• V. Keseljy, N. Cercone et al, “N-gram-based author profiles
for authorship attribution”, 2003
• W. B. Cavnar, J. M. Trenkle, “N-gram-based text
categorization”, Proceedings of SDAIR-94, 3rd Annual
Symposium on Document Analysis and Information
Retrieval, 1994
• P. Náther, “N-gram based Text Categorization”, Diploma
thesis, 2005
• J. Henke, “Statistical Inference: n-gram Models over Sparse
Data”, TDM Seminar
• J. Goodman, “The State of The Art in Language Modeling ”,
Microsoft Research, Speech Technology Group
• http://homepages.inf.ed.ac.uk/lzhang10/slm.html
Thank You!
Download