Oznur Tastan 10601 Machine Learning Recitation 3 Sep 16 2009

advertisement
Oznur Tastan
10601 Machine Learning
Recitation 3
Sep 16 2009
Outline
• A text classification example
– Multinomial distribution
– Drichlet distribution
• Model selection
– Miro will be continuing in that topic
Text classification example
Text classification
• We are not into classification yet.
• For the sake of example,
I’ll briefly go over what it is.
Classification Task:
You have an input x, you classify which label
it has y from some fixed set of labels y1,...,yk
Text classification spam filtering
Input: document D
Output: the predicted class y from {y1,...,yk }
Spam filtering:
Classify email as ‘Spam’, ‘Other’.
P (Y=spam | X)
Text classification
Input: document D
Output: the predicted class y from {y1,...,yk }
Text classification examples:
Classify email as ‘Spam’, ‘Other’.
What other text classification applications
you can think of?
Text classification
Input: document x
Output: the predicted class y y is from {y1,...,yk }
Text classification examples:
Classify email as
‘Spam’, ‘Other’.
Classify web pages as
‘Student’, ‘Faculty’, ‘Other’
Classify news stories into topics
‘Sports’, ‘Politics’..
Classify business names by
industry.
Classify movie reviews as
‘Favorable’, ‘Unfavorable’, ‘Neutral’
… and many more.
Text Classification: Examples
Classify shipment articles into one 93 categories.
An example category ‘wheat’
ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS
BUENOS AIRES, Feb 26
Argentine grain board figures show crop registrations of grains, oilseeds and their
products to February 11, in thousands of tonnes, showing those for future
shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in
brackets:
Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
Maize Mar 48.0, total 48.0 (nil).
Sorghum nil (nil)
Oilseed export registrations were:
Sunflowerseed total 15.0 (7.9)
Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for subproducts, as follows....
Representing text for classification
ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS
BUENOS AIRES, Feb 26
Argentine grain board figures show crop registrations of grains, oilseeds and their
products to February 11, in thousands of tonnes, showing those for future
shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in
brackets:
Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
Maize Mar 48.0, total 48.0 (nil).
Sorghum nil (nil)
Oilseed export registrations were:
Sunflowerseed total 15.0 (7.9)
Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for sub-products, as follows....
How would you represent the
document?
y
Representing text: a list of words
argentine, 1986, 1987, grain, oilseed,
registrations, buenos, aires, feb, 26,
argentine, grain, board, figures, show, crop,
registrations, of, grains, oilseeds, and, their,
products, to, february, 11, in, …
y
Common refinements: remove stopwords, stemming, collapsing multiple
occurrences of words into one….
Representing text for classification
ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS
BUENOS AIRES, Feb 26
Argentine grain board figures show crop registrations of grains, oilseeds and their
products to February 11, in thousands of tonnes, showing those for future
shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in
brackets:
Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
Maize Mar 48.0, total 48.0 (nil).
Sorghum nil (nil)
Oilseed export registrations were:
Sunflowerseed total 15.0 (7.9)
Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for sub-products, as follows....
How would you represent the
document?
y
‘Bag of words’ representation of text
word
ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS
BUENOS AIRES, Feb 26
Argentine grain board figures show crop registrations of grains, oilseeds and their
products to February 11, in thousands of tonnes, showing those for future
shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in
brackets:
Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
Maize Mar 48.0, total 48.0 (nil).
Sorghum nil (nil)
Oilseed export registrations were:
Sunflowerseed total 15.0 (7.9)
Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for sub-products, as follows....
Bag of word representation:
Represent text as a vector of word frequencies.
frequency
grain(s)
3
oilseed(s)
2
total
3
wheat
1
maize
1
soybean
1
tonnes
1
...
...
Bag of words representation
document i
Frequency (i,j) = j in document i
A collection of documents
word j
Bag of words
What simplifying assumption are we taking?
Bag of words
What simplifying assumption are we taking?
We assumed word order is not important.
‘Bag of words’ representation of text
word
ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS
BUENOS AIRES, Feb 26
Argentine grain board figures show crop registrations of grains, oilseeds and their
products to February 11, in thousands of tonnes, showing those for future
shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in
brackets:
Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
Maize Mar 48.0, total 48.0 (nil).
Sorghum nil (nil)
Oilseed export registrations were:
Sunflowerseed total 15.0 (7.9)
Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for sub-products, as follows....
Pr( D | Y  y )
?
frequency
grain(s)
3
oilseed(s)
2
total
3
wheat
1
maize
1
soybean
1
tonnes
1
...
...
Pr(W1  n1 ,...,W1  nk | Y  y)
Multinomial distribution
• The multinomial distribution is a generalization of the binomial
distribution.
• The binomial distribution counts successes of an event (for
example, heads in coin tosses).
• The parameters:
– N (number of trials)
–  (the probability of success of the event)
• The multinomial counts the number of a set of events (for example,
how many times each side of a die comes up in a set of rolls).
– The parameters:
– N (number of trials)
– 1.. k (the probability of success for each category)
Multinomial Distribution
From a box you pick k possible colored balls.
You selected N balls randomly and put into your bag.
Let probability of picking a ball of color i is
For each color
i
1 ,..,k
Wi be the random variable denoting the number of balls selected
in color i, can take values in {1…N}.
Multinomial Distribution
W1,W2,..Wk are variables
Number of possible orderings of N balls
N!
P(W1  n1 ,..., W1  nk | N ,  1 ,..,  k ) 
1n1 2 n2 .. k nk
n1 !n2 !..nk !
k
n
i 1
i
N
k

i 1
i
1
order invariant selections
Note events are indepent
A binomial distribution is the multinomial distribution with k=2 
and
1 ,2
 1  2
‘Bag of words’ representation of text
word
ARGENTINE 1986/87 GRAIN/OILSEED REGISTRATIONS
BUENOS AIRES, Feb 26
Argentine grain board figures show crop registrations of grains, oilseeds and their
products to February 11, in thousands of tonnes, showing those for future
shipments month, 1986/87 total and 1985/86 total to February 12, 1986, in
brackets:
Bread wheat prev 1,655.8, Feb 872.0, March 164.6, total 2,692.4 (4,161.0).
Maize Mar 48.0, total 48.0 (nil).
Sorghum nil (nil)
Oilseed export registrations were:
Sunflowerseed total 15.0 (7.9)
Soybean May 20.0, total 20.0 (nil)
The board also detailed export registrations for sub-products, as follows....
frequency
grain(s)
3
oilseed(s)
2
total
3
wheat
1
maize
1
soybean
1
tonnes
1
...
...
‘Bag of words’ representation of text
word
frequency
grain(s)
3
oilseed(s)
2
total
3
wheat
1
maize
1
soybean
1
tonnes
1
...
Can be represented as a multinomial distribution.
Words = colored balls, there are k possible type
of them
Document = contains N words, each word
occurs ni times
The multinomial distribution of words is going to be
different for different document class.
...
In a document class of ‘wheat’, grain is more likely.
where as in a hard drive shipment the parameter
for ‘wheat’ is going to be smaller.
Multinomial distribution and
bag of words
Represent document D as list of words w1,w2,..
For each category y, build a probabilistic model
Pr(D|Y=y)
Pr(D={argentine,grain...}|Y=wheat) = ....
Pr(D={stocks,rose,in,heavy,...}|Y=nonWheat) = ....
To classify, find the y which was most likely to
generate D
Conjugate distribution
• If the prior and the posterior are the same distribution, the
prior is called a conjugate prior for the likelihood
• The Dirichlet distribution is the conjugate prior for the
multinomial, just as beta is conjugate prior for the binomial.
Drichlet distribution
Dirichlet distribution generalizes the beta distribution
just like multinomial distribution generalizes the binomial
distribution
Gamma function
The Dirichlet parameter i can be thought of as a prior count of the ith class.
Dirichlet Distribution
Let’s say the prior for 1 ,..,k is Dir (1 ,..,  k )
From observations we have the following counts
n1 ,.., nk
The posterior distribution for 1 ,..,k given data
Dir (1  n1 ,..,  k  nk )
So the prior works like a pseudo-counts.
Pseudo Count and prior
• Let’s say you estimated the probabilities from a collection of
documents without using a prior.
• For all unobserved words in your document collection, you would
assign zero probability to that word occurring in that document
class. So whenever a document with that word comes in, the
probability will be zero for that document being in that class. Which
is probably wrong when you have only limited data.
• Using priors is a way of smoothing the probability distributions and
leaving out some probability mass for the unobserved events in
your data.
Generative model


W : Word
C: Document class generating the word
N : Number of word in the document
D: Collection of documents
 : matrix of parameters for the multionomial spe
C
 : Dirichlet( ) prior for 
w
N
D
Model Selection
Polynomial Curve Fitting
Blue: Observed data
True: Green true distribution
Sum-of-Squares Error Function
0th Order Polynomial
Blue: Observed data
Red: Predicted curve
True: Green true distribution
1st Order Polynomial
Blue: Observed data
Red: Predicted curve
True: Green true distribution
3rd Order Polynomial
Blue: Observed data
Red: Predicted curve
True: Green true distribution
9th Order Polynomial
Blue: Observed data
Red: Predicted curve
True: Green true distribution
Which of the predicted curve is better?
Blue: Observed data
Red: Predicted curve
True: Green true distribution
What do we really want?
Why not choose the method with the
best fit to the data?
What do we really want?
Why not choose the method with the
best fit to the data?
If we were to ask you the homework questions
in the midterm, would we have a good estimate of
how well you learned the concepts?
What do we really want?
Why not choose the method with the
best fit to the data?
How well are you going to predict
future data drawn from the same
distribution?
Example
General strategy
You try to simulate the real word scenario.
Test data is your future data.
Put it away as far as possible don’t look at it.
Validation set is like your test set. You use it to select your model.
The whole aim is to estimate the models’ true error on the sample data you have.
!!! For the rest of the slides ..Assume we put the test data aldready away.
Consider it as the validation data when it says test set.
Test set method
• Randomly split some portion of your data
Leave it aside as the test set
• The remaining data is the training data
Test set method
• Randomly split some portion of your data
Leave it aside as the test set
• The remaining data is the training data
• Learn a model from the training set
This the model you learned.
How good is the prediction?
• Randomly split some portion of your data
Leave it aside as the test set
• The remaining data is the training data
• Learn a model from the training set
• Estimate your future performance with
the test data
Train test set split
It is simple
What is the down side ?
More data is better
With more data you can learn better
Blue: Observed data
Red: Predicted curve
True: Green true distribution
Compare the predicted curves
Train test set split
It is simple
What is the down side ?
1. You waste some portion of your data.
Train test set split
It is simple
What is the down side ?
1. You waste some portion of your data.
What else?
Train test set split
It is simple
What is the down side ?
1. You waste some portion of your data.
2. You must be luck or unlucky with your test data
Train test set split
It is simple
What is the down side ?
1. You waste some portion of your data.
2. If you don’t have much data, you must be luck
or unlucky with your test data
How does it translate to statistics?
Your estimator of performance has …?
Train/test set split
It is simple
What is the down side ?
1. You waste some portion of your data.
2. If you don’t have much data, you must be luck
or unlucky with your test data
How does it translate to statistics?
Your estimator of performance has high variance
Cross Validation
Recycle the data!
LOOCV (Leave-one-out Cross Validation)
Let say we have N data points
k be the index for data points
k=1..N
Your single test data
Let (xk,yk) be the kth record
Temporarily remove (xk,yk)
from the dataset
Train on the remaining N-1
Datapoints
Test your error on (xk,yk)
Do this for each k=1..N and report the mean
error.
LOOCV (Leave-one-out Cross Validation)
There are N data points..
Do this N times. Notice the
test data is changing each time
LOOCV (Leave-one-out Cross Validation)
There are N data points..
Do this N times. Notice the
test data is changing each time
MSE=3.33
LOOCV (Leave-one-out Cross Validation)
Let say we have N data points
k be the index for data points
k=1..N
Let (xk,yk) be the kth record
Temporarily remove (xk,yk)
from the dataset
Train on the remaining N-1
datapoints
Test your error on (xk,yk)
Do this for each k=1..N and report the mean
error.
K-fold cross validation
test
Test
train
Train on (k
- 1) splits
k-fold
In 3 fold cross validation, there are 3 runs.
In 5 fold cross validation, there are 5 runs.
In 10 fold cross validation, there are 10 runs.
the error is averaged over all runs
Model Selection
In-sample error estimates:
Akaike Information Criterion (AIC)
Bayesian Information Criterion (BIC)
Minimum Description Length Principle (MDL)
Structural Risk Minimization (SRM)
Extra-sample error estimates:
Cross-Validation
Bootstrap
Most used method
References
• http://videolectures.net/mlas06_cohen_tc/
• http://www.autonlab.org/tutorials/overfit.ht
ml
Download