Empirical Research Methods in Computer Science Lecture 7 November 30, 2005

advertisement
Empirical Research Methods in
Computer Science
Lecture 7
November 30, 2005
Noah Smith
Using Data
Data
estimation; regression; learning; training
Model
classification; decision
pattern classification
machine learning
statistical inference
...
Action
Probabilistic Models

Let X and Y be random variables.
(continuous, discrete, structured, ...)

Goal: predict Y from X.

A model defines P(Y = y | X = x).
1.
2.
Where do models come from?
If we have a model, how do we use it?
Using a Model

We want to classify a message, x,
as spam or mail: y ε {spam, mail}.
x
Model
P(spam | x)
P(mail | x)
spam if Pspam | x   Pmail | x 
ŷ  
otherwise
 mail
Bayes’ Rule
likelihood: one distribution over complex observations per y
prior
P(x | y )  P( y )
P( y | x) 
P(x)
what we said the model must define
normalizes into a distribution:
P(x)   P( y ' )  P(x | y ' )
y'
Naive Bayes Models

Suppose X = (X1, X2, X3, ..., Xm).

Let
m
P(x | y )   P(xi | y )
i1
Naive Bayes: Graphical Model
Y
X1
X2
X3
...
Xm
Part II
Where do the model parameters
come from?
Using Data
Data
estimation; regression; learning; training
Model
Action
Warning


This is a HUGE topic.
We will barely scratch the surface.
Forms of Models


Recall that a model defines
P(x | y) and P(y).
These can have a simple multinomial
form, like
P(mail) = 0.545, P(spam) = 0.455

Or they can take on some other form,
like a binomial, Gaussian, etc.
Example: Gaussian

Suppose y is {male, female}, and
one observed variable is H, height.

P(H | male) ~ N(μm, σm2)
P(H | female) ~ N(μf, σf2)

How to estimate μm, σm2, μf, σf2?

Maximum Likelihood

Pick the model that makes the data
as likely as possible
max P(data | model)
Maximum Likelihood (Gaussian)
Estimating the parameters μm, σm2,
μf, σf2 can be seen as



fitting the data
estimating an underlying statistic
(point estimate)
n
m 
ˆ
 y i  malehi
i1
# males
n
2

ˆm 
2




y

male
h


ˆ
 i
i
m
i1
# males  1
Using the model
1.2
p(H | male)
p(H | female)
1
0.8
0.6
0.4
0.2
0
0
2
4
6
8
10
12
Using the model
1
P(male | H)
P(female | H)
0.8
0.6
0.4
0.2
0
0
2
4
6
8
10
12
Example: Regression


Suppose y is actual runtime, and x
is input length.
Regression tries to predict some
continuous variables from others.
Regression


Linear: assume linear relationship,
fit a line.
We can turn this into a model!
Linear Model

Given x, predict y.
y = β1x + β0 + N(0, σ2)
true regression line
random deviation
Principle of Least Squares


Minimize the sum of squared
vertical deviations.
Unique, closed form solution!
vertical deviation
Other kinds of regression


transform one or both variables
(e.g., take a log)
polynomial regression



(least squares → linear system)
multivariate regression
logistic regression
Example: text categorization

Bag-of-words model:


x is a histogram of counts for all
words
P(x | y )   puni( w | y )
w

y is a topic
count( w;x )
MLE for Multinomials

“Count and Normalize”
count( w; training)
p̂uni w | y  
count(*; training)
The Truth about MLE

You will never see all the words.

For many models, MLE isn’t safe.

To understand why, consider a
typical evaluation scenario.
Evaluation

Train your model on some data.

How good is the model?

Test on different data that the
system never saw before.

Why?
Tradeoff
overfits the training data
doesn’t generalize
low variance
low accuracy
Text categorization again

Suppose ‘v1@gra’ never appeared
in any document in training, ever.
P(x | y )   puni( w | y )
count( w;x )
w

What is the above probability for a
new document containing ‘v1@gra’
at test time?
Solutions

Regularization


Smoothing


Prefer less extreme parameters
“Flatten out” the distribution
Bayesian Estimation

Construct a prior over model
parameters, then train to maximize
P(data | model) × P(model)
One More Point

Building models is not the only way
to be empirical.


Neural networks, SVMs, instancebased learning
MLE and smoothed/Bayesian
estimation are not the only ways to
estimate.

Minimize error, for example
(“discriminative” estimation)
Assignment 3





Spam detection
We provide a few thousand
examples
Perform EDA and pick features
Estimate probabilities
Build a Naive-Bayes classifier
Download