Empirical Research Methods in Computer Science Lecture 6 November 16, 2005

advertisement
Empirical Research Methods in
Computer Science
Lecture 6
November 16, 2005
Noah Smith
Getting Empirical about Software

Example: given a file, is it text or
binary?
file command
if /the/
then text
else binary
Getting Empirical about Software

Example: early spam filtering
regular expressions: /viagra/
email address
originating IP address
Other reasons

Spam in 2006  Spam in 2005

Code re-use


Two programs may work in
essentially the same way, but for
entirely different applications.
Empirical techniques work!
Using Data
Data
estimation; regression; learning; training
Model
classification; decision
pattern classification
machine learning
statistical inference
...
Action
Probabilistic Models

Let X and Y be random variables.
(continuous, discrete, structured, ...)

Goal: predict Y from X.

A model defines P(Y = y | X = x).
1.
2.
Where do models come from?
If we have a model, how do we use it?
Using a Model

We want to classify a message, x,
as spam or mail: y ε {spam, mail}.
x
Model
P(spam | x)
P(mail | x)
spam if Pspam | x   Pmail | x 
ŷ  
otherwise
 mail
Bayes Minimum-Error Decision
Criterion

Decide yi if P(yi | x) > P(yj | x) for all j  i.
(Pick the most likely y, given x.)
Example


X = [/viagra/], Y ε {spam, mail}
From data, estimate:
P(spam | X > 0) = 0.99
P(mail | X > 0) = 0.01
P(spam | X = 0) = 0.45
P(mail | X = 0) = 0.55
BDC: if X > 0 then spam, else mail.
Probability of error?
P(spam | X > 0) = 0.99
P(mail | X > 0) = 0.01
P(spam | X = 0) = 0.45
P(mail | X = 0) = 0.55
What is the probability of error, given
X > 0?
Given X = 0?
Improving our view of the data

Why just use [m/viagra/]?
Cialis, diet, stock, bank, ...
Why just use {<50%, >50%}?

X could be a histogram of words!


Tradeoff
simple features
limited descriptive power
complex features
data sparseness
Problem


Need to estimate
P(spam | x)
for each x!
There are lots of word histograms!



length ε {1, 2, 3, ...}
|Vocabulary| = huge!
number of documents:
?
length
|
V
|

length1
“Data Sparseness”



You will never see every x.
So you can’t estimate distributions
that condition on each x.
Not just in text: anything dealing
with continuous variables or just
darn big sets.
Other simple examples


Classify fish into
{salmon, sea bass}
by X = (Weight, Length)
Classify people into
{undergrad, grad, professor}
by X = {Age, Hair-length, Gender}
Magic Trick



Often, P(y | x) is hard, but P(x | y)
and P(y) are easier to get, and
more natural.
P(y): prior (how much mail is spam?)
P(x | y): likelihood


P(x | spam) models what spam looks like
P(x | mail) models what mail looks like
Bayes’ Rule
likelihood: one distribution over complex observations per y
prior
P(x | y )  P( y )
P( y | x) 
P(x)
what we said the model must define
normalizes into a distribution:
P(x)   P( y ' )  P(x | y ' )
y'
Example

P(spam) = 0.455, P(mail) = 0.545
X
P(x |
P(x |
spam)
mail)
known sender, >50% dict. words
.00
.70
known sender, <50% dict. words
.01
.06
unknown sender, >50% dict. words
.19
.24
unknown sender, <50% dict. words
.80
.00
Resulting Classifier
times .455
times .545
X
P(x |
P(x | P(spam,
x)
spam) mail)
P(mail,
x)
decision
known, >50%
.00 .70
.00
.38
mail
known, <50%
.01 .06
.005
.03
mail
unknown, >50%
.24 .24
.11
.13
mail
unknown, <50%
.75 .00
.34
.00
spam
Possible improvement


P(spam) = 0.455, P(mail) = 0.545
Let X = (S, L, D).
S ε {known sender, unknown sender},
N = length in words,
D = # dictionary words
P(s, n, d | y) = P(s | y) × P(n | y) × P(d | n, y)
Modeling N and D
p(n | y )  ( y )n(1 ( y ))
geometric, with parameter κ(y)
n
d
n

d


p(d | n, y )   (y ) (1 (y ))
 d
binomial, with parameter δ(y)
Resulting Classifier
times .455
times .545
X = (S, N, D)
known, 1, 0
known, 1, 1
known, 2, 0
...
P(x |
P(x | P(spam,
x)
spam) mail)
P(mail,
x)
decision
Old model vs. New model


How many different x?
4
∞
How many degrees of freedom?



P(y):
P(x | y):
2
6
Which is better?
2
4
Old model vs. New model

The first model had a Boolean
variable:
“Are > 50% of the words in the dictionary?”

The second model made an
independence assumption about S and
(D, N).
Graphical Models
Y
S, rnd(D/N)
prior predicts
Y predicts Y
prior
P(x | y)
P(s | y)
Y
S
N
D
geometric binomial
Generative Story

First, pick y: spam or mail?
Use prior, P(Y).

Given that it’s spam, decide whether the
sender is known.
Use P(S | spam).

Given that it’s spam, pick the length.
Use geometric.

Given spam and n, decide how many of
the words are from the dictionary.
Use binomial.
Naive Bayes Models

Suppose X = (X1, X2, X3, ..., Xm).

Let
m
P(x | y )   P(xi | y )
i1
Naive Bayes: Graphical Model
Y
X1
X2
X3
...
Xm
Noisy Channel Models



Y is produced by a source.
Y is corrupted as it goes through a
channel; it turns into X.
Example: speech recognition
P(y) is the source model
P(x | y) is the channel model
Y
X
Loss Functions

Some errors are more costly than
others.





cost(spam | spam) = $0
cost(mail | mail) = $0
cost(mail | spam) = $1
cost(spam | mail) = $100
What to do?
Risk

Conditional risk:
R( y | x)   cost( y | y ' )  P( y '| x)
y'


Minimize expected loss by picking the y
to minimize R.
Minimizing error is a special case where
cost(y | y) = $0 and cost(y | y’) = $1.
Risk
X
P(x |
P(x |
spam)
mail)
P(spam |
x)
P(mail | R(spam |
x)
x)
R(mail |
x)
known, >50%
.00
.70
.00 1.00 $100
$0
known, <50%
.01
.06
.02
.98
$98
$.02
unknown, >50%
.24
.24
.46
.54
$54
$.46
unknown, <50%
.75
.00
1.00
.00
$0
$1
Determinism and
Randomness

If we build a classifier from a
model, and use a Bayes decision
rule to make decisions, is the
algorithm randomized, or is it
deterministic?
Download