Probabilistic Classifiers

advertisement
Probabilistic Classifiers
Irina Rish
IBM T.J. Watson Research Center
rish@us.ibm.com, http://www.research.ibm.com/people/r/rish/
February 20, 2002
Probabilistic Classification
Bayesian Decision Theory
Bayes decision rule (revisited):
Bayes risk, 0/1 loss, optimal classifier, discriminability
Probabilistic classifiers and their decision surfaces:
Continuous features (Gaussian distribution)
Discrete (binary) features (+ class-conditional independence)
Parameter estimation
“Classical” statistics: maximum-likelihood (ML)
Bayesian statistics: maximum a posteriori (MAP)
Common used classifier: naïve Bayes
VERY simple: class-conditional feature independence
VERY efficient (empirically); why and when? – still an open question
Bayesian decision theory
Make a decision that minimizes the overall expected cost (loss)
Advantage: theoretically guaranteed optimal decisions
Drawback: probability distributions are assumed to be known (in practice,
estimation of those distribution from data can be a hard problem)
Classification problem as an example of a decision problem
Given observed properties (features) of an object, find its class. Examples:
Sorting fish by its type (sea bass or salmon) given observed features such
as lightness and length
Video character recognition
Face recognition
Document classification using word counts
Guessing user’s intentions (potential buyer or not) by his web transactions
Intrusion detection
Notation and definitions
• C - a state of nature (class) : a random variable with distribution P(C )
- Ω = {ω1 ,..., ω n } is a set of possible states of nature (class labels)
•  = ( X 1 ,..., X d ) - feature vector in feature space S
- Continuous X i : S = ℜ ‹ and X has probability density p( X | C )
- Discrete X i : S = D d , D = {1,..., k} and X has probability P( X | C )
• α (Ÿ) : S → A - decision rule
- A = {α 1 ,..., α a } is a set of actions (decisions)
- Example : A = {ω1 ,..., ω n } andGα (Ÿ) : S → ΩG is a classifier/
• λij = λ (α i | ω j ) - loss function (cost of decision α i given state ω j )
- Example : 0/1 loss (λij = 1 if i ≠ j and λij = 0 if i = j )
n
• R(α i | Ÿ) = ∑ λ (α i | ω j ) P (ω j | Ÿ) - conditional risk of action α i given Ÿ
j =1
• R = ∫ R(α (Ÿ) | Ÿ) p (Ÿ)dŸG- risk (total expected loss)
Gaussian (normal) density N ( µ, σ )
2

1
1  x-µ  
exp  − 
 
2π σ
 2  σ  
p(x) =
µ = Ε[ x] =
2
∞
∫ xp( x)dx − expected value (mean)
−∞
2
σ = Ε[( x-µ ) ] − variance, σ − standard deviation
|x−µ |
r=
− Mahalanobi s distance from x to µ
σ
Interesting property (homework problem :) N( µ, σ ) has
maximum entropy H ( p( x)) among all p(x) with given mean and variance!
H ( p ( x)) = − ∫ p ( x) ln p ( x)dx (in nats)
When learning from data, max - entropy distributi ons are most reasonable
since they impose ' no additional structure' besides what is given as constaints
Bayes rule for binary classification
• Given only priors P( C = ω i ) = P(ω i ), /
//////choose///g * (x) = ω1 /if P(ω1 ) > P(ω2 ), g * (x) = ω 2 otherwise
• Given evidence x, update P( ω i ) using
p( Ÿ | ω i ) P(ω i )
Bayes formula : P( ω i | Ÿ) =
p (Ÿ)
n
where p( x) = ∑ p( Ÿ | ω i ) P(ω i ) - evidence probability,
j =1
p( Ÿ | ω i ) - likelihood, P(ω i ) − prior
Bayes decision rule :
choose // g * (x) = ω 1 if P(ω 1 | x ) > P (ω 2 | x ),
g * (x) = ω 2 otherwise
Example: one-dimensional case
Since p (Ÿ) does not depend on ω i in
p(x | ω i )
P( ω i | Ÿ) =
p( Ÿ | ω i ) P(ω i )
, we get
p (Ÿ)
iˆ ŒšGG‹ŒŠš–•GG™œ“Œ a
g * (x) = ω1 if
P(ω1|x) > P(ω2|x)
p( ω i | x)
p( x | ω1 ) P(ω1 ) > p( x | ω 2 ) P(ω 2 )
Priors :
P( ω1 ) =
2
1
, P( ω 2 ) =
3
3
kŒŠš–• G™ŒŽ–•š a
y 1* = {Ÿ £G P(ω1 | Ÿ) > P (ω 2 | Ÿ)}
y 2* = {Ÿ £G P(ω1 | Ÿ) ≤ P(ω 2 | Ÿ)}
Optimality of Bayes rule: idea
Bayes rule :
g * ( x) = ω1
if x < x *
Other rule :
g ( x) = ω1 if x < x g
Ÿ QGGŸŽ
Pg (error|Ÿ) = P(g(Ÿ) ≠ ω|Ÿ) = P(g(Ÿ) = ω1 , ω 2|Ÿ) + P(g(Ÿ) = ω 2 , ω1|Ÿ)
Pg *(error| Ÿ ) ≤ Pg (error| Ÿ ) for any g(x) : S → Ω
Proof – see next page…
Optimality of Bayes rule: proof
Pg*(error|Ÿ) = P(g*(Ÿ) ≠ ω|Ÿ) = P(g*(Ÿ) = ω1 , ω 2|Ÿ) + P(g*(Ÿ) = ω 2 , ω1|Ÿ) =
= P(g*(Ÿ) = ω1 | ω 2 S Ÿ)P(ω 2|Ÿ) + P(g*(Ÿ) = ω 2 | ω1 , Ÿ)P(ω1|Ÿ)
p1
p2
1) if P(ω1|x) > P(ω2|x), then g*(Ÿ) = ω1 and p1 = 1, p2 = 0 ⇒
Pg*(error|Ÿ) = P(ω 2|Ÿ)
2) if P(ω1|x) ≤ P(ω2|x), then g*(Ÿ) = ω 2 and p1 = 0, p2 = 1 ⇒
Pg*(error|Ÿ) = P(ω1|Ÿ)
Thus, Pg*(error|Ÿ) = min{P(ω1|Ÿ),P(ω2|Ÿ)}, and, for any g(x) : S → Ω
Pg* (error) =
+∞
∫P
g*
−∞
(error | x) p( x)dx ≤ Pg (error)
General Bayesian Decision Theory
Given :
• set of available actions (decisions ) A = {α 1 ,..., α a }
• loss function (cost of decision α i given ω j ) λij = λ (α i | ω j )
Find : a decision rule α (Ÿ) : S → A minimizing
R = ∫ R (α (Ÿ) | Ÿ) p (Ÿ)dŸSGG
the total expected loss (risk ) :
n
where R(α i | Ÿ) = ∑ λ (α i | ω j ) P (ω j | Ÿ) is the conditiona l risk
j =1
p(x | ω i ) P (ω i )
(Bayes formula)
p( x)
• Bayes decision rule : always minimize conditiona l risk
of action α i given Ÿ and P( ω i | x) =
n
/a ( x) = arg min R(α i | Ÿ) = arg min ∑ λ (α i | ω j ) P (ω j | Ÿ)
3
αi
αi
j =1
• Bayes decision rule yields minimum overall risk (called Bayes risk ) :
R * = min ∫ R (α (Ÿ) | Ÿ) p (Ÿ)dŸ
α (Ÿ)
Zero-one loss classification
Let α (Ÿ) = g( x) (action = classification), i.e. α i = ωi , i = 1,..., n
1 if i ≠ j
Zero - one loss : λij = λ (α i | ω j ) = 
0 if i = j
Then conditiona l risk = classification error
n
R (α i | Ÿ) = ∑ λ (α j | ω j ) P (ω j | Ÿ) = ∑ P (ω j | Ÿ) = 1 −P (ω i | Ÿ) = P (error | Ÿ)
j =1
j ≠i
Bayes rule :
g * (Ÿ) = ω i if
P(ω i | Ÿ) > P(ω j | Ÿ) for all i ≠ j
Bayes rule achieves minimum error rate R * = min ∫ R (α (Ÿ) | Ÿ) p (Ÿ) dŸ
α (Ÿ)
Different errors, different costs
Example from signal detection theory :
ω 1 - no signal (backgroun d noise), ω 2 − signal present
•
•
•
•
Hit : P ( x > x* | x ∈ ω 2 ) (cost λ 22 )
Miss : P ( x < x* | x ∈ ω 2 ) (cost λ12 )
False alarm : P ( x > x* | x ∈ ω 1 ) (cost λ 21 )
Correct rejection : P ( x < x* | x ∈ ω 1 ) (cost λ11 )
Discriminability
| µ 2 − µ1 |
d'=
σ
‹N
Receiver operating characteristic (ROC) curve
Error costs determine
optimal decision (threshold x*)
P ( x > x* | x ∈ ω 1 )
Cost-based classification
R(α1 | Ÿ) = λ11 P(ω1 | Ÿ) + λ12 P(ω 2 | Ÿ)
R(α | Ÿ) = λ P(ω | Ÿ) + λ P(ω | Ÿ)
2
21
1
22
2

Bayes rule :
g * (Ÿ) = ω1 if
R(α1 | Ÿ) < R(α 2 | Ÿ), i.e.
(λ21 − λ11 ) P(ω1 | Ÿ) > (λ12 − λ22 ) P(ω 2 | Ÿ)
Assuming λ 21 > λ11 and λ12 > λ 22
(errors cost more than correct decision) :
P (ω 1 | Ÿ ) ( λ12 − λ 22 )
, or
>
(
)
(
|
)
P ω2 Ÿ
λ 21 − λ11
Bayes rule :
g * ( Ÿ ) = ω 1 if
P ( Ÿ | ω 1 ) ( λ12 − λ 22 ) P( ω 2 )
>
= θλ
P ( Ÿ | ω 2 ) ( λ 21 − λ11 ) P( ω 1 )
Let αi = ωi , i = 1,2 , and λij = λ (αi | ω j ) :
Example: one-dimensional case
P(Ÿ | ω1 ) (λ12 − λ22 ) P(ω 2 )
>
= θλ
P(Ÿ | ω 2 ) (λ 21 − λ11 ) P(ω1 )
If misclassifying ω 2 as ω1 becomes more expensive
than otherwise, i.e. λ12 > λ 21 , then the threshold increases
Discriminant functions and classification
• Multi - category classifica tion :
- discrimina nt functions : f i (Ÿ), i = 1,..., n
( e.g., f i * (Ÿ) = − R (α i | Ÿ) for Bayes classifier )
- classifier : g (Ÿ) = ω i if f i (Ÿ) > f j (Ÿ) for all i ≠ j
• Two - category classifica tion :
- single discrimina nt function : f (Ÿ) ≡ f 1 (Ÿ) − f 2 (Ÿ)
- classifier : g (Ÿ) = ω 1 if f (Ÿ) > 0
g (Ÿ) = ω i
f 1 (x)
f 2 (x)
f n (x)
λ (α i | ω j )
α i = g (Ÿ)
Bayes Discriminant Functions
Bayes discrimina nt functions for zero - one loss classifica tion :
f i * (Ÿ ) = − R (α i | Ÿ ) = P(ω i | Ÿ ) − 1
Every f i (Ÿ ) can be replaced by h( f i (Ÿ )) where h(⋅) is monotonica lly increasing !
Multi-category:
f i * (Ÿ ) = P( ω i | Ÿ ) =
Decision regions and surfaces:
p( Ÿ | ω i ) P (ω i )
p (Ÿ )
f i * (Ÿ ) = p( Ÿ | ω i ) P (ω i )
f i * (Ÿ ) = log p( Ÿ | ω i ) + log P (ω i )
Two-category:
f * (Ÿ ) ≡ f 1* (Ÿ ) − f 1* (Ÿ )
f (Ÿ ) = P (ω 1 | Ÿ ) − P (ω 2 | Ÿ )
*
f * (Ÿ ) = log
p( Ÿ | ω 1 )
P (ω 1 )
+ log
p( Ÿ | ω 2 )
P (ω 2 )
O;
O:
O;
Discriminant functions: examples
Continuous
Discrete
Features
Binary, conditionally independent
P ( xi , x j | C ) = P ( xi | C ) P ( x j | C )
Discriminant functions
linear (hyperplanes)
Multivariate Gaussian p( | ω i ) =
1
 1

6:
t
exp
−
(
)
(
)
Z
Z
Z 
 2
(2π ) d / 2 | Z |1 / 2

• Same covariance matrix :
Z = for all classes ω Z
• General case : arbitrary Z
linear
(hyperplanes)
hyperquadrics
(hyperellipsoids, hy
perparabaloids, hyperhyperboloids)
Conditionally independent binary features linear classifier (‘naïve Bayes’-see later)
x = ( x1 ,..., x n ), x i ∈ { 0 ,1 } − features , c ∈ {ω 1 , ω 2 } − class
Let p i = P ( x i = 1 | ω 1 ),
n
P ( x | ω 1 ) = ∏ p (1 − p i )
i =1
xi
i
q i = P ( xi = 1 | ω 2 ). Then
1− xi
n
,
P ( x | ω 2 ) = ∏ q ixi (1 − q i )1− xi
i =1
Discriminant function f(x) (decision rule : if f(x) > 0, choose ω1 ) :
f ( x ) = log
p
P (ω 1 | x )
P ( x | ω 1 ) P (ω 1 )
= log
= log ∏  i
P (ω 2 | x )
P ( x | ω 2 ) P (ω 2 )
i =1  q i

1 − pi
p
= ∑  xi log i + (1 − xi ) log
1 − qi
qi
i =1 
n
n



xi
 1 − pi

 1 − qi



1− xi
 P (ω 1 ) 

 =
 P (ω 2 ) 

 P (ω 1 ) 
 + log
 =
 P (ω 2 ) 

n
pi (1− qi )
1− pi
P(ω1 )
f (x) = ∑wi xi + w0 , where wi = log
, w0 = ∑log
+ log
qi (1− pi )
1− qi
P(ω2 )
i=1
i =1
n
Case1: independent Gaussian features
2
with same variance for all classes: Z = σ Note: linear separating surfaces!
Case 2 : generalization to dependent features
having same covariances for all classes : Z = Case 3: unequal covariance matrices
One dimension: multiply-connected decision regions
Case 3, many dimensions: hyperquadric surfaces
Summary
Bayesian decision theory:
In theory: tells you how to make optimal (minimum-risk) decisions
In practice: where do you get those probabilities from?
Expert knowledge + learning from data (see next; also, Chapter 3)
Note: some typos in Chapter 2
R(α i (Ÿ) | Ÿ), not R(α i (Ÿ))
Page 25, second line after the equation 12: must be
Page 27, third line before section 2.3.1:
Page 50 and 51, in Fig. 2.20 and Fig 2.21, replace x-axis label by
switch ω 1 and ω 2 , and reverse inequality in λ 21 > λ12
P ( x > x* | x ∈ ω 1 )
Same replacement in problem 9, page 75
Section 2.11 (Bayesian belief networks): contains several mistakes; ignore for now.
Bayesian networks will be covered later.
Parameter Estimation
• In general, given training data D = {y 1 ,..., y N }, where y j = (x j , ω j ) ,
we wish to find (estimate) P(C = ω i ) and P(x | ω i ) - e.g., density estimation problem
• Usually, estimating P(x | ω i ) is hard, especially in high - dimensional feature spaces
• Solution : simplifying assumptions (e.g., parametric form of P(x | ω i ), or feature independence, etc.)
• We consider first a fixed parametric distribution approach (e.g. , Gaussian, multinomial, etc.)
• Then learning = parameter estimation from data
• Example : assume p(x | ω i ), is Gaussian N(µ i , σ i ), estimate µ i , σ i
• Two major approaches : classical statistical (ML) and Bayesian (MAP)
• Philosophical difference : is parameter a ' physical' constant or a random variable?
Maximum likelihood (ML) and
Maximum a posteriory (MAP) estimates
• Assume independent and identically distributed (i.i.d.) samples
D = {y1 ,..., y N }, where y j = (x j , ω j )
• Assumea parametricdistribution p(x | ω i , Z ), where Z is a parametervector
Example: Z = (µ i , σ i ) for Gaussianp(x | ω i , Z )
• We also assume that Z for different classes are independent
(can be estimated separately in same way)
•
N
Then P( D | ) = ∏ p(x j | )
j=1
• Maximum- likelihoodestimate ( is an unknownconstant):
ˆ = l() = P( D | )
ò
• Maximum a posterioryestimate ( is an unknown random variable,with prior P())
ˆ = P() = P( D | ) P()
ò
• Note that ML = MAP with uniformprior
ò
ML estimate: Gaussian distribution
• known σ , estimate µ :
1
µˆ =
N
N
∑x
j
j =1
• both µ and σ are unknown :
1
ˆµ =
N
N
1
ˆ
x , σ =
∑
N
j =1
j
N
2
j
ˆ
(
)
x
µ
−
∑
j =1
1
ˆ
• Note that σ is biased, i.e. Ε 
N
 N −1 2
2
2
ˆ
(
)
x
µ
=
σ
≠
σ
−

∑
N
j =1

N
j
1 N
• Unbiased estimate would be σˆ =
( x j −µˆ ) 2
∑
N − 1 j =1
Bayesian (MAP) estimate with
increasing sample size
Parameter estimation: discrete features
C
Multinomial P(x|C)
θ ki = P ( x = k | C = ω i )
X
ML-estimate: Θˆ = arg max log P(D|Θ )
ML( θ ) =
i
k
N x = k ,c = i
∑N
k
counts
Θ
MAP-estimate
x = k ,c = i
max log P ( D | Θ) P ( Θ)
Θ
Conjugate priors - Dirichlet Dir(θ paX | α1,paX ,...,α m,paX )
MAP( θ x , pa X ) =
N x , pa X + α x , pa X
∑N
x
x , pa
X
+
∑α
x
x , pa
X
Equivalent sample size
(prior knowledge)
An example: naïve Bayes classifier
P(C)
Class
Simplifying (“naïve”) assumption:
feature independence given class
C
P(x 1 ,..., x n | C) =
_
∏ wOŸ
Z =:
P(x
1
feature
1.
P(x
| C)
x1
2
| C)
P(x
feature
n
£jP
| C)
feature
x2
Z
xn
Bayes(-optimal) classifier:
given an (unlabeled) instance x = (x 1 ,..., x n ) , choose most likely class:
BO(x) = arg max P(C = i | x)
i
2.
Naïve Bayes classifier:
By Bayes rule
P(C = i | x ) =
P( x | C = i)P(C = i)
P( x )
NB(x) = arg max
i
n
, and by independence assumption
∏ P(C
j =1
= i)P(x j | C = i)
State-of-the-art
Optimality results
Linear decision surface for binary features (Minsky 61, Duda&Hart 73)
(polynomial for general nominal features - Duda&Hart 1973, Peot 96)
Optimality for OR and AND concepts (Domingos&Pazzani 97)
No XOR-containing concepts on nominal features (Zhang&Ling 01)
Algorithmic improvements
Boosted NB (Elkan 97) is equivalent to multilayer perceptron
Augmented NB (TAN, Bayes Nets – e.g., Friedman et al 97)
Other improvments (combining with Decision Trees (Kohavi), w/ error-correcting
output coding (ECOC) (Ghani, ICML 2000), etc.
Still, open problems remain:
NB error estimate/bounds based on domain properties
Why Naïve Bayes often works well
(despite independence assumption)?
Wrong P(C|x) estimates do not imply wrong classification!
Domingos&Pazzani, 97, J. Friedman 97, etc.
"Statistical diagnosis based on conditional independence does not require it", J. Hilden 84
class opt =
arg max P(class i | f)
Bayes-optimal:
True
P(class|f)
P(class | f)
i
Naïve Bayes: class NB =
arg max Pˆ (class i | f)
i
NB estimate
P̂(class | f)
Major questions remain:
Which P(c,x) are ‘good’ for NB?
What domain properties “predict” NB accuracy?
classopt
General question:
characterizing distributions P(X,C) and their
approximations Q (X,C) that can be ‘far’ from P(X,C),
but yield low classification error
True P
Approximate P
Class 1
Class 2
P(x|C)P(C)
Optimal decision boundary
Note: one measure of ‘distance’ between distributions can be relative entropy,
P( z )
or KL-divergence (see hw problem11, chap.3)
D ( P || Q) = ∫ P( z ) log
Q( z )
dz
Case study: using Naïve Bayes for
Transaction Recognition Problem
?
OpenDB
Search
SendMail
End-User
Transactions
(EUT)
RPCs
Remote
Procedure
Calls (RPCs)
Session
(connection)
Client
Workstation
Server
(Web, DB,
Lotus Notes)
Two problems: segmentation and labeling
Unsegmented RPC's
Segmented RPC's
and Labeled Tx's
1
2
1
Tx1
3
4
1
Tx2
2
Tx3
3
4
Tx2
Representing transactions
as feature vectors
Transaction of type i
R2
R5
R3
RPC occurrences
Bernoulli : P(R j = 1 | Ti ) = pij
R2
R1
R2
R5
f = (1, 1, 1, 0, 1, 0, ...)
f = (1, 3, 1, 0, 2, 0, ...)
RPC counts
Multinomial : P(ni1,...,niM | Ti ) =
M
n!
∏
M
j=1
p
∏
n!
ij
nij
ij
j=1
n
Geometric : P(n ij | Ti ) = p ij ij (1 − p ij )
Shifted Geometric : P(n ij | Ti ) = p ij
nij − sij
(1 − p ij )
Best fit to data ( χ 2 ) :
shifted geometric
Empirical results
NB + Bernoulli,
mult. or geom.
79 ± 3%
NB + shifted
geom.
Accuracy
87 ± 2%
≈ 10 ± 1%
Baseline classifier:
Always selects mostfrequent transaction
Training set size
Significant improvement over baseline classifier (75%)
NB is simple, efficient, and comparable to the state-of-the-art classifiers:
SVM – 85-87%, Decision Tree – 90-92%
Best-fit distribution (shift. geom) - not necessarily best classifier! (?)
Next lecture on Bayesian topics
April 17, 2002 - lecture on recent ‘hot stuff’:
Bayesian networks, HMMs, EM algorithm
Short Preview:
From Naïve Bayes to Bayesian Networks
Class
Naïve Bayes model:
independent features given class
P(f 1 | class)
P(f 2 | class)
feature f 1
feature f 2
Bayesian network (BN) model:
Any joint probability distributions
P(S)
Smoking
P(C|S)
lung Cancer
P(X|C,S)
X-ray
P(f n | class)
feature f n
P(S, C, B, X, D)=
= P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)
P(B|S)
Bronchitis
P(D|C,B)
Dyspnoea
CPD:
C
0
0
1
1
B D=0 D=1
0 0.1 0.9
1 0.7 0.3
0 0.8 0.2
1 0.9 0.1
Query: P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?
Download