Probabilistic Classifiers Irina Rish IBM T.J. Watson Research Center rish@us.ibm.com, http://www.research.ibm.com/people/r/rish/ February 20, 2002 Probabilistic Classification Bayesian Decision Theory Bayes decision rule (revisited): Bayes risk, 0/1 loss, optimal classifier, discriminability Probabilistic classifiers and their decision surfaces: Continuous features (Gaussian distribution) Discrete (binary) features (+ class-conditional independence) Parameter estimation “Classical” statistics: maximum-likelihood (ML) Bayesian statistics: maximum a posteriori (MAP) Common used classifier: naïve Bayes VERY simple: class-conditional feature independence VERY efficient (empirically); why and when? – still an open question Bayesian decision theory Make a decision that minimizes the overall expected cost (loss) Advantage: theoretically guaranteed optimal decisions Drawback: probability distributions are assumed to be known (in practice, estimation of those distribution from data can be a hard problem) Classification problem as an example of a decision problem Given observed properties (features) of an object, find its class. Examples: Sorting fish by its type (sea bass or salmon) given observed features such as lightness and length Video character recognition Face recognition Document classification using word counts Guessing user’s intentions (potential buyer or not) by his web transactions Intrusion detection Notation and definitions • C - a state of nature (class) : a random variable with distribution P(C ) - Ω = {ω1 ,..., ω n } is a set of possible states of nature (class labels) • = ( X 1 ,..., X d ) - feature vector in feature space S - Continuous X i : S = ℜ and X has probability density p( X | C ) - Discrete X i : S = D d , D = {1,..., k} and X has probability P( X | C ) • α () : S → A - decision rule - A = {α 1 ,..., α a } is a set of actions (decisions) - Example : A = {ω1 ,..., ω n } andGα () : S → ΩG is a classifier/ • λij = λ (α i | ω j ) - loss function (cost of decision α i given state ω j ) - Example : 0/1 loss (λij = 1 if i ≠ j and λij = 0 if i = j ) n • R(α i | ) = ∑ λ (α i | ω j ) P (ω j | ) - conditional risk of action α i given j =1 • R = ∫ R(α () | ) p ()dG- risk (total expected loss) Gaussian (normal) density N ( µ, σ ) 2 1 1 x-µ exp − 2π σ 2 σ p(x) = µ = Ε[ x] = 2 ∞ ∫ xp( x)dx − expected value (mean) −∞ 2 σ = Ε[( x-µ ) ] − variance, σ − standard deviation |x−µ | r= − Mahalanobi s distance from x to µ σ Interesting property (homework problem :) N( µ, σ ) has maximum entropy H ( p( x)) among all p(x) with given mean and variance! H ( p ( x)) = − ∫ p ( x) ln p ( x)dx (in nats) When learning from data, max - entropy distributi ons are most reasonable since they impose ' no additional structure' besides what is given as constaints Bayes rule for binary classification • Given only priors P( C = ω i ) = P(ω i ), / //////choose///g * (x) = ω1 /if P(ω1 ) > P(ω2 ), g * (x) = ω 2 otherwise • Given evidence x, update P( ω i ) using p( | ω i ) P(ω i ) Bayes formula : P( ω i | ) = p () n where p( x) = ∑ p( | ω i ) P(ω i ) - evidence probability, j =1 p( | ω i ) - likelihood, P(ω i ) − prior Bayes decision rule : choose // g * (x) = ω 1 if P(ω 1 | x ) > P (ω 2 | x ), g * (x) = ω 2 otherwise Example: one-dimensional case Since p () does not depend on ω i in p(x | ω i ) P( ω i | ) = p( | ω i ) P(ω i ) , we get p () i GGGG a g * (x) = ω1 if P(ω1|x) > P(ω2|x) p( ω i | x) p( x | ω1 ) P(ω1 ) > p( x | ω 2 ) P(ω 2 ) Priors : P( ω1 ) = 2 1 , P( ω 2 ) = 3 3 k G a y 1* = { £G P(ω1 | ) > P (ω 2 | )} y 2* = { £G P(ω1 | ) ≤ P(ω 2 | )} Optimality of Bayes rule: idea Bayes rule : g * ( x) = ω1 if x < x * Other rule : g ( x) = ω1 if x < x g QGG Pg (error|) = P(g() ≠ ω|) = P(g() = ω1 , ω 2|) + P(g() = ω 2 , ω1|) Pg *(error| ) ≤ Pg (error| ) for any g(x) : S → Ω Proof – see next page… Optimality of Bayes rule: proof Pg*(error|) = P(g*() ≠ ω|) = P(g*() = ω1 , ω 2|) + P(g*() = ω 2 , ω1|) = = P(g*() = ω1 | ω 2 S )P(ω 2|) + P(g*() = ω 2 | ω1 , )P(ω1|) p1 p2 1) if P(ω1|x) > P(ω2|x), then g*() = ω1 and p1 = 1, p2 = 0 ⇒ Pg*(error|) = P(ω 2|) 2) if P(ω1|x) ≤ P(ω2|x), then g*() = ω 2 and p1 = 0, p2 = 1 ⇒ Pg*(error|) = P(ω1|) Thus, Pg*(error|) = min{P(ω1|),P(ω2|)}, and, for any g(x) : S → Ω Pg* (error) = +∞ ∫P g* −∞ (error | x) p( x)dx ≤ Pg (error) General Bayesian Decision Theory Given : • set of available actions (decisions ) A = {α 1 ,..., α a } • loss function (cost of decision α i given ω j ) λij = λ (α i | ω j ) Find : a decision rule α () : S → A minimizing R = ∫ R (α () | ) p ()dSGG the total expected loss (risk ) : n where R(α i | ) = ∑ λ (α i | ω j ) P (ω j | ) is the conditiona l risk j =1 p(x | ω i ) P (ω i ) (Bayes formula) p( x) • Bayes decision rule : always minimize conditiona l risk of action α i given and P( ω i | x) = n /a ( x) = arg min R(α i | ) = arg min ∑ λ (α i | ω j ) P (ω j | ) 3 αi αi j =1 • Bayes decision rule yields minimum overall risk (called Bayes risk ) : R * = min ∫ R (α () | ) p ()d α () Zero-one loss classification Let α () = g( x) (action = classification), i.e. α i = ωi , i = 1,..., n 1 if i ≠ j Zero - one loss : λij = λ (α i | ω j ) = 0 if i = j Then conditiona l risk = classification error n R (α i | ) = ∑ λ (α j | ω j ) P (ω j | ) = ∑ P (ω j | ) = 1 −P (ω i | ) = P (error | ) j =1 j ≠i Bayes rule : g * () = ω i if P(ω i | ) > P(ω j | ) for all i ≠ j Bayes rule achieves minimum error rate R * = min ∫ R (α () | ) p () d α () Different errors, different costs Example from signal detection theory : ω 1 - no signal (backgroun d noise), ω 2 − signal present • • • • Hit : P ( x > x* | x ∈ ω 2 ) (cost λ 22 ) Miss : P ( x < x* | x ∈ ω 2 ) (cost λ12 ) False alarm : P ( x > x* | x ∈ ω 1 ) (cost λ 21 ) Correct rejection : P ( x < x* | x ∈ ω 1 ) (cost λ11 ) Discriminability | µ 2 − µ1 | d'= σ N Receiver operating characteristic (ROC) curve Error costs determine optimal decision (threshold x*) P ( x > x* | x ∈ ω 1 ) Cost-based classification R(α1 | ) = λ11 P(ω1 | ) + λ12 P(ω 2 | ) R(α | ) = λ P(ω | ) + λ P(ω | ) 2 21 1 22 2 Bayes rule : g * () = ω1 if R(α1 | ) < R(α 2 | ), i.e. (λ21 − λ11 ) P(ω1 | ) > (λ12 − λ22 ) P(ω 2 | ) Assuming λ 21 > λ11 and λ12 > λ 22 (errors cost more than correct decision) : P (ω 1 | ) ( λ12 − λ 22 ) , or > ( ) ( | ) P ω2 λ 21 − λ11 Bayes rule : g * ( ) = ω 1 if P ( | ω 1 ) ( λ12 − λ 22 ) P( ω 2 ) > = θλ P ( | ω 2 ) ( λ 21 − λ11 ) P( ω 1 ) Let αi = ωi , i = 1,2 , and λij = λ (αi | ω j ) : Example: one-dimensional case P( | ω1 ) (λ12 − λ22 ) P(ω 2 ) > = θλ P( | ω 2 ) (λ 21 − λ11 ) P(ω1 ) If misclassifying ω 2 as ω1 becomes more expensive than otherwise, i.e. λ12 > λ 21 , then the threshold increases Discriminant functions and classification • Multi - category classifica tion : - discrimina nt functions : f i (), i = 1,..., n ( e.g., f i * () = − R (α i | ) for Bayes classifier ) - classifier : g () = ω i if f i () > f j () for all i ≠ j • Two - category classifica tion : - single discrimina nt function : f () ≡ f 1 () − f 2 () - classifier : g () = ω 1 if f () > 0 g () = ω i f 1 (x) f 2 (x) f n (x) λ (α i | ω j ) α i = g () Bayes Discriminant Functions Bayes discrimina nt functions for zero - one loss classifica tion : f i * ( ) = − R (α i | ) = P(ω i | ) − 1 Every f i ( ) can be replaced by h( f i ( )) where h(⋅) is monotonica lly increasing ! Multi-category: f i * ( ) = P( ω i | ) = Decision regions and surfaces: p( | ω i ) P (ω i ) p ( ) f i * ( ) = p( | ω i ) P (ω i ) f i * ( ) = log p( | ω i ) + log P (ω i ) Two-category: f * ( ) ≡ f 1* ( ) − f 1* ( ) f ( ) = P (ω 1 | ) − P (ω 2 | ) * f * ( ) = log p( | ω 1 ) P (ω 1 ) + log p( | ω 2 ) P (ω 2 ) O; O: O; Discriminant functions: examples Continuous Discrete Features Binary, conditionally independent P ( xi , x j | C ) = P ( xi | C ) P ( x j | C ) Discriminant functions linear (hyperplanes) Multivariate Gaussian p( | ω i ) = 1 1 6: t exp − ( ) ( ) Z Z Z 2 (2π ) d / 2 | Z |1 / 2 • Same covariance matrix : Z = for all classes ω Z • General case : arbitrary Z linear (hyperplanes) hyperquadrics (hyperellipsoids, hy perparabaloids, hyperhyperboloids) Conditionally independent binary features linear classifier (‘naïve Bayes’-see later) x = ( x1 ,..., x n ), x i ∈ { 0 ,1 } − features , c ∈ {ω 1 , ω 2 } − class Let p i = P ( x i = 1 | ω 1 ), n P ( x | ω 1 ) = ∏ p (1 − p i ) i =1 xi i q i = P ( xi = 1 | ω 2 ). Then 1− xi n , P ( x | ω 2 ) = ∏ q ixi (1 − q i )1− xi i =1 Discriminant function f(x) (decision rule : if f(x) > 0, choose ω1 ) : f ( x ) = log p P (ω 1 | x ) P ( x | ω 1 ) P (ω 1 ) = log = log ∏ i P (ω 2 | x ) P ( x | ω 2 ) P (ω 2 ) i =1 q i 1 − pi p = ∑ xi log i + (1 − xi ) log 1 − qi qi i =1 n n xi 1 − pi 1 − qi 1− xi P (ω 1 ) = P (ω 2 ) P (ω 1 ) + log = P (ω 2 ) n pi (1− qi ) 1− pi P(ω1 ) f (x) = ∑wi xi + w0 , where wi = log , w0 = ∑log + log qi (1− pi ) 1− qi P(ω2 ) i=1 i =1 n Case1: independent Gaussian features 2 with same variance for all classes: Z = σ Note: linear separating surfaces! Case 2 : generalization to dependent features having same covariances for all classes : Z = Case 3: unequal covariance matrices One dimension: multiply-connected decision regions Case 3, many dimensions: hyperquadric surfaces Summary Bayesian decision theory: In theory: tells you how to make optimal (minimum-risk) decisions In practice: where do you get those probabilities from? Expert knowledge + learning from data (see next; also, Chapter 3) Note: some typos in Chapter 2 R(α i () | ), not R(α i ()) Page 25, second line after the equation 12: must be Page 27, third line before section 2.3.1: Page 50 and 51, in Fig. 2.20 and Fig 2.21, replace x-axis label by switch ω 1 and ω 2 , and reverse inequality in λ 21 > λ12 P ( x > x* | x ∈ ω 1 ) Same replacement in problem 9, page 75 Section 2.11 (Bayesian belief networks): contains several mistakes; ignore for now. Bayesian networks will be covered later. Parameter Estimation • In general, given training data D = {y 1 ,..., y N }, where y j = (x j , ω j ) , we wish to find (estimate) P(C = ω i ) and P(x | ω i ) - e.g., density estimation problem • Usually, estimating P(x | ω i ) is hard, especially in high - dimensional feature spaces • Solution : simplifying assumptions (e.g., parametric form of P(x | ω i ), or feature independence, etc.) • We consider first a fixed parametric distribution approach (e.g. , Gaussian, multinomial, etc.) • Then learning = parameter estimation from data • Example : assume p(x | ω i ), is Gaussian N(µ i , σ i ), estimate µ i , σ i • Two major approaches : classical statistical (ML) and Bayesian (MAP) • Philosophical difference : is parameter a ' physical' constant or a random variable? Maximum likelihood (ML) and Maximum a posteriory (MAP) estimates • Assume independent and identically distributed (i.i.d.) samples D = {y1 ,..., y N }, where y j = (x j , ω j ) • Assumea parametricdistribution p(x | ω i , Z ), where Z is a parametervector Example: Z = (µ i , σ i ) for Gaussianp(x | ω i , Z ) • We also assume that Z for different classes are independent (can be estimated separately in same way) • N Then P( D | ) = ∏ p(x j | ) j=1 • Maximum- likelihoodestimate ( is an unknownconstant): ˆ = l() = P( D | ) ò • Maximum a posterioryestimate ( is an unknown random variable,with prior P()) ˆ = P() = P( D | ) P() ò • Note that ML = MAP with uniformprior ò ML estimate: Gaussian distribution • known σ , estimate µ : 1 µˆ = N N ∑x j j =1 • both µ and σ are unknown : 1 ˆµ = N N 1 ˆ x , σ = ∑ N j =1 j N 2 j ˆ ( ) x µ − ∑ j =1 1 ˆ • Note that σ is biased, i.e. Ε N N −1 2 2 2 ˆ ( ) x µ = σ ≠ σ − ∑ N j =1 N j 1 N • Unbiased estimate would be σˆ = ( x j −µˆ ) 2 ∑ N − 1 j =1 Bayesian (MAP) estimate with increasing sample size Parameter estimation: discrete features C Multinomial P(x|C) θ ki = P ( x = k | C = ω i ) X ML-estimate: Θˆ = arg max log P(D|Θ ) ML( θ ) = i k N x = k ,c = i ∑N k counts Θ MAP-estimate x = k ,c = i max log P ( D | Θ) P ( Θ) Θ Conjugate priors - Dirichlet Dir(θ paX | α1,paX ,...,α m,paX ) MAP( θ x , pa X ) = N x , pa X + α x , pa X ∑N x x , pa X + ∑α x x , pa X Equivalent sample size (prior knowledge) An example: naïve Bayes classifier P(C) Class Simplifying (“naïve”) assumption: feature independence given class C P(x 1 ,..., x n | C) = _ ∏ wO Z =: P(x 1 feature 1. P(x | C) x1 2 | C) P(x feature n £jP | C) feature x2 Z xn Bayes(-optimal) classifier: given an (unlabeled) instance x = (x 1 ,..., x n ) , choose most likely class: BO(x) = arg max P(C = i | x) i 2. Naïve Bayes classifier: By Bayes rule P(C = i | x ) = P( x | C = i)P(C = i) P( x ) NB(x) = arg max i n , and by independence assumption ∏ P(C j =1 = i)P(x j | C = i) State-of-the-art Optimality results Linear decision surface for binary features (Minsky 61, Duda&Hart 73) (polynomial for general nominal features - Duda&Hart 1973, Peot 96) Optimality for OR and AND concepts (Domingos&Pazzani 97) No XOR-containing concepts on nominal features (Zhang&Ling 01) Algorithmic improvements Boosted NB (Elkan 97) is equivalent to multilayer perceptron Augmented NB (TAN, Bayes Nets – e.g., Friedman et al 97) Other improvments (combining with Decision Trees (Kohavi), w/ error-correcting output coding (ECOC) (Ghani, ICML 2000), etc. Still, open problems remain: NB error estimate/bounds based on domain properties Why Naïve Bayes often works well (despite independence assumption)? Wrong P(C|x) estimates do not imply wrong classification! Domingos&Pazzani, 97, J. Friedman 97, etc. "Statistical diagnosis based on conditional independence does not require it", J. Hilden 84 class opt = arg max P(class i | f) Bayes-optimal: True P(class|f) P(class | f) i Naïve Bayes: class NB = arg max Pˆ (class i | f) i NB estimate P̂(class | f) Major questions remain: Which P(c,x) are ‘good’ for NB? What domain properties “predict” NB accuracy? classopt General question: characterizing distributions P(X,C) and their approximations Q (X,C) that can be ‘far’ from P(X,C), but yield low classification error True P Approximate P Class 1 Class 2 P(x|C)P(C) Optimal decision boundary Note: one measure of ‘distance’ between distributions can be relative entropy, P( z ) or KL-divergence (see hw problem11, chap.3) D ( P || Q) = ∫ P( z ) log Q( z ) dz Case study: using Naïve Bayes for Transaction Recognition Problem ? OpenDB Search SendMail End-User Transactions (EUT) RPCs Remote Procedure Calls (RPCs) Session (connection) Client Workstation Server (Web, DB, Lotus Notes) Two problems: segmentation and labeling Unsegmented RPC's Segmented RPC's and Labeled Tx's 1 2 1 Tx1 3 4 1 Tx2 2 Tx3 3 4 Tx2 Representing transactions as feature vectors Transaction of type i R2 R5 R3 RPC occurrences Bernoulli : P(R j = 1 | Ti ) = pij R2 R1 R2 R5 f = (1, 1, 1, 0, 1, 0, ...) f = (1, 3, 1, 0, 2, 0, ...) RPC counts Multinomial : P(ni1,...,niM | Ti ) = M n! ∏ M j=1 p ∏ n! ij nij ij j=1 n Geometric : P(n ij | Ti ) = p ij ij (1 − p ij ) Shifted Geometric : P(n ij | Ti ) = p ij nij − sij (1 − p ij ) Best fit to data ( χ 2 ) : shifted geometric Empirical results NB + Bernoulli, mult. or geom. 79 ± 3% NB + shifted geom. Accuracy 87 ± 2% ≈ 10 ± 1% Baseline classifier: Always selects mostfrequent transaction Training set size Significant improvement over baseline classifier (75%) NB is simple, efficient, and comparable to the state-of-the-art classifiers: SVM – 85-87%, Decision Tree – 90-92% Best-fit distribution (shift. geom) - not necessarily best classifier! (?) Next lecture on Bayesian topics April 17, 2002 - lecture on recent ‘hot stuff’: Bayesian networks, HMMs, EM algorithm Short Preview: From Naïve Bayes to Bayesian Networks Class Naïve Bayes model: independent features given class P(f 1 | class) P(f 2 | class) feature f 1 feature f 2 Bayesian network (BN) model: Any joint probability distributions P(S) Smoking P(C|S) lung Cancer P(X|C,S) X-ray P(f n | class) feature f n P(S, C, B, X, D)= = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B) P(B|S) Bronchitis P(D|C,B) Dyspnoea CPD: C 0 0 1 1 B D=0 D=1 0 0.1 0.9 1 0.7 0.3 0 0.8 0.2 1 0.9 0.1 Query: P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?