Predictive Modeling & The Bayes Classifier 0011 0010 1010 1101 0001 0100 1011 Rosa Cowan April 29, 2008 1 2 4 Goal of Predictive Modeling 0011 0010 1010 1101 0001 0100 1011 • To identify class membership of a variable (entity, event, or phenomenon) through known values of other variables (characteristics, features, attributes). 1 2 • This means finding a function f such that y = f(x,) where x = {x1,x2,…,xp} is a set of estimated parameters for the model y = c {c1,c2,…,cm} for the discrete case y is a real number for the continuous case 4 Example Applications of Predictive Models 0011 0010 1010 1101 0001 0100 1011 • Forecasting peak bloom period for Washington’s cherry blossoms • Numerous applications in Natural Language Processing including semantic parsing, named entity extraction, coreference resolution and machine translation. • Medical diagnosis (MYCIN – identification of bacterial infections) • Sensor threat identification • Predicting stock market behavior • Image processing • Predicting consumer purchasing behaviors • Predicting successful movie and record productions 1 2 4 Predictive Modeling Ingredients 0011 0010 1010 1101 0001 0100 1011 1. 2. 3. 4. A model structure A score function An optimization strategy for finding the best Data or expert knowledge for training and testing 1 2 4 2 Types of Predictive Models 0011 0010 1010 1101 0001 0100 1011 • Classifiers or Supervised Classification* – for the case when C is categorical • Regression – for the case when C is realvalued. 1 2 4 *The remainder of this presentation focuses on Classifiers Classifier Variants & Example Types 0011 0010 1010 1101 0001 0100 1011 1. 2. Discriminative: work by defining decision boundaries or decision surfaces – Nearest Neighbor Methods; K-means – Linear & Quadratic Discriminant Methods – Perceptrons – Support Vector Machines – Tree Models (C4.5) Probabilistic Models: work by identifying the most likely class for a given observation by modeling the underlying distributions of the features across classes* – Bayes Modeling – Naïve Bayes Classifiers 1 2 4 *Remainder of presentation will focus on Probabilistic Models with particular attention paid to the Naïve Bayes Classifier General Bayes Modeling 0011 0010 1010 1101 0001 0100 1011 • Uses Bayes Rule: P( A B) ; P( B) P( A B) P( A / B) P( B) P( A / B) P( B) therefore P( B / A) P( A) P( A / B) 1 P(ck | x1 , x2 ,..., x p ) P( x1 , x2 ,..., x p | ck ) P(ck ) for 1 k m; P( x1 , x2 ,..., x p ) P(ck | x1 , x2 ,..., x p ) is referred to as the posterior probabilit y, P(ck ) as the prior probabilit y, P( x1 , x2 ,..., x p | ck ) as the likelihood , and P( x1 , x2 ,..., x p ) as the evidence. 2 4 • For general conditional probability classification modeling, we’re interesting in Bayes Example 0011 0010 1010 1101say 0001 0100 1011 in predicting if a particular student • Let’s we’re interested will pass CMSC498K. • We have data on past student performance. For each student we know: – If student’s GPA > 3.0 (G) – If student had a strong math background (M) – If student is a hard worker (H) – If student passed or failed course • A new student comes along with values G g , M m, and H h 1 2 4 and wants to know if they will likely pass or fail the course. P( g , m, h, pass ) f ( g , m, h) P( g , m, h, fail ) If f ( g , m, h) 1, then classifier predicts pass; otherwise fail. General Bayes Example (Cont.) 0011 0010 1010 Pass 1101 0001 0100 1011 GPA>3 (G) Math? (M) Hardworker (H) Prob Fail GPA>3 (G) Math? (M) Hardworker (H) Prob 0 0 0 0.01 0 0 0 0.28 0 0 1 0.03 0 0 1 0.15 0 1 0 0.05 0 1 0 0.20 0 1 1 0.08 0 1 1 0.14 1 0 0 0.10 1 0 0 0.07 1 0 1 0.28 1 0 1 0.05 1 1 0 0.15 1 1 0 0.08 1 1 1 0.30 1 Assume P(pass) 0.5 and P(fail) 0.5 Let x {0,1,0} or {G, M, H) P( pass ) P( x / pass ) 0.5 * 0.05 f ( x) 0.25 P( fail ) P( x / fail ) 0.5 * 0.20 1 2 4 1 1 0.03 Joint Probability Distributions grow exponentially with # of features! For binary-valued features, we need O(2p) JPDs for each class. Augmented Naïve Bayes Net (Directed Acyclic Graph) 0011 0010 1010 1101 0001 0100 1011 G and H are conditionally independent of M given pass pass G P(G | H pass) 0.9 P(G | H pass) 0.85 P(G | H pass) 0.10 P(G | H pass) 0.05 H P( H | pass) 0.5 P( H | pass) 0.2 0.5 M 1 2 4 P( M | pass) 0.6 P( M | pass) 0.10 P ( M , G, H , pass ) P ( M | G H pass ) * P (G H pass ) P ( M | pass ) * P (G H pass ) P ( M | pass ) * P (G | H pass ) * P (H pass ) P ( M | pass ) * P (G | H pass ) * P (H | pass ) * P ( pass ) 0.6 0.15 0.5 0.5 0.0225 Naïve Bayes •Strong assumption of the conditional independence of all feature variables. •Feature only0100 dependent 0011 0010 1010 variables 1101 0001 1011on class variable pass G H P(G | pass) 0.8 P(G | pass) 0.1 P( H | pass) 0.7 P( H | pass) 0.4 0.5 M 1 P( pass, G, M , H ) P( M | G H pass) P(G H pass) P( M / pass) P(G | H pass) P(H pass) P( M / pass) P(G | pass) P(H | pass) P( pass) p P( pass) P( xi | pass) i 1 0.5 0.6 0.2 0.3 0.018 2 4 P( M | pass) 0.6 P( M | pass) 0.7 Characteristics of Naïve Bayes 0011 0010 1010 1101 0001 0100 1011 • Only requires the estimation of the prior probabilities P(CK) and p conditional probabilities for each class, to be able to answer full set of queries across classes and features. • Empirical evidence shows that Naïve Bayes classifiers work remarkable well. The use of a full Bayes (belief) network provide only limited improvements in classification performance. 1 2 4 Why do Naïve Bayes Classifiers work so well? 0011 0010 1010 1101 0001 0100 1011 • Performance measured using 0-1 loss function which counts the number of incorrect classifications rather than a measure of how accurate the classifier estimates the posterior probabilities • Additional explanation by Harry Zhang claiming that the distribution of dependencies among features over the classes affects the accuracy of Naïve Bayes. 1 2 4 Zhang’s Explanation 0011 0010 1010 1101 0001 0100 1011 • Define Local Dependencies – measure of the dependency between a node and its parents. Ratio of the conditional probability of the node given its parents over the node without parents. 1 2 for a node, X , in the augmented naive Bayes graph the dependence derivative is defined as : P( X | pa( X ), pass) dd pass ( X | pa( X )) P( X | pass) P( X | pa( X ), fail ) dd fail ( X | pa( X )) P( X | fail ) where pa( X ) denotes the parents of X Define a local dependence derivative ratio for node X as dd ( X | pa( X )) ddr( X ) pass dd fail ( X | pa( X )) 4 Zhang’s Theorem #1 0011 0010 1010 1101 0001 0100 1011 • Given an augmented naïve Bayes graph and its correspondent naïve Bayes graph on features X1,X2,…Xp, assume that fb and fnb are the Bayes and Naïve Bayes classifiers respectively, then the equation below is true. f ( x , x ,..., x ) f ( x , x ,..., x ) ddr ( x ) p b 1 2 p p nb 1 2 p i ddr ( x ) is called the dependence i i i 1 2 4 distributi on factor - DF(x) Zhang’s Theorem #2 0011 0010 1010x1101 Given {x ,0001 x ,...,0100 x }, an1011 f classifier 1 2 p b is equivalent to an f nb classifier under 0 - 1 loss (i.e. both result in the same classifica tion), iff when f b ( x) 1, DF(x) f b ( x); or when f b ( x) 1, DF(x) f b ( x). 1 2 4 Analysis 0011 0010 1010 1101 0001 0100 1011 • Determine when fnb results in the same classification as fb. f f DF ( x) b nb 1 • Clearly when DF(X) = 1. There are 3 cases for DF(X)=1. 2 1. All the features are independent 2. Local dependencies for each node distributes evenly in both classes 3. Local dependencies supporting classification in one class are canceled by others supporting the opposite class. • If f 1 then when DF(x) f to ensure f 1. b b nb • If f 1 then when DF(x) f to ensure f 1. b b nb 4 The End Except For 0011 0010 1010 1101 0001 0100 1011 •Questions •List of Sources 1 2 4 List of Sources 0011 0010 1010 1101 0001 0100 1011 • Hand, D., Mannila, H., & Smyth, P. (2001). Principles of Data Mining; Chapter 10. Massachusetts:The MIT Press. • Zhang, H. (2004). The Optimality of Naïve Bayes. Retrieved April 17, 2008, Web site: 1 http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf 2 • Moore, A. (2001) Bayes Nets for Representing and reasoning about uncertainty. Retrieved April 22, 2008, Web site: • http://www.coral-lab.org/~oates/classes/2006/Machine%20Learning/web/bayesnet.pdf Naïve Bayes classifier. Retrieved April 10, 2008, Web site: http://en.wikipedia.org/wiki/Naive_Bayes_classifier 4 • Ruane, Michael (March 30, 2008) Cherry Blossom Forecast gets a Digital Aid. Retrieved April 10, 2008, Web site: http://www.boston.com/news/nation/washington/articles/2008/03/30/cherry_blossom_fo recast_gets_a_digital_aid /