Machine Learning & Data Mining Part 1: The Basics Jaime Carbonell (with contributions from Tom Mitchell, Sebastian Thrun and Yiming Yang) Carnegie Mellon University jgc@cs.cmu.edu December, 2008 © 2008, Jaime G Carbonell Some Definitions (KBS vs ML) Knowledge-Based Systems Rules, procedures, semantic nets, Horn clauses Inference: matching, inheritance, resolution Acquisition: manually from human experts Machine Learning Data: tables, relations, attribute lists, … Inference: rules, trees, decision functions, … Acquisition: automated from data Data Mining Machine learning applied to large real problems May be augmented with KBS December, 2008 © 2008, Jaime G. Carbonell 2 Ingredients for Machine Learning “Historical” data (e.g. DB tables) E.g. products (features, marketing, support, …) E.g. competition (products, pricing, customers) E.g. customers (demographics, purchases, …) Objective function (to be predicted or optimized) E.g. maximize revenue per customer E.g. minimize manufacturing defects Scalable machine learning method(s) E.g. decision-tree induction, logistic regression E.g. “active” learning, clustering December, 2008 © 2008, Jaime G. Carbonell 3 Sample ML/DM Applications I Credit Scoring Training: past applicant profiles, how much credit given, payback or default Input: applicant profile (income, debts, …) Objective: credit-score + max amount Fraud Detection (e.g. credit-card transactions) Training: past known legitimate & fraudulent transactions Input: proposed transaction (loc, cust, $$, …) Objective: approve/block decision December, 2008 © 2008, Jaime G. Carbonell 4 Sample ML/DM Applications II Demographic Segmentation Training: past customer profiles (age, gender, education, income,…) + product preferences Input: new product description (features) Objective: predict market segment affinity Marketing/Advertisement Effectiveness Training: past advertisement campaigns, demographic targets, product categories Input: proposed advertisement campaign Objective: project effectiveness (sales increase modulated by marketing cost) December, 2008 © 2008, Jaime G. Carbonell 5 Sample ML/DM Applications III Product (or Part) Reliability Training: past products/parts + specs at manufacturing + customer usage + maint rec Input: new part + expected usage Objective: mean-time-to-failure (replacement) Manufacturing Tolerances Training: past product/part manufacturing process, tolerances, inspections, … Input: new part + expected usage Objective: optimal manufacturing precision (minimize costs of failure + manufacture) December, 2008 © 2008, Jaime G. Carbonell 6 Sample ML/DM Applications IV Mechanical Diagnosis Training: past observed symptoms at (or prior to) breakdown + underlying cause Input: current symptoms Objective: predict cause of failure Mechanical Repair Training: cause of failure + product usage + repair (or PM) effectiveness Input: new failure cause + product usage Objective: recommended repair (or preventive maintenance operation) December, 2008 © 2008, Jaime G. Carbonell 7 Sample ML/DM Applications V Billeting (job assignments) Training: employee profiles, position profiles, employee performance in assigned position Input: new employee or new position profile Objective: predict performance in position Text Mining & Routing (e.g. customer centers) Training: electronic problem reports, customer requests + who should handle them Input: new incoming texts Objective: Assign category + route or reply December, 2008 © 2008, Jaime G. Carbonell 8 Preparing Historical Data Extract a DB table with all the needed information Select, join, project, aggregate, … Filter out rows with significant missing data Determine predictor attributes (columns) Ask domain expert for relevant attributes, or Start with all attributes and automatically subselect most predictive ones (feature selection) Determine to-be-predicted attribute (column) Objective of the DM (number, decision, …) December, 2008 © 2008, Jaime G. Carbonell 9 Sample DB Table [predictor attributes] [objective] Tot Num Max Num Acct. Income Job Delinq Delinq Owns Credit Good numb. in K/yr Now? accts cycles home? years cust.? --------------------------------------------------------------------------1001 85 Y 1 1 N 2 Y 1002 60 Y 3 2 Y 5 N 1003 ? N 0 0 N 2 N 1004 95 Y 1 2 N 9 Y 1005 110 Y 1 6 Y 3 Y 1006 29 Y 2 1 Y 1 N 1007 88 Y 6 4 Y 8 N 1008 80 Y 0 0 Y 0 Y 1009 31 Y 1 1 N 1 Y 1011 ? Y ? 0 ? 7 Y 1012 75 ? 2 4 N 2 N 1013 20 N 1 1 N 3 N 1014 65 Y 1 3 Y 1 Y 1015 65 N 1 2 N 8 Y 1016 20 N 0 0 N 0 N 1017 75 Y 1 3 N 2 N 1018 40 N 0 0 Y 1 Y December, 2008 © 2008, Jaime G. Carbonell 10 Supervised Learning on DB Table Given: DB table With identified predictor attributes x1, x2,… And objective attribute y Find: Prediction Function Fk : x1 ,..., xn y Fk {F1 , F2 ,...Fm } Subject to: Error Minimization on data table M f best Arg min [ 2 ( y f ( x i k i )) ] f k { f 1 ,... f m ) iRows ( M ) Least-squares error, or L1-norm, or L-norm, … December, 2008 © 2008, Jaime G. Carbonell 11 Popular Predictor Functions Linear Discriminators (next slides) k-Nearest-Neighbors (lecture #2) Decision Trees (lecture #5) Linear & Logistic Regression (lecture #4) Probabilistic Methods (Lecture #3) Neural Networks 2-layer Logistic Regression Multi-layer Difficult to scale up Classification Rule Induction (in a few slides) December, 2008 © 2008, Jaime G. Carbonell 12 Linear Discriminator Functions x2 Two class problem: y={ , } x1 December, 2008 © 2008, Jaime G. Carbonell 13 Linear Discriminator Functions x2 Two class problem: y={ , } x1 December, 2008 © 2008, Jaime G. Carbonell 14 Linear Discriminator Functions x2 Two class problem: y={ , } n y ai xi i 0 x1 December, 2008 © 2008, Jaime G. Carbonell 15 Linear Discriminator Functions x2 Two class problem: new y={ , } n y ai xi i 0 x1 December, 2008 © 2008, Jaime G. Carbonell 16 Issues with Linear Discriminators What is the “best” placement of the discriminator? Maximize the margin In general Support Vector Machines What if there are k classes (K > 2)? Must learn k different discriminators Each discriminates ki vs kji (all other classes) What if it classes are not linearly separable? Minimal error (L1 or L2) placement (regression) Give up on linear discriminators ( other fk’s) December, 2008 © 2008, Jaime G. Carbonell 17 Maximizing the Margin x2 margin Two class problem: y={ , } x1 December, 2008 © 2008, Jaime G. Carbonell 18 Nearly-Separable Classes x2 Two class problem: y={ , } x1 December, 2008 © 2008, Jaime G. Carbonell 19 Nearly-Separable Classes x2 Two class problem: y={ , } x1 December, 2008 © 2008, Jaime G. Carbonell 20 Minimizing Training Error Optimal placing of maximum-margin separator Quadratic programming (Support Vector Machines) Slack variables to accommodate training errors Minimizing error metrics Number of errors 1 L0 ( f , X , y ) I ( f ( xi ), yi ) n i 1.. n Magnitude of error Squared error Chevycheff norm L1 ( f , X , y ) i 1.. n f ( xi ) yi I ( f ( xi , yi ) 1 L2 ( f , X , y ) ( f ( xi ) yi ) 2 I ( f ( xi ), yi ) n i 1.. n L ( f , X , y ) max ( f ( xi ) yi ) I ( f ( xi ), yi ) i 1.. n December, 2008 © 2008, Jaime G. Carbonell 21 Symbolic Rule Induction General idea Labeled instances are DB tuples Rules are generalized tuples Generalization occurs at terms in tuples Generalize on new E+ not correctly predicted Specialize on new E- not correctly predicted Ignore predicted E+ or E- (error-driven learning) December, 2008 © 2008, Jaime G. Carbonell 22 Symbolic Rule Induction (2) Example term generalizations Constant => disjunction e.g. if small portion value set seen Constant => least-common-generalizer class e.g. if large portion of value set seen Number (or ordinal) => range e.g. if dense sequential sampling December, 2008 © 2008, Jaime G. Carbonell 23 Symbolic Rule Induction Example (1) Age Gender Temp b-cult c-cult 65 M 101 + .23 25 M 102 + .00 65 M 102 .78 36 F 99 .19 11 F 103 + .23 88 F 98 + .21 39 F 100 + .10 12 M 101 + .00 15 F 101 + .66 20 F 98 + .00 81 M 98 .99 87 F 100 .89 12 F 102 + ?? loc USA CAN BRA USA USA CAN BRA BRA BRA USA BRA USA CAN 14 67 USA normal BRA rash F M 101 + 102 + .33 .77 Skin normal normal rash normal flush normal normal normal flush rash rash rash normal disease strep strep dengue *none* strep *none* strep strep dengue *none* ec-12 ec-12 strep Symbolic Rule Induction Example (2) Candidate Rules: IF age = [12,65] gender = *any* temp = [100,103] b-cult = + c-cult = [.00,.23] loc = *any* skin = (normal,flush) THEN: strep IF age = (15,65) gender = *any* temp = [101,102] b-cult = *any* c-cult = [.66,.78] loc = BRA skin = rash THEN: dengue Disclaimer: These are not real medical records or rules Types of Data Mining “Supervised” Methods (this DM course) Training data has both predictor attributes & objective (to be predicted) attributes Predict discrete classes classification Predict continuous values regression Duality: classification regression “Unsupervised” Methods Training data without objective attributes Goal: find novel & interesting patterns Cutting-edge research, fewer success stories Semi-supervised methods: market-basket, … December, 2008 © 2008, Jaime G. Carbonell 26 Machine Learning Application Process in a Nutshell Choose problem where Prediction is valuable and non-trivial Sufficient historical data is available The objective is measurable (incl in past data) Prepare the data Tabular form, clean, divide training & test sets Select a Machine Learning algorithm Human readable decision fn rules, trees, … Robust with noisy data kNN, logistic reg, … December, 2008 © 2008, Jaime G. Carbonell 27 Machine Learning Application Process in a Nutshell (2) Train ML Algorithm on Training Data Set Each ML method has different training process Training uses both predictor & objective att’s Run Training ML Algorithm on Test Data Set Test uses only predictor att’s & outputs predictions on objective attributes Compare predictions vs actual objective att’s (see lecture 2 for evaluation metrics) If Accuracy threshold, done. Else, try different ML algorithm, different parameter settings, get more training data, … December, 2008 © 2008, Jaime G. Carbonell 28 Sample DB Table (same) [predictor attributes] [objective] Tot Num Max Num Acct. Income Job Delinq Delinq Owns Credit Good numb. in K/yr Now? accts cycles home? years cust.? --------------------------------------------------------------------------1001 85 Y 1 1 N 2 Y 1002 60 Y 3 2 Y 5 N 1003 ? N 0 0 N 2 N 1004 95 Y 1 2 N 9 Y 1005 100 Y 1 6 Y 3 Y 1006 29 Y 2 1 Y 1 N 1007 88 Y 6 4 Y 8 N 1008 80 Y 0 0 Y 0 Y 1009 31 Y 1 1 N 1 Y 1011 ? Y ? 0 ? 7 Y 1012 75 ? 2 4 N 2 N 1013 20 N 1 1 N 3 N 1014 65 Y 1 3 Y 1 Y 1015 65 N 1 2 N 8 Y 1016 20 N 0 0 N 0 N 1017 75 Y 1 3 N 2 N 1018 40 N 0 0 Y 10 Y December, 2008 © 2008, Jaime G. Carbonell 29 Feature Vector Representation Predictor-attribute rows in DB tables can be represented as vectors. For instance, the 2nd & 4th rows of predictor attributes in our DB table are: R2 = [60 R4 = [95 Y Y 3 1 2 2 Y N 5] 9] Converting to numbers (Y = 1, N = 0), we get: R2 = [60 R4 = [95 December, 2008 1 1 3 1 2 2 1 0 5] 9] © 2008, Jaime G. Carbonell 30 Vector Similarity Suppose we have a new credit applicant R-new = [65 1 1 2 0 10] To which of R2 or R4 is she closer? R2 = [60 R4 = [95 1 1 3 1 2 2 1 0 5] 9] What should we use as a SIMILARITY METRIC? Should we first NORMALIZE the vectors? If not, the largest component will dominate December, 2008 © 2008, Jaime G. Carbonell 31 Normalizing Vector Attributes Linear Normalization (often sufficient) Find max & min values for each attribute Normalize each attribute by: ( Aactual Amin ) Anorm ( Amax Amin ) Apply to all vectors (historical + new) …by normalizing each attribute, e.g.: AR 2,1 (60 20) (100 20) 0.5 December, 2008 © 2008, Jaime G. Carbonell 32 Normalizing Full Vectors Normalizing the new applicant vector R-new = [65 1 1 2 0 10] [.56 1 .17 .33 0 1] And normalizing the two past customer vectors R2 = [60 R4 = [95 1 1 3 1 2 2 1 0 5] [.50 1 .50 .33 1 .50] 9] [.94 1 .17 .33 0 .90] How about if some attributes are known to be more important, say salary (A1) & delinquencies (A3)? Weight accordingly, e.g. x2 for each E.g., R-new-weighted: [1.12 1 .34 .33 0 1] December, 2008 © 2008, Jaime G. Carbonell 33 Similarity Functions (inverse dist) Now that we have weighted normalized vectors, how do we tell exactly their degree of similarity? Inverse sum of differences (L1) siminv diff (a , b ) 1 | ai bi | i 1,... n Inverse Euclidean distance (L2) sim Euclid (a , b ) 1 (a b ) i 1,... n December, 2008 i 2 i © 2008, Jaime G. Carbonell 34 Similarity Functions (direct) Dot-Product Similarity simdot (a, b ) a b a b i 1,..., n i i Cosine Similarity (dot product of unit vectors) aibi a b i 1,..., n simcos (a , b ) a b 2 2 ai bi i 1,..., n i 1,..., n December, 2008 © 2008, Jaime G. Carbonell 35 Alternative: Similarity Matrix for Non-Numeric Attributes tiny little small medium large huge tiny 1.0 little 0.8 1.0 small 0.7 0.9 1.0 medium 0.5 0.7 0.7 1.0 large 0.2 0.3 0.3 0.5 1.0 huge 0.0 0.1 0.2 0.3 0.8 1.0 Diagonal must be 1.0 Monotonicity property must hold Triangle inequality must hold Transitive property must hold Additivity/Compostionality need not hold December, 2008 © 2008, Jaime G. Carbonell 36 k-Nearest Neighbors Method No explicit “training” phase When new case arrives (vector of predictor att’s) Find nearest k neighbors (max similarity) among previous cases (row vectors in DB table) k-neighbors vote for objective attribute Unweighted majority vote, or Similarity-weighted vote Works for both discrete or continuous objective attributes December, 2008 © 2008, Jaime G. Carbonell 37 Similarity-Weighted Voting in kNN If the Objective Attribute is Discrete: Valueobj ( y ) arg max sim ( x j , y ) C i Range(Valueobj ) [ x kNN ( y )]&[ value ( x ) C ] obj j i j If the Objective Attribute is Continuous: Valueobj ( y ) valueobj ( x j ) sim ( x j , y) x j kNN ( y ) sim ( x j , y) x j kNN ( y ) December, 2008 © 2008, Jaime G. Carbonell 38 Applying kNN to Real Problems 1 How does one choose the vector representation? Easy: Vector = predictor attributes What if attributes are not numerical? Convert: (e.g. High=2, Med=1, Low=0), Or, use similarity function over nominal values E.g. equality or edit-distance on strings How does one choose a distance function? Hard: No magic recipe; try simpler ones first This implies a need for systematic testing (discussed in coming slides) December, 2008 © 2008, Jaime G. Carbonell 39 Applying kNN to Real Problems 2 How does one determine whether data should be normalized? Normalization is usually a good idea One can try kNN both ways to make sure How does one determine “k” in kNN? k is often determined empirically Good start is: k log 2 (size ( DB )) December, 2008 © 2008, Jaime G. Carbonell 40 Evaluating Machine Learning Accuracy = Correct-Predictions/Total-Predictions Simplest & most popular metric But misleading on very-rare event prediction Precision, recall & F1 Borrowed from Information Retrieval Applicable to very-rare event prediction Correlation (between predicted & actual values) for continuous objective attributes R2, kappa-coefficient, … December, 2008 © 2008, Jaime G. Carbonell 41 Sample Confusion Matrix Predicted Diagnoses True Diagnoses Shorted Power Sup Loose Connect’s Burnt Resistor Not plugged in Shorted Power Sup 50 0 10 0 Loose Connect’s 1 120 0 12 Burnt Resistor 12 0 60 0 Not plugged in 0 8 5 110 December, 2008 © 2008, Jaime G. Carbonell 42 Measuring Accuracy Accuracy = correct/total Error = incorrect/total Hence: accuracy = 1 – error Trace(C ) A Full (C ) c c i 1,..., n i ,i i 1,... n j 1,..., n i, j For the diagnosis example: A = 340/386 = 0.88, E = 1 – A = 0.12 December, 2008 © 2008, Jaime G. Carbonell 43 What About Rare Events? Predicted Diagnoses True Diagnoses Shorted Power Sup Loose Connect’s Burnt Resistor Not plugged in Shorted Power Sup 0 0 10 0 Loose Connect’s 1 120 0 12 Burnt Resistor 12 0 60 0 Not plugged in 0 8 5 160 December, 2008 © 2008, Jaime G. Carbonell 44 Rare Event Evaluation Accuracy for example = 0.88 …but NO correct predictions for “shorted power supply”, 1 of 4 diagnoses Alternative: Per-diagnosis (per-class) accuracy: ci ,i A(class i ) c j ,i ci , j j 1 ,..., n j i , j 1 ,..., n A(“shorted PS”) = 0/22 = 0 A(“not plugged in”) = 160/184 = 0.87 December, 2008 © 2008, Jaime G. Carbonell 45 ROC Curves (ROC=Receiver Operating Characteristic) December, 2008 © 2008, Jaime G. Carbonell 46 ROC Curves (ROC=Receiver Operating Characteristic) Sensitivity = TP/(TP+FN) Specificity = TN/(TN+FP) December, 2008 © 2008, Jaime G. Carbonell 47 If Plenty of data, evaluate with Holdout Set Data train evaluate measure error Often also used for parameter optimization December, 2008 © 2008, Jaime G. Carbonell 48 Finite Cross-Validation Set True error: (true risk) eD y f ( x, ) p( x, y) dx, y D Test error: (empirical risk) D = all data 1 eˆS m m = #test samples December, 2008 y f ( x, ) x , y S S = test data © 2008, Jaime G. Carbonell 49 Confidence Intervals If S contains m examples, drawn independently m 30 Then With approximately 95% probability, the true error eD lies in the interval eˆS (1 eˆS ) eˆS 1.96 m December, 2008 © 2008, Jaime G. Carbonell 50 Example: Hypothesis misclassifies 12 out of 40 examples in cross validation set S. Q: What will the “true” error on future examples? A: With 95% confidence, the true error will be in the interval: eˆS (1 eˆS ) [0.16;0.44] eˆS 1.96 m m 40 December, 2008 eˆS 12 0.3 40 eˆS (1 eˆS ) 1.96 0.14 m © 2008, Jaime G. Carbonell 51 Confidence Intervals If S contains n examples, drawn independently n 30 Then With approximately N% probability, the true error eD lies in the interval eˆS z N eˆS (1 eˆS ) m N% 50% 68% 80% 90% 95% 98% 99% zN 0.67 1.0 1.28 1.64 1.96 2.33 2.58 Finite Cross-Validation Set True error: eD y f ( x, ) p( x, y) dx, y D Test error: eˆS 1 m y f ( x, ) x , y S Number of test errors: Is Binomially distributed: p m! k nk y f ( x , ) k ( e ) ( 1 e ) D D k ! ( m k )! x , y S December, 2008 © 2008, Jaime G. Carbonell 53 k-fold Cross Validation Data Train on yellow, evaluate on pink error1 Train on yellow, evaluate on pink error2 Train on yellow, evaluate on pink error3 k-way split Train on yellow, evaluate on pink error4 Train on yellow, evaluate on pink error5 Train on yellow, evaluate on pink error6 Train on yellow, evaluate on pink error7 Train on yellow, evaluate on pink error8 error = errori / k Cross Validation Procedure Purpose: Evaluate DM accuracy on training data Experiment: Try different similarity functions, etc. Process: Divide the training data into k equal pieces (each piece is called a “fold”) Train the classifier using all but kth fold Test for accuracy on kth fold Repeat with kth-1 fold held out for testing, then with kth-2 fold for testing, till tested on all folds Report the average accuracy across folds December, 2008 © 2008, Jaime G. Carbonell 55 The JackknifeData Comparing Different Hypotheses: Paired t test True difference: d eD (1 ) eD (2 ) For each partition k: dˆk eˆS ,k (1 ) eˆS ,k ( 2 ) k 1 dˆ dˆi k i 1 Average: test error for partition k N% Confidence interval: k-1 is degrees of freedom N is confidence level dˆ t N ,k 1 December, 2008 k 1 ˆ ˆ) 2 ( i k (k 1) i 1 © 2008, Jaime G. Carbonell 57 Version Spaces (Mitchell, 1980) Anything G boundary b N “Target” concept S boundary Specific Instances December, 2008 © 2008, Jaime G. Carbonell 58 Original & Seeded Version Spaces Version-spaces (Mitchell, 1980) Symbolic multivariate learning S & G sets define lattice boundaries Exponential worst-case: O(bN) Seeded Version Spaces (Carbonell, 2002) Generality level hypothesis seed S & G subsets effective lattice Polynomial worst case: O(bk/2), k=3,4 December, 2008 © 2008, Jaime G. Carbonell 59 Seeded Version Spaces (Carbonell, 2002) Xn Ym G boundary Det Adj N Det N Adj (Y2 num) = (Y3 num) (Y2 gen) = (Y3 gen) (X3 num) = (Y2 num) “Target” concept N S boundary “The big book” “ el libro grande” December, 2008 © 2008, Jaime G. Carbonell 60 Seeded Version Spaces Xn Ym G boundary Seed (best guess) k “Target” concept N S boundary “The big book” “ el libro grande” December, 2008 © 2008, Jaime G. Carbonell 61 Naïve Bayes Classification Some Notation: Training instance index i = 1, 2, …, I Term index j = 1, 2, …, J Category index k = 1, 2, …, K Training data D (k) = ((xi, yi (k) )) Instance feature vector xi = (1, ni1, ni2, …, niJ), Output labels yi = (yi (1) , yi (2) , …, yi(K) ) , yi (k) = 1 or 0 December, 2008 © 2008, Jaime G. Carbonell 62 Bayes Classifier Assigning the most probable category to x cˆ arg max k P(ck|x ) P ( ck ) P ( x | ck ) arg max k P ( x ) arg max k P(ck ) P( x | ck ) Bayes Rule arg max k log P(ck ) log P( x | ck ) # of training instances in ck Pˆ (ck ) I Pˆ ( xi | ck ) Pˆ (ni1 ,, niJ | ck ) ? December, 2008 (MLE) (Multinomial Distribution) © 2008, Jaime G. Carbonell 63 Maximum Likelihood Estimate (MLE) n: # of objects in a random sample from an population m: # of instances of a category among the n-object sample p: true probability of any object belonging to the category Likelihood of observing the data given model p is defined as: L( Dn | p ) P ( Dn | p ) P (Y1 , , Yn | p ) , Yi {0,1}, Yi ~ Ber ( p ) i 1 P (Yi | p ) p m (1 p ) n m , n assuming i.i.d. log p m (1 p ) n m m log p (n m) log( 1 p ) f ( p ) Setting the derivative of f(p) to zero yields: 0 d m nm f ( p) , (1 p)m (n m) p, dp p 1 p December, 2008 © 2008, Jaime G. Carbonell p m n 64 Binomial Distribution Consider coin toss as a Bernoulli process, X ~ Ber(p) P( Head ) p, P(Tail ) 1 p q What is the probability of seeing 2 heads out of 5 tosses? 5 2 3 5! 2 3 P(# of heads is 2 | n 5) p q p q 2 2 ! 3 ! Observing k heads in n tosses follows a binomial distribution: Y i 1 X i , Y ~ Bin (n, p) , n December, 2008 n k P(Y k ) p (1 p) n k k © 2008, Jaime G. Carbonell 65 Multinomial Distribution Consider tossing a 6-faced dice n times with probabilities p1, p2, …, p6 where the probabilities sum up to 1. Let the count of observing each face as a random variable, we have a multinomial process defined as ( X 1, X 2,, X 6) ~ Mul (n, p1 , p2 ... p6 ) 0 X j n, 6 j 1 X j n. n 6 nk pk P( X 1 n1 ,, X 6 n6 ) n1 n2 ... ... n6 k 1 December, 2008 © 2008, Jaime G. Carbonell 66 Multinomial NB The conditional probability is J nx ! nx , j P( x | c) P( nx | c) P ( t | c ) j nx1!nx 2 !...nxJ ! j 1 (t j is a term) We can remove the first term from the objective function P ( x | c) P (t j | c) nx j j 1 nxj log P (t j | c) J j December, 2008 © 2008, Jaime G. Carbonell 67 Smoothing Methods Laplace Smoothing (common) 1 nt|c ~ P (t | c) | V | nt|c tV Two-state Hidden Markov Model (BBN, or Jelinek-Mercer Interpolation) ~ P (t | c) P(t | c) (1 ) P(t ) Hierarchical Smoothing (McCallum, ICML’98) ~ P(t | c) 1P(t | c) 2 P(t | c2 ) ... h P(t | ch ) Lambda’s (summing to 1) are the mixture weights, obtained by running an EM algorithm on a validation set. December, 2008 © 2008, Jaime G. Carbonell 68 Basic Assumptions Term independence: P( xi | ck ) P(t1 | ck ) ni1 P(t 2 | ck ) ni 2 ... Expecting one objective attribute y per instance: P (c ) 1 k k Continuity of instances in the same class (one-mode per class) |V | nd ! nd t arg max k P(d | c k ) arg max k P(nd | c k ) P ( t | c ) k n ! n !... n ! d ,1 d ,2 d ,|V | tV December, 2008 © 2008, Jaime G. Carbonell 69 NB and Cross Entropy Entropy Measuring the uncertainty – lower entropy means easier predictions Minimum coding length if distribution p is known Cross Entropy Measuring the coding length (in # of bits) based on distribution q when the true distribution is p December, 2008 H ( p) pk log pk k p ( p1 ,, pK ), p k 1 k H(p || q) pk log qk k pk log pk k qk pk pk log pk pk log k k qk pk H ( p) D(p || q) © 2008, Jaime G. Carbonell 70 NB and Cross Entropy (cont’d) Kullback Liebler (KL) Divergence qk D(p || q) pk log pk k Also called “Relative Entropy” Measuring the difference between two distributions Zero valued if p = q Not inter-changeable December, 2008 © 2008, Jaime G. Carbonell 71 NB & Cross Entropy (cont’d) k * arg max log P (ck ) nij log P (t j | ck ) k t j xi nij log P (ck ) arg max log P (t | ck ) k ni t j xi ni log P (ck ) arg max Pˆ (t j | xi ) log P (t j | ck ) k ni t arg max log P (ck ) H ( p xi || qck ) k arg min log P (ck ) H ( p xi || qck ) k Minimum Description Length (MDL) Classifier December, 2008 © 2008, Jaime G. Carbonell 72 Concluding Remarks on NB Pros Explicit probabilistic reasoning Relatively effective, fast online response (as an eager learning) Cons Scoring function (logarithm of term probabilities) would be too sensitive to measurement errors on rare features One-class-per-instance assumption imposes both theoretical and practical limitations Empirically weak when dealing with rare categories and large feature sets December, 2008 © 2008, Jaime G. Carbonell 73 Statistical Decision Theory Random input X in RJ Random output Y in {1,2, …, K} Prediction f(X) in {1,2, …, K} Loss function (0-1 loss for classification) L(y(x), f(x)) = 0 iff f(x) = y(x) L(y(x), f(x)) = 1 otherwise Expected Prediction Error (EPE) EPE X k 1 L(Y , f ( X )) P(Y | X ) K fˆ ( x) arg min Minimizing EPE pointwise December, 2008 (k ) L ( k , f ( x )) f ( x ){1,, K } k 1 x K arg min k{1,, K }{1 x( k ) } arg max k{1,, K }{ x( k ) } © 2008, Jaime G. Carbonell 74 Selection of ML Algorithm (I) Method Training Data Requirements Random Noise Tolerance Scalability (atts + data) Rule Induction Sparse None Good Decision Trees Sparse-Dense Some Excellent Naïve Bayes Medium-Dense Some-Good Medium Regression Medium-Dense Some-Good Good kNN Sparse-Dense Some-Good Good-Excellent SVM Medium-Dense Some-Good Good-Excellent Neural Nets Dense Poor-Medium December, 2008 Good © 2008, Jaime G. Carbonell 75 Selection of ML Algorithm (II) Method Quality of Prediction Explanatory Power Popularity of Usage Rule Induction Good, brittle Very clear Med, declining Decision Trees Good/category Very clear High, stable Naïve Bayes Medium/cat Partial Med, declining Regression Good/both Partial-Poor High, stable kNN Good/both Partial-Good Med, increasing SVM Very good/cat Poor Med, increasing Neural Nets Good/cat Poor High, declining December, 2008 © 2008, Jaime G. Carbonell 76