CMSC 471 Spring 2014 Class #16 Thursday, March 27, 2014 Machine Learning II Professor Marie desJardins, mariedj@cs.umbc.edu Computing Information Gain Gain(A,S) I(S) I(A,S) I(S) vValues(A ) | Sv | I(Sv ) |S | I(T) = I(Pat, T) = I(Type, T) = French Gain (Pat, T) = Gain (Type, T) = Italia n Y N Y N Tha i N Y N Y Burger N Y N Y Som Ful Empt 2 Computing Information Gain •I(T) = - (.5 log .5 + .5 log .5) = .5 + .5 = 1 •I(Type, T) = 1/6 (0) + 1/3 (0) + 1/2 (- (2/3 log 2/3 + 1/3 log 1/3)) = 1/2 (2/3*.6 + 1/3*1.6) = .47 •I(Type, T) = 1/6 (1) + 1/6 (1) + 1/3 (1) + 1/3 (1) = 1 Y N Y N Thai N Y NY Burger N Y N Y French Italian Empty Some Full Gain (Pat, T) = 1 - .47 = .53 Gain (Type, T) = 1 – 1 = 0 3 Using Gain Ratios • The information gain criterion favors attributes that have a large number of values – If we have an attribute D that has a distinct value for each record, then I(D,T) is 0, thus Gain(D,T) is maximal • To compensate for this, Quinlan suggests using the following ratio instead of Gain: GainRatio(D,T) = Gain(D,T) / SplitInfo(D,T) • SplitInfo(D,T) is the information due to the split of T on the basis of value of categorical attribute D SplitInfo(D,T) = I(|T1|/|T|, |T2|/|T|, .., |Tm|/|T|) where {T1, T2, .. Tm} is the partition of T induced by value of D 4 Computing Gain Ratio SplitInfo(D,T) = I(|T1|/|T|, |T2|/|T|, .., |Tm|/|T|) I(T) = 1 French Y N Italian Y N I (Pat, T) = .47 I (Type, T) = 1 Gain (Pat, T) =.53 Gain (Type, T) = 0 Thai N Y N Y Burger N Y N Y Empty Some Full SplitInfo (Pat, T) = SplitInfo (Type, T) = GainRatio (Pat, T) = Gain (Pat, T) / SplitInfo(Pat, T) = .53 / ______ = GainRatio (Type, T) = Gain (Type, T) / SplitInfo (Type, T) = 0 / ____ = 0 !! 5 Computing Gain Ratio Y N Y N Thai N Y NY Burger N Y N Y French •I(T) = 1 •I (Pat, T) = .47 •I (Type, T) = 1 Gain (Pat, T) =.53 Gain (Type, T) = 0 Italian Empty Some Full SplitInfo (Pat, T) = - (1/6 log 1/6 + 1/3 log 1/3 + 1/2 log 1/2) = 1/6*2.6 + 1/3*1.6 + 1/2*1 = 1.47 SplitInfo (Type, T) = 1/6 log 1/6 + 1/6 log 1/6 + 1/3 log 1/3 + 1/3 log 1/3 = 1/6*2.6 + 1/6*2.6 + 1/3*1.6 + 1/3*1.6 = 1.93 GainRatio (Pat, T) = Gain (Pat, T) / SplitInfo(Pat, T) = .53 / 1.47 = .36 GainRatio (Type, T) = Gain (Type, T) / SplitInfo (Type, T) = 0 / 1.93 = 0 6 Bayesian Learning Chapter 20.1-20.2 Some material adapted from lecture notes by Lise Getoor and Ron Parr 7 Naïve Bayes 8 Naïve Bayes • Use Bayesian modeling • Make the simplest possible independence assumption: – Each attribute is independent of the values of the other attributes, given the class variable – In our restaurant domain: Cuisine is independent of Patrons, given a decision to stay (or not) 9 Bayesian Formulation • p(C | F1, ..., Fn) = p(C) p(F1, ..., Fn | C) / P(F1, ..., Fn) = α p(C) p(F1, ..., Fn | C) • Assume that each feature Fi is conditionally independent of the other features given the class C. Then: p(C | F1, ..., Fn) = α p(C) Πi p(Fi | C) • We can estimate each of these conditional probabilities from the observed counts in the training data: p(Fi | C) = N(Fi ∧ C) / N(C) – One subtlety of using the algorithm in practice: When your estimated probabilities are zero, ugly things happen – The fix: Add one to every count (aka “Laplacian smoothing”—they have a different name for everything!) 10 Naive Bayes: Example • p(Wait | Cuisine, Patrons, Rainy?) = α p(Cuisine ∧ Patrons ∧ Rainy? | Wait) = α p(Wait) p(Cuisine | Wait) p(Patrons | Wait) p(Rainy? | Wait) naive Bayes assumption: is it reasonable? 11 Naive Bayes: Analysis • Naive Bayes is amazingly easy to implement (once you understand the bit of math behind it) • Remarkably, naive Bayes can outperform many much more complex algorithms—it’s a baseline that should pretty much always be used for comparison • Naive Bayes can’t capture interdependencies between variables (obviously)—for that, we need Bayes nets! 12 Learning Bayesian Networks 13 Bayesian Learning: Bayes’ Rule • Given some model space (set of hypotheses hi) and evidence (data D): – P(hi|D) = P(D|hi) P(hi) • We assume that observations are independent of each other, given a model (hypothesis), so: – P(hi|D) = j P(dj|hi) P(hi) • To predict the value of some unknown quantity, X (e.g., the class label for a future observation): – P(X|D) = i P(X|D, hi) P(hi|D) = i P(X|hi) P(hi|D) These are equal by our independence assumption 14 Bayesian Learning • We can apply Bayesian learning in three basic ways: – BMA (Bayesian Model Averaging): Don’t just choose one hypothesis; instead, make predictions based on the weighted average of all hypotheses (or some set of best hypotheses) – MAP (Maximum A Posteriori) hypothesis: Choose the hypothesis with the highest a posteriori probability, given the data – MLE (Maximum Likelihood Estimate): Assume that all hypotheses are equally likely a priori; then the best hypothesis is just the one that maximizes the likelihood (i.e., the probability of the data given the hypothesis) • MDL (Minimum Description Length) principle: Use some encoding to model the complexity of the hypothesis, and the fit of the data to the hypothesis, then minimize the overall description of hi + D 15 Learning Bayesian Networks • Given training set D { x[1],..., x[ M ]} • Find B that best matches D – model selection – parameter estimation B B[1] A[1] C[1] E[1] E[ M ] B[ M ] A[ M ] C[ M ] Inducer E A C Data D 16 Parameter Estimation • Assume known structure • Goal: estimate BN parameters q – entries in local probability models, P(X | Parents(X)) • A parameterization q is good if it is likely to generate the observed data: L(q : D) P ( D | q) P ( x[m ] | q) m i.i.d. samples • Maximum Likelihood Estimation (MLE) Principle: Choose q* so as to maximize L 17 Parameter Estimation II • The likelihood decomposes according to the structure of the network → we get a separate estimation task for each parameter • The MLE (maximum likelihood estimate) solution: – for each value x of a node X – and each instantiation u of Parents(X) q * x |u N ( x, u) N (u) sufficient statistics – Just need to collect the counts for every combination of parents and children observed in the data – MLE is equivalent to an assumption of a uniform prior over parameter values 18 Sufficient Statistics: Example • Why are the counts sufficient? Moon-phase Light-level Earthquake Alarm Burglary q*x|u N ( x, u) N (u) θ*A | E, B = N(A, E, B) / N(E, B) 19 Model Selection Goal: Select the best network structure, given the data Input: – Training data – Scoring function Output: – A network that maximizes the score 20 Structure Selection: Scoring • Bayesian: prior over parameters and structure – get balance between model complexity and fit to data as a byproduct Marginal likelihood Prior • Score (G:D) = log P(G|D) log [P(D|G) P(G)] • Marginal likelihood just comes from our parameter estimates • Prior on structure can be any measure we want; typically a function of the network complexity Same key property: Decomposability Score(structure) = Si Score(family of Xi) 21 Heuristic Search B B E A E A C C B E B E A A C C 22 Exploiting Decomposability B B E A E A B C A C C B To recompute scores, only need to re-score families that changed in the last move E E A C 23 Variations on a Theme • Known structure, fully observable: only need to do parameter estimation • Unknown structure, fully observable: do heuristic search through structure space, then parameter estimation • Known structure, missing values: use expectation maximization (EM) to estimate parameters • Known structure, hidden variables: apply adaptive probabilistic network (APN) techniques • Unknown structure, hidden variables: too hard to solve! 24 Handling Missing Data • Suppose that in some cases, we observe Moon-phase earthquake, alarm, light-level, and moon-phase, but not burglary • Should we throw that data away?? Light-level • Idea: Guess the missing values based on the other data Earthquake Burglary Alarm 25 EM (Expectation Maximization) • Guess probabilities for nodes with missing values (e.g., based on other observations) • Compute the probability distribution over the missing values, given our guess • Update the probabilities based on the guessed values • Repeat until convergence 26 EM Example • Suppose we have observed Earthquake and Alarm but not Burglary for an observation on November 27 • We estimate the CPTs based on the rest of the data • We then estimate P(Burglary) for November 27 from those CPTs • Now we recompute the CPTs as if that estimated value had been observed Earthquake Burglary • Repeat until convergence! Alarm 27