CMSC 471 Spring 2014 Class #16 Thursday, March 27, 2014

advertisement
CMSC 471
Spring 2014
Class #16
Thursday, March 27, 2014
Machine Learning II
Professor Marie desJardins, mariedj@cs.umbc.edu
Computing Information Gain
Gain(A,S)  I(S)  I(A,S)  I(S)  vValues(A )
| Sv |
 I(Sv )
|S |
I(T) =

I(Pat, T) =
I(Type, T) =
French
Gain (Pat, T) =
Gain (Type, T) =
Italia
n
Y
N
Y
N
Tha
i
N
Y
N Y
Burger
N
Y
N Y
Som
Ful
Empt
2
Computing Information Gain
•I(T) =
- (.5 log .5 + .5 log .5)
= .5 + .5 = 1
•I(Type, T) =
1/6 (0) + 1/3 (0) +
1/2 (- (2/3 log 2/3 +
1/3 log 1/3))
= 1/2 (2/3*.6 +
1/3*1.6)
= .47
•I(Type, T) =
1/6 (1) + 1/6 (1) +
1/3 (1) + 1/3 (1) = 1
Y
N
Y
N
Thai N
Y
NY
Burger N
Y
N Y
French
Italian
Empty
Some
Full
Gain (Pat, T) = 1 - .47 = .53
Gain (Type, T) = 1 – 1 = 0
3
Using Gain Ratios
• The information gain criterion favors attributes that have a large
number of values
– If we have an attribute D that has a distinct value for each
record, then I(D,T) is 0, thus Gain(D,T) is maximal
• To compensate for this, Quinlan suggests using the following
ratio instead of Gain:
GainRatio(D,T) = Gain(D,T) / SplitInfo(D,T)
• SplitInfo(D,T) is the information due to the split of T on the
basis of value of categorical attribute D
SplitInfo(D,T) = I(|T1|/|T|, |T2|/|T|, .., |Tm|/|T|)
where {T1, T2, .. Tm} is the partition of T induced by value of D
4
Computing Gain Ratio
SplitInfo(D,T) = I(|T1|/|T|, |T2|/|T|, .., |Tm|/|T|)
I(T) = 1
French
Y
N
Italian
Y
N
I (Pat, T) = .47
I (Type, T) = 1
Gain (Pat, T) =.53
Gain (Type, T) = 0
Thai
N
Y
N Y
Burger
N
Y
N Y
Empty
Some
Full
SplitInfo (Pat, T) =
SplitInfo (Type, T) =
GainRatio (Pat, T) = Gain (Pat, T) / SplitInfo(Pat, T) = .53 / ______ =
GainRatio (Type, T) = Gain (Type, T) / SplitInfo (Type, T) = 0 / ____ = 0 !!
5
Computing Gain Ratio
Y
N
Y
N
Thai N
Y
NY
Burger N
Y
N Y
French
•I(T) = 1
•I (Pat, T) = .47
•I (Type, T) = 1
Gain (Pat, T) =.53
Gain (Type, T) = 0
Italian
Empty
Some
Full
SplitInfo (Pat, T) = - (1/6 log 1/6 + 1/3 log 1/3 + 1/2 log 1/2) = 1/6*2.6 + 1/3*1.6 + 1/2*1
= 1.47
SplitInfo (Type, T) = 1/6 log 1/6 + 1/6 log 1/6 + 1/3 log 1/3 + 1/3 log 1/3
= 1/6*2.6 + 1/6*2.6 + 1/3*1.6 + 1/3*1.6 = 1.93
GainRatio (Pat, T) = Gain (Pat, T) / SplitInfo(Pat, T) = .53 / 1.47 = .36
GainRatio (Type, T) = Gain (Type, T) / SplitInfo (Type, T) = 0 / 1.93 = 0
6
Bayesian Learning
Chapter 20.1-20.2
Some material adapted
from lecture notes by
Lise Getoor and Ron Parr
7
Naïve Bayes
8
Naïve Bayes
• Use Bayesian modeling
• Make the simplest possible independence assumption:
– Each attribute is independent of the values of the other attributes,
given the class variable
– In our restaurant domain: Cuisine is independent of Patrons, given a
decision to stay (or not)
9
Bayesian Formulation
• p(C | F1, ..., Fn) = p(C) p(F1, ..., Fn | C) / P(F1, ..., Fn)
= α p(C) p(F1, ..., Fn | C)
• Assume that each feature Fi is conditionally independent of
the other features given the class C. Then:
p(C | F1, ..., Fn) = α p(C) Πi p(Fi | C)
• We can estimate each of these conditional probabilities
from the observed counts in the training data:
p(Fi | C) = N(Fi ∧ C) / N(C)
– One subtlety of using the algorithm in practice: When your
estimated probabilities are zero, ugly things happen
– The fix: Add one to every count (aka “Laplacian smoothing”—they
have a different name for everything!)
10
Naive Bayes: Example
• p(Wait | Cuisine, Patrons, Rainy?)
= α p(Cuisine ∧ Patrons ∧ Rainy? | Wait)
= α p(Wait) p(Cuisine | Wait) p(Patrons | Wait)
p(Rainy? | Wait)
naive Bayes assumption: is it reasonable?
11
Naive Bayes: Analysis
• Naive Bayes is amazingly easy to implement (once you
understand the bit of math behind it)
• Remarkably, naive Bayes can outperform many much more
complex algorithms—it’s a baseline that should pretty much
always be used for comparison
• Naive Bayes can’t capture interdependencies between
variables (obviously)—for that, we need Bayes nets!
12
Learning Bayesian Networks
13
Bayesian Learning: Bayes’ Rule
• Given some model space (set of hypotheses hi) and
evidence (data D):
– P(hi|D) =  P(D|hi) P(hi)
• We assume that observations are independent of each other,
given a model (hypothesis), so:
– P(hi|D) =  j P(dj|hi) P(hi)
• To predict the value of some unknown quantity, X
(e.g., the class label for a future observation):
– P(X|D) = i P(X|D, hi) P(hi|D) = i P(X|hi) P(hi|D)
These are equal by our
independence assumption
14
Bayesian Learning
• We can apply Bayesian learning in three basic ways:
– BMA (Bayesian Model Averaging): Don’t just choose one
hypothesis; instead, make predictions based on the weighted average
of all hypotheses (or some set of best hypotheses)
– MAP (Maximum A Posteriori) hypothesis: Choose the hypothesis
with the highest a posteriori probability, given the data
– MLE (Maximum Likelihood Estimate): Assume that all
hypotheses are equally likely a priori; then the best hypothesis is just
the one that maximizes the likelihood (i.e., the probability of the data
given the hypothesis)
• MDL (Minimum Description Length) principle: Use
some encoding to model the complexity of the hypothesis,
and the fit of the data to the hypothesis, then minimize the
overall description of hi + D
15
Learning Bayesian Networks
• Given training set D  { x[1],..., x[ M ]}
• Find B that best matches D
– model selection
– parameter estimation
B
B[1]
A[1]
C[1] 
 E[1]
 


 

 


 


 E[ M ] B[ M ] A[ M ] C[ M ]
Inducer
E
A
C
Data D
16
Parameter Estimation
• Assume known structure
• Goal: estimate BN parameters q
– entries in local probability models, P(X | Parents(X))
• A parameterization q is good if it is likely to generate the
observed data:
L(q : D)  P ( D | q)   P ( x[m ] | q)
m
i.i.d. samples
• Maximum Likelihood Estimation (MLE) Principle:
Choose q* so as to maximize L
17
Parameter Estimation II
• The likelihood decomposes according to the structure of
the network
→ we get a separate estimation task for each parameter
• The MLE (maximum likelihood estimate) solution:
– for each value x of a node X
– and each instantiation u of Parents(X)
q
*
x |u
N ( x, u)

N (u)
sufficient statistics
– Just need to collect the counts for every combination of parents
and children observed in the data
– MLE is equivalent to an assumption of a uniform prior over
parameter values
18
Sufficient Statistics: Example
• Why are the counts sufficient?
Moon-phase
Light-level
Earthquake
Alarm
Burglary
q*x|u 
N ( x, u)
N (u)
θ*A | E, B = N(A, E, B) / N(E, B)
19
Model Selection
Goal: Select the best network structure, given the data
Input:
– Training data
– Scoring function
Output:
– A network that maximizes the score
20
Structure Selection: Scoring
• Bayesian: prior over parameters and structure
– get balance between model complexity and fit to data as a byproduct
Marginal likelihood
Prior
• Score (G:D) = log P(G|D)  log [P(D|G) P(G)]
• Marginal likelihood just comes from our parameter estimates
• Prior on structure can be any measure we want; typically a
function of the network complexity
Same key property: Decomposability
Score(structure) = Si Score(family of Xi)
21
Heuristic Search
B
B
E
A
E
A
C
C
B
E
B
E
A
A
C
C
22
Exploiting Decomposability
B
B
E
A
E
A
B
C
A
C
C
B
To recompute scores,
only need to re-score families
that changed in the last move
E
E
A
C
23
Variations on a Theme
• Known structure, fully observable: only need to do
parameter estimation
• Unknown structure, fully observable: do heuristic search
through structure space, then parameter estimation
• Known structure, missing values: use expectation
maximization (EM) to estimate parameters
• Known structure, hidden variables: apply adaptive
probabilistic network (APN) techniques
• Unknown structure, hidden variables: too hard to solve!
24
Handling Missing Data
• Suppose that in some cases, we observe
Moon-phase
earthquake, alarm, light-level, and
moon-phase, but not burglary
• Should we throw that data away??
Light-level
• Idea: Guess the missing values
based on the other data
Earthquake
Burglary
Alarm
25
EM (Expectation Maximization)
• Guess probabilities for nodes with missing values (e.g.,
based on other observations)
• Compute the probability distribution over the missing
values, given our guess
• Update the probabilities based on the guessed values
• Repeat until convergence
26
EM Example
• Suppose we have observed Earthquake and Alarm but not
Burglary for an observation on November 27
• We estimate the CPTs based on the rest of the data
• We then estimate P(Burglary) for November 27 from those
CPTs
• Now we recompute the CPTs as if that estimated value had
been observed
Earthquake
Burglary
• Repeat until convergence!
Alarm
27
Download