Learning R&N: ch 19, ch 20 based on material from Ray Mooney, Daphne

advertisement
Learning
R&N: ch 19, ch 20
based on material from
Ray Mooney, Daphne
Koller, Kevin Murphy,
Marie desJardins
Types of Learning
Supervised Learning - classification,
prediction
Unsupervised Learning – clustering,
segmentation, pattern discovery
Reinforcement Learning – learning
MDPs, online learning
Supervised Learning
Someone gives you a bunch of
examples, telling you what each one is
Eventually, you figure out the mapping
from properties (features) of the
examples to their type (classification)
Supervised Learning
Logic-based approaches:

learn a function f(X)  (true,false)
Statistical/Probabilistic approaches

Learn a probability distribution p(Y|X)
Outline
Logic-based Approaches



Decision Trees (last time)
Version Spaces (today)
PAC learning (we won’t cover, )
Instance-based approaches
Statistical Approaches
Predicate-Learning Methods
Version space
Need to provide H
Putting Things Together
with some “structure”
Explicit representation
of hypothesis space H
yes
no
Test
set
Evaluation
Induced
hypothesis h
Training
set 
Learning
procedure L
Object set
Example
set X
Goal predicate
Observable
predicates
Hypothesis
space H
Bias
Version Spaces
The “version space” is the set of all
hypotheses that are consistent with the
training instances processed so far.
An algorithm:
V := H
;; the version space V is ALL
hypotheses H
 For each example e:

 Eliminate any member of V that disagrees with e
 If V is empty, FAIL

Return V as the set of consistent hypotheses
Version Spaces: The Problem
PROBLEM: V is huge!!
Suppose you have N attributes, each
with k possible values
Suppose you allow a hypothesis to be
any disjunction of instances
There are kN possible instances  |H| =
N
k
2
If N=5 and k=2, |H| = 232!!
Version Spaces: The Tricks
First Trick: Don’t allow arbitrary disjunctions

Organize the feature values into a hierarchy of allowed
disjunctions, e.g.
any-color
dark
black

blue
pale
white
yellow
Now there are only 7 “abstract values” instead of 16
disjunctive combinations (e.g., “black or white” isn’t allowed)
Second Trick: Define a partial ordering on H (“general
to specific”) and only keep track of the upper bound
and lower bound of the version space
RESULT: An incremental, possibly efficient algorithm!
Rewarded Card Example
(r=1) v … v (r=10) v (r=J) v (r=Q) v (r=K)  ANY-RANK(r)
(r=1) v … v (r=10)  NUM(r)
(r=J) v (r=Q) v (r=K)  FACE(r)
(s=) v (s=) v (s=) v (s=)  ANY-SUIT(s)
(s=) v (s=)  BLACK(s)
(s=) v (s=)  RED(s)
A hypothesis is any sentence of the form:
R(r)  S(s)  IN-CLASS([r,s])
where:
• R(r) is ANY-RANK(r), NUM(r), FACE(r), or (r=j)
• S(s) is ANY-SUIT(s), BLACK(s), RED(s), or (s=k)
Simplified Representation
For simplicity, we represent a concept by rs, with:
• r  {a, n, f, 1, …, 10, j, q, k}
• s  {a, b, r, , , , }
For example:
• n represents:
NUM(r)  (s=)  IN-CLASS([r,s])
• aa represents:
ANY-RANK(r)  ANY-SUIT(s)  IN-CLASS([r,s])
Extension of a Hypothesis
The extension of a hypothesis h is
the set of objects that satisfies h
Examples:
• The extension of f is: {j, q, k}
• The extension of aa is the set of all cards
More General/Specific Relation
Let h1 and h2 be two hypotheses in H
h1 is more general than h2 iff the
extension of h1 is a proper superset of
the extension of h2
Examples:
• aa is more general than f
• f is more general than q
• fr and nr are not comparable
More General/Specific Relation
Let h1 and h2 be two hypotheses in H
h1 is more general than h2 iff the
extension of h1 is a proper superset of
the extension of h2
The inverse of the “more general”
relation is the “more specific” relation
The “more general” relation defines a
partial ordering on the hypotheses in H
Example: Subset of Partial Order
aa
na
4a
ab
a
nb
n
4b
4
G-Boundary / S-Boundary of V
A hypothesis in V is most general iff no
hypothesis in V is more general
G-boundary G of V: Set of most general
hypotheses in V
G-Boundary / S-Boundary of V
A hypothesis in V is most general iff no
hypothesis in V is more general
G-boundary G of V: Set of most general
hypotheses in V
A hypothesis in V is most specific iff no
hypothesis in V is more specific
S-boundary S of V: Set of most specific
hypotheses in V
Example: G-/S-Boundaries of V
G
aa
na
ab
4b
n
We replace every hypothesis in S
Now suppose that 4 is
whose
extension
doesa
not
4a
nb
given as a positive example
contain 4 by its generalization set
S
1
…
4
…
k
Example: G-/S-Boundaries of V
aa
na G and Sab
Here, both
have size 1.
This is not the case in general!
4a
a
nb
n
4b
4
Example: G-/S-Boundaries of V
na
4a
The
aa generalization set
of an hypothesis h is the
set of the
ab hypotheses
that are immediately more
general
than a
h
nb
n
4b
Let 7 be the next
(positive) example
4
Generalization
set of 4
Example: G-/S-Boundaries of V
aa
na
4a
ab
a
nb
n
4b
Let 7 be the next
(positive) example
4
Example: G-/S-Boundaries of V
Specialization
set of aa
aa
na
ab
a
nb
Let 5 be the next
(negative) example
n
Example: G-/S-Boundaries of V
G and S, and all hypotheses in between
form exactly the version space
ab
a
nb
n
Example: G-/S-Boundaries of V
At this stage …
ab
No
Yes
Maybe
Do 8, 6, j
satisfy CONCEPT?
a
nb
n
Example: G-/S-Boundaries of V
ab
a
nb
Let 2 be the next
(positive) example
n
Example: G-/S-Boundaries of V
ab
nb
Let j be the next
(negative) example
Example: G-/S-Boundaries of V
+ 4 7 2
– 5 j
nb
NUM(r)  BLACK(s)  IN-CLASS([r,s])
Example: G-/S-Boundaries of V
Let us return to the
version space …
… and let 8 be the next
(negative) example
ab
nb
The only most specific
hypothesis disagrees with n
this example, so no
hypothesis in H agrees with
all examples
a
Example: G-/S-Boundaries of V
Let us return to the
version space …
… and let j be the next
(positive) example
ab
nb
The only most general
hypothesis disagrees with n
this example, so no
hypothesis in H agrees with
all examples
a
Version Space Update
1. x  new example
2. If x is positive then
(G,S)  POSITIVE-UPDATE(G,S,x)
3. Else
(G,S)  NEGATIVE-UPDATE(G,S,x)
4. If G or S is empty then return failure
POSITIVE-UPDATE(G,S,x)
1. Eliminate all hypotheses in G that do not
agree with x
POSITIVE-UPDATE(G,S,x)
1. Eliminate all hypotheses in G that do not
agree with x
2. Minimally generalize all hypotheses in S
until they are consistent with x
Using the generalization
sets of the hypotheses
POSITIVE-UPDATE(G,S,x)
1. Eliminate all hypotheses in G that do not
agree with x
2. Minimally generalize all hypotheses in S
until they are consistent with x
3. Remove from S every hypothesis that is
neither more specific than nor equal to a
hypothesis in G
This step was not needed in the card example
POSITIVE-UPDATE(G,S,x)
1. Eliminate all hypotheses in G that do not
agree with x
2. Minimally generalize all hypotheses in S
until they are consistent with x
3. Remove from S every hypothesis that is
neither more specific than nor equal to a
hypothesis in G
4. Remove from S every hypothesis that is
more general than another hypothesis in S
5. Return (G,S)
NEGATIVE-UPDATE(G,S,x)
1. Eliminate all hypotheses in S that do not
agree with x
2. Minimally specialize all hypotheses in G
until they are consistent with x
3. Remove from G every hypothesis that is
neither more general than nor equal to a
hypothesis in S
4. Remove from G every hypothesis that is
more specific than another hypothesis in G
5. Return (G,S)
Example-Selection Strategy
(aka Active Learning)
Suppose that at each step the learning
procedure has the possibility to select
the object (card) of the next example
Let it pick the object such that, whether
the example is positive or not, it will
eliminate one-half of the remaining
hypotheses
Then a single hypothesis will be isolated
in O(log |H|) steps
Example
aa
na
• 9?
• j?
• j?
ab
a
nb
n
Example-Selection Strategy
Suppose that at each step the learning
procedure has the possibility to select the
object (card) of the next example
Let it pick the object such that, whether the
example is positive or not, it will eliminate
one-half of the remaining hypotheses
Then a single hypothesis will be isolated in
O(log |H|) steps
But picking the object that eliminates half the
version space may be expensive
Noise
If some examples are misclassified, the
version space may collapse
Possible solution:
Maintain several G- and S-boundaries,
e.g., consistent with all examples, all
examples but one, etc…
Version Spaces
Useful for understanding logical
(consistency-based) approaches to
learning
Practical if strong hypothesis bias
(concept hierarchies, conjunctive
hypothesis)
Don’t handle noise well
Statistical Approaches
Instance-based Learning (20.4)
Statistical Learning (20.1)
Neural Networks (20.5) on Dec. 2
Nearest Neighbor Methods
To classify a new input vector x, examine the k-closest training
data points to x and assign the object to the most frequently
occurring class
k=1
k=6
x
Issues
Distance measure


Most common: euclidean
Better distance measures: normalize each variable by standard deviation
Choosing k

Increasing k reduces variance, increases bias
For high-dimensional space, problem that the nearest neighbor may
not be very close at all!
Memory-based technique. Must make a pass through the data for each
classification. This can be prohibitive for large data sets.
Indexing the data can help; for example KD trees
Bayesian Learning
Example: Candy Bags
Candy comes in two flavors: cherry () and lime ()
Candy is wrapped, can’t tell which flavor until opened
There are 5 kinds of bags of candy:





H1 =
H2 =
H3 =
H4 =
H5 =
all cherry
75% cherry, 25% lime
50% cherry, 50% lime
25% cherry, 75% lime
100% lime
Given a new bag of candy, predict H
Observations: D1, D2 , D3, …
Bayesian Learning
Calculate the probability of each hypothesis, given
the data, and make prediction weighted by this
probability (i.e. use all the hypothesis, not just the
single best)
P(hi | d) 
P ( d|hi )P (hi )
P ( d)
 P(d | hi )P(hi )
Now, if we want to predict some unknown quantity X
P(X | d)   P(X |hi )P(hi | d)
i
Bayesian Learning cont.
Calculating P(h|d)
P(hi | d)  P(d | hi )P(hi )
likelihood prior
Assume the observations are i.i.d.—
independent and identically distributed
P ( d | hi )   P ( d j | hi )
j
Example:
Hypothesis Prior over h1, …, h5 is
{0.1,0.2,0.4,0.2,0.1}
Data:
Q1: After seeing d1, what is P(hi|d1)?
Q2: After seeing d1, what is P(d2= |d1)?
Download