lecture

advertisement
Announcements
CS Ice Cream Social
•
9/5 3:30-4:30, ECCR 265
•
includes poster session, student group presentations
Concept Learning

Examples
 Word meanings
 Edible foods
glorch
 Abstract structures (e.g., irony)
not
glorch
glorch
not
glorch
Supervised Approach To Concept Learning


Both positive and negative examples provided
Typical models (both in ML and Cog Sci) circa 2000
required both positive and negative examples
Contrast With Human Learning Abiliites


Learning from positive examples only
Learning from a small number of examples
 E.g., word meanings
 E.g., learning appropriate social behavior
 E.g., instruction on some skill

What would it mean to learn from a
small number of positive examples?
+ +
+
Tenenbaum (1999)
Two dimensional continuous feature space

Concepts defined by axis-parallel rectangles

e.g., feature dimensions

 cholesterol level
 insulin level
e.g., concept

healthy
Learning Problem



Given a set of given a set of n examples,
X = {x1, x2, x3, …, xn}, which are instances of the
concept…
Will some unknown example Y also be an
instance of the concept?
Problem of generalization
+ +
1
2
3
+
Hypothesis (Model) Space

H: all rectangles on the plane,
parameterized by (l1, l2, s1, s2)
h: one particular hypothesis

Note: |H| = ∞
Consider all hypotheses in parallel

In contrast to non-Bayesian approach of
maintaining only the best hypothesis
at any point in time.
Prediction Via Model Averaging
Will some unknown input y be in the concept
given examples X = {x1, x2, x3, …, xn}?
Q: y is a positive example of the concept (T,F)

P(Q | X) = ⌠h p(Q & h | X) dh
Marginalization

P(Q & h | X) = p(Q | h, X) p(h | X)
Chain rule

P(Q | h, X) = P(Q | h) = 1 if y is in h
Conditional
independence and
deterministic concepts

p(h | X) ~ P(X | h) p(h)
Bayes rule
likelihood
prior
Priors and Likelihood Functions
Priors, p(h)
 Location invariant
 Uninformative prior
(prior depends only on area of rectangle)
x
 Expected size prior

Likelihood function, p(X|h)
 X = set of n examples
 Size principle
Expected size prior
Generalization Gradients
MIN:
smallest hypothesis consistent with data
weak
Bayes: instead of using size principle, assumes
examples are produced by process independent of the true
class
Dark line =
50% prob.
Experimental Design
Subjects shown n dots on screen that are
“randomly chosen examples from some rectangle
of healthy levels”

 n drawn from {2, 3, 4, 6, 10, 50}

Dots varied in horizontal and vertical range
 r drawn from {.25, .5, 1, 2, 4, 8} units in a 24 unit
window

Task
 draw the ‘true’ rectangle around the dots
Experimental Results
Number Game

Experimenter picks integer arithmetic concept C
 E.g., prime number
 E.g., number between 10 and 20
 E.g., multiple of 5


Experimenter presents positive examples drawn at
random from C, say, in range [1, 100]
Participant asked whether some new test case
belongs in C
Empirical Predictive Distributions
E
x
a
m
p
le
s
1
6
1
0
.5
0
4 81
21
62
02
42
83
23
64
04
44
85
25
66
06
46
87
27
68
08
48
89
29
6
1
0
0
6
0
1
0
.5
0
4 81
21
62
02
42
83
23
64
04
44
85
25
66
06
46
87
27
68
08
48
89
29
6
1
0
0
1
6826
41
0
.5
0
4 81
21
62
02
42
83
23
64
04
44
85
25
66
06
46
87
27
68
08
48
89
29
6
1
0
0
1
62
31
92
01
0
.5
0
4 81
21
62
02
42
83
23
64
04
44
85
25
66
06
46
87
27
68
08
48
89
29
6
1
0
0
6
08
01
03
01
0
.5
Hypothesis Space
 Even numbers
 Odd numbers
 Squares
 Multiples of n
 Ends in n
 Powers of n
 All numbers
 Intervals [n, m] for n>0, m<101
 Powers of 2, plus 37
 Powers of 2, except for 32
•
•
Observation = 16
Likelihood function
 Size principle
•
Prior
 Intuition
data = 16
even
odd
squares
mult of 3
mult of 4
mult of 5
mult of 6
mult of 7
mult of 8
mult of 9
mult of 10
ends in 1
ends in 2
ends in 3
ends in 4
ends in 5
ends in 6
ends in 7
ends in 8
ends in 9
powers of 2
powers of 3
powers of 4
powers of 5
powers of 6
powers of 7
powers of 8
powers of 9
powers of 10
all
powers of 2 + {37}
powers of 2 − {32}
0
0.1
prior
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0
0.2 0
0.2
lik
0
0.4 0
0.2
post
0.4
Observation =
16 8 2 64
data = 16 8 2 64
•
•
Likelihood function
 Size principle
•
Prior
 Intuition
even
odd
squares
mult of 3
mult of 4
mult of 5
mult of 6
mult of 7
mult of 8
mult of 9
mult of 10
ends in 1
ends in 2
ends in 3
ends in 4
ends in 5
ends in 6
ends in 7
ends in 8
ends in 9
powers of 2
powers of 3
powers of 4
powers of 5
powers of 6
powers of 7
powers of 8
powers of 9
powers of 10
all
powers of 2 + {37}
powers of 2 − {32}
0
0.1
prior
35
35
30
30
25
25
20
20
15
15
10
10
5
5
0
0.2 0
1
lik
0
2 0
−3
x 10
0.5
post
1
Posterior Distribution After Observing 16
E
x
a
m
p
le
s
1
6
Model Vs. Human Data
1
0
.5
0
4 81
21
62
02
42
83
23
64
04
44
85
25
66
06
46
87
27
68
08
48
89
29
6
1
0
0
6
0
1
0
.5
0
4 81
21
62
02
42
83
23
64
04
44
85
25
66
06
46
87
27
68
08
48
89
29
6
1
0
0
1
6826
41
0
.5
MODEL
0
4 81
21
62
02
42
83
23
64
04
44
85
25
66
06
46
87
27
68
08
48
89
29
6
1
0
0
1
62
31
92
01
0
.5
0
4 81
21
62
02
42
83
23
64
04
44
85
25
66
06
46
87
27
68
08
48
89
29
6
1
0
0
6
08
01
03
01
0
.5
0
4 81
21
62
02
42
83
23
64
04
44
85
25
66
06
46
87
27
68
08
48
89
29
6
1
0
0
6
05
25
75
51
0
.5
0
4 81
21
62
02
42
83
23
64
04
44
85
25
66
06
46
87
27
68
08
48
89
29
6
1
0
0
HUMAN
DATA
8
12
543
61
0
.5
0
4 81
21
62
02
42
83
23
64
04
44
85
25
66
06
46
87
27
68
08
48
89
29
6
1
0
0
8
19
88
69
31
0
.5
0
4 81
21
62
02
42
83
23
64
04
44
85
25
66
06
46
87
27
68
08
48
89
29
6
1
0
0
Summary of Tenenbaum (1999)

Method
 Pick prior distribution (includes hypothesis space)
 Pick likelihood function (size principle)
 leads to predictions for generalization as a function of r
(range) and n (number of examples)
Claims people generalize optimally given
assumptions about priors and likelihood
Bayesian approach provides best description of
how people generalize on rectangle task.
Explains how people can learn from a small
number of examples, and only positive examples.
Important Ideas in Bayesian Models
Generative models
 Likelihood function
Consideration of multiple models in parallel
 Potentially infinite model space
Inference
 prediction via model averaging
 role of priors diminishes with amount of evidence
Learning
 trade off between model simplicity and fit to data 
Bayesian Occam’s Razor
Ockham's Razor
medieval philosopher
and monk
tool for cutting
(metaphorical)
If two hypotheses are equally consistent with the data,
prefer the simpler one.
Simplicity
•
•
•
•
can accommodate fewer observations
smoother
fewer parameters
restricts predictions more
(“sharper” predictions)
Examples
1st vs. 4th order polynomial
small rectangle vs. large rectangle
in Tenenbaum model
H1
H0
H0 H1
Motivating Ockham's Razor
PRIORS
Aesthetic considerations

A theory with mathematical beauty is more likely to
be right (or believed) than an ugly one, given that
both fit the same data.
Past empirical success of the principle

Coherent inference, as embodied by Bayesian
reasoning, automatically incorporates
Ockham's razor

Two theories H1 and H2

LIKELIHOODS
Ockham's Razor with Priors
Jeffreys (1939) probabililty text
more complex hypotheses should have lower priors
Requires a numerical rule for assessing
complexity
e.g., number of free parameters
e.g., Vapnik-Chervonenkis (VC) dimension
Subjective vs. Objective Priors
subjective or informative prior
specific, definite information about a random variable
objective or uninformative prior
vague, general information
Philosophical arguments for certain priors as
uninformative
Maximum entropy / least committment
e.g., interval [a b]: uniform
e.g., interval [0, ∞) with mean 1/λ:
exponential distribution
e.g., mean μ and std deviation σ: Gaussian
Independence of measurement scale
e.g., Jeffrey’s prior 1/(θ(1-θ)) for θ in [0,1]
expresses same belief whether we talk
about θ or logθ
Ockham’s Razor Via Likelihoods
Coin flipping example

H1: coin has two heads
H2: coin has a head and a tail
Consider 5 flips producing HHHHH

H1 could produce only this sequence
H2 could produce HHHHH, but also HHHHT, HHHTH,
... TTTTT
P(HHHHH | H1) = 1, P(HHHHH | H2) = 1/32
H2 pays the price of having a lower likelihood
via the fact it can accommodate a greater
range of observations

H1 is more readily rejected by observations

Simple and Complex Hypotheses
H2
H1
Bayes Factor
BIC is approximation to Bayes factor

A.k.a. likelihood ratio

Hypothesis Classes Varying In Complexity
E.g., 1st, 2nd, and 3d order polynomials

Hypothesis class is parameterized by w

v
Rissanen (1976)
Minimum Description Length
Prefer models that can communicate the data
in the smallest number of bits.

The preferred hypothesis H for explaining
data D minimizes:

(1) length of the description of the hypothesis
(2) length of the description of the data with the help
of the chosen theory
L: length
MDL & Bayes
L: some measure of length (complexity)

MDL: prefer hypothesis that min. L(H) + L(D|H)

Bayes rule implies MDL principle

P(H|D) = P(D|H)P(H) / P(D)
–log P(H|D) = –log P(D|H) – log P(H) + log P(D)
= L(D|H) + L(H) + const
Relativity Example
Explain deviation in Mercury's orbit at
perihelion with respect to prevailing theory

E: Einstein's theory
F: fudged Newtonian theory
deviation
α = true deviation
a = observed
Relativity Example (Continued)
Subjective Ockham's razor

result depends on one's belief about P(α|F)
Objective Ockham's razor

for Mercury example, RHS is 15.04
Applies to generic situation

Download