Document 13509159

advertisement
Outline
• Bayesian concept learning: Discussion
• Probabilistic models for unsupervised and
semi-supervised category learning
Discussion points
• Relation to “Bayesian classification”?
• Relation to debate between rules / logic /
symbols and similarity / connections /
statistics?
• Where do the hypothesis space and prior
probability distribution come from?
Discussion points
• Relation to “Bayesian classification”?
– Causal attribution versus referential inference.
– Which is more suited to natural concept learning?
• Relation to debate between rules / logic /
symbols and similarity / connections /
statistics?
• Where do the hypothesis space and prior
probability distribution come from?
Hierarchical priors
T ~ Beta(FH,FT)
FH,FT
Coin 1
d1
Coin 2
T
d2
d3
d4
d1
d2
...
T
d3
d4
T Coin 200
d1
d2
d3
• Latent structure captures what is common to all coins,
and also their individual variability
d4
Hierarchical priors
P(h)
Concept 1
h
x1
x2
...
Concept 2
h
x3
x4
x1
x2
x3
x4
Concept 200
h
x1
x2
• Latent structure captures what is common to all concepts, and also their individual variability
• Is this all we need?
x3
x4
social knowledge
number knowledge
math/magnitude?
P(h)
Concept 1
h
x1
x2
...
Concept 2
h
x3
x4
x1
x2
x3
x4
Concept 200
h
x1
x2
x3
• Hypothesis space is not just an arbitrary collection of hypotheses, but a principled system.
• Far more structured than our experience with specific number concepts. x4
Outline
• Bayesian concept learning: Discussion
• Probabilistic models for unsupervised and
semi-supervised category learning
Simple model of concept learning
“This is a blicket.”
“Can you show me the
other blickets?”
Simple model of concept learning
Other blickets.
The objects of planet Gazoob
Image removed due to copyright considerations.
Simple model of concept learning
“This is a blicket.”
“Can you show me the
other blickets?”
Learning from just one positive example is possible if:
– Assume concepts refer to clusters in the world.
– Observe enough unlabeled data to identify clear clusters.
Complications
Complications
• Outliers
“This is a blicket.”
Complications
• Overlapping clusters
“This is a blicket.”
Complications
• How many clusters?
“This is a blicket.”
Complications
• Clusters that are not simple blobs
“This is a blicket.”
Complications
• Concept labels inconsistent with clusters
“This is a blicket.”
“This is a gazzer.”
Simple model of concept learning
• Can infer a concept from just one positive
example if:
– Assume concepts refer to clusters in the world.
– Observe lots of unlabeled data, in order to identify
clusters.
• How do we identify the clusters?
– With no labeled data (“unsupervised learning”)
– With sparsely labeled data (“semi-supervised learning”)
Unsupervised clustering with probabilistic models
• Assume a simple parametric probabilistic model for clusters, e.g., Gaussian.
V 11
C
x1
p(x | c j )
x2
p(x1 | c j ) u p(x2 | c j )
p(xi | c j ) v e
( xi P ij) 2 /( 2V ij2 )
P 21
c1
c2
V 21
P11
Unsupervised clustering with probabilistic models
• Assume a simple parametric probabilistic
model for clusters, e.g., Gaussian.
• Two ways to characterize jth cluster:
– Parameters: Pij , V ij
– Assignments: zj(k) = 1 if kth point belongs to
cluster j, else 0.
Unsupervised clustering with probabilistic models
• Chicken-and-egg problem:
– Given assignments we could solve for maximum likelihood parameters:
¦zj
Pij
(k )
xi
(k )
k
(k )
z
¦ j
k
V 2ij
¦zj
(k )
x
i
(k )
Pij
k
(k )
z
¦ j
k
2
Unsupervised clustering with probabilistic models
• Chicken-and-egg problem:
– Given parameters we could solve for assignments zj(k):
z (jk )
1, j
arg max p (c jc | x ( k ) )
jc
0, else
p (
c j | x ( k ) ) v p (x ( k ) | c j ) p (c j )
–
i
1
2S V ij
e
( xi( k ) Pij ) 2 /( 2V ij2 )
Solve for “base
rate” parameters:
p (c j )
(k )
z
¦ j
k
p (c j )
Alternating optimization algorithm
0. Guess initial parameter values.
1. Given parameter estimates, solve for maximum a
posteriori assignments zj(k):
p (c j | x
(k )
1
)v–
2S V ij
i
e
( xi( k ) Pij ) 2 /( 2V ij2 )
z (jk )
p (c j )
1, j
arg max p(c jc | x ( k ) )
jc
0, else
2. Given assignments zj(k), solve for maximum
likelihood parameter estimates:
¦ z j (k ) xi (k )
Pij
k
¦ z j (k )
k
3. Go to step 1.
V 2ij
¦ z j (k ) xi (k ) Pij
k
¦ z j (k )
k
2
p (c j )
¦ z j (k )
k
Alternating optimization algorithm
x
z: assignments to cluster
P V p(cj): cluster parameters
[For simplicity, assume V p(cj) fixed.]
Alternating optimization algorithm
Step 0: initial parameter values
Alternating optimization algorithm
Step 1: update assignments
Alternating optimization algorithm
Step 2: update parameters
Alternating optimization algorithm
Step 1: update assignments
Alternating optimization algorithm
Step 2: update parameters
Alternating optimization algorithm
0. Guess initial parameter values.
1. Given parameter estimates, solve for maximum a
posteriori assignments zj(k):
p (c j | x
(k )
1
)v–
2S V ij
i
e
( xi( k ) Pij ) 2 /( 2V ij2 )
z (jk )
p (c j )
1, j
arg max p(c jc | x ( k ) )
jc
0, else
Why hard
2. Given assignments zj(k), solve for maximum
assignments?
likelihood parameter estimates:
¦ z j (k ) xi (k )
Pij
k
¦ z j (k )
k
3. Go to step 1.
V 2ij
¦ z j (k ) xi (k ) Pij
k
¦ z j (k )
k
2
p (c j )
¦ z j (k )
k
EM algorithm
0. Guess initial parameter values T = {P, V, p(cj)}.
1. “Expectation” step: Given parameter estimates, compute expected values of assignments zj(k)
h (jk )
p (c j | x
(k )
1
;T ) v –
2S V ij
i
e
( xi( k ) Pij ) 2 /( 2V ij2 )
p (c j )
2. “Maximization” step: Given expected
assignments, solve for maximum likelihood
parameter estimates:
¦
Pij
h (jk ) xi( k )
k
¦
k
h (jk )
V
2
¦
ij
h (jk )
xi( k )
Pij
k
(k )
h
¦ j
k
2
p (c j )
(k )
h
¦ j
k
What EM is really about
• Define a single probabilistic model for the
whole data set:
p(X | T )
(k )
p(x
|T )
–
k
–¦ p(x
k
| c j ;T ) p (
c j ;T )
j
–¦–
k
(k )
j
i
1
2S V ij
e
“mixture
model”
( xi( k ) Pij) 2 /( 2V ij2 )
p (c j )
What EM is really about
• Define a single probabilistic model for the
whole data set:
p(X | T )
(k )
p(x
|T )
–
k
–¦ p(x
k
| c j ;T ) p (
c j ;T )
j
–¦–
k
(k )
j
i
1
2S V ij
e
“mixture
model”
( xi( k ) Pij) 2 /( 2V ij2 )
• How do we maximize w.r.t. T?
p (c j )
What EM is really about
• Maximization would be simpler if we introduced new labeling variables Z = {zj(k)}:
p ( X,Z | T )
–– p(x
k
log p ( X,Z | T )
(k )
| c j ;T ) p ( c j ;T )
z (j k )
j
( k)
(k )
z
log
p
(
x
¦¦ j ¦
i | c j ;T ) p ( c j ;T )
k
j
i
¦¦ z (jk ) ¦ (xi( k ) Pij ) 2 /( 2V ij2 ) log p (c j )
k
j
i
• Problem: we don’t know Z = {zj(k)}!
What EM is really about
• Maximization expected value of the
“complete data” loglikelihood, log p(X, Z|T):
–E-step: Compute expectation
Q(T |T (t ) )
(t )
p(Z
|
X,
T
) log p( X, Z | T )
¦
Z
–M-step: Maximize
T (t 1)
arg max Q(T |T (t ) )
T
Download