Outline • Bayesian concept learning: Discussion • Probabilistic models for unsupervised and semi-supervised category learning Discussion points • Relation to “Bayesian classification”? • Relation to debate between rules / logic / symbols and similarity / connections / statistics? • Where do the hypothesis space and prior probability distribution come from? Discussion points • Relation to “Bayesian classification”? – Causal attribution versus referential inference. – Which is more suited to natural concept learning? • Relation to debate between rules / logic / symbols and similarity / connections / statistics? • Where do the hypothesis space and prior probability distribution come from? Hierarchical priors T ~ Beta(FH,FT) FH,FT Coin 1 d1 Coin 2 T d2 d3 d4 d1 d2 ... T d3 d4 T Coin 200 d1 d2 d3 • Latent structure captures what is common to all coins, and also their individual variability d4 Hierarchical priors P(h) Concept 1 h x1 x2 ... Concept 2 h x3 x4 x1 x2 x3 x4 Concept 200 h x1 x2 • Latent structure captures what is common to all concepts, and also their individual variability • Is this all we need? x3 x4 social knowledge number knowledge math/magnitude? P(h) Concept 1 h x1 x2 ... Concept 2 h x3 x4 x1 x2 x3 x4 Concept 200 h x1 x2 x3 • Hypothesis space is not just an arbitrary collection of hypotheses, but a principled system. • Far more structured than our experience with specific number concepts. x4 Outline • Bayesian concept learning: Discussion • Probabilistic models for unsupervised and semi-supervised category learning Simple model of concept learning “This is a blicket.” “Can you show me the other blickets?” Simple model of concept learning Other blickets. The objects of planet Gazoob Image removed due to copyright considerations. Simple model of concept learning “This is a blicket.” “Can you show me the other blickets?” Learning from just one positive example is possible if: – Assume concepts refer to clusters in the world. – Observe enough unlabeled data to identify clear clusters. Complications Complications • Outliers “This is a blicket.” Complications • Overlapping clusters “This is a blicket.” Complications • How many clusters? “This is a blicket.” Complications • Clusters that are not simple blobs “This is a blicket.” Complications • Concept labels inconsistent with clusters “This is a blicket.” “This is a gazzer.” Simple model of concept learning • Can infer a concept from just one positive example if: – Assume concepts refer to clusters in the world. – Observe lots of unlabeled data, in order to identify clusters. • How do we identify the clusters? – With no labeled data (“unsupervised learning”) – With sparsely labeled data (“semi-supervised learning”) Unsupervised clustering with probabilistic models • Assume a simple parametric probabilistic model for clusters, e.g., Gaussian. V 11 C x1 p(x | c j ) x2 p(x1 | c j ) u p(x2 | c j ) p(xi | c j ) v e ( xi P ij) 2 /( 2V ij2 ) P 21 c1 c2 V 21 P11 Unsupervised clustering with probabilistic models • Assume a simple parametric probabilistic model for clusters, e.g., Gaussian. • Two ways to characterize jth cluster: – Parameters: Pij , V ij – Assignments: zj(k) = 1 if kth point belongs to cluster j, else 0. Unsupervised clustering with probabilistic models • Chicken-and-egg problem: – Given assignments we could solve for maximum likelihood parameters: ¦zj Pij (k ) xi (k ) k (k ) z ¦ j k V 2ij ¦zj (k ) x i (k ) Pij k (k ) z ¦ j k 2 Unsupervised clustering with probabilistic models • Chicken-and-egg problem: – Given parameters we could solve for assignments zj(k): z (jk ) 1, j arg max p (c jc | x ( k ) ) jc 0, else p ( c j | x ( k ) ) v p (x ( k ) | c j ) p (c j ) i 1 2S V ij e ( xi( k ) Pij ) 2 /( 2V ij2 ) Solve for “base rate” parameters: p (c j ) (k ) z ¦ j k p (c j ) Alternating optimization algorithm 0. Guess initial parameter values. 1. Given parameter estimates, solve for maximum a posteriori assignments zj(k): p (c j | x (k ) 1 )v 2S V ij i e ( xi( k ) Pij ) 2 /( 2V ij2 ) z (jk ) p (c j ) 1, j arg max p(c jc | x ( k ) ) jc 0, else 2. Given assignments zj(k), solve for maximum likelihood parameter estimates: ¦ z j (k ) xi (k ) Pij k ¦ z j (k ) k 3. Go to step 1. V 2ij ¦ z j (k ) xi (k ) Pij k ¦ z j (k ) k 2 p (c j ) ¦ z j (k ) k Alternating optimization algorithm x z: assignments to cluster P V p(cj): cluster parameters [For simplicity, assume V p(cj) fixed.] Alternating optimization algorithm Step 0: initial parameter values Alternating optimization algorithm Step 1: update assignments Alternating optimization algorithm Step 2: update parameters Alternating optimization algorithm Step 1: update assignments Alternating optimization algorithm Step 2: update parameters Alternating optimization algorithm 0. Guess initial parameter values. 1. Given parameter estimates, solve for maximum a posteriori assignments zj(k): p (c j | x (k ) 1 )v 2S V ij i e ( xi( k ) Pij ) 2 /( 2V ij2 ) z (jk ) p (c j ) 1, j arg max p(c jc | x ( k ) ) jc 0, else Why hard 2. Given assignments zj(k), solve for maximum assignments? likelihood parameter estimates: ¦ z j (k ) xi (k ) Pij k ¦ z j (k ) k 3. Go to step 1. V 2ij ¦ z j (k ) xi (k ) Pij k ¦ z j (k ) k 2 p (c j ) ¦ z j (k ) k EM algorithm 0. Guess initial parameter values T = {P, V, p(cj)}. 1. “Expectation” step: Given parameter estimates, compute expected values of assignments zj(k) h (jk ) p (c j | x (k ) 1 ;T ) v 2S V ij i e ( xi( k ) Pij ) 2 /( 2V ij2 ) p (c j ) 2. “Maximization” step: Given expected assignments, solve for maximum likelihood parameter estimates: ¦ Pij h (jk ) xi( k ) k ¦ k h (jk ) V 2 ¦ ij h (jk ) xi( k ) Pij k (k ) h ¦ j k 2 p (c j ) (k ) h ¦ j k What EM is really about • Define a single probabilistic model for the whole data set: p(X | T ) (k ) p(x |T ) k ¦ p(x k | c j ;T ) p ( c j ;T ) j ¦ k (k ) j i 1 2S V ij e “mixture model” ( xi( k ) Pij) 2 /( 2V ij2 ) p (c j ) What EM is really about • Define a single probabilistic model for the whole data set: p(X | T ) (k ) p(x |T ) k ¦ p(x k | c j ;T ) p ( c j ;T ) j ¦ k (k ) j i 1 2S V ij e “mixture model” ( xi( k ) Pij) 2 /( 2V ij2 ) • How do we maximize w.r.t. T? p (c j ) What EM is really about • Maximization would be simpler if we introduced new labeling variables Z = {zj(k)}: p ( X,Z | T ) p(x k log p ( X,Z | T ) (k ) | c j ;T ) p ( c j ;T ) z (j k ) j ( k) (k ) z log p ( x ¦¦ j ¦ i | c j ;T ) p ( c j ;T ) k j i ¦¦ z (jk ) ¦ (xi( k ) Pij ) 2 /( 2V ij2 ) log p (c j ) k j i • Problem: we don’t know Z = {zj(k)}! What EM is really about • Maximization expected value of the “complete data” loglikelihood, log p(X, Z|T): –E-step: Compute expectation Q(T |T (t ) ) (t ) p(Z | X, T ) log p( X, Z | T ) ¦ Z –M-step: Maximize T (t 1) arg max Q(T |T (t ) ) T