Dirichlet Processes and Mixture Models An interactive tutorial: Part 1 Pfunk Meeting 1 Fall ’11 *some content adapted from El-Arini 2008 & Teh 2007 J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 1 Purpose • Introduction to the Dirichlet Process with MINIMAL pre-requisites • Set up for next week’s hands-on exposure to training mixture models using EM and DP priors Demo Code: http://www.cc.gatech.edu/~jscholz6/resources/code/DP_Tutorial/ J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 2 Topics • Discrete N-D probability distributions (categorical, multinomial, dirichlet) • Dirichlet Processes Metaphors • • • Polya Urn Chinese Restaurant Process Stick-breaking process J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 3 Topics • Discrete N-D probability distributions (categorical, multinomial, dirichlet) • • Dirichlet Process Definition Dirichlet Process Metaphors • • • Polya Urn Chinese Restaurant Process Stick-breaking process J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 4 Motivation J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 5 Motivation !"#$%&#$"' We are given a bunch of data points and ` !"#$%"#&'(")#$#*$+$#,"+-#$)*#$%"#+./*#+0$+#'+#1$,# told it was generated by a mixture of &")"%$+"*#2%.3#$#3'4+5%"#.2#6$5,,'$)#*',+%'75+'.),8 gaussians • ` 9)2.%+5)$+"/:-#).#.)"#0$,#$):#'*"$#!"#$%&'( 6$5,,'$),# ;%.*5<"*#+0"#*$+$8 J. Scholz2 (RIM@GT) Friday, September 2, 2011 08/26/2011 5 Motivation !"#$%&#$"' We are given a bunch of data points and ` !"#$%"#&'(")#$#*$+$#,"+-#$)*#$%"#+./*#+0$+#'+#1$,# told it was generated by a mixture of &")"%$+"*#2%.3#$#3'4+5%"#.2#6$5,,'$)#*',+%'75+'.),8 gaussians • ` • Unfortunately, no one said how many 9)2.%+5)$+"/:-#).#.)"#0$,#$):#'*"$#!"#$%&'( 6$5,,'$),# gaussians produced the data ;%.*5<"*#+0"#*$+$8 J. Scholz2 (RIM@GT) Friday, September 2, 2011 08/26/2011 5 #$%&#$"' Motivation "#$%"#&'(")#$#*$+$#,"+-#$)*#$%"#+./*#+0$+#'+#1$,# )"%$+"*#2%.3#$#3'4+5%"#.2#6$5,,'$)#*',+%'75+'.), • Could it be this? 2.%+5)$+"/:-#).#.)"#0$,#$):#'*"$#!"#$%&'( 6$5, J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 6 #$%&#$"' Motivation "#$%"#&'(")#$#*$+$#,"+-#$)*#$%"#+./*#+0$+#'+#1$,# )"%$+"*#2%.3#$#3'4+5%"#.2#6$5,,'$)#*',+%'75+'.),8 • Or perhaps this? )2.%+5)$+"/:-#).#.)"#0$,#$):#'*"$#!"#$%&'( 6$5,, J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 7 What to do? • We can guess the number of components, run Expectation Maximization (EM) for Gaussian Mixture Models, look at the results, and then try again... • We can run hierarchical agglomerative clustering, and cut the tree at a visually appealing level... J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 8 What to do? • Really, we want to cluster the data in a statistically principled manner, without resorting to hacks... >> for a preview, run demo 5.2 J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 9 Real examples • Brain Imaging: Model an unknown number of spatial activation patterns in fMRI images [Kim and Smyth, NIPS 2006] • Topic Modeling: Model an unknown number of topics across several corpora of documents [Teh et al. 2006] • Filtering and planning in unknown state spaces (iHMM [Beal et. al. 2003], iPOMDP [Doshi et. al. 2009]) J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 10 Topics • Discrete N-D probability distributions (categorical, multinomial, dirichlet) • • Dirichlet Process Definition Dirichlet Process Metaphors • • • Polya Urn Chinese Restaurant Process Stick-breaking process J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 11 Preliminaries: N-D Distributions • Categorical • Definition: X ∼ Cat(p) ⇒ P (X = x ) = p • IE, it’s a distribution on the probability i i of one event from a set of k possible • Semantics: • A draw from a categorical RV is a single event try it! >> Prelim 1.1 J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 12 Preliminaries: N-D Distributions • Multinomial • Definition • • X ∼ M ulti(p) ⇒ P (X = x) = xk x1 n! ...p p k (x x1 !,...,xk ! 1 ∈ Z n) IE, X is a distribution on the number of occurrences of k possible events, over n total trials Semantics • • A draw from a multinomial RV is a vector of event counts think: goes from event probs to event counts try it! >> Prelim 1.2 J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 13 Preliminaries: N-D Distributions • Dirichlet • • Definition • xk x1 n! ...p (x p 1 k x !,...,x ! k �1 Γ( k αk ) �K αk −1 � π k=1 k Γ(αk ) k X ∼ M ulti(p) ⇒ P (X = x) = X ∼ Dir(α) ⇒ P (Π = π) = IE, X is a distribution over the event probabilities in a categorical/multinomial RV Semantics • A draw from a dirichlet RV is a vector of event probabilities • think: goes from event counts to event probs try it! >> Prelim 1.3 J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 14 ∈ Preliminaries: N-D Distributions • Dirichlet • • Definition • xk x1 n! ...p (x p 1 k x !,...,x ! k �1 Γ( k αk ) �K αk −1 � π k=1 k Γ(αk ) k X ∼ M ulti(p) ⇒ P (X = x) = X ∼ Dir(α) ⇒ P (Π = π) = IE, X is a distribution over the event probabilities in a categorical/multinomial RV Hence it can be thought of as Semantics a distribution on distributions A draw from a dirichlet RV is a vector of event probabilities • • think: goes from event counts to event probs try it! >> Prelim 1.3 J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 14 ∈ When is the Dirichlet useful? • Often appears a bayesian setting when we need a prior on multinomial params (it’s conjugate to the multinomial) • Same as the beta distribution for the binomial, except N-D • E.G.: say we want to figure out the probability of a trick coin, and we only observe 3 heads • The ML estimate of p is 1, but that’s a bit strong, no? • Solution? Place a Beta prior on p, and use bayes’ rule* * for more, see: http://en.wikipedia.org/wiki/Checking_whether_a_coin_is_fair J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 15 Visualizing the Dirichlet Distribution Dirichlet Processes Examples of Dirichlet distributions Yee Whye Teh J. Scholz (RIM@GT) Friday, September 2, 2011 (Gatsby) DP and HDP Tutorial 08/26/2011 Mar 1, 2007 / CUED 4 / 53 16 Topics • Discrete N-D probability distributions (categorical, multinomial, dirichlet) • • Dirichlet Process Definition Dirichlet Process Metaphors • • • Polya Urn Chinese Restaurant Process Stick-breaking process J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 17 Dirichlet: from Distribution to Process J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 18 Dirichlet: from Distribution to Process • “A Dirichlet Process (DP) is a distribution over probability measures” J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 18 Dirichlet: from Distribution to Process • “A Dirichlet Process (DP) is a distribution over probability measures” • “A DP is a distribution over probability measures such that marginals on finite partitions are Dirichlet distributed” J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 18 Dirichlet: from Distribution to Process • “A Dirichlet Process (DP) is a distribution over probability measures” • “A DP is a distribution over probability measures such that marginals on finite partitions are Dirichlet distributed” • “A probability measure is a function from subsets of a space X to [0, 1] satisfying certain properties” J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 18 Dirichlet: from Distribution to Process • “A Dirichlet Process (DP) is a distribution over probability measures” • “A DP is a distribution over probability measures such that marginals on finite partitions are Dirichlet distributed” • “A probability measure is a function from subsets of a space X to [0, 1] satisfying certain properties” J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 18 Dirichlet: from Distribution to Process • “A Dirichlet Process (DP) is a distribution over probability measures” • “A DP is a distribution over probability measures such that marginals on finite partitions are Dirichlet distributed” • “A probability measure is a function from subsets of a space X to [0, 1] satisfying certain properties” • If you’re thinking “WTF??”, hang on! J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 18 Dirichlet: from Distribution to Process • “A Dirichlet Process (DP) is a distribution over probability measures” • “A DP is a distribution over probability measures such that marginals on finite partitions are Dirichlet distributed” • “A probability measure is a function from subsets of a space X to [0, 1] satisfying certain properties” This is key. We’ll get back to it • If you’re thinking “WTF??”, hang on! J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 19 Dirichlet Processes Dirichlet Processes A Dirichlet Process (DP) is aDefinition distribution over probability Definition chlet Processes measures. A Dirichlet Process (DP) is a distribution ov tion A Dirichlet Process (DP) is a distribution over probability A DP has two parameters: measures. Defining a DP measures. Base Process distribution H, which is like the mean of the DP. A Dirichlet (DP) is a distribution over probability A DP has two parameters: A DP has two parameters: • A DP has two parameters: measures. Strength parameter α, which an mean inverse-variance thethe DP. distribution H, which of is like mean Base distribution H, whichisislike likeBase the of the DP. A DP has two parameters: Strength parameter Base distribution H,α, which is like thean mean of the DP α, which Strength parameter which is like inverse-variance of the is DP.like an inv We write: • • Base distribution H, which is like the mean of the DP. We write: We write: Strength parameter α, which is likeα,an inverse-variance of the DP. of the DP Strength parameter which is like an inverse-variance We write: G ∼GDP(α, H) ∼ DP(α, H) G ∼ DP(α, H) We write: • forX:any partition (A1 , . . . , An ) of X: for any partition (A G, .∼.(A .DP(α, ,1 ,A. .n.)H) ofn )ifX: if for any partition ,A of 1 if for any partition (A , A. .n .) ,of X: n )) ∼ Dirichlet(αH(A 1 , . .1.), (G(A1 ),1.),. . ., .G(A Dirichlet(αH( (G(A G(A , αH(A )) n )) n∼ (G(A1 ), . . . , G(An )) ∼ Dirichlet(αH(A1 ), . . . , αH(An )) (G(A1 ), . . . , G(An )) ∼ Dirichlet(αH(A1 ), . . . , αH(An )) A4 A1 A1 A1 A3 A2 Yee Whye Teh (Gatsby) Yee Whye Teh (Gatsby) Whye Teh (Gatsby) J. Scholz (RIM@GT) Friday, September 2, 2011 A2 A5 6 3 A2 A6 A3A A A5 A1 A4 A4 A4 A6 A3 A2 A5 A5 Yee Whye Teh (Gatsby) DP and HDP Tutorial DP and HDP Tutorial DP and 08/26/2011 HDP Tutorial Mar 1, 2007 / CUED DP and HDP Tutorial Mar 1, 2007 / CUED 5 / 53 5 / 53 try it! >> Prelim 2.1 Mar 1, 2007 / CUED 205 / 5 A closer look • A DP has two parameters: • • • Base distribution H, which is like the mean of the DP Strength parameter α, which is like an inverse-variance of the DP We write: What is the form of H? ∼ G ∼ DP(α, H) if for any partition (A1 , . . . , An ) of X: (G(A1 ), . . . , G(A ∼n )) ∼ Dirichlet(αH(A1 ), . . . , αH(An )) A4 A1 A3 A2 J. Scholz (RIM@GT) Friday, September 2, 2011 A6 A5 08/26/2011 21 What is the form of H? • Can be any distribution defined over our event space (e.g. gaussian) • continuous or discrete: both legal • only condition is that it has to return a density for any partition A we give it J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 22 Topics • Discrete N-D probability distributions (categorical, multinomial, dirichlet) • • Dirichlet Process Definition Dirichlet Process Metaphors • • • Polya Urn Chinese Restaurant Process Stick-breaking process J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 23 So where is the Process in a DP? • 3 Metaphors: • Polya-urn • Involves drawing and replacing balls from an urn • Chinese Restaurant • Involves customers sitting at tables in proportion to their popularity • Stick-breaking • Involves breaking off pieces of a stick of unit length J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 24 So where is the Process in a DP? • 3 Metaphors: • Polya-urn • Involves drawing and replacing balls from an urn • Chinese Restaurant • Involves customers sitting at tables in proportion to their popularity • Stick-breaking • J. Scholz (RIM@GT) Friday, September 2, 2011 Involves breaking off pieces of a stick of unit length 08/26/2011 25 The Polya-urn Scheme ya’s Urn Scheme Pòlya’s urn scheme produces a sequence θ1 , θ2 , . . . with the Polya urn scheme produces a sequence θ1, θ2, . . . with the following conditionals: • following conditionals: eq 1: θn |θ1:n−1 ∼ �n−1 δθi + αH n−1+α i=1 Imagine picking balls of different colors from an urn: Imagine picking balls of different colors from an urn: • Start with no balls in the urn. with probability ∝ α, draw θ ∼ H, and add a ball of Start with no balls in the urn. • that color into the urn. With probability α,pick drawa θball∼ at H,random and add from a ball of that With ∝ n −∝1, • probability n n color into the the urn, record θn tourn. be its color, return the ball into the urn and place a second ball of same color into urn. With probability ∝ n − 1, pick a ball at random from the urn, record θn to be its color, return the ball into the urn and place a second ball of same color into urn. • Yee Whye Teh (Gatsby) J. Scholz (RIM@GT) Friday, September 2, 2011 DP and HDP Tutorial 08/26/2011 Mar 1, 2007 / CUED 10 / 53 26 The Polya-urn Scheme ya’s Urn Scheme Pòlya’s urn scheme produces a sequence θ1 , θ2 , . . . with the Polya urn scheme produces a sequence θ1, θ2, . . . with the following conditionals: • following conditionals: eq 1: θn |θ1:n−1 ∼ �n−1 δθi + αH n−1+α i=1 number of θi colored balls Imagine picking balls of different colors from an urn: Imagine picking balls of different colors from an urn: • Start with no balls in the urn. with probability ∝ α, draw θ ∼ H, and add a ball of Start with no balls in the urn. • that color into the urn. With probability α,pick drawa θball∼ at H,random and add from a ball of that With ∝ n −∝1, • probability n n color into the the urn, record θn tourn. be its color, return the ball into the urn and place a second ball of same color into urn. With probability ∝ n − 1, pick a ball at random from the urn, record θn to be its color, return the ball into the urn and place a second ball of same color into urn. • Yee Whye Teh (Gatsby) J. Scholz (RIM@GT) Friday, September 2, 2011 DP and HDP Tutorial 08/26/2011 Mar 1, 2007 / CUED 10 / 53 26 Polya sampling in practice • Equation 1 is of the form (p)f(Ω) + (1-p)g(Ω) • Implies that proportion p of density is associated with f, so we can split the task in half: • first flip a bern(p) coin. If heads, draw from f, if tails, draw from g • for polya urn, gives us either a sample from existing balls (f), or a new color (g)* *if g is a continuous density on Ω, then the probability of sampling an existing cluster from g is zero. (why?) J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 27 ya’s Urn Scheme Analyzing Polya Urn One (infinitely long) “run” of our process ∼ ADP(α, is a random probability measure. draw GH) ∼ DP(α, H) is a random probability measure. A draw G ∼ DP(α,H) is a random probability measure • Treating G as a distribution, consider i.i.d. draws from as a distribution, consider i.i.d. draws from G:G: • Treating G as a distribution, consider i.i.d. draws from G: θi |G ∼ G • θi |G ∼ G One component drawn from G Marginalizing out G, marginally each H, while the conditional i ∼ Marginalizing out G, each θi ∼θH, while the conditional ng out G, marginally each θi ∼ H, while the conditional distributions are, distributions are: s are, �n−1 i=1 δθi + αH θn |θ1:n−1 ∼ �n−1 n − 1 + α δ + αH θn |θ1:n−1 ∼ • i=1 θi This is the Pòlya scheme. n −we 1 did + αin the Polya urn scheme* This isurn precisely what Pòlya urn * This is scheme. why people say the that the DP is the “De Finetti distribution underlying the Urn process. It’s what makes the θi exchangeable. (Since θi are i.i.d. ∼ G, their joint distribution is invariant to permutations) J. Scholz (RIM@GT) Yee Whye Teh (Gatsby) Friday, September 2, 2011 08/26/2011 DP and HDP Tutorial Mar 1, 2007 / CUED 9 / 53 28 Chinese Restaurant Process Chinese Restaurant Process The Chinese Restaurant Process Chinese Restaurant Process • Generating from the CRP: Generating from the CRP: Generating from the CRP: First customer sits at the first table • Generating from the CRP: First customer sits thefirst firsttable. table. First customer sits atatthe Customer n sits at: • First customer sits at the first table. Customer n sits at: Customer at: Customernnsits sits at: Table k with probability where n is the number of c Table k with probability, where i is the • Table Table with probability is the number of customers k kwith probability where nwhere n is the number of c nk n nk k α+n−1 k α+n−1 α+n−1 customers at table k k k number at tablekof at table .k. at table k. α α A new table K + 1 with probability . A new table K + 1 with probability . α+n−1 α α+n−1 A new table K + 1 with probability . A new table K + 1 with probability α+n−1 Customers ⇔ integers, tables ⇔ clusters. • Customers ⇔ integers, tables ⇔ clusters. Customers ⇔ integers, tables ⇔ clusters. The CRP exhibits the clustering property of the DP. The CRP exhibits the clustering property of the DP: • The CRP exhibits the clustering property of the DP. The CRP the clustering 8 property of the DP. 5 2 4exhibits 1 2 1 2 4 3 4 1 J. Scholz (RIM@GT) Friday, September 2, 2011 3 35 9 7 6 8 5 78 6 6 7 9 9 Exhibits a rich-get-richer effect 08/26/2011 29 The Chinese Restaurant Process • Closely related to the Polya Urn process: • The CRP is the induced distribution over partitions from an urn process • Just take all the balls and sort them by color • This defines a partition of 1, . . . , n into K clusters, such that if i is in cluster k, * then θi = θ k J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 30 The Stick-Breaking Construction Stick-breaking Construction Stick-breaking Construction • But how do But draws ∼ draws DP(α,G H)∼look like? howGdo DP(α, H) look like? G is discreteGwith probability so: is discrete withone, probability one, so: But what do draws G ∼ DP(α,H) look like? • ∞ � G one, = so: πk δG θk∗ = G is discrete with probability k =1 ∞ � πk δθk∗ k =1 The stick-breaking construction shows that G ∼ DP(α,H) if: • The stick-breaking construction shows thatshows G ∼ DP(α, The stick-breaking construction that GH) ∼ if: DP(α, H) if: πk = βk k� −1 l=1 (1 πk − = ββlk) k� −1 l=1 (1 − βl ) βkα) ∼ Beta(1, α) βk ∼ Beta(1, ! ∗ ∗ θk ∼ H θk ∼ H ! !(3) !(4) !(4) (5) (6) • !(6) !(2) !(3) !(1) !(2) !(1) !(5) WeGEM(α) write π∼ π (π . . distributed .) is distributed We write ∼ if πGEM(α) = if(ππ πif2(π , .= .1,.)π is21, ,distributed as above. 1, = Weπwrite π ∼ GEM(α) . π. .)2 , is as as above. above Yee Whye Teh (Gatsby) Yee Whye Teh (Gatsby) J. Scholz (RIM@GT) Friday, September 2, 2011 DP and HDP Tutorial DP and HDP Tutorial 08/26/2011 Mar 1, 2007 / CUED Mar 1, 2007 / CUED 15 / 53 31 15 The Stick-Breaking Construction Stick-breaking Construction Stick-breaking Construction • But how do But draws ∼ draws DP(α,G H)∼look like? howGdo DP(α, H) look like? G is discreteGwith probability so: is discrete withone, probability one, so: mixing proportion But what do draws G ∼ DP(α,H) look like? • ∞ � G one, = so: πk δG θk∗ = G is discrete with probability k =1 ∞ � point mass πk δθk∗ k =1 The stick-breaking construction shows that G ∼ DP(α,H) if: • The stick-breaking construction shows thatshows G ∼ DP(α, The stick-breaking construction that GH) ∼ if: DP(α, H) if: πk = βk k� −1 l=1 (1 πk − = ββlk) k� −1 l=1 (1 − βl ) βkα) ∼ Beta(1, α) βk ∼ Beta(1, ! ∗ ∗ θk ∼ H θk ∼ H ! !(3) !(4) !(4) (5) (6) • !(6) !(2) !(3) !(1) !(2) !(1) !(5) WeGEM(α) write π∼ π (π . . distributed .) is distributed We write ∼ if πGEM(α) = if(ππ πif2(π , .= .1,.)π is21, ,distributed as above. 1, = Weπwrite π ∼ GEM(α) . π. .)2 , is as as above. above Yee Whye Teh (Gatsby) Yee Whye Teh (Gatsby) J. Scholz (RIM@GT) Friday, September 2, 2011 DP and HDP Tutorial DP and HDP Tutorial 08/26/2011 Mar 1, 2007 / CUED Mar 1, 2007 / CUED 15 / 53 31 15 The Stick-Breaking Construction Stick-breaking Construction • But how do draws G ∼ DP(α, H) look like? Why does this make G issense? discrete with probability one, so: • • ∞ � Draws from the beta(1,alpha) give G =a π δ distribution over the interval (0,1), which we The stick-breaking construction shows that G ∼ DP(α, H) if: can think of as where to break the stick k θk∗ k=1 k−1 � th sample The product scales the k ! l=1 ! β ∼ Beta(1, α) according to how much has been broken off ! k ! ∗ θk ∼ H ! already πk = βk (1 − βl ) !(1) (2) (3) (4) (5) (6) • We write π ∼ GEM(α) if π = (π1 , π2 , . . .) is distributed as above. In the limit we get another infinite partitioning of our interval [0,1], and therefore a (discrete) probability measure Yee Whye Teh (Gatsby) J. Scholz (RIM@GT) Friday, September 2, 2011 09/02/2011 DP and HDP Tutorial Mar 1, 2007 / CUED 32 1 End Part 1 Questions? J. Scholz (RIM@GT) Friday, September 2, 2011 08/26/2011 33