Indian Buffet Process Xiaolong Wang and Daniel Khashabi UIUC, 2013 Contents • Mixture Model ▫ Finite mixture ▫ Infinite mixture • Matrix Feature Model ▫ Finite features ▫ Infinite features(Indian Buffet Process) Bayesian Nonparametrics • Models with undefined number of elements, ▫ Dirichlet Process for infinite mixture models ▫ With various applications Hierarchies Topics and syntactic classes Objects appearing in one image ▫ Cons The models are limited to the case that could be modeled using DP. i.e. set of observations are generated by only one latent component Bayesian Nonparametrics contd. • In practice there might be more complicated interaction between latent variables and observations • Solution ▫ Looking for more flexible nonparametric models ▫ Such interaction could be captured via a binary matrix ▫ Infinite features means infinite number of columns Finite Mixture Model • • • • N {xi }i1 Set of observation: Constant clusters, K Cluster assignment for x i is ci 1,..., K Cluster assignments vector : c [c1 , c2 ,..., cN ]T • The probability of each sample under the model: K p(xi | ) p(xi | ci k ) p(ci k ) k 1 • The likelihood of samples: N K p( X | ) p(xi | ci k ) p(ci k ) i 1 k 1 • The prior on the component probabilities (symmetric Dirichlet dits.) | ~ Dirichlet( K , , K ). Finite Mixture Model • Since we want the mixture model to be valid for any general component p( x j | c j i) we only assume the number of cluster assignments to be the goal of learning this mixture model ! • Cluster assignments: c [c1 , c2 ,..., cN ]T • The model can be summarized as : θ | α ~ Dirichlet( , , ). K K ci | θ ~ Discrete(θ) • To have a valid model, all of the distributions likelihood must be valid ! posterior p(c | θ). p(θ) prior p(θ | c) p(c) Marginal likelihood (Evidence) Finite Mixture Model contd. p(c) p(θ) D( ,..., ) K K p(c | θ) p(θ)dθ K N k 1 p(c | θ) kmk k 1 i 1 p(c) k K (ci k ) mk 1 K p(c | θ) p(θ)dθ K K D( K 1 , ..., K K ) k 1 mk K 1 k m k K k 1 , K ( N ) K dθ K N s.t. mk (ci k ) i 1 k 1 Infinite mixture model • Infinite clusters likelihood ▫ It is like saying that we have : K • Since we always have limited samples in reality, we will have limited number of clusters used; so we define two set of clusters: ▫ K number of classes for which mk 0 ▫ K 0 number of classes for which mk 0 • Assume a reordering, such that k K mk 0; and k K mk 0 K0 K K • Infinite clusters likelihood ▫ It is like saying that we have : K ▫ Infinite dimensional multinomial cluster distribution. Infinite mixture model • Now we return to the previous slides and set K in formulas m k K k 1 p(c) , K ( N ) K K ( x 1) x( x) (mk K K p(c) (mk k 1 ( ) K ) (mk K K ) K 1)...( N s.t. mk (ci k ) i 1 K ( ) ( N ) K ) ( K ) (mk ) K ( ) j K K j 1 ( ) K mk 1 ( ) j K ( N ) k 1 j 1 K K mk 1 If we set K the marginal likelihood will be p ( c ) 0 . Instead we can model this problem, by defining probabilities on partitions of samples, instead of class labels for each sample. Infinite mixture model contd. • Define a partition of objects; • Want to partition N objects into K classes • Equivalence class of object partitions: [c] c | c c i i p ([c]) c[ c ] p (c) K! K0 ! K ( ) j K ( N ) k 1 j 1 K K mk 1 K mk 1 K ! ( ) K p([c]) . . j . K K0 ! K k 1 j 1 K ( N ) K ( ) K lim K p([c]) . 1 . mk 1 ! . ( N ) k 1 • Valid probability distribution for an infinite mixture model • Exchangeable with respect to clusters assignments ! ▫ Important for Gibbs sampling (and Chinese restaurant process) ▫ Di Finetti’s theorem: explains why exchangeable observations are conditionally independent given some probability distribution Chinese Restaurant Process (CRP) mk k K i 1 p(ci k | c1 ,..., ci 1 ) k K 1 i 1 Parameter 1 m1 0 1 p(c1 1) 1 Chinese Restaurant Process (CRP) mk k K i 1 p(ci k | c1 ,..., ci 1 ) k K 1 i 1 Parameter 1 m1 1, m2 0 1 p(c2 1| c1 ) 11 p(c2 2 | c1 ) 1 11 Chinese Restaurant Process (CRP) mk k K i 1 p(ci k | c1 ,..., ci 1 ) k K 1 i 1 Parameter 1 m1 1, m2 1, 1 p (c3 1| c1:2 ) 2 1 m3 0 1 p (c3 3 | c1:2 ) 2 1 1 p(c3 2 | c1:2 ) 2 1 Chinese Restaurant Process (CRP) mk k K i 1 p(ci k | c1 ,..., ci 1 ) k K 1 i 1 Parameter 1 m1 2, m2 1, 2 p(c4 1| c1:3 ) 3 1 m3 0 1 p(c4 3 | c1:3 ) 3 1 1 p(c4 2 | c1:3 ) 3 1 Chinese Restaurant Process (CRP) mk k K i 1 p(ci k | c1 ,..., ci 1 ) k K 1 i 1 Parameter 1 m1 2, m2 2, 2 p(c5 1| c1:4 ) 4 1 p(c5 2 | c1:4 ) m3 0 1 p(c5 3 | c1:4 ) 4 1 2 4 1 Chinese Restaurant Process (CRP) mk k K i 1 p(ci k | c1 ,..., ci 1 ) k K 1 i 1 Parameter 1 m1 2, m2 2, 2 p(c6 1| c1:5 ) 5 1 m3 1, m4 0 1 p(c6 3 | c1:5 ) 5 1 2 1 p(c6 2 | c1:5 ) p(c6 4 | c1:5 ) 5 1 5 1 Chinese Restaurant Process (CRP) mk k K i 1 p(ci k | c1 ,..., ci 1 ) k K 1 i 1 Parameter 1 m1 2, m2 3, 2 p(c7 1| c1:6 ) 6 1 m3 1, m4 0 1 p(c7 3 | c1:6 ) 6 1 3 1 p(c7 2 | c1:6 ) p(c7 4 | c1:6 ) 6 1 6 1 Chinese Restaurant Process (CRP) mk k K i 1 p(ci k | c1 ,..., ci 1 ) k K 1 i 1 Parameter 1 m1 2, m2 3, 2 p(c8 1| c1:3 ) 7 1 m3 2, m4 0 2 p(c8 3 | c1:3 ) 7 1 3 1 p(c8 2 | c1:3 ) p(c8 4 | c1:3 ) 7 1 7 1 Chinese Restaurant Process (CRP) mk k K i 1 p(ci k | c1 ,..., ci 1 ) k K 1 i 1 Parameter 1 m1 3, m2 3, 3 p(c9 1| c1:3 ) 8 1 m3 2, m4 0 2 p(c9 3 | c1:3 ) 8 1 3 1 p(c9 2 | c1:3 ) p(c9 4 | c1:3 ) 8 1 8 1 Chinese Restaurant Process (CRP) mk k K i 1 p(ci k | c1 ,..., ci 1 ) k K 1 i 1 Parameter 1 m1 3, p(c10 1| c1:4 ) m2 3, 3 9 1 m3 2, m4 1 p(c10 3 | c1:4 ) 3 p (c10 2 | c1:4 ) 9 1 2 9 1 m5 5 p(c10 5 | c1:4 ) 1 p (c10 4 | c1:4 ) 9 1 1 9 1 CRP: Gibbs sampling • Gibbs sampler requires full conditional p(ci k | ci , X) p(X | c). p(ci k | ci ) • Finite Mixture Model: p(ci k | ci ) mi ,k K N 1 • Infinite Mixture Model: m i ,k N 1 p(ci k | c i ) N 1 0 m i ,k 0 k K i 1 otherwise Beyond the limit of single label • In Latent Class Models: ▫ Each object (word) has only one latent label (topic) ▫ Finite number of latent labels: LDA ▫ Infinite number of latent labels: DPM • In Latent Feature (latent structure) Models: ▫ Each object (graph) has multiple latent features (entities) ▫ Finite number of latent features: Finite Feature Model (FFM) ▫ Infinite number of latent features: Indian Buffet Process (IBP) • • Rows are data points Columns are latent features • Movie Preference Example: • Rows are movies: Rise of the Planet of the Apes • Columns are latent features: • Made in U.S. • Is Science fiction • Has apes in it … Latent Feature Model Z continuous F • F : latent feature matrix • Z : binary matrix • V : value matrix F ZV • With p(F)=p(Z). p(V) discrete F Finite Feature Model • Generating Z : (N*K) binary matrix ▫ For each column k, draw p from beta distribution k ▫ For each object, flip a coin by zik ì a p | a ~ Beta( ,1) ï k K í ï z | p ~ Bernoulli(p ) k î ik k a (p k ,1- p k ) ~ Dir( ,1) K • Distribution of Z : K N K p(Z | π) p( zik | k ) k mk (1 k ) N mk k 1 i 1 k 1 K p(Z | a ) = Õ ò p(p k | a ) p(z×k | p k )dp k k K pk a =Õ K k=1 G(mk + a K )G(N - mk +1) G(N + 1+ a K ) Finite Feature Model • Generating Z : (N*K) binary matrix ▫ For each column k, draw p from beta distribution k ▫ For each object, flip a coin by zik ì a p | a ~ Beta( ,1) ï k K í ï z | p ~ Bernoulli(p ) k î ik k • Z is sparse: ▫ Even K ®¥ a (p k ,1- p k ) ~ Dir( ,1) K Indian Buffet Process 1st Representation: K ® ¥ • Difficulty: ▫ P(Z) -> 0 ▫ Solution: define equivalence classes on random binary feature matrices. • left-ordered form function of binary matrices, lof(Z): ▫ Compute history h of feature (column) k ▫ Order features by h decreasingly 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 h4 = 2 3 + 2 7 + 29 Z1 Z2 are lof equivalent iff lof(Z1)=lof(Z2) = [Z] Describing [Z]: K h = #{i;hi = h} K+ = å Kh i>0 K = K+ + K0 Cardinality of [Z]: æ ö K ç K ...K ÷ = è 0 2 N -1 ø K! 2 N -1 ÕK ! h=0 h Indian Buffet Process 1st Representation: K ® ¥ Given: a K Pr(Z | a ) = Õ K G(mk + a K )G( N - mk +1) G(N +1+ k=1 æ ö K card([Z]) = ç ÷= K ...K N è 0 2 -1 ø K! a K ) N 2 -1 ÕK ! Derive when K ® ¥ : ( K + < ¥ almost surely) h=0 h Pr([Z] | a ) = Pr(Z | a )× card([Z]) a a = K K! N 2 -1 ÕK ! h=0 h ÕK k=1 G(mk + K )G(N - mk + 1) G(N + 1+ a K æa a ö K ! ç K G( K )G( N + 1) ÷ = 2N -1 ç a ÷ K h ! çè G(N + 1+ ) ÷ø Õ K h=0 ) K0 K a ÕK k=1 G(mk + a K )G(N - mk + 1) G(N + 1+ a K ) Indian Buffet Process 1st Representation: K ® ¥ Given: K Pr(Z | a ) = Õ k=1 a K G(mk + a K )G( N - mk +1) G(N +1+ æ ö K card([Z]) = ç ÷= K ...K N è 0 2 -1 ø K! a K ) N 2 -1 ÕK ! Derive when K ® ¥ : ( K + < ¥ almost surely) h=0 h Pr([Z] | a ) = Pr(Z | a )× card([Z]) K æa a ö 0 a a G( )G( N + 1) G(m + )G(N - mk + 1) K k ÷ K! ç K K K K = 2N -1 Õ ç a ÷ k=1 a ç ÷ G(N + 1+ ) K h ! è G(N + 1+ ) ø Õ K K h=0 K æa a ö a K + G(m + G( )G( N + 1) )G(N - mk + 1) k ÷ K! ç K K K = 2N -1 Õ ç a ÷ k=1 a ç ÷ G( )G(N + 1) K h ! è G(N + 1+ ) ø Õ K K h=0 Indian Buffet Process 1st Representation: K ® ¥ Given: K Pr(Z | a ) = Õ k=1 a K G(mk + a K )G( N - mk +1) G(N +1+ æ ö K card([Z]) = ç ÷= K ...K N è 0 2 -1 ø K! a K ) N 2 -1 ÕK ! Derive when K ® ¥ : ( K + < ¥ almost surely) h=0 h Pr([Z] | a ) = Pr(Z | a )× card([Z]) K æa a ö a K + G(m + G( )G( N + 1) )G(N - mk + 1) k ÷ K! ç K K K = 2N -1 Õ ç a ÷ k=1 a ç ÷ G( )G(N + 1) K h ! è G(N + 1+ ) ø Õ KK K h=0 K K ö öö aæNæa a KK ! ! æ G( a1 ++/÷1)1)j÷÷ - K K N+ 1)G(N )G(N = 2 N -1 2 N -1 ç KçççG( K (1+ KN! + m -1 ÷ ÷÷) a a (N - mk )! ç çÕ N a / j a K ( j + ) j=1 (m -1)!× a )÷ )÷÷ j=1 Õ Õ ç N K K h !K hçè ! ççG(N (1+ G( + 1+ k Õ0 !Õ K è ( j + ) ÷ øø Õ h=0 h=1 è Õ j=1 k KKKø k=1 j=0 K N! Indian Buffet Process 1st Representation: K ® ¥ Given: K Pr(Z | a ) = Õ a K G(mk + a K )G( N - mk +1) G(N +1+ k=1 æ ö K card([Z]) = ç ÷= K ...K N è 0 2 -1 ø K! a K ) N 2 -1 ÕK ! Derive when K ® ¥ : ( K + < ¥ almost surely) h=0 h Pr([Z] | a ) = Pr(Z | a )× card([Z]) = K! 2 N -1 K0 !Õ K h ! N Õ(1+ a/ j K j=1 ) -K K+ Õ k=1 a (N - mk )! (mk -1)!× K N! h=1 N- a Ha / j - N K N =Õ ee K! K! a K+ K + Ka a N K 2 2 -1 -1+ 2 N -1 K0 !K K ! K+ K ! K0 Õ !Õ KhÕ ! K h h + N h=1 j=1 1 h=1 Harmonic sequence sum j=1 j HN = å h=1 K+ Õ k=1 (N - mk )! (mk -1)! N! Indian Buffet Process 1st Representation: K ® ¥ Given: K Pr(Z | a ) = Õ a K G(mk + k=1 a )G( N - mk +1) K G(N +1+ æ ö K card([Z]) = ç ÷= K ...K N è 0 2 -1 ø K! a K ) N 2 -1 ÕK ! Derive when K ® ¥ : ( K + < ¥ almost surely) h=0 h Pr([Z] | a ) = Pr(Z | a )× card([Z]) =e -a H N aK 2 N -1 + Õ Kh ! h=1 K+ Õ k=1 (mk -1)!(N - mk )! N! Note: • Pr([Z] | a ) is well defined because K < ¥ a.s. + • Pr([Z] | a ) depends on Kh : • The number of features (columns) with history h • Permute the rows (data points) does not change Kh (exchangeability) Indian Buffet Process 2st Representation: Customers & Dishes • Indian restaurant with infinitely many infinite dishes (columns) • The first customer tastes first K1(1) dishes, sample K (1) ~ Poission(a ) 1 • The i-th customer: mk • Taste a previously sampled dish with probability i + 1 • mk : number of previously customers taking dish k a • Taste following K1(i) new dishes, sample K1(i ) ~ Poission( ) i Poission Distribution: p(k | l ) = l k e- l k! Indian Buffet Process 2st Representation: Customers & Dishes K1(i ) K 2(i ) K1(i ) ( ) e = Õ i (i ) K1 ! i N Pnew _ dish a K 3(i ) - 1/1 1/2 1/3 2/4 2/5 3/6 4/7 3/8 5/9 6/10 a i = a K e-a H + N ÕK i (i ) 1 N N 1 K1(i ) Õi ( i ) K+ 1 K1(i ) (N - mk )!(mk -1)! Õi ( i ) Pold _ dishes = Õk N! N 3!6! 10! 1/1 2/2 3/3 1/4 4/5 5/6 6/7 1/8 7/9 8/10 8!1! 10! Indian Buffet Process 2st Representation: Customers & Dishes From: K1(i ) ( ) e = Õ i (i ) K1 ! i N Pnew _ dish a - a i = a K e-a H + N N ÕK (i ) 1 N 1 K1(i ) ( Õi i ) i 1 K1(i ) (N - mk )!(mk -1)! ( ) P = Õi i old _ dishes Õk N! K+ N We have: PIBP (Z | a ) = Note: a e K+ -a H N N ÕK (i ) 1 (N - mk )!(mk - 1)! Õk N! K+ i • Permute K1(1), next K1(2), next K1(3),… next K1(N) dishes (columns) does not change P(Z) • The permuted matrices are all of same lof equivalent class [Z] N • Number of permutation: K1(i ) a K+ e-a H N K+ (N - mk )!(mk -1)! PIBP ([Z] | a ) = 2 N -1 i Õ 2nd representation N N! 2 -1 k K ! is equivalent to Õ h K ! h h=1 1st representation h=1 Õ Õ Indian Buffet Process Distribution over rd 3 Representation: Collections of Histories • Directly generating the left ordered form (lof) matrix Z • For each history h: ▫ mh: number of non-zero elements in h ▫ Generate Kh columns of history h (mh -1)!(N - mh )! K h ~ Poission(a ) N! • The distribution over collections of histories • Note: ▫ Permute digits in h does not change mh (nor P(K) ) ▫ Permute rows means customers are exchangeable Indian Buffet Process • Effective dimension of the model K+: ▫ Follow Poission distribution: K+ ~ Poission(a H N ) ▫ Derives from 2nd representation by summing Poission components • Number of features possessed by each object: ▫ Follow Possion distribution: Poission(a ) ▫ Derives from 2nd representation: The first customer chooses Poission(a ) dishes The customers are exchangeable and thus can be purmuted • Z is sparse: ▫ Non-zero element ▫ Derives from the 2nd (or 1st) representation: a Expected number of non-zeros for each row is Na Expected entries in Z is IBP: Gibbs sampling • Need to have the full conditional p(zik = 1| Z-(ik ) ,X) µ p(X | Z) p(zik = 1| Z-(ik ) ) ▫ Z( n,k ) denotes the entries of Z other than znk . ▫ p(X|Z) depends on the model chosen for the observed data. • By exchangeability, consider generating row as the last customer: • By IBP, in which p(zik = 1| z- ik ) = m-i,k N ▫ If sample zik=0, and mk=0: delete the row ▫ At the end, draw new dishes from Pois( ) with considering p(X | Z) n Approximated by truncation, computing probabilities for a range of values of new dishes up to a upper bound More talk on applications • Applications ▫ As prior distribution in models with infinite number of features. ▫ Modeling Protein Interactions ▫ Models of bipartite graph consisting of one side with undefined number of elements. ▫ Binary Matrix Factorization for Modeling Dyadic Data ▫ Extracting Features from Similarity Judgments ▫ Latent Features in Link Prediction ▫ Independent Components Analysis and Sparse Factor Analysis • More on inference ▫ Stick-breaking representation (Yee Whye Teh et al., 2007) ▫ Variational inference (Finale Doshi-Velez et al., 2009) ▫ Accelerated Inference (Finale Doshi-Velez et al., 2009) References • Griffiths, T. L., & Ghahramani, Z. (2011). The indian buffet process: An introduction and review. Journal of Machine Learning Research, 12, 1185-1224. • Slides: Tom Griffiths, “The Indian buffet process”.