Indian Buffet Process

Indian Buffet Process Xiaolong Wang and Daniel Khashabi UIUC, 2013 Contents • Mixture Model ▫ Finite mixture ▫ Infinite mixture • Matrix Feature Model ▫ Finite features ▫ Infinite features(Indian Buffet Process) Bayesian Nonparametrics • Models with undefined number of elements, ▫ Dirichlet Process for infinite mixture models ▫ With various applications  Hierarchies  Topics and syntactic classes  Objects appearing in one image ▫ Cons  The models are limited to the case that could be modeled using DP.  i.e. set of observations are generated by only one latent component Bayesian Nonparametrics contd. • In practice there might be more complicated interaction between latent variables and observations • Solution ▫ Looking for more flexible nonparametric models ▫ Such interaction could be captured via a binary matrix ▫ Infinite features means infinite number of columns Finite Mixture Model • • • • N {xi }i1 Set of observation: Constant clusters, K Cluster assignment for x i is ci 1,..., K Cluster assignments vector : c  [c1 , c2 ,..., cN ]T • The probability of each sample under the model: K p(xi |  )   p(xi | ci  k ) p(ci  k ) k 1 • The likelihood of samples: N K p( X |  )    p(xi | ci  k ) p(ci  k ) i 1 k 1 • The prior on the component probabilities (symmetric Dirichlet dits.)  |  ~ Dirichlet(  K , ,  K ). Finite Mixture Model • Since we want the mixture model to be valid for any general component p( x j | c j  i) we only assume the number of cluster assignments to be the goal of learning this mixture model ! • Cluster assignments: c  [c1 , c2 ,..., cN ]T • The model can be summarized as :    θ | α ~ Dirichlet( ,  , ).  K K   ci | θ ~ Discrete(θ) • To have a valid model, all of the distributions likelihood must be valid ! posterior p(c | θ). p(θ) prior p(θ | c)  p(c) Marginal likelihood (Evidence) Finite Mixture Model contd. p(c)       p(θ)   D( ,..., )  K   K p(c | θ) p(θ)dθ K N k 1  p(c | θ)   kmk k 1 i 1 p(c)    k K   (ci  k ) mk 1 K  p(c | θ) p(θ)dθ K   K D(  K 1 , ..., K  K  ) k 1 mk   K 1 k    m    k  K      k 1  , K ( N   )      K     dθ K N s.t. mk    (ci  k ) i 1 k 1 Infinite mixture model • Infinite clusters likelihood ▫ It is like saying that we have : K   • Since we always have limited samples in reality, we will have limited number of clusters used; so we define two set of clusters: ▫ K  number of classes for which mk  0 ▫ K 0 number of classes for which mk  0 • Assume a reordering, such that k  K  mk  0; and k  K  mk  0 K0 K K • Infinite clusters likelihood ▫ It is like saying that we have : K   ▫ Infinite dimensional multinomial cluster distribution. Infinite mixture model • Now we return to the previous slides and set K   in formulas    m    k  K      k 1 p(c)  , K ( N   )      K     K ( x  1)  x( x)  (mk   K K  p(c)   (mk  k 1     ( )   K  )  (mk   K K )  K  1)...( N s.t. mk    (ci  k ) i 1  K ( )     ( N   )  K  ) (  K ) (mk   )  K ( )  j      K K  j 1  ( ) K mk 1   ( )  j     K  ( N   ) k 1 j 1  K  K  mk 1 If we set K   the marginal likelihood will be p ( c )  0 . Instead we can model this problem, by defining probabilities on partitions of samples, instead of class labels for each sample. Infinite mixture model contd. • Define a partition of objects; • Want to partition N objects into K classes  • Equivalence class of object partitions: [c]  c | c  c i i  p ([c])   c[ c ] p (c)  K!      K0 !  K    ( )  j     K  ( N   ) k 1 j 1  K  K  mk 1 K mk 1 K !   ( )  K  p([c])   . . j  . K   K0 ! K k 1 j 1  K  ( N   ) K ( ) K  lim K  p([c])   . 1 .   mk  1 ! . ( N   ) k 1   • Valid probability distribution for an infinite mixture model • Exchangeable with respect to clusters assignments ! ▫ Important for Gibbs sampling (and Chinese restaurant process) ▫ Di Finetti’s theorem: explains why exchangeable observations are conditionally independent given some probability distribution Chinese Restaurant Process (CRP)  mk k  K  i 1   p(ci  k | c1 ,..., ci 1 )     k  K 1  i 1   Parameter   1 m1  0 1 p(c1  1)  1 Chinese Restaurant Process (CRP)  mk k  K  i 1   p(ci  k | c1 ,..., ci 1 )     k  K 1  i 1   Parameter   1 m1  1, m2  0 1 p(c2  1| c1 )  11 p(c2  2 | c1 )  1 11 Chinese Restaurant Process (CRP)  mk k  K  i 1   p(ci  k | c1 ,..., ci 1 )     k  K 1  i 1   Parameter   1 m1  1, m2  1, 1 p (c3  1| c1:2 )  2 1 m3  0 1 p (c3  3 | c1:2 )  2 1 1 p(c3  2 | c1:2 )  2 1 Chinese Restaurant Process (CRP)  mk k  K  i 1   p(ci  k | c1 ,..., ci 1 )     k  K 1  i 1   Parameter   1 m1  2, m2  1, 2 p(c4  1| c1:3 )  3 1 m3  0 1 p(c4  3 | c1:3 )  3 1 1 p(c4  2 | c1:3 )  3 1 Chinese Restaurant Process (CRP)  mk k  K  i 1   p(ci  k | c1 ,..., ci 1 )     k  K 1  i 1   Parameter   1 m1  2, m2  2, 2 p(c5  1| c1:4 )  4 1 p(c5  2 | c1:4 )  m3  0 1 p(c5  3 | c1:4 )  4 1 2 4 1 Chinese Restaurant Process (CRP)  mk k  K  i 1   p(ci  k | c1 ,..., ci 1 )     k  K 1  i 1   Parameter   1 m1  2, m2  2, 2 p(c6  1| c1:5 )  5 1 m3  1, m4  0 1 p(c6  3 | c1:5 )  5 1 2 1 p(c6  2 | c1:5 )  p(c6  4 | c1:5 )  5 1 5 1 Chinese Restaurant Process (CRP)  mk k  K  i 1   p(ci  k | c1 ,..., ci 1 )     k  K 1  i 1   Parameter   1 m1  2, m2  3, 2 p(c7  1| c1:6 )  6 1 m3  1, m4  0 1 p(c7  3 | c1:6 )  6 1 3 1 p(c7  2 | c1:6 )  p(c7  4 | c1:6 )  6 1 6 1 Chinese Restaurant Process (CRP)  mk k  K  i 1   p(ci  k | c1 ,..., ci 1 )     k  K 1  i 1   Parameter   1 m1  2, m2  3, 2 p(c8  1| c1:3 )  7 1 m3  2, m4  0 2 p(c8  3 | c1:3 )  7 1 3 1 p(c8  2 | c1:3 )  p(c8  4 | c1:3 )  7 1 7 1 Chinese Restaurant Process (CRP)  mk k  K  i 1   p(ci  k | c1 ,..., ci 1 )     k  K 1  i 1   Parameter   1 m1  3, m2  3, 3 p(c9  1| c1:3 )  8 1 m3  2, m4  0 2 p(c9  3 | c1:3 )  8 1 3 1 p(c9  2 | c1:3 )  p(c9  4 | c1:3 )  8 1 8 1 Chinese Restaurant Process (CRP)  mk k  K  i 1   p(ci  k | c1 ,..., ci 1 )     k  K 1  i 1   Parameter   1 m1  3, p(c10  1| c1:4 )  m2  3, 3 9 1 m3  2, m4  1 p(c10  3 | c1:4 )  3 p (c10  2 | c1:4 )  9 1 2 9 1 m5  5 p(c10  5 | c1:4 )  1 p (c10  4 | c1:4 )  9 1 1 9 1 CRP: Gibbs sampling • Gibbs sampler requires full conditional p(ci  k | ci , X)  p(X | c). p(ci  k | ci ) • Finite Mixture Model: p(ci  k | ci )  mi ,k   K N 1   • Infinite Mixture Model:  m i ,k  N 1      p(ci  k | c  i )    N 1   0   m i ,k  0 k  K i  1 otherwise Beyond the limit of single label • In Latent Class Models: ▫ Each object (word) has only one latent label (topic) ▫ Finite number of latent labels: LDA ▫ Infinite number of latent labels: DPM • In Latent Feature (latent structure) Models: ▫ Each object (graph) has multiple latent features (entities) ▫ Finite number of latent features: Finite Feature Model (FFM) ▫ Infinite number of latent features: Indian Buffet Process (IBP) • • Rows are data points Columns are latent features • Movie Preference Example: • Rows are movies: Rise of the Planet of the Apes • Columns are latent features: • Made in U.S. • Is Science fiction • Has apes in it … Latent Feature Model Z continuous F • F : latent feature matrix • Z : binary matrix • V : value matrix F  ZV • With p(F)=p(Z). p(V) discrete F Finite Feature Model • Generating Z : (N*K) binary matrix ▫ For each column k, draw p from beta distribution k ▫ For each object, flip a coin by zik ì a p | a ~ Beta( ,1) ï k K í ï z | p ~ Bernoulli(p ) k î ik k a (p k ,1- p k ) ~ Dir( ,1) K • Distribution of Z : K N K p(Z | π)   p( zik |  k )    k mk (1   k ) N mk k 1 i 1 k 1 K p(Z | a ) = Õ ò p(p k | a ) p(z×k | p k )dp k k K pk a =Õ K k=1 G(mk + a K )G(N - mk +1) G(N + 1+ a K ) Finite Feature Model • Generating Z : (N*K) binary matrix ▫ For each column k, draw p from beta distribution k ▫ For each object, flip a coin by zik ì a p | a ~ Beta( ,1) ï k K í ï z | p ~ Bernoulli(p ) k î ik k • Z is sparse: ▫ Even K ®¥ a (p k ,1- p k ) ~ Dir( ,1) K Indian Buffet Process 1st Representation: K ® ¥ • Difficulty: ▫ P(Z) -> 0 ▫ Solution: define equivalence classes on random binary feature matrices. • left-ordered form function of binary matrices, lof(Z): ▫ Compute history h of feature (column) k ▫ Order features by h decreasingly 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 h4 = 2 3 + 2 7 + 29 Z1 Z2 are lof equivalent iff lof(Z1)=lof(Z2) = [Z] Describing [Z]: K h = #{i;hi = h} K+ = å Kh i>0 K = K+ + K0 Cardinality of [Z]: æ ö K ç K ...K ÷ = è 0 2 N -1 ø K! 2 N -1 ÕK ! h=0 h Indian Buffet Process 1st Representation: K ® ¥ Given: a K Pr(Z | a ) = Õ K G(mk + a K )G( N - mk +1) G(N +1+ k=1 æ ö K card([Z]) = ç ÷= K ...K N è 0 2 -1 ø K! a K ) N 2 -1 ÕK ! Derive when K ® ¥ : ( K + < ¥ almost surely) h=0 h Pr([Z] | a ) = Pr(Z | a )× card([Z]) a a = K K! N 2 -1 ÕK ! h=0 h ÕK k=1 G(mk + K )G(N - mk + 1) G(N + 1+ a K æa a ö K ! ç K G( K )G( N + 1) ÷ = 2N -1 ç a ÷ K h ! çè G(N + 1+ ) ÷ø Õ K h=0 ) K0 K a ÕK k=1 G(mk + a K )G(N - mk + 1) G(N + 1+ a K ) Indian Buffet Process 1st Representation: K ® ¥ Given: K Pr(Z | a ) = Õ k=1 a K G(mk + a K )G( N - mk +1) G(N +1+ æ ö K card([Z]) = ç ÷= K ...K N è 0 2 -1 ø K! a K ) N 2 -1 ÕK ! Derive when K ® ¥ : ( K + < ¥ almost surely) h=0 h Pr([Z] | a ) = Pr(Z | a )× card([Z]) K æa a ö 0 a a G( )G( N + 1) G(m + )G(N - mk + 1) K k ÷ K! ç K K K K = 2N -1 Õ ç a ÷ k=1 a ç ÷ G(N + 1+ ) K h ! è G(N + 1+ ) ø Õ K K h=0 K æa a ö a K + G(m + G( )G( N + 1) )G(N - mk + 1) k ÷ K! ç K K K = 2N -1 Õ ç a ÷ k=1 a ç ÷ G( )G(N + 1) K h ! è G(N + 1+ ) ø Õ K K h=0 Indian Buffet Process 1st Representation: K ® ¥ Given: K Pr(Z | a ) = Õ k=1 a K G(mk + a K )G( N - mk +1) G(N +1+ æ ö K card([Z]) = ç ÷= K ...K N è 0 2 -1 ø K! a K ) N 2 -1 ÕK ! Derive when K ® ¥ : ( K + < ¥ almost surely) h=0 h Pr([Z] | a ) = Pr(Z | a )× card([Z]) K æa a ö a K + G(m + G( )G( N + 1) )G(N - mk + 1) k ÷ K! ç K K K = 2N -1 Õ ç a ÷ k=1 a ç ÷ G( )G(N + 1) K h ! è G(N + 1+ ) ø Õ KK K h=0 K K ö öö aæNæa a KK ! ! æ G( a1 ++/÷1)1)j÷÷ - K K N+ 1)G(N )G(N = 2 N -1 2 N -1 ç KçççG( K (1+ KN! + m -1 ÷ ÷÷) a a (N - mk )! ç çÕ N a / j a K ( j + ) j=1 (m -1)!× a )÷ )÷÷ j=1 Õ Õ ç N K K h !K hçè ! ççG(N (1+ G( + 1+ k Õ0 !Õ K è ( j + ) ÷ øø Õ h=0 h=1 è Õ j=1 k KKKø k=1 j=0 K N! Indian Buffet Process 1st Representation: K ® ¥ Given: K Pr(Z | a ) = Õ a K G(mk + a K )G( N - mk +1) G(N +1+ k=1 æ ö K card([Z]) = ç ÷= K ...K N è 0 2 -1 ø K! a K ) N 2 -1 ÕK ! Derive when K ® ¥ : ( K + < ¥ almost surely) h=0 h Pr([Z] | a ) = Pr(Z | a )× card([Z]) = K! 2 N -1 K0 !Õ K h ! N Õ(1+ a/ j K j=1 ) -K K+ Õ k=1 a (N - mk )! (mk -1)!× K N! h=1 N- a Ha / j - N K N =Õ ee K! K! a K+ K + Ka a N K 2 2 -1 -1+ 2 N -1 K0 !K K ! K+ K ! K0 Õ !Õ KhÕ ! K h h + N h=1 j=1 1 h=1 Harmonic sequence sum j=1 j HN = å h=1 K+ Õ k=1 (N - mk )! (mk -1)! N! Indian Buffet Process 1st Representation: K ® ¥ Given: K Pr(Z | a ) = Õ a K G(mk + k=1 a )G( N - mk +1) K G(N +1+ æ ö K card([Z]) = ç ÷= K ...K N è 0 2 -1 ø K! a K ) N 2 -1 ÕK ! Derive when K ® ¥ : ( K + < ¥ almost surely) h=0 h Pr([Z] | a ) = Pr(Z | a )× card([Z]) =e -a H N aK 2 N -1 + Õ Kh ! h=1 K+ Õ k=1 (mk -1)!(N - mk )! N! Note: • Pr([Z] | a ) is well defined because K < ¥ a.s. + • Pr([Z] | a ) depends on Kh : • The number of features (columns) with history h • Permute the rows (data points) does not change Kh (exchangeability) Indian Buffet Process 2st Representation: Customers & Dishes • Indian restaurant with inﬁnitely many inﬁnite dishes (columns) • The first customer tastes first K1(1) dishes, sample K (1) ~ Poission(a ) 1 • The i-th customer: mk • Taste a previously sampled dish with probability i + 1 • mk : number of previously customers taking dish k a • Taste following K1(i) new dishes, sample K1(i ) ~ Poission( ) i Poission Distribution: p(k | l ) = l k e- l k! Indian Buffet Process 2st Representation: Customers & Dishes K1(i ) K 2(i ) K1(i ) ( ) e = Õ i (i ) K1 ! i N Pnew _ dish a K 3(i ) - 1/1 1/2 1/3 2/4 2/5 3/6 4/7 3/8 5/9 6/10 a i = a K e-a H + N ÕK i (i ) 1 N N 1 K1(i ) Õi ( i ) K+ 1 K1(i ) (N - mk )!(mk -1)! Õi ( i ) Pold _ dishes = Õk N! N 3!6! 10! 1/1 2/2 3/3 1/4 4/5 5/6 6/7 1/8 7/9 8/10 8!1! 10! Indian Buffet Process 2st Representation: Customers & Dishes From: K1(i ) ( ) e = Õ i (i ) K1 ! i N Pnew _ dish a - a i = a K e-a H + N N ÕK (i ) 1 N 1 K1(i ) ( Õi i ) i 1 K1(i ) (N - mk )!(mk -1)! ( ) P = Õi i old _ dishes Õk N! K+ N We have: PIBP (Z | a ) = Note: a e K+ -a H N N ÕK (i ) 1 (N - mk )!(mk - 1)! Õk N! K+ i • Permute K1(1), next K1(2), next K1(3),… next K1(N) dishes (columns) does not change P(Z) • The permuted matrices are all of same lof equivalent class [Z] N • Number of permutation: K1(i ) a K+ e-a H N K+ (N - mk )!(mk -1)! PIBP ([Z] | a ) = 2 N -1 i Õ 2nd representation N N! 2 -1 k K ! is equivalent to Õ h K ! h h=1 1st representation h=1 Õ Õ Indian Buffet Process Distribution over rd 3 Representation: Collections of Histories • Directly generating the left ordered form (lof) matrix Z • For each history h: ▫ mh: number of non-zero elements in h ▫ Generate Kh columns of history h (mh -1)!(N - mh )! K h ~ Poission(a ) N! • The distribution over collections of histories • Note: ▫ Permute digits in h does not change mh (nor P(K) ) ▫ Permute rows means customers are exchangeable Indian Buffet Process • Effective dimension of the model K+: ▫ Follow Poission distribution: K+ ~ Poission(a H N ) ▫ Derives from 2nd representation by summing Poission components • Number of features possessed by each object: ▫ Follow Possion distribution: Poission(a ) ▫ Derives from 2nd representation:  The first customer chooses Poission(a ) dishes  The customers are exchangeable and thus can be purmuted • Z is sparse: ▫ Non-zero element ▫ Derives from the 2nd (or 1st) representation: a  Expected number of non-zeros for each row is Na  Expected entries in Z is IBP: Gibbs sampling • Need to have the full conditional p(zik = 1| Z-(ik ) ,X) µ p(X | Z) p(zik = 1| Z-(ik ) ) ▫ Z( n,k ) denotes the entries of Z other than znk . ▫ p(X|Z) depends on the model chosen for the observed data. • By exchangeability, consider generating row as the last customer: • By IBP, in which p(zik = 1| z- ik ) = m-i,k N ▫ If sample zik=0, and mk=0: delete the row  ▫ At the end, draw new dishes from Pois( ) with considering p(X | Z) n  Approximated by truncation, computing probabilities for a range of values of new dishes up to a upper bound More talk on applications • Applications ▫ As prior distribution in models with infinite number of features. ▫ Modeling Protein Interactions ▫ Models of bipartite graph consisting of one side with undefined number of elements. ▫ Binary Matrix Factorization for Modeling Dyadic Data ▫ Extracting Features from Similarity Judgments ▫ Latent Features in Link Prediction ▫ Independent Components Analysis and Sparse Factor Analysis • More on inference ▫ Stick-breaking representation (Yee Whye Teh et al., 2007) ▫ Variational inference (Finale Doshi-Velez et al., 2009) ▫ Accelerated Inference (Finale Doshi-Velez et al., 2009) References • Griffiths, T. L., & Ghahramani, Z. (2011). The indian buffet process: An introduction and review. Journal of Machine Learning Research, 12, 1185-1224. • Slides: Tom Griffiths, “The Indian buffet process”.

Indian Buffet Process

Related documents

Products

Support

Indian Buffet Process

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib