Indian Buffet Process

advertisement
Indian Buffet Process
Xiaolong Wang and Daniel Khashabi
UIUC, 2013
Contents
• Mixture Model
▫ Finite mixture
▫ Infinite mixture
• Matrix Feature Model
▫ Finite features
▫ Infinite features(Indian Buffet Process)
Bayesian Nonparametrics
• Models with undefined number of elements,
▫ Dirichlet Process for infinite mixture models
▫ With various applications
 Hierarchies
 Topics and syntactic classes
 Objects appearing in one image
▫ Cons
 The models are limited to the case that could be modeled using DP.
 i.e. set of observations are generated by only one latent component
Bayesian Nonparametrics contd.
• In practice there might be more complicated interaction
between latent variables and observations
• Solution
▫ Looking for more flexible nonparametric models
▫ Such interaction could be captured via a binary matrix
▫ Infinite features means infinite number of columns
Finite Mixture Model
•
•
•
•
N
{xi }i1
Set of observation:
Constant clusters, K
Cluster assignment for x i is ci 1,..., K
Cluster assignments vector : c  [c1 , c2 ,..., cN ]T
• The probability of each sample under the model:
K
p(xi |  )   p(xi | ci  k ) p(ci  k )
k 1
• The likelihood of samples:
N
K
p( X |  )    p(xi | ci  k ) p(ci  k )
i 1 k 1
• The prior on the component probabilities (symmetric Dirichlet dits.)
 |  ~ Dirichlet(

K
, ,

K
).
Finite Mixture Model
• Since we want the mixture model to be valid for any general
component p( x j | c j  i) we only assume the number of cluster
assignments to be the goal of learning this mixture model !
• Cluster assignments: c  [c1 , c2 ,..., cN ]T
• The model can be summarized as :



θ
|
α
~
Dirichlet(
,

,
).

K
K


ci | θ ~ Discrete(θ)
• To have a valid model, all of the distributions
likelihood
must be valid !
posterior
p(c | θ). p(θ) prior
p(θ | c) 
p(c) Marginal likelihood
(Evidence)
Finite Mixture Model contd.
p(c) 

 
 
p(θ)   D( ,..., ) 
K 
 K
p(c | θ) p(θ)dθ
K
N
k 1
 p(c | θ)   kmk
k 1
i 1
p(c) 


k
K
  (ci  k )
mk
1 K
 p(c | θ) p(θ)dθ
K


K
D(

K
1
, ...,
K

K

)
k 1
mk 

K
1
k



m


 k

K    

k 1

,
K
( N   )
   
 K 
  
dθ
K
N
s.t. mk    (ci  k )
i 1
k 1
Infinite mixture model
• Infinite clusters likelihood
▫ It is like saying that we have : K  
• Since we always have limited samples in reality, we will have limited
number of clusters used; so we define two set of clusters:
▫ K  number of classes for which mk  0
▫ K 0 number of classes for which mk  0
• Assume a reordering, such that k  K  mk  0; and k  K  mk  0
K0
K
K
• Infinite clusters likelihood
▫ It is like saying that we have : K  
▫ Infinite dimensional multinomial cluster distribution.
Infinite mixture model
• Now we return to the previous slides and set K   in formulas



m


 k

K    

k 1
p(c) 
,
K
( N   )
   
 K 
  
K
( x  1)  x( x)  (mk 

K
K
 p(c) 
 (mk 
k 1
 

 ( ) 
 K 
)  (mk 

K
K
)

K
 1)...(
N
s.t. mk    (ci  k )
i 1

K
( )
 
 
( N   )  K 
) (

K
)
(mk 

)

K ( )  j 




K
K

j 1 
( )
K
mk 1
  ( )

j




K  ( N   )
k 1 j 1 
K  K  mk 1
If we set K   the marginal likelihood will be p ( c )  0 .
Instead we can model this problem, by defining probabilities on
partitions of samples, instead of class labels for each sample.
Infinite mixture model contd.
• Define a partition of objects;
• Want to partition N objects into K classes

• Equivalence class of object partitions: [c]  c | c  c
i i 
p ([c]) 

c[ c ]
p (c) 
K!   
 
K0 !  K 
  ( )

j




K
 ( N   )
k 1 j 1 
K  K  mk 1
K mk 1
K
!
  ( )

K
 p([c])   .
.
j  .
K  
K0 ! K k 1 j 1 
K  ( N   )
K
( )
K
 lim K  p([c])   . 1
.   mk  1 ! .
( N   )
k 1


• Valid probability distribution for an infinite mixture model
• Exchangeable with respect to clusters assignments !
▫ Important for Gibbs sampling (and Chinese restaurant process)
▫ Di Finetti’s theorem: explains why exchangeable observations are
conditionally independent given some probability distribution
Chinese Restaurant Process (CRP)
 mk
k  K

i 1  
p(ci  k | c1 ,..., ci 1 )  
 
k  K 1

i 1  
Parameter   1
m1  0
1
p(c1  1) 
1
Chinese Restaurant Process (CRP)
 mk
k  K

i 1  
p(ci  k | c1 ,..., ci 1 )  
 
k  K 1

i 1  
Parameter   1
m1  1,
m2  0
1
p(c2  1| c1 ) 
11
p(c2  2 | c1 ) 
1
11
Chinese Restaurant Process (CRP)
 mk
k  K

i 1  
p(ci  k | c1 ,..., ci 1 )  
 
k  K 1

i 1  
Parameter   1
m1  1,
m2  1,
1
p (c3  1| c1:2 ) 
2 1
m3  0
1
p (c3  3 | c1:2 ) 
2 1
1
p(c3  2 | c1:2 ) 
2 1
Chinese Restaurant Process (CRP)
 mk
k  K

i 1  
p(ci  k | c1 ,..., ci 1 )  
 
k  K 1

i 1  
Parameter   1
m1  2,
m2  1,
2
p(c4  1| c1:3 ) 
3 1
m3  0
1
p(c4  3 | c1:3 ) 
3 1
1
p(c4  2 | c1:3 ) 
3 1
Chinese Restaurant Process (CRP)
 mk
k  K

i 1  
p(ci  k | c1 ,..., ci 1 )  
 
k  K 1

i 1  
Parameter   1
m1  2,
m2  2,
2
p(c5  1| c1:4 ) 
4 1
p(c5  2 | c1:4 ) 
m3  0
1
p(c5  3 | c1:4 ) 
4 1
2
4 1
Chinese Restaurant Process (CRP)
 mk
k  K

i 1  
p(ci  k | c1 ,..., ci 1 )  
 
k  K 1

i 1  
Parameter   1
m1  2,
m2  2,
2
p(c6  1| c1:5 ) 
5 1
m3  1,
m4  0
1
p(c6  3 | c1:5 ) 
5 1
2
1
p(c6  2 | c1:5 ) 
p(c6  4 | c1:5 ) 
5 1
5 1
Chinese Restaurant Process (CRP)
 mk
k  K

i 1  
p(ci  k | c1 ,..., ci 1 )  
 
k  K 1

i 1  
Parameter   1
m1  2,
m2  3,
2
p(c7  1| c1:6 ) 
6 1
m3  1,
m4  0
1
p(c7  3 | c1:6 ) 
6 1
3
1
p(c7  2 | c1:6 ) 
p(c7  4 | c1:6 ) 
6 1
6 1
Chinese Restaurant Process (CRP)
 mk
k  K

i 1  
p(ci  k | c1 ,..., ci 1 )  
 
k  K 1

i 1  
Parameter   1
m1  2,
m2  3,
2
p(c8  1| c1:3 ) 
7 1
m3  2,
m4  0
2
p(c8  3 | c1:3 ) 
7 1
3
1
p(c8  2 | c1:3 ) 
p(c8  4 | c1:3 ) 
7 1
7 1
Chinese Restaurant Process (CRP)
 mk
k  K

i 1  
p(ci  k | c1 ,..., ci 1 )  
 
k  K 1

i 1  
Parameter   1
m1  3,
m2  3,
3
p(c9  1| c1:3 ) 
8 1
m3  2,
m4  0
2
p(c9  3 | c1:3 ) 
8 1
3
1
p(c9  2 | c1:3 ) 
p(c9  4 | c1:3 ) 
8 1
8 1
Chinese Restaurant Process (CRP)
 mk
k  K

i 1  
p(ci  k | c1 ,..., ci 1 )  
 
k  K 1

i 1  
Parameter   1
m1  3,
p(c10  1| c1:4 ) 
m2  3,
3
9 1
m3  2,
m4  1
p(c10  3 | c1:4 ) 
3
p (c10  2 | c1:4 ) 
9 1
2
9 1
m5  5
p(c10  5 | c1:4 ) 
1
p (c10  4 | c1:4 ) 
9 1
1
9 1
CRP: Gibbs sampling
• Gibbs sampler requires full conditional
p(ci  k | ci , X)  p(X | c). p(ci  k | ci )
• Finite Mixture Model:
p(ci  k | ci ) 
mi ,k 

K
N 1  
• Infinite Mixture Model:
 m i ,k
 N 1  

 
p(ci  k | c  i )  
 N 1  
0


m i ,k  0
k  K i  1
otherwise
Beyond the limit of single label
• In Latent Class Models:
▫ Each object (word) has only one latent label (topic)
▫ Finite number of latent labels: LDA
▫ Infinite number of latent labels: DPM
• In Latent Feature (latent structure) Models:
▫ Each object (graph) has multiple latent features (entities)
▫ Finite number of latent features: Finite Feature Model (FFM)
▫ Infinite number of latent features: Indian Buffet Process (IBP)
•
•
Rows are data points
Columns are latent features
•
Movie Preference Example:
• Rows are movies: Rise of the Planet of the Apes
• Columns are latent features:
• Made in U.S.
• Is Science fiction
• Has apes in it …
Latent Feature Model
Z
continuous F
• F : latent feature matrix
• Z : binary matrix
• V : value matrix
F  ZV
• With p(F)=p(Z). p(V)
discrete
F
Finite Feature Model
• Generating Z : (N*K) binary matrix
▫ For each column k, draw p from beta distribution
k
▫ For each object, flip a coin by zik
ì
a
p
|
a
~
Beta(
,1)
ï k
K
í
ï z | p ~ Bernoulli(p )
k
î ik k
a
(p k ,1- p k ) ~ Dir( ,1)
K
• Distribution of Z :
K
N
K
p(Z | π)   p( zik |  k )    k mk (1   k ) N mk
k 1 i 1
k 1
K
p(Z | a ) = Õ ò p(p k | a ) p(z×k | p k )dp k
k
K
pk
a
=Õ K
k=1
G(mk +
a
K
)G(N - mk +1)
G(N + 1+
a
K
)
Finite Feature Model
• Generating Z : (N*K) binary matrix
▫ For each column k, draw p from beta distribution
k
▫ For each object, flip a coin by zik
ì
a
p
|
a
~
Beta(
,1)
ï k
K
í
ï z | p ~ Bernoulli(p )
k
î ik k
• Z is sparse:
▫ Even
K ®¥
a
(p k ,1- p k ) ~ Dir( ,1)
K
Indian Buffet Process
1st Representation: K ® ¥
• Difficulty:
▫ P(Z) -> 0
▫ Solution: define equivalence classes on random binary feature matrices.
• left-ordered form function of binary matrices, lof(Z):
▫ Compute history h of feature (column) k
▫ Order features by h decreasingly
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
h4 = 2 3 + 2 7 + 29
Z1 Z2 are lof equivalent
iff lof(Z1)=lof(Z2) = [Z]
Describing [Z]:
K h = #{i;hi = h}
K+ = å Kh
i>0
K = K+ + K0
Cardinality of [Z]:
æ
ö
K
ç K ...K ÷ =
è 0
2 N -1 ø
K!
2 N -1
ÕK !
h=0
h
Indian Buffet Process
1st Representation: K ® ¥
Given:
a
K
Pr(Z | a ) = Õ
K
G(mk +
a
K
)G( N - mk +1)
G(N +1+
k=1
æ
ö
K
card([Z]) = ç
÷=
K
...K
N
è 0
2 -1 ø
K!
a
K
)
N
2 -1
ÕK !
Derive when K ® ¥ : ( K + < ¥ almost surely)
h=0
h
Pr([Z] | a ) = Pr(Z | a )× card([Z])
a
a
=
K
K!
N
2 -1
ÕK !
h=0
h
ÕK
k=1
G(mk +
K
)G(N - mk + 1)
G(N + 1+
a
K
æa a
ö
K ! ç K G( K )G( N + 1) ÷
= 2N -1
ç
a ÷
K h ! çè G(N + 1+ ) ÷ø
Õ
K
h=0
)
K0
K
a
ÕK
k=1
G(mk +
a
K
)G(N - mk + 1)
G(N + 1+
a
K
)
Indian Buffet Process
1st Representation: K ® ¥
Given:
K
Pr(Z | a ) = Õ
k=1
a
K
G(mk +
a
K
)G( N - mk +1)
G(N +1+
æ
ö
K
card([Z]) = ç
÷=
K
...K
N
è 0
2 -1 ø
K!
a
K
)
N
2 -1
ÕK !
Derive when K ® ¥ : ( K + < ¥ almost surely)
h=0
h
Pr([Z] | a ) = Pr(Z | a )× card([Z])
K
æa a
ö 0
a
a
G(
)G(
N
+
1)
G(m
+
)G(N - mk + 1)
K
k
÷
K! ç K K
K
K
= 2N -1
Õ
ç
a ÷ k=1
a
ç
÷
G(N + 1+ )
K h ! è G(N + 1+ ) ø
Õ
K
K
h=0
K
æa a
ö
a
K + G(m +
G(
)G(
N
+
1)
)G(N - mk + 1)
k
÷
K! ç K K
K
= 2N -1
Õ
ç
a ÷ k=1
a
ç
÷
G( )G(N + 1)
K h ! è G(N + 1+ ) ø
Õ
K
K
h=0
Indian Buffet Process
1st Representation: K ® ¥
Given:
K
Pr(Z | a ) = Õ
k=1
a
K
G(mk +
a
K
)G( N - mk +1)
G(N +1+
æ
ö
K
card([Z]) = ç
÷=
K
...K
N
è 0
2 -1 ø
K!
a
K
)
N
2 -1
ÕK !
Derive when K ® ¥ : ( K + < ¥ almost surely)
h=0
h
Pr([Z] | a ) = Pr(Z | a )× card([Z])
K
æa a
ö
a
K + G(m +
G(
)G(
N
+
1)
)G(N - mk + 1)
k
÷
K! ç K K
K
= 2N -1
Õ
ç
a ÷ k=1
a
ç
÷
G( )G(N + 1)
K h ! è G(N + 1+ ) ø
Õ
KK
K
h=0
K
K
ö öö
aæNæa a
KK
! ! æ G(
a1 ++/÷1)1)j÷÷ - K K
N+ 1)G(N
)G(N
= 2 N -1 2 N -1 ç KçççG(
K (1+
KN!
+ m -1
÷ ÷÷)
a a (N - mk )!
ç çÕ
N
a
/
j
a
K
(
j
+
)
j=1
(m
-1)!×
a )÷ )÷÷
j=1
Õ
Õ
ç N
K
K h !K hçè ! ççG(N
(1+
G(
+
1+
k
Õ0 !Õ
K
è ( j + ) ÷ øø
Õ
h=0 h=1
è
Õ
j=1
k
KKKø
k=1 j=0
K
N!
Indian Buffet Process
1st Representation: K ® ¥
Given:
K
Pr(Z | a ) = Õ
a
K
G(mk +
a
K
)G( N - mk +1)
G(N +1+
k=1
æ
ö
K
card([Z]) = ç
÷=
K
...K
N
è 0
2 -1 ø
K!
a
K
)
N
2 -1
ÕK !
Derive when K ® ¥ : ( K + < ¥ almost surely)
h=0
h
Pr([Z] | a ) = Pr(Z | a )× card([Z])
=
K!
2 N -1
K0 !Õ K h !
N
Õ(1+
a/ j
K
j=1
)
-K
K+
Õ
k=1
a (N - mk )!
(mk -1)!×
K
N!
h=1
N- a Ha / j
- N
K
N
=Õ
ee
K! K! a K+ K +
Ka
a
N K
2 2 -1
-1+ 2 N -1
K0 !K K !
K+
K
!
K0 Õ
!Õ KhÕ
!
K
h
h
+
N
h=1
j=1
1
h=1
Harmonic sequence sum
j=1 j
HN = å
h=1
K+
Õ
k=1
(N - mk )!
(mk -1)!
N!
Indian Buffet Process
1st Representation: K ® ¥
Given:
K
Pr(Z | a ) = Õ
a
K
G(mk +
k=1
a
)G( N - mk +1)
K
G(N +1+
æ
ö
K
card([Z]) = ç
÷=
K
...K
N
è 0
2 -1 ø
K!
a
K
)
N
2 -1
ÕK !
Derive when K ® ¥ : ( K + < ¥ almost surely)
h=0
h
Pr([Z] | a ) = Pr(Z | a )× card([Z])
=e
-a H N
aK
2 N -1
+
Õ Kh !
h=1
K+
Õ
k=1
(mk -1)!(N - mk )!
N!
Note:
• Pr([Z] | a ) is well defined because K < ¥ a.s.
+
• Pr([Z] | a ) depends on Kh :
• The number of features (columns) with history h
• Permute the rows (data points) does not change Kh (exchangeability)
Indian Buffet Process
2st Representation: Customers & Dishes
• Indian restaurant with infinitely many infinite dishes (columns)
• The first customer tastes first K1(1) dishes, sample K (1) ~ Poission(a )
1
• The i-th customer:
mk
• Taste a previously sampled dish with probability i + 1
• mk : number of previously customers taking dish k
a
• Taste following K1(i) new dishes, sample K1(i ) ~ Poission( )
i
Poission Distribution: p(k | l ) =
l k e- l
k!
Indian Buffet Process
2st Representation: Customers & Dishes
K1(i )
K 2(i )
K1(i )
( ) e
= Õ i (i )
K1 !
i
N
Pnew _ dish
a
K 3(i )
-
1/1
1/2
1/3
2/4
2/5
3/6
4/7
3/8
5/9
6/10
a
i
=
a K e-a H
+
N
ÕK
i
(i )
1
N
N
1 K1(i )
Õi ( i )
K+
1 K1(i )
(N - mk )!(mk -1)!
Õi ( i ) Pold _ dishes = Õk
N!
N
3!6!
10!
1/1
2/2
3/3
1/4
4/5
5/6
6/7
1/8
7/9
8/10
8!1!
10!
Indian Buffet Process
2st Representation: Customers & Dishes
From:
K1(i )
( ) e
= Õ i (i )
K1 !
i
N
Pnew _ dish
a
-
a
i
=
a K e-a H
+
N
N
ÕK
(i )
1
N
1 K1(i )
(
Õi i )
i
1 K1(i )
(N - mk )!(mk -1)!
(
)
P
=
Õi i old _ dishes Õk
N!
K+
N
We have:
PIBP (Z | a ) =
Note:
a e
K+ -a H N
N
ÕK
(i )
1
(N - mk )!(mk - 1)!
Õk
N!
K+
i
• Permute K1(1), next K1(2), next K1(3),… next K1(N) dishes (columns) does not change P(Z)
• The permuted matrices are all of same lof equivalent class [Z]
N
• Number of permutation:
K1(i )
a K+ e-a H N K+ (N - mk )!(mk -1)!
PIBP ([Z] | a ) = 2 N -1
i
Õ
2nd representation
N
N!
2 -1
k
K
!
is equivalent to
Õ
h
K
!
h
h=1
1st representation
h=1
Õ
Õ
Indian Buffet Process
Distribution over
rd
3 Representation: Collections of Histories
• Directly generating the left ordered form (lof) matrix Z
• For each history h:
▫ mh: number of non-zero elements in h
▫ Generate Kh columns of history h
(mh -1)!(N - mh )!
K h ~ Poission(a
)
N!
• The distribution over collections of histories
• Note:
▫ Permute digits in h does not change mh (nor P(K) )
▫ Permute rows means customers are exchangeable
Indian Buffet Process
• Effective dimension of the model K+:
▫ Follow Poission distribution: K+ ~ Poission(a H N )
▫ Derives from 2nd representation by summing Poission
components
• Number of features possessed by each object:
▫ Follow Possion distribution: Poission(a )
▫ Derives from 2nd representation:
 The first customer chooses Poission(a ) dishes
 The customers are exchangeable and thus can be purmuted
• Z is sparse:
▫ Non-zero element
▫ Derives from the 2nd (or 1st) representation: a
 Expected number of non-zeros for each row is Na
 Expected entries in Z is
IBP: Gibbs sampling
• Need to have the full conditional
p(zik = 1| Z-(ik ) ,X) µ p(X | Z) p(zik = 1| Z-(ik ) )
▫ Z( n,k ) denotes the entries of Z other than znk .
▫ p(X|Z) depends on the model chosen for the observed data.
• By exchangeability, consider generating row as the last customer:
•
By IBP, in which
p(zik = 1| z- ik ) =
m-i,k
N
▫ If sample zik=0, and mk=0: delete the row

▫ At the end, draw new dishes from Pois( ) with considering p(X | Z)
n
 Approximated by truncation, computing probabilities for a range of values of new dishes up to a
upper bound
More talk on applications
• Applications
▫ As prior distribution in models with infinite number of features.
▫ Modeling Protein Interactions
▫ Models of bipartite graph consisting of one side with undefined
number of elements.
▫ Binary Matrix Factorization for Modeling Dyadic Data
▫ Extracting Features from Similarity Judgments
▫ Latent Features in Link Prediction
▫ Independent Components Analysis and Sparse Factor Analysis
• More on inference
▫ Stick-breaking representation (Yee Whye Teh et al., 2007)
▫ Variational inference (Finale Doshi-Velez et al., 2009)
▫ Accelerated Inference (Finale Doshi-Velez et al., 2009)
References
• Griffiths, T. L., & Ghahramani, Z. (2011). The indian buffet process: An
introduction and review. Journal of Machine Learning Research, 12, 1185-1224.
• Slides: Tom Griffiths, “The Indian buffet process”.
Download