Learning Inhomogeneous Gibbs Models Ce Liu celiu@microsoft.com How to Describe the Virtual World Histogram Histogram: marginal distribution of image variances Non Gaussian distributed Texture Synthesis (Heeger et al, 95) Image decomposition by steerable filters Histogram matching FRAME (Zhu et al, 97) Homogeneous Markov random field (MRF) Minimax entropy principle to learn homogeneous Gibbs distribution Gibbs sampling and feature selection Our Problem To learn the distribution of structural signals Challenges • How to learn non-Gaussian distributions in high dimensions with small observations? • How to capture the sophisticated properties of the distribution? • How to optimize parameters with global convergence? Inhomogeneous Gibbs Models (IGM) A framework to learn arbitrary high-dimensional distributions • 1D histograms on linear features to describe highdimensional distribution • Maximum Entropy Principle– Gibbs distribution • Minimum Entropy Principle– Feature Pursuit • Markov chain Monte Carlo in parameter optimization • Kullback-Leibler Feature (KLF) 1D Observation: Histograms Feature f(x): Rd→ R • Linear feature f(x)=fTx • Kernel distance f(x)=||f-x|| Marginal distribution hf ( z ) ( z - f T x ) f ( x )dx Histogram 1 Hf N N T ( f xi ) i 1 (f T xi ) (0, , 0, 1, 0, , 0) Intuition f2 f (x ) Hf2 f1 H f1 Learning Descriptive Models f (x ) f2 f2 f1 f1 H fobs 1 H fobs 2 H fsyn 1 = f ( x ) p( x ) H fsyn 2 Learning Descriptive Models Sufficient features can make the learnt model f(x) converge to the underlying distribution p(x) Linear features and histograms are robust compared with other high-order statistics Descriptive models f { p( x ) | hffi ( z ) hfpi ( z ), i 1,, m} Maximum Entropy Principle Maximum Entropy Model • To generalize the statistical properties in the observed • To make the learnt model present information no more than what is available Mathematical formulation p* ( x ) arg max entropy( p( x )) arg max{- p( x ) log p( x )dx} subjected to : Hfpi Hffi , i 1,, m Intuition of Maximum Entropy Principle f (x ) f1 syn H f1 f { p( x ) | Hff1 ( z ) Hfp1 ( z )} p* ( x ) Inhomogeneous Gibbs Distribution Solution form of maximum entropy model 1 p ( x; ) exp{Z () m T ( z ), ( f i i ) } i 1 Gibbs potential Parameter: {i } i (z ) (fiT x ) i ( z ), (fiT x ) Estimating Potential Function m 1 p ( x; ) exp{- i , (fiT x ) } Z () i 1 Distribution form m Z ( ) exp{- , (fiT x ) }dx Normalization i 1 Maximizing Likelihood Estimation (MLE) n Let : L( ) log p( xi ; ) * arg max L( ) i 1 1st and 2nd order derivatives L( ) 1 Z - H ffi E p ( x ; ) (fiT x ) - Hfobs i i Z i Parameter Learning Monte Carlo integration E p ( x ; ) [ (f x )] Hfi T i Algorithm syn L( ) obs Hfsyn H f i i i Input : {fi }, {H fobs ( z )} i Initialize : {i }, s Loop Sampling : {xi } ~ p( x; ) Compute syn histograms : Hfsyn , i 1: m i obs Update parameters : i s( Hfsyn H ), i 1 : m f i i Histogram divergences : D i 1 KL( Hfsyn , Hfobs ) i i m Reduce s Untill : D Output : Λ, {xi } Gibbs Sampling x1(t 1) ~ ( x1 | x2(t ) , x3(t ) ,, xK(t ) ) x2(t 1) ~ π ( x2 | x1(t 1) , x3(t ) , , xK(t ) ) xK( t 1) ~ ( xK | x1( t 1) ,, xK( t-11) ) y x Minimum Entropy Principle Minimum entropy principle • To make the learnt distribution close to the observed KL( f , p( x; * )) f ( x ) log f ( x) dx * p( x; ) E f [log f ( x )] - E f [log p( x; * )] entropy( p( x; * )) - entropy( f ( x)) Feature selection {f (i ) } * arg min entropy( p( I ; * )) Feature Pursuit A greedy procedure to learn the feature set {fi }iK1 {, f } m 1 p ( x; ) exp{- i , (fiT x ) } Z () i 1 m 1 p ( x; ) exp{- i , (fiT x ) - , (fT x ) } Z ( ) i 1 Reference model pref arg max KL( f ( x ), p( x; )) p ( x ; ) Approximate information gain d (f ) KL ( f ( x ), p( x; )) - KL ( f ( x ), pref ( x; )) Proposition The approximate information gain for a new feature is p d (f ) KL( Hfobs , H f ) and the optimal energy function for this feature is H fp f log obs H f Kullback-Leibler Feature Kullback-Leibler Feature fKL arg max KL( Hfobs , Hfsyn ) arg max Hfobs ( z ) log f Pursue feature by • Hybrid Monte Carlo • Sequential 1D optimization • Feature selection z Hfobs ( z ) Hfsyn ( z ) Acceleration by Importance Sampling Gibbs sampling is too slow… Importance sampling by the reference model m 1 1 ref T p ( x, ) exp{ , ( f i i x ) } ref Z ( ) i 1 ref x ref ~ p( x, ref ) j m 1 w j exp{- (i - iref ), (fiT x ref j ) } i 1 Flowchart of IGM Obs Samples Obs Histograms IGM Syn Samples MCMC N KL< Y Output Feature Pursuit KL Feature Toy Problems (1) Feature pursuit Synthesized samples Gibbs potential Observed histograms Synthesized histograms Mixture of two Gaussians Circle Toy Problems (2) Swiss Roll Applied to High Dimensions In high-dimensional space • Too many features to constrain every dimension • MCMC sampling is extremely slow Solution: dimension reduction by PCA Application: learning face prior model • 83 landmarks defined to represent face (166d) • 524 samples Face Prior Learning (1) Observed face examples Synthesized face samples without any features Face Prior Learning (2) Synthesized with 10 features Synthesized with 20 features Face Prior Learning (3) Synthesized with 30 features Synthesized with 50 features Observed Histograms Synthesized Histograms Gibbs Potential Functions Learning Caricature Exaggeration Synthesis Results Learning 2D Gibbs Process Observed Pattern Triangulation Random Pattern Synthesized Histogram1 Obs Histogram (1) Syn Histogram (1) Syn Pattern (1) Synthesized Histogram2 Obs Histogram (2) Syn Histogram (2) Syn Pattern (2) Synthesized Histogram3 Obs Histogram (3) Syn Histogram (3) Syn Pattern (3) Obs Histogram (4) Syn Histogram (4) Syn Pattern (4) CSAIL Thank you! celiu@csail.mit.edu