Learning IGM

advertisement
Learning Inhomogeneous
Gibbs Models
Ce Liu
celiu@microsoft.com
How to Describe the Virtual World
Histogram

Histogram: marginal distribution of image variances

Non Gaussian distributed
Texture Synthesis (Heeger et al, 95)
 Image decomposition by steerable filters
 Histogram matching
FRAME (Zhu et al, 97)



Homogeneous Markov random field (MRF)
Minimax entropy principle to learn homogeneous
Gibbs distribution
Gibbs sampling and feature selection
Our Problem

To learn the distribution of structural signals

Challenges
• How to learn non-Gaussian distributions in high
dimensions with small observations?
• How to capture the sophisticated properties of the
distribution?
• How to optimize parameters with global convergence?
Inhomogeneous Gibbs Models (IGM)
A framework to learn arbitrary high-dimensional
distributions
• 1D histograms on linear features to describe highdimensional distribution
• Maximum Entropy Principle– Gibbs distribution
• Minimum Entropy Principle– Feature Pursuit
• Markov chain Monte Carlo in parameter optimization
• Kullback-Leibler Feature (KLF)
1D Observation: Histograms

Feature f(x): Rd→ R
• Linear feature f(x)=fTx
• Kernel distance f(x)=||f-x||

Marginal distribution
hf ( z )    ( z - f T x ) f ( x )dx

Histogram
1
Hf 
N
N
T

(
f
 xi )
i 1
 (f T xi )  (0, , 0, 1, 0, , 0)
Intuition
f2
f (x )
Hf2
f1
H f1
Learning Descriptive Models
f (x )
f2
f2
f1
f1
H fobs
1
H fobs
2
H fsyn
1
=
f ( x )  p( x )
H fsyn
2
Learning Descriptive Models

Sufficient features can make the learnt model f(x)
converge to the underlying distribution p(x)

Linear features and histograms are robust
compared with other high-order statistics

Descriptive models
 f  { p( x ) | hffi ( z )  hfpi ( z ), i  1,, m}
Maximum Entropy Principle

Maximum Entropy Model
• To generalize the statistical properties in the observed
• To make the learnt model present information no more
than what is available

Mathematical formulation
p* ( x )  arg max entropy( p( x ))
 arg max{-  p( x ) log p( x )dx}
subjected to : Hfpi  Hffi , i  1,, m
Intuition of Maximum Entropy Principle
f (x )
f1
syn
H f1
 f  { p( x ) | Hff1 ( z )  Hfp1 ( z )}
p* ( x )
Inhomogeneous Gibbs Distribution

Solution form of maximum entropy model
1
p ( x;  ) 
exp{Z ()
m
T


(
z
),

(
f
 i
i ) }
i 1
Gibbs potential

Parameter:   {i }
i (z )
 (fiT x )
 i ( z ),  (fiT x ) 
Estimating Potential Function


m
1
p ( x;  ) 
exp{-  i , (fiT x ) }
Z ()
i 1
Distribution form
m
Z (  )   exp{-   ,  (fiT x ) }dx
Normalization
i 1

Maximizing Likelihood Estimation (MLE)
n
Let : L(  )   log p( xi ;  ) *  arg max L(  )
i 1

1st and 2nd order derivatives
L(  )
1 Z
- H ffi  E p ( x ; ) (fiT x ) - Hfobs
i
i
Z i
Parameter Learning

Monte Carlo integration
E p ( x ; ) [ (f x )]  Hfi
T
i

Algorithm
syn
L(  )
obs
 Hfsyn
H
f
i
i
i
Input : {fi }, {H fobs
( z )}
i
Initialize :   {i }, s
Loop
Sampling : {xi } ~ p( x; )
Compute syn histograms : Hfsyn
, i  1: m
i
obs
Update parameters : i   s( Hfsyn
H
), i  1 : m
f
i
i
Histogram divergences : D  i 1 KL( Hfsyn
, Hfobs
)
i
i
m
Reduce s
Untill : D  
Output : Λ, {xi }
Gibbs Sampling
x1(t 1) ~  ( x1 | x2(t ) , x3(t ) ,, xK(t ) )
x2(t 1) ~ π ( x2 | x1(t 1) , x3(t ) , , xK(t ) )
xK( t 1) ~  ( xK | x1( t 1) ,, xK( t-11) )
y
x
Minimum Entropy Principle

Minimum entropy principle
• To make the learnt distribution close to the observed
KL( f , p( x; * ))   f ( x ) log
f ( x)
dx
*
p( x;  )
 E f [log f ( x )] - E f [log p( x; * )]
 entropy( p( x; * )) - entropy( f ( x))

Feature selection   {f (i ) }
*  arg min entropy( p( I ; * ))
Feature Pursuit

A greedy procedure to learn the feature set
  {fi }iK1
   {, f }
m
1
p ( x;  ) 
exp{-  i , (fiT x ) }
Z ()
i 1
m
1
p ( x;   ) 
exp{-   i ,  (fiT x )  -   , (fT x ) }
Z (  )
i 1

Reference model
pref  arg max KL( f ( x ), p( x;   ))
p ( x ;  )

Approximate information gain
d (f )  KL ( f ( x ), p( x;  )) - KL ( f ( x ), pref ( x;   ))
Proposition
The approximate information gain for a new
feature is
p
d (f )  KL( Hfobs
,
H
f )

and the optimal energy function for this
feature is
H fp
f  log obs
H f
Kullback-Leibler Feature

Kullback-Leibler Feature
fKL  arg max KL( Hfobs , Hfsyn )  arg max  Hfobs ( z ) log
f

Pursue feature by
• Hybrid Monte Carlo
• Sequential 1D optimization
• Feature selection
z
Hfobs ( z )
Hfsyn ( z )
Acceleration by Importance Sampling


Gibbs sampling is too slow…
Importance sampling by the reference model
m 1
1
ref
T
p ( x,  ) 
exp{


,

(
f

i
i x ) }
ref
Z ( )
i 1
ref
x ref
~ p( x, ref )
j
m 1
w j  exp{-  (i - iref ), (fiT x ref
j ) }
i 1
Flowchart of IGM
Obs
Samples
Obs
Histograms
IGM
Syn
Samples
MCMC
N
KL<
Y
Output
Feature
Pursuit
KL
Feature
Toy Problems (1)
Feature
pursuit
Synthesized
samples
Gibbs potential
Observed
histograms
Synthesized
histograms
Mixture of two Gaussians
Circle
Toy Problems (2)
Swiss Roll
Applied to High Dimensions

In high-dimensional space
• Too many features to constrain every dimension
• MCMC sampling is extremely slow

Solution: dimension reduction by PCA

Application: learning face prior model
• 83 landmarks defined to represent face (166d)
• 524 samples
Face Prior Learning (1)
Observed face examples
Synthesized face samples without any features
Face Prior Learning (2)
Synthesized with 10 features
Synthesized with 20 features
Face Prior Learning (3)
Synthesized with 30 features
Synthesized with 50 features
Observed Histograms
Synthesized Histograms
Gibbs Potential Functions
Learning Caricature Exaggeration
Synthesis Results
Learning 2D Gibbs Process
Observed Pattern
Triangulation
Random Pattern
Synthesized Histogram1
Obs Histogram (1)
Syn Histogram (1)
Syn Pattern (1)
Synthesized Histogram2
Obs Histogram (2)
Syn Histogram (2)
Syn Pattern (2)
Synthesized Histogram3
Obs Histogram (3)
Syn Histogram (3)
Syn Pattern (3)
Obs Histogram (4)
Syn Histogram (4)
Syn Pattern (4)
CSAIL
Thank you!
celiu@csail.mit.edu
Download