Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model :
Sparse Coding, Independent Component Analysis, and
Minimum Entropy Segmentation
Jason Palmer
Department of Electrical and Computer Engineering
University of California San Diego
1.
Unsupervised learning of structure in continuous sensor data
1.
Data must be analyzed into component parts – reduced to set of states of the world which are active or not active in various combinations
2.
Probabilistic modeling – states
1.
Linear model
1.
Basis sets
2.
Hierarchical Linear processes
3.
Also kernel non-linear
2.
Probability model
1.
Distributions of input variables – types of densities
2.
Conditionally independent inputs – Markov connection of states
3.
Thesis topics
1.
Types of distributions and representations that lead to efficient and monotonic algorithms using non-Gaussian densities
2.
Calculating probabilities in linear process model
• General form of the model: x ( t
1
, , t d
)
1
d
A (
1
, ,
d
) s ( t
1
1
, , t d
d
)
: x ( t )= As ( t ), A overcomplete
: x ( t )= As ( t ), A invertible, s ( t )= A -1 x ( t )
: x ( t ) s ( t )
A (
W (
)
) s ( x ( t t
)
)
IID process
H
T
(z)
Linear filters
H
R
(z) H
M
(z)
Observed process
Source 1
Source 2
Sensor array
sources
Channels
R r g b
L r g b
- Represented as 2D field of 6D vectors
- Binocular video can be represented as a
3D field of 6D vectors
- Use block basis or mixture of filters
WORLD
WORLD x ( t )
( t ) x
ˆ
( t )
REPRESENTATION
S-s-s-s-s-s-i-i-i-k---ks-s-s-s-s-s----t-t-t-t-e-e-e-e-n-n-n-n
• Plot of speech signal: woman speaking the word “sixteen”
•
Clearly speech is non-stationary, but seems to be locally stationary
11
( t )
1m
( t )
r1
( t )
rm
( t )
1
( z )
r
( z ) s ( t )
z
x ( t )
A
1
(z) x
1
(t)
A
M
(z) x
M
(t) x ( t )
I.
II.
III.
IV.
B.
B.
A.
B.
A.
A.
A.
B.
Types of Probability densities
Sub- and Super-Gaussianity
1.
2.
3.
Representation in terms of Gaussian
Convex variational representation – Strong Super-Gaussians a.
Gaussian Scale mixtures
Multivariate Gaussian Scale mixtures (ISA / IVA)
Relationship between representations
Sparse Coding and Dictionary Learning
1.
2.
1.
2.
Optimization with given (overcomplete) basis a.
MAP – Generalized FOCUSS
Global Convergence of Iteratively Re-weighted Least Squares (IRLS) b.
Convergence rates
Variational Bayes (VB) – Sparse Bayesian Learning
Dictionary Learning
Lagrangian Newton Algorithm
Comparison in Monte Carlo experiment
Independent Component Analysis
1.
Convexity of the ICA optimization problem – stability of ICA
Fisher Information and Cramer-Rao lower bound on variance
1.
Super-Gaussian Mixture Source Model
Comparison between Gaussian and Super-Gaussian updates
Linear Process Mixture Model
Probability of signal segments
Mixture model segmentation
• Super-Gaussian = more peaked, than
Gaussian, heavier tail
• Sub-Gaussian = flatter, more uniform, shorter tail than Gaussian
Generalized Gaussian
exp(-|x| p ): Laplacian ( p = 1.0 ),
Gaussian ( p =2.0), sub-Gaussian ( p =10.0)
Super-Gaussian
Gaussian
Sub-Gaussian
•
Component density determines shape along direction of vector
•
Super-Gaussian = concentrated near zero, some large values
•
Sub-Gaussian = uniform around zero, no large values
Sub- AND Super-Gaussian
•
Super-Gaussians, represent sparse random variables
•
Most often zero, occasionally large magnitudes
•
Sparse random variables model variables with on / off ,
active / inactive states
•
Convex / concave functions are pointwise supremum / infimum of linear functions
Convex:
Concave: f f ( x )
( x )
sup
inf
x x
f f
*
* (
(
)
)
•
Convex function f (x) may be concave in x 2 , i.e. f ( x ) = g ( x 2 ), and g is concave on (0,
).
• Example: |x| 3/2 convex |x| 3/4 concave
• Example: |x| 4 convex |x| 2 still convex convex x 2 x 4 concave x 3/2 x 2 x 3/2 concave in x 2 x 4 convex in x 2
• If f ( x ) is concave in x 2 , and p ( x )= exp(f ( x )): f ( x )
inf
0
1
2
x
2 g
*
(
/ 2 ) p ( x )
sup
0
1 / 2
2
exp(
1
2
x
2
)
(
)
We say p ( x ) is Strongly Super-Gaussian
• If f ( x ) is convex in x 2 , and p ( x ) = exp(f ( x )): f ( x )
sup
0
1
2 x
2
/
g
*
(
/ 2 )
•
Gaussian Scale Mixtures ( GSMs ) are sums of Gaussians densities with different variances, but all zero mean: p ( x )
0
1 / 2
2
exp(
1
2
x
2
) p (
) d
•
A random variable with a GSM density can be represented as a product of Standard
Normal random variable Z, and an arbitrary non-negative random variable W:
X = Z W -1/2
Gaussians
Gaussian Scale
Mixture
• Multivariate densities can be modeled by product non-negative scalar and Gaussian random vector:
x
x d
1
W
1 / 2
z
z d
1
• Contribution: general formula for multivariate GSMs: p ( x )
( d
1 ) / 2
(
D )
( d
1 ) / 2 p ( x ) x
|| x ||
2
• Criterion for
p(x) = exp(-f (x)) = exp(-g(x 2 )) to be have convex variational representation:
D
2 log p ( x )
g
( x )
0
• Criterion for GSM representation given by Bernstein-Widder
Theorem on complete monotonicity ( CM ): p ( x )
0
e x d
(
)
p
0 , p
0 , p
0 , ...
• For Gaussian representation, need p
x
g
x
CM
• CM relationship (Bochner): exp(
g ( x )) is CM
g ' is CM g
g
0
GSM representation implies the convex variational representation .
• Bayesian Linear Model x=As+v : basis A , sources s , noise v
ˆ s
arg max s p ( s | x )
arg min s
log p ( s | x )
• Can always put in form: min f ( s ) subject to As = x , A overcomplete
• For Strongly Super-Gaussian priors, p ( s ) = exp(f ( s )): f ( z )
f ( s )
1 f
( s )
( z
2 s
2
2 s
• Sources are independent, cost function f ( s )=
) i f ( s i
) ,
(s) diagonal: f ( z )
f ( s )
1
2 z
T Λ ( s ) z
1
2 s
T Λ ( s ) s
• Solve: s new
z
1
2 z
T Λ
s old
z
Az
x
• s old satisfies As = x, so right side is negative, so left side is negative
• For Gaussian Scale Mixture p ( s ), we have s = z
-1/2 , and s is conditionally Gaussian given
EM algorithm
• The complete log likelihood is quadratic since s is conditionally
Gaussian:
1
2
( x
As )
T Σ
1
( x
As )
1
2 i
i s i
2
• Linear in
. For EM we need expected value of
given x .
But
s
x is a Markov chain:
E (
| x )
E (
| s )
f
( s s )
• GSM EM algorithm is thus the same as the Strong Super-Gaussian algorithm – both are Iteratively Reweighted Least Squares (IRLS)
• The FOCUSS algorithm is a particular MAP algorithm for sparse regression f
(s) = |s| p or f (s) = log s. It was derived by Gorodnitsky and Rao (1997), and
Rao and Kreutz-Delgado (1998)
• With arbitrary Strongly Super-Gaussian source prior,
Generalized FOCUSS : s new
( s old
) A
T (
A
( s old
) A
T )
1 x
• Convergence is proved using Zangwill’s Global Convergence Theorem , which requires: (1) Descent function (2) Boundedness of iterates, and
(3) closure ( continuity ) of algorithm mapping.
• We prove a general theorem on boundedness of IRLS iterations with diagonal weight matrix: least squares solution always lies in bounded part of orthant intersection
Least squares solution
Unbounded orthantconstraint intersection Bounded orthantconstraint intersection
•
We also derive the convergence rate of
Generalized FOCUSS for f ( s ) is convex.
Convergence rate for concave f ( s ) was proved by Gorodnitsky and Rao. We give an alternative proof.
•
General form of Sparse Bayesian Learning / Type II ML:
– Find Normal density (mean and covariance) that minimizes an upper bound on KL divergence from true posterior density:
s | x
s | x
s
,
| x s | x
D
(
N
s
s | x
s | x
p
s
x
)
–
OR: MAP estimate of hyperparameters,
, in GSM (instead of s ).
– OR: Variational Bayes algorithm which finds the separable posterior q ( s|x ) q (
|x ) that minimizes KL divergence from true posterior p ( s,
|x ).
• The bound is derived using a modified Jensen’s inequality:
E f ( s )
E g ( s
2
)
g ( Es
2
)
• Then minimize the bound by coordinate descent as before. Also
IRLS, same functional form but now diagonal weights are: f ' (
i
i
)
,
E ( s i
2
| x ,
)
i
2 [ s | x
] ii
An example of sparse regression with an overcomplete basis
The line is the one dimensional solution space (translated null space)
Below the posterior density p(s|x) in null space plotted for Generalized
Gaussian with p =1.0, p =0.5, and p =0.2
• Problem: Given data x
1
,…,x
N find an (overcomplete) basis A for which
As=x and the sources are sparse.
• Three algorithms:
(1) Lewicki-Sejnowski ICA
(2) Kreutz-Delgado FOCUSS based
(3) Girolami VB based algorithm
• We derived a Lagrangian Newton algorithm similar to Kreutz-Delgado’s algorithm
These algorithms have the general form:
A
( 1
) A
k
N
1
k s T k
B
(1) Lewicki-Sejnowski
k
A
f ( s k
), B
I
(2) Kreutz-Delgado
k
e k
, B
k s k s
T k
k
1
(3) Girolami VB
k
x k
, B
k s k s
T k
k
1
(4) Lagrangian Newton
k
A
k
A
T
1 x k
, B
diag(
)
1
A 2 x 3, sparsity 1 • Experiment: generate random A matrices, sparse sources s , and data x=As , N=100 m.
• Test algorithms:
– Girolami, p= 1.0, Jeffrey’s
– Lagrangian Newton, p =1.0, p =1.1
– Kreutz-Delgado, (non-)normalized
– Lewicki-Sejnowski, p =1.1, Logistic
A 4 x 8, sparsity 2 A 10 x 20, sparsity 1-5
• Goal: find synchronous “events” in multiple interesting components
• Learn basis for segments, length 100, across 5 channels
• Events are rare, so the prior density is sparse
EEG scalp maps:
•
Experimental task: subject sees sequence of letters, click left mouse if the letter is same as two letters back, if not click right
• Each column is a basis vector: segment of length 100 x 5 channels
• Only the second channel active in this subspace – related to incorrect response by subject – subject hears buzzer when wrong response given
• Dictionary learning with time series: must learn phase shifts
• In this subspace, channels 1 and 3 are active
• Channel 3 crest slightly precedes channel 1 crest
• This subspace is associated with correct response
• In this subspace, channels 1 and 2 have phase shifted 20 Hz bursts
• Not obviously associated with any recorded event
• ICA model: x=As , with A invertible, s=Wx
• Maximum Likelihood estimate of W=A -1 : p x
( k
; W
) k
N
1
| det W | p s
( Wx k
)
• For independent sources: p ( Wx )
n p
( w T x
) s k s i i k i
1
• Source densities unknown
, must be adapted – Quasi-ML (Pham 92)
• Since ML minimizes KL divergence over parametric family, ML with ICA model is equivalent to minimizing Mutual Information
•
If sources are Gaussian, A cannot be identified , only covariance
• If sources are Non-Gaussian, A can be identified (Cheng,
Rosenblatt)
•
Remarkably, the expected value of the Hessian of the ICA ML cost function
N log | W | can be calculated.
k
N
1 log p ( W x k
)
• Work with the “ global system
”
C = WA , whose optimum is always identity, C * = I .
• Using independence of sources at the optimum, we can block diagonalize the Hessian linear operator H(B)=D in the global space into 2 x 2 blocks:
d d ij ji
i
1 j
2
•
Main condition for positive definiteness and convexity of ML problem at the optimum:
i
i
2
j
2 j
1 j
i
2
b b ij ji
1
0
i
i
2
E f i
( s i
)
E s i
2
i
j
•
Expected Hessian is the Fisher
Information matrix
•
Inverse is Cramer-Rao lower bound on unbiased estimator variance
•
Plot shows bound for off-diagonal element with Gen. Gauss. prior
• Hessian also allows Newton method
•
For EM stability , by:
E f
( s s )
Es
2
E
f i
E
( f s i
(
) s replaced
) s
1
• Variational formulation also allows derivation of generalization of
Gaussian mixture model to strongly super-Gaussian mixtures : p ( s )
j m
1
j
1 / 2 j p j
j
1 / 2 ( s
j
)
• The update rules are similar to the
Gaussian mixture model, but include the variational parameters
j
k
N
1 z jk
jk w i
T x k k
N
1 z jk
jk
j
1 k
N
1 z jk
jk
w i
T x k
j
2
k
N
1 z jk
• Goal: find an efficient basis for representing image patches. Data vectors are 12 x 12 blocks.
Using the learned models, we classify each image block as from Model 1 or
Model 2
Lower left shows raw probability for
Model 1
Lower right shows binary segmentation
Blue captures high frequency ground
Again we classify each image block as from Model 1 or Model 2
Lower left shows raw probability for
Model 1
Lower right shows binary segmentation
Blue captures high frequency tree bark
• Low frequency components are not sparse, and may be multimodal
• Edge filters in Model 1 are not as sparse as the higher frequency components of Model 2
• Densities are very sparse
• Higher frequency components occur less often in the data
• Convergence is less smooth
Image bases EEG bases
1.2
More sparse, edge filters, etc.
2.0
1.2
2.0
Less sparse, biological signals
• Theorem of Gray shows that given a finite autoregressive process, the optimal rate transform is the inverse of the mixing filter
Z(t) H(z) X(t) H -1 (z) Z(t)
• For difference distortion measures:
R
Z
( D )
R
X
( D )
• Proof seems to extend to general linear systems, and potentially mixture models
• To the extent that Linear Process Mixture Models can model arbitrary piecewise linear random processes, linear mixture deconvolution is a general coding scheme with optimal rate
• Multichannel convolution is a linear operation
• Matrix is block Toeplitz
• To calculate likelihood, need determinant
• Extension of Szegö limit theorem
1
N lim
N log det T
N
1
2
A
A
A
A
0
2
1
3
A
1
A
0
A
1
A
2
log | det W (
) | d
W (
)
k
W k e
i
k
A
2
A
1
A
0
A
1
A
A
3
2
A
1
A
0
• Can be extended to multi-dimensional fields, e.g. image convolution
• Linear Process Mixture Model run on several source – 2 models
• Coherent activity is identified and segmented blindly
• Spectral density resolution greatly enhanced by eliminating noise
• Spectra before segmentation / rejection (left) and after (right).
• Spectral peaks invisible in all series spectrum becom visible in segmented spectrum
All series spectrum Segmented spectrum
Source A channel
Source B channel
• Fully implement hierarchical linear process model
• Implement Hidden Markov Model to learn relationships among various model states
• Test new multivariate dependent density models
• Implement multivariate convolutive model, e.g. on images to learn wavelets, and test video coding rates
• Implement Linear Process Mixture Model in VLSI circuits
• “A Globally Convergent Algorithm for MAP Estimation with Non-Gaussian
Priors,” Proceedings of the 36th Asilomar Conference on Signals and Systems,
2002.
• “A General Framework for Component Estimation,” Proceedings of the 4th
International Symposium on Independent Component Analysis, 2003.
• “Variational EM Algorithms for Non-Gaussian Latent Variable Models,”
Advances in Neural Information Processing Systems, 2005.
• “Super-Gaussian Mixture Source Model for ICA,” Proceedings of the 6th
International Symposium on Independent Component Analysis, 2006.
• “Linear Process Mixture Model for Piecewise Stationary Multichannel Blind
Deconvolution,” submitted ICASSP 2007
• Convergence proof for
Generalized FOCUSS algorithm
• Proposal of notion of Strong Super-Gaussianity and clarification of relationship to Gaussian Scale Mixtures and Kurtosis
• Extension of Gaussian Scale Mixtures to general multivariate dependency models using derivatives of univariate density
• Derivation of
Super-Gaussian Mixture Model , generalizes monotonic Gaussian mixture algorithm with same complexity
• Derivation of
Lagrangian Newton algorithm for overcomplete dictionary learning – best performance in Monte Carlo simulations
• Analysis of convexity of EM based ICA
• Proposal of
Linear Process Mixture Model , and derivation of segment probablity enabling probabilistic modeling of nonstationary time series