Defense Slides - University of California, San Diego

advertisement

Variational and Scale Mixture Density Representations for Estimation in the Bayesian Linear Model :

Sparse Coding, Independent Component Analysis, and

Minimum Entropy Segmentation

Jason Palmer

Department of Electrical and Computer Engineering

University of California San Diego

Introduction

1.

Unsupervised learning of structure in continuous sensor data

1.

Data must be analyzed into component parts – reduced to set of states of the world which are active or not active in various combinations

2.

Probabilistic modeling – states

1.

Linear model

1.

Basis sets

2.

Hierarchical Linear processes

3.

Also kernel non-linear

2.

Probability model

1.

Distributions of input variables – types of densities

2.

Conditionally independent inputs – Markov connection of states

3.

Thesis topics

1.

Types of distributions and representations that lead to efficient and monotonic algorithms using non-Gaussian densities

2.

Calculating probabilities in linear process model

Linear Process Model

• General form of the model: x ( t

1

,  , t d

)

 

1

 d

A (

1

,  ,

 d

) s ( t

1

 

1

,  , t d

  d

)

• Sparse coding

: x ( t )= As ( t ), A overcomplete

• ICA

: x ( t )= As ( t ), A invertible, s ( t )= A -1 x ( t )

• Blind deconvolution

: x ( t ) s ( t )

A (

W (

)

) s ( x ( t t

)

)

Voice and Microphone

IID process

H

T

(z)

Linear filters

H

R

(z) H

M

(z)

Observed process

Source 1

Sensor Arrays

Source 2

Sensor array

Convolutive Model

Different Impulse Response

EEG

sources

Binocular Color Image

Channels

R r g b

L r g b

- Represented as 2D field of 6D vectors

- Binocular video can be represented as a

3D field of 6D vectors

- Use block basis or mixture of filters

WORLD

Biological Systems

WORLD x ( t )

( t ) x

ˆ

( t )

REPRESENTATION

Linear Process Mixture Model

S-s-s-s-s-s-i-i-i-k---ks-s-s-s-s-s----t-t-t-t-e-e-e-e-n-n-n-n

• Plot of speech signal: woman speaking the word “sixteen”

Clearly speech is non-stationary, but seems to be locally stationary

11

( t )

1m

( t )

 r1

( t )

 rm

( t )

Source model

1

( z )

 r

( z ) s ( t )

Observation Segment Model

A(

z

)

x ( t )

Generative Model

A

1

(z) x

1

(t)

A

M

(z) x

M

(t) x ( t )

I.

II.

III.

IV.

B.

B.

A.

B.

A.

A.

A.

B.

Outline

Types of Probability densities

Sub- and Super-Gaussianity

1.

2.

3.

Representation in terms of Gaussian

Convex variational representation – Strong Super-Gaussians a.

Gaussian Scale mixtures

Multivariate Gaussian Scale mixtures (ISA / IVA)

Relationship between representations

Sparse Coding and Dictionary Learning

1.

2.

1.

2.

Optimization with given (overcomplete) basis a.

MAP – Generalized FOCUSS

Global Convergence of Iteratively Re-weighted Least Squares (IRLS) b.

Convergence rates

Variational Bayes (VB) – Sparse Bayesian Learning

Dictionary Learning

Lagrangian Newton Algorithm

Comparison in Monte Carlo experiment

Independent Component Analysis

1.

Convexity of the ICA optimization problem – stability of ICA

Fisher Information and Cramer-Rao lower bound on variance

1.

Super-Gaussian Mixture Source Model

Comparison between Gaussian and Super-Gaussian updates

Linear Process Mixture Model

Probability of signal segments

Mixture model segmentation

Sub- and Super-Gaussianity

• Super-Gaussian = more peaked, than

Gaussian, heavier tail

• Sub-Gaussian = flatter, more uniform, shorter tail than Gaussian

Generalized Gaussian

 exp(-|x| p ): Laplacian ( p = 1.0 ),

Gaussian ( p =2.0), sub-Gaussian ( p =10.0)

Super-Gaussian

Gaussian

Sub-Gaussian

Component density determines shape along direction of vector

Super-Gaussian = concentrated near zero, some large values

Sub-Gaussian = uniform around zero, no large values

Sub- AND Super-Gaussian

Super-Gaussians, represent sparse random variables

Most often zero, occasionally large magnitudes

Sparse random variables model variables with on / off ,

active / inactive states

Convex Variational Representation

Convex / concave functions are pointwise supremum / infimum of linear functions

Convex:

Concave: f f ( x )

( x )

 sup

 inf

 x x

 f f

*

* (

(

)

)

Convex function f (x) may be concave in x 2 , i.e. f ( x ) = g ( x 2 ), and g is concave on (0,

).

• Example: |x| 3/2 convex |x| 3/4 concave

• Example: |x| 4 convex |x| 2 still convex convex x 2 x 4 concave x 3/2 x 2 x 3/2 concave in x 2 x 4 convex in x 2

• If f ( x ) is concave in x 2 , and p ( x )= exp(f ( x )): f ( x )

 inf

 

0

1

2

 x

2  g

*

(

/ 2 ) p ( x )

 sup

 

0

1 / 2

2

 exp(

 1

2

 x

2

)

(

)

We say p ( x ) is Strongly Super-Gaussian

• If f ( x ) is convex in x 2 , and p ( x ) = exp(f ( x )): f ( x )

 sup

 

0

1

2 x

2

/

  g

*

(

/ 2 )

Scale Mixture Representation

Gaussian Scale Mixtures ( GSMs ) are sums of Gaussians densities with different variances, but all zero mean: p ( x )

 

0

 

1 / 2

2

 exp(

 1

2

 x

2

) p (

) d

A random variable with a GSM density can be represented as a product of Standard

Normal random variable Z, and an arbitrary non-negative random variable W:

X = Z W -1/2

Gaussians

Gaussian Scale

Mixture

• Multivariate densities can be modeled by product non-negative scalar and Gaussian random vector:

 x

 x d

1

W

1 / 2

 z

 z d

1

• Contribution: general formula for multivariate GSMs: p ( x )

  

( d

1 ) / 2

(

D )

( d

1 ) / 2 p ( x ) x

|| x ||

2

Relationship between Representations

• Criterion for

p(x) = exp(-f (x)) = exp(-g(x 2 )) to be have convex variational representation:

D

2  log p ( x )

 g



( x )

0

• Criterion for GSM representation given by Bernstein-Widder

Theorem on complete monotonicity ( CM ): p ( x )

 

0

 

 e x d

(

)

 p

0 , p

 

0 , p

 

0 , ...

• For Gaussian representation, need p

(

x

)

exp(

 g

(

x

))

CM

• CM relationship (Bochner): exp(

 g ( x )) is CM

 g ' is CM g

 g

 

0

GSM representation implies the convex variational representation .

Sparse Regression –Variational MAP

• Bayesian Linear Model x=As+v : basis A , sources s , noise v

ˆ s

 arg max s p ( s | x )

 arg min s

 log p ( s | x )

• Can always put in form: min f ( s ) subject to As = x , A overcomplete

• For Strongly Super-Gaussian priors, p ( s ) = exp(f ( s )): f ( z )

 f ( s )

1 f

( s )

( z

2  s

2

2 s

• Sources are independent, cost function f ( s )=

) i f ( s i

) ,

(s) diagonal: f ( z )

 f ( s )

 1

2 z

T Λ ( s ) z

 1

2 s

T Λ ( s ) s

• Solve: s new 

arg min

z

1

2 z

T Λ

(

s old

)

z

subject to

Az

 x

• s old satisfies As = x, so right side is negative, so left side is negative

Sparse Regression – MAP – GSM

• For Gaussian Scale Mixture p ( s ), we have s = z

-1/2 , and s is conditionally Gaussian given

EM algorithm

• The complete log likelihood is quadratic since s is conditionally

Gaussian:

1

2

( x

As )

T Σ 

1

( x

As )

 1

2 i

  i s i

2

• Linear in 

. For EM we need expected value of

 given x .

But

  s

 x is a Markov chain:

E (

| x )

E (

| s )

 f

( s s )

• GSM EM algorithm is thus the same as the Strong Super-Gaussian algorithm – both are Iteratively Reweighted Least Squares (IRLS)

Generalized FOCUSS

• The FOCUSS algorithm is a particular MAP algorithm for sparse regression f

(s) = |s| p or f (s) = log s. It was derived by Gorodnitsky and Rao (1997), and

Rao and Kreutz-Delgado (1998)

• With arbitrary Strongly Super-Gaussian source prior,

Generalized FOCUSS : s new  

( s old

) A

T (

A

( s old

) A

T   ) 

1 x

• Convergence is proved using Zangwill’s Global Convergence Theorem , which requires: (1) Descent function (2) Boundedness of iterates, and

(3) closure ( continuity ) of algorithm mapping.

• We prove a general theorem on boundedness of IRLS iterations with diagonal weight matrix: least squares solution always lies in bounded part of orthant intersection

Least squares solution

Unbounded orthantconstraint intersection Bounded orthantconstraint intersection

We also derive the convergence rate of

Generalized FOCUSS for f ( s ) is convex.

Convergence rate for concave f ( s ) was proved by Gorodnitsky and Rao. We give an alternative proof.

Variational Bayes

General form of Sparse Bayesian Learning / Type II ML:

– Find Normal density (mean and covariance) that minimizes an upper bound on KL divergence from true posterior density:

(

 s | x

,

 s | x

)

arg

 s

min

,

| x s | x

~

D

(

N

(

s

;

 s | x

,

 s | x

) ||

p

(

s

|

x

)

)

OR: MAP estimate of hyperparameters,

, in GSM (instead of s ).

– OR: Variational Bayes algorithm which finds the separable posterior q ( s|x ) q (

|x ) that minimizes KL divergence from true posterior p ( s,

|x ).

• The bound is derived using a modified Jensen’s inequality:

E f ( s )

E g ( s

2

)

 g ( Es

2

)

• Then minimize the bound by coordinate descent as before. Also

IRLS, same functional form but now diagonal weights are: f ' (

 i

 i

)

,

 

E ( s i

2

| x ,

)

  i

2  [  s | x

] ii

Sparse Regression Example

An example of sparse regression with an overcomplete basis

The line is the one dimensional solution space (translated null space)

Below the posterior density p(s|x) in null space plotted for Generalized

Gaussian with p =1.0, p =0.5, and p =0.2

Dictionary Learning

• Problem: Given data x

1

,…,x

N find an (overcomplete) basis A for which

As=x and the sources are sparse.

• Three algorithms:

(1) Lewicki-Sejnowski ICA

(2) Kreutz-Delgado FOCUSS based

(3) Girolami VB based algorithm

• We derived a Lagrangian Newton algorithm similar to Kreutz-Delgado’s algorithm

These algorithms have the general form:

A

( 1

 

) A

 

 k

N 

1

 k s T k 

B

(1) Lewicki-Sejnowski

 k

A

 f ( s k

), B

I

(2) Kreutz-Delgado

 k

 e k

, B

  k s k s

T k

  k

1

(3) Girolami VB

 k

 x k

, B

  k s k s

T k

  k

1

(4) Lagrangian Newton

 k

A

 k

A

T

1 x k

, B

 diag(

)

1

Dictionary Learning Monte Carlo

A 2 x 3, sparsity 1 • Experiment: generate random A matrices, sparse sources s , and data x=As , N=100 m.

• Test algorithms:

– Girolami, p= 1.0, Jeffrey’s

– Lagrangian Newton, p =1.0, p =1.1

– Kreutz-Delgado, (non-)normalized

– Lewicki-Sejnowski, p =1.1, Logistic

A 4 x 8, sparsity 2 A 10 x 20, sparsity 1-5

Sparse Coding of EEG

• Goal: find synchronous “events” in multiple interesting components

• Learn basis for segments, length 100, across 5 channels

• Events are rare, so the prior density is sparse

EEG scalp maps:

EEG Segment Basis: Subspace 1

Experimental task: subject sees sequence of letters, click left mouse if the letter is same as two letters back, if not click right

• Each column is a basis vector: segment of length 100 x 5 channels

• Only the second channel active in this subspace – related to incorrect response by subject – subject hears buzzer when wrong response given

• Dictionary learning with time series: must learn phase shifts

EEG Segment Basis: Subspace 2

• In this subspace, channels 1 and 3 are active

• Channel 3 crest slightly precedes channel 1 crest

• This subspace is associated with correct response

EEG Segment Basis: Subspace 3

• In this subspace, channels 1 and 2 have phase shifted 20 Hz bursts

• Not obviously associated with any recorded event

ICA

• ICA model: x=As , with A invertible, s=Wx

• Maximum Likelihood estimate of W=A -1 : p x

(   k

; W

)  k

N 

1

| det W | p s

( Wx k

)

• For independent sources: p ( Wx )

 n  p

( w T x

) s k s i i k i

1

• Source densities unknown

, must be adapted – Quasi-ML (Pham 92)

• Since ML minimizes KL divergence over parametric family, ML with ICA model is equivalent to minimizing Mutual Information

If sources are Gaussian, A cannot be identified , only covariance

• If sources are Non-Gaussian, A can be identified (Cheng,

Rosenblatt)

ICA Hessian

Remarkably, the expected value of the Hessian of the ICA ML cost function

N log | W | can be calculated.

 k

N 

1 log p ( W x k

)

• Work with the “ global system

C = WA , whose optimum is always identity, C * = I .

• Using independence of sources at the optimum, we can block diagonalize the Hessian linear operator H(B)=D in the global space into 2 x 2 blocks:

 d d ij ji

  i

1 j

2

Main condition for positive definiteness and convexity of ML problem at the optimum:

 i

 i

2

 j

2 j

1 j

 i

2

 b b ij ji

1

0

 i

 i

2

E f i



( s i

)

E s i

2

 i

 j

Expected Hessian is the Fisher

Information matrix

Inverse is Cramer-Rao lower bound on unbiased estimator variance

Plot shows bound for off-diagonal element with Gen. Gauss. prior

• Hessian also allows Newton method

For EM stability , by:

E f

( s s )

Es

2

E

 f i

E



( f s i

(

) s replaced

) s

1

Super-Gaussian Mixture Model

• Variational formulation also allows derivation of generalization of

Gaussian mixture model to strongly super-Gaussian mixtures : p ( s )

 j m 

1

 j

1 / 2 j p j

 j

1 / 2 ( s

  j

)

• The update rules are similar to the

Gaussian mixture model, but include the variational parameters

 j

 k

N 

1 z jk

 jk w i

T x k k

N 

1 z jk

 jk

 j

1  k

N 

1 z jk

 jk

 w i

T x k

  j

2

 k

N 

1 z jk

Source Mixture Model Examples

ICA Mixture Model – Images

• Goal: find an efficient basis for representing image patches. Data vectors are 12 x 12 blocks.

Covariance Square Root Sphere Basis

ICA: Single Basis

ICA Mixture Model: Model 1

ICA Mixture Model: Model 2

Image Segmentation 1

Using the learned models, we classify each image block as from Model 1 or

Model 2

Lower left shows raw probability for

Model 1

Lower right shows binary segmentation

Blue captures high frequency ground

Image Segmentation 2

Again we classify each image block as from Model 1 or Model 2

Lower left shows raw probability for

Model 1

Lower right shows binary segmentation

Blue captures high frequency tree bark

Image model 1 basis densities

• Low frequency components are not sparse, and may be multimodal

• Edge filters in Model 1 are not as sparse as the higher frequency components of Model 2

Image Model 2 densities

• Densities are very sparse

• Higher frequency components occur less often in the data

• Convergence is less smooth

Gen.Gauss. shape parameter histograms

Image bases EEG bases

1.2

More sparse, edge filters, etc.

2.0

1.2

2.0

Less sparse, biological signals

Rate Distortion Theory

• Theorem of Gray shows that given a finite autoregressive process, the optimal rate transform is the inverse of the mixing filter

Z(t) H(z) X(t) H -1 (z) Z(t)

• For difference distortion measures:

R

Z

( D )

R

X

( D )

• Proof seems to extend to general linear systems, and potentially mixture models

• To the extent that Linear Process Mixture Models can model arbitrary piecewise linear random processes, linear mixture deconvolution is a general coding scheme with optimal rate

Time Series Segment Likelihood

• Multichannel convolution is a linear operation

• Matrix is block Toeplitz

• To calculate likelihood, need determinant

• Extension of Szegö limit theorem

1

N lim

  N log det T

N

1

2

A

A

A

A

0

2

1

3

A

1

A

0

A

1

A

2

  log | det W (

) | d

W (

)

 k

W k e

 i

 k

A

2

A

1

A

0

A

1

A

A

3

2

A

1

A

0

• Can be extended to multi-dimensional fields, e.g. image convolution

Segmented EEG source time series

• Linear Process Mixture Model run on several source – 2 models

• Coherent activity is identified and segmented blindly

• Spectral density resolution greatly enhanced by eliminating noise

Spectral Density Enhancement

• Spectra before segmentation / rejection (left) and after (right).

• Spectral peaks invisible in all series spectrum becom visible in segmented spectrum

All series spectrum Segmented spectrum

Source A channel

Source B channel

Future Work

• Fully implement hierarchical linear process model

• Implement Hidden Markov Model to learn relationships among various model states

• Test new multivariate dependent density models

• Implement multivariate convolutive model, e.g. on images to learn wavelets, and test video coding rates

• Implement Linear Process Mixture Model in VLSI circuits

Publications

• “A Globally Convergent Algorithm for MAP Estimation with Non-Gaussian

Priors,” Proceedings of the 36th Asilomar Conference on Signals and Systems,

2002.

• “A General Framework for Component Estimation,” Proceedings of the 4th

International Symposium on Independent Component Analysis, 2003.

• “Variational EM Algorithms for Non-Gaussian Latent Variable Models,”

Advances in Neural Information Processing Systems, 2005.

• “Super-Gaussian Mixture Source Model for ICA,” Proceedings of the 6th

International Symposium on Independent Component Analysis, 2006.

• “Linear Process Mixture Model for Piecewise Stationary Multichannel Blind

Deconvolution,” submitted ICASSP 2007

Summary of Contributions

• Convergence proof for

Generalized FOCUSS algorithm

• Proposal of notion of Strong Super-Gaussianity and clarification of relationship to Gaussian Scale Mixtures and Kurtosis

• Extension of Gaussian Scale Mixtures to general multivariate dependency models using derivatives of univariate density

• Derivation of

Super-Gaussian Mixture Model , generalizes monotonic Gaussian mixture algorithm with same complexity

• Derivation of

Lagrangian Newton algorithm for overcomplete dictionary learning – best performance in Monte Carlo simulations

• Analysis of convexity of EM based ICA

• Proposal of

Linear Process Mixture Model , and derivation of segment probablity enabling probabilistic modeling of nonstationary time series

Download