PCA, Projection Pursuit

advertisement
PCA and Hebb Rule
0368-4149-01
Prof. Nathan Intrator
Tuesday 16:00-19:00
Office hours: Wed 4-5
nin@tau.ac.il
cs.tau.ac.il/~nin
Outline


Goals for neural learning - Unsupervised
Goals for statistical/computational learning





PCA
ICA
Exploratory Projection Pursuit
Search for non-Gaussian distributions
Practical implementations
2
Statistical Approach to Unsupervised
Learning





Understanding the nature of data variability
Modeling the data (sometimes very flexible model)
Understanding the nature of the noise
Applying prior knowledge
Extracting features based on:



Prior knowledge
Class prediction
Unsupervised learning
3
Neuronal Goal
We look for axes which minimise projection errors and maximise
the variance after projection
n-dimensional
m-dimensional
vectors
vectors
Ex:
m<n
transform from 2 to 1 dimension
4
Algorithm (cont’d)

Preserve as much of the variance as possible
more information (variance)
rotate
less
information
project
5
Linear transformations – example
2D vectors X in a unit circle with mean (1,1); Y = A*X, A = 2x2 matrix
 Y 1   2 1 X 1 
 
 
 Y 2  1 1  X 2 
The shape is elongated, rotated and the mean is shifted.
6
Invariant distances
Euclidean distance is not invariant to general linear transformations
Y  AX
(1)
(2)
Y -Y
2

 Y -Y

(1)
 X -X
(1)
(2)
(2)

Y  AX
 Y
T
T
T
(1)
-Y

(2)
(1)

A A X -X
(2)
This is invariant only for orthonormal matrices ATA = I
that make rigid rotations, without stretching or shrinking distances.
Idea: standardize the data in some way to create invariant distances.
7

Data standardization
For each vector component X(j)T=(X1(j), ... Xd(j)), j=1 .. n
calculate mean and std: n – number of vectors, d – their dimension
1 n ( j)
1 n ( j)
Xi   X i ; X   X
n j 1
n j 1
X(1)
X(2)
X( n )
(1)
1
(1)
2
(2)
1
(2)
2
( n)
1
( n)
2
X1
X
X2
X
X
X
X
X
Vector of mean
feature values.
Averages over rows.
Xd
X
(1)
d
X
(2)
d
X
( n)
d
8
Standard deviation
Calculate standard deviation:
1 n
X i   X i( j )
n j 1
2
i

n
1
( j)

X
 Xi

i
n  1 j 1
Vector of mean feature values.

2
Variance = square of standard
deviation (std), sum of all
deviations from the mean value.
Transform X => Z, standardized data vectors
Zi( j )   X i( j )  X i   i
9
Std data
Std data: zero mean and unit variance.
1 n ( j) 1 n
Zi   Z i    X i( j )  X i   i  0
n j 1
n j 1

2
Z ,i

1 n
( j)

Z
 Zi

i
n  1 j 1

2

1 n
( j)

X
 Xi

i
n  1 j 1

2
 i2  1
Standardize data after making data transformation.
Effect: data is invariant to scaling only (diagonal transformation).
Distances are invariant, data distribution is the same??
How to make data invariant to any linear transformations?
10
Terminology (Covariance)

How two dimensions vary from the mean with respect to
each other
n
cov( X , Y ) 
(X
i 1
i
 X )(Yi  Y )

(n  1)
cov(X,Y) > 0: Dimensions increase together
cov(X,Y) < 0: One increases, one decreases

cov(X,Y) = 0: Dimensions are independent

11
Terminology (Covariance Matrix)

•
Contains covariance values between all possible dimensions:
C nxn  (cij | cij  cov( Dimi , Dim j ))
Example for three dimensions (x,y,z) (Always symetric):
 cov( x, x) cov( x, y ) cov( x, z ) 


C   cov( y, x) cov( y, y ) cov( y, z ) 
 cov( z, x) cov( z, y ) cov( z, z ) 


cov(x,x)  variance of component x
12
Properties of the Cov matrix

Can be used for creating a distance that is not
sensitive to linear transformation

Can be used to find directions which maximize
the variance

Determines a Gaussian distribution uniquely (up
to a shift)
13
Data standardization example
For our example Y=AX, assuming X means=1 and variances = 1
 Y1   2 1  X 1 
 
 X 
Y
1
1
 2 
 2 
Transformation
Vector of mean
feature values.
1
 3  2 1 1
X  Y 
 1
1
2
1
1

  
 
1
5
σ X    σ Y     Diag  AA T 
1
 2
1
Y Y
 2
2

1
 X X
 2

T

A T A X   X
1
Variance
check it!
2

How to make this
invariant?
14
Covariance matrix
Variance (spread around mean value) + correlation between features.
1 n
(k )
(k )
CX is d x d
Cij 
X

X
X
 X j ; i, j  1 d

i
j
i
n  1 k 1



T
1 n
1
(k )
(k )
T
CX 
X

X
X

X

XX




n  1 k 1
n 1
where X is d x n dimensional matrix of vectors shifted to their means.
Covariance matrix is symmetric Cij = Cji and positive definite.
Diagonal elements are variances (square of std), i2 = Cii
Pearson correlation coefficient
rij  Cij  i j  [1, 1]
Spherical distribution of data has Cij=I (unit matrix).
Elongated ellipsoids: large off-diagonal elements, strong correlations
between features.
15
Mahalanobis distance
Linear combinations of features leads to rotations and scaling of data.
Y  AX; Y  AX; CY  AC X A T
X C  XTCX1X
2
Mahalanobis distance:
is invariant to linear transformations:
1
Y Y
 2
2
CY


  X   X  
1
 Y Y
 2
1
2
1
 2
 X X

X

T
CY1 Y    Y 
T
1
2
T 1

A  A  CX1A 1  A X   X 


T
1
2

2
CX
16

Principal components
How to avoid correlated features?
Correlations  covariance matrix is non-diagonal !
Solution: diagonalize it, then use transformation that makes it
diagonal to de-correlate features. Z are the eigen vectors of Cx
Y  Z X; CX Z  i Z ; CX Z  ZΛ
T
(i )
(i )
CY  ZTCX Z  ZT ZΛ  Λ
In matrix form,
X, Y are dxn,
Z, CX, CY are dxd
C – symmetric, positive definite matrix XTCX > 0 for ||X||>0;
its eigenvectors are orthonormal:
Z(i )T  Z( j )
its eigenvalues are all non-negative
  ij
Z – matrix of orthonormal eigenvectors (because Z is real+symmetric),
transforms X into Y, with diagonal CY, i.e. decorrelated.
17
Matrix form
Eigenproblem for C matrix in matrix form:
 C11 C12

 C21 C22


 Cd 1 C d 2
 Z11 Z12

 Z 21 Z 22


 Zd1 Zd 2
C1d  Z11 Z12

C2 d  Z 21 Z 22


Cdd  Z d 1 Z d 2
Z1d  1 0

Z 2 d  0 2


Z dd  0 0
C X Z  ZΛ
Z1d 

Z2d 



Z dd 
0

0


d 
18
Principal components
PCA: old idea, C. Pearson (1901), H. Hotelling 1933
Y  Z X;
T
CY  Z C X Z  Λ
T
Z – principal components, of vectors X
transformed using eigenvectors of CX
Covariance matrix of transformed vectors
is diagonal => ellipsoidal distribution of
data.
Result: PC are linear combinations of all features, providing new
uncorrelated features, with diagonal covariance matrix = eigenvalues.
Small li  small variance  data change little in direction Yi
PCA minimizes C matrix reconstruction errors:
Zi vectors for large li are sufficient to get:
ZΛZT
because vectors for small eigenvalues will have very
small contribution to the covariance matrix.
 CX
19
Two components for visualization
Diagonalization methods: see Numerical Recipes, www.nr.com
New coordinate system:
axis ordered according to variance
= size of the eigenvalue.
First k dimensions account for
k
Vk 

i 1
i
d

i 1
i
fraction of all variance (please note that li are variances);
frequently 80-90% is sufficient for rough description.
20
Solving for Eigenvalues & Eigenvectors

Vectors x having same direction as Ax are called eigenvectors
of A (A is an n by n matrix).

In the equation Ax=x,  is called an eigenvalue of A.

Ax=x  (A-I)x=0

How to calculate x and :



Calculate det(A-I), yields a polynomial (degree n)
Determine roots to det(A-I)=0, roots are eigenvalues 
Solve (A- I) x=0 for each  to obtain eigenvectors x
21
PCA properties
PC Analysis (PCA) may be achieved by:

transformation making covariance matrix diagonal

projecting the data on a line for which the sums of squares of distances
from original points to projections is minimal.

orthogonal transformation to new variables that have stationary variances
True covariance matrices are usually not known, estimated from data.
This works well on single-cluster data; more complex structure may require
local PCA, separately for each cluster.
PC is useful for: finding new, more informative, uncorrelated features;
reducing dimensionality: reject low variance features,
reconstructing covariance matrices from low-dim data.
22
PCA Wisconsin example
Wisconsin Breast Cancer data:

Collected at the University of Wisconsin Hospitals, USA.

699 cases, 458 (65.5%) benign (red), 241 malignant (green).

9 features: quantized 1, 2 .. 10, cell properties, ex:
Clump Thickness, Uniformity of Cell Size, Shape, Marginal
Adhesion, Single Epithelial Cell Size, Bare Nuclei,
Bland Chromatin, Normal
Nucleoli, Mitoses.
2D scatterograms do not show
any structure no matter which
subspaces are taken!
23
Example cont.
PC gives useful information
already in 2D.
Taking first PCA component of
the standardized data:
If (Y1>0.41) then benign
else malignant
18 errors/699 cases = 97.4%
Transformed vectors are not
standardized, std’s are below.
Eigenvalues converge
slowly, but classes are
separated well.
24
PCA disadvantages
Useful for dimensionality reduction but:

Largest variance determines which components are used, but does not
guarantee interesting viewpoint for clustering data.

The meaning of features is lost when linear combinations are formed.
Analysis of coefficients in Z1 and other important eigenvectors may show
which original features are given much weight.
PCA may be also done in an efficient way by performing singular value
decomposition of the standardized data matrix.
PCA is also called Karhuen-Loève transformation.
Many variants of PCA are described in A. Webb, Statistical pattern
recognition, J. Wiley 2002, good review in Duda and Hart (1973).
25
Exercise (part 1, Updated Mar 10)

How would you calculate efficiently the PCA of
data where the dimensionality d is much larger
than the number of vector observations n?

Download the Wisconsin Data from the UC
Irvine repository, extract PCAs from the data,
test scatter plots of original data and after
projecting onto the principal components, plot
Eigen values
26
Unsupervised learning
The Hebb rule – Neurons that fire together wire •
together.
PCA•
RF development with PCA•
Classical Conditioning
Classical Conditioning and Hebb’s rule
Ear
A
Nose
B
Tongue
“When an axon in cell A is near enough to excite cell B and
repeatedly and persistently takes part in firing it, some
growth process or metabolic change takes place in one
or both cells such that A’s efficacy in firing B is increased”
D. O. Hebb (1949)
The generalized Hebb rule:
dw i
 ηx iy
dt
where xi are the inputs
and y the output is assumed linear:
y   w jx j
j
Results in 2D
Example of Hebb in 2D
2
=/3
m
w
1
x2
0
-1
-2
-2
-1
x1
0
1
2
(Note: here inputs have a mean of zero)
In the simplest case, the change in synaptic weight w
is:
wi  xi y
where x are input vectors and y is the neural response.
Assume for simplicity a linear neuron:
So we get:
y
w
j
j
xj


wi     xi x j w j 
j

 to the distribution
Now take an average with respect
of inputs, get:


E[wi ]     E[ xi x j ]w j     Qij w j
j
 j

If a small change Δw occurs over a short time Δt
then:
(in matrix notation)
dw
w

 Qw
t
dt
If <x>=0 , Q is the covariance function.
What is then the solution of this simple first order
linear ODE ?
(Show on board)
Mathematics of the generalized Hebb rule
The change in synaptic weight w is:
wi   ( xi  x0 )( y  y0 )
where x are input vectors and y is the neural
response:
y   wj x j
j
Assume for simplicity a linear neuron:
So we get:


wi     xi x j w j  y0 xi  x0  x j w j  y0 x0 
j
 j

Taking an average of the distribution of inputs


E[wi ]     E[ xi x j ]w j  y0 E[ xi ]  x0  E[ x j ]w j  y0 x0 
j
 j

And using E[ xi x j ]  Qij and
E[ xi ]  
We obtain


E[wi ]     Qij w j  y0   x0  w j  y0 x0 
j
 j

In matrix form


E[w ]   Q  x0 J   y0 (   x0 )  Q  k2 J w  e k1
Where J is a matrix of ones, e is a vector in direction
(1,1,1 … 1), k1  y0 (   x0 ) and k2  x0 
or
^
E[w ]  Q w  e k1
Where
'
Q'  Q  k 2J
^
The equation therefore has the form
dw
dt
 [Q w  k1eˆ]
'
If k1 is not zero, this has a fixed point, however it is
usually not stable.
If k1=0 then have:
dw
dt
 Q w
'
The Hebb rule is unstable – how can it be
stabilized while preserving its properties?
The stabilized Hebb (Oja) rule.
wi' (t  1)  ( wi' (t )  wi ) / ( w  Δw)2
Where:
Normalize
Appoximate to first order in η:
wi' (t  1)  wi' (t )  wi  wi (t ) w j w j (t )
Now insert wi  xi y
Get:
j
wi' (t  1)  wi' (t )  xi y  wi' (t )x j yw'j (t )
j
 wi' (t )  xi y  wi' (t )y  x j w'j
}
j
y
Therefore
w  w (t  1)  w (t )   ( xi y  w (t ) y )
'
i
'
i
'
i
'
i
The Oja rule therefore has the form:
dwi
2
  ( xi y  wi y )
dy
2


dwi
2
  ( xi y  wi y )    xi  xk wk  wi  x j xk w j wk 
dt
j ,k
 k

dwi
dt

    wk xk xi
 k
 wi 
j ,k

xk x j wk w j 

In matrix form:
dw
  Qw  (w TQw)w 
dt
Average
Using this rule the weight vector converges to
the eigen-vector of Q with the highest eigen-value. It is
often called a principal component or PCA rule.
The exact dynamics of the Oja rule have been solved by •
Wyatt and Elfaldel 1995
Variants of networks that extract several principal •
components have been proposed (e.g: Sanger 1989)
2 skewed distributions
PCA transformation for 2D data:
First component will be chosen along the
largest variance line, both clusters will
strongly overlap, no interesting structure
will be visible.
In fact projection to orthogonal axis to the
first PCA component has much more
discriminating power.
Discriminant coordinates should be used
to reveal class structure.
42
Projection Pursuit
43
Projection Pursuit (PP)
PCA and FDA are linear, PP may be linear or non-linear.
Find interesting “criterion of fit”, or “figure of merit” function,
that allows for low-dim (usually 2D or 3D) projection.
Y( j )T  Y1( j ) , Y2( j )   f  X( j ) ; W  ;
I (Y; W)  I  f  X; W  
General transformation with
parameters W.
Index of “interestingness”
Interesting indices may use a priori knowledge about the problem:
1.
2.
3.
mean nearest neighbor distance – increase clustering of Y(j)
maximize mutual information between classes and features
find projection that have non-Gaussian distributions.
The last index does not use a priori knowledge; it leads to the Independent
Component Analysis (ICA).
44
ICA features are not only uncorrelated, but also independent.
Kurtosis
ICA is a special version of PP, recently very popular.
Gaussian distributions of variable Y are characterized by 2 parameters:
mean value:
Y  E{Y }
variance:
 Y2  E{Y  E (Y )}2
These are the first 2 moments of distribution; all higher are 0 for G(Y).
One simple measure of non-Gaussianity of projections is the
4-th moment (cumulant) of the distribution, called kurtosis, measures
“skewedness” of the distribution. For E{Y}=0 kurtosis is:
 4 Y   E Y
4
  3 E Y 
2
2
Super-Gaussian distribution: long tail, peak at zero,
k4(y)>0, like binary image data.
sub-Gaussian distribution is more flat and has k4(y)<0,
like speech signal data.
45
Correlation and independence
Variables are statistically independent if their joint probability distribution is
a product of probabilities for all variables:
p  X1, X 2
n
X n    pi  X i 
i 1
Features Yi, Yj are uncorrelated if covariance is diagonal, or:
E YY
i j   E Yi  E Y j 
Uncorrelated features are orthogonal.
Statistically independent features Yi, Yj for any functions give:




E f1 Yi  f 2 Y j   E  f1 Yi  E f 2 Y j 
This is much stronger condition than correlation; in particular the functions
may be powers of variables; any non-Gaussian distribution after PCA
46
transformation will still have correlated features.
PP/ICA example
Example: PCA and PP based on maximal kurtosis: note
nice separation of the blue class.
47
Some remarks




Many formulations of PP and ICA methods exist.
PP is used for data visualization and dimensionality reduction.
Nonlinear projections are frequently considered, but solutions are more
numerically intensive.
PCA may also be viewed as PP, max (for standardized data):

W  arg max E  W X 
(1)
W 1
T
2

Index I(Y;W) is based here on
maximum variance.
Other components are found in the space orthogonal to W1T X
W(k )
2


k 1





 arg max E  W T   I   W (i ) W T( i )  X   
W 1
i 1
   


Same index is used, with projection on space orthogonal to k-1 PCs.
48
How do we find multiple
Projections


Statistical approach is complicated:

Perform a transformation on the data to eliminate
structure in the already found direction

Then perform PP again
Neural Comp approach: Lateral Inhibition
49
How do we find multiple
Projections – Visual Approach
High Dimensional Data
Dimension Reduction
Feature Extraction
Visualisation
Classification
Analysis
50
Projection Pursuit
An automated procedure that seeks interesting low
dimensional projections of a high dimensional cloud by
numerically maximizing an objective function or projection
index.
Huber, 1985
51
Projection Pursuit is needed






Curse of dimensionality
Less Robustness
worse mean squared error
greater computational cost
slower convergence to limiting distributions
…
Required number of labelled samples increases
with dimensionality.
52
What is an interesting projection
In general: the projection that reveals more
information about the structure.
In pattern recognition:
a projection that maximises class
separability in a low dimensional
subspace.
53
Projection Pursuit
Dimensional Reduction
Find lower-dimensional projections of a high-dimensional point cloud to
facilitate classification.
Exploratory Projection Pursuit
Reduce the dimension of the problem to facilitate visualization.
54
Projection Pursuit
How many dimensions to use
 for visualization
 for classification/analysis
Which Projection Index to use
 measure of variation (Principal Components)
 departure from normality (negative entropy)
 class separability(distance, Bhattacharyya,
Mahalanobis, ...)
 …
55
Projection Pursuit
Which optimization method to choose
We are trying to find the global optimum among local ones


hill climbing methods (simulated annealing)
regular optimization routines with random starting points.
56
ICA demos


ICA has many applications in signal and image analysis.
Finding independent signal sources allows for separation of signals from
different sources, removal of noise or artifacts.
Observations X are a linear mixture W of unknown sources Y
X  WT Y
Both W and Y are unknown! This is a blind separation problem.
How can they be found?
If Y are Independent Components and W linear mixing the problem is
similar to FDA or PCA, only the criterion function is different.
Play with ICALab PCA/ICA Matlab software for signal/image analysis:
http://www.bsp.brain.riken.go.jp/page7.html
57
ICA demo: images & audio
Example from Cichocki’s lab,
http://www.bsp.brain.riken.go.jp/page7.html
X space for images:
take intensity of all pixels  one vector per image, or
take smaller patches (ex: 64x64), increasing # vectors
 5 images: originals, mixed, convergence of ICA iterations
X space for signals:
sample the signal for some time Dt
 10 songs: mixed samples and separated samples
58
Self-organization
PCA, FDA, ICA, PP are all inspired by statistics, although some neuralinspired methods have been proposed to find interesting solutions,
especially for their non-linear versions.

Brains learn to discover the structure of signals: visual, tactile, olfactory,
auditory (speech and sounds).

This is a good example of unsupervised learning: spontaneous development
of feature detectors, compressing internal information that is needed to
model environmental states (inputs).

Some simple stimuli lead to complex behavioral patterns in animals;
brains use specialized microcircuits to derive vital information from signals
– for example, amygdala nuclei in rats sensitive to ultrasound signals
signifying “cat around”.
59
Models of self-organizaiton
SOM or SOFM (Self-Organized Feature Mapping) – self-organizing
feature map, one of the simplest models.
How can such maps develop spontaneously?
Local neural connections: neurons interact strongly with those nearby, but
weakly with those that are far (in addition inhibiting some intermediate
neurons).
History:
von der Malsburg and Willshaw (1976), competitive learning, Hebb
mechanisms, „Mexican hat” interactions, models of visual systems.
Amari (1980) – models of continuous neural tissue.
Kohonen (1981) - simplification, no inhibition; leaving two essential
factors: competition and cooperation.
60
Download