CSCI 4390/6390 Database Mining Project 2 Instructor: Wei Liu Due on Dec. 5th

advertisement
CSCI 4390/6390 Database Mining
Project 2
Instructor: Wei Liu
Due on Dec. 5th
Problem 1. Dimensionality Reduction
Use Fisher Linear Discriminant Analysis (LDA) to perform supervised dimensionality reduction, and then
implement face recognition with LDA reduced data + 1-Nearest Neighbor (1NN) classifier.
Above all, see the reference paper http://www.ee.cuhk.edu.hk/˜xgwang/papers/wangTiccv03_
unified.pdf to know how to compute the within-class scatter matrix Sw and the between-class scatter matrix
Sb . Suppose that the original data dimensionality is d, the total number of training samples is N , and the number
of classes is C. Then we know
rank(Sw ) ≤ min(d, N − C),
rank(Sb ) ≤ min(d, C − 1).
(1)
If the data is in high dimensions such that d > N , then LDA will encounter the singularity issue of the matrix
Sw . To overcome this computational issue, a commonly used LDA solver applies PCA, as a preprocessing step,
to reduce the data from high dimensions to lower dimensions, so that the reduced within-class scatter matrix Sw
in the low-dimensional space becomes invertible.
We give a practical LDA algorithm (similar to the unified subspace algorithm in the reference paper) as follows:
1. Run PCA to reduce input N data samples from the dimensionality d to d1 , where C − 1 ≤ d1 ≤ N − C. We
obtain the PCA projection matrix U ∈ Rd×d1 that consists of d1 principal components.
2. Compute the U-reduced within-class scatter matrix S0w = U> Sw U ∈ Rd1 ×d1 , and between-class scatter
matrix S0b = U> Sb U ∈ Rd1 ×d1 .
3. Do the eigen-decomposition over S0w , achieving S0w = V1 Σ1 V1> (V1 contains all eigenvectors of S0w in
columns, and Σ1 includes all positive eigenvalues in its diagonal).
−1/2
−1/2
4. Further reduce the between-class scatter matrix to S00b = Σ1 V1> S0b V1 Σ1
∈ Rd1 ×d1 .
00
00
>
5. Do the eigen-decomposition over Sb , achieving Sb = V2 Σ2 V2 (V2 contains all eigenvectors of S00b in
columns, and Σ2 includes all positive and zero eigenvalues in its diagonal). Extract the truncated matrix V̄2 ∈
Rd1 ×(C−1) , whose columns are the C − 1 eigenvectors corresponding to the C − 1 positive eigenvalues.
−1/2
6. Compute the overall LDA projection matrix as W = UV1 Σ1 V̄2 ∈ Rd×(C−1) .
You are given a fully labeled Yale face dataset of 165 data samples with 4,096 dimensions from 15 individuals.
For each individual, there are 11 samples of which 6 samples are for training and the rest 5 samples for testing.
You are required to report the mean recognition rate over 20 trials, in each of which you need to 1) uniformly
randomly sample 6 samples per individual to construct a training dataset, 2) run LDA on the training dataset to
obtain the projection matrix W, 3) project (i.e., reduce the dimensionality) all training and test samples onto W,
and 4) run the 1NN classifier over the W reduced data to compute the recognition rate (i.e., correctly classified
sample proportion) on the test samples. Finally, the mean recognition rate is the mean of 20 recognition rates from
20 traning/test splitting trials.
There is one parameter – the PCA dimension d1 . You are encouraged to try its possible values in the range of
[C − 1, N − C] and find the value such that the highest mean recognition rate is reached. Plot a figure of mean
recognition rate vs. d1 is perfect.
All data information is wrapped into a ‘face dimreduce.mat’ file in Matlab. Separate data information, including ‘fea.txt’ (all data samples, each row represents a sample) and ‘gnd.txt’ (labels, 1 to 15, in the order of data
samples) are also provided.
Problem 2. Clustering
Use Spectral Clustering to cluster object images.
See the reference paper http://ai.stanford.edu/˜ang/papers/nips01-spectral.pdf. Implement the spectral clustering algorithm (on Page 2) described in this paper.
You are given an object image dataset of 1,440 data samples with 1,024 dimensions from 20 object categories.
In each object category, there are 72 samples. You are required to cluster these samples to 20 clusters, and report
the Normalized Mutual Information (NMI) measure.
Let {S1 , S2 , · · · , SC } be the true clusters of a dataset {xk }N
k=1 of N samples, and each cluster Si has |Si | = ni
samples (i = 1, · · · C). Let {S10 , S20 , · · · , SC0 } be your clustering result on this dataset, and each cluster Si0 has
|Si0 | = n0i samples (i = 1, · · · C). Then the NMI measure of your clustering result is computed as
PC PC
N M I {S1 , S2 , · · ·
, SC }, {S10 , S20 , · · ·
, SC0 }
=q
i=1
PC
j=1 nij
i=1 ni log
ni
n
nn
log( ni nij0 )
j
PC
0
j=1 nj
log
n0j n
,
(2)
where nij = |Si ∩ Sj0 |. The larger the NMI measure, the better the achieved clustering quality.
There is one parameter – the Gaussian RBF kernel width σ which is used in computing the affinity matrix A
with its (i, j)-th entry being Aij = exp − kxi − xj k2 /2σ 2 . You are encouraged to try σ’s possible values in the
range of [min1≤i6=j≤N kxi − xj k, max1≤i,j≤N kxi − xj k], and find the value such that the largest NMI is reached.
Plot a figure of NMI vs. σ is perfect.
All data information is wrapped into an ‘object cluster.mat’ file in Matlab. Separate data information, including
‘fea.txt’ (all data samples, each row represents a sample) and ‘gnd.txt’ (labels, 1 to 20, in the order of data samples)
are also provided.
Download