CSCI 4390/6390 Database Mining Project 2 Instructor: Wei Liu Due on Dec. 5th Problem 1. Dimensionality Reduction Use Fisher Linear Discriminant Analysis (LDA) to perform supervised dimensionality reduction, and then implement face recognition with LDA reduced data + 1-Nearest Neighbor (1NN) classifier. Above all, see the reference paper http://www.ee.cuhk.edu.hk/˜xgwang/papers/wangTiccv03_ unified.pdf to know how to compute the within-class scatter matrix Sw and the between-class scatter matrix Sb . Suppose that the original data dimensionality is d, the total number of training samples is N , and the number of classes is C. Then we know rank(Sw ) ≤ min(d, N − C), rank(Sb ) ≤ min(d, C − 1). (1) If the data is in high dimensions such that d > N , then LDA will encounter the singularity issue of the matrix Sw . To overcome this computational issue, a commonly used LDA solver applies PCA, as a preprocessing step, to reduce the data from high dimensions to lower dimensions, so that the reduced within-class scatter matrix Sw in the low-dimensional space becomes invertible. We give a practical LDA algorithm (similar to the unified subspace algorithm in the reference paper) as follows: 1. Run PCA to reduce input N data samples from the dimensionality d to d1 , where C − 1 ≤ d1 ≤ N − C. We obtain the PCA projection matrix U ∈ Rd×d1 that consists of d1 principal components. 2. Compute the U-reduced within-class scatter matrix S0w = U> Sw U ∈ Rd1 ×d1 , and between-class scatter matrix S0b = U> Sb U ∈ Rd1 ×d1 . 3. Do the eigen-decomposition over S0w , achieving S0w = V1 Σ1 V1> (V1 contains all eigenvectors of S0w in columns, and Σ1 includes all positive eigenvalues in its diagonal). −1/2 −1/2 4. Further reduce the between-class scatter matrix to S00b = Σ1 V1> S0b V1 Σ1 ∈ Rd1 ×d1 . 00 00 > 5. Do the eigen-decomposition over Sb , achieving Sb = V2 Σ2 V2 (V2 contains all eigenvectors of S00b in columns, and Σ2 includes all positive and zero eigenvalues in its diagonal). Extract the truncated matrix V̄2 ∈ Rd1 ×(C−1) , whose columns are the C − 1 eigenvectors corresponding to the C − 1 positive eigenvalues. −1/2 6. Compute the overall LDA projection matrix as W = UV1 Σ1 V̄2 ∈ Rd×(C−1) . You are given a fully labeled Yale face dataset of 165 data samples with 4,096 dimensions from 15 individuals. For each individual, there are 11 samples of which 6 samples are for training and the rest 5 samples for testing. You are required to report the mean recognition rate over 20 trials, in each of which you need to 1) uniformly randomly sample 6 samples per individual to construct a training dataset, 2) run LDA on the training dataset to obtain the projection matrix W, 3) project (i.e., reduce the dimensionality) all training and test samples onto W, and 4) run the 1NN classifier over the W reduced data to compute the recognition rate (i.e., correctly classified sample proportion) on the test samples. Finally, the mean recognition rate is the mean of 20 recognition rates from 20 traning/test splitting trials. There is one parameter – the PCA dimension d1 . You are encouraged to try its possible values in the range of [C − 1, N − C] and find the value such that the highest mean recognition rate is reached. Plot a figure of mean recognition rate vs. d1 is perfect. All data information is wrapped into a ‘face dimreduce.mat’ file in Matlab. Separate data information, including ‘fea.txt’ (all data samples, each row represents a sample) and ‘gnd.txt’ (labels, 1 to 15, in the order of data samples) are also provided. Problem 2. Clustering Use Spectral Clustering to cluster object images. See the reference paper http://ai.stanford.edu/˜ang/papers/nips01-spectral.pdf. Implement the spectral clustering algorithm (on Page 2) described in this paper. You are given an object image dataset of 1,440 data samples with 1,024 dimensions from 20 object categories. In each object category, there are 72 samples. You are required to cluster these samples to 20 clusters, and report the Normalized Mutual Information (NMI) measure. Let {S1 , S2 , · · · , SC } be the true clusters of a dataset {xk }N k=1 of N samples, and each cluster Si has |Si | = ni samples (i = 1, · · · C). Let {S10 , S20 , · · · , SC0 } be your clustering result on this dataset, and each cluster Si0 has |Si0 | = n0i samples (i = 1, · · · C). Then the NMI measure of your clustering result is computed as PC PC N M I {S1 , S2 , · · · , SC }, {S10 , S20 , · · · , SC0 } =q i=1 PC j=1 nij i=1 ni log ni n nn log( ni nij0 ) j PC 0 j=1 nj log n0j n , (2) where nij = |Si ∩ Sj0 |. The larger the NMI measure, the better the achieved clustering quality. There is one parameter – the Gaussian RBF kernel width σ which is used in computing the affinity matrix A with its (i, j)-th entry being Aij = exp − kxi − xj k2 /2σ 2 . You are encouraged to try σ’s possible values in the range of [min1≤i6=j≤N kxi − xj k, max1≤i,j≤N kxi − xj k], and find the value such that the largest NMI is reached. Plot a figure of NMI vs. σ is perfect. All data information is wrapped into an ‘object cluster.mat’ file in Matlab. Separate data information, including ‘fea.txt’ (all data samples, each row represents a sample) and ‘gnd.txt’ (labels, 1 to 20, in the order of data samples) are also provided.