G54DMT – Data Mining Techniques and Applications http://www.cs.nott.ac.uk/~jqb/G54DMT Dr. Jaume Bacardit jqb@cs.nott.ac.uk Topic 2: Data Preprocessing Lecture 2: Dimensionality Reduction and Imbalanced Classification Outline of the lecture • Dimensionality Reduction – Definition and taxonomy – Linear Methods – Non-Linear Methods • Imbalanced Classification – Definition and taxonomy – Over-sampling methods – Under-sampling methods • Resources Dimensionality reduction • Dimensionality reduction methods take an original dataset and convert every instance from the original Rd space to a Rd’ space, where d’<d • For each instance x in the dataset X: – y=f(x), where x={x1,x2,…,xd} and y={y1,y2,…,yd’} • The definition of f is computed from X (the training set), and it is what determines each of the different reduction methods • In general we find two main classes of dimensionality reduction methods: linear and non-linear Principal Component Analysis • Classic linear dimensionality reduction method (Pearson, 1901) • Given a set of original variables X1 … Xp (the attributes of the problem) • PCA finds a set of vectors Z1 … Zp that are defined as linear combinations of the original variables and that are uncorrelated between them, the principal components • The PC are also sorted such as Var(Z1)≥Var(Z2)≥…≥Var(Zp) Principal Components http://en.wikipedia.org/wiki/Principal_component_analysis Applying the PCs to transform data • Using all PCs æ ç ç ç ç ç ç ç è z11 z12 . . z21 z22 . . . . zn1 zn2 , , z1n ö æ ÷ ç z2n ÷ ç ÷ ç ÷×ç ÷ ç znn ÷÷ çç ø è x1 ö æ x '1 ö ÷ ç ÷ x2 ÷ ç x '2 ÷ ÷ ç ÷ . ÷=ç . ÷ . ÷ ç . ÷ xn ÷÷ çç x 'n ÷÷ ø è ø æ ç ö ç ÷×ç ÷ ç ø ç ç ç è x1 ö ÷ x2 ÷ æ ö ÷ ç x '1 ÷ . ÷= ç x '2 ÷ ø . ÷ è xn ÷÷ ø • Using only 2 PCs æ z ç 11 ç z21 è z12 . . z1n z22 . . z2n How many components do we use? • Using all components is useful if – The problem is small – We are interested in using an axis-parallel knowledge representation (rules, decision trees, etc.) • But many times what we are interested is in using just a subset of PC – PC are ranked by their variance – We can select the top N – Or we the number of PC that account for e.g. 95% of the cumulative variance So what happens to the data when we transform it? Data is rotated, so the PC become the axis of the new domain How PCA is computed • Normalize the data so all dimensions have mean 0 and variance 1 • Using Singular Value Decomposition (will describe in the missing values lecture) • Using the covariance method – Compute the co-variance matrix of the data c jk = [å(x ij - x j )(x ik - x k )]/(n -1) n – Compute the eigenvectors (PC) and eigenvalues (Variances) of the covariance matrix Implementations of PCA in WEKA • Simple implementation in the interface, which can’t be used to partition more than one file using the same set of PC (e.g. Training and test set) • Command line version: – java weka.filters.supervised.attribute.AttributeSelection -E "weka.attributeSelection.PrincipalComponents -R 0.5" -b -i <input training> -o <output training> -r <input test> -s <output test> -c last Cumulative variance of 50% Implementations of PCA in R > pca<-prcomp(data,scale=T) > pca Standard deviations: [1] 1.3613699 0.3829777 Rotation: PC1 PC2 V1 0.7071068 -0.7071068 V2 0.7071068 0.7071068 > plot(pca) > data_filtered<-predict(pca,data)[,1] Select only the first PC Independent Component Analysis • PCA tries to identify the components that characterise the data • ICA assumes that the data is no single entity, it is the linear combination of statistically independent sources, and tries to identify them • How is the independence measured? – Minimization of Mutual Information – Maximization of non-Gaussianity • FastICA is a very popular implementation (available in R) Multidimensional Scaling (MDS) • Family of dimensionality reduction methods originating/used mainly in the information visualisation field • It contains both linear and nonlinear variants (some of which are equivalent to PCA) • All variants starts by computing a NxN distance matrix D that contains all pair-wise distances between the instances in the training set • Then the method finds the mapping from the original space into a M-dimensional space (e.g. 2,3) so that the distances between instances in the new space are as close as possible to D • Available in R as well (cmdscale,isoMDS) Self-Organized Maps (SOM) • Truly non-linear dimensionality reduction method • Actually it is a type of unsupervised artificial neural network • Imagine it as a mesh adapting to a complex surface http://en.wikipedia.org/wiki/File:Somtraining.svg SOM algorithm (from Wikipedia) 1. Randomize the map's nodes' weight vectors (or initialize them using e.g. the two main PC) 2. Grab an input vector 3. Traverse each node in the map 1. 2. Use Euclidean distance formula to find similarity between the input vector and the map's node's weight vector Track the node that produces the smallest distance (this node is the best matching unit, BMU) 4. Update the nodes in the neighbourhood of BMU by pulling them closer to the input vector 1. Wv(t + 1) = Wv(t) + Θ(t)α(t)(D(t) - Wv(t)) 5. Increase t and repeat from 2 while t < λ Imbalanced Classification • Tackling classification problems where the class distribution is extremely uneven • These kind of problems are very difficult for standard data mining methods 50% of blue dots 10% of blue dots Effect of Class imbalance • Performance of XCS (evolutionary learning system) on the Multiplexer synthetic dataset with different degrees of class imbalance • IR = ratio between the majority and the minority class Three approaches of Imbalance Classification • Cost-sensitive classification – Adapting the machine learning methods to penalise more misclassifications of the minority class (later in the module) • Over-sampling methods – Generate more examples from the minority class • Under-sampling methods – Remove some of the examples from the majority class Synthetic Minority Over-Sampling Technique (SMOTE) • (Chawla et al., 02) • Generates synthetic instances from the minority class to balance the dataset • Instances are generated using real examples from the minority class as seed • For each real example its k nearest neighbours are identified. • Synthetic instances are generated to be at a random point between the seed and the neighbour (Orriols-Puig, 08) Under-sampling based on Tomek Links • (Batista et al., 04) • A Tomek Link is a pair of examples <Ei,Ej> of different class from the dataset for which there is no other example Ek in the dataset that is closer to any of them • The collection of Tomek Links in the dataset define the class frontiers • This undersampling method removes all examples from the majority class that are not Tomek links Resources • Comprehensive list of nonlinear dimensionality reduction methods • Good lecture slides about PCA and SVD • Survey on class imbalance • Class imbalance methods in KEEL Questions?