Data Preprocessing Lecturer: Dr. Bo Yuan LOGO E-mail: yuanb@sz.tsinghua.edu.cn Outline Data Cleansing Data Transformation Data Description Feature Selection Feature Extraction 2 Where are data from? 3 Why Data Preprocessing? Real data are notoriously dirty! The biggest challenge in many data mining projects Incomplete Occupancy = “ ” Noisy Salary = “-100” Inconsistent Age = “42” vs. Birthday = “01/09/1985” Redundant Too much data or too many features for analytical analysis Others Data types Imbalanced datasets 4 Missing Data Data are not always available. One or more attributes of a sample may have empty values. Many data mining algorithms cannot handle missing values directly. May cause significant troubles. Possible Reasons Equipment malfunction Data not provided Not Applicable (N/A) Different Types Missing completely at random Missing at random Not missing at random 5 How to handle missing data? Ignore Remove samples/attributes with missing values The easiest and most straightforward way Work well with low missing rates Fill in the missing values manually Recollect the data Domain knowledge Tedious/Infeasible Fill in the missing values automatically A global constant The mean or median Most probable values More art than science 6 Missing Data: An Example P(Y|X) M_Y Y M_Y M_Y X 7 Noisy Data 1.5 1 0.5 0 1.5 -0.5 1 -1 0.5 -1.5 0 1 2 3 4 5 6 7 5 6 7 0 -0.5 -1 1.5 -1.5 0 1 2 3 4 5 6 7 1 0.5 0 -0.5 Random Noise -1 8 -1.5 0 1 2 3 4 Outliers y Outliers y=x+1 x 9 Outliers Outliers 10 Anomaly Detection 11 Local Outlier Factor k=3 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝑂 p3 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝐴, 𝐵 = 𝑚𝑎𝑥 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝐵 , 𝑑 𝐴, 𝐵 𝑙𝑟𝑑 𝐴 = 1 𝐿𝑂𝐹𝑘 𝐴 = 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝐴, 𝐵 𝑁𝑘 𝐴 𝑙𝑟𝑑 𝐵 𝑙𝑟𝑑 𝐴 = 𝑁𝑘 𝐴 𝐵∈𝑁𝑘 𝐴 𝐵∈𝑁𝑘 𝐴 𝑙𝑟𝑑 𝐵 𝑁𝑘 𝐴 12 p1 p2 p4 𝐵∈𝑁𝑘 𝐴 o 𝑙𝑟𝑑 𝐴 Local Outlier Factor 13 Duplicate Data 14 Duplicate Data 15 Duplicate Data Create Keys Sort Merge 16 Data Transformation Now we have an error free dataset. Still needs to be standardized. Type Conversion Normalization Sampling 17 Attribute Types Continuous Real values: Temperature, Height, Weight … Discrete Integer values: Number of people … Ordinal Rankings: {Average, Good, Best}, {Low, Medium, High} … Nominal Symbols: {Teacher, Worker, Salesman}, {Red, Green, Blue} … String Text: “Tsinghua University”, “No. 123, Pingan Avenue” … 18 Size Type Conversion × ○ × × ○ × ○ × ○ ○ × ○ Color 19 Size Type Conversion × × ○ × × ○ ○ ○ × ○ ○ × Color 20 Type Conversion 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 21 Normalization The height of someone can be 1.7 or 170 or 1700 … Min-max normalization: v v min (new _ max new _ min) new _ min max min Let income range $12,000 to $98,000 be normalized to [0.0, 1.0]. Then $73,600 is mapped to 73,600 12,000 (1.0 0) 0 0.716 98,000 12,000 Z-score normalization (μ: mean, σ: standard deviation): v v Let μ = 54,000, σ = 16,000. Then 73,600 54,000 1.225 16,000 22 Sampling A database/data warehouse may store terabytes of data. Processing limits: CPU, Memory, I/O … Sampling is applied to reduce the time complexity. In statistics, sampling is applied often because obtaining the entire set of data is expensive. Aggregation Change of scale: • Cities States; Days Months More stable and less variability Sampling can be also used to adjust the class distributions. Imbalanced dataset 23 Imbalanced Datasets LHS 10% Classifier A 5% 85% Classifier B 24 RHS Imbalanced Datasets 25 Over-Sampling SMOTE 26 Data Description Mean Arithmetic mean Median Mode The most frequently occurring value in a list Can be used with non-numerical data. Variance Degree of diversity 27 Data Description Pearson’s product moment correlation coefficient rA, B ( A A)(B B) ( AB) n AB (n 1)AB (n 1)AB If rA, B > 0, A and B are positively correlated. If rA, B = 0, no linear correlation between A and B. If rA, B < 0, A and B are negatively correlated. 28 Data Description Pearson's chi-square (χ2) test 2 ( Observed Expected ) 2 Expected Play chess Not play chess Sum (row) Like science fiction 250 (90) 200 (360) 450 Not like science fiction 50 (210) 1000 (840) 1050 Sum (col.) 300 1200 1500 (250 90) 2 (50 210) 2 (200 360) 2 (1000 840) 2 507.93 90 210 360 840 2 29 Data Visualization (1D) Group 10 22% 9 8 11% 7 6 28% 5 4 3 2 33% 1 6% 0 30 1 2 3 4 5 Data Visualization (2D) 26 24 22 Acceleration 20 18 16 14 12 10 8 0 50 100 150 200 250 300 Displacement 31 350 400 450 500 Surf Plots (3D) 5 0 -5 -10 2 0 -2 -3 -2 -1 0 1 2 3 32 Box Plots (High Dimensional) 6 Values 4 2 0 -2 -4 1 2 3 Dimension 33 4 Parallel Coordinates (High Dimensional) 34 Guess Visualization 35 CiteSpace 36 Gephi 37 Feature Selection ID Weight Age Noisy? Gender Height Irrelevant? Education Duplicate? Income Complexity? Address Occupation Address 38 Class Distributions 39 Class Distributions 1.2 Female 1 Probability 0.8 0.6 0.4 0.2 0 Total Non-Smoker (60%) Smoker (40%) 40 Male Entropy 𝑛 𝐻 𝑋 =− 𝑝 𝑥𝑖 log 𝑏 𝑝 𝑥𝑖 𝑖=1 H (S ) 0.5 log2 0.5 0.5 log2 0.5 1.0 H (S | X a ) 0.8 log2 0.8 0.2 log2 0.2 0.7219 H (S | X b ) 0.05 log2 0.05 0.95 log2 0.95 0.2864 H (S | X ) 0.6 H (S | X a ) 0.4 H (S | X b ) 0.5477 Gain (S ,X ) H (S ) H (S | X ) 0.4523 41 Feature Subset Search Exhaustive All possible combinations C103 10! 120 (10 3)!3! 5 C20 20! 15504 (20 5)!5! Branch and Bound S1 S 2 S3 J (S1 ) J (S2 ) J (S3 ) 42 Feature Subset Search Top K Individual Features J (X k ) {J (x 1 ), ,J (x k )}, J (x 1 ) J (x 2 ) ... J (x k ) Sequential Forward Selection J (X k x 1 ) J (X k x 2 ) J (X k x D k ), x i X k Sequential Backward Selection J (X k x 1 ) J (X k x 2 ) J (X k x k ), x i X k Optimization Algorithms Simulated Annealing Tabu Search Genetic Algorithms 43 Feature Extraction 44 Principal Component Analysis 45 2D Example 46 2D Example 47 Some Math Goal: S(Y) has nonzero diagonal entries and all off-diagonal elements are zero. 48 A Different View 𝑥𝑘 𝑛 2 ′ 𝐽 𝑒 = 𝛼𝑘 = 𝑒 𝑡 𝑥𝑘 𝑥𝑘 − 𝑥𝑘 𝑥𝑘 𝑒 =1 𝑖=1 ′ 𝛼𝑘 𝑛 = 2 𝛼𝑘 𝑒 − 𝑥𝑘 𝑖=1 𝑛 𝑛 𝛼𝑘 2 𝑒 = 2 𝑖=1 𝑛 𝑛 𝑖=1 2 𝑖=1 𝑥𝑘 max 𝑒 𝑡 𝑆𝑒 2 𝑒 𝑠.𝑡. 𝑒 =1 𝑖=1 𝑡 𝑛 𝑒 𝑡 𝑥𝑘 𝑥𝑘 𝑒 + =− 𝑥𝑘 𝑛 𝛼𝑘 2 + 𝑖=1 𝛼𝑘 𝑒 𝑡 𝑥𝑘 + −2 𝑖=1 =− 𝑛 𝑛 𝑥𝑘 2 𝑆= 𝑖=1 𝑥𝑘 𝑥𝑘 𝑖=1 49 𝑡 Lagrange Multipliers max f ( x, y ) 3xy x, y s.t. 2 x y 8 F ( x, y, ) 3xy (2 x y 8) Fx 3 y 2 6 x 2 3 2 y 4 3 f ( 2,4) 3 2 4 24 Fy 3 x F 2 x y 8 50 More Math ... u et Se et e 1 u 2Se 2e e Se e eigenvector eigenvalue et Se et e PCA To project the original data to the eigenvectors of S with the largest eigenvalues. 3 2 1 1 1 = =3 3 1 2 1 1 2 1 1 1 = 1 2 −1 −1 51 PCA Examples 1 0.8 ; 1 0.8 x mvnrnd(zeros(10000,2),s ); s plot(x(:,1),x(:,2),'.'); axis equal; grid on; [V ,D ] eig(s ); 0.7071 0.7071 ; 0.7071 0.7071 V 0.2 0 ; 1.8 0 newx x * V (:,2); D hist(newx ,50); 52 Fish 53 Fish 54 The Issue of PCA Now, let’s consider class information ... 55 Linear Discriminant Analysis The objective of LDA is to perform dimension reduction while preserving as much of the class discriminatory information as possible. Given a set of d-D vectors x1, x2, ..., xn of which N1 belong to ω1, and N2 to ω2. Of all the possible projection lines y=wTx, find the one that maximizes the separability. 56 Measure of Separability 57 Fisher Criterion J 1 2 2 S12 S 22 58 Some Math … between-class scatter within-class scatter 59 More Math... generalized Rayleigh quotient 60 More Math... SW1SB w Jw eigenvector problem! S B w 1 2 1 2 w 1 2 R T R 1 2 w T scalar Jw S w1 S B w S w1 1 2 R w R 1 S w 1 2 J 61 LDA Example Dataset 𝐶1 = 1,2 ; 2,3 ; 3,3 ; 4,5 ; (5,5) 𝐶2 = 1,0 ; 2,1 ; 3,1 ; 3,2 ; 5,3 ; (6,5) Covariance of [c1; c2] Z= 2.7636 2.2545 2.2545 3.0182 Eigenvectors and Eigenvalues of Z 𝑉= −0.7268 0.6869 𝐷= 0.6328 0 0.6869 0.7268 0 5.1490 The direction of PCA projection: 0.6869, 0.7268 62 𝑇 LDA Example The mean of each class 𝜇1 = 𝑚𝑒𝑎𝑛 𝑐1 = 3.0, 3.6 𝑇 𝜇2 = 𝑚𝑒𝑎𝑛 𝑐2 = 3.3, 2.0 𝑇 𝑆𝐵 = 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇 = 0.11 −0.53 −0.53 2.56 The scatter of each class 𝑆1 = 4 × 𝑐𝑜𝑣 𝑐1 = 10 8 𝑆2 = 5 × 𝑐𝑜𝑣 𝑐2 = 17.3 16 16 16 𝑆𝑤 = 𝑆1 + 𝑆2 = 27.3 24 8 7.2 24 23.2 63 LDA Example 𝑆𝑤 −1 𝑆𝐵 = 0.26 −0.30 −1.27 1.42 Eigenvectors and Eigenvalues of 𝑆𝑤 −1 𝑆𝐵 𝑉= −0.98 0.67 −0.20 −0.75 𝐷= 0 0 0 1.69 The direction of LDA projection: 0.6656, −0.7463 Alternatively 𝑆𝑤 −1 𝜇1 − 𝜇2 𝑇 = −0.7936, 0.8899 𝑇 After normalization: −0.6656, 0.7463 𝑇 64 𝑇 LDA Example 6 PCA 5 4 3 2 1 0 -1 -2 LDA -3 -3 -2 -1 0 1 2 3 4 65 5 6 C-Class LDA 66 C-Class LDA For C-class LDA with C=2, SB is defined as: S B N1 1 1 N 2 2 2 T T N N N N N N N N N1 1 1 1 2 2 1 1 1 2 2 N 2 2 1 1 2 2 2 1 1 2 2 N N N N T N N N N N N N N N1 2 1 2 2 2 1 2 2 N 2 1 2 1 1 1 2 1 1 N N N N T N1 N 22 N12 N 2 T T 2 1 2 1 2 2 2 1 2 1 N N N1 N 2 1 2 1 2 T N 67 T T C-Class LDA N1=1000; N2=500; N3=2000; 10 X1=mvnrnd([0,0], [1,0;0,1], N1); 8 X2=mvnrnd([3,6], [1,-0.8;-0.8,1], N2); 6 X3=mvnrnd([10,6], [1,0.8;0.8,1], N3); 4 S1=(N1-1)*cov(X1); S2=(N2-1)*cov(X2); S3=(N3-1)*cov(X3); 2 Sw=S1+S2+S3; 0 M1=mean(X1); M2=mean(X2); M3=mean(X3); Mu=(N1*M1+N2*M2+N3*M3)/(N1+N2+N3); -2 Sb=N1*(M1-Mu)'*(M1-Mu)+N2*(M2-Mu)'*(M2-Mu) +N3*(M3-Mu)'*(M3-Mu); -4 -4 -2 0 2 4 6 8 10 12 14 J=inv(Sw)*Sb; [V,D]=eig(J); 600 600 500 500 400 400 300 300 200 200 100 100 0 -4 -2 0 2 4 6 8 10 12 14 16 0 -4 -2 0 68 2 4 6 8 Limitations of LDA LDA produces at most C-1 projections SB is a matrix with rank C-1 or less. SW may be singular. LDA does not work well when ... 69 Reading Materials M. A. Hernandez and S. J. Stolfo, “Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem,” Data Mining and Knowledge Discovery, vol. 2, pp. 9–37, 1998. A. Donders, G. van der Heijden, T. Stijnen, and K. Moons, “Review: a gentle introduction to imputation of missing values,” Journal of Clinical Epidemiology, vol. 59, pp. 1087-1091, 2006. N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002. N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intelligent Data Analysis vol. 6, pp. 429–449, 2002. D. Keim, “Information Visualization and Visual Data Mining,” IEEE Transactions on Visualization and Computer Graphics, vo. 7(1), pp. 100-107, 2002. PCA Tutorials http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf Lagrange Multipliers http://diglib.stanford.edu:8091/~klein/lagrange-multipliers.pdf 70 Review How to handle missing values? How to detect outliers? What is the major issue with imbalanced datasets? Why is feature selection important? How to search for a good subset of features? What does PCA stand for? What is the major issue of PCA? What does LDA stand for? What is the major issue of LDA? 71 Next Week’s Class Talk Volunteers are required for next week’s class talk. Topic: Data Visualization Tools Length: 20 minutes plus question time Suggested Points of Interest Software • CiteSpace • GUESS • Gephi Functionality Demos Features Case Studies 72