PPT - Bo Yuan

Data Preprocessing Lecturer: Dr. Bo Yuan LOGO E-mail: yuanb@sz.tsinghua.edu.cn Outline  Data Cleansing  Data Transformation  Data Description  Feature Selection  Feature Extraction 2 Where are data from? 3 Why Data Preprocessing?  Real data are notoriously dirty!  The biggest challenge in many data mining projects  Incomplete  Occupancy = “ ”  Noisy  Salary = “-100”  Inconsistent  Age = “42” vs. Birthday = “01/09/1985”  Redundant  Too much data or too many features for analytical analysis  Others  Data types  Imbalanced datasets 4 Missing Data  Data are not always available.  One or more attributes of a sample may have empty values.  Many data mining algorithms cannot handle missing values directly.  May cause significant troubles.  Possible Reasons  Equipment malfunction  Data not provided  Not Applicable (N/A)  Different Types  Missing completely at random  Missing at random  Not missing at random 5 How to handle missing data?  Ignore  Remove samples/attributes with missing values  The easiest and most straightforward way  Work well with low missing rates  Fill in the missing values manually  Recollect the data  Domain knowledge  Tedious/Infeasible  Fill in the missing values automatically  A global constant  The mean or median  Most probable values  More art than science 6 Missing Data: An Example P(Y|X) M_Y Y M_Y M_Y X 7 Noisy Data 1.5 1 0.5 0 1.5 -0.5 1 -1 0.5 -1.5 0 1 2 3 4 5 6 7 5 6 7 0 -0.5 -1 1.5 -1.5 0 1 2 3 4 5 6 7 1 0.5 0 -0.5 Random Noise -1 8 -1.5 0 1 2 3 4 Outliers y Outliers y=x+1 x 9 Outliers Outliers 10 Anomaly Detection 11 Local Outlier Factor k=3 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝑂 p3 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝐴, 𝐵 = 𝑚𝑎𝑥 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝐵 , 𝑑 𝐴, 𝐵 𝑙𝑟𝑑 𝐴 = 1 𝐿𝑂𝐹𝑘 𝐴 = 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝐴, 𝐵 𝑁𝑘 𝐴 𝑙𝑟𝑑 𝐵 𝑙𝑟𝑑 𝐴 = 𝑁𝑘 𝐴 𝐵∈𝑁𝑘 𝐴 𝐵∈𝑁𝑘 𝐴 𝑙𝑟𝑑 𝐵 𝑁𝑘 𝐴 12 p1 p2 p4 𝐵∈𝑁𝑘 𝐴 o 𝑙𝑟𝑑 𝐴 Local Outlier Factor 13 Duplicate Data 14 Duplicate Data 15 Duplicate Data Create Keys Sort Merge 16 Data Transformation  Now we have an error free dataset.  Still needs to be standardized.  Type Conversion  Normalization  Sampling 17 Attribute Types  Continuous  Real values: Temperature, Height, Weight …  Discrete  Integer values: Number of people …  Ordinal  Rankings: {Average, Good, Best}, {Low, Medium, High} …  Nominal  Symbols: {Teacher, Worker, Salesman}, {Red, Green, Blue} …  String  Text: “Tsinghua University”, “No. 123, Pingan Avenue” … 18 Size Type Conversion × ○ × × ○ × ○ × ○ ○ × ○ Color 19 Size Type Conversion × × ○ × × ○ ○ ○ × ○ ○ × Color 20 Type Conversion 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 21 Normalization  The height of someone can be 1.7 or 170 or 1700 …  Min-max normalization: v  v  min (new _ max  new _ min)  new _ min max  min  Let income range $12,000 to $98,000 be normalized to [0.0, 1.0]. Then $73,600 is mapped to 73,600 12,000 (1.0  0)  0  0.716 98,000 12,000  Z-score normalization (μ: mean, σ: standard deviation): v  v     Let μ = 54,000, σ = 16,000. Then 73,600 54,000  1.225 16,000 22 Sampling  A database/data warehouse may store terabytes of data.  Processing limits: CPU, Memory, I/O …  Sampling is applied to reduce the time complexity.  In statistics, sampling is applied often because obtaining the entire set of data is expensive.  Aggregation  Change of scale: • Cities  States; Days  Months  More stable and less variability  Sampling can be also used to adjust the class distributions.  Imbalanced dataset 23 Imbalanced Datasets LHS 10% Classifier A 5% 85% Classifier B 24 RHS Imbalanced Datasets 25 Over-Sampling SMOTE 26 Data Description  Mean  Arithmetic mean  Median  Mode  The most frequently occurring value in a list  Can be used with non-numerical data.  Variance  Degree of diversity 27 Data Description  Pearson’s product moment correlation coefficient rA, B ( A  A)(B  B)  ( AB)  n AB    (n  1)AB (n  1)AB  If rA, B > 0, A and B are positively correlated.  If rA, B = 0, no linear correlation between A and B.  If rA, B < 0, A and B are negatively correlated. 28 Data Description  Pearson's chi-square (χ2) test 2 ( Observed  Expected ) 2   Expected Play chess Not play chess Sum (row) Like science fiction 250 (90) 200 (360) 450 Not like science fiction 50 (210) 1000 (840) 1050 Sum (col.) 300 1200 1500 (250 90) 2 (50  210) 2 (200 360) 2 (1000 840) 2       507.93 90 210 360 840 2 29 Data Visualization (1D) Group 10 22% 9 8 11% 7 6 28% 5 4 3 2 33% 1 6% 0 30 1 2 3 4 5 Data Visualization (2D) 26 24 22 Acceleration 20 18 16 14 12 10 8 0 50 100 150 200 250 300 Displacement 31 350 400 450 500 Surf Plots (3D) 5 0 -5 -10 2 0 -2 -3 -2 -1 0 1 2 3 32 Box Plots (High Dimensional) 6 Values 4 2 0 -2 -4 1 2 3 Dimension 33 4 Parallel Coordinates (High Dimensional) 34 Guess Visualization 35 CiteSpace 36 Gephi 37 Feature Selection ID Weight Age Noisy? Gender Height Irrelevant? Education Duplicate? Income Complexity? Address Occupation Address 38 Class Distributions 39 Class Distributions 1.2 Female 1 Probability 0.8 0.6 0.4 0.2 0 Total Non-Smoker (60%) Smoker (40%) 40 Male Entropy 𝑛 𝐻 𝑋 =− 𝑝 𝑥𝑖 log 𝑏 𝑝 𝑥𝑖 𝑖=1 H (S )  0.5  log2 0.5  0.5  log2 0.5  1.0 H (S | X  a )  0.8  log2 0.8  0.2  log2 0.2  0.7219 H (S | X  b )  0.05  log2 0.05  0.95  log2 0.95  0.2864 H (S | X )  0.6  H (S | X  a )  0.4  H (S | X  b )  0.5477 Gain (S ,X )  H (S )  H (S | X )  0.4523 41 Feature Subset Search  Exhaustive  All possible combinations C103  10!  120 (10  3)!3! 5 C20  20!  15504 (20  5)!5!  Branch and Bound S1 S 2 S3  J (S1 )  J (S2 )  J (S3 ) 42 Feature Subset Search  Top K Individual Features J (X k )  {J (x 1 ), ,J (x k )}, J (x 1 )  J (x 2 )  ...  J (x k )  Sequential Forward Selection J (X k  x 1 )  J (X k  x 2 )    J (X k  x D  k ), x i  X k  Sequential Backward Selection J (X k  x 1 )  J (X k  x 2 )    J (X k  x k ), x i  X k  Optimization Algorithms  Simulated Annealing  Tabu Search  Genetic Algorithms 43 Feature Extraction 44 Principal Component Analysis 45 2D Example 46 2D Example 47 Some Math Goal: S(Y) has nonzero diagonal entries and all off-diagonal elements are zero. 48 A Different View 𝑥𝑘 𝑛 2 ′ 𝐽 𝑒 = 𝛼𝑘 = 𝑒 𝑡 𝑥𝑘 𝑥𝑘 − 𝑥𝑘 𝑥𝑘 𝑒 =1 𝑖=1 ′ 𝛼𝑘 𝑛 = 2 𝛼𝑘 𝑒 − 𝑥𝑘 𝑖=1 𝑛 𝑛 𝛼𝑘 2 𝑒 = 2 𝑖=1 𝑛 𝑛 𝑖=1 2 𝑖=1 𝑥𝑘 max 𝑒 𝑡 𝑆𝑒 2 𝑒 𝑠.𝑡. 𝑒 =1 𝑖=1 𝑡 𝑛 𝑒 𝑡 𝑥𝑘 𝑥𝑘 𝑒 + =− 𝑥𝑘 𝑛 𝛼𝑘 2 + 𝑖=1 𝛼𝑘 𝑒 𝑡 𝑥𝑘 + −2 𝑖=1 =− 𝑛 𝑛 𝑥𝑘 2 𝑆= 𝑖=1 𝑥𝑘 𝑥𝑘 𝑖=1 49 𝑡 Lagrange Multipliers max f ( x, y )  3xy x, y s.t. 2 x  y  8 F ( x, y,  )  3xy   (2 x  y  8) Fx  3 y  2  6  x 2 3 2 y 4 3 f ( 2,4)  3  2  4  24 Fy  3 x   F  2 x  y  8 50 More Math ...   u  et Se   et e 1 u  2Se  2e e Se  e eigenvector eigenvalue et Se  et e   PCA To project the original data to the eigenvectors of S with the largest eigenvalues. 3 2 1 1 1 = =3 3 1 2 1 1 2 1 1 1 = 1 2 −1 −1 51 PCA Examples  1 0.8 ; 1  0.8 x  mvnrnd(zeros(10000,2),s ); s   plot(x(:,1),x(:,2),'.'); axis equal; grid on; [V ,D ]  eig(s );  0.7071 0.7071 ;  0.7071 0.7071 V   0.2 0  ; 1.8  0 newx  x * V (:,2); D   hist(newx ,50); 52 Fish 53 Fish 54 The Issue of PCA Now, let’s consider class information ... 55 Linear Discriminant Analysis  The objective of LDA is to perform dimension reduction while preserving as much of the class discriminatory information as possible.  Given a set of d-D vectors x1, x2, ..., xn of which N1 belong to ω1, and N2 to ω2. Of all the possible projection lines y=wTx, find the one that maximizes the separability. 56 Measure of Separability 57 Fisher Criterion J 1   2 2 S12  S 22 58 Some Math … between-class scatter within-class scatter 59 More Math... generalized Rayleigh quotient 60 More Math... SW1SB w  Jw eigenvector problem! S B w  1   2 1   2  w  1   2 R T R  1   2  w T scalar Jw  S w1 S B w  S w1 1   2 R w R 1 S w 1   2  J 61 LDA Example  Dataset  𝐶1 = 1,2 ; 2,3 ; 3,3 ; 4,5 ; (5,5)  𝐶2 = 1,0 ; 2,1 ; 3,1 ; 3,2 ; 5,3 ; (6,5)  Covariance of [c1; c2]  Z= 2.7636 2.2545 2.2545 3.0182  Eigenvectors and Eigenvalues of Z  𝑉= −0.7268 0.6869  𝐷= 0.6328 0 0.6869 0.7268 0 5.1490  The direction of PCA projection: 0.6869, 0.7268 62 𝑇 LDA Example  The mean of each class  𝜇1 = 𝑚𝑒𝑎𝑛 𝑐1 = 3.0, 3.6 𝑇  𝜇2 = 𝑚𝑒𝑎𝑛 𝑐2 = 3.3, 2.0 𝑇  𝑆𝐵 = 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇 = 0.11 −0.53 −0.53 2.56  The scatter of each class  𝑆1 = 4 × 𝑐𝑜𝑣 𝑐1 = 10 8  𝑆2 = 5 × 𝑐𝑜𝑣 𝑐2 = 17.3 16 16 16  𝑆𝑤 = 𝑆1 + 𝑆2 = 27.3 24 8 7.2 24 23.2 63 LDA Example  𝑆𝑤 −1 𝑆𝐵 = 0.26 −0.30 −1.27 1.42  Eigenvectors and Eigenvalues of 𝑆𝑤 −1 𝑆𝐵  𝑉= −0.98 0.67 −0.20 −0.75  𝐷= 0 0 0 1.69  The direction of LDA projection: 0.6656, −0.7463  Alternatively  𝑆𝑤 −1 𝜇1 − 𝜇2 𝑇 = −0.7936, 0.8899 𝑇  After normalization: −0.6656, 0.7463 𝑇 64 𝑇 LDA Example 6 PCA 5 4 3 2 1 0 -1 -2 LDA -3 -3 -2 -1 0 1 2 3 4 65 5 6 C-Class LDA 66 C-Class LDA  For C-class LDA with C=2, SB is defined as: S B  N1 1   1     N 2  2    2    T T N   N   N  N   N   N   N  N      N1  1  1 1 2 2  1  1 1 2 2   N 2   2  1 1 2 2   2  1 1 2 2  N N N N       T  N   N   N   N    N   N   N   N    N1  2 1 2 2  2 1 2 2   N 2  1 2 1 1  1 2 1 1  N N N N       T N1 N 22 N12 N 2 T T  2 1   2 1   2   2  2  1  2  1  N N  N1 N 2 1  2 1  2 T N 67 T T C-Class LDA N1=1000; N2=500; N3=2000; 10 X1=mvnrnd([0,0], [1,0;0,1], N1); 8 X2=mvnrnd([3,6], [1,-0.8;-0.8,1], N2); 6 X3=mvnrnd([10,6], [1,0.8;0.8,1], N3); 4 S1=(N1-1)*cov(X1); S2=(N2-1)*cov(X2); S3=(N3-1)*cov(X3); 2 Sw=S1+S2+S3; 0 M1=mean(X1); M2=mean(X2); M3=mean(X3); Mu=(N1*M1+N2*M2+N3*M3)/(N1+N2+N3); -2 Sb=N1*(M1-Mu)'*(M1-Mu)+N2*(M2-Mu)'*(M2-Mu) +N3*(M3-Mu)'*(M3-Mu); -4 -4 -2 0 2 4 6 8 10 12 14 J=inv(Sw)*Sb; [V,D]=eig(J); 600 600 500 500 400 400 300 300 200 200 100 100 0 -4 -2 0 2 4 6 8 10 12 14 16 0 -4 -2 0 68 2 4 6 8 Limitations of LDA  LDA produces at most C-1 projections  SB is a matrix with rank C-1 or less.  SW may be singular.  LDA does not work well when ... 69 Reading Materials  M. A. Hernandez and S. J. Stolfo, “Real-world Data is Dirty: Data Cleansing and The Merge/Purge Problem,” Data Mining and Knowledge Discovery, vol. 2, pp. 9–37, 1998.  A. Donders, G. van der Heijden, T. Stijnen, and K. Moons, “Review: a gentle introduction to imputation of missing values,” Journal of Clinical Epidemiology, vol. 59, pp. 1087-1091, 2006.  N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, “SMOTE: synthetic minority over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.  N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intelligent Data Analysis vol. 6, pp. 429–449, 2002.  D. Keim, “Information Visualization and Visual Data Mining,” IEEE Transactions on Visualization and Computer Graphics, vo. 7(1), pp. 100-107, 2002.  PCA Tutorials  http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf  http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf  Lagrange Multipliers  http://diglib.stanford.edu:8091/~klein/lagrange-multipliers.pdf 70 Review  How to handle missing values?  How to detect outliers?  What is the major issue with imbalanced datasets?  Why is feature selection important?  How to search for a good subset of features?  What does PCA stand for?  What is the major issue of PCA?  What does LDA stand for?  What is the major issue of LDA? 71 Next Week’s Class Talk  Volunteers are required for next week’s class talk.  Topic: Data Visualization Tools  Length: 20 minutes plus question time  Suggested Points of Interest  Software • CiteSpace • GUESS • Gephi     Functionality Demos Features Case Studies 72

PPT - Bo Yuan

Related documents

Products

Support

PPT - Bo Yuan

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib