PPT - Bo Yuan

advertisement
Data Preprocessing
Lecturer: Dr. Bo Yuan
LOGO
E-mail: yuanb@sz.tsinghua.edu.cn
Outline
 Data Cleansing
 Data Transformation
 Data Description
 Feature Selection
 Feature Extraction
2
Where are data from?
3
Why Data Preprocessing?
 Real data are notoriously dirty!
 The biggest challenge in many data mining projects
 Incomplete
 Occupancy = “ ”
 Noisy
 Salary = “-100”
 Inconsistent
 Age = “42” vs. Birthday = “01/09/1985”
 Redundant
 Too much data or too many features for analytical analysis
 Others
 Data types
 Imbalanced datasets
4
Missing Data
 Data are not always available.
 One or more attributes of a sample may have empty values.
 Many data mining algorithms cannot handle missing values directly.
 May cause significant troubles.
 Possible Reasons
 Equipment malfunction
 Data not provided
 Not Applicable (N/A)
 Different Types
 Missing completely at random
 Missing at random
 Not missing at random
5
How to handle missing data?
 Ignore
 Remove samples/attributes with missing values
 The easiest and most straightforward way
 Work well with low missing rates
 Fill in the missing values manually
 Recollect the data
 Domain knowledge
 Tedious/Infeasible
 Fill in the missing values automatically
 A global constant
 The mean or median
 Most probable values
 More art than science
6
Missing Data: An Example
P(Y|X)
M_Y
Y
M_Y
M_Y
X
7
Noisy Data
1.5
1
0.5
0
1.5
-0.5
1
-1
0.5
-1.5
0
1
2
3
4
5
6
7
5
6
7
0
-0.5
-1
1.5
-1.5
0
1
2
3
4
5
6
7
1
0.5
0
-0.5
Random Noise
-1
8
-1.5
0
1
2
3
4
Outliers
y
Outliers
y=x+1
x
9
Outliers
Outliers
10
Anomaly Detection
11
Local Outlier Factor
k=3
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝑂
p3
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝐴, 𝐵 = 𝑚𝑎𝑥 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝐵 , 𝑑 𝐴, 𝐵
𝑙𝑟𝑑 𝐴 = 1
𝐿𝑂𝐹𝑘 𝐴 =
𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒𝑘 𝐴, 𝐵
𝑁𝑘 𝐴
𝑙𝑟𝑑 𝐵
𝑙𝑟𝑑 𝐴
=
𝑁𝑘 𝐴
𝐵∈𝑁𝑘 𝐴
𝐵∈𝑁𝑘 𝐴
𝑙𝑟𝑑 𝐵
𝑁𝑘 𝐴
12
p1
p2
p4
𝐵∈𝑁𝑘 𝐴
o
𝑙𝑟𝑑 𝐴
Local Outlier Factor
13
Duplicate Data
14
Duplicate Data
15
Duplicate Data
Create
Keys
Sort
Merge
16
Data Transformation
 Now we have an error free dataset.
 Still needs to be standardized.
 Type Conversion
 Normalization
 Sampling
17
Attribute Types
 Continuous
 Real values: Temperature, Height, Weight …
 Discrete
 Integer values: Number of people …
 Ordinal
 Rankings: {Average, Good, Best}, {Low, Medium, High} …
 Nominal
 Symbols: {Teacher, Worker, Salesman}, {Red, Green, Blue} …
 String
 Text: “Tsinghua University”, “No. 123, Pingan Avenue” …
18
Size
Type Conversion
×
○
×
×
○
×
○
×
○
○
×
○
Color
19
Size
Type Conversion
×
×
○
×
×
○
○
○
×
○
○
×
Color
20
Type Conversion
0
0
0
1
0
0
1
0
0
1
0
0
1
0
0
0
21
Normalization
 The height of someone can be 1.7 or 170 or 1700 …
 Min-max normalization:
v 
v  min
(new _ max  new _ min)  new _ min
max  min
 Let income range $12,000 to $98,000 be normalized to [0.0, 1.0].
Then $73,600 is mapped to
73,600 12,000
(1.0  0)  0  0.716
98,000 12,000
 Z-score normalization (μ: mean, σ: standard deviation):
v 
v  

 Let μ = 54,000, σ = 16,000. Then
73,600 54,000
 1.225
16,000
22
Sampling
 A database/data warehouse may store terabytes of data.
 Processing limits: CPU, Memory, I/O …
 Sampling is applied to reduce the time complexity.
 In statistics, sampling is applied often because obtaining the
entire set of data is expensive.
 Aggregation
 Change of scale:
• Cities  States;
Days  Months
 More stable and less variability
 Sampling can be also used to adjust the class distributions.
 Imbalanced dataset
23
Imbalanced Datasets
LHS
10%
Classifier A
5%
85%
Classifier B
24
RHS
Imbalanced Datasets
25
Over-Sampling
SMOTE
26
Data Description
 Mean
 Arithmetic mean
 Median
 Mode
 The most frequently occurring value in a list
 Can be used with non-numerical data.
 Variance
 Degree of diversity
27
Data Description
 Pearson’s product moment correlation coefficient
rA, B
( A  A)(B  B)  ( AB)  n AB



(n  1)AB
(n  1)AB
 If rA, B > 0, A and B are positively correlated.
 If rA, B = 0, no linear correlation between A and B.
 If rA, B < 0, A and B are negatively correlated.
28
Data Description
 Pearson's chi-square (χ2) test
2
(
Observed

Expected
)
2  
Expected
Play chess
Not play chess
Sum (row)
Like science fiction
250 (90)
200 (360)
450
Not like science fiction
50 (210)
1000 (840)
1050
Sum (col.)
300
1200
1500
(250 90) 2 (50  210) 2 (200 360) 2 (1000 840) 2
 



 507.93
90
210
360
840
2
29
Data Visualization (1D)
Group
10
22%
9
8
11%
7
6
28%
5
4
3
2
33%
1
6%
0
30
1
2
3
4
5
Data Visualization (2D)
26
24
22
Acceleration
20
18
16
14
12
10
8
0
50
100
150
200
250
300
Displacement
31
350
400
450
500
Surf Plots (3D)
5
0
-5
-10
2
0
-2
-3
-2
-1
0
1
2
3
32
Box Plots (High Dimensional)
6
Values
4
2
0
-2
-4
1
2
3
Dimension
33
4
Parallel Coordinates (High Dimensional)
34
Guess Visualization
35
CiteSpace
36
Gephi
37
Feature Selection
ID
Weight
Age
Noisy?
Gender
Height
Irrelevant?
Education
Duplicate?
Income
Complexity?
Address
Occupation
Address
38
Class Distributions
39
Class Distributions
1.2
Female
1
Probability
0.8
0.6
0.4
0.2
0
Total
Non-Smoker (60%)
Smoker (40%)
40
Male
Entropy
𝑛
𝐻 𝑋 =−
𝑝 𝑥𝑖 log 𝑏 𝑝 𝑥𝑖
𝑖=1
H (S )  0.5  log2 0.5  0.5  log2 0.5  1.0
H (S | X  a )  0.8  log2 0.8  0.2  log2 0.2  0.7219
H (S | X  b )  0.05  log2 0.05  0.95  log2 0.95  0.2864
H (S | X )  0.6  H (S | X  a )  0.4  H (S | X  b )  0.5477
Gain (S ,X )  H (S )  H (S | X )  0.4523
41
Feature Subset Search
 Exhaustive
 All possible combinations
C103 
10!
 120
(10  3)!3!
5
C20

20!
 15504
(20  5)!5!
 Branch and Bound
S1 S 2 S3

J (S1 )  J (S2 )  J (S3 )
42
Feature Subset Search
 Top K Individual Features
J (X k )  {J (x 1 ), ,J (x k )}, J (x 1 )  J (x 2 )  ...  J (x k )
 Sequential Forward Selection
J (X k  x 1 )  J (X k  x 2 )    J (X k  x D  k ), x i  X k
 Sequential Backward Selection
J (X k  x 1 )  J (X k  x 2 )    J (X k  x k ), x i  X k
 Optimization Algorithms
 Simulated Annealing
 Tabu Search
 Genetic Algorithms
43
Feature Extraction
44
Principal Component Analysis
45
2D Example
46
2D Example
47
Some Math
Goal: S(Y) has nonzero diagonal entries and all off-diagonal elements are zero.
48
A Different View
𝑥𝑘
𝑛
2
′
𝐽 𝑒 =
𝛼𝑘 = 𝑒 𝑡 𝑥𝑘
𝑥𝑘 − 𝑥𝑘
𝑥𝑘
𝑒 =1
𝑖=1
′
𝛼𝑘
𝑛
=
2
𝛼𝑘 𝑒 − 𝑥𝑘
𝑖=1
𝑛
𝑛
𝛼𝑘 2 𝑒
=
2
𝑖=1
𝑛
𝑛
𝑖=1
2
𝑖=1
𝑥𝑘
max 𝑒 𝑡 𝑆𝑒
2
𝑒
𝑠.𝑡. 𝑒 =1
𝑖=1
𝑡
𝑛
𝑒 𝑡 𝑥𝑘 𝑥𝑘 𝑒 +
=−
𝑥𝑘
𝑛
𝛼𝑘 2 +
𝑖=1
𝛼𝑘 𝑒 𝑡 𝑥𝑘 +
−2
𝑖=1
=−
𝑛
𝑛
𝑥𝑘
2
𝑆=
𝑖=1
𝑥𝑘 𝑥𝑘
𝑖=1
49
𝑡
Lagrange Multipliers
max f ( x, y )  3xy
x, y
s.t. 2 x  y  8
F ( x, y,  )  3xy   (2 x  y  8)
Fx  3 y  2
 6

x
2
3
2
y 4
3
f ( 2,4)  3  2  4  24
Fy  3 x  
F  2 x  y  8
50
More Math ...


u  et Se   et e 1
u
 2Se  2e
e
Se  e
eigenvector
eigenvalue
et Se  et e  
PCA
To project the original data to the eigenvectors
of S with the largest eigenvalues.
3
2 1 1
1
=
=3
3
1 2 1
1
2 1 1
1
=
1 2 −1
−1
51
PCA Examples
 1
0.8
;
1 
0.8
x  mvnrnd(zeros(10000,2),s );
s  
plot(x(:,1),x(:,2),'.');
axis equal;
grid on;
[V ,D ]  eig(s );
 0.7071 0.7071
;
 0.7071 0.7071
V  
0.2
0 
;
1.8
 0
newx  x * V (:,2);
D  
hist(newx ,50);
52
Fish
53
Fish
54
The Issue of PCA
Now, let’s consider class information ...
55
Linear Discriminant Analysis
 The objective of LDA is to perform dimension reduction while preserving
as much of the class discriminatory information as possible.
 Given a set of d-D vectors x1, x2, ..., xn of which N1 belong to ω1, and N2
to ω2. Of all the possible projection lines y=wTx, find the one that
maximizes the separability.
56
Measure of Separability
57
Fisher Criterion
J
1   2
2
S12  S 22
58
Some Math …
between-class scatter
within-class scatter
59
More Math...
generalized Rayleigh quotient
60
More Math...
SW1SB w  Jw
eigenvector problem!
S B w  1   2 1   2  w  1   2 R
T
R  1   2  w
T
scalar
Jw  S w1 S B w  S w1 1   2 R
w
R 1
S w 1   2 
J
61
LDA Example
 Dataset
 𝐶1 =
1,2 ; 2,3 ; 3,3 ; 4,5 ; (5,5)
 𝐶2 =
1,0 ; 2,1 ; 3,1 ; 3,2 ; 5,3 ; (6,5)
 Covariance of [c1; c2]
 Z=
2.7636
2.2545
2.2545
3.0182
 Eigenvectors and Eigenvalues of Z
 𝑉=
−0.7268
0.6869
 𝐷=
0.6328
0
0.6869
0.7268
0
5.1490
 The direction of PCA projection: 0.6869, 0.7268
62
𝑇
LDA Example
 The mean of each class
 𝜇1 = 𝑚𝑒𝑎𝑛 𝑐1 = 3.0, 3.6
𝑇
 𝜇2 = 𝑚𝑒𝑎𝑛 𝑐2 = 3.3, 2.0
𝑇
 𝑆𝐵 = 𝜇1 − 𝜇2 𝜇1 − 𝜇2 𝑇 =
0.11
−0.53
−0.53
2.56
 The scatter of each class
 𝑆1 = 4 × 𝑐𝑜𝑣 𝑐1 =
10
8
 𝑆2 = 5 × 𝑐𝑜𝑣 𝑐2 =
17.3 16
16 16
 𝑆𝑤 = 𝑆1 + 𝑆2 =
27.3
24
8
7.2
24
23.2
63
LDA Example
 𝑆𝑤 −1 𝑆𝐵 =
0.26
−0.30
−1.27
1.42
 Eigenvectors and Eigenvalues of 𝑆𝑤 −1 𝑆𝐵
 𝑉=
−0.98 0.67
−0.20 −0.75
 𝐷=
0
0
0 1.69
 The direction of LDA projection: 0.6656, −0.7463
 Alternatively
 𝑆𝑤 −1 𝜇1 − 𝜇2
𝑇
= −0.7936, 0.8899
𝑇
 After normalization: −0.6656, 0.7463
𝑇
64
𝑇
LDA Example
6
PCA
5
4
3
2
1
0
-1
-2
LDA
-3
-3
-2
-1
0
1
2
3
4
65
5
6
C-Class LDA
66
C-Class LDA
 For C-class LDA with C=2, SB is defined as:
S B  N1 1   1     N 2  2    2   
T
T
N   N  
N  N  
N   N  
N  N  


 N1  1  1 1 2 2  1  1 1 2 2   N 2   2  1 1 2 2   2  1 1 2 2 
N
N
N
N






T
 N   N   N   N  
 N   N   N   N  
 N1  2 1 2 2  2 1 2 2   N 2  1 2 1 1  1 2 1 1 
N
N
N
N






T
N1 N 22
N12 N 2
T
T
 2 1   2 1   2   2  2  1  2  1 
N
N

N1 N 2
1  2 1  2 T
N
67
T
T
C-Class LDA
N1=1000; N2=500; N3=2000;
10
X1=mvnrnd([0,0], [1,0;0,1], N1);
8
X2=mvnrnd([3,6], [1,-0.8;-0.8,1], N2);
6
X3=mvnrnd([10,6], [1,0.8;0.8,1], N3);
4
S1=(N1-1)*cov(X1); S2=(N2-1)*cov(X2); S3=(N3-1)*cov(X3);
2
Sw=S1+S2+S3;
0
M1=mean(X1); M2=mean(X2); M3=mean(X3);
Mu=(N1*M1+N2*M2+N3*M3)/(N1+N2+N3);
-2
Sb=N1*(M1-Mu)'*(M1-Mu)+N2*(M2-Mu)'*(M2-Mu) +N3*(M3-Mu)'*(M3-Mu);
-4
-4
-2
0
2
4
6
8
10
12
14
J=inv(Sw)*Sb;
[V,D]=eig(J);
600
600
500
500
400
400
300
300
200
200
100
100
0
-4
-2
0
2
4
6
8
10
12
14
16
0
-4
-2
0
68
2
4
6
8
Limitations of LDA
 LDA produces at most C-1 projections
 SB is a matrix with rank C-1 or less.
 SW may be singular.
 LDA does not work well when ...
69
Reading Materials
 M. A. Hernandez and S. J. Stolfo, “Real-world Data is Dirty: Data Cleansing and The
Merge/Purge Problem,” Data Mining and Knowledge Discovery, vol. 2, pp. 9–37, 1998.
 A. Donders, G. van der Heijden, T. Stijnen, and K. Moons, “Review: a gentle introduction to
imputation of missing values,” Journal of Clinical Epidemiology, vol. 59, pp. 1087-1091, 2006.
 N. V. Chawla, K. W. Bowyer, L. O. Hall and W. P. Kegelmeyer, “SMOTE: synthetic minority
over-sampling technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
 N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intelligent
Data Analysis vol. 6, pp. 429–449, 2002.
 D. Keim, “Information Visualization and Visual Data Mining,” IEEE Transactions on
Visualization and Computer Graphics, vo. 7(1), pp. 100-107, 2002.
 PCA Tutorials
 http://www.cs.princeton.edu/picasso/mats/PCA-Tutorial-Intuition_jp.pdf
 http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf
 Lagrange Multipliers
 http://diglib.stanford.edu:8091/~klein/lagrange-multipliers.pdf
70
Review
 How to handle missing values?
 How to detect outliers?
 What is the major issue with imbalanced datasets?
 Why is feature selection important?
 How to search for a good subset of features?
 What does PCA stand for?
 What is the major issue of PCA?
 What does LDA stand for?
 What is the major issue of LDA?
71
Next Week’s Class Talk
 Volunteers are required for next week’s class talk.
 Topic: Data Visualization Tools
 Length: 20 minutes plus question time
 Suggested Points of Interest
 Software
• CiteSpace
• GUESS
• Gephi




Functionality
Demos
Features
Case Studies
72
Download