Data Preparation - Department of Computer Science and Technology

advertisement
Data Mining:
Concepts and Techniques
— Chapter 3 —
Thanks:
©Jiawei Han and Micheline Kamber
Department of Computer Science
Thanks:
Ian Witten and EibeFrank,
author of Weka book
University of Illinois at Urbana-Champaign
www.cs.sfu.ca, www.cs.uiuc.edu/~hanj
Data Mining: Concepts and Techniques
1
Why Data Preprocessing?

Data in the real world is dirty

incomplete: lacking attribute
values, lacking certain attributes
of interest, or containing only
aggregate data








e.g., Salary=“-10”
inconsistent: containing
discrepancies in codes or names

Incomplete data comes from

e.g., occupation=“”
noisy: containing errors or
outliers


e.g., Age=“42”
Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now
rating “A, B, C”
e.g., discrepancy between
duplicate records

n/a data value when collected
different consideration between the
time when the data was collected and
when it is analyzed.
human/hardware/software problems
Noisy data comes from the process of
data

collection

entry

transmission
Inconsistent data comes from

Different data sources

Functional dependency violation
Data Mining: Concepts and Techniques
2
Major Tasks in Data Preprocessing

Data cleaning


Data integration


Normalization and aggregation
Data reduction


Integration of multiple databases, data cubes, or files
Data transformation


Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Obtains reduced representation in volume but produces the
same or similar analytical results
Data discretization

Part of data reduction but with particular importance, especially
for numerical data
Data Mining: Concepts and Techniques
3
How to Handle Missing Data?
1.
Obtain it: may incur costs
2.
Ignore the tuple: usually done when class label is missing (assuming the
tasks in classification—not effective when the percentage of missing values
per attribute varies considerably.
3.
Fill in the missing value manually: tedious + infeasible?
4.
Fill in it automatically with

a global constant : e.g., “unknown”, a new class?!

the attribute mean

the attribute mean for all samples belonging to the same class:
smarter

the most probable value: inference-based such as Bayesian formula or
decision tree
Data Mining: Concepts and Techniques
4
How to Handle Noisy Data?




Binning method:
 first sort data and partition into (equi-depth) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Clustering
 detect and remove outliers
Combined computer and human inspection
 detect suspicious values and check by human (e.g.,
deal with possible outliers)
Regression
 smooth by fitting the data into regression functions
Data Mining: Concepts and Techniques
5
Example: Weather Data



nominal weather data
Numerical weather data
Numerical class values: CPU data
Data Mining: Concepts and Techniques
6
Java Package for Data Mining
(a)
Data Mining: Concepts and Techniques
(b)
7
Metadata



Information about the data that encodes
background knowledge
Can be used to restrict search space
Examples:
 Dimensional considerations (i.e. expressions
must be dimensionally correct)
 Circular orderings (e.g. degrees in compass)
 Partial orderings (e.g.
generalization/specialization relations)
Data Mining: Concepts and Techniques
8
The ARFF format
%
% ARFF file for weather data with some numeric features
%
@relation weather
@attribute
@attribute
@attribute
@attribute
@attribute
outlook {sunny, overcast, rainy}
temperature numeric
humidity numeric
windy {true, false}
play? {yes, no}
@data
sunny, 85, 85, false, no
sunny, 80, 90, true, no
overcast, 83, 86, false, yes
...
Data Mining: Concepts and Techniques
9
Weather.arff
@relation weather
@attribute
@attribute
@attribute
@attribute
@attribute
outlook {sunny, overcast, rainy}
temperature real
humidity real
windy {TRUE, FALSE}
play {yes, no}
@data
sunny,85,85,FALSE,no
sunny,80,90,TRUE,no
overcast,83,86,FALSE,yes
rainy,70,96,FALSE,yes
rainy,68,80,FALSE,yes
rainy,65,70,TRUE,no
overcast,64,65,TRUE,yes
sunny,72,95,FALSE,no
sunny,69,70,FALSE,yes
rainy,75,80,FALSE,yes
sunny,75,70,TRUE,yes
overcast,72,90,TRUE,yes
overcast,81,75,FALSE,yes
rainy,71,91,TRUE,no
Data Mining: Concepts and Techniques
10
Weather.nominal.arff
@relation weather.symbolic
@attribute
@attribute
@attribute
@attribute
@attribute
outlook {sunny, overcast, rainy}
temperature {hot, mild, cool}
humidity {high, normal}
windy {TRUE, FALSE}
play {yes, no}
@data
sunny,hot,high,FALSE,no
sunny,hot,high,TRUE,no
overcast,hot,high,FALSE,yes
rainy,mild,high,FALSE,yes
rainy,cool,normal,FALSE,yes
rainy,cool,normal,TRUE,no
overcast,cool,normal,TRUE,yes
sunny,mild,high,FALSE,no
sunny,cool,normal,FALSE,yes
rainy,mild,normal,FALSE,yes
sunny,mild,normal,TRUE,yes
overcast,mild,high,TRUE,yes
overcast,hot,normal,FALSE,yes
rainy,mild,high,TRUE,no
Data Mining: Concepts and Techniques
11
Numerical Class values
@relation 'cpu'
@attribute vendor { adviser, amdahl, apollo, basf, bti, burroughs, c.r.d, cdc, cambex, dec, dg, formation, four-phase, gould, hp, harris,
honeywell, ibm, ipl, magnuson, microdata, nas, ncr, nixdorf, perkin-elmer, prime, siemens, sperry, sratus, wang}
@attribute MYCT real
@attribute MMIN real
@attribute MMAX real
@attribute CACH real
@attribute CHMIN real
@attribute CHMAX real
@attribute class real
@data
adviser,125,256,6000,256,16,128,199
amdahl,29,8000,32000,32,8,32,253
amdahl,29,8000,32000,32,8,32,253
amdahl,29,8000,32000,32,8,32,253
amdahl,29,8000,16000,32,8,16,132
amdahl,26,8000,32000,64,8,32,290
amdahl,23,16000,32000,64,16,32,381
amdahl,23,16000,32000,64,16,32,381
amdahl,23,16000,64000,64,16,32,749
amdahl,23,32000,64000,128,32,64,1238
apollo,400,1000,3000,0,1,2,23
apollo,400,512,3500,4,1,6,24
basf,60,2000,8000,65,1,8,70
basf,50,4000,16000,65,1,8,117
Data Mining: Concepts and Techniques
12
Attribute types used in practice




Most schemes accommodate just two levels of
measurement: nominal and ordinal
Nominal attributes are also called
“categorical”, ”enumerated”, or “discrete”
 But: “enumerated” and “discrete” imply order
Special case: dichotomy (“boolean” attribute)
Ordinal attributes are called “numeric”, or
“continuous”
 But: “continuous” implies mathematical
continuity Data Mining: Concepts and Techniques
13
Nominal quantities




Values are distinct symbols
 Values themselves serve only as labels or
names
 Nominal comes from the Latin word for name
Example: attribute “outlook” from weather data
 Values: “sunny”,”overcast”, and “rainy”
No relation is implied among nominal values
(no ordering or distance measure)
Only equality tests can be performed
Data Mining: Concepts and Techniques
14
Ordinal quantities






Impose order on values
But: no distance between values defined
Example: attribute “temperature” in weather
data
 Values: “hot” > “mild” > “cool”
Note: addition and subtraction don’t make
sense
Example rule: temperature < hot c play = yes
Distinction between nominal and ordinal not
always clear (e.g.
attribute
“outlook”)
Data Mining:
Concepts and Techniques
15
1. Data Transformation


NumbericalNominal
Two kinds
 Supervised


Example: weather data (temperature attribute)
and Play attribute (class)
Unsupervised

Example: divide into three ranges (small, mid,
large)
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
Data Mining: Concepts and Techniques
16
Discretization


Three types of attributes:
 Nominal — values from an unordered set
 Ordinal — values from an ordered set
 Continuous — real numbers
Discretization:
 divide the range of a continuous attribute into
intervals
 Some classification algorithms only accept categorical
attributes.
 Reduce data size by discretization
 Prepare for further analysis
Data Mining: Concepts and Techniques
17
Discretization and concept hierarchy
generation for numeric data

Binning (see sections before)

Histogram analysis (see sections before)

Clustering analysis (see sections before)

Entropy-based discretization

Segmentation by natural partitioning
Data Mining: Concepts and Techniques
18
Simple Discretization Methods: Binning

Equal-width (distance) partitioning:
 It divides the range into N intervals of equal size:
uniform grid
 if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.



The most straightforward
But outliers may dominate presentation: Skewed data is not
handled well.
Equal-depth (frequency) partitioning:
 It divides the range into N intervals, each containing
approximately same number of samples
 Good data scaling
 Managing categorical attributes can be tricky.
Data Mining: Concepts and Techniques
19
Binning Methods for Data Smoothing
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into 3 (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin-boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Data Mining: Concepts and Techniques
20
Histograms
30
25
20
15
10
Data Mining: Concepts and Techniques
100000
90000
80000
70000
60000
0
50000
5
40000

35
30000

40
20000

A popular data reduction
technique
Divide data into buckets
and store average (sum)
for each bucket
Can be constructed
optimally in one
dimension using dynamic
programming
Related to quantization
problems.
10000

21
Entropy-Based Discretization

Given a set of samples S, if S is partitioned into two
intervals S1 and S2 using boundary T, the entropy after
partitioning is
|S |
|S |
E (S ,T ) 


1 Ent ( )  2 Ent ( )
S1 | S |
S2
| S|
The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization.
The process is recursively applied to partitions obtained
until some stopping criterion is met, e.g.,
Ent ( S )  E (T , S )  

Experiments show that it may reduce data size and
improve classification accuracy
Data Mining: Concepts and Techniques
22
How to Calculate ent(S)?

Given two classes Yes and No, in a set S,

Let p1 be the proportion of Yes

Let p2 be the proportion of No,

p1 + p2 = 100%
Entropy is:
ent(S) = -p1*log(p1) –p2*log(p2)

When p1=1, p2=0, ent(S)=0,

When p1=50%, p2=50%,
ent(S)=maximum!

See TA’s tutorial notes for an Example.
Data Mining: Concepts and Techniques
23
Transformation: Normalization

min-max normalization
v  minA
v' 
(new _ maxA  new _ minA)  new _ minA
maxA  minA

z-score normalization

normalization by decimal scaling
v  meanA
v' 
stand _ devA
v
v'  j
10
Where j is the smallest integer such that Max(| v ' |)<1
Data Mining: Concepts and Techniques
24
Transforming Ordinal to Boolean


Simple transformation allows to code ordinal attribute
with n values using n-1 boolean attributes
Example: attribute “temperature”
Temperature
Temperature > cold
Temperature > medium
Cold
False
False
Medium
True
False
Hot
True
True
Original data

Transformed data
Why? Not introducing distance concept between
different colors: “Red” vs. “Blue” vs. “Green”.
Data Mining: Concepts and Techniques
25
2. Feature Selection


Feature selection (i.e., attribute subset selection):
 Select a minimum set of features such that the
probability distribution of different classes given the
values for those features is as close as possible to the
original distribution given the values of all features
 reduce # of patterns in the patterns, easier to
understand
Heuristic methods (due to exponential # of choices):
 step-wise forward selection
 step-wise backward elimination
 combining forward selection and backward elimination
 decision-tree induction
Data Mining: Concepts and Techniques
26
Feature Selection (Example)
ID

Which
subset of
features
would you
select?
Outlook
Temperature
Humidity
Windy
Play
1
100
40
90
0
-1
2
100
40
90
1
-1
3
50
40
90
0
1
4
10
30
90
0
1
5
10
15
70
0
1
6
10
15
70
1
-1
7
50
15
70
1
1
8
100
30
90
0
-1
9
100
15
70
0
1
10
10
30
70
0
1
11
100
30
70
1
1
12
50
30
90
1
1
13
50
40
70
0
1
14
10
30
90
1
-1
Data Mining: Concepts and Techniques
27
Some Essential Feature Selection
Techniques




Decision Tree Induction
Chi-squared Tests
Principle Component Analysis
Linear Regression Analysis
Data Mining: Concepts and Techniques
29
Example of Decision Tree Induction
Initial attribute set:
{A1, A2, A3, A4, A5, A6}
A4 ?
A6?
A1?
Class 1
>
Class 2
Class 1
Class 2
Reduced attribute set: {A1, A4, A6}
Data Mining: Concepts and Techniques
30
Chi-Squared Test (http://helios.bto.ed.ac.uk/bto/statistics/tress9.html)



Suppose that the ratio of male to female students in the Science Faculty is
exactly 1:1,
but in the Honours class over the past ten years there have been 80 females and
40 males.
Question: Is this a significant departure from the (1:1) expectation?
Female
Male
Total
Observed numbers
(O)
80
40
120
Expected numbers
(E)
60
60
120
O-E
20
-20
0
(O-E)2
400
400
(O-E)2 / E
6.67
6.67 13.34 = X2
Data Mining: Concepts and Techniques
31
Principal Component Analysis
X2
Y1
Y2
X1
Data Mining: Concepts and Techniques
32
Regression
Height
Y1
Y1’
y=x+1
X1
Data Mining: Concepts and Techniques
Age
33
5. Data Reduction with Sampling




Allow a mining algorithm to run in complexity that is
potentially sub-linear to the size of the data
Choose a representative subset of the data
 Simple random sampling may have very poor
performance in the presence of skew
Develop adaptive sampling methods
 Stratified sampling:
 Approximate the percentage of each class (or
subpopulation of interest) in the overall database
 Used in conjunction with skewed data
Sampling may not reduce database I/Os (page at a time).
Data Mining: Concepts and Techniques
34
Sampling
Raw Data
Data Mining: Concepts and Techniques
35
Sampling
Raw Data
Cluster/Stratified Sample
Data Mining: Concepts and Techniques
36
Summary

Data preparation is a big issue for both warehousing
and mining


Data preparation includes

Data cleaning and data integration

Data reduction and feature selection

Discretization
A lot a methods have been developed but still an active
area of research
Data Mining: Concepts and Techniques
37
References

Applied Artificial Intelligence Journal, Special Issue on Data Cleaning and Preparation for
Data Mining (2003) Qiang Yang et al. Editors
Cost-sensitive Decision Trees. (Draft) Charles Ling, Qiang Yang, et al. 2004

E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of

the Technical Committee on Data Engineering. Vol.23, No.4








D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments.
Communications of ACM, 42:73-78, 1999.
H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), December 1997.
A. Maydanchik, Challenges of Efficient Data Cleansing (DM Review - Data Quality resource
portal)
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning
and Transformation, VLDB’2001.
T. Redman. Data Quality: Management and Technology. Bantam Books, New York, 1992.
Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations.
Communications of ACM, 39:86-95, 1996.
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE
Trans. Knowledge and Data Engineering, 7:623-640, 1995.
Data Mining: Concepts and Techniques
38
Download