Attribute Selection

advertisement
Exploratory Data Mining and
Data Preparation
Fall 2003
Data Mining
1
The Data Mining Process
Data
evaluation
Business
understanding
Data
preparation
Deployment
Data
Modeling
Evaluation
Fall 2003
Data Mining
2
Exploratory Data Mining
Preliminary process
Data summaries
Attribute means
 Attribute variation
 Attribute relationships

Visualization
Fall 2003
Data Mining
3
Summary Statistics
Possible Problems:
• Many missing values (16%)
• No examples of one value
Select an attribute
Appears to be a
good predictor of
the class
Visualization
Fall 2003
Data Mining
4
Fall 2003
Data Mining
5
Exploratory DM Process
For each attribute:

Look at data summaries


Identify potential problems and decide if an
action needs to be taken (may require
collecting more data)
Visualize the distribution
Identify potential problems (e.g., one dominant
attribute value, even distribution, etc.)
 Evaluate usefulness of attributes

Fall 2003
Data Mining
6
Weka Filters
Weka has many filters that are helpful in
preprocessing the data

Attribute filters


Add, remove, or transform attributes
Instance filters

Add, remove, or transform instances
Process



Fall 2003
Choose for drop-down menu
Edit parameters (if any)
Apply
Data Mining
7
Data Preprocessing
Data cleaning

Missing values, noisy or inconsistent data
Data integration/transformation
Data reduction

Dimensionality reduction, data
compression, numerosity reduction
Discretization
Fall 2003
Data Mining
8
Data Cleaning
Missing values


Weka reports % of missing values
Can use filter called ReplaceMissingValues
Noisy data



Due to uncertainty or errors
Weka reports unique values
Useful filters include


Fall 2003
RemoveMisclassified
MergeTwoValues
Data Mining
9
Data Transformation
Why transform data?



Fall 2003
Combine attributes. For example, the ratio of two
attributes might be more useful than keeping them
separate
Normalizing data. Having attributes on the same
approximate scale helps many data mining
algorithms(hence better models)
Simplifying data. For example, working with
discrete data is often more intuitive and helps the
algorithms(hence better models)
Data Mining
10
Weka Filters
The data transformation filters in Weka
include:
Add
 AddExpression
 MakeIndicator
 NumericTransform
 Normalize
 Standardize

Fall 2003
Data Mining
11
Discretization
Discretization reduces the number of
values for a continuous attribute
Why?

Some methods can only use nominal data


Fall 2003
E.g., in Weka ID3 and Apriori algorithms
Helpful if data needs to be sorted
frequently (e.g., when constructing a
decision tree)
Data Mining
12
Unsupervised Discretization
Unsupervised - does not account for classes
Equal-interval binning
64
Yes
65
No
68
69
70
Yes Yes Yes
71
No
72
75
No Yes
Yes Yes
80
No
81
83
Yes Yes
85
No
Equal-frequency binning
64
Yes
Fall 2003
65
No
68
69
70
Yes Yes Yes
71
No
Data Mining
72
75
No Yes
Yes Yes
80
No
81
83
Yes Yes
85
No
13
Supervised Discretization
Take classification into account
Use “entropy” to measure information gain
Goal: Discretizise into 'pure' intervals
Usually no way to get completely pure intervals:
1 yes 8 yes & 5 no
64
Yes
65
No
F
Fall 2003
9 yes & 4 no
68
69
70
71
Yes
Yes
Yes
No
72
75
No
Yes
Yes
Yes
E
Data Mining
D
80
No
C
B
1 no
81
83
85
Yes
Yes
No
A
14
Error-Based Discretization
Count number of misclassifications

Majority class determines prediction

Count instances that are different
Must restrict number of classes.
Complexity

Brute-force: exponential time

Dynamic programming: linear time
Downside: cannot generate adjacent intervals
with same label
Fall 2003
Data Mining
15
Weka Filter
Fall 2003
Data Mining
16
Attribute Selection
Before inducing a model we almost
always do input engineering
The most useful part of this is attribute
selection (also called feature selection)
Select relevant attributes
 Remove redundant and/or irrelevant
attributes

Why?
Fall 2003
Data Mining
17
Reasons for Attribute
Selection
Simpler model


More transparent
Easier to interpret
Faster model induction

What about overall time?
Structural knowledge

Knowing which attributes are important may be
inherently important to the application
What about the accuracy?
Fall 2003
Data Mining
18
Attribute Selection Methods
What is evaluated?
Attributes
Subsets of
attributes
Filters
Filters
Independent
Evaluation
Method
Fall 2003
Learning
algorithm
Wrappers
Data Mining
19
Filters
Results in either

Ranked list of attributes
Typical when each attribute is evaluated
individually
 Must select how many to keep


A selected subset of attributes
Forward selection
 Best first
 Random search such as genetic algorithm

Fall 2003
Data Mining
20
Filter Evaluation Examples
Information Gain
Gain ration
Relief
Correlation
High correlation with class attribute
 Low correlation with other attributes

Fall 2003
Data Mining
21
Wrappers
“Wrap around” the
learning algorithm
Must therefore always
evaluate subsets
Return the best subset
of attributes
Apply for each learning
algorithm
Use same search
methods as before
Fall 2003
Data Mining
Select a subset of
attributes
Induce learning
algorithm on this subset
Evaluate the resulting
model (e.g., accuracy)
No
Stop?
Yes
22
How does it help?
Naïve Bayes
Instance-based learning
Decision tree induction
Fall 2003
Data Mining
23
Fall 2003
Data Mining
24
Scalability
Data mining uses mostly well developed
techniques (AI, statistics, optimization)
Key difference: very large databases
How to deal with scalability problems?
Scalability: the capability of handling
increased load in a way that does not
effect the performance adversely
Fall 2003
Data Mining
25
Massive Datasets
Very large data sets (millions+ of
instances, hundreds+ of attributes)
Scalability in space and time

Data set cannot be kept in memory


E.g., processing one instance at a time
Learning time very long
How does the time depend on the input?
 Number of attributes, number of instances

Fall 2003
Data Mining
26
Two Approaches
Increased computational power
Only works if algorithms can be sped up
 Must have the computing availability

Adapt algorithms

Fall 2003
Automatically scale-down the problem so
that it is always approximately the same
difficulty
Data Mining
27
Computational Complexity
We want to design algorithms with good
computational complexity
Time
exponential
polynomial
linear
logarithm
Number of instances
(Number of attributes)
Fall 2003
Data Mining
28
Example: Big-Oh Notation
Define


n =number of instances
m =number of attributes
Going once through all the instances has
complexity O(n)
Examples



Fall 2003
Polynomial complexity: O(mn2)
Linear complexity: O(m+n)
Exponential complexity: O(2n)
Data Mining
29
Classification
If no polynomial time algorithm exists to solve
a problem it is called NP-complete
Finding the optimal decision tree is an
example of a NP-complete problem
However, ID3 and C4.5 are polynomial time
algorithms


Fall 2003
Heuristic algorithms to construct solutions to a
difficult problem
“Efficient” from a computational complexity
standpoint but still have a scalability problem
Data Mining
30
Decision Tree Algorithms
Traditional decision tree algorithms assume
training set kept in memory
Swapping in and out of main and cache
memory expensive
Solution:




Fall 2003
Partition data into subsets
Build a classifier on each subset
Combine classifiers
Not as accurate as a single classifier
Data Mining
31
Other Classification Examples
Instance-Based Learning
Goes through instances one at a time
 Compares with new instance
 Polynomial complexity O(mn)
 Response time may be slow, however

Naïve Bayes
Polynomial complexity
 Stores a very large model

Fall 2003
Data Mining
32
Data Reduction
Another way is to reduce the size of the
data before applying a learning
algorithm (preprocessing)
Some strategies
Dimensionality reduction
 Data compression
 Numerosity reduction

Fall 2003
Data Mining
33
Dimensionality Reduction
Remove irrelevant, weakly relevant, and
redundant attributes
Attribute selection


Many methods available
E.g., forward selection, backwards elimination,
genetic algorithm search
Often much smaller problem
Often little degeneration in predictive
performance or even better performance
Fall 2003
Data Mining
34
Data Compression
Also aim for dimensionality reduction
Transform the data into a smaller space
Principle Component Analysis




Fall 2003
Normalize data
Compute c orthonormal vectors, or principle
components, that provide a basis for normalized
data
Sort according to decreasing significance
Eliminate the weaker components
Data Mining
35
PCA: Example
Fall 2003
Data Mining
36
Numerosity Reduction
Replace data with an alternative,
smaller data representation
 Histogram
1,1,5,5,5,5,5,8,8,10,10,10,10,12,14,14,14,15,15,15,
15,15,15,18,18,18,18,18,18,18,18,20,20,20,20,20,
20,20,21,21,21,21,25,25,25,25,25,28,28,30,30,30
1-10 11-20 21-30
Fall 2003
Data Mining
37
Other Numerosity Reduction
Clustering
Data objects (instance) that are in the
same cluster can be treated as the same
instance
 Must use a scalable clustering algorithm

Sampling

Fall 2003
Randomly select a subset of the instances
to be used
Data Mining
38
Sampling Techniques
Different samples




Sample without replacement
Sample with replacement
Cluster sample
Stratified sample
Complexity of sampling actually sublinear,
that is, the complexity is O(s) where s is the
number of samples and s<<n
Fall 2003
Data Mining
39
Weka Filters
PrincipalComponents is under the
Attribute Selection tab
Already talked about filters to discretize
the data
The Resample filter randomly samples
a given percentage of the data

Fall 2003
If you specify the same seed, you’ll get the
same sample again
Data Mining
40
Download