CS340 Data Mining: feature selection-

advertisement
AMCS/CS 340 : Data Mining
Feature Selection
Xiangliang Zhang
King Abdullah University of Science and Technology
Outline
• Introduction
• Unsupervised Feature Selection
 Clustering
 Matrix Factorization
• Supervised Feature Selection
 Individual Feature Ranking (Single Variable Classifier)
 Feature subset selection
o Filters
o Wrappers
• Summary
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
2
Problems due to poor variable selection
• Input dimension is too large; the curse of
dimensionality problem may happen;
• Poor model may be built with additional unrelated
inputs or not enough relevant inputs;
• Complex models which contain too many inputs
are more difficult to understand
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
3
Applications
OCR (optical character recognition)
HWR (handwriting recognition)
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
4
Benefits of feature selection
• Facilitating data visualization
• Data understanding
• Reducing the measurement and storage
requirements
• Reducing training and utilization times
• Defying the curse of dimensionality to improve
prediction performance
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
5
Feature Selection/Extraction
Thousands to millions of low level features:
select/extract the most relevant one to build better,
faster, and easier to understand learning machines.
m
d<<m
d
X
N
Y
• Using label Y 
supervised
• Without label Y 
unsupervised
{fi}
{Fj}
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
6
Feature Selection vs Extraction
Selection:
• choose a best subset of size d from the m features
{fi} can be a subset of {Fj}, i=1,…,d, and j=1,…,m
Extraction:
• extract d new features by linear or non-linear combination of all
the m features
- Linear/Non-linear feature extraction: {fi} = f({Fj})
m
• New features may not have
physical interpretation/meaning
d
X
N
{fi}
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Y
{Fj}
7
Outline
• Introduction
• Unsupervised Feature Selection
 Clustering
 Matrix Factorization
• Supervised Feature Selection
 Individual Feature Ranking (Single Variable Classifier)
 Feature subset selection
o Filters
o Wrappers
• Summary
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
8
Feature Selection by Clustering
• Group features into clusters
• Replace (many) similar variables in one cluster by
a (single) cluster centroid
• E.g., K-means, Hierarchical clustering
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
9
Example of student project
Abdullah Khamis,
AMCS/CS340 2010 Fall,
“Statistical Learning Based
System for Text Classification”
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
10
Other unsupervised FS methods
• Matrix Factorization
o PCA (Principal Component Analysis)
use PCs with largest eigenvalues as “features”
o SVD (Singular Value Decomposition)
use singular vectors with largest singular values as “features”
o NMF (Non-negative Matrix Factorization)
• Nonlinear Dimensionality Reduction
o Isomap
o LLE (Locally Linear Embedding)
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
11
Outline
• Introduction
• Unsupervised Feature Selection
 Clustering
 Matrix Factorization
• Supervised Feature Selection
 Individual Feature Ranking (Single Variable Classifier)
 Feature subset selection
o Filters
o Wrappers
• Summary
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
12
Feature Ranking
• Build better, faster, and easier to understand
learning machines
• Discover the most relevant features w.r.t. target
label, e.g., find genes that discriminate between healthy and disease
patients
m
d
N
Rank of
useful
features.
X
- Eliminate useless features (distracters).
- Rank useful features.
- Eliminate redundant features.
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
13
Example of detecting attacks in real HTTP logs
A common request.
A JS XSS attack.
Remote file inclusion attack
DoS attack.
Represent each HTTP request by a vector
• in 95 dimensions, corresponding to the 95 types of ASCII code (between 33 and 127)
• of character distribution computed as the frequency of each ASCII code in the path
source of a HTTP request. For example,
Classification of HTTP vectors in 95-dim v.s. in reduced dimension space?
Which dim to choose?
Which one is better?
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
14
Individual Feature Ranking (1)
by AUC
1. Rank the features by AUC
 1, most related
 0.5, most unrelated
1
True Positive Rate
ROC
curve
-1
AUC
0
False Positive Rate
1
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
xi
15
Individual Feature Ranking (2)
by Mutual Information
2. Rank the features by Mutual information I(i)
 The higher I(i), the attribute xi is more related to class y
Mutual information between each variable and the target:
• P(Y = y): frequency count of class y
• P(X = xi): frequency count of attribute value xi
• P(X = xi,Y = y): frequency count of attribute value xi given class y
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
16
Individual Feature Ranking (3)
with continuous target
3. Rank features by Pearson correlation coefficient
• detect linear dependencies between variable and target
• rank features by R(i) or R2(i) (linear regression)
 1 related;
 0 unrelated
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
17
Individual Feature Ranking (4)
by T-test
• Null hypothesis H0: m+ = m- (xi and Y are independent)
• Relevance index  test statistic
• T statistic: If H0 is true,
t
μ  μ
σ within 
~ student(n  n   2 d.f.),
1  1
n n
m-
m+
where n  and n  are the numbers of samples with label  and σ within 

2
( n  1)σ2
 ( n 1)σ
n  n  2
4. Rank by Pvalue  false positive rate
-1
The lower Pvalue, xi is more related to class y
sXiangliang Zhang, KAUST AMCS/CS 340: Data Mining
xi
s+
18
Individual Feature Ranking (5)
by Fisher Score
• Fisher discrimination
• Two-class case:
F = between class variance / pooled within class variance
(μ   μ  ) 2

n σ2  n σ2




n
n
m-
m+
5. Rank by F value
The higher F, xi is more related to class y
-1
sXiangliang Zhang, KAUST AMCS/CS 340: Data Mining
xi
s+
19
Rank features in HTTP logs
FS
SVM results
Features
AUC
Accuracy
0.8797
97.92%
1 AUC-ranking (D=30)
0.9212
97.60%
#.128:;?BLOQ[\]_aefhiklmoptuw|
2 MI-ranking (D=30)
0.8849
97.96%
"#./1268;ALPQRS[\]_`aehkltwyz|
3 R-ranking (D=30)
0.9208
97.67%
"#,.2:;?LQS[\]_`aehiklmoptuwz|
4 T-test ranking (D=30)
0.9208
97.67%
"#,.2:;?LQS[\]_`aehiklmoptuwz|
5 Fisher score ranking (D=30)
0.9208
97.67%
"#,.2:;?LQS[\]_`aehiklmoptuwz|
6 PCA (D=30, unsupervised)
0.8623
97.74%
Constructed features
All 95 features
1
http://www.lri.fr/~xlzhang/KAUST/CS340_slides/FS_rank_demo.zip
True positive rate
0.8
0.6
all features
rank by AUC
rank by MI
rank by correlation coefficient
rank by t-test
rank by Fisher Score
0.4
0.2
0
20
0
0.2
0.4
0.6
False positive rate
0.8
1
Issues of individual features ranking
• Relevance vs usefulness:
 Relevance does not imply usefulness.
 Usefulness does not imply relevance
• Leads to the selection of a redundant subset
k best features != best k features
• A variable that is useless by itself can be useful
with others
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
21
Useless features become useful
• Separation is gained by using two
variables instead of one or by
adding variables
• Ranking variables individually and
independently of each other is at
loss to determine which
combination of variables would give
best performance.
22
Outline
• Introduction
• Unsupervised Feature Selection
 Clustering
 Matrix Factorization
• Supervised Feature Selection
 Individual Feature Ranking (Single Variable Classifier)
 Feature subset selection
o Filters
o Wrappers
• Summary
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
23
Multivariate Feature Selection is complex
Kohavi-John, 1997
M features, 2M possible feature subsets!
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
24
Objectives of feature selection
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
25
Questions before subset feature selection
1. How to search the space of all possible variable
subsets?
2. Do we use the prediction performance to guide the
search?
• NO  Filter
• Yes  Wrapper
1)
how to assess the prediction performance of a learning
machine to guide the search and halt it
2)
which predictor to use
popular predictors include decision trees, Naive Bayes,
Least-square linear predictors, and SVM
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
26
Filter: Feature subset selection
All features
Filter
Feature
subset
Predictor
The feature subset is chosen by an evaluation criterion, which
measures the relation of each subset of input variables, e.g.,
 correlation based feature selector (CFS)
subsets that contain features that are highly correlated with the
class and uncorrelated with each other
mean feature-class
correlation
how predictive of the class a set of features are
M( subset
{ f i , i  1...k} ) 
k Rcf i
k  k (k  1) R fi f j
how much redundancy there is among the feature subset
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
average feature-feature
intercorrelation
27
Filter: Feature subset selection (2)
All features
Filter
Feature
subset
Predictor
Search in all possible feature subsets? k=1,…,M?
M( subset
{ f i , i  1...k} )
–
–
–
–
exhaustive enumeration
forward selection,
backward elimination,
best first, forward/backward with a stopping criterion
Filter method is a pre-processing step, which is independent of
the learning algorithm.
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
28
Forward Selection
Start
n
Sequential forward selection (SFS),
features are sequentially added to an
empty candidate set until the addition
of further features does not decrease
the criterion
n-1
n-2
…
1
Also referred to as SFS: Sequential Forward Selection
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
29
Backward Elimination
1
…
n-2
n-1
n
Start
Sequential backward selection (SBS),
in which features are sequentially
removed from a full candidate set until
the removal of further features increase
the criterion.
Also referred to as SBS: Sequential Backward Selection
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
30
Wrapper: Feature selection methods
All features
Multiple
Feature
subsets
Predictor
Wrapper
 Learning model is used as a part of evaluation function and
also to induce the final learning model
 Subsets of features are scored according to their predictive
power
 Optimizing the parameters of the model by measuring some
cost functions.
 Danger of over-fitting with intensive search!
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
31
RFE SVM
Recursive Feature Elimination (RFE) SVM. Guyon-Weston, 2000. US patent 7,117,188
All
features
Train
Train
SVM
SVM
Eliminate
useless
feature(s)
Performance
degradation?
Yes,
stop!
No,
continue…
1: repeat
2: Find w and b by training a linear SVM.
3: Remove the feature with the smallest value |wi|
4: until a desired number of features remain.
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
32
Selecting feature subsets in HTTP logs
FS
SVM results
AUC
All 95 features
Features
Accurac
y
0.8797 97.92%
1 AUC-ranking (D=30)
0.9212 97.60%
#.128:;?BLOQ[\]_aefhiklmoptuw|
2 R-ranking (D=30)
0.9208 97.67%
"#,.2:;?LQS[\]_`aehiklmoptuwz|
3 SFS Gram-Schmidt (D=30)
0.8914 97.85%
#&,./25:;=?DFLQ[\_`ghklmptwxz|
4 RFE SVM (D=30)
0.9174 97.85%
"#',-.0238:;<?D[\]_`ghklmoqwz|
5 ….
6 ….
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
34
Comparsion of Filter and Wrapper:
• Main goal: rank subsets of useful features
• Search strategies: explore the space of all possible feature combinations
• Two criteria: predictive power (maximize) and subset size (minimize).
• Predictive power assessment:
– Filter methods: criteria not involving any learning machine, e.g., a
relevance index based on correlation coefficients or test statistics
– Wrapper methods: the performance of a learning machine trained using
a given feature subset
• Wrapper is potentially very time consuming since they typically need to
evaluate a cross-validation scheme at every iteration.
• Filter method is much faster but it do not incorporate learning.
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
35
Forward Selection w. Trees
• Tree classifiers,
like CART (Breiman, 1984) or C4.5 (Quinlan, 1993)
At each step, choose
the feature that
“reduces entropy”
most. Work towards
“node purity”.
All the data
f2
f1
Feature subset
selection by
Random Forest
Choose f1
Choose f2
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
36
Outline
• Introduction
• Unsupervised Feature Selection
 Clustering
 Matrix Factorization
• Supervised Feature Selection
 Individual Feature Ranking (Single Variable Classifier)
 Feature subset selection
o Filters
o Wrappers
• Summary
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
37
Conclusion
Feature selection focuses on uncovering subsets of variables
X1, X2, …predictive of the target Y.
 Univariate feature selection
How to rank the features?
 Multivariate (subset) feature selection
 Filter, Wrapper, Embedded
How to search the subset of features?
How to evaluate the subsets of features?
 Feature extraction
How to construct new features in linear/non-linear ways?
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
38
In practice
• No method is universally better:
- wide variety of types of variables, data
distributions, learning machines, and objectives.
• Match the method complexity to the ratio M/N:
- univariate feature selection may work better than
multivariate feature selection;
- non-linear classifiers are not always better.
• Feature selection is not always necessary to achieve
good performance.
NIPS 2003 and WCCI 2006 challenges : http://clopinet.com/challenges
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
Feature selection toolbox
• Matlab: sequentialfs
(Sequential feature selection, shown in demo)
 Forward ---- good
 Backward --- be careful on definition of criteria
• Feature Selection Toolbox 3 – freely available and open-source
software in C++.
• Weka
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
40
Reference
• An introduction to variable and feature selection,
Isabelle Guyon, André Elisseeff, JMLR 2003
• Feature Extraction, Foundations and
Applications, Isabelle Guyon et al, Eds. Springer,
2006. http://clopinet.com/fextract-book
• Pabitra Mitra, C. A. Murthy, and Sankar K. Pal. (2002).
"Unsupervised Feature Selection Using Feature Similarity." In:
IEEE Transactions on Pattern Analysis and Machine
Intelligence, 24(3)
•
Prof. Marc Van Hulle, Katholieke Universiteit Leuven,
http://134.58.34.50/~marc/DM_course/slides_selection.pdf
Xiangliang Zhang, KAUST AMCS/CS 340: Data Mining
41
Download