Feature selection methods Isabelle Guyon

advertisement
Feature selection methods
Isabelle Guyon
isabelle@clopinet.com
IPAM summer school on Mathematics in Brain Imaging. July 2008
Feature Selection
• Thousands to millions of low level
features: select the most relevant one to
build better, faster, and easier to
understand learning machines.
n
n’
m
X
Applications
examples
Customer knowledge
103
102
System diagnosis
105
104
Quality control
Market
Analysis
106
Text
Categorization
OCR
HWR
Bioinformatics
10
10
Machine vision
102
103
104
105
variables/features
Nomenclature
• Univariate method: considers one
variable (feature) at a time.
• Multivariate method: considers subsets
of variables (features) together.
• Filter method: ranks features or feature
subsets independently of the predictor
(classifier).
• Wrapper method: uses a classifier to
assess features or feature subsets.
Univariate
Filter
Methods
Univariate feature ranking
m- m+
P(Xi|Y=1)
P(Xi|Y=-1)
-1
xi
s- s+
• Normally distributed classes, equal variance s2 unknown;
estimated from data as s2within.
• Null hypothesis H0: m+ = m• T statistic: If H0 is true,
t= (m+ - m-)/(swithin1/m++1/m-)
Student(m++m--2 d.f.)
Statistical tests (
chap. 2)
Null distribution
• H0: X and Y are independent.
• Relevance index  test statistic.
• Pvalue  false positive rate FPR = nfp / nirr
pval
r0
r
• Multiple testing problem: use Bonferroni
correction pval  n pval
• False discovery rate: FDR = nfp / nsc  FPR n/nsc
• Probe method: FPR  nsp/np
Univariate Dependence
• Independence:
P(X, Y) = P(X) P(Y)
• Measure of dependence:
MI(X, Y) =  P(X,Y) log P(X,Y) dX dY
P(X)P(Y)
= KL( P(X,Y) || P(X)P(Y) )
Other criteria (
chap. 3)
A choice of feature selection ranking methods
depending on the nature of:
•
the variables and the target (binary, categorical,
continuous)
•
the problem (dependencies between variables, linear/nonlinear relationships between variables and target)
•
the available data (number of examples and number of
variables, noise in data)
•
the available tabulated statistics.
Multivariate
Methods
Univariate selection may fail
Guyon-Elisseeff, JMLR 2004; Springer 2006
Filters,Wrappers, and
Embedded methods
All features
All features
Filter
Feature
subset
Multiple
Feature
subsets
Predictor
Predictor
Wrapper
All features
Embedded
method
Feature
subset
Predictor
Relief
Relief=<Dmiss/Dhit>
Dhit
Dmiss
nearest hit
nearest miss
Dhit Dmiss
Kira and Rendell, 1992
Wrappers for feature selection
Kohavi-John, 1997
N features, 2N possible feature subsets!
Search Strategies (
chap. 4)
•
•
•
•
Exhaustive search.
Simulated annealing, genetic algorithms.
Beam search: keep k best n-k
path at each step.
g
Greedy search: forward selection or backward
elimination.
• PTA(l,r): plus l , take away r – at each step,
run SFS l times then SBS r times.
• Floating search (SFFS and SBFS): One step
of SFS (resp. SBS), then SBS (resp. SFS) as
long as we find better subsets than those of
the same size obtained so far. Any time, if a better
subset of the same size was already found, switch abruptly.
Forward Selection (wrapper)
Start
n
n-1
n-2
…
1
Also referred to as SFS: Sequential Forward Selection
Forward Selection (embedded)
Start
n
n-1
n-2
…
1
Guided search: we do not consider alternative paths.
Typical ex.: Gram-Schmidt orthog. and tree classifiers.
Backward Elimination (wrapper)
Also referred to as SBS: Sequential Backward Selection
1
…
n-2
n-1
n
Start
Backward Elimination (embedded)
Guided search: we do not consider alternative paths.
Typical ex.: “recursive feature elimination” RFE-SVM.
1
…
n-2
n-1
n
Start
Scaling Factors
Idea: Transform a discrete space into a continuous space.
s=[s1, s2, s3, s4]
• Discrete indicators of feature presence: si {0, 1}
• Continuous scaling factors: si  IR
Now we can do gradient descent!
Formalism (
chap. 5)
• Many learning algorithms are cast into a minimization
of some regularized functional:
Empirical error
Regularization
capacity control
Justification of RFE and many other embedded methods.
Next few slides: André Elisseeff
Embedded method
• Embedded methods are a good inspiration to design
new feature selection techniques for your own
algorithms:
– Find a functional that represents your prior knowledge about
what a good model is.
– Add the s weights into the functional and make sure it’s
either differentiable or you can perform a sensitivity analysis
efficiently
– Optimize alternatively according to a and s
– Use early stopping (validation set) or your own stopping
criterion to stop and select the subset of features
• Embedded methods are therefore not too far from
wrapper techniques and can be extended to
multiclass, regression, etc…
The l1 SVM
• A version of SVM where W(w)=||w||2 is
replaced by the l1 norm W(w)=i |wi|
• Can be considered an embedded
feature selection method:
– Some weights will be drawn to zero (tend
to remove redundant features)
– Difference from the regular SVM where
redundant features are included
Bi et al 2003, Zhu et al, 2003
The l0 SVM
• Replace the regularizer ||w||2 by the l0 norm
• Further replace
by i log( + |wi|)
• Boils down to the following multiplicative update algorithm:
Weston et al, 2003
Causality
What can go wrong?
120
X1
X2
100
80
60
40
20
180
190
200
210
220
230
240
250
260
Guyon-Aliferis-Elisseeff, 2007
What can go wrong?
100
X2
80
60
40
20
16
X1
14
120
X1
X2
12
100
10
80
60
8
40
20
20
180
40
60
X2
190
200
210
220
230
240
250
260
80
100
X1
What can go wrong?
X2
Y
Y
X1
X2
X1
Guyon-Aliferis-Elisseeff, 2007
http://clopinet.com/causality
Wrapping up
Bilevel optimization
N variables/features
Split data into 3 sets:
training, validation, and test set.
m2
m3
M samples
m1
1) For each feature subset,
train predictor on training
data.
2) Select the feature subset,
which performs best on
validation data.
– Repeat and average if you
want to reduce variance (crossvalidation).
3) Test on test data.
Complexity of Feature Selection
With high probability:
Error
Generalization_error  Validation_error + (C/m2)
Method
Number of Complexity
subsets tried C
2N
N
Exhaustive
search wrapper
Nested subsets N(N+1)/2 or log N
Feature ranking N
n
m2: number of validation examples,
N: total number of features,
n: feature subset size.
Try to keep C of the order of m2.
CLOP
http://clopinet.com/CLOP/
• CLOP=Challenge Learning Object Package.
• Based on the Matlab® Spider package developed at the
Max Planck Institute.
• Two basic abstractions:
– Data object
– Model object
• Typical script:
– D = data(X,Y); % Data constructor
– M = kridge;
% Model constructor
– [R, Mt] = train(M, D); % Train model=>Mt
– Dt = data(Xt, Yt); % Test data constructor
– Rt = test(Mt, Dt); % Test the model
NIPS 2003 FS challenge
http://clopinet.com/isabelle/Projects/ETH/Feature_Selection_w_CLOP.html
Conclusion
• Feature selection focuses on
uncovering subsets of variables X1, X2,
… predictive of the target Y.
• Multivariate feature selection is in
principle more powerful than univariate
feature selection, but not always in
practice.
• Taking a closer look at the type of
dependencies in terms of causal
relationships may help refining the
notion of variable relevance.
Acknowledgements
and references
1) Feature Extraction,
Foundations and Applications
I. Guyon et al, Eds.
Springer, 2006.
http://clopinet.com/fextract-book
2) Causal feature selection
I. Guyon, C. Aliferis, A. Elisseeff
To appear in “Computational Methods of Feature Selection”,
Huan Liu and Hiroshi Motoda Eds.,
Chapman and Hall/CRC Press, 2007.
http://clopinet.com/causality
Download