Minimum Redundancy and Maximum Relevance Feature

advertisement
Minimum Redundancy and
Maximum Relevance Feature
Selection
Hang Xiao
Background
• Feature
– a feature is an individual measurable heuristic
property of a phenomenon being observed
– In character recognition: horizontal and vertical
profiles, number of internal holes, stroke
detection
– In speech recognition: noise ratios, length of
sounds, relative power, filter matches
– In microarray : genes expression
Background
• Relevance between features
– Correlation
– F-statistic
– Mutual information
p(x,y) : joint distribution function of X and Y
p(x), p(y) : marginal probability distribution functions
Independent : p(x,y) = p(x)p(y)  I(x,y) = 0
Feature Selection Problem
• Maximal relevance
– selecting the features with the highest relevance to the
target class c, based on mutual info., F-test, etc. without
considering relationships among features
• Minimal Redundancy
– Selected features are correlated
– Selected features cover narrow regions in space
mRMR: Discrete Variables
• Maximize Relevance:
S is the set of features
I(i,j) is mutual information between feature i and j
• Minimal Redundancy:
mRMR: Continuous Variables
• Maximum relevance: F-statistic F(i,h)
• Minimum redundancy : Correlation cor(i,j)
Combine Relevance and Redundancy
• Additive combination
• Multiplicative combination
Most Related Methods
• Most used feature selection methods: topranking features without considering
relationships among features.
• Yu & Liu, 2003/2004. information gain, essentially
similar approach
• Wrapper: not filter approach, classifier-involved
and thus features do not generalize well
• PCA and ICA: Feature are orthogonal or
independent, but not in the original feature space
Class Prediction Methods
• Naive Bayes (NB) classifier
{g1, g2, …, gm} gene expression level
p(gi|hk) is conditional table (density)
• Support Vector Machine SVM
– Draw an optimal hyperplane in the feature vector
space
Class Prediction Methods
• Linear Discriminant Analysis (LDA)
– Find a linear combination of feature
– ANOVA , regression analysis
• Logistic Regression (LR)
– a linear combination of the feature variables
– transformed into probabilities by a logistic
function
Microarray Gene
Expression Data Sets
for Cancer
Classification
LOOCV : Leave- One-Out Cross Validation
Baseline feature : based solely on maximum relevance
The role of redundancy reduction
(a) Relevance VI, and
(b) Redundancy for MRMR features on discretized NCI dataset.
(c) The respective LOOCV errors obtained using the Naive Bayes
classifier
Do mRMR Features Generalize Well on
Unseen Data?
Child Leukemia data (7 classes, 215 training samples, 112 testing samples)
testing errors. M is the number of features used in classification
What is the Relationship of mRMR Features and
Various Data Discretization Schemes?
LOOCV testing results classifier(#error) for binarized NCI and Lymphoma data using SVM
classifier.
Comparison with other work
Theoretical basis of mRMR
• Maximum Dependency Criterion
– Statistic association
– Definition : mutual information I(Sm,h)
• Mutual Information
– For two variables x and y
– For multivariate variable Sm and the target h
High-Dimensional Mutual Information
• For multivariate variable Sm and the target h
• Estimate high-dimensional I(Sm,h) is so difficult
– An ill-posed problem to find inverse of large covariance matrix
– Insufficient number of samples
– Combinatorial time complex O(C(|Ω|,|S|))
Factorize the Mutual Information
• Mutual information for multivariate variable Sm
and the target h
Define:
It can be proved:
Factorize I(Sm,h)
• Relevance of S={x1,x2, …} and h, or RL(S,h)
• Redundancy among variables {x1,x2,...}, or RD(S)
• For incremental search, max I(S,h) is “equivalent” to
max [RL(S,h) – RD(S)], so called min-RedundancyMax-Relevance(mRMR)
Advantages of mRMR
• Both relevance and redundancy estimation are lowdimensional problems (i.e. involving only 2 variables).
This is much easier than directly estimating
multivariate density or mutual information in the highdimensional space!
• Fast speed
• More reliable estimation
• mRMR is an optimal first-order approximation of I(.)
maximization
• Relevance-only ranking only maximizes J(.)!
Search Algorithm of mRMR
• Greedy search algorithm
– In the pool Ω, find the variable x1 that has the
largest I(x1,h). Exclude x1 from Ω
– Search x2 so that it maximizes I(x2,h) - ∑I(.,x2)/|Ω|
– Iterate this process until an expected number of
variables have been obtained, or other constraints
are satisfied
• Complexity O(|S|*|Ω|)
Comparing Max-Dep and mRMR:
Complexity of Feature Selection
Comparing Max-Dep and mRMR: Accuracy
of Feature Selected in Classification
• Leave-One-Out cross validation of feature
classification accuracies of mRMR and MaxDep
Use Wrappers to Refine Features
• mRMR is a filter approach
– Fast
– Features might be redundant
– Independent of the classifier
• Wrappers seek to minimize the number of errors directly
–
–
–
–
Slow
Features are less robust
Dependent on classifier
Better prediction accuracy
• Use mRMR first to generate a short feature pool and use
wrappers to get a least redundant feature set with better
accuracy
Use Wrappers to Refine Features
Forward wrappers
(incremental selection)
NCI Data
Backward wrappers
(decremental selection)
Conclusions
• The Max-Dependency feature selection can be
efficiently implemented as the mRMR algorithm
• Significantly outperforms the widely used maxrelevance selection method: mRMR features
cover a broader feature space with less features
• mRMR is very efficient and useful for gene
selection and many other applications. The
programs are ready!
Download