018-yener-Searching For a Few Good Features

advertisement
Searching For a Few Good
Features
Pathology Informatics 2010
Bȕlent Yener
Rensselaer Polytechnic Institute
Department of Computer Science
The Hard Problem: Bad or just Ugly??
?
One of the main challenges is to
Unlike healthy tissue, discriminating damaged (diseased but
not cancerous) tissue from cancerous one.
We need a few good features!!.
Brain Tissue - Diffused
Good: healthy
Bad: glioma
Ugly: inflammation
Gland based tissue: Prostate
Good (Healthy)
Ugly (PIN)
Bad (cancerous)
Gland based tissue: Breast
Ugly (in Situ)
Good
Bad (invesive)
Bone Tissue Images
Healthy (good)
Osteosarcoma (bad)
Fracture
Fracture
(ugly)
Two related problems
• Feature Extraction
– Identify and compute attributes that will characterize the information
encoded in the histology images
– Need to quantify!
• Feature Selection
– Identify an optimal subset.
Feature Selection
• Select a subset of the original features
– reduces the number of features (dimensionality reduction)
– removes irrelevant or redundant data (noise reduction)
• speeding up a data mining algorithm
• improving prediction accuracy
• It is an hard optimization problem!
• Optimal feature selection is an exhaustive search of all
possible subsets of features of the chosen cardinality.
– Too expensive
• In practice Adhoc heuristics
Greedy Algorithms
• A local optimum is searched
–
–
–
–
–
evaluate a candidate subset of features
modify the subset and evaluate it
if the new subset is an improvement over the old
then take it as current
else
• If algorithm is deterministic reject the modifications (e.g. hill
climbing)
• Else accept with a probability (e.g. simulated annealing).
Methods (partial list)
• Exhaustive search: evaluate  m  possible subsets.
• Branch and Bound Search: enumerate a fraction of the subsets--can find optimum but worst-case is exponential.
• Best features (isolated): evaluate all m features in isolation–-- no
guarantee for optimum
• Sequential Forward Selection: start with the best feature and add
one at a time – no back tracking
• SBS: start with all d features and eliminate one at a time—more
expensive than SFS and no backtracking either.
• Variants of SFS and SBS: start with k best features and then delete
r of them.. etc
d
Types of Algorithms
• Supervised, unsupervised , and semi-supervised (embedded)
feature selection algorithms
– e.g. (PCA) is a unsupervised feature extraction method- finds a set of
mutually orthogonal basis functions that capture the directions of
maximum variance in the data.
• But these features may not be useful for discriminating between
data in different classes.
• Wrappers (wrap the selection process around the learning
algorithm), Filters (examine intrinsic properties of the data)
• Feature selection algorithms with filter and embedded
models may return either a subset of selected features or the
weights (measuring feature relevance) of all features.
Relevance and redundancy
• A feature is statistically relevant if its removal from a feature
set will reduce the prediction power.
• A feature may be redundant due to the existence of other
relevant features, which provide similar prediction power as
this feature.
Filter Model
All d features
Subset selection
m<d features
Induction Algorithm
Algorithms inducing concept descriptions
from examples (i.e. learning algorithms)
• Filtering is independent from the algorithm
• It is a preprocessing step
• Example: Relief method
Relief Method
• It assigns relevance to features based on their ability to disambiguate
similar samples
– Similarity is defined by proximity in feature space.
– Relevant features accumulate high positive weights, while irrelevant features
retain near-zero weights.
– For each target sample,
• find the nearest sample in feature space of the same category, the “hit”
sample.
• find the nearest sample of the other category, the “miss” sample.
– The relevance of feature f near the target sample is measured as:
Source: K. Kira and L.A. Rendell
Other Filter Algorithms
• Laplacian Score: focuses local structure of the data space, computes a score
to reflect its locality preserving power.
• SPEC: similar but uses normalized Laplacian matrix.
• Fisher Score: assigns the highest score to the feature on which the data
points of different classes are far from each other.
• Chi-square Score: tests independence whether the class label is independent
of a particular feature.
• Minimum-Redundancy-Maximum-Relevance (mRmR): selects features that
are mutually far away from each other, while they still have "high" correlation
to the classication variable. (approximation to maximizing the dependency
between the joint distribution of the selected features and the classication
variable.)
• Kruskal Wallis: non-parametric method. Based on ranks for comparing the
population medians among groups.
• Information Gain: measures of dependence between the feature and the
class label.
Source: Zhao et al http://featureselection.asu.edu
Wrapper Model
Source: Zhao et al http://featureselection.asu.edu
BLogReg :
Gavin C. Cawley and Nicola L. C. Talbot. Gene selection in cancer classication using sparse logistic regression with
bayesian regularization. Bioinformatics, 22(19):2348{2355, 2006.
CFS : Mark A. Hall and Lloyd A. Smith. Feature selection for machine learning: Comparing a correlationbased fllter approach to the
wrapper, 1999.
Chi-Square : H. Liu and R. Setiono. Chi2: Feature selection and discretization of numeric attributes. In J.F. Vassilopoulos, editor,
Proceedings of the Seventh IEEE International Conference on Tools with Articial Intelligence, November 5-8, 1995, pages 388{391,
Herndon, Virginia, 1995. IEEE Computer Society.
FCBF: H. Liu and L. Yu. Feature selection for high-dimensional data: A fast correlation-based lter solution. In Correlation-Based Filter
Solution". In Proceedings of The Twentieth International Conference on Machine Leaning (ICML-03), pages 856{863, Washington,
D.C., 2003. ICM.
Fisher Score : R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classication. John Wiley & Sons, New York, 2 edition, 2001.
Information Gain: T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991.
Kruskal-Wallis : L. J. Wei. Asymptotic conservativeness and eciency of kruskal-wallis test for k dependent samples.
Journal of the American Statistical Association, 76(376):1006{1009, December 1981.
mRMR : F. Ding C. Peng, H. Long. Feature selection based on mutual information: Criteria of maxdependency, max-relevance, and
min-redundancy. IEEE TRANSACTIONS ON PATTERN ANAL- YSIS AND MACHINE INTELLIGENCE, 27(8):1226{1238, 2005.
Relief : K. Kira and L.A. Rendell. A practical approach to feature selection. In Sleeman and P. Edwards, editors, Proceedings of the
Ninth International Conference on Machine Learning (ICML-92), pages 249{256. Morgan Kaufmann, 1992.
SBMLR: Gavin C. Cawley, Nicola L. C. Talbot, and Mark Girolami. Sparse multinomial logistic regression via bayesian l1
regularisation. In NIPS, pages 209{216, 2006.
Spectrum: Huan Liu and Zheng Zhao. Spectral feature selection for supervised and unsupervised learning.
Proceedings of the 24th International Conference on Machine Learning, 2007.
Source: Zhao et al http://featureselection.asu.edu
Feature Space over Histology Images is
Large
• Texture based
• Intensity based
• Graph theoretical
– Voronoi graphs
– Cell-graphs
Voronoi Graphs and its Features
• Minimum Spanning tree and its properties
Cell-Graphs
Represent the tissue as a graph:
– A node of the graph represents a cell or cell cluster
– An edge of the graph represents a relation between a pair of nodes
(e.g., spatial, ECM)– generalization of Voronoi graphs
(a) Healthy
(b) Damaged
(c) Cancerous
What do we gain from Cell-graphs ?
• Mathematical representation
– We can apply operands on
them using
• (multi) Linear Algebra
• Algorithms
• We can quantify the structural
properties with mathematically
well defined graph metrics.
• Subgraph mining
– Descriptor subgraphs
– Subgraph search in a large graph
– Subgraph Kernels
Adjacency matrix:
1 if u and v are adjacent
A(u, v)  
0 otherwise
Normalized Laplacian:

1


1

L(u , v)  
 du dv

0

if u  v and d v  0,
if u and v are adjacent,
otherwise.
Cell-graph Features
• Local: cell-level
– Graph theoretical: e.g. Degree, clustering coeff.
– Morphological: e.g., shape
• Global: tissue-level
– Graph theoretical
– Spectral
Rich Set of Features for Description and Classification
# of Nodes
Number of cells.
# of Edges
Number of links between cells.
Average Degree
Average number of “neighboring” cells computed over all the nodes in a cell-graph.
Giant Connected Ratio
Number  #  of nodes in the largest connected component
Total (#) of nodes in the graph
Clustering Coefficient Ci.
Ci  (2  Ei ) / (k  (k  1))
where k is the number of neighbors of the node i and is the number of existing links between i and its neighbors. We exclude the nodes with
degree 1 (Dorogovtsev and Mendes, 2002).
% of Isolated Points (Pnts)
Percentage of nodes that have no edges incident to them
% of end Pnts
Percentage of nodes that have exactly one edge incident to them
# of Central Pnts
A node i is a central point of a graph if its eccentricity equals the min. eccentricity (i.e., graph radius). The set of all central points is called
the graph center, cardinality of this set is the definition of this metric.
Eccentricity Closeness
Given shortest path lengths between a node i and all of the reachable nodes around it, the eccentricity and the closeness of the node i are
defined as the maximum and the average of these shortest path lengths, respectively.
Spectral Radius
Maximum absolute value of eigenvalues in spectrum of a graph, which is the set of graph eigenvalues.
2nd Eigen Value
Second largest eigen value in the graph spectrum.
Eigen Exponent
The slope of the sorted eigen values as a function of their orders in log-log scale.
Trace
Sum of the eigen values.
Triangles
Clique of 3 nodes.
Cliques
A (sub)graph such that every pair of nodes are connected with a distinct edge.
Subgraph Density
A bound on the clustering coefficient of a subgraph (e.g., at least 0.9).
Bipartite Cliques
A complete bipartite graph: all possible edges are present
Cell-graph Feature Selection
• Pairwise correlation of features
Goal: to find a set of features which are pairwise
independent.
• Discriminative power
Goal: to find a smaller subset of features which are as
expressive as all feature set.
Pairwise Correlation Graph
• The correlation between the graph features, themselves, can
be represented as correlation graph.
• The correlation graph can be obtained in the procedure
below.
– Calculate the nxn correlation matrix for n features and obtain the
correlation coefficients (n = 20 in this case).
– Create nodes for each feature which are located in a circular manner.
– Set a threshold for correlation and establish an edge between two
feature nodes if |correlation coefficient| ≥ threshold (threshold = 0.9
in this case) .
Correlation Graphs for Healthy Tissue
Breast
Bone
Brain
Correlation Graphs for Cancerous Tissue
Breast
Bone
Brain
Observations on Correlation Graphs
• The correlation graphs differ greatly depending on tissue type
and (dis) functional status.
• The complexity of the correlation graph (number of edges)
depends on the tissue type and tissue status.
– Some features in some cases can show cluster structures (E.g. node
number, edge number and average degree in breast - healthy),
– but a cluster structure may not be in all cases (E.g. brain - cancer).
• The features are highly correlated.
Interpretation
• The strong correlation means a high dependency between the
features, which causes a complex joint probability density
function. Any probabilistic/statistic model attempt should be
aware of this complexity.
• An uncorrelated feature does not necessarily mean a
distinguishing feature. It might not be a discriminative feature
for classification.
• The high correlation may indicate that a smaller subset of
features might be enough to discriminate the classes – but
not always
Feature Selection: good, bad, and ugly
Breast – Average Degree
Brain – Average Degree
Feature Selection - cont
Breast – End Point Percentage
Brain – End Point Percentage
Feature Selection
Need a few god features!
Two phase approach:
– Find the best classifier (MLP)
– Determine the features
Feature Selection
• The data is not linearly separable. Also the features, as expected,
show different distributions in each tissue type.
• 10-fold cross-validation results (accuracy percentages) for breast
tissue using
– Adaboost (30 C4.5 trees),
– k-nn (k = 5),
– MLP (1 hidden layer, 12 hidden units, back propagation).
with all existing 20 features are obtained to see which classifier is
more successful in classifying the data for cell-graph features .
• These classifiers are used since they are good at separating nonlinearly distributed data and they are from different classification
algorithm families.
Feature Selection – next step
• The classification problem is reduced into 2-class problems
(healthy vs. cancerous, healthy vs. damaged, damaged vs.
cancerous).
• Number of edges and number of nodes are excluded. This
exclusion also decrease the runtime for selection.
Details
• An exhaustive search over 18 features is done using MLP.
Since MLP has given the highest accuracy rate with all feature,
it is intuitively expected to show higher accuracy than the
other classifiers during subset selection.
• The procedure is described below.
– Start with an empty selected feature subset with 0 accuracy
percentage. (seq. forward selection alg).
– Repeat the procedure below for all possible feature subset (218).
• Train the classifier and validate its accuracy with 10-fold cross-validation.
• If the average 10-fold CV accuracy percentage of the current subset is higher than
the selected feature subset, assign the current subset as the selected feature
subset.
MLP + Exhaustive Search Results on Breast
Cancer
• The results for breast data is given below. (no normalization)
Features Selected
Accuracy Percentage
Healthy vs. Cancer
Clustering Coefficients
Max and Min Eccentricity
Perc. Of Isolated Points
Perc. Of End Points
Perc. Of. Central Points
84.71 ± 2.7
Healthy vs. Damaged
Average Degree
Excluding Clustering Coeff.
Max Eccentricity 90%
Effective Hop Diameter
Perc. Of Isolated Point
80.52 ± 4.5
Damaged vs. Cancer
15 features out of 18
70.59 ± 3.64
Cell-graph Feature Selection with Relief Method
1. Average degree
2. Average Clustering coefficient
3. Average eccentricity
4. Maximum eccentricity
5. Minimum eccentricity
6. Average effective eccentricity
7. Maximum effective eccentricity
8. Minimum effective eccentricity
9. Average path length (closeness)
10. Giant connected ratio
11. Percentage of isolated points
12. Percentage of end points
13. Number of central points
14. Percentage of central points
15. Number of nodes
16. Number of edges
17. Spectral radius
18. Second largest eigenvalue
19. Trace
20. Energy
21. Number of eigenvalues
Relief based Cell-graph Feature Selection Result
Selected Features for Different
Normalization
Modeling branching Morphogensis
Problem Definition
• Treated with ROCK (Rhoassociated coil-coil kinase) that
regulates branching morphogenesis
• Untreated
• Can we quantify the organizing principles and distinguish
between different states of branching process?
Even a Richer Set of Features
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Average_degree
C
C2
D
Average_eccentricity
Maximum_eccentricity_(diameter)
Minimum_eccentricity_(radius)
Average_eccentricity_90
Maximum_eccentricity_90
Minimum_eccentricity_90
Average_path_length_(closeness)
Giant_connected_ratio
Number_of_Connected_Components
Percentage_of_isolated_points
Percentage_of_end_points
Number_of_central_points
Percentage_of_central_points
Number_of_nodes
Number_of_edges
20
21
22
23
24
25
26
elongation_
area
orientation
eccentricity
perimeter
circularity_
solidity
27
28
29
30
31
32
33
34
35
36
37
largest_eigen_adjacency_
second_largest_adjacency
trace_adjacency_
energy_adjacency
#of_zeros_normalized_laplacian
slope_0-1_normalized_laplacian
#of_ones_normalized_laplacian
slope_1-2_normalized_laplacian
#of_twos_normalized_laplacian
trace_laplacian
energy_laplacian
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
degree_cluster_1
degree_cluster_2
degree_cluster_3
clustering_coefficient_C_cluster_1
clustering_coefficient_C_cluster_2
clustering_coefficient_C_cluster_3
clustering_coefficient_D_cluster_1
clustering_coefficient_D_cluster_2
clustering_coefficient_D_cluster_3
eccentricity_cluster_1
eccentricity_cluster_2
eccentricity_cluster_3
effective_eccentricity_cluster_1_
effective_eccentricity_cluster_2
effective_eccentricity_cluster_3
closeness_cluster_1
closeness_cluster_2
closeness_cluster_3
Classifier Comparison
• Since MLP has a higher overall accuracy, it is used in later studies in
feature selection.
Adaboost
k-nn
MLP
Overall (%)
67.3 ± 3.27
68.13 ± 1.29
73.24 ± 1.94
Inflamed (%)
57.96 ± 9.53
65.93 ± 2.65
54.0 7± 6.28
Healthy (%)
73.82 ± 5.23
75 ± 1.70
78.38 ± 2.78
Cancerous (%)
67.82 ± 7.5
65.21 ± 1.78
78.99 ± 2.63
Epithelial vs Mesenchymal comparison in
treated tissue samples
Feature Selection Algorithm
Best CV rate
SVM No Feature Selection
100
Fscore selection: select features: 7,10,26,44,45
100
CfsSubsetEval: 7,10,14,15,16,21,25,26,43,44,45
100
ConsistencySubsetEval: 10,14
95.24
ReliefFAttributeEval: 26,7
100
SymmetricalUncertAttributeEval: 14,44
100
SVD Based: 12,20,22,23,26,41,42,44,49,52,54,55
95.238
Epithelial vs Mesenchymal comparison in
untreated tissue samples
Feature Selection Algorithm
Best CV rate
SVM No Feature Selection
97.619
Fscore selection 7,26,35
97.619
CfsSubsetEval: 6,9,14,15,25,26,43
95.2381
ConsistencySubsetEval: 14,21,25
97.619
ReliefFAttributeEval: 7,26,35
97.619
SymmetricalUncertAttributeEval: 6,7,9,15,25,26,44
97.619
SVD Based: 2,4,12,20,22,26,27,28,41,49,52,55
88.0952
Treated mesenchymal vs untreated
mesenchymal comparison
Feature Selection Algorithm
Best CV rate
SVM No Feature Selection
80.95
Fscore selection: select features: 3,4,21,24,26,27,39,45
80.95
CfsSubsetEval: 24
76.1905
ConsistencySubsetEval: 24
76.1905
ReliefFAttributeEval: 21,24,3,39,45,26,2,27,4,35,33,28,42
90.4762
SymmetricalUncertAttributeEval:
24,18,20,19,16,15,17,25,27,26,21
SVD Based: 12,20,23,24,26,41,44,45,49,52,55
76.1905
69.0476
Treated epithelial vs untreated epithelial
comparison
Feature Selection Algorithm
Best CV rate
SVM No Feature Selection
83.33
Fscore selection: 3
88.09
CfsSubsetEval: 3,44,45
88.09
ConsistencySubsetEval: 3,44
85.71
ReliefFAttributeEval: 3,44,45,46,49
85.71
SymmetricalUncertAttributeEval: 3,44,45
88.09
SVD Based: 1,2,12,20,41,45,46,49,52,53,55
76.1905
Concluding Remarks
• Feature extraction and selection are strongly coupled for
accuracy– always room for new features
• Feature selection performance depends on the induction
algorithm (i.e., learning algorithm)
• Quantifiable features are not always interpretable-
mapping the features to biology or pathology is crucial link!
Thank you!
Download