Searching For a Few Good Features Pathology Informatics 2010 Bȕlent Yener Rensselaer Polytechnic Institute Department of Computer Science The Hard Problem: Bad or just Ugly?? ? One of the main challenges is to Unlike healthy tissue, discriminating damaged (diseased but not cancerous) tissue from cancerous one. We need a few good features!!. Brain Tissue - Diffused Good: healthy Bad: glioma Ugly: inflammation Gland based tissue: Prostate Good (Healthy) Ugly (PIN) Bad (cancerous) Gland based tissue: Breast Ugly (in Situ) Good Bad (invesive) Bone Tissue Images Healthy (good) Osteosarcoma (bad) Fracture Fracture (ugly) Two related problems • Feature Extraction – Identify and compute attributes that will characterize the information encoded in the histology images – Need to quantify! • Feature Selection – Identify an optimal subset. Feature Selection • Select a subset of the original features – reduces the number of features (dimensionality reduction) – removes irrelevant or redundant data (noise reduction) • speeding up a data mining algorithm • improving prediction accuracy • It is an hard optimization problem! • Optimal feature selection is an exhaustive search of all possible subsets of features of the chosen cardinality. – Too expensive • In practice Adhoc heuristics Greedy Algorithms • A local optimum is searched – – – – – evaluate a candidate subset of features modify the subset and evaluate it if the new subset is an improvement over the old then take it as current else • If algorithm is deterministic reject the modifications (e.g. hill climbing) • Else accept with a probability (e.g. simulated annealing). Methods (partial list) • Exhaustive search: evaluate m possible subsets. • Branch and Bound Search: enumerate a fraction of the subsets--can find optimum but worst-case is exponential. • Best features (isolated): evaluate all m features in isolation–-- no guarantee for optimum • Sequential Forward Selection: start with the best feature and add one at a time – no back tracking • SBS: start with all d features and eliminate one at a time—more expensive than SFS and no backtracking either. • Variants of SFS and SBS: start with k best features and then delete r of them.. etc d Types of Algorithms • Supervised, unsupervised , and semi-supervised (embedded) feature selection algorithms – e.g. (PCA) is a unsupervised feature extraction method- finds a set of mutually orthogonal basis functions that capture the directions of maximum variance in the data. • But these features may not be useful for discriminating between data in different classes. • Wrappers (wrap the selection process around the learning algorithm), Filters (examine intrinsic properties of the data) • Feature selection algorithms with filter and embedded models may return either a subset of selected features or the weights (measuring feature relevance) of all features. Relevance and redundancy • A feature is statistically relevant if its removal from a feature set will reduce the prediction power. • A feature may be redundant due to the existence of other relevant features, which provide similar prediction power as this feature. Filter Model All d features Subset selection m<d features Induction Algorithm Algorithms inducing concept descriptions from examples (i.e. learning algorithms) • Filtering is independent from the algorithm • It is a preprocessing step • Example: Relief method Relief Method • It assigns relevance to features based on their ability to disambiguate similar samples – Similarity is defined by proximity in feature space. – Relevant features accumulate high positive weights, while irrelevant features retain near-zero weights. – For each target sample, • find the nearest sample in feature space of the same category, the “hit” sample. • find the nearest sample of the other category, the “miss” sample. – The relevance of feature f near the target sample is measured as: Source: K. Kira and L.A. Rendell Other Filter Algorithms • Laplacian Score: focuses local structure of the data space, computes a score to reflect its locality preserving power. • SPEC: similar but uses normalized Laplacian matrix. • Fisher Score: assigns the highest score to the feature on which the data points of different classes are far from each other. • Chi-square Score: tests independence whether the class label is independent of a particular feature. • Minimum-Redundancy-Maximum-Relevance (mRmR): selects features that are mutually far away from each other, while they still have "high" correlation to the classication variable. (approximation to maximizing the dependency between the joint distribution of the selected features and the classication variable.) • Kruskal Wallis: non-parametric method. Based on ranks for comparing the population medians among groups. • Information Gain: measures of dependence between the feature and the class label. Source: Zhao et al http://featureselection.asu.edu Wrapper Model Source: Zhao et al http://featureselection.asu.edu BLogReg : Gavin C. Cawley and Nicola L. C. Talbot. Gene selection in cancer classication using sparse logistic regression with bayesian regularization. Bioinformatics, 22(19):2348{2355, 2006. CFS : Mark A. Hall and Lloyd A. Smith. Feature selection for machine learning: Comparing a correlationbased fllter approach to the wrapper, 1999. Chi-Square : H. Liu and R. Setiono. Chi2: Feature selection and discretization of numeric attributes. In J.F. Vassilopoulos, editor, Proceedings of the Seventh IEEE International Conference on Tools with Articial Intelligence, November 5-8, 1995, pages 388{391, Herndon, Virginia, 1995. IEEE Computer Society. FCBF: H. Liu and L. Yu. Feature selection for high-dimensional data: A fast correlation-based lter solution. In Correlation-Based Filter Solution". In Proceedings of The Twentieth International Conference on Machine Leaning (ICML-03), pages 856{863, Washington, D.C., 2003. ICM. Fisher Score : R.O. Duda, P.E. Hart, and D.G. Stork. Pattern Classication. John Wiley & Sons, New York, 2 edition, 2001. Information Gain: T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, 1991. Kruskal-Wallis : L. J. Wei. Asymptotic conservativeness and eciency of kruskal-wallis test for k dependent samples. Journal of the American Statistical Association, 76(376):1006{1009, December 1981. mRMR : F. Ding C. Peng, H. Long. Feature selection based on mutual information: Criteria of maxdependency, max-relevance, and min-redundancy. IEEE TRANSACTIONS ON PATTERN ANAL- YSIS AND MACHINE INTELLIGENCE, 27(8):1226{1238, 2005. Relief : K. Kira and L.A. Rendell. A practical approach to feature selection. In Sleeman and P. Edwards, editors, Proceedings of the Ninth International Conference on Machine Learning (ICML-92), pages 249{256. Morgan Kaufmann, 1992. SBMLR: Gavin C. Cawley, Nicola L. C. Talbot, and Mark Girolami. Sparse multinomial logistic regression via bayesian l1 regularisation. In NIPS, pages 209{216, 2006. Spectrum: Huan Liu and Zheng Zhao. Spectral feature selection for supervised and unsupervised learning. Proceedings of the 24th International Conference on Machine Learning, 2007. Source: Zhao et al http://featureselection.asu.edu Feature Space over Histology Images is Large • Texture based • Intensity based • Graph theoretical – Voronoi graphs – Cell-graphs Voronoi Graphs and its Features • Minimum Spanning tree and its properties Cell-Graphs Represent the tissue as a graph: – A node of the graph represents a cell or cell cluster – An edge of the graph represents a relation between a pair of nodes (e.g., spatial, ECM)– generalization of Voronoi graphs (a) Healthy (b) Damaged (c) Cancerous What do we gain from Cell-graphs ? • Mathematical representation – We can apply operands on them using • (multi) Linear Algebra • Algorithms • We can quantify the structural properties with mathematically well defined graph metrics. • Subgraph mining – Descriptor subgraphs – Subgraph search in a large graph – Subgraph Kernels Adjacency matrix: 1 if u and v are adjacent A(u, v) 0 otherwise Normalized Laplacian: 1 1 L(u , v) du dv 0 if u v and d v 0, if u and v are adjacent, otherwise. Cell-graph Features • Local: cell-level – Graph theoretical: e.g. Degree, clustering coeff. – Morphological: e.g., shape • Global: tissue-level – Graph theoretical – Spectral Rich Set of Features for Description and Classification # of Nodes Number of cells. # of Edges Number of links between cells. Average Degree Average number of “neighboring” cells computed over all the nodes in a cell-graph. Giant Connected Ratio Number # of nodes in the largest connected component Total (#) of nodes in the graph Clustering Coefficient Ci. Ci (2 Ei ) / (k (k 1)) where k is the number of neighbors of the node i and is the number of existing links between i and its neighbors. We exclude the nodes with degree 1 (Dorogovtsev and Mendes, 2002). % of Isolated Points (Pnts) Percentage of nodes that have no edges incident to them % of end Pnts Percentage of nodes that have exactly one edge incident to them # of Central Pnts A node i is a central point of a graph if its eccentricity equals the min. eccentricity (i.e., graph radius). The set of all central points is called the graph center, cardinality of this set is the definition of this metric. Eccentricity Closeness Given shortest path lengths between a node i and all of the reachable nodes around it, the eccentricity and the closeness of the node i are defined as the maximum and the average of these shortest path lengths, respectively. Spectral Radius Maximum absolute value of eigenvalues in spectrum of a graph, which is the set of graph eigenvalues. 2nd Eigen Value Second largest eigen value in the graph spectrum. Eigen Exponent The slope of the sorted eigen values as a function of their orders in log-log scale. Trace Sum of the eigen values. Triangles Clique of 3 nodes. Cliques A (sub)graph such that every pair of nodes are connected with a distinct edge. Subgraph Density A bound on the clustering coefficient of a subgraph (e.g., at least 0.9). Bipartite Cliques A complete bipartite graph: all possible edges are present Cell-graph Feature Selection • Pairwise correlation of features Goal: to find a set of features which are pairwise independent. • Discriminative power Goal: to find a smaller subset of features which are as expressive as all feature set. Pairwise Correlation Graph • The correlation between the graph features, themselves, can be represented as correlation graph. • The correlation graph can be obtained in the procedure below. – Calculate the nxn correlation matrix for n features and obtain the correlation coefficients (n = 20 in this case). – Create nodes for each feature which are located in a circular manner. – Set a threshold for correlation and establish an edge between two feature nodes if |correlation coefficient| ≥ threshold (threshold = 0.9 in this case) . Correlation Graphs for Healthy Tissue Breast Bone Brain Correlation Graphs for Cancerous Tissue Breast Bone Brain Observations on Correlation Graphs • The correlation graphs differ greatly depending on tissue type and (dis) functional status. • The complexity of the correlation graph (number of edges) depends on the tissue type and tissue status. – Some features in some cases can show cluster structures (E.g. node number, edge number and average degree in breast - healthy), – but a cluster structure may not be in all cases (E.g. brain - cancer). • The features are highly correlated. Interpretation • The strong correlation means a high dependency between the features, which causes a complex joint probability density function. Any probabilistic/statistic model attempt should be aware of this complexity. • An uncorrelated feature does not necessarily mean a distinguishing feature. It might not be a discriminative feature for classification. • The high correlation may indicate that a smaller subset of features might be enough to discriminate the classes – but not always Feature Selection: good, bad, and ugly Breast – Average Degree Brain – Average Degree Feature Selection - cont Breast – End Point Percentage Brain – End Point Percentage Feature Selection Need a few god features! Two phase approach: – Find the best classifier (MLP) – Determine the features Feature Selection • The data is not linearly separable. Also the features, as expected, show different distributions in each tissue type. • 10-fold cross-validation results (accuracy percentages) for breast tissue using – Adaboost (30 C4.5 trees), – k-nn (k = 5), – MLP (1 hidden layer, 12 hidden units, back propagation). with all existing 20 features are obtained to see which classifier is more successful in classifying the data for cell-graph features . • These classifiers are used since they are good at separating nonlinearly distributed data and they are from different classification algorithm families. Feature Selection – next step • The classification problem is reduced into 2-class problems (healthy vs. cancerous, healthy vs. damaged, damaged vs. cancerous). • Number of edges and number of nodes are excluded. This exclusion also decrease the runtime for selection. Details • An exhaustive search over 18 features is done using MLP. Since MLP has given the highest accuracy rate with all feature, it is intuitively expected to show higher accuracy than the other classifiers during subset selection. • The procedure is described below. – Start with an empty selected feature subset with 0 accuracy percentage. (seq. forward selection alg). – Repeat the procedure below for all possible feature subset (218). • Train the classifier and validate its accuracy with 10-fold cross-validation. • If the average 10-fold CV accuracy percentage of the current subset is higher than the selected feature subset, assign the current subset as the selected feature subset. MLP + Exhaustive Search Results on Breast Cancer • The results for breast data is given below. (no normalization) Features Selected Accuracy Percentage Healthy vs. Cancer Clustering Coefficients Max and Min Eccentricity Perc. Of Isolated Points Perc. Of End Points Perc. Of. Central Points 84.71 ± 2.7 Healthy vs. Damaged Average Degree Excluding Clustering Coeff. Max Eccentricity 90% Effective Hop Diameter Perc. Of Isolated Point 80.52 ± 4.5 Damaged vs. Cancer 15 features out of 18 70.59 ± 3.64 Cell-graph Feature Selection with Relief Method 1. Average degree 2. Average Clustering coefficient 3. Average eccentricity 4. Maximum eccentricity 5. Minimum eccentricity 6. Average effective eccentricity 7. Maximum effective eccentricity 8. Minimum effective eccentricity 9. Average path length (closeness) 10. Giant connected ratio 11. Percentage of isolated points 12. Percentage of end points 13. Number of central points 14. Percentage of central points 15. Number of nodes 16. Number of edges 17. Spectral radius 18. Second largest eigenvalue 19. Trace 20. Energy 21. Number of eigenvalues Relief based Cell-graph Feature Selection Result Selected Features for Different Normalization Modeling branching Morphogensis Problem Definition • Treated with ROCK (Rhoassociated coil-coil kinase) that regulates branching morphogenesis • Untreated • Can we quantify the organizing principles and distinguish between different states of branching process? Even a Richer Set of Features 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 Average_degree C C2 D Average_eccentricity Maximum_eccentricity_(diameter) Minimum_eccentricity_(radius) Average_eccentricity_90 Maximum_eccentricity_90 Minimum_eccentricity_90 Average_path_length_(closeness) Giant_connected_ratio Number_of_Connected_Components Percentage_of_isolated_points Percentage_of_end_points Number_of_central_points Percentage_of_central_points Number_of_nodes Number_of_edges 20 21 22 23 24 25 26 elongation_ area orientation eccentricity perimeter circularity_ solidity 27 28 29 30 31 32 33 34 35 36 37 largest_eigen_adjacency_ second_largest_adjacency trace_adjacency_ energy_adjacency #of_zeros_normalized_laplacian slope_0-1_normalized_laplacian #of_ones_normalized_laplacian slope_1-2_normalized_laplacian #of_twos_normalized_laplacian trace_laplacian energy_laplacian 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 degree_cluster_1 degree_cluster_2 degree_cluster_3 clustering_coefficient_C_cluster_1 clustering_coefficient_C_cluster_2 clustering_coefficient_C_cluster_3 clustering_coefficient_D_cluster_1 clustering_coefficient_D_cluster_2 clustering_coefficient_D_cluster_3 eccentricity_cluster_1 eccentricity_cluster_2 eccentricity_cluster_3 effective_eccentricity_cluster_1_ effective_eccentricity_cluster_2 effective_eccentricity_cluster_3 closeness_cluster_1 closeness_cluster_2 closeness_cluster_3 Classifier Comparison • Since MLP has a higher overall accuracy, it is used in later studies in feature selection. Adaboost k-nn MLP Overall (%) 67.3 ± 3.27 68.13 ± 1.29 73.24 ± 1.94 Inflamed (%) 57.96 ± 9.53 65.93 ± 2.65 54.0 7± 6.28 Healthy (%) 73.82 ± 5.23 75 ± 1.70 78.38 ± 2.78 Cancerous (%) 67.82 ± 7.5 65.21 ± 1.78 78.99 ± 2.63 Epithelial vs Mesenchymal comparison in treated tissue samples Feature Selection Algorithm Best CV rate SVM No Feature Selection 100 Fscore selection: select features: 7,10,26,44,45 100 CfsSubsetEval: 7,10,14,15,16,21,25,26,43,44,45 100 ConsistencySubsetEval: 10,14 95.24 ReliefFAttributeEval: 26,7 100 SymmetricalUncertAttributeEval: 14,44 100 SVD Based: 12,20,22,23,26,41,42,44,49,52,54,55 95.238 Epithelial vs Mesenchymal comparison in untreated tissue samples Feature Selection Algorithm Best CV rate SVM No Feature Selection 97.619 Fscore selection 7,26,35 97.619 CfsSubsetEval: 6,9,14,15,25,26,43 95.2381 ConsistencySubsetEval: 14,21,25 97.619 ReliefFAttributeEval: 7,26,35 97.619 SymmetricalUncertAttributeEval: 6,7,9,15,25,26,44 97.619 SVD Based: 2,4,12,20,22,26,27,28,41,49,52,55 88.0952 Treated mesenchymal vs untreated mesenchymal comparison Feature Selection Algorithm Best CV rate SVM No Feature Selection 80.95 Fscore selection: select features: 3,4,21,24,26,27,39,45 80.95 CfsSubsetEval: 24 76.1905 ConsistencySubsetEval: 24 76.1905 ReliefFAttributeEval: 21,24,3,39,45,26,2,27,4,35,33,28,42 90.4762 SymmetricalUncertAttributeEval: 24,18,20,19,16,15,17,25,27,26,21 SVD Based: 12,20,23,24,26,41,44,45,49,52,55 76.1905 69.0476 Treated epithelial vs untreated epithelial comparison Feature Selection Algorithm Best CV rate SVM No Feature Selection 83.33 Fscore selection: 3 88.09 CfsSubsetEval: 3,44,45 88.09 ConsistencySubsetEval: 3,44 85.71 ReliefFAttributeEval: 3,44,45,46,49 85.71 SymmetricalUncertAttributeEval: 3,44,45 88.09 SVD Based: 1,2,12,20,41,45,46,49,52,53,55 76.1905 Concluding Remarks • Feature extraction and selection are strongly coupled for accuracy– always room for new features • Feature selection performance depends on the induction algorithm (i.e., learning algorithm) • Quantifiable features are not always interpretable- mapping the features to biology or pathology is crucial link! Thank you!