Toward A Unifying Taxonomy of Feature Selection PI: Huan Liu Department of Computer Science & Engineering Arizona State University Contact Information Huan Liu PO Box 875406 Arizona State University Tempe, AZ 85287-5406 Phone: (480) 727-7349 Fax: (480) 965-2751 Email: hliu@asu.edu WWW PAGE http://www.public.asu.edu/~huanliu/ List of Supported Student Lei Yu Project Award Information Award Number: IIS-0127815 Duration: 09/15/2001 -- 08/31/2002. Period of Performance of the entire project, 09/15/2001 -- 08/31/2002. Title: SGER: Toward a Unifying Taxonomy for Feature Selection Keywords Feature Selection, Dimensionality Reduction, Data Mining Project Summary The objective of this Small Grant for Exploratory Research (SGER) is to establish a unifying taxonomy of features selection. Feature selection is to choose a subset of features among the available ones. It can be viewed as an optimization problem of exponential time complexity along several dimensions. Many feature selection algorithms have been developed and systems deployed in real-world applications. However, there exists a distinct gap between what theory suggests and what practice reveals, and the proliferation of feature selection algorithms has not brought about a general methodology that facilitates new research and development on feature selection. This project is a first step toward dealing with the two issues. The task is accomplished in two steps: (1) defining a common platform to consider representative algorithms on the equal footing; and (2) building a unifying taxonomy to discover how they complement each other and what is missing. Representative data and algorithms are collected during the project for investigation. The results of this project will be a contemporary survey, a unifying taxonomy of feature selection algorithms, and some potential solutions to the automatic selection problem being able to automatically choose the most suitable feature selection algorithm with given conditions. Publications and Products Feature Selection Survey draft prepared and planned to submit to a journal when the project is into its fourth quarter. Book chapter on Feature Extraction, Selection and Construction (draft) submitted to the LEA Handbook of Data Mining. Invited paper on Feature Selection, Extraction, and Construction at PAKDD2002 Workshop 3 on Towards the Foundation of Data Mining Feature Selection with Selective Sampling submitted to ICML2002, Experimental Results Active Sampling for Dimensionality Reduction submitted to SIGKDD2002, Experimental Results One project web site has been created to include related information, and updates. Project Impact One straight A Ph.D. student is supported by this project. The student has been trained to conduct research based on the previous work and achieved significant progress toward his PhD thesis. The work produced in the project is being used in the graduate seminar course on Data Mining offered in Spring 2002 to emphasize the critical role of data preprocessing in data mining. The exploration with industry leads to a joint effort on embedded data extraction with Motorola. A full proposal to explore the solutions to the automatic selection problem was submitted to NSF IDM based on the latest developments on feature selection and its applications sponsored by this NSF grant. Goals, Objectives, and Targeted Activities The ultimate goal of this project is to explore ways to take advantage of the rapid developments in feature selection without delay. The objectives of this project are: (1) to organize the existing feature selection methods systematically and sensibly to facilitate a user to search for the most suitable method for an application or to help a researcher identify new problems; and (2) to attempt the automatic selection problem with a framework so that a recommendation could be offered when needed. This way, feature selection can be made transparent in data mining in a limited sense to allow a data miner to focus on data mining applications. Toward these objectives, we will complete the survey, collect representative data sets, apply or re-implement representative feature algorithms, use the generalized framework to experiment potential solutions to the automatic selection problems to further the work based on the survey as outlined in the full proposal. In the meantime, we will use this project as an opportunity to promote and strengthen our activities in education, student training, community services, and industrial collaboration. The targeted activities are (1) to improve the current data mining course and design a web-based training course for a broader audience (managers, engineers, planners, etc.) who are interested in data mining; (2) to work with researchers in other disciplines to disseminate and apply the stateof-the-art data mining techniques (chemical engineering, bioinformatics, information security); (3) to propose a data mining curriculum for senior students and help them prepared to face the challenges in this information explosion age by offering an introductory data mining course; (4) to work with industrial partners in applying the techniques developed to the real-world problems (one project on embedded data extraction has been planned and will be pursued). Project References The latest developments of the project are frequently updated at the project web site: http://www.public.asu.edu/~huanliu/feature_selection.html. "Feature Selection for Knowledge Discovery and Data Mining", H. Liu and H. Motoda, July 1998, Kluwer Academic Publishers. "Feature Extraction, Construction and Selection: A Data Mining Perspective'', H. Liu and H. Motoda, July 1998 (2nd Printing, 2001), Kluwer Academic Publishers. Area Background Feature selection is a process that chooses a subset of features from the original features so that the feature space is optimally reduced according to a certain criterion. It is a data pre-processing technique used often to deal with large data sets in data mining. Its role is to (1) reduce the dimensionality of feature space, (2) speed up a mining/learning algorithm, (3) improve the comprehensibility of the mining results, and increase the performance of a mining algorithm (e.g., predictive accuracy). Feature selection can be considered in the context of search. Various search algorithms are applied: exhaustive (breadth-first), complete (branch-and-bound), heuristic, and randomized. It finds a wide spectrum of applications. It can be applied in both supervised and unsupervised learning, as well as in text mining. Feature selection is a multi-disciplinary area contributed from statistical pattern recognition, data mining, and machine learning. Area References "Selection of Relevant Features and Examples in Machine Learning", A.L.Blum and P. Langley, Artificial Intelligence, 97:245-271, 1997. "Feature Selection Methods for Classification", M. Dash and H. Liu, Intelligent Data Analysis: An International Journal, 1(3), 1997. "Wrappers for Feature Subset Selection", R. Kohavi and G.H. John, Artificial Intelligence, 97(12):273-324, 1997. "Feature Selection: Evaluation, application, and small sample performance", A. Jain and D. Zongker, IEEE PAMI, 19(2):153-158, 1997. "Data Preparation for Data Mining", D. Pyle, Morgan Kaufmann, 1999. "Feature Subset Selection and Order Identification for Unsupervised Learning", J.G. Dy and C.E. Broadley, Proceedings of the 17th Intl Conference on Machine Learning, pages 247–254, 2000. "Introduction to Statistical Pattern Recognition", 2 nd edition, K. Fukunaga, Morgan Kaufmann, 1990. "Pattern Classification", 2nd edition, R.O. Duda, P.E. Hart, D.G. Stork, Wiley, 2001. "Instance Selection and Construction for Data Mining", H. Liu, and H. Motoda, Kluwer Academic Publishers, 2001. Potential Related Projects Some potential projects or collaborations in data pre-processing for data mining are: 1. Image Mining with Feature Extraction – study how to extract and selection relevant and effective features for object recognition using data mining techniques. 2. Embedded Data Extraction of Data Streams – study how to extract features and selection instances from data streams in an embedded environment. 3. Focusing techniques – study other related selection techniques: database selection, instance selection, plan selection, etc. Back to the IDM '02 homepage