Toward a Unifying Taxonomy for Feature

advertisement
Toward A Unifying Taxonomy of Feature
Selection
PI: Huan Liu
Department of Computer Science & Engineering
Arizona State University
Contact Information
Huan Liu
PO Box 875406
Arizona State University
Tempe, AZ 85287-5406
Phone: (480) 727-7349
Fax: (480) 965-2751
Email: hliu@asu.edu
WWW PAGE
http://www.public.asu.edu/~huanliu/
List of Supported Student
Lei Yu
Project Award Information

Award Number: IIS-0127815

Duration: 09/15/2001 -- 08/31/2002. Period of Performance of the entire project,
09/15/2001 -- 08/31/2002.

Title: SGER: Toward a Unifying Taxonomy for Feature Selection
Keywords
Feature Selection, Dimensionality Reduction, Data Mining
Project Summary
The objective of this Small Grant for Exploratory Research (SGER) is to establish a unifying
taxonomy of features selection. Feature selection is to choose a subset of features among the
available ones. It can be viewed as an optimization problem of exponential time complexity along
several dimensions. Many feature selection algorithms have been developed and systems
deployed in real-world applications. However, there exists a distinct gap between what theory
suggests and what practice reveals, and the proliferation of feature selection algorithms has not
brought about a general methodology that facilitates new research and development on feature
selection. This project is a first step toward dealing with the two issues. The task is accomplished
in two steps: (1) defining a common platform to consider representative algorithms on the equal
footing; and (2) building a unifying taxonomy to discover how they complement each other and
what is missing. Representative data and algorithms are collected during the project for
investigation. The results of this project will be a contemporary survey, a unifying taxonomy of
feature selection algorithms, and some potential solutions to the automatic selection problem being able to automatically choose the most suitable feature selection algorithm with given
conditions.
Publications and Products






Feature Selection Survey draft prepared and planned to submit to a journal when the
project is into its fourth quarter.
Book chapter on Feature Extraction, Selection and Construction (draft) submitted to the
LEA Handbook of Data Mining.
Invited paper on Feature Selection, Extraction, and Construction at PAKDD2002
Workshop 3 on Towards the Foundation of Data Mining
Feature Selection with Selective Sampling submitted to ICML2002, Experimental
Results
Active Sampling for Dimensionality Reduction submitted to SIGKDD2002,
Experimental Results
One project web site has been created to include related information, and updates.
Project Impact
One straight A Ph.D. student is supported by this project. The student has been trained to conduct
research based on the previous work and achieved significant progress toward his PhD thesis.
The work produced in the project is being used in the graduate seminar course on Data Mining
offered in Spring 2002 to emphasize the critical role of data preprocessing in data mining. The
exploration with industry leads to a joint effort on embedded data extraction with Motorola. A
full proposal to explore the solutions to the automatic selection problem was submitted to NSF
IDM based on the latest developments on feature selection and its applications sponsored by this
NSF grant.
Goals, Objectives, and Targeted Activities
The ultimate goal of this project is to explore ways to take advantage of the rapid developments
in feature selection without delay. The objectives of this project are: (1) to organize the existing
feature selection methods systematically and sensibly to facilitate a user to search for the most
suitable method for an application or to help a researcher identify new problems; and (2) to
attempt the automatic selection problem with a framework so that a recommendation could be
offered when needed. This way, feature selection can be made transparent in data mining in a
limited sense to allow a data miner to focus on data mining applications. Toward these objectives,
we will complete the survey, collect representative data sets, apply or re-implement representative
feature algorithms, use the generalized framework to experiment potential solutions to the
automatic selection problems to further the work based on the survey as outlined in the full
proposal. In the meantime, we will use this project as an opportunity to promote and strengthen
our activities in education, student training, community services, and industrial collaboration. The
targeted activities are (1) to improve the current data mining course and design a web-based
training course for a broader audience (managers, engineers, planners, etc.) who are interested in
data mining; (2) to work with researchers in other disciplines to disseminate and apply the stateof-the-art data mining techniques (chemical engineering, bioinformatics, information security);
(3) to propose a data mining curriculum for senior students and help them prepared to face the
challenges in this information explosion age by offering an introductory data mining course; (4)
to work with industrial partners in applying the techniques developed to the real-world problems
(one project on embedded data extraction has been planned and will be pursued).
Project References



The latest developments of the project are frequently updated at the project web site:
http://www.public.asu.edu/~huanliu/feature_selection.html.
"Feature Selection for Knowledge Discovery and Data Mining", H. Liu and H. Motoda, July 1998,
Kluwer Academic Publishers.
"Feature Extraction, Construction and Selection: A Data Mining Perspective'', H. Liu and H. Motoda,
July 1998 (2nd Printing, 2001), Kluwer Academic Publishers.
Area Background
Feature selection is a process that chooses a subset of features from the original features so that
the feature space is optimally reduced according to a certain criterion. It is a data pre-processing
technique used often to deal with large data sets in data mining. Its role is to (1) reduce the
dimensionality of feature space, (2) speed up a mining/learning algorithm, (3) improve the
comprehensibility of the mining results, and increase the performance of a mining algorithm (e.g.,
predictive accuracy). Feature selection can be considered in the context of search. Various search
algorithms are applied: exhaustive (breadth-first), complete (branch-and-bound), heuristic, and
randomized. It finds a wide spectrum of applications. It can be applied in both supervised and
unsupervised learning, as well as in text mining. Feature selection is a multi-disciplinary area
contributed from statistical pattern recognition, data mining, and machine learning.
Area References









"Selection of Relevant Features and Examples in Machine Learning", A.L.Blum and P. Langley,
Artificial Intelligence, 97:245-271, 1997.
"Feature Selection Methods for Classification", M. Dash and H. Liu, Intelligent Data Analysis: An
International Journal, 1(3), 1997.
"Wrappers for Feature Subset Selection", R. Kohavi and G.H. John, Artificial Intelligence, 97(12):273-324, 1997.
"Feature Selection: Evaluation, application, and small sample performance", A. Jain and D. Zongker,
IEEE PAMI, 19(2):153-158, 1997.
"Data Preparation for Data Mining", D. Pyle, Morgan Kaufmann, 1999.
"Feature Subset Selection and Order Identification for Unsupervised Learning", J.G. Dy and C.E.
Broadley, Proceedings of the 17th Intl Conference on Machine Learning, pages 247–254, 2000.
"Introduction to Statistical Pattern Recognition", 2 nd edition, K. Fukunaga, Morgan Kaufmann, 1990.
"Pattern Classification", 2nd edition, R.O. Duda, P.E. Hart, D.G. Stork, Wiley, 2001.
"Instance Selection and Construction for Data Mining", H. Liu, and H. Motoda, Kluwer Academic
Publishers, 2001.
Potential Related Projects
Some potential projects or collaborations in data pre-processing for data mining are:
1. Image Mining with Feature Extraction – study how to extract and selection relevant and
effective features for object recognition using data mining techniques.
2. Embedded Data Extraction of Data Streams – study how to extract features and selection
instances from data streams in an embedded environment.
3. Focusing techniques – study other related selection techniques: database selection,
instance selection, plan selection, etc.
Back to the IDM '02 homepage
Download