From: AAAI Technical Report WS-94-01. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved.
Feature
Selection
for Case-Based
of Cloud Types:
An Empirical
David
W. Aha
Classification
Comparison
Richard
Navy AI Center
Naval Research Laboratory
Washington, DC 20375
aha@aic.nrl.navy.mil
L. Bankert
Marine Meteorology Division
Naval Research Laboratory
Monterey, CA 93943
bankertQnrlmry.navy.mil
Abstract
ten classes of clouds, he used a standard feature selection technique and then applied a probabilistic neural
network (PNN) (Specht, 1990; Welch et. al, 1992)
classify cloud types3 Current plans include using this
network in SIAMES.
This paper extends the work reported by Bankert
(1994). Weargue for the use of case-based classifiers
in domains with large numbers of features. Weshow
they can locate smaller feature subsets leading to significantly higher accuracies on this task. Webegin by
describing the cloud classification task and Bankert’s
study. Next, we explain the benefits of using casebased classifiers in feature selection, and argue why
such approaches should outperform the approach previously used. Wethen describe a set of feature selection algorithms that we tested on the cloud classification task. Finally, we report our results, and discuss
them in the context of related work.
This paper has three primary contributions. First,
we show further evidence that feature selection algorithms should use the classifier itself to evaluate feature
subsets. Second, we show that a case-based algorithm
is a particularly goodclassifier in this context. Finally,
we introduce a method that allows the backward sequential selection algorithm to scale to large numbers
of features.
Accurate weather prediction is crucial for many
activities, including Naval operations.
Researchers within the meteorological division of
the Naval Research Laboratory have developed
and fielded several expert systems for problems
such as fog and turbulence forecasting, and tropical storm movement.They are currently developing an automatedsystemfor satellite imageinterpretation, part of whichinvolves cloud classification. Their cloud classification database contains 204 high-level features, but contains only
a few thousand instances. The predictive accuracy of classifiers can be improvedon this task
by employinga feature selection algorithm. We
explain whynon-parametriccase-based classifiers
are excellent choices for use in feature selection
algorithms. Wethen describe a set of such algorithms that use case-basedclassifiers, empirically
comparethem, and introduce novel extensions of
backwardsequential selection that allows it to
scale to this task. Several of the approacheswe
tested located feature subsets that attain significantly higher accuracies than those found in previously published research, and somedid so with
fewerfeatures.
Motivation
Accurate interpretation of maritime satellite imagery
data is an important component of Navy weather forecasting, particularly in remote areas for whichthere are
limited conventional observations. Decisions regarding tactical Naval operations depend on their accuracy.
Unfortunately, frequent personnel rotations and lack of
training time drastically reduce the level of shipboard
image interpretation expertise. Therefore, the Naval
Research Laboratory (NRL) in Monterey, CAis building an expert system named SIAMES(Satellite
Image Analysis Meteorological Expert System) that aids
in this task (Peak & Tag, 1992). Inputs to SIAMES,
including cloud classification, are not yet fully automated; the user must manually input details on cloud
types. Bankert (1994) addressed this problem. Given
database containing 204 continuous features describing
106
Cloud Classification
Task
Bankert (1994) describes the cloud classification task.
The original data were supplied and expertly labeled
by Dr. C. Wash and Dr. F. Williams of the Naval
Postgraduate School. They were in the form of four
advanced very high resolution radiometer (AVHRR)
local area coverage (LAC) scenes with size 512 × 512
pixels. An additional 91 images were labeled independently by four NRLMonterey experts under the direction of Dr. C. Crosiar. The scenes were not confined
to a specific location nor time of year. Sample areas
of 16x16 pixels that were covered by at least 75%of
1PNNuses a standard three-layer networkof nodes and
a Bayesianclassification strategy to select the class withthe
highest value of the a posteriori class probability density
function.
a particular cloud type were expertly labeled for each
image. A total of 1633 cases were extracted using this
procedure, where each case was labeled according to
one of ten cloud classes.
The features used to describe each case consist of
spectral, textural, and physical measures for each sample area. Examples of each group respectively include
such features as maximumpixel value, gray level entropy of the image, gray level distribution of adjacent
pixels in the horizontal dimension, and cloud-top temperature. Latitude was included as a final feature. All
features are defined over a continuous range of values.
The space of feature subsets (i.e., the power set of
all features) for this task is 2204 = 6.4 x 106°. Bankert
(1994) used a standard feature selection approach from
the pattern recognition literature.
This reduced the
number of features used from 204 to 15. He then applied PNN(Specht, 1990; Welch et al., 1992), which
has a 10-fold cross validation (CV) accuracy of 75.3%
on the initial feature set. Its accuracy improves to
78.9% when using the selected 15 features.
Feature Selection
The objective of feature selection is to reduce the number of features used to characterize a dataset so as to
improve an algorithm’s performance on a given task.
The task studied in this paper is classification.
Our
objectives are to maximize classification accuracy and
minimize the number of features used to define the
cases. Feature selection algorithms input a set of features and yield a subset of them.
Several knowledge-intensive CBRalgorithms have
been used to perform feature selection (e.g., Ashley
& Rissland, 1988). However, domain specific knowledge is not yet available for the cloud classification
task. This prevents us from using explanation-based
approaches for indexing and retrieving appropriate features (e.g., Cain, Pazzani, &: Silverstein, 1991; Ram,
1993). Furthermore, the same set of features are used
to describe each case in this case base, their values
have been pre-computed, and no further inferencing is
required to access these values. Therefore, we do not
address the cost of evaluating features (Owens, 1993).
Thus, this study was restricted to using knowledgepoor feature selection approaches (e.g., Devijver ~ Kittler, 1982), although we plan to work with knowledgebased techniques for feature extraction in the future.
Feature selection algorithms have three components:
1. Search algorithm This searches the space of feature subsets, which has size 2u where d is the number
of features. Points in this space are called states.
2. Evaluation function This inputs a state and outputs a numeric evaluation. The search algorithm’s
goal is to maximizethis function.
3. Classifier This is the target algorithm that uses
the final subset of features (i.e., those found by the
search algorithm to have the highest evaluation).
107
Filter ControlStrategy
I
Search
1
Selected
Features
Algorithm
r
I
Features
Classifier
I
Evaluation
1
Function
WrapperControl Strategy
I
Search
Features
1
Algorithm
Figure 1: Common
Control Strategies
lection Approaches
for Feature Se-
Typically, these componentsinteract in one of the two
ways described in Figure 1. Either the search and evaluation algorithms locate the set of features before the
classifier is consulted, or the classifier is itself the evaluation function. Weadopt the terms filter and wrapper
control strategies, respectively, from John, Kohavi, and
Pfleger (1994), who argued that the wrapper strategy
is superior because it avoids the problem of using an
evaluation function whose bias differs from the classitier. Our empirical results strongly support this claim.
Since manyfeature subsets are evaluated during the
feature selection search, classifiers that require manual tuning of their parameters (e.g., PNN)cannot
used as the evaluation function. Since Bankert (1994)
chose PNNas his classifier, he used a different function to evaluate states. Wehypothesize that better re-
sults can be obtained by using a wrapper model. Nonparametric case-based classifiers are excellent choices
for feature selection classifiers because they often obtain good accuracies and can also be used as evaluation
functions. Thus, we used IB1 (Aha, Kibler, & Albert,
1991) in our studies for these two purposes; it is an
implementation of the nearest neighbor classifier.
Doak(1992) identifies three categories of search algorithms: exponential, sequential, and randomized.
Exponential algorithms have exponential complexity in
the number of features. Sequential search algorithms
have polynomial complexity; they add or subtract features and use a hill-climbing search strategy. Randomized algorithms include genetic and simulated annealing search methods. Sequential search algorithms
seem most appropriate for our task because exponential search algorithms (e.g., FOCUS(Almuallim
Dietterich, 1991) are prohibitively expensive and randomized algorithms, unless biased as in (Skalak, 1994),
tend to yield larger subsets of features than do sequential strategies (Doak, 1992). Sequential algorithms performed comparatively well in Doak’s study. Thus, we
chose to use them in our study. Wehave not yet investigated strategies that combinealgorithms from different categories (e.g., Doak, 1992; Caruana & Freitag,
1994).
The most commonsequential search algorithms for
feature selection are variants of forward sequential selection (FSS) and backwardsequential selection (BSS).
FSS begins with zero attributes, evaluates all feature
subsets with exactly one feature, and selects the one
with the best performance. It then adds to this subset the feature that yields the best performance for
subsets of the next larger size. This cycle repeats until no improvement is obtained by extending the current subset. BSSinstead begins with all features and
repeatedly removes the feature that, so removed, the
maximal performance increase results. Doak (1992)
reported that variants of BSS outperformed variants
of FSS, probably because BSSevaluates the contribution of a feature whenall others are available, whereas
FSS suffer when features individually don’t contribute
to performance but do so when jointly considered with
other features (i.e., they interact).
BSS’s computational expense prevents it from being easily applied to our task. It would require approximately 21,000 applications before it could find a
subset of size found using Bankert’s approach. Each
application would involve an application of IB 1, whose
costs are O(FxI2), where I is the number of instances
and F is the number of features. Therefore, we modified BSS(see below). Because of this modification,
it is not obvious whether it would perform as well as
FSS in our study.
A Framework
of Algorithms
In our experiments, we focussed on the four categories
of feature selection algorithms shown in Table 1. The
108
Table 1: Four Types of Feature Selection A1 3rithms
Control Strategy
Search Algorithm
Filter
Filter
Wrapper
Wrapper
FSS
BSS
FSS
BSS
rest of this section describes the algorithms we applied.
Filter
Control
with FSS
Bankert’s (1994) approach is a memberof the first
these four categories. He used FSS with the Bhattacharya class separability index as the evaluation
function - it measures the degree to which classes are
separated and internally cohere. Many such indices
have been proposed and evaluated (e.g., Devijver
Kittler, 1982). They are usually less computationally
expensive than the classifier and are nonparametric,
which allows them to be used without requiring manual parameter tuning. However, if their bias does not
match the bias of the classifier, then they can lead to
suboptimal performance.
Filter
Control
with BSS
The second category combines a filter control strategy
with BSSsearch. Since it is difficult to apply BSS
to a task with 204 features due to its computational
complexity, we introduce BEAM,a novel extension
of BSSthat allows it to work with larger numbers of
features.
BEAMis somewhat similar to Doak’s (1992) randomgeneration plus sequential selection (RGSS) algorithm. They both begin with a feature subset found
through random selection of features, and they both
use a form of sequential feature selection after locating
this initial subset. After choosingthe initial set of features, RGSSuses FSS and then BSS in the hope of
overcoming the drawbacks of both approaches. These
sequential search algorithms suffer from beginning with
extreme subsets of features (i.e., none and all features
respectively). Thus, it makes sense to begin with some
other subset of features, to extend them with FSS,
and then use BSSsince it has proven itself to perform
well on many tasks. Doak (1992) reports good performance with I%GSSon a heart disease database with
75 features.
BEAMdiffers from RGSS in three ways. First,
instead of beginning with a random selection of features, BEAM
randomly samples the feature space for
a fixed numberof iterations and begins with the bestperforming feature subset found during those iterations. BEAM’sperformance in informal experiments
without this more careful selection of an initial feature
set was much lower.
Second, BEAMis biased in the initial
number of
features it selects. For the cloud classification task,
BEAM
initially
sampled only in the space of 25 features among the 204 available. This is required because the space of features is large, and we knowfrom
Bankert’s (1994) study that a small number of features
(i.e., 15) significantly improves predictive accuracy
this task.
Third, BEAMemploys a beam search (hence its
name) during BSS. It maintains a fixed-sized queue
of the best-performing states it has visited along with
a list of the features in it that have already been removed by BSSfor evaluation. The states on the queue
are ordered by decreasing accuracy, and the queue is
updated each time a state is selected and evaluated.
A parameter to BEAMdetermines whether a state
evaluation consists of evaluating all subsets of one
smaller number of features (full evaluation) or only a
single feature subset (greedy evaluation). Whenusing
greedy evaluation, a random element is used to select
both which state on the queue to examine next and
which feature to remove from it. The probability that
a state is chosen is a decreasing function of its location
in the queue. A state in location i of a queue of size n
is selected with probability
n-i-1
P(i) ~-~j~l J
Let L be the subset of features in the selected state,
with features F, that have been removed for evaluation. The next feature selected from this state is chosen randomly according to a uniform distribution from
F-L.
BEAM’salgorithm is summarized in Figure 2. The
subfunctions have been previously explained except for
update_queue, which during greedy evaluation adds f
to L, removes the selected state from (and reduces the
number of states in) the queue when F = L, and properly inserts the state with features F-{f} in the queue
if its evaluation (i.e., 10-fold CVaccuracy) is among
the best n whose features have not yet been exhaustively removed by BSS for evaluation.
For this category, we selected as BEAM’sevaluation function a separability index that is similar to the
one used by Bankert and performed best in comparison with many other indices (Skalak, 1994). A full
2evaluation function was used with a queue size of 1.
Wrapper
Control
with FSS
For this category, we combined FSS search with IB1
as both the evaluation function and classifier. Wealso
tested a second variant of this combination in which
we used BEAM,but substituted
FSS for BSS. When
2Since a larger queue could contain feature subsets of
different sizes, and since most separation indices are sensitive to feature subset size, we decided to not experiment
with the greedy evaluation function for this category.
109
Figure 2: BEAM:A Biased Variant of BSSfor Feature
~election with Large Numbersof Features
Inputs:
r: # randomly-selected
featuresubsets
to evaluate
d: # featuresin each randomly-selected
subset
n: Size of queue
e: type of evaluation(full or greedy)
I: Numberof select/evaluate
iterations
Key:
F: A subset of features
L: Featuresin F such that:
Vf E L, F- {f} has not been evaluated
BEAM(r,d,n,e,I)
1. F = best.random_state(r,d)
2. queue = initialize.queue(F)
3. best_evaluation
= 0.0~
4. For I iterations:
h. {F,L} = Select~tate(queue)
B. f = randomly~elect~eature(F-L)
C. evaluation= evaluate(F-{f},e)
D. queue = update_queue(F,f,evaluation,n)
E. if (evaluation> best_evaluation)
i. best_evaluation= evaluation
ii. best_state = F- {f}
5. Output:best_evaluation
and best_state
using the greedy evaluation function, we initialized
BEAMwith a randomly chosen subset of three features (i.e., the best among25 subsets tested for that
size), and used a queue size of 25. A queue size of one
was used when using full evaluation, which began with
the empty subset of features. In both cases, feature
subsets were constrained to have size no more than 25.
Wrapper
Control
with BSS
The fourth category combines a wrapper control strategy with BSS search. We again used the BEAMalgorithm, but in this case we used IB1 (Aha, Kibler,
& Albert, 1991) as the evaluation function. Again, we
tested BEAM
using both the full and’single state evaluation functions, and using queue sizes of one and 25
respectively.
Other Algorithms
Weincluded two other case-based feature selection algorithms in our study. IB4 (Aha, 1992a) was chosen
an exampleof a classifier that hill-climbs in the space
of reM-vMuedfeature weights. Several such algorithms
exist (e.g., Wettschereck & Dietterich, 1994). Although these algorithms have performed well on some
datasets, they tend to work best when features are either highly relevant or irrelevant. Wehypothesized
that IB4 would not work well on the cloud classification task, which contains manypartially-relevant features.
Wealso included Cardie’s (1993) approach. She used
C4.5 (Quinlan, 1993) to select which features to use
a case-based classifier and reported favorable results.
This method has not previously been used on tasks
involving more than a few dozen features, and some
reports suggest that C4.5 does not always perform well
whenthe features are all continuous (e.g., Aha, 1992b).
Thus, we suspected that it may not perform well on the
cloud classification task.
Empirical
Comparisons
Weexamined two hypotheses concerning the sequential
search frameworkdescribed in the previous section:
1. The wrappercontrol strategy is superior to the filter
control strategy for this task.
2. Variants of BSStend to outperform variants of FSS
for this task.
The first hypothesis is based on evidence that using the
classifier itself as the evaluation function yields better
performance on some tasks (e.g., Doak, 1992; Vafaie
De Jong, 1993; John, Kohavi, ~ Pfleger, 1994). Similar evidence exists for the second hypothesis (Doak,
1992). However, these hypotheses have not been previously investigated for tasks of this magnitude. We
also have secondary hypotheses concerning the other
two algorithms in our study:
3. Searching in the 2d space is superior than searching
in the ~a space, at least when initially determining which features are needed to attain good performanceon the cloud classification task.
4. C4.5 will not perform particularly well as a filter algorithm because this dataset contains only numericvalued features.
Hypothesis three is based on the observation that ~d
is much larger than 2d, which is itself large for our
task. Thus, a hill-climbing algorithm such as IB4 will
have difficulty locating a set of weights that yield good
performance on this task. Finally, although C4.5 has
performed well as a feature selection algorithm in previous studies, it has not been used for this purpose
with numeric-valued data.
Baseline
Results
Whenusing all 204 attributes, IBl’s 10-fold CVaccuracy is 72.6% while PNN’s accuracy is 75.3%. The
most frequent concept in the database occurs for
15.4% of the cases. Both IB1 and PNN significantly increase accuracy, although PNNfares better here because IB1 does not perform well for highdimensional tasks involving manypartially relevant attributes (Aha, 1992a).
110
Table 2: Best and Average (10 runs for the nondeterministic algorithms) 10-fold CVPercent Accuracies (Ace) and Feature Set Sizes (Sz) for Feature
lection Algorithms on the Cloud Classification Task,
where "Bs" and "B]" Indicates
Using BEAMwith
the Single and ~ll Evaluation Strate~;ies Respectively
Control
Search
rage
zS-/-Strategy
Alg.
Random Selection
75.9
83 69.9 103.4
15
Filter
FSS
78.6
75.4
24 69.8
21.3
Filter
BSS
Wrapper
FSS
88.0
10
24.9
Wrapper
FSS B,
87.2
25 83.1
Wrapper
82.4
13 79.7
17.0
BSS
B
I
9.1
Wrapper
BSS Bs
85.1
5 81.0
IB4
73.3
204
C4.5
76.5 79.4
~
Comparison
Results
Table 2 displays the best and average results when using the eight feature selection algorithms described in
the previous section. Additionally, the first line refers
to the results when randomly selecting subsets of features. The algorithms using the wrapper control strategy all used IB1 as their evaluation function, while the
filter strategy algorithms used a separability index to
evaluate states. IB1 was used as the classifier for all
the algorithms.
The second line in Table 2 refers to using the subset
found by Bankert (1994) and using IB1 rather than
PNNas the classifier.
This method is at a disadvantage because the bias of the index separation measure
maynot match the bias of the classifier. Thus, the set
of features selected by the index measure may not yield
comparatively high accuracies, as is true in this case.
The third line refers to using BSSsearch, a separation index, and the IB1 classifier.
This approach
also performed comparatively poorly. Our other results suggest this is due to using a separation index as
the evaluation function.
Line four of Table 2 replaces Bankert’s use of a separability index with IB1 as the evaluation function.
This single change yields the highest accuracy in our
study. Thus, this is strong evidence that using the
classifier also as the evaluation function yields better
performing algorithms for feature selection.
Line five reports a similar approach, but in this case
each run was initialized with the best-performing subset of three randomly chosen features.
BEAMwas
used (with FSS rather than BSS) with a queue size
of 25. The sizes of feature subsets were constrained
to not exceed 25. This algorithm found a subset with
high accuracy but required a relatively larger number
of features.
The sixth and seventh lines in Table 2 lists BEAM’s
results when using BSS as the search algorithm (r
325, d = 25, and I = 1000) and IB1 as the classifier.
These settings have not been optimized. Line six refers
to using a queue size of one and the full evaluation
strategy. Line seven refers to using a queue size of
25 and the greedy evaluation strategy. This latter approach performed well; it located a subset containing
only five features that attained an accuracy of 85.1%.
The eighth line in Table 2 refers to using IB4 (Aha,
1992a). Like most weight-tuning case-based algorithms, IB4 does not perform well unless each feature
is either highly relevant or irrelevant. Its poor performancehere is not surprising.
The final line refers to using C4.5 (Quinlan, 1993)
to select features for IB1. Using C4.5’s default parameter settings, it yields pruned trees that have a large
numberof features that do not deliver high predictive
accuracies. Wehave not investigated whether alternative parameter settings will improve its performance.
Summary
The highest accuracy (i.e., 88.0%) was obtained when
using the FSS search algorithm combined with IB1
as the evaluation function. A students one-tailed ttest confirmed that the accuracies obtained using the
feature subset found by this approach, as tested on
the ten folds, are significantly higher than the best
accuracies obtained by using the other approaches at
either the 95%confidence level or higher. In particular, using IB1 rather than a separability index for
the evaluation function significantly improved performance when using either FSS (97.5% confidence level)
or BSS(99.5%). Thus, this provides evidence for our
hypothesis that using the classifier as the evaluation
function improves performance.
The combination of BSSwith IB1 as the evaluation
function, as implemented in BEAMwith the single
evaluation option, also performed well. However, FSS
with IB1 located a five-feature subset with even higher
accuracy (i.e., 85.9%). Thus, we found no evidence
that BSS outperformed FSS, but this may be due to
the differences between BSSand its modification in
BEAM.
As expected, the final two algorithms performed
comparatively poorly in terms of accuracy (i.e., they
were significantly lower than the accuracies recorded by
the wrapper algorithms at either the 95%confidence
level or higher). Furthermore, neither algorithm found
small-sized feature sets in this study.
Conclusions
and Future
Research
This paper focuses on improving predictive accuracy
for a specific task: cloud classification. Properties specific to this task require the use of feature selection
approaches to improve case-based classification accuracy. Weexplained the motivation for using sequential selection algorithms for this task, tested several
variants, and introduced a novel modification of BSS
named BEAM
for problems with large numbers of features. Our results strongly indicate that, when using
Related
Work
Feature selection has its roots in pattern recognition and statistics
and is addressed in several textbooks (e.g., Devijver & Kittler,
1982). Mucciardi
3Theaverage 10-fold CVaccuracy of fifty randomselections of 25 features was 66.7%, while the sameaverage for
BEAM’s
initially selected feature sets was 74.6%.
Iii
\\\
and Gose’s (1971) empirical study fostered much interest on heuristic approachesfor feature selection that
use filter control strategies. Cover and van Campenhout (1977) later proved that, under some assumptions, such heuristic algorithms can perform arbitrarily
poorly. Yet exhaustive (exponential) search algorithms
are prohibitively expensive for non-trivial tasks. Their
seminal paper greatly influenced the pattern recognition community, and arguably reduced interest in
heuristic approachesusing filter strategies.
In contrast, several feature selection algorithms have
been recently described in the AI literature.
Many
weight-tuning algorithms have been proposed (e.g.,
Aha, 1992a; Wettschereck & Dietterich,
1994), although it is not feasible to explore the space of realvalued weights in ~d whend is large. Also, these algorithms work best when the features are either highly
relevant or irrelevant, which doesn’t appear to be true
for the cloud classification task. However,it is possible
that the notion of locally warping the instance space
(e.g., Aha & Goldstone, 1992) can be used for feature
selection (i.e., the relevance of features varies in a given
instance space).
We suspect that FOCUS(Almuallim & Dietterich,
1991) and RELIEF(Kira & l~endell, 1992) would
perform well on this task since the former uses exhaustive search and the latter assumes that features are,
again, either highly relevant or irrelevant. Vafaie and
De Jong (1993) instead used a genetic algorithm to select features for a rule induction program. Like John,
Kohavi, and Pfleger (1994), they found that filtering
inferior to the wrapper control strategy, although the
database in their study again involved only a comparatively small numberof features.
The results from Doak’s (1992) survey suggest using
feature selection algorithms similar to BEAM.Doak
combined random with sequential searching, suggested
using a beam search to guide BSS, and suggested that
genetic search approaches can be improved by biasing them towards smaller-sized feature subsets. Skalak
(1994) recently demonstrated that this approach is effective. BEAM
extends some of these ideas by adding
a strategy for selecting a good initial set of features,
implementing a beam search, and introducing random
elements in the state selection process, and may benefit from using a more intelligent methodfor selecting
its initial feature subset.
these algorithms, the evaluation function should be the
same as the classifier.
This leads to significantly higher
accuracies than those previously published for this task
(Bankert, 1994).
We have not yet evaluated
BEAMon Other problems with large numbers of features to better analyze
its benefits. While we anticipate that some of its properties will prove generally useful, we do not yet have evidence for such claims. Also, several of BEAM’sdesign
decisions have not been justified,
and we plan to examine these decisions systematically in future research.
Another goal is to integrate the capabilities
of genetic
search in BEAM,such as by using it whenever backward sequential
selection
depletes the queue of good
feature subsets. Similarly, C4.5 can be utilized by selecting feature subsets from the power set of features it
finds to be relevant for classification.
Manyother combinations
might prove useful.
A promising direction
of future research should involve using domain-specific
knowledge concerning feature relevance. Methods used
in case-based reasoning should prove highly valuable
once such domain expertise
becomes available.
Acknowledgements
Thanks
to Paul Tag, John
Grefenstette,
David Skalak, Diana Gordon, and Ashwin Ram for their feedback on this research.
References
Aha, D. W. (1992a).
Tolerating Noisy, Irrelevant,
and Novel Attributes in Instance-Based Learning Algorithms. International Journal of Man-MachineStudies,
36, 267-287.
Aha, D. W. (1992b). Generalizing from case studies:
case study. In Proceedings of the Ninth International
Conference on Machine Learning (pp. 1-10). Aberdeen,
Scotland: Morgan Kaufmann.
Aha, D. W., & Goldstone, R. L. (1992). Concept learning
and flexible weighting. In Proceedings of the Fourteenth
Annual Conference of the Cognitive Science Society (pp.
534-539). Bloomington, IN: Lawrence Erlbaum.
Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instancebased learning algorithms. Machine Learning, 6, 37-66.
Almuallim, H., & Dietterich, T. G. (1991). Learning with
many irrelevant features. In Proceedings of the Ninth
National Conference on Artificial Intelligence (pp. 547552). Menlo Park, CA: AAAIPress.
Ashley, K. D., & Rissland, E. L. (1988). Waiting
weighting: A symbolic least commitment approach. In
Proceedings of the Seventh National Conference on Artificial Intelligence (pp. 239-244). St. Paul, MN:Morgan
Kaufmann.
Bankert, R. L. (1994). Cloud classification
of AVHRR
imagery in maritime regions using a probabilistic neural
network. To appear in Journal of Applied Meteorology.
Cain, T., Pazzani, M. J., & Silverstein, G. (1991). Using
domain knowledge to influence similarity judgement. In
Proceedings of the Case-Based Reasoning Workshop (pp.
191-202). Washington, DC: Morgan Kaufmann.
Caruana, R., & Freitag, D. (1994). Greedy attribute selection. To appear in Proceedings of the Eleventh Inter-
112
national Machine Learning Conference. New Brunswick,
N J: Morgan Kaufmann.
Cardie, C. (1993). Using decision trees to improve casebased learning. In Proceedings of the Tenth International
Conference on Machine Learning (pp. 25-32). Amherst,
MA: Morgan Kaufmann.
Cover, T. M., & van Campenhout, J. M. (1977). On the
possible orderings in the measurementselection problem.
IEEE Transactions on Systems, Man, and Cybernetics,
7, 657-661.
Devijver, P. A., &Kittler, J. (1982). Pattern recognition:
A statistical approach. EnglewoodCliffs, N J: PrenticeHall.
Doak, J. (1992). An evaluation of feature selection methods and their application to computer security (Technical
Report CSE-92-18). Davis, CA: University of California,
Department of Computer Science.
John, G., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. To appear in Proceedings of the Eleventh International Machine Learning Conference. New Brunswick, N J: Morgan Kaufmann.
Kira, K., & Rendell, L. A. (1992). A practical approach
to feature selection. In Proceedings of the Ninth International Conference on Machine Learning (pp. 249-256).
Aberdeen, Scotland: Morgan Kaufmann.
Mueciardi, A. N., & Gose, E. E. (1971). A comparison
seven techniques for choosing subsets of pattern recognition properties. IEEE Transaction on Computers, 20,
1023-1031.
Owens, C. (1993). Integrating
feature extraction and
memorysearch. Machine Learning, 10, 311-340.
Peak, J. E., & Tag, P. M. (1992). Towards automated interpretation of satellite imagery for navy shipboard applications. Bulletin of the American Meterological Society, 73, 995-1008.
Quinlan, J. R. (1993). C~.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann.
Ram, A. (1993). Indexing, elaboration, and refinement:
Incremental learning of explanatory cases. Machine
Learning, 10, 201-248.
SkaJak, D. (1994). Prototype and feature selection
sampling and random mutation hill climbing algorithms.
To appear in Proceedings of the Eleventh International
Machine Learning Conference. New Brunswick, N J:
Morgan Kaufmann.
Specht, D. F. (1990). Probabilistic neural networks. Neural Networks, 3, 109-118.
Vafaie, H., & De Jong, K. (1993). Robust feature selection
algorithms. In Proceedings of the Fifth Conference on
Tools for Artificial Intelligence (pp. 356-363). Boston,
MA: IEEE Computer Society Press.
Welch, R. M., Sengupta, S. K., Goroch, A. K., Rabindra, P., Rangaraj, N., & Navar, M. S. (1992). Polar
cloud and surface classification
using AVHRR
imagery:
An intercomparison of methods. Journal of Applied Meteorology, 31, 405-420.
Wettschereck, D., & Dietterich, T. G. (1994). An experimental comparison of the nearest neighbor and nearest
hyperrectangle algorithms. To appear in Machine Learning.