From: AAAI Technical Report WS-94-01. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. Feature Selection for Case-Based of Cloud Types: An Empirical David W. Aha Classification Comparison Richard Navy AI Center Naval Research Laboratory Washington, DC 20375 aha@aic.nrl.navy.mil L. Bankert Marine Meteorology Division Naval Research Laboratory Monterey, CA 93943 bankertQnrlmry.navy.mil Abstract ten classes of clouds, he used a standard feature selection technique and then applied a probabilistic neural network (PNN) (Specht, 1990; Welch et. al, 1992) classify cloud types3 Current plans include using this network in SIAMES. This paper extends the work reported by Bankert (1994). Weargue for the use of case-based classifiers in domains with large numbers of features. Weshow they can locate smaller feature subsets leading to significantly higher accuracies on this task. Webegin by describing the cloud classification task and Bankert’s study. Next, we explain the benefits of using casebased classifiers in feature selection, and argue why such approaches should outperform the approach previously used. Wethen describe a set of feature selection algorithms that we tested on the cloud classification task. Finally, we report our results, and discuss them in the context of related work. This paper has three primary contributions. First, we show further evidence that feature selection algorithms should use the classifier itself to evaluate feature subsets. Second, we show that a case-based algorithm is a particularly goodclassifier in this context. Finally, we introduce a method that allows the backward sequential selection algorithm to scale to large numbers of features. Accurate weather prediction is crucial for many activities, including Naval operations. Researchers within the meteorological division of the Naval Research Laboratory have developed and fielded several expert systems for problems such as fog and turbulence forecasting, and tropical storm movement.They are currently developing an automatedsystemfor satellite imageinterpretation, part of whichinvolves cloud classification. Their cloud classification database contains 204 high-level features, but contains only a few thousand instances. The predictive accuracy of classifiers can be improvedon this task by employinga feature selection algorithm. We explain whynon-parametriccase-based classifiers are excellent choices for use in feature selection algorithms. Wethen describe a set of such algorithms that use case-basedclassifiers, empirically comparethem, and introduce novel extensions of backwardsequential selection that allows it to scale to this task. Several of the approacheswe tested located feature subsets that attain significantly higher accuracies than those found in previously published research, and somedid so with fewerfeatures. Motivation Accurate interpretation of maritime satellite imagery data is an important component of Navy weather forecasting, particularly in remote areas for whichthere are limited conventional observations. Decisions regarding tactical Naval operations depend on their accuracy. Unfortunately, frequent personnel rotations and lack of training time drastically reduce the level of shipboard image interpretation expertise. Therefore, the Naval Research Laboratory (NRL) in Monterey, CAis building an expert system named SIAMES(Satellite Image Analysis Meteorological Expert System) that aids in this task (Peak & Tag, 1992). Inputs to SIAMES, including cloud classification, are not yet fully automated; the user must manually input details on cloud types. Bankert (1994) addressed this problem. Given database containing 204 continuous features describing 106 Cloud Classification Task Bankert (1994) describes the cloud classification task. The original data were supplied and expertly labeled by Dr. C. Wash and Dr. F. Williams of the Naval Postgraduate School. They were in the form of four advanced very high resolution radiometer (AVHRR) local area coverage (LAC) scenes with size 512 × 512 pixels. An additional 91 images were labeled independently by four NRLMonterey experts under the direction of Dr. C. Crosiar. The scenes were not confined to a specific location nor time of year. Sample areas of 16x16 pixels that were covered by at least 75%of 1PNNuses a standard three-layer networkof nodes and a Bayesianclassification strategy to select the class withthe highest value of the a posteriori class probability density function. a particular cloud type were expertly labeled for each image. A total of 1633 cases were extracted using this procedure, where each case was labeled according to one of ten cloud classes. The features used to describe each case consist of spectral, textural, and physical measures for each sample area. Examples of each group respectively include such features as maximumpixel value, gray level entropy of the image, gray level distribution of adjacent pixels in the horizontal dimension, and cloud-top temperature. Latitude was included as a final feature. All features are defined over a continuous range of values. The space of feature subsets (i.e., the power set of all features) for this task is 2204 = 6.4 x 106°. Bankert (1994) used a standard feature selection approach from the pattern recognition literature. This reduced the number of features used from 204 to 15. He then applied PNN(Specht, 1990; Welch et al., 1992), which has a 10-fold cross validation (CV) accuracy of 75.3% on the initial feature set. Its accuracy improves to 78.9% when using the selected 15 features. Feature Selection The objective of feature selection is to reduce the number of features used to characterize a dataset so as to improve an algorithm’s performance on a given task. The task studied in this paper is classification. Our objectives are to maximize classification accuracy and minimize the number of features used to define the cases. Feature selection algorithms input a set of features and yield a subset of them. Several knowledge-intensive CBRalgorithms have been used to perform feature selection (e.g., Ashley & Rissland, 1988). However, domain specific knowledge is not yet available for the cloud classification task. This prevents us from using explanation-based approaches for indexing and retrieving appropriate features (e.g., Cain, Pazzani, &: Silverstein, 1991; Ram, 1993). Furthermore, the same set of features are used to describe each case in this case base, their values have been pre-computed, and no further inferencing is required to access these values. Therefore, we do not address the cost of evaluating features (Owens, 1993). Thus, this study was restricted to using knowledgepoor feature selection approaches (e.g., Devijver ~ Kittler, 1982), although we plan to work with knowledgebased techniques for feature extraction in the future. Feature selection algorithms have three components: 1. Search algorithm This searches the space of feature subsets, which has size 2u where d is the number of features. Points in this space are called states. 2. Evaluation function This inputs a state and outputs a numeric evaluation. The search algorithm’s goal is to maximizethis function. 3. Classifier This is the target algorithm that uses the final subset of features (i.e., those found by the search algorithm to have the highest evaluation). 107 Filter ControlStrategy I Search 1 Selected Features Algorithm r I Features Classifier I Evaluation 1 Function WrapperControl Strategy I Search Features 1 Algorithm Figure 1: Common Control Strategies lection Approaches for Feature Se- Typically, these componentsinteract in one of the two ways described in Figure 1. Either the search and evaluation algorithms locate the set of features before the classifier is consulted, or the classifier is itself the evaluation function. Weadopt the terms filter and wrapper control strategies, respectively, from John, Kohavi, and Pfleger (1994), who argued that the wrapper strategy is superior because it avoids the problem of using an evaluation function whose bias differs from the classitier. Our empirical results strongly support this claim. Since manyfeature subsets are evaluated during the feature selection search, classifiers that require manual tuning of their parameters (e.g., PNN)cannot used as the evaluation function. Since Bankert (1994) chose PNNas his classifier, he used a different function to evaluate states. Wehypothesize that better re- sults can be obtained by using a wrapper model. Nonparametric case-based classifiers are excellent choices for feature selection classifiers because they often obtain good accuracies and can also be used as evaluation functions. Thus, we used IB1 (Aha, Kibler, & Albert, 1991) in our studies for these two purposes; it is an implementation of the nearest neighbor classifier. Doak(1992) identifies three categories of search algorithms: exponential, sequential, and randomized. Exponential algorithms have exponential complexity in the number of features. Sequential search algorithms have polynomial complexity; they add or subtract features and use a hill-climbing search strategy. Randomized algorithms include genetic and simulated annealing search methods. Sequential search algorithms seem most appropriate for our task because exponential search algorithms (e.g., FOCUS(Almuallim Dietterich, 1991) are prohibitively expensive and randomized algorithms, unless biased as in (Skalak, 1994), tend to yield larger subsets of features than do sequential strategies (Doak, 1992). Sequential algorithms performed comparatively well in Doak’s study. Thus, we chose to use them in our study. Wehave not yet investigated strategies that combinealgorithms from different categories (e.g., Doak, 1992; Caruana & Freitag, 1994). The most commonsequential search algorithms for feature selection are variants of forward sequential selection (FSS) and backwardsequential selection (BSS). FSS begins with zero attributes, evaluates all feature subsets with exactly one feature, and selects the one with the best performance. It then adds to this subset the feature that yields the best performance for subsets of the next larger size. This cycle repeats until no improvement is obtained by extending the current subset. BSSinstead begins with all features and repeatedly removes the feature that, so removed, the maximal performance increase results. Doak (1992) reported that variants of BSS outperformed variants of FSS, probably because BSSevaluates the contribution of a feature whenall others are available, whereas FSS suffer when features individually don’t contribute to performance but do so when jointly considered with other features (i.e., they interact). BSS’s computational expense prevents it from being easily applied to our task. It would require approximately 21,000 applications before it could find a subset of size found using Bankert’s approach. Each application would involve an application of IB 1, whose costs are O(FxI2), where I is the number of instances and F is the number of features. Therefore, we modified BSS(see below). Because of this modification, it is not obvious whether it would perform as well as FSS in our study. A Framework of Algorithms In our experiments, we focussed on the four categories of feature selection algorithms shown in Table 1. The 108 Table 1: Four Types of Feature Selection A1 3rithms Control Strategy Search Algorithm Filter Filter Wrapper Wrapper FSS BSS FSS BSS rest of this section describes the algorithms we applied. Filter Control with FSS Bankert’s (1994) approach is a memberof the first these four categories. He used FSS with the Bhattacharya class separability index as the evaluation function - it measures the degree to which classes are separated and internally cohere. Many such indices have been proposed and evaluated (e.g., Devijver Kittler, 1982). They are usually less computationally expensive than the classifier and are nonparametric, which allows them to be used without requiring manual parameter tuning. However, if their bias does not match the bias of the classifier, then they can lead to suboptimal performance. Filter Control with BSS The second category combines a filter control strategy with BSSsearch. Since it is difficult to apply BSS to a task with 204 features due to its computational complexity, we introduce BEAM,a novel extension of BSSthat allows it to work with larger numbers of features. BEAMis somewhat similar to Doak’s (1992) randomgeneration plus sequential selection (RGSS) algorithm. They both begin with a feature subset found through random selection of features, and they both use a form of sequential feature selection after locating this initial subset. After choosingthe initial set of features, RGSSuses FSS and then BSS in the hope of overcoming the drawbacks of both approaches. These sequential search algorithms suffer from beginning with extreme subsets of features (i.e., none and all features respectively). Thus, it makes sense to begin with some other subset of features, to extend them with FSS, and then use BSSsince it has proven itself to perform well on many tasks. Doak (1992) reports good performance with I%GSSon a heart disease database with 75 features. BEAMdiffers from RGSS in three ways. First, instead of beginning with a random selection of features, BEAM randomly samples the feature space for a fixed numberof iterations and begins with the bestperforming feature subset found during those iterations. BEAM’sperformance in informal experiments without this more careful selection of an initial feature set was much lower. Second, BEAMis biased in the initial number of features it selects. For the cloud classification task, BEAM initially sampled only in the space of 25 features among the 204 available. This is required because the space of features is large, and we knowfrom Bankert’s (1994) study that a small number of features (i.e., 15) significantly improves predictive accuracy this task. Third, BEAMemploys a beam search (hence its name) during BSS. It maintains a fixed-sized queue of the best-performing states it has visited along with a list of the features in it that have already been removed by BSSfor evaluation. The states on the queue are ordered by decreasing accuracy, and the queue is updated each time a state is selected and evaluated. A parameter to BEAMdetermines whether a state evaluation consists of evaluating all subsets of one smaller number of features (full evaluation) or only a single feature subset (greedy evaluation). Whenusing greedy evaluation, a random element is used to select both which state on the queue to examine next and which feature to remove from it. The probability that a state is chosen is a decreasing function of its location in the queue. A state in location i of a queue of size n is selected with probability n-i-1 P(i) ~-~j~l J Let L be the subset of features in the selected state, with features F, that have been removed for evaluation. The next feature selected from this state is chosen randomly according to a uniform distribution from F-L. BEAM’salgorithm is summarized in Figure 2. The subfunctions have been previously explained except for update_queue, which during greedy evaluation adds f to L, removes the selected state from (and reduces the number of states in) the queue when F = L, and properly inserts the state with features F-{f} in the queue if its evaluation (i.e., 10-fold CVaccuracy) is among the best n whose features have not yet been exhaustively removed by BSS for evaluation. For this category, we selected as BEAM’sevaluation function a separability index that is similar to the one used by Bankert and performed best in comparison with many other indices (Skalak, 1994). A full 2evaluation function was used with a queue size of 1. Wrapper Control with FSS For this category, we combined FSS search with IB1 as both the evaluation function and classifier. Wealso tested a second variant of this combination in which we used BEAM,but substituted FSS for BSS. When 2Since a larger queue could contain feature subsets of different sizes, and since most separation indices are sensitive to feature subset size, we decided to not experiment with the greedy evaluation function for this category. 109 Figure 2: BEAM:A Biased Variant of BSSfor Feature ~election with Large Numbersof Features Inputs: r: # randomly-selected featuresubsets to evaluate d: # featuresin each randomly-selected subset n: Size of queue e: type of evaluation(full or greedy) I: Numberof select/evaluate iterations Key: F: A subset of features L: Featuresin F such that: Vf E L, F- {f} has not been evaluated BEAM(r,d,n,e,I) 1. F = best.random_state(r,d) 2. queue = initialize.queue(F) 3. best_evaluation = 0.0~ 4. For I iterations: h. {F,L} = Select~tate(queue) B. f = randomly~elect~eature(F-L) C. evaluation= evaluate(F-{f},e) D. queue = update_queue(F,f,evaluation,n) E. if (evaluation> best_evaluation) i. best_evaluation= evaluation ii. best_state = F- {f} 5. Output:best_evaluation and best_state using the greedy evaluation function, we initialized BEAMwith a randomly chosen subset of three features (i.e., the best among25 subsets tested for that size), and used a queue size of 25. A queue size of one was used when using full evaluation, which began with the empty subset of features. In both cases, feature subsets were constrained to have size no more than 25. Wrapper Control with BSS The fourth category combines a wrapper control strategy with BSS search. We again used the BEAMalgorithm, but in this case we used IB1 (Aha, Kibler, & Albert, 1991) as the evaluation function. Again, we tested BEAM using both the full and’single state evaluation functions, and using queue sizes of one and 25 respectively. Other Algorithms Weincluded two other case-based feature selection algorithms in our study. IB4 (Aha, 1992a) was chosen an exampleof a classifier that hill-climbs in the space of reM-vMuedfeature weights. Several such algorithms exist (e.g., Wettschereck & Dietterich, 1994). Although these algorithms have performed well on some datasets, they tend to work best when features are either highly relevant or irrelevant. Wehypothesized that IB4 would not work well on the cloud classification task, which contains manypartially-relevant features. Wealso included Cardie’s (1993) approach. She used C4.5 (Quinlan, 1993) to select which features to use a case-based classifier and reported favorable results. This method has not previously been used on tasks involving more than a few dozen features, and some reports suggest that C4.5 does not always perform well whenthe features are all continuous (e.g., Aha, 1992b). Thus, we suspected that it may not perform well on the cloud classification task. Empirical Comparisons Weexamined two hypotheses concerning the sequential search frameworkdescribed in the previous section: 1. The wrappercontrol strategy is superior to the filter control strategy for this task. 2. Variants of BSStend to outperform variants of FSS for this task. The first hypothesis is based on evidence that using the classifier itself as the evaluation function yields better performance on some tasks (e.g., Doak, 1992; Vafaie De Jong, 1993; John, Kohavi, ~ Pfleger, 1994). Similar evidence exists for the second hypothesis (Doak, 1992). However, these hypotheses have not been previously investigated for tasks of this magnitude. We also have secondary hypotheses concerning the other two algorithms in our study: 3. Searching in the 2d space is superior than searching in the ~a space, at least when initially determining which features are needed to attain good performanceon the cloud classification task. 4. C4.5 will not perform particularly well as a filter algorithm because this dataset contains only numericvalued features. Hypothesis three is based on the observation that ~d is much larger than 2d, which is itself large for our task. Thus, a hill-climbing algorithm such as IB4 will have difficulty locating a set of weights that yield good performance on this task. Finally, although C4.5 has performed well as a feature selection algorithm in previous studies, it has not been used for this purpose with numeric-valued data. Baseline Results Whenusing all 204 attributes, IBl’s 10-fold CVaccuracy is 72.6% while PNN’s accuracy is 75.3%. The most frequent concept in the database occurs for 15.4% of the cases. Both IB1 and PNN significantly increase accuracy, although PNNfares better here because IB1 does not perform well for highdimensional tasks involving manypartially relevant attributes (Aha, 1992a). 110 Table 2: Best and Average (10 runs for the nondeterministic algorithms) 10-fold CVPercent Accuracies (Ace) and Feature Set Sizes (Sz) for Feature lection Algorithms on the Cloud Classification Task, where "Bs" and "B]" Indicates Using BEAMwith the Single and ~ll Evaluation Strate~;ies Respectively Control Search rage zS-/-Strategy Alg. Random Selection 75.9 83 69.9 103.4 15 Filter FSS 78.6 75.4 24 69.8 21.3 Filter BSS Wrapper FSS 88.0 10 24.9 Wrapper FSS B, 87.2 25 83.1 Wrapper 82.4 13 79.7 17.0 BSS B I 9.1 Wrapper BSS Bs 85.1 5 81.0 IB4 73.3 204 C4.5 76.5 79.4 ~ Comparison Results Table 2 displays the best and average results when using the eight feature selection algorithms described in the previous section. Additionally, the first line refers to the results when randomly selecting subsets of features. The algorithms using the wrapper control strategy all used IB1 as their evaluation function, while the filter strategy algorithms used a separability index to evaluate states. IB1 was used as the classifier for all the algorithms. The second line in Table 2 refers to using the subset found by Bankert (1994) and using IB1 rather than PNNas the classifier. This method is at a disadvantage because the bias of the index separation measure maynot match the bias of the classifier. Thus, the set of features selected by the index measure may not yield comparatively high accuracies, as is true in this case. The third line refers to using BSSsearch, a separation index, and the IB1 classifier. This approach also performed comparatively poorly. Our other results suggest this is due to using a separation index as the evaluation function. Line four of Table 2 replaces Bankert’s use of a separability index with IB1 as the evaluation function. This single change yields the highest accuracy in our study. Thus, this is strong evidence that using the classifier also as the evaluation function yields better performing algorithms for feature selection. Line five reports a similar approach, but in this case each run was initialized with the best-performing subset of three randomly chosen features. BEAMwas used (with FSS rather than BSS) with a queue size of 25. The sizes of feature subsets were constrained to not exceed 25. This algorithm found a subset with high accuracy but required a relatively larger number of features. The sixth and seventh lines in Table 2 lists BEAM’s results when using BSS as the search algorithm (r 325, d = 25, and I = 1000) and IB1 as the classifier. These settings have not been optimized. Line six refers to using a queue size of one and the full evaluation strategy. Line seven refers to using a queue size of 25 and the greedy evaluation strategy. This latter approach performed well; it located a subset containing only five features that attained an accuracy of 85.1%. The eighth line in Table 2 refers to using IB4 (Aha, 1992a). Like most weight-tuning case-based algorithms, IB4 does not perform well unless each feature is either highly relevant or irrelevant. Its poor performancehere is not surprising. The final line refers to using C4.5 (Quinlan, 1993) to select features for IB1. Using C4.5’s default parameter settings, it yields pruned trees that have a large numberof features that do not deliver high predictive accuracies. Wehave not investigated whether alternative parameter settings will improve its performance. Summary The highest accuracy (i.e., 88.0%) was obtained when using the FSS search algorithm combined with IB1 as the evaluation function. A students one-tailed ttest confirmed that the accuracies obtained using the feature subset found by this approach, as tested on the ten folds, are significantly higher than the best accuracies obtained by using the other approaches at either the 95%confidence level or higher. In particular, using IB1 rather than a separability index for the evaluation function significantly improved performance when using either FSS (97.5% confidence level) or BSS(99.5%). Thus, this provides evidence for our hypothesis that using the classifier as the evaluation function improves performance. The combination of BSSwith IB1 as the evaluation function, as implemented in BEAMwith the single evaluation option, also performed well. However, FSS with IB1 located a five-feature subset with even higher accuracy (i.e., 85.9%). Thus, we found no evidence that BSS outperformed FSS, but this may be due to the differences between BSSand its modification in BEAM. As expected, the final two algorithms performed comparatively poorly in terms of accuracy (i.e., they were significantly lower than the accuracies recorded by the wrapper algorithms at either the 95%confidence level or higher). Furthermore, neither algorithm found small-sized feature sets in this study. Conclusions and Future Research This paper focuses on improving predictive accuracy for a specific task: cloud classification. Properties specific to this task require the use of feature selection approaches to improve case-based classification accuracy. Weexplained the motivation for using sequential selection algorithms for this task, tested several variants, and introduced a novel modification of BSS named BEAM for problems with large numbers of features. Our results strongly indicate that, when using Related Work Feature selection has its roots in pattern recognition and statistics and is addressed in several textbooks (e.g., Devijver & Kittler, 1982). Mucciardi 3Theaverage 10-fold CVaccuracy of fifty randomselections of 25 features was 66.7%, while the sameaverage for BEAM’s initially selected feature sets was 74.6%. Iii \\\ and Gose’s (1971) empirical study fostered much interest on heuristic approachesfor feature selection that use filter control strategies. Cover and van Campenhout (1977) later proved that, under some assumptions, such heuristic algorithms can perform arbitrarily poorly. Yet exhaustive (exponential) search algorithms are prohibitively expensive for non-trivial tasks. Their seminal paper greatly influenced the pattern recognition community, and arguably reduced interest in heuristic approachesusing filter strategies. In contrast, several feature selection algorithms have been recently described in the AI literature. Many weight-tuning algorithms have been proposed (e.g., Aha, 1992a; Wettschereck & Dietterich, 1994), although it is not feasible to explore the space of realvalued weights in ~d whend is large. Also, these algorithms work best when the features are either highly relevant or irrelevant, which doesn’t appear to be true for the cloud classification task. However,it is possible that the notion of locally warping the instance space (e.g., Aha & Goldstone, 1992) can be used for feature selection (i.e., the relevance of features varies in a given instance space). We suspect that FOCUS(Almuallim & Dietterich, 1991) and RELIEF(Kira & l~endell, 1992) would perform well on this task since the former uses exhaustive search and the latter assumes that features are, again, either highly relevant or irrelevant. Vafaie and De Jong (1993) instead used a genetic algorithm to select features for a rule induction program. Like John, Kohavi, and Pfleger (1994), they found that filtering inferior to the wrapper control strategy, although the database in their study again involved only a comparatively small numberof features. The results from Doak’s (1992) survey suggest using feature selection algorithms similar to BEAM.Doak combined random with sequential searching, suggested using a beam search to guide BSS, and suggested that genetic search approaches can be improved by biasing them towards smaller-sized feature subsets. Skalak (1994) recently demonstrated that this approach is effective. BEAM extends some of these ideas by adding a strategy for selecting a good initial set of features, implementing a beam search, and introducing random elements in the state selection process, and may benefit from using a more intelligent methodfor selecting its initial feature subset. these algorithms, the evaluation function should be the same as the classifier. This leads to significantly higher accuracies than those previously published for this task (Bankert, 1994). We have not yet evaluated BEAMon Other problems with large numbers of features to better analyze its benefits. While we anticipate that some of its properties will prove generally useful, we do not yet have evidence for such claims. Also, several of BEAM’sdesign decisions have not been justified, and we plan to examine these decisions systematically in future research. Another goal is to integrate the capabilities of genetic search in BEAM,such as by using it whenever backward sequential selection depletes the queue of good feature subsets. Similarly, C4.5 can be utilized by selecting feature subsets from the power set of features it finds to be relevant for classification. Manyother combinations might prove useful. A promising direction of future research should involve using domain-specific knowledge concerning feature relevance. Methods used in case-based reasoning should prove highly valuable once such domain expertise becomes available. Acknowledgements Thanks to Paul Tag, John Grefenstette, David Skalak, Diana Gordon, and Ashwin Ram for their feedback on this research. References Aha, D. W. (1992a). Tolerating Noisy, Irrelevant, and Novel Attributes in Instance-Based Learning Algorithms. International Journal of Man-MachineStudies, 36, 267-287. Aha, D. W. (1992b). Generalizing from case studies: case study. In Proceedings of the Ninth International Conference on Machine Learning (pp. 1-10). Aberdeen, Scotland: Morgan Kaufmann. Aha, D. W., & Goldstone, R. L. (1992). Concept learning and flexible weighting. In Proceedings of the Fourteenth Annual Conference of the Cognitive Science Society (pp. 534-539). Bloomington, IN: Lawrence Erlbaum. Aha, D. W., Kibler, D., & Albert, M. K. (1991). Instancebased learning algorithms. Machine Learning, 6, 37-66. Almuallim, H., & Dietterich, T. G. (1991). Learning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence (pp. 547552). Menlo Park, CA: AAAIPress. Ashley, K. D., & Rissland, E. L. (1988). Waiting weighting: A symbolic least commitment approach. In Proceedings of the Seventh National Conference on Artificial Intelligence (pp. 239-244). St. Paul, MN:Morgan Kaufmann. Bankert, R. L. (1994). Cloud classification of AVHRR imagery in maritime regions using a probabilistic neural network. To appear in Journal of Applied Meteorology. Cain, T., Pazzani, M. J., & Silverstein, G. (1991). Using domain knowledge to influence similarity judgement. In Proceedings of the Case-Based Reasoning Workshop (pp. 191-202). Washington, DC: Morgan Kaufmann. Caruana, R., & Freitag, D. (1994). Greedy attribute selection. To appear in Proceedings of the Eleventh Inter- 112 national Machine Learning Conference. New Brunswick, N J: Morgan Kaufmann. Cardie, C. (1993). Using decision trees to improve casebased learning. In Proceedings of the Tenth International Conference on Machine Learning (pp. 25-32). Amherst, MA: Morgan Kaufmann. Cover, T. M., & van Campenhout, J. M. (1977). On the possible orderings in the measurementselection problem. IEEE Transactions on Systems, Man, and Cybernetics, 7, 657-661. Devijver, P. A., &Kittler, J. (1982). Pattern recognition: A statistical approach. EnglewoodCliffs, N J: PrenticeHall. Doak, J. (1992). An evaluation of feature selection methods and their application to computer security (Technical Report CSE-92-18). Davis, CA: University of California, Department of Computer Science. John, G., Kohavi, R., & Pfleger, K. (1994). Irrelevant features and the subset selection problem. To appear in Proceedings of the Eleventh International Machine Learning Conference. New Brunswick, N J: Morgan Kaufmann. Kira, K., & Rendell, L. A. (1992). A practical approach to feature selection. In Proceedings of the Ninth International Conference on Machine Learning (pp. 249-256). Aberdeen, Scotland: Morgan Kaufmann. Mueciardi, A. N., & Gose, E. E. (1971). A comparison seven techniques for choosing subsets of pattern recognition properties. IEEE Transaction on Computers, 20, 1023-1031. Owens, C. (1993). Integrating feature extraction and memorysearch. Machine Learning, 10, 311-340. Peak, J. E., & Tag, P. M. (1992). Towards automated interpretation of satellite imagery for navy shipboard applications. Bulletin of the American Meterological Society, 73, 995-1008. Quinlan, J. R. (1993). C~.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann. Ram, A. (1993). Indexing, elaboration, and refinement: Incremental learning of explanatory cases. Machine Learning, 10, 201-248. SkaJak, D. (1994). Prototype and feature selection sampling and random mutation hill climbing algorithms. To appear in Proceedings of the Eleventh International Machine Learning Conference. New Brunswick, N J: Morgan Kaufmann. Specht, D. F. (1990). Probabilistic neural networks. Neural Networks, 3, 109-118. Vafaie, H., & De Jong, K. (1993). Robust feature selection algorithms. In Proceedings of the Fifth Conference on Tools for Artificial Intelligence (pp. 356-363). Boston, MA: IEEE Computer Society Press. Welch, R. M., Sengupta, S. K., Goroch, A. K., Rabindra, P., Rangaraj, N., & Navar, M. S. (1992). Polar cloud and surface classification using AVHRR imagery: An intercomparison of methods. Journal of Applied Meteorology, 31, 405-420. Wettschereck, D., & Dietterich, T. G. (1994). An experimental comparison of the nearest neighbor and nearest hyperrectangle algorithms. To appear in Machine Learning.