Decision-Rule Solutions for Data Mining with Missing Values Sholom Weiss and Nitin Indurkhya IBM Research Report RC-21783 Decision-Rule Solutions for Data Mining with Missing Values Sholom M. Weiss and Nitin Indurkhya IBM T.J. Watson Research Center, P.O. Box 218, Yorktown Heights, NY 10598, USA sholom@us.ibm.com, nitin@data-miner.com Abstract. A method is presented to induce decision rules from data with missing values where (a) the format of the rules is no dierent than rules for data without missing values and (b) no special features are specied to prepare the the original data or to apply the induced rules. This method generates compact Disjunctive Normal Form (DNF) rules. Each class has an equal number of unweighted rules. A new example is classied by applying all rules and assigning the example to the class with the most satised rules. Disjuncts in rules are naturally overlapping. When combined with voted solutions, the inherent redundancy is enhanced. We provide experimental evidence that this transparent approach to classication can yield strong results for data mining with missing values. Keywords: 1 decision rule induction, boosting Introduction Data warehousing has increased the opportunities for data mining. Unlike the datasets that have often been used in scientic experimentation, transactional databases often contain many missing values. Data with missing values complicates both the learning process and the application of a solution to new data. Depending on the learning method, special data preparation techniques may be necessary. This increases the amount of data preprocessing. The most common preprocessing techniques involve lling in the missing values. For instance, in [Pyle, 1999], several general approaches are described to replace the missing values prior to mining: { { { Estimate values using simple measures derived from means and standard deviations Estimate values by regression Augment each feature with a special value or ag that can be used in the solution as a condition for prediction While potentially useful, each of these techniques has obvious drawbacks. Estimating the missing value by a simple measure like a class mean is often circular reasoning that is a direct substitute for the class label. Moreover, missing values for new cases remain a problem. Estimating by regression is just as complex a task as the given classication problem. Using the occurrence of a missing value to reach a positive or negative conclusion may not be sensible in many contexts and clearly increases the complexity of the solution. With the commercial application of data mining methods, increased attention is given to decision trees and rules. These techniques may perform well and have the potential to give insight to the interpretation of data mining results, for example in marketing eorts. Decision trees methods have a long history of special techniques for processing missing values[Breiman et al., 1984],[Quinlan, 1989]. They process training data without any transformations, but have surrogates for tree nodes when values are missing. When a true-or-false test can potentially encounter a missing value, a number of alternative tests are also specied that hopefully track the results of the original test. Thus, the original data remain stable, but special methods and representations are needed to process missing data. Decision rules are closely related to decision trees. The terminal nodes of a tree can be grouped into Disjunctive Normal Form (DNF) rules, only one of which is satised for a new case. Decision rules are also DNF rules, but allow rules to overlap, which potentially allows for more compact and interesting rule sets. Decision tree induction methods are more ecient than those for decision rule induction{some methods for decision rule induction actually start with an induced decision tree. Procedures for pruning and optimization are relatively complex[Weiss and Indurkhya, 1993][Cohen, 1995]. Single decision trees are often dramatically outperformed by voting methods for multiple decision trees. Such methods produce exaggeratedly complex solutions, but they may be the best obtainable with any classier. In [Cohen and Singer, 1999], boosting techniques [Schapire, 1999] are used by a system called SLIPPER to generate a weighted set of rules that are shown to generally outperform standard rule induction techniques. While these rules can maintain clarity of explanation, they do not match the predictive performance of the strongest learning methods, such as boosted trees. Of particular interest to our work is [Friedman et al., 1998] where very small trees are boosted to high predictive performance by truncated tree induction (TTI). Small trees can be decomposed into a collection of interpretable rules. Some of the boosted collections of tiny trees, even tree stumps, have actually performed best on benchmark applications. In this paper, we discuss methods for learning and application of decision rules for classication from data with many missing values. The rules generated are Disjunctive Normal Form (DNF) rules. Each class has an equal number of unweighted rules. A new example is classied by applying all rules and assigning the example to the class with the most satised rules. Disjuncts in rules are naturally overlapping. When combined with voted solutions, the inherent redundancy is enhanced. The method can induce decision rules from data with missing values where (a) the format of the rules is no dierent than rules for data without missing values and (b) no special features are specied to prepare the the original data or to apply the induced rules. We provide experimental evidence that this transparent approach to classication can yield strong results for data mining with missing values. 2 Methods and Procedures The classical approach to rule induction is a two-step process. The rst step is to nd a single covering solution for all training examples. The covering rule set is found directly by inducing conjunctive rules or indirectly by inducing a decision tree. The direct solution usually involved inducing one rule at a time, removing the cases covered by the rule, and then repeating the process. The second step is to prune the covering rule set or tree into smaller structures, and pick the best one, either by a statistical test or by applying the rule sets to independent test cases. A pure DNF rule for classication is evaluated as satised or not. If satised, the rule implies a specic class. The conditions or components of a rule can be tested by applying or > operators to variables and coding categorical values separately as 1 for true and 0 for false. We can measure the size of a DNF rule with two measurements: (a) the length of a conjunctive term and the number of terms (disjuncts). For example, c1 c2 c3 OR c1 c3 c4 Class is a DNF rule for conditions ci with maximum length of three and two terms (disjuncts). Complexity of rule sets can be controlled by providing an upper bound on these two measurements. Table 1 describes the standard analysis of results for binary classication. For evaluation purposes, a rule is applied to each case. Classication error is measured as in equation 1. For case i, FP(i) is 1 for a false positive, FN(i) is 1 for a false negative, and 0 otherwise. f g f g ) Table 1. Analysis of Error for Binary Classication Rule-true Rule-false Class-true True positives (TP) False negatives (FN) Class-false False positives (FP) True negatives (TN) Error = FP + FN ; FP = X FP (i); FN = X FN (i) i i (1) For almost all applications, more than one rule is needed to achieve good predictive performance. In our lightweight approach, a solution consists of a set of an equal number of unweighted rules for each class. A new example is classied by picking the class having the most votes, the class with the most satised rules. We are very democratic; each class has an equal number of rules and votes, and each rule is approximately the same size. The principal remaining task is to describe a method for inducing rules from data. So far we have given a brief description of binary classication. Yet, this form of binary classication is at the heart of the rule induction algorithm. Let's continue to consider binary classication. The most trivial method for rule induction is to grow a conjunctive term of a rule by the greedy addition of a single condition that minimizes error. To ensure that a term is always added (when error is nonzero) we can dene a slightly modied measure, err1 in equation 2. Error is computed over candidate conditions where TP is greater than zero. If no added condition adds a true positive, the cost of a false negative error is doubled and the minimum cost solution is found. The cost of a false positive remains at 1. The minimum err1 is readily computed during sequential search using the bound of the current best err1 value. Err1 = FP + k FN where k = 1; 2; 4:::and TP > 0 f Frq(i) = 1 + e(i)3 FP = X FP (i) i frq(i); FN = X FN (i) i g (2) (3) frq(i) (4) The lightweight method is adaptive, and follows the well-known principle embodied in boosting: Give greater representation to erroneously classied cases. The technique for weighting cases during training is greatly simplied from the usual boosting methods. Analogous to [Breiman, 1996], no weights are used in the induced solution. Weighting of cases during sampling follows a simple method: Let e(i) be the cumulative number of errors for case i for all rules. It is computed by applying all prior induced rules and summing the errors for a case. The weighting given to a case during induction is an integer value, representing a relative frequency of that case in the new sample. Equation 3 is the frequency that is used. It has good empirical support, having had the best reported results on an important text-mining benchmark [Weiss et al., 1999], and was rst described in [Weiss and Indurkhya, 1998]. Thus if 10 rules have been generated, 4 of them erroneous on case i, then case i is treated as if it appeared in the sample 65 times. Based on prior experience, alternative functions to Equation 3 may also perform well. Unlike the results of [Bauer and Kohavi, 1999] for the alternative of [Breiman, 1996], Equation 3 performs well with or without random resampling, and the LRI algorithm uses no random resampling. The computation of FP and FN during training is modied slightly to follow Equation 4. Err1 is computed by simple integer addition. In practice, we use only 33 dierent values of e(i), for i=0 to 32. Whenever, the number of cumulative errors exceeds 32, all cumulative errors are normalized by an integer division of 2. The training algorithm for inducing a DNF rule R is given in Figure 1. The algorithm is repeated sequentially for the desired number of rules. Rules are always induced for binary classication, class versus not-class. A m-class classication problem is handled by mapping it to m binary classication problems { one for each class. Each of the binary classication problems can be computed independently and in parallel. As we shall in Section 3, the equality of voting and rule size, makes the predictive performance of rules induced from multiple binary classication problems quite comparable. 1. Grow conjunctive term T until the maximum length (or until FN = 0) by greedily adding conditions that minimize err1. 2. Record T as the next disjunct for rule R. If less than the maximum number of disjuncts (and FN > 0), remove cases covered by T , and continue with step 1. 3. Evaluate the induced rule R on all training cases i and update e(i), the cumulative number of errors for case i. Fig. 1. Lightweight Rule Induction Algorithm A pure DNF rule induction system has strong capabilities for handling missing values. Disjunction can produce overlap and redundancy. If we apply a rule to a case, and a term is not satised because one of its conditions has a missing value, the rule may still be satised by one of the other disjuncts of the rule. These rules have no special conditions referring to missing values; they look no dierent than rules induced from data with no missing values. How is this accomplished? For the application of rules, a term is considered not satised when a missing value is encountered in a case. During training, the following slight modications are made to the induction procedures: { { When looping to nd the best attribute condition, skip cases with missing values. Normalize error to a base relative to the frequency of all cases. Pn; all frq(n) P Normk = i; w=o missing vals frq (i) X FP (i) frq(i) FPk = Normk i X FN = Norm FN (i) frq(i) k k i (5) (6) (7) Each feature may have a variable number of missing values. The normalization factor is computed as in Equation 5 for feature k. The normalization factor is the total number of cases, n, including missing values cases, divided by the frequency of cases without missing values. False positives and negatives are computed as in Equations 6 and 7, a straightforward normalization of Equation 4. To select the solution with the best predictive performance, decisions must be made about the two key measures of rule size: (a) conjunctive term length and (b) the number of disjuncts. For data mining large samples, the best solution can be found by using an independent test set for estimating true error. If only a single complexity measure is varied during training, such as the number of disjuncts, then the estimated error rates for comparing the dierent solutions using only one independent test set are nearly unbiased[Breiman et al., 1984]. 3 Results Before we present experimental results for lightweight rule induction, let's consider our real-world experience in an important data mining application: the detection of patterns in survey data. IBM, like many companies, surveys the marketplace trying to gauge customer attitudes. In the case of IBM, thousands of IT professionals are surveyed about their buying intentions and their view of IBM products and the products of competitors. Survey data may be collected every quarter or perhaps as frequently as every week. For some recent period, such as the most recent quarter, the survey data are grouped into a sample. The sample can be mined, and purchasing patterns that are detected can potentially be of great value for marketing. The actual survey questions number in the many hundreds. Not all questions are asked of every respondent; records contain many missing values. What might be interesting questions? In practice, it's relatively easy to specify critical classication problems, for example \can we distinguish those people who intend to increase purchases of IBM equipment versus those that do not?" With hundreds of features and relatively dicult goals for discrimination, solutions of high complexity are likely when standard methods are used to nd a minimum error solution. Such solutions would not be acceptable to the marketers who make recommendations and take actions. In our case, the lightweight approach has an eective means of bounding the complexity of solutions. We can tradeo complexity and somewhat stronger predictive performance with clarity of interpretation. While it may seem severe, rules induced from one survey were restricted in size to no more than two terms with no disjunction and three rules for each of two classes. These simplied rules perform somewhat weaker than more complex and larger rule sets, but in this application, interpretability far outweighs raw predictive performance. Moreover, although the survey data are riddled with missing values, the solutions, posed in the form of decision rules, extract the essential patterns without ever mentioning missing values. To evaluate formally the performance of lightweight rule induction, datasets from the UCI repository [Blake et al., 1999] were processed. Table 2 summarizes the characteristics of these data. The number of features describes numerical features and categorical variables decomposed into binary features. Because the objective is data mining, we selected datasets having relatively large numbers of training cases and designated test sets.These datasets have no missing values, allowing us to set a baseline performance. Missing values were simulated by using a random number generator to delete an expected percentage of values from every feature. Table 2. Data Characteristics Name Train Test Features Classes coding 5000 15000 60 2 digit 7291 2007 256 10 letter 16000 4000 16 26 move 1483 1546 76 2 satellite 4435 2000 36 6 wave 5000 5000 40 3 LRI has several design parameters that aect results: (a) the number of rules per class (b) the maximum length of a rule and (c) the maximum number of disjunctions. For all of our experiments, we set the length of rules to 5 conditions. For most applications, increasing the number of rules increases predictive performance until a plateau is reached. The critical parameter is the number of disjuncts. We varied the number of disjuncts in each rule from 1, 2, 4, 8, 16, where 1 is a rule with a single conjunctive term. Table 3 summarizes the results for varying percentages of missing values in the training cases and the original test cases. Solutions were found of dierent complexities over the training cases. After training was completed, error was measured on the large test set. The error listed is for the solution with the minimum test error. Also included are the results for a binary decision tree trained on data having no missing values. After pruning at various levels of complexity, the minimum test error tree was selected. Table 4 lists the results for the same training cases and the same induced rules, but the test cases also have the same percentage of missing values. Figure 2 plots the change in performance of the rules as the percentage of missing values increases in the training set only or both the training and test set. Because the test error was used for nding the minimum error results for both the tree and rules, the results are somewhat optimistic. Still, for data mining applications, this procedure is quite reasonable[Breiman et al., 1984]. The standard error for any of these applications is 1% or less. 4 Discussion Lightweight Rule Induction has a very simple representation: pure DNF rules for each class. It is egalitarian, each class has the same number of rules of approximately the same-size rules. Scoring is trivial to understand: the class with the most satised rules wins. Table 3. Error for Varying Percentages of Missing Values in Training Set pct missing move wave satellite coding letter digit 0 tree 0.255 0.231 0.146 0.337 0.134 0.154 0 0.195 0.142 0.092 0.246 0.039 0.059 5 0.227 0.141 0.095 0.255 0.055 0.060 10 0.247 0.143 0.101 0.270 0.068 0.065 20 0.269 0.141 0.112 0.273 0.127 0.076 25 0.286 0.147 0.122 0.278 0.153 0.082 50 0.398 0.183 0.169 0.314 0.246 0.139 75 0.476 0.246 0.210 0.404 0.444 0.225 Table 4. Error for Varying Percentages of Missing Values in Training and Test Set pct missing move wave satellite coding letter digit 0 0.1947 0.1422 0.0910 0.2461 0.0395 0.0593 5 0.2516 0.1498 0.0945 0.2605 0.0742 0.0673 10 0.2898 0.1530 0.0980 0.2692 0.1117 0.0698 20 0.3396 0.1762 0.1220 0.2892 0.2010 0.0837 25 0.3364 0.1920 0.1345 0.2963 0.2525 0.0927 50 0.3946 0.2996 0.2355 0.3629 0.5583 0.2123 75 0.4877 0.5818 0.5425 0.4556 0.8562 0.6178 The method is about as simple as any rule induction method can be. The algorithm is rudimentary, and our C code implementation is less than 300 lines. It produces designer rules, where the size of the rules are specied by the application designer. The central question in Section 3 is: How well does LRI do on practical applications? For best predictive performance, a number of parameters must be selected prior to running. We have concentrated on data mining applications where it can be expected that sucient tests are available for easy estimation. Thus, we have included results that describe the minimum test error. With big data, its easy to obtain more than one test sample, and for estimating a single variable, a large single test set is adequate in practice [Breiman et al., 1984]. For purposes of experimentation, we xed almost all parameters, except for maximum number of disjuncts and the number of rules. The number of disjuncts is clearly on the critical path to higher performance. As already shown for boosting and all forms of adaptive resampling, most of the gains in performance are achieved with the initial smaller set of classiers. The results on these applications demonstrate the strength of this method when applied to missing values. Our simulations show much stronger performance than the tree method even when large numbers of data elds are deleted. The eects of randomly generated missing values are likely to be much more Tree Rule - train Rule - train/test 0.7 Error Rate 0.6 0.5 0.4 0.3 0.2 0.1 0 0 5 10 20 25 50 Percentage of Missing Values 75 Fig. 2. Performance on Digit Data with Missing Values drastic than those encountered for real-world data. Here, data elds are uniformly destroyed, whereas in real-world data the missing values are likely to be distributed among a subset of elds, hopefully the weaker features. That these rules can perform relatively well on data with missing values should not be surprising. DNF rules have ample opportunity for overlap and redundancy. When combined with voted solutions, the inherent overlap is enhanced. Of greatest signicance is that the format of these rules is no dierent than rules for data without missing values. Most practitioners of data mining dread the extra complexity of missing values in data. It may require much extra analysis and adjustments to data and methods. Lightweight rule induction oers a new approach that may reduce these tedious tasks while still providing a high-performance and interpretable solution. References [Bauer and Kohavi, 1999] E. Bauer and R. Kohavi. An empirical comparison of voting classication algorithms: Bagging, boosting and variants. Machine Learning, 36(1):105{139, 1999. [Blake et al., 1999] C. Blake, E. Keogh, and C. Merz. Uci repository of machine learning databases. Technical report, University of California Irvine, 1999. www.ics.uci.edu/mlearn/MLRepository.html. [Breiman et al., 1984] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classication and Regression Trees. Wadsworth, Monterrey, CA., 1984. [Breiman, 1996] L. Breiman. Bagging predictors. Machine Learning, 24:123{140, 1996. [Cohen and Singer, 1999] W. Cohen and Y. Singer. A simple, fast, and eective rule learner. In Proceedings of Annual Conference of American Association for Articial Intelligence, pages 335{342, 1999. [Cohen, 1995] W. Cohen. Fast eective rule induction. In Proceedings of the Twelfth International Conference on Machine Learning, pages 115{123, 1995. [Friedman et al., 1998] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. Technical report, Stanford University Statistics Department, 1998. www.stat-stanford.edu/tibs. [Pyle, 1999] D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, San Francisco, 1999. [Quinlan, 1989] J. Quinlan. Unknown attribute values in induction. In International Workshop on Machine Learning, pages 164{168, Ithica, NY, 1989. [Schapire, 1999] R. Schapire. A brief introduction to boosting. In Proceedings of International Joint Conference on Articial Intelligence, pages 1401{1405, 1999. [Weiss and Indurkhya, 1993] S. Weiss and N. Indurkhya. Optimized rule induction. IEEE EXPERT, 8(6):61{69, December 1993. [Weiss and Indurkhya, 1998] S. Weiss and N. Indurkhya. Predictive Data Mining: A Practical Guide. Morgan Kaufmann, 1998. DMSK Software: www.data-miner.com. [Weiss et al., 1999] S. Weiss, C. Apte, F. Damerau, and et al. Maximizing text-mining performance. IEEE Intelligent Systems, 14(4):63{69, 1999.