Decision-Rule Solutions for Data Mining with Missing Values

advertisement
Decision-Rule Solutions for Data Mining with Missing
Values
Sholom Weiss and Nitin Indurkhya
IBM Research Report RC-21783
Decision-Rule Solutions for Data Mining with
Missing Values
Sholom M. Weiss and Nitin Indurkhya
IBM T.J. Watson Research Center,
P.O. Box 218, Yorktown Heights, NY 10598, USA
sholom@us.ibm.com, nitin@data-miner.com
Abstract. A method is presented to induce decision rules from data
with missing values where (a) the format of the rules is no dierent than
rules for data without missing values and (b) no special features are specied to prepare the the original data or to apply the induced rules. This
method generates compact Disjunctive Normal Form (DNF) rules. Each
class has an equal number of unweighted rules. A new example is classied by applying all rules and assigning the example to the class with the
most satised rules. Disjuncts in rules are naturally overlapping. When
combined with voted solutions, the inherent redundancy is enhanced. We
provide experimental evidence that this transparent approach to classication can yield strong results for data mining with missing values.
Keywords:
1
decision rule induction, boosting
Introduction
Data warehousing has increased the opportunities for data mining. Unlike the
datasets that have often been used in scientic experimentation, transactional
databases often contain many missing values. Data with missing values complicates both the learning process and the application of a solution to new data.
Depending on the learning method, special data preparation techniques may be
necessary. This increases the amount of data preprocessing.
The most common preprocessing techniques involve lling in the missing
values. For instance, in [Pyle, 1999], several general approaches are described to
replace the missing values prior to mining:
{
{
{
Estimate values using simple measures derived from means and standard
deviations
Estimate values by regression
Augment each feature with a special value or ag that can be used in the
solution as a condition for prediction
While potentially useful, each of these techniques has obvious drawbacks. Estimating the missing value by a simple measure like a class mean is often circular
reasoning that is a direct substitute for the class label. Moreover, missing values
for new cases remain a problem. Estimating by regression is just as complex a
task as the given classication problem. Using the occurrence of a missing value
to reach a positive or negative conclusion may not be sensible in many contexts
and clearly increases the complexity of the solution.
With the commercial application of data mining methods, increased attention
is given to decision trees and rules. These techniques may perform well and have
the potential to give insight to the interpretation of data mining results, for
example in marketing eorts.
Decision trees methods have a long history of special techniques for processing
missing values[Breiman et al., 1984],[Quinlan, 1989]. They process training data
without any transformations, but have surrogates for tree nodes when values are
missing. When a true-or-false test can potentially encounter a missing value, a
number of alternative tests are also specied that hopefully track the results of
the original test. Thus, the original data remain stable, but special methods and
representations are needed to process missing data.
Decision rules are closely related to decision trees. The terminal nodes of
a tree can be grouped into Disjunctive Normal Form (DNF) rules, only one of
which is satised for a new case. Decision rules are also DNF rules, but allow
rules to overlap, which potentially allows for more compact and interesting rule
sets.
Decision tree induction methods are more ecient than those for decision
rule induction{some methods for decision rule induction actually start with an
induced decision tree. Procedures for pruning and optimization are relatively
complex[Weiss and Indurkhya, 1993][Cohen, 1995]. Single decision trees are often
dramatically outperformed by voting methods for multiple decision trees. Such
methods produce exaggeratedly complex solutions, but they may be the best
obtainable with any classier. In [Cohen and Singer, 1999], boosting techniques
[Schapire, 1999] are used by a system called SLIPPER to generate a weighted
set of rules that are shown to generally outperform standard rule induction techniques. While these rules can maintain clarity of explanation, they do not match
the predictive performance of the strongest learning methods, such as boosted
trees. Of particular interest to our work is [Friedman et al., 1998] where very
small trees are boosted to high predictive performance by truncated tree induction (TTI). Small trees can be decomposed into a collection of interpretable rules.
Some of the boosted collections of tiny trees, even tree stumps, have actually
performed best on benchmark applications.
In this paper, we discuss methods for learning and application of decision
rules for classication from data with many missing values. The rules generated
are Disjunctive Normal Form (DNF) rules. Each class has an equal number of
unweighted rules. A new example is classied by applying all rules and assigning the example to the class with the most satised rules. Disjuncts in rules
are naturally overlapping. When combined with voted solutions, the inherent
redundancy is enhanced. The method can induce decision rules from data with
missing values where (a) the format of the rules is no dierent than rules for
data without missing values and (b) no special features are specied to prepare
the the original data or to apply the induced rules. We provide experimental
evidence that this transparent approach to classication can yield strong results
for data mining with missing values.
2
Methods and Procedures
The classical approach to rule induction is a two-step process. The rst step is to
nd a single covering solution for all training examples. The covering rule set is
found directly by inducing conjunctive rules or indirectly by inducing a decision
tree. The direct solution usually involved inducing one rule at a time, removing
the cases covered by the rule, and then repeating the process. The second step is
to prune the covering rule set or tree into smaller structures, and pick the best
one, either by a statistical test or by applying the rule sets to independent test
cases.
A pure DNF rule for classication is evaluated as satised or not. If satised,
the rule implies a specic class. The conditions or components of a rule can be
tested by applying or > operators to variables and coding categorical values
separately as 1 for true and 0 for false.
We can measure the size of a DNF rule with two measurements: (a) the
length of a conjunctive term and the number of terms (disjuncts). For example,
c1 c2 c3 OR c1 c3 c4
Class
is a DNF rule for conditions ci with maximum length of three and two terms
(disjuncts). Complexity of rule sets can be controlled by providing an upper
bound on these two measurements.
Table 1 describes the standard analysis of results for binary classication.
For evaluation purposes, a rule is applied to each case. Classication error is
measured as in equation 1. For case i, FP(i) is 1 for a false positive, FN(i) is 1
for a false negative, and 0 otherwise.
f
g
f
g )
Table 1. Analysis of Error for Binary Classication
Rule-true
Rule-false
Class-true True positives (TP) False negatives (FN)
Class-false False positives (FP) True negatives (TN)
Error = FP + FN ; FP =
X FP (i); FN = X FN (i)
i
i
(1)
For almost all applications, more than one rule is needed to achieve good
predictive performance. In our lightweight approach, a solution consists of a set
of an equal number of unweighted rules for each class. A new example is classied
by picking the class having the most votes, the class with the most satised rules.
We are very democratic; each class has an equal number of rules and votes, and
each rule is approximately the same size.
The principal remaining task is to describe a method for inducing rules from
data. So far we have given a brief description of binary classication. Yet, this
form of binary classication is at the heart of the rule induction algorithm.
Let's continue to consider binary classication. The most trivial method for rule
induction is to grow a conjunctive term of a rule by the greedy addition of a single
condition that minimizes error. To ensure that a term is always added (when
error is nonzero) we can dene a slightly modied measure, err1 in equation 2.
Error is computed over candidate conditions where TP is greater than zero. If no
added condition adds a true positive, the cost of a false negative error is doubled
and the minimum cost solution is found. The cost of a false positive remains
at 1. The minimum err1 is readily computed during sequential search using the
bound of the current best err1 value.
Err1 = FP + k FN where k = 1; 2; 4:::and TP > 0
f
Frq(i) = 1 + e(i)3
FP =
X FP (i)
i
frq(i); FN =
X FN (i)
i
g
(2)
(3)
frq(i)
(4)
The lightweight method is adaptive, and follows the well-known principle embodied in boosting: Give greater representation to erroneously classied cases.
The technique for weighting cases during training is greatly simplied from the
usual boosting methods. Analogous to [Breiman, 1996], no weights are used
in the induced solution. Weighting of cases during sampling follows a simple
method: Let e(i) be the cumulative number of errors for case i for all rules. It is
computed by applying all prior induced rules and summing the errors for a case.
The weighting given to a case during induction is an integer value, representing
a relative frequency of that case in the new sample. Equation 3 is the frequency
that is used. It has good empirical support, having had the best reported results
on an important text-mining benchmark [Weiss et al., 1999], and was rst described in [Weiss and Indurkhya, 1998]. Thus if 10 rules have been generated, 4 of
them erroneous on case i, then case i is treated as if it appeared in the sample 65
times. Based on prior experience, alternative functions to Equation 3 may also
perform well. Unlike the results of [Bauer and Kohavi, 1999] for the alternative
of [Breiman, 1996], Equation 3 performs well with or without random resampling, and the LRI algorithm uses no random resampling. The computation of
FP and FN during training is modied slightly to follow Equation 4.
Err1 is computed by simple integer addition. In practice, we use only 33
dierent values of e(i), for i=0 to 32. Whenever, the number of cumulative errors
exceeds 32, all cumulative errors are normalized by an integer division of 2.
The training algorithm for inducing a DNF rule R is given in Figure 1. The
algorithm is repeated sequentially for the desired number of rules. Rules are
always induced for binary classication, class versus not-class. A m-class classication problem is handled by mapping it to m binary classication problems {
one for each class. Each of the binary classication problems can be computed
independently and in parallel. As we shall in Section 3, the equality of voting
and rule size, makes the predictive performance of rules induced from multiple
binary classication problems quite comparable.
1. Grow conjunctive term T until the maximum length (or until FN = 0) by greedily
adding conditions that minimize err1.
2. Record T as the next disjunct for rule R. If less than the maximum number of
disjuncts (and FN > 0), remove cases covered by T , and continue with step 1.
3. Evaluate the induced rule R on all training cases i and update e(i), the cumulative
number of errors for case i.
Fig. 1. Lightweight Rule Induction Algorithm
A pure DNF rule induction system has strong capabilities for handling missing values. Disjunction can produce overlap and redundancy. If we apply a rule
to a case, and a term is not satised because one of its conditions has a missing
value, the rule may still be satised by one of the other disjuncts of the rule.
These rules have no special conditions referring to missing values; they look no
dierent than rules induced from data with no missing values. How is this accomplished? For the application of rules, a term is considered not satised when
a missing value is encountered in a case. During training, the following slight
modications are made to the induction procedures:
{
{
When looping to nd the best attribute condition, skip cases with missing
values.
Normalize error to a base relative to the frequency of all cases.
Pn; all frq(n)
P
Normk =
i; w=o missing vals frq (i)
X FP (i) frq(i)
FPk = Normk
i
X
FN = Norm
FN (i) frq(i)
k
k
i
(5)
(6)
(7)
Each feature may have a variable number of missing values. The normalization factor is computed as in Equation 5 for feature k. The normalization
factor is the total number of cases, n, including missing values cases, divided by
the frequency of cases without missing values. False positives and negatives are
computed as in Equations 6 and 7, a straightforward normalization of Equation
4.
To select the solution with the best predictive performance, decisions must
be made about the two key measures of rule size: (a) conjunctive term length
and (b) the number of disjuncts. For data mining large samples, the best solution
can be found by using an independent test set for estimating true error. If only
a single complexity measure is varied during training, such as the number of
disjuncts, then the estimated error rates for comparing the dierent solutions
using only one independent test set are nearly unbiased[Breiman et al., 1984].
3
Results
Before we present experimental results for lightweight rule induction, let's consider our real-world experience in an important data mining application: the
detection of patterns in survey data. IBM, like many companies, surveys the
marketplace trying to gauge customer attitudes. In the case of IBM, thousands
of IT professionals are surveyed about their buying intentions and their view of
IBM products and the products of competitors. Survey data may be collected
every quarter or perhaps as frequently as every week. For some recent period,
such as the most recent quarter, the survey data are grouped into a sample. The
sample can be mined, and purchasing patterns that are detected can potentially
be of great value for marketing. The actual survey questions number in the many
hundreds. Not all questions are asked of every respondent; records contain many
missing values.
What might be interesting questions? In practice, it's relatively easy to specify critical classication problems, for example \can we distinguish those people
who intend to increase purchases of IBM equipment versus those that do not?"
With hundreds of features and relatively dicult goals for discrimination, solutions of high complexity are likely when standard methods are used to nd
a minimum error solution. Such solutions would not be acceptable to the marketers who make recommendations and take actions. In our case, the lightweight
approach has an eective means of bounding the complexity of solutions. We can
tradeo complexity and somewhat stronger predictive performance with clarity
of interpretation. While it may seem severe, rules induced from one survey were
restricted in size to no more than two terms with no disjunction and three rules
for each of two classes. These simplied rules perform somewhat weaker than
more complex and larger rule sets, but in this application, interpretability far
outweighs raw predictive performance. Moreover, although the survey data are
riddled with missing values, the solutions, posed in the form of decision rules,
extract the essential patterns without ever mentioning missing values.
To evaluate formally the performance of lightweight rule induction, datasets
from the UCI repository [Blake et al., 1999] were processed. Table 2 summarizes
the characteristics of these data. The number of features describes numerical
features and categorical variables decomposed into binary features. Because the
objective is data mining, we selected datasets having relatively large numbers of
training cases and designated test sets.These datasets have no missing values,
allowing us to set a baseline performance. Missing values were simulated by
using a random number generator to delete an expected percentage of values
from every feature.
Table 2. Data Characteristics
Name Train Test Features Classes
coding 5000 15000 60
2
digit 7291 2007 256
10
letter 16000 4000 16
26
move 1483 1546 76
2
satellite 4435 2000 36
6
wave 5000 5000 40
3
LRI has several design parameters that aect results: (a) the number of rules
per class (b) the maximum length of a rule and (c) the maximum number of
disjunctions. For all of our experiments, we set the length of rules to 5 conditions. For most applications, increasing the number of rules increases predictive
performance until a plateau is reached. The critical parameter is the number of
disjuncts. We varied the number of disjuncts in each rule from 1, 2, 4, 8, 16,
where 1 is a rule with a single conjunctive term.
Table 3 summarizes the results for varying percentages of missing values in
the training cases and the original test cases. Solutions were found of dierent
complexities over the training cases. After training was completed, error was
measured on the large test set. The error listed is for the solution with the minimum test error. Also included are the results for a binary decision tree trained
on data having no missing values. After pruning at various levels of complexity,
the minimum test error tree was selected. Table 4 lists the results for the same
training cases and the same induced rules, but the test cases also have the same
percentage of missing values. Figure 2 plots the change in performance of the
rules as the percentage of missing values increases in the training set only or both
the training and test set. Because the test error was used for nding the minimum
error results for both the tree and rules, the results are somewhat optimistic.
Still, for data mining applications, this procedure is quite reasonable[Breiman et
al., 1984]. The standard error for any of these applications is 1% or less.
4
Discussion
Lightweight Rule Induction has a very simple representation: pure DNF rules
for each class. It is egalitarian, each class has the same number of rules of approximately the same-size rules. Scoring is trivial to understand: the class with
the most satised rules wins.
Table 3. Error for Varying Percentages of Missing Values in Training Set
pct missing move wave satellite coding letter digit
0 tree 0.255 0.231 0.146 0.337 0.134 0.154
0
0.195 0.142 0.092 0.246 0.039 0.059
5
0.227 0.141 0.095 0.255 0.055 0.060
10
0.247 0.143 0.101 0.270 0.068 0.065
20
0.269 0.141 0.112 0.273 0.127 0.076
25
0.286 0.147 0.122 0.278 0.153 0.082
50
0.398 0.183 0.169 0.314 0.246 0.139
75
0.476 0.246 0.210 0.404 0.444 0.225
Table 4. Error for Varying Percentages of Missing Values in Training and Test Set
pct missing move wave satellite coding letter digit
0
0.1947 0.1422 0.0910 0.2461 0.0395 0.0593
5
0.2516 0.1498 0.0945 0.2605 0.0742 0.0673
10
0.2898 0.1530 0.0980 0.2692 0.1117 0.0698
20
0.3396 0.1762 0.1220 0.2892 0.2010 0.0837
25
0.3364 0.1920 0.1345 0.2963 0.2525 0.0927
50
0.3946 0.2996 0.2355 0.3629 0.5583 0.2123
75
0.4877 0.5818 0.5425 0.4556 0.8562 0.6178
The method is about as simple as any rule induction method can be. The
algorithm is rudimentary, and our C code implementation is less than 300 lines.
It produces designer rules, where the size of the rules are specied by the application designer.
The central question in Section 3 is: How well does LRI do on practical applications? For best predictive performance, a number of parameters must be
selected prior to running. We have concentrated on data mining applications
where it can be expected that sucient tests are available for easy estimation.
Thus, we have included results that describe the minimum test error. With big
data, its easy to obtain more than one test sample, and for estimating a single
variable, a large single test set is adequate in practice [Breiman et al., 1984]. For
purposes of experimentation, we xed almost all parameters, except for maximum number of disjuncts and the number of rules. The number of disjuncts is
clearly on the critical path to higher performance. As already shown for boosting and all forms of adaptive resampling, most of the gains in performance are
achieved with the initial smaller set of classiers.
The results on these applications demonstrate the strength of this method
when applied to missing values. Our simulations show much stronger performance than the tree method even when large numbers of data elds are deleted.
The eects of randomly generated missing values are likely to be much more
Tree
Rule - train
Rule - train/test
0.7
Error Rate
0.6
0.5
0.4
0.3
0.2
0.1
0
0
5
10
20
25
50
Percentage of Missing Values
75
Fig. 2. Performance on Digit Data with Missing Values
drastic than those encountered for real-world data. Here, data elds are uniformly destroyed, whereas in real-world data the missing values are likely to be
distributed among a subset of elds, hopefully the weaker features.
That these rules can perform relatively well on data with missing values
should not be surprising. DNF rules have ample opportunity for overlap and
redundancy. When combined with voted solutions, the inherent overlap is enhanced. Of greatest signicance is that the format of these rules is no dierent
than rules for data without missing values.
Most practitioners of data mining dread the extra complexity of missing
values in data. It may require much extra analysis and adjustments to data
and methods. Lightweight rule induction oers a new approach that may reduce
these tedious tasks while still providing a high-performance and interpretable
solution.
References
[Bauer and Kohavi, 1999] E. Bauer and R. Kohavi. An empirical comparison of voting classication algorithms: Bagging, boosting and variants. Machine Learning,
36(1):105{139, 1999.
[Blake et al., 1999] C. Blake, E. Keogh, and C. Merz. Uci repository of machine learning databases. Technical report, University of California Irvine, 1999.
www.ics.uci.edu/mlearn/MLRepository.html.
[Breiman et al., 1984] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classication
and Regression Trees. Wadsworth, Monterrey, CA., 1984.
[Breiman, 1996] L. Breiman. Bagging predictors. Machine Learning, 24:123{140, 1996.
[Cohen and Singer, 1999] W. Cohen and Y. Singer. A simple, fast, and eective rule
learner. In Proceedings of Annual Conference of American Association for Articial
Intelligence, pages 335{342, 1999.
[Cohen, 1995] W. Cohen. Fast eective rule induction. In Proceedings of the Twelfth
International Conference on Machine Learning, pages 115{123, 1995.
[Friedman et al., 1998] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of boosting. Technical report, Stanford University Statistics Department, 1998. www.stat-stanford.edu/tibs.
[Pyle, 1999] D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, San
Francisco, 1999.
[Quinlan, 1989] J. Quinlan. Unknown attribute values in induction. In International
Workshop on Machine Learning, pages 164{168, Ithica, NY, 1989.
[Schapire, 1999] R. Schapire. A brief introduction to boosting. In Proceedings of International Joint Conference on Articial Intelligence, pages 1401{1405, 1999.
[Weiss and Indurkhya, 1993] S. Weiss and N. Indurkhya. Optimized rule induction.
IEEE EXPERT, 8(6):61{69, December 1993.
[Weiss and Indurkhya, 1998] S. Weiss and N. Indurkhya. Predictive Data Mining: A
Practical Guide. Morgan Kaufmann, 1998. DMSK Software: www.data-miner.com.
[Weiss et al., 1999] S. Weiss, C. Apte, F. Damerau, and et al. Maximizing text-mining
performance. IEEE Intelligent Systems, 14(4):63{69, 1999.
Download