Thresholding for Making Classifiers Cost-sensitive Victor S. Sheng, Charles X. Ling Department of Computer Science The University of Western Ontario, London, Ontario N6A 5B7, Canada {ssheng, cling}@csd.uwo.ca Witten & Frank, 2005), sampling (Zadronzny et al., 2003), and weighting (Ting, 1998)). Cost-sensitive meta-learning methods are useful because they allow us to reuse existing base learning algorithms and their related improvements. Thresholding is another cost-sensitive meta-learning method, and it is applicable to any classifiers that can produce probability estimates on training and test examples. Almost all classifiers (such as decision trees, naïve Bayes, and neural networks) can produce probability estimates on examples. Thresholding is very simple: it selects the probability that minimizes the total misclassification cost on the training instances as the threshold for predicting testing instances. However, we will show that Thresholding is highly effective. It outperforms previous meta-learning cost-sensitive methods, and even the theoretical threshold, on almost all datasets. It is also least sensitive when the difference in misclassification costs is high. In the next section, we will give an overview of previous work on cost-sensitive meta-learning, particularly MetaCost (Domingos, 1999) and Weighting (Ting, 1998). Section 3 describes Thresholding that can convert any cost-insensitive classifiers into cost-sensitive ones. The empirical evaluation is presented in Section 4, which is followed by conclusions in the last section. Abstract In this paper we propose a very simple, yet general and effective method to make any cost-insensitive classifiers (that can produce probability estimates) cost-sensitive. The method, called Thresholding, selects a proper threshold from training instances according to the misclassification cost. Similar to other cost-sensitive meta-learning methods, Thresholding can convert any existing (and future) costinsensitive learning algorithms and techniques into costsensitive ones. However, by comparing with the existing cost sensitive meta-learning methods and the direct use of the theoretical threshold, Thresholding almost always produces the lowest misclassification cost. Experiments also show that Thresholding has the least sensitivity on the misclassification cost ratio. Thus, it is recommended to use when the difference on misclassification costs is large. Introduction Classification is a primary task of inductive learning in machine learning. Many effective inductive learning techniques have developed, such as naïve Bayes, decision trees, neural networks, and so on. However, most original classification algorithms ignore different misclassification errors; or they implicitly assume that all misclassification errors cost equally. In many real-world applications, this assumption is not true. For example, in medical diagnosis, missing a cancer diagnosis (false negative) is much more serious than the other way around (false positive); the patient could lose his/her life because of the delay in treatment. In many real-world applications, the differences between different misclassification errors can be quite large. Cost-sensitive learning (Turney, 1995, 2000; Elkan, 2001; Zadrozny and Elkan, 2001; Lizotte, 2003; Ting, 1998) has received much attention in recent years to deal with such an issue. Many works for dealing with different misclassification costs have been done, and they can be categorized into two groups. One is to design costsensitive learning algorithms directly (Turney, 1995; Drummond and Holte, 2000). The other is to design a wrapper that converts existing cost-insensitive base learning algorithms into cost-sensitive ones. The wrapper method is also called cost-sensitive meta-learning. Section 2 provides a more detailed review of cost-sensitive metalearning approaches (such as relabeling (Domingos, 1999, Review of Previous Work Cost-sensitive meta-learning converts existing costinsensitive base learning algorithms into cost-sensitive ones without modifying them. Thus, it can be regarded as a middleware component that pre-processes the training data, or post-processes the output, for cost-insensitive learning algorithms. Cost-sensitive meta-learning techniques can be classified into two main categories, sampling and nonsampling, in terms of whether the distribution of training data is modified or not according to the misclassification costs. Costing (Zadronzny et al., 2003) belongs to the sampling category. This paper focuses on the nonsampling cost-sensitive meta-learning approaches. The non-sampling approaches can be further classified into three subcategories: relabeling, weighting, and threshold adjusting, described below. The first is relabeling the classes of instances, by applying the minimum expected cost criterion (Michie, Spiegelhalter, and Taylor, 1994). Relabeling can be further divided into two branches: relabeling the training instances Copyright © 2006, American Association for Artificial Intelligence (www.aaai.org). All rights reserved. 476 and relabeling the test instances. MetaCost (Domingos, 1999) belongs to the former, and CostSensitiveClassifier (CSC) (Witten & Frank, 2005) belongs to the latter. Weighting (Ting, 1998) assigns a certain weight to each instance in terms of its class, according to the misclassification costs, such that the learning algorithm is in favor of the class with high weight/cost. The third subcategory is threshold adjusting. Thresholding belongs to this category. It searches for the best probability as a threshold for future prediction. We provide a detailed description of it in Section 3. In Section 4, we compare it with the other non-sampling methods: relabling and weighting. In (Elkan, 2001), the theoretical threshold for making an optimal decision on classifying instances into positive is obtained as: T= C (1,0) , C (1,0) + C (0,1) 0 (2) T threshold 1 0 (b) T1 Total cost (a) Total cost Total cost total cost. Figure 1(c) shows a case with two or more local minima with the same value. We have designed a heuristic to resolve the tie: we select the local minimum with hills that are less steep on average; in another word, we select the local minimum whose “valley” has a wider span. The rationale behind this heuristic for the tie breaking is that we prefer a local minimum that is less sensitive to small changes in the threshold selection. For the case shown in Figure 1(c), the span of the right “valley” is greater than the one of the left. Thus, T2 is chosen as the best threshold. T2 threshold 1 0 (c) T1 T2 threshold 1 Figure 1. Typical curves for the total misclassification cost. Another improvement is overfitting avoidance. Overfitting can occur if the threshold is obtained directly from the training instances: the best threshold obtained directly from the training instances may not generalize well for the test instances. To reduce overfitting, Thresholding searches for the best probability as threshold from the validation sets. More specifically, an m-fold cross-validation is applied, and the base learning algorithm predicts the probability estimates on the validation sets. After this, the probability estimate of each training example is obtained (as it was in the validation set). Thresholding then simply picks up the best threshold that yields the minimum total misclassification cost (with the tie breaking heuristic described earlier), and use it for the test instances. Note that the test instances are not used for searching the best threshold. where C(j,i) is the misclassification cost of classifying an instance belonging to class j into class i. In this paper, we assume that there is no cost for the true positive and the true negative, i.e., C(0,0) = C(1,1) = 0. (Elkan, 2001) further discusses how to use this formula to rebalance training instances (e.g., via sampling) to turn cost-insensitive classifiers into cost-sensitive ones. In a later section, we will show Thresholding, which searches for the best threshold, surprisingly outperforms the direct use of the theoretical threshold defined in (2). Thresholding As we have discussed in Introduction, almost all classification methods can produce probability estimates on instances (both training instances and test instances). Thresholding simply finds the best probability from the training instances as the threshold, and use it to predict the class label of test instances: a test example with predicted probability above or equal to this threshold is predicted as positive; otherwise as negative. Thus, for a given threshold, the total misclassification cost for a set of examples can be calculated, and it (MC) is a function of the threshold (T); that is, MC=f(T). The curve of this function can be obtained after computing misclassification costs for each possible threshold. In reality, we only need to calculate misclassification costs for each possible probability estimates on the training examples. With this curve, Thresholding can simply choose the best threshold that minimizes the total misclassification cost, with the following two improvements on tie breaking and overfitting avoidance. There are in general three types of curves for the function MC=f(T), as shown in Figure 1. Figure 1(a) shows a curve of the total misclassification cost with one global minimum. This is the ideal case. However, in practice, there may exist local minima in the curve MC=f(T) as shown in Figures 1(b) and 1(c). Figure 1(b) shows a case with multiple local minima but one of them is smaller than all others. In both cases ((a) and (b)) it is straightforward for Thresholding to select the threshold with the minimal Empirical Evaluation Table 1. Twelve Datasets used in the experiments, where Monks-P3 represents the dataset Monks-Problems-3. No. of Attributes Breast-cancer 10 Breast-w 10 Car 7 Credit-g 21 Diabetes 9 Hepatitis 20 Kr-vs-kp 37 Monks-P3 7 Sick 30 Spect 23 Spectf 45 Tic-tac-toe 10 No. of Instances 286 699 1728 1000 768 155 3196 554 3772 267 349 958 Class dist. (N/P) 201/85 458/241 1210/518 700/300 500/268 32/123 1669/1527 266/288 3541/231 55/212 95/254 332/626 Cost ratio (FP/FN) 85/201 241/458 518/1210 300/700 268/500 123/32 1527/1669 288/266 231/3541 212/55 254/95 626/332 To compare Thresholding with other existing methods, we choose 11 real-world datasets and 1 artificial dataset (Monks-Problems-3), listed in Table 1, from the UCI Machine Learning Repository (Blake and Merz, 1998). These datasets are chosen because they are binary classes, 477 have at least some discrete attributes, and have a good number of instances. In all experiments, we use 10-fold cross validation in Thresholding. instances of the opposite class. This way, the rare class is more expensive if you predict it incorrectly. This is normally the case in the real-world applications. This setting can also reduce the potential effect of the class distribution, because the performance of MetaCost, CSC and Weighting in CostSensitiveClassifier may be affected by the base learners that implicitly make decisions based on the threshold 0.5. Later we will set misclassification costs to be independent of the number of examples. The experimental results, shown in Figure 3, are presented in terms of the average total cost via 10 runs over ten-fold cross-validation applied to all the methods. This is the external cross-validation for Thresholding. Note that Thresholding has an internal cross-validation (i.e., the m-fold cross validation described in Section 3), which is only used to search the proper threshold from the training set in Thresholding. Figure 2 shows the experiment process for Thresholding. 1. Apply 10-fold cross-validation. That is, sample 90% data for training, and the rest (none-overlapping) is for testing a. Apply 10-fold cross-validation on the training data to find the proper threshold. We choose C4.5 (Quinlan, 1993) as the base learning algorithm. We first conduct experiments to compare the performance of Thresholding with existing meta-learning cost-sensitive methods: MetaCost, CSC and Weighting in CostSensitiveClassifier. Many researchers (Bauer and Kohavi, 1999; Domingos, 1999; Buhlmann and Yu, 2003; Zadrozny et al., 2003) have shown that Bagging (Breiman, 1996) can reliably improve base classifiers. As bagging has already been applied in MetaCost, we also apply bagging (with different numbers of bagging iterations) to Thresholding and CostSensitiveClassifier. We implement Thresholding in the popular machine learning toolbox WEKA (Witten & Frank, 2005). As MetaCost and CostSensitiveClassifier are already implemented in WEKA, we directly use these implementations in our experiments. As misclassification costs are not available for the datasets in the UCI Machine Learning Repository, we reasonably assign their values to be roughly the number of 14400 13800 13600 13400 12000 11000 10000 13200 9000 13000 8000 1 10 100 MetaCost CSC T hresholding Weighting 13000 Total cost 14000 Total cost 14000 MetaCost CSC T hresholding Weighting 14200 MetaCost CSC T hresholding Weighing 141000 100 10 100 1850 33000 31000 29000 1 10 100000 MetaCost CSC T hresholding Weighting 1800 100 1000 Number of iterations (Hepatitis) 1750 MetaCost CSC T hresholding Weighting 90000 80000 70000 60000 50000 27000 1700 25000 1 40000 1 10 100 1000 Number of iteartions (Kr-vs-kp) 7100 10 9500 8500 7500 6100 6500 5900 5500 1 10 100 Number of iterations (Spect) 1000 1 10 100 Number of iterations (Sick) 1000 87000 77000 67000 Total cost 6300 Total cost MetaCost CSC T hresholding Weighting 6500 1000 MetaCost CSC T hresholding Weighting 10500 6700 100 Number of iterations (Monks-Problems-3) 11500 6900 Total cost 2250 1000 Total cost Total cost 35000 2450 Number of iterations (Diabetes) MetaCost CSC T hresholding Weighting 37000 MetaCost CSC T hresholding Weighting 2650 1850 Number of iterations (Credit-g) 39000 1000 2050 1 1000 10 100 Number of iterations (Car) 2850 MetaCost CSC T hresholding Weighting 68000 131000 10 40000 1 73000 1 45000 1000 78000 136000 Total cost 100 Total cost 146000 50000 30000 10 83000 Total cost Total cost 151000 55000 Number of iterations (Breast-w) Number of iterations (Breast-cancer) 156000 MetaCost CSC T hresholding Weighting 60000 35000 1 1000 65000 Total cost Comparing with Other Meta-Learning Methods MetaCost CSC T hresholding Weighting 57000 47000 37000 27000 17000 1 10 100 Number of iterations (Spectf) 1000 1 10 100 1000 Number of iterations (T ic-tac-toe) Figure 3. Comparing Thresholding with other meta-learning approaches. The lower the total cost, the better. 478 that the relabeling approach is less satisfactory, and Thresholding and Weighting seem to be better metalearning approaches. The experimental results in this section show Thresholding is the best. i. Apply the base learner on the internal training set ii. Predict probability estimates on the validation set b. Find the best threshold based on the predicted probabilities c. Classify the examples in the test set with the threshold obtained in step 1(a) 2. Obtain the average total cost Figure 2. The experiment process of Thresholding. In Figure 3, the vertical axis represents the total misclassification cost, and the horizontal axis represents the number of iterations in bagging. We summarize the experimental results in Table 2. Table 2. Summary of the experimental results. An entry w/t/l means that the approach at the corresponding row wins in w datasets, ties in t datasets, and loses in l datasets, compared to the approach at the corresponding column1. CSC Weighting Thresholding MetaCost 7/1/4 9/0/3 9/1/2 CSC Weighting 10/1/1 9/1/2 6/1/5 Sensitivity to Cost Ratios In the last section, we compare the performance of metalearning methods under some specific misclassification costs. In this section, we evaluate the sensitivity of these meta-learning methods in terms of different cost ratios of 2:1, 5:1, 10:1, and 20:1 between false positive and false negative. These cost ratios are independent to the number of positive and negative instances. Bagging (with 10 iterations) is still applied in all methods. The results, shown in Figure 4, are presented in terms of the average total cost (in units; we set the false negative misclassification cost as one unit) over ten-fold cross-validation. The vertical axis represents the total cost, and the horizontal axis represents the cost ratios. We summarize the results in Table 3. Table 3. Summary of the experimental results (Figure 4). The definition of the entry w/t/l is the same as Table 2. Weighting MetaCost CSC CSC 2/0/10 Weighting 5/3/4 7/2/3 Thresholding 6/3/3 9/2/1 7/2/3 From the results in Figure 4 and Table 3, we can draw the following conclusions. First, the relative relationship for the performance of the meta-learning methods remains the same: Thresholding is the best, followed by Weighting. However, MetaCost is much better than CSC. This shows that the post-relabeling (CSC) becomes worse when the cost ratios increase. Thresholding outperforms all other methods for most cost ratios in seven out of twelve datasets tested. Second, overall the total misclassification cost increases with increasing values of the cost ratios. This is expected as the sum of false positive and false negative increases when the value of the cost ratio increases. Another interesting conclusion is that each method has a different sensitivity to the cost ratio increment. The sensitivity can be reflected by how quickly the total misclassification cost increases when the cost ratio increases. The less quickly it increases, the better. We can see from Figure 4 that the increment of the total cost of CSC is always almost the greatest. It is followed by MetaCost. MetaCost is similar as Weighting only in three datasets (Breast-w, Credit-g, and Spect). However, Weighting outperforms MetaCost in five of the rest datasets. Thresholding is again the best (i.e., the slowest increment) in six out of twelve datasets tested. It performs better than MetaCost in six datasets, better than CSC in nine datasets, and better than Weighting in seven datasets. Except two datasets (Credit-g and Diabetes), We can draw the following interesting conclusion from the results shown in Figure 3 and Table 2. First of all, MetaCost almost performs worse than other meta-learning algorithms. MetaCost may overfit the data as it uses the same learning algorithm to build the model as the one to relabel the training examples. Bagging does improve its performance in all datasets tested, particularly in first 10 iterations. But the improvements are not as significant as Bagging applied in other algorithms, particularly after 10 iterations. Second, CSC performs better than MetaCost in seven out of twelve datasets. In other datasets, it is similar or worse. Third, overall, Weighting performs much better than MetaCost and CSC. Weighting performs worse than MetaCost only in three datasets (Car, Kr-vs-kp, and Monks-Problems-3). In others, it outperforms MetaCost significantly. Comparing with CSC, Weighting performs better in ten out of twelve datasets. In the other datasets, it is the same (Breast-w) or worse (Kr-vs-kp). Fourth, Thresholding outperforms MetaCost and CSC in nine out of twelve datasets respectively, and outperforms Weighting in six datasets. In the others, it is similar or worse. Similar to MetaCost, Bagging does improve the performance of Thresholding, but not significant. Without bagging (i.e., the number of iteration is 1), Thresholding performs the best in nine out of twelve datasets. In all, we can conclude that Thresholding is the best, followed by Weighting, followed by CSC. Both MetaCost and CSC belong to the relabeling category: the former relabels the training instances and the latter relabels the test instances. This leads us to believe 1 As there are four points in each curve, we define that curve A wins curve B if A has more than three points, including three points, lower than their corresponding points in B. We also define that A ties with B if A has two points lower and the other two points higher than their corresponding points in B. For the rest cases, curve A loses to curve B. 479 Total cost 900 2:1 1200 500 300 2:1 250 100 50 20:1 50 40 5:1 10:1 Cost ratio (Diabetes) 450 350 5:1 10:1 Cost ratio (Spect) 20:1 MetaCost CSC Weighting T hesholding 40 Total cost 2:1 450 MetaCost CSC Weighting T hresholding 350 250 5:1 10:1 Cost ratio (Hepatitis) 20:1 MetaCost CSC Weighting T hresholding 150 50 2:1 5:1 10:1 20:1 Cost ratio (Monks-Problems-3) MetaCost CSC Weighting T hresholding 250 150 50 2:1 20:1 60 20:1 30 20 550 MetaCost CSC Weighting T hresholding 80 5:1 10:1 Cost ratio (Car) 20 2:1 Total cost Total cost 5:1 10:1 Cost ratio (Kr-vs-kp) 400 10 0 0 2:1 2:1 100 600 2:1 Total cost 150 800 60 130 20:1 200 20:1 MetaCost CSC Weighting T hresholding 200 Total cost 5:1 10:1 Cost ratio (Credit-g) 5:1 10:1 Cost ratio (Breast-w) MetaCost CSC Weighting T hresholding 1000 700 180 80 0 20:1 MetaCost CSC Weighting T hresholding 1100 50 Total csot 1300 5:1 10:1 Cost ratio (Breast-cancer) 100 Total cost 2:1 150 MetaCost CSC Weighting T hresholding 230 Total cost 120 Total cost MetaCost CSC Weighting T hresholding 170 110 100 90 80 70 60 50 MetaCost CSC Weighting T hresholding 200 Total cost Total cost 250 220 2:1 5:1 10:1 Cost ratio (Spectf) 20:1 370 320 270 220 170 120 70 5:1 10:1 Cost ratio (Sick) 20:1 MetaCost CSC Weighting T hresholding 2:1 5:1 10:1 Cost ratio (T ic-tac-toe) 20:1 Figure 4. Total cost under different cost ratios. theoretical threshold performs much better than CSC, Thresholding is one of the best methods for the rest of the although it ties to MetaCost and worse than Weighting. datasets. In all, we can conclude that CSC is most sensitive to the increment of the cost ratios, followed by MetaCost, Conclusions and Future Work and followed by Weighting. Thresholding is the most resistant (the best) to the cost ratios. Thus, when the cost Thresholding is a general method to make any costratio is large, it is recommended over other methods. insensitive learning algorithms cost-sensitive. It is a simple yet direct approach as it learns the best threshold from the Theoretical Threshold training instances. Thus, the best threshold chosen reflects not only different misclassification costs but also the data Thresholding searches for the best threshold from the distribution. We were surprised by its good performance. training instances; however, it is time consuming to search However our repeated experiments show that Thresholding for the best threshold via cross-validation. How does it outperforms other existing cost-sensitive meta-learning compare with the theoretical threshold reviewed earlier? In methods, such as MetaCost, CSC, Weighting, and the this section, we compare Thresholding with the direct use direct use of the theoretical threshold. Threholding also has of the theoretical threshold. We conduct the same the best resistance (insensitivity) to large misclassification experiments as the last subsection. The results are cost ratios. Thus, it is recommended to use especially when presented in Figure 5. We can see that Threholding clearly the difference in misclassification costs is large. outperforms the theoretical threshold in nine out of twelve In our future work, we plan to apply Thresholding on datasets. They are exact same in the dataset Monksdatasets with multiple classes. Problems-3. In addition, Thresholding has a better resistance (insensitivity) to large cost ratios, particularly in Acknowledgements datasets Breast-w, Car, Credit-g, Hepatitis, Kr-vs-kp, We thank anonymous reviewers for the useful comments Spectf, Spect, and Tic-tac-toe. We can thus conclude that it and suggestions. The authors thank NSERC for the support is worth spending time to search for the best threshold in of their research. Thresholding. Table 4. Comparing Theoretical with other approaches. References MetaCost CSC Weighting Bauer, E., and Kohavi, R. 1999. An empirical comparison Theoretical 6/0/6 9/2/1 5/0/7 of voting classification algorithms: bagging, boosting and We summarize the comparisons between Figure 4 and variants, Machine Learning, 36(1/2):105-139. Figure 5 in Table 4. We can see that the use of the 480 150 190 170 150 T hresholding T heoretical 130 1390 2:1 T hresholding T heoretical 60 660 460 22 100 0 2:1 5:1 10:1 Cost ratio (Kr-vs-kp) Total cost 50 200 T hresholding T heoretical 5:1 10:1 Cost ratio (Spect ) 20:1 2:1 5:1 10:1 Cost ratio (Hepatitis) 20:1 T hresholding T heoretical 270 170 70 5:1 10:1 20:1 Cost ratio (Monks-Problems-3) 2:1 290 T hresholding T heoretical 150 100 50 2:1 T hresholding T heoretical 370 12 250 60 20:1 17 2:1 T hresholding T heoret ical 70 5:1 10:1 Cost ratio (Diabetes) 7 20:1 20:1 20 2:1 Total cost T hresholding T heoretical 200 5:1 10:1 Cost ratio (Car) 40 Total cost 20:1 Total cost Total cost 5:1 10:1 Cost Ratio (Credit-g) 100 20:1 260 300 Total cost 5:1 10:1 Cost Ratio (Breast-w) 860 390 2:1 105 95 2:1 Total cost 590 110 30 1060 790 80 70 1260 990 90 110 20:1 T hresholding T heoretical 1190 Total cost 5:1 10:1 Cost ratio (Breast-cancer) T hresholding T heoretical 115 Total cost 2:1 120 T hresholding T heoretical Total cost 190 210 Total cost Total cost 230 240 5:1 10:1 Cost ratio (Sick) 20:1 T hresholding T heoretical 190 140 90 2:1 5:1 10:1 Cost ratio (Spectf) 20:1 2:1 5:1 10:1 Cost rat io (T ic-t ac-t oe) 20:1 Figure 5. Comparing Thresholding with the theoretical threshold. Blake, C.L., and Merz, C.J. 1998. UCI Repository of machine learning databases (website). Irvine, CA: University of California, Department of Information and Computer Science. Breiman, L., Friedman, J.H., Olshen, R.A., and Stone, C.J. 1984. Classification and regression trees. Wadsworth, Belmont, CA. Brieman, L. 1996. Bagging predictors. Machine Learning, 24:123-140. Buhlmann, P., and Yu, B. 2003. Analyzing bagging. Annals of Statistics. Chawla, N.V., Japkowicz, N., and Kolcz, A. eds. 2004. Special Issue on Learning from Imbalanced Datasets. SIGKDD, 6(1): ACM Press. Domingos, P. 1999. MetaCost: A general method for making classifiers cost-sensitive. In Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, 155-164, ACM Press. Drummond, C., and Holte, R. 2000. Exploiting the cost (in)sensitivity of decision tree splitting criteria. In Proceedings of the 17th International Conference on Machine Learning, 239-246. Drummond, C., and Holte, R.C. 2003. C4.5, Class Imbalance, and Cost Sensitivity: Why under-sampling beats over-sampling. Workshop on Learning from Imbalanced Datasets II. Elkan, C. 2001. The Foundations of Cost-Sensitive Learning. In Proceedings of the Seventeenth International Joint Conference of Artificial Intelligence, 973-978. Seattle, Washington: Morgan Kaufmann. Lizotte, D., Madani, O., and Greiner R. 2003. Budgeted Learning of Naïve-Bayes Classifiers. In Proceedings of the Nineteenth Conference on Uncertainty in Artificial Intelligence. Acapulco, Mexico: Morgan Kaufmann. Michie, D., Spiegelhalter, D.J., and Taylor, C.C. 1994. Machine Learning, Neural and Statistical Classification. Ellis Horwood Limited. Quinlan, J.R. eds. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann. Ting, K.M. 1998. Inducing Cost-Sensitive Trees via Instance Weighting. In Proceedings of the Second European Symposium on Principles of Data Mining and Knowledge Discovery, 23-26. Springer-Verlag. Turney, P.D. 1995. Cost-Sensitive Classification: Empirical Evaluation of a Hybrid Genetic Decision Tree Induction Algorithm. Journal of Artificial Intelligence Research 2:369-409. Turney, P.D. 2000. Types of cost in inductive concept learning. In Proceedings of the Workshop on CostSensitive Learning at the Seventeenth International Conference on Machine Learning, Stanford University, California. Weiss, G., and Provost, F. 2003. Learning when Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research 19: 315-354. Witten, I.H., and Frank, E. 2005. Data Mining – Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers. Zadrozny, B., Langford, J., and Abe, N. 2003. Costsensitive learning by Cost-Proportionate instance Weighting. In Proceedings of the 3th International Conference on Data Mining. Zadrozny, B. and Elkan, C. 2001. Learning and Making Decisions When Costs and Probabilities are Both Unknown. In Proceedings of the Seventh International Conference on Knowledge Discovery and Data Mining, 204-213. 481