Cost-Sensitive Bayesian Network algorithm Eman Nashnush E.Nashnush1@edu.salford.ac.uk University of Salford ,Manchester, UK Sponsor in Libya ( Tripoli University ) Introduction: Machine learning algorithms are becoming an increasingly important area for research and application in the field of Artificial Intelligence and data mining. One of the most important algorithm is Bayesian network, this algorithm have been widely used in real world applications like medical diagnosis, image recognition, fraud detection, and inference problems. In all of these applications, evaluation method as accuracy is not enough because there are costs involve each decision. For example, in a fraud detection application to predict new case, there are several costs involved when the classifier predicts a fraudulent case as a non-fraudulent case. Also, fraud databases have an unbalanced class distribution which is known to affect learning algorithms adversely. Therefore, this project develops new algorithm that aims to minimize the costs of prediction, misclassification, imbalance data, time and test. In this work, we attempt to create a new cost-sensitive Bayesian network learning algorithm by adapting Bayesian network algorithm, which focuses on accuracy only. There are several ways of adapting our algorithm and make it cost-sensitive, this includes: changing distribution of the data; changing the construction process and even adopting alternative measure in the algorithms that take account of cost; and using Genetic Algorithm to learn structure of BN. This work will apply different approaches such as amending distributions, amending formula, and using Genetic algorithms. Finally, an empirical evaluation of the developed algorithms will be carried on the artificial data sets (e.g diabetes data, lung cancer data, Bank data …etc). Cost-insensitive Vs. cost-sensitive (Research problem) Hypotheses/The problem In the real world problems such as fraud detection, medical diagnosis, or any decision problem. Often, one class label in dataset such as (Non-fraud class) is very rare and expansive than another class, because the cost of not recognizing some of the instances which belong to the rare class is high. Therefore, most of machine learning Methodology A cost-insensitive classifier focus on accuracy only (class label output).. Learner Classifier Therefore, three methods have been proposed to tackle those problems and Training Data 1. Decision trees 2. Rules 3. Naive Bayes ($43.45,retail,10040, .. nonfraud) ($246,70,weapon,94583,.,fraud) Transaction {fraud,nonfraud} minimize the expected misclassification cost. ... Amend the data distribution to reflect cost. methods do not take cost into account. Thus, those algorithms (cost-insensitive Testing data algorithms) have a poor result, because ignoring cost might produce a very week model. In reality, misclassification problems (error of classification) are very common problem in real-world data mining when the data is imbalanced in class label. Results 250 240 600 220 180 60 100 40 25 0 32 30 10 20 9.5 15 28 26 24 22 iono 10 9 0 8.5 ionosphere labor 5 20 100 Up to now, two new methods for cost-sensitive Bayesian Network algorithms 50 0 breast 15 5 50 hypo 150 80 13.5 4 60 13 3 40 12.5 2 20 12 1 0 11.5 0 5 0 0 pima sonar costs. 0 horse horse-colic 50 statistical measures) that amends the selection measure to take account of 10 hepati 100 approach and another that uses a transparent box approach (modifying the 20 60 0 have been developed and explored: one that uses a black box (Sampling) 30 65 20 10 mushroom bupa liver diorder breastcancear 55 0 heart 10.5 40 40 5 0 40 10 25 10 diabetes 150 60 60 20 10 0 80 15 80 15 30 Bayes Network algorithm. 30 20 50 20 tic-tac 50 30 40 200 0 0 german 150 100 0 0 spambase crx sensitive Bayes Network algorithm via changing the distributions, and the original 50 50 190 25 100 100 200 gymexamg 150 150 210 0 200 200 230 20 method with the original algorithm. In the figure below, I show the results of Cost- Conclusion: 250 800 80 with the existing methods, and also compare the performance of this proposed Utilize a Genetic algorithm to evolve a 'fittest' Bayesian network. Cost-sensitive attempt to minimize the expected cost.. Up to Now, I have investigated experimentally how changing the distribution of data data sets from the UCI repository database. I try to compare my proposed algorithm nonfraud fraud ($99.99,pharmacy,10027,...,?) ($1.00,gas,00234,...,?) 200 approach that called “Cost-Sensitive Bayesian Network using Sampling” with 24 Amend the formula by modifying the statistical measures to include cost. Class Labels Classifier 400 will affect the performance and cost of a Bayesian classifier. I experiment my The previously mentioned problems are happened during classification data set. unbalanced The effect of our algorithms are evaluated and compared with other algorithms, such as (MetaCost+J4.8, standard decision tree(J48), and standard Bayesian networks). weather