International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 1- April 2016 Data Discrimination Prevention using Genetic Algorithm Method in Data Mining Kalaivani. S#1, Jeniffer. D#2, Abirami. K#3, Bhargavi. S#4 #1 Asst. Prof , Department of CSE, Manakula Vinayagar Institute of Technology, Pondicherry University, India. #2, 3, 4 B.Tech - CSE, Manakula Vinayagar Institute of Technology, Pondicherry University, India. Abstract — Now-a-days data mining techniques are used for extracting specific kind of knowledge for discrimination purposes. This can be more commonly seen in banking sectors, where the private information of an individual such as religion, race, nationality, gender, income, age and so on can be misused. Such sensitive information are used for decision making purposes by deciding whether to grant loans, give employment, etc. Direct discrimination is based on sensitive attributes. Indirect discrimination is based on non-sensitive attributes in close relation with sensitive ones. For this reason, anti-discrimination techniques are used for discrimination discovery and prevention. In this paper, we introduce the notion of classification rule techniques combined with association rule hiding for cleaning the training data sets and also to eliminate the possibility of discrimination. We propose new techniques for tackling discrimination prevention in data mining for direct and indirect discrimination prevention, individually or both at the same time. First, we find the frequent classification rule for the training data set. Then we perform association rule hiding’s heuristic approach. Finally, the original data is transformed in such a way that direct and/or indirect discriminatory biases are removed, while also maintaining the data quality as well as reducing information loss. Keywords — Anti-discrimination, Data mining, Direct discrimination, Indirect discrimination, Association rules, Privacy. I. INTRODUCTION Discrimination is treating or considering a person or thing differently, either in favor of or against him, based on the group, kind, or type to which that person or thing belongs rather than on individual calibre. Data discrimination deals with selecting information of an individual that may be of interest to the service provider, and unfairly treating them on the basis of their membership to certain groups, namely, race, caste, creed, nationality, gender, etc. Such discriminatory acts are punishable by law in many democratic countries. With the advent of Big data and data mining, large amounts of data are being collected automatically and decision-making processes are being carried out easily with the help of computers. Discrimination poses a ISSN: 2231-5381 harmful threat to data collection, leading to unfair treatment as it denies opportunities to specific groups in settings like staff selection, health insurance, home loans, education, etc. This threat can be kept at bay using today’s technology which can aid discrimination prevention. Automated data collection and decision-making is done using association/classification rule mining techniques. But the classification rule depends on the training data set. If the trained data is biased, then the result will also have discriminatory characteristics. There are two types of discrimination: direct and indirect. Direct discrimination involves biasing based on discriminatory attributes in the data set. Indirect discrimination involves the use of non-discriminatory attributes that are correlated to discriminatory information. Such type of discrimination can also be accomplished using some background knowledge. For example, if a manager at a bank looks to grant loans only to people who belong to his region (e.g. Tamil Nadu), and he has access to information only about the individual’s language and pin code (e.g. Tamil, 605003), then with help of any external source or existing knowledge, he will easily learn that the people who speak Tamil are from Tamil Nadu. This kind of effect is known as redlining and such indirectly discriminating rules are known as redlining rules. In this paper, we will also deal with the prevention of discrimination based on redlining rules. II. RELATED WORK The discovery of discrimination decision was first proposed by Pedreschi et al. [1], [3]. The approach was based on mining classification rules, which was the inductive part, and reasoning on them, which was the deductive part, using the quantitative measures of discrimination that formalize legal definitions of discrimination. In the present scenario, we consider each rule individually for measuring discrimination without considering other rules or the relation between them. In this proposed system, we also take into account the relation between classification rules for discrimination discovery, based on the occurrence of discriminatory item sets. Discrimination prevention consists of introducing patterns that will avoid discriminatory decisions even though the original training data set is biased. The following three approaches are commonly followed: http://www.ijettjournal.org Page 50 International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 1- April 2016 1. 2. 3. Preprocessing: Transformation of the source data is done in such a way that the discriminatory factors present in the original data are removed. This way we can mine fair decision rules from the transformed data and apply any of the known mining algorithms. Data transformation and hierarchy-based generalization has been adapted so far from the privacy preservation literature [4], [5]. Preprocessing is necessary for applications where data mining is performed by external third parties, other than the data holder. In-processing: The mining algorithm needs to be updated also, as we do not want discriminatory decision rules in the resulting models. It is recommended that in-processing discrimination prevention techniques should wait on particular purpose data mining algorithms. It suggests an alternative approach to clean the discrimination from the original data set [2] whereby the nondiscriminatory constraint is embedded into a decision tree learner by changing its splitting criterion and pruning strategy through a novel leaf relabeling approach. Post-processing: Instead of transforming the original data set or making changes in the data mining algorithms, we modify the resulting data mining models. In a confidence-altering approach [6], the CPAR algorithm is used to infer the classification rule where the authority to publish the data is not given by the post-process. III. DISCRIMINATION ANALYSIS A. Basic Definitions A data set consists of records and their attributes. Let DB be an original data set. An item is an attribute associated with its value, e.g. Sex = female. An item set ( ) is a group of one or more items, e.g., {Zip=12345, City=NYC}. A classification rule is a phrase, → , where is an item set having no class item, and C is a class item (yes/no decision), e.g., {Race = black; City = NYC} → Loan = No. is known as the assumption of the rule. The support of an item set,( ) is the portion of information that contains the item set X. A rule → is said to be supported by a record if both C and X appear in the record. The confidence of a classification rule ( → ), describes the frequency of class item in . Hence, if supp(X) > 0 then ( → )= ISSN: 2231-5381 An ideal classification rule is said to have support and confidence values higher than the specified lower range. Here support is a way of measuring mathematical importance, and confidence is a way of measuring the effectiveness of the rule. Consider to be the data set of frequently classified rules produced from . B. PD and PND Rules Let us assume that the discriminatory items in DB are predetermined (e.g. Sex=Female, Race=black), a frequent classification rule belongs to any one of the following two classes depending on the discriminatory and non-discriminatory items in DB: i. A classification rule X →C is potentially discriminatory (PD) when X = A, B to A where A is a non-empty discriminatory item set and B is a non-discriminatory item set. ii. A classification rule X →C is potentially non-discriminatory (PND) when X = D, B is a non-discriminatory item set. The phrase “potentially” indicates that a PD rule could probably result in discriminatory decisions, so some actions are required to evaluate discrimination potential (direct discrimination). Also, a PND rule may result in discriminatory decisions if put together with some background information (indirect discrimination). C. Discrimination Discovery and Measurement Direct and indirect discrimination discovery [7] involves identifying α-protective and α-discriminatory rules respectively. First, using predetermined discriminatory items in DB, we obtain frequent classification rules which are either PD or PND rules. Second, direct discrimination is measured by finding out α-discriminatory rules among the PD rules using a discriminatory threshold value and a direct discrimination measure, fig.3.1. Third, indirect discrimination is measured by finding out the redlining rules among the PND [8] rules along with some background knowledge, using a discriminatory threshold value and an indirect discriminatory measure, fig.3.2. The degree of discrimination of a PD rule is measured using extended lift (elift) introduced by Pedreschi et al.[1,3]. elift (A,B → C) Using the elift value and a threshold value we can determine whether the rule is discriminatory or not. Let α ϵ R be a fixed threshold and let A be a nonempty discriminatory item set while B be a nondiscriminatory item set. A PD classification rule c : A,B → C is α -protective with respect to elift if elift(c)< α. Otherwise, c is α -discriminatory. http://www.ijettjournal.org Page 51 International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 1- April 2016 Discriminatory PD rules A, B parameter is the key for adjusting the level of protection against discrimination. PD category guidelines are produced from a dataset containing discriminatory product places. PD classification rules are produced from a dataset containing discriminatory item sets. In this paper, we propose the method of genetic algorithm to improve the rule extraction from the history of the dataset for association rule hiding. C Check direct discrimination IV. A, B PROPOSED APPROACH PD rules C A. System Architecture In the proposed work, we avoid the discrimination using post-processing approach. We proposed a genetic algorithm for optimized discrimination discovery, as shown in figure. The results of the genetic algorithm are given as input to the CPAR (Classification based on Predictive Association Rules) algorithm for discrimination prevention, fig. 4.1. The CPAR combines the uses of both the associative and traditional rule based classification: FOIL and PRM. Now we will see describe the algorithmic strategy of our proposed genetic algorithm and CPAR, which is used for discrimination discovery and prevention by the post-processing approach. Data set with discriminatory item sets Fig.3.1 Modeling the process of direct discrimination control Discriminatory PD rules A, B C Check indirect discrimination through an inference model Original Database Post-processing D, B C PND Rules Background rules A, B D, B D A Genetic algorithm CPAR algorithm Data sets without discriminatory item sets Background knowledge Mined data Fig.3.2 Modeling the process of indirect discrimination control The idea of α –protection [9] is introduced in order to measure the “disproportionate burdens” that a rule imposes, as a PD rule does not always provide evidence of discriminatory actions. It gives a measure of the discriminatory power of a PD classification rule. The idea is to determine such a measure as the comparative gain in confidence of the rule due to the existence of the discriminatory item sets. The ISSN: 2231-5381 Fig. 4.1: System Architecture B. Algorithm Strategies In a genetic algorithm [10], a group of candidate solutions (called individuals) to optimize a problem is developed for better solutions. Each candidate solution has a set of attributes which can be mutated and changed. The evolution usually begins with a population of randomly generated individuals (known as a candidate rule). At the end of each iteration, the http://www.ijettjournal.org Page 52 International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 1- April 2016 Finally the output of this genetic algorithm is given as input to the CPAR algorithm and prevents both direct and indirect discrimination. CPAR follows a greedy algorithm to directly generate rules from training data, instead of generating a large number of candidate rules as in associative classification. Moreover, CPAR generates and tests more rules than traditional rule-based classifiers to avoid disregarding important rules. As we want to avoid over fitting, CPAR uses expected accuracy to evaluate each rule and uses the best k rules in prediction. The following steps are required at the time of implementation of CPAR algorithm: Generation of Rules Estimate accuracy of the rules Classification and Result analysis The experimental results show that the proposed algorithm is quite successful in both goals of discovering discrimination and removing discrimination. V. the history of the data sets. Cross over and mutation steps produce higher quality rules. When the number of iteration increases, then match value also increases, that means, we get high match value as shown in Fig.5.1. These rules are given as input to discrimination prevention algorithms and can prevent both direct and indirect discrimination. 100 80 Fitness population is called a generation. In each generation, the fitness, which is the value of the objective function in the given problem of optimization, of every individual in the population is evaluated. The more fit individuals are arbitrarily selected from the current population to form a new generation. Each individual's data item is modified (probably recombined and randomly mutated) in the new generation which is later used in the next iteration of the algorithm. When a maximum number of generations have been produced, or a desired fitness level has been reached for the population, the algorithm terminates. A typical genetic algorithm has the following requirements: 1. A genetic representation of the solution domain, 2. A fitness function to assess the solution domain. The fitness function which is described over the genetic representation, measures the quality of the represented solution. The fitness function is generally problem dependent. In some problems, it is highly difficult or impossible to define the fitness expression; in such cases, a simulation is possibly used to determine the fitness function value of a phenotype, sometimes even interactive genetic algorithms are used. Here the fitness function is based on a rank. It is calculated by taking the ratio of number of matched records from history to the rule size. 60 40 20 0 0 200 400 600 800 1000 1200 Generation Fig. 5.1: Performance graph By using the CPAR algorithm for discrimination prevention the strong rules are generated without any discriminated items. It overcomes the disadvantages of preprocessing method. Post processing method maintains the quality of a dataset and prevents the data set of discriminating rules. VI. CONCLUSION Discrimination is an important issue in processing of loans in banking sectors. We have successfully combined two well-known algorithms in preventing both direct and indirect discrimination from the data sets. As a future enhancement, we are discovering trials of discrimination dissimilar from the ones reflected in this paper laterally with privacy conservation in data mining. The genetic algorithm prioritizes the best suitable item set from the data set for easy classification. The CPAR Algorithm furnishes high precision and efficiency, which can be accredited to the subsequent distinguished features. CPAR algorithm, on the other hand, characterizes a new methodology in the direction of proficient and great eminence classification. Nevertheless, by instigating a proposed method, we avert discrimination prevention in post-processing. EXPERIMENTAL RESULTS ACKNOWLEDGEMENT The algorithm implemented in JDK 1.7 using JavaServer Pages. The tests were performed on a 2.4 GHz Pentium IV machine, equipped with 512MB of RAM, and running under Windows 7 Professional. Genetic algorithm gives the best matched rules from ISSN: 2231-5381 We thank our colleagues for their support and help that they provided from time to time. We also thank our institution for providing their support with every aspect. A conclusion might elaborate on the http://www.ijettjournal.org Page 53 International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 1- April 2016 extensions which can be applied to the proposed system and also on the importance of the work. REFERENCES [1] A. Vetro, H. Sun, P. DaGraca, and T. Poon, “Minimum drift architectures for three-layer scalable DTV decoding,” IEEE Trans. Consumer Electron., vol. 44, no. 3, pp. 527-536, Aug. 1998. [2] A. N. Netravali and B. G. Haskell, Digital Pictures, 2nd ed., Plenum Press: New York, 1995, pp. 613-651. [3] A. Vetro, H. Sun, P. DaGraca, and T. Poon, “Minimum drift architectures for three-layer scalable DTV decoding,” IEEE Trans. Consumer Electron., vol. 44, no. 3, pp. 527-536, Aug. 1998. [4] A. N. Netravali and B. G. Haskell, Digital Pictures, 2nd ed., Plenum Press: New York, 1995, pp. 613-651. [5] A. Vetro, H. Sun, P. DaGraca, and T. Poon, “Minimum drift architectures for three-layer scalable DTV decoding,” IEEE Trans. Consumer Electron., vol. 44, no. 3, pp. 527-536, Aug. 1998. [6] A. N. Netravali and B. G. Haskell, Digital Pictures, 2nd ed., Plenum Press: New York, 1995, pp. 613-651. [7] A. Vetro, H. Sun, P. DaGraca, and T. Poon, “Minimum drift architectures for three-layer scalable DTV decoding,” IEEE Trans. Consumer Electron., vol. 44, no. 3, pp. 527-536, Aug. 1998. [8] A. N. Netravali and B. G. Haskell, Digital Pictures, 2nd ed., Plenum Press: New York, 1995, pp. 613-651. [9] A. Vetro, H. Sun, P. DaGraca, and T. Poon, “Minimum drift architectures for three-layer scalable DTV decoding,” IEEE Trans. Consumer Electron., vol. 44, no. 3, pp. 527-536, Aug. 1998. [10] A. N. Netravali and B. G. Haskell, Digital Pictures, 2nd ed., Plenum Press: New York, 1995, pp. 613-651. ISSN: 2231-5381 http://www.ijettjournal.org Page 54