International Journal of Futuristic Machine Intelligence & Application (IJFMIA) (e-Journal) ISSN : 2395-308x ,Vol 1 Issue 3,www.ijfmia.com Discrimination Prevention in Data Mining for Intrusion and Crime Detection PUSHKAR ASWALE, BHAGYASHREE BORADE, SIDDHARTH BHOJWANI, NIRAJ GOJUMGUNDE BE-A DEPT. OF COMPUTER ENGINEERING MIT ACADEMY OF ENGINEERING ALANDI(D) PUNE Abstract— Data mining involves the extraction of implicit previously unknown and potentially useful knowledge from large databases. The important issue in data mining is discrimination. Discrimination can be viewed as the act of unfairly treating people on the basis that they belong to a specific group. For instance, individuals may be discriminated because of their ideology, gender, age etc. In Economics and Social Sciences, discrimination has been studied for over half a century. There are several decision-making tasks which lend themselves to discrimination, e.g. loan granting insurance premium computing and staff selection. So the decision making utility must be discrimination free. Discrimination is of two types, direct or indirect (systematic). Direct discrimination includes a set of rules(laws) or procedures(events) that explicitly mention minority or disadvantaged groups based on sensitive discriminatory attributes related to group membership. Indirect discrimination includes rules or procedures that are not explicitly mentioning discriminatory attributes, but could generate discriminatory decisions. The literature has given evidence of unfair treatment in racial profiling and redlining, mortgage discrimination, personnel selection discrimination and wages discrimination. Keywords— Data mining, Discrimination Prevention, Direct and Indirect Discrimination. I. LITERATURE SURVEY 1) CLASSIFICATION WITH NO DISCRIMINATION BY PREFERENTIAL SAMPLING We can remove the sensitive data instead of relabeling it. The new solution to the CND problem by introducing a sampling scheme for making the discrimination free instead of relabeling the data set. The algorithm is used in this paper is classification algorithm. The goal of classification is to accurately predict the target class for each care in the data. It predicts categorical labels and classify the data based on the training set and the values in a classifying attribute and uses it in classifying new data. The techniques used in this paper is Pre-processing, Preferential sampling, Over sampling, Uniform sampling. In preprocessing there are a lot tangential and excess data present or noisy and so knowledge uncovering during the aiming stage are a lot of elaborated . Data preparation and filtering steps can contain considerable amount of processing period. Data pre-processing includes cleaning, normalization, transformation and characteristic extraction selection. In Preferential sampling the process that determines the data location and the process being modeled are stochastically dependent. In the over sample are the action of sample importantly higher than the doubly the twice the band width or peak relative frequency of the signal comprising sampled. Over sampling sample aids avoid aliasing, answer and brings down noise. The equation is used fs=2bb Where fs are the sample relative frequency and b are the bandwidth or maximum relative frequency of signal. Then quist rate is then 2b.The Uniform sampling defined as each Data objects probability is uniform. In this paper disadvantage is Discrimination were removed in ethical and legal region. 2)THREE-NAÏVE BAYES APPROACHES FOR DISCRIMINATION FREE CLASSIFICATION In this method naive bayes is modified for discrimination classification. Discrimination laws do not allow the use of these rules of attributes such as gender, religion. Using decision rules that base their decision on these attributes in classifier. The approaches are used in this paper Navies bayes model, Latent variable model, and Modified naïve bayes. The naives bayes model is Abayes classifier, is a simple possibility classifier based on applying bayes theorem with strong International Journal of Futuristic Machine Intelligence & Application (IJFMIA) (e-Journal) ISSN : 2395-308x ,Vol 1 Issue 3,www.ijfmia.com statistical independence assumption. Depending on precise nature of the probability model, navie bayes classifiers can be trained very efficiently in supervised learning. A latent variable model is a numerical model that relates a set of variables to set of latent variables. The responses on the indicators or manifest variables are the results of an individual’s position on the latent variables. The modified navie bayes is Modify the probability distribution p(s/c) of the sensitive attribute values given the class values . 3)FAST ALGORITHM FOR MINING ASSOCIATION RULES Fast algorithm is an efficient algorithm used to avoid the discrimination in data mining. In this paper algorithm apriori, apriority, AIS algorithm, apriori hybrid algorithm are used. The apriori algorithmic rule are the adult detail sets by the early authorize comprised reached aim the fresh candidate detail. Pruning comprised represented applying the information that some subdivision of itemset. The difference for determining the support in the database is not used after the first pass. In the AIS algorithm involves two concepts are extension of an item set, determining what should be in the candidate item set .The apriori hybrid algorithm Uses apriori in the early passes and later shifts to apriority .In this paper disadvantages is An extra cost is sustained when shifting from apriori to apriority. II. PROPOSED SYSTEM Our Proposed data transformation methods rule protection and rule generalization are based on measures for both direct and indirect discrimination and can deal with several discriminatory items. We demonstrate an integrated approaching to address and indirect discrimination prevention, on finalized algorithmic rule and all potential information shift ways confirmed rule protection and or convention generalization that could prevent indirect discrimination . FIGURE.1 According to above figure Our Proposed data transformation methods rule protection and rule generalization are based on measures for both direct and indirect discrimination and can deal with several discriminatory items. We demonstrate an integrated approaching to address and indirect discrimination prevention, on finalized algorithmic rule and all potential information shift ways confirmed rule protection and or convention generalization that could prevent indirect discrimination . We suggest fresh utility amounts to evaluate the different aimed favoritism prevention processes in terms by information quality and discrimination removal as some direct and indirect discrimination. Direct and indirect discrimination discovery includes identifying discriminatory rules and redlining rules. Using the above transformation methods effectively to identify the categories and remove direct and indirect discrimination method. Finally, discrimination free data models can be produced from the transformed data set without seriously damaging data quality. Discrimination prevention techniques in conditions by information character and discrimination removal are referred because of some direct and indirect discrimination. The proposed techniques are quite successful in both goals of removing discrimination and preserving data quality. International Journal of Futuristic Machine Intelligence & Application (IJFMIA) (e-Journal) ISSN : 2395-308x ,Vol 1 Issue 3,www.ijfmia.com III. MODULES DESCRIPTION 1) AUTOMATED DATA COLLECTION Data mining is an increasingly important technology for extracting useful knowledge hidden in large collections of data. The problems outlined above can be eliminated when production data is collected automatically. When production data is collected automatically as it happens, you can be assured that it is timely, accurate, and unbiased. Until recently, automatically collecting production data was a costly and unreliable proposition. There are, however, negative social perceptions about data mining, among which potential privacy invasion and potential discrimination. The latter consists of unfairly treating people on the basis of their belonging to a specific group. Automated data collection and data mining techniques such as classification rule mining have paved the way to making automated decisions, like loan granting/denial, insurance premium computation, etc. If the training data sets are biased in what regards discriminatory (sensitive) attributes like gender, race, religion, etc., discriminatory decisions may ensue. For this reason, antidiscrimination techniques including discrimination discovery and prevention have been introduced in data mining. Services in the information society allow for automatic and routine collection of large amounts of data. Those data are often used to train association/classification rules in view of making automated decisions, like loan granting/denial, insurance premium computation, personnel selection, etc. At first sight, automating decisions may give a sense of fairness: classification rules do not guide themselves by personal preferences. 2)MEASURE THE DISCRIMINATION DIFFERENT TYPES sensitive discriminatory attributes related to group membership. Translated the qualitative statements in existing laws, regulations, and legal cases into quantitative formal counterparts over classification rules and they introduced a family of measures of the degree of discrimination of a PD rule. One of these measures is the extended lift (elift).The purpose of direct discrimination discovery is to identify α discriminatory rules. In fact, α discriminatory rules indicate biased rules that are directly inferred from discriminatory items (e.g., Foreign worker). We call these rules direct α discriminatory rules. In addition to elift, two other measures slift and olift were proposed indirect discrimination measure Indirect discrimination consists of rules or procedures that, while not explicitly mentioning discriminatory attributes, intentionally or unintentionally could generate discriminatory decisions. Redlining by financial institutions (refusing to grant mortgages or insurances in urban areas they consider as deteriorating) is an archetypal example of indirect discrimination, although certainly not the only one. With a slight abuse of language for the sake of compactness, in this paper indirect discrimination will also be referred to as redlining and rules causing indirect discrimination will be called redlining rules. Indirect discrimination could happen because of the availability of some background knowledge (rules), for example, that a certain zip code corresponds to a deteriorating area or an area with mostly black population. The background knowledge might be accessible from publicly available data (e.g., census data) or might be obtained from the original data set itself because of the existence of nondiscriminatory attributes that are highly correlated with the sensitive ones in the original data set. The purpose of indirect discrimination discovery is to identify redlining rules. In fact, redlining rules indicate biased rules that are indirectly inferred from nondiscriminatory items. OF To construct the automated data collection database consist of discrimination rules. To measure the discrimination it has two types 1. Direct discrimination 2. Indirect discrimination Negative social perceptions about data mining, among which potential privacy invasion and potential discrimination. Discrimination can be either direct or indirect. Direct discrimination occurs when decisions are made based on sensitive attributes. Indirect discrimination occurs when decisions are made based on non sensitive attributes which are strongly correlated with biased sensitive ones. 3) DIRECT DISCRIMINATION MEASURE Direct discrimination consists of rules or procedures that explicitly mention minority or disadvantaged groups based on IV. MATHEMATICAL MODEL Let S be a database requested by user, discrimination prevention module which gives discrimination free database. Such that S= {I, F, O} Where, I represent the set of inputs; I= {I1, I2,I3,I4,I5} I1= Adult Database I2= FR be the database of frequent classification rule extracted from DB. I3= MR be the database of direct discriminatory rules. I4= Alpha be a threshold value. I4= DIs be a collection of sensitive attribute. And F is the set of functions: F= {F1, F2, F3, F4, F5, F6, F7, F8, F9} International Journal of Futuristic Machine Intelligence & Application (IJFMIA) (e-Journal) ISSN : 2395-308x ,Vol 1 Issue 3,www.ijfmia.com F1 = Login to the system. F2 = Request for database. F3 = Admin extract database using DPM. F4 = Calculate confidence and support. F5 = Calculate ELIFT. F6 = Identify discriminatory rules. F7 =Transform database. F8 = Apply DDP algorithm. F9 = Get discrimination free database And O is the set of outputs; O = {O1} O1=Discrimination free database without affecting original database. Functions F1: Login to the system If x = login F(x) = login successful If U [A-Z a-z] and P [A-Z a-z 0-9] and Length (P) >= 6 Else Login fails Where, P: Password U: Username F2 = Request for database. X: Given database. F(x) = Generate database with attribute. F3 =Extract database by administrator using DPM. X : Database. F(X) = Take database and pass to DPM. F4 = Calculate confidence and support. X : Frequent classification rule. F(X) = Calculate confidence and support of selected classification rule. F5 =Elift calculation . X : Confidence and support. F(X) = Calculate Elift. F6 = Identify -discriminatory rules. X : -discriminatory rules. F(X) =Extract -discriminatory rules. F7 =Transform database. X: Database. F(X): Transformation of database. F8 = Apply DDP Algorithm. X: Discrimination free database. F(X) = Generate database. F9 = Display final output. X: Database F(X) = Display discrimination free database. Design Details and UML diagrams: FIGURE.2 FIGURE.3 Following UML diagram shows: In figure 1.1, the use case diagram is shown in which there are two actors i.e. User and Dataset. There are number of use cases, the pre-process input data case represents pre-processed input data. Frequent classification rule is another case used to represent classification rules extracted from the input data. Also Discrimination discovery use case represents those discriminatory classification rules that are obtained using elift, slift, glift etc. The use case Data Transformation performs transformation on given input data. Finally results obtained using elift, glift, slift are compared to evaluate its performance. DATA MODEL DESCRIPTION This section describes information domain for the software. A. Data Description It gives the processes, data store, data flow and source and destination entities. B. Data objects and Relationships International Journal of Futuristic Machine Intelligence & Application (IJFMIA) (e-Journal) ISSN : 2395-308x ,Vol 1 Issue 3,www.ijfmia.com Data object and their major attributes and relationships among data objects are described using an ERD-like form. C. Level 0 Data Flow Diagram: The level 0 data flow diagram for the discrimination prevention system is shown in figure 1.2 that includes the processes responsible for processing of the input data and performs data transformation on given data. [1] X. Yin and J. Han. CPAR: Classification based on Predictive Association Rules. In Proc. of the 3rd SIAM International Conference on Data Mining, 2003. [2] P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison-Wesley, 2006. [3] R. Agrawal and R. Srikant, “Fast algorithms for mining association rules in large databases”. Proceedings of the 20th International Conference on Very Large Data Bases, pp. 487-499. VLDB, 1994. [4] D. Pedreschi, S. Ruggieri and F. Turini, “Discriminationaware data mining”. Proc. of the 14th ACM International Conference on Knowledge Discovery and Data Mining (KDD 2008), pp. 560-568. ACM, 2008. [5] FIGURE.4 CONCLSION & FUTURE WORK Our contribution concentrates on producing training data which are free or nearly free from discrimination while preserving their usefulness to detect real intrusion or crime. In the future, we want to run our method on real datasets, improve our methods and also consider background knowledge (indirect discrimination). REFERENCES S. Hajian, J. Domingo-Ferrer, and A. Martnez Balleste, "Discrimination Prevention in Data Mining for Intrusion and Crime Detection," Proc. IEEE Symp. Computational Intelligence in Cyber Security (CICS ’11), pp. 47-54, 2011. [6] S. Hajian, A. Monreale, D. Pedreschi, J. Domingo Ferrer and F. Giannotti. "Injecting discrimination and privacy awareness into pattern discovery," In 2012 IEEE 12th International Conference on Data Mining Workshops, pp. 360-369. IEEE Computer Society, 2012. [7] S. Ruggieri, D. Pedreschi, and F. Turini, "DCUBE: Discrimination Discovery in Databases,"Proc. ACM Int’l Conf. Management of [8] F. Kamiran, T. Calders, and M. Pechenizkiy, “Discrimination Aware Decision Tree Learning,”Proc. IEEE Int’l [9] Conf. Data Mining (ICDM’10),pp. 869-874, 2010. [10] Sara Hajian, Josep Domingo-Ferrer and Antoni MartnezBalleste Universitat Rovira i Virgili, Discrimination