mining classifying

advertisement
International Journal of Futuristic Machine Intelligence & Application (IJFMIA) (e-Journal)
ISSN : 2395-308x ,Vol 1 Issue 3,www.ijfmia.com
Discrimination Prevention in Data Mining for Intrusion and
Crime Detection
PUSHKAR ASWALE, BHAGYASHREE BORADE, SIDDHARTH BHOJWANI, NIRAJ GOJUMGUNDE
BE-A
DEPT. OF COMPUTER ENGINEERING MIT ACADEMY OF ENGINEERING ALANDI(D) PUNE
Abstract— Data mining involves the extraction of implicit
previously unknown and potentially useful knowledge from large
databases. The important issue in data mining is discrimination.
Discrimination can be viewed as the act of unfairly treating
people on the basis that they belong to a specific group. For
instance, individuals may be discriminated because of their
ideology, gender, age etc. In Economics and Social Sciences,
discrimination has been studied for over half a century. There
are several decision-making tasks which lend themselves to
discrimination, e.g. loan granting insurance premium computing
and staff selection. So the decision making utility must be
discrimination free. Discrimination is of two types, direct or
indirect (systematic). Direct discrimination includes a set of
rules(laws) or procedures(events) that explicitly mention
minority or disadvantaged groups based on sensitive
discriminatory attributes related to group membership. Indirect
discrimination includes rules or procedures that are not explicitly
mentioning discriminatory attributes, but could generate
discriminatory decisions. The literature has given evidence of
unfair treatment in racial profiling and redlining, mortgage
discrimination, personnel selection discrimination and wages
discrimination.
Keywords— Data mining, Discrimination Prevention, Direct
and Indirect Discrimination.
I.
LITERATURE SURVEY
1) CLASSIFICATION WITH NO DISCRIMINATION BY
PREFERENTIAL SAMPLING
We can remove the sensitive data instead of relabeling it. The
new solution to the CND problem by introducing a sampling
scheme for making the discrimination free instead of
relabeling the data set. The algorithm is used in this paper is
classification algorithm. The goal of classification is to
accurately predict the target class for each care in the data. It
predicts categorical labels and classify the data based on the
training set and the values in a classifying attribute and uses it
in classifying new data. The techniques used in this paper is
Pre-processing, Preferential sampling, Over sampling,
Uniform sampling. In preprocessing there are a lot tangential
and excess data present or noisy and so knowledge uncovering
during the aiming stage are a lot of elaborated . Data
preparation and filtering steps can contain considerable
amount of processing period. Data pre-processing includes
cleaning, normalization, transformation and characteristic
extraction selection. In Preferential sampling the process that
determines the data location and the process being modeled
are stochastically dependent. In the over sample are the action
of sample importantly higher than the doubly the twice the
band width or peak relative frequency of the signal comprising
sampled. Over sampling sample aids avoid aliasing, answer
and brings down noise. The equation is used fs=2bb Where fs
are the sample relative frequency and b are the bandwidth or
maximum relative frequency of signal. Then quist rate is then
2b.The Uniform sampling defined as each Data objects
probability is uniform. In this paper disadvantage is
Discrimination were removed in ethical and legal region.
2)THREE-NAÏVE
BAYES
APPROACHES
FOR
DISCRIMINATION FREE CLASSIFICATION
In this method naive bayes is modified for discrimination
classification. Discrimination laws do not allow the use of
these rules of attributes such as gender, religion. Using
decision rules that base their decision on these attributes in
classifier. The approaches are used in this paper Navies bayes
model, Latent variable model, and Modified naïve bayes. The
naives bayes model is Abayes classifier, is a simple possibility
classifier based on applying bayes theorem with strong
International Journal of Futuristic Machine Intelligence & Application (IJFMIA) (e-Journal)
ISSN : 2395-308x ,Vol 1 Issue 3,www.ijfmia.com
statistical independence assumption. Depending on precise
nature of the probability model, navie bayes classifiers can be
trained very efficiently in supervised learning. A latent
variable model is a numerical model that relates a set of
variables to set of latent variables. The responses on the
indicators or manifest variables are the results of an
individual’s position on the latent variables. The modified
navie bayes is Modify the probability distribution p(s/c) of the
sensitive attribute values given the class values .
3)FAST ALGORITHM FOR MINING ASSOCIATION
RULES
Fast algorithm is an efficient algorithm used to avoid the
discrimination in data mining. In this paper algorithm apriori,
apriority, AIS algorithm, apriori hybrid algorithm are used.
The apriori algorithmic rule are the adult detail sets by the
early authorize comprised reached aim the fresh candidate
detail. Pruning comprised represented applying the
information that some subdivision of itemset. The difference
for determining the support in the database is not used after
the first pass. In the AIS algorithm involves two concepts are
extension of an item set, determining what should be in the
candidate item set .The apriori hybrid algorithm Uses apriori
in the early passes and later shifts to apriority .In this paper
disadvantages is An extra cost is sustained when shifting from
apriori to apriority.
II.
PROPOSED SYSTEM
Our Proposed data transformation methods rule protection and
rule generalization are based on measures for both direct and
indirect discrimination and can deal with several
discriminatory items. We demonstrate an integrated
approaching to address and indirect discrimination prevention,
on finalized algorithmic rule and all potential information shift
ways confirmed rule protection and or convention
generalization that could prevent indirect discrimination .
FIGURE.1
According to above figure Our Proposed data transformation
methods rule protection and rule generalization are based on
measures for both direct and indirect discrimination and can
deal with several discriminatory items. We demonstrate an
integrated approaching to address and indirect discrimination
prevention, on finalized algorithmic rule and all potential
information shift ways confirmed rule protection and or
convention generalization that could prevent indirect
discrimination . We suggest fresh utility amounts to evaluate
the different aimed favoritism prevention processes in terms
by information quality and discrimination removal as some
direct and indirect discrimination. Direct and indirect
discrimination discovery includes identifying discriminatory
rules and redlining rules. Using the above transformation
methods effectively to identify the categories and remove
direct and indirect discrimination method. Finally,
discrimination free data models can be produced from the
transformed data set without seriously damaging data quality.
Discrimination prevention techniques in conditions by
information character and discrimination removal are referred
because of some direct and indirect discrimination. The
proposed techniques are quite successful in both goals of
removing discrimination and preserving data quality.
International Journal of Futuristic Machine Intelligence & Application (IJFMIA) (e-Journal)
ISSN : 2395-308x ,Vol 1 Issue 3,www.ijfmia.com
III.
MODULES DESCRIPTION
1) AUTOMATED DATA COLLECTION
Data mining is an increasingly important technology for
extracting useful knowledge hidden in large
collections of data. The problems outlined above can be
eliminated when production data is collected automatically.
When production data is collected automatically as it happens,
you can be assured that it is timely, accurate, and unbiased.
Until recently, automatically collecting production data was a
costly and unreliable proposition. There are, however,
negative social perceptions about data mining, among which
potential privacy invasion and potential discrimination. The
latter consists of unfairly treating people on the basis of their
belonging to a specific group. Automated data collection and
data mining techniques such as classification rule mining have
paved the way to making automated decisions, like loan
granting/denial, insurance premium computation, etc. If the
training data sets are biased in what regards discriminatory
(sensitive) attributes like gender, race, religion, etc.,
discriminatory decisions may ensue.
For this reason,
antidiscrimination techniques including discrimination
discovery and prevention have been introduced in data mining.
Services in the information society allow for automatic and
routine collection of large amounts of data. Those data are
often used to train association/classification rules in view of
making automated decisions, like loan granting/denial,
insurance premium computation, personnel selection, etc. At
first sight, automating decisions may give a sense of fairness:
classification rules do not guide themselves by personal
preferences.
2)MEASURE
THE
DISCRIMINATION
DIFFERENT
TYPES
sensitive discriminatory attributes related to group
membership. Translated the qualitative statements in existing
laws, regulations, and legal cases into quantitative formal
counterparts over classification rules and they introduced a
family of measures of the degree of discrimination of a PD
rule. One of these measures is the extended lift (elift).The
purpose of direct discrimination discovery is to identify α
discriminatory rules. In fact, α discriminatory rules indicate
biased rules that are directly inferred from discriminatory
items (e.g., Foreign worker). We call these rules direct α
discriminatory rules. In addition to elift, two other measures
slift and olift were proposed indirect discrimination measure
Indirect discrimination consists of rules or procedures that,
while not explicitly mentioning discriminatory attributes,
intentionally or unintentionally could generate discriminatory
decisions. Redlining by financial institutions (refusing to grant
mortgages or insurances in urban areas they consider as
deteriorating) is an archetypal example of indirect
discrimination, although certainly not the only one. With a
slight abuse of language for the sake of compactness, in this
paper indirect discrimination will also be referred to as
redlining and rules causing indirect discrimination will be
called redlining rules. Indirect discrimination could happen
because of the availability of some background knowledge
(rules), for example, that a certain zip code corresponds to a
deteriorating area or an area with mostly black population.
The background knowledge might be accessible from publicly
available data (e.g., census data) or might be obtained from the
original data set itself because of the existence of
nondiscriminatory attributes that are highly correlated with the
sensitive ones in the original data set.
The purpose of indirect discrimination discovery is to identify
redlining rules. In fact, redlining rules indicate biased rules
that are indirectly inferred from nondiscriminatory items.
OF
To construct the automated data collection database consist of
discrimination rules. To measure the discrimination it has two
types
1. Direct discrimination
2. Indirect discrimination
Negative social perceptions about data mining, among which
potential privacy invasion and potential discrimination.
Discrimination can be either direct or indirect. Direct
discrimination occurs when decisions are made based on
sensitive attributes. Indirect discrimination occurs when
decisions are made based on non sensitive attributes which are
strongly correlated with biased sensitive ones.
3) DIRECT DISCRIMINATION MEASURE
Direct discrimination consists of rules or procedures that
explicitly mention minority or disadvantaged groups based on
IV.
MATHEMATICAL MODEL
Let S be a database requested by user,
discrimination prevention module which gives discrimination
free database.
Such that
S= {I, F, O} Where,
I represent the set of inputs;
I= {I1, I2,I3,I4,I5}
I1= Adult Database
I2= FR be the database of frequent classification
rule extracted from DB.
I3= MR be the database of direct discriminatory
rules.
I4= Alpha be a threshold value.
I4= DIs be a collection of sensitive attribute.
And F is the set of functions:
F= {F1, F2, F3, F4, F5, F6, F7, F8, F9}
International Journal of Futuristic Machine Intelligence & Application (IJFMIA) (e-Journal)
ISSN : 2395-308x ,Vol 1 Issue 3,www.ijfmia.com
F1 = Login to the system.
F2 = Request for database.
F3 = Admin extract database using DPM.
F4 = Calculate confidence and support.
F5 = Calculate ELIFT.
F6 = Identify discriminatory rules.
F7 =Transform database.
F8 = Apply DDP algorithm.
F9 = Get discrimination free database
And O is the set of outputs;
O = {O1}
O1=Discrimination free database without affecting original
database.
Functions
F1: Login to the system
If x = login
F(x) = login successful
If U [A-Z a-z] and P [A-Z a-z 0-9] and
Length (P) >= 6
Else
Login fails
Where,
P: Password U: Username
F2 = Request for database.
X: Given database.
F(x) = Generate database with attribute.
F3 =Extract database by administrator using DPM.
X : Database.
F(X) = Take database and pass to DPM.
F4 = Calculate confidence and support.
X : Frequent classification rule.
F(X) = Calculate confidence and support of selected
classification rule.
F5 =Elift calculation .
X : Confidence and support.
F(X) = Calculate Elift.
F6 = Identify -discriminatory rules.
X : -discriminatory rules.
F(X) =Extract -discriminatory rules.
F7 =Transform database.
X: Database.
F(X): Transformation of database.
F8 = Apply DDP Algorithm.
X: Discrimination free database.
F(X) = Generate database.
F9 = Display final output.
X: Database
F(X) = Display discrimination free database.
Design Details and UML diagrams:
FIGURE.2
FIGURE.3
Following UML diagram shows:
In figure 1.1, the use case diagram is shown in which there are
two actors i.e. User and Dataset. There are number of use
cases, the pre-process input data case represents pre-processed
input data. Frequent classification rule is another case used to
represent classification rules extracted from the input data.
Also Discrimination discovery use case represents those
discriminatory classification rules that are obtained using elift,
slift, glift etc. The use case Data Transformation performs
transformation on given input data. Finally results obtained
using elift, glift, slift are compared to evaluate its
performance.
DATA MODEL DESCRIPTION
This section describes information domain for the software.
A. Data Description
It gives the processes, data store, data flow and source and
destination entities.
B. Data objects and Relationships
International Journal of Futuristic Machine Intelligence & Application (IJFMIA) (e-Journal)
ISSN : 2395-308x ,Vol 1 Issue 3,www.ijfmia.com
Data object and their major attributes and relationships among
data objects are described using an ERD-like form.
C. Level 0 Data Flow Diagram:
The level 0 data flow diagram for the discrimination
prevention system is shown in figure 1.2 that includes the
processes responsible for processing of the input data and
performs data transformation on given data.
[1]
X. Yin and J. Han. CPAR: Classification based on
Predictive Association Rules. In Proc. of the 3rd SIAM
International Conference on Data Mining, 2003.
[2]
P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to
Data Mining. Addison-Wesley, 2006.
[3]
R. Agrawal and R. Srikant, “Fast algorithms for mining
association rules in large databases”. Proceedings of the
20th International Conference on Very Large Data Bases,
pp. 487-499. VLDB, 1994.
[4]
D. Pedreschi, S. Ruggieri and F. Turini, “Discriminationaware data mining”. Proc. of the 14th ACM International
Conference on Knowledge Discovery and Data Mining
(KDD 2008), pp. 560-568. ACM, 2008.
[5]
FIGURE.4
CONCLSION & FUTURE WORK
Our contribution concentrates on producing training data
which are free or nearly free from discrimination while
preserving their usefulness to detect real intrusion or
crime.
In the future, we want to run our method on real datasets,
improve our methods and also consider background
knowledge (indirect discrimination).
REFERENCES
S. Hajian, J. Domingo-Ferrer, and A. Martnez Balleste,
"Discrimination Prevention in Data Mining for Intrusion
and Crime Detection," Proc. IEEE Symp. Computational
Intelligence in Cyber Security (CICS ’11), pp. 47-54,
2011.
[6] S. Hajian, A. Monreale, D. Pedreschi, J. Domingo Ferrer
and F. Giannotti. "Injecting discrimination and privacy
awareness into pattern discovery," In 2012 IEEE 12th
International Conference on Data Mining Workshops, pp.
360-369. IEEE Computer Society, 2012.
[7] S. Ruggieri, D. Pedreschi, and F. Turini, "DCUBE:
Discrimination Discovery in Databases,"Proc. ACM Int’l
Conf. Management of
[8] F. Kamiran, T. Calders, and M. Pechenizkiy,
“Discrimination Aware Decision Tree Learning,”Proc.
IEEE Int’l
[9] Conf. Data Mining (ICDM’10),pp. 869-874, 2010.
[10] Sara Hajian, Josep Domingo-Ferrer and Antoni MartnezBalleste Universitat Rovira i Virgili, Discrimination
Download