Data Discrimination Prevention using Genetic Algorithm Method in Data Mining . Kalaivani. S

advertisement
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 1- April 2016
Data Discrimination Prevention using Genetic
Algorithm Method in Data Mining
Kalaivani. S#1, Jeniffer. D#2, Abirami. K#3, Bhargavi. S#4
#1
Asst. Prof , Department of CSE, Manakula Vinayagar Institute of Technology, Pondicherry University, India.
#2, 3, 4
B.Tech - CSE, Manakula Vinayagar Institute of Technology, Pondicherry University, India.
Abstract — Now-a-days data mining techniques are
used for extracting specific kind of knowledge for
discrimination purposes. This can be more commonly
seen in banking sectors, where the private information
of an individual such as religion, race, nationality,
gender, income, age and so on can be misused. Such
sensitive information are used for decision making
purposes by deciding whether to grant loans, give
employment, etc. Direct discrimination is based on
sensitive attributes. Indirect discrimination is based
on non-sensitive attributes in close relation with
sensitive ones. For this reason, anti-discrimination
techniques are used for discrimination discovery and
prevention. In this paper, we introduce the notion of
classification rule techniques combined with
association rule hiding for cleaning the training data
sets and also to eliminate the possibility of
discrimination. We propose new techniques for
tackling discrimination prevention in data mining for
direct and indirect discrimination prevention,
individually or both at the same time. First, we find
the frequent classification rule for the training data
set. Then we perform association rule hiding’s
heuristic approach. Finally, the original data is
transformed in such a way that direct and/or indirect
discriminatory biases are removed, while also
maintaining the data quality as well as reducing
information loss.
Keywords — Anti-discrimination, Data mining,
Direct discrimination, Indirect discrimination,
Association rules, Privacy.
I.
INTRODUCTION
Discrimination is treating or considering a person or
thing differently, either in favor of or against him,
based on the group, kind, or type to which that person
or thing belongs rather than on individual calibre.
Data discrimination deals with selecting information
of an individual that may be of interest to the service
provider, and unfairly treating them on the basis of
their membership to certain groups, namely, race,
caste, creed, nationality, gender, etc. Such
discriminatory acts are punishable by law in many
democratic countries.
With the advent of Big data and data mining, large
amounts of data are being collected automatically and
decision-making processes are being carried out easily
with the help of computers. Discrimination poses a
ISSN: 2231-5381
harmful threat to data collection, leading to unfair
treatment as it denies opportunities to specific groups
in settings like staff selection, health insurance, home
loans, education, etc. This threat can be kept at bay
using today’s technology which can aid
discrimination prevention. Automated data collection
and
decision-making
is
done
using
association/classification rule mining techniques. But
the classification rule depends on the training data set.
If the trained data is biased, then the result will also
have discriminatory characteristics.
There are two types of discrimination: direct and
indirect. Direct discrimination involves biasing based
on discriminatory attributes in the data set. Indirect
discrimination involves the use of non-discriminatory
attributes that are correlated to discriminatory
information. Such type of discrimination can also be
accomplished using some background knowledge. For
example, if a manager at a bank looks to grant loans
only to people who belong to his region (e.g. Tamil
Nadu), and he has access to information only about
the individual’s language and pin code (e.g. Tamil,
605003), then with help of any external source or
existing knowledge, he will easily learn that the
people who speak Tamil are from Tamil Nadu. This
kind of effect is known as redlining and such
indirectly discriminating rules are known as redlining
rules. In this paper, we will also deal with the
prevention of discrimination based on redlining rules.
II.
RELATED WORK
The discovery of discrimination decision was first
proposed by Pedreschi et al. [1], [3]. The approach
was based on mining classification rules, which was
the inductive part, and reasoning on them, which was
the deductive part, using the quantitative measures of
discrimination that formalize legal definitions of
discrimination. In the present scenario, we consider
each rule individually for measuring discrimination
without considering other rules or the relation
between them. In this proposed system, we also take
into account the relation between classification rules
for discrimination discovery, based on the occurrence
of discriminatory item sets. Discrimination prevention
consists of introducing patterns that will avoid
discriminatory decisions even though the original
training data set is biased. The following three
approaches are commonly followed:
http://www.ijettjournal.org
Page 50
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 1- April 2016
1.
2.
3.
Preprocessing: Transformation of the source
data is done in such a way that the
discriminatory factors present in the original
data are removed. This way we can mine fair
decision rules from the transformed data and
apply any of the known mining algorithms.
Data transformation and hierarchy-based
generalization has been adapted so far from
the privacy preservation literature [4], [5].
Preprocessing is necessary for applications
where data mining is performed by external
third parties, other than the data holder.
In-processing: The mining algorithm needs
to be updated also, as we do not want
discriminatory decision rules in the resulting
models. It is recommended that in-processing
discrimination prevention techniques should
wait on particular purpose data mining
algorithms. It suggests an alternative
approach to clean the discrimination from the
original data set [2] whereby the
nondiscriminatory constraint is embedded
into a decision tree learner by changing its
splitting criterion and pruning strategy
through a novel leaf relabeling approach.
Post-processing: Instead of transforming the
original data set or making changes in the
data mining algorithms, we modify the
resulting data mining models. In a
confidence-altering approach [6], the CPAR
algorithm is used to infer the classification
rule where the authority to publish the data is
not given by the post-process.
III.
DISCRIMINATION ANALYSIS
A. Basic Definitions
A data set consists of records and their attributes.
Let DB be an original data set.
An item is an attribute associated with its
value, e.g. Sex = female.
An item set ( ) is a group of one or more
items, e.g., {Zip=12345, City=NYC}.
A classification rule is a phrase, → , where
is an item set having no class item, and C
is a class item (yes/no decision), e.g., {Race
= black; City = NYC} → Loan = No. is
known as the assumption of the rule.
The support of an item set,( ) is the portion
of information that contains the item set X. A
rule → is said to be supported by a record
if both C and X appear in the record.
The confidence of a classification rule
( → ), describes the frequency of class item
in . Hence, if supp(X) > 0 then
( → )=
ISSN: 2231-5381
An ideal classification rule is said to have
support and confidence values higher than
the specified lower range. Here support is a
way of measuring mathematical importance,
and confidence is a way of measuring the
effectiveness of the rule. Consider
to be
the data set of frequently classified rules
produced from
.
B. PD and PND Rules
Let us assume that the discriminatory items in DB
are predetermined (e.g. Sex=Female, Race=black), a
frequent classification rule belongs to any one of the
following two classes depending on the
discriminatory and non-discriminatory items in DB:
i.
A classification rule X →C is potentially
discriminatory (PD) when X = A, B to A
where A is a non-empty discriminatory item
set and B is a non-discriminatory item set.
ii.
A classification rule X →C is potentially
non-discriminatory (PND) when X = D, B is
a non-discriminatory item set.
The phrase “potentially” indicates that a PD rule
could probably result in discriminatory decisions, so
some actions are required to evaluate discrimination
potential (direct discrimination). Also, a PND rule
may result in discriminatory decisions if put together
with some background information (indirect
discrimination).
C. Discrimination Discovery and Measurement
Direct and indirect discrimination discovery [7]
involves identifying α-protective and α-discriminatory
rules respectively. First, using predetermined
discriminatory items in DB, we obtain frequent
classification rules which are either PD or PND rules.
Second, direct discrimination is measured by finding
out α-discriminatory rules among the PD rules using a
discriminatory threshold value and a direct
discrimination measure, fig.3.1. Third, indirect
discrimination is measured by finding out the
redlining rules among the PND [8] rules along with
some background knowledge, using a discriminatory
threshold value and an indirect discriminatory
measure, fig.3.2.
The degree of discrimination of a PD rule is
measured using extended lift (elift) introduced by
Pedreschi et al.[1,3].
elift (A,B → C)
Using the elift value and a threshold value we can
determine whether the rule is discriminatory or not.
Let α ϵ R be a fixed threshold and let A be a nonempty discriminatory item set while B be a nondiscriminatory item set. A PD classification rule c :
A,B → C is α -protective with respect to elift if
elift(c)< α. Otherwise, c is α -discriminatory.
http://www.ijettjournal.org
Page 51
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 1- April 2016
Discriminatory
PD rules
A, B
parameter is the key for adjusting the level of
protection against discrimination. PD category
guidelines are produced from a dataset containing
discriminatory product places. PD classification rules
are produced from a dataset containing discriminatory
item sets. In this paper, we propose the method of
genetic algorithm to improve the rule extraction from
the history of the dataset for association rule hiding.
C
Check direct
discrimination
IV.
A, B
PROPOSED APPROACH
PD rules
C
A. System Architecture
In the proposed work, we avoid the discrimination
using post-processing approach. We proposed a
genetic algorithm for optimized discrimination
discovery, as shown in figure. The results of the
genetic algorithm are given as input to the CPAR
(Classification based on Predictive Association Rules)
algorithm for discrimination prevention, fig. 4.1. The
CPAR combines the uses of both the associative and
traditional rule based classification: FOIL and PRM.
Now we will see describe the algorithmic strategy of
our proposed genetic algorithm and CPAR, which is
used for discrimination discovery and prevention by
the post-processing approach.
Data set with
discriminatory
item sets
Fig.3.1 Modeling the process of direct
discrimination control
Discriminatory
PD rules
A, B
C
Check indirect
discrimination
through an
inference model
Original Database
Post-processing
D, B
C
PND Rules
Background
rules
A, B
D, B
D
A
Genetic algorithm
CPAR algorithm
Data sets
without
discriminatory
item sets
Background
knowledge
Mined data
Fig.3.2 Modeling the process of indirect
discrimination control
The idea of α –protection [9] is introduced in order
to measure the “disproportionate burdens” that a rule
imposes, as a PD rule does not always provide
evidence of discriminatory actions. It gives a measure
of the discriminatory power of a PD classification
rule. The idea is to determine such a measure as the
comparative gain in confidence of the rule due to the
existence of the discriminatory item sets. The
ISSN: 2231-5381
Fig. 4.1: System Architecture
B. Algorithm Strategies
In a genetic algorithm [10], a group of candidate
solutions (called individuals) to optimize a problem is
developed for better solutions. Each candidate
solution has a set of attributes which can be mutated
and changed. The evolution usually begins with a
population of randomly generated individuals (known
as a candidate rule). At the end of each iteration, the
http://www.ijettjournal.org
Page 52
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 1- April 2016
Finally the output of this genetic algorithm is given
as input to the CPAR algorithm and prevents both
direct and indirect discrimination. CPAR follows a
greedy algorithm to directly generate rules from
training data, instead of generating a large number of
candidate rules as in associative classification.
Moreover, CPAR generates and tests more rules than
traditional rule-based classifiers to avoid disregarding
important rules. As we want to avoid over fitting,
CPAR uses expected accuracy to evaluate each rule
and uses the best k rules in prediction. The following
steps are required at the time of implementation of
CPAR algorithm:
Generation of Rules
Estimate accuracy of the rules
Classification and Result analysis
The experimental results show that the proposed
algorithm is quite successful in both goals of
discovering
discrimination
and
removing
discrimination.
V.
the history of the data sets. Cross over and mutation
steps produce higher quality rules. When the number
of iteration increases, then match value also increases,
that means, we get high match value as shown in
Fig.5.1. These rules are given as input to
discrimination prevention algorithms and can prevent
both direct and indirect discrimination.
100
80
Fitness
population is called a generation. In each generation,
the fitness, which is the value of the objective
function in the given problem of optimization, of
every individual in the population is evaluated. The
more fit individuals are arbitrarily selected from the
current population to form a new generation. Each
individual's data item is modified (probably
recombined and randomly mutated) in the new
generation which is later used in the next iteration of
the algorithm. When a maximum number of
generations have been produced, or a desired fitness
level has been reached for the population, the
algorithm terminates.
A typical genetic algorithm has the following
requirements:
1. A genetic representation of the solution domain,
2. A fitness function to assess the solution domain.
The fitness function which is described over the
genetic representation, measures the quality of the
represented solution. The fitness function is generally
problem dependent. In some problems, it is highly
difficult or impossible to define the fitness expression;
in such cases, a simulation is possibly used to
determine the fitness function value of a phenotype,
sometimes even interactive genetic algorithms are
used. Here the fitness function is based on a rank. It is
calculated by taking the ratio of number of matched
records from history to the rule size.
60
40
20
0
0
200
400
600
800 1000 1200
Generation
Fig. 5.1: Performance graph
By using the CPAR algorithm for discrimination
prevention the strong rules are generated without any
discriminated items. It overcomes the disadvantages
of preprocessing method. Post processing method
maintains the quality of a dataset and prevents the
data set of discriminating rules.
VI.
CONCLUSION
Discrimination is an important issue in processing
of loans in banking sectors. We have successfully
combined two well-known algorithms in preventing
both direct and indirect discrimination from the data
sets. As a future enhancement, we are discovering
trials of discrimination dissimilar from the ones
reflected in this paper laterally with privacy
conservation in data mining. The genetic algorithm
prioritizes the best suitable item set from the data set
for easy classification. The CPAR Algorithm
furnishes high precision and efficiency, which can be
accredited to the subsequent distinguished features.
CPAR algorithm, on the other hand, characterizes a
new methodology in the direction of proficient and
great eminence classification. Nevertheless, by
instigating a proposed method, we avert
discrimination prevention in post-processing.
EXPERIMENTAL RESULTS
ACKNOWLEDGEMENT
The algorithm implemented in JDK 1.7 using
JavaServer Pages. The tests were performed on a 2.4
GHz Pentium IV machine, equipped with 512MB of
RAM, and running under Windows 7 Professional.
Genetic algorithm gives the best matched rules from
ISSN: 2231-5381
We thank our colleagues for their support and help
that they provided from time to time. We also thank
our institution for providing their support with every
aspect. A conclusion might elaborate on the
http://www.ijettjournal.org
Page 53
International Journal of Engineering Trends and Technology (IJETT) – Volume 34 Number 1- April 2016
extensions which can be applied to the proposed
system and also on the importance of the work.
REFERENCES
[1] A. Vetro, H. Sun, P. DaGraca, and T. Poon, “Minimum drift
architectures for three-layer scalable DTV decoding,” IEEE
Trans. Consumer Electron., vol. 44, no. 3, pp. 527-536, Aug.
1998.
[2] A. N. Netravali and B. G. Haskell, Digital Pictures, 2nd ed.,
Plenum Press: New York, 1995, pp. 613-651.
[3] A. Vetro, H. Sun, P. DaGraca, and T. Poon, “Minimum drift
architectures for three-layer scalable DTV decoding,” IEEE
Trans. Consumer Electron., vol. 44, no. 3, pp. 527-536, Aug.
1998.
[4] A. N. Netravali and B. G. Haskell, Digital Pictures, 2nd ed.,
Plenum Press: New York, 1995, pp. 613-651.
[5] A. Vetro, H. Sun, P. DaGraca, and T. Poon, “Minimum drift
architectures for three-layer scalable DTV decoding,” IEEE
Trans. Consumer Electron., vol. 44, no. 3, pp. 527-536, Aug.
1998.
[6] A. N. Netravali and B. G. Haskell, Digital Pictures, 2nd ed.,
Plenum Press: New York, 1995, pp. 613-651.
[7] A. Vetro, H. Sun, P. DaGraca, and T. Poon, “Minimum drift
architectures for three-layer scalable DTV decoding,” IEEE
Trans. Consumer Electron., vol. 44, no. 3, pp. 527-536, Aug.
1998.
[8] A. N. Netravali and B. G. Haskell, Digital Pictures, 2nd ed.,
Plenum Press: New York, 1995, pp. 613-651.
[9] A. Vetro, H. Sun, P. DaGraca, and T. Poon, “Minimum drift
architectures for three-layer scalable DTV decoding,” IEEE
Trans. Consumer Electron., vol. 44, no. 3, pp. 527-536, Aug.
1998.
[10] A. N. Netravali and B. G. Haskell, Digital Pictures, 2nd ed.,
Plenum Press: New York, 1995, pp. 613-651.
ISSN: 2231-5381
http://www.ijettjournal.org
Page 54
Download