International Journal on Advanced Computer Theory and

advertisement
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________________________________
Bayesian Classification (Gender) Using Simulation Technique
1
Samar Ballav Bhoi and 2Munesh Chandra Adhikary
1
Directorate of Distance & Continuing Education (DDCE), F.M. University, Balasore,
2
P G Dept. of Applied Physics & Ballistics (APAB), F.M. University, Balasore
E-mail: samarbhoi@yahoo.co.in, mcadhikary@gmail.com
Abstract: - A Bayesian classification is a statistical
classification in emerging trends of information technology
in development, which predicts the probability that a given
sample is a member of a particular class. It is based on the
Bayes theorem. The Bayesian classification shows better
accuracy and speed when applied to large databases. The
Bayesian Classification represents a supervised learning
method as well as a statistical method for classification.
Assumes an underlying probabilistic model and it allows us
to capture uncertainty about the model in a principled way
by determining probabilities of the outcomes. It can solve
diagnostic and predictive problems. This Classification is
named after Thomas Bayes (1702-1761), who proposed the
Bayes Theorem. Bayesian classification provides practical
learning algorithms and prior knowledge and observed
data can be combined. Bayesian Classification provides a
useful perspective for understanding and evaluating many
learning algorithms. It calculates explicit probabilities for
hypothesis and it is robust to noise in input data. A naive
Bayes classifier is a simple probabilistic classifier based on
applying Bayes' theorem with strong (naive) independence
assumptions. A more descriptive term for the underlying
probability model would be "independent feature model".
An overview of statistical classifiers is given in the article
on Pattern recognition. A Bayesian-network or belief
network represents the dependencies between variables to
provide a succinct design of a joint probability
distribution. The network is a directed graph where nodes
are sets of random variables; directed links connect node
pairs signifying which nodes have a direct effect on other
nodes; each node has a conditional probability table
representing the quantifiable impact each parent node has
on the child node’s value; and the graph has no directed
sequences determining the specific path to be taken or
result. The links, representing direct conditional
dependency between nodes, and the probability coupled
with each links are typically established by subject matter
experts (SMEs). Uncertainties can be applied to each node
to help make runs stochastic. A deterministic run is
executed when a child node’s values are derived exclusively
from the inputs of the node’s parent(s). A Bayesiannetwork can reason from effects to causes (diagnostic
inference), from causes to effects (causal inference),
between causes of a common effect (inter causal inference),
or by combining two or more of the above (mixed
inference). One of the obstacles with producing a Bayesiannetwork comes from the inability of SMEs to ascertain all
the nodes and directed links essential for an
implementation in a particular domain. Finally,
determining the probability weights for each link is often
considered the most complex phase of creating and
modifying a Bayesian-network. Data mining is most
important in area of A.I. in which we can draw the
classification gender of data item using simulation
process/technique. Objective of the paper aims at giving
you some of the fundamental techniques used in data
mining. This paper emphasizes on a brief overview of data
mining as well as the application of data mining techniques
to the real world using simulation technique. This topic is
based on classification of genders using Bayesian
Classification which is purely new data sets of random
numbers represents height, weight and foot sizes. Bayesian
reasoning is applied to decision making and inferential
statistics that deals with probability inference. It is used the
knowledge of prior events to predict future events.
Keywords: Bayesian Network, Attributes, posterior
probability, prior probability, data mining, extraction,
simulation, classification, SMEs, KDD, data items, random
number.
I. INTRODUCTION
In academic view and it is defined by W.H Inman is
subject-oriented, integrated, time-variant, nonvolatile,
and a collection of operational data that supports
decision–making of the management. It is a tool that
manages of data after and outside of operational system.
Data warehousing technology has evolved in business
applications for the process of strategic decision making.
It may be considered as the key components of IT
strategy and architecture of an organization. Example –
an electric billing company, by analyzing data of a data
warehouse can predict frauds and can reduce the cost of
such determinations, in fact this technology has such
great potential that any company processing proper
analysis tools can benefit from it. Thus a data warehouse
supports Business Intelligence. Presently some uses of
Data warehousing and Data Mining are in industries
like- Banking, Airline, Hospital and Investment &
Insurance, use of data warehousing and data mining
through simulation technique(1). Data Mining
emphasizes on three techniques:
a) Classification
b) Clustering
________________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014
1
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________________________________
c) Association Rules
Data Mining Vs. Knowledge Discovery in Databases
(KDD)
Knowledge Discovery in Databases (KDD) is the
process of finding useful Information, knowledge and
patterns in data while data mining is the process of using
of algorithms to automatically extract desired
information and patterns, which are derived by the
Knowledge Discovery in Databases process. Let us
define KDD: Knowledge Discovery in Databases
(KDD) Process .The different steps of KDD are as
follows:
(a)
Extraction: Obtains data from various data
sources.
(b)
Preprocessing: It includes cleansing the data
which has already been extracted by the above
step.
(c)
Transformation: The data is converted in to a
common format, by applying some
technique.
(d)
Data Mining: Automatically
information/patterns/knowledge.
(e)
Interpretation/Evaluation: Presents the results
obtained through data mining to the users, in
easily understandable and meaningful format.
Fig-1
E
P
extracts
T
DM
Preprocessed Transformed
data
data
I
Model
Fig : 1 (KDD Process) (Where E: Extraction, PPreprocess T- Transform DM-Data Mining
IInterpreter KP- knowledge Pattern )
Classification: The classification task maps data into
predefined groups or classes. Given a database/dataset
D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the
classification Problem is to define a mapping f:D->C
where each ti is assigned to one class, that is, it divides
database/dataset D into classes specified in the Set C. A
few very simple examples to elucidate classification
could be:
• Teachers classify students’ marks data into a set of
grades as A, B, C, D, or F.
• Classification of the height of a set of persons into the
classes tall, medium or short
Classification Approach: The basic approaches to
classification are:
(a)
Some of the most common techniques used for
classification may include the use of Decision Trees,
Neural Networks etc. Most of these techniques are based
on finding the distances or uses statistical methods.
Classification Using Distance (K-Nearest Neighbors
algorithm (KNN), this approach, places items in the
class to which they are “closest” to their neighbor. It
must determine distance between an item and a class.
Classes are represented by centroid (Central value) and
the individual points. One of the algorithms that are used
is K-Nearest Neighbors (2). Some of the basic points to
be noted about this algorithm are:
1.
The training set includes classes along with other
attributes. (Please refer to the training data given
in the Table-1 given below).
2.
The value of the K defines the number of near
items (items that have less distance to the
attributes of concern) that should be used from
the given set of training data (just to remind you
again, training data is already classified data).
This is explained in point (2) of the following
example.
3.
A new item is placed in the class in which the
most number of close items are placed. (Please
refer to point (3) in the following example).
4.
The
the
KP
Initial Target
Data
data
(b) Now applying the model developed to the new
data.
value
of
K
should
be
<=
number _ of _ training _ items ,-(1)
However, in our example for limiting the size of
the sample data; we have not followed this
formula.
Example: Consider the following data, which tells us the
person’s class depending upon gender and height.
Table-1:
Name
Gender Height (cm)
Class
Sumitra
F
160cm
Short
Ananda
M
200cm
Tall
Ranita
F
190cm
Medium
Radhika
F
188cm
Medium
Dally
F
170cm
Short
Arun
M
185cm
Medium
Shellina
F
160cm
Short
Arabinda
M
170cm
Short
Sachin
M
220cm
Tall
Manoj
M
210cm
Tall
Sulekha
F
180cm
Medium
Anil
M
195cm
Medium
Karisma
F
190cm
Medium
Sabita
F
180cm
Medium
Sipra
F
175cm
Medium
Questions:
To create specific models by, evaluating training
(1) let us to classify the tuple <Ananda, M, 200> from
data, this is basically the old data that has already
training data.
been classified by using the domain of the experts’
knowledge.
________________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014
2
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________________________________
(2) let us to classify the tuple <Manoj, M, 210> from
training data.
(3) Let us take only height attribute for distance
calculation and suppose K=5 then the following are
the near five tuples to the data that is to be
classified (using Manhattan distance as a measure
on the height attribute).
Table-2:
Name
Sumita
Jully
Shelly
Avinash
Gender
F
F
F
M
Height
160cm
170cm
160cm
170cm
I wish to determine which posterior is greater, male or
female. For the classification as male the posterior is
given by
Posterior (male) 
p(m) p(h / m) p( w / m) p( fs / m)
evidence
- (2)
For the classification as female the posterior is given by
Posterior ( female) 
Class
? (Short)
? (Short)
?( Short)
? (Short)
p( f ) p(h / f ) p( w / f ) p( fs / f )
evidence
- (3)
The evidence (also termed normalizing constant) may be
calculated (14):
Answer: 1) The tuple <Chandan, M, 160> can classify to
short class.
2) The tuple <Manoj, M, 210> can classify to tall class.
Classify the data using the simulation technique:
Consider the following data in which position attribute
acts as class:
evidence  p(m) p(h / m) p( w / m) p( fs / m)
 p( f ) p(h / f ) p( w / f ) p( fs / f ))
(4)
However, given the sample the evidence is a constant
and thus scales both posteriors equally. It therefore does
not affect classification and can be ignored. We now
determine the probability distribution for the sex of the
sample.
Example training set below. Table-3
sex
Male
Male
Male
Male
Female
Female
Female
Female
Male
Female
height (feet)
weight
(lbs)
6 (6”)
5.92 (5'11")
5.58 (5'7")
5.92 (5'11")
5(5”)
5.5 (5'6")
5.42 (5'5")
5.75 (5'9")
5.91 (5'11")
5.44 (5'5")
181
192
172
166
102
151
129
153
168
128
--- (5)
foot
size(inch
es)
12
11
12
10
6
8
7
9
10
8
,
Where
and
are
the parameters of normal distribution which have been
previously determined from the training set. Note that a
value greater than 1 is OK here – it is a probability
density rather than a probability, because height is a
continuous variable.
The classifier created from the training set using a
Gaussian distribution assumption (11) would be (given
variances are sample variances): Table-4
sex
mean
(height
)
male
5.866
fem
ale
5.422
varianc
e
(height
)
2.1504
e-02
5.8416
e-02
mean
(weight
)
175.8
132.6
varianc
e
(weight
)
0.9216
e+02
3.4504
e+02
mean
(foot
size)
variance
(foot
size)
11.00
0.8e+00
7.6
0.104e01
Let's say we have equiprobable classes so P (male) = P
(female) = 0.5. This prior distribution might be based on
our knowledge of frequencies in the larger population,
or on frequency in the training set.
Testing: Below is a sample to be classified as a male or
female. Table-5
sex
height (feet)
weight (lbs)
sample
6
130
foot
size(inches)
8
- (6)
Since posterior numerator is greater in the female case,
we predict the sample is female.
II: METHOD (SIMULATION):
Simulation(3) is the process of designing a model of a
real system and conducting experiments with this model
for the purpose of understanding the behavior for the
operation of the system. Mathematical steps for MCM:
(Monte Carlo Method)
________________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014
3
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________________________________
1.
Setting up probability distribution for the variables
to be analyzed.

Classification, Estimation, Prediction

Used for large data set
2.
Construct cumulative probability distribution for
each random numbers variables.

Very easy to construct
3.
Generate the random numbers within the range
from 00 to 99 or more.
4.
Conduct the simulation experiment (4) by means of
random sampling.

Not using
estimations
complicated
iterative

Often does surprisingly well

May not be the best possible classifier
parameter
5.
Repeat the step 4 until the required number of
simulation runs has been generated.

Robust, fast, it can usually be relied on to many
applications.
6.
Design and implement a course of action and
maintain control.
Applications (9)
Using above data (mentioned in table-3) to classify the
gender with different position at attributes.
1. Gene regulatory networks
2. Protein structure
Probability Distribution for Sex, height, weight and foot
sizes: table-6
3. Diagnosis of illness
Height (feet)
5. Image processing
Weight
Probability
F
M
5.0”-5.5”
100-125
0.30
0.26
5.5-6.0”
125-150
0.20
0.24
6.0-6.5”
150-175
0.25
0.20
6.5-7.5”
175-200
0.25
0.30
Using the random numbers 6, 5.3, 6.6, 5.2, 5.9 to
classify the gender. Generate the range of random
numbers using cumulative probability. Table-7
Height
Weight
Probabi Cumulati Range
(feet)
lity
ve prob.
of RN
5.0”-5.5”
100-125
0.30
0.30
00-02
5.5-6.0”
125-150
0.20
0.50
03-04
6.0-6.5”
150-175
0.25
0.75
05-07
6.5-7.5”
175-200
0.25
1.00
08-09
Calculation class of gender using random numbers:
table-8
Random No
6.0
5.3
6.6
5.2
5.9
Sex
Male
Female
Male
Female
Male
Height
181
129
192
129
168
Foot size
12
7
11
7
10
4. Document classification
6. Data fusion
7. Decision support systems
8. Gathering data for deep space exploration
9. Artificial Intelligence
10. Prediction of weather
11. On a more familiar basis, Bayesian networks are
used by the friendly Microsoft office assistant to elicit
better search results.
12. Another use of Bayesian networks arises in the
credit industry where an individual may be assigned a
credit score based on age, salary, credit history, etc. This
is fed to a Bayesian network which allows credit card
companies to decide whether the person's credit score
merits a favorable application.
IV.
CONCLUSION:
Using the simulation technique in Bayesian
Classification (9) of data we can solve the all data mining
methods as approximately and near about the results.
Such as classification of data items is totally based on
probability concept.
The advantages of Bayesian Networks (18):
III. RESULTS DISCUSSION OF ABOVE
METHOD:

Visually represent all the relationships between the
variables
After analyzing the test the method classify the given
data set into different attributes of gender of items in
proper classification. So this method can also applicable
for different other attributes of human beings to classify
the data. I obtained the results about gender using
simulation technique which is approximately correct
about previous methods. We can apply the technique of
simulation (7) in data mining problem: Some of the
applications of data mining are as follows:

Easy to recognize the
independence between nodes.

Can handle incomplete data

Scenarios where it is not practical to measure all
variables (costs, not enough sensors, etc.)

Help to model noisy systems.

dependence
and
Can be used for any system model - from all
known parameters to no known parameters.
________________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014
4
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________________________________
The limitations of Bayesian Networks:

All branches must be calculated in order to
calculate the probability of any one branch.

The quality of the results of the network depends
on the quality of the prior beliefs or model.

Calculation can be NP-hard
Calculations and probabilities using Baye's rule and
marginalization can become complex and are often
characterized by subtle wording, and care must be taken
to calculate them properly.
REFERENCES
Bayes
Classifiers
www.cs.cmu.edu/~awm
awm@cs.cmu.edu 412-268-7599
[10]
William DuMouchel Shannon Laboratory, AT&T
Labs –Research ,Bayesian Measurement of
Associations in Adverse Drug Reaction
Databases
dumouchel@research.att.com
DIMACS Tutorial on Statistical Surveillance
Methods Rutgers University June 20, 2003
[11]
CS/CNS/EE 155: Probabilistic Graphical Models
Problem Set 2 Handed out: 21 Oct 2009 Due: 4
Nov 2009
[12]
Learning Bayesian Networks from Data: An
Efficient Approach Based on Information Theory
Jie Cheng Dept. of Computing Science
University of Alberta Alberta, T6G 2H1 Email:
jcheng@cs.ualberta.ca David Bell, Weiru Liu
Faculty of Informatics, University of Ulster, UK
BT37 0QB Email: {w.liu, da.bell}@ulst.ac.uk
[1]
J Han, M Kamber, 2001 Data Mining Concepts
and Techniques, Morgan Kaufmann Publishers.
[2]
A K Pujari, 2004 Data Mining,
[3]
Gordern, Simulation techniques
[4]
Giuseppe Petrone, Giuliano Cammarata
InTech.’ Modelling and Simulation’
–
[13]
http://www.bayesia.com/en/products/
bayesialab/tutorial.php
[5]
Roger McHaney - BookBoon Understanding
Computer Simulation,
[14]
ISyE8843A, Brani Vidakovic Handout 17 1
Bayesian Networks
[6]
Christian P. Robert and George Casella, Springer
2004, Monte Carlo Statistical Methods" (second
edition).
[15]
Bayesian networks Chapter 14 Section 1 – 2
[16]
Naive-Bayes Classification Algorithm Lab4NaiveBayes.pdf
[17]
XindongWu · Vipin Kumar · J. Ross Quinlan ·
Joydeep Ghosh · Qiang Yang · Hiroshi Motoda ·
Geoffrey J. McLachlan · Angus Ng · Bing Liu ·
Philip S. Yu · Zhi-Hua Zhou · Michael Steinbach
David J. Hand · Dan Steinberg Received: 9 July
2007 `Top 10 algorithms in data mining
[7]
Christian P. Robert, George
Casella.
"Introducing Monte Carlo statistical Methods"
[8]
Tools and Examples for Developing Simulation
Algorithms-Hans Christian Öttinger, Swiss
Federal Institute of Technology Zürich,
Switzerland Springer.
[9]
Andrew W. Moore Professor School of Computer
Science Carnegie Mellon University, Naïve

________________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014
5
Download