International Journal on Advanced Computer Theory and Engineering (IJACTE) ________________________________________________________________________________________________ Bayesian Classification (Gender) Using Simulation Technique 1 Samar Ballav Bhoi and 2Munesh Chandra Adhikary 1 Directorate of Distance & Continuing Education (DDCE), F.M. University, Balasore, 2 P G Dept. of Applied Physics & Ballistics (APAB), F.M. University, Balasore E-mail: samarbhoi@yahoo.co.in, mcadhikary@gmail.com Abstract: - A Bayesian classification is a statistical classification in emerging trends of information technology in development, which predicts the probability that a given sample is a member of a particular class. It is based on the Bayes theorem. The Bayesian classification shows better accuracy and speed when applied to large databases. The Bayesian Classification represents a supervised learning method as well as a statistical method for classification. Assumes an underlying probabilistic model and it allows us to capture uncertainty about the model in a principled way by determining probabilities of the outcomes. It can solve diagnostic and predictive problems. This Classification is named after Thomas Bayes (1702-1761), who proposed the Bayes Theorem. Bayesian classification provides practical learning algorithms and prior knowledge and observed data can be combined. Bayesian Classification provides a useful perspective for understanding and evaluating many learning algorithms. It calculates explicit probabilities for hypothesis and it is robust to noise in input data. A naive Bayes classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model would be "independent feature model". An overview of statistical classifiers is given in the article on Pattern recognition. A Bayesian-network or belief network represents the dependencies between variables to provide a succinct design of a joint probability distribution. The network is a directed graph where nodes are sets of random variables; directed links connect node pairs signifying which nodes have a direct effect on other nodes; each node has a conditional probability table representing the quantifiable impact each parent node has on the child node’s value; and the graph has no directed sequences determining the specific path to be taken or result. The links, representing direct conditional dependency between nodes, and the probability coupled with each links are typically established by subject matter experts (SMEs). Uncertainties can be applied to each node to help make runs stochastic. A deterministic run is executed when a child node’s values are derived exclusively from the inputs of the node’s parent(s). A Bayesiannetwork can reason from effects to causes (diagnostic inference), from causes to effects (causal inference), between causes of a common effect (inter causal inference), or by combining two or more of the above (mixed inference). One of the obstacles with producing a Bayesiannetwork comes from the inability of SMEs to ascertain all the nodes and directed links essential for an implementation in a particular domain. Finally, determining the probability weights for each link is often considered the most complex phase of creating and modifying a Bayesian-network. Data mining is most important in area of A.I. in which we can draw the classification gender of data item using simulation process/technique. Objective of the paper aims at giving you some of the fundamental techniques used in data mining. This paper emphasizes on a brief overview of data mining as well as the application of data mining techniques to the real world using simulation technique. This topic is based on classification of genders using Bayesian Classification which is purely new data sets of random numbers represents height, weight and foot sizes. Bayesian reasoning is applied to decision making and inferential statistics that deals with probability inference. It is used the knowledge of prior events to predict future events. Keywords: Bayesian Network, Attributes, posterior probability, prior probability, data mining, extraction, simulation, classification, SMEs, KDD, data items, random number. I. INTRODUCTION In academic view and it is defined by W.H Inman is subject-oriented, integrated, time-variant, nonvolatile, and a collection of operational data that supports decision–making of the management. It is a tool that manages of data after and outside of operational system. Data warehousing technology has evolved in business applications for the process of strategic decision making. It may be considered as the key components of IT strategy and architecture of an organization. Example – an electric billing company, by analyzing data of a data warehouse can predict frauds and can reduce the cost of such determinations, in fact this technology has such great potential that any company processing proper analysis tools can benefit from it. Thus a data warehouse supports Business Intelligence. Presently some uses of Data warehousing and Data Mining are in industries like- Banking, Airline, Hospital and Investment & Insurance, use of data warehousing and data mining through simulation technique(1). Data Mining emphasizes on three techniques: a) Classification b) Clustering ________________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014 1 International Journal on Advanced Computer Theory and Engineering (IJACTE) ________________________________________________________________________________________________ c) Association Rules Data Mining Vs. Knowledge Discovery in Databases (KDD) Knowledge Discovery in Databases (KDD) is the process of finding useful Information, knowledge and patterns in data while data mining is the process of using of algorithms to automatically extract desired information and patterns, which are derived by the Knowledge Discovery in Databases process. Let us define KDD: Knowledge Discovery in Databases (KDD) Process .The different steps of KDD are as follows: (a) Extraction: Obtains data from various data sources. (b) Preprocessing: It includes cleansing the data which has already been extracted by the above step. (c) Transformation: The data is converted in to a common format, by applying some technique. (d) Data Mining: Automatically information/patterns/knowledge. (e) Interpretation/Evaluation: Presents the results obtained through data mining to the users, in easily understandable and meaningful format. Fig-1 E P extracts T DM Preprocessed Transformed data data I Model Fig : 1 (KDD Process) (Where E: Extraction, PPreprocess T- Transform DM-Data Mining IInterpreter KP- knowledge Pattern ) Classification: The classification task maps data into predefined groups or classes. Given a database/dataset D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the classification Problem is to define a mapping f:D->C where each ti is assigned to one class, that is, it divides database/dataset D into classes specified in the Set C. A few very simple examples to elucidate classification could be: • Teachers classify students’ marks data into a set of grades as A, B, C, D, or F. • Classification of the height of a set of persons into the classes tall, medium or short Classification Approach: The basic approaches to classification are: (a) Some of the most common techniques used for classification may include the use of Decision Trees, Neural Networks etc. Most of these techniques are based on finding the distances or uses statistical methods. Classification Using Distance (K-Nearest Neighbors algorithm (KNN), this approach, places items in the class to which they are “closest” to their neighbor. It must determine distance between an item and a class. Classes are represented by centroid (Central value) and the individual points. One of the algorithms that are used is K-Nearest Neighbors (2). Some of the basic points to be noted about this algorithm are: 1. The training set includes classes along with other attributes. (Please refer to the training data given in the Table-1 given below). 2. The value of the K defines the number of near items (items that have less distance to the attributes of concern) that should be used from the given set of training data (just to remind you again, training data is already classified data). This is explained in point (2) of the following example. 3. A new item is placed in the class in which the most number of close items are placed. (Please refer to point (3) in the following example). 4. The the KP Initial Target Data data (b) Now applying the model developed to the new data. value of K should be <= number _ of _ training _ items ,-(1) However, in our example for limiting the size of the sample data; we have not followed this formula. Example: Consider the following data, which tells us the person’s class depending upon gender and height. Table-1: Name Gender Height (cm) Class Sumitra F 160cm Short Ananda M 200cm Tall Ranita F 190cm Medium Radhika F 188cm Medium Dally F 170cm Short Arun M 185cm Medium Shellina F 160cm Short Arabinda M 170cm Short Sachin M 220cm Tall Manoj M 210cm Tall Sulekha F 180cm Medium Anil M 195cm Medium Karisma F 190cm Medium Sabita F 180cm Medium Sipra F 175cm Medium Questions: To create specific models by, evaluating training (1) let us to classify the tuple <Ananda, M, 200> from data, this is basically the old data that has already training data. been classified by using the domain of the experts’ knowledge. ________________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014 2 International Journal on Advanced Computer Theory and Engineering (IJACTE) ________________________________________________________________________________________________ (2) let us to classify the tuple <Manoj, M, 210> from training data. (3) Let us take only height attribute for distance calculation and suppose K=5 then the following are the near five tuples to the data that is to be classified (using Manhattan distance as a measure on the height attribute). Table-2: Name Sumita Jully Shelly Avinash Gender F F F M Height 160cm 170cm 160cm 170cm I wish to determine which posterior is greater, male or female. For the classification as male the posterior is given by Posterior (male) p(m) p(h / m) p( w / m) p( fs / m) evidence - (2) For the classification as female the posterior is given by Posterior ( female) Class ? (Short) ? (Short) ?( Short) ? (Short) p( f ) p(h / f ) p( w / f ) p( fs / f ) evidence - (3) The evidence (also termed normalizing constant) may be calculated (14): Answer: 1) The tuple <Chandan, M, 160> can classify to short class. 2) The tuple <Manoj, M, 210> can classify to tall class. Classify the data using the simulation technique: Consider the following data in which position attribute acts as class: evidence p(m) p(h / m) p( w / m) p( fs / m) p( f ) p(h / f ) p( w / f ) p( fs / f )) (4) However, given the sample the evidence is a constant and thus scales both posteriors equally. It therefore does not affect classification and can be ignored. We now determine the probability distribution for the sex of the sample. Example training set below. Table-3 sex Male Male Male Male Female Female Female Female Male Female height (feet) weight (lbs) 6 (6”) 5.92 (5'11") 5.58 (5'7") 5.92 (5'11") 5(5”) 5.5 (5'6") 5.42 (5'5") 5.75 (5'9") 5.91 (5'11") 5.44 (5'5") 181 192 172 166 102 151 129 153 168 128 --- (5) foot size(inch es) 12 11 12 10 6 8 7 9 10 8 , Where and are the parameters of normal distribution which have been previously determined from the training set. Note that a value greater than 1 is OK here – it is a probability density rather than a probability, because height is a continuous variable. The classifier created from the training set using a Gaussian distribution assumption (11) would be (given variances are sample variances): Table-4 sex mean (height ) male 5.866 fem ale 5.422 varianc e (height ) 2.1504 e-02 5.8416 e-02 mean (weight ) 175.8 132.6 varianc e (weight ) 0.9216 e+02 3.4504 e+02 mean (foot size) variance (foot size) 11.00 0.8e+00 7.6 0.104e01 Let's say we have equiprobable classes so P (male) = P (female) = 0.5. This prior distribution might be based on our knowledge of frequencies in the larger population, or on frequency in the training set. Testing: Below is a sample to be classified as a male or female. Table-5 sex height (feet) weight (lbs) sample 6 130 foot size(inches) 8 - (6) Since posterior numerator is greater in the female case, we predict the sample is female. II: METHOD (SIMULATION): Simulation(3) is the process of designing a model of a real system and conducting experiments with this model for the purpose of understanding the behavior for the operation of the system. Mathematical steps for MCM: (Monte Carlo Method) ________________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014 3 International Journal on Advanced Computer Theory and Engineering (IJACTE) ________________________________________________________________________________________________ 1. Setting up probability distribution for the variables to be analyzed. Classification, Estimation, Prediction Used for large data set 2. Construct cumulative probability distribution for each random numbers variables. Very easy to construct 3. Generate the random numbers within the range from 00 to 99 or more. 4. Conduct the simulation experiment (4) by means of random sampling. Not using estimations complicated iterative Often does surprisingly well May not be the best possible classifier parameter 5. Repeat the step 4 until the required number of simulation runs has been generated. Robust, fast, it can usually be relied on to many applications. 6. Design and implement a course of action and maintain control. Applications (9) Using above data (mentioned in table-3) to classify the gender with different position at attributes. 1. Gene regulatory networks 2. Protein structure Probability Distribution for Sex, height, weight and foot sizes: table-6 3. Diagnosis of illness Height (feet) 5. Image processing Weight Probability F M 5.0”-5.5” 100-125 0.30 0.26 5.5-6.0” 125-150 0.20 0.24 6.0-6.5” 150-175 0.25 0.20 6.5-7.5” 175-200 0.25 0.30 Using the random numbers 6, 5.3, 6.6, 5.2, 5.9 to classify the gender. Generate the range of random numbers using cumulative probability. Table-7 Height Weight Probabi Cumulati Range (feet) lity ve prob. of RN 5.0”-5.5” 100-125 0.30 0.30 00-02 5.5-6.0” 125-150 0.20 0.50 03-04 6.0-6.5” 150-175 0.25 0.75 05-07 6.5-7.5” 175-200 0.25 1.00 08-09 Calculation class of gender using random numbers: table-8 Random No 6.0 5.3 6.6 5.2 5.9 Sex Male Female Male Female Male Height 181 129 192 129 168 Foot size 12 7 11 7 10 4. Document classification 6. Data fusion 7. Decision support systems 8. Gathering data for deep space exploration 9. Artificial Intelligence 10. Prediction of weather 11. On a more familiar basis, Bayesian networks are used by the friendly Microsoft office assistant to elicit better search results. 12. Another use of Bayesian networks arises in the credit industry where an individual may be assigned a credit score based on age, salary, credit history, etc. This is fed to a Bayesian network which allows credit card companies to decide whether the person's credit score merits a favorable application. IV. CONCLUSION: Using the simulation technique in Bayesian Classification (9) of data we can solve the all data mining methods as approximately and near about the results. Such as classification of data items is totally based on probability concept. The advantages of Bayesian Networks (18): III. RESULTS DISCUSSION OF ABOVE METHOD: Visually represent all the relationships between the variables After analyzing the test the method classify the given data set into different attributes of gender of items in proper classification. So this method can also applicable for different other attributes of human beings to classify the data. I obtained the results about gender using simulation technique which is approximately correct about previous methods. We can apply the technique of simulation (7) in data mining problem: Some of the applications of data mining are as follows: Easy to recognize the independence between nodes. Can handle incomplete data Scenarios where it is not practical to measure all variables (costs, not enough sensors, etc.) Help to model noisy systems. dependence and Can be used for any system model - from all known parameters to no known parameters. ________________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014 4 International Journal on Advanced Computer Theory and Engineering (IJACTE) ________________________________________________________________________________________________ The limitations of Bayesian Networks: All branches must be calculated in order to calculate the probability of any one branch. The quality of the results of the network depends on the quality of the prior beliefs or model. Calculation can be NP-hard Calculations and probabilities using Baye's rule and marginalization can become complex and are often characterized by subtle wording, and care must be taken to calculate them properly. REFERENCES Bayes Classifiers www.cs.cmu.edu/~awm awm@cs.cmu.edu 412-268-7599 [10] William DuMouchel Shannon Laboratory, AT&T Labs –Research ,Bayesian Measurement of Associations in Adverse Drug Reaction Databases dumouchel@research.att.com DIMACS Tutorial on Statistical Surveillance Methods Rutgers University June 20, 2003 [11] CS/CNS/EE 155: Probabilistic Graphical Models Problem Set 2 Handed out: 21 Oct 2009 Due: 4 Nov 2009 [12] Learning Bayesian Networks from Data: An Efficient Approach Based on Information Theory Jie Cheng Dept. of Computing Science University of Alberta Alberta, T6G 2H1 Email: jcheng@cs.ualberta.ca David Bell, Weiru Liu Faculty of Informatics, University of Ulster, UK BT37 0QB Email: {w.liu, da.bell}@ulst.ac.uk [1] J Han, M Kamber, 2001 Data Mining Concepts and Techniques, Morgan Kaufmann Publishers. [2] A K Pujari, 2004 Data Mining, [3] Gordern, Simulation techniques [4] Giuseppe Petrone, Giuliano Cammarata InTech.’ Modelling and Simulation’ – [13] http://www.bayesia.com/en/products/ bayesialab/tutorial.php [5] Roger McHaney - BookBoon Understanding Computer Simulation, [14] ISyE8843A, Brani Vidakovic Handout 17 1 Bayesian Networks [6] Christian P. Robert and George Casella, Springer 2004, Monte Carlo Statistical Methods" (second edition). [15] Bayesian networks Chapter 14 Section 1 – 2 [16] Naive-Bayes Classification Algorithm Lab4NaiveBayes.pdf [17] XindongWu · Vipin Kumar · J. Ross Quinlan · Joydeep Ghosh · Qiang Yang · Hiroshi Motoda · Geoffrey J. McLachlan · Angus Ng · Bing Liu · Philip S. Yu · Zhi-Hua Zhou · Michael Steinbach David J. Hand · Dan Steinberg Received: 9 July 2007 `Top 10 algorithms in data mining [7] Christian P. Robert, George Casella. "Introducing Monte Carlo statistical Methods" [8] Tools and Examples for Developing Simulation Algorithms-Hans Christian Öttinger, Swiss Federal Institute of Technology Zürich, Switzerland Springer. [9] Andrew W. Moore Professor School of Computer Science Carnegie Mellon University, Naïve ________________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -4, 2014 5