Data Mining Classification through Simulation Technique 1 Samar Ballav Bhoi, 2Munesh Chandra Adhikary 1 D.D.C.E, F.M. University, Balasore, 2P.G. Dept. of Applied Physics & Ballistics, Fakir Mohan University, Balasore, Odisha E-Mail: samarbhoi@yahoo.co.in, mcadhikary@gmail.com capabilities of today’s transaction-oriented database. In many applications users only need read-access to data however they need to access larger volume of data very rapidly. A data warehouse can be considered to be a “Corporate memory”. Data and information are extracted from heterogeneous sources as they are generated. This paper aims at throwing some idea about the basic features, components, OLAP, Architecture, multidimensional data modeling, DSS, MOLAP, and ROLAP, data warehouse and views, open issue for data warehouse etc. Data Warehouses and Data Mining techniques are becoming indispensable parts of business intelligence programs. Use these links to learn more about these emerging fields and keep on top of this trend. Data mining is the process of automatic extraction of interesting (non trivial, implicit, previously unknown and potentially useful) information or patterns from the data in large database. ii). Data mining is one of the steps in process of Knowledge Discovery in Database (KDD) iii). Data mining is applied in every field whether it is Games, Marketing, Bioscience, Loan approval, Fraud detection etc. Data warehouse and data mining is most important in area of A.I. in which we can draw the classification of data item using simulation process/technique. Objective of the paper aims at giving you some of the fundamental techniques used in data mining. This paper emphasizes on a brief overview of data mining as well as the application of data mining techniques to the real world using simulation technique. Abstract – The past two decades has seen a dramatic increase in the amount of information or data being stored in electronic format. This accumulation of data has taken place at an explosive rate. It has been estimated that the amount of information in the world doubles every 20 months and the size and number of databases are increasing even faster. The increase in use of electronic data gathering devices such as point-of-sale or remote sensing devices has contributed to this explosion of available data. The problem of effectively utilizing these massive volumes of data is becoming a major problem for all enterprises. Data storage became easier as the availability of large amounts of computing power at low cost i.e., the cost of processing power and storage is falling, made data cheap. There was also the introduction of new machine learning methods for knowledge representation based on logic programming etc. in addition to traditional statistical analysis of data. The new methods tend to be computationally intensive hence a demand for more processing power. It was recognized that information is at the heart of business operations and that decision-makers could make use of the data stored to gain valuable insight into the business. Database Management systems gave access to the data stored but this was only a small part of what could be gained from the data. Traditional on-line transaction processing Keywords: data mining, extraction, classification, data items, random number. Systems, OLTPs, are good at putting data into databases quickly, safely and efficiently but are not good at delivering meaningful analysis in return. Analyzing data can provide further knowledge about a business by going beyond the data explicitly stored to derive knowledge about the business. This is where Data Mining has obvious benefits for any enterprise. I. simulation, INTRODUCTION In academic view and it is defined by W.H Inman is subject-oriented, integrated, time-variant, nonvolatile, and a collection of operational data that supports decision–making of the management. It is a tool that manages of data after and outside of operational system. Data warehousing technology has evolved in business applications for the process of strategic decision making. It may be considered as the key components of IT strategy and architecture of an organization. Example Information Technology has a major influence on organizational performance and competitive standing, with increasing processing power and availability of sophisticated analytical tools and techniques. The Data Warehouse is a kingpin of business intelligence. The data warehouses provide storage, functionality and responsiveness to queries, that is far superior to the ISSN (Print) : 2319 – 2526, Volume-2, Issue-1, 2013 133 Special Issue of International Journal on Advanced Computer Theory and Engineering (IJACTE) – an electric billing company, by analyzing data of a data warehouse can predict frauds and can reduce the cost of such determinations, in fact this technology has such great potential that any company processing proper analysis tools can benefit from it. Thus a data warehouse supports Business Intelligence. Presently some uses of Data Warehousing and Data Mining are in industries like- Banking, Airline, Hospital and Investment & Insurance, use of data ware housing and data mining through simulation technique(1). Data Mining emphasizes on three techniques: Classification: The classification task maps data into predefined groups or classes. Given a database/dataset D={t1,t2,…,tn} and a set of classes C={C1,…,Cm}, the classification Problem is to define a mapping f:D->C where each ti is assigned to one class, that is, it divides database/dataset D into classes specified in the Set C. A few very simple examples to elucidate classification could be: a. Classification • Teachers classify students’ marks data into a set of grades as A, B, C, D, or F. • Classification of the height of a set of persons into the classes tall, medium or short b. Clustering Classification Approach: The basic approaches to classification are: c. Association Rules Data Mining Databases (KDD) Vs. Knowledge Discovery • To create specific models by, evaluating training data, this is basically the old data that has already been classified by using the domain of the experts’ knowledge. in Knowledge Discovery in Databases (KDD) is the process of finding useful Information, knowledge and patterns in data while data mining is the process of using of algorithms to automatically extract desired information and patterns, which are derived by the Knowledge Discovery in Databases process. Let us define KDD: Knowledge Discovery in Databases (KDD) Process .The different steps of KDD are as follows: • Some of the most common techniques used for classification may include the use of Decision Trees, Neural Networks etc. Most of these techniques are based on finding the distances or uses statistical methods. Classification Using Distance (K-Nearest Neighbours algorithm) Extraction: Obtains data from various data sources. This approach, places items in the class to which they are “closest” to their neighbour.It must determine distance between an item and a class. Classes are represented by centroid (Central value) and the individual points. One of the algorithms that are used is K-Nearest Neighbors (2). Some of the basic points to be noted about this algorithm are: Preprocessing: It includes cleansing the data which has already been extracted by the above step. Transformation: The data is converted in to a common format, by applying some technique. Data Mining: Automatically information/patterns/knowledge. extracts the Interpretation/Evaluation: Presents the results obtained through data mining to the users, in easily understandable and meaningful format. E P T DM Target Preprocessed Data data data 1. The training set includes classes along with other attributes. (Please refer to the training data given in the Table-1 given below). 2. The value of the K defines the number of near items (items that have less distance to the attributes of concern) that should be used from the given set of training data (just to remind you again, training data is already classified data). This is explained in point (2) of the following example. 3. A new item is placed in the class in which the most number of close items are placed. (Please refer to point (3) in the following example). 4. The I KP Initial Now applying the model developed to the new data. Transformed Model data Fig : 1 (KDD Process) value of K should be <= number _ of _ training _ items , However, (Where E: Extraction, P-Preprocess T- Transform DMData Mining I- Interpreter KP- knowledge Pattern ) in our example for limiting the size of the sample data; we have not followed this formula. ISSN (Print) : 2319 – 2526, Volume-2, Issue-1, 2013 134 Special Issue of International Journal on Advanced Computer Theory and Engineering (IJACTE) Example: Consider the following data, which tells us the person’s class depending upon gender and height. Classify the data using the simulation technique: Consider the following data in which position attribute acts as class Table-1: Name Gender Height (cm) Class Table-3: Department Age Salary Position Sumita F 160cm Short Personnel 31-40 Medium Range Boss Chandan M 200cm Tall Personnel 21-30 Low Range Assistant Ranjita F 190cm Medium Personnel 31-40 Low Range Assistant Radha F 188cm Medium MIS 21-30 Medium Range Assistant Jully F 170cm Short MIS 31-40 Low Range Boss Arun M 185cm Medium MIS 21-30 Medium Range Assistant Shelly F 160cm Short MIS 41-50 Low Range Boss Avinash M 170cm Short Administration 31-40 Medium Range Boss Sachin M 220cm Tall Administration 31-40 Medium Range Assistant Manoj M 210cm Tall Security 41-50 Medium Range Boss Sangeeta F 180cm Medium Security 21-30 Low Range Assistant Anirban M 195cm Medium Krishna F 190cm Medium Kabita F 180cm Medium Pooja F 175cm Medium So, we have to again calculate the spitting attribute for this age range (31-40). Now, the tuples that belong to this range are as follows: Table-4: Questions: (1) let us to classify the tuple <Chandan, M, 160> from training data. (2) let us to classify the tuple <Manoj, M, 210> from training data. (3) Let us take only height attribute for distance calculation and suppose K=5 then the following are the near five tuples to the data that is to be classified (using Manhattan distance as a measure on the height attribute). Gender Height Class Sumita F 160cm ? (Short) Jully F 170cm ? (Short) Shelly F 160cm ?( Short) Avinash M 170cm ? (Short) Pooja F 175cm ? (Medium) Salary Position Personnel Medium Range Boss Personnel Low Range Assistant MIS High Range Boss Administration Medium Range Boss Administration Medium Range Assistant II. METHOD (SIMULATION): Simulation(3) is the process of designing a model of a real system and conducting experiments with this model for the purpose of understanding the behavior for the operation of the system. Mathematical steps for MCM: (Monte Carlo Method) Table-2: Name Department Answer: 1) The tuple <Chandan, M, 160> can classify to short class. 1. Setting up probability distribution for the variables to be analyzed. 2. Construct cumulative probability distribution for each random numbers variables. 3. Generate the random numbers within the range from 00 to 99 or more. 4. Conduct the simulation experiment (4) by means of random sampling. 2) The tuple <Manoj, M, 210> can classify to tall class. ISSN (Print) : 2319 – 2526, Volume-2, Issue-1, 2013 135 Special Issue of International Journal on Advanced Computer Theory and Engineering (IJACTE) 5. Repeat the step 4 until the required number of simulation runs has been generated. 6. Design and implement a course of action and maintain control. III. RESULTS DISCUSSION OF ABOVE METHOD: Using above data (mentioned in table-3) to classify the department with salary in different position in Company. Table-5: Department Personnel Personnel MIS Administration Administration Salary Medium Range Low Range High Range Medium Range Medium Range After analyzing the method to classify the personnel data/information about department, salary and position, I obtained the results about position using simulation technique which is approximately correct about previous methods. We can apply the technique of simulation (7) in data mining problem: Some of the applications of data mining are as follows: Position Boss Assistant Boss Boss Assistant Marketing and sales data analysis: A company can use customer transactions in their database to segment the customers into various types. Such companies may launch products for specific customer bases. Investment analysis: Customers can look at the areas where they can get good returns by applying the data mining. Loan approval: Companies can generate rules (8) depending upon the dataset they have. On that basis they may decide to whom, the loan has to be approved. Fraud detection: By finding the correlation between faults, new faults can be detected by applying data mining. Network management: By analyzing pattern generated by data mining for the networks and its faults, the faults can be minimized as well as future needs can be predicted. Risk Analysis: Given a set of customers and an assessment of their risk worthiness, descriptions for various classes can be developed. Use these descriptions to classify a new customer into one of the risk categories. Brand Loyalty: Given a customer and the product he/she uses, predict whether the customer will change their products. Housing loan prepayment prediction: Rule discovery techniques can be used to accurately predict the aggregate number of loan prepayments in a given quarter as a function of prevailing interest rates, borrower characteristics and account data. Choosing the probability of department, age and salary of persons. Table -6: Department Age Salary Probability Personnel 34 Medium Range .29 Personnel 35 Low Range .11 MIS 38 High Range .10 Administration 37 Medium Range .30 Administration 34 Medium Range .20 Using the random numbers: 56, 25,19,24,67 to classify the information (5) as above. Generate the: random numbers range. Table-7: Department Age Probabilit y Cumulative Probability Personnel 34 .29 .29 Random number Range 00-28 Personnel 35 .11 .40 29-39 MIS 38 .10 .50 40-49 Administration 37 .30 .80 50-79 Administration 34 .20 1.00 80-99 IV. CONCLUSION Using the simulation technique we can solve the all data mining methods as approximately and near about the results. Such as classification of data items is totally based on probability concept. Calculation of class using given Random number (6). Table-8: Age Department Salary Position 34 Random No 56 Administration Medium Range 35 25 Personnel Low Range Boss /Assistant Assistant [1] Data Mining Concepts and Techniques, J Han, M Kamber, Morgan Kaufmann Publishers, 2001. 38 19 Personnel Low Range Assistant [2] Data Mining, A K Pujari, 2004. 37 24 Personnel Medium Range Boss [3] Simulation techniques, by Gordern. 34 67 Administration Medium Range Boss /Assistant [4] Modelling and Simulation By, Giuseppe Petrone, Giuliano Cammarata – InTech. V. REFERENCES ISSN (Print) : 2319 – 2526, Volume-2, Issue-1, 2013 136 Special Issue of International Journal on Advanced Computer Theory and Engineering (IJACTE) [5] Understanding Computer Simulation By, Roger McHaney - BookBoon [6] Monte Carlo Statistical Methods" (second edition) by Christian P. Robert and George Casella, Springer 2004, [7] "Introducing Monte Carlo statistical Christian P. Robert, George Casella. [8] Tools and Examples for Developing Simulation Algorithms-Hans Christian Öttinger, Swiss Federal Institute of Technology Zürich, Switzerland Springer. Methods" ISSN (Print) : 2319 – 2526, Volume-2, Issue-1, 2013 137