Keywords: data mining, extraction, simulation, classification, data

advertisement
Data Mining Classification through Simulation Technique
1
Samar Ballav Bhoi, 2Munesh Chandra Adhikary
1
D.D.C.E, F.M. University, Balasore,
2P.G. Dept. of Applied Physics & Ballistics, Fakir Mohan University, Balasore, Odisha
E-Mail: samarbhoi@yahoo.co.in, mcadhikary@gmail.com
capabilities of today’s transaction-oriented database. In
many applications users only need read-access to data
however they need to access larger volume of data very
rapidly. A data warehouse can be considered to be a
“Corporate memory”. Data and information are extracted
from heterogeneous sources as they are generated. This
paper aims at throwing some idea about the basic features,
components, OLAP, Architecture, multidimensional data
modeling, DSS, MOLAP, and ROLAP, data warehouse
and views, open issue for data warehouse etc. Data
Warehouses and Data Mining techniques are becoming
indispensable parts of business intelligence programs. Use
these links to learn more about these emerging fields and
keep on top of this trend. Data mining is the process of
automatic extraction of interesting (non trivial, implicit,
previously unknown and potentially useful) information or
patterns from the data in large database. ii). Data mining is
one of the steps in process of Knowledge Discovery in
Database (KDD) iii). Data mining is applied in every field
whether it is Games, Marketing, Bioscience, Loan
approval, Fraud detection etc. Data warehouse and data
mining is most important in area of A.I. in which we can
draw the classification of data item using simulation
process/technique. Objective of the paper aims at giving
you some of the fundamental techniques used in data
mining. This paper emphasizes on a brief overview of data
mining as well as the application of data mining techniques
to the real world using simulation technique.
Abstract – The past two decades has seen a dramatic
increase in the amount of information or data being stored
in electronic format. This accumulation of data has taken
place at an explosive rate. It has been estimated that the
amount of information in the world doubles every 20
months and the size and number of databases are
increasing even faster. The increase in use of electronic
data gathering devices such as point-of-sale or remote
sensing devices has contributed to this explosion of
available data. The problem of effectively utilizing these
massive volumes of data is becoming a major problem for
all enterprises.
Data storage became easier as the availability of large
amounts of computing power at low cost i.e., the cost of
processing power and storage is falling, made data cheap.
There was also the introduction of new machine learning
methods for knowledge representation based on logic
programming etc. in addition to traditional statistical
analysis of data. The new methods tend to be
computationally intensive hence a demand for more
processing power. It was recognized that information is at
the heart of business operations and that decision-makers
could make use of the data stored to gain valuable insight
into the business. Database Management systems gave
access to the data stored but this was only a small part of
what could be gained from the data. Traditional on-line
transaction processing
Keywords:
data
mining,
extraction,
classification, data items, random number.
Systems, OLTPs, are good at putting data into
databases quickly, safely and efficiently but are not good at
delivering meaningful analysis in return. Analyzing data
can provide further knowledge about a business by going
beyond the data explicitly stored to derive knowledge
about the business. This is where Data Mining has obvious
benefits for any enterprise.
I.
simulation,
INTRODUCTION
In academic view and it is defined by W.H Inman is
subject-oriented, integrated, time-variant, nonvolatile,
and a collection of operational data that supports
decision–making of the management. It is a tool that
manages of data after and outside of operational system.
Data warehousing technology has evolved in business
applications for the process of strategic decision
making. It may be considered as the key components of
IT strategy and architecture of an organization. Example
Information Technology has a major influence on
organizational performance and competitive standing, with
increasing processing power and availability of
sophisticated analytical tools and techniques. The Data
Warehouse is a kingpin of business intelligence. The data
warehouses
provide
storage,
functionality
and
responsiveness to queries, that is far superior to the
ISSN (Print) : 2319 – 2526, Volume-2, Issue-1, 2013
133
Special Issue of International Journal on Advanced Computer Theory and Engineering (IJACTE)
– an electric billing company, by analyzing data of a
data warehouse can predict frauds and can reduce the
cost of such determinations, in fact this technology has
such great potential that any company processing proper
analysis tools can benefit from it. Thus a data warehouse
supports Business Intelligence. Presently some uses of
Data Warehousing and Data Mining are in industries
like- Banking, Airline, Hospital and Investment &
Insurance, use of data ware housing and data mining
through simulation technique(1). Data Mining
emphasizes on three techniques:
Classification: The classification task maps data
into predefined groups or classes. Given a
database/dataset D={t1,t2,…,tn} and a set of classes
C={C1,…,Cm}, the classification Problem is to define a
mapping f:D->C where each ti is assigned to one class,
that is, it divides database/dataset D into classes
specified in the Set C. A few very simple examples to
elucidate classification could be:
a. Classification
•
Teachers classify students’ marks data into a set of
grades as A, B, C, D, or F.
•
Classification of the height of a set of persons into
the classes tall, medium or short
b. Clustering
Classification Approach: The basic approaches to
classification are:
c. Association Rules
Data Mining
Databases (KDD)
Vs.
Knowledge
Discovery
• To create specific models by, evaluating training
data, this is basically the old data that has already
been classified by using the domain of the experts’
knowledge.
in
Knowledge Discovery in Databases (KDD) is the
process of finding useful Information, knowledge and
patterns in data while data mining is the process of using
of algorithms to automatically extract desired
information and patterns, which are derived by the
Knowledge Discovery in Databases process. Let us
define KDD: Knowledge Discovery in Databases
(KDD) Process .The different steps of KDD are as
follows:
•
Some of the most common techniques used for
classification may include the use of Decision Trees,
Neural Networks etc. Most of these techniques are based
on finding the distances or uses statistical methods.
Classification Using Distance (K-Nearest Neighbours
algorithm)
Extraction: Obtains data from various data sources.
This approach, places items in the class to which
they are “closest” to their neighbour.It must determine
distance between an item and a class. Classes are
represented by centroid (Central value) and the
individual points. One of the algorithms that are used is
K-Nearest Neighbors (2). Some of the basic points to be
noted about this algorithm are:
Preprocessing: It includes cleansing the data which
has already been extracted by the above step.
Transformation: The data is converted in to a
common format, by applying some technique.
Data Mining: Automatically
information/patterns/knowledge.
extracts
the
Interpretation/Evaluation: Presents the results
obtained through data mining to the users, in easily
understandable and meaningful format.
E
P
T
DM
Target Preprocessed
Data
data
data
1.
The training set includes classes along with other
attributes. (Please refer to the training data given in
the Table-1 given below).
2.
The value of the K defines the number of near
items (items that have less distance to the attributes
of concern) that should be used from the given set
of training data (just to remind you again, training
data is already classified data). This is explained in
point (2) of the following example.
3.
A new item is placed in the class in which the most
number of close items are placed. (Please refer to
point (3) in the following example).
4.
The
I
KP
Initial
Now applying the model developed to the new data.
Transformed Model
data
Fig : 1 (KDD Process)
value
of
K
should
be
<=
number _ of _ training _ items , However,
(Where E: Extraction, P-Preprocess T- Transform DMData Mining I- Interpreter KP- knowledge Pattern )
in our example for limiting the size of the sample
data; we have not followed this formula.
ISSN (Print) : 2319 – 2526, Volume-2, Issue-1, 2013
134
Special Issue of International Journal on Advanced Computer Theory and Engineering (IJACTE)
Example: Consider the following data, which tells us the
person’s class depending upon gender and height.
Classify the data using the simulation technique:
Consider the following data in which position attribute
acts as class
Table-1:
Name
Gender
Height
(cm)
Class
Table-3:
Department
Age
Salary
Position
Sumita
F
160cm
Short
Personnel
31-40
Medium Range
Boss
Chandan
M
200cm
Tall
Personnel
21-30
Low Range
Assistant
Ranjita
F
190cm
Medium
Personnel
31-40
Low Range
Assistant
Radha
F
188cm
Medium
MIS
21-30
Medium Range
Assistant
Jully
F
170cm
Short
MIS
31-40
Low Range
Boss
Arun
M
185cm
Medium
MIS
21-30
Medium Range
Assistant
Shelly
F
160cm
Short
MIS
41-50
Low Range
Boss
Avinash
M
170cm
Short
Administration
31-40
Medium Range
Boss
Sachin
M
220cm
Tall
Administration
31-40
Medium Range
Assistant
Manoj
M
210cm
Tall
Security
41-50
Medium Range
Boss
Sangeeta
F
180cm
Medium
Security
21-30
Low Range
Assistant
Anirban
M
195cm
Medium
Krishna
F
190cm
Medium
Kabita
F
180cm
Medium
Pooja
F
175cm
Medium
So, we have to again calculate the spitting attribute for
this age range (31-40). Now, the tuples that belong to
this range are as follows: Table-4:
Questions: (1) let us to classify the tuple <Chandan, M,
160> from training data.
(2) let us to classify the tuple <Manoj, M, 210> from
training data.
(3) Let us take only height attribute for distance
calculation and suppose K=5 then the following are the
near five tuples to the data that is to be classified (using
Manhattan distance as a measure on the height
attribute).
Gender
Height
Class
Sumita
F
160cm
? (Short)
Jully
F
170cm
? (Short)
Shelly
F
160cm
?( Short)
Avinash
M
170cm
? (Short)
Pooja
F
175cm
? (Medium)
Salary
Position
Personnel
Medium Range
Boss
Personnel
Low Range
Assistant
MIS
High Range
Boss
Administration
Medium Range
Boss
Administration
Medium Range
Assistant
II. METHOD (SIMULATION):
Simulation(3) is the process of designing a model of
a real system and conducting experiments with this
model for the purpose of understanding the behavior for
the operation of the system. Mathematical steps for
MCM: (Monte Carlo Method)
Table-2:
Name
Department
Answer: 1) The tuple <Chandan, M, 160> can classify
to short class.
1.
Setting up probability distribution for the variables
to be analyzed.
2.
Construct cumulative probability distribution for
each random numbers variables.
3.
Generate the random numbers within the range
from 00 to 99 or more.
4.
Conduct the simulation experiment (4) by means of
random sampling.
2) The tuple <Manoj, M, 210> can classify to tall class.
ISSN (Print) : 2319 – 2526, Volume-2, Issue-1, 2013
135
Special Issue of International Journal on Advanced Computer Theory and Engineering (IJACTE)
5.
Repeat the step 4 until the required number of
simulation runs has been generated.
6.
Design and implement a course of action and
maintain control.
III. RESULTS DISCUSSION OF ABOVE METHOD:
Using above data (mentioned in table-3) to classify
the department with salary in different position in
Company. Table-5:
Department
Personnel
Personnel
MIS
Administration
Administration
Salary
Medium Range
Low Range
High Range
Medium Range
Medium Range
After analyzing the method to classify the personnel
data/information about department, salary and position,
I obtained the results about position using simulation
technique which is approximately correct about previous
methods. We can apply the technique of simulation (7) in
data mining problem: Some of the applications of data
mining are as follows:
Position
Boss
Assistant
Boss
Boss
Assistant
Marketing and sales data analysis: A company can
use customer transactions in their database to segment
the customers into various types. Such companies may
launch products for specific customer bases. Investment
analysis: Customers can look at the areas where they can
get good returns by applying the data mining. Loan
approval: Companies can generate rules (8) depending
upon the dataset they have. On that basis they may
decide to whom, the loan has to be approved. Fraud
detection: By finding the correlation between faults,
new faults can be detected by applying data mining.
Network management: By analyzing pattern generated
by data mining for the networks and its faults, the faults
can be minimized as well as future needs can be
predicted. Risk Analysis: Given a set of customers and
an assessment of their risk worthiness, descriptions for
various classes can be developed. Use these descriptions
to classify a new customer into one of the risk
categories. Brand Loyalty: Given a customer and the
product he/she uses, predict whether the customer will
change their products. Housing loan prepayment
prediction: Rule discovery techniques can be used to
accurately predict the aggregate number of loan
prepayments in a given quarter as a function of
prevailing interest rates, borrower characteristics and
account data.
Choosing the probability of department, age and salary
of persons.
Table -6:
Department
Age
Salary
Probability
Personnel
34
Medium Range
.29
Personnel
35
Low Range
.11
MIS
38
High Range
.10
Administration
37
Medium Range
.30
Administration
34
Medium Range
.20
Using the random numbers: 56, 25,19,24,67 to
classify the information (5) as above. Generate the:
random numbers range.
Table-7:
Department
Age
Probabilit
y
Cumulative
Probability
Personnel
34
.29
.29
Random
number
Range
00-28
Personnel
35
.11
.40
29-39
MIS
38
.10
.50
40-49
Administration
37
.30
.80
50-79
Administration
34
.20
1.00
80-99
IV. CONCLUSION
Using the simulation technique we can solve the all
data mining methods as approximately and near about
the results. Such as classification of data items is totally
based on probability concept.
Calculation of class using given Random number (6).
Table-8:
Age
Department
Salary
Position
34
Random
No
56
Administration
Medium Range
35
25
Personnel
Low Range
Boss
/Assistant
Assistant
[1] Data Mining Concepts and Techniques, J Han, M
Kamber, Morgan Kaufmann Publishers, 2001.
38
19
Personnel
Low Range
Assistant
[2] Data Mining, A K Pujari, 2004.
37
24
Personnel
Medium Range
Boss
[3] Simulation techniques, by Gordern.
34
67
Administration
Medium Range
Boss
/Assistant
[4] Modelling and Simulation By, Giuseppe Petrone,
Giuliano Cammarata – InTech.
V. REFERENCES
ISSN (Print) : 2319 – 2526, Volume-2, Issue-1, 2013
136
Special Issue of International Journal on Advanced Computer Theory and Engineering (IJACTE)
[5] Understanding Computer Simulation By, Roger
McHaney - BookBoon
[6]
Monte Carlo Statistical Methods" (second edition)
by Christian P. Robert and George Casella, Springer
2004,
[7]
"Introducing Monte Carlo statistical
Christian P. Robert, George Casella.
[8] Tools and Examples for Developing Simulation
Algorithms-Hans Christian Öttinger, Swiss Federal
Institute of Technology Zürich, Switzerland
Springer.
Methods"

ISSN (Print) : 2319 – 2526, Volume-2, Issue-1, 2013
137
Download