IEEE Paper Template in A4 (V1) - the Journal of Information

advertisement
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER
SCIENCE AND APPLICATIONS
A STUDY OF DATA MINING APPLICATION.
1
MS. ZARANA. C. PADIYA, 2 MS. SEEMA ZOPE, 3 MS. YESHA DAVE
1Asst.
Professor, MCA, SLDCCA, Bharuch, Gujarat.
.Professor, MCA, SLDCCA, Bharuch, Gujarat.
3 Asst. Professor, MCA, RKCET, Rajkot, Gujarat.
2 Asst
Zarnapadia86@yahoo.in, spshekhawat@rediffmail.com,yesha112ysd@gmail.com
ABSTRACT: Data Mining was a totally new concept for us and it took quit long time to understand the concept of
this technology. we have gone through some data and a few dataset and found relationship among the attributes, but
it was really a difficult task to establish relation for a dataset which was not known to us. And also find the hidden
information in that dataset, as finding hidden and useful information from a given dataset is the major goal of data
mining technology. The major task was that how to initiate the process, this made to study entire technology in depth.
This study was very much helpful to understand the system and it became the ladder through which we could climb
step-by step towards analysis. The resultant system of this research and finding is named as Web Based Data Mining
Application (WDMA).Before discussing WDMA in whole we would like to list out the findings of our research
Keywords—Data, Dataset, WDMA, Information, mining.
I: INTRODUCTIONS
Data Mining refers generally to the exploration,
analysis and presentation of historical data, or can
refer to the specific act of retrieving information,
usually from the Data Warehouse.
In big organization Dataset can be real assets.
Commercial dataset-Ex. the retail sector-are growing
at unpredictable rates. Such dataset contain a lot of
information, which can often be only accessed, with
the help of suitably designed computer based search
and analysis applications. The scientific approach to
search and analysis is referred to as Data Mining.
Data Mining is the extraction of hidden information
from large dataset and it is a power full new
technology with great potential to help companies
focus on the most important information in their Data
Warehouses. This enough to start with and now we
are going to focus at the functionalities of Data
Mining in some more details.
II WHAT KIND OF DATA CAN BE MINED?
In principle, data mining is not specific to one type of
media or data. Data mining should be applicable to
any kind of information repository. However,
algorithms and approaches may differ when applied to
different types of data. Data mining is being put into
use and studied for databases, including relational
databases, object-relational databases and objectoriented databases, data warehouses, transactional
databases,
unstructured
and
semi-structured
repositories such as the World Wide Web, advanced
databases such as spatial databases, multimedia
databases, time-series databases and textual databases,
and even flat files. Here are some examples in more
detail:
II.I FLAT FILES: Flat files are actually the most
common data source for data mining algorithms,
especially at the research level. Flat files are simple
data files in text or binary format with a structure
known by the data mining algorithm to be applied.
The data in these files can be transactions, time-series
data, scientific measurements, etc.
II.II RELATIONAL DATABASES: Briefly, a relational
database consists of a set of tables containing either
values of entity attributes, or values of attributes from
entity relationships. Tables have columns and rows,
where columns represent attributes and rows represent
tuples. A tuple in a relational table corresponds to
either an object or a relationship between objects and
is identified by a set of attribute values representing a
unique key. In Figure below we present some
relations Customer, Items, and Borrow representing
business activity in a fictitious video store Our Video
Store. These relations are just a subset of what could
be a database for the video store and is given as an
example.
Figure-1
The most commonly used query language for
ISSN: 0975 – 6728| NOV 12TO OCT 13 | VOLUME – 02, ISSUE - 02
Page 106
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER
SCIENCE AND APPLICATIONS
relational database is SQL, which allows retrieval and
manipulation of the data stored in the tables, as well
as the calculation of aggregate functions such as
average, sum, min, max and count. For instance, an
SQL query to select the videos grouped by category
would be:
SELECT count (*) FROM Items WHERE type=video
GROUP BY category.
graphics, image interpretation, and natural language
processing methodologies.
II.V SPATIAL DATABASES: Spatial databases are
databases that, in addition to usual data, store
geographical information like maps, and global or
regional positioning. Such spatial databases present
new challenges to data mining algorithms
Data mining algorithms using relational databases can
be more versatile than data mining algorithms
specifically written for flat files, since they can take
advantage of the structure inherent to relational
databases. While data mining can benefit from SQL
for data selection, transformation and consolidation, it
goes beyond what SQL could provide, such as
predicting, comparing, detecting deviations, etc.
II.III TRANSACTION DATABASES: A transaction
database is a set of records representing transactions,
each with a time stamp, an identifier and a set of items.
Associated with the transaction files could also be
descriptive data for the items. For example, in the case
of the video store, the rentals table such as shown in
the Figure below represents the transaction database.
Each record is a rental contract with a customer
identifier, a date, and the list of items rented (i.e.
video tapes, games, VCR, etc.). Since relational
databases do not allow nested tables (i.e. a set as
attribute value), transactions are usually stored in flat
files or stored in two normalized transaction tables,
one for the transactions and one for the transaction
items. One typical data mining analysis on such data
is the so-called market basket analysis or association
rules in which associations between items occurring
together or in sequence are studied.
Figure-2. Transaction Database.
II.IV: MULTIMEDIA DATABASES: Multimedia
databases include video, images, and audio and text
media. They can be stored on extended objectrelational or object-oriented databases, or simply on a
file system. Multimedia is characterized by its high
dimensionality, which makes data mining even more
challenging. Data mining from multimedia
repositories may require computer vision, computer
Figure: Spatial Database.
II.VI TIME SERIES DATABASES:
Time-series databases contain time related data such
stock market data or logged activities. These
databases usually have a continuous flow of new data
coming in, which sometimes causes the need for a
challenging real time analysis. Data mining in such
databases commonly includes the study of trends and
correlations between evolutions of different variables,
as well as the prediction of trends and movements of
the variables in time. Figure below shows some
examples of time-series data.
Figure: Time series Database
ISSN: 0975 – 6728| NOV 12TO OCT 13 | VOLUME – 02, ISSUE - 02
Page 107
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER
SCIENCE AND APPLICATIONS
III: FUNCTIONS OF DATA MINING:
III. I KNOWLEDGE DISCOVERY:
What are Data Mining and Knowledge Discovery?
With the enormous amount of data stored in files,
databases, and other repositories, it is increasingly
important, if not necessary, to develop powerful
means for analysis and perhaps interpretation of such
data and for the extraction of interesting knowledge
that could help in decision-making.
Data Mining, also popularly known as Knowledge
Discovery in Databases (KDD), refers to the
nontrivial extraction of implicit, previously unknown
and potentially useful information from data in
databases. While data mining and knowledge
discovery in databases (or KDD) are frequently
treated as synonyms, data mining is actually part of
the knowledge discovery process. The following
figure shows data mining as a step in an iterative
knowledge discovery process.
Data mining software can explore large set of data
without predetermined hypothesis to find interesting
pattern and then present this information to the user.
Terminology of Data Mining the process of
knowledge discovery is known as Knowledge
Discovery in Datasets (KDD). The term Knowledge
Discovery in Datasets, refers to the broad process of
finding knowledge in data, and emphasizes the “high
level” application of particular Data Mining methods.
It is one of interest to researchers in pattern
recognition, dataset, statistics, artificial intelligence,
knowledge acquisition for expert system, and data
visualization. The unifying goal of the KDD process
is to extract knowledge from data in the context of
large datasets.
III.I.I Data cleaning: also known as data cleansing,
it is a phase in which noise data and irrelevant data
are removed from the collection.
III.I.II Data integration: at this stage, multiple data
sources, often heterogeneous, may be combined in a
common source.
III.I.III Data selection: at this step, the data
relevant to the analysis is decided on and retrieved
from the data collection.
III.I.IV Data transformation: also known as data
consolidation, it is a phase in which the selected data
is transformed into forms appropriate for the mining
procedure.
III.I.V Data mining: it is the crucial step in which
clever techniques are applied to extract patterns
potentially useful.
III.I.VI Pattern evaluation: in this step, strictly
interesting patterns representing knowledge are
identified based on given measures.
III.I.VII Knowledge representation: is the final
phase in which the discovered knowledge is visually
represented to the user.
The KDD is an iterative process. Once the discovered
knowledge is presented to the user, the evaluation
measures can be enhanced, the mining can be further
refined, new data can be selected or further
transformed, or new data sources can be integrated,
in order to get different, more appropriate results.
III.II PREDICTIVE MODELING:
This function uses patterns discovered to predict
future behavior. For example information gathered
from credit card transaction can be used to identify
customer priorities or instance of fraud.
III.III FORNESIC ANALYSIS.
This is the process of applying the extracted patterns
to find anomalies and unusual patterns in the data.
For example, retail analyst could explore the reason
why a particular population group makes certain
types of purchase in a particular store.
IV. DATA MINING TECHNOLOGIES:
Data mining software uses a variety of different
approaches to sift and sort data, Identify patterns and
process information. Methods adopted include:
Figure: Data mining is the core for knowledge
discovery process
The Knowledge Discovery in Databases process
comprises of a few steps leading from raw data
collections to some form of new knowledge. The
iterative process consists of the following steps:
Decision-tree approach
Regression approach
Rule discovery approach
Neural network approach
Genetic programming
Fuzzy logic
Nearest Neighbor approach
These methods can be combined in different ways to
sift and sort complex data. These methods no doubt
ISSN: 0975 – 6728| NOV 12TO OCT 13 | VOLUME – 02, ISSUE - 02
Page 108
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER
SCIENCE AND APPLICATIONS
provide means to mine the data but there are lots of
statistical activities making base for these techniques
to be implemented. As working on these kind of
technologies is not an easy task, so the hurdles being
simplified by the statistical methods and various
statistical tests. Commercial software packages often
use a combination of two or more of these methods.
VI.I DECISION-TREE APPROACH
Decision tree system partition data sets into smaller
subsets, based on simple conditions, with a single
starting point. In the example below, the decision tree
is used to make decisions about expenses, from the
starting point of ‘Grade’.
A disadvantage of this approach is that there will
always be information loss, because a decision tree
selects one specific attribute for partitioning at each
stage with a single starting point. The decision tree
can present one set of outcomes, but not more than
one set, as there is a single starting point. Therefore
decision trees are suited for data sets where there is
one attribute to start with.
While studying decision tree we come across some
algorithms specially designed to generate decision
tress and they are listed and discussed as below:
CHAID, C&RT and ID3. They are discussed as
below:
algorithm is used basically for solving regression and
classification problems
ID3
ID3 classification algorithm varies simply, builds a
decision tree from a fixed set of examples. The
resulting tree is used to classify future samples. The
example has several attributes and belongs to class
(like yes or no). The leaf nodes of the decision tree
contain the class name whereas a non-leaf node is a
decision node. The decision node is an attribute test
with each branch (to another decision tree) being a
possible value of the attribute.
ID3 uses information gain to help it decide which
attribute goes into a decision node. The advantage of
learning a decision tree is that a program, rather than
a knowledge engineer elicits knowledge from an
expert.
VI.II NEURAL NETWORKING:
Neural networking classifies large sets of data and
assigns weights or scores to the data. This info is then
retained by the software and adjusted as it undergoes
further iterations.
CHAID
The acronym CHAID stands for chi-squared
Automatic Interaction Detector. It is one of the oldest
tree classification methods originally proposed by
Kass (1980; according to Ripley, 1996, the CHAID
algorithm is a descendent of THAID developed by
Morgan and Messenger, 1973). CHAID will “build”
non-binary trees (i.e., trees where more than two
branches can attach to a single root or node), based
on a relatively simple algorithm that is particularly
well suited for the analysis of larger datasets. Also,
because the CHAID algorithm will often effectively
yield many multi-way frequency tables, it has been
particularly popular in marketing research, in the
context of market segmentation studies.
C&RT
C&RT builds classification and regression trees for
predicting
continuous
dependent
variables
(regression) and categorical predictor variables
(classification). The classic C&RT algorithm was
popularized by breiman et al. (Breiman, Friedman,
OLshen, &Stone, 1984; see also Ripley, 1996). This
A neural net consist of a number of interconnected
elements (called neurons) which learn by modifying
the connection between them each neuron has a set of
weights that determine how it evaluates the combine
strength of the input signals. Once the neuron
network has calculated the relative effect each of this
characteristic has on the data, it can apply the
knowledge it has learned to a new set of data. Neural
network can “learn” form examples.
`
A Simple Neural Network
ISSN: 0975 – 6728| NOV 12TO OCT 13 | VOLUME – 02, ISSUE - 02
Page 109
JOURNAL OF INFORMATION, KNOWLEDGE AND RESEARCH IN COMPUTER
SCIENCE AND APPLICATIONS
records are the data set is found (And the most
similar neighbors are identified.
For example, a bank may compare a new customer
with all existing bank customers, by examining age,
income etc. And so set an appropriate credit rating.
V. CONCLUSION
A neural network with feedback and
competition
However, the disadvantage of neural networks is that
input has to be numeric, which may lead to
complication when dealing with non-scalar fields,
such as Country or Product the numeric labels have
to be given to fields of equal value. A neural network
in the process of iteration may come to assign
relationship or value based on these arbitrary
numbers, which would corrupt the output.
VI.III FUZZY LOGIC
With this research We would like to conclude that the
study of data mining and its techniques had made us
understand the vast and varying world of data
mining. And by learning the scope and application of
data mining we got the idea for my work i.e the web
based data mining application.
Although this has lots of complexity and unfolds, it
has its own importance and advantages so we would
like to go with it.
REFERENCES:
Data Mining: Concepts and Techniques
Jiawei Han and Micheline Kamber, Morgan
Kaufmann
Data Mining for Association Rules and
Sequential Patterns
Jean-Marc Adamo, Springer
Rules incorporate probability. So good might mean a
70% success rate or a 90% success rate. This is called
an inexact rule. A “Fuzzy” rule can vary in terms of
the numeric values in the body of the rule.
For e.g. the confidence might vary according to the
values of the of one of the variable (e.g. as the age
increases). Fuzzy logic assesses data in terms of
possibility and uncertainty.
IF income is low AND person is young
THEN credit limit is low
This rule is fuzzy because of the imprecise definition
of “income”, “young” and “credit limit”. The credit
limit will change as the age or income changes.
VI.IV NEAREST NEIGHBOUR APPROACH
The nearest neighbor method matches patterns
between different sets of data. This approach is based
on data retention. When a new record is presented for
prediction, the “distance” between it and similar
ISSN: 0975 – 6728| NOV 12TO OCT 13 | VOLUME – 02, ISSUE - 02
Page 110
Download