Data mining Techniques

advertisement
TABLE OF CONTENT
TABLE OF CONTENT .......................................................................................... 1
DATA MINING TECHNIQUES .............................................................................. 2
INTRODUCTION ............................................................................................... 2
CLUSTERING ANALYSIS AND AUTO-CLUSTERING ..................................... 3
BOUNDARY MODELING & CENTROID MODELING ................................... 4
LINK ANALYSIS ................................................................................................... 5
VISULIZATION .................................................................................................. 5
CONVENTIONAL AND SPATIAL VISULATION ............................................ 6
RULE INDUCTION ............................................................................................... 7
WORKS CITIED.................................................................................................... 9
1
DATA MINING TECHNIQUES
INTRODUCTION
Data mining can be divided in two sub groups: discovery and exploration.
discovery in the sense that meaningful patterns are uncovered in data and
distinguish properly and exploitation meaning that during these meaningful
patterns are used to create useful application for data modeling purposes.
The process of discovering knowledge in data is referred to as
“Knowledge discovery (KD) this is when data is being worked on from a
conventional data store not a data warehouse or an online transaction system.
KD can be performed in two ways either manually or through an automated
process.
This process of KD is finding a connection between data and facts, it can
in fact be said to be a relationship between prior facts and a subsequent fact.
The prior fact is called the antecedent and the subsequent fact known as the
consequent. There should be no assumption of an underlying cause and effect
connection between the antecedent and consequent.
Through the process of data mining, it should always be in mind that data
is not facts, in order to have knowledge and facts one needs data. Knowledge is
the main connection between antecedent facts and consequent facts, this
knowledge that plays the main role between antecedent and consequent facts
2
takes the form of several shapes in the dorm of data mining. Knowledge can be
in the form if if-then-else rules that are known as knowledge based expert system
(KBES) or it can be in form of a complex mathematical transform.
This section of the paper will discuss the four main techniques used in
data mining: Auto-clustering, link analysis, visualization, and the rule induction.
These four techniques are used in the creation of knowledge information based
on data from the database.
CLUSTERING ANALYSIS AND AUTO-CLUSTERING
Patterns are very important in the issue of data mining and in order to look
for pattern between data in order to create a relationship, which will produce
knowledge we need to better, understand patterns. A “Pattern” is regularities in
data, but it should always be kept in mind that data can be very complex and
provides no useful information” (Data Mining Explained, Chapter 8, Pg.141).
The process of data mining is looking for patterns that will give information
on the behavior of the data, this behavior may be our own behavior, those of the
customers, suppliers or that of the industry after a significant events occurs.
When dealing with numeric data the most useful irregularity that can be found is
“clusters”. A “cluster” is a coherent aggregation of data points; clusters are a
group of data with similar behaviors and characteristics.
Clusters can be in the shape of normally distributed aggregations of nonoverlapping point sets, or they can be in the shape of long thing strands of data
points or ragged blob having various shapes and sizes. Clusters usually include
3
a mixture of cluster groups of various sorts, which frequently overlap in very
complex ways. The fact that clusters sometimes overlap is what makes data
mining difficult to determine where one cluster ends or the other cluster begins.
The ultimate goal of clustering is to find groups of data that are similar to
each other and separate them from groups of data without those similar
characteristics or features. Business knowledge is very important to the
clustering process in order for data to be processed and interpreted. Groups of
clusters can also be used to further classify new data; common algorithms used
to perform clustering include boundary modeling and centroid modeling.
BOUNDARY MODELING & CENTROID MODELING
Boundary modeling is when clusters are distinguished by specifying the
boundaries between them while centroid modeling clusters are distinguished by
specifying their locations and extents. Boundary model is also known as bottom
– up approach to analyzing, this is due to the fact that data clusters are
determined by looking at the location of individual points rather than the whole
group of data as a whole. Centroid modeling is sometimes called top down
modeling because locations and extend of data are determined by the
distribution of the point as a group.
4
LINK ANALYSIS
This is a graphical approach to analysis of data to establish connections
and relationship in order to determine knowledge and facts based on the data.
Link analysis helps to explore data that can help identify relationship among
values in the databases. There are various types of link analysis but the most
common are stratification, association discovery and sequence discovery.
Stratification is a process of retrieving records from the data storage using
the hypothesized links as the retrieval keys, while association discovery finds rule
about items that appear together in an event such as a purchase transaction.
The final most commonly used link analysis is sequence discovery and it is very
similar to association discovery but it is an association relation over time.
VISULIZATION
Data Visualization's role in data mining is the process of allowing the data
analyst to gain intuition about the data being observed. Data visualization
applications are used to assist the analyst in selecting display formats, viewer’s
perspective and data representation schemas that help to foster deep intuitive
understanding of the data to be mined.
There are two type of Data visualization techniques “Conventional
visualization” and “Spatial Visualization” both of techniques have been around for
a while but conventional visualization have been around much longer so it have
been more researched and better developed over the years.
5
The most commonly used visualization techniques include scatter plots,
pie charts, tables and other graphic representation that depict the relationship
between the data being observed. This application also include features such as
drill down menu, hyper links to related data sources, multiple simultaneous data
views, select count analyze object in displays and dynamic and active view of the
database just to name a few.
The most effective pattern recognition equipment known to date is called
the “Grayware”; researchers discovered that “ Grayware allows manual
knowledge discovery to be facilitated when domain data are presented in a form
that allows the human mind to interact with them at an intuitive level” (Data
Mining explained, Chapter 8, Ph 150).
CONVENTIONAL AND SPATIAL VISULATION
Conventional visualization means the use of graph, charts and other
graphic representation to depict information about a population and not individual
population data itself. Conventional visualization is visual depiction of certain
types of Meta data and it involves computation of descriptive statistics such as
counts, frequencies, ranges and proportions.
Spatial visualization on the other hand uses plots that depict actual
members of the population in their feature spaces. With spatial visualization
there is no computation of descriptive statistics, the depiction preserves the
geometric relationships among the data which are the “spatial” characteristic of
the population that re being sampled and visualized.
6
RULE INDUCTION
Rule induction is the process of creating rules directly from the data
without human interaction. Rule induction is a process within data mining that
“derives a set of rule to classify cases.” (Introduction to Data Mining & knowledge
Discovery, Page 17). The purpose of rule induction is to create a set of
independent rules, which do not necessarily form some sort of decision tree.
These rules that are generated do not necessarily have to cover all
possible scenarios and the result can sometime conflict with the predications. In
the case when the result do not match up with the predication is to assign a
confidence to the rules and use a rule that the analyst is more confident keeping
in mind the knowledge of the business and the data being analyzed. In the case
were more than two rules conflict it is practice that analyst weight their votes of
confidence on each rule.
Rule induction algorithms determine what decision surfaces of a specified
type that correctly classify as much of the given data as possible. Most
algorithms use decision surfaces that are combinations of points, lines, planes
and hyper-planes. The most common rule-induction algorithm is based on a
“Splitting” technique. These splits create decision surfaces that are expressed as
predicate in rule.
Data mining techniques are very important to the data mining process as a
whole. The type of technique to use can vary and should depend on the type of
data to be analyzed. The analyst should also have a good understanding of the
business and what the data is to be used for. This is important in determining
7
what data is information and how to go about creating rule that would generate
useful information that would be beneficial to the business.
8
WORKS CITIED
1. Two Crows Corporation. Introduction to Data Mining and Knowledge
Discovery. Third Edition, 1999.
2. Hancock, Delmater. Data Mining Explained. Digital Press, Boston 2001.
9
Download