Topic 7 Data Mining

advertisement
ICT619 Intelligent Systems
Topic 7: Data Mining
Introduction
Business Applications of data mining
Data Mining Activities
Data Mining Techniques
How to Apply Data Mining
Data Mining Development Methodology
References






Berry, M., & Linoff, G. Mastering Data Mining, Wiley Computer Publishing, New York
2000.
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., and Zanasi, A. Discovering Data Mining:
From Concept to Implementation. Prentice Hall, Englewood Cliffs, NJ 1998.
Dhar, V., & Stein, R.,”Deriving Rules from Data” in Seven Methods for Transforming
Corporate Data into Business Intelligence., Prentice Hall 1997, pp. 167-189, 251-258.
Ganti, V., Gehrke, J., & Ramakrishnan, R. Mining Very Large Databases, IEEE Computer,
Vol.32 No.8, August 1999, pp.38-45.
Hirji, K., Exploring Data Mining Implementation, Communications of the ACM, Vol.44,
No.7, July 2001, pp. 87-93.
Web site on Data Mining and Web Mining - http://www.kdnuggets.com/software/suites.html
533562489
1
ICT619 S2-05
Introduction
Why data mining?
The advent of information technology in the last four decades has resulted in an abundance of data
generated or captured electronically. Some of the sources for this proliferation of data are: data
generated by point-of-sale (POS) devices such as bar code scanners, customer call detail databases,
web log files in e-commerce etc.
While the generation, storage and transmission of data is becoming more and more efficient,
organizations are ending up with huge amounts of mostly day-to-day transaction data, which are being
stored away in database files for possible future use. A more organised and useful approach of storing
data has given rise to organizational data warehouses, which are central repositories of cleaned and
transformed data. As an example of the magnitude of business data, the US Library of Congress has 17
million books equivalent to about 17 terabytes of data, which is also reported to be the volume of data
in the database of the UPS shipping company.
Data is being collected mostly for improving efficiency of underlying operations, but not for analysis
and prediction. It has become obvious that businesses can gain competitive advantage if it is possible to
extract useful information on market and customer behaviour by “mine”-ing the data. Such information
may indicate important underlying trends, associations or patterns in market behaviour, which can help
obtain answers to questions like – “Based on past buying behaviour, which customers should be
targeted for direct marketing?”
According to (Hirji 2001),
“A practical and applied definition of data mining is the analysis and non-trivial extraction of data from
databases for the purpose of discovering new and valuable information, in the form of patterns and rules,
from relationships between data elements.”
The term data mining stands for the process of, rather than a product for, exploring and analysing data
for discovering new and useful information. This concept is not new, and data analysis using statistical
methods such as regression is already a well-established practice. But statistical methods suffer from
scalability problem. They work well with relatively small data sets and manageable number of
variables, but not so with millions of records and thousands of variables. The advent of intelligent
techniques such as artificial neural networks and decision trees has made it possible to perform data
mining involving large volumes of data more effectively and efficiently.
Current interest in data mining, in particular for gaining competitive advantage in business, is growing.
Its application in areas such direct target marketing campaigns, fraud detection, and development of
models to aid in financial predictions is likely to intensify in the coming years.
Data mining projects involve both the utilization of established algorithms from machine learning,
statistics, database systems and information visualization, and the development of new methods and
algorithms, targeted at large data mining problems. Nowadays data mining efforts have gone beyond
crunching databases of credit card usage or stored transaction records. They have been focusing on data
collected in the health care system, art, design, medicine and biology and other areas of human
endeavour.
Online Analytical Processing (OLAP), Data Mining and Knowledge Discovery
Unlike data warehouses, which aim to present a single centralised view of the data in an organization,
OLAP databases (known as data marts) offer improved speed and responsiveness by limiting
themselves to a single view of the data specific to the department that owns the database. OLAP
databases are organised using a multidimensional structure (called a cube) using dimensions of the
business such as time, product type and geography. This enables the analyst to segregate and analyse
data along the different aspects of business activity. Sometimes a data warehouse can consist of a
collection of data marts.
533562489
2
ICT619 S2-05
Both data mining and OLAP are tools for decision support. While OLAP attempts to find what is
happening and how it is happening, data mining aims to find out the underlying characteristics of data
in order to be able to predict what is likely to happen. The difference in OLAP and data mining is best
understood in terms of the questions they allow to be asked by the user - for example, “How has the
sale of product X varied from quarter to quarter during the past year over all the geographic regions it
is sold? (OLAP)”, “Which customers are more likely to buy product X?” (data mining ).
Data mining belongs to the wider field of knowledge discovery, which concerns the automated
extraction of useful information from a diverse range of sources including textual, multimedia and web
data. Data mining is knowledge discovery applied predominantly in commercial databases.
The main objectives of this topic are to:
- Understand the role of data mining in business
- Distinguish between different data mining techniques
- Understand how to go about making use of data mining
Business Applications of Data Mining
The data mining segment is one of the fastest growing segments in the business intelligence market.
Companies are increasingly investigating the potential of data mining technology to deliver
competitive advantage. It is being regarded as an integral and necessary component of an
organization’s portfolio of analytical techniques. The use of data mining techniques as an intelligent
system tool in a number of aspects in business is outlined below.
Data mining for marketing
Many of the most successful applications of data mining are in marketing. In this area, data mining is
used for both reducing cost and increasing revenue. Databases contain collections of data on
prospective targets of a marketing campaign. These data relate to information on customer behaviour
obtained from operational systems such as point-of-sale systems or from commercial databases
available for a fee. Data mining can be used to reduce marketing costs by eliminating calls and letters
to people who are unlikely to respond to an offer.
Data mining for customer relationship management
Good customer relationship management involves anticipating customers’ needs and responding to
them proactively. This can be made possible in a large enterprise only through the application of data
mining techniques to the records in its customer database.
Data mining in R&D
Data mining can lower costs during the research and development phase of the product life cycle. One
example is the pharmaceutical industry, which is characterised by the generation of large amounts of
test data. These data have to be analysed using sophisticated prediction techniques to determine which
chemicals are likely to produce useful drugs. An entire discipline of bioinformatics has grown up to
mine and interpret the data being generated by high throughput screening and other biological sources.
Data Mining Activities
The different objectives of performing data mining may be categorised into two broad groups –
directed data mining and undirected data mining.
In directed data mining, we know what we are looking for. We aim to find the value of a pre-identified
target variable in terms of a collection of input variables, eg, classifying insurance claims. Undirected
data mining takes a bottom-up approach. It finds patterns in data and leaves it to the user to find the
significance of these patterns. Eg, an analysis of customer profiles may result in identifying groups of
customers with similar buying patterns.
533562489
3
ICT619 S2-05
The different types of data mining tasks are listed below:
 Classification
 Estimation
 Prediction
 Finding affinity grouping or association rules
 Clustering
 Description and visualisation
Classification, estimation and prediction are examples of directed data mining, while the remaining
three tasks belong to the group undirected data mining group.
Classification
Classification assigns a given object to a predefined category (class) based on the object’s attributes
(features). In the business data mining context, the objects to be classified are generally represented by
database records. Examples of classification tasks include:
o Assigning keywords to articles
o Classifying credit applicants as low, medium and high risk
o Assigning customers to predefined customer segments
Estimation
While classification produces discrete outcomes – yes or no, low, medium, high etc, estimation deals
with continuously varying outcomes, such as income, probability of a customer leaving (known in data
mining circles as sing) or average number of children in a family. Outcomes of such estimation can
also be used for classification by ranking the values and applying a threshold to categorise.
Prediction
A prediction can be a classification or estimation task performed to predict some future behaviour.
Examples include:
o Predicting which customers will churn in the next six months
o Predicting the size of a balance that will be transferred
Finding affinity grouping or association rules
The task of affinity grouping is to find out, which things go together. The prototypical example is –
which things go together in a supermarket shopping trolley. Supermarkets make use of such
information for arranging items in shelves or catalogues, and for identifying cross-selling opportunities
(eg, people who buy product X also buy products Y and Z).
Clustering
Clustering segments a group of diverse records into subgroups or clusters containing similar records.
There are no predefined classes in clustering; records are grouped based on similarities in their
attributes – eg, people with similar buying habits. Different clusters of music CD purchases may
indicate different age or cultural sub-groups. It is left to the data miner to interpret such clusters and
decide what use can be made of them.
Description and visualisation
Data mining can be used simply to describe a database to help increase our understanding of the
people, products or processes that produced the data in the first place. A good description can also
provide an explanation of their behaviour or at least indicate where to look for an explanation. Data
visualisation is a powerful form of descriptive data mining. Although visualisation may not be
meaningful in all cases, it can be very effective in explaining things by exploiting our ability to make
use of visual clues.
533562489
4
ICT619 S2-05
Data Mining Techniques
The field of data mining spans a number of disciplines including statistics, databases and machine
learning. In this topic, we’ll aim to find out when to apply data mining techniques, how to interpret
their results, and how to evaluate their performance. To be able to do so, we need a basic understanding
of the inner workings of these techniques. The three major approaches for data mining are: decision
trees, automatic cluster detection and artificial neural networks (supervised and unsupervised).
Decision Trees
A decision tree may be regarded as the visual representation of a reasoning process. They are
particularly suitable for solving classification problems. A decision tree consists of nodes, branches and
leaves. In it, each internal node represents an attribute, called a splitting attribute, and each leaf node is
labelled with a class label. The class label is decided by the class of the records that ended up in that
leaf during training. A leaf node may also contain a value depending upon the average of the values of
such records. Each edge originating from an internal node is labelled with a splitting predicate that
involves only the node’s splitting attribute. The splitting predicate has the property that any record will
take a unique path from the root to exactly one leaf node.
Salary
<= 50K
> 50K
Group D
Age
<= 40
> 40
Employment
Academia, Industry
Group B
Group C
Self
Group A
Figure 1 Sample decision tree for a catalogue mailing (Ganti et al.
1999)
To help us understand how the decision tree works, we can view each record with N attributes as a
point in an N-dimensional record space. Each branch in decision tree is a test on a single variable that
splits the space into two or more regions. After the split in the root node of the tree, each of these
regions will have a mix of records with different values for most, if not all, attributes except the one,
which was tested in the root node. With each successive test and split, the resulting regions get more
and more segregated with increasing homogeneity among the records in each region. Ultimately, the
leaf nodes will contain the purest batch of records. For example, in the decision tree of Figure 1, any
self-employed person aged less than 41 and earning a salary of more than $50,000 will be classified as
belonging to group A.
It is possible to build a decision tree that correctly classifies every single record, assuming no two
records have the same set of attributes (input variables) but belong to two different classes (target
variables). This is not desirable though, as it gives rise to the problem known as overfitting. Such a tree
describes the training data very well but is unlikely to generalise to new data sets. As an example, a
large decision tree may be built to identify every inhabitant of a small town – each leaf node will be
labelled with a name (assuming no two persons with the same names live in this town). But this tree
won’t be able to do anything useful like determining whether a given person belongs to the group of
533562489
5
ICT619 S2-05
overweight male teenage students (no representative leaf node). To prevent overfitting, test data set are
used to prune decision trees once it has been built using the training data set.
There are a number of different types of decision trees depending mainly upon the number of splits
allowed at each level, how these splits are chosen when the tree is built and how the tree is pruned to
prevent overfitting. More broadly, decision trees can be grouped as either classification trees (leafs
represent classes) or regression trees (leafs represent a numeric value). There are various algorithms for
building decision trees – the most notable among them are CHAID, C4.5/C5.0 and CART. Typical data
mining software tools these days allow the user to choose among several splitting criteria and pruning
strategies, and to control parameters such as maximum tree depth to allow approximation of any of
these algorithms.
How decision trees are built
Decision trees are built through a process known as recursive partitioning. Recursive partitioning is an
iterative process of splitting the data up into partitions (regions of record space). Initially all the records
are in a training set – the preclassified records that are used to determine the structure of the decision
tree. An algorithm splits up the data, using every possible binary split on every field of the records. The
algorithm chooses the split that partitions the data into two parts that are purer than the original data
(the training set). The splitting process is then applied to each of the new parts and so on until no more
useful splits can be found.
The most important task in building a decision tree is to decide which of the attributes (independent
fields in a record) gives the best split. The best split is defined as one that creates partitions where a
single class predominates. The measure used to evaluate a potential splitter is the reduction in diversity
(or increase in purity). There are several methods of calculating the index of diversity for a set of
records. One measure, called the Gini index in data mining circles, is given by the formula
2p1(1 – p2)
where p1 is the probability of class one. Two other diversity indexes are:
min (p1, p2)
p1log p1 + p2log p2, known as Entropy.
To choose the best splitter at a node, the decision tree algorithm considers each input field in turn.
Then, every possible split is tried. The diversity measure is calculated for the two new partitions, and
the best split is the one with the largest reduction in diversity. The field, which yields the best split, is
chosen as the splitter for that node. If a field takes on only one value, it is eliminated from
consideration since there is no way it can be used to create a split. When no split can be found that
significantly decreases the diversity of a given node, then this node is a leaf node. Eventually only leaf
nodes remain and the full decision tree has been grown.
As mentioned earlier, the full decision tree needs to be pruned to improve its performance. Pruning is
done by removing leaves and branches (edges leading to leaves) that fail to generalise. There are a
number of pruning methods, one of which uses the actual performance of the tree on a separate set of
preclassified data, called the test set. A tree is pruned back to the subtree that minimises error on the
test set.
Application of decision trees
Decision trees are useful when the data mining task is classification of records or prediction of
outcomes. They are used when the goal is to assign each record to one of a few broad categories.
Decision tree methods are also chosen for their ability to generate understandable rules, which can be
explained and translated into SQL or a natural language. For any classified record, the rule for its
classification can be generated by simply tracing the path from the root to the leaf where record ended
up. Most decision tree tools provide for this capability.
Like artificial neural networks, decision trees also represent a class of machine learning algorithm since
they are capable of generating rules from training data. One major difference between these two
paradigms however is that unlike a neural net, whose rules (or input-output mappings) are implicit in
its weights, rules in a decision tree are explicit.
533562489
6
ICT619 S2-05
Automatic Cluster Detection
Cluster detection aims to discover structure in a complex data set as a whole in order to carve it up into
simpler groups. Examples of clustering are - finding products that should be grouped together in a
catalogue, or identifying groups of customers with similar tastes in music. There are many methods for
finding clusters in data, a prominent one among which is described below.
K-means clustering
The K-means clustering algorithm is available in a wide variety of commercial data mining tools. It
divides the data set into a predetermined number, k, of clusters. These clusters are centred at random
points in the record space. Records are assigned to the clusters through an iterative process that moves
the cluster means (also called cluster centroids) around until each one is actually at the centre of some
cluster of records.
Seed 3
Seed 2
Seed 1
Figure 2 Initial cluster seeds (from Berry & Linoff 2000).
In the first step, k data points are selected to be the seeds more or less arbitrarily. Each of these seeds is
an embryonic cluster with only one element. In the example shown in figure 1, k is 3.
Seed 3
Seed 2
Seed 1
Figure 3 Initial clusters and intercluster boundaries (from Berry &
Linoff 2000).
In the second step, each record is assigned to the cluster whose centroid is nearest to that record. This
forms the three clusters shown in figure 4 with the new intercluster boundaries. Note the boxed record
which was assigned to cluster 2 (seed 2) initially now becomes part of cluster 1.
533562489
7
ICT619 S2-05
Seed 3
Seed 2
Seed 1
Figure 4 New clusters, their centroids marked by crosses and
intercluster boundaries (from Berry & Linoff 2000).
The centroid of a cluster of records is calculated by taking the average of each field for all the records
in that cluster. For measuring distances between a record and a cluster’s centroid, the Euclidean
distance1 is most commonly used by data mining software.
In the k-means method, the original choice of the value of k determines the number of clusters that will
be found. Unless advanced knowledge is available on the likely number of clusters, experimentation
with different values of k is needed. Best results are obtained when k matches the underlying structure
of the data.
Interpreting clusters
A strength of the automatic cluster detection is that it is an undirected data mining technique – we can
look for something useful without knowing what we are looking for. But this also means, we may not
recognize it when we find something useful!
The most frequently used approaches to understanding clusters are

Building a decision tree with the cluster labels as target variables, and using it derive rules
explaining how to assign new records to the correct cluster.

Using visualisation to see how the clusters are affected by changes in input variables.

Examining the differences in the distributions of variables from cluster to cluster, one variable
at a time.
Application of clusters
Cluster detection is used when it is suspected that there are natural groupings, which may represent
groups of customers or products that have lot in common with each other. These may turn out to be
commonly occurring customer segments for which customised marketing approaches are justified.
Clustering is also useful when there are many competing patterns in the data making it hard to spot any
single pattern. Creating clusters of similar data records reduces the complexity within clusters so that
other data mining techniques are more likely to succeed.
1
The Euclidean distance between two points P(x1, x2, .. , xn) and Q(y1, y2, .. , yn) in n-dimensional
space is ((x1-y1)2 + (x2-y2)2 + .. + (xn-yn)2).
533562489
8
ICT619 S2-05
Artificial Neural Networks
Going back to the main data mining activities mentioned earlier, classification, estimation and
prediction are the three types of tasks performed in directed data mining, where a (dependent) variable
is described in terms of the values of a group of other (independent) variables. In practice,
classification, where we look for a categorical (yes/no, low, medium, high etc) answer, can be an
estimation problem, where some threshold is applied to estimated output values to come up with
categories. Prediction, on the other hand, can be viewed as an estimation or classification task with the
difference that the estimated value or class can only be verified at some future point.
As we found in topic 3, the main generic application of artificial neural networks is pattern recognition
or classification, which assigns an input object to a class based on the values of its attributes. The best
artificial neural networks model for performing classification is the backpropagation network (or the
multilayer perceptron), which learns to classify data through supervised training.
The artificial neural network model particularly suited for the data mining task of clustering is the
Kohonen net or the self-organising map (SOM). With SOMs, the learning algorithms are unsupervised.
Groups of similar input pattern vectors (data records) that are near each other in the N-dimensional
record space are mapped to neurons that are also close to one another in the output layer – thus forming
representative clusters for similar data records. The SOM can thus serve as a clustering tool as well as
visualisation tool for high-dimensional data. See topic 4 ANN for more details on both the
backpropagation and SOM neural networks.
SOMs have been claimed to be often more effective than classical clustering algorithms such as kmeans. While clustering techniques often have very restrictive assumptions or limited ability to find
complex shapes, SOMs are much more able to isolate clusters in high dimensional space, particularly
those clusters that are not tight little balls of data but rather twisting, curving regions in N-dimensional
space. SOMs take data in N-space and produce graphs in 2-dimensional space that reveal key
associations among the original input vectors in N-space.
Artificial neural networks can produce very good results, but they require extensive data preparation
involving normalisation and conversion of categorical values to numeric values. The main drawback of
ANNs is that they are difficult to understand because they represent complex non-linear models that,
unlike decision trees, do not produce rules readily.
Application of neural nets
Neural networks are a good choice for most classification and prediction tasks when the results are
more important than understanding how the model works.
Neural nets do not work well when there are many hundreds or thousands of input features. Large
numbers of features can make it more difficult for the network to fined patterns and can result in long
training phases.
How to Apply Data Mining
There are essentially four ways of utilising data mining expertise in business.
1. By purchasing readymade scores (such as on credit worthiness for a loan applicant) from
outside vendors.
2. By purchasing software that embodies data mining expertise designed for a particular
application such as credit approval, fraud detection or churn prevention.
3. By hiring outside consultants to perform data mining for special projects.
4. By developing own data mining skills within the business organisation.
The advantage of purchasing scores is that it is quick and easy, but the intelligence being limited to
single score values, its usefulness is also limited.
533562489
9
ICT619 S2-05
Purchasing Software
Data mining expertise can be embodied in software in one of two ways. The software may be an actual
model, in the form of a set of rules for decision support, or a fully-trained neural network applied to a
particular domain. Alternatively, it may embody knowledge of the process of building models
appropriate to a particular domain in the form of a model-creation wizard or template.
Purchasing models developed somewhere else can work well but only to the extent that the products,
customers, and market conditions match those that were used to develop the model. Applications
developed to meet the needs of a particular industry are often called vertical applications because they
attempt to cover every layer from data handling at the bottom to report generation at the top. A generalpurpose data mining tool on the other hand is horizontal since it provides broad applicability to many
problems. One example of data mining tools which are vertical applications is the California based
company HNC’s embedded neural network model Falcon for predicting fraud in credit cards. In 1998,
Falcon monitored over 250 million card accounts worldwide.
Model building software are tools for automating the process of creating candidate models and
selecting the ones that perform best, once the proper input and output variables have been identified.
Such software enable novice model builders to be taken through the process of creating models based
on their own data. Although this gives the flexibility of reflecting local conditions, it leaves a number
of significant tasks for the user (these are part of the overall data mining development methodology
described below) –
 Choosing a suitable business problem to be addressed by data mining.
 Identifying and collecting data that is likely to contain the information needed to answer the
business question.
 Preprocessing the data so that the data mining tool can make use of it.
 Transforming the database so that the input variables needed by the model are available.
 Designing a plan of action based on the model and implementing it in the marketplace.
 Measuring the results of the actions and feeding them back into the database where they will be
available for future mining.
As one example of model building software, Model 1 from Unica Technologies contains four modules,
each of which addresses a particular direct marketing challenge. These are called Response Modeller,
Cross Seller, Customer Valuator, and Customer Segmenter.
Hiring Outside Experts
This approach is recommended if an organisation is in the early stages of integrating data mining in its
business, and especially if the data mining activity is to be an one-off process, eg, fixing a
manufacturing problem. If instead, it is to be an ongoing process, eg, data mining for customer
relationship management, it is more worthwhile to consider developing the necessary skill among an
organisation’s own staff through in-house development.
Outside expertise for data mining is likely to be available in three possible places:
 From a data mining software vendor
If the data mining software has been already selected, the company providing the software is
the first place to look for help.

Data mining centres
These are usually collaborations between universities and private companies.

Consulting companies
Ideally the consulting company chosen should have had experience specifically in the area of
interest to the organisation seeking help.
Developing In-house Expertise
Any business serious about converting corporate data to business intelligence should consider making
data mining one of its core competencies. This applies particularly to companies which have many
products and customers.
533562489
10
ICT619 S2-05
Data Mining Development Methodology
According to (Hirji 2001) no quantitative or qualitative study of how to actually perform data mining
has been undertaken. Cabena et al.(Cabena 1998) have proposed a five-stage model of implementing
and using data mining. The stages in the model are:
1. Business objective determination
2. Data preparation
3. Data mining
4. Results analysis
5. Knowledge assimilation
Business objective determination is concerned with clearly identifying the business problem to be
mined. Data preparation involves data selection, preprocessing and transformation. Data mining is
concerned with algorithm selection and execution. Results analysis is concerned with the question of
whether anything new or interesting has been found. Knowledge assimilation aims to formulate ways
of exploiting the new information extracted.
Case Study
A case study involving a large fast food outlet and described in (Hirji 2001) brought out some
deficiencies of the methodology described above. A new set of stages for data mining development and
use has been proposed as given below:
1. Business objective determination
2. Data preparation
3. Data audit
4. Interactive data mining and results analysis
5. Back end data mining
6. Results synthesis and presentation
The case study used IBM’s Intelligent Miner for Data on AIX as the data mining tool and took 20
actual days of effort across the 6 stages above. Back end data mining involves data enrichment and
additional data mining algorithm execution by the data mining specialist. 45% of the total data mining
project effort was taken up by stages 4, 5 and 6, and 30% was required by the data preparation stage.
The 30% of total time required by the data preparation stage, as compared with the 70% predicted in
the earlier model of Cabena et al. may be explained by the use of an existing data warehouse in the
organization. For a data mining project without this advantage, more project resources would be
required to perform tasks such as selecting, cleaning, transforming, coding, and loading the data.
Important aspects of the Interactive data mining and results analysis stage were linking data mining
results with business strategy and using application software such as spreadsheets to perform sensitivity
analysis of results obtained. The objective would be to demonstrate how data mining results support
business strategy. For example, patterns of fast food product combinations would be identified as a
basis for developing strategies for recombining some existing product offerings.
Further quantitative studies are required involving other industries to further validate the methodology
outlined above. It is expected that a set of best practices for data mining implementation projects will
emerge through further refinement of this methodology.
533562489
11
ICT619 S2-05
Download