Study Guide

advertisement
Study Guide
Hierarchical Clustering
Cluster Analysis
Cluster Analysis, also called data segmentation, has a variety of goals. All relate to grouping or
segmenting a collection of objects (also called observations, individuals, cases, or data rows) into subsets
or "clusters", such that those within each cluster are more closely related to one another than objects
assigned to different clusters. Central to all of the goals of cluster analysis is the notion of degree of
similarity (or dissimilarity) between the individual objects being clustered. There are two major methods
of clustering -- hierarchical clustering and k-means clustering.
In hierarchical clustering the data are not partitioned into a particular cluster in a single step. Instead, a
series of partitions takes place, which may run from a single cluster containing all objects to n clusters
each containing a single object. Hierarchical Clustering is subdivided into agglomerative methods, which
proceed by series of fusions of the n objects into groups, and divisive methods, which separate n objects
successively into finer groupings. Agglomerative techniques are more commonly used, and this is the
method implemented in XLMiner. Hierarchical clustering may be represented by a two dimensional
diagram known as dendrogram which illustrates the fusions or divisions made at each successive stage of
analysis. An example of such a dendrogram is given below:
Agglomerative methods
(we used this method in lab)
An agglomerative hierarchical clustering procedure produces a series of partitions of the data, P n, Pn-1,
....... , P1. The first Pn consists of n single object 'clusters', the last P1, consists of single group containing
all n cases.
At each particular stage the method joins together the two clusters which are closest together (most
similar). (At the first stage, of course, this amounts to joining together the two objects that are closest
Study Guide
together, since at the initial stage each cluster has one object.)
Differences between methods arise because of the different ways of defining distance (or similarity)
between clusters.
Standard Data Partition
Data Partitioning
Most data mining projects use large volumes of data. Before building a model, typically you partition the
data using a partition utility. Partitioning yields mutually exclusive datasets: a training dataset, a
validation dataset and a test dataset.
Training Set
The training dataset is used to train or build a model. For example, in a linear regression, the training
dataset is used to fit the linear regression model, i.e. to compute the regression coefficients. In a neural
network model, the training dataset is used to obtain the network weights.
Validation Set
Once a model is built on training data, you need to find out the accuracy of the model on unseen data.
For this, the model should be used on a dataset that was not used in the training process -- a dataset
where you know the actual value of the target variable. The discrepancy between the actual value and
the predicted value of the target variable is the error in prediction. Some form of average error (MSE of
average % error) measures the overall accuracy of the model.
If you were to use the training data itself to compute the accuracy of the model fit, you would get an
overly optimistic estimate of the accuracy of the model. This is because the training or model fitting
process ensures that the accuracy of the model for the training data is as high as possible -- the model is
specifically suited to the training data. To get a more realistic estimate of how the model would perform
with unseen data, you need to set aside a part of the original data and not use it in the training process.
This dataset is known as the validation dataset. After fitting the model on the training dataset, you
should test its performance on the validation dataset.
Test Set (we did not use a test set in our lab)
The validation dataset is often used to fine-tune models. For example, you might try out neural network
models with various architectures and test the accuracy of each on the validation dataset to choose
among the competing architectures. In such a case, when a model is finally chosen, its accuracy with the
validation dataset is still an optimistic estimate of how it would perform with unseen data. This is because
the final model has come out as the winner among the competing models based on the fact that its
accuracy with the validation dataset is highest. Thus, you need to set aside yet another portion of data
which is used neither in training nor in validation. This set is known as the test dataset. The accuracy of
the model on the test data gives a realistic estimate of the performance of the model on completely
unseen data.
Study Guide
Discriminant Analysis
Introduction
Discriminant analysis is a technique for classifying a set of observations into predefined classes. The
purpose is to determine the class of an observation based on a set of variables known as predictors or
input variables. The model is built based on a set of observations for which the classes are known. This
set of observations is sometimes referred to as the training set. Based on the training set , the technique
constructs a set of linear functions of the predictors, known as discriminant functions, such that
L = b1x1 + b2x2 + … + bnxn + c , where the b's are discriminant coefficients, the x's are the input
variables or predictors and c is a constant.
These discriminant functions are used to predict the class of a new observation with unknown class. For a
k class problem k discriminant functions are constructed. Given a new observation, all the k discriminant
functions are evaluated and the observation is assigned to class i if the ith discriminant function has the
highest value.
Association Rules
Introduction
Association rule mining finds interesting associations and/or correlation relationships among large set of
data items. Association rules show attribute value conditions that occur frequently together in a given
dataset. A typical and widely-used example of association rule mining is Market Basket Analysis.
For example, data are collected using bar-code scanners in supermarkets. Such ‘market basket’
databases consist of a large number of transaction records. Each record lists all items bought by a
customer on a single purchase transaction. Managers would be interested to know if certain groups of
items are consistently purchased together. They could use this data for adjusting store layouts (placing
items optimally with respect to each other), for cross-selling, for promotions, for catalog design and to
identify customer segments based on buying patterns.
Association rules provide information of this type in the form of "if-then" statements. These rules are
computed from the data and, unlike the if-then rules of logic, association rules are probabilistic in nature.
In addition to the antecedent (the "if" part) and the consequent (the "then" part), an association rule has
two numbers that express the degree of uncertainty about the rule. In association analysis the
antecedent and consequent are sets of items (called itemsets) that are disjoint (do not have any items in
common).
The first number is called the support for the rule. The support is simply the number of transactions that
include all items in the antecedent and consequent parts of the rule. (The support is sometimes
expressed as a percentage of the total number of records in the database.)
The other number is known as the confidence of the rule. Confidence is the ratio of the number of
transactions that include all items in the consequent as well as the antecedent (namely, the support) to
the number of transactions that include all items in the antecedent.
For example, if a supermarket database has 100,000 point-of-sale transactions, out of which 2,000
include both items A and B and 800 of these include item C, the association rule "If A and B are
Study Guide
purchased then C is purchased on the same trip" has a support of 800 transactions (alternatively 0.8% =
800/100,000) and a confidence of 40% (=800/2,000). One way to think of support is that it is the
probability that a randomly selected transaction from the database will contain all items in the antecedent
and the consequent, whereas the confidence is the conditional probability that a randomly selected
transaction will include all the items in the consequent given that the transaction includes all the items in
the antecedent.
Lift is one more parameter of interest in the association analysis. Lift is nothing but the ratio of
Confidence to Expected Confidence. Expected Confidence in this case means, using the above example,
"confidence, if buying A and B does not enhance the probability of buying C." It is the number of
transactions that include the consequent divided by the total number of transactions. Suppose the
number of total number of transactions for C are 5,000. Thus Expected Confidence is
5,000/1,00,000=5%. For our supermarket example the Lift = Confidence/Expected Confidence =
40%/5% = 8. Hence Lift is a value that gives us information about the increase in probability
of the "then" (consequent) given the "if" (antecedent) part.
Download