Uploaded by C2207201

Week 2 Practical

advertisement
Practical 3: Introduction to Pattern Discovery
3.1
Introduction .............................................................................................................
3.2
Cluster Analysis ......................................................................................................
Demonstration:
Segmenting Census Data ............................................................
Demonstration:
Exploring and Filtering Analysis Data ..........................................
Demonstration:
Setting Cluster Tool Options ........................................................
Demonstration:
Creating Clusters with the Cluster Tool ........................................
Demonstration:
Specifying the Segment Count.....................................................
Demonstration:
Exploring Segments .....................................................................
Demonstration:
Profiling Segments .......................................................................
Exercises...........................................................................................................
3.3
Market Basket Analysis .........................................................................................
Demonstration:
Market Basket Analysis ................................................................
Demonstration:
Sequence Analysis ......................................................................
Exercises...........................................................................................................
3.4
Chapter Summary ...................................................................................................
3.5
Solutions .................................................................................................................
Solutions to Exercises .......................................................................................
Solutions to Student Activities (Polls/Quizzes) ..................................................
2
Practical 3: Introduction to Pattern Discovery
3.1 Introduction
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Pattern Discovery
The Essence of Data Mining?
“…the discovery of interesting,
unexpected, or valuable
structures in large data sets.”
– David Hand
...
3
There are a multitude of definitions for the field of data mining and knowledge discovery. Most
center on the concept of pattern discovery. For example, David Hand, Professor of Statistics at
Imperial College, London and a noted data mining authority, defines the field as “…the discovery
of interesting, unexpected, or valuable structures in large data sets.” (Hand 2005) This is made
possible by the ever-increasing data stores brought about by the era’s information technology.
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Pattern Discovery
The Essence of Data Mining?
“…the discovery of interesting,
unexpected, or valuable
structures in large data sets.”
– David Hand
“If you’ve got terabytes of data, and you’re
relying on data mining to find interesting
things in there for you, you’ve lost before
you’ve even begun.”
– Herb Edelstein
4
3.1 Introduction
3
Although Hand’s pronouncement is grandly promising, experience shows it to be overly
optimistic. Herb Edelstein, President of Two Crows Corporation and an internationally
recognized expert in data mining, data warehousing, and CRM, counters with the following
(Beck (Editor) 1997):
“If you’ve got terabytes of data, and you’re relying on data mining to find interesting things
in there for you, you’ve lost before you’ve even begun. You really need people who
understand what it is they are looking for – and what they can do with it once they find it.”
Many people think data mining (in particular, pattern discovery) means magically finding hidden
nuggets of information without having to formulate the problem and without regard to the
structure or content of the data. This is an unfortunate misconception.
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Pattern Discovery Caution






Poor data quality
Opportunity
Interventions
Separability
Obviousness
Non-stationarity
5
In his defense, David Hand is well aware of the limitations of pattern discovery and provides
guidance on how these analyses can fail (Hand 2005). These failings often fall into one of six
categories:
 Poor data quality assumes many guises: inaccuracies (measurement or recording errors),
missing, incomplete or outdated values, and inconsistencies (changes of definition). Patterns
found in false data are fantasies.
 Opportunity transforms the possible to the perceived. Hand refers to this as the problem of
multiplicity, or the law of truly large numbers. Examples of this abound. Hand notes the odds
of a person winning the lottery in the United States are extremely small and the odds of that
person winning it twice are fantastically so. However, the odds of someone in the United
States winning it twice (in a given year) are actually better than even. As another example, you
can search the digits of pi for “prophetic” strings such as your birthday or significant dates in
history and usually find them, given enough digits (www.angio.net/pi/piquery).
4
Practical 3: Introduction to Pattern Discovery
 Intervention, that is, taking action on the process that generates a set of data, can destroy or
distort detected patterns. For example, fraud detection techniques lead to preventative
measures, but the fraudulent behavior often evolves in response to this intervention.
 Separability of the interesting from the mundane is not always possible, given the information
found in a data set. Despite the many safeguards in place, it is estimated that credit card
companies lose $0.18 to $0.24 per $100 in online transactions (Rubinkam 2006).
 Obviousness in discovered patterns reduces the perceived utility of an analysis. Among the
patterns discovered through automatic detection algorithms, you find that there is an almost
equal number of married men as married women, and you learn that ovarian cancer occurs
primarily in women and that check fraud occurs most often for customers with checking
accounts.
 Non-stationarity occurs when the process that generates a data set changes of its own accord.
In such circumstances, patterns detected from historic data can simply cease. As Eric Hoffer
states, “In times of change, learners inherit the Earth, while the learned find themselves
beautifully equipped to deal with a world that no longer exists.”
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Pattern Discovery Applications
Data reduction
Novelty detection
Profiling
Market basket analysis
A C
B
6
Sequence analysis
...
Despite the potential pitfalls, there are many successful applications of pattern discovery:
 Data reduction is the most ubiquitous application, that is, exploiting patterns in data to create
a more compact representation of the original. Though vastly broader in scope, data reduction
includes analytic methods such as cluster analysis.
 Novelty detection methods seek unique or previously unobserved data patterns. The methods
find application in business, science, and engineering. Business applications include fraud
detection, warranty claims analysis, and general business process monitoring.
 Profiling is a by-product of reduction methods such as cluster analysis. The idea is to create
rules that isolate clusters or segments, often based on demographic or behavioral
measurements. A marketing analyst might develop profiles of a customer database to describe
the consumers of a company’s products.
3.1 Introduction
5
 Market basket analysis, or association rule discovery, is used to analyze streams of
transaction data (for example, market baskets) for combinations of items that occur (or do not
occur) more (or less) commonly than expected. Retailers can use this as a way to identify
interesting combinations of purchases or as predictors of customer segments.
 Sequence analysis is an extension of market basket analysis to include a time dimension to the
analysis. In this way, transactions data is examined for sequences of items that occur (or do not
occur) more (or less) commonly than expected. A webmaster might use sequence analysis to
identify patterns or problems of navigation through a website.
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Pattern Discovery Tools
Data reduction
Novelty detection
Profiling
Market basket analysis
A C
B
7
Sequence analysis
...
The first three pattern discovery applications are primarily served (in no particular order) by
three tools in SAS Enterprise Miner: Cluster, SOM/Kohonen, and Segment Profile. The next
section features a demonstration of the Cluster and Segment Profile tools.
6
Practical 3: Introduction to Pattern Discovery
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Pattern Discovery Tools
Data reduction
Novelty detection
Profiling
Market basket analysis
A C
B
Sequence analysis
8
Market basket analysis and sequence analysis are performed by the Association tool. The Path
Analysis tool can also be used to analyze sequence data. (An optional demonstration of the
Association tool is presented at the end of this chapter.)
3.2 Cluster Analysis
3.2 Cluster Analysis
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Unsupervised Classification
inputs
grouping
cluster 1
cluster 2
cluster 3
cluster 1
Unsupervised classification:
grouping of cases based on
similarities in input values
cluster 2
10
...
Unsupervised classification (also known as clustering and segmenting) attempts to group
training data set cases based on similarities in input variables. It is a data reduction method
because an entire training data set can be represented by a small number of clusters. The
groupings are known as clusters or segments, and they can be applied to other data sets to
classify new cases. It is distinguished from supervised classification (also known as predictive
modeling), which will discuss in next chapters.
The purpose of clustering is often description. For example, segmenting existing customers into
groups and associating a distinct profile with each group might help future marketing strategies.
However, there is no guarantee that the resulting clusters are meaningful or useful.
Unsupervised classification is also useful as a step in predictive modeling. For example,
customers can be clustered into homogenous groups based on sales of different items. Then a
model can be built to predict the cluster membership based on more easily obtained input
variables.
7
8
Practical 3: Introduction to Pattern Discovery
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
k-means Clustering Algorithm
Training Data
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
12
One of the most commonly used methods for clustering is the k-means algorithm. It is a
straightforward algorithm that scales well to large data sets, and is, therefore, the primary tool for
clustering in SAS Enterprise Miner.
Although often overlooked as an important part of a clustering process, the first step in using the
k-means algorithm is to choose a set of inputs. In general, you should seek inputs that have the
following attributes:
 are meaningful to the analysis objective
 are relatively independent
 are limited in number
 have a measurement level of Interval
 have low kurtosis and skewness statistics (at least in the training data)
Choosing meaningful inputs is clearly important for interpretation and explanation of the
generated clusters. Independence and limited input count make the resulting clusters more stable.
(Small perturbations of training data usually do not result in large changes to the generated
clusters.) An interval measurement level is recommended for k-means to produce non-trivial
clusters. Low kurtosis and skewness statistics on the inputs avoid creating single-case outlier
clusters.
3.2 Cluster Analysis
9
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
k-means Clustering Algorithm
Training Data
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
13
The next step in the k-means algorithm is to choose a value for k, the number of cluster centers.
SAS Enterprise Miner features an automatic way to do this, assuming that the data has k distinct
concentrations of cases. If this is not the case, you should choose k to be consistent with your
analytic objectives.
With k selected, the k-means algorithm chooses cases to represent the initial cluster centers (also
named seeds).
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
k-means Clustering Algorithm
Training Data
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
14
...
The Euclidean distance from each case in the training data to each cluster center is calculated.
Cases are assigned to the closest cluster center.
10
Practical 3: Introduction to Pattern Discovery

Because the distance metric is Euclidean, it is important for the inputs to have compatible
measurement scales. Unexpected results can occur if one input’s measurement scale
differs greatly from the others.
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
k-means Clustering Algorithm
Training Data
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
...
15
The cluster centers are updated to equal the average of the cases assigned to the cluster
in the previous step.
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
k-means Clustering Algorithm
Training Data
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
16
Cases are reassigned to the closest cluster center.
...
3.2 Cluster Analysis
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
k-means Clustering Algorithm
Training Data
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
...
17
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
k-means Clustering Algorithm
Training Data
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
18
...
The update and reassign steps are repeated until the process converges.
11
12
Practical 3: Introduction to Pattern Discovery
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
k-means Clustering Algorithm
Training Data
1. Select inputs.
2. Select k cluster centers.
3. Assign cases to closest
center.
4. Update cluster centers.
5. Reassign cases.
6. Repeat steps 4 and 5
until convergence.
...
25
On convergence, final cluster assignments are made. Each case is assigned to a unique segment.
The segment definitions can be stored and applied to new cases outside of the training data.
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Segmentation Analysis
Training Data
When no clusters exist,
use the k-means
algorithm to partition
cases into contiguous
groups.
27
Although they are often used synonymously, a segmentation analysis is distinct from a traditional
cluster analysis. A cluster analysis is geared toward identifying distinct concentrations of cases in
a data set. When no distinct concentrations exist, the best you can do is a segmentation analysis –
that is, algorithmically partitioning the input space into contiguous groups of cases.
3.2 Cluster Analysis
13
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Demographic Segmentation Demonstration
Analysis goal:
Group geographic regions into segments based on
income, household size, and population density.
Analysis plan:




Select and transform segmentation inputs.
Select the number of segments to create.
Create segments with the Cluster tool.
Interpret the segments.
30
The following demonstration illustrates the use of clustering tools. The goal of the analysis is to
group people in the United States into distinct subsets based on urbanization, household size, and
income factors. These factors are common to commercial lifestyle and life-stage segmentation
products.
(For examples, see http://www.claritas.com/ or http://www.spectramarketing.com.)
14
Practical 3: Introduction to Pattern Discovery
Segmenting Census Data
This demonstration introduces SAS Enterprise Miner tools and techniques for cluster and
segmentation analysis. There are five parts:
 define the diagram and data source
 explore and filter the training data
 integrate the Cluster tool into the process flow and select the number of segments to create
 run a segmentation analysis
 use the Segment Profile tool to interpret the analysis results
Diagram Definition
Use the following steps to define the diagram for the segmentation analysis.
1. Right-click Diagrams in the Project Panel and select Create Diagram. The Create New
Diagram window appears and requests a diagram name.
2. Enter Segmentation Analysis in the Diagram Name field and click OK. SAS Enterprise
Miner creates an analysis workspace window named Segmentation Analysis.
You use the Segmentation Analysis window to create process flow diagrams.
3.2 Cluster Analysis
15
Data Source Definition
Follow these steps to create the segmentation analysis data source.
1. Right-click Data Sources in the Project Panel and select Create Data Source. The Data
Source Wizard appears.
The Data Source Wizard guides you through a process to create a SAS Enterprise Miner data
source.
2. Click Next to use a SAS table as the source for the metadata. (This is the usual choice.)
16
Practical 3: Introduction to Pattern Discovery
In this step, select the SAS table that you want to make available to SAS Enterprise Miner.
You can either enter the library name and SAS table name as libname.table-name or select
the SAS table from a list.
3. Click Browse to choose a SAS table from the libraries that are visible to the SAS Foundation
Server.
The Select a SAS Table dialog box appears. One of the libraries is named AAEM. This is the
library name that was defined in Chapter 2. Select Aaem and the Census2000 SAS table.
The Census2000 data is a postal code-level summary of the entire 2000 United States
Census.
It features seven variables:
ID
LOCX
LOCY
MEANHHSZ
MEDHHINC
REGDENS
density)
REGPOP
postal code of the region
region longitude
region latitude
average household size in the region
median household income in the region
region population density percentile (1=lowest density, 100=highest
number of people in the region
3.2 Cluster Analysis
17
The data is suitable for the creation of life-stage, lifestyle segments using SAS Enterprise
Miner’s pattern discovery tools.
4. Click OK.
The Select a SAS Table dialog box closes and the selected table is entered in the Table field.
5. Click Next in two consecutive windows.
The Data Source Wizard starts the metadata definition process. SAS Enterprise Miner assigns
initial values to the metadata based on characteristics of the selected SAS table. The Basic
setting assigns initial values to the metadata based on variable attributes such as the variable
name, data type, and assigned SAS format. The Advanced setting also includes information
about the distribution of the variable to assign the initial metadata values.
6. Click Next to use the Basic setting.
18
Practical 3: Introduction to Pattern Discovery
The Data Source Wizard enables you to specify the role and level for each variable in the
selected SAS table. A default role is assigned based on the name of a variable. For example,
the variable ID was given the role ID based on its name. When a variable does not have a
name that corresponds to one of the possible variable roles, using the Basic setting, it is given
the default role of Input. An input variable is used for various types of analysis to describe a
characteristic, measurement, or attribute of a record, or case, in a SAS table.
The metadata settings are correct for the upcoming analysis.
7. Click Next until you reach Step 7. This is the second-to-last step in the Data Source Wizard.
The Data Source Wizard enables you to set a role for the selected SAS table and provide
descriptive comments about the data source definition. For the impending analysis, a table
role of Raw is acceptable.
8. Click Next to proceed to the final step. Click Finish to complete the data source definition.
The CENSUS2000 table is added to the Data Sources entry in the Project Panel.
3.2 Cluster Analysis
19
Exploring and Filtering Analysis Data
A worthwhile next step in the process of defining a data source is to explore and validate its
contents. By assaying the prepared data, you substantially reduce the chances of erroneous
results in your analysis, and you can gain insights graphically into associations between
variables.
Data Source Exploration
1. Right-click the CENSUS2000 data source and select Edit Variables from the pop-up menu.
The Variables - CENSUS2000 dialog box appears.
2. Examine histograms for the available variables.
3. Select all listed inputs by dragging the pointer across all of the input names or by holding
down the Ctrl key and typing A.
20
Practical 3: Introduction to Pattern Discovery
4. Click Explore. The Explore window appears. This time it displays histograms for all of the
variables in the CENSUS2000 data source.
5. Maximize the MeanHHSz histogram by double-clicking its title bar. The histogram fills the
Explore window.
As before, increasing the number of histogram bins from the default of 10 increases your
understanding of the data.
3.2 Cluster Analysis
6. Right-click in the histogram window and select Graph Properties from the pop-up menu.
The Properties - Histogram dialog box appears.
You can use the Properties - Histogram dialog box to change the appearance of the
corresponding histogram. Enter 100 in the Number of X Bins field and click OK.
The histogram is updated to show 100 bins.
21
22
Practical 3: Introduction to Pattern Discovery
There is a curious spike in the histogram at (or near) zero. A zero household size does not
make sense in the context of census data.
7. Select the bar near zero in the histogram.
8. Restore the size of the window by double-clicking the title bar of the MeanHHSz window.
The window returns to its original size.
The zero average household size seems to be evenly distributed across the longitude, latitude,
and density percentile variables. It seems concentrated on low incomes and populations, and
also makes up the majority of the missing observations in the distribution of Region Density.
It is worthwhile to look at the individual records of the explore sample.
9. Maximize the CENSUS2000 data table.
10. Scroll in the data table until you see the first selected row.
3.2 Cluster Analysis
23
Records 45 and 46 (among others) have the zero Average Household Size characteristic.
Other fields in these records also have unusual values.
11. Select the Average Household Size column heading twice to sort the table by ascending
values
in this field. Cases of interest are collected at the top of the data table.
Most of the cases with zero Average Household Size have zero or missing on the remaining
non-geographic attributes. There are some exceptions, but it could be argued that cases such
as this are not of interest for analyzing household demographics. The next part of this
demonstration shows how to remove cases such as this from the subsequent analyses.
12. Close the Explore and Variables windows.
24
Practical 3: Introduction to Pattern Discovery
Case Filtering
The SAS Enterprise Miner Filter tool enables you to remove unwanted records from an analysis.
Use these steps to build a diagram to read a data source and to filter records.
1. Drag the CENSUS2000 data source to the Segmentation Analysis workspace window.
2. Click the Sample tab to access the Sample tool group.
3. Drag the Filter tool (fourth from the left) from the tools pallet into the Segmentation
Analysis workspace window and connect it to the CENSUS2000 data source.
4. Select the Filter node and examine the Properties panel.
3.2 Cluster Analysis
25
Based on the values of the Properties pane, by default, the node filters cases in rare levels in
any class input variable and cases exceeding three standard deviations from the mean on any
interval input variable.
Because the CENSUS2000 data source contains only interval inputs, only the Interval
Variables criterion is considered.
5. Change the Default Filtering Method property to User-Specified Limits.
6. Select
(the Interval Variables ellipsis). The Interactive Interval Filter window appears.
You are warned at the top of the dialog box that the Train or raw data set does not exist. This
indicates that you are restricted from the interactive filtering elements of the node, which are
available after a node is run. Nevertheless, you can enter filtering information.
26
Practical 3: Introduction to Pattern Discovery
7. Enter 0.1 as the Filter Lower Limit value for the input variable MeanHHSz. Press Enter to
make sure that the change is saved.
8. Click OK to close the Interactive Interval Filter dialog box. You are returned to the SAS
Enterprise Miner interface window.
All cases with an average household size less than 0.1 are filtered from subsequent analysis
steps.
9. Run the Filter node and view the results. The Results window appears.
10. Go to line 38 in the Output window.
3.2 Cluster Analysis
The Filter node removed 1081 cases with a household size of zero.
11. Close the Results window. The CENSUS2000 data is ready for segmentation.
27
28
Practical 3: Introduction to Pattern Discovery
Setting Cluster Tool Options
The Cluster tool performs k-means cluster analyses, which is a widely used method for cluster
and segmentation analysis. This demonstration shows you how to use the tool to segment the
cases in the CENSUS2000 data set.
1. Click the Explore tab.
2. Locate and drag a Cluster tool into the diagram workspace.
3. Connect the Filter node to the Cluster node.
To create meaningful segments, you need to set the Cluster node to do the following:
 ignore irrelevant inputs
 standardize the inputs to have a similar range
4. Select the Variables property for the Cluster node. The Variables window appears.
5. Select Use  No for LocX, LocY, and RegPop.
The Cluster node creates segments using the inputs MedHHInc, MeanHHSz, and RegDens.
Segments are created based on the (Euclidean) distance between each case in the space of
selected inputs. If you want to use all the inputs to create clusters, these inputs should have
similar measurement scales. Calculating distances using standardized distance measurements
3.2 Cluster Analysis
29
(subtracting the mean and dividing by the standard deviation of the input values) is one way
to ensure this. You can standardize the input measurements using the Transform Variables
node. However, it is easier to use the built-in property in the Cluster node.
6. Select the inputs MedHHInc, MeanHHSz, and RegDens. Select Explore. The Explore window
appears.
The inputs that are selected for use in the cluster are on three entirely different measurement
scales.
They need to be standardized if you want a meaningful clustering.
7. Close the Explore window.
8. Click OK to close the Variables window.
30
Practical 3: Introduction to Pattern Discovery
9. Notice the default setting for Internal Standardization: Internal Standardization 
Standardization. No change is required because standardization is performed on input
variables. Distances between points are calculated based on standardized measurements.

Another way to standardize an input is by subtracting the input’s minimum value and
dividing by the input’s range. This is called range standardization. Range
standardization rescales the distribution of each input to the unit interval, [0,1].
The Cluster node is ready to run.
3.2 Cluster Analysis
31
Creating Clusters with the Cluster Tool
By default, the Cluster tool attempts to automatically determine the number of clusters in the
data. A three-step process is used.
Step 1 A large number of cluster seeds are chosen (50 by default) and placed in the input space.
Cases in the training data are assigned to the closest seed, and an initial clustering of the
data is completed. The means of the input variables in each of these preliminary clusters
are substituted for the original training data cases in the second step of the process.
Step 2 A hierarchical clustering algorithm (Ward’s method) is used to sequentially consolidate
the clusters that were formed in the first step. At each step of the consolidation, a statistic
named the cubic clustering criterion (CCC) (Sarle 1983) is calculated. Then, the smallest
number of clusters that meets both of the following criteria is selected:
 The number of clusters must be greater than or equal to the number that is specified as
the minimum value in the Selection Criterion properties.
 The number of clusters must have cubic clustering criterion statistic values that are
greater than the CCC threshold that is specified in the Selection Criterion properties.
Step 3 The number of clusters determined by the second step provides the value for k in a kmeans clustering of the original training data cases.
1. Run the Cluster node and select Results. The Results – Node: Cluster Diagram window
appears.
32
Practical 3: Introduction to Pattern Discovery
The Results - Cluster window contains four embedded windows.
 The Segment Plot window attempts to show the distribution of each input variable by
cluster.
 The Mean Statistics window lists various descriptive statistics by cluster.
 The Segment Size window shows a pie chart describing the size of each cluster formed.
 The Output window shows the output of various SAS procedures run by the Cluster node.
Apparently, the Cluster node found four clusters in the CENSUS2000 data. Because the
number of clusters is based on the cubic clustering criterion, it might be interesting to
examine the values of this statistic for various cluster counts.
2. Select View  Summary Statistics  CCC Plot. The CCC Plot window appears.
In theory, the number of clusters in a data set is revealed by the peak of the CCC versus
Number of Clusters plot. However, when no distinct concentrations of data exist, the utility
of the CCC statistic is somewhat suspect. SAS Enterprise Miner attempts to establish
reasonable defaults for its analysis tools. The appropriateness of these defaults, however,
strongly depends on the analysis objective and the nature of the data.
There is a local maximum on the CCC Plot at four clusters. This is the reason that Enterprise
Miner has returned 4 as the cluster solution.
3.2 Cluster Analysis
Specifying the Segment Count
You might want to increase the number of clusters created by the Cluster node. You can do this
by changing the CCC cutoff property or by specifying the desired number of clusters.
1. In the Properties panel for the Cluster node, select Specification Method  User Specify.
The User Specify setting creates a number of segments indicated by the Maximum Number
of Clusters property (in this case, 10).
2. Run the Cluster node and select Results. The Results - Node: Cluster Diagram window
appears, and shows a total of 10 generated segments.
33
34
Practical 3: Introduction to Pattern Discovery
As seen in the Mean Statistics window, segment frequency counts vary from 10 cases to
more than 9,000 cases.
3.2 Cluster Analysis
35
Exploring Segments
Although the Results window shows a variety of data summarizing the analysis, it is difficult to
understand the composition of the generated clusters. If the number of cluster inputs is small, the
Graph Wizard can aid in interpreting the cluster analysis.
1. Close the Results - Cluster window.
2. Select Exported Data from the Properties panel for the Cluster node. The Exported Data Cluster window appears.
This window shows the data sets that are generated and exported by the Cluster node.
3. Select the Train data set and select Explore. The Explore window appears.
36
Practical 3: Introduction to Pattern Discovery
You can use the Graph Wizard to generate plots of the CENSUS2000 data.
4. Select Actions  Plot. The Select a Chart Type window appears. Select Scatter to create a
scatter plot.
5. Click Next. The Graph Wizard proceeds to the next step, which is Select Chart Roles.
3.2 Cluster Analysis
37
6. Enable multiple role assignments by selecting the check box at the bottom of the table. Select
roles of X, Y, for LocX, LocY. Select Color as the role for _SEGMENT_.and select Tip as
the role for MeanHHSz, MedHHInc, and RegDens.
7. Click Finish.
38
Practical 3: Introduction to Pattern Discovery
The Explore window displays a plot of the CENSUS2000 data.
8. Display the tooltips by placing the pointer over a place in the plot.
Each square in the plot represents a unique postal code. The squares are color-coded by
cluster segment.
Right-click in the plot to modify graph properties such as marker size and colors.
To further aid interpretability, add a distribution plot of the segment number.
1. Select Action  Plot. The Select a Chart Type window appears.
3.2 Cluster Analysis
2. Select Bar.
3. Click Next.
4. Select Role  Category for the variable _SEGMENT_.
39
40
Practical 3: Introduction to Pattern Discovery
5. Click Finish.
A histogram of _SEGMENT_ is displayed.
By itself, this plot is of limited use. However, when the plot is combined with the plot using
latitude and longitude, you can easily interpret the generated segments in a two-dimensional
plot of the USA.
3.2 Cluster Analysis
41
6. Select Window  Tile to show both graphs after you delete the other tables from the
Explore window. Drag the pointer over a segment of the plot of the USA and notice that the
cluster segments in this area are highlighted in the histogram.
7. Move the box to the right and notice that the selected proportions in the histogram change as
you move the box.
42
Practical 3: Introduction to Pattern Discovery
8. Experiment with creating other types of graphs with this data in the Explore window. For
example, consider a three-dimensional scatter plot. Use MeanHHSz, MedHHInc, and
RegDens on each axis with the variable _SEGMENT_ assigned the role Color.
9. Close the Explore, Exported Data, and Results windows.
3.2 Cluster Analysis
43
Profiling Segments
You can gain a great deal of insight by creating plots as in the previous demonstration.
Unfortunately,
if more than three variables are used to generate the segments, the interpretation of such plots
becomes difficult.
Fortunately, there is another useful tool in SAS Enterprise Miner for interpreting the composition
of clusters: the Segment Profile tool. This tool enables you to compare the distribution of a
variable in an individual segment to the distribution of the variable overall. As a bonus, the
variables are sorted by how well they characterize the segment.
1. Drag a Segment Profile tool from the Assess tool palette into the diagram workspace.
2. Connect the Cluster node to the Segment Profile node.
To best describe the segments, you should choose a reasonable subset of the available input
variables.
3. Select the Variables property for the Segment Profile node.
4. Select Use  No for ID, LocX, LocY, and RegPop.
5. Click OK to close the Variables dialog box.
44
Practical 3: Introduction to Pattern Discovery
6. Run the Segment Profile node and select Results. The Results - Node: Segment Profile
Diagram window appears.
7. Maximize the Profile window.
Features of each segment become apparent. For example, segment 4, when compared to the
overall distributions, has a lower Region Density Percentile, slightly higher Average
Household Size, and more central Median Household Income.
3.2 Cluster Analysis
8. Maximize the Variable Worth: _SEGMENT_ window.
The window shows the relative worth of each variable in characterizing each segment. For
example, segment 4 is largely characterized by the RegDens input, but the other two inputs
also play a role.
Again, similar analyses can be used to describe the other segments. The advantage of the
Segment Profile window (compared to direct viewing of the segmentation) is that the
descriptions can be more than three-dimensional.
45
46
Practical 3: Introduction to Pattern Discovery
Exercises
1. Conducting Cluster Analysis
The DUNGAREE data set gives the number of pairs of four different types of dungarees that
were sold at stores over a specific time period. Each row represents an individual store. There
are six columns in the data set. One column is the store identification number, and the
remaining columns contain the number of pairs of each type of jeans that were sold.
Name
Model
Role
Measurement
Level
Description
STOREID
ID
Nominal
Identification number of the store
FASHION
Input
Interval
Number of pairs of fashion jeans sold at the
store
LEISURE
Input
Interval
Number of pairs of leisure jeans sold at the
store
STRETCH
Input
Interval
Number of pairs of stretch jeans sold at the
store
ORIGINAL
Input
Interval
Number of pairs of original jeans sold at the
store
SALESTOT
Rejected
Interval
Total number of pairs of jeans sold (the sum
of FASHION, LEISURE, STRETCH, and
ORIGINAL)
a. Create a new diagram in your project. Name the diagram Jeans.
b. Define the data set DUNGAREE as a data source.
c. Determine whether the model roles and measurement levels assigned to the variables are
appropriate.
Examine the distribution of the variables.
 Are there any unusual data values?
 Are there missing values that should be replaced?
d. Assign the variable STOREID the model role ID and the variable SALESTOT the
model role Rejected. Make sure that the remaining variables have the Input model role
and the Interval measurement level. Why should the variable SALESTOT be rejected?
3.2 Cluster Analysis
47
e. Add an Input Data Source node to the diagram workspace and select the DUNGAREE
data table as the data source.
f. Add a Cluster node to the diagram workspace and connect it to the Input Data node.
g. Select the Cluster node. Leave the default setting as Internal Standardization 
Standardization. What would happen if inputs were not standardized?
h. Run the diagram from the Cluster node and examine the results.
Does the number of clusters created seem reasonable?
i. Specify a maximum of six clusters and rerun the Cluster node. How does the number and
quality of clusters compare to that obtained in part h?
j. Use the Segment Profile node to summarize the nature of the clusters.
48
Practical 3: Introduction to Pattern Discovery
3.3 Market Basket Analysis (Self-Study)
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Market Basket Analysis
A B C
A C D
Rule
AD
CA
AC
B&CD
40
B C D
Support
2/5
2/5
2/5
1/5
A D E
B C E
Confidence
2/3
2/4
2/3
1/3
...
Market basket analysis (also known as association rule discovery or affinity analysis) is a
popular data mining method. In the simplest situation, the data consists of two variables: a
transaction and an item.
For each transaction, there is a list of items. Typically, a transaction is a single customer
purchase, and the items are the things that were bought. An association rule is a statement of the
form (item set A)  (item set B).
The aim of the analysis is to determine the strength of all the association rules among a set of
items.
The strength of the association is measured by the support and confidence of the rule. The
support for the rule A  B is the probability that the two item sets occur together. The support of
the rule A  B is estimated by the following:
transactions that contain every item in A and B
all transactions
Notice that support is symmetric. That is, the support of the rule A  B is the same as the
support of the rule B  A.
The confidence of an association rule A  B is the conditional probability of a transaction
containing item set B given that it contains item set A. The confidence is estimated by the
following:
3.3 Market Basket Analysis (Self-Study)
49
transactions that contain every item in A and B
transactions that contain the items in A
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Implication?
Checking Account
No
Yes
No
500
3500
4,000
Yes
1000
5000
6,000
Savings
Account
Support(SVG  CK) = 50%
Confidence(SVG  CK) = 83%
Expected Confidence(SVG  CK) = 85%
Lift(SVG  CK) = 0.83/0.85 < 1
10,000
42
The interpretation of the implication () in association rules is precarious. High confidence and
support does not imply cause and effect. The rule is not necessarily interesting. The two items
might not even be correlated. The term confidence is not related to the statistical usage.
Therefore, there is no repeated sampling interpretation.
Consider the association rule (saving account)  (checking account). This rule has 50% support
(5,000/10,000) and 83% confidence (5,000/6,000). Based on these two measures, this might be
considered a strong rule. On the contrary, those without a savings account are even more likely
to have
a checking account (87.5%). In fact, saving and checking are negatively correlated.
If the two accounts were independent, then knowing that a person has a saving account does not
help in knowing whether that person has a checking account. The expected confidence if the two
accounts were independent is 85% (8,500/10,000). This is higher than the confidence of SVG 
CK.
In general, the expected confidence of the rule A  B is estimated by the following:
transactions that contain the items in B / all transactions
The lift of the rule A  B is the confidence of the rule divided by the expected confidence,
assuming that the item sets are independent. Mathematically, the lift of rule A  B is calculated
by the following:
confidence of A  B / expected confidence of A  B
50
Practical 3: Introduction to Pattern Discovery
The lift can be interpreted as a general measure of association between the two item sets. Values
greater than 1 indicate positive correlation, values equal to 1 indicate zero correlation, and values
less than 1 indicate negative correlation. Notice that lift is symmetric. That is, the lift of the rule
A  B is the same as the lift of the rule B  A.
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Barbie Doll  Candy
1. Put them closer together in the store.
2. Put them far apart in the store.
3. Package candy bars with the dolls.
4. Package Barbie + candy + poorly selling item.
5. Raise the price on one, and lower it on the other.
6. Offer Barbie accessories for proofs of purchase.
7. Do not advertise candy and Barbie together.
8. Offer candies in the shape of a Barbie doll.
43
Forbes (Palmeri 1997) reported that a major retailer determined that customers who buy Barbie
dolls have a 60% likelihood of buying one of three types of candy bars. The confidence of the
rule Barbie  candy is 60%. The retailer was unsure what to do with this nugget. The online
newsletter Knowledge Discovery Nuggets invited suggestions (Piatesky-Shapiro 1998).
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Data Capacity
D
A
44
A
A
A
B C
B B
A
D
A
3.3 Market Basket Analysis (Self-Study)
51
In data mining, the data is not generated to meet the objectives of the analysis. It must be
determined whether the data, as it exists, has the capacity to meet the objectives. For example,
quantifying affinities among related items would be pointless if very few transactions involved
multiple items. Therefore, it is important to do some initial examination of the data before
attempting to do association analysis.
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Association Tool Demonstration
Analysis goal:
Explore associations between retail banking
services used by customers.
Analysis plan:
 Create an association data source.
 Run an association analysis.
 Interpret the association rules.
 Run a sequence analysis.
 Interpret the sequence rules.
45
A bank’s Marketing Department is interested in examining associations between various retail
banking services used by customers. Marketing would like to determine both typical and atypical
service combinations as well as the order in which the services were first used.
These requirements suggest both a market basket analysis and a sequence analysis.
52
Practical 3: Introduction to Pattern Discovery
Market Basket Analysis
The BANK data set contains service information for nearly 8,000 customers. There are three
variables in the data set, as shown in the table below.
Name
Measurement
Level
Model Role
Description
ACCOUNT
ID
Nominal
Account Number
SERVICE
Target
Nominal
Type of Service
VISIT
Sequence
Ordinal*
Order of Product Purchase
(*Although the appropriate measurement level for a sequence variable is usually ordinal,
Enterprise Miner does not accept ordinal as the measurement level for a sequence variable. Later
in the demonstration, a measurement level of interval is assigned to the sequence variable.)
The BANK data set has more than 32,000 rows. Each row of the data set represents a customerservice combination. Therefore, a single customer can have multiple rows in the data set, and
each row represents one of the products he or she owns. The median number of products per
customer is three.
The 13 products are represented in the data set with the following abbreviations:
ATM
automated teller machine debit card
AUTO
automobile installment loan
CCRD
credit card
CD
certificate of deposit
CKCRD
check or debit card
CKING
checking account
HMEQLC
home equity line of credit
IRA
individual retirement account
MMDA
money market deposit account
MTG
mortgage
PLOAN
personal or consumer installment loan
SVG
saving account
TRUST
personal trust account
Your first task is to create a new analysis diagram and data source for the BANK data set.
3.3 Market Basket Analysis (Self-Study)
53
1. Create a new diagram named Associations Analysis to contain this analysis.
2. Select Create Data Source from the Data Sources project property.
3. Select the BANK table from the AAEM library.
4. In Step 5, assign roles to the table variables as shown below.
An association analysis requires exactly one target variable and at least one ID variable. Both
should have a nominal measurement level. However, a level of Interval for the ID variable is
sufficient. A sequence analysis also requires a sequence variable. It usually has an ordinal
measurement scale. However, in SAS Enterprise Miner, the sequence variable must be
assigned the level Interval.
5. Click Next until you reach Step 8. For an association analysis, the data source should have a
role of Transaction.
Select Role  Transaction.
6. Click Finish to close the Data Source Wizard.
7. Drag a BANK data source into the diagram workspace.
8. Click the Explore tab and drag an Association tool into the diagram workspace.
54
Practical 3: Introduction to Pattern Discovery
9. Connect the BANK node to the Association node.
10. Select the Association node and examine its Properties panel.
11. The Export Rule by ID property determines whether the Rule-by-ID data is exported from
the node and whether the Rule Description table is available for display in the Results
window. Set the value for Export Rule by ID to Yes.
3.3 Market Basket Analysis (Self-Study)
55
Other options in the Properties panel include the following:
 Minimum Confidence Level specifies the minimum confidence level to generate a rule.
The default level is 10%.
 Support Type specifies whether the analysis should use the support count or support
percentage property. The default setting is Percent.
 Support Count specifies a minimum level of support to claim that items are associated
(that is, they occur together in the database).
 Support Percentage specifies a minimum level of support to claim that items are
associated (that is, they occur together in the database). The default frequency is 5%. The
support percentage figure that you specify refers to the proportion of the largest single item
frequency, and not the end support.
 Maximum Items determines the maximum size of the item set to be considered. For
example, the default of four items indicates that a maximum of four items can be included
in a single association rule.

If you are interested in associations that involve fairly rare products, you should
consider reducing the support count or percentage when you run the Association
node. If you obtain too many rules to be practical or useful, you should consider
raising the minimum support count or percentage as one possible solution.
Because you first want to perform a market basket analysis, you do not need the
sequence variable.
12. Access the Variables - Assoc dialog box for the Association node.
13. Select Use  No for the VISIT variable.
14. Click OK to close the Variables - Assoc dialog box.
15. Run the diagram from the Association node and view the results.
56
Practical 3: Introduction to Pattern Discovery
The Results - Node: Association Diagram window appears. The Statistics Plot, Statistics Line
Plot, Rule Matrix, and Output windows are visible.
16. Maximize the Statistics Line Plot window.
The statistics line plot graphs the lift, expected confidence, confidence, and support for each
of the rules by rule index number.
Consider the rule A  B. Recall the following:
 Support of A  B is the probability that a customer has both A and B.
3.3 Market Basket Analysis (Self-Study)
57
 Confidence of A  B is the probability that a customer has B given that the customer has
A.
 Expected Confidence of A  B is the probability that a customer has B.
 Lift of A  B is a measure of the strength of the association. If Lift=2 for the rule A=>B,
then a customer having A is twice as likely to have B than a customer chosen at random.
Lift is the confidence divided by the expected confidence.
Notice that the rules are ordered in descending order of lift.
17. To view the descriptions of the rules, select View  Rules  Rule description.
The highest lift rule is checking, and credit card implies check card. This is not surprising
given that many check cards include credit card logos. Notice the symmetry in rules 1 and 2.
This is not accidental because, as noted earlier, lift is symmetric.
58
Practical 3: Introduction to Pattern Discovery
18. (Optional) Examine the rule matrix.
The rule matrix plots the rules based on the items on the left side of the rule and the items on
the right side of the rule. The points are colored, based on the confidence of the rules. For
example, the rules with the highest confidence are in the column in the picture above. Using
the interactive feature of the graph, you discover that these rules all have checking on the
right side of the rule.
Another way to explore the rules found in the analysis is by plotting the Rules table.
3.3 Market Basket Analysis (Self-Study)
19. Select View  Rules  Rules Table. The Rules Table window appears.
20. Select
(the Plot Wizard icon).
21. Choose a Matrix graph for the type of chart, and select Next.
22. Select the matrix variables Lift, Conf, and Support as shown below. Select Next.
59
60
Practical 3: Introduction to Pattern Discovery
23. Select the Group role for _RHAND and the Tip role for LIFT and RULE to add these
details to the tooltip action.
24. Click Finish to generate the plot.
3.3 Market Basket Analysis (Self-Study)
61
The legend shows the right hand of the rule. When you click a service or group of services in
the legend, the points in the matrix graphs are highlighted. This plot enables you to explore
the relationships among the various metrics in association analysis.
When you position the pointer over a selected point in the plot, the tooltips show the details
of the point, including the full rule.
25. Right-click in the graph and select Data Options. Click the Where tab. Specify Expected
Confidence(%) as the column name and Greater than as the operator. Click the ellipsis next
to Value. Set the slider to include values greater than 40, or enter 40 for the value.
26. Select OK  Apply.
62
Practical 3: Introduction to Pattern Discovery
27. Select OK. The subset selected cases represent three different sets of services in the legend
for the right hand of the rules.
28. You can modify the look of the graph. Right-click the graph and select Graph Properties.
Change the plot matrix diagonal type to Histogram.
3.3 Market Basket Analysis (Self-Study)
29. Label the legend by selecting Legend and selecting the check box in the Title area next to
Show (Right Hand of Rule). Select OK.
63
64
Practical 3: Introduction to Pattern Discovery
30. Click the SVG (Savings Account) category in the legend and notice that the histograms show
the distribution of the selected rules in the diagonal.
31. Close the Results window.
3.3 Market Basket Analysis (Self-Study)
65
Sequence Analysis
In addition to the products owned by its customers, the bank is interested in examining the order
in which the products are purchased. The sequence variable in the data set enables you to
conduct a sequence analysis.
1. Add an Association node to the diagram workspace and connect it to the BANK node.
2. Rename the new node Sequence Analysis.
3. Set Export Rule by ID to Yes.
4. Examine the Sequence panel in the Properties panel.
The options in the Sequence panel enable you to specify the following properties:
 Chain Count is the maximum number of items that can be included in a sequence. The
default value is 3 and the maximum value is 10.
 Consolidate Time enables you to specify whether consecutive visits to a location or
consecutive purchases over a given interval can be consolidated into a single visit for
analysis purposes. For example, two products purchased less than a day apart might be
considered to be a single transaction.
66
Practical 3: Introduction to Pattern Discovery
 Maximum Transaction Duration enables you to specify the maximum length of time for
a series of transactions to be considered a sequence. For example, you might want to
specify that the purchase of two products more than three months apart does not constitute
a sequence.
 Support Type specifies whether the sequence analysis should use the Support Count or
Support Percentage property. The default setting is Percent.
 Support Count specifies the minimum frequency required to include a sequence in the
sequence analysis when the Sequence Support Type property is set to Count. If a sequence
has a count less than the specified value, that sequence is excluded from the output.
 Support Percentage specifies the minimum level of support to include the sequence in the
analysis when the Support Type property is set to Percent. If a sequence has a frequency
that is less than the specified percentage of the total number of transactions, then that
sequence is excluded from the output. The default percentage is 2%. Permissible values are
real numbers between 0 and 100.
5. Run the diagram from the Sequence Analysis node and view the results.
6. Maximize the Statistics Line Plot window.
The statistics line plot graphs the confidence and support for each of the rules by rule index
number.
The percent support is the transaction count divided by the total number of customers, which
would be the maximum transaction count. The percent confidence is the transaction count
divided by the transaction count for the left side of the sequence.
3.3 Market Basket Analysis (Self-Study)
7. Select View  Rules  Rule Description to view the descriptions of the rules.
The confidence for many of the rules changes after the order of service acquisition is
considered.
67
68
Practical 3: Introduction to Pattern Discovery
Exercises
2. Conducting an Association Analysis
A store is interested in determining the associations between items purchased from the Health
and Beauty Aids Department and the Stationery Department. The store chose to conduct a
market basket analysis of specific items purchased from these two departments. The
TRANSACTIONS data set contains information about more than 400,000 transactions made
over the past three months. The following products are represented in the data set:
1. bar soap
10. pens
2. bows
11. perfume
3. candy bars
12. photo processing
4. deodorant
13. prescription medications
5. greeting cards
14. shampoo
6. magazines
15. toothbrushes
7. markers
16. toothpaste
8. pain relievers
17. wrapping paper
9. pencils
There are four variables in the data set:
Name
STORE
Model
Role
Rejected
TRANSACTIO ID
N
Measurement
Level
Description
Nominal
Identification number of the store
Nominal
Transaction identification number
PRODUCT
Target
Nominal
Product purchased
QUANTITY
Rejected
Interval
Quantity of this product purchased
a. Create a new diagram. Name the diagram Transactions.
b. Create a new data source. Use the data set AAEM.TRANSACTIONS.
c. Assign the model role Rejected to the variables STORE and QUANTITY. These
variables are not used in this analysis. Assign the ID model role to the variable
TRANSACTION and the Target model role to the variable PRODUCT.
d. Add the node for the TRANSACTIONS data set and an Association node to the
diagram.
e. Change the setting for Export Rule by ID to Yes.
f. Do not change the remaining default settings for the Association node and run the
analysis.
3.3 Market Basket Analysis (Self-Study)
g. Examine the results of the association analysis.
What is the highest lift value for the resulting rules?
Which rule has this value?
69
70
Practical 3: Introduction to Pattern Discovery
3.4 Chapter Summary
Pattern discovery seems to embody the promise of data mining, but there are many ways for an
analysis to fail. SAS Enterprise Miner provides tools to help with data reduction, novelty
detection, profiling, market basket analysis, and sequence analysis.
Cluster and segmentation analyses are similar in intent but differ in execution. In cluster analysis,
the goal is to identify distinct groupings of cases across a set of inputs. In segmentation analysis,
the goal is to partition cases from a single cluster into contiguous groups.
SAS Enterprise Miner offers several tools for exploring the results of a cluster and segmentation
analysis. For low dimension data, you can use capabilities provided by the Graph Wizard and the
Explore window. For higher dimensional data, you can choose the Segment Profile tool to
understand the generated partitions.
Market basket and sequence analyses are handled by the Association tool. This tool transforms
transaction data sets into rules. The value of the generated rules is gauged by confidence,
support, and lift. The Association tool features a variety of plots and tables to help you explore
the analysis results.
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
Pattern Discovery Tools: Review
Generate cluster models using automatic
settings and segmentation models with
user-defined settings.
Compare within-segment distributions of
selected inputs to overall distributions. This
helps you understand segment definition.
Conduct market basket and sequence analysis
on transactions data. A data source must have
one target, one ID, and (if desired) one
sequence variable in the data source.
49
3.5 Solutions
3.5 Solutions
Solutions to Exercises
1. Conducting Cluster Analysis
a. Create a new diagram in your project. Name the diagram Jeans.
1) To open a diagram, select File  New  Diagram.
2) Enter the name of the new diagram, Jeans, and click OK.
b. Define the data set DUNGAREE as a data source.
1) Select File  New  Data Source.
2) In the Data Source Wizard - Metadata Source window, make sure that SAS Table is
selected as the source and click Next.
3) To choose the desired data table, click Browse.
4) Double-click the AAEM library to see the data tables in the library.
5) Select the DUNGAREE data set, and then click OK.
6) Click Next.
7) Click Next.
8) Select Advanced to use the Advanced Advisor, and then click Next.
71
72
Practical 3: Introduction to Pattern Discovery
c. Determine whether the model roles and measurement levels assigned to the variables are
appropriate.
Examine the distribution of the variables.
1) Hold down the Ctrl key and select the variables of interest.
2) Click Explore.
There do not appear to be any unusual or missing data values. Close the Explore
window.
d. The variable STOREID should have the ID model role and the variable SALESTOT
should have the Rejected model role.
3.5 Solutions
73
The variable SALESTOT should be rejected because it is the sum of the other input
variables in the data set. Therefore, it should not be considered as an independent
input value.
Click Next several times and then click Finish to complete the data source creation.
e. To add an Input Data Source node to the diagram workspace and select the
DUNGAREE data table as the data source, drag the DUNGAREE data source onto the
diagram workspace.
f. Add a Cluster node to the diagram workspace. The workspace should appear as shown.
g. Select the Cluster node.
In the property panel, note the default: Internal Standardization  Standardization.
If you do not standardize, the clustering occurs strictly on the inputs with the largest
range (Original and Leisure).
74
Practical 3: Introduction to Pattern Discovery
h. Run the diagram from the Cluster node and examine the results.
1) Run the Cluster node and view the results.
2) To view the results, right-click the Cluster node and select Results.
The Cluster node’s Automatic number of cluster specification method seems to
generate an excessive number of clusters.
i. Specify a maximum of six clusters and rerun the Cluster node.
1) Select Specification Method  User Specify.
2) Select Maximum Number of Clusters  6.
3.5 Solutions
75
3) Run the Cluster node and view the results.
Apparently all but one of the segments is well populated. There are more details
about the segment composition in the next step.
j. Connect a Segment Profile node to the Cluster node.
Run the Segment Profile node and view the results.
Segment 1 contains stores selling a higher-than-average number of original jeans.
76
Practical 3: Introduction to Pattern Discovery
Segment 2 contains stores selling a higher-than-average number of stretch jeans.
Segment 3 contains stores selling small numbers of all jeans styles.
Segment 4 contains stores selling a higher-than-average number of leisure jeans.
Segment 5 contains stores selling a higher-than-average number of fashion jeans.
Segment 6 contains stores selling a higher-than-average number of original jeans, but
lower-than-average number of stretch and fashion.
3.5 Solutions
2. Conducting an Association Analysis
a. Create the Transactions diagram.
1) To open a new diagram in the project, select File  New  Diagram.
2) Name the new diagram Transactions and click OK.
b. Create a new data source. Use the data set AAEM.TRANSACTIONS.
1) Right-click Data Sources in the project tree and select Create Data Source.
2) In the Data Source Wizard - Metadata Source window, make sure that SAS Table is
selected as the source and click Next.
3) Click Browse to choose a data set.
4) Double-click the AAEM library and select the TRANSACTIONS data set.
5) Click OK.
6) Click Next.
7) Examine the data table properties, and then click Next.
8) Select Advanced to use the Advanced Advisor, and then click Next.
77
78
Practical 3: Introduction to Pattern Discovery
c. Assign appropriate model roles to the variables.
1) Change the measurement level of STORE to Nominal. Hold down the Ctrl key and
select the rows for the variables STORE and QUANTITY. In the Role column of
one of these rows, select Rejected.
2) Select the TRANSACTION row and select ID as the role.
3) Select the PRODUCT row and select Target as the role.
4) Click Next.
5) To skip decision processing, click Next. To skip creation of a sample, click Next.
6) Change the role to Transaction.
3.5 Solutions
7) Select Next  Finish.
d. Add the node for the TRANSACTIONS data set and an Association node to the
diagram.
The workspace should appear as shown.
e. Change the setting for Export Rule by ID to Yes.
79
80
Practical 3: Introduction to Pattern Discovery
f. Run the Association node and view the results.
g. Examine the results of the association analysis.
Examine the Statistics Line plot.
Rule 1 has the highest lift value, 3.60.
Looking at the output reveals that Rule 1 is the rule Perfume  Toothbrush.
3.5 Solutions
Solutions to Student Activities (Polls/Quizzes)
C opy r i ght © 2014, SAS I nsti tute I nc . All r i ghts r eser ved.
8.01 Multiple Choice Poll – Correct Answer
For a k-means clustering analysis, which of the following
statements is true about input variables?
a. Input variables should be limited in number and be
relatively independent.
b. Input variables should be of interval measurement
level.
c. Input variables should have distributions that are
somewhat symmetric.
d. Input variables should be meaningful to analysis
objectives.
e. All of the above
29
81
Download