Explaining the State Transitions Viewer for Sequence Clustering

advertisement
Explaining the State Transitions Viewer
for Sequence Clustering
by Allan Mitchell, SQL Server MVP and principal consultant at Konesans Limited in the UK. You can
find Allan’s website at www.SQLIS.com.
The community-editable, Web version of this document is available on TechNet Wiki here:
http://social.technet.microsoft.com/wiki/contents/articles/971.aspx.
Introduction
In SQL Server 2005 and SQL Server 2008, Microsoft has added some fantastic visualizations around
data-mining algorithms. These visual aids allow us to see exactly what a particular algorithm is
predicting or describing—making a difficult subject easier to understand.
Problem
In this article, I look at the State Transitions viewer for the Microsoft Sequence Clustering algorithm
and explain exactly what you are seeing and why. Although it is not necessary, this will be easier to
understand if you have a statistics background or previous experience with data mining. You can find
an introduction to the Microsoft Sequence Clustering Algorithm here:
http://msdn.microsoft.com/en-us/library/ms175462.aspx.
The image below is an example screen from the State Transitions viewer.
Fig 1
The screenshot is taken from the Sequence Clustering model that is deployed as part of the sample
Analysis Services Project for SQL Server 2008. You can download the Analysis Services Project here:
http://msftasprodsamples.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=18652.
Explanation
The screenshot presents a very simple, clean view of a transition. When you look at your models,
you will undoubtedly see a lot more nodes, lines, and numbers. However, take heart that everything
we learn here is equally applicable in larger sequences. I just wanted a clean, uncomplicated view of
a transition.
The Viewer in General
This slider on the left side allows us to gradually filter in or out weaker or stronger links between
items in a sequence. The links are determined by the transition probability. If the slider is at the top
then, no matter how slight the probability, all the links will be displayed. If the slider is at the
bottom, only the strongest links will be displayed.
Across the top, there are many tabs and two dropdown combo-boxes. Once you click each of these
tabs, you see another viewer for our Sequence Clustering model. The dropdown on the left allows us
to choose our mining model, and the dropdown on the right allows us to choose the viewer. This is
important; you might find that you use the viewer dropdown quite frequently.
Of the two available viewers, this article concentrates on the Microsoft Sequence Cluster Viewer.
The Microsoft Sequence Cluster Viewer is the more graphical of the two, and I find it easier to
understand. The second viewer, the Microsoft Generic Content Tree Viewer, is not graphical, but it
contains more information, which is extremely useful when you want to dig deeper into the
algorithm.
The Viewer in Detail
Sequence Start and End
Let’s dive straight in and try to work out what the model viewer is telling us. In this article, I have
chosen to look at Cluster 13 for this particular sequence clustering model. One of the first things I
noticed and often get asked about is the “triangles with balls.” I know this is not a very technical
name, but in the absence of anything else, I am using the term here.
Here is the ball on the flat edge of the triangle:
Fig 2
This describes that ML Mountain Tire is the first state in the sequence.
Here is the ball on the point of the triangle:
Fig 3
This describes that the ML Mountain Tire state is the last in the sequence and nothing comes after it.
Description of the numbers in the viewer
Looking at the original screenshot, we see that for this particular sequence we would start with the
ML Mountain Tire state 61 percent of the time and the Mountain Tire Tube state 39 percent of the
time. Where ML Mountain Tire is the first item in the sequence, we can expect it to be followed by
the Mountain Tire Tube state in 32 percent of cases and in 68 percent of cases where it is the end of
the sequence. When Mountain Tire Tube is the starting sequence, we can expect it to never be
followed by anything—we know this because of the value 1.00 (see Fig 1) and the end-of-sequence
ball on the point of a triangle. Each sequence has the unique color to mark its states (including start
and end states), link, and probabilities.
Explanation of the source of the numbers
Now let’s query the metadata around the model using Data Mining eXpressions (DMX). Although a
detailed look at DMX is out of scope for this article, I will explain my queries as we go along.
For this cluster, the first thing I want to know is the probabilities of each state being first in the
sequence. Earlier we saw in the viewer that the probabilities are 61 and 39 percent respectively.
The query below shows how we can retrieve the same information when using a DMX (Data Mining
Expression) query.
Here is the query:
SELECT FLATTENED NODE_UNIQUE_NAME,
(SELECT ATTRIBUTE_VALUE AS [Product 1],
[Support] AS [Sequence Support],
[Probability] AS [Sequence Probability]
FROM NODE_DISTRIBUTION WHERE [Support] > 0 ) as t
FROM [Sequence Clustering].CONTENT
WHERE NODE_TYPE = 13
AND [PARENT_UNIQUE_NAME] = 13
Query 1
The outer part of Query 1 selects from the [Sequence Clustering] model and asks for a NODE_TYPE
of 13. This NODE_TYPE is the type that holds the first states of possible sequences. Remember that
a cluster can have multiple possible starting points for sequences/runs of states. Query 1 also asks
for where PARENT_UNIQUE_NAME is 13. This means we want to look at Cluster 13.
What might be slightly confusing are the FLATTENED keyword and the nested table in Query 1. The
following code reads from a nested table that is returned as part of the [Model].CONTENT request.
(SELECT ATTRIBUTE_VALUE AS [Product 1],
[Support] AS [Sequence Support],
[Probability] AS [Sequence Probability]
FROM NODE_DISTRIBUTION WHERE [Support] > 0 ) as t
Query 2
Here I am asking for the Value of the first sequence state, the probability of the state, and the
amount of cases that support that state where there is at least some support.
Here are the results:
NODE_UNIQUE_NAME
t.Product 1
t.Sequence Support
t.Sequence Probability
884722 ML Mountain Tire
168.4872
0.613636
884722 Mountain Tire Tube
106.0845
0.386364
Table 1
We can see that the numbers returned correlate nicely with what the viewer shows us.
Now let’s move a little further and see what the probabilities are of the next states in the sequence
when ML Mountain Tire is the first state. For this I am going to use a slight variation of Query 1:
SELECTFLATTENED NODE_UNIQUE_NAME,
(SELECT ATTRIBUTE_VALUE AS [Product 1],
[Support] AS [Sequence Support],
[Probability] AS [Sequence Probability]
FROM NODE_DISTRIBUTION) as t
FROM [Sequence Clustering].CONTENT
WHERE NODE_TYPE = 13
AND [PARENT_UNIQUE_NAME] = 13
Query 3
The only difference from Query 3 to Query 1 is that I have not restricted the nested table to only
show items that have [Support] > 0.
Here are the results of Query 3:
Table 2
Query 3 shows us what the first states are for Cluster 13. As we can see, there are only two
possibilities. We are concentrating on ML Mountain Tire here. To find out what states follow ML
Mountain Tire, we need to count down the rows in Table 2 until we reach ML Mountain Tire.
Counting should start at 0; Row 0 is always the Missing state. If we count, we should get to 14. We
now need to go look at the transition states.
The following query will tell us where to find the node that holds the second states for sequence
state 14. We also restrict the PARENT_UNIQUE_NAME to the NODE_UNIQUE_NAME we retrieved
by looking at first states of sequences in Query 1.
SELECT NODE_UNIQUE_NAME
FROM [Sequence Clustering].CONTENT
WHERE NODE_DESCRIPTION = 'Transition row for sequence state 14'
AND [PARENT_UNIQUE_NAME] = '884722'
Query 4
We now need to take the result of Query 4, 884737, and use it to get the second state items.
SELECTFLATTENED
(SELECT ATTRIBUTE_VALUE AS Product2,
[Support] AS [P2 Support],
[Probability] AS [P2 Probability]
FROM NODE_DISTRIBUTION) AS t
FROM [Sequence Clustering].CONTENT
WHERE NODE_UNIQUE_NAME = '884737'
Query 5
As you can see, Query 5 is pretty much the same query as before, but NODE_UNIQUE_NAME has
changed to the node returned in the previous query.
Here are the results of Query 5:
Table 3
The probability column shows us that there is a 31 percent probability that Mountain Tire Tube will
follow ML Mountain Tire and a 68 percent chance that nothing will follow. This correlates nicely
with what we see in the viewer.
Perform the same query for the Mountain Tire Tube state. If we go back to the results in Table 2, we
see that Mountain Tire Tube is the 17th transition State Row. To find the second states, this changes
our query to the following:
SELECT NODE_UNIQUE_NAME
FROM [Sequence Clustering].CONTENT
WHERE NODE_DESCRIPTION = 'Transition row for sequence state 17'
AND [PARENT_UNIQUE_NAME] = '884722'
Query 6
As before, we take the results of Query 6, 884740, and use it to find the second states:
SELECTFLATTENED
(SELECT ATTRIBUTE_VALUE AS Product2,
[Support] AS [P2 Support],
[Probability] AS [P2 Probability]
FROM NODE_DISTRIBUTION) AS t
FROM [Sequence Clustering].CONTENT
WHERE NODE_UNIQUE_NAME = '884740'
Query 7
The results below correlate again very nicely with what we see in the viewer. Here 0.99999999 is
identical to probability 1 (100%). It is what statisticians call “almost surely.” We will never be able
to run this sequence enough times to say that is is always true (surely), but we are very sure it will
happen.
Table 4
Whenever Mountain Tire Tube is the State in a sequence for this cluster, we always end on that
state.
Conclusion
Microsoft is making great strides into providing a means for more and more people to use Data
Mining without feeling that it is too complex. The viewers provided with SQL Server 2005 and 2008
are excellent ways to visualize what the model has learned about your data. This article has shown
how to interpret one of those viewers. Thank you for reading, and please send us your feedback to
let us know how useful this was or if you have any other thoughts.
About the author. Allan Mitchell is a SQL Server MVP based in the UK. He specializes in the
Microsoft SQL Server BI stack with a passion for Data Mining and SQL Server Integration Services.
Download