Explaining the State Transitions Viewer for Sequence Clustering by Allan Mitchell, SQL Server MVP and principal consultant at Konesans Limited in the UK. You can find Allan’s website at www.SQLIS.com. The community-editable, Web version of this document is available on TechNet Wiki here: http://social.technet.microsoft.com/wiki/contents/articles/971.aspx. Introduction In SQL Server 2005 and SQL Server 2008, Microsoft has added some fantastic visualizations around data-mining algorithms. These visual aids allow us to see exactly what a particular algorithm is predicting or describing—making a difficult subject easier to understand. Problem In this article, I look at the State Transitions viewer for the Microsoft Sequence Clustering algorithm and explain exactly what you are seeing and why. Although it is not necessary, this will be easier to understand if you have a statistics background or previous experience with data mining. You can find an introduction to the Microsoft Sequence Clustering Algorithm here: http://msdn.microsoft.com/en-us/library/ms175462.aspx. The image below is an example screen from the State Transitions viewer. Fig 1 The screenshot is taken from the Sequence Clustering model that is deployed as part of the sample Analysis Services Project for SQL Server 2008. You can download the Analysis Services Project here: http://msftasprodsamples.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=18652. Explanation The screenshot presents a very simple, clean view of a transition. When you look at your models, you will undoubtedly see a lot more nodes, lines, and numbers. However, take heart that everything we learn here is equally applicable in larger sequences. I just wanted a clean, uncomplicated view of a transition. The Viewer in General This slider on the left side allows us to gradually filter in or out weaker or stronger links between items in a sequence. The links are determined by the transition probability. If the slider is at the top then, no matter how slight the probability, all the links will be displayed. If the slider is at the bottom, only the strongest links will be displayed. Across the top, there are many tabs and two dropdown combo-boxes. Once you click each of these tabs, you see another viewer for our Sequence Clustering model. The dropdown on the left allows us to choose our mining model, and the dropdown on the right allows us to choose the viewer. This is important; you might find that you use the viewer dropdown quite frequently. Of the two available viewers, this article concentrates on the Microsoft Sequence Cluster Viewer. The Microsoft Sequence Cluster Viewer is the more graphical of the two, and I find it easier to understand. The second viewer, the Microsoft Generic Content Tree Viewer, is not graphical, but it contains more information, which is extremely useful when you want to dig deeper into the algorithm. The Viewer in Detail Sequence Start and End Let’s dive straight in and try to work out what the model viewer is telling us. In this article, I have chosen to look at Cluster 13 for this particular sequence clustering model. One of the first things I noticed and often get asked about is the “triangles with balls.” I know this is not a very technical name, but in the absence of anything else, I am using the term here. Here is the ball on the flat edge of the triangle: Fig 2 This describes that ML Mountain Tire is the first state in the sequence. Here is the ball on the point of the triangle: Fig 3 This describes that the ML Mountain Tire state is the last in the sequence and nothing comes after it. Description of the numbers in the viewer Looking at the original screenshot, we see that for this particular sequence we would start with the ML Mountain Tire state 61 percent of the time and the Mountain Tire Tube state 39 percent of the time. Where ML Mountain Tire is the first item in the sequence, we can expect it to be followed by the Mountain Tire Tube state in 32 percent of cases and in 68 percent of cases where it is the end of the sequence. When Mountain Tire Tube is the starting sequence, we can expect it to never be followed by anything—we know this because of the value 1.00 (see Fig 1) and the end-of-sequence ball on the point of a triangle. Each sequence has the unique color to mark its states (including start and end states), link, and probabilities. Explanation of the source of the numbers Now let’s query the metadata around the model using Data Mining eXpressions (DMX). Although a detailed look at DMX is out of scope for this article, I will explain my queries as we go along. For this cluster, the first thing I want to know is the probabilities of each state being first in the sequence. Earlier we saw in the viewer that the probabilities are 61 and 39 percent respectively. The query below shows how we can retrieve the same information when using a DMX (Data Mining Expression) query. Here is the query: SELECT FLATTENED NODE_UNIQUE_NAME, (SELECT ATTRIBUTE_VALUE AS [Product 1], [Support] AS [Sequence Support], [Probability] AS [Sequence Probability] FROM NODE_DISTRIBUTION WHERE [Support] > 0 ) as t FROM [Sequence Clustering].CONTENT WHERE NODE_TYPE = 13 AND [PARENT_UNIQUE_NAME] = 13 Query 1 The outer part of Query 1 selects from the [Sequence Clustering] model and asks for a NODE_TYPE of 13. This NODE_TYPE is the type that holds the first states of possible sequences. Remember that a cluster can have multiple possible starting points for sequences/runs of states. Query 1 also asks for where PARENT_UNIQUE_NAME is 13. This means we want to look at Cluster 13. What might be slightly confusing are the FLATTENED keyword and the nested table in Query 1. The following code reads from a nested table that is returned as part of the [Model].CONTENT request. (SELECT ATTRIBUTE_VALUE AS [Product 1], [Support] AS [Sequence Support], [Probability] AS [Sequence Probability] FROM NODE_DISTRIBUTION WHERE [Support] > 0 ) as t Query 2 Here I am asking for the Value of the first sequence state, the probability of the state, and the amount of cases that support that state where there is at least some support. Here are the results: NODE_UNIQUE_NAME t.Product 1 t.Sequence Support t.Sequence Probability 884722 ML Mountain Tire 168.4872 0.613636 884722 Mountain Tire Tube 106.0845 0.386364 Table 1 We can see that the numbers returned correlate nicely with what the viewer shows us. Now let’s move a little further and see what the probabilities are of the next states in the sequence when ML Mountain Tire is the first state. For this I am going to use a slight variation of Query 1: SELECTFLATTENED NODE_UNIQUE_NAME, (SELECT ATTRIBUTE_VALUE AS [Product 1], [Support] AS [Sequence Support], [Probability] AS [Sequence Probability] FROM NODE_DISTRIBUTION) as t FROM [Sequence Clustering].CONTENT WHERE NODE_TYPE = 13 AND [PARENT_UNIQUE_NAME] = 13 Query 3 The only difference from Query 3 to Query 1 is that I have not restricted the nested table to only show items that have [Support] > 0. Here are the results of Query 3: Table 2 Query 3 shows us what the first states are for Cluster 13. As we can see, there are only two possibilities. We are concentrating on ML Mountain Tire here. To find out what states follow ML Mountain Tire, we need to count down the rows in Table 2 until we reach ML Mountain Tire. Counting should start at 0; Row 0 is always the Missing state. If we count, we should get to 14. We now need to go look at the transition states. The following query will tell us where to find the node that holds the second states for sequence state 14. We also restrict the PARENT_UNIQUE_NAME to the NODE_UNIQUE_NAME we retrieved by looking at first states of sequences in Query 1. SELECT NODE_UNIQUE_NAME FROM [Sequence Clustering].CONTENT WHERE NODE_DESCRIPTION = 'Transition row for sequence state 14' AND [PARENT_UNIQUE_NAME] = '884722' Query 4 We now need to take the result of Query 4, 884737, and use it to get the second state items. SELECTFLATTENED (SELECT ATTRIBUTE_VALUE AS Product2, [Support] AS [P2 Support], [Probability] AS [P2 Probability] FROM NODE_DISTRIBUTION) AS t FROM [Sequence Clustering].CONTENT WHERE NODE_UNIQUE_NAME = '884737' Query 5 As you can see, Query 5 is pretty much the same query as before, but NODE_UNIQUE_NAME has changed to the node returned in the previous query. Here are the results of Query 5: Table 3 The probability column shows us that there is a 31 percent probability that Mountain Tire Tube will follow ML Mountain Tire and a 68 percent chance that nothing will follow. This correlates nicely with what we see in the viewer. Perform the same query for the Mountain Tire Tube state. If we go back to the results in Table 2, we see that Mountain Tire Tube is the 17th transition State Row. To find the second states, this changes our query to the following: SELECT NODE_UNIQUE_NAME FROM [Sequence Clustering].CONTENT WHERE NODE_DESCRIPTION = 'Transition row for sequence state 17' AND [PARENT_UNIQUE_NAME] = '884722' Query 6 As before, we take the results of Query 6, 884740, and use it to find the second states: SELECTFLATTENED (SELECT ATTRIBUTE_VALUE AS Product2, [Support] AS [P2 Support], [Probability] AS [P2 Probability] FROM NODE_DISTRIBUTION) AS t FROM [Sequence Clustering].CONTENT WHERE NODE_UNIQUE_NAME = '884740' Query 7 The results below correlate again very nicely with what we see in the viewer. Here 0.99999999 is identical to probability 1 (100%). It is what statisticians call “almost surely.” We will never be able to run this sequence enough times to say that is is always true (surely), but we are very sure it will happen. Table 4 Whenever Mountain Tire Tube is the State in a sequence for this cluster, we always end on that state. Conclusion Microsoft is making great strides into providing a means for more and more people to use Data Mining without feeling that it is too complex. The viewers provided with SQL Server 2005 and 2008 are excellent ways to visualize what the model has learned about your data. This article has shown how to interpret one of those viewers. Thank you for reading, and please send us your feedback to let us know how useful this was or if you have any other thoughts. About the author. Allan Mitchell is a SQL Server MVP based in the UK. He specializes in the Microsoft SQL Server BI stack with a passion for Data Mining and SQL Server Integration Services.