Back-Testing Data Mining Results Using Monte Carlo

SQL Server Technical Article
Back-Testing Data Mining Results Using Monte Carlo Simulation
Writer: Hilmar Buchta
Technical Reviewer: Dana Cristofor
Published: January 2011
Applies to: SQL Server 2008, SQL Server 2008 R2
Summary: This article describes statistical test methods for validating the correctness of a data
mining model. Therefore the predicted events are compared with the real events in a backtesting process and a threshold for the amount of prediction errors is to be defined.
Copyright
The information contained in this document represents the current view of Microsoft Corporation on
the issues discussed as of the date of publication. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft
cannot guarantee the accuracy of any information presented after the date of publication.
This white paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS,
IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT.
Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights
under copyright, no part of this document may be reproduced, stored in, or introduced into a retrieval
system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or
otherwise), or for any purpose, without the express written permission of Microsoft Corporation.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this document. Except as expressly provided in any written license
agreement from Microsoft, the furnishing of this document does not give you any license to these
patents, trademarks, copyrights, or other intellectual property.
Unless otherwise noted, the example companies, organizations, products, domain names, e-mail
addresses, logos, people, places, and events depicted herein are fictitious, and no association with any
real company, organization, product, domain name, e-mail address, logo, person, place, or event is
intended or should be inferred.
© 2011 Microsoft Corporation. All rights reserved.
2
Contents
.1
Introduction .................................................................................................................................................. 4
Back testing ................................................................................................................................................... 4
Practical example used in this article............................................................................................................ 5
How many prediction errors can be expected? ............................................................................................ 6
Monte Carlo simulation ................................................................................................................................ 7
Probability density function ........................................................................................................................ 10
Implementing the Monte Carlo simulation in SSIS ..................................................................................... 11
Statistical Hypothesis Testing ..................................................................................................................... 20
Summary and discussion............................................................................................................................. 23
Conclusion ................................................................................................................................................... 25
Resources .................................................................................................................................................... 25
3
Introduction
Data mining is frequently used to derive new knowledge from existing data. This new knowledge
may then influence future business decisions. Usually these business decisions contain some
kind of risk. For example, in a manufacturing company the result of a data mining process could
be that sales for a certain product are very likely to increase during the next period. The logically
consistent decision in this case would be to increase the production for this product in order to
fulfill the expected sales growth. If the prediction from the mining process turns out to be true,
everything is perfect. If not, inventory is full but there is no market for the product.
Whenever mining results are used as a basis for business decisions, we want to be sure to fully
understand the results and also we want to be able to prove that our mining models are working
correctly.
This article describes a back testing process that can be used to validate that the mining model
is working within the expected tolerance. The back testing process described here makes use of
Microsoft SQL Server 2008 R2, Microsoft Visual Studio, and data generated using the Monte
Carlo method. For general information about these Microsoft products and the Monte Carlo
method, see the links provided at the end of the paper.
Back testing
In the design phase, we can use techniques like cross validation or mining accuracy charts to
understand the statistical relevance of the model as shown in Figure 1.
TIME
TOOLS
Design time of the mining model
Lift chart, cross
validation etc.
Usage of the model
Back testing
Figure 1. Tools for validating data mining results
In contrast, back testing is used to compare the predictions from the mining process from a past
period with the reality from today. For example, a publishing company wants to predict which
customers are likely to cancel their subscriptions. To do this, they use data mining. Three
months later, they use a back testing approach to test the prediction against the reality.
The result of the back testing is one of the following:


The data mining model works as expected (number of false predictions within tolerance).
The data mining model does not work as expected (number of false predictions not
within tolerance).
In the latter case, usually a review of the mining model is the appropriate action, to see whether
there are missing input variables or other major changes that need to be reflected by the mining
model.
4
Practical example used in this article
In order to illustrate the back-testing process, this paper uses data from a fictitious
telecommunications company. Each customer has a 12-month contract that can be cancelled by
the customer at the end of the period. If the contract is not cancelled, it is automatically
continued for another 12 months. The mining model in this example is trained to predict the
likeliness for a contract cancellation within the next three months.
In order to follow the example, I have provided a link, listed at the end of this paper, from which
you can download the sample data and other resources. The download (zip-archive) contains
the following:


A Microsoft SQL Server 2008 R2 backup of the sample database (DataMining) with
randomly generated test data (30,000 cases)
A Microsoft Visual Studio solution for the Monte Carlo simulation (SQL Server
Integration Services project) and visualization of the results (SQL Server Reporting
Services project)
Let’s assume that we did a churn prediction 3 months ago. For this article we are not interested
in the actual data mining model but only in the prediction output. We stored the results in a
database table, named Mining_Results, as shown in Table 1.
Table 1. Sample data (output from mining model)
CaseKey
Churn
1
2
3
4
5
6
7
8
9
10
…
true
false
false
true
false
false
false
false
false
false
…
ChurnProbability
0.874
0.071
0.017
0.502
0.113
0.160
0.069
0.018
0.026
0.187
…
The column Churn contains the predictions from the mining algorithm. A Churn value of true
means that for this particular case, the customer is likely to cancel the contract. The churn rate
(shown in the ChurnProbability column) contains the output from the mining function
PredictProbability(Churn,true). The probability is actually the ratio of cancellation of all cases
that are considered similar by the mining algorithm, with the ratio represented as a number
between 0 (meaning 0 percent) and 1 (meaning 100 percent). For example, when we take a
look at the first row of Table 1, the probability is 0.874. This means that 87.4 percent of all
similar cases have cancelled their contract. Actually, the calculation of the probability is a little
5
bit more complex but in general, this explanation provides a good understanding of how the
probability is calculated.
Table 1 is the only input used for this kind of back-testing analysis. This means the process is
independent of the chosen mining algorithm (as long as the algorithm provides a prediction
probability).
Because the prediction was performed three months ago, the company can now compare the
actual results with the prediction. How many prediction errors did occur? Is the model working
correctly?
How many prediction errors can be expected?
If the mining model works correctly, a churn probability of 0.874 for a single contract means that
87.4 percent of similar cases cancelled their contract. Consequentially, this means that 100
percent minus 87.4 percent, or 12.6 percent, of similar cases did not cancel their contract. So,
whenever the mining algorithm predicted a case as churn=true, the error probability is 1ChurnProbability. For the cases that are predicted as churn=false (that is, where
ChurnProbability is less than 50 percent), the error probability is equal to the ChurnProbability.
This means, we can calculate the number of expected prediction failures using the following
Transact-SQL query.
SELECT SUM(CASE churn
WHEN 1 THEN 1 - churnprobability
ELSE churnprobability
END) AS errorcount
FROM
[dbo].[Mining_Result]
For this example, the query results in 2,126 expected prediction errors (from a total of 30,000
cases within the sample data set). It is possible to distinguish false positive from false negative
prediction errors, but this article focuses on the total prediction errors only.
The expected behavior of the query data would be that it would match the prediction quality of
the test data set, and indeed, the number that this query returns corresponds with the output of
the classification matrix when you build the classification matrix using the test data set. The
classification matrix for the test data set is the default in Business Intelligence Development
Studio. It can also be queried by calling the SystemGetClassificationMatrix Data Mining
Extensions (DMX) stored function and setting the second parameter to a value of 2 (1 means
training data, 2 test data, 3 both), as shown in the following example.
CALL SystemGetClassificationMatrix(Churn_Rate_DT,2,ContractCancelled')
6
The number of 2,126 expected prediction errors is just a theoretical number, because the
mining model is just an approximation of the real customers’ behavior. In reality it is very likely
that there would be a different number of errors. The key to the back-testing approach is to
understand the distribution of prediction errors. A distribution can be visualized as a probability
density function. Figure 2 shows this function based on the example data set.
Figure 2. Probability density function for the number of prediction errors
Before we go into more detail about how to produce this chart, let’s spend a moment with the
result. First, we see a peak somewhere between 2,102 and 2,152. This matches the previous
query result of 2,126 prediction errors. The area shaded green corresponds to 100 percent.
However, the probability for the peak value is less than 1 percent. So the probability for seeing
exactly 2,126 prediction errors (if the model is correct) is less than 1 percent.
Monte Carlo simulation
For known distribution functions, the probability density function can be plotted easily. For
example, many of these functions can be directly used in Microsoft Office Excel, like the normal
distribution, the gamma distribution, or the chi-square distribution. However, if the distribution is
not known, determining the probability density function is more complicated. Monte Carlo
simulations are frequently used to compute the probability density function.
Basically, a Monte Carlo simulation for our churn prediction sample works like this:
1. Create an empty histogram table that maps the number of cancellations to the count of
scenarios that ended with this number of cancellations.
2. Do n loops (scenarios).
For each loop:
a. Initially set the number of cancelled contracts x for this scenario to zero.
7
b. Look at each case (inner loop):
i. Compute a random number r.
ii. Based on this random number, decide whether the case is counted as
prediction error by comparing r with the ChurnProbability.
iii. If an error is detected, increment x.
c. At the end of each scenario, increment the number of occurrences of x
cancellations in the histogram table.
For example, the first scenario might end in 2,105 prediction errors. This means the histogram
table looks like this:
2105  1
The second scenario ended in 2,294 cancellations. Now, the histogram table looks like this:
2105  1
2294  1
If the third scenario ends in 2,105 cancellations again, the histogram table looks like this:
2105  2
2294  1
Usually, many scenarios must be computed in order to get a relatively smooth graph. Here are
some examples.
50 scenarios
500 scenarios
8
5,000 scenarios
30,000 scenarios
100,000 scenarios
300,000 scenarios
As you can see, the chart is getting smoother the more scenarios are used for the simulation.
However, the more scenarios are used, the longer the calculation time gets. In this example, for
the 300,000 scenarios there are 9 billion calculations in total (because the inner loop consists of
30,000 cases here). The chart from Figure 2 is a smoothed version of a simulation with 30,000
scenarios.
Using this approach makes it possible to calculate an approximation to the probability density
function without actually knowing the type of distribution.
Monte Carlo simulations are used in many ways, for example:
9



Financial risk management (investment banking)
In contrast with stress testing, which uses well-defined historic events, Monte Carlo
simulation can be used to simulate unpredictable effects on the markets. Based on the
probability density function, measures like the value at risk (VaR) can be computed.
Customer order forecast (CRM)
Each offer is associated with a likelihood (probability). How many orders can we expect?
For example, how likely is it that we see less than 500 orders?
Project risk management
Tom DeMarco and Timothy Lister use Monte Carlo simulation to get a good
understanding of the effects (time/cost) of project risks in their book Waltzing With Bears
(see reference [2] at the end of this paper).
Probability density function
The switch from the histogram to the probability density function as shown in Figure 2 is rather
simple. You just have to replace the number of cases on the y-axis with the probability, which is
the number of cases divided by number of scenarios that were used for the Monte Carlo
simulation.
For the purpose of statistical tests, a specific outcome of exactly n cancellations is not really
interesting. But based on the probability density function it is easy to calculate the probability for
any range of results. For example, if you are interested in the how likely it is to see between
2,100 and 2,200 prediction errors, you only need to calculate the area below the graph in this
range. For continuous distributions this would be the integral, but because this example is a
discrete distribution, you can simply sum up the values.
In many cases, you may be interested in the probability for seeing up to n prediction errors
(instead of exactly n prediction errors). This probability is calculated using the cumulated
probability function (sum in the range from 0 to n) as shown in Figure 3.
10
Figure 3. Cumulated probability density function of the data from Figure 2
For example, it is very unlikely (almost 0 percent) to see fewer than 2,000 cancellations, but it is
very likely (almost 100 percent) to see fewer than 2,251 cancellations.
Before you design the statistical test, the Monte Carlo simulation must be in place. The next
section explains how to implement the Monte Carlo simulation using SQL Server Integration
Services.
Implementing the Monte Carlo simulation in Integration Services
This section shows how to implement the Monte Carlo simulation as an Integration Services
script component.
Step 1: Create the output table
The output table is used to store the results of the Monte Carlo simulation. Strictly speaking,
only two columns are necessary: The column NumCases represents the number of prediction
errors and the column Count represents the number of cases for which the Monte Carlo
simulation ended in this particular number of prediction errors.
The other columns are used for additional computations and will be explained later in this
article.
CREATE TABLE [dbo].[Mining_Histogram](
[NumCases] [int] NOT NULL,
[Count] [int] NULL,
[Probability] [float] NULL,
[TotalProbability] [float] NULL,
[BetaCount] [int] NULL,
[BetaProbability] [float] NULL,
[BetaTotalProbability] [float] NULL
)
11
Step 2: Create the Integration Services package
In Business Intelligence Development Studio, create a new package with a data source (the
database with the mining results as shown in Table 1 and the output table created in the first
step). Add an Execute SQL task and a Data Flow task to the package as shown in Figure 4.
Figure 4. Control flow for the Monte Carlo simulation
The Execute SQL task is used to erase all data from the histogram table created in step 1, so
that you start with an empty histogram.
truncate table Mining_Histogram
Step 3: Add package scope variables
For easier configuration two variables are used.
MonteCarloNumCases is used to set the number of scenarios (outer loop). BetaValue is used
for the error model that is discussed later.
Step 4: Implement the data flow
Add the following data flow components to the data flow task.
12
Figure 5. Data flow for the Monte Carlo simulation
The data source simply selects all rows from the Mining_Result table.
SELECT
FROM
CaseKey, Churn, ChurnProbability
dbo.Mining_Result
Step 5: Define the output columns for the Monte Carlo test
For the script component, add the user variables, User::BetaValue and
User::MonteCarloNumCases, as read-only variables. Then add the following columns as
output columns.
13
Figure 6. Output columns for the script component
Column
NumCases
Count_Beta
Data type
DT_I4
DT_I4
Count
DT_I4
Probability
DT_R8
BetaProbability
DT_R8
14
Description
Number of cases
Number of scenarios of the error model that
ended with this number of cases
Number of scenarios that ended with this
number of cases
Probability of this number of cases (Count
divided by total number of Monte Carlo
scenarios)
Probability of this number of cases for the
error model
Column
Data type
CumulatedProbability
DT_R8
CumulatedBetaProbability DT_R8
Description
Cumulated probability
Cumulated probability of the error model
Step 6: Write the script for the Monte Carlo simulation
Enter the following lines as the script for the script component written in C#. The code contains
some basic logging and also computes the additional fields for the output table.
using
using
using
using
using
using
System;
System.Data;
Microsoft.SqlServer.Dts.Pipeline.Wrapper;
Microsoft.SqlServer.Dts.Runtime.Wrapper;
System.Collections;
System.Windows.Forms;
[Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute]
public class ScriptMain : UserComponent
{
ArrayList cases;
int numloops =0;
double betavalue;
String componentname = "Monte Carlo Simulator";
Hashtable ht =null;
Hashtable ht_beta = null;
public override void PreExecute()
{
base.PreExecute();
cases = new ArrayList();
numloops = Variables.MonteCarloNumCases;
betavalue = Variables.BetaValue;
ht = new Hashtable(numloops / 50);
ht_beta = new Hashtable(numloops / 50);
}
public override void PostExecute()
{
base.PostExecute();
}
public override void Input0_ProcessInputRow(Input0Buffer Row)
{
if (!Row.ChurnProbability_IsNull) cases.Add(Row.ChurnProbability);
}
public override void Input0_ProcessInput(Input0Buffer Buffer)
{
bool dummy = true;
15
while (Buffer.NextRow())
{
Input0_ProcessInputRow(Buffer);
}
ComponentMetaData.FireInformation(0, componentname, "Cases loaded: "
+ cases.Count.ToString(), "", 0, ref dummy);
if (!Buffer.EndOfRowset()) return;
Random random = new Random(12345);
int counter;
int counter_beta;
int maxkey = 0;
int minkey = Int32.MaxValue;
for (int lp = 1; lp <= numloops; lp++)
{
if (lp % 500 == 0)
ComponentMetaData.FireInformation(0, componentname,
"Simulating case: " + lp.ToString(), "", 0, ref dummy);
counter = 0;
counter_beta = 0;
foreach (double prob in cases)
{
double rnd = random.NextDouble();
if ((prob >= 0.5 && rnd > prob) || (prob < 0.5 && rnd <
prob)) counter++;
if ((prob >= 0.5 && rnd > prob-betavalue) || (prob < 0.5 &&
rnd < prob+betavalue)) counter_beta++;
}
if (ht.ContainsKey(counter))
{
ht[counter] = (int)ht[counter] + 1;
}
else
ht.Add(counter, 1);
if (ht_beta.ContainsKey(counter_beta))
{
ht_beta[counter_beta] = (int)ht_beta[counter_beta] + 1;
}
else
ht_beta.Add(counter_beta, 1);
if (counter > maxkey) maxkey = counter;
if (counter < minkey) minkey = counter;
if (counter_beta > maxkey) maxkey = counter_beta;
if (counter_beta < minkey) minkey = counter_beta;
}
cases.Clear();
16
ComponentMetaData.FireInformation(0, componentname, "Simulation done.
Simulated cases: " + numloops.ToString(), "", 0, ref dummy);
// Write result to output
double case_prob;
double case_betaprob;
double totalprob = 0;
double totalbetaprob = 0;
for (int i=minkey-1;i<=maxkey;i++)
{
Output0Buffer.AddRow();
Output0Buffer.NumCases = i;
if (!ht.ContainsKey(i))
{
Output0Buffer.Count = 0;
case_prob = 0;
}
else {
Output0Buffer.Count = (int)ht[i];
case_prob = ((int)(ht[i])) / ((double)numloops);
}
if (!ht_beta.ContainsKey(i))
{
Output0Buffer.CountBeta = 0;
case_betaprob = 0;
}
else
{
Output0Buffer.CountBeta = (int)ht_beta[i];
case_betaprob = ((int)(ht_beta[i])) / ((double)numloops);
}
Output0Buffer.Probability = case_prob;
Output0Buffer.BetaProbability = case_betaprob;
totalprob += case_prob;
totalbetaprob += case_betaprob;
Output0Buffer.CumulatedProbability = totalprob;
Output0Buffer.CumulatedBetaProbability = totalbetaprob;
}
}
public override void CreateNewOutputRows()
{
}
}
Please note that we initialized the random number generator with a fixed seed here in order to
make our output reproducible.
…
Random random = new Random(12345);
17
…
Otherwise, each run of the package would result in slightly different results.
Step 7: Map the script output to the output table
Finally, the script output needs to be mapped to the output table as shown in Figure 7.
Figure 7. Mapping of the script output to the output table
18
Step 8: Test the package
Run the package. If everything is correct, the output table dbo.Mining_Histogram should be
populated with 483 rows (for the sample data from earlier). The first and last rows are shown in
Figure 8.
Figure 8. First and last rows of the Monte Carlo histogram table
The column Count stores the number of occurrences of the corresponding scenarios. For
example, two scenarios ended with 1,980 prediction errors, and no scenario ended with 1,978
errors. The column TotalProbability contains a simple running total. Therefore, TotalProbability
is zero for the first line and 1 (almost) for the last line. The columns with “beta” in the column
name belong to the error model and are described in the next section.
19
Statistical hypothesis testing
A statistical hypothesis test defines a hypothesis and a test condition. As long as the hypothesis
passes the test condition, the hypothesis is considered to be correct.
In this example, the hypothesis is “The data mining model is working correctly.” The test
condition is “The number of prediction errors is less than n.” The variable n is the test parameter
that has to be defined.
As long as our back-testing ends with less than n prediction errors, the test is considered as
passed. In the context of statistics, the test result is then considered “negative.” However, if
there are more than n errors, the hypothesis fails the test condition and has to be considered
wrong. In the context of statistics, the test result is considered “positive” in this case.
When doing this kind of tests, there are basically two types of mistake we could make:


According to the test, the model is wrong although it is correct.
According to the test, the model is correct although it is wrong.
Table 2 shows the possible combinations between test result and reality as a matrix.
Table 2. Type I and type II errors
Test result
The test is negative, meaning
that the model passes the test.
The test is positive, meaning
that the model does not pass
the test.
Model is correct (reality)
Correct result (probability=1alpha. This value represents the
specificity of the test.).
Type 1 error / alpha error.
Model is incorrect (reality)
Type 2 error / beta error.
Correct result (probability=1beta. The value represents the
power or sensitivity of the test.).
If n is very high, the risk for a type 1 error is low and the risk for a type 2 error is high. On the
other hand, if n is low, the risk for a type 1 error is high and the risk for a type 2 error is low. The
art of test design is in choosing an acceptable risk probability for both type 1 and type 2 errors.
To do so, you need to calculate both errors. This is where the Monte Carlo results are useful.
In order to continue we first need to specify what is really meant when we say, “The model is
incorrect.” Basically, we need to define a tolerance. In order to do so, we use an error model.
This means that the statement “the model is incorrect” is transformed to “the alternative error
model is correct.”
The source code used earlier already includes the calculation for this error model. In this
example, we assume that the error model goes off by 0.5 percent (the value of the package
variable Beta_Value). Within the source code, the following two lines do the decisions for the
original and the alternative model.
...
20
if ((prob >= 0.5 && rnd > prob) || (prob < 0.5 && rnd < prob)) counter++;
if ((prob >= 0.5 && rnd > prob-betavalue) || (prob < 0.5 && rnd <
prob+betavalue)) counter_beta++;
...
Figure 9 shows the probability density functions for the two models.
Figure 9. Probability density function for hypothesis and alternative model
The error model results in more prediction errors. To finalize the test configuration, we need to
look at the cumulated probability functions.
Function
Cumulated value of 1 minus
probability for the original model.
Cumulated value of the probability
for the alternative error model.
Figure 10 shows both functions.
21
Description
Probability for more than n errors, although the model is
correct.
Probability for less than n errors, although the alternative
model is correct (that is, the tolerance is exactly 0.5
percent).
This is equal to the probability that exactly n errors will
occur while the tolerance is 0.5 percent or higher.
Figure 10. Cumulated probability density function for hypothesis and alternative model
Here are some sample values for the two types of the error, using the table shown in Figure 8
and columns NumCases, TotalProbability, and BetaTotalProbability.
n (test parameter)
2,100
2,120
2,140
2,160
2,180
2,200
2,220
2,240
2,260
2,280
2,300
Probability for type 1 error
72.9%
55.3%
36.5%
20.7%
9.8%
3.8%
1.2%
0.3%
0.1%
0.0%
0.0%
Probability for type 2 error
0.0%
0.0%
0.1%
0.3%
1.3%
3.9%
9.9%
20.2%
35.9%
54.1%
71.3%
In this example, there are low values for type 1 and type 2 errors at n=2,200. In many real-world
scenarios it will not be possible to minimize the type 1 and type 2 error probabilities at the same
time. This means that you need to make a decision whether to optimize for a low type 1 or type
2 error probability.
If, for example, the objective for the test is to keep the type 1 error below 1 percent, then we
could compute the corresponding type 2 error probability using this query.
SELECT TOP 1 numcases AS n,
22
FROM
WHERE
ORDER
Round(100 * ( betatotalprobability ), 1) AS beta
mining_histogram
Round(100 * ( 1 - totalprobability ), 1) < 1
BY betatotalprobability
For this example, the resulting type 2 error probability will be about 11.5 percent with n =2,224.
On the other hand, we could also use a similar query to have the type 1 error probability below 1
percent.
SELECT TOP 1 numcases AS n,
Round(100 * ( 1 - totalprobability ), 1) As Alpha
FROM
mining_histogram
WHERE
Round(100 * ( betatotalprobability ), 1) < 1
ORDER BY betatotalprobability desc
This would result in a type 1 error probability of about 12 percent and a value of 2,175 for n.
Summary and discussion
Back-testing can be an invaluable tool for ensuring that your organization is making the best use
of predictive data mining models. However, if you are not familiar with statistical tests, this
concept might be new. Therefore, this paper described in detail how to perform back-testing in a
hypothetical scenario, using SQL Server and Visual Studio. Here is a summary of the test and
its results:





As a prerequisite for this scenario, we need a data mining model to work against, along
with its results. The test compares data from a data mining model for predicting churn
probability (finding a churn score for each case).
The results of the model are used for marketing purposes, so we want to be sure that
the model is working correctly.
The requirements for back-testing are to design a test such that:
o If the model passes the test, the company is 99 percent sure that the model was
really correct.
A query was used to determine the test condition “Number of prediction errors < 2,175.”
The test showed that there is about 12 percent probability that the model fails to meet
the requirements although it is actually working correctly.
For the error model we chose a simple shift here of 0.5 percent. Of course, we could also have
chosen more sophisticated errors models or even a random guess model.
You can also use different Monte Carlo simulations for other aspects in the data mining process.
Some examples are listed, with a brief discussion of each:
Understanding the distribution of cancellations
You can use the Monte Carlo simulation to count cancellations instead of prediction errors. For
example, you may want to determine the answer to this question: What is the worst number of
23
cancellations to happen (with 95 percent confidence)? In statistical terms, this means you would
compute a 5 percent percentile of the cumulated probability density function.
Understanding the quality of the classification matrix
You can also use this simulation to learn more about the quality of your classification matrix. As
suggested earlier, the number of prediction errors can be split into type 1 and type 2 errors. If
you want to determine the quality of the criteria used to classify error type, you can use the
training data as input, rather than the mining result data. This method enables you to compare
the classification based on the test data with the error probability based on the training data.
The output could be as shown in Figure 11. Here, the distribution (the shaded area) is the
density taken from the test data error probability. The thin line is the cumulated density (the
scales on the y-axis and the x-axis are different) and the blue line is the actual number of errors
based on the test data set. The higher the point at which the blue line intersects the cumulated
density, the better it is. For example, in Figure 11, the classification does not really match the
expectations. Usually in this case, cross validation would reveal a large standard deviation
between the test partitions.
Figure 11. Example of an alternative classification matrix
24
Conclusion
Back-testing is a good practice for validating data mining models. In back-testing, you compare
past predictions with actual results in order to prove that the mining model is still working as
designed. For example, market conditions may have changed dramatically and the mining
model may not reflect these changes; back-testing shows whether the mining model needs to
be adjusted.
The statistical hypothesis test builds the background for defining a test condition. In order to
understand the error probabilities for the test, the distribution of the predicted probability is
important. This paper demonstrates how to use a Monte Carlo simulation to compute this
distribution. You can use this type of simulation for other tasks within the data mining process as
well.
For more information:
http://www.microsoft.com/sqlserver/: SQL Server Web site
http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter
http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter
http://ms-olap.blogspot.com: Hilmar Buchta’s blog about Analysis Services, MDX, and Data
Mining
Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5
(excellent), how would you rate this paper and why have you given it this rating? For example:


Are you rating it high due to having good examples, excellent screen shots, clear writing,
or another reason?
Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing?
This feedback will help us improve the quality of white papers we release.
Send feedback.
Resources
[1]
Monte Carlo method – Wikipedia (http://en.wikipedia.org/wiki/Monte_Carlo_method)
[2]
“Waltzing With Bears: Managing Risk on Software Projects”, Dorset House Publishing
Company, Tom DeMarco and Timothy Lister’
and
Riskology website: http://www.systemsguild.com/riskology/
[3]
Probability density function - Wikipedia
(http://en.wikipedia.org/wiki/Probability_density_function)
25
[4]
Sample data and solution for this article (SQL Server 2008 R2 database backup and
sample solution) (http://cid61f98448a5e17d57.office.live.com/self.aspx/%c3%96ffentlich/Validating%20Using%20M
onte%20Carlo%20Sim.zip)
About the author: Hilmar Buchta is a Business Intelligence consultant, project manager, and
architect at ORAYLIS GmbH (http://www.oraylis.de).
26