SQL Server Technical Article Back-Testing Data Mining Results Using Monte Carlo Simulation Writer: Hilmar Buchta Technical Reviewer: Dana Cristofor Published: January 2011 Applies to: SQL Server 2008, SQL Server 2008 R2 Summary: This article describes statistical test methods for validating the correctness of a data mining model. Therefore the predicted events are compared with the real events in a backtesting process and a threshold for the amount of prediction errors is to be defined. Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. This white paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in, or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. Unless otherwise noted, the example companies, organizations, products, domain names, e-mail addresses, logos, people, places, and events depicted herein are fictitious, and no association with any real company, organization, product, domain name, e-mail address, logo, person, place, or event is intended or should be inferred. © 2011 Microsoft Corporation. All rights reserved. 2 Contents .1 Introduction .................................................................................................................................................. 4 Back testing ................................................................................................................................................... 4 Practical example used in this article............................................................................................................ 5 How many prediction errors can be expected? ............................................................................................ 6 Monte Carlo simulation ................................................................................................................................ 7 Probability density function ........................................................................................................................ 10 Implementing the Monte Carlo simulation in SSIS ..................................................................................... 11 Statistical Hypothesis Testing ..................................................................................................................... 20 Summary and discussion............................................................................................................................. 23 Conclusion ................................................................................................................................................... 25 Resources .................................................................................................................................................... 25 3 Introduction Data mining is frequently used to derive new knowledge from existing data. This new knowledge may then influence future business decisions. Usually these business decisions contain some kind of risk. For example, in a manufacturing company the result of a data mining process could be that sales for a certain product are very likely to increase during the next period. The logically consistent decision in this case would be to increase the production for this product in order to fulfill the expected sales growth. If the prediction from the mining process turns out to be true, everything is perfect. If not, inventory is full but there is no market for the product. Whenever mining results are used as a basis for business decisions, we want to be sure to fully understand the results and also we want to be able to prove that our mining models are working correctly. This article describes a back testing process that can be used to validate that the mining model is working within the expected tolerance. The back testing process described here makes use of Microsoft SQL Server 2008 R2, Microsoft Visual Studio, and data generated using the Monte Carlo method. For general information about these Microsoft products and the Monte Carlo method, see the links provided at the end of the paper. Back testing In the design phase, we can use techniques like cross validation or mining accuracy charts to understand the statistical relevance of the model as shown in Figure 1. TIME TOOLS Design time of the mining model Lift chart, cross validation etc. Usage of the model Back testing Figure 1. Tools for validating data mining results In contrast, back testing is used to compare the predictions from the mining process from a past period with the reality from today. For example, a publishing company wants to predict which customers are likely to cancel their subscriptions. To do this, they use data mining. Three months later, they use a back testing approach to test the prediction against the reality. The result of the back testing is one of the following: The data mining model works as expected (number of false predictions within tolerance). The data mining model does not work as expected (number of false predictions not within tolerance). In the latter case, usually a review of the mining model is the appropriate action, to see whether there are missing input variables or other major changes that need to be reflected by the mining model. 4 Practical example used in this article In order to illustrate the back-testing process, this paper uses data from a fictitious telecommunications company. Each customer has a 12-month contract that can be cancelled by the customer at the end of the period. If the contract is not cancelled, it is automatically continued for another 12 months. The mining model in this example is trained to predict the likeliness for a contract cancellation within the next three months. In order to follow the example, I have provided a link, listed at the end of this paper, from which you can download the sample data and other resources. The download (zip-archive) contains the following: A Microsoft SQL Server 2008 R2 backup of the sample database (DataMining) with randomly generated test data (30,000 cases) A Microsoft Visual Studio solution for the Monte Carlo simulation (SQL Server Integration Services project) and visualization of the results (SQL Server Reporting Services project) Let’s assume that we did a churn prediction 3 months ago. For this article we are not interested in the actual data mining model but only in the prediction output. We stored the results in a database table, named Mining_Results, as shown in Table 1. Table 1. Sample data (output from mining model) CaseKey Churn 1 2 3 4 5 6 7 8 9 10 … true false false true false false false false false false … ChurnProbability 0.874 0.071 0.017 0.502 0.113 0.160 0.069 0.018 0.026 0.187 … The column Churn contains the predictions from the mining algorithm. A Churn value of true means that for this particular case, the customer is likely to cancel the contract. The churn rate (shown in the ChurnProbability column) contains the output from the mining function PredictProbability(Churn,true). The probability is actually the ratio of cancellation of all cases that are considered similar by the mining algorithm, with the ratio represented as a number between 0 (meaning 0 percent) and 1 (meaning 100 percent). For example, when we take a look at the first row of Table 1, the probability is 0.874. This means that 87.4 percent of all similar cases have cancelled their contract. Actually, the calculation of the probability is a little 5 bit more complex but in general, this explanation provides a good understanding of how the probability is calculated. Table 1 is the only input used for this kind of back-testing analysis. This means the process is independent of the chosen mining algorithm (as long as the algorithm provides a prediction probability). Because the prediction was performed three months ago, the company can now compare the actual results with the prediction. How many prediction errors did occur? Is the model working correctly? How many prediction errors can be expected? If the mining model works correctly, a churn probability of 0.874 for a single contract means that 87.4 percent of similar cases cancelled their contract. Consequentially, this means that 100 percent minus 87.4 percent, or 12.6 percent, of similar cases did not cancel their contract. So, whenever the mining algorithm predicted a case as churn=true, the error probability is 1ChurnProbability. For the cases that are predicted as churn=false (that is, where ChurnProbability is less than 50 percent), the error probability is equal to the ChurnProbability. This means, we can calculate the number of expected prediction failures using the following Transact-SQL query. SELECT SUM(CASE churn WHEN 1 THEN 1 - churnprobability ELSE churnprobability END) AS errorcount FROM [dbo].[Mining_Result] For this example, the query results in 2,126 expected prediction errors (from a total of 30,000 cases within the sample data set). It is possible to distinguish false positive from false negative prediction errors, but this article focuses on the total prediction errors only. The expected behavior of the query data would be that it would match the prediction quality of the test data set, and indeed, the number that this query returns corresponds with the output of the classification matrix when you build the classification matrix using the test data set. The classification matrix for the test data set is the default in Business Intelligence Development Studio. It can also be queried by calling the SystemGetClassificationMatrix Data Mining Extensions (DMX) stored function and setting the second parameter to a value of 2 (1 means training data, 2 test data, 3 both), as shown in the following example. CALL SystemGetClassificationMatrix(Churn_Rate_DT,2,ContractCancelled') 6 The number of 2,126 expected prediction errors is just a theoretical number, because the mining model is just an approximation of the real customers’ behavior. In reality it is very likely that there would be a different number of errors. The key to the back-testing approach is to understand the distribution of prediction errors. A distribution can be visualized as a probability density function. Figure 2 shows this function based on the example data set. Figure 2. Probability density function for the number of prediction errors Before we go into more detail about how to produce this chart, let’s spend a moment with the result. First, we see a peak somewhere between 2,102 and 2,152. This matches the previous query result of 2,126 prediction errors. The area shaded green corresponds to 100 percent. However, the probability for the peak value is less than 1 percent. So the probability for seeing exactly 2,126 prediction errors (if the model is correct) is less than 1 percent. Monte Carlo simulation For known distribution functions, the probability density function can be plotted easily. For example, many of these functions can be directly used in Microsoft Office Excel, like the normal distribution, the gamma distribution, or the chi-square distribution. However, if the distribution is not known, determining the probability density function is more complicated. Monte Carlo simulations are frequently used to compute the probability density function. Basically, a Monte Carlo simulation for our churn prediction sample works like this: 1. Create an empty histogram table that maps the number of cancellations to the count of scenarios that ended with this number of cancellations. 2. Do n loops (scenarios). For each loop: a. Initially set the number of cancelled contracts x for this scenario to zero. 7 b. Look at each case (inner loop): i. Compute a random number r. ii. Based on this random number, decide whether the case is counted as prediction error by comparing r with the ChurnProbability. iii. If an error is detected, increment x. c. At the end of each scenario, increment the number of occurrences of x cancellations in the histogram table. For example, the first scenario might end in 2,105 prediction errors. This means the histogram table looks like this: 2105 1 The second scenario ended in 2,294 cancellations. Now, the histogram table looks like this: 2105 1 2294 1 If the third scenario ends in 2,105 cancellations again, the histogram table looks like this: 2105 2 2294 1 Usually, many scenarios must be computed in order to get a relatively smooth graph. Here are some examples. 50 scenarios 500 scenarios 8 5,000 scenarios 30,000 scenarios 100,000 scenarios 300,000 scenarios As you can see, the chart is getting smoother the more scenarios are used for the simulation. However, the more scenarios are used, the longer the calculation time gets. In this example, for the 300,000 scenarios there are 9 billion calculations in total (because the inner loop consists of 30,000 cases here). The chart from Figure 2 is a smoothed version of a simulation with 30,000 scenarios. Using this approach makes it possible to calculate an approximation to the probability density function without actually knowing the type of distribution. Monte Carlo simulations are used in many ways, for example: 9 Financial risk management (investment banking) In contrast with stress testing, which uses well-defined historic events, Monte Carlo simulation can be used to simulate unpredictable effects on the markets. Based on the probability density function, measures like the value at risk (VaR) can be computed. Customer order forecast (CRM) Each offer is associated with a likelihood (probability). How many orders can we expect? For example, how likely is it that we see less than 500 orders? Project risk management Tom DeMarco and Timothy Lister use Monte Carlo simulation to get a good understanding of the effects (time/cost) of project risks in their book Waltzing With Bears (see reference [2] at the end of this paper). Probability density function The switch from the histogram to the probability density function as shown in Figure 2 is rather simple. You just have to replace the number of cases on the y-axis with the probability, which is the number of cases divided by number of scenarios that were used for the Monte Carlo simulation. For the purpose of statistical tests, a specific outcome of exactly n cancellations is not really interesting. But based on the probability density function it is easy to calculate the probability for any range of results. For example, if you are interested in the how likely it is to see between 2,100 and 2,200 prediction errors, you only need to calculate the area below the graph in this range. For continuous distributions this would be the integral, but because this example is a discrete distribution, you can simply sum up the values. In many cases, you may be interested in the probability for seeing up to n prediction errors (instead of exactly n prediction errors). This probability is calculated using the cumulated probability function (sum in the range from 0 to n) as shown in Figure 3. 10 Figure 3. Cumulated probability density function of the data from Figure 2 For example, it is very unlikely (almost 0 percent) to see fewer than 2,000 cancellations, but it is very likely (almost 100 percent) to see fewer than 2,251 cancellations. Before you design the statistical test, the Monte Carlo simulation must be in place. The next section explains how to implement the Monte Carlo simulation using SQL Server Integration Services. Implementing the Monte Carlo simulation in Integration Services This section shows how to implement the Monte Carlo simulation as an Integration Services script component. Step 1: Create the output table The output table is used to store the results of the Monte Carlo simulation. Strictly speaking, only two columns are necessary: The column NumCases represents the number of prediction errors and the column Count represents the number of cases for which the Monte Carlo simulation ended in this particular number of prediction errors. The other columns are used for additional computations and will be explained later in this article. CREATE TABLE [dbo].[Mining_Histogram]( [NumCases] [int] NOT NULL, [Count] [int] NULL, [Probability] [float] NULL, [TotalProbability] [float] NULL, [BetaCount] [int] NULL, [BetaProbability] [float] NULL, [BetaTotalProbability] [float] NULL ) 11 Step 2: Create the Integration Services package In Business Intelligence Development Studio, create a new package with a data source (the database with the mining results as shown in Table 1 and the output table created in the first step). Add an Execute SQL task and a Data Flow task to the package as shown in Figure 4. Figure 4. Control flow for the Monte Carlo simulation The Execute SQL task is used to erase all data from the histogram table created in step 1, so that you start with an empty histogram. truncate table Mining_Histogram Step 3: Add package scope variables For easier configuration two variables are used. MonteCarloNumCases is used to set the number of scenarios (outer loop). BetaValue is used for the error model that is discussed later. Step 4: Implement the data flow Add the following data flow components to the data flow task. 12 Figure 5. Data flow for the Monte Carlo simulation The data source simply selects all rows from the Mining_Result table. SELECT FROM CaseKey, Churn, ChurnProbability dbo.Mining_Result Step 5: Define the output columns for the Monte Carlo test For the script component, add the user variables, User::BetaValue and User::MonteCarloNumCases, as read-only variables. Then add the following columns as output columns. 13 Figure 6. Output columns for the script component Column NumCases Count_Beta Data type DT_I4 DT_I4 Count DT_I4 Probability DT_R8 BetaProbability DT_R8 14 Description Number of cases Number of scenarios of the error model that ended with this number of cases Number of scenarios that ended with this number of cases Probability of this number of cases (Count divided by total number of Monte Carlo scenarios) Probability of this number of cases for the error model Column Data type CumulatedProbability DT_R8 CumulatedBetaProbability DT_R8 Description Cumulated probability Cumulated probability of the error model Step 6: Write the script for the Monte Carlo simulation Enter the following lines as the script for the script component written in C#. The code contains some basic logging and also computes the additional fields for the output table. using using using using using using System; System.Data; Microsoft.SqlServer.Dts.Pipeline.Wrapper; Microsoft.SqlServer.Dts.Runtime.Wrapper; System.Collections; System.Windows.Forms; [Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute] public class ScriptMain : UserComponent { ArrayList cases; int numloops =0; double betavalue; String componentname = "Monte Carlo Simulator"; Hashtable ht =null; Hashtable ht_beta = null; public override void PreExecute() { base.PreExecute(); cases = new ArrayList(); numloops = Variables.MonteCarloNumCases; betavalue = Variables.BetaValue; ht = new Hashtable(numloops / 50); ht_beta = new Hashtable(numloops / 50); } public override void PostExecute() { base.PostExecute(); } public override void Input0_ProcessInputRow(Input0Buffer Row) { if (!Row.ChurnProbability_IsNull) cases.Add(Row.ChurnProbability); } public override void Input0_ProcessInput(Input0Buffer Buffer) { bool dummy = true; 15 while (Buffer.NextRow()) { Input0_ProcessInputRow(Buffer); } ComponentMetaData.FireInformation(0, componentname, "Cases loaded: " + cases.Count.ToString(), "", 0, ref dummy); if (!Buffer.EndOfRowset()) return; Random random = new Random(12345); int counter; int counter_beta; int maxkey = 0; int minkey = Int32.MaxValue; for (int lp = 1; lp <= numloops; lp++) { if (lp % 500 == 0) ComponentMetaData.FireInformation(0, componentname, "Simulating case: " + lp.ToString(), "", 0, ref dummy); counter = 0; counter_beta = 0; foreach (double prob in cases) { double rnd = random.NextDouble(); if ((prob >= 0.5 && rnd > prob) || (prob < 0.5 && rnd < prob)) counter++; if ((prob >= 0.5 && rnd > prob-betavalue) || (prob < 0.5 && rnd < prob+betavalue)) counter_beta++; } if (ht.ContainsKey(counter)) { ht[counter] = (int)ht[counter] + 1; } else ht.Add(counter, 1); if (ht_beta.ContainsKey(counter_beta)) { ht_beta[counter_beta] = (int)ht_beta[counter_beta] + 1; } else ht_beta.Add(counter_beta, 1); if (counter > maxkey) maxkey = counter; if (counter < minkey) minkey = counter; if (counter_beta > maxkey) maxkey = counter_beta; if (counter_beta < minkey) minkey = counter_beta; } cases.Clear(); 16 ComponentMetaData.FireInformation(0, componentname, "Simulation done. Simulated cases: " + numloops.ToString(), "", 0, ref dummy); // Write result to output double case_prob; double case_betaprob; double totalprob = 0; double totalbetaprob = 0; for (int i=minkey-1;i<=maxkey;i++) { Output0Buffer.AddRow(); Output0Buffer.NumCases = i; if (!ht.ContainsKey(i)) { Output0Buffer.Count = 0; case_prob = 0; } else { Output0Buffer.Count = (int)ht[i]; case_prob = ((int)(ht[i])) / ((double)numloops); } if (!ht_beta.ContainsKey(i)) { Output0Buffer.CountBeta = 0; case_betaprob = 0; } else { Output0Buffer.CountBeta = (int)ht_beta[i]; case_betaprob = ((int)(ht_beta[i])) / ((double)numloops); } Output0Buffer.Probability = case_prob; Output0Buffer.BetaProbability = case_betaprob; totalprob += case_prob; totalbetaprob += case_betaprob; Output0Buffer.CumulatedProbability = totalprob; Output0Buffer.CumulatedBetaProbability = totalbetaprob; } } public override void CreateNewOutputRows() { } } Please note that we initialized the random number generator with a fixed seed here in order to make our output reproducible. … Random random = new Random(12345); 17 … Otherwise, each run of the package would result in slightly different results. Step 7: Map the script output to the output table Finally, the script output needs to be mapped to the output table as shown in Figure 7. Figure 7. Mapping of the script output to the output table 18 Step 8: Test the package Run the package. If everything is correct, the output table dbo.Mining_Histogram should be populated with 483 rows (for the sample data from earlier). The first and last rows are shown in Figure 8. Figure 8. First and last rows of the Monte Carlo histogram table The column Count stores the number of occurrences of the corresponding scenarios. For example, two scenarios ended with 1,980 prediction errors, and no scenario ended with 1,978 errors. The column TotalProbability contains a simple running total. Therefore, TotalProbability is zero for the first line and 1 (almost) for the last line. The columns with “beta” in the column name belong to the error model and are described in the next section. 19 Statistical hypothesis testing A statistical hypothesis test defines a hypothesis and a test condition. As long as the hypothesis passes the test condition, the hypothesis is considered to be correct. In this example, the hypothesis is “The data mining model is working correctly.” The test condition is “The number of prediction errors is less than n.” The variable n is the test parameter that has to be defined. As long as our back-testing ends with less than n prediction errors, the test is considered as passed. In the context of statistics, the test result is then considered “negative.” However, if there are more than n errors, the hypothesis fails the test condition and has to be considered wrong. In the context of statistics, the test result is considered “positive” in this case. When doing this kind of tests, there are basically two types of mistake we could make: According to the test, the model is wrong although it is correct. According to the test, the model is correct although it is wrong. Table 2 shows the possible combinations between test result and reality as a matrix. Table 2. Type I and type II errors Test result The test is negative, meaning that the model passes the test. The test is positive, meaning that the model does not pass the test. Model is correct (reality) Correct result (probability=1alpha. This value represents the specificity of the test.). Type 1 error / alpha error. Model is incorrect (reality) Type 2 error / beta error. Correct result (probability=1beta. The value represents the power or sensitivity of the test.). If n is very high, the risk for a type 1 error is low and the risk for a type 2 error is high. On the other hand, if n is low, the risk for a type 1 error is high and the risk for a type 2 error is low. The art of test design is in choosing an acceptable risk probability for both type 1 and type 2 errors. To do so, you need to calculate both errors. This is where the Monte Carlo results are useful. In order to continue we first need to specify what is really meant when we say, “The model is incorrect.” Basically, we need to define a tolerance. In order to do so, we use an error model. This means that the statement “the model is incorrect” is transformed to “the alternative error model is correct.” The source code used earlier already includes the calculation for this error model. In this example, we assume that the error model goes off by 0.5 percent (the value of the package variable Beta_Value). Within the source code, the following two lines do the decisions for the original and the alternative model. ... 20 if ((prob >= 0.5 && rnd > prob) || (prob < 0.5 && rnd < prob)) counter++; if ((prob >= 0.5 && rnd > prob-betavalue) || (prob < 0.5 && rnd < prob+betavalue)) counter_beta++; ... Figure 9 shows the probability density functions for the two models. Figure 9. Probability density function for hypothesis and alternative model The error model results in more prediction errors. To finalize the test configuration, we need to look at the cumulated probability functions. Function Cumulated value of 1 minus probability for the original model. Cumulated value of the probability for the alternative error model. Figure 10 shows both functions. 21 Description Probability for more than n errors, although the model is correct. Probability for less than n errors, although the alternative model is correct (that is, the tolerance is exactly 0.5 percent). This is equal to the probability that exactly n errors will occur while the tolerance is 0.5 percent or higher. Figure 10. Cumulated probability density function for hypothesis and alternative model Here are some sample values for the two types of the error, using the table shown in Figure 8 and columns NumCases, TotalProbability, and BetaTotalProbability. n (test parameter) 2,100 2,120 2,140 2,160 2,180 2,200 2,220 2,240 2,260 2,280 2,300 Probability for type 1 error 72.9% 55.3% 36.5% 20.7% 9.8% 3.8% 1.2% 0.3% 0.1% 0.0% 0.0% Probability for type 2 error 0.0% 0.0% 0.1% 0.3% 1.3% 3.9% 9.9% 20.2% 35.9% 54.1% 71.3% In this example, there are low values for type 1 and type 2 errors at n=2,200. In many real-world scenarios it will not be possible to minimize the type 1 and type 2 error probabilities at the same time. This means that you need to make a decision whether to optimize for a low type 1 or type 2 error probability. If, for example, the objective for the test is to keep the type 1 error below 1 percent, then we could compute the corresponding type 2 error probability using this query. SELECT TOP 1 numcases AS n, 22 FROM WHERE ORDER Round(100 * ( betatotalprobability ), 1) AS beta mining_histogram Round(100 * ( 1 - totalprobability ), 1) < 1 BY betatotalprobability For this example, the resulting type 2 error probability will be about 11.5 percent with n =2,224. On the other hand, we could also use a similar query to have the type 1 error probability below 1 percent. SELECT TOP 1 numcases AS n, Round(100 * ( 1 - totalprobability ), 1) As Alpha FROM mining_histogram WHERE Round(100 * ( betatotalprobability ), 1) < 1 ORDER BY betatotalprobability desc This would result in a type 1 error probability of about 12 percent and a value of 2,175 for n. Summary and discussion Back-testing can be an invaluable tool for ensuring that your organization is making the best use of predictive data mining models. However, if you are not familiar with statistical tests, this concept might be new. Therefore, this paper described in detail how to perform back-testing in a hypothetical scenario, using SQL Server and Visual Studio. Here is a summary of the test and its results: As a prerequisite for this scenario, we need a data mining model to work against, along with its results. The test compares data from a data mining model for predicting churn probability (finding a churn score for each case). The results of the model are used for marketing purposes, so we want to be sure that the model is working correctly. The requirements for back-testing are to design a test such that: o If the model passes the test, the company is 99 percent sure that the model was really correct. A query was used to determine the test condition “Number of prediction errors < 2,175.” The test showed that there is about 12 percent probability that the model fails to meet the requirements although it is actually working correctly. For the error model we chose a simple shift here of 0.5 percent. Of course, we could also have chosen more sophisticated errors models or even a random guess model. You can also use different Monte Carlo simulations for other aspects in the data mining process. Some examples are listed, with a brief discussion of each: Understanding the distribution of cancellations You can use the Monte Carlo simulation to count cancellations instead of prediction errors. For example, you may want to determine the answer to this question: What is the worst number of 23 cancellations to happen (with 95 percent confidence)? In statistical terms, this means you would compute a 5 percent percentile of the cumulated probability density function. Understanding the quality of the classification matrix You can also use this simulation to learn more about the quality of your classification matrix. As suggested earlier, the number of prediction errors can be split into type 1 and type 2 errors. If you want to determine the quality of the criteria used to classify error type, you can use the training data as input, rather than the mining result data. This method enables you to compare the classification based on the test data with the error probability based on the training data. The output could be as shown in Figure 11. Here, the distribution (the shaded area) is the density taken from the test data error probability. The thin line is the cumulated density (the scales on the y-axis and the x-axis are different) and the blue line is the actual number of errors based on the test data set. The higher the point at which the blue line intersects the cumulated density, the better it is. For example, in Figure 11, the classification does not really match the expectations. Usually in this case, cross validation would reveal a large standard deviation between the test partitions. Figure 11. Example of an alternative classification matrix 24 Conclusion Back-testing is a good practice for validating data mining models. In back-testing, you compare past predictions with actual results in order to prove that the mining model is still working as designed. For example, market conditions may have changed dramatically and the mining model may not reflect these changes; back-testing shows whether the mining model needs to be adjusted. The statistical hypothesis test builds the background for defining a test condition. In order to understand the error probabilities for the test, the distribution of the predicted probability is important. This paper demonstrates how to use a Monte Carlo simulation to compute this distribution. You can use this type of simulation for other tasks within the data mining process as well. For more information: http://www.microsoft.com/sqlserver/: SQL Server Web site http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter http://ms-olap.blogspot.com: Hilmar Buchta’s blog about Analysis Services, MDX, and Data Mining Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example: Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason? Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing? This feedback will help us improve the quality of white papers we release. Send feedback. Resources [1] Monte Carlo method – Wikipedia (http://en.wikipedia.org/wiki/Monte_Carlo_method) [2] “Waltzing With Bears: Managing Risk on Software Projects”, Dorset House Publishing Company, Tom DeMarco and Timothy Lister’ and Riskology website: http://www.systemsguild.com/riskology/ [3] Probability density function - Wikipedia (http://en.wikipedia.org/wiki/Probability_density_function) 25 [4] Sample data and solution for this article (SQL Server 2008 R2 database backup and sample solution) (http://cid61f98448a5e17d57.office.live.com/self.aspx/%c3%96ffentlich/Validating%20Using%20M onte%20Carlo%20Sim.zip) About the author: Hilmar Buchta is a Business Intelligence consultant, project manager, and architect at ORAYLIS GmbH (http://www.oraylis.de). 26