Automation of Data Mining Using Integration Services

SQL Server Technical Article Writer: Jeannine Takaki Technical Reviewer: Raman Iyer Published: July 2011 Applies to: SQL Server 2008 R2, SQL Server 2008, SQL Server 2005 Summary: This article is a walkthrough that illustrates how to build multiple related data models by using the tools that are provided with Microsoft SQL Server Integration Services. In this walkthrough, you will learn how to automatically build and process multiple data mining models based on a single mining structure, how to create predictions from all related models, and how to save the results to a relational database for further analysis. Finally, you view and compare the predictions, historical trends, and model statistics in SQL Server Reporting Services reports. Copyright This document is provided “as-is”. Information and views expressed in this document, including URL and other Internet Web site references, may change without notice. You bear the risk of using it. Some examples depicted herein are provided for illustration only and are fictitious. No real association or connection is intended or should be inferred. This document does not provide you with any legal rights to any intellectual property in any Microsoft product. You may copy and use this document for your internal, reference purposes. © 2011 Microsoft. All rights reserved. 2 Contents .1 Introduction .................................................................................................................................................. 5 Automating the Creation of Data Mining Models ........................................................................................ 5 Solution Walkthrough ................................................................................................................................... 6 Scope ......................................................................................................................................................... 6 Overall Process.......................................................................................................................................... 7 Phase 1 - Preparation .................................................................................................................................... 7 Create the Forecasting mining structure and default time series mining model ..................................... 7 Extract and edit the XMLA statement....................................................................................................... 8 Prepare the replacement parameters ...................................................................................................... 9 Phase 2 - Model Creation............................................................................................................................ 12 Create package and variables (CreateParameterizedModels.dtsx) ........................................................ 13 Configure the Execute SQL task (Get Model Parameters) ...................................................................... 13 Configure the Foreach Loop container (Foreach Model Definition)....................................................... 14 Phase 3 - Process the Models ..................................................................................................................... 15 Create package and variables (ProcessEmptyModels.dtsx) ................................................................... 16 Add a Data Mining Query task (Execute DMX Query) ............................................................................ 16 Create Execute SQL task (List Unprocessed Models) .............................................................................. 17 Create a Foreach Loop container (Foreach Model in Variable) .............................................................. 17 Add an Analysis Services Processing task to the Foreach Loop (Process Current Model) ...................... 17 Add a Data Mining Query task after the Foreach Loop (Update Processing Status) .............................. 18 Phase 4 - Create Predictions for All Models................................................................................................ 19 Create package and variables (PredictionsAllModels.dtsx) .................................................................... 20 Create Execute SQL task (Get Processed Models) .................................................................................. 20 Create Execute-SQL task (Get Series Names) ......................................................................................... 20 Create Foreach Loop container (Predict Foreach Model) ...................................................................... 21 Create variables for the Foreach Loop container ................................................................................... 22 3 Create the Data Mining Query task (Predict Amt) .................................................................................. 23 Create the second Data Mining Query task (Predict Qty) ...................................................................... 24 Create Data Flow tasks to Archive the Results ....................................................................................... 25 Create Data Flow tasks (Archive Results Qty, Archive Results Amt) ...................................................... 26 Run, Debug, and Audit Packages ............................................................................................................ 27 Phase 5 - Analyze and Report ..................................................................................................................... 27 Using the Data Mining Viewer ................................................................................................................ 27 Using Reporting Services for Data Mining Results .................................................................................. 28 Interpreting the Results .......................................................................................................................... 31 Discussion.................................................................................................................................................... 32 The Case for Ensemble Models ............................................................................................................... 33 Closing the Loop: Interpreting and Getting Feedback on Models .......................................................... 33 Conclusion ................................................................................................................................................... 34 Resources .................................................................................................................................................... 34 Acknowledgements..................................................................................................................................... 34 Code for Script Task .................................................................................................................................... 35 4 Introduction This article is a walkthrough that illustrates how to use the data mining tools that are provided with Microsoft SQL Server Integration Services. If you are an experienced data miner, you probably already use the tools provided in Business Intelligence Development Studio or the Data Mining Client Add-in for Microsoft Excel for building or browsing mining models. However, Integration Services helps you automate many processes. This solution also introduces the concept of ensemble models for data mining, which are sets of multiple related models. For most data mining projects, you need to create several models, analyze the differences, and compare outputs before you can select a best model to use operationally. Integration Services provides a framework within which you can easily generate and manage ensemble models. In this series, you will learn how to: • Configure the Integration Services components that are provided for data mining. • Automatically build and update mining models by using Integration Services. • Store mining model parameters and prediction results in the database engine. • Integrate reporting requirements in the model design workflow. Note that these are just a few of the ways that you can use Integration Services to incorporate data mining into analytic and data handling workflows. Hopefully these examples will help you get more mileage out of existing installations of Integration Services and SQL Server Analysis Services. Automating the Creation of Data Mining Models This scenario positions you as an analyst who has been tasked with creating some projections based on past sales data. You are unsure about how to configure the time series algorithm for best results (ARIMA? ARTXP? What hints to provide?). Moreover, you know that the modeling process typically involves building several models and testing different scenarios. Rather than build variations on the model ad hoc, you decide to automatically generate multiple related models, varying the parameters systematically for each model. This way you can easily create many models, each using a different combination of periodicity hints and algorithm type. After you have created and processed all the models, you will put the historical data plus the predictions for each model into a series of reports to see which models provide good results. This walkthrough demonstrates these features: • Automatically building multiple mining models using parameters stored in SQL Server • Generating bulk predictions for each model and storing them in SQL Server • Comparing trends from the models by putting them side by side in reports 5 Solution Walkthrough This section describes the complete solution that builds multiple models and creates queries that return predictions from each model. It contains these parts: [1] Analysis Services project: To create this project, follow the instructions in the Data Mining tutorial on MSDN (http://msdn.microsoft.com/en-us/library/ms169846.aspx) to create a Forecasting mining structure and default time series mining model. [2] Integration Services project: You will create a new project, containing multiple packages:    A package that builds multiple models, using the Analysis Services Execute DDL task A package that processes multiple models, using the Analysis Services Processing task A package that creates predictions from all models, using the Data Mining query task Scope The following Integration Services tasks and components are used in this walkthrough. For more information from SQL Server Books Online, click on the link in the Task or component column. Task or component Execute SQL task Analysis Services Execute DDL task Analysis Services Processing task Foreach Loop Container Script task Data Mining Query task Data Flow task OLE DB source OLE DB destination Derived Column transformation Used for Gets variable values, and creates tables to store results Creates individual models Populates the models with data Builds and processes multiple data mining models Builds the required XMLA commands Creates predictions from each model Manages and merges prediction results Gets data from temporary prediction table Writes predictions to permanent table Adds metadata about predictions Even though the following Integration Services components are also very useful for data mining, they are not used in this walkthrough—look for examples in a later paper:      Data Profiling task Conditional Split transformation Percentage Sampling transformation Lookup and Fuzzy Lookup transformation Data Mining Training destination Note: The SQL Server Reporting Services project containing the reports that compare models is not included here, even though this project generates all the data required for the reports. That is because the report creation process is somewhat lengthy to describe, especially if you 6 are not familiar with Reporting Services. Moreover, since all the prediction data is stored in the relational database, there are other reporting clients you can use, including Microsoft PowerPivot for Excel and Project Crescent. However, we hope to describe the process in a separate article later on the TechNet Wiki (http://social.technet.microsoft.com/wiki/contents/articles/default.aspx). Overall Process Phase 1 - Preparation: The definition of the models you want to create is stored in SQL Server as a set of parameters, values, and model names. Phase 2 – Model creation: Integration Services retrieves the model definitions and passes the parameter values to a Foreach Loop that builds and then executes the XML for Analysis (XMLA) statement for each model. Phase 3 – Model processing: Integration Services retrieves a list of available models, and then it processes each model by populating it with data. Phase 4 - Prediction: Integration Services issues a prediction query to each processed model. Each set of predictions is saved to a SQL Server table. Phase 5 – Reporting and analysis: The prediction trends for each model are compared by using reports (created by Reporting Services, PowerPivot, or your favorite reporting client) using the data in the relational table. Phase 1 - Preparation In this phase, you set up the structure, sample data, and parameters your packages will use. Before you build the working packages, you need to complete the following tasks:  Create the Forecasting mining structure used by all mining models.  Generate the sample XMLA that represents the default time series mining model, to use as a template.  Create a table that stores replacement parameters for the new models, and then insert the parameter values. The following section describes how to perform these tasks. Create the Forecasting mining structure and default time series mining model To create multiple models based on a single mining structure, you need to create the Forecasting mining structure first. Based on that mining structure, you also need to create a time series model that can be used as the template for generating other models. If you do not already have a mining structure capable of supporting a time series model, you can build one by following the steps described in the Microsoft Time Series tutorial (http://msdn.microsoft.com/en-us/library/ms169846.aspx) in SQL Server Books Online. 7 Extract and edit the XMLA statement Next, use the time series model from the previous step to extract an XMLA statement that will serve as a template for all other models. You can get the XMLA statement for any model or structure by using the scripting options in SQL Server Management Studio: 1. 2. 3. 4. In SQL Server Management Studio, right-click the time series model. Click Script Mining Model as. Save the XMLA to a text file. Open the text file in Notepad or another plain-text editor. After you generate XMLA for the default time series model by using the script option, it looks like the following code. (The XMLA statement for models can be lengthy, so only an excerpt is shown here.) The XMLA statement always includes the database, the mining structure, metadata such as the model name, and the algorithm used for analysis. It can optionally include multiple parameters. <ParentObject> <DatabaseID>Forecasting Models</DatabaseID> <MiningStructureID>Forecasting</MiningStructureID> </ParentObject> <ObjectDefinition> <MiningModel> <ID>ARIMA_1-10-30</ID> <Name>ARIMA_1-10-30</Name> <Algorithm>Microsoft_Time_Series</Algorithm> <AlgorithmParameters> <AlgorithmParameter> <Name>FORECAST_METHOD</Name> <Value xsi:type="xsd:string">ARIMA</Value> </AlgorithmParameter> <AlgorithmParameter> <Name>PERIODICITY_HINT</Name> <Value xsi:type="xsd:string">{1,10,30}</Value> 8 </AlgorithmParameter> </AlgorithmParameters> <Columns> </MiningModel> </ObjectDefinition> Next, make the following changes to the command text that you extracted:   Add the parameters that you want to change, if they are not already present in the model. Default parameters are part of the XMLA output, so if your base model does not contain any parameters, you will need to add the XMLA section that contains parameters. Remove unnecessary white space and all line breaks. For this walkthrough, the XMLA is stored as a string in a variable, which cannot contain line breaks. If you leave in any line breaks, the problem is not detected at package validation, but at run time, the Analysis Services engine attempts to execute the XMLA and fails with an error. To clean up the file, use your favorite text editor. White spaces such as tabs and multiple space characters are okay but you can remove them if you like, to shorten the string variable. There is no limit on the size of string variables, but there is a 4,000-character limit in the expression editor. 5. If your model does not already contain the parameters FORECAST_METHOD and PERIODICITY_HINT, use the code listed earlier, and copy the XML node that begins with <AlgorithmParameters> and ends with </AlgorithmParameters>. Paste it into the text file containing the XMLA command, directly below the line that defines the algorithm, and before the section that defines the columns. 6. Edit the entire XMLA statement to remove line breaks. You can use any text editor that you like, so long as you verify that the result is a single line of text. Prepare the replacement parameters To create new models, you must update the basic XMLA command that you just created by inserting different values for the parameters. Among the parameters you must update are the model names and the model ID. Before you do this, you may find it helpful to review the format of the parameters you will change:     9 FORECAST_METHOD – Can have the values MIXED (default), ARIMA, or ARTXP. PERIODICITY_HINT – Can have any combination of numbers separated by commas and enclosed by curly braces. MODEL_ID – Must be unique for each model you create or an error will be generated. MODEL_NAME – Should match the MODEL_ID; optional, but having them match makes the process easier to understand. Integration Services is extremely flexible, so there are many different ways to store the parameters and insert them into the model XMLA. For example, you could:  Store the parameters as text in a SQL Server table, and then insert them into the XMLA within a Foreach Loop, by using an ADO.NET iterator.  Save the XMLA command as a text file and read it using a flat file connection. Save the variables in a configuration file and apply them at run time.  Save the XMLA command as an .xml file, and then read it into a package by using an XML Source. Insert the variables into the XML by using the properties and methods of the XML task.  Create multiple XMLA files in advance and then read the files with a combination of a Foreach loop and an XML Source connection. However, for this scenario, you need to be able to easily add new sets of parameters, and to view and update the complete list of models and parameters. Therefore, you’ll use the first method: create the parameter-value pairs as records in a SQL Server database, and then read in the new values at run time by using a Foreach Loop container. This way, you can easily view or update the parameters by using SQL queries. Run the following statement to create the parameters table. USE [DMReporting] -- substitute your database name here GO /****** Object: ******/ Table [dbo].[ModelParameters] SET ANSI_NULLS ON GO SET QUOTED_IDENTIFIER ON GO CREATE TABLE [dbo].[ModelParameters]( [RecordID] [int] IDENTITY(1,1) NOT NULL, [RecordDate] [datetime] NULL, [ModelID] [nvarchar](50) NULL, [ModelName] [nvarchar](50) NULL, [ForecastMethod] [nvarchar](50) NULL, 10 Script Date: 11/09/2010 10:56:26 [PeriodicityHint] [nvarchar](50) NULL ) ON [PRIMARY] GO The following table lists the parameters that are used to build models in this walkthrough. Insert these values into the parameters table you created by using the script. ModelID ARIMA_1-7-10 ARIMA_1-10-30 ARIMA_nohints MIXED_1-7-10 MIXED_1-10-30 MIXED_nohints ARTXP_1-7-10 ARTXP_1-10-30 ARTXP_nohints ForecastMethod ARIMA ARIMA ARIMA MIXED MIXED MIXED ARTXP ARTXP ARTXP PeriodicityHint {1,7,10} {1,10,30} {1} {1,7,10} {1,10,30} {1} {1,7,10} {1,10,30} {1} This scenario uses the parameters FORECAST_METHOD and PERIODICITY_HINT because they are among the most important parameters for time series models (also because they are string values and easy to change!). However, the parameters that you change will be completely different for other algorithms. For example, if you build a clustering model, you might decide to change the CLUSTERING_METHOD parameter and build models using each of the four clustering methods, such as K-Means. You might also try altering the MINIMUM_SUPPORT parameter, or trying a variety of cluster seeds. For a list of the parameters provided by the different algorithms, see the algorithm technical reference topics (http://msdn.microsoft.com/enus/library/cc280427.aspx) in MSDN. Important note for data miners: Altering parameter values can strongly affect the model results. Therefore, you should have some sort of plan for analyzing the results and weeding out badly fit models. For example, because the time series algorithm is very sensitive to periodicity hints, it can produce poor results if you provide the wrong hint. If you specify that the data contains weekly cycles and it actually contains monthly cycles, the algorithm attempts to fit the data to the suggested weekly cycle and might produce odd results. Some of the models generated by this automation process demonstrate this behavior. There are many ways that you can check the validity of models:  11 Use descriptive statistics and metadata for the individual models to eliminate models that have characteristics of overfitting or poor fit.  Validate data sets and models by using cross-validation or one of the other accuracy measures provided by SQL Server. For more information, see Validating Data Mining Models (http://technet.microsoft.com/en-us/library/ms174493.aspx) in SQL Server Books Online.  Choose only parameters and values that make sense for the business problem; use business rules to guide your modeling. This completes the preparations, and you can now build the three packages. The instructions for each package begin with a diagram that illustrates the package workflow and briefly describes the package components. The diagram is followed by steps that you can follow to configure each task or destination. Phase 2 - Model Creation In Phase 2, you build a package that creates many mining models, using the templates and parameters you prepared in Phase 1. From among these models, you can choose an analysis that best suits your needs. This package includes the following tasks:     12 The initial Execute SQL task, which gets the model names and model parameters from the relational database and then stores the results of that query in an ADO.NET rowset variable A Foreach Loop container, which iterates over the values stored in the ADO.NET rowset variable and then passes the new model names and model parameters, one at a time, to the Script task and the Analysis Services Execute DDL task inside the Foreach Loop The Script task, which loads the base XMLA statement from a variable, inserts the new parameters, and then writes out the new XMLA to another variable The Analysis Services Execute DDL task, which executes the updated XMLA statement contained in the variable Create package and variables (CreateParameterizedModels.dtsx) 1. Create a new Integration Services package and name it CreateParameterizedModels.dtsx. 2. Set up the variable that stores the model parameters. The variable name should be User:objAllModelParameters, and the variable should have the type Object. Make sure that the variable has package scope, because it passes values from the query results, returned by the Get Model Parameters (Execute SQL task), to a Foreach loop. Configure the Execute SQL task (Get Model Parameters) 1. Add an Execute SQL task to the package you just created and name it Get Model Parameters. 2. In the Execute SQL Task Editor, specify the database that stores the model parameters. This walkthrough uses an OLE DB connection manager. 3. For the SQLSourceType property, specify Direct input, and then paste in the following query. SELECT ModelID AS MODEL_ID, ModelID AS MODEL_NAME, ForecastMethod AS FORECAST_METHOD, PeriodicityHint AS PERIODICITY_HINT FROM dbo.ModelParameters 13 4. For the ResultSet property, choose Full row set. This enables you to store a multi-row result in a variable. 5. In the Result Set pane, for Result Name, type 0 (zero), and then assign the variable User:objAllModelParameters. Configure the Foreach Loop container (Foreach Model Definition) Now that you have loaded a set of parameters into an object variable, you pass the variable to a Foreach Loop container, which then performs an operation for each row of data: 1. Create a Foreach Loop container, and name it Foreach Model Definition. 2. Select the Foreach Loop container and then open the Variables window, to create the following variables with the data types specified, scoped to the Foreach Loop container: User::strBaseXMLA String User::strModelID String User::strModelName String User::strForecastMethod String User::strPeriodicityHint String User::strModelXMLA String 3. Open the Foreach Loop Editor. For Collection, choose ADO.NET enumerator. 4. In the Foreach Loop Editor, configure the enumerator by choosing the variable User:objAllModelParameters in the ADO object source variable dropdown list. Do not change the default enumeration mode – you are only passing in one table, so the default setting, Rows in the first table, is correct. 5. Click Variable mappings, and then assign columns from the parameters to the indexes of columns in variable data as follows: User::strModelID 0 User::strModelName 1 User::strForecastMethod 2 User::strPeriodicityHint 3 6. Add a Script task inside the Foreach Loop container, and name it Update Model XMLA. 7. In the Script Task Editor, specify the properties of the variables as follows: User::strBaseXMLA Read-only User::strModelID Read-only User::strModelName Read-only User::strForecastMethod Read-only User::strPeriodicityHint Read-only User::strModelXMLA Read/write 8. Click Edit Script to add the code that replaces the string for each variable value. Note: The code for this task is included in the Appendix. The Script task performs a simple operation: it finds the default values in the basic XMLA and inserts the new parameter values by doing string replacement. You could also do this by using regular expressions or XML methods, of course. 14 9. Add an Analysis Services Processing task inside the Foreach Loop container and name it Execute Model XMLA. 10. For the Connection property, specify the instance of Analysis Services where your models are stored. 11. On the DDL tab, for SourceType, choose Variable, and then select the variable User::strModelXMLA. This completes the package that creates the models. You can now execute just this package by right-clicking the package in Solution Explorer and then clicking Execute Now. After the package runs, if there are no errors, you can connect to your Analysis Services database by using SQL Server Management Studio and see the list of new models that were created. However, you cannot browse the models or build prediction queries yet. That is because the models are just metadata until they are processed, and they contain no data or patterns. In the next package, you will process the models. Phase 3 - Process the Models This package gets a list of valid models from the Analysis Services server, and then it processes the models using the Analysis Services Processing task. Until processed, the model that you created by using XMLA is just a definition: a collection of metadata that defines parameters and data source bindings. Processing gets the data from the Forecasting data source, and then it generates patterns based on the algorithm you specified. (For more information about the architecture of mining models, and about processing, see Mining Structures (http://msdn.microsoft.com/en-us/library/ms174757.aspx) in the MSDN Library.) In summary, this is how the package handles processing:     15 The first Execute Data Mining Query task issues a query to get a list of valid models. That list is written to a temporary table in the relational engine. The next task, an Execute SQL task, retrieves the model names from that table and then puts them in an ADO.NET rowset variable. The Foreach Loop container takes the ADO.NET rowset variable contents as input, and then it processes each model serially, using the embedded Analysis Services Processing task. Finally you update the status of the models. Create package and variables (ProcessEmptyModels.dtsx) 1. Create a new Integration Services package and name it ProcessEmptyModels.dtsx. 2. With the package background selected, add a user variable, objModelsList. This variable will have package scope and will be used to store the list of models that are available on the server. Add a Data Mining Query task (Execute DMX Query) 1. Create a new Data Mining Query task and name it Execute DMX Query. 2. In the Data Mining Query Task Editor, on the Mining Model tab, specify the Analysis Services database that contains the time series mining models. 3. For Mining Structure, choose Forecasting. 4. Click the Query tab. Here, instead of creating a prediction query, paste in the following text of a content query. A content query returns metadata about the model and data already stored in the model in the form of summary statistics. SELECT MODEL_NAME, IS_POPULATED, LAST_PROCESSED, TRAINING_SET_SIZE FROM $system.DM_SCHEMA_MINING_MODELS Not all of these columns are needed for processing, but you can add the columns now and update the information later. 5. On the Output tab, for Connection, select the relational database where you will store the results. For this solution, it is <local server name>/DM_Reporting. 6. For Output table, type a temporary table name (in this solution, tmpProcessingStatus) and then select the option Drop and re-create the output table. 16 Create Execute SQL task (List Unprocessed Models) 1. Add a new Execute SQL Task, and name it List Unprocessed Models. Connect it to the previous task. 2. In the Execute SQL Task Editor, for Connection, use an OLE DB connection, and then choose the server name: for example, <local server name>/DM_Reporting. 3. For Result set, select Full result set. 4. For SQLSourceType, select Direct input. 5. For SQL Statement, type the following query text. SELECT MODEL_NAME FROM tmpProcessingStatus 6. On the Result Set tab, assign the columns in the result set to variables. There is only one column in the result set, so you assign the variable, User::objModelList, to ResultSet 0 (zero). Create a Foreach Loop container (Foreach Model in Variable) By now you should be pretty comfortable with using the combination of an Execute SQL task and a Foreach Loop container. 1. Create a new Foreach Loop container and name it Foreach Model in Variable. Connect it to the previous task. 2. With the Foreach Loop container selected, open the Variables window, and then add three variables scoped to the Foreach Loop container. The latter two variables work together: you store the processing command template in one variable, and then you use an Integration Services expression to alter the text and save the changes to the second variable: strModelName1 String strXMLAProcess1 String strXMLAProcess2 String 3. In the Foreach Loop Editor, set the enumerator type to ForEach ADO enumerator. 4. In the Enumerator configuration pane, set ADO object source variable to User::objModelList. 5. Set Enumeration mode to Rows in first table only. 6. In the Variables mapping pane, assign the variable, User::strModelName1, to Index 1. This means that each row of the single-column table returned by the query will be fed into the variable. Add an Analysis Services Processing task to the Foreach Loop (Process Current Model) The editor for this task requires that you first connect to an Analysis Services database and then choose from a list of objects that can be processed. However, because you need to automate this task, you can’t use the interface to choose the objects to process. So how do you iterate through a list of objects for processing? 17 The solution is to use an expression to alter the contents of the property, ProcessingCommand. You use the variable, strXMLAProcess1, which you set up earlier, to store the basic XMLA for processing a model, but you insert a placeholder that you can modify later when you read the variable. You alter the command using an expression and write the new XMLA out to a second variable, strXMLAProcess2. 1. Drag a new Analysis Services Processing task into the Foreach Loop container you just created. Name it Process Current Model. 2. With the Foreach Loop selected, open the Variables window, and then select the variable User::strXMLAProcess2. 3. In the Properties pane, select Evaluate as expression and set it to True. 4. For the value of the variable, type or build this expression. REPLACE( @[User::strXMLAProcess1] , "ModelNameHere", @[User::strModelName1] ) 5. In the Analysis Services Processing Task Editor, click Expressions, and then expand the list of expressions. 6. Select ProcessingCommand and then type the variable name as follows: @[User::strXMLAProcess2] Another way to train the model would be to add a processing task within the same Foreach Loop that you used to create the model. However, there are good reasons to build and process the models in separate packages. For example:   Processing can be time-consuming, and it depends on connections to source data. It is easier to debug problems when model creation and processing are in separate packages. Moreover, the Data Mining Query task that is provided in the Control Flow can be used to execute many different types of queries against an Analysis Services data source. You can use schema rowset queries within this task to get information about other Analysis Services objects, including cubes and tabular models, or even run Data Mining Extensions (DMX) DDL statements. (In contrast, the Data Mining Query Transformation component, available in the Data Flow, can only be used to create predictions against an existing mining model.) The final step in this phase is to add a task that updates the status of your mining models. You can now execute this package as before. Add a Data Mining Query task after the Foreach Loop (Update Processing Status) This task uses the Data Mining Query task to get the updated status of the mining models, and write that to a relational data table. 1. Right-click the Data Mining Query task you created before, because it has all the right connections and the correct query text, and then click Copy. 2. Paste the task after the Foreach Loop container and connect it to the loop. 18 3. Rename the task Update Processing Status. 4. Open the Data Mining Query Task Editor, click the Output tab, and verify that the option Drop and re-create the output table is selected. This completes the package. You can now execute this package as before. When you execute this package, the actual processing of each model can take a fairly long time, depending on how many models are available. You might want to add logging to the package to track the time used for processing each model. Phase 4 - Create Predictions for All Models In this package, you create prediction queries, using the Data Mining Query task, and run the queries for each of the models that you just created and processed:     19 You first use an Execute SQL task to query the list of models in the database, and save that list in a variable. The Foreach Loop then uses the variable to first customize the prediction targets, by inserting the model name from the variable into the two Data Mining Query tasks. You write the prediction results to a table in the relational database, using a pair of Data Flow tasks, which also write out some useful metadata. The Analysis Services Execute DDL task executes the updated XMLA statement contained in the variable. Create package and variables (PredictionsAllModels.dtsx) 1. Create a new Integration Services package and name it PredictionsAllModels.dtsx. 2. Create a variable with package scope as follows: objProcessedModels Object Create Execute SQL task (Get Processed Models) 1. Create a new Execute SQL task and name it Get Processed Models. 2. In the Execute SQL Task Editor, for Connection, use an OLE DB connection, and for the server, type <local server name>/DM_Reporting. 3. For Result set, select Full result set. 4. For SQLSourceType, select Direct input. 5. For SQL Statement, type the following query text. (The purpose of adding the WHERE condition is to ensure that you do not create a prediction against a model that has not been processed, which would generate an error). SELECT MODEL_NAME FROM dbo.tmpProcessingStatus WHERE LAST_PROCESSED IS NOT null 6. On the Result Set tab, assign the columns in the result set to variables. There is only one column in the result set, so you assign the variable User::objProcessedModels to ResultSet 0 (zero). Tip: When you are working with data mining models and especially when you are building complex queries, we recommend that you build DMX queries beforehand by opening the model directly in Business Intelligence Developer Studio and using Prediction Query Builder, or by launching Prediction Query Builder from SQL Server Management Studio. The reason for is that when you build queries by using the data mining designers in SQL Server Management Studio or Business Intelligence Developer Studio, Analysis Services does some validation, which enables you to browse and select valid objects. However, the Query Builder provided in the Data Mining Query task does not have this context and cannot validate or help with your selections. Create Execute-SQL task (Get Series Names) This task is not strictly necessary for prediction, but it creates data that is useful later for reporting. Recall that the time series data mining model is based on sales data for different product lines in different regions, with each combination of a product line plus a region making a single series. For example, you can predict sales for the M200 product in Europe or the M200 product in North America. Here, the series name is extracted and stored in a table, making it easier to group and filter the predictions later in Reporting Services: 1. Add a new Execute SQL task and name it Get Series Names. 20 2. Connect it to the previous task. 3. In the Execute SQL Task Editor, choose OLE DB connection, and for Connection, type <local server name>/DM_Reporting. 4. For Result set, select None. 5. For SQLSourceType, select Direct input. 6. For SQL Statement, type the following query text. IF EXISTS (SELECT [modelregion] FROM DMReporting.dbo.tmpModelRegions) BEGIN TRUNCATE TABLE DMReporting.dbo.tmpModelRegions INSERT DMReporting.dbo.tmpModelRegions SELECT DISTINCT [ModelRegion] FROM AdventureWorksDW2008R2.dbo.vTimeSeries END Create Foreach Loop container (Predict Foreach Model) In this Foreach Loop, you create two pairs of tasks: a prediction query plus data flow to handle the results, one for Amount and one for Quantity. You might ask, why generate the results for Amount and Quantity separately, when the Prediction Query Builder allows you to predict both at once? The reason is that the data mining query returns a nested rowset for each series you predict, but the providers in Integration Services can work only with flattened rowsets. If you predict both Amount and Quantity in one query, the rowset contains many nulls when flattened. Rather than try to remove the nulls and sort out the results, it is easier to generate a separate set of results and then combine them later in the Integration Services data flow. 1. Create a new Foreach Loop container and name it Predict Foreach Model. Connect it to the task, Get Processed Models. 2. With the Foreach Loop container selected, open the Variables window and create a new variable scoped to the task, as follows: strModelName String 3. Return to the Foreach Loop Editor, and set the enumerator type to ForEach ADO enumerator. 4. In the Enumerator configuration pane, set ADO object source variable to User::objProcessedModels. 5. Set Enumeration mode to Rows in first table only. 6. In the Variables mapping pane, assign the variable User::strModelName to Index 0. Each row of the single-column table returned by the query is fed into the variable 21 strModelName, which is in turn used to update the prediction query in the next set of tasks. Create variables for the Foreach Loop container Much of the work in this package is done by the variable assignments. You create one set of variables that store the text of the prediction queries, and then you insert the name of the mining models by looking it up in another variable, strModelName. This illustrates a useful technique in Integration Services: updating the contents of a variable by using an expression as the variable definition. 1. With the Foreach Loop selected, create four new variables: strQueryBaseAmount String strQueryBaseQty String strPredictAmt String strPredictQty String 2. For the value of variable strQueryBaseAmount, type the following query. SELECT FLATTENED 'ModelNameHere' as [Model Name], [ModelNameHere].[Model Region] as [Model and Region], (SELECT $TIME as NewTime, Amount as NewValue, PredictStDev([Amount])as ValueStDev, PredictVariance([Amount]) as ValueVariance FROM PredictTimeSeries([ModelNameHere].[Amount],10) ) AS Predictions FROM [ModelNameHere] Important: The query here is formatted for readability, but the query will not work if you copy and paste these statements into the variable as is. You must copy the statement into a text editor first and remove all line breaks. Unfortunately the Integration Services editors do not detect line breaks or raise any errors while you are editing the task, but when you run the package, you will get an error. So be sure to remove the line breaks first! 3. For the value of variable strQueryBaseQty, type the following query after removing the line breaks. SELECT FLATTENED 'ModelNameHere' as [Model Name], [ModelNameHere].[Model Region] as [Model and Region], (SELECT $TIME as NewTime, 22 Quantity as NewValue, PredictStDev([Quantity])as ValueStDev, PredictVariance([Quantity]) as ValueVariance FROM PredictTimeSeries([ModelNameHere].[Amount],10) ) AS Predictions FROM [ModelNameHere] Notice the placeholder, ModelNameHere, in this procedure. This placeholder will be replaced with a valid model name, which the package gets from the variable strModelName. The next steps explain how to create an expression that updates the query text each time the loop is executed. 4. In the Variables window, select the variable strPredictQty, and then open or select the Properties window to see the extended properties of the variable. 5. Locate Evaluate as Expression and set the value to True. 6. Locate Expression and type or paste in the following expression. REPLACE( @[User::strQueryBaseQty] , "ModelNameHere", @[User::strModelName2] ) 7. Repeat this process for the variable strPredictAmt, using the following expression. REPLACE( @[User::strQueryBaseAmount] , "ModelNameHere", @[User::strModelName2] ) Create the Data Mining Query task (Predict Amt) Now that you’ve configured the variables, most of the work is done. All you have to do is create a pair of Data Mining Query tasks. Each task gets the updated query string out of the variable that you just created, runs the query, and then saves the predictions to the specified output: 1. Drop a new Data Mining Query task into the Foreach Loop container. Name it Predict Amt. 2. In the Data Mining Query Task Editor, on the Mining Model tab, for Connection, choose the name of the Analysis Services instance that hosts your models. For example:<servername>.ForecastingModels. 3. For Mining structure, choose Forecasting. 4. On the Output tab, for Connection, choose the instance of the database engine where you will store the results. For example, <servername>.DM_Reporting. 5. For Output table, select or type the table name, tmpPredictionResults. Choose the option Drop and re-create the output table. (Note: If this package has never been run, you must type the name of the table, and the task will then create it. However, if you are rerunning the package, the table already exists, so you must drop and then rebuild it.) 23 6. On the Query tab, for Build query, you can paste in the base query temporarily. It will be replaced with the contents of a variable. After you run the package once, you should see the text of the base query. 7. With the Predict Amt task selected, open the Properties pane and locate the Expressions property. 8. Expand the list of expressions and add the variable @[User::strPredictAmt] as the value for the QueryString property. You can also select the value from a list by clicking the Browse (...) button. Create the second Data Mining Query task (Predict Qty) Repeat the steps just described for a query that does the same thing, only with Quantity as the predictable column: 1. Create a new Data Mining Query task named Predict Qty. 2. Repeat steps 2-6 from the previous procedure exactly as described. 3. With the Predict Qty task selected, open the Properties pane and locate the Expressions property. 4. Expand the list of expressions and add the variable @[User::strPredictQty] as the value for the QueryString property. After you run the package once, the DMX statement contains blank brackets, like these. SELECT FLATTENED '' as [Model Name], [].[Model Region] as [Model and Region], (SELECT $TIME as NewTime, Amount as NewValue, PredictStDev([Amount])as ValueStDev, PredictVariance([Amount]) as ValueVariance FROM PredictTimeSeries([].[Amount],10) ) AS Predictions FROM [] These brackets will be populated by a variable that supplies the model name at run time. To summarize all the variable activity at run time:     24 The package gets a variable with a list of models. The loop gets the name of one model from the list. The loop gets a prediction query from a variable, inserts the model name, and writes out a new prediction query. The query task executes the updated prediction query. Note that these prediction queries all write their results to the same temporary table, which is dropped and then rebuilt during each loop. Therefore, you need to add a Data Flow task in between, which moves the results to the archive table and also adds some metadata. Create Data Flow tasks to Archive the Results Remember that you created separate predictions for the data values, Amount and Quantity, to avoid dealing with a lot of nulls in the results. Next they are merged back together for reporting, to make a table that looks like this. Job ID 012 Time executed Series Time slice 2010-09-20 012 2010-09-20 012 2010-09-20 012 2010-09-20 M200 Europe M200 Europe M200 Europe M200 Europe January 2012 February 2012 January 2012 February 2012 Prediction type Sales Amount Sales Amount Sales Quantity Sales Quantity Predicted value 283880 StDev Variance nnn nnn 251225 nnn nnn 507 nnn nnn 758 nnn nnn You can also add any extra metadata that might be useful later, such as the date the predictions were generated, a job ID, and so forth. Let’s take another look at the DMX query statements used to generate the predictions. SELECT $TIME as NewTime, Amount as NewValue, PredictStDev([Amount])as ValueStDev,PredictVariance([Amount]) as ValueVariance FROM PredictTimeSeries Ordinarily the default column names that are generated in a prediction query are named by default based on the predictable column name, so the names would be something like PredictAmount and PredictQuantity. However, you can use a column alias in the output (here, it is NewValue) to make it easier to combine predicted values. Again, because Integration Services is so flexible, there are lots of ways you might accomplish this task:     Store results in memory and merge them before writing to the archive table. Store the results in different columns, one for each prediction type. Write the results to temporary tables and merge them later. Use the Integration Services raw file format to quickly write out and then read the interim results. However, in this scenario, you want to verify the prediction data that is generated by each query. So you use the following approach:  25 Write predictions to a temporary table.    Use an OLE DB Source component to get the predictions that were written to the temporary table. Use a Derived Column transformation to clean up the data and add some simple metadata. Save the results to the archive table that is used for reporting on all models. The graphic illustrates the overall task flow within each Data Flow task. Create Data Flow tasks (Archive Results Qty, Archive Results Amt) 1. Within the loop Predict Foreach Model, create two Data Flow tasks and name them Archive Results Qty and Archive Results Amt. 2. Connect each Data Flow task to its related Data Mining Query task, in the order shown in the earlier Control Flow diagram for Package 3. Note: You must have these tasks in a sequence, because they use the same temporary table and archive table. If Integration Services executes the tasks in parallel, the processes could create conflicts when attempting to access the same table. 3. In each Data Flow task, add the following three components:  An OLE BD data source that reads from tmpPredictionResults  A Derived Column transformation as defined in the following table  An OLE DB destination that writes to table ArchivedPredictions 4. Create expressions in each Derived Column transformation, to generate the data for the new columns as follows. Task name Archive Results Qty Archive Results Qty 26 Derived column name PredictionDate PredictedValue Data type Value datetime string GETDATE() Amount Archive Results Amt Archive Results Amt PredictionDate PredictedValue datetime string GETDATE() Quantity Tip: Isolating the data flows for each prediction type has another advantage: it is much, much easier to modify the package later. For example, you might decide that there is no good reason for creating a separate prediction for quantity. Instead of editing your query or the output, you can just disable that part of the package and it will still run without modification – you just won’t have predictions for Quantity. Run, Debug, and Audit Packages That’s it – the packages contain all the tools you need to dynamically create and update multiple related data mining models. The packages for this scenario have been designed so that you can run them individually. We recommend that you run each package on its own at least once, to get a feel for what each package produces. Later you can add logging to the packages to track errors, or create a parent task to connect them by adding an Execute Package task. Phase 5 - Analyze and Report Now that you have a set of predictions for multiple models, you are probably anxious to see the trends, and to analyze differences. Using the Data Mining Viewer The quickest way to view individual models is by using the Data Mining viewers. The Microsoft Time Series Viewer (http://technet.microsoft.com/en-us/library/ms175331.aspx) is particularly handy because it combines the historical data with the predictions for each series, and it displays error bars for the predictions. 27 However, some users are not comfortable with using Business Intelligence Development Studio. Even if you use the Data Mining Add-ins for Microsoft Office, which provides a Microsoft Visio viewer and a time series browser, the amount of detail in the time series viewer can be overwhelming. In contrast, analysts typically want even more detail, including statistics embedded in the model content, together with the metadata you captured about the source and the model parameters. It’s impossible to please everyone! Fortunately Reporting Services lets you pick the data you want, add extra data sets and linked reports, filter, and group, so you can create reports that meet the needs of each set of users. Using Reporting Services for Data Mining Results Our requirements for the basic report were as follows:      28 All related models should be in one chart, for quick comparison. To simplify comparisons, we can present the predictions for Amount and Quantity separately. The results should be separable by model and region. We need to compare predictions for the same periods, for multiple models. Rather than working with multiple sources of data, we would like all data in a relational store. Additional requirements might include:    A chart showing historical values along with predictions. Statistics derived from comparison of prediction values. Metadata about each model in a linked report. As the analyst, you might want even more detail:    First and last dates used for training each model List of the algorithm parameters and pattern formulas Descriptive statistics that summarize the variability and range of source data in each of the series, or across series However, for the purposes of this walkthrough, there is already plenty of detail for comparing models. You can always add this data later and then present it in linked reports The following graphic shows the Reporting Services report that compares the prediction results for each model: 29 Notice that you can configure a report to show all kinds of information in ToolTips—in this example, as you pause the mouse over a prediction, you see the standard deviation and variance for the predictions. The next shows a series of charts that have been copied into a matrix. By using a matrix, you can create a set of filtered charts. This series of graphs shows predictions for Amount for all models. 30 M200 R750 T1000 Europe North America Pacific Interpreting the Results If you are familiar with time series modeling, a couple of trends begin to jump out at you just from scanning these charts:    There are some extreme series in ARIMA models – possibly the periodicity hint is to blame. Predictions in ARTXP models cut off at a certain point in many series. This is expected, because ARTXP detects the instability of a model and does not make predictions if they are not reliable. You would expect the MIXED models to generally perform better, because they combine the best characteristics of ARTXP and ARIMA. Indeed, they seem more reliable, though you would want to verify that. The following trend lines are interesting, and they illustrate some problems you might see with models. The results might indicate that the data is bad, there is inadequate data, or the data is too variable to fit. 31 R750 Europe (Amount) R750 Europe (Quantity) R250 North America (Amount) R250 North America (Quantity) When you see wildly varying trends from models on the same data, you should of course reexamine the model parameters, but you might also use cross-prediction or aggregate your data differently, to avoid being influenced too strongly by a single data series:   With cross-prediction, you can build a reliable model from aggregated data or a series with solid data, and then make predictions based on that model for all series. ARTXP models and mixed models support cross-prediction. If you do not have enough data to meaningfully analyze each region or product line separately, you might get better results by aggregating by product or region or both, and create predictions from the aggregate model. Discussion Data mining can be a labor-intensive process. From data acquisition and preparation to modeling, testing, and exploration of the results, much effort is needed to ensure that the data supports the intended analysis and that the output of the model is meaningful. Some parts of the model-building process will always require human intervention – understanding the results, for example, requires careful review by an expert who can assess whether the numbers make sense. However, by automating some part of the data mining process, Integration Services can not only speed the process, but also potentially improve the results. For example, if you don’t know which mixture of algorithms produces the best results, or what the possible time cycles are in your data, you can use automation to experiment. Moreover, there are benefits beyond simple time saving. 32 The Case for Ensemble Models Automation supports the creation of ensemble models. Ensemble models (roughly speaking) are models that are built on the same data, but use different methods of analysis. Typically the results of multiple models are compared and/or combined, to yield results that are superior to those of any single model. Assessing multiple models for the same prediction task is now considered a best practice, even with the best data. Some reasons that are often cited for using ensemble models include:   Avoiding overfitting. When a model is too closely patterned on a specific training set, you get great predictions on a test data set that matches the training data, and a lot of variability when you try the model out on real-world data. Variable selection. Each algorithm type processes data differently and delivers different insights. Rather than compare predictions, as we did here, you might use one algorithm to identify the most important variables or to prescreen for correlations that can mask other more interesting patterns. There has been much research in recent years on the best methods for combining the estimates from ensemble models — merging, bagging, voting, averaging, weighting by posterior evidence, gating, and so forth. A discussion of ensemble models is beyond the scope of this paper, and you will note that we did not attempt to combine prediction results in this paper; we only presented them for comparison. However, we encourage you to read the linked resources to learn more about these techniques. Closing the Loop: Interpreting and Getting Feedback on Models Now that you have summarized the results in an easy-to-read report, what’s next? Typically, you just think of more questions to answer!      What about internal promotions or known events? Have we eliminated known correlations? Local and cultural events can significantly affect sales of particular products. Rather than expect the same seasonality in multiple regions, should we separate the regions for modeling? Should we choose a time series algorithm that can account for the effects of random or cyclical external events, or do we want to smooth data to find overall trends? Can we compare these projections graphs with a projection done by the traditional business method of year-to-date comparisons? How do these predictions compare to percentage increases targeted by the business? Fortunately, because you have created an extensible framework for incorporating data mining in analysis using Integration Services, it will be relatively easy to collect more data, update models, and refine your presentation. 33 Conclusion This paper introduced a framework for automation of data mining, with results saved to a relational data store, to encourage a systematic approach to predictive analytics. This walkthrough showed that it is relatively easy to set up Integration Services packages that create data mining models and generate predictions from them. A framework like the one demonstrated here could be extended to support further parameterization, encourage the use of ensemble models, and incorporate data mining in other analytic workflows. Resources [1] Jamie MacLennan: Walkthrough of SQL Server 2005 Integration Services for data mining http://www.sqlserverdatamining.com/ssdm/Default.aspx?tabid=96&Id=338 [2] Microsoft Research: Ensemble models http://academic.research.microsoft.com/Paper/588724.aspx [3] Reporting Services tutorials http://msdn.microsoft.com/en-us/library/bb522859.aspx [4] Michael Ohler: Assessing forecast accuracy http://www.isixsigma.com/index.php?option=com_k2&view=item&id=1550:assessing-forecastaccuracy-be-prepared-rain-or-shine&Itemid=1&tmpl=component&print=1 [5] Statistical methods for assessing mining models http://ms-olap.blogspot.com/2010/12/do-you-trust-your-data-mining-results.html [6] John Maindonald, “Data Mining from a Statistical Perspective” http://maths.anu.edu.au/~johnm/dm/dmpaper.html. Acknowledgements I am indebted to my coworkers for their assistance and encouragement. Carla Sabotta (technical writer, Integration Services) provided invaluable feedback on the steps in each of the SSIS packages, ensuring that I didn’t leave out anything. Ranjeeta Nanda of the Integration Services test team kindly reviewed the code in the Script task. Mary Lingel (technical writer, Reporting Services) took my complex data source and developed a set of reports that made it look simple. 34 Code for Script Task The following code can be added to the Script task to change values in the XML model definition. This very simple sample was written using VB.NET, but the Script task supports C# as well. A number of message boxes have been added to verify that the task was processing the XML as expected. You would eventually comment these out. You would also want to add string length checking and other validation to prevent DMX injection. Public Sub Main() 'get base XMLA and create new blank XMLA used for output Dim strXMLABaseDef As String = Dts.Variables("strBaseXMLA").Value.ToString Dim strXMLANewDef As String = strXMLABaseDef 'create local variables and fill them with values from the SQL query Dim txtModelID As String = Dts.Variables("strModelID").Value.ToString Dim txtModelName As String = Dts.Variables("strModelName").Value.ToString Dim txtForecastMethod As String = Dts.Variables("strForecastMethod").Value.ToString Dim txtPeriodicityHint As String = Dts.Variables("strPeriodicityHint").Value.ToString 'first update base XMLA with new model ID and model name ' <ID>ForecastingDefault</ID> ' <Name>ForecastingDefault</Name> Dim txtNewID As String = "<ID>" & txtModelID & "</ID>" Dim txtNewName As String = "<Name>" & txtModelName & "</Name>" 'insert values strXMLANewDef = strXMLANewDef.Replace("<ID>ForecastingDefault</ID>", txtNewID) strXMLANewDef = strXMLANewDef.Replace("<Name>ForecastingDefault</Name>", txtNewName) 35 'display model names – for troubleshooting only MessageBox.Show(strXMLANewDef, "Verify new model ID and name") 'create temporary variables for replacement operations Dim strParameterName As String = "" Dim strParameterValue As String = "" 'update value for FORECAST METHOD. Because all possible values have exactly 5 chars, simply replace strParameterName = "FORECAST_METHOD" strParameterValue = "MIXED" 'default value If strXMLABaseDef.Contains(strParameterValue) Then 'replace the default value MIXED with whatever is in the variable from the SQL Server query strXMLANewDef = strXMLANewDef.Replace(strParameterValue, txtForecastMethod) 'display Forecast parameter value– for troubleshooting only MessageBox.Show(strXMLANewDef, "Check Forecast Method", MessageBoxButtons.OK) Else : MessageBox.Show("Problem with base XMLA", "The XMLA definition does not include the parameter, ;" & _ strParameterName, MessageBoxButtons.YesNoCancel) End If 'look for a PERIODICITY_HINT value strParameterName = "PERIODICITY_HINT" strParameterValue = "{1}" 'default value 36 If strXMLABaseDef.Contains(strParameterName) Then Dim StartString As Integer = strXMLABaseDef.IndexOf("{") Dim EndString As Integer = strXMLABaseDef.IndexOf("}") 'replace the default value {1} with whatever is in the variable strXMLANewDef = strXMLANewDef.Replace(strParameterValue, txtPeriodicityHint) MessageBox.Show(strXMLANewDef, "Check Periodicity Hint", MessageBoxButtons.OK) Else : MessageBox.Show("Problem with base XMLA", "The XMLA definition does not include the parameter, ;" & _ strParameterName, MessageBoxButtons.YesNoCancel) End If 'save the completed definition to the package variable Dts.Variables("strModelXMLA").Value = strXMLANewDef Dts.TaskResult = ScriptResults.Success End Sub For more information: http://www.microsoft.com/sqlserver/: SQL Server Web site http://technet.microsoft.com/en-us/sqlserver/: SQL Server TechCenter http://msdn.microsoft.com/en-us/sqlserver/: SQL Server DevCenter Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example: 37   Are you rating it high due to having good examples, excellent screen shots, clear writing, or another reason? Are you rating it low due to poor examples, fuzzy screen shots, or unclear writing? This feedback will help us improve the quality of white papers we release. Send feedback. 38

Automation of Data Mining Using Integration Services

Related documents

Products

Support

Automation of Data Mining Using Integration Services

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib