CHANGE DATA CAPTURE FOR SPECIFIED INTERVAL PACKAGE SAMPLE SQL Server Technical Article Writer: Sandra Ward Published: 11 2008 Applies To: SQL Server 2008 Summary: These are the accompanying notes describing the Change Data Capture for Specified Interval Package Sample available on Codeplex. The sample demonstrates the use of CDC technology in support of SSIS incremental load packages. Change Data Capture for Specified Interval Package Sample Copyright The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. 2008 Microsoft Corporation. All rights reserved. Microsoft is a registered trademark of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners. Table of Contents Getting Started .................................................................................................1 Sample Structure ..............................................................................................1 The Sample Environment Test Harness .............................................................. 2 Package Architecture for Incremental Load ......................................................... 2 SetupCDCSample Package .................................................................................3 Initializing the Environment and Enabling Change Data Capture ............................ 3 Execute SQL Task - Create Tables and Enable Change Data Capture................. 3 Creating Capture Instances for the Source Tables .......................................... 5 Generating Wrapper TVFs for the Query Functions.......................................... 6 Generating the Sample Workload ...................................................................... 8 Script Tasks – Apply Inserts and Updates to CDC Enabled Tables ..................... 8 Script Task – Mark Workload Completion ...................................................... 9 Initializing the Target Tables........................................................................... 10 Execute SQL Task – Create Database Snapshot ........................................... 10 Data Flow Tasks to Perform the Initial Load ................................................. 10 Setting Up the Periodic Requests for Change Data ............................................. 11 Execute SQL Task – Verify Capture Process is Started................................... 12 Execute SQL Task – Determine Datetime Base for Initial Extraction Interval .... 13 Script Task – Log Capture Not Started Message ........................................... 14 Extracting and Processing Change Data ........................................................... 16 For Loop Container - Cycle Master at 10 Second Intervals ............................. 17 Validating Incremental Load and Reporting Completion Status ............................ 19 Execute SQL Task – Check for Mismatch in Replicas ..................................... 19 VB Script Task – Output Run Completion Status ........................................... 20 SetupCDCSample Package Variables ................................................................ 22 MasterCDC Package.........................................................................................23 The Change Data Capture Database Validity Interval ......................................... 23 The Change Data Capture Validity Interval for a Capture Instance ....................... 24 A Closer Look at an SSIS Wrapper Function ...................................................... 25 Master Package Configurations and Variables.................................................... 26 Package Configurations for Initializing Package Variables .............................. 26 Master Package Variables .......................................................................... 27 Master Package Tasks .................................................................................... 28 Execute SQL Task – Check for Data ............................................................ 28 Execute Package Tasks – Extract Change Data for CDC Enabled Tables ........... 30 Script Task – Log Extract Error .................................................................. 30 Script Task – Delay .................................................................................. 31 Script Task – Log Extraction Complete ........................................................ 32 Child Packages for the Change Data Capture for Specified Interval Package Sample ............................................................................................................33 Child Package Variables ................................................................................. 35 Child Package Tasks ...................................................................................... 35 Script Task - Generate SQL Data Query ...................................................... 35 Data Flow Task – Process Change Data ....................................................... 37 Error Logging in the Child Packages ................................................................. 40 Wrappers for CDC TVFs ...................................................................................41 Instantiated Multi-Statement TVF Wrappers ..................................................... 41 A Closer Look at an SSIS Wrapper Function ...................................................... 43 Use of datetime Values in the Wrapper Signature ......................................... 43 Returning Column Information in the Wrapper Function ................................ 45 Extracting Information from the CDC Update Mask ....................................... 46 Customizing Wrapper Functions for the First Extraction Interval .......................... 48 Running the Change Data Capture for Specified Interval Package Sample in BI Studio .............................................................................................................50 Conclusion.......................................................................................................52 Change Data Capture for Specified Interval Package Sample Getting Started This document goes through the Change Data Capture for Specified Interval Package Sample in detail, describing how the Change Data Capture feature in SQL Server 2008 can be used to support ETL from an SSIS package. Code for the Change Data Capture for Specified Interval Package Sample is available through CODEPLEX. The README supplied with the sample provides detailed information on installing the sample files locally, along with the requirements needed to run the sample. This accompanying document is available on MSDN. The Change Data Capture for Specified Interval Package Sample makes use of the databases AdventureWorks2008 and AdventureWorksDW2008, both of which are available for download on Codeplex. Follow the instructions at the download site to insure they are properly loaded in your environment. NOTE: After loading AdventureWorks2008, execute sp_helpusers to determine whether the database user ‘dbo’ has an associated login. If the LoginName for the returned ‘dbo’ entry is NULL, the database user ‘dbo’ has been orphaned. Run the following command to associate the database user ‘dbo’ with the login ‘sa.’ exec sp_changedbowner 'sa' The database user ‘dbo’ must be associated with a valid login in order for the Change Data Capture capture process to successfully harvest changes from the log and deposit them in the database change tables. There are two package variables in the SetupCDCSample package that are key to successfully running the sample: SQLServerInstallPath and BasePath. SQLServerInstallPath, which defaults to c:\program files\Microsoft SQL Server\ identifies the install path for SQL Server on the local machine. If this is different in your environment, modify the package variable appropriately. BasePath, which defaults to @[SQLServerInstallPath] + “100\Samples\Integration Services\Package Samples\Change Data Capture for Specified Interval Package Sample\Change Data Capture Sample\”, identifies the standard install path of the sample relative to the SQL Server install path. If you have installed the sample elsewhere, set this path appropriately. Sample Structure The Change Data Capture feature of SQL Server 2008 allows the DML activity against database tables to be captured in change tables. The change data is made available to applications through table valued functions that query the change tables. This sample demonstrates the use of CDC technology from within SSIS incremental load packages to obtain source table changes to be applied to a Data Mart. 1 Change Data Capture for Specified Interval Package Sample The Sample Environment Test Harness A setup package is provided that both initializes the sample environment and provides a test harness for driving the data extraction process. The package begins by initializing the source tables to an initial state. It then configures the database for Change Data Capture, and creates a capture instance for each table to be tracked. The package then launches several tasks to generate DML activity against the source tables. At the same time that DML is being applied to the source tables, a database snapshot is taken by the Setup package. The snapshot is then used to provide data for an initial load of the target tables within the Data Mart. Metadata maintained in the snapshot is used to identify the starting point for the initial extraction for the incremental load. Once the initial load completes, the setup package enters a loop that launches the master package to harvest changes for 10 second intervals. It is the action of the master package and its associated child packages to obtain change data for a specified interval and apply the changes to the target environment that represents the principle focus of the sample. The loop logic continues to monitor the progress of the DML tasks through global variables. When the loop logic determines that the DML tasks have all completed and that the window of the next extraction interval has moved beyond the period of time when the workload was applied, the loop logic terminates. On completion, SQL CHECKSUM is used to compare the source files to the replicas. Table differences, if present, are noted in an event log entry that is made to record the status of the completed run. Package Architecture for Incremental Load The core of the sample consists of four packages: one master package and three child packages. The master package obtains the extraction interval for the incremental load from the setup package, and then verifies that the interval lies within the current Change Data Capture validity interval for the database. If the low end-point of the extraction interval is earlier in time than the minimum commit time for change table entries, the master package logs an error to the application event log and terminates. This is an indication that change table cleanup has been too aggressive, and the needed change data is no longer available in the change tables. If the high end-point of the extraction interval is later in time that the maximum commit time associated with a change table entry, the capture process has not completed the population of the change tables for the interval. In this case, the master package delays giving the capture process the opportunity to catch up before checking again. The delay loop will continue to execute until the capture process has caught up, or an iteration limit is reached and the package terminates depositing an error message in the event log. If both endpoints of the extraction interval are in range, the master package launches the child packages to retrieve the change data and update the Data Mart. The master package waits for notification from all packages before writing an informational message to the application event log at the conclusion of the extraction cycle. The log entry identifies the start and end times of the extraction interval as well as other packages variables. 2 Change Data Capture for Specified Interval Package Sample SetupCDCSample Package The package SetupCDCSample.dtsx sets up the test environment for the sample. It uses AdventureWorks2008 as its source database, and AdventureWorksDW2008 as the target. The database snapshot AdventureWorks2008_dbss is created for AdventureWorks2008 to provide a consistent view of the source tables for the initial load. The diagram below shows the control flow for the setup package. Figure 1: SETUPCDCSAMPLE Package Initializing the Environment and Enabling Change Data Capture Execute SQL Task - Create Tables and Enable Change Data Capture The SetupCDC Sample package begins with a single SSIS Execute SQL Task. Its purpose is to run the T-SQL script CDCSetupTables.sql to setup the environment for Change Data Capture. 3 Change Data Capture for Specified Interval Package Sample The script begins by generally cleaning up the sample environment, removing any objects created in previous runs. This makes it straightforward to rerun the sample and still allow the created objects to endure at the end of the run. The script then enables Change Data Capture for the database AdventureWorks2008. The Change Data Capture feature of SQL Server 2008 that allows the DML activity against database tables to be captured in change tables must initially be enabled at the database level by a member of the fixed server sysadmin role. You can use the following T-SQL query to determine whether Change Data Capture is already enabled for a database: SELECT is_cdc_enabled from sys.databases WHERE name = 'AdventureWorks2008' AND is_cdc_enabled = 1 WIthin the sample, this query is used to first determine whether Change Data Capture has already been enabled for the database. If it is, the following stored procedure is run to disable it. exec sys.sp_cdc_disable_db Disabling Change Data Capture at the database level will cleanup all of the Change Data Capture metadata for the database including the metadata associated with capture instances that have have previously been created for tracked tables. This allows each execution of the sample to execute from a clean test environment. The following stored procedure is then executed to enable Change Data Capture for AdventureWorks2008: exec sys.sp_cdc_enable_db The script then removes preexisting tables and functions from the schema CDCSample in AdventureWorks2008. The schema CDCSample is used for the new tables that will be created to serve as the source tables for the SSIS packages used to extract change data. The following three tables are created in the CDCSample schema of AdventureWorks2008, each mirroring an existing AdventiureWorks2008 table: CDCSample.Customer CDCSample.CreditCard CDCSample.WorkOrder 4 Change Data Capture for Specified Interval Package Sample Three multi-statement table valued functions are generated and instantiated at this time: CDCSample.fn_net_changes_Customer, CDCSample.fn_net_changes_CreditCard and CDCSample.fn_net_changes_WorkOrder. Three additional custom table valued functions are also created to handle the initial extraction interval when synchronizing to the initial load. Both sets of functions serve as wrapper functions for the CDC generated functions used to query for change data. They will be discussed in detail when the SSIS packages for extracting change data are examined. A portion of the table data in the original AdventureWorks2008 tables is then used to initialize the source tables in the CDCSample schema. The remaining data will be used later to generate a dynamic workload when tracking is enabled. With the source tables initialized, the script now creates three destination tables in the database AdventureWorksDW2008, one for each source table. For the purposes of the sample, the column structure of the destination tables is identical to that of the source tables. The function of the sample is to simply apply the changes associated with the source to the destination to allow the destination to reflect changes to the source in a timely fashion. Creating Capture Instances for the Source Tables Once the source tables are initialized, the setup script creates capture instances for each of the source tables. While the database itself must be enabled for Change Data Capture by a member of the sysadmin server role, capture instances for individual tables can be created by members of the db_owner database role. The stored procedure sys.sp_cdc_enable_table is used to create a capture instance for a source table. The following calls are made within the setup script to create capture instances for the three source tables. exec sys.sp_cdc_enable_table 'CDCSample', 'Customer', 'Customer', @supports_net_changes = 1, @role_name = null exec sys.sp_cdc_enable_table 'CDCSample', 'CreditCard', ' CreditCard ', @supports_net_changes = 1, @role_name = null exec sys.sp_cdc_enable_table 'CDCSample', 'WorkOrder', 'WorkOrder', @supports_net_changes = 1, @role_name = null 5 Change Data Capture for Specified Interval Package Sample The first two parameters to the stored procedure are the schema and table name of the source table to be tracked. The third parameter is the name chosen for the associated capture instance. Any name can be chosen, but within a given database, the capture instance name must be unique. Since it is used to identify the change data associated with a given source table, it usually makes sense to name the capture instance in a manner that provides cues to its associated source table. If not specified, it will default to the schema name followed by the table name, separated by an underscore. In the sample, the tablename has been used as the capture instance name. The parameter @supports_net_changes is used to indicate that functions to query for net changes should be generated for the capture instance. The source table must have a primary key or a defined unique index if this parameter is set to 1. Here, the term ‘net changes’ is used in a very specific way. The function returning net changes will only return a single row for each changed row for a given query window representing the final state of the row at the end of the interval. The operation returned with the row will be the one needed to correctly apply the row to the destination. In contrast, when a query for all changes is requested, a row is returned in the result set for each committed change to a row of the table. The parameter @role_name = null is used to indicate that no gating role is begin used to restrict access to change data. In order to have access to change data, the requestor must have select access to the captured columns of the source table. In addition, if the caller is not sysadmin or db_owner and a gating role has been defined, the caller must also be a member of the gating role. By default, a gating role must be defined. If, however, the @role_name parameter is explicitly set to null when the capture instance is created, no gating role is used and select access alone is sufficient to gain access to the change data. Generating Wrapper TVFs for the Query Functions At the time Change Data Capture is enabled for a source table, the single statement Table Valued Functions (TVFs) needed to query the change table for change data are automatically generated. Functions to query for all changes are always generated, while those used to extract net changes are generated only if the @support_net_changes parameter is set to 1. The principle drawback of the generated TVFs with respect to SSIS developers is that the query interval used to identify the range for which change data is needed is Log Sequence Number (LSN) based rather than time based. Change Data Capture does, however, provide a stored procedure that will script wrapper functions for the generated TVFs that use time based rather than LSN based parameters as interval boundaries. The sample setup script makes use of this capability and instantiates wrapper stored procedures for each of the defined net changes functions after the capture instances are created. The following simple stored procedure is created to generate the wrappers. The call to sp_cdc_generate_wrapper_function when no parameters are specified will return a result set that includes scripts to generate wrappers for all of the capture instances that 6 Change Data Capture for Specified Interval Package Sample the caller is authorized to access. The instantiation loop then selects from among those wrappers only those that wrap the net changes queries. create procedure [CDCSample].[generate_wrappers] as begin declare @wrapper_functions table (function_name sysname, create_stmt nvarchar(max)) insert into @wrapper_functions exec sys.sp_cdc_generate_wrapper_function declare @stmt nvarchar(max) declare #hfunctions cursor local fast_forward for select create_stmt from @wrapper_functions where function_name like 'fn_net_changes%' open #hfunctions fetch #hfunctions into @stmt while (@@fetch_status <> -1) begin exec sp_executesql @stmt fetch @hfunctions into @stmt end close @hfunctions deallocate #hfunctions end The above procedure is then called after the capture instances have been created. exec CDCSample.generate_wrappers 7 Change Data Capture for Specified Interval Package Sample Generating the Sample Workload Script Tasks – Apply Inserts and Updates to CDC Enabled Tables Once the environment has been enabled for Change Data Capture three Execute SQL Tasks are launched in parallel to generate DML activity against the source tables. After each task completes, a second task is launched to generate additional activity. In total, there are six tasks that generate load: one applying inserts and one applying updates, for each of three source tables. The update load only targets table rows populated as part of the initial load. A seeded random number generator is used so that the results are reproducible. The insert load is generated from the portion of the table rows not included within the initial load. By periodically enforcing a 10 second delay between batches of inserts and updates, the load is spread across several minutes. This enforced delay makes it easier to demonstrate techniques for systematically moving the query window when data needs to be extracted periodically. Where the period for an actual ETL system is most typically 24 hours, the period used in this sample is 20 seconds. The principle used to walk the change tracking timeline, however, is the same irrespective of the size of the extraction interval. Structurally, all of the Execute SQL Tasks that are used to generate workload are identical. They take no parameters and do not generate results sets. Each runs a SQL script to apply changes to CDC source tables. The table below shows the common SQL Statement attributes of the workload tasks. SQL Statement Attribute Value ConnectionType OLE DB Connection AdventureWorks2008 SQLSourceType File connection Below the script files for generating workload are paired with their corresponding tasks. 8 Execute SQL Task for Workload SQLStatementSource Insert Customer Table @[ScriptPath] + ’CDCCustomerInsert.sql’ Modify Customer Table @[ScriptPath] + ’CDCCustomerModify.sql’ Insert CreditCard Table @[ScriptPath] + ’CDCCreditCardInsert.sql’ Modify CreditCard Table @[ScriptPath] + ’CDCCreditCardModify.sql’ Insert WorkOrder Table @[ScriptPath] + ’CDC WorkOrderInsert.sql’ Modify WorkOrder Table @[ScriptPath] + ’CDC WorkOrderModify.sql’ Expressions- ConnectionString Change Data Capture for Specified Interval Package Sample Script Task – Mark Workload Completion The Script Task, Mark Workload Completion, runs after all the DML workload has been applied to the source tables. It is used to set package variables flagging the completion of the workload tasks and noting the completion time. The loop that periodically launches the master package to harvest change data monitors these package variables and terminates sample execution after the entire workload has been applied to the target environment. The EntryPoint property which defines the first method that executes when the script task runs is MarkWorkloadCompletion. The ReadWriteVariables property allows the following package variables to be set by the script: User::WorkloadCompleted, User::WorkloadEndTime The VB script code to support setting these package variables is shown below. Public Sub MarkWorkloadCompletion() Dim varWorkloadEndTime As Variable = Dts.Variables("User::WorkloadEndTime") Dim varWorkloadCompleted As Variable = Dts.Variables("User::WorkloadCompleted") Dts.VariableDispenser.LockForWrite("User::WorkloadEndTime") Dts.VariableDispenser.LockForWrite("User::WorkloadCompleted") varWorkloadEndTime.Value() = Now varWorkloadCompleted.Value() = 1 Dts.Variables.Unlock() Dts.TaskResult = ScriptResults.Success End Sub 9 Change Data Capture for Specified Interval Package Sample Initializing the Target Tables One of the principle issues to address when applying change data to a target environment is the synchronization of the stream of change data to the initial load. Simply put, the initial load of the target reflects a snapshot of the source at some point in time. The challenge is to determine the last change represented within the snapshot in order to have a basis for determining the first incremental change that needs to be applied. For Change Data Capture, the recommended strategy for synchronization makes use of a database snapshot created after the source tables are enabled. Use of the snapshot as the source for the initial load insures the cross table consistency of the tracked tables. More importantly, however, the metadata maintained for the snapshot allows you to precisely determine the LSN to use when you apply the first incremental load. The sample treats the first extraction interval differently than the others, using an LSN value for the low end-point and a datetime value as the high end-point. This is needed to insure that no changes are missed or repeated as a result of the synchronization process. A custom wrapper function is created to deal with this special interval. Execute SQL Task – Create Database Snapshot As described above, a database snapshot is created to be used as the source for the initial load of the target tables. The snapshot is created concurrently with the execution of the workload tasks to demonstrate that the content of the target tables subsequent to the initial load is not explicitly predetermined. Snapshot metadata will be used to determine the appropriate starting point for the incremental loads that follow. The package variable User::CreateSnapshot contains the statement used to create the database snapshot. The content of the variable is derived from an expression that references the package variable User::BasePath to construct the path where the snapshot file is to be located. "CREATE DATABASE @AdventureWorks2008_dbss ON (NAME = AdventureWorks2008_Data, FILENAME = '" + @[BasePath] + "AdventureWorks2008_data.ss') AS SNAPSHOT OF AdventureWorks2008;" The SQL statement has no input parameters and no result set is returned. Data Flow Tasks to Perform the Initial Load Once the snapshot is created it can be queried concurrently to obtain a consistent initial load for the target tables. The sample makes use of data flow tasks to perform the initial load, but any technology that uses the snapshot as the source for the data can be 10 Change Data Capture for Specified Interval Package Sample used. Use of the snapshot as the source guarantees the cross table consistency of the target tables that are loaded. The three Data Flow Tasks used to perform the initial load of the target tables are all structured identically. What differs among the three tasks are the names of the source and destination tables that provide the end points of the data flow. Each consists of an OLE DB source referencing the database snapshot, and an OLE DB destination that references AdventureWorksDW2008, the target database. All of the columns of the source table are output to the data flow. In general, column names associated with the source are mapped to identical names in the destination. Columns defined as computed columns, however, are ignored. Below is the association between individual Data Flow Tasks and the source and destination tables. OLE DB Source Attribute Value OLE DB connection manager AdventureWorks2008_dbss Data access mode Table or view OLE DB Destination Attribute Value OLE DB connection manager AdventureWorksDW2008 Data access mode Table or view The table below shows the common attributes of the OLE DB source and destination Data Flow Components: Data Flow Task Source/Destination Table Load Target Customer Table From Snapshot CDCSample.Customer Load CreditCard Table From Snapshot CDCSample.CreditCard Load WorkOrder Table From Snapshot CDCSample.WorkOrder Setting Up the Periodic Requests for Change Data Once the initial load of the target tables has completed the sample verifies that the capture process responsible for harvesting changes from the transaction log and depositing them in the change tables is active. If there are no entries in cdc.lsn_time_mapping, this is an indication that the capture process did not auto-start 11 Change Data Capture for Specified Interval Package Sample and that SQL Agent is not running. If an active capture process is not detected, the sample will log an error in the event log and terminate. If the presence of an active capture process is verified, metadata from the database snapshot is used to determine the LSN that will anchor the initial incremental load. A base time for computing the high end-point for the initial load is also determined at this time. Execute SQL Task – Verify Capture Process is Started If there are no entries in cdc.lsn_time_mapping, this is an indication that the capture process did not auto-start when Change Data Capture was enabled. This typically occurs because SQL Agent is not running. This condition is detected here so that an event log message can be posted to indicate the failure to auto-start and the run can be terminated. The package variable User::CaptureStarted is used to report this condition. The following SQL Statement is defined as Direct input: declare @start_time datetime, @capturestarted bit select @start_time = min(tran_end_time) from cdc.lsn_time_mapping if @start_time is null begin select @capturestarted = 0 end else begin select @capturestarted = 1 end select @capturestarted as CaptureStarted The SQL statement has no input parameters. The result set is defined as a single row with the column CaptureStarted: 12 Result Name Variable Name CaptureStarted User::CaptureStarted Change Data Capture for Specified Interval Package Sample If there are no entries in the cdc.lsn_time_mapping table, the script task Log Capture Not Started Message is called to log an informational message indicating that the capture process did not auto start, which requires SQL Agent to be running. Execute SQL Task – Determine Datetime Base for Initial Extraction Interval After the initial load of the target tables the sample uses an Execute SQL task to determine the end-points for the first extraction interval. Subsequent extraction intervals will be based off the initial interval, at ten second intervals. For the initial extraction interval, the lower bound is expressed as an LSN value to insure that no incremental changes are lost between the snapshot and the first incremental load. This LSN value is determined by using the function sys.fn_cdc_dbsnapshotLSN to retrieve the last commit LSN from the database snapshot. This value is saved as the package variable LastLSN. Once the lower boundary expressed as an LSN value is determined, the sample uses information in the cdc.lsn_time_mapping table to assign an approximate datetime value to this LSN. It is this approximate datetime value that is used to compute the high endpoint for the initial extraction interval. The script has a built in delay to wait until the capture process has processed all LSN values through the determined LastLSN value. After this is insured, the minimum tran_end_time of the entries in cdc.lsn_time_mapping with start_lsn value greater than LastLSN is then determined. This value is saved in the package variable ExtractEndTime. Later in the loop that cycles the master package every 10 seconds, this value will be used as the base from which the high-end point of the first extraction interval is determined. The following SQL Statement is defined as Direct input: declare @command nvarchar(max), @database_name nvarchar(1000) ,@lastLSN binary(10), @max_lsn binary(10), @start_time datetime ,@lastLSNstr nvarchar(42) exec sys.sp_cdc_dbsnapshotLSN 'AdventureWorks2008_dbss' ,@lastLSN output, @lastLSNstr output select @max_lsn = sys.fn_cdc_get_max_lsn() while (@lastLSN >= @max_lsn) begin waitfor delay '00:00:10' select @max_lsn = sys.fn_cdc_get_max_lsn() end 13 Change Data Capture for Specified Interval Package Sample select @start_time = min(tran_end_time) from cdc.lsn_time_mapping where start_lsn > @lastLSN select @start_time = convert(nvarchar(40),@start_time,20) select @start_time as ExtractStartTime, @start_time as ExtractEndTime, @lastLSNstr as LastLSN The SQL statement has no input parameters. The result set is defined as a single row with the columns ExtractEndTime and LastLSN. Result Name Variable Name ExtractEndTime User:: ExtractEndTime LastLSN User:: LastLSN Script Task – Log Capture Not Started Message The Script Task Log Capture Not Started Message, is used to log an information message to the Windows event log to indicate that the capture process did not autostart. The precedence constraint that defines the workflow between the Execute SQL Task Determine Extraction Start Time and this task is the following: Completion @CaptureStarted == false and evaluates to TRUE The EntryPoint property which defines the first method that executes when the script task runs is LogCaptureNotStartedMessage. The ReadOnlyVariables property setting makes the following package variables available to the script: System::PackageName, System::StartTime, User::CaptureStarted The VB script code to support writing to the event log is shown below. 14 Change Data Capture for Specified Interval Package Sample Public Sub LogCaptureProcessNotStartedMessage() Dim varPackageName As Variable = Dts.Variables("System::PackageName") Dim varStartTime As Variable = Dts.Variables("System::StartTime") Dim varCaptureStarted As Variable = Dts.Variables("User::CaptureStarted") Dim sLog As String Dim sEventMessage As String Dim sMachine As String Dim sSource As String Dts.VariableDispenser.LockForRead("System::PackageName") Dts.VariableDispenser.LockForRead("System::StartTime") Dts.VariableDispenser.LockForRead("User::CaptureStarted") sLog = "Application" sSource = varPackageName.Value().ToString sEventMessage = "The CDC Capture Process was not started." _ & "Make certain that SQL Agent is running." _ & Chr(10) _ & "=============================================" & Chr(10) _ & "The Package: " + varPackageName.Value().ToString _ & Chr(10) _ & "Started: " & varStartTime.Value().ToString _ & Chr(10) _ & "Current Time:" & System.DateTime.Now _ & Chr(10) _ & "=============================================" _ & Chr(10) _ & "Capture Started: " & varCaptureStarted.Value().ToString sMachine = "." Dim ELog As New EventLog(sLog, sMachine, sSource) 15 Change Data Capture for Specified Interval Package Sample ELog.WriteEntry(sEventMessage, EventLogEntryType.Information) Dts.Variables.Unlock() Dts.TaskResult = ScriptResults.Success End Sub The event log message posted will be similar to that shown in Figure 2: Figure 2: Event Log Message Signaling Capture Process Not Started Extracting and Processing Change Data After the boundaries of the first incremental load have been determined, the setup package enters a loop which invokes the master package to harvest change data at 10 second intervals until all the changes from the generated workload have been applied to the target. 16 Change Data Capture for Specified Interval Package Sample For Loop Container - Cycle Master at 10 Second Intervals The Loop Container logic itself is designed to execute the container tasks until all of the changes in the generated workload have been extracted and applied to the target and then terminate. Two conditions need to be satisfied in order for the loop to terminate. First the workload generation tasks need to have all completed. This is determined by checking the package variable WorkloadCompleted, which is set by the Mark Workload Completion task that runs after all of the load generation tasks run. Second, all of the changes associated with the workload must have been applied to the target tables. This is determined by checking the proposed start time for the next extraction interval. If this value is greater than the time when the Workload Completion Task ran, then we are assured that the extraction intervals already processed fully cover the time when changes were applied to the source tables and the run can terminate. The loop container also maintains a package variable that is used as an interval counter. It initializes the variable to 0 and then increments it by one during each iteration. Since the interval counter is a package variable, it can be passed to the master package launched from within the loop. For Loop Properties Property Name Property Value InitExpression @IntervalID = 0 EvalExpression !((@WorkloadCompleted == 1) && (@ExtractStartTime > @WorkloadEndTime)) AssignExpression @IntervalID = @IntervalID + 1 Within the for loop container are two tasks - a Script Task that computes the new endpoints for the extraction interval and an Execute Package Task that invokes the master package driving the extraction cycle. Script Task - Set Extract Interval The Script Task Set Extract Interval is used to determine the end-points for the next query window. It uses the high end-point of the previous window as the low end-point of the current window. It then adds 10 seconds to the new low end-point to obtain the new high end-point. The computed end-points are returned as a single row result set and deposited in package variables User::ExtractStartTime and User::ExtractEndTime. 17 Change Data Capture for Specified Interval Package Sample The EntryPoint property which defines the first method that executes when the script task runs is SetExtractInterval. The ReadWriteVariables property setting makes the following package variables available to the script: User::ExtractStartTime User::ExtractEndTime The VB script code to support computing the query interval end-points is show below. Public Sub SetExtractInterval() Dim varStartTime As Variable = Dts.Variables("User::ExtractStartTime") Dim varEndTime As Variable = Dts.Variables("User::ExtractEndTime") Dts.VariableDispenser.LockForWrite("User::ExtractStartTime") Dts.VariableDispenser.LockForWrite("User::ExtractEndTime") varStartTime.Value() = varEndTime.Value() varEndTime.Value() = DateAdd(DateInterval.Second, 10, varStartTime.Value()) Dts.Variables.Unlock() Dts.TaskResult = ScriptResults.Success End Sub Execute Package Task – Run Master to Extract Data The Execute Package Task, Run Master to Extract Data, functions as the test harness for the ETL extraction cycle. Its’ sole function is to launch the master package that will in turn launch the individual extraction packages for the individual tables. 18 Change Data Capture for Specified Interval Package Sample Validating Incremental Load and Reporting Completion Status Once the generated workload has been applied to the target environment, the state of the replicas should match that of the source tables. The SQL Script Task Check for Mismatch in Replicas uses SQL CHECKSUM to verify the contents of the source and target tables match, setting package variables to record the status of the run. Finally the VB Script task Output Run Completion Status is launched to output the run status to the event log. Execute SQL Task – Check for Mismatch in Replicas The Execute SQL Task Check for Mismatch in Replicas is used to determine whether the replicas created match the source tables. CHECKSUM is used to compare the table contents and package variables are set for each of the tracked tables to indicate whether differences were detected. The following SQL Statement is defined as Direct input: declare @CustomerMismatch int, @CreditCardMismatch int, @WorkOrderMismatch int, @Checksum bigint, @ChecksumDW bigint select @CustomerMismatch = 0, @CreditCardMismatch = 0, @WorkOrderMismatch = 0, @Checksum = 0, @ChecksumDW = 0 select @Checksum = CHECKSUM(*) from AdventureWorks2008.CDCSample.Customer select @ChecksumDW = CHECKSUM(*) from AdventureWorksDW2008.CDCSample.Customer if (@Checksum <> @ChecksumDW) begin set @CustomerMismatch = 1 end select @Checksum = CHECKSUM(*) from AdventureWorks2008.CDCSample.CreditCard select @ChecksumDW = CHECKSUM(*) from AdventureWorksDW2008.CDCSample.CreditCard if (@Checksum <> @ChecksumDW) begin set @CreditCardMismatch = 1 end select @Checksum = CHECKSUM(*) from AdventureWorks2008.CDCSample.WorkOrder 19 Change Data Capture for Specified Interval Package Sample select @ChecksumDW = CHECKSUM(*) from AdventureWorksDW2008.CDCSample.WorkOrder if (@Checksum <> @ChecksumDW) begin set @WorkOrderMismatch = 1 end select @CustomerMismatch as CustomerMismatch, @CreditCardMismatch as CreditCardMismatch, @WorkOrderMismatch as WorkOrderMismatch The SQL statement has no input parameters. The result set is defined as a single row with the columns CustomerMismatch, CreditCardMismatch, and WorkOrder Mismatch. Result Name Variable Name CustomerMismatch User::CustomerMismatch CreditCardMismatch User::CreditCardMismatch WorkOrderMismatch User::WorkOrderMismatch VB Script Task – Output Run Completion Status The VB Script Task Output Run Completion Status writes a status entry to the Event Log indicating the result of the comparisons made between the source tables and the target tables that were updated using Change Data Capture. The EntryPoint property which defines the first method that executes when the script task runs is OutputRunCompletionStatus.. The ReadOnlyVariables property setting makes the following package variables available to the script: System::PackageName System::StartTime User::CreditCardMismatch User::CustomerMismatch User::WorkOrderMismatch The VB script code to log completion status is shown below. Note that a comparison failure for any of the tables will cause the task to complete with a failure status. Public Sub OutputRunCompletionStatus() Dim varPackageName As Variable = Dts.Variables("System::PackageName") Dim varStartTime As Variable = Dts.Variables("System::StartTime") Dim varCustomerMismatch As Variable = Dts.Variables("User::CustomerMismatch") Dim varCreditCardMismatch As Variable = Dts.Variables("User::CreditCardMismatch") 20 Change Data Capture for Specified Interval Package Sample Dim varWorkOrderMismatch As Variable = Dts.Variables("User::WorkOrderMismatch") Dim Dim Dim Dim Dim Dim Dim sLog As String sEventMessage As String sMachine As String sSource As String sCustomer As String sCreditCard As String sWorkOrder As String sCustomer = "Customer replica is identical." sCreditCard = "CreditCard replica is identical." sWorkOrder = "WorkOrder replica is identical." Dts.VariableDispenser.LockForRead("System::PackageName") Dts.VariableDispenser.LockForRead("System::StartTime") Dts.VariableDispenser.LockForRead("User::CustomerMismatch") Dts.VariableDispenser.LockForRead("User::CreditCardMismatch") Dts.VariableDispenser.LockForRead("User::WorkOrderMismatch") Dts.TaskResult = ScriptResults.Success If varCustomerMismatch.Value = 1 Then sCustomer = "Customer replica does not match source." Dts.TaskResult = ScriptResults.Failure End If If varCreditCardMismatch.Value = 1 Then sCreditCard = "CreditCard replica does not match source." Dts.TaskResult = ScriptResults.Failure End If If varWorkOrderMismatch.Value = 1 Then sWorkOrder = "WorkOrder replica does not match source." Dts.TaskResult = ScriptResults.Failure End If sLog = "Application" sSource = varPackageName.Value().ToString sEventMessage = "CDC SSIS Sample Completion Status" & Chr(10) _ & "=============================================" & & "The Package: " + varPackageName.Value().ToString & Chr(10) _ & "Started: " & varStartTime.Value().ToString _ & Chr(10) _ & "Current Time:" & System.DateTime.Now _ & Chr(10) _ & "=============================================" & & "Customer table: " & sCustomer _ & Chr(10) _ & "=============================================" & & "CreditCard table: " & sCreditCard _ & Chr(10) _ & "=============================================" & & "WorkOrder table: " & sWorkOrder _ & Chr(10) _ & "=============================================" 21 _ Chr(10) _ _ Chr(10) _ Chr(10) _ Chr(10) _ Change Data Capture for Specified Interval Package Sample sMachine = "." Dim ELog As New EventLog(sLog, sMachine, sSource) ELog.WriteEntry(sEventMessage, EventLogEntryType.Information) Dts.Variables.Unlock() End Sub SetupCDCSample Package Variables The following package variables are defined for the setup package: Name Data Type Description and Default Value BasePath String Install path for Change Data Capture for Specified Interval Package Sample packages Default: @[SQLServerInstallPath] + "100\\Samples\\Integration Services\\Package Samples\\Change Data Capture for Specified Interval Package Sample\\Change Data Capture Sample\\" ScriptPath String Install path for Change Data Capture for Specified Interval Package Sample scripts Default: @[BasePath] + "Scripts\\" SQLServerInstallPath String Install path for SQL Server Default: c:\program files\Microsoft SQL Server\ CaptureStarted Boolean Flag indicating capture process has started Default: False CreateSnapshot String SQL statement to create database snapshot Default: "CREATE DATABASE AdventureWorks2008_dbss ON ( NAME = AdventureWorks2008_Data, FILENAME = '" + @[BasePath] + "AdventureWorks2008_data.ss' ) AS SNAPSHOT OF AdventureWorks2008;" ExtractEndTime datetime End time of next extraction Default: 5/6/2007 8:54 AM ExtractStartTime datetime Start time of next extraction Default: 5/6/2007 8:54 AM IntervalID Int32 Extraction Interval identifier Default: 0 LastLSN String LSN anchor for first extraction Default: 0x00000000000000000000 WorkloadCompleted 22 Int32 0 if Customer replica is identical at the end of the Change Data Capture for Specified Interval Package Sample run: 1 if validation detected a mismatch Default: 0 WorkloadEndTime datetime Time of workload completion Default: 2/21/2008 12:31 PM CustomerMismatch Int32 0 if Customer replica is identical at the end of the run: 1 if validation detected a mismatch Default: 0 CreditCardMismatch Int32 0 if CreditCard replica is identical at the end of the run: 1 is validation detected a mismatch Default: 0 WorkOrderMismatch Int32 0 if WorkOrder replica is identical at the end of the run: 1 is validation detected a mismatch Default: 0 MasterCDC Package The package MasterCDC is responsible for launching all of the individual packages used to extract change data for the SQL Server source tables and apply the changes to the Data Mart. The principle task of the master package is to verify that the extraction interval passed from the parent lies with the current Change Data Capture database validity interval. The Change Data Capture Database Validity Interval The Change Data Capture validity interval for a database is simply the time interval for which change data is currently available for its capture instances. In principle, it begins when the first capture instance is created for a database table, and extends forward in time to the present. In practice, the validity interval is a moving window just as the extraction interval is a moving window. Change data deposited in tables would grow unmanageably if it wasn’t periodically and systematically pruned. By default, only three days of data are retained in the change tables. Hence, the period of time covered by the validity interval is typically 72 hours. While the cleanup process works to move the low end-point of the validity interval to the right on the Change Data Capture timeline, the capture process does the same for the high end-point. Since the capture process extracts change data from the transaction log, there is a built in latency between the time that a change is committed to a source table and the time that the change appears within its associated change table. While this latency is typically small, it is nevertheless important to remember that change data is not available until the capture process has processed the related log entries. For each transaction that results in one or more entries appearing within database change tables, an entry is logged to the cdc.lsn_time_mapping table. Each entry in the mapping table contains both a commit Log Sequence Number or commit LSN, and a 23 Change Data Capture for Specified Interval Package Sample transaction commit time (columns start_lsn and tran_end_time respectively.) Change Data Capture uses the table cdc.lsn_time_mapping to identify the current bounds of the database validity interval, with the smallest and largest commit times represented in the entries of cdc.lsn_time_mapping table denoting the low and high end-points of the validity interval for the database. The Change Data Capture Validity Interval for a Capture Instance While it is often the case that the database validity interval and an individual capture instance’s validity interval will coincide, this is not always true. The validity interval of the capture instance begins when the capture process recognizes the capture instance and begins logging associated changes to its change table. As a result, if capture instances are created at different times, each will initially have a different low endpoint. The start_lsn column of the result set returned by sys.sp_cdc_help_change_data_capture shows the current low end-point for each defined capture instance. When the cleanup process cleans up change table entries, it adjusts the start_lsn values for all capture instances to reflect the new low water mark for available change data. Only those capture instances with start_lsn values currently less than the new low water mark are adjusted. Over time, if no new capture instances are created, the validity intervals for all individual instances will coincide with the database validity interval. The validity interval is important to consumers of change data because the extraction interval for a request must be fully covered by the current Change Data Capture validity interval for the capture instance. If the low end-point of the extraction interval is to the left of the low end-point of the validity interval, there could be missing change data due to aggressive cleanup. If the high end-point of the extraction interval is to the right of the high end-point of the validity interval, the capture process has not yet processed through the time period represented by the extraction interval and change data could also be missing. This relationship is illustrated in the diagram below. 24 Change Data Capture for Specified Interval Package Sample Figure 3: Change Data Capture Validity Intervals A Closer Look at an SSIS Wrapper Function It is important to note that the CDC query functions will fail if the request interval is not fully covered by the validity interval. In this sample, validity checks are systematically performed up front in the master package, so that the recoverable case, where the capture process simply needs to catch up, can be dealt with automatically. The first task of the master package is to verify the extraction interval that it is passed. If the low end–point of the extraction interval is outside the validity interval, the package logs an error to the event log and exits with error status. If the high end-point of the extraction interval identifies a time that is ahead of current processing by the CDC capture process, the master package will delay to allow the capture process to catch up. Once the extraction interval is in range, the master package launches the child packages. Each child package picks up the datetime values defining the extraction interval from its parent through Package Configurations. When all client packages have completed, the master package logs an event message indicating completion. The control flow diagram for the master package is shown below. 25 Change Data Capture for Specified Interval Package Sample Figure 4: Master Package for Incremental Data Extraction Master Package Configurations and Variables Package Configurations for Initializing Package Variables The master package uses Package Configurations to initialize several runtime parameters. The table below shows the parent variables that are used to initialize local package variables of the master package. Configuration Name Configuration Type Configuration String Target Object Target Property Configure ExtractStartTime Parent Package Variable User::ExtractStartTime ExtractStartTime Value Configure ExtractEndTime Parent Package Variable User::ExtractEndTime ExtractEndTime Value Configure IntervalID Parent Package Variable User::IntervalID IntervalID Value 26 Change Data Capture for Specified Interval Package Sample Configure Last LSN Parent Package Variable User::LastLSN LastLSN Value Configure Base Path Parent Package Variable User::BasePath BasePath Value Master Package Variables The following package variables are defined for the master package: Name Data Type Scope Description BasePath String MasterCDC Path where Change Data Capture for Specified Interval Package Sample packages are installed DataReady Int32 MasterCDC Code indicating query status 0 = Need to wait for capture process 1 = Start time predates validity interval 2 = Ready for data query ( interval > 1) 3 = Ready for first data query 5 = Timeout ceiling reached waiting for capture process Delay Int32 MasterCDC milliseconds to delay before rechecking if capture process has caught up ExtractStartTime datetime MasterCDC Start time of extraction interval ( Interval > 1) ExtractEndTime datetime MasterCDC End time of extraction interval IntervalID Int32 MasterCDC Interval ID LastLSN String MasterCDC LSN anchor used for first extraction interval TimeoutCeiling Int32 MasterCDC Number of cycles to delay waiting for capture process to catch up prior to terminating with error TimeoutCount Int32 MasterCDC Number of delays that have already been invoked 27 Change Data Capture for Specified Interval Package Sample Master Package Tasks Execute SQL Task – Check for Data The Execute SQL Task, Check for Data, is used by the master package to determine how to proceed based upon the relationship of the requested extraction interval to the current Change Data Capture database validity interval and the current timeout count. It assigns one of 5 possible values to the package variable DataReady to indicate which of five possible conditions holds. It requires an OLE DB connection to the source database. The following SQL Statement is defined as Direct input: declare @DataReady int, @TimeoutCount int if not exists ( select tran_end_ from cdc.lsn_time_mapping where tran_end_time > ? ) select @DataReady = 0 else if ? = 0 select @DataReady = 3 else if not exists ( select tran_end_time from cdc.lsn_time_mapping where tran_end_time <= ? ) select @DataReady = 1 else select @DataReady = 2 select @TimeoutCount = ? if (@DataReady = 0 select @TimeoutCount = @TimeoutCount + 1 else select @TimeoutCount = 0 if (@TimeoutCount > ?) select @DataReady = 5 select @DataReady as DataReady, @TimeoutCount as TimeoutCount The SQL statement has five input parameters: 28 Variable Name Direction Data Type Parameter Name User::ExtractEndTime Input DATE 0 User::IntervalID Input SHORT 1 User::ExtractStartTime Input DATE 2 User::TimeoutCount Input SHORT 3 User::TimeoutCeiling Input SHORT 4 Change Data Capture for Specified Interval Package Sample The result set is defined as a single row with the columns DataReady and TimeoutCount: Result Name Variable Name DateReady User::DataReady TimeoutCount User::TimeoutCount The SQL query first determines whether there are any entries in the cdc.lsn_time_mapping table that are later than the requested end of the extraction interval. If there are none, this means that the capture process has not yet processed all the changes in the request interval and DataReady is set to 0. If the capture process is caught up, the Interval ID is next checked. If it is set to 0, this is the first interval to be processed which requires special treatment. DataReady in this case is assigned a value of 3 to indicate this is the first interval. If this is not the first interval, a check is then made to verify that the starting point for the extraction interval is not smaller than all existing entries in the cdc.lsn_time_mapping table. DataReady in this case is set to 1, indicating the extraction interval is invalid and cannot be automatically corrected. Finally, if the starting point of the interval is not outside the validity interval, DataReady is set to 2 indicating that change data can be extracted from change tables for the interval. Once DataReady has been determined, the query determines if the master package is in a delay loop waiting for the capture process to catch up. If DataReady is 0, the timeout counter is incremented and then checked to see if the configured number of wait intervals has been exhausted. If the counter exceeds the ceiling currently set to 20, @DataReady is set to 5 to indicate the wait for the capture process has timed out. Note that whenever a non-zero value is determined for @DataReady, the timeout counter is reset to 0. After execution of the Execute SQL Task Check for Data by the package, there are three possible paths that can be taken: If DataReady is 0 the required changes have not yet been propagated to the change tables. In this case, the Script Task Delay executes to wait for a period of time to allow the capture process to catch up. If DataReady is 1, the low end point of the extraction interval is outside the CDC Change Data Capture validity interval. In this case, an error is logged by the VB Script Task Log Extract Error and the package terminates, since there is no automatic recovery if there is a possibility for data loss. Similarly, if DataReady is 5 the allowable number of delays has been exhausted and the Script Task Log Extract Error is called to log the timeout error. Finally, if DataReady is 2 or 3, all changes through the indicated ExtractEndTime have been deposited in change tables and it is possible to allow the data extraction packages to gather change data. In this case, the three data extraction packages are launched. 29 Change Data Capture for Specified Interval Package Sample Execute Package Tasks – Extract Change Data for CDC Enabled Tables The Execute Package Tasks Extract Cutomer Data, Extract CreditCard Data and extract WorkOrder Data launch the individual extraction packages for the individual tables. Script Task – Log Extract Error The Script Task, Log Extract Error is used to log an error to the Windows event log. The precedence constraint that defines the workflow between the Execute SQL Task Check Data and this task is the following: Success Completion Code and @DataReady == 1 || @DataReady == 5 evaluates to TRUE The EntryPoint property which defines the first method that executes when the script task runs is LogExtractError. The ReadOnlyVariables property setting makes the following package variables available to the script: User::DataReady System::ExecutionInstanceGUID, User::ExtractStartTime, System::PackageName, System::StartTime The VB script code to support writing to the event log is shown below. Public Sub LogExtractError() Dim varPackageName As Variable = Dts.Variables("System::PackageName") Dim varStartTime As Variable = Dts.Variables("System::StartTime") Dim varInstanceID As Variable = Dts.Variables("System::ExecutionInstanceGUID") Dim varExtractStartTime As Variable = Dts.Variables("User::ExtractStartTime") Dim varDataReady As Variable = Dts.Variables("User::DataReady") Dim Dim Dim Dim Dim sLog As String sEventMessage As String sMachine As String sSource As String iDataReady As Integer Dts.VariableDispenser.LockForRead("System::PackageName") Dts.VariableDispenser.LockForRead("System::StartTime") Dts.VariableDispenser.LockForRead("System::ExecutionInstanceGUID") Dts.VariableDispenser.LockForRead("User::ExtractStartTime") Dts.VariableDispenser.LockForRead("User::DataReady") 30 Change Data Capture for Specified Interval Package Sample sLog = "Application" sSource = varPackageName.Value().ToString iDataReady = varDataReady.Value() If iDataReady = 1 Then sEventMessage = "Start Time Error" Else sEventMessage = "Timeout Error" End If sEventMessage = sEventMessage _ & Chr(10) _ & "=============================================" & Chr(10) _ & "The Package: " + varPackageName.Value().ToString _ & Chr(10) _ & "Started: " & varStartTime.Value().ToString _ & Chr(10) _ & "Current Time:" & System.DateTime.Now _ & Chr(10) _ & "=============================================" _ & Chr(10) _ & "Extract Start Time: " _ & varExtractStartTime.Value().ToString _ & Chr(10) _ & "Execution GUID: " & varInstanceID.Value().ToString sMachine = "." Dim ELog As New EventLog(sLog, sMachine, sSource) ELog.WriteEntry(sEventMessage, EventLogEntryType.Error) Dts.Variables.Unlock() Dts.TaskResult = ScriptResults.Failure End Sub Script Task – Delay The Script Task Delay is used to delay for a given interval of time. The precedence constraint that defines the workflow between the Execute SQL Task Check Data and this task is the following: Success Completion Code and @DataReady == 0 && @TimeoutCount <=@TimeoutCeiling evaluates to TRUE The EntryPoint property which defines the first method that executes when the script task runs is Delay. The ReadOnlyVariables property setting makes the following package variables available to the script: User::Delay The VB script code to support delaying for a period of time is shown below. 31 Change Data Capture for Specified Interval Package Sample Public Sub Delay() Dim varDelay As Variable = Dts.Variables("User::Delay") Dim iDelay As Integer Dts.VariableDispenser.LockForRead("User::Delay") iDelay = varDelay.Value() Threading.Thread.Sleep(iDelay) Dts.Variables.Unlock() Dts.TaskResult = ScriptResults.Success End Sub Script Task – Log Extraction Complete The Script Task, Log Extraction Complete is used to log a completion message to the Windows event log. The precedence constraint that defines the workflow between the three Execute Package Tasks is Completion. The EntryPoint property that defines the first method that executes when the script task runs is LogExtractionCompletion. The ReadOnlyVariables property setting makes the following package variables available to the script: User::DataReady, System::ExecutionInstanceGUID, User::ExtractEndTime, User::ExtractStartTime, User::IntervalID, System::PackageName, System::StartTime The VB script code to support writing the completion notification to the event log is shown below. Public Sub LogExtractionCompletion() Dim varPackageName As Variable = Dts.Variables("PackageName") Dim varStartTime As Variable = Dts.Variables("StartTime") Dim varInstanceID As Variable = Dts.Variables("ExecutionInstanceGUID") Dim varExtractStartTime As Variable = Dts.Variables("ExtractStartTime") Dim varExtractEndTime As Variable = Dts.Variables("ExtractEndTime") Dim varIntervalID As Variable = Dts.Variables("IntervalID") Dim varDataReady As Variable = Dts.Variables("DataReady") 32 Change Data Capture for Specified Interval Package Sample Dim Dim Dim Dim sLog As String sEventMessage As String sMachine As String sSource As String Dts.VariableDispenser.LockForRead("PackageName") Dts.VariableDispenser.LockForRead("StartTime") Dts.VariableDispenser.LockForRead("ExecutionInstanceGUID") Dts.VariableDispenser.LockForRead("ExtractStartTime") Dts.VariableDispenser.LockForRead("ExtractEndTime") Dts.VariableDispenser.LockForRead("IntervalID") Dts.VariableDispenser.LockForRead("DataReady") sLog = "Application" sSource = varPackageName.Value().ToString sEventMessage = "Extract Complete" _ & Chr(10) _ & "=============================================" & Chr(10) _ & "The Package: " + varPackageName.Value().ToString _ & Chr(10) _ & "Started: " & varStartTime.Value().ToString _ & Chr(10) _ & "Current Time:" & System.DateTime.Now _ & Chr(10) _ & "=============================================" _ & Chr(10) _ & "Extract Start Time: " & varExtractStartTime.Value().ToString _ & Chr(10) _ & "Extract End Time: " & varExtractEndTime.Value().ToString _ & Chr(10) _ & "Interval ID: " & varIntervalID.Value().ToString _ & Chr(10) _ & "Data Ready: " & varDataReady.Value().ToString _ & Chr(10) _ & "Execution GUID: " & varInstanceID.Value().ToString sMachine = "." Dim ELog As New EventLog(sLog, sMachine, sSource) ELog.WriteEntry(sEventMessage, EventLogEntryType.Information) Dts.Variables.Unlock() Dts.TaskResult = ScriptResults.Success End Sub Child Packages for the Change Data Capture for Specified Interval Package Sample The sample child packages all use instantiated wrapper functions to query for change data for all extraction intervals after the first interval. For the first interval, they use a customized version of the generated wrapper function that allows the low end-point of the extraction interval to be expressed as an LSN value as opposed to a datetime value. 33 Change Data Capture for Specified Interval Package Sample The datetime boundaries of the extraction interval, as well as the LSN boundary used for the first interval, are obtained from parent package variables when the packages are launched. The packages also obtain a flag indicating whether this is the initial extraction interval so that the first interval can be handled as a special case. The basic control flow for the packages is straightforward. A Script Task is called first to construct the SQL query that will be used to query for change data. Control is then passed to the Data Flow Task to request and process the change data. The data flow task uses an OLE DB source component to perform the query, directing the returned result set to a conditional split transformation. The conditional split uses the operation returned in each result set row to direct the rows to appropriate transformations: Deletes and updates are sent to OLE DB command transformations, while inserts are directed to an OLE DB destination. The control and data flow for these child packages is shown in the figure below. Figure 5: Child Packages Using Multi-Statement TVFs for Data Access The Configurations used by the child extract packages use package variables from the launching master package to initialize their own package variables. 34 Configuration Name Configuration Type Configuration String Target Object Target Property Configure Data Ready Parent Package Variable User::DataReady DataReady Value Configure End Time Parent Package Variable User::ExtractStartTime StartTime Value Change Data Capture for Specified Interval Package Sample Configure Start Time Parent Package Variable User::ExtractEndTime StartTime Value Configure Last LSN Parent Package Variable User::LastLSN LastLSN Value Child Package Variables The following package variables are defined for the child packages: Name Data Type Scope Description DataReady Int32 CDCWorkOrderExtract Code indicating query status 0 = Need to wait for capture process 1 = Start time predates validity interval 2 = Ready for data query ( interval > 1) 3 = Ready for first data query 5 = Timeout ceiling reached waiting for capture process ExtractStartTime datetime CDCWorkOrderExtract Start time of extraction interval ( Interval > 1) ExtractEndTime datetime CDCWorkOrderExtract End time of extraction interval LastLSN String CDCWorkOrderExtract LSN anchor used for first extraction interval SQLDataQuery String CDCWorkOrderExtract Query to use to obtain change data Child Package Tasks Script Task - Generate SQL Data Query The Execute SQL Task Generate SQL Query is used to set up the query for change data. It allows us to work around the inability to pass parameters directly to table valued functions in an OLE DB source. The parameters are passed to the VB script code which is used to compose the query string and return the desired select statement in the package variable SQLDataQuery. One additional issue bears mentioning. Our general strategy for systematically processing a stream of change data is to use the high end-point of the previous interval to determine the low end-point of the subsequent interval, and to compute a new high end-point based upon the needs of the application environment. This strategy works well for all intervals except the initial interval, when there is no previous interval. 35 Change Data Capture for Specified Interval Package Sample In general, the destination and source will not be synchronized when a decision is made to use CDC technology to apply incremental loads and the first task will be to identify an anchor for the destination. The anchor is defined as an LSN lying within the Change Data Capture validity interval of a capture instance such that (1) All changes with start LSN values up to and including that anchor are already reflected in the destination and (2) All changes with start LSN values greater than the anchor have yet to be applied. Once the anchor is determined, it is used to explicitly set the LSN boundary for the initial extraction. For this sample, a database snapshot provides data for the initial load, allowing an appropriate anchor LSN for the first extraction to be determined directly from snapshot metadata. While use of a database snapshot in preparing the initial load is not a requirement, it greatly simplifies the synchronization process. The Script Task Generate SQL Data Query is used to generate the query to be used to extract change data. It uses the passed package variable DataReady to determine whether the query is for the first interval or for a subsequent interval. If it is the first interval, it constructs a call to the function fn_net_changes_WorkOrder_First which takes an LSN value as its first parameter. Otherwise, it constructs a call to the function fn_net_changes_WorkOrder which uses two datetime values as the interval end-points. The EntryPoint property that defines the first method that executes when the script task runs is GenerateSQLQuery. The ReadOnlyVariables property setting makes the following package variables available to the script: User::DataReady, User::EndTime, User::LastLSN, User::StartTime The ReadWriteVariables property setting allows the following package variableto be set by the script: User::SQLDataQuery The VB script code to support constructing the data query is shown below. Public Sub GenerateDataQuery() Dim varStartTime As Variable = Dts.Variables("User::StartTime") Dim varEndTime As Variable = Dts.Variables("User::EndTime") Dim varDataReady As Variable = Dts.Variables("User::DataReady") Dim varSQLDataQuery As Variable = Dts.Variables("User::SQLDataQuery") Dim varLastLSN As Variable = Dts.Variables("User::LastLSN") Dim Dim Dim Dim Dim 36 iDataReady As Integer sStartTime As String sEndTime As String sLastLSN As String dStartTime As DateTime Change Data Capture for Specified Interval Package Sample Dim dEndTime As DateTime Dts.VariableDispenser.LockForRead("User::StartTime") Dts.VariableDispenser.LockForRead("User::EndTime") Dts.VariableDispenser.LockForRead("User::DataReady") Dts.VariableDispenser.LockForWrite("User::SQLDataQuery") Dts.VariableDispenser.LockForWrite("User::LastLSN") iDataReady = varDataReady.Value() dStartTime = varStartTime.Value() sStartTime = dStartTime.ToString ("G", DateTimeFormatInfo.InvariantInfo) dEndTime = varEndTime.Value() sEndTime = dEndTime.ToString ("G", DateTimeFormatInfo.InvariantInfo) sLastLSN = varLastLSN.Value().ToString If iDataReady = 2 Then varSQLDataQuery.Value = "select * from dbo.fn_net_changes_WorkOrder('" _ & sStartTime & "','" & sEndTime & "', 'all')" Else varSQLDataQuery.Value = "select * from dbo.fn_net_changes_WorkOrder_First('" _ & sLastLSN & "','" & sEndTime & "', 'all')" End If Dts.Variables.Unlock() Dts.TaskResult = ScriptResults.Success End Sub Data Flow Task – Process Change Data The Data Flow Task Process Change Data calls a table valued function to query the source for change data and then applies the change data returned in the result set to the destination. OLE DB Source The query used by these child processes is obtained from a package variable and makes use of an instantiated wrapper function. (Wrapper functions are discussed in detail at the end of this document.) Data Access Mode SQL Command from variable Variable Name User::SQLDataQuery The result set returned by the query includes the metadata column CDC_OPERATION that identifies the operation to be used when applying the change to the destination. A conditional split transformation is used to direct the rows to one of three possible components based upon the following defined conditions: 37 Change Data Capture for Specified Interval Package Sample Order Output Name Condition 1 Inserts __CDC_OPERATION == "I" 2 Updates __CDC_OPERATION == "UN" 3 Deletes __CDC_OPERATION == "D" OLE DB Command Data Mart Deletes The Delete flow is directed to the OLE DB Command Data Mart Deletes. The following command is used to apply the changes within this flow to the destination: delete from CDCSample.WorkOrder where WorkOrderID = ? The command requires a column from the result set in order to successfully apply the delete. This is the primary key columns for the table. The delete command above is for the child package processing changes to the WorkOrder table, which has a single primary key column WorkOrderID. The mapping of result set columns to command parameters can be seen in the following diagram: Figure 6: Result Set Mapping for Deletes OLE DB Command Data Mart Updates The Update flow is directed to the OLE DB Command Data Mart Updates. The following command is used to apply the changes within this flow to the destination: 38 Change Data Capture for Specified Interval Package Sample update CDCSample.WorkOrder set ProductID = ?, OrderQty = ?, ScrappedQty = ?, StartDate = ?, EndDate = ?, DueDate = ?, ScrapReasonID = ?, ModifiedDate = ? where WorkOrderID = ? This command requires all of the source columns from the result set in order to successfully apply the update. The mapping of result set columns to command parameters can be seen in the following diagram: Figure 7: Result Set Mapping for Updates OLE DB Destination Data Mart Inserts The Insert flow is directed to the OLE DB Destination. The table name alone identifies the target for the insert. The following diagram shows how the result set columns are mapped to the table columns: 39 Change Data Capture for Specified Interval Package Sample Figure 8: Result Set Mapping for Inserts Error Logging in the Child Packages Each of the child packages makes use of the SSIS Log Provider for Windows Event Log to allow logging for the OnError event. Logging is configured using the Configure SSIS Logs dialog box. This box appears in the designer when right clicking on a selected package and choosing Logging. This dialog box also allows the current logging settings to be examined. For the child packages, this dialog box shows the SSIS Log Provider for Windows Event Log as the Provider type on the Providers and Logs tab. Under the Details tab, the OnError condition is set. Even when range validation is done up front, it is always possible that conditions will change between the time the check is performed and the time the TVF executes. Applications must always be prepared to deal with the possibility of range errors. The CDC TVFs check to insure that the extraction interval defined is fully covered by the validity interval for the capture instance and errors if this requirement is not met. In the sample, if range errors are encountered by any of the child packages, OLE DB will log the error to the event log producing an entry similar to the following. 40 Change Data Capture for Specified Interval Package Sample Figure 9: Range Error Posted from OLEDB The error returned for range errors is error 313, “An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_net_changes_ …”. The figure above shows this error. Wrappers for CDC TVFs Instantiated Multi-Statement TVF Wrappers Wrapper functions for consuming CDC change data serve several important purposes. First, and most importantly, they allow an SSIS Data Flow task to be used in a straightforward manner to query for change data. The inline TVFs that are generated to access CDC change tables do not allow the column structure of the returned result set to be determined by an OLE DB provider. This limitation, however, can be dealt with in a straightforward fashion by wrapping the inline TVFs with a multi-statement TVF that explicitly identifies the columns returned by the CDC function. The scripted wrapper functions do precisely this. 41 Change Data Capture for Specified Interval Package Sample While the need to be able to determine column information about the returned result set represents the principle reason for making use of a wrapper function for the generated CDC TVF, the wrapper serves several other very useful purposes. The generated CDC query functions used to gather change data use Log Sequence Numbers (LSNs) to mark the boundary of the query window. While LSNs are invaluable in insuring that data can be retrieved from change tables in a systematic manner that guarantees no lost or repeated data, they have virtually no meaning to the application layer that wants to consume change data. Particularly for Data Mart applications, the request for change data is most typically bounded by datetime values. The CDC feature has built-in functionality to deal with systematically mapping between datetime values and LSN values. The wrapper function is an ideal place to address these mapping issues, allowing the application to define its request interval as a datetime pair, with the wrapper function performing the necessary translation between these values and the LSN values needed to query the generated CDC TVFs. In addition to shielding the SSIS application from dealing with LSN values directly, the wrapper function can perform column filtering on the data returned from the CDC TVF more efficiently than column filtering done when querying the wrapper function. The result set returned by the CDC TVF is expected to serve a variety of clients, and typically the captured columns included within a capture instance are not client specific. When the requesting package needs only a subset of the captured columns, this column filtering is best done within the wrapper code. Finally, the wrapper code allows information returned from the generated TVF to be recoded in a form that is more easily consumed by the calling application. In its simplest form, recoding can be used to convert the integer based operation codes into more meaningful single character codes that provide built in cues to their meaning, ie mapping 2 to ‘I’ for insert, 1 to ‘D’ for delete, and 4 to ‘UN’ for update new values. The more interesting case, however, is the recoding of column update information extracted from the CDC update mask. When a CDC function returning net changes is called using the row filter option ‘all with mask’, an update mask is generated that identifies all of the column values that changed due to the update. If more than one update occurred during the interval, the mask represents the aggregate changes for all updates. This ability to easily determine the columns that will change, prior to applying an update, can be extremely useful during ETL dimension processing. Because this information is typically only needed for a small subset of columns, the wrapper is an appropriate place to extract this information from the mask and return it as simple flag columns to the SSIS application layer. We will now look at a simple example of a wrapper function for the table WorkOrders in AdventureWorks2008. 42 Change Data Capture for Specified Interval Package Sample A Closer Look at an SSIS Wrapper Function Use of datetime Values in the Wrapper Signature What we first note when examining a wrapper function is that its signature contains two datetime values: a start time and an end time. Typically, it is most natural for an SSIS application to use a time interval to delimit a query range, since this allows the CDC change data to be easily related to other Data Mart data. While the CDC TVFs do not allow datetime values to be used directly as query end-points, mapping functions are provided that make it possible to systematically translate datetime values into corresponding LSNs. The wrapper functions provide an ideal location in which to imbed this mapping logic. In the generated wrapper functions scripted by the procedure sys.sp_cdc_generate_wrapper_function, the following convention is used to guarantee that no data is repeated or skipped. The caller of the wrapper agrees to pass the end time of the previous data query as the start time of the next query, without modification. Within the wrapper, the calls to the mapping functions are always constructed in exactly the same way. This makes it possible to insure that given the same time value passed in the previous call, the mapping function will be able to regenerate the same LSN value in the subsequent call. This is key to guaranteeing that the resulting LSN based intervals do not have breaks or overlap. Finally, the function sys.fn_cdc_increment_lsn is used to increment the previous end lsn by one to obtain the next start LSN value. Note that it is guaranteed that there are no LSNs that lie between an LSN and that LSN value incremented by 1. Note also that the LSNs passed to the TVFs define a closed interval, so that all change entries having LSN values within the defined interval, including the boundary values, will be included in the returned result. Below we have extracted the portion of the wrapper function that addresses the mapping of datetime values to LSN values to define the extraction interval. Note that passing a null value for the start time value signals the wrapper function to use the current low end point of the capture instance validity interval as the interval start time. Similarly, a null end time value will cause the high end point of the capture instance to be used as the high end point for the extraction interval. This serves a couple of useful purposes. When you are designing the package to process the result set returned by the wrapper function, you can set the default string that defines the call to use null parameters. This allows you to get needed column information for the result set without worrying too much about providing a valid extraction interval. The call to the wrapper function using null values will only fail if the cdc.lsn_time_mapping table has no entries. Some of the additional checking performed by the wrapper function deserves explanation. We know that the CDC LSN based query functions will return a non-zero value in @@error if the LSN range does not fall within the validity interval of the capture instance. We would like the same to be true when the LSN based query functions are called from within a generated wrapper function. In order to insure this, 43 Change Data Capture for Specified Interval Package Sample we need to check the datetime range explicitly, prior to mapping it to an LSN value. If the datetime range value is not within bounds, it is mapped to NULL forcing the range error when the underlying TVF is called. Also note the following check: If @from_lsn is not null and @to_lsn is not null and (@from_lsn = sys.fn_cdc_increment_lsn( @to_lsn )) return This is a legitimate condition that can occur when there are no entries in cdc.lsn_time_mapping in the interval between the start and end time. This causes the function sys.fn_cdc_map_time_to_lsn(‘largest less than or equal’, @start_time) and sys.fn_cdc_map_time_to_lsn(‘largest less than or equal’, @end_time) to return the same LSN value. Incrementing the start LSN by one causes the condition to evaluate to true. In this case, it is appropriate to return with an empty result set. CREATE function [dbo].[fn_net_changes_WorkOrder] ( @start_time datetime = null, @end_time datetime = null, @row_filter_option nvarchar(30) = N'all' ) … begin declare @from_lsn binary(10), @to_lsn binary(10) if (@start_time is null) select @from_lsn = [sys].[fn_cdc_get_min_lsn]('WorkOrder') else begin if ([sys].[fn_cdc_map_lsn_to_time ([sys].[fn_cdc_get_ min_lsn]('WorkOrder') > @start_time or ([sys].[fn_cdc_map_lsn_to_time] ([sys].[fn_cdc_get_max_lsn]()) < @start_time) select @from_lsn = null else select @from_lsn = [sys].[fn_cdc_increment_lsn] ([sys].[fn_cdc_map_time_to_lsn] ('largest less than or equal',@start_time)) if (@end_time is null) select @to_lsn = sys.fn_cdc_get_max_lsn() else begin if [sys].[fn_cdc_map_lsn_to_time] ([sys].[fn_cdc_get_max_lsn]() < @endt_time) select @to_lsn = null else select @to_lsn = [sys].[fn_cdc_map_time_to_lsn] ('largest less than or equal',@end_time) 44 Change Data Capture for Specified Interval Package Sample end if @from_lsn is not null and @to_lsn is not null and (@from_lsn = [sys].[fn_cdc_increment_lsn](@to_lsn)) return … end Returning Column Information in the Wrapper Function The wrapper function is structured as a multi-statement TVF. The result set returned by the function is defined explicitly, and allows the OLE DB provider to make column information available to the calling program. The need to include this information explicitly would make this one of the more tedious aspects of preparing wrapper functions manually. Note that the result from querying the CDC TVF is inserted into the defined table before exiting the function. The defined table includes all of the table columns plus a final column __CDC_OPERATION. This column is a more user friendly recoding of the __$Operation column returned in the CDC TVF. create function [dbo].[fn_net_changes_WorkOrder] ( @start_time datetime = null, @end_time datetime = null, @row_filter_option nvarchar(30) = N'all' ) returns @resultset table ( [WorkOrderID] int ,[ProductID] int ,[OrderQty] int ,[ScrappedQty] smallint ,[StartDate] datetime ,[EndDate] datetime ,[DueDate] datetime ,[ScrapReasonID] smallint ,[ModifiedDate] datetime ,[__CDC_OPERATION] varchar(2) ) as begin … insert into @resultset select [WorkOrderID] ,[ProductID] ,[OrderQty] ,[ScrappedQty] ,[StartDate] ,[EndDate] ,[DueDate] ,[ScrapReasonID] ,[ModifiedDate] ,case [__$operation] 45 Change Data Capture for Specified Interval Package Sample when 1 then 'D' when 2 then 'I' when 3 then ‘UO’ when 4 then 'UN' when 5 then ‘M’ else null end as [__CDC_OPERATION] from [cdc].[fn_cdc_get_net_changes_WorkOrder] (@from_lsn, @to_lsn, @row_filter_option') return end go Extracting Information from the CDC Update Mask CDC functionality includes the ability to identify all column values that changed for each identified update operation. When querying for all changes, this information is always returned. When querying for net changes, the update mask is only returned as a nonnull value when the filter option ‘all with mask’ is selected. While this information can be extremely useful, its representation in mask form is awkward for SSIS to consume directly. CDC does, however, provide SQL functions to assist in extracting data from the update mask, making wrapper functions an ideal place for performing this extraction. While information from the update mask is not used within the current sample, it is nevertheless useful to show how logic to extract data from the update mask can be embedded within a simple wrapper function. While the CDC supplied mask provides information on all captured columns, it is likely that the application will only be interested in this information for a handful of columns. In the example below, the wrapper for the WorkOrders table is modified to include a final flag column that indicates on update whether the column OrderQty has changed. For this sample we made use of the default wrapper functions so were able to generate the wrappers for all defined, accessible capture instances with a single call. It is possible, however, to tailor the call to the scripting stored procedure to further customize the generated wrapper. In this case, we want to explicitly identify columns that need an update flag. We do this by coding the optional parameter @update_flag_list as a comma separated list of columns for which update information is needed. The call below generates a wrapper with an additional output column that for update operations, signals whether or not the column OrderQty was modified. exec sys.sp_cdc_generate_wrapper_function @capture_instance = 'WorkOrder', @update_flag_list = 'OrderQty' When examining the code below, note that the returned table @WorkOrder now includes an additional column OrderQty_uflag, after __CDC_OPERATION. This column 46 Change Data Capture for Specified Interval Package Sample will hold the additional update flag. Next, note the call to the function sys.fn_cdc_get_column_ordinal to obtain the column ordinal for the column of interest. Once the column ordinal is known, the function sys.fn_cdc_is_bit_set can be applied in the select statement for each returned update to determine whether the bit corresponding to the column is set in the returned mask. Finally, note that when the TVF dbo.fn_net_changes_Workload is invoked, the row filter option must be set to 'all with mask' to signal, in the net changes query, that the mask should be computed. create function [dbo].[fn_net_changes_WorkOrder] ( @start_time datetime = null, @end_time datetime = null, @row_filter_option nvarchar(30) = N'all') returns @resultset table ( [WorkOrderID] int ,[ProductID] int ,[OrderQty] int ,[ScrappedQty] smallint ,[StartDate] datetime ,[EndDate] datetime ,[DueDate] datetime ,[ScrapReasonID] smallint ,[ModifiedDate] datetime ,[__CDC_OPERATION] varchar(2) ,[OrderQty_uflag] bit ) as begin declare @from_lsn binary(10), @to_lsn binary(10) declare @ordinal_1 int select @ordinal_1 = [sys].[fn_cdc_get_column_ordinal] ('WorkOrder', 'OrderQty') if (@start_time is null) select @from_lsn = [sys].[fn_cdc_get_min_lsn]('WorkOrder') else begin if ([sys].[fn_cdc_map_lsn_to_time] ([sys].[fn_cdc_get_min_lsn] ('WorkOrder')) > @start_time) or ([sys].[fn_cdc_map_lsn_to_time] ([sys].[fn_cdc_get_max_lsn]()) < @start_time) select @from_lsn = null else select @from_lsn = [sys].[fn_cdc_increment_lsn] ([sys].[fn_cdc_map_time_to_lsn] ('largest less than or equal',@start_time)) end if (@end_time is null) select @to_lsn = [sys].[fn_cdc_get_max_lsn]() else begin if [sys].[fn_cdc_map_lsn_to_time] 47 Change Data Capture for Specified Interval Package Sample ([sys].[fn_cdc_get_max_lsn]()) < @end_time select @to_lsn = null else select @to_lsn = [sys].[fn_cdc_map_time_to_lsn] ('largest less than or equal',@end_time) end if @from_lsn is not null and @to_lsn is not null and (@from_lsn = [sys].[fn_cdc_increment_lsn](@to_lsn)) return insert into @resultset select [WorkOrderID] ,[ProductID] ,[OrderQty] ,[ScrappedQty] ,[StartDate] ,[EndDate] ,[DueDate] ,[ScrapReasonID] ,[ModifiedDate] ,case [__$operation] when 1 then 'D' when 2 then 'I' when 3 then 'UO' when 4 then 'UN' when 5 then 'M' else null end as [__CDC_OPERATION], case[ __$operation] when 4 then case [__$update_mask] when null then null else [sys].[fn_cdc_is_bit_set] (@ordinal_1, [__$update_mask]) End else null end as [OrderQty_uflag] from [cdc].[fn_cdc_get_net_changes_WorkOrder] (@from_lsn, @to_lsn, @row_filter_option) return end Customizing Wrapper Functions for the First Extraction Interval Our strategy for synchronizing the initial load from a database snapshot forces us to deal with the first extraction interval specially, allowing the lower bound of the extraction interval to be expressed as an LSN value, while continuing to honor a datetime value as the upper bound. While these customized functions are not currently scripted when wrapper functions are generated, it is straightforward to hand modify the 48 Change Data Capture for Specified Interval Package Sample generated wrappers to produce a version designed specifically to handle the initial extraction interval. In this sample, the code for these functions is included explicitly in the script file CDCSetupTables.sql. The modified WorkOrder script is shown below, with the changes appearing in black. These changes include a change to the function signature, the conversion of the string representation of the LSN to a binary(10), some range validation against the capture instance validity interval, and an incremental adjustment to get all changes after the defined anchor LSN. -- Generate custom wrapper functions for initial incremental load of -- WorkOrder create function [dbo].[fn_net_changes_WorkOrder_First] ( @start_lsn_str varchar(40) = null, @end_time datetime = null, @row_filter_option nvarchar(30) = N'all' ) returns @resultset table ( [WorkOrderID] int, [ProductID] int, [OrderQty] int, [StockedQty] int, [ScrappedQty] smallint, [StartDate] datetime, [EndDate] datetime, [DueDate] datetime, [ScrapReasonID] smallint, [ModifiedDate] datetime, [__CDC_OPERATION] varchar(2) ) as begin declare @from_lsn binary(10), @to_lsn binary(10), @start_lsn binary(10) if (@start_lsn_str is null) select @from_lsn = [sys].[fn_cdc_get_min_lsn]('WorkOrder') else begin select @start_lsn = [dbo].[HexStrToVarBin](@start_lsn_str) if ([sys].[fn_cdc_get_min_lsn]('WorkOrder') > @start_lsn) or ([sys].[fn_cdc_get_max_lsn]() < @start_lsn) select @from_lsn = null else select @from_lsn = [sys].[fn_cdc_increment_lsn](@start_lsn) end if (@end_time is null) select @to_lsn = [sys].[fn_cdc_get_max_lsn]() else begin if [sys].[fn_cdc_map_lsn_to_time]([sys]. [fn_cdc_get_max_lsn]()) < @end_time select @to_lsn = null else select @to_lsn = [sys].[fn_cdc_map_time_to_lsn] ('largest less than or equal',@end_time) end 49 Change Data Capture for Specified Interval Package Sample if @from_lsn is not null and @to_lsn is not null and (@from_lsn = [sys].[fn_cdc_increment_lsn](@to_lsn)) return insert into @resultset select [WorkOrderID], [ProductID], [OrderQty], [StockedQty], [ScrappedQty], [StartDate], [EndDate], [DueDate], [ScrapReasonID], [ModifiedDate], case [__$operation] when 1 then 'D' when 2 then 'I' when 3 then 'UO' when 4 then 'UN' when 5 then 'M' else null end as [__CDC_OPERATION] from [cdc].[fn_cdc_get_net_changes_WorkOrder](@from_lsn, @to_lsn, @row_filter_option) return end GO Running the Change Data Capture for Specified Interval Package Sample in BI Studio To run the Change Data Capture for Specified Interval Package Sample in BI Studio, open the Change Data Capture for Specified Interval Package Sample project and execute the package SetupCDCSample.dtsx. It should take several minutes to run. Once the execution environment has been set up, three of the workload tasks are launched to generate DML against the AdventureWorks2008 tables that are being tracked. These workload tasks have built in 10 second delays to guarantee that the load is spread over several minutes, regardless of the hardware that it is run on. 50 Change Data Capture for Specified Interval Package Sample Figure 10: Setup Package Performing Initial Load and Workload Generation Concurrently While the workload is being applied, a task to generate the database snapshot for AdventureWorks2008 is also launched. After it completes the dataflow tasks are allowed to run against the snapshot database to initially load the target tables in AdventureWorksDW2008. Once the initial load completes, the base time for the initial extraction interval is determined and the master incremental load package is invoked. Each time the master package is invoked, it first verifies that the desired extraction interval is contained within the validity interval for the database. If necessary, it will delay and give the capture process time to process all the changes through the high end-point of the next extraction interval. Once the workload has been applied the Mark Workload Completion task sets package variables to indicate that the workload has completed and to identify a completion time. When the start time of an extraction interval exceeds that time, the loop cycling the master package will terminate. 51 Change Data Capture for Specified Interval Package Sample Figure 11: Setup package at Successful Run Completion The control flow diagram above shows the setup package at the end of a successful run. Conclusion The Change Data Capture feature of SQL Server 2008 makes change data available in a relational format. This document describes how SSIS packages can leverage this feature to handle incremental loads to a Data Mart using a sample available on Codeplex as the vehicle to drive the discussion. For more information: SQL Server Web site: http://www.microsoft.com/sql/default.mspx SQL Server TechCenter: http://technet.microsoft.com/en-us/sqlserver/default.aspx SQL Server DevCenter: http://msdn2.microsoft.com/en-us/sqlserver/default.aspx Did this paper help you? Please give us your feedback. Tell us on a scale of 1 (poor) to 5 (excellent), how would you rate this paper and why have you given it this rating? For example: 52 Are you rating it high due to having good examples, excellent screenshots, clear writing, or another reason? Change Data Capture for Specified Interval Package Sample Are you rating it low due to poor examples, fuzzy screenshots, unclear writing? This feedback will help us improve the quality of white papers we release. Send feedback. 53