A Sample Event Consumer Program For Updating Netezza with DB2 Changed Data Introduction This article introduces a sample program that retrieves DB2 changed data in a delimited values format from WebSphere MQ and puts the changed data into Netezza. The sample is based on the event publishing function found in the IBM InfoSphere Data Replication (IIDR) Q Replication technology. Event Publishing captures changed data from a database transaction log and makes it available for applications to consume. The consuming applications may range from in-house programs to off-the-shelf ETL tools, such as InfoSphere DataStage, that perform complex transformations on the data. Netezza appliances deliver advanced complex analytics functions to enterprise data warehousing and business intelligence applications. As more applications are processed by the appliance, the need to move data to the Netezza appliance increases. Sample Scenario The sample event consumer program use-cases include active-active and data warehousing environments maintained by the Q Replication technology of the IIDR product. The sample scenario used in this article is a data warehousing scenario with site 1 as the DB2 for z/OS production environment and site 2 as a DB2 for z/OS server maintained by Q Replication. Site 2 also runs a Netezza appliance. In the picture below, unidirectional replication is used to maintain the DB2 for z/OS on site 2. A capture program on site 2 feeds DB2 changed data into the Netezza Appliance. In active-active scenarios with bidirectional replication between the two sites, a capture would already be running on site 2. 1 A benefit of scenarios that rely on a Q Apply program to maintain the DB2 for z/OS on site 2 is the ability to perform the initial load of the Netezza database without impacting the DB2 for z/OS production environment on site 1. The procedure is described in more details in a section below. The solution at a glance To feed DB2 changed data into a Netezza Appliance, you need a Q Capture program (which provides the event publishing function) and the sample event consumer program called asnconsumer introduced in this article. The sample program runs on USS (Unix System Services for z/OS), UNIX, Linux or Windows. The high-level solution includes three main stages. In the first stage, you identified the tables for which changed data is to be captured. The Q Capture program's event publishing function puts committed transactions into a local WebSphere MQ queue. The second and third stages run independently of each other and run continuously. They are delivered by the main threads of the event consumer sample program. The three stages are shown in the diagram below: 2 In the second stage, the event consumer sample program reads the queue and writes data to the local file system (USS, UNIX, etc.). One file is created for each table with changed data. The program reads data from the queue for a user-defined amount of time (default is five minutes). At the end of that time, reading of WebSphere MQ data stops temporarily, the files are moved to a “refresh“ directory, and the reading of MQ data starts again. In the third stage, the event consumer sample program relies on JDBC to load the refresh directory data files into changed data tables in Netezza (the TMPCD_XXX tables in the picture) and to call a SQL procedure to perform the necessary delete and insert statements for the target tables (REFRESH_SQL procedure). The event consumer sample program does not issue update statements against the Netezza target database. If there is no data to process, the sample program simply waits for data to be available. The event consumer sample program runs continually until you stop it with a 'CTL-C' user interrupt. Nothing runs on Netezza other than two SQL procedures: the REFRESH_SQL procedure mentioned above and the ADDSTMTS procedure that defines a Netezza end-user table to the event consumer program. Instructions on how to create the SQL procedures (and required SQL tables) are provided in the last section of this article. The event consumer program creates all its files on the local file system (USS) of system 2, not in Netezza. Initial Load of the Netezza Target Database Before you start the asnconsumer sample, you first need to perform a load of the Netezza end-user tables with data from the DB2 source using your own load procedures. The sample scenario described above delivers a simple solution to coordinate the load process with concurrent updates being made to the DB2 tables without impacting the DB2 source production environment. When the Q Apply program maintains the DB2 for z/OS at site 2, you can find a reliable point in time in DB2 from which to load Netezza. Stop the Q Apply program at a particular point in time, load the data into Netezza using your own load procedure and restart the Q Apply process when the load is completed for all tables. Since the Q Apply program delivers transaction consistency, the procedure guarantees that Netezza is initially loaded with DB2 data that preserves transaction consistency without impacting DB2 on site 1. Changes will be accumulating in the WebSphere MQ queue on site 2 while Q Apply is stopped but they will be picked up by Q Apply when it comes back up. Updating Netezza (more on the third stage) The event consumer sample program runs in the same system where DB2 and WebSphere MQ run (site 2 in the diagrams of this article) and issues JDBC calls to access the (remote) Netezza target database. Data file names in the local file system are hardcoded as XXX.dat and the changed data table names are hardcoded as TMPCD_XXX in Netezza with XXX the original DB2 table name (does not include schema name). 3 Changed data tables are only created when there is changed data for a table and are defined as temporary tables in the Netezza database. The sample program uses Netezza “external” tables to load the Netezza changed data tables from data files (step 3a in the diagram below). External tables process bulk loads of data quite efficiently and the sample program loads each changed data table one at a time. Once the changed data tables are loaded, the sample program calls a SQL procedure named REFRESH_SQL to refresh the Netezza end-user tables. The procedure issues a single delete and a single insert SQL statement for each table (step 3b in the diagram below) and each statement includes its own subquery that identifies the correct set of changed data rows to apply. After delete and insert statements are executed for the end-user tables, a SQL commit is executed, data files that were processed are deleted and temporary changed data tables are dropped. The diagram below shows how a Netezza table is updated by populating a temporary staging table (step 3a) and by issuing one delete and one insert SQL statement against the table (step 3b): 4 The delete statement is always executed before the insert statement and a DB2 update statement is processed as a delete/insert pair (a Netezza database optimization). The delete subquery ('deleteSubquery' in the diagram above) associated with the delete statement derives the set of rows to delete from the end-user table based on the data present in the temporary TMPCD_XXX table and the Netezza end-user table. Likewise, the insert subquery ('insertSubquery') derives the set of rows to insert into the end-user table. The subqueries rely on the existence of a primary key for each Netezza end-user table. The Netezza database does not enforce primary key definitions but allows their definition. The table below explains the logic used by the delete and insert subqueries. The logic finds the qualifying rows from the temporary TMPCD_XXX table and the delete subquery has two answer sets. The qualifying rows may be a subset of the initial TMPCD_XXX table rows as some optimization minimizes the set of rows required: Subquery Qualifying row from the temporary TMPCD_XXX table Rationale Delete Insert or Update changed data row and after image key values match the key values of a row in the Netezza end-user table (exist in the Netezza end-user table) Remove the record from the Netezza end-user table so that the Insert or Update statement succeeds when it is executed after the delete Delete Update or Delete changed data row and before image key values match the key values of a row in the Netezza end-user table Remove the “old” record value of an Update statement and remove a regular Delete record from the Netezza end-user table Insert Insert or Update changed data row Add a regular Insert record and add the “new” and record value of an Update statement after image key values do not already exist in the Netezza end-user table in the Netezza end-user table Optimization: If the same after-image key values exist in another row of the TMPCD_XXX table, key values are that of the most recent row Optimization: When multiple rows exist for the same key values in the TMPCD_XXX table, retain only the most current row. Netezza analytics functions are used to build an intermediary answer set to filter redundant Insert or Update rows before applying the condition listed in the Qualifying Row column of this table. 5 The delete and insert statements with a subquery are issued by the Netezza SQL procedure called REFRESH_SQL. The next section shows you how to download the asnconsumer zip file that includes the script that you need to run to create the REFRESH_SQL procedure and highlights the other tasks you need to complete before you run the event consumer sample. Putting it all together: the asnconsumer.zip file Before you run the event consumer sample program, a Q Replication environment must be up and running for a DB2. Instructions on how to complete the prerequisite steps and configure your environment are provided in a README file included in the asnconsumer zip file that you first need to download from the Samples section of the Q+SQL Replication Forum on developerWorks. To configure your environment after downloading the asnconsumer.zip file follow the steps below: 1. Unzip the asnconsumer.zip file by running the command: unzip asnconsumer.zip 2. Follow the instructions provided in the README file, which include the event consumer sample program: ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ Prerequisites Contents of the asnconsumer.zip file Install Tasks One-time Setup Per Source Table Configuration (required for each Netezza end-user table maintained by the asnconsumer sample) Logic Overview (includes description of the asnconsumer sample threads) Sample Invocation Parameters. The asnconsumer.zip file includes all the script files you need to run the asnconsumer sample program in your environment. It also includes the asnconsumer.jar file (executable) and the asnconsumer.java file (source code). If you have a non-Netezza database and would like to change the asnconsumer sample logic to support it, follow the instructions in the Install Tasks section of the README on how to build a new asnconsumer.jar file after making changes to the original asnconsumer.java source. 6