ETL TESTING CONCEPTS 1 Prerequisite For Learning ETL Testing 1. Experience on Software Testing 2. Good Knowledge on Database (Oracle or SQL Server) 3. Good knowledge on UNIX or LINUX commands 4. Basic knowledge on ETL Tool (Informatica, Data Stage and ODI etc… TESTING ENVIRONMENT FOR ETL 1. LINUX /UNIX Load Server 2. ETL Tool 3. WINSCP 4. PUTTY 5. Data Warehouse DB (NETEZZA/TERA DATA) TEST DATA PREPARATION MISSING DATA (NULL VALUES) DATA TYPE VALIDATION ( PASSING NON NUMERIC ON NUMERIC FIELD) FIELD LENGTH VALIDATAIONS ( PASSING VALUE FIELD LENGTH > DEFINED FIELD LENGH) NO.OF COLUMNS VALIDATION ( PASSING RECORD WITH MORE NO.OF COLUMN VALUES THAN DEFINED COLUMNS ON TABLE) DATE FORMATS (ACTUAL DATE :DD/MM/YYYY) DATA PASSED AS YYYY/DD/MM SSN WITH SEPERATORS AND DATA PASSED WITH OUT SEPERATORS 123-45- 2389 DATA PASSED AS 123452389) PADDING OF DATA ORDER OF SEQUENCE FOR COLUMNS JUNK DATA (?,'' and special characters) DATE FIELD SAYS DOB NOT > CURRENT DATE . BUT WE PASS TOMORROW'S DATE. CONSTRAINT TESTING During constraint testing, the objective is to validate unique constraints, primary keys, foreign keys, indexes, and relationships. The test script should include these validation points. Some ETL processes can be developed to validate constraints during the loading of the warehouse. All table names Column names for each table. Column types for each table. Whether a column allows NULL or not. Default definitions. SOURCE TO TARGET COUNT TESTING The objective of the count test scripts is to determine if the record counts in the source match the record counts in the target. Some ETL processes are capable of capturing record count information such as records read, records written, records in error, etc. If the ETL process used can capture that level of detail and create a list of the counts, allow it to do so. This will save time during the validation process. It is always a good practice to use queries to double check the source to target counts. Example: 1. Record count verification in Source as per the business rule. SELECT COUNT(*) FROM (SELECT DEPTNO,AVG(SAL),MAX(SAL),MIN(SAL),SUM(SAL) FROM EMP GROUP BY DEPTNO) 2. Record count verification in Target SELECT COUNT(* ) FROM TRG_AGGR SOURCE TO TARGET DATA VALIDATION TESTING No ETL process is smart enough to perform source to target field-to-field validation. This piece of the testing cycle is the most intensive and requires the most thorough analysis of the data. Example: Data verification in Source as per the business rule. SELECT DEPTNO,AVG(SAL),MAX(SAL),MIN(SAL),SUM(SAL) FROM EMP GROUP BY DEPTNO Data verification in Target SELECT * FROM TRG_AGGR SELECT * FROM EMP MINUS SELECT * FROM TRG_EMP; SELECT COUNT(*) FROM EMP MINUS SELECT COUNT(*) FROM TRG_EMP; DATA VALIDATION TESTING 1. Verify that data is transformed correctly according to various business requirements and rules. 2. Make sure that all projected data is loaded into the data warehouse without any data loss and truncation. 3. Make sure that ETL application appropriately rejects, replaces with default values and reports invalid data. 4. Make sure that data is loaded in data warehouse within prescribed and expected time frames to confirm improved performance and scalability. BATCH SEQUENCE & DEPENDENCY TESTING ETL’s in DW are essentially a sequence of processes that execute in a particular sequence. Dependencies do exist among various processes and the same is critical to maintain the integrity of the data. Executing the sequences in a wrong order might result in inaccurate data in the warehouse. The testing process must include at least 2 iterations of the end–end execution of the whole batch sequence. Data must be checked for its integrity during this testing. The most common type of errors caused because of incorrect sequence is the referential integrity failures, incorrect end-dating (if applicable) etc, reject records etc. JOB RESTART TESTING In a real production environment, the ETL jobs/processes fail because of number of reasons (say for ex: database related failures, connectivity failures etc). The jobs can fail half/partly executed. A good design always allows for a restart ability of the jobs from the failure point. Although this is more of a design suggestion/approach, it is suggested that every ETL job is built and tested for restart capability. DUPLICATE TESTING Duplicate Testing must be performed at each stage of the ETL process and in the final target table. This testing involves checks for duplicates rows and also checks for multiple rows with same primary key, both of which cannot be allowed. PERFORMANCE TESTING It is the most important aspect after data validation. Performance testing should check if the ETL process is completing within the load window. VOLUME TESTING Verify that the system can process the maximum expected quantity of data for a given cycle in the time expected. NEGATIVE TESTING Negative Testing checks whether the application fails and where it should fail with invalid inputs and out of boundary scenarios and to check the behavior of the application. INTIAL & INCREMENTAL LOAD TESTING Type 1 - Update a source value. Run ETL process. Verified change was applied. Type 2 - Update a source value. Run ETL process. Verified change was applied to new row and old row was end dated. INCREMENTAL & SCD TYPE2 IMPLEMENTATION TESTING There are number of universal verifications that have to be carried out for any kind of data warehouse testing. Below is the list of objects that are treated as essential for validation in ETL testing: - Verify that data transformation from source to destination works as expected - Verify that expected data is added in target system - Verify that all DB fields and field data is loaded without any truncation - Verify data checksum for record count match - Verify that for rejected data proper error logs are generated with all details - Verify NULL value fields - Verify that duplicate data is not loaded - Verify data integrity ETL Testing Challenges: ETL testing is quite different from conventional testing. There are many challenges we faced while performing data warehouse testing. Here is the list of few ETL testing challenges I experienced on my project: - Incompatible and duplicate data. - Loss of data during ETL process. - Unavailability of inclusive test bed. - Testers have no privileges to execute ETL jobs by their own. - Volume and complexity of data is very huge. - Fault in business process and procedures. - Trouble acquiring and building test data. - Missing business flow information. Incremental(Delta) Load Test Tootia on Tue Aug 30, 2011 12:00 am Hi all It is my first post in this forum, we have recently started to implement a dimensional data warehouse for our organisation and I am the main person in charge of implementing the ETL jobs. We use SAP BO DataServices as the ETL tool. My question is what is the best and efficient way of testing Delta load (whether Dimension or Fact tables)? As each table has got its corresponding SQL script (as mapping rules), it is easy to test the Initial load. We just check the target table in DW against its SQL script in a test case (written and run in MS SQL Management Studio). However testing the Delta load is time consuming and not straight forward as Initial load testing. For each job usually there are different tables which contributes in loading data into DW table. In the Delta load job, I have a condition for each source table to see if the record modified Date is greater than the last ETL load, the record should be picked up and contributed in the load, otherwise not. To fully test the job, I should insert and update a record into each source table, run the job and see whether the record has been passed through (affected the target record) or not. If we have only 4 tables as the source of a job, we need to have at least 8 test cases to test the Delta load. One test case for Insert a new record in each table and one test case for update an existing record in source table. 4 * 2 = 8 test cases. That's why I found out testing the Delta load tedious and frustrating. I'm wondering if there is any better way to fully test the ETL job for Incremental load, rather than the way I explained above.