Extract, Transform, Load 1 Agenda Review Analysis Logical Design Physical Design Implementation (Bus Matrix, Info Package) (Dimensional Modeling) (Spreadsheet) (Data Mart Relational Tables) ETL Process Overview ETL Components Staging Area Extraction Transformation Loading Documenting High-Level ETL Requirements Documenting Detailed ETL Flows Example ETL 2 ETL Overview Reshaping relevant data from source systems into useful information stored in the DW Extract Copying and integrating data from OLTP and other data sources in preparation for cleansing and loading into the DW Transform Cleaning and converting data to prepare it for loading into the DW Load Putting cleansed and converted data into the DW 3 ETL Overview, cont… Not Really New, BUT… Much more data Includes rearranging, summarizing Data used for strategic decision-making Characteristics: Process AND technology Detailed, highly-dependent tasks Consumes average 75% of DW development An on-going process for life of DW Requirements: Well-documented Automated Flexible 4 Data Warehouse Project Lifecycle Source: Mundy, Thornthwaite, and Kimball (2006). The Microsoft Data Warehouse Toolkit, Wiley Publishing Inc., Indianapolis, IN. 5 High Level Design of ETL Initial documentation of: ETL Process ETL Process “Flow” or Architecture What data do we need and where is it coming from (i.e., “E”)? Physical DW Design Spreadsheet shown previously What are the major transformation/cleansing needs (i.e., “T”)? “Extend” Physical DW Design Spreadsheet OR ETL Map What’s the sequence of activities for ETL? ETL Map 6 ETL Process 1. Determine target data 2. Determine data sources 3. Prepare data mapping 4. Organize data staging area 5. Establish data extraction rules 6. Establish data transformation rules 7. Plan aggregate tables 8. Establish data load procedures 9. Load dimension tables 10. Load fact tables 7 ETL Process Flow 3, Spreadsheet 2, Spreadsheet 1, Dim Model 8 Review: Dimensional Modeling 9 Review: DM Implementation DimStudent CREATE TABLE DimStudent( student_sk int identity(1,1) , student_id varchar(9) , firstname varchar(30) , lastname varchar(30) , major varchar(6) , classification varchar(25) , gpa numeric(3, 2) , clubname varchar(25) , undergrad_school varchar(25) , gmat int , undergrad_or_gradvarchar(10) , CONSTRAINT dimstudent_pk PRIMARY KEY (student_sk)); GO FactEnrollment CREATE TABLE fact_enrollment( student_sk int , class_sk int , date_sk int , professor_sk int , location_sk int , termyear_sk int , coursegrade numeric(2, 1) , CONSTRAINT fact_enrollment_pk PRIMARY KEY (student_sk, class_sk, date_sk, professor_sk) , CONSTRAINT fact_enrollment_student_fk FOREIGN KEY (student_sk) REFERENCES dimstudent(student_sk) , CONSTRAINT fact_enrollment_class_fk FOREIGN KEY(class_sk) REFERENCES dimclass (class_sk) , CONSTRAINT fact_enrollment_date_fk FOREIGN KEY(date_sk) REFERENCES dimtime (date_sk) , CONSTRAINT fact_enrollment_professor_fk FOREIGN KEY(professor_sk) REFERENCES dimprofessor (professor_sk) , CONSTRAINT fact_enrollment_location_fk FOREIGN KEY(location_sk) REFERENCES dimlocation (location_sk) , CONSTRAINT fact_enrollment_termyear_fk FOREIGN KEY(termyear_sk) REFERENCES dimtermyear (termyear_sk),); 10 GO Review: DW Physical Design, cont… 11 ETL Process Flow 6, 7, Map & SSIS 5, SSIS 8, 9, 10, SSIS 4 12 ETL Staging Area Information hub, facilitating the enriching stages that data goes through to populate a DW Advantages: Separates source systems and DW Minimizes ETL impact on source AND DW systems Can consist of multiple “hubs” “upload” area “staging” area “DW load images” 13 ETL Staging Area, cont… 14 Clean, Transform Source Data 15 Common Transformations Format Revisions Key Restructuring, Lookup Handling of Null Values Decoding fields Calculated, Derived values Merging of Data 16 Common Transformations, cont… Splitting of single fields Character set conversion Units of measurement conversion Date/time conversion Summarization Deduplication 17 Common Transformations, cont… Other Data Quality Issues Standardize values Validate values Identifying mismatches, misspellings Etc… Data Quality Suggestions: Appoint “Data Stewards” Ensure ETL programs have control checks Data Profiling… 18 Comparison of Models 19 Transformations Example Dim Time Dim DimClass Professor Dim Location Dim Dim Student TermYear FactEnrollment Create SK Generate SK Generate SK Generate SK Generate SK Generate SK Add SKs: student, section, prof (join registration to all the dims; left join to prof) Insert row w/SK = -1 Insert row w/SK = -1 Insert row w/SK = -1 Insert row w/SK = -1 Insert row w/SK = -1 Insert row w/SK = -1 Expand rank values (use SQL case) Get distinct city/state combinations from student tbl Get distinct term/year combinations from section Expand department values (join prof to departments) Expand state values (needs lookup table) 20 Data Profiling Systematic analysis of the content of a data source Goals: Anticipate potential data quality issues upfront Build quality corrections and controls into ETL process Manual and/or Tool-assisted 21 Profiling Example: Manual Account CustID Number Customer First Type Title Name AW000110 11000 00 I AW000110 11001 01 I AW000110 11002 02 Last Name Gender Email Phone Address Line1 Address Line2 State Postal Code Country Yang F jon24@adventureworks.com. 1(11) 500 5550162 3761 N. 14th St Queensland 4700 AU Eugene Huang F eugene10@adventureworks.com. 500-555-0110 2243 W St. Victoria 3198 AU I Ruben Torres F ruben35@advantureworks.com. 1(11) 500 5550184 5844 Linden Dr New South Wales 7001 AU AW000110 11003 03 I Christy Zhu F christy12@adventureworks.com. 1(11) 500 5550162 1825 Village Pl. Queensland 2113 AW000110 11004 04 I F elizabeth5@adventureworks.com. 7553 Harness (500) 555-0131 Circle AW000110 11005 05 I M julio1@adventureworks.com. 1(11) 500 5550151 Mr. Jon Mrs. Elizabeth Johnson Julio Ruiz 7305 Humphrey Drive New South Wales 2500 AU 4169 OZ 22 Profiling Example: SSIS 23 Documenting ETL High Level Design Add to existing DW Physical Design Spreadsheet 24 Documenting ETL High Level Design 25 Low Level Design of ETL Process Detailed documentation of: What data do we need and where is it coming from? What are the major transformation/cleansing needs? What’s the sequence of activities for ETL? Can use tool like SSIS 26 Extracting Data from Sources 27 Extracting Source Data Two forms: 1. Static Data Capture Point-in-time snapshot Initial Loads and periodic refreshes 2. Revised Data Capture Only data that has been added, updated, deleted since last load Ongoing incremental loads Two timeframes Immediate Deferred 28 Static Data Capture (T)SQL Scripts e.g., small number of tables/rows Export/Import Tables e.g., database or non-database sources Backup/Restore Database e.g., copying sqlserver source database for initial load ETL Detach/Attach Database e.g., copying older sqlserver version to newer sqlserver version for initial load ETL 29 Revised Data Capture Immediate / Real-time ETL side: OLTP side: OLTP side: procs get changed data from log real-time and update ETL staging tables triggers update ETL staging tables apps write to OLTP AND ETL staging tables Deferred ETL side: ETL side: OLTP side: procs get changed data from OLTP tables based on timestamps procs do file comparison changed data capture (SS 2008) 30 Class Performance DW Example Create ClassPerformanceDW database Using ClassPerformanceDW database… Create ClassPerformanceDW tables using SQL Script http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d w_tables/create_class_performance_dw_tables.sql 31 ETL Example using SQL Scripts One "Master Script" Calls seven "table transform/load" scripts 32 "Master" Script --be sure to turn on Query, SQLCMD mode in order to run this script Use ClassPerformanceDW print 'loading dimclass table' go :r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_dimclass.sql" print 'loading dimprofessor table' go :r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_dimprofessor.sql" print 'loading dimstudent table' go :r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_dimstudent.sql" print 'loading dimtime table' go :r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_dimtime.sql" print 'loading dimlocation table' go :r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_dimlocation.sql" print 'loading dimtermyear table' go :r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_dimtermyear.sql" print 'loading factenrollment table' go :r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_factenrollment.sql" 33 Loading Data into Target Structures 34 Load "DimProfessor" Script (pg. 1 of 3) use ClassPerformanceDW go set nocount on print 'remove existing data from dimprofessor' go if object_id('dbo.dimprofessor', 'u') is not null begin if object_id('dbo.factenrollment', 'u') is not null begin ALTER TABLE factenrollment drop CONSTRAINT [factenrollment_professor_fk]; end truncate table DimProfessor; end go print 'adding oltp prof data to dimprofessor' print 'professor_sk will be automatically inserted' insert into dimprofessor ( professor_id , firstname , lastname , rank , department) select prof_id, firstname, lastname, rank, dept from regnOLTP.dbo.prof; go 35 Load "DimProfessor" Script (pg. 2 of 3) print 'decoding rank field' UPDATE dimprofessor SET dimprofessor.rank = case dimprofessor.rank when 'asst' then 'assistant prof' when 'assc' then 'associate prof' when 'prof' then 'full prof' end ; go print 'decoding department field using imported excel spreadsheet' UPDATE dimprofessor SET dimprofessor.department = regnOLTP.dbo.departments.deptname FROM dimprofessor, regnOLTP.dbo.departments WHERE dimprofessor.department = regnOLTP.dbo.departments.deptid ; go 36 Load "DimProfessor" Script (pg. 3 of 3) print 'adding SK -1 row' set identity_insert dimprofessor on go insert into dimprofessor ( professor_sk , professor_id , firstname , lastname , rank , department) values (-1, -1, 'unknown', 'unknown', 'unknown', 'unknown'); GO set identity_insert dimprofessor off go Set nocount off 37 Load "FactEnrollment" Script print 'adding oltp registration data to fact_enrollment' insert into factenrollment ( student_sk , class_sk , date_sk , professor_sk , location_sk , termyear_sk , coursegrade) select student_sk, class_sk, datesk, professor_sk, location_sk, termyear_sk, final_grade from (((((((regnOLTP.dbo.registration INNER JOIN dimstudent on regnOLTP.dbo.registration.stud_id = dimstudent.student_id) INNER JOIN dimclass on regnOLTP.dbo.registration.crn = dimclass.crn) INNER JOIN dimtime on CONVERT(varchar(10),regnOLTP.dbo.registration.regn_date,101) = dimtime.actualdate) INNER JOIN regnOLTP.dbo.section on dimclass.crn = regnOLTP.dbo.section.crn) INNER JOIN dimtermyear on regnOLTP.dbo.section.term = dimtermyear.term AND regnOLTP.dbo.section.year = dimtermyear.year) INNER JOIN RegnOLTP.dbo.student on RegnOLTP.dbo.student.stud_id = regnOLTP.dbo.registration.stud_id) LEFT JOIN dimprofessor on regnOLTP.dbo.section.prof_id = dimprofessor.professor_id) LEFT JOIN dimlocation on regnOLTP.dbo.student.city = dimlocation.city AND RegnOLTP.dbo.student.state = dimlocation.state_abbreviation; 38 go Entire Transform/Load "Package" http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d w_tables.zip 39 Documenting ETL Low Level Design: SSIS Comes with SQL Server Helps document and automate ETL process Based on defining Packages Tasks One approach A package for each target table A "master" package 40 SSIS Package Examples: Master 41 SSIS Package Examples: Extract All 42 SSIS Package Examples: Extract Changed using CDC Eg, SELECT * from cdccustomer WHERE cdc_chg_date > etl_last_capture_date; 43 SSIS Package Examples: Transforms 44 SSIS Package Examples: Load 45