Initial Loads

advertisement
Extract, Transform, Load
1
Agenda
 Review




Analysis
Logical Design
Physical Design
Implementation
(Bus Matrix, Info Package)
(Dimensional Modeling)
(Spreadsheet)
(Data Mart Relational Tables)
 ETL Process Overview
 ETL Components




Staging Area
Extraction
Transformation
Loading
 Documenting High-Level ETL Requirements
 Documenting Detailed ETL Flows
 Example ETL
2
ETL Overview
 Reshaping relevant data from source systems
into useful information stored in the DW
 Extract
 Copying and integrating data from OLTP and
other data sources in preparation for cleansing
and loading into the DW
 Transform
 Cleaning and converting data to prepare it for
loading into the DW
 Load
 Putting cleansed and converted data into the DW
3
ETL Overview, cont…
 Not Really New, BUT…
 Much more data
 Includes rearranging, summarizing
 Data used for strategic decision-making
 Characteristics:




Process AND technology
Detailed, highly-dependent tasks
Consumes average 75% of DW development
An on-going process for life of DW
 Requirements:
 Well-documented
 Automated
 Flexible
4
Data Warehouse Project Lifecycle
Source: Mundy, Thornthwaite, and Kimball (2006). The Microsoft Data Warehouse Toolkit, Wiley Publishing
Inc., Indianapolis, IN.
5
High Level Design of ETL
 Initial documentation of:
 ETL Process
 ETL Process “Flow” or Architecture
 What data do we need and where is it coming
from (i.e., “E”)?
 Physical DW Design Spreadsheet shown previously
 What are the major transformation/cleansing
needs (i.e., “T”)?
 “Extend” Physical DW Design Spreadsheet OR
 ETL Map
 What’s the sequence of activities for ETL?
 ETL Map
6
ETL Process
1. Determine target data
2. Determine data sources
3. Prepare data mapping
4. Organize data staging area
5. Establish data extraction rules
6. Establish data transformation rules
7. Plan aggregate tables
8. Establish data load procedures
9. Load dimension tables
10. Load fact tables
7
ETL Process Flow
3, Spreadsheet
2, Spreadsheet
1, Dim Model
8
Review: Dimensional Modeling
9
Review: DM Implementation
DimStudent
CREATE TABLE DimStudent(
student_sk
int identity(1,1)
, student_id
varchar(9)
, firstname
varchar(30)
, lastname
varchar(30)
, major
varchar(6)
, classification
varchar(25)
, gpa
numeric(3, 2)
, clubname
varchar(25)
, undergrad_school varchar(25)
, gmat
int
, undergrad_or_gradvarchar(10)
, CONSTRAINT dimstudent_pk
PRIMARY KEY (student_sk));
GO
FactEnrollment
CREATE TABLE fact_enrollment(
student_sk
int
, class_sk
int
, date_sk
int
, professor_sk
int
, location_sk
int
, termyear_sk
int
, coursegrade
numeric(2, 1)
, CONSTRAINT fact_enrollment_pk PRIMARY KEY
(student_sk, class_sk, date_sk, professor_sk)
, CONSTRAINT fact_enrollment_student_fk FOREIGN KEY
(student_sk) REFERENCES dimstudent(student_sk)
, CONSTRAINT fact_enrollment_class_fk FOREIGN KEY(class_sk)
REFERENCES dimclass (class_sk)
, CONSTRAINT fact_enrollment_date_fk FOREIGN KEY(date_sk)
REFERENCES dimtime (date_sk)
, CONSTRAINT fact_enrollment_professor_fk FOREIGN
KEY(professor_sk) REFERENCES dimprofessor
(professor_sk)
, CONSTRAINT fact_enrollment_location_fk FOREIGN
KEY(location_sk) REFERENCES dimlocation (location_sk)
, CONSTRAINT fact_enrollment_termyear_fk FOREIGN
KEY(termyear_sk) REFERENCES dimtermyear
(termyear_sk),);
10
GO
Review: DW Physical Design, cont…
11
ETL Process Flow
6, 7, Map
& SSIS
5, SSIS
8, 9, 10, SSIS
4
12
ETL Staging Area
 Information hub, facilitating the enriching
stages that data goes through to populate a DW
 Advantages:
 Separates source systems and DW
 Minimizes ETL impact on source AND DW systems
 Can consist of multiple “hubs”
 “upload” area
 “staging” area
 “DW load images”
13
ETL Staging Area, cont…
14
Clean, Transform Source Data
15
Common Transformations
 Format Revisions
 Key Restructuring, Lookup
 Handling of Null Values
 Decoding fields
 Calculated, Derived values
 Merging of Data
16
Common Transformations, cont…
 Splitting of single fields
 Character set conversion
 Units of measurement conversion
 Date/time conversion
 Summarization
 Deduplication
17
Common Transformations, cont…
 Other Data Quality Issues
 Standardize values
 Validate values
 Identifying mismatches, misspellings
 Etc…
 Data Quality Suggestions:
 Appoint “Data Stewards”
 Ensure ETL programs have control checks
 Data Profiling…
18
Comparison of Models
19
Transformations Example
Dim
Time
Dim
DimClass
Professor
Dim
Location
Dim
Dim Student
TermYear
FactEnrollment
Create SK
Generate SK
Generate SK
Generate SK
Generate SK
Generate SK
Add SKs: student, section,
prof
(join registration to all the
dims;
left join to prof)
Insert row
w/SK = -1
Insert row w/SK
= -1
Insert row w/SK =
-1
Insert row w/SK
= -1
Insert row w/SK
= -1
Insert row w/SK = -1
Expand rank
values
(use SQL case)
Get distinct
city/state
combinations
from student tbl
Get distinct
term/year
combinations
from section
Expand
department
values
(join prof to
departments)
Expand state
values
(needs lookup
table)
20
Data Profiling
 Systematic analysis of the content of a data
source
 Goals:
 Anticipate potential data quality issues upfront
 Build quality corrections and controls into ETL
process
 Manual and/or Tool-assisted
21
Profiling Example: Manual
Account
CustID Number
Customer
First
Type
Title Name
AW000110
11000 00
I
AW000110
11001 01
I
AW000110
11002 02
Last
Name
Gender Email
Phone
Address Line1
Address
Line2
State
Postal
Code Country
Yang
F
jon24@adventureworks.com.
1(11) 500 5550162
3761 N. 14th St
Queensland
4700
AU
Eugene
Huang
F
eugene10@adventureworks.com.
500-555-0110
2243 W St.
Victoria
3198
AU
I
Ruben
Torres
F
ruben35@advantureworks.com.
1(11) 500 5550184
5844 Linden Dr
New South
Wales
7001
AU
AW000110
11003 03
I
Christy
Zhu
F
christy12@adventureworks.com.
1(11) 500 5550162
1825 Village Pl.
Queensland
2113
AW000110
11004 04
I
F
elizabeth5@adventureworks.com.
7553 Harness
(500) 555-0131 Circle
AW000110
11005 05
I
M
julio1@adventureworks.com.
1(11) 500 5550151
Mr. Jon
Mrs. Elizabeth Johnson
Julio
Ruiz
7305 Humphrey
Drive
New South
Wales
2500
AU
4169
OZ
22
Profiling Example: SSIS
23
Documenting ETL High Level Design
 Add to existing DW Physical Design
Spreadsheet
24
Documenting ETL High Level Design
25
Low Level Design of ETL Process
 Detailed documentation of:
 What data do we need and where is it coming
from?
 What are the major transformation/cleansing
needs?
 What’s the sequence of activities for ETL?
 Can use tool like SSIS
26
Extracting Data from Sources
27
Extracting Source Data
 Two forms:
1.
Static Data Capture


Point-in-time snapshot
Initial Loads and periodic refreshes
2. Revised Data Capture



Only data that has been added, updated, deleted
since last load
Ongoing incremental loads
Two timeframes


Immediate
Deferred
28
Static Data Capture
 (T)SQL Scripts
 e.g., small number of tables/rows
 Export/Import Tables
 e.g., database or non-database sources
 Backup/Restore Database
 e.g., copying sqlserver source database for initial
load ETL
 Detach/Attach Database
 e.g., copying older sqlserver version to newer
sqlserver version for initial load ETL
29
Revised Data Capture
 Immediate / Real-time
 ETL side:
 OLTP side:
 OLTP side:
procs get changed data from log real-time
and update ETL staging tables
triggers update ETL staging tables
apps write to OLTP AND ETL staging
tables
 Deferred
 ETL side:
 ETL side:
 OLTP side:
procs get changed data from OLTP tables
based on timestamps
procs do file comparison
changed data capture (SS 2008)
30
Class Performance DW Example
 Create ClassPerformanceDW database
 Using ClassPerformanceDW database…
 Create ClassPerformanceDW tables using SQL
Script

http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d
w_tables/create_class_performance_dw_tables.sql
31
ETL Example using SQL Scripts
 One "Master Script"
 Calls seven "table transform/load" scripts
32
"Master" Script
--be sure to turn on Query, SQLCMD mode in order to run this script
Use ClassPerformanceDW
print 'loading dimclass table'
go
:r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_dimclass.sql"
print 'loading dimprofessor table'
go
:r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_dimprofessor.sql"
print 'loading dimstudent table'
go
:r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_dimstudent.sql"
print 'loading dimtime table'
go
:r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_dimtime.sql"
print 'loading dimlocation table'
go
:r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_dimlocation.sql"
print 'loading dimtermyear table'
go
:r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_dimtermyear.sql"
print 'loading factenrollment table'
go
:r "Z:\Documents\Gina\teaching\sqlserver\scripts\generate_class_performance_dw_tables\load_factenrollment.sql"
33
Loading Data into Target Structures
34
Load "DimProfessor" Script (pg. 1 of 3)
use ClassPerformanceDW
go
set nocount on
print 'remove existing data from dimprofessor'
go
if object_id('dbo.dimprofessor', 'u') is not null
begin
if object_id('dbo.factenrollment', 'u') is not null
begin
ALTER TABLE factenrollment drop CONSTRAINT [factenrollment_professor_fk];
end
truncate table DimProfessor;
end
go
print 'adding oltp prof data to dimprofessor'
print 'professor_sk will be automatically inserted'
insert into dimprofessor (
professor_id
, firstname
, lastname
, rank
, department)
select
prof_id, firstname, lastname, rank, dept
from
regnOLTP.dbo.prof;
go
35
Load "DimProfessor" Script (pg. 2 of 3)
print 'decoding rank field'
UPDATE dimprofessor
SET dimprofessor.rank = case dimprofessor.rank
when 'asst' then 'assistant prof'
when 'assc' then 'associate prof'
when 'prof' then 'full prof'
end
;
go
print 'decoding department field using imported excel spreadsheet'
UPDATE dimprofessor
SET dimprofessor.department = regnOLTP.dbo.departments.deptname
FROM dimprofessor, regnOLTP.dbo.departments
WHERE dimprofessor.department = regnOLTP.dbo.departments.deptid
;
go
36
Load "DimProfessor" Script (pg. 3 of 3)
print 'adding SK -1 row'
set identity_insert dimprofessor on
go
insert into dimprofessor (
professor_sk
, professor_id
, firstname
, lastname
, rank
, department)
values
(-1, -1, 'unknown', 'unknown', 'unknown', 'unknown');
GO
set identity_insert dimprofessor off
go
Set nocount off
37
Load "FactEnrollment" Script
print 'adding oltp registration data to fact_enrollment'
insert into factenrollment (
student_sk
, class_sk
, date_sk
, professor_sk
, location_sk
, termyear_sk
, coursegrade)
select
student_sk, class_sk, datesk, professor_sk, location_sk, termyear_sk, final_grade
from
(((((((regnOLTP.dbo.registration INNER JOIN dimstudent on
regnOLTP.dbo.registration.stud_id = dimstudent.student_id)
INNER JOIN dimclass on regnOLTP.dbo.registration.crn = dimclass.crn)
INNER JOIN dimtime on CONVERT(varchar(10),regnOLTP.dbo.registration.regn_date,101) =
dimtime.actualdate)
INNER JOIN regnOLTP.dbo.section on dimclass.crn = regnOLTP.dbo.section.crn)
INNER JOIN dimtermyear on regnOLTP.dbo.section.term = dimtermyear.term AND
regnOLTP.dbo.section.year = dimtermyear.year)
INNER JOIN RegnOLTP.dbo.student on RegnOLTP.dbo.student.stud_id =
regnOLTP.dbo.registration.stud_id)
LEFT JOIN dimprofessor on regnOLTP.dbo.section.prof_id = dimprofessor.professor_id)
LEFT JOIN dimlocation on regnOLTP.dbo.student.city = dimlocation.city AND
RegnOLTP.dbo.student.state = dimlocation.state_abbreviation;
38
go
Entire Transform/Load "Package"
http://business.baylor.edu/gina_green/teaching/sqlserver/scripts/generate_class_performance_d
w_tables.zip
39
Documenting ETL Low Level Design:
SSIS
 Comes with SQL Server
 Helps document and automate ETL process
 Based on defining
 Packages
 Tasks
 One approach
 A package for each target table
 A "master" package
40
SSIS Package Examples: Master
41
SSIS Package Examples: Extract All
42
SSIS Package Examples: Extract Changed
using CDC
Eg, SELECT * from cdccustomer WHERE
cdc_chg_date >
etl_last_capture_date;
43
SSIS Package Examples: Transforms
44
SSIS Package Examples: Load
45
Download