Designing the Data Warehouse for High Performance

advertisement
High Performance Data Warehouse
Design and Construction
ETL Processing
prepared by
Stephen A. Brobst
sbrobst@alum.mit.edu
(617) 422-0800
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
1
ETL Processing
IT Users
Operational Data
Data Transformation
Enterprise Warehouse and
Integrated Data Marts
Replication
Dependent Data Marts or
Departmental Warehouses
Business Users
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
2
Data Acquisition from OLTP Systems
Why is it hard?
 Multiple source systems technologies.
 Inconsistent data representations.
 Multiple sources for the same data element.
 Complexity of required transformations.
 Scarcity and cost of legacy cycles.
 Volume of legacy data.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
3
Data Acquisition from OLTP Systems
Many possible source systems technologies:
* Flat files
* VSAM
* IMS
* IDMS
* DB2 (many flavors)
* Adabase
*
*
*
*
*
*
Excel
Access
Oracle
Informix
Sybase
Ingres
* Model 204
* DBF Format
* RDB
* RMS
* Compressed
* Many others...
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
4
Data Acquisition from OLTP Systems
Inconsistent data representation: Same data, different domain
values...
Examples:
 Date value representations:
- 1996-02-14
- 02/14/1996
- 14-FEB-1996
- 960214
- 14485
 Gender value representations:
- M/F
- M/F/PM/PF
- 0/1
- 1/2
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
5
Data Acquisition from OLTP Systems
Multiple sources for the same data element:
 Need to establish precedence between source
systems on a per data element basis.
 Take data element from source system with
highest precedence where element exists.
 Must sometimes establish “group precedence”
rules to maintain data integrity.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
6
Data Acquisition from OLTP Systems
Complexity of required transformations:
 Simple scalar transformations.
– 0/1 => M/F
 One to many element transformations.
– 6x30 address field => street1, street2, city, state, zip
 Many to many element transformations.
– Householding and Individualization of customer
records
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
7
Data Acquisition from OLTP Systems
Scarcity and cost of legacy cycles:
 Generally want to off-load transformation
cycles to open systems environment.
 Often requires new skill sets.
 Need efficient and easy way to deal with
mainframe data formats such as EBCDIC
and packed decimal.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
8
Data Acquisition from OLTP Systems
Volume of legacy data:
 Need lots of processing and I/O to effectively
handle large data volumes.
 2GB file limit in older versions of UNIX is not
acceptable for handling legacy data - need full
64-bit file system.
 Need efficient interconnect bandwidth to
transfer large amounts of data from legacy
sources.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
9
Data Acquisition from OLTP Systems
What does the solution look like?
 Meta data driven transformation architecture.
 Modular software solutions with component
building blocks.
 Parallel software and hardware architectures.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
10
Data Acquisition from OLTP Systems
Meta data driven transformation architecture:
 Need multiple meta data structures.
– Source meta data
– Target meta data
– Transformation meta data
 Must avoid “hard coding” for maintainability.
 Automatic generation of transformations from meta
data structures.
 Meta data repository ideally accessible by APIs and
end user tools.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
11
Data Acquisition from OLTP Systems
Modular software structures with component
building blocks:
 Want a data flow driven transformation
architecture that supports multiple processing
steps.
 Meta data structures should map inputs and
outputs between each transformation module.
 Leverage pre-packaged tools for
transformation steps wherever possible.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
12
Data Acquisition from OLTP Systems
Parallel software and hardware architectures:
 Use data parallelism (partitioning) to allow
concurrent execution of multiple job streams.
 Software architecture must allow efficient repartitioning of data between steps in the
transformation process.
 Want powerful parallel hardware architectures
with many processors and I/O channels.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
13
A Word of Warning
The data quality in the source systems will be
much worse than what you expect.
 Must allocate explicit time and resources to
facilitate data clean-up.
 Data quality is a continuous improvement
process - must institute TQM program to be
successful.
 Use “house of quality” technique to prioritize
and focus data quality efforts.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
14
ETL Processing
It is important to look at the big picture.
Data acquisition time may include:









Extracts from source systems.
Data movement.
Transformations.
Data loading.
Index maintenance.
Statistics collection.
Summary data maintenance.
Data mart construction.
Backups.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
15
Loading Strategies
Once we have transformed data, there are three
primary loading strategies:
1. Full data refresh with “block slamming” into
empty tables.
2. Incremental data refresh with “block
slamming” into existing (populated) tables.
3. Trickle feed with continuous data acquisition
using row level insert and update operations.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
16
Loading Strategies
We must also worry about rolling off “old” data
as its economic value drops below the cost for
storing and maintaining it.
new data
old data
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
17
Loading Strategies
Choice in loading strategy depends on tradeoffs
in data freshness and performance, as well as
data volatility characteristics.
What is the goal?
 Increased data freshness.
 Increased data loading performance.
Real-Time Availability
Low Update Rates
( Delayed Availability )
Minimal Load Time
High Update Rates
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
18
Loading Strategies
Should consider:
 Data
storage requirements.
 Impact on query workloads.
 Ratio of existing to new data.
 Insert versus update workloads.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
19
Loading Strategies
40000
35000
Rows/CPU/Sec
Tradeoffs in data
loading with a
high percentage
of data changes
per data block:
Full Refresh
Incremental
Update
30000
25000
20000
Incremental
Insert
Shadow Table + Insert-Select
Table Copy
15000
10000
5000
Rows/DB affected
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
20
600
500
300
200
100
0
400
Trickle Feed
0
Loading Strategies
500
Incremental Update
Incremental Insert
Trickle Feed
400
Rows/CPU/Sec
Tradeoffs in data
loading with a
low percentage
of data changes
per data block:
Shadow Table
+ Insert-Select
300
200
Table Copy
100
2
1
0
0
Rows/DB affected
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
21
Full Refresh Strategy
Completely re-load table on each refresh.
Step 1: Load table using block slamming.
Step 2: Build indexes.
Step 3: Collect statistics.
This is a good (simple) strategy for small tables
or when a high percentage of rows in the data
changes on each refresh (greater than 10%).
e.g., reference lookup tables or account tables where
balances change on each refresh.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
22
Full Refresh Strategy
Performance hints:
 Remove
referential integrity (RI) constraints
from table definitions for loading operations.
– Assume that data cleansing takes place in transformations.
 Remove
secondary index specifications from
table definition.
– Build indices after table has been loaded.
 Make
sure target table logging is disabled
during loads.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
23
Full Refresh Strategy
Consider using “shadow” tables to allow refresh
to take place without impacting query
workloads.
1. Load shadow table.
2. Replace-view operation to direct queries to refreshed
table make new data visible.
Trades storage for availability.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
24
Incremental Refresh Strategy
Incrementally load new data into existing target
table that has already been populated from
previous loads.
Two primary strategies:
1. Incremental load directly into target table.
2. Use shadow table load followed by insert-select
operation into target table.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
25
Incremental Refresh Strategy
Design considerations for incremental load directly into
target table using RDBMS utilities:
Indices should be maintained automatically.
 Re-collect statistics if table demographics have
changed significantly.
 Typically requires a table lock to be taken during
block slamming operation.
 Do you want to allow for “dirty” reads?
 Logging behavior differs across RDBMS products.

Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
26
Incremental Refresh Strategy
Design considerations for shadow table
implementation:
Use block slamming into empty “shadow” table
having identical structure to target table.
 Staging space required for shadow table.
 Insert-select operation from shadow table to target
table will preserve indices.
 Locking will normally escalate to table level lock.
 Beware of log file size constraints.
 Beware of performance overhead for logging.
 Beware of rollbacks if operation fails for any reason.

Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
27
Incremental Refresh Strategy
Both incremental load strategies described preserve index structures
during the loading operation.
However, there is a cost to maintaining indexes during the loads...
 Rule-of-thumb: Each secondary index maintained during the load
costs 2-3 times the resources of the actual row insertion of data into
a table.
 Rule-of-thumb: Consider dropping and re-building index structures
if the number of rows being incrementally loaded is more than 10%
of the size of the target table.
Note: Drop and re-build of secondary indices may not be acceptable
due to availability requirements of the DW.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
28
Trickle Feed
Acquire data on a continuous basis into
RDBMS using row level SQL insert and
update operations.




Data is made available to DW “immediately” rather
than waiting for batch loading to complete.
Much higher overhead for data acquisition on a per
record basis as compared to batch strategies.
Row level locking mechanisms allow queries to
proceed during data acquisition.
Typically relies on Enterprise Application Integration
(EAI) for data delivery.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
29
Trickle Feed
A tradeoff exists between data freshness and
insert efficiency:
 Buffering
rows for insertion allows for fewer
round trips to RDBMS...
 … but waiting to accumulate rows into the
buffer impacts data freshness.
Suggested approach: Use a threshold that
buffers up to M rows, but never waits more
than N seconds before sending a buffer of
data for insertion.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
30
ELT versus ETL
There are two fundamental approaches to data
acquisition:
 ETL
is extract, transform, load in which
transformation takes place on a transformation
server using either an “engine” or by
generated code.
 ELT is extract, load, transform in which data
transformations take place in the relational
database on the data warehouse server.
Of course, hybrids are also possible...
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
31
ETL Processing
ETL processing performs the transform
operations prior to loading data into the
RDBMS.
1. Extract data from the source systems.
2. Transform data into a form consistent with
the target tables.
3. Load the data into the target tables (or to
shadow tables).
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
32
ETL Processing
ETL processing is typically performed using
resources on the source systems platform(s)
or a dedicated transformation server.
Transformation
Server
Source Systems
Pre-Transformations
Data Warehouse
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
33
ETL Processing
Perform the transformations on the source
system platform if available resources exist
and there is significant data reduction that can
be achieved during the transformations.
Perform the transformations on a dedicated
transformation server if the source systems
are highly distributed, lack capacity, or have
high cost per unit of computing.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
34
ETL Processing
Two approaches for ETL processing:
1. Engine: ETL processing using an
interpretive engine for applying transformation
rules based on meta data specifications.
- e.g., Ascential, Informatica
2. Code Generation: ETL processing using
code generated based on meta data
specification.
- e.g., Ab Initio, ETI
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
35
ELT Processing
 First,
load “raw” data into empty tables using
RDBMS block slamming utilities.
 Next, use SQL to transform the “raw” data into
a form appropriate to the target tables.
– Ideally, the SQL is generated using a meta data driven tool
rather than hand coding.
 Finally,
use insert-select into the target table
for incremental loads or view switching if a full
refresh strategy is used.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
36
ELT Processing
DW server is the the transformation server for
ELT processing.
Files
Source Systems
Teradata
Fastload
Network
Channel
Data Warehouse
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
37
ELT Processing

ELT Processing obviates the need for a separate
transformation server.
– Assumes that spare capacity exists on DW server to
support transformation operations.
ELT leverages the build-in scalability and
manageability of the parallel RDBMS and HW
platform.
 Must allocate sufficient staging area space to
support load of raw data and execution of the
transformation SQL.
 Works well only for batch oriented transforms
because SQL is optimized for set processing.

Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
38
Bottom Line
• ETL is a significant task in any DW deployment.
• Many options for data loading strategies: need to
evaluate tradeoffs in performance, data freshness,
and compatibility with source systems environment.
• Many options for ETL/ELT deployment: need to
evaluate tradeoffs in where and how transformations
should be applied.
Copyright © 2000, 2001. Stephen A. Brobst. Do not duplicate or distribute without written permission.
39
Download