Data Warehouse - Computer Information Systems

advertisement
Data Warehouse
DATA TRANSFORMATION
Extract Transform Insert
n
n
Extract data from operational system,
transform and insert into data
warehouse
Why ETI?
n
n
Will your warehouse produce correct
information with the current data?
How ho w can I ensure warehouse
credibility?
Excuses for NOT
Transforming Legacy Data
n
n
n
n
n
Old data works fine, new will work as well.
Data will be fixed at point of entry through
GUI.
If needed, data will be cleaned after new
system populated; After proof-of-concept
pilot.
Keys join the data most of the time.
Users will not agree to modifying or
standardizing their data.
Levels of Migration Problem
n
Existing metadata is insufficient and
unreliable
n
n
n
Data values incorrectly typed and accessible
n
n
n
Metadata must hold for all occurrences
Metadata must represent business and technical
attributes
Values form extracted from storage
Values meaning inferred from its content
Entity keys unreliable or unavailable
n
Inferred from related values
Metadata Challenge
n
Metadata gets out of synch with details it
summarizes
n
n
Not at the right level of detail
n
n
n
n
Business grows faster than systems designed to
capture business info
Multiple values in a single field
Multiple meanings to a single field
No fixed format for value
Expressed in awkward of limited terms
n
Program/compiler view rather than business view
Character-level Challenge
n
Value instance level
n
n
n
n
Named type level
n
n
n
Spelling, aliases
Abbreviations, truncations, transpositions
Inconsistent storage formats
Multiple meanings, contextual meanings
Synonyms, homonyms
Entity level
n
n
No common keys or representation
No integrated view across records, files, systems
Some Data Quality examples
n
The magic shrinking vendor file
n
127 ways to spell...
n
Data surprises in individual fields
n
Cowbirds and Data Fields
n
Magic numbers and embedded intelligence
The Magic Shrinking Vendor
File
A Medical claims processor was
having trouble with their Insurance
Vendor file. They thought they had
300,000 Insurance Vendors.
When they cleaned up their data,
they discovered they had only
27,000 unique Insurance Vendors.
127 ways to spell...
n
Have over 127 different ways to spell
AT&T
n
Have over 1000 ways to spell duPont
Data surprises in individual
fields
NAME
SOC. SEC. #
TELEPHONE
Source: Vality
Data surprises in individual
fields
Meta
NAME
SOC. SEC. #
TELEPHONE
Source: Vality
Data surprises in individual
fields
Meta
NAME
SOC. SEC. #
TELEPHONE
Denise Mario DBA
Marc Di Lorenzo ETAL
Actual
Data
Values
Tom & Mary Roberts
First Natl Provident
Digital 15 State St.
Astorial Fedrl Savings
Kevin Cooke, Receiver
John Doe Trustee for K
Source: Vality
Data surprises in individual
fields
Meta
Actual
Data
Values
NAME
SOC. SEC. #
Denise Mario DBA
228-02-1975
Marc Di Lorenzo ETAL
999999999
Tom & Mary Roberts
025-37-1888
First Natl Provident
34-2671434
Digital 15 State St.
101010101
Astorial Fedrl Savings
LN#12-756
Kevin Cooke, Receiver
18-7534216
John Doe Trustee for K
111111111
TELEPHONE
Source: Vality
Data surprises in individual
fields
Meta
Actual
Data
Values
NAME
SOC. SEC. #
TELEPHONE
Denise Mario DBA
228-02-1975
6173380300
Marc Di Lorenzo ETAL
999999999
3380321
Tom & Mary Roberts
025-37-1888
First Natl Provident
34-2671434
415-392-2000
Digital 15 State St.
101010101
508-466-1200
Astorial Fedrl Savings
LN#12-756
212-235-1000
Kevin Cooke, Receiver
18-7534216
FAX 528-9825
John Doe Trustee for K
111111111
5436
Source: Vality
Cowbirds and Data Fields
n
Cowbirds lay their eggs in other birds
nets
n
Users use data fields that are not used
for other purposes
Magic Numbers and
Embedded Intelligence
Customer Number = XXXX-YY-ZZ
XXXX = 1st 4 Positions of Zip Code
If YY = 00-70 Then Cust = Pharmacy
If YY = 80-89 Then Cust = Hospital
Except if YY = 82 and ZZ = ** Which
Means...
Orr's Laws of Data Quality
Law #1 - “Data that is not used cannot be correct!”
Law #2 - “Data quality is a function of its use, not its
collection!”
Law #3 - “Data will be no better than its most
stringent use!”
Law #4 - “Data quality problems increase with the
age of the system!”
Law #5 - “Data quality laws apply equally to metadata!”
Law #6 - The less likely something is to occur, the
more traumatic it will be when it happens!
Legacy Data Contaminants
Found in Migrations
n
n
n
n
n
Lack of standards
Data surprises in individual fields
Legacy information buried in free form
fields
Legacy myopia – multiple account
numbers block consolidated view
Anomaly nightmare – complex matching
and consolidation
4 Fundamental Types of
Transformation
n
Simple Transformation
n
n
n
Fundamental building blocks of all data
transformations
One field at a time
Cleansing and Scrubbing
n
n
Ensure consistent formatting and usage of
field or related group of fileds
Checks valid values
4 Fundamental Types of
Transformation (con’t)
n
Integration
n
n
Takes operational data from one or more
sources and maps it, field by field to new
data structure
Aggregation and Summarization
n
n
Remove low level of detail
Data for data mart
Simple Transformation
n
Convert data element from one type to
another
n
n
n
Date time conversion
n
n
semantic value same
rename elements
standard warehouse format
Decode encoded fields
n
M F vs C S MM
Cleansing and Scrubbing
n
Actual content examined
n
n
n
n
Range checking,
enumerated lists,
dependency checking
Uniform representation for dw
n
address information
n
parse to components
Integration
n
n
Simple field level mappings -80-90%
Complex integration
n
No common identifier
n
n
n
Multiple sources for same target element
n
n
n
probable matches
2-stage process, isolation/reconciliation
contradictory
Missing data
Derived/calculated data
n
redundant?
Aggregation and Summarization
n
Summarization is the addition of like
values along one or more business
dimensions
n
n
Aggregation is the addition of different
business elements into common total
n
n
add daily sales by stores for monthly sales
by region
daily product sales plus monthly consulting
sales give monthly combined sales amount
Details of process available in metadata
Data Re-engineering
Problem
n
Programming for the unknown
n
n
Programming for noise and uncertainty
n
n
Unanticipated values, structures and patterns
Conflicting and missing values
Programming for productivity and efficiency
Changing data values, changing user requirements
n High volumes, non-linear searches
Conventional data transformation methods do not solve
the metadata and data value challenges – need data
re-engineering Stephen Brown, Vality Corp.
n
Data Re-engineering Process
•
External
Files
•
Legacy
Applications
Historical
Extracts
Data investigation
and Metadata
Mining
Data
Standardization
•Customer
Information
Systems
•Data
Warehouses
•
Data Integration
•Client/server
Applications
•
Data Survivorship
and Formating
•Consolidations
Natural Laws of
Data Re-engineering
n
n
n
n
n
n
n
n
n
n
Data has no standard
You can’t predict or legislate format or content
Data will evolve faster than its capture and storage systems
You can’t write rules for what you don’t know and can’t see
Instructions for handling data are within the data
Don’t trust the metadata, make the data reveal itself
Revealed metadata is knowledge about the business
Revealed metadata validates warehouse design
Revealed metadata supports conversion project management
Revealed metadata is insurance against misinformation
Buy tool or manually code
programs ?
3 - DW Tools
Technologies
2nd Generation ETL Suites / Environments
Repositories
DB & System Monitors
Meta Data Browsers
Data Visualization
Data Mining
Job Schedulers
DB Design
Replication/Distribution Tools
CASE
EIS
MOLAP/ROLAP/LowLAP
Q&R/MQE/MRE
RDBMS Utilities
1st Generation ETL
Universal Repositories
Processes
•Design
•Mapping
•Extract
•Scrub
•Transform
•Load
•Index
•Aggregation
•Replication
•Data Set Distribution
Meta Data
System Monitoring
•Access & Analysis
•Resource Scheduling & Distribution
Transformation
n
Choosing between tool and manually
coded programs
n
Time frames - tools take longer
n
n
n
n
n
select, configure, learn
Budgets - short term or long term
Size of warehouse - initial project small
enough for coding
Size and skills of warehouse team
Tool automatically generates and
maintains metadata
Hand Generated Code
n
Upside
n
n
n
n
n
n
No learning curve
Inherent skills
In house capabilities
Usually simple
No culture change/mandate (CASE)
Downside
n
n
n
Manual meta data
Maintenance challenge when talent level changes
No automation
Tools
n
Upside
n
n
n
n
n
Easy to maintain as talent level changes
Automatic meta data
May gain efficiencies
Integration with repositories
Integration with other tools
n
n
n
Schedulers
Monitors
Meta data management
Tools
n
Downside
n
n
n
Cost (1st generation tools very high $)
Learning curve
Enforced culture change
n
n
n
Must use tool for all changes
Speed, may be slower to implement
May require additional resources
Manual Code / 1st Generation
ETL Tools Process
Source Mainframe or C/S System
Source
OLTP
Systems
Data Warehouse Client/Server System
External Job Scheduling and Control - External Meta Data Load/Maintenance
Extract
Program
Transform
Program
File
Transfer
Program
File
Load
Program
Copyright © 1997, Enterprise Group, Ltd.
Index
Program
Aggregation
Program
2nd Generation ETL Tools
Process
Source Mainframe
or C/S System
Transformation Engine
C/S System
Source OLTP
Systems
Data Warehouse/Mart
C/S System
Data Warehouse
or Data Mart
Transformation
Engine
•Monitoring
•Scheduling
•Extraction
•Scrubbing
•Transformation
•Load
•Index
•Aggregation
•Meta Data Load
•Meta Data Maint.
Caching
Copyright © 1997, Enterprise Group, Ltd.
2nd Generation ETL Environment
Process
Source Mainframe
or C/S System
Transformation Engine
C/S System
Enterprise
Meta Data
User Process
•Surf Meta Data
•Request Resource
•Schedule Delivery
Source OLTP
Systems
Transformation
Engine
•Monitoring
•Scheduling
•Extraction
•Scrubbing
•Transformation
•Load
•Index
•Aggregation
•Meta Data Load
•Meta Data Maint.
•Request Broker
Caching
Data Mart
C/S System
Data Warehouse
C/S System
Data
Warehouse
Copyright © 1997, Enterprise Group, Ltd.
Data Mart
Data Mart
C/S System
Data Mart
1st Generation ETL Tools
Hampered by:
n
n
n
n
n
n
n
High cost (average deal prices in the $250-400k range)
Long learning curves
Perceived value (most teams felt they could write better
code)
Cultural challenges (like a CASE tool, the team must use
the code generator for all creation and changes, no matter
how minor)
Core capabilities (complex transformations still required
manual code)
Management requirements (users still had to manage all
the programs generated)
Performance issues (the resulting programs could not
leverage parallelism)
Important 2nd Generation ETL
tool features:
n
n
n
n
n
n
n
Transformation engine design
Ability to leverage parallel server technology
CDC (Change Data Capture, which allows only
the new data to be extracted)
Incremental aggregation (ability to add CDC
incremental data to existing aggregations)
Limited or no use of temporary files or data base
tables (virtual caching only)
Common, open and extensible meta data
repository
Enterprise scalability
Important 2nd Generation ETL
tool features:
n
n
n
n
n
n
n
n
Common UI (User Interface) across all tools
Extensive selection of transformation algorithms
Easily extensible scrub and transform algorithm library
Extensive heterogeneous source and target support
Native OLAP data set target support
System monitoring & management
Enterprise meta data repository (content, resources,
structure, etc.)
Transform once, populate many (populate multiple
targets with a single transformation output)
Important 2nd Generation ETL
tool features:
n
n
n
n
n
n
n
Integrated enterprise scale scrubbing capabilities
Seamless interoperability with external point
solution tools
Integrated information access, analysis,
scheduling and delivery
Aggregate aware information request broker
(enables virtual data warehouse)
Ad hoc aggregation monitoring and management
Pipeline parallelism / very high throughput
Native drivers (source and target)
OLTP <> OLAP
n
OLTP
n
n
normalized
OLAP
n
n
tools must provide multidimensional
conceptual view of data ??????
Providing OLAP to User Analysts,
E.F.Codd
redundant data
Multidimensional Model
n
Data stored as facts and dimensions
Sales Fact Cube
Download