Data Warehouse DATA TRANSFORMATION Extract Transform Insert n n Extract data from operational system, transform and insert into data warehouse Why ETI? n n Will your warehouse produce correct information with the current data? How ho w can I ensure warehouse credibility? Excuses for NOT Transforming Legacy Data n n n n n Old data works fine, new will work as well. Data will be fixed at point of entry through GUI. If needed, data will be cleaned after new system populated; After proof-of-concept pilot. Keys join the data most of the time. Users will not agree to modifying or standardizing their data. Levels of Migration Problem n Existing metadata is insufficient and unreliable n n n Data values incorrectly typed and accessible n n n Metadata must hold for all occurrences Metadata must represent business and technical attributes Values form extracted from storage Values meaning inferred from its content Entity keys unreliable or unavailable n Inferred from related values Metadata Challenge n Metadata gets out of synch with details it summarizes n n Not at the right level of detail n n n n Business grows faster than systems designed to capture business info Multiple values in a single field Multiple meanings to a single field No fixed format for value Expressed in awkward of limited terms n Program/compiler view rather than business view Character-level Challenge n Value instance level n n n n Named type level n n n Spelling, aliases Abbreviations, truncations, transpositions Inconsistent storage formats Multiple meanings, contextual meanings Synonyms, homonyms Entity level n n No common keys or representation No integrated view across records, files, systems Some Data Quality examples n The magic shrinking vendor file n 127 ways to spell... n Data surprises in individual fields n Cowbirds and Data Fields n Magic numbers and embedded intelligence The Magic Shrinking Vendor File A Medical claims processor was having trouble with their Insurance Vendor file. They thought they had 300,000 Insurance Vendors. When they cleaned up their data, they discovered they had only 27,000 unique Insurance Vendors. 127 ways to spell... n Have over 127 different ways to spell AT&T n Have over 1000 ways to spell duPont Data surprises in individual fields NAME SOC. SEC. # TELEPHONE Source: Vality Data surprises in individual fields Meta NAME SOC. SEC. # TELEPHONE Source: Vality Data surprises in individual fields Meta NAME SOC. SEC. # TELEPHONE Denise Mario DBA Marc Di Lorenzo ETAL Actual Data Values Tom & Mary Roberts First Natl Provident Digital 15 State St. Astorial Fedrl Savings Kevin Cooke, Receiver John Doe Trustee for K Source: Vality Data surprises in individual fields Meta Actual Data Values NAME SOC. SEC. # Denise Mario DBA 228-02-1975 Marc Di Lorenzo ETAL 999999999 Tom & Mary Roberts 025-37-1888 First Natl Provident 34-2671434 Digital 15 State St. 101010101 Astorial Fedrl Savings LN#12-756 Kevin Cooke, Receiver 18-7534216 John Doe Trustee for K 111111111 TELEPHONE Source: Vality Data surprises in individual fields Meta Actual Data Values NAME SOC. SEC. # TELEPHONE Denise Mario DBA 228-02-1975 6173380300 Marc Di Lorenzo ETAL 999999999 3380321 Tom & Mary Roberts 025-37-1888 First Natl Provident 34-2671434 415-392-2000 Digital 15 State St. 101010101 508-466-1200 Astorial Fedrl Savings LN#12-756 212-235-1000 Kevin Cooke, Receiver 18-7534216 FAX 528-9825 John Doe Trustee for K 111111111 5436 Source: Vality Cowbirds and Data Fields n Cowbirds lay their eggs in other birds nets n Users use data fields that are not used for other purposes Magic Numbers and Embedded Intelligence Customer Number = XXXX-YY-ZZ XXXX = 1st 4 Positions of Zip Code If YY = 00-70 Then Cust = Pharmacy If YY = 80-89 Then Cust = Hospital Except if YY = 82 and ZZ = ** Which Means... Orr's Laws of Data Quality Law #1 - “Data that is not used cannot be correct!” Law #2 - “Data quality is a function of its use, not its collection!” Law #3 - “Data will be no better than its most stringent use!” Law #4 - “Data quality problems increase with the age of the system!” Law #5 - “Data quality laws apply equally to metadata!” Law #6 - The less likely something is to occur, the more traumatic it will be when it happens! Legacy Data Contaminants Found in Migrations n n n n n Lack of standards Data surprises in individual fields Legacy information buried in free form fields Legacy myopia – multiple account numbers block consolidated view Anomaly nightmare – complex matching and consolidation 4 Fundamental Types of Transformation n Simple Transformation n n n Fundamental building blocks of all data transformations One field at a time Cleansing and Scrubbing n n Ensure consistent formatting and usage of field or related group of fileds Checks valid values 4 Fundamental Types of Transformation (con’t) n Integration n n Takes operational data from one or more sources and maps it, field by field to new data structure Aggregation and Summarization n n Remove low level of detail Data for data mart Simple Transformation n Convert data element from one type to another n n n Date time conversion n n semantic value same rename elements standard warehouse format Decode encoded fields n M F vs C S MM Cleansing and Scrubbing n Actual content examined n n n n Range checking, enumerated lists, dependency checking Uniform representation for dw n address information n parse to components Integration n n Simple field level mappings -80-90% Complex integration n No common identifier n n n Multiple sources for same target element n n n probable matches 2-stage process, isolation/reconciliation contradictory Missing data Derived/calculated data n redundant? Aggregation and Summarization n Summarization is the addition of like values along one or more business dimensions n n Aggregation is the addition of different business elements into common total n n add daily sales by stores for monthly sales by region daily product sales plus monthly consulting sales give monthly combined sales amount Details of process available in metadata Data Re-engineering Problem n Programming for the unknown n n Programming for noise and uncertainty n n Unanticipated values, structures and patterns Conflicting and missing values Programming for productivity and efficiency Changing data values, changing user requirements n High volumes, non-linear searches Conventional data transformation methods do not solve the metadata and data value challenges – need data re-engineering Stephen Brown, Vality Corp. n Data Re-engineering Process • External Files • Legacy Applications Historical Extracts Data investigation and Metadata Mining Data Standardization •Customer Information Systems •Data Warehouses • Data Integration •Client/server Applications • Data Survivorship and Formating •Consolidations Natural Laws of Data Re-engineering n n n n n n n n n n Data has no standard You can’t predict or legislate format or content Data will evolve faster than its capture and storage systems You can’t write rules for what you don’t know and can’t see Instructions for handling data are within the data Don’t trust the metadata, make the data reveal itself Revealed metadata is knowledge about the business Revealed metadata validates warehouse design Revealed metadata supports conversion project management Revealed metadata is insurance against misinformation Buy tool or manually code programs ? 3 - DW Tools Technologies 2nd Generation ETL Suites / Environments Repositories DB & System Monitors Meta Data Browsers Data Visualization Data Mining Job Schedulers DB Design Replication/Distribution Tools CASE EIS MOLAP/ROLAP/LowLAP Q&R/MQE/MRE RDBMS Utilities 1st Generation ETL Universal Repositories Processes •Design •Mapping •Extract •Scrub •Transform •Load •Index •Aggregation •Replication •Data Set Distribution Meta Data System Monitoring •Access & Analysis •Resource Scheduling & Distribution Transformation n Choosing between tool and manually coded programs n Time frames - tools take longer n n n n n select, configure, learn Budgets - short term or long term Size of warehouse - initial project small enough for coding Size and skills of warehouse team Tool automatically generates and maintains metadata Hand Generated Code n Upside n n n n n n No learning curve Inherent skills In house capabilities Usually simple No culture change/mandate (CASE) Downside n n n Manual meta data Maintenance challenge when talent level changes No automation Tools n Upside n n n n n Easy to maintain as talent level changes Automatic meta data May gain efficiencies Integration with repositories Integration with other tools n n n Schedulers Monitors Meta data management Tools n Downside n n n Cost (1st generation tools very high $) Learning curve Enforced culture change n n n Must use tool for all changes Speed, may be slower to implement May require additional resources Manual Code / 1st Generation ETL Tools Process Source Mainframe or C/S System Source OLTP Systems Data Warehouse Client/Server System External Job Scheduling and Control - External Meta Data Load/Maintenance Extract Program Transform Program File Transfer Program File Load Program Copyright © 1997, Enterprise Group, Ltd. Index Program Aggregation Program 2nd Generation ETL Tools Process Source Mainframe or C/S System Transformation Engine C/S System Source OLTP Systems Data Warehouse/Mart C/S System Data Warehouse or Data Mart Transformation Engine •Monitoring •Scheduling •Extraction •Scrubbing •Transformation •Load •Index •Aggregation •Meta Data Load •Meta Data Maint. Caching Copyright © 1997, Enterprise Group, Ltd. 2nd Generation ETL Environment Process Source Mainframe or C/S System Transformation Engine C/S System Enterprise Meta Data User Process •Surf Meta Data •Request Resource •Schedule Delivery Source OLTP Systems Transformation Engine •Monitoring •Scheduling •Extraction •Scrubbing •Transformation •Load •Index •Aggregation •Meta Data Load •Meta Data Maint. •Request Broker Caching Data Mart C/S System Data Warehouse C/S System Data Warehouse Copyright © 1997, Enterprise Group, Ltd. Data Mart Data Mart C/S System Data Mart 1st Generation ETL Tools Hampered by: n n n n n n n High cost (average deal prices in the $250-400k range) Long learning curves Perceived value (most teams felt they could write better code) Cultural challenges (like a CASE tool, the team must use the code generator for all creation and changes, no matter how minor) Core capabilities (complex transformations still required manual code) Management requirements (users still had to manage all the programs generated) Performance issues (the resulting programs could not leverage parallelism) Important 2nd Generation ETL tool features: n n n n n n n Transformation engine design Ability to leverage parallel server technology CDC (Change Data Capture, which allows only the new data to be extracted) Incremental aggregation (ability to add CDC incremental data to existing aggregations) Limited or no use of temporary files or data base tables (virtual caching only) Common, open and extensible meta data repository Enterprise scalability Important 2nd Generation ETL tool features: n n n n n n n n Common UI (User Interface) across all tools Extensive selection of transformation algorithms Easily extensible scrub and transform algorithm library Extensive heterogeneous source and target support Native OLAP data set target support System monitoring & management Enterprise meta data repository (content, resources, structure, etc.) Transform once, populate many (populate multiple targets with a single transformation output) Important 2nd Generation ETL tool features: n n n n n n n Integrated enterprise scale scrubbing capabilities Seamless interoperability with external point solution tools Integrated information access, analysis, scheduling and delivery Aggregate aware information request broker (enables virtual data warehouse) Ad hoc aggregation monitoring and management Pipeline parallelism / very high throughput Native drivers (source and target) OLTP <> OLAP n OLTP n n normalized OLAP n n tools must provide multidimensional conceptual view of data ?????? Providing OLAP to User Analysts, E.F.Codd redundant data Multidimensional Model n Data stored as facts and dimensions Sales Fact Cube