What is ETL? • ETL: Extract Transformation and Load • The term is used to describe data migration or the data conversion process • ETL may be part of the regularly repeated business process • The amount and complexity of data grew dramatically over the past decade so the ETL processes became more complex and demanding 1 Phases of ETL Project 1. 2. 3. 4. 5. 6. 7. 8. Requirements Analysis Design Proof of Concept Development Testing Execution Verification 2 Requirements • Scope of the data migration - what data is required in the target system and how is it going to be used? • Execution requirements – has to be within certain timeframe, sequence, geographic location, repeatability, acceptable system down time, etc. • Source data retention period, backup and restore requirements • Requirements should be made with this in mind: – Data is the company’s most valuable asset. – Consequences of corrupt data are usually very costly. 3 Analysis • Understanding the source data – Data Dictionary that is used for designing ETL process has to be created – Mission critical task – Frequently underestimated (importance and time) • All available resource should be used to do analysis properly: – Available system documentation including Data Model and Data Dictionary – People – Reverse engineering 4 Design • • • • • Choice of methodology Choice of technology Design Target Database Design ETL process Data Mapping Document – Map source data to the target database – Specifies transformation rules – Specifies generated data (not from source) • Design ETL verification process • Ensures that all requirements are addressed 5 Proof of Concept • Helps to determine or estimate: – Feasibility of the concept – Development time – Performance, capacity and execution time – Requirements to be met • Gain knowledge about the technology • Code produced in this phase usually can be reused during the development phase 6 Development • Includes: – Produce code and processes as per Design and Data Mapping Document – Data verification scripts or programs as per Test Plan – Execution scripts as per Execution Plan – Unit testing – performed and documented by developers. • Typical Challenges: – Inadequate requirements and design documents – Developers unfamiliar with technology 7 Testing Ensures that requirements are met Test Plan is highly recommended Types of testing: Functional, stress, load, integration, connectivity, regression Challenges: Automation and repeatability (testing and verification scripts) Creation of the Test Data Extracting small data sets from large data volumes Confidential data may not be made available for testing 8 Execution • Execution plan should include: – – – – – Sequence of tasks Time of execution and expected duration of execution Checkpoints and success criteria Back out plan and continuation of business Resources involved • For mission, critical system down time could be limited or even entirely unacceptable • Execution should be controlled and verified 9 Verification • Confirms that the data migration was successful • Determined during the design phase • Various methodologies and technologies could be used • Automated verifications are highly recommended (driven by requirements) 10 Why ETL Projects Fail? • • • • Underestimate complexity of the project Overlook or neglect phases of the project Wrong choice of technology Common misconceptions like – Expensive ETL tools will solve all problems – No or very little programming will be required – We don't need or we don't have time for plans, but we know exactly what we need to do 11 Problems with Mainstream ETL Tools • • • • • • • • • Maintaining license and consultants is very expensive Significant time required to learn Usually require dedicated hardware Cannot take advantage of the database vendor proprietary technologies that are optimized for the fastest data migration Complex tasks very often require integration with other technologies Very limited performance Only small amount of provided functionality is actually required for ETL project Very limited application for Data Analysis Huge discrepancy between marketing promises and actual performance 12 Malloc Inc. • During the 15 years in the IT Consulting business, a proprietary ETL methodology and technology was developed • Consists of two major modules: – Database Analyzer – G-DAO Framework • Major advantages: – Inexpensive, easier to learn and performs better than mainstream ETL Software – Any Java developer can master it and start using it within several days – It is proven and it works 13 Database Analyzer • Produces ETL Data Analysis Reports in various formats • Major usage: – To analyze and understand source data and the database attributes – Create data mapping and transformation documents – Create a data dictionary – Suggests ways to improve database design – Valuable source of information for Business Analysis, Data Architects, Developers, and Database Administrators. 14 Database Analyzer Reports • Intuitive, descriptive and easy to read • In HTML format • Can be imported and edited in major document editors such as MS Word 15 Graphical User Interface • User friendly GUI Interface • It can also run in Batch mode for lengthy analysis (large data sources) 16 G-DAO Framework • Java Code Generator • Eliminates huge legwork to develop the code required for the ETL Process • Uses analysis performed by Database Analyzer to produce the code optimized for a particular database • Code may be used for purposes other than ETL (any kind of database access and data manipulation) • Uses the advantage of an almost unlimited world of java libraries (no proprietary language and interfaces) 17 G-DAO Framework Advantages • Data can flow directly from source to target (no need for intermittent storage into files) • XA transactions are supported for all major databases • Functionality is limited only by the limitation of the JDBC driver and Java language • Easy to learn and implement • No dedicated hardware required • Provides a platform for any kind of business application that requires data access 18 Malloc Inc Incorporated in 1993 in Toronto, Canada Provided IT Consulting to: Oracle Corporation General Electric Citibank Royal Bank Bank of Montreal The Prudential Standard Life and many more Most recent implementations of Database Analyzer and G-DAO Framework Citibank Royal Bank of Canada http://www.mallocinc.com 19