Gaétan Hervé Group Manager ELCA Informatique S.A. Data Integration in Business Intelligence Project The Microsoft ETL … SSIS SSIS Connectivity SSIS New Features Thread Optimization Lookup Caching Change Data Capture in SSIS Merge Statement in SSIS Conclusion and Questions Why Data Integration is always an important step in a BI project Business needs Shops & Online Ordering $$ Sales and Marketing reporting Auto generated reports via a single system $ New Client Purchasing Registering Purchasing Navision & AS400 Technical challenges Navision Database Custom & Purchsing Information Reporting architecture Data integration Static Data AS400 Historical Data Sales, Stock & Purchasing, Accounting Dimensional Data (Reference Data) CRM System (Aquitaine) Hitorical Transactions Historical Accounting Data Customer Rating BI SQL Server studio DB administrator (Web) Reports Users BO Admin tools powerUsers Business Needs Marketing reports Self registering customers via ticketing and customers via marketing list (CRM) Technical challenges Architecture Data Integration What are the components of the Microsoft ETL and how it works The Microsoft ETL SSIS (Sql Server Integration Services) A group of ETL tasks is a package Packages are created, tested through Visual Studio Package are deployed either on a file system or on the database Running of packages is available via the SQL Agent How SSIS can log operations, read data, transform them and load them to the destination table What are the available data sources for SSIS (key for an ETL) Main ‘Mission’ of an ETL Connect a data source, get the data, transform them and load them in another data source => Connectivity is a key feature Data Sources Source Provider SSIS boundary Destination Provider Data Sources Application Systems Relational DB Systems • SAP - MySAP ERP, R/3 • Peoplesoft • Siebel • SQL Server • Oracle • IBM DB2 Structured/semistructured data • MS Excel, CSV • Text File, Flat File • XML , EDI Queue Systems & Protocols • MSMQ • (s)FTP – sFTP is not supported out of the box • HTTP(s) Considerations • • • • Data Types Metadata 64-bit SSIS integration – Custom Source, Script Component, Standard Provider Stack (ADO.NET, OleDB, ODBC) • Supported Host Application Versions • Microsoft, Partners, 3rd party vendors Considerations • • • • Supported Data Types Metadata extraction 64-bit drivers How to connect - ADO.NET, ODBC, OleDB • Custom Features for connectors – i.e. Bulk Load/Write Considerations • What are they - XML, EDI, Flat File, Excel, CSV • Data type conversion • SSIS components – XML, Flat File, Text File • 3rd party components – Data Defractor by Interactive Edge • Extensibility Story – Script & Custom components Considerations • Mostly untyped systems • Control Flow Tasks vs. Data Flow Components • Streaming Data vs. Recordset • Data Behind The Firewall • Web Service Support • (s) FTP – sFTP is not supported out of the box • HTTP(s) • IBM WebSphere MQ – no out-of –the box support Has the richest and most flexible support Use SQL Server Destination • SQL-only components : Bulk Insert Task, SQL Server Destination, Transfer Tasks, DB Maintenance Tasks, SQL Server Mobile Destination, Fast Load Options in OleDB Destination • Faster than Bulk Insert or OleDB Destination with “fast load” option SQL Server DB2 DB2/400 Oracle SAP Access Excel Office 2007 Sybase Informix Teradata FoxPro File DBs Adabas CISAM DISAM Ingres II Oracle Rdb RMS Enscribe SQL/MP IMS/DB VSAM LDAP Connectivity White Paper has the full list : http://ssis.wik.is/Connectivity_White_Paper How the new features will help to improve integration, reduce processing and loading of data delays Integration today Increasing data volumes Increasingly diverse sources More users and use cases Requirements reached the tipping point Low-impact source extraction Efficient transformation Bulk loading techniques Current SSIS Thread Scheduler Threads affinitized to dataflow subtrees Thread starvation on highly-parallel designs Single thread for each synchronous path Non-linear scale-up (plateau) SSIS Pipeline Parallelism Rewrote the thread scheduler Improved performance and scale Thread pool shared across multiple components Benefits Better performance (50%) in highly-parallel designs Less manual tuning during development (lower TCO) Better hardware utilization (higher ROI) It just works! Loading reference data in the ETL process is expensive Dimension lookups are core to ETL Table joins need to be performed outside the database Often involves staging the data Bottleneck – resource intensive Efficient lookups are key to optimal ETL performance Multiple modes of operation Wide array of data sources Cache sharing and reuse Problems in current SSIS Lookup component Cache is reloaded on every execution and/or loop Cache sharing semantics ‘magic’ Caches can only be loaded through OleDb Flexible cache implementation Cache-load is a separate operation to Lookup Hydrated and dehydrated to the file system Amortize cache-load across multiple cache-reads Caches can be explicitly shared Adaptable Caches can be loaded from any source (SQL, Text, Mainframe,…) Track cache hits and misses Cascaded Lookup patterns Multiple modes Full Cache (pre-load all rows, most memory, fastest) Partial Cache (on miss, query database and store result) No Cache (pass-through to DB, least memory, slowest) First Process Subsequent Process Populates cache from any source and saves to disk Cache rehydrated from disk customer .csv Fact Sales Hydrate cache from file or get it from memory Save cache to disk or persist in memory Cache sharable and durable How the lookup works with two distinct steps : cache loading and data usage For each file in a directory Sales Facts Inventory Facts Complains Facts Read File xx LookUp Time LookUp Write Data Write Data Write Data Write Data Application Database Data Warehouse ODS SISS SISS Source Data Extraction Warehouse Load Extracting data from the source is expensive Efficient extraction is key to improving ETL performance Involves bulk loading data into staging areas or warehouse Time consuming & resource intensive Triggers (synchronous IO penalty) Timestamp columns (Schema changes) Complex queries (delayed IO penalty) Custom (ISV, mirror, snapshot, …) Incremental data load is key to efficient extraction Need to know what changed at source since a point in time Expensive lookups to determine changed columns Providing information up front about which columns changed will improve efficiency Information about what changed at the source Operation (Insert, Update, Delete) Update mask (which columns changed) Changes captured from the log asynchronously Minimal impact on source system Log reader can be scheduled to run during idle time OLTP Enabled per table Change Tables Data Warehouse / ODS Hidden change tables store captured changes One change table per source table that is tracked Retention-based cleanup jobs CDC APIs provide access to change data Table valued functions and scalar functions provide access to change data and CDC metadata TVF allows the changes to be gathered for specific intervals enabling incremental population of DW Transactional consistency is maintained across multiple tables for the same request interval Can be used on an existing proprietary application How information extracted from cdc tables can be used in SSIS to load the ODS Database I/O is typically the major cost in ETL Large number of rows Complex semantics Indexes, constraints, triggers, … Inserts, Updates & Deletes included in same source stream Usually with no way to distinguish them Solved using inelegant patterns (ELT) Contention and b/locking How do we lower the cost? Simplify semantics Simplify development Improve overall performance Single statement can deal with Inserts, Updates & Deletes all at once Canonical statement similar to existing standards Includes both SCD-1 and SCD-2 semantics Includes DELETE semantics Performance Goals 20% faster Minimal logging on inserts (2x) Optimized loading directly from text file – OPENQUERY(BULK…) Typical Solution Clean the source data, load it into Tbl_Staging Index Tbl_Staging UPDATE Warehouse INNER JOIN Tbl_Staging ON… INSERT Warehouse LEFT JOIN Tbl_Staging ON… MERGE Warehouse FROM Tbl_Staging ON… INSERT INTO target (…) SELECT field1,expiryDate From ( MERGE target_table as target USING (select * from source_table) as source ON source.SK_Date_ID = target.SK_Date_ID and source.SK_Item_ID = target.SK_Item_ID WHEN MATCHED THEN UPDATE SET … WHEN target NOT MATCHED THEN INSERT … WHEN source NOT MATCHED THEN DELETE … Output $action,source.field1,inserted.RecordEffEndDate ) As SCD2(action,field1,recordeffendate) WHERE SCD2.action = ‘UPDATE’ How to integrate the MERGE ‘chain’ in SSIS In Target Not in Target Not in Source Same Data SQL Server 2008 Release focused on Performance & Scale Improved ETL processing SSIS Connectivity SSIS Lookup SSIS Pipeline Parallelism Change Data Capture (CDC) MERGE Benefits Less manual tuning during development (lower TCO) Better hardware utilization (higher ROI) Smaller batch window (agility) ELCA is one of the Swiss main independent companies in the IT development and system integration field. We develop, integrate, operate and maintain IT solutions using custom developed applications, as well as industry standards. • Founded in 1968 • Employees > 450 • Offices Lausanne (head-quarters), Zurich, Geneva, Bern, London, Madrid, Paris, Ho Chi Minh City (Vietnam) • Turnover CHF 57 M, uninterrupted positive results for 20 years • Quality Standards ISO 9001 since 1993 CMMI Level 3 since 2007 • Awards Microsoft BI www.microsoft.com/bi SQL Server Integration Services http://www.microsoft.com/sql/technologies/integration/default.mspx Guided Tours for SSIS http://www.microsoft.com/sql/technologies/integration/tours.mspx Technical Portal for SSIS http://technet.microsoft.com/en-gb/sqlserver/default.aspx Developer Portal for SSIS http://msdn2.microsoft.com/en-us/sql/aa336312.aspx Safe Software FME Extension for SSIS http://www.safe.com/microsoft © 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. We develop, integrate, operate and maintain IT solutions using custom developed applications, as well as industry standards Competencies and main focus Business Consulting System Integration Software Development Operations Management Strategy Architecture Compliance BI / DWH CRM ECM Security Ticketing Individual SW Project Mgmt Testing Properties Quality Turnkey Fixed price Offshore Payment Services Virtualization Operations With more than 10 years of experience and > 30 experienced consultants and engineers, ELCA is one of the leaders in the Business Intelligence market in Switzerland. We realize for our customers state of the art solutions: Financial consolidation for management and regulatory reporting Analytical CRM solutions for marketing and sales Balanced scorecard for performance management Business Process Management solutions Integrated operational reporting Risk management solutions Resource optimization support