Gaétan Hervé
Group Manager
ELCA Informatique S.A.
Data Integration in Business Intelligence
Project
The Microsoft ETL … SSIS
SSIS Connectivity
SSIS New Features
Thread Optimization
Lookup Caching
Change Data Capture in SSIS
Merge Statement in SSIS
Conclusion and Questions
Why Data Integration is always an
important step in a BI project
Business needs
Shops & Online Ordering
$$
Sales and Marketing
reporting
Auto generated reports
via a single system
$
New Client
Purchasing
Registering
Purchasing
Navision & AS400
Technical challenges
Navision
Database
Custom & Purchsing Information
Reporting architecture
Data integration
Static Data
AS400
Historical
Data
Sales, Stock & Purchasing, Accounting
Dimensional Data (Reference Data)
CRM System
(Aquitaine)
Hitorical Transactions
Historical Accounting Data
Customer Rating
BI
SQL Server studio
DB administrator
(Web) Reports
Users
BO Admin tools
powerUsers
Business Needs
Marketing reports
Self registering
customers via ticketing
and customers via
marketing list (CRM)
Technical challenges
Architecture
Data Integration
What are the components of the
Microsoft ETL and how it works
The Microsoft ETL
SSIS (Sql Server
Integration Services)
A group of ETL tasks is
a package
Packages are created,
tested through Visual
Studio
Package are deployed
either on a file system
or on the database
Running of packages is
available via the SQL
Agent
How SSIS can log operations, read
data, transform them and load them to
the destination table
What are the available data sources
for SSIS (key for an ETL)
Main ‘Mission’ of an ETL
Connect a data source, get the data, transform
them and load them in another data source
=> Connectivity is a key feature
Data Sources
Source Provider
SSIS boundary
Destination Provider
Data Sources
Application
Systems
Relational DB
Systems
• SAP - MySAP ERP, R/3
• Peoplesoft
• Siebel
• SQL Server
• Oracle
• IBM DB2
Structured/semistructured data
• MS Excel, CSV
• Text File, Flat File
• XML , EDI
Queue Systems &
Protocols
• MSMQ
• (s)FTP – sFTP is not supported out of the box
• HTTP(s)
Considerations
•
•
•
•
Data Types
Metadata
64-bit
SSIS integration – Custom Source, Script
Component, Standard Provider Stack
(ADO.NET, OleDB, ODBC)
• Supported Host Application Versions
• Microsoft, Partners, 3rd party vendors
Considerations
•
•
•
•
Supported Data Types
Metadata extraction
64-bit drivers
How to connect - ADO.NET,
ODBC, OleDB
• Custom Features for
connectors – i.e. Bulk
Load/Write
Considerations
• What are they - XML, EDI, Flat File,
Excel, CSV
• Data type conversion
• SSIS components – XML, Flat File,
Text File
• 3rd party components – Data
Defractor by Interactive Edge
• Extensibility Story – Script & Custom
components
Considerations
• Mostly untyped systems
• Control Flow Tasks vs. Data Flow
Components
• Streaming Data vs. Recordset
• Data Behind The Firewall
• Web Service Support
• (s) FTP – sFTP is not supported out of the
box
• HTTP(s)
• IBM WebSphere MQ – no out-of –the box
support
Has the
richest and
most flexible
support
Use SQL
Server
Destination
• SQL-only components : Bulk Insert
Task, SQL Server Destination, Transfer
Tasks, DB Maintenance Tasks, SQL
Server Mobile Destination, Fast Load
Options in OleDB Destination
• Faster than Bulk Insert or OleDB
Destination with “fast load” option
SQL
Server
DB2
DB2/400
Oracle
SAP
Access
Excel
Office
2007
Sybase
Informix
Teradata
FoxPro
File DBs
Adabas
CISAM
DISAM
Ingres II
Oracle
Rdb
RMS
Enscribe
SQL/MP
IMS/DB
VSAM
LDAP
Connectivity White Paper has the full list :
http://ssis.wik.is/Connectivity_White_Paper
How the new features will help to
improve integration, reduce processing
and loading of data delays
Integration today
Increasing data volumes
Increasingly diverse
sources
More users and use cases
Requirements reached the
tipping point
Low-impact source
extraction
Efficient transformation
Bulk loading techniques
Current SSIS Thread Scheduler
Threads affinitized to dataflow subtrees
Thread starvation on highly-parallel designs
Single thread for each synchronous path
Non-linear scale-up (plateau)
SSIS Pipeline Parallelism
Rewrote the thread scheduler
Improved performance and scale
Thread pool shared across multiple components
Benefits
Better performance (50%) in highly-parallel designs
Less manual tuning during development (lower TCO)
Better hardware utilization (higher ROI)
It just works!
Loading reference data in the ETL process is expensive
Dimension lookups are core to ETL
Table joins need to be performed outside the database
Often involves staging the data
Bottleneck – resource intensive
Efficient lookups are key to optimal ETL performance
Multiple modes of operation
Wide array of data sources
Cache sharing and reuse
Problems in current SSIS Lookup component
Cache is reloaded on every execution and/or loop
Cache sharing semantics ‘magic’
Caches can only be loaded through OleDb
Flexible cache implementation
Cache-load is a separate operation to Lookup
Hydrated and dehydrated to the file system
Amortize cache-load across multiple cache-reads
Caches can be explicitly shared
Adaptable
Caches can be loaded from any source (SQL, Text, Mainframe,…)
Track cache hits and misses
Cascaded Lookup patterns
Multiple modes
Full Cache (pre-load all rows, most memory, fastest)
Partial Cache (on miss, query database and store result)
No Cache (pass-through to DB, least memory, slowest)
First Process
Subsequent Process
Populates cache from any
source and saves to disk
Cache rehydrated from disk
customer .csv
Fact Sales
Hydrate cache
from file or get it
from memory
Save cache to
disk or persist in
memory
Cache sharable
and durable
How the lookup works with two distinct
steps : cache loading and data usage
For each file in a
directory
Sales
Facts
Inventory
Facts
Complains
Facts
Read File
xx
LookUp
Time
LookUp
Write
Data
Write
Data
Write
Data
Write
Data
Application
Database
Data
Warehouse
ODS
SISS
SISS
Source Data Extraction
Warehouse Load
Extracting data from the source is expensive
Efficient extraction is key to improving ETL performance
Involves bulk loading data into staging areas or warehouse
Time consuming & resource intensive
Triggers (synchronous IO penalty)
Timestamp columns (Schema changes)
Complex queries (delayed IO penalty)
Custom (ISV, mirror, snapshot, …)
Incremental data load is key to efficient extraction
Need to know what changed at source since a point in time
Expensive lookups to determine changed columns
Providing information up front about which columns changed
will improve efficiency
Information about what changed at the source
Operation (Insert, Update, Delete)
Update mask (which columns changed)
Changes captured from the log asynchronously
Minimal impact on source system
Log reader can be scheduled to run during idle time
OLTP
Enabled per table
Change
Tables
Data Warehouse
/ ODS
Hidden change tables store captured changes
One change table per source table that is tracked
Retention-based cleanup jobs
CDC APIs provide access to change data
Table valued functions and scalar functions provide
access to change data and CDC metadata
TVF allows the changes to be gathered for specific
intervals enabling incremental population of DW
Transactional consistency is maintained across
multiple tables for the same request interval
Can be used on an existing proprietary application
How information extracted from cdc
tables can be used in SSIS to load the
ODS
Database I/O is typically the major cost in ETL
Large number of rows
Complex semantics
Indexes, constraints, triggers, …
Inserts, Updates & Deletes included in same source stream
Usually with no way to distinguish them
Solved using inelegant patterns (ELT)
Contention and b/locking
How do we lower the cost?
Simplify semantics
Simplify development
Improve overall performance
Single statement can deal with Inserts, Updates & Deletes all at once
Canonical statement similar to existing standards
Includes both SCD-1 and SCD-2 semantics
Includes DELETE semantics
Performance Goals
20% faster
Minimal logging on inserts (2x)
Optimized loading directly from text file – OPENQUERY(BULK…)
Typical Solution
Clean the source data, load it into Tbl_Staging
Index Tbl_Staging
UPDATE Warehouse INNER JOIN Tbl_Staging ON…
INSERT Warehouse LEFT JOIN Tbl_Staging ON…
MERGE Warehouse FROM Tbl_Staging ON…
INSERT INTO target (…)
SELECT field1,expiryDate
From
(
MERGE target_table as target
USING (select * from source_table) as source
ON source.SK_Date_ID = target.SK_Date_ID and
source.SK_Item_ID = target.SK_Item_ID
WHEN MATCHED THEN UPDATE SET
…
WHEN target NOT MATCHED THEN INSERT
…
WHEN source NOT MATCHED THEN DELETE
…
Output $action,source.field1,inserted.RecordEffEndDate
)
As SCD2(action,field1,recordeffendate)
WHERE SCD2.action = ‘UPDATE’
How to integrate the MERGE ‘chain’ in
SSIS
In
Target
Not in
Target
Not in
Source
Same
Data
SQL Server 2008
Release focused on Performance & Scale
Improved ETL processing
SSIS Connectivity
SSIS Lookup
SSIS Pipeline Parallelism
Change Data Capture (CDC)
MERGE
Benefits
Less manual tuning during development (lower TCO)
Better hardware utilization (higher ROI)
Smaller batch window (agility)
ELCA is one of the Swiss main independent companies in the IT development and
system integration field.
We develop, integrate, operate and maintain IT solutions using custom developed
applications, as well as industry standards.
• Founded in
1968
• Employees
> 450
• Offices
Lausanne (head-quarters), Zurich,
Geneva, Bern, London, Madrid,
Paris, Ho Chi Minh City (Vietnam)
• Turnover
CHF 57 M, uninterrupted positive results for 20 years
• Quality Standards
ISO 9001 since 1993
CMMI Level 3 since 2007
• Awards
Microsoft BI
www.microsoft.com/bi
SQL Server Integration Services
http://www.microsoft.com/sql/technologies/integration/default.mspx
Guided Tours for SSIS
http://www.microsoft.com/sql/technologies/integration/tours.mspx
Technical Portal for SSIS
http://technet.microsoft.com/en-gb/sqlserver/default.aspx
Developer Portal for SSIS
http://msdn2.microsoft.com/en-us/sql/aa336312.aspx
Safe Software FME Extension for SSIS
http://www.safe.com/microsoft
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.
The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market
conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation.
MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
We develop, integrate, operate and maintain
IT solutions using custom developed
applications, as well as industry standards
Competencies and main focus
Business
Consulting
System
Integration
Software
Development
Operations
Management
Strategy
Architecture
Compliance
BI / DWH
CRM
ECM
Security
Ticketing
Individual SW
Project Mgmt
Testing
Properties
Quality
Turnkey
Fixed price
Offshore
Payment Services
Virtualization
Operations
 With more than 10 years of experience and > 30 experienced
consultants and engineers, ELCA is one of the leaders in the Business
Intelligence market in Switzerland.
 We realize for our customers state of the art solutions:
 Financial consolidation for management and regulatory reporting
 Analytical CRM solutions for marketing and sales
 Balanced scorecard for performance management
 Business Process Management solutions
 Integrated operational reporting
 Risk management solutions
 Resource optimization support