BI-User-Group-SSIS

advertisement
Deep Dive into ETL
Implementation with SQL
Server Integration Services
Anton Rozenson
Anton.Rozenson@GnetGroup.com
About this training event
• First event with core focus on Microsoft BI technology
• Help the community by sharing our learning and experience
based on real world scenarios
• Network with peers and learn from you!
• These training events will be held every 2 months
Agenda
• Importance and complexity of ETL process
• ETL Architecture
• Changed Data Capture challenge and options
• Data Flow design and performance considerations
• SSIS project deployment
• Package execution options
• Performance monitoring in SSIS catalog
Moving data
• “Data Warehouse is a system that extracts, cleans, conforms,
and delivers source data into a dimensional data store and
then supports and implements querying and analysis for the
purpose of decision making” – Ralph Kimball, Joe Casertam.
(2004) “The Data Warehouse ETL Toolkit
• Estimated 80% of work in
building Data warehouse solution
is related to ETL design and
implementation
• Data Warehouse is only as good
as the data that it contains
Common ETL Architecture
Sources
Staging
EDW
Flat File
Data
warehouse
RDBMS
Cloud
• Changed data
• Reference data
• Artifacts and Error
tables
• Consumption
ready
• De-normalized
• Clean data
ELT Architecture
Sources
EDW
Flat File
Data
warehouse
RDBMS
Cloud
• Changed data
• Reference data
• Artifacts and Error
tables
• De-normalized
• Retains
traceable
business key
CDC Challenge
What Changed?
When changed ?
Unable to modify source
systems to include CDC
attribution?
Reliable
timestamps?
Need to reliably determine modified
data for incremental loading
New data
Updated data
Deleted data
CDC options
Source system can provide time stamps
• Reliability of process and completeness of information?
• Did all yesterday’s transactions committed?
• Does Source system include work flows that can cause
late arrival of data records?
• How to determine deleted records?
Comparing EDW data to Source system to determine differences
• Very expensive query affecting Source system
• Does Source System incorporates archival process?
Operational Data Store (ODS) can help
• Keeping change history
• Clearly define state of data records
• Stores metadata
CDC in SQL 2014
In SQL Server, change data capture offers an effective solution to the
challenge of efficiently performing incremental loads from source
tables to data marts and data warehouses.
Change Data Capture process stores transaction information from
SQL log into system tables in CDC Schema.
Data from CDC tables can be extracted by using table valued
functions generated when CDC is enabled on the table.
Key concepts:
• LSN – a binary timestamp representation used to restrict
changed set
• All Changes – changed set includes all DML transactions
• Net Changes – changed set includes last DML transactions
based on unique index.
Demo
Data Flow design challenges
How should Data Flow perform?
Consider following factors when designing an ETL Solution
• Source System structure and ability to execute logic such as
sorting and filtering.
• Parallel processing should be used with caution. Fastest way
to load a table is Fast Load with table lock. This prevents
loading data in parallel. Partition switching can be an answer.
• Requirements for data availability in EDW
• Service Level Agreement (SLA)
Data Flow blocking tasks
Data flow transformations in SSIS use memory/buffers in different
ways. The way a transformation uses memory can dramatically
impact the performance of your package. Transformation buffer
usage can be classified into 3 categories: Non Blocking, Partially
Blocking, and (Full) Blocking.
Non Blocking transformations: Audit, Character Map, Conditional
Split, Copy Column, Data Conversion, Derived Column, Import
Column, Lookup, Multicast, Percentage sampling, Row count, Row
sampling, Script component
Partially Blocking transformations: Data mining, Merge, Merge
Join, Pivot/Unpivot, Term Extraction, Term Lookup, Union All
Blocking transformations: Aggregate, Fuzzy Grouping, Fuzzy
Lookup, Sort
Work around blocking
transformations in Data Flow
It is not always possible to avoid using blocking or partially blocking
transformations, but in some cases it is possible.
For example Merge transformation requires data set to be
sorted. While Sort transformation is expensive, in some cases
sorting can be handled in the source query. Make sure to set
IsSorted property of Data Source output to true and assign
proper SortKeyPosition to output columns.
Another example is Aggregate Transformation. Use Script
transformation to perform aggregation of data and return result to
a variable.
Demo
Project Deployment model
Project deployment model allows following features:
• Parameters can be used in expressions or tasks. Parameters can
reference an environment variable. Environment variable values
are resolved at the time of package execution.
• An environment is a container of variables that can be referenced
by Integration Services projects. Environments allow you to
organize the values that you assign to a package. For example,
you might have environments named "Dev", "test", and
"Production".
• SSISDB catalog allows you to use folders to organize your projects
and environments.
• Catalog stored procedures and views can be used to manage
Integration Services objects in the catalog.
Package Execution
An execution is an instance of a package execution.
Package execution can be scheduled via SQL Agent job. SQL Agent
provides an easy to use interface for mapping of Project parameters
to environment variables.
Packages can also be executed via Execute package tasks from
another SSIS package. This allows creation of robust workflow
incorporated into the Master Package.
Package Execution
SSIS catalog allows package execution to be controlled
programmatically from within T-SQL. A Number of stored procedures
are provided to manage Package Execution.
catalog.create_execution creates an instance of package execution
and assigns Execution_ID.
catalog.set_execution_parameter_value assigns parameters to the
instance of package execution. Execution parameters control Logging
Level, Dump settings, Synchronized execution option as well as
ability to assign values to Project or Package scoped parameters.
catalog.start_execution starts an instance of execution.
Execution Monitoring
Catalog provides a set of standard reports allowing administrators
easy access to execution performance and statistics.
For details about executions, validations, messages that are logged
during operations, and contextual information related to errors, query
these views.
executions list of Executions includes environmental data
execution_data_statistics data flow performance information
execution_parameter_values list of run time parameters
event_messages messages that were logged during executions
Demo
Questions
Additional Resources
What's New in SQL Server 2014
http://msdn.microsoft.com/en-us/library/bb500435.aspx
SSIS Catalog
http://msdn.microsoft.com/en-us/library/hh479588.aspx
Deployment of Projects and Packages
http://msdn.microsoft.com/en-us/library/hh213290(v=sql.120).aspx
Change Data Capture
http://technet.microsoft.com/en-us/library/bb522489(v=sql.105).aspx
Change Data Capture (SSIS)
http://msdn.microsoft.com/en-us/library/bb895315.aspx
CDC Flow Components
http://msdn.microsoft.com/en-us/library/hh231087(v=sql.120).aspx
Enable and Disable Change Data Capture (SQL Server)
http://msdn.microsoft.com/en-us/library/cc627369.aspx
SQL Server OLE DB Deprecation and Integration Services
http://blogs.msdn.com/b/mattm/archive/2012/01/09/sql-server-ole-db-deprecation-and-integration-services.aspx
oData Data source setup
http://www.microsoft.com/en-us/download/details.aspx?id=42280
oData samples
http://services.odata.org/
Contact Us
www.gnetgroup.com
Neelesh Raheja
VP, Consulting Services
Anton Rozenson
BI Solution Architect
Neelesh.Raheja@gnetgroup.com
Anton.Rozenson@gnetgroup.com
@PracticalBI
blog.gnetgroup.com
facebook.com/gnetgroup
linkedin.com/company/143712
youtube.com/user/GNetGroup
twitter.com/GnetGroup
Download