Build a Metadata-Driven ETL Platform by Extending Microsoft SQL Server Integration Services SQL Server Technical Article Writers: Tianying He, Mike Gudyka Technical Reviewer: Darvey Lavender Contributors: Eric Sullivan, John Barrows, Erik Swanson, Bob Rohde, Pankaj Sharma, Veronica D'Souza, Shawn Archer, Adrian Hill, Paul Zangaglia, Lucas Hryniewicki, Amaranath Dabbara Published: March 2008 Applies To: SQL Server 2008 Summary: SQL Server 2008 Integration Services (SSIS) provides a flexible and scalable architecture that enables high-performance data extract, transform, and load (ETL). The Microsoft Business Intelligence Center of Excellence has extended SSIS to a metadata-driven platform to more effectively build, deploy, and manage ETL processes in large data warehousing environments. Copyright This is a preliminary document and may be changed substantially prior to final commercial release of the software described herein. The information contained in this document represents the current view of Microsoft Corporation on the issues discussed as of the date of publication. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information presented after the date of publication. This White Paper is for informational purposes only. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS DOCUMENT. Complying with all applicable copyright laws is the responsibility of the user. Without limiting the rights under copyright, no part of this document may be reproduced, stored in or introduced into a retrieval system, or transmitted in any form or by any means (electronic, mechanical, photocopying, recording, or otherwise), or for any purpose, without the express written permission of Microsoft Corporation. Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property rights covering subject matter in this document. Except as expressly provided in any written license agreement from Microsoft, the furnishing of this document does not give you any license to these patents, trademarks, copyrights, or other intellectual property. 2008 Microsoft Corporation. All rights reserved. Microsoft, Excel, and SQL Server are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other countries. The names of actual companies and products mentioned herein may be the trademarks of their respective owners. Table of Contents Introduction ......................................................................................................1 Challenges and Product Limitations ..................................................................1 Enterprise Standard......................................................................................... 1 Developer Productivity ..................................................................................... 1 Data Lineage and Audit Trail ............................................................................. 2 Scale Out with Commodity Hardware ................................................................. 2 Usage Scenario of Metadata-Driven ETL ............................................................2 Platform Architecture ........................................................................................3 Metadata Repository .........................................................................................4 Builder ..............................................................................................................5 Defining SSIS Control Flow ............................................................................... 5 Dynamic Generating SSIS Packages .................................................................. 7 Controller and Worker .......................................................................................8 Distributed Execution ....................................................................................... 8 Unified Logging ............................................................................................... 9 Monitor .............................................................................................................9 Metadata Editor ...............................................................................................10 ETL Pattern Library .........................................................................................10 Further Enhancements ....................................................................................10 Metadata Repository Manager ......................................................................... 10 Business Rule Engine ..................................................................................... 10 Data Quality ................................................................................................. 11 Putting It Together ........................................................................................ 11 Conclusion.......................................................................................................11 Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services 1 Introduction Microsoft SQL Server™ 2008 Integration Services (SSIS) enables enterprise customers to create, deploy, and manage high-performance data integration solutions. Some of the most common scenarios of using SSIS are building data warehouses (DW) and developing business intelligence (BI) solutions. A data warehouse is defined by Bill Inmon as “a subject oriented, integrated, time variant, nonvolatile collection of data in support of management decision.” Extract, transform, and load (ETL) is a crucial process in data warehousing that involves extracting data from outside sources, transforming it to fit business needs, and ultimately loading it into the end target, usually the data warehouse. ETL is an important part of the process of bringing heterogeneous and asynchronous source extracts into a homogeneous environment. SQL Server Integration Services can pull data from a wide variety of sources including relational databases, Microsoft Excel® files, XML files, and flat files. It also includes a rich set of tools and components for developing and executing ETL packages. You can create SSIS packages manually by using SQL Server Business Intelligence Development Studio or programmatically by using SSIS APIs. Although SSIS offers powerful capabilities for building robust ETL solutions, customers still face many challenges when implementing large-scale data warehouses. This paper describes how to extend SSIS to a metadata-driven platform to better address those challenges. Challenges and Product Limitations The following sections discuss the challenges of developing large enterprise data warehouse systems and the limitations of SQL Server 2008 Integration Services. Enterprise Standard For a large data warehousing system in an enterprise environment, it is important to standardize ETL processes, including unified logging, checkpoint, and exception handling. A standard is a description of precise behaviors or actions that can help to prevent the creation of different flavors of SSIS packages, which can make them difficult for other developers to understand. Standards also help to improve the productivity of the team, the quality of the application, and the maintainability and understandability of the system. Although creating SSIS packages based on predefined templates is a good practice, this paper introduces a comprehensive metadata-driven approach to enforce enterprise standards. Developer Productivity Developers can create and deploy SSIS packages by using SQL Server Business Intelligence Development Studio, which offers a flexible way to define and execute ETL tasks. While SQL Server Business Intelligence Development Studio is very effective for designing individual and simple ETL packages, for large data warehousing systems with hundreds of packages, it is very labor intensive and error prone to develop, test, deploy, and maintain a large number of SSIS packages for data acquisition, integration, and distribution. A cost effective alternative is to enable developers to define ETL processes using metadata definitions without the need to worry about common tasks such as logging and exception handling. ETL patterns are custom implementations of Microsoft Corporation ©2008 Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services 2 data movement, transformation, integration, and quality management. Reusable ETL patterns improve developer productivity. ETL patterns can be defined using metadata and implemented by using custom store procedures or managed code. Data Lineage and Audit Trail Enterprise customers need advanced BI capabilities, such as using data lineage to track data integration from operational systems to BI reports. Data lineage helps to show where the data came from and what rules were applied to data along the way. It provides a complete view of the data lifecycle. This type of advanced BI capability requires an integrated metadata system. While the SQL Server BI toolset is mostly metadata-driven internally, every major component of SQL Server, from database tables to XML files, keeps metadata in its own independent structure with its own access methods. Metadata are kept in different formats across Microsoft SQL Server Integration Services (SSIS), Analysis Services (SSAS), and Reporting Services (SSRS). While you can use the Microsoft® SQL Server™ 2005 BI Metadata Sample toolkit to analyze and view dependencies and lineage of objects, it is mostly used for impact analysis. Scale Out with Commodity Hardware By design, the SQL Server Integration Services pipeline is almost exclusively in memory. The potential disadvantage of this is that for large amounts of data and complicated transformations you need a large amount of memory. SSIS does not support Advanced Windows Extensions (AWE). Scaling out to multiple packages across processes is the only way to take advantage of larger amounts of memory. For very large memory requirements, consider using 64-bit systems. High-end computers can be expensive and customers are looking for alternatives with commodity hardware. Instead of scaling up, scaling out by using distributed processing on commodity hardware can be a cost-effective option. Distributed processing not only improves the scalability of the system, but it also increases the reliability of the system by removing single points of failure. To address these challenges, this paper presents the design of a metadatadriven ETL platform that focuses on optimizing the acquisition, integration, and distribution needs of enterprise data warehouses. The platform is complementary to SQL Server Integration Services. It helps reduce the total cost of ownership of large enterprise data warehouse systems and BI solutions. Usage Scenario of Metadata-Driven ETL To discuss metadata-driven ETL, we must first understand what metadata is. In short, metadata is data about data. Metadata is used to add context for the data or hide complexity from users who do not need to know or understand the details of the data. Metadata is classified by usage as technical metadata and business metadata. In the ETL process, developers primarily deal with technical metadata. Following is a usage scenario describing a metadata-driven ETL development process. In the scenario, developers extract data from a relational data source to a relational database. Data transformation during the data movement is not included. The basic flow of the extraction process is as follows: Microsoft Corporation ©2008 Filename: Document1 3 1. The ETL developer defines the source and destination of the extraction process, including servers, relational databases, and tables. 2. For each table, the schema (columns, index, and constraints) can either be automatically retrieved by the system or manually specified by the ETL developer. 3. The ETL developer specifies the mapping between source and destination at the database, table, and column level. 4. The ETL developer selects either a full or delta load. 5. The ETL developer configures an orchestration process. For each table, ETL developers define one or multiple steps for performing the extraction. 6. The ETL developer specifies how ETL packages should be executed—at a scheduled time or on demand. 7. The system dynamically generates one or more SSIS packages. 8. The system deploys the SSIS packages to a distributed execution environment. 9. The system executes ETL jobs and captures the job status. The outcomes of the usage scenario are: Metadata defining an end-to-end extraction process is captured and stored in a metadata repository. SSIS packages are executed. Data is moved from source systems to destinations. The ETL job status is captured and logged. Typical extraction-related metadata includes: System environmental parameters, such as server name, data source name, folder location, and connection parameters List of tables to be extracted Columns to be extracted for each table Delta detection of each table Source-to-destination mapping for data movement ETL processes and job schedules Number of retries in case of failure Logging parameters This usage case provides a high-level description of a data extraction process to illustrate the design of the platform. Delta load is the process whereby only changes in the source table(s) are loaded to the warehouse database. There are many other use cases for data extraction, transformation, and load. The following sections demonstrate how to extend SSIS to support a metadata-driven ETL approach. Platform Architecture The design goals of the platform include improving the productivity of developers, enforcing ETL standards, supporting a cost-effective way to deploy large data warehouses on commodity hardware, and providing a centralize metadata repository for lineage tracking. The solution architecture of the platform is shown in Figure 1. The intent of this paper is not to document all the implementation details. Rather, it describes the concepts and key components of the platform and how they are Microsoft Corporation ©2008 Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services 4 connected with SQL Server Integration Services. For a more detailed architecture diagram, see Figure 6 at the end of this paper. Figure 1: Platform architecture The key components of the platform include: Metadata repository. Stores the ETL definitions, which describe the data sources, destinations, data mappings, data transformations, and orchestration processes Metadata editor. Used for managing metadata via a graphical user interface Builder. Generates SSIS packages based on metadata stored in the metadata repository Controller/worker runtime environment. Consists of a controller and a number of workers to manage the distributed execution of SSIS packages and unified logging Logging repository. Stores status data for building and executing SSIS packages Monitor. Used for monitoring and reporting system status The following sections describe each component in detail. With an open architectural centered around metadata repository, this platform can be further extended. Metadata Repository The metadata repository is used to store technical and business metadata, including but not limited to data source and destination definition, data movement and pattern definition, and orchestration process definition. It can be further extended to include business rule definition, data quality metric definition, and so on. Figure 2 shows a sample metadata model1. Note that this model is for illustration purposes only and does not necessarily reflect the actual data model we implemented. In the example: Data package defines the data source and destination entities to be used in the ETL process. Data packages can be implemented as a hierarchy and includes data groups and data elements. For relational data sources, a database is a type of data package, a table is a type of data group, and a column is a type of data element. The data model is based on the book Universal Meta Data Models, published by David Marco. 1 Microsoft Corporation ©2008 Filename: Document1 5 Data movement defines the mapping between the source and destination. The mapping can be done at multiple levels. For relational databases, it can be at the database, table, and column level. Data transformation defines the transformation rules that will be applied to the data movement. Transformation rules can be implement using store procedures and managed coded. Reusable code can be saved as ETL patterns. In addition, other metadata is required by the platform. For example, data orchestration defines how the ETL jobs will be executed and the precedence of tasks. Figure 2: Sample metadata model The focus of this paper is not the details of the metadata model. What is important is that the metadata repository is the hub of the platform. Builder The builder is designed to automatically generate SSIS packages and instances of custom code based on metadata definitions. Before the release of SSIS, many organizations developed their own custom code to perform ETL. The SSIS makes it possible for organizations to leverage their existing investments in data integration and take full advantage of features such as unified logging and distributed execution. Standards and best practices are enforced by the builder. Defining SSIS Control Flow Control flow is a key component of SQL Server Integration Services. This section explains how to define control flows by using a predefined template. SSIS provides powerful and flexible tools out of the box. While flexibility is important for some Microsoft Corporation ©2008 Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services 6 customers, other customers may place more emphasis on standardized processes. For enterprise customers, enforcing standards not only improves the productivity of developers, but also helps reduce maintenance and operational costs. Instead of using SQL Server Business Intelligence Development Studio to design SSIS control flows, we provided an orchestration template as an abstraction layer. With this template, we can enforce standards and apply best practice behind the scene. This is especially useful for large data warehousing development. In the orchestration template, ETL processes are defined within a simple hierarchy, consisting of systems, processes, and tasks. A task is represented as a unit of running code, such as the execution of a stored procedure or a SSIS data flow task. Processes are simply a means of defining groups of tasks. The hierarchy can be described in outline form: System SystemVariable SystemConnection Process ProcessPrecedent Task TaskPrecedent A system is analogous to an SSIS package, a process is analogous to an SSIS sequence container, and a task is analogous to an SSIS task object. Figure 3 shows an example of how an ETL package can be defined by using the orchestration template. Figure 3: Defining SSIS control flow by using a template By default, all tasks and processes run in parallel. Precedents can be set either at the task or process level. In Figure 3, the Load Dimension task must complete before the Load Sales Facts task can start, and all Extract Sales Data tasks must complete before any Load Sales Data tasks can start. Precedents can be qualified with an expression, by using SSIS expression syntax. ETL processes that are defined by using the orchestration template are stored in the metadata repository and used by the builder to generate SSIS packages. Microsoft Corporation ©2008 Filename: Document1 7 Dynamic Generating SSIS Packages SSIS packages can be created programmatically based on defined metadata. The following SSIS namespace and objects are used to generate the package: Microsoft.SqlServer.Dts.Runtime Application Connections ConnectionManager DtsContainer Executable LogProviderBase Package PrecedenceConstraint Sequence Task TaskHost Variable Microsoft.SqlServer.Dts.Tasks.ExecuteSQLTask On the Microsoft Developer Network (MSDN), you can find documentation and sample code that shows how to programmatically create SSIS packages. For more information, see Building Packages Programmatically. Figure 4 shows a portion of the SSIS package generated for the example described in Figure 3. Because even in simple scenarios an SSIS package with proper exception handling and logging can be complex, it would not be easy to manually create the package from scratch by using SQL Server Business Intelligence Development Studio. Generating SSIS packages based on metadata definitions automates numerous repetitive housekeeping tasks, reduces the risk of errors, and enhances developer productivity. Microsoft Corporation ©2008 Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services 8 Figure 4: Example of an SSIS control flow diagram generated by the builder At run time, the builder employs the orchestration template to dynamically create SSIS packages from the metadata specification and to deploy to the target worker servers. The next section explains the distributed execution of ETL packages based on a controller and worker architecture. Controller and Worker The run-time environment of the platform is based on a controller/worker architecture. The following sections provide more detail on distributed execution of SSIS packages and unified logging by using controllers and workers. Distributed Execution In the architecture to support distributed execution, ETL operations can occur on one of many worker servers, while process-defining metadata resides on a centralized controller server. The package executed on the worker server sends progress reports back to the controller and write events to the standard logging system. The controller and worker architecture is shown in Figure 5. Microsoft Corporation ©2008 Filename: Document1 9 Figure 5: Controller and worker architecture The controller server hosts a central metadata database. It also hosts the components to create and deploy SSIS packages, including the builder. The worker server hosts the SSIS packages and client logging components. While the system supports distributed processing, it is not mandated by the platform. Both controller and worker components can be installed and run on the same server. Unified Logging Logging in this platform consists of a client and a central component. A client logging component runs on the worker server to collect and manage logging events. This component uses SQL Server Services Broker to define a logging conversation and send messages from the worker server to the centralized logging repository. The ETL process produces common output that provides the ability to review the status of ETL jobs. In addition, an SSIS log provider and a logging interface for custom code are provided to enable all messages, including the SSIS log, platform run-time messages, and custom code events, to be written to the same logging stream. Monitor The monitor provides a consistent and user-friendly interface for reporting on system status. The tool can report the current status and historical statistics in an easily consumable format. The monitoring tool facilitates a better customer experience as well as saving debugging and troubleshooting time. The monitoring tool can be further Microsoft Corporation ©2008 Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services 10 enhanced with proactive notification and integration with Microsoft System Center (formerly named Microsoft Operations Manager). Metadata Editor Currently, developers use templates stored in a Microsoft Excel file for data entry. An import wizard is used to load metadata definitions from Excel worksheets to a central metadata repository. Further improvements to the ETL platform may include tools to receive metadata (tables, columns, and relationships) from data modeling tools. A metadata editor is planned but not yet implemented to capture and maintain metadata definitions for ETL. It will provide a friendly user interface for creating, retrieving, updating and deleting metadata. Note: Not all metadata must be captured manually. The metadata editor should be capable of performing scans and discovering source database schemas in order to build initial target schema as well as to uncover source system changes that may impact ETL jobs. ETL Pattern Library Recent studies show that almost half of all ETL implementations use hand-coded extracts for some or all of the work. When we look at data integration between operational systems for migrations and consolidations or for real-time ETL, custom coding is even more common. It would be valuable for organizations to leverage their existing investments in custom code. In this platform, extended from SSIS, developers can take full advantage of features such as logging, debugging, and BI integration by wrapping the custom code as re-usable ETL patterns. An ETL pattern can be implemented by using SQL stored procedures or managed code by using .NET. The platform already includes many commonly used ETL patterns, such as extract, slowly changing dimensions, and Change Data Capture APIs. It enables ETL developers to extend the pattern library with their own reusable code. Further Enhancements The ETL platform can be further extended to include features described in sections below. Metadata Repository Manager Because of the increased complexity of the metadata repository, the platform will have a graphical user interface (GUI) to support administrative features. For example, the GUI can be used to configure repository security, allowing specific user groups access repository data. The repository manager should be able to provide commonly used administrative functionalities. Business Rule Engine Instead of implementing business rules in custom code for data movement, transformation, and quality management, business rules can be stored in a metadata repository and consumed by a business rule engine. This will enhance corporate standards, improves data quality, and supports the auditing and tracking of data lineage. Microsoft Corporation ©2008 Filename: Document1 11 Data Quality Data quality metrics will be used to measure the completeness, validity, consistency, and accuracy of data. For example, data quality metrics can ensure that nonconforming data is flagged as an error, loaded to an error staging area, and not loaded to the data warehouse until corrected. A successful data quality management strategy involves three key tasks: profiling, cleansing, and auditing. The metadata repository can be further extended to store metadata about data quality. Putting It Together A proposed system architecture, with future enhancements, is shown in Figure 6. It highlights major components of the platform as well as how the platform is integrated with SSIS. Controller/Worker s rie MetadatantRepository gE Lo Load Balancer Designer ETL Process Metadata SSIS Runtime Metadata Editor Package Metadata Schema Metadata Business Rules Task Builder M Metadata Repository Manager et ad at a ETL Pattern Library SSIS Package Generator Task Container Task ETL Pattern Instance Generator Task Task Packages BizRule Engine Task Logging Repository Da ta a at ad et M Package Deployer Real Time Data St at us Monitor SSIS Data Flow Components ETL Pattern Instance SSIS Adapter Custom Data Adapters Real Time Monitor Status Data Historical Data Analyzer Log Entries Logger Figure 6: Proposed future system architecture Besides generating SSIS packages dynamically by using metadata definitions, the platform also enables SSIS packages to be executed in a distributed environment with consistent logging. It also includes reusable code and best practices implemented as ETL patterns. Those patterns complement SSIS data flow components for common ETL operations. A set of standardized APIs will be provided for accessing the metadata and logging repositories. By extending SSIS to support metadata-driven ETL, the platform helps to address many of the challenges encountered when building large data warehouses systems and BI solutions. Conclusion Microsoft SQL Server Integration Services 2008 (SSIS) offers great capabilities for highperformance ETL and a cost effective product for developing enterprise data warehouse solutions. By standardizing ETL processes, improving developer productivity, and reducing operational cost, the metadata-driven ETL platform built on top of SSIS described in this paper enables enterprise customers to reduce the total cost of ownership of data warehousing systems. Microsoft Corporation ©2008 Build a Metadata Driven ETL Platform by Extending Microsoft SQL Server Integration Services 12 For more information: SQL Server Web site SQL Server TechCenter on Microsoft TechNet SQL Server DevCenter on MSDN Microsoft Corporation ©2008