1er Simposio Latinoamericano Data Quality Fundamentals Miguel Angel Granados Troncoso Agenda • • • • Scenarios Definitions, Processes and Standards Data Quality Services (DQS) DQS Solutions 1 Required 9s & Protection Rapid Data Exploration 2 6 Managed SelfService BI 9 Scale on Demand Organizational Compliance Blazing-Fast Performance 5 10 Fast Time to Solution 3 7 Peace of Mind 4 8 Credible, Consistent Data Scalable Analytics & DW 11 12 Optimized Productivity Extend Any Data, Anywhere #7 Credible, Consistent Data Companies with accurate data perform better¹ % of master data complete & accurate Hrs spent per employee each week searching for info Top 20% Performers Middle 50% Performers Bottom 30% Performers Delivered with 1.2hrs 91% 2.8hrs 68% 6hrs Under 50% Data Quality Services ¹Source: “Turning Pain into Productivity with Master Data Management,” Aberdeen Group, Feb 2011 Master Data Services Single BI Semantic Model Why is Data Quality Important? Data quality problems cost U.S. businesses more than $600 billion a year. Data Warehousing Institute (TDWI) Costs associated with bad data include: • Excess inventory • Higher supply chain costs • higher direct marketing costs • Billing • And more… Common Data Quality Issues Data Quality Issue Sample Data Problem Format Do values follow consistent formatting standards ? Telephone number formats: xxxxxxxxxx, (xxx) xxx-xxxx 1.xxx.xxx.xxxx, etc. Standard Are data elements consistently defined and understood ? ‘Gender code’ = M, F, U ‘Gender code’ = 0, 1, 2 Consistent Do values represent the same meaning ? How is revenue presented ? Dollars, Euro, Both? Complete Is all necessary data present ? 20% of customers’ last name is blank, 50% of zip-codes are 99999 Accurate Does the data accurately represent reality or a verifiable source? A Supplier is listed as ‘Active’ but went out of business six years ago Valid Do data values fall within acceptable ranges? Salary values should be between 60,000-120,000 Duplicates Data appears several times Both John Ryan and Jack Ryan appear in the system – are they the same person? Agenda Scenarios • Definitions, Processes and Standards • Data Quality Services (DQS) • DQS Solutions Data Governance Strategic IT Governance Data Governance Data Management Data Quality Tactical Data Correctness Data Management Data Standarization Data Management Master Data Management Data Quality • Data quality consists of verifying whether the data is suitable for their intended use in operations, decision making and planning. Domain Management Discovery Value Management Knowledge Discovery Quality Control Efforts • • • • Knowing the context of the data Profile the data required Create and maintain quality standards Tracking Data Quality Requirements for Data Quality Solution Tracking and monitoring the state of data quality activities and quality of data. Analysis of the data source; providing insight into the quality of the data, to identify data quality issues. Monitoring Cleansing Profiling Matching Amend, remove or enrich data that is incorrect or incomplete. This includes correction, standardization and enrichment. Identifying, linking and removing duplications within or across sets of data. How to Manage Data Quality? Data quality management entails the establishment and deployment of: – Roles – Responsibilities – Policies – Procedures – Technology People Technology Processes Data Quality Standards ISO 8000 ISO 22745 •Data Quality Principles •Characteristics that defines data quality •Processes that ensure data quality •Defines open technical dictionaries •Applying dictionaries to master data International Association for Information and Data Quality http://www.iaidq.org/ Agenda Scenarios Definitions, Processes and Standards • Data Quality Services (DQS) • DQS Solutions Data Quality Services (DQS) is a Knowledge-Driven data quality solution, enabling IT Pros and data stewards to easily improve the quality of their data DQS Solution Concepts Knowledge-Driven Based on a Data Quality Knowledge Base (DQKB) that is reusable for a variety of data quality improvements Semantics Data is mapped into Data Domains, which capture its Semantics Knowledge Discovery Acquire additional knowledge through data samples and user feedback Open and Extendible Support use of user-generated knowledge and IP by 3rd party reference data providers Easy to Use Compelling user experience designed for increased productivity Data Quality Knowledge Base (DQKB) • Repository of knowledge about data: – Domains define values and rules for each field – Matching policies define rules for identifying duplicate records Domains Composite Domains Matching Policy DQS Knowledge Sources Windows Azure Marketplace™ Data Market Cleanse and enrich data with Reference Data Services from DataMarket 3rd Party Reference Data Providers Open integration with external 3rd party reference data providers DQS Data Store Website that contains DQS knowledge available for downloading Organization Data Create domains from your own data sources Out of the Box Knowledge A set of data domains that come out of the box with DQS What is a Domain? • Domains are specific to a data field • Domains contain the rules for the data Domain • Domains can be individual or composite Values Reference Data Rules and Relationships What is a Reference Data Service? • The Azure Marketplace hosts specialist data cleansing providers Set up an account Subscribe to a reference service Map your domain to the reference service KB Address Name First Name Family Name DQS Architecture Overview DQS Clients DQS Client DQS Cloud Services DQS Store - KB, Domains DataMarket - Categorized Reference Data Knowledge Discovery and Management DQS Server Interactive DQ Projects 3rd Party Reference Data Reference Data API (Browse, Set, Validate…) Reference Data API (Browse, Get, Update…) DQS Engine Knowledge Discovery Data Profiling Exploration Matching Other DQS Clients DQ Projects Store Common Knowledge Store SSIS DQS Cleansing Component Future Clients: Excel, SharePoint, MDS… Reference Data Services Cleansing Administration DQ Active Projects Published KBs © 2010 Microsoft Corporation. Microsoft Materials - Confidential. All rights reserved. Reference Data Agenda Scenarios Definitions, Processes and Standards Data Quality Services (DQS) • DQS Solutions DQS process Knowledge Management Reference Data Build Enterprise Data Integrated Profiling Status Progress Knowledge Base Notifications Use DQ Projects • • • Interactive Cleansing – DQS Project Analyzes the quality of source data Automatically corrects and enriches the data Manual approval/rejection of suggestions provided by the cleansing algorithm/ reference data services Batch Cleansing - Using SSIS DQS server Knowledge Base Values/Rules Reference Data Definition Matching Policy SSIS Package Source DQS Cleansing Component Destination SSIS Data Flow Matching – DQS Project Why Match? • Identify duplicates within the data source • Create consolidated view of data DQS Matching • • • • Build a matching police Matching training Create a matching project Choose survivors Agenda Scenarios Definitions, Processes and Standards Data Quality Services (DQS) DQS Solutions Q&A Personal Blog http://www.granadostroncoso.com.mx PASS Mexico City Chapter http://mexico.sqlpass.org @PASSMXDF SolidQ Journal http://www.solidq.com/sqj/Pages/Home.aspx Microsoft http://www.microsoft.com/sqlserver/en/us/solutions-technologies/SQL-Server2012-business-intelligence.aspx