Data Quality Knowledge Base - Shanghai SQL Server User Group

1er Simposio Latinoamericano
Data Quality Fundamentals
Miguel Angel Granados Troncoso
Agenda
•
•
•
•
Scenarios
Definitions, Processes and Standards
Data Quality Services (DQS)
DQS Solutions
1
Required 9s
& Protection
Rapid Data
Exploration
2
6
Managed SelfService BI
9
Scale on
Demand
Organizational
Compliance
Blazing-Fast
Performance
5
10
Fast Time
to Solution
3
7
Peace of
Mind
4
8
Credible,
Consistent Data
Scalable
Analytics & DW
11
12
Optimized
Productivity
Extend Any
Data, Anywhere
#7
Credible, Consistent Data
Companies with accurate data perform better¹
% of master data complete &
accurate
Hrs spent per employee each week
searching for info
Top 20% Performers
Middle 50%
Performers
Bottom 30%
Performers
Delivered with
1.2hrs
91%
2.8hrs
68%
6hrs
Under 50%
Data
Quality
Services
¹Source: “Turning Pain into Productivity with Master Data Management,” Aberdeen Group, Feb 2011
Master
Data
Services
Single BI
Semantic
Model
Why is Data Quality Important?
Data quality problems cost U.S. businesses more than $600 billion a year.
Data Warehousing Institute (TDWI)
Costs associated with bad data include:
• Excess inventory
• Higher supply chain costs
• higher direct marketing costs
• Billing
• And more…
Common Data Quality Issues
Data Quality
Issue
Sample Data Problem
Format
Do values follow consistent formatting standards ?
Telephone number formats:
xxxxxxxxxx,
(xxx) xxx-xxxx
1.xxx.xxx.xxxx, etc.
Standard
Are data elements consistently defined and understood ?
‘Gender code’ = M, F, U
‘Gender code’ = 0, 1, 2
Consistent
Do values represent the same meaning ?
How is revenue presented ?
Dollars, Euro, Both?
Complete
Is all necessary data present ?
20% of customers’ last name is blank,
50% of zip-codes are 99999
Accurate
Does the data accurately represent reality or a verifiable source?
A Supplier is listed as ‘Active’ but went out of business six
years ago
Valid
Do data values fall within acceptable ranges?
Salary values should be between
60,000-120,000
Duplicates
Data appears several times
Both John Ryan and Jack Ryan appear in the system – are
they the same person?
Agenda
Scenarios
• Definitions, Processes and Standards
• Data Quality Services (DQS)
• DQS Solutions
Data Governance
Strategic
IT Governance
Data Governance
Data Management
Data Quality
Tactical
Data
Correctness
Data Management
Data Standarization
Data Management
Master Data Management
Data Quality
• Data quality consists of verifying whether the data is suitable for
their intended use in operations, decision making and planning.
Domain
Management
Discovery
Value
Management
Knowledge
Discovery
Quality Control Efforts
•
•
•
•
Knowing the context of the data
Profile the data required
Create and maintain quality standards
Tracking Data Quality
Requirements for Data Quality Solution
Tracking and monitoring
the state of data quality
activities and quality of data.
Analysis of the data source;
providing insight into the
quality of the data, to
identify data quality issues.
Monitoring
Cleansing
Profiling
Matching
Amend, remove or enrich data
that is incorrect or incomplete.
This includes correction,
standardization and
enrichment.
Identifying, linking and
removing duplications within
or across sets of data.
How to Manage Data Quality?
Data quality management entails the establishment and deployment of:
– Roles
– Responsibilities
– Policies
– Procedures
– Technology
People
Technology
Processes
Data Quality Standards
ISO 8000
ISO 22745
•Data Quality Principles
•Characteristics that
defines data quality
•Processes that ensure
data quality
•Defines open technical
dictionaries
•Applying dictionaries
to master data
International Association for Information and Data Quality
http://www.iaidq.org/
Agenda
Scenarios
Definitions, Processes and Standards
• Data Quality Services (DQS)
• DQS Solutions
Data Quality Services (DQS) is a
Knowledge-Driven data quality solution, enabling IT Pros and data
stewards to easily improve the quality of their data
DQS Solution Concepts
Knowledge-Driven
Based on a Data Quality Knowledge Base (DQKB) that is reusable for a variety of data quality improvements
Semantics
Data is mapped into Data Domains, which capture its Semantics
Knowledge Discovery
Acquire additional knowledge through data samples and user feedback
Open and Extendible
Support use of user-generated knowledge and IP by 3rd party reference data providers
Easy to Use
Compelling user experience designed for increased productivity
Data Quality Knowledge Base (DQKB)
• Repository of knowledge about data:
– Domains define values and rules for each field
– Matching policies define rules for identifying duplicate records
Domains
Composite
Domains
Matching Policy
DQS Knowledge Sources
Windows Azure Marketplace™ Data Market
Cleanse and enrich data with Reference Data Services from DataMarket
3rd Party Reference Data Providers
Open integration with external 3rd party reference data providers
DQS Data Store
Website that contains DQS knowledge available for downloading
Organization Data
Create domains from your own data sources
Out of the Box Knowledge
A set of data domains that come out of the box with DQS
What is a Domain?
• Domains are specific to a data field
• Domains contain the rules for the data
Domain
• Domains can be individual or
composite
Values
Reference Data
Rules and
Relationships
What is a Reference Data Service?
• The Azure Marketplace hosts
specialist data cleansing providers

Set up an account

Subscribe to a reference service

Map your domain to the reference
service
KB
Address
Name
First Name
Family Name
DQS Architecture Overview
DQS Clients
DQS Client
DQS Cloud Services
DQS Store - KB, Domains
DataMarket - Categorized Reference Data
Knowledge Discovery
and Management
DQS Server
Interactive DQ Projects
3rd Party
Reference Data
Reference Data API
(Browse, Set, Validate…)
Reference Data API
(Browse, Get, Update…)
DQS Engine
Knowledge Discovery
Data Profiling
Exploration
Matching
Other DQS Clients
DQ Projects Store
Common Knowledge Store
SSIS DQS Cleansing Component
Future Clients: Excel,
SharePoint,
MDS…
Reference Data
Services
Cleansing
Administration
DQ Active
Projects
Published KBs
© 2010 Microsoft Corporation. Microsoft Materials - Confidential. All rights reserved.
Reference Data
Agenda
Scenarios
Definitions, Processes and Standards
Data Quality Services (DQS)
• DQS Solutions
DQS process
Knowledge Management
Reference
Data
Build
Enterprise
Data
Integrated
Profiling
Status
Progress
Knowledge
Base
Notifications
Use
DQ Projects
•
•
•
Interactive
Cleansing
–
DQS
Project
Analyzes the quality of source data
Automatically corrects and enriches the data
Manual approval/rejection of suggestions provided by the cleansing algorithm/ reference data services
Batch Cleansing - Using SSIS
DQS server
Knowledge Base
Values/Rules
Reference Data Definition
Matching Policy
SSIS Package
Source
DQS Cleansing
Component
Destination
SSIS Data Flow
Matching – DQS Project
Why Match?
• Identify duplicates within the data source
• Create consolidated view of data
DQS Matching
•
•
•
•
Build a matching police
Matching training
Create a matching project
Choose survivors
Agenda
Scenarios
Definitions, Processes and Standards
Data Quality Services (DQS)
DQS Solutions
Q&A
Personal Blog
http://www.granadostroncoso.com.mx
PASS Mexico City Chapter
http://mexico.sqlpass.org
@PASSMXDF
SolidQ Journal
http://www.solidq.com/sqj/Pages/Home.aspx
Microsoft
http://www.microsoft.com/sqlserver/en/us/solutions-technologies/SQL-Server2012-business-intelligence.aspx