Class slides May 30, 2001

advertisement
Data Quality
Class 3
Goals



Dimensions of Data Quality
Enterprise Reference Data
Data Parsing
Dimensions of Data Quality

Poor data quality is similar to obscenity–

It seems as if there are no real ways to measure it,
but you know it when you see it!
In reality, data quality can be measured
–
The frame of reference for measurement is different
Dimensions of Data Quality 2




Data Models
Data Values
Data Presentation
Data Policy
Example: Sales Database

Sales and marketing database
–
–


Current customers
Sales leads
Name, address, contact data
For current customers, sales data
Data Quality of Data Models




Clarity of definition
Comprehensiveness
Flexibility
Robustness
Data Quality of Data Models 2




Essentialness
Attribute granularity
Precision of domains
Homogeneity
Data Quality of Data Models 3




Naturalness
Identifiability
Obtainability
Relevance
Data Quality of Data Models 4



Simplicity
Semantic Consistency
Structural Consistency
Data Quality of Data Values





Accuracy
Null values
Completeness
Consistency
Currency
Accuracy



Agreement with established sources
Database of record
Other corroborative sources
Null Values

Null vs. Missing
–
–
–
–
–
Unavailable
Not applicable
No value
Not classified
Truly null
Completeness




Mandatory attributes require values
Optional attributes may hold values (when and
how?)
Inapplicable attributes may not have a value
(also when and how?)
Completeness constraints
Consistency




Are values in one set consistent with values in
another set?
Consistency relations between attributes in the
same table
Consistency assertions across columns
Consistency relationships between tables
Currency/Timeliness



What data is current?
How is it kept up-to-date?
Time expectations for accessibility to data
Data Quality of Data Presentation




Appropriateness
Correct Interpretation
Flexibility
Format Precision
Data Quality of Data Presentation 2



Portability
Representation Consistency
Representation of Null Values
Data Quality of Data Policy





Access
Metadata
Privacy
Fault-tolerance
Security
Reference Data




Relatively static
Referred to from within many tables
Shared data
Examples:
–
–
–
Product catalog
Security type classification
Currency codes
Reference Data 2



Metadata
Data Domains
Mappings between those domains
Domains


A data domain is a subclassed data type
Domains can be described using enumerations
–

Example: states, cities, product codes
Domains can be described using functions
–
Example: formatted trouble ticket ids such as CCNNNN
Domain Membership


Data attributes can be affiliated with data
domains
Test a value to make sure it is valid within a
domain
–
–
For enumerated domains, a lookup works
For generated/described domains, check to see if
the value could be generated by function
Domains as Metadata



The existence of a data domain is metadata
Keep track of which tables have attributes
making use of which domains
Manage domains as enterprise reference data
Domain Tables


Domains are sets of values
Maintain in two tables:
–
–

Domain name table (domain_name, domain_id)
Domain value table (domain_id, value)
Example:
–
–
(2, “STATES”)
{(2, “NY”), (2, “MA”), (2, “CT”), (2, “VA”)…}
Mappings




Map values in one domain to values in other
domains
1-1, 1-Many, Many-1, Many-Many
Represents relations from one set to another
set
1-1, Many-1 are functions
Mappings as Metadata



Semi-static mappings are also metadata
Keep track of which tables refer to mappings
Enterprise reference data
Mapping Tables

Maintain mapping information in two tables:
–
–
Mapping name table (from_domain, to_domain,
mapping_id, mapping name)
Mapping table (mapping_id, source_value,
target_value)
Example:


(“Currency”, “Country”, 65,
“CurrencyToCountry”)
(65, “USD”, “US)
Tables




Inference of data fields from source data
In our case, this is straightforward based on
SGML tags in the data
Determine table structure in context of use of
metadata and reference data
In other words, try to maintain normal form
Keys




Primary key: must be unique across all values in table
Primary key may be assigned based on internal
increasing value, or maybe extant in the data
Foreign key: relates values in one table to values in
another table
Foreign keys must exist in target table (=referential
integrity)
Data Profiling and Parsing

Goals:
–
–
–
Identify data domains
Identify mappings
Identify candidate keys
Data Column Profiling

For each table
–
For each column







Type inference
Subclassed Type inference
Count the number of distinct values
Enumerate the distinct values
Sort the distinct values
Look for patterns of usage
Document discoveries
Data Table Profiling

For each Table
–
–
–
–
Identify candidate keys
Identify data domains
Identify data mappings
Look for patterns
Multiple Table Profiling



Identify foreign key relationships
Validate referential integrity
Look for patterns
Data Parsing

2 kinds of data domains
–
–
Enumerated, consisting of a predefined set of
values
Inferred/Functional, consisting of values that are
validated based on a set of rules
Type Inference



Data value sets all conform to a data type
All values within a column must belong to the same
type
Propose data type by a series of inferences
–
–
–
–
Initial assumption is that all values belong to strictest type
Test for violations to type restriction
As a violation is discovered, loosen the restriction and test
again for violations
Measure conformance at each level
Subclassed Type Inference


Once type has been inferred, look for
additional restrictions within type
Example:
–


Integer vs. Integer within a range (0..100)
For integers and real values, look for ranges
For character strings, look for length, context,
character patterns
String Parsing

Classify characters into groups
–
–
–


Alphabetic
Numeric/Digits
Punctuation (this can be further refined)
Transform all values within a column set into
their corresponding pattern
Example: 789-23-1100 would change to
–
DDDPDDPDDDD
String Parsing




Given a set of representative strings, perform
column analysis again
Look for high frequency counts for specific
patterns
This will provide proposed functional domains
We can characterize certain domains by format
(e.g., telephone numbers, SSNs, Tax Ids, UPC
codes, etc.)
Word Parsing


More advanced technique to explore domain
types
Classify word tokens in terms of predefined
characterization
–
Example: Name words, business words, transaction
words, connector words, titles, etc.
Primary Key Discovery


Iterative process to find candidate keys
A candidate key is one or more attributes
whose values, when composed, can uniquely
locate a record
Primary Key Discovery 2

For I = 1, number of attributes
–
For each set S composed of I attributes do


Are the composed values unique across the tables?
If so, add to candidate key set
Next Assignment

Type Inferencer
–

Propose data types for analyzed columns
Format Parser
–
–
–
–
Transform strings into pattern strings
Propose format for a column
Catalog discovered named formats (telno, SSN,
etc.)
More details on posting on web site
Download