Document 15063059

advertisement
Course Name: Business Intelligence
Year: 2009
Data Quality and Information Compliance
14th Meeting
Source of this Material
(2).
Loshin, David (2003). Business Intelligence:
The Savvy Manager’s Guide. Chapter 9
Bina Nusantara University
3
The Business Case
According to the 2001 PricewaterhouseCoopers Global Data Management
Survey, fully 75% of the respondents (senior-level executives) had experienced
significant problems as a result of defective data. There problem included:
• Extra costs to prepare reconciliations
• Delays or scrapping of new systems
• Failure to collect receivables
• Inability to deliver orders
• Lost sales
According to the Data Warehousing Institute’s report Data Quality and The
Bottom Line, poor quality customer data cost U.S. businesses $611 billion a
year (and this is referring just to customer data-consider how much all other
kinds of poor quality data can cost.).
Bina Nusantara University
4
More Than Just Names and Addresses
What is particularly interesting is that some of these assessments of the
commercial costs of poor data quality are based on relatively simple metrics
related to incorrect names and addresses. Data quality is defined in terms of
how each data consumer desires to use the data, and to this end we must
discuss some dimensions across which data quality can be measured.
• Data Quality Dimensions
There are potentially many dimensions of data quality dealing with data models, data
domains, and data presentation, the ones that usually attract the most attention are
dimensions that deal with data values, namely: Accuracy, Completeness,
Consistency, and Timeliness.
•
Accuracy
Accuracy refers to the degree to which data values agree with an identified source of
correct information.
•
Completeness
Completeness refers to the expectation that data instances contain all the information
they are supposed to.
Bina Nusantara University
5
More Than Just Names and Addresses (cont…)
•
Consistency
Consistency refers to data values in one data set being consistent with values in
another data set.
•
Currency/Timeliness
Currency refers to the degree to which information is current with the world that is
models. Currency can measure how up to date information is and whether it is correct
despite possible time-related changes. Timeliness refers to the time expectation for
accessibility of information. Timeless can be measured as the time between when
information is expected and when it is readily available for usage.
•
Data Quality Expectations
The level of data quality is determined by the data consumers in terms of meeting or
beating their own defined expectations.
Bina Nusantara University
6
Types of Errors
In this section we look at some common sources of errors.
• Attribute Granularity
The granularity of an attribute refers to how much information is embedded within that
attribute.
•
Finger Flubs and Transcription Errors
Many data errors can creep in right at the introduction into the processing stream. At
a data entry or transcription stage, individuals or systems are likely to introduce
variations in spellings, abbreviations, phonetic similarities, transcribed letters inside
words, misspelled words, miskeyed letters, etc.
•
Floating Data
Imprecision in metadata modeling may result in lack of clarity as to whether certain
kinds of data go into specific fields.
•
Implicit and Explicit Nullness
The question of nullness is interesting , because the absence of a value may provide
more fodder for making inferences than the presence of an incorrect value.
Examples of explicit nulls are shown in Figure 14-1.
Bina Nusantara University
7
Types of Errors (cont…)
Figure 14-1
•
Semistructured Formats
Data records in regular databases are formatted in a structured way; information in
free text is said to be unstructured. Somewhere in the middle is what we call
semistructured data, which is essentially free text with some conceptual connectivity.
•
Strict Format Conformance
We naturally tend to view information in patterns, and frequently data modelers will
reverse this order and impose a pattern or set of patterns on a set of attributes that
may actually prevent those fields from being properly filled in.
Bina Nusantara University
8
Types of Errors (cont…)
•
Transformation Errors
Errors may be introduced during the extraction and transformation process. If the
transformation rules are not completely correct or if there is a flaw in the
transformation application, errors will be created where none originally existed.
•
Overloaded Attributes
Either as a reflection of poor data modeling or as a result of changing business
concerns, there is information that is stored in one data field that actually contains
more than one data value.
Bina Nusantara University
9
Data Cleansing
A large part of the cleansing process involves identification and elimination of
duplicate records. The data cleansing process consists of phases that try to
understand what is in the data, transform that into a standard form, identify
incorrect data, and then attempt to correct it.
• Parsing
The first step in data cleansing is parsing, which is the process of identifying
meaningful tokens within a data instance and then analyzing token steams for
recognizable pattern. The parsing process segregates each word and then attempts
to determine the relationship between the word and previously defined token sets and
to form sequences of tokens.
•
Standardization
Standardization is the process of transforming data into a form specified as a
standard. That standard may be one defined by some governing body or may just
refer to the segregation of subcomponents within a data field that has been
successfully parsed.
Bina Nusantara University
10
Data Cleansing (cont…)
•
Abbreviation Expansion
An abbreviation is a compact representation of some alternate recognized entity, and
finding and standardizing abbreviations is another rule-oriented aspect of data
cleansing. Abbreviations must be parsed and recognized, and then a set of
transformational business rules can be used to change abbreviations into their
expanded form.
•
Correction
Once components of a string have been identified and standardized, the next stage
of the process attempts to correct those data values that are not recognized and to
augment correctable records with the corrected information. The correction process is
based on maintaining a set of incorrect values and their corrected forms.
•
Updating Missing Fields
One aspect of data cleansing is being able to fill fields that are missing information. It
is important to understand, though, that without a well-documented and agreed-to set
of rules to determine how to fill in a missing field, it can be counterproductive and
dangerous to fill in missing value.
Bina Nusantara University
11
Business Rule-Based Information Compliance
Information compliance is a concept that incorporates the definition of business
rules for measuring the level of conformance of sets of data with client
expectations.
• A Data Quality Rule Framework
Our framework for articulating quality expectations looks at how that data is used and
how we can express rules from this holistic approach. This can be decomposed into
the definition of metadata-like reference data sets and assertions that relate to
values, records, columns, and tables within a collection of data sets.
•
Domains
Any value set that has some structural rules and explicit semantic rules governing
validity. Either way, these expectations restrict the values that an attributes takes.
Whether these rules are syntactic or semantic, we can define an explicit set of
restrictions on a set of values within a type and call that a domain.
•
Mappings
A mapping is a relation between domain A and domain B, defined as a set of pairs of
values {a,b} such that a is a member of domain A and b is a member of domain B.
Bina Nusantara University
12
Business Rule-Based Information Compliance
(cont…)
•
Null Conformance
If null are allowed, the rule specifies that if a data attribute’s value is null, then it must
use one of a set of defined null representations.
•
Value Restrictions
A value restriction describes some business knowledge about a range of values, a
value restriction rule constraint value to be within the defined range.
•
Domain and Mapping Membership
Domain membership asserts that an attribute’s value is always taken from a
previously defined data domain. A mapping membership rule asserts that the relation
between two attributes or fields is restricted based on a named mapping.
•
Completeness and Exemption
A completeness rule specifies that when a condition is true, a record is incomplete
unless all attributes on a provided list are not null. An exemption rule says that if a
condition is true, then those attributes in a named list should not have values.
Bina Nusantara University
13
Business Rule-Based Information Compliance
(cont…)
•
Consistency
Consistency refers to maintaining a relationship between two (or more) attributes
based on the content of the attributes. A consistency rule indicates that if a particular
condition holds true, then a following a consequent must also be true.
•
Continuous Data Quality Monitoring and Improvement
We iteratively improve data quality by identifying sources of poor data quality,
asserting a set of rules about our expectations for the data, and implementing a
measurement application using those rules.
Bina Nusantara University
14
Management Issue
Data quality is probably the most critical aspect of the business intelligence
process, and it is a problem that needs to be addressed. But be assured that
some investment must be made (preferably at the start of the project) in
ensuring high levels of data quality.
• Pay Now or Pay (More) Later
The most critical management issue associated with data quality is the lack of true
understanding of its importance. No BI program can be successful if the data that’s
feeds that program is faulty.
•
Personalization of Quality
Where the exposition of data quality problems is viewed by the employee as a
personal attack, to be avoided at all costs.
•
Data Ownership and Responsibilities
A formal structure for data ownership and a responsibility chain should be defined
before the data integration component of implementation.
Bina Nusantara University
15
Management Issue (cont…)
•
Correction Versus Augmentation
Another important distinction is between the repetitive application of the same
corrections to the same data each time that data set is extracted from its sources and
brought into the BI environment, and a true data quality improvement process.
Bina Nusantara University
16
Download