Document 11038344

advertisement
HU/0
.M414
no.
^3
ALFRED
P.
WORKING PAPER
SLOAN SCHOOL OF MANAGEMENT
Data Quality Requirements Analysis
and Modeling
WP #3515-93
December 1992
CISL
Richard Y.
WP#
92-04
Wang
Reddy
H. B. Kon
M.
P.
Sloan School of Management,
MIT
MASSACHUSETTS
INSTITUTE OF TECHNOLOGY
50 MEMORIAL DRIVE
CAMBRIDGE, MASSACHUSETTS 02139
To appear
in the Journal of Decision
Special Issue
Support Systems (DSS)
on Information Technologies and Systems
Data Quality Requirements Analysis
and Modeling
WP #3515-93
Deceml)er 1992
CISL
Richard Y.
WP# 92-^
Wang
Reddy
H. B. Kon
M.
P.
Sloan School of Management,
*
MIT
see page bottom for complete address
Richard Y.
Wang E53-317
M. P. Reddy E53-322
Henry B. Kon E53-322
Sloan School of Management
Massachusetts Institute of Technology
01239
Cambridge,
MA
E30SeE35Es3BW.'.
smpv';
•
To
appsear in the Journal of Decision Support Systems (DSS)
Special Issue
Toward Quality
on Information Technologies and Systems
An
Data:
Richard
Attribute-Based Approach
Y.
Wang
Reddy
Henry B. Kon
M.
P.
November 1992
(CIS-92-04, revised)
Composite Information Systems Laboratory
E53-320, Sloan School of
Management
Massachusetts Institute of Technology
Cambridge, Mass. 02139
ATTN:
Prof.
Richard
Wang
(617) 253-0442
Bitnet Address: rwang@sloan.mit.edu
©
1992 Richard Y.
ACKNOWLEDGEMENTS
Wang, M.P. Reddy, and Henry
B.
Kon
reported herein has been supported, in part, by
MIT's International Financial Service Research Center and MIT's Center for
Information Systems Research. The authors wish to thank Stuart Madnick and
Amar Gupta for their comments on earlier versions of this paper. Thanks are also
due to Amar Gupta for his support and Gretchen Fisher for helping prepare this
manuscript.
Work
1.
Introduction
1.1.
1.2.
Data quality: an attribute-based example
Research focus and paper organization
1.3.
2.
5
Rationale for cell-level tagging
related to data tagging
5
2.2.
Work
6
2.3.
Terminology
7
Data quality requirements analysis
3.1.
3.2.
3.3.
3.4.
4.
2
4
4
Research background
2.1.
3.
1
Dimensions of data quality
The
Step
Step
Step
Step
1:
2:
3:
4:
Establishing the applications view
Determine (subjective) quality parameters
Determine (objective) quality indicators
Creating the quality schema
attribute-based
4.1.
4.2.
4.3.
model of data quality
Data structure
Data integrity
Data manipulation
4.3.1. Ql-CompatibiUty and QIV-Equal
4.3.2. Quality Indicator Algebra
4.3.2.1.
Selection
4.3.2.2.
Projection
4.3.2.3.
Union
4.3.2.4.
Difference
4.3.2.5.
Cartesian Product
8
9
9
10
11
12
12
15
15
15
18
18
19
20
22
24
5.
Discussion and future directions
25
6.
References
27
7.
Appendix A: Premises about data quality requirements analysis
29
29
Premises related to data quality modeling
Premises related to data quality definitions and standards across
users
30
7.3. Premises related to a single user
30
7.1.
7.2.
Toward Quality
An
Data:
Attribute-Based Approach
Introduction
1.
Organizations in industries such as banking, insurance,
retail,
consumer marketing, and health
care are increasingly integrating their business processes across functional, product,
lines.
The
integration of these business processes, in turn, accelerates
application systems for product development, product delivery,
1989).
As a
many
medium
more than 60%
& Neter,
1981).
size corporations (with
of the firms
more
effective
&
Short,
applications today require access to corporate functional and product
of errors (Johnson, Leitch,
surveyed 500
for
and customer service (Rockart
Unfortunately, most databases are not error-free, and
databases.
number
result,
demand
and geographic
had problems
some contain
a surprisingly large
In a recent industry executive report. Computer-world
annual sales of more than $20 million), and rejX)rted that
The Wall Street journal also reported
in data quality.^
that:
to computers, huge databases brimming with information are at our fingertips, just
waiting to be tapped. They can be mined to find sales prospects among existing customers; they
can be analyzed to unearth costly corporate habits; they can be manipulated to divine future
trends. Just one problem: Those huge databases may be hill of junk. ... In a world where people
Thanks
are
moving
to total quality
management, one
of the critical areas
is
data.'^
In general, inaccurate, out-of-date, or incomplete data can
socially
and economically (Laudon,
1986; Liepins
&
Uppuluri, 1990; Liepins, 1989;
Zarkovich, 1966). Managing data quality, howrever,
achieve zero
defect
datar' this
may
have significant impacts both
is
a
complex
task.
Although
not always be necessary or attainable
for,
Wang &
it
Kon, 1992;
would be
among
ideal to
others, the
following two reasons:
First, in
many
applications,
addresses in database marketing
customers,
it is
it
is
may
not always be necessary to attain zero defect data. Mailing
a good example.
not necessary to have the correct city
Second, there
is
name
sending promotional materials to target
In
in
an address as long as the zip code
a cost/quality tradeoff in implementing data quality programs.
is correct.
Ballou and
Pazer found that "in an overwhelming majority of cases, the best solutions in terms of error rate
reduction
is
the worst in terms of cost" (Ballou
&
Pazer, 1987).
The Pareto
losses are never uniformly distributed over the quality characteristics.
distributed in such a
way
Rather, the losses are always
that a small percentage of the quality characteristics, "the vital few,"
always contributes a high percentage of the quality
2
Computerworld, September 28, 1 992, p. 80-84.
The Wall Street Journal, May 26, 1992, page B6.
3
just like the well publicized
1
Principle also suggests that
concept of zero
loss.
As
defect products in the
a result, the cost
manufacturing
improvement
literature.
potential
is
high for "the
few" projects whereas the
vital
cure costs more than the disease (Juran
"trivial
& Gryna,
many"
1980).
defects are not worth tackling because the
sum, when the cost
In
is
prohibitively high,
it
is
not feasible to attain zero defect data.
Given
may
that zero defect data
make
and
we
This suggests that
able to judge the quality of data.
characteristics of the data
not always be necessary nor attainable,
manufacturing process
its
it
indicators such as
who
it
would be
originated the data,
.
From
the data
which are
these quality indicators, the user can
know
useful to
when
useful to be
tag data with quality indicators
a judgment of the quality of the data for the specific application at hand.
decision to purchase stocks, for example,
would be
was
In
making
a financial
the quality of data through quality
collected,
and how the data was
collected.
In this paper,
we propose an
attribute-based
model
that facilitates cell-level tagging of data.
Included in this attribute-based model are a mathematical model description that extends the
relational
process
model, a
SQL
set of quality integrity rules,
queries that are augmented with quality indicator requirements.
indicators, the user can
make
a better interpretation of the data
In order to establish the relationship
data.
and a quality indicator algebra which can be used
From
to
these quality
and determine the believability
between data quality dimensions and quality
of the
indicators, a
data quality requirements analysis methodology that extends the Entity Relationship (ER) model
is
also presented.
Just as
it
is difficult to
product which define
its
manage product
quality,
it
is
also difficult to
characteristics that define data quality.
quality,
for the
LL
quality without understanding the attributes of the
manage data
quality without understanding the
Therefore, before one can address issues involved in data
one must define what data quality means. In the following subsection, we present a definition
dimensions of data quality.
Dimensions of data quality
Accuracy
is
the most obvious dimension
when
it
comes
to
data quality.
"errors occur because of delays in processing times, lengthy correction times,
stringent data edits" (Morey, 1982).
In addition to defining
Morey suggested
and overly or
insufficiently
accuracy as "the recorded value
conformity with the actual value," Ballou and Pazer defined timeliness (the recorded value
of date),
completeness
(all
that
is
is in
not out
values for a certain variables are recorded), and consistency (the
representation of the data value
is
the
same
in all cases) as the
key dimensions of data quality (Ballou
&
Pazer, 1987).
Huh
et al. identified accuracy, completeness, consistency,
important dimensions of data quality (Huh,
It
is
(e.g.,
Juran
et al., 1990; Juran, 1979;
(1)
Data quality
(2)
Data quality
is
We illustrate
these
is
certain data retrieved
the data
for quality control
have been well established
in
&
Pazer, 1985; Garvin, 1983; Garvin, 1987; Garvin, 1988;
Gryna, 1980; Morey, 1982;
Wang &
Guarrascio, 1991).
It
is
also
intrinsic characteristics of data quality:
a multi-dimensional concept.
a hierarchical concept.
two
characteristics
from a database.
must be accessible
must be able
&
two
interesting to note that there are
the data
methods
Juran, 1979), neither the dimensions of quality for manufacturing nor for
data have been rigorously defined (Ballou
Huh,
et a!., 1990).
interesting to note that although
the manufacturing field
and currency as the most
by considering how a user may make decisions based on
First the
(the user has the
to interpret the data (the user
user must be able to get to the data, which
means and
privilege to get the data).
means
that
Second, the user
understands the syntax and semantics of the data). Third,
must be useful (data can be used as an input
to the user's decision
making
process). Finally, the
data must be believable to the user (to the extent that the user can use the data as a decision input).
Resulting from this
and
believability.
list
In order to
that can be accessed); to
and
to
are the following four dimensions:
be accessible
consistent, credible,
to the user, the data
be useful, the data must be relevant
be believable, the user
may
consider,
and accurate Timeliness,
.
accessibility, interpretability, usefulness,
among
(fits
must be available
(exists in
requirements for making the decision);
other factors, that the data be complete, timely,
in turn, can be characterized
by currency (when
item was stored in the database) and volatility (how long the item remains valid).
the data quality dimensions illustrated in this scenario.
currenT) Qion-volatile;
Figure
1:
some form
A Hierarchy of Data Quality Dimensions
Figure
1
the data
depicts
These multi-dimensional concepts and hierarchy of data quality dimensions provide a
conceptual framework for understanding the characteristics that define data quality. In this paper,
focus on interpretability and believability, as
we
we
consider accessibility to be primarily a function of the
information system and usefulness to be primarily a function of an interaction between the data and the
application domain.
12.
The idea of data tagging
Data quality: an
attrib ute-based
Suppose an analyst maintains
this effort
Data
may
illustrated
more concretely below.
example
a database
contain attributes such as
may be collected
is
on technology companies. The schema used
company name, CEO name, and earnings
over a period of time and come from a variety of sources.
Table
Company Name
1:
Company
Information
to
support
estimate (Table
1).
value
may have
necessary to
may,
know
in turn,
arbitrary
a set of quality indicators associated with
In
it.
the quality of the quality indicators themselves, in
have another
set of associated quality indicators.
number of underlying
As
some
applications,
it
may
be
which case a quality indicator
such, an attribute
may have
levels of quality indicators. This constitutes a tree structure, as
shown
an
in
Figure 2 below.
(attribute)
nndicator)
^indicator")
tK:.
3x...
(^indicator)
^
Figiu-e 2:
(mdlcator)
An attribute with
quality indicators
Conventional spreadsheet programs and database systems are not appropriate for handling
data which
is
structured in this manner.
In particular, they lack the quality integrity constraints
necessary for ensuring that quality indicators are always tagged along with the data (and deleted
when
the data
is
deleted)
an attribute with
In order to associate
developed
and the algebraic operators necessary
to facilitate the
is
immediate quality indicators, a mechanism must be
Section 2 presents the research background.
presents the data quality requirements analysis methodology.
based data model. Discussion and future directions are
2.
we
we
present the attribute-
in Section 5.
Research background
and present the terminology used
in this pap)er.
Rationale for cell-level tag gin g
Any
relation.
same
made
In section 4,
Section 3
discuss our rationale for tagging data at the cell level, summarize the
literature related to data tagging,
2.1.
set
it.
organized as follows.
In this section
query processing.
hnkage between the two, as well as between a quality indicator and the
of quality indicators associated with
This paper
its
for attribute-based
characteristics of data at the relation level should be applicable to all instances of the
It is,
quality.
however, not reasonable
to
assume
that all instances
(i.e.,
tuples) of a relation
Therefore, tagging quality indicators at the relation level
quality heterogeneity at the instance level.
is
have the
not sufficient to handle
By the same token, any
to all attribute
characteristics of data tagged at the tuple level should be applicable
values in the tuple.
However, each
attribute value in a tuple
different sources, through different collection methods,
Therefore, tagging data at the tuple level
basic unit of manipulation,
We now examine
12.
Work
is
it
is
and updated
also insufficient.
may
be collected from
at different points in time.
Since the attribute value of a
necessary to tag quality information at the
cell is
the
cell level.
the literature related to data tagging.
related to data tag gin g
A mechanism
for tagging data has
DENOTE
operations to tag and un-tag the
operators
is to
been proposed by Codd.
name
It
includes
of a relation to each tuple.
permit both the schema information and the database extension
uniform way (Codd, 1979).
It
does not, however, allow
NOTE, TAG, and
The purpose
to
of these
be manipulated in a
for the tagging of other data (such as source) at
either the tuple or cell level.
Although self-describing data
schema
to
level
files
at the
(McCarthy, 1982; McCarthy, 1984; McCarthy, 1988), no specific solution has been offered
manipulate such quality information
A
and meta-data management have been proposed
at the tuple
and
cell levels.
rule-based representation language based on a relational schema has been proposed to store
data semantics at the instance level (Siegel
attribute values based
the tuple level as
on values of other
opposed
&
Madnick, 1991). These rules are used
attributes in the tuple.
to the cell level,
and thus
to derive
meta-
However, these rules are specified
at
cell-level operations are not inherent in the
model.
A
polygen model (poly = multiple, gen = source) (Wang
&
Madnick, 1990) has been proposed
tag multiple data sources at the cell level in a heterogeneous database environment
imp)ortant to
know
where
it
to
is
not only the originating data source but also the intermediate data sources which
contribute to final query results.
The
research, however, focused
on the "where from" perspective and
did not provide mechanisms to deal with more general quality indicators.
In (Sciore, 1991), annotations are
oriented environment.
general treatment
is
to
However, data quality
support the temporal dimension of data in an objectis
a multi-dimensional concept.
necessary to address the data quality issue.
calculus-based language
data.
used
is
provided
to
Therefore, a
more
More importantly, no algebra
or
support the manipulation of annotations associated with the
The examination of the above research
functionality of our attribute-based model,
23.
an extension of existing data models
is
support the
to
required.
Terminology
To
•
efforts suggests that in order
we
facilitate further discussion,
An
introduce the following terms:
a pplication attribute refers to an attribute associated with an entity or a relationship in an
entity-relationship (ER) diagram.
This would include the data traditionally associated with
an application such as part number and supplier.
•
A
quality parameter
defines
when
is
a qualitative or subjective
dimension of data quality that a user of data
For example, believability and timeliness are such
evaluating data quality.
dimensions.
•
As introduced
in Section
characteristics of data
collection
•
A
method
defined by users to
its
is
may
the value determined (directly or indirectly)
by the user of data
Functions can be
For example, the quality
.
is
a
measured
may have
have discussed the rationale
characteristic of the stored data.
For example, the
a quality indicator value The Wall Street Journal.
for cell-level tagging,
and introduced the terminology used
methodology
Data source, creation time, and
be defined as high or low depending on the quality indicator source
data quality indicator source
tagging,
process.'*
quality indicators to quality parameters.
quality indicator value
We
manufacturing
parameter based on underlying quality indicators.
map
parameter credibility
A
quality indicators provide objective information about the
are examples of such objective measures.
for a particular quality
•
and
quality parameter value
of the data
1,
for the specification of data quality
in this paper.
summarized work
related to data
In the next section,
parameters and indicators.
The
we
intent
present a
is to
allow
users to think through their data quality requirements, and to determine which quality indicators
would be appropriate
for a given application.
We consider an indicator objective if it is generated using a well defined and widely accepted measure.
Data quality requirements analysis
3.
In general, different users
data
may have
may have
different quality characteristics.
thorough treatment of these
requirements analysis (Batini, Lenzirini,
between
The reader
is
referred to
different tyf)es of
Appendix
A
for a
&
an effort similar
is
in
spirit
Navathe, 1986; Navathe, Batini,
on quality aspects of the
Based on
data.
traditional data requirements analysis
&
to
traditional data
Ceri, 1992; Teorey,
this similarity, parallels
and data quality requirements
can be drawn
analysis.
depicts the steps involved in performing the proposed data quality requirements analysis.
application requirements
Stepl
determine the application view of data
I
application view
application's
i
Step 2
quality requlrments
determine (subjective) quality parameters for the
application
candidate quality
paramaters
T
parameter view
__i
Step3
determine (objective) quality Indicators for the
application
quality view (1)
•
s'teD'4
-
1
quality view
-*^
I
(I)
quality view (n)
^
quality view Integration
quality
Figure
The
input, output
more
issues.
Data quality requirements analysis
1990), but focusing
and
different data quality requirements,
3:
and
The process
schema
of data quality requirements analysis
objective of each step are described in the following subsections.
Figure 3
3.1.
Step
Step
this paper.
&
Establishing the applications view
1:
1 is
A
the
whole of the
traditional data
modeling process and
will not
comprehensive treatment of the subject has been presented elsewhere
Navathe, 1986; Navathe,
&
Batini,
we
A company
Figure 4
documents the a pplication view
for
management
has a company name, a CEO, and an
earnings estimate, while a stock has a share price, a stock exchange (NYSE,
that
(Batini, Lenzirini,
are interested in designing a portfolio
system which contains companies that issue stocks.
An ER diagram
in
Ceri, 1992; Teorey, 1990).
For illustrative purposes, suppose that
symbol.
be elaborated upon
AMS,
or OTC),
our running example
is
and a
ticker
shown below
in
.
COMPANY
STOCK
JSSUES
-<^COI
COMPANY
NAME
SHARE PRICE
Z^
CEO NAME
EARNINGS ESTIMATE
Figure
2.2,
Step
2:
4:
Application view (output from Step
1)
Determine (subjective) quality parameters
The goal
in this step
These parameters need
to
dimensional concept, and
is
parameters from the user given an application view.
be gathered from the user in a systematic
may be
illustrates the addition of the
application view.
to elicit quality
5:
as data quality
is
operationalized for tagging purposes in different ways.
two high
level parameters, interpretability
Each quality parameter identified
Figure
way
Interpretability
is
shown
a multi-
Figure 5
and believability.
inside a "cloud" in the diagram.
and believability added
to the application
view
to the
Interpretabilitv can be defined through quality indicators such as data units
and scale
(e.g., in
millions).
defined through currency and volatility
stock entity which
is
-o
ii
Step
3:
6:
is
Timeliness, in turn, can be
identified in this step are
referred to as the parameter view.
We
added
to
focus here on the
STOCK
Parameter view for the stock entity
(partial
output from Step
2)
Determine (objective) quality indicators
The goal
in
Step 3
is to
operationalize the primarily subjective quality parameters identified
in Step 2 into objective quality indicators.
rectangle)
The quality parameters
.
in Figure 6.
_ _
Figure
.
The resulting view
shown
dollars)
Believability can be defined in terms of lower-level quality parameters
such as completeness, timeliness, consistency, credibility, and accuracy
the application view.
(e.g., in
and
is
is
depicted as a tag (using a dotted-
attached to the corresponding quality parameter (from Step
view The portion of the quality view
.
-o-
Figure
Each quality indicator
7:
for the stock entity in the
2),
running example
is
creating the quality
shown
in Figure 7.
4 UNITS?
STOCK
The portion of the
lnterpretable^>.^g^^^^jg,
quality view for the stock entity (output from Step 3)
10
Corresponding
currency units in which share price
price
is
is
parameter interpretable are the more objective quality indicators
to the quality
is
measured
(e.g.,
$ vs. ¥)
the latest closing price or latest nominal price.
and status which says whether the share
Similarly, the believability of the share price
indicated by the quality indicators source and reporting date
.
For each quality indicator identified in a quality view,
indicators for a quality indicator, then Steps 2-3 are repeated,
example, the quality of the attribute Earnings Estimate
(i.e.,
the
name
who provided
of the journal) but also
on the second
it
making
may depend
level source
the Earnings Estimate figure to the journal
depicted below in Figure
if
(i.e.,
important
is
this
an
name
and the Reporting
have quality
For
iterative process.
not only on the
the
to
first level
source
of the financial analyst
date).
This scenario
is
8.
'earnings estimate
Believable
I
,
'
_ - J
,
REPORTING DATE'
I
SOURCE
r---"
•
,
I
L
Believable
\
(ANALYSTS NAME
I
_1
I
REPORTING DATE'
J
J
,
Figure
All quality
8:
Quality indicators of quality indicators
views are integrated in Step 4
to
generate the quality schema, as discussed in the
following subsection.
lA,
Step
Creating the quality schema
4:
When
the design
multiple quality views
must be consolidated
Lenzirini,
&
single global
is
may
large
result.
and more than one
set of application
To eliminate redundancy and
suffice.
is
involved,
inconsistency, these quality views
into a single global view, in a process similar to
schema integration
(Batini,
Navathe, 1986), so that a variety of data quality requirements can be met. The resulting
view
is
called the quality schema.
This involves the integration of quality indicators.
may
requirements
In
more complicated
indicators in order to decide
cases,
it
may
what indicators
In simpler cases, a
union of these indicators
be necessary to examine the relationships
to include in the quality
11
among
schema. For example,
it is
the
likely
that
one quality view
may have age
may have
as an indicator, whereas another quaUty view
time for the same quaUty parameter. In
time
this case, creation
may
creation
be chosen for the quality schema
because age can be computed given current rime and crearion time.
We have
presented a step-by-step procedure
model
in a position to present the attribute-based data
We are now
data quality requirements.
to specify
supporting the storage, retrieval, and
for
processing of quality indicators as spjecified in the quality schema.
The attribute-based model of data
4.
We
choose to extend the relational model because the structure and semantics of the relational
approach are widely understood. Following the
attribute-based data
model
data manipulation.
We
is
assume
Data
that the reader
(a)
1982), the presentation of the
data structure, (b) data integrity, and
familiar with the relational
is
model (Codd,
(c)
1970;
1983).
structxire
As shown
Figure 2 (Section
in
levels of quality indicators.
and the
1),
an attribute
to facilitate the linkage
In
number
arbitrary
its
of underlying
immediate quality indicators, a
between the two, as well as between a quality
with
set of quality indicators associated
the quality key concept.
may have an
In order to associate an attribute with
mechanism must be developed
indicator
model (Codd,
relational
divided into three parts:
Codd, 1979; Date, 1990; Maier,
4.L
quality
This mechanism
it.
extending the relational model,
Codd made
is
developed through
clear the
need
to
uniquely
identify tuples through a system- wide unique identifier, called the tuple ID (Codd, 1979; Khoshafian
&
Copeland, 1990).^
This concept
an attribute
Specifically,
applied in the attribute-based model to enable this linkage.
is
scheme
in a relation
attribute, consisting of the attribute
and
is
a quality
expanded
key
into
where EEc
is
the quality key for the attribute
expanded scheme
is
referred to as a quality
EE (Tables
scheme
defines a quality scheme for the quality relation
indicators are associated with the attributes
.
is
expanded
3-6 are
EE^)
into (EE,
embedded
In Table 4, ((CN, nik),
Company.
CN
quality
pair, called a
.
For example, the attribute Earnings Estimate (EE) in Table 3
4
an ordered
in
Figure
in
9).
Table
This
<CEO, nik), (EE, EEc))
The "nik" indicates
that
and CEO; whereas EEc indicates
no quality
that
EE has
associated quality indicators.
Correspondingly, each
quality
cell,
cell in
a relational tuple
is
expanded
into
consisting of an attribute value and a quality key value. This
Similarly, in the object-oriented literature, the ability to
property of an object-oriented data model.
12
make
an ordered
pair, called a
expanded tuple
references through object identity
is
is
referred to
considered
a basic
as a quality tuple and the resulting relation (Table 4)
key value
is
referred to as a quality relation
in a quality cell refers to the set of quality indicator values
the attribute value.
This set of quality indicator values
tuple called a quality indicator tuple
quality indicator tuples
is
quality indicator relation
Under
is
A
.
composed
quality relation
referred to as the quality indicator
the relational
Relation for
3:
scheme
that defines the
.
Company
Company
;E0 Name
(CEO)
Name (CN)
Earnings
Estimate (EE)
IBM
Akers
7
id002c
DELL
Dell
3
model
I
Table
4:
I
I
Qual^
i\^ Relation for>^mpany
for>Gompar
{CN, nik)
tid
idOOU
(IBM, nik)
(Dell, nik)
idOOZc
Table
idl02c
of a set of these time-varying
idOOU
the attribute-based
idlOlc
form a kind of quality
model
tid
EEg
to
The quality scheme
.
5:
(Barron's, id201<)
Tables
SRCU
Jnl, id202<»
6:
(EE, EEC)
(Akers, nilO
(Dell,
Level-One QIR
(SRC1.SRC1<!>
(Wall St
f
(CEO, nik
for the
(News
(7,idl01«)
nik)
|.(3,idl02«)
EE
attribute
Date, nik) (Entrv Clerk, nik'
(Oct 5
"92,
nik)
(Oct 6
'92,
nik)
Level-Two QIR
for the
EE
Each quality
immediately associated with
grouped together
called a quality indicator relation
Table
Under
is
.
(Joe,
nil^
(Mary,
attribute
nil«^
indicators for
In general,
Each attribute in qr2
a.
in turn,
,
can have a quality indicator relation associated with
an attribute can have n-levels of quality indicator relations associated with
it,
n
^
0.
it.
For
example. Tables 5-6 are referred to respectively as level-one and level-two quality indicator relations
for the attribute
We
Earnings Estimate.
define a quality scheme set as the collection of a quality scheme and
indicator schemes that are associated with
scheme
We define a
Company.
set for
quality indicators.
A
indicator schemes, quality
In Figure 9, Tables 3-6 collectively define the quality
it.
is
defined as a set of quality scheme sets that describes the
Figure 10 illustrates the relationship
scheme
the quality
quality database as a database that stores not only data but also
quality schema
structure of a quality database.
all
sets,
and
among
quality schemes, quality
the quality schema.
Quality
Schema
Figure 10 Quality schemes, quality indicator schemes, quality scheme sets, and the quality schema
We now
present a mathematical definition of the quality relation.
developed in the relational model,
domain
(in
for a
Table
product
4,
system-wide unique
7€
EE where EE
is
a
we
id
domain
as a set of values of similar type.
domain
for earnings estimate).
Let
m
is
ID be
the
be a domain for an attribute
DID
be defined on the Cartesian
DID).
be a quality key value associated with an attribute value d where d €
quality relation (qr) of degree
Let
D
identifier (in Table 4, idlOlc € ID). Let
D X ID (in Table 4, <7, idlOU) €
Let
define a
Following the constructs
defined on the
m+1 domains (m>0;
in Table 4,
D and
m=3)
if it is
id € ID.
A
a subset of
the Cartesian product:
ID X DIDi X DID2 X
Let qthe a quality tuple, which
is
designated
an element
in a
X DIDm-
quality relation.
Then a
quality relation qr
as:
qr
The
is
...
=
{qt qt
I
=
<id,
didi, did2,
•••,
didm) where
integrity constraints for the attribute-based
14
model
id € ID, did; € DID),
is
presented next.
j
=
1,
...
,m)
4^
Data integrity
A
fundamental property of the attribute-based model
corresponding quality (including
we mean
atomic unit
all
attribute value
corresponding quality indicators also need
In other words, an attribute value
We
that an attribute value
and
descendant) indicator values are treated as an atomic unit
whenever an
that
is
and
its
to
is
.
created, deleted, retrieved, or modified,
its
By
its
be created, deleted, retrieved, or modified respectively.
corresponding quality indicator values behave atomically.
refer to this property as the atomicity property hereafter.
This property
is
enforced by a set of
quality referential integrity rules as defined below.
Insertion
key present
Insertion of a tuple in a quality relation
:
schema
in the tuple (as specified in the quality
indicator tuple
must be inserted
must ensure
that for each non-null quality
definition), the
corresponding quality
into the child quality indicator relation. For each non-null quality key
in the inserted quality indicator tuple, a
corresponding quality indicator tuple must be inserted
at the
next level. This process must be continued recursively until no more insertions are required.
Deletion
key present
Deletion of a tuple in a quality relation must ensure that for each non-null quality
:
the tuple, corresponding quality information
in
must be deleted from the
corresponding to the quality key. This process must be continued recursively until a tuple
with
all
encountered
null quality keys.
Modification
:
an attribute value
If
quality indicator values of that attribute
We now introduce a quality
U.
is
table
is
modified in a quality relation, then the descendant
must be modified.
indicator algebra for the attribute-based model.
Data manipulation
In
order to present the algebra formally,
we
Let
.
Ql-Compatibility and QIV-Equal
a-i
and
associated with
attributes aj
and
a2
Ol-Compatible with
a2
then the attributes
assume
be two application attributes.
Let QKa,) denote the set of quality indicators
Let S be a set of quality indicators.
aj.
are defined to be
We
define two key concepts that are fundamental
and OlV-Equal
to the quality indicator algebra: Ol-compatibility
4.3.1.
first
shown
a^
and
that the
respect to S.6
in Figure 11 are
a2
shown
If
S
C QKaj) and
For example,
if
Ql-Compatible with respect
in Figure 11 are not
numeric subscripts
(e.g.,
S
S =
to S.
C
QI(a2), then ai and a2
{qii, qi2, qi2i),
Whereas
Ql-Compatible with respect
if
S =
then the
(qii, qi22}/
to S.
qin) n^^P 'he quality indicators to unique positions
quality indicator tree.
15
in the
Figure 11: Ql-Compatibility Example
Let 3] and 82 be Ql-Compatible with respect to
Let
respectively.
(qi2(W]
)
= V2
qi(wi
)
be the value of quality indicator
in Figure 12).
of ai
and
82
value Wi where qi g S
qi for the attribute
QIV-Equal with
Define Wj and W2 to be
W2 be values
Let wi and
S.
respect to S provided that qi(W]
=
)
c
V qi
qi(w2)
S=
6
S,
{qij, qi2i),
denoted as wi = W2. In Figure
example, w^ and W2 are QIV-Equal with respect
12, for
but not QIV-Equal with respect to S =
{qii, qisi)
because qi3i(wi) =
to
whereas qi3i(w2) =
v^-^
X31.
(a2,W2)
(a^.w.,)
/\
(''hr^l)
^2)
(qii2
\\\
(qi21'^2l'
/\
(q'22'''22) «'*31'^31)
(<"l
1
M
^11
qi^g
•
v_
\
\
^2
>
^21
<'''21
'"bl
>
'
"31
>
y
Figure 12: QIV-Equal Example
In practice,
specify
level
it
is
the elements of
all
up
Let wi
the quality indicators to be
we
alleviate the situation,
and
depth
if
the following
(2)
S consists of
a^
and
attributes.
Let ai
a2 respectively, then
two conditions are
all
compared
(i.e.,
to
introduce i-level Ql-compatibility
(i-
in
which aU the quality
in a quality indicator tree are considered.
two application
and W2 be values of
Compatible
,
To
to a certain level of
Let ai and a2 be
S
S).
all
QIV-Equal) as a special case for Ql-compatibility (QlV-equal)
indicators
to
tedious to explicitly state
satisfied:
and
ai
be Ql-Compatible with respect
wj and W2 are defined
(1)
ai
quality indicators present within
and
i
82 are
to
be
i-level
to S.
QI-
Ql-Compatible with respect
levels of the quality indicator tree of
ai (thus of 82).
By
If
the
'i'
is
same
the
token, i-level QIV-Equal
maximum
level of
be maximum-level Ol-Compatible
by wi
='" W2,
.
between wi and W2, denoted by Wj
depth in the quality indicator
Similarly,
tree,
then
a-i
='
W2, can be defined.
and
82 are defined to
maximum-level QIV-Equal between Wj and W2, denoted
can also be defined.
16
To exemplify the algebraic operations
quality relations having the
Large_and_Medium (Tables
Figure
same
quality
7, 7.1, 7.2
14).
<CN,
Table 7
nil«r>
in the quality indicator algebra,
scheme
set as
in Figure 13)
shown
in Figure
9.
we
They are
and Small_and_Medium (Tables
introduce two
referred to as
8, 8.1,
and
8.2 in
If
QUALITY"
the clause "with
absent in a user query, then
is
on the quality of data
explicit constraints
would not be compared
in the retrieval process;
however, the quality indicator values associated with
SQL
The quality
that
retrieved as well.
syntax, the dot notation
indicator algebra
namely
algebraic operations,
and
(That
attribute a.
QUALITY"
let
QR
and
QS
be two quality tuples. Let Sg be a
t2
Sg
is,
is
clause.) Let the
are identical.
Let
ti .a
=
is a
quality
equal with resfject to Sg.
define the five orthogonal quality relational
and Cartesian product.
be two quality schemes and
Let a and b be
two
let
qr and qs be two
attributes in both
set of quality indicators Sf)ecified
QR and
by the user
for the
constructed form the specifications given by the user in the "with
term
*
we
QR and QS respectively.
quality relations associated with
t^.a
t2.a
=
denote that the values of the attribute a in the tuples
t2.a
denote that the values of attribute a
Similarly, let
='
ti .a
t2.a
and
=
t] .a
maximum-level QlV-equal respectively between the values of
""
t2.a
and
ti.a
in the tuples
denote
t^
i-level
and
t2
t]
and
are QIV-
QlV-equal and
t2.a.
Selection
4.3.2.I.
Selection
its
SRC2 which
presented in the following subsection.
is
selection, projection, union, difference,
In the following operations,
t2
identifies
in turn is a quality indicator to EE.
Following the relational algebra (Klug, 1982),
ti
to identify a quality indicator in the
Quality Indicator Algebra
4.3.2.
QS. Let
used
is
EE.SRC1.SRC2
In Figure 9, for example,
which
indicator for SRCl,
no
that the user has
being retrieved. In that case quality indicator values
In the extended
quality indicator tree.
means
is
would be
the applications data
it
a unary operation which selects only a horizontal subset of a quality relation (and
is
corresponding quality indicator relations) based on the conditions specified in the Selection
There are two types of conditions
operation.
in the Selection operation: regular conditions for
an
application attribute and quality conditions for the quality indicator relations corresponding to the
The
application attribute.
o^C
(qr)
where
=
(t
C(ti
)
I
V
=
ti .b);
6 qr,
ej <& ©2 <&
or (ti.aGti.b);
=""
ti
selection, a'^c (qr)' is defined as follows:
ei**
is
V ae QR,
... <i)
o e,
=
"i
<i)
ti .a)
62''
d)
a
(t.a
...d)
=
""
t, .a))
ep''; Gj is
in
of the forms (qik = constant) or (ti.a =
qik € QKa); <t e
indicators to be
e,,
((t.a
{
a, v, -,
);
6
a C(ti
)}
one of the forms:
''
ti.b)or
(ti.a
= (g, s, s, #, <, >, =); and Sa,b
compared during the comparison
18
of
tj .a
and
tj .b.
is
(ti .a
6 constant)
=' ti.b)or (ti.a
the set of quality
Example
1:
Get
all
Large_and_Medium companies whose earnings estimate
is
over 2 and
is
supplied by
Zacks Investment Research.
A corresponding extended SQL query is shown as follows:
algebra.
SELECT
FROM
CN, CEO, EE
WHERE
EE>2
LARGE_AND_MEDIUM
with
QUALITY
This
SQL query can be accomplished through
The
result is
EE.SRC1.SRC2= 'Zacks'
shown below.
a Selection operation in the quality indicator
SELECT
CN,EE
FROM LARGE_and_MEDIUM
This
SQL query can be accomplished through a
<CN,
nil«>
Projection operation.
The
result is
shown below.
<CN,
nil(f>
Note
This
also that unlike the relational union, the
is illustrated
Example
Example
in
3-3
:
operation in Example 3-b:
with
The
SM.CN, SM.CEO, SM.EE
SMALL_and_MEDIUM SM
LM.CN, LM.CEO, LM.EE
LARGE_and_MEDIUM LM
QUALITY
result is
(LM.EE.SRC1= SM.EE.SRCl)
shown below.
<CN,
is
not commutative.
3-3 below.
Consider the following extended
SELECT
FROM
UNION
SELECT
FROM
quality union operation
nile>
SQL query which
switches the order of the union
Example
4:
as
Get
all
the companies which are classified as only
Large_and_Medium companies but not
Small_and_Medium companies.
A corresponding SQL query is shown as follows:
SELECT
FROM
DIFFERENCE
LM.CN, LM.CEO, LM.EE
SELECT
SM.CN, SM.CEO, SM.EE
LARGE_and_MEDIUM LM
FROM
SMALL_and_MEDIUM SM
With
QUALITY
This
SQL query
(LM.EE.SRC1.SRC2 = SM.EE.SRC1.SRC2)
can be accomplished through a Difference operation.
below.
<CN,
nil>
The
result is
shown
Cartesian Product
4.3.2.5.
The Cartesian product
Let
ti
€ qr
the tuple
t2
e qs. Let
The tuple
t2.
of degree r+s.
qr X^ qs =
t(l)
=
t(r+l)
The
t](i)
t
is
also a binary operation.
denote the
i'^
I
V
t,(l)
=
t]
A
t2(l)
in the quality relation resulting
<LM.CN, m\t>
qs,
be of degree
t]
and
t2(i)
r
and QS be of degree
denote the
i"^ attribute
from the Cartesian product of qr and qs
denoted as qr X^
will
s.
of
be
qs, is defined as follows:
G qr, Vt2 6 qs,
t(l)
A
=" tid) A
t(r+l)
="
t(2)
t2(l)
A
=
t,(2)
t(r+2)
result of the Cartesian product
shown below.
QR
attribute of the tuple
The Cartesian product of qr and
{ t
Let
A
=
t(2)
t2(2)
='" t,(2)
a
A
t(r)
=
t(r+2) ="" t2(2)
a
...
ti(r)
...
a
t{r)
t(r+s)
=
="" t,(r)
tzCs)
a
a
t(r+s)
="
t2(s)
between Large_and_Medium and Small_and_Medium
is
)
We
have presented the attribute-based model including a description of the model structure, a
model, and
set of integrity constraints for the
a quality indicator algebra.
algebraic operations are exemplified in the context of the
the capabilities of this
model and future research
SQL
In addition,
each of the
query. The next section discusses
some
of
directions.
Discussion and future directions
5.
The attribute-based model can be applied
in
many
different
ways and some
of
them are
listed
below:
•
The
ability of the
retain the origin
•
A
user can
Example
1,
model
to
filter
the data retrieved from a database
for instance, only the data furnished
QUALITY
multiple levels
makes
The reputation
inspected.
it
possible to
in Figure 9 illustrates this.
according to quality requirements.
by Zacks Investment Research
is
In
retrieved as
EE.SRCl.SRC2='Zacks'."
Data authenticity and believability can be improved by data inspection and
quality indicator value could indicate
•
at
and intermediate data sources. The example
specified in the clause "with
•
support quality indicators
who
certification.
inspected or certified the data and
when
it
A
was
of the inspector will enhance the believability of the data.
The quality indicators associated with data can help
to resolve
semantic incompatibility
capability
is
among
clarify data semantics,
which can be used
data items received from different sources.
This
very useful in an interoperable environment where data in different databases
have different semantics.
•
Quality indicators associated with an attribute
values. For example,
it
can be interpreted
the spouse
name
is
if
may
facilitate a better interpretation of null
the value retrieved for the spouse field
(i.e.,
is
empty
in
an employee record,
tagged) in several ways, such as (I) the employee
unknown, or
(3) this
tuple
is
inserted into the
is
employee
unmarried,
table
(2)
from the
materialization of a view over a table which does not have spouse field.
•
In a data quality control process,
the source of error
In this paper,
and processed.
when
errors are detected, the data administrator can identify
by examining quality indicators such as data source or collection method.
we have
Specifically,
investigated
we have
requirements analysis and specification,
how
(1)
(2)
quality indicators
may
be specified, stored, retrieved,
established a step-by-step procedure for data quality
presented a model for the structure, storage, and processing
25
of quality relations
and quality indicator
relations (through the algebra),
functionalities related to data quality administration
We
investigating
its
combining accurate monthly data with
mechanisms
components.
to determine the quality of
(2) In
order
to
(3)
touched upon
control.
are actively pursuing research in the following areas:
of derived data (e.g.,
values of
and
and
(1) In
order to determine the quality
weekly
less accurate
data),
we
are
derived data based on the quality indicator
use this model for existing databases, which do not have
tagging capability, they must be extended with quality schemas instantiated with appropriate
quality indicator values.
effective.
(3)
We
are exploring the possibility of
Though we have chosen
the relational
model
oriented approach appears natural to model data and
quality control
methods),
we
its
making such
a transformation cost-
to represent the quality
quality indicators.
schema, an object-
Because
many
of the
mechanisms are procedure oriented and o-o models can handle procedures
are investigating the pros
and cons of the object-oriented approach.
26
(i.e.,
References
6.
[I]
[2]
[3]
Ballou, D. P. & Pazer, H. L. (1985). Modeling Data and Process Quality in Multi-input, Multioutput Information Systems. Management Science, 11(2), pp. 150-162.
Ballou, D. P. & Pazer, H. L. (1987). Cost/Quality Tradeoffs for Control Procedures in Information
Systems. International Journal of Management Science, 15(6), pp. 509-521.
Batini,
C,
Lenzirini, M.,
&
database schema integration.
[4J
Codd,
ACM,
[5]
E. F. (1970).
A
Navathe, S. (1986). A comparative analysis of methodologies for
ACM Computing Survey, 1£(4), pp. 323 - 364.
relational
model of data
for large shared data banks.
Communications
of the
ii(6), pp. 377-387.
Codd,
E. F. (1979).
capture more meaning.
ACM
practical foundation for productivity, the 1981
ACM
Extending the relational database model
to
Transactions on Database Systems, 1(4), pp. 397-434.
[6]
Codd,
E. F. (1982).
Relational database:
Turing Award Lecture. Communications
An
[7]
Date, C.
[8]
Garvin, D. A. (1983). Quality on the
J.
(1990).
Introduction
to
A
of the
ACM,
25(2), pp. 109-117.
Database Systems (5th ed.). Reading,
line.
MA: Addison-Wesley.
Harvard Business Review, (September- October), pp. 65-
75.
[9]
Garvin, D. A. (1987). Competing on the eight dimensions of quality. Harvard Business Review,
(November-December), pp. 101-109.
[10]
Garvin, D. A. (1988). Managing Quality-The Strategic and Competitive Edge
The Free
[II]
[12]
Huh,
(1 ed.).
New
York:
Press.
Y. U., et
al.
Data Quality. Information and Software Technology, 22(8), pp. 559-565.
(1990).
Johnson, J. R., Leitch, R. A., & Neter, ]. (1981). Characteristics of Errors in Accounts Receivable and
Inventory Audits. Accounting Review, 56(April), pp. 270-293.
[13]
Juran,
J.
M.
(1979). Quality Control
[14]
Juran,
J.
M.
&
Gryna,
F.
M.
Handbook (3rd
ed.).
(1980). Quality Planning
New
York: McGraw-Hill Book Co.
and Analysis (2nd
ed.).
New
York:
McGraw
Hill.
[15]
S. N. & Copeland, G. P. (1990). Object
San Mateo, CA: Morgan Kaufmann.
Khoshafian,
37-46).
Identity. In S. B.
Zdonik& D. Maier
(Ed.), (pp.
[16]
Klug, A. (1982). Equivalence of relational algebra and relational calculus query languages having
aggregate functions. The Journal of ACM, 22, pp. 699-717.
[17]
Laudon, K. C. (1986). Data (Quality and Due Process
Communications of the ACM, 22(1), pp. 4-11.
[18]
Large Interorganizational Record Systems.
& Uppuluri, V. R. R. (1990). Data Quality Control: Theory and Pragmatics (pp. 360).
Marcel Dekker, Inc.
Liepins, G. E.
New York:
[19]
in
Liepins, O. E. (1989).
Sound Data Are
a
Sound Investment. Quality Programs, (September), pp.
61-
63.
[20]
Maier, D. (1983). The Theory of Relational Databases (1st ed.). Rockville,
MD: Computer
Science
Press.
[21]
McCarthy,
J.
L.
(1982). Metadata
Management
for
Large Statistical Databases. Mexico City,
Mexico. 1982. pp. 234-243.
[22]
McCarthy,
J.
L. (1984). Scientific Information
School, Monterey,
CA.
1984. pp.
27
= Data + Meta-data. U.S. Naval Postgraduate
[23]
[24]
McCarthy, J. L. (1988). The Automated Data Thesaurus: A New Tool for Scientific Information.
Proceedings of the 11th International Codata Conference, Karlsruhe, Germany. 1988. pp.
Morey,
C. (1982). Estimating
R.
Communications of
[251
Navathe,
and Sons.
[26]
Rockart,
Batini,
S.,
&
F.
J.
the
ACM,
C, &
Short,
J.
and Improving the Quality
of Information in the MIS.
25(May), pp. 337-342.
Ceri, S. (1992). The Entity Relationshiip Approach
E. (1989). IT in the 1990s:
Sloan Management Review, Sloan School
New
york: Wiley
Managing Organizational Interdependence.
Management, MIT,
of
.
lfl(2),
pp. 7-17.
Using Annotations to Support Multiple Kinds of Versioning in an Object-Oriented
Database System. ACM Transactions on Database Systems, l£(No. 3, September 1991), pp. 417-438.
[27]
Sciore, E. (1991).
[28]
Siegel,
&
M.
Madnick,
S.
E.
A
(1991).
metadata approach
to
resolving
semantic conflicts.
Barcelona, Spain. 1991. pp.
[29]
[30]
Teorey, T.
Mateo, CA
(1990). Database
J.
:
Morgan Kaufman
&
Modeling and Design: The Entity-Relationship Approach
Towards
Wang,
R. Y.
Wang
(Ed.), Information Technology in Action: Trends
Kon, H.
.
San
Publisher.
B. (1992).
Total Data Quality
Management (TDQM).
and Perspectives Englewood
In R. Y.
Cliffs,
NJ:
Prentice Hall.
[31]
Wang, Y. R. & Guarrascio, L. M. (1991). Dimensions of Data Quality: Beyond Accuracy. (CISL-91Composite Information Systems Laboratory, Sloan School of Management, Massachusetts
06)
Institute of
[32]
Wang,
Technology, Cambridge,
Y. R.
&
Madnick,
MA,
S. E. (1990).
02139 June 1991.
A
Polygen Model for Heterogeneous Database Systems: The
Source Tagging Perspective. Brisbane, Australia. 1990. pp. 519-538.
[33]
Zarkovich. (1966). Quality of
United Nations.
Statistical
Data
.
Rome: Food and Agriculture Organization
\
28
of the
Appendix A: Premises about data quality requirements
7.
Below we present premises related
analysis.
To
facilitate further discussion,
refers to t>oth quality parameters
to
we
analysis
data quality modeling and data quality requirements
define a data quality attribute as a collective term that
and quality indicators as shown
in
Figure A.l. (This term
is
referred
to as a quality attribute hereafter.)
Data
Quality
Attributes
(collective)
among
Figure A.l: Relationship
Premises related
7.1.
quality attributes, quality parameters,
to data quality
Data quality modeling
is
and quality
indicators.
modeling
an extension of traditional data modeling methodologies.
As data
modeling captures many of the structural and semantic issues underlying data, data quality modeling
captures
many
premises relate
of the structural
to these
(Premise
and semantic issues underlying data
data quality modeling issues.
(Relatedness between entity and quality attributes):
1.1)
attribute can be considered either as
quality attribute.
application
name
entity attribute
be included; alternatively,
From
an entity attribute
For example, the
may be an
it
of a teller
if initial
may be modeled
(i.e.,
judgment
call
on
In
some
cases a quahty
an application entity's attribute) or as a
who performs
a transaction in a banking
application requirements state that the teller's
name
as a quality attribute.
a modeling perspective, whether an attribute should be
a quality attribute is a
The following four
quality.
modeled as an
the part of the design team,
entity attribute or
and may depend on the
initial
application requirements as well as eventual uses of the data, such as the inspection of the data for
distribution to external users, or for integration with other data of different quality.
distribution
and integration of the information
quality of the data they use.
When
the data
information of different quality, that quality
A
guideline to this judgment
is
is
is that
If,
exported to their users,
to ask
what information the
on the other hand, the information
process, such as
when, where, and by
often the users of a given system
"know"
of
the
however, or combined with
may become unknown.
provides application information such as a customer
attribute.
The relevance
whom
name and
relates
the data
attribute provides.
address,
more
it
may be
the attribute
considered an entity
to asf)ects of the data
was manufactured, then
If
this
manufacturing
may
be a quality
attribute.
In short, the objective of the data quality requirement analysis
attributes, but also to
is
not
strictly to
develop quality
ensure that important dimensions of data quality are not overlooked entirely in
requirement analysis.
29
(Premise
(Quality attribute non-orthogonality):
1.2)
one another.
For example, the two quality parameters credibility and timeliness are
orthogonal
to
related
not orthogonal), such as for real time data.
(i.e.,
(Premise
may
(Heterogeneity and hierarchy in the quality of supplied data): Quality of data
1.3)
differ across databases, entities, attributes,
may
university database
example:
and
instances.
be of higher quality than data
data about alumni (an entity)
Attribute example:
Different quality attributes need not be
in the student entity,
may
Entity
be less reliable than data about students (an
entity).
may
in
may be more
be
accurate than are addresses.
less interpretable than that of a
Because
human
insight
is
needed
for data quality
modeling and different people may have
different opinions regarding data quality, different quality definitions
this
Instance
domestic student.
Premises related to data quality definitions and standards across users
7.2.
call this
information in a
John Doe's personal database.
grades
example: data about an international student
Database example:
phenomenon
"data quality
is
in the
and standards may
result.
We
eye of the beholder." The following two premises entail
phenomenon.
(Premise 2.1)
indicators
may
(Users define different quality attributes):
vary from one user to another. Quality parameter example: for a manager the quality
parameter for a research report
may need
to
Quality parameters and quality
may
be inexpensive, whereas for a financial trader, the research report
be credible and timely.
Quality indicator example:
the
manager may measure
may measure
indicator may be
inexpensiveness in terms of the quality indicator (monetary) cost, whereas the trader
inexpensiveness in terms of opportunity cost of her
own
time and thus the quality
retrieval time.
(Premise 2.2) (Users have different quality standards): Acceptable levels of data quality
differ
from one user
to another.
For example, an investor following the
movement
of a stock
consider a fifteen minute delay for share price to be sufficiently timely, whereas a trader
price quotes in real time
A
single user
data used. This
to a single
may have
phenomenon
is
user
different quality attributes
summarized
in
and quality standards
for the different
Premise 3 below.
(Premise 3) (For a single user; non-uniform data quality attributes and standards):
have different quality attributes and quality standards across databases,
instances. Across attributes example:
than for the
for certain
number
who needs
may not consider fifteen minutes to be timely enough.
Premises related
7.3.
may
may
A
user
may need
fact that
30
user
higher quality information for the phone
of employees. Across instances example:
companies, but not for others due to the
A
A
user
may need
some companies
may
entities, attributes, or
number
high quality information
are of particular interest.
Date Due
^luidggf
-1 ^ft?;i
Lib-26-67
Mn
"
"
i.iPRaRiF<;
nun
"Nil
Mllllllllllllilli
3
TOfiO
QDflMb554
K/cT
1
Download