Document 11074833

advertisement
M.I.T.
LrSRARIES
-
DEV\^Y
^
Pewey
'
HD28
.M414
"1^
MIT LIBRARIES
llhl
I
3 9080 00932 7625
Toward Quality Data: An
Attribute-Based Approach
Richard Y. Wang
M.P. Reddy
Henry
B.
WP #3762
November
PROFIT
Productivity
From
"PROHT"
Kon
1992
#92-01
Information Technology
Research
Initiative
Sloan School of Management
Massachusetts Institute of Technology
Cambridge,
MA 02139 USA
(617)253-8584
Fax: (617)258-7579
Copyright Massachusetts
Institute
of Technology 1992. The research described
herein has been supported (in whole or in part) by the Productivity
Technology (PROFIT) Research
PROFIT
sponsor firms.
Initiative at
MIT. This copy
is
From Information
for the exclusive use of
Productivity
From Information Technology
(PROFIT)
The Productivity From Information Technology (PROFIT) Initiative was established
on October 23, 1992 by MIT President Charles Vest and Provost Mark VVrighton "to
study the use of information technology in both the private and public sectors and
to enhance productivity in areas ranging from finance to transportation, and from
manufacturing to telecommunications." At the time of its inception, PROFIT took
over the Composite Information Systems Laboratory and Handwritten Character
Recognition Laboratory. These two laboratories are now involved in research related to context mediation and imaging respectively.
SSACHUSETTSNSTl.OTE
OF TECHNOLOGY
MAY 2
3 1995
UBRARIES
PROFFT has undertaken joint efforts with a number of research centers,
laboratories, and programs at MIT, and the results of these efforts are documented
in Discussion Papers published by PROFFT and/or the collaborating MFI entity.
In addition,
Correspondence can be addressed
to:
The "PROFFF" Initiative
RoomE53-310, MFI
50 Memorial Drive
Cambridge, MA 02142-1247
TeL (617) 253-8584
Fax: (617) 258-7579
E-Mail: profit@mit.edu
Toward Quality
ABSTRACT
data resource
complex
is
The need
always be attainable.
it
would be
manufacturing process.
From
facilitates cell-level
which are
it
would be
may
a
not
criteria
characteristics of the data
make
how
Specifically,
quality indicators
we propose an
Included in
may
in
and
their
its
own
be specified, stored,
attribute-based data
this attribute-based
mathematical model description that extends the relational model, a
integrity rules,
and
queries that are
augmented with
a quality indicator algebra
believability of the data.
is
the data for the specific application at hand.
tagging of data.
indicators, the user can
however,
useful to be able to
these quality indicators, users can
This paper investigates
and processed.
quality,
ideal to achieve zero defect data, this
This suggests that
indicators
judgment of the quality of
retrieved,
Managing data
critical.
nianagement of the
Moreover, different users may have different
determining the quality of data.
tag data with quality
Attribute-Based Approach
for a quality perspective in the
becon^ing increasingly
Although
task.
An
Data:
make
which can be used
quality indicator requirements.
a better interpretation of the data
In order to establish the relationship
dimensions
and quality
methodology
that extends the Entity Relationship
indicators,
a
data
quality
model
is
to
From
model
that
model are
a
set of quality
process
SQL
these quality
and determine
the
between data quality
requirements
also presented.
analysis
1.
Introduction
1.1.
1.2.
1.3.
2.
4
5
5
2.2.
Rationale for cell-level tagging
Work related to data tagging
2.3.
Terminology
7
6
Data quality requirements analysis
3.1.
3.2.
3.3.
3.4.
4.
2
4
Research background
2.1.
3.
1
Dimensions of data quality
Data quality: an attribute-based example
Research focus and paper organization
The
1:
Establishing the applications view
9
2:
9
Step
Step
4:
Determine (subjective) quality parameters
Determine (objective) quality indicators
Creating the quality schema
3:
attribute-based
4.1.
4.2.
4.3.
8
Step
Step
model
of data quality
4.3.2.
11
12
Data structure
Data integrity
Data manipulation
4.3.1.
10
12
15
15
Ql-Compatibility and QIV-Equal
Quality Indicator Algebra
15
4.3.2.1.
Selection
18
18
4.3.2.2.
Projection
19
Union
20
22
24
4.3.2.3.
4.3.2.4.
Difference
4.3.2.5.
Cartesian Product
5.
Discussion and future directions
25
6.
References
27
7.
Appendix A: Premises about data quality requirements
analysis
29
29
Premises related to data quality modeling
across
standards
and
definitions
7.2. Premises related to data quality
30
users
30
7.3. Premises related to a single user
7.1.
Toward Quality
An
Data:
Introduction
1.
Organizations
in industries
Attribute-Based Approach
such as banking, insurance,
retail,
consumer marketing, and health
care are increasingly integrating their business processes across functional, product, and geographic
lines.
The integration
of these business processes, in turn, accelerates
demand
for
more
effective
application systems for product development, product delivery, and customer service (Rockart
1989).
As
applications today require access
to
corporate functional and product
Unfortunately, most databases are not error-free, and some contain a surpnsingly large
databases.
number
many
a result,
Short,
St.
of errors (Johnson, Leitch, it Neter, 1981).
medium
surveyed 500
more than 60%
Thanks
In a recent
size corporations (with annual sales of
of the firms
had problems
in
data quality.^
industry executive report, Computerworld
more than $20
and reported
The Wall Street lournal also reported
computers, huge databases brimming with information are
to
million),
at
our fingertips,
that
that:
)ust
waiting to be tapped. They can be mined to find sales prospects among existing customers; they
can be analyzed to unearth costly corporate habits, they can be manipulated to divine future
trends. Just one problem: Those huge databases may be full of |unk. ... In a world where p>eople
are
moving
to total quality
management, one
of the cntical areas
In general, inaccurate, out-of-date, or incomplete data
socially
and economically (Laudon,
Managing data
Zarkovich, 1966).
achieve zero
defect
data}
this
1986; Liepins
quality,
may
&
is
data.^
can have significant impacts both
Uppuluri, 1990; Liepins, 1989;
however,
is
a
complex
task.
Although
not always be necessary or attainable
for,
Wang k
it
Kon, 1992;
would be
among
ideal to
others, the
following two reasons:
First, in
many
applications,
addresses in database marketing
customers,
it
is nfil
it
is
may
a
not always be necessary to attain zero defect dau.
good example.
necessai7 to have the correct city
Second, there
is
name
In
sending promotional materials
in
an address as long as the zip code
a cost/quality tradeoff in implementing data quality programs.
Pazer found that "in an overwhelming ma)onty of cases, the best solutions
reduction
losses are
is
the worst in terms of cost" (Ballou
nev«
k
Pazer, 1987).
way
1
2
3
loss.
Computerworld, Sept«nib«r 28, 1992. p. 80-84.
Th€ WaU Slr*«t Joum*!, May 26, 1992, pagt B6.
)ust like the weil puWldzed coiKept of itw dtfect products
m
As
the
is
correct.
Ballou and
also suggests that
Rather, the losses are always
that a small percentage of the quality characteristics,
always contributes a high percentage of the quality
to target
terms of error rate
The Pareto Principle
uniformly distributed over the quality characteristics.
distributed in such a
in
Mailing
"the vital few,
a result, the cost improvenrtent potential
manufactunng
literittir*.
is
high for "the
few" projects whereas the
vital
cure costs more than the disease (Juran
&
"trivial
Cryna,
many"
1980).
defects are not worth tackling because the
sum, when the cost
in
prohibitively high,
is
it
is
not feasible to attain zero defect data.
Given
may
that zero defect data
make
judgment
a
deasion
to
and
we
This suggests that
able to judge the quality of data.
characteristics of the data
not always be necessary nor attainable,
manufacturing process
its
it
indicators such as
who
for
example,
would be
it
originated the data,
.
From
useful to
when
useful to be
tag data with quality indicators
know
the data
which are
these quality indicators, the user can
of the quality of the data for the specific application at hand.
purchase stocks,
would be
was
In
making
a financial
the quality of data through quality
collected,
and how
the data
was
collected.
In this paper,
we
pro|X)se an attribute-based model that facilitates cell-level tagging of data.
Included in this attribute-based model are a mathematical model description that extends the
relational
model, a
SQL
process
set of quality integrity rules,
In
a quality indicator algebra
which can be used
queries that are augmented with quality indicator requirements.
indicators, the user can
data.
and
order
make
a better interpretation of the data
to establish the relationship
From
to
these quality
and determine the believability of the
between data quality dimensions and quality indicators, a
data quality requirements analysis methodology that extends the Entity Relationship (ER) model
is
also presented.
Just as
it
is difficult to
product which define
its
manage product
quality,
it is
also difficult to
characteristics that define data quality.
quality,
for the
LL
quality without understanding the attnbutes of the
manage data
quality without understanding the
Therefore, before one can address issues involved in data
one must define what data quality means.
In the following subsecfion,
we
present a definition
dimensioru of data quality.
Dimension
Accuracy
of data guall^
is
the most obvious dimension
when
it
comes
to data quality.
"errors ocair beciuse of delays in processing times, lengthy correction times,
stringent data edits" (Morey, 1982).
In addifion to defining
Morey suggested
and overly or
insufficiently
accuracy as "the recorded value
conformity with the actual value," Ballou and Pazer defined timeliness (the recorded value
of date),
completeness
(all
that
is
is
m
not out
values for a certain variables are recorded), and consistency (the
representafion of the data value
is
the
same
in all cases)
as the key dimensions of data quality (Ballou
&
Pazer, 1987).
Huh
et al. identified
accuracy, completeness, consistency, and currency as the most
important dimensions of data quality (Huh,
It
the
,
1990; Juran, 1979; Juran
(1)
Data quality
(2)
Data quality
is a
is
the data
two
from
characteristics by considering
First the user
(the user has the
believability.
In
to
means and
to
he accessible
be believable, the user
consistent, credible,
Guarrascio, 1991).
a user
may make
It
is
also
decisions based on
to get to the data,
privilege to get the data).
to the user's
which means
that
Second, the user
may
to the user, the
and accurate Timeliness,
.
among
making
process). Finally, the
accessibility, interpretability, usefulness,
data must be available (exists in
must be relevant
consider,
decision
user can use the data as a decision input).
are the following four dimensions:
order
how
must be able
(to the extent that the
that can be accessed); to be useful, the data
and
Wang &
data (the user understands the syntax and semantics of the data). Third,
data must be believable to the user
and
for
Pazer, 1985, Garvin, 1983; Garvin, 1987; Garvin, 1988;
must be useful (data can be used as an input
list
manufacturing nor
intrinsic characteristics of data quality:
a database.
to interpret the
Resulting from this
of quality for
in
a hierarchical concept.
must be accessible
must be able
k
have been well established
multi-dimensional concept.
illustrate these
certain data retneved
the data
dimensions
Gryna, 1980, Morey, 1982;
<Sc
two
interesting to note that there are
We
tor quality control
field (e.g., Juran, 1979), neither the
data have been ngorously defined (Ballou
et al
methods
interesting to note that although
IS
manufactunng
Huh,
ot al., 1990)
(fits
requirements for making the deasion);
other factors, that the data be complete, timely.
in turn,
can be characterized by currency (when the data
item was stored in the database) and volatility (how long the item renruiins valid).
the data quality dimensions illustrated in this scenario.
currtnt^ Qhon-vol«tll^
Figure
1:
some form
A Hierarchy of DaU Quality Dimensions
Figure
1
depicts
These multi-dimensional concepts and hierarchy of data quahty dimensions provide
a
conceptual framework for understanding the characteristics that define data quality.
In this paper,
focus on interpretability and believability, as
a function of the
information system and usefulness
application domain.
iu2,
The idea
Data quality: an
to
this effort
Data
may
attri bute-based
a
over
a
penod
be primanly
function of an interaction between the data and the
more concretely below.
illustrated
database on technology companies. The schema used
company name, CEO name, and earnings
of time
Table
Company .Mame
is
a
to
example
contain attributes such as
may be collected
consider cKCcssibility
be primarily
of data tagging
Suppose an analyst maintains
we
we
and come from
1:
Company
a
vanety of sources.
Information
to
support
estimate (Table
1).
value
may have
necessary
may,
to
know
in turn,
arbitrary
a set of quality indicators associated
with
In
it.
the quality of the quality mdicators themselves, in
have another
applications,
which case
it
mav
be
a quality indicator
As such, an attnbute mav have an
set of associated quality indicators.
number of underlying
some
levels of quality indicators. This consntutes a tree structure, as
shown
in
Figure 2 below.
(attribute)
(indicator)
(indicatoT)
tK:
zr\...
(indicator)
^
Figure
2:
An
(indicator)
attribute with quality indicators
Conventional spreadsheet programs and database systems are not appropriate for handling
data which
is
structured in this manner.
In particular, they lack the quality integrity constraints
necessary for ensunng that quality indicators are always tagged along with the data (and deleted
when
the data
is
deleted) and the algebraic operators necessary for attnbute-based query processing.
an attribute with
In order to associate
developed
of quality indicators associated with
is
immediate quality indicators,
between the two, as well as between
to facilitate the linkage
This paper
its
based data model. Discussion and future direchons are made
RarionaU
Any
relation.
same
quality.
set
we
Section 3
present the attnbute-
in Section 5.
and present
cell level,
summarize the
the terminology used in this paper.
fnr f»ll.l»v»» tagging
characteristics of
It is,
In section 4,
discuss our rationale for tagging data at the
literature related to data tagging,
and the
Research background
2.
LI.
a quality indicator
Section 2 presents the research background.
presents the data quality requirements analysis methodology
we
mechanism must be
it.
organized as follows.
In this section
a
dau
at the relation level
however, not reasonable
to
assume
should be applicable to
that all instances
(i.e.,
instances of the
tuples) of a relarion have the
Therefore, tagging quality indicators at the relation level
quality heterogeneity at the instance level.
all
is
not sufficient to handle
By the same token, any characteristics
to all attribute
of data
However, each
values in the tuple.
tagged
at the
tuple level should be applicable
attribute value in a tuple
may
be collected from
different sources, through different collection methods, .ind updated at different points
Therefore, tagging data at the tuple level
is
basic unit of manipulation,
to tag quality
We now
2ol
Work
examine
necessary
is
it
also msutticient
Smce
information
m
time.
the attribute value of a cell
is
the
at the cell level.
the literature related to data tagging.
related to data tagging
A mechanism
DENOTE
operations
operators
is to
for tagging data has
been proposed by Codd.
and un-tag the name
to tag
It
includes
of a relation to each tuple.
permit both the schema information and the database extension
uniform way (Codd,
1979).
It
does
not,
however, allow
for the
NOTE,
T.AG, and
The purpose
of these
be manipulated
to
in a
tagging of other data (such as source) at
either the tuple or cell level.
Although self-describing data
schema
to
level
and meta-data management have been proposed
(McCarthy, 1982; McCarthy, 1984; McCarthy, 1988), no
marupulate such quality information
A
files
at the hjple
and
specific solution has
been offered
cell levels.
rule-based representation language based on a relational schema has been proposed
data semantics at the instance level (Siegel
attribute values based
the tuple level as
on values of other
opposed
k
Madnick,
These rules are used
1991).
However, these
attributes in the tuple.
to the cell level,
and thus
at the
to
to store
denve meta-
rules are specified at
cell-level operations are not inherent in the
model.
A
polygen model
(px)Iy
= multiple, gen = source) (Wang
&
Madnick, 1990) has been proposed
tag multiple data sources at the cell level in a heterogeneous database environment
important
to
know
where
it
to
is
not only the originating data source but also the intermediate data sources which
contribute to final query results.
The
research, however, focused
on the "where from" perspective and
did not provide mechanisms to deal with more general quality indicators.
In (Sciore, 1991), annotations are
oriented environment
general treatment
is
to
However, data quality
support the temporal dimension of data in an objectis
a
multi-dimensional concept.
necessary to address the data quality issue.
calculus-based language
data.
used
is
provided
to
Therefore, a
more
More importantly, no algebra
or
support the manipulation of annotations associated with the
The examination
above research
the
of
efforts
suggests that
in
functionality of our attnbute-based model, an extension of existing data models
—
2^
order
is
support the
to
required.
Tcnninology
To
An
•
we
facilitate further discussion,
application ^ttrit>Mte refers
to
entity-relationship (ER) diagram.
introduce the following terms:
an attnbute associated with an entity or a relationship
m
an
This would mclude the data traditionally associated with
an application such as part number and supplier.
A
•
quality parameter
defines
is
a qualitative or subjective
when evaluating
dimension of data quality
that a user of data
For example, believability and timeliness are such
data quality.
dimensions.
•
As introduced
in Section
characteristics of data
collection
•
A
method
1,
and
manufacturing process.^
its
Data source, creation
time, and
are examples of such objective measures.
quality parameter value
for a particular quality
defined by users
quality indicators provide objective information about the
to
is
the value determined (directly or indirectly) by the user of data
parameter based on underlying quality indicators.
map
quality indicators to quality parameters.
Functions can be
For example, the quality
parameter credibility may be defined as high or low depending on the quality indicator source
of the data
•
A
.
quality indicator value
is
data quality indicator source
We
tagging,
a
measured
may have
have discussed the rationale
for cell-level tagging,
for the specification of data quality
in
this paper.
We
for a
summarized work
to
related to data
In the next section,
parameters and indicators.
users to think through their data quality requirements, and
would be appropriate
For example, the
a quality indicator value The Wall Street journal.
and introduced the terminology used
methodology
characteristic of the stored data.
The
we
intent is to allow
determine which quality indicators
given application.
consider u\ uidicatof objective
li it
i»
present a
generated using a well defined and widely accepted m«««if«.
Data quality requirements analysis
3.
In general, different users
data
may have
may have
different quality characteristics.
thorough treatment of these
The reader
is
referred to
Appendix A
requirements analysis (Batini, Lenzirini,
on quality aspects
it
is
an
effort
similar
in
spirit
X'avathe, 1986; Mavathe, Batini,
Based on
of the data.
&
for a
more
to
traditional
data
Ceri, 1992; Teorey,
this similarity, parallels
between traditional data requirements analysis and data quality requirements
can be drawn
analysis.
depicts the steps involved in performing the proposed data quality requirements arulysis.
application requirements
i
Step1
detemiine the application view of data
I
application view
application's
Step 2
quality requinnents^
determine (subjective) quality parameters for the
application
candkjata quality
paramaters
T
parameter view
Step3
determine (objective) quality Indicators for the
application
T
quality vl«w
(1
)
s"ter4
•
-
quality view
^
I
(I)
•
^^
quality view (n)
^
quality view Integration
1schema
quality
Figure
The
input, output
of
issues.
Data quality requirements analysis
1990), but focusing
and different tvpes
different data quality requirements,
3:
and
The process of daU quality requirtmento analysis
objective of each step are
descnbed
in the
following subsections.
Figure
3
IL
—
Step
Step
1
IS
the
whole of the
A comprehensive
this paper.
&c
Establishing the applications vievy
1:
Navathe, 1986; Navathe,
Batini, &c
system which contains companies
An ER diagram
symbol.
Figure 4
a stock
that
will not
be elaborated upon
in
treatment ot the subiect has been presented elsewhere (Batini, Leazirini,
Cen.
1992; Teorev. 1990).
For illustrative purposes, suppose that
earnings estimate, while
modeling process and
traditional data
we
ire interested in
that issue stocks.
designing a portfolio management
company has
.A
a
company name,
has a share pnce. a stock exchange (NfYSE,
documents
AMS,
the a pplication view for our running
a
CEO, and an
or OTC), and a ticker
example
is
shown below
in
.
COMPANY
ISSUES
STOCK
SHARE PRICE
CEO NAME
EARNINGS ESTIMATE
Figure
2JL
Step
2:
4:
Application view (output from Step
1)
Determine (subjective) q uality parametera
The goal
in this step
These parameters need
to
dimensional concept, and
is
parameters from the user given an application view
be gathered from the user
may
illustrates the addition of the
application view.
to elicit quality
5:
systematic
way
as data quality
is
be operationalized for tagging purposes in different ways.
two high
level parameters, interpretability
Each quality parameter identified
Figure
in a
Interpretability
is
shown
a multi-
Figure
and believability.
inside a "cloud" in the diagram.
and believability added
to the application
view
5
to the
Interpretability can be defined through quality indicators such as data units (eg., in dollars)
and scale
Believabilitv can be defined in terms ot lower-level quality parameters
(e.g., in millions).
such as completeness, timeliness, consistency credibility, and accuracy
,
The quality parameters
defined through currency and volatility.
The resulting view
the application view.
stock entity which
is
shown
in
Figure
i^
Step
3:
identified in this step are
We
referred to as the parameter view.
can be
added
to
focus here on the
6.
Parameter view for the stock entity
in
Step 3
is to
(partial
output from Step
2)
operationaiize the pnmarily subjective quality parameters identified
Step 2 into objective quality indicators.
rectangle)
in turn,
Determine (objective) quality indicators
The goal
in
6:
Timeliness,
STOCK
-O-i
Figure
is
.
and
is
Each quality indicator
is
depicted as a tag (using a dotted-
attached to the corresponding quality parameter (from Step
view The portion of the quality view
.
for the stock entity in the
2),
running example
is
creating the quality
shown
in Figure 7.
UNITSf
-Oi
Figure
7:
STOCK
The portion
of the quality
view
10
for the stock entity (output
from Step
3)
Corresponding
currency umts
pnce
IS
to the quality
which share price
in
is
parameter intcrpretable are the more objective quality indicators
measured
(e
g
,
5 vs
the latest closing price or latest nominal price
IS
V)
and status which says whether the share
Similarly, the believabilitv of the share price
indicated by the quality indicators source and reportin)^ Jate
For each quality indicator identified
a quality
in
.
view,
indicators for a quality indicator, then Steps 2-3 are repeated,
if
it
making
important
is
this
the
name
of the )Ournal) but also on the second level source
(i.e.,
the
name
the Earnings Estimate figure to the journal and the Reporting date).
depicted l>elow
in
source
This scenario
is
8.
.ANALYSTS NAME
Figure
All quality
first level
For
of the financial analyst
who provided
Figure
have quality
an iterative process.
example, the quality of the attnbute Earnings Estimate may depend not only
on the
(i.e.,
to
8:
REPORTINQ DATE'
'
Quality indicators of quality indicators
views are integrated
in Step 4 to generate the quality schenia, as discussed in the
following subsection.
l±
Stgp
Crgaring tht quality
4:
When
the design
multiple quality views
must be consolidated
Lenzirini,
k
is
may
vh^ma
large
result.
and more than one
set of application requirenrtents is involved,
To eliminate redundancy and
into a single global view, in a process similar to
is
(Batini,
called the quality schema.
This involves the integration of quality indicators.
suffice.
schema integration
Navathe, 1986), so that a variety of data quality requirements can be met. The resulting
single global view
may
inconsistency, these quality views
In
more complicated
indicators in order to decide
cases,
it
may
what indicators
In
be necessary
simpler cases, a union of these indicators
to
examine the relationships among the
to include in the quality
11
schema. For example,
it
is
likely
that
one quality view may have
a^
tim^ for the same quality parameter.
as an indicator, whereas another quality view
may
time
In this case, creation
mav have
creation
be chosen for the qualitv schema
because age can be computed given current time and creation time.
We
have presented
a step-by-step
procedure
processing of quality indicators as specified
The
4.
choose
in the quality
now
of data quality
extend the relational model because the structure and semanhcs of the relational
to
attribute-based data
model
data manipulation.
We
il,
are
schema.
model
attribute-based
approach are widely understood. Following the
Codd,
We
specify data quality requirements.
present the attnbute-based data model tor supporting the storage, retrieval, and
in a position to
We
to
is
divided into three parts;
assume
model (Codd,
relational
that the reader
1982), the presentation of the
data structure,
(a)
data integrity, and
(b)
familiar with the relational
is
model (Codd,
(c)
1970;
1979; Date, 1990; Maier, 1983).
Data structure
As shown
in Figure 2 (Section
levels of quality indicators.
and the
an attribute
may have
to facilitate the linkage
In
its
immediate quality indicators, a
between the two, as well as between
set of quality indicators associated with
the quality key concept.
an arbitrary number of underlying
In order to associate an attribute with
mechanism must be developed
indicator
1),
This
it.
extending the relational model,
mechanism
Codd made
is
a quality
developed through
clear the
need
uniquely
to
identify tuples through a system-wide unique identifier, called the tuple ID (Codd, 1979; Khoshafian
Sc
Copeland, 1990). ^
an attribute
Specifically,
is
applied
in the
in a relation
scheme
is
This concept
attribute, consisting of the attribute
and
a qualitv
attnbute-based model
expanded
key
into
where EEr
is
an ordered
is
is
referred to as a qualitv
scheme
defines a quality scheme for the quality relation
indicators are associated with the attributes
.
In Table 4,
Company.
CN
qualitv
pair, called a
expanded
the quality key for the attribute EE (Tables 3-6 are
expanded scheme
enable this linkage.
.
For example, the attribute Earnings Estimate (EE) in Table 3
4
to
embedded
«CN,
The
into (EE, EEt) in Table
nil«>,
in
(CEO,
Figure
nilc),
9).
This
(EE, EEt))
"nilc" indicates that no quality
and CEO; whereas EEc indicates
that
EE has
associated quality indicators.
Correspondingly, each
quality
cell,
cell in a relational
tuple
is
expanded
into
an ordered
pair, called a
consisting of an attribute value and a quality key value. This expanded tuple
SuniUrly, in the ob)«ct-on«nt«d lHtr«ture, the ability to
property of an ob^-onenled data model.
12
mak«
references through objtet infinity
is
is
referred to
corsdered
a basic
as a quality tupl^ and the resulting relation (Table
key value
4) is
referred to as a quality relation
in a quality cell refers to the sot ot quality indicator
the attribute value.
This set of quality indicator values
tuple called a quality indicator tuple
quality indicator tuples
is
quality indicator relation
Under
is
quality relation
.A
.
the relational
scheme
Relation for
3:
Company
:E0 Name
Earnings
iCEO)
Estimate (EE)
IBM
Akers
DELL
Dell
model
4:
Qua tuy
w
Rela tion to
onCpmpany
(CN,
tid
(IBM,nil<:)
id002«
(Dell,
Table
(Barron s,id20U)
SRClc
Figure
(SRC2,
idlOU)
idl02t>
EE attnbute
for the
(Oct6
Level-Two QIR
6:
nik)
Date.
nilt)j
(Oct 5 92, niU)
idl02(f '<WallStJnl,id202e>
Tables
(Dell,
News
(SRCl.SRCU)
(7,
,.(3,
(Akers, nil(>
nik)
Level-One QIR
5;
(EE EE«>
(EE,
E
(jCEoTniki
nilc)
idOOU
EE<
nilt)
(Entry Clerk.
(Joe, nilt)
(Mary, nile)
92, nil«>
EE
for the
attribute
(Report date,
nilt;
id201< (Zacks, nilt)
(Sep
id202c (Zacks, nil<)
(Sepl5 92,nilO
9:
The Quality Scheme
1
nil<:
92, nil<>
Company
Set for
quality key thus serves as a foreign key, relating an attribute (or quality indicator) value
a quality indicator relation for the
to its associated quality indicator tuple.
For example. Table 5
attnbute Earnings Estimate and Table 6
a quality indicator relation for the attnbute
data) in Table
is
5.
The quality
cell
its
is
<Wall St
)nl,
is
SRCl
(source of
id202«) in Table 5 contains a quality key value, id202<,
a tuple id (primary key) in Table 6.
Let qri be a quality relation
then
that defines the
.
I
which
kind of quality
of a set of these time-varying
.Name tCN)
id002(j
Table
The
a
Company
idOOU
the attribute-based
3
form
model
tid
idlOU
to
The quality scheme
.
referred to as the quality indicator
Table
Under
composed
called a qualitv indicator relation
Each quality
values immediately associated with
grouped together
is
.
quality key
and
must be non-null
a
an attnbute
(i.e.,
containing a quality indicator tuple for
a,
not
then
in qri.
"nil<").
all
13
If
a has associated quality indicators,
Let qr2 be the quality indicator relation
the attributes of qr2 are called level-one quality
indicators for
a.
Each attnbute
in qrj
,
in turn,
an attribute can have n-levels of quality indicator relations associated with
In general,
example. Tables 5-6 are referred
for the attribute
We
to respectively as Icvol-ono
and level-two quality indicator
define a quality scheme set as the collection of a quality scheme and
set for
s
n
it,
0.
it.
For
relations
Earnings Estimate.
indicator schemes that are associated with
scheme
can have a quality indicator relation associated with
We
Company.
quality indicators.
A
Figure
9,
the quality
Tables 3-6 collectively define the quality
define a quality database as a database that stores not only data but also
quality
schema
structure of a quality database.
indicator schemes, quality
In
it.
all
is
defined as a
Figure 10 illustrates the relationship
scheme
sets,
scheme
set of quality
among
sets that describes the
quality schemes, quality
and the quality schema.
Quality
Schema
Quality schemes, quality indicator schemes, quality scheme
Figtire 10
We now
developed
domain
(in
Table
product
present a matheinatical definition of the quality relation.
in the relahonal
4,
model,
system-wide unique
for a
7 €
sets,
EE where EE
D X ID (in Table 4, <7.
is
we
domain
define a
and the quality schema
Following the constructs
as a set of values of sirrular type.
identifier (in Table 4, idlOlc € ID).
a don^ain for earnings estimate).
the
D
be a domain for an attnbute
DID
be defined on the Cartesian
Let
Let
ID be
Let
idlOU) € DID).
Let uf be a quality key value associated with an attribute value d
quality relation (qr) of degree tn
is
defined on the
m+1 domains (m>0;
where d
in Table 4,
€
D
m»3)
and
if it
is
id €
ID.
A
a subset of
the Cartesian product:
ID X DIDi X DID2 X
Let
is
(jt
designated
be a quahty tuple, which
an element
in a
X DIDm-
quality relation.
Then
a quality relation qr
as:
(^
The
is
...
«
Iqt
I
qt » <id, did], did2,
••,
didm) where
integrity constramts for the attnbute-based
14
model
id « ID, didj c DID},
is
presented next.
j
-
1,
...
,m)
Data integ rity
4J.
A fundamental property
corresponding quality (including
we mean
atomic unit
that
all
whenever an
other words, an attribute value and
We
attribute ".alue
its
to
is
that an attribute value
and
.
created, deleted, retrieved, or modified,
its
Bv
its
bo created, deleted, retrieved, or modified respectively
corresponding quality indicator values behave atomicallv
property as the atomicity property hereafter.
refer to this
is
descendant) indicator values are treated as an atomic unit
corresponding quality indicators also need
In
model
of the attribute-based
This property
is
enforced by a set of
quality referential integrity rules as defined below.
Insertion
key present
Insertion of a tuple in a quality relanon
:
in the tuple (as specified in the quality
indicator tuple
must be
must ensure
schema
definition), the
inserted into the child quality indicator relation.
in the inserted quality indicator tuple, a
that for
each non-null quality
corresponding quality
For each non-null quality key
corresponding quality indicator tuple must be inserted
at the
next level. This process must be continued recursively until no more insertions are required.
Deletion
key present
with
all null
must be deleted from the
the tuple, corresponding quality information
in
corresponding
Deletion of a tuple in a quality relation must ensure that for each non-null quality
:
to the quality key.
This process must be continued recursively until a tuple
table
encountered
is
quality keys.
Modification
:
If
an attribute value
is
modified
in a quality relation, then the
descendant
quality indicator values of that attribute must be modified.
We now
4A
introduce a quality indicator algebra for the attnbute-based model.
Data manipulation
In order to present the algebra formally,
to the quality indicator algebra:
4.3.1.
and
associated with
attributes a,
aj be
ai-
are defined to be
define two key concepts that are fundamental
Ql-compatibilitv and OIV-Equal
and
two application
Ql-Compatible with
aj
assume
attributes.
.
Let QI(a,) denote the set of quality indicators
Let S be a set of quality indicators.
shown
in Figure
then the attributes a, and aj shown
We
first
Ql-Compatibility and QIV-Equal
Let a,
6
we
that the
H
in
If
S
i;;
QKa,) and S C QKaj), then
For example,
respect to S.6
if
S =
are Ql-Compatible with respect to
Figure 11 are
numeric subscripts (eg,
ncji
qi,,)
quality indicator tree.
15
S.
(qii, qij, qiji),
Whereas
Ql-Comparible with respect
map
if
S =
a,
and
a^
then the
(qii
qi22).
,
to S.
the quality indicators to unique positions in the
Figure
11:
Ql-Compatibility Example
Let a, and a2 be Ql-Compatible with respect to
respectively.
qHw,) be
Let
(qi2(wi) = V2 in Figure
V
qi(w2)
S =
qi e S,
12).
Let w, and wj be values of a.
the value of quality indicator qi for the attribute value w,
Define w, and wj to be
denoted as w, = W2.
QIV-Equal with
In Figure 12, for
but n^l QIV-Equal with respect to S =
(qi,, qi2i),
S.
resfject to
where
and
a;
qi e S
S provided that qi(w;
=
)
example, w, and W2 are QIV-Equal with respect
to
qh,) because qi3i(w,) = V3, whereas qi3](w2) =
(qi,,
X3,.
(32,^2)
(q'l-
(qi2-V2)
V,)
\\\
/\
C^r^n^
(qi,.v,) (qij.v^)
(PI3. ^3)
Figure
In practice,
it
QIV-Equal) as
indicators
up
a
12:
S).
To
to
S
,
and
^1
1
)
C'lz^
to a certain level of
depth
if
the following
(2)
S consists of
a,
and
aj
^
('"3r'3i)
y
the quality indicators to be
we
compared
(i.e.,
to
introduce i-level Ql-compatibility
(i-
which
aU. the quality
in a quality indicator tree are considered.
Let a,
and
respectively, then w,
two conditions are
all
\
\
3) ('"2r ^2i)
special case for Ql-compatibility (QlV-equal) in
and W2 be values of
Compatible
all
alleviate the situation,
Let a, and a^ be two application attributes.
Let w,
C'n
QIV-Equal Example
tedious to explicitly state
elements of
sfjecify all the
level
is
/\
(l*22-^22) «"3r^31>
(^'l2-^12)<'"2r''21^
(qij.vj)
satisfied.
(1)
a^
be Ql-Compatible with respect
and W2 are defined
a,
quality indicators present within
and
i
82 are
to
be
i-levgl
to S.
QI-
Ql-Compatible with respect
levels of the quality indicator tree of
a, (thus of 82).
By the same token,
If
'i'
is
the
i-level
maximum
level of
be maximum-level Ql-Comparible
by w,
=*" W2,
QIV-Equal between w, and W2, denoted by w,
.
depth
in the quality indicator tree,
Similarly,
=> W2,
can be defined.
then a, and a2 are defined to
maximum-lev el QIV-Equal between w, and W2, denoted
can also be defined.
16
To exemplify
the algebraic operations in the quality indicator algebra,
quality relations having the
Large_and_Medium
Figure
(Tables
same
7,
7
1,
quality
Table 7
set js
7 2 in Figure 13)
14).
<CN.
scheme
nilj>
shown
in
Figure
9.
we introduce two
They are
and Small_and. Medium (Tables
referred to as
8, 8 1,
and
8 2 in
'with
the clause
If
explicit constraints
SQL
extended
which
The quality
that
in a
being retrieved.
is
syntax, the dot notation
in turn
is
namely
quality relations associated with
and
tj
no
case quality indicator values
to identify a quality indicator in the
example, EE.SRCl.SRC2 identifies SRC2 which
a quality
is
presented in the following subsection.
(That
QUALITY"
clause.)
are identical.
QR
QR
Let
S^
is,
is
define the five orthogonal quality relational
and QS
and Cartesian product.
respectively.
and b be two
Let a
qr and qs be two
let
attributes in both
a set of quality indicators specified
by the user
a =^* t2.a denote that the values of attnbute a in the tuples
t,
for the
Similarly, let
t,
='
.a
t2.a
and
=
a
t,
""
the values of ti.a
t2.a
and
denote
and
ti
i-Ievel
t2
and
t,
are
QIV-
QlV-equal and
t2.a.
Selection
Selection
is a
unary operation which
selects only a
honzonul subset
of a
quaUty relation (and
corresponding quality indicator relations) based on the conditions specified
and quality conditions
application attribute.
The
o^C^qD-UI Vti
whereC(ti)»e,
or (tiaetib);
ti .b);
selection, a'^c (qr)-
€ qr.
for the quality indicator relations
ei''
is
corresponding
QR,((t.a =
ti
'I
a) A(t.a
=
(t)...*^;
""
t,.a))
(
a. v.
compared during
^
);
9
to the
6; is
in
a C(ti))
one of the forms:
(tt-a
9
constant)
of the forms (qik = constant) or (ti.a =^'*ti.b)or (tia ='ti.b)or
*6
an
defined as follows:
'S
<ftej ,»...<& e„<t>e,<i<t) ez
qik € QUa);
indicators to be
Vac
in the Selection
in the Selection ofjeration: regular conditions for
There are two types of conditions
application attribute
="
and
constructed form the specifications given by the user in the "with
maximum-level QIV -equal respectively between
op>eration.
QR
Let the term ti.a = tj.a denote that the values of the attnbute a in the tuples
equal with respect to Sg.
4.3.2.I.
we
and QS be two quality schemes and
be two quality tuples. Let S^ be
a.
its
used
is
selection, projection, union, difference,
In the following operations, let
ti
that the user has
a quality indicator to EE.
is
indicator algebra
algebraic operations,
attribute
In that
means
however, the quality indicator values associated with
Following the relational algebra (Klug, 1982),
t,
it
Quality Indicator Algebra
4.3.2.
QS. Let
user query, then
retrieved as well.
In Figure 9, for
quality indicator tree.
absent
in the retrieval process,
would be
the applications data
indicator for SRCl,
is
on the quality of data
would not be compared
In the
QUALITY"
= (S. 2. s. *, <. >. -); and Sa,b
the comp>anson of
18
ti
a
and
t]
b.
>«
(ti
a
the set of quality
Example
Get
1:
all
Large_and_Medium companies whoso earnmgs estimate
is
over
2
and
is
supplied bv
Zacks Investment Research.
A
algebra.
corresponding extended
SQL query
is
^hown
SELECT
FROM
CN, CEO. EE
WHERE
EE>2
EESRCrSRC2=Zacks'
LARGE AND MEDIUM
with
QUALITY
This
SQL query
The
result
is
can be accomplished through
shown below.
<CN,
Table 9
Table
as tollows:
9.1
nile>
a Selection
operation in the quality indicator
SELECT
CN, EE
FROM LARGE, and_MEDIUM
This
SQL query
can be accomplished through a Proiection operation. The result
<C\,
r\\\i>
is
shown below.
<C.\',
niU>
\ote
This
is
also that unlike the relational union, the
Example
illustrated in
Example
op>eration in
3-3
:
Example
3-3
Consider the following extended
3-b:
SMCN. SMCEO, SM EE
SMALL_and_MEDILM SM
UNION
SELECT
FROM
LARCE_and MEDIUM LM
The
LM.CN, LM CEO, LM EE
QUALITY
result
is
(LM.EE.SRC1= SMEE.SRCl)
shown below.
<CNI,
is
not
commutative
below
SELECT
FROM
with
quahty union operation
mlo
SQL query which
switches the order ot the union
Example
4:
as
A
Get
all
the
companies which are
clabSiticcJ as
only Large_and_Medium companies but not
Small_and_Medium companies.
corresponding
SQL query
is
shown
as rollows:
SELECT
FROM
DIFFERENCE
LM.CN, LMCEO, LM EE
SELECT
SM CN, SM CEO, SM EE
LARCE.and.MEDILM LM
FROM
with
QUALITY
This
SQL query
SMALL_and_MEDILV1 S.V1
= SM EE SRCl SRC2)
(LM EE.SRCLSRC2
can be accomplished through a Difference operation.
below.
<CN,
nil>
The
result
is
shown
Cartesian Product
4.3.2.5.
The Cartesian product
Let
t,
€ qr
the tuple
6 qs. Let
t2
The tuple
tj.
qr X^ qs =
(
t
I
X^ t,
denote the
i'^
<L.MCN' niio
QR
qs,
be of degree
t,
and
t2(i)
r
and QS be
denote the
i'^
of degree
attribute of
from the Cartesian product of qr and qs
denoted as qr X^
qs,
is
s.
will
be
defined as follows:
€ qr, Vt2 6 qs,
A t(r+l)
='"
t(2)
=
t,(2)
A t(2)="^ t,(2)A
tjd) A t(r+2) =
result of the Cartesian product
shown below.
Let
attribute of the tuple
in the quality relation resulting
= t,(l)At(l)="'t,(l)A
t(r+l) = tjd)
The
t
also a binary operation
The Cartesian product of qr and
of degree r+s.
t(l)
t,(i)
is
t2(2)
a t(r+2)
...
=""
t(r)
t2(2)
= t,(r)A t(r)='"t:(r)A
a
...
t(r+s)
=
t2(s)
a
t(r+s)
='
t2(s)
between Large_and_Medium and Small_and_Medium
is
)
We
have presented the attnbute-based model including
set of integrity constraints for the
model, and
the capabilities of this
mdicator algebra.
a quality
algebraic operations are exemplified in the context or the
model and future research
a description of the
SQL
model
structure, a
In addition,
each of the
query. The next section discusses
some
of
directions.
Discussion and future direcrions
5.
The attribute-based model can be applied
in
many
different
ways and some
of
them are
listed
below:
•
The
retain the origin
•
A
model
ability of the
user can
Example
filter
the data retneved from a database
QUALITY
in
makes
The reputation
inspected.
according
to quality
who
requirements.
capability
is
very useful
in
is
In
retrieved as
among
certification.
inspected or certified the data and
when
it
A
was
enhance the believability of the data.
The quality indicators associated with data can help
semantic incompatibility
this.
EE.SRCl.SRC2='Zacks'."
of the inspector will
to resolve
possible to
it
Figure 9 illustrates
Data authenticity and believability can be improved by data inspection and
quality indicator value could indicate
•
at multiple levels
only the data furnished by Zacks Investment Research
specified in the clause "with
•
support quality indicators
and intermediate data sources. The example
for instance,
1,
to
clarify data semantics,
which can be used
data items received from different sources.
an interoperable environment where data
in different
This
databases
have different semantics.
•
Quality indicators associated with an attribute
values.
it
For example,
can be interpreted
the spouse
name
is
if
may
facilitate a better interpretation of null
the value retneved for the spouse field
(i.e.,
tagged) in several ways, such as
unknown, or
(3) this
tuple
is
is
empty
(1) the
•
In a data quality control process,
the source of error
In this paper,
and processed.
when
by examining quality
we have
Specifically,
investigated
we have
requirements analysis and specification,
(2)
is
employee
unmarried,
table
(2)
from the
field.
errors are detected, the data adnrvirustrator can identify
indicators such as data source or collection method.
how
(1)
employee
inserted into the
materialization of a view over a table which does not have spouse
m an employee record,
quality indicators
may be
specified, stored, retrieved,
established a step-by-step procedure for data quality
presented a model for the structure, storage, and processing
25
of quality relations
and quality indicator relations (through the algebra), and
of
are artively pursuing research
denved data
investigating
values of
its
(e.g.,
m
the following areas:
combining accurate monthly data
mechanisms
components.
to
determine the quality of
(2) In
order
to
touched upon
and control.
functionalities related to data quality administration
We
(3)
vvith
(Din order
less
to
determine the quality
accurate weekly data),
we
are
derived data based on the quality mdicator
use this model for existing databases, which do not have
tagging capability, they must be extended with quality schemas instantiated with appropriate
quality indicator values.
effective.
(3)
We
are exploring the possibility of
Though we have chosen
the relational
model
oriented approach appjears natural to model data and
quality control
methods),
we
its
to
making such
a transformation cost-
represent the quality schema, an object-
quality indicators.
Because many of the
mechanisms are procedure oriented and o-o models can handle procedures
are investigating the pros
and cons
of the object-oriented approach.
26
(i.e.,
References
6.
11
Ballou, D. P.
Pazer, H.
St
(1985)
L.
Modeling Data and Process Quality
output Information Systems. Management
2)
&
Ballou, D. P.
in Multi-input, Multi-
Science. 11(2), pp. 150-162.
Pazer, H. L. (1987). Cost/Quality Tradeoffs for Control Procedures in Information
of Management Science. li(6), pp. 509-521.
Systems. International fournal
31
41
Batini, C, Lenztnni, M., & Navathe, S. (1986). A comparative analysis of methodologies
database schema integration. ACM Computing Survey, 1S(4), pp. 323 - 364.
Codd,
ACM,
51
relational
model
of data for large shared data banks.
CommunicaHons
of the
li(6), pp. 377-387.
Codd,
Codd,
capture more meaning.
ACM
practical foundation for productivity, the 1981
ACM
Extending the relational database model
on Database Systems, i(4), pp. 397-434.
E. F. (1979).
Transactiofxs
61
A
E. F. (1970).
for
E. F. (1982).
Relational database:
Turing Award Lecture. Commumcations
An
A
of the
ACM,
to
i5(2), pp. 109-117.
7)
Date, C.
8)
Garvin, D. A. (1983). Quality on the
9)
Garvin, D. A. (1987). Competing on the eight dimensions of quality. Harvard Business Review.
J.
(1990).
Introduction
to
Database Systems (5lh
line.
ed.).
MA: Addison-Wesley.
Reading,
Harvard Business Review. (September- October), pp. 65-
(November-December), pp. 101-109.
10]
Garvin, D. A. (1988). Managing Quaiity-The Strategic and Competitive Edge
The Free Press.
11)
Huh,
12)
Y. U., et
al.
(1990).
DaU
(Quality. Information
(1 ed.).
New
York;
and Software Technology. 12(8), pp. 559-565.
Johnson, J. R., Leitch, R. A., k Neter, J. (1981). Characteristics of Errors in Accounts Receivable and
Inventory Audits. Accounting Review. Sfi(April), pp. 270-293.
131
Juran,
J.
M.
(1979). Quality Control
141
Juran,
J.
M.
&
Gryna,
F.
M.
Handbook (3rd
ed.).
New
York: McGraw-Hill Book Co.
(1980). Quality Planning and Analysis (2nd ed.).
New
York:
McGraw
Hill.
151
16)
17)
Khoshafian. S. N. & Copeland, G. P. (1990). Object Identity. In
37-46). San Mateo, CA: Morgan Kauhnann.
Uudon.
K.
(Ed.), (pp.
C (1986). Data Quality and Due Process in Large Interorganizational Record Systems.
of tht
ACM,
2S(1). pp- 4-11.
Uppuhiri. V. R. R. (1990). Data Quality ControL Theory and Pragmatica (pp. 360).
York: tfciiil Dekkcr, Inc.
Liepins, G.
New
19]
Zdonik* D. Maier
iQug, A. (1982). Equivalence of relational algebra and relational calculus query languages having
aggregate hinctions. Tht Jourml of ACM. 23, pp 699-717.
Communicatiom
18]
S. B.
Uepins, O.
LIk
C0M9). Sound
Data Are a Sound Investment. Quality Programs. (September), pp. 61-
63.
201
Maier, D. (19«3).
TV
Thtory of Rektional Databases (1st
ed.). Rockville,
MD: Computer
Science
Presa.
21]
McCarthy,
Medea
22]
J.
L
(1982).
MeHdatM Management
for
Urge
Statistical
Databasa. Mexico City,
1982. pp. 234-243.
McCarthy,
J.
L. (1984). ScienHfic Information -
School, Monterey,
CA.
1984. pp.
27
Data > Meta-data. U.S. Naval Postgraduate
(231
McCarthy,
(24)
Morey,
L. (1988). The Automated Data Thesaurus:
A >Jew Tool for Scientific Information.
J.
Proceedings of the 11th International Codata Conference, Karlsruhe, Germany. 1988.
pp.
(1982).
C.
R.
Communications
(251
N'avathe,
of the
Batini,
S.,
Estimating and Improving the Quality of Information
ACM. 2i(May), pp. 337-342
C,
Cen,
Si
S.
(1992)
The Entity Rdatwnshiip Approach
.
New
m
the
MIS.
york: Wiley
and Sons.
(261
Rockart,
Sloan
F.
J.
&
Short,
J.
IT in the 1990s:
E. (1989).
Management Review. Sloan
Managing Organizational Interdependence.
School of Management, .Vl/T, liZ(2), pp. 7-17.
E. (1991). Using Annotations to Support Multiple Kinds of Versioning in an Object-Onented
Database System. ACM Transactions on Database Systems. ]£(No. 3, September 1991), pp. 417-438.
(271
Score,
(281
Siegel,
&
M.
Madnick,
S.
E.
(1991).
A metadata approach
to
resolving
semantic
conflicts.
Barcelona, Spain. 1991. pp.
(29)
(301
Teorey, T.
Mateo, CA
J.
:
(1990). Database
Morgan Kaufman
Modeling and Design: The Entity-Relationship Approach
Wang,
R. Y. St
Wang
(Ed.), Information Technology
Kon, H.
.
San
Publisher.
B. (1992).
Towards
m
Total Data Quality
Action:
Management (TDQM). In R. Y.
Englewood Cliffs, NJ:
Trends and Perspectives
Prentice Hall.
[31]
Wang,
Y. R. <fc Guarrascio, L. M. (1991). Dimensions of Data Quality: Beyond Accuracy. (CISL-91Composite Infornuhon Systems Laboratory, Sloan School of Management, Massachusetts
Institute of Technology, Cambndge, MA, 02139 June 1991.
06)
Y. R. & Madnick, S. E. (1990). A Poly gen Model for Heterogeneous Database Systems: The
Source Tagging Perspective. Brisbane, Australia. 1990. pp. 519-538.
(321
Wang,
[33]
Zarkovich. (1966). Quality of
United Nations.
Statistical
Data
.
Rome: Food and Agriculture Organization
28
of the
Appendix A: Premises about data
7.
Below we present premises
To
analysis.
related to data quality
facilitate further discussion,
refers to both quality parameters
quality requirements analysis
we
modeling and data quality requirements
define a data quality
and quality indicators
as
shown
ath-ilyv,tg
in
as a collective term that
Figure A.l. (This term
is
referred
to as a quality attribute hereafter.)
Data
Quality
Attrlbutas
(coll*cllva)
among
Figure A.l: Relationship
Premises related
7.1.
to data quality
Data quality modeling
is
modeling captures many of the
many
captures
premises relate
(Premise
1.1>
modeling
an extension of traditional data modeling methodologies.
structural
of the structural
to these
quality attributes, quality parameters, and quality indicaton.
and semantic issues underlying
and semantic issues underlying data
may
application
a
name
For example, the
be an entity attribute
it
if
of a teller
initial
may be modeled
is
a judgnnent call
on
(i.e.,
In
some
cases a quality
an application entity's attribute) or as a
who performs
a transaction in a
banking
application requirements state that the teller's
name
as a quality attribute.
modeling perspective, whether an
a quality attribute
The following four
quality.
(Relatedness between enhty and quality attributes):
be included; alternatively,
From
As data
modeling
data quality modeling issues.
attribute can be considered either as an entity attnbuie
quality attribute.
data, data quality
attribute should be
modeled as an
the part of the design team,
entity attribute or
and may depend on the
initial
application requirements as well as eventual uses of the data, such as the inspection of the data for
distribution to external users, or for integration with other data of different quality.
distribution
and integration of the information
quality of the data they use.
When
the data
is
is
that often the users of a
The relevance
given system "know"
of
the
exported to their users, however, or combined with
information of different quality, that q\iality nruy become unknown.
A
g\iidclfaM to this
judgment
is
to
ask what information the attribute provides.
If,
on
process, such as
the other hand, the information relates more
when, where, and by
whom
the data
the attribute
may be
considered an entity
to aspects of the
data manufacturing
provides applicMtton information such as a customer name and address,
attribute.
If
it
was nnanufactured, then
this
may
be a quality
attribute.
In short, the objective of the datt quality requirement analysis
attributes,
but also
to
is
not strictly to develop quality
ensure that important dimensions of data quality are not overlooked entirely
requirement analysis.
29
in
(Premise
orthogonal
related
not orthogonal), such as for real time data.
(Heterogeneity and hierarchy
(Prenruse 1.3)
may
differ across databases, entities, attributes,
may be
university database
example:
Attribute example:
may
example: data about an international student
Premises related
Because
and instances.
human
supplied data): Quality of data
Database example:
may
may be more
needed
is
this
phenomenon
"data quality
is in
in a
Entity
entity).
Instance
be less interpretable than that of a domestic student.
to data quality definitions
insight
accurate than are addresses.
and standards across users
for data quality
modeling and different people may have
different opinions regarding data quality, different quality definitions
call this
information
be less reliable than data about students (an
the student entity, grades
in
in the quality of
of higher quality than data in John Doe's personal database.
data about alumni (an entity)
7.2.
Different quality attributes need not be
For example, the two quality parameters credibility and timeliness are
one another.
to
(i.e.,
(Quality attnbute non-orthogonality):
1.2)
the eye of the beholder."
and standards may
The following two
result.
We
prenruses entail
phenomenon.
(Premise
indicators
may
(Users define different quality attributes):
2.1)
vary from one user
f>arameter for a research report
may need
to
in
and quality
manager the
quality
be inexpensive, whereas for a financial trader, the research report
be credible and timely.
inexpensiveness
(Quality parameter example: for a
to another.
may
(Quality parameters
Quality indicator example:
terms of the quality indicator (m(5netary)
inexpensiveness in terms of opportunity cost of her
own
cost,
the
manager may measure
whereas the trader may measure
time and thus the quality indicator
may
be
retrieval time.
(Premise
differ
2.2)
from one user
(Users have different quality standards):
For example, an investor following the
to another.
consider a fifteen nninute delay for share price
price quotes in real time
may
Premises related
7.3.
A
single user
data used. This
to
movement
of a stock
be sufficiently timely, whereas a trader
is
different quality attributes
summarized
in
and quality standards
for the different
Premise 3 below.
have different quality attributes and quality standards across databases,
instances. Across attributes exajnple:
for certain
numba of employees.
compB\ics, but
iK>t for
A
user
may need
Across instances example:
others
due
to the fact that
^9;
30
A
user nnay
entities, attributes, or
higher quality information for the phorie numt)er
3 9080 00932 7625
^97
who needs
to a single user
(Premise 3) (For a single user; non-uniform data quality attributes and standards):
than for the
may
may
not consider fifteen minutes to be timely enough.
may have
phenomenon
Acceptable levels of data quality
A
user
may need
some companies
high quality information
are of particular interest
Date Due
Lib-26-67
Download