Data and information quality - Computing and Information Systems

advertisement
Data and Information Quality: an Information-theoretic
Perspective
Wei Hu and Junkang Feng
Decisions making and efficiencies of business flow
heavily depend on the quality of the information
systems implemented. The evaluation of data quality
(DQ) and information quality (IQ) has been treated
as challenging issues in the field of information
systems management for the last twenty years.
However, we observe the definitions of DQ and IQ in
the literature are not necessarily convincing, which
seems to have hampered the development of deep
and sound understanding of the issue, and of
practically applicable and effective measures and
techniques for their evaluation. It would appear that
this might be caused by the seemingly lack of
research on the definitions of data quality and
information quality from an information-theoretic
perspective. Through our review of relevant
information systems literature, we believe that a
rigorous and theoretically sound foundation is
highly desirable to provide an insight into specifying
and distinguishing the terms ‘data quality’ and
‘information quality’.
This paper presents a data-info quality model under
an Information Source (S) – Information Bearer (B)
– Information Receiver (R) framework based upon
theories of semantic information, including
Dretske’s semantic theory of information, Devlin’s
‘infon’ theory, Stamper’s Organizational Semiotics,
and Floridi’s revised standard definition of
information. We present a set of definitions, compare
data quality with information quality, and outline the
objective and subjective aspects involved in
addressing this problem. This model forms a basis
for our further research in data and information
quality assessment.
1. Introduction
Information systems (IS) play a key role in
organizations for decision-making and efficient
business flow for years. Issues regarding the
evaluation of data quality (DQ) and information
quality (IQ) have been noticed and identified
increasingly within the field of information systems
management in recent years. Numerous research
efforts have been made in this area from different
disciplines and using different research approaches
for the purpose of developing data and information
quality concepts and methods (Ballou&Pazer, 85;
Burgess et al., 04; Dedeke, 00; English, 99; Eppler,
01; Hill, 04; Lee et al., 02; Liu & Chi, 02; Price &
Shanks, 04; Redman, 01; Wand & Wang, 96; Wang
& Strong, 96; etc). Hundreds of tools have been
produced for evaluating quality in practice since
1996 (English, 99).
Research and practice indicates that data or
information quality should be defined accurately and
is taken as encompassing multiple dimensions. Many
data or information quality frameworks have been
presented in the literature. They contain some quality
dimensions or categories derived normally based
upon some research method in a specific domain
with a set of quality metrics, criteria, components,
items, or attributes. Eppler (01) gives five future
directions for information quality research. The
quest for more generic framework and the
development
of
frameworks
that
show
interdependencies between different quality criteria
are emphasised. It appears, however, there is still
lack of theoretical underpinnings of the exploration
of interdependencies or inter-relationships among
those quality indicators proposed. It leads to the
difficulties for the professional who needs to decide
an appropriate framework with a large set of criteria
given for a task in hand within an organization.
When investigating further, we observe that the
terms ‘data quality’ and ‘information quality’ are
considered synonyms by many if not all. They are
usually interchangeable in relevant quality literature.
The very concept of IQ is somewhat nebulous
(Ballou et al., 03-04). It makes the discussion of the
aforementioned question difficult and ambiguous. It
seems to us that the notions of DQ and IQ are yet to
be defined adequately in a grounded way.
In this paper, we wish to argue that a set of wellestablished theories, including Dretske’s semantic
theory of information (Dretske, 81), Devlin’s ‘infon’
theory (Devlin, 91), Stamper’s Organizational
Semiotics (Stamper, 97) and Floridi’s revised
standard definition of information (Floridi, 05),
would provide a novel insight into investigating DQ
and IQ and shed light to the interdependencies
among different quality indicators. The specific aim
of this paper is to explore how this problem might be
approached from another perspective, namely an
information-theoretic perspective, and then further
research might be pursued to develop a quality
framework thereby to analyze and rearrange existing
or derive new quality categories for quality
assessment in practice.
This paper is organized as follows. We first review
existing studies about DQ and IQ and the limitations
of them that we notice particularly in Section 2. In
Section 3, we present the basic notions of the
theories referenced in this paper, and then introduce
an information-centric framework for information
systems and information flow. In Section 4, we
propose a data-info quality model for understanding
‘data’, ‘data quality’, ‘information quality’ and their
inter-relationships. And then we use this model to
analyze quality categories from some existing
approaches. Finally, in Section 5 we give
conclusions and indicate future work.
2. Literature Review
Existing studies have reached a consensus that DQ
and IQ is a multi-dimensional concept. Research
efforts have been made to derive quality indicators
for the development of different quality frameworks.
Wang and some other researchers (Lee et al., 02;
Wang & Strong, 96; etc], following the methods
developed in marketing research for determining the
quality characteristics of products, present a
framework of information quality (IQ) from
information consumer’s perspective. They group all
of their IQ dimensions into four IQ categories,
Intrinsic IQ, Contextual IQ, Representational IQ, and
Accessibility IQ. English gives three reasons to
measuring information quality and two definitions of
IQ (English, 99). One is its inherent quality, and the
other is its pragmatic quality. His approach to quality
includes three components, namely data definition
quality, data content quality, and data presentation
quality. DeLone and McLean’s review of the MIS
literature during the 1980’s reports twenty-three IQ
measures from nine previous studies (DeLone &
McLean, 92; DeLone & McLean, 03). D&M
(DeLone and McLean) say: “understandably, most
measures of information quality are from the
perspective of the user of this information and are
thus fairly subjective in character” (DeLone &
McLean, 92).
Furthermore, we find that there are different
classifications for existing approaches to DQ and IQ
in terms of different perspectives. We illustrate them
in Table 1.
In addition, Eppler (01) reviews and finds out twenty
information quality frameworks appearing in the
literature from 1989 to 1999 in sixteen various
application contexts. Many approaches to the quality
problems in his findings, however, are proposed
from a management, manufacturing, or technology
perspective. He claims that the majority of
frameworks they studied are context-specific rather
than generic and widely applicable. He evaluates the
frameworks according to two dimensions: analytic
and pragmatic criteria respectively.
Perspective
Classifications
Empirical research
Practitioner-based approach
Research Approaches
(Price & Shanks, 04)
Theoretical approach
Literature-based approach
Integrated approach
Communities
Academics’ view
(Lee et al., 02)
Practitioners’ view
Software Quality
Subject Domains
(Burgess et al., 04)
Data Quality
Information Quality
Web Quality
Table 1: Classifications of existing approaches to
DQ/IQ
Through reviewing the literature, it seems to us that
there is a lack of overarching theoretical perspectives
or approaches for classifying existing quality
frameworks with respect to their quality indicators
delivered. Fundamental questions still remain as to
how quality should be defined and the specific
criteria that should be used to evaluate information
quality (Price & Shanks, 04). As mentioned in
Section 1, we are arguing that defining and
distinguishing DQ and IQ should be addressed as a
priority. However, from the work of Price and
Shanks (04), the authors indicate that “due to the
lack of agreement on the precise definition of
information in the literature, we choose to restrict
our usage of the term information to informal
discussion and avoid its use in formal definitions”. It
is difficult to achieve an agreement on the definitions
of the terms Data and Information. To this end, we
attempt to use an information-theoretical perspective
for seeking a solution and providing a fresh insight
as it would seem necessary to construct a formal and
theoretically sound quality framework under which
we derive quality criteria and categories.
Existing studies normally consider data or
information as a type of products or output of an
information system and use the analogy between
data and products to develop measurement models of
DQ and IQ (Kahn, 97; Lee et al., 02; Price &
Shanks, 04; etc). In the literature, the definitions of
data quality and information quality are
distinguished depending on whether information is
considered to be a product or a service. However, the
analogical approach is still limited because data are
after all different from products (Liu & Chi, 02).
Theoretical approaches do appear in the literature.
Wand and Wang drive quality definitions by
anchoring them in ontological foundations and base
on the notion that the role of an information system
is to provide a representation of an application
domain as perceived by the user. For the information
system to function properly, both the representation
and interpretation transformations, involved in the
development and use of an information system, need
to be performed flawlessly (Wand & Wang, 96). It
results in a set of four intrinsic data quality
dimensions: complete, unambiguous, meaningful,
and correct. A semiotic information quality
framework (Price & Shanks, 04) is presented to
define information quality and corresponding quality
categories in terms of three semiotic levels, namely
syntactic, semantic, and pragmatic, defined by
Morris (38) and in terms of definitions for data,
information and meaning by Mingers (95). Hill (04)
proposes an information-theoretic model based upon
Shannon & Weaver’s information theory for the
purpose of considering customer information quality
in an organization. It provides a quantitative
assessment of proposed information quality
improvements. However, there seems a lack of
knowledge and attempt of using an informationtheoretic perspective for investigating both terms of
‘data’ and ‘information’ and DQ and IQ. For
example, Wand and Wang (96) derived four DQ
attributes, which is only a small sample of the
attributes in assessing intrinsic DQ. This might be
due to the lack of an understanding of the subjective
and objective nature of the domain.
3. An Information-theoretic approach
to quality
Through reviewing the literature, we believe that an
information-theoretical underpinning for the terms of
‘data’, ‘information’, DQ, and IQ should shed light
to the quest of a generic quality model for the
purpose of exploring interdependencies or interrelationships among quality indicators proposed in
various quality frameworks. In this paper, we present
an overall model for such a purpose that is based
upon a set of well-established theories.
Theories of Semantic Information
Information is still an ‘explicandum’ (Floridi, 05) in
academic community today. Numerous attempts
have been made to define it, but many of them are
‘merry-go-round’ definitions (Stamper, 97). Shannon
and Weaver’s paper (49) over half a century ago
gives a mathematical model of communication, in
which they use probability to define the amount of
information that is caused by ‘reduction in
uncertainty’. This covers only the engineering aspect
of information creation and transmission. Dretske
(81) makes a profound paradigm shift from
engineering aspect to semantic aspect of
information. We take Dretske’s account of the
relationship between information and knowledge to
be an important insight, which we intend to use as a
way of incorporating epistemological considerations
into the theory of information.
Following Dretske, information will be taken as
created by or associated with a state of affairs among
a set of possibilities of a situation, the occurrence or
realization of which reduces the uncertainty of the
situation. We focus on claims of the form ‘a’s being
F carries the information that b is G’. From the point
of view of semiotics, which has been used in
developing a science for information systems, we
say that one signal, a’s being F, carries information
about a state of affairs, b is G. Relevant to this,
Dretske establishes the following definition of
information content:
Let k be prior knowledge about a specific
information source, r being F carries the information
that s is G if and only if the conditional probability
of s being G given that r is F is 1 (and less than 1
given k alone).
Following above definition, we proposed our first
basic notion called ‘data bears information’ (Hu &
Feng, 02; Xu & Feng, 02) which is now re-illustrated
in Figure 1.
being simplified to
Information
Level
Y
X
bears
Data
Level
being simplified to
A
B
Figure 1: Simplification on information level and
data level
The main point relevant to this paper that this
diagram illustrates is that a representation/signal is
considered to represent/carry part of information
existing in the real world. When the source of
information, namely that part of real world, is
changed or simplified, a new representation/signal
could be used to replace the old one. For example, in
the database area, we could use (entity-relationship)
ER schemas to design a conceptual representation
for a university (a part of real world). With the
modification made on the information requirements
of the university’s information systems, the
representation used to bear the information source,
namely ER schemas in this case, would be
rearranged accordingly.
Information can be transmitted. A state of affairs,
say r1, is a particular case or an instantiation of a
general situation, say r. The reduction in uncertainty
at r due to the occurrence of r1 may be accounted for
by one or more events, say s1, s2,...,sn, that occur at
another general situation, say s. This gives rise to a
special kind of relationship - ‘informational
relationship’ (Dretske, 81) - between these two
general situations r and s. An informational
relationship captures certain degree of dependency
between a state of affairs r1 of a general situation r
and what takes place in another general situation s.
This dependency can be demonstrated by the fact
that r1’s appearance alters the distribution of
probabilities of the various possibilities at s. The
dependency is a type of regularity concerning
different general situations based upon nomic
dependencies (Dretske, 81), logic, or norms, etc. in a
social setting. Due to this relationship, information
created at s is transmitted to r. We will call s the
‘information source’, and r ‘the bearer of
information about s’. Moreover, a state of affairs r1
at r can be seen as a signal that carries information
about s. A sign/signal carries information about
states of affairs in the world – what it signifies, even
though the sign/signal may never be actually
observed by anyone. Besides, if it is recorded, r1
becomes a piece of data. Thus data carry
information. In general, data in a database system are
a collection of recorded signals or events, which bear
certain information about the source within a process
of information transmission. Information is carried
by a sign and is objective and in analogue form.
Therefore we believe that it would be beneficial to
look into problems regarding data and information
from the perspectives adopted by various semantic
information theories, which might help reach the
root and reveal the essence of the problems. In order
to introduce a theoretically sound foundation for the
notions of information and our data-info quality
model, we start with the ontological assumption that
information is objective. In the beginning there was
information. The word came later (Dretske, 81). The
existence of information is independent of its
interpreters or receivers (agents). We notice that
Floridi defines four types of data: primary data;
metadata; operational data; derivative data (Floridi,
05). He revises the ‘standard definition of
information’ and adds a fourth condition to it. His
work will be discussed further in Section 4.
The S-B-R Framework
To facilitate further studies of information within the
context of information systems, that is, to gain
insight and to be able to explain various phenomena
in human communication, information creation and
transformation, and the development of information
systems, an overarching framework seems highly
desirable even necessary. Aforementioned various
theories and semiotics can be seen, among other
things, address the issue of information and
information flow in different ways and emphasize
different aspects of it. We find that all these may be
incorporated within a framework, which would help
make sense of them, and make good use of them in
understanding information and information flow. We
believe that such a framework should be formulated
from the point of view of how information is created,
carried and finally received. Therefore we have
created a framework consisting of Information
Source, Information Bearer and Information
Receiver, and the links between them. We call such
an abstract model the ‘S-B-R Framework’
(illustrated in Figure 2).
We use a simple example to show how this
framework might work. As illustrated in Figure 2,
some information is created due to reduction in
uncertainty, for example, the tree is 80 years old,
rather than it is possible that the tree is 40 years old
or 80 years old among many other possibilities at an
information source. This information can be carried
by an information bearer due to an informational
relationship between the source and the bearer,
which may be based upon some ‘nomic
dependencies’ (Dretske, 81). An information bearer
provides an opportunity for an information receiver,
for example a human agent, to receive information
about the information source. By consulting an
information bearer, an information receiver can
acquire information (illustrated by dotted line in
Figure 2) if the receiver is aware of and attuned to
some constraints (Devlin, 91), which formulates the
dependency and therefore the informational
relationship between the bearer and the source.
Information Source
Information must be created in the first place.
Following Dretsk, any situation may be regarded as
a source of information as long as reduction in
uncertainty takes place. It could be a Universe of
Discourse, a particular situation (Devlin, 91), a
relation, an event with uncertain outcomes, and so on.
For example, the situation ‘choosing one from eight
employees to do an unpleasant job’ can be an
information source S.
S
provide an opportunity
for R to receive
information about S
B
provide information
Information
Source
In addition, we maintain that the literal meaning, if
any, of a bearer is independent of the information
that it bears. It is only accidental that the former is
(part) of the latter.
Information
Bearer
R
Information
Receiver
access/interpret
B for receiving
information
about S
carry information
information, i.e., ‘Tree is
80 years old’, …
Age of the
tree when it
was felled
receives
sees
Type of the
tree
Animals
that live in
the vicinity
bears
…
Tree stump
Human being
Figure 2 S-B-R Framework
From the point of view of semiotics, an information
source S can be seen as the ‘sign object’ (Falkenberg,
98) that conforms to the definition of ‘sign’ given by
Charles Sanders Peirce. It is a thing that the sign
alludes or refers to.
Information Bearer
Information flow requires, as necessity, some
representation of information, which we call the
bearer.
An Information Bearer can be a traffic
light or signal, a physical sign or an IT system.
Following Stamper (97), anything, say x, can
function as a sign if it can stand for something else,
say y, for the people in some community. Here, x is
an information bearer for y. With our S-B-R
framework, our ontological assumptions are that
information may or may not be carried by a bearer;
information can be conveyed only through a bearer;
and information is independent of whether one
receives it or can receive it or not. For example, if a
book were written in ancient Chinese, we would
consider that it carries certain information no matter
whether we can read it or not.
Considering the structure of a sign given by Peirce,
we agree that the ‘representamen’, which is a thing
serving as the ‘carrier’ of the sign, is independent of
its meaning (Falkenberg, 98). For example, an entity
in an Entity Relationship data schema might refer to
something that has no semantic correspondence with
the meaning of the name given to that entity.
Information Receiver
To be able to receive information carried by a bearer,
following Devlin, we maintain that an information
receiver must be aware of and actually invokes some
relevant ‘constraints’ (Devlin, 91) in order to receive
information that is borne by a bearer. Different
receivers may receive different information from the
same bearer. The users of an information system are
information receivers. In a system integration
environment, an agent or a mediator can be an
information receiver, which may process information
further.
4. A Data-info quality model
Information Quality is critical in organizations
(Ballou et al., 03-04; DeLone & McLean, 2003).
Early research efforts in Data Quality at MIT led to
the development of the Total Data Quality
Management (TDQM) cycle: Define, Measure,
Analyze, and Improve (Wang, 04). Tu & Wang
worked on ER extensions at the attribute level via
modeling data quality of the original schemas (Tu &
Wang, 93). Brodie (80) places the role of data quality
within the life-cycle framework with an emphasis on
database constraints. We believe that data quality has
a close relationship with the tasks of information
systems design and information quality has an interrelationship with data quality of an information
system.
In this section we put forward an observation,
namely, it might be helpful to go back to the basics
of information systems development. A similar
perspective has been utilized by Wand and Wang
(96). In this paper, we use another perspective,
namely, information-theoretic perspective to look at
Information Systems from the point of view of
information flow from the source of information to
the receiver of the information via some information
bearer for the purpose of forming a data-info quality
model. This idea comes from a seemingly widely
accepted opinion that an information system is
designed to store data (including multi-media data)
and provide information to the information
consumers. It is an ‘information-bearing’ media for
the purpose of serving business processes and
performance within an organization. Furthermore, it
appears that there is a lack of a practical, theoreticalgrounded information-centric model in the literature
thereby to explore and analyze an inevitable
phenomenon, namely, information flow, in IS
development and IS evaluation, in particular, DQ
evaluation and IQ evaluation. The motivation of our
work is that we aim to bring some contribution on
the theoretical level through our model and address
relevant issues mentioned in Section 1.
Definitions
Many definitions of the terms ‘data quality’ and
‘information quality’ have been proposed in the
literature. Eppler lists seven definitions of
information quality from reviewing existing
literature on information quality published from
1989 to 1999 (Eppler, 01). It seems that many of
them are defined from a management, manufacturing,
or technology perspective. Some definitions for both
of terms are ambiguous and overlap. We wish to
argue that this might be caused by the lack of a
sound theoretical foundation. The S-B-R framework
described above might fill in this gap by providing a
fresh insight into the problem and help define ‘data’
and ‘information’ for studying ‘data quality’ and
‘information quality’. Drawing on relevant literature
regarding data quality and information quality and
under the S-B-R framework, we generalize a
conceptual model for considering these two terms as
illustrated in Figure 3. We call it the ‘data-info
quality model’.
In the diagram, S normally contains three parts in the
context of Information Systems Development. They
are ‘original user requirements’, ‘user expectations’,
and ‘organizational needs’. The latter two change
due to the dynamic nature of organizational goals,
business strategies and performance. In the middle of
the diagram, B is an information system that is a
carrier or a mediator of information source S. It can
be an ERP system, a CRM system, and so on, in the
core of which lies a data engine, such as a database
or a data warehouse. R, the information receiver,
receives information, which is part of S, by
accessing and interpreting B.
Following the notion of ‘data bears information’
discussed in Section 3 and the objectives of data
quality and information quality evaluation appearing
in context (Wang et al., 95) we look at Information
Bearer (B) for assessing ‘data quality’. In the other
words, the assessment of data quality is a task to
define the quality of an information bearer. For
assessing ‘information quality’ of an information
system, we examine the linkage between
Information Source (S) and Information Bearer (B),
and the linkage between Information Receiver (R)
and Information Bearer (B). In the other words, to
assess information quality, we have to take the whole
chain from S through B to R into consideration. We
examine how well the information bearer represents
the information source, and how well the
information bearer supports the information receiver.
That is to say, we look at how good the bearer is at
conveying information to the receiver who would
use perception and other cognitive means for this
purpose. To enable such assessment, we present the
information-theoretic definitions of data, information,
data quality and information quality below.
Definition 1. Data is a set of values recorded in an
information system, which are collected from the
real world, generated from some pre-defined
procedures, indicating the nature of stored values, or
regarding usage of stored values themselves; or, a
model for the purpose of organizing, constraining,
representing those values in an information system
for its consumers.
DQ
R
S
B
Original
user
requirement
New
organizational
needs
User
expectations
Information
Consumer
Represent
Access
Database/ Data
Warehouse (inc. data
value, structure,
Machine
constraints, etc)
IQ
Figure 3: A Data-info Quality Model
We define data here in a broad sense to cover values
and structures existing in an information system.
Following Floridi, the first type above can be of four
types (namely primary data, metadata, operational
data, and derivative data) according to their sources
and purposes. The second type has a direct impact on
the organization of data of the first type in terms of
requirements.
Definition 2. Information, carried by non-empty,
well-formed, meaningful, and truthful data (Floridi,
05), is a set of states of affairs, which are part of the
real world and independent of its receivers.
We define information in an objective way following
Dretske and Floridi. Floridi (05) revises standard
definition of information with adding a fourth
condition that information must be truthful. As
explained by Florid, ‘Truthful’ is used here as
synonymous for ‘true’, to mean ‘representing or
conveying true contents about the referred situation
or topic’.
Definition 3. Data Quality is the intrinsic quality of
data (a type of information bearer) itself.
This definition reveals the objective characteristics
of the task of evaluating the quality of data, such as,
representation, precision, and etc. It is in conformity
with the discussion of the ‘syntactic quality criteria’
reported by the work of Price and Shanks (04) and
the ‘inherent information quality characteristics’
defined by English (99), and the ‘intrinsic’ and
‘contextual’ data quality category proposed by Wang
and Strong (96).
Definition 4. Information Quality is the degree to
which the information is represented and to which
the information can be perceived and accessed.
The term ‘information quality’ is defined from two
directions in our data-info quality model. It is not a
one-array concept; rather it is the degree of some
relevant correspondence between the information
source and the information bearer, and between the
information bearer and the information receiver
respectively. From a semiotic perspective, our work
on this level is also in conformity with the ‘semantic
quality criteria’ and the ‘pragmatic quality criteria’
reported by Price and Shanks (04), the ‘pragmatic
information quality characteristics’ defined by
English (99), and the ‘representational’ and
‘accessibility’ data quality categories proposed by
Wang and Strong (96).
5. Data
Quality
Quality
vs.
Information
According to Floridi (05), nonempty, well-formed
and meaningful data may be of poor quality. Data
that are incorrect, imprecise or inaccurate are still
data and they are often recoverable, but, if they are
not truthful, they can only constitute misinformation,
which is not information at all. Following Floridi
and considering our data-info quality model, we
believe that high data quality is a necessary
condition for seeking high information quality
within an information system. It is not, however, a
sufficient condition. For example, a well-organized
database using Chinese characters that has recorded
accurate and timely stock information does not have
high information quality if its users include some
non-Chinese speakers even though the system has
high data quality. Take another example, a decisionmaker is provided a stock report with a set of
complete, readable, and well-formatted data. He/she
will not obtain any information if data is not true or
inaccurate to reflect real situation. Therefore, high
information quality should be based upon high data
quality, and the data must be appropriately presented
and accessible to the information consumer.
Based upon our above thinking and definitions
regarding data quality and info quality, we can
rearrange existing quality dimensions and criteria in
the literature into a new framework, as shown in
Table 2 and Table 3 respectively. It is intuitively
organized based upon our experience and
corresponding description of the selected dimensions
from the literature.
Data Quality
Price and Shanks (04)
English (99)
Wang et al. (96)
Dedeke (00)
latter. Distinguishing information quality from data
quality will help IS professionals and organizations
derive required and appropriate quality criteria for
the task in hand. Further analysis and validation on
aforementioned issues will be reported in our future
publications.
Objectivity vs. Subjectivity
In the relevant literature, the notion of data or
information quality depends on the actual use of
data. They are normally investigated from the
viewpoint of information consumers. From the work
of Wand and Wang (96), a design-oriented approach
is proposed to define data quality based upon a
concept called ‘possible data deficiencies’ in a
system context. Ballou and Pazer’s study focuses
primarily on intrinsic dimensions that can be
measured objectively (Ballou & Pazer, 85; Ballou &
Pazer, 95). However, it would appear that the issue
of subjectivity versus objectivity that are involved in
data and information quality evaluation in
information systems are hardly addressed
adequately. We believe that to address this issue is
important - not only can an insight of the problem be
gained, but also it should benefit the selection of
research methods for the development of a
methodology for assessing data quality and
information quality.
Information Quality
Syntactic quality
Semantic quality, pragmatic quality
Inherent characteristics
Pragmatic characteristics
Intrinsic, contextual
Representational, accessibility
Ergonomic, accessibility
Representation
Table 2: Some existing quality dimensions
rearranged within a data-info quality framework
Data Quality
Information Quality
Accuracy, format, timeliness, precision,
amount of data, etc.
Relevancy, accessibility, usefulness, readability, completeness,
consistency, reliability, importance, truthfulness, etc.
Table 3: Some existing quality criteria rearranged
within a data-info quality framework
Interdependencies among quality dimensions and
criteria can be further explored and studied from the
point of the view of the inter-relationship between
data quality and information quality. The quality
criteria for the former will clearly have impact on the
Our preliminary thinking about this philosophical
issue is that it can be looked at with the ‘S-B-R’
perspective. In Figure 3, we have shown that
information quality is concerned with two linkages
between S and B, and between B and R separately.
The first linkage embodies the objective aspect of
the problem following our ontological assumption
on information. It depends on design-oriented or
system-oriented. Therefore, theoretical techniques
(i.e., SQL query design, schema transformation, and
etc) and quantitative research methods will
contribute to detecting and providing solutions to the
problems. The second linkage should be looked at
within a social setting, and therefore predominately
inter-subjective or subjective (Mingers, 95). For
example, different groups of information consumers
may have different qualifications and different
knowledge background, and therefore may receive
different information from accessing the same
information bearer. Qualitative research methods
may contribute to identifying problems, reaching
conclusions and obtaining solutions. Much more
work should be carried out along this avenue.
6. Summary and Future Work
In this paper, we have examined some fundamental
issues concerning data and information quality
evaluation from an information-theoretic perspective
that is informed by a set of well-established theories.
We have proposed a data-info quality model based
upon an information-centric framework to provide a
rigorous theoretical foundation for (1) defining and
distinguishing the terms of ‘data quality’ and
‘information quality’; (2) discussing the interrelationships between two terms; (3) studying the
subjective and objective characteristics of data
quality and information quality. A more generic
framework for data and information quality and a set
of quality categories and criteria with their
interdependencies articulated will be reported in
future publications.
Ballou, D. P. and H. L. Pazer, “Designing
Information Systems to Optimize the AccuracyTimeliness
Tradeoff”,
Information
Systems
Research, 6(1) 1995, pp. 51-72.
Ballou, D., Madnick, S., and Wang, R. Y., “Special
Section: Assuring Information Quality”, Journal of
Management Information Systems, Winter 2003-4,
Vol. 20, No. 3, pp. 9-11.
Brodie, M. L., “Data quality information systems,
information, and management,” vol. 3, pp. 245-258,
1980.
Burgess, M., Fiddian, N. J., and Gray, W, “Quality
measures and the information consumer”, ICIQ 2004
Dedeke, A. “A Conceptual Framework for
Developing Quality Measures for Information
Systems”, Proceedings of the 2000 Conference on
Information Quality (IQ-2000), Cambridge, MA,
USA, 2000, pp.126-128.
DeLone, W. H. and McLean, E. R. “Information
Systems Success: The Quest for the Dependent
Variable”, Information Systems Research, Volume 3,
No. 1, March 1992, pp. 60-95.
DeLone, W. H., & McLean, E. R. “The DeLone and
McLean model of information systems success: A
ten-year update”. Journal of Management
Information Systems, 19(4), 2003, pp. 9-30.
Devlin, K. Logic and Information. Cambridge
University Press, Cambridge, 1991.
Dretske, F. I. Knowledge and the Flow of
Information, Basil Blackwell, Oxford, 1981.
English, L. P., Improving Data Warehouse and
Business Information Quality. Wiley & Sons, New
York, 1999.
This model is being validated through a two-stage
survey. First, a series of interviews will be organized
with selected organizations, enterprises, and
institutions in the UK and China. The goal is to
elaborate the model using a qualitative research
method and to generate a data-info quality
framework. Then, a questionnaire will be used to test
in the real world the proposed quality framework and
to categorize quality criteria.
Eppler, M. J., “The concept of information quality:
an interdisciplinary evaluation of recent information
quality frameworks”, Studies in Communication
Sciences, 1 (2001) p.167-182.
References
Hill, G., “An information-theoretic model of
customer information quality”, Proc. IFIP Int’l
Conf. on Decision Support Systems, Italy, 2004.
Ballou, D. P. and H. L. Pazer, “Modeling Data and
Process Quality in Multi-input, Multi-output
Information Systems”, Management Science, 31(2)
1985, pp. 150-162.
Falkenberg, D. E., Hesse, W., Stamper, R., et al. A
Framework of Information Systems Concepts – The
FRISCO Report (web edition), IFIP, 1998.
Floridi, L., “Is Semantic Information Meaningful
Data?”,
Philosophy
and
Phenomenological
Research, Vol. LXX, No. 2, March 2005.
Hu, W. and Feng, J. 2002. “Some considerations for
a semantic analysis of conceptual data schemata”, In
Systems Theory and Practice in the Knowledge Age,
(E. Ragsdell et al.), Kluwer Academic/Plenum
Publishers. New York. ISBN 0-306-47247-3.
Kahn, B. K., Strong, D. M. and Wang, R. Y., “A
Model for Delivering Quality Information as Product
and Service”, in Conference on Information Quality,
Cambridge, MA, pp. 80-94, 1997.
Lee, Y. W., Strong, D., Kahn, B., and Wang, R.,
“AIMQ: a methodology for information quality
assessment”, Information and Management, 40(2)
pp. 133-146, 2002.
Liu, L. and Chi, L., “Evolutional Data Quality: A
Theory-specific View”, ICIQ 2002.
Mingers, J. “Information and meaning: foundations
for an intersugjective account”, Information Systems
Journal, 1995; 5:285 – 306
Morris, C., “Foundations of the Theory of Signs”, in
International Encyclopedia of Unified Science, vol.1,
University of Chicago Press, London, 1938.
Price, R.J., Shanks, G.A., “A semiotic information
quality framework”, In R. Meredith, G. Shanks, D.
Arnott and S. Carlsson (eds.) Proceedings of the
2004 IFIP International Conference on Decision
Support Systems (DSS2004): Decision Support in an
Uncertain and Complex World, Prato, Italy, 1-3 July:
658-672.
Redman, T., Data Quality: The Field Guide, New
Jersey: Digital Press, 2001.
Shannon, C. E. and Weaver, W. The mathematical
theory of communication. Urbana: University of
Illinois, 1949.
Stamper, R. “Organisational Semiotics”, In
Information Systems: An Emerging Discipline?,
Mingers, J and Stowell, F. ed. The McGraw-Hill
Companies, London, 1997.
Tu, S.Y. and Wang, R. Y., “Modeling Data Quality
and Context Through Extension of the ER Model”,
Massachusetts Institute of Technology (MIT) Sloan
School of Management, Cambridge, MA, TDQM93-13, 1993.
Wand, Y. and Wang, R. Y., “Anchoring Data
Quality Dimensions in Ontological Foundations”,
Communications of the ACM, 39(11): 86-95, 1996
Wang, R.Y., Kon, H.B., and Madnick, S.E., “Data
quality requirements analysis and modeling”, Proc.
Ninth Int’l Conf. on Data Engineering, pp. 670-677,
Vienna, 1993.
Wang, R. Y., Storey, V. C., and Firth, C. P., 1995,
“A Framework for Analysis of Data Quality
Research”, IEEE Transactions on Knowledge and
Data Engineering, Vol. 7, No. 4, 1995.
Wang, R.Y. and Strong. D.M. (1996) “Beyond
Accuracy: What Data Quality Means to Data
Consumers”, Journal of Management Information
Systems, 12(4): 5-34.
Wang, R. Y., “Data Quality: Theory in Practice”,
EPA 23rd Annual Conference, April 2004.
Xu, H. and Feng, J., “Towards a Definition of the
‘Information Bearing Capability’ of a Conceptual
Data Schema”, In Systems Theory and Practice in
the Knowledge Age, (E. Ragsdell et al.), Kluwer
Academic/Plenum Publishers. New York. ISBN 0306-47247-3, 2002.
Download