r\0) \^ ^ Ihm miswey HD28 .M414 MIT LIBRARIES 3 9080 00932 7534 A Process View of Data Quality Henry Kon Jacob Lee Richard WP #3766 Wang March 1993 PROFIT #93-05 From Information Technology "PROFIT" Research Initiative Sloan School of Management Productivity Massachusetts Institute of Technology Cambridge, MA 02139 USA (617)253-8584 Fax: (617)258-7579 Copyright Massachusetts Institute of Technology 1993. The research described herein has been supported (in whole or in part) by the Productivity From Information Technology (PROFIT) Research Initiative at MIT. This copy is for the exclusive use of PROFIT sponsor firms. Productivity From Information Technology (PROFIT) The Productivity From Information Technology (PROFIT) Initiative was established on October 23, 1992 by MIT President Charles Vest and Provost Mark Wrighton "to study the use of information technology in both the private and public sectors and to enhance productivity in areas ranging from finance to transportation, and from manufacturing to telecommunications." At the time of its inception, PROFIT took over the Composite Information Systems Laboratory and Handwritten Character Recognition Laboratory. These two laboratories are lated to context mediation now and imaging respectively. involved in research re-^^cr^^- ,.,^.~. OF TECHNOLOGY MAY 2 3 1995 LIBRARIES PROFFT has undertaken joint efforts with a number of research centers, laboratories, and programs at MIT, and the results of these efforts are documented in Discussion Papers published by PROFIT and/or the collaborating MIT entity. In addition, Correspondence can be addressed to: The "PROFIT" Room Initiative E5 3-3 10, MIT 50 Memorial Drive Cambridge, MA 02142-1247 Tel: (617) 253-8584 Fax: (617) 258-7579 E-Mail: profit@mit.edu ABSTRACT We posit that the term data quality, though used in a variety of research and practitioner and defined. To improve data quality, we must bound and define the concept of data quality. In the past, researchers have tended to take a productoriented view of data quality. Though necessary, this view is insufficient for three reasons. First, data quality defects in general, are difficult to detect by simple inspection of the data product. Second, definitions of data quality dimensions and defects, while useful intuitively, tend to be ambiguous and interdependent. Third, in line with a cornerstone of TQM philosophy, emphasis should be placed on process management to improve product quality. contexts, has been inadequately conceptualized The objective of this paper is to characterize the concept of data quality from a process formal process model of an information system (IS) is developed which offers precise process constructs for characterizing data quality. With these constructs, we rigorously define the key dimensions of data quality. The analysis also provides a framework for examining the causes of data quality problems Finally, facilitated by the exactness of the model, an analysis is presented of the interdependencies among the various data quality dimensions. persfjective. A . INTRODUCTION 1. Total Quality Management techniques aim to drive defect levels in business processes to a minimum based on continuous improvement and customer focus these processes, however, has been [Johnson, Leitch, need clearly a We & measured defective [Ishikawa, 1985). postulate that we and continual that supports in the 10-75% range for a variety of applications Neter, 1981; Laudon, 1986; Morey, 1982; O'Neill for coordinated, concerted The data effort to & Vizine-Goetz, 1988). There address the data quality issue. can neither measure nor manage what we cannot define, and this is the motivation for a precise explication of the concept of data quality. In order to relate our work research, we Indeed, quality in conformance to customer requirements [Juran Data quality, when interpreted as its & Gryna, most general sense, 1988 1991]. Included in the scales are IS success, has operationalized based on been variously operationalized into scales IS (Davis, 1989; Moore & Benbesat, such items as "makes job easier" and "clear and understandable". These however, do not define (nor are intended . is p.2.2]. such as the "perceived usefulness" and "perceived ease of use" of an the data to pjast consider data quality broadly to involve the satisfaction of information systenw users with the information they receive. scales, is to define) objective Such objective definitions would be required Measuring the quality of a data and nveasurable to facilitate the characteristics of management of data quabty. set requires specific characteristics of the data as a basis for measxirement Other researchers have developed scales specific technical characteristics of the data for user information satisfaction, such as accuracy and timeliness. intuitive definitions of these terms are provided. At For example, a research subject the importance of information accuracy, but the term itself is left which do include best, however, only may be undefined [Baroudi asked to rate & Orlikowski, Alternatively, intuitive definitions 1988). [Bailey output information & may be provided such system and data are important to users. It is means a The measures themselves are and general foundation the correctness of the satisfaction research deals with data quality issues, its focus the objective assessment of data quality, but rather in understanding itself. - Pearson, 1983 p.5411- Although user information not of the data accuracy as; for data quality definition is what overall characteristics of a of measuring characteristics of the data users subjective developed. the user or organizational context [Morey, 1982; Ballou broken timeliness. definition of down & Pazer, 1985; Huh, et al., 1990). less so is that, although a data<entered view and scope of the dimensions are unsatisfying. For example, accuracy agreement with an identified source" (Huh, certain variable are recorded" [Ballou k et al., is there and taken, the "all values for a While these definitions may be among them, and is defined as "a measure 1990 p.560], or completeness as Pazer, 1985 p.l531. they are imprecise, there are interdependencies exist is on Data quality into various dimensions such as accuracy, completeness, consistency, The problem with these approaches and and highly aggregated, and no rigorous Other researchers working on data quality have focused more on the data product and typically not is is no intuifive, theoretical basis offered for them. In summary, a product view of data quality focuses satisfaction of the information on characteristics of the data product or systems users with the information system. Such a view is intuitive, on as we tend to think of product quality in terms of product characteristics and user requirements. However, we believe that the product-oriented approach to defining data quality is inherently lacking in three respects: (1) Data quality defects, unlike those of nuny standard manufactured products, are difficult observe or detect. In product manufacturing, each manufactured item may be compared to a to standard product specification to determine conformance. In data nnanufacturing, each piece of non-redundant data is unique. There prenuse of this paper is no standard product. is that the way How do we know if the data represents the 'true' value? A to detect defects is to focus on the process of data creation and manipulation. Thus process quality beconoes a surrogate for data quality (2) Product-oriented efforts to derive a mininul and orthogonal set of data quality dimensions show unconvincing results. Objections cannot be answered satisfactorily. causes. . such as "Why this set of dimensions?" and "Are there any others?" Various quality dimensions are interdependent, having conunon For example, timeliness and completeness are related, as data which slow in arriving, results both in untinnely data elements and incomplete data while useful for general understanding, do not management. facilitate the clear is generated too late or sets. is Simple definitions, communication needed for qiiality (3) as per the quality experts Finally, Deming (Deming, and Taguchi (Taguchi, 1982) causes of quality problems in general are inherent in the production process. that causes of defects in the production process should be detected analyze and improve data quality, we must A key TQM 1979], philosophy and eliminated. Thus, in order is to look at production process quality, and not simply at product quality. We approach have a poor grounding as will be more data into output data. receiver and model fruitful. A to the process is believe that a process-oriented the set of inter-related activities that transforms input This paper defines and formalizes the data production process using a source- of an information system (IS). We posit that the constructs developed are fundamental sufficiently precise to serve as a platform for characterizing the concept of data quality. Thus, data quality problems can be defined in terms of LL component defects in objectives of this paper are: (1) to develop a set of formal process constructs which constitute model of an dimensions of data quality and The contribution of the to (3) the use of these constructs Mario Bunge in p>aper is the development of a theoretically define five (non-orthogonal) The grounded process model of an model developed draws from IS the Senuntics [Bunge, 1974) and Ontology [Bunge, 1977]. The process approach has implications for systems design and of)eration. data defects does not suggest We adopt a to provide an analysis of their interdependencies. information system and conceptualization of data quality. of and analysis of orthogonal IS as a data delivery vehicle, (2) the identification components of process non<onformance, and work the data production process. Paper Qbjectivg and Stratggy The a We concept of data quality. From a product view alone, even a robust classification of how system design and operation might be enhanced three step strategy. First, we to reduce problenre. define an ideal IS as one that delivers data to the end user via a data production process that conforms to user requirements. The data production process nr\aps some aspects of interest in the world into a symbolic representation for use necessary conditions for this mapping to conform to user requirements. by a data Second, user. We identify we decompose the production process into orthogonal components. Thirdly, these process components form the basis for the characterization of data quality dimensions and defects through conformance and non conformance to user requirements. In Section 2 Section 3 we constructs. we present an overall description and high level fonnalization of the nvodel. In present the data production process in greater detail, including the detailed process In section 4 we develop interdependencies. In Section 5 the definitions of data quality dimensions we conclude the paper and discuss its implications. and analyze their THE roEAL INFORMATION SYSTEM 2. Wand and Weber postulated generally that "information systems are built to provide information that otherwise would have required the effort of observing or predicting some reality with which we are concerned. From perceived reality. someone. " We It is 1: An IS is a by we world system as p)erceived by means of providing a mapping from aspects that data product defects are literature in general, some a representation of stating the following premise. representation of those aspects via We posit of a real is 1990). this postulate Premise an information system human-created representation a [Wand & Weber, adopt this point of view, some human due to data of the world into a symbolic perception. production process defects . As per operationalize quality as conformance to user requirements. the quality We thus operationalize data quality as coriformance of the data production process to user requirements. In this respect, data quality is a binary condition [Pall, 1987]. We do not aim here to characterize levels of data quality. Either the process confonns (has quality) or does not conform (does not have quality) with respect to user requirements. This leads to the next premise: Premise Data quality 2: is conformance of the data production process to a single user requirement. Figure on behalf of a (Arrow A), 1 data user. and our model. illustrates We see from the figure This observation (perception) is results in a perceived reality as construed that a data originator observes the the measurement of the world of interest by the encoded (data entry) into the IS via son>e data input device (Arrow Premise 3: The world originator records states of the world as originator. B). Next, this f)erception is We assunne opposed to changes in states. For example, the originator would record inventory levels as opposed to changes in inventory. This encoding results in the symbolic representation of the originator's perception as data. Lastly, the user decodes (interprets) the symbols (Arrow C). perceived reality + A e e & input storage device display device data production timeO proceed to formalize [Bunge, 1977]. The world via attributes is our notion of an ideal made up of things. which are defined and ascribed A of thing? with resp>ect to reference frames. timeX An inionnation system model Figiire 1: We user The by IS first stating some ontological principles characteristics of things of the to the things by the observer. We world are construed measure the attributes reference franrte nuiy include elements such as the time of measurement and the condition of the observer, e.g., the observer's velocity when measuring the speed of another object. The reference frame, in general, would provide information about any variable factors believed to affect the outcome value of die measuren^nt. We are interested in a set of r things in the world interested in a set of q attributes (ak I kal,2,...,q). (xi I i = l,...,r). For each attribute there frames from which to measure it The reference frames chosen up to time Each an aspect of the world that some form component xi we are may be one or more reference tl is a set {mj I j=l,2,...,n}. triple 0)ijk=<xi, mj, is For each thing of measurement. is of interest Knowledge and has an associated value of vjiit and its of the data originator's perceived reality. perceived reality is given in Section 3.1. ak> vijic This value associated triple <xi, mj, A is obtained via ay> constitutes a unit more formal discussion of the concept of For a thing xi and reference frame mi, let Aij=|<xi, mj,ak> I k=l,2...q) and Hi=U Aij ;=i where Hi the history of the thing xj over is all the reference frames of interest. R=U Then Hi 1=1 is the set of triples that are of interest as of time with reference fran>es up to The knowledge of This may be time function (Arrow C). where t2 D, where t|. is The temporal Each we are are defined for specific times ti cell and its triple Oijk and each data element constitutes an and wff s form part of a symbolic language in the next section. therefore be represented by the binary relation g wff s delivered as output interested in reference frames g expresses the actual production process asf?ects of the then encoded into a symbolic form. corresponds to a cell fimction, decoding function the set of atomic Recall that Specifically, wff. D is These wff s are then interpreted by the end-user via a decoding" The encoding be formally presented ^ originator, represents a value vjjk- The data production process can to the set by the data R, as perceived cell the set of aspects of the world of interest is ti atomic well-formed formula (wff). will Then R a relational table for example. In this case, each data element within the which ti- - up to a data set consumer as of time R t2, to t\. how aspects of the world are production process are inherent in the specification of and from the mapped R and to D, as they t2. Definition: Data quality is defined as the conformance of g to user requirenvents. We refer to g as gp when g conforms to user requirements. Each triple in R must be imiqudy represented as data in D. This leads to the proposition: Proposition: For quality data, gp must be a bijective function (one-to-one and onto) aspects of things to the appropriate wff s for the user's symbolic language. Formally, that maps gp: The precise form of (e.g. relational gp must be determined by user requirements because various symbolic or hierarchical) may In the last section abstraction. In this section we we representations above requirement. satisfy the THE DATA PRODUCTION PROCESS 3. terms of R-»D represented the production of data as the relation delve into greater engineering detail - g, a high level specifying the production process in specific steps. its Our treatment Weber, 1988] of an IS is similar in in the sense that it is an IS methodology and spirit to Wand and Weber, e.g., (Wand modeling formalism based on Bunge [Bunge, 1974, Bunge, & 1977]. However, where Wand and Weber production including measurement, encoding, and decoding of data between the data originator itself, and the data focus on the design and analysis of systems, we focus on data user. In the following sub-sections we decompose the production process g. Measurement and Perceived Reality 3.1. Each triple (Ojjic R e is assigned a value vji^ e V the data originator via a function: X:R->V This is analogous state function iterates to the state function over a domain which proposed by Bunge [Bunge, 1977 is The function % on the other hand, thing. p.l251. However, Bunge's a set of reference frames for an attribute of a particular iterates over R, which is preferable for the current presentation. The data originator's knowledge of (Dijk consists not only of the value vijk but also the appropriately, we a set of atomic statements which represents the originator's knowledge of R. For example, we More associated attribute ak of the thing xi as perceived via a reference frame nw. define the measurement function (Arrow A in Fig. 1) as: F:R-»S where S is may be interested in the attribute "temperature" in the city of interest. The temperature in Rome is to "Rome" which is be n>easured by an alcohol thermometer becomes the reference frame and the mode of measurement, the thermometer, function F. the particular thing of Let the value of the temperature on a particular day t be ZS'F. Then daily. is Thus, the date represented by the F (Rome, where s is the thermometer We is t, temperature) = s Rome on day atomic statement "The temperature in note also that different measurement procedures are appropriate for different attributes. argument of What F. In the temperature example, this point, the difference F represents measurement between an depends on what is fixed and what interested the "temperature at the attribute is via a thermometer. is and a reference frame needs attribute included as part of attribute and what is the particular attribute in the If "weight" F would represent a weighing machine. attribute of interest, then a At by an alcohol 28°F. The particular form of the measurement represented by F depends on were the as measured t is included in a reference frame? variable across instances of measurement. 8KX)am on Valentine's Day 1993" at If, to be highlighted. This distinction various locations within "temperature at 8:00am on Valentine's Day 1993" with M we for example, corresponding are Then Italy. to the set of various locations. If on "temperature in the first we may be the other hand, Rome" corresponds case, the date and time are interested in the "temperature in domain at various times, then M corresponds to the set of various times. to the attribute while fixed for the Rome" of interest (and therefore incorporated as part of the attribute definition) while locations vary (and therefore incorporated as reference frames). the second, the location is fixed while the times of In measurement vary. statements that forms the originator's knowledge or pjerceived We provide below a more detailed example to be used Note In that S is the set of reality. for the rennainder of the paper. Example: We use a frozen-foods company's inventory system as the domain of interest. The things of interest will be n cold-storage warehouses attributes of interest: xi, 1 "Warehouse ^ i id", ^ n, owned by "number company. this A given warehouse will have 3 of frozen dinners at begiiming of day" as procedure pi and "temperature at beginning of day" as measured by procedure p2 which QTY (a2) and TEMP (33) The reference frame v2 (in units of is respectively. the date t QTY on which is equal to the number the measurenwnts are taken. one) and V3 (in "¥) correspond to ID, QTY and TEMP corresponding statements are given by where of dinners physically F(xi, ai, t) = F(xi, 32, t) = S2 F(xi, as, t) = S3 SI The values vi measured by we call ID (aj), on the shelves. (3-digit string), respectively for day t. The si="the ID of warehouse S2="the is pi v] at the beginning of the QTY of frozen dinners in warehouse i day t" beginning of day at the t as measured by procedure V2 and " S3="TEMP in warehouse summary, thus In is i are instantiated via far i at the we have measurement beginning of day t as measured by procedure p2 defined thing s with their attributes and reference frames which into atomic statements. In the next section the encoding of statements into data, V3" is and the interpretation of the data by we develop a formalism for the user. Symbolic Language 3.2. The knowledge Bunge [Bunge, 1974). of the data originator encoded as wffs of a symbolic language defined by is We introduce the encoding function E represents the mapping used by the originator to E which is a component of a symbolic language. map statements in S to corresponding wffs in D. E:S^D The decoding function E"^ (Arrow C in Figure 1) is used by the user to interpret wff back to the staten^ents they signify. We note that an atomic wff not only contains data that represents the value additional information about the reference frame and attribute definitions, as well as attained. Often, when the term "data quality" is used, data elements alone, or whether it on a In addition to the data values, the user nuiy understand the data. Each of these names) modem is is not clear whether it how screen. they were Consider in the cells represent attribute values. need column names, table names, and other information may have some to type of quality associated with them. databases, this extra information referred to as metadata The data but also refers to the quality of refers to the quality of this additional information as well. the display of a relational database table In it vijk, (e.g., besides the data elements and their column [McCarthy, 1984J, which may be implemented via schema information, data dictionaries, semantically rich data model constructs, or systems documentation, depending on the boundary of the concept of the data and IS. regarding the intrinsic n>eaning of the data [Collett, Huhns, k Metadata may Shen, 1991; Siegel well as indicators of the quality of the data value as an electronic artifact Wang, Kon, We the time of & & Madnick, 1991], as k [Wang Madnick, 1990; Madnick, 19931. ackiwwiedge the (wssibility of inaccuracies in metadata as well as in data. For example, measurement may have been recorded wrongly. Thus the possible need metadata as well. This recursiveness in metadata k Madnick, include both ii\formation 1991; Wang, Reddy, is for qualification of a universal problem in metadata modeling [Siegel k Kon, 1993]. We assume only one level of metadata. We can now more completely characterize g as a composite function. measurement function F and the encoding function E. This is reflected in Fig. 2. It consists of the (1), (2) and We now define R. (3) provide an example (1) THINGS: The (2) ATTRIBUTES: and °F for January 1, based on the example started in Section 3.1: things of interest are corporate warehouses. ID, QTY and TEMP QTY and TEMP REraRENCE FRAME: (3) spjecification are the attributes of interest, using units of single dinners respectively. Information is to be generated once at the beginning of each day from 1993. (4) MEASUREMENT procedures pi and p2 for QTY and TEMP respectively. (5) LANGUAGE: Decoding is Encoding should be done using the following notation: 4 a reverse mapping of the encoding through digits, left justified. which the user interprets the numerical data. INSTANTIATION DELAYS: (6) Values are to be available on the IS display within three hours of measurement. In the following section we use the IS model to analyze and define several fundamental- dimensions of data quality. DEFINING THE DIMENSIONS OF DATA QUAUnf 4. We embodied have shown in g, the In this section requirements. we mapping from R to D. look at how depends on the quality of the data production process as We have also decomposed g into its various process components. the data production process fails in temns of non<onfonnance to user We define the specific components of non-conformance by which data quality is lost The reasons cycle (SDLC). First, IS. that data quality for iwn-conformance may correspond to any of several stages in a systems design life an analysis and design stage transforms user requirements into a design model of the Second, an implementation stage converts the design specification into an operational system. identify here a third component not identified in the SDLC We perspective: the data production stage Deviations nnay exist between the intended operation and the actual operation of the IS. We . define data quality over the ag gregate tramformation from user requirements through actual data as delivered and interpreted. In Table 1 below, each item describes a component of non-conformance of the process constructs. 11 Process Construct Thing Accuracy signifies (e.g., if point concept which requires each wff to properly correspond to the triple is a there are two dinners, observed and the symbol 'two' is '2' is entered). It o) it characterizes the relation from a single aspect of the world to a single wff. gp((o) = ga(w) for In terms of the process constructs, a wff is € R. all co accurate if, and only if Ep(Fp(xi, mj, ak)) = Ea(Fa(xi, mj, a^)) i.e. the aggregate result of Note offset measurement and encoding is that this condition accounts for the case according to P. where measurement errors and encoding errors each other. Completeness 4.1^. Completeness is a set-based concept that requires that all o) e Rp are measured and mapped onto the co-domain: co-domain (gp) = co-domain In terms of process constructs, a data set Rp = Da is complete means that all the aspects of the For example, specifies only only Rp is violated if, and only if Ra/«ind Ep(Fp((o))=Ea(Fa((o)) for This (ga). if P world of specifies all warehouses in California or if P all co e interest are Rp measured and encoded company warehouses specifies the attributes accurately. as things of interest and QTY and TEMP and A A specifies QTY - these are incompleteness problems - one in things and one in attributes. A.13. Relevance Relevance is a point concept It requires that each wff in Da be the result of a mapping from an aspect of the world of interest Rp. g-lp(di) € 4.1.4. Rp for all di e Da Timeliness Timeliness requires that the time delay between measurement and the instantiation of wff s at or below the maximum specified. 13 is te-td (te and td were defined 4.1.5. in Section 3.3 ^ DImax each ^or and DImax co specification item 6 in Secfion 3.4.) irv Interpretability map In general, interpretability requires that the user per e Rg wff back to the statements they signify as This requires that the user be able to interpret the data element as well as various aspects of the P. metadata. For example, the data element number three. number of frozen dinners). The itself definition of the attribute reliability of the data, for must be interpreted so must also be known or when the data symbol interpreted, (e.g., In addition, reference frame information example, based on that the may means the QTY means the '3' provide indicators of the was measured. Thus, for data to be interpretable with respect to the user for di € Da, di is Ep-1 (di) = Ea"^ To be nrtay specific, we distinguish not be well formed. This necessitates the The key interpreted We to interpretability 4^ is due is, (di) for di Da Da- first of the is e Da the actual data set, for which some elements two requirements. given a wff, whether accurate or not, can be properly it by a user of the language. posit that these five dinoensions are sufficient to characterize data of perfect quality with respect to a given user specification. This Dp from a wff, and to the We make no clain\s as to interdependendes that exist among necessity (minimality) of the dinnensions. the dimensions, which we discuss next. Interdependendea The above aiulysis demonstrates the complexity underlying the concept Intuitive or simple definitions of data quality may be inadequate to operationalize data quality measures. Sinularly, attempts to find orthogonal dimensions of data quality discuss below, interdependendes exist constructs. may be misguided. As we the dimensions, even given the exactness of our model The discussion on interdependendes illustrate several will not is presented to to reflect the current state of our dynamic be a formal one, but rather obvious interdependendes. Tinneliness world. amoi^ of data quality. and accuracy: As data ages, it where the data is In applicatioiu timeliness of data delivery is latest may cease taken to represent the current sute of the world, related to data accuracy. number may beconne inaccurate with time. 14 For example, the value of a person's phone Timeliness and completeness mentioned or is slow earlier, timeliness in arriving^ results is unavailable. If we consider timeliness in terms of instantiation delay then, as and completeness are related concepts. Data which is generated too both in untimely data values and (tempwrarily) incomplete data Completeness and accuracy desired wff : A : wff which The required data is inaccurate is late, sets. unusable by the user, and thus the for practical purposes, not in the database for the user, is, causmg incompleteness. Accuracy and interpretability in metadata is : As we include metadata interpretable as to what the data is. is our definition of the data, inaccuracy For example, related to the interpretability of data. height as of a certain date, then the date in metadata if a wff were express a child's to which makes for the height value it Thus, inaccuracy in the date would affect the interpretability of the height. Thus we see that interdependencies exist DISCUSSION AND CONCLUSION 5. A model of an information system has been presented in which data process quality was treated as a surrogate for data product quality. constructs. Using the among dimensions often though of as independent. nrtodel we The model contains an explicit and orthogonal set of process are able to develop five data quality dimension definitions: accuracy, completeness, relevance, timeliness, and interpretability. They are shown, under the explicitly stated assumptions of the model, We have seen to inter-relate in various ways. that data quality is not a simple concept. minimal and orthogonal set of data quality dimensior\s single dimensions nor their interdependencies world and an infonnation system are required intuitive definitions of data, applicability to solving quality data. have is trivial definitions. IS, and A this paper. for detailed analysis. Efforts in the past to the some Neither well-defined model of the measurement, and the information system problems related that there exists brought into question by may be which appeal to limited in their development of information systems that deliver Data quality improvements must come through assisting and infonning those who manage, design, imi^ement, and operate infonnation systems. In of the The notion its this respect, a process oriented impacts on data quality, are more informative than This paper has shown static analyas data product dimensions. that interp reta bility, accuracy, timeliness, relevance, and completeness have complex constructs underlying them. More rigorous definitions of data, schfema infonnation, user understanding of data, generation of data, and time are necessary to define rigorously the components of data quality. Even the relatively simple source-receiver requires a thorough analysis for We - its model of an IS developed in this paper implications to data quality. have demonstrated a certain arbitrariness in past conceptualizations of data characteristics considering words such as timeliness and con\pleteness as indeperulent dimensions of data quality 15 when in fact there are strict relationships between them. It is clear from data quality requires a strong model of the process by which the data components and stages of data creation. this effort that operationalizing and solid definitions for Data quality characteristics can be traced to characteristics of the information system delivery vehicle that creates This research has denK)nstrated the is created it. utility of using fundamental concepts, such as those provided by the process model, as a foundation for data quality measurement. The constructs are donriain and data model independent. The primitives developed here provide an unambiguous vocabulary further work in developing data quality measures, systems. 16 and may assist in for designing quality information References [1] Bailey, E. J- & Pearson, S. W. (1983). User Satisfaction. Management Development of a Tool for Measuring and Analyzing Computer Science, 22(5), pp. 530-545. & Pazer, H. L. (1985). Modeling Data and Process (Quality in Multi-input, Multi-output Systems. Information Management Science, 11(2), pp. 150-162. [31 Baroudi, j. j. & Orlikowski, W. J. (1988). A Short-Form Measure of User Information Satisfaction: A Psychometric Evaluation and Notes on Use. Journal of Management Information Systems, [41 Bunge, M. (1974). Semantics I: Sense and Reference Boston: D. Reidel Publishing Company. (31 Bunge, M. (1977). Ontology I: The Furniture of the World Boston: D. Reidel Publishing Company. (b| Collett, C, Huhns, M. N., & Shen, W. (1991). Resource Integration Using a Large Knowledge Base in Carnot. /£££ Computer, [7] Davis, F. D. (1989). Perceived Usefulness, Perceived Ease of Use, and User Acceptance of Information Technology. Management Information Systems Quarterly, (September), pp. 319-340. Cambridge: MIT Center for Advanced Enfgineering. [81 Deming, W. E. (1982). Out of the Crisis Huh, Y. U., et al. (1990). Data Quality. Information and Software Technology, 12(8), pp. 559-565. (91 Ishikawa, K. (1985). What is Total Quality Control?-the Japanese Way [101 Englewood Cliffs, NJ: Prentice-Hall. [Ill Johnson, J. R., Leitch, R. A., & Neter, J. (1981). Characteristics of Errors in Accounts Receivable and Inventory Audits. Accounting Review, 56(April), pp. 270-293. [121 Juran, J. M. & Gryna, F. M. (1988). Quality Control Handbook (4th ed.). New York: McGraw-Hill Book Co. [13] Laudon, K. C. (1986). Data Quality and Due Process in Large Interorganizational Record Systems. Communications of the ACM, 22(1), pp. 4-11. McCarthy, J. L. (1984). Scientific Information - Data + Meta-data. U.S. Naval Postgraduate [141 [21 Bailou, D. P. . . . . School, Monterey, CA. 1984. Moore, G. C. & Benbesat, I. (1991). Developing an Instrument to Measure the Perceptions of Adopting an Information Technology Innovation. 2(3), pp. 192-222. Morey, R. C. (1982). Estimating and improving the Quality of Information in the MIS. [161 Communications of the ACM, 25(May), pp. 337-342. (171 O'Neill, E. T. & Vizine-Goetz, D. (1988). (Quality Control in 0\line Databases. In M. E. Williams (Ed.), Annual Review of Inofrmation, Science, and Technology (pp. 125-156.). Elsevier Publishing [151 Company. [181 [191 [20J [21] G. A. (1987). Quality Process Management Englewood Qiffs, NJ: PrenHceHall, Inc. Siegel, M. & Madnick, S. E. (1991). A metadata approach to resolving semantic conflicts. Pall, . Conference. Barceloiu, Spain. 1991. Taguchi, G. (1979). Introduction to Off-line Quality Control Control Association. Wand, Magaya, Japan: Central Japan (Quality & Weber, R. (1988). An Ontological Aruilysis of Some Fundamental Information Systems The Ninth International Conference on Information Systems, Minneapolis, Minnesota, Y. Concepts. [221 . VLDB USA. 1988. Wand, Y. & Weber, R. (1990). Mario Bunge's Ontology as a Fonnal Foundation for Infomution Bunge's Treatise J. W. Dom (Ed.), Studies on Mario Systent^ Concepts. In P. Weingartner& G. Amsterdam; Rodopi. Wang, R. Y., Kon, H. B., & Madnick, S. E. (1993). Data Quality Requirements Analysis and Data Engineering. Vienna, Austria. 1993 [241 Wang, R. Y., Reddy, P., k Kon, H. B. (1993). An Attribute-based Model of Data for Data Quality Management, to appear in the Journal of Decision Suuport Systems (DSS) [25] Wang, Y. R. & Madnick, S. E. (1990). A Polygen Model for Heterogeneous Database Systems: The [231 Modeling in Source Tagging Perspective. Brisbane, Australia. 1990. pp. 519-538. 17 MIT LIBRARIES 3 9080 00932 7534 97 Date Due