HU/0 .M414 no. ^3 ALFRED P. WORKING PAPER SLOAN SCHOOL OF MANAGEMENT Data Quality Requirements Analysis and Modeling WP #3515-93 December 1992 CISL Richard Y. WP# 92-04 Wang Reddy H. B. Kon M. P. Sloan School of Management, MIT MASSACHUSETTS INSTITUTE OF TECHNOLOGY 50 MEMORIAL DRIVE CAMBRIDGE, MASSACHUSETTS 02139 To appear in the Journal of Decision Special Issue Support Systems (DSS) on Information Technologies and Systems Data Quality Requirements Analysis and Modeling WP #3515-93 Deceml)er 1992 CISL Richard Y. WP# 92-^ Wang Reddy H. B. Kon M. P. Sloan School of Management, * MIT see page bottom for complete address Richard Y. Wang E53-317 M. P. Reddy E53-322 Henry B. Kon E53-322 Sloan School of Management Massachusetts Institute of Technology 01239 Cambridge, MA E30SeE35Es3BW.'. smpv'; • To appsear in the Journal of Decision Support Systems (DSS) Special Issue Toward Quality on Information Technologies and Systems An Data: Richard Attribute-Based Approach Y. Wang Reddy Henry B. Kon M. P. November 1992 (CIS-92-04, revised) Composite Information Systems Laboratory E53-320, Sloan School of Management Massachusetts Institute of Technology Cambridge, Mass. 02139 ATTN: Prof. Richard Wang (617) 253-0442 Bitnet Address: rwang@sloan.mit.edu © 1992 Richard Y. ACKNOWLEDGEMENTS Wang, M.P. Reddy, and Henry B. Kon reported herein has been supported, in part, by MIT's International Financial Service Research Center and MIT's Center for Information Systems Research. The authors wish to thank Stuart Madnick and Amar Gupta for their comments on earlier versions of this paper. Thanks are also due to Amar Gupta for his support and Gretchen Fisher for helping prepare this manuscript. Work 1. Introduction 1.1. 1.2. Data quality: an attribute-based example Research focus and paper organization 1.3. 2. 5 Rationale for cell-level tagging related to data tagging 5 2.2. Work 6 2.3. Terminology 7 Data quality requirements analysis 3.1. 3.2. 3.3. 3.4. 4. 2 4 4 Research background 2.1. 3. 1 Dimensions of data quality The Step Step Step Step 1: 2: 3: 4: Establishing the applications view Determine (subjective) quality parameters Determine (objective) quality indicators Creating the quality schema attribute-based 4.1. 4.2. 4.3. model of data quality Data structure Data integrity Data manipulation 4.3.1. Ql-CompatibiUty and QIV-Equal 4.3.2. Quality Indicator Algebra 4.3.2.1. Selection 4.3.2.2. Projection 4.3.2.3. Union 4.3.2.4. Difference 4.3.2.5. Cartesian Product 8 9 9 10 11 12 12 15 15 15 18 18 19 20 22 24 5. Discussion and future directions 25 6. References 27 7. Appendix A: Premises about data quality requirements analysis 29 29 Premises related to data quality modeling Premises related to data quality definitions and standards across users 30 7.3. Premises related to a single user 30 7.1. 7.2. Toward Quality An Data: Attribute-Based Approach Introduction 1. Organizations in industries such as banking, insurance, retail, consumer marketing, and health care are increasingly integrating their business processes across functional, product, lines. The integration of these business processes, in turn, accelerates application systems for product development, product delivery, 1989). As a many medium more than 60% & Neter, 1981). size corporations (with of the firms more effective & Short, applications today require access to corporate functional and product of errors (Johnson, Leitch, surveyed 500 for and customer service (Rockart Unfortunately, most databases are not error-free, and databases. number result, demand and geographic had problems some contain a surprisingly large In a recent industry executive report. Computer-world annual sales of more than $20 million), and rejX)rted that The Wall Street journal also reported in data quality.^ that: to computers, huge databases brimming with information are at our fingertips, just waiting to be tapped. They can be mined to find sales prospects among existing customers; they can be analyzed to unearth costly corporate habits; they can be manipulated to divine future trends. Just one problem: Those huge databases may be hill of junk. ... In a world where people Thanks are moving to total quality management, one of the critical areas is data.'^ In general, inaccurate, out-of-date, or incomplete data can socially and economically (Laudon, 1986; Liepins & Uppuluri, 1990; Liepins, 1989; Zarkovich, 1966). Managing data quality, howrever, achieve zero defect datar' this may have significant impacts both is a complex task. Although not always be necessary or attainable for, Wang & it Kon, 1992; would be among ideal to others, the following two reasons: First, in many applications, addresses in database marketing customers, it is it is may not always be necessary to attain zero defect data. Mailing a good example. not necessary to have the correct city Second, there is name sending promotional materials to target In in an address as long as the zip code a cost/quality tradeoff in implementing data quality programs. is correct. Ballou and Pazer found that "in an overwhelming majority of cases, the best solutions in terms of error rate reduction is the worst in terms of cost" (Ballou & Pazer, 1987). The Pareto losses are never uniformly distributed over the quality characteristics. distributed in such a way Rather, the losses are always that a small percentage of the quality characteristics, "the vital few," always contributes a high percentage of the quality 2 Computerworld, September 28, 1 992, p. 80-84. The Wall Street Journal, May 26, 1992, page B6. 3 just like the well publicized 1 Principle also suggests that concept of zero loss. As defect products in the a result, the cost manufacturing improvement literature. potential is high for "the few" projects whereas the vital cure costs more than the disease (Juran "trivial & Gryna, many" 1980). defects are not worth tackling because the sum, when the cost In is prohibitively high, it is not feasible to attain zero defect data. Given may that zero defect data make and we This suggests that able to judge the quality of data. characteristics of the data not always be necessary nor attainable, manufacturing process its it indicators such as who it would be originated the data, . From the data which are these quality indicators, the user can know useful to when useful to be tag data with quality indicators a judgment of the quality of the data for the specific application at hand. decision to purchase stocks, for example, would be was In making a financial the quality of data through quality collected, and how the data was collected. In this paper, we propose an attribute-based model that facilitates cell-level tagging of data. Included in this attribute-based model are a mathematical model description that extends the relational process model, a SQL set of quality integrity rules, queries that are augmented with quality indicator requirements. indicators, the user can make a better interpretation of the data In order to establish the relationship data. and a quality indicator algebra which can be used From to these quality and determine the believability between data quality dimensions and quality of the indicators, a data quality requirements analysis methodology that extends the Entity Relationship (ER) model is also presented. Just as it is difficult to product which define its manage product quality, it is also difficult to characteristics that define data quality. quality, for the LL quality without understanding the attributes of the manage data quality without understanding the Therefore, before one can address issues involved in data one must define what data quality means. In the following subsection, we present a definition dimensions of data quality. Dimensions of data quality Accuracy is the most obvious dimension when it comes to data quality. "errors occur because of delays in processing times, lengthy correction times, stringent data edits" (Morey, 1982). In addition to defining Morey suggested and overly or insufficiently accuracy as "the recorded value conformity with the actual value," Ballou and Pazer defined timeliness (the recorded value of date), completeness (all that is is in not out values for a certain variables are recorded), and consistency (the representation of the data value is the same in all cases) as the key dimensions of data quality (Ballou & Pazer, 1987). Huh et al. identified accuracy, completeness, consistency, important dimensions of data quality (Huh, It is (e.g., Juran et al., 1990; Juran, 1979; (1) Data quality (2) Data quality is We illustrate these is certain data retrieved the data for quality control have been well established in & Pazer, 1985; Garvin, 1983; Garvin, 1987; Garvin, 1988; Gryna, 1980; Morey, 1982; Wang & Guarrascio, 1991). It is also intrinsic characteristics of data quality: a multi-dimensional concept. a hierarchical concept. two characteristics from a database. must be accessible must be able & two interesting to note that there are the data methods Juran, 1979), neither the dimensions of quality for manufacturing nor for data have been rigorously defined (Ballou Huh, et a!., 1990). interesting to note that although the manufacturing field and currency as the most by considering how a user may make decisions based on First the (the user has the to interpret the data (the user user must be able to get to the data, which means and privilege to get the data). means that Second, the user understands the syntax and semantics of the data). Third, must be useful (data can be used as an input to the user's decision making process). Finally, the data must be believable to the user (to the extent that the user can use the data as a decision input). Resulting from this and believability. list In order to that can be accessed); to and to are the following four dimensions: be accessible consistent, credible, to the user, the data be useful, the data must be relevant be believable, the user may consider, and accurate Timeliness, . accessibility, interpretability, usefulness, among (fits must be available (exists in requirements for making the decision); other factors, that the data be complete, timely, in turn, can be characterized by currency (when item was stored in the database) and volatility (how long the item remains valid). the data quality dimensions illustrated in this scenario. currenT) Qion-volatile; Figure 1: some form A Hierarchy of Data Quality Dimensions Figure 1 the data depicts These multi-dimensional concepts and hierarchy of data quality dimensions provide a conceptual framework for understanding the characteristics that define data quality. In this paper, focus on interpretability and believability, as we we consider accessibility to be primarily a function of the information system and usefulness to be primarily a function of an interaction between the data and the application domain. 12. The idea of data tagging Data quality: an attrib ute-based Suppose an analyst maintains this effort Data may illustrated more concretely below. example a database contain attributes such as may be collected is on technology companies. The schema used company name, CEO name, and earnings over a period of time and come from a variety of sources. Table Company Name 1: Company Information to support estimate (Table 1). value may have necessary to may, know in turn, arbitrary a set of quality indicators associated with In it. the quality of the quality indicators themselves, in have another set of associated quality indicators. number of underlying As some applications, it may be which case a quality indicator such, an attribute may have levels of quality indicators. This constitutes a tree structure, as shown an in Figure 2 below. (attribute) nndicator) ^indicator") tK:. 3x... (^indicator) ^ Figiu-e 2: (mdlcator) An attribute with quality indicators Conventional spreadsheet programs and database systems are not appropriate for handling data which is structured in this manner. In particular, they lack the quality integrity constraints necessary for ensuring that quality indicators are always tagged along with the data (and deleted when the data is deleted) an attribute with In order to associate developed and the algebraic operators necessary to facilitate the is immediate quality indicators, a mechanism must be Section 2 presents the research background. presents the data quality requirements analysis methodology. based data model. Discussion and future directions are 2. we we present the attribute- in Section 5. Research background and present the terminology used in this pap)er. Rationale for cell-level tag gin g Any relation. same made In section 4, Section 3 discuss our rationale for tagging data at the cell level, summarize the literature related to data tagging, 2.1. set it. organized as follows. In this section query processing. hnkage between the two, as well as between a quality indicator and the of quality indicators associated with This paper its for attribute-based characteristics of data at the relation level should be applicable to all instances of the It is, quality. however, not reasonable to assume that all instances (i.e., tuples) of a relation Therefore, tagging quality indicators at the relation level quality heterogeneity at the instance level. is have the not sufficient to handle By the same token, any to all attribute characteristics of data tagged at the tuple level should be applicable values in the tuple. However, each attribute value in a tuple different sources, through different collection methods, Therefore, tagging data at the tuple level basic unit of manipulation, We now examine 12. Work is it is and updated also insufficient. may be collected from at different points in time. Since the attribute value of a necessary to tag quality information at the cell is the cell level. the literature related to data tagging. related to data tag gin g A mechanism for tagging data has DENOTE operations to tag and un-tag the operators is to been proposed by Codd. name It includes of a relation to each tuple. permit both the schema information and the database extension uniform way (Codd, 1979). It does not, however, allow NOTE, TAG, and The purpose to of these be manipulated in a for the tagging of other data (such as source) at either the tuple or cell level. Although self-describing data schema to level files at the (McCarthy, 1982; McCarthy, 1984; McCarthy, 1988), no specific solution has been offered manipulate such quality information A and meta-data management have been proposed at the tuple and cell levels. rule-based representation language based on a relational schema has been proposed to store data semantics at the instance level (Siegel attribute values based the tuple level as on values of other opposed & Madnick, 1991). These rules are used attributes in the tuple. to the cell level, and thus to derive meta- However, these rules are specified at cell-level operations are not inherent in the model. A polygen model (poly = multiple, gen = source) (Wang & Madnick, 1990) has been proposed tag multiple data sources at the cell level in a heterogeneous database environment imp)ortant to know where it to is not only the originating data source but also the intermediate data sources which contribute to final query results. The research, however, focused on the "where from" perspective and did not provide mechanisms to deal with more general quality indicators. In (Sciore, 1991), annotations are oriented environment. general treatment is to However, data quality support the temporal dimension of data in an objectis a multi-dimensional concept. necessary to address the data quality issue. calculus-based language data. used is provided to Therefore, a more More importantly, no algebra or support the manipulation of annotations associated with the The examination of the above research functionality of our attribute-based model, 23. an extension of existing data models is support the to required. Terminology To • efforts suggests that in order we facilitate further discussion, An introduce the following terms: a pplication attribute refers to an attribute associated with an entity or a relationship in an entity-relationship (ER) diagram. This would include the data traditionally associated with an application such as part number and supplier. • A quality parameter defines when is a qualitative or subjective dimension of data quality that a user of data For example, believability and timeliness are such evaluating data quality. dimensions. • As introduced in Section characteristics of data collection • A method defined by users to its is may the value determined (directly or indirectly) by the user of data Functions can be For example, the quality . is a measured may have have discussed the rationale characteristic of the stored data. For example, the a quality indicator value The Wall Street Journal. for cell-level tagging, and introduced the terminology used methodology Data source, creation time, and be defined as high or low depending on the quality indicator source data quality indicator source tagging, process.'* quality indicators to quality parameters. quality indicator value We manufacturing parameter based on underlying quality indicators. map parameter credibility A quality indicators provide objective information about the are examples of such objective measures. for a particular quality • and quality parameter value of the data 1, for the specification of data quality in this paper. summarized work related to data In the next section, parameters and indicators. The we intent present a is to allow users to think through their data quality requirements, and to determine which quality indicators would be appropriate for a given application. We consider an indicator objective if it is generated using a well defined and widely accepted measure. Data quality requirements analysis 3. In general, different users data may have may have different quality characteristics. thorough treatment of these requirements analysis (Batini, Lenzirini, between The reader is referred to different tyf)es of Appendix A for a & an effort similar is in spirit Navathe, 1986; Navathe, Batini, on quality aspects of the Based on data. traditional data requirements analysis & to traditional data Ceri, 1992; Teorey, this similarity, parallels and data quality requirements can be drawn analysis. depicts the steps involved in performing the proposed data quality requirements analysis. application requirements Stepl determine the application view of data I application view application's i Step 2 quality requlrments determine (subjective) quality parameters for the application candidate quality paramaters T parameter view __i Step3 determine (objective) quality Indicators for the application quality view (1) • s'teD'4 - 1 quality view -*^ I (I) quality view (n) ^ quality view Integration quality Figure The input, output more issues. Data quality requirements analysis 1990), but focusing and different data quality requirements, 3: and The process schema of data quality requirements analysis objective of each step are described in the following subsections. Figure 3 3.1. Step Step this paper. & Establishing the applications view 1: 1 is A the whole of the traditional data modeling process and will not comprehensive treatment of the subject has been presented elsewhere Navathe, 1986; Navathe, & Batini, we A company Figure 4 documents the a pplication view for management has a company name, a CEO, and an earnings estimate, while a stock has a share price, a stock exchange (NYSE, that (Batini, Lenzirini, are interested in designing a portfolio system which contains companies that issue stocks. An ER diagram in Ceri, 1992; Teorey, 1990). For illustrative purposes, suppose that symbol. be elaborated upon AMS, or OTC), our running example is and a ticker shown below in . COMPANY STOCK JSSUES -<^COI COMPANY NAME SHARE PRICE Z^ CEO NAME EARNINGS ESTIMATE Figure 2.2, Step 2: 4: Application view (output from Step 1) Determine (subjective) quality parameters The goal in this step These parameters need to dimensional concept, and is parameters from the user given an application view. be gathered from the user in a systematic may be illustrates the addition of the application view. to elicit quality 5: as data quality is operationalized for tagging purposes in different ways. two high level parameters, interpretability Each quality parameter identified Figure way Interpretability is shown a multi- Figure 5 and believability. inside a "cloud" in the diagram. and believability added to the application view to the Interpretabilitv can be defined through quality indicators such as data units and scale (e.g., in millions). defined through currency and volatility stock entity which is -o ii Step 3: 6: is Timeliness, in turn, can be identified in this step are referred to as the parameter view. We added to focus here on the STOCK Parameter view for the stock entity (partial output from Step 2) Determine (objective) quality indicators The goal in Step 3 is to operationalize the primarily subjective quality parameters identified in Step 2 into objective quality indicators. rectangle) The quality parameters . in Figure 6. _ _ Figure . The resulting view shown dollars) Believability can be defined in terms of lower-level quality parameters such as completeness, timeliness, consistency, credibility, and accuracy the application view. (e.g., in and is is depicted as a tag (using a dotted- attached to the corresponding quality parameter (from Step view The portion of the quality view . -o- Figure Each quality indicator 7: for the stock entity in the 2), running example is creating the quality shown in Figure 7. 4 UNITS? STOCK The portion of the lnterpretable^>.^g^^^^jg, quality view for the stock entity (output from Step 3) 10 Corresponding currency units in which share price price is is parameter interpretable are the more objective quality indicators to the quality is measured (e.g., $ vs. ¥) the latest closing price or latest nominal price. and status which says whether the share Similarly, the believability of the share price indicated by the quality indicators source and reporting date . For each quality indicator identified in a quality view, indicators for a quality indicator, then Steps 2-3 are repeated, example, the quality of the attribute Earnings Estimate (i.e., the name who provided of the journal) but also on the second it making may depend level source the Earnings Estimate figure to the journal depicted below in Figure if (i.e., important is this an name and the Reporting have quality For iterative process. not only on the the to first level source of the financial analyst date). This scenario is 8. 'earnings estimate Believable I , ' _ - J , REPORTING DATE' I SOURCE r---" • , I L Believable \ (ANALYSTS NAME I _1 I REPORTING DATE' J J , Figure All quality 8: Quality indicators of quality indicators views are integrated in Step 4 to generate the quality schema, as discussed in the following subsection. lA, Step Creating the quality schema 4: When the design multiple quality views must be consolidated Lenzirini, & single global is may large result. and more than one set of application To eliminate redundancy and suffice. is involved, inconsistency, these quality views into a single global view, in a process similar to schema integration (Batini, Navathe, 1986), so that a variety of data quality requirements can be met. The resulting view is called the quality schema. This involves the integration of quality indicators. may requirements In more complicated indicators in order to decide cases, it may what indicators In simpler cases, a union of these indicators be necessary to examine the relationships to include in the quality 11 among schema. For example, it is the likely that one quality view may have age may have as an indicator, whereas another quaUty view time for the same quaUty parameter. In time this case, creation may creation be chosen for the quality schema because age can be computed given current rime and crearion time. We have presented a step-by-step procedure model in a position to present the attribute-based data We are now data quality requirements. to specify supporting the storage, retrieval, and for processing of quality indicators as spjecified in the quality schema. The attribute-based model of data 4. We choose to extend the relational model because the structure and semantics of the relational approach are widely understood. Following the attribute-based data model data manipulation. We is assume Data that the reader (a) 1982), the presentation of the data structure, (b) data integrity, and familiar with the relational is model (Codd, (c) 1970; 1983). structxire As shown Figure 2 (Section in levels of quality indicators. and the 1), an attribute to facilitate the linkage In number arbitrary its of underlying immediate quality indicators, a between the two, as well as between a quality with set of quality indicators associated the quality key concept. may have an In order to associate an attribute with mechanism must be developed indicator model (Codd, relational divided into three parts: Codd, 1979; Date, 1990; Maier, 4.L quality This mechanism it. extending the relational model, Codd made is developed through clear the need to uniquely identify tuples through a system- wide unique identifier, called the tuple ID (Codd, 1979; Khoshafian & Copeland, 1990).^ This concept an attribute Specifically, applied in the attribute-based model to enable this linkage. is scheme in a relation attribute, consisting of the attribute and is a quality expanded key into where EEc is the quality key for the attribute expanded scheme is referred to as a quality EE (Tables scheme defines a quality scheme for the quality relation indicators are associated with the attributes . is expanded 3-6 are EE^) into (EE, embedded In Table 4, ((CN, nik), Company. CN quality pair, called a . For example, the attribute Earnings Estimate (EE) in Table 3 4 an ordered in Figure in 9). Table This <CEO, nik), (EE, EEc)) The "nik" indicates that and CEO; whereas EEc indicates no quality that EE has associated quality indicators. Correspondingly, each quality cell, cell in a relational tuple is expanded into consisting of an attribute value and a quality key value. This Similarly, in the object-oriented literature, the ability to property of an object-oriented data model. 12 make an ordered pair, called a expanded tuple references through object identity is is referred to considered a basic as a quality tuple and the resulting relation (Table 4) key value is referred to as a quality relation in a quality cell refers to the set of quality indicator values the attribute value. This set of quality indicator values tuple called a quality indicator tuple quality indicator tuples is quality indicator relation Under is A . composed quality relation referred to as the quality indicator the relational Relation for 3: scheme that defines the . Company Company ;E0 Name (CEO) Name (CN) Earnings Estimate (EE) IBM Akers 7 id002c DELL Dell 3 model I Table 4: I I Qual^ i\^ Relation for>^mpany for>Gompar {CN, nik) tid idOOU (IBM, nik) (Dell, nik) idOOZc Table idl02c of a set of these time-varying idOOU the attribute-based idlOlc form a kind of quality model tid EEg to The quality scheme . 5: (Barron's, id201<) Tables SRCU Jnl, id202<» 6: (EE, EEC) (Akers, nilO (Dell, Level-One QIR (SRC1.SRC1<!> (Wall St f (CEO, nik for the (News (7,idl01«) nik) |.(3,idl02«) EE attribute Date, nik) (Entrv Clerk, nik' (Oct 5 "92, nik) (Oct 6 '92, nik) Level-Two QIR for the EE Each quality immediately associated with grouped together called a quality indicator relation Table Under is . (Joe, nil^ (Mary, attribute nil«^ indicators for In general, Each attribute in qr2 a. in turn, , can have a quality indicator relation associated with an attribute can have n-levels of quality indicator relations associated with it, n ^ 0. it. For example. Tables 5-6 are referred to respectively as level-one and level-two quality indicator relations for the attribute We Earnings Estimate. define a quality scheme set as the collection of a quality scheme and indicator schemes that are associated with scheme We define a Company. set for quality indicators. A indicator schemes, quality In Figure 9, Tables 3-6 collectively define the quality it. is defined as a set of quality scheme sets that describes the Figure 10 illustrates the relationship scheme the quality quality database as a database that stores not only data but also quality schema structure of a quality database. all sets, and among quality schemes, quality the quality schema. Quality Schema Figure 10 Quality schemes, quality indicator schemes, quality scheme sets, and the quality schema We now present a mathematical definition of the quality relation. developed in the relational model, domain (in for a Table product 4, system-wide unique 7€ EE where EE is a we id domain as a set of values of similar type. domain for earnings estimate). Let m is ID be the be a domain for an attribute DID be defined on the Cartesian DID). be a quality key value associated with an attribute value d where d € quality relation (qr) of degree Let D identifier (in Table 4, idlOlc € ID). Let D X ID (in Table 4, <7, idlOU) € Let define a Following the constructs defined on the m+1 domains (m>0; in Table 4, D and m=3) if it is id € ID. A a subset of the Cartesian product: ID X DIDi X DID2 X Let qthe a quality tuple, which is designated an element in a X DIDm- quality relation. Then a quality relation qr as: qr The is ... = {qt qt I = <id, didi, did2, •••, didm) where integrity constraints for the attribute-based 14 model id € ID, did; € DID), is presented next. j = 1, ... ,m) 4^ Data integrity A fundamental property of the attribute-based model corresponding quality (including we mean atomic unit all attribute value corresponding quality indicators also need In other words, an attribute value We that an attribute value and descendant) indicator values are treated as an atomic unit whenever an that is and its to is . created, deleted, retrieved, or modified, its By its be created, deleted, retrieved, or modified respectively. corresponding quality indicator values behave atomically. refer to this property as the atomicity property hereafter. This property is enforced by a set of quality referential integrity rules as defined below. Insertion key present Insertion of a tuple in a quality relation : schema in the tuple (as specified in the quality indicator tuple must be inserted must ensure that for each non-null quality definition), the corresponding quality into the child quality indicator relation. For each non-null quality key in the inserted quality indicator tuple, a corresponding quality indicator tuple must be inserted at the next level. This process must be continued recursively until no more insertions are required. Deletion key present Deletion of a tuple in a quality relation must ensure that for each non-null quality : the tuple, corresponding quality information in must be deleted from the corresponding to the quality key. This process must be continued recursively until a tuple with all encountered null quality keys. Modification : an attribute value If quality indicator values of that attribute We now introduce a quality U. is table is modified in a quality relation, then the descendant must be modified. indicator algebra for the attribute-based model. Data manipulation In order to present the algebra formally, we Let . Ql-Compatibility and QIV-Equal a-i and associated with attributes aj and a2 Ol-Compatible with a2 then the attributes assume be two application attributes. Let QKa,) denote the set of quality indicators Let S be a set of quality indicators. aj. are defined to be We define two key concepts that are fundamental and OlV-Equal to the quality indicator algebra: Ol-compatibility 4.3.1. first shown a^ and that the respect to S.6 in Figure 11 are a2 shown If S C QKaj) and For example, if Ql-Compatible with respect in Figure 11 are not numeric subscripts (e.g., S S = to S. C QI(a2), then ai and a2 {qii, qi2, qi2i), Whereas Ql-Compatible with respect if S = then the (qii, qi22}/ to S. qin) n^^P 'he quality indicators to unique positions quality indicator tree. 15 in the Figure 11: Ql-Compatibility Example Let 3] and 82 be Ql-Compatible with respect to Let respectively. (qi2(W] ) = V2 qi(wi ) be the value of quality indicator in Figure 12). of ai and 82 value Wi where qi g S qi for the attribute QIV-Equal with Define Wj and W2 to be W2 be values Let wi and S. respect to S provided that qi(W] = ) c V qi qi(w2) S= 6 S, {qij, qi2i), denoted as wi = W2. In Figure example, w^ and W2 are QIV-Equal with respect 12, for but not QIV-Equal with respect to S = {qii, qisi) because qi3i(wi) = to whereas qi3i(w2) = v^-^ X31. (a2,W2) (a^.w.,) /\ (''hr^l) ^2) (qii2 \\\ (qi21'^2l' /\ (q'22'''22) «'*31'^31) (<"l 1 M ^11 qi^g • v_ \ \ ^2 > ^21 <'''21 '"bl > ' "31 > y Figure 12: QIV-Equal Example In practice, specify level it is the elements of all up Let wi the quality indicators to be we alleviate the situation, and depth if the following (2) S consists of a^ and attributes. Let ai a2 respectively, then two conditions are all compared (i.e., to introduce i-level Ql-compatibility (i- in which aU the quality in a quality indicator tree are considered. two application and W2 be values of Compatible , To to a certain level of Let ai and a2 be S S). all QIV-Equal) as a special case for Ql-compatibility (QlV-equal) indicators to tedious to explicitly state satisfied: and ai be Ql-Compatible with respect wj and W2 are defined (1) ai quality indicators present within and i 82 are to be i-level to S. QI- Ql-Compatible with respect levels of the quality indicator tree of ai (thus of 82). By If the 'i' is same the token, i-level QIV-Equal maximum level of be maximum-level Ol-Compatible by wi ='" W2, . between wi and W2, denoted by Wj depth in the quality indicator Similarly, tree, then a-i =' W2, can be defined. and 82 are defined to maximum-level QIV-Equal between Wj and W2, denoted can also be defined. 16 To exemplify the algebraic operations quality relations having the Large_and_Medium (Tables Figure same quality 7, 7.1, 7.2 14). <CN, Table 7 nil«r> in the quality indicator algebra, scheme set as in Figure 13) shown in Figure 9. we They are and Small_and_Medium (Tables introduce two referred to as 8, 8.1, and 8.2 in If QUALITY" the clause "with absent in a user query, then is on the quality of data explicit constraints would not be compared in the retrieval process; however, the quality indicator values associated with SQL The quality that retrieved as well. syntax, the dot notation indicator algebra namely algebraic operations, and (That attribute a. QUALITY" let QR and QS be two quality tuples. Let Sg be a t2 Sg is, is clause.) Let the are identical. Let ti .a = is a quality equal with resfject to Sg. define the five orthogonal quality relational and Cartesian product. be two quality schemes and Let a and b be two let qr and qs be two attributes in both set of quality indicators Sf)ecified QR and by the user for the constructed form the specifications given by the user in the "with term * we QR and QS respectively. quality relations associated with t^.a t2.a = denote that the values of the attribute a in the tuples t2.a denote that the values of attribute a Similarly, let =' ti .a t2.a and = t] .a maximum-level QlV-equal respectively between the values of "" t2.a and ti.a in the tuples denote t^ i-level and t2 t] and are QIV- QlV-equal and t2.a. Selection 4.3.2.I. Selection its SRC2 which presented in the following subsection. is selection, projection, union, difference, In the following operations, t2 identifies in turn is a quality indicator to EE. Following the relational algebra (Klug, 1982), ti to identify a quality indicator in the Quality Indicator Algebra 4.3.2. QS. Let used is EE.SRC1.SRC2 In Figure 9, for example, which indicator for SRCl, no that the user has being retrieved. In that case quality indicator values In the extended quality indicator tree. means is would be the applications data it a unary operation which selects only a horizontal subset of a quality relation (and is corresponding quality indicator relations) based on the conditions specified in the Selection There are two types of conditions operation. in the Selection operation: regular conditions for an application attribute and quality conditions for the quality indicator relations corresponding to the The application attribute. o^C (qr) where = (t C(ti ) I V = ti .b); 6 qr, ej <& ©2 <& or (ti.aGti.b); ="" ti selection, a'^c (qr)' is defined as follows: ei** is V ae QR, ... <i) o e, = "i <i) ti .a) 62'' d) a (t.a ...d) = "" t, .a)) ep''; Gj is in of the forms (qik = constant) or (ti.a = qik € QKa); <t e indicators to be e,, ((t.a { a, v, -, ); 6 a C(ti )} one of the forms: '' ti.b)or (ti.a = (g, s, s, #, <, >, =); and Sa,b compared during the comparison 18 of tj .a and tj .b. is (ti .a 6 constant) =' ti.b)or (ti.a the set of quality Example 1: Get all Large_and_Medium companies whose earnings estimate is over 2 and is supplied by Zacks Investment Research. A corresponding extended SQL query is shown as follows: algebra. SELECT FROM CN, CEO, EE WHERE EE>2 LARGE_AND_MEDIUM with QUALITY This SQL query can be accomplished through The result is EE.SRC1.SRC2= 'Zacks' shown below. a Selection operation in the quality indicator SELECT CN,EE FROM LARGE_and_MEDIUM This SQL query can be accomplished through a <CN, nil«> Projection operation. The result is shown below. <CN, nil(f> Note This also that unlike the relational union, the is illustrated Example Example in 3-3 : operation in Example 3-b: with The SM.CN, SM.CEO, SM.EE SMALL_and_MEDIUM SM LM.CN, LM.CEO, LM.EE LARGE_and_MEDIUM LM QUALITY result is (LM.EE.SRC1= SM.EE.SRCl) shown below. <CN, is not commutative. 3-3 below. Consider the following extended SELECT FROM UNION SELECT FROM quality union operation nile> SQL query which switches the order of the union Example 4: as Get all the companies which are classified as only Large_and_Medium companies but not Small_and_Medium companies. A corresponding SQL query is shown as follows: SELECT FROM DIFFERENCE LM.CN, LM.CEO, LM.EE SELECT SM.CN, SM.CEO, SM.EE LARGE_and_MEDIUM LM FROM SMALL_and_MEDIUM SM With QUALITY This SQL query (LM.EE.SRC1.SRC2 = SM.EE.SRC1.SRC2) can be accomplished through a Difference operation. below. <CN, nil> The result is shown Cartesian Product 4.3.2.5. The Cartesian product Let ti € qr the tuple t2 e qs. Let The tuple t2. of degree r+s. qr X^ qs = t(l) = t(r+l) The t](i) t is also a binary operation. denote the i'^ I V t,(l) = t] A t2(l) in the quality relation resulting <LM.CN, m\t> qs, be of degree t] and t2(i) r and QS be of degree denote the i"^ attribute from the Cartesian product of qr and qs denoted as qr X^ will s. of be qs, is defined as follows: G qr, Vt2 6 qs, t(l) A =" tid) A t(r+l) =" t(2) t2(l) A = t,(2) t(r+2) result of the Cartesian product shown below. QR attribute of the tuple The Cartesian product of qr and { t Let A = t(2) t2(2) ='" t,(2) a A t(r) = t(r+2) ="" t2(2) a ... ti(r) ... a t{r) t(r+s) = ="" t,(r) tzCs) a a t(r+s) =" t2(s) between Large_and_Medium and Small_and_Medium is ) We have presented the attribute-based model including a description of the model structure, a model, and set of integrity constraints for the a quality indicator algebra. algebraic operations are exemplified in the context of the the capabilities of this model and future research SQL In addition, each of the query. The next section discusses some of directions. Discussion and future directions 5. The attribute-based model can be applied in many different ways and some of them are listed below: • The ability of the retain the origin • A user can Example 1, model to filter the data retrieved from a database for instance, only the data furnished QUALITY multiple levels makes The reputation inspected. it possible to in Figure 9 illustrates this. according to quality requirements. by Zacks Investment Research is In retrieved as EE.SRCl.SRC2='Zacks'." Data authenticity and believability can be improved by data inspection and quality indicator value could indicate • at and intermediate data sources. The example specified in the clause "with • support quality indicators who certification. inspected or certified the data and when it A was of the inspector will enhance the believability of the data. The quality indicators associated with data can help to resolve semantic incompatibility capability is among clarify data semantics, which can be used data items received from different sources. This very useful in an interoperable environment where data in different databases have different semantics. • Quality indicators associated with an attribute values. For example, it can be interpreted the spouse name is if may facilitate a better interpretation of null the value retrieved for the spouse field (i.e., is empty in an employee record, tagged) in several ways, such as (I) the employee unknown, or (3) this tuple is inserted into the is employee unmarried, table (2) from the materialization of a view over a table which does not have spouse field. • In a data quality control process, the source of error In this paper, and processed. when errors are detected, the data administrator can identify by examining quality indicators such as data source or collection method. we have Specifically, investigated we have requirements analysis and specification, how (1) (2) quality indicators may be specified, stored, retrieved, established a step-by-step procedure for data quality presented a model for the structure, storage, and processing 25 of quality relations and quality indicator relations (through the algebra), functionalities related to data quality administration We investigating its combining accurate monthly data with mechanisms components. to determine the quality of (2) In order to (3) touched upon control. are actively pursuing research in the following areas: of derived data (e.g., values of and and (1) In order to determine the quality weekly less accurate data), we are derived data based on the quality indicator use this model for existing databases, which do not have tagging capability, they must be extended with quality schemas instantiated with appropriate quality indicator values. effective. (3) We are exploring the possibility of Though we have chosen the relational model oriented approach appears natural to model data and quality control methods), we its making such a transformation cost- to represent the quality quality indicators. schema, an object- Because many of the mechanisms are procedure oriented and o-o models can handle procedures are investigating the pros and cons of the object-oriented approach. 26 (i.e., References 6. [I] [2] [3] Ballou, D. P. & Pazer, H. L. (1985). Modeling Data and Process Quality in Multi-input, Multioutput Information Systems. Management Science, 11(2), pp. 150-162. Ballou, D. P. & Pazer, H. L. (1987). Cost/Quality Tradeoffs for Control Procedures in Information Systems. International Journal of Management Science, 15(6), pp. 509-521. Batini, C, Lenzirini, M., & database schema integration. [4J Codd, ACM, [5] E. F. (1970). A Navathe, S. (1986). A comparative analysis of methodologies for ACM Computing Survey, 1£(4), pp. 323 - 364. relational model of data for large shared data banks. Communications of the ii(6), pp. 377-387. Codd, E. F. (1979). capture more meaning. ACM practical foundation for productivity, the 1981 ACM Extending the relational database model to Transactions on Database Systems, 1(4), pp. 397-434. [6] Codd, E. F. (1982). Relational database: Turing Award Lecture. Communications An [7] Date, C. [8] Garvin, D. A. (1983). Quality on the J. (1990). Introduction to A of the ACM, 25(2), pp. 109-117. Database Systems (5th ed.). Reading, line. MA: Addison-Wesley. Harvard Business Review, (September- October), pp. 65- 75. [9] Garvin, D. A. (1987). Competing on the eight dimensions of quality. Harvard Business Review, (November-December), pp. 101-109. [10] Garvin, D. A. (1988). Managing Quality-The Strategic and Competitive Edge The Free [II] [12] Huh, (1 ed.). New York: Press. Y. U., et al. Data Quality. Information and Software Technology, 22(8), pp. 559-565. (1990). Johnson, J. R., Leitch, R. A., & Neter, ]. (1981). Characteristics of Errors in Accounts Receivable and Inventory Audits. Accounting Review, 56(April), pp. 270-293. [13] Juran, J. M. (1979). Quality Control [14] Juran, J. M. & Gryna, F. M. Handbook (3rd ed.). (1980). Quality Planning New York: McGraw-Hill Book Co. and Analysis (2nd ed.). New York: McGraw Hill. [15] S. N. & Copeland, G. P. (1990). Object San Mateo, CA: Morgan Kaufmann. Khoshafian, 37-46). Identity. In S. B. Zdonik& D. Maier (Ed.), (pp. [16] Klug, A. (1982). Equivalence of relational algebra and relational calculus query languages having aggregate functions. The Journal of ACM, 22, pp. 699-717. [17] Laudon, K. C. (1986). Data (Quality and Due Process Communications of the ACM, 22(1), pp. 4-11. [18] Large Interorganizational Record Systems. & Uppuluri, V. R. R. (1990). Data Quality Control: Theory and Pragmatics (pp. 360). Marcel Dekker, Inc. Liepins, G. E. New York: [19] in Liepins, O. E. (1989). Sound Data Are a Sound Investment. Quality Programs, (September), pp. 61- 63. [20] Maier, D. (1983). The Theory of Relational Databases (1st ed.). Rockville, MD: Computer Science Press. [21] McCarthy, J. L. (1982). Metadata Management for Large Statistical Databases. Mexico City, Mexico. 1982. pp. 234-243. [22] McCarthy, J. L. (1984). Scientific Information School, Monterey, CA. 1984. pp. 27 = Data + Meta-data. U.S. Naval Postgraduate [23] [24] McCarthy, J. L. (1988). The Automated Data Thesaurus: A New Tool for Scientific Information. Proceedings of the 11th International Codata Conference, Karlsruhe, Germany. 1988. pp. Morey, C. (1982). Estimating R. Communications of [251 Navathe, and Sons. [26] Rockart, Batini, S., & F. J. the ACM, C, & Short, J. and Improving the Quality of Information in the MIS. 25(May), pp. 337-342. Ceri, S. (1992). The Entity Relationshiip Approach E. (1989). IT in the 1990s: Sloan Management Review, Sloan School New york: Wiley Managing Organizational Interdependence. Management, MIT, of . lfl(2), pp. 7-17. Using Annotations to Support Multiple Kinds of Versioning in an Object-Oriented Database System. ACM Transactions on Database Systems, l£(No. 3, September 1991), pp. 417-438. [27] Sciore, E. (1991). [28] Siegel, & M. Madnick, S. E. A (1991). metadata approach to resolving semantic conflicts. Barcelona, Spain. 1991. pp. [29] [30] Teorey, T. Mateo, CA (1990). Database J. : Morgan Kaufman & Modeling and Design: The Entity-Relationship Approach Towards Wang, R. Y. Wang (Ed.), Information Technology in Action: Trends Kon, H. . San Publisher. B. (1992). Total Data Quality Management (TDQM). and Perspectives Englewood In R. Y. Cliffs, NJ: Prentice Hall. [31] Wang, Y. R. & Guarrascio, L. M. (1991). Dimensions of Data Quality: Beyond Accuracy. (CISL-91Composite Information Systems Laboratory, Sloan School of Management, Massachusetts 06) Institute of [32] Wang, Technology, Cambridge, Y. R. & Madnick, MA, S. E. (1990). 02139 June 1991. A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective. Brisbane, Australia. 1990. pp. 519-538. [33] Zarkovich. (1966). Quality of United Nations. Statistical Data . Rome: Food and Agriculture Organization \ 28 of the Appendix A: Premises about data quality requirements 7. Below we present premises related analysis. To facilitate further discussion, refers to t>oth quality parameters to we analysis data quality modeling and data quality requirements define a data quality attribute as a collective term that and quality indicators as shown in Figure A.l. (This term is referred to as a quality attribute hereafter.) Data Quality Attributes (collective) among Figure A.l: Relationship Premises related 7.1. quality attributes, quality parameters, to data quality Data quality modeling is and quality indicators. modeling an extension of traditional data modeling methodologies. As data modeling captures many of the structural and semantic issues underlying data, data quality modeling captures many premises relate of the structural to these (Premise and semantic issues underlying data data quality modeling issues. (Relatedness between entity and quality attributes): 1.1) attribute can be considered either as quality attribute. application name entity attribute be included; alternatively, From an entity attribute For example, the may be an it of a teller if initial may be modeled (i.e., judgment call on In some cases a quahty an application entity's attribute) or as a who performs a transaction in a banking application requirements state that the teller's name as a quality attribute. a modeling perspective, whether an attribute should be a quality attribute is a The following four quality. modeled as an the part of the design team, entity attribute or and may depend on the initial application requirements as well as eventual uses of the data, such as the inspection of the data for distribution to external users, or for integration with other data of different quality. distribution and integration of the information quality of the data they use. When the data information of different quality, that quality A guideline to this judgment is is is that If, exported to their users, to ask what information the on the other hand, the information process, such as when, where, and by often the users of a given system "know" of the however, or combined with may become unknown. provides application information such as a customer attribute. The relevance whom name and relates the data attribute provides. address, more it may be the attribute considered an entity to asf)ects of the data was manufactured, then If this manufacturing may be a quality attribute. In short, the objective of the data quality requirement analysis attributes, but also to is not strictly to develop quality ensure that important dimensions of data quality are not overlooked entirely in requirement analysis. 29 (Premise (Quality attribute non-orthogonality): 1.2) one another. For example, the two quality parameters credibility and timeliness are orthogonal to related not orthogonal), such as for real time data. (i.e., (Premise may (Heterogeneity and hierarchy in the quality of supplied data): Quality of data 1.3) differ across databases, entities, attributes, may university database example: and instances. be of higher quality than data data about alumni (an entity) Attribute example: Different quality attributes need not be in the student entity, may Entity be less reliable than data about students (an entity). may in may be more be accurate than are addresses. less interpretable than that of a Because human insight is needed for data quality modeling and different people may have different opinions regarding data quality, different quality definitions this Instance domestic student. Premises related to data quality definitions and standards across users 7.2. call this information in a John Doe's personal database. grades example: data about an international student Database example: phenomenon "data quality is in the and standards may result. We eye of the beholder." The following two premises entail phenomenon. (Premise 2.1) indicators may (Users define different quality attributes): vary from one user to another. Quality parameter example: for a manager the quality parameter for a research report may need to Quality parameters and quality may be inexpensive, whereas for a financial trader, the research report be credible and timely. Quality indicator example: the manager may measure may measure indicator may be inexpensiveness in terms of the quality indicator (monetary) cost, whereas the trader inexpensiveness in terms of opportunity cost of her own time and thus the quality retrieval time. (Premise 2.2) (Users have different quality standards): Acceptable levels of data quality differ from one user to another. For example, an investor following the movement of a stock consider a fifteen minute delay for share price to be sufficiently timely, whereas a trader price quotes in real time A single user data used. This to a single may have phenomenon is user different quality attributes summarized in and quality standards for the different Premise 3 below. (Premise 3) (For a single user; non-uniform data quality attributes and standards): have different quality attributes and quality standards across databases, instances. Across attributes example: than for the for certain number who needs may not consider fifteen minutes to be timely enough. Premises related 7.3. may may A user may need fact that 30 user higher quality information for the phone of employees. Across instances example: companies, but not for others due to the A A user may need some companies may entities, attributes, or number high quality information are of particular interest. Date Due ^luidggf -1 ^ft?;i Lib-26-67 Mn " " i.iPRaRiF<; nun "Nil Mllllllllllllilli 3 TOfiO QDflMb554 K/cT 1