M.I.T. LrSRARIES - DEV\^Y ^ Pewey ' HD28 .M414 "1^ MIT LIBRARIES llhl I 3 9080 00932 7625 Toward Quality Data: An Attribute-Based Approach Richard Y. Wang M.P. Reddy Henry B. WP #3762 November PROFIT Productivity From "PROHT" Kon 1992 #92-01 Information Technology Research Initiative Sloan School of Management Massachusetts Institute of Technology Cambridge, MA 02139 USA (617)253-8584 Fax: (617)258-7579 Copyright Massachusetts Institute of Technology 1992. The research described herein has been supported (in whole or in part) by the Productivity Technology (PROFIT) Research PROFIT sponsor firms. Initiative at MIT. This copy is From Information for the exclusive use of Productivity From Information Technology (PROFIT) The Productivity From Information Technology (PROFIT) Initiative was established on October 23, 1992 by MIT President Charles Vest and Provost Mark VVrighton "to study the use of information technology in both the private and public sectors and to enhance productivity in areas ranging from finance to transportation, and from manufacturing to telecommunications." At the time of its inception, PROFIT took over the Composite Information Systems Laboratory and Handwritten Character Recognition Laboratory. These two laboratories are now involved in research related to context mediation and imaging respectively. SSACHUSETTSNSTl.OTE OF TECHNOLOGY MAY 2 3 1995 UBRARIES PROFFT has undertaken joint efforts with a number of research centers, laboratories, and programs at MIT, and the results of these efforts are documented in Discussion Papers published by PROFFT and/or the collaborating MFI entity. In addition, Correspondence can be addressed to: The "PROFFF" Initiative RoomE53-310, MFI 50 Memorial Drive Cambridge, MA 02142-1247 TeL (617) 253-8584 Fax: (617) 258-7579 E-Mail: profit@mit.edu Toward Quality ABSTRACT data resource complex is The need always be attainable. it would be manufacturing process. From facilitates cell-level which are it would be may a not criteria characteristics of the data make how Specifically, quality indicators we propose an Included in may in and their its own be specified, stored, attribute-based data this attribute-based mathematical model description that extends the relational model, a integrity rules, and queries that are augmented with a quality indicator algebra believability of the data. is the data for the specific application at hand. tagging of data. indicators, the user can however, useful to be able to these quality indicators, users can This paper investigates and processed. quality, ideal to achieve zero defect data, this This suggests that indicators judgment of the quality of retrieved, Managing data critical. nianagement of the Moreover, different users may have different determining the quality of data. tag data with quality Attribute-Based Approach for a quality perspective in the becon^ing increasingly Although task. An Data: make which can be used quality indicator requirements. a better interpretation of the data In order to establish the relationship dimensions and quality methodology that extends the Entity Relationship indicators, a data quality model is to From model that model are a set of quality process SQL these quality and determine the between data quality requirements also presented. analysis 1. Introduction 1.1. 1.2. 1.3. 2. 4 5 5 2.2. Rationale for cell-level tagging Work related to data tagging 2.3. Terminology 7 6 Data quality requirements analysis 3.1. 3.2. 3.3. 3.4. 4. 2 4 Research background 2.1. 3. 1 Dimensions of data quality Data quality: an attribute-based example Research focus and paper organization The 1: Establishing the applications view 9 2: 9 Step Step 4: Determine (subjective) quality parameters Determine (objective) quality indicators Creating the quality schema 3: attribute-based 4.1. 4.2. 4.3. 8 Step Step model of data quality 4.3.2. 11 12 Data structure Data integrity Data manipulation 4.3.1. 10 12 15 15 Ql-Compatibility and QIV-Equal Quality Indicator Algebra 15 4.3.2.1. Selection 18 18 4.3.2.2. Projection 19 Union 20 22 24 4.3.2.3. 4.3.2.4. Difference 4.3.2.5. Cartesian Product 5. Discussion and future directions 25 6. References 27 7. Appendix A: Premises about data quality requirements analysis 29 29 Premises related to data quality modeling across standards and definitions 7.2. Premises related to data quality 30 users 30 7.3. Premises related to a single user 7.1. Toward Quality An Data: Introduction 1. Organizations in industries Attribute-Based Approach such as banking, insurance, retail, consumer marketing, and health care are increasingly integrating their business processes across functional, product, and geographic lines. The integration of these business processes, in turn, accelerates demand for more effective application systems for product development, product delivery, and customer service (Rockart 1989). As applications today require access to corporate functional and product Unfortunately, most databases are not error-free, and some contain a surpnsingly large databases. number many a result, Short, St. of errors (Johnson, Leitch, it Neter, 1981). medium surveyed 500 more than 60% Thanks In a recent size corporations (with annual sales of of the firms had problems in data quality.^ industry executive report, Computerworld more than $20 and reported The Wall Street lournal also reported computers, huge databases brimming with information are to million), at our fingertips, that that: )ust waiting to be tapped. They can be mined to find sales prospects among existing customers; they can be analyzed to unearth costly corporate habits, they can be manipulated to divine future trends. Just one problem: Those huge databases may be full of |unk. ... In a world where p>eople are moving to total quality management, one of the cntical areas In general, inaccurate, out-of-date, or incomplete data socially and economically (Laudon, Managing data Zarkovich, 1966). achieve zero defect data} this 1986; Liepins quality, may & is data.^ can have significant impacts both Uppuluri, 1990; Liepins, 1989; however, is a complex task. Although not always be necessary or attainable for, Wang k it Kon, 1992; would be among ideal to others, the following two reasons: First, in many applications, addresses in database marketing customers, it is nfil it is may a not always be necessary to attain zero defect dau. good example. necessai7 to have the correct city Second, there is name In sending promotional materials in an address as long as the zip code a cost/quality tradeoff in implementing data quality programs. Pazer found that "in an overwhelming ma)onty of cases, the best solutions reduction losses are is the worst in terms of cost" (Ballou nev« k Pazer, 1987). way 1 2 3 loss. Computerworld, Sept«nib«r 28, 1992. p. 80-84. Th€ WaU Slr*«t Joum*!, May 26, 1992, pagt B6. )ust like the weil puWldzed coiKept of itw dtfect products m As the is correct. Ballou and also suggests that Rather, the losses are always that a small percentage of the quality characteristics, always contributes a high percentage of the quality to target terms of error rate The Pareto Principle uniformly distributed over the quality characteristics. distributed in such a in Mailing "the vital few, a result, the cost improvenrtent potential manufactunng literittir*. is high for "the few" projects whereas the vital cure costs more than the disease (Juran & "trivial Cryna, many" 1980). defects are not worth tackling because the sum, when the cost in prohibitively high, is it is not feasible to attain zero defect data. Given may that zero defect data make judgment a deasion to and we This suggests that able to judge the quality of data. characteristics of the data not always be necessary nor attainable, manufacturing process its it indicators such as who for example, would be it originated the data, . From useful to when useful to be tag data with quality indicators know the data which are these quality indicators, the user can of the quality of the data for the specific application at hand. purchase stocks, would be was In making a financial the quality of data through quality collected, and how the data was collected. In this paper, we pro|X)se an attribute-based model that facilitates cell-level tagging of data. Included in this attribute-based model are a mathematical model description that extends the relational model, a SQL process set of quality integrity rules, In a quality indicator algebra which can be used queries that are augmented with quality indicator requirements. indicators, the user can data. and order make a better interpretation of the data to establish the relationship From to these quality and determine the believability of the between data quality dimensions and quality indicators, a data quality requirements analysis methodology that extends the Entity Relationship (ER) model is also presented. Just as it is difficult to product which define its manage product quality, it is also difficult to characteristics that define data quality. quality, for the LL quality without understanding the attnbutes of the manage data quality without understanding the Therefore, before one can address issues involved in data one must define what data quality means. In the following subsecfion, we present a definition dimensioru of data quality. Dimension Accuracy of data guall^ is the most obvious dimension when it comes to data quality. "errors ocair beciuse of delays in processing times, lengthy correction times, stringent data edits" (Morey, 1982). In addifion to defining Morey suggested and overly or insufficiently accuracy as "the recorded value conformity with the actual value," Ballou and Pazer defined timeliness (the recorded value of date), completeness (all that is is m not out values for a certain variables are recorded), and consistency (the representafion of the data value is the same in all cases) as the key dimensions of data quality (Ballou & Pazer, 1987). Huh et al. identified accuracy, completeness, consistency, and currency as the most important dimensions of data quality (Huh, It the , 1990; Juran, 1979; Juran (1) Data quality (2) Data quality is a is the data two from characteristics by considering First the user (the user has the believability. In to means and to he accessible be believable, the user consistent, credible, Guarrascio, 1991). a user may make It is also decisions based on to get to the data, privilege to get the data). to the user's which means that Second, the user may to the user, the and accurate Timeliness, . among making process). Finally, the accessibility, interpretability, usefulness, data must be available (exists in must be relevant consider, decision user can use the data as a decision input). are the following four dimensions: order how must be able (to the extent that the that can be accessed); to be useful, the data and Wang & data (the user understands the syntax and semantics of the data). Third, data must be believable to the user and for Pazer, 1985, Garvin, 1983; Garvin, 1987; Garvin, 1988; must be useful (data can be used as an input list manufacturing nor intrinsic characteristics of data quality: a database. to interpret the Resulting from this of quality for in a hierarchical concept. must be accessible must be able k have been well established multi-dimensional concept. illustrate these certain data retneved the data dimensions Gryna, 1980, Morey, 1982; <Sc two interesting to note that there are We tor quality control field (e.g., Juran, 1979), neither the data have been ngorously defined (Ballou et al methods interesting to note that although IS manufactunng Huh, ot al., 1990) (fits requirements for making the deasion); other factors, that the data be complete, timely. in turn, can be characterized by currency (when the data item was stored in the database) and volatility (how long the item renruiins valid). the data quality dimensions illustrated in this scenario. currtnt^ Qhon-vol«tll^ Figure 1: some form A Hierarchy of DaU Quality Dimensions Figure 1 depicts These multi-dimensional concepts and hierarchy of data quahty dimensions provide a conceptual framework for understanding the characteristics that define data quality. In this paper, focus on interpretability and believability, as a function of the information system and usefulness application domain. iu2, The idea Data quality: an to this effort Data may attri bute-based a over a penod be primanly function of an interaction between the data and the more concretely below. illustrated database on technology companies. The schema used company name, CEO name, and earnings of time Table Company .Mame is a to example contain attributes such as may be collected consider cKCcssibility be primarily of data tagging Suppose an analyst maintains we we and come from 1: Company a vanety of sources. Information to support estimate (Table 1). value may have necessary may, to know in turn, arbitrary a set of quality indicators associated with In it. the quality of the quality mdicators themselves, in have another applications, which case it mav be a quality indicator As such, an attnbute mav have an set of associated quality indicators. number of underlying some levels of quality indicators. This consntutes a tree structure, as shown in Figure 2 below. (attribute) (indicator) (indicatoT) tK: zr\... (indicator) ^ Figure 2: An (indicator) attribute with quality indicators Conventional spreadsheet programs and database systems are not appropriate for handling data which is structured in this manner. In particular, they lack the quality integrity constraints necessary for ensunng that quality indicators are always tagged along with the data (and deleted when the data is deleted) and the algebraic operators necessary for attnbute-based query processing. an attribute with In order to associate developed of quality indicators associated with is immediate quality indicators, between the two, as well as between to facilitate the linkage This paper its based data model. Discussion and future direchons are made RarionaU Any relation. same quality. set we Section 3 present the attnbute- in Section 5. and present cell level, summarize the the terminology used in this paper. fnr f»ll.l»v»» tagging characteristics of It is, In section 4, discuss our rationale for tagging data at the literature related to data tagging, and the Research background 2. LI. a quality indicator Section 2 presents the research background. presents the data quality requirements analysis methodology we mechanism must be it. organized as follows. In this section a dau at the relation level however, not reasonable to assume should be applicable to that all instances (i.e., instances of the tuples) of a relarion have the Therefore, tagging quality indicators at the relation level quality heterogeneity at the instance level. all is not sufficient to handle By the same token, any characteristics to all attribute of data However, each values in the tuple. tagged at the tuple level should be applicable attribute value in a tuple may be collected from different sources, through different collection methods, .ind updated at different points Therefore, tagging data at the tuple level is basic unit of manipulation, to tag quality We now 2ol Work examine necessary is it also msutticient Smce information m time. the attribute value of a cell is the at the cell level. the literature related to data tagging. related to data tagging A mechanism DENOTE operations operators is to for tagging data has been proposed by Codd. and un-tag the name to tag It includes of a relation to each tuple. permit both the schema information and the database extension uniform way (Codd, 1979). It does not, however, allow for the NOTE, T.AG, and The purpose of these be manipulated to in a tagging of other data (such as source) at either the tuple or cell level. Although self-describing data schema to level and meta-data management have been proposed (McCarthy, 1982; McCarthy, 1984; McCarthy, 1988), no marupulate such quality information A files at the hjple and specific solution has been offered cell levels. rule-based representation language based on a relational schema has been proposed data semantics at the instance level (Siegel attribute values based the tuple level as on values of other opposed k Madnick, These rules are used 1991). However, these attributes in the tuple. to the cell level, and thus at the to to store denve meta- rules are specified at cell-level operations are not inherent in the model. A polygen model (px)Iy = multiple, gen = source) (Wang & Madnick, 1990) has been proposed tag multiple data sources at the cell level in a heterogeneous database environment important to know where it to is not only the originating data source but also the intermediate data sources which contribute to final query results. The research, however, focused on the "where from" perspective and did not provide mechanisms to deal with more general quality indicators. In (Sciore, 1991), annotations are oriented environment general treatment is to However, data quality support the temporal dimension of data in an objectis a multi-dimensional concept. necessary to address the data quality issue. calculus-based language data. used is provided to Therefore, a more More importantly, no algebra or support the manipulation of annotations associated with the The examination above research the of efforts suggests that in functionality of our attnbute-based model, an extension of existing data models — 2^ order is support the to required. Tcnninology To An • we facilitate further discussion, application ^ttrit>Mte refers to entity-relationship (ER) diagram. introduce the following terms: an attnbute associated with an entity or a relationship m an This would mclude the data traditionally associated with an application such as part number and supplier. A • quality parameter defines is a qualitative or subjective when evaluating dimension of data quality that a user of data For example, believability and timeliness are such data quality. dimensions. • As introduced in Section characteristics of data collection • A method 1, and manufacturing process.^ its Data source, creation time, and are examples of such objective measures. quality parameter value for a particular quality defined by users quality indicators provide objective information about the to is the value determined (directly or indirectly) by the user of data parameter based on underlying quality indicators. map quality indicators to quality parameters. Functions can be For example, the quality parameter credibility may be defined as high or low depending on the quality indicator source of the data • A . quality indicator value is data quality indicator source We tagging, a measured may have have discussed the rationale for cell-level tagging, for the specification of data quality in this paper. We for a summarized work to related to data In the next section, parameters and indicators. users to think through their data quality requirements, and would be appropriate For example, the a quality indicator value The Wall Street journal. and introduced the terminology used methodology characteristic of the stored data. The we intent is to allow determine which quality indicators given application. consider u\ uidicatof objective li it i» present a generated using a well defined and widely accepted m«««if«. Data quality requirements analysis 3. In general, different users data may have may have different quality characteristics. thorough treatment of these The reader is referred to Appendix A requirements analysis (Batini, Lenzirini, on quality aspects it is an effort similar in spirit X'avathe, 1986; Mavathe, Batini, Based on of the data. & for a more to traditional data Ceri, 1992; Teorey, this similarity, parallels between traditional data requirements analysis and data quality requirements can be drawn analysis. depicts the steps involved in performing the proposed data quality requirements arulysis. application requirements i Step1 detemiine the application view of data I application view application's Step 2 quality requinnents^ determine (subjective) quality parameters for the application candkjata quality paramaters T parameter view Step3 determine (objective) quality Indicators for the application T quality vl«w (1 ) s"ter4 • - quality view ^ I (I) • ^^ quality view (n) ^ quality view Integration 1schema quality Figure The input, output of issues. Data quality requirements analysis 1990), but focusing and different tvpes different data quality requirements, 3: and The process of daU quality requirtmento analysis objective of each step are descnbed in the following subsections. Figure 3 IL — Step Step 1 IS the whole of the A comprehensive this paper. &c Establishing the applications vievy 1: Navathe, 1986; Navathe, Batini, &c system which contains companies An ER diagram symbol. Figure 4 a stock that will not be elaborated upon in treatment ot the subiect has been presented elsewhere (Batini, Leazirini, Cen. 1992; Teorev. 1990). For illustrative purposes, suppose that earnings estimate, while modeling process and traditional data we ire interested in that issue stocks. designing a portfolio management company has .A a company name, has a share pnce. a stock exchange (NfYSE, documents AMS, the a pplication view for our running a CEO, and an or OTC), and a ticker example is shown below in . COMPANY ISSUES STOCK SHARE PRICE CEO NAME EARNINGS ESTIMATE Figure 2JL Step 2: 4: Application view (output from Step 1) Determine (subjective) q uality parametera The goal in this step These parameters need to dimensional concept, and is parameters from the user given an application view be gathered from the user may illustrates the addition of the application view. to elicit quality 5: systematic way as data quality is be operationalized for tagging purposes in different ways. two high level parameters, interpretability Each quality parameter identified Figure in a Interpretability is shown a multi- Figure and believability. inside a "cloud" in the diagram. and believability added to the application view 5 to the Interpretability can be defined through quality indicators such as data units (eg., in dollars) and scale Believabilitv can be defined in terms ot lower-level quality parameters (e.g., in millions). such as completeness, timeliness, consistency credibility, and accuracy , The quality parameters defined through currency and volatility. The resulting view the application view. stock entity which is shown in Figure i^ Step 3: identified in this step are We referred to as the parameter view. can be added to focus here on the 6. Parameter view for the stock entity in Step 3 is to (partial output from Step 2) operationaiize the pnmarily subjective quality parameters identified Step 2 into objective quality indicators. rectangle) in turn, Determine (objective) quality indicators The goal in 6: Timeliness, STOCK -O-i Figure is . and is Each quality indicator is depicted as a tag (using a dotted- attached to the corresponding quality parameter (from Step view The portion of the quality view . for the stock entity in the 2), running example is creating the quality shown in Figure 7. UNITSf -Oi Figure 7: STOCK The portion of the quality view 10 for the stock entity (output from Step 3) Corresponding currency umts pnce IS to the quality which share price in is parameter intcrpretable are the more objective quality indicators measured (e g , 5 vs the latest closing price or latest nominal price IS V) and status which says whether the share Similarly, the believabilitv of the share price indicated by the quality indicators source and reportin)^ Jate For each quality indicator identified a quality in . view, indicators for a quality indicator, then Steps 2-3 are repeated, if it making important is this the name of the )Ournal) but also on the second level source (i.e., the name the Earnings Estimate figure to the journal and the Reporting date). depicted l>elow in source This scenario is 8. .ANALYSTS NAME Figure All quality first level For of the financial analyst who provided Figure have quality an iterative process. example, the quality of the attnbute Earnings Estimate may depend not only on the (i.e., to 8: REPORTINQ DATE' ' Quality indicators of quality indicators views are integrated in Step 4 to generate the quality schenia, as discussed in the following subsection. l± Stgp Crgaring tht quality 4: When the design multiple quality views must be consolidated Lenzirini, k is may vh^ma large result. and more than one set of application requirenrtents is involved, To eliminate redundancy and into a single global view, in a process similar to is (Batini, called the quality schema. This involves the integration of quality indicators. suffice. schema integration Navathe, 1986), so that a variety of data quality requirements can be met. The resulting single global view may inconsistency, these quality views In more complicated indicators in order to decide cases, it may what indicators In be necessary simpler cases, a union of these indicators to examine the relationships among the to include in the quality 11 schema. For example, it is likely that one quality view may have a^ tim^ for the same quality parameter. as an indicator, whereas another quality view may time In this case, creation mav have creation be chosen for the qualitv schema because age can be computed given current time and creation time. We have presented a step-by-step procedure processing of quality indicators as specified The 4. choose in the quality now of data quality extend the relational model because the structure and semanhcs of the relational to attribute-based data model data manipulation. We il, are schema. model attribute-based approach are widely understood. Following the Codd, We specify data quality requirements. present the attnbute-based data model tor supporting the storage, retrieval, and in a position to We to is divided into three parts; assume model (Codd, relational that the reader 1982), the presentation of the data structure, (a) data integrity, and (b) familiar with the relational is model (Codd, (c) 1970; 1979; Date, 1990; Maier, 1983). Data structure As shown in Figure 2 (Section levels of quality indicators. and the an attribute may have to facilitate the linkage In its immediate quality indicators, a between the two, as well as between set of quality indicators associated with the quality key concept. an arbitrary number of underlying In order to associate an attribute with mechanism must be developed indicator 1), This it. extending the relational model, mechanism Codd made is a quality developed through clear the need uniquely to identify tuples through a system-wide unique identifier, called the tuple ID (Codd, 1979; Khoshafian Sc Copeland, 1990). ^ an attribute Specifically, is applied in the in a relation scheme is This concept attribute, consisting of the attribute and a qualitv attnbute-based model expanded key into where EEr is an ordered is is referred to as a qualitv scheme defines a quality scheme for the quality relation indicators are associated with the attributes . In Table 4, Company. CN qualitv pair, called a expanded the quality key for the attribute EE (Tables 3-6 are expanded scheme enable this linkage. . For example, the attribute Earnings Estimate (EE) in Table 3 4 to embedded «CN, The into (EE, EEt) in Table nil«>, in (CEO, Figure nilc), 9). This (EE, EEt)) "nilc" indicates that no quality and CEO; whereas EEc indicates that EE has associated quality indicators. Correspondingly, each quality cell, cell in a relational tuple is expanded into an ordered pair, called a consisting of an attribute value and a quality key value. This expanded tuple SuniUrly, in the ob)«ct-on«nt«d lHtr«ture, the ability to property of an ob^-onenled data model. 12 mak« references through objtet infinity is is referred to corsdered a basic as a quality tupl^ and the resulting relation (Table key value 4) is referred to as a quality relation in a quality cell refers to the sot ot quality indicator the attribute value. This set of quality indicator values tuple called a quality indicator tuple quality indicator tuples is quality indicator relation Under is quality relation .A . the relational scheme Relation for 3: Company :E0 Name Earnings iCEO) Estimate (EE) IBM Akers DELL Dell model 4: Qua tuy w Rela tion to onCpmpany (CN, tid (IBM,nil<:) id002« (Dell, Table (Barron s,id20U) SRClc Figure (SRC2, idlOU) idl02t> EE attnbute for the (Oct6 Level-Two QIR 6: nik) Date. nilt)j (Oct 5 92, niU) idl02(f '<WallStJnl,id202e> Tables (Dell, News (SRCl.SRCU) (7, ,.(3, (Akers, nil(> nik) Level-One QIR 5; (EE EE«> (EE, E (jCEoTniki nilc) idOOU EE< nilt) (Entry Clerk. (Joe, nilt) (Mary, nile) 92, nil«> EE for the attribute (Report date, nilt; id201< (Zacks, nilt) (Sep id202c (Zacks, nil<) (Sepl5 92,nilO 9: The Quality Scheme 1 nil<: 92, nil<> Company Set for quality key thus serves as a foreign key, relating an attribute (or quality indicator) value a quality indicator relation for the to its associated quality indicator tuple. For example. Table 5 attnbute Earnings Estimate and Table 6 a quality indicator relation for the attnbute data) in Table is 5. The quality cell its is <Wall St )nl, is SRCl (source of id202«) in Table 5 contains a quality key value, id202<, a tuple id (primary key) in Table 6. Let qri be a quality relation then that defines the . I which kind of quality of a set of these time-varying .Name tCN) id002(j Table The a Company idOOU the attribute-based 3 form model tid idlOU to The quality scheme . referred to as the quality indicator Table Under composed called a qualitv indicator relation Each quality values immediately associated with grouped together is . quality key and must be non-null a an attnbute (i.e., containing a quality indicator tuple for a, not then in qri. "nil<"). all 13 If a has associated quality indicators, Let qr2 be the quality indicator relation the attributes of qr2 are called level-one quality indicators for a. Each attnbute in qrj , in turn, an attribute can have n-levels of quality indicator relations associated with In general, example. Tables 5-6 are referred for the attribute We to respectively as Icvol-ono and level-two quality indicator define a quality scheme set as the collection of a quality scheme and set for s n it, 0. it. For relations Earnings Estimate. indicator schemes that are associated with scheme can have a quality indicator relation associated with We Company. quality indicators. A Figure 9, the quality Tables 3-6 collectively define the quality define a quality database as a database that stores not only data but also quality schema structure of a quality database. indicator schemes, quality In it. all is defined as a Figure 10 illustrates the relationship scheme sets, scheme set of quality among sets that describes the quality schemes, quality and the quality schema. Quality Schema Quality schemes, quality indicator schemes, quality scheme Figtire 10 We now developed domain (in Table product present a matheinatical definition of the quality relation. in the relahonal 4, model, system-wide unique for a 7 € sets, EE where EE D X ID (in Table 4, <7. is we domain define a and the quality schema Following the constructs as a set of values of sirrular type. identifier (in Table 4, idlOlc € ID). a don^ain for earnings estimate). the D be a domain for an attnbute DID be defined on the Cartesian Let Let ID be Let idlOU) € DID). Let uf be a quality key value associated with an attribute value d quality relation (qr) of degree tn is defined on the m+1 domains (m>0; where d in Table 4, € D m»3) and if it is id € ID. A a subset of the Cartesian product: ID X DIDi X DID2 X Let is (jt designated be a quahty tuple, which an element in a X DIDm- quality relation. Then a quality relation qr as: (^ The is ... « Iqt I qt » <id, did], did2, ••, didm) where integrity constramts for the attnbute-based 14 model id « ID, didj c DID}, is presented next. j - 1, ... ,m) Data integ rity 4J. A fundamental property corresponding quality (including we mean atomic unit that all whenever an other words, an attribute value and We attribute ".alue its to is that an attribute value and . created, deleted, retrieved, or modified, its Bv its bo created, deleted, retrieved, or modified respectively corresponding quality indicator values behave atomicallv property as the atomicity property hereafter. refer to this is descendant) indicator values are treated as an atomic unit corresponding quality indicators also need In model of the attribute-based This property is enforced by a set of quality referential integrity rules as defined below. Insertion key present Insertion of a tuple in a quality relanon : in the tuple (as specified in the quality indicator tuple must be must ensure schema definition), the inserted into the child quality indicator relation. in the inserted quality indicator tuple, a that for each non-null quality corresponding quality For each non-null quality key corresponding quality indicator tuple must be inserted at the next level. This process must be continued recursively until no more insertions are required. Deletion key present with all null must be deleted from the the tuple, corresponding quality information in corresponding Deletion of a tuple in a quality relation must ensure that for each non-null quality : to the quality key. This process must be continued recursively until a tuple table encountered is quality keys. Modification : If an attribute value is modified in a quality relation, then the descendant quality indicator values of that attribute must be modified. We now 4A introduce a quality indicator algebra for the attnbute-based model. Data manipulation In order to present the algebra formally, to the quality indicator algebra: 4.3.1. and associated with attributes a, aj be ai- are defined to be define two key concepts that are fundamental Ql-compatibilitv and OIV-Equal and two application Ql-Compatible with aj assume attributes. . Let QI(a,) denote the set of quality indicators Let S be a set of quality indicators. shown in Figure then the attributes a, and aj shown We first Ql-Compatibility and QIV-Equal Let a, 6 we that the H in If S i;; QKa,) and S C QKaj), then For example, respect to S.6 if S = are Ql-Compatible with respect to Figure 11 are numeric subscripts (eg, ncji qi,,) quality indicator tree. 15 S. (qii, qij, qiji), Whereas Ql-Comparible with respect map if S = a, and a^ then the (qii qi22). , to S. the quality indicators to unique positions in the Figure 11: Ql-Compatibility Example Let a, and a2 be Ql-Compatible with respect to respectively. qHw,) be Let (qi2(wi) = V2 in Figure V qi(w2) S = qi e S, 12). Let w, and wj be values of a. the value of quality indicator qi for the attribute value w, Define w, and wj to be denoted as w, = W2. QIV-Equal with In Figure 12, for but n^l QIV-Equal with respect to S = (qi,, qi2i), S. resfject to where and a; qi e S S provided that qi(w; = ) example, w, and W2 are QIV-Equal with respect to qh,) because qi3i(w,) = V3, whereas qi3](w2) = (qi,, X3,. (32,^2) (q'l- (qi2-V2) V,) \\\ /\ C^r^n^ (qi,.v,) (qij.v^) (PI3. ^3) Figure In practice, it QIV-Equal) as indicators up a 12: S). To to S , and ^1 1 ) C'lz^ to a certain level of depth if the following (2) S consists of a, and aj ^ ('"3r'3i) y the quality indicators to be we compared (i.e., to introduce i-level Ql-compatibility (i- which aU. the quality in a quality indicator tree are considered. Let a, and respectively, then w, two conditions are all \ \ 3) ('"2r ^2i) special case for Ql-compatibility (QlV-equal) in and W2 be values of Compatible all alleviate the situation, Let a, and a^ be two application attributes. Let w, C'n QIV-Equal Example tedious to explicitly state elements of sfjecify all the level is /\ (l*22-^22) «"3r^31> (^'l2-^12)<'"2r''21^ (qij.vj) satisfied. (1) a^ be Ql-Compatible with respect and W2 are defined a, quality indicators present within and i 82 are to be i-levgl to S. QI- Ql-Compatible with respect levels of the quality indicator tree of a, (thus of 82). By the same token, If 'i' is the i-level maximum level of be maximum-level Ql-Comparible by w, =*" W2, QIV-Equal between w, and W2, denoted by w, . depth in the quality indicator tree, Similarly, => W2, can be defined. then a, and a2 are defined to maximum-lev el QIV-Equal between w, and W2, denoted can also be defined. 16 To exemplify the algebraic operations in the quality indicator algebra, quality relations having the Large_and_Medium Figure (Tables same 7, 7 1, quality Table 7 set js 7 2 in Figure 13) 14). <CN. scheme nilj> shown in Figure 9. we introduce two They are and Small_and. Medium (Tables referred to as 8, 8 1, and 8 2 in 'with the clause If explicit constraints SQL extended which The quality that in a being retrieved. is syntax, the dot notation in turn is namely quality relations associated with and tj no case quality indicator values to identify a quality indicator in the example, EE.SRCl.SRC2 identifies SRC2 which a quality is presented in the following subsection. (That QUALITY" clause.) are identical. QR QR Let S^ is, is define the five orthogonal quality relational and QS and Cartesian product. respectively. and b be two Let a qr and qs be two let attributes in both a set of quality indicators specified by the user a =^* t2.a denote that the values of attnbute a in the tuples t, for the Similarly, let t, =' .a t2.a and = a t, "" the values of ti.a t2.a and denote and ti i-Ievel t2 and t, are QIV- QlV-equal and t2.a. Selection Selection is a unary operation which selects only a honzonul subset of a quaUty relation (and corresponding quality indicator relations) based on the conditions specified and quality conditions application attribute. The o^C^qD-UI Vti whereC(ti)»e, or (tiaetib); ti .b); selection, a'^c (qr)- € qr. for the quality indicator relations ei'' is corresponding QR,((t.a = ti 'I a) A(t.a = (t)...*^; "" t,.a)) ( a. v. compared during ^ ); 9 to the 6; is in a C(ti)) one of the forms: (tt-a 9 constant) of the forms (qik = constant) or (ti.a =^'*ti.b)or (tia ='ti.b)or *6 an defined as follows: 'S <ftej ,»...<& e„<t>e,<i<t) ez qik € QUa); indicators to be Vac in the Selection in the Selection ofjeration: regular conditions for There are two types of conditions application attribute =" and constructed form the specifications given by the user in the "with maximum-level QIV -equal respectively between op>eration. QR Let the term ti.a = tj.a denote that the values of the attnbute a in the tuples equal with respect to Sg. 4.3.2.I. we and QS be two quality schemes and be two quality tuples. Let S^ be a. its used is selection, projection, union, difference, In the following operations, let ti that the user has a quality indicator to EE. is indicator algebra algebraic operations, attribute In that means however, the quality indicator values associated with Following the relational algebra (Klug, 1982), t, it Quality Indicator Algebra 4.3.2. QS. Let user query, then retrieved as well. In Figure 9, for quality indicator tree. absent in the retrieval process, would be the applications data indicator for SRCl, is on the quality of data would not be compared In the QUALITY" = (S. 2. s. *, <. >. -); and Sa,b the comp>anson of 18 ti a and t] b. >« (ti a the set of quality Example Get 1: all Large_and_Medium companies whoso earnmgs estimate is over 2 and is supplied bv Zacks Investment Research. A algebra. corresponding extended SQL query is ^hown SELECT FROM CN, CEO. EE WHERE EE>2 EESRCrSRC2=Zacks' LARGE AND MEDIUM with QUALITY This SQL query The result is can be accomplished through shown below. <CN, Table 9 Table as tollows: 9.1 nile> a Selection operation in the quality indicator SELECT CN, EE FROM LARGE, and_MEDIUM This SQL query can be accomplished through a Proiection operation. The result <C\, r\\\i> is shown below. <C.\', niU> \ote This is also that unlike the relational union, the Example illustrated in Example op>eration in 3-3 : Example 3-3 Consider the following extended 3-b: SMCN. SMCEO, SM EE SMALL_and_MEDILM SM UNION SELECT FROM LARCE_and MEDIUM LM The LM.CN, LM CEO, LM EE QUALITY result is (LM.EE.SRC1= SMEE.SRCl) shown below. <CNI, is not commutative below SELECT FROM with quahty union operation mlo SQL query which switches the order ot the union Example 4: as A Get all the companies which are clabSiticcJ as only Large_and_Medium companies but not Small_and_Medium companies. corresponding SQL query is shown as rollows: SELECT FROM DIFFERENCE LM.CN, LMCEO, LM EE SELECT SM CN, SM CEO, SM EE LARCE.and.MEDILM LM FROM with QUALITY This SQL query SMALL_and_MEDILV1 S.V1 = SM EE SRCl SRC2) (LM EE.SRCLSRC2 can be accomplished through a Difference operation. below. <CN, nil> The result is shown Cartesian Product 4.3.2.5. The Cartesian product Let t, € qr the tuple 6 qs. Let t2 The tuple tj. qr X^ qs = ( t I X^ t, denote the i'^ <L.MCN' niio QR qs, be of degree t, and t2(i) r and QS be denote the i'^ of degree attribute of from the Cartesian product of qr and qs denoted as qr X^ qs, is s. will be defined as follows: € qr, Vt2 6 qs, A t(r+l) ='" t(2) = t,(2) A t(2)="^ t,(2)A tjd) A t(r+2) = result of the Cartesian product shown below. Let attribute of the tuple in the quality relation resulting = t,(l)At(l)="'t,(l)A t(r+l) = tjd) The t also a binary operation The Cartesian product of qr and of degree r+s. t(l) t,(i) is t2(2) a t(r+2) ... ="" t(r) t2(2) = t,(r)A t(r)='"t:(r)A a ... t(r+s) = t2(s) a t(r+s) =' t2(s) between Large_and_Medium and Small_and_Medium is ) We have presented the attnbute-based model including set of integrity constraints for the model, and the capabilities of this mdicator algebra. a quality algebraic operations are exemplified in the context or the model and future research a description of the SQL model structure, a In addition, each of the query. The next section discusses some of directions. Discussion and future direcrions 5. The attribute-based model can be applied in many different ways and some of them are listed below: • The retain the origin • A model ability of the user can Example filter the data retneved from a database QUALITY in makes The reputation inspected. according to quality who requirements. capability is very useful in is In retrieved as among certification. inspected or certified the data and when it A was enhance the believability of the data. The quality indicators associated with data can help semantic incompatibility this. EE.SRCl.SRC2='Zacks'." of the inspector will to resolve possible to it Figure 9 illustrates Data authenticity and believability can be improved by data inspection and quality indicator value could indicate • at multiple levels only the data furnished by Zacks Investment Research specified in the clause "with • support quality indicators and intermediate data sources. The example for instance, 1, to clarify data semantics, which can be used data items received from different sources. an interoperable environment where data in different This databases have different semantics. • Quality indicators associated with an attribute values. it For example, can be interpreted the spouse name is if may facilitate a better interpretation of null the value retneved for the spouse field (i.e., tagged) in several ways, such as unknown, or (3) this tuple is is empty (1) the • In a data quality control process, the source of error In this paper, and processed. when by examining quality we have Specifically, investigated we have requirements analysis and specification, (2) is employee unmarried, table (2) from the field. errors are detected, the data adnrvirustrator can identify indicators such as data source or collection method. how (1) employee inserted into the materialization of a view over a table which does not have spouse m an employee record, quality indicators may be specified, stored, retrieved, established a step-by-step procedure for data quality presented a model for the structure, storage, and processing 25 of quality relations and quality indicator relations (through the algebra), and of are artively pursuing research denved data investigating values of its (e.g., m the following areas: combining accurate monthly data mechanisms components. to determine the quality of (2) In order to touched upon and control. functionalities related to data quality administration We (3) vvith (Din order less to determine the quality accurate weekly data), we are derived data based on the quality mdicator use this model for existing databases, which do not have tagging capability, they must be extended with quality schemas instantiated with appropriate quality indicator values. effective. (3) We are exploring the possibility of Though we have chosen the relational model oriented approach appjears natural to model data and quality control methods), we its to making such a transformation cost- represent the quality schema, an object- quality indicators. Because many of the mechanisms are procedure oriented and o-o models can handle procedures are investigating the pros and cons of the object-oriented approach. 26 (i.e., References 6. 11 Ballou, D. P. Pazer, H. St (1985) L. Modeling Data and Process Quality output Information Systems. Management 2) & Ballou, D. P. in Multi-input, Multi- Science. 11(2), pp. 150-162. Pazer, H. L. (1987). Cost/Quality Tradeoffs for Control Procedures in Information of Management Science. li(6), pp. 509-521. Systems. International fournal 31 41 Batini, C, Lenztnni, M., & Navathe, S. (1986). A comparative analysis of methodologies database schema integration. ACM Computing Survey, 1S(4), pp. 323 - 364. Codd, ACM, 51 relational model of data for large shared data banks. CommunicaHons of the li(6), pp. 377-387. Codd, Codd, capture more meaning. ACM practical foundation for productivity, the 1981 ACM Extending the relational database model on Database Systems, i(4), pp. 397-434. E. F. (1979). Transactiofxs 61 A E. F. (1970). for E. F. (1982). Relational database: Turing Award Lecture. Commumcations An A of the ACM, to i5(2), pp. 109-117. 7) Date, C. 8) Garvin, D. A. (1983). Quality on the 9) Garvin, D. A. (1987). Competing on the eight dimensions of quality. Harvard Business Review. J. (1990). Introduction to Database Systems (5lh line. ed.). MA: Addison-Wesley. Reading, Harvard Business Review. (September- October), pp. 65- (November-December), pp. 101-109. 10] Garvin, D. A. (1988). Managing Quaiity-The Strategic and Competitive Edge The Free Press. 11) Huh, 12) Y. U., et al. (1990). DaU (Quality. Information (1 ed.). New York; and Software Technology. 12(8), pp. 559-565. Johnson, J. R., Leitch, R. A., k Neter, J. (1981). Characteristics of Errors in Accounts Receivable and Inventory Audits. Accounting Review. Sfi(April), pp. 270-293. 131 Juran, J. M. (1979). Quality Control 141 Juran, J. M. & Gryna, F. M. Handbook (3rd ed.). New York: McGraw-Hill Book Co. (1980). Quality Planning and Analysis (2nd ed.). New York: McGraw Hill. 151 16) 17) Khoshafian. S. N. & Copeland, G. P. (1990). Object Identity. In 37-46). San Mateo, CA: Morgan Kauhnann. Uudon. K. (Ed.), (pp. C (1986). Data Quality and Due Process in Large Interorganizational Record Systems. of tht ACM, 2S(1). pp- 4-11. Uppuhiri. V. R. R. (1990). Data Quality ControL Theory and Pragmatica (pp. 360). York: tfciiil Dekkcr, Inc. Liepins, G. New 19] Zdonik* D. Maier iQug, A. (1982). Equivalence of relational algebra and relational calculus query languages having aggregate hinctions. Tht Jourml of ACM. 23, pp 699-717. Communicatiom 18] S. B. Uepins, O. LIk C0M9). Sound Data Are a Sound Investment. Quality Programs. (September), pp. 61- 63. 201 Maier, D. (19«3). TV Thtory of Rektional Databases (1st ed.). Rockville, MD: Computer Science Presa. 21] McCarthy, Medea 22] J. L (1982). MeHdatM Management for Urge Statistical Databasa. Mexico City, 1982. pp. 234-243. McCarthy, J. L. (1984). ScienHfic Information - School, Monterey, CA. 1984. pp. 27 Data > Meta-data. U.S. Naval Postgraduate (231 McCarthy, (24) Morey, L. (1988). The Automated Data Thesaurus: A >Jew Tool for Scientific Information. J. Proceedings of the 11th International Codata Conference, Karlsruhe, Germany. 1988. pp. (1982). C. R. Communications (251 N'avathe, of the Batini, S., Estimating and Improving the Quality of Information ACM. 2i(May), pp. 337-342 C, Cen, Si S. (1992) The Entity Rdatwnshiip Approach . New m the MIS. york: Wiley and Sons. (261 Rockart, Sloan F. J. & Short, J. IT in the 1990s: E. (1989). Management Review. Sloan Managing Organizational Interdependence. School of Management, .Vl/T, liZ(2), pp. 7-17. E. (1991). Using Annotations to Support Multiple Kinds of Versioning in an Object-Onented Database System. ACM Transactions on Database Systems. ]£(No. 3, September 1991), pp. 417-438. (271 Score, (281 Siegel, & M. Madnick, S. E. (1991). A metadata approach to resolving semantic conflicts. Barcelona, Spain. 1991. pp. (29) (301 Teorey, T. Mateo, CA J. : (1990). Database Morgan Kaufman Modeling and Design: The Entity-Relationship Approach Wang, R. Y. St Wang (Ed.), Information Technology Kon, H. . San Publisher. B. (1992). Towards m Total Data Quality Action: Management (TDQM). In R. Y. Englewood Cliffs, NJ: Trends and Perspectives Prentice Hall. [31] Wang, Y. R. <fc Guarrascio, L. M. (1991). Dimensions of Data Quality: Beyond Accuracy. (CISL-91Composite Infornuhon Systems Laboratory, Sloan School of Management, Massachusetts Institute of Technology, Cambndge, MA, 02139 June 1991. 06) Y. R. & Madnick, S. E. (1990). A Poly gen Model for Heterogeneous Database Systems: The Source Tagging Perspective. Brisbane, Australia. 1990. pp. 519-538. (321 Wang, [33] Zarkovich. (1966). Quality of United Nations. Statistical Data . Rome: Food and Agriculture Organization 28 of the Appendix A: Premises about data 7. Below we present premises To analysis. related to data quality facilitate further discussion, refers to both quality parameters quality requirements analysis we modeling and data quality requirements define a data quality and quality indicators as shown ath-ilyv,tg in as a collective term that Figure A.l. (This term is referred to as a quality attribute hereafter.) Data Quality Attrlbutas (coll*cllva) among Figure A.l: Relationship Premises related 7.1. to data quality Data quality modeling is modeling captures many of the many captures premises relate (Premise 1.1> modeling an extension of traditional data modeling methodologies. structural of the structural to these quality attributes, quality parameters, and quality indicaton. and semantic issues underlying and semantic issues underlying data may application a name For example, the be an entity attribute it if of a teller initial may be modeled is a judgnnent call on (i.e., In some cases a quality an application entity's attribute) or as a who performs a transaction in a banking application requirements state that the teller's name as a quality attribute. modeling perspective, whether an a quality attribute The following four quality. (Relatedness between enhty and quality attributes): be included; alternatively, From As data modeling data quality modeling issues. attribute can be considered either as an entity attnbuie quality attribute. data, data quality attribute should be modeled as an the part of the design team, entity attribute or and may depend on the initial application requirements as well as eventual uses of the data, such as the inspection of the data for distribution to external users, or for integration with other data of different quality. distribution and integration of the information quality of the data they use. When the data is is that often the users of a The relevance given system "know" of the exported to their users, however, or combined with information of different quality, that q\iality nruy become unknown. A g\iidclfaM to this judgment is to ask what information the attribute provides. If, on process, such as the other hand, the information relates more when, where, and by whom the data the attribute may be considered an entity to aspects of the data manufacturing provides applicMtton information such as a customer name and address, attribute. If it was nnanufactured, then this may be a quality attribute. In short, the objective of the datt quality requirement analysis attributes, but also to is not strictly to develop quality ensure that important dimensions of data quality are not overlooked entirely requirement analysis. 29 in (Premise orthogonal related not orthogonal), such as for real time data. (Heterogeneity and hierarchy (Prenruse 1.3) may differ across databases, entities, attributes, may be university database example: Attribute example: may example: data about an international student Premises related Because and instances. human supplied data): Quality of data Database example: may may be more needed is this phenomenon "data quality is in in a Entity entity). Instance be less interpretable than that of a domestic student. to data quality definitions insight accurate than are addresses. and standards across users for data quality modeling and different people may have different opinions regarding data quality, different quality definitions call this information be less reliable than data about students (an the student entity, grades in in the quality of of higher quality than data in John Doe's personal database. data about alumni (an entity) 7.2. Different quality attributes need not be For example, the two quality parameters credibility and timeliness are one another. to (i.e., (Quality attnbute non-orthogonality): 1.2) the eye of the beholder." and standards may The following two result. We prenruses entail phenomenon. (Premise indicators may (Users define different quality attributes): 2.1) vary from one user f>arameter for a research report may need to in and quality manager the quality be inexpensive, whereas for a financial trader, the research report be credible and timely. inexpensiveness (Quality parameter example: for a to another. may (Quality parameters Quality indicator example: terms of the quality indicator (m(5netary) inexpensiveness in terms of opportunity cost of her own cost, the manager may measure whereas the trader may measure time and thus the quality indicator may be retrieval time. (Premise differ 2.2) from one user (Users have different quality standards): For example, an investor following the to another. consider a fifteen nninute delay for share price price quotes in real time may Premises related 7.3. A single user data used. This to movement of a stock be sufficiently timely, whereas a trader is different quality attributes summarized in and quality standards for the different Premise 3 below. have different quality attributes and quality standards across databases, instances. Across attributes exajnple: for certain numba of employees. compB\ics, but iK>t for A user may need Across instances example: others due to the fact that ^9; 30 A user nnay entities, attributes, or higher quality information for the phorie numt)er 3 9080 00932 7625 ^97 who needs to a single user (Premise 3) (For a single user; non-uniform data quality attributes and standards): than for the may may not consider fifteen minutes to be timely enough. may have phenomenon Acceptable levels of data quality A user may need some companies high quality information are of particular interest Date Due Lib-26-67