DEW-* HD28 .M414 no. <?3 ALFRED P. WORKING PAPER SLOAN SCHOOL OF MANAGEMENT Quality Data Objects WP #3517-93 December 1992 CISL Richard Y. M. P. WP# 92-06 Wang Reddy Sloan School of Management, MIT MASSACHUSETTS INSTITUTE OF TECHNOLOGY 50 MEMORIAL DRIVE CAMBRIDGE, MASSACHUSETTS 02139 Quality Data Objects WP #3517-93 December 1992 CISL Richard Y. M. P. WP# 92-06 Wang Reddy Sloan School of Management, * MIT see page bottom for complete address Richard Y. Wang E53-322 M. Prabhakara Reddy Sloan School of Management Massachusetts Institute of Technology Cambridge, MA 01239 3 1993 1 Quality Data Objects Richard Y. Wang M. Prabhakara Reddy December 1992 CISL-92-06 Composite Information Systems Laboratory E53-320. Sloan School of Management Technology Cambridge. Mass. 02139 Massachusetts Institute of Prof. Richard Wang (617)2530442 Bitnet Address: rwang@eagle.mit.edu ATTN: 1992 Richard Y. Wang and M. Prabhakara Reddy Work reported herein has been supported, in part, by MIT's Total Data Quality Management (TDQM) project. MIT's Productivity From Information Technology (PROFIT) Consortium. MIT's International Financial Service Research Center (IFSRC) and MIT's Center for Information Systems Research (CISR). In particular, the authors wish to thank Prof. Stuart Madnick and Dr. Amar Gupta for their support to this research. Acknowledgments Quality Data Objects ABSTRACT Needs resources are becoming increasingly This paper investigates critical. associate data with quality information that can help users quality of data. investigate its Specifically, includes a description of the and object, a mechanism The behavior object. we propose and behavior. structure management for a quality perspective in the datum how make judgments The structure of the quality data datum to of the the concept of quality data object object, its to associate the of data and object corresponding quality description object with its of the quality object includes a set of quality description methods quality dimensions (such as timeliness, completeness, credibility). to measure In addition, we have developed a quality data object algebra that includes quality comparison methods and an algebra domain. It that extends the relational algebra to the quality data object allows for a systematic construction of retrieval methods for quality data objects. The concept of quality data the design object object presented in the paper and manufacture of data products. proposed in this We is step toward envision that the quality data It will enable users to the quality of data products according to their chosen criteria; hope first paper can be used as basic building blocks for the design, manufacture, and delivery of quality data products. users to a buy data products based on their quality requirements. that the concepts of quality data objects improve data quality and data reusability. it measure will also enable In this manner, and quality data products we will help 1. 2. Introduction 1 work 1.1. Related 1.2. Research Focus 2 3 The quality data object 2.1. 5 Structure of the quality data object 2.1.1. Semantics of the is-a-quality-of link 2.1.1.1. 2.2. 6 Difference between is-a-part-of and is-a-quality-of 6 The quality data object schema Behavior of the quality data object 2.2.1. Currency 3.2. 8 9 Volatility 10 2.2.3. Timeliness 11 11 2.2.5. Accuracy Completeness 2.2.6. Credibility 12 A quality data object algebra 3.1. 7 2.2.2. 2.2.4. 3. 5 Difference between is-a and is-a-quality-of 2.1.1.2. 2.1.2. 5 Quality comparison methods in a quality-description object i_deep_equal _method q_equal _method Quality algebraic methods 12 13 14 3.1.1. 15 3.1.2. 15 3.2.1. 3.2.2. 3.2.3. 3.2.4. 3.2.5. Method Union Method Difference Method Projection Method Cartesian Product Method Selection 4. Concluding remarks 5. Appendix 6. References A 16 16 16 17 17 18 19 20 22 Quality Data Objects Introduction 1. The quality implemented by a conventional data base management of data in a database systems (DBMS) has been treated, primarily, through functionalities such as recovery, concurrency, and security control integrity, (e.g., Bernstein Codd, 1982; Codd, Wood, 1981; Hoffman, 1977; Hsiao, Kerr, Qian & 1986; Date, 1981; Date, 1990; & Denning 1981; Bernstein, et & some failure has consistency of data is Madnick, 1978; Korth & failures, that are it is known to a state that is to be when multiple users update the database concurrently. Integrity aims may be caused by by mistakes on the part of the operator or the application programmer, by system the last of these, however, falsification; is much not so a matter of of security; protecting the database against illegal operations, as opposed to those merely invalid, is the responsibility of the security subsystem (Date, 1985). These functionalities are necessary but not the end-user's perspective (Johnson, Leitch, Liepins, 1989; & rendered the current state incorrect. Concurrency control ensures that the preserved even by deliberate integrity as 1970; Silberschatz, 1986; Martin, 1973; preventing invalid updates against the database from happening. Invalid updates errors in data entry, Codd, 1981; al., Denning, 1979; Fernandez, Summers, Wiederhold, 1986; Ullman, 1982). Recovery restores the database correct after at & Goodman, Wang & & sufficient to ensure data quality in the Neter, 1981; Laudon, 1986; Liepins Kon, 1992; Zarkovich, 1966). Integrity constraints and example, are essential to ensuring data quality in a database, but they are often & DBMS from Uppuluri, 1990; validity checks, for just the beginning of a continuing data-integrity program that will ultimately address the real needs of users for data that can be used as an input to the user's decision making process (Maxwell, 1989). In general, data in the DBMS may be used by a range of different organizational functions with different perceptions of what constitutes quality data in terms of dimensions such as accuracy, completeness, consistency, and timeliness (Ballou & Pazer, 1987; Huh, et al., 1990; Redman, 1992). Consider the following example scenarios: • A person's name is carried as Rockart in yet another. J. F. Rockart in once place, John All are technically "true" provided by the conventional in the database consistently ? F. Rockart in another, and Jack and would pass the DBMS, but which one should be considered integrity constraints as accurate and stored A • client workstation at the end runs business applications using data downloaded from a database server Whereas, data of each day. in the server is updated instantly with changes and new information through on-line transaction processing. Thus, data in the client workstation is never current from the server and some user's viewpoint. Earning estimates for companies are stored • and how are not, making it difficult to in a database but who made these estimates, when, judge the credibility of the data by those who are not familiar with the context. In these and other similar situations, the quality of data a matter of data validity but rather of its information that can help users usag e. how would be useful It make judgments managed by the to associate and manage data is equipped with the capabilities measure the quality of data they need and conforms with LL is not so data with quality to structure in such a way that users can be to retrieve the data that their quality requirements. Related work An attribute-based research that facilitates cell-level tagging of data has been proposed to enable users to retrieve data that conforms with their quality requirements (Wang, Kon, 1993; Wang, Reddy, & Kon, effort are a methodology 1992; Wang & Madnick, for analyzing data quality model description, relational indicator algebra can be used to process From The problem with Li, 1987), and Codd, 1970; SQL (1) In corresponding quality description through the created through the concept of quality key. Madnick, ER model proposed an attribute-based model encompassing a quality indicator algebra that extends the 1979; Codd, 1982; Codd, 1986). The quality queries that are augmented with quality indicator these quality indicators, the user can this research is twofold: & Included in this attribute-based research requirements that extends the a set of quality integrity rules, model proposed by Codd (Codd, requirements. 1990). & by Chen (Chen, 1976; Chen, 1984; Chen, 1991; Chen a much of the quality of data for the specific application at hand. The research question here to DBMS join (2) In make order a better judgment of the quality of to associate the data. application data with operation in the model, an artificial link needs order to be able to judge the quality of data, to it its be is necessary to compute data quality dimension values and other procedure-oriented quality measures. Although these could be accomplished using the that of the object-oriented approach. relational approach, Moreover, this research it is not as natural compared to did not address issues involved in measuring data quality dimension values. In other related research efforts that aim at annotating data, self-describing data files and meta-data management have been proposed at the schema level (McCarthy, 1982; McCarthy, 1984; however, McCarthy, 1988); information at the instance dimension of data concept. in no level. been offered specific solution has In (Sciore, 1991), annotations are manipulate such quality to used an object-oriented environment. However, data quality Therefore, a more general treatment is Still is a multi-dimensional necessary to address the data quality issue. importantly, no algebra or calculus-based language annotations associated with the data. support the temporal to provided is More support the manipulation of to other research efforts (Codd, 1979; Siegel & Madnick, 1991) have dealt with data tagging without either an algebra or a set of quality measures for data quality dimensions. Li Research Focus In this paper, we advocate that data quality must be modeled as an integral part of a data object rather than simply as a set of functionalities of the modeling construct of quality data object in which each datum procedures used to indicate the quality of the data object. that set of quality algebraic Many concepts in methods the associated with appropriate data and We present a paradigm can be applied They are fundamental The reader via the object-oriented approach. set of quality measure methods timeliness), is support the quality data object to in our decision to referred to the model the quality data Appendix object for a detailed discussion of constructs in the object-oriented paradigm such as inheritance, method, polymorphism, active and message can be applied value, In this research, each to support the quality data object. datum is modeled the quality information corresponding to the as an object called a datum is constructed from a object . n) datum object and its datum object As shown in Figure . called a quality description object quality-of link associates a quality description object with ..., we propose that supports the manipulation of quality data objects. the object-oriented (Banerjee, 1987; Snyder, 1986). 1, is specifically, compute quality dimension values (such as accuracy, consistency, completeness, and and a how DBMS. More its datum object. associated quality description object . The The composite is 1, is-a- object called a quality data Instance variables of a quality description object include descriptive data (qualityjndicatorj, i= and procedural data (quality_procedurej, j= 1, ..., m). Figure A 1: A Quality Data quality data object called Earnings-Estimate is Object exemplified in Figure 2. Note that Eamings- Estimate-Qual and Source-1-Qual are quality description objects for Earnings-Estimate and Source-1 respectively. Note also Qual, a quality data object. is itself that Source-1, an attribute of the quality description object Eamings-Estimate- Earnings- Eatlmats is-a-quality-of Figure 2 The Quality Data Object Earnings-Estimate Section 2 presents the quality data object. Section 3 presents a quality data object algebra that allows for the construction of methods which conform with the user's quality requirements. Conclusions and future directions are presented in Section 4. The 2. quality data object In this section, the quality data object presented in terms of is structure and behavior. its Included in the structure of the quality data object are a definition of the components of a quality data object, the semantics of is-a-quality-of, and the quality data object schema. the quality data object are a discussion of dimensions of data quality Included in the behavior of and quality measure methods and messages. 2.1. Structure of the quality data object Following the object structure defined Khoshafian & Copeland, 1990; Zdonik & in the object-oriented Maier, 1990), we paradigm (Banerjee, 1987; define two object types for the quality data object. Let I denote the such as integer, An • set of real, string. system generated Let identifiers. B denote the set of base atomic types Then object is defined as a primitive object provided that its value belongs to B. The value of a primitive object can not be further subdivided. In the context of the quality data object, every datum An • object is a primitive object. object is defined as a tuple object are distinct attribute names and i 4 's if its in Figure a is-a-quality-of link. object which is 1, is is the quality description object The composite object resulting a n :i n > where ai's In the context of the quality I. from is associated with this association is a unit of manipulation. Thus every quality data object is its datum object through defined as a quality data a composite object. This levels. Semantics of the is-a-quality-of link Note that there is no specific mechanism quality description object with the primitive generalization (is-a) nor the aggregation the is-a-quality-of link. and the ..., a tuple object. composite property can be nested in an arbitrary number of 2.1.1. of the form <ai:ii, a2:i2, are distinct identifiers from data object, every quality description object As shown value The is-a-part-of link is is-a link is used in the object-oriented datum (is-a-part-oft used to associate object. More paradigm to associate the specifically, neither the construct can be used to capture the semantics of to associate a subclass object an object with its with its super class assembly object (Banerjee, 1987). object; Difference between 2.1.1.1. The datum link is-a-quality-of object and its and is-a is is-a-quality-of conceptually different from quality description object semantically different from is-a is because the relation between a is-a not a super-class vs. subclass relation. because the construct inheritance that associated with is-a is It is is not applicable to the is-a-quality-of link. Difference between is-a-part-of and i$-a-quality-of 2.1.1.2. The conceptual difference between is-a-quality-of and is-a-part-of represents the relation between the objects having part and assembly relation; datum represents the association between a To present object the semantic difference and between that is-a-part-of is whereas is-a-quality-of quality description object. its is-a-quality-of and to object Oj, then O, we first discuss the said to have composite is-a-part-of, semantics of is-a-part-of. If there is reference from Oj. object of O,. object, a is-a-part-of link The object Oj is from object O, called the parent object of O, Based on whether an object has a and whether the existence of an object composite references have been formalized composite reference, reference, and (4) (2) is-a-part-of link object Oj depends on the existence of & (Kim, Bertino, parent object, four types of its Garza, 1989): (1) dependent exclusive dependent shared composite (3) independent shared composite reference. is-a-part-of and is-a-quality-of comes from has only two composite references instead of four in the case of the fact that is-a- Specifically, let of • Oq and A Od set of objects to Od denote a datum object and whom Oq Oq has a is-a-quality-of Let del(O q ) refer to . a quality description object of link. We is-a-part-of. them as dependent exclusive quality reference and dependent shared quality reference denote the component called the is with only one object or more than one independent exclusive composite reference, The semantic difference between quality-of and the is Od . Let Q<O q ) and del(Od ) denote deletion respectively. Then, dependent exclusive quality reference from Od to Oq means that Q(O q ) = {O d }, and del(O d ) implies del(O q ). • A dependent shared quality reference from O d then del(Od ) implies del(O q ). In existence If Q(Oq D (Od ) ) to Oq means that its ) then del(Od ) implies {O d } dependent quality references the quality description object depends on the existence of Q(Oq 2 (O d corresponding datum is is If Q(Oq deleted from treated as a object. ). weak object ) = (O d ) Q(O q ). and its Dependent exclusive quality references increases storage overhead. Whereas, dependent shared quality references are beneficial from the storage view point but causes problems during deletion and update. 2.1.2. The quality data object schema Quality data objects are used as building blocks to construct a quality data object schema. For exposition purposes, we first illustrate, in Figure 3, a composite object company in the object-oriented paradigm, which has instance variables Company-Name, CEO-Name, and Earnings-Estimate (each of the instance variables is a primitive object, hence the composite object company). Company Company-Name^ (CEO-Nam*} (^Earnings-Estimate > < Figure Let us now suppose 3: The Object Company that out of these three primitive objects, the Estimate are quality sensitive, and are converted into quality data objects as CEO-Name and shown in Figure 4 below. is-a-quality-of L D> Collection-Procedure Figure 4: The Quality Data Object Q-Company Earnings- The quality data objects communicate with in the object-oriented object it encapsulated as a unit of manipulation. is through pre-defined methods only. paradigm. to retrieve the data that Q-company In addition, conforms with it It behaves has the capabilities to in the That is, other same way as an object measure the quality of data and users' quality requirements, as will be discussed in Section 2.2. Using quality data objects as basic building blocks, more complex objects can be constructed through other object-oriented constructs such as aggregation Figure 5, for (is-a-part-of) and generalization (is-a). In example, the Directed Acyclic Graph (DAG) constructed with the quality data objects Q- Company, Q-IT-Department, Q-Finance-Department, and Q-High-Tech-Company forms a quality data object schema. Q-Company is-a-part-of is-a-part-of Q-fT-Department 5: Quality of Object Schema have presented the quality data object unique construct that is paradigm, now it is Q-Hgh-Tech -Company Q-Finance-Department Figure We is-a to the quality in terms of its Through structure. the is-a-quality-of data object and the other constructs in the object-oriented possible to construct a quality data object schema. behavior of the quality data object that will addresses the issues of The next how to section presents the measure the quality of data. 2,2. Behavior of the quality data object The multi-dimensional and (Wang, Reddy, considering & how Kon, 1992; a user hierarchical characteristics of data quality Wang & may make Strong, 1992). illustrate these two characteristics here decisions based on certain data retrieved from a database. user must be able to get to the data, which means and We were investigated privilege to get the data). means that the data must be accessible by First the (the user has the Second, the user must be able to interpret the data (the user understands the syntax and semantics of the data). Third, the data must be useful (data can be used as an input to the user's decision making process). Finally, the data must be believable extent that the user can use the data as a decision input). Resulting from this dimensions: accessibility, interpretability, usefulness, the user, the data must be available (exists in and some form believability. that can In list to the user (to the are the following four order to be accessible be accessed); to to be useful, the data must be relevant consider, among requirements for making the decision); and (fits to be believable, the user may other factors, that the data be complete, timely, consistent, credible, and accurate Timeliness, in turn, can be characterized by currency (when the data item and volatility (how long the item remains valid) characteristics of data quality provide a conceptual was . stored in the database) These multi-dimensional and hierarchical . framework for defining the behavior of the quality data object. In general, the behavior of object datum of the quality data object, both messages meant for an is encapsulated in objects its methods and messages. and quality description and update, their creation, deletion, objects will In the context have methods and just like objects in the object-oriented paradigm. Each message Message The described using the following syntax. is := (receiver) (receiver) part is (message_nameX[(argument) an identifier ]) denoting an object that receives and interpret the message. The (message_name) gives the name of the message which helps the receiving The (argument) part message with a particular method. by the method in the receiving object. of the message object to associate the carries data which is required A message can have zero, one, or more than one arguments, as the brackets " indicate. Each method is described using the following syntax. Method_name: (name of Invoked_by: (messaget, message]) Method_action: (procedural description of the method) the method) Only those methods and messages paper. Below 2-2.1. we define key methods that measure data quality. Currency The currency dimension is solely a characteristic of storage of the data. currency on a continuous scale from possible, state value of C is 1 to 1. to the oldest stored data. computed dynamically using indicator value tagged to every instance. • related to the data quality aspect are presented in this The Let state C would be assigned We propose to measure to data that are as current as represent the measure for currency (0 £ C £ 1). The the creation time of the instance. Creation time is a quality Depending on the message, the currency method determine the currency of an individual instance, can: determine the average currency of the instances of the class, determine the percentage of instances whose currency meets one of the following conditions (referred to as 6) ( 1 ) A when compared to the total instances of the class available in the database: below or above a user -defined currency description of the method is level, (2) in between a user-defined currency interval. given below. Method_name: Currency _method Invoked_by: (receiver) currency(instance_variable) (receiver) average_currency(instance_variable) (receiver) 8-currency(instance_variable, 9) For the message currency, the method returns a pairwise value, (instance value, Method_action: currency), for all For the message the instances satisfying the qualification. average _currency, the currency method returns the average currency of instances of the instance variable. method returns For the message 9-currency, all the currency the percentage of the currency values of the instances that satisfy the condition 8. 2,2.2. Volatility The time. state intrinsic property of the data was For example, the fact that George Washington remains true no matter may an volatility of data is how long ago that be woefully out of date. We fact propose to was recorded. On measure volatility Let Xj denote a random The variable, volatility is i = 1, 2, ..., measured N, then V is on computed - i=l N-l N J^Xi 10 its storage president of the United States a continuous scale from 1 via the coefficient of variation, £ X? andX= unrelated to the other hand, yesterday's stock quote V=4 X where S= first is do not change over time) and refers to data that are not volatile at all (they that are constantly in flux. the which NX 3 to 1 where refers to data denoted by V. as follows (Kazmier, 1976): A description of the method is given below. Method_name: Volatility_method Invoked_by: (receiver) Method_action: The system monitors updates volatility(instance_variable, qualification_for_instances) to the value of an instance variable that is being modified and simultaneously computes the following three required parameters order to compute the coefficient of variation: in N, the (a) total number of sum of N updates, (b) X, the average of all updated values, and (c) 2^ X; the , i=l squares of updated values. These three parameters are stored as quality indicators in the quality description object corresponding to each instance variable. all The method returns a pairwise value, (instance value, volatility), for qualified instances. 2.23. Timeliness Timeliness situation is to is defined as a function of currency and volatility of a data value. The most stable have data for which the currency (unchanging) or both. For such data there data are old (currency = 1) combining currency and representing the best and and highly is (entered very recently) or the volatility of no timeliness problem. The worst volatile (volatility volatility via their 1 is = 1). We propose T= VCV root-mean square: to situation arises when measure timeliness by where 5 T < 1 with the worst case. A description of the method is given below. Method_name: Timeliness_method Invoked_by: (receiver) timeliness(instance_variable, qualification_for_instances) Method_action: The method returns a pairwise qualified instances to the 2.2.4. value, (instance value, Timeliness), for for all message sender. Accuracy In general, a user can test the accuracy of the data present in a database with a set of sample data considered to be accurate by the user. For example, a user payroll database can basis of this test, first the user who wants check whether his salary (known data) makes judgment whether accuracy on a continuous scale from to 1 where to state accurate) case. 11 is to check the accuracy of a recorded correctly or not. query the database or not. On the We propose to measure refers to best (all accurate) and 1 the worst (none A method description of the is given below. Mithod_name: Accuracy_method Invoked_by: (receiver) Method The action: first accuracy( instance_variable, list_of_known_instances) argument in the parenthesis gives the whose accuracy was that were known to the of an instance variable The second argument gives message sender. This known a list of instances of instances is considered as true values. The method computes the percentage, denoted as p, of list match between the true values and the recorded values and returns the (1-p) to message sender. Completeness 2.2.5. Following Ballou and Pazer, recorded (Ballou where required. name Pazer, 1987). refers to the best state measure & implies that it we define completeness as We propose and 1 to values for a certain variables are measure completeness on a continuous the worst case. has no null instances all scale from to 1 For a given instance variable, the completeness in the database, Using values recorded for the instance variable are null. whereas the measure this, a 1 implies all the user can measure the degree of completeness of the database regarding an instance variable. A description of the method is given below. Method_name: Completeness_method Invoked_by: (receiver) completeness(instance_variable) Method The method measures action: instance variable the percentage, denoted as p, of when compared available in the database 2.2.6. The empty instances of the to the total instances of the instance variable and returns p to the message sender. Credibility credibility of a datum in a database is present in the quality description object of the computed based on datum and (2) the set Let x be an instance. Let q be the quality indicator of x ( the user wants to use to compute the credibility of x. and (1) of specifications given let "J' assigned to the quality indicator q by the user. Let 4 =0. The credibility of x is the user. Let uv, be the user's specified value for qt and 4 5j by be the number of quality indictors be the recorded value of the quality indicator q for x in the database. Let uv, =rv, else the quality indicator values computed by 12 8( w ( rv ( be the credibility weight be a binary variable defined as follows: the following expression: let 8i =1 if I wi*5j. A description of the method is given below. Method_name: Credibility_method Invoked_by: (i) (receiver) credibility-a(instance_variable [, (ii) (quality_indicator, quality _indicator_value, credibility_weight)]) (receiver) credibility-b(instance_variable [, (iii) (quality _indicator, quality_indicator_value )]) (receiver) credibility-c(instance_variable [, (quality_indicator, quality_indicator_value, credibility_weight)], desired_credibility) Method_action: For the message credibility-a, the method returns values instance_variable and their associated credibilities. quality_indicator is If of weight for each not specified (message credibility-b) then the method assumes equal weight for each quality indicator specified by the user and returns values of the instance variable and their associated credibilities. credibility-c, the whose We quality. method returns only those values credibility is the more than or equal For of the instance variable, to the desired_credibility. have presented the methods and messages that measure the key dimensions of data They define an important part of the behavior behavioral component of the quality data object the user's quality requirements. is In the next section, we critical present an algebra for the quality data object methods for the quality data object. A qualify data object algebra In order to retrieve quality data object instances from a database, it is those quality data object instances that conform with requirements for both the quality description portion. This requires a set of perform the operations such as The other the capability to retrieve data that conforms with that allows for a systematically construction of retrieval 3. of the quality data object. methods selection, projection, and to necessary to identify datum portion and the perform the comparisons and an algebra to join of quality data objects. Section 3.1 presents quality comparison methods. Section 3.2 presents an algebra that extends the relational algebra to the quality data object domain. 13 3.1. Quality comparison methods in a quality-description obj ect In this subsection the methods which are two different quality & We Copeland, 1990). and 0_equal i_deq?_equal, M_deep_equal, that underlie these Two primitive objects are defined to be QjLeep_equa\ Two tuple objects are defined to be l_deq>_equal on the same the values they take 2_deep_equal if l)_deep_equal. Let O] 0_deep_equal if their datum values and datum their values matches. if they have the Two and Two portions are identical. two quality data objects are i_deep_equal and Ojare We up to the " 1 j 02 that and set of attributes if tuple objects are defined to be two tuple Similarly, on the same attribute are (i- i_deep_equal. is quality data objects are l_deep_equal illustrate the is Let o-[ maximum the 7. In order to do to qi] , and the value of souree-1 and the value of source-2 would correspond In Figure to the object 02 7, let o-i and 02 be would correspond to the same. is . 7. first o-[ and 02, . exemplify the In Figure would correspond to vj Object c^ However, values they take on qi 31 are different (v 31 vs. x 31 >. follows that 0] we 2, to vo- earnings Source-1 Source-2 would correspond The quality data object objects. because both have the same datum value, v all the corresponding to v\-[. two quality data values they take on qi^qi^qiaare their level are identical. depth of both it, be a quality data object in Figure estimate would correspond to 01, and the value of earnings estimate would correspond if defined as M_deep_equal, denoted by Oi = M 02 above concepts through Figure 2. 'V is If first datum values and their if level are identical. \_deep_equal, then this relation notation used in Figure 7 via Figure qill, same the corresponding quality indicator values at the quality indicator values O] Ot on equality two comparison methods. the values they take denote two tuple objects of the context of the quality data object, two quality data objects are defined to be In Similarly, =* 02 if and some define the concepts of 0_deep_equal, on same attributes are 1_deep_equal. be i_deep_equal to if in detail also discussed, based first attribute are 0_deep_equal. the values they take objects are defined if two methods are special cases of these definitions provided in (Khoshafian and comparison methods are discussed not M_deep_equal to 02. 14 Ot is Since the is 1 Ot is 0_deep_equal _deep _equal to 02 because the not 2_deep_equal to 02 because the maximum number of level of Oj is 2, it ^ OfcCo) (qi,.v,) |,v,) /\ (4iV v 1l) («" 2' 1 (qi 2 .v - 2) (qi 3 3.1.1. v ("VV Wj'ty ) 3 \ v \ V 12 )(q, 2r 21 (q' ) Figure We now , 7: 3 v r 3i> (q' (^11^11) Quality Description objects : o, 12 . W3.V3) \ \ v , 2 ) (q 2 j r v (qi 3r 2i> x 3i> and 02 present the two comparison methods: i_deep_equal and 9_equal. i_deep_equal .method A description of the method is given below. Method_name: i_deep_equal _method Invoked_by: (receiver) i_deep_equal(object 1 ,object2, no_of_levels) Method_action: This method compares object and object2 and then returns True object2are i_deep_equal, where 'V is object! if and the no_of_levels specified by the user, else returns False. 3.1.2. 6_equal .method Two 0_deep_equal. quality data objects are 0_equal, If two objects o, and 02 have 9_equal then relation For example, consider 9 9_equal. objects Oi the values they take if = {qi,, qi 2/ qin, qi2il- is denoted by 9_M_deq>_equal, then relation is is Oi The quality description Similarly one can define 9_i_deep_equal_method and and 02 are 9_i_deep_equal, then relation on the attributes in 9 are =e 02. objects o, and 9_M_deep _equal_method. denoted by o, =6 (>) oj denoted by 01 = e<M) 02. For example, 9 = If { 03 are If two two objects o, and 02 ha ve qii } then o, and 02 have 9_M_deep_equal. A description of the method is given below. Method_name: 9_equal .method Invoked_by: (receiver) 9_equal(objecti,object2 , 9) Method This method compares object! and object2 action: object2 where 9 is and then returns True the set of quality indicators specified False. 15 by the if object, =9 user, else returns These two quality comparison methods are used to define the following quality algebraic methods. 3.2. Qua lity algebraic In this section, 3.2.1 Selection methods we Method Selecnon_method must selected first introduce quality algebraic methods to operate on quality data objects. selects only a subset of objects satisfy the selection criterion. Let order predicates. This operation creates collection O, which satisfy the predicates and the predicate q symbolically denoted as a^(0,p,q), aq A p and where ( m <, n) objects of type The predicate p q. n objects of type is I method is T. T from that each object Let p and q be members the on the datum a constraint of object The selection_method defined as follows: is O) a p(o) a (O, p,q) = (o (o e description of the m a collection of constraint on the quality description object. a is O be from an object collection such q(o)) given below. Method_name: Selection_Method Invoked_by: (receiver) selection(object_class, data_constiaint, quality _constraint) Method_action: This method checks each instance of the object_class to see whether they and the quality _constraint, and returns satisfy data_constraint object all instances of the object_class which satisfy both of these constraints. \2-2 Union Method In O] be this union_method the two operand quality data / the collection of n objects of type method is a collection of from the collection 0\ and when compared as u^ (Oi, O2, 9) In the objects (where n is <, p the collection of <, n+m) selects only those instances to the instances of ^q (Oi O2 , p T and O2 be object collections must be of the same m objects of type T. of type T. This method type. Let The result of selects all instances from the collection O2 which are not duplicates Oi. The logic of the union_method, which is symbolically denoted defined as follows , 6) » {o I Voe Oi above expression, " -. ) u { o V02 € I (ot= 02) a (01= and 02 are considered duplicates provided 0_deep_equal with respect to all O2 3d Oi a (o = 02)} " is that their quality indicators in 16 e 9. meant M 02) a {(o 1 =°o2 ) a (o,= to eliminate duplicates. datum portions Note -, that the are the 6 02)) Objects 01 same and they above definition } for the are union is commutative from the view point of the user who defined 9 c»i= 02 9. In general it is not commutative because, u does not mean (oi^cc). A description of the method is given below. Method_name: Union_method Invoked_by: (receiver) union(Oi,C>2,9) Method_action: Let resultl be the set of instances in the object collection 0\. Let result! be all O2 such the subset (need not be strict subset) of instances of from result =resultl In difference_method, the method is u result!. to any instance This method returns the set any instance in resultl. Let result. Method 3.23 Difference collection of 0_deep_equal and not 9_equl result! is not that two operand n objects of type T and O2 be a a collection of p which are not 0_deep_equal to objects in logic of the difference_method, —q (Oi,02,9) = collection of p 5 objects (where object collections O2 m objects of type T. n) of type T. The with respect to " (0\ denoted as O2 , {olVo,60i 3o2e02,(o= M , must be of the same all The O] be type. Let a result of the difference result consists of objects only from Oi quality indicator specified in 9. The 9) is defined as follows o,) a^ {(o 1 -°o a ) a(o,- 8 02)}} A description of the method is given below. Method_name: Difference_method Invoked_by: (receiver) difference(Oi, 02/9) Method_action: Let result be the set of all 0_deep_equal and 9_equal to the instances of Oi except those that any instance of O2. This method returns the are set result. 3.2.4 Projection Let Method O be an object collection of m objects of type T. Projection_method generates p (where p < m) T from the object collection O. Let o be an object in the collection O. The function returns an object o' of type T from the object o. The projection method also eliminates duplicate objects from the objects of type f q result. The logic of the projection_method, which is symbolically denoted as fl (O, f:T', 9) is as follows n q (O, f:T', 9)=(( n (O, f:T)}— [oj I o o € v 2 n (O, 17 f:T'), {(o^Oj) a (o^Oz)} 1 defined where n A (O, f:D = (f(o)l description of the 0€ O) method is given below. Method_name: Projection_method Invoked_by: (receiver) projecrion(0, 8) Method_action: Let O 7 be the object type whose instances variables are variables of O. 0\ Let f a subset of the instance be a function which takes an instance of Let resultl be the set of instances of (J. Let result C O and resultl instances generated by eliminating duplicates from the set resultl. instantiates be the set of The method returns the set result. 3.2.5 Cartesian Product Let Oi be an Method object collection of n objects of type Ti, of type T2. Let o, be an object in the collection constructs a new object 01 © 02 of type T3 from 0\ and 01 and let 02. / and 02 be let O2 be an object collection of an object in the collection O2. m objects The method Objects of type T3 consists of instance variables from both Ti and T2. The logic of the cartesian_product_method, denoted as n q (O, f:T', 0) is defined as follows X q A (Oi,02) = (o I Vo, e description of the Oi V02 method is e 02,0 = 0, ©02) given below. Method_name: Cartesian_product_method Invoked_by: (receiver) cartesian_product(Oi Method_action: Let O3 be a new and of 02- Let f object type , O2) which will have all the instance variables of 0\ be the function which take instances of Oi and instances of O2 and with these instances, the function method returns the set of instances of O3. f instantiates the object type O3. This Other algebraic methods such as Intersection_method and Join_method can be defined using the above defined five algebraic methods. 18 Concluding remarks 4. In this paper, help users we have make judgments question was capabilities to how investigated how with quality information that can to associate data of the quality of data for the specific application at hand. and manage data to structure in such a measure the quality of data they need and way that users could to retrieve the Our research be equipped with the data that conforms with their quality requirements. Toward object object. is this goal, we have proposed the concept of quality data object in which each datum associated with appropriate data and procedures used to indicate the quality of the Specifically, the is-a-quality-of link is corresponding quality description object. associated quality description object is proposed The composite to associate a object constructed called a quality data object. access object instances which matches users' quality requirements. measure methods that It object with It its provides methods which can also provides a set of quality we have developed volatility, timeliness, a quality data object algebra comparison methods and an algebra that extends the relational algebra quality data object domain. its from a datum object and compute quality dimension values including currency, accuracy, consistency, and completeness. In addition, that includes quality It datum datum to the allows for a systematic construction of retrieval methods for quality data objects. The concept of quality data manufacture of data products. object presented in the We envision paper is a first that the quality data object step toward the design and proposed in this paper can be used as basic building blocks for the design, manufacture, and delivery of quality data products. enable users to measure the quality of data products according to their chosen It will it will also enable users to purchase data products based on their quality requirements. In this manner, we hope that the concepts of quality data objects and quality data products data reusability. 19 will help criteria; improve data quality and Appendix A 5. Many concepts in the object-oriented paradigm can be applied They are fundamental object. approach. In this in our decision to model the quality data Appendix we discuss how constructs inheritance, method, to support the quality data object via the object-oriented in the object-oriented polymorphism, active value, and message can be exploited to paradigm such as support the quality data object. We present features of the object-oriented paradigm and relate them to the quality data object. Modeling Paradigm objects (Kim, 1989; Kim, paradigm, In the object-oriented 1990). This paradigm is conceptual entities are modeled as all particularly interesting to us because both data quality can be represented as objects, as Figures 1-2 illustrate. representation schemes for data and Estimate is modeled as an object In Figure 2, for It its eliminates the dichotomy of example, the datum object Earnings- its quality. its quality description attributes such as Source-1 and and and Reporting- Date are also modeled as objects. Inheritance Objects in an object hierarchy can inherit both the data and methods from their parent objects in the object hierarchy (Banerjee, 1987; Snyder, 1986; Zdonik context of the quality data object, quality information just like data Method The behavior state of an Zdonik object. a quality data object & of and methods an object Maier, 1990). is inherited Maier, 1990). by its In the child object, the Therefore, both quality indicators and quality automatically inherited. is procedures can be reused (Banerjee, 1987; whenever & in the object-oriented in the object-oriented A method paradigm is encapsulated in methods consists of code that manipulates In the context of the quality data object, dimension values are procedure-oriented, and are paradigm. mechanisms used difficult to to and returns the determine data quality express declaratively. Therefore, the procedural capability in the object-oriented paradigm can be used effectively to define quality procedures in a quality data object. For example, timeliness of a quality data object As another example, oriented and can be encapsulated as a method. deleted, and modified by the methods (Wang, Reddy, St In the object-oriented objects to define different procedures, of arguments (Zdonik & procedure- since objects are instantiated, of the object, the corresponding quality integrity constraints Kon, 1992) can be embedded Polymorphism is in the definition of these methods. paradigm, the same method name can be used in different and the same method can take Maier, 1990). This feature is different types or different number important in the context of the quality data object because data quality measure methods can be defined differently in different objects with the same 20 name. Moreover, the evaluation of a procedure depends on the type and number of arguments passed the procedure which, in turn, depend on the quality requirements of a user. One believability can be invoked with different sets of arguments: Estimate if the immediate source (e.g. the Wall Street Journal) is consider additional quality indicators such as source of source Zacks Investment Research which is user For example, the method may believe the Earnings- credible whereas another user (e.g., to may the Wall Street Journal quoted considered very credible by the investment community) as important in determining the believability. Using polymorphism, both of the users can use the same method but with different sets of arguments. Active values In the object-oriented paradigm, the values of active instance variables are computed is at run time based on values of other instance variables (Zdonik useful in computing data quality dimension values dynamically. in the eyes of the beholder (Wang, Kon, object need to & & Maier, 1990). This feature Since data quality, in a sense, lies Madnick, 1993), some quality dimensions of a quality data be computed dynamically based on (1) user requirements and (2) data and procedures encapsulated in the quality data object. For example, timeliness of a quality data object can not be stored as a value. It Messages messages (Maier & must be computed dynamically upon demand, as discussed In the object-oriented paradigm, objects can Stein, 1987). in Section 2. communicate with one another through Messages, together with any arguments that messages, constitute the public interface of an object. This feature is handy may be passed with the in the context of quality data object because the extra complexity introduced in a quality data object can be encapsulated by the interface of an object which is nothing but a collection of messages. 21 References 6. HI |2| Ballou, D P. & Pazer, H. L. (1987). Cost/Quality Tradeoffs for Control Procedures Systems. International lournal of Management Science, ii(6), pp. 509-521. Banerjee, J., et al,. (1987). Data Model Issues for Object-Oriented Applications. in Information ACM Transactions on Office Information Systems. £(1). |3] [4| Bernstein, P. A., et al. Chen, Chen, CA. [71 Query Processing (1981). System in a in Distributed Database Systems. for Distributed Databases (SDD-1). Transactions on Database Systems, 6(4), pp. 602-623. P. The Entity-Relationship Model (1976). P. ACM/TODS, L [61 N. (1981). Concurrency Control 11(2), pp. 185-221. ACM [5] & Goodman, Bernstein, P. A. Computing Surveys, - Toward Unified View of Data. a pp- 166-193. An P. P. (1984). Algebra for a Directional Binary Entity- Relationship Model. Los Angeles, 1984. pp. 37-40. Chen, Entity -Relationsip Approach P. (1991). P. to Database Design Wellesley, . MA: Q.E.D. Information Sciences. [8] Chen, & P. P. The Lattice Concept North-Holland. Li, L. (1987). Relationship Approach [9] Codd, ACM, [10] Codd, E. F. (1970). A relational model of data in Entity Set. In S. Spaccapaietra (Ed.), Entity- for large shared data banks. Communications of the capture more meaning. ACM practical foundation for productivity, the 1981 ACM 11(6), pp. 377-387. Extending the relational database model E. F. (1979). to Transactions on Database Systems, i(4), pp. 397-434. Ill) Codd, E. F. (1982). Relational database: Turing Award Lecture. Communications [12] Codd, E. F. (1986). An The Second relational. A ACM, of the 25(2), pp. 109-117. management systems that are claimed to be on Data Engineering, Los Angeles, CA. 1986. pp. evaluation scheme for database International Conference 720-729. [13] Date, C. J. (1981). Referential Integrity. The Proceedings of the 7th International Conference on Very Large Data bases, Cannes, France. 1981. MA: Addison-Wesley. [14] Date, C. J. (1985). An Introduction to Database Systems (15) Date, C. J. (1990). An Introduction to Database Systems (5th ed.). Reading, [16] Denning, D. [17] Fernandez, E. B., Summers, R. MA: Addison-Wesley. [18] Hoffman, L. E. J. & Denning, (1977). P. J. (1979). Data Security. C, & Wood, Modern Methods for C . Reading, ACM MA: Addison-Wesley. Comp. Surv., U(3). (1981). Database Security and Integrity Computer Security and Privacy . . Englewood Readings, Cliffs, NJ: Prentice Hall. [19] Hsiao, D. K., Kerr, D. S., & Madnick, S. M. (1978). Privacy and Security of Data Communications and Data bases. Proceedings of the 4th International Conference on Very Large Data Bases, 1978. [20] Huh, [21] Johnson, Y. U., et J. al. (1990). Data Quality. Information and Softvmre Technology, 32(8), pp. 559-565. R., Leitch, R. A., & Neter, J. (1981). Characteristics of Errors in Accounts Receivable and Inventory Audits. Accounting Review, 5JH April), pp. 270-293. [22] Kazmier, L. J. (1976). Business Statistics . New York, N.Y.: 22 McGraw Hill Company. [23 Khoshafian, S. N. & Copeland, G. P. (1990). Object Identity. In San Mateo, CA: Morgan Kaufmann. S. B. Zdonik& D. Maier (Ed.), (pp. 37-46). [24 [25 Kim, W., Lochovsky, Frederick H. (1989). Object-Oriented Concepts, Databases, and Applications (1 ed.). New York: Addison-Wesley. Kim, W. (1990). Databases: Definition and Research Directions. Object-Oriented IEEE Transactions on Knowledge and Data Engineering, 2(3). [26 [27] Kim, W., Bertino, E., & Garza, J. F. (1989). Proceedings of the 1989 ACM SIGMOD Internaltional Conference on the Management of Data. ACM SIGMOD, Portland, Oregan. 1989. pp. 337-347. & Korth, H. Silberschatz, A. (1986). Database System Concepts New . York: McGraw-Hill Book Company. [28 Laudon, K. C. (1986). Data Quality and Due Process in Large Interorganizational Record Systems. Communications of the ACM, 220), pp. 4-11. [29 Liepins, G. E. & Uppuluri, V. R. R. (1990). Data Quality Control: Theory and Pragmatics (pp. 360). Marcel Dekker, Inc. New York: [30 Liepins, O. E. (1989). Sound Data Are a Sound Investment. Quality Programs, (September), pp. 61- 63. [31 & Maier, D. Cambridge, [32 Martin, Stein, (1987). J. Development and Implentation of an Object-Oriented (1973). Security, Accuracy, J. DBMS . MA: MIT Press. and Privacy in Computer Systems . Englewood Cliffs, NJ: Prentice Hall. [33 Maxwell, B. S. (1989). Beyond "Data Validity": Improving the Quality of HRIS Data. Personnel, , pp. 48-58. [34 McCarthy, L. J. (1982). Metadata Management for Large Statistical Databases. Mexico City, Mexico. 1982. pp. 234-243. [35 McCarthy, L. (1984). Scientific Information J. School, Monterey, CA. = Data + Meta-data. U.S. Naval Postgraduate 1984. A New Tool for Scientific Information. J. L. (1988). The Automated Data Thesaurus: Proceedings of the 11th International Codata Conference, Karlsruhe, Germany. 1988. [36 McCarthy, [37] Qian, X. & Wiederhold, G. (1986). Knowledge-based integrity constraint on Very Large Data Bases, 1986. pp. 3-12. validation. The Twelfth International Conference [38] Redman, [39] Sciore, E. (1991). [40 Siegel, T. C. (1992). Data Quality: Management and Technology . New York, NY: Bantam Books. Using Annotations to Support Multiple Kinds of Versioning in an Object-Oriented Database System. ACM Transactions on Database Systems, l£(No. 3, September 1991), pp. 417-438. M. & Madnick, S. E. (1991). A metadata approach to resolving semantic conflicts. Barcelona, Spain. 1991. [41 Snyder, A. (1986). Encapsulation and Inheritance OOPSLA-86, [42 Ullman, J. in Object-Oriented Programming Languages. 1986. pp. 38-45. D. (1982). Principles of Database Systems . Rockville, Maryland, USA: Computer Science Press. [43 & Towards Total Data Quality Management (TDQM). In R. Y. and Perspectives Englewood Cliffs, NJ: Wang, R. Y. Wang (Ed.), Information Technology in Action: Trends Kon, H. B. (1992). Prentice Hall. 23 [44| [45] Wang, R. Y., Kon, H. B., k Madnick, S. E. (1993). Data Quality Requirements Analysis and Modeling in Data Engineering, in the Proceedings of the 9th International Conference on Data Engingeenng, Vienna. 1993. Wang, R. Y., Reddy, M. P., & Kon, H. B. (1992). Toward Quality Data: An Attribute-based Approach, to appear in the journal of Decision Support Systems (DSS), Special Issue on Information Technologies and Systems,. Y. R. & Madnick, S. E. (1990). A Polygen Model for Heterogeneous Database Systems: The Source Tagging Perspective. Brisbane, Australia. 1990. pp. 519-538. [46] Wang, [47] Wang, & Y. R. Strong, D. (1992). Dimensions of Data Quality: Beyond Accuracy. Submitted for Publication. [481 Zarkovich. (1966). Quality of United Nations. [49] Zdonik, S. California: B. & Data Maier, D. (1990). Readings Morgan Kaufmann (I Statistical A . Rome: Food and Agriculture Organization in Publishers, Inc. 24 ~t Object Olriented Database Systems (pp. of the 629). Date Due Lib-26-67 MIT LIBRARIES 3 TOflO 00flMb55E 5