Knowledge versus Data Gio Wiederhold Department of Computer Science Stanford University October 1985, converted to MS Word 24 Jan 2003. Appeared in Brodie, Mylopoulos, and Schmidt (eds) `On Knowledge Base Management Systems: Integrating Artificial Intelligence and Database Technologies,' Springer Verlag Feb. 1986. Preamble The terms knowledge and data, and knowledge bases and databases are frequently seen in papers discussing future information systems. They are rarely defined precisely and little aggreement exists about the domain and scope encompassed by these terms. This note presents one set of definitions how these terms may be used to make useful distinctions in the systems we are building. By defining these two terms -- and not letting the two be synonymous -- we establish a classification scheme for objects seen in information systems. We are aware that classifications as proposed here are artificial. We have to consider what the objective is for the classification being proposed here since objects may be classified along more than one dimension. For instance, a classification used for flowers used by a botanist is not necessarily useful for florists who is concerned with costs and sales nor for a gardener who is concerned with planting seasons. The Model In our work we are concerned with describing systems which provide information, typically information to be used for decision-making in enterprises. These systems automate tasks which, in a traditional environment, are carried to a large extend by people, using computers only to store and communicate data. At the point of decision-making an expert, armed with knowledge considers data which has been selected as relevant to the problem at hand and makes the decision. The selected data provides information. The knowledge provided by the expert has been obtained through education and experience. Figure 1 shows the flow leading to decision making. Knowledge Loop Data Loop Storage Education Selection Recording Integration Experience Abstraction State changes DecisionDecision-making Action Figure 1 Function of Knowledge and Data in Enterprises We have not sketched or named the processes which permit these transformations. Neither have we indicated in this sketch the feedback loops through which the systems gain new data and new knowledge. Such feedback is essential to assure long-term stability of information systems. Feedback into the data box occurs through the collection of observations, modeling the real world. Feedback into the knowledge box occurs through the development of generalizations and abstractions, perhaps formalized through the scientific loop of hypothesis generation, hypothesis verfication, review, publication, and dissemination. Our model of knowledge-based systems follows this description of use of information knowledge, and data. The objective of the classification we propose in this note is to assist in the design and organization of information systems where these functions are being automated. Organization in turn is necessary in order to make large systems manageable. The automation of the decision-making processes means that both knowledge and information have to be accessed so that for the inferencing process both categories must be accessible. We observe however that typically the source for the knowledge, namely the expert, and the source for the information, the collected data, is distinct. Only in academic exercises do we find that expert has time to also gather and validate the data. In industry the collected data is not provided by the expert. The expert may specify, using expert knowledge selection and processing steps to reduce data to information. We define information, following Shannon [SW48], as data that conveys material that was previously unknown to the receiver. The Test to Distinguish Data and Knowledge From these observations we derive a litmus test for distinguishing knowledge versus data: If we can trust an automatic process or clerk to collect the material then we are talking about data. The correctness of data with respect to the real world can be objectively verified by comparison with repeated observations of the real world. Eventually the state of the world changes, and we have to trust prior observations If we look for an expert to provide the material then we are talking about knowledge. Knowledge includes abstractions and generalizations of voluminius material. These are typically less precise and cannot be easily objectively verified. Many definitions, necessary to organize systems, are knowledge as well; we look towards experts for the definitions which are important building stones for further abstractions, categorization, and generalization [SS77]. This definition permits us to now make processing distinctions between data and knowledge. Since data reflects the current state of the world at the level of instances it will include much detail, will be voluminous, and will appear in reports which are used at lower levels of the enterprise for verification. Where instances change rapidly much data must be collected over time as well if a complete historical picture is desired. Knowledge will not to change as frequently. Knowledge may be complex but will deal with generalizations and hence refer to entity types rather than to entity instances. The differences also mean that different data structures may be appropriate for the representation of data and the representation of knowledge, at least in the pre-processing stages before the decision-making process occurs. Structures for data are often simple, to accommodate frequent updates; knowledge, being updated by experts or learning systems, can benefit from more complex representations. Consistency of Data and Knowledge Data which are newly aquired may conflict with existing knowledge. Integrity constraints in databases may prevent some such data to be entered, to protect the database. Other data will enter, perhaps altering the generalizations made in the associated knowledge base. For instance, QUIST makes the assumption, acquired from human experts, that supertankers are not found in the Mediterranean [King80]. This generally valid assumption has only an economic, but not a legal or a physical basis. It may be violated by unusual instances, perhaps a supertanker enters the Mediterranean for repairs. For query optimization in the domain of transportation the QUIST generalization, attached to the abstraction of ``supertanker'', will remain valid. Many artificial intelligence systems deal with expert provided estimates of uncertainty of knowledge. If the uncertainty is quantified, as in MYCIN and its successors, new instances of conflicting data will increase the degree of uncertainty [Shor73]. Learning techniques must be embedded to automate updating of certainty factors. Systems which do not manage uncertainty, as those based on logic, have a problem dealing with conflicting data. Checking for conflicts is in itself costly. When a conflict is Identified either the data must be considered to be in error or the knowledge must be false. Without an expert's presence the former alternative must be taken, even if now the state of the real world is not correctly represented. An expert, before revising the knowledge, will verify the data, since in large databases we will always find some erroneous observations. Even in the case that the data is verified to be correct then the general base knowledge may not be wrong, it is more likely that it was incomplete. Now new knowledge must be added which will cover the case in question and similar cases which can be foreseen. Note that in this discussion the role of data versus knowledge seems obvious. We have found the distinction clear in all our experiments [Wie84]. Problems We must, however consider problems of our definition of data and knowledge. Items which appear as data at a higher level in the enterprise may appear to be knowledge to personnel involved in a lower level. In general, the assignment of data and knowledge is situation dependent. In large-scale systems which serve multiple objectives some further categorization may be necessary. Unfortunately, we have no practical experience in dealing with this degree of complexity. Neither have seen results of systems of such scope. One candidate information system structure is to permit multiple layers, and let the knowledge component of one system layer become the supporting data for the next system layer. Such a structure appears to be nice and general but may quite difficult to manage. A realistic current example of multiple level data is found when database systems manage aggregate or derived data. Aggregate data is a combination of knowledge, defined by an aggregation routine, and the source data on which the knowledge operates. Another approach to deal with alternative knowledge definitions is to use a model based on views over knowledge, similar to views over data. This concept was proposed by us in [Miss84] and is a current research topic here. It appears that the view concept in databases raises many problems which are best dealt with at a higher level of abstraction . Such an approach requires a conceptual extension of view definition and processing schemes that now support the notion of views in databases [Kell86]. Views over knowledge structures may, for instance, support distinct hierarchical categorizations of knowledge, each appropriate to some set of applications. Conclusion The question remains whether our classification is generally useful. It may not be useful if only concepts of querying of knowledge and associated data are being considered. We are convinced of its power when we design systems to deal with large collections of data. The concerns in such systems include processing mechanisms and architecture. The ability to deal with the largest fraction of the system content, namely the data, in a regular and simple way is worth much to us. The distinction becomes a basis for modularity. When consider issues as update, errors in data, and uncertainty of knowledge the distinctions become also conceptually useful. For instance, in large databases, the probability of existence of errors approaches certainty, and we cannot afford to disable general inferencing because of spurious errors in the database. In applications where the domain is sufficiently large or complex that knowledge must be incomplete. In such systems conflicts of data instances with the general knowledge will remain. We can see that applications aiding management in planning will depend on the knowledge being adequate for processing, and ignore outlying instances of data. Administrative functions may analyze the database for exceptions, that is conflicts of data and knowledge. The distinction of these application types is rooted in the distinction we make of data and knowledge. Acknowledgement This research was supported by the Defense Advanced Research Projects Agency, contract N39-84-C-211 for Management of Knowledge and Database. Experience leading to these conclusions came from work as described in [WBW85]. Michael Brodie provided detailed and helpful comments. References [Kell86] Arthur M.~Keller: ``Choosing a View Update Translator by Dialog at View Definition Time"; to appear in IEEE Computer, Jan.1986. [King80] Jonathan King: ``Modelling Concepts for Reasoning about Access to Knowledge"; Proc.~of the ACM Workshop on Data Abstraction, Data Bases, and Conceptual Modelling, Pingree Park CO, June 23--26, 1980, ACM-SIGPLAN Notices, vol.16 no.1, Jan.1981. [Miss84] Michele Missikoff and Gio Wiederhold: ``Towards a Unified Approach for Expert and Database Systems"; Proc. First Workshop on Expert Database Systems, Kiawah Island, South Carolina, Oct.1984, Institute of Information Management, Technology and Policy, Univ. of South Carolina, vol.2, pp.186-206. [SW48] Claude E.~Shannon and Warren Weaver: The Mathematical Theory of Computation; the Univ.\ of Illinois Press, 1962, reprinted from the Bell System Technical Journal, 1948. [Shor73] E.~Shortliffe, S.~G.\ Axline, B.~G.\ Buchanan, T.~C.\ Merigan and S.~N.\ Cohen: ``An Artificial Intelligence Program to Advise Physicians Regarding Antimicrobial Therapy"; Computers and Biomed.Res., vol.6 no.6, Dec.1973, pp.544--560. [SS77]John M.~Smith,J.M. and Diane C.~P.~Smith: ``Database Abstractions: Aggregation and Generalization"; ACM TODS, vol.1 no.1, Jun.1977, pp.105--133. [Wied84]Gio Wiederhold: ``Knowledge and Database Management"; IEEE Software, vol.1 no.1, January 1984, pp.63--73. [WEBW85] Gio Wiederhold, Robert L.~Blum, and Michael Walker: ``An Integration of Knowledge and Data Representation"; this proceedings. Brodie, Mylopoulos, and Schmidt (eds) `On Knowledge Base Management Systems: Integrating Artificial Intelligence and Database Technologies,' Springer Verlag Feb. 1986.