Knowledge versus Data - The Stanford University InfoLab

Knowledge versus Data
Gio Wiederhold
Department of Computer Science
Stanford University
October 1985, converted to MS Word 24 Jan 2003.
Appeared in Brodie, Mylopoulos, and Schmidt (eds) `On Knowledge Base Management Systems: Integrating Artificial
Intelligence and Database Technologies,' Springer Verlag Feb. 1986.
The terms knowledge and data, and knowledge bases and databases are frequently seen in papers discussing future
information systems. They are rarely defined precisely and little aggreement exists about the domain and scope
encompassed by these terms. This note presents one set of definitions how these terms may be used to make useful
distinctions in the systems we are building.
By defining these two terms -- and not letting the two be synonymous -- we establish a classification scheme for objects
seen in information systems. We are aware that classifications as proposed here are artificial. We have to consider
what the objective is for the classification being proposed here since objects may be classified along more than one
dimension. For instance, a classification used for flowers used by a botanist is not necessarily useful for florists who is
concerned with costs and sales nor for a gardener who is concerned with planting seasons.
The Model
In our work we are concerned with describing systems which provide information, typically information to be used for
decision-making in enterprises. These systems automate tasks which, in a traditional environment, are carried to a
large extend by people, using computers only to store and communicate data. At the point of decision-making an
expert, armed with knowledge considers data which has been selected as relevant to the problem at hand and makes
the decision. The selected data provides information. The knowledge provided by the expert has been obtained
through education and experience. Figure 1 shows the flow leading to decision making.
Knowledge Loop
Data Loop
State changes
Figure 1 Function of Knowledge and Data in Enterprises
We have not sketched or named the processes which permit these transformations. Neither have we indicated in this
sketch the feedback loops through which the systems gain new data and new knowledge. Such feedback is essential
to assure long-term stability of information systems. Feedback into the data box occurs through the collection of
observations, modeling the real world. Feedback into the knowledge box occurs through the development of
generalizations and abstractions, perhaps formalized through the scientific loop of hypothesis generation, hypothesis
verfication, review, publication, and dissemination.
Our model of knowledge-based systems follows this description of use of information knowledge, and data. The
objective of the classification we propose in this note is to assist in the design and organization of information systems
where these functions are being automated. Organization in turn is necessary in order to make large systems
The automation of the decision-making processes means that both knowledge and information have to be accessed so
that for the inferencing process both categories must be accessible. We observe however that typically the source for
the knowledge, namely the expert, and the source for the information, the collected data, is distinct. Only in academic
exercises do we find that expert has time to also gather and validate the data. In industry the collected data is not
provided by the expert. The expert may specify, using expert knowledge selection and processing steps to reduce data
to information. We define information, following Shannon [SW48], as data that conveys material that was previously
unknown to the receiver.
The Test to Distinguish Data and Knowledge
From these observations we derive a litmus test for distinguishing knowledge versus data:
If we can trust an automatic process or clerk to collect the
material then we are talking about data. The
correctness of
data with respect to the real world can be objectively verified by
comparison with
repeated observations of the real world.
Eventually the state of the world changes, and we have to trust
prior observations
If we look for an expert to provide the material then we are
talking about knowledge. Knowledge
includes abstractions and
generalizations of voluminius material. These are typically
less precise and
cannot be easily objectively verified.
Many definitions, necessary to organize systems, are knowledge
as well; we look towards experts for the definitions which are
important building stones for further
abstractions, categorization,
and generalization [SS77].
This definition permits us to now make processing distinctions between data and knowledge.
Since data reflects the current state of the world at the level of instances it will include much detail, will be voluminous,
and will appear in reports which are used at lower levels of the enterprise for verification. Where instances change
rapidly much data must be collected over time as well if a complete historical picture is desired.
Knowledge will not to change as frequently. Knowledge may be complex but will deal with generalizations and hence
refer to entity types rather than to entity instances.
The differences also mean that different data structures may be appropriate for the representation of data and the
representation of knowledge, at least in the pre-processing stages before the decision-making process occurs.
Structures for data are often simple, to accommodate frequent updates; knowledge, being updated by experts or
learning systems, can benefit from more complex representations.
Consistency of Data and Knowledge
Data which are newly aquired may conflict with existing knowledge. Integrity constraints in databases may prevent
some such data to be entered, to protect the database. Other data will enter, perhaps altering the generalizations made
in the associated knowledge base. For instance, QUIST makes the assumption, acquired from human experts, that
supertankers are not found in the Mediterranean [King80]. This generally valid assumption has only an economic, but
not a legal or a physical basis. It may be violated by unusual instances, perhaps a supertanker enters the Mediterranean
for repairs. For query optimization in the domain of transportation the QUIST generalization, attached to the
abstraction of ``supertanker'', will remain valid.
Many artificial intelligence systems deal with expert provided estimates of uncertainty of knowledge. If the
uncertainty is quantified, as in MYCIN and its successors, new instances of conflicting data will increase the degree of
uncertainty [Shor73]. Learning techniques must be embedded to automate updating of certainty factors.
Systems which do not manage uncertainty, as those based on logic, have a problem dealing with conflicting data.
Checking for conflicts is in itself costly. When a conflict is Identified either the data must be considered to be in error
or the knowledge must be false. Without an expert's presence the former alternative must be taken, even if now the
state of the real world is not correctly represented.
An expert, before revising the knowledge, will verify the data, since in large databases we will always find some
erroneous observations. Even in the case that the data is verified to be correct then the general base knowledge may
not be wrong, it is more likely that it was incomplete. Now new knowledge must be added which will cover the case in
question and similar cases which can be foreseen.
Note that in this discussion the role of data versus knowledge seems obvious. We have found the distinction clear in all
our experiments [Wie84].
We must, however consider problems of our definition of data and knowledge. Items which appear as data at a higher
level in the enterprise may appear to be knowledge to personnel involved in a lower level. In general, the assignment
of data and knowledge is situation dependent. In large-scale systems which serve multiple objectives some further
categorization may be necessary.
Unfortunately, we have no practical experience in dealing with this degree of complexity. Neither have seen results of
systems of such scope.
One candidate information system structure is to permit multiple layers, and let the knowledge component of one
system layer become the supporting data for the next system layer. Such a structure appears to be nice and general but
may quite difficult to manage. A realistic current example of multiple level data is found when database
systems manage aggregate or derived data. Aggregate data is a combination of knowledge, defined by an aggregation
routine, and the source data on which the knowledge operates.
Another approach to deal with alternative knowledge definitions is to use a model based on views over knowledge,
similar to views over data. This concept was proposed by us in [Miss84] and is a current research topic here. It
appears that the view concept in databases raises many problems which are best dealt with at a higher level of
abstraction . Such an approach requires a conceptual extension of view definition and processing schemes that now
support the notion of views in databases [Kell86]. Views over knowledge structures may, for instance, support distinct
hierarchical categorizations of knowledge, each appropriate to some set of applications.
The question remains whether our classification is generally useful. It may not be useful if only concepts of querying
of knowledge and associated data are being considered.
We are convinced of its power when we design systems to deal with large collections of data. The concerns in such
systems include processing mechanisms and architecture. The ability to deal with the largest fraction of the system
content, namely the data, in a regular and simple way is worth much to us. The distinction becomes a basis for
When consider issues as update, errors in data, and uncertainty of knowledge the distinctions become also
conceptually useful. For instance, in large databases, the probability of existence of errors approaches certainty, and
we cannot afford to disable general inferencing because of spurious errors in the database.
In applications where the domain is sufficiently large or complex that knowledge must be incomplete. In such systems
conflicts of data instances with the general knowledge will remain. We can see that applications aiding management in
planning will depend on the knowledge being adequate for processing, and ignore outlying instances of data.
Administrative functions may analyze the database for exceptions, that is conflicts of data and knowledge. The
distinction of these application types is rooted in the distinction we make of data and knowledge.
This research was supported by the Defense Advanced Research Projects Agency, contract N39-84-C-211 for
Management of Knowledge and Database. Experience leading to these conclusions came from work as described in
[WBW85]. Michael Brodie provided detailed and helpful comments.
[Kell86] Arthur M.~Keller: ``Choosing a View Update Translator by Dialog at View Definition Time"; to appear in
IEEE Computer, Jan.1986.
[King80] Jonathan King: ``Modelling Concepts for Reasoning about Access to Knowledge"; Proc.~of the ACM
Workshop on Data Abstraction, Data Bases, and Conceptual Modelling, Pingree Park CO, June 23--26, 1980,
ACM-SIGPLAN Notices, vol.16 no.1, Jan.1981.
[Miss84] Michele Missikoff and Gio Wiederhold: ``Towards a Unified Approach for Expert and Database Systems";
Proc. First Workshop on Expert Database Systems, Kiawah Island, South Carolina, Oct.1984, Institute of
Information Management, Technology and Policy, Univ. of South Carolina, vol.2, pp.186-206.
[SW48] Claude E.~Shannon and Warren Weaver: The Mathematical Theory of Computation; the Univ.\ of Illinois
Press, 1962, reprinted from the Bell System Technical Journal, 1948.
[Shor73] E.~Shortliffe, S.~G.\ Axline, B.~G.\ Buchanan, T.~C.\ Merigan and S.~N.\ Cohen: ``An Artificial
Intelligence Program to Advise Physicians Regarding Antimicrobial Therapy"; Computers and
Biomed.Res., vol.6 no.6, Dec.1973, pp.544--560.
[SS77]John M.~Smith,J.M. and Diane C.~P.~Smith: ``Database Abstractions: Aggregation and Generalization";
ACM TODS, vol.1 no.1, Jun.1977, pp.105--133.
[Wied84]Gio Wiederhold: ``Knowledge and Database Management"; IEEE Software, vol.1 no.1, January 1984,
[WEBW85] Gio Wiederhold, Robert L.~Blum, and Michael Walker: ``An Integration of Knowledge and Data
Representation"; this proceedings. Brodie, Mylopoulos, and Schmidt (eds) `On Knowledge Base Management
Systems: Integrating Artificial Intelligence and Database Technologies,' Springer Verlag Feb. 1986.