word - University of California, Irvine

advertisement
From Computer Data to Human Knowledge: A
Cognitive Approach to Knowledge Discovery and Data
Mining
Michael J. Pazzani
Department Name:
Institution:
Information and Computer Science
University of California, Irvine
Contact Information
Michael J. Pazzani
Information and Computer Science
444 Computer Science Bldg.
University of California
Irvine, CA 92697-3425
Phone: (949) 824-7405
Fax : (949) 824-4035
Email: pazzani@ics.uci.edu
http://www.ics.uci.edu/~pazzani
WWW PAGE
http://www.ics.uci.edu/~pazzani/CognitiveKDD.html
Project Award Information



Award Number: 9731990
Duration: 10/01/1998 – 9/30/2001
Title: From Computer Data to Human Knowledge: A Cognitive Approach to Knowledge
Discovery and Data Mining
Keywords
Knowledge Discovery in Database; Cognitive Science; Comprehensibility of Learned Models
Project Summary
The research is concerned with intelligent decision aids that can be developed by data mining
techniques. Experience has shown that such systems can learn accurate models, but that experts in
areas where those models are used in decision aids are often reluctant to trust them because they do
not use the same tests, intermediate conclusions, or abstractions that the experts have grown to
trust. Experts also want models that are stable under small changes in the data being analyzed.
Psychologists have uncovered numerous factors that simplify the learning, understanding, and
communication of category and process information by humans. This research seeks to explore
these psychological principles in light of the output of existing KDD algorithms and to develop and
evaluate new KDD algorithms that will provide output that is easy for people to learn, use, and
communicate to others. With the results of this research, it should be possible to make such
decision aids more "human centered", so that they will be used more often and more effectively in
practice.
This is a joint project between Dorrit Billman (Georgia Institute of Technology) and Michael
Pazzani (University of California, Irvine). Billman and Pazzani have complementary abilities that
will produce the synergy of interdisciplinary work at its best, but also a common pool of
assumptions and knowledge to facilitate communication and interaction. Pazzani is a computer
scientist who has done psychology experiments, while Billman is a psychologist who has done
computational work. This collaboration will bring cognitive principles into the field of Knowledge
Discovery and Databases.
Goals, Objectives, and Targeted Activities
Our activities for the project this year fall under two main themes.
1. The evaluation of systems for multiple linear regression. Multiple linear regression (e.g., Draper
and Smith, 1981) is a technique for finding a linear relationship between a set of explanatory
variables (xi) and a dependent variable (y): y = b0 + b1x1 + b2x2 + ... + bnxn. The coefficients,
(bi) provide an indication on the influence of an explanatory variable on the dependent variable.
Such analysis might help guide future decision-making. For example, many lenders use a credit
score to help determine whether to make a loan. This score is a combination of many factors
such as income, debt, and past payment history which positively or negatively affect the credit
risk of a borrower. However, multiple regression as used in practice can produce models that
are unacceptable to experts because factors that should positively affect a decision may have
negative coefficients and vice versa. Through psychological investigations, we have verified
that users are sensitive to the signs of coefficients used in linear models. We have proposed and
evaluated a constrained form of regression (Independent Sign Regression) that produces models
that have similar predictive accuracy to multiple linear regression, but whose results are more
acceptable to users.
2. The creation and deployment of intelligent agents for text filtering. This activity also seeks to
use learning algorithms that produce understandable results. In one system developed (Billsus
& Pazzani, 1998), the agent can explain the reasons for classifications, accept feedback on
whether the reason is acceptable to the user, and revise the user’s profile based upon feedback
to the explanation.
Indication of Success
The primary result this year is the creation of a constrained form of regression that produces linear
models that generalize as well as those produced by multiple linear regression (see Table 1) yet
produces results that subjects are more willing to use (see Table 2).
Table 1. Predictive Mean Squared Error of Regression Routines.
Database
Alzheimer
Autompg
Baseball
CS Dept
Housing
Pollution
Multiple
Linear
Regression
0.184
10.6
8.74e+5
0.244
23.7
3.53e+3
Independent
Sign
Regression
0.166
10.5
8.55e+5
0.213
27.6
1.6e+3
Table 2. Average Subjects Ratings for Linear Equations.
Regression Algorithm
Multiple Linear Regression
Independent Sign Regression
Mean
Rating
-0.816
0.603
Project Impact
Although this research project was initiated on Sept. 30, 1998, it has attracted the attention of
several corporations; the PI has been invited to speak at Microsoft Research Laboratories in
Redmond, WA and HNC Software in San Diego, CA.
GPRA Outcome Goals
We believe the Independent Sign Regression algorithm represents an advance that will be
applicable to a wide variety of situations, particularly those in which people with little knowledge
of assumptions behind data mining algorithms apply these algorithms to diverse data sets.
Project References
1. Pazzani, M. & Bay, S. (submitted). The Independent Sign Bias: Gaining Insight from Multiple
Linear Regression. Annual Meeting of the Cognitive Science Society.
2. Billsus, D. and Pazzani, M. (in press). A Personal News Agent that Talks, Learns and Explains.
The Third International Conference on Autonomous Agents (Agents '99), Seattle, Washington,
May 1-5, 1999.
3. Intelligent Agent software; NewsDude http://www.ics.uci.edu/~dbillsus/NewsDude
Area Background
The goal of data mining is to investigate algorithms for providing insight into some phenomenon by
analyzing a database of examples of that phenomenon. The specific focus of our investigation is to
constrain and bias algorithms for creating models of data so that these models are understandable
and coherent to users of knowledge discovery and data mining systems.
Large databases are being collected in science, business, and medicine due to advances in methods
for collecting, storing, and integrating data. The potential benefit of these rich information sources
has scarcely been tapped and the societal effects scarcely envisioned. Not only could this provide
new discovery methods in science and new decision-making tools for business, but also new bases
for policy making in health or economics, new tools for medical diagnosis, new information about
products for consumers, and a host of other possibilities.
In response to the availability of these very large databases, a variety of techniques have been
developed and applied to recover useful information. These techniques have drawn from statistics,
pattern recognition, machine learning, and neural networks to build models describing regularities
in the data. The goal of this modeling is to help people understand the data by discovering
predictive or descriptive models. However, to date research in data mining has not paid attention to
the cognitive factors that make the resulting models coherent, credible, easy to learn, easy to use,
and easy to communicate to others. Without attention to the human user, the social benefits of data
mining cannot be fully realized.
We anticipate that principles of human learning and reasoning will guide the design of new data
mining algorithms to produce models that are easier for users of KDD systems to understand and
that properties of learning algorithms will add to the understanding of human psychological
processes.
Area References
Davies, J. & Billman, D. (1996). Consistent Contrast in Unsupervised Learning. in Program of the
Eighteenth Annual Conference of the Cognitive Science Society. Erlbaum: Hillsdale, NJ.
Draper, N. & Smith, H. (1981). Applied Regression Analysis. John Wiley & Sons.
Fayyad, U.M., Piatetsky-Shapiro, G., & Smyth, P. (1996). From Data Mining to Knowledge
Discovery: An Overview. In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth,
Ramasamy Uthurusamy (Eds.): Advances in Knowledge Discovery and Data Mining (pp. 134). AAAI/MIT Press.
Kelley, H. (1971). Causal schemata and the attribution process. In E. Jones, D. Kanouse, H.
Kelley, N. Nisbett, S. Valins, & B. Weiner (Eds.), Attribution: Perceiving the causes of
behavior (pp 151-174). Morristown, NJ: General Learning Press.
Murphy, G.L. & Allopenna, P.D. (1994). The locus of knowledge effects in concept learning.
Journal of Experimental Psychology: Learning, Memory, and Cognition, 19, 203-222.
Pazzani, M. (1991a). The influence of prior knowledge on concept acquisition: Experimental and
computational results. Journal of Experimental Psychology: Learning, Memory & Cognition,
17, 3.
Spiegelhalter, D., Dawid, P., Lauritzen, S. and Cowell, R. (1993). Bayesian Analysis in Expert
Systems. Statistical Science, 8, 219-283.
Tufte, E.R. (1990). Envisioning Information. Connecticut: Graphics Press.
Download