From Computer Data to Human Knowledge: A Cognitive Approach to Knowledge Discovery and Data Mining Michael J. Pazzani Department Name: Institution: Information and Computer Science University of California, Irvine Contact Information Michael J. Pazzani Information and Computer Science 444 Computer Science Bldg. University of California Irvine, CA 92697-3425 Phone: (949) 824-7405 Fax : (949) 824-4035 Email: pazzani@ics.uci.edu http://www.ics.uci.edu/~pazzani WWW PAGE http://www.ics.uci.edu/~pazzani/CognitiveKDD.html Project Award Information Award Number: 9731990 Duration: 10/01/1998 – 9/30/2001 Title: From Computer Data to Human Knowledge: A Cognitive Approach to Knowledge Discovery and Data Mining Keywords Knowledge Discovery in Database; Cognitive Science; Comprehensibility of Learned Models Project Summary The research is concerned with intelligent decision aids that can be developed by data mining techniques. Experience has shown that such systems can learn accurate models, but that experts in areas where those models are used in decision aids are often reluctant to trust them because they do not use the same tests, intermediate conclusions, or abstractions that the experts have grown to trust. Experts also want models that are stable under small changes in the data being analyzed. Psychologists have uncovered numerous factors that simplify the learning, understanding, and communication of category and process information by humans. This research seeks to explore these psychological principles in light of the output of existing KDD algorithms and to develop and evaluate new KDD algorithms that will provide output that is easy for people to learn, use, and communicate to others. With the results of this research, it should be possible to make such decision aids more "human centered", so that they will be used more often and more effectively in practice. This is a joint project between Dorrit Billman (Georgia Institute of Technology) and Michael Pazzani (University of California, Irvine). Billman and Pazzani have complementary abilities that will produce the synergy of interdisciplinary work at its best, but also a common pool of assumptions and knowledge to facilitate communication and interaction. Pazzani is a computer scientist who has done psychology experiments, while Billman is a psychologist who has done computational work. This collaboration will bring cognitive principles into the field of Knowledge Discovery and Databases. Goals, Objectives, and Targeted Activities Our activities for the project this year fall under two main themes. 1. The evaluation of systems for multiple linear regression. Multiple linear regression (e.g., Draper and Smith, 1981) is a technique for finding a linear relationship between a set of explanatory variables (xi) and a dependent variable (y): y = b0 + b1x1 + b2x2 + ... + bnxn. The coefficients, (bi) provide an indication on the influence of an explanatory variable on the dependent variable. Such analysis might help guide future decision-making. For example, many lenders use a credit score to help determine whether to make a loan. This score is a combination of many factors such as income, debt, and past payment history which positively or negatively affect the credit risk of a borrower. However, multiple regression as used in practice can produce models that are unacceptable to experts because factors that should positively affect a decision may have negative coefficients and vice versa. Through psychological investigations, we have verified that users are sensitive to the signs of coefficients used in linear models. We have proposed and evaluated a constrained form of regression (Independent Sign Regression) that produces models that have similar predictive accuracy to multiple linear regression, but whose results are more acceptable to users. 2. The creation and deployment of intelligent agents for text filtering. This activity also seeks to use learning algorithms that produce understandable results. In one system developed (Billsus & Pazzani, 1998), the agent can explain the reasons for classifications, accept feedback on whether the reason is acceptable to the user, and revise the user’s profile based upon feedback to the explanation. Indication of Success The primary result this year is the creation of a constrained form of regression that produces linear models that generalize as well as those produced by multiple linear regression (see Table 1) yet produces results that subjects are more willing to use (see Table 2). Table 1. Predictive Mean Squared Error of Regression Routines. Database Alzheimer Autompg Baseball CS Dept Housing Pollution Multiple Linear Regression 0.184 10.6 8.74e+5 0.244 23.7 3.53e+3 Independent Sign Regression 0.166 10.5 8.55e+5 0.213 27.6 1.6e+3 Table 2. Average Subjects Ratings for Linear Equations. Regression Algorithm Multiple Linear Regression Independent Sign Regression Mean Rating -0.816 0.603 Project Impact Although this research project was initiated on Sept. 30, 1998, it has attracted the attention of several corporations; the PI has been invited to speak at Microsoft Research Laboratories in Redmond, WA and HNC Software in San Diego, CA. GPRA Outcome Goals We believe the Independent Sign Regression algorithm represents an advance that will be applicable to a wide variety of situations, particularly those in which people with little knowledge of assumptions behind data mining algorithms apply these algorithms to diverse data sets. Project References 1. Pazzani, M. & Bay, S. (submitted). The Independent Sign Bias: Gaining Insight from Multiple Linear Regression. Annual Meeting of the Cognitive Science Society. 2. Billsus, D. and Pazzani, M. (in press). A Personal News Agent that Talks, Learns and Explains. The Third International Conference on Autonomous Agents (Agents '99), Seattle, Washington, May 1-5, 1999. 3. Intelligent Agent software; NewsDude http://www.ics.uci.edu/~dbillsus/NewsDude Area Background The goal of data mining is to investigate algorithms for providing insight into some phenomenon by analyzing a database of examples of that phenomenon. The specific focus of our investigation is to constrain and bias algorithms for creating models of data so that these models are understandable and coherent to users of knowledge discovery and data mining systems. Large databases are being collected in science, business, and medicine due to advances in methods for collecting, storing, and integrating data. The potential benefit of these rich information sources has scarcely been tapped and the societal effects scarcely envisioned. Not only could this provide new discovery methods in science and new decision-making tools for business, but also new bases for policy making in health or economics, new tools for medical diagnosis, new information about products for consumers, and a host of other possibilities. In response to the availability of these very large databases, a variety of techniques have been developed and applied to recover useful information. These techniques have drawn from statistics, pattern recognition, machine learning, and neural networks to build models describing regularities in the data. The goal of this modeling is to help people understand the data by discovering predictive or descriptive models. However, to date research in data mining has not paid attention to the cognitive factors that make the resulting models coherent, credible, easy to learn, easy to use, and easy to communicate to others. Without attention to the human user, the social benefits of data mining cannot be fully realized. We anticipate that principles of human learning and reasoning will guide the design of new data mining algorithms to produce models that are easier for users of KDD systems to understand and that properties of learning algorithms will add to the understanding of human psychological processes. Area References Davies, J. & Billman, D. (1996). Consistent Contrast in Unsupervised Learning. in Program of the Eighteenth Annual Conference of the Cognitive Science Society. Erlbaum: Hillsdale, NJ. Draper, N. & Smith, H. (1981). Applied Regression Analysis. John Wiley & Sons. Fayyad, U.M., Piatetsky-Shapiro, G., & Smyth, P. (1996). From Data Mining to Knowledge Discovery: An Overview. In Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, Ramasamy Uthurusamy (Eds.): Advances in Knowledge Discovery and Data Mining (pp. 134). AAAI/MIT Press. Kelley, H. (1971). Causal schemata and the attribution process. In E. Jones, D. Kanouse, H. Kelley, N. Nisbett, S. Valins, & B. Weiner (Eds.), Attribution: Perceiving the causes of behavior (pp 151-174). Morristown, NJ: General Learning Press. Murphy, G.L. & Allopenna, P.D. (1994). The locus of knowledge effects in concept learning. Journal of Experimental Psychology: Learning, Memory, and Cognition, 19, 203-222. Pazzani, M. (1991a). The influence of prior knowledge on concept acquisition: Experimental and computational results. Journal of Experimental Psychology: Learning, Memory & Cognition, 17, 3. Spiegelhalter, D., Dawid, P., Lauritzen, S. and Cowell, R. (1993). Bayesian Analysis in Expert Systems. Statistical Science, 8, 219-283. Tufte, E.R. (1990). Envisioning Information. Connecticut: Graphics Press.