Lecture 03: Categorization SIMS 202: Information Organization and Retrieval Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 am Fall 2003 Credits to Marti Hearst and Warren Sack for some of the slides in this lecture IS 202 - FALL 2003 2003.09.02 - SLIDE 1 Today • Review of Last Time – What Is Information? • Categorization • Discussion Questions • Action Items for Next Time IS 202 - FALL 2003 2003.09.02 - SLIDE 2 Assignment 1 - Discussion • What is information, according to your background or area of expertise? IS 202 - FALL 2003 2003.09.02 - SLIDE 3 What Is Information? • Relating data to a context (“situational interpretation”) • Anything that is important to anyone (“significance”) • World data information knowledge • Requires community of interpretation • All information is dependent on context IS 202 - FALL 2003 • Capable of being recorded and stored and transmitted (also in physical form – e.g., fossils) • Information must be recorded • Information is a record of something that can be reused • Information is a commodity 2003.09.02 - SLIDE 4 What Is Information? • Negentropy • Potential energy to become knowledge • Potential for it to be built upon • Questions – Does information have to be related to “true” data? – Can information be downgraded to data if it is forgotten? IS 202 - FALL 2003 2003.09.02 - SLIDE 5 Human Communication Theory? Message Source Message Encoding Decoding Destination Channel Noise IS 202 - FALL 2003 2003.09.02 - SLIDE 6 The Conduit Metaphor • Language functions like a conduit, transferring thoughts bodily from one person to another • In writing and speaking, people insert their thoughts or feelings in the words • Words accomplish the transfer by containing the thoughts or feelings and conveying them to others • In listening or reading, people extract the thoughts and feelings once again from the words IS 202 - FALL 2003 2003.09.02 - SLIDE 7 Toolmakers’ Paradigm IS 202 - FALL 2003 2003.09.02 - SLIDE 8 How Much Information Today? • See report by Hal Varian and Peter Lyman http://www.sims.berkeley.edu/research/projects/ how-much-info/ • Total annual information production including print, film, magnetic media, etc. – Upper Bound 2,120,539 Terabytes (1012 bytes) – Lower Bound 635,480 Terabytes – I.e., between 1 and 2 Exabytes per year (1018 bytes) • How do we organize THIS? IS 202 - FALL 2003 2003.09.02 - SLIDE 9 Categorization 09/02/2003 Categorization 09/04/2003 Knowledge Representation 09/09/2003 Lexical Relations and WordNet 09/11/2003 Metadata Introduction 09/16/2003 Controlled Vocabulary Introduction 09/18/2003 Thesaurus Design and Construction IS 202 - FALL 2003 2003.09.02 - SLIDE 10 Foucault on Borges • This passage quotes “a certain Chinese encyclopedia” in which it is written that ‘animals are divided into: (a) belonging to the Emperor, (b) embalmed, (c) tame, (d) suckling pigs, (e) sirens, (f) fabulous, (g) stray dogs, (h) included in the present classification, (i) frenzied, (j) innumerable, (k) drawn with a very fine camelhair brush, (l) et cetera, (m) having just broken the water pitcher, (n) that from a long way off look like flies.’ – Michel Foucault, The Order of Things, 1970 IS 202 - FALL 2003 2003.09.02 - SLIDE 11 Yahoo! Categorization IS 202 - FALL 2003 2003.09.02 - SLIDE 12 Yahoo! Categorization Detail IS 202 - FALL 2003 2003.09.02 - SLIDE 13 Why Study Categorization? • Categorization is central to how we organize information and the world • Categorization is a core cognitive process • In recent years, centuries-old views of categorization have been revised • Understanding how people categorize can help us design information systems that do a better job at organization and retrieval IS 202 - FALL 2003 2003.09.02 - SLIDE 14 Why Read Lakoff? • Very influential figure in recent thinking about human categorization, metaphor, and cognition • Provides summary of historical work and develops syncretic model of cognition and categorization • Clear explanations using examples • Professor at UC Berkeley (Department of Linguistics) IS 202 - FALL 2003 2003.09.02 - SLIDE 15 George Lakoff • Lakoff’s research covers many areas of Conceptual Analysis within Cognitive Linguistics – The nature of human conceptual systems, especially metaphor systems for concepts such as time, events, causation, emotions, morality, the self, politics, etc. – The development of Cognitive Social Science, which applies ideas of Cognitive Semantics to the Social Sciences – The implications of Cognitive Science for Philosophy, in collaboration with Mark Johnson, Chair of Philosophy at the University of Oregon – Neural foundations of conceptual systems and language, in collaboration with Jerome Feldman, of the International Computer Science Institute, seeking to develop biologicallymotivated structured connectionist systems to model both the learning of conceptual systems and their neural representations – The cognitive structure, especially the metaphorical structure, of mathematics, in collaboration with Rafael Núñez IS 202 - FALL 2003 2003.09.02 - SLIDE 16 George Lakoff • Selected publications – Metaphors We Live By (with Mark Johnson) Univ. of Chicago Press. 1980. – Women, Fire, and Dangerous Things. University of Chicago Press. 1987. – More Than Cool Reason. (with Mark Turner) Univ. of Chicago Press. 1989. – Moral Politics. University of Chicago Press. 1996. – Philosophy in The Flesh. Basic Books, 1999. – Where Mathematics Comes From: How the Embodied Mind Brings Mathematics into Being. (with Rafael Núñez). Basic Books. 2000. – Moral Politics: How Liberals and Conservatives Think. Second Edition. University of Chicago Press, 2002. IS 202 - FALL 2003 2003.09.02 - SLIDE 17 Objectivist Views • Thought is mechanical manipulation of symbols • The mind is an abstract machine • Symbols get their meaning from correspondences to the external world • Symbols are internal representations • Abstract symbols stand in correspondence with the external world independent of the interpreting organism • The human mind is a mirror of nature • Human bodies play no role in characterizing concepts • Thought is abstract and disembodied • Exclusively symbolic machines are capable of thought • Thought can be broken down into simple “building blocks” • Thought is defined by mathematical logic IS 202 - FALL 2003 2003.09.02 - SLIDE 18 Experientialist Views • • • • Thought is embodied Thought is imaginative Thought has gestalt properties Thought utilizes basic-level categorization and basiclevel primacy • Thought uses prototypes and family resemblances as organizing structures • Conceptual structure can be described using cognitive models that have the above properties • The theory of cognitive models incorporates what was right about the traditional view of categorization, meaning, and reason, while accounting for the empirical data on categorization and fitting the new view overall IS 202 - FALL 2003 2003.09.02 - SLIDE 19 Central Conceptual Issue • Do meaningful thought and reason concern merely the manipulations of abstract symbols and their correspondence to an objective reality, independent of any embodiment (except, perhaps, for limitations imposed by the organism)? • Do meaningful thought and reason essentially concern the nature of the organism doing the thinking—including the nature of its body, its interaction in its environment, its social character, and so on? IS 202 - FALL 2003 2003.09.02 - SLIDE 20 Categorization • Classical categorization – Necessary and sufficient conditions for membership – Generic-to-specific monohierarchical structure • Modern categorization – Characteristic features (family resemblances) – Centrality/typicality (prototypes) – Basic-level categories IS 202 - FALL 2003 2003.09.02 - SLIDE 21 Defining Category Membership • Necessary and sufficient conditions – Every condition must be met – No other conditions can be required • Example: A prime number: – An integer divisible only by itself and 1. Source: Webster's Revised Unabridged Dictionary, © 1996, 1998 MICRA, Inc. • Example: mother – A woman who has given birth to a child. IS 202 - FALL 2003 2003.09.02 - SLIDE 22 Defining Category Membership • Necessary and sufficient conditions for Mother? – mother(A,B) -> female(A), gave-birth-to(A,B), same-species(A,B) • What about – Birth mother vs. adoptive mother – Surrogate mother – Transgenic mother IS 202 - FALL 2003 2003.09.02 - SLIDE 23 Can Category Membership Be Defined? • What are the necessary and sufficient conditions for something to be a game? • Famous example by Wittgenstein – Classic categories assume clear boundaries defined by common properties (necessary and sufficient conditions) • How do we categorize games? IS 202 - FALL 2003 2003.09.02 - SLIDE 24 Definition of Game • Counterexample: “Game” – No common properties shared by all games • Card games, ball games, Olympic games, children’s games – Competition: ring-around-the-rosy – Skill: dice games – Luck: chess – No fixed boundary to category • Can be extended to new games (e.g., video games) • Alternative notion of category membership – Concepts related by family resemblances IS 202 - FALL 2003 2003.09.02 - SLIDE 25 Properties of Categorization • Family resemblance – Members of a category may be related to one another without all members having any property in common • Instead, they may share a large subset of traits • Some attributes are more likely given that others have been seen – Example: feathers, wings, twittering, ... • Likely to be a bird, but not all features apply to “emu” • Unlikely to see an association with “barks” IS 202 - FALL 2003 2003.09.02 - SLIDE 26 Properties of Categorization • Example: Prime numbers – Definition: An integer divisible only by itself and 1 – Examples: 2, 3, 5, 7, 11, 13, 17, … • A very clear-cut category. Or is it? – Can one number be “more prime” than another? • Centrality – Some members of a category may be “better examples” than others, i.e., “prototypical” members • Example: robins vs. chickens vs. emus IS 202 - FALL 2003 2003.09.02 - SLIDE 27 Properties of Categorization • Characteristic features – Perceived degree of category membership has to do with which features help define the category – Members usually do not have ALL the necessary features, but have some subset – Those members that have more of the central features are seen as more central members – People have conceptions of typical members IS 202 - FALL 2003 2003.09.02 - SLIDE 28 Testing for Centrality/Typicality • Ask a series of questions, compare how long it takes people to answer – True or false: • • • • • An apple is a fruit A plum is a fruit A coconut is a fruit An olive is a fruit A tomato is a fruit • Rosch and Mervis – The more features a fruit shares with the other fruits, the more typical a member of the class it is IS 202 - FALL 2003 2003.09.02 - SLIDE 29 Characteristic Features • • • • • • • Is a cat on a mat a cat? Is a dead cat a cat? Is a photo of a cat a cat? Is a cat with three legs a cat? Is a cat that barks a cat? Is a cat with a dog’s brain a cat? Is a cat with every cell replaced by a dog’s cells a cat? IS 202 - FALL 2003 2003.09.02 - SLIDE 30 Properties of Categorization • Basic-level categories – Categories are organized into a hierarchy from the most general to the most specific, but the level that is most cognitively basic is “in the middle” of the hierarchy • Basic-level primacy – Basic-level categories are functionally primary with respect to factors including ease of cognitive processing (learning, reasoning, recognition, etc.) IS 202 - FALL 2003 2003.09.02 - SLIDE 31 Basic-Level Categories • Brown 1958, 1965, Berlin et al., 1972, 1973 • Folk biology: – – – – – Unique beginner: plant, animal Life form: tree, bush, flower Generic name: pine, oak, maple, elm Specific name: Ponderosa pine, white pine Varietal name: Western Ponderosa pine • No overlap between levels • Level 3 is basic – Corresponds to genus – Folk biological categories correspond accurately to scientific biological categories only at the basic level IS 202 - FALL 2003 2003.09.02 - SLIDE 32 Psychologically Primary Levels SUPERORDINATE BASIC LEVEL SUBORDINATE animal dog terrier furniture chair rocker • Children take longer to learn superordinate categories above the basic level • Superordinate categories above the basic level are not associated with mental images or motor actions IS 202 - FALL 2003 2003.09.02 - SLIDE 33 Basic-Level Categorization • Perception – Overall perceived shape – Single mental image – Fast identification • Function – General motor program • Communication – Shortest, most commonly used and contextually neutral words – First learned by children • Knowledge Organization – Most attributes of category members stored at this level IS 202 - FALL 2003 2003.09.02 - SLIDE 34 Middle-Out Categorization • Top down – Object • Writing implement – Pen • Bottom up – Sanford Uniball Black Pen • Ink Pen – Pen • Middle out – Writing implement • Pen – Ink Pen IS 202 - FALL 2003 2003.09.02 - SLIDE 35 Summary • Processes of categorization underlie many of the issues having to do with information organization • Categorization is messier than our computer systems would like • Human categories have graded membership, consisting of family resemblances – Family resemblance is expressed in part by which subset of features is shared – It is also determined by underlying understandings of the world that do not get represented in most systems • Basic-level categories, as well as subordinate and superordinate categories, seem to be cognitively real and therefore important in the design of information organization and retrieval systems IS 202 - FALL 2003 2003.09.02 - SLIDE 36 Discussion Questions (Lakoff) • Margaret Tong on Lakoff – If categorization is embodied, i.e., is a consequence of bodily experience, and if there is a pool of information so large that must be categorized by a computer (beyond human capacity to categorize), then does the computer incorporate ‘bodily experience’? If so, how? If not, does it have to rely on the so called classical view of categorization? – The objects under study by various researchers are mostly physical, such as trees, birds and colors. Would the same theory apply if the entity to categorize is information, which is somewhat intangible? IS 202 - FALL 2003 2003.09.02 - SLIDE 37 Discussion Questions (Lakoff) • Carolyn Cracraft on Lakoff – Do the existence of prototype members offer any support for the conduit theory of language discussed last week? For instance, does the fact that diverse peoples across many cultures will all select focal blue as the best example of their word for blue imply that there really is a transmittable idea contained in that word? IS 202 - FALL 2003 2003.09.02 - SLIDE 38 Discussion Questions (Lakoff) • Carolyn Cracraft on Lakoff – In the discussion of basic-level categories, Lakoff opens with Brown, whose examples of basic names determined by distinctive actions seem to fall at the “life form” level - flower, ball, cat, etc. Then he attempts to slide seamlessly into the discussion of Tzeltal plant classification, where the basic level is the “genus” level – oak, maple, etc. (or, to relate to the earlier examples, rose, baseball, Persian). It seems to me, though I’ve not experimented, that children actually learn Brown’s life-form-level words before the genus-level (tree before oak, flower before rose). So what is the basic-level category? Does it really exist, and is it predictable? IS 202 - FALL 2003 2003.09.02 - SLIDE 39 Discussion Questions (Lakoff) • Carolyn Cracraft on Lakoff – In terms of our real concerns, i.e., organization of information in a library or database, it seems that prototype theory is readily applicable but basic-level theory is less so. I can see where, given a question like “How did the Egyptians build the pyramids?”, people would categorize potentially relevant information and feel that some pieces of information were better than others (like in Barsalou’s ad-hoc categories referred to at the bottom of page 45). But I’m not sure I understand Lakoff’s claim that basic-level categories can be extended from the physical world into “event categories” by way of metaphor (bottom of page 47). What does he mean by event categories? Would these event categories have relevance to questions of information retrieval or abstract knowledge organization like the one posed above? IS 202 - FALL 2003 2003.09.02 - SLIDE 40 Discussion Questions (Lakoff) • Simon King on Lakoff – Does experientialism completely invalidate objectivism as a model of cognition? Even if we accept prototype theory and believe that human thought is not based on rigid classical categorization, isn’t possible that humans find it simpler to internally represent categories by their prototypical members and that this is simply a cognitive shortcut? The same thoughts could be formed by manipulating categories, even though this is not how humans think. Does it even make sense to think about cognition outside of our human context? IS 202 - FALL 2003 2003.09.02 - SLIDE 41 Discussion Questions (Lakoff) • Simon King on Lakoff – Does objectivism preclude imagination and creativity? If thought is atomistic does this mean that there can be no intuition or ‘leaps’ of logic? Lakoff states “every time we categorize something in a way that does not mirror nature, we are using general human imaginative capabilities.” Does this mean that imagination can be considered a form of logical error or a mistaken internal representation of the world? Is imagination a requirement of ‘thought’ or can some organism or system be said to think if it operates on logic alone? IS 202 - FALL 2003 2003.09.02 - SLIDE 42 Next Time • Knowledge Representation IS 202 - FALL 2003 2003.09.02 - SLIDE 43 Homework (!) • Read the handouts – “The Vocabulary Problem in Human-System Communication” (G. W. Furnas, T. K. Landauer, L. M. Gomez, S. T. Dumais) – “Commonsense-Based Interfaces” (M. Minsky) – “CYC: A Large-Scale Investment in Knowledge Infrastructure” (D. B. Lenat) IS 202 - FALL 2003 2003.09.02 - SLIDE 44