Categorization

advertisement
Lecture 03: Categorization
SIMS 202:
Information Organization
and Retrieval
Prof. Ray Larson & Prof. Marc Davis
UC Berkeley SIMS
Tuesday and Thursday 10:30 am - 12:00 am
Fall 2003
Credits to Marti Hearst and Warren Sack for some of the slides in this lecture
IS 202 - FALL 2003
2003.09.02 - SLIDE 1
Today
• Review of Last Time
– What Is Information?
• Categorization
• Discussion Questions
• Action Items for Next Time
IS 202 - FALL 2003
2003.09.02 - SLIDE 2
Assignment 1 - Discussion
• What is information, according to your
background or area of expertise?
IS 202 - FALL 2003
2003.09.02 - SLIDE 3
What Is Information?
• Relating data to a context
(“situational
interpretation”)
• Anything that is important
to anyone (“significance”)
• World data
information
knowledge
• Requires community of
interpretation
• All information is
dependent on context
IS 202 - FALL 2003
• Capable of being
recorded and stored and
transmitted (also in
physical form – e.g.,
fossils)
• Information must be
recorded
• Information is a record of
something that can be
reused
• Information is a
commodity
2003.09.02 - SLIDE 4
What Is Information?
• Negentropy
• Potential energy to become knowledge
• Potential for it to be built upon
• Questions
– Does information have to be related to “true”
data?
– Can information be downgraded to data if it is
forgotten?
IS 202 - FALL 2003
2003.09.02 - SLIDE 5
Human Communication Theory?
Message
Source
Message
Encoding
Decoding
Destination
Channel
Noise
IS 202 - FALL 2003
2003.09.02 - SLIDE 6
The Conduit Metaphor
• Language functions like a conduit, transferring
thoughts bodily from one person to another
• In writing and speaking, people insert their
thoughts or feelings in the words
• Words accomplish the transfer by containing the
thoughts or feelings and conveying them to
others
• In listening or reading, people extract the
thoughts and feelings once again from the words
IS 202 - FALL 2003
2003.09.02 - SLIDE 7
Toolmakers’ Paradigm
IS 202 - FALL 2003
2003.09.02 - SLIDE 8
How Much Information Today?
• See report by Hal Varian and Peter Lyman
http://www.sims.berkeley.edu/research/projects/
how-much-info/
• Total annual information production including
print, film, magnetic media, etc.
– Upper Bound 2,120,539 Terabytes (1012 bytes)
– Lower Bound
635,480 Terabytes
– I.e., between 1 and 2 Exabytes per year (1018 bytes)
• How do we organize THIS?
IS 202 - FALL 2003
2003.09.02 - SLIDE 9
Categorization
09/02/2003
Categorization
09/04/2003
Knowledge Representation
09/09/2003
Lexical Relations and WordNet
09/11/2003
Metadata Introduction
09/16/2003
Controlled Vocabulary Introduction
09/18/2003
Thesaurus Design and Construction
IS 202 - FALL 2003
2003.09.02 - SLIDE 10
Foucault on Borges
• This passage quotes “a certain Chinese
encyclopedia” in which it is written that ‘animals
are divided into: (a) belonging to the Emperor,
(b) embalmed, (c) tame, (d) suckling pigs, (e)
sirens, (f) fabulous, (g) stray dogs, (h) included
in the present classification, (i) frenzied, (j)
innumerable, (k) drawn with a very fine
camelhair brush, (l) et cetera, (m) having just
broken the water pitcher, (n) that from a long
way off look like flies.’
– Michel Foucault, The Order of Things, 1970
IS 202 - FALL 2003
2003.09.02 - SLIDE 11
Yahoo! Categorization
IS 202 - FALL 2003
2003.09.02 - SLIDE 12
Yahoo! Categorization Detail
IS 202 - FALL 2003
2003.09.02 - SLIDE 13
Why Study Categorization?
• Categorization is central to how we
organize information and the world
• Categorization is a core cognitive process
• In recent years, centuries-old views of
categorization have been revised
• Understanding how people categorize can
help us design information systems that do
a better job at organization and retrieval
IS 202 - FALL 2003
2003.09.02 - SLIDE 14
Why Read Lakoff?
• Very influential figure in recent thinking
about human categorization, metaphor,
and cognition
• Provides summary of historical work and
develops syncretic model of cognition and
categorization
• Clear explanations using examples
• Professor at UC Berkeley (Department of
Linguistics)
IS 202 - FALL 2003
2003.09.02 - SLIDE 15
George Lakoff
• Lakoff’s research covers many areas of Conceptual
Analysis within Cognitive Linguistics
– The nature of human conceptual systems, especially metaphor
systems for concepts such as time, events, causation, emotions,
morality, the self, politics, etc.
– The development of Cognitive Social Science, which applies
ideas of Cognitive Semantics to the Social Sciences
– The implications of Cognitive Science for Philosophy, in
collaboration with Mark Johnson, Chair of Philosophy at the
University of Oregon
– Neural foundations of conceptual systems and language, in
collaboration with Jerome Feldman, of the International
Computer Science Institute, seeking to develop biologicallymotivated structured connectionist systems to model both the
learning of conceptual systems and their neural representations
– The cognitive structure, especially the metaphorical structure, of
mathematics, in collaboration with Rafael Núñez
IS 202 - FALL 2003
2003.09.02 - SLIDE 16
George Lakoff
• Selected publications
– Metaphors We Live By (with Mark Johnson) Univ. of
Chicago Press. 1980.
– Women, Fire, and Dangerous Things. University of
Chicago Press. 1987.
– More Than Cool Reason. (with Mark Turner) Univ. of
Chicago Press. 1989.
– Moral Politics. University of Chicago Press. 1996.
– Philosophy in The Flesh. Basic Books, 1999.
– Where Mathematics Comes From: How the
Embodied Mind Brings Mathematics into Being. (with
Rafael Núñez). Basic Books. 2000.
– Moral Politics: How Liberals and Conservatives Think.
Second Edition. University of Chicago Press, 2002.
IS 202 - FALL 2003
2003.09.02 - SLIDE 17
Objectivist Views
• Thought is mechanical manipulation of symbols
• The mind is an abstract machine
• Symbols get their meaning from correspondences to the external
world
• Symbols are internal representations
• Abstract symbols stand in correspondence with the external world
independent of the interpreting organism
• The human mind is a mirror of nature
• Human bodies play no role in characterizing concepts
• Thought is abstract and disembodied
• Exclusively symbolic machines are capable of thought
• Thought can be broken down into simple “building blocks”
• Thought is defined by mathematical logic
IS 202 - FALL 2003
2003.09.02 - SLIDE 18
Experientialist Views
•
•
•
•
Thought is embodied
Thought is imaginative
Thought has gestalt properties
Thought utilizes basic-level categorization and basiclevel primacy
• Thought uses prototypes and family resemblances as
organizing structures
• Conceptual structure can be described using cognitive
models that have the above properties
• The theory of cognitive models incorporates what was
right about the traditional view of categorization,
meaning, and reason, while accounting for the empirical
data on categorization and fitting the new view overall
IS 202 - FALL 2003
2003.09.02 - SLIDE 19
Central Conceptual Issue
• Do meaningful thought and reason concern
merely the manipulations of abstract symbols
and their correspondence to an objective reality,
independent of any embodiment (except,
perhaps, for limitations imposed by the
organism)?
• Do meaningful thought and reason essentially
concern the nature of the organism doing the
thinking—including the nature of its body, its
interaction in its environment, its social
character, and so on?
IS 202 - FALL 2003
2003.09.02 - SLIDE 20
Categorization
• Classical categorization
– Necessary and sufficient conditions for
membership
– Generic-to-specific monohierarchical structure
• Modern categorization
– Characteristic features (family resemblances)
– Centrality/typicality (prototypes)
– Basic-level categories
IS 202 - FALL 2003
2003.09.02 - SLIDE 21
Defining Category Membership
• Necessary and sufficient conditions
– Every condition must be met
– No other conditions can be required
• Example: A prime number:
– An integer divisible only by itself and 1.
Source: Webster's Revised Unabridged Dictionary, © 1996, 1998 MICRA, Inc.
• Example: mother
– A woman who has given birth to a child.
IS 202 - FALL 2003
2003.09.02 - SLIDE 22
Defining Category Membership
• Necessary and sufficient conditions for
Mother?
– mother(A,B) -> female(A), gave-birth-to(A,B),
same-species(A,B)
• What about
– Birth mother vs. adoptive mother
– Surrogate mother
– Transgenic mother
IS 202 - FALL 2003
2003.09.02 - SLIDE 23
Can Category Membership Be Defined?
• What are the necessary and sufficient
conditions for something to be a game?
• Famous example by Wittgenstein
– Classic categories assume clear boundaries
defined by common properties (necessary
and sufficient conditions)
• How do we categorize games?
IS 202 - FALL 2003
2003.09.02 - SLIDE 24
Definition of Game
• Counterexample: “Game”
– No common properties shared by all games
• Card games, ball games, Olympic games,
children’s games
– Competition: ring-around-the-rosy
– Skill: dice games
– Luck: chess
– No fixed boundary to category
• Can be extended to new games (e.g., video
games)
• Alternative notion of category membership
– Concepts related by family resemblances
IS 202 - FALL 2003
2003.09.02 - SLIDE 25
Properties of Categorization
• Family resemblance
– Members of a category may be related to one
another without all members having any
property in common
• Instead, they may share a large subset of traits
• Some attributes are more likely given that others
have been seen
– Example: feathers, wings, twittering, ...
• Likely to be a bird, but not all features apply to
“emu”
• Unlikely to see an association with “barks”
IS 202 - FALL 2003
2003.09.02 - SLIDE 26
Properties of Categorization
• Example: Prime numbers
– Definition: An integer divisible only by itself and 1
– Examples: 2, 3, 5, 7, 11, 13, 17, …
• A very clear-cut category. Or is it?
– Can one number be “more prime” than another?
• Centrality
– Some members of a category may be “better
examples” than others, i.e., “prototypical” members
• Example: robins vs. chickens vs. emus
IS 202 - FALL 2003
2003.09.02 - SLIDE 27
Properties of Categorization
• Characteristic features
– Perceived degree of category membership
has to do with which features help define the
category
– Members usually do not have ALL the
necessary features, but have some subset
– Those members that have more of the central
features are seen as more central members
– People have conceptions of typical members
IS 202 - FALL 2003
2003.09.02 - SLIDE 28
Testing for Centrality/Typicality
• Ask a series of questions, compare how long it
takes people to answer
– True or false:
•
•
•
•
•
An apple is a fruit
A plum is a fruit
A coconut is a fruit
An olive is a fruit
A tomato is a fruit
• Rosch and Mervis
– The more features a fruit shares with the other fruits,
the more typical a member of the class it is
IS 202 - FALL 2003
2003.09.02 - SLIDE 29
Characteristic Features
•
•
•
•
•
•
•
Is a cat on a mat a cat?
Is a dead cat a cat?
Is a photo of a cat a cat?
Is a cat with three legs a cat?
Is a cat that barks a cat?
Is a cat with a dog’s brain a cat?
Is a cat with every cell replaced by a dog’s
cells a cat?
IS 202 - FALL 2003
2003.09.02 - SLIDE 30
Properties of Categorization
• Basic-level categories
– Categories are organized into a hierarchy
from the most general to the most specific, but
the level that is most cognitively basic is “in
the middle” of the hierarchy
• Basic-level primacy
– Basic-level categories are functionally primary
with respect to factors including ease of
cognitive processing (learning, reasoning,
recognition, etc.)
IS 202 - FALL 2003
2003.09.02 - SLIDE 31
Basic-Level Categories
• Brown 1958, 1965, Berlin et al., 1972, 1973
• Folk biology:
–
–
–
–
–
Unique beginner: plant, animal
Life form: tree, bush, flower
Generic name: pine, oak, maple, elm
Specific name: Ponderosa pine, white pine
Varietal name: Western Ponderosa pine
• No overlap between levels
• Level 3 is basic
– Corresponds to genus
– Folk biological categories correspond accurately to
scientific biological categories only at the basic level
IS 202 - FALL 2003
2003.09.02 - SLIDE 32
Psychologically Primary Levels
SUPERORDINATE
BASIC LEVEL
SUBORDINATE
animal
dog
terrier
furniture
chair
rocker
• Children take longer to learn superordinate
categories above the basic level
• Superordinate categories above the basic
level are not associated with mental
images or motor actions
IS 202 - FALL 2003
2003.09.02 - SLIDE 33
Basic-Level Categorization
• Perception
– Overall perceived shape
– Single mental image
– Fast identification
• Function
– General motor program
• Communication
– Shortest, most commonly used and contextually neutral words
– First learned by children
• Knowledge Organization
– Most attributes of category members stored at this level
IS 202 - FALL 2003
2003.09.02 - SLIDE 34
Middle-Out Categorization
• Top down
– Object
• Writing implement
– Pen
• Bottom up
– Sanford Uniball Black Pen
• Ink Pen
– Pen
• Middle out
– Writing implement
• Pen
– Ink Pen
IS 202 - FALL 2003
2003.09.02 - SLIDE 35
Summary
• Processes of categorization underlie many of the issues
having to do with information organization
• Categorization is messier than our computer systems
would like
• Human categories have graded membership, consisting
of family resemblances
– Family resemblance is expressed in part by which subset of
features is shared
– It is also determined by underlying understandings of the world
that do not get represented in most systems
• Basic-level categories, as well as subordinate and
superordinate categories, seem to be cognitively real
and therefore important in the design of information
organization and retrieval systems
IS 202 - FALL 2003
2003.09.02 - SLIDE 36
Discussion Questions (Lakoff)
• Margaret Tong on Lakoff
– If categorization is embodied, i.e., is a consequence
of bodily experience, and if there is a pool of
information so large that must be categorized by a
computer (beyond human capacity to categorize),
then does the computer incorporate ‘bodily
experience’? If so, how? If not, does it have to rely
on the so called classical view of categorization?
– The objects under study by various researchers are
mostly physical, such as trees, birds and colors.
Would the same theory apply if the entity to
categorize is information, which is somewhat
intangible?
IS 202 - FALL 2003
2003.09.02 - SLIDE 37
Discussion Questions (Lakoff)
• Carolyn Cracraft on Lakoff
– Do the existence of prototype members offer
any support for the conduit theory of language
discussed last week? For instance, does the
fact that diverse peoples across many
cultures will all select focal blue as the best
example of their word for blue imply that there
really is a transmittable idea contained in that
word?
IS 202 - FALL 2003
2003.09.02 - SLIDE 38
Discussion Questions (Lakoff)
• Carolyn Cracraft on Lakoff
– In the discussion of basic-level categories, Lakoff
opens with Brown, whose examples of basic names
determined by distinctive actions seem to fall at the
“life form” level - flower, ball, cat, etc. Then he
attempts to slide seamlessly into the discussion of
Tzeltal plant classification, where the basic level is the
“genus” level – oak, maple, etc. (or, to relate to the
earlier examples, rose, baseball, Persian). It seems to
me, though I’ve not experimented, that children
actually learn Brown’s life-form-level words before the
genus-level (tree before oak, flower before rose). So
what is the basic-level category? Does it really exist,
and is it predictable?
IS 202 - FALL 2003
2003.09.02 - SLIDE 39
Discussion Questions (Lakoff)
• Carolyn Cracraft on Lakoff
– In terms of our real concerns, i.e., organization of information in
a library or database, it seems that prototype theory is readily
applicable but basic-level theory is less so. I can see where,
given a question like “How did the Egyptians build the
pyramids?”, people would categorize potentially relevant
information and feel that some pieces of information were better
than others (like in Barsalou’s ad-hoc categories referred to at
the bottom of page 45). But I’m not sure I understand Lakoff’s
claim that basic-level categories can be extended from the
physical world into “event categories” by way of metaphor
(bottom of page 47). What does he mean by event categories?
Would these event categories have relevance to questions of
information retrieval or abstract knowledge organization like the
one posed above?
IS 202 - FALL 2003
2003.09.02 - SLIDE 40
Discussion Questions (Lakoff)
• Simon King on Lakoff
– Does experientialism completely invalidate
objectivism as a model of cognition? Even if we
accept prototype theory and believe that human
thought is not based on rigid classical categorization,
isn’t possible that humans find it simpler to internally
represent categories by their prototypical members
and that this is simply a cognitive shortcut? The
same thoughts could be formed by manipulating
categories, even though this is not how humans think.
Does it even make sense to think about cognition
outside of our human context?
IS 202 - FALL 2003
2003.09.02 - SLIDE 41
Discussion Questions (Lakoff)
• Simon King on Lakoff
– Does objectivism preclude imagination and creativity?
If thought is atomistic does this mean that there can
be no intuition or ‘leaps’ of logic? Lakoff states “every
time we categorize something in a way that does not
mirror nature, we are using general human
imaginative capabilities.” Does this mean that
imagination can be considered a form of logical error
or a mistaken internal representation of the world? Is
imagination a requirement of ‘thought’ or can some
organism or system be said to think if it operates on
logic alone?
IS 202 - FALL 2003
2003.09.02 - SLIDE 42
Next Time
• Knowledge Representation
IS 202 - FALL 2003
2003.09.02 - SLIDE 43
Homework (!)
• Read the handouts
– “The Vocabulary Problem in Human-System
Communication” (G. W. Furnas, T. K.
Landauer, L. M. Gomez, S. T. Dumais)
– “Commonsense-Based Interfaces” (M.
Minsky)
– “CYC: A Large-Scale Investment in
Knowledge Infrastructure” (D. B. Lenat)
IS 202 - FALL 2003
2003.09.02 - SLIDE 44
Download