Uncovering Mental Representations with Markov Chain Monte Carlo Adam N. Sanborn

advertisement
Uncovering Mental Representations with
Markov Chain Monte Carlo
Adam N. Sanborn
Indiana University, Bloomington
Thomas L. Griffiths
University of California, Berkeley
Richard M. Shiffrin
Indiana University, Bloomington
Abstract
A key challenge for cognitive psychology is the investigation of mental representations, such as object categories, subjective probabilities, choice utilities,
and memory traces. In many cases, these representations can be expressed
as a non-negative function defined over a set of objects. We present a behavioral method for estimating these functions. Our approach uses people as
components of a Markov chain Monte Carlo (MCMC) algorithm, a sophisticated sampling method originally developed in statistical physics. Experiments 1 and 2 verified the MCMC method by training participants on various category structures and then recovering those structures. Experiment 3
demonstrated that the MCMC method can be used estimate the structures
of the real-world animal shape categories of giraffes, horses, dogs, and cats.
Experiment 4 combined the MCMC method with multidimensional scaling
to demonstrate how different accounts of the structure of categories, such as
prototype and exemplar models, can be tested, producing samples from the
categories of apples, oranges, and grapes.
Determining how people mentally represent different concepts is one of the central
goals of cognitive psychology. Identifying the content of mental representations allows us to
explore the correspondence between those representations and the world, and to understand
their role in cognition. Psychologists have developed a variety of sophisticated techniques
for estimating certain kinds of mental representations, such as the spatial or featural repreThe authors would like to thank Jason Gold, Rich Ivry, Michael Jones, Woojae Kim, Krystal Klein, Tania
Lombrozo, Chris Lucas, Angela Nelson, Rob Nosofsky and Jing Xu for helpful comments. ANS was supported
by an Graduate Research Fellowship from the National Science Foundation and TLG was supported by grant
number FA9550-07-1-0351 from the Air Force Office of Scientific Research while completing this research.
Experiments 1, 2, and 3 were conducted while ANS and TLG were at Brown University. Preliminary results
from Experiments 2 and 3 were presented at the 2007 Neural Information Processing Systems conference.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
2
sentations underlying similarity judgments, from behavior (e.g., Shepard, 1962; Shepard &
Arabie, 1979; Torgerson, 1958). However, the standard method used for investigating the
vast majority of mental representations, such as object categories, subjective probabilities,
choice utilities, and memory traces, is asking participants to make judgments on a set of
stimuli that the researcher selects before the experiment begins. In this paper, we describe a
novel experimental procedure that can be used to efficiently identify mental representations
that can be expressed as a non-negative function over a set of objects, a broad class that
includes all of the examples mentioned above.
The basic idea behind our approach is to design a procedure that produces samples
of stimuli that are not chosen before the experiment begins, but are adaptively selected to
concentrate in regions where the function in question has large values. More formally, for
any non-negative function f (x) over a space of objects X , we can define a corresponding
probability distribution p(x) ∝ f (x), being the distribution obtained when f (x) is normalized over X . Our approach provides a way to draw samples from this distribution. For
example, if we are investigating a category, which is associated with a probability distribution or similarity function over objects, this procedure will produce samples of objects
that have high probability or similarity for that category. We can also use the samples
to compute any statistic of interest concerning a particular function. A large number of
samples will provide a good approximation to the means, variances, covariances, and higher
moments of the distribution p(x). This makes it possible to test claims about mental representations made by different cognitive theories. For instance, we can compare the samples
that are produced with the claims about subjective probability distributions that are made
by probabilistic models of cognition (e.g., Anderson, 1990; Ashby & Alfonso-Reese, 1995;
Oaksford & Chater, 1998).
The procedure that we develop for sampling from these mental representations is
based on a method for drawing samples from complex probability distributions known as
Markov chain Monte Carlo (MCMC) (an introduction is provided by Neal, 1993). In this
method, a Markov chain is constructed in such a way that it is guaranteed to converge to a
particular distribution, allowing the states of the Markov chain to be used in the same way
as samples from that distribution. Transitions between states are made using a series of
local decisions as to whether to accept a proposed change to a given object, requiring only
limited knowledge of the underlying distribution. As a consequence, MCMC can be applied in contexts where other Monte Carlo methods fail. The first MCMC algorithms were
used for solving challenging probabilistic problems in statistical physics (Metropolis, Rosenbluth, Rosenbluth, Teller, & Teller, 1953), but they have become commonplace in a variety
of disciplines, including statistics (Gilks, Richardson, & Spiegelhalter, 1996) and biology
(Huelsenbeck, Ronquist, Nielsen, & Bollback, 2001). These methods are also beginning to
be used in modeling of cognitive psychology (Farrell & Ludwig, in press; Griffiths, Steyvers,
& Tenenbaum, 2007; Morey, Rouder, & Speckman, 2008; Navarro, Griffiths, Steyvers, &
Lee, 2006)
In standard uses of MCMC, the distribution p(x) from which we want to generate
samples is known – in the sense that it is possible to write down at least a function proportional to p(x) – but difficult to sample from. Our application to mental representations is
more challenging, as we are attempting to sample from distributions that are unknown to
the researcher, but implicit in the behavior of our participants. To address this problem, we
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
3
designed a task that will allow people to act as elements of an MCMC algorithm, letting us
sample from the distribution p(x) associated with the underlying representation f (x). This
task is a simple two-alternative forced choice. Much research has been devoted to relating
the magnitude of psychological responses to choice probabilities, resulting in mathematical
models of these tasks (e.g., Luce, 1963). We point out an equivalence between a model of
human choice behavior and one of the elements of an MCMC algorithm, and then use this
equivalence to develop a method for obtaining information about different kinds of mental
representations. This allows us to harness the power of a sophisticated sampling algorithm
to efficiently explore these representations, while never presenting our participants with
anything more exotic than a choice between two alternatives.
While our approach can be applied to any representation that can be expressed as
a non-negative function over objects, our focus in this paper will be on the case of object
categories. Grouping objects into categories is a basic mental operation that has been
widely studied and modeled. Most successful models of categorization assign a category to
a new object according to the similarity of that object to the category representation. The
similarity of all possible new objects to the category representation can be thought of as a
non-negative function over these objects, f (x). This function, which we are interested in
determining, is a consequence of not only category membership, but also of the decisional
processes that extend partial category membership to objects that are not part of the
category representation.
Our method is certainly not the first developed to investigate categorization. Researchers have invented many methods to investigate the categorization process: some determine the psychological space in which categories lie, others determine how participants
choose between different categories, and still others determine what objects are members of
a category. All of these methods have been applied in simple training experiments, but they
do not always scale well to the large stimulus spaces needed to represent natural stimuli.
Consequently, the majority of experiments that investigate human categorization models
are training studies. Most of these training experiments use simple stimuli with only a
few dimensions, such as binary dimensional shapes (Shepard, Hovland, & Jenkins, 1961),
lengths of a line (Huttenlocher, Hedges, & Vevea, 2000), or Munsell colors (Nosofsky, 1987;
Roberson, Davies, & Davidoff, 2000), though more complicated stimuli have also been used
(e.g., Posner & Keele, 1968). Research on natural perceptual categories has focused on exploring the relationships of category labels (Storms, Boeck, & Ruts, 2000; Rosch & Mervis,
1975; Rosch, Mervis, Gray, Johnson, & Boyes-Braem, 1976), with a few studies examining
members of a perceptual category using simple parameterized stimuli (e.g., Labov, 1973).
It is difficult to investigate natural perceptual categories with realistic stimuli because
natural stimuli have a great deal of variability. The number of objects that can be distinguished by our senses is enormous. Our MCMC method makes it possible to explore the
structure of natural categories in relatively large stimulus spaces. We illustrate the potential
of this approach by estimating the probability distributions associated with a set of natural
categories, and asking whether these distributions are consistent with different models of
categorization. An enduring question in the categorization literature is whether people represent categories in terms of exemplars or prototypes (J. D. Smith & Minda, 1998; Nosofsky
& Zaki, 2002; Minda & Smith, 2002; Storms et al., 2000; Nosofsky, 1988; Nosofsky & Johansen, 2000; Zaki, Nosofsky, Stanton, & Cohen, 2003; E. E. Smith, Patalano, & Jonides,
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
4
1998; Heit & Barsalou, 1996). In an exemplar-based representation, a category is simply
the set of its members, and all of these members are used when evaluating whether a new
object belongs to the category (e.g., Medin & Schaffer, 1978; Nosofsky, 1986). In contrast, a
prototype model assumes that people have an abstract representation of a category in terms
of a typical or central member, and use only this abstract representation when evaluating
new objects (e.g., Posner & Keele, 1968; Reed, 1972). These representations are associated
with different probability distributions over objects, with exemplar models corresponding to
a nonparametric density estimation scheme that admits distributions of arbitrary structure
and prototype models corresponding to a parametric density estimation scheme that favors
unimodal distributions (Ashby & Alfonso-Reese, 1995). The distributions for different natural categories produced by MCMC could thus be used to discriminate between these two
different accounts of categorization.
The plan of the paper is as follows. The next section describes MCMC in general
and the Metropolis method and Hastings acceptance rules in particular. We then describe
the experimental task we use to connect human judgments to MCMC, and outline how
this task can be used to study the mental representation of categories. Experiment 1 tests
three variants of this task using simple one-dimensional stimuli and trained distributions,
while Experiment 2 uses one of these variants to show that our method can be used to
recover different category structures from human judgments. Experiment 3 applies the
method to four natural categories, providing estimates of the distributions over animal
shapes that people associate with giraffes, horses, cats, and dogs. Experiment 4 combines
the MCMC method with multidimensional scaling in order to explore the structure of
natural categories in a perceptual space. This experiment illustrates how the suitability of a
simple prototype model for describing natural categories could be tested by combining these
two methodologies. Finally, in the General Discussion we discuss how MCMC relates to
other experimental methods, and describe some potential problems that can be encountered
in using MCMC, together with possible solutions.
Markov chain Monte Carlo
Generating samples from a probability distribution is a common problem that arises
when working with probabilistic models in a wide range of disciplines. This problem is
rendered more difficult by the fact that there are relatively few probability distributions for
which direct methods for generating samples exist. For this reason, a variety of sophisticated
Monte Carlo methods have been developed for generating samples from arbitrary probability distributions. Markov chain Monte Carlo (MCMC) is one of these methods, being
particularly useful for correctly generating samples from distributions where the probability of a particular outcome is multiplied by an unknown constant – a common situation in
statistical physics (Newman & Barkema, 1999) and Bayesian statistics (Gilks et al., 1996).
A Markov chain is a sequence of random variables where the value taken by each
variable (known as the state of the Markov chain) depends only on the value taken by
the previous variable (Norris, 1997). We can generate a sequence of states from a Markov
chain by iteratively drawing each variable from its distribution conditioned on the previous
variable. This distribution expresses the transition probabilities associated with the Markov
chain, which are the probabilities of moving from one state to the next. An ergodic Markov
chain, having a non-zero probability of reaching any state from any other state in a finite and
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
5
aperiodic number of iterations, eventually converges to a stationary distribution over states.
That is, after many iterations the probability that a chain is in a particular state converges
to a fixed value, the stationary probability, regardless of the initial value of the chain. The
stationary distribution is determined by the transition probabilities of the Markov chain.
Markov chain Monte Carlo algorithms are schemes for constructing Markov chains
that have a particular distribution (the “target distribution”) as their stationary distribution. If a sequence of states is generated from such a Markov chain, the states towards the
end of the sequence will arise with probabilities that correspond to the target distribution.
These states of the Markov chain can be used in the same way as samples from the target
distribution, although the correlations among the states that result from the dynamics of
the Markov chain will reduce the effective sample size. The original scheme for constructing
these types of Markov chains is the Metropolis method (Metropolis et al., 1953), which was
subsequently generalized by Hastings (1970).
The Metropolis method
The Metropolis method specifies a Markov chain by splitting the transition probabilities into two parts: a proposal distribution and an acceptance function. A proposed state
is drawn from the proposal distribution (which can depend on the current state) and the
acceptance function determines the probability with which that proposal is accepted as the
next state. If the proposal is not accepted, then the current state is taken as the next state.
This procedure defines a Markov chain, with the probability of the next state depending
only on the current state, and it can be iterated until it converges to the stationary distribution. Through careful choice of the proposal distribution and acceptance function, we
can ensure that this stationary distribution is the target distribution.
A sufficient, but not necessary, condition for specifying a proposal distribution and
acceptance function that produce the appropriate stationary distribution is detailed balance,
with
p(x)q(x∗ |x)a(x∗ ; x) = p(x∗ )q(x|x∗ )a(x; x∗ )
(1)
where p(x) is the stationary distribution of the Markov chain (our target distribution),
q(x∗ |x) is the probability of proposing a new state x∗ given the current state x, and a(x∗ ; x)
is the probability of accepting that proposal. We can motivate detailed balance by first
imagining that we are running a Markov chain in which we draw a new proposed state from
a uniform distribution over all possible proposals and then accept the proposal with some
probability. The transition probability of the Markov chain is equal to the probability of
accepting this proposal. We would like for the states of the Markov chain to act as samples
from a target distribution. A different way to think of this constraint is to require the ratio
of the number of visits of the chain to state x to the number of visits of the chain to state x∗
to be equal to the ratio of the probabilities of these two states under the target distribution.
A sufficient, but not necessary, condition for satisfying this requirement is to enforce this
constraint for every transition probability. Formally we can express this equality between
ratios as
p(x)
a(x; x∗ )
=
(2)
p(x∗ )
a(x∗ ; x)
Now, to generalize, instead of assuming that each proposal is drawn uniformly from the space
of parameters, we assume that each proposal is drawn from a distribution that depends on
6
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
the current state, q(x∗ |x). In order to correct for a non-uniform proposal we need to multiply
each side of Equation 2 by the proposal distribution. The result is equal to the detailed
balance equation.
The Metropolis method specifies an acceptance function which satisfies detailed balance for any symmetric proposal distribution, with q(x∗ |x) = q(x|x∗ ) for all x and x∗ . The
Metropolis acceptance function is
a(x∗ ; x) = min
p(x∗ )
p(x)
,1 .
(3)
meaning that proposals with higher probability than the current state are always accepted.
To apply this method in practice, a researcher must be able to determine the relative
probabilities of any two states. However, it is worth noting that the researcher need not
know p(x) exactly. Since the algorithm only uses the ratio p(x∗ )/p(x), it is sufficient to
know the distribution up to a multiplicative constant. That is, the Metropolis method can
still be used even if we only know a function f (x) ∝ p(x).
Hastings acceptance functions
Hastings (1970) gave a more general form for acceptance functions that satisfy detailed
balance, and extended the Metropolis method to admit asymmetric proposal distributions.
The key observation is that Equation 1 is satisfied by any acceptance rule of the form
s(x∗ , x)
a(x∗ ; x) =
1+
p(x) q(x∗ |x)
p(x∗ ) q(x|x∗ )
(4)
where s(x∗ , x) is a symmetric in x and x∗ and 0 ≤ a(x∗ ; x) ≤ 1 for all x and x∗ . Taking
s(x∗ , x) =

p(x) q(x∗ |x)
p(x∗ ) q(x|x∗ )


 1 + p(x∗ ) q(x|x∗ ) , if p(x) q(x∗ |x) ≥ 1
(5)

∗)
∗

p(x∗ ) q(x|x∗ )
 1 + p(x ) q(x|x
p(x) q(x∗ |x) , if p(x) q(x∗ |x) ≤ 1
yields the Metropolis acceptance function for symmetric proposal distributions, and generalizes it to
p(x∗ ) q(x|x∗ )
a(x∗ ; x) = min
,
1
(6)
p(x) q(x∗ |x)
for asymmetric proposal distributions. With a symmetric proposal distribution, taking
s(x∗ , x) = 1 gives the Barker acceptance function
a(x∗ ; x) =
p(x∗ )
p(x∗ ) + p(x)
(7)
where the acceptance probability is proportional to the probability of the proposed and the
current state under the target distribution (Barker, 1965). While the Metropolis acceptance function has been shown to result in lower asymptotic variance (Peskun, 1973) and
faster convergence to the stationary distribution (Billera & Diaconis, 2001) than the Barker
acceptance function, it has also been argued that neither rule is clearly dominant across all
situations (Neal, 1993).
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
7
Summary
Markov chain Monte Carlo provides a simple way to generate samples from probability
distributions for which other sampling schemes are infeasible. The basic procedure is to start
a Markov chain at some initial state, chosen arbitrarily, and then apply one of the methods
for generating transitions outlined above, proposing a change to the state of the Markov
chain and then deciding whether or not to accept this change based on the probabilities
of the different states under the target distribution. After allowing enough iterations for
the Markov chain to converge to its stationary distribution (known as the “burn-in”), the
states of the Markov chain can be used to answer questions about the target distribution
in the same way as a set of samples from that distribution.
An acceptance function from human behavior
The way that transitions between states occur in the Metropolis-Hastings algorithm
already has the feel of a psychological task: two objects are presented, one being the
current state and one the proposal, and a choice is made between them. We could thus
imagine running an MCMC algorithm in which a computer generates a proposed variation
on the current state and people choose between the current state and the proposal. In this
section, we consider how to lead people to choose between two objects in a way that would
correspond to a valid acceptance function.
From a task to an acceptance function
Assume that we want to gather information about a mental representation characterized by a non-negative function over objects, f (x). In order for people’s choices to act as an
element in an MCMC algorithm that produces samples from a distribution proportional to
f (x), we need to design a task that leads them to choose between two objects x∗ and x in
a way that corresponds to one of the valid acceptance functions introduced in the previous
section. In particular, it is sufficient to construct a task such that the probability with
which people choose x∗ is
f (x∗ )
a(x∗ ; x) =
(8)
f (x∗ ) + f (x)
this being the Barker acceptance function (Equation 7) for the distribution p(x) ∝ f (x).
Equation 8 has a long history as a model of human choice probabilities, where it is
known as the Luce choice rule or the ratio rule (Luce, 1963). This rule has been shown
to provide a good fit to human data when participants choose between two stimuli based
on a particular property (Bradley, 1954; Clarke, 1957; Hopkins, 1954), though the rule
does not seem to scale with additional alternatives (Morgan, 1974; Rouder, 2004; Wills,
Reimers, Stewart, Suret, & McLaren, 2000). When f (x) is a utility function or probability
distribution, the behavior described in Equation 8 is known as probability matching and
is often found empirically (Vulkan, 2000). The ratio rule has also been used to convert
psychological response magnitudes into response probabilities in many models of cognition
such as the Similarity Choice Model (Luce, 1963; Shepard, 1957), the Generalized Context
Model (Nosofsky, 1986), the Probabilistic Prototype model (Ashby, 1992; Nosofsky, 1987),
the Fuzzy Logical Model of Perception (Oden & Massaro, 1978), and the TRACE model
(McClelland & Elman, 1986).
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
8
There are additional simple theoretical justification for using Equation 8 to model
people’s response probabilities. One justification is to use the logistic function to model
a soft threshold on the log odds of two alternatives (cf. Anderson, 1990; Anderson &
Milson, 1989). The log odds in favor of x∗ over x under the distribution p(x) will be
∗)
Λ = log p(x
p(x) . Assuming that people receive greater reward for a correct response than
an incorrect response, the optimal solution to the problem of choosing between the two
objects is to deterministically select x∗ whenever Λ > 0. However, we could imagine that
people place their thresholds at slightly different locations or otherwise make choices nondeterministically, such that their choices can be modeled as a logistic function on Λ with
the probability of choosing x∗ being 1/(1 + exp{−Λ}). An elementary calculation shows
that this yields the acceptance rule given in Equation 8.
Another theoretical justification is to assume that log probabilities are noisy and
the decision is deterministic. The decision is made by sampling from each of the noisy log
probabilities and choosing the higher sample. If the additive noise follows a Type I Extreme
Value (e.g., Gumbel) distribution, then it is known that the resulting choice probabilities
are equivalent to the ratio rule (McFadden, 1974; Yellott, 1977).
The correspondence between the Barker acceptance function and the ratio rule suggests that we should be able to use people as elements in MCMC algorithms in any context
where we can lead them to choose in a way that is well-modeled by the ratio rule. We can
then explore the representation characterized by f (x) through a series of choices between
two alternatives, where one of the alternatives that is presented is that selected on the
previous trial and the other is a variation drawn from an arbitrary proposal distribution.
The result will be an MCMC algorithm that produces samples from p(x) ∝ f (x), providing
an efficient way to explore those parts of the space where f (x) takes large values.
Allowing a wider range of behavior
Probability matching, as expressed in the ratio rule, provide a good description of
human choice behavior in many situations, but motivated participants can produce behavior
that is more deterministic than probability matching (Vulkan, 2000; Ashby & Gott, 1988).
Several models of choice, particularly in the context of categorization, have been extended
in order to account for this behavior (Ashby & Maddox, 1993) by using an exponentiated
version of Equation 8 to map category probabilities onto response probabilities,
a(x∗ ; x) =
f (x∗ )γ
f (x∗ )γ + f (x)γ
(9)
where γ raises each term on the right side of Equation 8 to a constant. The parameter
γ makes it possible to interpolate between probability matching and purely deterministic
responding: when γ = 1 participants probability match, and when γ = ∞ they choose
deterministically.
Equation 9 can be derived from an extension to the “soft threshold” argument given
above. If responses follow a logistic function on the log posterior odds with gain γ, the
probability of choosing x∗ is 1/(1 + exp{γΛ}) where Λ is defined above. The case discussed
above corresponds to γ = 1, and as γ increases, the threshold moves closer to a step function
at Λ = 0. Carrying out the same calculation as for the case where γ = 1 in the general case
yields the acceptance rule given in Equation 9.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
9
The acceptance function in Equation 9 contains an unknown parameter, so we will
not gain as much information about f (x) as when we make the stronger assumption that
participants probability match (i.e. fixing γ = 1). Substituting Equation 9 into Equation 1
and assuming a symmetric proposal distribution we can show that the exponentiated choice
rule is an acceptance function for a Markov chain with stationary distribution
p(x) ∝ f (x)γ
(10)
Thus, using the weaker assumptions of Equation 9 as a model of human behavior, we can
estimate f (x) up to a constant exponent. The relative probabilities of different objects will
remain ordered, but cannot be directly evaluated. Consequently, the distribution defined
in Equation 10 has the same peaks as the function produced by directly normalizing f (x),
and the ordering of the variances for the dimensions is unchanged, but the exact values of
the variances will depend on the parameter γ.
Categories
The results in the previous section provide a general method for exploring mental
representations that can be expressed in terms of a non-negative function over a set of
objects. In this section, we consider how this approach can be applied to the case of
categories, providing a concrete example of a context in which this method can be used and
an illustration of how tasks can be constructed to meet the criteria for defining an MCMC
algorithm. We begin with a brief review of the methods that have been used to study
mental representations of category structure. In the following discussion it is important
to keep in mind that performance in any task used to study categories involves both the
mental representations with which the category members, or category abstractions, are held
in memory, and the processes that operate on those representations to produce the observed
data. In this article our aim is to map the category structure that results from both the
representations and processes acting together. Much research in the field has of course been
aimed at specifying the representations, the processes, or both, as described briefly in the
review that follows.
Methods for studying categories
Cognitive psychologists have extensively studied the ways in which people group objects into categories, investigating issues such as whether people represent categories in
terms of exemplars or prototypes (J. D. Smith & Minda, 1998; Nosofsky & Zaki, 2002;
Minda & Smith, 2002; Storms et al., 2000; Nosofsky, 1988; Nosofsky & Johansen, 2000;
Zaki et al., 2003; E. E. Smith et al., 1998; Heit & Barsalou, 1996). Many different types
of methodologies have been developed in an effort to distinguish the effects of prototypes
and exemplars, and to test other theories of categorization. Two of the most commonly
used methods are multidimensional scaling (MDS; Shepard, 1962; Torgerson, 1958) and
additive clustering (Shepard & Arabie, 1979). These methods each take a set of similarity
judgments and find the representation that best respects these judgments. MDS uses similarity judgments to constrain the placement of objects within a similarity space – objects
judged most similar are placed close together and those judged dissimilar are place far apart.
Additive clustering works in a similar manner, except that the resulting representation is
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
10
based on discrete features that are either shared or not shared between objects. These two
methodologies are used in order to establish the psychological space for the stimuli in an
experiment, assisting in the evaluation of categorization models. MDS has been applied
to many natural categories, including Morse code (Rothkopf, 1957) and colors (Nosofsky,
1987). Traditional MDS does not scale well to a large number of objects because all pairwise
judgments must be collected, but alternative formulations have been developed to deal with
large numbers of stimuli (Hadsell, Chopra, & LeCun, 2006; de Silva & Tenenbaum, 2003;
Goldstone, 1994).
Discriminative tests are a second avenue for comparing theories about categorization.
In these tests, participants are shown a single object and asked to pick the category to
which it belongs. The choices that are made reflect the structures of all the categories under
consideration and can be analyzed to determine the decision boundaries between categories.
This is perhaps the most commonly used method for distinguishing between many different
types of models (e.g., Nosofsky, Gluck, Palmeri, McKinley, & Glauthier, 1994; Ashby &
Gott, 1988), and it has been combined with MDS to provide a more diagnostic test (e.g,
Nosofsky, 1986, 1987). In addition, discriminative tests can be applied to naturalistic stimuli
with large numbers of parameters. In psychophysics, the response classification technique
is a discriminative test that has been used to infer the decision boundary that guides
perceptual decisions between two stimuli (Ahumada & Lovell, 1971). This methodology
has been applied to visualize the decision boundary between images with as many as 10,000
pixels (Gold, Murray, Bennett, & Sekuler, 2000).
A third method for exploring category structure is to evaluate the strength of category
membership for a set of the objects. In categorization studies, this goal is generally accomplished by asking participants for typicality ratings of stimuli (e.g., Storms et al., 2000;
Bourne, 1982). This procedure was originally used to justify the existence of a category
prototype (Rosch & Mervis, 1975; Rosch, Simpson, & Miller, 1976), but typicality ratings
have also been used to support exemplar models (Nosofsky, 1988). Typicality ratings are
difficult to collect for a large set of stimuli, because a judgment must be made for every
object. In situations in which there are a large number of objects and few belong to a
category, this method will be very inefficient. The key claim of this paper is the idea that
the MCMC method can provide the same insight into categorization as given by typicality
judgments, but in a form that scales well to the large numbers of stimuli needed to explore
natural categories.
Representing categories
Computational models of categorization quantify different hypotheses about how categories are represented. These models were originally developed as accounts of the cognitive
processes involved in categorization, and specify the structure of categories in terms of a
similarity function over objects. Subsequent work identified these models as being equivalent to schemes for estimating a probability distribution (a problem known as density
estimation), and showed that the underlying representations could be expressed as probability distributions over objects (Ashby & Alfonso-Reese, 1995). We will outline the basic
ideas behind two classic models – exemplar and prototype models – from both of these
perspectives. While we will use the probabilitistic perspective throughout the rest of the
paper, the equivalence between these two perspectives mean that our analyses carry over
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
11
equally well to the representation of categories as a similarity function over objects.
Exemplar and prototype models share the basic assumption that people assign stimuli
to categories based on similarity. Given a set of N − 1 stimuli x1 , . . . , xN −1 with category
labels c1 , . . . , cN −1 , these models express the probability that stimulus N is assigned to
category c as
ηN,c βc
p(cN = c|xn ) = P
(11)
c0 ηN,c0 βc0
where ηN,c is the similarity of the stimulus xN to category c and βc is the response bias
for category c. The key difference between the models is in how ηN,c , the similarity of a
stimulus to a category, is computed.
In an exemplar model (e.g., Medin & Schaffer, 1978; Nosofsky, 1986), a category is
represented by all of the stored instances of that category. The similarity of stimulus N to
category c is calculated by summing the similarity of the stimulus to all stored instances of
the category. That is,
X
ηN,i
(12)
ηN,c =
i|ci =c
where ηN,i is a symmetric measure of the similarity between the two stimuli xN and xi . The
similarity measure is typically defined as a decaying exponential function of the distance
between the two stimuli, following Shepard (1987). In a prototype model (e.g., Reed,
1972), a category c is represented by a single prototypical instance. In this formulation, the
similarity of a stimulus N to category c is defined to be
ηN,c = ηN,pc
(13)
where pc is the prototypical instance of the category and ηN,pc is a measure of the similarity
between stimulus N and the prototype pc , as used in the exemplar model. One common
way of defining the prototype is as the centroid of all instances of the category in some
psychological space, i.e.,
1 X
pc =
xi
(14)
Nc i|c =c
i
where Nj is the number of instances of the category (i.e. the number of stimuli for which
ci = c). There are obviously a variety of possibilities within these extremes, with similarity
being computed to a subset of the exemplars within a category, and these have been explored
in more recent models of categorization (e.g., Love, Medin, & Gureckis, 2004; Vanpaemel,
Storms, & Ons, 2005).
Taking a probabilistic perspective on the problem of categorization, a learner should
believe stimulus N belongs to category c with probability
p(xN |cN = c)p(cN = c)
0
0
c0 p(xN |cN = c )p(cN = c )
p(cN = c|xN ) = P
(15)
with the posterior probability of category c being proportional to the product of the probability of an object with features xN being produced from that category and the prior
probability of choosing that category. The distribution p(x|c), which we use as shorthand
for p(xN |cN = c), reflects the learner’s knowledge of the structure of the category, taking
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
12
into account the features and labels of the previous N −1 objects. Ashby and Alfonso-Reese
(1995) observed a connection between this Bayesian solution to the problem of categorization and the way that choice probabilities are computed in exemplar and prototype models
(i.e. Equation 11). Specifically, ηN,c can be identified with p(xN |cN = c), while βc corresponds to the prior probability of category c, p(cN = c). The difference between exemplar
and prototype models thus comes down to different ways of estimating p(x|c).
The definition of ηN,c used in an exemplar model (Equation 12) corresponds to estimating p(x|c) as the sum of a set of functions (known as “kernels”) centered on the xi
already labeled as belonging to category c, with
p(xN |cN = c) ∝
X
k(xN , xi )
(16)
i|ci =c
where k(x, xi ) is a probability distribution centered on xi .1 This is a method that is widely
used for approximating distributions in statistics, being a simple form of nonparametric
density estimation (meaning that it can be used to identify distributions without assuming that they come from an underlying parametric family) called kernel density estimation
(e.g., Silverman, 1986). The definition of ηN,c used in a prototype model (Equation 13)
corresponds to estimating p(x|c) by assuming that the distribution associated with each
category comes from an underlying parametric family, and then finding the parameters
that best characterize the instances labeled as belonging to that category. The prototype
corresponds to these parameters, with the centroid being an appropriate estimate for distributions whose parameters characterize their mean. Again, this is a common method for
estimating a probability distribution, known as parametric density estimation, in which the
distribution is assumed to be of a known form but with unknown parameters (e.g., Rice,
1995). As with the similarity-based perspective, there are also probabilistic models that
interpolate between exemplars and prototypes (e.g., Rosseel, 2002).
Tasks resulting in valid MCMC acceptance rules
Whether they are represented as similarity functions (being non-negative functions
over objects), or probability distributions, it should be possible to use our MCMC method
to investigate the stucture of categories. The critical step is finding a question that we can
ask people that leads them to respond in the way identified by Equation 8 when presented
with two alternatives. That is, we need a question that leads people to choose alternatives
with probability proportional to their similarity to the category or probability under that
category. Fortunately, it is possible to argue that several kinds of questions should have
this property. We consider three candidates.
Which stimulus belongs to the category?. The simplest question we can ask people is
to identify which of the two stimuli belongs to the category, on the assumption that this is
true for only one of the stimuli. We give a Bayesian analysis of this question in Appendix
A, showing that stimuli should be selected in proportion to their probability under the
category distribution p(x|c) under some minimal assumptions. It is also possible to argue
for this claim via an analogy to the standard categorization task, in which Equation 8 is
1
R
The constant of proportionality is determined by k(x, xi )dx, being
and is absorbed into βc to produce direct equivalence to Equation 12.
1
Nc
if
R
k(x, xi ) dx = 1 for all i,
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
13
widely used to describe choice probabilities. In the standard categorization task, people
are shown a single stimulus and asked to select one of two categories. In our task, people
are shown two stimuli and asked to identify which belongs to a single category. Just as
people make choices in proportion to the similarity between the stimulus and the category
in categorization, we expect them to do likewise here.
Which stimulus is a better example of the category?. The question of which stimulus
belongs to the catgory is straightforward for participants to interpret when it is clear that
one stimulus belongs to the category and the other does not. However, if both stimuli
have very high or very low values under the probability distribution, then the question
can be confusing. An alternative is to ask people which stimulus is a better example of
the category – asking for a typicality rating rather than a categorization judgment. This
question is about the representativeness of an example for a category. Tenenbaum and
Griffiths (2001) provide evidence for a representativeness measure that is the ratio of the
posterior probabilities of the alternatives. In Appendix A, we show that under the same
assumptions as those for the previous question, answers to this question should also follow
the Barker acceptance function. As above, this question can also be justified by appealing to
similarity-based models: Nosofsky (1988) found that a ratio rule combined with an exemplar
model predicted response probabilities when participants were asked to choose the better
example of a category from a pair of stimuli.
Which stimulus is more likely to have come from the category?. An alternative means
of relaxing the assumption that only one of the two stimuli came from the category is
to ask people for a judgment of relative probability rather than a forced choice. This
question directly measures p(x|c) or the corresponding similarity function. Assuming that
participants apply the ratio rule to these quantities, their responses should yield the Barker
acceptance function for the appropriate stationary distribution.
Summary
Based on the results in this section, we can define a simple method for drawing samples from the probability distribution (or normalized similarity function) associated with a
category using MCMC. Start with a parameterized set of objects and start with an arbitrary
member of that set. On each trial, a proposed object is drawn from a symmetric distribution
around the original object. A person chooses between the current object and the proposed
new object, responding to a question about which object is more consistent with the category. Assuming that people’s choice behavior follows the ratio rule given in Equation 8, the
stationary distribution of the Markov chain is the probability distribution p(x|c) associated
with the category, and samples from the chain provide information about the mental representation of that category. As outlined above, this procedure can also provide information
about the relative probabilities of different objects when people’s behavior is more deterministic than probability matching. To explore the value of this method for investigating
category structures we conducted a series of four experiments. The first two experiments
examine whether we can recover artificial category structures produced through training,
while the other two experiments apply it to estimation of the distributions associated with
natural categories.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
14
Figure 1. Examples of the largest and smallest fish stimuli presented to participants during training
in Experiments 1 and 2. The relative size of the fish stimuli are shown here; true display sizes are
given in the text.
Experiment 1
To test whether our assumptions about human decision making were accurate enough
to use MCMC to sample from people’s mental representations, we trained people on a
category structure and attempted to recover the resulting distribution. A simple onedimensional categorization task was used, with the height of schematic fish (see Figure 1)
being the dimension along which category structures were defined. Participants were trained
on two categories of fish height – a uniform distribution and a Gaussian distribution – being
told that they were learning to judge whether a fish came from the ocean (the uniform
distribution) or a fish farm (the Gaussian distribution). Once participants were trained,
we collected MCMC samples for the Gaussian distributions by asking participants to judge
which of two fish came from the fish farm using one of the three questions introduced
above. The three questions conditions were used to find empirical support for the theoretical
conclusion that all three questions would allow samples to be drawn from a probability
distribution. Assuming participants accurately represent the training structure and use a
ratio decision rule, the samples drawn will reflect the structure on which participants were
trained for each question condition.
Method
Participants. Forty participants were recruited from the Brown University community. The data from four participants were discarded due to computer error in the presentation. Two additional participants were discarded for reaching the upper or lower bounds
of the parameter range. These participants were replaced to give a total of 12 participants
using each of the three MCMC questions. Each participant was paid $4 for a 35 minute
session.
Stimuli. The experiment was presented on a Apple iMac G5 controlled by a script
running in Matlab using PsychToolbox extensions (Brainard, 1997; Pelli, 1997). Participants were seated approximately 44 cm away from the display. The stimuli were a modified
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
15
version of the fish stimuli used in Huttenlocher et al. (2000). The fish were constructed from
three ovals, two gray and one black, and a gray circle on a black background. Fish were all
9.1 cm long with heights drawn from the Gaussian and uniform distributions in training.
Because the display did not allow a continuous range of fish heights to be displayed, the
heights varied in steps of 0.13 cm. Examples of the smallest and largest fish are shown in
Figure 1. During the the MCMC trials, fish were allowed to range in height from 0.03 cm
to 8.35 cm.
Procedure.
Each participant was trained to discriminate between two categories of fish: ocean
fish and fish farm fish. Participants were instructed, “Fish from the ocean have to fend for
themselves and as a result they have an equal probability of being any size. In contrast,
fish from the fish farm are all fed the same amount of food, so their sizes are similar and
only determined by genetics.” These instructions were meant to suggest that the ocean fish
were drawn from a uniform distribution and the fish farm fish were drawn from a Gaussian
distribution, and that the variance of the ocean fish was greater than the variance of the
fish farm fish.
Participants saw two types of trials. In a training trial, either the uniform or Gaussian
distribution was selected with equal probability, and a single sample (with replacement) was
drawn from the selected distribution. The uniform distribution had a lower bound of 2.63
cm and an upper bound of 5.75 cm, while the Gaussian distribution had a mean of 4.19 cm
and a standard deviation of 0.39 cm. The sampled fish was shown to the participant, who
chose which distribution produced the fish. Feedback was then provided on the accuracy
of this choice. In an MCMC trial, two fish were presented on the screen. Participants
chose which of the two fish came from the Gaussian distribution. Neither fish had been
sampled from the Gaussian distribution. Instead, one fish was the state of a Markov chain
and the other fish was the proposal. The state and proposal were unlabeled and they were
randomly assigned to either the left or right side of the screen. Three MCMC chains were
interleaved during the MCMC trials. The start states of the chains were chosen to be 2.63
cm, 4.20 cm, and 5.75 cm. Relative to the training distributions, the start states were
over-dispersed, facilitating assessment of convergence. The proposal was chosen from a
symmetric discretized pseudo-Gaussian distribution with a mean equal to the current state
and standard deviation equal to the training Gaussian standard deviation. The probability
of proposing the current state was set to zero.
The question asked on the MCMC trials was varied between-participants. During the
test phase, participants were asked, “Which fish came from the fish farm?”, “Which fish
is a better example of a fish from the fish farm?”, or “Which fish was more likely to have
come from the fish farm?”. Twelve participants were assigned to each question condition.
All participants were trained on the same Gaussian distribution, with mean µ = 4.19 cm
and standard deviation σ = 3.9 mm.
The experiment was broken up into blocks of training and MCMC trials. The experiment began with 120 training trials, followed by alternating blocks of 60 MCMC trials
and 60 training trials. Training and MCMC trials were interleaved to keep participants
from forgetting the training distributions. After a total of 240 MCMC trials, there was one
final block of 60 test trials. These were identical to the training trials, except participants
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
16
were not given feedback. The final block provided an opportunity to test the discriminative
predictions of the MCMC method.
Results and Discussion
Participants were excluded if the state of any chain reached the edge of the parameter
range or if their chains did not converge to the stationary distribution. We used a heuristic
for determining convergence to the stationary distribution: every chain had to cross another
chain.2 For the participants that were retained, the first 20 trials were removed from each
chain. Figure 2 provides justification for removing the first 20 trials, as each example
participant (one from each question condition) required about this many trials for the
chains to converge. The remaining 60 trials from each chain were pooled and used in
further analyses.
The acceptance rates of the proposals should be reasonably low in MCMC applications. A high acceptance rate generally results from proposal distributions that are narrow
compared to the category distribution and thus do not explore the space well. Very low
acceptance rates occur if the proposals are too extreme and the low acceptance rate also
causes the algorithm to explore the space more slowly. A rate of 20% to 40% is often considered a good value (Roberts, Gelman, & Gilks, 1997). In this experiment, participants
accepted between 33% and 35% of proposals in the three conditions. A more extensive
exploration of the effects of acceptance rates is presented in the General Discussion.
The chains for one example participant per condition are shown in Figure 2. The
right side of the figure compares the training distribution to two estimates of the stationary distribution formed from the samples: a nonparametric kernel density estimate, and a
parametric estimate made by assuming that the stationary distribution was Gaussian. The
nonparametric kernel density estimator used a Gaussian kernel with width optimal for producing Gaussian distributions (Bowman & Azzalini, 1997), although we should note that
this kernel does not necessarily produce Gaussian distributions. The parametric density
estimator assumed that the samples come from a Gaussian distribution and estimates the
parameters of the Gaussian distribution that generated the samples. Both density estimation methods show that the samples do a good job of reproducing the Gaussian training
distributions. A more quantitative measure of how well the participants reproduced the
training distribution is to evaluate the means and standard deviations of the samples relative to the training mean and standard deviation. The averages over participants are shown
in Figure 3. Over all participants, the mean of the samples is not significantly different from
the training mean and the standard deviation of the samples is not significantly different
from the standard deviation of the training distribution for any of the questions. The t-tests
for the mean and standard deviation for each question are shown in Table 1.
The final testing block provided an opportunity to test the predictions of samples
collected by the MCMC method for a separate set of discriminative judgments. The fish
farm distribution was assumed to be Gaussian with the parameters determined by the
mean and standard deviation of the MCMC samples. The uniform distribution on which
participants were trained was assumed for the ocean fish category. The category with the
2
Many heuristics have been proposed for assessing convergence (Brooks & Roberts, 1998; Mengersen,
Robert, & Guihenneuc-Jouyaux, 1999; Cowles & Carlin, 1996). The heuristic we used is simple to apply in
a one-dimensional state space.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
17
Figure 2. Markov chains produced by participants in Experiment 1. The three rows are participants
from the three question conditions. The panels in the first column show the behavior of the three
Markov chains per participant. The black lines represent the states of the Markov chains, the dashed
line is the mean of the Gaussian training distribution, and the dot-dashed lines are two standard
deviations from the mean. The second column shows the densities of the training distributions.
These training densities can be compared to the MCMC samples, which are described by their
kernel density estimates and Gaussian fits in the last two columns.
Figure 3. Results of Experiment 1. The bar plots show the mean of µ and σ across the MCMC
samples produced by participants in all three question conditions. Error bars are one standard error.
The black dot indicates the actual value of µ and σ for each condition, which corresponds closely
with the MCMC samples.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
18
Table 1: Statistical Significance Tests for Question Conditions in Experiment 1
Question
Which?
Better Example?
More Likely?
Statistic
Mean
Std Dev
Mean
Std Dev
Mean
Std Dev
df
11
11
11
11
9
9
t
0.41
-1.46
0.00
0.72
-0.95
-0.22
p
0.69
0.17
1.00
0.48
0.37
0.83
higher probability for each fish size was used as the prediction, consistent with assuming that
participants make optimal decisions. The discriminative responses matched the prediction
on 69% of trials, which was significantly greater than the chance value of 50%, z = 17.7, p <
0.001.
Experiment 2
The results of Experiment 1 indicate that the MCMC method can accurately produce
samples from a trained structure for a variety of questions. In the second experiment, the
design was extended to test whether participants trained on different structures would show
clear differences in the samples that they produced. Specifically, participants were trained
on four distributions differing in their mean and standard deviation to observe the effect on
the MCMC samples.
Method
Participants. Fifty participants were recruited from the Brown University community.
Data from one participant was discarded for not finishing the experiment, data from another
was discarded because the chains reached a boundary, and the data of eight others were
discarded because their chains did not cross. After discarding participants, there were ten
participants in the low mean and low standard deviation condition, ten participants in the
low mean and high standard deviation condition, nine participants in the high mean and low
standard deviation condition, and eleven participants in the high mean and high standard
deviation condition.
Stimuli. Stimuli were the same as Experiment 1.
Procedure. The procedure for this experiment is identical to that of Experiment 1,
except for three changes. The first change is that participants were all asked the same question, “Which fish came from the fish farm?” The second difference is that the mean and
the standard deviation of the Gaussian were varied in four between-participant conditions.
Two levels of the mean, µ = 3.66 cm and µ = 4.72 cm, and two levels of the standard deviation, σ = 3.1 mm and σ = 1.3 mm, were crossed to produce the four between-participant
conditions. The uniform distribution was the same across training distributions and was
bounded at 2.63 cm and 5.76 cm. The final change was that the standard deviation of the
proposal distribution was set to 2.2 mm in this experiment.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
19
Results and Discussion
As in the previous experiment, it took approximately 20 trials for the chains to
converge, so only the remaining 60 trials per chain were analyzed. The acceptance rates in
the four conditions ranged from 38% to 45%. The distributions on the right hand side of
Figure 4 show the training distribution and the nonparametric and parametric estimates of
the stationary distribution produced from the MCMC samples. The distributions estimated
for the participants shown in this figure match well with the training distributions. The
mean, µ, and standard deviation, σ, was computed from the MCMC samples produced by
each participant. The average of these estimates for each condition is shown in Figure 5.
As predicted, µ was higher for participants trained on Gaussians with higher means, and
σ was higher for participants trained on Gaussians with higher standard deviations. These
differences were statistically significant, with a one-tailed Student’s t-test for independent
samples giving t(38) = 7.36, p < 0.001 and t(38) = 2.01, p < 0.05 for µ and σ respectively.
The figure also shows that the means of the MCMC samples corresponded well with the
actual means of the training distributions. The standard deviations of the samples tended to
be higher than the training distributions, which could be a consequence of either perceptual
noise (increasing the effective variation in stimuli associated with a category) or choices
being made in a way consistent with the exponentiated choice rule with γ < 1. Perceptual
noise seems to be implicated because both standard deviations in this experiment (which
were overestimated) were less than the standard deviation in the previous experiment (which
was correctly estimated). The prediction of the MCMC samples for discriminative trials
was determined by the same method used in Experiment 1. The discriminative responses
matched the prediction on 75% of trials, significantly greater than the chance value of 50%,
z = 24.5, p < 0.001.
Experiment 3
The results of Experiments 1 and 2 suggest that the MCMC method can accurately
reproduce a trained structure. Experiment 3 used this method to gather samples from the
natural (i.e., untrained) categories of giraffes, horses, cats, and dogs, representing these
animals using nine-parameter stick figures (Olman & Kersten, 2004). These stimuli are less
realistic than full photographs of animals, but have a representation that is tuned to the
category, so that perturbations of the parameters will have a better chance of producing a
similar animal than than pixel perturbations. The complexity of these stimuli is still fairly
high, hopefully allowing the capture of a large number of interesting aspects of natural
categories. The samples produced by the MCMC method were used to examine the mean
animals in each category, what variables are important in defining membership of a category,
and how the distributions of the different animals are related.
The use of natural categories also provided the opportunity to compare the results
of the MCMC method with other approaches to studying category representations. These
stick figure stimuli were developed by Olman and Kersten (2004) to illustrate that classification images, one of the discriminative approaches introduced earlier in the paper, could
be applied to parameter spaces as well as to pixel spaces. Discriminative methods like classification images can be used to determine the decision bound between categories, but the
results of discrimination judgments should not be used to estimate the parameters charac-
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
20
Figure 4. Markov chains produced by participants in Experiment 2. The four rows are participants
from each of the four conditions. The panels in the first column show the behavior of the three
Markov chains per participant. The black lines represent the states of the Markov chains, the dashed
line is the mean of the Gaussian training distribution, and the dot-dashed lines are two standard
deviations from the mean. The second column shows the densities of the training distributions.
These training densities can be compared to the MCMC samples, which are described by their
kernel density estimates and Gaussian fits in the last two columns.
Figure 5. Results of Experiment 2. The bar plots show the mean of µ and σ across the MCMC
samples produced by participants in all four training conditions. Error bars are one standard error.
The black dot indicates the actual value of µ and σ for each condition, which corresponds with the
MCMC samples.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
21
terizing the mean of a category. The experiment empirically highlights why discriminative
judgments should not be used for this purpose, and serves as a complement to the theoretical argument given in the discussion. In addition, we analyzed the results to determine
whether it would have been feasible to use typicality judgments on a random subset of
examples to identify the underlying category structure. Collecting typicality judgments is
efficient when most stimuli belong to a category, but can be inefficient when a category
distribution is non-zero in only a small portion of a large parameter space. The MCMC
method excels in the latter situation, as it uses a participant’s previous responses to choose
new trials.
Method
Participants. Eight participants were recruited from the Brown University community. Each participant was tested in five sessions, each of which could be split over multiple
days. A session lasted for a total of one to one-and-a-half hours for a total time of five to
seven-and-a-half hours. Participants were compensated with $10 per hour of testing.
Stimuli. The experiment was presented on a Apple iMac G5 controlled by a script
running in Matlab using PsychToolbox extensions (Brainard, 1997; Pelli, 1997). Participants were seated approximately 44 cm away from the display. The nine-parameter stick
figure stimuli that we used were developed by Olman and Kersten (2004). The parameter
names and ranges are shown in Figure 6. From the side view, the stick figures had a minimum possible width of 1.8 cm and height of 2.5 mm. The maximum possible width and
height were 10.9 cm and 11.4 cm respectively. Each stick figure was plotted on the screen
by projecting the three dimensional description of the lines onto a two dimensional plane
from a particular viewpoint. To prevent viewpoint biases, the stimuli were rotated on the
screen by incrementally changing the azimuth of the viewpoint, while keeping the elevation
constant at zero. The starting azimuth of the pair of stick figures on a trial was drawn from
a uniform distribution over all possible values and the period of rotation was fixed at six
seconds. The perceived direction of rotation is ambiguous in this experiment.
Procedure. The first four sessions of the experiment were used to sample from a
participant’s category distribution for four animals: giraffes, horses, cats, and dogs. Each
session was devoted to a single animal and the order in which the animals were tested was
counterbalanced across participants. On each trial, a pair of stick figures was presented,
with one stick figure being the current state of the Markov chain and the other the proposed
state of the chain. Participants were asked on each trial to pick “Which animal is the
”,
where the blank was filled by the animal tested during that session. Participants made their
responses via keypress. The proposed states were drawn from a multivariate discretized
pseudo-Gaussian with a mean equal to the current state and a diagonal covariance matrix.
Informal pilot testing showed good results were obtained by setting the standard deviation
of each variable to 7% of the range of the variable. Proposals that went beyond the range
of the parameters were automatically rejected and were not shown to the participant.
To assess convergence, three independent Markov chains were run during each session.
The starting states of the Markov chains were the same across sessions. These starting states
were at the vertices of an equilateral triangle embedded in the nine-dimensional parameter
space. The coordinates of each vertex were either 20% of the range away from the minimum
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
22
Figure 6. Range of stick figures shown in Experiment 3 by parameter. The top line for each
parameter is the range in the discrimination session, while the bottom line is the range in the
MCMC sessions. Lengths are relative to the fixed body length of one.
value or 20% of the range away from the maximum value for each parameter. Participants
responded to 999 trials in each of the first four sessions, with 333 trials presented from each
chain. The automatically rejected out-of-range trials did not count against the number
of trials in a chain, so more samples were collected than were presented. The number of
collected trials ranged from 1252 to 2179 trials with an average of 1553 trials.
The fifth and final session was a 1000 trial discrimination task, reproducing the procedure that Olman and Kersten (2004) used to estimate the structure of visual categories
of animals. On each trial, participants saw a single stick figure and made a forced response
of giraffe, horse, dog, or cat. Stimuli were drawn from a uniform distribution over the
restricted parameter range shown in Figure 6. This range was used in Olman and Kersten
(2004), and provided a higher proportion of animal-like stick figures at the expense of not
testing portions of the animal distributions. Unlike in the MCMC sessions, the stimuli
presented were independent across trials.
Results and Discussion
In order to determine the number of trials that were required for the Markov chains to
converge to the stationary distribution, the samples were projected onto a plane. The plane
used was the best possible plane for discriminating the different animal distributions for a
particular participant, as calculated through linear discriminant analysis (Duda, Hart, &
Stork, 2000). Trials were removed from the beginning of each chain until the distributions
looked as if they had converged. Removal of 250 trials appeared to be sufficient for most
chains, and this criterion was used for all chains and all participants. The acceptance rate of
the proposals across subjects and conditions ranged between 14% and 31%. Averaged across
individuals, it ranged from 21% to 26%, while averaged across categories, acceptance rates
ranged from 19% to 27%. Figure 7 shows the chains of the giraffe condition of Participant 3
projected onto the plane that best discriminates the categories. The largest factor in the
horizontal component was neck length, while the largest factor in the vertical component
was head length. The left panel shows the starting points and all of the samples for each
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
23
Figure 7. Markov chains and samples from Participant 3 in Experiment 3. The left panel displays
the three chains from the giraffe condition. The discs represent starting points, the dotted lines are
discarded samples, and the solid lines are retained samples. The right panel shows the samples from
all of the animal conditions for Participant 3 projected onto the plane that best discriminates the
classes. The quadrupeds in the bubbles are examples of MCMC samples.
chain. The right panel shows the remaining samples after the burn-in was removed. From
this figure, we can see that the giraffe and horse distributions are relatively distinct, while
the cat and dog distributions show more overlap.
Displaying a nine-dimensional joint distribution is impossible, so we must look at
summaries of the distribution. A basic summary of the animal distributions that were
sampled is the mean over samples. The means for all combinations of participants and
quadrupeds are shown in Figure 8. As support for the MCMC method, these means tend
to resemble the quadrupeds that participants were questioned about. While the means
provide face validity for the MCMC method, the distributions associated with the different
categories contain much more interesting information. Samples from each animal category
are shown in Figure 7, which give a sense of variation within a category. Also, the marginal
distributions shown in Figure 9 give a more informative picture of the joint distribution than
the means. The marginal distributions were estimated by taking a kernel density estimate
along each parameter for each of the quadrupeds on the data pooled over participants.
The Gaussian kernel used had its width optimized for estimating Gaussian distributions
(Bowman & Azzalini, 1997). In Figure 9 we can see that neck length and neck angle
distinguish giraffes from the other quadrupeds. Horses generally have a longer head length
than the other quadrupeds. Dogs have a somewhat distinct head length and head angle.
Cats have the longest tails, shortest head length, and a narrowest foot spread.
The data summaries along one or two dimensions may exaggerate the overlap between
categories. It is possible that the overlap differs between participants or the distributions
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
24
Figure 8. Means across samples for all participants and all quadrupeds in Experiment 3. The rows
of the figure correspond to giraffes, horses, cats, and dogs respectively. The columns correspond to
the eight participants.
are even more separated in the full parameter space. In order to examine how much the joint
distributions overlap, a multivariate Gaussian distribution was fit to each set of animal samples. Using these distributions for each quadruped, We calculated how many samples from
each quadruped fell within the 95% confidence region of each animal distribution. Table 2
shows the number of samples that fall within each distribution averaged over participants.
The diagonal values are close to the anticipated value of 95%. Between quadrupeds, horses
and giraffes share a fair amount of overlap and dogs and cats share a fair amount of overlap.
Between these two groups, cats and dogs overlap a bit less with horses than they do with
each other. Also, cats and dogs overlap very little with giraffes. There is an interesting
asymmetry between the cat and dog distributions. The cat distribution contains 14% of
the dog samples, but the dog distribution only contains 3.7% of the cat samples.
The discrimination session allows us to compare the outcome of the MCMC method
with that of a more standard experimental task. The mean animals from all of the MCMC
trials and from all of the discrimination trials (both pooled over participants) are shown
in Figure 10. The discrimination judgments were made on stimuli drawn randomly from
the parameter range used by Olman and Kersten (2004). The mean stick figures that
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
25
Figure 9. Marginal kernel density estimates for each of the nine stimulus parameters, aggregating
over all participants in Experiment 3. Examples of a parameter’s effect are shown on the horizontal
axis. The middle stick figure is given the central value in the parameter space. The two flanking
stick figures show the extremes of that particular parameter with all other parameters held constant.
Table 2: Proportion of Samples that Fall Within Quadruped 95% Confidence Regions (Averaged
Over Participants) in Experiment 3
Distribution
Giraffe
Horse
Cat
Dog
Giraffe
0.954
0.097
0.000
0.021
Samples
Horse Cat
0.048 0.006
0.960 0.046
0.044 0.936
0.044 0.037
Dog
0.000
0.058
0.143
0.954
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
26
Figure 10. Mean quadrupeds across participants for the MCMC and discrimination sessions in
Experiment 3. The rows of the figure correspond to the data collection method and the columns to
giraffes, horses, cats, and dogs respectively.
result from these discrimination trials look reasonable, but MCMC pilot results showed
that the marginal distributions were often pushed against the edges of the parameter ranges.
The parameter ranges were greatly expanded for the MCMC trials as shown in Figure 6,
resulting in a space that was approximately 3,800 times the size of the discrimination
space. Over all participants and animals, only 0.4% of the MCMC samples fell within the
restricted parameter range used in the discrimination session. This indicates that the true
category distributions lie outside this restricted range, and that the mean animals from
the discrimination session are actually poor examples of the categories – something that
would have been hard to detect on the basis of the discrimination judgments alone. This
result was borne out by asking people whether they thought the mean MCMC animals
or the corresponding mean discrimination animals better represented the target category.
A separate group of 80 naive participants preferred most mean MCMC animals over the
corresponding mean discrimination animals. Given a forced choice between the MCMC and
discrimination mean animals, the MCMC version was chosen by 99% of raters for giraffes,
73% of raters for horses, 25% of raters for cats, and 69% of raters for dogs.
With an added assumption the results from the MCMC method can be used to
predict discriminative judgments. The MCMC samples estimate the category densities and
it is straightforward to use this information to predict discrimination judgments if optimal
decision bounds between categories are assumed. Multivariate Gaussian distributions were
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
27
Table 3: Proportion of Parameter Space Occupied by Individual Participant Samples in Experiment 3
Giraffe
0.00004
Horse
0.00006
Cat
0.00003
Dog
0.00002
estimated for each category and for each stimulus shown in the discrimination session.
These four probabilities were normalized and then the prediction was the category with the
maximum probability from four animal distributions, consistent with an optimal decision.
Using this method, the predictions matched human responses on 32% of trials, significantly
higher than the chance value of 25%, z = 14.5, p < 0.001. (Other assumptions, such
as probability matching to make predictions or using Gaussian kernels to construct the
distribution produced very similar results.) The discrimination results were confined to a
small region of the parameter space in which few MCMC samples were drawn, which may
have contributed to the weakness of the prediction.
In a large space, randomly selecting trials for typicality judgments will be inefficient
when the category regions are small and spread out. This is exactly the situation encountered in Experiment 3. The projection of the samples from the nine-dimensional parameter
space onto a plane, as in Figure 7, exaggerates the size of the relevant region for a category
relative to the whole parameter space. The volume of the space occupied by the different
categories can be measured by constructing convex hulls around all of the post-burn-in
samples from each participant for each animal. Convex hulls connect the most extreme
points in a cloud to form a surface that has no concavities and contains all of the points
(Barnett, 1976). Dividing the volume of the space taken up by each animal (averaged over
participants) by the total volume of the parameter space result in the very small ratios
shown in Table 3: the categories take up only a tiny part of the nine-dimensional space
of stick-figures. This analysis depends on the bounds of the parameters: selecting a much
larger space than necessary would artificially decrease the volume ratios. We tested whether
these ratios are artificially decreased by computing the size of the least enclosing hypercube
for all of the samples pooled across participants and animal conditions. The ratio of this
hypercube to the total parameter space was 0.99, negating the criticism that the parameter space was larger than necessary. The MCMC method thus efficiently honed in on the
small portion of the space where each category was located, while typicality judgments on
a random subset of examples would have been mostly uninformative.
Experiment 4
In this experiment, we outline how the MCMC method can be used to test models
of categorization with natural categories. We do not provide a definitive result, but instead outline how difficulties in applying the MCMC method to testing models of natural
categories might be overcome. Theoretically, we are able to test categorization models,
such as prototype and exemplar models, because they make different predictions about the
distribution over objects associated with categories. As outlined above, prototype models
produce unimodal distributions, while exemplar models are more flexible, approximating
a distribution with a mixture of kernels centered on each example. The MCMC method
produces samples from this distribution, which can be used to distinguish these models.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
28
Categorization models are usually tested with training experiments, but the efficiency of
the MCMC method in exploring complex parameter spaces allows for the testing of categorization models using more realistic stimuli and natural categories. Using this method,
we can examine whether the distributions associated with natural categories more closely
resemble those predicted by prototype or exemplar models. More precisely, we can test
whether these distributions have properties that rule out a prototype model, as an exemplar model can emulate a prototype model given an appropriate set of exemplars. For
example, an exemplar model with two exemplars that are very near one another would
be indistinguishable from a unimodal distribution. As a result, the test is strong in only
one direction: a multimodal distribution is evidence against a simple prototypical category
representation. In the other direction we would only be able to claim that the prototype
model could fit the data, but not provide direct evidence that the exemplar model is unable
to do so.
In Experiment 3, there was a suggestion of multiple modes in the marginal distribution
of tail angle for several animal shape categories. However, beyond any arguments about
the strength of the results, the design of the experiment did not have all of the controls
necessary to provide strong evidence against the prototype model. Most importantly, the
current experiment demonstrates how to test for category multimodality in a psychological
space, instead of testing in an experimenter-defined parameter space. Categorization models
are defined over a psychological space (Reed, 1972; Nosofsky, 1987), and it is possible that
a transformation from one space to another would change the shape of the function. In
order to assess multimodality, the samples must be projected to a psychological space and
then the probability distributions can be assessed there.
Multidimensional scaling (MDS) is the standard method used to create a psychological
space, and was used in this experiment to project the MCMC samples into a psychological
space. Classical MDS is most commonly applied MDS method in psychological experiments,
but it was impractical in this situation because a very large number of pairwise similarity
judgments would have been needed to construct a scaling solution for the MCMC samples.
Instead, an MDS variant known as Dimensionality Reduction by Learning an Invariant
Mapping (DrLIM; Hadsell et al., 2006) was used. DrLIM has not been used in psychological experiments to our knowledge, but was chosen because its advantages over classical
MDS. First, it utilizes neighborhood judgments instead of requiring all pairwise similarity
judgments between examples. Neighborhood judgments can be made more quickly than
the full set of pairwise similarity judgments – while the similarity of an item to every other
item needs to be implicitly determined in order to find its neighbors, participants need
only respond with the target item’s neighbors. Second, DrLIM uses these neighborhood
judgments to train an explicit function that maps the parameter space onto the similarity
space. Classical MDS ignores the parameters of the items and does not produce an explicit
mapping function. As a result, DrLIM can use the neighborhood judgments collected for
a small set of stimuli in order to produce a solution for all of the stimuli in the entire
parameter space.
The following experiment demonstrates how to test the simple prototype model using the categories of apples, grapes, and oranges. These categories are basic-level categories (Rosch, Mervis, et al., 1976; Liu, Golinkoff, & Sak, 2001), like the animal categories
(Snodgrass & Vanderwart, 1980), the level at which the prototype model is defined (Rosch,
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
29
Mervis, et al., 1976). However, there is reason to expect a stronger result of multimodality
with these categories: the stimuli were colored shapes and varieties of apples and grapes
have different colors.
Method
Participants. Eight participants were recruited from the Indiana University community. Each participant was tested in two phases, with the second phase allowed to extend
over multiple days. Participants completed the experiment in five to seven sessions with a
total time ranging from five to seven-and-a-half hours and were compensated $10 per hour.
Stimuli. The experiment was presented in a darkened room on a Apple eMac controlled by a script running in Matlab using PsychToolbox extensions (Brainard, 1997; Pelli,
1997). The display was calibrated using a Spyder2 colorimeter to the sRGB standard with
a D65 temperature point. Observers were seated approximately 44 cm away from the display. Stimuli were six-parameter shapes with three parameters determining the shape and
three parameters determining the color. The three shape parameters were the radii of the
circles, the horizontal distance between circle centers, and the vertical distance between
circle centers as shown in Figure 11. The range was [0, 0.5] for the the radii, [0, 1] for the
horizontal distance, and [−0.866, 0.866] for the vertical distance. The maximum values for
each of these parameters produced a shape with maximum size: 2.5 cm in the MDS session
and 7.5 cm in the MCMC sessions. Negative values of the horizontal and vertical parameters resulted in the changes in the relative position of the circle centers (e.g., a negative
vertical parameter resulted in an internal structure of two circles above a single circle).
The three color parameters corresponded to the three parameters in the CIE Lab standard.
The allowable range for L was [0, 100], while the ranges for a and b were [−100, 100]. Each
stimulus was topped by a green sprig (sRGB value of [34, 139, 34]) with a height and width
1/10th and 1/40th that of the maximum fruit size respectively. Each fruit stimulus was
placed on a “dinner plate” constructed of three circles: two white and one black. The three
circles were overlayed to produce a white plate with a black ring. The plate was 1.3512
times the maximum fruit size with an inner white region radius that was 90% of the total
plate radius. These values were chosen so that largest possible stimulus would not touch
the black ring on the plate. A schematic ruler was always present on the screen, indicating
that participants should behave as if the plates were 10 inches in diameter.
Procedure. There were two phases: an MDS session and an MCMC session that
interleaved three basic-level categories (apples, grapes, and oranges). The MDS session
consisted of ninety trials of neighborhood judgments. On each trial one of the stimuli
was used as the target and participants chose the neighbors from all of the remaining
stimuli. Participants were instructed to think of the shapes as fruit and to make their
neighbor judgments on that basis. These stimuli were picked so that they would include
the most common fruit. Table 4 shows the factors and levels that were crossed to make
the ninety stimuli. (For a shape parameter x, x = sx0 .) A screen capture of this phase
of the experiment is shown in Figure 12. The number of potential neighbors exceeded the
number of stimuli that could be comfortably displayed on a single screen, so several pages
of stimuli were used, with a maximum of 32 stimuli per page. The locations of the potential
neighbors were randomized on each trial to reduce the possible bias caused by the ordering
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
30
Figure 11. Shape parameters used in constructing fruit stimuli in Experiment 4. The r, h, and v
correspond to the radii, horizontal, and vertical parameters. Black lines are for reference only and
were not visible as the shape was a single monochromatic color. A green sprig was placed at the top
of each stimulus to signify where the vine would attach.
Table 4: Stimulus Properties Crossed to Produce Multidimensional Scaling Stimulus Set in Experiment 4
Color (L, a, b)
Red
(53.24, 80.09, 67.20)
Cardinal
(42.72, 62.89, 28.48)
Forest Green (50.59, −49.59, 45.02)
Yellow
(97.14, −21.56, 94.48)
Brown
(40.44, 27.50, 50.14)
Eggplant
(34.88, 64.05, −30.59)
Orange
(74.93, 23.93, 78.95)
Peach
(91.95, 1.80, 27.19)
Tan
(74.98, 5.02, 24.43)
Shape (r0 , h0 , v 0 )
Circle
(0.5, 0, 0)
Fat Circle
(0.5, 0.3, 0)
Very Fat Circle
(0.5, 1, 0)
Stawberry-shaped
(0.5, 1, 0.866)
Pear-shaped
(0.5, 1, −0.866)
Size (s)
Small (0.33)
Large (0.67)
of the potential neighbors. Participants were allowed to navigate back and forth between
the pages to choose the nearest neighbors to the target stimulus (by putting those stimuli
on the neighbor plates). Once all of the pages had been viewed and five nearest neighbors
had been selected, the END button appeared and participants were allowed to end the trial
whenever they were satisfied with their choices.
A series of MCMC sessions followed the MDS session. Each MCMC trial displayed
two stimuli on plates and participants were asked to respond via keypress to the stimulus
that was a better example of a particular category. One stimulus was the current state of
a Markov chain and the other was the proposed state. Whichever stimulus was selected
became the new state of the chain. For each tested category, three chains were interleaved.
Three chains for each of the three basic level categories (apples, grapes, and oranges) were
all interleaved randomly with one another. There were 667 trials per chain yielding 2001
trials presented per basic level category. The number of samples was often far greater than
the number of presented trials because all proposals outside of the parameter ranges were
automatically rejected. The number of samples in a category condition for a participant
ranged from 3,572 to 7,416 with an average of 4,650. The proposal distribution and chain
start points were the same across all categories and participants. The proposal distribution
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
31
Figure 12. Sample screen from the session collecting nearest neighbor data in Experiment 4.
Participants selected the stimuli that they considered the nearest neighbors of each stimulus.
was a mixture with a 0.8 weighting on a diagonal-covariance Gaussian with standard deviations equal to 7% of each parameter’s range, a 0.1 weight on a uniform distribution across
all shape parameters, and a 0.1 weight on a uniform distribution across all color parameters.
(When proposing on a subset of parameters the other parameters remained fixed.) The uniform distributions were used in order to make large jumps between potentially isolated peaks
in the subjective probability distributions. The large jumps were proposed separately for
shape and color parameters because these dimensions are psychologically separable (Garner
& Felfoldy, 1970).
Multidimensional Scaling Analysis
This section provides details on the method used to analyze the neighborhood data
and the scaling results. DrLIM is a MDS method that uses neighborhood judgments to find
the best function to map stimuli from a parameter space to a similarity space. The best
function is selected by finding the parameters for a class of functions that best minimize
a particular loss function. This loss function increases as the distance between items that
are neighbors increases and as the distance between items that are not neighbors decreases.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
32
Non-neighbors only impact the loss function within a certain margin. The exact loss function
is
~ 1, X
~ 2 ) = (1 − Y ) 1 (DW )2 + (Y ) 1 (max{0, m − DW })2
L(W, Y, X
(17)
2
2
where W is the set of parameters, Y is the response (neighbors are 1 and non-neighbors
~ 1 and X
~ 2 are the two members of the pair, m is the margin outside of which
are −1), X
the non-neighbors do not contribute to the loss function, and DW is the Euclidean distance
~ 1 and X
~ 2 as inputs and parameters W .
between the function’s outputs given X
In order to allow as large a class of nonlinear functions as possible, a feedforward
artificial neural network (ANN) with a hidden layer was used. The number of inputs was
equal to the number of parameters of the stimuli and the number of outputs determined
the dimensionality of the similarity space. (Values on these outputs give the coordinates of
the items in the similarity space.) To produce a distance measure between items, a siamese
ANN was used, meaning that a single set of parameters governed the behavior of the pair
of networks required to produce the distance between a pair of items. Pairs of stimuli were
considered neighbors when if either of the stimuli was the target, the other was chosen to
be one of its five nearest neighbors.
The network structure is shown in Figure 13. Thirty hidden nodes were used and
network weights were updated via stochastic gradient descent. Weights were initially drawn
from a uniform distribution with a range of [−0.7, 0.7]. Training error was backpropogated
through the network with a total gradient equal to the sum of the gradients of the network
for each input, multiplied by the gradient of the loss function, and a learning rate parameter.
The gradient of the loss function was DW for neighbors, −(m−DW ) for non-neighbors within
the margin, and zero for non-neighbors outside the margin. The margin was set to 0.3. The
learning rate parameter was equivalent to dividing the total gradient by the number of
training pairs. The network was trained for 80 epochs on each run and 30 runs with new
random starting weights were used to avoid local minimums in the loss function. This
simulation was repeated for each participant for each possible number of output dimensions
(one to six).
The neighborhood judgments of an example participant, Participant 7, are shown in
Figure 14. These judgments show a strong influence of color, while shape is also be a factor.
Across different colors, some were considered better neighbors than others, in particular
peach and tan were close neighbors. These neighborhood judgments and those of the other
participants were the input into the DrLIM method described above. The values of the loss
function for each of the participants for each possible number of dimensions in the solution
are shown in Figure 15. The loss function tended to decrease with more dimensions, and
the small departures from monotonicity are due to the difficulty in finding the best weights
for each output dimension. An inspection of this figure reveals an elbow point at about
three dimensions, but the remainder of the analyses will use the two dimensional solution
to simplify the presentation.
The variability across participants in neighborhood judgments is illustrated by plots
of the MDS stimuli onto the best two dimensional solutions in Figure 16. The MDS solution
for Participant 7 reflects the trends in the neighborhood judgments (Figure 14): a strong
clustering by color, but with a contribution of shape as well. This pattern was common
across participants, with only Participant 5 showing a very different pattern. Shape is the
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
33
Figure 13. Illustration of the structure of the artificial neural network used to implement the
DrLIM method for the multidimensional scaling analysis in Experiment 4. An input layer feeds the
parameters forward to a hidden layer and then an output layer for each of the stimuli. The output
layer in this illustration is for a two dimensional solution. A pair of stimuli are fed through the
network (parameters are the same in the two networks) and the values of the outputs are fed into a
distance function, and then into a loss function based on a participant’s neighborhood judgments.
strongest cue for Participant 5 and size also appears to be more important than color.
Participant 4 shows the same basic color clustering as Participant 7, but with the colors
divided into groups. Participant 2 shows this phenomenon as well, but in this case only the
green fruit shapes are isolated.
While the scaling solution presented here satisfies the loss function used in DrLIM,
there are many possible ways to define what is meant by a good scaling solution. Research
in dimensionality reduction has produced many different algorithmic approaches to this
problem and these approaches can produce different solutions (Hadsell et al., 2006; Hinton
& Salkhutdinov, 2006; Roweis & Saul, 2000; Tenenbaum, de Silva, & Langford, 2000).
In order to make strong claims about models it will be important for the neighborhood
judgments to lead to consistent representations under a variety of plausible assumptions.
Results and Discussion
The key analysis step in testing whether a prototype model is a good description of
the selected categories is to determine the number of clusters in the MCMC samples. A
larger and more comprehensive investigation of these category structures could use a variety
of methods for determining the number of clusters, including fitting mixture models to the
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
34
Figure 14. Neighborhood judgments made by Participant 7 in Experiment 4. The center of each
cluster was the target and the five surrounding shapes were the neighbors selected by the participant
on that trial.
data. In one mixture model approach, a certain number of components (or clusters) is
assumed and then the parameters that produce the maximum likelihood of the data are
found. The maximum likelihoods from different numbers of clusters are then compared
using a model selection criterion, such as the Bayesian Information Criterion (Schwartz,
1978). More thoroughly Bayesian options exist as well, such as inferring a distribution
over the number of clusters by using a Dirichlet process mixture model (Antoniak, 1974;
Ferguson, 1983; Neal, 1998). Since our goal here is primarily to illustrate how the MCMC
approach could be used to address questions about prototype and exemplar models, we will
restrict ourselves to a qualitative analysis of the data.
Inspection of the data suggested that the procedure used in this preliminary experiment did not provide enough samples for the MCMC algorithm to explore the entire
stationary distribution. Acceptance rates across subjects and questions ranged from 2.5%
to 15%, which indicates that the proposal distribution may have been too wide. Discarding
the first half of the samples from each chain results in the left column of Figure 17. Each
panel shows the states of the three Markov chains sampling from one of Participant 7’s
categories, plotted on the best two-dimensional MDS solution. The three chains show a fair
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
35
Figure 15. Value of the loss function optimized by the DrLIM algorithm for multidimensional
scaling in Experiment 4, as a function of dimension. Each line represents an individual participant.
amount of convergence: each region of the samples seems to involve at least two chains.
However, the ideal would be for every chain to have a presence in each cluster of samples.
The samples pooled across chains for each category are shown in the right column. The
separation between samples shown in Figure 17 suggests that they were drawn from a multimodal distribution, but simulations showed that similar numbers of clusters could have
been produced from three non-converged chains sampling from a category distribution with
a single mode.
This experiment outlines the procedures necessary for the MCMC method to produce evidence against a simple prototype model of natural categories. An advanced MDS
method was adapted to use for human experiments for the purpose of determining an
explicit mapping between the stimulus space and the psychological space. Basic-level categories were tested using the MCMC method. While this experiment was primarily intended
to illustrate how the MCMC method can be used to explore theoretical questions about
the representation of category structures, the preliminary results provide some suggestion
of multimodality. Further, more comprehensive, experiments will be required to obtain
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
36
Figure 16. Two dimensional scaling solutions for the stimuli presented during the neighborhood
task in Experiment 4. Panels correspond to individual participants. The locations of individual fruit
stimuli reflect their locations in the recovered psychological space.
definitive evidence as to whether these categories have structures that are inconsistent with
prototype models.
General Discussion
We have developed a Markov chain Monte Carlo method for exploring probability
distributions associated with different mental representations. This method allows people
to act as a component of an MCMC algorithm, constructing a task for which choice probabilities follow a valid acceptance function. By choosing between the current state and a
proposal according to these assumptions, people produce a Markov chain with a stationary
distribution that is informative about their mental representations. Our experiments illustrate how this approach can be used to investigate the mental representation of category
structures. The results of Experiments 1 and 2 indicate that the MCMC method accurately
uncovers differences in mental representations that result from training people on categories
with different structures. However, the potential of the method lies in being able to dis-
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
37
Figure 17. States of the Markov chains and samples for Participant 7 in Experiment 4, plotted
on the best two dimensional MDS solution. In the first column, each of the three chains for a
fruit condition are displayed in a different color. Dotted lines are discarded samples, solid lines are
retained samples, and the black discs are the starting points for each chain. The second column
shows the samples that were retained in each fruit condition.
cover the structures that people associate with categories that they have learned through
interacting with the world. Experiment 3 collected samples from natural categories of animals over a nine-dimensional stick figure space. These samples illustrated the differences
between the representations of these animals along the parameters of the stimuli, giving
more insight into the representations than could be gleaned by finding the discrimination
boundary alone. Experiment 4 illustrated a test of a central question about representation
of natural categories. Multidimensional scaling was combined with MCMC to probe the
category structure of apples, grapes, and oranges. The results provided a demonstration of
how to test a models of natural categories and open up the possibilities of future empirical
investigations with this method.
In the remainder of the paper, we consider the impact of relaxing the assumptions we
have made about human decision making, how the MCMC method relates to other methods
that have been used for exploring mental representations, and provide some cautions and
recommendations for users of this method. The latter draw on both our experience in
running these experiments and the theoretical properties of MCMC, and are intended to
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
38
make it easier for other researchers to use the MCMC method to explore psychological
questions. We close with a brief discussion of the implications of our results, and some
possible future directions.
Relaxing the assumptions
To match human decision making to the MCMC algorithm, we made several strong
assumptions about human decisions:
1. Participants use a ratio rule to make decisions
2. Category representations are unchanged throughout the experiment
3. Decisions are based only on the current stimulus pair
We have already addressed sensitivity to the first of these assumptions, showing that
the algorithm will produce meaningful results for behavior that is more or less deterministic
than probability matching. In the remainder of this section, we address the other two assumptions. The implications of relaxing the second assumption are considered in the section
on category drift and the robustness of the method to violations of the third assumption
are considered in the context effects section.
Category drift. As the MCMC analysis is based on the assumption that the category
distribution remains constant over time, a concern in the experiments was to make sure
that participants’ categories were not influenced by the test trials. In a pilot version of
Experiment 1, training and test trials were not interleaved, instead all of the training trials
were presented at the beginning of the experiment. Data from an example participant from
the pilot experiment is shown in Figure 18. The Markov chains converge, but once they
converge the chains tend to move together. This result, if not simply due to chance, is due
to changes in the underlying category structure. If the distribution we are drawing samples
from is not constant, then the samples will reflect a combination of all of the states of the
dynamic distribution, not the stationary distribution that is assumed. In Experiments 1
and 2, we attempted to reduce category drift through interleaving blocks of training and test
trials. This strategy seemed effective as the behavior in Figure 18 was not observed. This
caution is necessary for natural categories as well. If the categories participants have formed
are fragile, then they could change throughout a MCMC test session through the influence
of test trials. And if the stimuli have more than one dimension, it will be much more difficult
to identify category drift by examining whether the Markov chains are traveling together.
Context effects. Two types of context effects are considered in this section: context
effects arising from between-trial dependencies and context effects that are constant over
the course of the experiment. Sequential effects are nearly ubiquitous in cognitive science,
with examples including representations of spatial or temporal intervals (Gilden, Thornton,
& Mallon, 1995), serial position effects in free recall (e.g., Murdock, 1962; Raaijmakers &
Shiffrin, 1981), and priming (Schooler, Shiffrin, & Raaijmakers, 2001; Howes & Osgood,
1954). However, MCMC was developed for the digital computer and assumes that decisions
depend only on the current stimuli and are not influenced by any previous trials. In the
MCMC task, there are a multiple plausible between-trial dependencies that could arise.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
39
Figure 18. Markov chain states for a pilot participant that illustrates category drift. The chains
converge and then move together over time.
Participants could grow attached to (or tired with) a particular stimulus and change the
probability of accepting a new stimulus.
We can attempt to model these context effects with a more general decision rule.
Instead of Equation 7, we will use a rule that assumes participants can remember which
example was presented on a previous trial. This rule assumes that participants have a fixed
chance of ignoring the category probabilities of the stimuli and instead are biased to choose
either the state of the Markov chain or the proposal of the Markov chain,
a(x∗ ; x) = gb + (1 − g)
p(x∗ )
p(x∗ ) + p(x)
(18)
where g is the probability of making a biased choice, and b is the probability of choosing
the proposal when making a biased choice. Interestingly, if participants are biased towards
remaining in the same state, the stationary distribution does not change. We can see
this in Equation 18 if b = 0. In this case the acceptance probability of a new state is
scaled downward by 1 − g and the acceptance function still satisfies the detailed balance
equation (Equation 1). Simulations verified this result, as shown in Figure 19. The left panel
shows one million states from an unbiased decision rule that is sampling from a mixture
of Gaussians distribution. The middle panel shows one million states from a decision rule
that flips a fair coin and either uses the ratio rule or remains in the current state. These
simulations illustrate how the stationary distribution does not change when participants
are biased towards the current state as in Equation 18.
The outcome is not as faithful to the stationary distribution for participants who can
remember the previous choice, and are then biased towards choosing the new stimulus. This
bias can be modeled by setting b = 1 in Equation 18. We do not have analytical results for
this case, but the right panel in Figure 19 shows simulated results from a mixture of two
Gaussians. The modes of the distribution seem to be fairly unbiased, while the variance
of each of the Gaussian components seems to have increased. As the exact spread of the
distribution cannot be determined if participants are using an exponentiated decision rule
(Equation 9), it does not seem like a simple bias toward the proposed state will much harm
the interpretation of the results. Context effects arising from between-trial dependencies can
40
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
���������
��������
�
��������������
�����
��������������
��������
����
����
����
����
����
����
����
����
����
����
����
����
�
�
���������
�
���������
���������
Figure 19. States of a Markov chain for unbiased and biased decision rules. The states are shown
in a histogram with the expected results from the true stationary distribution plotted as a dashed
curve over the bars. The left panel is the unbiased decision, the middle panel is a decision biased by
always choosing the current state on half of the trials (b = 0 and g = 0.5), and the right panel is a
decision by always choosing the proposal on the half the trials (b = 1 and g = 0.5).
certainly take a more complicated form than Equation 18, but the simulations for a simple
form of bias show that the method has some robustness to between-trial dependencies.
These bias effects are undesirable, even if they do not change the asymptotic results.
This reason is that the convergence of the Markov chain will be slowed because fewer
trials are informative. An experimental method for mitigating these effects is to interleave
multiple Markov chains from multiple categories. Multiple chains introduce variety into
the stimuli participants judge during any short period of time and chains from different
categories increase that variety greatly. This diversity should help counter any fatigue or
affinity participants build up toward a stimulus or set of stimuli.
Context effects that remain constant during an experiment can come from a large
variety of sources including the internal and external conditions of the experiment (Criss
& Shiffrin, 2004). Instructions in particular can have effects that are unforeseen by the
researcher (McKoon & Ratcliff, 2001). In the MCMC experiments, the previous categories
on which a participant has been tested may influence a participant’s conception of other
category structures. The label used for the category could have an effect as well (for
instance, cats versus felines), though Experiment 1 shows that the results are robust to the
form of the question. There is no way to produce a context-free experiment, so the samples
gathered using the MCMC method must be considered with respect to the context of the
experiment.
Other methods for studying mental representations
The results of Experiments 3 and 4 demonstrate that the MCMC method can be used
to explore categories that people have built up through real-world experience. However,
this is certainly not the only method that has been developed to gain insight into people’s
concepts and categories. In this section, several existing methods of gathering and analyzing
categorization data are reviewed and compared to the MCMC method.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
41
Multidimensional scaling and additive clustering. Multidimensional scaling (MDS;
Shepard, 1962; Torgerson, 1958) and additive clustering (AC; Shepard & Arabie, 1979)
have been effectively used in many domains to convert judgments of similarity into a psychological structure. Similarity judgments are assumed to reflect the distances between
stimuli in a mental space. MDS uses the relationship between similarity and distance to
construct a space for the stimuli that best respects people’s similarity judgments, while AC
constructs a featural representation. As demonstrated in Experiment 4, MDS can be used
as a complementary method to the MCMC method. MDS/AC are methods for representing the stimuli, while MCMC determines the probability distribution. As a result, samples
gathered by MCMC can be projected from the original parameter space into the psychological space derived by MDS. This transformation gives the MCMC samples additional
significance and allows researchers to test categorization models, such as the prototype
model, that are defined within a psychological space.
The judgments made during an MDS or AC trial share properties with the judgments
made during an MCMC trial. MDS and AC data are collected by one of two methods:
directly, by rating the similarity or dissimilarity of pairs of stimuli (e.g., Navarro & Lee,
2004), or indirectly, by counting the confusions between pairs of stimuli (e.g., Rothkopf,
1957; Nosofsky, 1987). The MCMC task also requires judgments between pairs of stimuli –
specifically which stimulus is a better example of a particular category. Both of these tasks
depend on an instructional context. In Experiment 4, the MDS context was fruit, while the
MCMC context was a particular fruit category. The difference between the tasks is that
participants are asked about different aspects of the stimulus pair. MDS/AC data are the
psychological distances between a pair of stimuli, while MCMC data are the members of
each pair that are the better examples. Even if asked in the same context, these questions
will produce different information, which are suitable for different applications.
Discriminative methods. A discriminative choice is defined here as a decision between
a set of category labels for a single stimulus. These data are used by discriminative methods, such as the estimation of classification images (Ahumada & Lovell, 1971; Gold et
al., 2000; Li, Levi, & Klein, 2004; Olman & Kersten, 2004; Victor, 2005; Tjan & Nandy,
2006), to make inferences about category structures. Classification images provide a way
to estimate the boundaries between categories in high-dimensional stimulus spaces. In a
typical classification image experiment, participants classify stimuli that have had noise
added. A common choice for visual stimuli is to draw samples from pixel-wise independent
Gaussian distributions with a mean equal to one of the images participants are asked to
find. Participants are presented this noisy stimulus and asked to make a forced choice as
to the identity of the noiseless image. Based on the combination of the actual stimulus and
the response made, a classification image is a hyperplane that separates the responses into
different classes (Murray, Bennett, & Sekuler, 2002).
Discriminative methods make different assumptions about human categorization than
the MCMC method, and for the purposes of predicting discriminative judgments, can be
best understood using the distinction between discriminative and generative approaches to
classification that is used in machine learning (Jordan, 1995; Ng & Jordan, 2001). Following
this distinction, discriminative methods implicitly view human categorization as involving
some function that maps stimuli directly to category labels, and attempt to estimate the
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
42
Figure 20. Discriminative choices do not determine category means. The left panel shows two
sets of Gaussian distributions with equal within-set variance and an optimal decision boundary at
zero. The right panel shows the response probabilities for the right-hand category for each pair of
distributions. Despite different means in the left panel, the response probabilities in the right panel
are equal.
parameters of that process. This approach provides information about how a judgment
is made about two or more alternatives, and it is useful when the question of interest
is how people make these discriminations, and what features of the stimuli are relevant.
However, other information, such as the mean and variance of the category, is lost when
using discriminative methods.
Experiment 3 showed empirically that the mean of a set of discriminative judgments
does not provide a good estimate of the natural category mean. In training studies the
exemplars are known, and the category means for a particular model can be estimated
from the exemplars so this problem does not arise. However, in general, this approach
faces a significant obstacle: discriminative trials do not contain the information necessary to
determine a category mean unless very restrictive assumptions are made about the structure
of a category. Figure 20 shows the probability distributions associated with two sets of
categories. While these categories look very different, they lead to the same predictions
about discriminative responses. The left panel shows two pairs of Gaussian distributions
that have the same optimal decision boundary, but different means and variances. Using
the ratio rule of Equation 8, the response probabilities for choosing the distribution of each
pair with the higher mean were computed and are displayed in the right panel. Despite the
difference in category centers, the response probabilities for the two sets of distributions in
a discriminative paradigm are exactly the same. The multivariate analysis in Appendix B
extends this result, showing that the means of the categories that can mimic a particular
discriminative response function do not have to be proportional to the original means. These
results demonstrate that it is not possible to determine category means from discriminative
data, unless even stronger assumptions than optimal boundaries, Gaussian distributions,
and equal covariance are made.
In contrast, the MCMC method corresponds to a generative view of discriminative
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
43
judgments, assuming that each category is represented as a probability distribution over
objects, and that categorization is the consequence of using Bayes’ rule to work back from
these distributions to a decision as to how to categorize new objects. The samples gathered approximate the actual distributions themselves, allowing us to learn more about the
categories than if a discriminative model had been used. A researcher can perform any test
on these distributions, participant to there being enough data to support the test. The
advantage of the generative nature of the MCMC method is that additional information is
obtained, such as estimates of category means and variability as in Experiment 3.
Discriminative judgments can be predicted using the probability distributions estimated by MCMC, by making assumptions about the decision process that participants use.
The discrimination predictions based on MCMC judgments and optimal decision boundaries were compared to actual discrimination judgments in Experiments 1, 2, and 3. The
match was above chance, but was far from perfect. In general, if the experimenter is only
interested in a very particular and limited set of information about category discriminations, then the lack of flexibility of the discriminative model can be an advantage. Fewer
trials will be necessary to produce the same results, assuming the assumptions made by the
discriminative model are correct (Jordan, 1995; Ng & Jordan, 2001). Classification images
provide an example of the efficiency of discriminative methods in collecting boundary information, having been successfully applied to estimating linear boundaries between images of
10,000 pixels (e.g., Gold et al., 2000). The MCMC method has the advantage of revealing
gradient information, but would be impractical for estimating a decision boundary in this
situation. Its strength is in providing a much richer picture of the structure of a category
than information about the boundaries that determine discrimination judgments.
Typicality ratings. Typicality ratings are a useful tool for determining how well objects
represent a category. These ratings are most commonly used to explore natural categories
and were instrumental in supporting the view that natural basic-level categories have a
prototypical representation (Rosch & Mervis, 1975). Later work showed that typicality
ratings in a training paradigm were well-described by an exemplar model (Nosofsky, 1988).
The most commonly used categorization tasks are discrimination tasks, but the information
gathered by typicality ratings is often more stable across different contexts. As an example,
the boundaries of color categories show a lot of variation across different cultures. In
constrast, focal colors, or the best examples of colors, do not show much variation across
cultures (Lakoff, 1987). Other experiments suggest that the same result occurs for odors
(Chrea, Valentin, Sulmont-Ross, Hoang Nguyen, & Abdi, 2005). These results and others
(Labov, 1973), show that category boundaries are more sensitive to context effects than
category centers. Thus, typicality judgments may be more useful for studying basic cognitive
processes than discrimination judgments.
A typicality rating task and the task used in the MCMC method are very similar.
The ratings can be typicality judgments of single objects (e.g., Storms et al., 2000) or
a pairwise judgment of which object is more typical (e.g., Bourne, 1982). The pairwise
judgment is exactly the task used in the MCMC method. However, the MCMC method
provides a way to estimate the distributions associated with natural categories that is more
efficient than previous methods for gathering typicality ratings. Experiments that collect
typicality ratings have relied on the researcher to choose the test objects (e.g., Labov, 1973;
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
44
Rosch, Mervis, et al., 1976; Nosofsky, 1988). If the researcher has a lot of knowledge about
the structure of the category, then this method will be efficient. However, with natural
categories the researcher often has less knowledge, because he or she does not know what
examples were used to construct the category. Out of many hypotheses about a natural
category structure, few will be true. The resulting experiment will require a large set of
stimuli to be tested, and few trials will be informative.
Experiment 3 demonstrated just this kind of situation: the distributions associated
with each individual animal shape were only non-zero in a small region of the parameter
space required to test all of the animals. Choosing a set of animal shapes for participants
to rate would have been inefficient because almost all would fall outside the bounds of
the categories. The MCMC method is more effective in this situation because it uses a
participant’s previous responses to choose informative new test objects. In general, the
MCMC algorithm is particularly effective for sampling from distributions that occupy a
small region of a parameter space (Neal, 1993). As a result, the MCMC method presented
here is well suited for determining the distribution associated with a natural category.
Categorization models. Categorization research has benefited from intensive investigation using a large number of mathematical models. These models have been developed
to fit the data from a variety of tasks including binary classifications (e.g., Nosofsky et al.,
1994; Ashby & Gott, 1988; Nosofsky, 1987), typicality judgments (e.g., Storms et al., 2000;
Nosofsky, 1988), or stimulus reconstructions (e.g., Huttenlocher et al., 2000; Vevea, 2006).
In order to produce predictions, the models need to be adapted for each task in which the
researcher is interested. The MCMC method is not a model of categorization, but instead is
a method for collecting data. To make this discussion more concrete, we will describe how
the MCMC method relates to a very well known model of categorization: the Generalized
Context Model (GCM; Nosofsky, 1986).
The GCM stores all exemplars that a participant has observed. In order to make a
category judgment, the summed similarity of a new stimulus to each of the examples from
a category is computed, as shown in Equation 12. These summed similarities for a category
are combined in a ratio rule with the summed similarities of other categories to response
probabilities, as in Equation 11. In essence, the exemplars are peaks of a similarity function
over the stimulus parameter space. The generalization gradient and dimension weights of
the GCM determine the spread around these peaks. Each category has a separate similarity
function and the strength of an example in a particular category is determined by the value
of the similarity function at the parameters of the test stimulus. An illustration of the
similarity function created by the GCM is shown in Figure 21.
There is already an explicit model of the MCMC task – participants respond according
to the ratio rule based on the similarity functions of the different categories. This is the
same assumption as that used in the GCM, except we make this assumption as part of the
design of our experiment rather than part of the analysis. If the assumption holds, the
results our method produces samples from a probability distribution that is proportional
to the summed similarity of the objects to the exemplars. Making this assumption earlier
in the process of data collection allows it to be leveraged to collect more informative data
via MCMC. However, it should be noted that this method does not draw samples from
the stored exemplars in the GCM. If people are behaving according to the GCM, then
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
45
Figure 21. Relationship of samples to exemplars in the Generalized Context Model. The crosses
on the horizontal axis are exemplars, the solid curve is the resulting similarity function, the dashed
curve is the exponentiated function, and the circles are samples from the exponentiated normalized
function.
the samples will be a proportional approximation to the similarity function that the GCM
produces for a category, as shown in Figure 21. These samples will reflect the exemplars plus
the generalization gradient and the dimension weights, not the exemplars themselves. Also,
if participants deviate from probability matching and follow the choice rule in Equation 9,
the resulting probability distributions will show an effect of the γ parameter and will not
be equivalent to the similarity function used to represent exemplars in the GCM.
Psychophysical staircases. The psychophysical method of adaptive staircases
(Cornsweet, 1962) is used to find a participant’s threshold in a task in which the intensity
of the stimulus is varied. A simple implementation of this method is to show a participant
a series of trials, half of which contain a stimulus and half of which are noise alone. The
participant responds as to whether the stimulus is present. If the response is correct, then
the intensity of the stimulus is lowered to make the task more difficult. If the response is
incorrect, the intensity is raised to make it easier. This staircase automatically converges
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
46
on the 50% accuracy threshold. The number of consecutive correct or incorrect responses
required can be manipulated to change the percent accuracy to which the staircase will
converge (Levitt, 1971).
The MCMC method resembles the method of adaptive staircases in that they both
adjust trial presentation based on participant responses. Both have a current state in the
stimulus space and transition to a new state based on human responses. Both methods
use Markov chains – for psychophysical staircases, the next state depends only on the
responses to the current state. However, the Markov chains of the two methods use different
contingencies for making transitions. The MCMC method transitions to the state that the
participant chooses, while a staircase transitions based on the correctness of the participant’s
response. The stationary distribution of the MCMC method is the category distribution
while the stationary distribution of a psychophysical staircase is unknown apart from its
target mean. The most salient difference between the two methods is that the MCMC
method has no ground truth. Rather than converging to a distribution centered around a
point with a certain objective level of accuracy, it converges to a distribution reflecting a
participant’s subjective degrees of belief about the stimuli.
Cautions and recommendations
While the MCMC method has many desirable theoretical properties and the experiments illustrate that it can produce meaningful results for natural categories, there are
several issues that can arise in its implementation. In this section, we describe some of the
potential problems and some ways in which they can be solved.
Designing the experiment and analyzing the data
Burn-in. As mentioned previously, MCMC requires a number of trials to converge to
the stationary distribution. These pre-convergence trials are not samples from the distribution of interest and must be discarded. Unfortunately, there are few theoretical guarantees
about when a Markov chain has converged to its stationary distribution. A large and varied set of heuristics have been developed for assessing convergence, and we applied different
heuristics in different experiments. In Experiments 1 and 2, we waited until the chains
crossed, a simple heuristic that can be applied in one dimension. Since the chains will definitely cross once they reach their stationary distibution (as each chain should visit every
state with probability proportional to its stationary probability), this gives us a simple criterion to check for convergence. In Experiment 3, a variant of an early standard known as
the “thick pen technique” (Gelfand, Hills, Racine-Poon, & Smith, 1990) was applied. This
heuristic involves looking at a fixed number of chains and removing the early samples from
each until they are visually indistinguishable. More objective measures have been developed to determine whether a Markov chain has converged, but are usually more difficult to
implement and can be computationally expensive. There is quite a large number of these
methods and they vary in their applicability to different sampling problems (see Brooks &
Roberts, 1998; Mengersen et al., 1999; Cowles & Carlin, 1996 for reviews).
Dependent samples. An issue with MCMC that affects the power of the method is
that there is a dependency between successive samples. The states of a Markov chain are
not drawn independently – the overall probability of transitioning to a new state depends
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
47
on the current state. There are many ways to deal with sample dependence. A common
method is to keep every nth sample and discard the rest, as the dependence between samples
decreases as the distance between samples increases. Thus, by dropping out a large enough
number of intermediate samples, the remaining set will be pseudo-independent. In the
experiments, all of the samples were retained. A sample can be dependent when computing
the moments of a distribution, or any other expectation with respect to that distribution.
However, if a calculation that involved the number of samples (e.g., t-test, ANOVA) were
used, the number of states of the Markov chain cannot be used as the sample size. One
alternative is to use the pseudo-independent sample in the calculation. A second alternative
is to compute the effective sample size of the Markov chain states and using that as the
sample size.
Bounded parameter spaces. The MCMC method uses a symmetric proposal distribution, which requires care when applying to bounded parameter spaces. For states near the
boundary, a large number of proposals will be outside the bounds. A common impulse in
this situation is to truncate the proposal distribution and normalize. However, this scheme
will violate detailed balance (Equation 1) because the proposal probabilities between a state
at the edge of the space and a state in the middle of the space will not be the same. Experiments 1 and 2 dealt with this problem by allowing proposals outside of the parameter
range and rejecting participants who accepted out-of-bounds proposals. This method is not
very appealing, especially for experiments in which each participant requires a long time
to run. A more sophisticated approach was adopted for Experiments 3 and 4. Proposals
of parameter values outside of the parameter range were allowed, but these proposals were
automatically rejected – adding another sample of the Markov chain at the current state
without presenting a trial to the participant.
The justification for this approach is straightforward. In committing ourselves to a
restricted range of parameter values, we only want to measure the subjective function f (x)
over those values. Thus, we treat parameters outside the range as having f (x) equal to
zero. As a consequence, we will automatically reject any proposal that goes outside the
range, and converge to a stationary distribution that is proportional to f (x) normalized
over the set of parameter values allowed in the experiment. In principle, we can actually
run the MCMC procedure without bounds on the values of the parameters, since it should
converge to a sensible distribution regardless, but we have not tested this in practice.
We can also check that rejecting proposals outside the parameter range is valid by separately showing detailed balance holds for in-bounds and out-of-bounds proposals. Within
the in-bounds region detailed balance holds because the proposal probability between any
two states using a symmetric distribution is equal and a Barker acceptance function is assumed. Between the in-bounds and out-of-bounds regions proposals are allowed, but an
always-reject acceptance function has been adopted. Assuming that in-bounds proposals
from an out-of-bounds state are always rejected (which will never actually be implemented),
detailed balance is satisfied – between any in-bounds and out-of-bounds state both sides of
the detailed balance equation are zero.
Selection of a proposal distribution. When implementing the MCMC method, a proposal distribution must be chosen. In the limit, any distribution that is symmetric and
allows the Markov chain to visit every state in the space will produce samples from the
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
48
same distribution. However for any practical number of trials, some proposal distributions
will be more efficient than others. In Experiments 1-3, we used Gaussian distributions
that had a variance along each parameter equal to a fixed percentage of that parameter’s
range. The percentage for each experiment was chosen based on pilot studies, looking for a
percentage that would give an average acceptance probability of around 20% - 40%. This
value was sought because it is the optimal acceptance probability when sampling from a
Gaussian distribution (Roberts et al., 1997) and we anticipated that our subjective probability distributions would be similar to Gaussians, consistent with several standard models
of human category learning (Ashby & Alfonso-Reese, 1995; Reed, 1972).
Recent research in statistics has focused on developing automated methods for adjusting a proposal distribution. A variety of related algorithms collectively known as adaptive
MCMC were developed for this reason (Gilks, Roberts, & George, 1994; Gelfand & Sahu,
1994; Gilks, Roberts, & Sahu, 1998). These algorithms work by using samples from the
chain in order to adjust the proposal distribution to be more efficient. Similar methods could
be used to adjust proposal distributions when using MCMC as an experimental procedure,
potentially increasing the efficiency of the method by shortening burn-in and ensuring better
exploration of the distributions associated with different categories.
Isolated modes. A concern in any attempt to characterize a natural category is that
there may be peaks of the probability distribution that are isolated from the rest of the peaks
by regions of stimuli that are not members of the category. This concern was especially valid
in Experiment 4 in which the mode of red apples might be isolated from the mode of green
apples by a region of poor category members. Drawing samples from isolated peaks can be
difficult under MCMC. The width of the proposal distribution can be increased to raise the
probability of proposing a state in an isolated mode, but has the downside of decreasing the
overall probability of accepting proposals. In Experiment 4, we used a mixture distribution
that combined a high proportion of local proposals with the possibility of making large
jumps in the parameter space. Despite using this proposal distribution, the chains did not
mix well in this experiment, which could be due to chains not switching between isolated
modes.
One alternative approach is to use algorithms developed in the machine learning literature to specifically deal with this issue. The key to these methods is to sample from the
target distribution at different temperatures, meaning that samples are drawn from distributions proportional to the target distribution raised to different powers. Low temperatures
refer to large exponents and high temperatures refer to small exponents. Metropolis-coupled
MCMC (Geyer, 1991) runs several chains with transition kernels at different temperatures
in parallel. The states of the chains are occasionally exchanged, allowing chains with small
proposal widths to make large jumps. Simulated tempering (Marinari & Parisi, 1992) is a
similar algorithm, but instead of running several chains in parallel, one chain is run and
transition kernels of different temperatures are swapped. These ideas provide help in most
applications, and it may be possible to apply them in the human MCMC procedure. In our
setting, the “temperature” of a chain corresponds to the value of the parameter γ used in
Equation 9. Consequently, MCMC methods that exploit variation in temperature could be
implemented by developing a way to adjust the γ parameter of a participant’s acceptance
function, perhaps through instructions or manipulation of the motivation of the participant
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
49
via rewards.
Conclusion
Markov chain Monte Carlo is one of the basic tools in modern statistical computing,
providing the basis for numerical simulations conducted in a wide range of disciplines. The
results presented in this paper suggest that this class of algorithms can also be of use in
psychology, not just for numerical simulation, but also for uncovering mental representations. The experiments in this paper have applied the MCMC method to categorization
tasks, but it could also be used in a variety of other domains as well. Many formal models of
human cognition (Oaksford & Chater, 1998; Chater, Tenenbaum, & Yuille, 2006; Anderson,
1990; Shiffrin & Steyvers, 1997) assume that there is a range of degrees of belief in different hypothetical outcomes. The MCMC method is designed to estimate these subjective
functions. Using this technique, interesting data could be collected in many different areas
of study, including investigating the subjective response to a probe of memory or exploring
the internal values of different response alternatives in decision making. We are particularly
excited about the prospects for using this approach together with probabilistic models of
cognition, making it possible to estimate many of the distributions that otherwise have to
be assumed in these models.
Beyond the potential of MCMC for uncovering mental representations, we view one
of the key contributions of this research to be the idea that we can use people as elements in
randomized algorithms, provided we can predict their behavior, and use those algorithms to
learn more about the mind. The early success of this approach suggests that other MCMC
methods could be used to explore mental representations, either by using the Barker acceptance function or by establishing other connections between statistical inference and human
behavior. The general principle of allowing people to act as components of randomized
algorithms can teach us a great deal about cognition.
References
Ahumada, A. J., & Lovell, J. (1971). Stimulus features in signal detection. Journal of the Acoustical
Society of America, 49 , 1751-1756.
Anderson, J. R. (1990). The adaptive character of thought. Hillsdale, NJ: Erlbaum.
Anderson, J. R., & Milson, R. (1989). Human memory: An adaptive perspective. Psychological
Review , 96 , 703-719.
Antoniak, C. (1974). Mixtures of Dirichlet processes with applications to Bayesian nonparametric
problems. The Annals of Statistics, 2 , 1152-1174.
Ashby, F. G. (1992). Multidimensional models of perception and cognition. Hillsdale, NJ: Erlbaum.
Ashby, F. G., & Alfonso-Reese, L. A. (1995). Categorization as probability density estimation.
Journal of Mathematical Psychology, 39 , 216-233.
Ashby, F. G., & Gott, R. E. (1988). Decision rules in the perception and categorization of multidimensional stimuli. Journal of Experimental Psychology: Learning, Memory, and Cognition,
14 , 33-53.
Ashby, F. G., & Maddox, W. T. (1993). Relations between prototype, exemplar, and decision bound
models of categorization. Journal of Mathematical Psychology, 37 , 372-400.
Barker, A. A. (1965). Monte Carlo calculations of the radial distribution functions for a protonelectron plasma. Australian Journal of Physics, 18 , 119-133.
Barnett, V. (1976). The ordering of multivariate data. Journal of the Royal Statistical Society.
Series A (General), 139 , 318-355.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
50
Billera, L. J., & Diaconis, P. (2001). A geometric interpretation of the Metropolis-Hastings algorithm.
Statistical Science, 16 , 335-339.
Bourne, L. (1982). Typicality effects in logically defined categories. Memory & Cognition, 10 , 3-9.
Bowman, A. W., & Azzalini, A. (1997). Applied smoothing techniques for data analysis: The kernel
approach with S-plus illustrations. Oxford: Oxford University Press.
Bradley, R. A. (1954). Incomplete block rank analysis: On the appropriateness of the model of a
method of paired comparisons. Biometrics, 10 , 375-390.
Brainard, D. H. (1997). The psychophysics toolbox. Spatial Vision, 10 , 433-436.
Brooks, S., & Roberts, O. (1998). Assessing convergence of Markov chain Monte Carlo. Statistics
and Computing, 8 , 319-335.
Chater, N., Tenenbaum, J. B., & Yuille, A. (2006). Special issue on “Probabilistic models of
cognition”. Trends in Cognitive Sciences, 10 (7).
Chrea, C., Valentin, D., Sulmont-Ross, C., Hoang Nguyen, D., & Abdi, H. (2005). Semantic,
typicality and odor representation: a cross-cultural study. Chemical Senses, 30 , 37-49.
Clarke, F. R. (1957). Constant-ratio rule for confusion matrices in speech communication. The
Journal of the Acoustical Society of America, 29 , 715-720.
Cornsweet, T. N. (1962). The staircase-method in psychophysics. The American Journal of Psychology, 75 , 485-491.
Cowles, M. K., & Carlin, B. P. (1996). Markov chain Monte Carlo convergence diagnostics: A
comparative review. Journal of the American Statistical Association, 91 , 883-904.
Criss, A. H., & Shiffrin, R. M. (2004). Context noise and item noise jointly determine recognition
memory: a comment on Dennis and Humphreys (2001). Psychological Review , 111 , 800-807.
de Silva, V., & Tenenbaum, J. B. (2003). Global versus local methods in nonlinear dimensionality reduction. In S. T. S. Becker & K. Obermayer (Eds.), Advances in neural information
processing systems 15 (pp. 705–712). Cambridge, MA: MIT Press.
Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification. New York: Wiley.
Farrell, S., & Ludwig, C. (in press). Bayesian and maximum likelihood estimation of hierarchical
response time models. Psychonomic Bulletin & Review .
Ferguson, T. S. (1983). Bayesian density estimation by mixtures of normal distributions. In
M. Rizvi, J. Rustagi, & D. Siegmund (Eds.), Recent advances in statistics (p. 287-302). New
York: Academic Press.
Garner, W. R., & Felfoldy, G. L. (1970). Integrality of stimulus dimensions of various types of
information processing. Cognitive Psychology, 1 , 225-241.
Gelfand, A. E., Hills, S. E., Racine-Poon, A., & Smith, A. F. M. (1990). Illustration of Bayesian
inference in normal data models using Gibbs sampling. Journal of the American Statistical
Association, 85 , 972-985.
Gelfand, A. E., & Sahu, S. K. (1994). On Markov chain Monte Carlo acceleration. Journal of
Computational and Graphical Statistics, 3 , 261-276.
Geyer, C. (1991). Markov chain Monte Carlo maximum likelihood. In E. M. Keramidas (Ed.), Proceedings of the 23rd symposium on the interface: Computing science and statistics. Interface
Foundation.
Gilden, D. L., Thornton, T., & Mallon, M. W. (1995, March). 1/f Noise in Human Cognition.
Science, 267 , 1837-1839.
Gilks, W. R., Richardson, S., & Spiegelhalter, D. J. (Eds.). (1996). Markov chain Monte Carlo in
practice. Suffolk: Chapman and Hall.
Gilks, W. R., Roberts, G. O., & George, E. I. (1994). Adaptive direction sampling. The Statistician,
43 , 179-189.
Gilks, W. R., Roberts, G. O., & Sahu, S. K. (1998). Adaptive Markov chain Monte Carlo through
regeneration. Journal of the American Statistical Association, 93 , 1045-1054.
Gold, J., Murray, R., Bennett, P., & Sekuler, A. (2000). Deriving behavioural receptive fields for
visually completed contours. Current Biology, 10 (11), 663-666.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
51
Goldstone, R. (1994). An efficient method for obtaining similarity data. Behavioral Research
Methods, Instruments, & Computers, 26 , 381-386.
Griffiths, T. L., Steyvers, M., & Tenenbaum, J. B. (2007). Topics in semantic representation.
Psychological Review , 114 , 211-244.
Hadsell, R., Chopra, S., & LeCun, Y. (2006). Dimensionality reduction by learning an invariant
mapping. In Computer vision and pattern recognition, 2006 IEEE computer society conference
on (Vol. 2, p. 1735-1742).
Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications.
Biometrika, 57 , 97-109.
Heit, E., & Barsalou, L. W. (1996). The instantiation principle in natural categories. Memory, 4 ,
413-451.
Hinton, G. E., & Salkhutdinov, R. R. (2006). Reducing the dimensionality of data with neural
networks. Science, 313 , 504-507.
Hopkins, J. W. (1954). Incomplete block rank analysis: Some taste test results. Biometrics, 10 ,
391-399.
Howes, D., & Osgood, C. E. (1954). On the combination of associative probabilities in linguistic
contexts. The American Journal of Psychology, 67 , 241-258.
Huelsenbeck, J. P., Ronquist, F., Nielsen, R., & Bollback, J. P. (2001). Bayesian inference of
phylogeny and its impact on evolutionary biology. Science, 294 , 2310-2314.
Huttenlocher, J., Hedges, L. V., & Vevea, J. L. (2000). Why do categories affect stimulus judgment?
Journal of Experimental Psychology: General , 129 , 220-241.
Jordan, M. I. (1995). Why the logistic function? a tutorial discussion on probablities and neural
networks (Computational Cognitive Science Technical Report No. 9503).
Labov, W. (1973). The boundaries of words and their meanings. In B. Aarts (Ed.), New ways of
analyzing variation in english (p. 340-373). Oxford University Press.
Lakoff, G. (1987). Women, fire, and dangerous things: What categories reveal about the mind.
Chicago: University of Chicago Press.
Levitt, H. (1971). Transformed up-down methods in psychoacoustics. The Journal of the Acoustical
Society of America, 2 , 467-477.
Li, R. W., Levi, D. M., & Klein, S. A. (2004). Perceptual learning improves efficiency by re-tuning
the decision ’template’ for position discrimination. Nature Neuroscience, 7 (2), 178-183.
Liu, J., Golinkoff, R. M., & Sak, K. (2001). One cow does not an animal make: Young children can
extend novel words at the superordinate level. Child Development, 72 , 1674-1694.
Love, B. C., Medin, D. L., & Gureckis, T. M. (2004). SUSTAIN: A network model of category
learning. Psychological Review , 111 , 309-332.
Luce, R. D. (1963). Detection and recognition. In R. D. Luce, R. R. Bush, & E. Galanter (Eds.),
Handbook of mathematical psychology, volume 1 (p. 103-190). New York and London: John
Wiley and Sons, Inc.
Marinari, E., & Parisi, G. (1992). Simulated tempering: A new Monte Carlo scheme. Europhysics
Letters, 19 , 451-458.
McClelland, J. L., & Elman, J. L. (1986). The TRACE model of speech perception. Cognitive
Psychology, 18 , 1-86.
McFadden, D. (1974). Conditional logit analysis of qualitative choice behavior. In P. Zarembka
(Ed.), Frontiers of econometrics. Academic Press.
McKoon, G., & Ratcliff, R. (2001). Counter model for word identification: reply to Bowers (1999).
Psychological Review , 108 , 674-681.
Medin, D. L., & Schaffer, M. M. (1978). Context theory of classification learning. Psychological
Review , 85 , 207-238.
Mengersen, K. L., Robert, C. P., & Guihenneuc-Jouyaux, C. (1999). MCMC convergence diagnostics:
a ’reviewww’. In J. Berger, B. J., A. P. Dawid, & A. F. M. Smith (Eds.), Bayesian statistics
6 (p. 415-440). Oxford: Oxford Sciences.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
52
Metropolis, A. W., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., & Teller, E. (1953).
Equations of state calculations by fast computing machines. Journal of Chemical Physics, 21 ,
1087-1092.
Minda, J. P., & Smith, J. D. (2002). Comparing prototype-based and exemplar-based accounts of
category learning and attentional allocation. Journal of Experimental Psychology: Learning,
Memory, and Cognition, 28 , 275-292.
Morey, R. D., Rouder, J. N., & Speckman, P. L. (2008). A statistical model for discriminating
between subliminal and near-liminal performance. Journal of Mathematical Psychology, 52 ,
21-36.
Morgan, B. J. T. (1974). On Luce’s choice axiom. Journal of Mathematical Psychology, 11 , 107-123.
Murdock, B. B. (1962). The serial position effect of free recall. Journal of Experimental Psychology,
64 , 482-488.
Murray, R. F., Bennett, P. J., & Sekuler, A. B. (2002). Optimal methods for calculating classification
images: weighted sums. Journal of Vision, 2 , 79-104.
Navarro, D. J., Griffiths, T. L., Steyvers, M., & Lee, M. D. (2006). Modeling individual differences
using Dirchlet processes. Journal of Mathematical Psychology, 50 , 101-122.
Navarro, D. J., & Lee, M. D. (2004). Common and distinctive features in stimulus similarity: a
modified version of the contrast model. Psychonomic Bulletin & Review , 11 , 961-974.
Neal, R. M. (1993). Probabilistic inference using Markov Chain Monte Carlo methods (Tech. Rep.
No. CRG-TR-93-1). Department of Computer Science, University of Toronto.
Neal, R. M. (1998). Markov chain sampling methods for Dirichlet process mixture models (Tech.
Rep. No. 9815). Department of Statistics, University of Toronto.
Newman, M. E. J., & Barkema, G. T. (1999). Monte carlo methods in statistical physics. Oxford:
Clarendon Press.
Ng, A. Y., & Jordan, M. I. (2001). On discriminative vs. generative classifiers: A comparison of
logistic regression and naive Bayes. In Nips (p. 841-848).
Norris, J. R. (1997). Markov chains. Cambridge, UK: Cambridge University Press.
Nosofsky, R. M. (1986). Attention, similarity, and the identification-categorization relationship.
Journal of Experimental Psychology: General , 115 , 39-57.
Nosofsky, R. M. (1987). Attention and learning processes in the identification and categorization of
integral stimuli. Journal of Experimental Psychology: Learning, Memory, and Cognition, 13 ,
87-108.
Nosofsky, R. M. (1988). Exemplar-based accounts of relations between classifcation, recognition,
and typicality. Journal of Experimental Psychology: Learning, Memory, and Cognition, 14 ,
700-708.
Nosofsky, R. M., Gluck, M., Palmeri, T. J., McKinley, S. C., & Glauthier, P. (1994). Comparing
models of rule-based classification learning: A replication and extension of Shepard, Hovland,
and Jenkins (1961). Memory and Cognition, 22 , 352-369.
Nosofsky, R. M., & Johansen, M. K. (2000). Exemplar-based accounts of “multiple-system” phenomena in perceptual categorization. Psychonomic Bulletin & Review , 7 , 375-402.
Nosofsky, R. M., & Zaki, S. R. (2002). Exemplar and prototype models revisited: Response strategies, selective attention, and stimulus generalization. Journal of Experimental Psychology:
Learning, Memory, and Cognition, 28 , 924-940.
Oaksford, M., & Chater, N. (Eds.). (1998). Rational models of cognition. Oxford University Press.
Oden, G. C., & Massaro, D. W. (1978). Integration of featural information in speech perception.
Psychological Review , 85 , 172-191.
Olman, C., & Kersten, D. (2004). Classification objects, ideal observers, and generative models.
Cognitive Science, 28 , 227-239.
Pelli, D. G. (1997). The VideoToolbox software for visual psychophysics: Transforming numbers
into movies. Spatial Vision, 10 , 437-442.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
53
Peskun, P. H. (1973). Optimum Monte-Carlo sampling using Markov chains. Biometrika, 60 ,
607-612.
Posner, M. I., & Keele, S. W. (1968). On the genesis of abstract ideas. Journal of Experimental
Psychology, 77 , 353-363.
Raaijmakers, J. G. W., & Shiffrin, R. M. (1981). Search of associative memory. Psychological
Review , 88 , 93-134.
Reed, S. K. (1972). Pattern recognition and categorization. Cognitive Psychology, 3 , 393-407.
Rice, J. A. (1995). Mathematical statistics and data analysis (2nd ed.). Belmont, CA: Duxbury.
Roberson, D., Davies, I., & Davidoff, J. (2000). Color categories are not universal: replications and
new evidence from a stone-age culture. Journal of Experimental Psychology: General , 129 ,
369-398.
Roberts, G. O., Gelman, A., & Gilks, W. R. (1997). Weak convergence and optimal scaling of
random walk metropolis algorithms. Annals of Applied Probability, 7 , 110-120.
Rosch, E., & Mervis, C. B. (1975). Family resemblances: studies in the internal structure of
categories. Cognitive Psychology, 7 , 573-605.
Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & Boyes-Braem, P. (1976). Basic objects
in natural categories. Cognitive Psychology, 8 , 382-439.
Rosch, E., Simpson, C., & Miller, R. S. (1976). Structural bases of typicality effects. Journal of
Experimental Psychology: Human Perception and Performance, 2 , 491-502.
Rosseel, Y. (2002). Mixture models of categorization. Journal of Mathematical Psychology, 46 ,
178-210.
Rothkopf, E. Z. (1957). A measure of stimulus similiarity and errors in some paired-associate
learning tasks. Journal of Experimental Psychology, 53 , 94-101.
Rouder, J. N. (2004). Modeling the effects of choice-set size on the processing of letters and words.
Psychological Review , 111 , 80-93.
Roweis, S. T., & Saul, L. K. (2000). Nonlinear dimensionality reduction by locally linear embedding.
Science, 290 , 2323-2326.
Schooler, L. J., Shiffrin, R. M., & Raaijmakers, J. G. W. (2001). A Bayesian model for implicit
effects in perceptual identification. Psychological Review , 108 .
Schwartz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6 , 461-464.
Shepard, R. N. (1957). Stimulus and response generalization: A stochastic model relating generalization to distance in psychological space. Psychometrika, 22 , 325-345.
Shepard, R. N. (1962). The analysis of proximities: Multidimensional scaling with an unknown
distance function: I. Psychometrika, 27 , 124-140.
Shepard, R. N. (1987). Toward a universal law of generalization for psychological science. Science,
237 , 1317-1323.
Shepard, R. N., & Arabie, P. (1979). Additive clutering: Representation of similarities as combinations of discrete overlapping properties. Psychological Review , 86 , 87-123.
Shepard, R. N., Hovland, C. I., & Jenkins, H. M. (1961). Learning and memorization of classifications. Psychological Monographs, 75 . (13, Whole No. 517)
Shiffrin, R. M., & Steyvers, M. (1997). A model for recognition memory: REM: Retrieving Effectively
from Memory. Psychonomic Bulletin & Review , 4 , 145-166.
Silverman, B. W. (1986). Density estimation. London: Chapman and Hall.
Smith, E. E., Patalano, A. L., & Jonides, J. (1998). Alternative strategies of categorization.
Cognition, 65 , 167-196.
Smith, J. D., & Minda, J. P. (1998). Prototypes in the mist: The early epochs of category learning.
Journal of Experimental Psychology: Learning, Memory, and Cognition, 24 , 1411-1436.
Snodgrass, J. G., & Vanderwart, M. (1980). A standardized set of 260 pictures: norms for name
agreement, image agreement, familiarity, and visual complexity. Journal of Experimental
Psychology: Human Learning and Memory, 6 , 174-215.
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
54
Storms, G., Boeck, P. D., & Ruts, W. (2000). Prototype and exemplar-based information in natural
language categories. Journal of Memory and Language, 42 , 51-73.
Tenenbaum, J. B., de Silva, V., & Langford, J. C. (2000). A global geometric framework for nonlinear
dimensionality reduction. Science, 290 , 2319-2323.
Tenenbaum, J. B., & Griffiths, T. L. (2001). The rational basis of representativeness. In Proceedings
of the 23rd annual meeting of the cognitive science society.
Tjan, B., & Nandy, A. (2006). Classification images with uncertainty. Journal of Vision, 6 , 387-413.
Torgerson, W. S. (1958). Theory and methods of scaling. New York: Wiley.
Vanpaemel, W., Storms, G., & Ons, B. (2005). A varying abstraction model for categorization. In
Proceedings of the 27th Annual Conference of the Cognitive Science Society. Mahwah, NJ:
Erlbaum.
Vevea, J. L. (2006). Recovering stimuli from memory: a statistical method for linking discrimination
and reproduction responses. British Journal of Mathematical and Statistical Psychology, 59 ,
321-346.
Victor, J. (2005). Analyzing receptive fields, classification images and functional images: challenges
with opportunities for symmetry. Nature Neuroscience, 8 , 1651-1656.
Vulkan, N. (2000). An economist’s perspective on probability matching. Journal of Economic
Surveys, 14 , 101-118.
Wills, A. J., Reimers, S., Stewart, N., Suret, M., & McLaren, I. P. L. (2000). Tests of the ratio rule
in categorization. The Quarterly Journal of Experimental Psychology, 53A, 983-1011.
Yellott, J. I. (1977). The relationship between Luce’s choice axiom, Thurstone’s theory of comparative judgment, and the double exponential distribution. Journal of Mathematical Psychology,
15 , 109-144.
Zaki, S. R., Nosofsky, R. M., Stanton, R. D., & Cohen, A. L. (2003). Prototype and exemplar
accounts of category learning and attentional allocation: A reassessment. Journal of Experimental Psychology: Learning, Memory, and Cognition, 29 , 1160-1173.
Appendix A
Bayesian analysis of choice tasks
Consider the following task. You are shown two objects, x1 and x2 , and told that one
of those objects comes from a particular category, c. You have to choose which object you
think comes from that category. How should you make this decision?
We can analyze this choice task from the perspective of a rational Bayesian learner.
The choice between the objects is a choice between two hypotheses: The first hypothesis,
h1 , is that x1 is drawn from the probability distribution associated with a category p(x|c)
and x2 is drawn from g(x), an alternative distribution that governs the probability of a noncategory object being presented to the participant. The second hypothesis, h2 , is that x1
is from the alternative distribution and x2 is from the category distribution. The posterior
probability of the first hypothesis given the data is determined via Bayes’ rule,
p(h1 |x1 , x2 ) =
=
p(x1 , x2 |h1 )p(h1 )
p(x1 , x2 |h1 )p(h1 ) + p(x1 , x2 |h2 )p(h2 )
p(x1 |c)g(x2 )p(h1 )
p(x1 |c)g(x2 )p(h1 ) + p(x2 |c)g(x1 )p(h2 )
(19)
where we use the category distribution p(x|c) and its alternative g(x) to calculate p(x1 , x2 |h).
We will now make two assumptions. The first assumption is that the prior probabilities of the hypotheses are the same. Since there is no a priori reason to favor one of the
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
55
objects over the other, this assumption seems reasonable. The second assumption is that
the probabilities of the two stimuli under the alternative distribution are approximately
equal, with g(x1 ) ≈ g(x2 ). If people assume that the alternative distribution is uniform,
then the probabilities of the two stimuli will be exactly equal. However, the probabilities
will still be roughly equal under the weaker assumption that the alternative distribution
is fairly smooth and x1 and x2 differ by only a small amount relative to the support of
that distribution. The difference between x1 and x2 is likely to be small if the proposal
distribution is narrow. With these assumptions Equation 19 becomes
p(h1 |x1 , x2 ) ≈
p(x1 |c)
p(x1 |c) + p(x2 |c)
(20)
with the posterior probability of h1 being set by the probabilities of x1 and x2 in that
category.
The analysis of the “Which is more typical?” question proceeds similarly. Tenenbaum
and Griffiths (2001) defined a Bayesian measure of the representativeness of a stimulus x
for a category c to be
p(x|c)
r(x, c) = P
(21)
0
0
c0 6=c p(x|c )p(c )
where p(c0 ) is the probability of an alternative category, normalized to exclude c. This
measure indicates the extent to which x provides evidence for c, as opposed to all other
categories. Applying the Luce choice rule, we might expect that participants asked to decide
whether x1 or x2 were more typical of c would choose x1 with probability
p(choose x1 ) = r(x1 , c)/(r(x1 , c) + r(x2 , c))
(22)
If x1 and x2 are approximately equally likely, averaging over all categories, then
P
P
0
0
0
0
c0 6=c p(x2 |c )p(c ). As above, the
c0 6=c p(x1 |c )p(c ) will be approximately the same as
fact that x1 and x2 should be relatively similar makes this assumption likely to hold. In
this case, Equation 22 reduces to the familiar Barker acceptance rule for p(x|c) (Equation
7).
Appendix B
The information available from discriminative data
Assume we have two categories with Gaussian distribution representations. The two
categories have equal covariance matrices Σ and assume that discriminative judgments are
made via a ratio rule (Equation 8). Without loss of generality, let the midpoint between
the means of categories 1 and 2 be zero, so that µ1 = −µ2 . The probability of assigning a
stimulus x to category 1 is
P (c = 1|x) =
N (x; µ1 , Σ)
N (x; µ1 , Σ) + N (x; µ2 , Σ)
(23)
where N (x; µ, Σ) is the probability density function for a Gaussian with mean µ and covariance matrix Σ, and we assume that the two categories have equal prior probability.
Dividing through by the numerator we obtain
P (c = 1|x) =
1+
exp{− 12 [(x
− µ2
)T Σ−1 (x
1
− µ2 ) − (x − µ1 )T Σ−1 (x − µ1 )]}
(24)
UNCOVERING MENTAL REPRESENTATIONS WITH MCMC
56
being a logistic function. Using the distributive property and µ1 = −µ2 , the argument of
the exponential in the denominator reduces to −µT1 Σ−1 x − xT Σ−1 µ1 , which simplifies to
−2µT1 Σ−1 x, being a linear function of x.
From the results in the previous paragraph, we can see that the mean µ1 can be
multiplied by any constant and produce the same discriminative judgments, as long as
the covariance Σ is multiplied by the same constant. This result was used to produce the
univariate example in Figure 20. In the multivariate case, this result means that discriminative judgments can at least determine the the means of the categories up to a multiplicative
constant. However, by diagonalizing the covariance matrix, we can show that the multivariate category means are even less constrained. The covariance matrix is diagonalized by
Σ = QT DQ, factoring it into matrices of eigenvectors Q and a diagonal matrix of eigenvalues D. The eigenvector matrices can be used to transform µ1 and x into a new basis,
giving
−2µT1 QT D−1 Qx = −2(µ01 )T Dx0
(25)
where µ0A and x0 are the vectors transformed into the new basis and QT = Q−1 because Q
is an orthogonal matrix.
In the right-hand side of Equation 25, each entry of the transformed mean vector
is multiplied only by the variance along that dimension. Now each element of the mean
vector can be multiplied by a different coefficient, as long as the corresponding variance is
divided by that coefficient. Assume we now have two more categories, categories 3 and 4,
with µ3 = −µ4 and a shared covariance matrix Σ0 . Let K be a diagonal matrix with each
entry corresponding to the ratio of an entry of the transformed mean of category 3, µ3 Q,
divided by an entry of the transformed mean of category 1, µ1 Q. Then µ3 can take any
value where Σ0 is positive semi-definite, with
Σ0 = QT DKQ
(26)
and still produce an equivalent boundary between categories 3 and 4 to that between categories 1 and 2.
Not all values of µ3 will produce a positive semi-definite Σ0 , but µ3 can take many
values that are not a multiplicative constant of µ1 . This analysis shows that discriminative judgments between a pair of categories cannot determine the category means without
additional information. Even knowing that the categories are multivariate Gaussian with
equal covariance, the best a researcher can say is that the means are equidistant from a particular point (or, more generally, a particular hyperplane) in a metric determined by their
covariance matrix, as this is the relevant criterion for producing the same discriminative
judgment. Though this result was shown assuming a ratio rule, odds ratios of multivariate
Gaussians, such as N (x; µ1 , Σ)/N (x; µ2 , Σ), have the same property. Dividing the ratio by
the numerator results in Equation 24 without the additive constant in the denominator.
Download