Chapter 1 Basic Concepts - Perception and Cognition Lab

advertisement
Chapter 1
Basic Concepts
If it can go wrong, it will. —Murphy
The phenomenon of toast falling from a table to land butter-side
down on the floor is popularly held to be empirical proof of the
existence of Murphy’s Law. —Matthews, 1995, p. 172
In this chapter, we present a few basic concepts about statistical modeling. These concepts, which form a cornerstone for the rest of the book,
are most easily discussed in the context of a simple example. We use an
example motivated by Murphy’s Law. As noted by Matthews in the above
quote, the probability toast falls butter-side down is an important quantity
in testing Murphy’s law. In order to estimate this probability, toast can be
flipped; Figure 1.1 provides an example of a toast flipper. The toast is first
buttered and then placed butter-side up at Point A of the device. Then the
experimenter pushes on the lever (Point B) and observes whether the toast
hits the floor butter-side up or butter-side down. We will call the whole act
flipping toast as it is analogous to flipping coins.
1.1
Random Variables
1.1.1
Outcomes and Events
The basic ideas of modeling are explained with reference to two simple experiments. In the first, Experiment 1, a piece of toast is flipped once. In
1
2
CHAPTER 1. BASIC CONCEPTS
B
A
11111111111
00000000000
00000000000
11111111111
00000000000
11111111111
00000000000
11111111111
00000000000
11111111111
Toast
Figure 1.1: Flipping toast. A piece of toast is placed butter-side at Point A.
Then, force is applied to Point B, launching the toast off the table.
the second, Experiment 2, a piece of toast is flipped twice. The first step in
analysis is describing the outcomes and sample space.
Definition 1 (Outcome) An outcome is a possible result of
an experiment.
There are two possible outcomes for Experiment 1: (1) the piece of toast
falls butter-side down or (2) it falls butter-side up. We denote the former
outcome by D and the latter by U (for “down” and “up,” respectively). For
Experiment 2, the outcomes are denoted by ordered pairs as follows:
• (D, D) : The first and second pieces fall butter-side down.
• (D, U) : The first piece falls butter-side down and the second falls
butter-side up.
• (U, D) : The first piece falls butter-side up and the second falls butterside down.
• (U, U) : The first and second pieces fall butter-side up.
3
1.1. RANDOM VARIABLES
Definition 2 (Sample Space) The sample space is the set of
all outcomes.
The sample space for Experiment 1 is {U, D}. The sample space for
Experiment 2 is {(D, D), (U, D), (D, U), (U, U)}.
Although outcomes describe the results of experiments, they are not sufficient for most analyses. To see this insufficiency, consider Murphy’s Law
in Experiment 2. We are interested in whether one or more of the flips is
butter-side up. There is no outcome that uniquely represents this event.
Therefore, it is common to consider events:
Definition 3 (Events) Events are sets of outcomes. They are,
equivalently, subsets of the sample space.
There are four events associated with Experiment 1 and sixteen associated
with Experiment 2. For Experiment 1, the four events are:
{U}, {D}, {U, D}, ∅.
The event {U, D} refers to the case that either the toast falls butter-side
down or it falls butter-side up. Barring the miracle that the toast lands on
its side, it will always land either butter-side up or butter-side down. Even
though this event seems uninformative, it is still a legitimate event and is
included. The null set (∅) is an empty set; it has no elements.
The 16 events for Experiment 2 are:
{(D, D)},
{(U, U)},
{(D, D), (U, U)},
{(U, D), (U, U)},
{(D, D), (D, U), (U, U)},
∅.
{(D, U)},
{(D, D), (D, U)},
{(D, U), (U, D)},
{(D, U), (U, D), (U, U)},
{(D, D), (D, U), (U, D)},
{(U, D)},
{(D, D), (U, D)},
{(D, U), (U, U)},
{(D, D), (U, D), (U, U)},
{(D, D), (D, U), (U, D), (U, U)}
4
CHAPTER 1. BASIC CONCEPTS
1.1.2
Probability
Definition 4 (Probability) Probabilities are numbers assigned to events. The number reflects our degree of belief in
the plausibility of the event. This number ranges from zero (the
event will never occur in the experiment) to one (the event will
always occur). The probability of event A is denoted P r(A).
To explain how probability works, it is helpful at first to assume the
probabilities of at least some of the events are known. For now, let’s assume
the probability that toast falling butter-side down. is .7; e.g., P r({D}) = .7,
It is desirable to place probabilities on the other events as well. The following
concepts are useful in doing so:
Definition 5 (Union) The union of sets A and B, denoted
A ∪ B, is the set of all elements that are either in A or
in B. For example, if A = {1, 2, 3} and B = {3, 4, 5}, then
A ∪ B = {1, 2, 3, 4, 5}.
Definition 6 (Intersection) The intersection of sets A and
B, denoted A ∩ B, is the set of all elements that are both in A
and in B. For example, if A = {1, 2, 3} and B = {3, 4, 5}, then
A ∩ B = {3}.
Probabilities are placed on events by applying the following three rules
of probability:
5
1.1. RANDOM VARIABLES
Event
{D}
{U}
{U, D}
∅
Pr
.7
.3
1
0
Table 1.1: Probabilities of events in Experiment 1.
Definition 7 (The Three Rules of Probability) The three
rules of probability are:
1. P r(A) ≥ 0, where A is any event,
2. P r(sample space) = 1, and
3. if A ∩ B = ∅, then P r(A ∪ B) = P r(A) + P r(B).
The three rules are known as Kolmogorov Axioms of probability (Kolmogorov, 1950). It is relatively easy to apply the rules to the probability of
events. Table 1.1 shows the probabilities of events for Experiment 1.
For Experiment 2, let’s assume the following probabilities on the events
corresponding to single outcomes: P r({(D, D)}) = .49, P r({(D, U)}) = .21,
P r({(U, D)}) = .21, and P r({(U, U)}) = .09. Using the above rules, we can
compute the probabilities on the events. These probabilities are shown in
Table 1.2.
1.1.3
Random Variables
Suppose we are interested in the number of butter-side-down flips. According
to Murphy’s Law, this number should be large. For each experiment, this
number varies according the probability of events. The concept of random
variable captures this dependence:
6
CHAPTER 1. BASIC CONCEPTS
Event
{(D, D)}
{(D, U)}
{(U, D)}
{(U, U)}
{(D, D), (D, U)}
{(D, D), (U, D)}
{(D, D), (U, U)}
{(D, U), (U, D)}
{(D, U), (U, U)}
{(U, D), (U, U)}
{(D, U), (U, D), (U, U)}
{(D, D), (U, D), (U, U)}
{(D, D), (D, U), (U, U)}
{(D, D), (D, U), (U, D)}
{(D, D), (D, U), (U, D), (U, U)}
∅
Probability
.49
.21
.21
.09
.70
.70
.58
.42
.30
.30
.51
.79
.79
.91
1
0
Table 1.2: Probabilities of events in Experiment 2.
7
1.1. RANDOM VARIABLES
Outcome
D
U
Value of X
1
0
Table 1.3: Definition of a random variable X for Experiment 1.
Outcome
(D, D)
(D, U)
(U, D)
(U, U)
Value of X
2
1
1
0
Table 1.4: Definition of a random variable X for Experiment 2.
Definition 8 (Random Variable) A random variable (RV)
is a function that maps events into sets of real numbers. Probabilities on events become probabilities on the corresponding sets
of real numbers.
Let random variable X denote the number of butter-side down flips. X
is defined for Experiments 1 and 2 in Tables 1.3 and 1.4, respectively.
Random variables map experimental results into numbers. This mapping also applies to probability: probabilities on events in the natural word
transfer to numbers. Tables 1.5 and 1.6 show this mapping.
Value of X
1
0
Corresponding Event
{D}
{U}
Probability
.7,
.3,
Table 1.5: Probabilities associated with random variable X for Experiment
1.
8
CHAPTER 1. BASIC CONCEPTS
Value of X
2
1
0
Corresponding Event
{(D, D)}
{(D, U), (U, D)}
{(U, U)}
Probability
.49
.42
.09
Table 1.6: Probabilities associated with random variable X for Experiment
2.
Random variables are typically typeset in upper-case; e.g., X. Random
variables take on values and these values are typeset in lower-case; e.g., x.
The expression P r(X = x) refers to the probability that an event corresponding to x will occur. The mappings in Tables 1.5 and 1.6 are typically
expressed as the function f (x) = P r(X = x), which is called the probability
mass function.
Definition 9 (Probability Mass Function) Probability
mass function, f(x), provides the probability for a particular
value of a random variable.
For Experiment 1, the probability mass function of X is



For Experiment 2, it is
.3 x = 0
f (x) = .7 x = 1


0 Otherwise
.09
.42
f (x) = 
.49



0





(1.1)
x=0
x=1
x=2
Otherwise
(1.2)
Probability mass functions always sum to 1, i.e., x f (x) = 1 as a consequence of the Kolomogorov Axioms in Definition 7. As can be seen, the
above probability mass functions sum to 1.
P
1.1. RANDOM VARIABLES
9
As discussed in the preface, we use the computer package R to aid with
statistical analysis. R can be used to plot these probability mass functions.
The following code is for plotting the probability mass function for Experiment 2, shown in Eq. (1.2). The first step in plotting is to assign values to x
and f . These assignments are implemented with the statements x=c(0,1,2)
and f=c(.09,.42,.49). The symbol c() stands for “concatenate” and is
used to join several numbers into a vector. The values of a variable may seen
by simply typing the variable name at the R prompt; e.g.,
> x=c(0,1,2)
> f=c(.09,.42,.49)
> x
[1] 0 1 2
> f
[1] 0.09 0.42 0.49
The function plot() can be used to plot one variable as a function of
another. Try plot(x,f,type=’h’). The resulting graph should look like
Figure 1.2. There are several types of plots in R including scatter plots, bar
plots, and line plots. The type of plot is specified with the type option. The
option type=’h’ specifies thin vertical lines, which is appropriate for plotting
probability mass functions. Help on any R command is available by typing
help with the command name in parentheses, e.g., help(plot). The points
on top of the lines were added with the command points(x,f).
The random variable X, which denotes the number of butter-side-down
flips, is known as a discrete random variable. The reason is that probability
is assigned to discrete points; for X, it is the discrete points of 0, 1, and
2. There is another type of random variable, a continuous random variable,
in which probability is assigned to intervals rather than points. The differences between discrete and continuous random variables will be discussed in
Chapter 4.
1.1.4
Parameters
Up to now, we have assumed probabilities on some events (those corresponding to outcomes) and used the laws of probability to assign probabilities to
the other events. In experiments, though, we do not assume probabilities; instead, we estimate them from data. We introduce the concept of a parameter
10
0.6
0.4
0.2
0.0
Probability Mass Function
CHAPTER 1. BASIC CONCEPTS
−1
0
1
2
3
Value of X
Figure 1.2: Probability mass functions for random variable X, the number
of butter-side down flips, in Experiment 2.
to avoid assuming probabilities.
Definition 10 (Parameter) Parameters are mathematical
variables on which a random variable may depend. Probabilities
on events are functions of parameters.
For example, let the outcomes of Experiment 1 depend on parameter p
as follows: P r(D) = p. Because the probability of all outcomes must sum
1.0, the probability of event U must be 1 − p. The resulting probability mass
function for X, the number of butter-side down flips, is



1−p x =0
x=1
f (x; p) =  p

0
Otherwise
(1.3)
The use of the semicolon in f (x; p) indicates that the function is of one
variable, x, for a given value of the parameter p.
Let’s consider the probabilities in Experiment 2 to be parameters defined
as p1 = P r(D, D), p2 = P r(D, U), p3 = P r(U, D). By the laws of probability,
11
1.2. BINOMIAL DISTRIBUTION
P r(U, U) = 1 − p1 − p2 − p3 . The resulting probability mass function on X is
1 − p1 − p2 − p3
p2 + p3
f (x; p1 , p2 , p3 ) =

p

1


0





x=0
x=1
x=2
Otherwise
(1.4)
This function is still of one variable, x, for given values of p1 , p2 , and p3 .
Although the account of probability and random variables presented here
is incomplete, it is sufficient for the development that follows in the book.
A more complete basic treatment can be found in mathematical statistics
textbooks such as Hogg & Craig (1978) and Rice (1995).
Problem 1.1.1 (Your Turn)
You and a friend are playing a game with a four-sided die. Each of the
sides, labeled A, B, C, and D, has equal probability of landing up on
any given throw.
1. There are two throws left in the game; list all of the possible
outcomes for the last two throws. Hint: these outcomes may be
expressed as ordered pairs.
2. In order to win, you need to throw an A or B on each of the final
two throws. List all the outcomes that are elements in the event
that you win.
3. In this game there are only two possible outcomes: winning and
losing. Given the information in Problem 2, what is the probability mass function for the random variable that maps the event
that you win to 0 and the event that you lose to 1? Plot this
function in R.
1.2
Binomial Distribution
Experiment 1, in which a piece of toast is flipped once, plays an important
role in developing more sophisticated models. Experiment 1 is an example
12
CHAPTER 1. BASIC CONCEPTS
of a Bernoulli trial, defined below.
Definition 11 (Bernoulli Trial) Bernoulli trials are experiments with two mutually exclusive (or dichotomous) outcomes.
Examples include a flip of toast; the sex of a baby (assuming
only male or female outcomes); or, for our purposes, whether
a participant produces a correct response (or not) on a trial in
a psychology experiment. By convention, one of the outcomes
is called a success, the other a failure. A random variable is
distributed as a Bernoulli if it has two possible values: 0 (for
failure) and 1 (for success).
As a matter of notation, we will consider the butter-side-down result in
toast flipping as a success and the butter-side-up result as a failure. Random
variables from Bernoulli trials have a single parameter p and probability mass
function given by Eq. (1.3). In general, the value of p is not known a priori
and must be estimated from the data.
Experiment 2 is a sequence of two Bernoulli trials. We let X1 and X2
denote the outcomes (either success of failure) of these two trials. Let p1 and
p2 be the probability of success parameter for X1 and X2 , respectively. If
p1 = p2 , then random variables X1 and X2 are called identical. Note that
identical does not mean that the results of the flips are the same, e.g., both
flips are successes or both are failures. Instead, it means that the probabilities
of a success are the same. The concept of identical random variables can be
extended: whenever two random variables have the same probability mass
function, they are identical. If the result of a Bernoulli trial is not affected by
the result of the others, then the RVs are called independent. If two random
variables are both independent and identically distributed, then they are
called iid.
In order to understand the concepts of independent and identically distributed, it may help to consider a concrete example. Suppose a basketball
player is shooting free throws. If one throw influences the next, for instance,
if a player gets discouraged because he or she misses a throw, this is a violation of independence, but not necessarily of identical distribution. Although
one throw may effect the next, if on average they are the same, identical
1.2. BINOMIAL DISTRIBUTION
13
distribution is not violated. If a player gets tired and does worse over time,
regardless of the outcome of his or her throws, then identical distribution is
violated. It is possible to violate one and not the other.
Definition 12 (Bernoulli Process) A sequence of independent and identically distributed Bernoulli trials is called a
Bernoulli Process.
In a Bernoulli process, each Xi is a function of the same parameter p.
Furthermore, because each trial is independent, the order of outcomes is
unimportant. Therefore, it makes sense to define a new random variable, Y ,
P
which is the total number of successes, i.e., Y = N
i=1 Xi .
Definition 13 (Binomial Random Variable) The random
variable which denotes the number of successes in a Bernoulli
process is called a binomial. The binomial random variable has
a single parameter, p (probability of success on a single trial). It
is also a function of a constant, N, the number of trials. It can
take any integer value from 0 to N.
Definition 14 (Random Variable Notation) It is common
to use the character “∼” to indicate the distribution of a random variable. A binomial random variable is indicated as
Y ∼ Binomial(p, N). Here, Y is the number of successes in
N trials where the probability of success on each trial is p.
The variable p is considered a parameter but the variable N is not. The
value of N is known exactly and supplied by the experimenter. The true
value of p is generally unknown and must be estimated. The probability
14
CHAPTER 1. BASIC CONCEPTS
mass function of a binomial random variable describes the probability of
observing y successes in N trials:
P r(Y = y) = f (y; p) =
The term
N
y
!





N
y
!
py (1 − p)N −y y = 0, .., N
0
(1.5)
Otherwise
refers to the “choose function,” given by
N
y
!
=
N!
.
y!(N − y)!
Let’s look at the probability mass function in R. Try the following for
N = 20 and p = .7:
y=0:20
#assigns x to 0,1,2,..,20
f=dbinom(y,20,.7) #probability mass function
plot(y,f,type=’h’)
points(y,f)
In the above code, dbinom() is an R function that returns the probability
mass function of a binomial. Variable y is a vector taking on 21 values
(0,1,..,20). Because the first argument of dbinom() is a vector the output is
also a vector with one element for each element of y. Type f to see the 21
values. The first value of f corresponds to the first value of y; the second
value of f to the second value of y; and so on. The resulting plot is shown
in Figure 1.3.
The goal of Experiment 1 and 2 is to learn about p. These experiments,
however, are too small to learn much. Instead, we need a larger experiment
with more flips; for generality, consider the case in which we had N flips. One
common-sense approach is to take the number of successes in a Bernoulli
process, Y and divide by N, the number of trials. A function of random
variables that estimates a parameter is called an estimator. The commonsense estimator of p is p̂ = Y /N. It is conventional to place the caret over
an estimator as in p̂. This distinguishes it from the true, but unknown
parameter p. Note that because p̂ is a function of a random variable, it is
also a random variable. Because estimators are random variables themselves,
studying them requires more background about random variables.
15
0.10
0.00
Probability Mass Function
1.3. EXPECTED VALUES OF RANDOM VARIABLES
0
5
10
15
20
Value of X
Figure 1.3: Probability Mass function for a binomial random variable with
N = 20 and p = .7
1.3
Expected Values of Random Variables
1.3.1
Expected Value
The expected value is the center or theoretical average of a random variable.
For example, the center of the distribution in Figure 1.3 is at 14. Expected
value is defined as:
Definition 15 (Expected Value) The expected value of a
discrete random variable X is given as:
E(X) =
X
xf (x; p).
(1.6)
x
The expected value of a random variable is closely related to the concept of
an average, or mean. Typically, to compute a mean of a set of values, one
adds all the values together and divides by the total number of values. In
the case of an expected value, however, each possible value is weighted by
the probability that it will occur before summing. The expected value of a
random variable is also called its first moment, population mean, or simply
its mean. It is important to differentiate between the expected value of a
random variable and the sample mean, which will be discussed subsequently.
16
CHAPTER 1. BASIC CONCEPTS
Consider the following example. Let X be a random variable denoting the
outcome of a Bernoulli trial with parameter p = .7. Then, the expected value
P
is given by E(X) = x xf (x; p) = (0 × .3) + (1 × .7) = .7. More generally,
the expected value of a Bernoulli trial with parameter p is E(X) = p.
1.3.2
Variance
Whereas the expected value measures the center of a random variable, variance measures its spread.
Definition 16 (Variance) The variance of a discrete random
variable is given as:
V(X) =
X
x
[x − E(X)]2 f (x; p)
(1.7)
Just as the expected value is a weighted sum, so is the variance of a random
variable. The variance is the sum of all possible squared deviations from
the expected value, weighted by their probability of occuring. An equivalent
equation for variance is given as V(X) = E[(X − E(X))2 ]; that is, variance
is the expected squared deviation of a random variable from its mean. The
variance of a random variable is different from the variance of a sample, which
will be discussed subsequently.
Another common measure of the spread of a random variable is the standard deviation. The standard
q deviation of a random variable is square-root
of variance, e.g. SD(X) = V(X). Standard deviation is often used as a
measure of spread rather than variance because it is in the same units as the
random variable. Variance, in contrast, is in squared units, which are more
difficult to interpret. The standard deviation of an estimator has its own
name: standard error.
Definition 17 (Standard Error) The standard deviation of
a parameter estimator is called the standard error of the estimator.
1.3. EXPECTED VALUES OF RANDOM VARIABLES
17
Problem 1.3.1 (Your Turn)
1. It is common in psychology to ask people their opinion of statements, e.g., “I am content with my life.” Responses are often
collected on a Likert scale; e.g., 1=strongly disagree, 2=disagree,
3=neutral, 4=agree, 5=strongly agree. The answer may be considered a random variable. Suppose the probability mass function
for the above question is given as f (x) = (.05, .15, .25, .35, .2) for
x = 1, . . . , 5, respectively. Plot this probability mass function.
Compute the expected value. Does the expected value appear to
be at center of the distribution? Compute the variance.
2. Let Y be a binomial RV with N = 3 and parameter p. Show
E(Y ) = 3p.
1.3.3
Expected Value of Functions of Random Variables
It is often necessary to consider functions of random variables. For example,
the common sense estimator of p, p̂ = Y /N, is a function of random variable
Y . The following two rules are convenient in computing the expected value
of functions of random variables.
18
CHAPTER 1. BASIC CONCEPTS
Definition 18 (Two rules of expected values) Let X, Y ,
and Z all denote random variables, and let Z = g(X). The
following rules apply to the expected values:
1. The expected value of the sum is the sum of the expected
values:
E(X + Y ) = E(X) + E(Y )
, and
2. the expected value of a function of a random variable is
E(Z) =
X
g(x)f (x; p),
x
where f is the probability mass function of X.
The first rule can be used to find the expected value of a binomial random
P
variable. By definition, binomial RV Y is defined as Y = N
1=1 Xi , where the
Xi are iid Bernoulli trials. Hence, by Rule 1,
E(Y ) = E(
X
Xi ) =
X
E(Xi ) =
p = Np
i
i
i
X
The second rule can be used to find the expected value of p̂. The random
variable p̂ = g(Y ) is g(Y ) = Y /N. The expected value of p̂ is given by:
E(p̂) = E(g(Y ))
X
=
(x/N)f (x; p)
x
= (1/N)
X
xf (x; p)
x
= (1/N)E(Y )
= (1/N)(Np)
= p.
While p̂ may vary from experiment to experiment, its average will be p.
1.4. SEQUENCES OF RANDOM VARIABLES
1.4
19
Sequences of Random Variables
1.4.1
Realizations
Consider Experiment 1, the single flip of toast and the random variable,
X, the number of butter-side-down flips. Before the experiment, there are
two possible values that X could take with nonzero probability, 0 and 1.
Afterward, there is one result. The result is called the realization of X.
Definition 19 (Realization) The realization of a RV is the
value it attains in an experiment.
Consider Experiment 2 in which two pieces of toast are flipped. Before
the experiment is conducted, the possible values of X are 0, 1, and 2. Afterward, the realization of X can be only one of these values. The same is
true of estimators. Consider the random variables Y ∼ Binomial(p, N) and
common-sense estimate p̂ = Y /N. After an experiment, these will have realizations denoted y and y/N, respectively. The realization of an estimator is
called an estimate.
It is easy to generate realizations in R. For binomial random variables,
the appropriate function is rbinom(): type rbinom(1, 20, .7). The first
argument is the number of realizations, which is 1. The second is N, the
number of trials. The third is p, the probability of success on a trial. Try
the command a few times. The output of each command is one realization
of an experiment with 20 trials.
1.4.2
Law of Large Numbers
Consider rbinom(5, 20, .7). This should yield five replicates; the five
realizations from five separate experiments. There are two interpretations
of the five realizations. The first, sometimes prominent in undergraduate
introductory texts, is that these five numbers are samples from a common
distribution. The second, which is more common in advanced treatments of
probability, is that the realizations are from different, though independent
and identically distributed, random variables. Replicate experiments can be
20
CHAPTER 1. BASIC CONCEPTS
represented as a sequence of random variables, and in this case, we write:
iid
Yi ∼ Binomial(p = .7, N = 20) i = 1, . . . , 5.
Each Yi is a different random variable, but all Yi are independent and distributed as identical binomials. Each i could repeasent a different trial, a
different person, or a different experimental condition.
Of course, we are not limited to 5 replicates; for example y=rbinom(200,
20, .7) produces 200 replicates and stores them in vector y. To see a histogram, type hist(y, breaks=seq(-.5, 20.5, 1), freq=T). We prefer a
different type of histogram for looking at realizations of discrete random
variables—one in which the y-axis is not the raw counts but the proportion, or relative frequency, of counts. These histograms are called relative
frequency histograms.
iid
Definition 20 (Relative Frequency Histogram) Let Yi ∼
Y be a sequence of M independent and identically distributed
discrete random variables and let y1 , .., yM be a sequence of corresponding realizations. Let hM (j) be the proportion of realizations with value j. The relative frequency histogram is a plot of
hM (j) against j.
Relative frequency histograms may be drawn in R with the following
code:
freqs=table(y) #frequencies of realization values
props=freqs/200 # proportions of realization values
plot(props, xlab=’Value’, ylab=’Relative Frequency’)
The code draws the histogram as a series of lines. The relative histogram
plot looks like a probability mass function. Figure 1.4A shows that this is
no coincidence. The lines are the relative frequency histogram; the points
are the probability mass function for a binomial with N = 20 and p = .7
(The points were produced with the points() function. The specific form is
points(0:21,dbinom(0:21,20,.7),pch=21)).
21
0
5
10
15
20
0.10
0.20
B
0.00
0.10
0.20
Relative Frequency
A
0.00
Relative Frequency
1.4. SEQUENCES OF RANDOM VARIABLES
0
Outcome
5
10
15
Outcome
Figure 1.4: A. Relative Frequency histogram and probability mass function
roughly match with 200 realizations. B. The match is near perfect with
100,000 realizations.
The match between the relative histogram and the pmf is not exact. The
problem is that there are only 200 realizations. Figure 1.4B shows the match
between probability mass function and the relative frequency histogram when
there are 10,000 realizations. Here, the match is nearly perfect. This match
indicates that as the number of realizations grows, the relative frequency
histogram converges to the probability mass function. The convergence is a
consequence of the Law of Large Numbers. The Law of Large says, informally,
that the proportion of realizations attaining a particular value will converge
to the true probability of that realization. More formally,
lim hM (j) = f (j; p),
M →∞
where f is the probability mass function of Y .
The fact that the relative frequency histogram of samples converges to
the probability mass function is immensely helpful in understanding random
variables. Often, it is difficult to write down the probability mass function
of a random variable but easy to generate samples of realizations. By generating a sequence of realizations from independent and identically distributed
random variables, it is possible to see how the probability mass functions behaves. This approach is called the simulation approach and we use it liberally
as a teaching tool.
We can use the simulation approach to approximate the probability mass
function for the common-sense estimator p̂ = Y /N with the following R code:
20
22
0.20
0.10
0.00
Probability Mass Function
CHAPTER 1. BASIC CONCEPTS
0.25
0.4
0.55
0.7
0.85
1
Value of estimator
Figure 1.5: Simulated probability mass function for a common-sense estimator of p for a binomial with N = 20 and p = .7.
y=rbinom(10000,20,.7)
p.hat=y/20 #10,000 iid replicates of p-hat
freq=table(p.hat)
plot(freq/10000,type=’h’)
The resulting plot is shown in Figure 1.5. The plot shows the approximate
probability mass function for the p̂ estimator. The distribution of an estimator is so often of interest that it has a special name: a sampling distribution.
Definition 21 (Sampling Distribution) A sampling distribution is the probability mass function of an estimator.
1.5
Estimators
Estimators are random variables that are used to estimate parameters from
data. We have seen one estimator, the common-sense estimator of p in a
23
1.5. ESTIMATORS
binomial: p̂ = Y /N. Two others are the sample mean and sample variance defined below, which are used as estimators for the expected value and
variance of an RV, respectively.
Definition 22 (Sample Mean and Sample Variance)
Let Y1 , Y2, ..., YM be a collection of M random variables. The
sample mean and sample variance are defined as
Ȳ
s2Y
Yi
and
M
P
2
i (Yi − Ȳ )
=
,
M −1
=
P
i
(1.8)
(1.9)
respectively.
How good are these estimators? To answer this question, we first discuss
properties of estimators.
1.5.1
Properties of estimators
To evaluate the usefulness of estimators, statisticians usually discuss three
basic properties: bias, efficiency, and consistency. Bias and efficiency are
illustrated in Table 1.7. The data are the results of weighing a hypothetical
person of 170 lbs on two hypothetical scales four separate times. Bias refers
to the mean of repeated estimates. Scale A is unbiased because the mean of
the estimates equals the true value of 170 lbs. Scale B is biased. The mean
is 172 lbs which is 2 lbs. greater than true value of 170 lbs. Examining the
values for scale B, however, reveals that scale B has a smaller degree of error
than scale A. Scale B is called more efficient than Scale A. High efficiency
means that expected error is low. Bias and efficiency have the same meaning
for estimators as they do for scales. Bias refers to the difference between
the average value of an estimator and a true value. Efficiency refers to the
amount of spread in an estimator around the true value.
The bias and efficiency of any estimator depends on the sample size. For
P
example, the common-sense estimator p̂ is p̂ = ( N
i=1 Yi /N) provides a better
24
CHAPTER 1. BASIC CONCEPTS
Table 1.7: Two Hypothetical Scales
Scale A
180
160
175
165
Mean 170
Bias
0
RMSE 7.91
Scale B
174
170
173
171
172
2.0
2.55
estimate with increasing N. Let θ̂N denote an estimator which estimates
parameter θ, for a sample size of N.
Definition 23 (Bias) The bias of an estimator is given by BN :
BN = E(θˆN ) − θ
Bias refers to the expected value of an estimator. We have already proven
that estimator p̂ is unbiased (Section 1.3.3). Both sample mean and sample
variance are also unbiased. Other common estimators, however, are biased.
One example is the sample correlation. Fortunately, this bias reduces toward
zero with increasing N. Unbiasedness is certainly desirable, but not critical.
Many of the estimators discussed in this book will have some degree of bias.
Problem 1.5.1 (Your Turn)
Let Yi ; i = 1..N be a sequence of N independent and identically distributed random variables. Show that the sample mean is unbiased for
all N (hint: use the rules of expected value in Definition 18).
25
1.5. ESTIMATORS
Definition 24 (Efficiency) Efficiency refers to the expected
degree of error in estimation. We use root-mean-squared error
(RMSE) as a measure of efficiency:
RMSE =
q
E[(θ̂N − θ)2 ]
(1.10)
More efficient efficient estimators have less error, on average, than less efficient estimators. Sample mean and sample variance are the most efficient
unbiased estimators of expected value and variance, respectively. One of the
main issues is estimation is the trade-off between bias and efficiency. Often, the most efficient estimator of a parameter is biased, and this facet is
explored in the following section.
The final property of estimators is consistency.
Definition 25 (Consistency) An estimator is consistent if
lim RMSE(θ̂N ) = 0
N →∞
Consistency means that as the sample sizes gets larger and larger, the estimator converges to the true value of the parameter. If an estimator is consistent, then one can estimate the parameter to arbitrary accuracy. To get
more accurate estimates, one simply increases the sample size. Conversely, if
an estimator is inconsistent, then there is a limit to how accurately the parameter can be estimated, even with infinitely large samples. Most common
estimators in psychology, including the sample mean, sample variance, and
sample correlation, are consistent.
Because sample means and sample variances converge to expected value
and variances, respectively, they can be used to estimate these properties.
For example, let’s approximate the expected value, variance, and standard
error of p̂ with the sample statistics in R. We first generate a sequence of
iid
realizations y1 , .., yM for binomial random variables Yi ∼ Y i = 1, .., M. For
each realization, we compute an estimate pi = yi /N. The sample mean,
sample variance, and sample standard deviation approximate the expected
value, variance, and standard error. To see this, run the following R code:
26
CHAPTER 1. BASIC CONCEPTS
y=rbinom(10000,20,.7)
p.hat=y/20
mean(p.hat)
#sample mean
var(p.hat)
#sample variance (N-1 in denominator)
sd(p.hat)
#sample std. deviation (N-1 in denominator)
Problem 1.5.2 (Your Turn)
How does the standard error of p̂ depend on the number of trials N?
Let’s use the simulation method to further study the common-sense estimator of the expected value of the binomial, the sample mean. Suppose in
an experiment, we had ten binomial RVs, each the result of 20 toast flips.
Here is a formal definition of the problem:
Yi
Ȳ
iid.
∼ Binomial(p, 20), i = 1...10,
P
i Yi
=
.
10
The following code generates 10 replicates from a binomial, each of 20
flips. Here we have defined a custom function called bsms() (bsms stands
for “binomial sample mean sampler”). Try it a few times. This is analogous
to having 10 people each flip 20 coins, then returning the mean number of
heads across people.
#define function
bsms=function(m,n,p)
{
z=rbinom(m,n,p)
mean(z)
}
#call function
bsms(10,20,.7)
27
0.06
0.04
0.02
0.00
Relative Frequency
1.5. ESTIMATORS
12
13
14
15
16
Sample Mean of Binomials
Figure 1.6: Relative Frequency plot of 10,000 calls to the function bsms().
for this plot, bsms() computed the mean of 10 realizations from binomials
with N = 20 and p = .7
The above code returns a single number as output: the sample mean of 10
binomials. Since the sample mean is an estimator, it has a sampling distribution. The bsms() function returns one realization of the sample mean. If we
are interested in the sampling distribution of the sample mean, we need to
sample it many times and plot the results in a relative frequency histogram.
This can be done by repeatedly calling bsms(). Here is the code for 10,000
replicates of bsms():
M=10000
bsms.realization=1:M #define the vector ppes.realization
for(m in 1:M) bsms.realization[m]=bsms(10,20,.7)
bsms.props=table(bsms.realization)/M
plot(ppes.props, xlab="Estimate of Expected Value (Sample Mean)",
ylab="Relative Frequency", type=’h’)
The resulting histogram is shown in Figure 1.6. The new programming
element is the for loop. Within it, function bsms() is called M times, each
result being stored to a different element of bsms.realization. However,
we cannot reference elements in a vector without first reserving space. The
line bsms.realization=1:M defines the vector, and in the process, reserves
space for it.
28
CHAPTER 1. BASIC CONCEPTS
Problem 1.5.3 (Your Turn)
1. What is the expected value of the sample mean of ten binomial
random variables with N = 20 and p = .5? What is the approximate value from the above simulation? Are the values close?
What is the simulation approximation for the standard error?
2. Manipulate the number of trials, N, in each binomial RV through
a few levels: 5 trials, 20 trials, 80 trials. What is the effect on the
sampling distribution of Ȳ ?
3. Manipulate the number of random variables in the sample mean
though a few levels: e.g., a mean of 4, 10, or 50 binomials. What
is the effect on the sampling distribution Ȳ ?
4. What is the effect of raising or lower the number of replicates M?
1.6
Three Binomial Probability Estimators
Consider the three following estimators for p: p̂0 , p̂1 , and p̂2 .
Y
,
N
Y + .5
=
,
N +1
Y +1
.
=
N +2
p̂0 =
(1.11)
p̂1
(1.12)
p̂2
(1.13)
Let’s use R to examine the properties of these three estimators for 10
flips with p = .7. The following code uses the simulation method. It draws
10,000 replicates from a binomial distribution and computes the value for
each estimator for each replicate.
1.6. THREE BINOMIAL PROBABILITY ESTIMATORS
29
p=.7
N=10
z=rbinom(10000,N,p)
est.p0=z/N
est.p1=(z+.5)/(N+1)
est.p2=(z+1)/(N+2)
bias.p0=mean(est.p0)-p
rmse.p0=sqrt(mean((est.p0-p)^2))
bias.p1=mean(est.p1)-p
rmse.p1=sqrt(mean((est.p1-p)^2))
bias.p2=mean(est.p2)-p
rmse.p2=sqrt(mean((est.p2-p)^2))
Figure 1.7 shows the sampling distributions for the three estimators.
These sampling distributions tend to be roughly centered around the true
value of the parameter, p = .7. Estimator p̂2 is the least spread out, followed
by pˆ1 and pˆ0 . Bias and efficiency of the estimators are indicated. Although
estimator pˆ0 is unbiased, it is also the least efficient! Figure 1.8 shows bias
and efficiency for all three estimators for the full range of p. The conventional
estimator p̂0 is unbiased for all true values of p, but the other two estimators
are biased for extreme probabilities. None of the estimators are always more
efficient than the others. For intermediate probabilities, estimator p̂2 is most
efficient; for extreme probabilities, estimator p̂0 is most efficient. Typically,
researchers have some idea of what types of probabilities of success to expect
in their experiments. This knowledge can therefore be used to help pick the
best estimator for a particular situation. We recommend p̂1 as a versatile alternative to p̂0 for many applications even though it is not the common-sense
estimator.
30
CHAPTER 1. BASIC CONCEPTS
0.30
Probablity
0.25
0.20
p0
Bias=0
RMSE=.145
0.15
0.10
0.05
0.00
Probablity
0.25
0.20
0.15
p1
Bias=−.018
RMSE=.133
0.10
0.05
0.00
Probablity
0.25
0.20
0.15
p2
Bias=−.033
RMSE=.125
0.10
0.05
0.00
0.0
0.2
0.4
0.6
0.8
1.0
Estimated Probability of Success
Figure 1.7: Sampling distribution of p̂0 , p̂1 , and p̂2 . Bias and root-meansquared-error (RMSE) are included. This figure depicts the case that there
are N = 10 trials with a p = .7.
31
1.6. THREE BINOMIAL PROBABILITY ESTIMATORS
Bias
0.05
p1
0.00
p2
p0
−0.05
0.0
0.2
0.025
0.4
p0
0.8
1.0
0.8
1.0
p1
0.020
RMSE
0.6
0.015
p2
0.010
0.005
0.000
0.0
0.2
0.4
0.6
Probability of Success
Figure 1.8: Bias and root-mean-squared-error (RMSE) for the three estimators as a function of true probability. Solid, dashed, and dashed-dotted lines
denote the characteristics of p̂0 , p̂1 , and p̂2 , respectively.
32
CHAPTER 1. BASIC CONCEPTS
Problem 1.6.1 (Your Turn)
The estimators of the binomial probability parameter discussed above
all have the form (Y + a)/(N + 2a). We have advocated using the
estimator p̂1 = (Y + .5)/(N + 1), but there are many other possible
estimators besides a = .5.. Examine what happens to the efficiency of
an estimator as a gets large. Why would we choose a = .5 over, say,
a = 20?
Chapter 2
The Likelihood Approach
Throughout this book, we use a general set of techniques for analysis that are
based on the likelihood function. In this chapter we present these techniques
within the context of three examples involving the binomial distribution.
At the end of the chapter, we provide an overview of the theoretical justification for the likelihood approach. In the following chapters, we use this
approach to analyze pertinent models in cognitive and perceptual psychology.
Throughout this book, analysis is based on the following four steps:
1. Define a hierarchy of models.
2. Express the likelihood functions for the models.
3. Find the parameters that maximize the likelihood functions.
4. Compare the values of likelihood to decide which model is best.
2.1
Estimating a Probability
We illustrate the likelihood approach within the context of an example. As
previously discussed, the binomial describes the number of successes in a
set of Bernoulli trials. Let’s consider the toast-flipping experiment discussed
previously. The goal is to estimate the true probability, p, that a piece of
toast lands butter-side down. As before, let N denote the number of pieces
flipped and let random variable Y denote the number of butter-side down
flips and y denote the datum, a realization of Y .
33
34
CHAPTER 2. THE LIKELIHOOD APPROACH
2.1.1
A hierarchy of models
In this example, we define a single model,
Y ∼ Binomial(N, p).
Clearly, there is no hierarchy with a single model; subsequent examples will
include multiple models arranged in a hierarchy.
2.1.2
Express the likelihood function
The next step in a likelihood analysis is to express the likelihood function.
We first define the function for this example and then give a more general definition. Likelihood functions are closely related probability mass functions.
For the binomial, the probability mass function, P r(Y = y), is
f (y; p) =









N
y
!
py (1 − p)N −y y = 0, .., N
0
(2.1)
Otherwise
We can rewrite the probability mass function as a function of the parameter given realization y:
L(p; y) =
N
y
!
py (1 − p)N −y
(2.2)
The right-hand side of the equation is the same as the probability mass
function; the difference is on the left-hand side. Here, we have switched the
arguments to reflect the fact that the likelihood function, denoted by L, is a
function of the parameter p.
Definition 26 (Likelihood Function) For a discrete random
variable, the likelihood function is the probability mass function
expressed as a function of the parameters.
Let’s examine the likelihood function for the binomial in R. First, we
define a function called likelihood().
2.1. ESTIMATING A PROBABILITY
35
likelihood=function(p,y,N)
return(dbinom(y,N,p))
Now, let’s examine the likelihood, a function of p, for the case in which 5
successes were observed in 10 flips.
p=seq(0,1,.01)
like=likelihood(p,5,10)
plot(p,like,type=’l’)
The seq() function in the first line assigns p to the vector (0, .01, .02, ..., 1).
The second line computes the value of the likelihood for each of these values of p. Figure 2.1 shows the resulting plot (Panel A). The likelihood is
unimodal and is centered over .5. Also shown in the figure are likelihoods
for 50 successes out of 100 flips, 7 successes out of 10 flips, and 70 successes
out of 100 flips. Two patterns are evident. First, the maximum value of the
likelihood is at y/N. Second, the width of the likelihood is a function of N.
The larger N, the smaller the range of parameter values that are likely for
the observation y.
It is a reasonable question to ask what a particular value of likelihood
means. For example, in Panel A, the maximum of the likelihood (about
.25) is far smaller than that in Panel B. In most applications, the actual
value of likelihood is not important. For the binomial, likelihood depends
on the observation, the parameter p and the number of Bernoulli trials, N.
For estimation, it is the shape of the function and the location of the peak
that are important, as discussed subsequently. For model comparison, the
difference in likelihood values among models is important.
0.08
CHAPTER 2. THE LIKELIHOOD APPROACH
0.00
0.00
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.08
Likelihood
B
0.04
A
0.10
0.20
36
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
D
0.00
0.00
0.10
0.04
0.20
C
0.2
0.0
0.2
0.4
0.6
p
0.8
1.0
0.0
p
Figure 2.1: A plot of likelihood for (A) 5 successes from 10 flips, (B) 50
successes from 100 flips, (C) 7 successes from 10 flips, and (D) 70 successes
from 100 flips
37
2.1. ESTIMATING A PROBABILITY
Problem 2.1.1 (Your turn)
The Poisson random variable describes the number of rare events in
an interval. For example, it may be used to model the number of car
accidents during rush hour; the number of earthquakes in a year in a
certain region; or the number of deer in a area of land. The probability
mass function of a Poisson random variable is given by
f (y; λ) =





λy e−λ
y!
y = 0, 1, 2, ...; λ > 0
0
Otherwise
(2.3)
where λ is the parameter of interest. Using R, draw two probability
mass functions, one for λ = .5, the other for λ = 10. Draw a graph of
the likelihood for y = 4.
2.1.3
Find the parameters that maximize the likelihood
The basic premise of maximum likelihood estimation is given in the following
definition:
Definition 27 (Maximum Likelihood Estimate) A maximum likelihood (ML) estimate is the parameter value that maximizes the likelihood function for a given set of data.
There are two basic methods of deriving maximum likelihood estimators
(MLEs): (1) use calculus to derive solutions, (2) use a numerical method to
find the value of the parameter that maximizes the likelihood. This second
method is easily implemented in R.
The first step in maximizing a likelihood is to use the natural logarithm
of the likelihood (the log likelihood, denoted by l) as the main function of interest. This transformation is helpful whether one uses calculus or numerical
38
CHAPTER 2. THE LIKELIHOOD APPROACH
methods. Figure 2.2 shows log likelihood functions for the binomial distribution. To find the log likelihood, one takes the logarithm of the likelihood
function, e.g., for the binomial:
l(p; y) = log[L(p; y)]
#
"
!
N
y
N −y
= log
p (1 − p)
y
= log
N
y
!
+ log(py ) + log((1 − p)N −y )
= log
N
y
!
+ y log p + (N − y) log(1 − p).
(2.4)
Definition 28 (log likelihood Function) The log likelihood
function is the natural logarithm of the likelihood function.
There are two types of methods to find maximum likelihood estimates.
The first is based on calculus and is discussed below. The calculus methods
are limited in their application and many problems must be solved with
numerical methods. Throughout the book, we will the second type of method,
numerical methods, and their implementation in R.
2.1.4
Calculus Methods to find MLEs
In this section, we briefly show how calculus may be used to solve the MLE
for the binomial example. Calculus is not necessary to understand the vast
majority of the material in this book. We provide this section as a service to
those students with calculus. Those students without it can skip this section
without loss. This section is therefore both advanced and optional.
Our goal is to find the value of p that maximizes the log likelihood function, l(p; y). To do so we take the derivative of l(p; y) and set it equal to
zero. From Eq. (2.4):
l(p; y) = log
N
y
!
+ y log p + (N − y) log(1 − p).
39
2.1. ESTIMATING A PROBABILITY
−30
−50
−50
0.0
0.2
0.4
0.6
0.8
1.0
0.0
0.2
0.4
0.6
0.8
1.0
0.2
0.4
0.6
0.8
1.0
−30
−50
−30
−10
D
−10
C
−50
Log−Likelihood
−30
−10
B
−10
A
0.0
0.2
0.4
0.6
p
0.8
1.0
0.0
p
Figure 2.2: Plots of log likelihood for (A) 5 successes from 10 flips, (B) 50
successes from 100 flips, (C) 7 successes from 10 flips, and (D) 70 successes
from 100 flips
40
CHAPTER 2. THE LIKELIHOOD APPROACH
Step 1: Differentiate l(p, y) with respect to p.
"
!
#
∂l(p; y)
∂
N
log
+ y log p + (N − y) log(1 − p) ,
=
y
∂p
∂p
"
!#
∂
∂
∂
N
log
+
[y log p] +
[(N − y) log(1 − p)] ,
=
y
∂p
∂p
∂p
y N −y
= 0+ −
,
p
1−p
y N −y
−
.
=
p
1−p
Step 2: Set the derivative to zero and solve for p.
y N −y
−
p
1−p
y
p
(1 − p)y
y − py
y
= 0,
=
=
=
=
p̂ =
N −y
,
1−p
(N − y)p,
Np − yp,
Np,
y
,
N
(2.5)
where y is the number of observed successes in a particular experiment. For
a binomial, the proportion of successes is the maximum likelihood estimator
of parameter p.
Problem 2.1.2 (Your Turn)
Using calculus methods, derive the maximum likelihood estimator of
parameter λ in the Poisson distribution (see Eq. 2.3).
2.1. ESTIMATING A PROBABILITY
2.1.5
41
Numerical Optimization to find MLEs
There are three steps in using R to find MLEs. The first is to define a log
likelihood function; the second is to enter the data; the third is to maximize
the log likelihood function.
• Step 1: Define a log likelihood function for the binomial:
loglike=function(p,y,N)
return(dbinom(y,N,p,log=T))
Note the log=T option in dbinom(). With this option, dbinom() returns the log of the probability mass function.
• Step 2: Enter the data. Suppose we observed five successes on ten flips.
N=10
y=5
• Step 3: Find the maximum likelihood estimate. There are a few different numerical methods implemented in R for optimization. For models
with one parameter, the function optimize() is an appropriate choice
(Brent, 1973). Here is an example.
optimize(loglike,interval=c(0,1),maximum=T,y=y,N=N)
The first argument is the function to be maximized; the second argument is the interval on which the parameter may range, the third
argument indicates that the function is to be maximized (the other
alternative is that it be minimized); the other arguments are passed to
the function loglike(). Here is the output from R.
$maximum
[1] 0.5
$objective
[1] -1.402043
42
CHAPTER 2. THE LIKELIHOOD APPROACH
The data were 5 successes in 10 flips. The maximum is what we expect,
p̂ = .5. The objective is the value of the log likelihood function at the
maximum.
Problem 2.1.3 (Your Turn)
Use optimize() to find the ML estimate of λ of a Poisson distribution
for y = 4 counts.
2.1.6
Select the best model
In this case, there is one model and it is best by default. In all subsequent
examples, we will have more than one model to choose from.
2.2
Is buttered toast fair?
Coins are considered fair; that is, the probability of the coin landing on either
side is p = .5. In our exploration of Murphy’s Law, we can ask whether
buttered toast is fair. We define two models: one, the general model, has p
free to be any value. The second model, the restricted model, has p fixed at
.5. The restricted model instantiates the proposition that buttered toast is
fair.
2.2.1
Define a Hierarchy of models
General Model:
Restricted Model:
Y ∼ Binomial(p, N).
Y ∼ Binomial(.5, N).
It is clear here that there is a hierarchical relationship between the general
and restricted model. The restricted model may be obtained by limiting the
parameter value of the general model. In this sense, the restricted model is
nested within the general one.
43
2.2. IS BUTTERED TOAST FAIR?
Definition 29 (Nested Models) Model B is considered
nested within Model A if there exists a restriction on the
parameters of Model A that yield Model B.
2.2.2
Express the likelihoods
The likelihood functions for the general model (denoted by Lg ) and restricted
model (denoted by Lr ) are:
Lg (p; y) =
N
y
!
N
y
Lr (.5; y) =
py (1 − p)N −y
!
.5y .5N −y
(2.6)
(2.7)
Note that the likelihood for the restriction is not quite a function of parameters. In all other applications, the likelihoods will be proper functions of
parameters.
2.2.3
Maximize the likelihoods
We can use the calculus-based results we found above to maximize the likelihood functions. Estimates are:
y
,
N
= .5,
pˆg =
pˆr
where y is the number of observed successes.
2.2.4
Select the best model
Model selection is done through comparing maximized likelihoods of nested
models. Let’s suppose we observed 13 successes out of 20 trials. Is this
number of successes extreme enough to reject the restricted model in favor
of the general model? Using R, we can find the following log likelihood values
for 13 successes of 20 trials.
44
CHAPTER 2. THE LIKELIHOOD APPROACH
0
General
Model
Log−likelihood
−1
−2
−3
−4
Restricted
Model
−5
0
5
10
15
20
Number of Successes
Figure 2.3: Maximized log likelihoods as a function of the number of successes
for the general and restricted models.
N=20
y=13
mle.general=y/N
mle.restricted= .5 #by definition
log.like.general=dbinom(y,N,mle.general,log=T)
log.like.restricted=dbinom(y,N,mle.restricted,log=T)
log.like.general
[1] -1.690642
log.like.restricted
[1] -2.604652
The log likelihood is greater for the general model than the restricted
model (-1.69 vs. -2.60). This fact, in itself, is not informative. Figure 2.3
shows maximized log likelihood for the restricted and general models for all
possible outcomes. Maximized likelihood for the general model is as great or
greater than that of the restricted model for all outcomes. This trend always
holds: general models always have higher log likelihood than their nested
restrictions.
45
2.2. IS BUTTERED TOAST FAIR?
This leads to the question of whether a restricted model is ever appropriate. The answer is overwhelmingly positive. The restricted model, if it is
true, is a simpler and more parsimonious account of flipping toast. Having
evidence for its plausibility or implausibility represents the accumulation of
knowledge.
We will fail to reject the restricted model if the difference in log likelihood
between the restricted and the general model is not too great. If there is
a great difference, however, then we may reject the restricted model. For
example, if the number of successes is 4, there is a large difference between the
general model and the restricted model indicating that the restriction is illsuited. The likelihood ratio statistic (G2 ) is used to formally assess whether
the difference in log likelihood is sufficiently large to reject a restricted model
in favor of a more general one.
Definition 30 (Likelihood ratio statistic) Consider likelihoods functions for a general and restricted (properly nested)
model, Lg and Lr with parameters pg and pr , respectively. Then
Lr (pˆr ; y)
G = −2 log
Lg (pˆg ; y)
2
!
Statistic G2 is conveniently expressed with log likelihoods. In this case:
G2 = −2[lr (pˆr ; y) − lg (pˆg ; y)],
(2.8)
where lg and lr are the respective log likelihood functions.
If the restricted model holds, then G2 follows a chi-square distribution.
The chi-square distribution has a single parameter, the degrees of freedom
(df). For a likelihood ratio test of nested models, the degrees of freedom is
the difference in the number of parameters between the general and restricted
model. In this case, the general model has one parameter while the restricted
model has none; hence, the difference is 1. The critical .05 value of a chisquare distribution with 1 degree of freedom is about 3.84. Hence for our
case, if G2 < 3.84 then we fail to reject the restriction; otherwise, we reject
the restriction.
Here is G2 for 13 successes of 20 trials:
46
Log−Likelihood
Ratio Statistic
CHAPTER 2. THE LIKELIHOOD APPROACH
25
20
15
10
5
0
0
5
10
15
20
Number of Successes
Figure 2.4: Likelihood ratio statistics (G2 ) as a function of outcome. The
horizontal line denotes the critical value of 3.84. We fail to reject the restricted model (fair toast) when the number of successes is in between 6 and
14, inclusive. If the number of successes is more extreme, then we can reject
the fair-toast model
G2=-2*(log.like.restricted-log.like.general)
G2
[1] 1.828022
The value is about 1.83, which is less than 3.84. Hence, for 13 successes in
20 trials, we cannot reject the restriction that buttered toast is fair.
Figure 2.4 shows which outcomes will lead to a rejection of the fair-toast
restriction: all of those outcomes with likelihood ratio statistics above 3.84.
The horizontal line denotes the value of 3.84. From the figure, it is clear that
observing fewer than 6 or more than 14 successes out of twenty flips would
lead to a G2 greater than 3.84 and, therefore, to a rejection of the statement
that toast is fair.
Testing the applicability of a restricted model against a more general one
is a form of null hypothesis testing. In this case, the restricted model serves
as the null hypothesis. When G2 is high, the restricted model is implausible,
leading to us reject the restriction and accept the more general alternative.
2.3. COMPARING CONDITIONS
2.3
47
Comparing Conditions
Suppose we have two different conditions in a memory experiment. For
example, we are investigating the effects of word frequency on recall performance. In our experiment, participants study two types of words: high
and low frequency. High frequency words are those used often in everyday
speech (e.g., horse, table); low frequency words are used rarely (e.g., lynx, ottoman). We wish to know whether recall performance is significantly better
in one condition than the other. Although there are several ways of tackling
this problem, we use the likelihood framework here. Let’s suppose that each
condition consists of a number of trials. On each trial, performance may either be a success (e.g., a word was recalled) or a failure (e.g., a word was not
recalled). We let Yl and Yh be the number of successes in the low-frequency
and high-frequency, conditions. Likewise, we let Nl and Nh be the number
of trials in these conditions, respectively. Suppose the data are as follows:
yl = 6 of Nl = 20 low-frequency words were recalled and yh = 7 of Nh = 13
high-frequency words were recalled.
2.3.1
Define a hierarchy of models
General Model:
Yl ∼ Binomial(pl , Nl )
Yh ∼ Binomial(ph , Nh )
(2.9)
(2.10)
Yl ∼ Binomial(p, Nl )
Yh ∼ Binomial(p, Nh )
(2.11)
(2.12)
Restricted Model:
The difference in the above models is in the probability of a success. In
the general model, there are two such probabilities, pl and ph , with one for
each condition. This notation is used to indicate that these two probabilities
need not be the same value. The restricted model has one parameter p,
indicating that the probability of success in both conditions must be the
same. If the restricted model does not fit well when compared to the general
model, then we may conclude that performance depends on word frequency.
The next step is expressing the likelihoods. The likelihood functions for
these models are more complicated than for previous models. The complication comes about because the models involve two random variables, Yl and
48
CHAPTER 2. THE LIKELIHOOD APPROACH
Yh . These complications are overcome by introducing a new concept: joint
probability mass function. The probability mass function was introduced earlier; it describes how probability is distributed for a single random variable.
The joint probability mass function describes how probability is distributed
across all combinations of outcomes for two (or more) random variables.
2.3.2
Joint Probability Mass Functions
Definition 31 (Joint Probability Mass Function) Let X
and Y be two discrete random variables, then the joint probability mass function is given by
f (x, y) = P r(X = x and Y = y).
The following is an example of a joint probability mass function. Suppose
X is a random variable that has equal mass on 1, 2, and 3 and zero mass
otherwise. The probability mass function for X is given by:
1/3
1/3
fX (x) =
 1/3



0





x=1
x=2
x=3
Otherwise
Because we will be considering more than one random variable, we subscript
probability mass functions, i.e., fX for clarity.
Suppose Y = X + 1 with probability .5 and Y = X − 1 with probability
.5. The joint probability mass function for X and Y is denote fX,Y and given
by:


1/6 x = 1, y = 0





1/6 x = 1, y = 2





 1/6 x = 2, y = 1
fX,Y (x, y) = 1/6 x = 2, y = 3



1/6 x = 3, y = 2





1/6 x = 3, y = 4



 0
Otherwise
2.3. COMPARING CONDITIONS
49
The likelihood of the parameters is the joint probability mass function
expressed as a function of the parameters. For the general model:
L(pl , ph ; l, h) = fYl ,Yh (l, h; pl , ph ).
(2.13)
We introduced independence earlier in the previous chapter; if two random variables are independent, then the outcome of one does not influence
the outcome of the other. Under this condition the joint probability mass
function is the product of the individual mass functions. These individual
probability mass functions are called marginal probability mass functions.
Definition 32 (Independent Random Variables) Let X
and Y be random variables with probability mass functions fX
and fY , respectively. If these random variables are independent,
fX,Y (x, y) = fX (x)fY (y).
Is independence justified for either the general or restricted model? It
would seem, at first consideration, that independence cannot be used, especially for the restricted model in which both random variables are functions
of the same parameter p. If p is high, both Yl and Yh tend to be high in value;
if p is low, both Yl and Yh tend to be low in value. Hence, knowledge of Yl
would imply knowledge of p which would imply knowledge of Yh , seemingly
violating independence.
This reasoning, fortunately, is not fully accurate. Models, such as the
restricted model, may be specified dependencies among random variables
through common parameters. The critical issue is whether there are dependencies outside this specification. In other words, given a common value of
p, does knowing the value of Yh provide any more information about Yl ?
Implicit in the definition of a model is that there are no other dependencies. Therefore, the likelihood for a model may be made with recourse to
independence. For the general model,
fYl ,Yh (l, h; pl , ph ) = fYl (l; pl , ph ) × fYh (h; pl , ph )
(2.14)
fYl ,Yh (l, h; p) = fYl (l; p) × fYh (h; p)
(2.15)
For the restricted model,
50
CHAPTER 2. THE LIKELIHOOD APPROACH
2.3.3
Expressing the Likelihoods
With this digression on conditional independence finished, we return to the
problem of expressing the likelihood for the models. For the general model,
we start with Eq. (2.14). Because for the general model Yh does not depend
on pl and Yl does not depend on ph , the joint likelihood could be simplified:
fYl ,Yh (l, h; pl , ph ) = fYl (l; pl ) × fYh (h; ph ).
Substituting the probability mass function of a binomial yields:
fYl ,Yh (l, h; pl , ph )(l, h, pl ; ph ) =















Nl
l
!
pll (1
− pl )
Nl −l
Nh
h
!
phh (1 − ph )Nh −h l = 0, .., N,
h = 0, .., N
0
Otherwise
The likelihood function for pl and ph in the general model is therefore
LG (pl , ph , yl , yh ) =
Nl
yl
!
pyl l (1
− pl )
Nl −yl
Nh
yh
!
pyhh (1 − ph )Nh −yh (2.16)
The restricted model has a single parameter p. The likelihood function
is obtained by setting ph = p and pl = p:
LR (p, yl , yh ) =
Nl
yl
!
yl
p (1 − p)
Nl −yl
Nh
yh
!
pyh (1 − p)Nh −yh
(2.17)
log likelihood functions are given by:
!
!
Nl
Nh
lG (pl , ph , yl , yh ) = log
+ log
+ yl log(pl ) +
yl
yh
(Nl − yl ) log(1 − pl ) + yh log(ph ) +
(Nh − yh ) log(1 − ph )
!
!
Nl
Nh
lR (p, yl , yh ) = log
+ log
+ yl log(p) +
yl
yh
(Nl − yl ) log(1 − p) + yh log(p) +
(Nh − yh ) log(1 − p)
(2.18)
(2.19)
Eq. (2.19) can be simplified further:
"
!
!#
Nl
Nh
= log
+ (yl + yh ) log(p) +
yl
yh
[(Nl + Nh ) − (yl + yh )] log(1 − p)
(2.20)
2.3. COMPARING CONDITIONS
2.3.4
51
Maximize the Likelihoods
General Model
There are a few approaches we can take to maximizing likelihoods. The most
intuitive is to note that Yh and Yl are independent samples with no common
parameters. In this case, we can maximize them separately. So, we use our
standard binomial MLE result:
yl
p̂l =
,
(2.21)
Nl
yh
pˆh =
.
(2.22)
Nh
Applying these to the data yields p̂l = 6/20 = .3 and pˆh = 7/13 = .538.
If we failed to see this intuitive approach, we can use R to maximize
likelihood numerically as follows.
The log likelihood function in Eq. (2.18) may be maximized as written,
but it is not optimal. One property of likelihood is that the maximum and the
shape do not change when the function is multiplied by a constant. Likewise,
the maximum of log likelihood does not change with addition of a constant.
This conservation is shown in Figure 2.5. Because of this property, likelihood
is often defined only up to a constant of multiplication. Therefore, if l is a
log likelihood, then l + c is also a log likelihood, where c is a constant that
does not involve the parameters.
This fact helps facilitate the time and precision of finding the parameters
!
Nl
that maximize log likelihood for the general model. The terms log
yl
!
Nh
and log
do not depend on parameters. An equally valid log likelihood
yh
for lG is:
lG = yl log(pl ) + (Nl − yl ) log(1 − pl ) + yh log(ph ) + (Nh − yh ) log(1 − ph ).
Precision and speed are gained by not having to evaluate the constant terms
involving the log of the choose function.
The code below uses numerical methods to maximize the log likelihood
function for the general model. In the code, pL, pH, yL, yH, NL, and
NH correspond to pl , ph , yl , yh , Nl , and Nh respectively. The function
nll.general() returns the negative of the log likelihood. The reason for
this choice is discussed below.
52
−20
B
−100
−60
Log likelihood
−60
−20
A
−100
Log likelihood
CHAPTER 2. THE LIKELIHOOD APPROACH
0.0
0.2
0.4
0.6
0.8
1.0
p
0.0
0.2
0.4
0.6
0.8
1.0
p
Figure 2.5: The maximum of the log likelihood function does not change with
the addition of a constant. Panel A shows the log likelihood function for a
binomial with N=100 and 70 observed successes. The vertical line denotes
the maximum at .7. Panel B shows the log likelihood with constant log Ny
subtracted. The function has the same shape and maximum as the function
in Panel A.
# negative log likelihood of the general model
nll.general=function(par,y,N)
{
pL=par[1]
pH=par[2]
yL=y[1]
yH=y[2]
NL=N[1]
NH=N[2]
ll=yL*log(pL)+(NL-yL)*log(1-pL)+yH*log(pH)+(NH-yH)*log(1-pH)
return(-ll)
}
Function nll.general() is a function of 2 parameters. The function
optimize() In R, finds the maximum of a function of one parameter. The
function optim() is used when there are more than one parameter. By
default, the optim() command minimizes instead of maximizing functions.
This is the reason we use the negative log likelihood. The parameter values
that minimize the negative log likelihood maximize the log likelihood.
par=c(.5,.5) #starting values
2.3. COMPARING CONDITIONS
53
N=c(20,13) #number of trials in each condition
y=c(6,7) #number of successes
#use maximization of two or more variables
optim(par,nll.general,y=y,N=N)
Here is the output from the optim()1 call above:
$par
[1] 0.2999797 0.5384823
$value
[1] 21.1897
$counts
function gradient
69
NA
$convergence
[1] 0
$message
NULL
The numerically-maximized parameter values match those from the intuitive
method above.
Function optim() can use a few different algorithms; the default is based
on the simplex algorithm of Nelder & Mead (1965) which, while slow, is
often successful in finding local minima (Press, Flannery, Teukolsky, and
Vetterling, 1992). We discuss minimization in more depth in Chapter 5.
Restricted Model
The restricted model can also be solved intuitively. Under the model, performance in the low- and high-frequency conditions reflects a single Bernoulli
process. Hence, we can pool the data across conditions. Pooling the number
1
For more information about what each of these values mean, try ?optim in R to get
help on optim().
54
CHAPTER 2. THE LIKELIHOOD APPROACH
of successes across both conditions yields y = 6 + 7 = 13. Doing the same
for the number of trials yields N = 20 + 13 = 33. This pooling is evident in
Eq. (2.20); when we group terms, the restricted log likelihood is the log likelihood function of a binomial with N = Nl +Nh trials and y = yl +yh successes.
If we fail to see this intuitive approach, R may be used to solve the problem
numerically. Because the restricted model has a single parameter, it is best
to use optimize() instead of optim(). The function ll.restricted() is
the log likelihood for the restricted model:
#log likelihood of restricted model
nll.restricted=function(p,y,N)
{
yL=y[1]
yH=y[2]
NL=N[1]
NH=N[2]
ll=yL*log(p)+(NL-yL)*log(1-p)+yH*log(p)+(NH-yH)*log(1-p)
return(-ll)
}
N=c(20,13)
y=c(6,7)
optimize(nll.restricted,interval=c(0,1),y=y,N=N,maximum=F)
Here is the output from the optimize() call above:
$maximum
[1] 0.393931
$objective
[1] 22.12576
2.3.5
Select the best model
We use the likelihood ratio statistic, G2 , to select the best model. To do so
in R, we first assign the output of the numerical minimization to variables:
gen=optim(par,nll.general,y=y,N=N)
res=optimize(nll.restricted,interval=c(0,1),y=y,N=N,maximum=F)
2.4. STANDARD ERRORS FOR MAXIMUM LIKELIHOOD ESTIMATORS55
The results can be seen by typing gen or res. Unfortunately, outputs of
optim() and optimize() are not consistent. The value of the minimized
function is called value in optim() and objective in optimize(). The
value of log likelihood is given by:
gen.ll=-gen$value
res.ll=-res$objective
Statistic G2 is obtained by G2=-2*(res.ll-gen.ll); for the example, G2 is
about 1.87. In this case we retain the restricted model because G2 < 3.84.
Problem 2.3.1 (Your turn)
Let’s revisit the Poisson model. You are a traffic engineer and there
have been too many accidents at rush hour at one particular intersection. To see if the accidents are caused by excessive speed, you place a
police car 100 meters before the intersection. The day before you place
the police car, there are 12 accidents; the day after, 4 accidents. Use a
likelihood ratio test to determine if placing the police car was effective
in lowering the accident rate λ.
2.4
Standard Errors for Maximum Likelihood
Estimators
Maximum likelihood estimators are random variables. Consequently, they
have sampling distributions and standard errors. We provide here a brief description of how to derive standard errors for maximum likelihood estimates.
We present here the implementation without the theoretical underpinning.
Unfortunately, the mathematics in this underpinning is beyond the scope of
this book. A formal treatment may be found in several standard advanced
statistics texts including Lehman (1991).
The standard error of a maximum likelihood estimator is related to the
curvature of the log likelihood function at the maximum. In Figure 2.2, the
likelihood function in Panel B has more curvature at the maximum than
56
CHAPTER 2. THE LIKELIHOOD APPROACH
that in Panel A. This curvature reflects how much variability there is in
the sampling distribution of the MLE with more curvature corresponding to
smaller variability. Not surprisingly, log likelihood curvature and variability
are both functions of the number of observations.
To compute an index of curvature at the maximum in R, we add the
option hessian=TRUE to the optim() call. The following code returns an
index of curvature:
par=c(.5,.5)
gen=optim(par,nll.general,y=y,N=N,hessian=TRUE)
Type gen and notice the additional field $hessian, which is a matrix. The
diagonal elements are of interest and standard errors for the parameters are
calculated by genSE=sqrt(diag(solve(gen$hessian))). The elements of
the vector genSE correspond to the standard errors for the parameters in the
vector par.
Many researchers report standard errors along with estimates. Standard
errors, in our opinion, give a rough guide to the amount of variability in
the estimate as well as calibrate the eye as to the magnitude of significant
effects (Rouder & Morey, 2005). We discuss the related concept of confidence
intervals in Chapter 7.
The following code plots standard errors in R. For convenience, we define
a new function errbar(), which we can use for adding error bars to any plot.
errbar=function(x,y,height,width,lty=1){
arrows(x,y,x,y+height,angle=90,length=width,lty=lty)
arrows(x,y,x,y-height,angle=90,length=width,lty=lty)
}
We plot the estimates in bar plot format:
xpos=barplot(gen$par,names.arg=c(’Low Frequency’,
’High Frequency’),col=’white’,ylim=c(0,1),space=.5,
ylab=’Probability Estimate’)
The x-axis in a bar plot is labeled by condition. R assigns numeric values to x-axis so that you may add lines, annotations, or error bars. These are
stored in variable xpos. To add error bars, use errbar(xpos,gen$par,genSE,.3).
57
0.8
0.6
0.4
0.2
0.0
Probability Estimate
1.0
2.5. WHY MAXIMUM LIKELIHOOD?
Low Frequency
High Frequency
Condition
Figure 2.6: Standard errors and confidence intervals on maximum likelihood
estimates from the word frequency example. Bars represent point estimates,
and error bars represent standard errors of the estimator.
The last value is simply the width of the error bar. The resulting bar plot
with standard error bars is shown in Figure 2.6. The overlap of the error bars
indicates that any effect of word frequency is quite small given the variability of the estimate. The likelihood ratio tests confirms the lack of statistical
significance.
Problem 2.4.1 (Your Turn)
Consider the previous problem with the Poisson distribution model of
traffic accidents. Plot your estimates of λ for the general model with
standard error bars.
2.5
Why Maximum Likelihood?
The likelihood approach is used throughout this book for analysis. In this
section, we provide an overview of the theoretical justification for this choice.
There are several other good alternatives to maximum likelihood including
58
CHAPTER 2. THE LIKELIHOOD APPROACH
minimizing the mean squared error between model predictions and observed
data. Likelihood has the following advantages under mild technical conditions2 :
1. Maximum likelihood estimates, while often biased in small sample sizes,
are consistent; i.e., in the limit of a large number of samples, the estimates of a parameter converge to its true value.
2. Maximum likelihood estimates are asymptotically maximally efficient—
in the limit of a large number of samples, no other estimate has a
smaller error. This lower bound is known as the Cramer-Rao lower
limit in the statistics literature, and it can be shown that maximum
likelihood estimators approach the limit with large sample sizes.
3. Maximum likelihood estimates are asymptotically normally distributed.
As the sample size becomes larger, the sampling distribution converges
to a normal distribution. This licenses the use of normal-model based
statistics in analyzing maximum likelihood estimates. A discussion of
normal-model based statistics is provided in Chapter 7.
4. The likelihood approach is tractable for the simple, nonlinear models.
These models are more realistic accounts of psychological processes
than standard normal-based models. Because they are more realistic,
they provide more detailed tests of psychological theory than standard
ANOVA or regression models. We join others who recommend likelihood as a viable alternative for the types of models well-suited for
cognitive and perceptual psychology (see also Glover & Dixon, 2004;
Myung, 2001.
5. Likelihood is a stepping stone in understanding Bayesian techniques.
We have argued that these techniques will become increasingly valuable
in psychological contexts (e.g., Rouder & Lu, in press; Rouder, Lu,
Speckman, Sun, and Jiang, in press; Rouder, Sun, Speckman, Lu &
Dzhou, 2003). Knowledge gained in this book about analysis with
likelihoods transfers to the Bayesian framework more so than that of
other estimation methods.
2
For these advantages to hold, the model must be regular. Regularity is a set of
conditions that guarantee that the likelihood function is sufficiently smooth. A formal
definition may be found in Lehmann, 1991.
2.5. WHY MAXIMUM LIKELIHOOD?
59
Given these advantages, it is worth considering the drawbacks of maximum likelihood. There are three related drawbacks:
1. Maximum likelihood estimates are often biased in finite samples. Researchers may simply feel uncomfortable with biased estimators. Consequently, there is much development in maximally efficient unbiased
estimators. We feel comfortable recommending ML and are not overly
concerned with bias. We do recommend that researchers understand
the magnitude of bias for their application and the simulation method
is useful for this purpose.
2. Although maximum likelihood estimates have excellent asymptotic properties, they may not be optimally efficient for small sample sizes. Sometimes, other methods are just as good asymptotically and better in
finite samples (see Heathcote, Brown, and Mewhort, 2002, Brown &
Heathcote, 2004, and Speckman & Rouder, 2004, for an exchange about
a different estimation appears to outperform likelihood in finite samples).
3. Likelihood tests of composite hypotheses are not necessarily the most
powerful. Composite hypotheses are the type that we consider in this
book and correspond to the case in which one model is a restriction of
another. For some model classes, there are more powerful tests (a good
example is the t-test). Of course, developing most-powerful tests for
nonlinear models is a difficult problem for many applications whereas
the application of the likelihood method is often straightforward and
tractable.
In sum, the likelihood method is not necessarily ideal for every situation,
but it is straightforward and tractable with many nonlinear models. With
it, psychologists can test theories to a greater level of detail than is possible
with standard linear models.
More advanced, in-depth treatments of maximum likelihood techniques
can be found in mathematical statistics texts. While these in-depth treatments do require some calculus knowledge, the advanced student can benefit from learning about properties of maximum likelihood estimators. Some
good texts to consider for further reading are Hogg & Craig (1978), Lehmann
(1991), and Rice (1998).
60
CHAPTER 2. THE LIKELIHOOD APPROACH
Chapter 3
The High-Threshold Model
Experimental psychologists learn about cognition by measuring how people
react to various stimuli. In many cases, these reactions indicate how well
people process stimuli; e.g., how well people can detect a faint tone or how
well they can remember words. In most investigations, it is essential to have
a measure of how well people perform on a task.
3.1
The Signal-Detection Experiment
Consider the problem of assessing how well a participant can hear a tone
of a given frequency and volume when it is embedded in noise. A simple
experiment may consist of two types of trials: one in which the target tone
is presented embedded in noise, and a second in which the noise is presented
without the tone. Trials with an embedded tone are called signal trials; trials
without the tone are called noise trials. Participants listen to a sequence of
signal and noise trials in which both types are intermixed. Their task is
to indicate whether the target tone is present or absent. Let’s consider an
experiment in which the experimenter presents 100 signal trials and 50 noise
trials. Table 3.1 shows a sample set of data.
Experiments with data in the form of Table 3.1 are called signal-detection
experiments. Psychologists express the results of such experiments in terms
of four events:
1. Hit: Participant responds “tone present” on a signal trial.
61
62
CHAPTER 3. THE HIGH-THRESHOLD MODEL
Stimulus
Signal
Noise
Total
Response
Tone Present Tone Absent
75
25
30
20
105
45
Total
100
50
150
Table 3.1: Sample data for a signal-detection experiment.
2. Miss Participant responds “tone absent” on a signal trial.
3. False Alarm: Participant responds “tone present” on a noise trial.
4. Correct Rejection: Participant responds “tone absent” on a noise trial.
Hit and correct rejection events are correct responses while false alarm and
miss events are error responses.
Signal detection experiments are used in many domains besides the detection of tones. One prominent example is in the study of memory. In
many memory experiments, participants study a set of items, and then at
a later time, are tested on them. In the recognition memory paradigm both
previously studied items and unstudied novel items are presented at the test
phase. The participant indicates whether the item was previously studied or
is novel. In this case, a studied item is analogous to the signal stimulus and
the novel item is analogous to a noise stimulus. Consequently, the miss error
occurs when a participant is presented a studied item and indicates that it is
novel. The false alarm error occurs when a participant is presented a novel
item and indicates it was studied.
It is reasonable to ask why there are two different types of correct and
error events. Why not just measure overall accuracy? One of the most
dramatic example of the importance of differentiating the errors comes from
the repressed-memory literature. The controversy stems from the question of
whether it is possible to recall events that did not happen, especially those
regarding sexual abuse. According to some, child sexual abuse is such a
shocking event that memory for it may be repressed (Herman & Schatzow,
1987). This repressed memory may then be “recovered” at some point later
in life. The memory, when repressed is a miss; but when recovered, is a
hit. Other researchers question the veracity of these recovered memories,
claiming it is doubtful that a memory of sexual abuse can be repressed and
then recovered (Loftus, 1993). The counter claim is that the sexual abuse
3.1. THE SIGNAL-DETECTION EXPERIMENT
63
may not have occurred. The “recovered memory” is actually a false alarm.
In this case, differentiating between misses and false alarms is critical in
understanding how to evaluate claims of recovered memories.
The results of a signal detection experiment are commonly expressed as
the following rates:
• Hit Rate: The proportion of tone-present responses on signal trials .
The hit rate in Table 3.1 is .75.
• Miss Rate: The proportion of tone-absent responses on signal trials.
The miss rate in Table 3.1 is .25
• False-Alarm Rate: The proportion of tone-present responses on noise
trials. The false-alarm rate in Table 3.1 is .6.
• Correct-Rejection Rate: The proportion of tone-absent responses on
noise trials. The correct-rejection rate in Table 3.1 is .4.
To perform analysis on data from signal-detection experiments, let’s use
random variables to denote counts of events. Let RVs Yh , Ym , Yf , and Yc
denote the number of hits, misses, false alarms, and correct rejections, respectively. For example, Yh denote the number of hits. Data, such as the
entries in Table 3.1, are denoted with yh , ym , yf , and yc respectively. Let
Ns and Nn refer to the number of signal and noise trials, respectively. The
hit rate is yh /Ns ; the miss rate is ym /Ns ; the false-alarm rate is yf /Nn ; the
correct-rejection rate is yc /Nn .
Each signal trial results in either a hit or miss event; likewise, each noise
trial results in either a false-alarm or correct-rejection event. Hence,
Ns = yh + ym ,
Nn = yf + yc
Whereas Ns and Nn are known rather than estimated, it is only necessary
to record the numbers of hits and false alarms. From these numbers, the
numbers of misses and correct rejections can be calculated. Therefore, there
are only two independent pieces of data in the signal detection experiment.
64
3.1.1
CHAPTER 3. THE HIGH-THRESHOLD MODEL
A Simple Binomial Model of Hits and False Alarms
The simplest model of the signal-detection experiment is given as
Yh ∼ B(ph , Ns ),
Yf ∼ B(pf , Nn ),
(3.1)
(3.2)
where ph and pf refer to the true probabilities of hits and false alarms, respectively. The other two probabilities, the probability of a miss and the probability of a correct rejection are denoted pm and pc , respectively. The outcome
of a signal trial may only be a hit or a miss. Consequently, ph + pm = 1.
Likewise, pf + pc = 1. Hence, there are only two free parameters (ph , pf ) of
concern. Once these two are estimated, estimates of (pm , pc ) can be obtained
by subtraction.
The model may be analyzed by treating each component independently.
Hence, by the results with the binomial distribution, maximum likelihood
estimates are given by
p̂h = yh /Ns
p̂f = yf /Nn
The terms p̂h and p̂f are the hit and false alarm rates, respectively.
3.2
The High-Threshold Model
The problem with the binomial model above is that it yields two different
measures of performance: the hit rate and the correct-rejection rate. In most
applications, researchers are interested in a single measure of performance.
The high-threshold model provides this. It posits that perception is all-ornone. A participant is either in one of two mental states on a signal trial.
They either have detected the target tone, with probability d, or failed to do
so, with probability 1 − d. When participants fail to detect the tone, they
still may guess that it had been presented with probability g. Figure 3.1
provides a graphical representation of the model. There are two ways for a
hit event: the first is through successful detection. A hit may also occur when
detection fails yet the participant still guesses that the stimulus is present.
This route to a hit occurs with probability (1 − d)g. Summing these yields
the probability of a hit: ph = d + (1 − d)g. For noise trials, there is no target
65
3.2. THE HIGH-THRESHOLD MODEL
Hit
High Threshold Model
d
False Alarm
Hit
g
g
1−d
1−g
1−g
Correct Rejection
Miss
Signal Trials
Noise Trials
Figure 3.1: The high-threshold model.
to detect. False alarms are produced only by guessing: pf = g. Substituting
these relations into the binomial model on hits and false alarms (Eqs. 3.1 &
3.2) yields:
Yh ∼ Binomial(d + (1 − d)g, Ns )
Yf ∼ Binomial(g, Nn ),
(3.3)
(3.4)
where 0 < d, g < 1. The goal is to estimate parameters d and g. We use the
four-step likelihood approach. The first step, defining a hierarchy of models
has been done. The above model is the only one.
3.2.1
Express the likelihood
The likelihood is derived from the joint probability mass function f (yh , yf ).
As discussed in Chapter 2, given common parameters, we treat yh and yf as
independent. Therefore the joint probability mass function may be obtained
by multiplying the marginal probability mass functions. These marginals are
given by:
f (yh ; d, g) =
Ns
yh
!
(d + (1 − d)g)yh (1 − (d + (1 − d)g))Ns−yh ,
and
f (yf ; d, g) =
Nn
yf
!
(g)yf (1 − g))Nn −yf .
66
CHAPTER 3. THE HIGH-THRESHOLD MODEL
Theses equations may be simplified by substituting ym for Ns − yh , yc for
Nn − yf and (1 − d)(1 − g) for 1 − (d + (1 − d)g). Making these substitutions
and multiplying these marginal probability mass functions yields
Ns
yh
f (yh , yf ; d, g) =
×
!
(d + (1 − d)g)yh ((1 − d)(1 − g))ym
Nn
yf
!
(g)yf (1 − g))yc .
The log likelihood may be obtained by rewriting this equation as a function of d and g and then taking logarithms:
l(d, g; yh, yf ) = log
+ log
Ns
yh
!
Nn
yf
+ yh log(d + (1 − d)g) + (ym ) log((1 − d)(1 − g))
!
+ yf log(g) + (yc ) log(1 − g).
Some of the terms are not functions of parameters and may be omitted:
l(d, g; yh, yf ) = − (yh log(d + (1 − d)g) + (ym ) log((1 − d)(1 − g)) + yf log(g) + (yc ) log(1 − g
3.2.2
Maximize the Likelihood
Either calculus methods or numerical methods may be used to provide maximum likelihood estimates. The calculus methods provide the following solutions:
dˆ =
yh
Ns
y
− Nfn
,
y
1 − Nfn
ĝ = yf /Nn .
(3.5)
(3.6)
Typically, these are written in terms of the hit and false alarm rates, p̂h and
p̂f , respectively:
p̂h − p̂f
dˆ =
,
1 − pˆf
ĝ = p̂f .
3.2. THE HIGH-THRESHOLD MODEL
67
These equations are used to generate estimates from the sample data in
Table 3.1:
75
100
30
− 50
30 = .375
1 − 50
ĝ = yf /Nn = .6.
dˆ =
The following is an implementation in R. We present it for two reasons:
(1) the programming techniques implemented here are useful in analyzing
subsequent models; and (2) estimates of standard errors are readily obtained
in the numerical approach. In the program y is a vector of data. It has
four elements, (yh , ym , yf , yc ) and may be assigned values in Table 3.1 with
y=c(100,50,30,20). Vector par is the vector of parameters (d, g)
The first step is to compute the negative log likelihood:
#negative log likelihood of high-threshold model
nll.ht=function(par,y)
{
d=par[1]
g=par[2]
ll=y[1]*log(d+(1-d)*g)+y[2]*log((1-d)*(1-g))+
y[3]*log(g)+y[4]*log(1-g)
return(-ll)
}
Although the above code is valid, we may rewrite it to make it easier to
read and modify. The change is motivated by noting that the log likelihood
can be put in following form:
l = C + yh log(ph ) + ym log(pm ) + yf log(pf ) + yc log(pc ),
where C is the log of the choose terms that do not depend on parameters.
This log likelihood can be rewritten as
l=
X
yi log(pi ),
i
where i ranges over the four events. Let p denote a vector of probabilities
(ph , pm , pf , pc ). The function is rewritten as:
68
CHAPTER 3. THE HIGH-THRESHOLD MODEL
#negative log likelihood of high-threshold model
nll.ht=function(par,y)
{
d=par[1]
g=par[2]
p=1:4 # reserve space
p[1]=d+(1-d)*g #probability of a hit
p[2]=1-p[1] # probability of a miss
p[3]=g # probability of a false alarm
p[4] = 1-p[3] #probability of a correct rejection
return(-sum(y*log(p)))
}
The next step is to maximize the function:
y=c(75,25,30,20)
par=c(.5,.5)
optim(par,nll.ht,y=y)
Execution of this code yields estimates of dˆ = .3751 and ĝ = .5999, which
are acceptably close to the closed-form answers of dˆ = .375 and ĝ = .6.
3.3
Selective Influence in the High Threshold
Model
ˆ it
Although the high-threshold model provides estimates of performance (d),
does so by assuming that perception is all-or-none. Is this aspect correct?
One way of testing it is to perform a selective influence experiment. In
a selective influence experiment of the high-threshold model, the researcher
designates a manipulation that should affect one parameter and not the other.
Parameter d is a bottom-up strength parameter. In a tone experiment, it
would reflect factors that determine the strength of perception of the tone
including its volume and frequency. Parameter g is a top-down parameter.
To influence g and not d, a researcher may use differing payoffs. In one
condition, Condition 1, the researcher may pay 10c for each hit and 1c for
each correct rejection. In a second condition, Condition 2, the researcher may
3.3. SELECTIVE INFLUENCE IN THE HIGH THRESHOLD MODEL69
Frequencies of Responses
Hit Miss False Alarm
Condition 1 40
10
30
Condition 2 15
35
2
Correct Rejection
20
48
Table 3.2: Hypothetical data to test selective influence in the High Threshold
model.
pay the reverse (1c for each hit and 10c for each correct rejection). Condition
1 favors tone-present responses; condition 2 favors a tone-absent responses.
Parameter g should therefore be higher in Condition 1 than Condition 2.
Parameter d does not reflect these payoffs and should be invariant to the
manipulation. Suppose the experiment was run with 50 signal and 50 noise
trials in each condition. Hypothetical data is given in Table 3.2.
There are two parts to the selective influence test: The first is whether
the manipulation affected g as hypothesized. The second part is whether the
the manipulation had no effect on d. This second test is as least as important
as the first; the invariant of d, if it occurs, is necessary for support of the
model. We follow the four steps in answering this question.
3.3.1
Hierarchy of Models
We form three models. Let di and gi be the sensitivity and guessing rate in
the ith condition, where i is 1 or 2. Likewise, let Yh,i, Yf,i denote the number
of hits and false alarms in the ith condition, respectively. Let Ns,i , Nn,i denote
the number of signal and noise trials in the ith condition, respectively. Model
1, the most general model, is constructed by allowing separate sensitivity and
guessing rates.
Model 1
Yh,i ∼ B(di + (1 − di )gi, Ns,i ),
Yf,i ∼ B(gi , Nn,i ).
Model 1 has four parameters: d1 , g1 , d2 , g2 .
(3.7)
(3.8)
(3.9)
70
CHAPTER 3. THE HIGH-THRESHOLD MODEL
Model 2 is constructed as a restriction on Model 1—assume that sensitivity is equal in both conditions:d = d1 = d2 .
Model 2
Yh,i ∼ B(d + (1 − d)gi , Ns,i),
Yf,i ∼ B(gi , Nn,i).
(3.10)
(3.11)
(3.12)
Model 2 has three parameters (d, g1, g2 )
Model 3 is the other restriction of Model 1. Although sensitivity may vary
across conditions, the guessing rate is assumed to be equal: g = g1 = g2 .
Model 3:
Yh,i ∼ B(di + (1 − di )g, Ns,i),
Yf,i ∼ B(g, Nn,i).
(3.13)
(3.14)
The three models form a hierarchy with Model 1 being the most general
and Models 2 and 3 being proper restrictions. This hierarchy allows us to
test the selective influence hypotheses. Accordingly, the expected variation
of g can be tested by comparing Model 3 to Model 1. Likewise, the expected
invariance of d can be tested by comparing Model 2 to Model 1.
3.3.2
Express the Likelihoods
We express the likelihoods within R. Our approach is to specify a general log
likelihood function for any one condition. It takes as input the four hit, miss,
false alarm, and correct rejection counts and two parameters. We will use this
function repeatedly, even when fitting the restricted models. The comments
indicate how par and y should be assigned when calling the function.
#negative log likelihood for high-threshold
#assign par=c(d,c)
#assign y=c(h,m,f,c)
nll.condition=function(par,y)
{
p=1:4
d=par[1]
g=par[2]
3.3. SELECTIVE INFLUENCE IN THE HIGH THRESHOLD MODEL71
p[1]=d+(1-d)*g
p[2]=1-p[1]
p[3]=g
p[4]=1-p[3]
return(-sum(y*log(p)))
}
Given common parameters, data from the different conditions are independent. The joint likelihood across conditions is the product of likelihoods for
each condition, and the joint log likelihood across conditions is the sum of
the log likelihoods for each condition. The following function, nll.1() computes the negative log likelihood for Model 1. It does so by calling individual
condition log likelihood function nll.condition() twice and adding the results. Because Model 1 specifies different parameters for each condition, each
call to nll.condition() has different parameters.
#negative log likelihood for Model 1:
#assign par4=d1,g1,d2,g2
#assign y8=(h1,m1,f1,c1,h2,m2,f2,c2)
nll.1=function(par4,y8)
{
nll.condition(par4[1:2],y8[1:4])+ #condition 1
nll.condition(par4[3:4],y8[5:8])
#condition 2
}
The input to the function are the vector of four parameters (d1 , g1, d2 , g3 )
and the vector of eight data points from the two conditions.
In Model 2, there is a single detection parameter d. The log likelihood
for this model is evaluated similarly to that in Model 1. The difference is
that when nll.condition() is called for each condition, it is done with a
common detection parameter. The input is the vector of three parameters
and eight data points:
#negative log likelihood for Model 2:
#assign par3=d,g1,g2
#assign y8=(h1,m1,f1,c1,h2,m2,f2,c2)
nll.2=function(par3,y8)
{
72
CHAPTER 3. THE HIGH-THRESHOLD MODEL
nll.condition(par3[1:2],y8[1:4])+
nll.condition(par3[c(1,3)],y8[5:8])
}
The negative log likelihood function for Model 3 is given by
#negative log likelihood for Model 3:
#par3=d1,d2,g
#y8=(h1,m1,f1,c1,h2,m2,f2,c2)
nll.3=function(par3,y8)
{
nll.condition(par3[c(1,3)],y8[1:4])+
nll.condition(par3[2:3],y8[5:8])
}
3.3.3
Maximize the Likelihoods
Maximization may be done with the optim call:
dat=c(40,10,30,20,15,35,2,48)
#Model 1
par=c(.5,.5,.5,.5) #starting values
mod1=optim(par,nll.1,y8=dat,hessian=T)
#Model 2
par=c(.5,.5,.5) #starting values
mod2=optim(par,nll.2,y8=dat,hessian=T)
#Model 3
par=c(.5,.5,.5) #starting values
mod3=optim(par,nll.3,y8=dat,hessian=T)
The above code produces a number of warnings: “NaNs produced in
log(x).” These warnings are inconsequential for this application. They come
about because optim does not know that sensitivity and guess parameters
may only be between 0 and 1. In Section ?? we will discuss a solution to this
problem.
0.2
0.4
0.6
0.8
Condition 1
Condition 2
0.0
Parameter Estimates
1.0
3.3. SELECTIVE INFLUENCE IN THE HIGH THRESHOLD MODEL73
Detection (d)
Guessing (g)
Parameter
Figure 3.2: Parameter estimates and standard errors from Models 1 and 3.
Bars are parameter estimates from the general model and points are estimates
from the restricted model.
The output is in variables mod1, mod2, and mod3. There is one element
of the analysis that is of concern. The estimate of dˆ2 for Model 3, given in
mod3 is -.03. This estimate is invalid and we discuss a solution in Section ??.
For now the value of dˆ2 may be set to 0. Figure 3.2 provides an appropriate
graphical representation of the results. It was constructed with barplot and
errbar as discussed in Chapter 2. The bar plots are from Model 1, the
general model. The point between the two bar-plotted detection estimates
is the common detection estimate from Model 2. The point between the two
bar-plotted guessing estimates is the common guessing estimate from Model
3. From these plots, it would seem that the manipulation certainly affected
g. The case for d is more ambiguous, but it seems plausible that d depends on
the payoff, which would violate selective influence and question the veracity
of the model.
3.3.4
Testing Selective Influence
Although Figure 3.2 is informative, it is no substitute for formal hypothesis
tests. The first part of the selective influence test is whether g depended on
condition. The test is performed with a likelihood ratio test. The value of G2
may be computed in R as 2*(mod3$value-mod1$value). The value is 41.28
74
CHAPTER 3. THE HIGH-THRESHOLD MODEL
(with dˆ2 set to 0 the value is 41.37). Under the null-hypothesis that g1 = g2 ,
this value should be distributed as a chi-square. As mentioned previously, the
degrees of freedom for the test is the difference in the number of parameters
in the models, which is 1. The criterial value of the chi-square statistic with
1 degree of freedom is 3.84. Hence, Model 3 can be rejected in favor of Model
1. The payoff manipulation did indeed influence g as hypothesized.
The second part of selective influence is the invariance of d. From Figure 3.2, it is evident that there is a large disparity of sensitivity across the
conditions (.50 vs. .27). This difference appears relatively large given the
standard errors. Yet, the value of G2 (2*(mod2$value-mod1$value)) is 1.26,
which is less than the criterial value of 3.84. Therefore, the invariance cannot
be rejected.
This later finding is somewhat surprising given the relatively large size of
the effect in Figure 3.2. As a quick check of this obtained invariance, it helps
to inspect model predictions. The model predictions for a condition can be
obtained by:
ˆ
p̂h = dˆ + (1 − d)ĝ
p̂f = ĝ.
(3.15)
(3.16)
With these equations, the predictions from Model 1 and Model 2 are
shown in Table 3.3. As can be seen, Model 2 does a fair job at predicting
the data, even though the parameter estimate of d is different than d1 and
d2 in Model 1. This result is evidence for the invariance of d. When there
is a common detection parameter, the ability to predict the data is almost
as good as with condition-specific detection parameters. The lesson learned
is that it may be difficult with nonlinear models to inspect parameter values
with standard errors and decide if they differ significantly.
3.3. SELECTIVE INFLUENCE IN THE HIGH THRESHOLD MODEL75
Condition 1
Data
Model 1 Prediction
Model 2 Prediction
Condition 2
Data
Model 1 Prediction
Model 2 Prediction
Hit Rate
False-Alarm Rate
.800
.800
.750
.600
.600
.640
.300
.300
.322
.040
.040
.038
Table 3.3: Predictions derived from the High Threshold model.
Problem 3.3.1 (Your Turn)
You are testing the validity of the high-threshold model for the perception of faint audio tones with a selective influence test. You manipulate
the volume of the tone through two levels: low and very low. The manipulation is hypothesized to affect d and not g. The obtained data are
given below. Use R to test for selective influence.
Hits Misses
Low
33
17
Very Low 40
10
False Alarms Correct Rejections
42
9
30
20
76
CHAPTER 3. THE HIGH-THRESHOLD MODEL
Problem 3.3.2 (Your Turn)
Let’s gain some insight into why the detection estimates across conditions seem quite different in Figure 3.2 even though they are not
statistically so. One reason for the appearance of difference is that the
standard error bars are drawn symmetrically around the parameter estimate. Let’s see if this is accurate. We ask whether there is skew to the
sampling distribution of the common d estimator in Model 2. Use the
simulation method to construct the sampling distribution of d in Model
2 for true values (d = .3, g1 = .64, g2 = .04) with 50 noise and signal
trials in each of the two conditions. Plot the sampling distribution as
a relative-frequency histogram. Is it skewed? If so, in which direction.
For the purposes of this problem, set negative estimates of g2 to zero.
3.4
Receiver Operating Characteristic
There is a common graphical approach to assessing models in signal detection
experiments. Psychologists typically graph the hit rate as a function of the
false-alarm rate. The resulting plot is called a receiver operating characteristic or ROC plot. Table 3.4 shows data from a hypothetical experiment which
is a test of the selective of payoffs. There are 500 observations per condition.
The resulting hit rates are .81, .70, .57, .50, .30 for Conditions A through E,
respectively. The resulting false-alarm rates are .60, .47, .37, .20, .04 for Conditions A through E, respectively. The ROC plot for these data is shown in
Figure 3.3. There is a point for each condition. The x-axis value is the false
alarm rate, the y-axis value is the hit-rate.
The lines in Figure 3.3 are predictions from the high-threshold model.
Each line corresponds to a particular value of d. The points on the line are
obtained by varying g. The line is the prediction for the case of invariance
of sensitivity and it is called the isosensitivity curve (Luce, 1963). The highthreshold model predicts straight line isosensitivity curves with a slope of
(1 − d) and an intercept of d. The following is the derivation of the result:
77
3.4. RECEIVER OPERATING CHARACTERISTIC
Reward for Correct Response
Signal Trial
Noise Trial
Condition A
10c
1c
Condition B
7c
3c
Condition C
5c
5c
Condition D
3c
7c
Condition E
1c
10c
Data
Hit Miss
404 96
348 152
287 213
251 249
148 352
FA
301
235
183
102
20
CR
199
265
317
398
480
0.8
1.0
Table 3.4: Hypothetical data for a signal detection experiment with payoffs.
A
0.6
B
Hit Rate
d=.6
C
0.4
D
d=.35
0.2
E
0.0
d=.1
0.0
0.2
0.4
0.6
0.8
1.0
False Alarm Rate
Figure 3.3: ROC plot. The points are the data from Table 3.4. Lines denote
predictions of the high-threshold model.
78
CHAPTER 3. THE HIGH-THRESHOLD MODEL
Let i index condition. By the high-threshold model
ph,i = d + (1 − d)gi , pf,i = gi .
Substituting the later equation into the former yields
ph,i = d + (1 − d)pf,i ,
which is a straight-line relationship. Straight-line ROCs are characteristic of
models with all-or-none mental processes.
ROC plots can be made in R with the plot() command. Straight lines,
denoting isosensitivity curves, may be added with the lines() command.
The details are left as part of the exercise below.
Problem 3.4.1 (Your Turn)
Fit the high-threshold model to the data in Table 3.4. Fit a general
model with separate parameters for each condition. The 10 parameters
in this model are (dA , gA , dB , gB , dC , gC , dD , gD , dE , gE ). Fit a common
detection model; the six parameters are (d, gA , gB , gC , gD , gE ).
1. Estimate parameters for both models. Make a graph showing the
detection parameters for each condition (from the general model)
with standard errors.
2. Plot the data as an ROC and add a line denoting the commondetection model.
3. Perform a likelihood ratio test to see if detection varies across
conditions.
3.5
3.5.1
The Double High-Threshold Model
Basic Model
The high-threshold model is useful because it provides separate estimates of
detection and guessing probabilities. It is, however, not the only model to do
79
3.5. THE DOUBLE HIGH-THRESHOLD MODEL
Double High−Threshold Model
Hit
d
Correct Rejection
d
g
Hit
1−d
g
False Alarm
1−d
1−g
Miss
Signal Trials
1−g
Correct Rejection
Noise Trials
Figure 3.4: The double high-threshold model.
so. A closely related alternative is the double high-threshold model. Like the
high-threshold model, the double high-threshold model is also predicated
on all-or-none mental processes. In contrast to the high-threshold model,
however, the double high-threshold model posits that participants may enter
a noise-detection state in which they are sure no signal has been presented.
The model is shown graphically in Figure 3.4. The model is the same as
the high-threshold model for signal trials: either the signal is detected, with
probability d, or not. If the signal is not detected, the participant guesses as
before. On noise trials, participants either detect that the target is absent
(with probability d) or enter a guessing state. Model equations are given by
Yh ∼ Binomial(d + (1 − d)g, Ns )
Yf ∼ Binomial((1 − d)g, Nn),
(3.17)
(3.18)
where 0 < d, g < 1.
Analysis of this model is analogous to the high-threshold model. The log
likelihood is given by:
l(d, g; yh, yf ) = yh log(d + (1 − d)g) + (ym ) log((1 − d)(1 − g)),
+yf log((1 − d)g) + (yc ) log(d + (1 − d)(1 − g)).
Either calculus methods or numerical methods may be used to provide
maximum likelihood estimates. The calculus methods provide the following
estimates:
yh
yf
dˆ =
−
(3.19)
Ns Nn
yf /Nn
.
(3.20)
ĝ =
1 − dˆ
80
CHAPTER 3. THE HIGH-THRESHOLD MODEL
These can be rewritten in terms of hit and false alarm rates:
dˆ = p̂h − p̂f
p̂f
ĝ =
.
1 − dˆ
(3.21)
(3.22)
The estimate dˆ in this case is simply the hit rate minus the false alarm
rate. This estimate is often used as a measure of performance in memory
experiments (e.g., Anderson, Craik, & Naveh-Benjamin, 1998). Equation
(3.19) shows that this measure is may be derived from a double high-threshold
model. Moreover, the validity of the measure may be assessed in a particular
domain via a suitable selective influence test.
Implementation in R is straightforward. The log likelihood is computed
by
#negative log likelihood for double high-threshold model
nll.dht=function(par,y)
{
d=par[1]
g=par[2]
p=1:4 # reserve space
p[1]=d+(1-d)*g #probability of a hit
p[2]=1-p[1] # probability of a miss
p[3]= (1-d)*g# probability of a false alarm
p[4] = 1-p[3] #probability of a correct rejection
return(-sum(y*log(p)))
}
The ROC of the double high-threshold model can be derived by observing
that ph = d + pf . Hence, the ROC is a straight line with y-intercept of d and
a slope of 1.0. Plots of ROCs for a few values of d are shown in Figure 3.5.
Problem 3.5.1 (Your Turn)
Do the analyses in Your Turn 3.4 for the double high-threshold model.
81
3.5. THE DOUBLE HIGH-THRESHOLD MODEL
1.0
0.8
0.6
Hit Rate
d=0.6
0.4
d=0.35
0.2
d=0.1
0.0
0.0
0.2
0.4
0.6
0.8
1.0
False Alarm Rate
Figure 3.5: ROC lines of the double high-threshold model for several values
of d.
3.5.2
Double High-Threshold Model and Overall Accuracy
In the preceding sections, we stressed the importance of differentiating miss
errors from false alarm errors. Some researchers prefer to use overall accuracy
as a measure of performance. Overall accuracy, denoted by c, is
c=
Yh + Yc
.
Ns + Nn
ĉ =
yh + yc
.
Ns + Nn
It may be estimated by
In many experiments, the number of signal and noise trials is equal (N =
Nn = Ns ). In this case, accuracy is given by
c = (Yh + Yc )/2N.
It may be surprising that this measure may be theoretically motivated
by the double high-threshold model. For signal trials, the number of correct
82
CHAPTER 3. THE HIGH-THRESHOLD MODEL
responses is Yh = Ns (d + (1 − d)g); for noise trials, the number of correct
responses is Yc = Nn (d + (1 − d)(1 − g)). Overall accuracy is
c=
Ns (d + (1 − d)g) + Nn (d + (1 − d)(1 − g))
.
Ns + Nn
If Ns = Nn , then the above equation may be reduced to
c=
d 1
+ .
2 2
This relationship indicates that overall accuracy is a simple linear transform
of d for when Nn = Ns . Overall accuracy for this case may be derived from
a double-high threshold model; its validity as a measure is tested by testing
the selective influence of the double-high threshold model.
3.5.3
Process Dissociation Procedure
The most influential double high-threshold model is Jacoby’s process dissociation procedure for the study of human memory (Jacoby, 1991). Currently,
human memory is often conceptualized as consisting of several separate systems or components. Perhaps the most fundamental piece of evidence for
separate components comes from the study of anterograde amnesics. These
patients typically suffer a stroke or a head injury. They have fairly-well preserved memories from before the stroke or injury, but have impairment in
forming new memories. They do poorly in direct tests for the memory of
recent events such as a recognition memory test. They make many more
miss and false alarm errors than appropriate control participants.
Although amnesics are greatly impaired in direct memory tasks such as
recognition memory, they are far less impaired on indirect memory tasks.
One such task is stem completion. In this task participants are shown a list
of words at study as before. At test, they are given a word stem, such as
br
. Their task is to complete the stem with the first word that comes
to mind. Control participants without amnesia typically show a tendency to
complete stems with studied words. For example, if no word starting with
stem br is studied, typical completions are words like bread and brother but
not bromide. If bromide is studied, however, it is far more likely to be used as
the stem completion than otherwise. This type of test is considered indirect
because the experimenter does not ask whether an item was studied or not.
3.5. THE DOUBLE HIGH-THRESHOLD MODEL
83
Instead, the presence of memory is inferred from its ability to indirectly affect
a mental action such as completing the stem with the first word that comes to
mind. Most surprisingly, amnesics have somewhat preserved performance on
indirect tests. While an amnesic may not recall studying a specific word, that
word still has an elevated chance of being used by the amnesic to complete
the stem at test (Graf & Schacter,1985).
This finding, as well as many related ones, have led to the current conceptualization of memory as consisting of two systems or components. One
of these components reflects conscious recollection, which is willful and produces the feeling of explicitly remembering an event. This type of memory is
primarily used in direct tasks. The other component is the automatic, unconscious, residual activation of previously processed or encountered material.
Automatic activation corresponds to the feeling of familiarity but without
explicit memorization. Within this conceptualization, amnesics’ deficit is in
the conscious recollective component but not in the automatic component.
Hence, they tend to have less impairment in tasks that do not require a
conscious recollection. There are a few variations of this dichotomy (e.g.,
Jacoby, 1991; Schacter, 1990; Squire, 1994), but the conscious-automatic one
is influential.
The goal of the process dissociation procedure is to measure the degree
of conscious and automatic processing and we describe its application to the
stem completion task. The task starts with a study phase in which participants are presented a sequence of items. There are two test conditions in
process dissociation: an include condition and an exclude condition. In the
include condition, participants are instructed to complete the stem with a
previously studied word. In the exclude condition, participants are instructed
to complete the stem with any word other than the one studied. In the include condition, stem completion of studied word can occur either through
successful conscious recollection or automatic activation. In the exclude condition, successful conscious recollection does not lead to stem completion
with the studied item, instead it leads to stem completion with a different
item.
The following notation is used to implement the model. Let Ni and
Ne denote the number of words in the include and exclude test conditions,
respectively. Let random variables Yi,s and Yi,n be the frequency of stems
completed with a studied word and a word not studied, respectively, in the
include condition. Let random variables Ye,s and Ye,n denote the same for
words in the exclude condition. It is assumed that recollection is all-or-none
84
CHAPTER 3. THE HIGH-THRESHOLD MODEL
Process Dissociation Procedure
studied word
r
word not studied
r
a
studied word
1−r
a
studied word
1−r
1−a
1−a
word not studied
word not studied
Exclude Trials
Include Trials
Figure 3.6: The process dissociation procedure model.
and occurs with a probability of r. Automatic activation is also all-or-none
and occurs with probability a. Figure 3.6 depict the model. Resulting data
are modeled as:
Yi,s ∼ Binomial(Ni , r + (1 − r)a)
Ye,s ∼ Binomial(Ne , (1 − r)a).
(3.23)
(3.24)
The model is a double-high threshold model where conscious recollection
plays the role of detection and automatic activation plays the role of guessing
(Buchner, Erdfelder, Vaterrodt-Plunneck, 1995). Because process dissociation is a double high-threshold model, all of the previous development may
be used in analysis. Estimators for R and A are given by:
yi,s ye,s
−
Ni
Ne
ye,s /Ne
â =
.
1 − r̂
r̂ =
(3.25)
(3.26)
3.5. THE DOUBLE HIGH-THRESHOLD MODEL
Problem 3.5.2 (Your Turn)
As people age, their memory declines. Let’s use the process dissociation procedure to assess the locus of the decline. The table, below,
shows hypothetical data for younger and elderly adults. The columns
“Studied” and “Not Studied” shows the number of stems completed by
a studied word and a word not studied, respectively. The total number
of stems is the sum of these two numbers. Given these data, test for
the effects of aging on r and a.
Younger
Elderly
Condition
Include
Exclude
Studied Not Studied Studied Not Studied
69
81
37
113
36
64
12
88
85
86
CHAPTER 3. THE HIGH-THRESHOLD MODEL
Chapter 4
The Theory of Signal Detection
The theory of signal detection (Green & Swets, 1966) is the dominant modelbased method of assessing performance in perceptual and cognitive psychology. We describe the model for the tone-in-noise signal detection experiment.
The model, however, is applied more broadly to all sorts of two-choice tasks
in the literature including those in assessing memory. McMillan and Creelman (1991) provide an extensive review of the variations and uses of this
flexible model. It is important to distinguish between the theory of signal
detection and signal-detection experiments. The former is a specific model
like the high-threshold model; the latter is an experimental design with two
stimuli and two responses.
The presentation of the model relies on continuous random variables,
which were not covered in Chapter 1. We first discuss this type of random
variable as well as density, cumulative distribution, and quantile functions.
Then we introduce the normal distribution, upon which the theory of signal
detection is based. After covering this background material in the first half
of the chapter, we present the theory itself in the second half.
4.1
Continuous Random variables
In previous chapters, we proposed models based on the binomial distribution.
One feature of the binomial is that there is only mass on a select number
outcomes. Figure 4.1, top-left, shows the probability mass function for a
binomial with 50 trials (p = .3). Only the points 0,1,...,50 have nonzero
mass. Because mass falls on select discrete points, the binomial is a called
87
CHAPTER 4. THE THEORY OF SIGNAL DETECTION
0.10
B
0.00
0.04
Probability
0.08
0.20
A
0.00
Probability
0.12
88
0
10
20
30
40
50
0
5
15
D
Density
0.005
0.0
0.4
0.8
C
0.010
0.015
10
Number of Events
0.000
Density (1/pounds)
Number of Successes
100
150
200
Weight in Pounds
250
−0.5
0.0
0.5
1.0
Outcome
Figure 4.1: Top: Probability mass functions for discrete random variables.
Bottom: Density functions for continuous random variables.
a discrete distribution. Another example of a discrete distribution is the
Poisson distribution. The top-right panel shows the Poisson probability mass
function. For the Poisson, mass occurs on all positive integers and zero.
There are gaps between points where there is no mass.
In contrast to discrete distributions, continuous distributions have mass
on intervals rather than on discrete points. The best-known continuous distribution is the normal. The bottom-left panel (Figure 4.1) shows a density function of a normal distribution. Here there is density on every single
point. The bottom-right panel shows another example of a continuous random variable—the uniform distribution. Notice that where there is density,
it is on an interval rather than on single points separated by gaps. The density function serves a similar purpose to the probability mass function and is
discussed in detail below.
1.5
4.1. CONTINUOUS RANDOM VARIABLES
4.1.1
89
The Density Function
In discrete distributions, probability mass functions describe the probability
that a random variable takes a specific realization. The density function
is different. At first consideration, it is tempting to interpret density as
a probability. This interpretation, however, is flawed. To see this flaw,
consider density as probability for Figure 4.1D. If density were probability,
then we could ask, ”What is the probability of observing either 1/3 or 2/3?”
The density at each point is 1.0, hence, the sum density for two points is
2.0. Probability for any event can never exceed 1; hence, density cannot be
probability.
Density is interpretable on intervals rather than on individual points. To
compute the probability that a random variable takes a realization within
an interval, we compute the area under the density function on the interval.
Consider again the uniform distribution of Figure 4.1D. Suppose we wish to
know the probability that an observation is between 1/4 and 3/4. The area
under the density function on this interval, which is the probability, is p =
1/2. The uniform is a particularly easy distribution to compute the area on
an interval because the area is that of a rectangle. The relationship between
the density function, area, and probability is formalized in the following
definition:
Definition 33 (Density Function) The density function of a
continuous random variable X is a function such that the area
under the function between a and b corresponds to P r(a < X ≤
b)
.
What is the probability that a continuous random variable takes any single value? It is the area under a single point, which is zero. This condition
makes sense. For example, we can ask what the probability that a person
weighs 170lbs. There are many people who report their weight at 170lbs,
but this report is only an approximation. In fact, very few people worldwide weigh between 169.99lbs and 170.01lbs. Surely almost nobody weighs
between 169.999999lbs and 170.000001lbs. As we decrease the size of the
interval around 170lbs, the probability that anybody’s weight falls in the
90
CHAPTER 4. THE THEORY OF SIGNAL DETECTION
interval becomes smaller. In the limit, nobody can possibly weigh exactly
170lbs. Hence the probability of someone weighing exactly some weight, to
arbitrary precision, is zero.
In order to more fully understand the density function, it is useful to
consider the units of the axes. The units of the x-axis of a density function is
straightforward—it is the units of the measurement of the random variable.
For example, the x-axis of the normal density in Figure 4.1 is in units of
pounds. The units of the y-axis is more subtle. It is found by considering
the units of area. On any graph, the units of area under the curve is given
by:
Units of Area = x-axis units × y-axis units .
(4.1)
The area under the density function corresponds to probability. Probability
is a pure number without any physical unit such as pounds or inches. For
the normal density over weight in Figure 4.1 in which the x-axis units is
pounds, the y-axis unit must be 1/pounds in order for Equation 4.1 to hold.
In general, then, the units of density on the y-axis is the reciprocal of that
of the random variable.
One of the more useful properties of densities is that they describe the
convergence of histograms with increasingly large samples. In Chapter 1 with
discrete RVs, we advocated a relative-frequency histograms because they
converge to the appropriate probability mass function. For continuous RVs,
the appropriate histogram is termed a relative-area histogram. In relativearea histograms, the area of a bin corresponds to the proportion of responses
that fall within the bin. Figure 4.2 (left) shows a relative-area histogram
of 100 realizations from a normal distribution. The bin between 150lbs and
160lbs has a height value .015 (in units of 1/lbs) and an area of .15. Fifteen
percent of the realizations fell in this bin. The advantage of relative-area
histograms is that they converge1 to the density function. The right panel
shows and example of this convergence. There are 20,000 realizations in the
histogram.
1
Convergence involves shrinking the bin size. As the number of realizations grows,
the bins should become smaller, and, in the limit, the bins should become infinitesimally
small.
91
0.015
0.010
0.005
Density (1/pounds)
0.000
0.010
0.000
Density (1/pounds)
4.1. CONTINUOUS RANDOM VARIABLES
100
120
140
160
180
200
220
100
Weight in pounds
150
200
250
Weight in pounds
Figure 4.2: Convergence of relative-area histograms to a density function.
Left: Histogram of 100 observations. Right: Histogram of 20,000 observations.
4.1.2
Cumulative Distribution Functions
The area under the density function for an interval describes the probability
that an outcome will be in that interval. The cumulative distribution function
describes the probability than an outcome will be less than or equal to a point.
Definition 34 (Cumulative Distribution Function (CDF))
Let F denote the cumulative distribution function of random
variable X. Then, F (x) = P r(X ≤ x).
Figure 4.3 shows the relationship between density and cumulative distribution functions for a uniform between 0 and 2. There are two dotted
vertical lines, labeled a and b, are for a = .5 and b = 1.3. The values of
density and CDF are shown in left and right panels, respectively. The area
under the density function to the left of a is .25, and this is the value of the
CDF in the right panel. Likewise, the area under the density function to the
left of b is .65, and this is also graphed in Panel B. Cumulative distribution
functions are limited to the [0, 1] interval and are always increasing.
The cumulative distribution function can be used to compute the probability an observation occurs on the interval [a, b]:
P r(a < X ≤ b) = F (b) − F (a).
(4.2)
92
0.8
a
b
0.0
0.2
0.4
b
0.4
Cumulative Probability
a
0.0
Density
0.6
CHAPTER 4. THE THEORY OF SIGNAL DETECTION
0.0
0.5
1.0
1.5
0.0
2.0
Outcome
0.5
1.0
1.5
2.0
Outcome
Figure 4.3: Density (left) and cumulative distribution function (right) for a
uniform random variable. The cumulative distribution function is the area
under the density function to the left of a value.
The relationship between density and cumulative distribution functions
for continuous variables may be expressed with calculus. We provide them
for students with knowledge of calculus:
F (x) =
Z
τ =x
τ =−∞
f (τ )dτ,
d
F (x).
dx
Cumulative distribution functions are also defined for discrete random
variables. For the binomial model of toast-flipping, the CDF describes the
probability of obtaining x or fewer butter-side-down flips. Figure 4.4 shows
the CDF for 10 flips (with p = .5). A few points deserve comment. First,
some points are open (not filled-in) while others are closed. The associated
value of the function at these points is the closed point. For example, the
CDF at x = 4 is .38 and is not .17. Second, the CDF is defined easily for
fractions, like 3.5, which do not correspond to outcomes. The probability of
observing 3.5 or fewer butter-side-down flips is the same as observing 3 or
fewer butter-side-down flips. This facet accounts for the stair-step characteristic in the graph.
f (x) =
4.1.3
Quantile Function
Quantiles play a major role in some of the psychological models we consider.
The easiest way to explain quantiles is to consider percentiles. The 75th
93
0.8
0.6
0.4
0.2
0.0
Cumulative Probability
1.0
4.1. CONTINUOUS RANDOM VARIABLES
0
2
4
6
8
10
Outcome
Figure 4.4: Cumulative distribution function for a binomial with N = 10
and p = .5.
percentile is the value below which 75% of the distribution lies. For example,
for the uniform in Figure 4.3, the value 1.5 is the 75th percentile because
75% percent of the area is below this point. Quantiles are percentiles for
distributions, except they are indexed by fractions rather than by percentage
points. The .75 quantile corresponds to the 75th percentile.
Definition 35 (Quantile) The pth quantile of a distribution is
the value qp such that P r(X ≤ qp ) = p.
The quantile function takes a probability p and returns the associated pth
quantile for a distribution. The quantile function is the inverse of the cumulative distribution function. Whereas the CDF returns the proportion of mass
below a given point, the quantile function returns the point below which a
given proportion of the mass lies. Examples of density functions, cumulative
distribution functions and quantile functions for three different continuous
distributions are shown in Figure 4.5. The top row is for a uniform between
0 and 2; the middle row is for a normal distribution; the bottom row is for
an exponential distribution. The exponential is a skewed distribution used
to model the time between events such as earthquakes, light bulb failures,
94
CHAPTER 4. THE THEORY OF SIGNAL DETECTION
Quantile Function
Cumulative Distribution Function
2.0
Density Function
0.0
0.5
1.0
1.5
0.0
1.0
Outcome
0.8
0.4
Probability
0.0
0.2
0.0
Density
0.4
Uniform
2.0
0.0
0.5
Outcome
1.0
1.5
2.0
0.0
0.2
Outcome
0.4
0.6
0.8
1.0
0.8
1.0
0.8
1.0
Probability
150
200
200
120
0.0
100
160
Outcome
0.8
0.4
Probability
0.010
0.000
Density
Normal
250
100
Outcome
150
200
250
0.0
0.2
Outcome
0.4
0.6
Probability
0
2
4
6
Outcome
8
10
8
6
0
0.0
0.0
2
4
Outcome
0.8
0.4
Probability
0.2
Density
0.4
Exponential
0
2
4
6
Outcome
8
10
0.0
0.2
0.4
0.6
Probability
Figure 4.5: Density, cumulative probability, and quantile functions for uniform, normal, and an exponential distributions.
or action potential spikes in neurons. The inverse relationship between CDF
and quantile functions is evident in the graph.
Several quantiles have special names. The .25, .50 and .75 quantiles of a
distribution are known as the first, second, and third quartiles respectively,
because they divide the distribution into quarters. The .50 quantile is also
known as the median.
4.1.4
Expected Value of Continuous Random Variables
The following section is based on calculus. It is not critical for understanding
the remaining topics in this book. The expected value of a distribution is its
center, and is also often called the mean of the distribution. In Chapter 1
we defined the expected value of discrete random variables in terms of their
4.1. CONTINUOUS RANDOM VARIABLES
95
probability mass functions, E(X) = x xf (x). This definition is not appropriate for continuous random variables. Instead, expected value is defined
in terms of integrals. The following hold for expected value and variance of
continuous random variables:
P
E(X) =
V(x) =
Z
∞
−∞
Z ∞
−∞
xf (x)dx,
(x − E[X])2 f (x)dx.
Likewise, the expected of a function of a random variable g(X) is given by:
E[g(X)] =
Z
∞
−∞
g(x)f (x)dx.
The other properties of expected value discussed in Chapter 1 hold for
continuous random variables. Most importantly, the expected value can typically be estimated by the sample mean. Therefore, instead of calculating
expected values by evaluating integrals, we can use simulation instead. As
before, we simply draw an appropriately large number of random samples
and then calculate the sample mean.
4.1.5
The Normal Distribution
The normal distribution is the basis of many common inferential tests including t-tests, ANOVA, and regression. The normal has two parameters, µ
and σ 2 . These are called the mean and variance, respectively, because if X is
a normal random variable, it can be shown that E(X) = µ and V(X) = σ 2 .
A concept essential for understanding signal detection is the standard
normal distribution:
Definition 36 (Standard Normal) A standard normal random variable is distributed as a normal with µ = 0 and σ 2 = 1.
The cumulative distribution function and quantile function of the standard
normal is denoted by Φ(x) and Φ−1 (p), respectively.
96
4.1.6
CHAPTER 4. THE THEORY OF SIGNAL DETECTION
Random variables in R
The R package has built-in density, cumulative distribution, and quantile
functions for a large number of distributions. Functions dnorm(), pnorm(),
and qnorm() are the density, cumulative distribution function, and quantile
function of the normal, respectively. Likewise, functions dunif(), punif(),
qunif() are the corresponding R functions for the uniform distribution. The
syntax generalizes: a ’d’ before a random variable name refers to either the
density or a probability mass function (depending on whether the random
variable is continuous or discrete). A ’p’ and a ’q’ before the name refer to
the CDF and quantile functions, respectively. An ’r’ before the name, as in
rnorm() or rbinom(), produces realizations from the random variable. Syntax for these functions can be obtained through help(), e.g.; help(dunif).
Quantile functions are useful in finding criterial values for test statistics.
The chi-squared distribution, for example, describes the distribution of G2
under the null hypothesis. In conventional hypothesis testing, the goal is to
specify the probability of mistakenly rejecting the null hypothesis when it
is true. The probability is called the Type I error rate and is often denoted
as α. In psychology the convention is to set α = .05. This setting directly
determines the criterion of the test statistic. For G2 , we wish to set a criterion
so that when the null is true, we reject it 5% of the time. The situation is
depicted in the left column of Figure ??. The criterion is therefore the
value of a chi-squared distribution below which 95% of the mass lie; i.e.,
the .95 quantile. This value is given by qchisq(.95,df), where df is the
degrees of freedom. Criterial bounds for t-tests, F-tests, and z-tests can be
found with qt(), qf(), and qnorm() , respectively. With the t-tests and
z-tests, researchers are often interested in two-tail alternative hypotheses.
These bounds are provided with .025 and .975 quantiles. An example for
a t-distribution with four degrees of freedom is shown in the right panel of
Figure 4.6.
97
4
4.1. CONTINUOUS RANDOM VARIABLES
qt(.025,4)
0.2
Density
3
95%
0.1
2
1
Density
qt(.975,4)
0.3
qchisq(.95,1)
95%
0
1
2
3
4
0.0
0
5%
5
2.5%
−4
2.5%
−2
0
2
4
Figure 4.6: Criterial bounds for the chi-square distribution with 1 df and the
t distribution with 4 df.
Problem 4.1.1 (Your turn)
1. Use R to plot the density function of a standard normal from -3
to 3.
2. Use R to plot the cumulative distribution function of a standard
normal from -3 to 3.
3. Use the cumulative distribution function to find the median of
the standard normal. Is your answer what you’d expect? Where
is this point on the density function?
4. Use R to plot the quantile function of a standard normal from
.001 to .999. Use abline(v=c(.025,.975)) to put vertical lines
on your plot. What are the y-values where these lines cross the
quantile function? How much probability mass is on the interval
between these lines?
5. Use R to plot a histogram of 10,000 realizations drawn from a
standard normal. How does this compare with your plot from
Problem #1 above?
6. Use these 10,000 realizations to estimate the expected value, median, and variance of the standard normal.
98
CHAPTER 4. THE THEORY OF SIGNAL DETECTION
Tone Absent
Tone Present
0.2
0.0
0.1
Density
0.3
0.4
d’
c
−4
−2
0
2
4
6
Sensory Strength
Figure 4.7: The signal detection model.
4.2
Theory of Signal Detection
It is important to distinguish between a signal detection experiment and the
theory of signal detection: one is a type of experiment the other is a model.
There are many theories applicable signal detection experiment besides the
theory of signal detection including the threshold theories of the previous
chapter.
We describe the theory of signal detection for the tone detection experiment. Participants monitor the input for a tone. The resulting sensation
is assumed to be a random variable called strength. Tone-absent trials tend
to have low strength while tone-present trials tend to higher strength. Hypothetical distributions of strength are shown in Figure 4.7; in the figure
the distribution for tone-present trials has greater strength on average than
that for tone-absent trials. These distributions are modeled as a normal
distributions:
S∼
(
Normal(µ = 0, σ 2 = 1), for tone-absent trials,
Normal(µ = d′ , σ 2 = 1), for tone-present trials.
(4.3)
To make a response, the participant sets a criterial bound on strength. This
bound is denoted by c and is presented as a vertical line in Figure 4.7. If the
strength of a stimulus is larger than c, then the participant responds “tone
present;” otherwise the participant responds “tone absent.”
4.2. THEORY OF SIGNAL DETECTION
99
Analysis begins with model predictions about hit, false-alarm, miss, and
correct-rejection probabilities. Correct rejection probability is the easiest to
derive. Correct rejection events occur when strengths from the tone-absent
distribution are below the criterial bound c. This probability is the CDF of
a standard normal at c, which is denoted as Φ(c).
pc = Φ(c).
(4.4)
The probability of a false alarm is 1 minus probability of a correct rejection.
Hence,
pf = 1 − Φ(c).
(4.5)
The equations for hits and misses are only a bit more complicated. The
probability of a miss is the probability that an observation from a normal
with mean of d′ and variance of 1 is less than c. This can be written as
pm = F (c; µ = d′ , σ 2 = 1). It is more standard, however, to express these
probabilities in relation to the standard normal. This is done by noting that
the probability that an observation from N(d′ , 1) is less than c is the same
as an observation from N(0, 1) is less than c − d′ :
Because ph = 1 − pm ,
pm = Φ(c − d′ ).
(4.6)
ph = 1 − Φ(c − d′ ).
(4.7)
Equations 4.5 through 4.7 describe underlying probabilities, and not data.
The resulting data is distributed as a binomials, e.g.,
Yh ∼ Binomial(ph , Ns )
Yf ∼ Binomial(pf , Nn ).
Substituting in for ph and pm provides the complete specification of the signal
detection model:
Yh ∼ Binomial(1 − Φ(c − d′ ), Ns )
Yf ∼ Binomial(1 − Φ(c), Nn ).
4.2.1
(4.8)
(4.9)
Analysis
In this section, we provide analysis for a single condition. Data are the
numbers of hit, miss, false alarm and correct rejects and denoted with the
100
CHAPTER 4. THE THEORY OF SIGNAL DETECTION
vector y = (yh , ym , yf , yc ). The most common method of deriving estimators
is as follows: Equation 4.5 can be rewritten as c = Φ−1 (1 − pf ). Equation 4.7
can be rewritten as c−d′ = Φ−1 (1−ph ). The standard normal is a symmetric
distribution. Hence, the area below a point x is the same as the area above
point −x. This fact implies that Φ−1 (p) = −Φ−1 (1 − p). Using this fact and
a little algebraic rearrangement, the following hold:
c = −Φ−1 (pf )
d′ = Φ−1 (ph ) − Φ−1 (pf )
Conventional estimators are obtained by using empirically observed hit and
false-alarm rates:
ĉ = −Φ−1 (p̂f )
dˆ′ = Φ−1 (p̂h ) − Φ−1 (p̂f )
(4.10)
(4.11)
Fortunately, these conventional estimators are the maximum likelihood
estimators. Hence for a single condition, it is easiest to simply use the above
estimators rather than numerically maximizing likelihood. These estimators
may be computed in R:
dprime.est=qnorm(hit.rate)-qnorm(fa.rate)
c.est=-qnorm(fa.rate)
Although these estimators are useful in single-condition cases, they are
not useful for testing the invariance of parameters across conditions. For
example, the above method may not be used to estimate a common sensitivity
parameter across two conditions with different bounds. These more realistic
cases may be analyzed with a likelihood approach.
The first step is writing down the log likelihood. The log likelihood for a
binomial serves as a suitable starting point:
l(ph , pf ; yh , yf ) = yh log(ph ) + ym log(pm ) + yf log(pf ) + yc log(pc )
(4.12)
Substituting expressions for the probabilities yields:
l(d′ , c; yh , yf ) = yh log(1−Φ(c−d′ ))+ym log(Φ(c−d′ ))+yf log(1−Φ(c))+yc log(Φ(c))
(4.13)
This log likelihood may be computed in R:
4.2. THEORY OF SIGNAL DETECTION
101
#log likelihood for signal detection
#par=c(d’,c)
#y=c(hit,miss,fa,cr)
ll.sd=function(par,y)
{
p=1:4
p[1]=1-pnorm(par[2],par[1],1)
p[2]=1-p[1]
p[3]=1-pnorm(par[2],0,1)
p[4]=1-p[3]
sum(y*log(p))
}
Let’s estimate d′ and c for data y = (40, 10, 30, 20):
y=c(40,10,30,20)
par=c(1,0) #starting values
optim(par,ll.sd,y=y,maximum=T)
The results are d̂′ = .588, and ĉ = −.253. These estimates match those
obtained from Equations 4.10 and 4.11.
4.2.2
ROC Curves for Signal Detection
In Chapter 3, we described the receiver operating characteristic (ROC) plots.
The points in Figure 4.8A shows the ROC for the hit and false-alarm rate data
provided in Table 3.4. The ROC is useful when an experimental manipulation
is assumed to not affect sensitivity. For this case, different models make
different predictions about the isosensitivity curve. The isosensitivity curve
for the signal detection model for a few values of d′ are shown in Figure 4.8A.
These isosensitivity curves are different than those from the high-threshold
model and double-high threshold model; they are curved rather than straight
lines.
There is an alternative to the ROC plot specifically suited for the signal detection model. The alternative, called a zROC plot, is shown in Figure 4.8B. In this plot, standard normal quantiles of hit and false-alarm rates
(i.e., Φ−1 (p̂h ) and Φ−1 (p̂f )) are plotted rather than the hit and false-alarm
102
2
d=1.5
d=0.75
0.2
−1
0.4
d=0.1
d=0.75
0
z(Hit Rate)
0.6
1
d=1.5
d=0.1
0.0
A
0.0
0.2
0.4
0.6
0.8
B
−2
Hit Rate
0.8
1.0
CHAPTER 4. THE THEORY OF SIGNAL DETECTION
1.0
−2
False Alarm Rate
−1
0
1
z(False Alarm Rate)
Figure 4.8: ROC and zROC plots for the data from Table 3.4. Signal detection model predictions are overlaid as lines.
rates themselves. The main motivation for doing so is that isosensitivity
curves on this plot are straight lines with a slope of 1.0. To see why this is
true, note that Eq. (4.11) can be generalized for i conditions as follows:
d′i = Φ−1 (p̂h,i) − Φ−1 (p̂f,i ).
To derive an isosensitivity curve, we assume each condition has the same
sensitivity; therefore, d′i may be replaced by d′ . Rearranging yields
Φ−1 (p̂h,i) = Φ−1 (p̂f,i ) + d′ .
For notational convenience, let yi = Φ−1 (p̂h,i) and xi = Φ−1 (p̂f,i ). Then,
yi = xi + d′ ,
which is the equation for a straight line with a slope of 1.0 and an intercept
of d′ . If the signal detection model holds and the conditions each have the
same sensitivity, then the zROC points should fall on a straight line with
slope 1.0. The function Φ−1 is also called a z-transform and z-transformed
proportions are also called z-scores.
To draw a zROC, we use qnorm(p), where p is either the hit or false-alarm
rate. The following code plots the zROC for the data in Table 3.4.
2
4.2. THEORY OF SIGNAL DETECTION
103
hit.rate=c(.81,.7,.57,.5,.30)
fa.rate=c(.6,.47,.37,.2,.04)
plot(qnorm(fa.rate),qnorm(hit.rate),ylab="z(Hit Rate)",
xlab="z(False-Alarm Rate)",ylim=c(-2,2),xlim=c(-2,2))
Signal-detection isosensitivity lines can be overlaid on the plot by drawing lines with the abline() function. The syntax for drawing lines with
specified slope and intercept is abline(a=intercept,b=slope). For example, the diagonal with slope of 1 and an intercept of 1.5 is drawn with
abline(a=1.5,b=1) or simply abline(1.5,1).
Problem 4.2.1 (Your Turn)
Fit the signal-detection model to the data in Table 3.4. Fit a general
model with separate parameters for each condition. The 10 parameters
in this model are (d′A , cA , d′B , cB , d′C , cC , d′D , cD , d′E , cE ). Fit a common
sensitivity model; the six parameters are (d′ , cA , cB , cC , cD , cE ).
1. Estimate parameters for both models. Make a graph showing
the sensitivity parameters for each condition (from the general
model) with standard errors.
2. Plot the data as a zROC; add a line denoting the commonsensitivity model.
3. Perform a likelihood ratio test for the common-sensitivity hypothesis.
4. Estimate the common-detection high-threshold model fit for the
data (see Your Turn 3.4). Plot this model’s prediction in the
zROC plot.
104
4.2.3
CHAPTER 4. THE THEORY OF SIGNAL DETECTION
Null Counts
One vexing problem in the analysis of the signal detection model is the
occasional absence of either miss or false-alarm events. When there are no
misses, the hit-rate is p̂h = 1.0. Recall that d̂′ = Φ−1 (p̂h ) − Φ−1 (p̂f ). When
p̂h = 1.0, the term Φ−1 (p̂h ) is infinite (try qnorm(1)), leading to an estimate
of d̂′ = ∞. Likewise, when there are no false alarms, the term Φ−1 (p̂f ) is
negatively infinite (try qnorm(0)), leading again to an estimate of d̂′ = ∞.
An infinite sensitivity estimate is indeed problematic.
The presence of null counts is therefore a problem in need of redress. The
problem occurs in the estimates of ph and pf when there are no misses or no
false alarms. It is reasonable to assume that the underlying true values ph and
pf are never 1.0 and 0.0, respectively. As the number of trials is increased, it is
expected that participants eventually make both false alarm and miss errors.
Accordingly, estimates of p̂h and p̂f should not be 1 and 0, respectively, even
when there are no misses and false alarms. Snodgrass and Corwin (1988)
recommend using the p̂1 estimator we have previously introduced in Chapter
1 (Equation 1.12): p̂1 = (y + .5)/(N + 1). This estimator keeps the estimates
from extremes of 0 or 1. For signal detection, estimates of hit and false-alarm
rates are given by:
yh + .5
pˆh =
,
Ns + 1
yf + .5
.
pˆf =
Nn + 1
This correction can be implemented within the formal likelihood approach by
adding .5 to each observed cell count. For moderate values of probabilities,
it will lead to more efficient estimation (see Figure 1.8).
There is a second correction used in the literature to keep estimates from
being too extreme, called the 1/2N rule (Berkson, 1953):



1
,
2N
y
,
N
y = 0,
0 < y < N,
p̂ =


1
1 − 2N , y = N.
This correction is implemented in R as follows:
p=y/N
p[p==0]=1/(2*N)
p[p==1]=1-1/(2*N)
4.2. THEORY OF SIGNAL DETECTION
105
The above code introduces some new programming elements. Let’s work
through an example with three conditions. Suppose for each condition, there
are 20 observations (N = 20) and the number of successes (hits or false
alarms) is y = (2, 0, 20). The code works as follows: From the first line, p
is a vector with values (.1, 0, 1). The second line of code is more complex.
Consider first the term p==0. The symbol == tests each term for equality
with 0, and so this line returns a vector of true and false values. In this case,
the second element is true, because p[2] does equal 0. The left-hand side,
p[p==0] refers to all of those elements in which the term within the brackets
is true, i.e., all those in which p does indeed equal zero. These elements
are replaced with the value of 1/2N. The third line operates analogously–it
replaces all estimated proportions of 1.0 with the value 1−1/2N. Hautus and
Lee (1998) provide further discussion of the properties of these estimators.
106
CHAPTER 4. THE THEORY OF SIGNAL DETECTION
Problem 4.2.2 (Your Turn)
In application, it matters little whether a Snodgrass-Corwin or a Berkson correction is used. Let’s compare them using R.
1. An experimenter runs a signal detection experiment with 20 signal and noise trials. The true value of d′ and c are 1 and .5,
respectively. Simulate the sampling distributions of d̂′ and ĉ for
both correction methods using 100,000 replicate experiments (if
your computer is old and slow, then perhaps 10,000 replicates is
more appropriate). Plot these as histograms (there are four separate histograms obtained by the combining the 2 parameters (d′ ,
c) with the two correction methods).
2. Estimate the bias and RMSE for sensitivity of each correction
method. Which of these is more efficient? One convenient method
of comparing the two corrections is to compute a ratio of RMSE
for the two methods. Let the numerator be the RMSE from the
Snodgrass-Corwin correction and the denominator be the RMSE
from the Berkson correction. With this convention numbers less
than 1.0 indicate better efficiency for the Snodgrass-Corwin correction; numbers greater than 1.0 indicate the reverse.
3. Of course, true values are not limited to (d′ = 1, c = .5). Try
your code for true values (d′ = 1.5, c = .2).
4. Let’s explore the corrections for a range of true values. Let
true d′ = (.2, .4, .., 2.8). For each of these values, let c =
(0, .1d′ , .2d′, .., .9d′ , d′). Compute the efficiency ratio of the two
methods. There should be 154 different RMSE ratios. Use
the contour function to plot these. This method provides as
assessment of the relative efficiency between the two methods
for a wide range of parameter values. As alternative use the
filled.contour function. Also try these plots for logarithm of
the efficiency ratio. This quantity has the advantage of being
positive when efficiency favors one of the correction methods and
negative when it favors the other. Hint: Use loops through true
values in your R code.
107
4.3. ERROR BARS FOR ROC PLOTS
Condition A
Condition B
Condition C
Reward for Correct Response
Signal Trial
Noise Trial
10c
1c
5c
5c
1c
10c
Data
Hit Miss
82
18
68
36
48
52
FA
62
44
28
CR
36
56
72
Table 4.1: Hypothetical data for a signal-detection experiment with payoffs.
4.3
Error bars for ROC plots
The previously-drawn ROC plots lack error bars. As mentioned previously,
standard errors are a rough guide to the variability in parameter estimates.
In this section, we consider methods of using error bars for ROC plots.
Data points in ROC graphs are composed of hit and false-alarm rates.
These rates are estimates of the true hit and false-alarm probabilities. The
following equation may be used for computing standard errors for probability
estimates from data distributed as a binomial:
SE(p̂) =
s
p(1 − p)
.
N
Figure 4.9A shows an example of an ROC plot with standard errors. For
each point, there is one error bar in the vertical direction indicating the
standard error of the hit-rate estimate and one in the horizontal direction
indicating the standard error of the false-alarm rate estimate. The ROC plot
comes from the hypothetical data in Table 4.1, which serves as a convenient
example in demonstrating how to draw these error bars in R.
The novel element in these plots are horizontal error bars which are drawn
with the following code:
horiz.errbar=function(x,y,height,width,lty=1)
{
arrows(x,y,x+width,y,angle=90,length=height,lty=lty)
arrows(x,y,x-width,y,angle=90,length=height,lty=lty)
}
The following code uses horiz.errorbar to draw the standard errors.
1.0
CHAPTER 4. THE THEORY OF SIGNAL DETECTION
A
0.5
B
0.0
d’=0.56
−1.0
−0.5
z(Hit Rate)
0.6
0.4
0.2
0.0
Hit Rate
0.8
1.0
108
0.0
0.2
0.4
0.6
0.8
1.0
False Alarm Rate
−1.0
−0.5
0.0
0.5
z(False Alarm Rate)
Figure 4.9: A: ROC plot for the data from Table 4.1 with standard errors on
hit and false-alarm rates. The black line represents the isosensitivity curve
for the value of d′ obtained by maximum likelihood. B: zROC plot for the
same data with standard errors on sensitivity estimate. The solid line is the
isosensitivity curve for the best-fitting signal detection model. The dotted
lines are standard errors on sensitivity.
hit=c(82,68,48)
fa=c(62,44,28)
N=100
hit.rate=hit/N
fa.rate=fa/N
std.err.hits=sqrt(hit.rate*(1-hit.rate)/N)
std.err.fa=sqrt(fa.rate*(1-fa.rate)/N)
#plot ROC
plot(fa.rate,hit.rate,xlim=c(0,1),ylim=c(0,1))
errbar(fa.rate,hit.rate,height=std.err.hits,width=.05)
horiz.errbar(fa.rate,hit.rate,height=.05,width=std.err.fa)
An alternative approach is to draw standard errors on more substantive
parameters. Figure 4.9B provides an example for the zROC curve. The
fitted model is a signal detection model with a common sensitivity estimate
1.0
4.3. ERROR BARS FOR ROC PLOTS
109
for all three conditions. The resulting isosensitivity curve is the solid line.
Standard errors were derived from optim with the option hessian=T. For the
data in Table 4.1, the estimate of sensitivity is 0.56 and its standard error is
0.077. These standard errors are plotted as parallel isosensitivity curves and
are denoted with dotted lines. The code for drawing these lines is
dprime.est= .56
# as derived from an optim call
std.err= .077
#as derived from an optim call
plot(qnorm(fa.rate),qnorm(hit.rate))
abline(a=dprime.est,b=1) #isosensitivity curve
abline(a=dprime.est+std.err,b=1,lty=2) #plus standard error
abline(a=dprime.est-std.err,b=1,lty=2) #minus standard error
These two approaches to placing standard errors should not be used on
the same plot. Placing standard errors on both hits and false alarms and on
substantive parameters is often confusing and we recommend using only one
of these approaches. A good rule-of-thumb is that standard errors should be
placed on hit and false-alarm rates when several different models are being
compared, whereas standard errors should be placed on model parameters
when a specific hypothesis is being tested.
Problem 4.3.1 (Your Turn)
Using the hypothetical data in Table 3.4, create two plots:
1. Create an ROC plot with standard errors on hit and false alarms.
2. Create a zROC plot with standard errors on d′ .
These plots should resemble Figure 4.9.
110
CHAPTER 4. THE THEORY OF SIGNAL DETECTION
Chapter 5
Advanced Threshold and
Signal-Detection Models
In the previous two chapters we discussed basic models of task performance.
Unfortunately, there are many examples in the literature in which these basic
models fail to account for empirically observed ROC functions (e.g., Luce,
1963; Ratcliff, Sheu, & Grondlund, 1993). In this chapter, we expand our coverage to more flexible and powerful models. The first model we discuss is the
general high-threshold which is a generalization of both the high-threshold
and double high-threshold model. The second model is a generalization of
the signal detection model. The final model, Luce’s low-threshold model,
introduces the idea that people can perceive stimuli that are not present.
The models introduced in the previous chapter have two basic parameters:
one for detection or sensitivity and another for guessing or response bias.
The models introduced in this chapter have three basic parameters: two for
sensitivity and a third for response bias. Three-parameter models cannot
be fit to a single condition of a signal-detection experiment because each
condition provides only two independent observations: the numbers of hits
and false alarms. Instead, these models are estimated with observations from
many conditions. In order to demonstrate analysis, we consider the payoff
experiment example of Table 3.4. In this hypothetical experiment, the stimuli
are constant across conditions, hence parameters describing sensitivity should
not vary. Models that are consistent with invariance of sensitivity parameters
are appropriate whereas those that are inconsistent are not.
Although these new models are more flexible, they present two problems
in analysis. First, the methods we have used for numerically minimizing
111
112CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
the likelihood fail for these models. They do not return parameters that
maximize likelihood. Second, the previously introduced techniques for model
comparison are insufficient for the three new models. None of the new models
is a restriction of another; e.g., the models are not nested. The likelihood
ratio statistic, while appropriate for nested models, is not appropriate for
non-nested ones. We discuss an alternative method, based on the Akaike
Information Criterion (AIC, Akaike, 1974), to make these comparisons. The
remainder of the chapter is divided into three sections: a discussion of more
advanced numerical techniques for improving optimization, a discussion of
the three new models, and a discussion of nonnested model comparisons.
5.1
5.1.1
Improving Optimization
The Problem
The likelihoods of the models we consider in this chapter cannot be maximized with the methods of the previous chapters without modification. The
problem is that with default settings, the optim() function often fails to
find the maximum in problems with many parameters. To demonstrate this
failure , we start with a simple optimization problem. Suppose we have a set
of observations y1 , ..., yN and a set of parameters θ1 , ..., θN . We wish to find
the best values of θ1 ...θN that minimize the following function h:
h=
n
X
i=1
(yi − θi )2 .
The function is a sum-of-squared-differences formula. It is at a minimum
when all the differences are zero; i.e., when θ1 = y1 , θ2 = y2 , ..., θn = yn . If
θi = yi, then the sum of squared differences is h = 0. Function optim() does
a good job of finding this minimum for a handful of parameters. The following
code serves as an example. It minimizes h with respect to parameters θ1 , .., θ4
for four pieces of data: y1 = 3, y2 = 8, y3 = 13, y4 = 18.
h=function(theta,y) return(sum((theta-y)^2))
y=c(3,8,13,18)
par=rep(10,4) #starting values
optim(par,h,y=y)
5.1. IMPROVING OPTIMIZATION
113
The results are:
$par
[1] 3.000866 8.000700 13.000482 17.999283
$value
[1] 1.986393e-06
As expected, the minimum of h is very close to h = 0 and parameter values
are nearly equal to their respective data points.
The results are not so good for twenty parameters. Consider the following
code:
y=1:20 #integers from 1 to 20
par=rep(10,20) #starting values
optim(par,h,y=y)
Results are:
$par
[1] 0.3535526 1.7091429 4.1538558 3.0865141 5.8710791 5.2499118
[7] 7.8665955 8.8864199 7.8365200 9.2342382 11.6387186 13.8285510
[13] 13.6197244 13.8850712 12.5633528 15.3360254 16.7935952 16.6734207
[19] 19.3004116 19.8297374
$value
[1] 19.91514
These results are troubling as the estimates are surprisingly far from their
true values, and the function minimizes to a value of 19.9 instead of 0.
This example demonstrates that the optim() function with default settings is unable to handle problems with more than a few parameters. The
above case demonstrates that optim() is not foolproof. In the following sections, we explore a few strategies to increase the accuracy of optimization.
5.1.2
Nested Optimization
One good approach to increasing the accuracy of optimization is to frame
the analysis so that it involves several separate optimizations with each being
over a smaller number of parameters. Consider the above example with the
function h and twenty parameters. Notice that the value of θ1 that minimizes
114CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
h is a function of only y1 and not of the other nineteen data points. This fact
holds analogously for the other parameters as well—the appropriate value of
each parameter depends on a single data point. Consider the following code
for minimization that takes advantage of this fact:
h=function(theta,obs) (theta-obs)^2 #to-be-minimized function
y=1:20
par.est=rep(0,20) #reserve space
min=0
for (i in 1:20)
{
est=optimize(h,interval=c(-100,100),obs=y[i])
min=min+est$objective
par.est[i]=est$minimum #store the estimates
}
In this code, we call optimize twenty times from within the loop. Each pass
through the loop optimizes a single parameter. The resulting parameters
values are stored in vector par.est. The minimum of the function is stored
in min. The results are that the estimated values (par.est) are nearly the
true values and the function minimizes to nearly 0. For this case, performing
20 one-parameter optimizations is far more accurate than performing a single
twenty-parameter optimization.
This strategy of framing analysis so that it involves multiple optimizations
with smaller numbers of parameters is often natural in psychological contexts.
Consider the analysis of the high-threshold model for payoffs data (Table 3.4)
as an example. The model reflecting selective influence has six parameters:
(d, gA , gB , gC , gD , gE ). We start by assuming the true value of d. Of course,
this assumption is unwarranted. We only use the assumption to get started
and will soon dispense of it during estimation. If the true value of d is known,
then estimation of gA only depends on data in Condition A, estimation of gB
only depends on the data of Condition B and so on. This fact leads naturally
to multiple optimization calls. The first step is to compute the likelihood for
g in a single condition given a fixed value of d:
#negative log likelihood for high-threshold model
#one condition, function of g for given d
#y=c(hit,mis,fa,cr) for one condition
5.1. IMPROVING OPTIMIZATION
115
nll.ht.given.d=function(g,y,d)
{
p=1:4
# reserve space
p[1]=d+(1-d)*g #probability of a hit
p[2]=1-p[1]
# probability of a miss
p[3]=g
# probability of a false alarm
p[4] = 1-p[3]
#probability of a correct rejection
return(-sum(y*log(p)))
}
The body of the function is identical to that in Section 3.2.3; the difference
is how the parameters are passed. The function in Section 3.2.3 is minimized
with respect to two parameters. The current function will be minimized with
respect to g alone. The maximum log likelihood for all five conditions for a
known value of d is:
#d is detection parameter
#dat is hA,mA,faA,cA,...hE,mE,faE,cE
nll.ht=function(d,dat)
{
return(
optimize(nll.ht.given.d,interval=c(0,1),y=dat[1:4],d=d)$objective+
optimize(nll.ht.given.d,interval=c(0,1),y=dat[5:8],d=d)$objective+
optimize(nll.ht.given.d,interval=c(0,1),y=dat[9:12],d=d)$objective+
optimize(nll.ht.given.d,interval=c(0,1),y=dat[13:16],d=d)$objective+
optimize(nll.ht.given.d,interval=c(0,1),y=dat[17:20],d=d)$objective
)
}
The function nll.ht can be called for any value of d, and when it is called, it
performs five one-parameter optimizations. Of course, we wish to estimate d
rather than assume it. This can be done by optimizing nll.ht with respect
to d. Here is the code for the data in Table 3.4:
dat=c(404,96,301,199,348,152,235,265,
287,213,183,317,251,249,102,398,148,352,20,480)
g=optimize(nll.ht,interval=c(0,1),dat=dat)
116CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
The result of this last optimization provides the minimum of the negative log
likelihood as well as the ML estimate of d. The minimum here is 2901.941,
which is probably lower than the negative log likelihood you found for the
restricted model in Your Turn 3.4.1 (depending on the starting values you
chose). To find the ML estimates of the guessing parameters, each of the
optimization statements in nll.ht may be called with the ML estimate of d.
We call this approach nested optimization. Optimization of parameters
specific to conditions (gA , gB , .., gE ) is nested within parameters common
across all conditions (d). Overall nested optimization often provides for more
accurate results than a large single optimization. The disadvantage of nested
optimization is that it does not immediately provide standard error estimates
for parameters. One strategy is to do two separate optimizations. The first
of these is with nested optimizations. Afterward, the obtained ML estimates
can be used as starting values for a single optimization of all parameters with
a single optim() call. In this case, optim() should return these starting values as parameter estimates. In addition, it will return the Hessian which can
be used to estimate standard errors as discussed in Chapter 2.
Problem 5.1.1 (Your Turn)
1. Use nested optimization to estimate parameters of the highthreshold model for multiple conditions with a common detection
parameter. The code in this section provides a partial solution–it
returns common detection parameter d. It does not return guessing parameters (gA , gB , ..., gE ). You will need to modify the code
to return these estimates. Test your code with the data of Table 3.4 and compare the results with those found in Your Turn
4.2.1. Estimate standard errors for the six parameters.
2. Use nested optimization to estimate the signal detection model
with common d′ . Estimate all six parameters and their standard
errors.
117
5.1. IMPROVING OPTIMIZATION
5.1.3
Parameter Transformations
When optimizing log likelihood for the high-threshold and double high-threshold
models in Chapter 3, you may have noticed that R reported some warnings.
The warnings reflect a mismatch between the models and the minimization.
In the models, parameters are probabilities and must be constrained between zero and one. The default algorithm in optim(), however, assumes
that parameters can take on any real value. In the course of optimization,
simplex may try to evaluate the log likelihood for an invalid value, for example, d = −.5. When this happens, the logarithms are undefined and optim()
reports the condition as a warning and goes on optimizing.
There are two disadvantages to having this mismatch. The first is that
optim() may return an invalid parameter value. For example, in Chapter
3 we presented a high-threshold model with a common guessing parameter
(Model 3, page 68). Unfortunately, the obtained ML estimate of d2 was
negative. We previously ignored this transgression, but that is not an ideal
solution. The second disadvantage is that having a mismatch often results in
longer and less accurate optimization. It takes time to evaluate functions with
invalid parameter values and many such calls will lead to inaccurate results.
Parameter transformations are a general and easy-to-implement solution.
The basic idea is to construct a mapping from all real values into valid
ones. Figure 5.1 provides an example for the high-threshold model. The
x-axis shows all real values; the y-axis shows the valid values between zero
and one. We allow optim() algorithm to choose any value of the x axis, and
term this value z. For example, in optimizing the high-threshold model, the
algorithm might choose to try a value z = −1. Instead of trying to evaluate
the log likelihood with this value, we evaluate the log likelihood with the
value on the y axis, in this case, .27.
The function in Figure 5.1 is
1
.
(5.1)
1 + e−z
This particular transform is the logit or log-odds transform of a probability
parameter. The inverse mapping from p to z is given by:
p=
!
p
z = log
.
1−p
(5.2)
These two equations are built into R. Equation 5.1 may be evaluated with
the plogis() function; Equation 5.2 may be evaluated with the qlogis()
0.0
0.2
0.4
p
0.6
0.8
1.0
118CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
−6
−4
−2
0
2
4
6
z
Figure 5.1: The logit transform.
function. Here is an example of using this transform for the high-threshold
model. We pass the vector z, the unconstrained parameter values to function
and then transform them to valid values with plogis().
# negative log likelihood, high-threshold model
#par ranges across all reals
nll=function(z,y)
{
par=plogis(z) #transform to [0,1]
d=par[1]
g=par[2]
p=1:4 # reserve space
p[1]=d+(1-d)*g #probability of a hit
p[2]=1-p[1] # probability of a miss
p[3]=g # probability of a false alarm
p[4] = 1-p[3] #probability of a correct rejection
return(-sum(y*log(p)))
}
This function may be called as usual:
y=c(75,25,30,20)
z=c(qlogis(.5),qlogis(.5)) #ranges from -infty to infty
5.1. IMPROVING OPTIMIZATION
119
results=optim(z,snll,y=y)
plogis(results$par)
Function optim() evaluates the log likelihood function with various values
of z that are free to vary across the reals. The results are for z, which ranges
across all reals. To interpret these values, they should also be transformed
to probabilities; this is done with the plogis(result$par) statement.
Rerun the code from Chapter 3 and then run the above code for comparison. The parameter estimates are almost identical. Look at the the $counts
returned by optim(). It is lower for the current code (45 evaluations) than
for that in Chapter 3 (129 evaluations). When we transformed parameters, R
required only one-third the evaluation calls, saving time in the optimization
process.
Problem 5.1.2 (Your Turn)
In our discussion of the high-threshold model (Chapter 3), we presented
a model we called “Model 3” with a common guessing parameter. Unfortunately, the obtained ML estimate of d2 was negative. Use the
transformed-parameters strategy to re-estimate this model. Compare
the results with those presented in Chapter 3.
5.1.4
Convergence
Optimization routines work by repeatedly evaluating the function until a
minimum is found. By default optim() continues to search for better parameter values until one of two conditions is met: 1. new iterations do not
lower the value of the to-be-minimized function much or 2. a maximum
number of iterations occurs. If the optim() reaches this maximum number
of iterations, then it is said to have not converged and optim() returns a
value of 1 for the $convergence field. If the algorithm stops before this
maximum number of iterations because new iterations do not lower the function value much, then the algorithm is said to have converged and a value of
0 is returned in the convergence field. The maximum number of iterations
defaults to 500 for the default algorithm in optim().
120CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
Convergence should be monitored. If convergence is not obtained, it is
possible that the parameter estimates found are not the maximum likelihood estimates. Sometimes the brute force method of raising the maximum
number of iterations is effective. The maxit option in optim() controls this
maximum. Let’s consider the previous example in which a single optim()
P
2
call is used to minimize the function h = 20
i=1 (yi − θi ) . To increase the
maximum number of iterations, the following call is made:
y=1:20
par=rep(10,20)
optim(par,h,control=list(maxit=10000),y=y) #takes a while
We achieve convergence and the estimates are much better than for 500 iterations (although the function minimizes to .08 instead of zero). A related
approach for convergence is to run the algorithm repeatedly. For each repetition, the previous parameter values serve as the new starting values. For
example,
par=rep(10,20)
a1=optim(par,h,control=list(maxit=10000),y=y)
a2=optim(a1$par,h,control=list(maxit=10000),y=y)
The results of the second call are reasonable (the function minimizes to .006
instead of zero).
The advantage of these brute-force approaches of raising the number of
function evaluations is that they are trivially easy to implement. The disadvantage is that they often are not as effective as nesting optimization and
transforming parameters.
5.1.5
An alternative algorithm
The default algorithm in optim() is simplex (Mead and Nelder, 1967), which
is known for its versatility and robustness (e.g., Press et al., 1992). The simplex algorithm, however, is one of many algorithms for function optimization1
1
The scope of this book precludes a discussion on the theory behind and comparisons
between optimization methods, The interested reader is referred to Press et al., 1992 and
Nocedal and Wright (1999).
5.1. IMPROVING OPTIMIZATION
121
We have used simplex within optim() because it is known to work moderately well for problems with constraints, such as those in which parameters
are constrained to be between zero and one. There are other choices in R
and the function nlm is often useful. Suppose we wish to minimize function
P
h = i (yi − θi )2 for 200 parameters rather than 4 or 20. This is impossible
with a single simplex call, yet it works quickly with a single nlm() call:
y=1:200
par=rep(100,200)
optim(par,h,y=y)
nlm(h,par,y=y)
The outputted estimates are accurate and the function h is minimized to
zero. The syntax of nlm() closely resembles that in optimize() and may be
seen with the statement help(nlm). Function nlm returns fields $value and
$estimates, the minimized value and estimates, respectively. Field $code
provides diagnostics which are covered in help(nlm). Function nlm() also
returns the Hessian for standard errors.
Function nlm() has some drawbacks. It often fails when parameters are
constrained. It cannot be used, for example, with the high-threshold model
without parameter transformation. We have found that it sometimes fails
even after parameters are transformed. When nlm() fails, it tends to return
implausible parameter estimates. This is an advantage as the poor quality of
estimation is easily detected. One effective optimization strategy is to combine optim() (with simplex) and nlm(). First run optim() to get somewhat
close to ML values, and then run nlm() to find the true minimum.
5.1.6
Caveats
Optimization is far from foolproof. The lack of an all-purpose, sure-fire numerical optimizer is perhaps the most significant drawback to the numerical
approach we advocate. As a result, it is incumbent on the researcher to
use numerical optimization with care and wisdom. We recommend that a
researcher consider additional safeguards to understand the quality of their
optimizations. Here are a few:
• Repeat optimization with different starting points. It is a good
idea to repeat optimization from a number of different starting values.
Hopefully, many starting values lead to the same minimum.
122CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
• Fix some parameters. Once a minimization is found, it is often easy
to fix some of the parameters and re-fit the remaining ones as free.
For example, in the high-threshold model, we can fix detection to the
estimated value of d and re-fit a model in which guessing parameters
are free to vary. If the original optimization is good, the resulting
parameter estimates for guessing should be identical in the original fit
and in the re-fit.
• Examine simulated data. Once a minimization has been found, the
parameter estimates can then be used to generate a large data set. This
new, artificial data set can then be minimized. The difference between
the parameter estimates from the empirical data and those from the
artificial data should not only be small, but should decrease as the
sample size of the artificial data is increased.
5.2
General High-Threshold Model
With the discussion of optimization complete, we return to models of psychological process. The first of the three models presented in this chapter is
the general high-threshold model (Figure 5.2). The three parameters are the
sensitivity to signal (ds ), the sensitivity to noise (dn ), and the guessing bias
(g). The equations for hits and false alarms are
ph = ds + (1 − ds )g,
pf = (1 − dn )g
(5.3)
(5.4)
This model is a generalization of both the high-threshold model and
the double high-threshold model. It reduces to the high-threshold model
if dn = 0, and it reduces to a double high-threshold model if ds = dn . The
isosensitivity curve on the ROC plot is obtained by varying g while keeping
ds and dn constant. The predicted curve is a straight line with y-intercept
at ds and slope of (1 − ds )/(1 − dn ). Two examples of isosensitivity curves
are drawn in Figure 5.3A. In this chapter, we fit all of the models to the
hypothetical data from the payoff experiment in Table 3.4. The model we fit
has selective invariance of the detection parameters dn and ds . There are a
total of seven parameters (ds , dn , gA , gB , gC , gD , gE ) across the five conditions.
It is helpful to fit the model with the nested optimization strategy. First,
log likelihood of g is expressed for known detection parameters (ds , dn ).
123
5.2. GENERAL HIGH-THRESHOLD MODEL
Double High−Threshold Model
Hit
ds
g
Correct Rejection
dn
Hit
False Alarm
g
1−d s
1−d n
1−g
1−g
Miss
Signal Trials
Correct Rejection
Noise Trials
0.0
0.4
0.8
False Alarms
0.8
2
0.4
1
0.0
Hits
0.4
2
1
Hits
0.8
2
0.0
Hits
1
0.0
0.4
0.8
Figure 5.2: The general high-threshold model.
0.0
0.4
0.8
False Alarms
0.0
0.4
0.8
False Alarms
Figure 5.3: Isosensitivity curves for three models. A: General high-threshold
model. Lines 1 and 2 have detection parameters (ds = .5, dn = .3) and (ds =
.7, dn = .15), respectively. The former is closer to a double high-threshold
model while the later is closer to a high-threshold model. B: Unequal-variance
signal detection model. Lines 1 and 2 have parameters (d′ = 1, σ = 1.4) and
(d′ = .25, σ = .8), respectively. C: Low-threshold model. Lines 1 and 2 have
detection parameters (ds = .6, dn = .3) and (ds = .8, dn = .15), respectively.
124CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
#negative log likelihood for general threshold model, 1 condition
#det=c(ds,dn)
#y=c(hit,miss,fa,cr)
#g=guessing
nll.ght.1=function(g,y,det)
{
ds=det[1]
dn=det[2]
p=1:4 # reserve space
p[1]=ds+(1-ds)*g #probability of a hit
p[2]=1-p[1] # probability of a miss
p[3]=(1-dn)*g # probability of a false alarm
p[4] = 1-p[3] #probability of a correct rejection
return(-sum(y*log(p)))
}
We find the best value of g for each condition with optimize(). Because
optimize() works well on restricted intervals such as [0, 1], there is no reason
to transform these guessing parameters. The likelihood across all conditions
is obtained by adding the likelihood from each condition. Note the use of
transformed detection parameters.
#log likelihood of general high-threshold model
#dat=c(h1,m1,f1,c1,...,h5,m5,f5,c5)
#zdet=logit-transformed detection parameters
nll.ght=function(zdet,dat)
{
det=plogis(zdet)
return(
optimize(nll.ght.1,interval=c(0,1),y=dat[1:4],det=det)$objective+
optimize(nll.ght.1,interval=c(0,1),y=dat[5:8],det=det)$objective+
optimize(nll.ght.1,interval=c(0,1),y=dat[9:12],det=det)$objective+
optimize(nll.ght.1,interval=c(0,1),y=dat[13:16],det=det)$objective+
optimize(nll.ght.1,interval=c(0,1),y=dat[17:20],det=det)$objective
)
}
5.3. SIGNAL DETECTION WITH UNEQUAL VARIANCE
125
Next, the log likelihood for the model is maximized by finding the appropriate detection parameters (ds , dn ). Both nlm() and optim() work well with
transformed detection parameters in this application.
dat=c(404,96,301,199,348,152,235,265,
287,213,183,317,251,249,102,398,148,352,20,480)
zdet=rep(0,2) #starting values of zdet
est=optim(zdet,nll.ght,dat)
plogis(zdet$par)
We will discuss the results after introducing the remaining models.
5.3
Signal Detection with Unequal Variance
The signal detection model in the previous chapter assumed that the variance
of the tone-present strength distribution was the same as the tone-absent
strength distribution. A more general model is one in which these variances
are not assumed equal. The variance of the tone-absent distribution is still set
to 1, but the variance of the tone-present distribution is free and denoted as
σ 2 . The model is called the free-variance signal-detection model. Figure 5.4
provides a graphical representation.
The probability of hits and false alarms are:
pf = 1 − Φ(c),
ph = 1 − F (c, d′, σ 2 ),
(5.5)
(5.6)
where F (x, µ, σ 2 ) is the CDF for a normal with parameter (µ, σ 2). It is
conventional to rewrite this equation in terms of the CDF for the standard
normal: F (x, µ, σ 2) = Φ([x − µ]/σ). Therefore,
c − d′
.
ph = 1 − Φ
σ
!
(5.7)
Even though it is more conventional to express the model in terms of the
standard normal, it is more convenient not to do so in the R implementation.
The model reduces to the equal-variance signal detection model if σ 2 = 1.
The isosensitivity ROC curve is obtained by varying c while keeping d′ and
σ 2 constant. The predicted curve is curvilinear with points though the
126CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
0.2
σ
0.0
0.1
Density
0.3
0.4
d’
−4
−2
bound
4
6
8
Sensory Strength
Figure 5.4: The free-variance signal detection model.
origin and (1,1); two examples of isosensitivity curves are drawn in Figure 5.3B. For the five conditions in Table 3.4, the model has 7 parameters
(d′ , σ 2 , c1 , c2 , c3 , c4 , c5 ). Evaluation of the log likelihood for known d′ and σ 2
in a single condition may be evaluated in R:
#negative log likelihood of free-variance signal detection model
#par=c(d,sigma)
#y=c(hits,misses,fa’s,cr’s)
nll.fvsd.1=function(c,y,par)
{
d=par[1]
sigma=par[2]
p=1:4 # reserve space
p[1]=1-pnorm(c,d,sigma) #probability of a hit
p[2]=1-p[1] # probability of a miss
p[3]=1-pnorm(c) # probability of a false alarm
p[4] = 1-p[3] #probability of a correct rejection
return(-sum(y*log(p)))
5.4. LOW-THRESHOLD MODEL
127
}
Problem 5.3.1 (Your Turn)
Fit the free variance signal detection model to the data in Table 3.4.
Use a common d′ , a common σ 2 , and a separate c parameter for each
condition. Be sure to use the nested optimization strategy. Do not use
transformed parameters as the parameters of the free-variance signaldetection model are not restricted. Functions optim() and nlm()
should give identical answers.
5.4
Low-Threshold Model
Luce (1963) proposed a qualitatively different model for the detection of tones
than the general high-threshold or signal-detection model. He noted that
empirically obtained ROC curves for auditory detection were not straight
lines, and hence inconsistent with the high-threshold model. Yet, based on
other experiments in which participants needed to both detect tones and
then identify their frequency, Luce concluded that decisions were based on
all-or-none representations of information. Luce proposed the low-threshold
model as a threshold model that does not predict straight-lined ROCs. This
model is similar to the general high-threshold model in that perception is
assumed all-or-none. In the general high-threshold model, it is assumed that
either the correct stimulus is detected or that the participant guesses. In the
low-threshold model, by contrast, it is assumed that people can misperceive
stimuli—that is they can detect a tone’s presence even when it is absent.
When the tone is presented, the participant either detects the tone or
detects its absence. There is no guessing in this model. The probability that
the participant detects the tone is ds and dn for signal and noise stimuli,
respectively. Parameters ds and dn are sensitivity parameters and do not
reflect bias from manipulations such as payoffs. Of courses, responses do
depend on payoffs and an additional specification is needed. The simplest
128CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
model is to propose that responses are biased by an amount b:
ph = ds + b,
pf = dn + b.
In this case, there is a bias toward “signal present” responses if b is positive
and one toward “signal absent” responses if b is negative. This model, however, has a logical flaw. To see this, consider the effect of Condition E, in
which participants earn 10c for every correct rejection but only 1c for every
hit. In this case, both hit and false-alarm rates should be quite low and b
should be quite negative. If dn is less than b, then the predicted false alarm
rate will be negative. For example, if b = .3 and dn = .1, then the predicted
false-alarm rate is -.2.
In Luce’s low-threshold model, bias is a relative fraction rather than an
absolute amount. Suppose ds is ds = .7. The largest conceivable positive
bias is .3; any more bias would lead to a hit rate greater than 1. The bias
parameter in Luce’s model indexes the fraction of this largest effect. For
example, if the bias is b = .1, the effect is 10% of the largest conceivable
bias amount. For the case ds =, 7, the hit rate is .7+(.1)(.3)=.73. For the
same example, let’s suppose the dn is .2. The largest conceivable amount of
positive bias is .8; any more would result in false alarm rates above 1.0. If
b = .1, then the false alarm rate is .2+(.1)(.8)=.28. The hit and false-alarm
probabilities are
ph =
(
pf =
(
ds + (1 − ds )b 0 ≤ b ≤ 1
ds + ds b
−1 ≤ b < 0
dn + (1 − dn )b 0 ≤ b ≤ 1
dn + dn b
−1 ≤ b < 0
(5.8)
(5.9)
The relative bias, b, varies between −1 and 1.
Figure 5.3C shows isosensitivity predictions for this model. These are
obtained by keeping dn and ds constant while varying the bias parameter b.
The isosensitivity curve consists of two straight lines: one from the origin to
point (dn , ds ) and the other from point (dn , ds ) to the point (1, 1). The line
starting at the origin is termed the lower limb and results from b < 0. The
line ending at (1, 1) is termed the upper limb and results from b > 0. The
point (dn , ds ) is obtained when b = 0.
To implement the low-threshold model in R, it is convenient to first define
the following function that describes the amount of bias as a function of p
and b.
5.4. LOW-THRESHOLD MODEL
129
amount.of.bias=function(b,p)
ifelse(b>0,(1-p)*b,p*b)
The ifelse function has been introduced earlier (p.xx). If b > 0 then the
second argument ((1-p)*b) is evaluated and returned; otherwise the third
argument (p*b) is evaluated and returned.
With this function, it is straightforward to implement a function that
evaluates the log likelihood for a single condition. The following is the log
likelihood of b for known detection parameters (ds , dn ).
#low-threshold model
#det=c(ds,dn)
#y=c(hits,misses,fa’s,cr’s)
nll.lt.1=function(b,y,det)
{
ds=det[1]
dn=det[2]
p=1:4 # reserve space
p[1]=ds+amount.of.bias(b,ds) #probability of a hit
p[2]=1-p[1] # probability of a miss
p[3]=dn+amount.of.bias(b,dn) # probability of a false alarm
p[4] = 1-p[3] #probability of a correct rejection
return(-sum(y*log(p)))
}
Problem 5.4.1 (Your Turn)
Fit the low-threshold model to the data in Table 3.4. Be sure to use
nested optimization. Also, be sure to use transformed parameters for
ds and dn .
130CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
Binomial Model
logL=−32.57
G2=3.99
logL=−34.57
G2=19.66*
High−Threshold
logL=−44.40
AIC=100.80
AIC=85.15
G2=8.25*
G2=8.45*
Low−Threshold
General
High−Threshold
AIC=83.13
logL=−36.70
10 Parameters
AIC=87.40
Free−Variance
Signal Detection
logL=−36.80
G2=3.30
AIC=87.60
G2=22.47*
Double
High−Threshold
logL=−36.21
7 Parameters
AIC=84.43
Equal−Variance
Signal Detection
logL=−48.03
6 Parameters
AIC=108.07
Figure 5.5: A hierarchy of seven models for the payoff experiment in Table 3.4. Included are model comparison statistics G2 , log likelihood, and
AIC.
5.5
Nested-Model and AIC Analyses
We have now discussed seven different models of the payoff experiment of
Table 3.4. These models are expressed as a tree in Figure 5.5. At the top
level is a binomial model with a separate ph and pf probability parameter
for each condition. There are ten free parameters for the five conditions.
This model is the most general and all other models are nested within it. At
the next level are the general high-threshold model, the free-variance signal
detection model, and the low-threshold model. Each of these models has
seven parameters. At the bottom level are six-parameter models: the highthreshold model, the double-high threshold model, and the signal detection
model. The lines between models depict nesting relationships; e.g., the double high-threshold model is nested within the general high-threshold model
but not within the free-variance signal detection model.
The data are plotted as points in an ROC plot in Figure 5.6A. The lines
are the best-fitting isosensitivity curves of the three high threshold models.
Figure 5.6B presents the same data; the lines are best-fitting isosensitivity
curves of the two signal detection models and the low-threshold model. It
is obvious that there are poor fits for the high-threshold model and equal-
131
1.0
0.6
0.0
0.2
0.4
Hits
0.6
0.4
0.2
0.0
Hits
B
0.8
A
0.8
1.0
5.5. NESTED-MODEL AND AIC ANALYSES
0.0
0.2
0.4
0.6
False Alarms
0.8
1.0
0.0
0.2
0.4
0.6
0.8
False Alarms
Figure 5.6: ROC plots of the data and model predictions. Left: Isosensitivity
predictions of the generalized high-threshold, high-threshold, and doublehigh threshold models. Right: Isosensitivity predictions of the low-threshold,
free-variance signal detection, and equal-variance signal detection models.
Standard errors on hit and false alarm rates are never bigger than .022.
variance signal detection model. It is difficult to draw further conclusions
from inspection.
Comparison of nested models may be made with the log likelihood ratio test, as discussed in previous chapters. Values of G2 between nested
models are indicated in the Figure 5.5 for these comparisons. Those values
with asterisks are significant indicating the restricted model is inappropriate. According to this analysis, the only appropriate models are the general
high-threshold model and the double high-threshold model. Neither of these
restrictions can be rejected from their more general model.
For this application, it is not necessary to compare across non-nested
models to decide which is the most parsimonious. In general, however, it is
often helpful to make such comparisons. There are a number of approaches
discussed in the modern statistics literature. We describe the Akaike Information Criteria (AIC) approach (Akaike, 1973) because it has been recommended in psychology (e.g., Ashby and Ells, 2003) and is convenient in a
likelihood framework. There are other approaches to comparing non-nested
1.0
132CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
models including Bayesian information criteria (BIC, Schwartz, 1990) and
Bayes factor (Raftery & Kass, 1995; Myung & Pitts, 1997). These other
methods are more complex and are outside the scope of this book.
The AIC measure for a model is:
AIC = −2 log L(θ∗ ) + 2M,
(5.10)
where L is the likelihood of the model, θ∗ are the MLEs of the parameters,
and M is the number of parameters. The lower the AIC, the better the
model fit. The model with the lowest AIC measure is selected as the most
parsimonious. AIC measures for the seven models are shown in Figure 5.5.
The general high-threshold model is the most parsimonious, followed by the
double-high threshold model and the binomial model.
Consider the AIC measure for two models that have the same number
of parameters. In this case, the model with the lower AIC score is the one
with the higher log likelihood. This is a reasonable boundary condition. Log
likelihood values, however, are insufficient when two models have different
numbers of parameters. As discussed in Chapter 2, models with more parameters tend to have higher log likelihoods. For example, the binomial model
with ten parameters will always have higher likelihood than any of its restrictions simply because it has a greater number of parameters. The AIC
measure accounts for the number of parameters by penalizing models with
more parameters. For each additional parameter, the AIC score is raised by
2 points (this value of 2 is not arbitrary; it is derived from statistical theory).
Because of this penalty, the AIC score for the general high-threshold model
is lower than that of the binomial model, even though the latter has greater
log likelihood.
The AIC and likelihood ratio test analyses concord fairly well in that
they both favor the general high-threshold and double-high threshold model
over the other competitors. They appear to disagree, however, on which of
these two is the most appropriate. According to the likelihood ratio test, the
double high-threshold model may not be rejected in favor of the general highthreshold model. In contrast, according to AIC, the general high-threshold
model is preferred over the double high-threshold model. The disagreement
is more apparent than real because the analyses have different logical bases.
The likelihood ratio test is vested in the logic of a null hypothesis testing.
In this case, the double high-threshold restriction (ds = dn ) serves as the
null hypothesis and we do not have sufficient evidence to reject it at the .05
5.5. NESTED-MODEL AND AIC ANALYSES
133
level. We do have some evidence, however, that it is not fitting perfectly
well. The expected value of the chi-square with one degree-of-freedom is
1.0. The obtained value of G2 is 3.3, is reasonably large. While it is not
sufficiently great to reject the restriction at the .05 level, it is at the .1
level (try 1-pchisq(3.3,1)). The AIC measure reflects this information and
accords an advantage to the general high-threshold model. A statement that
reconciles these two approaches is that while the evidence favors the general
high-threshold model, it is not sufficient to reject the double high-threshold
model restriction.
134CHAPTER 5. ADVANCED THRESHOLD AND SIGNAL-DETECTION MODELS
Chapter 6
Multinomial Models
In the previous chapters, we focused on paradigms in which participants were
presented two response options. Here, we consider paradigms in which participants are presented more than two options. One example is a confidenceratings task. In this task, participants indicate their confidence in their
judgments by choosing one of several options such as, “I have low confidence.” The binomial is not the appropriate distribution for this paradigm
because confidence ratings span several values. Instead, we use the multinomial distribution, a generalization of the binomial suitable for more than two
responses. After presenting the multinomial, we present several psychological models: the signal detection model of confidence, Rouder & Batchelder’s
storage-retrieval model (Rouder & Batchelder, 1998), Luce’s similarity choice
model (Luce, 1963b), multidimensional scaling models (Shepard, Romney, &
Nerlove, 1967), and Nosofsky’s generalized context model (Nosofsky, 1986).
Each of these models may be implemented as substantive restrictions on the
general multinomial model.
6.1
Multinomial distribution
The confidence-ratings task provides a suitable paradigm for presenting the
multinomial distribution. In the simplest case, participants are presented
one of two stimuli, such as tones embedded in noise or noise alone. The
participant chooses from a set of options as indicated in Table 6.1. The
table also provides hypothetical data. The multinomial model is applicable
to the results from a particular stimulus. The top row of Table 6.1 shows
135
136
CHAPTER 6. MULTINOMIAL MODELS
Response
Tone Absent
Stimulus High Medium Low
Tone in Noise
3
7
14
Noise Alone
12
16
21
Tone Present
Low Medium High
24
32
12
15
10
6
Total
92
80
Table 6.1: Hypothetical data from confidence ratings task.
data from the tone-in-noise stimuli. The outcomes of N tone-in-noise trials
may be described by values y1 , y2, ..., yI where I is the number of options
and yi is the number trials for which the ith option was chosen. For the top
row of Table 6.1, I = 6, y1 = 3, y2 = 7, y3 = 14, y4 = 24, y5 = 32, and
y6 = 12. The total sum of all counts must equal the number of trials N; i.e.,
P
i yi = N. We let random variable Yi denote the frequency counts for each
category. After data are obtained, values yi are realizations of Yi . Because N
is always known, one of the counts may be calculated from the others: e.g.,
P
yI = N − I−1
i=1 yi . For Table 6.1, there are five independent pieces of data
for each stimulus.
In Chapter 2, we introduced the concept of a joint probability mass function. The joint probability mass function of random variables X and Y was
f (x, y) = P r(X = x and Y = y). The multinomial distribution is a joint
distribution over random variables Y1 , Y2 , ..., YI . It has i probability parameters p1 , .., pI ; where pi denotes the probability that the response on a trial
is the ith option. The multinomial probability mass function is
k
N! Y
f (y1 , y2 , .., yI ) = Qk
pyi i
i=1 yi ! i=1
(6.1)
where denotes the product of the terms, and is analogous to Σ for sums.
P
On each trial, one of the response options must be chosen; hence, i pi = 1.
Consequently, one of the probability parameters may be calculated from the
others. There are, therefore, I − 1 free parameters in the distribution. When
a sequence of random variables Y1 , ..YI are distributed in this manner, we
write,
Y1 , .., YI ∼ Multinomial(p1 , .., pI , N),
(6.2)
Q
where N, as defined above, is the total number of trials.
6.2. SIGNAL DETECTION MODEL OF A CONFIDENCE-RATING TASK137
Problem 6.1.1 (Your Turn)
Show that if I = 2, the pmf of a multinomial distribution is the same
as that of a binomial distribution.
The likelihood of the parameters given the data is obtained by rewriting
the joint pmf as a function of parameters:
Log likelihood is
k
N! Y
pyi i .
L(p1 , .., pI ; y1 , ..., yI ) = Qk
y
!
i=1 i i=1
l(p1 , .., pI ; y1 , ..., yI ) = log N! −
X
i
log yi ! +
X
yi log pi .
(6.3)
i
The terms in log N! and − i log yi are not dependent on the parameters and
P
may be omitted in analysis. The remaining term, i yi log pi , is identical to
the critical term in the log likelihood of the binomial. Maximum likelihood
estimates of pi can be obtained by numerical minimization or by calculus
methods. The calculus methods yield:
P
p̂i =
6.2
yi
.
N
(6.4)
Signal Detection Model of a ConfidenceRating Task
In the previous chapters, we assumed that payoffs affect response parameters
and not perceptual ones. By varying payoff conditions, it was possible to
construct a ROC plot of the data and overlay isosensitivity predictions of
various models. The advantage of the payoff method is that it may be used
to test many different models simultaneously (see Figure 5.5). There is a
second, more common method for achieving the same aim: ask participants
to rate their confidence. This method is somewhat easier and often less costly
to implement.
138
6.2.1
CHAPTER 6. MULTINOMIAL MODELS
The model
It is straightforward to adapt a signal-detection model for confidence-ratings
data. According to the theory of signal detection, stimuli give rise to perceptual strengths. Responses are then determined by a decision bound. For
the two-choice paradigm, there is one decision bound. Strengths below the
bound produce a tone-absent response; strengths above it produce a tonepresent response. Confidence-rating data is accounted for by positing I − 1
bounds. These bounds are denoted c1 , c2 , ..cI−1 with c1 ≤ c2 ≤ ... ≤ cI−1 .
The model with these bounds is shown in Figure 6.1. Strengths below c1
result in a “Tone absent with High Confidence” response; strengths between
c1 and c2 result in a “Tone Absent with Medium Confidence” response, and
so on.
6.2.2
Analysis: General Multinomial Model
In order to analyze this model, we start with a general multinomial model.
We then implement the confidence-ratings signal detection model as a nested
restriction on the multinomial probability parameters. The data from a
confidence-rating experiment can be denoted as Yi,j where i refers to the
response category and j refers to the stimulus. Stimuli are either signal-innoise (j = s) or noise-alone (j = n). Responses are the confidence categories
(i = 1, 2, .., I). The data in Table 6.1 serve as an example.
For the general model, the data are modeled with a pair of multinomial
distributions:
(Y1,s , .., yI,s) ∼ M(p1,s , .., pI,s , Ns )
(Y1,n , .., yI,n) ∼ M(p1,n , .., pI,n , Nn ),
(6.5)
(6.6)
In this model, the Ns and Nn are the number of noise and signal trials,
respectively. Parameters pi,j are the probability of the ith response to the j th
P
P
stimuli and are subject to the restrictions i pi,s = 1 and i pi,n = 1. ML
estimates for pij are analogous to those for the binomial and are p̂i,j = Yi,j /Nj .
Calculation is straightforward and an example for the data of Table 6.1 is:
y.signal=c(3,7,14,24,32,12)
y.noise=c(12,16,21,15,10,6)
NS=sum(dat.signal)
NN=sum(dat.noise)
6.2. SIGNAL DETECTION MODEL OF A CONFIDENCE-RATING TASK139
c3
c2
c5
c4
0.6
c1
High
confidence
Medium
confidence
Tone Present
Low
confidence
Low
confidence
Medium
confidence
High
confidence
0.3
0.2
0.1
0.0
Density
0.4
0.5
Tone Absent
−2
0
2
4
6
Strength
Figure 6.1: The signal detection model for confidence ratings. The left and
right distributions represent the noise and signal-plus-noise distributions, respectively. The five bounds, c1 , .., c5 , divide up the strengths into six intervals
from “Tone absent with high confidence” to “Tone present with high confidence.”
140
CHAPTER 6. MULTINOMIAL MODELS
parest.signal=dat.signal/NS
parest.noise=dat.noise/NN
Although parameter estimates for the general multinomial models are
straightforward, it is useful in analyzing the signal detection model to express
the log likelihood for the multinomial model. The first step is to express the
likelihood for a single stimulus:
#negative log likelihood for one stimulus
nll.mult.1=function(p,y) -sum(y*log(p))
We assume independence holds across stimuli; hence, the overall log likelihood for both stimuli in the general model is
nll.general=nll.mult.1(parest.signal,dat.signal)+
nll.mult.1(parest.noise,dat.noise)
6.2.3
Analysis: The Signal Detection Restriction
Signal detection model for confidence ratings may be implemented as restrictions on multinomial probabilities. We describe here the case for the
free-variance signal-detection model. In this model, the probability of an ith
response is the probability that an observation is in the interval [ci−1 , ci ] (with
c0 defined as −∞ and cI defined as +∞ for convenience). This probability
is simply the difference in cumulative distribution functions (see Eq. 4.2).
Hence:
pi,j = F (ci ; µj , σj2 ) − Fj (ci−1 ; µj , σj2 ),
(6.7)
where F is the CDF of the normal distribution with mean µj and variance σj2 .
The values of µj and σj2 reflect the stimulus. For noise trials, these values are
set to 0 and 1, respectively. For signal trials, however, these values are free
parameters d′ and σ 2 , respectively. The following R function, sdprob.1(),
computes the values of pi,j for any µj , σj2 , and vector of bounds.
sdprob.1=function(mean,sd,bounds)
{
cumulative=c(0,pnorm(bounds,mean,sd),1)
p.ij=diff(cumulative)
6.2. SIGNAL DETECTION MODEL OF A CONFIDENCE-RATING TASK141
return(p.ij)
}
To understand the code, consider the case in which the stimulus is noise
alone (standard normal) and the bounds are (−2, −1, 0, 1, 2). The vector
cumulative is assigned values (0, .05, .16, .5, .84, .95, 1). The middle five values are the areas under the normal density function to the left of each of the
five bounds. Eq. 6.7 describes the area between the bounds, which are the
successive differences between these cumulative values. These differences are
conveniently obtained by the diff(), function.
Now that we can compute model-based probabilities for a single condition,
it is fairly straightforward to estimate d′ , σ, and the bounds. We provide code
that computes the (negative) log likelihood of (d′, σ 2 , c1 , .., c5 ) and leave it to
the reader to optimize it. The first step is to write a function that returns
response probabilities for both stimuli as a function of model parameters:
nll.sigdet=function(par,y)
#negative log likelihood of free-variance signal detection
#for confidence interval paradigm
#par=d,sigma,bounds
#y=y_(1,s),..,y_(I,s),y_(1,n),..,y_(I,n)
{
I=length(y)/2
d=par[1]
sigma=par[2]
bounds=par[3:length(par)]
p.noise=sd.prob.1(0,1,bounds)
p.signal=sd.prob.1(d,sigma,bounds)
nll.signal=nll.mult.1(p.signal,y[1:I])
nll.noise=nll.mult.1(p.noise,y[(I+1):(2*I)
return(-nll.signal-nll.noise)
}
142
CHAPTER 6. MULTINOMIAL MODELS
Problem 6.2.1 (Your Turn)
Decide whether the free-variance signal detection model is appropriate
for the data in Table 6.1 by comparing it with the general multinomial
model with a nested likelihood ratio test.
6.3
Multinomial Process Tree Models
Batchelder and Reifer (1999; Riefer & Batchelder, 1988) have advocated a
class of multinomial models, called multinomial process tree models (MPT),
for many applications in perception and cognition. The formal definition1
of a MPT model is provided by Batchelder and Riefer (1999) and Hu and
Batchelder (1994); we provide a more informal definition and a detailed example. Informally, MPT models are models with the following three properties: (1) latent processes are assumed to be all-or-none, (2) processing at one
stage is contingent on the results of the previous stage, and (3) the models
may be expressed as “tree” diagrams. The high-threshold model meets these
properties: First, the psychological processes of detection and guessing are
all-or-none. Second, processes are contingent—guessing occurs contingent
on a detection failure. Third, the tree-structure of the model is evident in
Figure 3.1.
6.3.1
Storage-retrieval model
Rouder and Batchelder (1998) present a MPT model for separating storage
and retrieval effects in a bizarre-imagery task. Storage is the process of
forming memory traces of to-be-remembered items. Retrieval, in contrast, is
the process of recovering those traces for output. It is well known that bizarre
material, such as a dog riding a bicycle, is easier to recall than comparable
common material, such as a dog chasing a bicycle. Rouder and Batchelder
1
Hu and Batchelder (1994) describe technical restrictions on MPT models that are
often but not always met in practice. The benefit of considering these restrictions is that
if they are met, then a certain statistical algorithm, the expectation-maximization (EM)
algorithm (Dempster, Laird, and Rubin, 1977) is guaranteed to find the true maximum
likelihood estimates (Hu & Batchelder, 1994). A discussion of the restrictions and of the
EM algorithm are outside the scope of this book.
6.3. MULTINOMIAL PROCESS TREE MODELS
Event
E1
E2
E3
E4
E5
E6
Free-Recall Result
both words free recalled
one word free recalled
no words free recalled
two words free recalled
one word free recalled
no words free recalled
143
Cued-Recall Result
correct cued recall
correct cued recall
correct cued recall
incorrect cued recall
incorrect cued recall
incorrect cued recall.
Table 6.2: Possible results for each word pair in the storage-retrieval experiment (Riefer & Rouder, 1992).
asked whether the mnemonic benefit of bizarre material was in storage or in
retrieval processes.
The data we analyze comes from Riefer and Rouder (1992) who presented
participants with sentences such as: “The DOG rode the BICYCLE.” Participants were asked to imagine the event described in the sentence. After a
delay, they were asked to perform two different memory tests. The first test
was a free-recall test in which participants were given a blank piece of paper
and asked to write down as many capitalized words as they can remember.
After the free-recall test, participants were given a cued-recall test. Here, the
first capitalized word of the sentence was given (e.g., DOG) and participants
were asked to write down the other capitalized word (e.g., BICYCLE). For
each sentence, the free-recall test result is scored on the number of words of a
pair recalled (either 0, 1, or 2); the cued-recall test is scored as either correct
or incorrect. The result of both tests is given in Table 6.2
Riefer and Rouder (1992) argued that whereas storage is necessary for
both free recall and cued recall, retrieval is not as needed for cued recall as it
is for free recall. The cue aids retrieval of the second capitalized word allowing
for successful recall even when the participant was unable to retrieve the first
word or association on their own. Rouder and Batchelder (1998) embedded
this argument in a MPT model shown in Figure 6.2. Each cognitive process
is assumed to be all-or-none. The first branch is for associative storage,
which entails the storage of both capitalized words and their association.
Associative storage is sufficient and necessary for cued recall. If associative
storage is successful, then the participant has the opportunity to retrieve the
association and items and does so successfully with probability r. Associative
storage and retrieval are sufficient for free recall of both items and correct
cued recall. If retrieval of the association fails, participants can retrieve each
144
CHAPTER 6. MULTINOMIAL MODELS
E1
r
s
s
1−r
1−s E2
s
a
1−s
1−s
1−a
u
E1
E2
E3
u
E4
1−u
E5
u
E5
1−u
1−u
E6
Figure 6.2: The Rouder and Batchelder storage-retrieval model for bizarre
imagery. See Table 6.2 for description of events E1 , .., E6 .
item independently as a singleton with probability s. If associative storage
fails, the participants still may store and retrieve each item as a singleton
with probability u.
The equations for the model may be derived from Figure 6.2. Probabilities
on a path multiply. For example, the probability of an E3 event is P r(E3 ) =
a(1 − r)(1 − s)2 . When there are two paths to an event, the probabilities of
the paths add. For example, there are two ways to get an E1 event: either
the association is stored and retrieved (with probability ar) or associative
retrieval fails and both items as retrieved as singletons (with probability
a(1 − r)s2 ). Adding these up yields P r(E1 ) = ar + a(1 − r)s2 . The full set
of equations is:
P r(E1)
P r(E2)
P r(E3)
P r(E4)
P r(E5)
P r(E6)
=
=
=
=
=
=
ar + a(1 − r)s2 ,
a(1 − r)[2s(1 − s)],
a(1 − r)(1 − s)2 ,
(1 − a)u2 ,
(1 − a)[2u(1 − u)],
(1 − a)(1 − u)2 .
6.3. MULTINOMIAL PROCESS TREE MODELS
6.3.2
145
Analysis
Estimating parameters of the model is relatively straightforward. The following code computes the log likelihood for the model. Because all of the
parameters are bounded on the (0, 1) interval, we passed logistic transformed
variables.
#Storage Retrieval Model (negative log likelihood)
#y=#E1, #E2,..#E6
#par=c(a,r,s,u) logit transformed
nll.sr.1=function(par,y)
{
a=plogis(par[1]) #transform variables to 0,1
r=plogis(par[2])
s=plogis(par[3])
u=plogis(par[4])
p=1:6
p[1]=a*(r+(1-r)*s^2)
p[2]=2*a*(1-r)*s*(1-s)
p[3]=a*(1-r)*(1-s)^2
p[4]=(1-a)*u^2
p[5]=2*(1-a)*u*(1-u)
p[6]=(1-a)*(1-u)^2
-sum(y*log(p))
}
146
CHAPTER 6. MULTINOMIAL MODELS
Problem 6.3.1 (Your Turn)
Riefer and Rouder (1992) report the following frequencies for bizarre
and common sentences, respectively.
Condition
Bizarre
Common
E1 E2
103 2
80
0
Event
E3 E4
46 0
65 3
E5
7
9
E6
22
2
1. Estimate parameters for the bizarre condition. Be sure to use parameter transformations. Hint: use optim() to get approximate
estimates. Use these estimates as starting values for nlm() to get
true ML estimates.
2. Do the same for the the common condition.
3. Test the hypothesis that the bizarreness effect is a storage advantage. The log likelihood of the general model is simply the sum of
the log likelihoods across the conditions. Use nested optimization
for the restricted model.
4. Test the hypothesis that the bizarreness effect is a retrieval advantage. Use nested optimization for the restricted model.
6.4
6.4.1
Similarity Choice Model
The Choice Axiom
The study of how people choose between competing alternatives is relevant
to diverse fields such as decision-making, marketing, and perceptual psychology. Psychologists seek common decision-making processes when describing
choices as diverse as diagnosing diseases or deciding between ordering apple
and orange juice. Luce’s choice axiom (1957) and similarity choice model
147
6.4. SIMILARITY CHOICE MODEL
Q. Which of the following is brewed and served at
The Flatbranch Brewery in Columbia, Missouri?
FOUR CHOICES
A) Ant Eater Lager
B) Oil Change Stout
C) Tiger Ale
D) Chancellor’s Reserve
BELIEFS
A) .20
B) .19
C) .60
D) .01
TWO CHOICES
B) Oil Change Stout
D) Chancellor’s Reserve
REVISED BELIEFS
B) .95
D) .05
Figure 6.3: An example of the choice axiom.
(1963) have had a large impact in all of these fields.
The choice axiom is explained in the context of the following example (
Figure 6.3). Suppose a contestant in a game show is asked a question about
a local beer in a small Midwestern city. The contestant is unsure of the
correct answer but is able to assign probability values to the four choices
as indicated in the figure. In particular, the contestant believes choice (2)
is nineteen times as likely to be correct as choice (4). The game show host
then eliminates choices (1) and (3). According to the choice axiom, the ratio
between choices (2) and (4) should be retained whether choices (1) and (3)
are available or not. If this 19:1 ratio is preserved, the probabilities in the
two-choice case becomes .95 and .05, respectively.
The choice axiom is also an instantiation of the law of conditional probability. The law of conditional probabilities provides an ideal means of updating
probabilities when conditions change and states that probabilities are nor-
148
CHAPTER 6. MULTINOMIAL MODELS
malized by the available choices. For the example in Figure 6.3, let pi and
p∗i denote the belief in the ith alternative before and after two choices are
eliminated. According to the law of conditional probability:
p∗i =
pi
,
p2 + p4
i = 2, 4.
(6.8)
The correct answer for this question is (2). When probabilities follow the
law of conditional probability, we say the decision maker properly conditions
on events.
6.4.2
Similarity Choice Model
In this section we implement a choice-axiom-derived model, the similarity
choice model (Luce, 1963), for the case of letter identification. Under normal
viewing conditions, we rarely mistakenly identify letters. In order to study
how people identify letters, it is helpful to degrade the viewing conditions
so that people make mistakes. One way of doing this is to present a letter
briefly and follow it with a pattern mask consisting of “###.” If the letter is
presented sufficiently quickly, performance will reflect guessing. As the stimulus duration is increased, performance increases until it reaches near-perfect
levels. Of interest to psychologists is how letters are confused when performance is intermediate. These patterns of confusion provide a means of telling
how similar letters are to each other. As discussed subsequently, measures
of similarity can be used to infer the mental representation of stimuli.
Several authors have proposed that participants use the choice axiom in
decision making (e.g., Clark, 1957; Luce, 1959; and Shepard, 1957). The
choice axiom, however, may fail for the trivial reason of response bias. For
example, a participant may favor the first response option presented (or
the one presented on the left if options are presented simultaneously). The
Similarity Choice Model (SCM) adds response biases to the choice axiom.
The probability of response j to stimulus i is
ηi,j βj
.
k=1 ηj,k βk
pi,j = PJ
(6.9)
In this model, ηi,j describes the similarity between stimuli i and j, and
βj describes the bias toward the jth response. Similarities range between
0 and 1. The similarity between any item and itself is 1. To make the
6.4. SIMILARITY CHOICE MODEL
149
model identifiable it is typically assumed that similarity is symmetric, e.g.,
ηi,j = ηj,i . For example in letter identification, the similarity of a to b is the
same as b to a. The assumption of similarity is not without critics (Tversky,
1979) and is dispensed with in some models (e.g., Keren & Baggen, 1981).
One upshot of these restrictions on similarity is a reduction in the number
of free parameters. To see this reduction, consider the paradigm in which
participants identify the first three letters a, b, c. For this case there are
I = J = 3 stimuli and responses. The full matrix of all ηi,j has 9 elements.
The below matrix shows the effect of imposing the restriction ηi,i = 1 and
ηi,j = ηj,i.
η
η1,2 η1,3 1,1 η1,2 η1,3 η =1, η =η 1
i,j
j,i i,i
(6.10)
η1,2
1 η2,3 7−→
η2,1 η2,2 η2,3 η1,3 η2,3
η3,1 η3,2 η3,3 1
In this case, the restrictions reduce the number of parameters from 9 to 3.
To insure that the SCM predictions for response probabilities are between
0 and 1, response biases are always positive. Although there are J response
bias terms, only J − 1 of them are free. To see this, consider what happens
to pij when all response bias values are doubled. The factor of 2 cancels in
the numerator and denominator of Eq (6.9). Hence, one of the responses can
be set to 1.0 without any loss of generality. As a matter of convenience, we
always set β1 = 1.0 and estimate the remaining J − 1 values.
6.4.3
Analysis
Consider SCM for the identification of the first three letters a, b, c. Table 6.3
depicts the format of data. Random variable Yij is the number of times
stimulus i elicits response j. In the table, the stimuli are denoted by lowercase letters (even though the physical stimuli may be upper-case) and the
responses are denoted by upper-case letters. Table 6.3 is called a confusion
matrix.
The general multinomial model for this confusion matrix is
(Ya,A , Ya,B , Ya,C ) ∼ Multinomial(pa,A , pa,B , pa,C , Na ),
(Yb,A , Yb,B , Yb,C ) ∼ Multinomial(pb,A , pb,B , pb,C , Nb ),
(Yc,A , Yc,B , Yc,C ) ∼ Multinomial(pc,A , pc,B , pc,C , Nc ).
(6.11)
(6.12)
(6.13)
For each of these three component multinomials, the three probability
parameters represent the probability of a particular response given a partic-
150
CHAPTER 6. MULTINOMIAL MODELS
Stimulus
Stimulus a
Stimulus b
Stimulus c
Response
A
B
C
Ya,A Ya,B Ya,C
Yb,A Yb,B Yb,C
Yc,A Yc,B Yc,C
# of trials per stimulus
Na
Nb
Nc
Table 6.3: Confusion matrix for three stimuli (a, b, c) associated with three
responses (A, B, C).
ular stimulus. These three probabilities must sum to 1.0. Hence, for each
component, there are two free parameters. For the three stimuli, there are a
total of six free parameters. The log likelihood is
l=
3 X
3
X
yi,j log pi,j .
(6.14)
i=1 j=1
ML estimates of the probability parameters are the appropriate proportions;
i.e., p̂a,A = ya,A /Na .
The SCM model for the identification of three letters has five free parameters (η1,2 , η1,3 , η2,3 , β2 , β3 ). The other parameters can be derived from
these five. Analysis proceeds by simply expressing the probabilities in Equations (6.11) through (6.13) as functions of the five free parameters. For
example, probability p1,1 may be expressed as
1
η1,1 β1
=
.
p1,1 = P
1 + η1,2 β2 + η1,3 β3
k η1,k βk
Likewise, the multinomial log likelihood function (Eq. 6.14) is expressed as
a function of the five free parameters and maximized with respect to these
parameters. Analysis for a confusion matrix of arbitrary I and J proceeds
analogously.
6.4.4
Implementation in R
In this section, we implement analysis of SCM in R. One element that makes
this job difficult is that the similarities, ηi,j , are most easily conceptualized
as a matrix, as in Eq. 6.10. Yet, all of our previous code relied on passing
parameters as vectors. To complicate matters, not all of the elements of the
similarity matrix are free. Hence, the code must keep careful track of which
6.4. SIMILARITY CHOICE MODEL
151
matrix elements are free and which are derived. Consequently, the code
is belabored. Even so, we present it because these types of problems are
common in programming complex models, and this code serves as a suitable
exemplar. Readers not interested in programming these models can skip this
section without loss.
The first step in implementing the SCM model is to write a function
that yields the log likelihood of the data as a function of all similarity and
response parameters, whether they are free or derived. The following code
does so. It calculates the log likelihood for each stimulus within a loop and
steps through the loop for all I stimuli. Before running the code, be sure to
define I; e.g., I=3.
#eta is an I-by-I matrix of similarities
#beta is an I element array of response biases
#Y is stimulus-by-response matrix of frequencies
nll.scm.first=function(eta,beta,Y)
{
nll=0
for (stim in 1:I)
{
denominator=sum(eta[stim,]*beta) #denominator in Eq 6.8
p=(eta[stim,]*beta)/denominator #all J p[i,] at once
nll=nll-sum(Y[stim,]*log(p)) #add negative log likelihoods
}
return(nll)
}
This function is not directly suitable for minimization because it is a function of both free parameters and derived parameters. The matrix eta has
I 2 elements; yet the model specifies far fewer similarity parameters. In fact,
after accounting for ηi,i = 1 and ηi,j = ηj,i there are only I(I − 1)/2 free similarity parameters. Likewise, there are only I − 1 response bias parameters.
The goal is to maximize likelihood with respect to these free parameters.
The following code partially meets this goal; it maps the free similarity parameters, denoted, par.e, into matrix eta.
#par.e is a I(I-1)/2 element vector of free similarity parameters
#eta is an I-by-I matrix of all similarities
152
CHAPTER 6. MULTINOMIAL MODELS
par2eta=function(par.e)
{
eta=matrix(1,ncol=I,nrow=I)
eta[upper.tri(eta)]=par.e
eta[lower.tri(eta)]=par.e
return(eta)
}
#create an I-by-I matrix of 1s.
The novel elements in this code are the functions upper.tri() and lower.tri().
These functions map the elements of the vector par.e into the appropriate
locations in the matrix eta. To see how these work, try the following lines
sequentially:
x=matrix(nrows=3,1:9,byrow=T) #type x to see the matrix
upper.tri(x)
lower.tri(x)
x[upper.tri(x)]
x[lower.tri(x)]
x[upper.tri(x)]=c(-1,-2,-3) #type x to see the matrix
The following function, par2beta, returns all beta as a function of the
I − 1 free response bias parameters.
#par.b is I-1 free response bias parameters
#code returns all I response bias parameters
par2beta=function(par.b)
{
return(c(1,par.b))
}
In SCM, there are I(I − 1)/2 free similarity parameters and I − 1 free response bias parameters. The following function returns the log likelihood as a
function of these I(I −1)/2+I −1 free parameters. Because all similarity parameters are restricted to be between zero and 1, we use logistic-transformed
similarity parameters. Likewise, because response biases must always be
positive, we use an exponential transform of response bias parameters. The
function ex is positive for a real values of x.
6.4. SIMILARITY CHOICE MODEL
153
nll.scm=function(par,dat) {
#par is concatanation of I(I-1)/2 logistic transformed similarity
#parameters and (I-1) exponential transformed response bias parameters
#dat is confusion matrix
par.e=plogis(par[1:(I*(I-1)/2)])
par.b=exp(par[((I*(I-1)/2)+1):((I*(I-1)/2)+I-1)])
beta=par2beta(par.b)
eta=par2eta(par.e)
return(nll.scm.first(eta,beta,dat))
}
The following code shows the analysis of a sample confusion matrix. The
first line reads sample data into the matrix dat.
I=3
dat=matrix(scan(),ncol=3,byrow=T)
49 1 15
6 22 5
17 1 35
par=c(.1,.1,.1,1,1) #eta’s=.1, beta=1
par[1:3]=qlogis(par[1:3]) #probit transformed similarities
par[4:5]=log(par[4:5]) #exponential transform response biases
g=optim(par,nll.scm,dat=dat,control=list(maxit=11000))
The results may be found by transforming parameters:
># similarities
> par2eta(plogis(g$par[1:3]))
[,1]
[,2]
[,3]
[1,] 1.00000000 0.07529005 0.38558239
[2,] 0.07529005 1.00000000 0.07982558
[3,] 0.38558239 0.07982558 1.00000000
># response biases
154
CHAPTER 6. MULTINOMIAL MODELS
> par2beta(exp(g$par[4:5]))
[1] 1.0000000 0.2769761 0.7926932
From this data, it may be seen that a and c are more similar to each other
than either is to b. In addition, there is a tendency to have greater response
bias to a and c and not to b.
Problem 6.4.1 (Your Turn)
Decide if SCM is appropriate for the letter confusion data in the above
R code by comparing it with the general multinomial model with a
nested likelihood ratio test.
155
6.4. SIMILARITY CHOICE MODEL
Problem 6.4.2 (Your Turn)
Rouder (2001, 2004) tested the appropriateness of SCM by manipulating the number of choices in a letter identification task. He reasoned
that if SCM held, the similarity between letters should not be a function of the choice set. Consider the following fictitious data for four
and two choice condition:
Response
Stimulus
a
b
c
d
Four-Choice Condition
A B C
D
49 5 1
2
12 35 7
1
1 5 22
11
1 6 18
19
Two-Choice Condition
A
B
83
10
15
76
Decide if the similarly between A and B is the same across the two
conditions. Consider the following steps:
1. Estimate a general SCM model with separate similarities across
both conditions. This can be done by separately finding estimates
for the four-choice SCM model for the four-choice condition and
the two-choice SCM model for the two-choice condition. What
is the joint log likelihood of these parameters across both conditions?
2. Estimate a model in which there is one ηa,b across both conditions.
What is the log likelihood of this common ηa,b model?
3. With these two log likelihoods, it is straightforward to construct
a likelihood ratio test of the common parameter. Construct a
likelihood ratio test and report the result.
156
6.5
CHAPTER 6. MULTINOMIAL MODELS
SCM and Dimensional Scaling
The SCM models similarity between two stimuli. A related concept is mental
distance. Mental distance is the inverse of similarity; two items are considered
close to each other in a mental space if they are highly similar, and hence,
easily confused. Using distances, it is possible to build a space of mental
representation. To better motivate the concept of a space of representation,
consider the three sets of stimuli in Figure 6.4. The top left panel shows
a sequence of nine lines, labeled A, B,...,I. These lines differ in a single
dimension—length. It is reasonable to suspect the mental representation
of this set varies on a single dimension. The left, middle column shows a
hypothetical unidimensional representation. The mental representation of
all of the stimuli are on a straight line. The smallest and largest stimuli
are disproportionately far from the others indicating that these stimuli are
least likely to be confused. The bottom left shows an alternative mental
representation in two dimensions. The primary mental dimension is still
length, but there is a secondary dimension that indicates how central or
peripheral the stimuli are. The top center panel shows a different set of
stimuli: set of circles with inscribed diameters. These stimuli differ on two
dimensions: size and the angle of the diameter. It is reasonable to suspect
that the mental representation of this set varies on two dimensions as well.
A hypothetical mental space is shown below these stimuli. In the figure,
there is greater distance between stimuli on the size dimension than on the
angle dimension, indicating that size is more salient than angle. The right
panel shows select letters in Asomtavruli, an ancient script of the Georgian
language. It is reasonable to suspect that the mental representation of this set
spans many dimensions. The goal of multidimensional scaling is to identify
the number of dimensions in the mental representation of a set of stimuli and
the position of each stimulus within this set.
Dimensional models of mental space may be formulated as restrictions on
SCM. We follow Shepard (1957) who related distance to similarity as
ηi,j = exp(−di,j ),
(6.15)
where di,j is the distance between stimuli i and j. In SCM, the greatest
similarity is of an item to itself, which is set to 1.0. This level of similarity
corresponds to a distance of zero. In SCM, as items become less confusable,
similarity decreases toward zero. According to Equation (6.15), as items become less confusable, distance increases. In fact, SCM can be parameterized
157
6.5. SCM AND DIMENSIONAL SCALING
C
B
A
F
E
D
I
H
G
ABCDE F GH I
Line Segments
A
B
C DE F G
Asomtavruli Letters
Circles with embedded diameters
H
I
Length
Unidimensional Representation
E
CD F G
B
A
H
Length
I
Two dimensional Representation
A
B
C
D
E
F
G
H
I
Angle
Size
Two dimensional Representation
Figure 6.4: Top: Three types of stimuli of increasing dimensionality. Bottom: Hypothetical mental spaces for line segments and circles with inscribed
diameters.
158
CHAPTER 6. MULTINOMIAL MODELS
General
Multinomial Model
SCM
(I−1 Dimensions)
K−Dimension
SCM
I(I−1) Parameters
For 9 Stimuli: 72 Parameters
I(I−1)/2 + (I−1) Parameters
For 9 Stimuli: 44 Parameters
K−1
( Σ i ) +K(I−K) + (I−1) Parameters
i=1
Two−Dimension
SCM
2(I−1)+(I−2) Parameter
One−Dimension
SCM
2(I−1) Parameters
For 9 stimuli: 23 Parameters
For 9 Stimuli: 16 Parameters
Figure 6.5: Hierarchical relationship between models of mental distance.
with distance instead of similarity. This distance-based SCM model is given
as
exp(−di,j )βj
.
(6.16)
pi,j = PJ
k=1 exp(−di,k )βk
Figure 6.5 describes the nested models approach to dimensional models.
At the top level is the general multinomial model. The most general restriction is an SCM model with distance parameters that obey three restrictions:
1. the distance between any item and itself is di,i = 0; 2. distance is symmetrical e.g., di,j = dj,i, and 3. the shortest distance between any two points is a
straight line, e.g., di,j ≤ di,k + d + j, k. An SCM model that obeys these three
restrictions contain distances that can be represented in at most an I − 1
dimension space. Lower dimensions models are restrictions on the distances
and are represented as submodels. At most restrictive model, with the fewest
parameters, is the one in which all the distances are constrained such that
the mental space is constrained to a single dimension. This model, in fact,
is the easiest to analyze and we consider it next.
6.5. SCM AND DIMENSIONAL SCALING
6.5.1
159
Unidimensional SCM Model
For a unidimension model, we may represent the position of the ith item as
a single number, xi . The distance between any two item is
dij = |xi − xj |,
where |x| is the absolute value of x. The following SCM model relates identification performance to the one-dimensional mental representations of lines.
exp(−|xi − xj |)βj
.
pi,j = P
k exp(−|xi − xk |)βk
(6.17)
Fortunately, only minor changes to the SCM code are needed to analyze
this model. There are I positions xi , but the first one of these may be set to
0 without any loss. Therefore, there are only I − 1 free position parameters.
Likewise, there are I − 1 free response biases. The first step is modifying the
mapping from free parameters to similarities:
par2eta.1D=function(par.e)
{
x=c(0,par.e) #set first item’s position at zero
d=as.matrix(dist(x,diag=T,upper=T)) #type d to see distances
return(exp(-d)) #returns similarities
}
The function dist() is a built-in R function for computing distances from
positions. It returns data in a specific format that is not useful to us; consequently, we use the function as.matrix() to represent the distances as a
matrix.
The last step is to modify the nll to send the appropriate parameters to
par2eta.1D() and par2beta(). Because all real numbers are valid positions,
we do not transform parameters.
nll.scm.1D=function(par,dat) {
par.e=par[1:(I-1)]
par.b=par[I:(2*I-2)]
beta=par2beta(par.b)
eta=par2eta.1D(par.e)
return(nll.scm.first(eta,beta,dat))
}
160
CHAPTER 6. MULTINOMIAL MODELS
Problem 6.5.1 (Your Turn)
Fit the one-dimensional SCM model to the following confusion matrix
for the identification of line segments:
Item
a
b
c
d
e
f
g
h
A B C
84 8 7
14 45 23
8 10 36
2 8 17
3 3 6
2 4 8
0 2 5
0 0 2
Response
D E
1 0
9 5
19 15
31 21
28 34
18 18
6 7
0 3
F G H
0 0 0
4 0 0
7 3 2
9 8 4
14 5 7
32 11 7
9 49 22
5 5 85
Optimization may be improved by first using an optim() call to get
good starting values for an nlm() call. Plot the resulting positions.
Notice that the spacing is not quite uniform. This enhanced distance
between the first and second item and the ultimate and penultimate
item is common in one-dimensional absolute identification and is called
the bow effect (Luce, Nosofsky, Green, Smith, 1982).
6.5.2
Multidimensional SCM Models
To test the appropriateness of the one-dimensional SCM model, we can embed the one-dimensional model within a higher-dimension SCM model. In
the most general case, each item is represented by a point in a higher dimensional space. The location of the ith item can be denoted by (xi,1 , xi,2 , ...xi,M ),
where M is the number of dimensions in the mental representation.
The construction of distance is more complex with multiple dimensions.
One general form of distance is called Minkowski distance.
161
6.5. SCM AND DIMENSIONAL SCALING
Definition 37 (Minkowski Distance) The Minkowski distance between any two points is
di,j =
M
X
m=1
|(xi,m − xj,m )|r
!1/r
.
(6.18)
To understand Minkowski distance, lets take a simple case of two stimuli which exist in two dimensions. The stimuli’s positions are (x1 , y1 ) and
(x2 , y2 ). For these two stimuli and for r = 2, Equation (6.18) reduces to
d1,2 =
q
(x1 − x2 )2 + (y1 − y2 )2 .
This is the familiar formula for the distance of a straight line between two
points. When r = 2, the distance is called the Euclidean distance. Figure 6.6 shows the Euclidean distance between two points as well as two other
distances. When r = 1, the distance is called the city-block distance. For
the example with two points in two dimensions, city-block distances are computed with only vertical and horizontal lines. Much like navigating in a dense
city, diagonal lines are not admissible paths between points. The city-block
distance in the figure is 7. The maximum distance occurs asr → ∞. The
distance is maximum difference between the points on a single dimension.
In the figure, the differences are 4 (x-direction) and 3 (y-direction). The
maximum difference is 4. We use the Euclidean distance here although researchers have proposed the city-block distance for certain classes of stimuli
(e.g., Shepard, 1986)
The following code implements a two-dimensional SCM model for the
line lengths. There are I points, each with two position parameters. The
first item may be placed at (0, 0). Also, the y-coordinate of the second point
may also be placed at 0 without loss. There are, therefore, 2(I − 1) − 1
free position parameters in I − 1 free response biases. The total, therefore,
is 3(I − 1) − 1 free parameters. The function par2eta.2D() converts the
2(I − 1) − 1 position parameters to similarities:
par2eta.2D=function(par.e)
{
x=c(0,par.e[1:(I-1)]) #x-coordinate of each of the I points
162
CHAPTER 6. MULTINOMIAL MODELS
City−Block Distance: 7
5
Euclidean Distance: 5
3
Maximum Distance: 4
4
Figure 6.6: Three distance measures between 2 points.
y=c(0,0,par.e[I:(2*(I-1)-1)]) #y-coordinate of each of the I points
points=cbind(x,y)
d=as.matrix(dist(points,"euclidian",diag=T,upper=T))
return(exp(-d))
}
The cbind() function makes a matrix by binding rows together. In the
above code, it makes a matrix of 2 columns and I rows; the first and second
column is the x-coordinate and y-coordinate of each point, respectively. The
negative log likelihood function is
nll.scm.2D=function(par,dat)
{
par.e=par[1:(2*(I-1)-1)]
par.b=par[(2*(I-1)):(3*(I-1)-1)]
beta=par2beta(par.b)
eta=par2eta.2D(par.e)
return(nll.scm.first(eta,beta,dat))
}
Optimization is performed as follows:
par=c(1:7,rep(0,6),rep(1/8,7))
start=optim(par,nll.scm.2D,dat=dat)
g.2D=nlm(nll.scm.2D,start$par,dat=dat,iterlim=200)
6.5. SCM AND DIMENSIONAL SCALING
163
Problem 6.5.2 (Your Turn)
Decide if the one-dimensional restriction of the two-dimensional SCM
model is appropriate for the line-length data with a likelihood ratio
test.
One of the drawbacks of the SCM models is that the parameters grow
quickly with increasing numbers of items. With sixteen items, for example,
a three-dimensional SCM model has 56 parameters. While this is not a
prohibitive number in some contexts, it is difficult to achieve stable ML with
the methods we have described.
Fortunately, there are standard, high performance multidimensional scaling techniques (Cox and Cox, 2001; Shepard, Romney, and Nerlove, 1972;
Torgeson, 1958). The mathematical bases of these techniques are outside
the scope of this book. Instead, we describe their R implementation. When
distances are known, the function cmdscale() provides a Euclidean representation. An example is built into R and may be found with ?cmdscale.
For psychological applications, however, we rarely know mental distances.
There are a few common alternatives. One alternative is to simply ask people
the similarity of items, two at a time. These data can be transformed to
distances using Equation 6.15. One critique of this approach is that similarity
data is treated as a ratio scale. For example, suppose a participants rates
the similarity between items i and j as a “2” and that between items k and
l as a “4.” The implicit claim is that k is twice as similar to l as i is to j.
This may be too strong; instead, it may be more prudent to just consider the
ordinal relations, e.g., k is more similar to l than i is to j. Fortunately, there
is a form of multidimensional scaling, called non-metric multidimensional
scaling, that is based on these ordinal relations. In R, two methods work
well: either sammon() and isoMDS(). Both of these functions are in the
package “MASS” (Venebles & Ripley, 2002) The package is load with the
library() command; e.g., library(MASS).
164
6.6
CHAPTER 6. MULTINOMIAL MODELS
Generalized Context Model of Categorization
The Generalized Context Model (Nosofsky, 1986) describes how people categorize novel stimuli. According to the the models, categories are represented
by exemplars. For example, shoes are represented by a set of stored mental
exemplars of shoes. They are not represented by a set of verbal rules; such
as “a shoe is a piece of footware that is longer than it is high.” GCM posits
that a novel piece of footwear will be classified as a shoe as opposed to boot
if it is more similar to the stored exemplars of shoes than it is to the stored
exemplars of boots.
GCM has primarily been tested with novel rather than natural categories.
Consider an experiment in which the Asomtavruli letters in Figure 6.4 are
assigned by the experimenter to two different categories. Suppose left-column
letters are assigned to Category A and right-column letters are assigned to
Category B. Further suppose that a participant has been shown these six
stimuli and their category labels numerous times and has well learned the
pairings. GCM describes how participants will classify previously unseen
stimuli, such as middle-column letters.
In GCM, it is assumed that a to-be-classified item and the exemplars are
represented as points in a multidimensional space. The position of to-beclassified item is denoted by y = (y1 , .., yM ); the position of the ith exemplar
is denoted xi = (xi,1 , .., xi,M ). Exemplars belong to specific categories; let sets
A and B denote the exemplars that belong to categories A and B, respectively.
In the Asomtavruli letter example, the letters in the left column belong to
A; those in the right column belong to B. The similarity between the tobe-classified item and the ith exemplar is determined by their Minkowski
distance:
X
ηy,i = exp(−( αm |ym − xi,m |r )1/r ), r = 1, 2.
(6.19)
m
The exponent r is usually assumed before hand. The new parameters in
Equation 6.19 are (α1 , .., αm ), which differentially weights the importance of
the dimensions. The motivation for the inclusion of these parameters is that
participants may stress differences in some dimensions more than others in
categorization. These weights are relative, and the first one may be set to
1.0 without any loss.
When to-be-classified item y is presented, it elicits activity for the cate-
6.6. GENERALIZED CONTEXT MODEL OF CATEGORIZATION 165
gories. This activation depends on the overall similarity of the item to the
exemplars for specific categories. The activation for category A is
X
a=
ηy,i ,
i∈A
where the sum is over all exemplars of Category A. Activation for category
B is given analogously:
b=
X
ηy,i .
i∈B
The probability that stimulus y is placed in category A may be given as:
p(y, A) =
aβA
,
aβA + bβA
where βA and βB are response biases.
In fitting SCM, researchers do not let the position parameters of the exemplars of the to-be-classified stimuli be free. Instead, these are estimated
prior to the categorization experiment through either a similarity ratings task
or an absolute identification experiment. The data from these supplementary tasks are submitted to a multidimensional scaling routine (see previous
section) to yield distances. The remaining free parameters are the weights
α and response bias β. GCM is attractive because one can predict categorization from the identification data with a minimal number of additional
parameters.
166
CHAPTER 6. MULTINOMIAL MODELS
Problem 6.6.1 (Your Turn)
Let’s use Asomtavruli letters to test GCM. The stimuli in our experiment are sixteen Georgian letters which exist in a four dimensional
space as follows:
Letter
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
x1
-0.167
-0.445
-0.308
-0.648
-0.254
0.001
0.013
0.441
0.014
0.251
-0.357
-0.218
0.466
0.549
-0.082
-0.218
Coordinates
x2
x3
0.136 0.042
0.265 -0.008
0.325 -0.046
0.120 0.138
-0.048 0.354
-0.746 -0.080
-0.357 -0.887
-0.064 0.261
-0.206 -0.065
0.492 0.157
0.387 -0.267
0.402 -0.218
0.113 -0.054
0.289 -0.047
-0.181 0.782
0.402 -0.218
x4 Category
-0.088
A
-0.218
A
0.241
A
0.350
A
0.428
A
0.033
B
0.203
B
0.309
B
0.671
B
-0.102
B
-0.084
?
-0.044
?
-0.124
?
-0.052
?
-0.075
?
-0.044
?
In our experiment, a participant learns to classify the Letters 1 to 5 to
Category A and Letters 6 to 10 to Category B. They then are repeatedly
tested on Letters 11 through 16 without any feedback. The table below
contains the number of times each letter was placed in each category:
Letter
11
12
13
14
15
16
Response Category
A
B
28
22
26
24
16
34
21
29
23
27
30
20
There are 6 pieces of data and 4 free parameters (α2 , α3 , α4 , βB ). Estimate these parameters with a Euclidean distance GCM model and
decide if the model fits well by comparing it to a general binomial
model (the general model is binomial as there are only two choices in
the categorization task).
6.7. GOING FORWARD
6.7
167
Going Forward
In this and the preceding chapters, we have shown how substantive models
may be analyzed as restrictions on general binomial and multinomial models.
The data for these models are frequency counts of responses. Frequency count
data of this type are called categorical data; models of categorical data are
called categorical models. The focus of the next two chapters is on continuous
variables such as response time. Once again, we embed substantive models
as restrictions on nested models whenever it is feasible. In the next chapter,
we discuss the normal model which is the basis for ANOVA and regression.
Following that, we discuss substantive and statistical models of response
time. The last chapter is devoted to hierarchical models.
168
CHAPTER 6. MULTINOMIAL MODELS
Chapter 7
The Normal Model
The first six chapters focused on substantive models for binomial and multinomial data. The binomial and multinomial are ideal for data that is discrete,
such as the frequency of events. Psychologists often deal with data which is
more appropriately modeled as continuous. A common example is the time
to complete a task in an experiment, or response time (RT). Because RT
may take any positive value, it is appropriate to model it with a continuous
random variable. In some contexts, even discrete data may be more conveniently modeled with continuous RVs. One example is intelligence quotients
(IQ). IQ scores are certainly discrete, as there are a fixed number of questions
on a test. Even though it is discrete, it is typically modeled as a continuous
RV. The most common model of data in psychology is the normal. In this
chapter we cover models based on the normal distribution.
7.1
The Normal-Distribution Model
IQ scores are typically modeled as normals. Consider the case in which we
obtain IQ test scores from a set of participants. The ith participant’s score,
i = 1, .., I, is denoted by random variable Yi which is assumed to be a normal:
Yi ∼ Normal(µ, σ 2 ), i = 1, .., I.
(7.1)
Parameters µ and σ 2 are the mean and variance of the distribution. The goal
is to estimate these parameters. One way to do this is through maximum
likelihood.
169
170
CHAPTER 7. THE NORMAL MODEL
The first step is expressing the likelihood. Accordingly, we start with
the probability density function (pdf). The pdf for the normally distributed
random variable Yi is
(y−µ)2
1
fYi (yi ) = √
e− 2σ2 .
2πσ
We must then express the joint pdf for all I random variables. Because the
observations are independent, the joint pdf is obtained by multiplying
fY1 ,..yI (y1, .., yI ) =
PI (y −µ)2
1
− i=1 i 2
2σ
e
.
(2π)I/2 σ I
The likelihood is obtained by rewriting the pdf as a function of the parameters.
PI (yi −µ)2
L(µ, σ 2 ; y1 , .., yI ) = σ −I e− i=1 2σ2 ,
where the term in 2π has been omitted. Log-likelihood is
l(µ, σ 2 ; y1 , .., yI ) = −I log σ + −
I
X
(yi − µ)2
.
2σ 2
i=1
The next step in estimating mu and σ 2 is maximizing log-likelihood.
This may be done either numerically or with calculus methods. The calculus
methods yield the following estimators:
µ̂ =
σˆ2 =
PI
i=1
PI
yi
(7.2)
I
i=1 (yi
I
− µ)2
(7.3)
The estimator for µ, µ̂, is the sample mean. Under the assumption that Yi
are normal random variables, the sampling distribution of µ̂ is also normal:
µ̂ ∼ Normal(µ, σ 2 /I).
Figure 7.1 shows the relationship between normally-distributed data Yi
and the normally-distributed estimate µ̂. In the Figure, the higher variance
probability density function is the distribution of data and has a mean of 100
and a standard deviation of 15. The lower variance density is the distribution
of the
√ sample mean of 9 observations. The standard error of µ̂ this case is
15/ 9 = 5.
171
0.04
0.00
Density
0.08
7.1. THE NORMAL-DISTRIBUTION MODEL
60
80
100
120
140
Intellegence Quotient (IQ)
Figure 7.1: The distribution of the IQ population and the distribution of the
sample mean, with N = 9.
172
CHAPTER 7. THE NORMAL MODEL
The maximum likelihood estimator for σˆ2 , however, is not the sample
variance. Sample variance, s2 , is defined as:
2
s =
PI
− µ)2
.
I −1
i=1 (yi
The difference between s2 and σˆ2 is in the denominator; the former involves a
factor of I −1 while the latter involves a factor I. The practical consequences
of this difference are explored in the following exercise.
Problem 7.1.1 (Your Turn)
Consider the model for IQ, Yi ∼ Normal(µ = 100, σ 2 = 100), where
i = 1, .., 10.
1. Use the simulation method to explore the sampling distributions
of the ML estimator (σˆ2 ) and the classic estimator s2 . How different are they? Compare them to the distribution of a χ2 random
variable with 9 degrees of freedom. What do you notice?
2. Compute the RMSE and bias for each estimate. Which is more
efficient?
7.2
Comparing Two Means
The main appeal of the normal model is that methods of inference are well
known. Consider, for example, a hypothetical investigation of ginko root
extract on intelligence. Participants may be split into two groups: those
receiving ginko root extract (the treatment group) and those receiving a
placebo pill (the control group). After receiving their respective ginko extract
or the placebo, participants take an intelligence test. The resulting scores
may be denoted Yi,j , where i indicates participant and j indicates the group
(either treatment or control).
The normal model is
Yi,j ∼ Normal(µj , σ 2 )
(7.4)
Density
173
0.000 0.010 0.020
7.2. COMPARING TWO MEANS
60
80
100
120
140
Intellegence Quotient (IQ)
Figure 7.2: The independent-groups t test model.
The model is shown graphically in Figure 7.2. The main question is
whether the treatment affected intelligence. The questions is assessed by
asking whether µtreatment equals µcontrol .
This question may certainly be answered with a likelihood ratio test. It
is more standard, however, to use a t-test1 . The rationale for a t-test may be
found in several texts of statistics for behavioral scientists (e.g., Hays, 1994).
We focus on the R implementation.
treatment=c(103,111,112,89,120,123,92,105,87,126)
control=c(81,114,105,75,104,98,114,106,92,122)
t.test(treatment,control,var.equal=T)
The output indicates the t value, the degrees of freedom for the test, and
the p value. The p value denotes the probability of observing a t value as
1
The two tests are in fact equivalent in this case.
174
85
95
105
115
CHAPTER 7. THE NORMAL MODEL
Control
Treatment
Figure 7.3: Hypothetical data with confidence intervals.
extreme or more extreme than the one observed under the null hypothesis
that the true means are equal. In this case, the p value is about 0.39. We
cannot reject the null hypothesis that the two groups have the same mean.
It is common when working with the normal model to plot µ̂, as shown in
Figure 7.3. The bars denote confidence intervals (CIs) rather than standard
errors. Confidence intervals have an associated percentile range and the 95%
confidence interval is typically used. The interpretation of CI, like much
of standard statistics, is rooted in the concept of repeated experiments. In
the limit that an experiment is repeated infinitely often, the true value of
the parameter will lie in the 95% CI for 95% of the replicates. Confidence
intervals are constructed with the t distribution as follows.
#for treatment condition
mean(treatment)+c(-1,1)*qt(.025,9)*sd(treatment)/sqrt(10)
#for control condition
7.3. FACTORIAL EXPERIMENTS
175
mean(control)+c(-1,1)*qt(.025,9)*sd(treatment)/sqrt(10)
Alternatively, R will compute the confidence intervals for you if you use the
t.test() function. Reporting CI’s or standard errors are equally acceptable
in most contexts. We recommend CI’s over standard errors in general, though
we tend to make exceptions when plotting CI’s clutters a plot more than
plotting standard errors.
Problem 7.2.1 (Your Turn)
Compare t test with ML estimation. Using the model in Equation 7.4,
construct a nested ML test to test the hypothesis that two group means
are equal. Use 10 subjects, and examine the rejection rates when the
difference between the two group means are 0, .5, 1, and 1.5 group
standard deviations apart. Do the tests perform differently?
7.3
Factorial Experiments
The t test works well when there are two groups to compare. In the example
above, there was one factor manipulated and that factor had two groups. If
we wanted to manipulate another factor, or add another treatment, the t test
would be inappropriate.
Suppose that in addition to the effect of ginko root, we were also interested in how the exposure to certain types of music effects IQ. There has
been some suggestion that the music of Mozart increases IQ for a short time
(Rauscher, Shaw, & Ky, 1993). This controversial effect has been dubbed
the ”Mozart effect”. We could design an experiment in which we manipulated both treatment with ginko root and exposure to music. This would
produce 2 × 2 = 4 groups. The four groups are shown in Figure 7.4. This
design is called a factorial design because every combination of the factors is
represented.
We can think of running this experiment in two ways. In a betweensubjects design, each participant is a member of only one group. In a withinsubjects design, each participant will typically be involved in every combina-
176
CHAPTER 7. THE NORMAL MODEL
Figure 7.4: A factorial design with 2 factors.
7.3. FACTORIAL EXPERIMENTS
177
tion. Both designs have advantages; for the purposes of this section we will
consider the between-subjects design.
For an experiment design like the one described above, the most widely
used model for analysis is ANOVA. The details and theory behind ANOVA
analyses is covered in elementary statistics texts (Hays, 19xx). The basic
ANOVA model for the 2-factor, between-subjects design above is
IQijk ∼ Normal(µjk , σ 2 )
µjk = µ0 + αj + βk + (αβ)jk
(7.5)
(7.6)
The IQ score of the ith participant in the jth ginko root treatment and
the kth music treatment is distributed as a normal. This normal has a mean
which depends on the condition. The terms µ0 , αj , and βk are the grand mean
and the effects of the ginko and music treatments respectively. Typically we
P
P
apply the constraint that αk = βk = 0. The last term, (αβ)jk is called
the interaction term, and describes any deviation from the additivity of the
treatment effects. For instance, ginko may only be effective in the presence
of music. For each effect, the null hypothesis is all treatment means are the
same.
R has functions for ANOVA analyses built in. Consider the experimental
design above. Data for a hypothetical experiment using this design is in the
file factorial.dat. Download this file into your working directory, then use the
following code to load and analyze it.
dat=read.table(’factorial.dat’,header=T)
summary(aov(IQ~Ginko*Music,data=dat))
The result should be a table like the one in Table 7.1. The p values in
the last column tell us the probability under the null hypothesis (that all
treatment means are the same) of getting an F statistic as large as the one
obtained. If p is less than our prespecified alpha, we reject the null and
conclude that all treatment means are not equal. In this case, there is a
significant effect of ginko root, but not of exposure to music.
To see this graphically, consider Figure 7.5. The boxplot shows that the
median scores of the groups receiving ginko are higher than the groups not
receiving it, which accords with the ANOVA analysis. It also allows us to
quickly check the equal-variance assumption of the ANOVA test; in this case
there does not appear to be any violation of this assumption.
178
CHAPTER 7. THE NORMAL MODEL
Df Sum Sq Mean Sq F value Pr(¿F)
1
1690.0
1690.0 5.9455 0.01982 *
1
28.9
28.9 0.1017 0.75168
1
547.6
547.6 1.9265 0.17368
36 10233.0
284.2
Ginko
Music
Ginko:Music
Residuals
—
Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1
Table 7.1: ANOVA table for the hypothetical IQ data.
IQ
140
120
100
80
NG.NM
G.NM
NG.M
G.M
Group
Figure 7.5: A boxplot of the IQ data.
7.4. CONFIDENCE INTERVALS IN FACTORIAL DESIGNS
7.4
179
Confidence Intervals in Factorial Designs
Confidence intervals in factorial designs are more difficult to use. Because
of the variety of possible deigns (between- and within-subjects, fixed- and
random-effects models) it is impossible to prescribe one method that will
work for every design. Also, the quantity on which the confidence interval
is placed can vary. One can place confidence intervals on group means,
differences between groups, or standardized effect sizes. The choice depends
on how you wish to use the intervals; if they are for inference, you may
wish to put the confidence intervals on effect sizes. If the intervals are for
graphical purposes, you may want to put them on group means instead.
For the purposes of simplicity, we will deal with only the latter approach.
Confidence intervals on effect sizes is discussed in depth in (cites).
One simple way to put confidence intervals on the groups in the betweensubjects designs like the design above is to extend the approach with the t
test to each group. We can compute a sample mean and standard deviation
for each group and use them to build a individual confidence intervals. A
better approach, however, is to use the estimate of σ 2 , the error variance,
provided by the ANOVA analysis. This is the mean squared error in the
ANOVA table. The 95% confidence interval for the ijth group is then
CIij =
"
s
MSE
X̄ ± t.025 (errordf )
Nij
#
(7.7)
where Nij is the number of subjects in the ijth group.
Applying this to the data and plotting,
gmeans=tapply(dat$IQ,list(dat$Ginko,dat$Music),mean)
size=qt(.025,36)*sqrt(284.2/10)
g=barplot(gmeans,beside=T,ylim=c(70,130),xpd=F)
errbar(g,gmeans,size,.3,1)
The resulting plot is shown in Figure 7.6. Two things are apparent from this
plot. First, with only 10 participants in each group, our ability to localize
means with 95% confidence is limited; the confidence intervals are fairly wide.
Second, the mean of the two dark bars (no ginko) is less than the mean of the
light bars (ginko treatment). The pattern of means is similar to the pattern
of medians from the boxplot in Figure 7.5.
180
70
90
110
130
CHAPTER 7. THE NORMAL MODEL
NM
M
Figure 7.6: Barplot of the IQ data with 95% confidence intervals.
181
7.5. REGRESSION
7.5
7.5.1
Regression
Ordinary Least-Squares Regression
Regression is perhaps the most popular method of assessing the relationships
between variables. We provide a brief example of how to do regression in R.
The example comes from an experiment by Gomez, Perea & Ratcliff (in
press) who asked participants to perform a lexical decision task. In this task,
participants decide whether a string of letters is a valid English word. For
example, the string cafe is a valid English word while the string mafe in not.
In this task, it has been well established that responses to common valid
words are faster than the rare common words. The goal is to explore some
of the possible functional relationships between word frequency and response
time (RT). Word frequency is a measure of the number of times of word
occurs for every million words of text in magazinge (Kucera & Francis, 1968).
Gomez et al. collected over 9,000 valid observations across 55 participants
and 400 words. The basic data for this example are the mean RTs for each
of the 400 words computed by averaging across participants. The following
code loads the data and draws a scatter plot of RT as a function of word
frequency. Before running it, download the file regression.dat and set it in
your working directory 2 .
dat=read.table(’regression.dat’,header=T)
colnames(dat) #returns the column names of dat
plot(dat$freq,dat$rt,cex=.3,pch=20,col=’red’)
There are four hundred points, with each coming from a different word.
From the scatter plot in the left panel of Figure 7.7, it seems evident that as
frequency increases, RT decreases.
The linear regression model is
RTi ∼ Normal(β0 + β1 fj , σ 2 ).
(7.8)
RTj and fj denote the mean RT to the jth item and the frequency of the
jth item, respectively. Parameters beta0 and β1 are the intercept and slope
of the best fitting line, respectively. The model is written equivalently as
RTi = β0 + β1 fi + ǫi ,
2
How to in Windows
182
CHAPTER 7. THE NORMAL MODEL
14
0.9
1.0
1.1
6
0.5
0.6
0.7
0.8
0.9
0.8
0.7
0.6
0.5
Response Time
1.0
1.1
2
5
10
15
Word Frequency
20
0.0
1.0
2.0
3.0
log(Word Frequency)
183
7.5. REGRESSION
where epsilon is taken to be independent and identically distributed normals
centered at zero; i.e.,
ǫ ∼ Normal(0, σ 2 ).
The standard approach to estimating this model is based on least-squares
regression (Pedhauzer). The code for analysis is
g=lm(dat$rt~dat$freq)
summary(g)
abline(g)
The first line fits the regression model. The commandlm() stands for “linear
model,” and is extremely flexible. The argument is the specification of the
model. The output is stored in the object g. The command summary(g)
returns a summary of the fit. Of immediate interest are the estimates βˆ0 and
βˆ1 , which are .753 seconds and -.008 seconds per frequency units, respectively.
Also reported are standard errors and a test of whether these statistics are
significantly different than zero. The model accounts for 21.4% of the total
variance in the data (multiple R2 = .214). The last line adds the regression
line in the plot. This line has intercept βˆ0 and slope βˆ1 .
184
CHAPTER 7. THE NORMAL MODEL
Problem 7.5.1 (Parameter Estimates in Regression)
1. Are slopes and intercept estimates biased? Consider the case
in which we are evaluating the effects of therapy in behavioral
therapy on IQ for autistic three-year olds. Suppose for each participant, the change in IQ, denoted Yi , is related to the number
of days in therapy, denoted X as follows:
Yi ∼ Normal(1 + .05Xi, σ 2 = 2).
Let’s assume that an experimenter observes 50 children. Let’s
assume that the time in therapy, Xi , is distributed as a normal:
Xi ∼ Normal(100, 10).
To answer the question, simulations should proceed as follows.
For each replicate experiment, first draw Xi values for all 50 children from a normal. Then, for each value Xi , draw a value of Yi
from the above regression equation. Then, analyze the 50 pairs
of (Xi , Yi ) with lm(). Save the intercept and slope estimates.
Repeat the process until you have a reasonable estimate of the
sampling distribution of slope and intercept.
2. The results should indicate that the slope and intercept estimators are unbiased. Does the bias depend on the distribution of
Xi ? Assess the bias when Xi is distributed as a binomial with
N = 120 and p = .837.
7.5.2
Nonparametric Smoothing
Although the regression line appears reasonable for the Gomez et al., there
may be more parsimonious models. One sensible method of checking the
validity of a regression line is to fit a nonparametric regression model. A
nonparametric model does not assume a particular form for the data, but
instead fits a smooth curve. One nonparametric regression model is lowess
185
7.5. REGRESSION
(Cleveland, 1981). In a small interval, lowess fits the points with a polynomial regression line. Points further away from the center of the interval
are weighted less than those near the center. These polynomial fits are done
across the range of the points, and a kind of ”average” line is constructed.
In this way, a fit to the points is generated without recourse to parametric
assumptions.
In R, lowess is by the commands lowess() or loess(). The syntax of the
former is somewhat more convenient and we use it here. The nonparametric
smoothing line may be added to the plot by
lines(lowess(dat$freq,dat$rt),lty=2)
The nonparametric smooth is the dotted line (and is produced with the
option lty=2). The regression line overestimates the nonparametric smooth
in the middle of the range while underestimating it in the extremes. The
nonparametric smooth is a more faithful representation of the data. The
discordance between the nonparametric line and the regression line indicates
a misfit of the regression model.
RT is frequently modeled as the logarithm of word frequency, e.g.,
RTi ∼ Normal(β0 + β1 log fj , σ 2 ).
Figure 7.7, right panel, shows this relationship along with regression line and
lowess nonparametric smooth. The figure is created with the following code:
logfreq=log(dat$freq)
plot(logfreq,dat$rt,cex=.3,pch=20,col=’red’)
w=seq(2,20,4) # tick marks for axis drawn in next statemet
axis(3,at=log(w),labels=w) #create axis on top
g=lm(dat$rt~logfreq)
summary(g)
abline(g)
lines(lowess(logfreq,dat$rt),lty=2)
The new element in the code is the use of an axis on top of the figure, and
this is done through the axis() command. The intercept is .775 seconds
and the slope is -.051 second per log unit frequency. The slope indicates
how many seconds faster RT is when word frequency is multiplied by e, the
natural number (e ≈ 2.72).
186
CHAPTER 7. THE NORMAL MODEL
This interpretation of slope is awkward. If slope is multiplied by log 2, the
interpretation is the amount of time saved for each doubling of RT. In this
case, the amount is .035 sec. The model accounts for 25.0% of the variance,
which is somewhat more than the 21.4% accounted for by the linear model.
Moreover, the regression line is closer to the nonparametric smooth for this
model than it is for the linear model indicating an improved fit.
7.5.3
Multiple Regression
In the lexical decision application, Gomez et al. manipulated two other
factors than may affect RT: word length and neighborhood size. Words were
either 4 or 5 letters in length. The neighborhood of a word refers to all words
that look very similar. It is conventionally operationlized as the collection
of words that differ from the target by only one letter, e.g.; hot is in the
neighborhood of hat. The size of the neighborhood for a word refers to the
number of neighbors. For Gomez et al.’s words, these varied from 0 to 26.
Researchers are often interested in how the three variables account for RT.
This problem is termed a multiple regression problem as there are multiple
predictors of the dependent measure. The model is
RTi ∼ Normal(β0 + β1 fi + β2 ni + β3 δ(li − 5)
(7.9)
This can be accomplished in R with the following code:
dat$length=as.factor(dat$length)
summary(lm(rt~freq+length+neigh,data=dat))
Interpretation of the parameters is analogous to that in the simpler onepredictor regression case. For details on multiple regression, consult any
introductory statistics text.
7.5.4
Maximum Likelihood
The maximum likelihood approach provides for very similar estimates to the
least squares approach for ordinary and multiple regression. The likelihood
for the multiple regression is
187
7.5. REGRESSION
Problem 7.5.2 (Your turn)
• The linear and logarithm models of the effect of word frequency
on RT are not nested. Use the AIC statistic to compare them.
• Consider the power model:
RTj ∼ Normal(α + βfjγ , σ 2 ).
Estimate the parameters with maximum likelihood. Compare
this model with the log and linear models through AIC. Note,
this model is a generalization of the linear one. The linear model
holds if γ = 1. Check if the linear model restriction holds with a
likelihood ratio test.
7.5.5
A z-ROC Signal-Detection Application
One advantage of the ML approach over the least-squares is that it is more
flexible. Many advanced normal models, such as hierarchical linear models
and structural equation models, are analyzed with likelihood-based techniques. Here we show the superiority of ML over a more common leastsquares approach for the analysis of the free-variance signal detection model.
In Section x.x, we discussed z-ROCs as a common graphical representation for analyzing the signal detection model. The z-ROC plot is a plot
of Φ−1 (pˆh ) as a function of Φ−1 (pˆf ). According to the free-variance model
ph = Φ((d′ − c)/σ) and pf = Φ(−c).Substituting yields
Φ−1 (ph ) =
d′ Φ−1 pf
+
.
σ
σ
This last equation describes a line with slope of 1/sigma and intercept of
d′ /σ. Based on this fact, researchers have used linear regression of z-ROC
plots to estimate σ and d′ (e.g., Ratcliff, Shue & Grondlund, 1993). Specifi-
188
CHAPTER 7. THE NORMAL MODEL
cally:
d̂′ =
bˆ0
bˆ1
σ̂ = 1/bˆ1
To see how the regression technique is implemented, consider an experiment with a payoff manipulation. Payoff is manipulated through three level,
and for each level, there are 100 signal and noise trias. The number of hits
across the three conditions are 50, 70, 84, respectively; the number of false
alarms are 16, 30, and 50, respectively. The following code plots a zROC,
fits a line, and provides estimates of d′ and σ.
hit.rate=c(69,89,98)/100
fa.rate=c(12,23,38)/100
z.hit=qnorm(hit.rate)
z.fa=qnorm(fa.rate)
plot(z.fa,z.hit)
g=lm(z.hit~z.fa)
lines(g)
summary(g)
coef=g$coefficients #first element is intercept, second is slope
est.sigma=1/coef[2]
est.dprime=coef[1]/coef[2]
Estimates of the intercept and slope are 2.58 and 1.79, respectively; corresponding estimates of d′ and σ are 1.44 and .558, respectively.
We use the simulation method to assess the accuracy of this approach vs.
the ML approach. Consider a five-condition payoff experiment with true d′ =
1.3, true criteria c = (.13, .39, .65, .91, 1.17) and true σ = 1.2. Assume there
are 100 noise and 100 signal trials for each of the five conditions. For each
replicate experiment, data were generated and then d′ and σ were estimated
two ways: 1. from least squares regression, and 2. from the maximum
likelihood method discussed in Section x.x. Figure 7.8 shows the results.
The top panel shows a histogram of estimates of d′ for the two methods. The
likelihood estimators are more accurate; the regression method is subject to
an overestimation bias. The middle panel shows the same for σ; once again,
the likelihood method is more accurate. These differences in bias are not
7.5. REGRESSION
189
severe and it is reasonable to wonder if they are systematic. The bottom row
shows scatter plots. The x-axis value is the likelihood estimate for a replicate
experiment; the y-axis value is the regression method estimate. For every
replicate experiment, the regression method yielded greater estimates of d′
and σ showing that there is a systematic difference between the methods. In
sum, the ML method is more accurate because it does not suffer the same
degree of systematic over estimation.
Problem 7.5.3 (Your Turn)
Do the simulation described in the previous paragraph, then make a
figure similar to Figure 7.8 to present the results.
Why are the likelihood estimates better than the regression ones? The
regression method assumes that the regressor variable, the variable on the xaxis, is known with certainty. Consider the example in Your Turn in which we
regressed IQ gain (denoted Yi ) onto the number of days of treatment (denoted
Xi ) for autistic children. The x-axis variable, time in treatment, is, for each
participant, a constant that is known to exact precision. This statement
holds true regardless of the overall distribution of these times. This precision
is assumed in the regression model. Let’s explore a violation.. Suppose, due
to some sloppy bookkeeping, we had error-prone information about how long
each individual was in therapy. Let Xi′ be the length recorded, and let’s
assume Xi′ ∼ Normal(Xi , 5). We wish to recover the slope and intercepts
when we regress X onto Y . Since we don’t have true values Xi , we use our
error-prone values X ′ instead. You will show, in the following problem, that
the estimates of slope and intercept are biased. The bias in slope is always
toward zero; that is, estimated slopes are always flattened with respect to
the true slope.
Let’s return to the signal detection problem. Let’s say we wanted to study
subliminal priming. Subliminal priming is often operationalized as priming
in the absence of prime detectability (cite). For instance, imagine a paradigm
where trials are constructed as in Figure 7.9. The task is determine where a
given number is less-than or greater-than 5. With the sequence in Figure 7.9,
there are two possible types of trial: participants can be instructed to respond
190
3.0
2.0
Density
0.0
1.0
LS
1.0
2.0
ML
0.0
Density
3.0
CHAPTER 7. THE NORMAL MODEL
1.0
2.0
1.0
2.0
d’
Density
ML
0.5
1.5
2.5
sigma
3.5
0.0 0.5 1.0 1.5 2.0
0.0 0.5 1.0 1.5 2.0
Density
d’
LS
0.5
1.5
2.5
3.5
sigma
Figure 7.8: A comparison of maximum likelihood and least squares methods
of estimating signal detection parameters. The top row shows histograms of
1000 estimates of d′ with ML and LS, respectively; the bottom row shows
1000 estimates of σ with ML and LS. Blue (solid) vertical lines represent
mean estimates, and red (dashed) vertical lines represent true values.
191
7.5. REGRESSION
Foreperiod
578ms
Forward Mask
#####
Time
66ms
Prime
2
22ms
Backward Mask
#####
66ms
Target
6
200ms
Until
Response
Figure 7.9: The distribution of the IQ population and the distribution of the
sample mean, with N = 9.
to the prime or the target. When participants respond to the target, the
prime can actually affect decision to the prime. If both numbers are on the
same side of 5, responses are faster than if they are not.
In a subliminal priming paradigm, the primes are displayed fast enough
that it is difficult to see them. If there is a priming effect in the target
task when detectability in the prime task is 0, we call this subliminal priming. Although this is conceptually straightforward, actually establishing that
detectability is 0 is difficult.
One method proposed by Greenwald et al (cite) uses linear regression to
establish subliminal priming. In this method, priming effects are regressed
onto detectability (typically measured by d′ ), and a non-zero y intercept is
taken as evidence for subliminal priming. An example of this is shown in
Figure 7.10.
192
20
15
10
5
0
Priming Effect (RT)
25
30
CHAPTER 7. THE NORMAL MODEL
0.0
0.4
0.8
Detection (d’)
193
7.5. REGRESSION
Problem 7.5.4 (Your Turn)
Let’s examine the effects of a random predictor on the regression
method in subliminal priming through simulation. In order to do this,
we will assume that there is no subliminal priming; instead assume that
the relation ship between detectability and priming effect is
γi = d′i
d′i ≥ 0
(7.10)
(7.11)
In this case, the priming effect for the ith subject is linearly related to
detectability. Follow these steps:
1. Sample 15 true d′ from a normal with mean .5 and standard
deviation .3. Set any negative d′ values to 0. Assume that c =
d′ /2.
2. Sample each participant’s priming effect from a normal with mean
γi and standard deviation .1. Call these γ̂i .
3. Sample 50 signal and 50 noise trials for each person, and estimate
d′ . Call these d̂′i
4. Use linear regression to estimate the slope and intercept predicting γ̂i from d̂′i . Save the slope and intercept estimates and the p
values. Repeat these steps 1000 times.
5. Are the slope and intercept estimates biased? What is the true
type-I error rate for the intercept if you use p < .05 to reject the
null? What does this mean for inferences of subliminal priming?
Several of the examples in this chapter have involved specific instances
where traditional models such as least squares qregression may provide biased
estimates or incorrect inferences. This focus is not intended to dissuade
researchers from using traditional techniques. Rather, we believe that they
are important tools in research. It is important, however, to use the proper
tool in the proper situation. Whenever you decide to use a model, be sure
that you understand the assumptions of the model and ask whether they are
194
CHAPTER 7. THE NORMAL MODEL
appropriate for your application. Sometimes equal-variance normal errors
are not appropriate, and a more sophisticated technique is warranted. Other
times, ANOVA or regression will serve well. It is up to the researcher to
make an informed decision regarding the best analysis for the situation.
Chapter 8
Response Times
The main emphasis of this chapter and next is the modeling of response time.
Response time may be described as the time needed to complete a specified
task. Response time typically serves as a measure of performance. Stimuli
that yield lower response times are thought to be processed more mental
facility than those that yield higher response times. In this sense, response
time is often used analogously to accuracy, with the difference being that a
lower RT corresponds to better performance. Although this direct use of RT
is the dominant mode, there are a growing number of researchers who have
used other characteristics of RT to draw more detailed conclusions about
processing. This chapter and the next provide some of the material needed
to study RT models. The goal of this chapter is to outlay statistical models
of response time; the goal of the next is to briefly introduce a few process
models.
8.1
Graphing Distributions
Response time varies from response choice in that it is modeled as a continuous random variable. It is helpful to consider methods of looking at
or examining distributions of continuous data. Consider the simple case in
which a single participant observes stimuli in two experimental conditions.
We model the data in each condition as a sequence of independent and identicaly distributed random variables. For the first conditions, the observations
may be modeled as realization of a sequence independent and identically
195
196
CHAPTER 8. RESPONSE TIMES
distributed random variables X1 , .., XN ; e.g.,
iid
X1 , .., XN ∼ X
where N is the number of observations. Likewise, for the second condition,
the data may be modeled as
iid
Y1 , .., YM ∼ Y
The question at hand is to compare the distributions with graphical methods. Data in the first condition are given as
x1=c(0.794, 0.629, 0.597, 0.57, 0.524, 0.891, 0.707, 0.405, 0.808, 0.733,
0.616, 0.922, 0.649, 0.522, 0.988, 0.489, 0.398, 0.412, 0.423, 0.73,
0.603, 0.481, 0.952, 0.563, 0.986, 0.861, 0.633, 1.002, 0.973, 0.894,
0.958, 0.478, 0.669, 1.305, 0.494, 0.484, 0.878, 0.794, 0.591, 0.532,
0.685, 0.694, 0.672, 0.511, 0.776, 0.93, 0.508, 0.459, 0.816, 0.595)
The data in second conditon are given as
x2=c(0.503, 0.5, 0.868, 0.54, 0.818, 0.608, 0.389, 0.48, 1.153, 0.838,
0.526, 0.81, 0.584, 0.422, 0.427, 0.39, 0.53, 0.411, 0.567, 0.806,
0.739, 0.655, 0.54, 0.418, 0.445, 0.46, 0.537, 0.53, 0.499, 0.512,
0.444, 0.611, 0.713, 0.653, 0.727, 0.649, 0.547, 0.463, 0.35, 0.689,
0.444, 0.431, 0.505, 0.676, 0.495, 0.652, 0.566, 0.629, 0.493, 0.428)
Figures 8.1 through ?? shows several different methods of plotting the
data. Each of these methods shows a different facet of the data. We highlight
the relative advantage and disadvantage of each. In exploring data, we often
find it ueful to make multiple plots of the same data so that we may more
fully understand them.
8.1.1
Box Plots
The top left panel shows box plots of data in both conditions. Box plots
are a very good way of gainng an initial idea about distributions. The thick
middle line of the plot corresponds to the median; the box corresponds to
the 25th and 75th percentiles. The height of this box, the distance between the 25th and 75th percentile, is called the interquartile range. The
197
1.2
1.0
0.8
0.6
0.4
Response Time (sec)
8.1. GRAPHING DISTRIBUTIONS
Condition 1
Condition 2
Figure 8.1: Boxplots of two distributions.
whiskers extend to the most extreme point 1.5 times the inter-quartiel range
past the box. Observations outside the whiskers are denotes with a small
circles; these should be considered extreme observations. One advantage
of histograms is that several distributions can be compared at once. For
the displayed plots, it may be seen that the RT data in the second condition is quicker and less variable than that in the first condition. Boxplots are drawn in R with the boxplot() command; Figure 8.1 was drawn
with the command boxplot(x1,x2,names=c("Condition 1","Condition
2"),ylab="Response Time (sec)").
8.1.2
Histograms
The top panel of the Figure 8.2 shows histograms of the distributions for
each condition separately. We have plotted relative area histograms as these
converge to probability density functions (see Chapter 4). The advantage of
histograms are that they provide a detailed and intuitive approximation of
the desnity function. There are two main disadvantages to histograms: First,
it is difficult to draw more than one distribution’s histogram per plot. This
fact often makes it less convenient to compare distributions with histograms
than other graphical methods. For example, the comparison between the
two distributions is easier with two boxplots (Figure 8.1) than with two
histograms. The second disadvantage is that the shape of the histogram
depends on the choice of boundaries for the bins. The bottom row shows
what happens if the bins are chosen to finely (panel C) or too coarsely (panel
D). An alternative appoach to histograms is advocated by Ratcliff (1979)
who chooses bins with equal area instead of equal width. An example for
198
2.0
CHAPTER 8. RESPONSE TIMES
B
2.0
0.0
1.0
Density
1.0
0.0
0.5
Density
1.5
3.0
A
0.4
0.6
0.8
1.0
1.2
1.4
0.4
0.6
1.2
1.4
1.5
0.5
1.0
D
0.0
2
4
Density
6
C
0
Density
1.0
Response Time
8
Response Time
0.8
0.4
0.6
0.8
1.0
1.2
1.4
Response Time
0.4
0.6
0.8
1.0
1.2
Response Time
Figure 8.2: A & B: Histograms for the first and second conditions, respectively. C & D: Histogram for Condition 1 with bins chosen too finely (5 ms)
and too coarsely (400 ms), respectively.
Condition 1 is shown in Figure 8.3. Here, each bin consists of area .2 but
the width changes with height. Equal-area histograms may be drawn in R
by setting bins widths with the quantile command. The figure was drawn
with hist(x1,prob=T,main="",xlab="Response Time",xlim=c(.3,1.5),
breaks=quantile(x,seq(0,1,.2))).
8.1.3
Smoothed Histograms
Figure 8.4 (left) provides an example of smoothed or kernel-density histograms. One heuritic for smoothing is to consider a line drawn smoothly
between the midpoints of each bin. The key is the smoothness, the line is
1.4
199
1.0
0.0
0.5
Density
1.5
2.0
8.1. GRAPHING DISTRIBUTIONS
0.4
0.6
0.8
1.0
1.2
1.4
Response Time
0.8
0.4
Condition 1
Condition 2
0.0
Probability
1.0
2.0
Condition 1
Condition 2
0.0
Density
3.0
Figure 8.3: Equal-area alternative histogram for Condition 1.
0.2
0.4
0.6
0.8
1.0
1.2
Response Time (sec)
1.4
0.4
0.6
0.8
1.0
1.2
1.4
Response Time (sec)
Figure 8.4: Left: Smoothed histograms. Right: Empirical cumulative distribution functions.
constrained to change gradually. Smoothed histograms are drawn by a more
complex algorithm, but the heuristic is sufficeint for our purposes. The advantage of smoothed histograms is that they approximate density functions
and that several may be presented on a single grpah, facilitating comparsion.
The disadvantage is that smoothed histograms always have some degree of
distortion when compared to the true density function. They tend to be a bit
more variable than the true density and they cannot capture sharp changes.
The density() function smoothes histograms. The plot in the left panel is
drawn with plot(density(x1),xlim=c(.2,1.5),lty=2,lwd=2,main="",xlab="Response
Time (sec)") and the line is added with lines(density(x1),lwd=2).
200
8.1.4
CHAPTER 8. RESPONSE TIMES
Empirical Cumulative Distribution Plots
Figure 8.4 (right) provides an example of an empirical cumulative distribution
plot. These plots have several advantages. First, several distributions may be
displayed in a single plot fascilitating comparisons. Second, the analyst does
not need to make any choices as he or she does in drawing histograms (size
and placement of bins) or smoothed histograms (the amount of smoothing).
Third, empirical CDFs provide a more detailed display of the distribution
than box plots. The disadvantage is that these plots are not easy to interpret
at first. It takes experience to effectively read these graphs. We find that
after viewing several of these, they become nearly as intuitive as histograms.
Often, these graphs are very useful for exploring data and observing how
models fit and misfit data. Given these advantages, an empirical CDF plot
is a strong candidate for data representation in many contexts.
8.2
Descriptive Statistics About Distributions
In the previous section we described how to graphically display distributions.
There are also desciptive statistics used to descirbe distribution. In this
section we describe the two most common sets of these statistics: moments
and quantiles.
8.2.1
Moments
Moments are a standard set of statistics used to describe a distribution. The
terminology is borrowed from mechanics in which the moments of an object
described where its center of mass was, its momenton when spun, and how
much wobble it had when spun. Moments are numbered, that is, we speak
of the first moment, the second moment, etc. of a distribution. The first
moment of a distribution is its expected value; i.e., the first moment of RV
X is E(X) (see Eq. xx and yy). The first moment, therefore, is an index
of the center of middle of a distribution and it is estimated with the sample
mean. Higher moments come in two flavors, central and raw, and are defined
as:
Definition here (Richard?) [ Raw moments are given as E(X n ) where n
is an integer. The second and third raw moments, for example, are given by
E(X 2 ) and E(X 3 ). Central moments are g ]
8.3. COMPARING DISTRIBUTIONS
201
Central moments are far more common than raw moments. In fact, central moments are so common that they are sometimes refered to simply as the
moments, and we do so here. The second moment is given by E(X −E(X)2 ),
which is also the variance. Variance indexes the spread or dispersion of a
distributoin and is estimated with the sample variance (Eq. ). The third
moment and fourth moments are integral for the computation of skewness
and kurtosis, defined below.
Defn here: (Richard).
Skewness indexes assymetries in the shape of a distribution. Figure xx
provides an example. The normal distribution (solid line) is symmetric
around its mean. The corresponding skewness value is 0. The distribution
with the long right tails all have positive skewness values. If these distributions had skewed the other direction and had long left tails, the skewness
would be negative.
Kurtosis is especially useful for characterizing symmetric distributions. A
normal has.
In R, sample mean and variance is given by mean() and var(), respectively. Functions for skewness and kurtosis are not built in, but may defined
as:
RCODE of skewness and kurtosis.
8.2.2
Quantiles
In Chapter 3, we explored and defined quantile functions. One method of
describing a distribution is to simply list its quantiles. For example, it is not
characterize a distribution by its .1, .3, .5, .7, and .9 quantiles. This listing is
analogous to box plots and portrays, in list form, the same basic information.
8.3
Comparing Distributions
The main goal in analysis of RT is not to simply describe distributions but
to also compare them. We find it useful to consider three properties when
comparing distributions: location, scale, and shape:
202
8.3.1
CHAPTER 8. RESPONSE TIMES
Location
Figure 8.5 shows distributions that differ only in location. The left, center,
and right panels show the effects of locaton changes on probability density
functions, cumulative distribution functions, and quantile functions, respectively. The effect is most easily described with reference to probability and
cumulative distribution functions: a location effect is a shift or translation
of the entire distribution. Location changes are easiest to see in the quantile functions; the lines are parallel over all probabilities. (In the figure, the
lines may not appear parallel, but this appearance is an illusion. For every
probability, the top line is .2 seconds larger than the bottom one.) It is
straightforward to express location changes formally. Let X and Y be two
random variables with density functions denoted by f and g, respectively,
and cumulative distribution functions denoted by F and G. If X and Y
differ only in location, then the following relations hold:
Y
g(t)
G(t)
−1
G (p)
=
=
=
=
X + a,
f (t − a),
F (t − a),
a + F −1 (p),
for location difference a. Location is not the same as mean. It is true that if
two distributions differ in only location, than they differ in mean by the same
amount. The opposite, however, does not hold. There are several changes
that result in changes in mean besides location.
8.3.2
Scale
Scale refers to the dispersion of a distribution, but it is different than variance. Figure 8.6 shows three distributions that differ in scale. The top row
shows two zero-centered normals that differ in scale. The panels, from leftto-right, show density, cuulative distribution, and quantile functions. For
the normal, scale describes the dispersion from the mean (depicted with a
vertical dotted line in the density plot). The middle row show the case for
a Weibull distribution. The Weibull is a reasonable model for RT as it is
unimodal and skewed to the right; it is discussed subsequently. The depicted
Weibull distributions have the same location and shape; they differ only in
scale. For this distribution, scale describes the dispersion from the lowest
203
0.0
0.5
1.0
1.5
1.0
0.2
0.6
Time (ms)
0.8
0.4
0.0
1.0
0.0
Density
2.0
Cumulative Probability
8.3. COMPARING DISTRIBUTIONS
0.0
Time (sec)
0.5
1.0
1.5
Time (sec)
0.0
0.4
0.8
Cumulative Probability
Figure 8.5: Distributions that differ in location are shifted. The plots show,
from left to right, locaton changes in density, cumulative distribution, and
quanitle functions, respectively. The arrowed line at the right shows a difference of .2 sec; the distance separating the quantile functions for all probability
values.
value (depicted with a vertical line) rather than from the mean. The bottom
panel shows scale changes in an ex-Gaussian distribution, another popular
RT distribution that is described subsequently. For this distribution the scale
describes the amount of dispersion from a value that is near the mode (depicted with a vertical line). The quantile plot of the ex-Gaussian is ommited
as we do not know of a convenient expression for this function.
Y
= a + bX,
t−a
1
,
f
g(t) =
b b t−a
G(t) = F
,
b
G−1 (p) = a + bF −1 (p).
8.3.3
Shape
Shape is a catch-all category that refers to any change in distributions that
cannot be described as location and scale changes. Figure 8.7 shows some
examples of shape changes. The left panel shows two distributions; the one
with the dotted line is more symmetric than the one with the solid one.
204
0
1
2
3
Score
−3
−3
−2
−1
0.0
2.0
0.6
0.8
1.0
Time (ms)
0.2
0.0
0.5
1.0
1.5
0.0
0.2
0.4
0.6
0.8
0.8
Cumulative Probability
0.4
Cumulative Probability
1.5
0.4
0.0
4
3
2
1.0
Time (sec)
0.2
Cumulative Probability
Time (sec)
1
Density
3
0.8
1.5
0
0.5
2
0.4
Cumulative Probability
1.0
Time (sec)
0.0
1
0.0
4
3
2
Density
1
0.5
0
Score
0
0.0
−1
1
2
0.8
3
Score
1.0
−1
0.6
−2
0.4
Cumulative Probability
−3
0.0
0.4
0.0
Density
0.8
CHAPTER 8. RESPONSE TIMES
0.0
0.5
1.0
1.5
2.0
Time (sec)
Figure 8.6: Distributions that differ in scale. The plots show, from left to
right, scale changes in density, cumulative distribution, and quanitle functions, respectively. The top, middle, and bottom rows show scale changes for
the normal, Weibull, and ex-Gaussian distributions, respectively.
1.0
205
0.3
0.1
0.2
Density
1.0
Density
2
0.0
0.5
1.0
Time (sec)
1.5
0.0
0.0
1
0
Density
2.0
3
0.4
4
3.0
8.3. COMPARING DISTRIBUTIONS
0.0
0.5
1.0
1.5
2.0
−4
Time (sec)
−2
0
2
Score
Figure 8.7: Distributions that differ in shape.
The center and right panel shows more subtle shape differences. The center
panel shows the case in which the right tail of the dotted-line distribution
is stretched relative to the right tail fo the solid-line distribution. There
is no comparable stretching on the left side of the distributions. Because
the stretching is inconsistent, the effect is a shape change and not a scale
change. The right panel shows two symmetric distribution. Nonetheless,
they are different with the solid one having more mass in the extreme tails
and in the center. There is no stretching that can take one distribution into
the other; hence they have different shape.
8.3.4
The Practical Consequences of Shape
In some sense, shape is the most important of the three relationships in the
location, scale, shape trichotomy. From a statistical point of view, it often
makes little sense to compare the location and scale of distributions that
vary in shape. For example, it would make little sense to compare the effect
of scale across two conditions if the distribution in the first conditions was
normal and in the second condition was exponential. The ordering of scales
has no meaning when the shape changes. Conversely, comparisons of scale
and licensed whenever there is shape invariance. From a psychological point
of view, this type of constraint makes sense. Shape serves as an indicator of
mental processing: distributions with different shapes may indicate a change
in strategy or cognitive architecture. For example, the distribution of RT
from serial processing and parallel processing often vary in shape. It would
4
206
CHAPTER 8. RESPONSE TIMES
be fairly meaningless to talk about changes in overall scale across conditions
which differ in architecture. If two distributions vary in shape, than this
variability warrants further exploration. Why and how they differ are important clues to how the conditions affect mental processing. If shape does
not change, then the analyst may study how scale changes across conditions.
8.4
Quantile-Quantile Plots
Quantile-quantile plots (also called QQ plots) are epecially useful for assessing whether distributions differ in location, scale, or shape. QQ plots are
formed by plotting the quantiles of one distribution against that of another.
An example is constructed by considering the case presented in the beginning
of the chapter (page xx). Here a single participant provided 50 RTs in each
of two conditions. The data in the first and second condition were denoted
in R by vectors x1 and x2, respectively. The QQ plot is shown in Figure 8.8.
The first step in consrtucting it is sorting RTs from smallest to largest. For
each of the 50 observations there is a point. The first point (marked with
the lower of the three arrows) comes from the smallest RT in each condition;
the y-axis value is the smallest RT in the first condition, the x-axis value is
the smallest RT in the second condition. The next point is from the pair of
next smallest observations and so on. The graph is drawn by
range=c(.2,1.5)
par(pty=s)
plot(sort(x2),sort(x1),xlim=range,ylim=range)
The assignment of distributions to axes in the QQ plot is a matter of personal
choice. We prefer to draw the larger distribution on the y-axis and will
maintain this convention thoughout.
The reason this plot is called a quantile-quantile plot may not be immediately evident. Consider the median of each distribution. Whereas there are
50 observation in each condition, the median is estimated by the 25th observation. The upper arrow denotes the median from both conditions. This
point, therefore, is a median-median point. The middle arrow denotes same
for .25 quanile; the y-axis value is the estimate of the .25 quantile for the first
condition; the x-axis value is the estimate of the .25 quantile for the second
condition. Each point may be thought of likewise, they are plots of quantiles
of one distribution against the corresponding quantiles of another.
207
1.0
0.6
0.2
RT in Condition 1
1.4
8.4. QUANTILE-QUANTILE PLOTS
0.2
0.6
1.0
1.4
RT in Condition 2
Figure 8.8: Qauntile-quantile (QQ) plots of two distributions. The lower,
middle, and upper arrows indicate the smallest RT; the .25 quantile, and the
median, respectively.
One aspect that simplifies the drawing of qqplots in the above example
is that there are the same number of observations in each condition. QQ
plots can still be drawn when there are different numbers and R provides
a convenient function qqplot(). For example, suppose there were only 11
observations in the first condition: z=x1[1:11]. The QQ plot is drawn with
qqplot(x2,z). In this plot, there are 11 points. The function computes the
approriate 11 values from x2 corresponding to the same quantiles as the 11
points in z.
QQ plots graphical depcit the location-scale-shape loci of the effect of
variables. Figure 8.9 shows the relationships. The left column shows the
pdf (top)of two distributions that differ in location. The resulting QQ plot
(bottom) is a straight line with a slope of 1.0. The y-intercept indexes the
degree of location change. The middle colun shows the same for scale. The
QQ plot s a straight line and the slope denotes the scale change. In the figure,
the scale of the slow distribution is twice that of the fast one; the slope of
the QQ plot is 2. The bottom row shows the case for shape changes. If
shape changes, then the QQ plots are no longer straight lines and show some
208
CHAPTER 8. RESPONSE TIMES
degree of curvature. The QQ plot of the sample data x1 and x2 (Figure ??)
indicate that the primary difference between the conditions is a scale effect.
One drawback to QQ plots is that it is often visually difficult to inspect
small effects. Typical effects in subtle tasks, such as in priming, are on the
order of 30 ms or less. QQ plots are not ideally suited to express such small
effects because each axis must encompass the full range of the distribution,
which often encompasses a second or more. The goal then is to produce
a graph that, like the QQ plot, is diagnostic for location, scale, and shape
changes, but for which small effects are readily apperent. The solution is
the delta plot, which is shown in Figure 8.10. Like QQ plots, these plots
are derived from quantiles. The y-axis in these plots are difference between
quantiles; the x-axis is the average between quantiles. The R code to draw
delta plots is
p=seq(.1,.9,.1)
df=quantile(x1,p)-quantile(x2,p)
av=(quantile(x1,p)+quantile(x2,p))/2
plot(av,df,ylim=c(-.05,.25),
ylab="RT Difference (sec)",xlab="RT Average (sec)")
abline(h=0)
axis(3,at=av,labels=p,cex.axis=.6,mgp=c(1,.5,0))
We prefer to also indicate the cumulative probability associated with an
average on delta plots. This is indicated by the top axis and drawn by the
last line in the above R code.
Delta plots provide the same basic information as QQ plots. The topright panel shows the case when two distributions vary in location; the delta
plot is a straight horizontal line. The botom-left panel shows changes in
scale; the straigth line has positive slope. The bottom-right panels shows
changes in shape; the line has curvature.
The choice of average quantile on the x-axis of delta plot seems arbitrary.
A difference choices might be the cumulative probability of the quantiles
and this choice has been used in the literature (xx). It turns out there is a
justification for the use of average quantiles values, and this jusification is
demonstrated in Figure 8.11. First, note that the unfilled points are just the
QQ plot of the sample and their position is identical to Figure ??. These
solid points were obtained by simply rotating
√ the figure through 45 degrees.
The new x-axis value of each point is (1/ 2)(x1 (p) + x2 (p)); the new y-axis
209
0.5
1.0
1.5
3
1
0
1
0
0.0
2
Density
3
2
Density
1.5
1.0
0.0
0.5
Density
2.0
2.5
4
8.4. QUANTILE-QUANTILE PLOTS
0.0
1.0
1.5
0.0
Time (sec)
0.5
1.0
1.5
Time (sec)
1.2
1.0
0.8
0.6
Condition 2
1.0
0.8
0.6
Condition 2
0.8
0.6
0.4
0.4
0.4
0.2
Condition 2
1.0
1.2
1.2
Time (sec)
0.5
0.2
0.4
0.6
0.8
Condition 1
1.0
1.2
0.4
0.6
0.8
1.0
Condition 1
1.2
0.4
0.6
0.8
1.0
Condition 1
Figure 8.9: QQ plots are useful for comparing distributions. Left: Changes
in location affect the intercept of the QQ plot. Middle: Changes in scale
affect the slope of the QQ plot. Right: Changes in shape add curvature to
the QQ plot.
1.2
210
0.3
0.4
0.5
0.6
0.7
0.8
0.9
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
RT Difference (sec)
0.0
0.1
0.2
−0.1
−0.05
RT Difference (sec)
0.05
0.15
0.3
0.25
0.1
CHAPTER 8. RESPONSE TIMES
0.5
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.8
0.9
0.4
0.9
0.1
0.5
0.2
0.3
0.4
0.6
RT Average (sec)
0.5
0.6
0.7
0.7
0.8
0.8
0.9
−0.1
−0.05
RT Difference (sec)
0.05 0.10
RT Difference (sec)
0.1 0.2 0.3
0.4
0.15
0.1
0.6
0.7
RT Average (sec)
0.4
0.5
0.6
RT Average (sec)
0.7
0.8
0.40
0.45
0.50 0.55 0.60
RT Average
Figure 8.10: Delta plots are also useful for comparing distributions. TopLeft: Delta plot of two distributions. Top-Right: Location changes result in
a straight horizontal delta plot. Bottom-Left: Changes in scale are reflected
in an increasing slope in the delta plot. Bottom-Richt: Changes in shape
result in a curved delta plot.
0.65
0.70
211
−2
−1
0
1
2
8.5. STOCHASTIC DOMINANCE AND RESPONSE TIMES
−2
−1
0
1
2
Figure 8.11: A demonstration that delta plots are rotated QQ plots.
√
value is (1/ 2)(x1 (p) − x2 (p)), where x1 (p) and x2 (p) are the observed pth
quantile for
the new y-axis
√ observations x1 and x2 , respectively. Multiplying
√
value by 2 and dividing the new x-axis value by 2 yields the delta plot.
Hence, delta plots retain all of the good features of QQ plots, but are placed
on a more convenient scale.
8.5
Stochastic Dominance and Response Times
One of the most useful concepts in analysis of distributions is stochastic
dominance. If one distribution stochastically dominates another, then the
first isunambiguously bigger than the second. For example, the distributions
in Figure 8.5 show stochastic dominance; the one denoted by dotted lines
dominates the one denoted by solid lines. On contrast, the left and right
panels of Figure 8.7 show violations of stochastic dominance. In both panels,
neither of the compared distributions is unambiguously bigger than the other.
definition here.
Stochastic dominance is implicitly assumed in most statistical tests. For
example, t-tests, ANOVA, and regression stipulate that observations are normally distributed with equal variance. Under the equal variance assumption,
the specified relationship is a location change which is always stochastically
dominant. Stochastic dominance even belies nonparametric tests in statistics.
Without stochastic dominance, it is impossible to say that one distribution
is unambiguously greater than another.
Little attention has been paid to stochastic dominance in experimental
212
CHAPTER 8. RESPONSE TIMES
psychology. Most manipulations appear to obey stochastic dominance; if a
manipulation slows RT, it appears to do so across the distribution. For example Brown and Wagenmachers note that RT mean and standard deviation
tend to covary in several tasks. Even so, it is conceivable that stochastic dominance may be violated in a few priming paradigms. Heathcote, Mewhort,
and Popiel report an intriuging example from a Stroop task. In this task
participants must read the color of the ink that form words. In the concordant condition, the word describe the color; e.g., the word green is written in
green ink, the word red is written in red ink, and so on. In the inconcordant
condition, the word and its color mismatch, e.g.; the word green is written in
red ink. Response times are slower in the inconcordant than concordant conditions. Surprisingly, according to Heathcote, Mewhort, and Popiel (1991),
this effect may violate stochastic dominance. These researchers report that
concordancy speeds the fastest responses while slowing the slowest. This
type of an effect, however, is most likely rare; at least it is rarely reported.
8.6
Parametric Models
The previous RT properties and methods are nonparametric as no specific
form of random variables was assumed. The vast majority of analyses are
parametric; that is, the researcher assumes a particular form. In this book, we
have covered in depth the binomial, multinomial, and normal distributions.
These distributions are ill-suited for modeling RT. We introduce a few more
common choices: the ex-Gaussian, Weibull, log normal, gamma, and inverse
Gaussian. We focus more on how to fit these distributions rather than a
discussion of their successes and failures in the literature. In the following
chapter, we focus on select substantive models of response time.
8.6.1
The Wiebull
The three-parameter Weibull is a standard distribution in statistics with pdf
β
f (t; ψ, θ, β) =
θ
t−ψ
θ
!(

t−ψ
β − 1) exp  −
θ
!β 
,
θ, β > 0.
The parameters ψ, θ, and β are location, scale, and shape parameters, respectively. Figure 8.12 shows the effect of changing each parameter while
213
8.6. PARAMETRIC MODELS
leaving the remaining ones fixed. This figure provides one rationale for the
Weibull distribution; it is a convenient means of measuring location, scale,
and shape. The location parameter, ψ is the minimum value and the scale
parameter θ describes the dispersion of mass above this point. The Weibull
is flexible with regard to shape. When the shape is β = 1.0, the Weibull
reduces to an exponential. As the shape parameter increases past 1.0, the
Weibull becomes less skewed. At β = 3.43, the Weibull is approximately
symmetric. In general, shapes that characterize RT distributions lie between
β = 1.2 and β = 2.5. The Weibull is stochastically dominant for effects
in location and scale but not for in shape. Therefore, the Weibull is especially useful for modeling manipulations that affect location and scale, but
not those that affect shape. The Weibull has a useful process interpretation
which we cover in the following chapter.
Maximum likelihood estimation of the Weibull is straightforward as R already has built-in Weibull functions (dweibull(), pweibull(), qweibull(),
rweibull()). The log likelihood of a single observation t at parameters
(ψ, θ, β) may be computed by dweibull(t-psi, shape=beta, scale=theta,log=T).
The likelihood of a set of independent and identically distributed observations, such as x (page cc), is given by sum(dweibull(t-psi, shape=beta,
scale=theta,log=T)). Estimation of Weibull parameters is straightforward:
#par = psi,theta,beta
nll.wei=function(par,dat)
return(-sum(dweibull(data-par[1],shape=par[3],scale=par[2],log=T)))
par=c(min(x),.3,2)
optim(par,nll.wei,dat=x)
8.6.2
The lognormal
The lognormal is used similarly to the Weibull. The pdf is
1
(log(t − ψ) − µ)2
√ exp −
f (t; ψ, µ, σ ) =
2σ 2
(t − ψ)σ 2π
2
!
where parameters are (ψ, µ, σ 2 ). Parameter ψ and σ 2 are location and shape
parameters, respectively. Parameter µ is not a scale parameter, but exp µ
is. There reason for the use of µ and σ 2 in this context is as follows: Let
214
CHAPTER 8. RESPONSE TIMES
4
3.0
2.0
2
Weibull
0.0
0
0.0
1
1.0
1.0
Density
Shape
3
Scale
2.0
Location
0.6
1.0
1.4
0.2
0.6
1.0
1.4
0.6
1.0
1.4
3
2
2.0
1.5
Lognormal
0
0.0
1
1.0
0.0
0.6
1.0
1.4
0.2
0.6
1.0
1.4
Inverse
Gaussian
0.2
0.6
1.0
Time (sec)
1.4
0
0
0.0
1
2
2
4
1.5
3
8
0.2
6
1.4
5
1.0
3.0
0.6
4
Density
0.2
Density
0.2
3.0
3.0
0.2
0.2
0.6
1.0
Time (sec)
1.4
0.2
0.6
1.0
1.4
Time (sec)
Figure 8.12: Probability density functions for Weibull, lognormal, and inverse
Gaussian distributions. Left: Changes in shift parameters; Middle: Changes
in scale parameters; Right: Changes in shape parameters.
215
8.6. PARAMETRIC MODELS
Z be a normal random variable with mean µ and variability sigma2 . Let
T, response time, be T = ψ + log T . Then T is distributed as a log normal
with parameters (ψ, µ, σ 2). Figure 8.12 show how changes in parameters
correspond to changes in location, scale, and shape. Like the Weibull, the
log normal is stochastically dominant in location and scale parameters, but
not in the shape parameter. Hence, like the Weibull, it is a good model
for shift and scale change, but not for shape change. The main pragmatic
difference between the log normal and the Weibull is in the left tail of the
distribution. The log normal left tail is more curved and gradual than the
Weibull’s. Because scale is given by exp µ, reasonable values of µ are often
negative. The values of Figure 8.12 range from µ = −1.4 to µ = 1. Values
of shape σ typically range from .4 to 1.0. Negative loglikelihood is given as
nll.lnorm=function(par,dat)
-sum(dlnorm(dat-par[1],mu=par[2],sigma=par[3],log=T))
8.6.3
Inverse Gaussian
The inverse Gaussian, which is also known as the Wald, plays a proinant
role in substantive models of RT. Here, we focus on a location-scale-shape
parameterization given by
f (x|λ, φ) =
s
−((x − ψ)φ − λ)2
λ
.
(x − ψ)−3/2 exp
2π
2λ(x − ψ)
!
Here, parameters (ψ, λ, φ) are location, scale, and shape. The following R
code provides the pdf and negative log-likelihood of the inverse Gaussian.
#2-parameter inverse gaussian
dig=function(x,lambda,phi) #density of IG
sqrt(lambda/(2*pi))*x^{-1.5}*exp(-(x*phi-lambda)^2/(2*lambda*x))
nll.ig=function(par,dat)
-sum(log(dig(dat-par[1],lambda=par[2],phi=par[3])))
Reasonable values for shape parameter φ range from 1 to 20 or so.
One feature of the inverse Gaussian is that it is stochastically dominant in
all three parameters. This feature is in contrast to the Weibull and lognormal
216
CHAPTER 8. RESPONSE TIMES
which are only stochastically dominant in location and scale. Even though
this stochastic dominance seems to be an advantage, it is not necessarily so.
The inverse Gaussian does not change much in shape across wide ranges of
parameter settings. In fact, shape changes describe more subtle behavior of
the tail without much affecting the overall asymmetry of the distribution.
The consequence is that it is often difficult to estimate inverse Gaussian parameters. Figure ?? provides an illustration; here vastly different parameters
give rise to similar distributions. It is evident that it would take large sample sizes to distinguish the two distributions. The inverse Gaussian may be
called weakly identifiable because it takes large sample sizes to identify the
parameters.
8.6.4
ex-Gaussian
The ex-Gaussian is the most popular descriptive model of response time. It
is motivated by assuming that RT is the sum of two processes. For example,
Hohle (1965), who introduced the model, speculated that RT was the sum
of the time to make a decision and a time to execute a response. The first of
these two was assumed to be distributed as an exponential (see Figure 8.13),
the second was assumed to be a normal. The sum of an exponential and
a normal distribution is an ex-Gaussian distribution. The exponential component has a single parameter, τ , which describes its scale. The normal
component has two parameters: µ and σ 2 . The ex-Gaussian, therefore has
three components: µ, σ, and τ . Examples of the ex-Gaussian are provided
in Figure 8.13.
The ex-Gaussian pdf is given by
and programmed in R as
dexg<-function(t,mu,sigma,tau)
{
temp1=(mu/tau)+((sigma*sigma)/(2*tau*tau))-(t/tau)
temp2=((t-mu)-(sigma*sigma/tau))/sigma
(exp(temp1)*pnorm(temp2))/tau
}
With this pdf, ML estimates are easily obtained:
217
2.0
Density
1.0
3
0.0
0
0
1
1
2
Density
4
3
2
Density
5
6
4
8.6. PARAMETRIC MODELS
1.5
0.0
0.5
1.5
0.0
1.0
1.5
3.0
0.5
1.0
1.5
0.0
1.5
3.0
Density
3.0
2.0
0.0
Density
0.0
1.5
1.0
Time (sec)
1.0
2.0
1.0
1.0
Time (sec)
0.5
Time (sec)
0.0
0.5
1.5
0.0
0.0
Time (sec)
0.0
1.0
2.0
Density
3.0
2.0
Density
0.0
0.5
0.5
Time (sec)
1.0
2.0
Density
1.0
0.0
0.0
Density
1.0
Time (sec)
1.0
1.0
Time (sec)
2.0
0.5
1.0
0.0
0.0
0.5
1.0
1.5
Time (sec)
Figure 8.13: Some stuff here.
0.0
0.5
1.0
Time (sec)
1.5
218
CHAPTER 8. RESPONSE TIMES
nll.exg=function(par,dat)
-sum(log(dexg(dat,mu=par[1],sigma=par[2],tau=par[3])))
Hohle originally postulates a tight correspondance between mental processes and ex-Gausian components. Hohle and Taylor tested this correspondance by performing selective influence experiments; unfortunately, no selective influence was identified. In modern examples, the ex-Gaussian is used
descriptively: parameter µ is a measure of central tendency, parameter σ indexes the extent of the right tail, and parameter τ indexes the extent of the
left tail. Most researchers find that manipulations that lengthed RT do so,
primarily, by extending τ . Changes in τ and µ obey stochastic dominance;
changes in σ violate it.
In the standard parameterization, parameters σ and τ are both shape
parameters for the ex-Gaussian and there is no scale parameter. Scale, however, is preserved if changes in σ affect τ to the same degree. For example, if
both τ and σ double, then shape is preserved. Hence, it is possible to define
a new parameter η = στ as the shape of the ex-Gaussian. As η is increased,
the shape of the distribution becomes more skewed. The three parameters of
the ex-Gaussian may be expressed as (µ, σ, η). In this parameterization, the
parameters correspond to the location, scale, and shape of the distribution.
Figure 8.13 bottom row shows the effects of changing any one of these parameters. The distribution is stochastically dominant in location and shape but
stochastically indominant in scale. We do not recommend the ex-Gaussian
because stochastically-dominant scale changes tend to describe the effects of
many manipulations and participant variables. The following code provides
location, scale, and shape estimates for the ex-Gaussian:
nll.exg.lss=function(par,dat)
-sum(log(dexg(dat,mu=par[1],sigma=par[2],tau=par[3]*par[2])))
8.7
8.7.1
Analysis Across Several Participants
Group Level Analysis
Most studies are concerned with analysis across several participants. Response times cannot simply be aggregated across people. To see the harm
219
4
3
2
Density
2
0
0
1
1
Density
3
5
8.7. ANALYSIS ACROSS SEVERAL PARTICIPANTS
0.2
0.4
0.6
0.8
1.0
1.2
0.2
0.4
0.6
1.0
1.2
1.0
1.2
x2
3
1
2
Density
1.0
0
0.0
Density
2.0
4
x1
0.8
0.2
0.4
0.6
0.8
1.0
1.2
0.2
Binned
0.4
0.6
0.8
z−Transformed
Figure 8.14: Some stuff here.
of this practice, consider the example in Figure ??. The top row shows histograms of observations from two individuals. The bottom right shows the
case in which the observations are aggregated, that is, they are grouped together. The resulting distribution is more variable than those for the two
individuals. Moreover, it is bimodal where the distributions for the two
indeividuals are unimodal. In sum, aggregation greatly distorts RT distributions.
We address the question of how to draw a group-level distribution. Defining a group-level distribution is straightforward if shape is assumed not to
vary across participants and far more difficult otherwise. If shape is invariant,
220
CHAPTER 8. RESPONSE TIMES
then the jth participant’s response time distribution, Tj can be expressed as
Tj = ψj + θj Z,
where Z is a base random variable with a location of 0 and a scale of 1.0.
For example, if all participants had a Weibull distribution with shape β =
1.7, then Z would be distributed as a Weibull with density function f (t) =
1.7t.7 exp(−(t1 .7)). In the case that all participants have the same shape, the
group RT, denoted by S, is defined as
S = ψ̄ + θ̄Z.
There are two methods for plotting group-level distribution S without
committing to a particular form of Z. The first is quantile averaging.1 To
quantile-average a set of oberservations, the researcher first estimates quantiles in each distribution. For example, if x1 and x2 are data from two
participants, then we might estimate the deciles of each; e.g.,
decile.p=seq(.1,.9,.1)
x1.d=quantile(x1,decile.p)
x2.d=qunatile(x2,decile.p)
The next step is to average these deciles: group.d=(x1.d+x2,d)/2. Here
group.d are the deciles of the group distribution S that is defined above.
If we wish to know this distribution in greater detail, all we need to do
is increase the number of probabilities for which we compute quantiles.
While qauntile averagin is popular, there is an easier but equivalent method
to drawing group RT distributions when assuming shape invariance. For
each participant, simply take the z-transform of all scores. For example,
z1=(x1-mean(x1))/sd(x1) and z2=(x2-mean(x2))/sd(x2). The z-transformed
data may be aggregated, multiplied by the average standard deviation and
shifted by the average mean. For example,
mean.mean=mean(c(mean(x1),mean(x2))
mean.sd=mean(c(sd(x1),sd(x2)))
1
A related method to quantile averaging is Vincentizing. (cites) Instead of taking a
standard estimate of quantile, the researcher averages all observations in certain range. For
example, the .1 Vincentile might be the average of all observations above the .05 quantile
and below the .15 quantile. We do not recommend Vincentizing on princible. Quantile
estimators are well understood in statistics; vincentiles have never been explored.
8.7. ANALYSIS ACROSS SEVERAL PARTICIPANTS
221
s=mean.sd*(c(z1,z2))+mean.mean
hist(s)
If shape is invariant, then the group distribution are samples from S as
defined above. Quantile averaging and z-transforming yield identical results
in the limit of increasing number of observations per partipant and nearly
identical results with smaller sample sizes. For all practical purposes, they
may be considered equivalent.
The question of how to define a group-level distribution when shape
varies is difficult. A researcher can always perform quantile averaging or ztransforms; but the results are not particularly interpretable without shape
invariance. The problem is that if shape varies, the group-level distribution
will not belong to the same distribution as the individuals. If researchers
suspect that shape changes across people, then specific parametric models
are more appropriate than quantile averaging or z-transforms. The bottomleft plot of Figure 8.14 shows the group-level distribution derived from ztransformed for samples x1 and x2.
A related question is whether group-level distributions are helpful in estimating grouplevel parameters in specific parametric forms. Rouder and
Speckman explored this question in detail. If shape does not vary across
participants, then there is no advantage to drawing group-level distributions
before estimation. The question is more complex when shape does change.
As the number of observations per participant increases, estimating each participant’s parameters is advantageous. However, for small sample sizes, there
is stability in the group-level distribution that may lead to better parameter estimates. Rouder and Speckman show, for example, a large benefit of
quantile-averaging data when fitting a Weibull if there are less than 60-80
observations per participant.
8.7.2
Inference
In the previous section, we described how to draw goup level distributions.
These distributions are useful in exploring the possible loci of effects. In this
section, we discuss a different problem: how to draw the conclusion about
the loci of effects. If the researcher is willing to adopt a parametric model,
such as a Weibull or ex-Gaussian, the problem is well formed. For example,
suppose previous research has indicates that an effect is in parameter τ . We
descrbe how to use a likelihood ratio test for this example. We let Xijk
222
CHAPTER 8. RESPONSE TIMES
denote the response time for the ith participant (i = 1, . . . , I) observing a
stimulus in the jth condition (j = 1, 2) for the kth replicate (k = 1, . . . , K).
The general model is given by
Xijk ∼ ex-Gaussian(µij , σij , τij ).
An effect in τ may be tested against the restriction that τij does not vary
across conditions; i.e., τi = τi1 = τi2 .
Xijk ∼ ex-Gaussian(µij , σij , τi ).
The general model has 6I parameters; the restricted model has 5I parameters. Hence, the likelihood ratio test is I degrees of freedom. We provide the
code and an example of this type of test in the next section.
There is a strong caveat to this test. We model each particpant’s parameters as free to take on any value without constraint. For example, µij may
be any real value; σij and τij may be any positive real value. This approach
seems appropriate, it does not allow for the generalization to new participants. In statistical parlance, each participant’s parameters are treated as
fixed effects. The results of the likelihood ratio test generalize to repeated experiments with these specific participants. It is important to note that they
do not generalize to other participants. It is possible to generalize the model
such that inference can be made to new participants. The resulting model
is a hierarchical or multilevel one (c.f., Rouder et al., 2003). The analysis of
hierarchical models is outside the scope of this book.
A second approach which does provide for generalization to new participants is to model resulting parameter estimates. For example, it is conventional to see a t-test on ex-Gaussian parameters (cites). A paired t-test on
τ would imply the following null model:
τi2 − τi1 ∼ Normal(0, σ 2 ).
Researchers using the t-test should be sensitive to the normality and equalvariance assumptions in the model. If these are not met, an appropriate
nonparametric test, such as the rank-sign test (cite), may be used.
8.8
An example analysis
The existence of serveral options for modeling RT necessitates that the analyst make choices. Perhaps the most productive approach is to let the data
8.8. AN EXAMPLE ANALYSIS
223
themselves guide the choices. To illustrate, we consider the case in which 40
participants each read a set of nouns and verbs. The dependent measure is
the time to read a word, and each participant reads 100 of each type. The
main goal of analysis is to draw conclusions about the effect of nouns vs.
verbs.
Sample data are provided in file chapt6.dat. They may be read in with the
following command: dat=read.table(’chapt6.dat’,header=T). A listing
of dat will reveal 4 columns (subject, part-of-speech, replicate, response time)
and 4000 rows. In this example, we simulated the data from an inverse
Gaussian and built the effect of part-of-speech into scale. Therefore, we
hope to recover a scale effect and not shift or shape effects.
Had this set been obtained from participants, cleaning procedures would
have been needed to remove errors, premature responses, and inordinate
lapses in attention. We find it helpful to define the following vectors to aid
analysis:
sub=dat$sub
pos=dat$pos
rt=dat$rt
8.8.1
Analysis of Means
A quick analysis of means is always recommended. Figure ??, top row,
shows the relevant plots. The left plot is a line plot of condition means for
each participants. The following code computes the mean with a tapply()
command and plots the resultant matrix with a matplot() command:
ind.m=tapply(rt,list(sub,pos),mean) #Matrix of participant-by-pos means
I=nrow(ind.m) #I: Number of participants
matplot(t(ind.m),typ=’l’,col=’grey’,lty=1,axes=F,xlim=c(.7,2.3),
ylab=’Response Time (sec)’)
One new element in this code is the axes=F option in the matplot() statement. This option supresses the drawing of axes. When the axis are automatically plotted, values of 1 and 2 are drawn on the x-axis, which do not
have relavence to the reader. We manually add a more approriate x-axis
with the command axis(1,label=c(’Nouns’,’Verbs’),at=c(1,2)). The
“1” in the first argument indicates the x-axis. The y-axis and box are added
224
CHAPTER 8. RESPONSE TIMES
with axis(2) and box(), respectively. The final element to add are group
means:
grp.m=apply(ind.m,2,mean)
lines(1:2,m1,lwd=2)
The termination points are added with matpoints(t(ind.m),pch=21, bg=’red’).
t.test(ind.m[,1],ind.m[,2],paired=T). The right plot is a graphical display of the effect of part-of-speech. It shows a boxplot of each individual’s
effect along with 95% CI error bars around the mean effect. This graph was
drawn with
effect=ind.m[,2]-ind.m[,1]
boxplot(effect,col=’lightblue’,ylab="Verb-Noun Effect (sec)")
errbar(1.3,mean(effect),qt(.975,I-1)*sd(effect)/sqrt(length(effect)),.2)
points(1.3,mean(effect),pch=21,bg=’red’)
abline(0,0)
As can be seen, there is a 27 ms advantage for nouns; moreover, over 3/4 of
the 40 participants have noun advantage. The significance of the effect may
be confirmed with a t-test (for these data, t(39) = 6.03, p < .001).
8.8.2
Delta Plots
To explore the effect of part-of-speech on distributional relationships, we
drew a delta plot for each individual (Figure ??, bottom-left). We used a
loop to calculate deciles for each individual reading nouns and verbs and
stored the results in matrix noun and verb. The rows index individuals; the
columns index deciles:
decile.p=seq(.1,.9,.1)
noun=matrix(ncol=length(decile.p),nrow=I)
verb=matrix(ncol=length(decile.p),nrow=I)
for (i in 1:I)
{
noun[i,]=quantile(rt[sub==i & pos==1],p=decile.p)
verb[i,]=quantile(rt[sub==i & pos==2],p=decile.p)
}
0.00
−0.15
0.10
0.4
0.6
0.8
1.0
1.2
1.4
1.6
0.04
0.20
Nouns
0.02
Difference of Deciles
0.00
Difference in Deciles (sec)
0.6
0.8
1.0
1.2
−0.02
0.02
0.06
0.10
Verb−Noun Effect (sec)
0.4
Response Time (sec)
8.8. AN EXAMPLE ANALYSIS
225
Verbs
0.45
Average of Deciles (sec)
Figure 8.15: Some stuff here.
0.55
0.65
Average of Deciles
0.75
226
CHAPTER 8. RESPONSE TIMES
ave=(verb+noun)/2
dif=verb-noun
The last two lines are matrices for the delta plot: the difference between
deciles and the average of deciles. These matrices of deciles may be drawn
with the matplot(ave,dif,typ=’l’). The result is a colorful mess. The
problem is that matplot() treats each column as a line to be plotted. We
desire a line for each participant, that is, for each row. The trick is to
transpose the matrices. The approriate command, with a few options to
improve the plot, is
matplot(t(ave),t(dif),typ=’l’,
lty=1,col=’blue’,ylab="Difference in Deciles (sec)",
xlab="Average of Deciles (sec)")
abline(0,0)
These individual delta plots are too noisy to be informative. This fact
necessitates a group-level plot. We therefore assume that participants do not
vary in shape and quantile average deciles across people. Deciles of group
level distributions are given by
grp.noun=apply(noun,2,mean)
grp.verb=apply(verb,2,mean)
A delta plot from these group-level deciles is provided in the bottom-right
panel and drawn with
dif=grp.verb-grp.noun
ave=(grp.verb+grp.noun)/2
plot(ave,dif,typ=’l’,ylim=c(0,.05),
ylab="Difference of Deciles",xlab="Average of Deciles")
points(ave,dif,pch=21,bg=’red’)
abline(0,0)
The indication form this plot is that the effect of part-of-speech is primarily
in scale.
8.8. AN EXAMPLE ANALYSIS
8.8.3
227
Ex-Gaussian Analysis
We fit the ex-Guassian distributional model to each participant’s data.
Xij ∼ ex-Gauss(µij , σij , τij ),
where i = 1, . . . , I indexes participants and j = 1, 2 indexes part of speech.
The code to do so is:
r=0
est=matrix(nrow=I*J,ncol=7)
for (i in 1:I)
for (j in 1:J)
{
r=r+1
par=c(.4,.1,.1)
g=optim(par,nll.exg,dat=rt[sub==i & pos==j])
est[r,]=c(i,j,g$par,g$convergence,g$value)
}
%$
The matrix est has seven columns: a participant label, a condition label,
ML estimates of µ, σ, and τ , an indication of convergence, and a minimized
negative loglikelihood. All optimizations should converge and this facet may
be checked by confirming sum(est[,6]) is zero. Figure 8.16 (left) provides
boxplots on the effect of part-of-speech on each parameter. The overall effects
on µ, σ, and τ are 9.4 ms, 1.5 ms, and 16.6 ms, respectively. The overall
NLL for this model may be obtained by summing the last column; it is
Researchers typically test the effects of the manipulation on a parameter
by the t-test. For example, effect of part-of-speech 0n τ is marginally significant by a t-test (values here). This t-test approach differs from a liklihood
ratio test in two respects. On one hand, it is less princibled as the researcher
is forced to assume that the parameter estimates are normally distributed
and of equal variance. The likelihood ratio test provides for a more princibled alternative as it accounts for the proper sampling distributions of the
parameters. The above code provides estimates of the general models in
which all parameters are free to vary. The overall negative loglikelihood for
the general model may be obtained by summing the last column; it is xx.xx.
The restricted model is given by:
228
Mu
Sigma
0.6
0.2
−0.2
−0.6
Verb−Noun Effect (sec)
0.2
0.1
0.0
−0.2
Verb−Noun Effect (sec)
0.3
CHAPTER 8. RESPONSE TIMES
Tau
Shift
Scale
Figure 8.16: Some stuff here.
One interpretation of these analyses is that the effect of part-of-speech
is primarily in τ (a shape parameter). This interpretation would be incorrect. We generated the data with only scale effects. The achilles heal of
the ex-Gausssian is that scale effects violate stochastic dominance. The inverse Gaussian posits stochastic dominant scale effects. Figure ?? shows the
consequence. When scale changes, the effect on the ex-Gaussian is a large
change in τ and smaller changes in µ and σ. Indeed, this occurs with the
estimates from the data. Therefore, the ex-Gaussian is of limited use because
it frequently leads to the misinterpretation of scale effects as τ effects. We
believe this distribution should be used wth great caution. Moreover, we are
suspicious of many of the reported effects in τ in the literature may be in
fact stochastically-dominant scale effects.
We also performed a Weibull distribution analysis on each individual’s
data. The results are shown in Figure ?? (right). The shift and shape effect
are insignificant; the scale effect is more significant than appears. Scale
effects are multiplicative, hence the ratio is the best way to compare scales.
We computed the ratio of the scale of the verb RT distribution to the scale
of the noun RT distribution for each participant. These scales are shown in
Figure ??. Notice that the spacing is not equidistant; that is ... Significance is
most appropriately assesed on the log scale—in this case, we use the following
model:
!
θiV
log
∼ Normal(µ, σ 2 ),
θiN
where θiN and θiV are the ith participant’s Weibu
Shape
8.8. AN EXAMPLE ANALYSIS
the effect of part-of-speech on scale is significant ().
229
230
CHAPTER 8. RESPONSE TIMES
Bibliography
[1] N. D. Anderson, F. I. M. Craik, and M. Naveh-Benjamin. The attentional demands of encoding and retrieval in younger and older adults: I.
evidence from divided attention costs. Psychology & Aging, 13:405–423,
1998.
[2] W. H. Batchelder and D. M. Riefer. Theoretical and empirical review of
multinomial process tree modeling. Psychonomic Bulletin and Review,
6:57–86, 1999.
[3] R. Brent. Algorithms for Minimization without Derivatives. PrenticeHall, Englewood Cliffs, N.J., 1973.
[4] A. Buchner, E. Erdfelder, and B. Vaterrodt-Plunneck. Toward unbiased
measurement of conscious and unconscious memory processes within the
process dissociation framework. Journal of Experimental Psychology:
General, 124:137–160, 1995.
[5] F. R. Clarke. Constant ratio rule for confusion matrices in speech communications. Journal of the Accoustical Society of America, 29:715–720,
1957.
[6] T. F. Cox and M. A. A. Cox. Multidimensional Scaling, 2nd Edition.
Chapman and Hall / CRC press, Boca Raton, FL, 1994.
[7] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum liklihood from
incomplete data via the em algorithm. Journal of the Royal Statistical
Society, Series B, 39:1–38, 1977.
[8] S. Glover and P. Dixon. Likelihood ratios: a simple and flexible statistic
for empirical psychologists. Psychonomic Bulletin & Review, 11:791–
806, 2004.
231
232
BIBLIOGRAPHY
[9] P. Graf and D. L. Schacter. Implicit and explicit memory for new associations in normal and amnesic subjects. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 11:501–518, 1985.
[10] M. J. Hautus and A. L. Lee. The dispersions of estimates of sensitivity
obtained from four psychophysical procedures: Implications for experimental design. Perception & Psychophysics, 60:638–649, 1998.
[11] E. Herman, J. L.; Schatzow. Recovery and verification of memories of
childhood sexual trauma. Psychoanalytic Psychology, 4:1–14, 1987.
[12] R. V. Hogg and A. T. Craig. Introduction to mathematical statistics.
MacMillan, New York, 1978.
[13] X. Hu and W. H. Batchelder. The statistical analysis of general processing tree models with the em algorithm. Psychometrika, 59:21–47,
1994.
[14] L. L. Jacoby. A process dissociation framework: Separating automatic
from intentional uses of memory. Journal of Memory and Language,
30:513–541, 1991.
[15] Gideon Keren and Stan Baggen. Recognition models of alphanumeric
characters. Perception & Psychophysics, 29:234–246, 1981.
[16] A. N. Kolmogorov. Foundations of the theory of probability. Chelsea,
New York, 1950.
[17] E. L. Lehmann. Theory of point estimation. Wadsworth, 1991.
[18] E. F. Loftus. The reality of repressed memories. American Psychologist,
48:518–537, 1993.
[19] R. D. Luce. Individual Choice Behavior. Wiley, New York, 1959.
[20] R. D. Luce. Detection and recognition. In R. D. Luce, R. R. Bush,
and E. Galanter, editors, Handbook of mathematical psychology (Vol.
1). Wiley, New York, 1963.
[21] R. D. Luce. A threshold theory for simple detection experiments. Psychological Review, 70:61–79, 1963.
BIBLIOGRAPHY
233
[22] R. D. Luce, R. M. Nosofsky, D. M. Green, and A. F. Smith. The bow and
sequential effects in absolute identification. Perception & Psychophysics,
32:397–408, 1982.
[23] R. A. J. Matthews. Tumbling toast, Murphy’s law, and fundamental
constants. European Journal of Physics, 16:172–175, 1995.
[24] J. A. Nelder and R. Mead. A simplex method for function minimization.
Computer Journal, 7:308–313, 1965.
[25] R. M. Nosofsky.
Attention, similarity, and the identificationcategorization relationship. Journal of Experimental Psychology: General, 115:39–57, 1986.
[26] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and F. P. Flannery.
Numerical Recipes in C: The art of Scientific Computing. 2nd Ed. Cambridge University Press, Cambridge, England, 1992.
[27] J. Rice. Mathematical statistics and data analysis. Brooks/Cole, Monterey, CA, 1998.
[28] D. M. Riefer and W. H. Batchelder. Multinomial modeling and the
measure of cognitive processes. Psychological Review, 95:318–339, 1988.
[29] J. N. Rouder. Absolute identification with simple and complex stimuli.
Psychological Science, 12:318–322, 2001.
[30] J. N. Rouder. Modeling the effects of choice-set size on the processing
of letters and words. Psychological Review, 111:80–93, 2004.
[31] J. N. Rouder and W. H. Batchelder. Multinomial models for measuring
storage and retrieval processes in paired associate learning. In C. Dowling, F. Roberts, and P. Theuns, editors, Progress in mathematical psychology. Erlbaum, Hillsdale, NJ, 1998.
[32] J. N. Rouder and R. D. Morey. Relational and arelational confidence
intervals: A comment on Fidler et al. (2004). Psychological Science,
16:77–79, 2005.
[33] J. N. Rouder, D. Sun, P. L. Speckman, J. Lu, and D. Zhou. A hierarchical Bayesian statistical framework for response time distributions.
Psychometrika, 68:587–604, 2003.
234
BIBLIOGRAPHY
[34] D. L. Schacter. Perceptual representation systems and implicit memory:
Toward a resolution of the multiple memory systems debate. Erlbaum,
Hillsdale, NJ, 1990.
[35] R. N. Shepard, A. K. Romney, and S. B. Nerlove. Multidimensional
scaling: Theory and applications in the behavioral sciences: I. Theory.
Seminar Press, Oxford, 1972.
[36] Roger N. Shepard. Stimulus and response generation: A stochastic
model relating generalization to distance in a psychological space. Psychometrika, 22:325–345, 1957.
[37] J. G. Snodgrass and J. Corwin. Pragmatics of measuring recognition
memory: Applications to dementia and amnesia. Journal of Experimental Psychology: General, 117:34–50, 1988.
[38] L.R. Squire. Declarative and nondeclarative memory: Multiple brain
systems supporting learning and memory. In D.L. Schacter and E. Tulving, editors, Memory systems 1994. MIT Press, Cambridge, MA, 1994.
[39] W. S. Torgerson. Theory and methods of scaling. Wiley, New York,
1958.
[40] A Tversky. Elimination by aspects: A theory of choice. Psychological
Review, 79:281–299, 1972.
Download