State-dependent learning and suboptimal choice: when starlings

ANIMAL BEHAVIOUR, 2005, 70, 571–578
doi:10.1016/j.anbehav.2004.12.009
State-dependent learning and suboptimal choice: when
starlings prefer long over short delays to food
LORENA PO MPILIO & A LEX K ACELNIK
Department of Zoology, University of Oxford
(Received 30 July 2004; initial acceptance 3 September 2004;
final acceptance 22 December 2004; published online 11 July 2005; MS. number: 8228)
Recent studies have used labels such as ‘work ethics’, ‘sunk costs’ and ‘state-dependent preferences’ for
apparent anomalies in animals’ choices. They suggest that preference between options relates to the
options’ history, rather than depending exclusively on the expected payoffs. For instance, European
starlings, Sturnus vulgaris, trained to obtain identical food rewards from two sources while in two levels of
hunger preferred the food source previously associated with higher hunger, regardless of the birds’ state at
the time of testing. We extended this experimentally and theoretically by studying starlings choosing
between sources that differed not only in history but also in the objective properties (delay until reward) of
the payoffs they delivered. Two options (PF and H) were initially presented in single-option sessions when
subjects were, respectively, prefed or hungry. While option PF offered a delay until reward of 10 s in all
treatments, option H delivered delays of 10, 12.5, 15 and 17.5 s in four treatments. When training was
completed, we tested preference between the options. When delays in both options were equal (10 s), the
birds strongly preferred H. When delay in H was 17.5 s, the birds were indifferent, with intermediate
results for intermediate treatments. Preference was not mediated by disrupted knowledge of the delays.
Thus, preferences were driven by past state-dependent gains, rather than by the joint effect of the birds’
state at the time of choice and knowledge of the absolute properties of each alternative, as assumed in
state-dependent, path-independent models of optimal choice.
Ó 2005 The Association for the Study of Animal Behaviour. Published by Elsevier Ltd. All rights reserved.
Several recent studies of avian decision making have dealt
with two related and somewhat counterintuitive observations. The first is that when subjects are initially trained to
obtain equally sized rewards after either hard or light
work, and are then offered a choice without the work
requirement, they prefer the reward previously associated
with harder work (Clement et al. 2000; Kacelnik & Marsh
2002; originally reported by Marsh 1999). The second is
that when subjects are trained with equal rewards in
different states of deprivation, and are then offered
a choice, they prefer the reward previously associated
with greater deprivation (Marsh et al. 2004). Enhanced
preference for stimuli associated with greater need is
probably far from restricted to birds: locusts, Locusta
migratoria, too prefer a signal (an odour in this case) that
has been previously paired with food encountered under
a state of greater need, regardless of state at the time of
testing (Pompilio 2004; L. Pompilio, A. Kacelnik & S. T.
Correspondence: A. Kacelnik, Department of Zoology, Oxford University, South Parks Road, Oxford OX1 3PS, U.K. (email: alex.kacelnik@
zoo.ox.ac.uk).
0003–3472/04/$30.00/0
Bahmar, unpublished data). Here we discuss the findings
on birds, propose some theoretical clarifications, and
present an experimental study aimed at establishing the
strength of the effect and the relevance of previously
proposed theoretical models.
Clement et al. (2000) referred to the first phenomenon
as ‘work ethic’, but the label should not be seen to imply
that the birds would prefer to work harder, as this was not
the case: the preference was between the outcomes of the
two levels of work and not between the levels of work
themselves. Kacelnik & Marsh (2002) instead related their
findings to the notions of ‘sunk costs’ and ‘concorde
fallacy’ that are used to describe cases when subjects
(suboptimally) attribute value to options in relation to
their history instead of to their future payoffs (Dawkins &
Carlisle 1976; Thaler 1980).
In terms of the mechanisms that may lead to these
phenomena, Clement et al. (2000) proposed that the
contrast between the hedonic state of the individual
immediately prior to reward and at the time of reward
may be greater after greater effort than after less effort.
Hedonic state is a concept often used in reference to the
agents’ wellbeing (see e.g. Berridge & Robinson 1998), but
571
Ó 2005 The Association for the Study of Animal Behaviour. Published by Elsevier Ltd. All rights reserved.
572
ANIMAL BEHAVIOUR, 70, 3
to our knowledge there is no agreed operational definition. The proposed mechanism may occur if delivery of
reward always brings the animal to the same hedonic
state, while the previous condition (the amount of work)
leads to a different state prior to the reward being
delivered. Thus, the differential increase in hedonic state
is greater because harder work is more hedonically
negative. Marsh (1999) and Kacelnik & Marsh (2002)
proposed a similar mechanism inspired by functional
considerations, and without the assumption that the
rewards left the animals in the same hedonic state regardless of the starting point. In Marsh’s (1999, page 79)
words, ‘Rewards realised after much effort could be locally
assessed as being more valuable because of depressed
energy reserves at the time of reward payoff. This reasoning is based on the assumption that an otherwise equivalent food payoff has more fitness value when an
individual is in poorer condition than when it is in better
condition, and that such relative considerations influence
the way that outcome value is represented in memory’. If
what the subject finds reinforcing is a correlate of fitness
gain, then a food source that (because of the circumstances in which it was typically encountered) has previously yielded higher gains than an alternative would be
preferred even when the circumstances have changed so
that the gains from the two sources at the time of the
choice are identical.
In Marsh et al.’s (2004) experiment, the differential need
reduction during acquisition was caused not by extra work
but by the experimenters manipulating directly the
energetic state of their subjects. They proposed that two
payoffs of equal size would be valued differently if one
typically occurred when reserves were lower, based on the
assumption that the fitness versus reserves function is
most often concave (e.g. McNamara & Houston 1982). In
their experiment, Marsh et al. (2004) trained European
starlings, Sturnus vulgaris, to peck differently coloured keys
offering equal food rewards. The two food sources were
encountered in different sessions, while the birds were
either food deprived or fed. After training for an equal
number of reinforcements in each state, the birds progressed into choice tests in which they were presented
simultaneously with the two food sources. Regardless of
whether they were hungry or fed at the time of the choice,
the starlings significantly preferred the source that had
previously been met under higher deprivation, suggesting
that it is indeed plausible that the earlier results obtained
by changing the work requirement might have been
mediated by some effect of work on the need reduction
afforded by the reward. Before revising the ideas involved,
it is worth revisiting the bases on which normative models
of choice are developed. We do this below.
Normative models of decision making are characterized
by referring to a scale of value, or ‘utility’, dependent on
(but distinct from) the metrics of the events involved.
The following statement, although written over two and
a half centuries ago, makes the case for the relevance of
this scale as clearly as anything written since: ‘The price
of an item is dependent only on the thing itself and is
equal for everyone; the utility, however, is dependent on
the particular circumstances of the person making the
estimate. Thus, there is no doubt that a gain of one
thousand ducats is more significant to a pauper than to
a rich man though both gain the same amount.’
(Bernoulli 1738, page 24).
Bernoulli’s use of the term ‘utility’ is not the same as
that used in modern microeconomics, and is also different
from the use of fitness in evolutionary biology. However,
for the present purposes what his text highlights is the
distinction between metrics of outcomes and metrics of
value. For instance, a bee may experience sugar solutions
that are 20, 40 or 60% in concentration: it may be possible
to show a monotonic scale for the perception of concentration and a nonmonotonic scale for preference for
concentration. In Bernoulli’s example, there is no suggestion that the pauper or the rich man perceive the sum as
anything different from a thousand ducats, but they are
said to value the same amount differently. Both the
objective events and their perceived representations
(which are treated as causal factors in the decision process)
may be thought of as having dimensions of state. For
instance, the monetary rewards can be referred to a scale
of wealth, while the sugar in a solution has dimensions of
energetic reserves. Both the pauper and the rich man
become a thousand ducats better off from the prize, and
they may both be aware that this is the case. In contrast,
the value assigned to the events or to their perceived
representations has dimensions of either fitness or utility,
or of some correlated measure of motivational attractiveness, and hence they differ between the two subjects. For
once, the pauper has more reasons to celebrate than the
rich man, even if their final states will preserve their initial
difference in wealth. Crucially, the relation between the
perception of absolute properties of the reward and its
utility need not be linear and may not even be monotonic
(although it normally would be for the case of money).
In behavioural ecology, the ideal dependent variable is
evolutionary fitness and the independent variables are
metrics of the physical properties of rewards. However,
both dimensions are often replaced by operational (‘cognitive’) dimensions, where the dependent variable is some
motivational correlate of fitness (i.e. ‘pleasure’) and the
independent variable is the subject’s knowledge of the
event, for instance, the bee’s receptor’s intensity of response to concentration or its knowledge of the size of
a reward or the length of handling time.
In the present study we manipulated the state of the
starlings (as done by Marsh et al. 2004) and systematically
varied, in different treatments, the objective properties
(delay until food) of the option associated with the state of
higher deprivation. This manipulation allows us to test
several hypotheses, which we present using Fig. 1 as
a framework.
The figure presents a positive but decelerated relation
between a dependent ‘value’ ordinate and an independent
‘state’ abscissa. The abscissa is an approximate representation of energetic state, so that a hungrier animal is
placed closer to the origin. However, its dimensions are
the same as the metrics of the reward, so that consuming
a reward has the effect of displacing the animal’s state to
the right. The reward metric is not confined to reward size,
and in relation to the present experiment it is not. We
POMPILIO & KACELNIK: STATE, LEARNING AND CHOICE
Value (fitness/pleasure)
ˆ
M
H
sPF
vPF
ˆ
M
PF
MPF
sH
vH
MH
H
PF
State
Figure 1. Value (measured in fitness or in a motivational correlate) as
a function of state of the subject. The outcome of each interaction
with a food source produces a state change proportional to the
outcome’s magnitude in some metric of state. Outcomes with
greater immediacy (delay1) have lower opportunity cost and cause
larger displacement. The subject encounters two food sources, each
when it (the subject) is in one of two states, hungry (H) or prefed
(PF). The outcomes are labelled MH and MPF and their metrics are
represented as arrows that cause a positive state displacement. The
^ H and M
^ PF ) is shown in
subject’s representation of these metrics (M
the inset. The value of each outcome (vH and vPF) is the vertical
displacement that corresponds to each change in state. The first
derivative (marginal rate) of the value versus state function at each
initial state is indicated by the slopes of the tangents sH and sPF. In
this example the objective metrics of the outcomes are equal, but
the subject’s representation differs because it sees the outcome
obtained when hungrier as larger (or having greater immediacy)
than the alternative.
consider that rewards delivered with greater immediacy
cause a lower opportunity loss and hence a higher net
payoff, so we represent all outcomes as causing a displacement along this axis, and more immediate rewards as
causing a larger displacement than more delayed rewards.
Figure 1 represents a subject that could be found in two
states, which we label with the suffixes H (for ‘hungry’) and
PF (for ‘prefed’). Rewards of magnitude MH or MPF cause
positive displacements in the abscissa and upward displacements vH and vPF in the ordinate. The slope of the value
versus state function at the points of encounter with the
rewards (namely the first derivative or marginal rate of the
function) is indicated by the slopes of the tangents sH and
sPF. Using this representation, we may consider four
hypotheses for mechanisms that could explain the reported effects of past experience. These are described below.
(1) Magnitude priority: preferences are driven primarily
by reward metrics (Mi), and secondarily by additional
criteria. This is a lexicographic rule. In lexicographic
models such as this one, the dimensions of the targets
are judged in order, and the highest-order dimension in
which options differ controls the choice. In terms of Fig. 1,
differences in the horizontal arrows showing reward
metrics would have primary and even exclusive control.
This is compatible both with previous observations and
with conventional reward maximizing, because in the
published experiments the rewards were equal, so that the
comparison among magnitudes would have yielded no
difference, and control could have passed to second-order
dimensions such as the history of each source. The
prediction of this hypothesis is that preference will always
be for source H when MH O MPF and for source PF when
the reverse is true, regardless of the strength of the
inequality. Other, normally irrelevant dimensions would
enter the scene when MH and MPF are strictly equal as in
the experiments reported previously.
(2) Value priority: preferences are driven by the value
(vi) provided by each option at the time of learning. As
already mentioned, two payoffs of equal size would be
remembered as having different value if one typically
occurred when reserves were lower, based on the assumption that the fitness versus reserves function is concave. In
functional terms, this is equivalent to choosing based on
the past fitness gain. This is compatible with previous
reports because either harder work or greater deprivation
placed the animals under conditions where the value
function was steeper, and, since rewards were of equal size,
the vertical displacement (vH) was larger for the hard work
or hungry option. The present hypothesis differs from the
previous one in that the impact of reward metrics is
mediated through the consequences in value, so that the
animal will prefer option H whenever vH O vPF. Under this
hypothesis preference for the H option can persist even if
MH ! MPF (namely if the reward from source H is of lower
magnitude). The strength of this preference, however,
should decrease as the inequality (vH O vPF) becomes less
extreme, mediated through the effect of magnitude on
value.
(3) State priority: preferences are driven exclusively by
the state of need when subjects learned about the options.
In terms of Fig. 1, preferences would depend on the slopes
of the tangents (sH, sPF). This could occur because natural
selection may have calibrated the valuation mechanism to
the marginal fitness gains of the value versus state
function. This mechanism would sustain preference for
option H regardless of the magnitude of its reward (MH).
Unlike hypothesis 2, Mi would not even have an indirect
effect.
(4) Perceptual distortion: preferences are driven by the
subject’s internal representation of the metrics of the
^ i ) and these representations are affected by
rewards (M
state at the time of learning. In Fig. 1, this is represented
by the cartoon inset showing a bird that remembers the
metric of one reward as smaller than the alternative even
if the reward magnitudes were objectively equal. This idea
is inspired by Killeen’s (1984) hypothesis that temporal
intervals may be measured and represented internally by
means of a ‘clock’ that varies its rate depending on state of
arousal. In this hypothesis, if a subject learns about two
stimuli SH and SPF both lasting 10 s but it meets these
stimuli under low and high levels of arousal, respectively,
it would store more ‘ticks’ of the clock for the duration of
SPF because it is encountered when the clock runs faster. If
the internally represented intervals are measured later on
(for instance using a behavioural peak procedure) under
equal arousal, one would observe a shorter interval for SH
than for SPF. If the preference is for the option with the
most favourable internal representation, the bird will
573
574
ANIMAL BEHAVIOUR, 70, 3
choose SH because it perceives it as yielding a shorter
delay. Thus, in this case time of reward expectation should
be the best predictor of preference. For instance, if subjects
prefer option H when it yields 15 s of delay over option PF
that always yields 10 s of delay then their pecking
behaviour should show that the time of expectation of
reward is shorter for option H, thus revealing a perceptual
distortion caused by the training conditions.
In the present study we designed the experimental
procedure so as to establish as much as possible differences between the predictions of these four hypotheses.
To distinguish between the four hypothetical mechanisms
described above, we used four treatments, one of which
reproduced the conditions of Marsh et al. (2004), whereas
in the others the rewards were no longer equal.
METHODS
Subjects
Our subjects were six wild-caught European starlings
with extensive experimental histories. The birds were kept
in individual cages (120 ! 60 cm and 50 cm high) that
served as housing and test chambers. They were visually
but not acoustically isolated on a light:dark cycle of
12:12 h and at a temperature of 16–18 C. Fresh drinking
water was always available and bathing trays were provided at least twice a week. Birds were also permitted to
feed ad libitum on turkey starter crumbs and supplementary mealworms (Tenebrio sp.) for 2.5 h at the end of the
last experimental session. From previous experience this
regime is known to allow the starlings’ body weights to
remain stable at approximately 90% of their free feeding
value (Bateson 1993).
The birds were caught during 2002 and kept under an
English Nature licence. The experiments took place in
March 2003. All the birds remained healthy and were
released into the wild (University Parks, Oxford) during
the following summer.
Apparatus
Each cage had a panel with a central food hopper and
two square response keys. The keys could be illuminated
with yellow, red, blue, orange, violet, white, pink or green.
Computers running Arachnid language (Paul Fray Ltd.,
Cambridge, U.K.) served for control and data collection.
The size of food rewards was fixed for all treatments at
2 units of semicrushed and sieved Orlux pellets (ca. 0.02 g/
unit) and delivered at a rate of 1 unit/s by automatic pellet
dispensers (Campden Instruments, Cambridge, U.K.). The
delays to delivery were varied as explained below.
Experimental Protocol
We used a within-subject design in which each bird was
exposed to four treatments (the order of the treatments
was balanced across birds). Each treatment lasted for 6
days (5 days of training plus 1 day of testing). In all the
treatments, on the test day the starlings had to choose
between two options (signalled with differently coloured
keys). During training one of the options (option H, for
‘hungry’) had been encountered in a higher level of
deprivation (the subjects were food deprived for 3 h before
the session started). The other option (option PF, for
‘prefed’) had been encountered in a lower level of
deprivation (the subjects were deprived as in the other
group but were then fed for 10 min before the session
started).
Subjects were exposed to only forced trials during
training and only test trials during testing. In forced trials
only one key was active. The trial started with the key
blinking (0.7 s ON–0.3 s OFF). The first peck caused the
light to stay steadily ON and the programmed delay
started running. The first peck after the programmed
delay had elapsed triggered the delivery of the programmed amount of food, followed by a fixed intertrial
interval (ITI) of 80 s during which all keys were OFF. If no
peck was registered during the 5 s after the programmed
delay had elapsed the bird lost the reward, a new ITI
started, and the same option was presented after the ITI.
This 5-s limited hold was aimed at reducing differences
between experienced and programmed delay. Thus, in
operant terms the schedule could be described as a discrete-trials, limited-hold, response-initiated, fixed interval.
Choice trials started with two keys that simultaneously
blinked. The first peck on either of them caused that key
to turn steadily ON and the other to turn OFF. After that,
the trial continued as in forced trials. During test trials we
recorded the number of pecks/s during the delay programmed for each option.
Option PF was programmed as a fixed interval (FI) of
10 s in all the treatments. In the alternative (option H), we
manipulated the delay to food across treatments: FI10 s,
FI12.5 s, FI15 s and FI17.5 s in treatments TR1, TR2, TR3
and TR4, respectively. Regardless of the preference of the
subjects, the energetic rate of gain or immediacy offered
by the options can be ranked in inverse order of length of
the delay. We used this metric as the objective measure of
the outcome of each option (equivalent to Mi in Fig. 1).
Pretraining
All subjects were given a ‘reminder’ pretraining to eat
from the hopper and peck at each of the eight hues (red,
green, yellow, blue, orange, violet, white and pink)
randomly presented on each of the pecking keys. Initially,
each peck was followed by reinforcement. On the next 3
days of pretraining, the fixed interval was gradually
increased to 17.5 s.
Training
During this stage, the birds experienced only forced
trials to allow them to learn about the properties of the
options. They experienced two sessions per day, one of
each type: Hungry and Prefed. Each session consisted of
the following parts, in sequential order: (1) a food deprivation period (2 h 50 min); (2) a ‘manipulation’ period
(10 min); (3) a ‘key-pecking’ period, where the birds had
to complete 20 forced trials (this period could last up to
POMPILIO & KACELNIK: STATE, LEARNING AND CHOICE
50 min, depending on how long it took the birds to
complete 20 trials); and (4) an ad libitum food period
(turkey crumbs, for the final 15 min of each session)
designed to reduce carry-over influences across sessions
and equalize state for the following session.
The manipulation determined the kind of session: in
Prefed sessions, the experimenter entered the room and
provided ad libitum food (turkey crumbs) for 10 min. In
Hungry sessions, the experimenter entered the room at
the same times as in Prefed sessions, but no food was
provided. In the key-pecking period that immediately
followed, a coloured lamp (red, green, yellow, violet, pink,
orange, white or blue) corresponding with the session
type and treatment was illuminated behind one of the
pecking keys on either side (randomly determined across
trials) of the panel. The association of colours with session
types was balanced across birds and treatments, but for
each bird one colour was always associated with one
option type (e.g. red with PF and green with H) and
a different pair of colours were used for each treatment.
The key-pecking period was programmed to terminate
after the subject completed 20 forced trials or after 50 min,
whichever came first (in fact, the birds always completed
the 20 forced trials in less than 50 min).
The two daily sessions began at 0800 and 1200 hours,
respectively. For each subject, one session of each type was
given each day, with order of presentation balanced across
subjects and reversing between days.
Test
After 5 days of training, the birds experienced 1 day
with two choice sessions. These sessions consisted of 15
choice trials each. In choice trials, subjects expressed their
preference between options H and PF. The state of the
subjects at the time of testing was manipulated as during
the training period and the order in which subjects
experienced the two states was balanced across birds:
within the group, three of the subjects had the morning
session in the Hungry state and the afternoon session in
the Prefed state. The remaining three birds within each
group had the opposite order. Thus, each subject contributed 30 choices, 15 in each state at the time of testing.
Data Analysis
Preferences as revealed by choices
The proportion of choices for each option during the
choice sessions was considered as the measure of preference. Because preference is analysed in the form of
proportions, the data were arcsine square-root transformed and residuals were inspected for normality and
homogeneity of variance (Zar 1999). None of these
assumptions were violated (Jarque–Bera test of normality
and Bartlett’s test of the hypothesis that error variance of
the dependent variable was equal across the different
treatments: P O 0.05 in both cases).
Pecking patterns
We analysed the pecking patterns during choice trials,
converted into 1-s time bins. To estimate whether the
subjects could differentiate between H and PF options, we
compared the relative percentage of accumulated pecks at
bin 10 between the two options comprising each treatment (hereafter, R(10)). To obtain R(10) we first calculated
individual values as follows:
CHb
Rb 10Z
CHb CCPFb
ð1Þ
where Rb(10) is the score for bird b. CHb and CPFb are the
mean cumulative numbers of pecks up to 10 s into the
delay to food for the H and the PF options, respectively,
for bird b. We calculated one proportion per bird, per
treatment, and averaged the results across subjects to
obtain R(10).
RESULTS
Preferences as Revealed by Choices
We are interested in the effect of several factors on
preference: state during training, state during testing, and
metrics of the options. To simplify the language, we
conducted all the analysis in terms of the proportion of
choices for option H, the one that had been experienced
during training in sessions when the subject was hungrier.
We refer to this proportion as PH. Figure 2 shows PH in
choice sessions when the subjects were, at the time of
testing, in the Prefed state or in the Hungry state and also
the results pooling all the data. Qualitatively, it can be
seen that (1) all of the bars lie above the indifference line,
indicating that there was a preference for the option
associated with the Hungry state during training, even
though this option was either equal to or worse than the
alternative; (2) there is no major difference between the
left and centre two groups of bars, indicating that state at
the time of testing was largely ineffective; and (3) the
strength of preference for option H decreased towards
indifference as the delay to food in this option increased
from 10 s (when it was equal to the alternative) to 17.5 s.
For statistical analysis we considered two measures of
preference: the proportion of subjects that favoured the H
option (regardless of strength of preference) and the
proportion of choices for this option across subjects and
treatments (TR1–TR4).
The proportion of subjects was tested as follows. Of the
30 choice opportunities each bird had (two sessions of 15
trials each, one in each of the energetic states), all six birds
preferred the H option (i.e. PH O 0.5) in TR1–TR3. In TR4,
PH was greater than 0.5 for four of the six birds. Thus, the
proportion of subjects preferring H is significantly higher
than random for TR1, TR2 and TR3 (binomial test:
P Z 0.031), and nonsignificantly so for TR4 (P Z 0.688).
For the second measure of preference, that is, to test the
magnitude of preference, we carried out four one-sample t
tests, one per treatment, comparing PH across birds against
a random expectation of 0.5. The result indicates that the
preference for H is significantly higher than 0.5 for TR1, TR2
and TR3 (TR1: t1,5 Z 6.386, P Z 0.001; TR2: t1,5 Z 5.263,
P Z 0.003; TR3: t1,5 Z 5.181, P Z 0.004) and nonsignificant for TR4 (t1,5 Z 1.418, P O 0.215).
575
ANIMAL BEHAVIOUR, 70, 3
1
0.9
Proportion of choices
576
*
Tested when prefed
*
0.8
Overall
Tested when hungry
*
**
*
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
FI10 s FI12.5 s FI15 s FI17.5 s
FI10 s FI12.5 s FI15 s FI17.5 s
H option
FI10 s FI12.5 s FI15 s FI17.5 s
Figure 2. Mean proportion of choices G SD for the H options (i.e. those experienced during training when the subject was hungrier than in the
prefed option). Whereas the PF option always had a programmed delay to food of 10 s, the H option always had a programmed delay of 10,
12.5, 15 or 17.5 s according to treatment. The subjects were tested either hungry or prefed. ‘Overall’ shows the data after pooling across
testing states. The horizontal line indicate indifference. Asterisks indicate a significant (P ! 0.05) difference from random.
We examined the influences of state during testing and
of treatment on choice proportions by running a GLM
model, with PH as response variable and subject (from 1 to
6) as a random factor, state during testing (Prefed or
Hungry) as a fixed factor and delay until food associated
with the H option (10 s, 12.5 s, 15 s, 17.5 s) as a continuous variable. State during testing did not have significant
effects (F1,24 Z 0.33, P Z 0.592) and there were no significant interactions between state during testing and treatment (F1,24 Z 0.15, P Z 0.698) or between subjects and
state during testing (F5,24 Z 0.33, P Z 0.891). There were,
however, significant effects of treatment. PH decreased as
the delay to food in the H option increased (F1,24 Z 32.23,
slope: 0.10008, P ! 0.0001). We did not find interaction
effects between treatment and subjects (F5,24 Z 1.09,
P Z 0.390).
had a longer delay. A one-sample t test per treatment
comparing R(10) against a random expectation of 0.5
yielded significant differences from 0.5 for treatments TR3
and TR4 tested under the Prefed condition. Comparison
with the choice results shown in Fig. 2 shows interesting
effects. For TR1 and TR2, H was preferred, but R(10) did
not differ from indifference; for TR3 H was preferred but
R(10) was significantly less than 0.5, and for TR4 the birds
were neutral but R(10) was significantly lower than 0.5.
The most striking comparison is for TR3; the birds
significantly preferred H when it delivered a delay 50%
longer than the alternative PF, but the pecking patterns
show that they could distinguish the options in terms of
their programmed delay until reward.
DISCUSSION
Pecking Patterns
The analysis of choice showed that the birds preferred
delays of 12.5 s and 15 s to delays of 10 s or were
indifferent between a delay of 17.5 s and a delay of 10 s
provided that the history of the first option was associated
with a state of greater need. This could be mediated by
a distorted perception of the delays (hypothesis 4: perceptual distortion).
To test this hypothesis, we analysed the pecking pattern
during test trials and compared the pecks accumulated up
to the 10th second between each pair of options presented
in each treatment. The assumption is that this number of
pecks should be greater for the option in which the reward
is perceived as being more imminent, and this should be
expressed by the index R(10) (see Methods). Figure 3
shows the R(10) index recorded under both energetic
states and pooled during testing. The pooled data show
that R(10) was almost exactly 0.5 when both options had
delays of 10 s and was lower than 0.5 when the H option
Our subjects were allowed to learn the properties of two
options while in two different states, and after extensive
training they were allowed to choose between the options.
While one option (PF) was always experienced when the
subject was not very hungry and always offered a delay of
10 s, the other (H) was encountered when the subject was
hungrier, and delivered delays of 10, 12.5, 15 and 17.5 s in
four treatments. Similarly to Marsh et al. (2004), we found
that subjects facing two options of equal objective value
preferred the one associated with a state of higher
deprivation during training. The preference for H declined
in inverse relation to the delay in this option, reaching
neutrality when H offered a delay of 17.5 s. The state of
the animals at the time of expressing their preference had
no detectable effect.
Of the four hypotheses presented in the introduction,
we can discard the first, third and fourth, for the following
reasons.
Hypothesis 1 postulated an absolute priority for the
objective metrics of the reward, explaining previous
POMPILIO & KACELNIK: STATE, LEARNING AND CHOICE
1
Tested when prefed
Relative proportion of pecks
accumulated at bin 10
0.9
Tested when hungry
Overall
0.8
0.7
0.6
0.5
*
*
0.4
0.3
0.2
0.1
0
FI10 s FI12.5 s FI15 s FI17.5 s
FI10 s FI12.5 s FI15 s FI17.5 s
FI10 s FI12.5 s FI15 s FI17.5 s
H option
Figure 3. Mean G SD relative proportion of accumulated pecks for H options (i.e. those experienced during training when the subject was
hungrier than in the prefed option) at bin 10 in test trials. For further details see legend of Fig. 2.
effects of state (Clement et al. 2000; Marsh et al. 2004) by
the fact that the subjects (pigeons, Columba livia, and
starlings) incurred no costs, as the options were objectively
identical. This is not the case here, where starlings preferred an option yielding a delay of 15 s over another
yielding a delay of 10 s.
Hypothesis 3 postulated that only state at the time of
training was influential, as the subject would pay attention only to its state-dependent marginal rate of gain with
independence of reward metrics. This hypothesis can be
discarded here because the preference for option H declined as the loss resulting from choosing it increased.
Finally, hypothesis 4 postulated that the paradoxical
preference was due to a distorted perception of the delays
to food. If this were true, time-dependent pecking rate
(which is known to reflect reward expectation) should
have paralleled preference. In this study we did not collect
data in probe trials where the precise timing of reward
expectation could be judged, but extensive evidence
published elsewhere (see, for instance, Kacelnik & Brunner
2002) proves that starlings do learn these intervals
accurately. In addition, the analysis of relative cumulative
number of pecks indicated that reward expectation after
10 s of waiting was lower in some cases when preference
was higher. Critically, when H offered a delay of 15 s,
preference for H was significantly higher than 50%,
whereas reward expectation in the same option as measured by relative cumulative pecking rate was significantly
lower.
This leaves us with hypothesis 2, which predicates that
preference is driven by the remembered ‘value’ of previous
outcomes of each source and raises the issue (which we
cannot fully explain) of why (in functional terms) the
choice mechanism should have evolved these properties.
Clearly, this mechanism would produce sensible choices
whenever the probability of encountering each option is
uncorrelated with the subject’s state. The animal would
learn that source A yields higher gains than source B in
some metric correlated with fitness, and then will choose
A over B whenever the choice occurs. However, as is the
case in our experimental situation, the mechanism can
produce paradoxical consequences if fitness gain is state
dependent and encounters with each option occur on
systematically different states.
These possibly costly paradoxes may be counterbalanced if, first, they are rare in nature and, second, the
mechanism has advantages over possible alternatives in
a sufficiently large number of other circumstances. We do
not know how frequently situations such as those implemented here (where the subject’s state correlates with
probability of encounter) are encountered in nature, and
arguing about this point may have to wait until formal
modelling is implemented, but we may speculate about
the possible benefits such systems may yield. Consider the
possibility that (as in our hypothesis 1) decision makers
remember the metrics of each alternative and then,
whenever a choice is required, pick the alternative with
more favourable parameters (such as shorter delay or
larger reward). The drawback of this mechanism, in
comparison to remembering a correlated variable of
fitness gains, is that it may be harder to choose between
actions whose consequences have different dimensions,
such as amount and delay, mean and variance, or
nutritional value and predation risk. A mechanism sensitive to past gains, instead, would simply allow the animal
to compare alternatives in terms of the resulting benefits
in some common currency, a hedonic scale related to
fitness (McFarland & Sibly 1975; Cabanac 1992).
In summary, while state-dependent valuation learning
can lead to paradoxical and even costly choices in some
situations, it may in fact be a very efficient way to act in
comparison with other realistic possibilities. Although this
is just a speculation, the present findings highlight the
importance of the agent’s history. While for good theoretical reasons normative models of decision making tend
to be path independent (Houston & McNamara 1999), so
577
578
ANIMAL BEHAVIOUR, 70, 3
that only present state and future consequences are
included, empirical findings such as the present ones
and those related to other paradoxical observations (such
as framing, Marsh & Kacelnik 2002) suggest that pathdependent models may be due for an important contribution. In fact, subtle differences in state when experiencing options may have lasting cognitive consequences
capable of explaining future differences in preferences
between individuals in the same energetic state and this
may have to be incorporated in the models.
Two themes appear appropriate for future research. First,
formal evolutionary modelling may be used to evaluate
the likelihood that a mechanism such as the one we
predicate may evolve in competition with other proposed
alternatives. Second, the model itself should be tested
empirically to judge the extent to which it may explain
a wider range of anomalies found across diverse paradigms
of animal choice.
Acknowledgments
We thank Cynthia Schuck-Paim and Martin Rossi for
advice, discussion and comments on the manuscript.
L.P. was supported by a Clarendon Fund Scholarship from
the University of Oxford. The research was partially
funded by the BBSRC Grant 43/S13483 to A.K.
References
Bateson, M. 1993. Currencies for decision making: the foraging
starling as a model animal. D.Phil. thesis, University of Oxford.
Bernoulli, D. 1738. Specimen Theoriae novae de Mensura Sortis
[Exposition of a new theory on the measurement of risk].
Translation Econometrica, 22, 23–36, January 1954.
Berridge, K. C. & Robinson, T. E. 1998. What is the role of
dopamine in reward: hedonic impact, reward learning, or
incentive salience? Brain Research Reviews, 28, 309–369.
Cabanac, M. 1992. Pleasure: the common currency. Journal of
Theoretical Biology, 155, 173–200.
Clement, T. S., Feltus, J. R., Kaiser, D. H. & Zentall, T. R. 2000.
‘Work ethic’ in pigeons: reward value is directly related to the
effort or time required to obtain the reward. Psychonomic Bulletin
Review, 7, 100–106.
Dawkins, R. & Carlisle, T. R. 1976. Parental investment, mate
desertion and a fallacy. Nature, 161, 131–133.
Houston, A. I. & McNamara, J. M. 1999. Models of Adaptive
Behaviour: An Approach Based on State. Cambridge: Cambridge
University Press.
Kacelnik, A. & Brunner, D. 2002. Timing and foraging: Gibbon’s
scalar expectancy theory and optimal patch exploitation. Learning
and Motivation, 33, 177–195.
Kacelnik, A. & Marsh, B. 2002. Cost can increase preference in
starlings. Animal Behaviour, 63, 245–250.
Killeen, P. R. 1984. Incentive theory III: adaptive clocks. In: Timing
and Time Perception (Ed. by J. Gibbon & L. Allan), pp. 515–527.
New York: New York Academy of Sciences.
McFarland, D. J. & Sibly, R. M. 1975. The behavioural final
common path. Philosophical Transactions of the Royal Society of
London, Series B, 270, 265–293.
McNamara, J. M. & Houston, A. I. 1982. Short-term behaviour and
lifetime fitness. In: Functional Ontogeny (Ed. by D. J. McFarland),
pp. 60–87. London: Pitman.
Marsh, B. 1999. Making the best choice: judgement and strategic
decision making under conditions of risk and uncertainty. D.Phil.
thesis, University of Oxford.
Marsh, B. & Kacelnik, A. 2002. Framing effects and risky decisions
in starlings. Proceedings of the National Academy of Sciences, U.S.A.,
99, 3352–3355.
Marsh, B., Schuck-Paim, C. & Kacelnik, A. 2004. State-dependent
learning affects foraging choices in starlings. Behavioral Ecology,
15, 396–399.
Pompilio, L. 2004. Animal choice and the construction of
preferences. D.Phil. thesis, University of Oxford.
Thaler, R. 1980. Towards a positive theory of consumer choice.
Journal of Economic Behaviour and Organization, 1, 39–60.
Zar, J. M. 1999. Biostatistical Analysis, 4th edn. Englewood cliff:
Prentice Hall.