State-dependent learning and suboptimal choice: when starlings

ANIMAL BEHAVIOUR, 2005, 70, 571–578 doi:10.1016/j.anbehav.2004.12.009 State-dependent learning and suboptimal choice: when starlings prefer long over short delays to food LORENA PO MPILIO & A LEX K ACELNIK Department of Zoology, University of Oxford (Received 30 July 2004; initial acceptance 3 September 2004; final acceptance 22 December 2004; published online 11 July 2005; MS. number: 8228) Recent studies have used labels such as ‘work ethics’, ‘sunk costs’ and ‘state-dependent preferences’ for apparent anomalies in animals’ choices. They suggest that preference between options relates to the options’ history, rather than depending exclusively on the expected payoffs. For instance, European starlings, Sturnus vulgaris, trained to obtain identical food rewards from two sources while in two levels of hunger preferred the food source previously associated with higher hunger, regardless of the birds’ state at the time of testing. We extended this experimentally and theoretically by studying starlings choosing between sources that differed not only in history but also in the objective properties (delay until reward) of the payoffs they delivered. Two options (PF and H) were initially presented in single-option sessions when subjects were, respectively, prefed or hungry. While option PF offered a delay until reward of 10 s in all treatments, option H delivered delays of 10, 12.5, 15 and 17.5 s in four treatments. When training was completed, we tested preference between the options. When delays in both options were equal (10 s), the birds strongly preferred H. When delay in H was 17.5 s, the birds were indifferent, with intermediate results for intermediate treatments. Preference was not mediated by disrupted knowledge of the delays. Thus, preferences were driven by past state-dependent gains, rather than by the joint effect of the birds’ state at the time of choice and knowledge of the absolute properties of each alternative, as assumed in state-dependent, path-independent models of optimal choice. Ó 2005 The Association for the Study of Animal Behaviour. Published by Elsevier Ltd. All rights reserved. Several recent studies of avian decision making have dealt with two related and somewhat counterintuitive observations. The first is that when subjects are initially trained to obtain equally sized rewards after either hard or light work, and are then offered a choice without the work requirement, they prefer the reward previously associated with harder work (Clement et al. 2000; Kacelnik & Marsh 2002; originally reported by Marsh 1999). The second is that when subjects are trained with equal rewards in different states of deprivation, and are then offered a choice, they prefer the reward previously associated with greater deprivation (Marsh et al. 2004). Enhanced preference for stimuli associated with greater need is probably far from restricted to birds: locusts, Locusta migratoria, too prefer a signal (an odour in this case) that has been previously paired with food encountered under a state of greater need, regardless of state at the time of testing (Pompilio 2004; L. Pompilio, A. Kacelnik & S. T. Correspondence: A. Kacelnik, Department of Zoology, Oxford University, South Parks Road, Oxford OX1 3PS, U.K. (email: alex.kacelnik@ zoo.ox.ac.uk). 0003–3472/04/$30.00/0 Bahmar, unpublished data). Here we discuss the findings on birds, propose some theoretical clarifications, and present an experimental study aimed at establishing the strength of the effect and the relevance of previously proposed theoretical models. Clement et al. (2000) referred to the first phenomenon as ‘work ethic’, but the label should not be seen to imply that the birds would prefer to work harder, as this was not the case: the preference was between the outcomes of the two levels of work and not between the levels of work themselves. Kacelnik & Marsh (2002) instead related their findings to the notions of ‘sunk costs’ and ‘concorde fallacy’ that are used to describe cases when subjects (suboptimally) attribute value to options in relation to their history instead of to their future payoffs (Dawkins & Carlisle 1976; Thaler 1980). In terms of the mechanisms that may lead to these phenomena, Clement et al. (2000) proposed that the contrast between the hedonic state of the individual immediately prior to reward and at the time of reward may be greater after greater effort than after less effort. Hedonic state is a concept often used in reference to the agents’ wellbeing (see e.g. Berridge & Robinson 1998), but 571 Ó 2005 The Association for the Study of Animal Behaviour. Published by Elsevier Ltd. All rights reserved. 572 ANIMAL BEHAVIOUR, 70, 3 to our knowledge there is no agreed operational definition. The proposed mechanism may occur if delivery of reward always brings the animal to the same hedonic state, while the previous condition (the amount of work) leads to a different state prior to the reward being delivered. Thus, the differential increase in hedonic state is greater because harder work is more hedonically negative. Marsh (1999) and Kacelnik & Marsh (2002) proposed a similar mechanism inspired by functional considerations, and without the assumption that the rewards left the animals in the same hedonic state regardless of the starting point. In Marsh’s (1999, page 79) words, ‘Rewards realised after much effort could be locally assessed as being more valuable because of depressed energy reserves at the time of reward payoff. This reasoning is based on the assumption that an otherwise equivalent food payoff has more fitness value when an individual is in poorer condition than when it is in better condition, and that such relative considerations influence the way that outcome value is represented in memory’. If what the subject finds reinforcing is a correlate of fitness gain, then a food source that (because of the circumstances in which it was typically encountered) has previously yielded higher gains than an alternative would be preferred even when the circumstances have changed so that the gains from the two sources at the time of the choice are identical. In Marsh et al.’s (2004) experiment, the differential need reduction during acquisition was caused not by extra work but by the experimenters manipulating directly the energetic state of their subjects. They proposed that two payoffs of equal size would be valued differently if one typically occurred when reserves were lower, based on the assumption that the fitness versus reserves function is most often concave (e.g. McNamara & Houston 1982). In their experiment, Marsh et al. (2004) trained European starlings, Sturnus vulgaris, to peck differently coloured keys offering equal food rewards. The two food sources were encountered in different sessions, while the birds were either food deprived or fed. After training for an equal number of reinforcements in each state, the birds progressed into choice tests in which they were presented simultaneously with the two food sources. Regardless of whether they were hungry or fed at the time of the choice, the starlings significantly preferred the source that had previously been met under higher deprivation, suggesting that it is indeed plausible that the earlier results obtained by changing the work requirement might have been mediated by some effect of work on the need reduction afforded by the reward. Before revising the ideas involved, it is worth revisiting the bases on which normative models of choice are developed. We do this below. Normative models of decision making are characterized by referring to a scale of value, or ‘utility’, dependent on (but distinct from) the metrics of the events involved. The following statement, although written over two and a half centuries ago, makes the case for the relevance of this scale as clearly as anything written since: ‘The price of an item is dependent only on the thing itself and is equal for everyone; the utility, however, is dependent on the particular circumstances of the person making the estimate. Thus, there is no doubt that a gain of one thousand ducats is more significant to a pauper than to a rich man though both gain the same amount.’ (Bernoulli 1738, page 24). Bernoulli’s use of the term ‘utility’ is not the same as that used in modern microeconomics, and is also different from the use of fitness in evolutionary biology. However, for the present purposes what his text highlights is the distinction between metrics of outcomes and metrics of value. For instance, a bee may experience sugar solutions that are 20, 40 or 60% in concentration: it may be possible to show a monotonic scale for the perception of concentration and a nonmonotonic scale for preference for concentration. In Bernoulli’s example, there is no suggestion that the pauper or the rich man perceive the sum as anything different from a thousand ducats, but they are said to value the same amount differently. Both the objective events and their perceived representations (which are treated as causal factors in the decision process) may be thought of as having dimensions of state. For instance, the monetary rewards can be referred to a scale of wealth, while the sugar in a solution has dimensions of energetic reserves. Both the pauper and the rich man become a thousand ducats better off from the prize, and they may both be aware that this is the case. In contrast, the value assigned to the events or to their perceived representations has dimensions of either fitness or utility, or of some correlated measure of motivational attractiveness, and hence they differ between the two subjects. For once, the pauper has more reasons to celebrate than the rich man, even if their final states will preserve their initial difference in wealth. Crucially, the relation between the perception of absolute properties of the reward and its utility need not be linear and may not even be monotonic (although it normally would be for the case of money). In behavioural ecology, the ideal dependent variable is evolutionary fitness and the independent variables are metrics of the physical properties of rewards. However, both dimensions are often replaced by operational (‘cognitive’) dimensions, where the dependent variable is some motivational correlate of fitness (i.e. ‘pleasure’) and the independent variable is the subject’s knowledge of the event, for instance, the bee’s receptor’s intensity of response to concentration or its knowledge of the size of a reward or the length of handling time. In the present study we manipulated the state of the starlings (as done by Marsh et al. 2004) and systematically varied, in different treatments, the objective properties (delay until food) of the option associated with the state of higher deprivation. This manipulation allows us to test several hypotheses, which we present using Fig. 1 as a framework. The figure presents a positive but decelerated relation between a dependent ‘value’ ordinate and an independent ‘state’ abscissa. The abscissa is an approximate representation of energetic state, so that a hungrier animal is placed closer to the origin. However, its dimensions are the same as the metrics of the reward, so that consuming a reward has the effect of displacing the animal’s state to the right. The reward metric is not confined to reward size, and in relation to the present experiment it is not. We POMPILIO & KACELNIK: STATE, LEARNING AND CHOICE Value (fitness/pleasure) ˆ M H sPF vPF ˆ M PF MPF sH vH MH H PF State Figure 1. Value (measured in fitness or in a motivational correlate) as a function of state of the subject. The outcome of each interaction with a food source produces a state change proportional to the outcome’s magnitude in some metric of state. Outcomes with greater immediacy (delay1) have lower opportunity cost and cause larger displacement. The subject encounters two food sources, each when it (the subject) is in one of two states, hungry (H) or prefed (PF). The outcomes are labelled MH and MPF and their metrics are represented as arrows that cause a positive state displacement. The ^ H and M ^ PF ) is shown in subject’s representation of these metrics (M the inset. The value of each outcome (vH and vPF) is the vertical displacement that corresponds to each change in state. The first derivative (marginal rate) of the value versus state function at each initial state is indicated by the slopes of the tangents sH and sPF. In this example the objective metrics of the outcomes are equal, but the subject’s representation differs because it sees the outcome obtained when hungrier as larger (or having greater immediacy) than the alternative. consider that rewards delivered with greater immediacy cause a lower opportunity loss and hence a higher net payoff, so we represent all outcomes as causing a displacement along this axis, and more immediate rewards as causing a larger displacement than more delayed rewards. Figure 1 represents a subject that could be found in two states, which we label with the suffixes H (for ‘hungry’) and PF (for ‘prefed’). Rewards of magnitude MH or MPF cause positive displacements in the abscissa and upward displacements vH and vPF in the ordinate. The slope of the value versus state function at the points of encounter with the rewards (namely the first derivative or marginal rate of the function) is indicated by the slopes of the tangents sH and sPF. Using this representation, we may consider four hypotheses for mechanisms that could explain the reported effects of past experience. These are described below. (1) Magnitude priority: preferences are driven primarily by reward metrics (Mi), and secondarily by additional criteria. This is a lexicographic rule. In lexicographic models such as this one, the dimensions of the targets are judged in order, and the highest-order dimension in which options differ controls the choice. In terms of Fig. 1, differences in the horizontal arrows showing reward metrics would have primary and even exclusive control. This is compatible both with previous observations and with conventional reward maximizing, because in the published experiments the rewards were equal, so that the comparison among magnitudes would have yielded no difference, and control could have passed to second-order dimensions such as the history of each source. The prediction of this hypothesis is that preference will always be for source H when MH O MPF and for source PF when the reverse is true, regardless of the strength of the inequality. Other, normally irrelevant dimensions would enter the scene when MH and MPF are strictly equal as in the experiments reported previously. (2) Value priority: preferences are driven by the value (vi) provided by each option at the time of learning. As already mentioned, two payoffs of equal size would be remembered as having different value if one typically occurred when reserves were lower, based on the assumption that the fitness versus reserves function is concave. In functional terms, this is equivalent to choosing based on the past fitness gain. This is compatible with previous reports because either harder work or greater deprivation placed the animals under conditions where the value function was steeper, and, since rewards were of equal size, the vertical displacement (vH) was larger for the hard work or hungry option. The present hypothesis differs from the previous one in that the impact of reward metrics is mediated through the consequences in value, so that the animal will prefer option H whenever vH O vPF. Under this hypothesis preference for the H option can persist even if MH ! MPF (namely if the reward from source H is of lower magnitude). The strength of this preference, however, should decrease as the inequality (vH O vPF) becomes less extreme, mediated through the effect of magnitude on value. (3) State priority: preferences are driven exclusively by the state of need when subjects learned about the options. In terms of Fig. 1, preferences would depend on the slopes of the tangents (sH, sPF). This could occur because natural selection may have calibrated the valuation mechanism to the marginal fitness gains of the value versus state function. This mechanism would sustain preference for option H regardless of the magnitude of its reward (MH). Unlike hypothesis 2, Mi would not even have an indirect effect. (4) Perceptual distortion: preferences are driven by the subject’s internal representation of the metrics of the ^ i ) and these representations are affected by rewards (M state at the time of learning. In Fig. 1, this is represented by the cartoon inset showing a bird that remembers the metric of one reward as smaller than the alternative even if the reward magnitudes were objectively equal. This idea is inspired by Killeen’s (1984) hypothesis that temporal intervals may be measured and represented internally by means of a ‘clock’ that varies its rate depending on state of arousal. In this hypothesis, if a subject learns about two stimuli SH and SPF both lasting 10 s but it meets these stimuli under low and high levels of arousal, respectively, it would store more ‘ticks’ of the clock for the duration of SPF because it is encountered when the clock runs faster. If the internally represented intervals are measured later on (for instance using a behavioural peak procedure) under equal arousal, one would observe a shorter interval for SH than for SPF. If the preference is for the option with the most favourable internal representation, the bird will 573 574 ANIMAL BEHAVIOUR, 70, 3 choose SH because it perceives it as yielding a shorter delay. Thus, in this case time of reward expectation should be the best predictor of preference. For instance, if subjects prefer option H when it yields 15 s of delay over option PF that always yields 10 s of delay then their pecking behaviour should show that the time of expectation of reward is shorter for option H, thus revealing a perceptual distortion caused by the training conditions. In the present study we designed the experimental procedure so as to establish as much as possible differences between the predictions of these four hypotheses. To distinguish between the four hypothetical mechanisms described above, we used four treatments, one of which reproduced the conditions of Marsh et al. (2004), whereas in the others the rewards were no longer equal. METHODS Subjects Our subjects were six wild-caught European starlings with extensive experimental histories. The birds were kept in individual cages (120 ! 60 cm and 50 cm high) that served as housing and test chambers. They were visually but not acoustically isolated on a light:dark cycle of 12:12 h and at a temperature of 16–18 C. Fresh drinking water was always available and bathing trays were provided at least twice a week. Birds were also permitted to feed ad libitum on turkey starter crumbs and supplementary mealworms (Tenebrio sp.) for 2.5 h at the end of the last experimental session. From previous experience this regime is known to allow the starlings’ body weights to remain stable at approximately 90% of their free feeding value (Bateson 1993). The birds were caught during 2002 and kept under an English Nature licence. The experiments took place in March 2003. All the birds remained healthy and were released into the wild (University Parks, Oxford) during the following summer. Apparatus Each cage had a panel with a central food hopper and two square response keys. The keys could be illuminated with yellow, red, blue, orange, violet, white, pink or green. Computers running Arachnid language (Paul Fray Ltd., Cambridge, U.K.) served for control and data collection. The size of food rewards was fixed for all treatments at 2 units of semicrushed and sieved Orlux pellets (ca. 0.02 g/ unit) and delivered at a rate of 1 unit/s by automatic pellet dispensers (Campden Instruments, Cambridge, U.K.). The delays to delivery were varied as explained below. Experimental Protocol We used a within-subject design in which each bird was exposed to four treatments (the order of the treatments was balanced across birds). Each treatment lasted for 6 days (5 days of training plus 1 day of testing). In all the treatments, on the test day the starlings had to choose between two options (signalled with differently coloured keys). During training one of the options (option H, for ‘hungry’) had been encountered in a higher level of deprivation (the subjects were food deprived for 3 h before the session started). The other option (option PF, for ‘prefed’) had been encountered in a lower level of deprivation (the subjects were deprived as in the other group but were then fed for 10 min before the session started). Subjects were exposed to only forced trials during training and only test trials during testing. In forced trials only one key was active. The trial started with the key blinking (0.7 s ON–0.3 s OFF). The first peck caused the light to stay steadily ON and the programmed delay started running. The first peck after the programmed delay had elapsed triggered the delivery of the programmed amount of food, followed by a fixed intertrial interval (ITI) of 80 s during which all keys were OFF. If no peck was registered during the 5 s after the programmed delay had elapsed the bird lost the reward, a new ITI started, and the same option was presented after the ITI. This 5-s limited hold was aimed at reducing differences between experienced and programmed delay. Thus, in operant terms the schedule could be described as a discrete-trials, limited-hold, response-initiated, fixed interval. Choice trials started with two keys that simultaneously blinked. The first peck on either of them caused that key to turn steadily ON and the other to turn OFF. After that, the trial continued as in forced trials. During test trials we recorded the number of pecks/s during the delay programmed for each option. Option PF was programmed as a fixed interval (FI) of 10 s in all the treatments. In the alternative (option H), we manipulated the delay to food across treatments: FI10 s, FI12.5 s, FI15 s and FI17.5 s in treatments TR1, TR2, TR3 and TR4, respectively. Regardless of the preference of the subjects, the energetic rate of gain or immediacy offered by the options can be ranked in inverse order of length of the delay. We used this metric as the objective measure of the outcome of each option (equivalent to Mi in Fig. 1). Pretraining All subjects were given a ‘reminder’ pretraining to eat from the hopper and peck at each of the eight hues (red, green, yellow, blue, orange, violet, white and pink) randomly presented on each of the pecking keys. Initially, each peck was followed by reinforcement. On the next 3 days of pretraining, the fixed interval was gradually increased to 17.5 s. Training During this stage, the birds experienced only forced trials to allow them to learn about the properties of the options. They experienced two sessions per day, one of each type: Hungry and Prefed. Each session consisted of the following parts, in sequential order: (1) a food deprivation period (2 h 50 min); (2) a ‘manipulation’ period (10 min); (3) a ‘key-pecking’ period, where the birds had to complete 20 forced trials (this period could last up to POMPILIO & KACELNIK: STATE, LEARNING AND CHOICE 50 min, depending on how long it took the birds to complete 20 trials); and (4) an ad libitum food period (turkey crumbs, for the final 15 min of each session) designed to reduce carry-over influences across sessions and equalize state for the following session. The manipulation determined the kind of session: in Prefed sessions, the experimenter entered the room and provided ad libitum food (turkey crumbs) for 10 min. In Hungry sessions, the experimenter entered the room at the same times as in Prefed sessions, but no food was provided. In the key-pecking period that immediately followed, a coloured lamp (red, green, yellow, violet, pink, orange, white or blue) corresponding with the session type and treatment was illuminated behind one of the pecking keys on either side (randomly determined across trials) of the panel. The association of colours with session types was balanced across birds and treatments, but for each bird one colour was always associated with one option type (e.g. red with PF and green with H) and a different pair of colours were used for each treatment. The key-pecking period was programmed to terminate after the subject completed 20 forced trials or after 50 min, whichever came first (in fact, the birds always completed the 20 forced trials in less than 50 min). The two daily sessions began at 0800 and 1200 hours, respectively. For each subject, one session of each type was given each day, with order of presentation balanced across subjects and reversing between days. Test After 5 days of training, the birds experienced 1 day with two choice sessions. These sessions consisted of 15 choice trials each. In choice trials, subjects expressed their preference between options H and PF. The state of the subjects at the time of testing was manipulated as during the training period and the order in which subjects experienced the two states was balanced across birds: within the group, three of the subjects had the morning session in the Hungry state and the afternoon session in the Prefed state. The remaining three birds within each group had the opposite order. Thus, each subject contributed 30 choices, 15 in each state at the time of testing. Data Analysis Preferences as revealed by choices The proportion of choices for each option during the choice sessions was considered as the measure of preference. Because preference is analysed in the form of proportions, the data were arcsine square-root transformed and residuals were inspected for normality and homogeneity of variance (Zar 1999). None of these assumptions were violated (Jarque–Bera test of normality and Bartlett’s test of the hypothesis that error variance of the dependent variable was equal across the different treatments: P O 0.05 in both cases). Pecking patterns We analysed the pecking patterns during choice trials, converted into 1-s time bins. To estimate whether the subjects could differentiate between H and PF options, we compared the relative percentage of accumulated pecks at bin 10 between the two options comprising each treatment (hereafter, R(10)). To obtain R(10) we first calculated individual values as follows: CHb Rb 10Z CHb CCPFb ð1Þ where Rb(10) is the score for bird b. CHb and CPFb are the mean cumulative numbers of pecks up to 10 s into the delay to food for the H and the PF options, respectively, for bird b. We calculated one proportion per bird, per treatment, and averaged the results across subjects to obtain R(10). RESULTS Preferences as Revealed by Choices We are interested in the effect of several factors on preference: state during training, state during testing, and metrics of the options. To simplify the language, we conducted all the analysis in terms of the proportion of choices for option H, the one that had been experienced during training in sessions when the subject was hungrier. We refer to this proportion as PH. Figure 2 shows PH in choice sessions when the subjects were, at the time of testing, in the Prefed state or in the Hungry state and also the results pooling all the data. Qualitatively, it can be seen that (1) all of the bars lie above the indifference line, indicating that there was a preference for the option associated with the Hungry state during training, even though this option was either equal to or worse than the alternative; (2) there is no major difference between the left and centre two groups of bars, indicating that state at the time of testing was largely ineffective; and (3) the strength of preference for option H decreased towards indifference as the delay to food in this option increased from 10 s (when it was equal to the alternative) to 17.5 s. For statistical analysis we considered two measures of preference: the proportion of subjects that favoured the H option (regardless of strength of preference) and the proportion of choices for this option across subjects and treatments (TR1–TR4). The proportion of subjects was tested as follows. Of the 30 choice opportunities each bird had (two sessions of 15 trials each, one in each of the energetic states), all six birds preferred the H option (i.e. PH O 0.5) in TR1–TR3. In TR4, PH was greater than 0.5 for four of the six birds. Thus, the proportion of subjects preferring H is significantly higher than random for TR1, TR2 and TR3 (binomial test: P Z 0.031), and nonsignificantly so for TR4 (P Z 0.688). For the second measure of preference, that is, to test the magnitude of preference, we carried out four one-sample t tests, one per treatment, comparing PH across birds against a random expectation of 0.5. The result indicates that the preference for H is significantly higher than 0.5 for TR1, TR2 and TR3 (TR1: t1,5 Z 6.386, P Z 0.001; TR2: t1,5 Z 5.263, P Z 0.003; TR3: t1,5 Z 5.181, P Z 0.004) and nonsignificant for TR4 (t1,5 Z 1.418, P O 0.215). 575 ANIMAL BEHAVIOUR, 70, 3 1 0.9 Proportion of choices 576 * Tested when prefed * 0.8 Overall Tested when hungry * ** * 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 FI10 s FI12.5 s FI15 s FI17.5 s FI10 s FI12.5 s FI15 s FI17.5 s H option FI10 s FI12.5 s FI15 s FI17.5 s Figure 2. Mean proportion of choices G SD for the H options (i.e. those experienced during training when the subject was hungrier than in the prefed option). Whereas the PF option always had a programmed delay to food of 10 s, the H option always had a programmed delay of 10, 12.5, 15 or 17.5 s according to treatment. The subjects were tested either hungry or prefed. ‘Overall’ shows the data after pooling across testing states. The horizontal line indicate indifference. Asterisks indicate a significant (P ! 0.05) difference from random. We examined the influences of state during testing and of treatment on choice proportions by running a GLM model, with PH as response variable and subject (from 1 to 6) as a random factor, state during testing (Prefed or Hungry) as a fixed factor and delay until food associated with the H option (10 s, 12.5 s, 15 s, 17.5 s) as a continuous variable. State during testing did not have significant effects (F1,24 Z 0.33, P Z 0.592) and there were no significant interactions between state during testing and treatment (F1,24 Z 0.15, P Z 0.698) or between subjects and state during testing (F5,24 Z 0.33, P Z 0.891). There were, however, significant effects of treatment. PH decreased as the delay to food in the H option increased (F1,24 Z 32.23, slope: 0.10008, P ! 0.0001). We did not find interaction effects between treatment and subjects (F5,24 Z 1.09, P Z 0.390). had a longer delay. A one-sample t test per treatment comparing R(10) against a random expectation of 0.5 yielded significant differences from 0.5 for treatments TR3 and TR4 tested under the Prefed condition. Comparison with the choice results shown in Fig. 2 shows interesting effects. For TR1 and TR2, H was preferred, but R(10) did not differ from indifference; for TR3 H was preferred but R(10) was significantly less than 0.5, and for TR4 the birds were neutral but R(10) was significantly lower than 0.5. The most striking comparison is for TR3; the birds significantly preferred H when it delivered a delay 50% longer than the alternative PF, but the pecking patterns show that they could distinguish the options in terms of their programmed delay until reward. DISCUSSION Pecking Patterns The analysis of choice showed that the birds preferred delays of 12.5 s and 15 s to delays of 10 s or were indifferent between a delay of 17.5 s and a delay of 10 s provided that the history of the first option was associated with a state of greater need. This could be mediated by a distorted perception of the delays (hypothesis 4: perceptual distortion). To test this hypothesis, we analysed the pecking pattern during test trials and compared the pecks accumulated up to the 10th second between each pair of options presented in each treatment. The assumption is that this number of pecks should be greater for the option in which the reward is perceived as being more imminent, and this should be expressed by the index R(10) (see Methods). Figure 3 shows the R(10) index recorded under both energetic states and pooled during testing. The pooled data show that R(10) was almost exactly 0.5 when both options had delays of 10 s and was lower than 0.5 when the H option Our subjects were allowed to learn the properties of two options while in two different states, and after extensive training they were allowed to choose between the options. While one option (PF) was always experienced when the subject was not very hungry and always offered a delay of 10 s, the other (H) was encountered when the subject was hungrier, and delivered delays of 10, 12.5, 15 and 17.5 s in four treatments. Similarly to Marsh et al. (2004), we found that subjects facing two options of equal objective value preferred the one associated with a state of higher deprivation during training. The preference for H declined in inverse relation to the delay in this option, reaching neutrality when H offered a delay of 17.5 s. The state of the animals at the time of expressing their preference had no detectable effect. Of the four hypotheses presented in the introduction, we can discard the first, third and fourth, for the following reasons. Hypothesis 1 postulated an absolute priority for the objective metrics of the reward, explaining previous POMPILIO & KACELNIK: STATE, LEARNING AND CHOICE 1 Tested when prefed Relative proportion of pecks accumulated at bin 10 0.9 Tested when hungry Overall 0.8 0.7 0.6 0.5 * * 0.4 0.3 0.2 0.1 0 FI10 s FI12.5 s FI15 s FI17.5 s FI10 s FI12.5 s FI15 s FI17.5 s FI10 s FI12.5 s FI15 s FI17.5 s H option Figure 3. Mean G SD relative proportion of accumulated pecks for H options (i.e. those experienced during training when the subject was hungrier than in the prefed option) at bin 10 in test trials. For further details see legend of Fig. 2. effects of state (Clement et al. 2000; Marsh et al. 2004) by the fact that the subjects (pigeons, Columba livia, and starlings) incurred no costs, as the options were objectively identical. This is not the case here, where starlings preferred an option yielding a delay of 15 s over another yielding a delay of 10 s. Hypothesis 3 postulated that only state at the time of training was influential, as the subject would pay attention only to its state-dependent marginal rate of gain with independence of reward metrics. This hypothesis can be discarded here because the preference for option H declined as the loss resulting from choosing it increased. Finally, hypothesis 4 postulated that the paradoxical preference was due to a distorted perception of the delays to food. If this were true, time-dependent pecking rate (which is known to reflect reward expectation) should have paralleled preference. In this study we did not collect data in probe trials where the precise timing of reward expectation could be judged, but extensive evidence published elsewhere (see, for instance, Kacelnik & Brunner 2002) proves that starlings do learn these intervals accurately. In addition, the analysis of relative cumulative number of pecks indicated that reward expectation after 10 s of waiting was lower in some cases when preference was higher. Critically, when H offered a delay of 15 s, preference for H was significantly higher than 50%, whereas reward expectation in the same option as measured by relative cumulative pecking rate was significantly lower. This leaves us with hypothesis 2, which predicates that preference is driven by the remembered ‘value’ of previous outcomes of each source and raises the issue (which we cannot fully explain) of why (in functional terms) the choice mechanism should have evolved these properties. Clearly, this mechanism would produce sensible choices whenever the probability of encountering each option is uncorrelated with the subject’s state. The animal would learn that source A yields higher gains than source B in some metric correlated with fitness, and then will choose A over B whenever the choice occurs. However, as is the case in our experimental situation, the mechanism can produce paradoxical consequences if fitness gain is state dependent and encounters with each option occur on systematically different states. These possibly costly paradoxes may be counterbalanced if, first, they are rare in nature and, second, the mechanism has advantages over possible alternatives in a sufficiently large number of other circumstances. We do not know how frequently situations such as those implemented here (where the subject’s state correlates with probability of encounter) are encountered in nature, and arguing about this point may have to wait until formal modelling is implemented, but we may speculate about the possible benefits such systems may yield. Consider the possibility that (as in our hypothesis 1) decision makers remember the metrics of each alternative and then, whenever a choice is required, pick the alternative with more favourable parameters (such as shorter delay or larger reward). The drawback of this mechanism, in comparison to remembering a correlated variable of fitness gains, is that it may be harder to choose between actions whose consequences have different dimensions, such as amount and delay, mean and variance, or nutritional value and predation risk. A mechanism sensitive to past gains, instead, would simply allow the animal to compare alternatives in terms of the resulting benefits in some common currency, a hedonic scale related to fitness (McFarland & Sibly 1975; Cabanac 1992). In summary, while state-dependent valuation learning can lead to paradoxical and even costly choices in some situations, it may in fact be a very efficient way to act in comparison with other realistic possibilities. Although this is just a speculation, the present findings highlight the importance of the agent’s history. While for good theoretical reasons normative models of decision making tend to be path independent (Houston & McNamara 1999), so 577 578 ANIMAL BEHAVIOUR, 70, 3 that only present state and future consequences are included, empirical findings such as the present ones and those related to other paradoxical observations (such as framing, Marsh & Kacelnik 2002) suggest that pathdependent models may be due for an important contribution. In fact, subtle differences in state when experiencing options may have lasting cognitive consequences capable of explaining future differences in preferences between individuals in the same energetic state and this may have to be incorporated in the models. Two themes appear appropriate for future research. First, formal evolutionary modelling may be used to evaluate the likelihood that a mechanism such as the one we predicate may evolve in competition with other proposed alternatives. Second, the model itself should be tested empirically to judge the extent to which it may explain a wider range of anomalies found across diverse paradigms of animal choice. Acknowledgments We thank Cynthia Schuck-Paim and Martin Rossi for advice, discussion and comments on the manuscript. L.P. was supported by a Clarendon Fund Scholarship from the University of Oxford. The research was partially funded by the BBSRC Grant 43/S13483 to A.K. References Bateson, M. 1993. Currencies for decision making: the foraging starling as a model animal. D.Phil. thesis, University of Oxford. Bernoulli, D. 1738. Specimen Theoriae novae de Mensura Sortis [Exposition of a new theory on the measurement of risk]. Translation Econometrica, 22, 23–36, January 1954. Berridge, K. C. & Robinson, T. E. 1998. What is the role of dopamine in reward: hedonic impact, reward learning, or incentive salience? Brain Research Reviews, 28, 309–369. Cabanac, M. 1992. Pleasure: the common currency. Journal of Theoretical Biology, 155, 173–200. Clement, T. S., Feltus, J. R., Kaiser, D. H. & Zentall, T. R. 2000. ‘Work ethic’ in pigeons: reward value is directly related to the effort or time required to obtain the reward. Psychonomic Bulletin Review, 7, 100–106. Dawkins, R. & Carlisle, T. R. 1976. Parental investment, mate desertion and a fallacy. Nature, 161, 131–133. Houston, A. I. & McNamara, J. M. 1999. Models of Adaptive Behaviour: An Approach Based on State. Cambridge: Cambridge University Press. Kacelnik, A. & Brunner, D. 2002. Timing and foraging: Gibbon’s scalar expectancy theory and optimal patch exploitation. Learning and Motivation, 33, 177–195. Kacelnik, A. & Marsh, B. 2002. Cost can increase preference in starlings. Animal Behaviour, 63, 245–250. Killeen, P. R. 1984. Incentive theory III: adaptive clocks. In: Timing and Time Perception (Ed. by J. Gibbon & L. Allan), pp. 515–527. New York: New York Academy of Sciences. McFarland, D. J. & Sibly, R. M. 1975. The behavioural final common path. Philosophical Transactions of the Royal Society of London, Series B, 270, 265–293. McNamara, J. M. & Houston, A. I. 1982. Short-term behaviour and lifetime fitness. In: Functional Ontogeny (Ed. by D. J. McFarland), pp. 60–87. London: Pitman. Marsh, B. 1999. Making the best choice: judgement and strategic decision making under conditions of risk and uncertainty. D.Phil. thesis, University of Oxford. Marsh, B. & Kacelnik, A. 2002. Framing effects and risky decisions in starlings. Proceedings of the National Academy of Sciences, U.S.A., 99, 3352–3355. Marsh, B., Schuck-Paim, C. & Kacelnik, A. 2004. State-dependent learning affects foraging choices in starlings. Behavioral Ecology, 15, 396–399. Pompilio, L. 2004. Animal choice and the construction of preferences. D.Phil. thesis, University of Oxford. Thaler, R. 1980. Towards a positive theory of consumer choice. Journal of Economic Behaviour and Organization, 1, 39–60. Zar, J. M. 1999. Biostatistical Analysis, 4th edn. Englewood cliff: Prentice Hall.

State-dependent learning and suboptimal choice: when starlings

Related documents

Products

Support

State-dependent learning and suboptimal choice: when starlings

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib