Class 21 - Delayed Reinforcement

advertisement

A Thought Experiment

• 2 doors

• .1 and .2 probability of getting a dollar respectively

• Can get a dollar behind both doors on the same trial

• Dollars stay there until collected, but never more than 1 dollar per door.

• What order of doors do you choose?

Patterns in the Data

• If choices are made moment by moment, should be orderly patterns in the choices: 2, 2, 1, 2, 2, 1…

• Results mixed but promising results when using time as the measure

What Works Best Right Now

• Maximizing local rates and moment to moment choices can lower overall reinforcement rate.

• Short-term vs. long-term

Delay and Self-Control

Delayed Reinforcers

• Many of life’s reinforcers are delayed…

– Eating right, studying, etc.

• Delay obviously devalues a reinforcer

– How are effects of reinforcers affected by delay?

– Why choose the immediate, smaller reward?

– Why ever show self-control?

Remember Superstition?

• Temporal, not causal

– Causal, with delay, very hard

• Same with delay of reinforcement

– Effects decrease with delay

• But how does it occur?

• Are there reliable and predictable effects?

• Can we quantify the effect?

How Do We Measure Delay Effects?

Studying preference of delayed reinforcers

Humans:

- verbal reports at different points in time

“what if” questions

Humans AND nonhumans:

A. Concurrent chains

B: Titration

All are choice techniques.

7

A. Concurrent chains

Concurrent chains are simply concurrent schedules -- usually concurrent equal VI VI -- in which reinforcers are delayed.

When a response is reinforced, usually both concurrent schedules stop and become unavailable, and a delay starts.

Sometimes the delays are in blackout with no response required to get the final reinforcer (an FT schedule);

Sometimes the delays are actually schedules, with an associated stimulus, like an FI schedule, that requires responding.

8

Initial links,

Choice phase

W

W

Conc VI VI

W W

W

W

Terminal links,

Outcome phase

VI a s

Food

The concurrent-chain procedure

VI b s

Food

9

An example of a concurrent-chain experiment

MacEwen (1972) investigated choice between two terminallink FI and two terminal-link VI schedules, one of which was always twice as long as the other.

The initial links were always concurrent VI 60-s VI 60-s schedules.

10

The terminal-link schedules were:

FI 5 s

FI 10 s

FI 20 s

FI 40 s

VI 5 s

VI 10 s

VI 20 s

VI 40 s

FI 10 s

FI 20 s

FI 40 s

FI 80 s

VI 10 s

VI 20 s

VI 40 s

VI 80 s

Constant reinforcer (delay and immediacy) ratio in the terminal links – all immediacy ratios are 2:1.

11

2.0

BIRD M6

1.5

1.0

0.5

FI TERMINAL LINKS

VI TERMINAL LINKS

0.0

0 10 20 30

SMALLER FI or VI VALUE (s)

40

12

From the generalised matching law , we would expect:

B

1 log

B

2

 a d log

D

2

D

1

 log c

If a d was constant, then because D

2

/ D

1 was kept constant, we would expect no change in choice with changes in the absolute size of the delays.

FI 5 s

FI 10 s

FI 20 s

FI 40 s

VI 5 s

VI 10 s

VI 20 s

VI 40 s

FI 10 s

FI 20 s

FI 40 s

FI 80 s

VI 10 s

VI 20 s

VI 40 s

VI 80 s

D

2

/ D

1 was kept constant throughout.

13

But choice did change, so a d did NOT remain constant:

2.0

But does give us some data to answer some other questions…

1.5

1.0

BIRD M6

0.5

FI TERMINAL LINKS

VI TERMINAL LINKS

0.0

0 10 20

SMALLER FI or VI VALUE (s)

30 40

14

Shape of the Delay Function

• Now that we have some data…

• How does reinforcer value change over time?

• What is the shape of the decay function?

Basically, the effects that reinforcers have on behaviour and more delayed after the reinforced response.

3

A concave-upwards graph

2

1

0

0 5 10 15 20

REINFORCER DELAY

25 30

16

Delay Functions

• What is the “real” delay function?

V t

= V

0

/ (1 + Kt)

V t

= V

0

/(1 + Kt) s

V t

= V

0

/(M + Kt s )

V t

= V

0

/(M + t s )

V t

= V

0 exp(-Mt)

Exponential versus hyperbolic decay

It is important to understand how the effects of reinforcers decay over time, because different sorts of decay predict different effects.

The two main candidates:

Exponential decay -- the rate of decay remains constant over time in this

Hyperbolic decay -- the rate of decay decreases over time

-- as in memory, too

18

Exponential decay

V t

V

0 e

bt

V t

: value of the delayed reinforcer at time t

V o

: value of the reinforcer at 0-s delay t : delay in seconds b : a parameter that determines the rate of decay e : the base of natural logarithms.

20

Hyperbolic decay

V t

 hV

0 h

 t

In this equation, all the variables are the same as in the exponential decay, except that h is the half-life of the decay -the time over which the value of V o value.

reduced to half its initial

Hyperbolic decay is strongly supported by Mazur’s research.

21

3.0

2.5

2.0

1.5

1.0

0.5

0.0

0

HYPERBOLIC DECAY

EXPONENTIAL DECAY

5 10 15 20

REINFORCER DELAY (s)

25 30

Two sorts of decay fitted to McEwen's

(1972) data

Hyperbolic is clearly better.

Not that clean, but…

1.00

0.75

0.50

0.25

HYPERBOLIC DECAY

0.00

1.00

0.75

0.50

0.25

EXPONENTIAL DECAY

0.00

0 10 20 30 40

SMALLER DELAY (s)

23

Studying Delay Using Indifference

• Titration procedures.

B: Titration - Finding the point of preference reversal

The titration procedure was introduced by Mazur:

- one standard (constant) delay and

- one adjusting delay.

These may differ in what schedule they are (e.g., FT versus

VT with the same size reinforcers for both), or they may be the same schedule (both FT, say) with different magnitudes of reinforcers.

What the procedure does is to find the value of the adjusting delay that is equally preferred to the standard delay -- the indifference point in choice.

25

For example:

- reinforcer magnitudes are the same

- standard schedule is VT 30 s

- adjusting schedule is FT

How long would the FT schedule need to become to make preference equal?

26

Titration: Procedure

Trials are in blocks of 4.

The first 2 are forced choice, randomly one to each alternative

The last 2 are free choice.

If, on the last 2 trials, it chooses the adjusting schedule twice, the adjusting schedule is increased by a small amount.

If it chooses the standard twice, the adjusting schedule is decreased by a small amount.

If equal choice (1 of each) -- no change

(von Bekesy procedure in audition)

27

Mazur's titration procedure

ITI

Trial start

Choice

Why the postreinforcer blackout?

W W W

W

W

Standard delay + red houselight

2-s food,

BO

W

Peck

W

6-s food

W

W

Adjusting delay + green houselight

28

Mazur’s Findings

• Different magnitudes, finding delay

– 2-sec rf delayed 8 sec = 6 sec rf delayed

20 sec.

• Equal magnitudes, variable vs. fixed delay

– Fixed delay 20 sec = variable delay 30 sec

• Why preference for variable?

– Hyperbolic decay and interval weighting.

Moving onto Self-Control

• Which would you prefer?

– $1 in an hour

– $2 tomorrow

Moving onto Self-Control

• Which would you prefer?

– $1 in a month

– $2 in a month and a day

Here’s the problem:

Preference reversal

In positive self control, the further you are away from the smaller and larger reinforcers, the more likely you are to accept the larger, more delayed reinforcers.

But, the closer you get to the first one, the more likely you are to chose the smaller, more immediate one.

32

Friday night:

“Alright, I am setting my alarm clock to wake me up at 6.00 am tomorrow morning, and then I’ll go jogging.” ...

Saturday 6.00 am:

“Hmm….maybe not today.”

33

Assume: At the moment in time when we make the choice, we choose the reinforcer that has the highest current value...

To be able to understand why preference reversal occurs, we need to know how the value of a reinforcer changes the time by which it is delayed...

Outside the laboratory, the majority of reinforcers are delayed.

Studying the effects of delayed reinforcers is therefore very important.

35

Animal research: Preference reversal

Green, Fisher, Perlow, & Sherman (1981)

Choice between a 2-s and a 6-s reinforcer.

Larger reinforcer delayed 4 s more than the smaller.

Choice response (across conditions) required from 2 to 28 s before the smaller reinforcer.

We will call this time T .

36

28 s T 2 s

Small rf

Large rf

4 s

37

28 s T 2 s

Small rf

Large rf

4 s

38

28 s T 2 s

Small rf

Large rf

4 s

39

Green et al .

(continued)

Thus, if T was 10 s, at the choice point,

 the smaller reinforcer was 10-s away

 the larger was 14-s away

So, as T is changed over conditions, we should see preference reversal.

40

Control condition: two equal-sized reinforcers were delayed, one 28 s the other 32 s.

Preference was strongly towards the reinforcer that came sooner.

So, at delays that long, pigeons can still clearly tell which reinforcer is sooner and which one later.

2

1

GREEN ET AL. (1981)

MEAN DATA

SELF CONTROL

0

-1 IMPULSIVITY

-2

0 5 10 15

VALUE OF T

20 25

41

Which Delay Function Predicts This?

6

4

EXPONENTIAL

MAG = 2

MAG = 6

2

0

0 10 20

SECONDS FROM SMALLER RF

30

42

6

4

2

HYPERBOLIC

MAG = 2

MAG = 6

0

0 10 20

SECONDS FROM SMALLER RF

30

Only hyperbolic decay can explain preference reversal

43

Hyperbolic predictions shown the same way

4

3

2

1

0

0 1 2 3

TIME

Choice reverses here

4 5 6

44

Using strict matching theory to explain preference reversal

The concatenated strict matching law for reinforcer magnitude and delay (see the generalised matching lecture) is:

B

1 

B

2

M

1

D

2

M

2

D

1 where M is reinforcer magnitude, and D is reinforcer delay.

Note that for delay, a longer delay is less preferred, and therefore D

2 is on top.

(OK, we know SM isn’t right, and delay sensitivity isn’t constant)

45

We will take the situation used by Green et al . (1981) , and work through what the STRICT matching law predicts:

The baseline is : M

1

= 2, M

2

= 6, D

1

= 0, D

2

= 4

B

B

2

1 

M

1

.

M

2

D

2

D

1

2 x 4

6 x 0

8

0

 

The choice is infinite. Thus, the subject is predicted always to take the smaller, zero-delayed, reinforcer

46

Now, add T = 0.5 s, so M

1

= 2, M

2

= 6, D

1

= 0.5, D

2

= 4.5

B

B

2

1 

M

1

.

M

2

D

2

D

1

2 x 4 .

5

6 x 0 .

5

9

3

3

The subject is predicted to prefer the smaller magnitude reinforcer three times more than the larger magnitude reinforcer, and again be impulsive. But its preference for the immediate reinforcer has decreased a lot.

47

Then, when T = 1,

B

1

B

2

2 x 5

6 x 1

10

6

1 .

67

The choice is now less impulsive.

48

For T = 2, the preference ratio B

1

/ B

2 is 1 -- so now, the generalised matching law predicts indifference between the two choices.

For T = 10, the preference ratio is 0. 47 -- more than 2:1 towards the larger, more delayed, reinforcer. That is, the subject is now showing self control

The whole function is shown next -- predictions for Green et al .

(1981) assuming strict matching.

49

This graph shows log ( B

2

/ B

1

), rather than ( B

1

/B

2

), shows how self control increases as you go back in time from when the reinforcers are due.

0.0

-0.5

1.0

MATCHING LAW PREDICTIONS

Self control

0.5

Impulsive

-1.0

0 5 10 15

VALUE OF T

20 25

50

Green et al.’s actual data 2

1

GREEN ET AL. (1981)

MEAN DATA

SELF CONTROL

0

-1 IMPULSIVITY

-2

0 5 10 15

VALUE OF T

20 25

51

Commitment

• Do this now

• Don’t have a choice to do the bad thing later

• Halloween candy

52

Commitment in the laboratory

Rachlin & Green (1972)

Pigeons chose between:

EITHER allowing themselves a later choice between a small shortdelay ( SS ) reinforcer or a large long-delay reinforcer ( LL ),

OR denying themselves this later choice, and can only get the LL reinforcer.

53

W

W

T

Rachlin & Green (1972)

Larger later

Smaller sooner

Larger later, no choice

Blackout

Reinforcer

54

Operant Conditioning

As they moved the time T at which the commitment response was offered earlier in time from the reinforcers (from 0.5 to 16 s), preference should reverse.

Indeed, Rachlin and Green found that 4 out of 5 birds developed commitment (what we might call a commitment strategy) when T was larger.

56

Operant Conditioning

Mischel & Baker (1975)

Experimenter puts one pretzel on a table and leaves the room for an unspecified amount of time.

If the child rings a bell, experimenter will come back and child can eat the pretzel.

If the child waits, experimenter will come back with 3 pretzels.

Most children chose the impulsive option.

But there is apparently a correlation with age, SES, IQ scores.

(correlation!)

58

Operant Conditioning

Mischel & Baker (1975)

Self control less likely if children are instructed to think about the taste of the pretzels (e.g., how crunchy they are).

Self control was more likely if they were instructed to think about the shape or colour of the pretzels.

60

Much human data replicated with animals by Neuringer &

Grosch (1981) .

For example, making food reinforcers visible upset self control, but an extraneous task helped self control.

61

Can nonhumans be trained to show sustained self control?

Mazur & Logue (1978) - Fading in self control

Choice 1

Choice 2

Delay (s) Magnitude (s)

6

6

2

6

Preferred Choice 2 (larger magnitude, same delay) -- Self control

Over 11,000 trials, they faded the delay to the smaller magnitude (Choice 1) to 0 s -- and self control was maintained!

62

Additionally, and this is important, self control was even maintained even when the outcomes were reversed between the keys.

In other words, the pigeons didn’t have to be re-taught to choose the self control option, but applied it to the new situation.

63

Contingency contracting

A common therapeutic procedure: e.g., “I give you my CD collection, and agree that if I don't lose

0.5 kg per week, you can chop up one of my CDs -- each week.”

You use the facts of self control -- i.e., you say "let's start this a couple of weeks from now" and the client will readily agree -- if you said, "starting today", they most likely would not.

It's easy to give up anything next week...

64

Other Commitment Procedures

• Tell your friend to pick you up

• Let everyone know you’ve stopped smoking

• Avoid discriminative stimuli

• Train incompatible behaviors

• Bring consequences closer in time

Social dilemmas

A lot of the world’s problems are problems of self control on a macro scale.

-Investment strategies

Rachlin, H. (2006). Notes on discounting. Journal of the

Experimental Analysis of Behavior, 85, 425- 435.

“ In general, if a variable can be expressed as a function of its own maximum value, that function may be called a discount function. Delay discounting and probability discounting are commonly studied in psychology, but memory, matching, and economic utility also may be viewed as discounting processes.”

66

Download