Overview chapter of most of the rest of this course

advertisement
Chapter 4: The Basic Findings In
Instrumental/Operant Conditioning(1)
Overview: This chapter is arranged in four major sections. The first presents the background to
Instrumental Conditioning, covering the early work by Watson and Thorndike, the basic
findings, what they thought was going on, and what some of the standard paradigms for
Instrumental Conditioning are. The second presents many of the basic principles that determine
when conditioning will occur, and whether it will be excitatory or inhibitory. The third
discusses important exceptions to these principles, and examines the complex interactions that
can arise. Finally, the fourth section briefly examines several alternative accounts of the type of
an association that can form. Several additional accounts of learning are introduced here; most
notably, Tolman's Cognitive Expectancy Approach and Skinner's Radical Behaviorism.
This section closes by examining some of the interrelationships between Instrumental and
Classical Conditioning.
I. Introduction To Instrumental Conditioning
We now shift from the topic of classical conditioning to that of instrumental (or as Skinner
terms it, operant) conditioning. This topic will prove a bit more complex in its findings. As you
will see, however, many of the ideas that were important in classical conditioning will prove
relevant here. Indeed, there has long been a debate over whether classical and operant
conditioning ought to be regarded as truly different forms of learning. They appear to differ in
the sense that classical conditioning generally involves the presence of reflex actions, whereas
instrumental conditioning generally involves modifications of voluntary behavior contingent on
presence of reinforcers or punishers. Whether that is a sufficient reason to distinguish them is
arguable, as we will see later. My sense of the field today is that most theorists would like to see
similar theories explain the results in both. Thus, it will not surprise you, for example, that a
modified version of the Rescorla-Wagner model has also been proposed for instrumental
conditioning.
Let's start with some historical background.
A. Background: Two Early Views Of Instrumental Conditioning
We will look at two quite different claims about the nature of instrumental conditioning.
One comes from Watson, the author of the 1913 behaviorist manifesto, Psychology as the
behaviorist views it, and the second comes from Thorndike, who can probably safely be credited
with conducting the first truly sophisticated and careful observations of complex animal learning.
Their accounts differ in ways that prefigured an important debate about what was needed for
learning to occur.
First, however, let us distinguish instrumental conditioning from classical conditioning. In
instrumental conditioning, an animal makes one of a number of possible responses in the
presence of some stimulus complex or context. That response may lead to some outcome. We
typically define learning in this circumstance as an alteration in some observed characteristic of
the response such as its frequency, latency, or amplitude. We will revisit this definition in more
detail later, once we have examined several theories of what gets acquired, and why. For now,
we can talk about instrumental conditioning as the type of learning involved in navigating a
maze, choosing the correct one of several doors to run to, or even performing some response that
will be successful in avoiding a future shock. In instrumental conditioning, new responses may
be taught that differ from any reflexive response already in the animal's behavioral repertoire.
Watson: Contiguity of S & R
As you know from Chapter 1, Watson attempted to redefine the field of psychology in
response to then-current mentalism. We have looked at several of the assumptions he brought
along. Basically, he was an extreme environmentalist who believed that most -- if not all -- of
our actions were under the learned control of associations. Based on his knowledge of work
being done by people like Pavlov, Watson certainly believed that living things were born with a
repertoire of reflexes. However, association quickly acquired control of previously reflexive
responses, and indeed helped modify those responses to create new responses. Some idea of
Watson's radicalism on this point may be gathered from a very famous quote (1926, p. 10):
Give me a dozen healthy infants, well-formed, and my own specified world to bring them up in
and I'll guarantee to take any one at random and train him to become any type of specialist I
might select -- doctor, lawyer, artist, merchant-chief and, yes, even beggar-man and thief,
regardless of his talents, penchants, tendencies, abilities, vocations, and race of his ancestors.
This was radical in at least two related ways. First, from a scientific perspective, it clearly denied
the relevance of genetic or inherited influences on current behavior. And second, from a social
perspective, it was about as different a position as one could expect to find from the thenprevailing attitudes about race and class.
In any case, Watson's primary idea was that an association could form between a stimulus
and a response (in addition to the type of association found in classical conditioning). But he was
a strict contiguity theorist on the issue of S-R associations: A response made in the presence of
a stimulus might associate with it, and under certain circumstances, would be likely to be seen
when that stimulus recurred. Those circumstances were defined by essentially two principles.
The first, a principle of frequency, stated that the association strengthened each time the
response was made to the stimulus, so that all things being equal, a frequent response was much
more likely to be emitted by the animal than a less frequent response. In addition, however, there
was a principle of recency: All things being equal, a recent response was more likely to be
emitted than a less recent response.
What you should particularly note about the brief description of Watson's system above is
the complete lack of any reference to a reinforcer, a perhaps surprising omission to students
who have been introduced to the idea that instrumental/operant conditioning is in large part about
the effects of rewards and punishments. That wasn't so for Watson, and it has not always been so
for later theorists as widely divergent as Guthrie and Tolman (see below and in the next chapter).
But on a preliminary and casual analysis of classical conditioning, the notion of a reward or
punishment does not seem greatly relevant in discussing whether the association forms.
(Nevertheless, some theorists refer to the UCS as a reinforcer on a broad definition that a
reinforcer is what makes a response more likely; presence of the UCS, whether in excitatoryappetitive or excitatory-aversive conditioning, certainly accomplishes that!) Why, then, ought we
to include it in instrumental conditioning?
And even though Watson talked about associations between stimuli and responses, he also
allowed for the possibility of associations between responses themselves. Thus, in the case of
animals learning to run a maze, the analysis of what is going on will involve a complex series of
muscle movements involving motor responses. (Much of our behavior is complex, rather than the
execution of simple responses to individual stimulus triggers.) Rather than talk about external
stimuli controlling each succeeding muscle movement that gets the animal from Point A to Point
B in the maze, Watson claimed that a chain of responses could be linked together that would be
initially set off by an external stimulus. Of course, to the extent that any response also involves
internal stimulation, one could still analyze chains in terms of stimulus-response links, so that
each muscle movement in the chain serves as the response to the previous movement, and the
stimulus for the next.
We have already discussed in Chapter 1 Watson's insistence that thinking could be reduced
to subvocal speech. He also conducted experiments in emotional conditioning. In a famous
study with Rayner, Watson conditioned a young child, Little Albert, to be afraid of a white rat.
Every time Albert played (apparently happily, at first) with the rat, an experimenter would creep
up behind Albert and strike a metal bar, making a loud clanging noise that frightened Albert and
caused him to cry. After several such occasions (six, in fact), Albert started to cry at the sight of
the rat. Note how this could be analyzed from the point of view of classical conditioning: The
noise caused the apparent emotional response of fear, whereas the rat served as the CS.
Given what you know of Watson's views on mentalism, you may be somewhat surprised to
discover him talking about the topic of emotions. However, for Watson, emotions were not
underlying mentalistic events, but rather, the behavioral components (the crying, the
whimpering, the shaking, etc.) observed in reaction to certain stimuli. Thus, Watson maintained a
perfect consistency with respect to his position that positivism required dealing strictly with a
behavioral level. Perhaps that is why he did not talk about reinforcers. Thorndike, a
contemporary of Watson's, was developing a theory of learning based on reinforcers, and
although he defined them in a sufficiently behavioristic fashion, he was nevertheless attacked by
others for apparently sneaking mentalistic terms back into hard-nosed scientific psychology.
To reiterate a point I made earlier, later behaviorists have on occasion adopted a strict
contiguity approach to learning. Most notable among these as a successor to Watson was
Guthrie, whose principle of conditioning stated (1952 , p. 23):
A combination of stimuli which has accompanied a movement will on its recurrence tend to be
followed by that movement. Note that nothing is here said about...reinforcement or pleasant
effects.
As we will see later, such approaches were in part a reaction to work by Tolman and his
colleagues suggesting that learning could occur in the absence of rewards or punishers. The
question that faces such theorists then becomes one of explaining how and why rewards and
punishers seem to influence the course of learning.
Thorndike & Puzzleboxes: Reinforcement-Based Learning
Rewards and punishers, in contrast, played a pivotal role in the work of Thorndike, who is
often credited with founding the field of instrumental conditioning. Thorndike published a
monograph in 1898 on his studies with animals such as cats. He set up an experimental apparatus
termed a puzzle box: a cage in which the animal was placed, and which could be escaped
through the performance of a simple response such as pulling on a rope attached to a door. These
studies really involved the first careful, detailed observations of what animals in general learned,
as opposed to anecdotal stories collected of amazing things animals did that obviously proved
their intelligence. (Television still plays into that sort of approach, needless to say!)
Thorndike asked a very simple question: Would escape from a puzzle box exhibit any signs
of intelligence? Would it display evidence of insight, in which the animal would be able to
glance about its environment, understand that the rope was attached to the door, and realize that
it needed only to pull on the rope to get out? To answer this question, Thorndike repeatedly
placed animals in the same puzzle box, and measured how long it took them to escape. And what
he found was that the time to escape decreased only gradually. By the end of the experiment,
after 20 or so trials, cats would easily leave the box by performing the appropriate response as
soon as they were placed in it. But, their history clearly demonstrated that this had to have been a
learned response. In particular, Thorndike pointed out that an animal making the correct response
on a given trial early in training would not necessarily choose that same response as its first
response on the next trial. So, rather than insight, he concluded that learning involved trial-anderror.
Trial-and-error refers to the gradual accumulation of correct responses through a slow
process of trying out all sorts of possibilities, and slowly weeding out the ones that do not work.
As did Watson, Thorndike thought animals were acquiring associations between stimulus
configurations (such as the puzzle box) and certain responses. But unlike Watson, he claimed
that an additional factor was important in the acquisition of these associations: They would
depend on the outcome of the animal's actions. This involved a principle Thorndike termed the
Law of Effect. Put briefly, this law claimed that an association between a stimulus and a
response would strengthen if the response were followed by a satisfactory state of affairs, and
would weaken if the response were followed by an unsatisfactory state of affairs. Thus,
Thorndike deliberately included Bentham's notion of hedonistic value as a principle governing
the formation of an association, in contrast to Watson. Rather than being a simple contiguity
theory, this was a reinforcement theory: In modern terms, learning of an association will occur
when there is a reinforcer following a response.
There are, of course, a number of interpretations available to account for how a reinforcer
might operate according to the law of effect. One of the first to come to most people's minds is a
teleological or purposive explanation: The animal performs a response because it desires the
outcome. But of course, desiring an outcome is a mental state that involves an object not present
at the time the animal is performing the response. That type of an explanation would violate the
positivist program Watson insisted everyone follow. Thus, as an alternative, we might propose
that a positive outcome has an automatic effect of strengthening the association: The animal does
not perform the response because it wants the outcome, but rather because the response is
strongly associated to the stimulus that is present.
Here is what Thorndike actually said regarding satisfying and unsatisfying states (1913, p.
2):
By a satisfying state of affairs is meant one which the animal does nothing to avoid, often doing
things which maintain or renew it. By an annoying state of affairs is meant one which the animal
does nothing to preserve, often doing things which put an end to it.
Although he was accused of using hopelessly mentalistic terms in describing learning as
depending on satisfactory or unsatisfactory states, his actual definition provided a clear
behavioral test for determining when one or the other state was present. In that sense, it ought to
have troubled people no more than Watson's use of the term "emotional."
Note too that Thorndike did not include the outcome in the association. As we will see,
other theorists have claimed that associations to the outcome may also form, so that we can have
S-R associations, R-O associations, and even S-O associations. To anticipate how such a model
might differ from Thorndike's, a strong S-R association may exist despite a highly unpleasant or
unsatisfying outcome: The presence of an R-O association in that event may serve to inhibit the
R excited by presence of an associated stimulus.
Thorndike also proposed another principle, the Law of Exercise (sometimes called the Law
of Use). This was essentially a principle of practice, somewhat similar to Watson's notion of
frequency: An association would strengthen if practiced. Both laws were revised in his later
work: the Law of Effect was essentially restricted to satisfactory outcomes, and the Law of Use
was modified to include outcomes rather than simple exercise.
Thorndike also spoke of the value of different satisfactory states, so that strong satisfiers
would do a better job of strengthening an association than weak satisfiers. And as an interesting
historical footnote, he actually contradicted one of the major principles of strict contiguity by
proposing an early version of belongingness by which some things would be more likely to
associate together than others.
In some sense, Skinner may be regarded as Thorndike's intellectual successor. Skinner
proposed similar ideas involving the law of reinforcement and the law of punishment.
According to Skinner, a reinforcer was any event that, following a response, made that response
more likely, whereas a punisher was any event that had the opposite effect. To try to identify
reinforcers and punishers in a way that wasn't completely circular (and also wasn't mentalistic),
Skinner imposed a condition of transituationality: A reinforcer or punisher, once identified in
terms of its effects on one response, also has to be shown capable of having a similar effect in
other situations, on other responses. Otherwise, we find ourselves defining a response as that
which, when followed by a reinforcer, increases in frequency. And that type of definition, of
course, reciprocally defines responses and reinforcers in terms of one another in an uninteresting,
circular fashion.
With this as background, let us look at some of the basic findings in instrumental
conditioning.
B. Some Basic Findings
Generalization, Discrimination, & Contrasts
Many of the basic findings will prove familiar, although there will also be some additional
results of interest. But in any case, as was true of classical conditioning, we obtain
generalization, discrimination, and contrasts.
The usual procedure for obtaining generalization involves pairing a response with an
outcome in the presence of a specific stimulus, and then presenting other stimuli to see whether
there is a similar response to them. As outcomes may be of two sorts (reinforcers and punishers),
we may obtain two different types of generalization gradients. The gradient associated with use
of a reinforcer is termed the gradient of excitation, whereas the gradient associated with use of
a punisher is termed the gradient of inhibition. In an excitatory gradient, we look for
responding to novel stimuli that is above the background or baseline or operant level; and in a
gradient of inhibition, we look for responding below normal.
Typically, when a response has been reinforced in the presence of the stimulus, that
stimulus is referred to as S+. Similarly, when the response has been punished, the stimulus is
referred to as S-. Watson and Rayner used an S- with Little Albert. In their work, they also
reported obtaining generalization: Albert developed fear reactions to other stimuli (such as
rabbits and coats) involving the features white and fur. Although they had planned on reversing
the fear conditioning, Albert's mother removed him from the daycare where they were doing
their experiments.
A good example of a gradient of excitation may be found in the work of Guttman and
Kalish. They took four different groups of pigeons and trained them to peck at a colored key.
The key differed in color for the four groups (530, 550, 580, or 600 nanometers). Then, in a
generalization test, Guttman and Kalish presented a series of 11 colors, one at a time, and simply
counted the number of pecks per 6 minute period that each color received. These colors included
the original (the S+), 5 colors above the S+ in wavelength, and 5 colors below. Their results
appear in Figure 1.
Several features of these results
should be noted. First, the stimulus that
received the most pecks for each group
was S+: That is where the peak of
each generalization gradient may be
found. (As you will see later, this need
not always be the case. Certain
experiences such as discrimination
training may alter the peak and shape
of a generalization gradient.) Second,
there was a relatively smooth dropoff of responding as the wavelengths of
the stimuli increasingly differed from
the S+. And finally, the curves were
symmetric: The left-hand side of each
curve looked approximately like the right-hand side.
Similar features may be found in a gradient of inhibition. Rather than look for a peak,
however, we search for a valley representing the lowest level of responding. Here, as the stimuli
increasingly differ, we ought to find increasing recovery of responding. Thus, a gradient of
inhibition looks a bit like an upside down gradient of excitation. In each, the idea is that
similarity of stimuli maps into similarity of responses.
As was true of classical conditioning, there will be occasions in which we want to train an
animal to treat apparently similar stimuli as if they were different. As a parent, you might think
there to be good reason to train a child to fear rats without desiring that such reactions extend
also to rabbits or cats. The standard technique for teaching a discrimination in
instrumental/operant conditioning will prove similar to that introduced in classical conditioning:
We present the outcome whenever the organism makes the response in the presence of one
stimulus, but not in the presence of another. To introduce technical terms, the stimulus that
signals an effective response (effective in the sense of producing an outcome) is called the
discriminative stimulus or discriminative cue (SD). The stimulus that should come to signal an
ineffective response is normally represented with a delta symbol. As I am posting this to the web
where delta symbols are a bit tricky to insert into normal text, I will adopt the practice of using
S+ and S- in this situation, as well.
There are other techniques to train a discrimination, as you will see in a later chapter. Rather
than associate one stimulus with no outcome, we can associate it with the need for a different
response. Thus, perhaps the animal will need to turn to the left for food when a red light is
present, but will need to turn to the right for food when an orange light appears. Such a technique
is referred to as choice discrimination. Alternatively, we might slowly introduce the second
stimulus into the animal's environment, presenting it initially at very low levels of intensity. If
the intensity is slowly increased, we may find that our animal has never responded to it, thus
foregoing generalization. (You should be wondering whether there is something like latent
inhibition going on with this procedure!) This technique is referred to as errorless
discrimination. Each technique appears to have different effects on the generalization gradient.
In particular, the standard technique using S+ and S- seems to cause the peak to move away from
S+, and to the side opposite S-, a phenomenon referred to as peak shift. Moreover, peak shift is
typically associated with a gradient that is no longer symmetrical: The gradient appears to be
'bunched up' on the S- side.
Finally, we may also note the existence of contrast effects associated with these phenomena.
There are several types of contrasts found in instrumental conditioning. One that accompanies
peak shift is termed behavioral contrast. Hanson, for example, compared discrimination
learning and non-discrimination learning groups. The discrimination-learning group displayed a
peak shift. Their responding to the S+ dropped off considerably. But the responding to the
stimulus that was the new peak increased dramatically. This group displayed about twice as
many responses to this untrained stimulus compared to the control group that did not have
discrimination training. Thus, in a behavioral contrast, responding occurs to a novel stimulus
at a greater level than would be expected on the basis of simple generalization.
Inhibition In Extinction & Punishment
Extinction in instrumental conditioning will involve essentially the same process of
decoupling that we saw in classical conditioning. That is, we generally remove the outcome after
which we ought to see the response return to normal or baseline levels (sometimes referred to as
operant levels). As is the case for acquisition (and for classical conditioning), the learning curve
for extinction typically involves diminishing returns. How long it takes a response to return to
pre-learning (baseline) levels is referred to as its resistance to extinction: Responses that
quickly return to baseline levels have a low resistance to extinction, whereas those that take a
long time to return to baseline have a high resistance. Resistance to extinction will depend on a
number of factors including the value of the outcome (see below), the energy required to make
the response (more physically demanding responses generally have lower resistance), and the
past history of training (responses learned under partial reinforcement conditions generally have
greater resistance to extinction than those acquired under continuous reinforcement conditions:
see below and the later chapter on partial reinforcement and extinction).
As was true in classical conditioning, extinction in instrumental conditioning is viewed as a
type of inhibitory learning. Following extinction, we obtain similar patterns of spontaneous
recovery and relearning that we did with classical conditioning: An extinguished response tends
to recur after a while (spontaneous recovery), arguing against any claim that the association
acquired during the acquisition phase had actually been destroyed or forgotten. Similarly, pairing
the extinguished response with the outcome results in much faster acquisition (relearning),
another argument suggesting extinction does not destroy the original learning.
Moreover, a stimulus associated with extinction appears to act as an aversive stimulus for
the animal, suggesting some degree of inhibition. Daly, for example, found that rats would learn
to escape from a place where they had earlier expected a reinforcer. When the reward was no
longer available, the contextual cues associated with that location were sufficient to motivate the
animal to avoid them by learning some new response getting it out of that situation.
Complicating the picture somewhat is the fact that an outcome may be a punisher.
Punishment, of course, often suppresses a response. There has been an argument extending as far
back as Thorndike concerning the effectiveness of punishment. Many people have reported that
punishers seem to have, at best, temporary suppressive effects on on-going behavior. However,
that issue appears to involve the intensity of the punisher. There is now plenty of evidence that
highly aversive punishers may have long-lasting effects. According to Bolles, stimuli present
when a punisher occurs may become conditioned danger signals that will tend to interfere with
on-going behavior by activating the animal's instinctive defenses (SSDRs: species-specific
defense reactions). Rats, for example, will run, freeze, or fight. So, in this case, a conditioned
suppression-like reaction may occur because one of these responses will be incompatible with
other excitatory responses such as pressing a lever for food.
In the case of a punished response, of course, extinction of that response by no longer
associating it with an aversive outcome ought to inhibit the stimulus's ability to act as a danger
signal triggering an SSDR. Inhibition of aversion in this case means seeing less aversion.
One more point while we are (briefly) on the subject of punishment: One of the difficulties
theorists have had with the effects of punishment (and with positing a general principle that
punished responses decrease in frequency) may be seen from a study by Brown, Martin, and
Morrow. They taught rats to run an alleyway to escape shock. Basically, the alleyway was
electrified, so the animals needed to run to the goal box (the only non-electrified, safe portion of
the alleyway). When the shock was turned off, there was fairly quick extinction of running.
However, two other groups of rats were also put through an extinction procedure. For one
of these groups, the shock was also turned off in the start box, so that they would actually be
punishing themselves for venturing out of the start area. The other group had the final 2 feet (of a
10-foot alleyway) electrified, so that they would be punished by trying to get to the goal box.
Curiously enough, these two groups did not extinguish anywhere near as rapidly: By the 6th day
of extinction, they were still running to the goal box, giving themselves needless shocks. Thus,
punishment sometimes can actually prolong the response being punished. This effect is called
vicious circle behavior.
Within the framework of a model such as Bolles's theory, a finding like that of Brown et al.
may be accounted for in terms of shock continuing to trigger the rat's running SSDR. It is also
possible that vicious circle behavior may ensue because of multiple mechanisms. Thus, in
another experiment, Badia and Culbertson set up a situation in which shock could be signaled
or unsignaled. In signaled shock, a stimulus will come on slightly before the shock. In this
study, they allowed rats to learn a response whose only reinforcer involved shocks being
signaled. Their animals acquired the response. Moreover, Badia, Culbertson, and Harsh found
that given a choice between unsignaled mild shocks of short duration and signaled shocks of
longer duration and higher intensity, the animals still performed the response, thus apparently
subjecting themselves to more punishment than was necessary. This type of vicious circle
behavior seems different from that of Brown et al. Rather than involve danger signals triggering
SSDRs, it seems to implicate a tradeoff between severity of the shock and predicting when it
ought to occur. On the other hand, Bolles also talks about safety signals that indicate a period
free from danger. In the unsignaled condition, there are no safety signals. Thus, this type of
vicious circle behavior may well result from an organism's search for safety signals. That ought
to remind you a bit of the work on compensatory or antagonistic conditioning, and its adaptive
value: Being in a highly aroused and tense physiological state because of the continual presence
of danger is physiologically stressful; safety signals help moderate the wear and tear.
Mediated Learning & Secondary Reinforcers
In classical conditioning, we discussed several types of mediated learning (higher-order
conditioning; sensory preconditioning) involving building chains of associations that would
allow distant events to become associated together (recall also the Dwyer et al. study presented at
the end of the last chapter). One of the major mechanisms for mediation in instrumental
conditioning is secondary reinforcement (although, as we will see, there are certainly aspects of
classical conditioning that govern this mechanism). Secondary reinforcement is learned
reinforcement: an otherwise neutral stimulus that acquires the ability to motivate new learning or
performance. Primary reinforcement, by way of contrast, is assumed to operate reflexively
because of an organism's genetic makeup.
Skinner may be credited with first making the distinction between primary and secondary
reinforcers. The standard example of secondary reinforcers operating in human societies is the
use of money. In our society, money includes round pieces of metal and rectangular pieces of
paper that have an extraordinary power to motivate behavior. In other societies, different objects
serve a similar function (tooled shells, for instance). These objects are not valuable in themselves
(aside from aesthetic considerations of design, etc.), but supposedly take on their value by means
of serving as a medium of exchange for intrinsically valuable goods such as food or drink.
Presumably, they acquire their reinforcing properties by being associated with primary
reinforcers.
Essentially, then, secondary reinforcers are believed to be conditioned through a process of
classical conditioning involving the following set-up:
CS (neutral stimulus) & UCS (primary reinforcer such as food)
Once we have established a pairing between the CS and a primary reinforcer, we may then test
for its value as a secondary reinforcer. Our experimental design would be as follows:
Group
Classical Conditioning
Instrumental Acquisition
experimental
control
CS & primary RF
(Nothing)
R in presence of S+ followed by CS
R in presence of S+ followed by CS
If we see an increase in responding in the experimental group compared to the control group,
then our CS has acquired reinforcing properties. This example should make clear why this is an
instance of mediated learning: The effect of the CS in the experimental group occurs by virtue of
its link with the primary reinforcer or UCS. When that link weakens, the value of the CS as a
secondary reinforcer ought also to weaken. Thus, in times of inflation when more money is
required to buy the same food, the reinforcing properties of a dollar or five dollars weakens.
On a classical conditioning analysis of secondary reinforcement, we would expect to obtain
findings like those we've already seen in the previous chapters. Several examples of such
findings might be mentioned. One involves a study by Egger and Miller. They trained pigeons
using a design similar to this one:
Group
1
2
Classical Conditioning
CS1 --> CS2 --> primary RF
CS1 --> CS2 --> primary RF
CS1 --> ..... --> no RF
Instrumental Acquisition
R followed by CS1 or CS2
R followed by CS1 or CS2
As you can see from this design, Group 2 had discrimination training in the sense that presence
or absence of CS2 was relevant to predicting presence or absence of the UCS. Not surprisingly,
given its better signal value, CS2 turned out to be the secondary reinforcer for the instrumental
acquisition phase in this group. But what about Group 1? CS2 certainly has better contiguity with
the UCS. However, in terms of signal value, it is not adding anything to what CS1 already
predicts. Thus, it is redundant, and we would predict from models like Kamin's or Mackintosh's
that CS2. be blocked. Consistent with this prediction, the secondary reinforcer for instrumental
acquisition in Group 1 is CS1, and not CS2.
Another example is rather cute. It comes from the Brelands, former students of Skinner's
who tried to train animals to perform in commercials using the principles they had learned. It is
also cute because it involves the notion of money as a secondary reinforcer. In one instance, they
attempted to train a pig to roll a (fake) coin into a piggy bank. During the training, the coin was
paired with a primary reinforcer, since they wanted to use the coin as a secondary reinforcer for
the responses involving in rolling. The procedure worked for a brief while, but then the pig
started treating the coin as if it were similar to the food it had been paired with: It tried to root the
coin just as it would have rooted real food. This result, termed instinctive drift, is perhaps one of
the clearest demonstrations of the involvement of classical conditioning in secondary reinforcers,
though it also serves to remind us that Watson's claim about reflexes quickly becoming
overwhelmed by learned associations radically overstates the case.
Secondary reinforcers play an important role in certain aspects of therapy and classroom
behavior. In clinical and educational settings, behavior modification techniques based on
principles of conditioning are used to try to change unacceptable behavior. These techniques
typically include a component of secondary reinforcement by which objects such as poker chips
may be accumulated for making desired responses (or avoiding undesired responses), and later
traded for privileges such as snacks, movies, pencils, etc. Use of such secondary reinforcers
involves the construction of what is called a token economy.
Other findings relevant to the involvement of classical conditioning in secondary
reinforcement include the intensity of the primary reinforcer (more intense primary reinforcers
yield more effective secondary reinforcers), the number of times the putative secondary
reinforcer is paired with the primary reinforcer, and the delay between these events. You should
be able to figure out why models such as Rescorla-Wagner or Wagner's rehearsal model, for
example, would support these findings.
One more important phenomenon while we are on the subject of mediated conditioning and
secondary reinforcement: Most behavior involves a complex series of responses executed in a
certain rapid and relatively smooth order. How is it that each single response can be reinforced?
There hardly seems time for that. And how is it that organisms in real environments (rather than
the laboratory where a researcher can control reinforcers and stimuli) acquire such complex
organizations? The answer to these questions involves the concept of chaining, and will prove to
rely heavily on secondary reinforcers.
We briefly introduced the notion of response chains earlier. An example will illustrate this
concept. Let's set ourselves the task of teaching pigeons to Time Warp. The Time Warp is the
dance from the Rocky Horror Picture Show. It (as is true of all dances) may be regarded as a
series of steps in a chain. In the case of the time warp, there are 5 steps (The Rocky Horror Show,
1975):
It's just a jump to the left, and then a step to the right. With your hands on your hips, you bring
your knees in tight. But it's the pelvic thrust, They really drive you insane. Let's do the Time
Warp again.
Normally, we would try to teach a chain backwards, So, we will train the last step first. That
involves teaching the pigeon a pelvic thrust. We have our response here, but we need a stimulus
and a reinforcer. Let's use a red light for the stimulus (seems appropriate, huh?), and some drink
for the reinforcer. Our design then is:
Phase
Stimulus (CS)
Response
Reinforcer (UCS)
1
red light
pelvic thrust
drink
Note particularly that I have also labeled the stimulus a CS, and the reinforcer a UCS. This
is meant to suggest that classical conditioning will be going on simultaneously with instrumental
conditioning: The stimulus is paired not only with the response, but also with the outcome.
Thus, as a result of instrumental conditioning, the animal should do a pelvic thrust to the red
light. But, as a result of classical conditioning, the red light ought to become a secondary
reinforcer. And that should suggest to you the rest of the design. Here it is in full:
Phase
Stimulus (CS)
Response
Reinforcer (UCS)
1
2
3
4
5
red light
blue light
green light
yellow light
white light
pelvic thrust
knees in tight
wings on hips
step to right
jump to left
drink
red light
blue light
green light
yellow light
So, if you look for the moment just at Phase 2, notice that we will reward the pigeon for bringing
its knees in tight by following that response with the red light. If the red light is a secondary
reinforcer, then the animal will acquire the response. And note too that the red light also serves
as the signal for the next step after knees in tight: the pelvic thrust. And finally, note that in Phase
2, we ought to obtain second-order conditioning: Two CSs (the blue and red lights) are being
paired. If successful, this means that the blue light now also becomes a secondary reinforcer.
At the end of this, the sequence will be that a white light serves as the signal for a jump to
the left; that's reinforced by the yellow light (thanks to fourth-order conditioning) which also
signals Step 2 (a step to the right); that's reinforced by the green light (thanks to third-order
conditioning), which signals Step 3 (wings on hips); that's reinforced by the blue light (thanks to
second-order conditioning), which signals Step 4 (knees in tight); and that is reinforced by the
red light (thanks to first-order conditioning), which finally signals the last step of the dance.
We haven't talked about a control experiment for this, but our control would be something
like the following:
Phase
Stimulus (CS)
Response
1
2
3
4
5
red light
blue light
yellow light
orange light
white light
pelvic thrust
knees in tight
wings on hips
step to right
jump to left
Reinforcer (UCS)
drink
green light
white light
purple light
green light
In this control experiment, only the first step ought to be acquired. The secondary reinforcer
from the first phase is never used in the later phases, and none of these is ever paired with a
primary reinforcer. Indeed, based on work in discrimination training, we might predict that the
other colors would become somewhat inhibitory (since they tend to signal absence of UCS).
But in real-world chains, of course, such individual discriminative cues and reinforcers do
not always appear to be present (although you could argue that they are present in a dance in
terms of the auditory stimuli represented in the music!). And we can solve that mystery by going
back and analyzing responses as having stimulus components. Responses are also being
associated with a UCS, so that doing a response can act as a secondary reinforcer! Thus,
jumping to the left may be reinforced by stepping to the right, eliminating the need for all of
these intervening light stimuli. If you thought our pigeon caught in a very awkward situation,
you were right: By considering the stimulus components of a response, we find a way to make
the concept of response chains a lot more realistic, and their execution smoother.
Interference
Because of the nature of instrumental conditioning, it is possible to have several different
responses associated with the same stimulus, or the same responses and stimuli associated with
several different outcomes. Under those conditions, sometimes complex patterns of results may
be found. In particular, certain combinations of events appear to result in interference. We will
consider two sorts of interference briefly in this section, and then revisit the issue in later
chapters. The two involve response competition and approach-avoidance conflicts.
Response competition involves one response interfering with or competing with another. In
fact, response competition is one of the theories regarding the process of extinction (see the
chapter on partial reinforcement and extinction). The basic idea here is that the animal is being
cued to perform incompatible responses.
An excellent example of response competition occurs in an experiment by Fowler and
Miller. They trained rats to run to a goal box. During extinction, all of the rats were shocked on
entering the goal box, but half of them were shocked on their front paws, and the other half were
shocked on their rear paws. The animals shocked on their front paws jerked back, whereas the
animals shocked on their rear paws jerked forwards. Moving forward is a response compatible
with running into the goal box, but moving backwards is an incompatible response. Despite the
fact that both groups received shock or punishment for entering the goal box, the front-paws
group extinguished more rapidly. The new response caused by the shock in this case interfered
with the old response.
Other examples of response competition come from work with humans in the verbal
learning paradigm. Here, subjects are often asked to learn a list of word pairs, and tested on
how successful they are at recalling the second word when presented with the first as a retrieval
cue. So, if you studied a pair such as SHORT-LAKE, the experimenter might say SHORT, and
you would need to reply with LAKE. As you may gather, we can identify the first word of a pair
as the stimulus term, and the second as the response term. Numerous studies show interference
when we ask people to learn several lists in which the same stimulus words are present, but there
are different response words. Response competition will certainly not turn out to be the sole
explanation of these findings (see, for example, Melton & Irwin, and Postman's review). But it
assuredly handles some of what is going on, as we find intrusions of the earlier responses during
learning of the later responses.
As for approach-avoidance conflicts, we may ask what happens when a response is
associated with both an aversive and an appetitive outcome. That situation happens more
frequently than you might think. In discrimination training, for example, we try to alter the
excitatory generalization to the S- by associating it with lack of a reward. But, that means that the
inhibition building up for S- may also generalize to the S+, canceling it out, to some extent (one
of the explanations for peak shift). Thus, discrimination training involves two stimuli, each of
which may be claimed to have some excitatory and some inhibitory components.
What ought to happen should thus reflect, in some sense, the summation of the excitation
and inhibition, as was the case in use of the summation test in classical conditioning (see the
discussion of algebraic summation theory in the chapter on attention and categorization for
more details).
As an interesting footnote, Dollard and Miller tried to combine aspects of Freudian
psychoanalytic theory and learning theory to describe some of the conflicts humans might be
subject to. They identified several different types of conflicts, but one they termed an approachavoidance conflict. In this situation, there is a tradeoff between the positive and negative
components of making a response. As an example, we might take a rat running down an
alleyway to obtain some food. Suppose that the goal box is associated both with food and with a
shock. What will the rat do? One analysis of this situation (culled from several different studies)
appears in Figure 2.
In this figure, we look at some measure of strength of a response as a function of how far
from the goal the animal is. There are actually two opposed tendencies graphed in this figure:
The tendency to approach the goal for a reinforcer, and the tendency to go away from the goal
due to punishment. The solid line represents a typical, idealized avoidance gradient: The closer
an animal is to an aversive or noxious stimulus, the more vigorously it leaves. As it gets further
and further away, its response (running, for example) gets weaker and weaker. In contrast, the
approach gradient graphed by the dotted line demonstrates the reverse finding: the closer to a
desired reinforcement, the faster or more vigorously the animal approaches it.
Several additional features of Figure 2 are important. One is that the avoidance gradient is
typically steeper than the gradient of approach, And the other is that in this figure, the lines
cross. And because they do, we obtain an approach-avoidance conflict, with the spot at which the
lines cross representing the conflict
point.
If you look to the right of the
conflict point, you will see that
approach is stronger than avoidance.
Thus, right of this spot, the animal
should tend to head towards the goal.
But once it passes the conflict point
and approaches, then avoidance
becomes stronger, driving it back. So,
the model predicts that an animal will
waver around the conflict point,
developing large amounts of
frustration in the process. There will
be some tendency here for the animal
to simply escape this situation, if that is at all an option.
Finally, an increase or decrease in the amount of reinforcement or punishment in this model
will essentially move the relevant gradient up or down. Increasing the punisher, for example,
should move the solid line up, and that will result in the conflict point (the spot at which the lines
cross) moving further away from the goal. In like fashion, increasing reinforcement moves the
approach gradient up, causing the spot at which the lines cross to occur closer to the goal.
This is by no means all that Dollard and Miller have to say about what approach-avoidance
conflicts entail. You may be interested in reading their book on personality and psychotherapy
for more information.
The Partial Reinforcement Effect
A final basic phenomenon we will discuss in this section involves the partial
reinforcement effect. Skinner and his colleagues studied various aspects of reinforcement-based
learning under what they claimed were 'real-world' conditions. Specifically, they asked what
would happen to learning when reinforcers appeared on only some trials. The findings were quite
interesting. Namely, with partial reinforcements, there was greater resistance to extinction.
We will look at partial reinforcement in more detail in a later chapter. For now, let me
mention that a number of variables interact with partial reinforcement. In particular, amount of
reinforcement will prove to play a pivotal role. In studies such as those conducted by Roberts,
we find that animals that have been continuously reinforced will display increased
resistance to extinction with small reinforcers. Roberts looked at extinction of alleyway
running in rats whose reinforcers ranged from 1 to 25 food pellets. Over 36 extinction trials,
there was little evidence of a change in the 1-pellet group, whereas the 25-pellet group was
performing at well less than half their rate prior to extinction. However, this effect appears to
depend on how much training an animal has had during the acquisition phase; it assumes a fairly
substantial amount of acquisition (see D'Amato). In contrast, animals that have been partially
reinforced will display increased resistance with large reinforcers (see, for example, Ratliff
and Ratliff). The first result, in particular, strikes many people as counterintuitive on first coming
across it. After all, shouldn't large reinforcers result in better learning, and shouldn't better
learning be longer-lasting learning?
There are in fact a number of explanations for the partial reinforcement effect. For the
moment, however, I will mention one to help you remember the results. This is Amsel's
Frustration Hypothesis, cited in the first chapter. According to Amsel, continuously reinforced
animals will experience more frustration when they lose a large reinforcer. And since frustration
acts as an aversive stimulus, these animals will avoid whatever it is that is causing the frustration.
So, with more frustration, there is faster learning of avoidance. But in contrast, animals in partial
reinforcement are being trained to tolerate frustration. With larger reinforcers, they are trained
specifically to handle greater and greater frustration. Thus, when they are placed in the highly
frustrating situation of extinction (in which the expected reward fails to materialize), they will be
better able to adapt to this situation.
Number of trials during acquisition also has different effects for continuously and partially
reinforced groups. For continuously reinforced groups, more training results in lesser resistance,
whereas for partially reinforced groups, more training results in greater resistance. The increased
training in partially reinforced groups translates into better training to tolerate frustration, but the
increased training in continuously reinforced groups translates into higher expectation of a
reward (and thus, a ruder awakening when it is no longer there).
In short, extinction is frustrating, because expected rewards don't occur. How much
resistance to extinction you will have will thus depend partly on how much frustration you
experience during extinction, and on how much frustration you have been trained to
tolerate during acquisition. The amount of frustration experienced during extinction depends
on the size of the reinforcer you expected. (Not getting an expected $50 is a lot more frustrating
than not getting an expected $1.) In addition, continuously reinforced animals have not been
trained to tolerate any frustration whatsoever.
C. Some Basic Paradigms
We have already introduced two general paradigms involving acquisition and extinction. In
acquisition, an outcome is typically paired with making a response in the presence of a stimulus;
in extinction, that pairing typically ceases. Within this broad framework (particularly with
respect to acquisition), we may distinguish several additional paradigms.
In appetitive or approach learning, the animal makes a response that results in a desired
reward. This is the type of learning involving reinforcement that we have implicitly and
explicitly discussed so far. But it is not the only paradigm based on reinforcement. Another that
deserves particular note is omission training, in which an animal has to suppress or withhold a
response in order to get its reward. Sheffield, for example, trained dogs to salivate in the
presence of a tone associated with food, and then shifted them to omission training. In this latter
phase, the dogs had to avoid salivating to the tone for several seconds to get the food. Omission
training is initially typically difficult, and displays a relatively slow learning curve. However,
there are several studies suggesting that in the long run, it will be as effective as extinction in
decreasing the frequency of a response. Omission training is sometimes referred to as negative
punishment to indicate that making the response is associated with removal of a reinforcer
(which thus acts as a punishment).
Another paradigm based on reinforcement is escape learning. In escape learning, the
animal learns a response that gets it away from punishment, either by turning off the punisher, or
by allowing the animal to leave the area where the punishment was administered. Escape
learning is closely associated with another paradigm, avoidance learning. In avoidance learning,
the punishment is intermittent rather than continuous. If the animal makes the proper response
before the punishment comes on, it will succeed in canceling that punishment. In avoidance
learning, animals typically start out by escaping the aversive stimulation (making a response
during the punishment that stops it), and then come to make the response early enough that they
subsequently successfully avoid the aversive stimulation.
Punishment training (or aversive learning), of course, involves the administration of an
unpleasant, aversive outcome following a response. Thus, punishment training, omission
training, and extinction all have in common reducing the level of a given response, whereas
appetitive learning, escape learning, and avoidance learning attempt to increase response level.
There are some obvious interplays in paradigms here, depending on which response you focus
on. Often, aspects of several different paradigms combine: One response may be punished while
another is reinforced.
We may also distinguish between signaled and unsignaled learning. A discrete, distinct
stimulus is present in signaled learning, but not in unsignaled learning. Thus, for example, in
unsignaled avoidance, shocks can occur at regular intervals that could be avoided if the animal
responds shortly before the shock's onset. There is no physical stimulus signaling the shock; the
animal in this case needs to rely on an internal sense of time. In unsignaled conditions, features
such as time or the contextual cues presumably act as stimuli.
Another paradigm, transfer training will prove important, especially when we focus on
discrimination in a later chapter. In transfer training, we look at the effects of learning one task
on another. Transfer might be nonexistent (zero), positive (facilitation: the learning is faster),
or negative (inhibition: there is interference). In addition, transfer effects might be proactive (in
which we look at the effect of an earlier task on the learning or performance of a later task), or
retroactive (in which we saw how the later task influences performance on the earlier one).
A final paradigm involves shaping. Normally, approach learning applies to responses that
are not especially frequent to start with, since we want to track an increase in frequency as one
of our measures of learning. Thus, we find ourselves in the following situation: We sit in the lab,
watching our animal subject, waiting for it to make the desired response so that we can
administer the reinforcer.
Such a procedure will obviously be inefficient. In some cases (such as a pig rolling a coin),
the wait may be very long indeed! Hence, a technology has developed that involves increasing
the probability of having the animal emit that response so that we can then train it further
through reinforcement. This technology, called shaping, requires reinforcing successive
approximations to the desired response.
Shaping works as follows. We start out by identifying a high-frequency component of the
response we want, and we reinforce that. So, if we want our rat to press a bar on the left side of
an experimental chamber, then a high-frequency component would involve having the rat be in
the left half of the chamber. While it is exploring its environment, we reinforce for crossing over
to the left. Then, as it increases its time on the left, we drop the reinforcer. That will cause the
behavior to become more variable. We await some response yet closer to what we want to train
(such as being near the bar), and when that occurs we reintroduce the reinforcer. And then, of
course, we cycle the process through again in order to obtain yet a closer approximation (such as
touching the bar). Shaping is a very powerful technique, not only because of its ability to 'coax'
low frequency responses out of an animal, but also -- and especially -- because of its ability to
mold a response that is not normally part of the animal's repertoire! Thus, by combining
shaping and chaining, instrumental conditioning allows us to train totally new responses, rather
than just transfer stimulus control of an old response to a new stimulus.
D. A Note About Terminology: Operant vs. Instrumental Learning
Finally, we ought to note a distinction that is sometimes made between what is termed
instrumental conditioning, and operant conditioning. In instrumental conditioning, the
emphasis is on a discrete trial, a situation in which there is a clear starting point and a clear
terminus. We may measure how long it takes the animal to make the response during the trial, or
we may measure the relative probability of the animal's success. So, to take Thorndike's
puzzlebox apparatus, the start of the trial occurs when a cat is placed in the puzzlebox, and it
ends when the cat has made the escape response. How long this takes is what we are interested
in. Similarly, in maze learning, the trial starts with the animal being placed in the start box, and
ends when the animal has found its way to the goal box. (Or alternatively, we can specify the
trial as what happens in some amount of time from when we have placed the animal in the start
box. Where has it gotten to in, say, an allowed 30 seconds? Learning here will show up as
increased probability of having made the correct response within the time frame of the trial.) In a
third example, choice discrimination, the trial starts when the animal is exposed to two stimuli,
and ends when the animal makes a response relevant to one of them. Our interest in this situation
typically involves whether the animal has chosen the correct stimulus.
There are no discrete trials in operant conditioning, on the other hand. A standard apparatus
for operant conditioning involves a Skinner Box, a chamber with something that can be
manipulated (a key to peck; a bar to press; a lever to move); various discriminative stimuli that
may be turned on or off (lights; noises); and means to automatically administer reinforcements or
punishments (food or shock dispensers connected to the bar, for example). Particularly with
respect to such simple responses as pressing a bar to obtain food, the interest will be more in how
rapidly those responses are executed. We don't stop the animal between responses in order to set
up another trial. Rather, we typically look at characteristics of response rate over time.
This distinction between discrete and continuous trials might also be expressed in a slightly
different manner. On a discrete trial, you can succeed only once (or perhaps 8 times if we use an
apparatus like Olton's radial maze, discussed in the previous chapter), whereas on continuous
trials you have the opportunity to obtain virtually unlimited reinforcements. So, the difference
between instrumental and operant conditioning in part involves whether there is a constraint on
how many reinforcing events an animal can seek out. That having been said, I will generally treat
these as equivalent.
II. Basic Requirements For Effective Conditioning
Many of the principles for effective conditioning will prove familiar from our discussion of
classical conditioning. Thus, number of pairings of a response with an outcome will prove
important in characterizing how quickly we see changes in characteristics of the response (its
strength, its amplitude, its latency, its probability, etc.) that signal evidence of learning. By the
same token, number of times a response fails to be followed by an outcome will be important,
not only in describing the course of extinction, but also (as in classical conditioning) in
describing the contingency between a response and an outcome. Below, we will briefly consider
additional principles having to do with temporal contiguity, outcome characteristics, and
contingency.
Before we do, however, we ought to note several features that make the situation a bit more
interesting. First, of course, is the issue of partial reinforcement. We will delay fuller discussion
of that to a later chapter. Second, there is the fact that in operant conditioning, an animal is
effectively in charge of whether to emit the response or withhold it. Obviously, researchers in
classical conditioning may easily arrange pairings of the CS and the UCS to achieve any desired
contingency. But in operant conditioning, controlling how many times the reinforcer occurs
when a response is emitted versus when a response is not emitted is clearly trickier. Third,
because of the presence of three events (stimulus, response, outcome), there are three potential
associations to worry about (S-R, S-O, and R-O). That means that we can ask about temporal
contiguity (or contingency) not just of response and outcome, but also of stimulus and response,
and of stimulus and outcome. The situation thus becomes significantly more complex.
Not all theorists believe that all three associations form. Thorndike, to remind you, accepted
only an S-R association, as did Watson. But, researchers such as Rescorla have made a very
strong case that the other associations are there, as well. Thus, Colwill and Rescorla used the
devaluation paradigm on a reinforcer after the response had been acquired. If a reinforcer's only
function is to stamp in the association (as claimed by Thorndike), devaluing the reinforcer ought
not to influence the response the animal gives to the stimulus. In the abstract, the design for this
type of experiment would be similar to the following:
Group
Acquisition Phase
Phase 2
Test Phase
experimental
control
R to S for RF
R to S for RF
RF & LiCl
(Nothing)
R to S?
R to S?
However, Colwill and Rescorla found a much less vigorous response following devaluation. This
must have its effect on an R-O or S-O association.
What about an S-O association? From our discussion of chaining and higher-order
conditioning, you already know that this association forms. Further evidence of this comes from
the Rescorla study mentioned in the previous chapter, in which a stimulus that caused higher
levels of responding during extinction became inhibitory, as measured by the summation and
retardation tests. We had earlier read about a classical conditioning version of that study, but
Rescorla also ran the same study with an instrumental conditioning set-up, and obtained the same
results. Because the S-R association in these types of experiments is rapidly relearned
following extinction while the S remains inhibitory, Rescorla claims that the inhibition
doesn't involve the S-R link! And as a final example, consider a classic study by Seward and
Levy on a phenomenon termed latent extinction. In their study, two groups of rats learned to
run to a goal box for a reward. Following acquisition, one group had the experience of being
placed directly in the goal without the reward. Then, both groups were put through extinction:
Group
Acquisition
Phase 2
Phase 3
experimental
control
run for RF
run for RF
put in goal, no RF
(Nothing)
extinction
extinction
In this experiment, the control group extinguished more slowly than the experimental group.
Presumably, the stimulus elements of the goal box had now become associated with some
inhibition for the experimental group, making their running to it less desirable.
Below, given the theoretical importance of reinforcement in operant conditioning, we will
concentrate on principles having to do with its presence relative to the response.
A. Temporal Parameters
A principle of fairly long standing (and which forms a part of many behavior-level theories)
has been that there must be temporal contiguity between the response and the outcome. In fact,
many studies report what is called a gradient of reinforcement in approach or appetitive
learning: The longer the delay between the response and the reinforcer, the weaker the learning.
A well-known experiment demonstrating the gradient of reinforcement was conducted by
Grice. Grice used a choice discrimination paradigm in which rats had to enter one of two rooms
or chambers. The rooms were different colors (black or white), and the rat was reinforced for
entering one of these but not the other. However, there were several groups of rats who differed
in terms of how long it took to get the reinforcer after choosing the correct color. All rats were
immediately placed in a neutral-color room where the reinforcer was given, but one group
received their reinforcer immediately, while others had to wait. The group with the longest wait
was reinforced after 10 sec. Essentially, Grice found a very rapid fall-off of learning. After about
1 sec, there was no evidence that the discrimination had been learned.
Depending on the response and the circumstances (see the next major section below), longer
delays in which learning still occurs have been reported. As an example, consider a study by
Capaldi, in which two groups of rats were trained to run to a goal box. One group was rewarded
as soon as it reached the goal box, but the other group had to wait 10 sec for its reward. Both
indeed learned to run to the goal box, but the running speed (and the initial velocity out of the
start box) was significantly depressed for the 10-sec delay group. Thus, their learning seems to
have been affected by the delay.
Sometimes, a delay of 1 or 5 sec seems to result in no learning, and at other times (as in
Capaldi's experiment), longer delays will be tolerated. Generally, however, the speed of learning
as measured by vigor or probability of the response (or number of trials to acquire it) will be
influenced by the response-reinforcer delay. Extrapolating from Skinner's claims, we may
present one theory for why this is so: Namely, as the delay period increases, the odds increase
that the animal will perform some other piece of behavior before the reinforcer is given. The
association may then form between that response and the outcome, rather than between the
effective response and the reward.
According to Skinner, temporal contiguity by itself is all that is needed for the formation of
an association. Skinner cites the example of superstitious behavior to demonstrate this. In
superstitious behavior, animals are reinforced at random, and need perform no response
whatsoever. Yet, Skinner in one of his studies reported that pigeons in this circumstance were
displaying apparently learned behaviors such as head shaking. He claimed that the reinforcer by
dumb luck must have been presented just after the pigeon had tossed its head, so that head
tossing was strengthened as a response in this situation. The increased possibility of acquiring
superstitious behavior that interferes with other learning might thus partly explain why temporal
contiguity is important.
From the perspective of a more cognitive, representational-level approach, we may posit a
similar idea expressed in very different terms. Given the presence of a reinforcer, the animal's
task is to determine which of a number of previous responses might be the one that worked. As
the number of responses increases, the task becomes more difficult. Moreover, because causes
normally result in relatively immediate effects (excepting, of course, situations such as illness or
food poisoning: note the relevance to the taste aversions paradigm), organisms may be
genetically predisposed to connect recent behavior with the current outcome (a principle of
causal recency).
A similar principle, of course, applies to aversive situations. Fowler and Trapold in an
experiment on escape learning varied how long it took for shock to turn off once the rat had run
to a goal box. The best learning/performance occurred for a group of rats whose shock was
turned off as soon as they entered the goal box. Animals that had to wait a bit for shock to turn
off did worse.
Finally, Boe and Church found that the effectiveness of punishment decreased with delay.
Unless punishment is administered very shortly after an animal's response, it will not prove very
effective. Dog owners who come home and punish puppies for earlier 'accidents' are most likely
to be associating themselves with the aversive outcome, and training fear of the owner and the
spot where the dog was punished. That is certainly not the same thing as housebreaking a pet.
B. Outcome Strength
Two types of outcomes have generally been discussed: reinforcement and punishment.
Each, however, may be further subdivided into two sorts, positive and negative. Positive
outcomes generally involve the presentation of a stimulus that changes a relatively neutral state
into the state specified by the outcome. Thus, positive reinforcement (generally referred to
without the use of the word "positive") involves the provision of something desirable that
normally results in appetitive behavior, and positive punishment (also typically referred to
without the modifying adjective) involves the provision of something undesirable that normally
results in aversive behavior.
The other two types of outcomes are negative reinforcement and negative punishment. It
will help you to keep these straight by recalling that anything that is a labeled reinforcer, positive
or negative, should operate by the law of reinforcement: It ought to increase the response that it
follows. Similarly, anything labeled a punisher, positive or negative, ought to work by the law of
punishment: It ought to decrease the response that it follows. That having been said, a negative
reinforcer takes on its reinforcing properties because some response the animal makes results in
removal of aversive stimulation. Negative reinforcement, of course, is the basis for escape
learning. And in similar fashion, a negative punisher acquires its punishing properties by virtue
of the fact that the animal makes a response leading to removal of a reward or privilege. Thus,
positive outcomes involve the presentation of stimulus events, and negative outcomes involve the
removal of certain stimulus events.
With respect to each, there appears to be a general principle that higher levels of strength
result in stronger or faster or more vigorous responding, consistent with a claim that outcome
strength influences speed of learning. Concerning positive reinforcement, for example, Kraeling
taught three groups of rats to run an alleyway for a drink reinforcement that varied in the amount
of sucrose concentration (recall that rats have a sweet tooth, so higher sucrose concentrations act
as more effective reinforcers). Each group was given one trial per day for 99 days. At the end,
they had each reached asymptote as measured by how fast they ran. However, the asymptotes
differed for the three groups: The group with the highest sucrose concentration had the fastest
asymptotic running speed whereas the group with the lowest concentration had the slowest speed.
Crespi found similar results (see below, Figure 3): Rats given large amounts of reinforcement on
each trial (64 pellets) showed faster running than rats given small amounts of reinforcement (4
pellets).
An experiment by Trapold and Fowler can illustrate the operation of this principle with
amount of negative reinforcement. They conducted an experiment in which rats had to run to
escape shock. Five groups of animals were given 20 trials of escape learning. The groups
differed in the intensity of the shock (varying from 120 volts up to 400 volts). Faster acquisition
of the escape response occurred with the larger shocks.
Finally, a classic experiment by Boe and Church may be used to illustrate the principle
with positive punishment. Boe and Church trained four groups of rats to press a bar for a reward,
and put each through extinction. Prior to extinction, however, three of these groups were put
through punishment training in which, for 15 minutes, a bar press gave the animal a shock. The
groups differed in intensity of the shock (35, 75, or 220 volts). Thus, the design was as follows:
Group
1
2
3
4
Acquisition Phase
RF for barpress (bp)
RF for barpress
RF for barpress
RF for barpress
Punishment
(None)
bp --> 35 Volts
bp --> 75 Volts
bp --> 220 Volts
Extinction Phase
No RF for bp
No RF for bp
No RF for bp
No RF for bp
The question, of course, was how punishment of bar pressing would help speed up removal of
that response. Over 9 sessions of extinction training, the group with the weak shock proved not
all that different from the group with no punishment: Each engaged in a substantial number of
responses during the course of extinction. However, quite different results occurred for the 75
and 220 volt groups: They showed a much lower level of responding during extinction. Indeed,
the 220 volt group hardly responded at all! Thus, effectiveness of punishment in suppressing
behavior will depend in part on severity of punishment. As the contrast between the control
group and the 35 volt group demonstrates, weak punishers may have little permanent effect
compared to extinction.
Of course, there are other variables that will influence the operation of an outcome. As you
know from an earlier discussion, aversive stimulation can have the paradoxical effect of
increasing the response it is meant to stamp out (vicious circle behavior). Also, the same amount
of an outcome packaged in different ways may effectively act as different amounts. Thus, for
example, Campbell, Batsche, and Batsche found that a reinforcer divided into smaller amounts
worked better than a reinforcer presented as one large amount. And to remind you, those
manipulations that seem to promote higher asymptotic levels during acquisition (in continuous
reinforcement) generally also promote the fastest extinction.
There are also contrasts that may occur when an organism experiences several different
levels of a reinforcement. An experiment by Crespi will illustrate these. Crespi trained rats to
run to a goal box for food (the apparatus here involved a straight alleyway in which rats are
released at one end of a corridor or tunnel, and have to run to the other end). One group was
given a large reward, a second group was given a medium reward, and a third group was given a
small reward. In each case, their running speed was measured. Then, the large-reward and smallreward groups were shifted to the medium reward. Thus, the design was something like the
following:
Group
1
2
3
Phase 1 (acquisition)
64-pellet reward
16-pellet reward
4-pellet reward
Phase 2 (maintenance)
16-pellet reward
16-pellet reward
16-pellet reward
You will note that I have labeled the two phases here acquisition and maintenance. The
rats in the acquisition phase received 20 learning trials, and their average running speed at the
end of training was measured. In a maintenance phase, on the other hand, we look at
performance after learning has occurred (that is, presumably after the association has formed).
In Crespi's study, there were 8 maintenance trials. Figure 3 presents the results after learning, and
on the eighth maintenance trial. As you can see from this figure, Group 2 showed some slight
increase; not surprising, since additional reinforced pairings ought to result in a stronger
association according to most standard theories of instrumental conditioning. But notice what
happened to the other two groups: A shift to a much smaller reward caused a negative contrast
by which running speed slowed down considerably, whereas a shift to a much larger reward
resulted in a corresponding increase (a positive contrast).
It is important to note that all
groups received the same amount of
learning in Phase 2 (in terms of
number of trials and what the
reinforcer was). Thus, we might have
expected each group to display the
same relative improvement. But, that
did not happen. Because these
contrasts occurred during a postacquisition period involving identical
additional training, they are generally
interpreted as demonstrating an effect
not on learning (or acquisition), but
rather on performance. That argument
is particularly compelling for Group 1:
They continued to receive additional reinforced training during Phase 2, yet they apparently got
worse! Contrasts of this sort are termed incentive contrasts.
Such contrasts should suggest that perhaps outcome amount is related more to an animal's
motivation to perform a response than whether that response gets learned in the first place.
Contrast effects may occur under a variety of conditions (see, for example, Flaherty's
review). They do tend to be temporary, however. Flaherty suggests, in particular, that negative
contrasts may reflect frustration at obtaining the less desired reward. Consistent with this,
tranquilized animals generally do not exhibit contrasts.
C. Contingency
We have already briefly alluded to the difficulties inherent in controlling contingency
between a response and an outcome in operant conditioning. We have also briefly talked about
the fact that Skinner (and Watson and Thorndike) claimed that contiguity was all that was needed
for learning. As in classical conditioning, however, there are claims that contingency is also
required for learning to occur. We will close out this section by considering the evidence that
contingency influences learning and performance.
To start, let us adapt the notion of a contingency space discussed in the previous chapter.
Previously, we had looked at the relative probabilities of the UCS when the CS was present or
absent. Now, we look at the relative probability of an outcome when the animal makes a
response (Probability 1), or withholds it (i.e., does not make the response: Probability 2).
Essentially, analogous to what happens in classical conditioning, many theorists will claim that
when Probability 1 exceeds Probability 2, there ought to be excitation: The animal is more likely
to make the response because the odds of getting a reward increase. (We are assuming rewards
outcomes rather than punishers here!) In contrast, when Probability 1 is below Probability 2, then
it makes more sense for the animal not to respond: The response ought to be inhibited. Finally,
when the two probabilities are equal (so that there is zero contingency), we would expect to find
no evidence of acquisition.
How well does this notion of contingency hold up? One interesting study that attempts to
assess this notion of an operant contingency space was performed by Hammond. Hammond set
up different contingencies between bar pressing and a reinforcer for several groups of rats.
Whenever the two probabilities were equal, the rats failed to display any change in bar pressing.
In these circumstances, there were always some pairings of the response with the outcome that
ought to have resulted in the association forming, if Skinner's claims made in the section on
superstitious behavior were correct. Indeed, we might regard such a circumstance as similar to a
partial reinforcement schedule. Nevertheless, no evidence of learning occurred. In contrast, rats
for whom Probability 1 exceeded Probability 2 did increase their bar pressing.
Moreover, in a follow-up, Hammond found that rats that had already acquired bar pressing
stopped when the two probabilities were made equal. This pattern of findings certainly goes
against Skinner's claims that contiguity is sufficient. Instead, it strongly suggests an exquisite
sensitivity to contingency on the rats' part.
Why, then, do we obtain superstitious behavior? According to many people, Skinner has
probably failed to consider the fact that reinforcers such as food also act as UCSs that elicit
certain responses, and that the contextual cues act as a CS that gets conditioned to the UCS. On
such an account, superstitious behavior really isn't: It is fairly straight-forward classical
conditioning of the sort we have been discussing in the last two chapters.
An example may serve to drive this point home. One claimed example of superstitious
behavior has been autoshaping. In autoshaping with pigeons some food is presented after a key
lights up (Brown & Jenkins, 1968). After a while, pigeons peck the key, although they need not
do so to obtain the food. Autoshaping is a very useful tool for training pigeons, because it seems
that the pigeons will train themselves, and save you the work. But that avoids analysis of this
situation as classical conditioning in which the lighted key serves as the CS. A nuanced
discussion of how classical conditioning can partly explain some of the response characteristics
(along with some of the problems for a simple stimulus substitution view of classical condition)
may be found in Staddon and Simmelhag (1971). Consistent with their discussion, Jenkins
and Moore (1973), for example, used either food pellets or drink as the reinforcer for different
groups of pigeons in an autoshaping paradigm. Pigeons pecked at solid food with a closed beak,
but opened the beak slightly to drink the liquid (you can see a demonstration of this on YouTube
by clicking here). Thus, what appears to be contiguity without contingency in operant
conditioning is sometomes a classical conditioning situation in which there is both
contiguity and contingency (but the contingency involves the CS and the UCS, rather than a
response and an outcome). On the other hand, Staddon and Simmelhag do point out that such
terminal responses differ from the interim reposnses that can be produced before them, and that
do seem to better fit with Skinner's notion of superstitious behavior.
Another experiment done with pigeons was performed by Killeen. The pigeons faced a
horizontal array of 3 keys, the middle one of which was lit. They were trained to peck at this
middle key. About 5% of the time, one of their pecks at the key would cause its light to go off,
and the lights of the two surrounding keys to come on. Another 5% of the time, a computer
would automatically turn off the center key while turning on the others. The question Killeen
asked was whether the pigeon was aware that it was responsible for this change.
How can we assess a pigeon's knowledge of the circumstances? Killeen reasoned that a
pigeon aware of whether its action had turned off the light would easily be able to learn another
response that would depend on that action. So, the experiment was arranged so that the pigeon
would get rewarded for pecking at one of the surrounding lights when it was responsible for
turning them on, but would have to peck at the other light when it was the computer that had
turned them on. That would be a difficult task to accomplish without sensitivity to contingency,
but Killeen's pigeons came through. They did indeed show they could discriminate events caused
by their own behavior from events caused by some other external cause.
Such sensitivity is not reported in all studies, however. In a quite clever study by Thomas,
rats obtained random free reinforcements, but could also make a response (bar pressing) that
would give them a reinforcement on demand. But the catch was, the number of random
reinforcers would drop some after the rat made that response. In other words, more reinforcers
were available if the rat did not respond. In contrast to the studies above, the rats in this
experiment actually did learn to bar press, which resulted in their getting less food!
One more study on reinforcement-based contingency may be mentioned. This study, done
by Watson (not the same Watson who redefined psychology as behaviorism!), used 3-monthsold human infants. Watson set up a contingency between their turning their heads and a
reinforcer of a mobile above their cribs turning for several seconds. A second group had the
same experience of mobile reinforcer, but the second group's mobile movements had nothing to
do with any of their responses. Although both groups initially displayed a great deal of interest
and pleasure in the mobile when it started moving, only the contingent group maintained this
reaction. Thus, Watson argued that the contingent infants had some sense of mastery over the
mobile that the non-contingent infants did not, some awareness that the mobile's movements
were due to their own actions. They were sensitive to contingency.
Contingency will also prove important in escape and avoidance learning. In particular,
Seligman and Maier have studied a phenomenon termed learned helplessness. The
experimental set-up for learned helplessness typically involves something like the following
design:
Group
Phase 1
Phase 2
Experimental
Control
inescapable, noncontingent P
(Nothing)
escape learning
escape learning
They find that unavoidable, non-contingent punishment results in the animals in the
Experimental Group not learning to escape, once a contingency is set-up between an escape
response and avoidance of the shock. The Control Group, in contrast, readily acquires the
response. According to Seligman's explanation of these results (the cognitive deficit
hypothesis), the animals in the Experimental Group have acquired a mistaken belief. Based on
the randomness of the shocks and their inability to escape them in Phase 1, they have mistakenly
learned that there is no response that will be effective in avoiding or escaping shock. (You may
want to compare this to various explanations for learned irrelevance in classical conditioning.)
Thus, they cease trying to discover an effective response, so that learning is no longer attempted
in Phase 2. Seligman has argued that some similar mechanism in humans may account for certain
episodes of depression.
III. Exceptions & Complex Interactions
We have seen, at least implicitly, an emphasis on reinforcement theories in which a
reinforcer is necessary for appetitive or approach learning (see also Thorndike). Indeed, the
notion of reinforcement is so pervasive that, as you will see in the next chapter, some theorists
have even claimed that extinction depends on there being a reinforcer present during the animal's
extinction training! Although there are a variety of associational or behavior-level theories to
account for instrumental and operant conditioning, we will direct the exceptions below to the
most conservative of these: the claim that learning requires a reinforcer, needs only
temporal contiguity, operates through an automatic process of slowly strengthening an
association, and stamps in specific muscle movements that increase in likelihood in the
presence of the S+.
A. Long-Delay Learning
We will start with the issue of temporal contiguity, As was true in classical conditioning, we
will also find instances of long-delay learning in operant conditioning. Indeed, we might start by
noting that the work of Garcia and Koelling, presented previously as an instance of long-delay
classical conditioning, might just as easily have been labeled instrumental conditioning: An
animal makes a response of drinking saccharine -flavored water, and experiences an outcome of
becoming ill several hours later. As this reinterpretation of the learned taste aversions
paradigm illustrates, it may sometimes be difficult to draw sharp, clear boundaries between
classical and instrumental conditioning (again, see Staddon & Simmelhag, 1971, for a discussion
of some of these issues; see also the last section in this chapter).
Leaving aside the taste aversions work, however, there are other studies suggesting
relatively long delays are possible. Lieberman, Davidson, and Thomas, for example, presented
a series of experiments in which pigeons had to peck the right or left side of a key. They found
that some of their animal subjects were able to learn the response even with delays of 7 sec or
longer (an extraordinarily long delay for a pigeon). The animals that were able to learn were the
ones whose correct response was followed by an unusual event (a marker). In their experiment,
the marker involved the key briefly turning a different color (from white to red on its left half
and green on its right half) after it had been pecked. Other work by Lieberman and his colleagues
has demonstrated that such marking can result in animals learning a discrimination even when
the reinforcement is delayed a full minute. Since the marker involved a non-reinforced stimulus
occurring after the relevant response but well before the reinforcement, the existence of a
marking effect poses a challenge to the idea that temporal contiguity is always necessary for
learning.
Indeed, this study ought to remind you a bit of some of the work we discussed regarding
rehearsal and surprisingness (in particular, the work by Wagner, Rudy, and Whitlow and that of
Hall and Pearce). Surprising events are apt to be rehearsed more. So, a distinctive surprising
event following a response may result in that response being rehearsed for a longer period of
time or becoming more distinct in memory (and thus more likely to be sampled as the cause of
the reinforcement). Lieberman et al.'s take on this (the marking hypothesis) combines elements
of both of the above (1985, p. 622):
[T]he effect of the marker proved to depend critically on what response preceded it: If a correct
response was marked on food trials, then correct responding increased; if an incorrect response
was marked, then incorrect responding increased. The most plausible explanation for this
result, we believe, is that the marker triggered a memory search that focused attention on the
preceding response, thereby increasing the likelihood that it would be remembered. [emphasis
added]
Another example of long-delay learning concerns a study by D'Amato, Sarafin, and
Salmon. They delayed reinforcers by at least 30 minutes in training trials with rats. In one
experiment, animals were placed in one of the two goal boxes of a T-maze (an apparatus that
looks like a T, in which the animal runs from the start box at the base of the T to one of the two
arms at the top), then put in the start box and fed 30 minutes later. Despite this delay, the animals
exhibited differential running to the arm in which they had been placed. Note, in particular, that
no additional events or stimulus cues were present during the wait in the start box that may have
become associated with the food: Once the animal was let out of the start box, the stimulus cues
around the correct arm may have primed the memory of being in that arm.
Finally, note too that the work we have already mentioned by Olton using the 8-arm radial
maze suggests that rats are quite adept at finding food in the maze without retracing their steps,
and without generally revisiting an already-visited arm. As they visit arms at random, they would
appear to maintain some information in short-term memory concerning which responses have
already been made. Given the length of time it takes to visit all 8 arms, this clearly qualifies as a
type of long-delay learning.
Numerous mechanisms for long-delay learning have been proposed. One that plays off of
the notion of secondary reinforcers has been proposed by Spence. This involves the mechanism
of an anticipatory fractional goal response (rg). Note that the response in this instance is
written with a lower-case r rather than an upper-case R. The reason is that the r is treated as one
component or fraction of a more complex response, the goal response (Rg), the animal makes on
reaching the goal and obtaining its reward. There will be numerous fractions or component
responses such as chewing, swallowing, salivating, etc. These get conditioned to the stimulus
cues present shortly before the animal enters the goal box. So, the association involves:
SGoalCues ----------> rg
But since these components or fractions are associated with food, they also become secondary
reinforcers through higher-order conditioning. Thus, the cues present as the animal enters the
goal area act as a reinforcer before the animal has actually received any food on that trial.
In addition, these components also have stimulus properties they are associated with,
although these, of course are unlearned: Chewing, swallowing, etc., all cause certain physical
sensations. So, the association ought also to include these, as follows (the dots indicate an
unlearned association):
SGoalCues ----------> rg ..... sg
And as was true of the response fraction, we indicate the stimulus fraction with a lower-case s.
The rg ..... sg is termed a mediator because it is a unit that may come between a stimulus and a
response in a chain of associations.
In Spence's theory, these mediators become anticipatory; that is, they start being
conditioned to earlier and earlier spots in a sequence. Thus, in a complex maze, the stimulus cues
right before the cues that led to the goal box also take on secondary reinforcing properties. If we
regard the goal cues as being at spot X, we will take the cues before these as being at spot X-1.
Then, through classical conditioning we have:
SX-1 ----------> SGoalCues ----------> rg ..... sg
Or, to represent this by a shortcut:
SX-1 ----------> rg ..... sg
And if at this point the animal needs to make a left turn to get to the area of the goal box, then the
associations at this point involve:
SX-1 ----------> rg ..... sg ----------> RLeft
And of course, we may now carry the procedure through to spot X-2 (the cues present before the
X-1 cues). Thus, fractional goal responses are effectively conditioned throughout a complex
chain in a process that should remind you of our example of the Time Warp. Consistent with this
theory, animals do tend to learn a complex maze backwards (although not all results support the
theory: In particular, researchers have not found evidence of anticipatory drooling at the various
spots or choice points of a maze).
Thus, in theory, the presence of secondary reinforcers may help to bridge what appears to
be a long delay. That would mean that long delays are really much shorter, since we need to
assess the delay in terms of the first reinforcer present after a response. In this case, that first
reinforcer will be a short-delay secondary reinforcer.
While secondary reinforcers and anticipatory fractional goal responses might account for
some of the long-delay results, however, they cannot account for all of them. In particular, the
study by D'Amato et al. would seem difficult to explain, since the animal is being fed in the start
box, so that any secondary reinforcers ought first to be associated with it, rather than the arm the
animal runs to. Similarly, Olton's results would not fall under this mechanism, because
secondary reinforcers ought to become associated with an arm the animal has already visited,
making it more likely the animal will revisit the same arm on the next trial. But that generally
doesn't happen. And that it doesn't happen makes sense, according to foraging theory: An
animal foraging in the wild for food is likely to deplete a food source, so there is adaptive value
in searching for food in different locations. But, searching for food in this manner also requires a
memory system that can keep track of where food was found previously, so as to avoid that spot.
Other factors that make long-delay learning possible include the presence of something to
make the correct response distinct, or the occurrence of little intervening activity between the
effective response and the reward. We have already discussed distinctiveness in terms of
Lieberman's marking hypothesis: A distinct response is likely to be more salient, rehearsed more,
and thus more readily available in memory when the occurrence of a reward triggers a search for
events that might have been responsible for it. The notion of little intervening activity will
similarly play off of a memory mechanism. With little or no intervening activity, the last
response will still be the one most likely to be recovered from memory. But as activity increases,
so do the possibilities of disruption (recall what Wagner, Rudy, and Whitlow found with posttrial episodes!), and choosing the wrong response as being the cause of the reward (response
competition).
B. Belongingness
On a strict associational account, any association ought to form between responses and
effective outcomes. However, several studies (in addition to the work we have already discussed
in learned taste aversions) suggest that need not be the case.
One of these studies, conducted by Shettleworth, involved an experiment with hamsters.
Shettleworth identified six high frequency activities in hamsters that included face washing,
digging, scent marking, hind leg scratching, rearing, and front paw scraping. When each of these
was subsequently paired with a food reinforcer, only digging, rearing, and front paw scraping
were affected. Such restriction of the operation of a reinforcer represents a violation of the
requirement that reinforcers be transituational: Here, at least, are three responses that a
reinforcer of food will not affect.
Another study illustrating belongingness comes out of the work of Premack. By allowing
kids to play with gumball machines (dispensing candy) and pinball (game) machines, Premack
identified kids who were players or eaters (based on the relative proportion of time they spent
with each machine). He then set up an experiment using the following design:
Group
Subjects
1A
1B
players
players
Response & Reinforcer
play to eat
eat to play
2A
2B
eaters
eaters
play to eat
eat to play
Thus, there was now a contingency between responding on one machine, and responding on the
other: In one case (play to eat), kids would have to increase their time on the pinball machine to
get an opportunity to use the gumball machine; in the other (eat to play), the reverse was
required: kids would have to increase responding to the gumball machine to get a shot at the
pinball machine. Only groups 1B and 2A showed learning. Thus, what counts as an effective
reinforcer for one child may be completely ineffective for another. (Similar results hold up for
animals: see the next chapter).
Also as a potential illustration of belongingness, we might mention the fact that certain
responses that are easy to acquire with positive reinforcement become very difficult to acquire
with negative reinforcement. Pecking a key for pigeons, for example, is difficult to train with
negative reinforcement (e.g., MacPhail). The explanation for this latter result may have to do
with Bolles's theory of safety and danger signals. Negative reinforcement involves the presence
of danger signals that trigger SSDRs. Such responses may well interfere with the desired
response, particularly if that desired response involves approaching the danger signal or
aversive stimulus! Such a notion is similar to a more general principle of preparedness posited
by Seligman: Responses may be ordered on a continuum ranging from prepared responses at
one extreme to contraprepared responses at the other. Prepared responses are responses quite
similar to what an animal would naturally do in a given situation, whereas contraprepared
responses are those the exact opposite of what the animal would normally do (approach danger
rather than flee from it, for example). According to Seligman's principle, the closer a response is
to the prepared end of the continuum, the easier it should be learned. So, the exact same
response may be acceptable in one circumstance, but not in another.
C. Acquisition Without Direct Reinforcement
A number of studies question whether a reinforcer is necessary for forming or strengthening
an association. The classic experiment illustrating this phenomenon was done by Tolman and
Honzik. Their design involved having rats learn a maze. The rats were given one trial per day for
17 days. The experiment involved the following design:
Group
1
2
3
Treatment
no RF; removed when reach goal box
RF on each day when reach goal box
no RF until the 11th day
The question Tolman and Honzik asked was how the third group would perform on days 11
through 17. Since these days represented the first time this group had experienced reinforcement,
a reinforcement-based account of learning would suggest that these animals started learning only
on Day 11. But in fact, on the 12th day, these animals were performing as well as (in fact, slightly
better than) the animals reinforced from the beginning (Group 2): They had learned to navigate
the maze in the absence of a reinforcer. This finding, termed latent learning, suggests that
reinforcement may be more important for performance (motivating an animal to show its
knowledge) rather than acquisition.
Another similar result involves a study by Butler in which monkeys learned a response
whose consequence involved being given access to a window looking out on a parking lot. While
curiosity might be called a reinforcer, it seems a bit of a stretch in this case. The problem is that
we have no way of independently identifying when learning would be expected to occur in the
absence of any other reinforcer such as food, and when it would not. When is the animal
curious?
A third study involves the area of observational learning. In a famous experiment by
Bandura, kids watched a tape of a clown playing with toys. Children in the vicarious
reinforcement group saw the clown being rewarded, but children in the vicarious punishment
group saw the clown being punished for the way he played. Later, when these kids were given a
chance to play with the same toys, the kids in the vicarious reinforcement group displayed the
same behaviors: evidence that they had learned by watching. The kids in the punishment group
played in a very different manner. But they had acquired the responses as well: When the
experimenter asked them to show what the clown had done, they were able to do so. Thus, we
find from this study that reinforcement and punishment may have an effect at a distance:
Watching others be reinforced can serve as a reinforcement. Such a notion takes us far afield
from the original idea of an appetitive stimulus that follows an emitted response (note that the
children had not made the response themselves, and note also that all children had learned the
response, though some of them had suppressed it until given permission by the experimenter to
play the way the clown had played).
A final study on observational learning illustrates that the notion is not restricted to humans.
Kohn and Dennis taught one group of rats a choice discrimination. A second group that were
able to watch the training of the first group learned the choice discrimination faster. Both the
Kohn and Dennis and the Bandura studies suggest a point we will explore in the next subsection;
namely, that responses do not need to first be emitted in order for learning to occur.
D. Non-Response-Dependent Acquisition
There are several embarrassments for a theory that claims learning requires making a
response. One such embarrassment is perhaps more severe than another, but we can categorize
these into two sorts: learning prior to responding, and response variations during acquisition or
performance.
Learning Prior To Responding
As we saw in the observational learning studies, organisms (human and non-human) can
start the process of learning prior to actually performing a physical response. To reiterate a theme
that was important in classical conditioning, this seems strongly to suggest that an association
forms between mental representations. Let's consider three more examples of this. They are
relevant because, unlike the Bandura and Kohn and Dennis studies, they will not involve
learning by observing others perform (i.e., learning by imitation). Thus, they cannot easily be
accounted for in terms of vicarious reinforcement.
The first is a study by McNamara, Long, and Wike. They looked at how long it took to
train two groups of rats to learn a maze. One group, however, was initially placed one-by-one in
a 'wagon' and dragged through the maze. This group did not perform any of the running or
turning responses, but they did get to observe their environments. And as you have probably
guessed, this group learned to run to the goal box (where they had been dragged) faster than the
group without that experience. Presumably, while being dragged through the maze, they had
opportunities to observe the various stimulus cues and form a representation of their
environments (a cognitive map: see the discussion below and in the next chapter on Tolman's
theory of learning). Learning to navigate through those environments was thus speeded up for
this group.
The second study involved the acquisition of cognitive maps by chimpanzees. In this study,
Menzel had animal subjects watch while food was hidden in slightly under 20 different locations
in a large field. The animals were brought along as Menzel hid the food over the field.
Subsequently, they were released at the starting point, and observed while they collected the
food. Two findings are relevant here. First, they knew the locations of the food, despite having
made no response themselves. And second, they did not collect the food in the same order as
it was hidden. Thus, we would not want to claim that a path through the field (consisting of a
chain of locations to visit) was learned (as may have been the case in the McNamara et al.
study). They clearly did not imitate Menzel's path or chain in this instance. The results again
seem to suggest that animals can acquire representations that are map-like, and that their learning
will show up as an enhanced ability to successfully find food (rather than as the performance of a
given order of responses).
Our third study was introduced earlier in this chapter. This is the study on latent extinction
by Seward and Levy. To remind you, animals placed directly in a goal box in which they have
previously been fed more rapidly extinguished running to that goal box than a group without this
initial experience. Insofar as extinction is normally defined as requiring the process of making a
non-reinforced response, both groups should have extinguished at the same rate: The initial
experience of the experimental group did not involve making a non-reinforced running response!
But that didn't happen. Presumably, through classical conditioning, the goal box had become
associated with food, so that absence of food may have aroused frustration or inhibition. Thus, a
change in the value of the outcome (a type of devaluation) altered its value for the experimental
group, making learning of extinction easier.
Response Variations
Alternatively, we may ask whether a given response, once it is acquired, is stamped in. On a
strict associationist account such as Watson's or Hull's, motor movements are trained. However,
the evidence strongly seems to disconfirm this, too.
Consider a study by Macfarlane. In this study, rats were trained to run a T-maze. Once they
had acquired this response, Macfarlane flooded the maze and put the rats into the start box. They
swam to the goal box that had previously been reinforced. The point of this study, of course, is
that swimming and running technically involve different muscle movements. So, if Watson's
view of learning were correct, we ought not to find evidence of learning when a different
response is executed. But consistent with our discussion of cognitive maps, these animals had
learned where to go: How to get there was not all that important.
A quite similar point occurs in studies with the use of the Morris maze. The Morris maze is
a pool of opaque, milky-white liquid that has a platform somewhere underneath the water. The
platform is close enough to the surface to enable a rat to keep its head above water without
having to swim. Morris and his colleagues have found that rats released into the pool from the
same spot eventually discover the platform (not having to expend energy on swimming is the
reinforcer), and then learn to swim straight towards it. What is important from our perspective,
however, is that these animals still head towards the platform when they are released from a new
location: They are able to adjust their angle of swimming relative to the landmarks in the room
that tell them in what direction the platform ought to be. That technically involves a different
response then the one these animals made during acquisition. That they can execute the proper
novel response again illustrates the involvement of cognitive maps in learning. What drives
performance here is where to get to, and how to get there as soon as possible.
And indeed, people who study acquisition often report that animals will perform a number
of different physical responses that appear equally effective in obtaining reinforcement. Rats
need not (and will not) always press the bar with the same paw.
Finally, although it takes us slightly off of the focus of this section, we may also mention
studies that show animals do not always prefer to make a response that has just been reinforced
(and that should therefore be relatively strong). On a T-maze, for example, rats have a tendency
to visit alternate arms (e.g., Dember and Fowler). This should remind you of our discussion of
foraging theory. Similarly, Harlow has demonstrated that primates can learn a win-shift losestay strategy in choice discrimination in which they have to select the non-reinforced stimulus on
the next trial. Responding to the previous S- and avoiding a response to the previous S+ ought to
be difficult under normal associationist assumptions. Under foraging theory assumptions, it
ought not to be that difficult.
E. Multiple Stimuli & 'Compound Conditioning'
Stimuli are complex events that can be broken down into yet simpler stimuli. Moreover,
several different stimuli may be present on a given trial, each associated with its own reinforcer.
In this section, we consider what happens in these circumstances.
Herrnstein's Matching Law
Consider the following situation: A pigeon is trained to peck at a key for reinforcement. We
use a variable interval schedule (see the chapter on extinction and partial reinforcement for
more details) in which the first peck after some random interval of time is reinforced. There may
be a red key in one block of trials, and the first effective peck at it will yield two reinforcers. In
another block of acquisition trials, there may be a blue key that will be rewarded with one
reinforcer when pecked. And finally, in yet another block of trials, a green key may be associated
with five reinforcers. Following this training, what will happen when the pigeon is put into an
operant chamber containing all three keys? Situations such as this were investigated by
Herrnstein.
The answer may surprise you. (And it will perhaps startle you to find that the answer doesn't
depend on species: College-level humans have displayed the same result!) A first-guess
common-sense theory most people come up with is that the animal (or human) will spend all of
its time on the key that has the best value -- the green key. But this often does not happen.
Instead, the animal (and the human) will distribute its responses in proportion to the
reinforcements available. That is, it will peck all of the keys over a period of time, but will peck
the green key proportionately more often than the red key, and the red proportionately more
often than the blue.
Herrnstein has called this the matching law. To provide a simple formula for this law, let us
assume we have a series of possible responses corresponding to a series of stimuli. In that case,
Responses to S1/Total Responses = RFs from S1/Total Available RFs
So, to calculate the proportion of times our animal spends with each key in the example above,
we would calculate the following:
Proportion of Responses to SGreen = 5/(5+2+1) = 5/8 = .625
Proportion of Responses to SRed = 2/(5+2+1) = 2/8 = .25
Proportion of Responses to SBlue = 1/(5+2+1) = 1/8 = .125
That is, it should distribute 62.5% of its responses to the green key, 25% to the red key, and only
12.5% to the blue key.
The matching law applies generally whenever there is a difference in value of the
reinforcer. We know that temporal contiguity can affect the value of a reinforcer, and a version
of the matching law has been formulated for this situation, as well. But in this case, value
depends on the reciprocal of the delay. A reinforcer that is given after a short delay has more
value than one given after a long delay. So, given the same reinforcer presented at 2 and at 8 sec
delays, its value would be 1/2 (.5) and 1/8 (.125), respectively. In this case, the matching law
would have the following formula:
Responses to S1/Total Responses = value of RF from S1/Total Available values
Let us take another example. We'll again use a pigeon trained to peck at a red, blue, or green
key. This time, the pigeon gets the same reinforcer from each, but at different delays: 2 sec for
pecking at the red key, 4 sec for pecking at the blue key, and 8 sec for pecking at the green key.
Before even doing any of the calculations, you ought to correctly predict that the red key will be
pecked most (fastest reward), and the green key least (slowest reward).
But let's do the calculations. First, we need to calculate the values based on the delays.
Remember that these are reciprocals. So, the values are:
red: 1/2 = .5 blue: 1/4 = .25 green: 1/8 = .125
Plugging these into our formula will yield the following results:
Proportion of Responses to SGreen = .125/(.125+.25+.5) = .125/.875 = .143
Proportion of Responses to SRed = .5/(.125+.25+.5) = .5/.875 = .571
Proportion of Responses to SBlue = .25/(.125+.25+.5) = .25/.875 = .286
You can see that the our predictions turn out to be correct. Specifically, our pigeon ought to
distribute 57.1% of its responses to the red key, but 14.3% to green. (As a check on your
calculations, by the way, note that the values ought all to add up to 100%!)
The matching law also applies to aversive outcomes. However, there is some evidence that
it does not apply to all instrumental situations (see, for example, Allison). Some discrepancies
have been found with fixed interval schedules, for example (see the chapter on extinction and
partial reinforcement). Nevertheless, it illustrates complex processes that depend on comparing
momentary values from multiple stimuli. Several theories have been proposed to explain the
matching law. One of these, the melioration theory, claims that animals are assessing the
momentary odds of a payoff. When one stimulus has paid off, then the animal works on the
stimulus that is next most likely to pay off. Although the example differs somewhat, it is
reminiscent of people playing multiple slot machines who shift to a different machine as soon as
the machine they're on has scored.
Compound Conditioning
The phrase compound conditioning is typically discussed in reference to classical
conditioning. Nevertheless, it is also appropriate here, as there are many instances in which
several stimulus elements are present during learning. The typical (and not always correct)
assumption is that a complex stimulus is the basis for generalization: The animal has associated
a response with all of the elements of that stimulus, and generalizes its responses to other
stimulus complexes as a function of how many elements overlap. However, we sometimes obtain
results reminiscent of blocking or overshadowing. We will look at these in more detail in the
chapter on attention and categorization, but let us briefly consider several studies that will
illustrate the point.
The first is a study by Reynolds. Reynolds taught pigeons to peck at a key that included
several different stimulus elements. Specifically, there was a white triangle against a red
background on top of the reinforced key (as opposed to a circle against a green background for
the non-reinforced key). In a later generalization test, Reynolds checked for how many pecks the
pigeons would give to a completely red key, a completely green key, a triangular key, and a
circular key. If the pigeons were under the control of the total stimulus complex (triangle & red),
then we would expect significant generalization to both red and triangle, since they contain
elements of the original training complex. Instead, she found that one pigeon pecked just to the
red key, and the other pecked just to the triangular key. In this case, one element of the complex
was the only effective or salient element; it completely overshadowed the other.
A second study that illustrates a similar phenomenon was conducted by Wagner, Logan,
Haberlandt, and Price. They set up a design for two groups of animals that was something like
the following:
Group
1
2
Compound Stimulus
Light & Tone1
Light & Tone2
Light & Tone1
Light & Tone2
RF Frequency
50%
50%
100%
0%
This study, of course, manipulates signal value. Thus, for the first group, the light has better
signal value than either of the tones, because the light predicts more reinforcers (each tone by
itself only predicts half the reinforcers the light does; you would have to attend to both tones to
predict as many reinforcers as the light: two things to track rather than one). In contrast, the light
is a worse predictor than Tone1 in the second group: paying attention to the light will work only
half the time (as it does for Group 1), but paying attention to Tone1 will work all of the time.
Wagner et al. find that the light does a more thorough job of controlling responding in the first
group, despite the fact that it really predicts the same number of reinforcers in both (note the
similarity to how we set up a blocking design in classical conditioning).
Can we get blocking in the traditional fashion? We ought to predict a blocking effect if
instrumental conditioning operates the same way classical conditioning does. That is, given the
following design:
Group
1
2
Phase 1
RF for R to S1
(Nothing)
Phase 2
RF for R to S1&S2
RF for R to S1&S2
we would predict that the animals in Group 1 should block to S2, whereas the animals in Group 2
might be expected to show some conditioning to both S1 and S2. Thomas, Mariner, and Sherry
used this type of design with pigeons, and obtained evidence of blocking.
Finally, consider a study by Lawrence and DeRivera. They used stimuli involving
different shades of grey. In this study, animals had to perform one response if the darker shade
was on top of a lighter shade, and the opposite response if the lighter shade was on top of the
darker. During training, the bottom shade in all cases was a neutral grey.
The question Lawrence and DeRivera asked was whether the animals were under control of
just the top color, or were comparing the top color with the bottom color. To make a long story
short, they found that animals were essentially comparing the top shade to the bottom shade.
When two of the lighter shades were presented, for example, which response the animal
performed depended on whether the lighter of these shades was on the top or the bottom, even
though both shades had been associated with the same response during acquisition. In this
case, responding was apparently based on a comparison rule such as darker on top or lighter on
top. Such comparative sensitivity is referred to as relational learning. It is incompatible with the
claim that an association forms between independent stimulus elements and the response. That
assumption forms part of many associationist models such as Hull's or Spence's theories (see the
next chapter), but it also appears inconsistent with a model like the Rescorla-Wagner model, in
which each stimulus is treated independently of the others on a conditioning trial. Under certain
circumstances, stimulus configurations are not equal to the sum of their component parts.
IV. Several Views Of Instrumental Conditioning
We have implicitly and explicitly discussed several views of what instrumental
conditioning involves. On many associationist accounts, there will be an association between a
stimulus and a response, at a minimum, and there may or may not be associations with the
outcome. Learning involves the formation of these associations, or their strengthening or
weakening. In contrast, as many of the studies we have just reviewed on cognitive maps and
response variations suggest, the stimulus-response link may not be all that critical. In this
section, we briefly introduce two diametrically opposed views of learning. These are the radical
behaviorist approach of Skinner and the cognitive expectancy theory of Tolman.
A. Skinner's View
We start with a notion Skinner expresses that makes him fundamentally different from
practically every one else in the field of learning: Skinner's refusal to engage in formal
theorizing. For Skinner, theories involve hypothetical constructs that cannot be directly observed,
and that therefore ought not to be discussed. And since theories (for most scientists and
researchers) generate predictions to be tested, you may correctly guess that Skinner's notion of
what experiments are about will also differ. Perhaps the best way of putting it is this: Skinner's
system only requires that we be able to predict and control behavior. When we have
accomplished that goal, we need go no further. So, we do experiments to see which
environmental contingencies control behavior in which ways.
For Skinner, there are essentially two broad categories of behavior: operants and
respondents. Respondents cover behaviors that are reflexively coaxed out of the animal by the
presence of specific stimuli. The presence of other stimuli at the same time enables them to
acquire the ability to elicit these responses. Thus, he includes the work in classical conditioning
under the category respondent. But operants are more characteristic of what most of us (and
most higher animals) do: They are the pieces of behavior that an animal emits in a given
situation.
The distinction between elicited and emitted behavior is an important one for Skinner.
Emitted behavior is not triggered by a single stimulus link to that behavior, whatever it is.
Operants are presumably influenced by a whole complex of events including the contextual
stimuli, genetic constraints, and motivational factors having to do with the animal's state of
hunger or thirst at the time. This is such a complex set of determiners that for all practical
purposes, Skinner refuses to talk about which stimuli connect with which responses. Thus,
contrary to most behaviorist theorists, Skinner does not talk about an association forming
between a given stimulus and a given response, nor about whether and when that association
strengthens or weakens. Rather, stimuli for Skinner serve something of the same function as
occasion setters: They signal times at which a response-outcome contingency occurs.
We ought to take a moment to note that the notion of a response-outcome contingency in
Skinner's work does not mean the same thing as an operant contingency space. This notion
merely means that the experimenter has set up a condition whereby a given response will
occasionally be followed by an outcome. Contiguity of the response and the outcome constitute
the relevant learning mechanism, as we have seen in the previous discussion of superstitious
behavior. But that is not to say that stimuli are irrelevant. When an outcome follows a response
in the presence of one stimulus but not another, the animal learns a discrimination. The response
comes under the control of the first stimulus (stimulus control), not in the sense that there is a
triggering effect, but in the sense that the first stimulus becomes part of the entire stimulus
situation in which responding alters.
We ought also to take out a moment to discuss this notion of a response. Although Skinner
did use the term, it is in some sense odd, given the de-emphasis on any specific cause of the
response. Given his view, it will not surprise you to learn that he did not insist that learning
require the same physical response to increase or decrease in frequency (unlike Watson and Hull,
who both claimed an association formed with specific muscular movements). So, many of the
objections we looked at above to the learning of specific movements do not apply to Skinner.
Instead, he adopted a functional definition of a response: any set of responses that achieve the
same function qualify as the same response. Thus, if the function is to get to the left side of a
T-maze, running to it, swimming to it, backing up to it, and casually crawfishing to it all count as
the same response, because they all accomplish the same function.
Moreover, through shaping and chaining, complex operants followed by a single reinforcer
are strengthened as a group, so that behavior may be constituted into quite long sequences. The
sequence of getting out your car key, inserting it into the lock, unlocking the door, taking the key
out, opening the door, getting in, closing the door, putting the key in the ignition, and turning it
may all be reinforced by the motor coming on, but will all be punished by an engine that refuses
to turn. If this seems a strange example to bring up, it isn't, really. Above all, Skinner was always
concerned with the practical aspects of modifying behavior in the real world, and exploring how
real-world contingencies affected behavior. Thus, in part as a pragmatist, he was concerned with
what worked, and not with elaborate theories of why or how.
Perhaps because he was a pragmatist, he and his followers also tended to avoid
experimental designs involving large groups of animals or people. His focus was on the
individual, and whatever changes could be observed in the individual. Control that individual's
behavior to some extent, and you have demonstrated sufficient explanation for why it occurs
(since you are now able reliably to predict the presence or absence of that behavior).
Among his contributions was the study of partial reinforcement schedules (see Ferster and
Skinner), behavior modification techniques and token economies, the notion of superstitious
behavior, behavioral-level definitions of reinforcers and punishers (as opposed to the theoretical
definitions we will see in the next chapter), work on secondary reinforcers and punishers,
teaching machines, and the notion of negative reinforcement (which involves a different
definition than the one I have used earlier: For Skinner, negative reinforcers are aversive events
that operate as reinforcers by being removed. This is a subtle difference, but recall that we have
defined negative reinforcement in terms of the removal of the event, not whether the event itself
is aversive). Skinner, unlike Watson, was also happy talking about the conditioning of private
events. And shortly before his death in 1990, he lambasted the current emphasis in American
Psychology on cognitive models and memory systems. Ironically, his approach had become
isolated from mainstream research at the same time that his behavior modification procedures
had become a normal part of educational and clinical management techniques.
I don't think Skinner would have been disturbed by any research finding whatsoever. Since
he refused to build formal theories, no finding could really have been inconsistent with his
approach. In some sense, I think of Skinner as engaged in a process of cataloging or categorizing
behavior: What are the situations under which this response increases? What are the situations
under which it displays resistance to extinction? What mechanisms are effective for altering a
response? How can we set our environments up to provide maximal efficiency? These were the
issues that occupied him.
As an example of work inspired by Skinner's approach, we may consider a famous series of
experiments by people like Verplanck and Greenspoon on verbal conditioning. They used a
reinforcer of agreement (e.g., "uh huh" or even a pencil tap) and showed that people increased
whatever it was that the "uh huh" followed: plural nouns rather than singular nouns, affective
rather than descriptive statements, etc. Following acquisition, Greenspan put his subjects through
extinction, and following that, questioned them to see if they had been aware of what was going
on. He claimed they weren't. Thus, several theorists made the very strong claim that humans
could easily be conditioned without their awareness (a claim Watson would have loved, of
course).
Later theorists such as Dulany and Spielberger and DeNike challenged those studies. They
pointed out a number of potential problems. One, for example, was that whatever people might
have thought was going on would have been implicitly disconfirmed once extinction started,
since they would now be collecting evidence against their hypothesis. Another was that a
number of correlated hypotheses could have increased responding, but that these weren't
included by the earlier experimenters. For instance, if you suspect you're being reinforced for
mentioning species of dogs, then you may say "chihuahuas, collies, daschunds, terriers, pugs,"
etc. Note that these are plurals. But when the experimenter asks you what you believed the
purpose of the experiment involved, you report that you were being stroked for coming out with
dogs, which gets you coded (unfairly) as having shown conditioning without awareness. And
finally, Dulany demonstrated that the people who showed verbal conditioning were those who at
the time had a correlated hypothesis (i.e., were aware that something was going on, and had a
theory that would result in increasing responses that the experimenter would count as correct by
administering reinforcement).
I think a true Skinnerian wouldn't have been much bothered by Dulany's or Spielberger and
DeNike's results. Awareness for them could be defined operationally as a series of answers to
questions on a survey (much as Watson defined emotions as nothing more than certain behaviors
like crying or shaking). Those answers constitute verbal behavior, as well. All a true Skinnerian
need do in this situation is talk about the conditions under which one type of verbal behavior
(performance on a survey) accompanies another (conditioning). I include this example because I
want to give you a flavor of the extraordinarily different ways in which people interpret scientific
research. To go back to the work by the philosopher Thomas Kuhn (mentioned in Chapter 1),
Skinner's approach represents a completely different paradigm. And people in different
paradigms can only rarely have useful discussions with one another about the foundational and
philosophical assumptions that make science and the world meaningful for them. It is a bit like
arguing religious beliefs.
B. Tolman's Expectancy Approach
Skinner's first major publication, The behavior of organisms, appeared in 1938. To give you
a time frame, Tolman published one of his major works, Purposive behavior in animals and men,
in 1932. At the time, behaviorism was the only game in town. Tolman regarded himself as a
behaviorist, but he was like no other theorist then around. As you may gather from the title of his
book, he believed that behavior was guided by an organism's purposes and goals. Thus, his brand
of behaviorism is called purposive behaviorism.
Tolman was a theorist, in contrast to Skinner. He build models around the notion of
intervening variables that came between a stimulus and a response. These variables in large
part involved cognitions: beliefs, expectancies, desires, and knowledge. The argument that he
and others who have used intervening variables make is that theories with such variables are
more successful in their predictions than those without. He didn't worry about Watson's dictum
that private events were illegitimate in a science of psychology, since the proof of the legitimacy
of the concept for Tolman was its track record. Tolman was the precursor of people like Bandura
who built models around observational learning, and more generally, of the cognitive revolution
that occurred in American psychology in the 1960s.
Several cognitions were particularly important in Tolman's work. One of these we have
already met: the notion of a cognitive map. According to Tolman, animals observing, exploring,
and experiencing their environments would come to have representations of the lay-out of those
environments. Thus, he performed experiments in which he showed that animals would illustrate
they had learned an environment once properly motivated to do so (e.g., the Tolman and Honzik
experiment on latent learning discussed earlier), and that they knew how to get around obstacles
and take novel shortcuts, when their normal routes were no longer available. We can discuss
such learning in terms of connections of stimulus (S-S) associations, but it was clearly
observational learning, in many of Tolman's studies.
Another cognition that was quite important (and that prefigured many modern theories of
learning) was the notion of an expectancy or expectation: a belief that some event ought to
occur in some situation based on past experiences. In discussing the situations found in
instrumental and classical conditioning, Tolman provided examples of two types of expectancies.
One may be written as follows:
Ej:
S1 -----> S2
This may be read as stating the content of expectancy Ej: That expectancy tells us that when S1
occurs, S2 may be expected to follow. If we substitute the CS for S1 and the UCS for S2, we see
that we obtain a situation corresponding to classical conditioning. But this situation extends far
beyond classical conditioning. It may also explain how we build cognitive maps: we learn that
this part of the route is normally succeeded by this other part.
As for instrumental conditioning, the expectancies may be given as below:
Ek:
S3 Ra -----> S4
El:
S3 Rb -----> S5
These two expectancies, Ek and El (the subscripts are just to keep them separate; we have a huge
number of expectancies in which these stimuli are specified rather than indicated through
abstract mathematical variables), basically state that in the presence of stimulus S3, one response
(Ra) will lead to stimulus S4, and the other response (Rb) will result in a different stimulus. If you
view these latter stimuli as rewards or punishers, then you obtain the expectancies that account
for approach or avoidance.
We can now add the notions of value and valence to account for what an animal will do.
The value of an expectancy has to do with the strength of its terminal stimulus. If we temporarily
regress to speaking of reinforcers and punishers, there are strong and weak reinforcers, just as
there are strong and weak punishers. As you might expect, the strong ones are of greater
motivational value than the weak ones. As for whether something is positive or negative, this
involves the notion of its valence or sign.
Given this, we can now state some simple rules for what an animal will do in any given
situation. One is that given a choice between two positive valence responses, the animal will
choose the stronger. As an example, consider the following expectancies in an experiment with
monkeys:
Ek:
Tone - Lift White Cup -----> find banana chip
El:
Tone - Lift Blue Cup -----> find piece of lettuce
Here, we presume the animal is faced with a choice involving two down-turned cups. Each has a
reinforcer hidden beneath it, and the animal may choose the reinforcer underneath one of the
cups. From past experience, it has learned that the white cup hides a banana chip, and the blue
cup hides a lettuce leaf. Banana chips are stronger reinforcers: They are high-value positivevalence outcomes. Thus, our principle states the animal ought to choose the white cup. In
common words, choose the better of two goods.
Our second rule will involve negative valences. We have a rat in a chamber that may leave
by one of two doors. It is being shocked in the chamber, so there is every reason to leave. If it
goes through the north door, the shock reduces by half, and if it goes through the south door, the
shock reduces by a fourth. In this case, the principle is choose the weaker of two negativevalence outcomes. Or in plain English, if you have to, go for the lesser of two evils.
That gives you a bit of a taste of Tolman's theory. We will talk about several more relevant
studies later. Tolman and Hull (see the next section), in particular, constantly chased one
another's experiments and theories, arguing about whether the results suggested the need for
cognitive factors or not. Interestingly enough, they both utilized intervening variables, although
Hull's system was far more developed and organized than Tolman's. But Tolman had a gift for
finding the weakness in one of Hull's claims, and doing an experiment that would seem to
demonstrate a result the exact opposite of what Hull predicted. That in part was the case with the
latent learning study, since Hull's model allows formation of an association only with a very
special type of reinforcing event called a drive reduction. But in latent learning, no such event
occurred on the first 10 trials for the group running without a reinforcer.
It's hard to imagine two approaches more different, and yet in some respects similar, than
Tolman's and Skinner's. They were both concerned with large-scale behavior rather than the
minute muscle movements of Hull's system. But Tolman freely speculated on intervening
variables while Skinner loathed them. One placed the cause of behavior squarely within a
cognitive or representation-level approach, but the other kept as close to a behavioral-level
approach as was possible. And the specific details of each theorist's approach were generally
ignored by most people, although each had enormous influence on subsequent work or
theoretical approaches. Indeed, one of Tolman's students, Krechevsky, developed the notion that
animals during learning test hypotheses about which stimulus element they are supposed to
notice. As we will see in a later chapter, this notion evolved into modern-day attentional theories
of discrimination learning.
C. A Note On The Interrelationship Between Classical & Operant Conditioning
It should by now be obvious that there are many close interrelationships between classical
and instrumental conditioning, Skinner's distinction between the two not withstanding. In some
cases, they are difficult to tell apart (as in the work on learned taste aversions, which may be
analyzed from either a classical or an instrumental conditioning perspective). In others, they are
clearly intertwined. Secondary reinforcers and aspects of chaining clearly involve classical
conditioning, and classical conditioning has sometimes been used to account for what happens in
avoidance learning (see the discussion of Mowrer's two-factor theory of avoidance learning in
the next chapter). So, how different are they, really?
Several theorists have tried to answer this question from different perspectives. One
approach simply attempts to cut the Gordian knot by applying similar models to each. In a later
chapter in which we examine theories of discrimination learning, we will come across attentional
and rehearsal concepts that will remind you of the corresponding models in classical
conditioning. There have been some attempts, for example, to modify the Rescorla-Wagner
model to handle the strength of instrumental learning by treating the S and the R as CSs in
compound conditioning, and the RF as the UCS (see, for example, Wasserman, Elek, Chatlosh,
and Baker). We have already seen some of the predictions regarding blocking and
overshadowing that would naturally arise from application of such a model.
Another approach is more direct: If Skinner is correct in identifying these two types of
learning as really involving different types of muscle systems (voluntary versus involuntary
muscles), then we ought not to be able to instrumentally condition involuntary reflexes.
However, there is now quite a lot of work in the area of biofeedback on different species
(including humans) that demonstrates modifying involuntary responses through operation of
instrumental reinforcers is feasible (see, for example, Miller).
Whether the same models ultimately apply to these two areas or not, do note that there will
always be the other type of learning occurring whenever you train instrumental or classical
conditioning. Reinforcers and punishers are also significant biological events that act as UCSs,
so that instrumental conditioning using outcomes should generally include some aspects of
classical conditioning. Similarly, biologically significant UCSs in classical conditioning may act
as outcomes influencing the responses the animal makes prior to their presentation.
Learning in the real world doesn't always occur in neat packets that allow one type of
association to form, and not another.
Partial Bibliography
Amsel, A. (1958). The role of frustrative nonreward in noncontinuous reward situations.
Psychological Bulletin, 55, 102-119.
Allison, J. (1989). The nature of reinforcement. In S.B. Klein & R.R. Mowrer (Eds.),
Contemporary learning theories: Instrumental conditional theory and the impact of biological
constraints on learning (13-39). NJ: Erlbaum.
Badia, P., & Culbertson, S. (1972). The relative aversiveness of signalled vs. unsignalled
escapable and inescapable shock. Journal of the Experimental Analysis of Behavior, 17, 463-471.
Badia, P., Culbertson, S., & Harsh, J. (1973). Choice of longer or stronger signalled shock vs.
shorter or weaker unsignalled shock. Journal of the Experimental Analysis of Behavior, 19, 2532.
Bandura, A. (1965). Influence of models' reinforcement contingencies on the acquisition of
imitative responses. Journal of Personality and Social Psychology, 1, 589-595.
Boe, E.E., & Church, R.M. (1967). Permanent effects of punishment during extinction. Journal
of Comparative and Physiological Psychology, 63, 486-492.
Bolles, R.C. (1970). Species-specific defense reactions and avoidance learning. Psychological
Review, 77, 32-48.
Breland, K., & Breland, M. (1961) The misbehavior of organisms. American psychologist, 16,
681-684.
Brown, J.S., Martin,R.C., & Morrow, M.W. (1964). Self-punitive behavior in the rat: Facilitative
effects of punishment on resistance to extinction. Journal of Comparative and Physiological
Psychology, 57, 127-133.
Brown, Pl., & Jenkins, H.M. (1968). Auto-shaping of the pigeon's key peck. Journal of the
Experimental Analysis of Behavior, 11, 1-8.
Butler, R.A. (1953). Discrimination learning by rhesus monkeys to visual exploration motive.
Journal of Comparative and Physiological Psychology, 46, 95-98.
Campbell, P.E., Batsche, C.J., & Batsche, G.M. (1972). Spaced-trials reward magnitude effects
in the rat: Single versus multiple food pellets. Journal of Comparative and Physiological
Psychology, 81, 360-364.
Capaldi, E.J. (1978). Effects of schedule and delay of reinforcement on acquisition speed.
Animal Learning and Behavior, 6, 330-334.
Colwill, R.M., & Rescorla, R.A. (1985). Postconditioning devaluation of a reinforcer affects
instrumental responding. Journal of Experimental Psychology: Animal Behavior Processes, 11,
120-132.
Crespi, L.P. (1942). Quantitative variation in incentive and performance in the white rat.
American Journal of Psychology, 55, 467-517.
Daly, H.B. (1974). Reinforcing properties of escape from frustration aroused in various learning
situations. In G.H. Bower (Ed.), The psychology of learning and motivation (Vol 8, 187-231).
NY: Academic.
D'Amato, M.R. (1970). Experimental psychology: Methodology, psychophysics, and learning.
NY: McGraw-Hill.
D'Amato, M.R., Sarafin, W.R., & Salmon, D. Long-delay conditioning and instrumental
learning: Some new findings. In N.E. Spear and R.R. Miller (Eds.), Information processing in
animals: Memory mechanisms (113-142). NJ: Erlsbaum.
Dember, W.N., & Fowler, H. (1958). Spontaneous alternation behavior. Psychological Bulletin,
55, 412-428.
Dollard, J.C., & Miller, N.E. (1950). Personality and psychotherapy. NY: McGraw-Hill.
Dulany, D. E. (1968). Awareness, rules, and propositional control: A confrontation with S-R
behavior theory. In T.R. Dixon & D.C. Horton (Eds.), Verbal behavior and general behavior
theory. NJ: Prentice-Hall.
Dwyer, D.M., Mackintosh, N.J., & Boakes, R.A. (1998). Simulatneous activation of the
representations of absent cues results in the formation of an excitatory association between them.
Journal of Experimental Psychology: Animal Behavior Processes, 24, 163-171.
Egger, M.D., & Miller, M.E. (1963). When is a reward reinforcing? An experimental study of
the information hypothesis. Journal of Comparative and Physiological Psychology, 56, 132-137.
Ferster, C.B., & Skinner, B.F. (1957). Schedules of reinforcement. NY: Appleton-CenturyCrofts.
Flaherty, C.F. (1982). Incentive contrast: A review of behavioral changes following shifts in
reward. Animal Learning and Behavior, 10, 409-440.
Fowler, H., & Miller, N.E. (1963). Facilitation and inhibition of runway performance by hindand forepaw shock of various intensities. Journal of Comparative and Physiological Psychology,
56, 801-806.
Fowler, H., & Trapold, M.A. (1962). Escape performance as a function of delay of
reinforcement. Journal of Experimental Psychology, 63, 464-467.
Garcia, J., Ervin, F.R., & Koelling, R.A. (1966). Learning with prolonged delay of
reinforcement. Psychonomic Science, 5, 121-122.
Greenspoon, J. (1955). The reinforcing effect of two spoken sounds on the frequency of two
responses. American Journal of Psychology, 68, 409-416.
Grice, G.R. (1948). The relation of secondary reinforcement to delayed reward in visual
discrimination learning. Journal of Experimental Psychology, 38, 1-16.
Guthrie, E.R. (1952 ). The psychology of learning. (Revised edition) NY: Harper & Row.
Guttman, N., & Kalish, H.I. (1956). Discriminability and stimulus generalization. Journal of
Experimental Psychology, 51, 79-88.
Hammond, L.J. (1980). The effect of contingency upon the appetitive conditioning of free
operant behavior. Journal of the Experimental Analysis of Behavior, 34, 297-304.
Hanson, H.M. (1959). Effects of discrimination training on stimulus generalization. Journal of
Experimental Psychology, 58, 321-334.
Herrnstein, R.J. (1970). On the law of effect. Journal of the Experimental Analysis of Behavior,
13, 243-266.
Jenkins, H.M., & Moore, B.R. (1973). The form of the autoshaped response with food or water
reinforcers. Journal of the Experimental Analysis of Behavior, 20, 163-181.
Killeen. P. (1978). Superstition: A matter of bias, not detectability. Science, 199, 88-90.
Kohn, B., & Dennis, M. (1972). Observation and discrimination learning in the rat: Specific and
nonspecific effects. Journal of Comparative and Physiological Psychology, 78, 292-296.
Kraeling, D. (1961). Analysis of amount of reward as a variable in learning. Journal of
Comparative and Physiological Psychology, 54, 560-565.
Krechevsky, I. (1932). "Hypotheses" in rats. Psychological Review, 39, 516-532.
Kuhn, T.S. (1970). The structure of scientific revolutions (second edition). Chicago: University
of Chicago Press.
Lawrence, D.H., & DeRivera. J. (1954). Evidence for relational transposition. Journal of
Comparative and Physiological Psychology, 47, 465-471.
Lieberman, D.A., Davidson, F.H., & Thomas, G.V. (1985). Marking in pigeons: The role of
memory in delayed reinforcement. Journal of Experimental Psychology: Animal Behavior
Processes, 11, 611-624.
Macfarlane, D.A. (1930). The role of kinesthesis in maze learning. California University
Publication Psychology, 4, 277-305.
MacPhail. E.M. (1968). Avoidance responding in pigeons. Journal of the Experimental Analysis
of Behavior, 11, 625-632.
McNamara, H.J., Long, J.B., & Wike, F.L. (1956). Learning without response under two
conditions of external cues. Journal of Comparative and Physiological Psychology, 49, .
Melton, A.W., & Irwin, J.M. (1940). The influence of degree of interpolated learning on
retroactive inhibition and overt transfer of specific responses. American Journal of Psychology,
53, 173-203.
Menzel, E.W. (1978). Cognitive mapping in chimpanzees. In S.H. Hulse, H. Fowler, & W.K.
Honig (Eds.), Cognitive processes in animal behavior (375-422). NJ: Erlbaum.
Miller, N. E. (1978). Biofeedback and visceral learning. Annual Review of Psychology, 29, 373404.
Morris, R.G.M., Garrud, P., Rawlins, J.N.P., & O'Keefe, W. (1982).
Olton, D.S. (1978). Characteristics of spatial memory. In S.H. Hulse, H. Fowler, & W.K. Honig
(Eds.), Cognitive processes in animal behavior (341-373). NJ: Erlbaum.
Postman, L. (1974). Transfer, interference, and forgetting. In J.W. Kling and L.A. Riggs (Eds.),
Experimental psychology. NY: Holt, Rinehart, & Winston.
Premack, D. (1959). Toward empirical behavior laws: I. Positive reinforcement. Psychological
Review, 66, 219-233.
Ratliff, R.G., & Ratliff, A.R. (1971). Runway acquisition and extinction as a joint function of
magnitude of reward and percentage of rewarded acquisition trials. Learning and Motivation, 2,
289-295.
Rescorla, R.A. (1997). Response-inhibition in extinction. Quarterly Journal of Experimental
Psychology. B. Comparative and Physiological Psychology, 50B, 238-252.
Reynolds, G.S. (1961). Attention in the pigeon. Journal of the Experimental Analysis of
Behavior, 4, 57-71.
Roberts, W.A. (1969). Resistance to extinction following partial and consistent reinforcement
with varying magnitudes of reward. Journal of Comparative and Physiological Psychology, 67,
395-400.
Seligman, M.E.P. (1970). On the generality of laws of learning. Psychological Review, 77, 406418.
Seligman, M.E.P., & Maier, S.F. (1967). Failure to escape traumatic shock. Journal of
Experimental Psychology, 74, 1-9.
Seward, J.P., & Levy, N. (1949). Sign learning as a factor in extinction. Journal of Experimental
Psychology, 39, 660-668.
Sheffield, F.D. (1965). Relation between classical conditioning and instrumental learning. In
W.F. Prokasy (Ed.), Classical conditioning: A symposium (302-322). NY: Appleton-CenturyCrofts.
Shettleworth, S.J. (1975). Reinforcement and the organization of behavior in golden hamsters:
Hunger, environment, and food reinforcement. Journal of Experimental Psychology: Animal
Behavior Processes, 1, 56-87.
Skinner, B.F. (1938). The behavior of organisms: An experimental analysis. NY: AppletonCentury-Crofts.
Skinner, B.F. (1964). Behaviorism at fifty. In T.W. Wann (Ed.), Behaviorism and
phenomenology: Contrasting bases for modern psychology (79-108). Chicago: U. Chicago Press.
Spence, K.W. (1947). The role of secondary reinforcement in delayed reward learning.
Psychological Review, 54, 1-8.
Spielberger, L.D., & DeNike, L. (1966). Descriptive behaviorism versus cognitive theory in
verbal operant conditioning. Psychological Review, 73, 306-326.
Staddon, J.E.R., & Simmelhag, V.L. (1971). The "superstition" experiment: A reexamination of
its implications for the principles of adaptive behavior. Psychological Review, 78, 3-43.
The Rocky Horror Show. (1975). Original Australian Cast Album. Elephant Records. (Yep: It's
better Rock & Roll than the later American album based on the movie version, in my opinion...)
Thomas, G. (1981). Contiguity, reinforcement rate, and the law of effect. Quarterly Journal of
Experimental Psychology, 33B, 33-43.
Thomas, D.R., Mariner, R.W., & Sherry, G. (1969). Role of pre-experimental experience in the
development of stimulus control. Journal of Experimental Psychology, 79, 375-376.
Thorndike, E.L. (1898). Animal intelligence: An experimental study of the associative processes
in animals. Psychological Review Monograph Supplements 2, no. 8.
Thorndike, E.L. (1913).The psychology of learning. NY: Columbia University.
Tolman, E. C. (1932). Purposive behavior in animals and men. NY: Century.
Tolman, E.C., & Honzik, C.H. (1930). Introduction and removal of reward and maze
performance in rats. University of California Publications in Psychology, 4, 257-275.
Trapold, M.A., & Fowler, H. (1960). Instrumental escape performance as a function of the
intensity of noxious stimulation. Journal of Experimental Psychology, 60, 323-326.
Verplanck, W.S. (1955). The operant, from rat to man. Transactions of the New York Academy of
Sciences, Series 11, 17 (8), 594-601.
Wagner, A.R., Logan, F.A., Haberlandt, K., & Price, T. (1968). Stimulus selection in animal
discrimination learning. Journal of Experimental Psychology, 76, 171-180.
Wasserman, E.A., Elek, S.M., Chatlosh, D.C., & Baker, A.G. (1993). Rating causal relations:
Role of probability in judgments of response-outcome contingency. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 19, 174-188.
Watson, J.B. (1913). Psychology as the behaviorist views it. Psychological Review, 20, 158-177.
Watson, J.B. (1926 Excerpts from "What the nursery has to say about instincts." In C. Murchison
(Ed.), Psychologies of 1925. NY: Clark U. Press.
Watson, J.B. (1930). Behaviorism. NY: Norton.
Watson, J.B., & Rayner, R. (1920). Conditioned emotional reactions. Journal of Experimental
Psychology, 3, 1-14.
Watson, J.S. (1967). Memory and "contingency analysis" in infant learning. Merrill-Palmer
Quarterly, 12, 55-76.
Some Relevant Internet Sites (but there are many more out there!):
Animal Cognition Page:
(http://www.pigeon.psy.tufts.edu/psych26/history.htm)
(note particularly the links to Thorndike, Tolman, and Skinner's stuff. There's a graph of
Thorndike's results with puzzle-box learning that you should look at.)
The Behaviorist Manifesto:
(Watson's classic paper)
(http://www.yorku.ca/dept/psych/classics/Watson/views.htm)
Emotional Conditioning:
(
http://www.yorku.ca/dept/psych/classics/Watson/emotion.htm)
(The Watson & Rayner paper reporting on their study with Little Albert)
Cognitive
Maps:
(http://www.yorku.ca/dept/psych/classics/Tolman/Maps/maps.htm)
(A paper by Tolman reviewing some of the cognitive map studies)
(Note: The three papers above are from the Classics in the History of Psychology webpage; you
have a link for it in Chapter 1)
1. Chapter © 1998 by Claude G. Cech
Download