Questionnaire Scales Part 1

advertisement
Questionnaire Scales: Part 1
Slide 1
The goal of this lecture is to inform you about the different types of question formats that you
could use in questionnaires.
Slide 2
As the note to the right indicates, researchers can develop a broad range of scales for
measuring things. I included this Beaufort wind force scale as one example of a type of scale
you might not have thought existed.
Slide 3
Although I will try in this lecture to structure the different types of scales as much as possible,
the point of showing you this cartoon—perhaps my favorite marketing research cartoon—is that
it never hurts to be creative when trying to design a scale. Although being creative differs from
confusing respondents with bizarre question formats, there can be a fine line between confusing
and effective questions.
Slide 4
One key point of this lecture is that the precise language you use in your questions is important.
Different words will solicit different answers. You really have to think about each word that you
include in your questions. The cartoon illustrates the euphemisms that we sometimes use to
describe things; security division of an automobile wreckage site, aka junkyard dog. The first
descriptor sounds very different from the second descriptor. Similarly, using some words in
questions will inspire very different answers than using other words.
Slide 5
There are many different ways to ask the same question. Asking those questions in different
ways can yield very different responses.
Slide 6
Here are six different response formats for the same question: “How likely would you be to buy
Grandma’s peach cobbler?” In Case #1, there are five dashes, with anchors very likely and very
unlikely. In Case #2, there are the same anchors, but the numbers 1 through 5 have replaced
the dashes. In Case #3, there are five check boxes, with the first and last one labeled. In Case
#4, dashes have returned, but instead of only labeling the endpoints, the intermediate points are
labeled as well. By the way, it’s strongly recommended that you not merely provide respondents
with anchoring endpoints; you also should describe the intermediate points in meaningful
language. Case #5 looks like half boxes in which the person can check off their answers. Case
#6 shows a scale from +2 to -2 with the same anchors, very likely to very unlikely.
Slide 7
This slide is similar to the previous one, except that seven-point rather than five-point scales are
shown. In part, I included this slide for example #4 about ‘Cheer detergent is.’ Notice the scale
point descriptors: very harsh, harsh, somewhat harsh, neither harsh nor gentle, somewhat
gentle, gentle, and very gentle. You’ll need to describe each scale point for any question you
design.
Page | 1
Slide 8
This slide summarizes research that shows the format of a question as simple as “What is your
age?” makes a meaningful difference. The top of the slide indicates that there are three basic
ways to ask about a respondent’s age: (1) What is your age? (2) In what year were you born?
and (3) In which category does your age fall? The table at the bottom of the slide shows that
answers to the age question differed meaningfully by question format. As each set of 800
respondents that received a given question format should have similar age profiles—thanks to
the law of large numbers and random sampling—the age profiles for each group should differ
only by sampling error. Instead, people asked about their age directly tend to answer somewhat
younger than people asked the year of their birth and meaningfully younger than people asked
the category in which their age fell. The bottom on the slide shows that a meaningfully larger
percent of respondents refused to answer the direct age and year of birth questions than
refused to answer the age category question. Although it seems counterintuitive that categorical
responses are more accurate despite rounding error, their higher accuracy and lower nonresponse rate make the age category measure superior to the other age measures.
Slide 9
This figure depicts the remainder of this lecture. When talking about different types of scales,
researchers tend to divide scales into comparative and non-comparative forms. Then, within
those categories, there are sub-categories of scales. Because you’re probably most familiar with
non-comparative scales, I’ll start with them.
Slide 10
Non-comparative (or monadic) rating scales ask about a single concept. Here’s an example:
“Now that you’ve had an automobile for about one year, please tell us how satisfied you are with
its engine power and pickup.” The range of responses runs from completely satisfied to very
dissatisfied.
Slide 11
In contrast, a comparative rating scale asks respondents to rate something by comparing it to a
benchmark or series of benchmarks. An example of a comparative rating scale: “Please indicate
how the amount of authority in your present job compares with the amount of authority that
would be ideal for this job.” This question asks for a comparison between a current professional
job versus a benchmark ideal job. The responses are too much, about right, and too little.
Slide 12 (No Audio)
Slide 13
The most popular non-comparative scale is the Likert scale. I’ll first show you several examples
of Likert items. In this first example, the Likert item concerns tennis. The statement “It is more
fun to play a tough competitive tennis match than to play an easy one” is matched with
response alternatives that run from strongly agree to strongly disagree.
Slide 14
Here’s an example of several Likert-type items for assessing consumer beliefs about a
department store. The format here has the statements on the left hand side: Duncan’s
Department Store has lower prices than competitors; Merchandise displays at Duncan’s
Page | 2
Department Store are messy; Clerks at Duncan’s Department Store are not very friendly; and
The downtown Duncan’s Department Store is a convenient location. Notice that some of these
items are positive and some of these items are negative. The responses range from strongly
agree to strongly disagree and the response categories are summarized by letters instead of
numbers; ‘SA’ instead of 1, ‘A’ instead of 2, and ‘AL’ instead of 3. One also could use blanks or
boxes that require a checkmark. Many formats will work for Likert-type items. I recommend a
number format because it’s easier to enter such data into a computer. Clearly, it’s easier to type
a string of numbers than to looking at boxes or blanks or letters, try to convert them into a
number mentally, and then type that number into a computer.
Slide 15
Here’s an example of several Likert-type items that use an importance scale rather than an
agreement scale. Here, respondents indicate non-important, slightly important, important, very
important, or extremely important.
Slide 16
Here’s an example of several Likert-type items in which blanks are used. Again, all of these are
Likert scales, but I recommend you use numbers instead of blanks because it’s easier to enter
the data from that type of question format.
Slide 17
To this point I’ve been careful to say that the previous slide showed Likert-type items. The Likert
scale is actually a set or series of such items. A single item is not a Likert scale, although many
researchers misuse the term and refer to single Likert-type items as Likert scales. Nonetheless,
Likert scales are a multiple-item scale; hence, the notion of summing responses to the multiple
items to achieve a total score. Likert scales are very popular for many reasons, including that
they are easy or relatively easy to write and respondents are familiar with such questions. If they
ignore your instructions, as most respondents do, they’ll still be able to answer your questions
properly.
Slide 18
This is probably the most popular format for Likert scale items. The scaling 5 to 1 could easily
be reversed as 1 to 5. Typically, scale numbers are run from strongly agree to strongly disagree,
and it’s probably best to make ‘agree’ a larger number than ‘disagree’. The 10 items shown
here—such as “the commercial was soothing,” “the commercial was not entertaining,” and “the
commercial was insulting”—could be related to people’s impressions of how enjoyable it was to
view the commercial or the quality of the commercial. If all 10 items relate to the same basic
underlying notion, then we can sum people’s scores on these items to derive an overall score of
the commercial’s likeability. The assumption is, of course, that all the questions are phrased in
the same direction. Notice the items (1) “the commercial was soothing,” (2) “the commercial was
NOT entertaining,” (3) “the commercial was insulting,” (4) “the commercial was silly,” (5) “the
commercial was too ‘hard sell’,” and (6) “the characters in the commercial were realistic.” Items
#1 and #6 are positive items; people who strongly agreed with those items must have liked the
commercial. To strongly agree with items #2, #3, and #4 is to dislike the commercial. To take a
sum that would be meaningful, either the negative items or positive would need to be reverse
scored. Reverse scoring puts all the answers in the same direction. If I were scoring all these
items in a positive direction, if someone answered ‘4’ for question #2 (the commercial was NOT
entertaining), then I would enter it into the computer as a 2. If someone answered 5 to “the
Page | 3
commercial was insulting,” then I would score it as a 1. Reverse scoring allows a meaningful
sum of the scores across all these items to derive an overall likeability score for the commercial.
Slide 19
This slide illustrates what I mean by the kind of data that Likert-type items yield. You can see
that 10 different people were asked to respond to 10 different items. Let’s assume the 10 items
in the previous slide and these are their answers. We’ll assume that the numbers in the matrix
have already been reversed scored, so that for the negative items a ‘5’ means a ‘1’ and a ‘4’
means a ‘2’. The sum at the end of the last column shows scores that run from 25 for person #4
to 40 for person #9. Later in the semester, when I talk about reliability, I’ll try to make sense of
that last row, which says item-to-total correlations. Let me preface that subsequent explanation
by indicating that part of what researchers do, in deciding whether or not they should ask all
those different questions and if summing all those scores makes sense, is to determine if
people’s responses to each question are related to their responses to the other questions.
Researchers assume that if all the questions address the same underlying construct, then the
answers to those questions should be somewhat consistent. If the answers to one question are
unrelated to the answers to other questions, then that’s a problem. If the answers to one
question are strongly but negatively related to the answers to other questions, then reverse
scoring is needed. Looking at the numbers in the last row, notice that the answers for item #4
don’t seem especially related; any number close to + 1 indicates that thing is highly related to
other things. It appears the answers to question #4 are not especially related to the answers to
the other items and the answers to the other items are highly related to one another.
Slide 20
This slide illustrates the types of statements one might develop to measure people’s attitudes
toward product quality and warranty responsibility. Some of these items are fairly wordy. The
goal is to keep items as concise as possible, yet fully explain the underlying notions. In
“Products today are built to last a long time before needing service or repair” the language is
relatively simple and it entails the issue of obsolescence and ultimately warranty responsibility.
“Too many products available to customers are unnecessarily complex” is relatively simple
language. “Customers rarely get stuck with products that don’t work, since most products today
have good guarantees” is simple as well. Although these items are imperfect, they are the type
of items that people can normally read, understand, and respond to meaningfully using the sorts
of 5 or 7 point scales with which you’re familiar.
Slide 21
I include this figure because parts of it are informative yet I also disagree with parts of it. For the
scale categories and the labels we would provide for them, the ones for quality are fine: well
above average, above average, average, below average, and well below average. The
importance categories, interest categories, satisfaction categories, and even the uniqueness
categories make sense. However, I disagree with using Likert-type scales to assess frequency
or truth. I strongly discourage using Likert-type scales for frequency because the frequency
descriptors mean vastly different things to different people. For example, I might say that I
sometimes drink coffee. What that means is that I might brew myself a couple of pots a week
and drink 6 to 8 cups each time. Somebody else might give the same answer, but mean they
drink one cup a month. Somebody else might give that same answer, but mean they drink two
cups a day. What I mean by ‘sometimes’ and what someone else means by ‘sometimes’, as it
relates to coffee consumption, may differ markedly, and the same answer shouldn’t reflect
vastly different behaviors. As for Likert-type scales and truth, I’m a logician at heart, so
Page | 4
something is either true or false. For a larger report, certain aspects may be true and other
aspects may be false, but to say that something is somewhat true, from a logical standpoint,
makes little sense. Thus, I also discourage using Likert-type items to assess the degree to
which people believe something is true.
Slide 22
Thurstone scales, which are to some extent related to Likert-type scales, often are ignored by
undergraduate marketing research textbooks. Such scales are valuable and provide excellent
data, but they are far harder to construct than Likert-type scales. I’ll show you why in the next
several slides.
Slide 23
When constructing a Thurstone scale, researchers create a series of items to which people can
respond either yes or no. Those items are designed and sequenced such that respondents are
increasingly likely to respond ‘yes’ as they progress from item to item. The ideal Thurstone scale
would yield a series of one answers (no) followed by a series of the opposite answer (yes). The
point responses change from ‘no’ to ‘yes’ is the point of measurement interest.
Slide 24
To some extent, you can think of a Thurstone scale as similar in concept to a standardized
exam like the ACT or SAT. There is a range of questions, in terms of difficulty, on those types of
exams; examiners assume that they can identify the abilities of students taking the exam by
examining the pattern of responses to those questions. They assume everyone can answer the
easiest questions correctly and very few people can answer the hardest questions correctly, but
there won’t be a random sort of response in which easy questions are missed while difficult
ones are answered correctly. Think about identifying a series of questions that relate to some
topic, with each question progressively more positive or more negative. Researchers would
expect people’s responses would shift at some point as the statements become more positive.
Here’s an example of trying to form a Thurstone scale by asking a series of judges to rate the
items by the likelihood to which someone might agree or disagree with them.
Slide 25
We can take those experts’ responses and sum them so that we can derive the scale values
shown in the second column. Seemingly, the item that people would be least likely to agree with
is item #3, with a scale value of 9.9. Item #2, with a scale value of 2.0, would be the item with
which they’d be most likely to agree.
Slide 26
This is an example of a response curve a series of Thurstone items might yield. It becomes
progressively more likely that someone will switch from ‘no’ to ‘yes’ or ‘disagree’ to ‘agree’ as
the items progress from #1 to #11. No one would agree with item #1, but everyone would agree
with item #11. Seemingly, at items #6 and #7, roughly 50% of people begin to agree.
Slide 27
Another type of non-comparative scale that’s popular is the Semantic Differential, (SD) scale.
With SD scales, there’s a series of bi-polar rating items. The bi-polar adjectives that anchor the
endpoints of the scale could be items such as good and bad. In entering the response data into
Page | 5
the computer, researchers assign a number to each scale point. Although many people refer to
SD scales as scales with bi-polar ratings, the true SD scale assumes three underlying attitudinal
dimensions that everyone, regardless of culture or language, uses to evaluate things in their
social environment. These three dimensions are evaluation, power, and activity. For a properly
constructed SD scale, all the items will relate to one of these three dimensions. However, over
time people have adopted SD-type scales so that items may not be related to one of these three
underlying dimensions.
Slide 28
Here’s an example of SD scales for measuring attitudes towards tennis. The bi-polar adjectives
are exciting versus calm, interesting versus dull, simple versus complex, and passive versus
active. When a respondent receives this scale, the instructions indicate that a checkmark or
other type of mark should be placed on the line such that the proximity of that X to each of the
adjectives indicates that respondent’s attitude on that particular dimension.
Slide 29
Here are some SD scales for attitudes towards a jazz saxophone recording. These items are
related to audio recordings, the same way the previous items were related to tennis playing.
Slide 30
The next three slides provide additional examples of SD scales. I provide these examples to
indicate the range of topics on which SD scales can be used and the range of formats that can
be used for SD scales.
Slide 31 to 32 (No Audio)
Slide 33
Although SD scales are popular, I don’t recommend them for several reasons.

Respondents will tend to misuse those scales. Unfortunately, many people don’t read
the instructions to questionnaires; as a result, instead of checking the appropriate box or
marking the appropriate area on the line between bi-polar adjectives, they merely circle
one of the endpoints. If someone circles one of those bi-polar adjectives, we cannot
guess that (1) they meant to check off the box or the area of the line closest to that
adjective, or (2) they misread the instructions. As a result, we can’t use that person’s
data in our analysis.

It’s far more difficult to construct SD scales then Likert scales. If nothing else, we’re
limited to only a few words and it’s difficult to summarize complex notions in so few
words. Likert scales permit many more—although not an infinite number—of words.

Negation is not necessarily an opposite. Many of the bi-polar adjectives in the previous
examples show words are ‘something’ and then ‘not something’. Sometimes ‘not
something’ is the opposite, but other times ‘not something’ is not the opposite. For
example, ‘not black’ includes all the other colors that are ‘not black’: yellow, blue, green,
red, and orange. Hence, the opposite of ’black’ is ‘white’. Unfortunately, most people
who construct SD scales use negation when incorrect.
Page | 6
There’s no advantage to SD scales. Likert scales are easier to construct. Respondents are far
more familiar with them and are more likely to use them properly. That said, there’s something
in marketing research called profile analysis. I’ll show you some examples and explain why the
use of SD to construct profiles is not an advantage either.
Slide 34
Here’s an example of a profile analysis for three different beers. This figure summarizes
responses from numerous people on the SD items, across three different beers: a regional
brand, Miller, and Budweiser. Miller’s managers might use this display by comparing Miller’s
position to these other beers, sensing where there are meaningful gaps, and then addressing
these gaps by modifying their product or promotional efforts. In this example, Miller is perceived
as the highest-priced beer. If that perception poses a problem, then Miller’s managers might run
ads to reinforce that Miller is a reasonably priced, good-value-for-the-dollar beer, or these
managers might decide to modify their pricing policy.
Slide 35
The gaps in this figure relating major airlines to commuter airlines suggest that respondents
viewed major airlines as having quieter equipment but being more expensive and being less
polite than commuter airlines. If these are people’s general perceptions, it’s easy to understand
how a major airline might try to address them through conventional marketing means.
Slide 36
I include this example of a profile to show that not only can existing brands be compared to one
another, but a consumer’s ideal brand can be compared to existing brands. In this example for a
color television, ‘I’ represents the ideal rating for a brand. Brand ‘A’ is perceived as very
expensive and far more expensive than Brand ‘B’, yet Brand ‘B’ also is perceived as being too
expensive relative to an ideal brand.
Slide 37
Finally, here’s an image profile for a savings bank. The major gaps suggest that the present
bank is perceived as being far more old-fashioned than the ideal bank, which is perceived as
more modern. The ideal bank also is perceived as larger, more innovative, and a leader, relative
to the present bank. If the responses summarized here represented a sample of my customers,
then I might believe it’s wise to renovate my bank, update its procedures, and install new and
more modern equipment. Such changes seemingly would bring my bank more in line with the
ideal for my current customer base.
Slide 38
Although the preceding four examples suggest that profile analyses provide marketers with
much useful information, I believe these analyses tend to confuse decision making about the
best course of action. Here are three reasons why I believe this is true.

Only a few brands can be depicted. In the examples, there was never more than three
brands compared. Admittedly, these graphs were in black and white, and with color you
might be able to compare five brands. However, real markets tend to have more than
four or five competitors, so such maps are quite incomplete. There are alternative
mapping procedures in marketing that can depict far more than five brands, and I
encourage people to use those perceptual maps rather than profile analyses.
Page | 7

The attributes are not necessarily dependent of one another. In the other mapping
procedures in marketing, there are ways to guarantee that the underlying dimensions on
which we’re assessing things are independent. In this case, I could be asking three or
four questions that I’m unaware relate to the same underlying notion. In part, by not
ensuring the attributes are independent, I may be confusing a marketing manager about
modifying his or her brand.

Finally, the profiles don’t indicate which attributes are of greater or lesser importance. In
the previous banking example, it seems that modern and innovative are major gaps that
the bank needs to address, yet it’s quite possible that bank customers view both gaps as
trivial. Perhaps customers care about and hence patronize this bank because it offers
high quality personal service. Without a way for knowing which attributes are important
or unimportant to customers, it’s impossible to interpret these profiles meaningfully.
As a result, there’s no reason for you to use an SD scale when a Likert scale is easier to
construct, will not be misused or is less likely to be misused by respondents, and will yield the
types of maps that are far more informative than profile analyses.
Slide 39
Another example of a non-comparative scale is a Stapel scale. As this slide indicates, the goal
with a Stapel scale is to avoid the difficulty of constructing scales with bi-polar adjectives;
instead, a single adjective is used. All the advantages, disadvantages, and results of using
Stapel scales are similar to that of SD scales. Stapel scales are easier to construct and
administer. I’ve already argued against the use of SD scales, so I merely present this alternative
for deep background.
Slide 40
The next three slides provide three examples of Stapel scales: one for Bloomingdale’s, one for
Kmart, and one for compact cars. Please note the extensive instructions associated with a
Stapel scale. Respondents are unfamiliar with such scales; as a result, they are unlikely to use
them properly.
Slide 41
For Stapel scales, like the one for Kmart, I find respondents ignore the instructions to circle the
appropriate positive or negative number. Notice the Stapel scale here runs from +5 to -5. In the
other example, it ran from +3 to -3. In the next example, it’s formatted in an entirely different
way. Stapel scales can have different numbers of scale points. However, I’ve found that many
respondents confronted with Stapel scales circle the adjective or phrase they believe is most
descriptive of the object in question, rather than circle the appropriate scale number. If they
believe that Kmart provides low price but has a cold atmosphere, then they’ll circle those two
words but they’ll ignore the friendly employees and slow service scales.
Slide 42
One textbook author, about fifteen years ago, believed that this constituted a Stapel scale and I
have no reason to believe otherwise. Notice the scale’s format is vastly different. Respondents
are given a series of adjectives and then a scale of 1 to 7. These indicate to what degree the
adjectives perfectly or not at all describe compact cars. Again, it’s far easier to construct a Likert
scale and I urge you to use those rather than scales formatted this way.
Page | 8
Download