Questionnaire Scales: Part 2 Slide 1 Part two of this lecture on questionnaire scales continues with a discussion of non-comparative and comparative scales. Slide 2 If you’ve made it through Part 1 of this lecture, then you know that I urge you to use Likert scales because they’re the easiest to construct, the least confusing to respondents, and functionally equivalent to any other scale format you might select. What are the formatting issues for designing non-comparative scales in general, but Likert scales in particular? There are five issues. You should provide verbal descriptions for each category, and those descriptions must be very concise and precise. You’ll need to choose the number of categories for the Likert-type items. To discriminate between people—and in marketing, discrimination is a good word because it means trying to differentiate groups of people according to their preferences—may mean spotting subtle differences. If you write questions with few response choices, then it’ll be difficult to identify distinct groups of people. The rule of thumb is that scale items should have at least four categories, but typically five to nine categories. If you provide more than nine categories, people will be unable to make clear distinctions, like the difference between ‘15’ and ‘16’ on a 20-point scale. You’ll need to choose either a balanced or unbalanced scale. By balanced, I mean an equal number of positive and negative scale points; unbalanced means an unequal number of those points. Conventional wisdom dictates that you use balanced scales unless you know that respondents tend to respond toward one or the other end of the scale. This unbalanced problem is an issue for ethics research; due to social desirability bias, many respondents tend to answer toward the positive end of the scale. Spreading the positive end of the scale makes it easier to differentiate among the people crowding the positive end of the scale. You’ll need to decide whether or not to use an odd or even number of categories or scale points. This is a somewhat arbitrary decision. I recommend that you use an odd number of scale points only if respondents could be truly neutral or indifferent to a Likerttype statement. By using an even number of scale points, you force someone to fall on one or the other side of the fence. If you provide an odd number with middle, neutral point, respondents can become lazier and respond ‘neutral’ instead of carefully considering whether they are slightly more favorable or unfavorable to that statement. Finally, you’ll need to decide whether or not you force respondents to answer your question. By force, I mean excluding a don’t know answer option. Without a don’t know option, people who have no other opinion often circle the midpoint of the scale, hence confusing lack of knowledge with indifference. If you believe respondents could be unknowledgeable about the statement, then you should allow for a don’t know response. I’ll show examples of all five issues on the next slide. Page | 1 Slide 3 In this first example about the taste of Wonder Bread, the five scale items are balanced. It’s a forced choice because you’re not giving respondents the option of saying they don’t know and there’s an odd number of scale points. This scale assumes the possibility of an indifference point for someone’s attitude toward the taste of Wonder Bread. In the second example about Ultra Bright Toothpaste, the scale is balanced; the number of positive and negative statements is identical. It’s also forced choice, but now there’s an even number of intervals. This scale assumes that respondents can have either a somewhat positive or somewhat negative opinion, but could not be indifferent about Ultra Bright Toothpaste. In the third example about the reaction to an ad, it’s an unbalanced scale because there are three favorable statements and only one negative statement. It’s a forced choice item because respondents don’t have the option of answering don’t know, and there’s an odd number of intervals. Finally, in the last example about a Sears downtown store, the item is balanced in the sense that there are as many positive as negative statements. It’s not forced response because there’s an ‘I don’t know’ option, and there’s an odd number of intervals excluding the ‘don’t know’ response, which is off the scale continuum. All of these formats are perfectly reasonable. Slide 4 As I mentioned in an earlier lecture, often we develop multiple items to assess objects on a given attribute. This slide summarizes the approach for developing a multi-item scale to assess a single construct like store image. First, I’d review the theoretical work on store image. Based on that work, I’d generate a large pool of items suggested by theory, secondary data, and any qualitative research. Next, I’d select a reduced set of items based on expert judges. I might, for example, develop a set of 40 items and then ask several colleagues to examine those items and select the ones they believe best represent the construct. Then, I’d take that reduced set of items, administer them to a sample of respondents, analyze their responses, and ultimately create a reduced set of items that would constitute my final scale. The technical aspects of the requisite statistical analysis will be addressed in the subsequent lecture. Slide 5 Returning to single-item scales, here’s an example of one you might find in many marketing research questionnaires: a purchase-intention scale. The top scale contains five points and the bottom scale contains 11 points. Slide 6 Here are alternative formats for purchase-intent questions that relate to my comments about question formatting. In the first example, the scale is balanced and has a neutral point. In the second example, it’s balanced without a neutral point. In the third example, it’s balanced but it’s not forced because there’s a don’t know answer choice. (I’m dubious about placing that choice as the third choice on the scale because that placement implies it’s part of the continuum, as opposed to an option.) The fourth example is a graphic scale; the fifth example is dichotomous, in the sense that there are but two choices; would or would not buy; and in the last example, the purchase-intent scale is unbalanced because there are more items related to possibility of purchase. Page | 2 Slide 7 Graphic rating scales present respondents with a graphic continuum and ask them to respond accordingly. There are several circumstances under which graphic rating scales might be useful; in particular, when respondents’ language capabilities are suspect. Slide 8 Language isn’t an issue for this ladder scale, but it serves as an analogy for the way people think about life and climbing a ladder to success. The top of the ladder represents the best possible life and the lowest rung in the ladder represents the worst possible life. In a way, this graphic symbolizes the underlying construct. Slide 9 Here’s an example of a thermometer scale, which is used to evaluate the quality of food at a restaurant called Outpost’s Steak n’ Fries. I’m uncertain why researchers would use such a scale, other than its novelty inducing a higher response rate. Slide 10 The next three slides present scales that younger children might use to indicate their attitudes toward an object. Younger children’s verbal abilities may be minimal; as a result, using these types of scales may provide more accurate assessments of their attitudes. For example, I assume that respondents to the first scale, which asks “How much did you like the boy in the commercial?” are meant to circle one of these three pictures; smile, neutral, or frown. This reply should be indicative of that respondent’s assessment of the boy in the commercial. Slide 11 Here is a smiling-face scale. Although the verbal instructions are present, the child doesn’t read them; instead, an interviewer reads these instructions. “Tell me how much you like the Pull-back teddy bear by pointing to the face that best shows how much you like it. If you did not like the Pull-back teddy bear at all, you should point to Face 1. If you liked it very much, you should point to Face 4. Now, how much did you like the Pull-back teddy bear?” Young children could respond to a question with this format. Slide 12 Graphic scales could be used for children or for adults with language limitations. The number of stars, where five stars is really liked and one star is really hated, or the stick figures, where the one with open arms means ‘liked it a lot’ and the one with the thumb pointing downward means ‘didn’t like it all’. It’s very similar to the slide #10, with the child liking the boy in the commercial. Slide 13 Ignoring constant sum scales for the moment, this table provides a good summary of the relative advantages and disadvantages for the different scales I’ve discussed to this point. Slide 14 As a quick reminder, non-comparative scales ask respondents to consider one attribute or one object at a time, whereas comparative scales ask respondents to consider multiple attributes or multiple objects at one time. Page | 3 Slide 15 Ranking scales are a type of comparative scale. Here’s an example of a ranking order scale for eye shadow. There are six different brands being ranked on three different characteristics: the quality of the container, the quality of the applicator, and the quality of the eye shadow itself. This is as complex a rating task as I’d recommend you ask respondents to perform. Ranking more than a half-dozen things on a given attribute is probably too difficult for most respondents. Although such scales are reliable indicators of the most preferred (or highest ranked) and the least preferred (or lowest ranked), the rankings for all other objects are unreliable. Slide 16 Here’s the type of data we might collect for rank ordering of four items. In this case, 10 people have been asked to rank order four items: a, b, c, and d. Person #1 ranked ‘B’ most preferred, ‘A’ second-most preferred, ‘C’ third-most preferred, and ‘D’ least preferred. Similarly, Persons #2 through #10 ranked the same four items. Slide 17 As I mentioned in the lecture on levels of measurement, researchers must analyze rank-order data carefully; it’s not intervally or ratio scaled, and it’s not parametric data. As a result, such data cannot be analyzed with traditional statistics. Researchers can’t take the mean of the ranks and say object ‘A’ has the highest mean rank. Instead, they must create tabulations like the one depicted on this slide. There are four brands: A, B, C, and D. This table summarizes the previous data table by presenting the number of times each brand is ranked first, second, third, and fourth. This table shows a meaningful and statistically correct way to summarize the data on the previous slide. Slide 18 Paired-comparison scales have certain favorable psychometric properties relative to ranking scales. Respondents are presented with two objects at a time and asked to pick the one they prefer. This is a relatively psychometrically simple task, so almost all respondents can perform it properly. Think about when you’ve been asked to compare audio speakers. After the salesperson asked you about your budget and the type of music you like, he or she ushered you into a listening room, picked two different pairs of speakers, and then played music of the type you like, first on one set of speakers and then another set, and then back to first set, and then back to the second set, et cetera. Going back and forth allows you to compare effortlessly; even people with uneducated ears can hear differences if asked to compare speaker system #1 to speaker system #2. However, if that same salesperson asked you to compare five different speaker systems at one time, you’d be hard-pressed to do so well. By the time you’d be listening to speaker system #3, you’d no longer recall how speaker system #1 sounded. People can easily respond about two things at a time; it’s beyond most people’s ability to respond meaningfully about four or five things at a time. Paired-comparison scales create more reliable rank-ordering data with one proviso, which is specified in the second bullet point: the large number of scales often needed to rank things from most to least preferred or most to least important. Assume 10 brands that respondents must rate from most to least preferred. For a ranking scale, the 10 items would be listed and people asked to put a number 1 through 10 next to each item in accord with how they rank it from most to least preferred. Although a seemingly straightforward task, people won’t do it well because they’re being asked to compare too many things at the same time. This ranking question could Page | 4 be asked as a set of paired-comparison scales: Which do you prefer, 1 or 2? 1 or 3? 1 or 4? et cetera. To complete this task, people would respond, as the formula indicates, to 10 x 9 = 45 separate questions. Instead of filling in ten numbers, they would need to respond to 45 separate questions, which can be fatiguing. Once respondents become fatigued, they’ll no longer carefully discriminate between the objects, and their answers will become unreliable, thus defeating the purpose of using paired-comparison scales instead of rank-ordering scales. I recommend that you never ask people to rank more than a half-dozen things at a time. If you want them to rank up to 10 or 11 things, consider paired-comparison scales. If you want them to rank more than 10 or 11 things, there are alternatives I’ll discuss in a subsequent lecture. Slide 19 Here’s an example of using paired-comparison scales to plan an ad for a restaurant. Restaurants have different features, such as type of food, fun place to go, prices, location, service, and atmosphere. If people rank those six things from most to least important, the things ranked most and least important will be ranked reliably, but not the other four things. In designing an ad, the restaurant owner would like to know the most important attributes people think about when selecting a restaurant. As opposed to asking them to rank those six things, the researcher could provide respondents with this paired comparison-table and ask them which is more important: type of food or service; fun place to go or quality of food. This pairedcomparison task is relatively simple from a psychometric standpoint. Slide 20 Here’s another example of an abbreviated paired-comparison set for suntan products. I include this slide to show that the instructions for these types of questions are relatively simple. It’s likely respondents will read such instructions and respond accordingly. Slide 21 (No Audio) Slide 22 Another type of comparative scale is the constant-sum scale. As I mentioned in the lecture on levels of measurement, constant-sum scales have one very favorable property: they generate ratio-scale data. In this example, respondents are asked to allocate 100 points across seven characteristics of tennis sportswear. If comfortable to wear received 20 points, and made in the USA received 10 points, it’s safe to say that comfortable to wear is twice as important as made in the USA. That’s a level of analyses that is unavailable when dealing with nominal or interval scaled data. One limitation of constant-sum scales is that most respondents will be unfamiliar with them, may not read the instructions properly, and as a result, they may just look under the column number points and check those features they believe are most important. Such data are unusable for subsequent analysis. Slide 23 Here’s an example of a constant-sum scale for automobiles. Although the points do sum properly, it’s easy to norm them; for example, if someone inadvertently allocates only 80 points, then multiplying all the points by 5/4ths creates a sum of 100 points. Such norming allows that person’s responses to be added to other people’s responses in a meaningful way. Administering a constant-sum scale via the Internet causes this problem to vanish because the software can be programmed to norm the data and force it to sum to 100 points. Page | 5 Slide 24 Here’s an example of a poor constant-sum scale that I found in a marketing research textbook I used several years ago. It asks respondents to allocate 10 points in accord with the last 10 times they purchased shampoo. This scale is limited in two ways. First, people will not recall what shampoo they bought the ninth or tenth time ago. Given the frequency with which people buy shampoo—perhaps once every three to four months—you’re asking them to recall shampoo they purchased three years ago. The likelihood of remembering that correctly is very low; hence, the time horizon for this question is problematic. Second, the 10 points must be allocated across far too many items. If you wonder why I sometimes show you poorly formatted scales, it’s because they’re the best tool for explaining what you should avoid. Slide 25 Here’s an example of a weighted-paired-comparison scale; it’s called constant sum with paired comparison. The instructions ask respondents to divide 11 points between each pair of hand and body lotions. Points are divided in such a way that the more preferred thing receives more points than the less preferred thing in proportion to the degree of preference. Nothing can receive more than 11 points. Such data reveals what is preferred (A or B) and the degree to which it’s preferred. Slide 26 Q sorts are a method for sorting a large number of things, as will be illustrated on the next two slides. Slide 27 Suppose we’re interested in having people sort or rank 75 magazines from most to least preferred. Such a task is impossible for a paired-comparison approach, and the data we’ll receive from a traditional ranking approach would be highly unreliable. How best to identify the most and least preferred magazines out of a set of 75 magazines? One possibility is to give people a mechanical sorting task. In this case, people receive a deck of cards, and on each card is a picture of a magazine. The instructions read “Please choose nine magazines you most prefer of the 75. Once you’ve selected the nine most preferred, please list the magazine name on the form in the column headed Most Preferred. Now select the next nine.” This is one way to sort the magazine. Another way to run a Q sort is to provide the pile of 75 cards and ask people to divide the cards into two piles: the more preferred versus less preferred pile, which essentially asks for repeated paired comparisons. Here’s a magazine, put it in one of two categories. After they’ve sorted the 75 cards into two piles of most and least preferred, ask them to take the pile of most preferred pile and divide it into two piles: the most preferred of the more preferred and the less preferred of the more preferred. By mechanically sorting cards in this fashion, people are making repeated dual comparisons; in other words, breaking down the ranking of 75 items into a series of paired comparisons. This format makes this task doable for respondents. Slide 28 Here’s another example of a Q sorting task. In this case, respondents are asked to sort 100 bank advertising slogans from most unique to least unique. The advertiser assumes that the more unique slogans will be more memorable and hence more effective, and the least unique slogans will be less memorable and hence less effective. Both this example and the previous example require manual sorting of physical cards. Q sorts also can be performed online with Page | 6 computer software. There isn’t a requirement that respondents have access to physical cards for the sorting process; the sorting process also could be done in a virtual space. Slide 29 (No Audio) Slide 30 The dollar-metric scale is a favorite scale (of mine) because it yields high quality data and it’s easy for respondents to use. Here, the question relates to different types of containers for fruit juice. Respondents are asked to indicate which of two different forms they most prefer and then how much more they’d be willing to pay for juice delivered in that type of container (relative to the unchosen container). This type of information can help to make sound design decisions. In this example, the glass container is preferred to the can container by $.07. If a juice producer decided to introduce glass containers and only charge an additional $.05 for those containers (relative to the juice sold in cans), then customers would be likely to buy that glass-enclosed juice because they’ve received a bargain. They’re willing to pay $.07 more for juice in a glass container but are being asked to pay only $.05 more. Dollar-metric data can be used with cost data to help marketers optimize the design of their products. Slide 31 Magnitude-estimation scales are similar to constant sum scales plus paired-comparison scales. In this case, people are asked, on a scale of 0 to 100, to indicate the relative degree to which they agree or disagree with a certain statement. Slide 32 If we consider the endpoints of each line as two different notions, then line-marking scales are another example of comparative scales. In the case of this marking scale, the proximity of the X to each endpoint indicates the degree to which respondents believe that endpoint describes the object. Slide 33 Here’s a summary of the relative advantages and disadvantages of various comparative scales. Slide 34 (No Audio) Slide 35 To briefly recap this lecture on questionnaire scales, I describe the various non-comparative and comparative scales that can be used in a questionnaire. For non-comparative scales, I gave many reasons for preferring Likert-type scales. For comparative scales, I recommended that you always make the respondent’s task reasonable. Rank-order scales are acceptable for ranking a few items. Paired-comparison scales are preferred for ranking more than a few items. If many items must be ranked, then a Q sort is required. Page | 7