Questionnaire Scales Part 1

Questionnaire Scales: Part 1 Slide 1 The goal of this lecture is to inform you about the different types of question formats that you could use in questionnaires. Slide 2 As the note to the right indicates, researchers can develop a broad range of scales for measuring things. I included this Beaufort wind force scale as one example of a type of scale you might not have thought existed. Slide 3 Although I will try in this lecture to structure the different types of scales as much as possible, the point of showing you this cartoon—perhaps my favorite marketing research cartoon—is that it never hurts to be creative when trying to design a scale. Although being creative differs from confusing respondents with bizarre question formats, there can be a fine line between confusing and effective questions. Slide 4 One key point of this lecture is that the precise language you use in your questions is important. Different words will solicit different answers. You really have to think about each word that you include in your questions. The cartoon illustrates the euphemisms that we sometimes use to describe things; security division of an automobile wreckage site, aka junkyard dog. The first descriptor sounds very different from the second descriptor. Similarly, using some words in questions will inspire very different answers than using other words. Slide 5 There are many different ways to ask the same question. Asking those questions in different ways can yield very different responses. Slide 6 Here are six different response formats for the same question: “How likely would you be to buy Grandma’s peach cobbler?” In Case #1, there are five dashes, with anchors very likely and very unlikely. In Case #2, there are the same anchors, but the numbers 1 through 5 have replaced the dashes. In Case #3, there are five check boxes, with the first and last one labeled. In Case #4, dashes have returned, but instead of only labeling the endpoints, the intermediate points are labeled as well. By the way, it’s strongly recommended that you not merely provide respondents with anchoring endpoints; you also should describe the intermediate points in meaningful language. Case #5 looks like half boxes in which the person can check off their answers. Case #6 shows a scale from +2 to -2 with the same anchors, very likely to very unlikely. Slide 7 This slide is similar to the previous one, except that seven-point rather than five-point scales are shown. In part, I included this slide for example #4 about ‘Cheer detergent is.’ Notice the scale point descriptors: very harsh, harsh, somewhat harsh, neither harsh nor gentle, somewhat gentle, gentle, and very gentle. You’ll need to describe each scale point for any question you design. Page | 1 Slide 8 This slide summarizes research that shows the format of a question as simple as “What is your age?” makes a meaningful difference. The top of the slide indicates that there are three basic ways to ask about a respondent’s age: (1) What is your age? (2) In what year were you born? and (3) In which category does your age fall? The table at the bottom of the slide shows that answers to the age question differed meaningfully by question format. As each set of 800 respondents that received a given question format should have similar age profiles—thanks to the law of large numbers and random sampling—the age profiles for each group should differ only by sampling error. Instead, people asked about their age directly tend to answer somewhat younger than people asked the year of their birth and meaningfully younger than people asked the category in which their age fell. The bottom on the slide shows that a meaningfully larger percent of respondents refused to answer the direct age and year of birth questions than refused to answer the age category question. Although it seems counterintuitive that categorical responses are more accurate despite rounding error, their higher accuracy and lower nonresponse rate make the age category measure superior to the other age measures. Slide 9 This figure depicts the remainder of this lecture. When talking about different types of scales, researchers tend to divide scales into comparative and non-comparative forms. Then, within those categories, there are sub-categories of scales. Because you’re probably most familiar with non-comparative scales, I’ll start with them. Slide 10 Non-comparative (or monadic) rating scales ask about a single concept. Here’s an example: “Now that you’ve had an automobile for about one year, please tell us how satisfied you are with its engine power and pickup.” The range of responses runs from completely satisfied to very dissatisfied. Slide 11 In contrast, a comparative rating scale asks respondents to rate something by comparing it to a benchmark or series of benchmarks. An example of a comparative rating scale: “Please indicate how the amount of authority in your present job compares with the amount of authority that would be ideal for this job.” This question asks for a comparison between a current professional job versus a benchmark ideal job. The responses are too much, about right, and too little. Slide 12 (No Audio) Slide 13 The most popular non-comparative scale is the Likert scale. I’ll first show you several examples of Likert items. In this first example, the Likert item concerns tennis. The statement “It is more fun to play a tough competitive tennis match than to play an easy one” is matched with response alternatives that run from strongly agree to strongly disagree. Slide 14 Here’s an example of several Likert-type items for assessing consumer beliefs about a department store. The format here has the statements on the left hand side: Duncan’s Department Store has lower prices than competitors; Merchandise displays at Duncan’s Page | 2 Department Store are messy; Clerks at Duncan’s Department Store are not very friendly; and The downtown Duncan’s Department Store is a convenient location. Notice that some of these items are positive and some of these items are negative. The responses range from strongly agree to strongly disagree and the response categories are summarized by letters instead of numbers; ‘SA’ instead of 1, ‘A’ instead of 2, and ‘AL’ instead of 3. One also could use blanks or boxes that require a checkmark. Many formats will work for Likert-type items. I recommend a number format because it’s easier to enter such data into a computer. Clearly, it’s easier to type a string of numbers than to looking at boxes or blanks or letters, try to convert them into a number mentally, and then type that number into a computer. Slide 15 Here’s an example of several Likert-type items that use an importance scale rather than an agreement scale. Here, respondents indicate non-important, slightly important, important, very important, or extremely important. Slide 16 Here’s an example of several Likert-type items in which blanks are used. Again, all of these are Likert scales, but I recommend you use numbers instead of blanks because it’s easier to enter the data from that type of question format. Slide 17 To this point I’ve been careful to say that the previous slide showed Likert-type items. The Likert scale is actually a set or series of such items. A single item is not a Likert scale, although many researchers misuse the term and refer to single Likert-type items as Likert scales. Nonetheless, Likert scales are a multiple-item scale; hence, the notion of summing responses to the multiple items to achieve a total score. Likert scales are very popular for many reasons, including that they are easy or relatively easy to write and respondents are familiar with such questions. If they ignore your instructions, as most respondents do, they’ll still be able to answer your questions properly. Slide 18 This is probably the most popular format for Likert scale items. The scaling 5 to 1 could easily be reversed as 1 to 5. Typically, scale numbers are run from strongly agree to strongly disagree, and it’s probably best to make ‘agree’ a larger number than ‘disagree’. The 10 items shown here—such as “the commercial was soothing,” “the commercial was not entertaining,” and “the commercial was insulting”—could be related to people’s impressions of how enjoyable it was to view the commercial or the quality of the commercial. If all 10 items relate to the same basic underlying notion, then we can sum people’s scores on these items to derive an overall score of the commercial’s likeability. The assumption is, of course, that all the questions are phrased in the same direction. Notice the items (1) “the commercial was soothing,” (2) “the commercial was NOT entertaining,” (3) “the commercial was insulting,” (4) “the commercial was silly,” (5) “the commercial was too ‘hard sell’,” and (6) “the characters in the commercial were realistic.” Items #1 and #6 are positive items; people who strongly agreed with those items must have liked the commercial. To strongly agree with items #2, #3, and #4 is to dislike the commercial. To take a sum that would be meaningful, either the negative items or positive would need to be reverse scored. Reverse scoring puts all the answers in the same direction. If I were scoring all these items in a positive direction, if someone answered ‘4’ for question #2 (the commercial was NOT entertaining), then I would enter it into the computer as a 2. If someone answered 5 to “the Page | 3 commercial was insulting,” then I would score it as a 1. Reverse scoring allows a meaningful sum of the scores across all these items to derive an overall likeability score for the commercial. Slide 19 This slide illustrates what I mean by the kind of data that Likert-type items yield. You can see that 10 different people were asked to respond to 10 different items. Let’s assume the 10 items in the previous slide and these are their answers. We’ll assume that the numbers in the matrix have already been reversed scored, so that for the negative items a ‘5’ means a ‘1’ and a ‘4’ means a ‘2’. The sum at the end of the last column shows scores that run from 25 for person #4 to 40 for person #9. Later in the semester, when I talk about reliability, I’ll try to make sense of that last row, which says item-to-total correlations. Let me preface that subsequent explanation by indicating that part of what researchers do, in deciding whether or not they should ask all those different questions and if summing all those scores makes sense, is to determine if people’s responses to each question are related to their responses to the other questions. Researchers assume that if all the questions address the same underlying construct, then the answers to those questions should be somewhat consistent. If the answers to one question are unrelated to the answers to other questions, then that’s a problem. If the answers to one question are strongly but negatively related to the answers to other questions, then reverse scoring is needed. Looking at the numbers in the last row, notice that the answers for item #4 don’t seem especially related; any number close to + 1 indicates that thing is highly related to other things. It appears the answers to question #4 are not especially related to the answers to the other items and the answers to the other items are highly related to one another. Slide 20 This slide illustrates the types of statements one might develop to measure people’s attitudes toward product quality and warranty responsibility. Some of these items are fairly wordy. The goal is to keep items as concise as possible, yet fully explain the underlying notions. In “Products today are built to last a long time before needing service or repair” the language is relatively simple and it entails the issue of obsolescence and ultimately warranty responsibility. “Too many products available to customers are unnecessarily complex” is relatively simple language. “Customers rarely get stuck with products that don’t work, since most products today have good guarantees” is simple as well. Although these items are imperfect, they are the type of items that people can normally read, understand, and respond to meaningfully using the sorts of 5 or 7 point scales with which you’re familiar. Slide 21 I include this figure because parts of it are informative yet I also disagree with parts of it. For the scale categories and the labels we would provide for them, the ones for quality are fine: well above average, above average, average, below average, and well below average. The importance categories, interest categories, satisfaction categories, and even the uniqueness categories make sense. However, I disagree with using Likert-type scales to assess frequency or truth. I strongly discourage using Likert-type scales for frequency because the frequency descriptors mean vastly different things to different people. For example, I might say that I sometimes drink coffee. What that means is that I might brew myself a couple of pots a week and drink 6 to 8 cups each time. Somebody else might give the same answer, but mean they drink one cup a month. Somebody else might give that same answer, but mean they drink two cups a day. What I mean by ‘sometimes’ and what someone else means by ‘sometimes’, as it relates to coffee consumption, may differ markedly, and the same answer shouldn’t reflect vastly different behaviors. As for Likert-type scales and truth, I’m a logician at heart, so Page | 4 something is either true or false. For a larger report, certain aspects may be true and other aspects may be false, but to say that something is somewhat true, from a logical standpoint, makes little sense. Thus, I also discourage using Likert-type items to assess the degree to which people believe something is true. Slide 22 Thurstone scales, which are to some extent related to Likert-type scales, often are ignored by undergraduate marketing research textbooks. Such scales are valuable and provide excellent data, but they are far harder to construct than Likert-type scales. I’ll show you why in the next several slides. Slide 23 When constructing a Thurstone scale, researchers create a series of items to which people can respond either yes or no. Those items are designed and sequenced such that respondents are increasingly likely to respond ‘yes’ as they progress from item to item. The ideal Thurstone scale would yield a series of one answers (no) followed by a series of the opposite answer (yes). The point responses change from ‘no’ to ‘yes’ is the point of measurement interest. Slide 24 To some extent, you can think of a Thurstone scale as similar in concept to a standardized exam like the ACT or SAT. There is a range of questions, in terms of difficulty, on those types of exams; examiners assume that they can identify the abilities of students taking the exam by examining the pattern of responses to those questions. They assume everyone can answer the easiest questions correctly and very few people can answer the hardest questions correctly, but there won’t be a random sort of response in which easy questions are missed while difficult ones are answered correctly. Think about identifying a series of questions that relate to some topic, with each question progressively more positive or more negative. Researchers would expect people’s responses would shift at some point as the statements become more positive. Here’s an example of trying to form a Thurstone scale by asking a series of judges to rate the items by the likelihood to which someone might agree or disagree with them. Slide 25 We can take those experts’ responses and sum them so that we can derive the scale values shown in the second column. Seemingly, the item that people would be least likely to agree with is item #3, with a scale value of 9.9. Item #2, with a scale value of 2.0, would be the item with which they’d be most likely to agree. Slide 26 This is an example of a response curve a series of Thurstone items might yield. It becomes progressively more likely that someone will switch from ‘no’ to ‘yes’ or ‘disagree’ to ‘agree’ as the items progress from #1 to #11. No one would agree with item #1, but everyone would agree with item #11. Seemingly, at items #6 and #7, roughly 50% of people begin to agree. Slide 27 Another type of non-comparative scale that’s popular is the Semantic Differential, (SD) scale. With SD scales, there’s a series of bi-polar rating items. The bi-polar adjectives that anchor the endpoints of the scale could be items such as good and bad. In entering the response data into Page | 5 the computer, researchers assign a number to each scale point. Although many people refer to SD scales as scales with bi-polar ratings, the true SD scale assumes three underlying attitudinal dimensions that everyone, regardless of culture or language, uses to evaluate things in their social environment. These three dimensions are evaluation, power, and activity. For a properly constructed SD scale, all the items will relate to one of these three dimensions. However, over time people have adopted SD-type scales so that items may not be related to one of these three underlying dimensions. Slide 28 Here’s an example of SD scales for measuring attitudes towards tennis. The bi-polar adjectives are exciting versus calm, interesting versus dull, simple versus complex, and passive versus active. When a respondent receives this scale, the instructions indicate that a checkmark or other type of mark should be placed on the line such that the proximity of that X to each of the adjectives indicates that respondent’s attitude on that particular dimension. Slide 29 Here are some SD scales for attitudes towards a jazz saxophone recording. These items are related to audio recordings, the same way the previous items were related to tennis playing. Slide 30 The next three slides provide additional examples of SD scales. I provide these examples to indicate the range of topics on which SD scales can be used and the range of formats that can be used for SD scales. Slide 31 to 32 (No Audio) Slide 33 Although SD scales are popular, I don’t recommend them for several reasons.  Respondents will tend to misuse those scales. Unfortunately, many people don’t read the instructions to questionnaires; as a result, instead of checking the appropriate box or marking the appropriate area on the line between bi-polar adjectives, they merely circle one of the endpoints. If someone circles one of those bi-polar adjectives, we cannot guess that (1) they meant to check off the box or the area of the line closest to that adjective, or (2) they misread the instructions. As a result, we can’t use that person’s data in our analysis.  It’s far more difficult to construct SD scales then Likert scales. If nothing else, we’re limited to only a few words and it’s difficult to summarize complex notions in so few words. Likert scales permit many more—although not an infinite number—of words.  Negation is not necessarily an opposite. Many of the bi-polar adjectives in the previous examples show words are ‘something’ and then ‘not something’. Sometimes ‘not something’ is the opposite, but other times ‘not something’ is not the opposite. For example, ‘not black’ includes all the other colors that are ‘not black’: yellow, blue, green, red, and orange. Hence, the opposite of ’black’ is ‘white’. Unfortunately, most people who construct SD scales use negation when incorrect. Page | 6 There’s no advantage to SD scales. Likert scales are easier to construct. Respondents are far more familiar with them and are more likely to use them properly. That said, there’s something in marketing research called profile analysis. I’ll show you some examples and explain why the use of SD to construct profiles is not an advantage either. Slide 34 Here’s an example of a profile analysis for three different beers. This figure summarizes responses from numerous people on the SD items, across three different beers: a regional brand, Miller, and Budweiser. Miller’s managers might use this display by comparing Miller’s position to these other beers, sensing where there are meaningful gaps, and then addressing these gaps by modifying their product or promotional efforts. In this example, Miller is perceived as the highest-priced beer. If that perception poses a problem, then Miller’s managers might run ads to reinforce that Miller is a reasonably priced, good-value-for-the-dollar beer, or these managers might decide to modify their pricing policy. Slide 35 The gaps in this figure relating major airlines to commuter airlines suggest that respondents viewed major airlines as having quieter equipment but being more expensive and being less polite than commuter airlines. If these are people’s general perceptions, it’s easy to understand how a major airline might try to address them through conventional marketing means. Slide 36 I include this example of a profile to show that not only can existing brands be compared to one another, but a consumer’s ideal brand can be compared to existing brands. In this example for a color television, ‘I’ represents the ideal rating for a brand. Brand ‘A’ is perceived as very expensive and far more expensive than Brand ‘B’, yet Brand ‘B’ also is perceived as being too expensive relative to an ideal brand. Slide 37 Finally, here’s an image profile for a savings bank. The major gaps suggest that the present bank is perceived as being far more old-fashioned than the ideal bank, which is perceived as more modern. The ideal bank also is perceived as larger, more innovative, and a leader, relative to the present bank. If the responses summarized here represented a sample of my customers, then I might believe it’s wise to renovate my bank, update its procedures, and install new and more modern equipment. Such changes seemingly would bring my bank more in line with the ideal for my current customer base. Slide 38 Although the preceding four examples suggest that profile analyses provide marketers with much useful information, I believe these analyses tend to confuse decision making about the best course of action. Here are three reasons why I believe this is true.  Only a few brands can be depicted. In the examples, there was never more than three brands compared. Admittedly, these graphs were in black and white, and with color you might be able to compare five brands. However, real markets tend to have more than four or five competitors, so such maps are quite incomplete. There are alternative mapping procedures in marketing that can depict far more than five brands, and I encourage people to use those perceptual maps rather than profile analyses. Page | 7  The attributes are not necessarily dependent of one another. In the other mapping procedures in marketing, there are ways to guarantee that the underlying dimensions on which we’re assessing things are independent. In this case, I could be asking three or four questions that I’m unaware relate to the same underlying notion. In part, by not ensuring the attributes are independent, I may be confusing a marketing manager about modifying his or her brand.  Finally, the profiles don’t indicate which attributes are of greater or lesser importance. In the previous banking example, it seems that modern and innovative are major gaps that the bank needs to address, yet it’s quite possible that bank customers view both gaps as trivial. Perhaps customers care about and hence patronize this bank because it offers high quality personal service. Without a way for knowing which attributes are important or unimportant to customers, it’s impossible to interpret these profiles meaningfully. As a result, there’s no reason for you to use an SD scale when a Likert scale is easier to construct, will not be misused or is less likely to be misused by respondents, and will yield the types of maps that are far more informative than profile analyses. Slide 39 Another example of a non-comparative scale is a Stapel scale. As this slide indicates, the goal with a Stapel scale is to avoid the difficulty of constructing scales with bi-polar adjectives; instead, a single adjective is used. All the advantages, disadvantages, and results of using Stapel scales are similar to that of SD scales. Stapel scales are easier to construct and administer. I’ve already argued against the use of SD scales, so I merely present this alternative for deep background. Slide 40 The next three slides provide three examples of Stapel scales: one for Bloomingdale’s, one for Kmart, and one for compact cars. Please note the extensive instructions associated with a Stapel scale. Respondents are unfamiliar with such scales; as a result, they are unlikely to use them properly. Slide 41 For Stapel scales, like the one for Kmart, I find respondents ignore the instructions to circle the appropriate positive or negative number. Notice the Stapel scale here runs from +5 to -5. In the other example, it ran from +3 to -3. In the next example, it’s formatted in an entirely different way. Stapel scales can have different numbers of scale points. However, I’ve found that many respondents confronted with Stapel scales circle the adjective or phrase they believe is most descriptive of the object in question, rather than circle the appropriate scale number. If they believe that Kmart provides low price but has a cold atmosphere, then they’ll circle those two words but they’ll ignore the friendly employees and slow service scales. Slide 42 One textbook author, about fifteen years ago, believed that this constituted a Stapel scale and I have no reason to believe otherwise. Notice the scale’s format is vastly different. Respondents are given a series of adjectives and then a scale of 1 to 7. These indicate to what degree the adjectives perfectly or not at all describe compact cars. Again, it’s far easier to construct a Likert scale and I urge you to use those rather than scales formatted this way. Page | 8

Questionnaire Scales Part 1

Related documents

Products

Support

Questionnaire Scales Part 1

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib