Reliability and Validity Slide 1 In this lecture, I discuss measurement, validity, and reliability—although in everyday English these terms are used interchangeably. They have vastly different meanings from measurement perspective. Slide 2 When researchers think about measuring respondents with an observational or experimental study, they’re typically looking for inter-respondent differences. What causes those differences in those measures between respondents? Well, those differences could reflect ‘true’ differences among respondents, or those differences could reflect random or systematic error. Reliability and validity address the random and systematic error component of measure. Validity addresses the accurate-indicator dimension of differences within measures. Slide 3 What does a researcher mean by a reliable measure? Reliability in a research context is the degree to which measures are free from random error and, therefore, yield consistent results. In other words, it’s the repeatability of a measure. If I take that same measurement on the same object or person on multiple occasions, will I get the same number or score? Slide 4 In this cartoon, the researcher is responding to a question about reliability. Unfortunately, his results were unreliable because he checked the data twice and got two different results. If these measures had been reliable, he would have gotten the same result. Slide 5 In contrast to reliability, which addresses issues about the repeatability or stability of a measure, validity is a more theoretical notion. In validity, what researchers try to assess is the ability of a scale to measure the intended construct. In other words, are we measuring the right thing? Slide 6 In this cartoon, the researcher is concerned about a validity issue. The caption indicates that 92% of people lie on polls. In a humorous way, the cartoon illustrates what happens when research questions don’t measure what was intended. Slide 7 This cartoon clearly differentiates validity from reliability. As you’ll notice from the first frame on the left, the manager says, “That was an interesting report Jones. Are we confident it is correct?” Jones could provide one of two different answers. In the top frame on the right, his answers confirm validity when he says, “Sure boss. It doesn’t matter how many people you interview as long as you measure the right thing.” That is a validity response. Alternatively, in the bottom frame, Jones says, “Sure boss. Our sampling procedures and response rates were as good as you can get.” This suggests that if they were to field the survey again, then they’d get a similar result. Page | 1 Slide 8 Rulers are examples of measures that are reliable and valid; reliable, in the sense that if the same object is measured with the same ruler on consecutive days, then the results would be similar. A ruler is a valid measure because a ruler is supposed to measure length and indeed it does that precisely. Slide 9 This slide shows how validity and reliability are related to marketing-related measures. The upper-left-hand corner illustrates the circumstance under which a measure is valid and reliable. In this case, the measurement needs to occur only once to learn its value. Alternatively, the bottom-right-hand corner illustrates the circumstance under which a measure is neither valid nor reliable. In this case, it’s not usable in any meaningful fashion; it’s impossible to learn the true value by using a measure that’s neither valid nor reliable. The interesting circumstances are the ones on the diagonal, cells B and C. Cell B (the upperright-hand corner) illustrates the circumstances under which a measure is somewhat unreliable, in the sense that taking the same measure repeatedly would produce different values. From a theoretical standpoint, the measure reflects the correct underlying notion; it’s measuring what it was meant to measure. In such a circumstance, the solution for a researcher is to take repeated measurements, and then average them. If a measure is valid—so from a theoretical standpoint it’s measuring the correct thing—but is somewhat unreliable, the solution is to take repeated measures and average them. Cell C (the bottom-left-hand corner) illustrates the circumstances under which a measure is repeatable but not valid. It’s possible that the measure is systematically biased; in that case, the magnitude of that bias and its direction are known, so all that’s needed is a single ‘corrected’ measure. If, for example, a researcher wanted to measure social status and was limited to a measure of income—which should reflect social status but could be biased systematically—then all that’s needed is a single measure of income that is corrected for its bias. In summary, the ideal of reliable and valid measures is illustrated in Cell A. The impossible situation is Cell D, in which a measure is neither valid nor reliable. With measures of the type shown in Cell B—valid but unreliable—the solution is to take repeated measures and then average them. With measures of the type shown in Cell C—reliable but not valid—the solution is to determine if the invalid measure reflects the underlying construct of interest, determine the magnitude and direction of the bias, take the single measure, and adjust it accordingly. Slide 10 This slide summarizes the causes of variation in measures among people. Researchers hope that this first cause is the dominant one: the difference in the number or score yielded by a measure relates to true differences among people on the characteristics of interest. There are other causes of variation—the error term indicated on slide #2 of this lecture—that are related to validity and reliability. All but the second case relate to reliability; in contrast, the second cause, which is differences due to stable characteristics of respondents (such as intelligence and education), suggests a validity issue. With the second cause, the measure is being influenced by stable characteristics that were unintended to influence that measure. The remaining causes—things like short-term personal factors—relate to reliability. Page | 2 People respond differently based on the situation. People’s responses to questions differ by their physical well being or how motivated they are to answer a question at a given time. For example, I might respond one way to a telephone interviewer who calls me at home three minutes before I must leave, but a different way if I don’t plan to leave my house that day. I might respond differently based on the way an interview is administered. The greater intimacy of personal interviews may induce different responses than selfadministered interviews. Response variation may be attributable to the specific items included in a questionnaire. There’s an infinite number of ways researchers can ask the same question or try to measure the same underlying construct. The specific item(s) chosen to measure a construct may cause some variation in people’s responses due to the interpretation of the specific words or formatting chosen. There could be ambiguity in the measure, something my lectures on question and questionnaire design have urged you to avoid. Nonetheless, despite researchers’ best efforts, spoken language is somewhat ambiguous by nature. Spoken language also is somewhat complex, so questions not of the absolutely simplest nature invite complexity or the ambiguity of the questions may influence, to some extent, people’s responses. There could be mechanical issues; for example, insufficient blank space on a selfadministered questionnaire may encourage people to shorten their answer. An unprofessional-looking questionnaire may discourage people from providing complete answers. Finally, efforts to code or to score responses, especially if they are open ended, could introduce error. Slide 11 What methods are used to assess measurement reliability and validity? I’ll talk briefly about those and then very briefly about generalizability. Slide 12 To assess the reliability or repeatability of their measures, researchers might use any of these five approaches. Different approaches are more suitable to different circumstances. Test/re-test reliability means measuring a person at one time, then re-measuring that same person at a different time, and finally comparing the answers. For example, if I’m looking at attitude measures and I believe that attitudes are relatively stable, then I’d expect similar answers on repeated measure administrations. Test/re-test reliability is the degree of answer similarity. Another way to assess reliability is splitting the sample. I could ask 200 people the same question, then randomly split those 200 people into two groups of 100 people, and then determine if the answers from the first group are consistent (on average) with the answers from the second group. If so, then I’ve achieved split-sample reliability. Alternative forms reliability relates to the infinite number of questions appropriate for assessing any underlying construct. To assess store image, for example, I might create one set of questions to ask one set of people and a different set of questions to ask a Page | 3 second set of people. Then, I’d compare the responses of the first group to the second group. If I’ve created a reliable measure, then the alternative forms ought to yield comparable responses. Another way to assess reliability is internal comparison. As I’ve mentioned in previous lectures, some marketing constructs require multiple items for measurement. In such cases, a single item—like one used to assess income or the number of children in a household—is insufficient. To assess a notion like store image, which is far more complex, requires multiple items. To some extent, I could assess the reliability of those items by examining how the answers to each item relate to the scores on all other items. What I’d like is some consistency among responses to all the items. If one item tends to produce scores unrelated to the others, then I know that item should be deleted from the set of items meant to assess the underlying construct. A reasonably consistent set of items are internal-comparison reliable. Finally, if I’m running something like content analysis, then there’s a fair amount of subjectivity involved in assigning scores to objects or people. Because of subjectivity, it’s useful to check multiple scores for their consistency. That’s what score reliability addresses. Coming back to the fourth approach, internal comparison reliability can be used to assess reliability ‘after the fact’; in other words, after I’ve collected my data. After I’ve fielded my questionnaire, I want to guarantee that all items related to one underlying construct are consistent with one another. I can do that after the fact in a way that’s easy and inexpensive. All five aforementioned ways are good for assessing reliability. The first three ways to assess reliability—test/re-test, split samples, and alternative forms—require a larger sample, which would entail either more time or more expense (in advanced planning and a larger budget). Way #4 is a relatively straightforward way to check for the consistency of multiple-item measures. Way #5 is used to assess reliability when assessments or scores are assigned somewhat subjectively, as they would be in a content analysis. Slide 13 There are several ways to assess measure validity. This slide summarizes most of those approaches. Content or face validity is merely an assessment of whether or not scale items are adequately representative of the underlying construct of interest. Typically, content validity or face validity is assessed by asking subject-matter experts if the items make sense. For example, if I’m designing a scale related to retailing, I might find five professors of retailing and ask them if those items make sense; if they agree the items make sense in this context, then I have content or face validity. If I’m interested in predictive validity, also referred to as criterion validity, what I’m trying to access is whether or not a measure is predictive of something else I wish to predict. It doesn’t matter whether or not the measure makes sense from a theoretical standpoint; if a measure is predictive, then it has predictive validity, although theoretically related measures tend to be more predictive. Nonetheless, the issue is whether or not one measure tends to predict another thing that I want to predict. To the degree that one measure predicts another one, then I have criterion or predictive validity. Page | 4 I also might be interested in whether or not my measures properly converge or discriminate between concepts. If I have a strong sense that I’m measuring the correct thing, then I should be able to develop alternative measures and the scores on those measures should be consistent. Think about measuring IQ. There’s an infinite number of ways we can assess someone’s intelligence. If I’ve a good handle for what constitutes intelligence, then I ought to receive consistent scores on different IQ tests for the same person. The extent the scores are inconsistent suggests a lack of understanding about the underlying notion of intelligence. I’d like convergent validity or to have multiple measurements of the same thing provide consistent answers. I’d also like to be certain that my different scales measure different things. I’d like to know that I haven’t confused measuring one thing with measuring something else. To the extent I can guarantee that my scale is measuring one thing and not something else, I’ve achieved discriminate validity. Finally, I like each measure to be related in a proper theoretical manner to the other marketing constructs. To the extent that a measure is consistent with measures of other marketing constructs, I’ve achieved nomological validity. Slide 14 There are two other issues worth addressing before ending this lecture; the first is shown by this cartoon, in which two researchers contemplate whether or not some research results are reliable, valid, and generalizable. I’ve already discussed reliability and validity. Generalizability is whether or not the results of a study can be generalized to a larger population or to a broader set of circumstances. For example, based on experiments that often rely on a few hundred subjects, researchers try to draw conclusions about how the general population, or at least a much larger population, might respond. The stronger the evidence that the responses of a few hundred people are generalizable to a larger population, the more comfortable are marketers with generalizing the results of an experiment. Slide 15 The other issue worth considering is the issue of sensitivity. A measure can be reliable, in the sense that it’s repeatable; it can be valid, in the sense that it measures what was intended and is consistent with theoretical notions; and it can be generalizable, in the sense that the sample that was drawn is representative of the larger population in question. In contrast, sensitivity is whether or not a measure accurately detects differences among people. A measure can be reliable, valid, and generalizable, but relatively insensitive. Sensitivity is critical for researchers trying to classify people into different groups, as would be the case for segmentation analysis. Page | 5