Reliability and Validity

advertisement
Reliability and Validity
Slide 1
In this lecture, I discuss measurement, validity, and reliability—although in everyday English
these terms are used interchangeably. They have vastly different meanings from measurement
perspective.
Slide 2
When researchers think about measuring respondents with an observational or experimental
study, they’re typically looking for inter-respondent differences. What causes those differences
in those measures between respondents? Well, those differences could reflect ‘true’ differences
among respondents, or those differences could reflect random or systematic error. Reliability
and validity address the random and systematic error component of measure. Validity
addresses the accurate-indicator dimension of differences within measures.
Slide 3
What does a researcher mean by a reliable measure? Reliability in a research context is the
degree to which measures are free from random error and, therefore, yield consistent results. In
other words, it’s the repeatability of a measure. If I take that same measurement on the same
object or person on multiple occasions, will I get the same number or score?
Slide 4
In this cartoon, the researcher is responding to a question about reliability. Unfortunately, his
results were unreliable because he checked the data twice and got two different results. If these
measures had been reliable, he would have gotten the same result.
Slide 5
In contrast to reliability, which addresses issues about the repeatability or stability of a measure,
validity is a more theoretical notion. In validity, what researchers try to assess is the ability of a
scale to measure the intended construct. In other words, are we measuring the right thing?
Slide 6
In this cartoon, the researcher is concerned about a validity issue. The caption indicates that
92% of people lie on polls. In a humorous way, the cartoon illustrates what happens when
research questions don’t measure what was intended.
Slide 7
This cartoon clearly differentiates validity from reliability. As you’ll notice from the first frame on
the left, the manager says, “That was an interesting report Jones. Are we confident it is
correct?” Jones could provide one of two different answers. In the top frame on the right, his
answers confirm validity when he says, “Sure boss. It doesn’t matter how many people you
interview as long as you measure the right thing.” That is a validity response. Alternatively, in
the bottom frame, Jones says, “Sure boss. Our sampling procedures and response rates were
as good as you can get.” This suggests that if they were to field the survey again, then they’d
get a similar result.
Page | 1
Slide 8
Rulers are examples of measures that are reliable and valid; reliable, in the sense that if the
same object is measured with the same ruler on consecutive days, then the results would be
similar. A ruler is a valid measure because a ruler is supposed to measure length and indeed it
does that precisely.
Slide 9
This slide shows how validity and reliability are related to marketing-related measures.
The upper-left-hand corner illustrates the circumstance under which a measure is valid and
reliable. In this case, the measurement needs to occur only once to learn its value. Alternatively,
the bottom-right-hand corner illustrates the circumstance under which a measure is neither valid
nor reliable. In this case, it’s not usable in any meaningful fashion; it’s impossible to learn the
true value by using a measure that’s neither valid nor reliable.
The interesting circumstances are the ones on the diagonal, cells B and C. Cell B (the upperright-hand corner) illustrates the circumstances under which a measure is somewhat unreliable,
in the sense that taking the same measure repeatedly would produce different values. From a
theoretical standpoint, the measure reflects the correct underlying notion; it’s measuring what it
was meant to measure. In such a circumstance, the solution for a researcher is to take repeated
measurements, and then average them. If a measure is valid—so from a theoretical standpoint
it’s measuring the correct thing—but is somewhat unreliable, the solution is to take repeated
measures and average them.
Cell C (the bottom-left-hand corner) illustrates the circumstances under which a measure is
repeatable but not valid. It’s possible that the measure is systematically biased; in that case, the
magnitude of that bias and its direction are known, so all that’s needed is a single ‘corrected’
measure. If, for example, a researcher wanted to measure social status and was limited to a
measure of income—which should reflect social status but could be biased systematically—then
all that’s needed is a single measure of income that is corrected for its bias.
In summary, the ideal of reliable and valid measures is illustrated in Cell A. The impossible
situation is Cell D, in which a measure is neither valid nor reliable. With measures of the type
shown in Cell B—valid but unreliable—the solution is to take repeated measures and then
average them. With measures of the type shown in Cell C—reliable but not valid—the solution is
to determine if the invalid measure reflects the underlying construct of interest, determine the
magnitude and direction of the bias, take the single measure, and adjust it accordingly.
Slide 10
This slide summarizes the causes of variation in measures among people. Researchers hope
that this first cause is the dominant one: the difference in the number or score yielded by a
measure relates to true differences among people on the characteristics of interest. There are
other causes of variation—the error term indicated on slide #2 of this lecture—that are related to
validity and reliability. All but the second case relate to reliability; in contrast, the second cause,
which is differences due to stable characteristics of respondents (such as intelligence and
education), suggests a validity issue. With the second cause, the measure is being influenced
by stable characteristics that were unintended to influence that measure.
The remaining causes—things like short-term personal factors—relate to reliability.
Page | 2

People respond differently based on the situation. People’s responses to questions differ
by their physical well being or how motivated they are to answer a question at a given
time. For example, I might respond one way to a telephone interviewer who calls me at
home three minutes before I must leave, but a different way if I don’t plan to leave my
house that day. I might respond differently based on the way an interview is
administered.

The greater intimacy of personal interviews may induce different responses than selfadministered interviews.

Response variation may be attributable to the specific items included in a questionnaire.
There’s an infinite number of ways researchers can ask the same question or try to
measure the same underlying construct. The specific item(s) chosen to measure a
construct may cause some variation in people’s responses due to the interpretation of
the specific words or formatting chosen.

There could be ambiguity in the measure, something my lectures on question and
questionnaire design have urged you to avoid. Nonetheless, despite researchers’ best
efforts, spoken language is somewhat ambiguous by nature. Spoken language also is
somewhat complex, so questions not of the absolutely simplest nature invite complexity
or the ambiguity of the questions may influence, to some extent, people’s responses.

There could be mechanical issues; for example, insufficient blank space on a selfadministered questionnaire may encourage people to shorten their answer. An
unprofessional-looking questionnaire may discourage people from providing complete
answers. Finally, efforts to code or to score responses, especially if they are open
ended, could introduce error.
Slide 11
What methods are used to assess measurement reliability and validity? I’ll talk briefly about
those and then very briefly about generalizability.
Slide 12
To assess the reliability or repeatability of their measures, researchers might use any of these
five approaches. Different approaches are more suitable to different circumstances.

Test/re-test reliability means measuring a person at one time, then re-measuring that
same person at a different time, and finally comparing the answers. For example, if I’m
looking at attitude measures and I believe that attitudes are relatively stable, then I’d
expect similar answers on repeated measure administrations. Test/re-test reliability is
the degree of answer similarity.

Another way to assess reliability is splitting the sample. I could ask 200 people the same
question, then randomly split those 200 people into two groups of 100 people, and then
determine if the answers from the first group are consistent (on average) with the
answers from the second group. If so, then I’ve achieved split-sample reliability.

Alternative forms reliability relates to the infinite number of questions appropriate for
assessing any underlying construct. To assess store image, for example, I might create
one set of questions to ask one set of people and a different set of questions to ask a
Page | 3
second set of people. Then, I’d compare the responses of the first group to the second
group. If I’ve created a reliable measure, then the alternative forms ought to yield
comparable responses.

Another way to assess reliability is internal comparison. As I’ve mentioned in previous
lectures, some marketing constructs require multiple items for measurement. In such
cases, a single item—like one used to assess income or the number of children in a
household—is insufficient. To assess a notion like store image, which is far more
complex, requires multiple items. To some extent, I could assess the reliability of those
items by examining how the answers to each item relate to the scores on all other items.
What I’d like is some consistency among responses to all the items. If one item tends to
produce scores unrelated to the others, then I know that item should be deleted from the
set of items meant to assess the underlying construct. A reasonably consistent set of
items are internal-comparison reliable.

Finally, if I’m running something like content analysis, then there’s a fair amount of
subjectivity involved in assigning scores to objects or people. Because of subjectivity, it’s
useful to check multiple scores for their consistency. That’s what score reliability
addresses. Coming back to the fourth approach, internal comparison reliability can be
used to assess reliability ‘after the fact’; in other words, after I’ve collected my data. After
I’ve fielded my questionnaire, I want to guarantee that all items related to one underlying
construct are consistent with one another. I can do that after the fact in a way that’s easy
and inexpensive.
All five aforementioned ways are good for assessing reliability. The first three ways to assess
reliability—test/re-test, split samples, and alternative forms—require a larger sample, which
would entail either more time or more expense (in advanced planning and a larger budget). Way
#4 is a relatively straightforward way to check for the consistency of multiple-item measures.
Way #5 is used to assess reliability when assessments or scores are assigned somewhat
subjectively, as they would be in a content analysis.
Slide 13
There are several ways to assess measure validity. This slide summarizes most of those
approaches.
Content or face validity is merely an assessment of whether or not scale items are adequately
representative of the underlying construct of interest. Typically, content validity or face validity is
assessed by asking subject-matter experts if the items make sense. For example, if I’m
designing a scale related to retailing, I might find five professors of retailing and ask them if
those items make sense; if they agree the items make sense in this context, then I have content
or face validity.
If I’m interested in predictive validity, also referred to as criterion validity, what I’m trying to
access is whether or not a measure is predictive of something else I wish to predict. It doesn’t
matter whether or not the measure makes sense from a theoretical standpoint; if a measure is
predictive, then it has predictive validity, although theoretically related measures tend to be
more predictive. Nonetheless, the issue is whether or not one measure tends to predict another
thing that I want to predict. To the degree that one measure predicts another one, then I have
criterion or predictive validity.
Page | 4
I also might be interested in whether or not my measures properly converge or discriminate
between concepts. If I have a strong sense that I’m measuring the correct thing, then I should
be able to develop alternative measures and the scores on those measures should be
consistent. Think about measuring IQ. There’s an infinite number of ways we can assess
someone’s intelligence. If I’ve a good handle for what constitutes intelligence, then I ought to
receive consistent scores on different IQ tests for the same person. The extent the scores are
inconsistent suggests a lack of understanding about the underlying notion of intelligence. I’d like
convergent validity or to have multiple measurements of the same thing provide consistent
answers. I’d also like to be certain that my different scales measure different things. I’d like to
know that I haven’t confused measuring one thing with measuring something else. To the extent
I can guarantee that my scale is measuring one thing and not something else, I’ve achieved
discriminate validity.
Finally, I like each measure to be related in a proper theoretical manner to the other marketing
constructs. To the extent that a measure is consistent with measures of other marketing
constructs, I’ve achieved nomological validity.
Slide 14
There are two other issues worth addressing before ending this lecture; the first is shown by this
cartoon, in which two researchers contemplate whether or not some research results are
reliable, valid, and generalizable. I’ve already discussed reliability and validity. Generalizability is
whether or not the results of a study can be generalized to a larger population or to a broader
set of circumstances. For example, based on experiments that often rely on a few hundred
subjects, researchers try to draw conclusions about how the general population, or at least a
much larger population, might respond. The stronger the evidence that the responses of a few
hundred people are generalizable to a larger population, the more comfortable are marketers
with generalizing the results of an experiment.
Slide 15
The other issue worth considering is the issue of sensitivity. A measure can be reliable, in the
sense that it’s repeatable; it can be valid, in the sense that it measures what was intended and
is consistent with theoretical notions; and it can be generalizable, in the sense that the sample
that was drawn is representative of the larger population in question. In contrast, sensitivity is
whether or not a measure accurately detects differences among people. A measure can be
reliable, valid, and generalizable, but relatively insensitive. Sensitivity is critical for researchers
trying to classify people into different groups, as would be the case for segmentation analysis.
Page | 5
Download