Validity - Access Florida Tech - Florida Institute of Technology

advertisement
Research Skills for Psychology Majors: Everything You Need to Know to Get Started
Validity
Contents of This Chapter
Validity in Measurement .................................................................................. 2
Construct Validity....................................................................................................... 2
Face Validity .........................................................................................................2
Recipe for Squid and Anchovy Pizza ................................................................... 2
Content Validity.......................................................................................................... 3
Criterion Validity ........................................................................................................ 3
Validity in Research.............................................................................................. 4
Internal Validity ...................................................................................................... 4
Empirical (Defined).................................................................................................. 4
Threats to Internal Validity...................................................................................... 5
Manipulation Checks........................................................................................5
Experimenter Effects and Demand Characteristics........................................ 6
Demand Characteristics ..................................................................................6
Placebo Effects .......................................................................................................... 6
True Experiment (Reviewed) ................................................................................. 6
Regression Effects..................................................................................................... 7
Selection (Biased Sampling)................................................................................... 7
Carry-Over Effects.................................................................................................... 8
Third-Variable Confounding ................................................................................... 8
External Validity ..................................................................................................... 9
Authoritarian Characteristics................................................................................. 9
Ecological Validity ............................................................................................ 10
Reference................................................................................................................. 10
Social Loafing (Overview) .................................................................................... 10
V
alidity, the deeper and more complicated twin of Reliability, is outlined in a
cursory manner in this chapter. Validity is indeed a complex topic, so for a
more complete treatment of the subject an advanced methods book should
be consulted. You may have this opportunity in a later course.
Validity, like reliability, takes many forms and perhaps the term is used too broadly.
Generally, validity is used in two contexts: (1) evaluating the quality of a measurement instrument or method (a true twin of reliability); and (2) evaluating the quality of a research study, especially an experiment (maybe a cousin). This chapter
will cover both uses of the term. These two sides of validity have in common the
question, “is it really what it says it is?”. In other words, validity is an issue in interpreting the real meaning of the research and focuses on the relationships among
what the researcher had in mind from the start, what the researcher did and what
©2003 W. K. Gabrenya Jr. Version: 1.0
Page 2
the outcome really means. Reliability, while certainly necessary, is a less sweeping
concept.
Validity in Measurement
Measurement takes many forms, and although the most common form of measurement in social and behavioral science seems to be the self-report survey
instrument, we measure constructs in a variety of ways. For example, classic
research on interpersonal aggression operationalizes “aggression” by measuring
the number and intensity of electric shocks that a person gives to another. Effort
in group process experiments is assessed by measuring how loudly a research
subjects are willing to shout in a sound-proofed room. Attitudes can be measured
by determining the diameter of the pupil (hole in the middle of your eye). All such
measures beg the question: what is really being measured?
Construct Validity
The question, what is really being measured? is the central issue of construct
validity. Of all types of measurement validity, construct validity is the most important. A construct is a theoretical concept whose existence means something in
the context of a hypothesis or theory. For example, when we study the effect of
anger on aggression, “anger” and “aggression” are theoretical constructs that must
be operationalized–represented by something we can measure and quantify–in
order to be studied. Constructs, hypotheses, and operationalization are discussed
in detail in chapters 5 and 7.
Face Validity
Whenever a measure is used in research, the first question is
whether or not it is a valid representation of the construct.
Establishing the construct validity of a measure can be very difficult. The easiest way is to rely on face validity: the measures
seems valid “on the face it,” i.e., it just looks right. For example,
if we want to measure attitudes toward anchovy pizza, the
author’s favorite (but the squid is good too, if you can find it in
America), we might ask,
Anchovy pizza tastes (check one):
Delicious ___ ___ ___ ___ ___ ___ ___ Awful
Anchovy Pizza
Often face-valid measures like this one have sufficient construct
validity.
A more complex method for assessing construct validity is to
compare the measure in question to other measures that are
supposed to measure the same thing, and to measures that
are not supposed to measure the same thing. A valid measure
should evidence high correlations with the former, and low
correlations with the latter. For example, the anchovy pizza
measure should show a moderate correlation with a measure
Recipe for Squid and Anchovy Pizza
Recipe for Squid and Anchovy Pizza (in Japanese
and English):
http://www.takenet.or.jp/~francois/rect20j.html
Page 3
of appreciation for seafood, one would suppose. Another approach is to administer the measure to groups of people who, based on some other research, are
expected ahead of time to differ in a certain direction on the measures. Italians
like the author should appreciate anchovies and squid more than people from boring northern places. If the results don’t come out as expected, the measure may
have a problem.
Content Validity
Content validity is closely related to construct validity. Sometimes it is important
that a measure assess a sufficiently broad or comprehensive range of the parts
or components of a construct. Some constructs are very simple, like “attitude
toward anchovy pizza” while others are complex and multifaceted, like “knowledge
of psychology research methods.” To be valid, a measure of a complex construct
must be sufficiently comprehensive, picking up most of its parts or components.
This is another way of saying that sometimes a measure can’t be considered to
have construct validity unless it meets this “comprehensiveness criterion.”
For example, a measure of a knowledge domain such as “psychology research
methods,” would not have construct validity unless its content were valid, i.e., assessed a broad range of all the knowledge and skills involved in doing psychological
research. A final exam in such a course that only focused on research design but
ignored all the other aspects of methods would lack content validity for lack of
sufficient breadth. It would therefore lack construct validity because the construct
“knowledge of research methods” would not be adequately covered. Students
refer to tests like this as “stupid” and “unfair.”
Criterion Validity
Another way to judge the validity of a measure is whether it does a good job
predicting something with which is ought to be related, termed a criterion. A
criterion can be just about anything that is, itself, high in construct validity. For example, we might want to predict the starting salary of new psychology graduates,
a variable that indicates “success” to many people. Starting salary is the criterion.
The validity of a measure can be assessed by how well it predicts this criterion. In
this example, we might be interested in Grade Point Average as the predictor measure. The criterion validity of GPA is evaluated on the basis of how well it predicts
salary just after graduation. Predicting a criterion from a previously measured
variable like GPA is termed predictive validity. When the predictor and the
criterion are assessed at the same time, the term concurrent validity is used.
Criterion validity depends on the practical value of the measure more so than
its theoretical value. In other words, the measure is only as good as its ability to
predict something interesting or useful. Construct validity may in fact be poor in
a successful, criterion-valid measure because sometimes we don’t have to know
exactly what construct the measure is assessing for it to be useful in predicting a
criterion. If animals howl more just before an earthquake, consistently, we need
not know why, we just have to get out of the house with our clothes on.
The author’s youthful attempts to build equations that predict the outcome of
horse races is a discouraging case in point. The idea here was to (1) measure
Page 4
everything about every horse’s history (taken from racing forms) then (2)
use statistical methods to construct an equation to represent what works
best to predict the race outcomes, then (3) use the equation as a guide
for wagering, then (4) get rich quick with little effort. Just exactly what the
predictors measure, in the context of how fast a horse can fly, is not considered in this sort of fantasy and therefore there really are no “constructs” on
the predictor side. Lacking constructs and a theory, there is no way to, for
example, predict how the equation must change if, say, it rains just before the
race. Don’t try this at home.
Flying Horses
Validity in Research
The question, What does it really mean? is as pertinent to the research itself as
it is to the measures employed in the research. Technically, validity in measurement is a component of the overall validity of the research itself. If the measures
are not valid, then the research cannot be valid. Two broad types of validity are of
concern in research: internal validity and external validity. Internal validity concerns the meaning of the observed relationships between variables, i.e., can we be
certain that the relationship we see is really what we think it is. External validity
concerns the extent to which the research results can be generalized beyond a
research study. Valid research must work in more than one setting, with other
types of subjects, using other operationalizations of the constructs, and in the
“real” world.
Internal Validity
The goal of research in social and behavioral science is to build
and test theories and models through empirical methods. Ideally, the empirical methods take the form of experiments in
which clear cause and effect relationships can potentially be
demonstrated, while some researchers must be content to
simply observe relationships between variables. In both cases,
the essential goal is to determine if the observed relationships
mean what they are supposed to mean. In an experiment,
where variables are manipulated and controlled in such a way
that the independent variable can be said to cause the dependent variable (see chapter 10), internal validity is achieved when
the researcher is confident that three conditions have been
met:
Empirical (Defined)
“Empirical” means operationalizing constructs
then collecting data, the preferred way of
knowing in modern science. An “empiricist” is
a scientist or philosopher who believes in such
things. At one time this was a new idea, but it
is now the accepted norm in science. “Nonempirical” means several different things, such
as developing theories using rational thought,
informal observation of life, intuition, etc.
1. The operationalization of the independent variable–the causal variable–has
construct validity, i.e., it was done correctly and means what the theory says
it should mean.
2. The operationalization of the dependent variable has construct validity, as
described in the previous section.
3. The IV clearly is responsible for the observed change in the DV; the DV’s
relationship to the IV cannot be explained in some other way.
Conditions (1) and (3) are essentially the same. The researcher must prove that
Page 5
his or her manipulated IV has construct validity just as the DV, a measured variable,
must be valid. Condition (2) is something else.
Threats to Internal Validity
Beginning at least with the brilliant work of the methodologist - philosopher of
science - cross-cultural psychologist Donald Campbell during the mid-20th Century, we evaluate the internal and external validity of research in terms of the extent
to which it successfully avoids the many “threats” to validity. Campbell and others
identified many threats to validity, only some of which will be covered here. These
threats, as a whole, are often referred to as confounding variables, confounding
effects, or simply confounds. The story of empirical research is the struggle against
confounds.
Generally, a confound is a sort of contamination. The hoped-for cause-effect relationship between the IV and the DV (or among two DVs in a non-experimental
design) is compromised. In an experiment, this contamination can take the form
of the IV losing construct validity or of the effect of the IV on the DV being due
to something besides the IV. These situations are explained and illustrated in the
following sections.
The most common type of confound in experimental research concerns the
construct validity of the IV. Is the IV really operationalizing the correct theoretical construct, or does it mean something else (or a combination of these two
possibilities)? Take for example research on embarrassment in social psychology.
Experiments have been performed in which the IV is a manipulation of embarrassment: some subjects are led to be embarrassed, and some are not. Americans
don’t like to sing in public, so in American research we can embarrass people by
making them sing in the presence of others. In some studies, they have been asked
to sing the Star Spangled Banner with its problematic high notes. The question we
must deal with is whether or not singing is really embarrassing, and whether or
not singing vs. not singing is also manipulating some other construct that would
contaminate the meaning of the experiment.
Manipulation Checks
We try to check the former possibility by including a manipulation check in the
experiment, a measure of the manipulation’s effectiveness. In this case, we might
ask the subjects afterwards if they were embarrassed. Of course, one could
question the construct validity of this measure, too. Maybe people are unwilling
to report their embarrassment (because they’re embarrassed about being embarrassed), or maybe it’s subtle or “unconscious.” The latter possibility in which we
unwittingly manipulate both the intended construct and something else on top of
it is more complicated. What if singing versus not singing the Star Spangled Banner
manipulates embarrassment and patriotism? Which of these two psychological
states was responsible for the effects on the DV? Good and lucky researchers
anticipate this sort of problem and arrange their manipulations from the start
to avoid confounds, but most often someone else catches the problem after the
study has been conducted (even published). So how would you fix this study?
There are several other threats to the internal validity of experimental and non-
Page 6
experimental research. Some of these are summarized in the following sections.
Experimenter Effects and Demand Characteristics
Careless researchers can confound an experiment by acting differently toward
subjects (people, rats, etc.) in different conditions of the study. For example,
in the embarrassment study described previously, it would be a problem if the
experimenter were to be embarrassed by the subjects’ singing. If this happened, it
would be hard to know if the effect of the IV on the DV were due to the subjects’
embarrassment or to the researcher’s awkward behavior in the singing condition.
Keeping the experimenter as far away from the workings of the experiment as
possible, and trying to keep him/her ignorant of which condition of the experiment is currently being run for as long as possible, are common ways to reduce
this problem. If the researcher knew the hypothesis of the study, he or she might
unconsciously act differently in the various experimental conditions, biasing the
results. These experimenter expectancy effects have been found even to
occur in rat research. One solution is to not tell the experimenter the hypothesis.
Demand Characteristics
Features of the experimental setting that cue the subjects as to what they think
is expected of them are called demand characteristics, perhaps because the
setting “demands” something of the subject. In order to please the experimenter,
save face, or avoid embarrassment the subjects may decide to do the right thing as
they see it. Whether or not the subjects are correct in their assumptions about
what the experiment is about or what they “ought to” do, their behavior becomes
a function of this assumption rather than of the IV itself. If subjects in the embarrassment study decided that this must surely be an experiment on patriotism, then
they might decide to act as patriotically as possible throughout the study just to
please the experimenter. Sometimes subjects act in accordance with what the
hypothesis actually predicted, producing a successful experiment for the wrong
reason, and sometimes they get it backwards, producing an unsuccessful experiment for the wrong reason. The former situation is potentially more problematic
for the researcher’s career.
Placebo Effects
Placebo effects are a close relative to demand characteristics but are usually associated with experimental manipulations
involving IVs that produce a medical benefit for the subject,
such as a drug. In a placebo effect, the subject makes a conscious or unconscious assumption about the outcome of the
treatment or drug, then consciously or unconsciously gets better. The actual processes by which this occurs are not entirely
clear. Medical researchers are wise to this problem, and to that
of experimenter expectancies, so they use double-blind procedures in which neither the experimenters nor the patients are
aware of what condition the patients are in. Hence the famous
“sugar pill placebo” that the control group receives.
True Experiment (Reviewed)
A true experiment has three characteristics:
1. Random selection of subjects from the
target population
2. Random assignment of subjects to conditions
3. Manipulation of IVs and high level of control
over the situation.
Page 7
Regression Effects
Some of the threats to internal validity are more common in non-experimental
research, such as quasi-experimental, differential, and correlational studies. These
designs were presented in Chapter 4. The advantage of true experiments is that
they can, almost by definition, avoid some of these threats. However, true experiments are often impossible to perform and researchers must settle for quasiexperiments. Quasi-experiments violate one or more of the criteria for a true
experiment (see sidebar).
Skill
Regression effects can occur when the first or second criteria for a true experiment are not present. For example, if your research concerns alternate methods
for training a skill, it should ideally begin with two randomly assigned groups,
the group that gets your training procedure and a control group. But in the real
world, perhaps you must start training the least-skilled people first, so based on
a pretest of the whole
High
sample (purple people in
No-training conthe chart) these people
trol group
are put in your treatment group (blue group).
After the training proceTraining
dure, you may well find
Low
greater improvement
in the treatment group.
Step 1: Select
Step 2: Measure Step 3: Manipu- Step 4: Measure
Sample
Skills, Assign to
lation
Skills
Was your procedure sucConditions
cessful? Maybe not. Perhaps the low-skill people
placed in the treatment group tended to test lower on the pretest due to nonskill related factors (having a bad day, bad luck, etc.) and vice versa for the control
group, so on the post-test after your training procedure they tested closer to their
average skill level (and vice-versa). The scores of the two groups have “regressed
toward the mean.”
Selection (Biased Sampling)
Regression effects are one of the many ways in which sampling can be biased.
Whenever condition two of a true experiment is violated the researcher can’t
be sure if the observed outcome of the study is due to the IV or to some preexisting difference in the subjects who are placed in the experimental conditions.
Quasi-experiments frequently violate this criterion. A study of the effects of a
drug prevention program in High Schools might have to use intact samples, such as
homerooms or whole schools in different conditions of the experiment. School
A might be given the treatment (drug prevention program) and school B might
serve as the control group. This study violates criteria one and two (and possibly criterion three to some extent) and is therefore a quasi-experiment. If the
two schools are different in any way (which is likely), the outcome of the study is
hard to interpret—the IV is contaminated with a selection bias. Procedures have
been developed to check for the presence of this contamination and to reduce its
impact, making important research of this kind feasible.
Selection bias can be subtle and unexpected. For example, an error in random
Page 8
assignment could be made in a true experiment. Studies of social interaction
sometimes observe people interacting in a naturalistic setting, videotaping then
analyzing the interaction. A study might look at how some aspect of the interaction–topic of conversation, for example–affects how talkative people are. Participants are randomly assigned to discuss topic A (why anchovy pizza is delicious)
or topic B (why tuition is too high because the faculty are getting rich). So far so
good, but an unwitting experimenter might schedule groups to discuss topic A in
the morning and topic B in the afternoon, just to keep things simple. If the results
show that topic A produces less discussion than topic B, is it because (a) the topics
themselves affect interaction; (b) people are drowsy and less communicative in the
morning; (c) people are less hungry in the morning; (d) all of the above? In this
example, the selection bias is not in the intrinsic characteristics of the subjects, but
rather in their temporary states of mind when they take part in the study.
Carry-Over Effects
Some research designs require the subjects to experience something more than
once, such as repeated applications of the IV or repeated measurement of the DV.
Termed repeated measures or pre-post designs, these designs were introduced
in the Basic Research Design chapter. For example, an undergraduate thesis
performed at Florida Tech a few years ago (Rivera, 1993) examined the effect of
stress on errors in flying a flight simulator. (“Error” means “crash.”) Subjects
were students in the School of Aeronautics flight program. In this type of study,
participants experience both conditions of the experiment, stress and no-stress,
over a long series of “trials” (in this case, each trial was a landing). The problem
with this kind of research is that the effects of experiencing one condition of the
design (stress or no-stress) might affect or carry over to the other condition.
For example, would the participant fly better in a no-stress condition right after
crashing the plane in the stress condition. Vice versa? Carry over effects make it
difficult to know if the IV is more (or less) effective than if the carry over had not
taken place.
What is the solution? The cleanest fix is to use a between-groups design in which
some subjects only experience the stress condition throughout the experiment,
and others only experience the no-stress condition. This is the last resort, because it means using twice as many subjects. Another solution is to carefully try
all combinations of condition orders to see if the order makes a difference. All
repeated measures designs use this technique when possible.
Third-Variable Confounding
Two common internal validity problems are encountered in correlational and
differential designs. First, the study may lose internal validity when one of the
variables is confounded with another variable “outside the model. “ “Outside the
model” refers to variables that the researcher had not thought were relevant in
the study, either explicitly in her hypothesis or implicitly in her understanding of
the research area. This is another way of saying that there is a problem with the
construct validity of the variable. For example, a differential-design study that
examines the relationship between culture and values would have to deal with
the perennial problem that “culture” is as confounded as it is fascinating. When
two cultures are compared on a value dimension, there are so many differences
Page 9
between the cultures that no single construct can be identified
as responsible for an observed value differences. Is it because
the cultures differ in wealth? Child-rearing practices? History
of internal warfare? Modernization? The list of confounded
variables is endless, and culture is sometimes referred to as the
“Global X.” Hence, good cross-cultural research is much more
complex than a two-variable comparison and involves multiple
cultures.
Second, internal validity is compromised if a third variable not
initially considered in the model accounts for the relationship
between the variables. For example, research has demonstrated a positive relationship between the personality dimension
Authoritarianism (see sidebar) and racial prejudice. Perhaps
this relationship is a simple one as illustrated in model A.
Authoritarian Characteristics
1. Concern with power and toughness
2. Dislike of tender-mindedness, psychological
analysis
3. Conventional values rigidly followed
4. Condemn and reject others who don’t follow conventional values
5. Overly concerned with others’ sexuality
6. Hostility
7. Projecting own negative emotional impulses
on others
8. Submission and conformity to idealized
moral authorities
In model A, the curved line indicates a relationship in which the
causal direction is not known or assumed. In fact, most psy9. Fatalism, superstition, rigid categorical thinkchologists would assume that Authoritarianism affects or leads
ing
to prejudice, but model B indicates a different possibility. Social
class has been shown to be related to both Authoritarianism
and to prejudice, such that middle class people are lower on
both. Social class is a differential variable whose relationships to Authoritarianism and to prejudice
can be traced to several
types of parental influences as well as to adult
Authoritarianism
Prejudice
experiences in society.
In model B, Social Class
A. Simple relationship between two variables
is the third variable,
and the relationship in
model A is said to be
“spurious,” or a “spuSocial Class
rious correlation.”
The correlation between Authoritarianism
and prejudice is actually
due to their common
Authoritarianism
Prejudice
relationships with social
class. More than one
B. Third-variable confounding
spurious variable may be
present.
External Validity
A valid research study, regardless of the design employed, must generalize to other
people, measures, times, and places. Because research is conducted to build and
test theories and models, a study that only works with a certain kind of sample
and one way of operationalizing each construct is not very useful. A robust theory
demands wide generalization. External validity is the extent to which a study’s
Page 10
findings can be generalized. Logically, external validity belongs to the theory, not
to the study, but by convention we use the term to apply to specific research studies.
The first place we look for external validity is in the sampling. Would the results
of the study be the same if a different target population were used? Most research
in American psychology is conducted on white middle class college sophomores
enrolled in Introductory Psychology courses, an observation that enrages crosscultural psychologists, most of whom have had extensive experience in other
cultures. Can the results of this research generalize to “normal” people, such
as working class adults? African Americans? Chinese living 100km from a paved
road? Yanamamo Indians engaged in inter-village raiding and wife-stealing?
The second place to look is the operationalization of the variables, both IVs and DVs. A masterful attempt to generalize the
operationalization of variables was performed by Bibb Latané
and his colleagues in their development of Social Impact Theory
and one of its most visible phenomena, research on social
loafing (see sidebar). From the core finding, they applied their
theory to the performance of swimmers and runners, the quality of Beatles songs, generating ideas (brainstorming), and others. They generalized the sampling to children, Japanese Honda
managers, Taiwanese and Malaysian school children, and kids
who grew up in Melbourne, Florida. Tasks involving producing
a lot of something such as loudness (termed maximizing tasks)
and tasks involving cognitive skills such as counting (optimizing
tasks) were employed. Overall, robust support for the social
loafing effect was found, while interacting factors and cultural
differences were identified.
Ecological Validity
Social Loafing (Overview)
The core finding in the social loafing literature
is that as the number of people responsible
for performing a task increases, the effort each
person puts into the task decreases. If this
reduction in effort is great enough, the task
outcome declines as the number of workers
increases. Several task variables moderate
this effect, such as the extent to which each
person has a sense of personal responsibility
for the success of the group effort.
In the earliest social loafing research, college
students were asked to shout as loudly as possible in groups or alone. Their shouting “performance” was assessed using a sound level
meter. The researchers found that, as group
size increased, individuals’ loudness decreased.
The third place to look for generalization is the extent to
which the research will work the same way in the real world, that is, in natural
settings. Research that only works in laboratory settings and either cannot be
replicated in the real world or has no real world analog is said to lack ecological
validity. For example, if social loafing only occurred in the laboratory settings
where it was first researched–special soundproofed rooms, subjects wearing
blindfolds and heavy earphones–it would not be ecologically valid and would not
have caught people’s attention. However, when it was shown that swimmers swam
faster in individual than in relay events, and that managers in the vaunted Honda
Motor Company loafed almost as much as American college students, the research
gained both attention and ecological validity. Latané won a prestigious scientific
award. The author got a job.
Reference
Rivero, M. (1993). Effect of alternate warning systems on gear-up landings in a flight
simulator. Undergraduate thesis, Florida Institute of Technology, Melbourne, FL.
Download