Research Skills for Psychology Majors: Everything You Need to Know to Get Started Validity Contents of This Chapter Validity in Measurement .................................................................................. 2 Construct Validity....................................................................................................... 2 Face Validity .........................................................................................................2 Recipe for Squid and Anchovy Pizza ................................................................... 2 Content Validity.......................................................................................................... 3 Criterion Validity ........................................................................................................ 3 Validity in Research.............................................................................................. 4 Internal Validity ...................................................................................................... 4 Empirical (Defined).................................................................................................. 4 Threats to Internal Validity...................................................................................... 5 Manipulation Checks........................................................................................5 Experimenter Effects and Demand Characteristics........................................ 6 Demand Characteristics ..................................................................................6 Placebo Effects .......................................................................................................... 6 True Experiment (Reviewed) ................................................................................. 6 Regression Effects..................................................................................................... 7 Selection (Biased Sampling)................................................................................... 7 Carry-Over Effects.................................................................................................... 8 Third-Variable Confounding ................................................................................... 8 External Validity ..................................................................................................... 9 Authoritarian Characteristics................................................................................. 9 Ecological Validity ............................................................................................ 10 Reference................................................................................................................. 10 Social Loafing (Overview) .................................................................................... 10 V alidity, the deeper and more complicated twin of Reliability, is outlined in a cursory manner in this chapter. Validity is indeed a complex topic, so for a more complete treatment of the subject an advanced methods book should be consulted. You may have this opportunity in a later course. Validity, like reliability, takes many forms and perhaps the term is used too broadly. Generally, validity is used in two contexts: (1) evaluating the quality of a measurement instrument or method (a true twin of reliability); and (2) evaluating the quality of a research study, especially an experiment (maybe a cousin). This chapter will cover both uses of the term. These two sides of validity have in common the question, “is it really what it says it is?”. In other words, validity is an issue in interpreting the real meaning of the research and focuses on the relationships among what the researcher had in mind from the start, what the researcher did and what ©2003 W. K. Gabrenya Jr. Version: 1.0 Page 2 the outcome really means. Reliability, while certainly necessary, is a less sweeping concept. Validity in Measurement Measurement takes many forms, and although the most common form of measurement in social and behavioral science seems to be the self-report survey instrument, we measure constructs in a variety of ways. For example, classic research on interpersonal aggression operationalizes “aggression” by measuring the number and intensity of electric shocks that a person gives to another. Effort in group process experiments is assessed by measuring how loudly a research subjects are willing to shout in a sound-proofed room. Attitudes can be measured by determining the diameter of the pupil (hole in the middle of your eye). All such measures beg the question: what is really being measured? Construct Validity The question, what is really being measured? is the central issue of construct validity. Of all types of measurement validity, construct validity is the most important. A construct is a theoretical concept whose existence means something in the context of a hypothesis or theory. For example, when we study the effect of anger on aggression, “anger” and “aggression” are theoretical constructs that must be operationalized–represented by something we can measure and quantify–in order to be studied. Constructs, hypotheses, and operationalization are discussed in detail in chapters 5 and 7. Face Validity Whenever a measure is used in research, the first question is whether or not it is a valid representation of the construct. Establishing the construct validity of a measure can be very difficult. The easiest way is to rely on face validity: the measures seems valid “on the face it,” i.e., it just looks right. For example, if we want to measure attitudes toward anchovy pizza, the author’s favorite (but the squid is good too, if you can find it in America), we might ask, Anchovy pizza tastes (check one): Delicious ___ ___ ___ ___ ___ ___ ___ Awful Anchovy Pizza Often face-valid measures like this one have sufficient construct validity. A more complex method for assessing construct validity is to compare the measure in question to other measures that are supposed to measure the same thing, and to measures that are not supposed to measure the same thing. A valid measure should evidence high correlations with the former, and low correlations with the latter. For example, the anchovy pizza measure should show a moderate correlation with a measure Recipe for Squid and Anchovy Pizza Recipe for Squid and Anchovy Pizza (in Japanese and English): http://www.takenet.or.jp/~francois/rect20j.html Page 3 of appreciation for seafood, one would suppose. Another approach is to administer the measure to groups of people who, based on some other research, are expected ahead of time to differ in a certain direction on the measures. Italians like the author should appreciate anchovies and squid more than people from boring northern places. If the results don’t come out as expected, the measure may have a problem. Content Validity Content validity is closely related to construct validity. Sometimes it is important that a measure assess a sufficiently broad or comprehensive range of the parts or components of a construct. Some constructs are very simple, like “attitude toward anchovy pizza” while others are complex and multifaceted, like “knowledge of psychology research methods.” To be valid, a measure of a complex construct must be sufficiently comprehensive, picking up most of its parts or components. This is another way of saying that sometimes a measure can’t be considered to have construct validity unless it meets this “comprehensiveness criterion.” For example, a measure of a knowledge domain such as “psychology research methods,” would not have construct validity unless its content were valid, i.e., assessed a broad range of all the knowledge and skills involved in doing psychological research. A final exam in such a course that only focused on research design but ignored all the other aspects of methods would lack content validity for lack of sufficient breadth. It would therefore lack construct validity because the construct “knowledge of research methods” would not be adequately covered. Students refer to tests like this as “stupid” and “unfair.” Criterion Validity Another way to judge the validity of a measure is whether it does a good job predicting something with which is ought to be related, termed a criterion. A criterion can be just about anything that is, itself, high in construct validity. For example, we might want to predict the starting salary of new psychology graduates, a variable that indicates “success” to many people. Starting salary is the criterion. The validity of a measure can be assessed by how well it predicts this criterion. In this example, we might be interested in Grade Point Average as the predictor measure. The criterion validity of GPA is evaluated on the basis of how well it predicts salary just after graduation. Predicting a criterion from a previously measured variable like GPA is termed predictive validity. When the predictor and the criterion are assessed at the same time, the term concurrent validity is used. Criterion validity depends on the practical value of the measure more so than its theoretical value. In other words, the measure is only as good as its ability to predict something interesting or useful. Construct validity may in fact be poor in a successful, criterion-valid measure because sometimes we don’t have to know exactly what construct the measure is assessing for it to be useful in predicting a criterion. If animals howl more just before an earthquake, consistently, we need not know why, we just have to get out of the house with our clothes on. The author’s youthful attempts to build equations that predict the outcome of horse races is a discouraging case in point. The idea here was to (1) measure Page 4 everything about every horse’s history (taken from racing forms) then (2) use statistical methods to construct an equation to represent what works best to predict the race outcomes, then (3) use the equation as a guide for wagering, then (4) get rich quick with little effort. Just exactly what the predictors measure, in the context of how fast a horse can fly, is not considered in this sort of fantasy and therefore there really are no “constructs” on the predictor side. Lacking constructs and a theory, there is no way to, for example, predict how the equation must change if, say, it rains just before the race. Don’t try this at home. Flying Horses Validity in Research The question, What does it really mean? is as pertinent to the research itself as it is to the measures employed in the research. Technically, validity in measurement is a component of the overall validity of the research itself. If the measures are not valid, then the research cannot be valid. Two broad types of validity are of concern in research: internal validity and external validity. Internal validity concerns the meaning of the observed relationships between variables, i.e., can we be certain that the relationship we see is really what we think it is. External validity concerns the extent to which the research results can be generalized beyond a research study. Valid research must work in more than one setting, with other types of subjects, using other operationalizations of the constructs, and in the “real” world. Internal Validity The goal of research in social and behavioral science is to build and test theories and models through empirical methods. Ideally, the empirical methods take the form of experiments in which clear cause and effect relationships can potentially be demonstrated, while some researchers must be content to simply observe relationships between variables. In both cases, the essential goal is to determine if the observed relationships mean what they are supposed to mean. In an experiment, where variables are manipulated and controlled in such a way that the independent variable can be said to cause the dependent variable (see chapter 10), internal validity is achieved when the researcher is confident that three conditions have been met: Empirical (Defined) “Empirical” means operationalizing constructs then collecting data, the preferred way of knowing in modern science. An “empiricist” is a scientist or philosopher who believes in such things. At one time this was a new idea, but it is now the accepted norm in science. “Nonempirical” means several different things, such as developing theories using rational thought, informal observation of life, intuition, etc. 1. The operationalization of the independent variable–the causal variable–has construct validity, i.e., it was done correctly and means what the theory says it should mean. 2. The operationalization of the dependent variable has construct validity, as described in the previous section. 3. The IV clearly is responsible for the observed change in the DV; the DV’s relationship to the IV cannot be explained in some other way. Conditions (1) and (3) are essentially the same. The researcher must prove that Page 5 his or her manipulated IV has construct validity just as the DV, a measured variable, must be valid. Condition (2) is something else. Threats to Internal Validity Beginning at least with the brilliant work of the methodologist - philosopher of science - cross-cultural psychologist Donald Campbell during the mid-20th Century, we evaluate the internal and external validity of research in terms of the extent to which it successfully avoids the many “threats” to validity. Campbell and others identified many threats to validity, only some of which will be covered here. These threats, as a whole, are often referred to as confounding variables, confounding effects, or simply confounds. The story of empirical research is the struggle against confounds. Generally, a confound is a sort of contamination. The hoped-for cause-effect relationship between the IV and the DV (or among two DVs in a non-experimental design) is compromised. In an experiment, this contamination can take the form of the IV losing construct validity or of the effect of the IV on the DV being due to something besides the IV. These situations are explained and illustrated in the following sections. The most common type of confound in experimental research concerns the construct validity of the IV. Is the IV really operationalizing the correct theoretical construct, or does it mean something else (or a combination of these two possibilities)? Take for example research on embarrassment in social psychology. Experiments have been performed in which the IV is a manipulation of embarrassment: some subjects are led to be embarrassed, and some are not. Americans don’t like to sing in public, so in American research we can embarrass people by making them sing in the presence of others. In some studies, they have been asked to sing the Star Spangled Banner with its problematic high notes. The question we must deal with is whether or not singing is really embarrassing, and whether or not singing vs. not singing is also manipulating some other construct that would contaminate the meaning of the experiment. Manipulation Checks We try to check the former possibility by including a manipulation check in the experiment, a measure of the manipulation’s effectiveness. In this case, we might ask the subjects afterwards if they were embarrassed. Of course, one could question the construct validity of this measure, too. Maybe people are unwilling to report their embarrassment (because they’re embarrassed about being embarrassed), or maybe it’s subtle or “unconscious.” The latter possibility in which we unwittingly manipulate both the intended construct and something else on top of it is more complicated. What if singing versus not singing the Star Spangled Banner manipulates embarrassment and patriotism? Which of these two psychological states was responsible for the effects on the DV? Good and lucky researchers anticipate this sort of problem and arrange their manipulations from the start to avoid confounds, but most often someone else catches the problem after the study has been conducted (even published). So how would you fix this study? There are several other threats to the internal validity of experimental and non- Page 6 experimental research. Some of these are summarized in the following sections. Experimenter Effects and Demand Characteristics Careless researchers can confound an experiment by acting differently toward subjects (people, rats, etc.) in different conditions of the study. For example, in the embarrassment study described previously, it would be a problem if the experimenter were to be embarrassed by the subjects’ singing. If this happened, it would be hard to know if the effect of the IV on the DV were due to the subjects’ embarrassment or to the researcher’s awkward behavior in the singing condition. Keeping the experimenter as far away from the workings of the experiment as possible, and trying to keep him/her ignorant of which condition of the experiment is currently being run for as long as possible, are common ways to reduce this problem. If the researcher knew the hypothesis of the study, he or she might unconsciously act differently in the various experimental conditions, biasing the results. These experimenter expectancy effects have been found even to occur in rat research. One solution is to not tell the experimenter the hypothesis. Demand Characteristics Features of the experimental setting that cue the subjects as to what they think is expected of them are called demand characteristics, perhaps because the setting “demands” something of the subject. In order to please the experimenter, save face, or avoid embarrassment the subjects may decide to do the right thing as they see it. Whether or not the subjects are correct in their assumptions about what the experiment is about or what they “ought to” do, their behavior becomes a function of this assumption rather than of the IV itself. If subjects in the embarrassment study decided that this must surely be an experiment on patriotism, then they might decide to act as patriotically as possible throughout the study just to please the experimenter. Sometimes subjects act in accordance with what the hypothesis actually predicted, producing a successful experiment for the wrong reason, and sometimes they get it backwards, producing an unsuccessful experiment for the wrong reason. The former situation is potentially more problematic for the researcher’s career. Placebo Effects Placebo effects are a close relative to demand characteristics but are usually associated with experimental manipulations involving IVs that produce a medical benefit for the subject, such as a drug. In a placebo effect, the subject makes a conscious or unconscious assumption about the outcome of the treatment or drug, then consciously or unconsciously gets better. The actual processes by which this occurs are not entirely clear. Medical researchers are wise to this problem, and to that of experimenter expectancies, so they use double-blind procedures in which neither the experimenters nor the patients are aware of what condition the patients are in. Hence the famous “sugar pill placebo” that the control group receives. True Experiment (Reviewed) A true experiment has three characteristics: 1. Random selection of subjects from the target population 2. Random assignment of subjects to conditions 3. Manipulation of IVs and high level of control over the situation. Page 7 Regression Effects Some of the threats to internal validity are more common in non-experimental research, such as quasi-experimental, differential, and correlational studies. These designs were presented in Chapter 4. The advantage of true experiments is that they can, almost by definition, avoid some of these threats. However, true experiments are often impossible to perform and researchers must settle for quasiexperiments. Quasi-experiments violate one or more of the criteria for a true experiment (see sidebar). Skill Regression effects can occur when the first or second criteria for a true experiment are not present. For example, if your research concerns alternate methods for training a skill, it should ideally begin with two randomly assigned groups, the group that gets your training procedure and a control group. But in the real world, perhaps you must start training the least-skilled people first, so based on a pretest of the whole High sample (purple people in No-training conthe chart) these people trol group are put in your treatment group (blue group). After the training proceTraining dure, you may well find Low greater improvement in the treatment group. Step 1: Select Step 2: Measure Step 3: Manipu- Step 4: Measure Sample Skills, Assign to lation Skills Was your procedure sucConditions cessful? Maybe not. Perhaps the low-skill people placed in the treatment group tended to test lower on the pretest due to nonskill related factors (having a bad day, bad luck, etc.) and vice versa for the control group, so on the post-test after your training procedure they tested closer to their average skill level (and vice-versa). The scores of the two groups have “regressed toward the mean.” Selection (Biased Sampling) Regression effects are one of the many ways in which sampling can be biased. Whenever condition two of a true experiment is violated the researcher can’t be sure if the observed outcome of the study is due to the IV or to some preexisting difference in the subjects who are placed in the experimental conditions. Quasi-experiments frequently violate this criterion. A study of the effects of a drug prevention program in High Schools might have to use intact samples, such as homerooms or whole schools in different conditions of the experiment. School A might be given the treatment (drug prevention program) and school B might serve as the control group. This study violates criteria one and two (and possibly criterion three to some extent) and is therefore a quasi-experiment. If the two schools are different in any way (which is likely), the outcome of the study is hard to interpret—the IV is contaminated with a selection bias. Procedures have been developed to check for the presence of this contamination and to reduce its impact, making important research of this kind feasible. Selection bias can be subtle and unexpected. For example, an error in random Page 8 assignment could be made in a true experiment. Studies of social interaction sometimes observe people interacting in a naturalistic setting, videotaping then analyzing the interaction. A study might look at how some aspect of the interaction–topic of conversation, for example–affects how talkative people are. Participants are randomly assigned to discuss topic A (why anchovy pizza is delicious) or topic B (why tuition is too high because the faculty are getting rich). So far so good, but an unwitting experimenter might schedule groups to discuss topic A in the morning and topic B in the afternoon, just to keep things simple. If the results show that topic A produces less discussion than topic B, is it because (a) the topics themselves affect interaction; (b) people are drowsy and less communicative in the morning; (c) people are less hungry in the morning; (d) all of the above? In this example, the selection bias is not in the intrinsic characteristics of the subjects, but rather in their temporary states of mind when they take part in the study. Carry-Over Effects Some research designs require the subjects to experience something more than once, such as repeated applications of the IV or repeated measurement of the DV. Termed repeated measures or pre-post designs, these designs were introduced in the Basic Research Design chapter. For example, an undergraduate thesis performed at Florida Tech a few years ago (Rivera, 1993) examined the effect of stress on errors in flying a flight simulator. (“Error” means “crash.”) Subjects were students in the School of Aeronautics flight program. In this type of study, participants experience both conditions of the experiment, stress and no-stress, over a long series of “trials” (in this case, each trial was a landing). The problem with this kind of research is that the effects of experiencing one condition of the design (stress or no-stress) might affect or carry over to the other condition. For example, would the participant fly better in a no-stress condition right after crashing the plane in the stress condition. Vice versa? Carry over effects make it difficult to know if the IV is more (or less) effective than if the carry over had not taken place. What is the solution? The cleanest fix is to use a between-groups design in which some subjects only experience the stress condition throughout the experiment, and others only experience the no-stress condition. This is the last resort, because it means using twice as many subjects. Another solution is to carefully try all combinations of condition orders to see if the order makes a difference. All repeated measures designs use this technique when possible. Third-Variable Confounding Two common internal validity problems are encountered in correlational and differential designs. First, the study may lose internal validity when one of the variables is confounded with another variable “outside the model. “ “Outside the model” refers to variables that the researcher had not thought were relevant in the study, either explicitly in her hypothesis or implicitly in her understanding of the research area. This is another way of saying that there is a problem with the construct validity of the variable. For example, a differential-design study that examines the relationship between culture and values would have to deal with the perennial problem that “culture” is as confounded as it is fascinating. When two cultures are compared on a value dimension, there are so many differences Page 9 between the cultures that no single construct can be identified as responsible for an observed value differences. Is it because the cultures differ in wealth? Child-rearing practices? History of internal warfare? Modernization? The list of confounded variables is endless, and culture is sometimes referred to as the “Global X.” Hence, good cross-cultural research is much more complex than a two-variable comparison and involves multiple cultures. Second, internal validity is compromised if a third variable not initially considered in the model accounts for the relationship between the variables. For example, research has demonstrated a positive relationship between the personality dimension Authoritarianism (see sidebar) and racial prejudice. Perhaps this relationship is a simple one as illustrated in model A. Authoritarian Characteristics 1. Concern with power and toughness 2. Dislike of tender-mindedness, psychological analysis 3. Conventional values rigidly followed 4. Condemn and reject others who don’t follow conventional values 5. Overly concerned with others’ sexuality 6. Hostility 7. Projecting own negative emotional impulses on others 8. Submission and conformity to idealized moral authorities In model A, the curved line indicates a relationship in which the causal direction is not known or assumed. In fact, most psy9. Fatalism, superstition, rigid categorical thinkchologists would assume that Authoritarianism affects or leads ing to prejudice, but model B indicates a different possibility. Social class has been shown to be related to both Authoritarianism and to prejudice, such that middle class people are lower on both. Social class is a differential variable whose relationships to Authoritarianism and to prejudice can be traced to several types of parental influences as well as to adult Authoritarianism Prejudice experiences in society. In model B, Social Class A. Simple relationship between two variables is the third variable, and the relationship in model A is said to be “spurious,” or a “spuSocial Class rious correlation.” The correlation between Authoritarianism and prejudice is actually due to their common Authoritarianism Prejudice relationships with social class. More than one B. Third-variable confounding spurious variable may be present. External Validity A valid research study, regardless of the design employed, must generalize to other people, measures, times, and places. Because research is conducted to build and test theories and models, a study that only works with a certain kind of sample and one way of operationalizing each construct is not very useful. A robust theory demands wide generalization. External validity is the extent to which a study’s Page 10 findings can be generalized. Logically, external validity belongs to the theory, not to the study, but by convention we use the term to apply to specific research studies. The first place we look for external validity is in the sampling. Would the results of the study be the same if a different target population were used? Most research in American psychology is conducted on white middle class college sophomores enrolled in Introductory Psychology courses, an observation that enrages crosscultural psychologists, most of whom have had extensive experience in other cultures. Can the results of this research generalize to “normal” people, such as working class adults? African Americans? Chinese living 100km from a paved road? Yanamamo Indians engaged in inter-village raiding and wife-stealing? The second place to look is the operationalization of the variables, both IVs and DVs. A masterful attempt to generalize the operationalization of variables was performed by Bibb Latané and his colleagues in their development of Social Impact Theory and one of its most visible phenomena, research on social loafing (see sidebar). From the core finding, they applied their theory to the performance of swimmers and runners, the quality of Beatles songs, generating ideas (brainstorming), and others. They generalized the sampling to children, Japanese Honda managers, Taiwanese and Malaysian school children, and kids who grew up in Melbourne, Florida. Tasks involving producing a lot of something such as loudness (termed maximizing tasks) and tasks involving cognitive skills such as counting (optimizing tasks) were employed. Overall, robust support for the social loafing effect was found, while interacting factors and cultural differences were identified. Ecological Validity Social Loafing (Overview) The core finding in the social loafing literature is that as the number of people responsible for performing a task increases, the effort each person puts into the task decreases. If this reduction in effort is great enough, the task outcome declines as the number of workers increases. Several task variables moderate this effect, such as the extent to which each person has a sense of personal responsibility for the success of the group effort. In the earliest social loafing research, college students were asked to shout as loudly as possible in groups or alone. Their shouting “performance” was assessed using a sound level meter. The researchers found that, as group size increased, individuals’ loudness decreased. The third place to look for generalization is the extent to which the research will work the same way in the real world, that is, in natural settings. Research that only works in laboratory settings and either cannot be replicated in the real world or has no real world analog is said to lack ecological validity. For example, if social loafing only occurred in the laboratory settings where it was first researched–special soundproofed rooms, subjects wearing blindfolds and heavy earphones–it would not be ecologically valid and would not have caught people’s attention. However, when it was shown that swimmers swam faster in individual than in relay events, and that managers in the vaunted Honda Motor Company loafed almost as much as American college students, the research gained both attention and ecological validity. Latané won a prestigious scientific award. The author got a job. Reference Rivero, M. (1993). Effect of alternate warning systems on gear-up landings in a flight simulator. Undergraduate thesis, Florida Institute of Technology, Melbourne, FL.