Designing the Study This chapter includes information on the following topics: Types of Experiments Variables Experimental Definitions Design Types Selecting Measures Obstacles and Threats Types of Experiments Although the term experiment is used most of the time, not all research projects are true experiments. For a project to qualify as an actual experiment, there must be an independent variable that the experimenter is manipulating, and a dependent variable that is measured as the changes in the independent variable occur. Studies that do not involve the manipulation of one variable to study changes in another variable are referred to as quasi-experimental. Most of the research that takes place at the undergraduate level is truly quasi-experimental. The most common of this type of study is the distribution of questionnaire packets composed of several scales or measures, and then the study of the outcomes of those measures relative to each other. Although this type of study is not a true experiment, its contribution to the science and research literature is valid. Quasi-experimental designs often are used to offer the basis of more complex, truly experimental designs. There really are a lot of benefits to quasi-experimental designs, particularly from the viewpoint of an undergrad. First of all, a quasi-experiment is simpler. Everything about it is simpler: research design, selecting measures, recruiting participants, administering the items, coding and tracking responses, and data analysis. There are simply fewer major mistakes that have to be avoided. This makes it a great way to get your feet wet, and a great place to start while you get familiar with statistics and the writing aspect of research. Second, quasiexperiments allow for more independent work. A faculty advisor is much more likely to let an undergrad work independently if there is going to be minimal risk to participants and a less complicated research design, that if the student wants to bring people into a lab and manipulate various aspects of the environment. This is not to say that a student is not capable of completing a true experiment, or that no professor will be willing to advise such a project. Nonetheless, the experience of completing a quasi-experiment before taking on a real experiment independently provides a boost in ability that is invaluable in the execution of a true experiment. Variables In every research project, there are at least two, and sometimes three, types of variables used. The three types of variables are independent, dependent, and organismic. The independent variable is often referred to as the IV, and can include condition of assignment, time of day that participants complete the study, or the order in which stimuli is presented to participants. Independent variables are those that the experimenter manipulates intentionally. The dependent variable, also known as the DV, is the variable that is being measured. Most research involves a manipulation of the IV in order to study changes in the DV. The third type of variable is organismic or subject variables. These variables are similar to the independent variables, except that the experimenter cannot manipulate them. Organismic variables can be used to divide participants into groups to allow for comparison of a DV relative to the organismic variable instead of an IV. Gender, age, eye color, height, and weight are all examples of organismic, or subject, variables. In addition to being familiar with the various types of variables, it is necessary to understand the levels of measurement of the variables you are using. There are three levels of measurement: nominal, ordinal, and scale. Nominal variables are those measured in terms of categories of membership; none are necessarily better than the others, they are just different categories. For instance, gender is a nominal variable: female is not greater than or less than male, it is just a different category. Other nominal variables include political and religious affiliations, race or ethnicity, and family composition. Ordinal variables are similar to nominal variables in that they are measured in terms of membership. The difference between nominal and ordinal variables is that ordinal indicates a specific order, or a ranking of the classifications. The categories freshman, sophomore, junior, and senior are all categories of membership. Because they can be ordered in terms of least advanced to most advanced, they constitute an ordinal variable instead of a nominal one. Military rank is another ordinal variable: each military rank is a category (so it seems to be nominal), but there is a certain order to them, which makes them ordinal. The key feature of ordinal variables is that they indicate a ranked order. The third type of variable measurement is scale, which can be broken down further into interval and ratio. Interval scale indicates that the magnitude of something is being measured, that the spaces between any two sets of consecutive points are equal, but there is no true zero point. Temperature is an example of ordinal measure: there is no absolute zero, which we experience when the winter temperatures dip below zero. Ratio scale is also a measure of quantity or magnitude, has equal distances between points, but does have an absolute zero point. Age, number of course credits complete, dollar amount of annual income, number of siblings or children are all ratio scale variables. A third classification of variables is whether it is continuous or discontinuous. Continuous implies a scale along which an individual can fall at any point. For instance, if you measure the height of participants in inches, and do not round to the nearest inch, it would be possible to have any number of inches or fractions of inches imaginable. This is a continuous measure. A discontinuous variable implies that within the range of possible values, an individual cannot fall into any spot between points. Nominal and ordinal data are both types of discontinuous variables. Calculating the age of participants in whole days is an example of a scale variable that is also discontinuous: if you are measuring age in whole days, it is not possible to have half a day or an eighth of a day. This inability to fall between measurement points makes it discontinuous. Experimental Definitions In the English language, there are several words that have multiple meanings. Unfortunately, the same problem exists in research. To ensure that your audience understands exactly what you are talking about, you must provide definitions of your constructs and variables. Constructs are the major concepts you are looking at. Even though it might seem obvious that everyone in the world is familiar with a concept, you must define it in your study. You do not have to define things so simple that the average elementary school student knows what they are, like age, gender, race, or religious affiliation. You do need to define everything else in terms of how it is relevant to your study. Independent and dependent variables also must be defined for your audience. The definition of an independent variable is the experimental operational definition, and identifies the exact manipulation of the independent variable that occurred. The operational definition is used to define the dependent variable in terms of what it is (according to you), how it is being observed, and how it is being measured. Writing good definitions can be more difficult than it sounds. The purpose of defining variables is to convey to the audience the precise way you are defining, manipulating, and measuring the variables under consideration. Design Types There are three types of designs that cover most experimental designs: within-subjects, between-subjects, and mixed. Each of these names indicates the way that comparisons and analyses take place. Within-subjects studies rely on one group of participants to complete all the measures, and comparisons are made between those measures for the same group of subjects. Correlational studies are often within-subjects designs: all participants might complete the same surveys, and then correlations between responses to those measures are calculated to identify a relationship between variables. This type of design might be used to determine that there is a relationship between two variables, say yearly income and open-mindedness. You might be able to identify a relationship between these two variables, but keep in mind that you cannot determine a cause-and-effect relationship, just that a relationship exists. You do not know for sure whether having more money causes the participants to be more or less open-minded, or if how open-minded an individual is effects how much money they will be able to earn. This is the problem of bidirectionality – not knowing whether one variable causes the other, or if they merely coexist. A final consideration is the effect of a third variable. Suppose that the previous example is a real study, and you have identified a relationship between yearly income and openmindedness. It might be that the two of these are not as related as you think – maybe they are both closely related to a third variable, such as level of education or age, that causes them to appear related even when they really are not. When considering relationships, it is also important to understand the direction of the relationship. Correlations can be positive or negative. A positive relationship is one in which as one variable increases, so does the other. Income level increasing as age increases is a positive relationship. Negative relationships are those in which, as one variable goes up or down, the other variable moves in the opposite direction. If you find that creative ability decreases as age increases, you have identified a negative relationship. Between-subjects designs, sometimes referred to as group comparison designs, are studies that compare responses between two groups of participants. If your research design includes the assignment of participants to different conditions so you can compare the responses between the groups, you have a between-subjects design. Mixed designs are studies designed to compare responses between groups as well as within the group. Suppose you have three conditions and participants are randomly assigned to one of the conditions. If all of your participants complete the same measures and you look for relationships within those constructs measured, you are using within-subjects design elements. If, in the same study, you look at differences between two or more groups, say between men and women, you are using between-subjects comparisons. This means the design is mixed. Regardless of the data you collect, there is always some way to evaluate it through both betweensubjects and within-subjects comparisons. This does not mean that every study has a mixed design. The classification of your study as between-subjects, within-subjects, or mixed should be based on your hypotheses. Let you primary objectives guide the development of your experimental design. It is usually more satisfying to design your study to fit your hypothesis, than to alter your hypothesis to fit the design you have developed. Selecting Measures When it comes to selecting the measures to use in a research project, you have two basic choices. You can either use a measure that has been developed and statistically supported by another researcher, or you can attempt to create your own measure. There are a lot of advantages to using someone else’s measure: it already exists, so all you have to do is copy it, score it, and cite it. In general, measures that have made it into the field’s top journals have been through extensive statistical procedures to verify that they are good measures. So why would you choose to not use someone else’s measure? Just because someone meant to create a survey that would measure a particular construct does not mean they succeeded at doing so. It might also be that there is something about their sample that made the measure work, but it might not be as useful to you. In most situations, you would not know this until you have collected the data. Creating your own measure, however glamorous it might sound, is a lot of work. If you want to create a questionnaire, you have to come up with a list of questions that might be relevant to what you want the survey to measure. Once your data is collected, you must analyze the survey to determine if there is anything good in it. It is common to find that a survey meant to measure one thing really measures several different things, called factors. This is good, as it usually means the scale contains relevant subscales, but trying to figure out what those factors might be can be time consuming and tricky. When coming up with a list of questions to include in your measure, there is some controversy over how many questions you have to start with. Some researchers believe strongly that you must have a couple hundred questions to have any prayer at a decent measure. Other researchers, however, figure that if you can come up with 20 questions, and what you get is a solid measure or a couple distinct factors, that’s all you need. If you are new at research and are thinking about creating your own measure, talk to your advisor. If they are going to be helping you through the process, it isn’t a bad idea to do it their way the first time around. In general, it is less work to use someone else’s measure. The amount of time you have to complete the project, your confidence in your ability to learn more complicated analysis procedures, and not finding a measure you are really satisfied with are the key factors that influence whether or not researchers try to create new measures of their own. If you decide to create your own measure, be patient with yourself. You are not likely to produce a wonderful, amazing, ground-breaking questionnaire the first time around, but you might surprise yourself with something that actually works. When you are deciding between measures to figure out which ones best fit your study, consider how feasible they are. If you want to measure many different constructs in one study, it might be better to select shorter measures; if you are looking at just one or two specific concepts, length is not quite as important. Also, be aware that while many measures are available free of charge, there are some that you do have to pay for. Additionally, just because an article reports a new measure that sounds just perfect for you does not mean that the measure itself or a scoring key is available in print. You might need to contact the author and ask for a copy and permission to use it for your study. If you are on a tight time schedule, this may not be feasible. Choose measures that work best for the situation in which you find yourself as you attempt to conduct each study. If you find a measure you just have to try but it is not feasible for your present research, get a copy and file it away. You always have the option to work it into a future project. There are as many different scales used by measures, as there are measures. The most common type of scale used for personality or behavior assessment is called a Likert scale. A Likert scale usually has two endpoints with points between, and participants are asked to rank their agreement or disagreement with a statement by indicating where on the scale it falls. Likert scales can have either an even or an odd number of values. With an odd number, there is a midpoint; with an even number of values, there is no midpoint. Controversy about which method is better is ongoing. At this point, it really comes down to personal preference. If you are using a scale developed by someone else, use the same scale they used. If you change the scale, you have changed the measure, and then you really have no results comparable to those previously found using that measure. However, if you are creating your own measure, the choice of whether to have an odd number or even number of points is up to you. The major arguments for having an even set of points are as follows: when you are asking questions that are highly emotional or might make a participant feel guilty for their honest response, having a midpoint on the scale provides a safe “neutral” zone where participants can sit and not have to actually respond to the items. If there is an even number of points of the scale, the participant must choose a side of the issue. Even if they select one of the two middle numbers on the scale, they have chosen a side. It is theorized that this method leads to more honest answers on controversial topics. The argument in support of using a Likert scale with an odd number of points is just as compelling, however. When a midpoint is available, it does not force participants to take sides on an issue they honestly have not ever considered. There are topics out there that are not relevant to everyone, and a midpoint allows people to express this. As a basic rule of thumb, if you want to ask questions that are not guilt or emotion provoking, go ahead and use a midpoint. But if you allow a neutral zone in the midst of guilt or emotion provoking items, you might find that none or your participants have feelings on controversial issue. A few examples of topics that are more likely to provoke guilt or emotional reactions include prejudice (toward any group), stereotypes (again, of any group), political agendas, and self-report of aggression. With all studies, researchers must be sensitive to the possibility of socially desirable responding and the extent to which their topic might encourage it to happen. Socially desirable responding is the tendency of participants to report what they think is more socially accepted or expected of them. This is a real concern when addressing sensitive or volatile issues. The more controversial an issue is, the more likely participants are to monitor their responses. A tricky thing about socially desirable responding is that it does not always occur because of intent. Often, participants are not really conscious of the fact that they are doing it. When you are measuring topics that might encourage participants to provide socially desirable responses, include one of the many measures designed to detect the degree to which the participant is responding desirably. There are a number of social desirability scales available for free, and they range in complexity from being very short and simple to being more complex and measuring whether desirable responding is occurring consciously or subconsciously. If you decide that your study merits the use of a desirable responding measure, include it in the battery of questionnaires so that it follows immediately after the most provocative measure, as this is where desirable responding is the most likely to be an issue worth considering. Regardless of what measures you decide to use, or what topic you are studying, you need to be aware of the different ways that a survey measure can be evaluated. Namely, you should understand the reliability and validity of the scale. A measure that is reliable can be depended on to produce consistent results. A common way to evaluate reliability is to look at what is called test-retest reliability. If a measure is reliable by the test-retest method, a group of participants should be able to complete the same survey at two different times and yield very similar results. Instruments that meet this requirement are often thought of as measuring traits; they measure a trait that is largely immune to circumstance and other factors, and remains fairly constant most of the time. The other type of reliability to know about is Interitem reliability. Interitem reliability measures the extent to which items intended to measure the same concept actually do so. This type of reliability is often examined using correlations and Cronbach’s alpha. Validity is a little more complicated than reliability, but only because there are numerous types of validity. Validity is the extent to which a measure actually measures what we intend for it to measure. Validity can be either internal (knowing for a fact that the changes observed were the result of the independent variable), or external (how well the findings can be generalized to other sample groups and situations). Other types of validity to consider when evaluating possible measures include concurrent validity, content validity, face validity, and predictive validity. Concurrent validity is how high the correlation is between the scores on the measure you are examining, compared to scores earned on measures known and trusted to measure the same concept. In a sense, concurrent validity is a measure of how one scale compares to another. Content validity is how well the items included in the measure actually measure the different elements of what it is designed to measure. Suicide is a very complex concept to study. A measure that only gathers information about how a person would commit suicide if they were to attempt suicide leaves out a lot of other elements. Such a measure should also include items intended to assess desire to commit suicide, how often an individual contemplates suicide, and many other issues relevant to the topic of suicide. In short, for a measure to be high in content validity, it must measure as narrow or as broad a range of concepts as exist in what it is attempting to measure. Face validity is how obvious a method of manipulation or measurement is. If you use a yardstick to measure a football field, your method of measurement is very obvious, and therefore has high face validity. Predictive validity, like face validity, is fairly simple. Predictive validity is the degree to which the measure predicts behavior. This is of special concern with some topics, as what people report doing or having done is not necessarily what happened. There are so many reasons why a person might misrepresent their behaviors or intentions, that it is not realistic to attempt to evaluate all of the possible motivations. Measures that are high in predictive validity are those that have managed to bridge the gap between what people say they do, and what the actually do. Design Obstacles and Threats There are a number of problems that can threaten the validity of a study. To produce valid results, researchers must check their designs for these obstacles and threats before conducting the study. Research with a lot of validity issues is of little or no use in most instances. Some of the most common threats to validity are confounding, order effects, maturation, mortality, and participant history. Confounding is when a relationship that is found between two variables is invalid because the change in the dependent variable might actually be due to a variable other than the manipulated independent variable. Relationships exist everywhere, especially research; one goal of the researcher is to explain relationships, and this cannot be done if the relationship might be due to several different variables. Order effects are changes in response due to the order in which stimuli is presented. Suppose an experimenter wanted to measure hostility and sexism, was using a negative scenario in the measurement of sexism, and always measured hostility second. The responses obtained on the measure of hostility are likely to have been influenced by the scenario presented with the sexism measure. This is an order effect: things that effect the responses of participants that are based on the order in which they are presented. The easiest way to account for this issue is to counterbalance stimuli. This means randomizing the order in which the stimuli is presented. In the example above, half of the participants would complete the measures with the sexism measure first, and the other half would complete the hostility measure first. For true counterbalancing to occur, the participants must be randomly selected to one of the random order conditions. The more measures being used, the greater the number of random orders that are possible. History is the threat that is created by the individual experiences of each participant. Someone who wants to study prejudice toward a minority group is going to get very different results if that group has recently been accused of terrorist acts against the participant’s country than if the targeted minority group has never been attributed with attempting to injure the participant’s group. An experimenter who wants to evaluate the math abilities of college students would have a history threat to worry about if the participant group were all third- or fourth- year math majors. Testing threat is a type of history threat in which a participant’s having completed the same or another survey in the past influences how they respond to the current one. Maturation is one threat to internal validity that we cannot stop, but is also one we can easily work around. Maturation is the change in observed behavior (or survey response) that comes as a result of psychological or physical changes that the participant experiences during the experiment. Unless you are asking participants to be involved for only a few minutes to commit really low-stress tasks, this is something to be concerned with. In most research that students conduct, this can be virtually eliminated by taking two precautionary steps. The first thing we can do to help avoid maturation is to carefully select measures that get the information we want without gathering a lot of miscellaneous information. This means participation will take less time and allow fewer chances for changes in the participant’s physical state. The second measure we can take to help curb maturation is counterbalancing; presenting the stimuli in random orders when possible will help to balance where psychological maturation during participation would otherwise skew results. Subject mortality is the rate at which participants drop out of different conditions of a study. Inconsistent subject mortality across conditions is a threat to the internal validity of the design. There are several things you can do to help prevent subject mortality from becoming a problem. First, design all conditions to take approximately the same amount of time and effort, and to be relatively equal in the level of stress they induce. The more similar these factors, the more similar the dropout rate will be.