Testing Scientific Explanations (In words – slides page 7) Most people are curious about the causes for certain things. For example, people wonder whether exercise improves memory – and, if so, why? Or, we wonder whether using cell phones causes cancer – and, if so, how? Science allows us to systematically test competing hypotheses of such relations and the explanations for them. For example, exercise may affect neurotransmitters or increase blood flow to the brain, either of which could increase memory. Scientists conduct experiments to test their hypotheses. Any conclusions are only as good as the experiments that were used to support the conclusion. In this module, we will teach you to detect problems with experiments that lead to conclusions that seem suspect. A hypothesis specifies a testable relationship between variables. Thus, a hypothesis has three important elements: the independent variable, the dependent variable, and the type of relationship between them. The independent variable (IV from here on) is the condition that we vary to see if it has an effect on the other variable. In the examples above, the independent variables are (A) exercise and (B) cell phone usage, because we think they will causally affect another variable. The dependent variable (DV from here on) is the variable that we think will be affected. It is also called an outcome variable. The dependent variables in the examples above are (A) memory and (B) cancer. Finally, both example hypotheses assume a causal relationship, specifically, causes an increase. You should know that hypotheses are not always about causal relationships. One could also have correlational hypotheses, which merely predict that two variables change in relation to each other, but they do not assume that a change in one causes the change in the other. For example, we could change the verb in our example hypotheses to get these correlational hypotheses: (A) more frequent exercise is associated with better memory or (B) cell phone usage is related to the incidence of cancer. Figure 1. General statement. Conclusion: IV causes DV Conclusion: exercise (IV) causes increase memory (DV) Conclusion: cell phone (IV) causes increase cancer (DV) When evaluating research, the first things you should do are to identify the independent and dependent variables and determine if a causal or correlational prediction is being made. There are more specific issues related to these topics that you should consider when evaluating research. In these pages, you will read about some important flaws that may occur when research is conducted. A. Possible problems with IVs and DVs (1) Use valid measure of IVs and DVs Causal hypotheses predict that a change in the IV will cause a change in the DV. To test such a hypothesis, one therefore has to manipulate or vary an appropriate IV of interest. For the cell phone hypothesis, for example, you could compare cancer rates (DV) for participants who are grouped based on minutes of cell phone usage per day (e.g., more than 2 hours a day, .5 - 2 hours a day, less than 30 mins a day). This is a valid definition of cell phone usage (IV) because you are varying the amount of usage. An invalid definition of the IV would be the amount of money spent on a cell phone per month (e.g., more than $100 dollars, $40-99 dollars, less than $40 dollars a month). This is invalid because it doesn’t really indicate how much time one spends on the phone. You could have a very cheap plan, pay for much more time than you actually use, or call a lot but only on off hours. It is also important to have a valid measure of your DV. For example, if you want to test whether exercise (IV) improves memory (DV), you need to measure memory in a way that accurately reflects the aspect you are interested in. A standard memory test would be a valid measure; weighing someone’s head would not. Construct validity. We are often interested in variables that cannot be directly measured or observed, such as memory, intelligence, or depression. We refer to such variables as constructs. When our variables are constructs, we have to rely on scientifically developed measures as indicators of those constructs. For example, we may use a score on an intelligence test to indicate that a person is intelligent. Because these are not direct measures, they always depend on the developer’s ideas of what the construct is. Because these can vary, an experimenter has to be very careful in the choice of measure. When a measure really reflects the construct well, we say it has construct validity. Construct 1 validity of the IV in our example is whether working out on a treadmill (6 hours per week) is actually a “good” sense of exercising, whereas construct validity of the DV is whether our test of verbal and spatial memory is a “good” measure of what people mean by memory. For example, giving a person a gym membership would not be a valid manipulation of exercise, and self-reported judgment of one’s own memory is an invalid measure of memory. If your experiment does not have good construct validity, then you are not actually manipulating or measuring what you are claiming to – and thus, you will not test your actual hypothesis. Content validity. A second type of validity is content validity. Most constructs have many aspects, and the more of them you include in your measure or manipulation, the better your content validity becomes. For example, if a professor gives students a study guide for the test but all the questions cover only one or two of the areas, students would complain that it is not a fair measure of what they were supposed to learn. They are really arguing that the test does not have content validity. For the exercise example, to measure the number of items recalled from a list of words may have construct validity, but content validity could be improved by including other types of memory, such as recognition memory or spatial memory (e.g., maps and images). We could improve the content validity of our IV by also including 3 hours a week of anaerobic exercise (e.g., weight lifting). So, in summary, you have to manipulate and measure the variables you claim you are, not some other related variables, in order to have construct validity. If you don’t, you cannot draw valid conclusions because you are testing the relation between variables that are different from what you stated. Furthermore, you have to vary the most important elements of a variable, not just any one aspect of it, in order to have content validity. Valid measures are required to be able to appropriately interpret the results from an experiment and draw appropriate conclusions. (2) Must have comparison group(s) When testing the effect of an IV (e.g., exercise) on a DV (e.g., memory), you have to vary the IV. To test a causal hypothesis (e.g., exercise improves memory), you need to manipulate the IV, exercise. To test a correlational hypothesis (e.g., exercise is associated with better memory), you just compare already existing groups that differ on their level of exercise. Most of the time, scientists want to test causal hypotheses. To do so, you need one or more comparison groups that are not exposed to the treatment. For example, your treatment group would exercise 6 times a week while the comparison group would not exercise at all. Your comparison group, however, does not have to be a no-treatment group. For instance, you could have several groups that vary in exercise frequency (e.g., 6 times a week, 3 times a week, 1 time per week, and no exercise). These additional comparison groups allow you to determine how much each unit of exercise adds to improving memory. Although there are many things to consider in designing your comparison groups, the two most important flaws are having no comparison group and having a poor comparison group. No comparison group. Although it may seem good enough to simply show that a DV changes from before to after our IV manipulation (thus, a difference between “pretest” and “posttest” performance), this is not the case. It may be that everyone would improve on the second test even without the treatment. For example, people may perform better on our memory test the second time, even without exercise in between. To be able to attribute any change to the IV, we need a baseline for comparing change. This baseline is a comparison group not given the treatment. Poor comparison group. A more complicated problem is when the experimenters have two or more groups, but they do not vary on the critical IV that they claim they do, or something besides the IV is varied between groups. For instance, if people in the exercise condition work out in a fun group class, then any improvement may be due to the social interaction. To make the groups as similar as possible, you could have the no exercise group also attend a sedentary group class like watching others exercise. Notice how much care has to go into deciding what your comparison group should be. Comparison groups allow researchers to rule out alternative explanations for any effects, but only if the comparison group is equivalent in all regards except for the IV. If the researcher uses chance to determine each participant’s group assignment, such as flipping a coin (called random assignment), then this randomly assigned comparison group allows the researcher to make causal conclusions about the effect of the IV on the DV. In contrast, the researcher could put people into groups based on their natural, pre-existing characteristics, such as how much they usually exercise per week (0 hours, 3 hours, or 6 hours). But because many other things vary with how much someone chooses to exercise per week (e.g., diet, fitness level, motivation level, weight), it could be that one of these other reasons could explain 2 the improved memories. Random assignment means that everyone in a study has an equal chance of being in any condition, and this allows us to distribute such differences across conditions so that hopefully, the only thing that varies is our manipulation (IV). Thus, even the best comparison group will not allow the researcher to make a causal statement about the relationship between the IV and DV if people are not randomly assigned to condition. (3) Use appropriately sensitive DVs To test a hypothesis, we vary one or more IVs to see whether they affect performance on the DV. We have to make sure that the test or measure we are using for the DV is sensitive enough to detect any possible effect. There are two main flaws that can affect DV sensitivity. Ceiling or floor effects. Ceiling effects happen when a test is so easy that everyone achieves a near perfect score (at the ceiling or top of the scale). So even if you have a perfect treatment, you would never know it because you cannot make people do better than the control condition, which is already performing at the top of the test. Floor effects occur when the test is so hard that participants are not getting any items right. In this case, your manipulation, even if effective, may not be powerful enough to improve the scores even a little. Imprecise measures. The second concern with DV sensitivity is the use of an imprecise measure. This requires the researcher to determine how fine a scale should be used. For example, suppose you want to measure how much someone exercises. You could use a three-point scale (none, some, a lot). This is easy to answer, but if people start to exercise a little more, it is unlikely that this would be reflected by this broad 3-option scale. If, however, you measured exercise on a 100% scale, you would be able to see numbers change representing even slight increases in exercising. B. Problems with eliminating unwanted variables When you test the effect of an IV on some DV, it is important that you control the rest of the environment. There are always other variables that also vary between groups that could affect the outcome, but researchers try to isolate the effect of the IV by minimizing the influence of these other variables. You want to hold as many things constant between groups as possible especially if you think they may affect the DV. Otherwise, you cannot be sure that your varying of the IV(s) was really the reason for any change in performance (DV). For our exercise example, there are several possible variables that we might worry about (such as, motivation, starting health level, previous level of exercise, attitude and experience with exercise, people working out beyond what is required for the study). Two standard practices to help the groups be comparable before a treatment are to randomly assign participants to your groups, and to manipulate the IV rather than using naturally existing groups (which may differ along other variables). While there are many serious flaws that concern controlling the environment, we will only mention two: attrition and experimenter bias. (4) Limit attrition Groups can become different even after random assignment because sometimes participants “drop out” or fail to complete the experiment. We call this attrition. Attrition can be a serious problem, especially if the number of participants failing to complete the experiment is different for the various conditions. This can lead to differences between groups that affect the DV, which makes it impossible to conclude that the IV was responsible for any DV differences between groups. Attrition is most problematic if it is just one group that looses participants, or if participants with specific characteristics that may affect the DV drop out. Attrition is less of a concern when it is random and not due to the participants, such as equipment failure. However, researchers should always report the level of attrition in all conditions. Drop out. Attrition due to drop out can be a serious flaw. This happens when, after randomly assigning participants to condition or groups, a participant leaves without finishing, or stops showing up for sessions. When this happens, the groups may then be different for a reason other than the manipulation. In the case of the exercise experiment, we might find that unmotivated people may be more likely to drop out of the exercise group than out of the comparison group. This would leave the experimental group with more highly-motivated individuals than the control group. Because highlymotivated individuals will probably try harder on the memory test (DV), we would likely find memory differences between groups even if the exercise has no effect, simply because of the characteristics of those who dropped out. 3 Missing information. Attrition due to missing information is usually less of a threat but is still a concern. This happens when a participant completes the experiment but fails to answer every question. In our exercise-memory experiment, we could find missing information if participants do not indicate their gender or age on the pre-interview survey, or more important, they may skip one or more trials in the memory test. Missing information could lead to biased results, for example if people in one group systematically skip specific problems. (5) Limit experimenter bias One of the easiest factors of the environment to control is the effect of the researcher or experimenter. We know from several experiments, that the behavior or expectations of a researcher can have a significant impact on the results of an experiment. When this happens, it is not possible to determine whether the results were due to the IV or to actions of the experimenter. There are two important flaws to consider here. The first type of flaw focuses on our motivation (conscious or unconscious) to bias the results; the second type of flaw focuses on our ability to bias the results. Conflict of interest. Conflict of interest occurs when a researcher has a strong investment (fame or fortune) in a particular outcome of the experiment. This is a problem because the researcher may intentionally or unintentionally bias the experiment. For our exercise example, if we were paid by an exercise equipment company to show that using their equipment leads to better memory, we could have a conflict of interest. You should know that a conflict of interest doesn’t guarantee that a researcher will bias the results, but it creates a situation in which a researcher is more motivated to influence the outcome. Opportunity for bias. Opportunity for bias is a flaw in which the experimenter does not take important precautions to reduce his or her chances of affecting the behavior of the participants. The best precaution is using a double-blind technique in which the researcher is unaware of which condition the participant is in. A second method is to reduce contact with the participants. Most experiments are automated so that a computer presents the instructions rather than the experimenter doing so. A third method is to use automated or objective scoring of data. If that is not possible, then two raters should score all data independently (inter-rater reliability) without knowing the participant’s condition (doubleblind). For our memory experiment, if we use a recall rather then multiple choice test of memory, then the experimenter will have to determine which recall responses are correct and which are incorrect. This is an opportunity for bias. Therefore, two raters who are unaware of conditions should be used. We need to reduce the opportunity and motivation for experimenters to bias the results of an experiment, or we will not be able to rely on the results. C. Problems with the sample (6) Use appropriate sample size Of course, we can only test a hypothesis on a sample from the larger population. There are several possible flaws that can occur in an experiment that could be due to the characteristics of such a sample. We will only focus on those that relate to the size of the sample. In experiments, we use statistics to compare performance on the DV(s) to see if there is a difference between conditions. The size of the sample that we need to get reliable results is determined by how variable the population is. Variability refers to the extent to which the population’s scores on a DV tend to naturally vary, regardless of any sort of manipulation. Let’s say we have two conditions (exercise and no exercise) in the memory experiment above. Imagine using a highly variable population such as 5 year-olds worldwide to study the relationship between exercise and memory. It is likely that there is automatically a great deal of difference in the memory scores of all 5 year-olds even prior to any sort of exercise manipulation. When scores vary a lot in a population, and we draw small samples, our groups are likely to be quite different at the outset of the study, because people with extreme scores (high or low) have a big impact on the group average. As we increase sample size, however, individual extreme scores get increasingly balanced out by other scores, leading to more comparable groups. Typically, we have little information about how much a variable varies in a population; so, it is always a good idea to have large samples. There are primarily two flaws associated with an inadequate sample size: lack of power and low generalizability. Small sample size and lack of power. Power, or the likelihood of finding a difference that really does exist between groups, depends on the size of the sample. Power increases with larger samples 4 because, as we have seen, there is less extreme variation in the groups. Therefore, a null effect found in a study that uses small samples could simply be due to the lower power that comes with smaller sample sizes. In other words, there could really be a difference between groups, but it is not evident in the study because the sample sizes are so small. Small sample size and limited generalizability. Researchers are always interested in results that we can have some confidence are most likely true for all members of a population. The second flaw associated with small sample sizes is the limited extent to which we can generalize a result obtained with a small sample to the larger population of interest. So, you may find a significant difference between two groups, but if the samples are small it is difficult to generalize these results to other individuals. Remember that scores vary in a population, and the chances of having unusual (or unreplicable) results are higher in small samples where extreme scores have a large impact on our statistical measures. One thing to keep in mind is that there is no perfect size of sample to use – not even 30 subjects. It all depends on the variability of the population. If the population is highly variable (e.g., adult humans), then 20 subjects may not be enough to detect a true difference or to generalize to the population. If the population is relatively homogeneous (e.g., potatoes from a certified lab supplier), then 4 subjects may be large enough. D. Problems with Conclusions (7) Make tentative conclusions When researchers finish collecting data for a study, they analyze the data and prepare a manuscript to report the results of the study to others in the scientific community. The results of the study should always be reported using tentative conclusions. That is, instead of talking of “proving” a hypothesis true, you will always word conclusions in a way that acknowledges the possibility of future revisions of the hypothesis. This is because findings should be considered in light of the limitations that studies typically possess. Two flaws can occur when a researcher is making a conclusion. Conclusions too strong. The first flaw involves drawing too strong of a conclusion. This occurs when a researcher fails to acknowledge the possibility of alternative (and at times, better) explanations for the results. A researcher’s conclusion must always reflect the limitations of an experiment, and the potential for error. Even in the absence of obvious problems with an experiment, there is always the possibility that some effect was due to chance (i.e., any factors other than the IV) and not to the manipulation. Need for replication. The second flaw related to the tentative nature of the conclusions that are reported has to do with the need for replication in research. Replication involves repeating a study with the same (or very similar) methods but with different participants and sometimes with different researchers. When many different experimenters obtain similar results when conducting a similar study with different subjects, we can be more confident that the results generalize to a larger population. To the extent that participants are varied in replications, we gain information about how well an effect holds across age groups, countries, cultures, species, and so on. Furthermore, a consistent result allows researchers to make stronger conclusions. (8) Requirements for causal conclusions Recall that the goal of an experiment is to try to find support for your hypothesis and if possible rule out alternative explanations at the same time. To find support for a hypothesized causal relationship, you need to consider several things. First, consider your IVs and DVs. Your DV must be valid and sensitive. Your IV should also be valid and manipulated by the experimenter. This means that participants are randomly assigned to conditions with at least one condition being a good comparison group. Second, you want to eliminate unwanted variables, such as reducing attrition (especially unequal drop out among conditions) and limiting the effects of the experimenter. Finally, you need to make sure to have a large enough sample which depends on the variability in the population. This reading provides only a subset of the things you will want to consider about IVs, DVs, and controls. But, we think these are a good start to helping you think about testing hypotheses through research and the type of conclusions you can make from those studies. We hope this chapter increases your interest in finding out more about evaluating research. 5 QA9. Using the score card below, determine the flaws, if any, with each experiment description. If there are none, write “good experiment”. No credit will be given for questions such as “how many subjects were there?”. It is either clearly a flaw (So do not put in comments like “it doesn’t say how many subjects there were”). A. Possible problems with IVs and DVS (1) Use valid measure of IVs and DVs – manipulate (IV) and measure (DV) the variables you claim you are, especially the most important aspects of the variable. This is important because valid measures are required to be able to appropriately interpret the results from an experiment and draw appropriate conclusions. Flaw: (1) Validity of DV (2) Must have comparison group(s) – randomly assign participants to one or more groups that are not exposed to the treatment or are exposed to less of the treatment so you can rule out alternative explanations for any effects. Flaws: (1) Poor or missing comparison group, (2) No random assignment (3) Use appropriately sensitive DVs – use tests or measures that people do not score so well or so poorly on that you fail to detect a true treatment effect just because you have insensitive measures. Flaws: (1) Sensitivity of DV, (2) DV is not scored objectively B. Problems with unwanted, extraneous or confounding variables (4) Limit attrition – try to limit the number of participants who failed to complete the experiment or who skipped information because this can lead to differences between groups that affect the DV, which makes it impossible to conclude that the IV was responsible for any DV differences between groups. Flaw: (1) Mortality or attrition (5) Limit experimenter bias – try to limit the impact of the behaviors or expectations of the researcher on the results of an experiment. When this happens, it is not possible to determine whether the results were due to the IV or to actions of the experimenter. Flaws: (1) Experimenter bias, (2) Conflict of interest C. Problems with the sample (6) Use appropriate sample size – make sure that you have a large enough sample to detect a true effect of the treatment and a large enough sample to generalize to the entire population. Flaws: (1) Small sample size; (2) Poor sample selection D. Problems with Conclusions (7) Make tentative conclusions – experiments always have limits, and conclusions must reflect the possibility that the conclusions can be shown to be incorrect. Replicating the experiment and getting the same result helps increase our confidence in the results but will never prove the hypothesis true. Flaws: (1) Not a tentative conclusion/ Premature generalization of results (8) Requirements for causal conclusions – to make a causal conclusion, one has to randomly assign participants to a valid IV which has at least one comparison condition, measure a DV that is sensitive and valid, and then control unwanted variables (e.g., limit attrition and experimenter bias). Flaw: (1) Inappropriate causal statement 6 Testing Scientific Explanations Scientists are curious about the causes for things. E.g., Whether exercise improves memory – and, if so, why? Whether using cell phones causes cancer – and, if so, how? Science allows us to systematically test competing hypotheses of such relations and the explanations for them. E.g., Exercise neurotransmitters increase memory Exercise more blood to brain increase memory Use scientific method to conduct experiments to test such hypotheses. Our conclusions are only as good as the experiments used to test our hypotheses. Hypothesis – a testable relationship between 2 or more variables 3 important elements: 1. Independent variable (IV) 2. Dependent variable (DV) 3. Type of relationship between them 1. IV – what we want to see is effective; the thing we vary in an experiment 2. DV – the variable that we think will be affected; the outcome variable. 3. Type of relationship – causal or correlational Causal: IV causes DV exercise (IV) causes increase memory (DV) cell phone (IV) causes increase cancer (DV) Correlational: E.g., People who exercise more improves memory More frequent cell phone usage is related to the incidence of cancer. 7 NONCAUSAL -- Verbs of Association is associated with correlates with goes along with co-occurs with is related to CAUSAL Positive Improve Better Develop Increase Negative Nondirectional Worsen Hinder Hurts Affect Have effect on Have influence on Have impact on Destroys Stop Impede Weaken Reduce Lower Decrease Slow Lead to Cause Bring about Evaluating research ID IVs, DVs Determine if a causal or correlational prediction or conclusion is being made. Then evaluate based on: A. Possible problems with IVs and DVs (1) Use valid measure of IVs and DVs (2) Must have comparison group(s) (3) Use appropriately sensitive DVs B. Problems with eliminating unwanted variables (4) Limit attrition (5) Limit experimenter bias C. Problems with the sample (6) Use appropriate sample size D. Problems with Conclusions (7) Make tentative conclusions (8) Requirements for causal conclusions Possible problems with IVs and DVs (1) Use valid measure of IVs and DVs Constructs – scientifically developed measures of variables that cannot be directly measured or observed (e.g., memory, intelligence, or depression). We need to manipulate or vary an appropriate IV. Valid definition of cell phone usage: Cell phone usage per day (e.g., 2+ hours a day, .5 - 2 hours a day, less than 30 mins a day) 8 Invalid definition of cell phone usage: Amount of money spent on a cell phone per month (e.g., more than $100 dollars, $40-99 dollars, less than $40 dollars a month). It doesn’t really indicate how much time one spends on the phone. Valid definition of memory improvement: Number of items correctly recalled Invalid definition of memory improvement: Weighing someone’s head or self-reported perception of memory improvement * 2 problems with VALIDITY * 1. Construct validity – the degree to which the test measures the construct it is supposed to measure. Exercise: hours per week on treadmill (0, 3, 6 h/w) If your experiment does not have good construct validity, then you are not actually manipulating or measuring what you are claiming to – and thus, you will not test your actual hypothesis. 2. Content validity – the degree to which your measure reflects the actual material, substance, or content of the variables measured Most constructs have many aspects. The more of them you include in your measure or manipulation, the better your content validity becomes. Memory: test several types of memory (e.g., verbal, spatial) Summary: To draw good conclusions, you have to manipulate and measure the variables you claim you are (construct) as completely as possible (content) Possible problems with IVs and DVs (2) Must have comparison group(s) To test a correlational hypothesis, you just compare already existing groups that differ on their level of exercise. exercise (IV) is associated with To test a causal hypothesis, you need to manipulate the IV (randomly assign to different levels of the IV) exercise (IV) causes increase memory (DV) increase memory (DV) For causal conclusions, you need one or more comparison groups that are not exposed to the treatment. 2 Exercise (6 times a week, 0 times a week) 3 Exercise (6 / week, 3 / week, and 1 / week). 9 * 2 problems with COMPARISON GROUPS * 1. No comparison group Pretest Memory test Treatment Exercise Post test ? Memory test It may be that everyone would improve on the second test even without the treatment (e.g., learn to take test, more focused, more motivated) To be able to attribute any change to the IV, we need a baseline (no or less treatment) for comparing change. 2. Poor comparison group. The groups vary on things other than the IV. E.g., if people in the exercise condition work out in a fun group class, then any improvement may be due to the social interaction. Comparison groups allow researchers to rule out alternative explanations for any effects, but only if the comparison group is equivalent in all regards except for the IV. To support causal statements, need random assignment. To make sure the groups are as similar as possible except for the manipulation is to random assign participants to condition (using chance). Contrast: put people into groups based on their natural, pre-existing characteristics, such as how much they usually exercise per week (0 hours, 3 hours, or 6 hours). But because many other things vary with how much someone chooses to exercise per week (e.g., diet, fitness level, motivation level, weight), it could be that one of these other reasons could explain the improved memories. Summary: Need to randomly assign participants to a “good” comparison group to make causal conclusions. Possible problems with IVs and DVs (3) Use appropriately sensitive DVs To test a hypothesis, we vary one or more IVs to see whether they affect performance on the DV. We have to make sure that the test or measure we are using for the DV is sensitive enough to detect any possible effect. * 2 problems with DV SENSITIVITY * 1. Ceiling or floor effects. People either all do extremely well (i.e., ceiling effects) or do extremely poorly (i.e., floor effects) making it hard to find a real effect Your manipulation, even if actually effective, may not be powerful enough to improve the scores even a little 2. Imprecise measures. The scale is not detailed enough to find a difference. E.g., Categorizing people’s recall as: a lot, some, none. It would take a very large jump in memory to go from some to a lot. 10 B. Problems with eliminating unwanted variables Isolate variables of interest -- when you test the effect of an IV on some DV, it is important that you control the rest of the environment (i.e., eliminate or minimize the influence of other variables). Goal is the groups are not different before manipulation. o o Randomly assign participants to your groups Manipulate the IV rather than using naturally existing groups (which may differ along other variables). Otherwise, you cannot be sure that your varying of the IV(s) was really the reason for any change in performance (DV). E.g., for exercise and memory what other factors? (motivation, starting health level, previous level of exercise, attitude and experience with exercise). 2 of many serious flaws that concern controlling the environment, we will only mention two: attrition and experimenter bias. B. Problems with eliminating unwanted variables (4) Limit attrition Attrition – participants “drop out” or fail to complete the experiment. Attrition can lead groups to become different. Attrition is most problematic if it is just one group that looses participants, or if participants with specific characteristics that may affect the DV drop out. * 2 problems with ATTRITION * 1. Drop out – after randomly assigning participants to condition or groups, a participant leaves without finishing, or stops showing up for sessions. When this happens, the groups may then be different for a reason other than the manipulation. E.g. unmotivated people may be more likely to drop out of the exercise group than out of the comparison group. Therefore, in addition to assigned exercise, groups also differ on motivation level (highly-motivated individuals may try harder on the memory test (DV)) 2. Missing information – participant fails to answer every question. Missing information could lead to biased results, if people in one group systematically skip specific problems. B. Problems with eliminating unwanted variables (5) Limit experimenter bias We know from several experiments, that the behavior or expectations of a researcher can have a significant impact on the results of an experiment. When this happens, it is not possible to determine whether the results were due to the IV or to actions of the experimenter. 11 * 2 problems with EXPERIMENTER BIAS * 1. Conflict of interest – when a researcher has a strong investment (fame or fortune) in a particular outcome of the experiment. E.g., exercise equipment company to show that using their equipment leads to better memory The researcher may intentionally or unintentionally bias the experiment. 2. Opportunity for bias – important precautions are not made to reduce chances of the experimenter affecting the behavior. Best precaution is using a double-blind technique in which the researcher is unaware of which condition the participant is in. Also to reduce contact with the participants (e.g., automate experiment) Scoring should be automated or as objective as possible (inter-rater reliability) C. Problems with the sample (6) Use appropriate sample size We use statistics to compare performance on the DV to see if there is a difference between groups. E.g., M (exercise) = 9.4 items correctly recalled vs M (control) = 7.2 items correctly recalled The size of the sample that we need to get reliable results is determined by how variable the population is. Variability refers to the extent to which the population’s scores on a DV tend to naturally vary, regardless of any sort of manipulation. Variability is inevitable but the greater the variability, the harder it is to detect a difference. A group of 20 – 24 year olds college students will have less variability in memory than a group of 2080 year olds from general population. E.g., 2.2 difference may be detectable in the 20 something population due to small variability Other possible flaws such as sample selection but not covered here. * 2 problems with SAMPLE SIZE * 1. Small sample size and lack of power. Power – the likelihood of finding a difference that really does exist between groups. Power increases with larger samples because there is less extreme variation in larger groups. Therefore, if you do not find a difference between groups with small samples, it could simply be that you do not have enough power. With many experiments using humans, 20-30 is often large enough. With potatoes or plants, it 4 or 5 may be enough. (depends on variability) 2. Small sample size and limited generalizability. Generalizability – the extent to which the results will likely be true for all members of a population or all materials (recall only test some people and some materials) We may be less confident that our results will generalize to the larger population of interest if we have smaller samples 12 Really a problem only when you find a significant difference between groups Scores will vary in a population, and the chances of having unusual results are higher in small samples where extreme scores have a large impact on our statistical measures. D. Problems with Conclusions (7) Make tentative conclusions Instead of talking of “proving” a hypothesis true, we can only say that our results support the hypothesis Based on probability (e.g., .05) not deduction. Even in the absence of obvious problems with an experiment, there is always the possibility that some effect was due to chance (e.g., 5/100 conclusion could be wrong) and not to the manipulation. Conclusions must that acknowledge the possibility of future revisions of the hypothesis. * 2 problems with CONCLUSIONS * 1. Conclusions too strong. A researcher fails to acknowledge the possibility of alternative explanations for the results. A researcher’s conclusion must always reflect the limitations of an experiment, and the potential for error. 2. Need for replication. Replication involves repeating a study with the same (or very similar) methods but with different participants and sometimes with different researchers. When obtain similar results, we can be more confident that the results generalize to a larger population. D. Problems with Conclusions (8) Requirements for causal conclusions To find support for a hypothesized causal relationship, you need to consider several things. 1st: your IVs and DVs must be valid and sensitive. 2nd: Your IV must be manipulated by the experimenter. Randomly assigned to conditions with at least one condition being a good comparison group. 3rd: You need to eliminate unwanted variables, such as reducing attrition (especially unequal drop out among conditions) and limiting the effects of the experimenter. 4th: You need to make sure to have a large enough sample which depends on the variability in the population. 13