Research Methods in Psychology Data Analysis and Interpretation: Part II. Tests of Statistical Significance and the Analysis Story Null Hypothesis Significance Testing (NHST) Null hypothesis testing is used to determine whether mean differences among groups in an experiment are greater than the differences that are expected simply because of error variation (chance). Null Hypothesis Significance Testing (NHST) The first step in null hypothesis testing is to assume that the groups do not differ — that is, that the independent variable did not have an effect (i.e., the null hypothesis — H0). Probability theory is used to estimate the likelihood of the experiment’s observed outcome, assuming the null hypothesis is true. NHST (continued) A statistically significant outcome is one that has a small likelihood of occurring if the null hypothesis is true. We reject the null hypothesis, and conclude that the independent variable did have an effect on the dependent variable. A statistically significant outcome indicates that the difference between means obtained in an experiment is larger than would be expected if error variation alone (i.e., chance) were responsible for the outcome. NHST (continued) How small does the probability have to be in order to decide that a finding is statistically significant? Consensus among members of the scientific community is that outcomes associated with probabilities of less than 5 times out of 100 (p < .05) if the null hypothesis were true are judged to be statistically significant. This is called alpha (α) or the level of significance. NHST (continued) What does a statistically significant outcome tell us? An outcome with a probability just below .05 (and thus statistically significant) has about a 50/50 chance of being repeated in an exact replication of the experiment. As the probability of the outcome of the experiment decreases (e.g., p = .025, p = .01, p = .005), the likelihood of observing a statistically significant outcome (p < .05) in an exact replication increases. APA recommends reporting the exact probability of the outcome. NHST (continued) What do we conclude when a finding is not statistically significant? We do not reject the null hypothesis if there is no difference between groups. However, we don’t necessarily accept the null hypothesis either — that is, we don’t conclude that the independent variable did not have an effect. We cannot make a conclusion about the effect of the independent variable. Some factor in the experiment may have prevented us from observing an effect of the independent variable (e.g., too few participants). NHST (continued) Because decisions about the outcome of an experiment are based on probabilities, Type I or Type II errors may occur. A Type I error occurs when the null hypothesis is rejected, but the null hypothesis is true. That is, we claim that the independent variable is statistically significant (because we observed an outcome with p < .05) when there really is no effect of the independent variable. The probability of a Type I error is alpha — or the level of significance (p = .05). NHST (continued) A Type II error occurs when the null hypothesis is false, but it is not rejected. That is, we claim that the independent variable is not statistically significant (because we observed an outcome with p > .05) when there really is an effect of the independent variable that our experiment missed. Because of the possibility of Type I and Type II errors, researchers are tentative in their claims. We use words such as “support for the hypothesis” or “consistent with the hypothesis” rather than stating that a hypothesis has been “proven.” NHST: Comparing Two Means The appropriate inferential statistical test when comparing two means obtained from different groups of participants is a t -test for independent groups. The appropriate test when comparing two means obtained from the same participants (or matched groups) is a repeated measures (within-subjects) t-test. A measure of effect size should be reported when NHST is used. Comparing Two Means (continued) Independent Groups t-test The t-test for independent groups is defined as the difference between two sample means (e.g., treatment group and control group) divided by the standard error of the mean difference (sM1- M2). The calculation formula is: M1 – M2 t= (n1 - 1) s12 + (n2 – 1) s22 1 + 1 n1 + n2 – 2 n1 n2 Comparing Two Means (continued) Standard deviation and Variance. The standard deviation (SD or s) is a measure of how far on the average a score (X) is from the mean. Formula: ∑(X − M)2 ____________ N−1 The variance (s2) is a measure of variability; it is the square of the standard deviation. Comparing Two Means (continued) Either by using a calculator or a computer, obtain a value for the t statistic. Next, identify the probability associated with the outcome. If a computer and statistical software are used, the probability of the outcome will be presented with the value for t as part of the output. If the value for t is calculated using the formula, the probability of the outcome can be found by using the t table (Table A.2 of the Appendix) with df = N – 2. Comparing Two Means (continued) If the probability of the outcome is less than .05 (p < .05), reject the null hypothesis of no difference between the means, and conclude that the independent variable had a statistically significant effect on the dependent variable. If the probability of the outcome is greater than .05 (p > .05), do not reject the null hypothesis of no difference between the means. With a nonsignificant outcome, we withhold judgment about the effect of the independent variable. Calculate the effect size. Determine the power of the statistical test. Comparing Two Means (continued) A measure of effect size should always be calculated. For two means, Cohen’s d can be calculated using values from the t test: 2t d = √df Sometimes a large effect size can be observed with an outcome that is not statistically significant. This can occur when there is not sufficient power to detect the effect of the independent variable (e.g., too few participants). Comparing Two Means (continued) A repeated measures (within-subjects) t-test is used to test the difference between performance in the treatment condition and the control condition in a repeated measures design or matched groups design. D t = sD where “D-bar” is the mean of difference scores between the treatment and control conditions for each participant and sD is the standard error of difference scores. Comparing Two Means (continued) The standard error of the mean formula: sM = s √N Comparing Two Means (continued) Although the formula for calculating the t statistic in the repeated measures design is slightly different, the procedures for NHST are the same. The t value is obtained, followed by the associated probability value. If p < .05, reject the null hypothesis, and conclude the independent variable had a statistically significant effect on the dependent variable. If p > .05, do not reject the null hypothesis; the outcome of the statistical test was not significant. Data Analysis Involving More Than Two Conditions An experiment can have one independent variable with more than two levels, or an experiment might have two or more independent variables (each with at least two levels) in a complex design. The most frequently used statistical procedure for experiments with more than two conditions is analysis of variance (ANOVA) which uses null hypothesis significance testing (NHST). ANOVA Analysis of variance (ANOVA) is an inferential statistics test used to determine whether an independent variable has had a statistically significant effect on a dependent variable. The logic of ANOVA is based on identifying sources of error variation and systematic variation in the data. In a properly conducted experiment, the differences among participants should be the only source of error variation within each group. The experimental procedures should be held constant within each condition to decrease error variation. ANOVA (continued) The second source of variation in a random groups design is variation between the groups. If the null hypothesis is true (no difference between the groups), any observed difference among the means of the groups can be attributed to error variation — the differences among people in the groups. Thus, when the null hypothesis is assumed to be true, any differences among means in the experiment are attributed to error variation within the groups and error variation between the groups. ANOVA (continued) When the null hypothesis is false (the independent variable has an effect), the means for the conditions of the experiment should be different. An independent variable that has an effect on behavior should produce systematic differences in the means across the conditions of the experiment. Therefore, when the independent variable has an effect on behavior differences among group means, it can be attributed to the effect of the independent variable (systematic variation) plus error variation. ANOVA (continued) The F-test is a statistical test that allows us to determine whether the variation due to the independent variable is larger than what would be expected based on error variation alone. The conceptual definition of the F-test is: F = Variation between groups ---------------------------------Variation within groups ANOVA (continued) Because “variation between groups” can be attributed to error variation plus systematic variation, and “variation within groups” is attributed to error variation, the F ratio can be re-written as: F = Error variation + systematic variation -----------------------------------------------Error variation ANOVA (continued) If the null hypothesis is true, there is no systematic variation between groups (no effect of the independent variable), and the F ratio has an expected value of 1.00. Error variation divided by error variation would equal 1.0. F = (zero) Error variation + systematic variation -----------------------------------------------Error variation ANOVA (continued) As the amount of systematic variation increases (due to the effect of the independent variable), the expected value of the F ratio becomes greater than 1.00. F = (effect of the IV) Error variation + systematic variation -----------------------------------------------Error variation How much greater than 1.00 does the F ratio have to be before we can be confident that it reflects true systematic variation due to the independent variable (and not simply chance factors)? This is where NHST comes in: To be statistically significant, the F value needs to be large enough so that its probability of occurrence if the null hypothesis were true is less than our level of significance (p < .05). ANOVA (continued) The logic of statistical inference with ANOVA is similar to that used with the t-test. The first step is to assume no difference among the means of the experiment. If the omnibus F-test is statistically significant, we reject the null hypothesis of no difference among means. A statistically significant F-test indicates that there is a difference somewhere among the means in the experiment. The statistically significant omnibus, or overall, Ftest does not indicate which means are different. ANOVA (continued) The ANOVA Summary Table provides the information for estimating the sources of variance: between groups (systematic + error variation) and within groups (error variation). Source Sum of Squares (SS) Group (between) 54.55 3 Error (within) 37.20 df Mean Square (MS) F-test 18.18 16 7.80 p .002 2.33 The Mean Square for the “Group” independent variable provides an estimate of systematic variation plus error variation. The Mean Square for “Error” provides an estimate of error variation. The F-test is the Group MS divided by the Error MS (18.18 ÷ 2.33 = 7.80). This F-test is statistically significant because .002 < .05. ANOVA (continued) The between-groups sum of squares is equal to the sum of the differences between the overall mean and the mean of each group, which is then squared The within-group sum of squares is equal to the sum of the differences between each individual score in a group and the mean of each group, which is then squared. The total sum of squares is equal to the sum of the between-groups SS and the within-group SS Mean sum of squares is simply SS divided by df. ANOVA (continued) Because the F-test is statistically significant, we reject the null hypothesis, and conclude that the independent variable had a statistically significant effect on the dependent variable. The significant F-test tells us that the group means in the experiment are different — but it doesn’t tell us which means in the experiment are different. It is essential to examine the means to interpret the effect of the independent variable. Simply finding out whether the effect was statistically significant or not is not sufficient when analyzing data. Calculating Effect Size for Designs with Three or More Independent Groups The effect size for experiments with three or more groups is based on measures of “strength of association.” These measures allow researchers to estimate the amount of variability (variance) in participants’ scores that can be attributed to the effect of the independent variable. Larger effect sizes indicate that the independent variable can account for or “explain” participants’ performance more than smaller effect sizes. In ANOVA, a popular measure of association is eta squared (η2). Effect Size (continued) Eta squared is easily calculated from values found in the ANOVA summary table: η2 = Sum of Squares Between Groups Total Sum of Squares Eta squared can also be calculated simply from the report of an F-test: η2 = (F) (df effect) [(F)(df effect)] + (df error) Effect Size (continued) Another measure of effect size for use with three or more groups is Cohen’s f. Calculate f using values for eta squared: η2 f = 1 - η2 Cohen’s suggested guidelines for interpreting effect sizes using f: Small: f = .10 Medium: f = .25 Large: f = .40 Assessing Power for Independent Groups Designs Suppose a researcher observes an effect size of f = .40 (a “large” effect), but the effect of the independent variable is not statistically significant. Suppose there were 5 participants in each of four conditions (df = 3 for the effect of the independent variable). By referring to Power tables (Table A.5 in the Appendix), the researcher discovers that the power was .26. Power (continued) When power = .26, this indicates that a statistically significant outcome would occur only in approximately onefourth of the attempts to conduct this experiment under these circumstances (i.e., with 5 participants in each of 4 conditions and an effect size =.40). Typically, before they begin their research, researchers identify the number of participants they would need with power = .80 (a statistically significant outcome would occur in 80% of the attempts of an experiment). In this example, the power table indicates we would need 18 participants in each of the 4 conditions for a total of N = 72. Comparisons of Two Means When an independent variable with three or more levels is statistically significant, the next step is to identify which of the group means in the experiment are different: These are called “comparisons of two means.” These comparisons focus on a particular difference between two means. For example, suppose that an experiment has two control groups and one treatment group, and that the F-test for this independent variable with three levels is statistically significant. Comparisons of Two Means (continued) One comparison, in this example, would be to determine whether the mean for the treatment group is significantly different from the average of the means for the two control groups. A t-test can be used to compare the means using the following formula: M1 — M2 t= MSerror 1+ 1 n1 n2 The MSerror comes from the ANOVA summary table — n1 and n2 are the sample sizes associated with each mean in the test. Comparisons of Two Means (continued) The statistical significance of the t-test can be obtained by checking a t-test table (Table A.2 in the text), or by using a computer program in which the observed t value and df are entered, and the exact probability of the result is obtained. One Web site to check is: http://math.uc.edu/~brycw/classes/148/tables.htm Cohen’s d can be calculated for the comparison using the following formula: d = _2 (t) √dferror Repeated Measures Analysis of Variance The general procedures and logic for null hypothesis testing using repeated measures analysis of variance are similar to those used for independent groups analysis of variance. Before beginning the ANOVA for a complete repeated measures design, a summary score (e.g., mean) for each participant must be computed for each condition. Descriptive data are calculated to summarize performance for each condition of the independent variable across all participants. Repeated Measures ANOVA (continued) The primary way that ANOVA differs for repeated measures is in estimation of error variation or residual variation. Residual variation is the variation that remains when systematic variation due to the independent variable and participants is removed from the estimate of total variation. Repeated Measures ANOVA (continued) Variation due to different participants in conditions is eliminated in repeated measures designs because the same individuals participate in each condition. Because this source of variation is eliminated, repeated measures designs are more sensitive than independent groups designs — they are better able to detect the effect of an independent variable when that effect is present. Two-Factor Analysis of Variance for Independent Groups Designs Complex designs have two or more independent variables — each with two or more levels. The ANOVA indicates the statistical significance of main effects of each independent variable and the interaction effect(s) between variables. The analysis of complex designs differs depending on whether an interaction effect is statistically significant or not. Analysis of a Complex Design with an Interaction Effect If the omnibus (overall) ANOVA reveals a statistically significant interaction effect, the source of the interaction is identified using simple main effects analyses and comparisons of two means. A simple main effect is the effect of one independent variable at one level of a second independent variable. If an independent variable has three or more levels, comparisons of two means can be used to examine the source of a simple main effect by comparing means two at a time. After the simple main effects are analyzed, researchers examine the main effects of the independent variables. Analysis of a Complex Design with an Interaction Effect (continued) Confidence intervals may be drawn around group means to provide information regarding how precisely the population means have been estimated. The wider the intervals around the sample means, the less precise the estimate of the population means. A rule of thumb for interpreting confidence intervals is that if the intervals around the means do not overlap, then a difference between the population means is likely. Analysis with No Interaction Effect If an omnibus ANOVA indicates the interaction effect between independent variables is not statistically significant, the next step is to determine whether the main effects of the independent variables are statistically significant. The source of a statistically significant main effect can be specified more precisely by performing comparisons that compare means two at a time and by constructing confidence intervals. Reporting Results of a Complex Design The following should be included when describing the results of a complex design experiment: Description of variables and definition of levels (conditions) of each; Summary statistics for cells of the design in text, table, or figure; including, when appropriate, confidence intervals for group means; Report of F-tests for main effects and interaction effects with exact probabilities. Reporting Results (continued) The following should be included when describing the results of a complex design experiment: Effect size measure for each effect; Statement of power for nonsignificant effects; Simple main effects analysis when interaction effect is statistically significant and comparisons of means two at a time, if appropriate; Verbal description of statistically significant interaction effect (when present), referring reader to differences between cell means across levels of the independent variables. Reporting Results (continued) The following should be included when describing the results of a complex design experiment: Comparisons of two means, when appropriate, to clarify sources of systematic variation among means contributing to main effect; Conclusion that you wish reader to make from the results of this analysis.