Statistical Conclusion Validity Like it or not, most quantitative social science research is based the testing of statistical hypotheses. Most commonly the statistical hypothesis that is tested (which is commonly call the “null hypothesis”) is one that asserts that there is no relationship between two variables or sets of variables. The statistical conclusion that is drawn following such testing can be viewed as dichotomous. The two possible conclusions are: A “statistically significant” result: The data are so discrepant with the null hypothesis that we reject the null hypothesis. Since the null hypothesis asserts that there is no relationship between the variables, this amounts to our asserting that there is a relationship between the variables. Such a result is commonly referred to as a statistically “significant” result. A “nonsignificant” result: The data are not so discrepant with the null hypothesis that we are comfortable with rejecting the null hypothesis. That is, we remain uncertain regarding whether or not there is a relationship between the variables. It can be argued that the correlation between continuous variables is not likely ever to be exactly zero. Accordingly we may interpret a failure to reject the null hypothesis to mean that the tested correlation is simply not large enough for us to be sure that it is not zero. This is pretty much the same as saying that we are not sure with respect to the sign of the correlation, positive or negative. “Statistical conclusion validity” refers to the degree to which one’s analysis allows one to make the correct decision regarding the truth or approximate truth of the null hypothesis. Please note that statistical conclusion validity does not involve determining whether or not a causal relationship exists between the variables of interest -- that is a matter of internal validity. Statistical conclusion validity involves your decision regarding whether or not variables are related to one another. After having demonstrated that variables are related, then we can turn our attention to concerns regarding whether or not the relationship is a cause-effect relationship. Before one can appreciate a discussion on statistical conclusion validity, one must understand the basic logic of hypothesis testing. Accordingly I shall now review that topic. The Basic Logic of Null Hypothesis Statistical Testing (NHST) Here is a listing of the actions that are taken in NHST from start to finish: State null and alternative hypotheses. Decide on a criterion of statistical significance. Decide on a test statistic. Copyright 2003, Karl L. Wuensch - All rights reserved. Research-10-StatConclValid.docx 2 Determine how much data will need to be collected to have a good chance of detecting any important deviation between the null hypothesis and the truth. Collect the data. Enter the data into a computer file. Screen the data for data entry errors, unbelievable values, and violations of assumptions of the planned statistical analysis. Correct such problems, if possible. Compute basic descriptive statistics necessary to describe the effect obtained. Compute the value of the test statistic and from it obtain the p value. Compare the p value to the criterion of significance and make the statistical conclusion. Compute strength of effect estimates. Now I shall expand on each of the above. For parametric hypothesis testing one first states a null hypothesis (H). The H specifies that some parameter has a particular value or has a value in a specified range of values. For nondirectional hypotheses, a single value is stated. For example, xy = 0. is the Greek letter ‘rho,’ which is commonly used to stand for the value of a Pearson correlation coefficient in a population. Accordingly, this null hypothesis states that the value of the correlation between variables X and Y is zero in the population from which our data were randomly samples. For directional hypotheses a value of less than or equal to (or greater than or equal to) some specified value is hypothesized. For example, 0. The alternative hypotheses (H1) is the antithetical complement of the H. If the H is xy = 0, the H1 is xy 100. If H is xy 0, H1 is xy > 0. The H and the H1 are mutually exclusive and exhaustive: One, but not both, must be true. Very often the behavioral scientist wants to reject the H and assert the H1. For example, e may think that misanthropy is positively correlated with support for animal rights, so e sets up the H that xy 0, hoping to show that H is not reasonable, thus asserting the H1 that xy > 0. Sometimes, however, one wishes not to reject a H. For example, I may have a mathematical model that predicts that the average amount of rainfall on an April day in Soggy City is 9.5 mm, and if my data lead me to reject that H, then I have shown my model to be inadequate and in need of revision. The H is tested by gathering data that are relevant to the hypothesis and determining how well the data fit the H. If the fit is poor, we reject the H and assert the H1. We measure how well the data fit the H with an exact significance level, p, which is the probability of obtaining a sample as or more discrepant with the H than is that which we did obtain, assuming that the H is true. The higher this p, the better the fit between the data and the H. If this p is low we have cast doubt upon the H. If p is very low, we reject the H. How low is very low? Very low is usually .05 -- the criterion used to reject the H is p .05 for behavioral scientists, by convention, but I opine that an individual may set e’s own criterion by considering the implications 3 thereof, such as the likelihood of falsely rejecting a true H (an error which will be more likely if one uses a higher criterion, such as p .10). After stating null and alternative hypotheses, one needs to decide what criterion of statistical significance to employ (more on this later). Let us suppose that I have decided on the .05 level. One also needs to find a test statistic which can be used to obtain the p value. Suppose that I wish to test the null hypothesis that xy = 0, where X and Y are misanthropy and support for animal rights. One appropriate test statistic would be r n2 Student’s t, which can be computed by hand, t on n - 2 df. 1 r 2 After deciding on a test statistic, one needs to determine how much data will be needed. More on this later. For now, let us suppose that I decide I want to have enough data to have an 80% chance of detecting a correlation of magnitude xy = .25 using a .05 criterion of significance. I determine that I need N = 126. Next the data are collected, entered into a computer file, and screened. More on this later. Suppose that I managed to gather more than the desired 126 units of data and that after screening I am left with good data from 154 persons. Using my favorite statistical software (SAS), I obtain the following statistical output: Variable N misanthropy animal_rights Mean 154 154 Std Dev 2.32078 2.37969 0.67346 0.53501 Pearson Correlation Coefficients, N = 154 Prob > |r| under H0: Rho=0 animal_rights misanthropy 0.22064 0.0060 Notice that the test statistic (t or F) is not given. This is commonly the case when the parameter being tested is a Pearson correlation coefficient. Since the obtained p value (.006) does not exceed the criterion of statistical significance (.05), I reject the null hypothesis and conclude that misanthropy is correlated (positively) with support for animal rights. Pearson r can be considered a strength of effect estimate, but some people prefer to report r2 instead, so I might elect to report that r2 = .049. 4 Choosing the Criterion for Statistical Significance There are two ways that your statistical conclusion can be right and two ways that it can be wrong, as outlined in the table below: True Hypothesis Decision H H1 Reject H Type I error () correct decision correct decision Type II error () “Accept” H Consider first the column that is shaded yellow. This column is appropriate if the null hypothesis is absolutely true. Of course, you do not know if the null hypothesis is true or not -- if you did, you would not need to bother with any inferential statistics. One could argue that when dealing with continuous variables the null hypothesis is never or almost never exactly true, but it could be close to true, and I recommend that we treat a nearly true null hypothesis the same as a true null hypothesis. So, we are imagining that the null hypothesis is true or nearly so. For our example, that means that there is no relationship between misanthropy and support for animal rights. If my statistical analysis leads me to reject that true null hypothesis, then I have made a Type I error. If I have used the .05 criterion of statistical significance, then I will make that type of error 5% of the times that I test absolutely true null hypotheses. I refer to this error rate as the a priori conditional probability of a Type I error, and I use the symbol alpha () to represent this error rate. It is a priori because it is computed before I have gathered the data. It is conditional because it is calculated based on the condition that the null hypothesis is actually absolutely true. If the null hypothesis is not true, then one cannot make a Type I error. I got a nonzero r from the data in my sample, r = .22. If the null hypothesis is really true (that is, in the population from which my data were randomly sampled the = 0), what are the chances that I would get an absolute value of r as large as .22, due just to chance (sampling error)? The answer to that question is p, which we have already computed to be .006. So, what is the conditional probability that one will make a correct decision in the case when the null hypothesis is actually true. That is simple -- given that the null is true, you must have either made a mistake and rejected it (a Type I error) or made a correct decision and not rejected it. These two conditional probabilities must sum to 1, so the conditional probability of a correct decision is (1 - ), or, for our example, 95%. This probability is sometimes referred to as the confidence coefficient (although that term is more often used in the context of constructing confidence intervals). Now consider the rightmost column in the table. This column is appropriate if the null hypothesis is false -- that is, the alternative hypothesis is true. If you reject the 5 false null hypothesis, you have made a correct decision. The conditional probability of making the correct decision is called power. If you fail to reject the false null hypothesis, you have made a Type II error. The conditional probability of a Type II error is symbolized by the Greek Beta ( ). The lower one sets the criterion for , the larger will be, ceteris paribus, so one should not just set very low and think e has no chance of making any errors. Again, you must either make the correct decision or not, so power and beta must sum to one. Power = 1 - . Common practice in psychology is simply to set the criterion for statistical significance at .05, just because everybody else does so. IMHO, better practice would be for the researcher to set the criterion after considering the relative seriousness of Type I and Type II errors. Imagine that you are testing an experimental drug that is supposed to reduce blood pressure, but is suspected of inducing cancer. You administer the drug to 10,000 rodents. Since you know that the tumor rate in these rodents is normally 10%, your H is that the tumor rate in drug-treated rodents is 10% or less. That is, the H is that the drug is safe, it does not increase cancer rate. The H1 is that the drug does induce cancer, that the tumor rate in treated rodents is greater than 10%. [Note that the H always includes an “=,” but the H1 never does.] A Type II error, failing to reject the H of safety when the drug really does cause cancer, seems more serious than a Type I error here (assuming that there are other safe treatments for hypertensive folks so we don’t need to weigh risk of cancer versus risk of hypertension), so we would not want to place so low that was unacceptably large. If that H (drug is safe) is false, we want to be sure we reject it. That is, we want to have a powerful test, one with a high probability of detecting false H hypotheses. Now suppose we are testing the drug’s effect on blood pressure. The H is that the mean decrease in blood pressure after giving the drug (pre-treatment BP minus post-treatment BP) is less than or equal to zero (the drug does not reduce BP). The H 1 is that the mean decrease is greater than zero (the drug does reduce BP). Now a Type I error (claiming the drug reduces BP when it actually does not) is clearly more dangerous than a Type II error (not finding the drug effective when indeed it is), again assuming that there are other effective treatments and ignoring things like your boss’ threat to fire you if you don’t produce results that support e’s desire to market the drug. You would want to set the criterion for relatively low here. What Affects Statistical Conclusion Validity? Anything that increases the probability that you will make a Type I error or a Type II error will reduce your statistical conclusion validity. The conditional probability of a Type I error is under your direct control when you set the criterion for statistical significance. Anything which decreases your power (increases the probability of a Type II error) will also reduce your statistical conclusion validity. Power is affected by several factors. Power is greater with a higher criterion for statistical significance larger sample size (n) 6 smaller population variance in the criterion variable (achieved by controlling extraneous variables, using homogeneous groups of subjects, and using reliable methods of measuring and manipulating variables) greater difference between the actual value of the tested parameter and the value specified by the null hypothesis directional hypotheses (if the predicted direction (specified in the alternative hypothesis) is correct) some types of tests (t test) than others (sign test) some research designs (matched subjects) under some conditions (matching variable correlated with DV) Determining How Much Data Is Required Read the document Estimating the Sample Size Necessary to Have Enough Power. Entering the Data into A Computer File I prefer to put my data into a plain text file, using a word processor such as Microsoft Word. There are many other alternatives, such as using a spreadsheet (Excel), a database, or the data entry routine provided by a statistical package (SPSS). It is generally a good idea to put the data for each subject on a separate line. If you have a lot of data for each subject, then you might need more than one line for each subject. If you have only a few scores for each subject, it is most convenient to use list input. With list input you place a delimiter between each score and the next score. The delimiter is a character that tells the statistical program that it has just finished reading the score for one variable and now the score for the next variable is coming up. I prefer to use a blank space as a delimiter, but some others prefer commas, tabs, semicolons, or other special characters. Look at the following data in list input, where on each line the first score is an ID number, the second score a code for a categorical variable, the third score a code for a second categorical variable, and the fourth score from a continuous variable: 001 002 003 004 005 006 007 008 0 0 0 0 1 1 1 1 0 0 1 1 0 0 1 1 3.4 4.8 4.5 5.9 6.3 5.9 10.7 9.2 When you have a lot of data for each subject, you may prefer to use column input. In column input the data for each variable always occurs in a fixed range of columns and delimiters are unnecessary. Look at the data below. These are 7 the same data as above, but arranged with ID in cols 1-3, categorical variable A in column 4, categorical variable B in column 5, and the continuous variable in cols 6-9. 00100 3.4 00200 4.8 00301 4.5 00401 5.9 00510 6.3 00610 5.9 0071110.7 00811 9.2 It is very important that your data file contain an ID number for each subject, and it should be the same ID number that is recorded on the medium in which the data were originally collected (such as the survey form on which the subjects recorded their responses). If during the screening of the data you find scores that are hard to believe, you can then check to see what the ID number is for each subject with apparently bad data and then go back to the original data recording medium to double check on those scores. Double entry of the data can help you find typos made during data entry. With this procedure you have two different data-entry workers enter the same data into computer files. You then use a special program to compare the two data files. The special program will tell you where the two files differ with respect to the values of the entered data, and then you can go back to the original data recording medium to make corrections. If you use plain text files, Microsoft Word can help you do the double entry data checking. Give it try: Download Screen2210.txt and Screen2210B.txt, saving each to your hard drive. Then open Screen2210.txt in Word. After it has opened, click Tools, Track Changes, Compare Documents. In the “Select File to Compare With Current Document” window, select Screen2210.txt, as show in the screen shot below. Word will show you lines for which the two files differ. For example, look at the display for subject number 13: 01320028985440132002898545 On the left is the data line from the open document, on the right is the data line from the comparison document. The first three columns are the ID number. If you look carefully, you will see that the two lines differ with respect to the very last number. In the open file that is a 4, in the comparison file it is a 5. Somebody made a mistake. It is time to go back to the original data recording medium for subject number 013 and see what the correct score is. There are two more lines for which these two files differ. See if you can identify the lines and how they differ. 8 Using SPSS Since you will be conducting a statistical analysis as part of your writing assignment this semester, it is important that you learn how to use SPSS if you have not done so already. Please go to my SPSS Lessons Page and read the following documents. You should also spend some time in the lab practicing the skills taught in these lessons. An Introduction to SPSS for Windows -- booting SPSS, entering data, basic descriptive statistics including schematic plots, saving data and output. Using SPSS to Explore the INTROQ Data File -- importing data from a plain text file, assigning value labels, recoding variables, contingency tables and Pearson Chi-Square, bivariate linear correlation. Using SPSS to Screen Data -- finding out-of-range data, outliers, and violations of normality assumptions and transforming to reduce skewness. Review of Basic Descriptive Statistics You won’t get much value out of using SPSS if you have forgotten what you learned in your statistics class. Accordingly, I recommend that you review by reading the documents under the headings “Introduction and Descriptive Statistics” and “Basics of Parametric Inference” on my Statistics Lessons Page. Copyright 2003, Karl L. Wuensch - All rights reserved. Fair Use of this Document