Statistical Conclusion Validity

Statistical Conclusion Validity
Like it or not, most quantitative social science research is based the testing of
statistical hypotheses. Most commonly the statistical hypothesis that is tested (which is
commonly call the “null hypothesis”) is one that asserts that there is no relationship
between two variables or sets of variables. The statistical conclusion that is drawn
following such testing can be viewed as dichotomous. The two possible conclusions
are:

A “statistically significant” result: The data are so discrepant with the null
hypothesis that we reject the null hypothesis. Since the null hypothesis asserts
that there is no relationship between the variables, this amounts to our asserting
that there is a relationship between the variables. Such a result is commonly
referred to as a statistically “significant” result.

A “nonsignificant” result: The data are not so discrepant with the null
hypothesis that we are comfortable with rejecting the null hypothesis. That is, we
remain uncertain regarding whether or not there is a relationship between the
variables. It can be argued that the correlation between continuous variables is
not likely ever to be exactly zero. Accordingly we may interpret a failure to reject
the null hypothesis to mean that the tested correlation is simply not large enough
for us to be sure that it is not zero. This is pretty much the same as saying that
we are not sure with respect to the sign of the correlation, positive or negative.
“Statistical conclusion validity” refers to the degree to which one’s analysis allows
one to make the correct decision regarding the truth or approximate truth of the null
hypothesis. Please note that statistical conclusion validity does not involve determining
whether or not a causal relationship exists between the variables of interest -- that is a
matter of internal validity. Statistical conclusion validity involves your decision regarding
whether or not variables are related to one another. After having demonstrated that
variables are related, then we can turn our attention to concerns regarding whether or
not the relationship is a cause-effect relationship.
Before one can appreciate a discussion on statistical conclusion validity, one
must understand the basic logic of hypothesis testing. Accordingly I shall now review
that topic.
The Basic Logic of Null Hypothesis Statistical Testing (NHST)
Here is a listing of the actions that are taken in NHST from start to finish:




State null and alternative hypotheses.
Decide on a criterion of statistical significance.
Decide on a test statistic.
Copyright 2003, Karl L. Wuensch - All rights reserved.
Research-10-StatConclValid.docx
2








Determine how much data will need to be collected to have a good chance of
detecting any important deviation between the null hypothesis and the truth.
Collect the data.
Enter the data into a computer file.
Screen the data for data entry errors, unbelievable values, and violations of
assumptions of the planned statistical analysis. Correct such problems, if
possible.
Compute basic descriptive statistics necessary to describe the effect obtained.
Compute the value of the test statistic and from it obtain the p value.
Compare the p value to the criterion of significance and make the statistical
conclusion.
Compute strength of effect estimates.
Now I shall expand on each of the above.
For parametric hypothesis testing one first states a null hypothesis (H). The
H specifies that some parameter has a particular value or has a value in a specified
range of values. For nondirectional hypotheses, a single value is stated. For
example, xy = 0.  is the Greek letter ‘rho,’ which is commonly used to stand for the
value of a Pearson correlation coefficient in a population. Accordingly, this null
hypothesis states that the value of the correlation between variables X and Y is zero in
the population from which our data were randomly samples. For directional
hypotheses a value of less than or equal to (or greater than or equal to) some specified
value is hypothesized. For example,   0.
The alternative hypotheses (H1) is the antithetical complement of the H. If the
H is xy = 0, the H1 is xy  100. If H is xy  0, H1 is xy > 0. The H and the H1 are
mutually exclusive and exhaustive: One, but not both, must be true.
Very often the behavioral scientist wants to reject the H and assert the H1.
For example, e may think that misanthropy is positively correlated with support for
animal rights, so e sets up the H that xy  0, hoping to show that H is not
reasonable, thus asserting the H1 that xy > 0. Sometimes, however, one wishes not to
reject a H. For example, I may have a mathematical model that predicts that the
average amount of rainfall on an April day in Soggy City is 9.5 mm, and if my data lead
me to reject that H, then I have shown my model to be inadequate and in need of
revision.
The H is tested by gathering data that are relevant to the hypothesis and
determining how well the data fit the H. If the fit is poor, we reject the H and assert
the H1. We measure how well the data fit the H with an exact significance level, p,
which is the probability of obtaining a sample as or more discrepant with the H
than is that which we did obtain, assuming that the H is true. The higher this p,
the better the fit between the data and the H. If this p is low we have cast doubt upon
the H. If p is very low, we reject the H. How low is very low? Very low is usually .05
-- the criterion used to reject the H is p  .05 for behavioral scientists, by convention,
but I opine that an individual may set e’s own criterion by considering the implications
3
thereof, such as the likelihood of falsely rejecting a true H (an error which will be more
likely if one uses a higher criterion, such as p  .10).
After stating null and alternative hypotheses, one needs to decide what criterion
of statistical significance to employ (more on this later). Let us suppose that I have
decided on the .05 level.
One also needs to find a test statistic which can be used to obtain the p value.
Suppose that I wish to test the null hypothesis that xy = 0, where X and Y are
misanthropy and support for animal rights. One appropriate test statistic would be
r n2
Student’s t, which can be computed by hand, t 
on n - 2 df.
1 r 2
After deciding on a test statistic, one needs to determine how much data will
be needed. More on this later. For now, let us suppose that I decide I want to have
enough data to have an 80% chance of detecting a correlation of magnitude xy = .25
using a .05 criterion of significance. I determine that I need N = 126.
Next the data are collected, entered into a computer file, and screened.
More on this later.
Suppose that I managed to gather more than the desired 126 units of data and
that after screening I am left with good data from 154 persons. Using my favorite
statistical software (SAS), I obtain the following statistical output:
Variable
N
misanthropy
animal_rights
Mean
154
154
Std Dev
2.32078
2.37969
0.67346
0.53501
Pearson Correlation Coefficients, N = 154
Prob > |r| under H0: Rho=0
animal_rights
misanthropy
0.22064
0.0060
Notice that the test statistic (t or F) is not given. This is commonly the case when
the parameter being tested is a Pearson correlation coefficient.
Since the obtained p value (.006) does not exceed the criterion of statistical
significance (.05), I reject the null hypothesis and conclude that misanthropy is
correlated (positively) with support for animal rights.
Pearson r can be considered a strength of effect estimate, but some people
prefer to report r2 instead, so I might elect to report that r2 = .049.
4
Choosing the Criterion for Statistical Significance
There are two ways that your statistical conclusion can be right and two ways
that it can be wrong, as outlined in the table below:
True Hypothesis
Decision
H
H1
Reject H
Type I error
()
correct
decision
correct
decision
Type II error
()
“Accept” H
Consider first the column that is shaded yellow. This column is appropriate if the
null hypothesis is absolutely true. Of course, you do not know if the null hypothesis
is true or not -- if you did, you would not need to bother with any inferential statistics.
One could argue that when dealing with continuous variables the null hypothesis is
never or almost never exactly true, but it could be close to true, and I recommend that
we treat a nearly true null hypothesis the same as a true null hypothesis. So, we are
imagining that the null hypothesis is true or nearly so. For our example, that means that
there is no relationship between misanthropy and support for animal rights. If my
statistical analysis leads me to reject that true null hypothesis, then I have made a
Type I error. If I have used the .05 criterion of statistical significance, then I will make
that type of error 5% of the times that I test absolutely true null hypotheses. I refer to
this error rate as the a priori conditional probability of a Type I error, and I use the
symbol alpha () to represent this error rate. It is a priori because it is computed before
I have gathered the data. It is conditional because it is calculated based on the
condition that the null hypothesis is actually absolutely true. If the null hypothesis is not
true, then one cannot make a Type I error.
I got a nonzero r from the data in my sample, r = .22. If the null hypothesis is
really true (that is, in the population from which my data were randomly sampled the  =
0), what are the chances that I would get an absolute value of r as large as .22, due just
to chance (sampling error)? The answer to that question is p, which we have already
computed to be .006.
So, what is the conditional probability that one will make a correct decision in the
case when the null hypothesis is actually true. That is simple -- given that the null is
true, you must have either made a mistake and rejected it (a Type I error) or made a
correct decision and not rejected it. These two conditional probabilities must sum to 1,
so the conditional probability of a correct decision is (1 - ), or, for our example, 95%.
This probability is sometimes referred to as the confidence coefficient (although that
term is more often used in the context of constructing confidence intervals).
Now consider the rightmost column in the table. This column is appropriate if the
null hypothesis is false -- that is, the alternative hypothesis is true. If you reject the
5
false null hypothesis, you have made a correct decision. The conditional probability of
making the correct decision is called power. If you fail to reject the false null
hypothesis, you have made a Type II error. The conditional probability of a Type II
error is symbolized by the Greek Beta (  ). The lower one sets the criterion for , the
larger  will be, ceteris paribus, so one should not just set  very low and think e has no
chance of making any errors. Again, you must either make the correct decision or not,
so power and beta must sum to one. Power = 1 - .
Common practice in psychology is simply to set the criterion for statistical
significance at .05, just because everybody else does so. IMHO, better practice would
be for the researcher to set the criterion after considering the relative seriousness of
Type I and Type II errors.
Imagine that you are testing an experimental drug that is supposed to reduce
blood pressure, but is suspected of inducing cancer. You administer the drug to 10,000
rodents. Since you know that the tumor rate in these rodents is normally 10%, your H
is that the tumor rate in drug-treated rodents is 10% or less. That is, the H is that the
drug is safe, it does not increase cancer rate. The H1 is that the drug does induce
cancer, that the tumor rate in treated rodents is greater than 10%. [Note that the H
always includes an “=,” but the H1 never does.] A Type II error, failing to reject the H of
safety when the drug really does cause cancer, seems more serious than a Type I error
here (assuming that there are other safe treatments for hypertensive folks so we don’t
need to weigh risk of cancer versus risk of hypertension), so we would not want to place
 so low that  was unacceptably large. If that H (drug is safe) is false, we want to be
sure we reject it. That is, we want to have a powerful test, one with a high probability of
detecting false H hypotheses.
Now suppose we are testing the drug’s effect on blood pressure. The H is that
the mean decrease in blood pressure after giving the drug (pre-treatment BP minus
post-treatment BP) is less than or equal to zero (the drug does not reduce BP). The H 1
is that the mean decrease is greater than zero (the drug does reduce BP). Now a
Type I error (claiming the drug reduces BP when it actually does not) is clearly more
dangerous than a Type II error (not finding the drug effective when indeed it is), again
assuming that there are other effective treatments and ignoring things like your boss’
threat to fire you if you don’t produce results that support e’s desire to market the drug.
You would want to set the criterion for  relatively low here.
What Affects Statistical Conclusion Validity?
Anything that increases the probability that you will make a Type I error or a Type
II error will reduce your statistical conclusion validity. The conditional probability of a
Type I error is under your direct control when you set the criterion for statistical
significance. Anything which decreases your power (increases the probability of a
Type II error) will also reduce your statistical conclusion validity. Power is affected by
several factors. Power is greater with


a higher criterion for statistical significance
larger sample size (n)
6





smaller population variance in the criterion variable (achieved by controlling
extraneous variables, using homogeneous groups of subjects, and using reliable
methods of measuring and manipulating variables)
greater difference between the actual value of the tested parameter and the
value specified by the null hypothesis
directional hypotheses (if the predicted direction (specified in the alternative
hypothesis) is correct)
some types of tests (t test) than others (sign test)
some research designs (matched subjects) under some conditions (matching
variable correlated with DV)
Determining How Much Data Is Required
Read the document Estimating the Sample Size Necessary to Have Enough
Power.
Entering the Data into A Computer File
I prefer to put my data into a plain text file, using a word processor such as
Microsoft Word. There are many other alternatives, such as using a spreadsheet
(Excel), a database, or the data entry routine provided by a statistical package (SPSS).
It is generally a good idea to put the data for each subject on a separate line. If
you have a lot of data for each subject, then you might need more than one line for each
subject.
If you have only a few scores for each subject, it is most convenient to use
list input. With list input you place a delimiter between each score and the next score.
The delimiter is a character that tells the statistical program that it has just finished
reading the score for one variable and now the score for the next variable is coming up.
I prefer to use a blank space as a delimiter, but some others prefer commas, tabs,
semicolons, or other special characters. Look at the following data in list input, where
on each line the first score is an ID number, the second score a code for a categorical
variable, the third score a code for a second categorical variable, and the fourth score
from a continuous variable:
001
002
003
004
005
006
007
008
0
0
0
0
1
1
1
1
0
0
1
1
0
0
1
1
3.4
4.8
4.5
5.9
6.3
5.9
10.7
9.2
When you have a lot of data for each subject, you may prefer to use
column input. In column input the data for each variable always occurs in a fixed
range of columns and delimiters are unnecessary. Look at the data below. These are
7
the same data as above, but arranged with ID in cols 1-3, categorical variable A in
column 4, categorical variable B in column 5, and the continuous variable in cols 6-9.
00100 3.4
00200 4.8
00301 4.5
00401 5.9
00510 6.3
00610 5.9
0071110.7
00811 9.2
It is very important that your data file contain an ID number for each subject, and
it should be the same ID number that is recorded on the medium in which the data were
originally collected (such as the survey form on which the subjects recorded their
responses). If during the screening of the data you find scores that are hard to believe,
you can then check to see what the ID number is for each subject with apparently bad
data and then go back to the original data recording medium to double check on those
scores.
Double entry of the data can help you find typos made during data entry. With
this procedure you have two different data-entry workers enter the same data into
computer files. You then use a special program to compare the two data files. The
special program will tell you where the two files differ with respect to the values of the
entered data, and then you can go back to the original data recording medium to make
corrections.
If you use plain text files, Microsoft Word can help you do the double entry data
checking. Give it try:
Download Screen2210.txt and Screen2210B.txt, saving each to your hard drive.
Then open Screen2210.txt in Word. After it has opened, click Tools, Track Changes,
Compare Documents. In the “Select File to Compare With Current Document” window,
select Screen2210.txt, as show in the screen shot below. Word will show you lines for
which the two files differ. For example, look at the display for subject number 13:
01320028985440132002898545
On the left is the data line from the open document, on the right is the data line from the
comparison document. The first three columns are the ID number. If you look carefully,
you will see that the two lines differ with respect to the very last number. In the open file
that is a 4, in the comparison file it is a 5. Somebody made a mistake. It is time to go
back to the original data recording medium for subject number 013 and see what the
correct score is. There are two more lines for which these two files differ. See if you
can identify the lines and how they differ.
8
Using SPSS
Since you will be conducting a statistical analysis as part of your writing
assignment this semester, it is important that you learn how to use SPSS if you have
not done so already. Please go to my SPSS Lessons Page and read the following
documents. You should also spend some time in the lab practicing the skills taught in
these lessons.

An Introduction to SPSS for Windows -- booting SPSS, entering data, basic
descriptive statistics including schematic plots, saving data and output.

Using SPSS to Explore the INTROQ Data File -- importing data from a plain
text file, assigning value labels, recoding variables, contingency tables and
Pearson Chi-Square, bivariate linear correlation.

Using SPSS to Screen Data -- finding out-of-range data, outliers, and violations
of normality assumptions and transforming to reduce skewness.
Review of Basic Descriptive Statistics
You won’t get much value out of using SPSS if you have forgotten what you
learned in your statistics class. Accordingly, I recommend that you review by reading
the documents under the headings “Introduction and Descriptive Statistics” and “Basics
of Parametric Inference” on my Statistics Lessons Page.
Copyright 2003, Karl L. Wuensch - All rights reserved.

Fair Use of this Document