Exploring Inferential Statistics and Their Discontents
Stephen W. Watts
Northcentral University
Exploring Inferential Statistics and Their Discontents
Jackson (2012) Chapter Exercises
#2a. This is a one-tailed test. We are interested in the product preventing cavities.
#2b. H0: μ new TP ≥ μ oth; Ha: μ new TP < μ oth
#2c. Zobt = (M – SDs) / SDM  (1.73 – 1.5) / 1.12 = 0.23 / 1.12 = 0.205
#2d. Zcv = 1.645
#2e. H0 should not be rejected. The difference in cavities between other brands and the
new toothpaste are not significant enough to support the claim.
#2f. CI = M ± Z * SEM, M = 1.5, Z = ± 1.96, SEM = 1.12 / √ 60 = 1.12 / 7.746 = 0.145
 1.5 ± 19.6*0.145 = 1.5 ± 2.834  CI = -1.334 to 4.334. Since you can’t have negative
cavities, I would assume that the confidence interval would be from 0 to 4.334 cavities.
#4. As the degrees of freedom increase the critical value decreases. By comparing tobt
with his tcv with df = 13 there is a larger percentage that he will fail to reject the null hypothesis
even though it should be rejected; a Type II error.
#6a. This is a two-tail test; hypothesis is that there is a difference, but no direction is
#6b. μ cm = μ pop; Ha: μ cm ≠ μ pop
#6c. t = ( M – μ) / SEM; tobt = (59 – 58) / 1.016 = 0.984
#6d. t(13)cv = ±2.160
#6e. H0 should not be rejected. The difference demonstrated in spatial ability between
those who listen to classical music and those in the general population who do not listen to
classical music is not large enough to reject the null hypothesis.
#6f. CI = M ± tcv * SEM = 59 ± 2.160*1.016 = 59 ± 2.195  CI = 56.805 to 61.195
#8a. χ2 = Σ ( O – E) 2 / E = (31-24)2 / 24 + (89 – 96)2 / 96 = 72 / 24 – 72 / 96 = 49 / 24 +
49 / 96 = 2.042 + 0.510 = χ2obt = 2.552
#8b. df = 1
#8c. χ2cv = 3.841
#8d. The null hypothesis is rejected if χ2obt is greater than χ2cv; this is not the case. The
researcher will fail to reject the null hypothesis, because the number of people exercising in
California is not significantly larger than the number of people exercising in the United States.
#2a. This study meets the assumptions of the Independent-Groups t Test with a onetailed test.
#2b. H0: μNM ≥ μ Mus; Ha: μ NM < μ Mus
#2c. tcv = 1.86; t(8)obt = 2.193; p < .05, one-tailed test (Used SPSS to get obtained value.)
#2d. Reject the null hypothesis, and conclude that studying without music leads to better
attention on the material.
#2e. r2 = t2 / t2 + df = 2.1932 / 2.1932 + 8 = 4.809 / (4.809 + 8) = 4.809 / 12.809 = r 2 =
0.375 for a large effect size.
Test Scores
No Music
#2g. CI = M1 – M2 ± tcv(s M1 – M2) = 7.556 – 6.000 ± 1.86* (0.503 – 0.333) = 1.556 ±
1.86*0.170 = 1.556 ± .316  CI = 1.240 to 1.872
#4a. This study meets the assumptions of the t Test for Correlated Groups with a onetailed test.
#4b. H0: μNM ≥ μ Mus; Ha: μ NM < μ Mus
#4c. tcv = 2.015; t(5)obt = 2.739; p < 0.05, one-tailed test (Used Excel to get obtained
t-Test: Paired Two Sample for
Pearson Correlation
Hypothesized Mean Difference
t Stat
P(T<=t) one-tail
t Critical one-tail
P(T<=t) two-tail
t Critical two-tail
No Music
6.666666667 7.666666667
1.466666667 0.666666667
#4d. Reject the null hypothesis, and conclude that studying without music leads to better
test scores.
#4e. r2 = t2 / t2 + df = 2.7392 / (2.7392 + 5) = 7.502 / (7.502 + 5) = 7.502 / 12.502  r2 =
0.60 for a large effect size.
Test Scores
No Music
#4g. CI = M1 – M2 ± tcv(s M1 – M2) = 7.667 – 6.667 ± 2.015 * (0.494 – 0.333) = 1 ± 2.015
* 0.161 = 1 ± 0.324  CI = 0.676 to 1.324
#6a. This study meets the assumptions of the Wilcoxon Rank-Sum Test one-tailed test.
#6b. H0: μgs ≥ μ rs; Ha: μ gs < μ rs
#6c. W(n1=7, n2=7)cv = 39; W(n1 = 7, n2 = 7) = 43, p = 0.05, one-tailed test.
Red Sauce
Green Sauce
#6d. Fail to reject null hypothesis. Taste scores for the two sauces did not differ
#8a. χ2 (1, N = 105) = 6.732, p < 0.05
27 Front
19 Back
6.732 df=
#8b. df = (2-1)*(2-1) = 1
#8c. χ2cv = 3.841
#8d. Reject the null hypothesis. There is a significant difference in seating preferences
between women and men. More men sit in the back row and more women sit in the front row.
Part I Assignment Question Answers
What are degrees of freedom? How are they calculated? The degrees of freedom are
the number of scores in any sample that can freely change. For any given mean, all of the values
can freely change to maintain the mean except the last one. Thus, the degrees of freedom can be
calculated with the formula df = N – 1.
What do inferential statistics allow you to infer? The inference in inferential statistics
is what you can say about a population based on the conclusions found in a research study on a
sample of the population following specific sampling and statistical procedures.
What is the General Linear Model (GLM)? Why does it matter? The general linear
model (GLM) unifies various statistical models into a flexible generalization of linear regression
such that response variables can be other than normally distributed. Since the goal of data
analysis “is to summarize or describe accurately what is happening in the data” (Trochim &
Donnelly, 2012, p. 297) the GLM enhances this ability. The GLM has brought together
disparate statistical tools into one model, while also allowing for the discovery of advanced
models, such as SEM and HLM, for testing complex models.
Compare and contrast parametric and nonparametric statistics. Why and in what
types of cases would you use one over the other? Parametric statistics are used when
analyzing interval or ratio data. To use parametric statistics, the data must be bell-shaped and
will have known or calculable population means and standard deviations. Parametric statistics
allow for more accurate and precise estimates than nonparametric measures as long as the
assumptions upon which they are based are true. Parametric statistics allow for predictability of
scores based on probabilities. Nonparametric statistics are used when analyzing nominal or
ordinal data. With nonparametric statistics the distribution of data does not have to be bellshaped and population parameters are not needed. Nonparametric statistics are used in situations
where (a) an assumption of distribution probability is not warranted, (b) ranking is involved, and
(c) assessing preferences. The design of a study may affect which particular statistical test is
used, but does not differentiate between parametric and nonparametric statistics.
Why is it important to pay attention to the assumptions of the statistical test? What
are your options if your dependent variable scores are not normally distributed?
Parametric statistics are only accurate, or have statistical power, when the assumptions upon
which the test is based are true. If the test assumptions are not met, the results can be inaccurate,
misleading, or wrong. When dependent variable scores are not normally distributed, but are bellshaped, a t-test parametric analysis can be performed. If, however, the dependent variable scores
are not bell-shaped the only option for analysis is the use of nonparametric statistical tests.
Part II Assignment Question Answers
What does p = .05 mean? What are some misconceptions about the meaning of p
=.05? Why are they wrong? Should all research adhere to the p = .05 standard for
significance? Why or why not? Also known as the significance level, or alpha, of a study; p
represents the probability of committing a Type I error, of rejecting the null hypothesis when it is
true. Schmidt (2010) identified six misconceptions regarding significance, including: (a) reliable
replication, (b) identifies the size of a relationship, (c) when not significant it indicates no
relationship, (d) are essential to research, (e) guarantee impartiality, and (f) contribute to the
field. Significance bears no relationship on?? replication, and indicates that in situations where
there is no actual relationship between variables there is a 1:20 chance that a significant finding
will be found. Correlation or effect size identifies the strength of a relationship; the significance
does not. Type II and Type I errors are predicted because there is always a chance that a
significant result will be found even if no relationship occurs, or that a nonsignificant result will
be found even if there should be a significant?? one. Significance identifies the probability of
the first possibility. Significance can be manipulated though sample size, is not the best measure
for summarizing research data, and is not essential to research, does not guarantee impartiality of
observations, and may distort and “retard the development of cumulative knowledge” (Schmidt,
2010, p. 239).
Cohen (1992) identifies that when multiple null hypotheses are being tested the
significance level be lowered so that “experimentwise [sic] risk not become too large” (p. 156)
and conclusion validity decreased with a “fishing and the error rate problem” (Trochim &
Donnelly, 2012, p. 255). While a significance level of 0.05 is traditional, there is nothing
sacrosanct regarding that level of Type I probability, and Faul, Erdfelder, Land, and Buchner
(2007) concluded that there is no reason why the exact significance level of results should not be
reported in reports and articles.
Compare and contrast the concepts of effect size and statistical significance. There
are four important components of statistical power; significance, sample size, effect size, and
power (Cohen, 1992; Faul, Erdfelder, Land, & Buchner, 2007; Trochim & Donnelly, 2012).
Many researchers focus on significance almost to the exclusion of the other components (Cohen,
1992). The idea behind statistical analysis of research is to determine whether a treatment
creates two separate populations whose differences can be measured. The effect size is a
measure showing the degree the experimental mean is expected to deviate from the control mean
or the variance accounted for in a study. Statistical significance is the maximum allowable risk
of erroneously rejecting the null hypothesis and committing a Type I error. Both are similar
because as each decreases the minimum sample size must increase in order to maintain sufficient
power to find a significant result. As each increases in value the minimum sample size decreases
to maintain consistent power. Other than their relationship to statistical power, significance and
effect size are not similar at all.
What is the difference between a statistically significant result and a clinically or
“real world” significant result? Give examples of both. A statistically significant result
traditionally can occur by chance 1 in 20 times even if the null hypothesis is false. Clinical, or
real-world, significance occurs when research results are useful to expanding the knowledge of
the field. Statistical significance indicates results obtained are different from the population
mean and probably are not due to chance. Schmidt (2010) identified the results of a metaanalysis in which eight studies were found to be statistically nonsignificant, while another eight
studies were found to be statistically significant, even though all 16 studies used the same “test
of decision making to supervisory rating of job performance in various midlevel jobs” (p. 234).
A study can return statistical significance even if the null hypothesis is true or not be statistically
significant even though the null hypothesis is false; clinical significance is never achieved unless
the underlying data represent an important difference (Carver, 1978). Schmidt’s meta-analysis is
an example of practical significance because eliminating the sampling and measurement error
from the 16 studies resulted in “basically a single value (.32) because there is nearly no
variation” (p. 236).
What is NHST? Describe the assumptions of the model. Null hypothesis significance
testing (NHST) assumes that differences in means from the H0 are due principally because of
sampling variance or by chance (Kirk, 2003). If a statistically significant result is found, the null
hypothesis is rejected on the assumption that the result is probably not due to those factors.
Describe and explain three criticisms of NHST. Three criticisms of null hypothesis
significance testing are, (a) the wrong question is being answered, (b) it is a trivial exercise, and
(c) it makes an ordinal value out of a ratio value, losing much in the conversion. Researchers
want to ascertain truth when conducting research; what is the probability that this treatment will
consistently result in the outcomes predicted (reliability) and that the research hypothesis is true
(validity)? Significance testing contributes to answering neither of these questions, but instead
answers a different question; what is the probability that the results obtained were not a chance
occurrence? The assumptions of significance testing identify that there is sampling variance in
all research measures. Based on this assumption, the probability that two means on any measure
will be exactly the same to an infinite number of decimal places is zero. Therefore, the null
hypothesis will be false in every case, making it a trivial exercise to compare anything with it.
The third criticism of NHST is that significance testing creates an arbitrary “go-no-go decision
straddled over p = .05” (Cohen, 1992, p. 156) or a “cliff effect” (Kirk, 2003, p. 87) that tends to
mark significant findings as important and nonsignificant finds as unimportant; a “hurdle, with
the statistical significance test coming before further consideration of the results” (Carver, 1978,
p. 388). This hurdle, in essence, places chance and sample size before real-world significance.
Describe and explain two alternatives to NHST. What do their proponents consider
to be their advantages? An alternative to NHST proposed by Schmidt (2010) is to follow the
example of physical scientists and utilize confidence intervals. He suggests that confidence
intervals provide more information than do significance tests and are just as objective. He
further cites the 1999 APA Task Force Report on significance testing that confidence intervals
should be reported in research results. Kirk (2003) identified that confidence intervals provide
all of the information associated with significance, but also contains “a range of values within
which the population parameter is likely to lie, . . . the same unit of measurement as the data, . . .
[and] are especially useful in assessing the practical significance of results” (p. 88).
Both Schmidt (2010) and Kirk (2003) also propose the reporting of effect sizes in
research results. Effect sizes summarize data by identifying the strength of an association.
Through reporting this strength estimation can be made regarding the practical usefulness of the
Which type of analysis would best answer the research question you stated in
Activity 1? Justify your answer. I did not propose or state a research question in Activity 1,
nor was one asked of me.
Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational
Review, 48(3), 378-399. Retrieved from
Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155-159. doi:10.1037/00332909.112.1.155
Faul, F., Erdfelder, E., Lang, A.-G., & Buchneer, A. (2007). G*Power 3: A flexible statistical
power analysis program for the social, behavioral, biomedical sciences. Behavior
Research Methods, 39(2), 175-191. Retrieved from http://www.psycho.uniduesseldorf.de/abteilungen/aap/gpower3/download-and-register/Dokumente/GPower3BRM-Paper.pdf
Jackson, S. L. (2012). Research methods and statistics: A critical thinking approach (4th ed.).
Belmont, CA: Wadsworth Cengage Learning.
Kirk, R. E. (2003). The importance of effect magnitude. In S. F. Davis (Ed.), Handbook of
Research Methods in Experimental Psychology (pp. 83-105).
Schmidt, F. (2010). Detecting and correcting the lies that data tell. Perspectives on Psychological
Science, 5(3), 233-242. doi:10.1177/1745691610369339
Trochim, W. M. K., & Donnelly, J. P. (2008). The research methods knowledge base (3rd ed.).
Mason, OH: Cengage Learning.