Two Groups and One Continuous Variable

advertisement
Two Groups and One Continuous Variable
Psychologists and others are frequently interested in the relationship between a dichotomous
variable and a continuous variable – that is they have two groups of scores. There are many ways
such a relationship can be investigated. I shall discuss several of them here, using the data on sex,
height, and weight of graduate students, in the file SexHeightWeight.sav. For each of 49 graduate
students, we have sex (female, male), height (in inches), and weight (in pounds).
After screening the data to be sure there are no errors, I recommend preparing a schematic
plot – side by side box-and-whiskers plots. In SPSS, Analyze, Descriptive Statistics, Explore. Scoot
‘height’ into the Dependent List and ‘sex’ into the Factor List. In addition to numerous descriptive
statistics, you get this schematic plot:
76
74
72
70
68
66
HEIGHT
64
62
60
58
N=
28
21
Female
Male
SEX
The height scores for male graduate students are clearly higher than for female graduate
students, with relatively little overlap between the two distributions. The descriptive statistics show
that the two groups have similar variances and that the within-group distributions are not badly
skewed, but somewhat playtkurtic. I would not be uncomfortable using techniques that assume
normality.
Student’s T Test. This is probably the most often used procedure for testing the null
hypothesis that two population means are equal. In SPSS, Analyze, Compare Means, Independent
Samples T Test, height as test variable, sex as grouping variable, define groups with values 1 and 2.
The output shows that the mean height for the sample of men was 5.7 inches greater than for
the women and that this difference is significant by a separate variances t test, t(46.0) = 8.005, p <
.001. A 95% confidence interval for the difference between means runs from 4.28 inches to 7.08
inches.
When dealing with a variable for which the unit of measure is not intrinsically meaningful, it is a
good idea to present the difference in means and the confidence interval for the difference in means
in standardized units. While I don’t think that is necessary here (you probably have a pretty good
idea regarding how large a difference of 5.7 inches is), I shall for pedagogical purposes compute
Cohen’s d (the sample statistic) and a confidence interval for Cohen’s  (the parameter). In doing so,
I shall use a special SPSS script and the separate variances values for t and df. See the document
Confidence Intervals, Pooled and Separate Variances T. For these data, d = 2.36 (quite a large
difference) and the 95% confidence interval for  runs from 1.61 (large) to 3.09 (even larger). We can
be quite confident that the difference in height between men and women is large.
Dichot-Contin.docx
Group Statistics
height
sex
Female
Male
N
Mean
64.893
70.571
28
21
Std. Deviation
2.6011
2.2488
Std. Error
Mean
.4916
.4907
Independent Sam ple s Te st
t-t est for Equality of Means
t
height
Equal variances
as sumed
Equal variances
not as sumed
df
Sig. (2-tailed)
Mean
Difference
St d. Error
Difference
95% Confidenc e
Int erval of t he
Difference
Lower
Upper
-8. 005
47
.000
-5. 6786
.7094
-7. 1057
-4. 2515
-8. 175
45.981
.000
-5. 6786
.6946
-7. 0767
-4. 2804
Point Biserial Correlation. Here we simple compute the Pearson r between sex and height.
The value of that r is .76, and it is statistically significant, t(47) = 8.005, p < .001. This analysis is
absolutely identical to an independent samples t test with a pooled error term (see the t test output
above) or an Analysis of Variance with pooled error. The value of r here can be used as a strength of
effect estimate. Square it and you have an estimate of the percentage of variance in height that is
“explained” by sex.
One-Way Independent Samples Parametric ANOVA. This too is absolutely equivalent to
the t test. The value of F obtained will equal the square of the value of t, the numerator df will be one,
the denominator df will be N – 2 (just like with t), and the (appropriately one-tailed for nondirectional
hypothesis) p value will be identical to the two-tailed p value from t. Our t of 8.005 when squared will
yield an F of 64.08.
ANOVA
height
Sum of
Squares
Between
Groups
Within Groups
Total
df
Mean
Square
386.954
1
386.954
283.821
670.776
47
48
6.039
F
64.078
Sig.
.000
Discriminant Function Analysis. This also is equivalent to the independent samples t test,
but looks different. In SPSS, Analyze, Classify, Discriminant, sex as grouping variable, height as
independent variable. Under Statistics, ask for unstandarized discriminant function coefficients.
Under Classify, ask that prior probabilities be computed from group sizes and that a summary table
be displayed. In discriminant function analysis a weighted combination of the predictor variables is
used to predict group membership. For our data, that function is DF = -27.398 + .407*height. The
correlation between this discriminant function and sex is .76 (notice that this identical to the pointbiserial r computed earlier) and is statistically significant, 2(1, N = 49) = 39.994, p < .001. Using this
model we are able correctly to predict a person’s sex 83.7% of the time.
Canonical Discriminant Function Coefficients
Function
1
.407
-27.398
height
(Constant)
Unstandardized coefficients
Eigenvalues
Function
1
Eigenvalue
% of Variance
1.363a
100.0
Canonical
Correlation
.760
Cumulative %
100.0
a. First 1 canonical dis criminant functions were us ed in the
analysis.
Wilks' Lambda
Test of Function(s)
1
Wilks'
Lambda
.423
Chi-square
39.994
df
Sig.
.000
1
Classification Resultsa
Original
Count
%
sex
Female
Male
Female
Male
Predicted Group
Membership
Female
Male
26
2
6
15
92.9
7.1
28.6
71.4
Total
28
21
100.0
100.0
a. 83.7% of original grouped cases correctly clas sified.
Logistic Regression . This technique is most often used to predict group membership from
two or more predictor variables. The predictors may be dichotomous or continuous variables. Sex is
a dichotomous variable in the data here, so let us test a model predicting height from sex. In SPSS,
Analyze, Regression, Binary Logistic. Identify sex as the dependent variable and height as a
covariate.
The 2 statistic shows us that we can predict sex significantly (p < .001) better if we know the
person’s height than if all we know is the marginal distribution of the sexes (28 women and 21 men).
The odds ratio for height is 2.536. That is, the odds that a person is male are multiplied by 2.536 for
each one inch increase in height. That is certainly a large effect. The classification table show us
that using the logistic model we could, use heights, correctly predict the sex of a person 83.7% of the
time. If we did not know the person’s height, our best strategy would be to predict ‘woman’ every
time, and we would be correct 28/49 = 57% of the time.
Omnibus Tests of Model Coefficients
Chi-square
Step 1
df
Sig.
Step
38.467
1
.000
Block
38.467
1
.000
Model
38.467
1
.000
Variables in the Equation
Step
a
1
height
Constant
B
.931
-63.488
S.E.
.287
19.548
Wald
10.534
10.548
df
Sig.
.001
.001
1
1
Exp(B)
2.536
.000
a. Variable(s) entered on s tep 1: height.
Classification Table a
Predicted
sex
Step 1
Observed
sex
Female
Male
Female
26
6
Overall Percentage
Male
2
15
Percentage
Correct
92.9
71.4
83.7
a. The cut value is .500
Wilcoxon Rank Sums Test . If we had reason not to trust the assumption that the population
data are normally distributed, we could use a procedure which makes no such assumption, such as
the Wilcoxon Rank Sums Test (which is equivalent to a Mann-Whitney U Test). In SPSS, Analyze,
Nonparametric Tests, Two Independent Samples, height as test variable, sex as grouping variable
with values 1 and 2, Exact Test selected.
The output shows that the difference between men and women is statistically significant.
SPSS gives mean ranks. Most psychologists would prefer to report medians. From Explore, used
earlier, the medians are 64 (women) and 71 (men).
Ranks
height
sex
Female
Male
Total
N
28
21
49
Mean Rank
15.73
37.36
Test Statisticsa
Mann-Whitney U
Wilcoxon W
Z
As ymp. Sig. (2-tailed)
Exact Sig. (2-tailed)
Exact Sig. (1-tailed)
Point Probability
height
34.500
440.500
-5.278
.000
.000
.000
.000
a. Grouping Variable: sex
Sum of Ranks
440.50
784.50
Kruskal-Wallis Nonparametric ANOVA on Ranks
As with the parametric ANOVA, this ANOVA can be used with 2 or more independent samples.
While the results of the analysis will not be absolutely equivalent to those of a Wilcoxon rank sum
test, it does test the same null hypothesis.
Test Statisticsa,b
height
Chi-Square
df
Asymp. Sig.
27.854
1
.000
a. Kruskal Wallis Test
b. Grouping Variable: sex
Resampling Statistics. Using the bootstrapping procedure in David Howell’s resampling
program and the SexHeight.dat file, a 95% confidence interval for the difference in medians runs from
4 to 8. Since the null value (0) is not included in this confidence interval, the difference between
group medians is statistically significant. Using the randomization test for differences between two
medians, the nonrejection region runs from -2.5 to 3. Since our observed difference in medians is -7,
we reject the null hypothesis of no difference.
Return to Wuensch’s Stats Lessons
Karl L. Wuensch
East Carolina University
March, 2015
Download