Homework #10 Solutions

advertisement
PHCO 0504 – Introduction to Biostatistics
Homework #10 - Solutions
Due December 2nd at the beginning of class
1. A person’s muscle mass is expected to decrease with age. To explore this
relationship in women, a nutritionist randomly selected four women from each
10-year age group, beginning with age 40 and ending with age 79. The results
follow; X is age, and Y is a measure of muscle mass.
X 71 64 43 67 56 73 68 56 76 65 46 58 45 53 49 78
Y 82 91 100 68 87 73 78 80 65 84 116 76 97 100 105 77
a. Create (and include here) a scatter plot the values of muscle mass against
with the best fit line added to the plot. Does it look like the assumption
that the relationship between X and Y is reasonable?

Linear Regression
110
Muscle mass

100




90




80




70

Muscle m ass = 148.14 + -1.02 *x
R-Square = 0.67
50
60
70
Age
It looks like the assumption that the relationship between X and Y is
reasonable, that is, linear.
b. Create histograms of the X and the Y values (separately). Evaluate the
assumption that both variables need to be normally distributed in order to
draw inference from the calculated Pearson correlation.
Thursday, November 18, 2004
Page 1/11
3.5
7
3.0
6
2.5
5
2.0
4
1.5
3
1.0
2
Std. Dev = 11.36
.5
Std. Dev = 14.22
1
Mean = 60.5
Mean = 86.2
N = 16.00
0.0
45.0
50.0
55.0
60.0
65.0
70.0
75.0
N = 16.00
0
80.0
70.0
Age
80.0
90.0
100.0
110.0
120.0
Muscle mass
Since the sample size is not large (16), we might not say that both
variables are not normally distributed.
Correlations
Age
Age
Muscle mass
Pearson Correlation
Sig. (2-tailed)
N
Pearson Correlation
Sig. (2-tailed)
N
1
.
16
-.818**
.000
16
Muscle mass
-.818**
.000
16
1
.
16
**. Correlation is s ignificant at the 0.01 level (2-tailed).
c. Fit the regression model to these data using SPSS. Include the output
regarding the fit of the regression line (The Coefficients) and the
coefficient of determination (The “Model Fit”)
Model Summaryb
Model
1
R
.818a
R Square
.669
Adjusted
R Square
.645
Std. Error of
the Estimate
8.470
a. Predictors: (Constant), Age
b. Dependent Variable: Muscle mas s
Coeffi cientsa
Model
1
(Const ant)
Age
Unstandardized
Coeffic ients
B
St d. Error
148.141
11.837
-1. 024
.192
St andardiz ed
Coeffic ients
Beta
-.818
t
12.515
-5. 320
Sig.
.000
.000
a. Dependent Variable: Muscle mass
Thursday, November 18, 2004
Page 2/11
d. Formally interpret the slope of the regression function. (“If X increases
by…)
If a woman’s age were to increase by 1 year old, we would expect her
measured muscle mass to decrease by 1.024.
e. Perform a hypothesis test (with all 5 steps) using the output above to see if
there is a significant linear relationship between X and Y.
Assumptions
Independent random sample of subjects (each XY pair)
Errors are normally distributed (We’ll get to diagnostics!)
Errors have the same variance regardless of the X value
The mean of Y is a linear function of X
Hypotheses
H0: β = 0 vs. HA: β not equal to 0
Test Statistic: t = (b-0)/SE = -5.320
P-value < 0.0001
Conclusion: There is a strong evidence (p-value<0.0001) that measure of
muscle mass is linearly related to age.
f. In order to check the assumptions of the hypothesis test above, create a
histogram of the residuals. Use this histogram to evaluate one of the
assumptions above.
Histogram
Dependent Variable: Muscle mass
5
4
3
Frequency
2
1
Std. Dev = .97
Mean = 0.00
N = 16.00
0
-1.50
-1.00
-.50
0.00
.50
1.00
1.50
2.00
Regression Standardized Residual
We can say that the errors are normally distributed with only 16 cases
here.
Thursday, November 18, 2004
Page 3/11
g. Evaluate the assumption of homoscedasticity by examining the original
scatterplot of the data. (Does the assumption seem reasonable for this
data? Why or why not?)
Scatterplot
Dependent Variable: Muscle mass
2.0
1.5
1.0
.5
0.0
-.5
-1.0
-1.5
-2.0
-2.0
-1.5
-1.0
-.5
0.0
.5
1.0
1.5
2.0
Regression Standardized Predicted Value
With only 16 cases, it is hard to say there is not a constant variance, the
assumption of homoscedasticity seems reasonable for this data
h. State the value of and interpret the coefficient of determination.
R2 = 0.669, which means that the proportion of variation in Y (measure of
muscle mass) that can be explained by a linear function of X (age).
i. Write down the estimate of the regression function. Use this regression to
predict the muscle mass of a person who is 56 years old. When a person
ages by two years, how much do you expect their muscle mass to change.
Regression function: estimated mean = a + bx = 148.141 - 1.024x.
For a person who is 56 years old, predicted value = 148.141 - 1.024*56 =
90.797.
When a person ages by two years, we expect their muscle mass to
change/decrease 2*1.024 = 2.048.
Thursday, November 18, 2004
Page 4/11
2. A hospital administrator wished to study the relation between patient satisfaction
(Y) and patient’s age (in years), severity of illness (an index) and anxiety (an
index). The administrator randomly selected 23 patients and collected the data
presented below.
Subject
Satisfaction
Age
Severity
Anxiety
1
50
51
2.3
48
2
40
48
2.2
66
3
28
43
1.8
89
4
42
50
2.2
46
5
52
62
2.9
26
6
29
48
2.4
89
7
38
55
2.2
47
8
53
54
2.2
57
9
29
46
1.9
88
Subject
Satisfaction
Age
Severity
Anxiety
13
36
46
2.3
57
14
41
44
1.8
70
15
49
54
2.9
36
16
45
48
2.4
54
17
29
50
2.1
77
18
43
53
2.4
67
19
34
51
2.3
51
20
36
56
2.5
79
21
89
70
4.0
90
10
33
49
2.1
60
22
55
51
2.4
49
11
29
52
2.3
77
12
43
50
2.3
60
23
44
58
2.9
52
a. From the data given above, consider satisfaction to be the response and
severity of illness to be the predictor. Include on this plot the best fit line
(as calculate by SPSS).
i. Create, using SPSS, a scatter plot that includes the fitted line.

Linear Regression
80
Satisfaction = -12.52 + 22.90 * severity
R-Square = 0.64
Sa
tis
fac
tio
n
60





40

  







 

2.0

 
2.5
3.0
3.5
4.0
severity of illness
ii. Identify the outlier point.
The outlier point is for subject # 21, with satisfaction (Y) value 89
and severity value 4.0.
iii. Calculate the fitted lines and the correlations (both given on the
scatter plot with the fitted line) both with and without this point.
Thursday, November 18, 2004
Page 5/11
From the plot in part i, with the outlier point, we have the fitted
lines as
Satisfaction = -12.52 + 22.90 * severity.
The correlation = sqrt (R-Square) = sqrt(0.64) = 0.8.

Linear Regression



Satisfaction
50

Satisfaction = 7.01 + 14.25 * severity

R-Square = 0.26






40





30





2.00
2.25
2.50
2.75
severity of illness
Without the outlier point, we have the fitted lines as
Satisfaction = 7.01 + 14.25 * severity.
The correlation = sqrt (R-Square) = sqrt(0.26) = 0.5099.
iv. Does the outlier cause the correlation to be larger or smaller? Does
it seem to have an influence on the slope of the line?
The outlier causes the correlation to be larger (from 0.5099 to 0.8).
It seems to have an influence on the slope of the line (slope
changes from 14.25 to 22.90).
v. Calculate the Spearman Correlation both with and without the
outlier. Interpret the correlation with the outlier.
With the outlier:
Correlations
Spearman's rho
Satisfaction
severity of illnes s
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
Satisfaction
1.000
.
23
.570**
.005
23
severity of
illness
.570**
.005
23
1.000
.
23
**. Correlation is significant at the .01 level (2-tailed).
Thursday, November 18, 2004
Page 6/11
Interpretation:
Correlation of 0.570 between satisfaction and severity of illness is
positive, which means that satisfaction tends to increase as the
severity of illness increases.
Without the outlier:
Correlations
Spearman's rho
Satisfaction
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
severity of illnes s
Satisfaction
1.000
.
22
.507*
.016
22
*. Correlation is significant at the .05 level (2-tailed).
b. Consider satisfaction to be the response and anxiety to be the predictor.
i. Create, using SPSS, a scatter plot that includes the fitted line.

Linear Regression
80
Sa
tis
fac
tio
n
60






40









Satisfaction
= 50.90 + -0.14 * anxiety
R-Square = 0.04

40
60


80
anxiety
ii. Identify the outlier point.
The outlier point is for subject # 21, with satisfaction (Y) value 89
and anxiety value 90.
iii. Calculate the fitted lines and the correlations (both given on the
scatter plot with the fitted line) both with and without this point.
Thursday, November 18, 2004
Page 7/11
severity of
illness
.507*
.016
22
1.000
.
22
From the plot in part i, with the outlier point, we have the fitted
lines as
Satisfaction = 50.90 - 0.14 * anxiety.
The correlation = - sqrt (R-Square) = - sqrt(0.04) = -0.2.

Linear Regression


Satisfaction
50









40





30

40
Satisfaction = 63.25 + -0.38 * anxiety
R-Square = 0.58
60


80
anxiety
Without the outlier point, we have the fitted lines as
Satisfaction = 63.25 – 0.38 * severity.
The correlation = - sqrt (R-Square) = - sqrt(0.58) = - 0.7616.
iv. Does the outlier cause the correlation to be larger or smaller? Does
it seem to have an influence on the slope of the line?
The outlier causes the correlation to be smaller in absolute term
(from -0.7616 to –0.2). It seems to have an influence on the slope
of the line (slope changes from –0.38 to -0.14).
v. Calculate the Spearman Correlation both with and without the
outlier. Interpret the correlation with the outlier.
Thursday, November 18, 2004
Page 8/11
With the outlier:
Correlations
Spearman's rho
Satisfaction
anxiety
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
Satisfaction
1.000
.
23
-.507*
.014
23
anxiety
-.507*
.014
23
1.000
.
23
*. Correlation is s ignificant at the .05 level (2-tailed).
Interpretation:
Correlation of -0.507 between satisfaction and anxiety is negative,
which means that satisfaction tends to increase (decrease) as the
anxiety decreases (increases).
Without the outlier:
Correlations
Spearman's rho
Satisfaction
anxiety
Correlation Coefficient
Sig. (2-tailed)
N
Correlation Coefficient
Sig. (2-tailed)
N
Satisfaction
1.000
.
22
-.723**
.000
22
**. Correlation is s ignificant at the .01 level (2-tailed).
c. For the above cases, which correlation seems to be more “robust” (or less
sensitive) to outliers. Explain.
The Spearman correlation seems to be more “robust” (or less sensitive) to
outliers because it is the correlation between the ranks of the X values and
Y values.
d. Explain why satisfaction and anxiety may not be in a direct causal
relationship. (Why might it be misleading to call patient satisfaction the
“response?”) (This gives a justification for using correlation as the basis
for the analysis rather than regression.)
First of all, from the description of the data collection above, it is not clear
which comes first in chronological order: patient satisfaction (or lack
thereof) or anxiety. It may be that patients who were dissatisfied were
consequently more anxious. There may also be confounding variables:
Thursday, November 18, 2004
Page 9/11
anxiety
-.723**
.000
22
1.000
.
22
severity of illness may serve as a confounder here; other alternatives
include suddenness of illness, quality of nurse care, etc.
e. Using anxiety as the predictor and removing the outlier from the data,
calculate a confidence interval for the population correlation from the set
of 22 individuals. Interpret this interval.
Let
1.96
1.96
1 r 
z  0.5  ln 
, zU  z 
, z L  z 
n3
n3
1 r 
 e 2 Z L  1 e 2 ZU  1 

95% Confidence Interval is  2 Z L
, 2 ZU
e

1
e

1


With r = -.723 (Spearman correlation), the 95% confidence interval for the
population correlation is (-0.8772, -0.43354). I am 95% confident that the
correlation falls between –0.8772 and –0.43354.
If you use r = -.763 (Pearson correlation), the 95% confidence interval for
the population correlation is (-0.8963, -0.5033). I am 95% confident that
the correlation falls between –0.8963 and –0.5033.
Both intervals do not include zero and, therefore, there is a significant
negative correlation between anxiety and satisfaction. One needs to be
careful here. Just because we don’t like the “outlier” doesn’t mean we
should necessarily exclude it from the analysis. In practice, we need to go
back to the original data, identify this outlying patient and see if there is a
mistake in recording his responses or perhaps a misunderstanding on his
part as to the questions, or maybe he was surveyed at a different point in
his hospital stay than the other patients. Something seems to be making
this patient very dissimilar from the rest of the population of patients.
f. What assumptions must be true for the confidence interval created above
to be reliable? Evaluate these assumptions. (If the assumptions are not
true, the confidence interval may not include the parameter 95% of the
time! … or equivalently, the error rate may not be .05.) (2 pts)
The following assumptions are evaluated:
 Subjects are random, independent sample
Assumed. We don’t really have information about this.
 (Satisfaction and anxiety are “measured independently”)
 Satisfaction and anxiety are paired for each patient
True.
 Anxiety values were measured and not controlled
Obviously given the nature of the experiment, anxiety values would
not have been controlled because that would be unethical.
Thursday, November 18, 2004
Page 10/11
 Satisfaction and anxiety values sampled from Normal distributions
10
8
8
6
6
4
4
2
2
Std. Dev = 13.21
Std. Dev = 17.73
Mean = 42.0
N = 23.00
0
30.0
40.0
Satisfaction
50.0
60.0
70.0
80.0
90.0
Mean = 62.4
N = 23.00
0
30.0
40.0
50.0
60.0
70.0
80.0
90.0
anxiety
With sample sizes of 23 for both measurements, we might expect these
histograms to look at least a bit unimodal and symmetric. Normality
may be more reasonable for satisfaction than for anxiety.
 Relationship between satisfaction and anxiety is linear.
Based on the scatterplot in part (b,iii) this assumption seems
reasonable. One cannot pick out a curved pattern in the scatterplot.
Thursday, November 18, 2004
Page 11/11
Download