PHCO 0504 – Introduction to Biostatistics Homework #10 - Solutions Due December 2nd at the beginning of class 1. A person’s muscle mass is expected to decrease with age. To explore this relationship in women, a nutritionist randomly selected four women from each 10-year age group, beginning with age 40 and ending with age 79. The results follow; X is age, and Y is a measure of muscle mass. X 71 64 43 67 56 73 68 56 76 65 46 58 45 53 49 78 Y 82 91 100 68 87 73 78 80 65 84 116 76 97 100 105 77 a. Create (and include here) a scatter plot the values of muscle mass against with the best fit line added to the plot. Does it look like the assumption that the relationship between X and Y is reasonable? Linear Regression 110 Muscle mass 100 90 80 70 Muscle m ass = 148.14 + -1.02 *x R-Square = 0.67 50 60 70 Age It looks like the assumption that the relationship between X and Y is reasonable, that is, linear. b. Create histograms of the X and the Y values (separately). Evaluate the assumption that both variables need to be normally distributed in order to draw inference from the calculated Pearson correlation. Thursday, November 18, 2004 Page 1/11 3.5 7 3.0 6 2.5 5 2.0 4 1.5 3 1.0 2 Std. Dev = 11.36 .5 Std. Dev = 14.22 1 Mean = 60.5 Mean = 86.2 N = 16.00 0.0 45.0 50.0 55.0 60.0 65.0 70.0 75.0 N = 16.00 0 80.0 70.0 Age 80.0 90.0 100.0 110.0 120.0 Muscle mass Since the sample size is not large (16), we might not say that both variables are not normally distributed. Correlations Age Age Muscle mass Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N 1 . 16 -.818** .000 16 Muscle mass -.818** .000 16 1 . 16 **. Correlation is s ignificant at the 0.01 level (2-tailed). c. Fit the regression model to these data using SPSS. Include the output regarding the fit of the regression line (The Coefficients) and the coefficient of determination (The “Model Fit”) Model Summaryb Model 1 R .818a R Square .669 Adjusted R Square .645 Std. Error of the Estimate 8.470 a. Predictors: (Constant), Age b. Dependent Variable: Muscle mas s Coeffi cientsa Model 1 (Const ant) Age Unstandardized Coeffic ients B St d. Error 148.141 11.837 -1. 024 .192 St andardiz ed Coeffic ients Beta -.818 t 12.515 -5. 320 Sig. .000 .000 a. Dependent Variable: Muscle mass Thursday, November 18, 2004 Page 2/11 d. Formally interpret the slope of the regression function. (“If X increases by…) If a woman’s age were to increase by 1 year old, we would expect her measured muscle mass to decrease by 1.024. e. Perform a hypothesis test (with all 5 steps) using the output above to see if there is a significant linear relationship between X and Y. Assumptions Independent random sample of subjects (each XY pair) Errors are normally distributed (We’ll get to diagnostics!) Errors have the same variance regardless of the X value The mean of Y is a linear function of X Hypotheses H0: β = 0 vs. HA: β not equal to 0 Test Statistic: t = (b-0)/SE = -5.320 P-value < 0.0001 Conclusion: There is a strong evidence (p-value<0.0001) that measure of muscle mass is linearly related to age. f. In order to check the assumptions of the hypothesis test above, create a histogram of the residuals. Use this histogram to evaluate one of the assumptions above. Histogram Dependent Variable: Muscle mass 5 4 3 Frequency 2 1 Std. Dev = .97 Mean = 0.00 N = 16.00 0 -1.50 -1.00 -.50 0.00 .50 1.00 1.50 2.00 Regression Standardized Residual We can say that the errors are normally distributed with only 16 cases here. Thursday, November 18, 2004 Page 3/11 g. Evaluate the assumption of homoscedasticity by examining the original scatterplot of the data. (Does the assumption seem reasonable for this data? Why or why not?) Scatterplot Dependent Variable: Muscle mass 2.0 1.5 1.0 .5 0.0 -.5 -1.0 -1.5 -2.0 -2.0 -1.5 -1.0 -.5 0.0 .5 1.0 1.5 2.0 Regression Standardized Predicted Value With only 16 cases, it is hard to say there is not a constant variance, the assumption of homoscedasticity seems reasonable for this data h. State the value of and interpret the coefficient of determination. R2 = 0.669, which means that the proportion of variation in Y (measure of muscle mass) that can be explained by a linear function of X (age). i. Write down the estimate of the regression function. Use this regression to predict the muscle mass of a person who is 56 years old. When a person ages by two years, how much do you expect their muscle mass to change. Regression function: estimated mean = a + bx = 148.141 - 1.024x. For a person who is 56 years old, predicted value = 148.141 - 1.024*56 = 90.797. When a person ages by two years, we expect their muscle mass to change/decrease 2*1.024 = 2.048. Thursday, November 18, 2004 Page 4/11 2. A hospital administrator wished to study the relation between patient satisfaction (Y) and patient’s age (in years), severity of illness (an index) and anxiety (an index). The administrator randomly selected 23 patients and collected the data presented below. Subject Satisfaction Age Severity Anxiety 1 50 51 2.3 48 2 40 48 2.2 66 3 28 43 1.8 89 4 42 50 2.2 46 5 52 62 2.9 26 6 29 48 2.4 89 7 38 55 2.2 47 8 53 54 2.2 57 9 29 46 1.9 88 Subject Satisfaction Age Severity Anxiety 13 36 46 2.3 57 14 41 44 1.8 70 15 49 54 2.9 36 16 45 48 2.4 54 17 29 50 2.1 77 18 43 53 2.4 67 19 34 51 2.3 51 20 36 56 2.5 79 21 89 70 4.0 90 10 33 49 2.1 60 22 55 51 2.4 49 11 29 52 2.3 77 12 43 50 2.3 60 23 44 58 2.9 52 a. From the data given above, consider satisfaction to be the response and severity of illness to be the predictor. Include on this plot the best fit line (as calculate by SPSS). i. Create, using SPSS, a scatter plot that includes the fitted line. Linear Regression 80 Satisfaction = -12.52 + 22.90 * severity R-Square = 0.64 Sa tis fac tio n 60 40 2.0 2.5 3.0 3.5 4.0 severity of illness ii. Identify the outlier point. The outlier point is for subject # 21, with satisfaction (Y) value 89 and severity value 4.0. iii. Calculate the fitted lines and the correlations (both given on the scatter plot with the fitted line) both with and without this point. Thursday, November 18, 2004 Page 5/11 From the plot in part i, with the outlier point, we have the fitted lines as Satisfaction = -12.52 + 22.90 * severity. The correlation = sqrt (R-Square) = sqrt(0.64) = 0.8. Linear Regression Satisfaction 50 Satisfaction = 7.01 + 14.25 * severity R-Square = 0.26 40 30 2.00 2.25 2.50 2.75 severity of illness Without the outlier point, we have the fitted lines as Satisfaction = 7.01 + 14.25 * severity. The correlation = sqrt (R-Square) = sqrt(0.26) = 0.5099. iv. Does the outlier cause the correlation to be larger or smaller? Does it seem to have an influence on the slope of the line? The outlier causes the correlation to be larger (from 0.5099 to 0.8). It seems to have an influence on the slope of the line (slope changes from 14.25 to 22.90). v. Calculate the Spearman Correlation both with and without the outlier. Interpret the correlation with the outlier. With the outlier: Correlations Spearman's rho Satisfaction severity of illnes s Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) N Satisfaction 1.000 . 23 .570** .005 23 severity of illness .570** .005 23 1.000 . 23 **. Correlation is significant at the .01 level (2-tailed). Thursday, November 18, 2004 Page 6/11 Interpretation: Correlation of 0.570 between satisfaction and severity of illness is positive, which means that satisfaction tends to increase as the severity of illness increases. Without the outlier: Correlations Spearman's rho Satisfaction Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) N severity of illnes s Satisfaction 1.000 . 22 .507* .016 22 *. Correlation is significant at the .05 level (2-tailed). b. Consider satisfaction to be the response and anxiety to be the predictor. i. Create, using SPSS, a scatter plot that includes the fitted line. Linear Regression 80 Sa tis fac tio n 60 40 Satisfaction = 50.90 + -0.14 * anxiety R-Square = 0.04 40 60 80 anxiety ii. Identify the outlier point. The outlier point is for subject # 21, with satisfaction (Y) value 89 and anxiety value 90. iii. Calculate the fitted lines and the correlations (both given on the scatter plot with the fitted line) both with and without this point. Thursday, November 18, 2004 Page 7/11 severity of illness .507* .016 22 1.000 . 22 From the plot in part i, with the outlier point, we have the fitted lines as Satisfaction = 50.90 - 0.14 * anxiety. The correlation = - sqrt (R-Square) = - sqrt(0.04) = -0.2. Linear Regression Satisfaction 50 40 30 40 Satisfaction = 63.25 + -0.38 * anxiety R-Square = 0.58 60 80 anxiety Without the outlier point, we have the fitted lines as Satisfaction = 63.25 – 0.38 * severity. The correlation = - sqrt (R-Square) = - sqrt(0.58) = - 0.7616. iv. Does the outlier cause the correlation to be larger or smaller? Does it seem to have an influence on the slope of the line? The outlier causes the correlation to be smaller in absolute term (from -0.7616 to –0.2). It seems to have an influence on the slope of the line (slope changes from –0.38 to -0.14). v. Calculate the Spearman Correlation both with and without the outlier. Interpret the correlation with the outlier. Thursday, November 18, 2004 Page 8/11 With the outlier: Correlations Spearman's rho Satisfaction anxiety Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) N Satisfaction 1.000 . 23 -.507* .014 23 anxiety -.507* .014 23 1.000 . 23 *. Correlation is s ignificant at the .05 level (2-tailed). Interpretation: Correlation of -0.507 between satisfaction and anxiety is negative, which means that satisfaction tends to increase (decrease) as the anxiety decreases (increases). Without the outlier: Correlations Spearman's rho Satisfaction anxiety Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) N Satisfaction 1.000 . 22 -.723** .000 22 **. Correlation is s ignificant at the .01 level (2-tailed). c. For the above cases, which correlation seems to be more “robust” (or less sensitive) to outliers. Explain. The Spearman correlation seems to be more “robust” (or less sensitive) to outliers because it is the correlation between the ranks of the X values and Y values. d. Explain why satisfaction and anxiety may not be in a direct causal relationship. (Why might it be misleading to call patient satisfaction the “response?”) (This gives a justification for using correlation as the basis for the analysis rather than regression.) First of all, from the description of the data collection above, it is not clear which comes first in chronological order: patient satisfaction (or lack thereof) or anxiety. It may be that patients who were dissatisfied were consequently more anxious. There may also be confounding variables: Thursday, November 18, 2004 Page 9/11 anxiety -.723** .000 22 1.000 . 22 severity of illness may serve as a confounder here; other alternatives include suddenness of illness, quality of nurse care, etc. e. Using anxiety as the predictor and removing the outlier from the data, calculate a confidence interval for the population correlation from the set of 22 individuals. Interpret this interval. Let 1.96 1.96 1 r z 0.5 ln , zU z , z L z n3 n3 1 r e 2 Z L 1 e 2 ZU 1 95% Confidence Interval is 2 Z L , 2 ZU e 1 e 1 With r = -.723 (Spearman correlation), the 95% confidence interval for the population correlation is (-0.8772, -0.43354). I am 95% confident that the correlation falls between –0.8772 and –0.43354. If you use r = -.763 (Pearson correlation), the 95% confidence interval for the population correlation is (-0.8963, -0.5033). I am 95% confident that the correlation falls between –0.8963 and –0.5033. Both intervals do not include zero and, therefore, there is a significant negative correlation between anxiety and satisfaction. One needs to be careful here. Just because we don’t like the “outlier” doesn’t mean we should necessarily exclude it from the analysis. In practice, we need to go back to the original data, identify this outlying patient and see if there is a mistake in recording his responses or perhaps a misunderstanding on his part as to the questions, or maybe he was surveyed at a different point in his hospital stay than the other patients. Something seems to be making this patient very dissimilar from the rest of the population of patients. f. What assumptions must be true for the confidence interval created above to be reliable? Evaluate these assumptions. (If the assumptions are not true, the confidence interval may not include the parameter 95% of the time! … or equivalently, the error rate may not be .05.) (2 pts) The following assumptions are evaluated: Subjects are random, independent sample Assumed. We don’t really have information about this. (Satisfaction and anxiety are “measured independently”) Satisfaction and anxiety are paired for each patient True. Anxiety values were measured and not controlled Obviously given the nature of the experiment, anxiety values would not have been controlled because that would be unethical. Thursday, November 18, 2004 Page 10/11 Satisfaction and anxiety values sampled from Normal distributions 10 8 8 6 6 4 4 2 2 Std. Dev = 13.21 Std. Dev = 17.73 Mean = 42.0 N = 23.00 0 30.0 40.0 Satisfaction 50.0 60.0 70.0 80.0 90.0 Mean = 62.4 N = 23.00 0 30.0 40.0 50.0 60.0 70.0 80.0 90.0 anxiety With sample sizes of 23 for both measurements, we might expect these histograms to look at least a bit unimodal and symmetric. Normality may be more reasonable for satisfaction than for anxiety. Relationship between satisfaction and anxiety is linear. Based on the scatterplot in part (b,iii) this assumption seems reasonable. One cannot pick out a curved pattern in the scatterplot. Thursday, November 18, 2004 Page 11/11