Uploaded by Hailelule Aleme

Linear Regression Exercises Students(2)

advertisement
EBQ Course, Biostatistics Exercises: Linear Regression
Exercise 1:
Midwifes often perform a test to measure the estriol level in a urine sample of heavily pregnant women, since
the estriol level is believed to be related to the birth weight of the child. Hence, the test provides implicit
evidence for abnormal small fetuses. In order to investigate the relationship between the estriol level and the
birth weight of the baby, a study was conducted and the results are considered in this exercise.
1. Reproduce the table and figure based on the Estriol data.
2. Check the estimates a and b in R. Compare these estimates with the numbers in the aforementioned
Figure.
3. Try to calculate the values of a and b by hand.
4. Check whether the conditions of the linear regression model are fulfilled.
Table 1: Rosner Table 11.3 p 478
SOURCE
DF SS
MS
F
Regression 1
250.57 250.57 17.16
Error
29
423.43 14.6
Total
30
674.00
p
0.000
Exercise 2:
In a study in El Paso, Texas, one examines the relationship between exposure to lead and the mental
development of children. There are different ways of measuring exposure to lead. A method used in this
study is to measure the lead level in the blood in a control group of students with levels less dan 40g/100mL,
collected between 1972 and 1973 (CSCN2=0); in the group of exposed children the lead level exceeded
40g/100mL in 1972 or 1973 (CSCN2=1). Important variables in the study are the number of finger-wrist
taps (FWT) per 10 seconds with the dominant hand and the score on the Wechsler full-scale IQ test (IQF).
(Rosner Ex. 11.50)
Use the file Lead_ch11.txt and answer the following questions:
1. Perform a t-test to compare MAXFWT in the groups CSCN2=0 and CSCN2=1. (Tip: use
var.test(MAXFWT∼CSCN2,data=lead) and t.test(MAXFWT∼CSCN2,data=lead,var.equal=...))
2. Show that the result of fitting a linear regression model with Y = MAXFWT and X = CSCN2 provides
exactly the same results, i.e., what is the connection between the t-value and the F -value of the ANOVA
table derived from the linear regression model?
3. What are the assumptions in both approaches? (You don’t have to check those)
Exercise 3:
Data from a study on the use of antibiotics in hospitals in Pennsylvania is used in this exercise. The
dataset contains information about patients that are discharged from a hospital as part of a retrospective file
examination.
Answer the following questions using the hospital dataset (hospital.txt). (Rosner: Problems 11.13 - 11.14 11.15)
1. (11.13) Find the best-fitting linear relationship between duration of hospitalisation and age.
1
2. (11.14) Test for significance of this relationship. State any underlying assumptions you have used.
3. Construct confidence intervals for the regression parameters
4. Test whether the conditions of the linear regression model are satisfied. Are there any outliers in the
observations (check this using residual plots)?
5. (11.15) What is R2 for this regression?
6. Construct a 95% prediction interval for the duration in the hospital of an individual of age 35 years
and a 99% confidence interval for the mean duration in the hospital for individuals of that age.
Exercise 4:
Suppose we are interested in the relation between carbonmonoxide concentration and the density of cars in
a geographic area. The number of cars per hour (to the nearest 500 cars per hour) and the concentration
of carbon monoxide (CO) in parts per million at a particular street corner are measured, and the data are
grouped by cars per hour. The data are given in Table 11.19.
Table 2: Rosner Table 11.19 CO concentration and car density at a particular street corner
Cars/hour CO concentration
Number of
(×103 )
(ppm)
samples
1
9
6.8
7.7
3
1.5
9.6
6.8
11.3
3
2
12.3 11.8
2
3
20.7 19.2 21.6 20.6
4
1. Construct a scatter plot and discuss. Do you expect a strong relationship between the number of cars
and the CO concentration?
2
2. (11.17) Is the CO concentration related to the number of cars per hour? Use an F-test to check this.
3. Draw the regression line and construct pointwise 95% confidence intervals and prediction intervals
around the estimated regression line.
4. (11.18) What is the average CO concentration if 2500 cars per hour are on the road?
Exercise 5:
The data in Table 11.20 shows the infant-mortality rates per 1000 livebirths in the United States for the
period 1960-1979 (Rosner 11.21 - 11.24)
Table 3: Rosner Table 11.20 U.S. infant-mortality rates per 1000 livebirths, 1960-1979
x
y
x
y
1960 26
1974 16.7
1965 24.7 1975 16.1
1970 20
1976 15.2
1971 19.1 1977 14.1
1972 18.5 1978 13.8
1973 17.7 1979 13
x=year, y=infant-mortality rate per 1000 live births
1. (11.21) Fit a linear regression line relating infant-mortality rate to chronological year using these data.
2. (11.22) Construct 95% confidence interval for α and β.
3. (11.23) If the present trends continue for the next 10 years, then what would be the predicted infantmortality rate in 1989?
4. (11.24) Provide a 95% prediction interval for the estimate in (11.23) (question 3).
5. Calculate the Pearson correlation coefficient
Exercise 6:
The “patient satisfaction data” can be found in “patsat.txt”. This study was conducted to investigate the
relationshup between patient satisfaction and the age of the patient, the severity of the disease and an anxiety
index. In total, 46 patients were selected.
1. Investigate whether patient satisfaction depends on age, severity and/or anxiety.
2. We considered already a partial F-test for one regressor. However, you can apply this partial F-test
also for a group of covariates. More specifically,
F =
ResSSred −ResSSfull
kfull −kred
ResMSfull
∼ F (kfull − kred , n − kfull − 1)
Use the partial F-test to test whether severity and anxiety have a joint significant effect on the outcome.
## Loading required package: carData
3
Download