EBQ Course, Biostatistics Exercises: Linear Regression Exercise 1: Midwifes often perform a test to measure the estriol level in a urine sample of heavily pregnant women, since the estriol level is believed to be related to the birth weight of the child. Hence, the test provides implicit evidence for abnormal small fetuses. In order to investigate the relationship between the estriol level and the birth weight of the baby, a study was conducted and the results are considered in this exercise. 1. Reproduce the table and figure based on the Estriol data. 2. Check the estimates a and b in R. Compare these estimates with the numbers in the aforementioned Figure. 3. Try to calculate the values of a and b by hand. 4. Check whether the conditions of the linear regression model are fulfilled. Table 1: Rosner Table 11.3 p 478 SOURCE DF SS MS F Regression 1 250.57 250.57 17.16 Error 29 423.43 14.6 Total 30 674.00 p 0.000 Exercise 2: In a study in El Paso, Texas, one examines the relationship between exposure to lead and the mental development of children. There are different ways of measuring exposure to lead. A method used in this study is to measure the lead level in the blood in a control group of students with levels less dan 40g/100mL, collected between 1972 and 1973 (CSCN2=0); in the group of exposed children the lead level exceeded 40g/100mL in 1972 or 1973 (CSCN2=1). Important variables in the study are the number of finger-wrist taps (FWT) per 10 seconds with the dominant hand and the score on the Wechsler full-scale IQ test (IQF). (Rosner Ex. 11.50) Use the file Lead_ch11.txt and answer the following questions: 1. Perform a t-test to compare MAXFWT in the groups CSCN2=0 and CSCN2=1. (Tip: use var.test(MAXFWT∼CSCN2,data=lead) and t.test(MAXFWT∼CSCN2,data=lead,var.equal=...)) 2. Show that the result of fitting a linear regression model with Y = MAXFWT and X = CSCN2 provides exactly the same results, i.e., what is the connection between the t-value and the F -value of the ANOVA table derived from the linear regression model? 3. What are the assumptions in both approaches? (You don’t have to check those) Exercise 3: Data from a study on the use of antibiotics in hospitals in Pennsylvania is used in this exercise. The dataset contains information about patients that are discharged from a hospital as part of a retrospective file examination. Answer the following questions using the hospital dataset (hospital.txt). (Rosner: Problems 11.13 - 11.14 11.15) 1. (11.13) Find the best-fitting linear relationship between duration of hospitalisation and age. 1 2. (11.14) Test for significance of this relationship. State any underlying assumptions you have used. 3. Construct confidence intervals for the regression parameters 4. Test whether the conditions of the linear regression model are satisfied. Are there any outliers in the observations (check this using residual plots)? 5. (11.15) What is R2 for this regression? 6. Construct a 95% prediction interval for the duration in the hospital of an individual of age 35 years and a 99% confidence interval for the mean duration in the hospital for individuals of that age. Exercise 4: Suppose we are interested in the relation between carbonmonoxide concentration and the density of cars in a geographic area. The number of cars per hour (to the nearest 500 cars per hour) and the concentration of carbon monoxide (CO) in parts per million at a particular street corner are measured, and the data are grouped by cars per hour. The data are given in Table 11.19. Table 2: Rosner Table 11.19 CO concentration and car density at a particular street corner Cars/hour CO concentration Number of (×103 ) (ppm) samples 1 9 6.8 7.7 3 1.5 9.6 6.8 11.3 3 2 12.3 11.8 2 3 20.7 19.2 21.6 20.6 4 1. Construct a scatter plot and discuss. Do you expect a strong relationship between the number of cars and the CO concentration? 2 2. (11.17) Is the CO concentration related to the number of cars per hour? Use an F-test to check this. 3. Draw the regression line and construct pointwise 95% confidence intervals and prediction intervals around the estimated regression line. 4. (11.18) What is the average CO concentration if 2500 cars per hour are on the road? Exercise 5: The data in Table 11.20 shows the infant-mortality rates per 1000 livebirths in the United States for the period 1960-1979 (Rosner 11.21 - 11.24) Table 3: Rosner Table 11.20 U.S. infant-mortality rates per 1000 livebirths, 1960-1979 x y x y 1960 26 1974 16.7 1965 24.7 1975 16.1 1970 20 1976 15.2 1971 19.1 1977 14.1 1972 18.5 1978 13.8 1973 17.7 1979 13 x=year, y=infant-mortality rate per 1000 live births 1. (11.21) Fit a linear regression line relating infant-mortality rate to chronological year using these data. 2. (11.22) Construct 95% confidence interval for α and β. 3. (11.23) If the present trends continue for the next 10 years, then what would be the predicted infantmortality rate in 1989? 4. (11.24) Provide a 95% prediction interval for the estimate in (11.23) (question 3). 5. Calculate the Pearson correlation coefficient Exercise 6: The “patient satisfaction data” can be found in “patsat.txt”. This study was conducted to investigate the relationshup between patient satisfaction and the age of the patient, the severity of the disease and an anxiety index. In total, 46 patients were selected. 1. Investigate whether patient satisfaction depends on age, severity and/or anxiety. 2. We considered already a partial F-test for one regressor. However, you can apply this partial F-test also for a group of covariates. More specifically, F = ResSSred −ResSSfull kfull −kred ResMSfull ∼ F (kfull − kred , n − kfull − 1) Use the partial F-test to test whether severity and anxiety have a joint significant effect on the outcome. ## Loading required package: carData 3