The Islamic University of Gaza Faculty of Commerce Economics & Applied Statistics Department Regression Analysis – Dr. Samir Safi 02/12/2008 Final Project Due Tuesday, December 21th, 2008 The primary objective of the study on the Efficacy of Nosocomial Infection Control (SENIC Project) was to determine whether infection surveillance and control programs have reduced the rates of nosocomial (hospital-acquired) infection in United States hospitals. This data set consists of a random sample of 113 hospitals selected from the original 338 hospitals surveyed. The data presented for the 1975-76 study period. The 12 variables are: Variable Name Description Identification number (ID) 1-113 Length of stay (Stay) Average length of stay of all patients in hospitals (in days) Age (Age) Average age of patients (in years) Infection risk (Risk) Average estimated probability of acquiring infection in hospital (in percent) Routine culturing ratio Ratio of number of cultures performed to number of patients without signs (Culturing) or symptoms of hospital-acquired infection, times 100 Routine chest X-ray ratio Ratio of number of X-rays performed to number of patients without signs (X_ray) or symptoms of pneumonia, times 100 Number of beds (Beds) Average number of beds in hospital during study period Method school affiliation 1 = Yes, 2 = No (Affiliation) Region (Region) Geographic region, where: 1= NE, 2 = NW, 3= S, 4 = W Average daily census Average number of patients in hospital per day during study period (Census) Number of nurses Average number of full-time equivalent registered and licensed practical (Nurses) nurses during study period (number full time plus one half the number part time) Available facilities and Percent of 35 potential facilities and services that are provided by the services (Facilities) hospital Reference: Special issue, The SENIC Project,” American Journal of Epidemiology 111 (1980), pp. 465-653. Data obtained from Robert W. Haley, M.D. Hospital Infections Program, Center for Infectious Disease, Center for Disease Control, Atlanta, Georgia 30333. The data are in a file named PROJECT on the course website. 1 Refer to the SENIC data set. The average length of stay in a hospital (Y) is anticipated to be related to infection risk, available facilities and services, and routine chest X-ray ratio. Assume that the first-order regression model (Y i 0 1X i i ) is appropriate for each of the three predictor variables. a. Regress average length of stay on each of the three predictor variables. State the estimated regression function. b. Plot the three estimated regression functions and data on separate graphs. Does a linear relation appear to provide a good fit for each of the three predictor variables? c. Calculate MSE for each of the three predictor variables. Which predictor variable leads to the smallest variability around the fitted regression line based on the MSE? d. Using r 2 as the criterion, which predictor variable accounts for the largest reduction in the variability of the average length of stay? e. For each of the three fitted regression models, obtain the residuals and prepare a residual plot against X and a normal probability plot. Summarize your conclusions. Is linear regression model ( Yi 0 1 X i i ) more appropriate in one case than in the others? Second-order regression model (Y i 0 1X i 2 X 2 i i ) is to be fitted for relating number of nurses (Y) to available facilities and services (X). f. Fit the second-order regression model. Plot the residuals against the fitted values. How well does the second-order model appear to fit the data? g. Obtain R 2 for the second-order regression model. Also obtain the coefficient of simple determination r 2 for the first-order regression model. Has the addition of quadratic term in the regression model substantially increased the coefficient of determination? h. Test whether the quadratic term can be dropped from the regression model; use .10 . State the alternatives, decision rule, and conclusion. Length of stay (Y) is to be predicted, and the pool of potential predictor variables includes al l other variables in the data set except medical school affiliation and region. It believes that a model with log 10 y as the response variable and the predicted variables in first-order terms with no interaction terms will be appropriate. Consider cases 57-113 to constitute the modelbuilding data set to be used for the following analysis. i. Obtain the scatter plot matrix and the correlation matrix of the X variables. Is there evidence of strong linear pairwise associations among the predictor variables here? j. Obtain the three best subsets according to the Cp criterion. Which of these subset models appears to have the smallest bias? 2 The regression model containing age, routine chest X-ray ratio, and average daily census in first-order terms is to be evaluated in details on the model-building data set. k. Obtain the residuals and plot them against Ŷ , the fitted values of length of stay, and each of the predictor variables in the model. On the basis of these plots, should any modifications of the model be made? l. Prepare a normal probability plot of the residuals. Test the reasonableness of the normality assumption using .05 . What do you conclude? m. Obtain the scatter plot matrix, the correlation matrix of the X variables, and the variance inflation factors. Are there any indications that serious multicollinearity problems are present? Explain. n. Obtain the studentized deleted residuals and prepare a dot plot of these residuals. Are any outliers present? State your conclusion. o. Cases 62, 75, 106, and 112 are moderately outlying with respect to their X values, and case 87 is reasonably for outlying with respect to its Y value. Obtain DFFITS, DFBETAS, and Cook’s distance values for these cases to assess their influence. What do you conclude? Infection risk (Y) is to be regressed against length of stay (X1), age (X2), routine chest X-ray ratio (X3), and the medical school affiliation (X4). p. Fit a first-order regression model. Let X4 = 1 if hospital has medical school affiliation and 0 if not. q. Estimate the effect of medical school affiliation on infection risk using a 98 percent confidence interval. Interpret your interval estimate. r. It has been suggested that the effect of medical school affiliation on infection risk may interact with the effects of age and routine chest X-ray ratio. Add appropriate interaction terms to the regression models, fir the revised regression model, and test whether the interaction terms are helpful; use .10 . State the alternatives, decision rule, and conclusion. Length of stay (Y) is to be regressed on age (X1), routine culturing ratio (X2), average daily census (X3), available facilities and services (X4), and region (X5, X6, X7). s. Fit a first-order regression model. Let X5 = 1 if NE and 0 otherwise, X6 = 1 if NW and 0 otherwise, and X7 = 1 if S and 0 otherwise. t. Test whether the routine culturing can be dropped from the model; use a level of significance of .05. State the alternatives, decision rule, and conclusion. u. Examine whether the effect on length of stay for hospitals located in the western region differs from that for hospitals located in the other three regions by constructing an appropriate confidence interval for each pairwise comparison. Summarize your finding. 3