Final Project

advertisement
The Islamic University of Gaza
Faculty of Commerce
Economics & Applied Statistics Department
Regression Analysis – Dr. Samir Safi
02/12/2008
Final Project
Due Tuesday, December 21th, 2008
The primary objective of the study on the Efficacy of Nosocomial Infection Control (SENIC Project)
was to determine whether infection surveillance and control programs have reduced the rates of
nosocomial (hospital-acquired) infection in United States hospitals. This data set consists of a random
sample of 113 hospitals selected from the original 338 hospitals surveyed. The data presented for the
1975-76 study period. The 12 variables are:
Variable Name
Description
Identification number (ID) 1-113
Length of stay (Stay)
Average length of stay of all patients in hospitals (in days)
Age (Age)
Average age of patients (in years)
Infection risk (Risk)
Average estimated probability of acquiring infection in hospital (in percent)
Routine culturing ratio
Ratio of number of cultures performed to number of patients without signs
(Culturing)
or symptoms of hospital-acquired infection, times 100
Routine chest X-ray ratio
Ratio of number of X-rays performed to number of patients without signs
(X_ray)
or symptoms of pneumonia, times 100
Number of beds (Beds)
Average number of beds in hospital during study period
Method school affiliation
1 = Yes, 2 = No
(Affiliation)
Region (Region)
Geographic region, where: 1= NE, 2 = NW, 3= S, 4 = W
Average daily census
Average number of patients in hospital per day during study period
(Census)
Number of nurses
Average number of full-time equivalent registered and licensed practical
(Nurses)
nurses during study period (number full time plus one half the number part
time)
Available facilities and
Percent of 35 potential facilities and services that are provided by the
services (Facilities)
hospital
Reference: Special issue, The SENIC Project,” American Journal of Epidemiology 111 (1980), pp. 465-653. Data obtained
from Robert W. Haley, M.D. Hospital Infections Program, Center for Infectious Disease, Center for Disease Control, Atlanta,
Georgia 30333.
The data are in a file named PROJECT on the course website.
1
Refer to the SENIC data set. The average length of stay in a hospital (Y) is anticipated to be
related to infection risk, available facilities and services, and routine chest X-ray ratio. Assume
that the first-order regression model (Y i   0  1X i   i ) is appropriate for each of the three
predictor variables.
a. Regress average length of stay on each of the three predictor variables. State the estimated
regression function.
b. Plot the three estimated regression functions and data on separate graphs. Does a linear relation
appear to provide a good fit for each of the three predictor variables?
c. Calculate MSE for each of the three predictor variables. Which predictor variable leads to the
smallest variability around the fitted regression line based on the MSE?
d. Using r 2 as the criterion, which predictor variable accounts for the largest reduction in the
variability of the average length of stay?
e. For each of the three fitted regression models, obtain the residuals and prepare a residual plot
against X and a normal probability plot. Summarize your conclusions. Is linear regression model
( Yi   0   1 X i   i ) more appropriate in one case than in the others?
Second-order regression model (Y i   0  1X i   2 X 2 i   i ) is to be fitted for relating
number of nurses (Y) to available facilities and services (X).
f. Fit the second-order regression model. Plot the residuals against the fitted values. How well does
the second-order model appear to fit the data?
g. Obtain R 2 for the second-order regression model. Also obtain the coefficient of simple
determination r 2 for the first-order regression model. Has the addition of quadratic term in the
regression model substantially increased the coefficient of determination?
h. Test whether the quadratic term can be dropped from the regression model; use   .10 . State
the alternatives, decision rule, and conclusion.
Length of stay (Y) is to be predicted, and the pool of potential predictor variables includes al l
other variables in the data set except medical school affiliation and region. It believes that a
model with log 10 y as the response variable and the predicted variables in first-order terms
with no interaction terms will be appropriate. Consider cases 57-113 to constitute the modelbuilding data set to be used for the following analysis.
i. Obtain the scatter plot matrix and the correlation matrix of the X variables. Is there evidence of
strong linear pairwise associations among the predictor variables here?
j. Obtain the three best subsets according to the Cp criterion. Which of these subset models appears
to have the smallest bias?
2
The regression model containing age, routine chest X-ray ratio, and average daily census in
first-order terms is to be evaluated in details on the model-building data set.
k. Obtain the residuals and plot them against Ŷ , the fitted values of length of stay, and each of the
predictor variables in the model. On the basis of these plots, should any modifications of the
model be made?
l. Prepare a normal probability plot of the residuals. Test the reasonableness of the normality
assumption using   .05 . What do you conclude?
m. Obtain the scatter plot matrix, the correlation matrix of the X variables, and the variance inflation
factors. Are there any indications that serious multicollinearity problems are present? Explain.
n. Obtain the studentized deleted residuals and prepare a dot plot of these residuals. Are any
outliers present? State your conclusion.
o. Cases 62, 75, 106, and 112 are moderately outlying with respect to their X values, and case 87 is
reasonably for outlying with respect to its Y value. Obtain DFFITS, DFBETAS, and Cook’s
distance values for these cases to assess their influence. What do you conclude?
Infection risk (Y) is to be regressed against length of stay (X1), age (X2), routine chest X-ray
ratio (X3), and the medical school affiliation (X4).
p. Fit a first-order regression model. Let X4 = 1 if hospital has medical school affiliation and 0 if
not.
q. Estimate the effect of medical school affiliation on infection risk using a 98 percent confidence
interval. Interpret your interval estimate.
r. It has been suggested that the effect of medical school affiliation on infection risk may interact
with the effects of age and routine chest X-ray ratio. Add appropriate interaction terms to the
regression models, fir the revised regression model, and test whether the interaction terms are
helpful; use   .10 . State the alternatives, decision rule, and conclusion.
Length of stay (Y) is to be regressed on age (X1), routine culturing ratio (X2), average daily
census (X3), available facilities and services (X4), and region (X5, X6, X7).
s. Fit a first-order regression model. Let X5 = 1 if NE and 0 otherwise, X6 = 1 if NW and 0
otherwise, and X7 = 1 if S and 0 otherwise.
t. Test whether the routine culturing can be dropped from the model; use a level of significance of
.05. State the alternatives, decision rule, and conclusion.
u. Examine whether the effect on length of stay for hospitals located in the western region differs
from that for hospitals located in the other three regions by constructing an appropriate
confidence interval for each pairwise comparison. Summarize your finding.
3
Download