Model Predictions

advertisement
Module H8 Practical 9
Model Predictions
Objectives:
By the end of this practical you should be able to:


make predictions from a simple or multiple regression model
compute confidence intervals for a prediction making use of standard
errors for the prediction produced from statistical software
understand the difference between a confidence interval for a mean and
the confidence interval for a prediction concerning an individual
sampling unit
have greater confidence in model selection procedures


1. Return to the same data set used in the previous practical, i.e. the country level data
available in the Stata file named mdg_africa.dta. Here you will look at relationships with
infant mortality. Some variables of relevance are reproduced below.
country
sadc
infmort
povbelow
literacy
bthatten
uwater
rwater
usanit
-
name of country
whether the country is SADC (1=yes, 0=no)
Infant mortality rate per 1000 live births
Percent of the population below the poverty line (< $1 per day) - 2002
Percent of adults who are literate (2000-2004)
Percent of births attended by skilled health personnel
Urban population with access to improved water source
Rural population with access to improved water source
Urban population with access to improved sanitation
rsanit
-
Rural population with access to improved sanitation
Primary Objective: To identify the best subset of factors affecting infant mortality and
use the resulting equation to predict infant mortality in the future with a more easily
measurable set of variables. (It is assumed here that infant mortality is difficult to measure
because of large numbers of unreported deaths).
SADC Course in Statistics
Module H8 Practical 9 – Page 1
Module H8 Practical 9
(a)
Consider the four simple linear regressions below. Fit each regression and decide
which would be the most appropriate regression to consider for a prediction based on just
one explanatory variable, giving attention also to the ease with which the explanatory
variable may be measured in the future.



Regression of infant mortality on literacy rate
Regression of infant mortality on percent of births attended by skilled persons
Regression of infant mortality on percent of rural population with access to an

improved water source
Regression of infant mortality on percent of urban population with access to an
improved water source
Note down some of the key results in table below.
Variable name
Adjusted R2
Estimate of residual
variation (s2)
d.f. for
s2
Equation of regression line
literacy
bthatten
rwater
uwater
Which explanatory variable did you choose, and why?
(b) Keeping the variable chosen above, add each of the other three in turn to the model to
assess whether two explanatory variables would make a better predictor than one. Note
down your conclusions below.
SADC Course in Statistics
Module H8 Practical 9 – Page 2
Module H8 Practical 9
(c) For the model you have chosen above,
 check the behaviour of the residuals; and then
 predict infant mortality for a country in which the value of the explanatory variable(s)
takes(take) a pre-specified value or values (you may choose what the pre-specified value
or values are).
 Determine the standard error of the prediction using your computer software. (Hint: if
the pre-specified value is not an x-value in your data set, you may need, depending on
the capabilities of your software, to add another “dummy” response in the data file with
a missing value for infant mortality and your pre-specified value as the x (or x’s)).
 Use your standard errors to determine a 95% confidence interval for the true value of a
predicted mean value as well as a predicted individual value.
Note down your results below.
(i) Prediction:
(ii) Standard error of the mean prediction:
(iii) 95% confidence interval for the true value of the predicted mean:
(iv) 95% confidence interval for the true value of a predicted individual value:
(d) Note down your conclusions concerning the appropriateness of the above prediction
and any reservations you might have in using these predictions for making policy decisions.
SADC Course in Statistics
Module H8 Practical 9 – Page 3
Module H8 Practical 9
2. Open the Stata file named Tabora_RuralWomen.dta. These data are a subset of the
data used in the lecture session and correspond to information on rural female headed
households.
Variables found to be most appropriate in a prediction equation for log consumption
expenditure (in variable lnexpdf) with the full data set were the following (see also
corresponding powerpoint presentation notes):

Household size (in variable hhsize);









Squared household size (in variable hhsize2);
Whether household had access to clean water;
Number of days meat was eaten in past week (in variable qmeat);
Number of days milk taken in past week (in variable qmilk);
Whether household owned an iron (in variable iron);
Whether household owned a table (in variable table);
Whether household had paid for wheat flour in past month (in variable wheatf);
Whether household had bought seed in the past 12 months (in variable seeds);
Usual number of meals taken per day (in variable num_meal)
The aim is to examine whether the same subset of variables is also suitable for predicting
income poverty of just rural female headed households.
(a) Fit a regression model with all of the above as explanatory variables. How many of the
variables are now significant? Would you regard this model as being appropriate as a
prediction equation for just the rural female headed households? Note down reasons for
your answer below.
SADC Course in Statistics
Module H8 Practical 9 – Page 4
Module H8 Practical 9
(b) Adopt a backward elimination procedure to delete variables one by one. At each stage
examine results from your regression model and decide which variable should be deleted at
the next stage from your model. Proceed until all remaining variables are significant in the
fitted model. Note down answers to the questions below.
(i) How much of the variability in lnexpdf if explained by this model?
(ii) What is the equation of your prediction equation?
(iii) Are all variables in your equation above easily measurable if the model equation is to
be used as an equation to predict income poverty? If not, which would you consider
omitting from the model?
(iv) Do the signs of the regression coefficients make sense? If not, would you be happy
including such variables in your model?
(c) Write down the form of your final prediction equation (this may be the same as the
equation under (b)(ii) above.
(d) Examine the appropriateness of your prediction equation by following procedures
similar to those outlined in question 1 part (c). For convenience, these questions are
reproduced below.


check the behaviour of the residuals; and then
predict income poverty for households which take the following values for each of
the possible explanatory variables: hhsize=5; water = yes; qmeat = 1; qmilk = 0;
iron = 0; table = 1; wheatf = 0; seeds = 0; num_meal = 2.
SADC Course in Statistics
Module H8 Practical 9 – Page 5
Module H8 Practical 9


Determine the standard error of the prediction using your computer software.
(Hint: if the specified set of values for the explanatory variables are not in your
data set, depending on the capabilities of your software, you may need to add
another “dummy” response in the data file with a missing value for lnexpdf and
your specified explanatory variable values as the x’s).
Use your standard errors to determine a 95% confidence interval for the true value
of a predicted mean value as well as a predicted individual value.
Note down your results below.
(i) Prediction:
(ii) Standard error of the prediction:
(iii) 95% confidence interval for the true value of the predicted mean:
(iv) 95% confidence interval for the true value of a predicted individual value:
(e) Note down your conclusions concerning the appropriateness of the above prediction
and any reservations you might have in using these predictions for making policy decisions.
SADC Course in Statistics
Module H8 Practical 9 – Page 6
Download