Module H8 Practical 9 Model Predictions Objectives: By the end of this practical you should be able to: make predictions from a simple or multiple regression model compute confidence intervals for a prediction making use of standard errors for the prediction produced from statistical software understand the difference between a confidence interval for a mean and the confidence interval for a prediction concerning an individual sampling unit have greater confidence in model selection procedures 1. Return to the same data set used in the previous practical, i.e. the country level data available in the Stata file named mdg_africa.dta. Here you will look at relationships with infant mortality. Some variables of relevance are reproduced below. country sadc infmort povbelow literacy bthatten uwater rwater usanit - name of country whether the country is SADC (1=yes, 0=no) Infant mortality rate per 1000 live births Percent of the population below the poverty line (< $1 per day) - 2002 Percent of adults who are literate (2000-2004) Percent of births attended by skilled health personnel Urban population with access to improved water source Rural population with access to improved water source Urban population with access to improved sanitation rsanit - Rural population with access to improved sanitation Primary Objective: To identify the best subset of factors affecting infant mortality and use the resulting equation to predict infant mortality in the future with a more easily measurable set of variables. (It is assumed here that infant mortality is difficult to measure because of large numbers of unreported deaths). SADC Course in Statistics Module H8 Practical 9 – Page 1 Module H8 Practical 9 (a) Consider the four simple linear regressions below. Fit each regression and decide which would be the most appropriate regression to consider for a prediction based on just one explanatory variable, giving attention also to the ease with which the explanatory variable may be measured in the future. Regression of infant mortality on literacy rate Regression of infant mortality on percent of births attended by skilled persons Regression of infant mortality on percent of rural population with access to an improved water source Regression of infant mortality on percent of urban population with access to an improved water source Note down some of the key results in table below. Variable name Adjusted R2 Estimate of residual variation (s2) d.f. for s2 Equation of regression line literacy bthatten rwater uwater Which explanatory variable did you choose, and why? (b) Keeping the variable chosen above, add each of the other three in turn to the model to assess whether two explanatory variables would make a better predictor than one. Note down your conclusions below. SADC Course in Statistics Module H8 Practical 9 – Page 2 Module H8 Practical 9 (c) For the model you have chosen above, check the behaviour of the residuals; and then predict infant mortality for a country in which the value of the explanatory variable(s) takes(take) a pre-specified value or values (you may choose what the pre-specified value or values are). Determine the standard error of the prediction using your computer software. (Hint: if the pre-specified value is not an x-value in your data set, you may need, depending on the capabilities of your software, to add another “dummy” response in the data file with a missing value for infant mortality and your pre-specified value as the x (or x’s)). Use your standard errors to determine a 95% confidence interval for the true value of a predicted mean value as well as a predicted individual value. Note down your results below. (i) Prediction: (ii) Standard error of the mean prediction: (iii) 95% confidence interval for the true value of the predicted mean: (iv) 95% confidence interval for the true value of a predicted individual value: (d) Note down your conclusions concerning the appropriateness of the above prediction and any reservations you might have in using these predictions for making policy decisions. SADC Course in Statistics Module H8 Practical 9 – Page 3 Module H8 Practical 9 2. Open the Stata file named Tabora_RuralWomen.dta. These data are a subset of the data used in the lecture session and correspond to information on rural female headed households. Variables found to be most appropriate in a prediction equation for log consumption expenditure (in variable lnexpdf) with the full data set were the following (see also corresponding powerpoint presentation notes): Household size (in variable hhsize); Squared household size (in variable hhsize2); Whether household had access to clean water; Number of days meat was eaten in past week (in variable qmeat); Number of days milk taken in past week (in variable qmilk); Whether household owned an iron (in variable iron); Whether household owned a table (in variable table); Whether household had paid for wheat flour in past month (in variable wheatf); Whether household had bought seed in the past 12 months (in variable seeds); Usual number of meals taken per day (in variable num_meal) The aim is to examine whether the same subset of variables is also suitable for predicting income poverty of just rural female headed households. (a) Fit a regression model with all of the above as explanatory variables. How many of the variables are now significant? Would you regard this model as being appropriate as a prediction equation for just the rural female headed households? Note down reasons for your answer below. SADC Course in Statistics Module H8 Practical 9 – Page 4 Module H8 Practical 9 (b) Adopt a backward elimination procedure to delete variables one by one. At each stage examine results from your regression model and decide which variable should be deleted at the next stage from your model. Proceed until all remaining variables are significant in the fitted model. Note down answers to the questions below. (i) How much of the variability in lnexpdf if explained by this model? (ii) What is the equation of your prediction equation? (iii) Are all variables in your equation above easily measurable if the model equation is to be used as an equation to predict income poverty? If not, which would you consider omitting from the model? (iv) Do the signs of the regression coefficients make sense? If not, would you be happy including such variables in your model? (c) Write down the form of your final prediction equation (this may be the same as the equation under (b)(ii) above. (d) Examine the appropriateness of your prediction equation by following procedures similar to those outlined in question 1 part (c). For convenience, these questions are reproduced below. check the behaviour of the residuals; and then predict income poverty for households which take the following values for each of the possible explanatory variables: hhsize=5; water = yes; qmeat = 1; qmilk = 0; iron = 0; table = 1; wheatf = 0; seeds = 0; num_meal = 2. SADC Course in Statistics Module H8 Practical 9 – Page 5 Module H8 Practical 9 Determine the standard error of the prediction using your computer software. (Hint: if the specified set of values for the explanatory variables are not in your data set, depending on the capabilities of your software, you may need to add another “dummy” response in the data file with a missing value for lnexpdf and your specified explanatory variable values as the x’s). Use your standard errors to determine a 95% confidence interval for the true value of a predicted mean value as well as a predicted individual value. Note down your results below. (i) Prediction: (ii) Standard error of the prediction: (iii) 95% confidence interval for the true value of the predicted mean: (iv) 95% confidence interval for the true value of a predicted individual value: (e) Note down your conclusions concerning the appropriateness of the above prediction and any reservations you might have in using these predictions for making policy decisions. SADC Course in Statistics Module H8 Practical 9 – Page 6