Module H8 Practical 6 Multiple Regression Analysis Objectives: By the end of this practical you should be able to: conduct a multiple regression analysis and write down the fitted equation explain what hypotheses are being tested through the t-tests interpret the results of such tests of hypotheses have greater confidence in interpreting the meaning of R2 have greater confidence in conducting and interpreting results of a residual analysis appreciate how t-probabilities change when x-variables are dropped from the mode 1. The data for this practical concerns rural female headed households from the Kigoma region of Tanzania, found in the Excel sheet Kigoma_RuralWomen in file H8_data.xls, and also in the Stata file named Kigoma_RuralWomen.dta. A listing of the variables can be found on page 5. In this practical, you will be considering the relationship between log consumption expenditure (in variable lnexpdf) and the following four variables: Household size (in variable hhsize); Age of household head (in variable age) Number of employed adults in household (in variable empl) Usual number of meals per day (in variable num_meal) The aim is to investigate the extent to which hhsize, age, empl and num_meal explain the variability in income poverty so that the resulting multiple regression equation could be used to predict poverty levels of households not in the data set. SADC Course in Statistics Module H8 Practical 6 – Page 1 Module H8 Practical 6 (a) Start by producing a matrix plot between the dependent variable, lnexpdf, and the four explanatory variables. What conclusions can you make from these plots? Write these down below. (b) Fit a multiple linear regression model with consumption expenditure as the Dependent Variable and the variables hhsize, age, empl and num_meal as the Explanatory Variables. Note down the parameter estimates, their standard errors and the t-probabilities in the table below. Variable name Parameter estimate Standard error t-probability household size age number employed no of meals per day constant What conclusions may be drawn from the results above? Write them down with respect to the contribution that each of the four explanatory variables make to the model. SADC Course in Statistics Module H8 Practical 6 – Page 2 Module H8 Practical 6 (c) How much of the variability in log(consumption expenditure) is explained by the four variables included in the regression? How much of the variation has been left unexplained? Give an estimate for the measure of unexplained variability: What are the degrees of freedom associated with this estimate? (d) Would you consider dropping any of the variables in your model and re-fitting the model with the remaining variables? If so, which variable would you consider dropping, and why? (e) Re-fit the model with variables hhsize, age, and empl.1 Are you happy with the resulting model? How much of the variation in lnexpdf is now left unexplained? (f) Fit a model relating lnexpdf to hhsize and empl and write down answers to the following questions. What is the equation describing your multiple regression model? Note: Although this seems to suggest that dropping num_meal was the best choice in (d), this is not necessarily the case as you will see in the next session. 1 SADC Course in Statistics Module H8 Practical 6 – Page 3 Module H8 Practical 6 Are both hhsize and empl important in explaining the variability in lnexpdf? Give reasons for your answer. What percentage of the variability in lnexpdf is explained by hhsize and empl? (g) Finally, conduct a residual analysis to examine possible departures from model assumptions and to identify any extreme observations. What are your conclusions? SADC Course in Statistics Module H8 Practical 6 – Page 4 Module H8 Practical 6 Listing of data on Kigoma rural female headed households -----------------------------------------------------------------------------storage display value variable name type format label variable label -----------------------------------------------------------------------------hhid float %9.0g household id urb_rur float %9.0g urb_rur urban or rural region float %9.0g region region zone float %9.0g agro-ecological zone stratum float %9.0g stratum division of tanzania into 3 groups hh_wt float %9.0g final household weight expen float %9.0g expenditure per adult equivalent lnexpdf float %9.0g log (to base e) of expenditure per adult equivalent hhsize float %9.0g household size hhsize2 float %9.0g age float %9.0g age of household head agesq float %9.0g sexhead float %9.0g sexhead sex edu float %9.0g edu education level of hh head act1 float %9.0g act1 primary activity of household head act2 float %9.0g act2 secondary activity of household head empl float %9.0g empl number of adults employed (inc. self-empl) depratio float %9.0g dependency ratio pprm float %9.0g continuous variable for persons per room p_room float %9.0g p_room persons per sleeping room walls float %9.0g walls status of walls water float %9.0g water source of water supply fuelight float %9.0g fuelight source of fuel for lighting fuelght2 float %9.0g fuelght2 source of fuel for lighting (detailed) toilet float %9.0g toilet toilet facilities qmeat float %9.0g in past wk, days meat eaten qmilk float %9.0g in past wk, days milk taken num_meal float %9.0g num_meal no. of meals per day radio float %9.0g radio radio or radio cassette owned? bicycle float %9.0g bicycle bicycle owned? watch float %9.0g watch watch owned? iron float %9.0g iron iron owned? mosqnet float %9.0g mosqtnet mosquito net owned? table float %9.0g table table owned? sofa float %9.0g sofa sofa owned? lamp float %9.0g lamp lamp owned? soap float %9.0g soap paid for soap (either bar or powder)? wheatf float %9.0g wheatf paid for wheat flour? anyland float %9.0g anyland household owns any land for farming/ pastoralism landarea float %9.0g acres of land owned by hh for farming/pastoralism cashinc float %9.0g cashinc households main source of cash ------------------------------------------------------------------------------ SADC Course in Statistics Module H8 Practical 6 – Page 5 Module H8 Practical 6 2. If you have time, try also the exercise below. The average annual rainfall (mm) is given below for 16 stations in the southern Pennines, UK. The elevation of the rain gauges is also given. Of interest is to explore whether the annual rainfall has any relationship with either the elevation or the altitude. The data is also available in the worksheet named penrain in the Excel file H8_data.xls. It is also available in the Stata file named penrain.dta. Station Average Annual Rainfall (mm) Gauge Elevation (metres) Maximum Altitude (m) within 2 km Wessenden 1273 366 518 Blackmoorfoot 1094 244 328 956 235 290 Yateholme 1616 308 547 Harden Moss 1345 369 475 Wakefield 670 35 76 Langsett 1059 250 370 Underbank 949 184 355 Cannon Hall 738 113 210 Barnsley 640 40 118 Chew 1536 532 541 Bottoms Reservoir 1074 153 385 Yeoman Hey 1329 239 503 Dunford 1442 329 484 Broomhead Manor 1246 418 490 856 124 340 Huddersfield Moor Hall (a) Carry out a suitable analysis to investigate whether elevation or altitude or both contribute significantly to the model. You may wish to conduct two simple linear regressions first, one with elevation alone as the explanatory variable, and then with altitude alone as the explanatory variable. Finally fit both elevation and altitude. Note down the key results of these three regressions in the tables below and give your comments. SADC Course in Statistics Module H8 Practical 6 – Page 6 Module H8 Practical 6 (i) Regression of annual rainfall on elevation: Variable name Parameter estimate Standard error t-probability Standard error t-probability elevation constant Adj. R2: Comments: (i) Regression of annual rainfall on altitude: Variable name Parameter estimate altitude constant Adj. R2: Comments: (ii) Regression of annual rainfall on elevation and altitude: Variable name Parameter estimate Standard error t-probability elevation altitude constant Adj. R2: Comments: SADC Course in Statistics Module H8 Practical 6 – Page 7 Module H8 Practical 6 (iv) Overall conclusions from the above analyses: (b) Once you have chosen the model that best describes the variation in annual rainfall amounts, carry out an analysis of residuals to determine the validity of your conclusions. If there are doubts about the model assumptions, what remedial action might you consider? Take these actions and comment on whether you think they have been effective. SADC Course in Statistics Module H8 Practical 6 – Page 8