Module H8 Practical 4 Correlation and the Coefficient of Determination Objectives: By the end of this practical you should: be able to produce and interpret values of the correlation coefficient for pairs of quantitative variables. know how the coefficient of determination (R2) can be calculated from an anova table, and how it may be interpreted. be able to judge the practical value of measures of correlation 1. In this practical we will again use the data corresponding to the sample of rural female headed households available in the worksheet named Kilijaro_RuralWomen in the Excel workbook H8_data.xls. The Stata file Kilijaro_RuralWomen.dta also contains the same data. Some variables of interest are given below. See pages 4 and 5 for a full listing. Log consumption expenditure per adult equivalent per month (in variable lnexpdf), as a proxy measure of income poverty; Number of persons per sleeping room (in variable pprm); Household size (in variable hhsize); Dependency ratio = number of dependents in HH (depen)/ (hhsize – depen) Number of days meat was eaten in past week (in variable qmeat) Number of days milk taken in past week (in variable qmilk) Number of cattle and other large livestock (in variable qcattl) (a) Explore the relationships between variable lnexpdf and each of the others in turn by plotting lnexpdf versus each of the other variables in turn. Guess the value of the correlation coefficient by eye each time and note down your answers in the table below. Check how close your answers are to the true values by calculating the corresponding correlation coefficients using Stata. Enter also the exact values in your table below. SADC Course in Statistics Module H8 Practical 4 – Page 1 Module H8 Practical 4 Correlations with log consumption expenditure Variable name and description Guessed Correlation Actual Correlation pprm = no. of persons per sleeping room hhsize = household size depratio = dependency ratio qmeat = no. of days meat eaten in past week qmilk = no. of days milk taken in past week qcattl = number cattle & other large animals (b) What conclusions can you draw from each correlation value concerning the degree of association that each variable has with lnexpdf? Are they telling you a great deal? Are they all practically useful? (c) What other variables in your data file would you consider might be associated with the income poverty proxy? Explore the extent to which they are associated with lnexpdf. Note down your findings below. (d) Select the variable that you think has the greatest association with lnexpdf and fit a simple linear regression model to lnexpdf with the selected variable as your explanatory variable. Note down your results in the analysis of variance table below. Source of Variation d.f. S.S. M.S. = S.S./d.f. F F prob Regression Residual Total SADC Course in Statistics Module H8 Practical 4 – Page 2 Module H8 Practical 4 Use your results above to calculate the coefficient of determination (R2) and note it down below. If your computer output generates this automatically when producing the anova table, then verify that your calculations coincide with the R2 that you find. Also write down below, your interpretation of the meaning of the R2 value you find. Value of R2 = Interpretation of R2 : (e) Interpret the different components of the anova table and write down what conclusions may be drawn from results of the anova table. Also make a note of how much of the variability in lnexpdf is left unexplained after fitting the model. SADC Course in Statistics Module H8 Practical 4 – Page 3 Module H8 Practical 4 Listing of data in file Kilijaro_RuralWomen.dta -----------------------------------------------------------------------------storage display value variable name type format label variable label -----------------------------------------------------------------------------hhid float %9.0g household id urb_rur float %9.0g urb_rur urban or rural region float %9.0g region region zone float %9.0g agro-ecological zone stratum float %9.0g stratum division of tanzania into 3 groups hh_wt float %9.0g final household weight expadeqf float %9.0g expenditure per adult equivalent lnexpdf float %9.0g ln(expenditure per adult equivalent) - actual hhsize float %9.0g household size hhsize2 float %9.0g size float %9.0g size grouped household size age float %9.0g age of household head agesq float %9.0g sexhead float %9.0g sexhead sex edu float %9.0g edu education level of hh head act1 float %9.0g act1 primary activity of household head act2 float %9.0g act2 secondary activity of household head empl float %9.0g empl number of adults employed (inc. self-empl) depratio float %9.0g dependency ratio tenure float %9.0g tenure status of tenure depend float %9.0g no of dependents nondep float %9.0g no of nondependent hh members pprm float %9.0g continuous variable for persons per room p_room float %9.0g p_room no of persons per sleeping room floor float %9.0g floor floor status walls float %9.0g walls status of walls roofs float %9.0g roofs status of roof water float %9.0g water source of water supply fuelcook float %9.0g fuelcook source of fuel for cooking fuelck2 float %9.0g fuelck2 source of fuel for cooking (detailed) fuelight float %9.0g fuelight source of fuel for lighting fuelght2 float %9.0g fuelght2 source of fuel for lighting (detailed) toilet float %9.0g toilet toilet facilities qmeat float %9.0g in past wk, days meat eaten qfish float %9.0g in past wk, days fish eaten qmilk float %9.0g in past wk, days milk taken larganim float %9.0g larganim whether own large sized animals (cattle, etc) qcattl float %9.0g no of cattle and other large livestock medanim float %9.0g medanim whether own medium sized animals (sheep, goat) goatsp float %9.0g no of goats, sheep and other medium anims poultry float %9.0g quantity of poultry anyland float %9.0g anyland household owns any land for farming/ pastoralism landarea float %9.0g acres of land owned by hh for farming/pastoralism SADC Course in Statistics Module H8 Practical 4 – Page 4 Module H8 Practical 4 radio motcycle bicycle beds wadrobe mosqnet hoes wbarrow iron sofa lamp cashinc fertil radio or radio cassette owned? motor cycle owned? bicycle owned? beds owned? wardrobe owned? mosquito net owned? hoe owned? wbarrow owned? iron owned? sofa owned? lamp owned? households main source of cash whether hh paid for fertiliser/manure in past 12 months? seeds float %9.0g seeds whether hh paid for seeds in past 12 months? pesti float %9.0g pesti whether hh paid for pesticides/weed killer in past 12 months? labour float %9.0g labour whether hh paid for casual labour in past 12 months? rand1 float %9.0g Random division of data by this ------------------------------------------------------------------------------ 2. float float float float float float float float float float float float float %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g %9.0g radio motcycle bicycle beds wadrobe mosqtnet hoes wbarrow iron sofa lamp cashinc fertil Anscombe (1973) invented the following data to demonstrate the importance of graphs in regression analysis. There are four data sets, as given below. They are available in the Stata worksheet Anscombe.dta. x1 y1 y2 y3 x2 y4 10 8 13 9 11 14 6 8.04 6.95 7.58 8.81 8.33 9.96 7.24 9.14 8.14 8.74 8.77 9.26 8.10 6.13 7.46 6.77 12.74 7.11 7.81 8.84 6.08 8 8 8 8 8 8 8 6.58 5.76 7.71 8.84 8.47 7.04 5.25 4 12 7 5 4.26 10.84 4.82 5.68 3.10 9.13 7.26 4.74 5.39 8.15 6.42 5.73 8 8 8 19 5.56 7.91 6.89 12.50 Source: Anscombe, F.J. (1973) Graphs in Statistical Analysis, American Statistician, 27, pp.17-21. SADC Course in Statistics Module H8 Practical 4 – Page 5 Module H8 Practical 4 (a) Carry out a regression analysis on each of the four data sets and note down in the table below some summary statistics from each regression. Summary statistic y1 vs x1 y2 vs x1 y3 vs x1 y4 vs x2 F-value p-value for F Residual MS (s2) Reg. equation R2 Do you have any comments on the results you find above? (b) Now plot the data corresponding to each regression. Is a simple linear regression model sensible in each case? (c) Write down the key message(s) you have learnt from this exercise. SADC Course in Statistics Module H8 Practical 4 – Page 6