Identifying Response and Explanatory Variables

advertisement
Module H8 Practical 4
Correlation and the Coefficient of Determination
Objectives:
By the end of this practical you should:



be able to produce and interpret values of the correlation coefficient for
pairs of quantitative variables.
know how the coefficient of determination (R2) can be calculated from
an anova table, and how it may be interpreted.
be able to judge the practical value of measures of correlation
1. In this practical we will again use the data corresponding to the sample of rural female
headed households available in the worksheet named Kilijaro_RuralWomen in the Excel
workbook H8_data.xls. The Stata file Kilijaro_RuralWomen.dta also contains the same
data.
Some variables of interest are given below. See pages 4 and 5 for a full listing.

Log consumption expenditure per adult equivalent per month (in variable






lnexpdf), as a proxy measure of income poverty;
Number of persons per sleeping room (in variable pprm);
Household size (in variable hhsize);
Dependency ratio = number of dependents in HH (depen)/ (hhsize – depen)
Number of days meat was eaten in past week (in variable qmeat)
Number of days milk taken in past week (in variable qmilk)
Number of cattle and other large livestock (in variable qcattl)
(a) Explore the relationships between variable lnexpdf and each of the others in turn by
plotting lnexpdf versus each of the other variables in turn. Guess the value of the
correlation coefficient by eye each time and note down your answers in the table below.
Check how close your answers are to the true values by calculating the corresponding
correlation coefficients using Stata. Enter also the exact values in your table below.
SADC Course in Statistics
Module H8 Practical 4 – Page 1
Module H8 Practical 4
Correlations with log consumption expenditure
Variable name and description
Guessed Correlation
Actual Correlation
pprm = no. of persons per sleeping room
hhsize = household size
depratio = dependency ratio
qmeat = no. of days meat eaten in past week
qmilk = no. of days milk taken in past week
qcattl = number cattle & other large animals
(b)
What conclusions can you draw from each correlation value concerning the degree of
association that each variable has with lnexpdf? Are they telling you a great deal? Are they
all practically useful?
(c)
What other variables in your data file would you consider might be associated with
the income poverty proxy? Explore the extent to which they are associated with lnexpdf.
Note down your findings below.
(d) Select the variable that you think has the greatest association with lnexpdf and fit a
simple linear regression model to lnexpdf with the selected variable as your explanatory
variable. Note down your results in the analysis of variance table below.
Source of Variation
d.f.
S.S.
M.S. = S.S./d.f.
F
F prob
Regression
Residual
Total
SADC Course in Statistics
Module H8 Practical 4 – Page 2
Module H8 Practical 4
Use your results above to calculate the coefficient of determination (R2) and note it down
below. If your computer output generates this automatically when producing the anova
table, then verify that your calculations coincide with the R2 that you find.
Also write down below, your interpretation of the meaning of the R2 value you find.
Value of R2 =
Interpretation of R2 :
(e)
Interpret the different components of the anova table and write down what
conclusions may be drawn from results of the anova table. Also make a note of how
much of the variability in lnexpdf is left unexplained after fitting the model.
SADC Course in Statistics
Module H8 Practical 4 – Page 3
Module H8 Practical 4
Listing of data in file Kilijaro_RuralWomen.dta
-----------------------------------------------------------------------------storage display
value
variable name
type
format
label
variable label
-----------------------------------------------------------------------------hhid
float %9.0g
household id
urb_rur
float %9.0g
urb_rur
urban or rural
region
float %9.0g
region
region
zone
float %9.0g
agro-ecological zone
stratum
float %9.0g
stratum
division of tanzania into 3
groups
hh_wt
float %9.0g
final household weight
expadeqf
float %9.0g
expenditure per adult equivalent
lnexpdf
float %9.0g
ln(expenditure per adult
equivalent) - actual
hhsize
float %9.0g
household size
hhsize2
float %9.0g
size
float %9.0g
size
grouped household size
age
float %9.0g
age of household head
agesq
float %9.0g
sexhead
float %9.0g
sexhead
sex
edu
float %9.0g
edu
education level of hh head
act1
float %9.0g
act1
primary activity of household
head
act2
float %9.0g
act2
secondary activity of household
head
empl
float %9.0g
empl
number of adults employed (inc.
self-empl)
depratio
float %9.0g
dependency ratio
tenure
float %9.0g
tenure
status of tenure
depend
float %9.0g
no of dependents
nondep
float %9.0g
no of nondependent hh members
pprm
float %9.0g
continuous variable for persons
per room
p_room
float %9.0g
p_room
no of persons per sleeping room
floor
float %9.0g
floor
floor status
walls
float %9.0g
walls
status of walls
roofs
float %9.0g
roofs
status of roof
water
float %9.0g
water
source of water supply
fuelcook
float %9.0g
fuelcook
source of fuel for cooking
fuelck2
float %9.0g
fuelck2
source of fuel for cooking
(detailed)
fuelight
float %9.0g
fuelight
source of fuel for lighting
fuelght2
float %9.0g
fuelght2
source of fuel for lighting
(detailed)
toilet
float %9.0g
toilet
toilet facilities
qmeat
float %9.0g
in past wk, days meat eaten
qfish
float %9.0g
in past wk, days fish eaten
qmilk
float %9.0g
in past wk, days milk taken
larganim
float %9.0g
larganim
whether own large sized animals
(cattle, etc)
qcattl
float %9.0g
no of cattle and other large
livestock
medanim
float %9.0g
medanim
whether own medium sized
animals (sheep, goat)
goatsp
float %9.0g
no of goats, sheep and other
medium anims
poultry
float %9.0g
quantity of poultry
anyland
float %9.0g
anyland
household owns any land for
farming/ pastoralism
landarea
float %9.0g
acres of land owned by hh for
farming/pastoralism
SADC Course in Statistics
Module H8 Practical 4 – Page 4
Module H8 Practical 4
radio
motcycle
bicycle
beds
wadrobe
mosqnet
hoes
wbarrow
iron
sofa
lamp
cashinc
fertil
radio or radio cassette owned?
motor cycle owned?
bicycle owned?
beds owned?
wardrobe owned?
mosquito net owned?
hoe owned?
wbarrow owned?
iron owned?
sofa owned?
lamp owned?
households main source of cash
whether hh paid for
fertiliser/manure in past 12
months?
seeds
float %9.0g
seeds
whether hh paid for seeds in
past 12 months?
pesti
float %9.0g
pesti
whether hh paid for
pesticides/weed killer in past
12 months?
labour
float %9.0g
labour
whether hh paid for casual
labour in past 12 months?
rand1
float %9.0g
Random division of data by this
------------------------------------------------------------------------------
2.
float
float
float
float
float
float
float
float
float
float
float
float
float
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
%9.0g
radio
motcycle
bicycle
beds
wadrobe
mosqtnet
hoes
wbarrow
iron
sofa
lamp
cashinc
fertil
Anscombe (1973) invented the following data to demonstrate the importance of
graphs in regression analysis. There are four data sets, as given below. They are
available in the Stata worksheet Anscombe.dta.
x1
y1
y2
y3
x2
y4
10
8
13
9
11
14
6
8.04
6.95
7.58
8.81
8.33
9.96
7.24
9.14
8.14
8.74
8.77
9.26
8.10
6.13
7.46
6.77
12.74
7.11
7.81
8.84
6.08
8
8
8
8
8
8
8
6.58
5.76
7.71
8.84
8.47
7.04
5.25
4
12
7
5
4.26
10.84
4.82
5.68
3.10
9.13
7.26
4.74
5.39
8.15
6.42
5.73
8
8
8
19
5.56
7.91
6.89
12.50
Source: Anscombe, F.J. (1973) Graphs in Statistical Analysis, American Statistician, 27,
pp.17-21.
SADC Course in Statistics
Module H8 Practical 4 – Page 5
Module H8 Practical 4
(a)
Carry out a regression analysis on each of the four data sets and note down in the
table below some summary statistics from each regression.
Summary
statistic
y1 vs x1
y2 vs x1
y3 vs x1
y4 vs x2
F-value
p-value for F
Residual MS (s2)
Reg. equation
R2
Do you have any comments on the results you find above?
(b)
Now plot the data corresponding to each regression. Is a simple linear regression
model sensible in each case?
(c)
Write down the key message(s) you have learnt from this exercise.
SADC Course in Statistics
Module H8 Practical 4 – Page 6
Download