Lecture 3 handout

advertisement
PY1PR2 lecture 3:
Simple regression
Dr David Field
Summary
• Andy Field covers simple regression in the first
half of chapter 7
• Regression is a technique you can use when you
want to predict the value of a variable from the
value of another variable
• Relationship between correlation and regression
• Finding the “line of best fit”
• Interpreting regression results and using them to
make predictions
• Assessing how good the fit is
Regression vs. Correlation
• The three highlighted scatter plots all show examples of perfect positive
correlation
• But there is an obvious difference between them
– the slope
• Simple regression assesses the slope of the relationship between two
variables, while correlation assesses how strong / tight the relationship is
How regression measures the slope
• Make the assumption that the slope or relationship
can be described by a straight line
– if the scatter plot of two variables shows a curved
relationship you can’t use regression
• Find the “best fitting” line, and then the gradient of
that line is the measure of the slope
• There are an infinite number of possible straight
lines you could draw on the scatter plot of two
variables
– regression includes a method of finding the best one
Example data
• Today’s lecture, and the regression workshop, will
make use of some real data to illustrate how simple
regression is used to explore the relationship
between variables
• The sample is a sample of rich, developed nations
• The variables are
– average annual income per head
– income inequality
– % of population suffering from any form of mental illness
during the past year (WHO world mental health survey)
– life expectancy (workshop)
• Source: Wilkinson & Pickett (2009) The Spirit Level
25000
$ per year
Variation between the per capita incomes of
rich countries
40000
35000
30000
20000
15000
10000
Portugal
Israel
Greece
New Zealand
Spain
Singapore
Italy
Sweden
Finland
France
Germany
UK
Australia
Netherlands
Japan
Belgium
Canada
Austria
Ireland
Denmark
Switzerland
Norway
USA
Income Gap – the 20:20 ratio
• Average annual income is much larger in some developed
countries than others
• Countries also differ in terms of how wide the spread
around the average income
– In some countries there is a great deal of variation around the
mean (large SD)
– In other countries there is a small SD
• Economists think about this in terms of inequality
– How rich is rich?
– Subjectively, in some countries, rich means double the average
income
– In other countries rich means 4 times the averages income
• Quantifying inequality as an index
– The top 20% of earners in a country are defined as “rich”
– The bottom 20% are defined as “poor”
– The mean income of the top 20% is then divided by the mean
income of the bottom 20% producing an “inequality ratio”
Income and inequality as predictor variables
• Wilson & Pickett (2009) analysed the relationship
between income, income inequality, and the
prevalence of a range of health and social
problems in different countries
– e.g. homicide rate, imprisonment rate, teenage birth
rate, social mobility
• To illustrate linear regression, we will focus on the
relationship between income, income inequality,
and a psychological variable:
– % of population suffering from any form of mental
illness during the past year
– The scatter plot
indicates that
mental health
problems are
more prevalent
in more unequal
societies
– To quantify the
relationship with
regression, the
first step is to
find the “line of
best fit”
– The line of best fit
can be estimated
by eye
– This line is
obviously a poor
summary of the
trend on the graph
– The slope of the
line is much too
steep
– This line looks like
it captures the
relationship fairly
well
– The solid red line
captures the
relationship fairly
well
– But possibly the
green dotted line
does a better job
– Regression uses
a mathematical
technique to
decide which of all
the possible lines
is the best fitting
line
The method of least squares
The method of least squares can be used to decide which of these
two lines provides a better model of the data
•Calculate the difference between each data point and the line, in terms of
the predicted variable (mental health problems)
•Positive numbers mean the model has overestimated, negative numbers
mean it has underestimated
•To measure the fit, square the difference scores and sum them (why square
the difference scores?)
•Next, calculate the sum of squared differences from the line on the right
•Whichever line has a smaller total squared difference score is a better
model of the data
•There are always an infinite number of lines to compare, so regression
uses a mathematical technique to find the one that minimizes the squared
differences
Describing the line of best fit
• Once you have obtained the line of best fit it can
be drawn on a scatter plot, and it can be
described by two numbers
– the first number, called the intercept or b0, is the value
of the predicted variable (e.g. mental health) where
predictor has a value of 0, and the line crosses the
vertical axes of the graph
– the second number, called b1, is the gradient or slope of
the line, and tells you what happens to the y axis value
of the best fit line when you increase the value on the x
axis by 1 unit
• Together, they are known as the regression
coefficients
16
14
12
Outcome
10
8
6
4
2
0
0
Predictor
10
•The value of the intercept, b0, for both lines is 8
•The solid red line has a positive value of b1 (gradient or slope)
•For the dashed line, b1 is a negative number, indicating that as the
predictor increases the predicted value decreases
16
14
12
Outcome
10
8
6
4
2
0
Predictor
•These 3 lines all have the same positive slope value, b1
•But each has a different intercept, b0
The line of best fit as a “model” of the data
• A statistical model is a way of describing the most
important aspects of a set of data that is simpler
than the data itself
– a straight line is simpler than the scatter plot it
summarizes
• Like all statistical models, the line of best fit can
be used to predict values of the outcome for a
specific value of the predictor
– outcome = b0 + (b1 * specific value of predictor)
– see later for examples
• Best fit line for
prediction of mental
health by inequality
– The b0 and b1 have
the same units as the
predicted variable, %
in this case
– b1 is 3.7% per unit of
the predictor variable
– In other words, moving
from an inequality ratio
of 3 to 4 increases the
rate of mental illness
by 3.7%
b1 = 3.7
b0 = 6.6
Using the model to predict values
• Data for the % of population suffering from any
form of mental illness during the past year was not
available for Greece
– but we do know that Greece has an income inequality
ration of 6.19, which is 3.28 units higher than the most
equal country, Japan
– We can use the formula for straight lines, combined
with the values of b0 and b1 to predict the level of
mental illness in Greece
– b0 + (b1 * 3.28)
– 6.6 + (3.28 * 3.7) = 18.74%
• You can also check
the predicted value
for Greece graphically
– draw a line up from the
x axis to meet the
regression line at the
point corresponding to
the predictor value for
Greece
– draw a line across to
the y axis and read off
the predicted value
Making predictions for extreme values of
the predictor
• Imagine we obtained a measurement of income
inequality two new countries
– a capitalist country with no welfare state and zero
taxation, inequality ratio 18 (much higher than any
country in our sample)
– a communist country, inequality ratio 1.5 (les than half
the most equal country in our sample, Japan)
• We could use the equation for straight lines to
predict levels of mental illness for both countries
– Doing this is referred to as “extrapolation”
– Can you think of any problems with doing this, or
reasons for caution?
How well does the model fit?
•
Regression is guaranteed to find the best fitting
straight line, but it might still be a poor fit if the
two variables are only weakly related
1) There are two ways of assessing the model
1) R2, the proportion of the total variance in the predicted
variable that the best fitting line accounts for
2) a null hypothesis test:
•
•
•
what is the probability of obtaining the b1 value in the
sample if the true value of b1 in the population is zero?
a b1 of 0 means that as the value of the predictor
increases the value of the predicted stays the same
(as inequality increases mental illness stays the same)
– Let’s compare the fit of two models that predict the
proportion of the population with mental health
problems
Predicting mental health from income
b1 = 0.65
b0 = -1.7 (zero income = less than zero mental
health problems?!?!)
Assessing the two models of mental health: R2
• To assess the line of best fit, it is compared to an
even simpler model of the predicted variable
– If I had mental health data for the 11 countries on the
scatter plot, but no data about income or inequality, and
I was asked to predict the level of mental health
problems in Greece, my best guess would be the mean
of the mental health data
– The simplest possible model of the predicted variable is
its mean
– Calculate the sum of the squared deviations from the
mean (total sum of squares)
– Calculate the sum of the squared deviations from the
line of best fit (residual sum of squares)
– If the line is a good model, the residual sum should be
much smaller than the total sum
The line of best fit versus the mean
mean of Y
Can you find any countries where the mean is a better model of mental
health problems than the line of best fit for inequality?
The line of best fit versus the mean
Can you find any countries where the mean is a better model of mental
health problems than the line of best fit for income?
For how many countries is the model obviously better than the mean?
Calculating R2
• R2 is a descriptive statistic that describes how
much better the model is at explaining variation in
the predicted variable than using the mean as a
model
• It is expressed as a proportion of the total
variation (variance) in the predicted variable
– therefore it has a maximum of 1 and a minimum of 0
2
R
=
total sum of squares – residual sum of squares
total sum of squares
Note: the square root of R2 is the Pearson correlation of the two
variables
Which variable is a better predictor?
R2 = 0.16
correlation = 0.39
R2 = 0.55
correlation = 0.74
Is b1 significantly different from zero?
• If the true value of b1 in the population is zero, what is the
probability that a random sample of this size would have a
value of b1 as big or bigger than the observed value?
• The p value is provided by a t test
• t = b1 / standard error of b1
• standard error of b1 is based on the SD of the residual
(deviation) scores
– if the data points are close to the line of best fit SE will be small, if
far away, SE will be large
• otherwise identical to the t test for the difference between
two sample means
• if p < 0.05 we support the hypothesis that the predictor
variable is useful in estimating the value of the outcome
Are the predictors statistically significant?
t(9)1.29, p = 0.23, NS
t(9)3.3, p = 0.009
The simple linear regression model
• As we saw earlier, simple regression is simply a
model of the data as a straight line:
– Outcome = b0 + (b1 * specific value of predictor)
• But that equation is not quite complete, because
the regression model needs to reflect the fact that
the data points rarely all lie exactly on the line
• Therefore, you will usually see the regression
equation written as some symbolic variant of:
– Outcome = b0 + (b1 * specific value of predictor) +
“residual error”
• residual error refers to the deviation score from
the model (the vertical lines drawn on the scatter
plots earlier)
Things to bear in mind
• Often, you can’t be sure that the predicted variable is being
caused by the predictor
– it might be the other way round (and you can swap the predictor
and predicted around and run the analysis again)
– in some cases, e.g. inequality and mental health, it does not make
sense to run the regression the other way around
• Nobody would claim that mental health problems cause
inequality
• but you might make an argument that a 3rd variable causes the
variation in both observed variables
• The predicted variable should be a continuous variable
measured on an interval or ratio scale
• If a scatter plot suggests a non-linear relationship, you
can’t use simple regression
If you’d like to evaluate the effects
of inequality and other variables
yourself more data is available in
the book or on the website
e.g. more unequal US states have
higher levels of health and social
problems
Statistical terms for revision
•
•
•
•
model
method of least squares
residual
regression coefficients
– intercept, b0
– slope or gradient, b1
• best fitting line
• extrapolation
• R2
Download