Correlation and Simple Linear Regression Mike Tucker 27th April

advertisement
Correlation and Linear Regression
Correlation measures the extent to which two variables co-vary. It is based on a standardised version of the
covariance:
Variance (s2) of a random variable x:
Standard deviation (s) :
Covariance (Cov) of two random variables x,y:
Correlation (r):
The correlation (Pearson product-moment correlation coefficient) varies between -1 (perfectly negatively correlated)
and +1 (perfectly positively correlated).
The square of the correlation (r2) describes the proportion of variation shared between two variables.
Strong Positive Correlation
Strong Negative correlation
Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth
Note:
The Pearson correlation only measures the extent of the LINEAR relationship between two variables:
These variables are perfectly related (y=x2) but the correlation coefficient is 0.
Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth
Causality
Correlations tell us nothing about whether the two variables are causally related, or the direction of causality:
Exercise:
Use Excel to compute the correlation between these two variables ‘by hand’:
3
4
5
6
7
8
9
3
1
2
3
4
5
4
6
8
Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth
Simple Linear regression
Simple regression is a minor extension of correlation. Regression models are used to describe in more detail the
relationship between two variables. A straight line is fitted through the data. The line used minimises the sum of
squared deviations from the line (hence least squares regression). We will use some data relating the age of trees to
their girth:
Regression line showing the relationship between the girth of trees and their age.
The red arrows show the individual errors or residuals. The line is chosen so as to minimise the sum of all the
squared residuals.
Regression equation:
Y = Dependent Variable (or criterion)
X = Independent Variable
Yi = b0 + b1Xi + ei
i
= b0 + b1Xi
b0 = the Y Intercept
b1 = the slope (the change in Y for a unit increase in X)
Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth
Any particular Y value (Yi) can be predicted from knowing the associated X value (Xi):
Yi = b0 + b1Xi
Or for a particular tree (Treei):
Age of Tree i = 12.2 + 1.13 * Girth of Trunki
The intercept, b0, is where the line of best fit crosses the Y axis when X=0.
This does not always have a meaningful interpretation. In this example it means ‘what is the predicted age of a tree
that has 0 cm girth’.
Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth
The slope b0: Suppose, as in this example the slope is 1.13
This means that for every 1 cm increase in a tree’s girth the estimated age of the tree increases by 1.13 years
Hence a tree that had a girth of 40 cm would be predicted to be 11.3 years older than one with a 30 cm girth:
The standardised model
The regression equation can be standardised by normalising (subtracting the mean and dividing by the SD) the
scores of both variables. In this case the intercept is always zero (the model is centred) and the slope (now
symbolised by Beta,  represents the change in Y (measured in sds) for a 1 sd increase in X.
Assessing the model
In the simple regression case the assessment of the model’s significance is no different from the significance of the
correlation coefficient (the standardised slope, , is the same as the correlation coefficient). The square of this value
(R square) gives the proportion of the variation in Y that is explained by (or shared by) X.
Multiple Regression
The topic is complex so only the basics are covered here.
Principle is the same as simple linear regression.
Common procedure in non-experimental research where data collected from several measures is used to model
some outcome measure.
Multiple regression forms a model based on several predictors:
Y = b0 + b1X1 + b2X2 + b3X3 +……
Harder to visualise what’s going on (with two predictors the ‘prediction line’ is actually a prediction plane (flat
surface).
Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth
SPSS Example
We will use the Tree Age data set to first form a simple regression model for predicting a tree’s
Age from its girth and then form a multiple regression model using several other predictors to try and get a better
(more accurate) model.
Data file TreeAgeData.sav
SPSS menu commands:
Analyse/regression/linear regression
Dependent: Age
Independent:Girth
Note: Beta = R
R square = proportion of variation in Age that can be predicted from Girth
The ANOVA tells that overall the model is a significant predictor of Age (see also that the regression Sum of
Squares/Total Sum of squares = .674 = R2).
The regression equation is obtained from the coefficients table:
Predicted Age = 12.228 + (1.126 * Girth)
Or in standardised form:
Age (sd units) =.821*Girth (sd units)
Performing a multiple regression:
Re-run the procedure but adding the Tree Height and Canopy Spread variables to the Independents list:
Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth
The figure for R is now the multiple correlation coefficient. It represents the correlation of the observed Y values
with the predicted Y values from the multiple regression equation.
The ANOVA table for the regression tells you whether the regression model as a whole explains a significant amount
of variation in the dependent (Y) variable. Here it is highly significant (p<.001) which means that using SPREAD,
GIRTH and HEIGHT to predict AGE results in a significantly better prediction than simply using the average of the Y
values (the average AGE).
The interpretation of the coefficients (or Slopes) is a little more complicated in multiple regression than simple
regression, although the model equation is a simple extension of simple regression:
Age=16.283 + (.559*Girth) + (5.4*Height) + (-.751*Spread)
This is because, except in rare circumstances, the predictors are likely to be correlated with each other:
Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth
Analyse/correlate/bivariate
Variables: age, height, girth, spread
Significant correlations are marked **
Note:
1. Both GIRTH and HEIGHT are significantly correlated with AGE (but SPREAD isn’t)
2. The two predictors GIRTH and HEIGHT are themselves correlated (r = .72)
SPSS will produce scatter plots of these relationships – to view them all at once use:
Graphs/Legacy/Scatter/Matrix and put all variables (except disease) into the variables box. Thgis should always
done as a routine check that there are no obvious non-linear relationships amongst the variables.
Little or no correlation between SPREAD and anything
else.
Strong , positive, linear correlations between AGE and
HEIGHT; AGE and GIRTH; and HEIGHT and GIRTH.
From the correlation matrix we know these are highly
significant.
Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth
As SPREAD is not a significant predictor we can run the regression again this time only including GIRTH and HEIGHT
as predictors.
At this point all predictors are significant. This can be considered the ‘final’ model. (Although the constant is not
significant there is little to be gained by removing it – more on this later)
Interpreting the coefficients
1. The Final Model:
AGE = -6.1 + .56*GIRTH + 5.1 * HEIGHT
For a 1 unit increase in GIRTH, holding everything else – i.e. HEIGHT - constant, the predicted AGE increases by .56
units. Similarly for a 1 unit increase in HEIGHT, holding everything else – i.e. GIRTH – constant the predicted AGE
increases by 5.1.
NB. The units here refer to whatever units the raw data was measured in (in this case a 1 cm increase in girth results
in a .56 year increase in age but a 1 metre increase in Height results in a 5.1 year increase in age).
The p values associated with the individual predictors test whether that predictor contributes anything given that
the other variables are included in the model. Their absolute values cannot be taken as an indicator of relative
‘importance’. There are three main reasons for this.
1. The values of the coefficients depend on the scale of measurement of the predictor variable (in this example tree
GIRTH is measured in centimetres whereas tree HEIGHT is measured in metres but SPREAD in feet. Because of this
the standardised coefficients are a much better measure of relative importance.
2. When the predictors are correlated the values of the coefficients refer to their incremental influence on the
model. Two predictors may both be good predictors of the Y variable on their own but put together the one with the
largest correlation dominates and the other may not have much to contribute over and above this. In other words
because they are highly correlated they both convey similar information making one or the other more or less
redundant.
3. Especially with highly correlated predictors the stability of the coefficients is not high.
I.e. if you were to run the regression again on a new sample you could end up with an equally good model (in terms
of the overall R square value) but one which could have markedly different coefficients.
The intercept b0
 The intercept often has no realistic interpretation. It simply arises from the computation of the best fitting
regression line. Technically it is the value of Y when all the predictors are zero.
 In some models there is a logical reason why, when all the predictors are zero, the Y variable should also be
zero. The Tree Age data is an example. When the height and girth are zero you would presume the age also
to be zero and a regression model which disagrees with this might be thought to be ‘incorrect’. However this
is often simply a misapplication of the model :
Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth
Here the regression model has been based on a restricted range of data.
It may well provide a good fit for the data within this range but should not be used to make predictions for x values
outside this range. In this example the true intercept is zero but would be negative according to the regression
model. The true relationship here is not linear, but over the range considered a linear model is adequate.
Comparing Models
There are several methods of obtaining the most suitable regression model.
Although the ‘best’ model might intuitively be the one that explained most of the variation in Y (i.e. had the
highest R square) this is not necessarily the best or most appropriate approach.
The choice depends on the reason for constructing the regression model – is it :
 Simply to get the best prediction?
 To test certain theoretical models?
 To find a sufficiently good and cost effective / parsimonious model
A model may use 5 predictors to model a dependent variable and give an r square of 78%. However although
significantly better than a model with only 2 predictors (say r square = 69%) the difference in the r square value may
be sufficiently small not to warrant the use of all 5 predictors (e.g. price / time considerations in collecting the data if
the model was to be used in practice).
Models may be compared by looking at the difference in SSresidual between the FULL MODEL, (with more predictors)
and the REDUCED MODEL, Red, (with fewer predictors):
Which is distributed on F(d.f.Red-d.f.Full, d.f.Full) and which SPSS carries out for you.
Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth
The same comparison can be done quickly using the Block command from the regression dialogue.
This allows you to specify and compare models that differ on one or more predictors.
Example – adding girth in block 1 followed by girth, height, spread in block 2 and then making sure the R square
Change is checked from the statistics tab will get SPSS to evaluate the significance of the difference between the two
models – i.e. does adding height and spread significantly improve the model:
From the table you can see that the value of R square (which is a measure of the overall goodness of fit of the
model) increases by .177 by adding height and spread to the model. The ‘Sig of F change’ evaluates the significance
of this change – here the improvement in the model is highly significant.
Categorical Predictors
Categorical (as opposed to continuous) variables can also be used as predictors in a regression model.
E.g.
SEX (male, female)
Type of Residence (detached, semi-detached, terraced)
Disease(present, absent)
…etc.
Categorical predictors use ’dummy coding’ using 0’s and 1’s
If the category has more than two values then several columns need to be used to code it. In general if a category
has n levels then n-1 columns are required to code it.
The interpretation of regression coefficients for categorical predictors is different – suppose you had a variable
indicating whether someone smoked. This could be coded as 0 for non-smoker and 1 for smoker. If the outcome (Y)
variable was life expectancy in years then a coefficient of -6.2 for the ‘Smoker’ variable would indicate that life
expectancy for smokers is 6.2 years less than that for smokers (all other variables held constant).
If the variable had 3 possible values (e.g. High Risk, Medium Risk, Low risk) then you would need 2 variables – the
first containing a 1 if the person was high risk and a 0 otherwise, the second a 1 if they were medium risk and a 0
otherwise. The low risk category is coded by the fact that both of these columns would contain 0.
Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth
Download