Correlation and Linear Regression Correlation measures the extent to which two variables co-vary. It is based on a standardised version of the covariance: Variance (s2) of a random variable x: Standard deviation (s) : Covariance (Cov) of two random variables x,y: Correlation (r): The correlation (Pearson product-moment correlation coefficient) varies between -1 (perfectly negatively correlated) and +1 (perfectly positively correlated). The square of the correlation (r2) describes the proportion of variation shared between two variables. Strong Positive Correlation Strong Negative correlation Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth Note: The Pearson correlation only measures the extent of the LINEAR relationship between two variables: These variables are perfectly related (y=x2) but the correlation coefficient is 0. Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth Causality Correlations tell us nothing about whether the two variables are causally related, or the direction of causality: Exercise: Use Excel to compute the correlation between these two variables ‘by hand’: 3 4 5 6 7 8 9 3 1 2 3 4 5 4 6 8 Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth Simple Linear regression Simple regression is a minor extension of correlation. Regression models are used to describe in more detail the relationship between two variables. A straight line is fitted through the data. The line used minimises the sum of squared deviations from the line (hence least squares regression). We will use some data relating the age of trees to their girth: Regression line showing the relationship between the girth of trees and their age. The red arrows show the individual errors or residuals. The line is chosen so as to minimise the sum of all the squared residuals. Regression equation: Y = Dependent Variable (or criterion) X = Independent Variable Yi = b0 + b1Xi + ei i = b0 + b1Xi b0 = the Y Intercept b1 = the slope (the change in Y for a unit increase in X) Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth Any particular Y value (Yi) can be predicted from knowing the associated X value (Xi): Yi = b0 + b1Xi Or for a particular tree (Treei): Age of Tree i = 12.2 + 1.13 * Girth of Trunki The intercept, b0, is where the line of best fit crosses the Y axis when X=0. This does not always have a meaningful interpretation. In this example it means ‘what is the predicted age of a tree that has 0 cm girth’. Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth The slope b0: Suppose, as in this example the slope is 1.13 This means that for every 1 cm increase in a tree’s girth the estimated age of the tree increases by 1.13 years Hence a tree that had a girth of 40 cm would be predicted to be 11.3 years older than one with a 30 cm girth: The standardised model The regression equation can be standardised by normalising (subtracting the mean and dividing by the SD) the scores of both variables. In this case the intercept is always zero (the model is centred) and the slope (now symbolised by Beta, represents the change in Y (measured in sds) for a 1 sd increase in X. Assessing the model In the simple regression case the assessment of the model’s significance is no different from the significance of the correlation coefficient (the standardised slope, , is the same as the correlation coefficient). The square of this value (R square) gives the proportion of the variation in Y that is explained by (or shared by) X. Multiple Regression The topic is complex so only the basics are covered here. Principle is the same as simple linear regression. Common procedure in non-experimental research where data collected from several measures is used to model some outcome measure. Multiple regression forms a model based on several predictors: Y = b0 + b1X1 + b2X2 + b3X3 +…… Harder to visualise what’s going on (with two predictors the ‘prediction line’ is actually a prediction plane (flat surface). Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth SPSS Example We will use the Tree Age data set to first form a simple regression model for predicting a tree’s Age from its girth and then form a multiple regression model using several other predictors to try and get a better (more accurate) model. Data file TreeAgeData.sav SPSS menu commands: Analyse/regression/linear regression Dependent: Age Independent:Girth Note: Beta = R R square = proportion of variation in Age that can be predicted from Girth The ANOVA tells that overall the model is a significant predictor of Age (see also that the regression Sum of Squares/Total Sum of squares = .674 = R2). The regression equation is obtained from the coefficients table: Predicted Age = 12.228 + (1.126 * Girth) Or in standardised form: Age (sd units) =.821*Girth (sd units) Performing a multiple regression: Re-run the procedure but adding the Tree Height and Canopy Spread variables to the Independents list: Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth The figure for R is now the multiple correlation coefficient. It represents the correlation of the observed Y values with the predicted Y values from the multiple regression equation. The ANOVA table for the regression tells you whether the regression model as a whole explains a significant amount of variation in the dependent (Y) variable. Here it is highly significant (p<.001) which means that using SPREAD, GIRTH and HEIGHT to predict AGE results in a significantly better prediction than simply using the average of the Y values (the average AGE). The interpretation of the coefficients (or Slopes) is a little more complicated in multiple regression than simple regression, although the model equation is a simple extension of simple regression: Age=16.283 + (.559*Girth) + (5.4*Height) + (-.751*Spread) This is because, except in rare circumstances, the predictors are likely to be correlated with each other: Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth Analyse/correlate/bivariate Variables: age, height, girth, spread Significant correlations are marked ** Note: 1. Both GIRTH and HEIGHT are significantly correlated with AGE (but SPREAD isn’t) 2. The two predictors GIRTH and HEIGHT are themselves correlated (r = .72) SPSS will produce scatter plots of these relationships – to view them all at once use: Graphs/Legacy/Scatter/Matrix and put all variables (except disease) into the variables box. Thgis should always done as a routine check that there are no obvious non-linear relationships amongst the variables. Little or no correlation between SPREAD and anything else. Strong , positive, linear correlations between AGE and HEIGHT; AGE and GIRTH; and HEIGHT and GIRTH. From the correlation matrix we know these are highly significant. Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth As SPREAD is not a significant predictor we can run the regression again this time only including GIRTH and HEIGHT as predictors. At this point all predictors are significant. This can be considered the ‘final’ model. (Although the constant is not significant there is little to be gained by removing it – more on this later) Interpreting the coefficients 1. The Final Model: AGE = -6.1 + .56*GIRTH + 5.1 * HEIGHT For a 1 unit increase in GIRTH, holding everything else – i.e. HEIGHT - constant, the predicted AGE increases by .56 units. Similarly for a 1 unit increase in HEIGHT, holding everything else – i.e. GIRTH – constant the predicted AGE increases by 5.1. NB. The units here refer to whatever units the raw data was measured in (in this case a 1 cm increase in girth results in a .56 year increase in age but a 1 metre increase in Height results in a 5.1 year increase in age). The p values associated with the individual predictors test whether that predictor contributes anything given that the other variables are included in the model. Their absolute values cannot be taken as an indicator of relative ‘importance’. There are three main reasons for this. 1. The values of the coefficients depend on the scale of measurement of the predictor variable (in this example tree GIRTH is measured in centimetres whereas tree HEIGHT is measured in metres but SPREAD in feet. Because of this the standardised coefficients are a much better measure of relative importance. 2. When the predictors are correlated the values of the coefficients refer to their incremental influence on the model. Two predictors may both be good predictors of the Y variable on their own but put together the one with the largest correlation dominates and the other may not have much to contribute over and above this. In other words because they are highly correlated they both convey similar information making one or the other more or less redundant. 3. Especially with highly correlated predictors the stability of the coefficients is not high. I.e. if you were to run the regression again on a new sample you could end up with an equally good model (in terms of the overall R square value) but one which could have markedly different coefficients. The intercept b0 The intercept often has no realistic interpretation. It simply arises from the computation of the best fitting regression line. Technically it is the value of Y when all the predictors are zero. In some models there is a logical reason why, when all the predictors are zero, the Y variable should also be zero. The Tree Age data is an example. When the height and girth are zero you would presume the age also to be zero and a regression model which disagrees with this might be thought to be ‘incorrect’. However this is often simply a misapplication of the model : Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth Here the regression model has been based on a restricted range of data. It may well provide a good fit for the data within this range but should not be used to make predictions for x values outside this range. In this example the true intercept is zero but would be negative according to the regression model. The true relationship here is not linear, but over the range considered a linear model is adequate. Comparing Models There are several methods of obtaining the most suitable regression model. Although the ‘best’ model might intuitively be the one that explained most of the variation in Y (i.e. had the highest R square) this is not necessarily the best or most appropriate approach. The choice depends on the reason for constructing the regression model – is it : Simply to get the best prediction? To test certain theoretical models? To find a sufficiently good and cost effective / parsimonious model A model may use 5 predictors to model a dependent variable and give an r square of 78%. However although significantly better than a model with only 2 predictors (say r square = 69%) the difference in the r square value may be sufficiently small not to warrant the use of all 5 predictors (e.g. price / time considerations in collecting the data if the model was to be used in practice). Models may be compared by looking at the difference in SSresidual between the FULL MODEL, (with more predictors) and the REDUCED MODEL, Red, (with fewer predictors): Which is distributed on F(d.f.Red-d.f.Full, d.f.Full) and which SPSS carries out for you. Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth The same comparison can be done quickly using the Block command from the regression dialogue. This allows you to specify and compare models that differ on one or more predictors. Example – adding girth in block 1 followed by girth, height, spread in block 2 and then making sure the R square Change is checked from the statistics tab will get SPSS to evaluate the significance of the difference between the two models – i.e. does adding height and spread significantly improve the model: From the table you can see that the value of R square (which is a measure of the overall goodness of fit of the model) increases by .177 by adding height and spread to the model. The ‘Sig of F change’ evaluates the significance of this change – here the improvement in the model is highly significant. Categorical Predictors Categorical (as opposed to continuous) variables can also be used as predictors in a regression model. E.g. SEX (male, female) Type of Residence (detached, semi-detached, terraced) Disease(present, absent) …etc. Categorical predictors use ’dummy coding’ using 0’s and 1’s If the category has more than two values then several columns need to be used to code it. In general if a category has n levels then n-1 columns are required to code it. The interpretation of regression coefficients for categorical predictors is different – suppose you had a variable indicating whether someone smoked. This could be coded as 0 for non-smoker and 1 for smoker. If the outcome (Y) variable was life expectancy in years then a coefficient of -6.2 for the ‘Smoker’ variable would indicate that life expectancy for smokers is 6.2 years less than that for smokers (all other variables held constant). If the variable had 3 possible values (e.g. High Risk, Medium Risk, Low risk) then you would need 2 variables – the first containing a 1 if the person was high risk and a 0 otherwise, the second a 1 if they were medium risk and a 0 otherwise. The low risk category is coded by the fact that both of these columns would contain 0. Introduction to Correlation and Linear Regression. Dr Mike Tucker, University of Plymouth