Statistics for Health Research Correlation and Linear Regression Peter T. Donnan Professor of Epidemiology and Biostatistics CONTENTS • Coefficients of correlation • meaning • values • role • significance • Regression • line of best fit • prediction • significance 2 INTRODUCTION • Correlation • the strength of the linear relationship between two variables • Regression analysis • determines the nature of the relationship • Is there a relationship between the number of units of alcohol consumed and the likelihood of developing cirrhosis of the liver? 3 PEARSON’S COEFFICIENT OF CORRELATION (r) • Measures the strength of the linear relationship between one dependent and one independent variable • curvilinear relationships need other techniques • Values lie between +1 and -1 • perfect positive correlation r = +1 • perfect negative correlation r = -1 • no linear relationship r = 0 4 r = +1 PEARSON’S COEFFICIENT OF CORRELATION r = -1 r=0 r = 0.6 5 SCATTER PLOT BMD dependent variable make inferences about Calcium intake independent variable 6 NON-NORMAL DATA 7 NORMALISED 8 SPSS OUTPUT: SCATTER PLOT 9 SPSS OUTPUT: CORRELATIONS 10 Interpreting correlation Large r does not necessarily imply: strong correlation r increases with sample size cause and effect strong correlation between the number of televisions sold and the number of cases of paranoid schizophrenia watching TV causes paranoid schizophrenia may be due to indirect relationship 11 Interpreting correlation Variation in dependent variable due to: relationship with independent variable: r2 random factors: 1 - r2 r2 is the Coefficient of Determination e.g. r = 0.661 r2 = = 0.44 less than half of the variation in the dependent variable due to independent variable 12 13 Agreement Correlation should never be used to determine the level of agreement between repeated measures: measuring devices users techniques It measures the degree of linear relationship You can have high correlation with poor agreement 14 Non-parametric correlation Make no assumptions Carried out on ranks Spearman’s r Kendall’s t easy to calculate has some advantages over r distribution has better statistical properties easier to identify concordant / discordant pairs Usually both lead to same conclusions 15 Role of regression Shows how one variable changes with another By determining the line of best fit linear curvilinear 16 Line of best fit Simplest case linear Line of best fit between: dependent variable Y BMD independent variable X dietary intake of Calcium Y = a + bX value of Y when X=0 change in Y when X increases by 1 17 Role of regression Used to predict the value of the dependent variable when value of independent variable(s) known within the range of the known data extrapolation risky! relation between age and bone age Does not imply causality 18 SPSS OUTPUT: REGRESSION 19 Multiple regression More than one independent variable BMD dependent on: age gender calorific intake Use of bisphosphonates Exercise etc 20 Logistic regression The dependent variable is binary yes / no predict whether a patient with Type 1 diabetes will undergo limb amputation given history of prior ulcer, time diabetic etc result is a probability Can be extended to more than two categories Outcome after treatment recovered, in remission, died 21 Summary Correlation strength of linear relationship between two variables Pearson’s - parametric Spearman’s / Kendall’s non-parametric Interpret with care! Regression line of best fit prediction Multiple regression logistic 22 Statistics for Health Research Regression: Checking the Model Peter T. Donnan Professor of Epidemiology and Biostatistics Objectives of session • Recognise the need to check fit of the model • Carry out checks of assumptions in SPSS for simple linear regression • Understand predictive model • Understand residuals How is the fitted line obtained? Use method of least squares (LS) Seek to minimise squared vertical differences between each point and fitted line Results in parameter estimates or regression coefficients of slope (b) and intercept (a) – y=a+bx Dependent (y) Consider Fitted line of y = a +bx a Explanatory (x) Consider the regression of age on minimum LDL cholesterol achieved • Select Regression Linear…. • Dependent (y) – Min LDL achieved • Independent (x) - Age_Base Output from SPSS linear regression Coefficientsa Model 1 Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t (Constant) 2.024 .105 19.340 Age at baseline -.008 .002 -.121 -4.546 sig .000 .000 a. Dependent Variable: Min LDL achieved N.B. 0.008 may look very small but represents: The DECREASE in LDL achieved for each increase in one unit of age i.e. ONE year Output from SPSS linear regression Coefficientsa Model 1 Unstandardized Coefficients Standardized Coefficients B Std. Error Beta t (Constant) 2.024 .105 19.340 Age at baseline -.008 .002 -.121 -4.546 sig .000 .000 a. Dependent Variable: Min LDL achieved H0 : slope b = 0 Test t = slope/se = -0.008/0.002 = 4.546 with p<0.001, so statistically significant Predicted LDL = 2.024 - 0.008xAge Prediction Equation from linear regression Predicted LDL achieved = 2.024 - 0.008xAge So for a man aged 65 the predicted LDL achieved = 2.024 – 0.008x 65 = 1.504 Age Predicted Min LDL 45 1.664 55 1.584 65 1.504 75 1.424 Assumptions of Regression 1. Relationship is linear 2. Outcome variable and hence residuals or error terms are approx. Normally distributed Use Graphs and Scatterplot to obtain the Lowess line of fit Use Graphs and Scatterplot to obtain the Lowess line of fit 1. Create Scatterplot and then double-click to enter chart editor 2. Chose Icon ‘Add fit line at total’ 3. Then select type of fit such as Lowess Linear assumption: Fitted lowess smoothed line Lowess smoothed line (red) gives a good eyeball examination of linear assumption (green) Definition of a residual A residual is the difference between the predicted value (fitted line) and the actual value or unexplained variation ri = yi – E ( yi ) Or ri = yi – ( a + bx ) Residuals To assess the residuals in SPSS linear regression, select plots….. Normalised or standardised predicted value of LDL Normalised residual Select histogram of residuals and normal probability plot In SPSS linear regression, select Statistics….. Model fit Select confidence intervals for regression coefficients Select DurbinWatson for serial correlation and identification of outliers Output: Scatterplot of residuals vs. predicted Note 1) Mean of residuals = 0 2) Most of data lie within + or -3 SDs of mean Assumptions of Regression 1. Relationship is linear 2. Outcome variable and hence residuals or error terms are approx. Normally distributed Output: Histogram of standardised residuals Plot of residuals with normal curve superimposed Output: Cumulative probability plot Look for deviation from diagonal line to indicate nonnormality Output: Description of residuals Descriptive statistics for residuals Residuals Statisticsa Minimum Maxim um Predicted Value 1.314867 1.843205 Residual -1.65389 4.0658469 Std. Predicted Value -2.750 3.264 Std. Residual -2.302 5.660 Mean Std. Deviation 1.556478 .0878548 .0000000 .7181448 .000 1.000 .000 1.000 a. Dependent Variable: Min LDL achieved Worth investigation? Subjects with standardised residuals > 3 Casewise Diagnostics(a) N 1383 1383 1383 1383 Case NumberStd. Residual Min LDL 164 5.660 5.5840 209 4.395 4.5260 250 3.143 3.7875 268 3.064 3.8730 274 3.227 4.0953 362 4.095 4.5350 517 3.636 4.3240 849 3.968 4.3290 1047 4.207 4.4360 1075 3.885 4.4040 1103 3.519 3.9905 1229 3.016 3.7660 1290 3.975 4.2345 Predicted 1.518153 1.368685 1.529325 1.671664 1.777153 1.593460 1.711788 1.478113 1.413686 1.613219 1.462584 1.599254 1.379107 a. Dependent Variable: Min LDL achieved Residual 4.0658471 3.1573148 2.2581750 2.2013357 2.3180975 2.9415398 2.6122125 2.8508873 3.0223141 2.7907805 2.5279157 2.1667456 2.8553933 Output: Model fit and serial correlation Model Summary Model 1 R .121a R Square Adjusted R Square Std. Error of the Estimate Durbin-Watson .015 .014 .7184048 2.034 a. Predictors: (Constant), Age at baseline R – correlation between min LDL achieved and Age at baseline, here 0.121 R2 - % variation explained, here 1.5%, not particularly high Durbin-Watson test - serial correlation of residuals should be approximately 2 if no serial correlation Summary After fitting any regression model check assumptions • Functional form – linearity is default, often not best fit, consider quadratic… • Check Residuals for approx. normality • Check Residuals for outliers (> 3 SDs) • All accomplished within SPSS Practical on Model Checking Read in ‘LDL Data.sav’ 1) Fit age squared term in min LDL model and check fit of model compared to linear fit (Hint: Use transform/compute to create age squared term and fit age and age2) 2) Fit separate linear regressions with min Chol achieved with predictors of 1) baseline Chol 2) APOE_lin 3) adherence Check assumptions and interpret results