Correlation and regression analysis Week 8 Research Methods & Data Analysis Dr. Mario Mazzocchi Research Methods & Data Analysis 1 Lecture outline • • • • • Correlation Regression Analysis The least squares estimation method SPSS and regression output Task overview Dr. Mario Mazzocchi Research Methods & Data Analysis 2 Correlation • Correlation measures to what extent two (or more) variables are related – Correlation expresses a relationship that is not necessarily precise (e.g. height and weight) – Positive correlation indicates that the two variables move in the same direction – Negative correlation indicates that they move in opposite directions Dr. Mario Mazzocchi Research Methods & Data Analysis 3 Covariance • Covariance measures the “joint variability” • If two variables are independent, then the covariance is zero (however, Cov=O does not mean that two variables are independent) Cov( x, y) xy E( xy) E( x)E( y) • Where E(…) indicates the expected value (i.e. average value) Dr. Mario Mazzocchi Research Methods & Data Analysis 4 Correlation coefficient • The correlation coefficient r gives a measure (in the range –1, +1) of the relationship between two variables – r=0 means no correlation – r=+1 means perfect positive correlation – r=-1 means perfect negative correlation • Perfect correlation indicates that a p% variation in x corresponds to a p% variation in y Dr. Mario Mazzocchi Research Methods & Data Analysis 5 Correlation coefficient and covariance Cov( x, y) r Pearson correlation coefficient Var ( x)Var ( y) r xy Correlation coefficient - POPULATION x y n r s xy sx s y n SAMPLE Dr. Mario Mazzocchi 1 sxy xi yi n i 1 Research Methods & Data Analysis n x y i 1 n i i 1 i n 6 Bivariate and multivariate correlation • Bivariate correlation – 2 variables – Pearson correlation coefficient • Partial correlation – The correlation between two variables after allowing for the effect of other “control” variables Dr. Mario Mazzocchi Research Methods & Data Analysis 7 Significance level in correlation • Level of correlation (value of the correlation coefficient): indicates to what extent the two variables “move together” • Significance of correlation (p value): given that the correlation coefficient is computed on a sample, indicates whether the relationship appear to be statistically significant • Examples – Correlation is 0.50, but not significant: the sampling error is so high that the actual correlation could even be 0 – Correlation is 0.10 and highly significant: the level of correlation is very low, but we can be confident on the value of such correlation Dr. Mario Mazzocchi Research Methods & Data Analysis 8 Correlation and covariance in SPSS Choose between bivariate & partial Dr. Mario Mazzocchi Research Methods & Data Analysis 9 Bivariate correlation Select the variables you want to analyse Require the significance level (two tailed) Dr. Mario Mazzocchi Research Methods & Data Analysis Ask for additional statistics (if necessary) 10 Bivariate correlation output Correlations Shopping style Use coupons Amount spent Pearson Correlation Sig . (2-tailed) N Pearson Correlation Sig . (2-tailed) N Pearson Correlation Sig . (2-tailed) N Shopping style Use coupons Amount spent 1 .157** .159** . .000 .000 779 779 779 .157** 1 .291** .000 . .000 779 779 779 .159** .291** 1 .000 .000 . 779 779 779 **. Correlation is significant at the 0.01 level (2-tailed). Dr. Mario Mazzocchi Research Methods & Data Analysis 11 Partial correlations List of variables to be analysed Control variables Dr. Mario Mazzocchi Research Methods & Data Analysis 12 Partial correlation output - - - P A R T I A L Controlling for.. AMTSPENT SIZE STYLE AMTSPENT USECOUP ORG 1.0000 .2677 -.0116 ( 0) P= . USECOUP .2677 ( 775) P= .000 ORG C O R R E L A T I O N ( 775) P= .746 ( 775) P= .000 P= .746 1.0000 .0500 ( 0) P= . -.0116 ( 775) ( P= .164 .0500 ( 775) P= .164 775) C O E F F I C I E N T S - - - Partial correlations still measure the correlation between two variables, but eliminate the effect of other variables, i.e. the correlations are computed on consumers shopping in stores of identical size and with the same shopping style 1.0000 ( 0) P= . (Coefficient / (D.F.) / 2-tailed Significance) Dr. Mario Mazzocchi Research Methods & Data Analysis " . " is printed if a coefficient cannot be computed 13 Bivariate and partial correlations • Correlation between Amount spent and Use of coupon – Bivariate correlation: 0.291 (p value 0.00) – Partial correlation: 0.268 (p value 0.00) • The amount spent is positively correlated with the use of coupon (0=no use, 1=from newspaper, 2=from mailing, 3=both) • The level of correlation does not change much after accounting for different shop size and shopping styles Dr. Mario Mazzocchi Research Methods & Data Analysis 14 Linear regression analysis yi xi Intercept Error Dependent variable Regression coefficient Dr. Mario Mazzocchi Independent variable (explanatory variable, regressor…) Research Methods & Data Analysis 15 Regression analysis y Cholesterol (mg/100 ml) 400 300 200 20 30 40 50 60 x Age Dr. Mario Mazzocchi Research Methods & Data Analysis 16 Example • We want to investigate if there is a relationship between cholesterol and age on a sample of 18 people • The dependent variable is the cholesterol level • The explanatory variable is age Dr. Mario Mazzocchi Research Methods & Data Analysis 17 What regression analysis does • Determine whether a relationships exist between the dependent and explanatory variables • Determine how much of the variation in the dependent variable is explained by the independent variable (goodness of fit) • Allow to predict the values of the dependent variable Dr. Mario Mazzocchi Research Methods & Data Analysis 18 Regression and correlation • Correlation: there is no causal relationship assumed • Regression: we assume that the explanatory variables “cause” the dependent variable – Bivariate: one explanatory variable – Multivariate: two or more explanatory variables Dr. Mario Mazzocchi Research Methods & Data Analysis 19 How to estimate the regression coefficients • The objective is to estimate the population parameters e on our data sample: yi a bxi ei • A good way to estimate it is by minimising the error ei, which represents the difference between the actual observation and the estimated (predicted) one Dr. Mario Mazzocchi Research Methods & Data Analysis 20 Cholesterol (mg/100 ml) = 140.36 + 4.58 * age R-Square = 0.65 Linear Regression Cholesterol (mg/100 ml) 400 300 The objective is to identify the line (i.e. the a and b coefficients) that minimise the distance between the actual points and the fit line 200 20 30 40 50 60 Age Dr. Mario Mazzocchi Research Methods & Data Analysis 21 The least square method • This is based on minimising the square of the distance (error) rather than the distance sy Cov( x, y) sxy b 2 r Var ( x) sx sx a y bx Dr. Mario Mazzocchi Research Methods & Data Analysis 22 Bivariate regression in SPSS Dr. Mario Mazzocchi Research Methods & Data Analysis 23 Regression dialog box Dependent variable Explanatory variable Leave this unchanged! Dr. Mario Mazzocchi Research Methods & Data Analysis 24 Regression output Coefficientsa Model 1 (Constant) Age Unstandardized Coefficients B Std. Error 140.359 34.715 4.577 .838 Standardized Coefficients Beta .807 t 4.043 5.464 Sig . .001 .000 a. Dependent Variable: Cholesterol (mg /100 ml) Statistical significance Value of the coefficients Dr. Mario Mazzocchi Research Methods & Data Analysis Is the coefficient different from 0?25 Model diagnostics: goodness of fit Model Summary Model 1 R .807a R Sq uare .651 Adjusted R Sq uare .629 Std. Error of the Estimate 45.218 a. Predictors: (Constant), Age The value of the R square is included between 0 and 1 and represents the proportion of total variation that is explained by the regression model Dr. Mario Mazzocchi Research Methods & Data Analysis 26 R-square SS y SSreg SSres R 2 Total Variation Residual variation explaned variation by regression n n SSreg SS y n 2 ˆ ˆ ( y y ) ( y y ) ( y y ) i i i 2 i 1 2 i 1 i 1 yˆi a bxi Dr. Mario Mazzocchi Research Methods & Data Analysis 27 Multivariate regression • The principle is identical to bivariate regression, but there are more explanatory variables • The goodness of fit can be measured through the adjusted R-square, which takes into account the number of explanatory variables yi b0 b1 x1i b2 x2i ... bn xni ei Dr. Mario Mazzocchi Research Methods & Data Analysis 28 Multivariate regression in SPSS • Analyze / Regression / Linear Simply select more than one explanatory variable Dr. Mario Mazzocchi Research Methods & Data Analysis 29 Output Coefficientsa Model 1 (Constant) Health food store Size of store Gender Vegetarian Shopping style Use coupons Unstandardized Coefficients B Std. Error 296.482 19.792 9.721 15.012 9.753 6.070 -69.598 7.483 -1.910 12.570 22.760 6.069 30.417 3.512 Standardized Coefficients Beta .024 .059 -.302 -.005 .123 .285 t 14.980 .648 1.607 -9.301 -.152 3.750 8.662 Sig . .000 .517 .109 .000 .879 .000 .000 a. Dependent Variable: Amount spent Dr. Mario Mazzocchi Research Methods & Data Analysis 30 Coefficient interpretation • The constant represents the amount spent being 0 all other variables (£ 296.5) • Health food stores, Size of store and being vegetarian are not significantly different from 0 • Gender coeff = -69.6: On average being woman (G=1) implies spending £ 69 less • Shopping style coeff = +22.8 S – S=1 (shop per himself) = +22.8 – S=2 (shop per himself & spouse) = +45.6 – S=3 (shop per himself & family) = +68.4 Categorization problems? • Coupon use coeff = 30.4 C – – – – C=1 (do not use coupon) = +30.4 C=2 (coupon from newspapers) = +60.8 C=3 (coupon from mailings) = +91.2 C=4 (coupon from both) = +121.6 Dr. Mario Mazzocchi Research Methods & Data Analysis 31 Prediction • On average, how much will someone with the following characteristics spend: – Male (G=0) – Shopping for family (S=3) – Not using coupons (C=1) AMT 296.5 69.6 G 22.8 S 30.4 C 395.3 Dr. Mario Mazzocchi Research Methods & Data Analysis 32 How good is the model? Model Summary Model 1 R .439a R Square .193 Adjusted R Square .187 Std. Error of the Estimate 104.08167 a. Predictors: (Constant), Use coupons, Veg etarian, Gender, Health food store, Shopping style, Size of store • The regression model explain less than 19% of the total variation in the amount spent Dr. Mario Mazzocchi Research Methods & Data Analysis 33 Task A • Examine the relationship between the amount spent and the following customer characteristics: – Being male/female – Being vegetarian – Shopping for himself / for himself and others – Shopping style (weekly, bi-weekly, etc.) Potential methods: • Battery of hypothesis testing & Analysis of variance • Regression Analysis Dr. Mario Mazzocchi Research Methods & Data Analysis 34 Task B • Examine the relationship between the amount spent and the following customer characteristics: – Hypothesis: the average amount spent in healthoriented shop is higher than those of other shops. True or false? – Test the same hypothesis accounting for different shop sizes Potential methods: • Battery of hypothesis testing & Analysis of variance • Regression Analysis Dr. Mario Mazzocchi Research Methods & Data Analysis 35 Task C • Find a relationship between the average amount spent per store and the following store characteristics: – Size of store – Health-oriented store – Store organisation Potential methods: • Transform the customer data set into a store data set • Battery of ANOVA • Regression Analysis Dr. Mario Mazzocchi Research Methods & Data Analysis 36 Task D • Hypothesis: is the amount spent by those that use coupon significantly higher? • What is the most effective way of distributing coupons: – By mail – On newspapers – Both Potential methods: • Recode the variable into 1=not using coupon and 2=using coupon • Hypothesis testing • Analysis of variance Dr. Mario Mazzocchi Research Methods & Data Analysis 37