Correlation and regression Chapter 8 Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 1 Correlation • Measures the strength and direction of a relationship between two metric variables • For example, the relationship between weight and height or between consumption and price: it is not a perfect (deterministic) relationship, but we expect to find one on average Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 2 Correlation • Correlation measures to what extent two (or more) variables are related • Correlation expresses a relationship that is not necessarily precise (e.g. height and weight) • Positive correlation indicates that the two variables move in the same direction • Negative correlation indicates that they move in opposite directions • The question is – do two variables move together? Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 3 Covariance • measures the co-movement of two variables x and y across observations n • Sample covariance estimate: COV ( x, y ) sxy ( x x )( y y ) i 1 i i n 1 • For each observation i, a situation where both x and y are above or below their respective sample means increases the covariance value, while the situation where one of the variables is above the sample mean and the other is below decreases the total covariance. • Contrarily to variance, covariance can assume both positive and negative values. • If x and y always move in opposite direction, all terms in the summation above will be negative, leading to a large negative covariance. If they always move in the same direction there will be a large positive covariance. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 4 From covariance to correlation • Covariance, like variance, depends on the measurement units. • If one measures prices in dollars and consumption in ounces, a different covariance value is obtained as compared to the use of prices in Euros and consumption in Kilograms, even if both situations refer exactly to the same goods and observations. • Some form of normalization is needed to avoid the measurement unit problem • The usual approach is standardization, which requires subtracting the mean and dividing by the standard deviation. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 5 Correlation • Considering the covariance expression where the numerator is based already on differences from the means, all that is required is dividing by the sample standard deviations, for both x and y. n ( x x )( y y ) CORR( X , Y ) rxy sxy sx s y i 1 i n 1 n n (x x ) ( y y) 2 i 1 i n 1 Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi n i i 1 i 2 ( x x )( y y ) i i 1 i n n (x x ) ( y y) 2 i 1 i i 1 2 i n 1 6 Correlation coefficient • The correlation coefficient r gives a measure (in the range minus one to plus one) of the relationship between two variables • r=zero means no correlation • r=plus one means perfect positive correlation • r=minus one means perfect negative correlation • Perfect correlation indicates that a p% variation in x always corresponds to a p% variation in y Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 7 Correlation and causation • Note that no assumption or consideration is made on causality • The existence of a positive correlation of x and y does not mean that it is the increase in x which leads to an increase in y, but only that the two variables move together(to some extent) • Thus, correlation is symmetric, so that rxy= ryx Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 8 Correlation as a sample statistic • Correlation is more than an indicator; it can be regarded as a sample statistic (which allows hypothesis testing) • The sample measure of correlation is affected by sampling error: for example a small but positive correlation observed on a sample might hide a null (or negative) true correlation in the population Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 9 Correlation and inference • Some assumptions (checks) are needed a) the relationship between the two variables should be linear (a scatterplot could allow the identification of nonlinear relationships) the error variance should be similar for different correlation levels the two variables should come from similar statistical distributions If the two variables can be assumed to derive from normal distributions it becomes possible to run hypothesis testing b) c) d) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 10 Hypothesis testing on correlations • This fourth condition (d. ante) can be ignored when the sample is large enough (fifty or more observations). • In these cases one can exploit the probabilistic nature of sampling to run an hypothesis test on the sample correlation coefficient • The null hypothesis to be tested is that the correlation coefficient in the population is zero. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 11 Bivariate correlations • There are two elements to be considered • the value of the correlation coefficient r, which indicates to what extent the two variables move together • the significance of the correlation (a p value), which helps a decision as to whether the hypothesis that r=zero (no correlation in the population) should be rejected. • Examples – a correlation coefficient r=0.6 suggests a relatively strong relationship, but a p value well above 0.05 indicates that the hypothesis that the actual correlation is zero cannot be rejected at the 95% confidence level. – r=0.1 and p value below 0.01. (Thanks to a larger sample) one can be confident (at the 99% level) that there is a positive relationship between the two variables, although the relationship is weak. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 12 The influence of third factors • The correlation coefficient between x and y is only meaningful if one can safely assume that there is no other intervening variable which affects the values of x and y. • In the supermarket example, a negative correlation between prices and consumption is expected. • However, suppose that one day the government introduces a new tax which reduces the average available income by 10%. Consumers have less money and consume less. The supermarket tries to maintain its customers by cutting all prices, so that the reduction in prices mitigates the impact of the new tax. If we only observe prices and consumption, it is possible that we observe lower prices and lower consumption and the bivariate correlation coefficient might return a positive value. • Thus, we can only use the correlation coefficient when the ceteris paribus condition holds (all other relevant variables being constant) • This is rarely the case, so it is necessary to control for other influential variables like income in the price-consumption relationship. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 13 Partial correlation • The partial correlation coefficient allows one to evaluate the relationship between two variables after controlling for the effects of one or more additional variables. • For example, if x is price, y is consumption and z is income, the partial correlation coefficient is obtained by correcting the correlation coefficient between x and y after considering the correlation between x and z and the correlation between y and z rxy z rxy rxz ryz 1 r 2 xz 1 r 2 yz • This can be generalized to control for more variables Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 14 Other correlation statistics • Part (semi-partial) correlation: controls for the correlation between the influencing variable z and only one of the two variables x and y • Non-parametric correlation statistics (rank-order association): • Spearmans Rho • Kendalls Tau-b statistics • Multiple correlation coefficient (regression analysis): joint relationship between one variable (the dependent variable) and a set of other variables. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 15 Correlation and covariance in SPSS Choose between bivariate & partial Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 16 Bivariate correlation Select the variables you want to analyse Non-parametric association measures Require the significance level (two tailed) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi Descriptive statistics (incl. covariance) 17 Pearson bivariate correlation output Correl ations In a ty pical week how much fresh or frozen chicken do you buy for your household consumption (Kg.)? In a ty pical week how much fresh or frozen chicken do you buy for your household consumption (Kg.)? Pearson Correlation Average price Pearson Correlation Sig. (2-tailed) N Pearson Correlation Sig. (2-tailed) N Income level 1 Sig. (2-tailed) N 446 -.327** .000 438 .088 .125 304 Average price Income level -.327** .088 .000 .125 438 304 1 .091 .117 300 1 438 .091 .117 300 342 **. Correlation is significant at the 0.01 level (2-tailed). Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 18 Non-parametric tests output Correlations In a typical week how much fresh or frozen chicken do you buy for your household consumption (Kg.)? Kendall's tau_b In a typical week how much fresh or frozen chicken do you buy for your household consumption (Kg.)? Correlation Coefficient Average price Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Income level Spearman's rho In a typical week how much fresh or frozen chicken do you buy for your household consumption (Kg.)? Average price Income level Sig. (2-tailed) N Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) N Correlation Coefficient Sig. (2-tailed) N 1.000 Average price Income level -.469** .059 . .000 .176 446 438 304 -.469** .000 438 .059 .176 304 1.000 . 438 -.008 .854 300 -.008 .854 300 1.000 . 342 1.000 -.630** .079 . .000 .171 446 438 304 1.000 . 438 -.009 .880 300 -.009 .880 300 1.000 . 342 -.630** .000 438 .079 .171 304 **. Correlation is significant at the 0.01 level (2-tailed). Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 19 Partial correlations List of variables to be analysed Control variables Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 20 Partial correlation output Corre lations Control Variables Average price In a ty pical week how much fresh or frozen chicken do you buy for your household consumption (Kg.)? Income level 1.000 .129 Significance (2-t ailed) . .026 df 0 297 .129 .026 297 1.000 . 0 In a ty pical week how much fresh or frozen chick en do you buy for your household consumption (Kg.)? Correlation Income level Correlation Significance (2-t ailed) df Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi Partial correlations still measure the correlations between two variables, but eliminate the effect of other variables, i.e. the correlation reflects the relationship between consumption and income for consumers facing the same price 21 Bivariate linear regression yi xi i Dependent variable (Random) error term Intercept Explanatory variable Regression coefficient • Causality (from x to y) is now assumed • Regressing stands for going backwards from the dependent variable to its determinant • The error term embodies anything which is not accounted for by the linear relationship • The unknown parameters ( and ) need to be estimated (usually on sample data). We refer to the sample parameter estimates as a and b Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 22 Least squares estimation of the unknown parameters • For a given value of the parameters, the error (residual) term for each observation is ei yi a bxi • The least squares parameter estimates are those who minimize the sum of squared errors: n n SSE ( yi a bxi ) ei 2 i 1 Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 2 i 1 23 Assumptions on the error term (1) 1.The error term has a zero mean 2.The variance of the error term does not vary across cases (homoskedasticity) 3.The error term for each case is independent of the error term for other cases 4.The error term is also independent of the values of the explanatory (independent) variable 5.The error term is normally distributed Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 24 Assumptions on the error term (2) 1. The error term has a zero mean • (otherwise there would be a systematic bias which should be captured by the intercept) 2. The variance of the error term does not vary across cases (homoskedasticity) • for example, the error variability should not become larger for cases with very large values of the independent variable. Heteroskedasticity is the opposite condition 3. The error term for each case is independent from the error term for other cases • The omission of relevant explanatory variables would break this assumption, as an omitted independent variable (correlated across cases by definition) ends up in the residual term and induces correlation. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 25 Assumptions on the error term(3) 4. The error term is also independent of the values of the explanatory (independent) variable • • • otherwise it would mean that the variable is not truly independent and is affected by changes in the dependent variable Frequent problem: sample selection bias, which occurs when nonprobabilistic samples are used, that is the sample only includes units from a specific group Example: response to advertising by sampling those who purchase a deodorant after seeing an advert; those who decided not to buy the deodorant are not taken into account, even if they saw the advert. Correlated observations do not enter the analysis and this leads to a correlated error term 5. The error term is normally distributed • • This corresponds to the assumption that the dependent variable is normally distributed for any value of the independent variable(s). Normality makes life easier with hypothesis testing, but there are ways to overcome the problem if the distribution is not normal. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 26 Prediction • Once a and b have been estimated, it is possible to predict the value of the dependent variable for any given value of the explanatory variable yˆ j a bx j Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 27 Model evaluation • An evaluation of the model performance can be based on the residuals, which provide information on the capability of the model predictions to fit the original data (goodnessof-fit) • Since the parameters a and b are estimated on the sample, just like a mean, they are accompanied by the standard error of the parameters, which measures the precision of these estimates and depends on the sampling size. • Knowledge of the standard errors opens the way to run hypothesis testing and compute confidence intervals for the regression coefficients (see lecture 6). Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 28 Hypothesis testing on regression coefficients • T-test on each of the individual coefficients • Null hypothesis: the corresponding population coefficient is zero. • T statistic: simply divide the estimate (for example a, the estimate f ) by its standard error (sa). • The p-value allows one to decide whether to reject or not the null hypothesis that =zero, depending on the confidence level • F-test (multiple independent variables, as discussed later) • It is run jointly on all coefficients of the regression model • Null hypothesis: all coefficients are zero • The F-test in linear regression corresponds to the ANOVA test (and the GLM is a regression model which can be adopted to run ANOVA techniques) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 29 Coefficient of determination n n i 1 i 1 SSE ( yi a bxi ) 2 ei 2 The sum of squared errors is a measure of the variability which is not explained by the regression model n 2 the ( y y ) i and 1 coefficient of determination R is ithe Where SSR iscoefficient the portion of variability which is the correlation SSTsquare SSR ofSSE explained by the regression model: n between y and x The In sumaofBIVARIATE squared errors is aREGRESSION, portion SST of total variation, which is measured by 2 The natural candidate for measuring how well the model fits the data is the coefficient of determination, which varies between zero (when the model does not explain any of the variability of the dependent variable) and 1 (when the model fits the data perfectly) : Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi SSR (a bxi y ) 2 i 1 SSR SSE R 1 SST SST 2 30 Bivariate regression in SPSS Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 31 Regression output Only 5% of total variation is explained by the model (correlation is 0.23) Model Summary Model 1 R .232a R Square .054 Adjusted R Square .052 Std. Error of the Estimate .532591 a. Predictors: (Constant), Hous ehold size ANOVAb Model 1 Regres sion Residual Total Sum of Squares 8.036 141.259 149.295 df 1 498 499 Mean Square 8.036 .284 F 28.329 a. Predic tors: (Constant), Household s ize b. Dependent Variable: Eggs a Coefficients Model 1 (Constant) Household size Unstandardized Coefficients B Std. Error .235 .049 .095 .018 a. Dependent Variable: Eggs Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi Standardized Coefficients Beta .232 t 4.834 5.323 Sig. .000 .000 Sig. .000a The F-test rejects the hypothesis that all coefficients are zero Both parameters are statistically different from zero according to the t-test 32 Multiple regression • The principle is identical to bivariate regression, but there are more explanatory variables yi 0 1 x1i 2 x2i ... k xki i Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 33 Additional assumption 6. The independent variables are also independent of each other. Otherwise we could run into some double-counting problem and it would become very difficult to separate the meaning. • • Inefficient estimates Apparently good model but poor forecasts Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 34 Collinearity • Assumption six refers to the so-called collinearity (or multicollinearity) problem • Collinearity exists when two explanatory variables are correlated). • Perfect collinearity: one of the variables has a perfect (1 or -1) correlation with another variable, or with a linear combination of more than one variable. This makes estimation impossible. • Strong collinearity makes estimates of the coefficients unstable and inefficient, which means that the standard errors of the estimates are inflated as compared to the best possible solution. • Furthermore, the solution becomes very sensitive to the choice of which variables to include in the model. • When there is multicollinearity the model might look very good at a first glance, but produces poor forecasts. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 35 Goodness-of-fit • The coefficient of determination R2 (still computed as the ratio between SSR and SST), always increases with the inclusion of additional regressors • This is against the parsimony principle; models with many explanatory variable are more demanding in terms of data (higher costs) and computations. • If alternative nested models are compared those with more explanatory variables result in a better fit. • Thus, a proper indicator is the adjusted R2 which accounts for the number of explanatory variables (k) in relation to the number of observations (n) n -1 2 2 R 1 (1 R ) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi n - k -1 36 Multiple regression in SPSS • Analyze / Regression / Linear Simply select more than one explanatory variable Click on STATISTIC for collinearity diagnostics and more statistics Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 37 Additional statistics Part and partial correlations are provided Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 38 Output Model Summary Model 1 R .439a R Square .193 Adjusted R Square .176 Std. Error of the Estimate 1.54949 a. Predictors: (Constant), Average price, Chicken is a safe food, In my household we like chicken, Pleas e indicate your gross annual household income range, house Age, Number of people currently living in your household (including yourself) The model accounts for 19.3% of variability in the dependent variable. After adjusting for the number of regressors, the R2 is 0.176 ANOVAb Model 1 Regres sion Residual Total Sum of Squares 166.371 696.268 862.639 df 6 290 296 Mean Square 27.728 2.401 F 11.549 Sig. .000a a. Predictors: (Constant), Average price, Chicken is a s afe food, In my household we like chicken, Please indicate your gros s annual household income range, Age, Number of people currently living in your household (including yourself) The null hypothesis that all regressors are zero is strongly rejected b. Dependent Variable: In a typical week how much fresh or frozen chicken do you buy for your household consumption (Kg.)? Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 39 Output Coefficientsa Model 1 (Constant) In my hous ehold we like chicken Chicken is a safe food Age Number of people currently living in your household (including yourself) Please indicate your gross annual household income range Average price Unstandardized Coefficients B Std. Error -.362 .684 Standardized Coefficients Beta t -.529 Sig. .597 Tolerance values close to 1 and low VIF indicate multicollinearity is not an issue Correlations Zero-order Partial Part Collinearity Statistics Tolerance VIF .122 .078 .084 1.562 .119 .134 .091 .082 .974 1.027 .090 .005 .062 .006 .077 .049 1.449 .900 .148 .369 .097 .069 .085 .053 .076 .047 .973 .954 1.028 1.049 .277 .074 .210 3.756 .000 .286 .215 .198 .890 1.124 .098 .068 .080 1.440 .151 .103 .084 .076 .910 1.099 -.108 .020 -.299 -5.492 .000 -.343 -.307 -.290 .940 1.064 a. Dependent Variable: In a typical week how much fresh or frozen chicken do you buy for your household cons umption (Kg.)? Only these parameters (household size and price) emerge as significantly different from 0 Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 40 Coefficient interpretation – intercept • The constant represents the amount spent being zero all other variables. It provides a negative value, but the hypothesis that the constant is zero is not rejected • A household of zero components, with no income is unlikely to consume chicken • However, estimates for the intercept are often unsatisfactory, because frequently there are no data points with values for the independent variables close or equal to zero Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 41 Coefficient interpretation • The significant coefficients tell one that: • Each additional household component means an increase in consumption by 277 grams • A £ 1 increase in price leads to a decrease in consumption by 108 grams Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 42 Stepwise regression procedure • Explores each single explanatory variable before entering it in the model. The procedure: 1. 2. 3. • Adds the variable which shows the highest bivariate correlation with the dependent variable The partial correlation of all remaining potential independent variables (after controlling for the independent variable already included in the model) are explored and the explanatory variable with the highest partial correlation coefficients enters the model The model is re-estimated with two explanatory variables, then the decision whether to keep the second one is based on the variation of the F-value or other goodness-of-fit statistics like the Adjusted Rsquare, Information Criteria, etc. If the variation is not significant, then the second variable is not included in the model. Otherwise it stays in the model and the process continues for the inclusion of a third variable (go back to step 2) At each step, the procedure may drop one of the variables already included in the model if there is no significant decrease in the F-value (or any other targeted stepwise criterion) Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 43 Forward and backward • Forward regression works exactly like step-wise regression, but variables are only entered and not dropped. The process stops when there is no further significant increase in the F-value • Backward regression starts by including all the independent variables and works backward (according to the step-wise approach), so that at each step the procedure drops the variable which causes the minimum decrease in the F-value and stops when such decrease is not significant. Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 44 Stepwise regression in SPSS Choose the variable selection procedure method here and proceed as usual Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 45 Stepwise regression output Coefficientsa Model 1 2 (Constant) Average price (Constant) Average price Number of people currently living in your household (including yours elf) Unstandardized Coefficients B Std. Error 2.067 .170 -.124 .020 1.128 .268 -.111 .019 .314 .071 Standardized Coefficients Beta -.305 t 12.190 -6.270 4.204 -5.685 Sig. .000 .000 .000 .000 .238 4.427 .000 -.343 Correlations Zero-order Partial Part Collinearity Statistics Tolerance VIF -.343 -.343 -.343 1.000 1.000 -.343 -.315 -.302 .975 1.026 .286 .250 .235 .975 1.026 a. Dependent Variable: In a typical week how much fresh or frozen chicken do you buy for your hous ehold consumption (Kg.)? The first model only includes the “average price” variable In the second step, the household size is included No other variable enters the model Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 46 Regression ANOVA and the GLM • (Factorial) ANOVA can be seen as a regression model where all explanatory variables are binary (dummy) variables • Each of the dummy variables indicates whether a given factor is present or not • The T-test on the coefficients of the dummies are mean comparison tests • The F-test on the regression model is the test for factorial (n-way) ANOVA Statistics for Marketing & Consumer Research Copyright © 2008 - Mario Mazzocchi 47