PY1PR2 lecture 3: Simple regression Dr David Field Summary • Andy Field covers simple regression in the first half of chapter 7 • Regression is a technique you can use when you want to predict the value of a variable from the value of another variable • Relationship between correlation and regression • Finding the “line of best fit” • Interpreting regression results and using them to make predictions • Assessing how good the fit is Regression vs. Correlation • The three highlighted scatter plots all show examples of perfect positive correlation • But there is an obvious difference between them – the slope • Simple regression assesses the slope of the relationship between two variables, while correlation assesses how strong / tight the relationship is How regression measures the slope • Make the assumption that the slope or relationship can be described by a straight line – if the scatter plot of two variables shows a curved relationship you can’t use regression • Find the “best fitting” line, and then the gradient of that line is the measure of the slope • There are an infinite number of possible straight lines you could draw on the scatter plot of two variables – regression includes a method of finding the best one Example data • Today’s lecture, and the regression workshop, will make use of some real data to illustrate how simple regression is used to explore the relationship between variables • The sample is a sample of rich, developed nations • The variables are – average annual income per head – income inequality – % of population suffering from any form of mental illness during the past year (WHO world mental health survey) – life expectancy (workshop) • Source: Wilkinson & Pickett (2009) The Spirit Level 25000 $ per year Variation between the per capita incomes of rich countries 40000 35000 30000 20000 15000 10000 Portugal Israel Greece New Zealand Spain Singapore Italy Sweden Finland France Germany UK Australia Netherlands Japan Belgium Canada Austria Ireland Denmark Switzerland Norway USA Income Gap – the 20:20 ratio • Average annual income is much larger in some developed countries than others • Countries also differ in terms of how wide the spread around the average income – In some countries there is a great deal of variation around the mean (large SD) – In other countries there is a small SD • Economists think about this in terms of inequality – How rich is rich? – Subjectively, in some countries, rich means double the average income – In other countries rich means 4 times the averages income • Quantifying inequality as an index – The top 20% of earners in a country are defined as “rich” – The bottom 20% are defined as “poor” – The mean income of the top 20% is then divided by the mean income of the bottom 20% producing an “inequality ratio” Income and inequality as predictor variables • Wilson & Pickett (2009) analysed the relationship between income, income inequality, and the prevalence of a range of health and social problems in different countries – e.g. homicide rate, imprisonment rate, teenage birth rate, social mobility • To illustrate linear regression, we will focus on the relationship between income, income inequality, and a psychological variable: – % of population suffering from any form of mental illness during the past year – The scatter plot indicates that mental health problems are more prevalent in more unequal societies – To quantify the relationship with regression, the first step is to find the “line of best fit” – The line of best fit can be estimated by eye – This line is obviously a poor summary of the trend on the graph – The slope of the line is much too steep – This line looks like it captures the relationship fairly well – The solid red line captures the relationship fairly well – But possibly the green dotted line does a better job – Regression uses a mathematical technique to decide which of all the possible lines is the best fitting line The method of least squares The method of least squares can be used to decide which of these two lines provides a better model of the data •Calculate the difference between each data point and the line, in terms of the predicted variable (mental health problems) •Positive numbers mean the model has overestimated, negative numbers mean it has underestimated •To measure the fit, square the difference scores and sum them (why square the difference scores?) •Next, calculate the sum of squared differences from the line on the right •Whichever line has a smaller total squared difference score is a better model of the data •There are always an infinite number of lines to compare, so regression uses a mathematical technique to find the one that minimizes the squared differences Describing the line of best fit • Once you have obtained the line of best fit it can be drawn on a scatter plot, and it can be described by two numbers – the first number, called the intercept or b0, is the value of the predicted variable (e.g. mental health) where predictor has a value of 0, and the line crosses the vertical axes of the graph – the second number, called b1, is the gradient or slope of the line, and tells you what happens to the y axis value of the best fit line when you increase the value on the x axis by 1 unit • Together, they are known as the regression coefficients 16 14 12 Outcome 10 8 6 4 2 0 0 Predictor 10 •The value of the intercept, b0, for both lines is 8 •The solid red line has a positive value of b1 (gradient or slope) •For the dashed line, b1 is a negative number, indicating that as the predictor increases the predicted value decreases 16 14 12 Outcome 10 8 6 4 2 0 Predictor •These 3 lines all have the same positive slope value, b1 •But each has a different intercept, b0 The line of best fit as a “model” of the data • A statistical model is a way of describing the most important aspects of a set of data that is simpler than the data itself – a straight line is simpler than the scatter plot it summarizes • Like all statistical models, the line of best fit can be used to predict values of the outcome for a specific value of the predictor – outcome = b0 + (b1 * specific value of predictor) – see later for examples • Best fit line for prediction of mental health by inequality – The b0 and b1 have the same units as the predicted variable, % in this case – b1 is 3.7% per unit of the predictor variable – In other words, moving from an inequality ratio of 3 to 4 increases the rate of mental illness by 3.7% b1 = 3.7 b0 = 6.6 Using the model to predict values • Data for the % of population suffering from any form of mental illness during the past year was not available for Greece – but we do know that Greece has an income inequality ration of 6.19, which is 3.28 units higher than the most equal country, Japan – We can use the formula for straight lines, combined with the values of b0 and b1 to predict the level of mental illness in Greece – b0 + (b1 * 3.28) – 6.6 + (3.28 * 3.7) = 18.74% • You can also check the predicted value for Greece graphically – draw a line up from the x axis to meet the regression line at the point corresponding to the predictor value for Greece – draw a line across to the y axis and read off the predicted value Making predictions for extreme values of the predictor • Imagine we obtained a measurement of income inequality two new countries – a capitalist country with no welfare state and zero taxation, inequality ratio 18 (much higher than any country in our sample) – a communist country, inequality ratio 1.5 (les than half the most equal country in our sample, Japan) • We could use the equation for straight lines to predict levels of mental illness for both countries – Doing this is referred to as “extrapolation” – Can you think of any problems with doing this, or reasons for caution? How well does the model fit? • Regression is guaranteed to find the best fitting straight line, but it might still be a poor fit if the two variables are only weakly related 1) There are two ways of assessing the model 1) R2, the proportion of the total variance in the predicted variable that the best fitting line accounts for 2) a null hypothesis test: • • • what is the probability of obtaining the b1 value in the sample if the true value of b1 in the population is zero? a b1 of 0 means that as the value of the predictor increases the value of the predicted stays the same (as inequality increases mental illness stays the same) – Let’s compare the fit of two models that predict the proportion of the population with mental health problems Predicting mental health from income b1 = 0.65 b0 = -1.7 (zero income = less than zero mental health problems?!?!) Assessing the two models of mental health: R2 • To assess the line of best fit, it is compared to an even simpler model of the predicted variable – If I had mental health data for the 11 countries on the scatter plot, but no data about income or inequality, and I was asked to predict the level of mental health problems in Greece, my best guess would be the mean of the mental health data – The simplest possible model of the predicted variable is its mean – Calculate the sum of the squared deviations from the mean (total sum of squares) – Calculate the sum of the squared deviations from the line of best fit (residual sum of squares) – If the line is a good model, the residual sum should be much smaller than the total sum The line of best fit versus the mean mean of Y Can you find any countries where the mean is a better model of mental health problems than the line of best fit for inequality? The line of best fit versus the mean Can you find any countries where the mean is a better model of mental health problems than the line of best fit for income? For how many countries is the model obviously better than the mean? Calculating R2 • R2 is a descriptive statistic that describes how much better the model is at explaining variation in the predicted variable than using the mean as a model • It is expressed as a proportion of the total variation (variance) in the predicted variable – therefore it has a maximum of 1 and a minimum of 0 2 R = total sum of squares – residual sum of squares total sum of squares Note: the square root of R2 is the Pearson correlation of the two variables Which variable is a better predictor? R2 = 0.16 correlation = 0.39 R2 = 0.55 correlation = 0.74 Is b1 significantly different from zero? • If the true value of b1 in the population is zero, what is the probability that a random sample of this size would have a value of b1 as big or bigger than the observed value? • The p value is provided by a t test • t = b1 / standard error of b1 • standard error of b1 is based on the SD of the residual (deviation) scores – if the data points are close to the line of best fit SE will be small, if far away, SE will be large • otherwise identical to the t test for the difference between two sample means • if p < 0.05 we support the hypothesis that the predictor variable is useful in estimating the value of the outcome Are the predictors statistically significant? t(9)1.29, p = 0.23, NS t(9)3.3, p = 0.009 The simple linear regression model • As we saw earlier, simple regression is simply a model of the data as a straight line: – Outcome = b0 + (b1 * specific value of predictor) • But that equation is not quite complete, because the regression model needs to reflect the fact that the data points rarely all lie exactly on the line • Therefore, you will usually see the regression equation written as some symbolic variant of: – Outcome = b0 + (b1 * specific value of predictor) + “residual error” • residual error refers to the deviation score from the model (the vertical lines drawn on the scatter plots earlier) Things to bear in mind • Often, you can’t be sure that the predicted variable is being caused by the predictor – it might be the other way round (and you can swap the predictor and predicted around and run the analysis again) – in some cases, e.g. inequality and mental health, it does not make sense to run the regression the other way around • Nobody would claim that mental health problems cause inequality • but you might make an argument that a 3rd variable causes the variation in both observed variables • The predicted variable should be a continuous variable measured on an interval or ratio scale • If a scatter plot suggests a non-linear relationship, you can’t use simple regression If you’d like to evaluate the effects of inequality and other variables yourself more data is available in the book or on the website e.g. more unequal US states have higher levels of health and social problems Statistical terms for revision • • • • model method of least squares residual regression coefficients – intercept, b0 – slope or gradient, b1 • best fitting line • extrapolation • R2