Quantitative Methods Notes on Correlation and Regression WARNING: These notes are a short summary on the topics of correlation and regression. They are not a substitute for reading the textbook or coming to the lecture!!! 1 Scatter Plot The purpose here is to analyze the relationship between two variables. We will usually denote the variables by X and Y . We want to investigate whether the data contains evidence of a relationship or not, and to what degree. If there is a clear choice of dependent versus independent variables, then the convention is to use the notation X = independent variable (explanatory) Y = dependent variable (measured response) We already covered one tool to see if two variables are related: the cross-tabulation. If at least one of the variables is nominal or ordinal, then a cross-tabulation is the method of choice. Otherwise, if the level of measurement for both variables is interval, then it will be more informative to draw a scatter plot: we put the variable X on the x-axis, and the variable Y on the y-axis, and draw a dot for each pair of data (X, Y ). Figure 1 shows an example of a scatter plot. Figure 1: Example of a scatter plot, relating the arm strength and grip strength. We would conclude that there is a relationship between the variables if we can see a clear pattern in the scatter plot. The pattern could be in the form of a straight line, or a curve, or any definite shape that appears. On the other hand, if the scatter plot shows a random cloud of dots, with no clear pattern, then we would conclude that there is no apparent relationship. 1 2 Correlation Correlation is a measure of the linear relationship between the two variables. In other words, it says how much the scatter plot looks like a straight line. If all the points on the scatter plot are perfectly aligned, this would be an example of perfect correlation. Remarks: • If there is a correlation between X and Y , then there is a relationship between the variables. • If there is no correlation at all, there could still be a relationship! We can measure correlation precisely through the calculation of the correlation coefficient, which is denoted by r. The correlation coefficient r is a number between −1 and 1, with the following interpretation: • |r| = 1 : perfect correlation, all the points on the scatter plot are on a straight line • 0.5 < |r| < 1 : strong correlation • 0.25 < |r| < 0.5 : medium correlation • 0.1 < |r| < 0.25 : weak correlation • |r| < 0.1 : no correlation A positive correlation means that the scatter plot looks like a line with a positive slope. Similarly, a negative correlation means that scatter plot looks like a straight line with a negative slope. Figure 2 displays many examples of scatter plots with their associated correlation coefficients. Figure 2: Examples of scatter plots with their corresponding correlation coefficients. NOTE: The size of the correlation coefficient gives the strength of the correlation, and has nothing to do with the slope of the line. See the second row of scatter plots in Figure 2. 2 3 Regression Line The regression line is the straight line that “best fits” the data on the scatter plot. Mathematically, it is the line that minimizes the vertical distances between the data points and the line. The regression line is used as a model for the relationship between the two variables; however, it will only be a good model if the correlation is strong! In another words, the strength of the correlation indicates if the regression line represents a good fit for the data. Figure 3: Two examples of regression lines. On the left, the scatter plot shows a strong positive correlation (r = 0.7) and the regression line is a good fit. On the right, the correlation coefficient is much weaker (r = −0.3) and as a consequence the regression line does not fit the data as well. The equation for the regression line will be in the form Y = a + bX , where a is the y-intercept and b is the slope. (YES I know, this is confusing since you are probably used to writing the equation of a straight line as y = ax + b where a is the slope...) 3.1 Making predictions One of the purpose of the regression line is to be able to make predictions for new data. For example, say X is the wing length of a bird (in cm), and Y is the age of the bird (in days). Using birds born in captivity, we can measure their wing lengths and their corresponding ages to gather data. With this data, we can find the regression line; suppose we find the equation: Y = −1.61 + 3.33 X Now, we go in the wild and capture a bird. We measure the length of its wings, and we find X = 4cm. We don’t know how old the bird is, but we can use the regression line to predict its age: Y = −1.61 + 3.33(4) = 11.7 days. Is this a good prediction? We will be very confident in this answer if we have a strong correlation (regression line is a good fit), and much less confident if the correlation is weak. 3