Correlation and regression

advertisement
Quantitative Methods
Notes on Correlation and Regression
WARNING: These notes are a short summary on the topics of correlation and regression. They
are not a substitute for reading the textbook or coming to the lecture!!!
1
Scatter Plot
The purpose here is to analyze the relationship between two variables. We will usually denote the
variables by X and Y . We want to investigate whether the data contains evidence of a relationship
or not, and to what degree. If there is a clear choice of dependent versus independent variables,
then the convention is to use the notation
X = independent variable (explanatory)
Y = dependent variable (measured response)
We already covered one tool to see if two variables are related: the cross-tabulation. If at
least one of the variables is nominal or ordinal, then a cross-tabulation is the method of choice.
Otherwise, if the level of measurement for both variables is interval, then it will be more informative
to draw a scatter plot: we put the variable X on the x-axis, and the variable Y on the y-axis,
and draw a dot for each pair of data (X, Y ). Figure 1 shows an example of a scatter plot.
Figure 1: Example of a scatter plot, relating the arm strength and grip strength.
We would conclude that there is a relationship between the variables if we can see a
clear pattern in the scatter plot. The pattern could be in the form of a straight line, or a
curve, or any definite shape that appears. On the other hand, if the scatter plot shows a random
cloud of dots, with no clear pattern, then we would conclude that there is no apparent relationship.
1
2
Correlation
Correlation is a measure of the linear relationship between the two variables. In other words, it
says how much the scatter plot looks like a straight line. If all the points on the scatter plot are
perfectly aligned, this would be an example of perfect correlation.
Remarks:
• If there is a correlation between X and Y , then there is a relationship between the variables.
• If there is no correlation at all, there could still be a relationship!
We can measure correlation precisely through the calculation of the correlation coefficient, which
is denoted by r. The correlation coefficient r is a number between −1 and 1, with the following
interpretation:
• |r| = 1 : perfect correlation, all the points on the scatter plot are on a straight line
• 0.5 < |r| < 1 : strong correlation
• 0.25 < |r| < 0.5 : medium correlation
• 0.1 < |r| < 0.25 : weak correlation
• |r| < 0.1 : no correlation
A positive correlation means that the scatter plot looks like a line with a positive slope. Similarly,
a negative correlation means that scatter plot looks like a straight line with a negative slope.
Figure 2 displays many examples of scatter plots with their associated correlation coefficients.
Figure 2: Examples of scatter plots with their corresponding correlation coefficients.
NOTE: The size of the correlation coefficient gives the strength of the correlation, and has nothing
to do with the slope of the line. See the second row of scatter plots in Figure 2.
2
3
Regression Line
The regression line is the straight line that “best fits” the data on the scatter plot. Mathematically,
it is the line that minimizes the vertical distances between the data points and the line. The
regression line is used as a model for the relationship between the two variables; however, it will
only be a good model if the correlation is strong! In another words, the strength of the correlation
indicates if the regression line represents a good fit for the data.
Figure 3: Two examples of regression lines. On the left, the scatter plot shows a strong positive
correlation (r = 0.7) and the regression line is a good fit. On the right, the correlation coefficient
is much weaker (r = −0.3) and as a consequence the regression line does not fit the data as well.
The equation for the regression line will be in the form Y = a + bX , where a is the y-intercept
and b is the slope. (YES I know, this is confusing since you are probably used to writing the
equation of a straight line as y = ax + b where a is the slope...)
3.1
Making predictions
One of the purpose of the regression line is to be able to make predictions for new data. For
example, say X is the wing length of a bird (in cm), and Y is the age of the bird (in days). Using
birds born in captivity, we can measure their wing lengths and their corresponding ages to gather
data. With this data, we can find the regression line; suppose we find the equation:
Y = −1.61 + 3.33 X
Now, we go in the wild and capture a bird. We measure the length of its wings, and we find
X = 4cm. We don’t know how old the bird is, but we can use the regression line to predict its age:
Y = −1.61 + 3.33(4) = 11.7 days.
Is this a good prediction? We will be very confident in this answer if we have a strong correlation
(regression line is a good fit), and much less confident if the correlation is weak.
3
Download