Chapter 4 – Summarizing Bivariate Data Linear Regression and Correlation This chapter introduces an important method for making inferences about a linear correlation (or relationship) between two variables, and describing such a relationship with an equation that can be used for predicting the value of one variable given the value of the other variable. We will only look at linear relationships, which mean that when graphed, the points approximate a straight-line pattern. We will also introduce a way of measuring the strength of a linear relationship. Data that consist of ordered pairs are called bivariate data. A scatterplot is a display of ordered pairs plotted on a set of axes. 4.1 - Correlation Two variables have a linear relationship if the data tend to cluster around a ............................. when plotted on a scatterplot. A strong correlation means that the points in the scatterplot are closely clustered around a straight line. Two variables are positively associated if large values of one variable are associated with ........................ values of the other. Two variables are negatively associated if large values of one variable are associated with ......................... values of the other. The linear correlation coefficient, denoted r, measures the strength of a linear relationship between the paired x- and y-quantitative values in a sample. The value of r is always between -1 and +1 inclusive. That is, −1 ≤ r ≤ 1 . If r = 1, there is a perfect positive linear correlation. (or association) If r is close to 1, there is a strong positive linear correlation. If r is positive but close to zero, there is a weak positive linear correlation. If r = 0, there is no correlation. If r is negative but close to zero, there is a weak negative linear correlation. If r is close to -1, there is a strong negative linear correlation. If r = -1, there is a perfect negative linear correlation. Interchanging all x- and y-values will not change r. A common error is to conclude that correlation implies causality, meaning that one variable causes the results of the second variable. Correlation is NOT causation. There might be other known or unknown variables, confounders, which affect both of the variables that we are studying. Remember, that r measures the strength of a linear relationship, not the strength of a relationship that is non-linear. The graphs below certainly seem to have variables with some sort of relationship, but not linear, so r will not be helpful in these cases. In this class, we will not spend time on the tedious calculations required to find a correlation coefficient or the linear regression equation (next section) by hand, but we will use our calculators. The emphasis will instead be on the interpretation of the results. Ex. Global Warming: The following data set shows the temperatures at different levels of CO2 . CO2 (in parts per million) Temperature (in Celsius) 314 317 320 326 331 339 346 354 361 369 13.9 14.0 13.9 14.1 14.0 14.3 14.1 14.5 14.5 14.4 (a) Draw a scatterplot, first by hand, then using your calculator, where amount of CO2 is the independent variable and the temperature is the dependent variable. (b) Find the correlation coefficient 4.2, 4.3 - The Least Squares Regression Line If one can rent a car for $180/week plus $0.25/mile, we can write an equation for the cost of renting a car for a week by= y 180 + 0.25 x , where y represents the cost per week, and x represent the number of miles driven in a week. This linear equation gives an exact value of y for any given x. However, the variables often don't have an exact relationship, where one variable is determined completely by the other variable. But if they appear to have a linear relationship, we can find the graph and equation of the straight line that best represents this specific relationship. This straight line is called the least squares regression line (or line of best fit). The general form of this linear regression equation is y= a + bx (compare this equations to the more familiar form = y mx + b ). The regression line is seen as a measure of the mean value of y for a given value of x. Describe the following: x= y= a= b= Positive linear relationship: Negative linear relationship: Describe how we graph a line y= a + bx : Requirements for finding a linear regression line and it’s correlation coefficient: 1. The sample of paired data is a random sample of quantitative data. 2. Visual examination of the scatterplot shows that the points approximate a straight-line pattern. 3. Any outliers must be removed if they are known to be errors. Consider the effects of any outliers that are not known errors. (In a scatterplot, an outlier is a point lying far away from the other data points.) Describe how to find a linear regression equation on a TI83/84: More problems that applies to the Global Warming example on previous page: (c) Find the equation of the regression line, where amount of CO2 is the independent variable and the temperature is the dependent variable. (d) Draw the regression line in the same coordinate system as the scatterplot. (e) Mark the errors (residuals) between the actual points and the corresponding points on the regression line. (In general I will not ask you to draw the residuals, only for this problem.) (e) Interpret the slope for this particular problem, using correct units. (f) Interpret the y-intercept for this particular problem, using correct units. (g) Find the predicted temperature for a recent year in which the concentration of CO2 is 370.9 Is the predicted temperature close to the actual temperature of 14.5°C? (h) If the CO2 increases by 15 parts per million, how much would you expect the temperature to change. Regression lines are often useful for predicting the value of one variable, given some particular value of the other variable. If the regression line fits the data quite well, then it makes sense to use its equation for predictions (use a scatterplot and correlation coefficient to determine how well the line fits). However, don't base predictions on values that are far beyond the boundaries of the known sample data, as the linear relationship may not hold true there. An influential point is a point that, when included in a scatterplot, strongly affects the position of the least-squares regression line. ex. When a scatterplot contains outliers: • • Compute the least-squares regression line both with and without each outlier to determine which outliers are influential. Report the equations of the least-squares regression line both with and without each influential point.