Algebra 1 Summer Institute 2014 The Weather Turbulence The National Climate Data Center collects data on weather conditions at various locations. They classify each day as clear, partly cloudy, or cloudy. Using data taken over a number of years, they provide data on the following variables: 𝐱 = elevation above sea level (in feet) 𝐲 = mean number of clear days per year 𝐰= mean number of partly cloudy days per year 𝐳 = mean number of cloudy days per year Could a city’s elevation above sea level be used to predict the number of clear, partly cloudy, or cloudy days per year a city experiences? After observing a scatter plot of the data, linear models (or the least-squares linear model obtained from a calculator or computer software) can provide a reasonable description of the relationship between these two variables. The linear model will be evaluated by considering how close the data points are to the corresponding graph of the line. The equation of the linear model will be used to answer the statistical question. We will mostly concentrate on the associating between elevation and the number of clear days. The table below shows data for 14 U.S. cities City Albany, NY Albuquerque, NM Anchorage, AK Boise, ID Boston, MA Helena, MT Lander, WY Milwaukee, WI New Orleans, LA Raleigh, NC Rapid City, SD Salt Lake City, 69 𝐰= Mean Number of Partly Cloudy Days per Year 111 𝐳= Mean Number of Cloudy Days per Year 185 5,311 167 111 87 114 2,838 15 3,828 5,557 672 40 120 98 82 114 90 60 90 103 104 122 100 265 155 164 179 129 175 4 101 118 146 434 3,162 4,221 111 111 125 106 115 101 149 139 139 𝐱= Elevation Above Sea Level (ft.) 𝐲 = Mean Number of Clear Days per Year 275 1 Algebra 1 Summer Institute 2014 UT Spokane, WA Tampa, FL 2,356 19 86 101 88 143 191 121 Data Source: http://www.ncdc.noaa.gov/oa/climate/online/ccd/cldy.html 1. Work in groups of two. Create a scatter plot in Excel or GeoGebra of the data on elevation and mean number of clear days. 2. Do you see a pattern in the scatter plot, or does it look like the data points are scattered? 3. How would you describe the relationship between elevation and mean number of clear days for these 14 cities? That is, does the mean number of clear days tend to increase as elevation increases, or does the mean number of clear days tend to decrease as elevation increases? 4. Do you think that a straight line would be a good way to describe the relationship between the mean number of clear days and elevation? Why do you think this? 2 Algebra 1 Summer Institute 2014 We have noticed that the pattern is not very strong. How strong or weak is it? We will look at a number, correlation coefficient, used when we suspect a linear association between patterns called the Pearson product-moment coefficient that can measure the strength between two variables. Generally, the correlation coefficient of a sample is denoted by r, and the correlation coefficient of a population is denoted by ρ or R. The sign and the absolute value of a correlation coefficient describe the direction and the magnitude of the relationship between two variables. The value of a correlation coefficient ranges between -1 and 1. The greater the absolute value of a correlation coefficient, the stronger the linear relationship. The strongest linear relationship is indicated by a correlation coefficient of -1 or 1. The weakest linear relationship is indicated by a correlation coefficient equal to 0. A positive correlation means that if one variable gets bigger, the other variable tends to get bigger. A negative correlation means that if one variable gets bigger, the other variable tends to get smaller. Keep in mind that the Pearson product-moment correlation coefficient only measures linear relationships. Therefore, a correlation of 0 does not mean zero relationship between two variables; rather, it means zero linear relationship. (It is possible for two variables to have zero linear relationship and a strong curvilinear relationship at the same time.) The scatterplots below show how different patterns of data produce different degrees of correlation. Maximum positive correlation (r = 1.0) Strong positive correlation (r = 0.80) Zero correlation (r = 0) 3 Algebra 1 Summer Institute 2014 Maximum negative correlation (r = -1.0) Moderate negative correlation (r = -0.43) Strong correlation & outlier (r = 0.71) Several points are evident from the scatterplots. When the slope of the line in the plot is negative, the correlation is negative; and vice versa. The strongest correlations (r = 1.0 and r = -1.0 ) occur when data points fall exactly on a straight line. The correlation becomes weaker as the data points become more scattered. If the data points fall in a random pattern, the correlation is equal to zero. Correlation is affected by outliers. Compare the first scatterplot with the last scatterplot. The single outlier in the last plot greatly reduces the correlation (from 1.00 to 0.71). How to Calculate a Correlation Coefficient The formula is based on the Deviation scores, the difference between a raw score and the mean scores. For example, the deviation score for x is: xi = Xi -𝑋̅ Where: xi is the deviation for observation “i” Xi is the raw score for observation “i” 𝑋̅ is the mean of all raw scores The most common formula for computing a product-moment correlation coefficient (r) between two variables is: 𝑟= ∑(𝑥𝑦) √(∑ 𝑥 2 ) ∙ (∑ 𝑦 2 ) Where: Σ is the summation symbol, 4 Algebra 1 Summer Institute 2014 xi = Xi -𝑋̅, xi is the deviation score, Xi is the raw score for observation i, 𝑋̅ is the mean x value, yi = Yi -𝑌̅, yi is the deviation score, Yi is the raw score for observation i, and 𝑌̅ is the mean y value. 5. Using Excel, let’s calculate the correlation coefficient r. Open Excel, and enter the data to find the means for x and y, the deviation scores for x and y, and the sums. You will need the following columns: The Coefficient of Determination The coefficient of determination (denoted by R2) is a key output of regression analysis. It is interpreted as the proportion of the variance in the dependent variable that is predictable from the independent variable. The coefficient of determination ranges from 0 to 1. An R2 of 0 means that the dependent variable cannot be predicted from the independent variable. An R2 of 1 means the dependent variable can be predicted without error from the independent variable. An R2 between 0 and 1 indicates the extent to which the dependent variable is predictable. An R2 of 0.10 means that 10 percent of the variance in Y is predictable from X; an R2 of 0.20 means that 20 percent is predictable; and so on. If you know the linear correlation (r) between two variables, then the coefficient of determination (R2) is easily computed using the following formula: R2 = r2. 6. Compute the R2 for our example. What does the coefficient of determination tell us in the context of the problem? 5 Algebra 1 Summer Institute 2014 In the last activity, we created some lines and experimented with residuals to determine which line was a better fit. In this activity we will figure out mathematically how to come up with the equation of the least square regression line. The Least Squares Regression Line Linear regression finds the straight line, called the least squares regression line or LSRL, that best represents observations in a bivariate data set. Suppose Y is a dependent variable, and X is an independent variable. The population regression line is: Y = Β0 + Β1X Where Β0 is a constant, Β1 is the regression coefficient, X is the value of the independent variable, and Y is the value of the dependent variable. Given a random sample of observations, the population regression line is estimated by: ŷ = b0 + b1x Where b0 is a constant, b1 is the regression coefficient, x is the value of the independent variable, and ŷ is the predicted value of the dependent variable. Normally, you will use a computational tool - a software package (e.g., Excel) or a graphing calculator - to find b0 and b1. You enter the X and Y values into your program or calculator, and the tool solves for each parameter. In the unlikely event that you find yourself on a desert island without a computer or a graphing calculator, you can solve for b0 and b1 "by hand". Here are the equations. b1 = Σ [ (xi - x)(yi - y) ] / Σ [ (xi - x)2] b1 = r * (sy / sx) b0 = y - b1 * x Where: b0 is the constant in the regression equation, b1 is the regression coefficient, r is the correlation between x and y, xi is the X value of observation i, yi is the Y value of observation i, x is the mean of X, y is the mean of Y, sx is the standard deviation of X, and sy is the standard deviation of Y 7. Using Excel do the computations to find the equation of the regression line. 6 Algebra 1 Summer Institute 2014 Properties of the Regression Line When the regression parameters (b0 and b1) are defined as described above, the regression line has the following properties. The line minimizes the sum of squared differences between observed values (the y values) and predicted values (the ŷ values computed from the regression equation). The regression line passes through the mean of the X values (x) and through the mean of the Y values (y). The regression constant (b0) is equal to the y intercept of the regression line. The regression coefficient (b1) is the average change in the dependent variable (Y) for a 1-unit change in the independent variable (X). It is the slope of the regression line. The least squares regression line is the only straight line that has all of these properties. 8. Construct a scatter plot that displays the data for 𝐱 = elevation above sea level (in feet) and 𝐰 = mean number of partly cloudy days per year. Based on the scatter plot you constructed, is there a relationship between elevation and the mean number of partly cloudy days per year? If so, how would you describe the relationship? Explain your reasoning. 7