Correlation and Regression Ch 4 Why Regression and Correlation We need to be able to analyze the relationship between two variables (up to now we have only looked at single variables) We can use one variable to predict the other, but regression and correlation do NOT imply causation. Ex in the late 1940’s, analysts found a strong correlation between the amount of ice cream consumed and higher levels of the onset of polio….did ice cream cause polio? No, but both of these peak in the summer months causing there to be a strong correlation Scatter Plots (pg 158) Cost of Lots in Glen Ellyn, IL Sale Price (in $1000's) 600 500 400 300 200 100 0 0 50 100 150 Square Footage (in 100's) 200 250 Possible Relationships Pg 159 Positive Linear Negative Linear No apparent relationship Nonlinear relationship Correlation Coefficient (r) How we quantify the strength and direction of the linear relationship. Always between -1 and 1 inclusively 1 or -1 is a perfect fit (data points lie on a perfectly straight line) 0 is no correlation at all Correlation Coefficient (x x)(y y) r (n 1)sx sy Where sx isthe sample standard deviation of the xvalues, and sy is the sample standard deviation of the y-values. It takes too long to hand calculate……Use Excel ® =correl(array1, array2) Test for Linear Correlation Find absolute value of r Table G, go to row n Compare the absolute value of r to the critical value in table G If if the absolute value of r is GREATER than the critical value from table G, the variables ARE linearly correlated (either positive or negative) If absolute value of r is LESS than OR EQUAL to the critical value from table G, the variables are NOT linearly correlated Lot/sale price example Is there a linear correlation between the size of the lot and the sale price for the data from the earlier slide? Regression Lines: least-squares method Ŷ=mx+b m (notice the notation for the y…it is called “y hat”) (x x)(y y) (x x) b y (m * x) 2 The slope (m) is the estimated change in y per unit of x (how much is y increasing or decreasing per unit of x) The y-intercept (b) is the initial value when x is zero Lets find the equation of the regression line for the lot/sale price example…using Excel ® Excel for least squares method Enter data in Excel Insert chart: marked scatter (doesn’t have any lines on the points) Click the plus sign next to your chart Check trendline Click on the over arrow to the right of “trendline” Go to more options Select Type: Linear Options: check the display equation box and the show r squared value box—we talk about r squared later Pg 175…figure 4.15 Use of regression lines Use regression lines to predict Interpolation (within the range of the plotted data) Extrapolation (outside the range of the plotted data) There is possible error in both interpolation and extrapolation (predicted value vs observed value) Prediction error (or residual) is y – Ŷ SSE, SST, and SSR We are going to briefly talk about three different measures that have ugly equations…individually the numbers are not that useful, but at the end we will put them together to find a useful value…so just be patient. And, just so you know, we will use Excel ® to calculate all these values too Sum of Squares Error (SSE) We want our prediction errors to be small We use SSE to measure the prediction error SSE (y y ) 2 ^ When we use the Least-Squares Criterion our SSE will be minimized…we will use Excel ® in just a little bit Standard Error of the Estimate s Gives the measure of a typical residual (typical prediction error, kind of like the “average” error)…we want it to be small SSE s n 2 SST Is SSE = 12 “small” which would indicate that our regression line is useful? We have to find a couple of other values that will help answer this: Total Sum of Squares (SST) SST (y y) 2 Or where s2 is the sample variance SST (n 1)s 2 SSR Sum of Squares Regression (SSR) measures the amount of improvement in the accuracy of our estimates when using the regression equation compared to only relying on the y-values. SSR (y y) ^ 2 SST, SSR, SSE SST=SSR + SSE Pg 190 ex 4.16 Go to Excel ®….File, options, Add-Ins, Go, check Analysis tool pack, OK On your “DATA” tab “Data Analysis” should be to the right Select Data Analysis, select regression, ok Fill in y-values, x-values, ok In the table that appears, under ANOVA, the SS column is where we get SST (total) SSR (regression) and SSE (residual) Coefficient of Determination SSR r SST 2 Measures the goodness of fit of the regression equation to the data (always between 0 and 1 inclusively) This was on our original trend line graph!!! Coefficient of determination Read tan box below problem 4.17 pg 190 The closer the coefficient of determination is to 1 the better the fit of the regression equation to the data. (0 is a horrible fit) Pg 194 #43