Simple Regression 1 Correlation tells us how strongly Y and X are related … but regression estimates the form of this relationship We’ll begin with simple regression, which assumes the form: Yˆi b0 b1 X i Simple Regression 2 Y is the variable we want to predict We believe X influences how Y behaves Ŷi is the estimated value of Y at Xi b0 is the Y-intercept in the equation b1 is the slope of the regression line Simple Regression 3 Our goal: Find the straight line that best fits the data we’ve collected The best equation will be the one that minimizes the error in fit The equation is: The fit error is thus: Yˆi b0 b1 X i ei Yi Yˆi Simple Regression 4 14 + Errors 12 10 8 6 - Errors 4 2 0 0 1 2 3 4 5 Simple Regression 6 7 5 The fit error for the ith point on the scatterplot diagram is: ei Yi Yˆi We would like the sum of the + errors to be the same as the sum of the – errors. However, there are many lines that can make this happen. Simple Regression 6 Simple Regression 7 So, which of these solutions is the best one? Select the line with the minimum sum of squared error terms. This is called leastsquares regression. Simple Regression 8 Intercept: Slope: b0 Y b1 X SS xy COVAR( x, y ) n b1 * SS x Var ( x) n 1 * note COVAR here is Excel’s functional calculation which is the population covariance not the sample covariance Simple Regression 9 Some values can be calculated directly using the means, variances, and covariances. For one-variable (simple) regression, can add a trendline to a chart. Can use the Data Analysis Tool, Regression Can use the Excel function LINEST. Simple Regression 10 25 y = 0.0297x + 0.1912 20 Y 15 10 5 0 0 200 400 600 800 X Uses Excel’s Trend Line function Simple Regression 11 Simple Regression 12 The LINEST function must be entered as an array formula. For the example, highlight the cells E3:F7, type the formula “=LINEST(Orders,Weight,1,1)”, then CTRL-SHFT-ENTER. Simple Regression 13 Remember the variables are X = weight in pounds and Y = orders in 1000s The estimated intercept (b0) tells us that if there was no mail, we still have a minimum of (.1912)(1000) or 191.2 orders per day. The estimated slope (b1) tells us that each pound of mail tends to bring with it (.0297)(1000) or 29.7 orders. Simple Regression 14 There are two standard ways to judge: 1. 2. How much of the variation in the Y values (orders) can be attributed to the different values of X (weight of mail)? In general, how small (or large) are the errors in fit? Simple Regression 15 The Coefficient of Determination: The variation in Y explained by the X - Y relationsh ip R The R2 value is: The variation in Y 2 ◦ Always between 0 and 1 ◦ Is the percentage of variation explained by the model. ◦ The square of correlation (for simple regression) Simple Regression 16 ANOVA table: Total variation in the Y values is SST = 449.76 The amount of unexplained variation is SSE = 12.12 The difference is thus the variation explained by the regression equation or SSR = 449.76 – 12.12 = 437.64 The ratio of explained to total is how we get R2 = 437.64/449.76 = .973 Simple Regression 17 For every observation i, its error is given by: ei Yi Yˆi To find the “typical error,” use this formula: n S e 2 i i n2 This is the “Standard Error”, also the √MSE. Simple Regression 18 The typical error (called the standard error of prediction) for our regression model is: S = .7258 This means that we typically misestimate the actual number of orders per day by (.7258)(1000) = 725.8 That may sound like a lot, but you have to consider that we have between 5 and 20 thousand orders each day, average (13.22)*(1000) = 13200, then the percentage error is only 725.8 / 13200 = 5.5%. Simple Regression 19 Simple Regression 20 Simple Regression 21 Simple Regression 22 Simple Regression 23 Simple Regression 24