Linear Regression Simple linear regression model determines the relationship between one dependent variable (y) and one independent variable (x). A dependent variable is a random variable whose variation is predicted by an independent variable. The relationship is specified as: y = b0 + b1x Where: • b0 and b1 are fixed numbers. Further, the value of b0 determines the point at which the straight line crosses the yaxis; the y-intercept. Regression Analysis: Simple Page: 2 • The value of b1 determines the slope of the line. Generally, specified as the amount by which y changes for every unit change in x. If b1 < 0 the slope is negative If b1 > 0 the slope is positive If b1 = 0 the slope does not exist -- the line is parallel to the x-axis. STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple Page: 3 Notation Used Sxx = Σx2 - (Σx)2 / n Sxy = Σxy - (ΣxΣy) / n Syy = Σy2 - (Σy)2 / n b1 = Sxy / Sxx b0 = (1/n)(Σy - b1Σx) OR Σ(x - x )2 OR Σ(x - x )(y - y ) OR Σ(y - y )2 The Least-Square Criterion The straight line that best fits a set of data points is the one for which the sum of squared errors is smallest. Regression Line The straight line that fits a set of data points the best according to the least-square criterion is the regression line. The Regression Equation The equation of the regression line is stated as: y = b0 + b1x Estimation & Prediction Estimation or prediction is simply using the regression equation on the sample x-values to provide an estimate of the corresponding y-value. That is, ŷ = b0 + b1x STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple Page: 4 The Total sum of Squares The total squared differences in y is called the total sum of squares. The formula used is: SST = Syy = Σ(y - y )2 Error Sum Of Squares The total squared error is called the sum of squares error. The formula used is: SSE = Syy - (Sxy)2/Sxx = Σ(y - ŷ )2 Sum Of Squares Due To Regression The total amount of variation explained by the regression line is called the regression sum of squares. The formula used is: SSR = (Sxy)2/Sxx = Σ( ŷ - y )2 The Regression Identity SST = SSR + SSE STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple Page: 5 The Coefficient Of Determination The r-square or the coefficient of determination is the percentage reduction obtained in the total squared error by using the regression equation instead of the sample mean to predict the observed y-values. In other words, it is the amount of variation in the dependent variable that is explained by the independent variable. r2 = (SSR / SST) r2 = 1 - (SSE / SST) r2 = ((Sxy)2/Sxx Syy) OR OR STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple Page: 6 Linear Correlation • The square-root of r2 determines the linear correlation (r) between x & y. • r ranges from 0 to ±1. • Positive values indicate a positive correlation; negative values indicate a negative correlation. • If x & y are correlated it does not mean they are related. Other Terms Extrapolation Using the regression equation to make predictions for x-values outside the range of x-values in the sample data. Outliers & Influential Observations Recall: An outlier is an observation that lies outside the overall pattern of the data. In regression context an outlier is a data point that lies far from the regression line. An influential observation is a data point whose removal causes the regression equation to change considerably. Scatter Plots Plot of x & y to visualize the pattern of the sample data. If the plot shows a non-linear relationship between x (predictor variables) and y (response variable) DO NOT use linear regression methods. STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple Page: 7 Four Data Sets Having Same Value of Summary Statistics (Source: Anscombe, 1973) Data Set 1 x1 Mean STD. 4 5 6 7 8 9 10 11 12 13 14 9.00 3.32 y1 4.26 5.68 7.24 4.82 6.95 8.81 8.04 8.33 10.84 7.58 9.96 7.50 2.03 Data Set 2 x2 4 5 6 7 8 9 10 11 12 13 14 9.00 3.32 Data Set 3 y2 3.1 4.74 6.13 7.26 8.14 8.77 9.14 9.26 9.13 8.74 8.1 7.50 2.03 x3 4 5 6 7 8 9 10 11 12 13 14 9.00 3.32 STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Data Set 4 y3 x4 5.39 5.73 6.04 6.42 6.77 7.11 7.46 7.81 8.15 12.74 8.84 7.50 2.03 8 8 8 8 8 8 8 8 8 8 19 9.00 3.32 y4 6.58 5.76 7.71 8.84 8.47 7.04 5.25 5.56 7.91 6.89 12.5 7.50 2.03 Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple Page: 8 Data Set 1 Data Set 2 12 12 10 10 8 8 6 6 4 4 2 2 0 0 0 5 10 15 0 Data Set 3 5 10 15 Data Set 4 14 12 10 8 6 4 2 0 14 12 10 8 6 4 2 0 0 5 10 15 0 5 10 15 20 Inferential Methods Assumptions For Regression Inferences 1. Population Regression Line (Assumption I): There is a straight line, y = β0 + β1x such that for each x-value, the mean of the corresponding population of y-values lies on that straight line. STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Page: 9 Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple Page: 10 2. Equal Standard Deviations (Assumption II): The standard deviation, σ, of the population of y-values corresponding to a particular x-value is the same, regardless of the x-value. 3. Normality (Assumption III): For each x-value, the corresponding population of y-values is normally distributed. STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple Page: 11 Standard Error Of The Estimate The standard error of the estimate is defined by: se = Sqrt(SSE / (n - 2)) It provides us with an estimate for the population standard deviation. The se indicates how far the observed y-values are from the predicted y-values, on average. Residual Analysis For the Regression Model If the assumptions for regression inferences are met, then the following two conditions should hold. 1. A plot of the residuals against the x-values should fall roughly in a horizontal band centered and symmetric about the x-axis. If a pattern is indicated you probably need to use some other analytical method that simple linear regression. For example, the following graph shows that a quadratic relationship exists in the residuals. STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple Page: 12 Linearity Test Dependent Variable: Min 30 Residual 20 10 0 -10 -20 -30 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Units 2. A normal probability plot of the residuals should be roughly linear. For example, the following graph shows that the normality assumption is violated. Normal Probability Plot Dependent Variable: Min Sorted Residual 30 20 10 0 -10 -20 -30 -30 -20 -10 0 10 20 30 Expected Residual STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple Page: 13 Hypothesis Tests For The Slope Ho: β1 = 0 Ha: β1 ≠ 0 The test statistic has a t-distribution with df=(n - 2). t = (b1 - β1) / (se / Sqrt(Sxx)) If the value of the test statistic falls in the rejection region, then reject the null; otherwise do not reject the null. Confidence Intervals for the Slope The endpoints of the confidence interval for β1 are: b1 ± tα/2 ( Se S xx ) with df = n-2 Confidence Intervals for Means in Regression yˆ p ± t α / 2 s e 1 + n (x p − Σx S n ) 2 with df = (n -2) xx Confidence Intervals for a Population y-value given an x-value yˆ p ± t α / 2 s e 1 + 1 + n (x p − Σx S n ) 2 with df = (n -2) xx STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple Page: 14 Aside: The F distribution • Not symmetric • Does not have zero at its center Using the F-Table • Need a significance level, numerator degrees of freedom (n1 - 1), and denominator degrees of freedom (n2 - 1) • To find the left tailed critical value we can use the reciprocal of the right-tailed value with the numbers of degrees of freedom reversed. • For example: if n1 = 10, n2 = 7, alpha=0.05 the right tail value is 5.5234 (from table). The left tail value is (1/4.317 = 0.2315). That is, 4.317 is the critical value for 6,9 degrees of freedom. STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple Page: 15 Regression Example << Regrsam.xls >> Age (x) Price (y) xy 6 125 750 6 115 690 6 130 780 2 260 520 2 219 438 5 150 750 4 190 760 5 163 815 1 260 260 4 160 640 41 1772 6403 TOTALS Sqr(x) 36 36 36 4 4 25 16 25 1 16 199 Syy 339680 - Sqr(1772) / 10 = 25681.60 Sxx 199 - Sqr(41) / 10 = 30.90 Sxy 6403 - (41)(1772) / 10 = -862.20 B1 = Sxy / Sxx = -27.90 Bo = (1/11 (1772 - b1(41))) = 265.09 SST = Syy = 25681.600 SSR = (Sxy*Sxy) / Sxx = 24057.891 SSE = SST – SSR = 1623.709 Se = Sqrt(SSE / (n-2)) = 14.247 r-square = (1-(SSE / SST))*100 = 93.68% r= = 0.97 Sqrt(r-square) STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss Sqr(y) 15625 13225 16900 67600 47961 22500 36100 26569 67600 25600 339680 Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10 Regression Analysis: Simple Page: 16 Hypothesis Testing Ho: B1 = 0 Ha: B1 <> 0 t = b1 / (Se / Sqrt(Sxx)) = -10.887 Critical t at 95% and df=8 = 2.306 Since calculated value is in the rejection region we reject the null hypothesis. There is enough evidence to conclude that the age of corvettes is useful for predicting price of the corvettes. Confidence Intervals at 95% -27.9 + 2.306 * (14.247 / Sqrt(30.90)) = -27.9 - 2.306 * (14.247 / Sqrt(30.90)) = CI = (-21.99, -33.81) STA/FIN Class Notes Warning!! These notes contain direct references to copyrighted material Do not quote, copy, or replicate without permission: mailto:nina@uri.edu Class Notes to accompany: Introductory Statistics, 8th Ed, By Neil A. Weiss -21.99 -33.81 Prepared by: Nina Kajiji, Ph.D. Last Update: 1-DEC-10