COMPLETE BUSINESS STATISTICS by AMIR D. ACZEL & JAYAVEL SOUNDERPANDIAN 7th edition. Prepared by Lloyd Jaisingh, Morehead State University Chapter 10 Simple Linear Regression and Correlation McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All rights reserved. 10-2 10 • • • • • • • • • • • Simple Linear Regression and Correlation Using Statistics The Simple Linear Regression Model Estimation: The Method of Least Squares Error Variance and the Standard Errors of Regression Estimators Correlation Hypothesis Tests about the Regression Relationship How Good is the Regression? Analysis of Variance Table and an F Test of the Regression Model Residual Analysis and Checking for Model Inadequacies Use of the Regression Model for Prediction The Solver Method for Regression 10-3 10 LEARNING OBJECTIVES After studying this chapter, you should be able to: • Determine whether a regression experiment would be useful in a given • • • • • instance Formulate a regression model Compute a regression equation Compute the covariance and the correlation coefficient of two random variables Compute confidence intervals for regression coefficients Compute a prediction interval for the dependent variable 10-4 10 LEARNING OBJECTIVES (continued) After studying this chapter, you should be able to: • Test hypothesis about a regression coefficients • Conduct an ANOVA experiment using regression results • Analyze residuals to check if the assumptions about the regression model are valid • Solve regression problems using spreadsheet templates • Use LINEST function to carry out a regression 10-5 10-1 Using Statistics • Regression refers to the statistical technique of modeling the relationship between variables. • In simple linear regression, we model the relationship between two variables. • One of the variables, denoted by Y, is called the dependent variable and the other, denoted by X, is called the independent variable. • The model we will use to depict the relationship between X and Y will be a straight-line relationship. • A graphical sketch of the the pairs (X, Y) is called a scatter plot. 10-6 10-1 Using Statistics This scatterplot locates pairs of observations of advertising expenditures on the x-axis and sales on the y-axis. We notice that: Scatterplot of Advertising Expenditures (X) and Sales (Y) 140 120 Larger (smaller) values of sales tend to be associated with larger (smaller) values of advertising. Sales 100 80 60 40 20 0 0 10 20 30 40 50 A d ve rtising The scatter of points tends to be distributed around a positively sloped straight line. The pairs of values of advertising expenditures and sales are not located exactly on a straight line. The scatter plot reveals a more or less strong tendency rather than a precise linear relationship. The line represents the nature of the relationship on average. 10-7 Examples of Other Scatterplots 0 Y Y Y 0 0 0 0 X X X Y Y Y X X X 10-8 Model Building The inexact nature of the relationship between advertising and sales suggests that a statistical model might be useful in analyzing the relationship. A statistical model separates the systematic component of a relationship from the random component. Data Statistical model Systematic component + Random errors In ANOVA, the systematic component is the variation of means between samples or treatments (SSTR) and the random component is the unexplained variation (SSE). In regression, the systematic component is the overall linear relationship, and the random component is the variation around the line. 10-9 10-2 The Simple Linear Regression Model The population simple linear regression model: Y= 0 + 1 X Nonrandom or Systematic Component + Random Component where Y is the dependent variable, the variable we wish to explain or predict X is the independent variable, also called the predictor variable is the error term, the only random component in the model, and thus, the only source of randomness in Y. 0 is the intercept of the systematic component of the regression relationship. 1 is the slope of the systematic component. The conditional mean of Y: E[Y X ] 0 1 X 10-10 Picturing the Simple Linear Regression Model Y Regression Plot E[Y]=0 + 1 X Yi } { Error: i } 1 = Slope The simple linear regression model gives an exact linear relationship between the expected or average value of Y, the dependent variable, and X, the independent or predictor variable: E[Yi]=0 + 1 Xi 1 Actual observed values of Y differ from the expected value by an unexplained or random error: 0 = Intercept X Xi Yi = E[Yi] + i = 0 + 1 Xi + i 10-11 Assumptions of the Simple Linear Regression Model • The relationship between X and Y is a • • straight-line relationship. The values of the independent variable X are assumed fixed (not random); the only randomness in the values of Y comes from the error term i. The errors i are normally distributed with mean 0 and variance 2. The errors are uncorrelated (not related) in successive observations. That is: ~ N(0,2) Y Assumptions of the Simple Linear Regression Model E[Y]=0 + 1 X Identical normal distributions of errors, all centered on the regression line. X 10-12 10-3 Estimation: The Method of Least Squares Estimation of a simple linear regression relationship involves finding estimated or predicted values of the intercept and slope of the linear regression line. The estimated regression equation: Y = b0 + b1X + e where b0 estimates the intercept of the population regression line, 0 ; b1 estimates the slope of the population regression line, 1; and e stands for the observed errors - the residuals from fitting the estimated regression line b0 + b1X to a set of n points. The estimated regression line: Y b0 + b1 X (Y - hat) is the value of Y lying on the fitted regression line for a given where Y value of X. 10-13 Fitting a Regression Line Y Y Data X Three errors from the least squares regression line X Y Three errors from a fitted line X Errors from the least squares regression line are minimized X 10-14 Errors in Regression Y the observeddata point Yi . { Error ei Yi Yi Yi Y b0 b1 X Yi the predicted value of Y for X X Xi the fitted regression line i 10-15 Least Squares Regression The sum of squared errors in regression is: n SSE = e i=1 2 i n (y i y i ) 2 i=1 The least squares regression line is that which minimizes the SSE with respect to the estimates b 0 and b 1 . The normal equations: n y b0 n i nb0 b1 x i i=1 i=1 n x y i i=1 SSE i n n i=1 i=1 Least squares b0 b0 x i b1 x 2i Least squares b1 At this point SSE is minimized with respect to b0 and b1 b1 10-16 Sums of Squares, Cross Products, and Least Squares Estimators Sums of Squares and Cross Products: SSx (x x ) x 2 2 x 2 n 2 y 2 2 SS y ( y y ) y n SSxy (x x )( y y ) x ( y ) xy Least squares regression estimators: b1 SS XY SS X b0 y b1 x n 10-17 Example 10-1 Miles 1211 1345 1422 1687 1849 2026 2133 2253 2400 2468 2699 2806 3082 3209 3466 3643 3852 4033 4267 4498 4533 4804 5090 5233 5439 79,448 Dollars 1802 2405 2005 2511 2332 2305 3016 3385 3090 3694 3371 3998 3555 4692 4244 5298 4801 5147 5738 6420 6059 6426 6321 7026 6964 106,605 Miles 2 1466521 1809025 2022084 2845969 3418801 4104676 4549689 5076009 5760000 6091024 7284601 7873636 9498724 10297681 12013156 13271449 14837904 16265089 18207288 20232004 20548088 23078416 25908100 27384288 29582720 293,426,946 Miles*Dollars 2182222 3234725 2851110 4236057 4311868 4669930 6433128 7626405 7416000 9116792 9098329 11218388 10956510 15056628 14709704 19300614 18493452 20757852 24484046 28877160 27465448 30870504 32173890 36767056 37877196 390,185,014 2 SS x x x 2 n 293, 426 ,946 SS xy xy 79, 448 x ( y ) 2 40,947 ,557.84 25 n 390,185,014 (79, 448)(106,605 ) 51, 402 ,852 .4 25 SS 51, 402,852.4 b XY 1.255333776 1.26 1 SS 40,947 ,557.84 X b y b x 0 1 274.85 106,605 25 79,448 25 (1.255333776 ) 10-18 Template (partial output) that can be used to carry out a Simple Regression 10-19 Template (continued) that can be used to carry out a Simple Regression 10-20 Template (continued) that can be used to carry out a Simple Regression Residual Analysis. The plot shows the absence of a relationship between the residuals and the X-values (miles). 10-21 Template (continued) that can be used to carry out a Simple Regression Note: The normal probability plot is approximately linear. This would indicate that the normality assumption for the errors has not been violated. 10-22 Total Variance and Error Variance Y Y X What you see when looking at the total variation of Y. X What you see when looking along the regression line at the error variance of Y. 10-23 10-4 Error Variance and the Standard Errors of Regression Estimators Y Degrees of Freedom in Regression: df = (n - 2) (n total observations less one degree of freedom for each parameter estimated (b 0 and b1 ) ) 2 ) SS ( 2 XY SSE = ( Y - Y ) SSY SS X = SSY b1SS XY 2 2 An unbiased estimator of s , denoted by S : SSE MSE = (n - 2) Square and sum all regression errors to find SSE. X Example 10 - 1: SSE = SS Y b1 SS XY 66855898 (1.255333776 )( 51402852 .4 ) 2328161.2 MSE SSE n2 101224 .4 s MSE 2328161.2 23 101224 .4 318.158 10-24 Standard Errors of Estimates in Regression The standard error of b0 (intercept): s(b0 ) where s = s 2 x nSS X MSE The standard error of b1 (slope): s s(b1 ) SS X Example 10 - 1: 2 s x s(b0 ) nSS X 318.158 293426944 ( 25)( 4097557.84 ) 170.338 s s(b1 ) SS X 318.158 40947557.84 0.04972 10-25 Confidence Intervals for the Regression Parameters A (1 - ) 100% confidence interval for b : 0 b t s (b ) 0 ,(n 2 ) 0 2 A (1 - ) 100% confidence interval for b : 1 b t s (b ) 1 ,(n 2 ) 1 2 Least-squares point estimate: b1=1.25533 Height = Slope Length = 1 0 (not a possible value of the regression slope at 95%) Example 10 - 1 95% Confidence Intervals: b t s (b ) 0 0.025,( 25 2 ) 0 = 274.85 ( 2.069) (170.338) 274.85 352.43 [ 77.58, 627.28] b1 t 0.025,( 25 2 ) s (b1 ) = 1.25533 ( 2.069) ( 0.04972 ) 1.25533 010287 . [115246 . ,1.35820] 10-26 Template (partial output) that can be used to obtain Confidence Intervals for 0 and 1 10-27 10-5 Correlation The correlation between two random variables, X and Y, is a measure of the degree of linear association between the two variables. The population correlation, denoted by, can take on any value from -1 to 1. 1 -1 < < 0 0 0<<1 1 indicates a perfect negative linear relationship indicates a negative linear relationship indicates no linear relationship indicates a positive linear relationship indicates a perfect positive linear relationship The absolute value of indicates the strength or exactness of the relationship. 10-28 Illustrations of Correlation Y = -1 Y =0 Y =1 X Y = -.8 X X Y =0 Y = .8 X X X 10-29 Covariance and Correlation The covariance of two random variables X and Y: Cov ( X , Y ) E [( X )(Y )] X Y where and Y are the population means of X and Y respectively. X The population correlation coefficient: Cov ( X , Y ) = X Y The sample correlation coefficient * : SS XY r= SS SS X Y *Note: Example 10 - 1: SS XY r= SS SS X Y 51402852.4 ( 40947557.84)( 66855898) 51402852.4 .9824 52321943.29 If < 0, b1 < 0 If = 0, b1 = 0 If > 0, b1 >0 10-30 Hypothesis Tests for the Correlation Coefficient H0: = 0 H1: 0 (No linear relationship) (Some linear relationship) Test Statistic: t( n 2 ) r 1 r2 n2 Example 10 -1: r t( n 2 ) 1 r2 n2 0.9824 = 1 - 0.9651 25 - 2 0.9824 = 25.25 0.0389 t0. 005 2.807 25.25 H 0 rejected at 1% level 10-31 10-6 Hypothesis Tests about the Regression Relationship Constant Y Unsystematic Variation Y Y X Nonlinear Relationship Y X X A hypothesis test for the existence of a linear relationship between X and Y: H0: 1 0 H1: 1 0 Test statistic for the existence of a linear relationship between X and Y: b 1 t (n - 2) s(b ) 1 where b is the least - squares estimate of the regression slope and s ( b ) is the standard error of b . 1 1 1 When the null hypothesis is true, the statistic has a t distribution with n - 2 degrees of freedom. 10-32 Hypothesis Tests for the Regression Slope Example 10 - 1: H0: 1 0 H1: 1 0 t b 1 s(b ) 1 1.25533 (n - 2) = 25.25 Example10 - 4 : H : 1 0 1 H : 1 1 1 b 1 t 1 ( n - 2) s (b ) 1 1.24 - 1 = 1.14 0.21 0.04972 2.807 25.25 t ( 0 . 005 , 23 ) H 0 is rejected at the 1% level and we may conclude that there is a relationship between charges and miles traveled. 1.671 1.14 (0.05,58) H is not rejected at the10% level. 0 We may not conclude that the beta coefficien t is different from 1. t 10-33 10-7 How Good is the Regression? The coefficient of determination, r2, is a descriptive measure of the strength of the regression relationship, a measure of how well the regression line fits the data. ( y y ) ( y y) ( y y ) Total = Unexplained Explained Deviation Deviation Deviation (Error) (Regression) Y . Y Unexplained Deviation Y Explained Deviation Y } { 2 ( y y ) ( y y) ( y y ) SST = SSE + SSR Total Deviation { 2 r X X 2 SSR SST 1 SSE SST Percentage of total variation explained by the regression. 2 10-34 The Coefficient of Determination Y Y Y X X SST r2 = 0 SSE r2 = 0.50 SST SSE SSR SST SSR 6000 Dollars SSR 64527736.8 r 0.96518 SST 66855898 r2 = 0.90 S S E 7000 Example 10 -1: 2 X 5000 4000 3000 2000 1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 Miles 10-35 10-8 Analysis-of-Variance Table and an F Test of the Regression Model Source of Variation Sum of Squares Regression SSR Degrees of Freedom Mean Square F Ratio (1) MSR Error SSE (n-2) MSE Total SST (n-1) MST MSR MSE Example 10-1 Source of Variation Sum of Squares Regression 64527736.8 Degrees of Freedom F Ratio p Value 1 Mean Square 64527736.8 637.47 101224.4 Error 2328161.2 23 Total 66855898.0 24 0.000 10-36 Template (partial output) that displays Analysis of Variance and an F Test of the Regression Model 10-37 10-9 Residual Analysis and Checking for Model Inadequacies Residuals Residuals 0 0 x or y x or y Homoscedasticity: Residuals appear completely random. No indication of model inadequacy. Residuals Heteroscedasticity: Variance of residuals increases when x changes. Residuals 0 0 Time Residuals exhibit a linear trend with time. x or y Curved pattern in residuals resulting from underlying nonlinear relationship. 10-38 Normal Probability Plot of the Residuals Flatter than Normal 10-39 Normal Probability Plot of the Residuals More Peaked than Normal 10-40 Normal Probability Plot of the Residuals Positively Skewed 10-41 Normal Probability Plot of the Residuals Negatively Skewed 10-42 10-10 Use of the Regression Model for Prediction • Point Prediction A single-valued estimate of Y for a given value of X obtained by inserting the value of X in the estimated regression equation. • Prediction Interval For a value of Y given a value of X Variation in regression line estimate Variation of points around regression line For an average value of Y given a value of X Variation in regression line estimate 10-43 Errors in Predicting E[Y|X] Y Y Upper limit on slope Upper limit on intercept Regression line Lower limit on slope Y X X 1) Uncertainty about the slope of the regression line Regression line Y Lower limit on intercept X X 2) Uncertainty about the intercept of the regression line 10-44 Prediction Interval for E[Y|X] Y • Prediction band for E[Y|X] Regression line • • Y X X Prediction Interval for E[Y|X] The prediction band for E[Y|X] is narrowest at the mean value of X. The prediction band widens as the distance from the mean of X increases. Predictions become very unreliable when we extrapolate beyond the range of the sample itself. 10-45 Additional Error in Predicting Individual Value of Y Y Regression line Y Prediction band for E[Y|X] Regression line Y Prediction band for Y X 3) Variation around the regression line X X Prediction Interval for E[Y|X] 10-46 Prediction Interval for a Value of Y A (1 - ) 100% prediction interval for Y : 1 (x x) yˆ t s 1 n SS 2 2 X Example10 - 1 (X = 4,000) : 1 (4,000 3,177.92) {274.85 (1.2553)(4,000)} 2.069 318.16 1 25 40,947,557.84 5296.05 676.62 [4619.43, 5972.67] 2 10-47 Prediction Interval for the Average Value of Y A (1 - ) 100% prediction interval for the E[Y X] : 1 (x x) yˆ t s n SS 2 2 X Example10 - 1 (X = 4,000) : 1 (4,000 3,177.92) {274.85 (1.2553)(4,000)} 2.069 318.16 25 40,947,557.84 5,296.05 156.48 [5139.57, 5452.53] 2 10-48 Template Output with Prediction Intervals 10-49 10-11 The Excel Solver Method for Regression The solver macro available in EXCEL can also be used to conduct a simple linear regression. See the text for instructions. 10-50 Using Minitab Fitted-Line Plot for Regression Fitted Line Plot Y = - 0.8465 + 1.352 X 9.0 S R-Sq R-Sq(adj) 8.5 Y 8.0 7.5 7.0 6.5 6.0 5.5 6.0 6.5 X 7.0 7.5 0.184266 95.2% 94.8%