10-1 Chapter 6 Simple Linear Regression and Correlation 10-2 Introduction Many problems in engineering and science involve exploring the relationships between two or more variables. Regression analysis is a statistical technique of modeling the relationship between variables. There are two types of regression model multiple regression and simple linear regression model A regression model that contains more than one regressor/ independent/ predictor variable is called a multiple regression model. Simple linear regression model, has only one independent variable or regressor. Simple linear regression is used to model the relationship b/n two variables • One of the variables, denoted by Y, is called the dependent variable and the other, denoted by X, is called the independent variable. • The model we will use to depict the relationship between X and Y will be a straight-line relationship. 10-3 Scatter diagram/ plot Scatterplot of Advertising Expenditures (X) and Sales (Y) 140 120 100 Sales It is a graph on which each (x, y) pair is represented as a point plotted in a twodimensional coordinate system. This scatter plot locates pairs of observations of advertising expenditures on the x-axis and sales on the y-axis. We notice that: Larger (smaller) values of sales tend to be associated with larger (smaller) values of advertising. 80 60 40 20 0 0 10 20 30 40 50 A d ve rtising The scatter of points tends to be distributed around a positively sloped straight line. The pairs of values of advertising expenditures and sales are not located exactly on a straight line. The scatter plot reveals a more or less strong tendency rather than a precise linear relationship. 10-4 Therefore, it is probably reasonable to assume that the mean of the random variable Y is related to x by a straight-line relationship This line represents the nature of the relationship on average. 10-5 Examples of Other Scatterplots 0 Y Y Y 0 0 0 0 X X X Y Y Y X X X 10-6 Simple Linear Regression Model The mean of the random variable Y is related to x by the following straight-line relationship: E[Y X ] 0 1 X where the slope and intercept of the line are called regression coefficients. While the mean of Y is a linear function of x, the actual observed value y does not fall exactly on a straight line. The expected value of Y is assumed to be a linear function of x. The actual value of Y is determined by the mean value function (the linear model) plus a random error term, 10-7 Therefore; the actual value of Y (dependent variable) is determined by the following model (relation) Y= 0 + 1 X + Nonrandom or Systematic Random Component We will call this model the simple linear regression model Where: Y is the dependent variable, the variable we wish to explain or predict X is the independent variable, also called the regressor/ predictor variable is the error term, the only random component in the model 0 is the intercept of the systematic component of the regression relationship. 1 is the slope of the systematic component. Note: The choice of the model is based on inspection of a scatter diagram 10-8 Assumptions of the Simple Linear Regression Model • • The relationship between X and Y is a straight-line relationship. The errors i are normally distributed with mean 0 and variance 2. The errors are uncorrelated (not related) in successive observations. That is: ~ N(0,2) Y Assumptions of the Simple Linear Regression Model E[Y]=0 + 1 X Identical normal distributions of errors, all centered on the regression line. X 10-9 Estimation: The Method of Least Squares Suppose that we have n pairs of observations (x1, y1), (x2, y2),…(xn, yn). the estimated regression line is: Y= 0 + 1 X + Estimation of a simple linear regression relationship involves finding Regression coefficients, 0 and 1 of the linear regression line. The estimates of 0 and 1 should result in a line that is (in some sense) a “best fit” to the data. The values of Regression coefficients (0 and 1 ) should be determined so that the sum of the squares of the vertical deviations is minimized. This criterion for estimating the regression coefficients is called method of least squares. 10-10 Errors in Regression Y the observeddata point Yi Yi ˆ Y 0 1 X the fitted regression line . { Error i Yi Yˆi Yi the predicted value of Y for X X Xi i 10-11 Least Squares Regression The sum of squared errors in regression is : n n SSE = L (y i yˆ i ) 2 whereYˆ 2 i i =1 0 i =1 X 1 The least squares regression line is that whic h minimizes the SSE with respect to the estimates b 0 and b1. The least squares estimators of and , say, 0 and 1 must satisfy the 0 1 following 10-12 Simpilifyi ng these two equations yealds : n y i n n 0 1 xi i =1 n i =1 n n 2 x y x x i i 0 i 1 i i =1 i =1 i =1 These equations are called the least squares normal equations. The solution t o the normal equations results in the least squares estimators 0 1 and 10-13 10-14 Notationally, it is occasionally convenient to give special symbols to the numerator and denominator of the previous Equation Therefore; 1 = Sxy/Sxx 10-15 The fitted or estimated regression line is therefore ˆ Y 0 1 X Note that each pair of observations satisfies the relationship Yi= 0 + 1 Xi + i i Where = yi - Ŷi is called the residual. The residual describes the error in the fit of the model to the ith observation yi 10-16 Example Miles 1211 1345 1422 1687 1849 2026 2133 2253 2400 2468 2699 2806 3082 3209 3466 3643 3852 4033 4267 4498 4533 4804 5090 5233 5439 79,448 Dollars 1802 2405 2005 2511 2332 2305 3016 3385 3090 3694 3371 3998 3555 4692 4244 5298 4801 5147 5738 6420 6059 6426 6321 7026 6964 106,605 Miles 2 1466521 1809025 2022084 2845969 3418801 4104676 4549689 5076009 5760000 6091024 7284601 7873636 9498724 10297681 12013156 13271449 14837904 16265089 18207288 20232004 20548088 23078416 25908100 27384288 29582720 293,426,946 Miles*Dollars 2182222 3234725 2851110 4236057 4311868 4669930 6433128 7626405 7416000 9116792 9098329 11218388 10956510 15056628 14709704 19300614 18493452 20757852 24484046 28877160 27465448 30870504 32173890 36767056 37877196 390,185,014 x 2 2 S xx x n 2 79 , 448 293,426,946 40,947,557 .84 25 x( y) S xy xy n 390,185,014 (79,448)(106,605) 51,402,852.4 25 S b XY 51,402,852 .4 1.255333776 1.26 1 S 40,947,557 .84 XX b y b x 106,605 (1.255333776 ) 79,448 0 1 25 25 274 .85 10-17 Error Variance and the Standard Errors of Regression Estimators Y Degrees of Freedom in Regression : df = n - 2 2 ( S ) SSE = (Y - Yˆ )2 S XY YY S XX = S b S YY 1 XY An unbiased estimator of s2 , denoted by S 2 : MSE = SSE (n - 2) Square and sum all regression errors to find SSE. X Example SSE = S b S YY 1 XY 66855898 (1.255333776 )(51402852 .4) 2328161 .2 SSE 2328161 .2 MSE n2 23 101224 .4 s MSE 101224 .4 318 .158 10-18 Standard Errors of Estimates in Regression The standard error of 0 (intercept ) : s( 0 ) x s 2 nS XX 1 where s = MSE The standard error of s( ) 1 s S XX 1 (slope) : Example s x2 s( ) 0 nS XX 318.158 293426944 (25) (4097557.84) 170.338 s( ) s 1 S XX 318.158 40947557.84 0.04972 10-19 Confidence Intervals for the Regression Parameters In addition to point estimates of the slope and intercept, it is possible to obtain Confidence Interval estimates of these parameters. The width of these Confidence intervals is a measure of the overall quality of the regression line. 10-20 Example 95% Confidence Intervals : b t s(b ) 0 0.025,(252) 0 = 274.85 (2.069) (170.338) 274.85 352.43 [77.58,627.28] b t 1 0.025,(252) s(b ) 1 = 1.25533 (2.069) (0.04972) 1.25533 0.10287 [1.15246,1.35820] 10-21 PREDICTION OF NEW OBSERVATIONS An important application of a regression model is predicting new or future observations Y corresponding to a specified level of the regressor variable x. If x0 is the value of the regressor variable of interest, ˆ YO 0 1 X O is the point estimator of the new or future value of the response Y0. 10-22 Correlation The correlation between two random variables, X and Y, is a measure of the degree of linear association between the two variables. The population correlation, denoted by, can take on any value from -1 to 1. -1 < < 0 0<<1 indicates a perfect negative linear relationship indicates a negative linear relationship indicates no linear relationship indicates a positive linear relationship indicates a perfect positive linear relationship The absolute value of indicates the strength or exactness of the relationship. 10-23 Illustrations of Correlation Y = -1 Y =0 Y =1 X Y = -.8 X X Y =0 Y = .8 X X X 10-24 Covariance and Correlation The covariance of two random variables X and Y: Cov ( X , Y ) E [( X )(Y )] X Y where and Y are the population means of X and Y respectively. X The population correlation coefficient: Cov ( X , Y ) = X Y The sample correlatio n coefficien t*: S XY r= S S XX YY *Note: Example S XY r= S S XX YY 51402852.4 ( 40947557.84)( 66855898) 51402852.4 .9824 52321943.29 If < 0, b1 < 0 If = 0, b1 = 0 If > 0, b1 >0