Chapter 8 SIMPLE LINEAR REGRESSION ANALYSIS Correlation Analysis (page 579) Given: Bivariate data={(X1,Y1), (X2,Y2), …, (Xn,Yn)} Note: Association in bivariate data means a systematic connection between changes in one variable and changes in the other. If both variables were measured on at least an ordinal scale then the direction of the association can be described as either positive or negative. When an increase in one variable tends to be accompanied by an increase in the other, the variables are positively associated. On the other hand, when an increase in one variable tends to be accompanied by a decrease in the other, the variables are negatively associated. Objective of Correlation Analysis: to measure the strength and direction of the linear association between two variables. Chapter 8. Simple Linear Regression Analysis Scatter Diagram (page 579) First step in correlation analysis is to plot the individual pairs of observations on a two-dimensional graph called the scatter diagram. This will help you visualize the possible underlying linear relationship between the two variables. Using Microsoft Excel: Step 1. Highlight data. Step 2. Click Insert then choose Scatter. Chapter 8. Simple Linear Regression Analysis Linear Correlation Coefficient (page 580) Definition 18.1. The linear correlation coefficient, denoted by (Greek letter rho), is a measure of the strength of the linear relationship existing between two variables, say X and Y, that is independent of their respective scales of measurement. Cov(X, Y ) X Y Chapter 8. Simple Linear Regression Analysis Properties of (page 580) A linear correlation coefficient can only assume values between -1 and 1, inclusive of endpoints. The sign of describes the direction of the linear relationship between X and Y. A positive value for means that the line slopes upward to the right, and so as X increases, the value of Y increases. On the other hand, a negative value for means that the line slopes downward to the right, and so as X increases, the value of Y decreases. If = 0, then there is no linear correlation between X and Y. However, this does not mean a lack of association. It is possible to obtain a zero correlation even if the two variables are related, though their relationship is nonlinear, such as a quadratic relationship. When is -1 or 1, there is perfect linear relationship between X and Y and all the points (x, y) fall on a line whose slope is not equal to 0. ( is undefined when the slope is 0 since Var(Y)=0 in this case). A that is close to 1 or -1 indicates a strong linear relationship. A strong linear relationship does not necessarily imply that X causes Y or Y causes X. It is possible that a third variable may have caused the change in both X and Y, producing the observed relationship. This is an important point that we should always remember when studying not just relationships, but also comparing two populations, say by using a t-test. Unless we collected our data using a well-designed experiment where we were able to randomize the treatments and substantially control the extraneous variables, we need to use the more complex “causal” models to study causality. Otherwise, we just describe the observed relationship or the observed difference between means. Chapter 8. Simple Linear Regression Analysis Pearson product moment correlation coefficient (page 581) Definition 18.2 The Pearson product moment correlation coefficient between X and Y, denoted by r, is defined as: n n X i Yi i 1 r n n i 1 X 2i n n Xi i 1 n i 1 Yi i 1 2 Xi n n n i 1 Yi2 (X i X)( Yi Y) i 1 n i 1 2 Yi n (X i i 1 This is a point estimator of . Chapter 8. Simple Linear Regression Analysis 2 X) n i 1 ( Yi Y)2 Scatter Diagrams of Various Data Sets with Different Values of r (Figure 18.2) r = -1 r=1 Y Y X X r=0 r = 0.87 Y X Guide: Strong correlation – around 0.6 to 1 or around -1 to -0.6 Medium correlation – around 0.3 to 0.6 or around -0.6 to -0.3 Weak correlation – around 0.1 to 0.3 or around -0.3 to -0.1 Chapter 8. Simple Linear Regression Analysis Remark If r=1 then all the data points belong in a line whose slope is positive. If r=-1 then all the data points belong in a line whose slope is negative. If r=0 then we cannot conclude that all the data points belong in a line whose slope is 0 (or a horizontal line). Example: X -4 -2 0 2 4 0 Sum Y 4 2 0 2 4 12 XY -16 -4 0 4 16 0 n n n X i Yi i 1 r n n i 1 X 2 i Xi Xi 5 16 4 0 4 16 40 16 4 0 4 16 40 4 3 2 1 0 -4 Yi -2 i 1 2 i 1 Y2 n i 1 n X2 n n 2 i Y i 1 (5)(0) (0)(12) (5)(40) (0)2 (5)(40) (12)2 n 2 Yi i 1 0 Chapter 8. Simple Linear Regression Analysis 0 2 4 Test of Hypothesis Ho: =0 vs Ha: Test Statistic: T 0 r 1 r2 n 2 Critical region: |t| > t /2(v=n-2) Even if we are able to establish that there is a linear relationship between two variables, we still do not conclude that X causes Y. There may be a third variable that is correlated with both X and Y that is responsible for the apparent correlation. Chapter 8. Simple Linear Regression Analysis Examples Example 18.2 (page 581) and Example 18.3 (page 583) Exercise 1 (page 584). Suppose a breeder of Thoroughbred horses wishes to determine whether a linear relationship exists between the gestation period and the length of life of a horse. The breeder collected the following data from various stables across the region. Horse 1 2 3 4 5 6 7 8 9 10 a.) b.) c.) Gestation Period (in days) 416 280 290 309 365 356 403 300 265 400 Length of Life (years) 24 25.75 20 22 20 21.5 23.5 21.75 21 21 Plot a scatter diagram of the data on the gestation period and the length of life of a horse. Does there appear to be a linear relationship between the variables? Compute for the Pearson correlation coefficient between the gestation period and the length of life of a horse. What conclusion can you draw based on the value of the correlation coefficient? Does this support your observation in a.)? Test whether is different from 0 using a 0.05 level of significance. Chapter 8. Simple Linear Regression Analysis Computing for r n n n X i Yi i 1 r n n X i 1 2 i n Xi Yi i 1 i 1 2 n Xi i 1 n n i 1 2 i Y n 2 Yi i 1 (10)(74706.5) (3384)(220.5) (10)(1173632) (3384)2 (10)(4892.625) (220.5)2 Horse 1 2 3 4 5 6 7 8 9 10 Total X 416 280 290 309 365 356 403 300 265 400 3384 Y 24 25.75 20 22 20 21.5 23.5 21.75 21 21 220.5 0.0956 X2 Y2 XY 173056 576 9984 78400 663.0625 7210 84100 400 5800 95481 484 6798 133225 400 7300 126736 462.25 7654 162409 552.25 9470.5 90000 473.0625 6525 70225 441 5565 160000 441 8400 1173632 4892.625 74706.5 Chapter 8. Simple Linear Regression Analysis Hypothesis Testing About Ho: 0 at 0 vs. Ha: Test-statistic: T = 0.05. r 1 r2 n 2 Decision rule: Reject Ho if |t| > t.025(v=8). That is reject Ho if t>2.306 or t<-2.306. Computed value of test statistic: t 0.0956 1- .0956 8 2 0.2718 Do not reject Ho. There is insufficient evidence at 0.05 level of significance to conclude that there is a linear relationship between gestation period and length of life of horses. Chapter 8. Simple Linear Regression Analysis Simple Linear Regression Model (page 585) Definition 18.3 The simple linear regression model is given by the equation Yi where o 1 Xi i Yi is the value of the response variable (continuous) for the ith element, Xi is the value of the explanatory variable (continuous) for the ith element, 0 is a regression coefficient that gives the Y-intercept of the regression line, 1 is a regression coefficient that gives the slope of the line, i is the random error term for the ith element, where the i ’s are independent, normally distributed with mean 0 and variance 2 for i = 1, 2, …, n, n is the number of elements Chapter 8. Simple Linear Regression Analysis Remarks (page 586) Blue Line: E(Y given that X=x) = Chapter 8. Simple Linear Regression Analysis Y|x = o+ 1x since E( )=0 The random error, I , is the vertical gap between the ith observation and the blue line. I is a random variable and we will never know its realized value because o and 1 are unknown. The random error term accounts for all other factors that affect the value of Y that cannot be explained by the relationship between X and Y. This includes all other variables related to Y and also measurement errors. We require that the Is are independent random variables. For any fixed value of X, these random variables are normally distributed. The mean of any I is 0 and its variance is 2. (that is, we do not allow that the variation in the values of Is to differ for the different values of X). Consequently, for a fixed value of X=x,Y~Normal( o + 1x, 2) o is the value of the mean of Y when X=0. 1 is the change in the average value of Y for every unit increase in the value of X. Steps in Doing Simple Linear Regression Analysis (page 587) Step 1. Obtain the equation that best fits the data. Step 2. Evaluate the equation to determine the strength of the relationship for prediction and estimation. Step 3. Determine if the assumptions on the error terms are satisfied. Step 4. If the model fits the data adequately, use the equation for prediction and for describing the nature of the relationship between the variables. Chapter 4. Estimation: Two Populations Estimation Using the Method of Least Squares (page 588) The estimated regression equation is given by: Yˆ = b0 + b1X We use this formula to compute the predicted value of Y when given the value of X. We also use this to compute the predicted value of the ith observation in the sample data as follows: Yˆ i = b0 + b1Xi The method of least squares derives the values of b0 and b1 that minimizes n å 2 ( Yi - ( b 0 + b 0 Xi )) = i= 1 n å ei2 i= 1 Based on this criterion, the following formulas for bo, the estimate for for o, and b1, the estimate 1 , are obtained: n n n b1 X i Yi n Xi i 1 Yi i 1 n n n X i 1 i 1 2 2 i Xi and bo Y b1 X i 1 Chapter 8. Simple Linear Regression Analysis Graphical Representation The random error term: the vertical Yˆ = b0 + b1X Y|x = o+ 1x gap between the ith observation and Y|Xi = 0 + 1Xi i = Yi – ( 0 + 1Xi) The residual: the vertical gap between the ith observation and Yˆ i = b0 + b1Xi ei = Yi – (b0 + b1Xi) Chapter 8. Simple Linear Regression Analysis Example Example 18.5 (page 589) Example 11.12 (Mendenhall/Scheaffer) The data below represent a sample of mathematics achievement scores and calculus grades for 10 independently selected college freshmen. Plot the scatter diagram and use the method of least squares to fit a line to the given 10 points. Student Math Achievement Score Calculus Grade 1 2 3 4 5 6 7 8 9 10 39 43 21 64 57 47 28 75 34 52 65 78 52 82 92 89 73 98 56 75 Chapter 4. Estimation: Two Populations Computing for bo and b1 Student X Y X2 XY 1 2 3 4 5 6 7 8 9 10 39 43 21 64 57 47 28 75 34 52 65 78 52 82 92 89 73 98 56 75 1521 1849 441 4096 3249 2209 784 5625 1156 2704 2535 3354 1092 5248 5244 4183 2044 7350 1904 3900 460 760 23634 36854 Total æ öæ ö nå X i Yi - ççå X i ÷ ÷ççå Yi ÷ ÷ ÷ç i= 1 ÷ çè i= 1 øè ø (10)(36854) - (460)(760) b1 = i= 1 = = 0.766 2 2 n n (10)(23634) (460) æ ö nå X i2 - ççå X i ÷ ÷ çè i= 1 ÷ ø i= 1 n bo n n Y b1 X =76 – (0.766)(46)=40.78 ˆ = 40.78 + 0.766X Estimated regression equation: Y Chapter 8. Simple Linear Regression Analysis Using the Estimated Regression Equation ˆ = 40.78 + 0.766X Estimated regression equation: Y where Y=calculus grade and X=math achievement score If the model fits the data adequately, we can use this equation for prediction purposes and for describing the nature of the relationship between the variables. We can only predict the value of Y within the values of X in our data set, that is, for math achievement scores (X) from 21 to 75 only. For example, when X=50, we predict the calculus score to be Ŷ = 40.78 + (0.766)(50) = 79.06 . b1=0.766 means that as the math achievement score (X) increases by 1 unit, the mean calculus grade (Y) is estimated to increase by 0.766. b0=40.78 has no meaningful interpretation because X=0 is not within the range of values we used in our estimation. Chapter 8. Simple Linear Regression Analysis Confidence Interval Estimation (pages 590-592) n An estimator for 2 : MSE Yi SSE n 2 Confidence interval estimator for Yˆi 2 i 1 n 2 : 1 ( ) b1 - t a (v = n - 2)Sb1 , b1 + t a (v = n - 2)Sb1 where 2 2 MSE S b1 2 n Xi n X i 1 2 i n i 1 Confidence interval estimator for : 0 n X 2i MSE (b 0 ) t a (v = n - 2)Sb0 , b0 + t a (v = n - 2)Sb0 where 2 2 i 1 S b0 n n X i 1 2 n 2 i Xi i 1 Chapter 8. Simple Linear Regression Analysis Hypothesis Testing To test if there is a significant linear relationship between Y and X: Ho: 1 =0 vs Ha: 1 0 b1 Sb T = Test Statistic: Sb1 where MSE 1 2 n Xi n X i 1 2 i i 1 n Critical region: |t| > t /2(v=n-2) Chapter 8. Simple Linear Regression Analysis Example Student Math Score (X) Calculus Grade (Y) X2 1 2 3 4 5 6 7 8 9 10 39 43 21 64 57 47 28 75 34 52 65 78 52 82 92 89 73 98 56 75 1521 1849 441 4096 3249 2209 784 5625 1156 2704 70.6410671 -5.6410671 73.7033145 4.29668553 56.8609539 -4.86095392 89.7801132 -7.78011318 84.4211803 7.578819725 76.7655618 12.23443816 62.2198868 10.78011318 98.2012935 -0.20129345 66.8132579 -10.8132579 80.5933711 -5.59337106 31.821638 18.46150654 23.62887302 60.53016105 57.43850843 149.681477 116.2108401 0.040519054 116.926546 31.2857998 Total 460 760 23634 SSE= 606.025869 ˆ = 40.78 + 0.766X Predicted Y: Y Ho: =0 vs Ha: 1 1 Predicted Y Residuals Squared residuals Residuals: e = Y - Ŷ 0 at =0.05 b1 0.766 T = = = 4.375 where Sb = Test Statistic: Sb1 0.174985 1 SSE /( n - 2) = 2 æn ö÷ ççå X ÷ n çè i= 1 i ø÷ 2 åi= 1 Xi - n (606.025869)/ 8 = 0.174985 23634 - ((460)2 /10) Critical region: |t| > t /2(v=n-2); that is, t >2.306 or t < -2.306 Chapter 8. Simple Linear Regression Analysis Coefficient of Determination (page 593) Definition 18.4 The coefficient of determination, denoted by R2, is defined as the proportion of the variability in the observed values of the response variable that can be explained by the explanatory variable through their linear relationship Chapter 8. Simple Linear Regression Analysis Remarks We can use the coefficient of determination to assess the goodness-of-fit of the linear regression model. The realized value of the coefficient of determination will be from 0 to 1. Usually, this value is expressed in percentage so that we may interpret this as the percentage of the variation in the values of Y that is explained by the explanatory variable X through the model. If the model has perfect predictability then R2=1. If the model has no predictive capability then R2=0. Chapter 8. Simple Linear Regression Analysis Relationship between r and b1 n n n b1 X i Yi Xi i 1 n n X n i 1 2 2 i and Xi i 1 n 2 i X i 1 b1 n n i 1 n i 1 n n X 2 i n X i Yi i 1 r n i 1 n r Yi i 1 n Note: n n Xi i 1 n i 1 i 1 2 Xi n n 2 i Y i 1 2 Xi i 1 2 i Y n Yi 2 Yi i 1 Chapter 8. Simple Linear Regression Analysis n i 1 2 Yi Computing for R2 Student X Y X2 Y2 XY 1 2 3 4 5 6 7 8 9 10 39 43 21 64 57 47 28 75 34 52 65 78 52 82 92 89 73 98 56 75 1521 1849 441 4096 3249 2209 784 5625 1156 2704 4225 6084 2704 6724 8464 7921 5329 9604 3136 5625 2535 3354 1092 5248 5244 4183 2044 7350 1904 3900 460 760 23634 59816 36854 Total æ öæ ö nå X i Yi - ççå X i ÷ ÷ççå Yi ÷ ÷ ÷ çè i= 1 ÷ øèç i= 1 ø i= 1 = 2 öæ n 2ö n n æ ö æ ö ÷çç 2 ÷÷ ç ÷ X i2 - ççå X i ÷ ÷÷ ÷ çnå Yi - èççå Yi ø÷ ÷÷ ÷ ÷ ÷ øè çè i= 1 ø ÷ç i= 1 ÷ i= 1 ø n r= æ n çç ççnå è i= 1 n n (10)(36854) - (460)(760) ((10)(23634) - (460)2 )((10)(59816) - (760)2 ) = 0.8398 R2(100%) = (.8398)2(100%)=70.52% 70.52% of the variability of the grades in calculus can be explained by the math achievement scores. Chapter 8. Simple Linear Regression Analysis