Simple Linear Regression •Linear regression model •Prediction •Limitation •Correlation 1 Example: Computer Repair A company markets and repairs small computers. How fast (Time) an electronic component (Computer Unit) can be repaired is very important to the efficiency of the company. The Variables in this example are: Time and Units. 2 Humm… How long will it take me to repair this unit? Goal: to predict the length of repair Time for a given number of computer Units 3 Computer Repair Data 4 Units Min’s Units Min’s 1 23 6 97 2 29 7 109 3 49 8 119 4 64 9 149 4 74 9 145 5 87 10 154 6 96 10 166 Graphical Summary of Two Quantitative Variable Scatterplot of response variable against explanatory variable 5 What is the overall (average) pattern? What is the direction of the pattern? How much do data points vary from the overall (average) pattern? Any potential outliers? Summary for Computer Repair Data Scatterplot (Time vs Units) Some Simple Conclusions 6 Time is Linearly related with computer Units. (The length of) Time is Increasing as (the number of) Units increases. Data points are closed to the line. No potential outlier. Numerical Summary of Two Quantitative Variable 7 Regression Model Correlation Linear Regression Model Y: the response variable X: the explanatory variable Y=b0+b1X+error Y } b1 1 } b0 8 X Linear Regression Model 9 The regression line models the relationship between X and Y on average. Prediction Yˆ: Predicted value of Y for a given X value Regression equation: Yˆ bˆ0 bˆ1 X Eg. How long will it take to repair 3 computer units? Yˆ 4 . 16 15 . 51 X 10 The Limitation of the Regression Equation The regression equation cannot be used to predict Y value for the X values which are (far) beyond the range in which data are observed. Eg. The predicted WT of a given HT: Yˆ 205 5 X Given HT of 40”, the regression equation will give us WT of -205+5x40 = -5 pounds!! 11 The Unpredicted Part 12 The value Y Yˆ is the part the regression equation (model) cannot predict, and it is called “residual.” residual { 13 Correlation between X and Y 14 X and Y might be related to each other in many ways: linear or curved. y 2 .0 1 .6 1 .5 1 .4 1 .2 y 1 .8 2 .5 2 .0 2 .2 3 .0 Examples of Different Levels of Correlation 0.0 0.2 0.4 0.6 x r=.98 Strong Linearity 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x r=.71 Median Linearity 15 y 2 .0 1 .0 2 .0 1 .5 2 .5 y 3 .0 2 .5 3 .5 4 .0 3 .0 Examples of Different Levels of Correlation 0.0 0.2 0.4 0.6 x r=-.09 Nearly Uncorrelated 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 x r=.00 Nearly Curved 16 (Pearson) Correlation Coefficient of X and Y • A measurement of the strength of the “LINEAR” association between X and Y • The correlation coefficient of X and Y is: n rxy (y i y )( x i x ) i 1 s yy s xx s xy s yy s xx 17 Correlation Coefficient of X and Y 18 -1< r < 1 The magnitude of r measures the strength of the linear association of X and Y The sign of r indicate the direction of the association: “-” negative association “+” positive association Correlation Coefficient 19 The value r is almost 0 the best line to fit the data points is exactly horizontal the value of X won’t change our prediction on Y The value r is almost 1 A line fits the data points almost perfectly. Goodness of Fit of SLR Model For a data point: residuals For the whole dataset: R^2 R^2 (=r^2) is the proportion o f variation in Y explained by (the variation in) X 20 Table for Computing Mean, St. Deviation, and Corr. Coef. i 1 2 … n Total yi , yi y , ( yi y ) xi , xi x , ( xi x ) 2 y1 , y1 y , ( y1 y ) 2 x1 , x1 x , ( x1 x ) y2 , y2 y, ( y2 y ) 2 x2 , x2 x , ( x2 x ) … i 1 n y i ,0 , ( y i y ) i 1 y , 0 , S yy 2 ( y 2 y )( x 2 x ) 2 …. xn , xn x , ( xn x ) n 2 i 2 ( y n y )( x n x ) n n x ,0 , ( x i 1 ( y i y )( x i x ) ( y 1 y )( x1 x ) 2 …. yn , yn y, ( yn y ) n 2 i x) i 1 x , 0 , S xx 2 (y i y )( x i x ) i 1 S xy , rxy 21 Example: Computer Repair Time n y i 1361 , n 14 , y 1361 / 14 97 . 2143 i 1 n s yy (y y ) 27768 . 35 , 2 i i 1 n x i 84 , x 84 / 14 6 , i 1 n s xx (x i x ) 114 , i y )( x i x ) 1768 , 2 i 1 n s xy (y i 1 rxy s xy / 22 s yy s xx . 9937 Exercise (1) Fill the following table, then compute the mean and st. deviation of Y and X (2) Compute the corr. coef. of Y and X (3) Draw a scatterplot i xi xi x ( xi x ) 2 yi y yi ( yi y ) 2 ( y i y )( x i x ) 1 -.3 -.3 .09 .1 -.9 .81 .27 2 -.2 -.2 .04 .4 -.6 .36 .12 3 -.1 .01 .7 4 .1 .01 1.2 .2 5 .2 .04 1.6 .6 6 .3 .3 .09 2.0 Total 0 * .1 6.0 * 23 The Influence of Outliers The slope becomes bigger (toward outliers) 13 Y3 11 9 7 5 4 6 8 10 X3 24 12 14 The r value becomes smaller (less linear) The Influence of Outliers S ca tte r plot of y v s x The slope becomes clear (toward outliers) The | r | value becomes larger (more linear: 0.1590.935) 5 4 y 3 2 1 0 0 2 4 6 x 25 8 10