Statistics for Engineers (MA 223), Spring Quarter, 1999-2000 WS 6 – Linear Regression Most of you have already seen linear regression in action–e.g. …tting a line to some data points. We are going to do the same thing but we will be making some assumptions about the data and also getting some statistics to determine how well the line …ts the data. We assume for each independent variable (also called regressor) x; the corresponding y’s (called the response) will satisfy E(Y jx) = ¯ 0 + ¯ 1 x or Y = ¯ 0 + ¯ 1x + ² This means that for a given x; there will be lots of y’s (this is not unreasonable in an experimental setting). However in practice, you may have only one y for a given x but you may be varying the x and thus getting lots of data points (but with di¤erent x’s): Since we theoretically are getting lots of y’s for a given x; we can ask for the expected value (and also the distribution) of the y’s: The above equations tell us that the expected values of the y’s fall along a line (see Fig 10-2 on page 433). The ² in the above formula represents a random error, which we will assume to be normally distributed N (0; ¾): Important: in an experiment we usually use a number of di¤erent x’s; say x1 ; x2 ; : : : ; xn , so when we get responses, rather than being extremely lucky and getting E(Y jx); we will simply get some y in the distribution of y’s for that particular x: Thus we would not expect our data points (x1 ; y1 ); (x2 ; y2 ); : : : ; (xn ; yn ) to be on the line of averages y = ¯ 0 + ¯ 1 x: So what you do in Linear Regression is to use the data points that you have (i.e. from the experiment) to …nd a line y = b̄ 0 + b̄ 1 x which approximates the theoretical line y = ¯ 0 +¯ 1 x: Now when you see expressions like yi = ¯ 0 + ¯ 1 xi + ²i ; (formula 10-3, page 434), you should understand what they mean. Even though we are using the data point (xi ; yi ) (i.e. from experimentation, when we let x = xi ; we get as a response yi ); we know that if we were to experiment again with xi ; we would probably get another y since we have a random error ²i : In our discussion we will be assuming that all the random errors ²i ’s are the same and that they are normally distributed N (0; ¾): Question 1 (10 points) on Quiz 6. Before 5 pm on Sunday, May 14, email me the regression line which you …nd in Minitab when you use the 6 xi ’s 0.5, 1.2, 2.4, 3.1, 4, 4.6 and corresponding y’s from the normal distributions y = 2x + 1 + ² where ² = N (0; 0:3) (note that with this notation I’m using ¾ = 0:3): So, for example, y1 should be chosen from the normal distribution N(2(0:5) + 1; 0:3) = N (3; 0:3): Do this for each of the xi ’s and then get the regression line for your data points from Minitab. Notice that each student will have a di¤erent regression line. Record your regression line in your notes so we can use it to …nd some con…dence intervals later on. Some Notation: We have n data points (x1 ; y1 ); (x2 ; y2 ); : : : ; (xn ; yn ): P P xi yi x= y= and n n P ( xi )2 P P 2 2 Sxx = (xi ¡ x) = xi ¡ n P P ( xi ) ( yi ) P P Sxy = (xi ¡ x)(yi ¡ y) = xi yi ¡ n p.436). Syy = P 2 (yi ¡ y) = P yi2 ¡ áError in book (Formula 10–11, P ( yi )2 n Sxy and b̄ 0 = y ¡ b̄ 1 x (Note that this says the Regression Line: y = b̄ 0 + b̄ 1 x where b̄ 1 = Sxx “average” data point (x; y) is on the regression line.) The notation ybi will represent the y–value on the regression line corresponding to the x–value xi (i.e. ybi = b̄ 0 + b̄ 1 xi ) SSE = SSR = SST = P P P (yi ¡ ybi )2 (ybi ¡ y)2 (yi ¡ y)2 called the error sum of squares called the regression sum of squares called the total sum of squares (same as Syy above) Some Relationships: SST = SSR + SSE SSE = SST ¡ b̄ 1 SSxy SSR = b̄ 1 SSxy SSR SST ¡ SSE SSE = =1¡ called the coe¢cient of determination SST SST SST p r = § r2 with the sign chosen to be the same as the sign of b̄ 1 SSE E s2 = is an approximation to ¾ 2 (i.e. E( SS ) = ¾ 2 ) Note that in the text s2 is denoted n¡2 n¡2 by ¾b 2 ; p445. r2 = Some Statistics E( b̄ 1 ) = ¯ 1 ; V ar( b̄ 1 ) = ¾2 Sxx Therefore, if we use s (recall s = using t= b̄ ¡ ¯ 1 1 ps Sxx and q b̄ is normally distributed (i.e. N(¯ ; 1 1 SSE ) n¡2 p ¾ )) Sxx as an approximation to ¾; we can test hypotheses with n ¡ 2 degrees of freedom. Also a (1 ¡ ®)100% CI for ¯ 1 is s b̄ § t p n¡2; ® 1 2 Sxx E(SSR ) = ¾ 2 + ¯ 21 Sxx Thus under a hypothesis of the form: H0 : ¯ 1 = 0; it would follow that E(SSR ) = ¾ 2 Notice that under a hypothesis of the form: ¯ 1 = 0; we have two approximations of ¾ 2 : From E the previous page, we have E( SS ) = ¾ 2 (this is true without the hypothesis) and above we n¡2 have E(SSR ) = ¾ 2 : Therefore, as in the ANOVA, the quotient of these two approximations will be a F distriSSR bution. So if the null hypothesis is H0 : ¯ 1 = 0 and if the F value SSE is “too far from 1”, n¡2 we can reject the null hypothesis with the conclusion that we have enough evidence to say that ¯ 1 6= 0 and thus there is a linear relationship between x and y: These results can be summarized in a ANOVA table (see pages 450–452). See Table 10-2 on page 438 for an example of Minitab output.