96 CHAPTER 6. STATISTICAL INFERENCE 3. test statistic is: x̄1 − x̄2 − 0 Z= q 2 s1 s22 n1 + n2 Z has under the null hypothesis a standard normal distribution, we will consider large negative values of Z as evidence against H0 . p 4. computation: z = (1205 − 1400 − 0)/ 10002 /30 + 9002 /40 = −0.84 5. p-value: P (Z < −0.84) = 0.2005 This is not a very small value, we therefore have only very weak evidence against H0 . Example 6.3.5 queueing systems 2 very complicated queuing systems: We’d like to know, whether there is a difference in the large t probabilities of there being an available server. We do simulations for each system, and look whether at time t = 2000 there is a server available: System 1 System 2 n1 = 1000 runs n2 = 500 runs (each with different random seed) server at time t = 2000 available? 551 p̂1 = 1000 p̂2 = 303 500 How strong is the evidence of a difference between the t = 2000 availability of a server for the two systems? 1. state null hypothesis: H0 : p1 = p2 (p1 − p2 = 0) 2. alternative hypothesis: Ha : p1 6= p2 (p1 − p2 6= 0) 3. Preliminary: note that, if there was no difference between the two systems, a plausible estimate of the availability of a server would be p̂ = np̂1 + np̂2 551 + 303 = 0.569 = n1 + n2 1000 + 500 a test statistic is: Z=p p̂1 − p̂2 − 0 q p̂(1 − p̂) · n11 + 1 n2 Z has under the null hypothesis a standard normal distribution, we will consider large values of Z as evidence against H0 . p p 4. computation: z = (0.551 − 0.606)/( 0.569 · (1 − 0.569) 1/1000 + 1/500) = −2.03 5. p-value: P (|Z| > 2.03) = 0.04 This is fairly strong evidence of a real difference in t=2000 availabilities of a server between the two systems. 6.4 Regression A statistical investigation only rarely focusses on the distribution of a single variable. We are often interested in comparisons among several variables, in changes in a variable over time, or in relationships among several variables. The idea of regression is that we have a vector X1 , . . . , Xk and try to approximate the behavior of Y by finding a function g(X1 , . . . , Xk ) such that Y ≈ g(X1 , . . . , Xk ). Simplest possible version is: 6.4. REGRESSION 6.4.1 97 Simple Linear Regression (SLR) Situation: k = 1 and Y is approximately linearly related to X, i.e. g(x) = b0 + b1 x. Notes: • Scatterplot of Y vs X should show the linear relationship. • linear relationship may be true only after a transformation of X and/or Y , i.e. one needs to find the “right” scale for the variables: e.g. if y ≈ cxb , this is nonlinear in x, but it implies that ln x + ln c, ln y ≈ b |{z} |{z} x0 =:y 0 so on a log scale for both x and y-axis one gets a linear relationship. Example 6.4.1 Mileage vs Weight Measurements on 38 1978-79 model automobiles. Gas mileage in miles per gallon as measured by Consumers’ Union on a test track. Weight as reported by automobile manufacturer. A scatterplot of mpg versus weight shows an indirect proportional relationship: 35 30 M 25 P G 20 2.25 Transform weight by 1 x 3.00 Weight 3.75 to weight−1 . A scatterplot of mpg versus weight−1 reveals a linear relationship: 35 30 M 25 P G 20 0.300 0.375 1/Wgt 0.450 Example 6.4.2 Olympics - long jump Results for the long jump for all olympic games between 1900 and 1996 are: 98 CHAPTER 6. STATISTICAL INFERENCE year 1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 year long jump (in m) 1900 7.19 1904 7.34 1908 7.48 1912 7.60 1920 7.15 1924 7.45 1928 7.74 1932 7.64 1936 8.06 1948 7.82 1952 7.57 1956 7.83 A scatterplot of long jump versus year shows: long jump (in m) 8.12 8.07 8.90 8.24 8.34 8.54 8.54 8.72 8.67 8.50 l o 8.5 n g j 8.0 u m 7.5 p 0 20 40 year 60 80 The plot shows that it is perhaps reasonable to say that y ≈ β0 + β1 x The first issue to be dealt with in this context is: if we accept that y ≈ β0 + β1 x, how do we derive empirical values of β0 , β1 from n data points (x, y)? The standard answer is the “least squares” principle: y y=b0 + b1 x 0.75 0.50 0.25 -0.00 0.2 0.4 0.6 0.8 x In comparing lines that might be drawn through the plot we look at: Q(b0 , b1 ) = n X (yi − (b0 + b1 xi )) 2 i=1 i.e. we look at the sum of squared vertical distances from points to the line and attempt to minimize this 6.4. REGRESSION 99 sum of squares: d Q(b0 , b1 ) db0 = −2 d Q(b0 , b1 ) db1 = −2 n X i=1 n X (yi − (b0 + b1 xi )) xi (yi − (b0 + b1 xi )) i=1 Setting the derivatives to zero gives: nb0 − b1 b0 n X xi − b1 i=1 n X i=1 n X xi = x2i = i=1 n X i=1 n X yi xi yi i=1 Least squares solutions for b0 and b1 are: b1 = Pn Pn Pn Pn 1 (x − x̄)(yi − ȳ) i=1 xi · i=1 xi yi − n i=1 yi i=1 Pn i = Pn Pn 2 2 1 2 (x − x̄) xi − ( xi ) i=1 i i=1 n b0 = n slope i=1 n 1X 1X yi − b1 xi ȳ − x̄b1 = n i=1 n i=1 y − intercept at x = 0 These solutions produce the “best fitting line”. Example 6.4.3 Olympics - long jump, continued X := year, Y := long jump n X n X xi = 1100, i=1 i=1 n X x2i = 74608 n X yi = 175.518, i=1 yi2 = 1406.109, n X xi yi = 9079.584 i=1 i=1 The parameters for the best fitting line are: b1 = b0 = 9079.584 − 1100·175.518 22 11002 22 74608 − = 0.0155(in m) 175.518 1100 − · 0.0155 = 7.2037 22 22 The regression equation is high jump = 7.204 + 0.016year (in m). It is useful for addition, to be able to judge how well the line describes the data - i.e. how “linear looking” a plot really is. There are a couple of means doing this: 100 CHAPTER 6. STATISTICAL INFERENCE 6.4.1.1 The sample correlation r This is what we would get for a theoretical correlation % if we had random variables X and Y and their distribution. Pn Pn Pn Pn 1 i=1 xi yi − n i=1 xi · i=1 yi i=1 (xi − x̄)(yi − ȳ) = r r := pPn Pn 2 2 Pn Pn Pn Pn 2 2 1 1 2 2 i=1 (xi − x̄) · i=1 (yi − ȳ) i=1 xi − n ( i=1 xi ) i=1 yi − n ( i=1 yi ) The numerator is the numerator of b1 , one part under the root of the denominator is the denominator of b1 . Because of its connection to %, the sample correlation r fulfills (it’s not obvious to see, and we want prove it): • −1 ≤ r ≤ 1 • r = ±1 exactly, when all (x, y) data pairs fall on a single straight line. • r has the same sign as b1 . Example 6.4.4 Olympics - long jump, continued r= q 9079.584 − (74608 − 1100·175.518 22 11002 n )(1406.109 − = 0.8997 175.5182 ) 22 Second measure for goodness of fit: 6.4.1.2 Coefficient of determination R2 This is based on a comparison of “variation accounted for” by the line versus “raw variation” of y. The idea is that !2 n n n X X 1 X 2 2 (yi − ȳ) = yi − yi = SST T otal S um of S quares n i=1 i=1 i=1 is a measure for the variability of y. (It’s (n − 1) · s2y ) y 0.75 0.50 y 0.25 -0.00 0.2 0.4 0.6 0.8 x After fitting the line ŷ = b0 + b1 x, one doesn’t predict y as ȳ anymore and suffer the errors of prediction above, but rather only the errors ŷi − yi =: ei . So, after fitting the line n X i=1 e2i = n X (yi − ŷ)2 = SSES um of S quares of E rrors i=1 is a measure for the remaining/residual/ error variation. 6.4. REGRESSION 101 y y=b0 + b1 x 0.75 0.50 0.25 -0.00 0.2 0.4 0.6 0.8 x The fact is that SST ≥ SSE. So: SSR := SST − SSE ≥ 0. SSR is taken as a measure of “variation accounted for” in the fitting of the line. The coefficient of determination R2 is defined as: R2 = SSR SST Obviously: 0 ≤ R2 ≤ 1, the closer R2 is to 1, the better is the linear fit. Example 6.4.5 Olympics - long jump, continued Pn Pn 2 2 SST = i=1 yi2 − n1 ( i=1 yi ) = 1406.109 − 175.518 = 5.81. 22 SSE and SSR? y x ŷ y − ŷ (y − ŷ)2 7.185 0 7.204 -0.019 0.000 7.341 4 7.266 0.075 0.006 7.480 8 7.328 0.152 0.023 7.601 12 7.390 0.211 0.045 7.150 20 7.513 -0.363 0.132 7.445 24 7.575 -0.130 0.017 7.741 28 7.637 0.104 0.011 7.639 32 7.699 -0.060 0.004 8.060 36 7.761 0.299 0.089 7.823 48 7.947 -0.124 0.015 7.569 52 8.009 -0.440 0.194 7.830 56 8.071 -0.241 0.058 8.122 60 8.133 -0.011 0.000 8.071 64 8.195 -0.124 0.015 8.903 68 8.257 0.646 0.417 8.242 72 8.319 -0.077 0.006 8.344 76 8.381 -0.037 0.001 8.541 80 8.443 0.098 0.010 8.541 84 8.505 0.036 0.001 8.720 88 8.567 0.153 0.024 8.670 92 8.629 0.041 0.002 8.500 96 8.691 -0.191 0.036 SSE = 1.107 So SSR = SST − SSE = 5.810 − 1.107 = 4.703 and R2 = SSR SST = 0.8095. Connection between R2 and r R2 is SSR/SST - that’s the squared sample correlation of y and ŷ. If - and only if! - we use a linear function in x to predict y, i.e. ŷ = b0 + b1 x, the correlation between ŷ and x is 1. Then R2 (and only then!) is equal to the squared sample correlation between y and x = r2 : R2 = r2 if and only if ŷ = b0 + b1 x 102 CHAPTER 6. STATISTICAL INFERENCE Example 6.4.6 Olympics - long jump, continued R2 = 0.8095 = (0.8997)2 = r2 . It is possible to go beyond simply fitting a line and summarizing the goodness of fit in terms of r and R2 to doing inference, i.e. making confidence intervals, predictions, . . . based on the line fitting. But for that, we need a probability model. 6.4.2 Simple linear Regression Model In words: for input x the output y is normally distributed with mean β0 + β1 x = µy|x and standard deviation σ. In symbols: yi = β0 + β1 xi + i with i i.i.d. normal N (0, σ 2 ). β0 , β1 , and σ 2 are the parameters of the model and have to be estimated from the data (the data pairs (xi , yi ). Pictorially: y density of y given x x How do we get estimates for β0 , β1 , and σ 2 ? Point estimates: β̂0 = b0 , βˆ1 = b1 from Least Squares fit (which gives β̂0 and βˆ1 the name Least Squares Estimates. and σ 2 ? σ 2 measures the variation around the “true” line β0 + β1 x - we don’t know that line, but only b0 + b1 x. Should we base the estimation of σ 2 on this line? The “right” estimator for σ 2 turns out to be: n σ̂ 2 = 1 X SSE (yi − ŷi )2 = . n − 2 i=1 n−2 Example 6.4.7 Olympics - long jump, continued β̂0 = b0 = 7.2073 (in m) β̂1 = b1 = 0.0155 (in m) SSE 1.107 = = = 0.055. n−2 20 σ̂ 2 Overall, we assume a linear regression model of the form: y = 7.2037 + 0.0155x + e, with e ∼ N (0, 0.055).