Slide set 30 Stat 330 (Spring 2015) Last update: April 21, 2015 Stat 330 (Spring 2015): slide set 30 Topic 4: Regression Motivations: Statistical investigations only rarely focus on the distribution of a single variable. We are often interested in comparisons among several variables, in changes in a variable over time, or in relationships among several variables. Ideas: The idea of regression is that we have a random vector (X1, . . . , Xk ) whose realization is (x1, · · · , xk ) and try to approximate the behavior of Y by finding a function g(X1, . . . , Xk ) such that Y ≈ g(X1, . . . , Xk ). Target: We are going to talk about simple linear regression: k = 1 and Y is approximately linearly related to X, e.g. y = g(x) = b0 + b1x is a linear function. (1). Scatterplot of Y v.s. X ((xi, yi) on x − y plane) should show the linear relationship. (2). linear relationship may be true only after a transformation of X and/or Y , i.e. one needs to find the ”right” scale for the variables. 1 Stat 330 (Spring 2015): slide set 30 Example ♠ What does that mean by ”right” scale? For example, y ≈ cxb is nonlinear in x, but it implies that ln y ≈ b |{z} ln x + ln c, |{z} 0 =:y 0 =:x so on a log scale for both x and y-axis, one gets a linear relationship. Example (Mileage v.s. Weight): Measurements on 38 1978-79 model automobiles. Gas mileage in miles per gallon as measured by Consumers’ Union on a test track. Weight as reported by automobile manufacturer. A scatterplot of mpg versus weight shows an indirect proportional relationship. However, via transforming weight by x1 to weight−1, a scatterplot of mpg versus weight−1 reveals a linear relationship. 2 Stat 330 (Spring 2015): slide set 30 ♥ Example (Olympics - long jump): Results for the long jump for all olympic games between 1900 and 1996 are: year long jump (in m) year long jump (in m) year long jump (in m) year long jump (in m) 1900 1920 1936 1960 1976 1992 7.19 7.15 8.06 8.12 8.34 8.67 1904 1924 1948 1964 1980 1996 7.34 7.45 7.82 8.07 8.54 8.50 1908 1928 1952 1968 1984 7.48 7.74 7.57 8.90 8.54 1912 1932 1956 1972 1988 7.60 7.64 7.83 8.24 8.72 3 Stat 330 (Spring 2015): slide set 30 A scatterplot of long jump versus year shows: The plot shows that it is perhaps reasonable to say that y ≈ β0 + β 1 x 4 Stat 330 (Spring 2015): slide set 30 Regression via least square ♠ Least square: The first issue to be dealt with in this context is: if we accept that y ≈ β0 + β1x, how do we derive empirical values of β0, β1 from n data points (x, y)? The standard answer is the ”least squares” principle: ♠ b0 and b1 are estimates for β0 and β1 given the data (sometimes, denoted by β̂0 and β̂1) 5 Stat 330 (Spring 2015): slide set 30 ♠ The least square solution will produce the “best fitting line”. ♠ In comparing lines that might be drawn through the plot we look at: Q(b0, b1) = n X 2 (yi − (b0 + b1xi)) i=1 ♠ So, we look at the sum of squared vertical distances from points to the line and attempt to minimize this sum of squares: n X ∂ Q(b0, b1) = −2 (yi − (b0 + b1xi)) ∂b0 i=1 n X ∂ Q(b0, b1) = −2 xi (yi − (b0 + b1xi)) ∂b1 i=1 6 Stat 330 (Spring 2015): slide set 30 ♠ Setting the derivatives to zero gives: nb0 − b1 b0 n X i=1 xi − b1 n X i=1 n X xi = x2i = i=1 n X i=1 n X yi x i yi i=1 ♠ Least squares solutions for b0 and b1 are: b1 = Pn Pn Pn Pn 1 (x − x̄)(yi − ȳ) Sxy i=1 xi yi − n i=1 xi · i=1 yi i=1 Pn i = = = slope Pn Pn 2 2 1 2 S (x − x̄) xx xi − ( xi ) i=1 i i=1 b0 n i=1 n n 1X 1X = ȳ − x̄b1 = yi − b1 xi = y− intercept n i=1 n i=1 7 Stat 330 (Spring 2015): slide set 30 Example on regression ♠ Example (Olympics long jump game): X := # of years from 1900 (sample value denoted by x = year − 1900), Y := long jump (value: y), so n X i=1 n X xi = 1100, n X x2i = 74608 i=1 yi = 175.518, i=1 n X yi2 = 1406.109, i=1 n X xiyi = 9079.584 i=1 ♠ The parameters for the best fitting line are: b1 = 9079.584 − 1100·175.518 22 74608 − 11002 22 = 0.0155, b0 = 175.518 1100 − ·0.0155 = 7.2037 22 22 ♠ The regression equation is ”high jump = 7.204 + 0.016 (year − 1900)”. 8 Stat 330 (Spring 2015): slide set 30 Correlation and regression line ♠ To measure linear association between random variables X and Y , we would compute correlation ρ if we had their joint distribution. ♠ The sample correlation r is what we would get from the sample. ♥ Formula for r Pn − x̄)(yi − ȳ) r := pPn Pn 2 2 i=1 (xi − x̄) · i=1 (yi − ȳ) Pn Pn Pn 1 i=1 yi i=1 xi yi − n i=1 xi · = r P Pn P P 2 2 n n n 2− 1( 2− 1( x y x ) i i=1 i i=1 i=1 i i=1 yi ) n n i=1 (xi ♥ The numerator is the numerator of b1, one part under the root of the denominator is the denominator of b1. 9 Stat 330 (Spring 2015): slide set 30 The sample correlation r is connected to the theoretical correlation ρ, so some nontrivial results are expected • −1 ≤ r ≤ 1 • r = ±1 exactly, when all (x, y) data pairs fall on a single straight line. • r has the same sign as b1. ♠ Example (Olympics-continued): 9079.584 − 1100·175.518 22 = 0.8997 r=q 2 2 175.518 (74608 − 1100 )(1406.109 − ) 22 22 ♠ Both b1 > 0, and r > 0, which corresponds to positive correlation or increasing trend 10