Stat 330 (Spring 2015): slide set 30 =:x 2 However, via transforming weight by x1 to weight−1, a scatterplot of mpg versus weight−1 reveals a linear relationship. A scatterplot of mpg versus weight shows an indirect proportional relationship. Example (Mileage v.s. Weight): Measurements on 38 1978-79 model automobiles. Gas mileage in miles per gallon as measured by Consumers’ Union on a test track. Weight as reported by automobile manufacturer. so on a log scale for both x and y-axis, one gets a linear relationship. =:y ln y ≈ b ln x + ln c, ♠ What does that mean by ”right” scale? For example, y ≈ cxb is nonlinear in x, but it implies that Example Last update: April 21, 2015 Stat 330 (Spring 2015) Slide set 30 Stat 330 (Spring 2015): slide set 30 long jump (in m) 7.19 7.15 8.06 8.12 8.34 8.67 year 1900 1920 1936 1960 1976 1992 1904 1924 1948 1964 1980 1996 year 7.34 7.45 7.82 8.07 8.54 8.50 long jump (in m) 1908 1928 1952 1968 1984 year 7.48 7.74 7.57 8.90 8.54 long jump (in m) 1912 1932 1956 1972 1988 year 7.60 7.64 7.83 8.24 8.72 3 long jump (in m) ♥ Example (Olympics - long jump): Results for the long jump for all olympic games between 1900 and 1996 are: Stat 330 (Spring 2015): slide set 30 1 (2). linear relationship may be true only after a transformation of X and/or Y , i.e. one needs to find the ”right” scale for the variables. (1). Scatterplot of Y v.s. X ((xi, yi) on x − y plane) should show the linear relationship. Target: We are going to talk about simple linear regression: k = 1 and Y is approximately linearly related to X, e.g. y = g(x) = b0 + b1x is a linear function. Ideas: The idea of regression is that we have a random vector (X1, . . . , Xk ) whose realization is (x1, · · · , xk ) and try to approximate the behavior of Y by finding a function g(X1, . . . , Xk ) such that Y ≈ g(X1, . . . , Xk ). Motivations: Statistical investigations only rarely focus on the distribution of a single variable. We are often interested in comparisons among several variables, in changes in a variable over time, or in relationships among several variables. Topic 4: Regression Stat 330 (Spring 2015): slide set 30 i=1 (yi − (b0 + b1xi)) 2 n ∂ Q(b0, b1) = −2 xi (yi − (b0 + b1xi)) ∂b1 i=1 n ∂ Q(b0, b1) = −2 (yi − (b0 + b1xi)) ∂b0 i=1 6 i=1 n x i − b1 i=1 i=1 n n x2i = xi = n yi x i yi i=1 i=1 i=1 n n i=1 n 1 y i − b1 n i=1 n 1 n i=1 xi = y− intercept 7 n n n n 1 (x − x̄)(yi − ȳ) Sxy i=1 xi yi − n i=1 xi · i=1 yi i=1 n i = = = slope n n 2 2 1 2 S (x − x̄) xx xi − ( xi ) i=1 i b0 = ȳ − x̄b1 = b1 = ♠ Least squares solutions for b0 and b1 are: b0 nb0 − b1 Stat 330 (Spring 2015): slide set 30 Stat 330 (Spring 2015): slide set 30 ♠ Setting the derivatives to zero gives: ♠ b0 and b1 are estimates for β0 and β1 given the data (sometimes, denoted by β̂0 and β̂1) ♠ Least square: The first issue to be dealt with in this context is: if we accept that y ≈ β0 + β1x, how do we derive empirical values of β0, β1 from n data points (x, y)? The standard answer is the ”least squares” principle: 5 ♠ So, we look at the sum of squared vertical distances from points to the line and attempt to minimize this sum of squares: Q(b0, b1) = n Stat 330 (Spring 2015): slide set 30 Regression via least square 4 ♠ The least square solution will produce the “best fitting line”. ♠ In comparing lines that might be drawn through the plot we look at: y ≈ β0 + β1 x The plot shows that it is perhaps reasonable to say that A scatterplot of long jump versus year shows: Stat 330 (Spring 2015): slide set 30 yi = 175.518, xi = 1100, i=1 yi2 = 1406.109, x2i = 74608 n i=1 n 74608 − 11002 22 9079.584 − 1100·175.518 22 = 0.0155, b0 = xiyi = 9079.584 175.518 1100 − ·0.0155 = 7.2037 22 22 i=1 n 10 ♠ Both b1 > 0, and r > 0, which corresponds to positive correlation or increasing trend 9079.584 − 1100·175.518 22 = 0.8997 r= 2 11002 (74608 − 22 )(1406.109 − 175.518 ) 22 ♠ Example (Olympics-continued): • r has the same sign as b1. • r = ±1 exactly, when all (x, y) data pairs fall on a single straight line. • −1 ≤ r ≤ 1 The sample correlation r is connected to the theoretical correlation ρ, so some nontrivial results are expected Stat 330 (Spring 2015): slide set 30 8 ♠ The regression equation is ”high jump = 7.204 + 0.016 (year − 1900)”. b1 = ♠ The parameters for the best fitting line are: i=1 i=1 n n ♠ Example (Olympics long jump game): X := # of years from 1900 (sample value denoted by x = year − 1900), Y := long jump (value: y), so Example on regression Stat 330 (Spring 2015): slide set 30 9 ♥ The numerator is the numerator of b1, one part under the root of the denominator is the denominator of b1. − x̄)(yi − ȳ) n 2 2 i=1 (xi − x̄) · i=1 (yi − ȳ) n n n 1 i=1 xi yi − n i=1 xi · i=1 yi = n n 2 2 n n 2− 1( 2− 1( x x ) y y ) i i i=1 i i=1 i=1 i i=1 n n i=1 (xi n r := n ♥ Formula for r ♠ The sample correlation r is what we would get from the sample. ♠ To measure linear association between random variables X and Y , we would compute correlation ρ if we had their joint distribution. Correlation and regression line