Slide set 30 Stat 330 (Spring 2015) Last update: April 21, 2015

advertisement
Slide set 30
Stat 330 (Spring 2015)
Last update: April 21, 2015
Stat 330 (Spring 2015): slide set 30
Topic 4: Regression
Motivations: Statistical investigations only rarely focus on the distribution
of a single variable. We are often interested in comparisons among several
variables, in changes in a variable over time, or in relationships among
several variables.
Ideas: The idea of regression is that we have a random vector (X1, . . . , Xk )
whose realization is (x1, · · · , xk ) and try to approximate the behavior of Y
by finding a function g(X1, . . . , Xk ) such that Y ≈ g(X1, . . . , Xk ).
Target: We are going to talk about simple linear regression: k = 1 and Y
is approximately linearly related to X, e.g. y = g(x) = b0 + b1x is a linear
function.
(1). Scatterplot of Y v.s. X ((xi, yi) on x − y plane) should show the
linear relationship.
(2). linear relationship may be true only after a transformation of X and/or
Y , i.e. one needs to find the ”right” scale for the variables.
1
Stat 330 (Spring 2015): slide set 30
Example
♠ What does that mean by ”right” scale? For example, y ≈ cxb is nonlinear
in x, but it implies that
ln y ≈ b |{z}
ln x + ln c,
|{z}
0
=:y
0
=:x
so on a log scale for both x and y-axis, one gets a linear relationship.
Example (Mileage v.s. Weight): Measurements on 38 1978-79 model
automobiles. Gas mileage in miles per gallon as measured by Consumers’
Union on a test track. Weight as reported by automobile manufacturer.
A scatterplot of mpg versus weight shows an indirect proportional
relationship.
However, via transforming weight by x1 to weight−1, a scatterplot of mpg
versus weight−1 reveals a linear relationship.
2
Stat 330 (Spring 2015): slide set 30
♥ Example (Olympics - long jump): Results for the long jump for all
olympic games between 1900 and 1996 are:
year
long jump (in m)
year
long jump (in m)
year
long jump (in m)
year
long jump (in m)
1900
1920
1936
1960
1976
1992
7.19
7.15
8.06
8.12
8.34
8.67
1904
1924
1948
1964
1980
1996
7.34
7.45
7.82
8.07
8.54
8.50
1908
1928
1952
1968
1984
7.48
7.74
7.57
8.90
8.54
1912
1932
1956
1972
1988
7.60
7.64
7.83
8.24
8.72
3
Stat 330 (Spring 2015): slide set 30
A scatterplot of long jump versus year shows:
The plot shows that it is perhaps reasonable to say that
y ≈ β0 + β 1 x
4
Stat 330 (Spring 2015): slide set 30
Regression via least square
♠ Least square: The first issue to be dealt with in this context is: if we
accept that y ≈ β0 + β1x, how do we derive empirical values of β0, β1 from
n data points (x, y)? The standard answer is the ”least squares” principle:
♠ b0 and b1 are estimates for β0 and β1 given the data (sometimes, denoted
by β̂0 and β̂1)
5
Stat 330 (Spring 2015): slide set 30
♠ The least square solution will produce the “best fitting line”.
♠ In comparing lines that might be drawn through the plot we look at:
Q(b0, b1) =
n
X
2
(yi − (b0 + b1xi))
i=1
♠ So, we look at the sum of squared vertical distances from points to the
line and attempt to minimize this sum of squares:
n
X
∂
Q(b0, b1) = −2
(yi − (b0 + b1xi))
∂b0
i=1
n
X
∂
Q(b0, b1) = −2
xi (yi − (b0 + b1xi))
∂b1
i=1
6
Stat 330 (Spring 2015): slide set 30
♠ Setting the derivatives to zero gives:
nb0 − b1
b0
n
X
i=1
xi − b1
n
X
i=1
n
X
xi =
x2i =
i=1
n
X
i=1
n
X
yi
x i yi
i=1
♠ Least squares solutions for b0 and b1 are:
b1 =
Pn
Pn
Pn
Pn
1
(x − x̄)(yi − ȳ) Sxy
i=1 xi yi − n
i=1 xi ·
i=1 yi
i=1
Pn i
=
=
= slope
Pn
Pn
2
2
1
2
S
(x
−
x̄)
xx
xi − (
xi )
i=1 i
i=1
b0
n
i=1
n
n
1X
1X
= ȳ − x̄b1 =
yi − b1
xi = y− intercept
n i=1
n i=1
7
Stat 330 (Spring 2015): slide set 30
Example on regression
♠ Example (Olympics long jump game): X := # of years from 1900
(sample value denoted by x = year − 1900), Y := long jump (value: y), so
n
X
i=1
n
X
xi = 1100,
n
X
x2i = 74608
i=1
yi = 175.518,
i=1
n
X
yi2 = 1406.109,
i=1
n
X
xiyi = 9079.584
i=1
♠ The parameters for the best fitting line are:
b1 =
9079.584 − 1100·175.518
22
74608 −
11002
22
= 0.0155, b0 =
175.518 1100
−
·0.0155 = 7.2037
22
22
♠ The regression equation is ”high jump = 7.204 + 0.016 (year − 1900)”.
8
Stat 330 (Spring 2015): slide set 30
Correlation and regression line
♠ To measure linear association between random variables X and Y , we
would compute correlation ρ if we had their joint distribution.
♠ The sample correlation r is what we would get from the sample.
♥ Formula for r
Pn
− x̄)(yi − ȳ)
r := pPn
Pn
2
2
i=1 (xi − x̄) ·
i=1 (yi − ȳ)
Pn
Pn
Pn
1
i=1 yi
i=1 xi yi − n
i=1 xi ·
= r
P
Pn
P
P
2
2
n
n
n
2− 1(
2− 1(
x
y
x
)
i
i=1 i
i=1
i=1 i
i=1 yi )
n
n
i=1 (xi
♥ The numerator is the numerator of b1, one part under the root of the
denominator is the denominator of b1.
9
Stat 330 (Spring 2015): slide set 30
The sample correlation r is connected to the theoretical correlation ρ, so
some nontrivial results are expected
• −1 ≤ r ≤ 1
• r = ±1 exactly, when all (x, y) data pairs fall on a single straight line.
• r has the same sign as b1.
♠ Example (Olympics-continued):
9079.584 − 1100·175.518
22
= 0.8997
r=q
2
2
175.518
(74608 − 1100
)(1406.109
−
)
22
22
♠ Both b1 > 0, and r > 0, which corresponds to positive correlation or
increasing trend
10
Download