) of mong r

advertisement
Stat 330 (Spring 2015): slide set 30
=:x
2
However, via transforming weight by x1 to weight−1, a scatterplot of mpg
versus weight−1 reveals a linear relationship.
A scatterplot of mpg versus weight shows an indirect proportional
relationship.
Example (Mileage v.s. Weight): Measurements on 38 1978-79 model
automobiles. Gas mileage in miles per gallon as measured by Consumers’
Union on a test track. Weight as reported by automobile manufacturer.
so on a log scale for both x and y-axis, one gets a linear relationship.
=:y
ln y ≈ b ln x + ln c,
♠ What does that mean by ”right” scale? For example, y ≈ cxb is nonlinear
in x, but it implies that
Example
Last update: April 21, 2015
Stat 330 (Spring 2015)
Slide set 30
Stat 330 (Spring 2015): slide set 30
long jump (in m)
7.19
7.15
8.06
8.12
8.34
8.67
year
1900
1920
1936
1960
1976
1992
1904
1924
1948
1964
1980
1996
year
7.34
7.45
7.82
8.07
8.54
8.50
long jump (in m)
1908
1928
1952
1968
1984
year
7.48
7.74
7.57
8.90
8.54
long jump (in m)
1912
1932
1956
1972
1988
year
7.60
7.64
7.83
8.24
8.72
3
long jump (in m)
♥ Example (Olympics - long jump): Results for the long jump for all
olympic games between 1900 and 1996 are:
Stat 330 (Spring 2015): slide set 30
1
(2). linear relationship may be true only after a transformation of X and/or
Y , i.e. one needs to find the ”right” scale for the variables.
(1). Scatterplot of Y v.s. X ((xi, yi) on x − y plane) should show the
linear relationship.
Target: We are going to talk about simple linear regression: k = 1 and Y
is approximately linearly related to X, e.g. y = g(x) = b0 + b1x is a linear
function.
Ideas: The idea of regression is that we have a random vector (X1, . . . , Xk )
whose realization is (x1, · · · , xk ) and try to approximate the behavior of Y
by finding a function g(X1, . . . , Xk ) such that Y ≈ g(X1, . . . , Xk ).
Motivations: Statistical investigations only rarely focus on the distribution
of a single variable. We are often interested in comparisons among several
variables, in changes in a variable over time, or in relationships among
several variables.
Topic 4: Regression
Stat 330 (Spring 2015): slide set 30
i=1
(yi − (b0 + b1xi))
2
n
∂
Q(b0, b1) = −2
xi (yi − (b0 + b1xi))
∂b1
i=1
n
∂
Q(b0, b1) = −2
(yi − (b0 + b1xi))
∂b0
i=1
6
i=1
n
x i − b1
i=1
i=1
n
n
x2i =
xi =
n
yi
x i yi
i=1
i=1
i=1
n
n
i=1
n
1
y i − b1
n
i=1
n
1
n
i=1
xi = y− intercept
7
n
n
n
n
1
(x − x̄)(yi − ȳ) Sxy
i=1 xi yi − n
i=1 xi ·
i=1 yi
i=1
n i
=
=
= slope
n
n
2
2
1
2
S
(x
−
x̄)
xx
xi − (
xi )
i=1 i
b0 = ȳ − x̄b1 =
b1 =
♠ Least squares solutions for b0 and b1 are:
b0
nb0 − b1
Stat 330 (Spring 2015): slide set 30
Stat 330 (Spring 2015): slide set 30
♠ Setting the derivatives to zero gives:
♠ b0 and b1 are estimates for β0 and β1 given the data (sometimes, denoted
by β̂0 and β̂1)
♠ Least square: The first issue to be dealt with in this context is: if we
accept that y ≈ β0 + β1x, how do we derive empirical values of β0, β1 from
n data points (x, y)? The standard answer is the ”least squares” principle:
5
♠ So, we look at the sum of squared vertical distances from points to the
line and attempt to minimize this sum of squares:
Q(b0, b1) =
n
Stat 330 (Spring 2015): slide set 30
Regression via least square
4
♠ The least square solution will produce the “best fitting line”.
♠ In comparing lines that might be drawn through the plot we look at:
y ≈ β0 + β1 x
The plot shows that it is perhaps reasonable to say that
A scatterplot of long jump versus year shows:
Stat 330 (Spring 2015): slide set 30
yi = 175.518,
xi = 1100,
i=1
yi2 = 1406.109,
x2i = 74608
n
i=1
n
74608 −
11002
22
9079.584 − 1100·175.518
22
= 0.0155, b0 =
xiyi = 9079.584
175.518 1100
−
·0.0155 = 7.2037
22
22
i=1
n
10
♠ Both b1 > 0, and r > 0, which corresponds to positive correlation or
increasing trend
9079.584 − 1100·175.518
22
= 0.8997
r=
2
11002
(74608 − 22 )(1406.109 − 175.518
)
22
♠ Example (Olympics-continued):
• r has the same sign as b1.
• r = ±1 exactly, when all (x, y) data pairs fall on a single straight line.
• −1 ≤ r ≤ 1
The sample correlation r is connected to the theoretical correlation ρ, so
some nontrivial results are expected
Stat 330 (Spring 2015): slide set 30
8
♠ The regression equation is ”high jump = 7.204 + 0.016 (year − 1900)”.
b1 =
♠ The parameters for the best fitting line are:
i=1
i=1
n
n
♠ Example (Olympics long jump game): X := # of years from 1900
(sample value denoted by x = year − 1900), Y := long jump (value: y), so
Example on regression
Stat 330 (Spring 2015): slide set 30
9
♥ The numerator is the numerator of b1, one part under the root of the
denominator is the denominator of b1.
− x̄)(yi − ȳ)
n
2
2
i=1 (xi − x̄) ·
i=1 (yi − ȳ)
n
n
n
1
i=1 xi yi − n
i=1 xi ·
i=1 yi
= n
n
2
2
n
n
2− 1(
2− 1(
x
x
)
y
y
)
i
i
i=1 i
i=1
i=1 i
i=1
n
n
i=1 (xi
n
r := n
♥ Formula for r
♠ The sample correlation r is what we would get from the sample.
♠ To measure linear association between random variables X and Y , we
would compute correlation ρ if we had their joint distribution.
Correlation and regression line
Download