96 CHAPTER 6. STATISTICAL INFERENCE 3. test statistic is: − ¯

advertisement
96
CHAPTER 6. STATISTICAL INFERENCE
3. test statistic is:
x̄1 − x̄2 − 0
Z= q 2
s1
s22
n1 + n2
Z has under the null hypothesis a standard normal distribution, we will consider large negative values
of Z as evidence against H0 .
p
4. computation: z = (1205 − 1400 − 0)/ 10002 /30 + 9002 /40 = −0.84
5. p-value: P (Z < −0.84) = 0.2005
This is not a very small value, we therefore have only very weak evidence against H0 .
Example 6.3.5 queueing systems
2 very complicated queuing systems: We’d like to know, whether there is a difference in the large t probabilities of there being an available server.
We do simulations for each system, and look whether at time t = 2000 there is a server available:
System 1
System 2
n1 = 1000 runs n2 = 500 runs (each with different random seed)
server at time t = 2000 available?
551
p̂1 = 1000
p̂2 = 303
500
How strong is the evidence of a difference between the t = 2000 availability of a server for the two systems?
1. state null hypothesis: H0 : p1 = p2 (p1 − p2 = 0)
2. alternative hypothesis: Ha : p1 6= p2 (p1 − p2 6= 0)
3. Preliminary: note that, if there was no difference between the two systems, a plausible estimate of the
availability of a server would be
p̂ =
np̂1 + np̂2
551 + 303
= 0.569
=
n1 + n2
1000 + 500
a test statistic is:
Z=p
p̂1 − p̂2 − 0
q
p̂(1 − p̂) · n11 +
1
n2
Z has under the null hypothesis a standard normal distribution, we will consider large values of Z as
evidence against H0 .
p
p
4. computation: z = (0.551 − 0.606)/( 0.569 · (1 − 0.569) 1/1000 + 1/500) = −2.03
5. p-value: P (|Z| > 2.03) = 0.04 This is fairly strong evidence of a real difference in t=2000 availabilities
of a server between the two systems.
6.4
Regression
A statistical investigation only rarely focusses on the distribution of a single variable. We are often interested
in comparisons among several variables, in changes in a variable over time, or in relationships among several
variables.
The idea of regression is that we have a vector X1 , . . . , Xk and try to approximate the behavior of Y by
finding a function g(X1 , . . . , Xk ) such that Y ≈ g(X1 , . . . , Xk ).
Simplest possible version is:
6.4. REGRESSION
6.4.1
97
Simple Linear Regression (SLR)
Situation: k = 1 and Y is approximately linearly related to X, i.e. g(x) = b0 + b1 x.
Notes:
• Scatterplot of Y vs X should show the linear relationship.
• linear relationship may be true only after a transformation of X and/or Y , i.e. one needs to find the
“right” scale for the variables:
e.g. if y ≈ cxb , this is nonlinear in x, but it implies that
ln x + ln c,
ln y ≈ b |{z}
|{z}
x0
=:y 0
so on a log scale for both x and y-axis one gets a linear relationship.
Example 6.4.1 Mileage vs Weight
Measurements on 38 1978-79 model automobiles. Gas mileage in miles per gallon as measured by Consumers’
Union on a test track. Weight as reported by automobile manufacturer.
A scatterplot of mpg versus weight shows an indirect proportional relationship:
35
30
M 25
P
G
20
2.25
Transform weight by
1
x
3.00
Weight
3.75
to weight−1 . A scatterplot of mpg versus weight−1 reveals a linear relationship:
35
30
M 25
P
G
20
0.300
0.375
1/Wgt
0.450
Example 6.4.2 Olympics - long jump
Results for the long jump for all olympic games between 1900 and 1996 are:
98
CHAPTER 6. STATISTICAL INFERENCE
year
1960
1964
1968
1972
1976
1980
1984
1988
1992
1996
year long jump (in m)
1900
7.19
1904
7.34
1908
7.48
1912
7.60
1920
7.15
1924
7.45
1928
7.74
1932
7.64
1936
8.06
1948
7.82
1952
7.57
1956
7.83
A scatterplot of long jump versus year shows:
long jump (in m)
8.12
8.07
8.90
8.24
8.34
8.54
8.54
8.72
8.67
8.50
l
o 8.5
n
g
j 8.0
u
m 7.5
p
0
20
40
year
60
80
The plot shows that it is perhaps reasonable to say that
y ≈ β0 + β1 x
The first issue to be dealt with in this context is: if we accept that y ≈ β0 + β1 x, how do we derive empirical
values of β0 , β1 from n data points (x, y)? The standard answer is the “least squares” principle:
y
y=b0 + b1 x
0.75
0.50
0.25
-0.00
0.2
0.4
0.6
0.8
x
In comparing lines that might be drawn through the plot we look at:
Q(b0 , b1 ) =
n
X
(yi − (b0 + b1 xi ))
2
i=1
i.e. we look at the sum of squared vertical distances from points to the line and attempt to minimize this
6.4. REGRESSION
99
sum of squares:
d
Q(b0 , b1 )
db0
= −2
d
Q(b0 , b1 )
db1
= −2
n
X
i=1
n
X
(yi − (b0 + b1 xi ))
xi (yi − (b0 + b1 xi ))
i=1
Setting the derivatives to zero gives:
nb0 − b1
b0
n
X
xi − b1
i=1
n
X
i=1
n
X
xi
=
x2i
=
i=1
n
X
i=1
n
X
yi
xi yi
i=1
Least squares solutions for b0 and b1 are:
b1
=
Pn
Pn
Pn
Pn
1
(x − x̄)(yi − ȳ)
i=1 xi ·
i=1 xi yi − n
i=1 yi
i=1
Pn i
=
Pn
Pn
2
2
1
2
(x
−
x̄)
xi − (
xi )
i=1 i
i=1
n
b0
=
n
slope
i=1
n
1X
1X
yi − b1
xi
ȳ − x̄b1 =
n i=1
n i=1
y − intercept at x = 0
These solutions produce the “best fitting line”.
Example 6.4.3 Olympics - long jump, continued
X := year, Y := long jump
n
X
n
X
xi = 1100,
i=1
i=1
n
X
x2i = 74608
n
X
yi = 175.518,
i=1
yi2 = 1406.109,
n
X
xi yi = 9079.584
i=1
i=1
The parameters for the best fitting line are:
b1
=
b0
=
9079.584 −
1100·175.518
22
11002
22
74608 −
= 0.0155(in m)
175.518 1100
−
· 0.0155 = 7.2037
22
22
The regression equation is
high jump = 7.204 + 0.016year (in m).
It is useful for addition, to be able to judge how well the line describes the data - i.e. how “linear looking”
a plot really is.
There are a couple of means doing this:
100
CHAPTER 6. STATISTICAL INFERENCE
6.4.1.1
The sample correlation r
This is what we would get for a theoretical correlation % if we had random variables X and Y and their
distribution.
Pn
Pn
Pn
Pn
1
i=1 xi yi − n
i=1 xi ·
i=1 yi
i=1 (xi − x̄)(yi − ȳ)
= r
r := pPn
Pn
2
2
Pn
Pn
Pn
Pn
2
2
1
1
2
2
i=1 (xi − x̄) ·
i=1 (yi − ȳ)
i=1 xi − n (
i=1 xi )
i=1 yi − n (
i=1 yi )
The numerator is the numerator of b1 , one part under the root of the denominator is the denominator of b1 .
Because of its connection to %, the sample correlation r fulfills (it’s not obvious to see, and we want prove
it):
• −1 ≤ r ≤ 1
• r = ±1 exactly, when all (x, y) data pairs fall on a single straight line.
• r has the same sign as b1 .
Example 6.4.4 Olympics - long jump, continued
r= q
9079.584 −
(74608 −
1100·175.518
22
11002
n )(1406.109
−
= 0.8997
175.5182
)
22
Second measure for goodness of fit:
6.4.1.2
Coefficient of determination R2
This is based on a comparison of “variation accounted for” by the line versus “raw variation” of y.
The idea is that
!2
n
n
n
X
X
1 X
2
2
(yi − ȳ) =
yi −
yi
= SST T otal S um of S quares
n i=1
i=1
i=1
is a measure for the variability of y. (It’s (n − 1) · s2y )
y
0.75
0.50
y
0.25
-0.00
0.2
0.4
0.6
0.8
x
After fitting the line ŷ = b0 + b1 x, one doesn’t predict y as ȳ anymore and suffer the errors of prediction
above, but rather only the errors
ŷi − yi =: ei .
So, after fitting the line
n
X
i=1
e2i =
n
X
(yi − ŷ)2 = SSES um of S quares of E rrors
i=1
is a measure for the remaining/residual/ error variation.
6.4. REGRESSION
101
y
y=b0 + b1 x
0.75
0.50
0.25
-0.00
0.2
0.4
0.6
0.8
x
The fact is that SST ≥ SSE.
So: SSR := SST − SSE ≥ 0.
SSR is taken as a measure of “variation accounted for” in the fitting of the line.
The coefficient of determination R2 is defined as:
R2 =
SSR
SST
Obviously: 0 ≤ R2 ≤ 1, the closer R2 is to 1, the better is the linear fit.
Example 6.4.5 Olympics - long jump, continued
Pn
Pn
2
2
SST = i=1 yi2 − n1 ( i=1 yi ) = 1406.109 − 175.518
= 5.81.
22
SSE and SSR?
y
x
ŷ
y − ŷ (y − ŷ)2
7.185 0 7.204
-0.019
0.000
7.341 4 7.266
0.075
0.006
7.480 8 7.328
0.152
0.023
7.601 12 7.390
0.211
0.045
7.150 20 7.513
-0.363
0.132
7.445 24 7.575
-0.130
0.017
7.741 28 7.637
0.104
0.011
7.639 32 7.699
-0.060
0.004
8.060 36 7.761
0.299
0.089
7.823 48 7.947
-0.124
0.015
7.569 52 8.009
-0.440
0.194
7.830 56 8.071
-0.241
0.058
8.122 60 8.133
-0.011
0.000
8.071 64 8.195
-0.124
0.015
8.903 68 8.257
0.646
0.417
8.242 72 8.319
-0.077
0.006
8.344 76 8.381
-0.037
0.001
8.541 80 8.443
0.098
0.010
8.541 84 8.505
0.036
0.001
8.720 88 8.567
0.153
0.024
8.670 92 8.629
0.041
0.002
8.500 96 8.691
-0.191
0.036
SSE =
1.107
So SSR = SST − SSE = 5.810 − 1.107 = 4.703 and R2 = SSR
SST = 0.8095.
Connection between R2 and r
R2 is SSR/SST - that’s the squared sample correlation of y and ŷ.
If - and only if! - we use a linear function in x to predict y, i.e. ŷ = b0 + b1 x, the correlation between ŷ and
x is 1.
Then R2 (and only then!) is equal to the squared sample correlation between y and x = r2 :
R2 = r2 if and only if ŷ = b0 + b1 x
102
CHAPTER 6. STATISTICAL INFERENCE
Example 6.4.6 Olympics - long jump, continued
R2 = 0.8095 = (0.8997)2 = r2 .
It is possible to go beyond simply fitting a line and summarizing the goodness of fit in terms of r and R2 to
doing inference, i.e. making confidence intervals, predictions, . . . based on the line fitting. But for that, we
need a probability model.
6.4.2
Simple linear Regression Model
In words: for input x the output y is normally distributed with mean β0 + β1 x = µy|x and standard deviation
σ.
In symbols: yi = β0 + β1 xi + i with i i.i.d. normal N (0, σ 2 ).
β0 , β1 , and σ 2 are the parameters of the model and have to be estimated from the data (the data pairs
(xi , yi ).
Pictorially:
y
density of
y given x
x
How do we get estimates for β0 , β1 , and σ 2 ?
Point estimates:
β̂0 = b0 , βˆ1 = b1 from Least Squares fit (which gives β̂0 and βˆ1 the name Least Squares Estimates.
and σ 2 ?
σ 2 measures the variation around the “true” line β0 + β1 x - we don’t know that line, but only b0 + b1 x.
Should we base the estimation of σ 2 on this line?
The “right” estimator for σ 2 turns out to be:
n
σ̂ 2 =
1 X
SSE
(yi − ŷi )2 =
.
n − 2 i=1
n−2
Example 6.4.7 Olympics - long jump, continued
β̂0
= b0 = 7.2073 (in m)
β̂1
= b1 = 0.0155 (in m)
SSE
1.107
=
=
= 0.055.
n−2
20
σ̂ 2
Overall, we assume a linear regression model of the form:
y = 7.2037 + 0.0155x + e, with e ∼ N (0, 0.055).
Download