ESTIMATING A REGRESSION LINE

advertisement
ESTIMATING A REGRESSION LINE
F. Chiaromonte
1
This course is about REGRESSION ANALYSIS:
• Constructing quantitative descriptions of the statistical association between y
(response variable) and x (predictor, or explanatory variable) on the sample data.
• Introducing models, to interpret estimates and inferences on the parameters of
these descriptions in relation to the underlying population.
MULTIPLE regression, when we consider more than one predictor variable.
F. Chiaromonte
2
Fitting a line through bi-variate sample data (as a descriptor)
Least squares fit: find the line that minimizes the sum
of squared (vertical) distances from the sample points.
y = β 0 + β1 x generic equation of line
⎧ n
⎫
min ⎨∑ ( yi − ( β 0 + β1 xi )) 2 ⎬
β0 , β1
⎩ i =1
⎭
Normal equations (derivatives of obj fct)
n
n
⎧
⎪ ∑ yi = nβ 0 + β1 ∑ xi
⎪ i =1
i =1
⎨ n
n
n
2
⎪ xy =β
+
x
β
x
0∑ i
1∑ i
i i
⎪⎩∑
i =1
i =1
i =1
F. Chiaromonte
obj fct
Solution (unique)
n
b1 =
∑ ( x − x )( y − y )
i
i =1
i
n
2
(
)
x
−
x
∑ i
i =1
b0 = y − b1 x
3
yˆi = b0 + b1 xi fitted value
ei = yi − yˆi
residual
for sample point i=1…n
Geometric properties of the least square line:
n
∑e
i =1
i
=0
n
2
e
∑ i = min
i =1
n
n
i =1
i =1
∑ yˆi = ∑ yi
n
∑xe
i =1
i i
n
∑ yˆ e
i =1
i i
e
(x-bar,y-bar)
y-hat
=0
=0
y
x
y = b0 + b1 x
F. Chiaromonte
4
Simple linear regression MODEL:
Assume the sample data is generated as
yi = β 0 + β1 xi + ε i
xi , i = 1...n fixed (or condition on)
ε i , i = 1...n random errors s.t.
E (ε i ) = 0, ∀i no systematic component
var(ε i ) = σ 2 , ∀i
cor(ε i , ε j ) = 0, ∀i ≠ j
constant variance
uncorrelated
The values of y (given various values of x) scatter about a line, with constant
variance and no correlations among the departures from the line carried by
different observations…
…quite simplistic, but very useful in many applications!
Note: distribution of errors is unspecified for now.
F. Chiaromonte
5
If we assumed a bell-shaped distribution for the errors (which we will do later!)
here is how the “population” picture would look like:
E ( yi ) = β 0 + β1 xi
var( yi ) = σ 2
cor( yi , y j ) = 0
Interpretation of parameters:
β1=slope; change in the mean of y when x changes by one unit.
β0=intercept; if x=0 is a meaningful value, mean of y when x=0.
σ2=error variance; scatter of y about the regression line.
F. Chiaromonte
6
Under the assumptions of our simple linear regression model, the slope and
intercept of the least square line are (point) estimates of the population slope and
intercept, with the following very important properties:
GAUSS-MARKOV THEOREM: Under the conditions of the simple linear
regression model:
• b1 and b0 are unbiased estimates for β1 and β0
E (b1 ) = β1
E (b0 ) = β 0
• they are the most accurate (smallest MSE i.e. variance) among all unbiased
estimates that can be computed as linear functions of the response values.
Linearity:
n
b1 =
n
∑ ( x − x )( y − y ) ∑ ( x − x ) y
i
i =1
i
=
n
∑ (x − x )
i =1
i
2
i =1
n
i
i
∑ (x − x )
i =1
2
n
= ∑ ki yi
i =1
i
n
F. Chiaromonte
n
n
1
b0 = y − b1 x = ∑ yi − ∑ ki x yi = ∑ k%i yi
i =1 n
i =1
i =1
7
Point estimation of error variance
n
n
1
1
2
2
s =
y
−
b
+
b
x
=
e
(
(
))
∑
∑
i
i
0
1 i
n − 2 i =1
n − 2 i =1
SSE (error sum of squares)
2
MSE for the regression line
Unbiased for σ2:
dof of SSE (constraints = two parameters of line)
E (s2 ) = σ 2
Point estimation of mean response
Eˆ ( y at x) = yˆ ( x) = b0 + b1 x
F. Chiaromonte
8
Example: Simple linear regression for y = log length ratio between human and
chicken DNA , on x = log large insertion ratio, as sampled on n=100 genome windows.
Estimates from least squares
Line parameters:
Intercept: 0.19210
Slope: 0.21777
Error variance: 0.033 on 98 dof’s
Mean responses:
yˆ (1) = 0.41
yˆ (2) = 0.63 … would you trust this?
F. Chiaromonte
9
Example: Simple linear regression for y = mortality rate due
to malignant
Regression
Plot skin
Mortality =on
389.189
- 5.97764
Latitude
melanoma per 10 million people, on x = latitude, as sampled
49
US
states.
S = 19.1150
LAT
33.0
34.5
35.0
37.5
39.0
MORT
219
160
170
182
149
43.0
134
Estimates from least squares
Line parameters:
Intercept: 389.19
Slope: -5.977
R-Sq(adj) = 67.3 %
200
Mortality
#
State
1
Alabama
2
Arizona
3
Arkansas
4
California
5
Colorado
...
49
Wyoming
R-Sq = 68.0 %
150
100
30
40
50
Latitude
Error variance: 365.57 on 47 dof’s
Mean responses:
F. Chiaromonte
yˆ (30) = 209.88 … would you trust this?
yˆ (40) = 150.11
10
Maximum likelihood estimation under normality
xi , i = 1...n fixed (or condition on)
Assume (error distribution)
ε i , i = 1...n iid N (0, σ 2 )
Then
yi ~ N (β0 + β1 xi , σ 2 ) and indep, i = 1...n
Likelihood function
n
L(β0 , β1 , σ ) = ∏
2
i =1
⎧ 1
⎫
exp ⎨− 2 ( yi − (β0 + β1 xi ))2 ⎬
⎩ 2σ
⎭
2πσ 2
1
⎧ 1 n
2⎫
exp ⎨− 2 ∑ ( yi − (β0 + β1 xi )) ⎬
=
2 n/ 2
(2πσ )
⎩ 2σ i =1
⎭
1
F. Chiaromonte
11
max2 L( β 0 , β1 , σ 2 )
β 0 , β1 ,σ
obj fct
Solution (unique)
n
b1 =
∑ ( x − x )( y − y )
i
i =1
i
n
2
x
−
x
(
)
∑ i
i =1
same as from least square fit!
b0 = y − b1 x
n
1
s% 2 = ∑ ( yi − (b0 + b1 xi )) 2
n i =1
F. Chiaromonte
n−2 2
s
n
12
Some remarks
• a strong statistical association between y and x does not automatically imply
causation (e.g. a functional relationship of linear form); x can “proxy” the real
causing variable, perhaps is a spurious fashion!
• In observational studies, x (likewise y) is not controlled by the researchers; we
“condition” on the observed values of x. In experimental studies, x is controlled;
we can consider the values of x as fixed (although assignment of x levels to units
may be arranged at random in an experimental design). Experimental design
facilitates causality assessment.
• Extrapolating a statistical association (e.g. a regression line) outside the range
of the data on which it was “fitted” is dangerous; we don’t really know how the
association would be shaped where we didn’t get to look! With an experimental
design we can make sure we cover the range that is of interest.
F. Chiaromonte
13
Download