Estimation of regression function

advertisement
732G21/732A35/732G28
1

732G21 Sambandsmodeller
http://www.ida.liu.se/~732G21
One semester=Regr.analysis+
+ analysis of variance (teacher: Lotta Hallberg)
732G28 Regression methods
http://www.ida.liu.se/~732G28
Half of semester=Regr. analysis
732A35 Linear statistical models
http://www.ida.liu.se/~732A22
Almost one semester=Regr. Analysis+
+ analysis of variance (teacher: Lotta Hallberg)
732G21/732A35/732G28
2

Course language: English, but you may use Swedish

We use It’s learning (accessed via Student portal) (show…)

9 Lectures

8 Labs (computer). Deadlines, around 5 days after lab ends

8 Lessons=I solve problems on the whiteboard + lab discussion

One written final exam

Course book: Kutner, M.H., Nachtsheim, C.J., Neter, J. and Li, W. Applied
Linear Statistical Models with Student Data CD, 5th Edition, ISBN
0073108742.
732G21/732A35/732G28
3

Linear statistical models are widely used in
◦
◦
◦
◦
◦
Business
Economics
Engineering
Social, biological sciences
Etc
Example:
A database contains price of houses sold in Linköping in 2009, their
age, size, other parameters.
◦ Given parameters of a new house
 determine its approximate market price
 Determine reasonable price bounds
732G21/732A35/732G28
4

Analysis of databases
No
Area (X1)
Age (X2)
Price (Y)
1
320
14
2,530,000
2
210
1
1,800,000
…
…
…
…

Observations (records, cases) in rows

Variables in columns
◦ Explanatory variables (predictors, inputs) Xi
◦ Response Y, we assume Y=f(X1,…,Xn)
In this lecture, models with only one explanatory variable
732G21/732A35/732G28
5
Real data can seldom be presented as Y=βX (observation
errors, missing inputs etc)
Example: Age and salary for a sample of eight persons from
a company.
Age
Salary
50
45
21
32
40
56
61
55
39
33
17
30
27
35
44
38
36
25
40
35
Salary (y)

30
25
20
15
10
5
0
0
10
20
30
40
50
60
70
Age (x)
Scatterplot
732G21/732A35/732G28
6

Presented relation is almost linear
Linear regression analysis: find a linear finction as close as
possible to the data
50
y = 0.5471x + 8.4545
45
40
35
Salary (y)

30
25
20
15
10
5
0
0
10
20
40
30
50
60
70
Age (x)
732G21/732A35/732G28
7

For each X, there is a probability distribution P(Y=y|X=x) of Y

The aim is to find a regression function E(Y|X=x)
732G21/732A35/732G28
8
Construction of regression models



Selection of prediction variables (variance reduction)
Functional form (from theory, approximation)
Domain of the model
Software
 MINITAB
 SAS
 SPSS
 Matlab
 Excel
732G21/732A35/732G28
9
Formal statement
Yi   0  1 X 1   i




Yi is i th response value
β0 β1 model parameters, regression parameters (intercept,
slope)
Xi is i th predictor value
 i is i.i.d. random vars with expectation zero and variance σ2
732G21/732A35/732G28
10
Features (show…)
E Yi    0  1 X i
 2 Yi    2

All Yi and Yj are uncorrelated
Meaning of regression parameters
 β0 response value at X=0
 β1 change in EY per unit increase in X
732G21/732A35/732G28
11
Given data set
S   X 1 , Y1 ,...,  X n , Yn 
Method of least squares:




Observed response Yi
Estimated response  0  1 X i
Deviation Yi   0  1 X i 
Regression fit is good when all deviations are minimized (see
pict) -> minimimize sum of squares
n
Q   Yi   0  1 X i 
2
i 1
732G21/732A35/732G28
12

How to find minimum of Q?
Q
0
 0
Q
0
 1
Estimators of β0 and β1

 X
n
b1 
i 1
i
 X Yi  Y 
 X
n
i 1
X
2
i
b0  Y  b1 X
732G21/732A35/732G28
13
Exercise (For salary data, MINITAB):
1.
2.
3.
4.
Make scatterplot (Scatterplot…, with, without regression
lien)
Perform regression using ”Regression…”
Perform regression using ”Fitted line plot..”
Calculate coefficients by hand
732G21/732A35/732G28
14
50
y = 0.5471x + 8.4545
45
40
Salary (y)
35
30
25
20
15
10
5
0
0
10
20
30
40
50
60
70
Age (x)
732G21/732A35/732G28
15
Gauss-Markov theorem



Estimators b0 and b1 are unbiased and have minimum
variance among all unbiased estimators
Unbiased  bias=Eb0-β0=0  Eb0=β0
Analogously, Eb1=β1
Show illustration…
732G21/732A35/732G28
16

Mean (expected response)

Point estimator of mean response (fitted value)
 0  1 X
Yˆ  b0  b1 X
Residuals
ei  Yi  Yˆi
732G21/732A35/732G28
17
Plot of residuals (obtain it with MINITAB)
8
6
4
Residuals

2
0
0
10
20
30
40
50
60
70
-2
-4
-6
Age
732G21/732A35/732G28
18

Properties of residuals
n
1.
e
i 1
n
2.
i
0
2
e
i
i 1
n
3.
4.
is minimum possible
n
 Y   Yˆ
i 1
n
i
i 1
i
 X i ei  0 ,
i 1
5.
Q
 0)
(because
 0
(because of 1)
n
 Yˆ e
i 1
i i
0
(can be shown)
Regression line always goes through
X , Y 
732G21/732A35/732G28
19

Estimate of variance of single population (sample variance)
n
1
2


s2 
Y

Y

i
n  1 i 1

In regression, we compute s2 using residuals (look at residual
plot)
n

SSE   Yi  Yˆi
i 1
s 2  MSE 
  e
2
n
i 1
2
i
SSE
n2
732G21/732A35/732G28
20


Why divided by n-2? Because E(MSE)=σ2
Important: In general, unbiased
SSE
s  MSE 
nd
2
d - degrees of freedom, number of model parameteres
Example: Compute residuals, SSE, MSE, find it in MINITAB
output
732G21/732A35/732G28
21

Minitab
◦ Graph → Scatterplot
◦ Stat → Regression
◦ Stat->Fitted Line Plot
732G21/732A35/732G28
22

Course book, Ch. 1 up to page 27.
732G21/732A35/732G28
23
Download