Multiple Linear Regression

advertisement
AMS 572 Group #2
1
2
3
Outline
• Jinmiao Fu—Introduction and History
• Ning Ma—Establish and Fitting of the model
• Ruoyu Zhou—Multiple Regression Model in
Matrix Notation
• Dawei Xu and Yuan Shang—Statistical Inference
for Multiple Regression
• Yu Mu—Regression Diagnostics
• Chen Wang and Tianyu Lu—Topics in Regression
Modeling
• Tian Feng—Variable Selection Methods
• Hua Mo—Chapter Summary and modern
application
4
Introduction
• Multiple linear regression attempts to model
the relationship between two or more
explanatory variables and a response variable
by fitting a linear equation to observed data.
Every value of the independent variable x is
associated with a value of the dependent
variable
5
Example: The relationship between an adult’s health
and his/her daily eating amount of wheat, vegetable
and meat.
6
Histor
y
7
Karl Pearson (1857–1936)
Lawyer, Germanist, eugenicist,
mathematician and statistician
Correlation coefficient
Method of moments
Pearson's system of
continuous curves.
Chi distance, P-value
Statistical hypothesis testing theory, statistical
decision theory.
Pearson's chi-square test, Principal component
analysis.
8
Sir Francis Galton FRS (16
February 1822 – 17 January 1911)
Anthropology and polymathy
Doctoral students Karl Pearson
In the late 1860s, Galton
conceived the standard
deviation. He created the
statistical concept of
correlation and also
discovered the properties of
the bivariate normal
distribution and its relationship
to regression analysis
9
Galton invented the use of the regression line
(Bulmer 2003, p. 184), and was the first to
describe and explain the common phenomenon of
regression toward the mean, which he first
observed in his experiments on the size of the
seeds of successive generations of sweet peas.
10
The publication by his cousin
Charles Darwin of The Origin of
Species in 1859 was an event
that changed Galton's life. He
came to be gripped by the work,
especially the first chapter on
"Variation under
Domestication" concerning the
breeding of domestic animals.
11
Adrien-Marie Legendre (18
September 1752 – 10 January 1833)
was a French mathematician. He
made important contributions to
statistics, number theory, abstract
algebra and mathematical analysis.
He developed the least
squares method, which has
broad application in linear
regression, signal processing,
statistics, and curve fitting.
12
Johann Carl Friedrich
Gauss (30 April 1777 –
23 February 1855) was a
German mathematician
and scientist who
contributed significantly
to many fields, including
number theory, statistics,
analysis, differential
geometry, geodesy,
geophysics, electrostatics,
astronomy and optics.
13
Gauss, who was 23 at the time, heard about the
problem and tackled it. After three months of
intense work, he predicted a position for Ceres in
December 1801—just about a year after its first
sighting—and this turned out to be accurate within
a half-degree. In the process, he so streamlined the
cumbersome mathematics of 18th century orbital
prediction that his work—published a few years
later as Theory of Celestial Movement—remains a
cornerstone of astronomical computation.
14
It introduced the Gaussian gravitational constant,
and contained an influential treatment of the
method of least squares, a procedure used in all
sciences to this day to minimize the impact of
measurement error. Gauss was able to prove the
method in 1809 under the assumption of
normally distributed errors (see Gauss–Markov
theorem; see also Gaussian). The method had
been described earlier by Adrien-Marie
Legendre in 1805, but Gauss claimed that he had
been using it since 1795.
15
Sir Ronald Aylmer Fisher FRS (17
February 1890 – 29 July 1962)
was an English statistician,
evolutionary biologist, eugenicist
and geneticist. He was described
by Anders Hald as "a genius who
almost single-handedly created
the foundations for modern
statistical science," and Richard
Dawkins described him as "the
greatest of Darwin's successors".
16
In addition to "analysis of
variance", Fisher invented
the technique of maximum
likelihood and originated
the concepts of sufficiency,
ancillarity, Fisher's linear
discriminator and Fisher
information.
17
18
Probabilistic Model
yi : the observed value of the random variable(r.v.) Yi
Yi
depends on fixed predictor values
xi1 , xi 2 ,
, xik
,i=1,2,3,…,n
0 , 1 , , k unknown model parameters
n is the number of observations.
i ~
i.i.d
N (0,  )
2
19
Fitting the model
•LS provides estimates of the unknown model parameters,
0 , 1 , , k which minimizes Q
n
Q  [ yi  (  0   1xi1   2 xi 2  ...  kxik )]2
i 1
(j=1,2,…,k)
20
Tire tread wear vs. mileage
(example11.1 in textbook)
• The table gives the
measurements on the
groove of one tire after
every 4000 miles.
• Our Goal: to build a
model to find the
relation between the
mileage and groove
depth of the tire.
Mileage (in 1000
miles)
Groove Depth (in
mils)
0
394.33
4
329.50
8
291.00
12
255.17
16
229.33
20
204.83
24
179.00
28
163.83
32
150.33
21
SAS code----fitting the model
Data example;
Input mile depth @@;
Sqmile=mile*mile;
Datalines;
0 394.33 4 329.5 8 291 12 255.17 16 229.33 20 204.83 24
179 28 163.83 32 150.33
;
run;
Proc reg data=example;
Model Depth= mile sqmile;
Run;
22
Depth=386.26-12.77mile+0.172sqmile
23
Goodness of Fit of the Model
•Residuals
•
yˆ i
ˆi
ei  yi  y
(i  1, 2,
, n)
are the fitted values
yˆi  ˆ ˆxi1ˆ  xi1 ˆ kxik
(i  1, 2,..., n)
An overall measure of the goodness of fit
n
min Q  SSE   ei2
Error sum of squares (SSE):
i 1
total sum of squares (SST):
SST  ( yi  y )2
regression sum of squares (SSR):
SSR  SST  SSE
24
25
1. Transform the Formulas to Matrix Notation
26
1 𝑥11 𝑥12 ⋯ 𝑥1𝑘
𝑥21 𝑥22 ⋯ 𝑥2𝑘
1
• 𝑋=
⋮
⋮
⋮
⋱
⋮
1 𝑥𝑛1 𝑥𝑛2 ⋯ 𝑥𝑛𝑘
• 𝑋 → 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑜𝑟 𝑣𝑎𝑟𝑖𝑎𝑏𝑙𝑒𝑠
• The first column of X
1
1
denotes the constant term 𝛽0
⋮
1
(We can treat this as 𝛽0 𝑥𝑖0 with 𝑥𝑖0 = 1.)
27
• Finally let
𝛽0
𝛽1
• 𝛽=
⋮
𝛽𝑘
and
𝛽0
𝛽 = 𝛽1
⋮
𝛽𝑘
where
𝛽 → the (k+1)×1 vectors of unknown parameters
𝛽 → 𝛽′ 𝑠 LS estimates
28
• Formula
𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖1 + 𝛽2 𝑥𝑖2 + ⋯ + 𝛽𝑘 𝑥𝑖𝑘 + 𝜖𝑖
becomes
𝑌 = 𝑋𝛽 + 𝜖
• Simultaneously, the linear equation
𝑛
𝛽0 𝑛 + 𝛽1
𝑛
𝑥𝑖1 + ⋯ + 𝛽𝑘
𝑖=1
𝑛
𝑥𝑖𝑘 =
𝑖=1
𝑦𝑖
𝑖=1
are changed to
𝑋 ′ 𝑋𝛽 = -1𝑋 ′ 𝑦
Solve this equation respect to 𝛽 and we get
𝛽 = 𝑋′𝑋 𝑋′𝑦
(if the inverse of the matrix 𝑋 ′ 𝑋 exists.)
29
2. Example 11.2 (Tire Wear Data: Quadratic Fit
Using Hand Calculations)
• We will do Example 11.1 again in this part using the matrix
approach.
• For the quadratic model to be fitted
1
1
1
1
𝑋= 1
1
1
1
1
0
4
8
12
16
20
24
28
32
0
16
64
144
256
400
576
784
1024
𝑎𝑛𝑑
394.33
329.50
291.00
255.17
𝑦 = 229.33
204.83
179.00
163.83
150.33
30
• According to formula
′ -1 ′
𝛽= 𝑋𝑋 𝑋𝑦
we need to calculate 𝑋 ′ 𝑋 first and then invert
it and get (𝑋 ′ 𝑋)−1
9
144
3264
3264
82,944
𝑋 ′ 𝑋 = 144
3264 82,944 2,245,632
0.6606 −0.0773 0.0019
(𝑋 ′ 𝑋)−1 = −0.0773 0.0140 −0.0004
0.0019 −0.0004 0.0000
31
• Finally, we calculate the vector of LS estimates 𝛽
𝛽
𝛽0
= 𝛽1
𝛽2
= (𝑋 ′ 𝑋)−1 𝑋 ′ 𝑦
394.33
329.50
291.00
0.6606 −0.0773 0.0019
1 1 1
1
1
1
1
1
1
255.17
= −0.0773 0.0140 −0.0004
0 4 8
12
16
20
24
28
32 229.33
0.0019 −0.0004 0.0000 0 16 64 144 256 400 576 784 1024 204.83
179.00
163.83
150.33
386.265
= −12.722
0.172
32
• Therefore, the LS quadratic model is
𝑦 = 386.265 − 12.772𝑥 + 0.172𝑥 2 .
This model is the same as we obtained in Example
11.1.
33
34
Statistical Inference for
Multiple Regression
• Determine which predictor variables have
statistically significant effects
• We test the hypotheses:
H 0 j :  j  0 vs. H 1 j :  j  0
• If we can’t reject H0j, then xj is not a
significant predictor of y.
35
Statistical Inference on  ' s
• Review statistical inference for
Simple Linear Regression
2
ˆ1   1




ˆ 1 N   1,

Sxx 
 / S xx

(n  2) S 2
SSE
2


n2
2
2

t:
N (0,1)

N
W / (n  2)
ˆ 1   1
 / S xx
(n  2) S 2
(n  2) 2
tn  2
36
Statistical Inference on  ' s
• What about Multiple Regression?
• The steps are similar
ˆ 1
ˆ 1   1
N  1,  V jj  
 V jj

2
[ n  ( k  1)]S 2
2
N
t:
W / [n  (k  1)]

SSE
2
ˆ 1   1
 V jj
N (0,1)
 n2 ( k 1)
[n  (k  1)]S 2
[n  (k  1)] 2
tn ( k 1)
37
Statistical Inference on  ' s
• What’s Vjj? Why ˆ 1 N   1,  2V jj  ?
1. Mean
Recall from simple linear regression, the
least squares estimators for the regression
parameters ˆ 0 and ˆ 1 are unbiased.
ˆ )  

E
(

ˆ
0
0
Here,  of least

  
ˆ)
1 

E
(

1
squares estimators E ( ˆ ) 


 
is also unbiased.

  
 E ( ˆ )    k 
k 

38
Statistical Inference on  ' s
• 2.Variance
• Constant Variance assumption:
– V (i )   2
 0

2

0

var(Y ) 
0

0 0
2
0 

0 
2
  Ik


2
 
39
Statistical Inference on  ' s
T
1
T
ˆ
•   (X X ) X Y
var( ˆ )  var(cY )
 c var(Y )c
cY
T
T
 ( X T X ) 1 X (
 2I k)( ( X T X )1 X T )T
  (X X )
T
1
• Let Vjj be the jth diagonal of the matrix ( X X )
var(ˆ )   2V
2
T
j
1
jj
40
Statistical Inference on  ' s
2
ˆ
ˆ
Sum up, E (  j )   j , var(  j )   V jj
and we get ˆ j
N (  j ,  V jj ) 
2
ˆ j   j
 V jj
N (0,1)
41
Statistical Inference on  ' s
• Like simple linear regression, the unbiased estimator
of the unknown error variance  2 is given by
2

e
SSE
MSE
2
i
S 


n  (k  1) n  (k  1) d . f .
W
(n  (k  1)) S 2

2

SSE

2
~
2
n  ( k 1)
and that S 2 and ˆ j are statistically independent
42
Statistical Inference on  ' s
• Therefore,
ˆ j   j
 V jj
ˆ j   j
 V jj
t
N (0,1),
(n  (k  1)) S 2

2
ˆ j   j
[n  (k  1)]S

2
[n  (k  1)]
S V jj
~  n2( k 1)
2
ˆ j   j
ˆ j   j
S V jj
SE ( ˆ j )
tn ( k 1)
tn ( k 1)
SE ( ˆ j )  s v jj
43
Statistical Inference on  ' s
• Derivation of confidence interval of  j
P(tn ( k 1), /2 
ˆ j   j
SE ( ˆ j )
 tn ( k 1), /2 )  1  
P(ˆ j  tn( k 1), /2 SE (ˆ j )   j  ˆ j  tn( k 1), /2 SE (ˆ j ))  1  

The 100(1-α)% confidence interval for  j is
ˆ j  tn( k 1), /2 SE (ˆ j )

44
Statistical Inference on  ' s
• An   level test of hypotheses
H 0 j :  j   0j vs. H 1 j :  j   0j
P (Reject H0 j | H 0 j is true)  P( t j  c)  
c  tn( k 1), /2
• Rejects H0j if
0
ˆ
j j
tj 
 tn ( k 1), /2
SE ( ˆ j )
45
Prediction of Future Observation
•
Having fitted a multiple regression model,
suppose we wish to predict the future value of
Y for a specified vector of predictor variables
x*=(x0*,x1*,…,xk*)
• One way is to estimate E(Y*) by a
confidence interval(CI).
46
Prediction of Future Observation
ˆ *  E (Yˆ * )  ˆ0  ˆ1 x1* 
ˆk xk*  ( xk* )T ˆ
Var[( xk* )T ˆ ]  ( xk* )T Var ( ˆ ) xk*  ( xk* )T  2 ( X T X ) 1 ( xk* )T
 2 ( xk* )T V ( xk* )T
Replacing  2 by its estimate s 2  MSE, which has
n   K  1 d.f ., and using methods as in Simple Linear
Regression, a (1- )-level CI for  * is given by
 *  tn ( k 1), /2 s ( x* )T Vx*   *   *  tn ( k 1), /2 s ( x* )T Vx*
47
• F-Test for  j ' s
Consider:
H 0 : 1 
  k  0;
vs H1 : At least one  j  0.
Here H 0 is the overall null hypothesis, which
states that none of the x variables are related
to y . The alternative one shows at least one is
related.
48
How to Build a F-Test……
• The test statistic
F=MSR/MSE follows Fdistribution with k and
n-(k+1) d.f. The α -level
test rejects H 0 if
MSR
F
 f k ,n ( k 1),
MSE
recall that
MSE(error mean square)
2
e
 i 1 i
n
MSE 
n  ( k  1)
with n-(k+1) degrees of
freedom.
49
The relation between F and r
F can be written as a function of r.
By using the formula:
SSR  r SST ;
F can be as:
2
SSE  (1  r )SST .
2
r 2 [n  (k  1)]
F
2
k (1  r )
We see that F is an increasing function of r ² and
test the significance of it.
50
Analysis of Variance (ANOVA)
The relation between SST, SSR and SSE:
SST  SSR  SSE
where they are respectively equals to:
n
n
n
SST   ( yi  y ) ; SSR  ( yi  y ) ; SSE   ( yi  yi )2
2
i 1
2
i 1
i 1
The corresponding degrees of freedom(d.f.) is:
d . f .(SST )  n 1; d . f .(SSR)  k; d. f .(SSE)  n  (k  1).
51
ANOVA Table for Multiple Regression
Source of
Variation
(source)
Sum of Squares
(SS)
Degrees of
Freedom
(d.f.)
Regression
SSR
k
Error
SSE
n-(k+1)
Total
SST
n-1
Mean Square
F
(MS)
SSR
k
SSE
MSE 
n  ( k  1)
MSR 
F
MSR
MSE
This table gives us a clear view of analysis of
variance of Multiple Regression.
52
Extra Sum of Squares Method for
Testing Subsets of Parameters
Before, we consider the full model with k
parameters. Now we consider the partial
model:
Yi  0  1xi1 
 k m xi,k m  i
(i  1, 2,
, n)
while the rest m coefficients are set to zero.
And we could test these m coefficients to
check out the significance:
H 0 : k m1   k  0;
vs H1 : At least one of k m1 , , k  0.
53
Building F-test by Using Extra Sum of
Squares Method
Let SSRk m and SSEk m be the regression and error
sums of squares for the partial model. Since SST
Is fixed regardless of the particular model, so:
SST  SSRk m  SSEk m  SSRk  SSEk
then, we have:
SSRk m  SSEk  SSRk  SSRk m
The α-level F-test rejects null hypothesis if
( SSEk m  SSEk ) / m
F
 f m,n( k 1),
SSEk / [n  (k  1)]
54
Remarks on the F-test
The numerator d.f. is m which is the number of
coefficients set to zero. While the denominator
d.f. is n-(k+1) which is the error d.f. for the full
model.
The MSE in the denominator is the normalizing
factor, which is an estimate of σ² for the full
model. If the ratio is large, we reject H 0 .
55
Links between ANOVA and Extra Sum
of Squares Method
Let m=1 and m=k respectively, we have:
SSE0   i 1 ( yi  y ) 2  SST , SSEk  SSE
n
From above we can derive:
SSE0  SSEk  SST  SSE  SSR
Hence, the F-ratio equals:
SSR / k
MSR
F

SSE / [n  (k  1)] MSE
with k and n-(k+1) d.f.
56
57
5 Regression Diagnostics
5.1 Checking the Model Assumptions
Plots of the residuals against individual
predictor variables: check for linearity
A plot of the residuals against fitted values:
check for constant variance
A normal plot of the residuals:
check for normality
58
A run chart of the residuals: check if the
random errors are auto correlated.
Plots of the residuals against any omitted
predictor variables: check if any of the
omitted predictor variables should be
included in the model.
59
Example: Plots of the residuals against individual
predictor variables
60
SAS code
61
Example: plot of the residuals against fitted values
62
SAS code
63
Example: normal plot of the residuals
64
SAS code
65
5.2 Checking for Outliers and Influential Observations
• Standardized residuals
*
e
i

e
SE(e )
i
i

e
i
s 1  hii
.
Large e*i values indicate outlier observation.
• Hat matrix
H  X  X X  X 
1
If the Hat matrix diagonal h
ith observation is influential.
ii
2k  1

n
, then
66
Example: graphical exploration of outliers
67
Example: leverage plot
68
5.3 Data transformation
Transformations of the variables(both y and the x’s)
are often necessary to satisfy the assumptions of
linearity, normality, and constant error variance.
Many seemingly nonlinear models can be written in
the multiple linear regression model form after
making a suitable transformation. For example,
y*  0 x11 x22
after transformation:
log y  log 0  1 log x1   2 log x2
or
y*  0*  1* x1*  2* x2*
69
70
Multicollinearity
• Multicollinearity occurs when two or more predictors
in the model are correlated and provide redundant
information about the response.
• Example of multicollinear predictors are height and
weight of a person, years of education and income,
and assessed value and square footage of a home.
• Consequences of high multicollinearity:
a. Increased standard error of estimates of the β ’s
b. Often confused and misled results.
71
Detecting Multicollinearity
• Easy way: compute correlations between all pairs of
predictors. If some r are close to 1 or -1, remove one
of the two correlated predictors from the model.
Variable X1
X2
X3
X1
 2
12
13
X2
21
 2
23
X3
31
32
 2
Correlations
Equal to 1
X1
colinear
X2
X2
independent
X3
72
Detecting Multicollinearity
• Another way: calculate the variance inflation factors
for each predictor xj:
where
is the coefficient of determination of the
model that includes all predictors except the jth
predictor.
• If VIFj≥10, then there is a problem of multicollinearity.
73
Muticollinearity-Example
• See Example11.5 on Page 416, Response is the heat of
cement on a per gram basis (y) and predictors are
tricalcium aluminate(x1), tricalcium silicate(x2),
tetracalcium alumino ferrite(x3) and dicalcium silicate(x4).
74
Muticollinearity-Example
• Estimated parameters in first order model:
ˆy =62.4+1.55x1+0.510x2+0.102x3-0.144x4.
• F = 111.48 with p−value below 0.0001. Individual
t−statistics and p−values: 2.08 (0.071), 0.7 (0.501) and
0.14 (0.896), -0.20 (0.844).
• Note that sign on β4 is opposite of what is expected. And
very high F would suggest more than just one significant
predictor.
75
Muticollinearity-Example
• Correlations
• Correlations were r13 = -0.824, r24 =-0.973. Also the VIF
were all greater than 10. So there is a multicollinearity
problem in such model and we need to choose the
optimal algorithm to help us select the variables
necessary.
76
Muticollinearity-Subsets Selection
• Algorithms for Selecting Subsets
– All possible subsets
• Only feasible with small number of potential predictors
(maybe 10 or less)
• Then can use one or more of possible numerical criteria to
find overall best
– Leaps and bounds method
•
•
•
•
Identifies best subsets for each value of p
Requires fewer variables than observations
Can be quite effective for medium-sized data sets
Advantage to have several slightly different models to
compare
77
Muticollinearity-Subsets Selectioin
– Forward stepwise regression
• Start with no predictors
– First include predictor with highest correlation with response
– In subsequent steps add predictors with highest partial correlation with
response controlling for variables already in equations
– Stop when numerical criterion signals maximum (minimum)
– Sometimes eliminate variables when t value gets too small
• Only possible method for very large predictor pools
• Local optimization at each step, no guarantee of finding
overall optimum
– Backward elimination
• Start with all predictors in equation
– Remove predictor with smallest t value
– Continue until numerical criterion signals maximum (minimum)
• Often produces different final model than forward stepwise
method
78
Muticollinearity-Best Subsets Criteria
• Numerical Criteria for Choosing Best Subsets
– No single generally accepted criterion
• Should not be followed too mindlessly
– Most common criteria combine measures of with add
penalties for increasing complexity (number of predictors)
– Coefficient of determination
• Ordinary multiple R-square
•
• Always increases with increasing number of predictors, so not very
good for comparing models with different numbers of predictors
– Adjusted R-Square
• Will decrease if increase in R-Square with increasing p is small
79
Muticollinearity-Best Subsets Criteria
– Residual mean square (MSEp)
• Equivalent to adjusted r-square except look for minimum
• Minimum occurs when added variable doesn't decrease error sum of
squares enough to offset loss of error degree of freedom
– Mallows' Cp statistic
•
• Should be about equal to p and look for small values near p
• Need to estimate overall error variance
– PRESS statistic
• The one associated with the minimum value of PRESSp is chosen
• Intuitively easier to grasp than the Cp-criterion.
80
Muticollinearity-Forward Stepwise
• First include predictor with highest correlation with
response
>FIN=4 81
Muticollinearity-Forward Stepwise
• In subsequent steps add predictors with highest partial
correlation with response controlling for variables
already in equations. (if Fi>FIN=4, enter the Xi and
Fi<FOUT=4, remove the Xi)
>FIN=4
82
Muticollinearity-Forward Stepwise
>FIN=4
<FOUT=4
83
Muticollinearity-Forward Stepwise
• Summarize the stepwise algorithms
• Therefore our “Best Model” should only include x1 and x2,
which is y=52.5773+1.4683x1+0.6623x2
84
Muticollinearity-Forward Stepwise
• Check the significance of the model and individual
parameter again. We find p value are all small and each
VIF is far less than 10.
85
Muticollinearity-Best Subsets
• Also we can stop when numerical criterion signals
maximum (minimum) and sometimes eliminate variables
when t value gets too small.
86
Muticollinearity-Best Subsets
• The largest R squared value 0.9824 is associated with the
full model.
• The best subset which minimizes the Cp-criterion
includes x1,x2
• The subset which maximizes Adjusted R squared or
equivalently minimizes MSEp is x1,x2,x4. And the Adjusted
R squared increases only from 0.9744 to 0.9763 by the
addition of x4to the model already containing x1 and x2.
• Thus the simpler model chosen by the Cp-criterion is
preferred, which the fitted model is
y=52.5773+1.4683x1+0.6623x2
87
Polynomial model
•
•
Polynomial models are useful in situations where
the analyst knows that curvilinear effects are
present in the true response function.
We can do this with more than one explanatory
variable using Polynomial regression model:
88
Multicollinearity-Polynomial Models
• Multicollinearity is a problem in polynomial
regression (with terms of second and higher order): x
and x2 tend to be highly correlated.
• A special solution in polynomial models is to use zi =
xi − ¯xi instead of just xi. That is, first subtract each
predictor from its mean and then use the deviations
in the model.
89
Multicollinearity – Polynomial model
• Example: x = 2, 3, 4, 5, 6 and x2 = 4, 9, 16, 25, 36. As x
increases, so does x2. rx,x2 = 0.98.
•
= 4 then z = −2,−1, 0, 1, 2 and z2 = 4, 1, 0, 1, 4.
Thus, z and z2 are no longer correlated. rz,z2 = 0.
• We can get the estimates of the β’s from the
estimates of the γ ’s. Since
90
Dummy Predictor Variable
The dummy variable is a simple and
useful method of introducing into a
regression analysis information
contained in variables that are not
conventionally measured on a numerical
scale, e.g., race, gender, region, etc.
91
Dummy Predictor Variable
• The categories of an ordinal variable could be
assigned suitable numerical scores.
• A nominal variable with c≥2 categories can be
coded using c – 1 indicator variables, X1,…,Xc-1,
called dummy variables.
• Xi=1, for ith category and 0 otherwise
• X1=,…,=Xc-1=0, for the cth category
92
Dummy Predictor Variable
• If y is a worker’s salary and
Di = 1 if a non-smoker
Di = 0 if a smoker
We can model this in the following way:
yi    Di  ut
93
Dummy Predictor Variable
• Equally we could have used the dummy variable
in a model with other explanatory variables. In
addition to the dummy variable we could also
add years of experience (x), to give:
yi    Di  xi  ut
E ( yi )       X
For non-smoker
E ( yi )     X
For smoker
94
Dummy Predictor Variable
y
Non-smoker
Smoker
α+β
α
x
95
Dummy Predictor Variable
• We can also add the interaction to between
smoking and experience with respect to their
effects on salary.
yi     Di   xi   Di  xi  ut
E ( yi )  (   )  (   ) X
For non-smoker
E ( yi )     X
For smoker
96
Dummy Predictor Variable
y
Non-smoker
Smoker
α+β
α
x
97
Standardized Regression Coefficients
• We typically wants to compare predictors in
terms of the magnitudes of their effects on
response variable.
• We use standardized regression coefficients to
judge the effects of predictors with different
units
98
Standardized Regression Coefficients
• They are the LS parameter estimates obtained
by running a regression on standardized
variables, defined as follows:
_
yi  y
*
yi 
sy
_
x 
*
ij
xij  x j
sxij
(i  1, 2,
, n; j  1, 2,
, k)
• Where sy and sxj are sample SD’s of yi and x j
99
Standardized Regression Coefficients

 0*
0
• Let

 s
*
• And  j   ( xj )( j 1,2,
sy
,k)

 *j
• The magnitudes of can be directly compared
to judge the relative effects of x j on y.
100
Standardized Regression Coefficients

 0*
• Since  0 , the constant can be dropped
from the model. Let y* be the vector of the y* ' s
and x* be the matrix of x* ' s
 1 rx1x2
r
1
1 *' *
x
2
x
1
x x R

n1

 rxkx1 rxkx2
 ryx1 
rx1xk 



ryx2 
rx2 xk
1


x*' y*  r 


 n 1



 ryxk 
1 
101
Standardized Regression Coefficients
• So we can get
 * 
1 

*

     (x*'x*)1x*' y*  R1r

 * 
 k

• This method of computing  j ' s is numerically

more stable than computing j ' s directly, because
all entries of R and r are between -1 and 1.
102
Standardized Regression Coefficients
• Example (Given in page 424)
• From the calculation, we can obtain that


 1  0.19244,  2  0.3406
And sample standard deviations of x1,x2 and
are sx1  6.830, sx 2  0.641, sy  1.501
*
*

sx1
sx 2
 1   1 ( )  0.875,  2   2 ( )  0.105
sy
sy
*


 2 ,although 
 2 .Thus x1 has a
1

Then we have
Note that 
larger effect than x2 on y.
*
1
103
Standardized Regression Coefficients
• We can also use the matrix method to compute standardized
regression coefficients.
• First we compute the correlation matrix between x1 ,x2 and y
x1
x2
• Then we have
• Next calculate
• Hence
0.913
 1
R
1 
0.913
R 1 
1
1  rx21x 2
 *
  1   R 1r   0.875 
 0.105 
*


  2 
 1
 r
 x1x 2
y
x2
0.913
0.971 0.904
 0.971
r

0.904


rx1x 2   6.009 5.586


1   5.486 6.009 
• Which is as same result as before
104
105
How to decide their salaries?
32
23
Attacker
Defender
5 years
11 years
more than 20
goals per year
Lionel Messi
10,000,000 EURO/yr
less than 1
goals per year
Carles Puyol
5,000,000 EURO/yr
106
How to select variables?
• 1) Stepwise Regression
• 2)Best Subset Regression
107
Stepwise Regression
• Partial F-test
• Partial Correlation Coefficients
• How to do it by SAS?
• Drawbacks
108
Partial F-test
(p-1)-Variable Model:
Yi  0  1xi1  ...   p1xi, p1  i
p-Variable Model:
Yi  0  1xi1  ...   p1xi, p1   p xi, p  i
109
How to do the test?
H1 p:1 p  0
vs
H 0 p: p  0
We reject H0 p in favor of H1 p at level α if
Fp 
(SSE p 1  SSE p ) /1
SSE p / [n  ( p  1)]
 f ,1,n( p 1)
110
Another way to interpret the test:
• test statistics:
p
tp 
SE (  p )
t 2p  Fp
• We reject H0 p at level α if
| t p | tn( p 1), /2
111
Partial Correlation Coeffientients
2
yx p | x1,..., x p1
r

SSE p1  SSE p
SSE p 1

SSE ( x1 ,..., x p 1 )  SSE ( x1 ,..., x p )
SSE ( x1 ,..., x p 1 )
test statistics:
Fp  t 
2
p
*Add
2
yx p | x1,..., x p1
r
[n  ( p  1)]
1  ryx2 p | x1,..., x p1
xpto the regression equation that includes
x1,..., xp1 only if Fp is large enough.
112
How to do it by SAS? (EX9 Continuity of Ex5)
The table shows data on
the heat evolved in calories
during the hardening of
cement on a per gram basis
(y) along with the
percentages of four
ingredients: tricalcium
aluminate (x1), tricalcium
silicate (x2), tetracalcium
alumino ferrite (x3), and
dicalcium silicate (x4).
No.
X1
X2
X3
X4
Y
1
7
26
6
60
78.5
2
1
29
15
52
74.3
3
11
56
8
20
104.3
4
11
31
8
47
87.6
5
7
52
6
33
95.9
6
11
55
9
22
109.2
7
3
71
17
6
102.7
8
1
31
22
44
72.5
9
2
54
18
22
93.1
10
21
47
4
26
1159
11
1
40
23
34
83.8
12
11
66
9
12
113.3
13
10
68
8
12
109.4
113
data example1;
input x1 x2 x3 x4 y;
datalines;
7 26 6 60 78.5
1 29 15 52 74.3
11 56 8 20 104.3
11 31 8 47 87.6
7 52 6 33 95.9
11 55 9 22 109.2
3 71 17 6 102.7
1 31 22 44 72.5
2 54 18 22 93.1
21 47 4 26 115.9
1 40 23 34 83.8
11 66 9 12 113.3
10 68 8 12 109.4
SAS Code
;
Run;
proc reg data=example1;
model y= x1 x2 x3 x4 /selection=stepwise;
run;
114
SAS output
115
SAS output
116
Interpretation
•
At the first step, x4 is chosen into the equation as
it has the largest correlation with y among the 4
predictors;
•
At the second step, we choose x1 into the
equation for it has the highest partial correlation
with y controlling for x4;
•
At the third step, since ryx2 | x4 , x1 is greater than
ryx3 |x4 , x1 , x2 is chosen into the equation rather than
117
x3.
Interpretation
•
At the 4th step, we removed x4 from the model
since its partial F-statistics is too small.
•
From Ex11.5, we know that x4 is highly correlated
with x2. Note that in Step4, the R-Square is 0.9787,
which is slightly higher that 0.9725, the R-Square of
Step 2. It indicates that even x4 is the best predictor
of y, the pair (x1,x2) is a better predictor than the
predictor (x1,x4).
118
Drawbacks
• The final model is not guaranteed to be optimal in
any specified case.
• It yields a single final model while in practice there
are often several equally good model.
119
Best Subset Regression
• Comparison to Stepwise Method
• Optimality Criteria
• How to do it by SAS?
120
Comparison to Stepwise Regression
• In best subsets regression, a subset of
variables is chosen from that optimizes a welldefined objective criterion.
• The best regression algorithm permits
determination of a specified number of best
subsets from which the choice of the final
model can be made by the investigator.
121
Optimality Criteria
r  Criterion
2
p
r 
2
p
SSRp
SST
 1
SSE p
SST
Adjusted rp2  Criterion
2
adj , p
r
 1
SSE p / (n  ( p  1))
SST / n  1
 1
MSE p
MST
122
Optimality Criteria
Cp  Criterion
Standardized mean square error of prediction:
n
p 
1

2
 E[Y
i 1
ip
 E (Yi )]
2
unknown parameters such as  j ‘s, so minimize a sample
estimate of  p . Mallows’ Cp  statistic :
 p involves
Cp 
SSE p
2
 2( p  1)  n
123
Optimality Criteria
•
It practice, we use the Cp  Criterion
because of its ease of computation and
its ability to judge the predictive power of
a model.
124
How to do it by SAS?(Ex11.9)
• proc reg data=example1;
model y= x1 x2 x3 x4 /selection=adjrsq mse cp;
run;
125
SAS output
126
Interpretation
•
The best subset which minimizes the Cp  Criterion
is x1, x2 which is the same model selected using
stepwise regression in the former example.
•
The subset which maximizes radj2 , p is x1, x2, x4.
2
However, radj , p increases only from 0.9744 to 0.9763
by the addition of x4 to the model which already
contains x1 and x2.
•
Thus, the model chosen by the Cp  Criterion is
preferred.
127
128
Model (Extension of Simple Regression):
yi  0  1xi1  2 xi 2   k xik   i
Multiple Regression
Model
0, 1, 2, ....k are unknown parameters
Least squares method:
n
Q  [ yi  (  0   1xi1   2 xi 2  ...  kxik )]2
i 1
Fitting the MLR Model

n
Q
 2[ yi  (  0   1xi1   2 xi 2  ...  kxik )]  0
0
i 1
n
Q
 2[ yi  ( 0   1xi1   2 xi 2  ...  kxik )]xij  0
 j
i 1
SSR
Goodness of fit of the model: r 
SST
2
MLR Model in Matrix
Notation
Y  X 

  ( X ' X )1 X 'Y

  ( X ' X )1 X 'Y
129
Statistical Inference on  ' s
Hypotheses:
Statistical Inference for
Multiple Regression
Test statistic:
Hypotheses:
H0 j :  j  0 vs. H1 j :  j  0
ˆ j   j
Z
T

~ Tn ( k 1)
W / n  (k  1)
S v jj
H0 : 1 
Test statistic:
 k  0 vs. H a : At least one  j  0
MSR r 2{n  (k  1)}
F

MSE
k (1  r 2 )
Residual Analysis
Regression Diagnostics
Data Transformation
130

The General Hypothesis Test:
Compare
the full model:
the partial model:
Yi   0  1 x i1  ...  k x ik  i
Yi  0  1 x i1  ... km x i,km  i
Hypotheses: H 0 :  km 1  ...  k  0 vs. H a :  j  0
 ( SSEk m  SSEk ) / m
~ f m,n ( k 1)
Test statistic: F0 
 SSEk /[n  (k  1)]

RejectH

0 when F  f
m, n( k 1),
0
Estimating and Predicting Future Observations:
*
*
*
* '
Let x  (x 0 , x1 ,...,x k )
Test statistic: T 
and
ˆ *   *
*
*
s x Vx
*  Y *  0  1x1*  ...  k xk*  x* 
~ Tn( k 1)
CI for the estimated mean *:
PI for the estimated Y*:
ˆ *  t n ( k 1), / 2 s x* Vx *
Yˆ *  t n ( k 1), / 2 s 1  x* Vx *
131
Topics in regression
modeling
• Multicollinearity
• Polynomial Regression
• Dummy Predictor Variables
• Logistic egression Model
• Stepwise Regression:
partial F-test
ryx2 p / x1, xp1 
SSE p1  SSE p
2
ryx x
partial Correlation
Fp 
Coefficient
1  ryx2
p/ 1
Variable Selection
Methods
SSE p1
x p1 
 n   p  1 
• Stepwise Regression Algorithm
• Best Subsets Regression
p / x1
x p1
Strategy for building
a MLR model
132
Application of the MLR model
Linear regression is widely used in
biological, chemistry, finance and
social sciences to describe possible
relationships between variables. It
ranks as one of the most important
tools used in these disciplines.
133
Financial market
biology
Multiple
linear
regression
Housing price
heredity
Chemistry
134
Example
Broadly speaking, an asset pricing model can be
expressed as:
ri  ai  b1 j f1  b2 j f2 
 bkj f k  i
Where ri , f k and k denote the expected return on
asset i, the kth risk factor and the number of risk
factors, respectively.  i denotes the specific return
on asset i.
135
The equation can also be expressed in the matrix
notation:
is called the factor loading
136
Inflation rate
GDP
Interest rate
Rate of return on the
market portfolio
Employment rate
Government policies
137
Method
• Step 1: Find the efficient factors
(EM algorithms, maximum likelihood)
• Step 2: Fit the model and estimate the factor
loading
(Multiple linear regression)
138
• According to the multiple linear regression
and run data on SAS, we can get the factor
loading  and the coefficient of multiple
2
determination r
• We can ensure the factors that mostly effect
the return in term of SAS output and then
build the appropriate multiple factor models
• We can use the model to predict the future
return and make a good choice!
139
Thank you
140
Download