Model Building III – Remedial Measures

advertisement
Model Building III – Remedial
Measures
KNNL – Chapter 11
Unequal (Independent) Error Variances –
Weighted Least Squares (WLS)
• Case 1 – Error Variances known exactly (VERY rare)
• Case 2 – Error Variances known up to a constant
– Occasionally information known regarding experimental units
regarding the relative magnitude (unusual)
– If “observations” are means of different numbers of units (each
with equal variance) at the various X levels, Variance of
observation i is s2/ni where ni is known
• Case 3 – Variance (or Standard Deviation) is related to
one or more predictors, and relation can be modeled
(see Breusch-Pagan Test in Chapter 3)
• Case 4 – Ordinary Least Squares with correct variances
WLS – Case 1 Known Variances - I
Yi   0  1 X i1  ...   p 1 X i , p 1   i
s 12 0

2
0
s
2
2
 σ ε  


0
 0
 i ~ N  0, s i2  i  1,..., n
s  i ,  j   0  i  j
0

0


s n2 
Maximum Likelihood Estimation:
n
L β  
i 1
2
 1
1
exp   2 Yi   0  1 X i1  ...   p 1 X i , p 1   setting: wi  2
si
2s i2
 2s i

1
 n wi 
2
 1 n
 L  β   
exp

w
Y




X

...


X



0
1 i1
p 1 i , p 1
 2 i i

2

i

1


i

1


To maximize L  β  , we need to minimize Qw   wi Yi   0  1 X i1  ...   p 1 X i , p 1 
n
2
i 1
Note that values with smaller s i2 have larger weights wi in the weighted least squares criterion.
WLS – Case 1 Known Variances - II
 w1
0
Easiest to set up in matrix form, where: W 


0
σ 2 Y = σ 2 ε = W -1
Normal Equations:
0
w2
0
0
0 
= W'


wn 
 X'WX  b w  X'WY
 b w   X'WX  X'WY  AY A   X'WX  X'W
-1
-1
 E b w  = AE Y = AXβ =  X'WX  X'WXβ = β
-1
 σ 2 b w  = Aσ 2 Y A' =  X'WX  X'WW -1 WX  X'WX    X'WX 
-1
-1
-1
WLS – Case 2 – Variance Known up to Constant - I
Data are means of unequal numbers of replicates at X-levels, can use any weights generally
 r11 0
0


r21
0
s2
2
2
2
2  0
s Y i  si 
 σ Y  s
ri  number of replicates at i th level of X


ri

1 
0
rn 
 0
0
0
 r1 0
 r1 0

0 r

0  1 *
0
ni
1
1  0 r2
2
*

wi  2  2  W  2
 2W
W 
 s


si s
s 




0
0
r
0
0
r
n
n


-1
1
1
-1
 X'WX  X' 2 W* X  2 X'W* X   X'WX   s 2  X'W * X 
s
s
-1
1
1
-1
 X'WY  X' 2 W* Y  2 X'W *Y  b w   X'WX  X'WY   X'W *X  X'W *Y
s
s
 
^

*
w
Y

Y
i


i  i


i 1
MSEw 
n  p
n
σ b w    X'WX   s
2
-1
2
 X'W X 
*
-1
s b w   MSEw  X'W X 
2
*
-1
2
WLS – Case 2 – Variance Known up to Constant - II
Japanese study of Rosuvastin for patients with high
cholesterol. n = 6 doses, # of patients per dose (ri)
varied: (15, 17, 12, 14, 18, 13)
After preliminary plots, determined that Y = change in
LDL, X = ln(Dose)
X
1
1
1
1
1
1
X'W*X
89
169.005
0.0000
0.9163
1.6094
2.3026
2.9957
3.6889
169.005
458.0243
INV(X'W*X)
0.037538 -0.01385
-0.01385 0.007294
s2{b_w}
5.04
-1.86
-1.86
0.98
Y
-35.8
-45.0
-52.7
-49.7
-58.2
-66.0
W*
15
0
0
0
0
0
X'W*Y
-4535.8
-9624.3
b_w
-36.96
-7.38
s{b_w}
2.24
0.99
0
17
0
0
0
0
DoseGrp ChgLDL
1
-35.8
2
-45.0
3
-52.7
4
-49.7
5
-58.2
6
-66.0
Dose
1
2.5
5
10
20
40
ln(Dose)
0.0000
0.9163
1.6094
2.3026
2.9957
3.6889
0
0
12
0
0
0
0
0
0
14
0
0
0
0
0
0
18
0
0
0
0
0
0
13
Y
-35.8
-45.0
-52.7
-49.7
-58.2
-66.0
Yhat_W
-36.96
-43.72
-48.83
-53.94
-59.05
-64.17
e_w
1.2
-1.3
-3.9
4.2
0.9
-1.8
w*(e_w)^2
20.14
27.99
179.82
251.82
13.11
43.75
sum(w*e^2) 536.63
n-p
4
MSE_W
134.16
r
15
17
12
14
18
13
WLS - Case 3 – Estimated Variances
• Use squared residuals (estimated variances) or absolute
residuals (estimated standard deviations) from OLS to
model their levels as functions of predictor variables
– Plot Residuals, squared residuals and absolute residuals versus
fitted values and predictors
– Using model building techniques used on Y previously to obtain
a model for either the variances or standard deviations as
functions of the Xs or the mean
– Fit estimated WLS with the estimated weights (1/variances)
– Iterate until regression coefficients are stable (iteratively reweighted least squares)
Y - Xb  ' W  Y - Xb 


^
^
wi 
1
^
vi

-1
1
 
 si 
 
^
2
^
^


b ^   X' W X  X' W Y
w


MSE ^
w
^
^
w
w
n p
Case 4 – OLS with Estimated Variances
s 12 0

2
0
s
2
2
2
σ Y = σ ε  


0
 0
E Y = Xβ
b =  X'X  X'Y = AY
0

0

2
s n 
A =  X'X  X'
-1
-1
 σ 2 b = Aσ 2 Y A' =  X'X  X'σ 2 ε X  X'X 
-1
 s
Note: E 
2
i
2
i
Use e as an estimator of s :
2
i
s 2 b   X'X  X'S 0 X  X'X 
-1
2
i
-1
e12

0
S0  


 0
0
e22
0
0

0

2
en 
-1
This is referred to White’s estimator, and is robust to specification of form of error variance
Multicollinearity – Remedial Measures - I
• Goal: Prediction – Multicollinearity not an issue if new
cases have similar multicollinearity among predictors
– Firm uses shipment size (number of pallets) and weight (tons)
to predict unloading times. Future shipments have similar
correlation between predictors, predictions and PIs are valid
• Linear Combinations of Predictors can be generated
that are uncorrelated (Principal Components)
– Good: Multicollinearity gone
– Bad: New variables may not have “physical” interpretation
• Use of Cross-Section and Time Series in Economics
– Model: Demand = f(Income , Price)
– Step 1: Cross-section (1 time): Regress Demand on Income (bI)
– Step 2: Time Series: Regress Residuals from 1 on Price (bP)
Multicollinearity – Ridge Regression -I
• Mean Square Error of an Estimator = Variance + Bias2
• Goal: Add Bias to estimator to reduce MSE(Estimator)
• Makes use of Standard Regression Model with
Correlation Transformation
   E  b
MSE b
R
R

   s b    E b    
2
2
R
R
2
 Variance + (Bias) 2
Correlation Transformation / Standardized Regression Model:
Yi * 

Yi  Y

n  1 sY
X ik* 

X ik  X k

n  1 sk
Ridge Regression Estimator (with c  0):
Yi *  1* X i*1  ...   p* 1 X i*, p 1   i*  rXX b  rYX
 rXX  cI  b R  rYX
 b R   rXX  cI  rYX
-1
Goal: choose small c such that the estimators stabilize (flatten) and VIFs get small
 b1R 




R
bp 1 


Multicollinearity – Ridge Regression - II
Computational Formulas:
Variance Inflation Factors:
Diagonal Elements of the matrix:
 rXX  cI 
1
rXX  rXX  cI 
1
Error Sum of Squares (Correlation Transformed Data):
 *

Y  b X  ...  b X
SSER    Yi  Y 
i 1 

RR2  1  SSER
since SSTOR  1
^
*
i
n
R
1
*
i1
R
p 1
*
i , p 1
^
*
i
2
Robust Regression for Influential Cases
• Influential cases, after having been ruled out as recording
errors or indicative of new predictors, can be reduced in
terms of their impacts on the regression model
• Two commonly applied Robust Regression Models:
– Least Absolute Residuals (LAR) Regression – Choose the
coefficients that minimize sum of absolute deviations (as
opposed to squared deviations in OLS). No closed form
solutions, must use specialized programs (can be run in R)
– IRLS Robust Regression – Uses Iteratively Re-weighted Least
Squares where weights are based on how much each case is an
outlier (lower weights for larger outliers)
IRLS – Robust Regression
1. Choose weight function (we will use Huber here)
2. Obtain starting weights for each case.
3. Use starting weights in Weighted Least Squares and obtain residuals after fitting equation
4. Use Residuals from step 3 to generate new weights.
5. Continue iterations to convergence (small changes in estimated regression coefficients or residuals or fitted values)
Applying Procedure for Huber wight function (1.345 is tuning parameter to achieve high efficiency to normal model)
1

1. Huber weight function: w  1.345
 u

u  1.345
u  1.345
2. Use initial residuals from OLS fit after scaling by Median Absolute Deviation (Robust estimate of s ):
e
1
MAD 
median ei  median ei 
ui  i
0.6745
MAD
3. Iterate to Convergence


2 2

u
 
 1  

Note: Bisquare weight Function: w     4.685  

0
u  4.685
u  4.685
Nonparametric Regression
• Locally Weighted Regressions (Lowess)
– Works best with few predictors, and when data have been
transformed to normality with constant error variances, and
predictors have been scaled by their standard deviation
(sqrt(MSE) or MAD when outliers are present)
– Applies WLS with weights as distances from the individual
points in a neighborhood to the target (q = proportion of data
used in the neighborhood, typically 0.40 to 0.60)
• Regression Trees
– X-space is broken down into multiple sub-spaces (1-step at a
time), and each region is modeled by the mean response
– Each step is chosen to minimize the within region sum of
squares
Lowess Method
• Assumes model has been selected and built so assumptions of
errors being approximately normal with constant variance holds.
• Predictors of different units should be scaled by SD or MAD in
distance measure
Assuming p-1=2 predictor variables and
q  proportion of sample used at each point:
2
 X  X h1   X i 2  X h 2 
Distance Measure (Each point from target point): di   i1
 

s
s

1
 
2

3 3

di  d q
 1   di d q  


Weight Function: wi  
0
di  d q

d q  size of neighborhood so q of sample is included
Local Fitting: Fit first-order or second-order model by WLS
and get fitted value at each target point
2
Regression Trees
• Splits region covered by the predictor variables into
increasing number of sub-regions, where predicted value
is mean response of all cases falling in the sub-region.
• Goal: Choose “cut-points” that minimize Error Sum of
Squares for the current number of sub-regions –
computationally intensive, no closed form solution
• Problem: SSE will continually drop until there are as
many sub-regions as distinct combinations of predictors
(SSE → 0)
• Validation samples can be used to determine which
number of nodes (sub-regions – 1) that minimizes Mean
Square Prediction Error (MSPR)
Bootstrapping to Estimate Precision
• Computationally intensive methods used to estimate
precision of estimators in non-standard situations
• Based on re-sampling (with replacement) from observed
samples, re-computing estimates repeatedly
(nonparametric). Parametric methods also exist (not
covered here)
• Once original sample is observed, and quantity of
interest estimated, many samples of size n (selected with
replacement) are obtained, each sample providing a new
estimate.
• The standard deviation of the new sample quantities
represents an estimate of the standard error
Two Basic Approaches
• Fixed X Sampling (Model is good fit, constant variance,
predictor variables are fixed, i.e. controlled experiment)
– Fit the regression, obtain all fitted values and residuals
– Keeping the corresponding X-level(s) and fitted values, resample the n residuals (with replacement)
– Add the bootstrapped residuals to the fitted values, and re-fit
the regression (repeat process many times)
• Random X Sampling (Not sure of adequacy of model:
fit, variance, random predictor variables)
– After fitting regression, and estimating quantities of interest,
sample n cases (with replacement) and re-estimate quantities
of interest with “new” datasets (repeat many times)
Bootstrap Confidence Intervals – Reflection Method
Suppose we are interested in Estimating 1 in simple regression (generalizes to other parameters):
 X
n
1) Fit regression on original dataset, obtain b1 
i 1
i

 X Yi  Y
 X
n
i 1
i
X
2) Obtain m bootstrap samples and their estimated slopes: b*j1


2
j  1,..., m
3) Order the b*j1 from smallest to largest, and obtain b1*  / 2  and b1* 1   / 2  
(the lower and upper percentiles of the distribution)
4) Compute d1  b1  b1*  / 2 
d 2  b1* 1   / 2    b1
5) Approximate 1- 100% CI for 1 : b1  d 2  1  b1  d1
Download