9. adding a variable.pptx

advertisement
9. Adding a variable
CH1. What is what
CH2. A simple SPF
CH3. EDA
CH4. Curve fitting
CH5. A first SPF
CH6: Which fit is fitter
CH7: Choosing the objective function
CH8: Theoretical stuff (skip)
Ch9: Adding variables.
CH10. Choosing a model equation
In the previous session
we discussed the
objective function
In this session
1. The necessary and the
sufficient
2. The VIEDA
3. How to add AADT
4. The new CURE plots
5. The NM – time is a variable
1
The fit with ‘Segment Length’ alone was bad.
That AADT needs to be added is clear.
However, for other variables, intuition is insufficient;
Add all variables in data?
Only with statistically significant parameters?
Only when fit is improved materially?
Use other considerations?
Recall the danger:
Overfitting in a nutshell
2
Also recall the purpose:
Two perspectives on SPF
E{m} and s{m} = f(Traits, parameters)
Applications
centered
perspective
Cause and effect
centered perspective
The perspective determines how modeling is done
When to add variable?
Add variable if significant
SPF workshop February 2014, UBCO
3
To add or not to add?
• You do not add a variable because the data is there;
• You do not add a variable because the parameter is
statistically significant;
• You do add a variable if doing so improves the accuracy
of the estimates of E{m} σ{μ}
Two necessary conditions must be met to consider
adding a variable.
One sufficient condition must be met to justify adding a
variable.
SPF workshop February 2014, UBCO
4
The necessary condition: there is Bias-in-Use
Example:
A (Colorado, rural, two-lane) road segment is 0.5 miles long
and has AADT=500. What is its μ?
Based on the model we have now:
0.871
{
}
Ê μ | L  0.5 miles  1.656  0.5
 0.905 I & F in 5 years
This would be an unbiased estimate if we knew only the
length of this segment.
But we also know its AADT! What then is the bias-in-use?
SPF workshop February 2014, UBCO
5
How large is would be the bias-in-use?
After adding AADT to the SPF we will have:
Ê{μ|L  0.5miles&AADT  500}  0.213 I&Fcrashes in 5 years
Not adding AADT to the model equation would make us
estimate the μ as 0.905 whereas segments with L=0.5 and
AADT=500 have, on average, 0.213 I&F crashes
The bias-in-use due to the would be 0.905 - 0.213
I&F crashes.
SPF workshop February 2014, UBCO
6
Bias-in-use exists when for the unit or the population
of interest we have information about some safetyrelated variables that are not in the model equation.
The size of the bias-in-use depends on:
 what traits of the unit or population are known but
are not in the SPF;
 the level of these missing traits.
The absence from the SPF of a safety-related variable
when its level is known in applications causes bias-in-use;
Adding it to the model equation reduces the bias-in-use.’
SPF workshop February 2014, UBCO
7
Generally: If we know more about the unit (or population)
the safety of which is of interest than what is in the model
equation, the estimates of m (or of E{m}) will be biased.
E.g., We know that the
aggregate is polished but
friction was not in SPF
Friction is in SPF
but not known
about segment
Conversely, if we included in the model equation a
variable the value of which is not easily known for the
unit (or population) of interest, it would be a hindrance.
Should a guess be plugged into the model equation the
E μ will be biased.
8
Two kinds of bias
Bias-in-fit:
When the model equation does not fit the data for
some range of variable values.
Depends on the modeler.
Bias-in-use:
When for a unit or population of interest we
know some safety-relevant variables that are not
in the SPF.
Depends on what the modeler put into the SPF,
how the SPF was reported, and what information
the user has about units of interest.
SPF workshop February 2014, UBCO
9
It follows that variable is a candidate for
inclusion into the model equation if:
a. It is safety-related (causes bias-in-use)
b. Information about it is available.
These are the two necessary conditions
Recall:
A variable is safety-related if there is a regular
relationship between the residuals and the candidate
variable.
To determine whether a variable is safety-related:
Do a VIEDA
SPF workshop February 2014, UBCO
10
The Variable Introduction EDA: ‘VIEDA’
Question: Is AADT safety-related?
We now have (only) Segment Length in the model.
Segment Length and AADT are correlated.
Is AADT still needed?
Purpose of VIEDA:
1. Does the Observed/Fitted ratio have a
regular relationship with AADT?
2. If yes, what function can represent AADT
in the model?
SPF workshop February 2014, UBCO
11
Question: Is there a ‘regular relationship’
between AADT and the current residuals?
Open #12: VIEDA for AADT on Pivot Table workpage
Here is what we have now from NB fit (# 11)
SPF workshop February 2014, UBCO
12
1
Task:
•Use Pivot Table
to create bins
AADT 0-500,
500-1000, ....
•Put observed
and fitted
crashes into bins
•How do they
compare?
2
SPF workshop February 2014, UBCO
13
1.Drag
2. Drag these
SPF workshop February 2014, UBCO
14
Right click to get
Select
SPF workshop February 2014, UBCO
15
After ‘OK’ you get
What do you see?
Open the ‘Analysis’ workpage of #12
SPF workshop February 2014, UBCO
16
Copied from pivot table
Added
If I multiply
fitted by
0.78 I get
observed
15
10
Multipliers
5
0
0
5000
10000
15000
17 20000
An alternative is to use the ungrouped results
& Nadaraya-Watson
SPF workshop February 2014, UBCO
18
The two VIEDA questions about AADT:
Question 1: Is there a ‘regular relationship’?
Answer: Yes
Question 2: What does it look like?
Answer: This is what the multiplier
function should look like
15
10
May want to try  AADTβ
5
2
0
0
5000
10000
15000
20000
So, now that AADT met the necessary conditions ….
SPF workshop February 2014, UBCO
19
The sufficient condition
Question: should all ‘candidate’ variables that meet the
necessary conditions be in the model equation?
Answer: Only if including them
increases estimation accuracy.
Accuracy Gain: If AADT is known then adding it as
a variable will reduce bias-in-use.
Accuracy Loss: Every added variable decreases the
accuracy with which E{μ} is estimated.
Gain
SPF workshop February 2014, UBCO
Loss
20
Count of segments
If estimated only by Segment Length, average of 106;
If estimated only by AADT, average of 377,
But if by both, average of 4.
Is it a net gain?
SPF workshop February 2014, UBCO
21
Not so simple with parametric C-F
If without the variable the model equation is expression ‘A’
and adding the variable makes it into the expression ‘A×B’
then, by the laws of error propagation:
V{A×B}≌B2×V{A}+A2×V{B}
Addition to due to variance of expression ‘B’
Thus, if expression ‘B’ is AADTβ then V{B} will reflect the
uncertainty about AADT & uncertainty about β 2
2
SPF workshop February 2014, UBCO
22
The sufficient condition:
Add if it reduces the EMSE of Ê{μ}
Expected Mean Square Error of
Ê{μ}  bias - in - use  V{Ê{μ}}
2
Said differently: Add variable if by doing so the
Bias-in-use2 is reduced more than V Ê{μ} is increased
{
}
Surprisingly, this needs
research work.
SPF workshop February 2014, UBCO
23
How to add AADT to the C-F Spreadsheet
Open #13. NB fit with L and AADT on ‘L & AADT’ workbook.
Before addition of AADT(Only L)
SPF workshop February 2014, UBCO
24
Now with L & AADT
New parameter
Separate columns
for each variable
SPF workshop February 2014, UBCO
Scale parameter
here
25
26
‘Solver’ solution
SPF workshop February 2014, UBCO
27
L only
compare
L & AADT
What changed?
1. b1 is now way out of ±0.06 confidence interval;
2. 𝒷 is now larger, which is good;
3. Log-likelihood is larger, which is also good.
28
Observation 1. b1 is now very different
a. Whatever the parameter values are, they will
change when a (correlated) variable is added.
b. This is the ‘Omitted Variable Bias’.
c. There always are omitted variables.
d. The usual parameter accuracy statistics assume: - No omitted variables,
- Simple function exists and is correct.
- No error in variables
SPF workshop February 2014, UBCO
29
How b1 changed (from Chapter 5)
Conventional
Unconventional
Method
OLS
Poisson Likelihood
Negative Binomial Likelihood
Absolute Differences
χ2
Total Absolute
Bias
χ2
b1
0.866
0.860
0.871
0.911
0.737
0.882
Now, with AADT added, b1 is 1.08.
One can begin to trust the parameter values
when they do not change much as new variables
are introduced into the model equation.
SPF workshop February 2014, UBCO
30
Observation 2: 𝒷 is now larger
Recall that V μi =(E μi})2/bi and bi=𝒷×Li
The larger is 𝒷 the smaller is V{m}, the better is
the SPF when used for the HSM purposes!
Now there are three CURE plots to examine:
 for Segment Length,
For AADT,
and an ‘overall’ one
 for ‘Fitted Value’
SPF workshop February 2014, UBCO
31
CURE for Segment Length:
Addition of AADT – improvement. Remaining concerns.
SPF workshop February 2014, UBCO
32
Outliers?
No outlier here.
What can explain the drop?
Functional form – not likely
Terrain variable? Perhaps.
Other?
SPF workshop February 2014, UBCO
33
CURE for AADT:
•
•
•
•
Not good.
Large bias-of-fit (26%) in A→B
Function too high.
Can be improved by choice of function?
SPF workshop February 2014, UBCO
34
The former two CURE plots had L or AADT on the X-axis
and gave hints about how to change the function of for
these variables.
For L sort by
‘miles’
For AADT sort
by ‘AADT 94-98’
For ‘overall fit’ sort
by ‘Fitted Value’
35
The ‘overall fit’
Here the X-axis is: Fitted Value.
Function too high
The (only?) way to raise it 0 to A is to allow intercept.
The way to lower it B to C is to use more flexible function
S-shaped function (Logistic, Hoerl, … Chapter 10)
SPF workshop February 2014, UBCO
36
The Negative Multinomial: Time is a variable
The data
we have
But we used only 5
years, and then only
averages and sums!
SPF workshop February 2014, UBCO
37
C-F spreadsheet for NM with a five year panel
The data
Computations &
Starting guesses
38
Computations for log-likelihood
Solver Options
39
ML estimates
Time is a variable
Observed
1994
1759
Fitted
O/F Ratios
1752.2
1.001
1995 1996
...
...
...
...
...
...
1997
1837
1887.7
0.973
SPF workshop February 2014, UBCO
1998
1850
Sum
9229
1968.7 9229.0
0.940 1.000
40
Estimating yearly scale parameters
373/359=1.04
Under identical conditions, in 1995 4% more
crashes than in 1994. Time is variable.
What does it represent?
41
With 13 year, 13 yearly scale parameters
For extrapolation
For years with data
42
Saving on parameters
β0[1-βslope(Year-1986)]
The model equation so far
𝐸 𝜇 = 0.202 1 − 0.019(𝑌𝑒𝑎𝑟 − 1986 (𝑆𝑒𝑔𝑚𝑒𝑛𝑡 𝐿𝑒𝑛𝑡ℎ
1.082
𝐴𝐴𝐷𝑇
1000
0.906
𝑎𝑛𝑑
2
𝐸 𝜇
𝑉 𝜇 =
2.085(𝑆𝑒𝑔𝑚𝑒𝑛𝑡 𝐿𝑒𝑛𝑔𝑡ℎ}
SPF workshop February 2014, UBCO
43
The attractions of the NM model
 It keeps the advantages of the NB;
 It makes use of all available information;
 It allows one to tailor the estimate to the year of
interest;
 If there is regularity, it allows one to estimate for
the future.
Exercise:
Use all 13 years of data. Is the sequence of β0’s
regular? If yes, replace by function.
SPF workshop February 2014, UBCO
44
How to report
Usually the modeller introduces variables A, B, C and D
into the model equation, one after the other, and then
reports the parameters of the final, ‘fully loaded’ model.
Suppose the user only knows the value A,
B and C for some unit or population.
Now the fully loaded model is of no use.
However, if the modeller reported the results at every
stage: Model with ‘A’, Model with A&B, Model with
A&B&C and model with A&B&C&D users could make use
of the model that matches the information they have.
Conclusion: Report all practical combinations .
SPF workshop February 2014, UBCO
45
Summary for section 9. (Adding Traits)
1. The questions was ‘whether’ and ‘how’;
2. Bias in use: If we know more than is in SPF;
3. Variable is candidate if: (a) it is still safety-relevant
and (b) information about it is available in
applications;
4. The sufficient condition: reduction in bias2 > increase
in Variance of estimate of E{m};
5. Whether variable is ‘s-r’ is established by VIEDA;
6. AADT was s-r and multiplier increased with AADT;
SPF workshop February 2014, UBCO
46
7. Adding a variable to the C-F spreadsheet is
straightforward;
8. As a result all parameters change. This
means: all are provisional;
9. Each variable has a CURE plot. Each CURE
plot suggests the further steps;
10. Our data are yearly. To make use of them,
use NM likelihood function;
11. Now ‘year’ is a variable.
12. To be of practical use, report every model.
SPF workshop February 2014, UBCO
47
Download