9. Adding a variable CH1. What is what CH2. A simple SPF CH3. EDA CH4. Curve fitting CH5. A first SPF CH6: Which fit is fitter CH7: Choosing the objective function CH8: Theoretical stuff (skip) Ch9: Adding variables. CH10. Choosing a model equation In the previous session we discussed the objective function In this session 1. The necessary and the sufficient 2. The VIEDA 3. How to add AADT 4. The new CURE plots 5. The NM – time is a variable 1 The fit with ‘Segment Length’ alone was bad. That AADT needs to be added is clear. However, for other variables, intuition is insufficient; Add all variables in data? Only with statistically significant parameters? Only when fit is improved materially? Use other considerations? Recall the danger: Overfitting in a nutshell 2 Also recall the purpose: Two perspectives on SPF E{m} and s{m} = f(Traits, parameters) Applications centered perspective Cause and effect centered perspective The perspective determines how modeling is done When to add variable? Add variable if significant SPF workshop February 2014, UBCO 3 To add or not to add? • You do not add a variable because the data is there; • You do not add a variable because the parameter is statistically significant; • You do add a variable if doing so improves the accuracy of the estimates of E{m} σ{μ} Two necessary conditions must be met to consider adding a variable. One sufficient condition must be met to justify adding a variable. SPF workshop February 2014, UBCO 4 The necessary condition: there is Bias-in-Use Example: A (Colorado, rural, two-lane) road segment is 0.5 miles long and has AADT=500. What is its μ? Based on the model we have now: 0.871 { } Ê μ | L 0.5 miles 1.656 0.5 0.905 I & F in 5 years This would be an unbiased estimate if we knew only the length of this segment. But we also know its AADT! What then is the bias-in-use? SPF workshop February 2014, UBCO 5 How large is would be the bias-in-use? After adding AADT to the SPF we will have: Ê{μ|L 0.5miles&AADT 500} 0.213 I&Fcrashes in 5 years Not adding AADT to the model equation would make us estimate the μ as 0.905 whereas segments with L=0.5 and AADT=500 have, on average, 0.213 I&F crashes The bias-in-use due to the would be 0.905 - 0.213 I&F crashes. SPF workshop February 2014, UBCO 6 Bias-in-use exists when for the unit or the population of interest we have information about some safetyrelated variables that are not in the model equation. The size of the bias-in-use depends on: what traits of the unit or population are known but are not in the SPF; the level of these missing traits. The absence from the SPF of a safety-related variable when its level is known in applications causes bias-in-use; Adding it to the model equation reduces the bias-in-use.’ SPF workshop February 2014, UBCO 7 Generally: If we know more about the unit (or population) the safety of which is of interest than what is in the model equation, the estimates of m (or of E{m}) will be biased. E.g., We know that the aggregate is polished but friction was not in SPF Friction is in SPF but not known about segment Conversely, if we included in the model equation a variable the value of which is not easily known for the unit (or population) of interest, it would be a hindrance. Should a guess be plugged into the model equation the E μ will be biased. 8 Two kinds of bias Bias-in-fit: When the model equation does not fit the data for some range of variable values. Depends on the modeler. Bias-in-use: When for a unit or population of interest we know some safety-relevant variables that are not in the SPF. Depends on what the modeler put into the SPF, how the SPF was reported, and what information the user has about units of interest. SPF workshop February 2014, UBCO 9 It follows that variable is a candidate for inclusion into the model equation if: a. It is safety-related (causes bias-in-use) b. Information about it is available. These are the two necessary conditions Recall: A variable is safety-related if there is a regular relationship between the residuals and the candidate variable. To determine whether a variable is safety-related: Do a VIEDA SPF workshop February 2014, UBCO 10 The Variable Introduction EDA: ‘VIEDA’ Question: Is AADT safety-related? We now have (only) Segment Length in the model. Segment Length and AADT are correlated. Is AADT still needed? Purpose of VIEDA: 1. Does the Observed/Fitted ratio have a regular relationship with AADT? 2. If yes, what function can represent AADT in the model? SPF workshop February 2014, UBCO 11 Question: Is there a ‘regular relationship’ between AADT and the current residuals? Open #12: VIEDA for AADT on Pivot Table workpage Here is what we have now from NB fit (# 11) SPF workshop February 2014, UBCO 12 1 Task: •Use Pivot Table to create bins AADT 0-500, 500-1000, .... •Put observed and fitted crashes into bins •How do they compare? 2 SPF workshop February 2014, UBCO 13 1.Drag 2. Drag these SPF workshop February 2014, UBCO 14 Right click to get Select SPF workshop February 2014, UBCO 15 After ‘OK’ you get What do you see? Open the ‘Analysis’ workpage of #12 SPF workshop February 2014, UBCO 16 Copied from pivot table Added If I multiply fitted by 0.78 I get observed 15 10 Multipliers 5 0 0 5000 10000 15000 17 20000 An alternative is to use the ungrouped results & Nadaraya-Watson SPF workshop February 2014, UBCO 18 The two VIEDA questions about AADT: Question 1: Is there a ‘regular relationship’? Answer: Yes Question 2: What does it look like? Answer: This is what the multiplier function should look like 15 10 May want to try AADTβ 5 2 0 0 5000 10000 15000 20000 So, now that AADT met the necessary conditions …. SPF workshop February 2014, UBCO 19 The sufficient condition Question: should all ‘candidate’ variables that meet the necessary conditions be in the model equation? Answer: Only if including them increases estimation accuracy. Accuracy Gain: If AADT is known then adding it as a variable will reduce bias-in-use. Accuracy Loss: Every added variable decreases the accuracy with which E{μ} is estimated. Gain SPF workshop February 2014, UBCO Loss 20 Count of segments If estimated only by Segment Length, average of 106; If estimated only by AADT, average of 377, But if by both, average of 4. Is it a net gain? SPF workshop February 2014, UBCO 21 Not so simple with parametric C-F If without the variable the model equation is expression ‘A’ and adding the variable makes it into the expression ‘A×B’ then, by the laws of error propagation: V{A×B}≌B2×V{A}+A2×V{B} Addition to due to variance of expression ‘B’ Thus, if expression ‘B’ is AADTβ then V{B} will reflect the uncertainty about AADT & uncertainty about β 2 2 SPF workshop February 2014, UBCO 22 The sufficient condition: Add if it reduces the EMSE of Ê{μ} Expected Mean Square Error of Ê{μ} bias - in - use V{Ê{μ}} 2 Said differently: Add variable if by doing so the Bias-in-use2 is reduced more than V Ê{μ} is increased { } Surprisingly, this needs research work. SPF workshop February 2014, UBCO 23 How to add AADT to the C-F Spreadsheet Open #13. NB fit with L and AADT on ‘L & AADT’ workbook. Before addition of AADT(Only L) SPF workshop February 2014, UBCO 24 Now with L & AADT New parameter Separate columns for each variable SPF workshop February 2014, UBCO Scale parameter here 25 26 ‘Solver’ solution SPF workshop February 2014, UBCO 27 L only compare L & AADT What changed? 1. b1 is now way out of ±0.06 confidence interval; 2. 𝒷 is now larger, which is good; 3. Log-likelihood is larger, which is also good. 28 Observation 1. b1 is now very different a. Whatever the parameter values are, they will change when a (correlated) variable is added. b. This is the ‘Omitted Variable Bias’. c. There always are omitted variables. d. The usual parameter accuracy statistics assume: - No omitted variables, - Simple function exists and is correct. - No error in variables SPF workshop February 2014, UBCO 29 How b1 changed (from Chapter 5) Conventional Unconventional Method OLS Poisson Likelihood Negative Binomial Likelihood Absolute Differences χ2 Total Absolute Bias χ2 b1 0.866 0.860 0.871 0.911 0.737 0.882 Now, with AADT added, b1 is 1.08. One can begin to trust the parameter values when they do not change much as new variables are introduced into the model equation. SPF workshop February 2014, UBCO 30 Observation 2: 𝒷 is now larger Recall that V μi =(E μi})2/bi and bi=𝒷×Li The larger is 𝒷 the smaller is V{m}, the better is the SPF when used for the HSM purposes! Now there are three CURE plots to examine: for Segment Length, For AADT, and an ‘overall’ one for ‘Fitted Value’ SPF workshop February 2014, UBCO 31 CURE for Segment Length: Addition of AADT – improvement. Remaining concerns. SPF workshop February 2014, UBCO 32 Outliers? No outlier here. What can explain the drop? Functional form – not likely Terrain variable? Perhaps. Other? SPF workshop February 2014, UBCO 33 CURE for AADT: • • • • Not good. Large bias-of-fit (26%) in A→B Function too high. Can be improved by choice of function? SPF workshop February 2014, UBCO 34 The former two CURE plots had L or AADT on the X-axis and gave hints about how to change the function of for these variables. For L sort by ‘miles’ For AADT sort by ‘AADT 94-98’ For ‘overall fit’ sort by ‘Fitted Value’ 35 The ‘overall fit’ Here the X-axis is: Fitted Value. Function too high The (only?) way to raise it 0 to A is to allow intercept. The way to lower it B to C is to use more flexible function S-shaped function (Logistic, Hoerl, … Chapter 10) SPF workshop February 2014, UBCO 36 The Negative Multinomial: Time is a variable The data we have But we used only 5 years, and then only averages and sums! SPF workshop February 2014, UBCO 37 C-F spreadsheet for NM with a five year panel The data Computations & Starting guesses 38 Computations for log-likelihood Solver Options 39 ML estimates Time is a variable Observed 1994 1759 Fitted O/F Ratios 1752.2 1.001 1995 1996 ... ... ... ... ... ... 1997 1837 1887.7 0.973 SPF workshop February 2014, UBCO 1998 1850 Sum 9229 1968.7 9229.0 0.940 1.000 40 Estimating yearly scale parameters 373/359=1.04 Under identical conditions, in 1995 4% more crashes than in 1994. Time is variable. What does it represent? 41 With 13 year, 13 yearly scale parameters For extrapolation For years with data 42 Saving on parameters β0[1-βslope(Year-1986)] The model equation so far 𝐸 𝜇 = 0.202 1 − 0.019(𝑌𝑒𝑎𝑟 − 1986 (𝑆𝑒𝑔𝑚𝑒𝑛𝑡 𝐿𝑒𝑛𝑡ℎ 1.082 𝐴𝐴𝐷𝑇 1000 0.906 𝑎𝑛𝑑 2 𝐸 𝜇 𝑉 𝜇 = 2.085(𝑆𝑒𝑔𝑚𝑒𝑛𝑡 𝐿𝑒𝑛𝑔𝑡ℎ} SPF workshop February 2014, UBCO 43 The attractions of the NM model It keeps the advantages of the NB; It makes use of all available information; It allows one to tailor the estimate to the year of interest; If there is regularity, it allows one to estimate for the future. Exercise: Use all 13 years of data. Is the sequence of β0’s regular? If yes, replace by function. SPF workshop February 2014, UBCO 44 How to report Usually the modeller introduces variables A, B, C and D into the model equation, one after the other, and then reports the parameters of the final, ‘fully loaded’ model. Suppose the user only knows the value A, B and C for some unit or population. Now the fully loaded model is of no use. However, if the modeller reported the results at every stage: Model with ‘A’, Model with A&B, Model with A&B&C and model with A&B&C&D users could make use of the model that matches the information they have. Conclusion: Report all practical combinations . SPF workshop February 2014, UBCO 45 Summary for section 9. (Adding Traits) 1. The questions was ‘whether’ and ‘how’; 2. Bias in use: If we know more than is in SPF; 3. Variable is candidate if: (a) it is still safety-relevant and (b) information about it is available in applications; 4. The sufficient condition: reduction in bias2 > increase in Variance of estimate of E{m}; 5. Whether variable is ‘s-r’ is established by VIEDA; 6. AADT was s-r and multiplier increased with AADT; SPF workshop February 2014, UBCO 46 7. Adding a variable to the C-F spreadsheet is straightforward; 8. As a result all parameters change. This means: all are provisional; 9. Each variable has a CURE plot. Each CURE plot suggests the further steps; 10. Our data are yearly. To make use of them, use NM likelihood function; 11. Now ‘year’ is a variable. 12. To be of practical use, report every model. SPF workshop February 2014, UBCO 47