Predictive - NCSU Statistics

advertisement
Introduction to Predictive Modeling
with Examples
Nationwide Insurance Company, November 2
D. A. Dickey
Cool < ------------------------ > Nerdy
“Analytics” = “Statistics”
“Predictive Modeling” = “Regression”
Part 1: Simple Linear Regression
If the Life Line is long and deep, then this
represents a long life full of vitality and
health. A short line, if strong and deep,
also shows great vitality in your life and
the ability to overcome health problems.
However, if the line is short and shallow,
then your life may have the tendency to
be controlled by others
http://www.ofesite.com/spirit/palm/lines/linelife.htm
Wilson & Mather JAMA 229 (1974)
X=life line length
Y=age at death
proc sgplot;
scatter Y=age X=line;
reg Y=age X=line;
run ;
Result: Predicted Age at Death = 79.24 – 1.367(lifeline)
(Is this “real”??? Is this repeatable???)
We Use LEAST SQUARES
Squared residuals sum to 9609
Error sum of squares SSq versus slope and intercept
(truncated at SSq=9700)
“Best” line is the one that minimizes sum of squared residuals.
Best for this sample – is it the true relationship for everyone?
SAS PROC REG will compute it. What other lines might be the
true line for everyone?? Probably not the purple one.
Red one has slope 0 (no effect). Is red line unreasonable?
Can we reject H0:slope is 0?
Simulation: Age at Death = 67 + 0(life line) + e
Error e has normal distribution mean 0 variance 200.
Simulate 20 cases with n= 50 bodies each.
NOTE: Regression equations :
Age(rep:1) = 80.56253 - 1.345896*line.
Age(rep:2) = 61.76292 + 0.745289*line.
Age(rep:3) = 72.14366 - 0.546996*line.
Age(rep:4) = 95.85143 - 3.087247*line.
Age(rep:5) = 67.21784 - 0.144763*line.
Age(rep:6) = 71.0178 - 0.332015*line.
Age(rep:7) = 54.9211 + 1.541255*line.
Age(rep:8) = 69.98573 - 0.472335*line.
Age(rep:9) = 85.73131 - 1.240894*line.
Age(rep:10) = 59.65101 + 0.548992*line.
Age(rep:11) = 59.38712 + 0.995162*line.
Age(rep:12) = 72.45697 - 0.649575*line.
Age(rep:13) = 78.99126 - 0.866334*line.
Age(rep:14) = 45.88373 + 2.283475*line.
Age(rep:15) = 59.28049 + 0.790884*line.
Age(rep:16) = 73.6395 - 0.814287*line.
Age(rep:17) = 70.57868 - 0.799404*line.
Age(rep:18) = 72.91134 - 0.821219*line.
Age(rep:19) = 55.46755 + 1.238873*line.
Age(rep:20) = 63.82712 + 0.776548*line.
Predicted Age at Death = 79.24 – 1.367(lifeline)
Would NOT be unusual if there is no true relationship .
Distribution of t
Under H0
Conclusion:
Estimated slopes vary
Standard deviation of estimated slopes = “Standard error” (estimated)
Compute t = (estimate – hypothesized)/standard error
p-value is probability of larger |t| when hypothesis is correct (e.g. 0 slope)
p-value is sum of two tail areas.
Traditionally p<0.05  hypothesized value is wrong.
p>0.05 is inconclusive.
proc reg data=life;
model age=line;
run;
Parameter Estimates
Variable DF
Intercept 1
Line
1
Parameter
Estimate
79.23341
-1.36697
Standard
Error
14.83229
1.59782
t Value Pr > |t|
5.34
<.0001
0.86
0.3965
Area 0.19825
Area 0.19825
0.39650
-0.86
0.86
Conclusion: insufficient evidence against the hypothesis of no linear relationship.
H0:
H1:
H0: Innocence
H1: Guilt
Beyond reasonable
doubt
P<0.05
H0: True slope is 0
(no association)
H1: True slope is not 0
P=0.3965
Simulation: Age at Death = 67 + 0(life line) + e
Error e has normal distribution mean 0 variance 200.  WHY?
Simulate 20 cases with n= 50 bodies each.
Want estimate of variability around the true line. True variance is  2
Use sums of squared residuals (SS).
Sum of squared residuals from the mean is “SS(total)”
9755
Sum of squared residuals around the line is “SS(error)”
9609
(1) SS(total)-SS(error) is SS(model)
=
146
(2) Variance estimate is SS(error)/(degrees of freedom) = 200
(3) SS(model)/SS(total) is R2, i.e. proportion of variablity
“explained” by the model.
Analysis of Variance
Source
Model
Error
Corrected Total
Root MSE
14.14854
DF
1
48
49
Sum of
Squares
146.51753
9608.70247
9755.22000
R-Square
0.0150
Mean
Square
146.51753
200.18130
F Value
0.73
Pr > F
0.3965
Part 2: Multiple Regression
Issues:
(1) Testing joint importance versus individual significance
Two engine plane can still fly if engine #1 fails
Two engine plane can still fly if engine #2 fails
Neither is critical individually
Jointly critical (can’t omit both!!)
(2) Prediction versus modeling individual effects
(3) Collinearity (correlation among inputs)
Example: Hypothetical company’s sales Y depend on TV
advertising X1 and Radio Advertising X2.
Y = b0 + b1X1 + b2X2 +e
Data Sales; length sval $8; length cval $8;
input store TV radio sales;
(more code)
cards;
Sales
1 869 868 9089
2 836 820 8290
(more data)
40 969 961 10130
Radio
TV
proc g3d data=sales;
scatter radio*TV=sales/shape=sval color=cval zmin=8000;
run;
P2 axis
P2 axis
P2 axis
Conclusion: Can predict well with just TV, just radio, or both!
SAS code:
proc reg data=next; model sales = TV radio;
Analysis of Variance
Source
Model
Error
Corrected Total
Root MSE
Sum of
Squares
32660996
1683844
34344840
DF
2
37
39
213.32908
Mean
Square
16330498
45509
R-Square
F Value
358.84
Pr > F
<.0001 (Can’t omit both)
0.9510  Explaining 95% of variation in sales
Parameter Estimates
Variable
Intercept
TV
radio
DF
1
1
1
Parameter
Estimate
531.11390
5.00435
4.66752
Standard
Error
359.90429
5.01845
4.94312
t Value
1.48
1.00
0.94
Pr > |t|
0.1485
0.3251 (can omit TV)
0.3512 (can omit radio)
Estimated Sales = 531 + 5.0 TV + 4.7 radio with error variance 45509 (standard deviation 213).
TV approximately equal to radio so, approximately
Estimated Sales = 531 + 9.7 TV
or
Estimated Sales = 531 + 9.7 radio
Setting TV = radio (approximate relationship)
Estimated Sales = 531 + 9.7 TV
is this the BEST TV line?
Estimated Sales = 531 + 9.7 radio
is this the BEST radio line?
Proc Reg Data=Stores;
Model Sales = TV;
Model Sales = radio;
run;
Analysis of Variance
Source
DF
Sum of
Squares
Model
Error
Corrected Total
1
38
39
32620420
1724420
34344840
Root MSE
Variable
Intercept
TV
213.02459
R-Square
Mean
Square
32620420
45379
F Value
Pr > F
718.84
<.0001
0.9498
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
478.50829
9.73056
355.05866
0.36293
1.35
26.81
0.1857
<.0001
*********************************************************************************************
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
1
38
39
32615742
1729098
34344840
32615742
45503
Root MSE
Variable
Intercept
radio
213.31333
R-Square
F Value
Pr > F
716.79
<.0001
0.9497
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
612.08604
9.58381
350.59871
0.35797
1.75
26.77
0.0889
<.0001
Sums of squares capture variation explained by each variable
Type I: How much when it is added to the model?
Type II: How much when all other variables are present
(as if it had been added last)
Parameter Estimates
Variable
Intercept
TV
radio
DF
Parameter
Estimate
Standard
Error
1
1
1
531.11390
5.00435
4.66752
359.90429
5.01845
4.94312
t Value Pr > |t|
1.48
1.00
0.94
0.1485
0.3251
0.3512
Type I SS
Type II SS
3964160640
32620420
40576
99106
45254
40576
***********************************************************************************
Parameter Estimates
Variable
Intercept
radio
TV
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Type I SS
Type II SS
1
1
1
531.11390
4.66752
5.00435
359.90429
4.94312
5.01845
1.48
0.94
1.00
0.1485
0.3512
0.3251
3964160640
32615742
45254
99106
40576
45254
Summary:
Good predictions given by
Sales = 531 + 5.0 x TV + 4.7 x Radio or
Sales = 479 + 9.7 x TV
or
Sales = 612 + 9.6 x Radio or
(lots of others)
Why the confusion?
The evil Multicollinearity!!
(correlated X’s)
Those Mysterious “Degrees of Freedom” (DF)
First Martian  information about average height
0 information about variation.
2nd Martian gives first piece of information (DF) about
error variance around mean.
n Martians
n-1 DF for error (variation)
Martian Height
2 points  no information on variation of errors
n points  n-2 error DF
Martian Weight
How Many Table Legs?
(regress Y on X1, X2)
Source
Model
Error
Corrected Total
error
X2
Three legs will all touch the floor.
DF
2
37
39
Sum of
Squares
32660996
1683844
34344840
Mean
Square
16330498
45509
X1
Fourth leg gives first chance to measure error (first error DF).
Fit a plane  n-3 (37) error DF (2 “model” DF, n-1=39 “total” DF)
Regress Y on X1 X2 … X7  n-8 error DF (7 “model” DF, n-1 “total” DF)
Grades vs. IQ and Study Time
Data tests; input IQ Study_Time Grade; IQ_S = IQ*Study_Time;
cards;
105
10
75
110
12
79
120
6
68
116
13
85
122
16
91
130
8
79
114
20
98
102
15
76
;
Proc reg data=tests; model Grade = IQ;
Proc reg data=tests; model Grade = IQ Study_Time;
Variable
Intercept
IQ
Variable
Intercept
IQ
Study_Time
DF
1
1
Parameter
Estimate
62.57113
0.16369
Standard
Error
48.24164
0.41877
t Value
1.30
0.39
Pr > |t|
0.2423
0.7094
DF
1
1
1
Parameter
Estimate
0.73655
0.47308
2.10344
Standard
Error
16.26280
0.12998
0.26418
t Value
0.05
3.64
7.96
Pr > |t|
0.9656
0.0149
0.0005
Contrast:
TV advertising looses significance when radio is added.
IQ gains significance when study time is added.
Model for Grades:
Predicted Grade = 0.74 + 0.47 x IQ + 2.10 x Study Time
Question:
Does an extra hour of study really deliver 2.10 points for
everyone regardless of IQ? Current model only allows this.
proc reg; model Grade = IQ Study_Time IQ_S;
Source
Model
Error
Corrected Total
Root MSE
Variable
Intercept
IQ
Study_Time
IQ_S
DF
Sum of
Squares
Mean
Square
3
4
7
610.81033
31.06467
641.87500
203.60344
7.76617
2.78678
R-Square
F Value
Pr > F
26.22
0.0043
0.9516
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
1
72.20608
-0.13117
-4.11107
0.05307
54.07278
0.45530
4.52430
0.03858
1.34
-0.29
-0.91
1.38
0.2527
0.7876
0.4149
0.2410
“Interaction” model:
Predicted Grade =
72.21 - 0.13 x IQ - 4.11 x Study Time + 0.053 x IQ x Study Time
= (72.21 - 0.13 x IQ )+( - 4.11 + 0.053 x IQ )x Study Time
IQ = 102 predicts
Grade = (72.21-13.26)+(5.41-4.11) x Study Time = 58.95+ 1.30 x Study Time
IQ = 122 predicts
Grade = (72.21-15.86)+(6.47-4.11) x Study Time = 56.35 + 2.36 x Study Time
Slope = 2.36
Slope = 1.30
(1)
(2)
(3)
(4)
Adding interaction makes everything insignificant (individually) !
Do we need to omit insignificant terms until only significant ones remain?
Has an acquitted defendant proved his innocence?
Common sense trumps statistics!
Part 3: Diagnosing Problems in Regression
Main problems are
Multicollinearity (correlation among inputs)
Outliers
Proc Corr; Var TV radio sales;
Pearson Correlation Coefficients, N = 40
Prob > |r| under H0: Rho=0
TV
radio
sales
TV
1.00000
0.99737
<.0001
0.97457
<.0001
radio
0.99737
<.0001
1.00000
0.97450
<.0001
sales
0.97457
<.0001
0.97450
<.0001
1.00000
Principal Component
Axis 2
P2
TV $
Radio $
Principal Component
Axis 1: P1
Principal Components
TV
radio
1.00000
0.99737
<.0001
0.99737
<.0001
1.00000
(1)
(2)
(3)
(4)
Center and scale variables to mean 0 variance 1.
Call these X1 (TV) and X2 (radio)
n variables  total variation is n (n=2 here)
Find most variable linear combination P1=__X1+__X2
Variances are 1.9973 out of 2 (along P1 axis) standard deviation 1.9973
and 0.0027 out of 2 (along P2 axis) standard deviation 0.0027
Ratio of standard deviations (27.6) is “condition number” large  unstable regression.
Rule of thumb: Ratio 1 is perfect, >30 problematic. Spread on long axis is
27.6 times that on short axis.
Variance Inflation Factor
(1) Regress predictor i on all the others getting r-square: Ri2
(2) VIF is 1/(1- Ri2 ) for variable i (measures collinearity).
(3) VIF > 10 is a problem.
Variance Inflation Factor
(1) Regress predictor i on all the others getting r-square: Ri2
(2) VIF is 1/(1- Ri2 ) for variable i (measures collinearity).
(3) VIF > 10 is a problem.
Example:
Proc Reg Data=Sales; Model Sales = TV Radio/VIF collinoint;
Parameter Estimates
Variable
Intercept
TV
radio
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
Variance
Inflation
1
1
1
531.11390
5.00435
4.66752
359.90429
5.01845
4.94312
1.48
1.00
0.94
0.1485
0.3251
0.3512
0
190.65722
190.65722
Collinearity Diagnostics (intercept adjusted)
Number
Eigenvalue
Condition
Index
1
2
1.99737
0.00263
1.00000
27.57948
--Proportion of VariationTV
radio
0.00131
0.99869
0.00131
0.99869
We have a
MAJOR
problem!
(note: other diagnostics besides VIF and condition number are available)
Another problem: Outliers
TV ‚
1200 ˆ
‚
+
‚
‚
+
‚
+
‚
++
2
‚
+++
‚
+
‚
+
+ +
1000 ˆ
++
‚
++++
‚
++++
‚
+ +
‚
++
‚
++
‚
+
‚ +
‚
++
800 ˆ+
Šˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒˆƒ
800
1000
1200
radio
P
Example: Add one point to TV-Radio data
TV 1021, radio 954, Sales 9020
Proc Reg: Model Sales = TV radio/ p r;
Analysis of Variance
Source
Model
Error
Corrected Total
Root MSE
DF
2
38
40
229.86639
Sum of
Squares
33190059
2007865
35197924
R-Square
Mean
Square
16595030
52839
F Value
314.07
Pr > F
<.0001
0.9430
Parameter Estimates
Variable
Intercept
TV
radio
Obs
39
40
41
DF
1
1
1
Parameter
Estimate
689.01260
-6.28994
15.78081
Dependent Predicted
Variable
Value
9277
10130
9020
Residual
9430 -153.4358
9759 370.5848
9322 -301.8727
Standard
Error
382.52628
2.90505
2.86870
Std Error Student
Residual Residual
225.3
226.1
121.9
-0.681 |
1.639 |
-2.476 |
t Value
1.80
-2.17
5.50
Pr > |t|
0.0796
0.0367
<.0001
-2-1 0 1 2
Cook's
D
*|
|***
****|
|
|
|
0.006
0.030
5.224
???????
P1
P2
P1
Ordinary residual for store 41 not too bad (-300.87)
PRESS residuals
(1) Remove store i , Sales Y(i)
(2) Fit model to other 40 stores
(3) Get model prediction P(i) for store I
(4) PRESS residual is Y(i)-P(i)
Regular O and PRESS (dot) residuals
Store number
41
proc reg data=raw;
model sales = TV radio;
output out=out1 r=r press= press;
run;
View Along the P2 Axis
P2 (2nd Principal Component)
Part 4: Classification Variables (dummy variables, indicator variables)
Predicted Accidents = 1181 + 2579 X11
X11 is 1 in November, 0 elsewhere.
Interpretation:
In November, predict 1181+2579(1) = 3660.
In any other month predict 1181 + 2579(0) = 1181.
1181 is average of other months.
2579 is added November effect (vs. average of others)
Model for NC Crashes involving Deer:
Proc reg data=deer; model deer = X11;
Analysis of Variance
Source
Model
Error
Corrected Total
Root MSE
Variable
Intercept
X11
DF
1
58
59
580.42294
Label
Intercept
Sum of
Squares
30473250
19539666
50012916
R-Square
DF
1
1
Mean
Square
30473250
336891
F Value
90.45
Pr > F
<.0001
0.6093
Parameter
Estimate
1181.09091
2578.50909
Standard
Error
78.26421
271.11519
t Value
15.09
9.51
Pr > |t|
<.0001
<.0001
Looks like December and October need dummies too!
Proc reg data=deer; model deer = X10 X11 X12;
Analysis of Variance
Source
DF
Sum of
Squares
Mean
Square
Model
Error
Corrected Total
3
56
59
46152434
3860482
50012916
15384145
68937
Root MSE
Variable
Intercept
X10
X11
X12
262.55890
Label
Intercept
R-Square
DF
1
1
1
1
F Value
Pr > F
223.16
<.0001
0.9228
Parameter
Estimate
929.40000
1391.20000
2830.20000
1377.40000
Standard
Error
39.13997
123.77145
123.77145
123.77145
t Value
23.75
11.24
22.87
11.13
Average of Jan through Sept. is 929 crashes per month.
Add 1391 in October, 2830 in November, 1377 in December.
Pr > |t|
<.0001
<.0001
<.0001
<.0001
What the heck – let’s do all but one (need “average of rest” so must leave out at least one)
Proc reg data=deer; model deer = X1 X2 … X10 X11;
Analysis of Variance
Source
Model
Error
Corrected Total
Root MSE
DF
11
48
59
182.07290
Sum of
Squares
48421690
1591226
50012916
R-Square
Mean
Square
4401972
33151
F Value
132.79
Pr > F
<.0001
0.9682
Parameter Estimates
Variable
Label
Intercept
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
Intercept
DF
Parameter
Estimate
Standard
Error
t Value
Pr > |t|
1
1
1
1
1
1
1
1
1
1
1
1
2306.80000
-885.80000
-1181.40000
-1220.20000
-1486.80000
-1526.80000
-1433.00000
-1559.20000
-1646.20000
-1457.20000
13.80000
1452.80000
81.42548
115.15301
115.15301
115.15301
115.15301
115.15301
115.15301
115.15301
115.15301
115.15301
115.15301
115.15301
28.33
-7.69
-10.26
-10.60
-12.91
-13.26
-12.44
-13.54
-14.30
-12.65
0.12
12.62
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
0.9051
<.0001
Average of rest is just December mean 2307. Subtract 886 in January,
add 1452 in November. October (X10) is not significantly different than
December.
positive
negative
Add date (days since Jan 1 1960 in SAS) to capture trend
Proc reg data=deer; model deer = date X1 X2 … X10 X11;
Analysis of Variance
Source
Model
Error
Corrected Total
Root MSE
DF
12
47
59
129.83992
Sum of
Squares
49220571
792345
50012916
R-Square
Mean
Square
4101714
16858
F Value
243.30
Pr > F
<.0001
0.9842
Parameter Estimates
Variable
Intercept
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
X11
date
Label
Intercept
DF
1
1
1
1
1
1
1
1
1
1
1
1
1
Parameter
Estimate
-1439.94000
-811.13686
-1113.66253
-1158.76265
-1432.28832
-1478.99057
-1392.11624
-1525.01849
-1618.94416
-1436.86982
27.42792
1459.50226
0.22341
Standard
Error
547.36656
82.83115
82.70543
82.60154
82.49890
82.41114
82.33246
82.26796
82.21337
82.17106
82.14183
82.12374
0.03245
t Value
-2.63
-9.79
-13.47
-14.03
-17.36
-17.95
-16.91
-18.54
-19.69
-17.49
0.33
17.77
6.88
Trend is 0.22 more accidents per day (1 per 5 days) and is significantly
different from 0.
Pr > |t|
0.0115
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
<.0001
0.7399
<.0001
<.0001
Part 5 Logistic Regression
The problem: response is binary
yes or no,
accident or no accident,
claim or no claim,
at fault, not at fault
Prediction is prediction of probability
(of fault for example)
• Logistic idea: Map p in (0,1) to L in whole real
line, p=probability of fabric igniting.
• Use L = ln(p/(1-p))
• Model L as linear in flame exposure time.
• Predicted L = a + b(time)
• Given temperature X, compute a+bX then p =
eL/(1+eL)
• p(i) = ea+bXi/(1+ea+bXi)
• Write p(i) if response, 1-p(i) if not
• Multiply all n of these together, get function
Q(a,b), find a,b to maximize.
Example: Ignition
• Flame exposure time = X
• Ignited Y=1, did not ignite Y=0
– Y=1, X =
11, 12 14, 15, 17, 25, 30
– Y=0, X= 3, 5, 9 10 ,
13,
16
• Q=(1-p)(1-p)(1-p)(1-p)pp(1-p)pp(1-p)ppp
• P’s all different p=f(exposure time)
• Find a,b to maximize likelihood Q(a,b)
Likelihood function (Q)
-2.6
0.23
Example:
Shuttle Missions
•
•
•
•
•
O-rings failed in Challenger disaster
Low temperature
Prior flights “erosion” and “blowby” in O-rings
Feature: Temperature at liftoff
Target: problem (1) - erosion or blowby vs. no
problem (0)
Download