Neural Networks - UCF Statistics

advertisement
November 13, 2013
•
•
•
•
Collect model fits for 4 problems
Return reports
VIFs
Launch chapter 11
1
Grocery Data Assignment
X3 (holiday) and X1 (cases shipped) do the
job and X2 adds nothing; normality and
variance constant okay; no need for
quadratic terms or interactions; no need at
all to square X3 (two level factor)
2
Some notes
• State conclusion (final model) up front
• Report model with fit of X1 and X3 rather than
from a fit with X1, X2, X3 and just drop X2
• Check assumptions for X1, X3 model not model
with X2 as well
• Box-Cox suggests no transformation even
though l = 2 is “best”
• Interaction bit
3
Notes on writing aspects
• Avoid imperative form of verbs (Fit the
multivariate. Run the model. Be good.(You)
Verb …
• Don’t use contractions. It’s bad form. They’re
considered informal.
• Spell check does not catch wrong words (e.g.,
blow instead of below, not instead of note)
• Writing skills are important (benefits
considerable)
4
VIFs (not BFFs)
• Variance Inflation Factors
• (1-R2)-1 where R2 is the R-squared from regressing Xi on
the other Xi’s
• Available in JMP if you know where to look
5
• x3 appears to be both a “dud” for
predicting y and not very collinear with
either x1 or x2
6
• In computing VIFs, need to regress x3 on
x1 and x2, and compute 1/(1-RSquare)
which here is about 100. What gives?
7
8
Added variable plot
to the rescue
Note 1.6084963
matches earlier
value
(Suggest you run the
other way as well.)
Regress x3 and x2
on x1 and save
residuals
9
10
11
12
13
Interesting body fat example
•
•
•
•
Looks like very little co-linearity
However, massive multi-collinearity!
This is why it can be challenging at times!!!
Why we don’t throw extra dud variables
into the model
14
•
•
•
•
•
Steps in the analysis
Multivariate to get acquainted with data
(Analyze distribution all variables)
Looking for a decent model—parsimonious
– Linear, interactions, quadratic
– Stepwise if many variables
– PRESS vs. root mean square error
– Added variable plots according to taste
Check assumptions (lots of plots)
Check for outliers, influential observations (hats and Cook’s Di)
15
Chapter 11: Remedial Measures
We’ll cover in some detail:
11.1 Weighted Least Squares
11.2 Ridge regression
11.4 Regression trees
11.5 Bootstrapping
16
11.1 Weighted Least Squares
Suppose that the constant variance assumption does not
hold. Each residual has a different variance but keep
the zero covariances:
Least squares is out—what should we do?
17
Use Maximum Likelihood for inspiration!
Likelihood:
Now define the i-th weight to be:
Then the likelihood is:
18
Taking logarithms, we get:
Log Likelihood is a constant plus:
Criterion is same as least squares, except each
squared residual is weighted by wi -- hence the
weighted least squares criterion.
The coefficient vector bw that minimizes Qw is the
vector of weighted least squares estimates
19
Matrix Approach to WLS
Let:
Then:
20
21
Usual Case: Variances are Unknown
Need to estimate each variance!
Recall:
Give a statistic that can estimate si2:
Give a statistic that can estimate si:
22
Estimating a Standard Deviation Function
Step 1: Do ordinary least squares; obtain residuals
Step 2: Regress the absolute values of the residuals
^
against Y or whatever predictor(s) seem to be associated
with changes in the variances of the residuals.
Step 3: Use the predicted absolute residual for case i, |ei|
^
as the estimated variance of ei, call it si
^ = (1/s^)2
Step 4: Then w
i
i
23
• Subset x and y for table 11.1
• Fit y on x and save residuals, compute absolute value of
residuals
• Regress these residuals on x
• The predicted values are estimated stan. Dev.’s
• Weights are reciprocal of stan. Dev. Squared
• Use these weights with WLS on original y and x
variables to get y-hat = 55.566 + .5963 x
24
Pictures
25
Example
26
Notes on WLS Estimates
1. WLS estimates are minimum variance, unbiased.
2. If you use Ordinary Least Squares (OLS) when variance
is not constant, estimates are still unbiased, just not
minimum variance.
3. If you have replicates at each unique X category, you
can just use the sample standard deviation of the
responses at each category to determine the weight for
any response in the category.
4. R2 has no clear cut meaning here.
5. Must use the standard deviation function value (instead
of s) for confidence intervals for prediction
27
11.2 Ridge Regression
Biased regression to reduce the effect of
multicollinearity.
Shrinkage estimation: Reduce the variance of the
parameters by shrinking them (a bit) in absolute
magnitude. This will introduce some bias, but may
reduce the MSE overall.
Recall: MSE = bias squared plus variance:
28
29
How to Shrink?
Penalized least squares!
Start with standardized regression model:
Add a “penalty” proportional to the total size of the
parameters (proportionality or biasing constant is c):
30
Matrix Ridge Solution
Start with small c and increase (iteratively) until the
coefficients stabilize. Plot is called “ridge trace”
Here, use c
about equal
to .02
31
Example
32
• Decision trees… section 11.4
– Kddnuggets.com suggests…
– A bit of history
– Breiman et al.
– Respectability
– Oldie but goodie slides SA
– Titanic data
• Some bootstrapping stuff (probably not
tonight)
33
Taxonomy of Methods
Predictor (or Explanatory or
Independent) Variables
Quantitative
Regression
Quantitative
Response
(Dependent)
Variable
Logistic
Regression
Qualitative-2
or
levels
Discriminant
Analysis
Polytomous
Logistic
QualitativeRegression
more than 2
or
levels
Discriminant
Analysis
Qualitative
Mixture
Regression
Regression
or ANOVA or Analysis of
Covariance
Logistic
Logistic
Regression
Regression
or Log-linear
Models
Polytomous
Logistic
Regression
or Log-linear
Models
Polytomous
Logistic
Regression
34
Data mining and Predictive Modeling
Predictive modeling mainly involves the application of:
• Regression
• Logistic regression
• Regression trees
• Classification trees
• Neural networks
to very large data sets.
The difference, technically, is because we have so much data, we
can rely on the use of validation techniques---the use of training,
validation and test sets to assess our models. There is much less
concern about:
- Statistical significance (everything is significant!)
- Outliers/influence (a few outliers have no effect)
- Meaning of coefficients (models may have thousands of
predictors)
- Distributional assumptions, independence, etc.
35
Data mining and Predictive Modeling
We will talk about some of the statistical techniques
used in predictive modeling, once the data have
been gathered, cleaned, organized. But data
gathering usually involves merging disparate data
from different sources, data warehouses, etc., and
usually represents at least 80% of the work.
General Rule: Your organization’s data warehouse
will not have the information you need to build your
predictive model. (Paraphrased, Usama Fayyad,
VP data, Yahoo)
36
Regression Trees
Idea: Can we cut up the predictor space into
rectangles such that the response is roughly constant
in each rectangle, but the mean changes from
rectangle to rectangle?
_
We’ll just use the sample average (Y) in each
rectangle as our predictor!
Simple, easy-to-calculate, assumption-free,
nonparametric regression method. Note there is no
“equation.” The predictive model takes the form of
a decision tree.
37
Steroid Data
• See file ch11ta08steroidSplitTreeCalc.jmp
• Overall
Average of y is
17.64; SSE is
1284.8
38
Example: Steroid Data
Predictive Model
_
Fit
13.675
3.55
16.95
22.2
8.133
^
Example: What is Y at Age = 9.5?
39
How do we find the regions
(i.e., grow the tree)?
For one predictor X, it’s easy.
Step 1:
To find the first split point Xs, make a grid of possible split points along
the X axis. Each possible split point divides the X axis into two
regions R21 and R22. Now compute SSE for the two-region
regression tree:
Do this for every grid point X. The point that leads to the minimum
SSE is the split point.
Steps 2: If you now have r regions, determine the best split point for each of
the r regions as you did in step 1; choose the one that leads to the
lowest SSE for the r + 1 regions.
Steps 3: Repeat Step 2 until SSE levels off (more later on stopping)
40
Illustrate first split with Steroid Data
• See file ch11ta08steroidSplitTreeCalc.jmp
• Overall
Average of y is
17.64; SSE is
1284.8
41
In the JMP file, aforementioned…
• Point out calculations needed to determine
optimal first split
• Easy but a bit tedious
• Binary vs. multiple splits
• Run it in JMP, be sure to set min # in splits
• Fit conventional model as well…
42
Growing the Steroid Level Tree
Split 1
Split 2
Split 3
Split 4
43
When do we stop growing?
If you let the growth process go on forever, you’ll eventually have
n regions each with just one observation. The mean of each
region is the value of the observation, and R2 = 100%. (You fitted
n means (parameters) and so you have n – n = 0 degrees of
freedom for error). Where to stop??
We do this by data-splitting and cross-validation. After each split,
use your model (tree) to predict each observation in a hold-out
sample and compute MSPR or R2 (holdout) . As we saw with
OLS regression, MSPR will start to increase (R2 for holdout will
decrease) when we overfit.
We can rely on this because we have very large sample sizes.
44
What about multiple predictors?
For two or more predictors, no problem.
For each region, we have to determine the best predictor to split on
AND the best split point for that predictor. So if we have p – 1
predictors, and at stage r we have r regions, there are r(p – 1)
possible split points.
Example: Three splits for two predictors
45
GPA Data Results (text)
46
Using JMP for Regression Trees
• Analyze >> Modeling >> Partition
• Exclude at least 1/3 for validation sample using: Rows
>> Row Selection >> Select Randomly; then Rows >>
Exclude
• JMP will automatically give the predicted R2 value (1 –
SSE/SSTO for the validation set)
• You need to manually call for a split (doesn’t fit the tree
automatically)
47
Split button
Note: R2 for
hold-out
sample
As you grow
the tree this
value will
peak and
begin to
decline!
Clicking the red
triangle gives
options: select
“split history”
to see a plot of
predicted R2 vs.
number of
48
splits
Classification Trees
• Regression tree equivalent of logistic regression
• Response is binary 0-1; average response in each
region is now p, not Y
• For each possible split point, instead of SSE, we
compute the G2 statistic for the resulting 2 by r
contingency table.
2
G  2
2
i 1
 Observed 
(Observed) log 


j 1
 Expected 
r
Split goes to the smallest value. (Can also use the
negative of the log(p-value), where the p-value is
adjusted in a Bonferroni-like manner. This is called the
“Logworth” statistic. Again, you want a small value.)
49
50
51
52
Understanding ROC and Lift Charts
Assessing ability to classify a case (predict) correctly in
logistic regression, classification trees, or neural networks
(with binary responses) as a function of the cutoff value
chosen.
^
ROC Curve: Plot true positive rate [P(Y = 1|Y=1)] vs false
positive rate [(P(Y^ = 1|Y=0)].
Example 1: Classify top 40% (of predicted probabilities) as
1; bottom 60% as 0. Same as cutoff = .45, here.
Pred prob:
.49 .48 .47 .46 .43 .41 .38 .36 .32 .29
Data:
1
1
1
0
0
1
1
0
1 0
Classification: 1
1
1
1
0
0
0
0
0 0
Top 40%---------
Bottom 60%----------------
53
Calculating sensitivity (true pos)
and 1-specificity (false pos)
True positive rate: P(Yhat = 1|Y=1) = 3/6 = .5 (Y axis value)
False positive rate: P(Yhat = 1|Y=0) = 1/4 = .25 (X axis value)
Scatterplot of TruePos vs FalsePos
1.0
0.9
0.8
TruePos
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
0.1
0.2
0.3
0.4
FalsePos
0.5
0.6
0.7
0.8
54
Example 2: Classify top 40% (of predicted probabilities) as 1; bottom
60% as 0. Same as cutoff = .45, here.
Pred prob:
.49 .48 .47 .46 .43 .41 .38 .36 .32 .29
Data:
1
1
1
0
0
0
0
0
0 0
Classification: 1
1
1
1
0
0
0
0
0 0
Top 40%---------
Bottom 60%----------------
True positive rate: P(Yhat = 1|Y=1) = 3/3 = 1.0 (Y axis value)
False positive rate: P(Yhat = 1|Y=0) = 1/7 = .1428 (X axis value)
Scatterplot of TruePos vs FalsePos
1.0
0.9
TruePos
0.8
0.7
0.6
0.5
0.4
0.3
0.0
0.1
0.2
0.3
0.4
0.5
FalsePos
0.6
0.7
0.8
0.9
55
11.5 Bootstrapping in Regression
Bootstrapping:
A method that uses computer simulation, rather than
theory and analytical results, to obtain sampling
distributions of statistics. From these we can estimate
the precision of an estimator.
56
Background: Simulated Intervals
Suppose:
1. Our objective is to get a confidence interval for the
slope in a simple linear regression setting.
2. We know the distribution of Y at each X value.
How can we use computer simulation (Minitab) to get
a confidence interval for the slope?
57
Background: Simulated Intervals
Easy:
1. Obtain a random Y value for each of the n X points
2. Compute the regression.
3. Store b1
Do the above, say 1,000,000 times. Do a histogram of
the b1 values, use the .025 and .975 percentiles!
58
Simulated Intervals: Example
Toluca company data: Assume E(Y) = 62 + 3.6 X and s = 50
That is: e ~ N(62 + 3.6 X , 50).
Exec:
let k1 = k1 + 1
random 25 c3;
normal 0 50.
let c4 = 62 + 3.6*X + c3
Regress c4 1 'X';
Coefficients c5.
let c6(k1) = c5(2)
59
What if we don’t know the
distribution of errors?
Answer: Use the empirical distribution:
1. Fit the model (assume its true)
2. Then in the simulation, for each run, obtain a random
sample n residuals (with replacement) from the n
observed residuals.
3. Compute the new Y values, run the regression, and
store the bootstrap slope value, b1*
This is the basic approach to the fixed-X sampling
bootstrap
60
To obtain confidence interval:
1. Could use percentiles as previous.
2. Better approach is the reflection method:
d1 = b1 – b1*(a/2)
d2 = b1*(1 - a/2) – b1
b1 – d2 < b1 < b1 + d1
61
Random-X Sampling Version
When error variances are not constant or predictor
variables cannot be regarded as fixed constants, random
X sampling is used:
For each bootstrap sample, we sample a (Y, X) pair
with replacement from the data set.
In effect we sample rows of the data set with
replacement.
62
Fixed X Example—Toluca Data
Assume that the base regression has been run. We
have stored the residuals in column c3, predicted
values in c4.
let k1 = k1 + 1
sample 25 c3 c5;
replace.
let c6 = c4 + c5
Regress c6 1 'X';
Coefficients c7.
let c8(k1) = c7(2)
63
Neural Networks
The i-th observation is modeled as a nonlinear
function of m derived predictors, H0, …, Hm-1.
64
Neural Networks
OK, so what is gY and how are the predictors derived?
gY is usually a logistic function and the Hj are a nonlinear
function of a linear combination of the predictors X
Here Xi is the i-th row of the X matrix
65
Neural Networks
Put these together and you get the neural network
model:
A common choice for all of the nonlinear function is
again the logistic:
66
Neural Networks
The gj functions are sometimes called the “activation”
functions:
The original idea was that when a linear combination of the
predictors got large enough, a brain synapse would “fire”
or “activate.” So this was an attempt to model a “step”
input function.
67
Neural Networks
The gj functions are sometimes called the “activation”
functions:
The original idea was that when a linear combination of
the predictors got large enough, a brain synapse would
“fire” or “activate.” So this was an attempt to model a
“step” input function.
68
Neural Networks
Using the logistic for the gY and gj functions leads to the singlehidden-layer, feedforward neural network. Sometimes called the
single layer perceptron.
Yi  [1  exp (  H i ' b )]1   i
69
Network Representation
Useful to view as network and compare to multiple
regression:
70
Parameter Estimation:
Penalized Least Squares
Recall that we found if too many parameters are fit in OLS,
our ability to predict hold-out data can deteriorate. So we
looked at adjusted R2, AIC, BIC, Mallows Cp, which all
have built-in penalties for having too many parameters.
Dropping some predictors is like setting the corresponding
parameter estimate to zero, which “shrinks” the size of the
regression coefficient vector:
Size (b)  bi2
Another way to do this would be to leave all of the
predictors in, but require that there be penalty on the
estimation method for Size(b).
71
Parameter Estimation:
Penalized Least Squares
This leads to the “penalized least squares”
method. Choose the parameter estimates to
minimize:
Where the overfit penalty:
is the sum of squares of the estimates.
72
Example Using JMP (SAS) Software
We’ll consider the Ischemic Heart Disease data set in
Appendix C.9. Response is log(total cost subscriber claims),
and the predictors considered are:
Note: X1 is variable 5, X2 is variable 6, X3 is variable 9, and X4
is variable 8. The first 400 observations are used to fit (train)
the model, and the last 388 are held out for validation
73
Example Using JMP (SAS) Software
74
Example Using JMP (SAS) Software
75
Example Using JMP (SAS) Software
76
Comparison with Linear
and Quadratic OLS Fits
77
Comparison of Statistical and NN Terms
78
A Sampling Application
• Frequently, have an idea about the
variability in y off of an x-variable
• Forestry application:
– What is average age of trees in a stand
– Diameter of tree is “easy”
– Age of tree via ?????
79
• 20 trees in
sample
• 1132 in the
forest
• Average diameter is 10.3
118.3 off of fit
• (raw average age
is 107.4 ;
Average diameter in
Sample is 9.44…i.e.,
Sample tended to have “smaller trees in it. Fit corrects for this.)
Note, got 118.46 using estimated standard deviations
80
Download