Predicting the Future

advertisement
An Introduction to Things to
Come!
Administrivia
• The details on the class can be found here:
www.stanford.edu/class/hrp259
• Class participation plus lots of little assignments
and 2 exams determines your final grade.
• If you send me a computer virus or other
malicious code (even unintentionally) you will fail
the course.
– Visit ess.stanford.edu to get a virus scanner and keep
it up to date by clicking LiveUpdate….
• Dozens of new viruses are unleashed on
the world every day. Please update your
virus definitions and scan everything
before you send it to me.
How to Predict The Future
• Either the world is driven
completely by random chance
events (and your best bet for
predicting the future is using
Tarot cards or a Magic 8 Ball™)
or there are detectable patterns
in the world.
• If you talk to a preschool teacher
or a PhD in math, they will tell
you that math is all about
pattern detection.
Deterministic Modeling
• To predict the future using a deterministic model,
you would say that if event A happens then B
will happen. For example, in a deterministic
world, if you know that a baby gestates for
exactly 280 days (40 weeks), its weight will be
exactly 7 ½ lbs (3.2 kg). We know that in reality
it is typically in the range of 5.5–10 pounds (2.7–
4.6 kg).
• For better or for worse, we do not live in a
deterministic (Orwellian 1984ish) world and we
can not usually make EXACT predictions in
medicine.
What We Want….
•
You want to be able to fill in a table like this:
Weeks of Gestation
(in weeks)
Weight at birth
(in lbs)
29 weeks
…
38 weeks
39 weeks
40 weeks
…and express it with a simple formula like this:
lbs = weeks * something
The Process
• The process of going from a single
predictors or a set of predictors to a
predicted outcome is called statistical
modeling.
• People get far too excited about figuring
out which statistic (with accompanying pvalues anxiety) to use for the factors that
are used in models.
• Today I want to talk about the process of
building and using the most common
models that you will see in medicine.
What is a Model and Why Care?
• The predictors and the outcomes can be on a
continuous scale (time in days) or categorical
factors (mom smoked, yes or no).
• Generally we try to use all the information
available when we make a prediction about the
future.
– The amount of blood ejected each time the heart
beats (continuous scale) as opposed to whether or
not the heart is beating
– The number of cancer cells seen on a slide (or the
presence or absence of malignant cells)
• The models we build are remarkably similar
regardless of whether we have categorical or
continuous outcomes.
The Structure of a Model
• All the models I learned in school were formulated
at their core like this:
Outcome = baseline + predictor + predictor
Baby’s
Weight
some number
Impact of
time
Impact of being a
smoker
Weeks * a number a number
• The math can get ugly very quickly depending on
the properties of the outcome (continuous, count,
categories) but the core idea is that these models
are all using additive contributions from some
predictors!
How many predictors?
• You will be faced with the HARD question of how many
predictors you have.
– Stop and think about the LEVELS of your predictors. If you have
100 births in a dataset and 10 variables, many of which are
categorical, you can quickly find yourself making predictions
based on one or two births.
• Male babies (50%) from smoking mothers (20%) where this is the
mom’s 2nd or later birth (35%) = SMALL numbers.
• From100 in the sample, you end up with 5.25 births in this level.
• Do you want to make generalizations to the WORLD based on 5
children?
• Are the factors correlated?
• Are some cheap to measure?
• Can you use some as proxies for others?
Models
•
One of my favorite statistics books is
Michael J. Crawley’s Statistical
Computing. He says:
1. All models are wrong.
2. Some models are better than others.
3. The correct model can never be known with
certainty.
4. The simpler the model, the better it is.
What Makes a Bad Model
• Predicts some outcomes poorly
• Is strongly influenced by a small number of
data points
• Shows systematic patterns in how it fails
to predict
Types of Statistical Modeling
• Think about the extremes of what you could use
when making a model:
• Null model – use the mean of everyone
• A theoretically minimally adequate model
• Current model you have specified
• A theoretical maximal model with every predictor
• Saturated model
– Every predictor and every interaction
– Have everyone predict themselves
What is an interaction?
• Sometimes factors in a model behave
weird together. A mad scientist invented a
love potion and she looked at happiness in
the two genders and people on and off her
drug. There was a huge increase in
happiness if you were male and on drug
but minimal impact of gender or drug by
themselves.
Interaction Gender by Drug
drugged
35
30
normal
Mean
99.88
Mean
103.7
Standard Dev.
13.78
Standard Dev.
16.70
Mean
120.1
Mean
101.8
Standard Dev.
12.73
Standard Dev.
11.18
Percent
F
25
20
15
10
5
0
35
30
Percent
M
25
20
15
10
5
0
50
60
70
80
90
100
110
120
130
140
50
happiness
60
70
80
90
100
110
120
130
140
Interactions Again
• Imagine that are two factors that impact a man’s
risk of prostate cancer, being Japanese and
living in Japan when you are a pre-teen. If you
are Japanese, your risk of prostate cancer is
lower than for men of any other race. If you
were raised in Japan, your risk is lower than for
men raised in the USA.
• The interaction term would measure if there is
extra protection for being both Japanese and
having been raised in Japan.
Categorical Interaction
• IsJapanese = 0 or 1 (1= yes and 0 = no)
• IsRaisedJapan = 0 or 1
• IsJapanese * isRaisedJapan = ??
Probablity annually
Probablity of Prostate Cancer
0.1
0.09
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
Not Japanese
Japanese
Born USA
Born Japan
You can ALWAYS do better.
• You can keep adding more and more predictors
to a model but the price is the loss of
generalizability. Will you get another child who
is exactly that weight, who’s mother smoked that
much, of that ethnicity, that gender, whose father
weighed that much, mother weighed that much,
etc….?
• Modeling methods compare the models and
frequently use criteria that penalize for extra
factors (AIC criteria).
Goals
• I see modeling as having two goals.
• Estimate parameters.
– How much weight gain occurs each week as a baby
is developing?
• Estimate how well it describes your data.
– How far off will my guess be when I predict the next
child?
– Are there regions where my guesses are far off, like
premature or late deliveries?
– Is there a lot of variability at one point and not at
others?
– Can I see any problems when I fit the model to THIS
data?
Looking for Errors
• Statisticians use the word “error”
differently than everyone else.
– You know that you will not have perfect
prediction. Instead, you will be off. That is
error. It does not mean somebody made a
mistake! It just means you can’t make a
perfect prediction.
– Specifying how far you will be off is the fun
and interesting part of statistics. The rest is
just math.
Looking at Errors
Outcome = baseline + predictor + predictor + error
a number
drawn
from a
bell shaped
distribution
some
number
Baby Weight
Impact of
time
Impact being a
smoker
Weeks * a number a number
Looking for Errors
• Hopefully you will see that, given any specific
predictor value, your guessed values for the
outcome will be close to the values you actually
observe in the outcome. Also, any observed
outcome values that stray too far from your
guess are unlikely.
• That pattern of how far off your guesses are
from your observed data can frequently be
described by a bell-shaped (“normal”) histogram.
So, if you measure errors between your
prediction and the observed outcomes, the
distribution should be “normal.”
Guesses and Errors
Histogram of
actual weights at
40 week births
5.5 lbs
My model guesses
7.5 lbs
9.5 lbs
Histogram of
errors at 40
weeks
I guessed way
too high rarely
0 error if child was 7.5 lbs
Most errors are off by just a bit
I guessed way
too low rarely
Looking at Errors
• There are some kinds of errors that you will be
unwilling to accept.
• If I want to predict the number of times an evil
lackey proposes marriage to a mad scientist, I
will not accept a negative number!
• If I am predicting the chance of someone
developing cancer, I will not accept a number
less than 0% or greater than 100%.
• Specifying the type of errors is a critical part of
building a model.
More on Errors
• In addition to specifying the range of legal
values, another critical component is specifying
the variability in the errors.
• If your data is constrained to lie between 0 and
1, what is the variability like?
– If the average is about .5, then you can have scores
that are both above and below that and the variability
drawn as a histogram may be well described by a bell
shaped curve. What is the variability like if the
average is about .95? Now the whole right side of
your curve is hanging off the side of your page!
• If you have count data (e.g., number of cancer
cells), your variability increases with the average
count.
0
.73
1
20000
0
1
.5
If you pretend your variability is
normally distributed but your
outcome has a limited range,
you clearly have problems.
In theory, the variance of count
data increases with the mean.
0
5000
10000
15000
.9 1
variance
0
0
5000
10000
mean
15000
20000
Ordinary Least Squares
• Perhaps the easiest models to draw and
understand are ones where you have a
continuous outcome like weight and a
continuous predictor like time.
• The model is just a line….
• Y = mX + b
Weight = estimated weight gain each week after conception * number
of weeks + weight at 0 weeks
3000
2000
1000
FETAL_WGT_
4000
5000
Maximum Likelihood Visual
30
35
GWKS_DEL
40
Bad Models
• All models are wrong.
• Your data is sacred (after you remove the
pregnant men) and you fit models to the
data. You do not fit data to a model. That
difference is not a semantic minor detail.
Poor Predictions
• Sometimes you have data points that are
not well fit by the model. Go to extreme
measures to document those points. If the
data is not a true error, then run the
analysis with it and without it. Include the
point(s) in all your plots with a special
symbol and if one person changes your
inferences, consider excluding them.
– You may have different subgroups that you
have not identified yet.
5000
A True Outlier
3000
2000
1000
FETAL_WGT_
4000
Induced because
of HUGE size
30
35
GWKS_DEL
40
Looking at Residuals
• A critical step in examining the quality of a
model is graphically looking at the
residuals.
• Residuals are the differences between the
estimated values and the observed values
for each person/critter/observation.
• Look for curves, changing variability
across the range of values or changes
over time.
Patterns in Residuals
From Crawley: Statistical Computing
Partial Residuals
• When you have multiple predictors in a model,
you can ask for residuals where you have
“controlled for” or “removed the effect of” the
other factors.
• Evaluate software packages on their ability to
produce these graphics. They are completely
missing from Excel, they are not built into SAS
(but can be done with minor work), but they are
trivially easily with R or S-plus.
Curve Fitting
• Linear models can model curves
– The math is not too bad….
• You can use explicit mathematical
formulas. If you see curves in your
residuals, you can use things like:
– Polynomials or inverse polynomials
– Exponentials
– Power functions
Keywords in Parametric Modeling
and What They Look Like
S-shaped curve
From Crawley Statistical Computing
Nonlinear Regression
• Often the formulas to describe your data are
extraordinarily complicated and you want to use
non-linear or non-parametric modeling instead.
• Key words you will see include:
– Non-parametric smoothing
• Lowess regression
• Spine regression
– GAM
– Tree models
A Bad Fit
Y= Size
60
80
100
100
80
60
Is this better
than a flat line
at the mean?
40
40
20
20
0
0
Y= Size
residual
120
120
140
140
• What happens when you fit a straight
linear model to curvilinear data?
0
10
20
30
X = Age
40
50
0
10
20
30
X = Age
40
50
Is it good?
• A tiny p-value does
not mean a good
model!
• Where on the
output does it tell
that this is a good
or a poor model?
80
60
20
40
Flatten the line, then
look up and down to
see if you are
systematically off.
0
Y= Size
100
120
140
Residuals?
0
10
20
30
X = Age
40
50
Curve Fitting!
• You can build a model that has a curve
using a polynomial… the degree of the
polynomial determines how many “bends”
appear in a curve. So a 2nd degree
polynomial would use x and x2 while a 3rd
degree polynomial would use x and x2 and
x3. These squared or cubed values don’t
do anything especially complicated. They
are just like adding new variables.
Polynomials
100
60
80
Y= Size
80
60
size = intercept +
X * something +
X2* something else +
X3 * another thing
20
40
40
0
20
0
Y= Size
100
120
120
140
140
size = intercept + X * something + X2*something else
0
10
20
30
40
50
0
10
20
30
40
X = Age
X = Age
poly2 = lm(y~poly(x,2))
poly3 = lm(y~poly(x,3))
50
What is a good fit?
• Choosing where to stop adding terms to a
model is as much an art as a science. You
can do comparisons between the models
and ask to see if it is a statistically
significant difference.
• There are systems for
penalizing your model
as you add more and
more factors to a
model like AIC.
Generalized Linear Models
• You will eventually move out of the realm of
predicting continuous outcomes with normal
error. When you do, you will move into the
realm of Generalized Linear Models (GLM).
• You want to have a linear model predicting an
outcome where you restrict the possible
outcome values (e.g., only allow values between
0 and 1) and deal with errors not being
consistently normal across the entire range.
• You can change (transform) your outcome and
model this with just another linear model similar
to what I have shown.
GLM in English
• If you are predicting the number of bacteria you
see in a Petri dish, you can not possibly see a
negative number of bacteria. A GLM model can
be written so that your predicted values can not
be negative.
• Contrast this with the baby weight example
where with a bit of bad data for your predictor
value, you could have the formula spit out a
negative weight or a baby weighing a ton.
GLM
• Instead of modeling like this:
Outcome =
baseline + predictor + predictor + error
normal/bell-shaped
• You can model with GLM like this:
Tweaked outcome =
baseline + predictor + predictor + not normal error
Ordinary Regression
• So, the ordinary least squares regression
models that I have shown are really just a
case of GLM. In these cases I specify that
the tweak to the outcome is to just make
the outcome identical to what it was
originally and the error is normal.
• The tweak to the outcome is called the link
and this case the link is called identity.
Link Functions
• The tweaks to the outcome are called
links:
• Identity link = predicting a continuous
outcome (baby weight)
• Log link = if you can’t have negative
values
• Logit link = if you have to restrict the range
to between 0 and 1
• There are other links.
Error Structure
• Why bother to specify an error structure other than
normal?
– Strong skew, kurtosis errors, bounded errors, negative counts
• The shape of the error distribution is not a bell-shaped
curve. Rather than worrying about the math to describe
those curves, you simply need to know that different
types of data have different error structures.
–
–
–
–
Normal errors – continuous outcomes
Poisson errors - counts
Binomial errors - proportions
Gamma errors - variation
Binary Response
• If you are not dealing with a continuous
outcome, or count data, you will likely have a
binary (yes/no scored as 1 or 0) outcome.
• Clearly you need to do some major tweaking to
the outcome because linear models, as we have
seen, can predict very large and small numbers.
• Also, the variability of a binary outcome is very
different from a continuous variable.
Logistic Regression
• The solution is to specify a link that limits values
to be between 0 and 1 (think of the changed
outcome as being the probability of being scored
1) and use an error term that behaves well with
binary outcomes.
• This is a GLM with a logit link and binomal
errors.
• This kind of analysis is so popular most people
don’t know it is a GLM. Rather, they know it only
as logistic regression.
Before Next Time
• The first assignment is on the class
website and is due before the start of class
Wednesday.
• Print the slides as close as possible to
class time.
Download