Comparing models - an example with polynomials

advertisement
Data Handling & Analysis
Polynomials and model fit
Andrew Jackson
a.jackson@tcd.ie
Linear type data
• How are two measures related?
What do we do about curvature?
• Data are the number of
species (Y) recorded
per time spent looking
for them (X)
• Specifically, these data
come from fisheries
data
• Good proxy for species
diversity in the marine
habitat
Clearly a straight line won’t do
… the residuals are horrible
Polynomials
• Polynomials are linear
equations that show
curvature
– Quadratics
• Y = b0 + b1X + b2X2
– Cubics
• Y = b0 + b1X + b2X2 + b3X3
– 5th, 6th order
polynomials etc…
Quadratic model
Quadratic residuals
• Better…
• But not so good at
lower values of x
• Try a more complicated
model like a cubic
Cubic model
• Note the double
curvature
• Model appears to
explain the lower values
better
• But how sure are we of
the increase at higher
values?
Cubic residuals
• Better than the
quadratic
• But still over-estimating
the lowest values of x
Log transform the X variable
• Model is
– Y~log(X)
• Appears to explain the
data very well across
the full range
• Check the residuals…
Y~log(X) residuals
• Now these look pretty
near perfect
The null model
• Consists of a mean and a
variance only
• It gives us a benchmark
against which we can test
our models that include
more information
• If we can’t do better than
the null model then we
don’t understand our
data or system!
Residuals of the null model
Choosing between alternative models
• We now have a choice between 5 models
– Null model (zero order polynomial, which includes an
intercept only – i.e. just a mean and variance model)
– Straight line (first order polynomial)
– Quadratic (second order polynomial)
– Cubic (third order polynomial)
– First order polynomial with log(X)
• How do we select which one to use?
– Higher order polynomials require more parameters
Parsimony as a central tenet
• Parsimony is the application of the most
simplest explanation for a phenomenon and
underpins all of science
• So.. We need to pick the model that
– Fits the data the best, and …
– Uses the least number of parameters
Likelihood of data
AIC for model selection
• We will use Akaike’s Information Criterion
(AIC) to select the most suitable model
• AIC = -2Log(likelihood) + 2k
– Log-likelihood gets bigger the better the fit
– k is the number of parameters in the model
• Lower AIC = more suitable model
AIC of our models
•
•
•
•
•
•
•
•
•
•
Null model
248.2
Straight line
184.1
Quadratic
142.5
Cubic
124.9
4th order
83.5
5th order
77.6
6th order
77.7
log(X)
68.4
So the log(x) model is the best in this case
Note that adding more orders to the polynomials ceases to confer
any benefit after 5th order. Also… these get increasingly difficult to
explain and relate to biological phenomena
Conclusions
• AIC provides an objective way to compare
alternative models
• Lower AIC indicates a more parsimonius model
• Must only compare AIC on models of the exact
same response variable
• Only provides relative, and not absolute
indication of model fit
– Still need to check that the model is any good
– Residuals etc…
Download