Data Handling & Analysis Polynomials and model fit Andrew Jackson a.jackson@tcd.ie Linear type data • How are two measures related? What do we do about curvature? • Data are the number of species (Y) recorded per time spent looking for them (X) • Specifically, these data come from fisheries data • Good proxy for species diversity in the marine habitat Clearly a straight line won’t do … the residuals are horrible Polynomials • Polynomials are linear equations that show curvature – Quadratics • Y = b0 + b1X + b2X2 – Cubics • Y = b0 + b1X + b2X2 + b3X3 – 5th, 6th order polynomials etc… Quadratic model Quadratic residuals • Better… • But not so good at lower values of x • Try a more complicated model like a cubic Cubic model • Note the double curvature • Model appears to explain the lower values better • But how sure are we of the increase at higher values? Cubic residuals • Better than the quadratic • But still over-estimating the lowest values of x Log transform the X variable • Model is – Y~log(X) • Appears to explain the data very well across the full range • Check the residuals… Y~log(X) residuals • Now these look pretty near perfect The null model • Consists of a mean and a variance only • It gives us a benchmark against which we can test our models that include more information • If we can’t do better than the null model then we don’t understand our data or system! Residuals of the null model Choosing between alternative models • We now have a choice between 5 models – Null model (zero order polynomial, which includes an intercept only – i.e. just a mean and variance model) – Straight line (first order polynomial) – Quadratic (second order polynomial) – Cubic (third order polynomial) – First order polynomial with log(X) • How do we select which one to use? – Higher order polynomials require more parameters Parsimony as a central tenet • Parsimony is the application of the most simplest explanation for a phenomenon and underpins all of science • So.. We need to pick the model that – Fits the data the best, and … – Uses the least number of parameters Likelihood of data AIC for model selection • We will use Akaike’s Information Criterion (AIC) to select the most suitable model • AIC = -2Log(likelihood) + 2k – Log-likelihood gets bigger the better the fit – k is the number of parameters in the model • Lower AIC = more suitable model AIC of our models • • • • • • • • • • Null model 248.2 Straight line 184.1 Quadratic 142.5 Cubic 124.9 4th order 83.5 5th order 77.6 6th order 77.7 log(X) 68.4 So the log(x) model is the best in this case Note that adding more orders to the polynomials ceases to confer any benefit after 5th order. Also… these get increasingly difficult to explain and relate to biological phenomena Conclusions • AIC provides an objective way to compare alternative models • Lower AIC indicates a more parsimonius model • Must only compare AIC on models of the exact same response variable • Only provides relative, and not absolute indication of model fit – Still need to check that the model is any good – Residuals etc…