Model fitting lab

advertisement
Model fitting lab
The purpose of this lab is to build vonBertalanffy growth curves (a non linear model) to
examine the relationship between length (TL) and age (in years) for sturgeon from the
Yellow River and Suwannee River Florida. After building curves for each river
separately, you will then test to see if the curves are the same between the rivers. You
will then have the option (for bonus points) to test each parameter in the model
individually. Note that this approach can be used with many types of nonlinear models.
We are assuming an additive error structure, but other data structures (such as
multiplicative) can also be fit by manipulating the data. This is a simple framework that
you can adapt to a lot of other applications.
Steps:
(1) First read in the data and create plots of age (X-axis) and length (Y-axis).
(2) Fit a linear model with age (x-axis) and length (y-axis) for each river (a simple linear
regression model is fine).
(3) Estimate the vonBertalanffy growth curve for each river. The equation is
Predicted length = Linf*(1-EXP(-k*(age-to)))
These parameters have biological meaning, the Linf is related to the average maximum
size of an animal, the k is a metabolic function, age is the observed age at that specific
length, and t0 is the size when the fishes age is zero (this is usually a really small
number, sometimes a little above or below zero, but don’t worry if it is, this is usually
indicative of poor data resolution at the younger ages because the fish are more difficult
to collect).
Directions to do this in Excel are provided in the spreadsheet. To do this in SAS use
proc nlin. Read the data in first and make sure everything is ok, then check that there is
a column of data with the fish ages, lengths, and the river name.
proc nlin best = 1 maxiter = 500; *This says to return the best single fit,
but only try to fit the model 500 times before stopping;
by river;*this tells SAS to do the operation by river;
*below are the parameter values for the model, this is telling SAS that the
parameter values can only be within these bounds (this is a bounded
optimization), and it is also telling SAS how big a “step” to take in each of
the searches when the program begins to fit the data. You specify the bounds
based on what you know about the data and the model, if you don’t know
anything, it can take a long time to fit the data and you can encounter
“false minima” which can be bad solutions;
parms linf = 100 to 1800 by 100
to = 0 to 1 by 0.5
k = 0 to 5 by 1;
*below is the equation for the vonBertalanffy growth equation using the
parameter names given above;
model tl = linf * (1 - exp(-((k)*(age - to))));
*below I am specifying the output that I will need, and putting this output
in a dataset called c;
output out = c parms = Linf to k p = vbp r = yresid;
title 'VB Growth';
run;
Here is some pseudo code that might be helpful in the plotting
proc gplot;
plot predictedY*observedX observedY*observedX/overlay;
quit;
run;
**************************
Here is some R code that might help, try optim, MLE, and MLL for the fitting functions
and see if they are all coming up with about the same estimates
#P is a parameter vector p=(linf,K, to) and these are initial guesses
p<-c(2200,0.3,-0.1)
#the following is a function with the arguments p
#in this function we created the predicted values in "plen" based on our
#equation. the p[1] is referring to the first term in the p above
#which is our Linf value from our equation
fn1<-function(p){
plen<-p[1]*(1-exp(-p[2]*(age-p[3]))) #predicted lengths given age and P vector
ss<-sum((len-plen)^2) #(SS) criterion for predicted lengts at age
return(ss) #returns the sum of squares (what you want to minimize)
#return(plen)
}
predlen<-function(p){
#this stores your predicted values for plotting
p[1]*(1-exp(-p[2]*(age-p[3])))
}
#The following is essentially a solver call in Excel. nlm stands for
#non linear minimization and its arguements are
#nlm(function name, parameters, print hessian, scaler) use ?nlm to get help on using
#nlm.
#note there are several different fitting routines in R and SAS
#R is very sensitive to different starting values, so you need to give R different
#starting values, then run it again and see how it does
#run a ?nlm and see what the program convergence criteria codes are
#you want to have a code of 1
fit<-nlm(fn1,p,hessian = T) #So fit now stores the estimated parameter info.
#below in fit2 and fitoptim I demo to other fitting techniques, optim is most general
#as you can enter BFGS and other optimizations in here
#warning all of these functions are sensitive to starting values!!!
#think about scaling your values so they are all on the same order of magnitude
#i.e., see how Linf, k, and to (the model parameters) are all really different
#in size….this can cause problems in R (SAS and Excel use an automatic
#scaling routine
#fit2<-nls(len~Linf*(1-exp(-k*(age-to))),start=list(Linf=2200,k=0.3, to=-0.1))
#fitoptim<-optim(par=p,fn=fn1, method=c('BFGS'))
#This is how you get the standard errors for the estimated parameters
#using the hessian information.
SE<-sqrt(diag(2*fit$minimum/(length(age)-2)*solve(fit$hessian)))
p<-fit$estimate #assign the estimated parameters to p.
#hint the SS you need for the comparison test (question 2)is the first value (labeled
#minimum) in the #nlm fitting routine
************
Question 1. What are your parameter estimates for each river for both a linear and a
vonBertalanffy growth model? Make a plot showing the observed data and the
predicted values for each model and each river. Create, examine, and turn in the
residual plot and the plot of the predicted and observed values for each river and model
and describe which model appears to fit better. Write 2-3 sentences explaining which
model appears to fit better for each river.
************
Now the fun part, let's test to see if the vonBertalanffy curves differ between rivers. If
these curves different, then we would suspect that growth rates are different between
the two rivers. If the curves are different, then we could test each of the model
parameters to see which model parameter differs.
We will just test to see if the curves differ.
There are a few ways to do this, you could just look at the parameter estimates and SE
estimates to see if they overlap.
I like to use "test of coincident curves" which is basically comparing the sums of squares
between a single curve fit with all of the data (here one growth curve fit with data from
the Suwannee and the Yellow combined. For more detail see Kimura (1980) or Haddon
(2001) below. This is a really easy test that can be applied to a wide range of model
types. It is kind of similar to “likelihood ratio test” which we may cover in a few weeks.
To build the test of coincident curves...
Either copy and paste (use paste special) in your spreadsheet (or write on a piece of
paper) the parameter values for each growth curve for each river, and the residual sums
of squares value for that combination of growth curve and river. If you are doing this in
the spreadsheet attached, you would just highlight your parameter values for each river
individually and then paste them into your "base values" column. You would then add
the residual sums of squares together for each model output to create a “combined”
residual sum of squares. In the spreadsheet attached it is done for you in D8.
You then want to build a new set of parameter values with a new SS by calculating a
single growth curve and its associated SS for a "combined" growth curve of all of the
data (how do you fit one curve to the SAS or R code you have just written?).
Remember you are trying to use the residual SS to compare a "full" model of 1 growth
curve with all the data vs. the residual SS from the two individual curves added
together.
The test statistic is based on a chi-square distribution. To calculate the test statistic
Chisquare test stat = -n*ln((residual SS Suwannee + residual SS Yellow)/(residual SS Full model))
where n is the number of observations in your full data set (here n= 219) and the full
model is the residual SS from a single model fit to the combined data set.
Your degrees of freedom are 3 because there are 3 parameters that you are interested
in testing in the model. You then look up the chi square value for a test statistic that you
calculate above and 3 df. I use the chidist function in Excel where x is the value I want
to evaluate the distribution of. This will give you the p value.
***********
Question 2, Do the individual curves differ? Create a table with the parameter values
(SE or CI if calculated), the chisquare values, df, and p-values.
**********
**********
Bonus, I have provided directions in the spreadsheet columns H, I, and J on how you
can extend this method to testing individual parameters in the growth curve (very
handy). This can also be done in SAS or R, just remember to set some of the
parameters equal to values you have solved for previously…. 0.5 point bonus for each
parameter value that you are able to successfully test.
**********
Kimura, D. K. 1980. Likelihood methods for the von Bertalanffy growth curve.
Fisheries Bulletin 77:765-776.
Haddon, M. 2001. Modelling and quantitative methods in fisheries. Chapman and
Hall/CRC. Boca Raton, Florida.
Download