Model fitting lab The purpose of this lab is to build vonBertalanffy growth curves (a non linear model) to examine the relationship between length (TL) and age (in years) for sturgeon from the Yellow River and Suwannee River Florida. After building curves for each river separately, you will then test to see if the curves are the same between the rivers. You will then have the option (for bonus points) to test each parameter in the model individually. Note that this approach can be used with many types of nonlinear models. We are assuming an additive error structure, but other data structures (such as multiplicative) can also be fit by manipulating the data. This is a simple framework that you can adapt to a lot of other applications. Steps: (1) First read in the data and create plots of age (X-axis) and length (Y-axis). (2) Fit a linear model with age (x-axis) and length (y-axis) for each river (a simple linear regression model is fine). (3) Estimate the vonBertalanffy growth curve for each river. The equation is Predicted length = Linf*(1-EXP(-k*(age-to))) These parameters have biological meaning, the Linf is related to the average maximum size of an animal, the k is a metabolic function, age is the observed age at that specific length, and t0 is the size when the fishes age is zero (this is usually a really small number, sometimes a little above or below zero, but don’t worry if it is, this is usually indicative of poor data resolution at the younger ages because the fish are more difficult to collect). Directions to do this in Excel are provided in the spreadsheet. To do this in SAS use proc nlin. Read the data in first and make sure everything is ok, then check that there is a column of data with the fish ages, lengths, and the river name. proc nlin best = 1 maxiter = 500; *This says to return the best single fit, but only try to fit the model 500 times before stopping; by river;*this tells SAS to do the operation by river; *below are the parameter values for the model, this is telling SAS that the parameter values can only be within these bounds (this is a bounded optimization), and it is also telling SAS how big a “step” to take in each of the searches when the program begins to fit the data. You specify the bounds based on what you know about the data and the model, if you don’t know anything, it can take a long time to fit the data and you can encounter “false minima” which can be bad solutions; parms linf = 100 to 1800 by 100 to = 0 to 1 by 0.5 k = 0 to 5 by 1; *below is the equation for the vonBertalanffy growth equation using the parameter names given above; model tl = linf * (1 - exp(-((k)*(age - to)))); *below I am specifying the output that I will need, and putting this output in a dataset called c; output out = c parms = Linf to k p = vbp r = yresid; title 'VB Growth'; run; Here is some pseudo code that might be helpful in the plotting proc gplot; plot predictedY*observedX observedY*observedX/overlay; quit; run; ************************** Here is some R code that might help, try optim, MLE, and MLL for the fitting functions and see if they are all coming up with about the same estimates #P is a parameter vector p=(linf,K, to) and these are initial guesses p<-c(2200,0.3,-0.1) #the following is a function with the arguments p #in this function we created the predicted values in "plen" based on our #equation. the p[1] is referring to the first term in the p above #which is our Linf value from our equation fn1<-function(p){ plen<-p[1]*(1-exp(-p[2]*(age-p[3]))) #predicted lengths given age and P vector ss<-sum((len-plen)^2) #(SS) criterion for predicted lengts at age return(ss) #returns the sum of squares (what you want to minimize) #return(plen) } predlen<-function(p){ #this stores your predicted values for plotting p[1]*(1-exp(-p[2]*(age-p[3]))) } #The following is essentially a solver call in Excel. nlm stands for #non linear minimization and its arguements are #nlm(function name, parameters, print hessian, scaler) use ?nlm to get help on using #nlm. #note there are several different fitting routines in R and SAS #R is very sensitive to different starting values, so you need to give R different #starting values, then run it again and see how it does #run a ?nlm and see what the program convergence criteria codes are #you want to have a code of 1 fit<-nlm(fn1,p,hessian = T) #So fit now stores the estimated parameter info. #below in fit2 and fitoptim I demo to other fitting techniques, optim is most general #as you can enter BFGS and other optimizations in here #warning all of these functions are sensitive to starting values!!! #think about scaling your values so they are all on the same order of magnitude #i.e., see how Linf, k, and to (the model parameters) are all really different #in size….this can cause problems in R (SAS and Excel use an automatic #scaling routine #fit2<-nls(len~Linf*(1-exp(-k*(age-to))),start=list(Linf=2200,k=0.3, to=-0.1)) #fitoptim<-optim(par=p,fn=fn1, method=c('BFGS')) #This is how you get the standard errors for the estimated parameters #using the hessian information. SE<-sqrt(diag(2*fit$minimum/(length(age)-2)*solve(fit$hessian))) p<-fit$estimate #assign the estimated parameters to p. #hint the SS you need for the comparison test (question 2)is the first value (labeled #minimum) in the #nlm fitting routine ************ Question 1. What are your parameter estimates for each river for both a linear and a vonBertalanffy growth model? Make a plot showing the observed data and the predicted values for each model and each river. Create, examine, and turn in the residual plot and the plot of the predicted and observed values for each river and model and describe which model appears to fit better. Write 2-3 sentences explaining which model appears to fit better for each river. ************ Now the fun part, let's test to see if the vonBertalanffy curves differ between rivers. If these curves different, then we would suspect that growth rates are different between the two rivers. If the curves are different, then we could test each of the model parameters to see which model parameter differs. We will just test to see if the curves differ. There are a few ways to do this, you could just look at the parameter estimates and SE estimates to see if they overlap. I like to use "test of coincident curves" which is basically comparing the sums of squares between a single curve fit with all of the data (here one growth curve fit with data from the Suwannee and the Yellow combined. For more detail see Kimura (1980) or Haddon (2001) below. This is a really easy test that can be applied to a wide range of model types. It is kind of similar to “likelihood ratio test” which we may cover in a few weeks. To build the test of coincident curves... Either copy and paste (use paste special) in your spreadsheet (or write on a piece of paper) the parameter values for each growth curve for each river, and the residual sums of squares value for that combination of growth curve and river. If you are doing this in the spreadsheet attached, you would just highlight your parameter values for each river individually and then paste them into your "base values" column. You would then add the residual sums of squares together for each model output to create a “combined” residual sum of squares. In the spreadsheet attached it is done for you in D8. You then want to build a new set of parameter values with a new SS by calculating a single growth curve and its associated SS for a "combined" growth curve of all of the data (how do you fit one curve to the SAS or R code you have just written?). Remember you are trying to use the residual SS to compare a "full" model of 1 growth curve with all the data vs. the residual SS from the two individual curves added together. The test statistic is based on a chi-square distribution. To calculate the test statistic Chisquare test stat = -n*ln((residual SS Suwannee + residual SS Yellow)/(residual SS Full model)) where n is the number of observations in your full data set (here n= 219) and the full model is the residual SS from a single model fit to the combined data set. Your degrees of freedom are 3 because there are 3 parameters that you are interested in testing in the model. You then look up the chi square value for a test statistic that you calculate above and 3 df. I use the chidist function in Excel where x is the value I want to evaluate the distribution of. This will give you the p value. *********** Question 2, Do the individual curves differ? Create a table with the parameter values (SE or CI if calculated), the chisquare values, df, and p-values. ********** ********** Bonus, I have provided directions in the spreadsheet columns H, I, and J on how you can extend this method to testing individual parameters in the growth curve (very handy). This can also be done in SAS or R, just remember to set some of the parameters equal to values you have solved for previously…. 0.5 point bonus for each parameter value that you are able to successfully test. ********** Kimura, D. K. 1980. Likelihood methods for the von Bertalanffy growth curve. Fisheries Bulletin 77:765-776. Haddon, M. 2001. Modelling and quantitative methods in fisheries. Chapman and Hall/CRC. Boca Raton, Florida.