MT5751 2002/3 Statistical Fundamentals II Design- vs Model-based inference, General Framework, Plot Sampling Summary Before lecture: load library copy “MT5751 misc functions.txt” into command window set up mysamp as before: myreg <- generate.region(x.length=80, y.width=50) mydens <- generate.density(nint.x=80, nint.y=50, southwest=1,southeast=10, northwest=20) mydens <- add.hotspot(mydens, myreg, x=10, y=20, altitude=15, sigma=10) mydens <- add.hotspot(mydens, myreg, x=65, y=30, altitude=-15, sigma=15) mydens <- set.stripe (mydens, myreg, x1=30, y1=-10, x2=10, y2=90,width=10) mydens <- set.stripe (mydens, myreg, x1=33, y1=40, x2=38, y2=10, value=20, width=20) pars.pop <-setpars.population (myreg, density.pop=mydens, number.groups=250) set.seed(12345) ex.pop <- generate.population(pars.pop) pars.des.pl<-setpars.design.pl(myreg, n.interval.x=10, n.interval.y=10, method="random",area.covered = 0.2) set.seed(1212) ex.des <- generate.design.pl(pars.des.pl) ex.samp <- generate.sample.pl(ex.pop, ex.des) 4. Design-based vs Model-based estimation Model-based inference The binomial distribution provides a statistical model of how the data we observe (n) is generated. We can use it to get the sampling distribution, and hence properties, of our estimator n Nˆ c Bias: Analytically: (see Tutorial 1) By simulation: # Create Uniform Density uniform.dens <- generate.density(nint.x=10, nint.y=10, southwest=1,southeast=1, northwest=1) plot(uniform.dens,myreg,eye.horiz=10,eye.vert=30) # Design as before: mydes.pars <- setpars.design.pl(myreg, n.interval.x=10, n.interval.y=10,method="random",area.covered = 0.2) set.seed(1212) mydes <- generate.design.pl(mydes.pars) plot(mydes) # generate lots of populations from model, survey, and estimate: uniform.pop.pars<-setpars.population (myreg, density.pop=uniform.dens, number.groups=250) # initialise stuff i<-0 est<-c(1:1000) #loop i<-i+1 uniform.pop <- generate.population(uniform.pop.pars) mysamp <- generate.sample.pl(uniform.pop,mydes) plot (mysamp) myest<-point.est.pl(mysamp) est[i]<-myest$Nhat.grp myest$Nhat.grp B<-100 for(j in i:B) { uniform.pop <- generate.population(uniform.pop.pars) mysamp <- generate.sample.pl(uniform.pop,mydes) myest<-point.est.pl(mysamp) est[j]<-myest$Nhat.grp } # Plot histogram, mean and true N: hist(est[1:B],nclass=15,xlab=”Estimate”,freq=F,main=”Simulated sampling distribution of MLE of N”) lines(c(250,250),c(0,16),lwd=2,col="blue") lines(c(mean(est[1:B]),mean(est[1:B])),c(0,16),lty=2,lwd=2,col="re d") The estimator n/p is “model-unbiased”. We need not do the simulation, we can do it all analytically (Tutorial 1), and get exact confidence intervals (because we “know” the model). But what if the model is wrong? # Non-Uniform Density nonuniform.dens <- generate.density(nint.x=80, nint.y=50, southwest=1,southeast=1, northwest=1) nonuniform.dens<-add.hotspot(nonuniform.dens, myreg, x=30, altitude=40, sigma=10) plot(nonuniform.dens,myreg,eye.horiz=10,eye.vert=30) y=20, # Design as before: mydes.pars <- setpars.design.pl(myreg, n.interval.x=10, n.interval.y=10,method="random",area.covered = 0.2) set.seed(1212) mydes <- generate.design.pl(mydes.pars) plot(mydes) # generate lots of populations from model, survey, and estimate: nonuniform.pop.pars<-setpars.population (myreg, density.pop=nonuniform.dens, number.groups=250) # initialise stuff i<-0; est<-c(1:1000) #loop i<-i+1 nonuniform.pop <- generate.population(nonuniform.pop.pars) mysamp <- generate.sample.pl(nonuniform.pop,mydes) plot (mysamp) myest<-point.est.pl(mysamp) est[i]<-myest$Nhat.grp myest$Nhat.grp B<-100 for(j in i:B) { nonuniform.pop <- generate.population(nonuniform.pop.pars) mysamp <- generate.sample.pl(nonuniform.pop,mydes) myest<-point.est.pl(mysamp) est[j]<-myest$Nhat.grp } # Plot histogram, mean and true N: hist(est[1:j],nclass=15,xlab=”Estimate”,freq=F,main=”Simulated sampling distribution of MLE of N”) lines(c(250,250),c(0,16),lwd=2,col="blue") lines(c(mean(est[1:j]),mean(est[1:j])),c(0,16),lty=2,lwd=2,col="re d") Using the wrong model, leads to misleading inferences. (usually we randomize the design even if we use model-based estimators, so in practice things are not nearly as bad as they appear above!) Model Selection Can we tell if the model is wrong? Yes, to some extent. For example, does the distribution of animals in the sample look uniformly distributed? windows(height=4) uniform.pop <- generate.population(uniform.pop.pars) uniform.samp <- generate.sample.pl(uniform.pop,mydes) plot(uniform.samp) windows(height=4) nonuniform.pop <- generate.population(nonuniform.pop.pars) nonuniform.samp <- generate.sample.pl(nonuniform.pop,mydes) plot(nonuniform.samp) Model selection is a central part of model-based inference. It is a large subject in its own right and unfortunately we do not have time to go into it in any detail in this course. We do cover formal model selection briefly in the context of line transect and mark recapture estimation. There are many criteria and ways of choosing between competing statistical models for a survey dataset. We consider only Akaike’s Information Criterion (AIC). It is easy to calculate (given the likelihood), versatile, and conceptually fairly straightforward. AIC Recall that we maximise the likelihood function to get the MLE, and that the likelihood function is based on a statistical model. Let L be the value of the likelihood at its maximum. Let q be the number of unknown parameters in the likelihood (there is only one in the binomial likelihood we have looked at so far, namely N, so q=1 in this case). The AIC is defined as AIC = -2log(L) + 2q The first term, -2log(L), gets smaller as the fit to the data improves (and the likelihood, L, increases). Now as you add more parameters to the model, it gets more and more flexible, and is able to fit the data better and better. On the basis of goodness of fit, you’d want to choose the model with the smallest first term. But the second term, 2q, gets bigger and bigger as more and more parameters are added to the model. The second term is a penalty for having to estimate more and more parameters from the same amount of data. Choosing the model with the smallest AIC is a compromise between goodness of fit, and having to estimate many parameters. The criterion is applied by choosing the model with the smallest AIC among all the models under consideration. Design-based inference An alternative to model-based inference, is to dispense with the model altogether and base the sampling distribution and estimator properties on the randomness we put in the design: Look at the sampling distribution we got before: # initialise stuff i<-0 nonu.est<-c(1:1000) u.est<-c(1:1000) set.seed(1212) #loop i<-i+1 mydes.pars <- setpars.design.pl(myreg, n.interval.x=10, n.interval.y=10,method="random",area.covered = 0.2) mydes <- generate.design.pl(mydes.pars) nonu.samp <- generate.sample.pl(nonuniform.pop, mydes) plot(nonu.samp) est<- point.est.pl(nonu.samp) nonu.est[j]<-est$Nhat.grp nonu.est[j] # Nonuniform density: for(j in 1:500) { mydes <- generate.design.pl(mydes.pars) nonu.samp <- generate.sample.pl(nonuniform.pop, mydes) est<- point.est.pl(nonu.samp) nonu.est[j]<-est$Nhat.grp } # Uniform density: for(j in 1:500) { mydes <- generate.design.pl(mydes.pars) u.samp <- generate.sample.pl(uniform.pop, mydes) est<- point.est.pl(u.samp) u.est[j]<-est$Nhat.grp } # Plot simulated sampling distributions, means and true Ns: #xmax<-max(nonu.est,u.est) xmax<-500 xmin<-min(nonu.est,u.est) # Nonuniform sampling distribution windows() par(mfrow=c(2,1)) hist(nonu.est[1:j],nclass=25,xlab=”Estimate”,freq=F,main=”Simulate d sampling distribution of MLE of N (Nonuniform)”,xlim=c(xmin,xmax)) lines(c(250,250),c(0,16),lwd=2,col="blue") lines(c(mean(nonu.est[1:j]),mean(nonu.est[1:j])),c(0,16),lty=2,lwd =2,col="red") # Uniform sampling distribution hist(u.est[1:j],nclass=25,xlab=”Estimate”,freq=F,main=”Simulated sampling distribution of MLE of N (Uniform)”,xlim=c(xmin,xmax)) lines(c(250,250),c(0,16),lwd=2,col="blue") lines(c(mean(u.est[1:j]),mean(u.est[1:j])),c(0,16),lty=2,lwd=2,col ="red") # Compare std. err. in nonuniform and uniform cases sqrt(var(nonu.est[1:j])) sqrt(var(u.est[1:j])) The estimator is “design-unbiased”. (This can also be shown algebraically.) Within any given plot, the number of animals does not change between surveys. The only reason the estimate changes between surveys is that the sampled plots (the covered region) change between surveys, and this is determined by the design (which is “randomly choose 20 rectangular plots of size 8x5, with replacement, from the 100 8x5 plots in the survey region.”) (See book, Table 4.1 for example of estimator properties when sampling without replacement.) Summary Read Sections 3.2.2 and 3.2.3 of the book Note: Design-based CI’s are based on the inter-plot variation in animal numbers. Model-based CI’s are based on the assumed statistical model. If the model is wrong, the CI’s can be misleading (see Prac 1 for examples). Purely design-based inference is usually not possible with wildlife surveys (because we usually don’t know the probability of an animal being detected). Plot sampling is an exception, because detection probability is know. So most abundance estimation methods are at least partly model-based. A common and useful tactic is to obtain the estimator from the likelihood function (MLE), but to base CI estimation on a method that is not based on the strong assumptions in the likelihood. In this course we often use the nonparametric bootstrap for this reason. 5. General Statistical Framework Randomness in the Animal Population (Modelled with a “State model”) Randomness in the design Only those in covered region (shaded) are at risk of being detected. (Likelihood is conditional on design) Randomness in the Detection Process (Modelled with a “Observation model”) Only some (red dots) in covered region are detected. The statistical framework for animal abundance estimation methods is based on likelihood functions. These condition on the design (i.e. take it as given). The design determines the covered region. Given this, there are two sources of randomness that affect what you observe on a survey. They are: 1. Randomness in the population At the simplest level the relevant randomness is only to do with where the animals are. In general, it involves animal size, sex, etc., as well. The part of the statistical model that reflects this randomness is called the “State model” Heterogeneity: plot(mydens,myreg,eye.horiz=10,eye.vert=30) # Create heterogeneous population and look at it hetero.pop.pars<-setpars.population (myreg, density.pop=mydens, number.groups=250, size.method="poisson", size.min=1, size.max=5, size.mean=1, exposure.method="beta", exposure.min=2, exposure.max=10, exposure.mean=6, exposure.shape=1) hetero.pop <- generate.population(hetero.pop.pars) summary(hetero.pop) hetero.pop.pars<-setpars.population (myreg, density.pop=mydens, number.groups=250, size.method="poisson", size.min=1, size.max=5, size.mean=1, exposure.method="beta", exposure.min=2, exposure.max=10, exposure.mean=6, exposure.shape=1,adjust.interactive=T) set.seed(2468) hetero.pop <- generate.population(hetero.pop.pars) summary(hetero.pop) plot(hetero.pop,dsf=1.5) Note that heterogeneity in the population is of no consequence if animals of all types are equally detectable. Heterogeneity is an issue only as far as it affects animal detectability. A population of animals with different detectabilities is called heterogeneous. 2. Randomness in which animals in the covered region are detected In plot surveys, all animals in the covered region are detected, but in most wildlife surveys this is not the case. There is some random process in operation that determines which of the animals in the covered region are detected. We model this with something called an “Observation model” The likelihood for plot surveys involves only a state model, because all animals in the covered region are detected. In the examples above, the state model is “animals are equally likely to be anywhere in the survey region, and animals locate themselves independently of each other”. In some abundance estimation surveys, the covered region is assumed to be the whole survey region, and all randomness comes from the observation model. Simple mark-recapture surveys are a case in point. Some abundance estimation surveys involve both process and observation models. Like line transect surveys: # (State model done above) # The design: pars.des.lt<-setpars.design.lt(myreg, n.transects=3, n.units=3, visual.range=5, percent.on.effort=0.5) set.seed(121) des.lt <- generate.design.lt(pars.des.lt) # Specify the line transect Observation Model: pars.sur.lt<-setpars.survey.lt (hetero.pop, des.lt, disthalf.min=1, disthalf.max=3) # Do the survey: set.seed(123) samp.lt <- generate.sample.lt (pars.sur.lt) plot(samp.lt,whole.population=T, show.paths=T,dsf=1.5) summary(samp.lt) Others involve only observation models - like mark recapture surveys: # (State model and population from above) # Design = ”search” whole survey region des.cr <- generate.design.cr(myreg, n.occ=2) # Observation model with exposed more likely to be detected than unexposed: pars.sur.cr<-setpars.survey.cr (hetero.pop, des.cr, pmin.unmarked=0.001, pmax.unmarked =0.6) set.seed(123) samp.cr <- generate.sample.cr(pars.sur.cr) plot (samp.cr,whole.population=T,dsf=1.5) summary(samp.cr) Summary so far: Estimator properties: An estimator is a random variable; it has a sampling distribution, which determines its bias and precision. Model- vs Design-based: Model-based estimators are based on a statistical model, which involves assumptions (which may be wrong). Design-based estimators involve fewer assumptions, but purely design-based estimation for animal abundance estimation is usually not possible (because we don’t know detection probabilities). Confidence Intervals: CIs provide a measure of how un certain we are about out estimate: a 95% CI for N is a random interval that includes the true N 95% of the time. CIs can be estimated in many ways; we concentrate on the nonparametric bootstrap. It is based on weak model assumptions, versatile, and easy to apply. General Statistical Model: There are usually two sources of randomness in the data we observe: the randomness in the population itself, which we capture in a “Process mode”, and the randomness in observing the animals in the covered region, which we capture in an “Observation Model”. (There is also randomness in the design, but the likelihood takes the design as given.) MLEs: A maximum likelihood estimator (MLE) of N gives the most likely value of N, given the data and a statistical model. MLEs are model-based estimators; they have some nice properties, like asymptotic unbiasedness and efficiency. They are model-based. 6. Plot Sampling Summary Key Idea: Scale up the count from the covered area to the survey area State model: Animals distributed uniformly and independently in the survey region. Observation model: All animals in the covered region are detected with certainty. Likelihood function (see book, page 56) Main assumptions and the effect of violating them 1. Assumption: All animals in the covered region are detected. Effect of violation: Estimator is negatively biased by a factor equal to the probability that an animal in the covered region is detected. 2. Model-based Assumption: Animals are distributed uniformly and independently of one another. Most animals populations do not distribute themselves independently of one another; they tend to cluster or, in the case of highly territorial animals, to avoid one another. They are also seldom uniformly distributed. Effect of violation: CI’s and variances based on the assumption will be too narrow (if animals cluster), or too wide (if animals are more regularly distributed).