Lecture 1:

advertisement
MT5751 2002/3
Statistical Fundamentals II
Design- vs Model-based inference, General Framework, Plot Sampling Summary
Before lecture:
 load library
 copy “MT5751 misc functions.txt” into command window
 set up mysamp as before:
myreg <- generate.region(x.length=80, y.width=50)
mydens <- generate.density(nint.x=80, nint.y=50,
southwest=1,southeast=10, northwest=20)
mydens <- add.hotspot(mydens, myreg, x=10, y=20, altitude=15, sigma=10)
mydens <- add.hotspot(mydens, myreg, x=65, y=30,
altitude=-15, sigma=15)
mydens <- set.stripe (mydens, myreg, x1=30, y1=-10,
x2=10, y2=90,width=10)
mydens <- set.stripe (mydens, myreg, x1=33, y1=40, x2=38,
y2=10, value=20, width=20)
pars.pop <-setpars.population (myreg, density.pop=mydens, number.groups=250)
set.seed(12345)
ex.pop <- generate.population(pars.pop)
pars.des.pl<-setpars.design.pl(myreg, n.interval.x=10, n.interval.y=10,
method="random",area.covered = 0.2)
set.seed(1212)
ex.des <- generate.design.pl(pars.des.pl)
ex.samp <- generate.sample.pl(ex.pop, ex.des)
4. Design-based vs Model-based estimation
Model-based inference
The binomial distribution provides a statistical model of how the data we observe (n) is generated. We
can use it to get the sampling distribution, and hence properties, of our estimator
n
Nˆ 
c
Bias:

Analytically: (see Tutorial 1)

By simulation:
# Create Uniform Density
uniform.dens <- generate.density(nint.x=10, nint.y=10,
southwest=1,southeast=1, northwest=1)
plot(uniform.dens,myreg,eye.horiz=10,eye.vert=30)
# Design as before:
mydes.pars <- setpars.design.pl(myreg, n.interval.x=10,
n.interval.y=10,method="random",area.covered = 0.2)
set.seed(1212)
mydes <- generate.design.pl(mydes.pars)
plot(mydes)
# generate lots of populations from model, survey, and estimate:
uniform.pop.pars<-setpars.population (myreg,
density.pop=uniform.dens, number.groups=250)
# initialise stuff
i<-0
est<-c(1:1000)
#loop
i<-i+1
uniform.pop <- generate.population(uniform.pop.pars)
mysamp <- generate.sample.pl(uniform.pop,mydes)
plot (mysamp)
myest<-point.est.pl(mysamp)
est[i]<-myest$Nhat.grp
myest$Nhat.grp
B<-100
for(j in i:B) {
uniform.pop <- generate.population(uniform.pop.pars)
mysamp <- generate.sample.pl(uniform.pop,mydes)
myest<-point.est.pl(mysamp)
est[j]<-myest$Nhat.grp
}
# Plot histogram, mean and true N:
hist(est[1:B],nclass=15,xlab=”Estimate”,freq=F,main=”Simulated
sampling distribution of MLE of N”)
lines(c(250,250),c(0,16),lwd=2,col="blue")
lines(c(mean(est[1:B]),mean(est[1:B])),c(0,16),lty=2,lwd=2,col="re
d")
The estimator n/p is “model-unbiased”. We need not do the simulation, we can do it all analytically
(Tutorial 1), and get exact confidence intervals (because we “know” the model).
But what if the model is wrong?
# Non-Uniform Density
nonuniform.dens <- generate.density(nint.x=80, nint.y=50,
southwest=1,southeast=1, northwest=1)
nonuniform.dens<-add.hotspot(nonuniform.dens,
myreg,
x=30,
altitude=40, sigma=10)
plot(nonuniform.dens,myreg,eye.horiz=10,eye.vert=30)
y=20,
# Design as before:
mydes.pars <- setpars.design.pl(myreg, n.interval.x=10,
n.interval.y=10,method="random",area.covered = 0.2)
set.seed(1212)
mydes <- generate.design.pl(mydes.pars)
plot(mydes)
# generate lots of populations from model, survey, and estimate:
nonuniform.pop.pars<-setpars.population (myreg,
density.pop=nonuniform.dens, number.groups=250)
# initialise stuff
i<-0; est<-c(1:1000)
#loop
i<-i+1
nonuniform.pop <- generate.population(nonuniform.pop.pars)
mysamp <- generate.sample.pl(nonuniform.pop,mydes)
plot (mysamp)
myest<-point.est.pl(mysamp)
est[i]<-myest$Nhat.grp
myest$Nhat.grp
B<-100
for(j in i:B) {
nonuniform.pop <- generate.population(nonuniform.pop.pars)
mysamp <- generate.sample.pl(nonuniform.pop,mydes)
myest<-point.est.pl(mysamp)
est[j]<-myest$Nhat.grp
}
# Plot histogram, mean and true N:
hist(est[1:j],nclass=15,xlab=”Estimate”,freq=F,main=”Simulated
sampling distribution of MLE of N”)
lines(c(250,250),c(0,16),lwd=2,col="blue")
lines(c(mean(est[1:j]),mean(est[1:j])),c(0,16),lty=2,lwd=2,col="re
d")
Using the wrong model, leads to misleading inferences. (usually we randomize the design even if we
use model-based estimators, so in practice things are not nearly as bad as they appear above!)
Model Selection
Can we tell if the model is wrong?
Yes, to some extent. For example, does the distribution of animals in the sample look uniformly
distributed?
windows(height=4)
uniform.pop <- generate.population(uniform.pop.pars)
uniform.samp <- generate.sample.pl(uniform.pop,mydes)
plot(uniform.samp)
windows(height=4)
nonuniform.pop <- generate.population(nonuniform.pop.pars)
nonuniform.samp <- generate.sample.pl(nonuniform.pop,mydes)
plot(nonuniform.samp)
Model selection is a central part of model-based inference. It is a large subject in its own right and
unfortunately we do not have time to go into it in any detail in this course. We do cover formal model
selection briefly in the context of line transect and mark recapture estimation.
There are many criteria and ways of choosing between competing statistical models for a survey
dataset. We consider only Akaike’s Information Criterion (AIC). It is easy to calculate (given the
likelihood), versatile, and conceptually fairly straightforward.
AIC
Recall that we maximise the likelihood function to get the MLE, and that the likelihood function is
based on a statistical model.


Let L be the value of the likelihood at its maximum.
Let q be the number of unknown parameters in the likelihood (there is only one in the
binomial likelihood we have looked at so far, namely N, so q=1 in this case).
The AIC is defined as
AIC = -2log(L) + 2q
The first term, -2log(L), gets smaller as the fit to the data improves (and the likelihood, L, increases).
Now as you add more parameters to the model, it gets more and more flexible, and is able to fit the data
better and better. On the basis of goodness of fit, you’d want to choose the model with the smallest first
term. But the second term, 2q, gets bigger and bigger as more and more parameters are added to the
model. The second term is a penalty for having to estimate more and more parameters from the same
amount of data. Choosing the model with the smallest AIC is a compromise between goodness of fit,
and having to estimate many parameters. The criterion is applied by choosing the model with the
smallest AIC among all the models under consideration.
Design-based inference
An alternative to model-based inference, is to dispense with the model altogether and base the
sampling distribution and estimator properties on the randomness we put in the design:
Look at the sampling distribution we got before:
# initialise stuff
i<-0
nonu.est<-c(1:1000)
u.est<-c(1:1000)
set.seed(1212)
#loop
i<-i+1
mydes.pars <- setpars.design.pl(myreg, n.interval.x=10,
n.interval.y=10,method="random",area.covered = 0.2)
mydes <- generate.design.pl(mydes.pars)
nonu.samp <- generate.sample.pl(nonuniform.pop, mydes)
plot(nonu.samp)
est<- point.est.pl(nonu.samp)
nonu.est[j]<-est$Nhat.grp
nonu.est[j]
# Nonuniform density:
for(j in 1:500) {
mydes <- generate.design.pl(mydes.pars)
nonu.samp <- generate.sample.pl(nonuniform.pop, mydes)
est<- point.est.pl(nonu.samp)
nonu.est[j]<-est$Nhat.grp
}
#
Uniform density:
for(j in 1:500) {
mydes <- generate.design.pl(mydes.pars)
u.samp <- generate.sample.pl(uniform.pop, mydes)
est<- point.est.pl(u.samp)
u.est[j]<-est$Nhat.grp
}
# Plot simulated sampling distributions, means and true Ns:
#xmax<-max(nonu.est,u.est)
xmax<-500
xmin<-min(nonu.est,u.est)
# Nonuniform sampling distribution
windows()
par(mfrow=c(2,1))
hist(nonu.est[1:j],nclass=25,xlab=”Estimate”,freq=F,main=”Simulate
d sampling distribution of MLE of N
(Nonuniform)”,xlim=c(xmin,xmax))
lines(c(250,250),c(0,16),lwd=2,col="blue")
lines(c(mean(nonu.est[1:j]),mean(nonu.est[1:j])),c(0,16),lty=2,lwd
=2,col="red")
# Uniform sampling distribution
hist(u.est[1:j],nclass=25,xlab=”Estimate”,freq=F,main=”Simulated
sampling distribution of MLE of N
(Uniform)”,xlim=c(xmin,xmax))
lines(c(250,250),c(0,16),lwd=2,col="blue")
lines(c(mean(u.est[1:j]),mean(u.est[1:j])),c(0,16),lty=2,lwd=2,col
="red")
# Compare std. err. in nonuniform and uniform cases
sqrt(var(nonu.est[1:j]))
sqrt(var(u.est[1:j]))
The estimator is “design-unbiased”. (This can also be shown algebraically.)
Within any given plot, the number of animals does not change between surveys. The only
reason the estimate changes between surveys is that the sampled plots (the covered region)
change between surveys, and this is determined by the design (which is “randomly choose 20
rectangular plots of size 8x5, with replacement, from the 100 8x5 plots in the survey region.”)
(See book, Table 4.1 for example of estimator properties when sampling without
replacement.)
Summary
Read Sections 3.2.2 and 3.2.3 of the book
Note:


Design-based CI’s are based on the inter-plot variation in animal numbers.
Model-based CI’s are based on the assumed statistical model. If the model is wrong,
the CI’s can be misleading (see Prac 1 for examples).
Purely design-based inference is usually not possible with wildlife surveys (because we
usually don’t know the probability of an animal being detected). Plot sampling is an
exception, because detection probability is know. So most abundance estimation methods are
at least partly model-based.
A common and useful tactic is to obtain the estimator from the likelihood function (MLE),
but to base CI estimation on a method that is not based on the strong assumptions in the
likelihood. In this course we often use the nonparametric bootstrap for this reason.
5. General Statistical Framework
Randomness in the Animal Population
(Modelled with a “State model”)
Randomness
in
the
design
Only those in covered
region (shaded) are at
risk of being detected.
(Likelihood
is conditional
on design)
Randomness in the Detection Process
(Modelled with a “Observation model”)
Only some (red dots) in
covered region are
detected.
The statistical framework for animal abundance estimation methods is based on likelihood functions.
These condition on the design (i.e. take it as given). The design determines the covered region. Given
this, there are two sources of randomness that affect what you observe on a survey. They are:
1.
Randomness in the population
At the simplest level the relevant randomness is only to do with where the animals are. In
general, it involves animal size, sex, etc., as well. The part of the statistical model that reflects
this randomness is called the “State model”
Heterogeneity:
plot(mydens,myreg,eye.horiz=10,eye.vert=30)
# Create heterogeneous population and look at it
hetero.pop.pars<-setpars.population (myreg, density.pop=mydens,
number.groups=250, size.method="poisson", size.min=1,
size.max=5, size.mean=1, exposure.method="beta",
exposure.min=2, exposure.max=10, exposure.mean=6,
exposure.shape=1)
hetero.pop <- generate.population(hetero.pop.pars)
summary(hetero.pop)
hetero.pop.pars<-setpars.population (myreg, density.pop=mydens,
number.groups=250, size.method="poisson", size.min=1,
size.max=5, size.mean=1, exposure.method="beta",
exposure.min=2, exposure.max=10, exposure.mean=6,
exposure.shape=1,adjust.interactive=T)
set.seed(2468)
hetero.pop <- generate.population(hetero.pop.pars)
summary(hetero.pop)
plot(hetero.pop,dsf=1.5)
Note that heterogeneity in the population is of no consequence if animals of all types are
equally detectable. Heterogeneity is an issue only as far as it affects animal detectability. A
population of animals with different detectabilities is called heterogeneous.
2.
Randomness in which animals in the covered region are detected
In plot surveys, all animals in the covered region are detected, but in most wildlife surveys
this is not the case. There is some random process in operation that determines which of the
animals in the covered region are detected. We model this with something called an
“Observation model”
The likelihood for plot surveys involves only a state model, because all animals in the covered
region are detected. In the examples above, the state model is “animals are equally likely to be
anywhere in the survey region, and animals locate themselves independently of each other”.
In some abundance estimation surveys, the covered region is assumed to be the whole survey
region, and all randomness comes from the observation model. Simple mark-recapture surveys are
a case in point.
Some abundance estimation surveys involve both process and observation models.
Like line transect surveys:
# (State model done above)
# The design:
pars.des.lt<-setpars.design.lt(myreg, n.transects=3, n.units=3,
visual.range=5, percent.on.effort=0.5)
set.seed(121)
des.lt <- generate.design.lt(pars.des.lt)
# Specify the line transect Observation Model:
pars.sur.lt<-setpars.survey.lt (hetero.pop, des.lt,
disthalf.min=1, disthalf.max=3)
# Do the survey:
set.seed(123)
samp.lt <- generate.sample.lt (pars.sur.lt)
plot(samp.lt,whole.population=T, show.paths=T,dsf=1.5)
summary(samp.lt)
Others involve only observation models - like mark recapture surveys:
# (State model and population from above)
# Design = ”search” whole survey region
des.cr <- generate.design.cr(myreg, n.occ=2)
# Observation model with exposed more likely to be detected
than unexposed:
pars.sur.cr<-setpars.survey.cr (hetero.pop, des.cr,
pmin.unmarked=0.001, pmax.unmarked =0.6)
set.seed(123)
samp.cr <- generate.sample.cr(pars.sur.cr)
plot (samp.cr,whole.population=T,dsf=1.5)
summary(samp.cr)
Summary so far:

Estimator properties: An estimator is a random variable; it has a sampling distribution,
which determines its bias and precision.

Model- vs Design-based: Model-based estimators are based on a statistical model, which
involves assumptions (which may be wrong). Design-based estimators involve fewer
assumptions, but purely design-based estimation for animal abundance estimation is usually
not possible (because we don’t know detection probabilities).

Confidence Intervals: CIs provide a measure of how un certain we are about out estimate: a
95% CI for N is a random interval that includes the true N 95% of the time. CIs can be
estimated in many ways; we concentrate on the nonparametric bootstrap. It is based on weak
model assumptions, versatile, and easy to apply.

General Statistical Model: There are usually two sources of randomness in the data we
observe: the randomness in the population itself, which we capture in a “Process mode”, and
the randomness in observing the animals in the covered region, which we capture in an
“Observation Model”. (There is also randomness in the design, but the likelihood takes the
design as given.)

MLEs: A maximum likelihood estimator (MLE) of N gives the most likely value of N, given
the data and a statistical model. MLEs are model-based estimators; they have some nice
properties, like asymptotic unbiasedness and efficiency. They are model-based.
6. Plot Sampling Summary
Key Idea:
Scale up the count from the covered area to the survey area
State model:
Animals distributed uniformly and independently in the survey region.
Observation model:
All animals in the covered region are detected with certainty.
Likelihood function
(see book, page 56)
Main assumptions and the effect of violating them
1.
Assumption: All animals in the covered region are detected.
Effect of violation: Estimator is negatively biased by a factor equal to the probability that an
animal in the covered region is detected.
2.
Model-based Assumption: Animals are distributed uniformly and independently of one
another. Most animals populations do not distribute themselves independently of one another;
they tend to cluster or, in the case of highly territorial animals, to avoid one another. They are
also seldom uniformly distributed.
Effect of violation: CI’s and variances based on the assumption will be too narrow (if animals
cluster), or too wide (if animals are more regularly distributed).
Download