MT5751 2002/3 Statistical Fundamentals I Estimator Properties, Maximum Likelihood and Confidence Intervals Before lecture: load library copy non-library functions “MT5751 misc functions.txt” into command window 1. An Introductory Example We start by creating a population, surveying it, and estimating its abundance in an intuitively sensible way from the survey data. Investigation of the uncertainty associated with our estimate of abundance by surveying the population again and again leads on to more formal specification of an estimator and estimator properties. Create Survey region myreg <- generate.region(x.length=80, y.width=50) plot(myreg) Plot of example survey region Create density surface and population # Density mydens <- generate.density(nint.x=80, nint.y=50, southwest=1,southeast=10, northwest=20) mydens <- add.hotspot(mydens, myreg, x=10, y=20, altitude=15, sigma=10) mydens <- add.hotspot(mydens, myreg, x=65, y=30, altitude=-15, sigma=15) mydens <- set.stripe (mydens, myreg, x1=30, y1=-10, x2=10, y2=90,width=10) mydens <- set.stripe (mydens, myreg, x1=33, y1=40, x2=38, y2=10, value=20, width=20) plot(mydens,myreg,method="image") “Heat” plot of example animal density plot(mydens,myreg,eye.horiz=10,eye.vert=30) Prespective plot of example animal density # Population: pars.pop <-setpars.population (myreg, density.pop=mydens, number.groups=250) set.seed(12345) ex.pop <- generate.population(pars.pop) plot(ex.pop) Plot of example animal population Plot survey design: coverage; random location pars.des.pl<-setpars.design.pl(myreg, n.interval.x=10, n.interval.y=10, method="random",area.covered = 0.2) set.seed(1212) ex.des <- generate.design.pl(pars.des.pl) plot(ex.des) Plot of covered region Survey data: plot; summary ex.samp <- generate.sample.pl(ex.pop, ex.des) plot (ex.samp) Plot of locations of sampled animals summary(ex.samp) Summary of survey data plot (ex.samp, whole.population=T) Plot of locations of sampled and unsampled animals How to estimate abundance from these data? Because we covered 20% of the survey region in our survey, we expect that the animals in this region are about 20% of the population, i.e. that there are 5=1/0.2 times as many animals there as we saw. We saw 44, so if this is 20% of the population, there would be 44/0.2=220 animals in the population. ex.est<-point.est.pl(ex.samp) ex.est$Nhat.grp How certain are you of your answer? Do some more surveys and see how much the estimate changes: # initialise stuff i<-0 est<-rep(0,1000) # Resurvey and look at estimates: i<-i+1 mydes <- generate.design.pl(pars.des.pl) mysamp <- generate.sample.pl(ex.pop,mydes) plot (mysamp) myest<-point.est.pl(mysamp) est[i]<-myest$Nhat.grp myest$Nhat.grp # Plot histogram, mean and true N: hist(est[1:i],nclass=5,xlab=”Estimate”,main=”Simulated Sampling distribution”) lines(c(250,250),c(0,16),lwd=2,col="blue") lines(c(mean(est[1:i]),mean(est[1:i])),c(0,16),lty=2,lwd=2,col="red") Plot of estimated sampling distribution of the estimator from the above surveys # Do 200 estimates (without plots): B<-200 for(j in i:B) { mydes <- generate.design.pl(pars.des.pl) mysamp <- generate.sample.pl(ex.pop,mydes) myest<-point.est.pl(mysamp) est[j]<-myest$Nhat.grp } # Plot histogram, mean and true N: hist(est[1:B],nclass=15,xlab=”Estimate”,freq=F,main=”Sampling Distribution”) lines(c(250,250),c(0,16),lwd=2,col="blue") lines(c(mean(est[1:B]),mean(est[1:B])),c(0,16),lty=2,lwd=2,col="red") Plot of estimated sampling distribution of the estimator from more surveys Some Definitions Coverage probability is the probability that an animal falls in the covered region (ours was c=0.2 above). An estimate is the number we use to estimate some parameter (e.g. abundance, N). The estimate from our first survey was Nˆ 220 . An estimator is a function that takes the data and turns it into and estimate. The estimator we used is n Nˆ c where n is the data from the survey: the number of groups we saw. (Note that we use “hats” to indicate both estimates and estimators. An estimator is a random variable – the estimate changes from one occasion to another in a random way. As such, it has a probability distribution. This is called the sampling distribution of the estimator (it is generated by sampling). So what? So an estimate of N from a survey might be far from the true N. We can’t say how far it is (we’d need to know the true N to do this), but the probability distribution allows us to quantify how far from N it will be on average (this is its bias), and how much it varies from one survey occasion to another (its precision). Bias and precision are properties of the estimator: different estimators have different properties. Precision High Low Low Bias High There are many kinds of estimators. We concentrate on maximum likelihood estimators (MLEs). They have some nice properties, including the fact that they are asymptotically unbiased, asymptotically normal, and asymptotically minimum variance estimators. That is, with large enough sample size, they are unbiased, normally distributed, and have as low a variance (as high precision) as it is possible for any estimator to have! 2. Maximum Likelihood Likelihood function and MLE concept: We use plot sampling (like that above) to illustrate the ideas behind likelihood functions and maximum likelihood estimation. To quantify the properties of an estimator, we need to construct a probability model that captures the essential features of the estimator. To do this, we invariable need some assumptions. For plot sampling, we make the following assumptions. Assumption 1: animals detected independently. Since animals are detected if they fall in the covered region, and the covered region is placed independently of animal location, Assumption 1 = independent animal distribution) Assumption 2: Constant probability of detection: all animals are detected with the same probability, which we will denote p. hence n is binomial (see probability density equation in book) Consider the probability density as a function of N (not n) -> this gives the likelihood for N. Examples: We saw 44 groups; lets work out the probability of seeing this number for various values of N. This will give some idea of which N is most likeliy. dbinom(44,44,0.2) # syntax is dbinom(n,N,p) dbinom(44,50,0.2) dbinom(44,100,0.2) dbinom(44,150,0.2) # or to do a lot at once: N<-round(seq(44,350,length=20)) prob<-dbinom(44,N,0.2) # to plot them: plot(N,prob) We want to find the N that has highest likelihood, given p=0.2 and observed n=44 The set of all likelihoods for N>n is the likelihood function; we can draw it: binom.lik.N.plot(44,0.2,Nlim=c(120,370),col="blue") lines(c(120,370),c(0,0)) And we can find the maximum of this function from the plot: lines(c(220,220),c(0,dbinom(44,220,0.2)),col="red") text(220,-0.001,label="220",col="red") title(main="Likelihood for N, given n=44 and p=0.2") Or we can find it algebraically ---------------------- 50 mins to here (first break) ------------------binomial MLE derivation on the board Note digamma approximation (Figure 2.3 of book) Summary: From our statistical model (binomial in the example above), we can calculate the likelihood of any N, given what we saw (n=44 in the example above). We find the N that has maximum likelihood; this is the maximum likelihood estimate (MLE) of N. Given what we saw, it is the most likely N. The mathematical function that takes the data (n=44 in example above) as input and gives the MLE as output (e.g. the function n/p in the example above), is the maximum likelihood estimator (also abbreviated to MLE). The maximum likelihood estimator has a sampling distribution, which we can often derive from our statistical model. In order to quantify the precision of an estimator, we need to know (or estimate) its sampling distribution. We estimated it above by doing lots of surveys. In practice we only have one survey, so we need to be able to estimate the sampling distribution from this alone. Binomial example continued: MLE is n/p; n is binomial and p is constant (0.2), so sampling distribution of MLE is just distribution of n with x-axis rescaled by dividing it by p. Life is seldom this simple, and we often don’t even try to estimate the full sampling distribution, we just estimate the variance and/or a confidence interval (CI) to quantify our certainty about the value of N. 3. Confidence Intervals What is a confidence interval (CI)? It is a random interval (as opposed to a random number - which is what the MLE is). N250p01<-randCI(N=250,p=0.2, B=1,xlim=c(0,500)) N250p01<-randCI(N=250,p=0.2, B=50) N250p01<-randCI(N=250,p=0.2, B=100) But its a special kind of random interval: a 95% CI for N includes the true N 95% of the time. So how do we find a CI? There are many ways. If we know the sampling distribution: Exact (model-based) confidence interval. n<-44 pi.c<-0.2 t<-binom.CI(n,pi.c,Nmin=80,Nmax=400,Naxis=T,Ntick=c(4:26)*20) lines(c(44,44),c(-0.02,0),col="red") text(44,-0.021,label="n=44",col="red") CI is (170; 287) width=117 If we don’t know the sampling distribution (as is most often the case), we can estimate the CI in various ways, including: (Model-based) normal approximation Nhat<-n/pi.c sehatN<-sqrt(Nhat*(1-pi.c)/pi.c) t<-norm.CI(est=Nhat,se=sehatN,xlim=c(50,380),nx=100) CI is (162; 278) width=116 (Model-based) profile likelihood t<-binom.proflik.N(n=44,p=0.2,Nlim=c(150,310)) CI is (167; 284) width=117 (Design-based) nonparametric bootstrap set.seed(162278) ex.bs.est <- int.est.pl(ex.samp,ci.type="boot.nonpar", nclass=20) CI is (140; 310) width=170 Note that bootstrap CI is wider than any others. This is not an accident. The others are too narrow on average for this population. Can you think why? (See Prac 1). The nonparametric bootstrap In our example, we had 20 sampling units (plots). These were chosen randomly from the 100 sampling units in the population: plot(ex.samp) The sampling distribution of the estimator is the distribution of estimates that results from drawing many sets of randomly chosen sets of 20 units from the 100 in the population. A nonparametric bootstrap mimics this, but using only the units in the original sample. It is like repeating the survey many times, but using the plots in our sample, not all the plots in the survey region (we don’t know how many animals are in those we did not sample): 1. We “resample” K=20 plots, with replacement, from the K=20 in our sample; 2. Each time we resample, we calculate an estimate of N; 3. If we resample B=999 times, we have 999 estimates of N; the distribution of these is our estimated sampling distribution. 4. To get the CI, we use the estimated N below which 2.5% of the 999 estimates of N fall, and the N above which 2.5% of the estimates of N fall. (This is called the “percentile method” of calculating a CI from bootstrap resamples.) It is that simple. Advantages of the nonparametric bootstrap include: 1. It involves at most weak assumptions about the distribution of the estimator. (For example, the procedure above involves no assumptions about the distribution of animals in the survey region – which is something that affects the estimator properties). 2. It is versatile – it can be used in a wide variety of applications and contexts. 3. It is conceptually simple (although computationally expensive – no big deal nowdays). We concentrate on the nonparametric bootstrap in what follows, although we do sometimes use other methods. --------------------------------- break 2 (40 mins) ----------------------------