Sampling, Regression, Experimental Design and Analysis for

Sampling, Regression, Experimental Design and

Analysis for Environmental Scientists,

Biologists, and Resource Managers

C. J. Schwarz

Department of Statistics and Actuarial Science, Simon Fraser University cschwarz@stat.sfu.ca

January 16, 2011

Contents

1 Maximum Likelihood Estimation & AIC and model selection a primer

2

1.1

Maximum Likelihood Estimation (MLE) . . . . . . . . . . . . . .

2

1.1.1

The probability model . . . . . . . . . . . . . . . . . . . .

3

1.1.2

The likelihood

. . . . . . . . . . . . . . . . . . . . . . . .

7

1.1.3

The MLE . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

1.1.4

The precision of the MLE . . . . . . . . . . . . . . . . . .

10

1.1.5

Numerical optimization . . . . . . . . . . . . . . . . . . .

11

1.1.6

Example2: Different sizes of sampling units. . . . . . . . .

14

1.1.7

Example: What is the probability of becoming pregnant in a month? . . . . . . . . . . . . . . . . . . . . . . . . . .

17

1.1.8

Example: Zero-truncated distribution . . . . . . . . . . .

21

1.1.9

Example: Capture-recapture with 2 parameters . . . . . .

21

1.1.10 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

1.2

AIC and Model Selection

. . . . . . . . . . . . . . . . . . . . . .

24

1.2.1

Technical Details . . . . . . . . . . . . . . . . . . . . . . .

26

1.2.2

AIC and regression . . . . . . . . . . . . . . . . . . . . . .

28

1.2.3

Example: Capture-recapture

. . . . . . . . . . . . . . . .

29

1.2.4

Example: Survival of Juvenile Pygmy Rabbits

. . . . . .

38

1.3

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40

1

Chapter 1

Maximum Likelihood

Estimation & AIC and model selection - a primer

1.1

Maximum Likelihood Estimation (MLE)

These notes provide a (very) brief introduction to Maximum Likelihood Estimation (MLE) at a lower technical level than would be found in a formal course in Statistics. The important part of these notes are the ideas – don’t worry too much about the technical aspects unless you wish to use maximum likelihood for a non-standard problem. In the latter case, please don’t hesitate to contact me to discuss your problem.

A review of MLE is found at http://en.wikipedia.org/wiki/Maximum_ likelihood .

Maximum Likelihood Estimation (MLE) was first developed by Sir R. A.

Fisher a famous geneticist and statistician. It is closely related to the concept of probability and provides a comprehensive justifiable method to estimate parameters from data.

It is a standard tool used by statisticians for many problems.

The key advantages of MLE are that:

• a unified framework to obtain estimators.

2

CHAPTER 1. MAXIMUM LIKELIHOOD ESTIMATION & AIC AND

MODEL SELECTION - A PRIMER

• in large samples is fully efficient, i.e. extracts all the information possible from data.

• provides a standard way to obtain estimates of precision ( se ) for any data collection scheme along with a way to estimate confidence intervals.

However, MLE is not a panacea and there are several “problems” with MLE that may require other methods (such as Bayesian methods):

• The properties of MLE in small samples may not be optimal nor exact

• The likelihood function may not be computationally tractable. This often arises because the likelihood has to integrate over hidden variables and the likelihood is a multi-dimensional integral.

1.1.1

The probability model

The first step in using MLE is to define a probability model for your data. This is crucial because if your probability model is wrong, then all further inference will also be incorrect. Statisticians recognize that all probability models are wrong, but hope that they are close enough to reality to be useful. Consequently an important part of MLE is model assessment – how well does the model fit the data. This won’t be covered in this primer – contact me for details.

For example, consider the problem of estimating the density of objects (grey dots) in a region as shown on the next page c 2010 Carl James Schwarz

3

4



Of course in real life the locations of all the objects would be unknown because if the locations (and number of objects) were known, then it would be a simple matter to count all of the dots and divide by the study area to obtain the density value! Consequently, three small areas were randomly selected from the area (how would this be done?) and the number of objects within the sampling quadrats are counted as shown on the next page: c 2010 Carl James Schwarz

5

6



This gives three counts, 4, 10, and 15. Note that the sampling protocol has to be explicit on what to do with objects that intersect the boundary of the sampling frame – this will not be discussed further in these notes.

A non-parametric way to proceed would be to simply take the average count

(

4+10+15

3

= 9 .

67

) and report this as the density per sampling unit.

1

However, this simple approach will fail (as seen later) in more complex cases where the data are not a straightforward.

A MLE approach would start with a model for the data. In many cases, counts of objects follow a Poisson distribution. This would be true if objects occurred “at random” on the landscape. It is not always true – for example, objects could tend to fall in clusters (new plants growing from seeds from a parent plant), or objects could be more spread out (competition for resources) than random. In both cases, a more complex probability model could be fit. If a simple Poisson model is used, an important part of the MLE process is model assessment as noted earlier.

A Poisson distribution depends upon the parameter µ which (in this case) is the mean number of dots per sampling unit or the true density in the population.

Let Y represent the random variable for the number of objects found in each sample. If the value of µ is known, then the probability of seeing Y = y objects

2

in a sample follows the Poisson probability law:

P ( Y = y | µ ) = e

− µ µ y y !

For example, if µ = 1 , then P ( Y = 2 | µ = 1) = e

− 1

1

2

2!

= .

184 .

Of course, the value of µ is NOT known and this is the parameter that we would like to estimate from our sample of size 3.

1.1.2

The likelihood

MLE starts by defining the likelihood of each data value.

L ( µ | Y = y ) = e

− µ µ y y !

The likelihood looks familiar and is in fact the same probability function but with the roles of Y and µ are reversed. A probability function gives the probability of observing Y given the parameter µ and sums to 1 over all the possible

1

This estimator relies on the sampling being a simple random sample from the population is a design based estimator as covered in the chapter on Sampling Theory.

2

The notation Y = y is read as the random variable Y taking the value y . For example,

Y = 1 would imply that the number of objects in the sample was 1.

c 2010 Carl James Schwarz

7


MODEL SELECTION - A PRIMER values of Y . The likelihood function is a function of µ given the data and does

NOT sum to 1 over the possible values of µ . [Note that while the possible values of Y were 0, 1, 2, 3, . . .

, the possible values are µ is the non-negative real line, i.e.

0 ≤ µ .]

We will assume that each of three samples were selected independently of each other. [If this were not true, then the likelihood function below will be more complex.] It seems reasonable that the density is homogeneous over the entire region (which again can be relaxed if needed). Let Y i in the i th sample with i = 1 , 2 , 3 represent the count

. Because of the independence assumption and the assumption of a common parameter to all samples, the joint likelihood function is the product of the individual likelihood functions:

L ( µ | Y

1

= y

1

, Y

2

= y

2

, Y

3

= y

3

) = e

− µ µ y

1 y

1

!

× e

− µ µ y

2 y

2

!

× e

− µ µ y

3 y

3

!

This could be written more compactly using product notation but is not important for what follows.

So if our sample values were 4, 10 and 15, the likelihood function is:

L ( µ | Y

1

= 4 , Y

2

= 10 , Y

3

= 15) = e

− µ µ 4

×

4!

e

− µ µ 10

10!

× e

− µ µ 15

15!

and is function only of µ , i.e. the only unknown quantity is the parameter to be estimated.

1.1.3

The MLE

Fisher argued that a sensible estimate for µ would be the value that MAXI-

MIZES the LIKELIHOOD (that is why this procedure is called Maximum Likelihood). This intuitively says that the best guess for the unknown parameter maximizes the joint probability of the data and is the most consistent with the data.

The MLE could be found using graphical methods, i.e. plot L as a function of µ and find the value of µ that maximizes the curve as shown below: c 2010 Carl James Schwarz

8



However, simple calculus gives us a way to do this without resorting to the looking a graphs.

Recall from calculus that the maximum of a function often occurs when then derivative (the slope of the line is zero). As the function increases to the maximum the derivate is positive; as the function decreases away from the maximum, the slope is negative. It switches from a positive slope to a negative slope at the maximum.

It turns out that (for mathematical reasons), it is more convenient to look at the log ( likelihood ) where log () is the natural logarithm to the base e . In our c 2010 Carl James Schwarz

9


MODEL SELECTION - A PRIMER problem: log ( L ) = log ( L ( µ | Y

1

, Y

2

, Y

3

)) = − µ + Y

1 log ( µ ) − log ( Y

1

!)+

− µ + Y

2 log ( µ ) − log ( Y

2

!) + − µ + Y

3 log ( µ ) − log ( Y

3

!)

= − 3 µ + ( Y

1

+ Y

2

+ Y

3

) log ( µ ) − log ( Y

1

!) − log ( Y

2

!) − log ( Y

3

!)

Because the log () function is a monotonic transform, the maximum of log ( L ) will be coincident with the maximum of L .

To find the maximum of log ( L ) , find the first derivative with respect to µ , equate to 0, and solve for µ :

∂

∂µ log ( L ) = − 3 +

Y

1

+ Y

2

+ Y

3

µ

=0 or (after rearranging the above we have)

µ = b

Y

1

+ Y

2

+ Y

3

3

= Y or the sample average. The circumflex over the parameter is the usual convention to indicate that the value is an estimator derived from data rather than a known parameter value, i.e.

µ is a best-guess for µ .

b

Whew! This seems like a lot of work to get an “obvious” result, but the advantages of MLE will be come clearer in future examples.

So in our example, µ = b

4+10+15

3

= 9 .

67 .

1.1.4

The precision of the MLE

But, never report a “naked estimate”, i.e. it is necessary to attach a measure of precision (the standard error) to this estimate. MLE gives us a way to do this as well.

It turns out that a measure of “information” in the data is defined as the negative of the second derivative of the log-likelihood function (caution: some more calculus ahead) and the standard error of the MLE can be found from the inverse of the information (caution: in more complex problems this needs an inverse of a matrix).

The second derivative is found by taking the derivative of the derivative of the log-likelihood function. In our case, a measure of information is found as:

∂

2

I = −

∂µ∂µ log ( L ) =

Y

1

+ Y

2

+ Y

3

µ 2 c 2010 Carl James Schwarz

10



This doesn’t seem too helpful, as it depends upon the value of µ , so the observed information is found by substituting in the value of the MLE:

OI =

Y

1

+ Y

2

µ b

2

+ Y

3 which after some arithmetic gives

OI =

3

µ b

Almost there! The se of an estimator is found as the square-root of the inverse of the information matrix or: se ( µ ) = b p

√

µ b n

In our case, we have se =

√

9 .

67

√

3

= 1 .

80

This looks a bit odd compared to the familiar se ( Y ) = √ n for samples taken distribution is that the standard deviation of the data is equal to the fact the se of the MLE is not that different.

√

µ so in

Once the se is found, then confidence intervals can be found in the usual way, i.e. an approximate 95% confidence interval is µ ± 2 se . There are other b ways to find confidence intervals using the likelihood function directly (called profile intervals) but these are not covered in this review.

Don’t forget that model assessment must be performed to ensure that our choice of a Poisson distribution is sensible probability model!

1.1.5

Numerical optimization

The above example could be solved analytically to give a closed-form solution.

In many problems this is not possible – in fact the vast majority of problems do

NOT have closed form solutions.

However, finding the MLE, the information matrix, and the se can be done using numerical methods. For example, consider the PoissonEqual tab in the Excel workbook available at: http://www.stat.sfu.ca/~cschwarz/Stat-650/

Notes/MyPrograms/MLE/MLE.xls

and a portion reproduced below. While Excel is NOT the best tool for general maximum likelihood work, it will suffice for our simple examples.


11



The key steps are to

• Set up a area for the raw data

• Set up a cell for the parameter of interest.

• Use the built-in functions (e.g. the Poisson function in Excel) to find the probability of each data value for an (arbitrary) value of the parameter.

The initial value chosen is not that important.

• Find the log () of each probability (be sure to use the correct logarithm function, i.e. the ln () function in Excel.

• Sum the log () of each probability to find the total log-likelihood.

The log-likelihood function can then be maximized by changing the value of the parameter and seeing which values give the maximum of the log-likelihood.

Note that the log-likelihood is often negative, so the maximum is the value closest to zero (i.e.

− 2 is larger than − 4 ) and not the value with a larger absolute value. This can be automated by using the Solver feature of Excel.


12



To find the se find the negative of the second derivative for each probability point. Sum these to get the observed information and find the square root of the inverse to find the the se . It is possible to have Excel compute the information, but this is not demonstrated on the spreadsheets.


13



1.1.6

Example2: Different sizes of sampling units.

The first example considered cases where the sampling units were the same size.

Consider the sample seen on the next page with different sized sampling units: c 2010 Carl James Schwarz

14

15



How should an estimate of density be found now? Simply taking the arithmetic average seems silly as the sampling units are different sizes. Perhaps we should standardize each observation to the same size of sampling unit and then take the average? Or perhaps we would take the total number of dots divided by the total sampled area? There is no obvious way to decide which is the better method.

Using the Poisson model and MLE, the problem is not that more complex than the earlier example. Now the data come in pairs ( Y number of points in the sampling unit which has area A i

.

i

, A i

) where Y i is the

As before, let µ be the density per unit area (i.e. a sample unit of size 1).

According to the properties of the Poisson distribution, a larger sample also follows a Poisson distribution but the Poisson-parameter must be adjusted for the area measured: e

− A i

µ

( A i

µ )

Y i

P ( Y i

| µ, A i

) =

Y i

!

We now proceed as before. The likelihood function for each point is:

L ( µ | Y i

, A i

) = e

− A i

µ

( A i

µ )

Y i

Y i

!

The likelihood function for ALL the data points is:

L ( µ | Y i

, A i

, i = 1 , . . . , n ) =

Y e

− A i

µ ( A i

µ ) Y i

Y i

!

The log-likelihood function for ALL of the data points is: log ( L ) =

X

− A i

µ × Y i log ( A i

µ ) − log ( Y i

!)

We find the MLE by finding the first derivative and setting to zero:

∂

∂µ log ( L ) =

X

A i

−

Y

1

+ . . .

µ

=0 which gives us:

Y

1

+ . . .

µ = b A

1

+ . . .

or the total observed objects divided by the area observed.

We find the se by finding the measure of information using the negative of the second derivative of the log-likelihood:

∂

2

I = −

∂µ∂µ log ( L ) =

Y

1

+ Y

2

+ . . .

µ 2 c 2010 Carl James Schwarz

16


MODEL SELECTION - A PRIMER and upon some arithmetic substituting in the MLE for µ we get:

OI = −

A

1

+ A

2

µ b

2

+ . . .

which upon first inspection looks different from the results from the previous example, but reduces to the same form if all the in size.

A

1 are equal sized and one unit

The se is then se = √

A

1

µ b

+ A

2

+ ...

which reduces to the earlier form if all the samples are equal size.

The PoissonDifferent tab in the Excel worksheet from http://www.stat.

sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms/MLE/MLE.xls

repeats the analysis using numerical methods for this case.

1.1.7

Example: What is the probability of becoming pregnant in a month?

For many couples, it is a joyful moment when they decide to try to and have children. However, not every couple becomes immediately pregnant on the first attempt to conceive a child and it may take many months before the woman becomes pregnant.

Fertility scientists are interested in estimating the probability of becoming pregnant in a given month. A sample of couples are enrolled in a study and each couple records the number of months prior to becoming pregnant. Here is the raw data:

2; 6; 5; 0; 0; 4; 0; 3; 10+ where the value of 2 indicates that the couple became pregnant on the 3rd month (i.e. there were two months where the pregnancy did not occur PRIOR to becoming pregnant on the 3rd month). The value 10+ indicates that it took longer than 10 months to get pregnant but the exact time is unknown because the experiment terminated.

If the exact value were known for all couples (i.e. including the last couple), then the sample average time (in months) PRIOR to becoming pregnant would be a simple estimator for the average number of months PRIOR to becoming pregnant, and the intuitive estimator for the probability of becoming pregnant c 2010 Carl James Schwarz

17


MODEL SELECTION - A PRIMER on each month would be 1 . The extra ‘1’ in the denominator accounts for the

1+ Y fact that you become pregnant in the NEXT months after being unsuccessful for Y months. For example, if the average time PRIOR to becoming pregnant was 4 months, then the probability of becoming pregnant in each month is

1 / (4 + 1) = 0 .

20 .

But what should be done if you have incomplete data (the last couple).

This is an example of censored data 3

where there is information, but it is not obvious how to use it. Just using the value of 10 for the last couple will lead to an underestimate of the average time to become pregnant and an over estimate of the probability of becoming pregnant in a month. You could substitute in a value for the last couple before computing an average, but there is no obvious choice for a value to use – should you use 11, 12, 15, 27, etc.?

This problem is amenable to Maximum Likelihood Estimation. A common probability distribution to model this type of data is the geometric distribution with parameter p representing the probability of becoming pregnant in any month.

Let Y be the number of months PRIOR to becoming pregnant. Then

P ( Y = y | p ) = (1 − p ) y × p i.e. there are y “failures to get pregnant” followed by a “success”. For censored data, we add together the probability of becoming pregnant over all the months greater than or equal to the censored value:

P ( Y ≥ y | p ) =

∞

X

(1 − p ) i

× p i = y which after some algebra reduces to

= (1 − p ) y

We can now construct the likelihood function as the product of the individual terms:

L =

Y

(1 − p )

Y i p ×

Y

(1 − p )

Y i non-censored censored

The log-likelihood is then: log ( L ) =

X non-censored

Y i log (1 − p ) + log ( p ) +

X censored

Y i log (1 − p )

3

Another example of censored data are water quality readings where the concentration of a chemical is below the detection limit and the detection limit provides an upper bound on the actual concentration.


18



We take the first derivative and set to zero to find the point where the likelihood is maximized:

∂

∂p log ( L ) =

X non-censored

−

Y i

(1 − p )

+

1 p

+

X censored

−

Y i

(1 − p )

= 0 give

If there were no censoring, the above equation can be solved explicitly to

1 p b if no censoring

=

Y + 1 which has a nice interpretation as the average months prior to becoming pregnant + 1 month when you became pregnant.

Unfortunately, in the presence of censoring, there is NO explicit solution and the MLE MUST be solved numerically.

The Pregnant tab in the Excel workbook available at: http://www.stat.sfu.ca/~cschwarz/Stat-650/

Notes/MyPrograms/MLE/MLE.xls

has an example of the sample computations and high-lights are shown below.

Again the solver can be used to do the optimization: c 2010 Carl James Schwarz

19



The MLE is p = .

21 .

b

The se is found in a similar fashion, i.e. find the second derivative of each contribution to the likelihood, and add them together to give a measure of information. Finally, take the square root of the inverse of the information value to give the se

˙ se ( p ) = .

066 .

b

You may be curious to know that the commonly “accepted” value for the probability of becoming pregnant in a month when attempting to become pregnant is about 25%. This value was obtained using methods very similar to what was shown above.

see:

For information on the current “state of the art” for these types of studies,

Scheike, T.H. and Keiding, N. (2006) Design and analysis of timeto-pregnancy. Statistical Methods in Medical Research, 15, 127-140.

http://dx.doi.org/10.1191/0962280206sm435oa c 2010 Carl James Schwarz

20



1.1.8

Example: Zero-truncated distribution

The Poisson distribution is a popular distribution used to model smallish counts.

In some cases, only positive counts can be observed, i.e. you can’t observe a 0 value. This is known as a zero-truncated Poisson distribution. For example, the number of occupants in a car on a freeway can be closely modeled by a zero-truncated Poisson distribution because you can’t observe cars on the road with 0 occupants!

Notice how this differs from censoring. In censored data, the actual value isn’t observed, but there is information on the actual value. For example, because we can’t see the back-seat of a car very well, if we see two occupants in the front seat, we know the total number of occupants must be at least 2, i.e. could be 2 or 3 or 4 etc. In a zero-truncated distribution, the truncated values are simply not possible – for example, you would never see a car with 0 occupants on the freeway!

The probability distribution for a zero-truncated Poisson distribution ( Y >

0 ) is:

P ( Y = y | λ ) = e

− λ

λ

Y

Y !

1 − e − λ which is a regular Poisson distribution on the top but adjusted in the denominator by the probability of observing a Poisson with 1 or more counts. This ensures that the probability still adds to 1 when summed over possible values of Y.

The likelihood development is straightforward and not detailed here. There are no closed for solutions for the MLE and numerical methods must be used.

The ZeroTruncatedPoisson tab in the Excel workbook available at: http:// www.stat.sfu.ca/~cschwarz/Stat-650/Notes/MyPrograms/MLE/MLE.xls

demonstrates how to find the MLE numerically.

1.1.9

Example: Capture-recapture with 2 parameters

Capture-recapture is a common method to study population dynamics of wildlife.

Animals are marked (usually with individually numbered tags) and released back into the population. Subsequent recaptures of marked animals provides information about survival rates, movement rates, etc.

For example, suppose a set of birds was marked and released in year 1 and recaptures took place in years 2 and 3. The data consists of the capture history of each bird expressed as a 3 digit vector. For example, the history 101 indicates c 2010 Carl James Schwarz

21


MODEL SELECTION - A PRIMER a bird that was marked and released in year 1, not seen in year 2, but recaptured again in year 3. If birds are only marked and released in year 1, there are 4 possible capture histories that could occur: 100, 101, 110, or 111. [Because birds are only marked in year 1, the capture histories 011, 010, 001 cannot occur.]

Suppose 100 birds were marked and released with the following summary table of recaptures:

History n history

100 501

101

110

111

140

250

109

How can these capture histories provide information about the population dynamics?

The basic problem that needs to be overcome is the less than 100% recaptures of marked birds. For example, birds with history 101 were not recaptured in year 2 but must have been alive because they were recaptured alive in year

3. However, birds with history 100 may have died between years 1 and 2, or survived to year 2 and just wasn’t recaptured, with a similar ambiguity of the status of the bird in year 3.

To begin, we need to define a probability model for the observed data. There are two parameters of interest. Let φ be the yearly survival rate, i.e. the probability that a bird alive in year i will be alive in year i + 1 ; and p be the year capture rate, i.e. the probability that a bird who is alive in year i will be recaptured in year i . For simplicity we will assume that the survival and recapture rates are constant over time.

.

Consider first birds with capture history 111. We know for sure that this bird survived between year 1 and year 2, was recaptured in year 2, survived again between year 2 and year 3, and was recaptured in year 3. Consequently the probability of this history is found as:

P (111) = φpφp

Similarly for history 101 we know the bird survived both years, but was only recaptured in year 3. This gives:

P (101) = φ (1 − p ) φp

Notice the (1 − p ) term to account for not being seen in year 2.

The probability of history 110 is more complex. We know the bird survived between year 1 and year 2 and was seen in year 2. We don’t know the fate of the c 2010 Carl James Schwarz

22


MODEL SELECTION - A PRIMER bird after year 2. It either died between year 2 and year 3, or survived between year 2 and year 3 but wasn’t seen in year 3. This gives a probability:

P (110) = φp (1 − φ + φ (1 − p ))

Note the two possible outcomes after the recapture in year 2.

The probability of history 100 is more complex yet! The probability of this history is:

P (100) = (1 − phi ) + φ (1 − p )(1 − φ ) + φ (1 − p ) φ (1 − p )

Can you explain the meaning of each of the three terms in the above expression?

Each and every marked bird released in year 1 MUST have one of the above capture histories. A common probability model for this type of experiment is a multi-nomial model which is the extension of the binomial model for success/failure (2 outcomes) to this case where there are 4 possible outcomes for each bird. The likelihood function is constructed by taking the product of the probability of each history over all of the birds:

L =

Y all birds

P ( history ) which reduces to (why?)

L = P (100) n

100 × P (101) n

101 × P (110) n

110 × P (111) n

111 where P (100) has the expression given previously etc.

The likelihood is a function of 2 parameter, φ and p . In order to maximize the log-likelihood over 2 parameters, you set up a system of two equations in two unknowns that needs to be solved. Each equation is the partial derivative of the log-likelihood with respect to each individual parameter. There is no closed form solution to the system of equation and it must be solved numerically.

The capture-recapture tab in the Excel workbook has the sample computations and the Solver can easily maximize the log-likelihood function. The MLEs are b

= .

76 and p = .

45 .

b

Unfortunately it is NOT easy to obtain the matrix of second partial derivatives to estimate the information matrix in Excel and so this is not done.

Many of the leading capture-recapture researchers have developed a software package MARK available from http://www.phidot.org/software/mark/ docs/book/ that has implemented maximum likelihood estimation in this type

(and many other types) of mark-recapture models. Consult the Program MARK:

A gentle introduction for more details.


23



1.1.10

Summary

Maximum likelihood estimation is a standard, flexible method for many statistical problems. In the past, a great deal of effort was placed into finding closed form solutions – now numerical solutions are the norm and it is rare that interesting problems have a closed form solution.

MLE can deal with a wide range of data anomalies that are difficult to deal with in any other way such as truncation or censoring. Usually observations are assumed to be independent of each other which makes the joint likelihood easy to compute, but likelihood methods can also deal with non-independent observations.

The examples above were all for discrete data. Likelihood methods for continuous data (e.g. normal, log-normal, exponential distributions) are similar with the density function being used in place of the probability function when constructing the likelihood.

As noted above, likelihood methods may not perform well with small sample sizes nor with latent (hidden) variables. In these cases, Bayesian methods may be an alternative method for parameter estimation.

1.2

AIC and Model Selection

A large part of Statistics is finding models that are adequate representations of data. In many cases, several models all fit the data to a similar degree - which then is the “best” to use?

Many statistical methods have been developed to select among models in the search for the “best” model. Among these are:

• Stepwise Regression Variables are selected from a long list of variables according to procedures that start either with a very simple model and then add variables, or procedures that start with a large set of variables and then drop variables.

• All Subsets Regression For moderate number of variables, it is possible to examine all the possible subsets of variables of variables and then select the “best” model.

• Likelihood Ratio Tests When models differ only by a single variable, a statistical test (called the likelihood ratio test) can be used to see if the c 2010 Carl James Schwarz

24


MODEL SELECTION - A PRIMER simpler model is an adequate fit to the data. This is analogous to the backward stepwise regression technique.

However, all of these methods are less than satisfactory for a number of reasons:

• Assuming there is a BEST model The model selection techniques make an implicit assumption that there is a single BEST model for the data. However, real life is not so nice, and statistical models are only an abstraction of reality.

• No accounting for other models It often turns out that there are several models that all have similar fits to the data. The methods above do not account for this multiplicity of models all of which could be reasonable candidates.

• False objectivity The methods above have a false allure of objectivity.

However, the model selected is highly dependent upon small changes to the data, upon the significance level used to select among models, and to the underlying assumptions about the data (e.g. normality, independence) that are unlikely to be strictly true.

• Standard error are too small The standard errors of parameter estimates or prediction errors for predictions are too small in the sense that they are conditional upon the model being selected being the only model considered and is the correct model for the data. The standard errors do not include uncertainty in the model selection procedure.

• Failure to account for multiple testing Each model comparison involves a statistical test. Each test has a probability of a false positive result (the α ) level. The overall chance of a false positive result among all the model comparison is much larger than α , yet no adjustment has been made. This is analogous to the use of multiple comparison procedures to examine differences among means in Analysis of Variance.

For these reasons, there has been a shift in emphasis in recent years from finding the “best” model to integrating all models in making predictions.

There are many methods to combine information from competing models

- one popular paradigm is the Akaike Information Criteria (AIC) introduced by Akaike (1973). Burnham and Anderson (2002) and Anderson (2008) have a long and detailed look at the use of AIC in model selection in a wide variety of situations applied to ecological problems.

Under the AIC paradigm, the analyst first starts with a candidate set of models that are reasonable from a biological viewpoint. Then each model is c 2010 Carl James Schwarz

25


MODEL SELECTION - A PRIMER fit, and a summary measure that trades off goodness-of-fit and the number of parameters (the AIC value) is computed for each model. The AIC values are used to compute a model weight for each model that summarizes how much weight should be applied to the results of this model in making predictions.

Finally, the predictions from each model are weighted by the AIC weight and the resulting estimate and standard error incorporates both model uncertainty and imprecision from sampling in each of the models.

1.2.1

Technical Details

The basic statistical tool for measuring fit of a model to data is the likelihood function

L ( Y ; θ ) or the logarithm of the likelihood functions: log ( L ( Y ; θ )) where θ is the set of parameters used to describe the data Y . The estimates of the parameters that maximize the likelihood or log-likelihood are known are

Maximum Likelihood Estimators (MLEs) and have nice statistical properties.

In some settings the likelihood function is explicit (e.g. in capture-recapture models) while in other settings the likelihood function is hidden from the analyst but still lurking in the background (e.g. in regression problems, the sum of squares residual is a 1-1 function of the likelihood function and minimizing the sum of squares residual is equivalent to maximizing the likelihood).

As more parameters are added to the model (e.g. more variables to a regression problem, or a time-varying capture rate is considered rather than a constant survival rate over time), the value of the likelihood function must increase as you can always get a better fit to data with more parameters. However, adding parameters also implies that the total information in the data must be split over more parameters which gives estimates with worse precision, i.e. the standard errors of estimates get larger as more parameters are fit. Is the improvement of fit substantial enough to require the additional parameters with the resulting loss in precision?

Akaike (1973) developed a combined measure that trades off improvement of fit and the number of parameters required to obtain this fit:

AIC = − 2 log ( L ) + 2 K where K is the number of parameters in the model and log () is the natural logarithm to the base e . [The multiplier 2 is included for historical reasons.] c 2010 Carl James Schwarz

26



As the number of parameters increases (as K increases), the log-likelihood also increases, but the − 2 log ( L ) then becomes more negative. If adding one additional parameter substantially improves the fit, then the first term on the left side decreases more than the increase in K , and AIC is smaller. If adding one additional parameter gives no substantial increase in fit, then the K term dominates and the AIC increases.

The ‘optimal’ tradeoff between fit and the number of parameters occurs with the smallest value of AIC among models in our model set.

The above equation for AIC can be modified slightly to account for small sample sizes (leading to AIC c

) or for a general lack of fit of any model (leading to QAIC or QAIC c

). These details are not explored in this overview, but the same general principles are applicable.

While the ‘optimal’ model (among those in the model set) is the one with the lowest AIC, there may be several models that differs from the next-lowest by only a small amount? How much support is there for selecting one model over the other? Notice the use of the word support, rather than statistical significance. Anderson and Burnham (2002) and Anderson (2008) recommend several rules of thumb to select among models, based on differences in AIC.

The difference in AIC between a specific model and the best fitting model is denoted as ∆ AIC . By definition, ∆ AIC = 0 for the best fitting model. When the difference in AIC between 2 models ( ∆ AIC ) is less than 2 units, then one is reasonably safe in saying that both models have approximately equal weight in the data. If 2 < ∆ AIC < 7 , then there is considerable support for a real difference between the models, and if ∆ AIC > 7 , then there is strong evidence to support the conclusion of differences between the models.

This can be quantified further by computing an index of relative plausibility using normalized Akaike weights. These weights ( w i

) for the i th model in the candidate set are calculated as w i

= exp ( − ∆ AIC

2

)

Σ exp ( − ∆ AIC

2

) i.e. compute exp ( − ∆ AIC

2

1.

) for each model, and then normalize these to sum to

The ratio of the weights of 2 models is sometimes referred to as an index of relative plausibility. For example, a model with w i

=0.4 would be twice as likely (given the data) as a model with w i

=0.2. This value is the strength of evidence of this model relative to other models in the set of models considered.

The uncertainty in which model is selected is accommodated through the use of model averaging. In this method, estimates from each model are averaged c 2010 Carl James Schwarz

27


MODEL SELECTION - A PRIMER together using the AIC weights in a weighted average. For example, suppose there are the i th

R models in the candidate set, and let θ i represent the estimate from model. Then the model averaged estimate is found as:

θ avg

=

R

X w i

θ i i =1

Buckland et al. (1997) also showed how to estimate a standard error for this averaged estimate that includes both the standard error from each of the candidate models (i.e. sampling uncertainty for each model), and the variation in the estimate among the candidate models (i.e. model uncertainty): se b avg

=

R

X w i q se 2 ( b i

) + ( b i

− θ avg

) 2 i =1

The first component in the

√

∗ sign refers to the standard error of each estimate for a particular model; the second component refers to variation in the estimates around the model averaged estimate.

What about statistical significance - where is the p -value for this model?

The point Anderson and Burnham (2002) and Anderson (2008) make at this stage is to suggest that this sort of question represents misplaced focus. Instead, they suggest we should place greater emphasis on the effect size (the magnitude of the difference in estimates between models), than on significance levels. This is analogous to the arguments in favor of using confidence intervals in lieu of p -values in ecology.

1.2.2

AIC and regression

A popular paradigm in selecting regression models is to examine R 2

R 2 or adjusted values and select the models with the highest value. As noted elsewhere in the notes, there are serious problems with the uncritical use of R 2 in regression analysis.

A preferred method is to use AIC in model selection and averaging. It can be shown that if you make the usual assumption that residuals are normally distributed around the regression line, that the equation to estimate AIC reduces to:

AIC regression

= n × log (

SSE n

) + 2 K where SSE is the sum of squares error; n is the number of data points; and K is the number of terms in the model (including the intercept) i.e.

K = p + 1 where p is the number of predictors (excluding the intercept).


28



One important aspect to remember is that the scale of Y CANNOT be changed among models, i.e. you can’t do a regression on Y and on log ( Y ) and then simply compare the AIC values. In cases where different transformations on Y are needed, all the models must be fit on the same scale of Y . This may require a non-linear fit for some models.

There is NO problem in trying different transformations on the X variables all the models with different transformations can be compared without problems.

1.2.3

Example: Capture-recapture

Capture-recapture methods are a common technique to estimate important population parameters such as the yearly survival rate, abundance, movement etc.

One simple capture-recapture experiment, the Cormack-Jolly-Seber model, releases marked animals and based upon the subsequent recoveries, estimates survival and recapture rates. An brief example of how the probability models are constructed was demonstrated in the section on MLE, and the general theory is explained in Lebreton et al. (1992). We will consider a well studied example, a study on the European Dipper.

The study was conducted over 7 years with both males and females marked.

Here is a sample of the raw data in capture-history format. There were a total of 294 birds that were marked.

History Male Female

1111110

1111100

1111000

1111000

1101110

. . .

1

0

1

0

0

0

1

0

1

1

For example, the history 1101110 indicates the female bird was captured in year 1, had a band applied to its leg and released, It was recaptured (alive) in year 2, not seen in year 3, recaptured (alive) in years, 4, 5, 6 and not seen after year 6. However, the fate in year 7 is unknown – the bird either died between years 6 and 7, or it was alive in year 7 and not recaptured – there is no way to know exactly what happened.

The probability model to describe this data has 4 sets of parameters: a set of yearly survival rates for the male birds, a set of yearly survival rates for the female birds, a set of yearly recapture rates for males, and a set of yearly recapture rates for females. There are several biological hypotheses that can be c 2010 Carl James Schwarz

29


MODEL SELECTION - A PRIMER examined in a series of models, and interest lies in which model best describes the data with estimates of survival and recapture. For example, perhaps males and females have the same yearly survival rates but the rates differ across years.

Or males and females have different survival rates, but the survival rate for each sex is constant over time. Similarly, there are multiple models for the capture rates.

In order to keep track of the models, we adopts a standard notation as in

Lebreton et al. (1989). Let g represent a group effect (sex), and t represent a time effect. The survival rates are commonly denoted using a parameter φ

(phi) and the recapture rates are denoted using p . We can define the following models:

Model phi ( t ) , p ( g ) phi phi phi

(

(

( .

g g

)

)

∗

, p

, p t

(

)

( g

, p g

∗

(

∗ t g t

)

)

)

Interpretation

Survival is same for males and females, but the common survival rate varies across time. Recapture rates are equal across time, but vary between the groups.

Survival is different for every combination of group (sex) and time, i.e. no two values are equal Recapture rates are equal across time, but vary between the groups.

Survival is different for males and females (a group effect) but constant over time. Recapture rates are different for every combination of group (sex) and time.

Survival is the same for both males and females (no group effect) and constant over time. Recapture rates are different for every combination of group (sex) and time.

There are many possible combinations.

In addition, it is known that a flood occurred in year 2 and 3; these high waters may also have affected survival/recapture rates. The high waters could have reduced the food supply and the high waters may make it more difficult to spot the banded birds. Additional models incorporating the effect of flood were also considered such as:

Model phi ( g f lood ) , p ( g )

∗

Interpretation

Survival is different between males and females, and each sexes survival rate is equal in across the non-flood or flood years, but differs between flood and non-flooded years. Recapture rates are equal across time, but vary between the groups.

A total of 20 different models were constructed a priori , i.e. BEFORE looking at the data based on biological guesses as to plausibility. Review some of the models in the list to be sure you understand what the notation indicates.


30



Models considered for the Dipper study

Model

Phi(t) p(t)

Phi(g*t) p(g*t)

Phi(g) p(g)

Phi(.) p(.)

Phi(Flood) p(.)

Phi(Flood) p(Flood)

Phi(.) p(g)

Phi(g) p(.)

Phi(.) p(Flood)

Phi(t) p(.)

Phi(t) p(g)

Phi(.) p(t)

Phi(g) p(t)

Phi(g*t) p(.)

Phi(g*t) p(g)

Phi(.) p(g*t)

Phi(t) p(g*t)

Phi(g*t) p(t)

Phi(g) p(g*t)

We showed earlier how the probability of each capture history can be written in terms of the parameters φ g t and p g t

– you should select a few models to write out some of the histories. You can also get a sense for the models by constructing a diagram showing the relationship among the parameters (known among users of the computer program MARK as the PIMs). This will be demonstrated in class.

Note that biologically speaking, only the model P hi ( g ∗ t ) p ( g ∗ t ) could possibly be true! For example a phi ( g ) model would say that the survival rates differ between males and females but are equal across all times. It is logically impossible that the survival rate in year 1 would be equal to the survival rate in year 2 to 40 decimal places! So why would be fit such logically impossible models? The AIC paradigm says that ALL models are wrong, but some models may closely approximate reality. All else being equal, simpler models are preferred over more complex models because the uncertainty in the estimates must be smaller. But even this model likely is wrong, because it (implicitly) assumes that all birds of the same sex have the same survival rate in each year which likely isn’t true due to innate difference in fitness among individuals. The real world is infinitely complex and can’t possibly be captured by simple models, but we hope that our models are close approximations.

All of the two models were fit using Maximum Likelihood using Program

MARK . A summary of the results is presented below: c 2010 Carl James Schwarz

31



Model

{Phi(Flood) p(.) }

{Phi(Flood) p(Flood) }

{Phi(.) p(.) }

{Phi(.) p(g) }

{Phi(g) p(.) }

{Phi(.) p(Flood) }

{Phi(t) p(.) }

{Phi(g) p(g) }

{Phi(t) p(g) }

{Phi(.) p(t) }

{Phi(t) p(t) }

{Phi(g) p(t) }

{Phi(g*t) p(.) }

{Phi(g*t) p(g) }

{Phi(.) p(g*t) }

{Phi(t) p(g*t) }

{Phi(g*t) p(t) }

{Phi(g) p(g*t) }

{Phi(g*t) p(g*t) }

-2log(L)

660.10

660.06

666.84

666.19

666.68

666.82

659.73

666.15

659.16

664.48

656.95

664.30

658.24

657.90

662.25

654.53

655.47

662.25

653.95

Num.

Delta AICc Relative

Par AICc AICc Weights Likelihood

3 666.16

4 668.16

0.00

2.00

0.61

0.23

1.00

0.37

2 670.87

3 672.25

3 672.73

3 672.88

4.71

6.09

6.57

6.72

0.06

0.03

0.02

0.02

0.10

0.05

0.04

0.03

7 674.00

4 674.25

7.84

8.09

8 675.50

9.34

7 678.75

12.59

11 679.59

13.43

0.01

0.01

0.01

0.00

0.00

0.02

0.02

0.01

0.00

0.00

8 680.65

14.49

13 685.12

18.96

14 686.92

20.76

13 689.13

22.97

17 690.03

23.87

17 690.97

24.81

14 691.27

25.11

22 700.46

34.30

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

0.00

First consider the number of parameters in for each model.

For model

{ phi ( .

) p ( .

) } , there are two parameters - the survival rate ( φ ) that is common to both sexes and equal across all years and a similar parameter for the recapture rates. For model { phi ( .

) p ( F lood ) } , there are three parameters: the common survival rate, and the two recapture rates for flood vs. non-flood years each of which is the same across both sexes and the respective years. For model

{ phi ( t ) , p ( .

) } , there are 7 parameters: 6 survival rates (between years 1 and 2,

2 and 3, . . .

, 6 and 7) that are common to both sexes, and a single recapture rate that is common to both sexes and all years.

A first blush, the number of parameters for model { phi ( t ) p ( ∗ t ) } should be

12 with six survival rates (as above) and 6 recapture rates (years 2, 3, 4, 5, 6, and 7). [There is NO recapture rate for year 1 because that is when the birds were first captured and released and there is no information on the probability of first capture in year 1.] But the table above indicates that the number of parameters is 11. Why? It turns out that if the likelihood for this model is examined closely, the parameters φ

6 p

7 always occur together in a pair. This make this pair of parameters confounded because, for example, if φ

6 and p

7 is halved is doubled, then the product remains the same and the value of the likelihood is unchanged. Consequently, rather than counting φ

6 and p

7 as two separate parameters, they are counted only as a single confounded parameter reducing the parameter count by 1. The same problem occurs in any model where both the survival and recapture rates are time dependent.


32



One of the challenges of MLE in complex models is determining which parameters are identifiable – this is well beyond the scope of these notes. For the moment, just accept that the parameter count in the above table is correct.

As models become more complex (i.e. more parameters), they must fit better.

The log-likelihood must increase. But the table (for historical reasons) reports

− 2 log ( L ) which implies that − 2 log ( L ) decreases as models get more complex.

But as models become more complex (with more parameters) the same amount of information is split among more and more estimates leading to estimates with worse precision (i.e. larger standard errors. For example, compare the estimates from models with and without the effects of the flood years on the survival rates:

Model { P hi ( f lood ) , p ( .

) }

Parameters Estimate SE

Phi (non-flood) 0.6070958

0.0309489

Phi (flood) 0.4688271

0.0432355

p 0.8997889

0.0292696

Model { P hi ( .

) , p ( .

) }

Parameters Estimate

Phi p

0.560243

0.902583

SE

0.025133

0.028585

Notice that the se of the survival rate estimates in the simpler model are smaller (i.e. more precise) than in the more complex model.

In general, models with a small AICc are “better” in the tradeoff between model fit and model complexity. The AICc (the regular AIC corrected for small sample sizes) incorporates both model complexity (number of parameters) and model fit (likelihood). Look at the AICc for the top two models. The fit is almost the same ( − 2 log ( L ) ) is virtually identical, but the second model has one more parameter. Consequently, the AICc “penalizes” the second model for being unnecessarily complex and the AICc is larger because − 2 log ( L ) is virtually the same + 2 × the number of parameters, and after the small sample adjustment), resulting in almost a 2 unit increase in the AICc .

The Delta AICc column measures how must worse the other models in the model set are, relative to the “best” model in the model set. A rule of thumb is that differences of less than 3 or 4 units indicate two models of essentially equivalent fit. The reason for the “essentially equivalent” fit is that the model fits (and rankings) will change if the experiment were to be repeated (as any statistics computed on data must). Accordingly, the first 2 models are not really distinguishable.

The AICc Weight column is a measure of the “importance” of each model, c 2010 Carl James Schwarz

33


MODEL SELECTION - A PRIMER relative to the models in the model set. It indicates that most of the weight should be placed on the the first two models, with some minor weighting placed on the next 3 or 4 models, but virtually no weight on the remaining models.

In the AIC paradigm there are NO p-values for choosing between models, and NO model selection trying to find the single best model. The AIC paradigm recognizes that ALL of the models presented here are likely wrong (even the most complex model assumes that all animals have the same survival rate within a sex year combination which is likely untrue because different animals have different “fitness”. Consequently, trying to select the single “best” model is a fools paradise! So “testing” if there is a flood effect isn’t done – after all do you really expect absolute NO effect (to 40 decimal places) of flooding on survival rates?

So what estimates should be reported? Again, rather than reporting estimates from a single model, the AIC paradigm used model averaging to take the estimates from the various models and obtain a weighted average (by the

AICc weights) of the estimates from the respective models. Let us consider the model averaged estimate for the survival rate for male birds between year 1 and year 2 as presented by MARK :

Model

Apparent Survival Parameter (Phi) Males Parameter 1

Weight Estimate Standard Error

--------------------------- -------------------- --------------

{Phi(Flood) p(.) }


0.61200

0.22558

0.6070958

0.6077116

0.0309489

0.0311473

{Phi(.) p(.) }

{Phi(.) p(g) }

{Phi(g) p(.) }

{Phi(.) p(Flood) }

{Phi(t) p(.) }

{Phi(g) p(g) }

0.05818

0.02912

0.02287

0.02125

0.01215

0.01073

0.5602430

0.5606992

0.5702637

0.5602377

0.6258353

0.5658226

0.0251330

0.0251604

0.0353294

0.0251253

0.1116455

0.0355404

{Phi(t) p(g) }

{Phi(.) p(t) }

{Phi(t) p(t) }

{Phi(g) p(t) }

{Phi(g*t) p(.) }

{Phi(g*t) p(g) }

{Phi(.) p(g*t) }

{Phi(t) p(g*t) }

0.00572

0.00113

0.00074

0.00044

0.00005

0.00002

0.00001

0.00000

0.6248539

0.5530902

0.7181819

0.5632387

0.6173162

0.6109351

0.5568856

0.7158931

0.1114060

0.0277124

0.1555472

0.0367877

0.1511198

0.1497766

0.0256228

0.1544318

{Phi(g*t) p(t) }

{Phi(g) p(g*t) }

0.00000

0.7075812

0.00000

0.5560768

0.1945868

0.0342813

--------------------------- -------------------- --------------

Weighted Average 0.6012093

0.0320539

Unconditional SE 0.0378240


34



95% CI for Wgt. Ave. Est. (logit trans.) is 0.5253025 to 0.6725445

Percent of Variation Attributable to Model Variation is 28.18%

First you need to understand how you can get a survival rate for males in year 1 from each of the models when sex and/or year does not appear as a model effect? Consider the first model where there is a flood effect on survival but no sex effect. Then the survival rate in year 1 for males will the the estimated survival rate for non-flood years (which is common between the sexes).

In model with phi ( .

) , the estimate comes from the estimated survival rate which is common among all year and both sexes etc.

Notice that the estimated survival rates for males between year 1 and year

2 varies considerably among models as does the standard error.

The AIC paradigm constructs a weighted average of the estimates from each model which is reported at the bottom of the page. Because the model weights are close to

100% in total for the first few models, the final estimate must be close to the estimates from these models.

The model averaged se is not a simple weighted average (see the previous notes) but follows the same philosophy – models with higher AICc weights contribute more to the model averaged values. The Unconditional SE line adds an additional source of uncertainty – that from the models them selves. Notice that the estimates vary considerably about the model averaged value of .60.

The effect of the different models (after weighting by the AICc weight) is also incorporated. In this case, the models with vastly different estimates have little weight, so the unconditional se is very similar (but slightly larger) than the simple model averaged se .

Now consider the model averaged survival rates for males between year 2 and 3 (a flood year).

Model

Apparent Survival Parameter (Phi) Males Parameter 2

Weight Estimate Standard Error

--------------------------- -------------------- --------------

{Phi(Flood) p(.) } 0.61200

0.4688271

0.0432355


{Phi(.) p(.) }

{Phi(.) p(g) }

{Phi(g) p(.) }

0.22558

0.05818

0.02912

0.02287

0.4680133

0.5602430

0.5606992

0.5702637

0.0433897

0.0251330

0.0251604

0.0353294

{Phi(.) p(Flood) }

{Phi(t) p(.) }

{Phi(g) p(g) }

{Phi(t) p(g) }

{Phi(.) p(t) }

0.02125

0.5602377

0.01215

0.4541913

0.01073

0.5658226

0.00572

0.4551758

0.00113

0.5530902

0.0251253

0.0666224

0.0355404

0.0667781

0.0277124


35



{Phi(t) p(t) }

{Phi(g) p(t) }

{Phi(g*t) p(.) }

{Phi(g*t) p(g) }

{Phi(.) p(g*t) }

{Phi(t) p(g*t) }

{Phi(g*t) p(t) }

0.00074

0.4346707

0.00044

0.5632387

0.00005

0.4614109

0.00002

0.4582630

0.00001

0.5568856

0.00000

0.4336095

0.00000

0.4387672

0.0688290

0.0367877

0.1007034

0.0998610

0.0256228

0.0679647

0.1015258

{Phi(g) p(g*t) } 0.00000

0.5560768

0.0342813

---------------------------- -------------------- --------------

Weighted Average

Unconditional SE

0.4817958

0.0414640

0.0534687

95% CI for Wgt. Ave. Est. (logit trans.) is 0.3792812 to 0.5858662

Percent of Variation Attributable to Model Variation is 39.86%

Some models give the same estimated survival between year 1 and year 2

(non-flood, previous table) and between year 2 and year 3 (flood) - why? The model averaging continues in the same way. Notice that the unconditional se is now much larger than the conditional se because there is more variation in the top models in the estimated survival rates.

The model averaged values for males and females can be computed for all time periods:

Model averaged apparent survival rates

Est App.

Parameter

(Phi) Males Year 1

Survival

0.60

SE

0.04

LCI

0.53

UCI

0.67

(Phi) Males Year 2

(Phi) Males Year 3

(Phi) Males Year 4

(Phi) Males Year 5

0.48

0.48

0.60

0.60

0.05

0.05

0.04

0.03

0.38

0.38

0.53

0.53

0.59

0.59

0.67

0.67

(Phi) Males Year 6

(Phi) Females Year 1






0.60

0.60

0.48

0.48

0.60

0.60

0.60

0.33

0.04

0.05

0.05

0.04

0.04

0.31

0.09

0.52

0.38

0.38

0.53

0.53

0.11

0.96

0.67

0.58

0.58

0.67

0.67

0.95

Note the use of “apparent survival” because a bird that dies or permanently leaves the study area cannot be distinguished. Also notice the odd results for the apparent survival rate between years 6 to 7. The se and confidence intervals are very large because this parameter cannot be estimated in some model (see above).

When using model averaging, you must be careful to only average estimates that are comparable and identifiable in all models.


36



The final model averaged values for males and females can be plotted (along with a 95% confidence interval):

Along with the model fitting above, it is important to conduct through model assessments to ensure that even the best fitting model is reasonably sensible.

This is not covered in this brief review – refer to the vast literature on capturerecapture models for assistance.

When using the AIC paradigm it is important to specify the model set

BEFORE the analysis begins to avoid data dredging, and the model set should be comprehensive to include all models of biological interest. Nevertheless, it is important not to simply throw in all possible models and mechanically use this procedure – each model in the model set should have a biological justification.

The program MARK has and extensive implementation of the AIC paradigm c 2010 Carl James Schwarz

37


MODEL SELECTION - A PRIMER for use in capture-recapture studies. The AIC paradigm is the accepted standard for publishing work that involves capture-recapture experiment. If you submit a manuscript involving the use of capture-recapture methods and do not use

AIC, it will likely be returned unread and unreviewed.

1.2.4

Example: Survival of Juvenile Pygmy Rabbits

Please download and read the article:

Price, A.J., Estes-Zumpf, W., and Rachlow, J. (2010).

Survival of Juvenile Pygmy Rabbits.

Journal of Wildlife Management, 74, 43-47.

http://dx.doi.org/10.2193/2008-578 .

This paper illustrates the use of AIC methods in modern wildlife management research. In this article the authors use capture-recapture methods to study the survival of juvenile pygmy rabbits in east-central Idaho, US. In their study, they attached radio-tags to newly born rabbits and then followed rabbits every 3 or 4 days to see if the rabbit was alive or dead.

The known-fate model that is fit takes into account that the animal is detected with 100% probability (because of the radio tracking) and if the animal dies, the time of death is known (to within 3 days). Consequently, the only known survival paraemters are the weekly survival rates.

This differs from many capture-recapture experiments where detectability is less than 100% and you must estimate both survival and detection.

Figure 1 shows the Kaplan-Meier survivorship curve. The computation of this curve takes into account possible censoring (due to radio tags failing etc.) and is the standard way to dealing with known-fate data. For example, suppose you have the following (hypothetical) data on the number of deaths by week.

Week Alive at Deaths by start of end of

1

2

3

4

5 week

50

45

42

35

. . .

week

3

3

2

4

The survival rate for the first week is .94 = 47/50. This means that there were 47 alive animals at the start of week 2. However, only 45 could be located (2 c 2010 Carl James Schwarz

38


MODEL SELECTION - A PRIMER radios could have died). The animals with radio tags that could not be located at the start of week 2 are censored, their fate is known. The KM method computes the survival rate to the end of week 2 as 42/45. The cumulative survival rate over the first 2 weeks is computed as

47

50

× 42

45

. Because 3 animals died by the end of week 2, there are 42 animals alive and all radios were located at the start of week 2, so the number of animals alive is now 42. The cumulative survival to the end of week 3 is computed as:

47

50

× 42

45

× 40

42

. The table and the KM estimates can be extended to the end of the study in a similar fashion.

The KM estimate is the MLE for this process.

There were several potential predictor variables for the weekly survival rates as outlined in the METHODS section. Based on biologically reasonable grounds, a set of 14 a priori models was constructed (the model set). Examine the model set in Table 1 – be sure you understand the differences among the models and the biological interpretation of the models. In particular what does a model

YEAR × AREA mean? What does the model Constant survival mean?

etc. What does the model with the effect of BORN mean?

ML estimates were found for each model, and AIC was used to rank the relative fit and complexity for the models in the model set (Table 1). Be sure you understand the number of parameters in each model.

What is meant by the sentence in the report “A set of 9 models was included in the top model set . . .

indciated relatively high model uncertainty.” (First paragraph under Figure 1).

Understand how Table 2 was computed and how to interpret the table.

An understanding of how the table was computed will have to be conceptual because the authors have omitted many details from the paper such as the fact that survival was modelled on the logit scale, the covariate for BORN was standardized automatically by MARK etc. As such, the actual numbers in Table

2 are pretty much useless for actual hand computations (!).

Look at Figure 3. The bar suggest that the effect of year is about the same in both areas except translated upwards. What “model” does this suggest? Did this model rank high in Table 1? [Don’t forget to take into account the size of the se shown in the plot.]

Notice that there are NO p-values in the entire paper, and every estimate has an associated measure of precision (a se ).


39



1.3

References

Akaike, H. (1973). Information theory as an extension of the maximum likelihood principle. Pages 267-281 in Second international symposium on information theory. B. N. Petrov and F. Csaki, (editors). Akademiai Kiado, Budapest.

Anderson, D.R. (2008). Model Based Inference in the Life Sciences: A Primer on Evidence. Springer, New York.

Buckland, S. T., Burnham, K. P., and Augustin, N. H. (1997) Model selection: an integral part of inference. Biometrics 53, 603-618.

Burnham, K. P., and Anderson, D. R. (2002). Model selection and inference: a practical information theoretic approach. 2nd Edition. Springer-Verlag, New

York, NY.

Lebreton, J.-D., Burnham, K.P., Clobert, J. & Anderson, D.R. (1992). Modeling survival and testing biological hypotheses using marked animals: a unified approach with case studies. Ecological Monographs, 62, 67-118.


40

Sampling, Regression, Experimental Design and Analysis for

Sampling, Regression, Experimental Design and

Analysis for Environmental Scientists,

Biologists, and Resource Managers

C. J. Schwarz

Department of Statistics and Actuarial Science, Simon Fraser University cschwarz@stat.sfu.ca

January 16, 2011

Contents

Chapter 1

Maximum Likelihood

Estimation & AIC and model selection - a primer

1.1

Maximum Likelihood Estimation (MLE)

1.1.1

The probability model

1.1.2

The likelihood

1.1.3

The MLE

1.1.4

The precision of the MLE

1.1.5

Numerical optimization

1.1.6

Example2: Different sizes of sampling units.

1.1.7

Example: What is the probability of becoming pregnant in a month?

1.1.8

Example: Zero-truncated distribution

1.1.9

Example: Capture-recapture with 2 parameters

1.1.10

Summary

1.2

AIC and Model Selection

1.2.1

Technical Details

1.2.2

AIC and regression

1.2.3

Example: Capture-recapture

1.2.4

Example: Survival of Juvenile Pygmy Rabbits

1.3

References

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib