Modelling_practical

advertisement
DTC module in statistical modelling and inference
Tuesday Week 1: Statistical modelling – Practical work
These questions will all be discussed in detail during lectures. These notes are primarily as a
reminder.
1. Estimating parameters by likelihood
Collect data for everyone in the class on the following four random variables:


Whether or not they can touch their toes (binary response, 0 = no).
Gender (0 = male, 1 = female)
The aim is to see whether gender has any effect on whether or not an individual can touch
their toes. Answer the following




Assuming that there is some ‘probability’ that an individual can touch their toes, find
the maximum likelihood estimate (MLE) for this probability.
Plot the log-likelihood surface for this parameter. What is the range of values within
2 units of log-likelihood from the MLE?
Now estimate parameters under the model where men and women have different
‘parameters’.
Using a likelihood ratio test, calculate a p-value for the hypothesis that men and
women have the same probability.
2. Adding covariates
It seems plausible that other factors influence whether or not you can touch your toes.
Collect these additional data for everyone:



Age (months)
Height (cm)
Typical sporting activity (hours per week exercise)
We are going to use the logistic function to relate all these covariates to outcome. We will
use Matlab’s inbuilt glmfit function, but it is important to understand what this does.
Answer the following:


Write down the likelihood function for the logistic model.
Read the documentation for glmfit and use [b, dev, stats] =
glmfit(X, y, ‘binomial’) to obtain estimates of the coefficients in the

model (X is the matrix of predictors, y is the response). What is the probability that
a 23yr-old male who is 176cm in height and does 2 hours exercise per week can
touch his toes (under the fitted model)?
For which predictors is the p-value for the coefficient less than 0.05? What can you
say about which factors influence ability to touch one’s toes?
3. Linear modelling
In a linear model, we assume that the response variable is normally-distributed around its
expected value. Note that while the binomial distribution has a single parameter, the
normal distribution has two (mean and variance). Answer the following:




Using glmfit, estimate the coefficients for a model in which hours of activity per
week is the response and age, gender and height are all predictors. What are the
link and distribution functions for the linear model?
Using the MLEs for the coefficients, predict the number of hours activity per week
for a 100-year old man who is 150cm tall. Do you trust this prediction?
Calculate the difference in log-likelihood between the null model (the covariates
have no influence on activity) and the full model. Can you reject the null model?
Why does the output of glmfit not mention the variance parameter for the
normal distribution?
4. Constructing bespoke models
In the previous week’s introductory statistics people had to remember a permutation of the
numbers 1-10. Either retrieve this information or obtain it again (get a helper to use
Matlab’s randperm function to generate a permutation of the numbers 1-10, they read it
out slowly, you read it back to them and they record the number of correct answers before
the first mistake). Obtain data for all the class. Think about what a suitable model might be
for these data and how you might introduce covariates into it. Answer the following:



Under the null model, estimate the parameters for your model.
Use a likelihood ratio test to ask whether there is evidence that the probability of
getting the answer wrong increases with the position in sequence. You can either do
a grid-search or use Matlab’s fminsearch function to optimise.
Describe one parameterisaton of the model that would allow you to ask whether
gender has any influence on this short-term memory test.
Download