DTC module in statistical modelling and inference Tuesday Week 1: Statistical modelling – Practical work These questions will all be discussed in detail during lectures. These notes are primarily as a reminder. 1. Estimating parameters by likelihood Collect data for everyone in the class on the following four random variables: Whether or not they can touch their toes (binary response, 0 = no). Gender (0 = male, 1 = female) The aim is to see whether gender has any effect on whether or not an individual can touch their toes. Answer the following Assuming that there is some ‘probability’ that an individual can touch their toes, find the maximum likelihood estimate (MLE) for this probability. Plot the log-likelihood surface for this parameter. What is the range of values within 2 units of log-likelihood from the MLE? Now estimate parameters under the model where men and women have different ‘parameters’. Using a likelihood ratio test, calculate a p-value for the hypothesis that men and women have the same probability. 2. Adding covariates It seems plausible that other factors influence whether or not you can touch your toes. Collect these additional data for everyone: Age (months) Height (cm) Typical sporting activity (hours per week exercise) We are going to use the logistic function to relate all these covariates to outcome. We will use Matlab’s inbuilt glmfit function, but it is important to understand what this does. Answer the following: Write down the likelihood function for the logistic model. Read the documentation for glmfit and use [b, dev, stats] = glmfit(X, y, ‘binomial’) to obtain estimates of the coefficients in the model (X is the matrix of predictors, y is the response). What is the probability that a 23yr-old male who is 176cm in height and does 2 hours exercise per week can touch his toes (under the fitted model)? For which predictors is the p-value for the coefficient less than 0.05? What can you say about which factors influence ability to touch one’s toes? 3. Linear modelling In a linear model, we assume that the response variable is normally-distributed around its expected value. Note that while the binomial distribution has a single parameter, the normal distribution has two (mean and variance). Answer the following: Using glmfit, estimate the coefficients for a model in which hours of activity per week is the response and age, gender and height are all predictors. What are the link and distribution functions for the linear model? Using the MLEs for the coefficients, predict the number of hours activity per week for a 100-year old man who is 150cm tall. Do you trust this prediction? Calculate the difference in log-likelihood between the null model (the covariates have no influence on activity) and the full model. Can you reject the null model? Why does the output of glmfit not mention the variance parameter for the normal distribution? 4. Constructing bespoke models In the previous week’s introductory statistics people had to remember a permutation of the numbers 1-10. Either retrieve this information or obtain it again (get a helper to use Matlab’s randperm function to generate a permutation of the numbers 1-10, they read it out slowly, you read it back to them and they record the number of correct answers before the first mistake). Obtain data for all the class. Think about what a suitable model might be for these data and how you might introduce covariates into it. Answer the following: Under the null model, estimate the parameters for your model. Use a likelihood ratio test to ask whether there is evidence that the probability of getting the answer wrong increases with the position in sequence. You can either do a grid-search or use Matlab’s fminsearch function to optimise. Describe one parameterisaton of the model that would allow you to ask whether gender has any influence on this short-term memory test.