WTCHG course in statistical modelling and data analysis Likelihood: Practical exercises using Matlab Gil McVean 1. Calculating MLEs and likelihood surfaces In population genetics the distribution of the number of mutational differences, k, between the copies of a gene in two randomly sampled chromosomes is described by the formula 𝑃(𝑘|𝜃) = ( 1 𝜃 𝑘 )( ) 1+𝜃 1+𝜃 , where is a quantity called the population mutation rate. Answer the following questions. • • • • • What name is given to this form of distribution? [Hint: look back at the distributions lecture] What is the expected number of mutational differences between two randomly sampled genes? Write down the log likelihood function and find an expression for the maximum likelihood estimate. If I observe 23 differences between a pair of genes at one locus and 15 at another, what is the maximum likelihood estimate of ? Draw the log-likelihood surface for for the above example. Find the values of for which the log-likelihood is 2 units less than the log-likelihood at the MLE. 2. Likelihood ratio tests In the previous question we informally compared different values for the parameter using the log-likelihood surface. In this question we are going to look more formally at the idea of likelihood ratio tests. The pdf for the Normal distribution is 𝑓(𝑥|𝜇, 𝜎) = i. ii. iii. iv. 1 √2𝜋𝜎 𝑒 − 1 (𝑥−𝜇)2 2𝜎2 Simulate 10 random variables from a Normal(0,1) distribution using Matlab (either using the in-built function randn or using the polar transformation method). Find the MLE for the parameter given your sample assuming that you know the variance. Record the log-likelihood at the MLE. Find the log-likelihood for 𝜇 = 0. Calculate twice the difference in log-likelihood between the MLE and the truth. Record this number. Repeat steps i-iii 1000 times and plot a histogram of the quantity calculated in iii. Using a qqplot compare this distribution to that of a chi-squared distribution with one degree of freedom (you can do this by simulating normal random variables and squaring them). What is the probability of observing twice the difference in log-likelihood of more than 3.84? Gil McVean Last modified 01/11/2008 WTCHG course in statistical modelling and data analysis v. vi. Repeat steps i-iv, but where you estimate both the mean and variance of the normal distribution for each sample of size 10. How does the effect of estimating two parameters influence the distribution of the quantity calculated in ii? Why is the index parameter of the chi-squared distribution referred to as the degrees of freedom? 3. Sufficient statistics [Hard]. Let 𝑋1 , 𝑋2 , … , 𝑋𝑛 be a series of n random variables sampled from the distribution 𝑓(𝑥|𝜃) = 𝜃𝑥 −2, • • • • 0<𝜃≤𝑥<∞ Answer the following What is a sufficient statistic for 𝜃? Find the mle of 𝜃 Find the method of moments estimator of 𝜃. Obtain the cdf of the distribution and use this to sample 10 random variables from the distribution with 𝜃 = 1. Obtain the mle for 𝜃. Repeat this 100 times and comment on the performance of the estimator in terms of bias, variance and distribution. Gil McVean Last modified 01/11/2008