Estimation: Practical exercises using R

advertisement
WTCHG course in statistical modelling and data analysis
Self-assessment exercises
Gil McVean
The aim of this series of questions is to revisit some of the key ideas presented in the first two weeks so as to
make sure you have a thorough grasp of some of the key concepts in statistical analysis. Each question has
three parts – a simple question to answer, a short practical exercise to implement and a more advanced
question for those who want a challenge.
1. Plotting data
A) What aspects of your data might you aim to summarise graphically by the use of i) a histogram, ii) a
boxplot and iii) a QQplot?
B) Using R, read in the file “gc_content.txt”, which contains measurements of GC content in 170
contiguous windows of 1000bp along a part of chromosome 1 in humans. Plot the data in the second
column using a histogram and a boxplot. Summarise what these figures tell you about the distribution of the
data. Using a QQplot, compare the distribution of these data to the standard normal distribution.
C) How might you try to reject the null hypothesis that these data are drawn from a normal distribution?
Devise some statistic that is sensitive to departures from normality and use a simulation procedure to assess
the evidence for a departure from normality in the GC content data.
2. Distributions
A) Explain the relationship between the exponential distribution, the Poisson distribution and the gamma
distribution.
B) Using R, read in the data set “coverage_data.txt”. These data measure the depth of a shotgun
sequence data across the same 170kb region of human chromosome 1 as in the previous question. Plot the
data in the second column using a histogram. We want to try to model the data by an exponential
distribution. How can we use the method of moments to give us an estimate of the parameter for the
distribution? Obtain an estimate and use a QQplot to investigate whether it is a good description.
C) Fit a gamma distribution to the same data using the method of moments. Remember that the
exponential is a special case of the gamma distribution. How different is the distribution that you have fitted
to the exponential?
3. The Central Limit Theorem
A) State the central limit theorem and explain why it is so important in statistics.
Gil McVean
Last modified 01/11/2008
WTCHG course in statistical modelling and data analysis
B) Using R, simulate 1000 bootstrap samples from the coverage data and obtain the mean of each sample.
What values do you expect for the mean and the variance of this distribution? Compare the distribution of
these means to a normal distribution with the expected mean and variance. Now repeat, but calculating the
variance of each sample. What distribution do you expect the sample variances to take? Why?
C) The Pareto distribution is used to describe phenomena in which a small fraction of objects account for
much of the activity/mass (grains of sand, traffic on the internet, sizes of human settlements, etc. – i.e. longtailed distributions). It is characterised by two parameters – a shape parameter k (k>0) and a scale
parameter xm (xm >0). The pdf is given by
kxmk
f ( x | k , x m )  k 1 for x  xm
x
Using your powers of integration, work out whether the distribution has finite mean and variance and,
consequently, whether the CLT holds for this distribution.
4. Estimating uncertainty
A) Explain what is meant by a confidence interval.
B) Calculate a 95% confidence interval for the mean of the GC content data. [Note, it would be good to go
through the details of this calculation before checking that you agree with the results of applying
t.test()].
C) It turns out that the first 100 and the second 70 observations in the coverage data come from different
experiments. I want to test the hypothesis that the mean coverage differs between the two experiments.
Using the two-sample t-test, obtain 95% confidence intervals for the difference in coverage between the two
experiments. What is the probability of observing as big a difference under the null? What assumptions
have you had to make in order to obtain these inferences?
5. Likelihood
A) What is the likelihood function? Explain the meaning of the terms i) maximum likelihood estimate, ii)
likelihood ratio test and iii) support interval.
B) For the data on shotgun coverage calculate the (log) likelihood function for the parameter of the
exponential distribution over a grid of (say 100) points between the minimum and maximum of the observed
values. Find the value of the parameter that maximises the likelihood. Find the 2-unit support interval and
compare this to a 95% confidence interval calculated using the approach in 5B.
C) For the same data set, try to fit a gamma distribution using maximum likelihood. You cannot do this
analytically, so one option is to construct a grid for the two parameters and find an approximation to the
MLEs. Alternatively, you can use a numerical optimisation algorithm, such as the inbuilt function optim().
For example, if you have put the coverage data into a table called coverage with the second column
containing the values, the following code will find MLEs for the parameters of the distribution
Gil McVean
Last modified 01/11/2008
WTCHG course in statistical modelling and data analysis
fn<-function(par, data) {
n<-length(data);
xb<-mean(data);
lxb<-mean(log(data));
return(-n*(par[1]*log(par[2])-lgamma(par[1])-par[2]*xb+(par[1]-1)*lxb));
}
data<-coverage[,2];
a1<-mean(data)^2/var(data);
b1<-mean(data)/var(data);
optim(c(a1, b1), fn, gr=NULL, data);
Calculate the maximum log-likelihood under the gamma fit and also under the exponential fit. Is it
appropriate to use the standard theory of likelihood ratio tests to ask whether the gamma is a better fit than
the exponential? If so, what is the probability of observing an increase in log-likelihood as great as that you
observed if the null were true?
6. Linear modelling
A) Suppose I have a response variable Y and two explanatory variables, X and Z. Which of the following is a
valid linear model? (terms with  are the parameters to be estimated).
y i   0   1 xi   2 z i
yi  1 exp(  xi )   2 / z i
y i   0  1 xi2   2 xi z i
yi   0  1 xizi
yi   0  1 xi 2
B) We would like to find out about which genomic features influence coverage in the shotgun experiment.
You have already got data on GC content, but another important feature might be repeat content 9the
presence of short or long elements that are found in many copies throughout the genome, such as
transposons). In the file “repeat_content.txt” you will find the fraction of each 1kb region that is
repeat DNA.

Carry out a linear model analysis using first GC, second repeat content and third both. Summarise
what you have learnt.

Check whether the assumption of normality in the error is justified by examining the distribution of
the residuals and seeing if there are any systematic biases

Try using a square-root transformation of the original data. Does this improve the residuals? Does it
change your inferences?
Gil McVean
Last modified 01/11/2008
WTCHG course in statistical modelling and data analysis

How much of the variation in the original signal have you explained? What is the difference
between Multiple R-squared and Adjusted R-squared?
C) Using the function acf(), plot the autocorrelation of the residuals. What do you notice? Is this a
problem for the linear model analysis? What does it suggest?
7. Bayesian inference
A) What, if anything, is the Bayesian equivalent of a P value?
B) Consider flipping a drawing-pin. What is the probability that it will land point up? Characterise your
prior belief about this probability by sketching a distribution and then find values of the parameters of a
beta distribution that make a suitable fit. Now flip a drawing-pin 10 times and record the number of
each event. The posterior distribution for the probability of landing point up is now given by a beta
distribution with parameters a+nU, b+n-nU, where a and b are the parameters from your prior, n is the
number of trials (here 10) and nU is the number of times the pin landed point up. Use R to draw the
prior and posterior. Carry on flipping the pin and look again at the posterior after 20, 30, 40 and 50 flips.
Now combine information across the room. What is your point estimate for the probability? What is
the 95% credibile interval (ETPI)?
C) A Bayes factor can be used to represent the evidence for different models. Given the data you have
just collected, measure the evidence that the probability of landing point up is not equal to 0.5 after
different numbers of throws. What Bayes factor do you think is necessary to convince you that one
model is better than another?
Gil McVean
Last modified 01/11/2008
Download