Practical_recombination

advertisement
DTC module in Bioinformatics
Practical on recombination I
Gil McVean and Simon Myers, Tuesday June 14th 2005
In this practical you are going to learn about the structure of gene histories that include recombination
and devise a simple, non-parametric method for estimating how much historical recombination has
influenced genetic variation in a real data set.
Simulating recombination histories
The Hudson animator at http://www.coalescent.dk allows you to visualise coalescent simulations with
recombination and an arbitrary number of chromosomes. We will use this to get an intuitive idea of
how much recombination influences gene histories.
Find the web site and follow the links to the Hudson Animator with recombination. You can set both
the number of sequences in the sample, n, and the population-scaled recombination rate,  (rho) = 4Ner,
where Ne is the effective population size and r is the probability of a recombination event occurring
across the whole region in one meiosis. When you press recalc you get an animation that shows you
the ancestral recombination graph (ARG) with samples (red), coalescent events (green) and
recombination events (blue).
If you hover the pointer over the coalescent and recombination events, the little panel at the bottom
right shows you details of the ancestral material involved in each lineage and the time at which the
event occurred (scaled in units of 2Ne generations). Simulate a few histories with n=rho=2 so you get
familiar with the setup. You can also click the Trees tab, which gives you a different view of the
ARG. The pointer will start in the middle of the ‘left-most’ tree, with blue triangles indicating the
location of each recombination event along the sequence. If you click anywhere along the segment,
you will be shown the tree at each point along the sequence. Simulate a few histories with n=rho=4 to
familiarise yourself with this alternative view of the ARG.
Now use the simulator to get approximations to the following questions. See if you can get exact
answers using coalescent theory (not all questions have simple analytical expressions for their solution,
but you can probably come up with a good approximation!).
a) How many coalescent events are there in an ARG if there are no recombination events?
b) How many coalescent events are there in an ARG if there are r recombination events?
c) For a recombination rate parameter (rho) of 1 and two sequences (n=2), what is the probability that
there are no recombination events?
d) For n=4 and rho=1, what is the probability of there being a single recombination event in the ARG?
If you look at just those simulations where you get a single recombination event, what proportion of the
ARGs are such that you could place mutations (as many as you want) on the graph to make the
recombination event detectable by the four-gamete test (see below if you can’t remember what this is)?
e) Fix rho at 1 and estimate the average number of recombination events in ARGs for n=2, n=4, n=10
and n=20. How do you think the expected number of recombination events scales with the number of
sequences?
Detecting recombination events in empirical data
If the mutation rate is very low, such that repeat or back mutation is very unlikely, every time all four
possible combinations of two alleles at two loci are observed (the four-gamete test) we can be sure that
a recombination event must have occurred between them. In the example below, the data on the left
(A) show a detectable recombination and the data on the right (B) do not.
A)
B)
You are going to develop (and implement) an algorithm for calculating a lower bound for the minimum
number of recombination events that must have occurred in the history of our sample of sequences we will call this Rmin (Hudson and Kaplan 1985).
In the above examples, it is obvious that (A) requires at least one recombination and (B) zero. In the
examples below, how many do you need?
A)
B)
C)
For larger data sets we need to develop a systematic way of calculating Rmin. A good starting point is
to identify all pairs of SNPs for which a recombination event is detectable between them using the
four-gamete test. For example, below, the lines indicate the pairs of ‘incompatible sites’ (those which
have a detectable recombination event).
To find Rmin, we need to put the fewest possible recombination events down such that all pairwise
constraints are satisfied (a linear programming problem). What is the answer for the above example?
Can you see how to generalise the algorithm to an arbitrary number of SNPs?
If not, consider the following two examples.
A)
B)
Note that (B) is identical to (A), except that it has an extra SNP at the right. In (A), we only need a
single recombination event, but when we add the extra SNP to make (B), we generate an
incompatibility between the 2nd and the 4th SNPs, which could not be the result of the same
recombination event that generated the incompatibility between the 1 st and 2nd SNPs (note that the
incompatibility between the 1st and the 4th SNPs could already be explained). More generally, by
thinking about adding in one SNP at a time, and looking to see if it generates incompatibilities that are
not currently explained, we can work progressively along the sequence.
Explain why the formula
R min( i )  max j i R min( j )  I ij 
provides a solution to Rmin; (i and j are indices for the SNPs and Iij is an indicator function that takes
the value 0 if there is no detectable recombination between SNPs i and j and 1 if there is).
Implement an algorithm for calculating Rmin from empirical data (alleles coded as 0s and 1s) and
apply it to the human data set ceph7q31.txt you can find on the website (from the HapMap project).
www.stats.ox.ac.uk/~mcvean/DTC/BIOINF/Practicals/ceph7q31.txt
The first line tells you how many sequences and SNPs there are, the second line gives the positions of
the SNPs and the subsequent lines are the data for each chromosome. Do recombination events occur
uniformly along the sequence or do they cluster into particular regions?
Download