Homework 2 (due 2/28/05)

advertisement
Homework 2
1. Some short proofs of the recombination process.
a. Prove the Mether’s formular: p(R(J)) = 0.5*p(X(J)>0).
p( R( J )) = ∑p( R( J ), X(J) = n ) = ∑p( R( J ) | X(J) = n ) p(X(J) = n )
n>0
=
n>0
1
1
p(X(J) = n ) = p(X(J) > 0)
∑
2 n>0
2
b. Explain what is a first-division segregation (FDS) and a second-division
segregation (SDS). Why they are intesting?
FDS refers to the separation of the two parental chromosomes and SDS refers to
the separation of two sister chromatids of a single chromosome. FDS reflects
random segregation of parental chromosomes (Mendel’s first law) whereas SDS
reflects recombination events between the parental chromosomes.
c. Prove the inductive formula for the second-division segregation:
Sk+1=4Fk+2Sk. (note that in class we proved that a similar formula for the
FDS, mimic the technique we used there!)
Similar to the analysis for Fk+1 in class, by induction, given k exchanges, for each
of the strand choice pertained to FDS (i.e., not net recombination after k
exchanges w.r.t. the left most end), we have 4 possible strand choices for strand
exchange (i.e., one of the two sister chromatids of the mother chromosome to
either of the two sister chromatids of the father chromosome, and vise versa) to
make the ends of the 4 strand a SDS pattern; similarly, for each of the strand
choice pertained to SDS, one of the sister chromatids (i.e., the red one in page 22
of lecture 7) of the mother chromosome can exchange with the sister chromatids
of the same color of the father chromosome, and there are two such choices
altogether. Thus we have Sk+1=4Fk+2Sk.
d. Discuss the relationship and difference between Haldane’s mapping
function p(R(d))=(1/2)(1-exp(-2d)) and the mapping function for the SDS
frequency s(d)=(2/3)(1-exp(-3d)).
They are defined on different stochastic events. Haldane’s mapping function
defines the probability of CROSSOVER between two chromosomes (as a result
of odd number of cross-overs), which could happen in both FDS and SDS.
Whereas SDS is a 4-strand event resulting in DETECTABLE recombination from
the octad of four spore pairs. Both are functions of map distances based on a
Poisson point process.
2. Genetic Linage Analysis
a. What does “Hardy-Weinberg equilibrium” mean in terms of inheritance
and mating outcome?
The two parental chromosomes are independently inherited, and the joint
probability factors into a product of individual probabilities.
b. Consider the following pedigree on slide 12, lecture 8:
2
1
Dd
3
5
4
dd
Dd
Dd
DD
What is the conditional probability of the genotype of the individual 2, i.e.,
p(G2=Dd|G1=Dd,G3=dd,G4=Dd,G5=DD)? Use the allele frequencies given in
the previous slides.
p(G 2 = Dd | G 1 = Dd, G 3 = dd, G 4 = Dd, G 5 = DD)
=
p(G 2 = Dd, G 1 = Dd, G 3 = dd, G 4 = Dd, G 5 = DD)
∑p(G
2
= Dd, G 1 = Dd, G 3 = dd, G 4 = Dd, G 5 = DD)
G2
∑p(G
2
= Dd, G 1 = Dd, G 3 = dd, G 4 = Dd, G 5 = DD)
G2
= (2 × .01 × .99 × .7) × (2 × .01 × .99 × .3) × (1/2 × 1/2 × .9) × (2 × 1/2 × 1/2 × .7) × (1/2 × 1/2 × .8) +
(2 × .01 × .99 × .7) × (2 × .01 × .01 × .8) × (1/2 × 0 × .9) × (1/2 × 1 × .7) × (1/2 × 1× .8)
(2 × .01 × .99 × .7) × (2 × .99 × .99 × .1) × (1/2 × 1 × .9) × (1/2 × 1 × .7) × (1/2 × 0 × .8)
p(G 2 = Dd | G 1 = Dd, G 3 = dd, G 4 = Dd, G 5 = DD) = 1
c. What LODs of linkage are additive across independent pedigrees?
Because the logarithm operation turns the joint probability from a product of
terms into a summation of the logarithms of the terms.
d. Explain for n QTLs there are 2n distinct genotypes for BC and 3n distinct
genotypes for IC?
Because for BC there are two possible genotypes for each locus (e.g., (dd, dD) or
(DD,dD)), but for IC there are three possible genotypes (i.e., (dd,DD,dD)).
3. A common strategy to infer the haplotype of multiple SNPs is to used a
parsimony criteria to evaluate the results and devise algorithm that greedily
optimize this criteria over possible solutions. Now suppose we have the genotype
of 3 SNPs of individual 1: G1={1/0, 1/0,1/0}, and we also know that the genotype
of an individual 2 is: G2={1/1, 0/0,1/1}. Can you guess what is a “parsimonious”
solution of individual 1’s haplotypes? What if G2={0/0, 0/0,0/0}?
In the first case, he will have haplotype {101/010}, in the second case, he will have
{111/000}.
a. Why haplotype is advantageous over single SNPs for linkage analysis?
It is more discriminative than SNPs, with 2n possible configurations rather than
mere binary.
b. What is the computational and biological meaning for finding “blocks” in
long sequences of SNPs?
Within a block there is low LD between SNPs, meaning that recombination
within a block is not frequent and the possible configurations of haplotypes of a
block should also be lower than expected (i.e., <<2n where n is the number of
SNPs in a block). Thus biologically this suggests block to be treated as a
recombination unit and computationally this could save the inference cost.
c. Implement the EM algorithm for Haplotype inference and test it on the
data to be posted.
To open the zipped file, use command “tar zxvf hap.data.tgz”. In the data directory, you
will see a .genos and a .haplos file. The format is as follows:
Genotype:
1603 0 2 -1 1 1 0 0 1 1 1 1 2 1 0
2103 0 -1 2 1 -1 0 0 1 -1 1 1 0 1 -1
…
The first column is the sample id, starting from the second column, you see the genotype
of 14 SNPs in each sample. “0” denote the genotype (0,0), “1” denote the genotype (1,1),
“2” denote a genotype (0,1), i.e., a heterogeneous genotype, and finally, “-1” denote an
artifact in the measurement of that SNP, and can be ignored.
Haplotype:
1603 0 0 1 -1 1 1 0 0 1 1 1 1 1 1 0
1603 1 0 0 -1 1 1 0 0 1 1 1 1 0 1 0
2103 0 0 0 1 1 1 0 0 1 -1 1 1 0 1 -1
2103 1 0 -1 0 1 0 0 0 1 1 1 1 0 1 0
…
The first column is the sample id; the second column is the index of paternal and
maternal allele; starting from the third column, you see the haplotype of 14 SNPs in each
of the two alleles of every sample. Note that “-1” denote an artifact in the measurement
of that SNP, and can be ignored.
You need to infer the haplotype of the SNPs of all the samples using your program.
Report both the haplotype and the overall error ratio (i.e., the number of wrongly inferred
loci over all heterozygous loci).
Download