Abstract

advertisement
Biotechnology
Homework 5 Answers
Fall 2009
1. Positional Gene cloning: informative meioses & Lod scores.
For this question you can assume that a Polymorphic DNA marker for locus “a” can have
various values a1, a2 etc (e.g. different lengths of a particular PCR product as for a typical STRP),
that the family tree below is accurate and that locus “a” is on the same chromosome as the disease
gene. You can also assume that the disease is related to dysfunction of a single gene and that
mutant alleles of the relevant disease gene are rare in the general population but you should not
assume anything else. Individuals that clearly exhibit the disease are shown with filled symbols but I
am not disclosing the disease so you have to be open-minded about whether disease of differing
severity would necessarily be apparent and diagnosed.
(i) (a) In each generation roughly half of the children (male or female) of an affected mother have the
disease, so it is most likely a dominant allele that confers disease with high penetrance.
(b) Males cannot transmit an X-linked allele to their sons but we have no examples of affected
fathers to test this possibility, so the disease pedigree does not distinguish between X and autosomal
location (credit for this). However, we see that there are two alleles for “a” even in males and are
told that “a” and the disease gene are on the same chromosome, which must therefore be an
autosome (more credit).
(ii) De novo mutations are very rare, let alone for a particular disease that happens to run in the same
family, so it is almost certain that a parent of 22 and 24 carries a disease gene allele. Since such
alleles are rare in the population parent 10 is by far the more likely source than parent 11. Most
likely then parent 10 has the disease gene allele but does not show disease symptoms, illustrating
incomplete penetrance (i.e. additional factors can modify whether the mutant allele certainly brings
about disease).
(iii) The children include a3 and a2 alleles, so he must be a3 a2.
1
(iv) Since 4 does not have the disease he is not likely heterozygous for the disease gene allele, so no
information can be gained.
(v) Individual 6 is not heterozygous for “a” alleles so no linkage information can be derived.
(vi) Individual 1 is heterozygous for “a” alleles and the disease gene allele and it is clear for both 7
and 9 which “a” allele and which disease gene allele is inherited, so both are equally informative.
(vii) We know nothing about the parents of 1 and cannot therefore unambiguously determine what is
the cis linkage between “a” and the disease gene alleles (a2 or a3 on the same chromosome as the
mutant gene). We cannot therefore be certain of whether this linkage has changed during meiosis. If
we are trying to decide whether “a” and the disease gene are linked we cannot bias this statistical
evaluation by looking at the family downstream of individual “1’ and saying it is more likely we
would see non-recombinants than recombinants. The Lod score methodology provides the
framework for allowing the results to be evaluated in a fair and useful way without pretyending that
guesses are facts.
(viii) For individual 9 we can deduce (from her parents) that a2 is on the same chromosome as the
disease allele, and hence that individual 17 inherited a chromosome that underwent recombination in
the “a” to disease gene interval, while the other four children did not alter the linkage between a2 and
the disease allele (or a1 and the normal allele). Thus, L()/L(1/2)= (1-)4/(1/2)5
(ix) Since 1 out of 5 are recombinant chromosomes manifest by children’s genotypes and phenotypes
in this set of meioses, the most likely value of  is 0.2. For that value the Lod score is log (25 x 0.2 x
0.84) = log (2.6214) = 0.419
(x) (a) Simply add the Lod scores for each family, since addition of “logs’ is equivalent to
multiplying the independent probabilities if linkage that they represent
(b) Some apparently indistinguishable diseases can have different genetic origins. In that case we
would be mapping different genes (at different locations) in different families and markers that are
closely linked in one family would not be closely linked to a different gene (or genes) altered in
another family.
(c) Lod scores can be negative (basically if there are similar numbers of recombinants and nonrecombinants), so families which do not share the same genetic origin of disease will likely
contribute negative Lod scores, making it less likely that linkage can be deduced from combining
several families where there is locus heterogeneity (different genetic origins of disease).
To see how a Lod score is negative for unlinked loci consider a family with 2 recombinants and 2
non-recombinants. Z= Log (2(1-)2/(1/2)4)
Taking = 0.2 as an example, the numerator (top line) is 0.22 x 0.82 = 0.0256, and the numerator is
higher at 0.54= 0.0625, so Z, the log of a number less than 1.0 is negative. This will generally be true
for equal numbers of recombinants and non-recombinants because  (1-) is always less than 2 for
values of  less than 0.5 (which are the only sensible values of ).
(xi) A single gene mutation leading to a high penetrance phenotype in a dominant manner is key to
accurate genotyping (with respect to the disease gene) of many individuals within a family. Also, the
more common the disease the more prevalent such families are (and of course having a very
characteristic phenotype allowing precise diagnosis and a single genetic origin are also key factors).
2
All of this (other than possible locus heterogeneity) will be apparent fro taking careful medical
histories of families and drawing up the relevant pedigrees.
Please start a new page
2. (i) The Cap and 3’ end of an mRNA are labeled with biotin through a chemical reaction requiring
2’ and 3’ hydroxyls. mRNAs are primed with oligodT or other primers and reverse transcriptase
synthesizes cDNA under optimal conditions. RNase is added so that the biotinylated cap is
digested in all molecules where cDNA synthesis did not proceed to the 5’ end of the mRNA to
protect it from digestion. Al 3’ biotins will not hybridize to cDNA and will be digested. Hence only
biotin in association with full-length cDNA remains. biotinylated molecules are collected (with
streptavidin on magnetic beads and full-length first strand cDNA is then used as a template for the
second strand after, for example, terminal transferase-directed tailing to produce a homopolymeric
primer binding site.
(ii) You could search every sequenced cDNA, comparing to genomic sequence to find instances
where the cDNA has a T instead of a C in genomic DNA. Much of this may be due to
polymorphisms, or even reverse transcriptase errors. Hence, you might take a single mouse,
isolate genomic DNA and sequence the relevant PCR product encompassing the nucleotide of
interest. Then, make RNA from various tissues and sequence RT-PCR products to see if a T in a
mRNA can be produced in a mouse that only has C at the equivalent genomic position.
(iii) Two points could be considered. Assembly of contigs from short sequences demands a very
large number of sequence reads. Hence, the larger the genome the harder the task. Similarly,
finding overlaps to construct a contig of sequence for the first time requires more extensive
overlaps than “re-sequencing” a variant genome where the basic assembly of a prototype already
exists. so, just in terms of the number of reads required large genomes and de novo sequencing
projects are the hardest. However, the biggest problem for sequencing is always repeat
sequences. the problem is particularly acute for short reads because (i) they will rarely span a
repeat, allowing the length of a repeat to be determined, and (ii) there is less chance of
distinguishing two highly related but non-identical sequences (say with 5 or so differences over
700bp). Repeats are a huge problem for higher eukaryotes and for constructing genomic
sequences for the first time. So, sequencing many microbial genomes de novo is an ideal
application, as can be looking for variations of a sequenced genome, even for a complex genome,
as in humans.
(iv) Tthe high throughput feature means that this approach can be much faster than collecting and
sequencing full-length cDNAs. it has already been seen that RNASeq discovers many more splice
variants and transcribed regions than formerly identified by cloning cDNAs. However, it does not
define complete mRNA molecules and therefore leaves upon the question of how variable splicing
in one part of a MRNA is connected to variable splicing in another region (so collecting full-length
cDNAs remains useful).
(v)(a) You need to know as much as possible about the different mRNA (or other RNA) sequences
transcribed from the genome to know how to represent each species on the microarray.
(b) If you have sequence information you can make oligos that represent different mRNA species
even if you physically do not have the cDNAs.
(vi) For non-abundant RNA the signal depends on how many labeled species hybridize to the target
spot. This depends on the density of target molecules and the concentration of labeled species
(which depends on total amount of RNA used to make labeled species, the degree of amplification,
3
volume of hybridization, and of course the relative abundance of the RNA species in question).
noise comes mainly from inappropriate hybridization, especially of labeled species corresponding to
abundant RNAs. While high stringency hybridization and use of glass minimize such noise, it ios
finite and particularly high for Affymetrix targets which are only 25nt long.
4
Download