Biotechnology Homework 5 Answers Fall 2009 1. Positional Gene cloning: informative meioses & Lod scores. For this question you can assume that a Polymorphic DNA marker for locus “a” can have various values a1, a2 etc (e.g. different lengths of a particular PCR product as for a typical STRP), that the family tree below is accurate and that locus “a” is on the same chromosome as the disease gene. You can also assume that the disease is related to dysfunction of a single gene and that mutant alleles of the relevant disease gene are rare in the general population but you should not assume anything else. Individuals that clearly exhibit the disease are shown with filled symbols but I am not disclosing the disease so you have to be open-minded about whether disease of differing severity would necessarily be apparent and diagnosed. (i) (a) In each generation roughly half of the children (male or female) of an affected mother have the disease, so it is most likely a dominant allele that confers disease with high penetrance. (b) Males cannot transmit an X-linked allele to their sons but we have no examples of affected fathers to test this possibility, so the disease pedigree does not distinguish between X and autosomal location (credit for this). However, we see that there are two alleles for “a” even in males and are told that “a” and the disease gene are on the same chromosome, which must therefore be an autosome (more credit). (ii) De novo mutations are very rare, let alone for a particular disease that happens to run in the same family, so it is almost certain that a parent of 22 and 24 carries a disease gene allele. Since such alleles are rare in the population parent 10 is by far the more likely source than parent 11. Most likely then parent 10 has the disease gene allele but does not show disease symptoms, illustrating incomplete penetrance (i.e. additional factors can modify whether the mutant allele certainly brings about disease). (iii) The children include a3 and a2 alleles, so he must be a3 a2. 1 (iv) Since 4 does not have the disease he is not likely heterozygous for the disease gene allele, so no information can be gained. (v) Individual 6 is not heterozygous for “a” alleles so no linkage information can be derived. (vi) Individual 1 is heterozygous for “a” alleles and the disease gene allele and it is clear for both 7 and 9 which “a” allele and which disease gene allele is inherited, so both are equally informative. (vii) We know nothing about the parents of 1 and cannot therefore unambiguously determine what is the cis linkage between “a” and the disease gene alleles (a2 or a3 on the same chromosome as the mutant gene). We cannot therefore be certain of whether this linkage has changed during meiosis. If we are trying to decide whether “a” and the disease gene are linked we cannot bias this statistical evaluation by looking at the family downstream of individual “1’ and saying it is more likely we would see non-recombinants than recombinants. The Lod score methodology provides the framework for allowing the results to be evaluated in a fair and useful way without pretyending that guesses are facts. (viii) For individual 9 we can deduce (from her parents) that a2 is on the same chromosome as the disease allele, and hence that individual 17 inherited a chromosome that underwent recombination in the “a” to disease gene interval, while the other four children did not alter the linkage between a2 and the disease allele (or a1 and the normal allele). Thus, L()/L(1/2)= (1-)4/(1/2)5 (ix) Since 1 out of 5 are recombinant chromosomes manifest by children’s genotypes and phenotypes in this set of meioses, the most likely value of is 0.2. For that value the Lod score is log (25 x 0.2 x 0.84) = log (2.6214) = 0.419 (x) (a) Simply add the Lod scores for each family, since addition of “logs’ is equivalent to multiplying the independent probabilities if linkage that they represent (b) Some apparently indistinguishable diseases can have different genetic origins. In that case we would be mapping different genes (at different locations) in different families and markers that are closely linked in one family would not be closely linked to a different gene (or genes) altered in another family. (c) Lod scores can be negative (basically if there are similar numbers of recombinants and nonrecombinants), so families which do not share the same genetic origin of disease will likely contribute negative Lod scores, making it less likely that linkage can be deduced from combining several families where there is locus heterogeneity (different genetic origins of disease). To see how a Lod score is negative for unlinked loci consider a family with 2 recombinants and 2 non-recombinants. Z= Log (2(1-)2/(1/2)4) Taking = 0.2 as an example, the numerator (top line) is 0.22 x 0.82 = 0.0256, and the numerator is higher at 0.54= 0.0625, so Z, the log of a number less than 1.0 is negative. This will generally be true for equal numbers of recombinants and non-recombinants because (1-) is always less than 2 for values of less than 0.5 (which are the only sensible values of ). (xi) A single gene mutation leading to a high penetrance phenotype in a dominant manner is key to accurate genotyping (with respect to the disease gene) of many individuals within a family. Also, the more common the disease the more prevalent such families are (and of course having a very characteristic phenotype allowing precise diagnosis and a single genetic origin are also key factors). 2 All of this (other than possible locus heterogeneity) will be apparent fro taking careful medical histories of families and drawing up the relevant pedigrees. Please start a new page 2. (i) The Cap and 3’ end of an mRNA are labeled with biotin through a chemical reaction requiring 2’ and 3’ hydroxyls. mRNAs are primed with oligodT or other primers and reverse transcriptase synthesizes cDNA under optimal conditions. RNase is added so that the biotinylated cap is digested in all molecules where cDNA synthesis did not proceed to the 5’ end of the mRNA to protect it from digestion. Al 3’ biotins will not hybridize to cDNA and will be digested. Hence only biotin in association with full-length cDNA remains. biotinylated molecules are collected (with streptavidin on magnetic beads and full-length first strand cDNA is then used as a template for the second strand after, for example, terminal transferase-directed tailing to produce a homopolymeric primer binding site. (ii) You could search every sequenced cDNA, comparing to genomic sequence to find instances where the cDNA has a T instead of a C in genomic DNA. Much of this may be due to polymorphisms, or even reverse transcriptase errors. Hence, you might take a single mouse, isolate genomic DNA and sequence the relevant PCR product encompassing the nucleotide of interest. Then, make RNA from various tissues and sequence RT-PCR products to see if a T in a mRNA can be produced in a mouse that only has C at the equivalent genomic position. (iii) Two points could be considered. Assembly of contigs from short sequences demands a very large number of sequence reads. Hence, the larger the genome the harder the task. Similarly, finding overlaps to construct a contig of sequence for the first time requires more extensive overlaps than “re-sequencing” a variant genome where the basic assembly of a prototype already exists. so, just in terms of the number of reads required large genomes and de novo sequencing projects are the hardest. However, the biggest problem for sequencing is always repeat sequences. the problem is particularly acute for short reads because (i) they will rarely span a repeat, allowing the length of a repeat to be determined, and (ii) there is less chance of distinguishing two highly related but non-identical sequences (say with 5 or so differences over 700bp). Repeats are a huge problem for higher eukaryotes and for constructing genomic sequences for the first time. So, sequencing many microbial genomes de novo is an ideal application, as can be looking for variations of a sequenced genome, even for a complex genome, as in humans. (iv) Tthe high throughput feature means that this approach can be much faster than collecting and sequencing full-length cDNAs. it has already been seen that RNASeq discovers many more splice variants and transcribed regions than formerly identified by cloning cDNAs. However, it does not define complete mRNA molecules and therefore leaves upon the question of how variable splicing in one part of a MRNA is connected to variable splicing in another region (so collecting full-length cDNAs remains useful). (v)(a) You need to know as much as possible about the different mRNA (or other RNA) sequences transcribed from the genome to know how to represent each species on the microarray. (b) If you have sequence information you can make oligos that represent different mRNA species even if you physically do not have the cDNAs. (vi) For non-abundant RNA the signal depends on how many labeled species hybridize to the target spot. This depends on the density of target molecules and the concentration of labeled species (which depends on total amount of RNA used to make labeled species, the degree of amplification, 3 volume of hybridization, and of course the relative abundance of the RNA species in question). noise comes mainly from inappropriate hybridization, especially of labeled species corresponding to abundant RNAs. While high stringency hybridization and use of glass minimize such noise, it ios finite and particularly high for Affymetrix targets which are only 25nt long. 4