Repeated DNA sequences 1 Prof Duncan Shaw Molecular & Cell Biology My home page University of Aberdeen email me General sources for these lectures: Genes & Genomes, by M. Singer & P.Berg, chapters 9,10. (1991) Genes V, by B. Lewin, chapters 22, 26, 36. (1994) I will also use some other references later in the course. Lecture 1 Size and complexity of genomes Melting of DNA and different classes of DNA sequence Types of repeated sequence Maintenance of repeats by unequal crossovers Tandem repeats Microsatellites and disease Lecture 1 In general, more complex organisms have the biggest genomes. The diagram shows some examples. For mammals, including humans, the haploid genome is 3,000,000,000 bases (3000 Mb). But the number of genes in mammals is about 80,000. The average transcript length is 1 to 10 kb. So, only about 10% of a mammalian genome is coding. What is the rest? The diagram shows a very old experiment with DNA. If you take a sample of genomic DNA, heat it to 95oC to melt it to single strands, and let it cool slowly, you can follow its renaturing into the original, double-stranded form by measuring its UV absorbance (A260) or analysing samples by hydroxyapatite chromatography. The kinetics of DNA reassociation are a second-order chemical reaction, since it involves 2 molecules (i.e. the 2 single strands). Its mathematics are as follows: C = single-stranded [DNA] Co = initial single-stranded [DNA] t = time k = a constant The rate of the reaction is: dC/dT = -kC^2 If this equation is integrated, you can find the fraction of DNA still single-stranded at time t: C/Co = 1/(1+k.Cot) In practice, this is dealt with by plotting the % of the DNA that has re-annealed, as a function of cot (which is initial [DNA] times the time elapsed in seconds). This gives a cot curve, and the diagram shows examples for genomes of increasing complexity. The red curve is poly-dA:poly-dT, the blue is MS2 virus, the green is T4 phage, and the purple is E.coli. A definition: the complexity of a genome is the total length of all different sequences present in it. So for the red curve, which is poly-dA:poly-dT (i.e. all As on one strand, all Ts on the other) the complexity is 1 but the length could be anything. For the other curves, the complexity is about the same as the length as they don't contain repeated sequences. If you had a genome consisting of 1000 copies of the sequence CAGT, and nothing else, its complexity would be 4 and its length would be 4000. The time taken for half the DNA to reanneal (Cot1/2) is a convenient measure of genome complexity. If it is measured for any new genome, and compared to a standard such as the E coli genome (purple curve), you get the following relationship: Cot1/2 (new genome)/Cot1/2 (E coli) = complexity (new genome)/4.2Mb because 4.2Mb is the complexity of the E coli genome. The examples above were all simple genomes (nothing bigger than a bacterium). If you carry out a Cot analysis on mammalian DNA, you get a curve as shown in this diagram (red curve). This is a hypothetical genome with length 700Mb. As you can see it reanneals in 3 distinct phases. You can measure the cot1/2 for each phase and then use it to work out the compexity of the DNA in each phase, using the equation above. You can then work out the total length of DNA in each phase by how far up the Y-axis of the graph it goes. Then if you divide the length by the complexity for each phase, you get the number of copies of the sequence that makes up that phase. This table shows these results for our hypothetical mammalian genome: Fraction complexity repeat number 1 fast 340bp 500,000 2 medium 600kb 350 3 slow 300,000kb 1 So we can see from this that mammalian DNA contains 3 main classes of DNA sequence, as measured by Cot curves: 1. Highly repeated DNA (up to 1 million copies) 2. Moderately repeated DNA (up to 100,000 copies) 3. Unique sequence DNA (strictly speaking 1 copy, but in practice this also includes sequences with only a few copies) So, in which class of DNA are the genes? If you mix some radioactively-labelled mRNA into the DNA melting experiment (blue curve on diagram above) it mostly reassociates with the 3rd phase of the DNA. Therefore, coding sequences are mostly in the unique-sequence DNA. Types of repeat We will now look at the various types of repeated DNA. The basic facts are: - They may be tandem (i.e. arranged in blocks) or interspersed (distributed all around the genome) (see diagram). - They may be coding or non-coding - Copies may be perfect or imperfect - Origin is via duplication, amplification and/or transposition of the prototype sequence First example: ribosomal RNA genes (tandemly repeated, coding). There are 3 genes for rRNA, 18s, 28s and 5.8s. The gene products are part of the ribosome. The 3 genes are organised as a tandemly repeated unit, as shown in the diagram. In humans there are 5 blocks of rRNA repeats, on the short arms of the acrocentric chromosomes (13,14,15,21,22). The total number of copies is 150-200, and individual humans have different numbers, i.e. the repeat number is polymorphic. The sequences of the repeats (including the non-coding parts) are much more similar to one another than you would expect, given that they are very old in evolutionary terms. This suggests that they are somehow interacting with one another to exchange sequence information. How do these genes interact, and change their copy number? An experiment with yeast shows this (see diagram). A leu- yeast is transformed with a plasmid that has leu2 cloned next to a copy of the rRNA repeat. This can integrate into the yeast's chromosomal rRNA locus by homologous integration, making the transformants leu+. If this is mated with a wild type yeast, and the products of meiosis are analysed, both leu+ and leu- strains are found. If the structure of the rRNA locus is then investigated, it is found to have undergone loss or addition of copies as shown in the picture. The explanation of this is unequal crossing-over (between mis-aligned copies of the rRNA repeat) during meiosis. When unequal crossing over is combined with a bit of gene conversion (see next lecture) then it can account for variation in copy number, and homogeneity of sequence, between rRNA genes (and more generally in other types of repeat sequence). Other coding repeat sequences The following may be tandem or interspersed: 5s rRNA (not usually linked to the other rRNA genes described above) tRNA. In humans there are about 1300 tRNA genes (10-20 of each type) Small nuclear RNAs (snRNA) and small cytoplasmic RNAs (scRNA), e.g. 7slRNA. This one is part of the the "signal recognition particle", a structure involved in transport of secretory and membrane proteins. In the human genome there are 3 7slRNA genes, and hundreds of pseudogenes. Protein coding genes in multigene families (globins, histones, collagens, tubulins, actins, immunoglobulins, etc) Tandem repeats within genes Some protein-coding genes are composed of tandemly repeated segments, e.g: Collagen genes. Collagen is the major structural protein in vertebrates, and is made up of 3 polypeptide chains forming a triple helix. The protein structure is a repeat: (gly-X-Y)n The alpha 2(1) collagen gene is 38kb long with over 50 exons. Each exon is 54 or 108bp long, i.e. an exact multiple of the 3 amino-acid repeat; 6 or 12 copies. Tandem repeats in non-coding DNA Microsatellite repeats have repeat units from 1 to 5 bp, e.g. ....AAAAAAA........ ....CACACACACA....... .....CATGCATGCATGCATG............ They are present throughout mammalian genomes, at 1000s of separate loci. Therefore each one is a tandem repeat but the individual loci where they occur are interspersed repeats. They are often polymorphic (especially the ones with high copy numbers) with different numbers of repeat copies in individual alleles. Because of this they are very useful as linkage markers for genetic mapping. The human genome linkage map was based on several 1000 microsatellite repeats, mostly CA. Microsatellites and human disease Reference: Ashley & Warren (1995). Ann. Rev. Genet. 29, 703-728 One class of microsatellite repeat, trinucleotides, is often found in coding as well as non-coding DNA. If you ignore which DNA strand you are reading, there are only 10 basic types of trinucleotide repeat since AGC, GCA, CAG are all equivalent (they are permutations of the same sequence) and their complements are CTG and so on, and the same goes for any other combination. Two of these (CAG and CCG) are involved in human genetic disease. In the genes that contain them, the copy number (n) of the repeat is variable. If n<40, there are no symptoms. But if n>50, symptoms of the disease start to show (these thresholds are slightly different in different diseases). In many cases, the bigger n becomes, the worse the symptoms, and the earlier in the patients life they appear. This phenomenon is called "anticipation". Examples: Fragile X syndrome - (CCG)n in the 5' untranslated region of the FMR gene on the X chromosome. Fragile X is the commonest type of inherited mental retardation in boys. The effect of an expansion of the CCG repeat is to stop the gene being transcribed. Huntington's disease - (CAG)n in coding region of the gene, (gln)n in the protein. Look back at your lectures from GN3501. There are many other examples of inherited neurological disease like this with (gln)n in the protein coding sequence. Myotonic dystrophy - (CTG)n in 3' untranslated region of a protein kinase gene. It is an autosomal dominant neuromuscular disease and I covered this in GN3801. This table summarises the properties of 4 of these diseases. There are now several more examples. Clicking on the disease name takes you to the OMIM database for more information. Use the "back" button on the browser to return. disease size range translated to protein? unstable repeat length? Disease trinucleotide normal size range Fragile X CGG 6-50 200-2000 no very Myotonic dystrophy CTG 5-50 100-5000 no very Huntington's disease CAG 6-34 36-120 yes sometimes 25-36 43-81 yes sometimes Spinal cerebellar CAG ataxia 1 As well as varying in length, the repeats can actually expand as they are passed from parents to their children, so that the condition gets worse in successive generations of a family. This genetic instability is the molecular basis for anticipation. It results from slippage errors during DNA replication that affect the lagging strand (see diagram from lecture on DNA re-arrangements for a similar mechanism).