Repeated DNA sequences - lecture 1

advertisement
Repeated DNA sequences 1
Prof Duncan Shaw
Molecular & Cell Biology
My home page
University of Aberdeen
email me
General sources for these lectures:
Genes & Genomes, by M. Singer & P.Berg, chapters 9,10. (1991)
Genes V, by B. Lewin, chapters 22, 26, 36. (1994)
I will also use some other references later in the course.
Lecture 1
Size and complexity of genomes
Melting of DNA and different classes of DNA sequence
Types of repeated sequence
Maintenance of repeats by unequal crossovers
Tandem repeats
Microsatellites and disease
Lecture 1
In general, more complex organisms have the biggest genomes. The
diagram shows some examples. For mammals, including humans, the
haploid genome is 3,000,000,000 bases (3000 Mb).
But the number of genes in mammals is about 80,000. The average
transcript length is 1 to 10 kb. So, only about 10% of a mammalian
genome is coding. What is the rest?
The diagram shows a very old experiment with DNA. If you take a sample of
genomic DNA, heat it to 95oC to melt it to single strands, and let it cool slowly,
you can follow its renaturing into the original, double-stranded form by measuring
its UV absorbance (A260) or analysing samples by hydroxyapatite
chromatography.
The kinetics of DNA reassociation are a second-order chemical reaction, since it
involves 2 molecules (i.e. the 2 single strands). Its mathematics are as follows:
C = single-stranded [DNA]
Co = initial single-stranded [DNA]
t = time
k = a constant
The rate of the reaction is:
dC/dT = -kC^2
If this equation is integrated, you can find the fraction of DNA still single-stranded at time t:
C/Co = 1/(1+k.Cot)
In practice, this is dealt with by plotting the % of the
DNA that has re-annealed, as a function of cot
(which is initial [DNA] times the time elapsed in
seconds). This gives a cot curve, and the diagram
shows examples for genomes of increasing
complexity. The red curve is poly-dA:poly-dT, the
blue is MS2 virus, the green is T4 phage, and the
purple is E.coli.
A definition: the complexity of a genome is the total
length of all different sequences present in it. So for
the red curve, which is poly-dA:poly-dT (i.e. all As
on one strand, all Ts on the other) the complexity is 1
but the length could be anything. For the other curves, the complexity is about the same as the
length as they don't contain repeated sequences. If you had a genome consisting of 1000 copies
of the sequence CAGT, and nothing else, its complexity would be 4 and its length would be
4000.
The time taken for half the DNA to reanneal (Cot1/2) is a convenient measure of genome
complexity. If it is measured for any new genome, and compared to a standard such as the E coli
genome (purple curve), you get the following relationship:
Cot1/2 (new genome)/Cot1/2 (E coli) = complexity (new genome)/4.2Mb
because 4.2Mb is the complexity of the E coli genome.
The examples above were all simple genomes (nothing
bigger than a bacterium). If you carry out a Cot analysis on
mammalian DNA, you get a curve as shown in this diagram
(red curve). This is a hypothetical genome with length
700Mb. As you can see it reanneals in 3 distinct phases. You
can measure the cot1/2 for each phase and then use it to
work out the compexity of the DNA in each phase, using the
equation above.
You can then work out the total length of DNA in each phase
by how far up the Y-axis of the graph it goes. Then if you
divide the length by the complexity for each phase, you get
the number of copies of the sequence that makes up that
phase. This table shows these results for our hypothetical mammalian genome:
Fraction
complexity
repeat
number
1
fast
340bp
500,000
2
medium
600kb
350
3
slow
300,000kb
1
So we can see from this that mammalian DNA contains 3 main classes of DNA sequence, as
measured by Cot curves:
1. Highly repeated DNA (up to 1 million copies)
2. Moderately repeated DNA (up to 100,000 copies)
3. Unique sequence DNA (strictly speaking 1 copy, but in practice this also includes sequences
with only a few copies)
So, in which class of DNA are the genes?
If you mix some radioactively-labelled mRNA into the DNA melting experiment (blue curve on
diagram above) it mostly reassociates with the 3rd phase of the DNA. Therefore, coding
sequences are mostly in the unique-sequence DNA.
Types of repeat
We will now look at the various types of repeated DNA. The basic
facts are:
- They may be tandem (i.e. arranged in blocks) or interspersed
(distributed all around the genome) (see diagram).
- They may be coding or non-coding
- Copies may be perfect or imperfect
- Origin is via duplication, amplification and/or transposition of the
prototype sequence
First example: ribosomal RNA genes (tandemly repeated, coding).
There are 3 genes for rRNA, 18s, 28s and 5.8s. The gene
products are part of the ribosome. The 3 genes are organised
as a tandemly repeated unit, as shown in the diagram.
In humans there are 5 blocks of rRNA repeats, on the short
arms of the acrocentric chromosomes (13,14,15,21,22). The
total number of copies is 150-200, and individual humans
have different numbers, i.e. the repeat number is
polymorphic. The sequences of the repeats (including the
non-coding parts) are much more similar to one another than
you would expect, given that they are very old in
evolutionary terms. This suggests that they are somehow
interacting with one another to exchange sequence
information.
How do these genes interact, and change their
copy number? An experiment with yeast shows
this (see diagram). A leu- yeast is transformed
with a plasmid that has leu2 cloned next to a copy
of the rRNA repeat. This can integrate into the
yeast's chromosomal rRNA locus by homologous
integration, making the transformants leu+. If this
is mated with a wild type yeast, and the products
of meiosis are analysed, both leu+ and leu- strains
are found. If the structure of the rRNA locus is
then investigated, it is found to have undergone
loss or addition of copies as shown in the picture.
The explanation of this is unequal crossing-over
(between mis-aligned copies of the rRNA repeat)
during meiosis.
When unequal crossing over is combined with a
bit of gene conversion (see next lecture) then it
can account for variation in copy number, and
homogeneity of sequence, between rRNA genes
(and more generally in other types of repeat
sequence).
Other coding repeat sequences
The following may be tandem or interspersed:
5s rRNA (not usually linked to the other rRNA genes described above)
tRNA. In humans there are about 1300 tRNA genes (10-20 of each type)
Small nuclear RNAs (snRNA) and small cytoplasmic RNAs (scRNA), e.g. 7slRNA. This one is
part of the the "signal recognition particle", a structure involved in transport of secretory and
membrane proteins. In the human genome there are 3 7slRNA genes, and hundreds of
pseudogenes.
Protein coding genes in multigene families (globins, histones, collagens, tubulins, actins,
immunoglobulins, etc)
Tandem repeats within genes
Some protein-coding genes are composed of tandemly repeated segments, e.g:
Collagen genes. Collagen is the major structural protein in vertebrates, and is made up of 3
polypeptide chains forming a triple helix.
The protein structure is a repeat: (gly-X-Y)n
The alpha 2(1) collagen gene is 38kb long with over 50 exons. Each exon is 54 or 108bp long,
i.e. an exact multiple of the 3 amino-acid repeat; 6 or 12 copies.
Tandem repeats in non-coding DNA
Microsatellite repeats have repeat units from 1 to 5 bp, e.g.
....AAAAAAA........
....CACACACACA.......
.....CATGCATGCATGCATG............
They are present throughout mammalian genomes, at 1000s of separate loci. Therefore each one
is a tandem repeat but the individual loci where they occur are interspersed repeats.
They are often polymorphic (especially the ones with high copy numbers) with different
numbers of repeat copies in individual alleles. Because of this they are very useful as linkage
markers for genetic mapping. The human genome linkage map was based on several 1000
microsatellite repeats, mostly CA.
Microsatellites and human disease
Reference: Ashley & Warren (1995). Ann. Rev. Genet. 29, 703-728
One class of microsatellite repeat, trinucleotides, is often found in coding as well as non-coding
DNA. If you ignore which DNA strand you are reading, there are only 10 basic types of
trinucleotide repeat since AGC, GCA, CAG are all equivalent (they are permutations of the same
sequence) and their complements are CTG and so on, and the same goes for any other
combination.
Two of these (CAG and CCG) are involved in human genetic disease. In the genes that contain
them, the copy number (n) of the repeat is variable. If n<40, there are no symptoms. But if n>50,
symptoms of the disease start to show (these thresholds are slightly different in different
diseases). In many cases, the bigger n becomes, the worse the symptoms, and the earlier in the
patients life they appear. This phenomenon is called "anticipation".
Examples: Fragile X syndrome - (CCG)n in the 5' untranslated region of the FMR gene on the X
chromosome. Fragile X is the commonest type of inherited mental retardation in boys. The effect
of an expansion of the CCG repeat is to stop the gene being transcribed.
Huntington's disease - (CAG)n in coding region of the gene, (gln)n in the protein. Look back at
your lectures from GN3501. There are many other examples of inherited neurological disease
like this with (gln)n in the protein coding sequence.
Myotonic dystrophy - (CTG)n in 3' untranslated region of a protein kinase gene. It is an
autosomal dominant neuromuscular disease and I covered this in GN3801.
This table summarises the properties of 4 of these diseases. There are now several more
examples. Clicking on the disease name takes you to the OMIM database for more information.
Use the "back" button on the browser to return.
disease size
range
translated to
protein?
unstable
repeat
length?
Disease
trinucleotide
normal size
range
Fragile X
CGG
6-50
200-2000
no
very
Myotonic
dystrophy
CTG
5-50
100-5000
no
very
Huntington's
disease
CAG
6-34
36-120
yes
sometimes
25-36
43-81
yes
sometimes
Spinal cerebellar
CAG
ataxia 1
As well as varying in length, the repeats can actually expand as they are passed from parents to
their children, so that the condition gets worse in successive generations of a family. This genetic
instability is the molecular basis for anticipation. It results from slippage errors during DNA
replication that affect the lagging strand (see diagram from lecture on DNA re-arrangements for
a similar mechanism).
Download