Repeated DNA sequences 3 Prof Duncan Shaw Molecular & Cell Biology Lecture 3 CpG islands - discovery and structure CpG islands - function and uses in gene cloning Centromeres - why they are necessary Types of DNA repeat at centromeres CpG islands - discovery and structure References: Genes and genomes chapter 8.7 Cross SH, Bird AP (1995): "CpG islands and genes". Current Opinion in Genetics and Development 5, 309-314 CpG islands (also known as HTF islands) are not really a repeated sequence, but a special type of DNA sequence with a particular function. Most of vertebrate DNA is methylated at the C residue of the sequence ....CG..... This means that it can't be cut by certain restriction enzymes with CG in their recognition sites, e.g. HpaII (see top part of diagram). Other enzymes like this include NotI and SacII, which are used to digest mammalian DNA in large fragments for analysis by pulsed-field gel electrophoresis. If vertebrate DNA is digested with HpaII and run on an ordinary agarose gel, you get a pattern like the lower part of the diagram. The interpretation of this result was that most HpaII sites in vertebrate DNA are methylated at CG and not digested, resulting in the high molecular-weight DNA near the origin of the gel. But a fraction of sites are unmethylated and clustered together, resulting in very small fragments of DNA of 100bp and less. Only about 1% of the total DNA is in this fraction. These clusters of sites were named HTF islands (HpaII Tiny Fragments). The name was later changed to CpG islands. The next stage was to clone some of the HTF DNA and use it to probe genomic DNA libraries. This allowed fragments containing the islands to be sequenced. It revealed that a typical CpG island is 1-2kb long, and is about 70% G and C (as opposed to 40% in the rest of the genome). Furthermore, in the rest of the genome the dinucleotide ...CG... is under-represented due to its tendency to mutate to ...TG... In CpG islands there is no such deficit of CpG. Further studies confirmed that most C in CpG islands is unmethylated, and showed that there about 45,000 islands in a human genome. Sequencing DNA around islands showed that they are always located at the 5' end of genes, suggesting that they are part of the promoter. But as there are about 80,000 genes in mammals, not all genes have a CpG island. Those that do are usually "housekeeping" genes, i.e. those performing basic cellular functions in all tissues. This is a typical gene with a CpG island. The island includes the first exon. Although there are CpG dinucleotides throughout, they are clustered and unmethylated in the island. The average spacing between islands in the genome is about 70kb, but this can vary from 8kb to several Mb in particular regions of the genome. Function of CpG islands A function for islands was suspected because they are always associated with the first exon of housekeeping genes. Molecular studies showed that the chromatin in these regions has an "open" configuration, with no nucleosomes or histone H1. This would make the DNA accessible to transcription factors, etc. and hence able to be transcribed. Bulk genomic DNA is methylated by a specific methyltransferase enzyme, so what stops CpG islands from getting methylated as well? It may be due to flanking sites for the Sp1 DNA binding protein, because if these are removed, the island does get methylated. Possibly bound Sp1 prevents methyltransferase from acting. An exception to this is the mammalian X chromosome. In female cells one of the Xs is inactive. The CpG islands of its genes are methylated, which represses gene activity. In fragile X syndrome this occurs as part of the disease mechanism: the CpG island of the mutant FMR1 gene is methylated in male cells, preventing its expression. CpG islands are involved also in cancer (e.g. if the promoter of a tumour-suppressor gene is aberrantly methylated, leading to lack of expression and hence a growth advantage in the cells where this occurs). Another role is in genetic imprinting, the process whereby some genes are differentially expressed depending on which parent they were inherited from. The mechanism for this is associated with the methylation of CpG islands of imprinted genes. There is also some evidence that a few islands (2%) may be reversibly methylated to control gene expression during normal development. Further insight into the role of CpG islands came from the discovery of the MeCP2 protein. This is a 492aa protein which is abundant in all mammalian cells. It binds to methylated DNA, but not to unmethylated DNA (i.e. CpG islands). In vitro studies using reporter genes show that binding of MeCP2 to a methylated promoter represses transcription. Presumably it does so by preventing access by transcription factors. There is a critical minimum of methylation sites required for MeCP2 to do this, about 1 per 100bp. This indicates that MeCP2 molecules act in a co-operative manner. Use of MeCP2 for cloning genes The DNA binding domain of MeCP2 is about 85 aminoacids from near the N-terminus of the protein. It can be cloned and expressed in E coli, fused to a 6xHis sequence which enables it to bind to nickel ions on an agarose column. DNA fragments passed over this column can then be eluted with increasing salt concentrations as shown in the picture. Methylated DNA binds more tightly and is therefore eluted last. The MeCP2 column can now be used to isolate DNA fragments that contain CpG islands specifically from genomic DNA. The DNA is digested with an enzyme specific for TTAA, which cuts quite frequently except in CG rich regions (i.e. CpG islands). The fragments are passed over the column, which binds methylated DNA but not unmethylated. The flow-through is recycled through the column so that most of the methylated DNA sticks and most of the unmethylated comes through. To purify the CpG islands further, they are next converted to the methylated form by treatment in vitro with methyltransferase enzyme. As they have a higher density of CpG dinucleotides than bulk (nonCpG island) DNA, they will now bind more tightly to the column than any contaminating fragments of bulk DNA and can be specifically eluted with high salt as shown in the previous picture. After repeating this step the CpG island fragments are virtually pure and are cloned into a suitable vector to make a CpG island library. Clones from this library can then be used to isolate corresponding full-length cDNAs or genomic clones that contain the rest of the gene. Centromeres This is a typical eukaryotic chromosome. Both centromeres and telomeres have specific functions in the cell, and particular types of repeated DNA sequence. The diagram shows the conserved core sequence for a centromere of the yeast Saccharomyces cerevisiae. Functionally, the centromere is the part of the chromosome to which the spindle fibres (microtubules) attach at cell division, ensuring correct segregation of chromosomes to daughter cells. Structurally, it consists of a disc-like protein structure, the kinetochore. Centromeres also have associated repeated DNA sequences - another class of tandem repeat. S. cerevisiae has a simple centromere on each of its chromosomes.The first to be cloned (chromosome 11) was linked to the met14 gene, and was discovered when it was observed that a 5.2kb DNA fragment containing met14 could confer stability and correct segregation on the plasmid it was cloned in. Other centromeres were isolated and their sequences compared. They all have a conserved sequence similar to that shown in the picture above. All the conserved elements are necessary for correct centromere function. The sequence at each yeast centromere is homologous but not identical. By gene targeting, a centromere can be manipulated in various ways and its ability to direct correct chromosomal segregation at cell division can be observed. This shows that a centromere can be reversed or replaced with one from another chromosome, and still function properly, but that mutations of its sequence can destroy its function. The DNA of yeast centromeres is less accessible than the flanking DNA to enzyme digestion. So it is probably forming some special kind of chromatin structure. Specific binding proteins have been found, and it is believed that it is the combination of these proteins and the centromere-specific DNA sequences that is the basis of centromere function. Centromere-associated DNA repeats Most eukaryotes have many copies of tandemly-repeated DNA sequences around their centromeres (S. cerevisiae is an exception). This can account for up to 20% of total genomic DNA (in humans it is about 5%). These repeats, called centromeric satellites, are not transcribed. They stain strongly with Giemsa II giving dark bands under the microscope, sometimes referred to as heterochromatin. Similar staining is seen for some other regions e.g. telomeres, rRNA genes. The structure of the centromeric tandem repeats was investigated by restriction enzyme analysis - digest genomic DNA with various enzymes, agarose electrophoresis and Southern blot, and probe with a copy of the repeat. This gives a single band for any enzyme that cuts once in the repeat and a ladder pattern for enzymes that cut some repeats but not others (due to mutations). These patterns are very characteristic of tandem repeats. The sequence of the repeat varies between species. We will concentrate on those of primates including human. This is also known as the "alpha-satellite" repeat and it has a basic 170bp unit, with variations between species and between chromosomes in the same species. They also show polymorphism between individuals, both in terms of sequence and repeat copy number. This shows some variations on the primate theme. The lower 2 are human. If the sequences of individual repeats within a species are compared, you find 2 kinds of mutation: random, which might have occurred at any time, and ones common to several repeat units, which presumably arose by mutation followed by amplification of the repeat by unequal crossingover. Some centromeres have blocks of repeats made up of multimers of the 170bp unit. The last one in the picture is one such - it is from the human X chromosome. What happened here is first that that the basic 170bp unit became amplified to 12 copies, then various mutations occurred at random positions leading to the BamHI and PstI sites, then the entire 12-mer was itself amplified.