Genome organisation and evolution

Genome organisation and evolution Level 3 Molecular Evolution and Bioinformatics Jim Provan Page and Holmes: Sections 3.1.4/5 and 3.3 The eukaryotic genome Single-copy protein coding genes Coding DNA Dispersed Multigene families Regulatory sequences Tandemly repeated Satellite DNA Non-coding DNA Tandemly repeated DNA Transposable elements And retroviruses Spacer DNA Minisatellites Microsatellites The C-value paradox The amount of DNA per haploid genome is known as the C-value Contrary to expectation, the amount of DNA is not correlated with complexity: The protist, Amoeba dubia has about 200 times more DNA (670,000,000 kbp) than humans (3,300,000 kbp) Cannot be explained by differences in gene number 12 10 8 6 4 2 0 The structure of genes There are many forms of genes: Those which produce a protein, a tRNA or an rRNA are referred to as structural genes Those which control how and when genes are expressed are called regulatory genes Some housekeeping genes need to be expressed in all tissues e.g. those involved in protein synthesis Other, tissue-specific genes, are only expressed in a particular cell or tissue type e.g. the insulin gene is only expressed in the pancreatic β-cells Whatever their function, all genes contain a coding region which specifies a polypeptide or an RNA molecule Regulation of gene expression Coding regions of genes are usually flanked by regulatory regions which control gene expression through transcription and translation Upstream promoter regions: – In bacteria, there is a Pribnow box (TATAAT) about 10 bp upstream from where transcription starts, the ‘-35 site’ (TTGACA) about 35 bp upstream and the Shine-Dalgarno box (AGGAGG) about 7 bp before the initiation codon – In eukaryotes, as well as the TATA box, some promoter regions contain a CAAT box about 40 bp before initiation codon and a GC box (GGGCGG) about 110 bp upstream Downstream elements such as the polyadenylation signal (AATAA) signify the end of transcription and increase stability of RNA transcripts Structure of a typical gene - alcohol dehydrogenase (Adh) Promoter region • TATA box • CAAT box (in mammals) • GC box (GGGCGGG) Exon 1 Polyadenylation signal AATAA Eukaryote Exon 2 Exon 3 Exon 4 5’ 3’ Intron 1 Intron 2 Intron 3 Initiation codon Stop codon Promoter region • Shine-Dalgarno box (AGGAGG) • Pribnow box (TATAAT) Prokaryote • -35 site (TTGACA) 5’ 3’ Initiation codon Stop codon Introns Occur frequently within eukaryotic genomes and make up most of the length of very long genes Number, size and organisation of introns varies: Histones have no introns: chicken pro-a2-collagen gene has over fifty SV40 virus contains an intron of 31 bp: human dystrophin gene has an intron of over 210,000 bp Some introns have genes contained within them - the Adh gene in Drosophila is located within the intron of the outspread gene Strong conservation of intron-exon boundaries nearly always begin with GT and end with AG Types of introns Most introns in eukaryotes are spliceosomal introns (‘nuclear introns’) because they are spliced by a spliceosome of proteins and RNA Some introns can splice without the aid of proteins (“self-splicing introns”): One class - group I introns - are sometimes mobile because they encode proteins such as DNA endonucleases. They are found in mitochondrial and chloroplast genomes, rRNAs of some eukaryotes and in T4 bacteriophage Group II introns are found in organelles and their bacterial ancestors and contain reverse transcriptase-like sequences Group III introns are found in a few protists and are similar to group II introns with the central portion removed The evolution of introns There are two competing hypotheses for the evolution of spliceosomal introns: The introns-early hypothesis, proposed by Walter Gilbert, suggests that introns mark the boundaries between ancient genes which encoded distinct proteins. Throughout evolution these once-independent proteins have been put together in new combinations to produce more complex proteins by exon shuffling An alternative hypothesis (introns-late) suggests that introns only invaded eukaryote genomes fairly recently The evolution of introns (continued) A crucial prediction of the introns-early hypothesis is that spliceosomal introns delineate structural or functional units within proteins: Introns are found in the same places in all known globin genes, including myoglobin and plant leghaemoglobins More frequently, however, introns do not appear to separate functionally distinct parts of proteins Other problem with introns-early hypothesis is absence from Archaea and Bacteria: Massive intron loss has been postulated but does not explain why they are found in nuclear copies of organelle genes but not in the genes of the organelles or their precursors Exon shuffling has probably been a factor in later eukaryotes Multigene families Many genes are found not as individual copies but as part of multigene families, larger families of related genes: Important evolutionary innovation: proteins with similar function can be arranged so that they are regulated efficiently Vertebrates have a variety of multipolypeptide globin genes, produced by gene duplication, which are adapted to varying oxygen requirements of different developmental stages Not all genes are functional: Pseudogenes arise through gene duplications but acquire mutations since only one copy is required Processed pseudogenes, which lack promoters and introns, have been produced by reverse transcription of mRNA Multigene families (continued) e Foetal g1 g2 Pseudogene yh Adult d b 0 100 200 Millions of years ago Embryonic Evolution of multigene families Most obvious way in which gene number can change between species is through gene duplication: Can arise through unequal crossing-over May occur by duplication of entire genomes (polyploidy): – Common in plants: around 50% of angiosperms are polyploid – Xenopus laevis is tetraploid: normal meiosis is possible – Other members of the genus Xenopus have chromosome numbers ranging from 20 to 108 Another mechanism of gene duplication is transposition Fate of new gene depends on function: redundancy vs. natural selection Genes can also acquire new functions without duplication e.g. e-crystallin and LDH Gene duplication in the Hox gene family Homeotic genes control the development of body plan in animals In both vertebrate Hox and invertebrate HOM genes, there is a highly conserved protein motif known as a homeobox Mutations in Hox/HOM genes can drastically affect the organisation of body parts Gene duplication in the Hox gene family Although Hox/HOM genes are related, their organisation differs between organisms: In vertebrates, there are multiple clusters of Hox genes: the mouse has four clusters, each located on a different chromosome and covering over 100 kb HOM genes in Drosophila are found in two clusters, Antennipedia and Bithorax, on the same chromosome In amphioxus – a class of marine invertebrates which are the closest relatives to the vertebrates – there is a single cluster of at least 10 Hox genes each of which is homologous to a different Hox gene in vertebrates: origin of vertebrates coincided with a series of gene duplications Example of a dispersed gene family in vertebrates Gene duplication in the Hox gene family Gene Duplications (four clusters) Amphioxus Hypothetical Common Ancestor Drosophila lab pb Dfd Scr Antp Ubx AbdA AbdB Tandem arrays Tandem arrays contain multiple copies of genes with the same function Good example is the rDNA array: Large quantities of rRNA required Genes and spacers cotranscribed and separated by non-transcribed spacer Variation in size of arrays: – 1 copy in Tetrahymena – 19,300 copies in Amphiuma ETS ITS1 ITS2 NTS 18S 5.8S 28S Evolution of rDNA arrays Because they contain both highly conserved (18S) and highly variable (NTS) regions, rDNA sequences have been used frequently in molecular systematics Despite this, they do not evolve in a simple manner: Although there is a high degree of sequence similarity within species, there is great divergence between them Due to unequal crossing-over and gene conversion, concerted evolution can take place which allows genes to evolve together by spreading mutations throughout members This makes phylogenetic analysis difficult since it is not easy to discern which genes are truly homologous Often leads to “mosaics” of sequences, each with different phylogenetic history Non-coding repetitive DNA Class Copy number Organisation Satellite DNA Highly repetitive (>104) Tandemly repeated Mini-/microsatellite Moderately repetitive Tandemly repeated Transposable elements Moderately/highly repetitive Dispersed Tandemly repeated DNA Much of the non-coding repetitive DNA in eukaryotes consists of tandem repeats of short sequence motifs: Satellite DNA is located mainly in the heterochromatin and consists of motifs up to 40 kb in length: – The a-satellite DNA of primates based on a 171 bp motif repeated for hundreds of kilobases – Over 60% of the genome of Drosophila nasutoides is satellite DNA Minisatellites and microsatellites are comprised of shorter motifs duplicated through unequal crossing over and DNA slippage: – Minisatellites motifs are 11 – 60 bp in length and contain a G-rich “core” sequence – Microsatellites are shorter, generally dinucleotide repeats – Both exhibit extremely high mutation rates and multiple alleles are usually found in populations – Used in population genetics / forensics Transposable elements Transposable elements increase copy number by moving around the genome making additional copies: Around 50% of the maize genome may be transposable elements 10-20% of the Drosophila genome Three groups of transposable elements: Class I (retroelements) transpose through an intermediate RNA stage via reverse transcriptase cf. retroviruses Class II (DNA elements) transpose directly from DNA to DNA Little is known about miniature inverted-repeat transposable elements (MITEs): around 100 – 400 bp in length and transpose by as yet unknown means Transposable elements Class I transposable elements (retroelements) Retrotransposons LTR Retroposons Reverse transcriptase LTR Reverse transcriptase AAAAAA Class II transposable elements (DNA elements) Ac-like elements Transposase Miniature inverted-repeat transposable elements (MITEs) Short repeat e.g. Tourist and Stowaway Terminal repeat Retroelements Two subgroups: Retrotransposons contain long terminal repeats at both ends: example is copia element which is found 20 – 60 times in the genome of D. melanogaster Retroposons have no LTR and have a poly-A tail: – Long interspersed nuclear elements (LINEs) are 6 – 8 kb in length and present in thousands of copies: the L1 family is present in 590,00 copies in the human genome (17% of total) – Short interspersed nuclear elements (SINEs) do not produce reverse transcriptase and so are not considered true retroelements: they vary in size from 130 – 300 bp and have copy numbers from 50,000 to over 1,000,000 – Originally derived from RNA transcripts Endogenous retroviruses are proviruses which have been integrated into the germ-line of eukaryotes Class II (DNA) elements Possess terminal repeats but unlike retrotransposons these are short (generally < 100 bp) and usually inverted Encode a special transposase protein Best known types: Mariner elements in animals Hobo and P elements in Drosophila: – P elements can move between species and affect host phenotype – Increased infertility due to chromosome breakage (hybrid dysgenesis) occurs in D. melanogaster. P elements are not found in closely related species (D. simulans, D. sechellia, D. mauritania) but are found in more distantly related species e.g. D. willistoni group: transferred after D. melanogaster split from sibling species – Insertion can have “knock-out” effect on phenotype e.g. white gene in flies lacking red eye pigment

Genome organisation and evolution

Related documents

Products

Support

Genome organisation and evolution

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib