Genome structure and evolution Jan Pačes Institute of Molecular Genetics AS CR sizes of selected completed genomes genome chromosomes size genes Mycoplasma genitalium 0.58 Mbp 521 Escherichia coli 4.6 Mbp (5.4 Mbp) 4 377 (5 416) Saccharomyces cerevisiae 16 12.5 Mbp 5 770 Caenorhabtitis elegans 6 ~100 Mbp 19 427 Arabidopsis thaliana 5 ~115 Mbp ~28 k Drosophila melanogaster 5 ~122 Mbp 13 379 Homo sapiens 24 ~ 3.3 Gbp ~22.5 k genome complexity genome sizes arabidopsis thaliana genome size ~100 Mbp psilotum nudum genome size: ~ 250 Gbp unregular genome sizes? Schizosaccharomyces pombe Mimivirus fission yeast, genome smaller than many bacterias genome 12 462 637 bp, 4 929 genes virus of an amoeba genome 1 181 404 bp, 1 262 genes Tetraodon nigroviridis (pufferfish) same number of genes as human, genome size only 1/10th 300 Mbp, 27 918 genes C-value C-value refers to the amount of DNA contained within a haploid nucleus in picograms among diploid organisms the terms C-value and genome size are used interchangeably in polyploids the C-value may represent two or more genomes contained within the same nucleus in animals C-value range more than 3,300x genome size (bp) = (0.978 x 109) x DNA content (pg) DNA content (pg) = genome size (bp) / (0.978 x 109) 1 pg = 978 Mb genome sizes 0.0023 pg in the parasitic microsporidium Encephalitozoon intestinalis 1 400 pg in protist, the free-living amoeba Chaos chaos Gregory T http://www.genomesize.com C-value enigma What types of non-coding DNA are found in different eukaryotic genomes, and in what proportions? From where does this non-coding DNA come, and how is it spread and/or lost from genomes over time? What effects, or perhaps even functions, does this non-coding DNA have for chromosomes, nuclei, cells, and organisms? Why do some species exhibit remarkably streamlined chromosomes, while others possess massive amounts of non-coding DNA? What is the minimal genome? e-cell model and reconstruct biological phenomena in silico http://www.e-cell.org Synthetic genomes Mycoplasma laboratorium Gibson D, et al. (2008): Complete Chemical Synthesis, Assembly, and Cloning of a Mycoplasma genitalium Genome. Science. DOI: 10.1126/science.1151721 Synthia synthetic species of bacterium derived from the genome of Mycoplasma mycoides from scratch and transplanted into a Mycoplasma capricolum cell Gibson D, et al. (2010): Creation of a bacterial cell controlled by a chemically synthesized genome. Science. DOI: 10.1126/science.1190719 just for fun – watermarks S VENTERINSTITVTE CRAIGVENTER HAMSMITH CINDIANDCLYDE GLASSANDCLYDE "TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO RECREATE LIFE OUT OF LIFE." "SEE THINGS NOT AS THEY ARE, BUT AS THEY MIGHT BE." "WHAT I CANNOT BUILD, I CANNOT UNDERSTAND." P E A C Rhodobacter capsulatus, GC content homo sapiens, gene distribution Saccone S, et al. (2001) Chromosome Res. structure of human genome Up to date was read 3,164.7 billions nucleotides. Average gene is 3 thousands nucleotides length, longest gene (dystrophin) is 2.4 billion nucleotides length. Number of the genes is between 20k and 30k (23k) Less than 2% of the genome code some protein. Function of more than 50% of the genes is unknown. DNA is more than 99,9% identical between all humans. Repetitive elements, which does not code proteins ("junk DNA") compose more than 50% of the human genome. Entropy rate is around 1.7 (.9 for Y chromosome). Around 20% of our genome is transcribed. importance of “junk” DNA syncytin (adapted ancestral env polyprotein) social behavior in rodents (and possibly humans) DeVries AL and Cheng C-HC (2005): Antifreeze proteins in polar fishes. Fish Physiology source of microRNAs Peaston A, et al (2004): Retrotransposons Regulate Host Genes in Mouse Oocytes and Preimplantation Embryos. Developmental Cell evolution of sequences, for example, an antifreeze-protein gene in a species of fish Hammock EA, Young LJ (2005): Microsatellite instability generates diversity in brain and sociobehavioral traits. Science regulation of gene expression and promotion of genetic diversity Blond JL (1999): Molecular characterization and placental expression of HERV-W, a new human endogenous retrovirus family". J Virol Woolfe A, et al (2005): Highly conserved non-coding sequences are associated with vertebrate development .PLoS Biol LINE-1 capable of repairing broken strands of DNA. Morrish TA, et al (2002): DNA repair mediated by endonuclease-independent LINE1 retrotransposition. Nature Genetics synthesizing non-natural parts from natural genomic template Journal of Biological Engineering 2009, 3:2 doi:10.1186/1754-1611-3-2 Pawan K Dhar1 , Chaw Su Thwin1 , Kyaw Tun1 , Yuko Tsumoto1 , Sebastian Maurer-Stroh2 , Frank Eisenhaber2 and Uttam Surana3 The current knowledge of genes and proteins comes from 'naturally designed' coding and non-coding regions. It would be interesting to move beyond natural boundaries and make user-defined parts. To explore this possibility we made six non-natural proteins in E. coli. We also studied their potential tertiary structure and phenotypic outcomes. The chosen intergenic sequences were amplified and expressed using pBAD 202/D-TOPO vector. All six proteins showed significantly low similarity to the known proteins in the NCBI protein database. The protein expression was confirmed through Western blot. The endogenous expression of one of the proteins resulted in the cell growth inhibition. The growth inhibition was completely rescued by culturing cells in the inducer-free medium. Computational structure prediction suggests globular tertiary structure for two of the six non-natural proteins synthesized. main events in genome evolution mutations (SNP) duplications rearrangements horizontal transfer parasitic DNA how and where to find transposones Repbase database of repetitive elements http://www.girinst.org/repbase RepeatMasker search for repetitions in genome sequence http://www.repeatmasker.org repetitive elements in human genome Transposones: transposon-derived repeats, interspersed repeats Micro a minisatellites: simple sequence repeats repetition of simple sort direct repeats 3% of the genome Duplications: duplications of genome segments of different length (10 - 300 kb); inter and intra chromosomal 45% of the genome 3.3% of the genome Other types of repetitions: centromeric and telomeric repeats IHGSC, Nature 2001 transposones in human (vertebrate) genome DNA transposones retrotransposones RNA as intermediate, reverse transcription LTR transposones (similar to retroviruses) polyA retrotransposones (colinear with mRNA, polyA) human chromosome 21 DNA transposones 2-3 kb terminal reversed repetitions (50 - 100 bp) cut-and-paste mechanism 3% of the genome at least 7 classes, some of them not related LTR retrotransposones LTR – long terminal repeat Human Endogenous Retroviruses (HERVs) RNA intermediate (RNA pol. II ) short insertional duplications (4-6 bp) 8 % of the genome 100 000 elements, tens of families LINE1 (L1) elements LINE – long interspersed elements poly A (non-LTR) retrotransposons RNA intermediate (internal promotor for RNA pol. II) insertion duplication of different length (5-15 bp) insertion preferences (TT AAAA) 17 % of genome 500 000 elements, often cutted at 5' end 30-60 active LINE1 elements in genome nonautonomous elements They do not code enzymes for their own transposition. For each class of the autonomous elements exists nonautonomous elements. Such elements use different mechanism of replication, specific for autonomous elements. SINE (Alu) elements SINE – short interspersed elements poly A (non-LTR) retrotransposons RNA intermediate (internal promotor for RNA pol. III) insertion duplications (5-15 bp) insertion preferences (TT | AAAA) 10 % of genome 1 000 000 elements, often cutted at 5' end processed pseudogenes colinear with mRNA missing introns and promotores; poly A often 5' cutted bordered by direct repeats of different legth (4-15bp) insertion sites are similar to LINE1 transposition generated by L1 coevolution of “DNA parasites” DNA transposones LTR retrotransposones polyA retrotransposones HERV16 - example http://hervd.img.cas.cz 1000 Genome Project current status Trio project: two families with ~42x coverage Yoruba and Caucasian Low-coverage project: ~5x coverage of unrelated individuals 60 Yoruba, 60 Caucasians, 30 Han, 30 Japanese Exon project: 8000 exons (900 genes) by capture array, >50x coverage, 700 unrelated individuals + 2 individual sequences (Watson and Venter) 1000GPC, Nature 2010 stability / fluidity of the genome ~200 to 300 loss-of-function variants in annotated genes and 50 – 100 variants of implicated inherited disorders 10-8 per base per generation germline substitution rate 1000GPC, Nature 2010 ENCODE Encyclopedia Of DNA Elements Raney, NAR 2010 genome browsers Golden Path http://genome.ucsc.edu ENSEMBL http://www.ensembl.org that’s it, thank you Institute of Molecular Genetics AS CR Free and Open Bioinformatics Association