Genome evolution

advertisement
Genome structure and evolution
Jan Pačes
Institute of Molecular Genetics AS CR
sizes of selected completed genomes
genome
chromosomes
size
genes
Mycoplasma genitalium
0.58 Mbp
521
Escherichia coli
4.6 Mbp
(5.4 Mbp)
4 377
(5 416)
Saccharomyces
cerevisiae
16
12.5 Mbp
5 770
Caenorhabtitis elegans
6
~100 Mbp
19 427
Arabidopsis thaliana
5
~115 Mbp
~28 k
Drosophila
melanogaster
5
~122 Mbp
13 379
Homo sapiens
24
~ 3.3 Gbp
~22.5 k
genome complexity
genome sizes
arabidopsis thaliana

genome size ~100 Mbp
psilotum nudum

genome size: ~ 250 Gbp
unregular genome sizes?

Schizosaccharomyces pombe



Mimivirus



fission yeast, genome smaller than many bacterias
genome 12 462 637 bp, 4 929 genes
virus of an amoeba
genome 1 181 404 bp, 1 262 genes
Tetraodon nigroviridis (pufferfish)


same number of genes as human, genome size only 1/10th
300 Mbp, 27 918 genes
C-value





C-value refers to the amount of DNA contained
within a haploid nucleus
in picograms
among diploid organisms the terms C-value and
genome size are used interchangeably
in polyploids the C-value may represent two or more
genomes contained within the same nucleus
in animals C-value range more than 3,300x



genome size (bp) = (0.978 x 109) x DNA content (pg)
DNA content (pg) = genome size (bp) / (0.978 x 109)
1 pg = 978 Mb
genome sizes


0.0023 pg in the parasitic microsporidium Encephalitozoon
intestinalis
1 400 pg in protist, the free-living amoeba Chaos chaos
Gregory T http://www.genomesize.com
C-value enigma





What types of non-coding DNA are found in different
eukaryotic genomes, and in what proportions?
From where does this non-coding DNA come, and
how is it spread and/or lost from genomes over
time?
What effects, or perhaps even functions, does this
non-coding DNA have for chromosomes, nuclei,
cells, and organisms?
Why do some species exhibit remarkably
streamlined chromosomes, while others possess
massive amounts of non-coding DNA?
What is the minimal genome?
e-cell

model and reconstruct biological phenomena in silico
http://www.e-cell.org
Synthetic genomes

Mycoplasma laboratorium


Gibson D, et al. (2008): Complete Chemical Synthesis,
Assembly, and Cloning of a Mycoplasma genitalium
Genome. Science. DOI: 10.1126/science.1151721
Synthia


synthetic species of bacterium derived from the genome
of Mycoplasma mycoides from scratch and transplanted
into a Mycoplasma capricolum cell
Gibson D, et al. (2010): Creation of a bacterial cell
controlled by a chemically synthesized genome. Science.
DOI: 10.1126/science.1190719
just for fun – watermarks
S
VENTERINSTITVTE
CRAIGVENTER
HAMSMITH
CINDIANDCLYDE
GLASSANDCLYDE
"TO LIVE, TO ERR, TO FALL, TO TRIUMPH, TO
RECREATE LIFE OUT OF LIFE."
"SEE THINGS NOT AS THEY ARE, BUT AS
THEY MIGHT BE."
"WHAT I CANNOT BUILD, I CANNOT
UNDERSTAND."
P
E
A
C
Rhodobacter capsulatus, GC content
homo sapiens, gene distribution
Saccone S, et al. (2001) Chromosome Res.
structure of human genome









Up to date was read 3,164.7 billions nucleotides.
Average gene is 3 thousands nucleotides length, longest
gene (dystrophin) is 2.4 billion nucleotides length.
Number of the genes is between 20k and 30k (23k)
Less than 2% of the genome code some protein.
Function of more than 50% of the genes is unknown.
DNA is more than 99,9% identical between all humans.
Repetitive elements, which does not code proteins ("junk
DNA") compose more than 50% of the human genome.
Entropy rate is around 1.7 (.9 for Y chromosome).
Around 20% of our genome is transcribed.
importance of “junk” DNA

syncytin (adapted ancestral env polyprotein)


social behavior in rodents (and possibly humans)


DeVries AL and Cheng C-HC (2005): Antifreeze proteins in polar fishes. Fish
Physiology
source of microRNAs


Peaston A, et al (2004): Retrotransposons Regulate Host Genes in Mouse Oocytes
and Preimplantation Embryos. Developmental Cell
evolution of sequences, for example, an antifreeze-protein gene in a
species of fish


Hammock EA, Young LJ (2005): Microsatellite instability generates diversity in
brain and sociobehavioral traits. Science
regulation of gene expression and promotion of genetic diversity


Blond JL (1999): Molecular characterization and placental expression of HERV-W,
a new human endogenous retrovirus family". J Virol
Woolfe A, et al (2005): Highly conserved non-coding sequences are associated
with vertebrate development .PLoS Biol
LINE-1 capable of repairing broken strands of DNA.

Morrish TA, et al (2002): DNA repair mediated by endonuclease-independent LINE1 retrotransposition. Nature Genetics
synthesizing non-natural parts from natural
genomic template





Journal of Biological Engineering 2009, 3:2
doi:10.1186/1754-1611-3-2
Pawan K Dhar1 , Chaw Su Thwin1 , Kyaw Tun1 , Yuko Tsumoto1 ,
Sebastian Maurer-Stroh2 , Frank Eisenhaber2 and Uttam Surana3
The current knowledge of genes and proteins comes from 'naturally
designed' coding and non-coding regions. It would be interesting to move
beyond natural boundaries and make user-defined parts. To explore this
possibility we made six non-natural proteins in E. coli. We also studied their
potential tertiary structure and phenotypic outcomes.
The chosen intergenic sequences were amplified and expressed using
pBAD 202/D-TOPO vector. All six proteins showed significantly low similarity
to the known proteins in the NCBI protein database. The protein expression
was confirmed through Western blot. The endogenous expression of one of
the proteins resulted in the cell growth inhibition. The growth inhibition was
completely rescued by culturing cells in the inducer-free medium.
Computational structure prediction suggests globular tertiary structure for
two of the six non-natural proteins synthesized.
main events in genome evolution

mutations (SNP)
duplications
rearrangements
horizontal transfer

parasitic DNA



how and where to find transposones

Repbase



database of repetitive elements
http://www.girinst.org/repbase
RepeatMasker


search for repetitions in genome sequence
http://www.repeatmasker.org
repetitive elements in human genome

Transposones: transposon-derived repeats,
interspersed repeats


Micro a minisatellites: simple sequence repeats
repetition of simple sort direct repeats


3% of the genome
Duplications: duplications of genome segments of
different length (10 - 300 kb); inter and intra chromosomal


45% of the genome
3.3% of the genome
Other types of repetitions: centromeric and telomeric
repeats
IHGSC, Nature 2001
transposones in human (vertebrate) genome


DNA transposones
retrotransposones
RNA as intermediate, reverse transcription
 LTR transposones (similar to retroviruses)
 polyA retrotransposones (colinear with mRNA, polyA)
human chromosome 21
DNA transposones





2-3 kb
terminal reversed repetitions (50 - 100 bp)
cut-and-paste mechanism
3% of the genome
at least 7 classes, some of them not related
LTR retrotransposones






LTR – long terminal repeat
Human Endogenous Retroviruses (HERVs)
RNA intermediate (RNA pol. II )
short insertional duplications (4-6 bp)
8 % of the genome
100 000 elements, tens of families
LINE1 (L1) elements








LINE – long interspersed elements
poly A (non-LTR) retrotransposons
RNA intermediate (internal promotor for RNA pol. II)
insertion duplication of different length (5-15 bp)
insertion preferences (TT AAAA)
17 % of genome
500 000 elements, often cutted at 5' end
30-60 active LINE1 elements in genome
nonautonomous elements

They do not code enzymes for their own
transposition.

For each class of the autonomous elements exists
nonautonomous elements. Such elements use
different mechanism of replication, specific for
autonomous elements.
SINE (Alu) elements







SINE – short interspersed elements
poly A (non-LTR) retrotransposons
RNA intermediate (internal promotor for RNA pol. III)
insertion duplications (5-15 bp)
insertion preferences (TT | AAAA)
10 % of genome
1 000 000 elements, often cutted at 5' end
processed pseudogenes






colinear with mRNA
missing introns and promotores; poly A
often 5' cutted
bordered by direct repeats of different legth (4-15bp)
insertion sites are similar to LINE1 transposition
generated by L1
coevolution of “DNA parasites”
DNA transposones
LTR retrotransposones
polyA retrotransposones
HERV16 - example
http://hervd.img.cas.cz
1000 Genome Project
current status
 Trio project: two families with ~42x coverage


Yoruba and Caucasian
Low-coverage project: ~5x coverage of unrelated
individuals

60 Yoruba, 60 Caucasians, 30 Han, 30 Japanese

Exon project: 8000 exons (900 genes) by capture
array, >50x coverage, 700 unrelated individuals

+ 2 individual sequences (Watson and Venter)
1000GPC, Nature 2010
stability / fluidity of the genome


~200 to 300 loss-of-function variants in annotated
genes and 50 – 100 variants of implicated inherited
disorders
10-8 per base per generation germline substitution
rate
1000GPC, Nature 2010
ENCODE
Encyclopedia Of DNA Elements
Raney, NAR 2010
genome browsers




Golden Path
http://genome.ucsc.edu
ENSEMBL
http://www.ensembl.org
that’s it, thank you
Institute of Molecular Genetics AS CR
Free and Open Bioinformatics Association
Download