- YSU Proteomics/Genomics Research Group

advertisement
Introduction to Genomics
and the Tree of Life
Chapter 13
Extra-Reading
• Next generation sequencer
– What next generation sequencer can do for
genetics/genomics research?
• Compar_genomics
– What can we learn from comparative
genomics?
Outline of today’s lecture
Introduction: 5 perspectives, history of life
Genome-sequencing projects: chronology
Genome analysis: criteria, resequencing, metagenomics
DNA sequencing technologies: Sanger, 454, Solexa
Process of genome sequencing: centers, repositories
Genome annotation: features, prokaryotes, eukaryotes
Five approaches to genomics
As we survey the tree of life, consider these perspectives:
Approach I: cataloguing genomic information
Genome size; number of chromosomes; GC
content; isochores; number of genes; repetitive
DNA; unique features of each genome
Approach II: cataloguing comparative genomic information
Orthologs and paralogs; COGs; lateral gene transfer
Approach III: function; biological principles; evolution
How genome size is regulated; polyploidization;
birth and death of genes; neutral theory of
evolution; positive and negative selection; speciation
Approach IV: Human disease relevance
Approach V: Bioinformatics aspects
Algorithms, databases, websites
Page 519
Introduction
Lessons learned form comparative genomics
What have we learned about genes by comparing genomic
sequences?
What have we learned about regulation?
About 5% of the human genome is under purifying selection
Positively regulated regions
Mechanisms and history of mammalian evolution
Nonuniformity of neutral evolutionary rates within species
Nonuniformity of evolution along the branches of phylogeny
Learning more form existing data
Choice of species
Choice of tools
Future of comparative genomics
Levels of analysis in genomics
level
DNA
RNA
protein
complexes
pathways
organelles
organs
individuals
species
genus
phylum
kingdom
topics
genes, chromosomes
ESTs, ncRNA
ORFs, composition
binary, multimeric
databases
GenBank
UniGene, GEO
UniProt
BIND
COGs, KEGG
variation and disease
speciation
HapMap
TaxBrowser; SGD
JAX mouse
FishBase
TOL
Definitions of terms
Genomics is the study of genomes (the DNA comprising an
organism) using the tools of bioinformatics.
Bioinformatics is the study protein, genes, and genomes using
computer algorithms and databases.
Systematics is the scientific study of the kinds and diversity of
organisms and of any and all relationships among them.
Classification is the ordering of organisms into groups on the
basis of their relationships. The relationships may be
evolutionary (phylogenetic) or may refer to similarities of
phenotype (phenetic).
Taxonomy is the theory and practice of classifying organisms.
Pace (2001) described a tree
of life based on small subunit
rRNA sequences.
This tree shows the main
three branches described
by Woese and colleagues.
Fig. 13.1
Page 521
Molecular sequences as basis of trees
Historically, trees were generated primarily using
characters provided by morphological data. Molecular
sequence data are now commonly used, including
sequences (such as small-subunit RNAs) that are
highly conserved.
Visit the European Small Subunit Ribosomal RNA
database for 20,000 SSU rRNA sequences.
Page 523
Tree of life from David Hillis’ lab (based on ~3000 rRNAs)
animals
you are here
plants
protists
fungi
bacteria
archaea
http://www.zo.utexas.edu/faculty/antisense/Download.html
Tree of life from David Hillis’ lab (based on ~3000 rRNAs)
you are here
http://www.zo.utexas.edu/faculty/antisense/Download.html
Ribosomal RNA Database
Ribosomal Database Project
http://rdp.cme.msu.edu/index.jsp
Santos, S. R. and Ochman H. Identification and phylogenetic
sorting of bacterial lineages with universally conserved
genes and proteins. Environmental Microbiology. 2004.
Jul(6)7:754-9.
►Download fusA (translation elongation factor 2 [EF-2])
►Obtain DNA in the fasta format
►Align by ClustalW in MEGA
►Create a neighbor-joining tree
Page 524
European Small Subunit Ribosomal RNA database
(http://www.psb.ugent.be/rRNA/ssu/)
sA
0 fu
174
M
S
nD
usA
oge 11168 f
n
i
c
uc NC
TC lla burnetii RSA493 fusA
Coxie
lla s
Vibrio
linelo jejuni
o
parah
usAfa
Xy9leflla
W py
a5c
Xy
stiosa
lell
4
m
a
dios9a
fas
tidi
4
a
aA
oelycAtuR1
Teemm
1
C
fus
5
fuD
flori fusfusA
IM
sA
an 19718
us i B
ea
europa
m
Nitro
h
c
c
A
22106
i
lo
t
s
d
n
a
a
C
u
p
f
V
ib
e
ri
o
h
v
5
u
A
ln
if
icus CMC
o
9 fus
c
i
6
l
P6 fusA
6 99 V
He
A
ibwo
iasfus
ri 2i Jgg
svA
rio
les
urth
fBu
Aic
ufu
ln
A
S
s
yploylorWi
if
p
P
f
u
i
p
s YJ01
i pAhSidg
6 fusA
licloico hcBhaupachphihdaidi V Sh
HeHe BB
ibrioewa
ucu
chonle
ella
raeoN
ne1id
69e6n1si
sfuM
sAR1
Acin
fu s
etob
HNa
acte
r AD
eim
so
P1 f
PaNeismdeuncin
ste me regyit M usA
u m ningi 35C5
H
it 0 8 f
ult
Ph aem
oc Z24090HuPsA
i
o
Yeoto i
da
1 f
nf
r
Pm fusuAsA
Ye sinluiami lu R
7
0
d
n
fus
rs
in pees s T KW2
A
ia
tis T
0
O1 fu
pe
CO
s
st
92fus A
is
fusA
KI
M
A
fu
sA
usA
2f A
10 fussA
H8 P31 fu
p W s9B31
u
ss
T
t
I
ccu sngMa
A A
oco
nluo
us us A
f
f
s
aorie
ech
1
u
3
cmh
42 80 f
6
nroe
sA A
C7 C6 98
hloy
fu us
omcos
PC PC P1
30 f
s
1812
M
eu tis
ys CC
ac
C7
TCW
iol
oc
s
onP
ch inu
ov
moc
o e y ne
ar
esut
Gl
S
m
pNno
o
o
Pr
Pr
Neighbor-joining tree of ~150 fusA (GTPase) DNA sequences
r
lo
ch
oc
Ch
hl
or
o
Ch
sA
m
C
fu
lam h
ar
3
l
i
nu
yd am
04
s
o yd
I1 A fusA
Ch
C
pn o
R
s
C
lam
eu pn
sA A
M
SC fu 18
me
yd
P1
a 01 CT 2 fu fus
r
onum
Ch Chl
op
37
vo 0 hi Ty T2
lam am
CW
n
on
5
to 91Typ phi m L
y t ydo eum
o
A
f
r
A
L
is
us
y u
rac c
fus
ca estnter r T uri
A
ho av on J 02R939
3
3
e
m
Pa C
a
i
DU iae
1
f
L9
rac hl
ni a p e nt h
W3 GP 38 f ufsuAs
hlaam
EDsA
wi ini on on ne typ
r
7
A
u
u
m
y
m
C
s
I
f
H
l
Clo
E er a lm o
ydimu
A
X fC f sA
st a
57 H7
a Urid
0 fus
a alm
us usA
Y S SS
ceto
sA
O1 57
fousLA113
WaEru
A
A
u
s
f
i
1
u
l
f
i
Clobut
Fuso
0
1
o
7T
73
6e6n F
st yt licu 25mfufus
E c oli O FT0 xneri 245
Therm nuCclloest
Cio5p
eta m
sAA
rAr la
e
Clla flefusA
atu
c
t
s
e
p
t
o
i
n
fu
e
i
l
a
8
n
n
1
m
r
i
n
fA
E choige 12 xnerLie3pp0ttoo
i 2
a
Myco
r
ES olieKlla fle Le
plas ero tengcinTgCeCn2s5E884ffusA
onge 15386
usA
pneu
A
ESchig
f
nsis fusfuAsA
mon
1 fus 8 fusA
usA
Mycop
M129
SB
elfluustilaA
r
M
i
a
P
las ge
m
f
s ari
usA
liacum
82 fusA
nita G
x aoetoosgth
io VPI54
37 fus
rm
e
ta
e
if
e
h
u
T
q
A
A
Ureaplasm Mycoplas
ide
a parvum 7galli R
Bactero
W83 fusA
00970fufussA
A
hyro gingiv
rp
o
P
Mycoplas penetrans
HF2 fus
fusA
AA A
ulfur PCA
Mycoplas mycoides PG1
Geo sX
ufussA
sAfusA
ffusA
TLS fusA
m
a
A
P
fu
A
idu
A
A
A
A
fus
tep
M
n
o
Y
6
A
3K
s
lor
s
A
O
s
t
16
s
s
Ch
CTIP
e
UAB
is
bil
sm 3ffuu1VsR
pulmon
u sA
sye
etuhhdomo
R
mo
la
MycoplaMy
X
uuf2usfsfusfufu
playtsop
P
aenP
co
S
f4
ph
6
l
50 fusA
s
A
f
C
3
3
n
h
2
G
R
M
o
ch
G
on
ro
A
3
o
a
8
0
I
on
ni
s
m
uet
O
R
1S
si RB
Dm
als
3i1Ta
9331sA5sAfu
e0S
6G
A
rd
so
to
nrpe
Bo
cxamp
G
uvMpdio
2EeM
acBsen
esdonseobr
s
ia
5
rd
M
A
4
y
3
N
S
et
a
s
e
la
pa
i
8
s
r
n
o
M
ra
a
o
la
n
e
e
1
n
s
1
n
rt
o
e
A
i
e
u
5
a
l
12
e
u
F
u
n
G
u
fus
e
o
o
5
n
u
AgSi pertussis
gm
M1n82
I es
2s0fu
f fI1
I1
eg mTohamaI
AV SS
eostia
Bordet
oe0
gen
ypoypgoep
yegoue
td0riis
C
ciatiuaCm
tpyo
Str
fufusA
trpetrpepp
ssA
r lfC
A96
tre
33ss30
fu
vodaraudeg
SS
nhoC2s4pa7o2
is u53015
cntissalU
A
S p pgenagplaaplcaneN
rD
Br onoturhm
Tyi2
1f3usfuA
auurhlgizK
laW
4
P
a
a
i
StrreeSpatra
3
m
i
c
i
4
c
u
t
i
A
sA
z
N
r
z
0
e
0
u
0
i
o
e
M
u
c
l
s
0
n
f
1
p
r
l
m
t
f
o
a
c
o
u
0
a sus
f
j
H
Stre ctotsapo m
u
f
c
s
s
a
t
e
f
A
i
i
clierlon i M ldpe usA sA
uis
S Lalahnnre ero erue
s
pjoSt nt
13 estic1C05A8FoFnnicfus
uaru
30 en 21 fu3s0umA
aacc E hy ya
b
b
fu tus fusA309U
aph
ctoto
sA
Sat p
CB A 9S f
LLaac
St
15 DuAsA
1
fu
sA 10
fu
sA
la
o
yd
m
Clostridium
Yersinia pestis
Aquifex aeolicus
Oc
Rickettsia
sA
fu
A
09
A
us
A0 29 sfA 0 fus
sA
fu 68 DC1551 fu
CG 1c31tu
7 bAe4rcu C 97 fusA
sA
M
risycoTbCaBis2 M
2122 TW08 27 5fu
uoslsA fu
ich
l
ei
fsA
pl
F
i
A
ip
t
A
wh
7R
st TrC
m
A
a
0
H3
H
m
is
erc
su
tub
he
idvu3fus
vac
54N
ms icob
My
bory
feum
rop
lu e N
c
a paollla
Ae3po2nwhippl
riayacvoebiluaTrophe
A sryma
31 fusA
denticei Twist
rA
paM
ri BfusA
us fuoTl nema burgdorfe
othemy ph
1f r4liecpBoorrelia
iupdhpto o
c1oTe
eSsdtere hermtomysYRS3 sA
sAfusA
t rep ns
fu
ryn p
N kfu10
Co odo usSt uieran 3a2cpraaveiuTm cter HD100 fusA
0
b
e
c
m
l
o
lloA ba
d
i
de
B
Rh er
ifofMyb1ca3c henselae
e
o
a
Houst1 fusA
Bart
d
c
t
fus
Th
nreaMylu
mR
ryo e g
gu icket p
lon Ricket c rowa
o
Coinyn
n
o
ri
i
MzaeliksiihMadrid
Doer
ido
C
7 fusAE fu
Bif
sA
fusA
Bac. antracis
Bart quintana fusA
Wolbachia
ea
no
St bac
LLis t ap ihe
Bac anthracis
hy y
Stim
ner
en
l fu
noSontne
e sA
a
L
i copuc p si
B
BaaBccaacthnauthnrirt st m hayytoCider s HT
h 9c7oA2
4lipb m
BB
E
no07aefu
aca cnagrcais
5yu8sr1efsu1AFs1i2A
cBeacre is m
21362 831
c
u
f
u
2
cerhu
t
s
s 10 o s A652fu fu
eau
los 14987EGM 8fusfsuA sA
AsA
Ba dur5a79fufDsW
n ueA2
cs
ubt s fussAfufsuA
s
ilis
A A
fus
A
Mycoplasma
Mycobacterium
Treponema
History of life on earth
4.55 BYA
4.4-3.8 BYA
3.9 BYA
3.8 BYA
formation of earth (violent 100 MY period)
last ocean-evaporating impacts
oldest dated rocks
sun brightened to 70% of today’s luminosity
Ammonia, methane, or carbon dioxide
atmosphere.
Earliest life: RNA, protein
Source: Schopf J.W. (ed.), Life’s Origins (U. Calif. Press, 2002)
Page 521
Millions of years ago (MYA)
deuterostome/
echinoderm/ Cambrian Land
protostome Insects chordate explosion plants
Proterozoic eon
1000
Age of
Reptiles ends
Phanerozoic eon
500
100
0
Page 522
Millions of years ago (MYA)
Mass
extinction
100
Human/chimp
divergence
Dinosaurs extinct;
Mammalian radiation
50
10
0
Page 522
Millions of years ago (MYA)
Homo sapiens/
Chimp divergence
10
Australepithecus Earliest
Lucy stone tools
5
Emergence of
Homo erectus
1
0
Page 522
Years ago
Homo erectus
emerges in Africa
1,000,000
Mitochondrial
Eve
500,000
100,000
0
Page 523
Years ago
Emergence of
anatomically
modern H. sapiens
100,000
Neanderthal and Homo
erectus disappear
50,000
10,000
0
Page 523
Years ago
“Ice Man” Earliest
from Alps pyramids
10,000
5,000
Aristotle
1,000
0
Page 523
Years ago
algebra
1,000
Gutenberg
500
calculus
Darwin,
Mendel
100
0
Page 523
Chronology of genome sequencing projects
We will next summarize the major achievements in
genome sequencing projects from a chronological
perspective.
Page 525
Chronology of genome sequencing projects
1976: first viral genome
Fiers et al. sequence bacteriophage MS2 (3,569 base pairs,
Accession NC_001417).
1977:
Sanger et al. sequence bacteriophage fX174.
This virus is 5,386 base pairs (encoding 11 genes).
See accession J02482; NC_001422.
Page 527
Chronology of genome sequencing projects
1981
Human mitochondrial genome
16,500 base pairs (encodes 13 proteins, 2 rRNA, 22 tRNA)
Today (10/09), over 1800 mitochondrial genomes sequenced
1986
Chloroplast genome
156,000 base pairs (most are 120 kb to 200 kb)
Page 527
mitochondrion
chloroplast
Lack
mitochondria (?)
Entrez Genomes organelle resource at NCBI
http://www.ncbi.nlm.nih.gov/genomes/ORGANELLES/organelles.html
There are >2100 eukaryotic organelles (10/09)
GOBASE: resource for organelle genomes
http://megasun.bch.umontreal.ca/gobase/
MitoDat: resource for organelle genomes
“This database is dedicated to the nuclear
genes specifying the enzymes, structural
proteins, and other proteins, many still not
identified, involved in mitochondrial
biogenesis and function. MitoDat
highlights predominantly human nuclearencoded mitochondrial proteins.”
Not updated recently.
http://www-lecb.ncifcrf.gov/mitoDat/
MitoMap: resource for organelle genomes
http://www.mitomap.org/
It is possible to map mutations in human
mitochondrial DNA that are responsible for disease
Chronology of genome sequencing projects
1995: first genome of a free-living organism,
the bacterium Haemophilus influenzae
Page 530
Chronology of genome sequencing projects
1996: first eukaryotic genome
The complete genome sequence of the budding yeast
Saccharomyces cerevisiae was reported. We will
describe this genome soon.
Also in 1996, TIGR reported the sequence of the first
archaeal genome, Methanococcus jannaschii.
Page 532
Chronology of genome sequencing projects
1997:
More bacteria and archaea
Escherichia coli
4.6 megabases, 4200 proteins (38% of unknown function)
1998: first multicellular organism
Nematode Caenorhabditis elegans
97 Mb; 19,000 genes.
1999: first human chromosome
Chromosome 22 (49 Mb, 673 genes)
Page 532
1999: Human chromosome 22 sequenced
Chronology of genome sequencing projects
2000:
Fruitfly Drosophila melanogaster (13,000 genes)
Plant Arabidopsis thaliana
Human chromosome 21
2001: draft sequence of the human genome
(public consortium and Celera Genomics)
Page 534
2000
Overview of genome analysis
• Selection of genomes for sequencing
• Sequence one individual genome, or several?
• How big are genomes?
• Genome sequencing centers
• Sequencing genomes: strategies
• When has a genome been fully sequenced?
• Repository for genome sequence data
• Genome annotation
Page 537
Applications of Genome Sequencing
Purpose
Template
Example
De novo
sequencing
Genome sequencing
Sequencing
genomes
Ancient DNA
Extinct Neanderthal genome
Metagenomics
Human gut
Resequencing Whole genomes
Genomic regions
Somatic mutations
Transcriptome Full-length transcripts
Serial Analysis of
Gene Expression
(SAGE)
Epigenetics
>1000
influenza
Individual humans
Assessment of genomic
rearrangements or diseaseassociated regions
Sequencing mutations in cancer
Defining regulated messenger
RNA transcripts
Noncoding RNAs
Identifying
and
quantifying
microRNAs in samples
Methylation changes
Measuring methylation changes
in cancer
Table 13.15 p.538
Overview of
genome
analysis
Fig. 13.8
p.539
Criteria for selecting genomes for sequencing
Criteria include:
• genome size (some plants are >>>human genome)
• cost
• relevance to human disease (or other disease)
• relevance to basic biological questions
• relevance to agriculture
Page 538
Criteria for selecting genomes for sequencing
Criteria include:
• genome size (some plants are >>>human genome)
• cost
• relevance to human disease (or other disease)
• relevance to basic biological questions
• relevance to agriculture
Recent projects:
Chicken
Chimpanzee
Cow
Dog
Fungi (many)
Honey bee
Sea urchin
Rhesus macaque
Page 540
Selection criteria
Selection of genomes for sequencing is based
on specific criteria.
For an overview, see a series of white papers posted on
the National Human Genome Research Institute (NHGRI)
website: http://www.genome.gov/10002154
For a description of NHGRI selection criteria, visit:
http://www.genome.gov/10001495
Page 540
Criteria for selecting genomes for sequencing
Sequence one individual genome, or several?
Try one…
--Each genome center may study one
chromosome from an organism
--It is necessary to measure polymorphisms
(e.g. SNPs) in large populations
For viruses, thousands of isolates may be sequenced.
For the human genome, cost is the impediment.
Page 540
Diversity of genome sizes
How big are genomes?
Viral genomes: 1 kb to 350 kb (Mimivirus: 1181 kb)
Bacterial genomes: 0.5 Mb to 13 Mb
Eukaryotic genomes: 8 Mb to 686 Gb (human: ~3 Gb)
Page 540
Genome sizes in nucleotide base pairs
plasmids
viruses
bacteria
fungi
plants
algae
insects
mollusks
bony fish
The size of the human
genome is ~ 3 X 109 bp;
almost all of its complexity
is in single-copy DNA.
amphibians
reptiles
birds
The human genome is thought
to contain ~30,000-40,000 genes.
104
105
106
107
mammals
108
109
1010
1011
http://www3.kumc.edu/jcalvet/PowerPoint/bioc801b.ppt
16 eukaryotic genome projects > 1000 megabases
Genus, species
Subgroup
Size (Mb)
#chr
common name
Macropus eugenii
Mammals
3800
8
tammar wallaby
Oryctolagus cuniculus
Mammals
3500
22
rabbit
Cavia porcellus
Mammals
3400
31
guinea pig
Pan troglodytes
Mammals
3100
24
chimpanzee
Homo sapiens
Mammals
3038
23
human
Bos taurus
Mammals
3000
30
cow
Dasypus
novemcinctus
Mammals
3000
32
nine-banded armadillo
Loxodonta africana
Mammals
3000
28
African savanna elephant
Sorex araneus
Mammals
3000
Rattus norvegicus
Mammals
2750
21
rat
Canis familiaris
Mammals
2400
39
dog
Zea mays
Land Plants
2365
10
corn
Aplysia californica
Other
Animals
1800
17
California sea hare
Danio rerio
Fishes
1700
25
zebrafish
Gallus gallus
Birds
1200
40
chicken
Triphysaria versicolor
Land Plants
1200
European shrew
plant parasite
Ancient DNA projects
Special challenges:
• Ancient DNA is degraded by nucleases
• The majority of DNA in samples derives from unrelated
organisms such as bacteria that invaded after death
• The majority of DNA in samples is contaminated by
human DNA
• Determination of authenticity requires special controls,
and analysis of multiple independent extracts
Page 542
Metagenomics projects
Two broad areas:
• Environmental (ecological)
e.g. hot spring, ocean, sludge, soil
• Organismal
e.g. human gut, feces, lung
Page 543
Outline of today’s lecture
Introduction: 5 perspectives, history of life: time lines
Genome-sequencing projects: chronology
Genome analysis: criteria, resequencing, metagenomics
DNA sequencing technologies: Sanger, 454, Solexa
Process of genome sequencing: centers, repositories
Genome annotation: features, prokaryotes, eukaryotes
Outline of today’s lecture
Introduction: 5 perspectives, history of life: time lines
Genome-sequencing projects: chronology
Genome analysis: criteria, resequencing, metagenomics
DNA sequencing technologies: Sanger, 454, Solexa
Process of genome sequencing: centers, repositories
Genome annotation: features, prokaryotes, eukaryotes
Overview of genome analysis
20 Genome sequencing centers contributed
to the public sequencing of the human genome.
Many of these are listed at the Entrez genomes site.
(Or see Table 19.3, page 803.)
Page 548
Two approaches to genome sequencing
Whole genome shotgun sequencing (Celera)
Hierarchical shotgun sequencing (public consortium)
Two approaches to genome sequencing
Whole Genome Shotgun (from the NCBI website)
An approach used to decode an organism's genome
by shredding it into smaller fragments of DNA which
can be sequenced individually. The sequences of these
fragments are then ordered, based on overlaps in the
genetic code, and finally reassembled into the complete
sequence. The 'whole genome shotgun' (WGS) method is
applied to the entire genome all at once, while the
'hierarchical shotgun' method is applied to large,
overlapping DNA fragments of known location in
the genome.
Page 548
Human genome project: strategies
Whole genome shotgun sequencing (Celera)
-- given the computational capacity, this approach
is far faster than hierarchical shotgun sequencing
-- the approach was validated using Drosophila
Two approaches to genome sequencing
Hierarchical shotgun method
Assemble contigs from various chromosomes, then
sequence and assemble them. A contig is a set of
overlapping clones or sequences from which a sequence
can be obtained. The sequence may be draft or finished.
A contig is thus a chromosome map showing the
locations of those regions of a chromosome where
contiguous DNA segments overlap. Contig maps are
important because they provide the ability to study a
complete, and often large segment of the genome by
examining a series of overlapping clones which then
provide an unbroken succession of information about
that region.
Page 548
Two approaches to genome sequencing
Hierarchical shotgun sequencing (public consortium)
-- 29,000 BAC clones
-- 4.3 billion base pairs
-- it is helpful to assign chromosomal loci to
sequenced fragments, especially in light of
the large amount of repetitive DNA in the genome
-- individual chromosomes assigned to centers
Source: IHGSC (2001)
Sequenced-clone contigs are merged to form
scaffolds of known order and orientation
Source: IHGSC (2001)
Fig. 19.8
Page 804
When has a genome been fully sequenced?
A typical goal is to obtain five to ten-fold coverage.
Finished sequence: a clone insert is contiguously
sequenced with high quality standard of error rate
0.01%. There are usually no gaps in the sequence.
Draft sequence: clone sequences may contain several
regions separated by gaps. The true order and
orientation of the pieces may not be known.
Page 549
When has a genome been fully sequenced?
When has a genome been fully sequenced?
Fold coverage
0.25
0.5
0.75
1
2
3
4
5
6
7
8
9
10
% sequenced
22
39
53
63
87.5
95
98.2
99.4
99.75
99.91
99.97
99.99
99.995
Page 551
Trace repository for genome sequence data
Raw data from many genome sequencing projects
are stored at the trace archive at NCBI or EBI
(main NCBI page, bottom right).
Also visit: http://trace.ensembl.org/
As of October 2008, the Trace Archive had ~2b traces.
As of October 2009 it has ~2,108,000,000 traces.
Page 552
Fig. 13.12
Page 553
http://www.jgi.doe.gov/education/
http://www.youtube.com/watch?v=RLsb0pM
x_oU&feature=channel_page
A Howard Hughes Medical Institute (HHMI)
video production describing the Whole
Genome Shotgun Sequencing process at
the JGI. This video is viewable on
YouTube in three parts: Part1(chapters 15), Part 2 (chapters 6-8), Part 3 (chapters
9-14).
Role of comparative genomics
Phylogenetic footprinting
Phylogenetic shadowing
Population shadowing
Page 552
Fig. 13.13
Page 554
Outline of today’s lecture
Introduction: 5 perspectives, history of life: time lines
Genome-sequencing projects: chronology
Genome analysis: criteria, resequencing, metagenomics
DNA sequencing technologies: Sanger, 454, Solexa
Process of genome sequencing: centers, repositories
Genome annotation: features, prokaryotes, eukaryotes
Fig. 13.14
Page 555
Genome annotation
Information content in genomic DNA includes:
-- nucleotide composition (GC content)
-- repetitive DNA elements
-- protein-coding genes, other genes
Page 555
GC content varies across genomes
Bacteria
Number of species
in each GC class
10
5
Plants
5
Invertebrates
3
Vertebrates
10
5
20
30
40
50
60
70
GC content (%)
80
Fig. 13.15
Page 556
Gene prediction tools
• http://bioinformatics.ca/links_directory/?subcategory_i
d=39
• http://www.geneprediction.org/
Common tools
GenScan: http://genes.mit.edu/GENSCAN.html
HMMgene: http://www.cbs.dtu.dk/services/HMMgene/
Microbial:
http://www.ncbi.nlm.nih.gov/genomes/MICROBES/gli
mmer_3.cgi
Fungal:
http://www.cbcb.umd.edu/software/GlimmerHMM/
Download