A Genetic Approach to Ordered Sequencing of Arabidopsis

advertisement
DNA Learning Center
July 15, 2003
W. Richard McCombie
Professor
Cold Spring Harbor Laboratory and
The Watson School of Biological
Sciences
Basic points
• Genome research is advancing very rapidly
• Technologies are driving the progress
• These technologies and the data that results
from them will have a revolutionary effect
on the way biological research is done and
in our understanding of biology and
medicine
Major Topics
• What is genomics and in particular the human
genome program
• Introduction and historical perspective on
sequencing.
• Some information about genomes being
sequenced
• Stategies to analyse genomes
• Comparative genomics
• How genomics has and will change biology and
medicine
What is an organism
• At ONE LEVEL, it is the result of the execution
of the code that is its genome
• We do not know the degree to which environment
alters this execution
• We do know that in addition to physical attributes,
many complex processes such as behavior have an
influence from the code
• We now know that in mammals, this code is only
comprised of about 30,000-40,000 genes and their
control units
The Genome of an organism is:
• The complete set of inherited instructions
for that organism - It’s complete DNA code
• When operating creates a set of proteins in
an organized fashion
• These proteins act to cause growth,
development and reproduction of the
organism
What is genomics
• Genomics is the analysis of the complete set of
genetic instructions of an organism
• These genetic instructions consist of genes, which
direct the production of proteins and their control
elements
• These genes consist of a series of DNA bases
• Previously we could only look at one or at most a
few of these objects or parts at a time
• Technology now enables us to see them all
Why will genomics have such an
impact
• Important biological problems such as
cancer and learning and memory are
extraordinarily complex
• Genomics lets us integrate this complex
information in a meaningful way
• Ultimately, much of biological research will
be driven by computational analysis
Sizes of some important genomes
•
•
•
•
•
•
•
•
•
•
•
•
Virus
Bacteria
Yeast
C. elegans
Rice
Arabidopsis
Fugu
Mouse
Corn
Human
Wheat
Loblolly pine
0.003 - 0.300 million
0.8- 6 million
15 million
100 million
435 million
130 million
800 million
2.5 billion
2.5 billion
3 billion
16-20 billion
20 billion
Genome sequencing efficiencies
per person
•
•
•
•
•
•
1980: 0.1-1 kb per year
1985: 1-5 kb per year
1990: 25-50 kb per year
1996: 100-200 kb per year
2000: 500-1000 kb per year
2002: 10,000 - 25,000 kb per year
Bases in GenBank
4000000000
3000000000
2000000000
1000000000
0
1982 1985
1988 1991
1994 1997
Bases in GenBank
Bases in GenBank 1982-1987
18000000
16000000
14000000
12000000
10000000
8000000
6000000
4000000
2000000
0
Bases in GenBank
1982 1983 1984 1985 1986 1987
Methods to analyse a complex
genome
• Mapping
– Genetic
– Physical
• Expressed gene analysis
• Genome sequence analysis
– Complete sequence
– Skimming
– “Rough draft”
Salient features of genome
organization
• Higher organisms have large genomes with
considerable amount of repeat sequences
• Genes from higher organisms are
interrupted by non-coding regions
• Only a small portion of a genome codes for
genes
• Related organisms have related genomes
Expressed Sequence Tags (sequencing
parts of the processed genes)
• Advantages
• Inexpensive
• “Know” sequence is
coding
• Information about tissue
or developmental stage
expression
•
•
•
•
• Disadvantages
Coverage is incomplete
Position of sequence in the
genome is unknown
Only partial information
about each gene
No information about
structural elements
Steps in genome sequencing
•
•
•
•
•
•
Construction of a large-insert library
Construction of a small insert subclone library
Isolation of DNA
Sequencing of the DNA fragments (8-10x)
Assembly of the data into contiguous regions
Filling the gaps in the sequence and resolving
discrepancies
• Confirmation of the sequence
• Analysis
High Accuracy Genomic Sequencing (6-10x
plus resolution of problems)
•
•
•
•
• Advantages
Normalized coverage
of all genes
Information about
gene structure
Information about
regulatory elements
Genome organization
• Disadvantages
• Cost
• Time
• Difficult to determine
if a sequence codes for
a gene
“Rough draft”
• Can be thought of as:
– High coverage skimming
– Low coverage complete sequencing
• Advantages and disadvantages are
intermediate between skimming and
complete sequencing - dependent on the
coverage
Cost of various types of
sequencing (per base)
•
•
•
•
•
“Base perfect” (uncomplicated)
8x shotgun - no finishing
4x shotgun - no finishing
3x shotgun - no finishing
1x shotgun - no finishing
$0.3
$0.1
$0.05
$0.04
$0.01
The Human Genome Project
• Human genome consists of three billion
base pairs – Adenine, Cytosine, Guanine
Thymine
• Printing out the A,C,G,T would fill over
150,000 telephone book pages
• Disease is often caused by a single
variation in the three billion bases - one
different letter in 150,000 pages
The human genome project
• A concerted effort to build resources to
unravel the human control code
• To develop map resources to link genetic
elements (such as disease genes) to a
physical representation of the genome
• To determine the sequence of all of the
DNA that combines to make the human
control code
2-15-01
Genome sequencing assignments
I
II
III
IV
Kazusa
CSHSC
V
TIGR
SPP
ESSA
Genoscope
Kazusa
The Arabidopsis genome Ğbasic statistics
feature
Chr.1
Chr.2
Chr.3
Chr.4
Chr.5
30.4
19.8
23.7
17.8
27.0
GC content
33.4 %
35.5 %
36.1 %
35.5 %
35.9 %
GC content in coding regions
44.0 %
44.1 %
44.2 %
44.1 %
44.0 %
GC content in non-coding
32.4 %
33.3 %
32.4 %
32.8 %
32.5 %
no. of genes
7046
4036
5126
3825
5874
exon length
247
259
250
256
242
gene density (kb / gene )
4.3
4.9
4.5
4.6
4.6
60.6 %
56.8 %
59.7%
59.6 %
61.2 %
tRNAs
105
73
41
81
140
Targeted to mitochondria
445
425
446
377
627
(11%)
(10.5%)
(8.7%)
(9.9%)
(10.7)
543
533
621
513
884
(15%)
(13.2%)
(12.1%)
(13.4%)
(15.1%)
length[ Mbp ]
regions
EST matches
(% gene s wit h at least one EST
above 90% simil arit y)
Targeted to chloro plast
Gene Families
No. of
Gene families containing
singetons unique
2
3
4
5
>5
and
membe membe membe membe membe
distinct
rs
rs
rs
rs
rs
gene
families
1587 88.8 % 6.8 % 2.3 % 0.7 % 0.0 % 1.4 %
H.
influenzae
S.
5105 71.4 %
cerevisiae
D.
10736 72.5 %
melanogast
er
C. elegans 14177 55.2 %
A. thaliana 11601 35.0 %
13.8 % 3.5 % 2.2 % 0.7 % 8.4 %
8.5 % 3.4 % 1.9 % 1.6 % 12.1 %
12.0 % 4.5 % 2.7 % 1.6 % 24.0 %
12.5 % 7.0 % 4.4 % 3.6 % 37.4 %
th
,c
el
ld
iv
is
io
n
sy
pr
ot
ei
n
sy
nt
he
si
s
rip
tio
n
nt
he
si
s
en
er
gy
tra
ns
c
dn
a
pr
ot
ei
n
an
d
m
et
ab
ol
is
m
de
st
tra
in
at
ns
io
po
n
rt
fa
in
ci
tra
lita
ce
ce
tio
llu
llu
n
la
la
rc
rt
om
ra
ns
ce
m
un
po
llu
ic
rt
la
ce
a
r
tio
ll r
bi
og
n
es
/s
en
cu
ig
es
e,
n
al
is
de
tra
fe
ns
ns
e,
du
ce
ct
io
ll d
n
ea
th
,a
ge
io
cl
ni
in
as
c
g
si
h
f ic
om
at
eo
io
st
n
as
no
is
ty
et
cl
ea
rcu
t
un
cl
as
si
f ie
d
ce
ll g
ro
w
0.7
E.coli
Syneccocystis
0.6
Saccharomyces c.
C.elegans
0.5
Drosophila m.
human
0.4
0.3
0.2
0.1
0
Cytogenetic map of chromosome 4S
3Mb
NOR
2Mb
0.5Mb
0.5Mb
knob
2Mb
cen
Paul Fransz
Complete genomic sequencing
reduces the genetics of an
organism to a closed, finite
system
FRUITFULL Gene Function
The AGL8 gene was
renamed FRUITFULL (ful1)
Genetic Redundancy
ap1 cal ful triple mutants have
flowers replaced by shoots
• apetala1 cauliflower
double mutants have
proliferating floral
meristems ressembling
cauliflowers
The state of Arabidopsis research
200??
• Complete annotated sequence available
• Time to clone a gene has decreased from months
to years to weeks in some cases
• People are beginning to look at global features of
Arabidopsis
• Gene trap insertion in “every” gene
• Insertion site sequences known, linked to physical
and genetic map
Analysis of not the first, or the second,
but subsequent genomes
• The information from the first few genomes
will enable huge cost and time savings
• A major emphasis will be to determine the
function of genes
What are the genes and what do
they do???
• Computational analysis
• Functional analysis
– Microarrays
– Transposons
– Various other methods
• Comparative analysis
Comparative Genomics
What can we learn from
comparative analysis
• Evolutionary relationships
• Better annotation of genes, particularly of
beginning and ends of genes
• Detection of conserved regulatory regions
• Functional evidence
Benefits of having a model genome
reference sequence with conserved local
gene order to your plant of interest
• Requirements for sequence accuracy
decrease for most of the genome
– you can fill in with high accuracy where needed
• The reference genome can be used as a
scaffold allowing the anchoring of clones
(allowing partial sequence coverage to infer
complete clone coverage)
Co-linearity among cereal genomes
What type of comparisons are useful?
• Arabidopsis to very closely related species
– Annotate the Arabidopsis sequence
• Arabidopsis to related crop plants (soybean, tomato, Medicago
truncatula)
– Determine the degree of locally conserved gene order between these crops
and Arabidopsis
– Determine how the Arabidopsis sequence can be used in the analysis of
these species
• Arabidopsis to distant plants (rice for instance)
– Gene discovery
– Systems analysis
– Gene order conservation???
• Arabidopsis to animals
– How plants and animals differ in carrying out basic biological processes
– How plant and animals organize and manage gene expression
Mammalian Comparative
Genomics
• Canine vs. Human Genome
• Sequence canine ESTs
• In collaboration with Elaine Ostrander (FHCRC)
map to the dog genome
• Map computationally to the human genome
• Use to better annotate the human sequence
• Starting material for microarrays
• Use in gene discovery (behavior and cancer)
myosin, light polypeptide 4, alkali
How will genomics effect the
way we do biological research
Rate at which genes can be
identified
• Cloning - weeks to years
• Database searches - seconds to minutes
What are the areas where genome
technology will impact us
• Diagnostics
• Forensics
• Understanding of diseases such as cancer at
the molecular level
• Treatments for diseases customized to the
individual
Genomic Information allows us to
look at the entire gene content of
an organism simultaneously
> 9 of the 10 Leading Causes of Mortality Have
Genetic Components
•
•
•
•
?
•
•
•
•
•
1. Heart disease (29.5% of deaths in ‘00)
2. Cancer (22.9%)
3. Cerebrovascular diseases (6.9%)
4. Chronic lower respiratory dis. (5.1%)
5. Injury (3.9%)
6. Diabetes (2.9%)
7. Pneumonia/Influenza (2.8%)
8. Alzheimer disease (2.0%)
9. Kidney disease (1.6%)
10. Septicemia (1.3%)
Genomic Health Care
• About conditions partly:
–Caused by mutation(s) in gene(s)
• e.g., breast cancer, colon cancer, autism,
atherosclerosis, inflammatory bowel
disease, diabetes, Alzheimer disease,
mood disorders, etc., etc.
–Prevented by mutation(s) in gene(s)
• e.g., HIV (CCR5), ?atherosclerosis,
?cancers, ?diabetes , etc., etc.
Genomic Health Care
• Will change health care by...
– Creating a fundamental
understanding of the biology of many
diseases (and disabilities), even many
“non-genetic” ones
– Helping to redefine illnesses by
etiology rather than by
symptomatology
Genomic Health Care
• Knowledge of individual genetic
predispositions will allow:
– Individualized screening
– Individualized behavior changes
– Presymptomatic medical therapies, e.g.,
antihypertensive agents before
hypertension develops, anti-mood
disorder agents before mood disorder
occurs
Crystal Ball - 2010
•
•
•
•
Predictive genetic tests for 10 - 25 conditions
Intervention to reduce risk for many of them
Gene therapy for a few conditions
Primary care providers begin to practice genetic
medicine
• Preimplantation diagnosis widely available,
limits fiercely debated
• Effective legislative solutions to genetic
discrimination & privacy in place in US
• Access remains inequitable, especially in
developing world
Crystal Ball - 2020
• Gene-based designer drugs for diabetes,
hypertension, etc. coming on the market
• Cancer therapy precisely targets molecular
fingerprint of tumor
• Pharmacogenomic approach is standard
approach for many drugs
• Mental illness diagnosis transformed, new
therapies arriving, societal views shifting
• Homologous recombination technology
suggests germline gene therapy could be safe
Crystal Ball - 2030
• Genes involved in aging fully cataloged
• Clinical trials underway to extend life span
• Full computer model of human cells replaces
many laboratory experiments
• Complete genomic sequencing of an individual
is routine, costs less than $100
• Major anti-technology movements active in
US, elsewhere
• Worldwide inequities remain
Genomics
• May also change society…
– Genetic stratification, e.g., in
employment or marriage
– Genetic engineering against (and for)
diseases and characteristics
– Cloning
– Increased opportunity for “private
eugenics”
Genomics
• If we are all mutants, what is the
definition of normal?
Conclusions
• Genomics will be the knowledge base or
infrastructure for virtually all biology and
medicine of the 21st century
• In silico biology will be a driving force in
research and medicine
• Treatments for diseases will be radically
improved by our understanding of complex
diseases
Collaborators and Funding
Rob Martienssen
Pablo Rabinowicz
Lincoln Stein
Rod Wing and the CUGI Group
Susan McCouch
Steve Tanksley
Mike Bevan
Our ESSA-MIPS Collaborators
Rick Wilson
Marco Marra
Elaine Mardis
John McPherson
Bob Waterston
The WUGSC
Daphne Preuss
The AGI
Special thanks to NHGRI
for some of the slides used
Doug Cook
NSF, USDA, DOE
NIH (NHGRI) and NCI
Monsanto, Westvaco,
David Luke III
“It is now conceivable that our
children's children will know the
term cancer only as a constellation
of stars.”
– President Clinton at the White
House, June 26, 2000 announcing
completion of the human genome
draft sequence
Download