The Human Genome Project: The end is in site

advertisement
Automation for Genomics Discovery
at the Oklahoma Genome Center
Bruce A. Roe
Department of Chemistry and Biochemistry,
University of Oklahoma, Norman, OK 73019
C T
A G
Working Innovation into the Drug Discovery Pipeline
June 3, 2004
Houston Marriott Medical Center
Central Dogma of Molecular Biology
Each Chromosome Contains Hundreds of Genes
Gene
transcribe
RNA
process/
transport
mRNA
translate
Chromosome
C T
A G
DNA
Stable
Protein RNAs
What is a GENOME?
For humans, is the complete set of 23
chromosome pairs that we inherited from
our parents.
The human genome contains all the
information needed to make a human.
C T
A G
Most bacteria have only a single
chromosome that represents it’s genome
and contains all the information needed to
make that bacteria.
Human Genome Project Goals 1998-2003
• Achieve ~5-fold coverage of at least 90% of the genome in
a “working draft” based on mapped clones and finish onethird of the 3 billion base paired human genomic DNA
sequence by the end of 2000
• Finish the complete human genome sequence by the end
of April 2003, marking the 50th anniversary of the
discovery of the double helix structure of DNA by Watson
and Crick
• Make the sequence totally and freely accessible
• Reduce the cost of DNA sequencing to 25 cents/base over
this 5 year period by developing new technologies
• Study human genome sequence variation by creating a
Single Nucleotide Polymorphism (SNP) map with at least
100,000 markers
C T
A G
How Far Have We Come as of June 2004?
• Over 99% of the ~3.15 billion bases in the human genome
have been sequenced to completion finished as of April, 2003.
All the data is publicly available in the public databases.
• Ten human chromosomes (7,9,10,13,14,19,20,21,22,Y) have
been annotated and published and the remaining 14 are in the
final phases of annotation.
• There are fewer than 400 gaps in the sequence of the 24
chromosomes (22 numbered chromosome pairs plus X and Y)
• The cost of completed genomic DNA sequencing is slightly
less than 8 cents/finished base with the development of
improved automation.
• Had 3 quality checking exercises where two groups checked
the quality of another both in silico and by re-sequencing.
C T
A G
http://www.ncbi.nlm.nih.gov/genome/seq/HsHome.shtml
How do we sequence DNA?
The processes is similar to taking many copies a
newspaper, shreading it, then trying to put together
a copy of the original newspaper
This is accomplished by breaking many copies of
the DNA into small pieces and determining the order
of the four bases in each of these small pieces
Then, we overlap the small sequenced pieces to
obtain the sequence of the original, larger DNA
C T
A G
C T
A G
Sequence Pipeline at the University of Oklahoma
Genome Center, OU-ACGT
DNA
GenBank
DNA shearing
(HydroshearTM)
Growing subclones
(HiGroTM)
Subclone isolation II
(VPrepTM)
Sequencing
(ABI 3700)
Data assembly and
Analysis
Colony Piking
(QPixIITM)
Subclone Isolation I
(Mini-StaccatoTM)
Thermocycling
(ABI 9700)
Closure
C
A G
T
AMS-90 for PCR
Product Analysis
Liquid Handling
Primer
Synthesis
Hydroshear
C T
A G
•
•
•
•
•
GeneMachines, Inc. San Carlos, CA
Precision-drilled ruby orifice
500 m l syringe pump
Pump retraction speed range 0 – 40
A 100 to 300 ml sample sheared at a retraction speed
setting of 10 produces DNA 1- 4 Kbp fragments
Genetix QPixII Colony Picker
Digitizes colonies and picks in batches of 96 into 384-well plates
Pins are sterilized after each set of 96 colonies are picked
QuickTime™ and a
YUV420 codec decompressor
are needed to see this picture.
C T
A G
Cell Growth in 384 well plates in a HiGro
•
•
•
•
C T
A G
Capacity: 48 shallow, 384 well plates or 24 deep well plates.
Cells are grown into TB medium supplemented with salts and antibiotic
Cells are shaken at 520 rpm for 22 hours at 370C.
After 3.5 hours, oxygen is added @ 0.5 ft3/min for 0.5 second every 30
seconds.
Zymark SciClone with Twister II
C T
A G
384 tip pipettor
4 built in
shakers
Robotic 386 well plate
loader and stacker
Subclone Isolation I (Mini-Staccato)
QuickTime™ and a
YUV420 codec decompressor
are needed to see this picture.
• This Zymark robot has 384 cannula array, four built in shakers, three
attached storage racks, built-in barcoding and a Twister II robotic arm.
• This automation has allow us to perform the DNA isolation completely
C T
A G
unattended from as many as 80 384 well plates of bacterial cells per day.
Subclone Isolation I (Mini-Staccato)
QuickTime™ and a
YUV420 codec decompressor
are needed to see this picture.
C T
A G
The initial lysis solution (NaOH and SDS) is added to each of four 384
well plates containing bacterial cells that were loaded onto the built-in
shakers incorporated into the SciClone workspace deck.
Subclone Isolation I (Mini-Staccato)
QuickTime™ and a
YUV420 codec decompressor
are needed to see this picture.
C T
A G
The second solution, TE-RNase A, is added to each of the 384 well
plates and again shaken on the four auto-centering magnetic shakers
on the SciClone workspace deck.
Subclone Isolation I (Mini-Staccato)
QuickTime™ and a
YUV420 codec decompressor
are needed to see this picture.
C T
A G
Once all three lysis solutions are added and the plates are shaken after
each addition, the plates are transferred from the SciClone workspace
deck to a storage rack by the Twister II robotic arm.
Fluorescent DNA Sequencing
ACGTACACGTTCGG
C
C
G
A
A
C
G
T
The sequence
information is
fed into a computer
Detector
C
A G
T
Dye terminator-labeled
nested fragment set of
DNA copies from a template
with unknown sequence
in a single reaction tube
Reaction products are
applied to a single gel
lane or capillary and
electrophoresed to
separate the nested
fragment set
Laser
Subclone Isolation and Sequencing Reaction
Pipetting (Velocity 11 VPrep)
QuickTime™ and a
YUV420 codec decompressor
are needed to see this picture.
C T
A G
• Liquid handling station with 384-channel pipettor head
• Four movable shelves on either side of the pipettor head
• Used for Subclone isolation, sequencing reactions set-up and
as shown here, the ethanol-acetate precipitation clean-up step.
Thermocycling (ABI 9700)
60 cycles
Subclone
sequencing
conditions
C T
A G
950C
2:00
950C 500C 600C
0:30 0:20 4:00
40C
∞
Capillary Electrophoresis DNA Sequencing
• Our present capacity is fourteen 96 ABI 3700 capillary electrophoresisbased DNA sequencing instruments that are capable of analyzing two
384-well thermocycle plates or eight 96-well thermocycle plates per day.
• The DNA sequencing data is transferred to the Sun computer workgroup
C T for base calling (Phred), assembly (Phrap) and analysis (Consed).
A G
C T
A G
Primer synthesis (Mermade IV) for
PCR-based closure and finishing
C T
A G
• Standard phosphoramidite chemistry in an argon- filled reaction chamber.
• 192 primers synthesized at 2.5 nmole scale. Twice each day.
• 2.5 nanomole synthesis (50 cents/oligo) typically is used for either PCR or
DNA sequencing primers, but can be scaled to 10 nanomole.
Data assembly and Analysis
Sun V880 server
Phred/Phrap/Consed
Exgap
32 GB RAM running Solaris 8 OS and 3
TB of data stored on RAID-5 arrays
with autoloader tape backup
Also:
• 12 workstations each with 1 GB RAM
C T
A G
Sanger,
Keio,
Wash U,
OU
C T
A G
Human Chromosome 22
Sequence Features
• 39 % of the sequence is occupied by genes including
their introns, 5’ and 3’ non-translated regions.
• 3 % of the complete sequence encodes the protein
products of these genes.
• 42 % of the sequence is composed of repetitive
sequences, compared to 46 % for the entire genome.
• Only slightly over half of the genes predicted for
human chromosome 22 can be experimentally
validated.*
C
A G
T
* Shoemaker DD., et al. Experimental annotation of the human
genome using microarray technology. Nature. 409, 922-7 (2001).
An Individual’s Genome
Differs from the DNA of:
• Siblings by 1 to 2 million bases, ~99.98% identical, with
coding regions 99.99999% identical
• Unrelated humans by 6 million bases, ~99.8% identical
overall, with coding regions 99.9999% identical
• Chimpanzees by about 100 million base pairs ~98%
identical
• Baboons by about 300 million base pairs ~92% identical
• Mice by about 2.8 billion bases, but coding regions are
~90% identical
• Leaf spinach by about 2.9 billion bases, but coding
regions are ~40% identical
C T
A G
Differences between individuals
AGCCACACAGTGTCCACCGGATGGTTGATTTTGAAGCAGAGTT
AGCTTGTCACCTGCCTCCCTTTCCCGGGACAACAGAAGCTGAC
CTCTTTGNTCTCTTGCGCAGATGATGAGTCTCCGGGGCTCTAT
GGGTTTCTGAATGTCATCGTCCACTCAGCCACTGGATTTAAGC
AGAGTTCAAGTAAGTACTGGTTTGGGGAGNAGGGTTGCAGCGG
CNGAGCCAGGGTCTCCACCCAGGAAGGACTNATCGGGCAGGGT
GTGGGGAAACAGGGAGGTTGTTCAGATGACCACGGGACACCTT
TGACCCTGGCCGCTGTGGAGTGTTTGTGCTGGTTGATGCCTTC
TGGGTGTGGAATTGTTTTTCCCGGAGTGGCCTCTGCCCTCTCC
CCTAGCCTGTCTCAGATCCTGGGAGCTGGTGAGCTGCCCCCTG
CAGGTGGATCGAGTAATTGCAGGGGTTTGGCAAGGACTTTGAC
AGACATCCCCAGGGGTGCCCGGGAGTGTGGGGTCCNAGCCAG
The yellow underlined sequence is the first exon of
the BCR gene involved in leukemia. Only 5 bases
C T
A G
(N) differ in non-gene regions.
Human Chromosome 22
Single Nucleotide Polymorphisms*
Number of overlaps
Size of overlaps
Number of SNPs
Number of substitutions
Number of ins/del
335
13,203,147 bp
11,116 (~1/1000 bp)
9,123 (82%)
1,193 (18%)
Only 48 of the 11,116 SNPs were in coding
regions ~ 10 fold lower than in non-coding
C
A G
* E. Dawson, et al. A SNP Resource For Human Chromosome 22: Extracting Dense
Clusters of SNPs from the Genomic Sequence. Genome Research, 11, 170-178 (2001).
T
“We each are like a different symphony orchestra”
“All playing the same instruments slightly differently”
C T
A G
Good news and Bad news
• Good news <40,000 genes (counting dark space?)
• Bad news
• 2-4 times as many proteins as other
species due to extensive alternative
splicing in humans.
• We only know the function of about
half the predicted genes.
• Likely > 1 million different gene
products based on alternative splicing
and post-translational modifications.
C T
A G
Where we stand now
• We essentially have the ‘dictionary’ with all the words
(genes) spelled correctly, but only slightly more than
half of the words (genes) have definitions.
• Slightly over half of the 936 genes predicted for human
chromosome 22 have been experimentally validated.
•
•
•
•
223 have a known function and expression
172 have no known function but evidence for expression
182 have no known function and no evidence for expression
228 pseudogenes
• Through comparative genomic sequencing we can
annotate the human genome based on evolutionary
conserved gene sequences and use model systems to
C T
A G
study gene expression.
If a genomic region is conserved in evolutionary
distant organisms, it is present because the
region is maintained through selective pressure
over evolutionary time likely because it performs
necessary function.
C T
A G
C T
A G
Chimpanzee and Baboon
Genomic Sequencing
• Medically important model eukaryotic organisms
• The chimpanzee is our nearest evolutionary
relative with a genome that has ~98 %
sequence identity with the human genome
• The baboon genome has ~92 % sequence
identity with the human genome
C T
A G
PIP Plot of
a region of
human
chr22
compared
to syntenic
regions of
baboon
and mouse
C T
A G
humanspecific
repeat
regions
Questionable
gene present
in primates
but not in
rodents
34 Kbp
deletion
in
baboon
C T
A G
Exons in one
copy of a
zebrafish
duplicated
gene with
75%
homology to
human but
greatly
diverged,
<50%
homology, in
the other
copy
C T
A G
A complementary approach is to determine if
the predicted protein coding conserved
elements are functional by investigating their
expression profiles during development.
C T
A G
Whole mount in situ hybridization using
zebra fish as the model organism
Small people that swim in the water and
breath through gills… Han Wang, OU
C T
A G
Zebrafish as a model system
•
•
•
•
•
•
•
•
C T
A G
Have a short, ~ 3 month to reproductive maturity.
Can be easily bred in the lab in large numbers.
Are small in size - an adult is just a few centimeters long.
Have an ~ 5 day embryonic development period from
fertilized egg to a swimming fish.
The embryos are transparent making it easy to see internal
organs during development.
Is well established as a resource for genetic studies.
The Sanger Institute is completing the genome sequence,
which presently is ~50% complete and publicly available.
More than 90 % of the predicted human genes have a zebra
fish ortholog.
Whole mount in situ hybridization
Alkaline phosphatase-conjugated
anti-DIG antibody
DIG-labeled ssDNA
or RNA probe
BCIP* + NBT**
P
P
Digoxigenin
label uridine
Wash
Wash
P
mRNA
1. Add digoxigenin-labeled
probe complementary to
RNA of interest
C T
A G
2. Add alkaline phosphataseconjugated antibody that
binds to digoxigenin
*BCIP = 5 bromo-4-chloro-indoxyl phosphate
**NBT = nitro-blue-tetrazolium
3. Add BCIP + NBT that turns
dark purple dye when
dephosphorylated by the
alkaline phosphatase
thereby coloring the cell
Exon-specific ssDNA primers
Mermade synthesis of unique exon
specific primers of the gene of interest
PCR off zebra fish genomic DNA
Followed by unidirectional amplification with either
forward or reverse (nested) primers in the presence of
DIG-labeled dUTP
ssDNA (sense and antisense probes)
C T
A G
These steps now have been automated in a 96 well format
Ethidium bromide stained 1% agarose gel of dsPCR
off genomic DNA and subsequently unidirectional
amplified single stranded DNA probes
Size
Markers
PCR F R PCR F R
PCR F
R
PCR F R
PCR F R
1078
603
310
• These studies clearly demonstrate that, contrary to popular belief,
single stranded DNA contains regions that fold into sufficient double
stranded secondary structures that ethidium bromide can bind.
• However, agarose gel electrophoresis is labor intensive (slab gel
preparation and loading), electrophoresis is time consuming, and
detection typically requires the use of carcinogenic ethidium bromide
C T
A G
AMS-90 for ssPCR primer, dsPCR and single
strand unidirectional exon amplification
C T
A G
PCR and Unidirectional Single Primer Amplification on the AMS-90
Bases
single strand
single strand
single strand
single strand
ds PCR uni-directional ds PCR uni-directional ds PCR uni-directional ds PCR uni-directional
product
products
product
products
product
products
product
products
F
R
F
R
F
R
F
R
7000
4900
2900
1900
1100
700
500
300
100
15
C T
A G
Both double and single stranded DNA rapidly can be
resolved, detected and archived on the AMS-90
Custom MerMade Synthesized 20-mer DNA Primers
Rapidly Analyzed on the AMS-90
Bases
7000
4900
2900
1900
1100
700
500
300
100
15
ug/ul 2.0 1.0
0.5 0.25 0.12 0.06 0.06 0.12 0.25 0.5 1.0 2,0
Decreasing 20-mer Concentration Increasing 20-mer Concentration
Rapid, 30 seconds/lane run time vs over an hour/sample
C T
A G via capillary electrophoresis, of single stranded oligonucleotides
AMS-90 vs Ethidium Bromide Stained
Agarose Gels or Capillary Electrophoresis
• Both can be used to resolve and view both
double stranded and single stranded DNAs
• However, analysis on the AMS-90 requires:
• minimal human interaction,
• no separate photography,
• much less technician time,
• eliminates the use of carcinogenic
ethidium bromide
• is less error prone and
• takes much less time.
C T
A G
Human hypothetical protein-KIAA0819
One gene with 11 exons on Hu Chr 22
This one gene is split into 2 genes
in zebra fish
• ZF1 - Genomic location:307,280-316,461 bp on
Sanger Institute chromosome fragment ctg14067
• With the first 4 exons
• ZF2 - Genomic location:107,344-119,287 on Sanger
Institute chromosome fragment ctg11065
• With the remaining 7 exons
Note: 4 + 7 = 11
C T
A G
A multiPIP analysis of the predicted genes from human, rat, mouse, fugu
and zebra fish (ZF1 and ZF2) with homology to cDNA probe KIAA0819
100%
50%
C T
A G
Orthologous duplicated copies of a single
copy human KIAA0819 gene in zebra fish
Single human kiaa0819 gene
C T
A G
ZF1
Two zebra fish kiaa0819 gene orthologs
ZF2
Whole mount in situ hybridization of
ssDNA probes for the ZF1 gene
Antisense probe
Sense probe
No probe
120hpf
48hpf
24hpf
C T
A G
Only antisense probe hybridization to the Otic Placode
Expression of ZF1 Gene in the Otic Placode
Five sensory patches develop from the embryonic ear in three cristae associated
with a semicircular canal and two maculae associated with an otolith.
C T
A G
Whole mount in situ hybridization of a ssDNA
probe unique to the ZF2 gene at 24 and 48 hpf
AntiSense probe
Sense probe
hindbrain
24 hpf
forebrain
hindbrain
Otic placode
Pectoral fin
48 hpf
C T
A G
Only antisense probe hybridization to the hindbrain, forebrain,
Otic Placode and pectoral fin
ZF2, 48 hpf
hindbrain
C
A G
T
Otic placodes
Pectoral fin
Expression of ZF2 is seen in the edge of the otic placode with no
defined sensory patches, and in the budding pectoral fin.
Expression analysis show functional
divergence after duplication in zf1and zf2
• ZF1 is expressed only in the Otic Placode seen at
24-120 hpf
• ZF2 is expressed in the hindbrain, otic placode and
the pectoral fin, with the expression in the otic
placode differing from that of ZF1
• It is highly likely that the one gene in humans is
expressed in the developing ear, brain and involved
in early limb development
C T
A G
Whole mount in situ hybridization of a ssDNA probe for
Human Gene: NM_032775-ENSG00000185214
On Hu Chr 22 at positions 19,120,360 - 19,174,676
(no expression confirming ESTs)
Antisense probe
120hpf
Otic placode
Swim bladder
Otic placode
Swim bladder
Sense probe
Otic placode
Swim bladder
Otic placode
Swim bladder
160hpf
C T
A G
Only antisense probe hybridization to the Otic Placode and swim bladder
Summary of in situ hybridization studies:
Gene
Antisense probe
Dj508I15.c22.5
Phf5a-like gene
KIAA0819-ZF1
KIAA0819-ZF2
Brain
Brain
Otic placode
Hind brain, Otic placode,
and pectoral fin
Otic placode and
swim bladder
Hind brain,
Hind brain and
Branchial arches,
pectoral fin
Heart, and pectoral fin
Notochord, liver
Notochord
Hind brain, and
Otic placode
NM_032775
DGCR8
AP000553.6
C T
A G
Sense probe
ESTs
+
+
+
+
-
3 out of 7 predicted genes but with no previous evidence for expression
Conclusions:
• It now is clear that there are large conserved sequence
regions from evolutionary distant organisms ranging
from humans to fish. If these regions are conserved, the
function of the encoded genes also likely is conserved.
• The zebra fish is an ideal system in which to investigate
protein expression profiles for genes that are human
orthologs.
• All aspects of this work have been and will continue to be
improved by automation.
C T
A G
What’s next for our Genome Center?
• Participate in sequencing the mouse, chimp, baboon,
lemur, bovine, dog, cat, chicken and zebra fish
genomes concentrating on:
• Regions of high biological interest and
• Regions orthologous to human chromosome 22
• Sequence the Medicago truncatula (alfalfa) genome
using a mapped BAC-based approach concentrating
on coding regions
• Continued sequencing of selected pathogenic bacteria
• Investigate the function of the predicted genes with
unknown function in the zebrafish system first by
whole mount in situ and then expression knock down
experiments with morpholino oligos, once robust,
C T
A G
automated methods have been developed.
Laboratory Organization
Bruce Roe, PI
Support Teams
Informatics
Production
DNA Synthesis
Jim White
Steve Kenton
Hongshing Lai
Sean Qian***
Phoebe Loh*
Rose Morales-Diaz*
Sulan Qi
Mounir Elharam*
Bart Ford*
Steve Shaull**
Doug White
Work-study Undergraduate students**
Reagents &
Equip. Maint.
Mounir Elharam*
Doug White
Clayton Powell**
Administration
KayLynn Hale
Dixie Wishnuck
Tami Womack
Mary Catherine Williams
Research Teams
Doris Kupfer
Julia Kim*
Sun So
Graham Wiley**
Limei Yang
Angie Prescott*
Audra Wendt**
Mandi Aycock**
Fu Ying
Liping Zhou
Ruihua Shi****
Junjie Wu****
Trang Do
Anh Do
Lily Fu
Yang Ye**
Tessa Manning**
Ziyun Yao***
Steve Shaull*
Youngju Yoon****
Stephan Deschamps***
Shelly Oommen****
Christopher Lau****
ShaoPing Lin***
Honggui Jia
Hongming Wu
Baifang Qin
Peng Zhang
Axin Hua***
Weihong Xu****
Yanhong Li
Funding from the NHGRI, Noble Foundation, DOE, NSF C T Collaborators at Sanger, CWRU, CHOP, Keio, UIUC and Riken
A G
Fares Najar***
Chunmei Qu
Keqin Wang
Shuling Li
Lin Song****
Ying Ni
Huarong Jiang
Jami Milam****
Sara Downard**
Ging Sobhraksha**
Pheobe Loh *
Sulan Qi
Bart Ford*
* Previous undergraduate res. student
** Present undergraduate res. student
*** Previous graduate student
**** Present graduate student
C
A G
T
The ACGT Team
Peggy and Charles Stephenson Center
C T
A G
C T
A G
Download