Automation for Genomics Discovery at the Oklahoma Genome Center Bruce A. Roe Department of Chemistry and Biochemistry, University of Oklahoma, Norman, OK 73019 C T A G Working Innovation into the Drug Discovery Pipeline June 3, 2004 Houston Marriott Medical Center Central Dogma of Molecular Biology Each Chromosome Contains Hundreds of Genes Gene transcribe RNA process/ transport mRNA translate Chromosome C T A G DNA Stable Protein RNAs What is a GENOME? For humans, is the complete set of 23 chromosome pairs that we inherited from our parents. The human genome contains all the information needed to make a human. C T A G Most bacteria have only a single chromosome that represents it’s genome and contains all the information needed to make that bacteria. Human Genome Project Goals 1998-2003 • Achieve ~5-fold coverage of at least 90% of the genome in a “working draft” based on mapped clones and finish onethird of the 3 billion base paired human genomic DNA sequence by the end of 2000 • Finish the complete human genome sequence by the end of April 2003, marking the 50th anniversary of the discovery of the double helix structure of DNA by Watson and Crick • Make the sequence totally and freely accessible • Reduce the cost of DNA sequencing to 25 cents/base over this 5 year period by developing new technologies • Study human genome sequence variation by creating a Single Nucleotide Polymorphism (SNP) map with at least 100,000 markers C T A G How Far Have We Come as of June 2004? • Over 99% of the ~3.15 billion bases in the human genome have been sequenced to completion finished as of April, 2003. All the data is publicly available in the public databases. • Ten human chromosomes (7,9,10,13,14,19,20,21,22,Y) have been annotated and published and the remaining 14 are in the final phases of annotation. • There are fewer than 400 gaps in the sequence of the 24 chromosomes (22 numbered chromosome pairs plus X and Y) • The cost of completed genomic DNA sequencing is slightly less than 8 cents/finished base with the development of improved automation. • Had 3 quality checking exercises where two groups checked the quality of another both in silico and by re-sequencing. C T A G http://www.ncbi.nlm.nih.gov/genome/seq/HsHome.shtml How do we sequence DNA? The processes is similar to taking many copies a newspaper, shreading it, then trying to put together a copy of the original newspaper This is accomplished by breaking many copies of the DNA into small pieces and determining the order of the four bases in each of these small pieces Then, we overlap the small sequenced pieces to obtain the sequence of the original, larger DNA C T A G C T A G Sequence Pipeline at the University of Oklahoma Genome Center, OU-ACGT DNA GenBank DNA shearing (HydroshearTM) Growing subclones (HiGroTM) Subclone isolation II (VPrepTM) Sequencing (ABI 3700) Data assembly and Analysis Colony Piking (QPixIITM) Subclone Isolation I (Mini-StaccatoTM) Thermocycling (ABI 9700) Closure C A G T AMS-90 for PCR Product Analysis Liquid Handling Primer Synthesis Hydroshear C T A G • • • • • GeneMachines, Inc. San Carlos, CA Precision-drilled ruby orifice 500 m l syringe pump Pump retraction speed range 0 – 40 A 100 to 300 ml sample sheared at a retraction speed setting of 10 produces DNA 1- 4 Kbp fragments Genetix QPixII Colony Picker Digitizes colonies and picks in batches of 96 into 384-well plates Pins are sterilized after each set of 96 colonies are picked QuickTime™ and a YUV420 codec decompressor are needed to see this picture. C T A G Cell Growth in 384 well plates in a HiGro • • • • C T A G Capacity: 48 shallow, 384 well plates or 24 deep well plates. Cells are grown into TB medium supplemented with salts and antibiotic Cells are shaken at 520 rpm for 22 hours at 370C. After 3.5 hours, oxygen is added @ 0.5 ft3/min for 0.5 second every 30 seconds. Zymark SciClone with Twister II C T A G 384 tip pipettor 4 built in shakers Robotic 386 well plate loader and stacker Subclone Isolation I (Mini-Staccato) QuickTime™ and a YUV420 codec decompressor are needed to see this picture. • This Zymark robot has 384 cannula array, four built in shakers, three attached storage racks, built-in barcoding and a Twister II robotic arm. • This automation has allow us to perform the DNA isolation completely C T A G unattended from as many as 80 384 well plates of bacterial cells per day. Subclone Isolation I (Mini-Staccato) QuickTime™ and a YUV420 codec decompressor are needed to see this picture. C T A G The initial lysis solution (NaOH and SDS) is added to each of four 384 well plates containing bacterial cells that were loaded onto the built-in shakers incorporated into the SciClone workspace deck. Subclone Isolation I (Mini-Staccato) QuickTime™ and a YUV420 codec decompressor are needed to see this picture. C T A G The second solution, TE-RNase A, is added to each of the 384 well plates and again shaken on the four auto-centering magnetic shakers on the SciClone workspace deck. Subclone Isolation I (Mini-Staccato) QuickTime™ and a YUV420 codec decompressor are needed to see this picture. C T A G Once all three lysis solutions are added and the plates are shaken after each addition, the plates are transferred from the SciClone workspace deck to a storage rack by the Twister II robotic arm. Fluorescent DNA Sequencing ACGTACACGTTCGG C C G A A C G T The sequence information is fed into a computer Detector C A G T Dye terminator-labeled nested fragment set of DNA copies from a template with unknown sequence in a single reaction tube Reaction products are applied to a single gel lane or capillary and electrophoresed to separate the nested fragment set Laser Subclone Isolation and Sequencing Reaction Pipetting (Velocity 11 VPrep) QuickTime™ and a YUV420 codec decompressor are needed to see this picture. C T A G • Liquid handling station with 384-channel pipettor head • Four movable shelves on either side of the pipettor head • Used for Subclone isolation, sequencing reactions set-up and as shown here, the ethanol-acetate precipitation clean-up step. Thermocycling (ABI 9700) 60 cycles Subclone sequencing conditions C T A G 950C 2:00 950C 500C 600C 0:30 0:20 4:00 40C ∞ Capillary Electrophoresis DNA Sequencing • Our present capacity is fourteen 96 ABI 3700 capillary electrophoresisbased DNA sequencing instruments that are capable of analyzing two 384-well thermocycle plates or eight 96-well thermocycle plates per day. • The DNA sequencing data is transferred to the Sun computer workgroup C T for base calling (Phred), assembly (Phrap) and analysis (Consed). A G C T A G Primer synthesis (Mermade IV) for PCR-based closure and finishing C T A G • Standard phosphoramidite chemistry in an argon- filled reaction chamber. • 192 primers synthesized at 2.5 nmole scale. Twice each day. • 2.5 nanomole synthesis (50 cents/oligo) typically is used for either PCR or DNA sequencing primers, but can be scaled to 10 nanomole. Data assembly and Analysis Sun V880 server Phred/Phrap/Consed Exgap 32 GB RAM running Solaris 8 OS and 3 TB of data stored on RAID-5 arrays with autoloader tape backup Also: • 12 workstations each with 1 GB RAM C T A G Sanger, Keio, Wash U, OU C T A G Human Chromosome 22 Sequence Features • 39 % of the sequence is occupied by genes including their introns, 5’ and 3’ non-translated regions. • 3 % of the complete sequence encodes the protein products of these genes. • 42 % of the sequence is composed of repetitive sequences, compared to 46 % for the entire genome. • Only slightly over half of the genes predicted for human chromosome 22 can be experimentally validated.* C A G T * Shoemaker DD., et al. Experimental annotation of the human genome using microarray technology. Nature. 409, 922-7 (2001). An Individual’s Genome Differs from the DNA of: • Siblings by 1 to 2 million bases, ~99.98% identical, with coding regions 99.99999% identical • Unrelated humans by 6 million bases, ~99.8% identical overall, with coding regions 99.9999% identical • Chimpanzees by about 100 million base pairs ~98% identical • Baboons by about 300 million base pairs ~92% identical • Mice by about 2.8 billion bases, but coding regions are ~90% identical • Leaf spinach by about 2.9 billion bases, but coding regions are ~40% identical C T A G Differences between individuals AGCCACACAGTGTCCACCGGATGGTTGATTTTGAAGCAGAGTT AGCTTGTCACCTGCCTCCCTTTCCCGGGACAACAGAAGCTGAC CTCTTTGNTCTCTTGCGCAGATGATGAGTCTCCGGGGCTCTAT GGGTTTCTGAATGTCATCGTCCACTCAGCCACTGGATTTAAGC AGAGTTCAAGTAAGTACTGGTTTGGGGAGNAGGGTTGCAGCGG CNGAGCCAGGGTCTCCACCCAGGAAGGACTNATCGGGCAGGGT GTGGGGAAACAGGGAGGTTGTTCAGATGACCACGGGACACCTT TGACCCTGGCCGCTGTGGAGTGTTTGTGCTGGTTGATGCCTTC TGGGTGTGGAATTGTTTTTCCCGGAGTGGCCTCTGCCCTCTCC CCTAGCCTGTCTCAGATCCTGGGAGCTGGTGAGCTGCCCCCTG CAGGTGGATCGAGTAATTGCAGGGGTTTGGCAAGGACTTTGAC AGACATCCCCAGGGGTGCCCGGGAGTGTGGGGTCCNAGCCAG The yellow underlined sequence is the first exon of the BCR gene involved in leukemia. Only 5 bases C T A G (N) differ in non-gene regions. Human Chromosome 22 Single Nucleotide Polymorphisms* Number of overlaps Size of overlaps Number of SNPs Number of substitutions Number of ins/del 335 13,203,147 bp 11,116 (~1/1000 bp) 9,123 (82%) 1,193 (18%) Only 48 of the 11,116 SNPs were in coding regions ~ 10 fold lower than in non-coding C A G * E. Dawson, et al. A SNP Resource For Human Chromosome 22: Extracting Dense Clusters of SNPs from the Genomic Sequence. Genome Research, 11, 170-178 (2001). T “We each are like a different symphony orchestra” “All playing the same instruments slightly differently” C T A G Good news and Bad news • Good news <40,000 genes (counting dark space?) • Bad news • 2-4 times as many proteins as other species due to extensive alternative splicing in humans. • We only know the function of about half the predicted genes. • Likely > 1 million different gene products based on alternative splicing and post-translational modifications. C T A G Where we stand now • We essentially have the ‘dictionary’ with all the words (genes) spelled correctly, but only slightly more than half of the words (genes) have definitions. • Slightly over half of the 936 genes predicted for human chromosome 22 have been experimentally validated. • • • • 223 have a known function and expression 172 have no known function but evidence for expression 182 have no known function and no evidence for expression 228 pseudogenes • Through comparative genomic sequencing we can annotate the human genome based on evolutionary conserved gene sequences and use model systems to C T A G study gene expression. If a genomic region is conserved in evolutionary distant organisms, it is present because the region is maintained through selective pressure over evolutionary time likely because it performs necessary function. C T A G C T A G Chimpanzee and Baboon Genomic Sequencing • Medically important model eukaryotic organisms • The chimpanzee is our nearest evolutionary relative with a genome that has ~98 % sequence identity with the human genome • The baboon genome has ~92 % sequence identity with the human genome C T A G PIP Plot of a region of human chr22 compared to syntenic regions of baboon and mouse C T A G humanspecific repeat regions Questionable gene present in primates but not in rodents 34 Kbp deletion in baboon C T A G Exons in one copy of a zebrafish duplicated gene with 75% homology to human but greatly diverged, <50% homology, in the other copy C T A G A complementary approach is to determine if the predicted protein coding conserved elements are functional by investigating their expression profiles during development. C T A G Whole mount in situ hybridization using zebra fish as the model organism Small people that swim in the water and breath through gills… Han Wang, OU C T A G Zebrafish as a model system • • • • • • • • C T A G Have a short, ~ 3 month to reproductive maturity. Can be easily bred in the lab in large numbers. Are small in size - an adult is just a few centimeters long. Have an ~ 5 day embryonic development period from fertilized egg to a swimming fish. The embryos are transparent making it easy to see internal organs during development. Is well established as a resource for genetic studies. The Sanger Institute is completing the genome sequence, which presently is ~50% complete and publicly available. More than 90 % of the predicted human genes have a zebra fish ortholog. Whole mount in situ hybridization Alkaline phosphatase-conjugated anti-DIG antibody DIG-labeled ssDNA or RNA probe BCIP* + NBT** P P Digoxigenin label uridine Wash Wash P mRNA 1. Add digoxigenin-labeled probe complementary to RNA of interest C T A G 2. Add alkaline phosphataseconjugated antibody that binds to digoxigenin *BCIP = 5 bromo-4-chloro-indoxyl phosphate **NBT = nitro-blue-tetrazolium 3. Add BCIP + NBT that turns dark purple dye when dephosphorylated by the alkaline phosphatase thereby coloring the cell Exon-specific ssDNA primers Mermade synthesis of unique exon specific primers of the gene of interest PCR off zebra fish genomic DNA Followed by unidirectional amplification with either forward or reverse (nested) primers in the presence of DIG-labeled dUTP ssDNA (sense and antisense probes) C T A G These steps now have been automated in a 96 well format Ethidium bromide stained 1% agarose gel of dsPCR off genomic DNA and subsequently unidirectional amplified single stranded DNA probes Size Markers PCR F R PCR F R PCR F R PCR F R PCR F R 1078 603 310 • These studies clearly demonstrate that, contrary to popular belief, single stranded DNA contains regions that fold into sufficient double stranded secondary structures that ethidium bromide can bind. • However, agarose gel electrophoresis is labor intensive (slab gel preparation and loading), electrophoresis is time consuming, and detection typically requires the use of carcinogenic ethidium bromide C T A G AMS-90 for ssPCR primer, dsPCR and single strand unidirectional exon amplification C T A G PCR and Unidirectional Single Primer Amplification on the AMS-90 Bases single strand single strand single strand single strand ds PCR uni-directional ds PCR uni-directional ds PCR uni-directional ds PCR uni-directional product products product products product products product products F R F R F R F R 7000 4900 2900 1900 1100 700 500 300 100 15 C T A G Both double and single stranded DNA rapidly can be resolved, detected and archived on the AMS-90 Custom MerMade Synthesized 20-mer DNA Primers Rapidly Analyzed on the AMS-90 Bases 7000 4900 2900 1900 1100 700 500 300 100 15 ug/ul 2.0 1.0 0.5 0.25 0.12 0.06 0.06 0.12 0.25 0.5 1.0 2,0 Decreasing 20-mer Concentration Increasing 20-mer Concentration Rapid, 30 seconds/lane run time vs over an hour/sample C T A G via capillary electrophoresis, of single stranded oligonucleotides AMS-90 vs Ethidium Bromide Stained Agarose Gels or Capillary Electrophoresis • Both can be used to resolve and view both double stranded and single stranded DNAs • However, analysis on the AMS-90 requires: • minimal human interaction, • no separate photography, • much less technician time, • eliminates the use of carcinogenic ethidium bromide • is less error prone and • takes much less time. C T A G Human hypothetical protein-KIAA0819 One gene with 11 exons on Hu Chr 22 This one gene is split into 2 genes in zebra fish • ZF1 - Genomic location:307,280-316,461 bp on Sanger Institute chromosome fragment ctg14067 • With the first 4 exons • ZF2 - Genomic location:107,344-119,287 on Sanger Institute chromosome fragment ctg11065 • With the remaining 7 exons Note: 4 + 7 = 11 C T A G A multiPIP analysis of the predicted genes from human, rat, mouse, fugu and zebra fish (ZF1 and ZF2) with homology to cDNA probe KIAA0819 100% 50% C T A G Orthologous duplicated copies of a single copy human KIAA0819 gene in zebra fish Single human kiaa0819 gene C T A G ZF1 Two zebra fish kiaa0819 gene orthologs ZF2 Whole mount in situ hybridization of ssDNA probes for the ZF1 gene Antisense probe Sense probe No probe 120hpf 48hpf 24hpf C T A G Only antisense probe hybridization to the Otic Placode Expression of ZF1 Gene in the Otic Placode Five sensory patches develop from the embryonic ear in three cristae associated with a semicircular canal and two maculae associated with an otolith. C T A G Whole mount in situ hybridization of a ssDNA probe unique to the ZF2 gene at 24 and 48 hpf AntiSense probe Sense probe hindbrain 24 hpf forebrain hindbrain Otic placode Pectoral fin 48 hpf C T A G Only antisense probe hybridization to the hindbrain, forebrain, Otic Placode and pectoral fin ZF2, 48 hpf hindbrain C A G T Otic placodes Pectoral fin Expression of ZF2 is seen in the edge of the otic placode with no defined sensory patches, and in the budding pectoral fin. Expression analysis show functional divergence after duplication in zf1and zf2 • ZF1 is expressed only in the Otic Placode seen at 24-120 hpf • ZF2 is expressed in the hindbrain, otic placode and the pectoral fin, with the expression in the otic placode differing from that of ZF1 • It is highly likely that the one gene in humans is expressed in the developing ear, brain and involved in early limb development C T A G Whole mount in situ hybridization of a ssDNA probe for Human Gene: NM_032775-ENSG00000185214 On Hu Chr 22 at positions 19,120,360 - 19,174,676 (no expression confirming ESTs) Antisense probe 120hpf Otic placode Swim bladder Otic placode Swim bladder Sense probe Otic placode Swim bladder Otic placode Swim bladder 160hpf C T A G Only antisense probe hybridization to the Otic Placode and swim bladder Summary of in situ hybridization studies: Gene Antisense probe Dj508I15.c22.5 Phf5a-like gene KIAA0819-ZF1 KIAA0819-ZF2 Brain Brain Otic placode Hind brain, Otic placode, and pectoral fin Otic placode and swim bladder Hind brain, Hind brain and Branchial arches, pectoral fin Heart, and pectoral fin Notochord, liver Notochord Hind brain, and Otic placode NM_032775 DGCR8 AP000553.6 C T A G Sense probe ESTs + + + + - 3 out of 7 predicted genes but with no previous evidence for expression Conclusions: • It now is clear that there are large conserved sequence regions from evolutionary distant organisms ranging from humans to fish. If these regions are conserved, the function of the encoded genes also likely is conserved. • The zebra fish is an ideal system in which to investigate protein expression profiles for genes that are human orthologs. • All aspects of this work have been and will continue to be improved by automation. C T A G What’s next for our Genome Center? • Participate in sequencing the mouse, chimp, baboon, lemur, bovine, dog, cat, chicken and zebra fish genomes concentrating on: • Regions of high biological interest and • Regions orthologous to human chromosome 22 • Sequence the Medicago truncatula (alfalfa) genome using a mapped BAC-based approach concentrating on coding regions • Continued sequencing of selected pathogenic bacteria • Investigate the function of the predicted genes with unknown function in the zebrafish system first by whole mount in situ and then expression knock down experiments with morpholino oligos, once robust, C T A G automated methods have been developed. Laboratory Organization Bruce Roe, PI Support Teams Informatics Production DNA Synthesis Jim White Steve Kenton Hongshing Lai Sean Qian*** Phoebe Loh* Rose Morales-Diaz* Sulan Qi Mounir Elharam* Bart Ford* Steve Shaull** Doug White Work-study Undergraduate students** Reagents & Equip. Maint. Mounir Elharam* Doug White Clayton Powell** Administration KayLynn Hale Dixie Wishnuck Tami Womack Mary Catherine Williams Research Teams Doris Kupfer Julia Kim* Sun So Graham Wiley** Limei Yang Angie Prescott* Audra Wendt** Mandi Aycock** Fu Ying Liping Zhou Ruihua Shi**** Junjie Wu**** Trang Do Anh Do Lily Fu Yang Ye** Tessa Manning** Ziyun Yao*** Steve Shaull* Youngju Yoon**** Stephan Deschamps*** Shelly Oommen**** Christopher Lau**** ShaoPing Lin*** Honggui Jia Hongming Wu Baifang Qin Peng Zhang Axin Hua*** Weihong Xu**** Yanhong Li Funding from the NHGRI, Noble Foundation, DOE, NSF C T Collaborators at Sanger, CWRU, CHOP, Keio, UIUC and Riken A G Fares Najar*** Chunmei Qu Keqin Wang Shuling Li Lin Song**** Ying Ni Huarong Jiang Jami Milam**** Sara Downard** Ging Sobhraksha** Pheobe Loh * Sulan Qi Bart Ford* * Previous undergraduate res. student ** Present undergraduate res. student *** Previous graduate student **** Present graduate student C A G T The ACGT Team Peggy and Charles Stephenson Center C T A G C T A G