Todd D. Taylor, Ph.D. Genome Annotation and Comparative Analysis Team Computational and Experimental Systems Biology Group RIKEN Genomic Sciences Center taylor@gsc.riken.jp Bioinformatics and Comparative Genome Analysis Course Institut Pasteur Tunis - Tunisia April 2, 2007 Human Chromosome 21 (Nature, May 2000) 17 of 33.5 Mb Chromosome 18p (Nature, September 2005) 16 Mb Chromosome 11q (Nature, March 2006) 81 Mb ~4-5 % contribution to the Human Genome Project Chimpanzee Chromosome 22q (Nature, May 2004) 33.5 Mb (syntenic to human chr21) Chromosome Y (Nature Genetics, January 2006) Development of novel methods for gene and promoter prediction Identifying genes missed by other high-throughput methods Identification of unique regulatory mechanisms Looking for similarities Compare with distant species, like mouse Regions that are conserved may be important Looking for differences Compare with close species, like primates Regions that are different may be important Of course, there are exceptions to every rule! Pongo Gibbons Old world monkeys New world monkeys Mammalia Gorilla Primates Pan Hominoidea Homo Hominidae Hominidae Hominoidea Catarrhini Anthropoidea Primates Eutheria (placentalia) ~250MYa Mammalia Amniota (amniotes) ~350MYa 5 MYa Prosimians Heterodonty Mammary glands Homoeothermic Hair Placentation (in most), amnion, internal fertilization Sweat and sebaceous glands Anucleate red blood cells Lagomorpha Rodents Metatheria Prototheria Sauropsida Reptilia + Aves 34% maps to identical sequence in human genome Hiram Clawson and Kate Rosenbloom (UCSC). 09 June 2006 95% maps to identical sequence in human genome Hiram Clawson and Kate Rosenbloom (UCSC). 09 June 2006 Nobrega, et al. Science 302, 413 (2003) Size Intelligence Language Ageing Disease susceptibility Cancer Schizophrenia Autism Triplet expansion diseases AIDS Hepatitis Newton,2002年4月号 Science 295, 131-134 (2002) 1.23% substitution Number of simple repetitive sequences Insertion of Alu and L1 elements Unique sequences Local duplications Translocations Inversions Fewer CpG Islands predicted in chimp Compare with small ‘representative’ human chromosome (21) Clone-based sequencing strategy Map chimp BAC-end sequences to human chr. 21 Screen libraries for additional clones to fill gap regions 3 gaps, over 99% coverage Chimp Chr22 q-arm Human Chr21 q-arm Identity 100% 85% 5Mb 100% Identity Chimp Chr22 q-arm Human Chr21 q-arm 85% 1Mb 0.0050 HSA21q 0.20 0.0040 0.15 0.0030 0.10 0.0020 0.05 0.0010 0.00 Insertion frequency per bp Base changes or insertion size per bp 0.25 0.0000 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 0.25 0.0050 0.20 0.0040 Base changes or insertion size per bp Base change Insertion size (bp) Insertion frequency 0.15 0.0030 0.10 0.0020 0.05 0.0010 0.00 0.0000 0 2 4 6 8 10 12 14 16 18 Position (Mb) 20 22 24 26 28 30 32 Insertion frequency per bp PTR22q Chimpanzee Sequencing & Analysis Consortium. Nature (205) 437:69-87 Overall : 1.44% SINE/Alu LINE/L1 1.81% 1.38% CpG islands Simple repeats 2.26% 4.06% Base change Insertion frequency Base change 1.000 - Insertion frequency 0.907 1.000 Insertion size 0.051 0.013 Size (bp) Base content CG dinucleotide CpG islands Repeats # of Sequence gaps Estimated total size of clone gaps # of clone gaps G+C% HSA21q 33,102,702 14 73,108 PTR22q 32,799,845 22 74,311 3 40.94% 361,259 950 # ID# 2 41.01% 358,450 885 # ID# bp SINEs Young Alus *1 LINEs Young L1s *2 LTRs DNA elements RNAs Satellite Others Total 3,647,427 21,798 5,848,427 15,574 75 13,758 15,131 75 8,731 3,614,185 3,122 5,737,082 15,481 12 13,671 9,551 12 6,223 92,171 3,612,930 949,215 8,625 17,246 30,452 14,114,322 59 9,975 4,169 98 23 41 43,638 52 7,269 3,363 97 20 38 34,649 78,653 3,551,044 943,348 8,672 14,773 34,852 13,903,956 64 9,838 4,187 99 20 49 43,345 53 5,324 2,887 98 17 42 24,142 42.6% *1 AluYa5, AluYa8, AluYb8 and AluYb9 *2 L1HS and L1PA2 bp 42.4% Family Subfamily LINE/L1 L1HS 11 2 LTR/ERV1 HERVIP10FH 14 5 MER41A-int 10 2 MER4A1-int 5 0 MER83B-int 11 0 MER87 32 12 AluYa5 23 3 AluYb8 37 2 AluYb9 7 1 DNA/MER2 Tigger3 42 67 LTR/ERV1 LTR49-int 11 23 LTR/MaLR MLT1E-int 0 5 SINE/Alu HS21 PTR22 Human-specific characteristics have been acquired during the 5 million years since the divergence between Pan and Homo. Orangutan Gorilla Time Pongo (Orangutan) Gorilla Pan (Chimpanzee) 5〜6MYa Homo (Human) Phylogeny of Hominidae Chimpanzee Human(?) Pongo (Orangutan) Outgroup Gorilla Pan (Chimpanzee) LCA Time Homo Pan Gorilla Orangutan LCA Homo (LCA: The Last Common Ancestor) ACGTGTTTGAAATATTACTGATTGTAA ACGAGTTTGAAATATTATTGATTGTAA ACGTGTTTGAATCATTATTGATTGTAA ACGTGTTTAAATTATTATTGGTTGCAA ACGTGTTTGAAATATTATTGATTGTAA 0.0050 HSA21q 0.20 0.0040 0.15 0.0030 0.10 0.0020 0.05 0.0010 0.00 0.0000 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 Insertion frequency per bp Base changes or insertion size per bp 0.25 Human Chimpanzee Gorilla Orangutan IN/DEL examination based on 10,292,002 finished sequences RIKEN total PCR primers designable good amplification for both* insertion to the human sequence 267 158 139 insertion to the chimp sequence 222 147 128 489 305 267 * positive amplification found for both chimp and human template DNA Example 1 Deletion in Human Lineage Pt Hs Gg Pp 11 22 33 44 11 22 33 44 11 11 22 Example 2 Insertion in Human Lineage Pt Hs Gg Pp 1 2 3 4 1 2 3 4 1 1 2 1900 4200 2900 980 106 Example 3 Deletion in Chimp Lineage 106 117 Example 4 Allelic Deletion in Chimp Lineage Pt Hs Gg Pp 1 2 3 4 1 2 3 4 1 1 2 Pt Hs Gg Pp 1 2 3 4 1 2 3 4 1 1 2 2400 4200 1300 129 1200 154 284 genes 223 known 19 novel CDS 25 novel transcripts 12 putative 5 predicted 85 pseudogenes We lacked information for 6 genes located in sequencing gaps 6 hsa21 genes are absent from the ptr22 sequence (H2BFS, 5 KAP genes from the 21q22.1 cluster) 4 hsa21 genes appear to be pseudogenes in chimp 3 ptr22 pseudogenes are absent from the hsa21 sequence 1 hsa21 pseudogene has a complete ORF in ptr22 83% of genes have at least one amino acid replacement 10% of the potential ptr22 proteins are predicted to have a different length Amino acid insertion or deletion Different start codon Different stop codon Other, more complex rearrangement Shorter in chimp: ADAMTS5 Longer in chimp: C21orf30 •17 bp deletion in chimpanzee •Human and chimpanzee splice sites are different •Splice-site diversity FLJ32835 C21orf9 C21orf71 Sequence identity TCP10L C21orf96 The human chr21 genes ordered according to their chromosomal position Human-specific replacements Chimp-specific replacements KIAA0184 COL6A2 3. HUNK 4. AGPAT3 5. DSCR3 6. PWP2H 7. STCH 8. SLC5A3 9. CHAF1B 10. SIM2 11. KCNE2 12. APP 13. C21orf98 14. C21orf61 15. IFNAR1 16. UBASH3A 17. TMPRSS3 18. DSCR1 19. C21orf7 20. ADARB1 21. TSGA2 22. IFNAR2 23. C21orf63 24. KCNE1 25. C21orf2 26. C21orf55 27. ATP5A 28. CLDN8 29. C21orf56 30. DNMTA1 1. BACE2 2. TIAM1 3. BACH1 4. FAM3B 5. C21orf33 6. ADAMTS1 7. C21orf103 8. ITGB2 9. HLCS 10.DNMT3L 11.IFNGR2 12.PPIA3L 13.C21orf59 14.MRPL39 15.CLDN17 16.KRTAP11-1 17.CCT8 18.DSCR2 19.TFF2 20.BTG3 21.HSF2BP 22.C21orf115 1. 2. 2.5 2.0 1.5 1.0 0.5 0.0 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 35% 30% Gene Frequency 25% 20% 15% 10% 5% 0% 0 0.1 0.2 0.3 0.4 0.5 0.6 Ka / Ks 0.7 0.8 0.9 1 >1.1 Chimpanzee Sequencing & Analysis Consortium. Nature (205) 437:69-87 Correralate phenotype with genotype Using Affymetrix arrays it could be shown that the amount of transcript/gene varies in a species-specific manner (Enard et al. 2001). -> What DNA sequence differences are responsible for the observed differences in transcript-levels? Transcription start site (TSS) 5‘UTR 3‘UTR Promoter Enhancer •Transcriptional control • RNA stability ANNOTATED GENES DETECTED GENES UPREGULATED (IN HUMAN) DOWNREGULATED (IN HUMAN) 237 genes annotated for chromosome 21 189 represented on the affymetrix A-E arrays 189 annotated genes represented on the Affymetrix A-E arrays (Hellmann, Pääbo) brain liver IFNAR2 IFNGR2 ETS2 ITSN C21orf97 DSCR1 LSS TTC3 CXADR higher in chimp higher in human Identifying cis-regulatory elements in the human genome is a major challenge of the post-genomic era Promoters and enhancers that regulate gene expression in normal and diseased cells and tissues Inter-species sequence comparisons have emerged as a major technique for identifying human regulatory elements Particularly those to the sequenced mouse, chicken and fish genomes A significant fraction of empirically defined human regulatory modules Too weakly conserved in other mammalian genomes, such as the mouse, to distinguish them from nonfunctional DNA Completely undetectable in nonmammalian genomes Identification of such significantly divergent functional sequences will require complementary methods in order to complete the functional annotation of the human genome Deep intra-primate sequence comparison is a novel alternative to the commonly used distant species comparisons Non-coding sequences with primate-specific conservation include three regulatory elements Nature (2003) 424:788-793 Fused transcript formed by combining the exons of two or more distinct genes (child genes) Child gene A Conjoined Gene A – B Child gene B Exon Intron • Transcript A-B combines at least one exon (complete or partial overlap) from both Gene A & Gene B – Usually only supported by a few mRNA/EST sequences, and rarely by a CCDS • Currently, about 32 known cases found by searching NCBI Entrez (including 8 from chr 11 recently submitted by our group) Chr1 SRP9 – EPHX1 fusion (1 EST evidence-DA417873) Alternate splicing and novel exons observed in fused mRNA 27% Conjoined genes conserved in Chimpanzee Number of mRNAs examined At least one exon* from both child genes conserved in 6.5% Conjoined genes conserved in Mouse 456 (326 conjoined genes) Number Chimpanzee mRNAs 125 (69 conjoined genes) Mouse mRNAs 30 (15 conjoined genes) Both Chimpanzee and Mouse mRNAs 25 (11 conjoined genes) * Exons considered were part of conjoined gene mRNAs • RIKEN • • • • • • • • • • • Yoshiyuki Sakaki Tulika P. Srivastava Vineet K. Sharma Asao Fujiyama Masahira Hattori Atsushi Toyoda Yoko Kuroki Yasushi Totoki Hideki Noguchi Hidemi Watanabe Takehiko Itoh (MRI) • Chimpanzee Chr 22 Sequencing Consortium • Chinese National Human Genome Center at Shanghai, China • KRIBB Genome Research Center, Daejeon, Korea • National Yang Ming University Genome Research Center, Taipei, Taiwan • National Institute of Genetics, Mishima, Japan • RIKEN Genomic Sciences Center, Yokohama, Japan • GBF, Dept. of Genome Analysis, Braunschweig, Germany • Institute for Molecular Biotechnology, Jena, Germany • Max-Planck Institute for Molecular Genetics, Berlin, Germany