Genome Projects and Gene Hunting Wen-chang Lin Institute of Biomedical Sciences, Academia Sinica Taipei, Taiwan R. O. C. E-mail: wenlin@ibms.sinica.edu.tw Http://www.ibms.sinica.edu.tw/~wenlin The Human Genome Project is an ambitious effort to understand the hereditary instructions that make each of us unique. The goal of this effort is to find the location of the 100,000 or so human genes and to read the entire genetic script, all 3 billion bits of information, by the year 2005. What is the Human Genome Project? The Human Genome Project (HGP) is an international research program designed to construct detailed genetic and physical maps of the human genome, to determine the complete nucleotide sequence of human DNA, to localize the estimated 50,000-100,000 genes within the human genome, and to perform similar analyses on the genomes of several other organisms used extensively in research laboratories as model systems. The scientific products of the HGP will comprise a resource of detailed information about the structure, organization and function of human DNA, information that constitutes the basic set of inherited "instructions” for the development and functioning of a human being. Successfully accomplishing these ambitious goals will demand the development of a variety of new technologies. It will also necessitate advanced means of making the information widely available to scientists, physicians, and others in order that the results may be rapidly used for the public good. Improved technology for biomedical research will thus be another important product of the HGP. From the inception of the HGP, it was clearly recognized that acquisition and use of such genetic knowledge would have momentous implications for both individuals and society and would pose a number of policy choices for public and professional deliberation. Analysis of the ethical, legal, and social implications of genetic knowledge, and the development of policy options for public consideration are therefore yet another major component of the human genome research effort. Specific Goals (Phase I 1993-1998) Genetic Map Complete the 2-5 cM map by 1995 Develop technology for rapid genotyping Develop markers that are easier to use Develop new mapping technologies Physical Map Complete an STS map of the human genome at a resolution of 100 kb DNA Sequencing Develop efficient approaches to sequencing one- to several- megabase regions of DNA of high biological interest. Develop technology for high throughput sequencing, focusing on systems integration of all steps from template preparation to data analysis. Build up sequencing capacity to a collective rate of 50 Mb per year by the end of the period. This rate should result in an aggregate of 80 Mb of DNA sequence completed by the end of FY 1998. Specific Goals (Phase I 1993-1998) Gene Identification Develop efficient methods of identifying genes and for placement of known genes on physical maps or sequenced DNA. Technology Development Substantially expand support of innovative technological developments as well as improvements in current technology for DNA sequencing and to meet the needs of the Human Genome Project as a whole. Model Organisms Finish an STS map of the mouse at 300 Kb resolution Finish the sequence of the E. coli and S. cerevisiae genomes by 1998 or earlier Continue sequencing C. elegans and Drosophila genomes, with the aim of bringing C. elegans to near completion by 1998 Sequence selected segments of mouse DNA side by side with corresponding human DNA in areas of high biological interest Specific Goals (Phase I 1993-1998) Informatics Continue to create, develop and operate databases and database tools for easy access to data, including effective tools and standards for data exchange and links among databases Consolidate, distribute and continue to develop effective software for large-scale genome projects Continue to develop tools for comparing and interpreting genome information Ethical, Legal and Social Implications (ELSI) Continue to identify and define issues and develop policy options to address them Develop and disseminate policy options regarding genetic testing services with widespread potential use Foster greater acceptance of human genetic variation Enhance and expand public and professional education that is sensitive to sociocultural and psychological issues Training Continue to encourage training of scientists in interdisciplinary sciences related to genome research Specific Goals (Phase I 1993-1998) Technology Transfer Encourage and enhance technology transfer both into and out of centers of genome research Outreach Cooperate with those who would set up distribution centers for genome materials. Share all information and materials within 6 months of their development. This should be accomplished by submission to public databases or repositories, or both, where appropriate. Specific Goals (Phase II 1998-2003) Specific Goals (Phase II 1998-2003) Goal 1--The Human DNA Sequence a) Finish the complete human genome sequence by the end of 2003. b) Finish one-third of the human DNA sequence by the end of 2001. c) Achieve coverage of at least 90% of the genome in a working draft based on mapped clones by the end of 2001. d) Make the sequence totally and freely accessible. Specific Goals (Phase II 1998-2003) Goal 2--Sequencing Technology a) Continue to increase the throughput and reduce the cost of current sequencing technology. b) Support research on novel technologies that can lead to significant improvements in sequencing technology. c) Develop effective methods for the advanced development and introduction of new sequencing technologies into the sequencing process. Specific Goals (Phase II 1998-2003) Goal 3--Human Genome Sequence Variation a) Develop technologies for rapid, large-scale identification or scoring, or both, of SNPs and other DNA sequence variants. b) Identify common variants in the coding regions of the majority of identified genes during this 5-year period. c) Create an SNP map of at least 100,000 markers. d) Develop the intellectual foundations for studies of sequence variation. e) Create public resources of DNA samples and cell lines. Specific Goals (Phase II 1998-2003) Goal 4--Technology for Functional Genomics a) Develop cDNA resources. b) Support research on methods for studying functions of non-protein-coding sequences. c) Develop technology for comprehensive analysis of gene expression. d) Improve methods for genome-wide mutagenesis. e) Develop technology for global protein analysis. Specific Goals (Phase II 1998-2003) Goal 5--Comparative Genomics a) Complete the sequence of the C. elegans genome in 1998. b) Complete the sequence of the Drosophila genome by 2002. c) The mouse genome. 1) Develop physical and genetic mapping resources. 2) Develop additional cDNA resources. 3) Complete the sequence of the mouse genome by 2005. d) Identify other model organisms that can make major contributions to the understanding of the human genome and support appropriate genomic studies. Goal 6--Ethical, Legal, and Social Implications (ELSI) U.S. Human Genome Project Funding($Millions) FY DOE NIH* U.S. Total 1988 10.7 17.2 27.9 1989 18.5 28.2 46.7 1990 27.2 59.5 86.7 1991 47.4 87.4 134.8 1992 59.4 104.8 164.2 1993 63.0 106.1 169.1 1994 63.3 127.0 190.3 1995 68.7 153.8 222.5 1996 73.9 169.3 243.2 1997 77.9 188.9 266.8 1998 85.5 217.7 303.2 (NT$9,780,500,000) 1999 89.8 225.7 315.5 Mar. 24, 2000 -Finished sequence: 561,973 kb 17.5% of genome Draft sequence: 2,020,129 kb 62.9% of genome Current Progress Breakdown by Chromosome Chr Effective size (kb) Sequence done (kb) Percent finished Number of contigs Longest contig (kb) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 X Y total 263000 255000 214000 203000 194000 183000 171000 155000 145000 144000 144000 143000 98000 93000 89000 98000 92000 85000 67000 72000 39000 34491 164000 35000 3180491 26571 23193 10417 12521 15679 45668 81476 8730 4839 6091 8398 24509 2143 29775 2196 19372 28861 3734 15021 25825 25851 33620 65513 6934 528043 10.1% 9.1% 4.9% 6.2% 8.1% 25.0% 47.6% 5.6% 3.3% 4.2% 5.8% 17.1% 2.2% 32.0% 2.5% 19.8% 31.4% 4.4% 22.4% 35.9% 66.3% 97.5% 39.9% 19.8% 16.6 154 109 59 99 94 305 298 42 30 36 63 99 7 106 17 118 129 20 144 137 72 12 347 27 2532 928 695 746 393 739 3926 2094 1902 1010 469 817 1526 1416 1450 297 512 1101 349 1008 1187 7223 23051 949 1104 The completed sequence covers 33.4 Mb of 22q with 11 gaps and has been estimated to be accurate to less than 1 error in 50,000 bases, by internal and external checking exercises. The largest contiguous segment stretches over 23 Mb. From our gap-size estimates, we calculate that we have completed 33,464 kb of a total region spanning 34,491 kb and that therefore the sequence is complete to 97% coverage of 22q. 545 genes; 134 pseudo genes. http://www.ornl.gov/hgmis/ 3,000 ~ 4,000 genes http://www.ncbi.nlm.nih.gov/disease/ Completed Genomes Organism Genome Size (Mb) Caenorhabditis elegans Saccharomyces cerevisiae Escherichia coli Bacillus subtilus Synechocystis sp. *Archaeoglobus fulgidus *Pyrobaculum aerophilum Haemophilus influenzae *Methanobacterium thermoautotrophicum Helicobacter pylori *Methanococcus jannaschii *Aquifex aolicus Borrelia burgdorferi Treponema pallidum Mycoplasma pneumoniae *Mycoplasma genitalium *Mycoplasma genitalium Treponema pallidum Chlamydia trachomatis Plasmodium falciparum Chr2 Rickettsia prowazekii Helicobacter pylori Leishmania major Chr1 100 12.1 4.6 4.2 3.6 2.2 2.2 1.8 1.8 1.7 1.7 1.5 1.3 1.1 0.8 0.6 0.6 1.14 1.05 1 1.1 1.64 .27 Estimated Genes 6034 4288 ~4000 3168 2471 N.A. 1740 1855 1590 1692 1508 863 1234 677 470 470 The TIGR Microbial Database provides links to world-wide genome sequencing projects completed and underway, including the completed TIGR genomes: Archaeoglobus fulgidus, Borreliaburgdorferi, Deinococcus radiodurans,Haemophilus influenzae,Helicobacter pylori, Methanococcus jannaschii, Mycobacterium tuberculosis, Mycoplasma genitalium, Thermotoga maritima,and Treponema pallidum. In the last few decades, advances in molecular biology and the equipment available for research in this field have allowed the increasingly rapid sequencing of large portions of the genomes of several species. In fact, to date, several bacterial genomes, as well as those of some simple eukaryotes (e.g., Saccharomyces cerevisiae, or baker's yeast) have been sequenced in full. The Human Genome Project, designed to sequence all 24 of the human chromosomes, is also progressing. Popular sequence databases, such as GenBank and EMBL, have been growing at exponential rates. This deluge of information has necessitated the careful storage, organization and indexing of sequence information. Information science has been applied to biology to produce the field called Bioinformatics The most pressing tasks in bioinformatics involve the analysis of sequence information. Computational Biology is the name given to this process, and it involves the following: • • • • Finding the genes in the DNA sequences of various organisms Developing methods to predict the structure and/or function of newly discovered proteins and structural RNA sequences. Clustering protein sequences into families of related sequences and the development of protein models. Aligning similar proteins and generating phylogenetic trees to examine evolutionary relationships. Simple Mathematics: Human Genome 3 x 10 9 bps Human Genes (5% of the genome) 100,000 genes In a given cell type at a certain stage, it is estimated that around 20 % of the genes are transcribed or expressed. 20,000 genes Automatic sequencer The Growth of GenBank sequence database in the past 10 years. Release Year Base pairs Entries 58 62 66 70 74 80 86 92 98 104 110 115 88 89 90 91 92 93 94 95 96 97 98 99 24,690,876 37,183,950 51,306,092 77,337,678 120,242,234 163,802,597 230,485,928 425,860,958 730,552,938 1,258,290,513 2,162,067,871 4,653,932,745 21,248 31,229 41,057 58,952 97,084 150,744 237,775 620,765 1,114,581 1,891,953 3,043,729 5,354,511 Gene Expression Studies GenBank Overview What is GenBank? GenBank is the NIH genetic sequence database, an annotated collection of all publicly available DNA sequences ( Nucleic Acids Research 1998 Jan 1;26(1):1-7). There are approximately 2,162,000,000 bases in 3,044,000 sequence records as of December 1998. As an example, you may view the record for the neurofibromatosis gene. The complete release notes for the current version of GenBank are available. A new release is made every two months. GenBank is part of the International Nucleotide Sequence Database Collaboration, which is comprised of the DNA DataBank of Japan (DDBJ), the European Molecular Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data on a daily basis. Submissions to GenBank Many journals require submission of sequence information to a database prior to publication so that an accession number may appear in the paper. NCBI has a WWW form, called BankIt, for convenient and quick submission of sequence data. The beta-test version of Sequin, NCBI's new stand-alone submission software for MAC, PC, and UNIX platforms, is available by FTP. When using Sequin, the output files for direct submission should be sent to GenBank by electronic mail. Alternatively, the data files may be copied to a floppy disk and mailed to NCBI. Authorin, an older stand-alone program for MACs and PCs, can still be used to format your submission, although submitters are encouraged to switch to either BankIt or Sequin. Searching GenBank Text and Similarity searching Entrez Browser GenBank (nucleotides and proteins), PubMed (MEDLINE), 3D structures, genomes, and taxonomy databases. BLAST Sequence Similarity Searching Nucleotide or protein query sequences against the specified database using the BLAST suite of algorithms. dbEST Searching dbEST (Database of Expressed Sequence Tags). GenBank nr database: >gi|216185|dbj|D00635|ABCADHCC Acetobacter polyoxogenes genes for alcohol dehydrogenase, cytochrome c, complete cds ¶ GAATTCCGAACTATCCGTTTCATTGCTTATGCGACAGCATGTTCACTTTTTAGTGAGGCTGAACACTAAA ATGTCAGGAGACGAGCGTGCTAGCCTCAGTATGTTGCCATGAAACGGACCACCTGCTTTGTCTTTCCTGC CTGAAGCCGGTTTCTGTCTGGCCGGAAAAGAAGCGCTAGCGCGTTTTTTTGCCGGATACATTCAGAAAGC TGCTCCGGGCAGAAAGTTGCAGCGGCGGCATCCTGAATTCGAAACCGTTAGTTTTCTGAGGACATCACAT ATGATTTCTGCCGTTTTCGGAAAAAGACGTTCTCTGAGCAGAACGCTTACAGCCGGAACGATATGTGCGG CTCTCATCTCCGGGTATGCCACCATGGCATCCGCAGATGACGGGCAGGGCGCCACGGGGGAAGCGATCAT CCATGCCGATGATCACCCCGGTAACTGGATGACCTATGGCCGCACCTATTCTGACCAGCGCTACAGCCCG CTGGATCAGATCAACCGTTCCAATGTCGGTAACCTGAAGCTGGCCTGGTATCTGGACCTTGATACCAACC GTGGCCAGGAAGGCACGCCCCTGGTTATTGATGGCGTCATGTACGCCACCACCAACTGGAGCATGATGAA AGCCGTCGACGCCGCAACCGGCAAGCTGCTGTGGTCCTATGACCCGCGCGTGCCCGGCAACATTGCCGAC AAGGGCTGCTGTGACACGGTCAACCGTGGCGCGGCATACTGGAATGGCAAGGTCTATTTCGGCACGTTCG ACGGTCGCCTGATCGCGCTGGACGCCAAGACCGGCAAGCTGGTCTGGAGCGTCAACACCATTCCGCCCGA AGCGGAACTGGGCAAGCAGCGTTCCTATACGGTTGACGGCGCGCCCCGTATCGCCAAGGGCCGCGTGA>> ¶ FASTA format Medline searches: Academia Sinica Library (local) Http://igm.nlm.nih.gov/ Given COX-1 and COX-2 can a putative COX-3 be identified? Text search for COX-3 (and suitable alternative forms) Acquire human COX-1 and COX-2 sequences Search for sequence similarties in a fulllength sequence database Search for sequence similarties in an EST database Merge the results of the full-length and EST searches ESTs virtually indentical to COX-1 and COX-2 ESTs similar, but not indentical to COX-1/-2 May provide tissue localization information Search ESTs back against full-length databases Strong similarities with other genes indicate close relationship of COX family to another gene family probably with a different function Is it highly similar to COX-1, COX-2 or both? Is it only weakly similar? If so, might it be more similar to something else, a putative COX-3? In silico cloning: In order to perform an electronic cDNA library screen, the EST sequences retrieved in this way can be used as queries in a BLASTN search of dbEST to identify over-lapping ESTs. This procedure can be reiterated with the newly identified ESTs until no additional hits are found. The ESTs isolated can be assembled into sequence contigs using computer softwares. EST 1 EST 3 Query EST 2 1 61 121 181 241 301 361 421 481 541 601 661 mdltkmgmiq ilfhrnsqhy mletiqasdd vdqspsvsts vktemmqvde esaeqvpppa alavsmdfst qhrklhsgmk qthtgtdmav tgdhpyecef pfecklchqr wriektylyl lqnpshptgl tldflspktf ndteatmadg fglsamsptk vpsqdspgaa eagqaptgrp yggllpqgfi tygcelcgkr fcllcgkrfq cgscfrdest srdysamikh cyv lckanqmrla qqileyayta gaeeeedrka aavdslmtig essisggmgd ehpapppekh qrelfsklge fldslrlrmh aqsalqqhme lkshkrihtg lrthngaspy gtlcdvvimv tlqakaedld rylknifisk qsllqgtlqp kveergkegp lgiysvlpnh lavgmksesr llahsagaka vhagvrsyic ekpyecngcd qcticteycp dsqefhahrt dllyaaeile hsseesgyas pagpeeptla gtptrssvit kadavlsmps tigeqcsvcg fvcdqcgaqf secnrtfpsh kkfslkhqle slssmqkhmk vlactskmfe ieyleeqclk vagqslpgpm gggrhpgvae sarelhygre svtsglhvqp velpdneave skedalethr talkrhlrsh thyrvhtgek ghkpeeippd Sequence Alignment and Similarity Search: One goal of sequence alignment is to enable the researcher to determine whether two sequences display sufficient similarity to justify the inference of homology. Similarity is an observable quantity that might be expressed as, say, percent identity or some other suitable measure. Homology, on the other hand, refers to a conclusion drawn from these data that two genes share a common evolutionary history. While it is presumed that homologous sequences have diverged from a common ancestral sequence through iterative changes, we do not actually know what the ancestral sequence was (barring the possibility that DNA could be recovered from a fossil); all we have to observe are the sequences from extant organisms. In a residue-by-residue alignment it is often apparent that certain regions of a protein, or perhaps specific amino acids, are more highly conserved than others. This information may be suggestive of which residues are most crucial for a maintaining a protein’s structure or function. hum pLZF p hum TZFP p 1 MDLTKMGMIQLQNPSHPTGLLCKANQMRLAGTLCDVVIMVDSQEFHAHRTVLACTSKMFE 1 MSLPPIRLPSPYGSDRLVQLAARLRPA--LCDTLITVGSQEFPAHSLVLAGVSQQLG : I:L P L: A ::R A LCD :I V SQEF AH VLA S: 60 55 hum pLZF p hum TZFP p 61 ILFHRNSQHYTLDFLSPKTFQQILEYAYTATLQAKAEDLDDLLYAAEILEIEYLEEQCLK 120 56 ----RRGQWALGEGISPSTFAQLLNFVYGESVELQPGELRPLQEAARALGVQSLEEACWR 111 R Q : :SP TF Q:L : Y ::: : :L L AA L :: LEE C : hum pLZF p hum TZFP p 121 MLETIQASDDNDTEATMADGGAEEEEDRKARYLKNIFISKHSSEESGYASVAGQSLPGPM 180 112 ARGDRAKKPDP--------G-----------------LKKHQEEPEKPSRNPERELGDPG 146 D G : KH E : : L P hum pLZF p hum TZFP p 181 VDQSP-SVSTSFGLSAMSPTKAAVDSLMTIGQSLLQGTLQPPAGPEEPTLAGGGRHPGVA 239 147 EKQKPEQVSRTGGR-----------------EQEMLHKHSPPRG--RPEMAG-------- 179 Q P VS : G : : PP G P :AG hum pLZF p hum TZFP p 240 EVKTEMMQVDEVPSQDSPGAAESSISGGMGDKVEERGKEGPGTPTRSSVITSARELHYGR 299 180 --ATQEAQQEQTRSK------EKRLQAPVG----QRGADG-----KHGVLTWLRENPGGS 222 T: Q :: S: E : :G :RG :G : V:T RE G hum pLZF p hum TZFP p 300 EESAEQVPPPAEAGQAPTGRPEHPAPP-PEKHLGIYSVLPNHKADAVLSMPSSVTSGLHV 358 223 EESLRKLPGPLP----PAGSLQTSVTPRPSWAEAPWLVGGQPALWSILLMPP-------- 270 EES ::P P P:G : P P V : ::L MP hum pLZF p hum TZFP p 359 QPALAVSMDFSTYGGLLPQGFIQRELFSKLGELAVGMKSESRTIGEQCSVCGVELPDNEA 418 271 RYGIPFYHSTPTTGAWQEVWREQRIPLSLNAPKGLWSQNQ---L-ASSSPTPGSLP---- 322 : : T G QR S : : : : :S : LP hum pLZF p hum TZFP p 419 VEQHRKLHSGMKTYGCELCGKRFLDSLRLRMHLLAHSAGAKAFVCDQCGAQFSKEDALET 478 323 ----------------------------------------------QGPAQLS-PGEMEE 335 Q AQ S :E hum pLZF p hum TZFP p 479 HRQTHTGTDMAVFCLLCGKRFQAQSALQQHMEVHAGVRSYICSECNRTFPSHTALKRHLR 538 336 SDQGHTG---------------ALATCAGHEDKAG------CPPRPHPPPAPPARSR--- 371 Q HTG A :: H : C : P: A R hum pLZF p hum TZFP p 539 SHTGDHPYECEFCGSCFRDESTLKSHKRIHTGEKPYECNGCDKKFSLKHQLETHYRVHTG 598 372 ----------------------------------PYACSVCGKRFSLKHQMETHYRVHTG 397 PY C C K:FSLKHQ:ETHYRVHTG hum pLZF p hum TZFP p 599 EKPFECKLCHQRSRDYSAMIKHLRTHNGASPYQCTICTEYCPSLSSMQKHMKGHKPEEIP 658 398 EKPFSCSLCPQRSRDFSAMTKHLRTH-GAAPYRCSLCGAGCPSLASMQAHMRGHSPSQLP 456 EKPF C LC QRSRD:SAM KHLRTH GA:PY:C::C CPSL:SMQ HM:GH P ::P hum pLZF p hum TZFP p 659 PDWRIEKTYLY------------LCYV 673 457 PGWTIRSTFLYSSSRPSRPSTSPCCPSSSTT 487 P W I T:LY C Sequence Alignment and Similarity Search: Database similarity searching allows us to determine which of the hundreds of thousands of sequences present in the database are potentially related to a particular sequence of interest. In database searching, the basic operation is to sequentially align a query sequence to each subject sequence in the database. The results are reported as a ranked hit list followed by a series of individual sequence alignments, plus various scores and statistics. Current sequence databases are already immense and have continued to increase at an exponential rate, making straightforward application of dynamic programming methods impractical for database searching. One solution is to use massively parallel computers. There are several frequently used programs available on the Internet: FastA BLITZ BLAST Smith-Waterman based system (GenWeb of NHRI) Blast Family of Programs The BLAST family of programs allows all combinations of DNA or protein query sequences with searches against DNA or protein databases: blastp compares an amino acid query sequence against a protein sequence database. blastn compares a nucleotide query sequence against a nucleotide sequence database. blastx compares the six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. tblastn compares a protein query sequence against a nucleotide sequence database dynamically translated in all six reading frames (both strands). tblastx compares the six-frame translations of a nucleotide query sequence against the six-frame translations of a nucleotide sequence database. The default matrix for all protein-protein comparisons is BLOSUM62. Databases available for BLAST search Protein Sequence Databases nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF month All new or revised GenBank CDS translation+PDB+SwissProt+PIR+PRF released in the last 30 days. swissprot the last major release of the SWISS-PROT protein sequence database (no updates) yeast Yeast (Saccharomyces cerevisiae) protein sequences. E. coli E. coli genomic CDS translations pdb Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank Nucleotide Sequence Databases nr All Non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences) month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. dbest Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions dbsts Non-redundant Database of GenBank+EMBL+DDBJ STS Divisions yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences E. coli E. coli genomic nucleotide sequences organism CLUSTAL W One of the most widely used multiple sequence alignment program. Based on the idea of progressive alignment, this program takes an input set of sequences and calculates a series of pairwise alignments, comparing each sequence to every other sequence, one at a time. Human PLZF 406 ZINC1 (part #1) (part #2) (part #3) 406 409 426 1/1 14/14 1/1 C CGVELPDNEAVEQH H (part #1) (part #2) (part #3) 434 437 454 1/1 14/14 1/1 C CGKRFLDSLRLRMH H (part #1) (part #2) (part #3) 463 466 483 1/1 14/14 1/1 C CGAQFSKEDALETH H (part #1) (part #2) (part #3) 492 495 512 1/1 14/14 1/1 C CGKRFQAQSALQQH H (part #1) (part #2) (part #3) 520 523 540 1/1 14/14 1/1 C CNRTFPSHTALKRH H 434 ZINC1 463 ZINC1 492 ZINC1 520 ZINC1 548 ZINC1 (part #1) (part #2) (part #3) 548 551 568 1/1 14/14 1/1 C CGSCFRDESTLKSH H (part #1) (part #2) (part #3) 576 579 596 1/1 14/14 1/1 C CDKKFSLKHQLETH H (part #1) (part #2) (part #3) 604 607 624 1/1 14/14 1/1 C CHQRSRDYSAMIKH H (part #1) (part #2) (part #3) 632 635 652 1/1 14/14 1/1 C CTEYCPSLSSMQKH H 576 ZINC1 604 ZINC1 632 ZINC1 C2H2 zinc finger motif BLOCK ID AC DT DE PA ID ZINC_FINGER_C2H2; BLOCK AC BL00028; distance from previous block=(7,2235) DE Zinc finger, C2H2 type, domain proteins. BL CHP motif; width=29; seqs=135; 99.5%=1594; strength=1246 ADR1_YEAST ( 106) CEVCTRAFARQEHLKRHYRSHTNEKPYPC 10 AEF1_DROME ( 214) CNVCDKTFRQSSTLTNHLKIHTGEKPYNC 10 AZF1_YEAST ( 623) CDYCGKRFTQGGNLRTHERLHTGEKPYSC 10 BASO_HUMAN ( 358) CTACEKTFYDKGTLKIHYNAVHLKIKHKC 39 BRC1_DROME ( 669) CNICKRVYSSLNSLRNHKSIYHRNLKQPK 37 BRC2_DROME ( 471) CAICERVYCSRNSLMTHIYTYHKSRPGEM 27 BRC3_DROME ( 467) GSLAAAVYSLHSHAHGHVLGHATSPPRPG 87 BRLA_EMENI ( 324) EPGCNGRFKRQEHLKRHMKSHSKEKPHVC 22 BTEB_RAT ( 147) YSGCGKVYGKSSHLKAHYRVHTGERPFPC 11 CF23_DROME ( 368) CPDCPKTFKTPGTLAMHRKIHTGEAEREA 24 CF2_DROME ( 403) CSYCGKSFTQSNTLKQHTRIHTGEKPFRC 11 ZINC_FINGER_C2H2; PATTERN. Prosite PS00028; APR-1990 (CREATED); JUN-1994 (DATA UPDATE); NOV-1995 (INFO UPDATE). Zinc finger, C2H2 type, domain. C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H. Phylogenetic Analysis: Phylogenetics is the study of evolutionary relationships. Phylogenetic analysis is the means of inferring or estimating these relationship. The evolutionary history inferred from phylogenetic analysis is usually depicted as branching (treelike) diagrams, which represent a ort of pedigree of the inherited relationships among molecules (“gene trees”), organisms, or both. The four steps in phylogenetic analysis of DNA sequences are alignment, determining the substitution model, tree building, and tree evaluation. While other scientific analysis generally have empirical bases, phylogenetic analysis do not. The physical events yielding a phylogeny happened in the past, and can only be inferred or estimated. The three major tree-building criteria are distance, maximum parsimony, and maximum likelihood. Over 130 packages available for various platforms Radial Slanted Cladogram Phylogram http://www2.ebi.ac.uk/clustalw/ Ortholog: Homologous genes that have diverged from each other after speciation events (e.g., human beta- and chimp beta-globin) Paralog: Homologous genes that have diverged from each other after gene duplication events (e.g., human beta- and gamma-globin) Xenolog: Homologous genes that have diverged from each other after lateral gene transfer events (e.g., antibiotic resistance genes in bacteria) Homolog: Genes that are descended from a common ancestor (e.g., all globins) COG0568 K DNA-dependent RNA polymerase sigma70/sigma32 subunits EST: Expressed Sequences Tags dbEST is a division of GenBank that contains sequence data and other information on "singlepass" cDNA sequences, or Expressed Sequence Tags, from a number of organisms. There are 1,775,721 entries in human EST and 918,414 entries in mouse EST. Total of 3,643,273 sequence entries in dbEST. (Feb. 18, 2000). EST projects have their roots in the early 1980s, when it was recognized that short stretches of DNA sequences from cDNAs could be used to identify genes. The Institute for Genomic Research (TIGR) was established to generate EST data on a massive scale. Among the largest projects conducted entirely in the public domain include an effort funded by Merck and Co., which has deposited more than 500,000 human ESTs into dbEST. A hallmark of these endeavours, carried out by a collaboration between Washington University Genome Sequencing Center and members of IMAGE (Integrated Molecular Analysis of Gene Expression) consortium, has been the rapid deposition of the sequences into the public domain and the concomitant availability of the sequence-tagged clones. dbEST release 021800 Summary by Organism - February 18, 2000 Number of public entries: 3,643,273 Homo sapiens (human) Mus musculus + domesticus (mouse) Rattus sp. (rat) Caenorhabditis elegans (nematode) Drosophila melanogaster (fruit fly) Danio rerio (zebrafish) Lycopersicon esculentum (tomato) Zea mays (maize) Glycine max (soybean) Oryza sativa (rice) Arabidopsis thaliana (thale cress) 1,775,721 918,414 134,685 101,232 86,121 61,893 53,603 51,883 50,656 47,939 45,757 Search: AA927876 dbEST Id: 1659486 IDENTIFIERS EST name: GenBank Acc: GenBank gi: om18b09.s1 AA927876 3076620 CLONE INFO Clone Id: Source: Insert length: DNA type: IMAGE:1541369 (3') NCI 1074 cDNA PRIMERS Sequencing: SEQUENCE Quality: Entry Created: Last Updated: -40m13 fwd. ET from Amersham TTTGACGGGAGGGCACAGGAAACTCTTTATTATGGTGATGAGATCGACAATCTCCCCTAC TGTTAACCTTCGCTCCTGCACACTTCAGTGTCCTCACTCTGTAGGGCTCGCTGGCCTGGG CTTCTGCGACCCGCGATCGTCCAGGAGAGGGCACTCGGCGCCCTTCCTGGGGTNNTCTGG GGCGGAATTTGCTAGGCCGCCGTAGCAGCTGTGCCAGGTCAGAAGCCGAGCCGGNCCGCT TTTCGTTCTTTAATTGGACTCTTGGCTAAGACGCTACCGACACCCCGTCAGTGGTGGAGG AAGAAGGACAACAGGGAGAGGTCGAGG High quality sequence stops at base: 318 Apr 17 1998 Jun 10 1998 COMMENTS This clone is available royalty-free through LLNL ; contact the IMAGE Consortium (info@image.llnl.gov) for further information. LIBRARY dbEST lib id: Lib Name: Organism: Organ: Lab host: Vector: R. Site 1: R. Site 2: Description: 1042 Soares_NFL_T_GBC_S1 Homo sapiens pooled DH10B pT7T3D-Pac (Pharmacia) with a modified polylinker Not I Eco RI Equal amounts of plasmid DNA from three normalized libraries (fetal lung NbHL19W, testis NHT, and B-cell NCI_CGAP_GCB1) were mixed, and ss circles were made in vitro. Following HAP purification, this DNA was used as tracer in a subtractive hybridization reaction. The driver was PCR-amplified cDNAs from pools of 5,000 clones made from the same 3 libraries. The pools consisted of I.M.A.G.E. clones 297480-302087, 682632-687239, 726408-728711, and 729096-731399. Subtraction by Bento Soares and M. Fatima Bonaldo. Simple Mathematics: Summary by Organism - February 18, 2000 Homo sapiens (human) Human genes 1,775,721 100,000 genes More than 10 fold coverage!! Clustering is the process of finding subsets of sequences which belong together within a larger set. This is done by converting discrete similarity scores to boolean links between sequences. That is, two sequences are considered linked if their similarity exceeds a threshold. UniGene clustering proceeds in several stages, with each stage adding less reliable data to the results of the preceding stage. This staged clustering affords greater control than a more egalitarian treatment of all links between sequences. Unigene_HUMAN: 92,571 clusters| HGI: 299,412 clusters Unigene_MOUSE: 75,963 clusters| MGI: 104,927 clusters Unigene_RAT: 28,680 clusters | RGI: 35,875 clusters (Feb. 19 , 2000) (Jul. 3, 1999) THCs, "Tentative Human Consensus" sequences, are assemblies of human ESTs. TIGR's Human Gene Index compare with UniGene? The HGI assemblies (and all of TIGR's Gene Index assemblies) are made by first clustering the EST sequences and then assembling these clusters into consensus sequences, or THCs(TCs for non-human data). EST sequences are compared and clustered together if they meet the following criteria: a minimum 40 base pair match greater than 95% similarity in the overlap region a maximum unmatched overhang of 20 base pairs These clusters are then assembled into consensus sequences using TIGR's in-house assembly program. UniGene links ESTs in a cluster if the sequences have a 50 base pair overlap in the 3' untranslated region (UTR) with 100% identity. These clusters are not run through the more stringent assembly process and consensus sequences are not made. For this reason you will often find several TIGR THCs contained within one UniGene cluster. UniGene Human Release Statistics Statistics for UniGene build uploaded on: Sat Feb 19 2000 UniGene Build #108 Sequences Included in UniGene ============================= Known genes are from GenBank 114 (1-Dec-1999) ESTs are from dbEST through 13-Feb-2000 30044 mRNAs + gene CDSs 938584 EST, 3'reads 347845 EST, 5'reads + 157255 EST, other/unknown ---------1473728 total sequences in clusters Final Number of Clusters (sets) =============================== 92571 sets total 10797 sets contain at least one known gene 91523 sets contain at least one EST 9749 sets contain both genes and ESTs HGI Release 4.5 - Nov. 15, 1999 Total sequences in THCs ESTs 1,066,183 HTs 5,949 Totals 1,072,132 singletons 241,110 1,165 242,275 total 1,307,293 7,114 1,314,407 Total unique sequences THCs singleton ESTs singleton HTs Total 84,837 241,110 1,165 327,112 AA927876 as query (318 bps) Database: Unigene_HUMAN 58,791 sequences; 43,055,747 total letters Sequences producing significant alignments: Score (bits) gnl|UG|Hs#S971963 ak43b04.s1 Homo sapiens cDNA, 3' end /clone=IM... gnl|UG|Hs#S510257 70F12 Homo sapiens cDNA /clone=(not-directiona... 599 36 E Value e-171 0.17 gnl|UG|Hs#S971963 ak43b04.s1 Homo sapiens cDNA, 3' end /clone=IMAGE:1408687 /clone_end=3' /gb=AA868505 /ug=Hs.99430 /len=627 Length = 627 Score = 599 bits (302), Expect = e-171 Identities = 321/327 (98%), Positives = 321/327 (98%) Hs. 99430 Hs.99430 Homo sapiens EXPRESSION INFORMATION cDNA sources: Blood, Ovary, Testis EST SEQUENCES (8) AI150041 cDNA clone IMAGE:1751830 AA927876 cDNA clone IMAGE:1541369 AI223414 cDNA clone IMAGE:1838461 AI150330 cDNA clone IMAGE:1751988 AA868505 cDNA clone IMAGE:1408687 AA476210 cDNA clone IMAGE:771312 AA456628 cDNA clone IMAGE:809583 AI361709 cDNA clone IMAGE:2021901 Testis 3' 3' Testis 3' Testis 3' Testis 3' Ovary 3' Ovary 3' Blood 3' read read read read read read read read 1.1 1.1 1.0 0.6 kb kb kb kb Hs.434 Homo sapiens Human heregulin-beta1 gene, complete cds MAPPING INFORMATION Chromosome: 8 Gene Map 98: stSG4083 , Chr.8, D8S1820-D8S505 Gene Map 98: WI-18803 , Chr.8, D8S1820-D8S505 Gene Map 98: SHGC-12780 , Chr.8, D8S1820-D8S505 EXPRESSION INFORMATION cDNA sources: Brain, Breast, Liver, Testis AA927876 as query (318 bps) Database: HGI-HUMAN 234,459 sequences; 111,134,950 total letters Sequences producing significant alignments: Score (bits) E Value 581 40 e-165 0.027 lcl|THC226049 lcl|R47793 34 1.7 lcl|THC226049 Length = 436 THC226049 Score = 581 bits (293), Expect = e-165 Identities = 313/320 (97%), Positives = 313/320 (97%), Gaps = 1/320 (0%) >THC226049 TGAGGGCACAGGAAACTCTTTATTATGGTGATGAGATCGACAATCTCCCCTACTGTTAACCTTCGCTCCTGCACACTTCA GTGTCCTCACTCTGTAGGGCTCGCTGGCCTGGGCTTCTGCGACCCGCGATCGTCCAGGAGAGGGCACTCGGCGCCCTTCC TGGGGCgTTcTGGGGCGGAATTTGCTAGGCCGCCGTAGCAGCGGTGCCAGGTCAGAAGCCGAGCCGGCyCGCTTTTCGTT CTTTAATTGGACTCTTGGCTAAGACGCTACCGACACCCCGTCaGgTGGTGGAGGAAGAAGGACAACAGGGAGAGGTCGAG GGCCGAGACGGCTCGAGGGAGGAGTAGAGGAAGGTGGAGCGGATGGTCCATCCGGGCGGGAGTTGGCTGGGCGAGTGACC GCGCATGTGCCGCTGCATGGAGGGCAAGCTGTTACA 1=================================THC226049================================436 ----------------------------1---------------------------> --------------------------------------2--------------------------------------> # EST Id GB# ATCC# left right library -------------------------------------------------------------------------------1 F zw35g01.s1 AA476210 1 317 ovary tumor NbHOT, Soares 2 F zx75d08.s1 AA456628 1 436 ovary tumor NbHOT, Soares Sequence source codes: F = WashU/Merck There are no hits for THC226049. In silico cloning: In order to perform an electronic cDNA library screen, the EST sequences retrieved in this way can be used as queries in a BLASTN search of dbEST to identify over-lapping ESTs. This procedure can be reiterated with the newly identified ESTs until no additional hits are found. The ESTs isolated can be assembled into sequence contigs using computer softwares. How to start? TBLASTN emb|AJ003623|HSJ003623 H.sapiens DNA for EST MPIpl10-4B1 Length = 556 Score = 46.9 bits (109), Expect = 1e-04 Identities = 29/83 (34%), Positives = 42/83 (49%), Gaps = 8/83 (9%) Query: 23 RLRPALCDTLITVGSQEFPAHSLVLAGVSQQLGRRGQWALGEG--------ISPSTFAQL 74 RL+ LCD L+ VG Q+F AH VLA S+ E P F + Sbjct: 164 RLKGQLCDVLLIVGDQKFRAHKNVLAASSEYFQSLFTNKENESQTVFQLDFCEPDAFDNV 343 Query: 75 LNFVYGESVELQPGELRPLQEAARALGVQSL 105 LN++Y S+ ++ L +QE +LG+ L Sbjct: 344 LNYIYSSSLFVEKSSLAAVQELGYSLGISFL 436 Experimental results: TTGANNNCCTTTGAANNNCCNNTTNNTCATAGATCTCTCGAGTTTTTTTTTTTTTTTTTTTCTGAAGGGAGGGCACAGGAAAC TCTTTATTATGGTGATGAGATCGACAATCTCCCCTACTGTTAACCTTCGCTCCTGCACACTTCAGTGTCCTCACTCTGTAGGG CTCGCTGGCCTGGGCTTCTGCGACCCGCGATCGTCCAGGAGAGGGCACTCGGCGCCCTTCCTGGGGCGCTTCTGGGGCGGAAT TTGCTAGGCCGCCGTAGCAGCGGTGCCAGGTCAGAAGCCGAGCCGGCCCGCTTTTCGTTCTTTAATTGGACTCTTGGCTAAGA CGCTACCGACACCCCGTCAGGTGGTGGAGGAAGAAGGACAACAGGGAGAGGTCGAGGGCCGAGACGGCCTCGAGGAGGAGTAG AGGAAGGTGGAGCGGATGGTCCATCCGGGCGGGAGTTGGCTGGGCGAGTGACCGCGCATGTGCGCCTGCATGGAGGCCAGGCT GGGACAGCCGGCCCCGCACAGGGAGCAGCGGTACGGAGCGGCCCCGTGTGTCCGCAGGTGCTTGGTCATGGCCGAGAAGTCCC GGGAGCGCTGAGGACAAAGGCTACAGGAGAAGGGCTTCTCTCCTGTGTGGACTCGGTAGTGCGTCTCCATCTGATGCTTGAGT GAAAACCTCTTTCACAGACAGAGCACGCATAGGGGCCCAGACCGAGCANGGTCGACGCGGCCCGCGAAATTCGGATCCCCGGG GCCTTCATGGGCCATATGACCCCCCAAGCTAGCGTAAATCTGGGAACATCGTATGGGTAAAGCCNTNANAGAATCTCTTTTTT TTTGGGTTTGGGGNGGGGGTNATCTTTCATTNATCGAATTAGANTAGTTATNTNCCATTAATCCATTGNANNGGNNTTTAAAC ATTCCCTTGAAGGGATTCCNAAACCCTTTTACCNCAATTTTGGGTCCCGTCCAAACCCAGGTTGACAAGNGGGTTTTTGGAAA TTNTTTNCCCNTNATTCAATTTTTCCT Yeast two-hybrid experiment; Differential Display; Library screening; etc. BLASTN search to GenBank Cosmid from chromosome 19; it is a novel gene. BLASTN search to dbEST; Unigene; TIGR-HGI cDNA and genomic DNA alignment and matrix analysis: Gene prediction programs: http://CCR-081.mit.edu/GENSCAN.html GRAIL 2 10138 - 11018 12608 - 12748 13530 - 13923 + x x GENSCAN 10138 - 11018 11268 - 11341 11450 - 11518 11644 - 11808 11989 - 12144 12360 - 12454 12608 - 12748 + + + + + x x FGENES 1880 - 1908 5061 - 5175 5900 - 6049 8317 - 8544 10357 - 11018 11268 - 11341 11450 - 11518 11644 - 11864 polyA: 12521 x x x + + + + + + (Start) ATGTCCCTGCCCCCCATAAGACTGCCCAGCCCCTATGGCTCTGATCGGCTGGTACAGCTAGCAGCCAGGCTCCGGCCAGCACTCTGTGATACTCTGATCACCGTAGGGAGCCAGGAGTTC M S L P P I R L P S P Y G S D R L V Q L A A R L R P A L C D T L I T V G S Q E F> CCCGCCCACAGCCTGGTGCTAGCAGGTGTCAGCCAGCAGCTGGGCCGCAGGGGCCAGTGGGCTCTGGGAGAAGGCATCAGCCCTTCTACCTTTGCCCAGCTCCTGAACTTTGTGTATGGG P A H S L V L A G V S Q Q L G R R G Q W A L G E G I S P S T F A Q L L N F V Y G> GAGAGTGTAGAGCTGCAGCCTGGAGAGCTAAGGCCCCTTCAGGAGGCGGCCAGGGCCTTGGGAGTGCAGTCCCTGGAAGAGGCATGCTGGAGGGCTCGAGGGGACAGGGCTAAAAAGCCA E S V E L Q P G E L R P L Q E A A R A L G V Q S L E E A C W R A R G D R A K K P> GATCCAGGCCTGAAGAAACATCAGGAGGAGCCAGAGAAACCCTCAAGGAATCCTGAGAGAGAACTGGGGGACCCTGGAGAGAAGCAGAAACCAGAACAGGTTTCTAGAACTGGTGGGAGA D P G L K K H Q E E P E K P S R N P E R E L G D P G E K Q K P E Q V S R T G G R> GAACAGGAGATGTTGCACAAGCACTCGCCACCAAGAGGCAGACCCGAGATGGCAGGAGCAACGCAGGAGGCTCAGCAGGAACAGACCAGGTCAAAGGAGAAACGCCTCCAAGCCCCTGTT E Q E M L H K H S P P R G R P E M A G A T Q E A Q Q E Q T R S K E K R L Q A P V> GGCCAAAGGGGAGCAGATGGGAAGCATGGAGTGCTCACGTGGTTGAGGGAAAATCCAGGGGGCTCTGAGGAAAGTCTGCGCAAGCTCCCTGGCCCCCTTCCCCCAGCAGGCTCCCTGCAA G Q R G A D G K H G V L T W L R E N P G G S E E S L R K L P G P L P P A G S L Q> ACCAGCGTCACCCCTAGGCCCTCGTGGGCTGAGGCCCCTTGGTTGGTGGGGGGCCAGCCTGCCCTGTGGAGCATCCTGCTGATGCCGCCCAGATATGGCATTCCCTTCTACCATAGCACC T S V T P R P S W A E A P W L V G G Q P A L W S I L L M P P R Y G I P F Y H S T> CCCACCACTGGAGCCTGGCAGGAGGTCTGGCGGGAACAGAGGATCCCACTGTCCCTAAATGCCCCCAAAGGGCTCTGGAGCCAGAACCAGTTGGCCTCCTCCAGCCCTACCCCAGGTTCC P T T G A W Q E V W R E Q R I P L S L N A P K G L W S Q N Q L A S S S P T P G S> CTCCCCCAGGGCCCCGCACAGCTCAGCCCTGGGGAGATGGAAGAGTCTGATCAGGGGCACACAGGCGCACTTGCAACCTGTGCGGGTCATGAGGACAAGGCAGGCTGCCCACCTCGCCCG L P Q G P A Q L S P G E M E E S D Q G H T G A L A T C A G H E D K A G C P P R P> CACCCTCCCCCGGCCCCTCCTGCTCGGTCTCGGCCCTATGCGTGCTCTGTCTGTGGAAAGAGGTTTTCACTCAAGCATCAGATGGAGACGCACTACCGAGTCCACACAGGAGAGAAGCCC H P P P A P P A R S R P Y A C S V C G K R F S L K H Q M E T H Y R V H T G E K P> TTCTCCTGTAGCCTTTGTCCTCAGCGCTCCCGGGACTTCTCGGCCATGACCAAGCACCTGCGGACACACGGGGCCGCTCCGTACCGCTGCTCCCTGTGCGGGGCCGGCTGTCCCAGCCTG F S C S L C P Q R S R D F S A M T K H L R T H G A A P Y R C S L C G A G C P S L> GCCTCCATGCAGGCGCACATGCGCGGTCACTCGCCCAGCCAACTCCCGCCCGGATGGACCATCCGCTCCACCTTCCTCTACTCCTCCTCGAGGCCGTCTCGGCCCTCGACCTCTCCCTGT A S M Q A H M R G H S P S Q L P P G W T I R S T F L Y S S S R P S R P S T S P C> TGTCCTTCTTCCTCCACCACCTGACGGGGTGTCGGTAGCGTCTTAGCCAAGAGTCCAATTAAAGAACGAAAAGCGGGCCGGCTCGGCTTCTGACCTGGCACCGCTGCTACGGCGGCCTAG C P S S S T T * hum TZF p hum pLZF p mus pLZF p 1 MSLPPIRLPSPYGSDRLVQLAARLRPALCDTLITVGSQEFPAHSLVLAGVSQQLG----RRGQWALGEGISPSTFAQLLNFVYGESVELQPGELR 91 1 MDLTKMGMIQLQNPSHPTGLLCKANQMRLAGTLCDVVIMVDSQEFHAHRTVLACTSKMFEILFHRNSQHYTLDFLSPKTFQQILEYAYTATLQAKAEDLD 100 1 MDLTKMGMIQLQNPSHPTGLLCKANQMRLAGTLCDVVIMVDSQEFHAHRTVLACTSKMFEILFHRNSQHYTLDFLSPKTFQQILEYAYTATLQAKAEDLD 100 M : :: PS RL :LCD :I V SQEF AH VLA S: R Q : :SP TF Q:L : Y ::: : :L hum TZF p hum pLZF p mus pLZF p 92 PLQEAARALGVQSLEEACW------RARGD---RAKKPDPG----------------LKKHQEEPEKPSRNPERELGDPGEKQKP--------------- 151 101 DLLYAAEILEIEYLEEQCLKMLETIQASDDNDTEATMADGGAEEEEDRKARYLKNIFISKHSSEESGYASVAGQSLPGPMVDQSPSVSTSFGLSAMSPTK 200 101 DLLYAAEILEIEYLEEQCLKILETIQASDDNDTEATMADGGGEEEDDRKARYLKNIFISKHSSEESGYASVAGQSLPGPMVDQSPSVSTSFGLSAMSPTK 200 L AA L :: LEE C :A D A D G : KH E : : L P Q P hum TZF p hum pLZF p mus pLZF p 152 EQVSRTGGREQEMLH-KHSPPRG--RPEMAG-----ATQEAQQEQTRSKEKRLQ-AP------VG--------QRGADG-----KHGVLTWLRENPGGSE 223 201 AAVDSLMTIGQSLLQGTLQPPAGPEEPTLAGGGRHPGVAEVKTEMMQVDEVPSQDSPGAAESSISGGMGDKVEERGKEGPGTPTRSSVITSARELHYGRE 300 201 AAVDSLMSIGQSLLQGTLQPPAGPEEPTLAGGGRHPGVAEVKMEMMQVDEAPCQDSPGAAESSISGGMGDKFEERSKEGPGTPTRRSVITSARELHYGRE 300 V Q :L: PP G P :AG E : E : E Q :P : :R :G : V:T RE G E hum TZF p hum pLZF p mus pLZF p 224 ESLRKLPGPLP----PAGSLQTSVTP--RP--SWAEAP----WLVGGQP-ALWSILLMPPRYGIPFYHST-----PTTGAWQEVWR-----------EQR 294 301 ESAEQVPPPAEAGQAPTGRPEHPAPPPEKHLGIYSVLPNHKADAVLSMPSSVTSGLHVQPALAVSMDFSTYGGLLPQGFIQRELFSKLGELAVGMKSESR 400 301 ESGEQLSPPVEAGQGPPGRQEPLAPPVEKHLGIYSVLPNHKADAVLSMPSSVTSGLHVQPALAVSMDFSTYGGLLPQGFIQRELFSKLGELAVGMKAESR 400 ES :: P P G : P : : P V P :: S L : P : ST P :E: E R hum TZF p hum pLZF p mus pLZF p 295 ----------IPLSLN--------APKGLWSQ----------N-----Q--LASSSPTPGSLP-QGPAQLSP-GEMEESDQGHTGALAT-----CAG--- 349 401 TIGEQCSVCGVELPDNEAVEQHRKLHSGMKTYGCELCGKRFLDSLRLRMHLLAHSAGAKAFVCDQCGAQFSKEDALETHRQTHTGTDMAVFCLLCGKRFQ 500 401 PLGEQCSVCGVELPDNEAVEQHRKLHSGMKTYGCELCGKRFLDSLRLRMHLLAHSAGAKAFVCDQCGAQFSKEDALETHRQTHTGTDMAVFCLLCGKRFQ 500 : L N G: : LA S: : : Q AQ S :E Q HTG: : C hum TZF p hum pLZF p mus pLZF p 350 --------HEDKAG--------CP---P---------RPHPPPAPPARS------R----------------PYACSVCGKRFSLKHQMETHYRVHTGEK 399 501 AQSALQQHMEVHAGVRSYICSECNRTFPSHTALKRHLRSHTGDHPYECEFCGSCFRDESTLKSHKRIHTGEKPYECNGCDKKFSLKHQLETHYRVHTGEK 600 501 AQSALQQHMEVHAGVRSYICSECNRTFPSHTALKRHLRSHTGDHPYECEFCGSCFRDESTLKSHKRIHTGEKPYECNGCGKKFSLKHQLETHYRVHTGEK 600 E :AG C P R H P R PY C C K:FSLKHQ:ETHYRVHTGEK hum TZF p hum pLZF p mus pLZF p 400 PFSCSLCPQRSRDFSAMTKHLRTH-GAAPYRCSLCGAGCPSLASMQAHMRGHSPSQLPPGWTIRSTFLYSSSRPSRPSTSPCCPSSSTT 487 601 PFECKLCHQRSRDYSAMIKHLRTHNGASPYQCTICTEYCPSLSSMQKHMKGHKPEEIPPDWRIEKTYLYLCY-V 673 601 PFECKLCHQRSRDYSAMIKHLRTHNGASPYQCTICTEYCPSLSSMQKHMKGHKPEEIPPDWRIEKTYLYLCYV 673 PF C LC QRSRD:SAM KHLRTH GA:PY:C::C CPSL:SMQ HM:GH P ::PP W I T:LY : Hs.99430 Homo sapiens EXPRESSION INFORMATION cDNA sources: Blood, Ovary, Testis EST SEQUENCES (8) AI150041 cDNA clone IMAGE:1751830 AA927876 cDNA clone IMAGE:1541369 AI223414 cDNA clone IMAGE:1838461 AI150330 cDNA clone IMAGE:1751988 AA868505 cDNA clone IMAGE:1408687 AA476210 cDNA clone IMAGE:771312 AA456628 cDNA clone IMAGE:809583 AI361709 cDNA clone IMAGE:2021901 Testis 3' 3' Testis 3' Testis 3' Testis 3' Ovary 3' Ovary 3' Blood 3' Northern Blotting read read read read read read read read 1.1 1.1 1.0 0.6 kb kb kb kb LOCUS AF130255 1960 bp mRNA PRI 22-FEB-1999 DEFINITION Homo sapiens testis zinc finger protein (TZFP) mRNA, complete cds. ACCESSION AF130255 KEYWORDS . SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 1960) AUTHORS Tang,Tang K., Lai,Chun-Hung, Tang,Chieh-Ju C., Huang,Chang-Jen and Lin,Wen-chang. TITLE Identification and gene structure of a novel human PLZF related transcription factor gene, TZFP JOURNAL Unpublished REFERENCE 2 (bases 1 to 1960) AUTHORS Tang,T. K., Tang,C.-J. C. and Lin,W.-c. TITLE Direct Submission JOURNAL Submitted (22-FEB-1999) Institute of Biomedical Sciences, Academia Sinica, No. 128, Sec. 2, Academia Road, Taipei, Taiwan 11529, TAIWAN Search: AA927876 dbEST Id: 1659486 IDENTIFIERS EST name: GenBank Acc: GenBank gi: om18b09.s1 AA927876 3076620 CLONE INFO Clone Id: Source: Insert length: DNA type: IMAGE:1541369 (3') NCI 1074 cDNA PRIMERS Sequencing: SEQUENCE Quality: Entry Created: Last Updated: -40m13 fwd. ET from Amersham TTTGACGGGAGGGCACAGGAAACTCTTTATTATGGTGATGAGATCGACAATCTCCCCTAC TGTTAACCTTCGCTCCTGCACACTTCAGTGTCCTCACTCTGTAGGGCTCGCTGGCCTGGG CTTCTGCGACCCGCGATCGTCCAGGAGAGGGCACTCGGCGCCCTTCCTGGGGTNNTCTGG GGCGGAATTTGCTAGGCCGCCGTAGCAGCTGTGCCAGGTCAGAAGCCGAGCCGGNCCGCT TTTCGTTCTTTAATTGGACTCTTGGCTAAGACGCTACCGACACCCCGTCAGTGGTGGAGG AAGAAGGACAACAGGGAGAGGTCGAGG High quality sequence stops at base: 318 Apr 17 1998 Jun 10 1998 COMMENTS This clone is available royalty-free through LLNL ; contact the IMAGE Consortium (info@image.llnl.gov) for further information. LIBRARY dbEST lib id: Lib Name: Organism: Organ: Lab host: Vector: R. Site 1: R. Site 2: Description: 1042 Soares_NFL_T_GBC_S1 Homo sapiens pooled DH10B pT7T3D-Pac (Pharmacia) with a modified polylinker Not I Eco RI Equal amounts of plasmid DNA from three normalized libraries (fetal lung NbHL19W, testis NHT, and B-cell NCI_CGAP_GCB1) were mixed, and ss circles were made in vitro. Following HAP purification, this DNA was used as tracer in a subtractive hybridization reaction. The driver was PCR-amplified cDNAs from pools of 5,000 clones made from the same 3 libraries. The pools consisted of I.M.A.G.E. clones 297480-302087, 682632-687239, 726408-728711, and 729096-731399. Subtraction by Bento Soares and M. Fatima Bonaldo. Human cDNA Library Details: 470 different libraries so far covering more than 40 tissues Q&A CGAP Stomach 202.NCI_CGAP_Gas1 gastric tumor 203.NCI_CGAP_Gas4 gastric tumor Testis 204.Barstead HPL-RB5 testis 205.Soares testis NHT 206.Life Tech. testis (10426-013) Thymus 207.NCI_CGAP_Thym1 thymoma Thyroid 208.NCI_CGAP_Thy1 invasive thyroid tumor Uterus 209.NCI_CGAP_Ut1 uterine tumor 210.NCI_CGAP_Ut2 uterine tumor 211.NCI_CGAP_Ut3 uterine tumor 212.NCI_CGAP_Ut4 uterine tumor 213.Soares pregnant uterus NbHPU CGAP: Cancer Genome Anatomy Project Why CGAP? In the last two decades we have learned that genetic changes lie at the root of all cancers. In response, the Cancer Genome Anatomy Project (CGAP) will unite the newest technologies, along with those both costeffective and capable of high-throughput, to identify all the genes responsible for the establishment and growth of cancer. Project Goals To achieve a comprehensive molecular characterization of normal, precancerous, and malignant cells. Normal Cells Cancer Cells Comparing the fingerprints of a normal versus a cancer cell will highlight genes that by their suspicious absence or presence (such as Gene H ) deserve further scientific scrutiny to determine whether such suspects play a role in cancer, or can be exploited in a test for early detection. Identifying the genetic differences among normal cells, precancerous cells, and cancer cells, will contribute to our understanding of cancer as it fosters the discovery of genes that directly cause cancer provides us with a way to identify early precancerous cells and thus enhances our methods for early detection improves our ability to match patients with appropriate treatment Pre-cancer Time line Malignant Tumor The research results displayed in this graph demonstrate that for patients suffering from the cancer neuroblastoma, the presence or absence of a specific set of genes found on Chromosome 1 strongly correlates with patient outcome. Therefore, in the future this characteristic of the tumor can be used to identify those patients that would benefit from more aggressive treatment, and those best served by the current treatment protocol. Laser Capture Microdissection (LCM) Go 1999 CGAP sequences: 473,746 CGAP genes: 20,665 2000 CGAP sequences: 925,746 CGAP genes: 79,844 Not in all others Not in all others Not in all others Sequencing of Expressed Sequence Tags (ESTs) Serial Analysis of Gene Expression Differential Display Approaches Hybridization Analysis Digital Differential Display The foundation of DDD is UniGene. UniGene employs a conservative method to assign all the human EST sequences that meet minimal standards of quality to distinct "clusters", each representing a unique human expressed gene. DDD takes advantage of UniGene by comparing the number of times sequences from different libraries were assigned to a particular UniGene cluster. This has the advantage that DDD will only report on sequences that we have confidence represent bona fide human expressed genes. There will of course be many differences in the number of sequences contained in each library that are assigned to a particular UniGene cluster, but only some of these differences are likely to reflect biological reality. Therefore DDD employs a statistical method of comparison - The Fisher Exact Test - to identify only those differences that are likely to be real. One important factor in determining statistical relevance is the absolute number of sequences in each library that have been successfully assigned to a UniGene cluster. In many cases there are not enough sequences in dbEST libraries to meet the threshold of significance employed in the Fisher Exact Test. Since DDD will only yield a report if there are differences that exceed this threshold, it is expected that many comparisons will yield nothing. the fraction of sequences within the pool visual aid that reflects the numerical values statistically significant pairwise comparison THREE PRINCIPLES UNDERLIE THE SAGE TECHNOLOGY: One short oligonucleotide sequence from a defined location within a transcript ("tag") allows accurate quantitation. Tag size (10-14bp) is optimal for high throughput while maintaining accurate gene identification and quantitation. The combined power of serial and parallel processing increases data throughput by orders of magnitude when compared to conventional approaches. Ortholog: Homologous genes that have diverged from each other after speciation events (e.g., human beta- and chimp beta-globin) Paralog: Homologous genes that have diverged from each other after gene duplication events (e.g., human beta- and gamma-globin) Xenolog: Homologous genes that have diverged from each other after lateral gene transfer events (e.g., antibiotic resistance genes in bacteria) Homolog: Genes that are descended from a common ancestor (e.g., all globins) Dec. 11, 1998: C. elegans: Sequence to Biology -Jonathan Hodgkin, H. Robert Horvitz, Barbara R. Jasny, Judith Kimble* This special issue of Science celebrates a landmark in biology: determination of the essentially complete DNA sequence of an animal genome. The animal is a small invertebrate, the nematode (or roundworm) Caenorhabditis elegans, and the sequence consists of about 97 million base pairs of DNA, approximately one-thirtieth the number in the human genome. Nonetheless, the information content is enormous--eight times that of the budding yeast Saccharomyces cerevisiae, the only other eukaryote with a sequenced genome. Genomic sequence of the Nematode C. elegnas: A platform for investigating biology The C. elegans Squencing Consortium 97 MB 257 YACs (20% only in YAC) 2527 cosmids 113 fosmids 44 PCR 19,099 predicted genes 18,891 proteins here (16,260 reviewed) EST: 67,815 EST from 40,379 clones 7432 genes A multicellular organism genome Genefinder program: ** transplicing** 40% of predicted genes have ESTmatches 16,260/19,099 genes have been interactively reviewed. Average of one gene per 5 Kb. Average of five introns per gene. 27% of genome resides in exons. pFAM protein family search : Intracellular communication Transcriptional regulation Table 1. The 20 most common protein domains in C. elegans (41). RRM, RNA recognition motif; RBD, RNA binding domain; RNP, ribonuclear protein motif; UDP, uridine 5'-diphosphate. ------------------------------------------------------------------Number Description ------------------------------------------------------------------- 650 410 240 170 140 130 120 100 90 90 90 90 80 80 80 80 80 70 70 70 7 TM chemoreceptor Eukaryotic protein kinase domain Zinc finger, C4 type (two domains) Collagen 7 TM receptor (rhodopsin family) Zinc finger, C2H2 type Lectin C-type domain short and long forms RNA recognition motif (RRM, RBD, or RNP domain) Zinc finger, C3HC4 type (RING finger) Protein-tyrosine phosphatase Ankyrin repeat WD domain, G-beta repeats Homeobox domain Neurotransmitter-gated ion channel Cytochrome P450 Helicases conserved C-terminal domain Alcohol/other dehydrogenases, short-chain type UDP-glucoronosyl and UDP-glucosyl transferases EGF-like domain Immunoglobulin superfamily Worming secrets from the C. elegans genome: Dec 11, 1998. Sciences Washington University Genome Sequencing Center. Sanger Centre 8 - year effort: Sydney Brenner starts all. by 1992, they were doing a million bases per year. ~$200 M High-through put sequencing. Human genome project. “We will be doing a lot of jumping back and forth between species” - F. Collins Ping-Pong homology search In silico cloning: In order to perform an electronic cDNA library screen, the EST sequences retrieved in this way can be used as queries in a BLASTN search of dbEST to identify over-lapping ESTs. This procedure can be reiterated with the newly identified ESTs until no additional hits are found. The ESTs isolated can be assembled into sequence contigs using computer softwares. EST 2 EST 3 EST 1 There are many sequencing related errors in the dbEST. C elegnas a. a. sequences Human EST sequences Comparative Gene Identification Query= (597 letters) Sequences producing significant alignments: lcl|THC200240 lcl|THC151579 lcl|AA099787 (bits) 224 181 127 lcl|THC200240 Length = 764 Score = 224 bits (565), Expect = 4e-58 Identities = 106/187 (56%), Positives = 136/187 (72%) Value 4e-58 3e-45 8e-29 Query: 248 SGMKKNKYGNIEDLVVHLNFVCPKGIIQKQCQVPRMSSGPDIHQIILGSEGTLGVVSEVT 307 SGMKKN YGNIEDLVVH+ V P+GII+K CQ PRMS+GPDIH I+GSEGTLGV++E T Sbjct: 3 SGMKKNIYGNIEDLVVHIKXVTPRGIIEKSCQGPRMSTGPDIHHFIMGSEGTLGVITEAT 182 lcl|THC151579 Length = 698 Score = 181 bits (455), Expect = 3e-45 Identities = 81/142 (57%), Positives = 106/142 (74%) Query: 446 LGMNHGVLGESFETSVPWDKVLSLCRNVKELMKREAKAQGVTHPVLANCRVTQVYDAGAC 505 L + + VLGESFETS PWD+V+ LCRNVKE + RE K +GV + CRVTQ YDAGAC Sbjct: 41 LALEYXVLGESFETSAPWDRVVDLCRNVKERITRECKEKGVQFAPFSTCRVTQTYDAGAC 220 THC200240 sp|O00116|ADAS_HUMAN ALKYLDIHYDROXYACETONEPHOSPHATE SYNTHASE PRECURSOR (ALKYL-DHAP SYNTHASE) (ALKYLGLYCERONE-PHOSPHATE SYNTHASE) Length = 658 446-248=198 Score = 124 bits (309), Expect = 5e-29 517-319=198 Identities = 59/60 (98%), Positives = 59/60 (98%) 248 Query: 1 SGMKKNIYGNIEDLVVHIKXVTPRGIIEKSCQGPRMSTGPDIHHFIMGSEGTLGVITEAT 60 SGMKKNIYGNIEDLVVHIK VTPRGIIEKSCQGPRMSTGPDIHHFIMGSEGTLGVITEAT Sbjct: 319 SGMKKNIYGNIEDLVVHIKMVTPRGIIEKSCQGPRMSTGPDIHHFIMGSEGTLGVITEAT 378 THC151579 sp|O00116|ADAS_HUMAN ALKYLDIHYDROXYACETONEPHOSPHATE SYNTHASE PRECURSOR (ALKYL-DHAP SYNTHASE) (ALKYLGLYCERONE-PHOSPHATE SYNTHASE) Length = 658 Score = 127 bits (315), Expect = 1e-29 Identities = 59/60 (98%), Positives = 59/60 (98%) 446 Query: 1 LALEYXVLGESFETSAPWDRVVDLCRNVKERITRECKEKGVQFAPFSTCRVTQTYDAGAC 60 LALEY VLGESFETSAPWDRVVDLCRNVKERITRECKEKGVQFAPFSTCRVTQTYDAGAC Sbjct: 517 LALEYYVLGESFETSAPWDRVVDLCRNVKERITRECKEKGVQFAPFSTCRVTQTYDAGAC 576 U58746 [THC195737--------------------------------------------MTRHGKNSTAASVYTYHERRRDAKASGYGTLHARLGADSIKEFHCCSLTLQPCRNPVISPTGYIF --------] DREAILENILAQKKAYAKKLKEYEKQVAEESAAAKIAEGQAETFTKRTQFSAIESTPSRTGAVAT [THC195737-------------------PRPEVGSLKRQGGVMSTEIAAKVKAHGEEGVMSNMKGDKSTSLPSFWIPELNPTAVATKLEKPSS ----------------------------------------------------] KVLCPVSGKPIKLKELLEVKFTPMPGTETAAHRKFLCPVTRDELTNTTRCAYLKKSKSVVKYDVV [THC195737----------------------] EKLIKGDGIDPINGEPMSEDDIIELQRGGTGYSATNETKAKLIRPQLELQ* (44%/59%) Translation of U58746 1 MTRHGKNCTAGAVYTYHEKKKDTAASGYGTQNIRLSRDAVKDFDCCCLSLQPCHD 1 MTRHGKNSTAASVYTYHERRRDAKASGYGTLHARLGADSIKEFHCCSLTLQPCRN *******.** .******...*. ****** . ** *..*.* **.*.****. 55 55 Translation of U58746 56 PVVTPDGYLYEREAILEYILHQKKEIARQMKAYEKQRGTRREEQKELQRAASQDH 110 56 PVISPTGYIFDREAILENILAQKKAYAKKLKEYEKQVAEESAAAKIAEGQAETFT 110 **..* **...****** ** *** *...* **** * . * Translation of U58746 111 VRGFLEKESAIVSRPLNPFTAKALSGTSPD-----------DVQPGPSVGPPSKD 154 111 KRTQFSAIESTPSRTGAVATPRPEVGSLKRQGGVMSTEIAAKVKAHGEEGVMSNM 165 * . ** * . *. *. * * Translation of U58746 155 K-DK--VLPSFWIPSLTPEAKATKLEKPSRTVTCPMSGKPLRMSDLTPVHFTPLD 206 166 KGDKSTSLPSFWIPELNPTAVATKLEKPSSKVLCPVSGKPIKLKELLEVKFTPMP 220 * ** ******* *.* * ******** * **.****... .* *.***. Translation of U58746 207 SSVDRVGLITRSER-YVCAVTRDSLSNATPCAVLRPSGAVVTLECVEKLIRKDMV 260 221 ------GTETAAHRKFLCPVTRDELTNTTRCAYLKKSKSVVKYDVVEKLIKGDGI 269 * * . * ..* **** *.*.* ** *. * .** . *****. * . Translation of U58746 261 DPVTGDKLTDRDIIVLQRGGTGFAGSGVKLQAEKSRPVMQA 301 270 DPINGEPMSEDDIIELQRGGTGYSAT-NETKAKLIRPQLELQ 310 **..*. ... *** *******.. . .* ** .. U50199 [THC171302-MVFGENQDLIRTHFQKEADKVRAMKTNWGLFTRTRMIAQSDYDFIVTYQQAENEAERSTVLSVFKEK ------------------------------------------------------------------AVYAFVHLMSQISKDDYVRYTLTLIDDMLREDVTRTIIFEDVAVLLKRSPFSFFMGLLHRQDQYIVH ------------------------------------------------------------------ITFSILTKMAVFGNIKLSGDELDYCMGSLKEAMNRGTNNDYIVTAVRCMQTLFRFDPYRVSFVNING ------------------------------------------------------------------YDSLTHALYSTRKCGFQIQYQIIFCMWLLTFNGHAAEVALSGNLIQTISGILGNCQKEKVIRIVVST -----------------] [THC177150-------------------------------------------LRNLITSNQDVYMKKQAALQMIQNRIPTKLDHLENRKFTDVDLVEDMVYLQTELKKVVQVLTSFDEY ------------------------------------------------------------------ENELRQGSLHWSPAHKCEVFWNENAHRLNDNRQELLKLLVAMLEKSNDPLVLCVAAHDIGEFVRYYP ------------------------------------------------] RGKLKVEQLGGKEAMMRLLTVKDPNVRYHALLAAQKLMINNWKDLGLEI Human gene: 483 aa gi|2895578 (AF041338) vacuolar proton pump subunit SFD alpha is... gi|2895576 (AF041337) vacuolar proton pump subunit SFD beta iso... gi|1213557 (U50199) coded for by C. elegans cDNA yk89e9.5; code... gi|1086810 (U41109) similar to S. cerevisiae vacular H(+)-ATPas... gnl|PID|e351278 (Z99532) hypothetical protein [Schizosaccharomy... sp|P41807|VM13_YEAST VACUOLAR ATP SYNTHASE 54 KD SUBUNIT (V-ATP... 927 885 468 335 185 123 0.0 0.0 e-131 5e-91 5e-46 2e-27 gi|1213557 (U50199) coded for by C. elegans cDNA yk89e9.5; coded for by C. elegans cDNA cm7g5; coded for by C. elegans cDNA cm14b9; coded for by C. elegans cDNA yk52g5.5; coded for by C. elegans cDNA yk76e5.5; coded for by C. elegans cDNA yk131f11.5; c... Length = 470 Score = 468 bits (1192), Expect = e-131 Identities = 243/477 (50%), Positives = 314/477 (64%), Gaps = 20/477 (4%) gi|2895578 (AF041338) vacuolar proton pump subunit SFD alpha isoform [Bos taurus] Length = 483 Score = 927 bits (2369), Expect = 0.0 Identities = 460/483 (95%), Positives = 465/483 (96%) Query: 1 Sbjct: 1 Query: 61 Sbjct: 61 MTKMDIRGAVDAAVPTNIIAAKAAEVRANKVNWQSYLQGQMISAEDCEFIQRFEMKRSPE 60 MTKMDIRGAVDAAVPTNIIAAKAAEVRANKVNWQSYLQGQMIS+EDCEFIQRFEMKRSPE MTKMDIRGAVDAAVPTNIIAAKAAEVRANKVNWQSYLQGQMISSEDCEFIQRFEMKRSPE 60 EKQEMLQTEGSQCAKTFINLMTHICKEQTVQYILTMVDDMLQENHQRVSIFFDYARCSKN 120 EKQEMLQTEGSQ AKTFINLMTHI KEQTVQYILT+VDD LQENHQRVSIFFDYA+ SKN EKQEMLQTEGSQRAKTFINLMTHISKEQTVQYILTLVDDTLQENHQRVSIFFDYAKRSKN 120 Query: 121 TAWPYFLPILNRQDPFTVHMAARIIAKLAAWGKELMEGSDLNYYFNWIKTQLSSQKLRGS 180 TAW YFLP+LNRQD FTVHM ARIIAKLAAWGKELMEGSDLNYYFNWIKTQLSSQKLRGS Sbjct: 121 TAWSYFLPMLNRQDLFTVHMTARIIAKLAAWGKELMEGSDLNYYFNWIKTQLSSQKLRGS 180 Query: 181 GVAVETGTVSSSDSSQYVQCVAGCLQLMLRVNEYRFAWVEADGVNCIMGVLSNKCGFQLQ 240 GV ETGTVSSSDSSQYVQCVAGCLQLMLRVNEYRFAWVEADGVNCIMGVLSNKCGFQLQ Sbjct: 181 GVTAETGTVSSSDSSQYVQCVAGCLQLMLRVNEYRFAWVEADGVNCIMGVLSNKCGFQLQ 240 Query: 241 YQMIFSIWLLAFSPQMCEHLRRYNIIPVLSDILQESVKEKVTRIILAAFRNFLEKSTERE 300 YQMIFS+WLLAFSPQMCEHLRRYNIIPVLSDILQESVKEKVTRIILAAFRNFLEKS ERE Sbjct: 241 YQMIFSVWLLAFSPQMCEHLRRYNIIPVLSDILQESVKEKVTRIILAAFRNFLEKSVERE 300 Query: 301 TRQEYALAMIQCKVLKQLENLEQQKYDDEDISEDIKFLLEKLGESVQDLSSFDEYSSELK 360 TRQEYALAMIQCKVLKQLENLEQQKYDDEDISEDIKFLLEKLGESVQDLSSFDEYSSELK Sbjct: 301 TRQEYALAMIQCKVLKQLENLEQQKYDDEDISEDIKFLLEKLGESVQDLSSFDEYSSELK 360 Query: 361 SGRLEWSPVHKSEKFWRENAVRLNEKNYELLKILTKLLEVSDDPQXLAVAAHDVGXYVRX 420 SGRLEWSPVHKSEKFWREN RLNEKNYELLKILTKLLEVSDDPQ LAVAAHDVG YVR Sbjct: 361 SGRLEWSPVHKSEKFWRENPARLNEKNYELLKILTKLLEVSDDPQVLAVAAHDVGEYVRH 420 Query: 421 YPRGKRVIEQXGGKQLVMNHMHHEXQQVRYNALLAVQKLMVHNWEYLGKQLQSEQPQTXA 480 YPRGKRVIEQ GGKQLVMNHMHHE QQVRYNALLAVQKLMVHNWEYLGKQLQSEQPQT A Sbjct: 421 YPRGKRVIEQLGGKQLVMNHMHHEDQQVRYNALLAVQKLMVHNWEYLGKQLQSEQPQTAA 480 Query: 481 ARS 483 ARS Sbjct: 481 ARS 483 U64857 [AA134689----------------------------------------------MSLNGFGEHTRSASHAGSWYNANQRDLDRQLTKWLDNAGPRIGTARALISPHAGYSYCGETAAYAF --------------------------] KQVVSSAVERVFILGPSHVVALNGCAITTCSKYRTPLGDLIVDHKINEELRATRHFDLMDRRDEES [THC196496------------------------------------EHSIEMQLPFIAKVMGSKRYTIVPVLVGSLPGSRQQTYGNIFAHYMEDPRNLFVISSDFCHWGERF -----------------------------------------------------------------SFSPYDRHSSIPIYEQITNMDKQGMSAIETLNPAAFNDYLKKTQNTICGRNPILIMLQAAEHFRIS -----------------------------------] NNHTHEFRFLHYTQSNKVRSSVDSSVSYASGVLFVHPN Translation of U64857 1 MSNR---VVCREASHAGSWYTASGPQLNAQLEGWLSQVQSTKRPARAIIAPHAGY 1 MSLNGFGEHTRSASHAGSWYNANQRDLDRQLTKWLDNAGPRIGTARALISPHAGY ** .* ********.* * ** ** . ***.*.***** 52 55 Translation of U64857 53 TYCGSCAAHAYKQVDPSITRRIFILGPSHHVPLSRCALSSVDIYRTPLYDLRIDQ 107 56 SYCGETAAYAFKQVVSSAVERVFILGPSHVVALNGCAITTCSKYRTPLGDLIVDH 110 .*** .** *.*** * *.******* * * **... ***** ** .*. Translation of U64857 108 KIYGELWKTGMFERMSLQTDEDEHSIEMHLPYTAKAMESHKDEFTIIPVLVGALS 162 111 KINEELRATRHFDLMDRRDEESEHSIEMQLPFIAKVMGSKR--YTIVPVLVGSLP 163 ** ** * *. * . .* ******.**. ** * *.. .**.*****.* Translation of U64857 163 ESKEQEFGKLFSKYLADPSNLFVVSSDFCHWGQRFRYSYYD-ESQGEIYRSIEHL 216 164 GSRQQTYGNIFAHYMEDPRNLFVISSDFCHWGERFSFSPYDRHSSIPIYEQITNM 218 *..* .* .*..*. ** ****.********.** .* ** * ** * .. Translation of U64857 217 DKMGMSIIEQLDPVSFSNYLKKYHNTICGRHPIGVLLNAITELQK-NGMNMSFSF 270 219 DKQGMSAIETLNPAAFNDYLKKTQNTICGRNPILIMLQAAEHFRISNNHTHEFRF 273 ** *** ** * * .* **** .******.** ..*.* . *. . * * Translation of U64857 271 LNYAQSSQCRNWQDSSVSYAAGALTVH 297 274 LHYTQSNKVRSSVDSSVSYASGVLFVHPN 302 *.*.** . * *******.* * ** BLASTP (Jan. 10, 1999) gi|1465834 (U64857) No definition line found [Caenorhabditis el... sp|Q10212|YAY4_SCHPO HYPOTHETICAL 34.8 KD PROTEIN C4H3.04C IN C... sp|P47085|YJX8_YEAST HYPOTHETICAL 38.5 KD PROTEIN IN SUI2-TDH2 ... gi|2425141 (AF020286) similar to C. elegans CEESS08F encoded by... gnl|PID|d1031681 (AP000006) 294aa long hypothetical protein [Py... gi|2983422 (AE000712) hypothetical protein [Aquifex aeolicus] gi|2621080 (AE000796) conserved protein [Methanobacterium therm... gnl|PID|e283857 (Y08257) orf c05005 [Sulfolobus solfataricus] sp|Q57846|Y403_METJA HYPOTHETICAL PROTEIN MJ0403 >gi|2129073|pi... gi|2983762 (AE000735) hypothetical protein [Aquifex aeolicus] 300 215 195 155 87 85 79 78 77 68 1e-80 3e-55 3e-49 4e-37 1e-16 7e-16 4e-14 9e-14 2e-13 1e-10 gi|1465834 (U64857) No definition line found [Caenorhabditis elegans] Length = 302 Score = 300 bits (759), Expect = 1e-80 Identities = 153/292 (52%), Positives = 198/292 (67%), Gaps = 4/292 (1%) Query: 8 Sbjct: 11 REASHAGSWYTASGPQLNAQLEGWLSQVQSTKRPARAIIAPHAGYTYCGSCAAHAYKQVD 67 R ASHAGSWY A+ L+ QL WL ARA+I+PHAGY+YCG AA+A+KQV RSASHAGSWYNANQRDLDRQLTKWLDNAGPRIGTARALISPHAGYSYCGETAAYAFKQVV 70 Z36238 [THC132858-------------------] MKQFKRGIERDGTGFVVLMAEEAEDMWHIYNLIRIGDIIKASTIRKVVSETSTGTTSSQRVHTM LTVSVESIDFDPGAQELHLKGRNIEENDIVKLGAYHTIDLEPNRKFTLQKTEWDSIDLERLNLA [THC85433-----------------------------------------LDPAQAADVAAVVLHEGLANVCLITPAMTLTRAKIDMTIPRKRKGFTSQHEKGLEKFYEAVSTA --------------------------------------------] {AA938998***************** FMRHVNLQVVKCVIVASRGFVKDAFMQHLIAHADANGKKFTTEQRAKFMLTHSSSGFKHALKEV *******} [THC200182---------------------------------------------------LETPQVALRLADTKAQGEVKALNQFLELMSTEPDRAFYGFNHVNRANQELAIETLLVADSLFRA -----------------------------------------------] QDIETRRKYVRLVESVREQNGKVHIFSSMHVSGEQLAQLTGCAAILRFPMPDLDDEPMDEN Translation of Z36238 1 MKLVRKNIEKDNAGQVTLVPEEPEDMWHTYNLVQVGDSLRASTIRKVQTESSTGS 1 MKQFKRGIERDGTGFVVLMAEEAEDMWHIYNLIRIGDIIKASTIRKVVSETSTGT ** ...**.*..* * *. ** ***** ***...** ..******* .*.***. 55 55 Translation of Z36238 56 VGSNRVRTTLTLCVEAIDFDSQACQLRVKGTNIQENEYVKMGAYHTIELEPNRQF 110 56 TSSQRVHTMLTVSVESIDFDPGAQELHLKGRNIEENDIVKLGAYHTIDLEPNRKF 110 *.**.* **..**.**** * .*..** **.**. **.******.*****.* Translation of Z36238 111 TLAKKQWDSVVLERIEQACDPAWSADVAAVVMQEGLAHICLVTPSMTLTRAKVEV 165 111 TLQKTEWDSIDLERLNLALDPAQAADVAAVVLHEGLANVCLITPAMTLTRAKIDM 165 ** * .***. ***. * *** .*******..****..**.**.*******... Translation of Z36238 166 NIPRKRKGNCSQHDRALERFYEQVVQAIQRHIHFDVVKCILVASPGFVREQFCDY 220 166 TIPRKRKGFTSQHEKGLEKFYEAVSTAFMRHVNLQVVKCVIVASRGFVKDAFMQH 220 .******* .***.. **.*** * * **.. ****..*** ***.. * Translation of Z36238 221 MFQQAVKTDNKLLLGNRSKFLQVHASSGHKYSLKEALCDPTVLARLSDTKAAGEV 275 221 LIAHADANGKKFTTEQRAKFMLTHSSSGFKHALKEVLETPQVALRLADTKAQGEV 275 . .* . * .*.**. *.*** * .*** * * * **.**** *** Translation of Z36238 276 KALDDSYKMLQHEPDRAFYGLKQVEKANEAMAIDTLLISDELFRHQDVATRSRYV 330 276 KALNQFLELMSTEPDRAFYGFNHVNRANQELAIETLLVADSLFRAQDIETRRKYV 330 *** .. ******** .* .**. .**.***..* *** **. ** .** Translation of Z36238 331 RLVDSVKENAGTVRIFSSLHVSGEQLSQLTGVAAILRFPVPELSDQEGDS-SSEE 384 331 RLVESVREQNGKVHIFSSMHVSGEQLAQLTGCAAILRFPMPDLDDEPMDEN 381 ***.**.*. * *.****.*******.**** *******.*.* *. * Translation of Z36238 385 D 385 382 381 BLASTP (Jan. 10, 1999) sp|P48612|PELO_DROME PELOTA PROTEIN >gi|973224 (U27197) pelota ... sp|P50444|YNU6_CAEEL HYPOTHETICAL 42.9 KD PROTEIN R74.6 IN CHRO... gi|3941543 (AF069497) pelota [Arabidopsis thaliana] pir||S45456 DOM34 protein - yeast (Saccharomyces cerevisiae) >g... sp|P33309|DO34_YEAST DOM34 PROTEIN >gi|295608 (L11277) DOM34 [S... gnl|PID|e304505 (Z86109) unknown [Saccharomyces pastorianus] gi|2622770 (AE000923) cell division protein [Methanobacterium t... gnl|PID|d1031529 (AP000006) 356aa long hypothetical protein [Py... sp|Q57638|Y174_METJA HYPOTHETICAL PROTEIN MJ0174 >gi|2127805|pi... gi|2649765 (AE001046) cell division protein pelota (pelA) [Arch... 520 446 385 236 212 199 155 146 145 116 e-147 e-125 e-106 2e-61 2e-54 3e-50 4e-37 3e-34 6e-34 3e-25 sp|P50444|YNU6_CAEEL HYPOTHETICAL 42.9 KD PROTEIN R74.6 IN CHROMOSOME III >gi|3879163|gnl|PID|e1348805 (Z36238) Similar to the DOM34 protein of saccharomyces cerevisiae (Swiss Prot accession number P33309) [Caenorhabditis elegans] Length = 381 Score = 446 bits (1136), Expect = e-125 Identities = 215/371 (57%), Positives = 282/371 (75%) 1200 5 1100 5 C. elegans protein length 1000 900 800 5 700 5 5 600 5 5 55 55 55 5 5 55 5 5 555 5 5 5 5 5 5 55 5 5 5 555 55555 555 5 5 5 5 5 5 5 55 55 55 555 5 5 55 5 5555555 5 55 5 555 5 555 5 5555 55 5 55 5 5 5 5 555 5 5 555 5 5 5555 55 5 55 5555555 55 5 5 5 5 55 5 5 55 5 5 5555 5 55 55 500 400 300 200 100 0 0 100 200 300 400 500 600 CGI protein length 700 800 900 1000 800 700 H H Match area length 600 HH H H HH 500 HH HH H H H HH H H H H HH H H H H HH H H H H HHHHH H H H H H HH H H H HH HHH H H H H H H HH H H H H H H HH H HH H H HH H H H H H H H H H H H H H H HH H H H H HH HH H H HH H HHH H H H HH H H H H H H H H H H HH HH HH H H H H HH H H H H H H HH H 400 300 200 100 H H H 0 0 100 200 300 400 500 600 CGI protein length 700 800 900 1000 Protein similarity between CGI and C. elegans 100 90 A A A A A A AAAA A AA A A A A A AA A A A A A A A A A A A AA AA A A A AA A A A AA A A A A A A A A A A A A A AA A AA A A A A A A A AA AA A A AAAAA AAA A A A A A A AA A A A AAA A A AA A A A A A A A A A AA A A AAA A A A AA AA A A A A AA AA A AA AA A A AA A A 80 70 60 50 40 A A A A A A 30 0 100 200 300 400 500 600 CGI protein length 700 800 900 1000 C. elegans from WormPept:18,452 entries HGI searches (5 days for TBLASTN analysis) *Families *Known Gene *New Contig *Undetermined <100 aa 3,934 7,954 3,456 2,070 1,038 83% between Human & C. elegans 11% C. elegans specific *150 full length genes so far, more expected following GAP closure and 5’RACE. C. elegans from WormPept:18,452 entries MGI searches (5 days for TBLASTN analysis) *Families *Known Gene *New Contig *Undetermined <100 aa 5,602 4,151 5,805 1,856 1,038 84% between Mouse & C. elegans 10% C. elegans specific http://www.ibms.sinica.edu.tw/~wenlin/