Bioinformatics: Understanding the data in databases UBC Bioinformatics Centre MedGen 505, January 9, 2003 Francis Ouellette Director, UBC Bioinformatics Centre Vancouver, BC, Canada francis@cmmt.ubc.ca http://vanbug.org • Monthly bioinformatics seminar • The second Thursday of every month • Attended by academics, industry and government types. • Talk followed by beer and pizza. • Tonight @ 6:00 at the Chan Centre at the BCRI (CMMT). • Nat Goodman, a senior research scientist from the ISB, aka “the IT guy” Copyright 2002 UBC Bioinformatics Centre http://bioinformatics.ubc.ca Copyright 2002 UBC Bioinformatics Centre Bioinformatics is about understanding how life works. It is an hypothesis driven science Copyright 2002 UBC Bioinformatics Centre In bioinformatics, we use software tools and biological databases to ask questions. Copyright 2002 UBC Bioinformatics Centre At the UBC Bioinformatics Centre (UBiC) we bring together scientists that share the vision of making advances in computational biology, also working with bench scientists to validate the hypotheses we are generating. Copyright 2002 UBC Bioinformatics Centre Structure • • • • • Director Associate Director 6 adjunct faculty 4 more to be recruited Another recruitment already in progress Copyright 2002 UBC Bioinformatics Centre • • • • • • Director of Operation and Strategy Chief Soft. Dev. Chief Bioinformatics Chief Systems Chief Training and Support Chief Web Development UBiC: the vision BLAST IDB PeGASys Large Scale Bioinformatics Gene Identification Comparative Genomics Algorithm development Basic Research Copyright 2002 UBC Bioinformatics Centre CBW WWW Workshops Support & Training The UBC Bioinformatics Centre: Copyright 2002 UBC Bioinformatics Centre Copyright 2002 UBC Bioinformatics Centre Ouellette Lab projects • • • • • Core facility: training and support GeneComber: an Ab initio gene finding algorithm. IDB: the Integral DataBase system PeGASys: Parallel genome annotation system GeMS: Genomic Mutational Signature Sequences. Copyright 2002 UBC Bioinformatics Centre http://bioinformatics.ca Copyright 2002 UBC Bioinformatics Centre http://bioinformatics.ca Copyright 2002 UBC Bioinformatics Centre Canadian Bioinformatics Workshop Series Bioinformatics Genomics Proteomics Developing the Tools Intro Programming Copyright 2002 UBC Bioinformatics Centre Bioinformatics is about bringing biological themes together with the help of computer tools and biological databases. Computational biology can lead us to new insights or directions. Copyright 2002 UBC Bioinformatics Centre BLAST Result Basic Local Alignment Search Tool Copyright 2002 UBC Bioinformatics Centre PubMed Text Neighboring Genetic Analysis of Cancer in Families The Genetic Predisposition to Cancer Copyright 2002 UBC Bioinformatics Centre • Common terms could indicate similar subject matter • Statistical method • Weights based on term frequencies within document and within the database as a whole • Some terms are better than others Micro-array analysis: Science Jan 1 1999: 83-87 The Transcriptional Program in the Response of Human Fibroblasts to Serum Vishwanath R. Iyer, Michael B. Eisen, Douglas T. Ross, Greg Schuler, Troy Moore, Jeffrey C. F. Lee, Jeffrey M. Trent, Louis M. Staudt, James Hudson Jr., Mark S. Boguski, Deval Lashkari, Dari Shalon, David Botstein, Patrick O. Brown Figure 1 Copyright 2002 UBC Bioinformatics Centre Figure 4 VAST Result • • • • Vector Alignment Search Tool Ferredoxin •Halobacterium marismortui •Chlorella fusca Copyright 2002 UBC Bioinformatics Centre Computational Biology Analysis Q Gln NH2-C-CH2-CH2O Copyright 2002 UBC Bioinformatics Centre R Arg NH2-C-NH-CH2-CH2-CH2+NH2 Structural Interactions Other interactions occurring within this structure (blue). In this case Glutaminyl-tRNA Synthetase interacting with AMP. Copyright 2002 UBC Bioinformatics Centre Positional Cloning Family Studies Chromosome Interval Large-Insert Clones Candidate Genes Disease Mutation Met A T G Val G T C Ser T C A Leu C T G Gln C A A Pro C C G Cys T G T * Genetic Mapping Copyright 2002 UBC Bioinformatics Centre Physical Mapping Transcript Mapping Gene Sequencing A T G G T C T C A C T G T A A C C G T G T Met Val Ser Leu STOP Positional Candidate Cloning Family Studies Chromosome Interval Candidate Genes Disease Mutation Met A T G Val G T C Ser T C A Leu C T G Gln C A A Pro C C G Cys T G T * Genetic Mapping Copyright 2002 UBC Bioinformatics Centre Computer Search Gene Sequencing A T G G T C T C A C T G T A A C C G T G T Met Val Ser Leu STOP What does it mean to do CB? • Like to work with sequences, structures, expression arrays, interaction of molecules and genetic maps. • Like the whole systems approach • Like the IT component, and the power it provides to crunching through lots of data • Like clear answers • Like to do Science Copyright 2002 UBC Bioinformatics Centre Doing CB means to be … • • • • • • Database user Tool user Database developer Tool developer Training, practicing or developing Doing bioinformatics experiments Copyright 2002 UBC Bioinformatics Centre Bioinformatics experiments: Sequence BLAST search Reagents: Method: •Sequence •Databases •P-P •N-P •P-N •N-N •N (P) – N (P) Know your reagents Know your methods Copyright 2002 UBC Bioinformatics Centre Alignment Interpretation: BLASTP BLASTX TBLASTN BLASTN TBLASTX •Similarity •Hypothesis testing Do your controls Nature 409:452 Copyright 2002 UBC Bioinformatics Centre Copyright 2002 UBC Bioinformatics Centre Part 1. The Databases 1.GenBank: The Nucleotide Sequence Database 2. PubMed: The Bibliographic Database 3. Macromolecular Structure Databases 4. The Taxonomy Project 5. The Single Nucleotide Polymorphism Database 6. The Gene Expression Omnibus (GEO) 7. Online Mendelian Inheritance in Man (OMIM 8. The NCBI BookShelf: Searchable Biomedical Books 9. PubMed Central (PMC) 10. The SKY/CGH Database Part 2. Data Flow and Processing 11. Sequin: A Sequence Submission and Editing Tool 12. The Processing of Biological Sequence Data at NCBI 13. Genome Assembly and Annotation Process Part 3. Querying and Linking the Data 14. 15. 16. 17. 18. 19. 20. 21. The Entrez Search and Retrieval System The BLAST Sequence Analysis Tool LinkOut: Linking to External Resources from Entrez The Reference Sequence (RefSeq) Project LocusLink: A Directory of Genes Using the Map Viewer to Explore Genomes UniGene: A Unified View of the Transcriptome The Clusters of Orthologous Groups (COGs) Part 4. User Support 22. User Services: Helping You Find Your Way 23. Exercises: Using Map Viewer Glossary Copyright 2002 UBC Bioinformatics Centre The challenge of the information space: Nucleotide records Nucleotides Protein sequences 3D structures Interactions Expression data points Human Unigene Clusters Maps and Complete Genomes Different taxonomy Nodes Human dbSNP Human RefGenes records bp in Human Contigs > 500 kb PubMed records OMIM records Copyright 2002 UBC Bioinformatics Centre Jan 2002 14,976,310 15,849,921,438 1,793,850 16,500 6,181 >20,000,000 96,109 1,600 229,799 4,116,188 17,984 1,154,596,000 11,692,207 13,346 The challenge of the information space: Nucleotide records Nucleotides Protein sequences 3D structures Interactions & complexes Expression data points Human Unigene Cluster Maps and Complete Genomes Different taxonomy Nodes Human dbSNP Human RefSeq records bp in Human Contigs > 500 kb PubMed records OMIM records Copyright 2002 UBC Bioinformatics Centre Jan 2003 22,318,883 28,507,990,166 2,955,588 19,392 7,119 >40,000,000 115,523 2,698 278,402 4,892,258 20,008 1,451,804 12,319,105 14,116 Databases • Organized array of information • Place where you put things in, and (if all is well) you should be able to get them out again. • Resource for other databases and tools. • Simplify the information space by specialization. • Bonus: Allows you to make discoveries. Copyright 2002 UBC Bioinformatics Centre Databases Information system Query system Storage System Data Copyright 2002 UBC Bioinformatics Centre A List you look at A catalogue indexed files Boxes SQL GenBank flat file PC binary grep PDB file files Interaction Record The UnixUBC text library files Title of a book Google Bookshelves Book Entrez SRS “... the more closely and elegantly a model follows a real phenomenon, the more useful it is in predicting or understanding the natural phenomenon it mimics.” Ostell, Wheelan & Kans on the “NCBI data model” from “Bioinformatics, a Practical Guide to the Analysis of Genes and Proteins.”, Baxevanis and Ouellette, Eds. 2001 Copyright 2002 UBC Bioinformatics Centre Using the NCBI data model CMMT MEDLINE Expression Data PubMed online Journals Full text Accession Numbers GenBank SNP Data ACGATGTGGTCGATG TTCTCTATTATTATC GGAAGCTAAGGATAT CGCTGATGTGAGGTGA TCGGTTCTATCTGCA TAGCATGGATATTGA TGGCTTATAGGCTAG CGCTGATGTGAGGTG Accession Numbers - Map Genomes Links MVILLVILAIVLISD VTGREGSWQIPCMNV KRKKGREGDHIVLIL ILLNNAWASVLPESDS SDSGPLIILHEREKR LALAMAREENSPNCT PLIKRESAEDSEDLR KRKKTDEDDHIVLIL Protein Sequences BIND interaction:function MMDB structure:function VAST Structures Copyright 2002 UBC Bioinformatics Centre Primary Data • DNA sequences • RNA sequences • Protein sequences – In most cases protein sequences are interpreted sequences. • 3D structures • Expression data • Polymorphism data • Interaction data Copyright 2002 UBC Bioinformatics Centre Databases: some examples • Primary (archival) – DDBJ/EMBL/GenBank – TrEMBL – UNIProt – PDB – Medline – BIND Copyright 2002 UBC Bioinformatics Centre • Secondary (curated) – LOCUSLink – RefSeq – Taxon – Swiss-Prot – PROSITE – OMIM – SGD – FlyBase – GO What is GenBank? GenBank is the NIH genetic sequence dataset of all publicly available DNA and derived protein sequences, with annotations describing the biological information these records contain. http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html Benson et al., 2002, Nucleic Acids Res. 29:12-17 Copyright 2002 UBC Bioinformatics Centre Entrez NIH NCBI •Submissions •Updates GenBank •Submissions •Updates EMBL DDBJ EBI CIB NIG •Submissions •Updates getentry Copyright 2002 UBC Bioinformatics Centre SRS EMBL GenBank Flat File (GBFF) LOCUS DEFINITION MUSNGH 1803 bp mRNA ROD 29-AUG-1997 Mouse neuroblastoma and rat glioma hybridoma cell line NG108-15 cell TA20 mRNA, complete cds. ACCESSION D25291 NID g1850791 KEYWORDS neurite extension activity; growth arrest; TA20. SOURCE Murinae gen. sp. mouse neuroblastma-rat glioma hybridoma cell_line:NG108-15 cDNA to mRNA. ORGANISM Murinae gen. sp. Eukaryotae; mitochondrial eukaryotes; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae. REFERENCE 1 (sites) AUTHORS Tohda,C., Nagai,S., Tohda,M. and Nomura,Y. TITLE A novel factor, TA20, involved in neuronal differentiation: cDNA cloning and expression JOURNAL Neurosci. Res. 23 (1), 21-27 (1995) MEDLINE 96064354 REFERENCE 3 (bases 1 to 1803) AUTHORS Tohda,C. TITLE Direct Submission JOURNAL Submitted (18-NOV-1993) to the DDBJ/EMBL/GenBank databases. Chihiro Tohda, Toyama Medical and Pharmaceutical University, Research Institute for Wakan-yaku, Analytical Research Center for Ethnomedicines; 2630 Sugitani, Toyama, Toyama 930-01, Japan (E-mail:CHIHIRO@ms.toyama-mpu.ac.jp, Tel:+81-764-34-2281(ex.2841), Fax:+81-764-34-5057) COMMENT On Feb 26, 1997 this sequence version replaced gi:793764. FEATURES Location/Qualifiers source 1..1803 /organism="Murinae gen. sp." /note="source origin of sequence, either mouse or rat, has not been identified" /db_xref="taxon:39108" /cell_line="NG108-15" /cell_type="mouse neuroblastma-rat glioma hybridoma" misc_signal 156..163 /note="AP-2 binding site" GC_signal 647..655 /note="Sp1 binding site" TATA_signal 694..701 gene 748..1311 /gene="TA20" CDS 748..1311 /gene="TA20" /function="neurite extensiion activity and growth arrest effect" /codon_start=1 /db_xref="PID:d1005516" /db_xref="PID:g793765" /translation="MMKLWVPSRSLPNSPNHYRSFLSHTLHIRYNNSLFISNTHLSRR KLRVTNPIYTRKRSLNIFYLLIPSCRTRLILWIIYIYRNLKHWSTSTVRSHSHSIYRL RPSMRTNIILRCHSYYKPPISHPIYWNNPSRMNLRGLLSRQSHLDPILRFPLHLTIYY RGPSNRSPPLPPRNRIKQPNRIKLRCR" polyA_site 1803 BASE COUNT 507 a 458 c 311 g 527 t ORIGIN 1 tcagtttttt tttttttttt tttttttttt tttttttttt tttttttttg ttgattcatg 61 tccgtttaca tttggtaagt tcacaggcct cagtcaacac aattggactg ctcaggaaat 121 cctccttggt gaccgcagta tacttggcct atgaacccaa gccacctatg gctaggtagg 181 agaagctcaa ctgtagggct gactttggaa gagaatgcac atggctgtat cgacatttca 241 catggtggac ctctggccag agtcagcagg ccgagggttc tcttccgggc tgctccctca 301 ctgcttgact ctgcgtcagt gcgtccatac tgtgggcgga cgttattgct atttgccttc 361 cattctgtac ggcattgcct ccatttagct ggagagggac agagcctggt tctctagggc 421 gtttccattg gggcctggtg acaatccaaa agatgagggc tccaaacacc agaatcagaa 481 ggcccagcgt atttgtaaaa acaccttctg gtgggaatga atggtacagg ggcgtttcag 541 gacaaagaac agcttttctg tcactcccat gagaaccgtc gcaatcactg ttccgaagag 601 gaggagtcca gaatacacgt gtatgggcat gacgattgcc cggagagagg cggagcccat 661 ggaagcagaa agacgaaaaa cacacccatt atttaaaatt attaaccact cattcattga 721 cctacctgcc ccatccaaca tttcatcatg atgaaacttt gggtcccttc taggagtctg 781 cctaatagtc caaatcatta caggtctttt cttagccata cactacacat cagatacaat 841 aacagccttt tcatcagtaa cacacatttg tcgagacgta aattacgggt gactaatccg 901 atatatacac gcaaacggag cctcaatatt ttttatttgc ttattccttc atgtcggacg 961 aggcttatat tatggatcat atacatttat agaaacctga aacattggag tacttctact 1021 gttcgcagtc atagccacag catttatagg ctacgtcctt ccatgaggac aaatatcatt 1081 ctgaggtgcc acagttatta caaacctcct atcagccatc ccatatattg gaacaaccct 1141 agtcgaatga atttgagggg gcttctcagt agacaaagcc accttgaccc gattcttcgc 1201 tttccacttc atcttaccat ttattatcgc ggccctagca atcgttcacc tcctcttcct 1261 ccacgaaaca ggatcaaaca acccaacagg attaaactca gatgcagata aaattccatt 1321 tcacccctac tatacatcaa agatatccta ggtatcctaa tcatattctt aattctcata 1381 accctagtat tatttttccc agacatacta ggagacccag acaactacat accagctaat 1441 ccactaaaca ccccacccca tattaaaccc gaatgatatt tcctatttgc atacgccatt 1501 ctacgctcaa tccccaataa actaggaggt gtcctagcct taatcttatc tatcctaatt 1561 ttagccctaa tacctttcct tcatacctca aagcaacgaa gcctaatatt ccgcccaatc 1621 acacaaattt tgtactgaat cctagtagcc aacctactta tcttaacctg aattgggggc 1681 caaccagtag acacccattt attatcattg gccaactagc ctccatctca tacttctcaa 1741 tcatcttaat tcttatacca atctcaggaa ttatcgaaga caaaatacta aaattatatc 1801 cat // Copyright 2002 UBC Bioinformatics Centre Header •Title •Taxonomy •Citation Features (AA seq) DNA Sequence Abstract Syntax Notation (ASN.1) Copyright 2002 UBC Bioinformatics Centre FASTA > >gi|121066|sp|P03069|GCN4_YEAST GENERAL CONTROL PROTEIN GCN4 MSEYQPSLFALNPMGFSPLDGSKSTNENVSASTSTAKPMVGQLIFDKFIKTEEDPI IKQDTPSNLDFDFALPQTATAPDAKTVLPIPELDDAVVESFFSSSTDSTPMFEYEN LEDNSKEWTSLFDNDIPVTTDDVSLADKAIESTEEVSLVPSNLEVSTTSFLPTPVL EDAKLTQTRKVKKPNSVVKKSHHVGKDDESRLDHLGVVAYNRKQRSIPLSPIVPES SDPAALKRARNTEAARRSRARKLQRMKQLEDKVEELLSKNYHLENEVARLKKLVGE R Copyright 2002 UBC Bioinformatics Centre Graphical Representation Copyright 2002 UBC Bioinformatics Centre ASN.1 FASTA MMDB EMBL ASN.1 Graphical Copyright 2002 UBC Bioinformatics Centre Swiss-Prot GenBank GenPept Outline • GenBank dissection – identifiers – divisions – format/structure – features – file conversions Copyright 2002 UBC Bioinformatics Centre Organismal Divisions Used in which database? BCT FUN HUM INV MAM ORG PHG PLN PRI PRO ROD SYN VRL VRT Copyright 2002 UBC Bioinformatics Centre Bacterial Fungal Homo sapiens Invertebrate Other mammalian Organelle Phage Plant Primate (also see HUM) Prokaryotic Rodent Synthetic and chimeric Viral Other vertebrate DDBJ - GenBank EMBL DDBJ - EMBL all all EMBL all all all (not same data in all) EMBL all all all all Functional Divisions PAT EST STS GSS HTG HTC Patent Expressed Sequence Tags Sequence Tagged Site Genome Survey Sequence High Throughput Genome (unfinished) High throughput cDNA (unfinished) Organismal divisions: BCT PRI FUN ROD Copyright 2002 UBC Bioinformatics Centre INV SYN MAM VRL PHG VRT PLN Guiding Principals In GenBank, records are grouped for various reasons: understand this is key to using and fully taking advantage of this database. Copyright 2002 UBC Bioinformatics Centre LOCUS, Accession, NID and protein_id LOCUS: Unique string of 10 letters and numbers in the database. Not maintained amongst databases, and is therefore a poor sequence identifier. ACCESSION: A unique identifier to that record, citable entity; does not change when record is updated. A good record identifier, ideal for citation in publication. VERSION: : New system where the accession and version play the same function as the accession and gi number. Nucleotide gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes. PID: Protein Identifier: g, e or d prefix to gi number. Can have one or two on one CDS. Protein gi: Geninfo identifier (gi), a unique integer which will change every time the sequence changes. protein_id: Identifier which has the same structure and function as the nucleotide Accession.version numbers, but slightlt different format. Copyright 2002 UBC Bioinformatics Centre LOCUS, Accession, gi and PID LOCUS DEFINITION ACCESSION VERSION HSU40282 1789 bp mRNA PRI 21-MAY-1998 Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds. U40282 U40282.1 GI:3150001 LOCUS: ACCESSION: VERSION: GI: PID: Protein gi: protein_id: CDS Copyright 2002 UBC Bioinformatics Centre HSU40282 U40282 U40282.1 3150001 g3150002 3150002 AAC16892.1 157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1" /db_xref="PID:g3150002" /db_xref="GI:3150002" Sample GenBank mRNA Record LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL MEDLINE REFERENCE AUTHORS TITLE JOURNAL HSU40282 1789 bp mRNA PRI 21-MAY-1998 Homo sapiens integrin-linked kinase (ILK) mRNA, complete cds. U40282 U40282.1 GI:3150001 . human. Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. 1 (bases 1 to 1789) Hannigan,G.E., Leung-Hagesteijn,C., Fitz-Gibbon,L., Coppolino,M.G., Radeva,G., Filmus,J., Bell,J.C. and Dedhar,S. Regulation of cell adhesion and anchorage-dependent growth by a new beta 1-integrin-linked protein kinase Nature 379 (6560), 91-96 (1996) 96135142 2 (bases 1 to 1789) Dedhar,S. and Hannigan,G.E. Direct Submission Submitted (07-NOV-1995) Shoukat Dedhar, Cancer Biology Research, Sunnybrook Health Science Centre and University of Toronto, 2075 Bayview Avenue, North York, Ont. M4N 3M5, Canada Copyright 2002 UBC Bioinformatics Centre Sample GenBank Record FEATURES source gene CDS Location/Qualifiers 1..1789 /organism="Homo sapiens" /db_xref="taxon:9606" /chromosome="11" /map="11p15" /cell_line="HeLa" 1..1789 /gene="ILK" 157..1515 /gene="ILK" /note="protein serine/threonine kinase" /codon_start=1 /product="integrin-linked kinase" /protein_id="AAC16892.1" /db_xref="PID:g3150002" /db_xref="GI:3150002" /translation="MDDIFTQCREGNAVAVRLWLDNTENDLNQGDDHGFSPLHWACRE . . . DK" 443 a 488 c 480 g 378 t BASE COUNT ORIGIN 1 gaattcatct gtcgactgct accacgggag ttccccggag aaggatcctg cagcccgagt < ...> 1681 ggcgggctca gagctttgtc acttgccaca tggtgtcttc caacatggga gggatcagcc 1741 ccgcctgtca caataaagtt tattatgaaa aaaaaaaaaa aaaaaaaaa // Copyright 2002 UBC Bioinformatics Centre EST: Expressed Sequence Tag Expressed Sequence Tags are short (300-500 bp) single reads from mRNA (cDNA) which are produced in large numbers. They represent a snapshot of what is expressed in a given tissue, and developmental stage. Also see: http://www.ncbi.nlm.nih.gov/dbEST/ http://www.ncbi.nlm.nih.gov/UniGene/ Copyright 2002 UBC Bioinformatics Centre LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM AA675481 524 bp mRNA EST 28-NOV-1997 vr72d07.s1 Knowles Solter mouse 2 cell Mus musculus cDNA clone IMAGE:1134253 5' similar to TR:G992993 G992993 MYOSIN LIGHT CHAIN KINASE. ;, mRNA sequence. AA675481 AA675481.1 GI:2652718 EST. house mouse Mus musculus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Mus. ... COMMENT Contact: Marra M/Mouse EST Project WashU-HHMI Mouse EST Project Washington University School of MedicineP 4444 Forest Park Parkway, Box 8501, St. Louis, MO 63108 Tel: 314 286 1800 Fax: 314 286 1810 Email: mouseest@watson.wustl.edu This clone is available royalty-free through LLNL ; contact the IMAGE Consortium (info@image.llnl.gov) for further information. MGI:615525 Possible reversed clone: similarity on wrong strand High quality sequence stop: 469. Copyright 2002 UBC Bioinformatics Centre FEATURES source Location/Qualifiers 1..524 /organism="Mus musculus" /strain="B6D2 F1/J" /note="Organ: embryo; Vector: pBluescribe (modified); Site_1: MluI; Site_2: SalI; Cloned unidirectionally from mRNA prepared from 13,500 2-cell stage embryos. Primer: SalI(dT): 5'-CGGTCGACCGTCGACCGTTTTTTTTTTTTTTT-3'. cDNAs were cloned into the MluI/SalI sites of a modified pBluescribe vector using commercial linkers (NEB). Average insert size: 1.2 kb." /db_xref="taxon:10090" /clone="1134253" /clone_lib="Knowles Solter mouse 2 cell" /tissue_type="embryo" /dev_stage="2-cell" /lab_host="DH10B" 168 a 111 c 115 g 130 t BASE COUNT ORIGIN 1 ctcagttgta 61 ggaaattaca 121 cgaaagaggt 181 gtacatgtgt 241 tgaaatggat 301 tgtggagggg 361 atcttcctct 421 agcatagctg 481 ttcgtattta // Copyright 2002 UBC Bioinformatics Centre gacagtgagc tggtggtttg gaaacttact aaggcagtca gactactagg ccaaaaagga taagaacttc acagaaaagg tagaactaag cagtcagatt aaggagaaat gcctgtattt acaataaagg cttccctctg gaccagaggt tcatgcatat gaaataaatg acttaacata tactgttaaa actgcaggat accggaaacc ctcagcagcg tccttgggac gccactataa caggttcatt tacccattct tacagtttgc gtaacaggag ggagaagact ttcccagaag agcacctgca tctctctctc ctgacttaat accatgctgt gtcagaacta atga aacccaagcc atcagtacat atggaggaga ttcttaccat gctgcatctc ctttccccaa gcaaagtcaa agacagaagc STS Sequenced Tagged Sites, are operationally unique sequence that identifies the combination of primer pairs used in a PCR assay that generate a mapping reagent which maps to a single position within the genome. Also see: http://www.ncbi.nlm.nih.gov/dbSTS/ http://www.ncbi.nlm.nih.gov/genemap/ Copyright 2002 UBC Bioinformatics Centre GSS: Genome Survey Sequences Genome Survey Sequences are similar in nature to the ESTs, except that its sequences are genomic in origin, rather than cDNA (mRNA). The GSS division contains: • random "single pass read" genome survey sequences. • single pass reads from cosmid/BAC/YAC ends (these could be chromosome specific, but need not be) • exon trapped genomic sequences • Alu PCR sequences Also see: Copyright 2002 UBC Bioinformatics Centre http://www.ncbi.nlm.nih.gov/dbGSS/ LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL COMMENT FR0029137 445 bp DNA GSS 30-JUN-1998 Fugu rubripes GSS sequence, clone 037G16aE9, genomic survey sequence. AL031006 AL031006.1 GI:3286795 GSS; genome survey sequence. Fugu rubripes. Fugu rubripes Eukaryota; Metazoa; Chordata; Vertebrata; Actinopterygii; Neopterygii; Teleostei; Euteleostei; Acanthopterygii; Percomorpha; Tetraodontiformes; Tetraodontoidei; Tetraodontidae; Fugu. 1 (bases 1 to 445) Elgar,G., Clark,M., Smith,S., Meek,S., Warner,S., Umrania,Y., Williams,G. and Brenner,S. Direct Submission Submitted (09-JUN-1998) MRC Human Genome Mapping Project Resource Centre, Hinxton, Cambridge, CB10 1SB, UK. Email: biohelp@hgmp.mrc.ac.uk Vector: pBluescript II KS V_type: phagemid PRIMER: KS DESCR: One pass dye-terminator sequencing of cosmid cloned genomic sequence. Copyright 2002 UBC Bioinformatics Centre Genome Survey Sequences FEATURES source Location/Qualifiers 1..445 /organism="Fugu rubripes" /db_xref="taxon:31033" /clone_lib="cosmid 037G16" /clone="037G16aE9" 124 a 96 c 97 g 126 t BASE COUNT ORIGIN 1 atcctgcagt 61 gtcggccgta 121 atgggtaagt 181 ttcaagagag 241 gtctttggna 301 gcactgtgaa 361 accaaaagtt 421 atgagttaaa // Copyright 2002 UBC Bioinformatics Centre gaggcagaac aaagtcctcc gcaaacattt tcttggaagc tgagggaggg accctctggt tatcctgcaa tacggtttgt agggnctgtt gaaaacccac aactcaagat gtacacacct aaccagatac actgagccct ctgctattta tgaaa tccatttttt aaagcctttg aagtgccttt acagcgtagc ctggtgaaaa gaaacttcat acttctgtta 2 others gtctgtcagt cctatcgttc gagataacaa tgtttttacc cccatgcaga gttgtgaggc gcctctgttt ttaaacagtg caaatcttac aacctctttt tcagatgaat cttgcggaga aacagtgctt tggagaccac HTG: High Throughput Genome High Throughput Genome Sequences are unfinished genome sequencing efforts records. Unfinished records have gaps in the nucleotides sequence, low accuracy, and no annotations on the records. Also see: http://www.ncbi.nlm.nih.gov/HTGS/ Ouellette and Boguski (1997) Genome Res. 7:952-955 Copyright 2002 UBC Bioinformatics Centre HTGS in GenBank phase 1 Acc = AC000003 gi = 1556454 HTG phase 2 Acc = AC000003 gi = 2182283 PRI phase 3 Acc = AC000003 Copyright 2002 UBC Bioinformatics Centre HTG gi = 2204282 HTGS in GenBank • Unfinished Record – – – – • Sequencing will be unfinished Phase 1 or phase 2 HTG division KEYWORDS: HTG; HTGS_PHASE1 or 2 Finished record – – – – Sequencing will be finished Phase 3 Organismal division it belongs to PRI,INV or PLN KEYWORDS: HTG Copyright 2002 UBC Bioinformatics Centre HTGS: phase 1 LOCUS DEFINITION ACCESSION KEYWORDS ... COMMENT HSAC000003 120000 bp DNA HTG 20-SEP-1996 *** SEQUENCING IN PROGRESS *** Chromosome 17 genomic sequence; HTGS phase 1, 6 unordered pieces. AC000003 HTG; HTGS_PHASE1. *** *** *** WARNING: Phase 1 High Throughput Genome Sequence *** *** *** * This sequence is unfinished. It consists of 6 contigs for * which the order is not known; their order in this record is * arbitrary. In some cases, the exact lengths of the gaps * between the contigs are also unknown; these gaps are presented * as runs of N as a convenience only. When sequencing is complete, * the sequence data presented in this record will be replaced *by a single finished sequence with the same accession number. * 1 22526: contig of 22526 bp in length * 22527 23035: gap of unknown length * 23036 33919: contig of 10884 bp in length * 33920 34427: gap of unknown length * 34428 61877: contig of 27450 bp in length ... // Copyright 2002 UBC Bioinformatics Centre HTGS Phase 1 * the sequence data presented in this record will be replaced * by a single finished sequence with the same accession number. * 1 33214: contig of 33214 bp in length * 33215 33250: gap of unknown length * 33251 35134: contig of 1884 bp in length ... gap of unknown length 33061 33121 33181 33241 33301 33361 33421 ggagagcttc taaatgtctg cgagcaattc nnnnnnnnnn ctgtctaccc gggcagctag aaagaagcag Copyright 2002 UBC Bioinformatics Centre agggagactc gtttaccttc atgggcaaaa tagttcatca tccctcttcc ctgaaagaga gttgggggaa tgcggaatag agccgaaacg gtgccgccgc ccttctggtg ccttcctccc ccatctgcct agaggaagtg caggttgtaa cgggagaaat cacgnnnnnn gaagccacat caaatctatc taggaatagc aggatttcaa tcttccggtt ccagcctgcg nnnnnnnnnn tttctctttc agtaaagacc ctacactaga gtcaagaaag cgatagtcga tactccacag nnnnnnnnnn ctttctttcc accttgctgt ttcaaactac catcctgcct HTGS phase 3 LOCUS DEFINITION ACCESSION NID KEYWORDS ... COMMENT AC000003 122228 bp DNA PRI 07-OCT-1997 Homo sapiens chromosome 17, clone 104H12, complete sequence. AC000003 g2204282 HTG. The Staden databases, finishing information, and all chromatographic files used in the assembly of this clone are available from our anonymous ftp site. All repeats were identified using RepeatMasker: Smit, A.F.A. & Green, P. (1996-1997) http://ftp.genome.washington.edu/RM/RepeatMasker.html. FEATURES Location/Qualifiers source 1..122228 /organism="Homo sapiens" /db_xref="taxon:9606" /clone="104H12" /clone_lib="Research Genetics/Cal Tech CITB978SK-B (plates 1-194)" /chromosome="17" repeat_region 261..370 /rpt_family="MLT1B" Copyright 2002 UBC Bioinformatics Centre Copyright 2002 UBC Bioinformatics Centre Locus Link Copyright 2002 UBC Bioinformatics Centre http://nar.oupjournals.org/content/vol31/issue1/ Copyright 2002 UBC Bioinformatics Centre Genome Projects: discussion point • • • • • • • • Whole genome assembly “Bermuda agreement” HTG Finished What is it to be “finished” 1:10,000 error rate? How useful is an unfinished genome? Reference genomes TPA and RefSeq Copyright 2002 UBC Bioinformatics Centre In Closing ... • Able to recognize various data formats, and know what their primary use is. • Know, understand and utilize all types of sequence identifiers. • Know and understand various feature types present in the GenBank flat files. • Know and understand the various GenBank divisions. Copyright 2002 UBC Bioinformatics Centre Resources • W W W: – http://www.ncbi.nlm.nih.gov – http://www.ddbj.nig.ac.jp/ – http://www.ebi.ac.uk/ – http://www.ncbi.nlm.nih.gov/Web/Genbank/index.html – http://www.expasy.ch/sprot/ – http://www.rcsb.org/pdb/index.html – http://www.ncbi.nlm.nih.gov/Omim/ – http://genome-www.stanford.edu/Saccharomyces/ – http://nar.oupjournals.org/content/vol30/issue1/ – http://nar.oupjournals.org/content/vol31/issue1/ Copyright 2002 UBC Bioinformatics Centre