A Field Guide to GenBank and NCBI Molecular Biology Resources slightly modified from Peter Cooper ftp://ftp.ncbi.nih.gov/pub/cooper/FieldGuide/ Eric Sayers ftp://ftp.ncbi.nih.gov/pub/sayers/Field_Guide/U_Penn/ NCBI Resources • About NCBI • NCBI Sequence Databases – Primary Database – GenBank – Derivative Databases - RefSeq • Entrez Databases and Text Searching • BLAST Services • Genomic Resources The National Center for Biotechnology Information (NCBI) • Created as a part of NLM in 1988 – – – – • • • • Establish public databases Perform research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Human genome (2001) NCBI Home Page http://www.ncbi.nlm.nih.gov To learn more, visit the “Site Map” and “About NCBI” web pages About NCBI Some NCBI Statistics…. 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 1982 30000 28000 26000 24000 22000 20000 18000 16000 14000 12000 10000 Base Pairs Sequences 8000 6000 4000 2000 0 1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 Base Pairs of DNA (millions) Sequences (millions) Growth of GenBank Users per day 250000 1997 1998 1999 2000 200000 150000 100000 50000 Christmas Day 0 2001 Molecular Databases • Primary Databases – Original submissions by experimentalists – Database staff organize but don’t add additional information • Example: GenBank • Derivative Databases – Human curated • compilation and correction of data • Example: SWISS-PROT, NCBI RefSeq mRNA – Computationally Derived • Example: UniGene – Combinations • Example: NCBI Genome Assembly What is GenBank? NCBI’s Primary Sequence Database • Nucleotide only sequence database • GenBank Data – Direct submissions individual records (BankIt, Sequin) – Batch submissions via email (EST, GSS, STS) – ftp accounts established for sequencing centers • Data shared amongst three collaborating databases: – GenBank – DNA Database of Japan (DDBJ). – European Molecular Biology Laboratory Database (EMBL) The International Nucleotide Sequence Database Collaboration NIH Sequin BankIt ftp Entrez NCBI GenBank •Submissions •Updates •Submissions •Updates EMBL CIB NIG DDBJ •Submissions •Updates getentry EBI SRS EMBL GenBank: NCBI’s Primary Sequence Database Release 133 22,318,883 28,507,990,166 110,000 + December 2002 Records Nucleotides Species • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/ >90 Gigabytes of data Entrez Nucleotide RefSeq 1% EMBL 9% DDBJ 19% GenBank 71% 23,464,770 records Primary vs. Derivative Databases Curators RefSeq Sequencing Centers TATAGCCG AGCTCCGATA CCGATGACAA Labs Genome Assembly TATAGCCG TATAGCCG TATAGCCG TATAGCCG GenBank UniGene Algorithms Traditional GenBank Divisions •Direct Submissions (Sequin and BankIt) •Accurate •Well characterized BCT INV MAM PHG PLN PRI ROD SYN VRL VRT Bacterial and Archeal Invertebrate Mammalian (ex. ROD and PRI) Phage Plant and Fungal Primate Rodent Synthetic (cloning vectors) Viral Other Vertebrate A Traditional GenBank Record Locus Field Molecule Type Definition Line Accession Number Version GI (GenInfo) Keywords Taxonomy Modification Date GenBank Division A Traditional GenBank Record Bulk Sequence Divisions of GenBank •Batch Submissions (email and ftp) •Inaccurate •Poorly Characterized EST STS GSS HTG HTC Expressed Sequence Tag Sequence Tagged Site Genome Survey Sequence High Throughput Genomic High Throughput cDNA Organization of GenBank 11 Traditional Divisions Traditional 8% PAT 4% 1 Patent Division STS, HTG, HTC 2% GSS 19% EST 67% 5 Bulk Divisions 23,087,196 records What is UniGene? A gene-oriented view of sequence entries •MegaBlast-based automated sequence clustering •Nonredundant set of gene-oriented clusters •Each cluster represents a unique gene •Provides information on tissue-specific expression and map locations •Includes well-characterized genes and novel ESTs •Useful for gene discovery and selection of mapping reagents Organisms Represented in UniGene Genome Sequencing Whole BAC insert (or genome) shredding sequencing GSS division or trace archive cloning isolating assembly Draft Sequence (HTG division) Working Draft Sequence gaps HTG Division: High Throughput Genome phase 1 HTG phase 2 HTG phase 3 ROD Acc = AC109609.1 Acc =AC109609.6 Acc = AC109609.10 HTG Division: High Throughput Genome NCBI’s Third Party Annotation (TPA) Database NEW • NCBI now accepts the submission of new annotations of existing GenBank sequences; • Facilitates the annotation of genomes by experts; A Sample TPA record RefSeq: NCBI’s Derivative Sequence Database • Curated transcripts and proteins – reviewed – human, mouse, rat, fruit fly, zebrafish, arabidopsis • Human model transcripts and proteins • Assembled Genomic Regions (contigs) – draft human genome – mouse genome • Chromosome records – Microbial – viral – organelle The RefSeq Accession Numbers mRNAs and Proteins NM_123456 NP_123456 NR_123456 XM_123456 XP_123456 XR_123456 Gene Records NG_ 123456 Assemblies NT_ 123456 NW_123456 NC_ 123456 NR_ 123456 human Curated mRNA mouse rat Curated Protein fruit fly Curated non-coding RNA zebrafish Predicted Transcript (human, mouse) Arabidopsis Predicted Protein (human, mouse) Predicted non-coding RNA Reference Genomic Sequence (human) Contig (Mouse and Human) Supercontig (Mouse) Chromosome (Microbial,Viral,Arabidopsis ) Interim Identifier for Microbial Chromosomes Curated RefSeq Records: NM_, NP_ Entrez: Linking and Neighboring The Entrez Databases The (ever) Journals Expanding Entrez System UniGene Books PubMed Central SNP PubMed UniSTS Nucleotide Protein PopSet ProbeSet Entrez Genome Structure Taxonomy CDD 3D Domains OMIM Entrez Nucleotides glucose 6 phosphate dehydrogenase Document Summaries: glucose 6 phosphate dehydrogenase[All Fields] = 748 hits Entrez Nucleotides: Limits Accession All Fields Author Name EC/RN Number glucose 6 phosphate dehydrogenase Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Entrez Nucleotides: Preview/Index Adding Terms: Preview/Index Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties Protein Name Publication Date SeqID String Sequence Length . . . Plant G6PD mRNAs Display: Formats, Links, and Neighbors Summary Brief ASN.1 FASTA XML GenBank GI list LinkOut Nucleotide Neighbors Genome Links ProbeSet Links OMIM Links PopSet Links Protein Links PubMed Links SNP Links Structure Links Taxonomy Links >gi|603218|gb|U18238.1|MSU18238 Medicago sativa glucose-6-phosphate dehyd CCACCAGATATAATTAAGTAGATCAGAGTAGAAGAAGATGGGAACAAATGAATGGCATGTAGAAAGAAGA GATAGCATAGGTACTGAATCTCCTGTAGCAAGAGAGGTACTTGAAACTGGCACACTCTCTATTGTTGTGC TTGGTGCTTCTGGTGATCTTGCCAAGAAGAAGACTTTTCCTGCACTTTTTCACTTATATAAACAGGAATT GTTGCCACCTGATGAAGTTCACATTTTTGGCTATGCAAGGTCAAAGATCTCCGATGATGAATTGAGAAAC AAATTGCGTAGCTATCTTGTTCCAGAGAAAGGTGCTTCTCCTAAACAGTTAGATGATGTATCAAAGTTTT TACAATTGGTTAAATATGTAAGTGGCCCTTATGATTCTGAAGATGGATTTCGCTTGTTGGATAAAGAGAT TTCAGAGCATGAATATTTGAAAAATAGTAAAGAGGGTTCATCTCGGAGGCTTTTCTATCTTGCACTTCCT > CCTTCAGTGTATCCATCCGTTTGCAAGATGATCAAAACTTGTTGCATGAATAAATCTGATCTTGGTGGAT GGACACGCGTTGTTGTTGAGAAACCCTTTGGTAGGGATCTAGAATCTGCAGAAGAACTCAGTACTCAGAT gi number TGGAGAGTTATTTGAAGAACCACAGATTTATCGTATTGATCACTATTTAGGAAAGGAACTAGTGCAAAAC Locus name ATGTTAGTACTTCGTTTTGCAAATCGGTTCTTCTTGCCTCTGTGGAACCACAACCACATTGACAATGTGC AGATAGTATTTAGAGAGGATTTTGGAACTGATGGTCGTGGTGGATATTTTGACCAATATGGAATTATCCG Database identifiers AGATATCATTCCAAACCATCTGTTGCAGGTTCTTTGCTTGATTGCTATGGAAAAACCCGTTTCTCTCAAG Accession number gb GenBank CCTGAGCACATTCGAGATGAGAAAGTGAAGGTTCTTGAATCAGTACTCCCTATTAGAGATGATGAAGTTG TTCTTGGACAATATGAAGGCTATACAGATGACCCAACTGTACCGGACGATTCAAACACCCCGACTTTTGC emb EMBL AACTACTATTCTGCGGATACACAATGAAAGATGGGAAGGTGTTCCTTTCATTGTGAAAGCAGGGAAGGCC dbj DDBJ CTAAATTCTAGGAAGGCAGAGATTCGGGTTCAATTCAAGGATGTTCCTGGTGACATTTTCAGGAGTAAAA AGCAAGGGAGAAACGAGTTTGTTATCCGCCTACAACCTTCAGAAGCTATTTACATGAAGCTTACGGTCAA sp SWISS-PROT GCAACCTGGACTGGAAATGTCTGCAGTTCAAAGTGAACTAGACTTGTCATATGGGCAACGATATCAAGGG pdb Protein Databank ATAACCATTCCAGAGGCTTATGAGCGTCTAATTCTCGACACAATTAGAGGTGATCAACAACATTTTGTTC GCAGAGACGAATTAAAGGCATCATGGCAAATATTCACACCACTTTTACACAAAATTGATAGAGGGGAGTT pir PIR GAAGCCGGTTCCTTACAACCCGGGAAGTAGAGGTCCTGCAGAAGCAGATGAGTTATTAGAAAAAGCTGGA prf PRF TATGTTCAAACACCCGGTTATATATGGATTCCTCCTACCTTATAGAGTGACCAAATTTCATAATAAAACA ref RefSeq AGGATTAGGATTATCAGGAGCTTATAAATAAGTCTTCAATAAGCTTGTGAAATTTTCGTTATAATCTCTC TCATTTTGGGGTGTATATCAAGCATTTAAGCGCGTGTTTGACACAGTTTGTGTAATAGATTTGGCTCTGA ATGAAAATAAACGGGAATTGTTTCTTTTTGTTTTA FASTA definition line >gi|603218|gb|U18238.1|MSU18238 Entrez Genome Organism Pages The Map Viewer: a common platform for integrated display The Map Viewer Entrez PubMed Online Books Entrez Specialized Databases Taxonomy Searchable taxonomic tree having nodes for all species with records in an Entrez database OMIM Online Mendelian Inheritance in Man: A database of genetically linked human diseases ProbeSet Expression data (GEO) and microarray datasets Entrez Taxonomy Entrez OMIM Entrez ProbeSet Trace Archive Entrez Structure Structure Summary Cn3D viewer Related Structures Conserved Domains Cn3D: Displaying Structures Structural Alignment