NCBI FieldGuide NCBI Molecular Biology Resources A Field Guide Part 1 February 14, 2006 University of Tennessee, Memphis - Health Sciences Center Bethesda Created in 1988 as a part of the National Library of Medicine at NIH – – – – Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information NCBI FieldGuide The National Center for Biotechnology Information NCBI FieldGuide NCBI Web Traffic Japan 6% Italy 4% Users per day Canada 3% Germany 3% 600,000 United Kingdom 3% Netherlands 2% World Internet Users Spain 2% 500,000 Brazil 2% Sweden 1% U.S. 400,000 (.com, .net, .org, .gov, gov, .us) Switzerland 1% Belgium1% Other 14% 40% US Internet Users 300,000 200,000 100,000 1998 1999 2000 2001 2002 2003 2004 Christmas and New Year’s Day 2005 NCBI FieldGuide Literature Databases NCBI FieldGuide Part 2. Data Flow and Processing Part 3. Querying and Linking the Data Part 4. User Support A part of the NCBI Bookshelf NCBI FieldGuide Part 1. The Databases NLM Catalog PubChem PubMed Compounds BioAssays Substances OMIM PubMed Central Journals 3D Domains Books Structure Taxonomy CDD/CDART Entrez Protein NCBI FieldGuide The (ever expanding) Entrez System Genome UniSTS HomoloGene HomoloGene SNP UniGene Gene Gene GEO/GDS GenSat PopSet Nucleotide GenomeProjects Cancer Chromosomes ● A system of 29 linked databases ● A tool for finding biologically linked data ● A text search and retrieval engine ● A virtual workspace for manipulating large datasets NCBI FieldGuide What is Entrez? ● Each record is assigned a UID. – A “unique integer identifier” for internal tracking ● All Molecular Database entries are organized by organism (Taxonomy Database). ● Each record is indexed by data fields. – [author], [title], [organism], and many others ● Each record is given a Document Summary. – a summary of the record’s content (DocSum) ● Each record is assigned links to biologically related UIDs. NCBI FieldGuide Entrez Databases Word weight NCBI FieldGuide Examples of Database Integration at NCBI PubMed Phylogeny 3-D mmdb Taxonomy (3D structure) Structure VAST Genomes BLASTn Nucleotide sequences Protein sequences BLASTp Links Follow links to related data in the same database or in others! Hard Links: Curated links based on biology for example: • nucleotide taxonomy (based on organism identifier) • protein domain relatives (based on domain assignment) • domains pubmed (based on supporting literature) Soft Links: Pre-computed analyses for example: • nucleotide related sequences (BLAST neighbors) • protein conserved domains (CDD/RPS-BLAST search) • gene map viewer (map position of annotated gene) NCBI FieldGuide Following Links NCBI FieldGuide zebrafish NCBI FieldGuide • Primary Databases – Raw and redundant Data…..submitted, “owned” and updated by experimentalists • Examples: GenBank, SNP, GEO, PubChem Substance & BioAssay • NCBI FieldGuide Types of Molecular Databases Derivative Databases – Human-curated (compilation and curation of data) • Examples: GEO Datasets, Structure & Literature databases – Computationally-Derived • Example: UniGene, HomoloGene, PubChem Compound – Combination • Examples: RefSeq, Gene, Genome Assembly, Conserved Domain and Structure databases RefSeq Labs Sequencing Centers TATAGCCG AGCTCCGATA CCGATGACAA Curators TATAGCCG TATAGCCG TATAGCCG TATAGCCG Updated continually by NCBI GenBank Updated ONLY by submitters Genome Assembly UniGene Algorithms NCBI FieldGuide Primary vs. Derivative Sequence Databases GenBank • • • • Nucleotide only sequence database Archival in nature Each record is assigned a stable accession number Submission of GenBank Data to NCBI – Direct submissions of individual records via Web (BankIt, Sequin) – Batch submissions of bulk sequences via Email (EST, GSS, STS) – FTP accounts for Sequencing Centers • Three collaborating databases and other sources of data NCBI FieldGuide 1º Sequence Database Entrez NIH NCBI •Submissions •Updates GenBank DDBJ •Submissions •Updates getentry EMBL •Submissions •Updates EBI CIB NIG EMBL SRS NCBI FieldGuide The International Sequence Database Collaboration Release 151 December 2005 52,016,762 56,037,734,462 >140,000 Records Nucleotides Species 216 Gigabytes 890 files • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/ NCBI FieldGuide GenBank PRI ROD PLN BCT VRT INV VRL MAM PHG SYN UNA (29) (23) (17) (13) (10) (9) (5) (2) (1) (1) (1) Primate Rodent Plant and Fungal Bacterial/Archeal Other Vertebrate Invertebrate Viral Mammalian Phage Synthetic Unannotated EST GSS HTG PAT STS HTC ENV (464) (164) (69) (19) (14) (10) (3) Expressed Sequence Tag Genome Survey Sequence High Throughput Genomic Patent sequences Sequence Tagged Site High Throughput cDNA Environmental Samples Traditional NCBI FieldGuide GenBank Divisions •Direct Submissions (Sequin/Bankit) •Accurate (~1 error per 10,000 bp) •Well characterized •Organized by taxonomy Bulk •From sequencing projects •Batch submissions (ftp/email) •Inaccurate •Poorly Characterized •Organized by sequence type Derivative Sequence Database • • • The curated “best representative” sequences Standardized nomenclature and record structure Added annotation (references, sequence features) NCBI FieldGuide RELEASE 15 IS NOW AVAILABLE ON THE FTP SITE! NCBI FieldGuide RefSeq Curation Processes Curated genomic DNA (NC, NT, NW) Scanning.... Curated Model mRNA (XM) Model protein (XP) (XR) Curated mRNA (NM) (NR) Protein (NP) LOCUS DEFINITION ACCESSION VERSION ADSS 1368 bp mRNA linear PRI 27-AUG-2002 Homo sapiens adenylosuccinate synthase (ADSS), mRNA. NM_001126 RefSeq Nucleotide NM_001126.1 GI:4557270 LOCUS ADSS 455 aa linear PRI 27-AUG-2002 DEFINITION adenylosuccinate synthase; Adenylosuccinate synthetase (Ade(-)H-complementing) Homo sapiens . ACCESSION NP_001117 VERSION NP_001117.1 GI:4557271 RefSeq Protein DBSOURCE REFSEQ: accession NM_001126.1 COMMENT REVIEWED REFSEQ: This record has been curated by NCBI staff. The reference sequence was derived from X66503.1. Summary: Adenylosuccinate synthetase catalyzes the first committed step in the conversion of IMP to AMP. X records: Genome Annotation & Inferred or Predicted vs N records: Provisional, Reviewed or Validated NCBI FieldGuide Curated RefSeq Records NCBI now accepts the submission of new annotations of existing GenBank sequences. • • Submissions must be published in a peer-reviewed journal. Facilitates the annotation of sequences by experts. NCBI FieldGuide Third Party Annotation (TPA) Database Examples of sequences appropriate for TPA are: – Annotation of features on gene and/or mRNA sequences – Assembled “full length” genes and/or mRNAs What should not be submitted to TPA? – Synthetic constructs (such as cloning vectors) that use well-characterized, publicly available genes, promoters, or terminators – Updates or changes to existing sequence data – Sequence annotations without experimental evidence “Best representative” (reference) sequences Standardized nomenclature and record structure Added annotation (references, sequence features) Mapping Genome Data on an Assembly: Genome Sequence (RefSeq: NC, NT, NW) Transcript regions & ORFs (RefSeq: NM/NP, XM/XP) Markers (STS) Polymorphisms (SNP) ESTs/Exons (UniGene) NCBI FieldGuide • • • as of January 2006 Organelles: – Mitochondria (806) – Plastids (50) – Plasmids (850) – Nucleomorphs (3) • Viruses (2260) • Archaebacteria (25) • Eubacteria (269) • Eukaryotes (19complete/83assemblies) NCBI FieldGuide Complete Genomes NCBI FieldGuide New! Genome Projects NCBI FieldGuide • Full chromosomal sequences are provided • Genes are annotated • The annotation can be shown graphically and linked to sequence records NCBI FieldGuide Simple Genomes NCBI FieldGuide RefSeq Chromosomes: NC_ LOCUS NC_000913 4639221 bp DNA circular BCT 30-JUL-2003 DEFINITION Escherichia coli K12, complete genome. ACCESSION NC_000913 VERSION NC_000913.1 GI:16127994 gene 3954631..3956478 KEYWORDS . /gene="mutL" SOURCE Escherichia coli K12. /locus_tag="b4170" ORGANISM Escherichia coli K12 BASE COUNT 978672 a1011074 c 997153 g 974742 t Enterobacteriales; /note="synonym: mut-25" Bacteria; Proteobacteria; Gammaproteobacteria; ORIGIN Enterobacteriaceae; Escherichia. CDS 3954631..3956478 REFERENCE /gene="mutL" 1 (bases 1 to 4639221) 1 cgtcttcatt gtcagacagc agaatttgta cgcgctgttc ggcttgttgt aatttggcct AUTHORS /locus_tag="b4170" Blattner,F.R., Plunkett,G. III, acgccgcgtt Bloch, C.A.,cgaactcgtt Perna, N.T.,cagcgcctct Burland,V., tccagcggca 61 gcccctgacg tgccagctgc Riley,M., Collado-Vides,J., Glasner,J.D., Rode, C.K., Mayhew,G.F., 121 ggtcgccact ttccagacggmismatch gttacaatct gttccagctc gctcagcgcc ttttcaaagc /function="methyl-directed repair" Gregor,J., Davis,N.W., Kirkpatrick,H.A., Goeden,M.A., Rose,D.J., 181 tggcgggcgc /codon_start=1 Mau,R. and Shao,Y. ctcatttttc ttcggcataa tgaatgtctg actctcaata tttttcgccc 241complete cgtcatggta aaataacgcg caatggtaag gtgatgtgca /transl_table=11 TITLE The genomeacggactcag sequence of ggcaaatagc Esherichia coli K12. 301 cagcaaagcg tatacttccg cgcctggatg cagccgcagg tgtgggctgc JOURNAL /product="MutL" Science 277 (5331),atgttagtgg 1453-1474 (1997) MEDLINE /protein_id="NP_418591.1" 97426617 361 tgtatttttc cctatacaag tcgcttaagg cttgccaacg aaccattgcc gccatgaagt PUBMED /db_xref="GI:16131992" 9278503 421 ttatcattaa attgttcccg gaaatcacca tcaaaagcca atctgtgcgc ttgcgcttta REFERENCE 2 (bases 1 to 4639221) 481 taaaaatcct taccgggaac attcgtaacg ttttaaagca ctatgatgag acgctcgctg /translation="MPIQVLPPQLANQIAAGEVVERPASVVKELVENSLDAGATRIDI AUTHORS Blattner,F.R. 541 tcgtccgcca DIERGGAKLIRIRDNGCGIKKDELALALARHATSKIASLDDLEAIISLGFRGEALASI TITLE Direct submission ctgggataac atcgaagttc gcgcaaaaga tgaaaaccag cgtctggcta 601 ttcgcgacgc tctgacccgt attccgggta tccaccatat gaagacgtgc JOURNAL SSVSRLTLTSRTAEQQEAWQAYAEGRDMNVTVKPAAHPVGTTLEVLDLFYNTPARRKF Sumbitted (16-JAN-1997) Guy Plunkett III, Laboratory of tctcgaagtc Genetics, 661 cgtttaccga catgcacgat attttcgaga aagcgttggt tcagtatcgc gatcagctgg University of Wisconsin, 445 Henry Mall, Madison, WI 53706, USA. LRTEKTEFNHIDEIIRRIALARFDVTINLSHNGKIVRQYRAVPEGGQKERRLGAICGT E-mail ecoli@genetics.wisc.edu 608-262-2543 Fax: acatgatttt agctcgattg 721 aaggcaaaac cttctgcgta Phone: cgcgtgaagc gccgtggcaa AFLEQALAIEWQHGDLTLRGWVADPNHTTPALAEIQYCYVNGRMMRDRLINHAIRQAC Annotation of sequence Genome Gene, CDS, 781 atgtggaacg ttacgtcggc ggcggtttaa atcagcatat tgaatccgcg cgcgtgaagc EDKLGADQQPAFVLYLEIDPHQVDVNVHPAKHEVRFHQSRLVHDFIYQGVLSVLQQQL and other features 841 tgaccaatcc ggatgtgact gtccatctgg aagtggaaga cgatcgtctc ctgctgatta ETPLPLDDEPQPAPRSIPENRVAAGRNHFAEPAAREPVAPRYTPAPASGSRPAAPWPN 901 aaggccgcta cgaaggtatt ggcggtttcc cgatcggcac ccaggaagat gtgctgtcgc AQPGYQKQQGEVYRQLLQTPAPMQKLKAPEPQEPALAANSQSFGRVLTIVHSDCALLE 961 tcatttccgg tggtttcgac tccggtgttt ccagttatat gttgatgcgt cgcggctgcc RDGNISLLSLPVAERWLRQAQLTPGEAPVCAQPLLIPLRLKVSAEEKSALEKAQSALA ELGIDFQSDAQHVTIRAVPLPLRQQNLQILIPELIGYLAKQSVFEPGNIAQWIARNLM SEHAQWSMAQAITLLADVERLCPQLVKTPPGGLLQSVDLHPAIKALKDE" mutL NCBI FieldGuide New! • • • Sequences are provided complete or we help assemble Heavy annotation: Genes, transcript regions & ORFs, sequence variations & markers, clones, ESTs, etc. The annotation can be shown graphically and linked to other databases using the MapViewer A database for retrieval and analysis of karyotype data: Cancer Chromosomes NCBI FieldGuide Complex Genomes Click here to see all features and the sequence of this contig record. NCBI FieldGuide RefSeq Records 1: Contig: NT_034400. Homo sapiens NT_ &chro...[gi:51458694] Chromosome: NC_Links Click here to see all features and the sequence of this contig record . LOCUS NT_034400 1065823 bp DNA linear CON 19-AUG-2004 DEFINITION Homo sapiens chromosome 1 genomic contig. ACCESSION NT_034400 VERSION NT_034400.3 GI:51458694 KEYWORDS . gene complement(2548206..2591802) SOURCE Homo sapiens Annotation of /gene="ADSS" ORGANISM Homo sapiens Gene, mRNA, CDS, /db_xref="LocusID:159" Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; and other features /db_xref="MIM:103060" Euteleostomi; Mammalia; Eutheria; Primates; Catarrhini; mRNA complement(join(2548206..2549349,2550998..2551147, Hominidae; Homo. 2555692..2555789,2557339..2557463,2558471..2558625, REFERENCE 1 (bases 1 to 1065823) AUTHORS2559881..2560007,2562526..2562607,2563644..2563751, International Human Genome Sequencing Consortium. TITLE 2564012..2564078,2572236..2572286,2576516..2576584, The DNA sequence of Homo sapiens JOURNAL2577357..2577459,2591326..2591802)) Unpublished (2003) /gene="ADSS" COMMENT GENOME ANNOTATION REFSEQ: Features on this sequence have /product="adenylosuccinate synthase" CONTIG join(AL139152.7:1..55543,AL596177.4:1998..91084, been produced for build 35 version 1 of the NCBI's genome /note="Derived by automated computational analysis using gene AL356378.17:1999..202955,AL391904.14:2001..68222, annotation [see documentation]. prediction method: BLAST. Supporting evidence includes AL590667.7:2001..175494,AL359207.7:2001..112707, On Aug 19, 2004 this sequence version replaced gi:27478327. similarity to: 3 mRNAs" AL365260.11:2001..114412,complement(AL445591.10:1..138092), The DNA sequence is part of the third release of the /transcript_id="XM_049992.8" BX537254.7:2001..121309) finished human reference genome. It was assembled from /db_xref="GI:22045950" draft sequences individual clone sequences by Ordering the HumanofGenome Sequencing // /db_xref="LocusID:159" Consortium in consultation with NCBI staff. /db_xref="MIM:103060" COMPLETENESS: not full length. NCBI FieldGuide NCBI FieldGuide Higher Genome MapViews NCBI FieldGuide Higher Genome MapViews A new database for localization of proteins in Mouse Brains: New! GenSat A new database for information on Expression Reagents: Probe NCBI FieldGuide Gene Expression Databases NCBI FieldGuide Submit and update data Query the database: • gene identifiers • field information • sequence Browse datasets Download data GPL Platform descriptions GSM GSE Grouping of Raw/processed slide/chip data spot intensities from a single “a single experiment” slide/chip Entrez GEO Curated by NCBI NCBI FieldGuide Submitted by Manufacturer* Submitted by Experimentalists GDS Grouping of experiments Entrez GEO Datasets NCBI FieldGuide GDS177: CMV infection of HFF cells NCBI FieldGuide as of January 2006 SEVERAL Organisms Expression oriented NCBI FieldGuide UniGene Collections A Cluster of ESTs: NCBI FieldGuide Arabidopsis serine protease query 5’ EST hits 3’ EST hits NCBI FieldGuide New! NCBI FieldGuide EST-based Expression Profiles NCBI FieldGuide New! NCBI FieldGuide New! Pr196507.1 Links Ribonucleic acid probe (riboprobe) Prnp for Mus musculus gene prion protein (Prnp). Has been used in the GENSAT project for in situ hybridization. NCBI FieldGuide Probe: Expression probes Pr186482.1 Pr001034449.1 Links Small hairpin interfering RNARNA (shRNA) (siRNA) probe probe V2MM_66187 for Mus musculus for Mus gene musculus priongene protein prion (Prnp). protein Has(Prnp). been Developed used for RNA for interference RNA interference (RNAi). (RNAi). Reagent is available from Open Biosystems. NCBI FieldGuide Probe: siRNAs & shRNAs Sequences & Structures NCBI FieldGuide Protein Protein Conserved Domain Protein sequences CDD: Conserved functional domains in proteins represented by a PSSM RPS-BLAST, CDART Structure MMDB: Experimentally-derived 3D structure records from PDB 3D Domain Compact structural domains of protein folds NCBI FieldGuide Linking Protein Sequence, Structure and Function -Conserved Domainsconserved sequence elements that perform common functions NCBI FieldGuide Sequence-based Neighbors: Domain Neighbors Curation of protein multiple sequence alignments with known similar function by conversion to Position-Specific Scoring Matrices 10 20 30 40 50 60 ....*....|....*....|....*....|....*....|....*....|....*....| consensus 1FGI A 1BYG A gi 125135 gi 125702 gi 1174437 1 1 1 1 1 1 KWEIPREDLTLGKKLGEGAFGEVYKGTLKGkgd---nkSIDVAVKTLKEDASEeqIKEFL aWEIPRESLRLEVKLGQGCFGEVWMGTWNG--------TTRVAIKTLKPGTMS--PEAFL RWELPRDRLVLgkPLGEGAFGQVYLAEAIglgkdkpnrvTKVAVKMLKSDAtedkLSLDI GWALNMKELKLlqTIGKGEFGDVMLGDYRg---------NKVAVKCIKNDAt---AQAFL KYEIPRTDLTLkhKLGGGQYGEVYEGVWKky-------sLTVAVKTLKEDTm--eVEEFL KWEIPRSELTIlrKLGRGNFGEVFYGKWRn--------sIDVAVKTLREGTm--sTAAFL 57 311 74 62 284 325 “Reverse-Position Specific” Sequence Comparisons (RPS-BLAST) a.k.a. “Conserved Domain Database” (CDD) Search NCBI FieldGuide “Conserved Domain Architecture Retrieval Tool” (CDART) Modular Architecture of Domains • Cartoon descriptions of protein domain organization on the primary sequence • Allows for comparison with other proteins with the same Domain NCBI FieldGuide Sequence-based Neighbors: Domain Relatives NCBI FieldGuide NCBI Conserved Domain Summary CDART: Conserved Domain Architecture Retrieval Tool • • Derived from experimentally determined PDB records Data is added to PDB records including: – – – – • Addition of explicit chemical bonding information Validation and indexing of sequence Inclusion of Taxonomy, Citation, and other information Conversion to ASN.1 data description language Searching the Structure Databases: • • • • Keyword search by Entrez Sequence search by BLAST or BLink Domain search by CDD/RPS-BLAST Structure search by VAST NCBI FieldGuide Entrez Structure: Molecular Modeling Database NCBI FieldGuide Structure Summary Page Structure-based Neighbors to get the Cn3D viewer Sequence-based Neighbors: Structure-based Domains: Conserved Domains (CDD/RPS-BLAST) (3D Domains) Entrez PubChem PC Compound zidovudine NCBI FieldGuide New! Derived database of known chemicals from PC Substance records PC Substance Primary database of chemical samples PC BioAssay Primary database of bioactivity screens of samples in PC Substance NCBI FieldGuide PubChem: NCBI FieldGuide Compound, Substance, BioAssay Summary pages of curated information about genetic loci for organisms in the RefSeq project. ►Graphics ►Gene information ►Bibliography (PubMed links) ►General gene information ►NCBI Reference Sequences ►Related sequences ►Additional Links NCBI FieldGuide The Gene Summary Database R.norvegicus G6pdx M.musculus G6pd1, G6pdx D.melanogaster Zw A.thalia At5g35970 S.pombe SPAC3C7.13c SPAC9.01 SPCC794.01c B.anthracis BA_3932 H.pylori HP1101 zwf E.coli, Salmonella, Shigella, Yersinia, Neisseria…. glucose-6-phosphate dehydrogenase NCBI FieldGuide H.sapiens & B.taurus G6PD GENE SYMBOL & name [Organism] ►Bibliography (PubMed links) GeneRifs ►General gene information: Gene Ontology Homology (Mouse, Rat, Human) Phenotypes Sequence Tagged Sites Pathways ►NCBI Reference Sequences mRNA sequence Source sequence Product Conserved Domains ►Related sequences (genomic, mRNA & protein) ►Additional Links Default Display G6PD glucose-6-phosphate dehydrogenase [Homo sapiens] NCBI FieldGuide ►Gene information: Gene type Gene name Gene description RefSeq status Organism Lineage Gene Aliases Summary General protein information ►Graphics: Transcripts and products Genomic context G6PD glucose-6-phosphate dehydrogenase [Homo sapiens] NCBI FieldGuide Gene Default Table Display G6PD glucose-6-phosphate dehydrogenase [Homo sapiens] NCBI FieldGuide SNP: GeneView NCBI FieldGuide FTP Downloads NCBI Toolbox: In-house source code useful for incorporating NCBI-like functionality into their programs. Three main parts: Data Model, Data Encoding and Programming Libraries. • Examples: BLAST, Cn3D, Sequin, Data format conversion scripts http://www.ncbi.nlm.nih.gov/IEB/ToolBox/index.cgi E-Utilities: Guidelines for Entrez “URL calls” used to access data. Designed for use in scripts. • Examples: ESearch, EPost, ESummary, EFetch and ELink http://www.ncbi.nih.gov/entrez/query/static/eutils_help.html Caution: Overuse may result in blocked IPs! NCBI FieldGuide Help for Programmers To come in Part 2: • Searching Records with Entrez • Searching Sequences with BLAST • Searching Structures with VAST • An Integrated Example NCBI FieldGuide Intermission