Overview of Biological Databases (Lecture for CS498-CXZ Algorithms in Bioinformatics) Sept. 6, 2005 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign Most slides are taken from NCBI field guide at the web site http://www.ncbi.nlm.nih.gov/ The Central Dogma & Biological Data Original DNA Sequences (Genomes) Expressed DNA sequences ( = mRNA Sequences = cDNA sequences) Expressed Sequence Tags (ESTs) Protein Sequences -Inferred -Direct sequencing Protein structures -Experiments -Models (homologues) Literature information Entrez Integrates Most of Them! CancerChromosomes Gene UniGene UniST S Homologen e SNP Genome PopSet Nucleotide GEO Books MeSH PubMed OMIM Entrez Taxonomy GEO Datasets Protein PMC Journal s Domains Structur e 3D Domains Outline • NCBI & Entrez • Major Biological Databases • Using Entrez Some background about Entrez… The National Center for Biotechnology Information Bethesda,MD Created in 1988 as a part of the National Library of Medicine at NIH – Establish public databases – Research in computational biology – Develop software tools for sequence analysis – Disseminate biomedical information Web Access: http://www.ncbi.nlm.nih.gov Number of Users and Hits Per Day 450,000 400,000 1997 1998 1999 2000 2001 2002 2003 Number of Users 350,000 300,000 250,000 Currently averaging 10,000,000 to 50,000,000 hits per day! 200,000 150,000 100,000 50,000 0 Christmas & New Year’s Days Major Biological Databases Entrez: Database Integration PMC Cancer Chromosome OMIM Word weight Books PubMed Abstracts Related Articles PubChem 3D domain CDD Phylogeny Taxonomy VAST Gene HomoloGene 3 -D Structure Neighbors Related Structures Genome Genome Project UniGene GEO SNP OMIM BLAST BLAST Protein Sequences Nucleotide Sequences Neighbors Related Sequences Hard Link Neighbors Related Sequences BLink Domains Types of Databases • Primary Databases – Original submissions by experimentalists – Content controlled by the submitter • Examples: GenBank, SNP, GEO • Derivative Databases – Built from primary data – Content controlled by third party (NCBI) • Examples: Refseq, RefSNP, GEO Datasets, UniGene, TPA, NCBI Protein, Structure, Conserved Domain Primary vs. Derivative Sequence Databases RefSeq Labs Sequencing Centers TATAGCCG AGCTCCGATA CCGATGACAA Curators TATAGCCG TATAGCCG TATAGCCG TATAGCCG Updated continually by NCBI GenBank Updated ONLY by submitters Genome Assembly UniGene Algorithms Entrez Nucleotides • • • • Primary GenBank / EMBL / DDBJ 57,172,944 100% 80% Derivative RefSeq 60% Third 40% Party Annotation PDB 20% Total0% PDB TPA 1,278,742 RefSeq 4,653 GenBank 5,973 58,462,312 Entrez Protein: Derivative Databases GenPept 3,515,141 RefSeq 1,802,523 Third Party Annotation 100% Swiss Prot 80% PIR 60% PRF 40% PDB 20% PDB 4,217 PRF 189,324 PIR 222,232 SwissProt TPA 12,079 RefSeq 68,621 GenPept Total0% 5,814,137 BLAST nr total 2,726,372 Database 1: GenBank NCBI’s Primary Sequence Database What is GenBank? • • Nucleotide only sequence database Archival in nature – Historical – Reflective of submitter point of view (subjective) – Redundant • GenBank Data • – Direct submissions (traditional records) – Batch submissions (EST, GSS, STS) – ftp accounts (genome data) Three collaborating databases – GenBank – DNA Database of Japan (DDBJ) – European Molecular Biology Laboratory (EMBL) Database International Sequence Database Collaboration Entrez NIH NCBI GenBank •Submissions •Updates •Submissions •Updates EMBL CIB NIG DDBJ •Submissions •Updates getentry EBI SRS EMBL GenBank Divisions “Organismal” PRI ROD PLN BCT INV VRT VRL MAM PHG SYN UNA (28) (15) (13) (11) (7) (7) (4) (2) (1) (1) (1) Primate Rodent Plant and Fungal Bacterial/Archeal Invertebrate Other Vertebrate Viral Mammalian Phage Synthetic Unannotated EST GSS HTG PAT STS CON (377) (138) (63) (17) (9) (1) Expressed Sequence Tag Genome Survey Sequence High Throughput Genomic Patent Sequence Tagged Site Contigs, virtual • Organized by taxonomy (sort of) • Direct submissions (Sequin/Bankit) • Accurate (~1 error per 10,000 bp) • Well characterized “Functional” • Organized by sequence type • Batch submissions (ftp/email) • Inaccurate • Poorly characterized GenBank Functional (Bulk) Divisions • Expressed Sequence Tag – 1st pass single read cDNA • GenBank EST GSS – 1st pass single read gDNA • HTG STS Genome Survey Sequence High Throughput Genomic – incomplete sequences of genomic clones • Sequence Tagged Site – PCR-based mapping reagents Whole Genome Shotgun EST Division: Expressed Sequence Tags >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGG TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAA TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGA 5’ GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTAC 30,000 TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNC AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCN genes 3’ TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG nucleus >IMAGE:275615 3', mRNA sequence - isolate unique clones NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTA RNA - sequence once from TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTT gene products each end AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTT CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAG GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC make cDNA library 80-100,000 unique cDNA clones in library ESTs in Entrez Total Human Mouse Rat Zebrafish Wheat Barley Maize 28 million records 6.0 million 4.3 million 0.7 million 0.6 million 0.6 million 0.3 million 0.4 million GSS, WGS, HTG Whole BAC insert (or genome) shred sequence GSS division or trace archive assembly isolate clones whole genome shotgun assemblies (traditional division) Draft sequence (HTG division) HTG Example: Honeybee Draft Sequences LOCUS AC141845 147720 bp DNA linear HTG 19-MAR-2004 DEFINITION Apis mellifera clone CH224-4A2, WORKING DRAFT SEQUENCE, 14 unordered pieces. ACCESSION AC141845 VERSION AC141845.1 GI:29124029 KEYWORDS HTG; HTGS_PHASE1; HTGS_DRAFT. • Unfinished sequences of BACs • Gaps and unordered pieces • Finished sequences (Phase 3) move to traditional GenBank division 50 55 50 40 45 35 40 30 25 Sequence records Total base pairs Release 148: 35 45.2 million records 49.4 billion nucleotides 30 25 20 Average doubling time ≈ 14 months 20 15 15 10 10 5 0 5 ’83 ’84 ’85 ’86 ’87 ’88 ’89 ’90 ’91 ’92 ’93 ’94 ’95 ’96 ’97 ’98 ’99 ’00 ’01 ’02 ’03 ’04 ’05 ’06 0 Total Base Pairs (billions) Sequence Records (millions) 45 File Formats of the Sequence Databases Each sequence is represented by a text record called a flat file. GenBank/GenPept (useful for scientists) FASTA ASN.1 & XML (the simplest format) (useful for programmers) LOCUS DEFINITION AY182241 1931 bp mRNA linear PLN 04-MAY-2004 Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. FEATURES Location/Qualifiers source 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" gene 1..1931 /gene="AFS1" CDS 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI LSLLFQPLVN" ORIGIN 1 ttcttgtatc ccaaacatct cgagcttctt gtacaccaaa ttaggtattc actatggaat 61 tcagagttca cttgcaagct gataatgagc agaaaatttt tcaaaaccag atgaaacccg 121 aacctgaagc ctcttacttg attaatcaaa gacggtctgc aaattacaag ccaaatattt 181 ggaagaacga tttcctagat caatctctta tcagcaaata cgatggagat gagtatcgga 241 agctgtctga gaagttaata gaagaagtta agatttatat atctgctgaa acaatggatt // A Traditional GenBank Record Header The Flatfile Format Feature Table Sequence The Header LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL REFERENCE AUTHORS TITLE JOURNAL REFERENCE AUTHORS TITLE JOURNAL REMARK COMMENT AY182241 1931 bp mRNA linear PLN 04-MAY-2004 Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. AY182241 AY182241.2 GI:32265057 . Malus x domestica (cultivated apple) Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. 1 (bases 1 to 1931) Pechous,S.W. and Whitaker,B.D. Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit Planta 219, 84-94 (2004) 2 (bases 1 to 1931) Pechous,S.W. and Whitaker,B.D. Direct Submission Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA 3 (bases 1 to 1931) Pechous,S.W. and Whitaker,B.D. Direct Submission Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA Sequence update by submitter On Jun 26, 2003 this sequence version replaced gi:27804758. Header: Locus Line LOCUS AY182241 1931 bp mRNA linear PLN 04-MAY-2004 DEFINITION Malus x domestica synthase (AFS1) mRNA, LOCUS AY182241 1931 (E,E)-alpha-farnesene bp mRNA linear PLN 04-MAY-2004 complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS . SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. REFERENCE 1 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. Length Locus name Molecule type Division Modification Date Header: Database Identifiers LOCUS DEFINITION AY182241 1931 bp mRNA linear PLN 04-MAY-2004 Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, Accession complete cds. ACCESSION AY182241 •Stable VERSION AY182241.2 GI:32265057 •Reportable KEYWORDS . •Universal SOURCE Malus x domestica (cultivated apple) ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; core eudicots; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. Version REFERENCE 1 (bases 1 to 1931) GI number AUTHORS Pechous,S.W. and Whitaker,B.D. Tracks changes in sequence NCBI internal use TITLE Cloning and functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 Header: Organism LOCUS DEFINITION AY182241 1931 bp mRNA linear PLN 04-MAY-2004 Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds. ACCESSION AY182241 VERSION AY182241.2 GI:32265057 KEYWORDS . SOURCE Malus x domestica (cultivated apple) SOURCE (cultivated apple) ORGANISMMalus Malusxx domestica domestica ORGANISM Malus x domestica Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; Spermatophyta; Magnoliophyta; Streptophyta; eudicotyledons; core eudicots; Eukaryota; Viridiplantae; Embryophyta; rosids; eurosids I; Rosales; Rosaceae; Maloideae; Malus. Tracheophyta; Spermatophyta; Magnoliophyta; eudicotyledons; REFERENCE 1 (bases 1 to 1931) eudicots; eurosids I; Rosales; Rosaceae; AUTHORS core Pechous,S.W. androsids; Whitaker,B.D. TITLE Maloideae; Cloning andMalus. functional expression of an (E,E)-alpha-farnesene synthase cDNA from peel tissue of apple fruit JOURNAL Planta 219, 84-94 (2004) REFERENCE 2 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D.NCBI-controlled taxonomy TITLE Direct Submission JOURNAL Submitted (18-NOV-2002) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REFERENCE 3 (bases 1 to 1931) AUTHORS Pechous,S.W. and Whitaker,B.D. TITLE Direct Submission JOURNAL Submitted (25-JUN-2003) PSI-Produce Quality and Safety Lab, USDA-ARS, 10300 Baltimore Ave. Bldg. 002, Rm. 205, Beltsville, MD 20705, USA REMARK Sequence update by submitter COMMENT On Jun 26, 2003 this sequence version replaced gi:27804758. The Feature Table FEATURES source gene CDS start (atg) Coding sequence Implied protein Location/Qualifiers 1..1931 /organism="Malus x domestica" /mol_type="mRNA" /cultivar="'Law Rome'" /db_xref="taxon:3750" /tissue_type="peel" 1..1931 /gene="AFS1" stop (tag) 54..1784 /gene="AFS1" /note="terpene synthase" /codon_start=1 /product="(E,E)-alpha-farnesene synthase" /protein_id="AAO22848.2" GenPept Identifiers /db_xref="GI:32265058" /translation="MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWK NDFLDQSLISKYDGDEYRKLSEKLIEEVKIYISAETMDLVAKLELIDSVRKLGLANLF EKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQHGYKVSQDIFGRFMDEKGTLE NHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSNLSRDVVHS LELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWW ANLGIADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGS EEELKHFTNAVDRWDSRETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLT KVWADFCKALLVEAEWYNKSHIPTLEEYLRNGCISSSVSVLLVHSFFSITHEGTKEMA DFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIVCYMREVNASEETARKNIK GMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEKGPRTHI The Sequence: 99.99% Accurate ORIGIN // 1 61 121 181 ttcttgtatc tcagagttca aacctgaagc ggaagaacga ccaaacatct cttgcaagct ctcttacttg tttcctagat cgagcttctt gataatgagc attaatcaaa caatctctta gtacaccaaa agaaaatttt gacggtctgc tcagcaaata ttaggtattc tcaaaaccag aaattacaag cgatggagat actatggaat atgaaacccg ccaaatattt gagtatcgga 1741 1801 1861 1921 ggacccacat aataaatagc tgtaacgttg aaaaaaaaaa cctgtcttta ctattccaac ctcttgtaaa ctagtactca tatagtttga agcaaaagtt tgcggttcag ttcgtcatgg ataaattaat ctttacagtt ttgccaaaga ttatgaataa aaagttgtag tttgtcgttt aaaaaaaaaa a GenPept: FASTA format >gi|32265058|gb|AAO22848.2| (E,E)-alpha-farnesene synthase [Malus x domestica] MEFRVHLQADNEQKIFQNQMKPEPEASYLINQRRSANYKPNIWKNDFLDQSLISKYDGDEYRKLSEKLIE EVKIYISAETMDLVAKLELIDSVRKLGLANLFEKEIKEALDSIAAIESDNLGTRDDLYGTALHFKILRQH GYKVSQDIFGRFMDEKGTLENHHFAHLKGMLELFEASNLGFEGEDILDEAKASLTLALRDSGHICYPDSN LSRDVVHSLELPSHRRVQWFDVKWQINAYEKDICRVNATLLELAKLNFNVVQAQLQKNLREASRWWANLG IADNLKFARDRLVECFACAVGVAFEPEHSSFRICLTKVINLVLIIDDVYDIYGSEEELKHFTNAVDRWDS RETEQLPECMKMCFQVLYNTTCEIAREIEEENGWNQVLPQLTKVWADFCKALLVEAEWYNKSHIPTLEEY LRNGCISSSVSVLLVHSFFSITHEGTKEMADFLHKNEDLLYNISLIVRLNNDLGTSAAEQERGDSPSSIV CYMREVNASEETARKNIKGMIDNAWKKVNGKCFTTNQVPFLSSFMNNATNMARVAHSLYKDGDGFGDQEK GPRTHILSLLFQPLVN >gi|32265070|gb|AAP75563.1| putative doublecortin domain-containing protein MAKTGAEDHREALSQSSLSLLTEAMEVLQQSSPEGTLDGNTVNPIYKYILNDLPREFMSSQAKAVIKTTD DYLQSQFGPNRLVHSAAVSEGSGLQDCSTHQTASDHSHDEISDLDSYKSNSKNNSCSISASKRNRPVSAP VGQLRVAEFSSLKFQSARNWQKLSQRHKLQPRVIKVTAYKNGSRTVFARVTAPTITLLLEECTEKLNLNM AARRVFLADGKEALEPEDIPHEADVYVSTGEPFLNPFKKIKDHLLLIKKVTWTMNGLMLPTDIKRRKTKP VLSIRMKKLTERTSVRILFFKNGMGQDGHEITVGKETMKKVLDTCTIRMNLNLPARYFYDLYGRKIEDIS KGKH Abstract Syntax Notation: ASN.1 Seq-entry ::= set { class nuc-prot , descr { title "Malus x domestica (E,E)-alpha-farnesene synthase (AFS1) mRNA, complete cds." , source { org { taxname "Malus x domestica" , common "cultivated apple" , db { { db "taxon" , tag id 3750 } } , orgname { name binomial { genus "Malus" , species "x domestica" } , mod { { subtype cultivar , GenPept GenBank ASN.1 FASTA Protein FASTA Nucleotide Database 2: RefSeq NCBI’s Derivative Sequence Database What is RefSeq? • Curated transcripts and proteins (NM_, NP_) – reviewed – human, mouse, rat, fruit fly, zebrafish, arabidopsis • • microbial genomes (proteins), and more Model transcripts and proteins (XM_, XP_) Assembled Genomic Regions (contigs) (NT_, NW_) – human genome – mouse genome • – rat genome Chromosome records (NC_) – Human genome – microbial srcdb_refseq[Properties] – organelle ftp://ftp.ncbi.nih.gov/refseq/release/ RefSeq Benefits • • • • • • • non-redundancy explicitly linked nucleotide and protein sequences updates to reflect current sequence data and biology data validation format consistency distinct accession series stewardship by NCBI staff and collaborators RefSeq Curation Processes Curated genomic DNA (NC, NT, NW) Scanning.... Curated Model mRNA (XM) Model protein (XP) (XR) Curated mRNA (NM) (NR) Protein (NP) RefSeq Accession Numbers mRNAs and Proteins NM_123456 NP_123456 NR_123456 XM_123456 XP_123456 XR_123456 Gene Records NG_123456 Chromosome NC_123455 Assemblies NT_123456 NW_123456 Curated mRNA Curated Protein Curated non-coding RNA Predicted mRNA Predicted Protein Predicted non-coding RNA Reference Genomic Sequence Microbial replicons, organelle , viral genomes, human chromosomes Contig WGS Supercontig From GenBank to RefSeq NM_000121: Sequence Revision History Database 3: UniGene NCBI’s Derivative EST Database UniGene Clustering Expressed Sequences • • • Records are clusters of mRNAs and ESTs that ideally represent single genes Records are created automatically by a modified BLAST algorithm UniGene provides a means to identify an EST or unannotated mRNA UniGene Gene-oriented clusters of expressed sequences • Automatic clustering using MegaBlast • Each cluster represents a unique gene • Informed by genome hits • Information on tissue types and map locations • Useful for gene discovery and selection of mapping reagents A Cluster of ESTs query 5’ EST hits 3’ EST hits UniGene Collections Example UniGene Cluster Histogram of cluster sizes for UniGene Hs Build 177 (Now at Build #186) UniGene Cluster Hs.95351 SELECTED PROTEIN SIMILARITES UniGene Cluster Hs.95351 GENE EXPRESSION UniGene Cluster Hs.95351: expression UniGene Cluster Hs.95351: seqs Download sequences web page ftp://ftp.ncbi.nih.gov/repository/UniGene/Homo_sapiens/ Database 4: MMDB NCBI’s derivative protein structure database Indexing into MMDB MMDB Molecular Modeling Data Base Structure • Import only experimentally determined structures • Convert to ASN.1 • Create “backbone” model (Cα, P only) • Verify sequences • Create single-conformer model Add secondary structure Add chemical bonds id 1 , name "helix 1" , type helix , location subgraph residues interval { { molecule-id 1 , from 49 , to 61 } } } , inter-residue-bonds { { atom-id-1 { molecule-id 1 , residue-id 1 , atom-id 1 } , atom-id-2 { molecule-id 1 , residue-id 2 , atom-id 9 } } , Structure Summary Cn3D viewer Structure Neighbors Conserved Domains 3D Domain Neighbors Cn3D 4.1: C-Src Cn3D 4.1: Structural Alignment Conserved ATP binding site Src Kinase H. sapiens Casein kinase S. pombe Cn3D: Simple Homology Modeling human swordtail NCBI CD: Tyrosine Kinase Using Cn3D to model domains Submitting a PDB File to VAST • Choose the file format • Remove all lines except ATOM This is the best way to convert PDB files to MMDB format for viewing with Cn3D! Database 5: GEO NCBI’s Gene Expression Omnibus Submitted by Manufacturer* GPL Platform descriptions Submitted by Experimentalists GSM GSE Grouping of Raw/processed slide/chip data spot intensities from a single “a single experiment” slide/chip GEO SaMple: GEO SEries: experimental set of related conditions samples Entrez GEO Curated by NCBI GDS Grouping of experiments Entrez GEO Datasets What’s a DataSet? Supplied by submitter Platform Sample Series (GPL) (GSM) (GSE) array definition hyb. measurements related Samples DataSet Assembled by GEO staff (GDS) • A collection of experimentally-related samples processed using the same platform. • Samples within DataSets are organized into subgroups based on experimental variables. • Form the basis of GEO’s query, analysis and data display tools. Gene Expression Omnibus Dataset browser GEO Dataset Browser GEO Dataset Report GEO Profiles … of 12625 Database 6: CDD NCBI’s Derivative Conserved Domain Database Entrez CDD • • • Conserved Domain Database Multiple sequence alignments Position-specific scoring matrices (PSSM) Sources SMART, PFAM, COGs, KOGs, and NCBI curated domains (structure-informed alignments) CDD >gi|45549418|gb|AAS67634.1| ATP7A [Solenodon paradoxus] IVYQPHLITVEEIKKQIKAVGFPAFIKKQPKYLKLGAIDIERLKNIPVKSSEGSQQMSPS STNDSKVTLTIDGMHCNSCVSNIESALSTLHYVSSIVVSLQNKSAIIKYNANSVTPEIL KKAIEAISPGQYRVSITSEVESTSNSPSSSSQKAPLNVVSQPLTQVTVININGMTCNS CVQSIEGVMSKKAGVKSIQVSLANRNGTVEYDP LLTSPEILRE CDD Click on a colored bar to align your sequence to the CD CD Pfam COG Conserved Domain Database: cd00371.1, HMA CDD CDART: Conserved Domain Architecture Retrieval Tool Database 7: NCBI Genome Map Viewing Complex Genomes NCBI Map Viewer • Map Viewer Home Page • Shows all supported organisms • Provides links to genomic BLAST – Genome Overview Page • Provides links to individual chromosomes • Shows hits on a genome graphically – Chromosome Viewing Page • Allows interactive views of annotation details • Provides numerous maps unique to each genome The Map Viewer Genome BLAST Map Viewer: Human MLH1 EST Hits Customizable Transcripts Models NCBI Assembly Gene Annotations Maps and Options Mapped Variations MLH1 Synteny: Mammalian Genomes Many Other NCBI Databases… Other Specialized Databases • • • • • • Gene Symbol Database ( HUGO Gene Nomenclature ) KEGG (Kyoto Encyclopedia of Genes and Genomes) Pathway EPD (Eukaryote Promoter Database) Transcription Factor Database ( TRANSFAC ) Many organism-specific databases (e.g., Flybase, Beebase) … Access Databases through Entrez Accessing the Data in Entrez • Web Tools – Batch Entrez • Upload a file of GI or accession numbers to retrieve sequences – Batch Citation Matcher • Send citation information to Entrez and retrieve PubMed IDs for linking, citation display or other applications – Advanced Entrez Searching • Advanced searching techniques for Web Entrez – My NCBI • Includes automatic e-mailing of search updates and filters for search results • • Requires a username and password to access stored searches Programming Tools – E-Utilities • Run Entrez queries and download data from your own scripts over the Web – Linking to Entrez • Link to specific Entrez pages from your own web pages or applications – Entrez Client/Server • C language library for embedding Entrez calls into your programs Entrez: Web Access Default search: Against all databases in Entrez Interface: Global Entrez Target database: Adjustable using the pull-down menu NCBI Toolbox /************************************************************************ * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArray. * **************************************************************************/ #include <accentr.h> #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include <subutil.h> #include <objall.h> #include <objcode.h> #include <lsqfetch.h> #include <explore.h> Toolbox Sources ftp> open ftp.ncbi.nih.gov . . #ifdef ENABLE_ID1 ftp> cd toolbox #include <accid1.h> #endif ftp> cd ncbi_tools FILE *fpl; ftp://ftp.ncbi.nlm.gov/toolbox/ncbi_tools Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, Challenges in Bioinformatics PMC Cancer Chromosome OMIM Word weight Books PubMed Abstracts Related Articles PubChem How can we help biologists Taxonomy manage and exploit all suchVAST 3 -D Structure rapid growing, heterogeneous, and Gene Genome inaccurate information both efficiently and effectively? 3D domain CDD Phylogeny Neighbors Related Structures HomoloGene Genome Project UniGene GEO SNP OMIM BLAST BLAST Protein Sequences Nucleotide Sequences Neighbors Related Sequences Hard Link Neighbors Related Sequences BLink Domains