NCBI Created as a part of NLM in 1988 Establish public databases Research in computational biology Develop software tools for sequence analysis Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Human genome (2001) NCBI Home Page www.ncbi.nlm.nih.gov To learn more, visit “Site Map” and “About NCBI” web pages Entrez: An Integrated Database Search and Retrieval System The (ever) Expanding Entrez System UniGene PubMed Nucleotide Protein Journals Structure CDD Genome Entrez SNP PopSet OMIM 3D Domains Taxonomy UniSTS ProbeSet Books Literature Databases PubMed Books PubMed Central Journals On-Line Mendelian Inheritance in Man (OMIM) Molecular Sequence Databases Sequence Databases Nucleotide (GenBank) Taxonomy PopSet Protein Marker Databases Single Nucleotide Polymorphisms (SNP’s, dbSNP) Sequence Tagged Sites (STS’s, dbSTS) Expressed Sequence Tags (EST’s, dbEST) UniGene Molecular Databases Primary Databases Original submissions by experimentalists Database staff organize but don’t add additional information Example: GenBank Derivative Databases Human curated compilation and correction of data Example: SWISS-PROT, NCBI RefSeq mRNA Computationally Derived Example: UniGene Combinations Example: NCBI Genome Assembly Curators RefSeq TATAGCCG AGCTCCGATA CCGATGACAA Labs Genome Assembly TATAGCCG TATAGCCG TATAGCCG TATAGCCG GenBank UniGene Algorithms The International Nucleotide Sequence Database Collaboration NIH NCBI ENTREZ GenBank NIG CIB Get Entry DDBJ Qu i ck Ti m e ™ an d a T IF F ( Un co m p re ss ed ) d ec om pr es so r a r e ne ed ed t o s ee th i s pi c tu r e. EMBL EBI SRS EMBL Entrez Nucleotide EMBL 9% RefSeq 1% PDB 0.01% DDBJ 19% GenBank 71% What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature GenBank Data Direct submissions individual records (BankIt, Sequin) Batch submissions via email (EST, GSS, STS) ftp accounts established for sequencing centers Data shared amongst three collaborating databases: GenBank DNA Database of Japan (DDBJ). European Molecular Biology Laboratory Database (EMBL) The Old Way From Fran Lewitter, Whitehead Institute GenBank: NCBI’s Primary Sequence Database Release 136 June 2003 25,592,865 Records 32,528,249,295 Nucleotides 18,197,119(June 2002) 22,616,937,182(June 2002) 110,000 + Species • full release every two months • incremental and cumulative updates daily • available only through internet ftp://ftp.ncbi.nih.gov/genbank/ ftp://genbank.sdsc.edu/pub ftp://bio-mirror.net/biomirror/genbank/ 121 Gigabytes of data GenBank Divisions Traditional Divisions BCT INV MAM PHG PLN PRI ROD SYN VRL VRT Bacterial/Archeal Invertebrate Mammalian (ex. ROD/PRI) Phage Plant/Fungal Primate Rodent Synthetic (cloning vectors) Viral Other Vertebrate Bulk Sequence Divisions EST Expressed Sequence Tag STS Sequence Tagged Site GSS Genome Survey Sequence HTGS High Throughput Genomic Sequence HTC High Throughput cDNA A Traditional GenBank Record Locus Field Molecule Type Definition Line GI (GenInfo) Keywords Taxonomy Submission Field Modification Date GenBank Division Feature Table GenPept Record Genomic DNA Sequence Bulk Sequence Divisions Bulk Sequence Divisions •Batch Submission, e-mail, or ftp EST •Inaccurate STS •Poorly Characterized Expressed Sequence Tag Sequence Tagged Site HTGS High Throughput Genomic Sequence EST Division: Expressed Sequence Tags >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTT TCTGGCCTGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAG AATGGAAAGTCAAATTTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAG TTGACTTACTGAAGAATGGAGAGAGAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAG CAAGGACTGGTCTTTCTATCTCTTGTACTACACTGAATTCACCCCCACTGAAAAAGATGAGT ATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCCAAGTTNAGTTTAAGTGGGNA 5’ TCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTTTTGGATTGGGA 30,000 TGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG nucleus genes 3’ >IMAGE:275615 3', mRNA sequence - isolate unique clones NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACC -sequence once ATGCCTTACTTTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTC RNA ATTCATTATAACAAATTTCCAATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATT from each end gene products CTAAGCAGAGTATGTAAATTGGAAGTTAACTTATGCACGCTTAACTATCTTAACAAGCTTT GAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGATGTTGATGTTGGATAAGAGAATT CTCTGCTCCCCACCTCTANGTTGCCAGCCCTC make cDNA library 80-100,000 unique cDNA clones in library What is UniGene? A gene-oriented view of sequence entries •MegaBlast-based automated sequence clustering •Nonredundant set of gene-oriented clusters •Each cluster represents a unique gene •Provides information on tissue-specific expression and map locations •Includes well-characterized genes and novel ESTs •Useful for gene discovery and selection of mapping reagents EST hits to Homo sapiens muscle creatine kinase mRNA Query Sequence (muscle creatine kinase mRNA) 3’ EST Hits 5’ EST Hits UniGene Entry for H. sapiens Muscle Creatine Kinase STS Division : Sequence Tagged Sites Segment of gene, EST, mRNA or genomic DNA of known position (microsatellite) PCR with STS primers gives one product per genome Basis of Radiation Hybrid Mapping UniGene Genome Assembly Related resource: Electronic PCR UniSTS: Database of Mapped Markers HTG Division: High Throughput Genome phase 1 HTG unfinished, may be unordered,with gaps phase 2 HTG unfinished, oriented,ordered,may have gaps Acc = AC109609.1 Acc =AC109609.6 phase 3 Acc = AC109609.10 finished,no gaps Same accession numbers, different versions 40,000 to > 50,000 bp ROD HTG Division: High Throughput Genome RefSeq: NCBI’s Derivative Sequence Database Curated transcripts and proteins Human model transcripts and proteins Assembled Genomic Regions (contigs) reviewed human, mouse, rat, fruit fly, zebrafish, arabidopsis draft human genome mouse genome Chromosome records Microbial viral organelle Reference Sequences Chromosome: NC_000000 mRNA: Gene: NM_000000 NG_000000 protein: NP_000000 Contig: NT_000000 NW_000000 RNA: NR_000000 Model mRNA: XM_000000 Model RNA: XR_000000 Model protein: XP_000000 Curated Automated RefSeq Chromosomes: NC_ LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL MEDLINE PUBMED NC_002695 5498450 bp DNA circular BCT 02-OCT-2001 Escherichia coli O157:H7, complete genome. NC_002695 NC_002695.1 GI:15829254 . Escherichia coli O157:H7. Escherichia coli O157:H7 Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia. 1 (sites) Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S., Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T., Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T., Sasakawa,C. and Shinagawa,H. Complete nucleotide sequence of the prophage VT2-Sakai carrying the verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7 derived from the Sakai outbreak Genes Genet. Syst. 74 (5), 227-239 (1999) 20198780 10734605 RefSeq Contig: NT_, NW_ Curated RefSeq Records: NM_, NP_ Alignment Generated Transcripts: XM_,XP_ REFSEQ: Summary BLAST a starting point for most bioinformatics related problems… BLAST One BLAST, many flavors BLAST databases Example: BLASTing protein sequence BLAST output BLAST output formatting BLAST output BLAST output low complexity filter BLAST •Scores we get from BLAST have an underlying distribution. •E-value: the number of alignments with a particular score, or better score, that are expected to occur by chance when comparing two random sequences BLAST