NCBI Molecular Biology Resources A Field Guide NCBI Nov. 6, 2001 NCBI Resources About NCBI NCBI Sequence Databases • Primary Database – GenBank • Derivative Databases - RefSeq Entrez Databases and Text Searching BLAST Services Genomic Resources NCBI The National Center for Biotechnology Information (NCBI) Created as a part of the National Library of Medicine in 1988 • Establish public databases • Research in computational biology • Develop software tools for sequence analysis • Disseminate biomedical information Tools: BLAST(1990), Entrez (1992) GenBank (1992) Free MEDLINE (PubMed, 1997) Other databases: dbEST, dbGSS, dbSTS, MMDB, OMIM, UniGene, GeneMap, Taxonomy, CGAP, SAGE, LocusLink, RefSeq NCBI Molecular Databases Primary Databases • • Original submissions by experimentalists Database staff organize but don’t add additional information • Example: GenBank Derivative Databases • Human curated • compilation and correction of data • Example: SWISS-PROT, NCBI RefSeq mRNA • Computationally Derived • Combinations • Example: NCBI Genome Assembly NCBI • Example: UniGene What is GenBank? NCBI’s Primary Sequence Database Nucleotide only sequence database Archival in nature GenBank Data • • • Direct submissions individual records (BankIt, Sequin) Batch submissions via email (EST, GSS, STS) ftp accounts sequencing centers Data shared nightly among three collaborating databases NCBI • GenBank • DNA Database of Japan (DDBJ). • European Molecular Biology Laboratory Database (EMBL) at EBI. Entrez NIH NCBI GenBank •Submissions •Updates •Submissions •Updates EMBL CIB NIG DDBJ •Submissions •Updates getentry EBI SRS EMBL NCBI GenBank Release 126 13,602,262 14,396,883,064 80,000 + October2001 Records Nucleotides Species ftp://ncbi.nlm.nih.gov/genbank/ or ftp://genbank.sdsc.edu/pub/ NCBI • full release every two months • incremental and cumulative updates daily • available only through internet GenBank on FTP site ftp> open ftp.ncbi.nlm.nih.gov . . ftp> cd genbank NCBI Release 125: 243 files; 55.23 Gigabytes uncompressed GenBank Divisions Bulk Sequence Divisions PAT EST STS GSS HTG HTC CON Patent Expressed Sequence Tags (133 files) Sequence Tagged Site Genome Survey Sequence (41 files) High Throughput Genome (25 files) High Throughput cDNA Contig Traditional Divisions BCT INV MAM PHG PLN PRI ROD SYN UNA VRL VRT EST Division: Expressed Sequence Tags >IMAGE:275615 5' mRNA sequence GACAGCATTCGGGCCGAGATGTCTCGCTCCGTGGCCTTAGCTGTGCTCGCGCTACTCTCTCTTTCTGGCC TGGAGGTATCCAGCGTACTCCAAAGATTCAGGTTTACTCACGTCATCCAGCAGAGAATGGAAAGTCAAAT TTCCTGAATTGCTATGTGTCTGGGTTTCATCCATCCGACATTGAAGTTGACTTACTGAAGAATGGAGAGA GAATTGAAAAAGTGGAGCATTCAGACTTGTCTTTCAGCAAGGACTGGTCTTTCTATCTCTTGTACTACAC TGAATTCACCCCCACTGAAAAAGATGAGTATGCCTGCCGTGTTGAACCATGTNGACTTTGTCACAGNCCC 5’ AAGTTNAGTTTAAGTGGGNATCGAGACATGTAAGGCAGGCATCATGGGAGGTTTTGAAGNATGCCGCNTT 30,000 TTGGATTGGGATGAATTCCAAATTTCTGGTTTGCTTGNTTTTTTAATATTGGATATGCTTTTG nucleus genes 3’ make cDNA library 80-100,000 unique cDNA clones in library NCBI >IMAGE:275615 3', mRNA sequence - isolate unique clones NNTCAAGTTTTATGATTTATTTAACTTGTGGAACAAAAATAAACCAGATTAACCACAACCATGCCTTACT -sequence once TTATCAAATGTATAAGANGTAAATATGAATCTTATATGACAAAATGTTTCATTCATTATAACAAATTTCC 80-100,000 RNA AATAATCCTGTCAATNATATTTCTAAATTTTCCCCCAAATTCTAAGCAGAGTATGTAAATTGGAAGTTAA from each end gene products CTTATGCACGCTTAACTATCTTAACAAGCTTTGAGTGCAAGAGATTGANGAGTTCAAATCTGACCAAGAT GTTGATGTTGGATAAGAGAATTCTCTGCTCCCCACCTCTANGTTGCCAGCCCTC STS Division : Sequence Tagged Sites Segment of gene, EST , mRNA or genomic DNA of known position (microsatellite) PCR with STS primers gives unique product (one per genome) Basis of Radiation Hybrid Mapping • • Related resource: Electronic PCR http://www.ncbi.nlm.nih.gov/genome/sts/epcr.cgi NCBI UniGene Genome Assembly RH mapping using STSs A A B B Human Chromosome C C D D D Hybrid Cells A B A B C D + + + + + + + - NCBI PCR Results ePCR Results Hexokinase 1 EST SHGC-35892 dbSTS id: 44155, GenBank Accession: G29974 Organism: Homo sapiens Primer1: CATACGACACGGCTCACAAA Primer2: CTGTTTGTCTCGTGGGGG STS location: 30..160 Chromosome: 10 Expected amplicon size: 129, Observed amplicon size: 130 Primers match in forward orientation Query sequence: TTTTTGAATT TTCCAGTGAT TTCCGCAGAC GCTAGGACTG AGGCCACAGT AGCATGTGCC GAGGGGGAAC GGTACAAAGT GGCATTGTTT GTGTCCACCT GTTCCACGGA GGGTGCCAGG CCGGGAGGAG CAAGGATGAG TTACTAGGTC GTTGGTTGGT CCCCCCACGA CACACGATTT AGGGGAGGAA GCCCGGCAGT CTTTGGAGGC ATACGACACG TCCTTTTATC GACAAACAGA TGTGGCATTG GCAGCTAATG GTCTGCTGGT CAGAAGGCTG GCTCACAAAG CAAATGGAGA ATGCAAGACT ACACACCACG CTATGCCCAC GATAATACAT TCAGGTGGTG CGGTGGGAAA CAAGACACAT GTCACACGCG ATGCGATGCC ACTCGCCTTC TTCACACGGG TG NCBI 1 61 121 181 241 301 361 Genome Sequencing Whole BAC insert (or genome) sonication sequencing cloning isolating GSS division Draft Sequence (HTG division) NCBI assembly GSS Division: Genome Survey Sequences •Genomic equivalent of ESTs •BAC and other first pass surveys •BAC end sequences •Whole Genome Shotgun (some) •RAPIDS and other anonymous loci SP6 end T7 end NCBI Genomic Clone (BAC) HTG Division: High Throughput Genome Records phase 1 Acc = AC008701 gi = 6601005 phase 2 Acc = AC008701 gi = 6671909 HTG PRI phase 3 Acc = AC008701 HTG gi = 7328720 40,000 to > 350,000 bp The GenBank Record NCBI A Simple GenBank Record LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS TITLE JOURNAL REFERENCE AUTHORS TITLE JOURNAL REFERENCE AUTHORS TITLE JOURNAL REMARK COMMENT AF062069 3808 bp mRNA INV 02-MAR-2000 Limulus polyphemus myosin III mRNA, complete cds. AF062069 AF062069.2 GI:7144484 . Atlantic horseshoe crab. Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. 1 (bases 1 to 3808) Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. A myosin III from Limulus eyes is a clock-regulated phosphoprotein J. Neurosci. (1998) In press 2 (bases 1 to 3808) Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. Direct Submission Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA 3 (bases 1 to 3808) Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. Direct Submission Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA Sequence update by submitter On Mar 2, 2000 this sequence version replaced gi:3132700. GenBank Record, cont. FEATURES source CDS Location/Qualifiers 1..3808 /organism="Limulus polyphemus" /db_xref="taxon:6850" /tissue_type="lateral eye" 258..3302 /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate for PKA" /codon_start=1 /product="myosin III" /protein_id="AAC16332.2" /db_xref="GI:7144485" /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDKQA NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWLGI EFLEEGTAADLLATHRRFGIHLKEDLIALIIKEVVRAVQYLHENSIIHRDIRAANIMF SKEGYVKLIDFGLSASVKNTNGKAQSSVGSPYWMAPEVISCDCLQEPYNYTCDVWSIG ITAIELADTVPSLSDIHALRAMFRINRNPPPSVKRETRWSETLKDFISECLVKNPEYR PCIQEIPQHPFLAQVEGKEDQLRSELVDILKKNPGEKLRNKPYNVTFKNGHLKTISGQ BASE COUNT 1201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt 3781 aagatacagt aactagggaa aaaaaaaa // Sequence and Database Identifiers Locus, accession, gi, version Locus Name Modification Date Sequence mol-type mRNA (= cDNA) length rRNA snRNA DNA 3808 bp GB Division LOCUS AF062069 mRNA INV 02-MAR-2000 DEFINITION Limulus polyphemus myosin III mRNA, complete cds. ACCESSION AF062069 VERSION AF062069.2 Accession Number GI:7144484 DEF line (Title) Accession.version gi number Keywords, Source-organism Legacy field exception •EST •GSS •HTG KEYWORDS SOURCE ORGANISM Accepted common name . Scientific name Atlantic horseshoe crab. Limulus polyphemus Eukaryota; Metazoa; Arthropoda; Chelicerata; Merostomata; Xiphosura; Limulidae; Limulus. NCBI Taxonomic lineage according to GenBank Citation REFERENCE AUTHORS TITLE JOURNAL REFERENCE AUTHORS TITLE JOURNAL REFERENCE AUTHORS REMARK COMMENT Previous version NCBI TITLE JOURNAL 1 (bases 1 to 3808) Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Article Greenberg,R.M. and Smith,W.C. A myosin III from Limulus eyes is a clock-regulated phosphoprotein J. Neurosci. (1998) In press 2 (bases 1 to 3808) Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. Submitter Block Direct Submission Submitted (29-APR-1998) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA 3 (bases 1 to 3808) Battelle,B.-A., Andrews,A.W., Calman,B.G., Sellers,J.R., Greenberg,R.M. and Smith,W.C. Update history Direct Submission Submitted (02-MAR-2000) Whitney Laboratory, University of Florida, 9505 Ocean Shore Blvd., St. Augustine, FL 32086, USA Sequence update by submitter On Mar 2, 2000 this sequence version replaced gi:3132700. Feature Table FEATURES source CDS Coding Sequence " Location/Qualifiers 1..3808 /organism="Limulus polyphemus" Biosource /db_xref="taxon:6850" /tissue_type="lateral eye" 258..3302 /note="N-terminal protein kinase domain; C-terminal myosin heavy chain head; substrate Reading Frame for PKA" /codon_start=1 /product="myosin III" GenPept Protein Identifiers /protein_id="AAC16332.2" /db_xref="GI:7144485" /translation="MEYKCISEHLPFETLPDPGDRFEVQELVGTGTYATVYSAIDK NKKVALKIIGHIAENLLDIETEYRIYKAVNGIQFFPEFRGAFFKRGERESDNEVWL Sequence Indicates beginning of sequence data BASE COUNT 1201 a 689 c 782 g 1136 t ORIGIN 1 tcgacatctg tggtcgcttt ttttagtaat aaaaaattgt attatgacgt cctatctgtt <sequence omitted> 3721 accaatgtta taatatgaaa tgaaataaag cagtcatggt agcagtggct gtttgaaata 3781 aagatacagt aactagggaa aaaaaaaa // End of record NCBI Derivative Sequence Databases: RefSeq NCBI Reference Sequences mRNAs and Proteins NM_123456 NP_123456 XM_123456 XP_123456 Curated mRNA Curated Protein Predicted Transcript Predicted Protein Assemblies NT_123456 Contig (Mouse and Human Genomes) NC_123455 Chromosome (Microbial Genomes) NCBI Gene Records NG_123456 Reference Genomic Sequence Curated RefSeq Records: NM_, NP_ LOCUS NM_000492 6159 bp mRNA PRI 26-JUL-1999 DEFINITION Homo sapiens cystic fibrosis transmembrane conductance regulator(CFTR) mRNA. was derived from M28668.1, REFSEQ: This reference sequence RefSeq Nucleotide ACCESSION M55131.1.NM_000492 On Feb 17, 2000 this sequence version replaced gi:4502784. Summary: Cystic fibrosis transmembrane conductance regulator is LOCUS 1480 aacassete sub-family C.PRI 26-JUL-1999 member 7 NP_000483 of the ATP-binding The protein DEFINITION fibrosis channel transmembrane conductance regulator.of functionscystic as a chloride and controls the regulation ACCESSION NP_000483 other transport pathways. Mutations in this gene cause the PID autosomalg4502785 Protein recessive disorder, cystic fibrosis (CF)RefSeq and congenital VERSION GI:4502785 bilateralNP_000483.1 aplasia of the vas deferens (CBAVD). Alternative splice DBSOURCE accession NM_000492.1 variants REFSEQ: have been described, many of which result from mutations in the CFTR gene. COMPLETENESS: full length. COMMENT REFSEQ: This reference sequence was derived from M55131. PROVISIONAL RefSeq: This is a provisional reference sequence record that has not yet been subject to human review. The final curated reference sequence record may be somewhat different from this one. Reviewed Alignment Generated Transcripts: XM_, XP_ LOCUS DEFINITION ACCESSION VERSION XM_004980 6128 bp mRNA PRI 16-NOV-2000 Homo sapiens cystic fibrosis transmembrane conductance regulator, ATP-binding cassette (sub-family C, member 7) (CFTR), mRNA. mismatch XM_004980 XM_004980.3 GI:13631444 NCBI RefSeq Human Contig: NT_ mRNA complement(join(1255889..1257642,1258986..1259091, LOCUS NT_007935 1888399 bp DNA CON 16-NOV-2000 1259690..1259862,1271619..1271708,1281957..1282112, DEFINITION Homo sapiens chromosome 7 working draft sequence segment, 1296780..1297028,1309837..1309937,1312742..1312969, CONTIG join(AC073042.3:1155..2680,gap(100),AC074390.2:119526..151445, complete sequence. 1313881..1314031,1317797..1317876,1320768..1321018, 1321687..1321724,1329492..1329620,1331893..1332616, gap(100),AC074390.2:1..5245,gap(100), ACCESSION NT_007935 1334111..1334197,1336717..1336811,1364895..1365086, complement(AC074390.2:17705..23645),gap(100), VERSION NT_007935.1 GI:11422165 1375727..1375909,1382442..1382534,1384204..1384450, AC074390.2:97658..119425,AC073042.3:106479..121155, KEYWORDS HTG. 1387877..1388002,1389139..1389302,1390185..1390274, AC074390.2:164226..165036,AC073042.3:70628..79503,gap(100), 1393436..1393651,1415408..1415516,1420187..1420297, SOURCE human. 1444403..1444587)) AC073042.3:4627..6382,gap(100),AC073042.3:2781..4526,gap(100), ORGANISM Homo sapiens /partial complement(AC073042.3:183627..209083),gap(100), Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; /gene="CFTR" AC073042.3:79604..88622,gap(100),AC073042.3:139234..160437, Euteleostomi;Mammalia; Primates;conductance Catarrhini; /product="cystic Eutheria; fibrosis transmembrane regulator, ATP-binding cassette (sub-family C, member 7)" gap(100),complement(AC073042.3:6483..8319),gap(100), Hominidae; Homo. REFERENCE 1complement(AC073042.3:39354..45372),gap(100), (bases 1/transcript_id="XM_004980.1" to 1888399) /db_xref="LocusID:1080" complement(AC073042.3:21461..24064),gap(100), AUTHORS International Human Genome Project collaborators. /db_xref="MIM:602421" AC074390.2:156347..160294,gap(100), by automated computational analysis TITLE Toward the /note="derived complete sequence of the human genome Reordering draftusing sequence gene prediction method: Acembly. Supporting evidence complement(AC074390.2:5346..10750),gap(100), JOURNAL Unpublished includes similarity to: 9 proteins, 1 mRNAs See details in complement(AC074390.2:153911..156246),gap(100), COMMENT GENOME ANNOTATION NCBI contigs are derived from AceView" REFSEQ: complement(AC074390.2:23746..32402),gap(100), assembled genomic sequence data. They may include both gene complement(1255889..1444587) /gene="CFTR" complement(AC074390.2:151546..153810),gap(100), draft and finished sequence. /note="CF; MRP7; ABC35; ABCC7" complement(AC074390.2:57277..75275),gap(100), COMPLETENESS: not full length. /db_xref="LocusID:1080" complement(AC074390.2:75376..97557),gap(100), Map View of RefSeqs NT_ XM_ NCBI NM_ RefSeq Genome Records: NG_ NCBI RefSeq Chromosomes: NC_ LOCUS DEFINITION ACCESSION VERSION KEYWORDS SOURCE ORGANISM REFERENCE AUTHORS JOURNAL MEDLINE PUBMED NCBI TITLE NC_002695 5498450 bp DNA circular BCT 02-OCT-2001 Escherichia coli O157:H7, complete genome. NC_002695 NC_002695.1 GI:15829254 . Escherichia coli O157:H7. Escherichia coli O157:H7 Bacteria; Proteobacteria; gamma subdivision; Enterobacteriaceae; Escherichia. 1 (sites) Makino,K., Yokoyama,K., Kubota,Y., Yutsudo,C.H., Kimura,S., Kurokawa,K., Ishii,K., Hattori,M., Tatsuno,I., Abe,H., Iida,T., Yamamoto,K., Ohnishi,M., Hayashi,T., Yasunaga,T., Honda,T., Sasakawa,C. and Shinagawa,H. Complete nucleotide sequence of the prophage VT2-Sakai carrying the verotoxin 2 genes of the enterohemorrhagic Escherichia coli O157:H7 derived from the Sakai outbreak Genes Genet. Syst. 74 (5), 227-239 (1999) 20198780 10734605 Other NCBI Derivative Databases UniGene - LocusLink - gene oriented expressed sequence clusters central resource and interface for known genes NCBI NCBI Homepage NCBI Mendelian Inheritance in Man Entrez NCBI Similarity Searching NCBI Homepage Using Entrez An integrated database search and retrieval system NCBI Entrez: Neighboring and Hard Links Word weight PubMed abstracts 33-D -D Structure Structure Taxonomy Phylogeny Genomes BLAST Nucleotide sequences VAST (MMDB) Protein sequences BLAST WWW Entrez GenBank, EMBL, DDBJ RefSeq, PDB •All of MEDLINE plus others •Abstracts •Links to online Journals GenBank, DDBJ, EMBL translations PDB, PIR, SWISS-PROT, PRF, RefSeq NCBI’s MMDB - derived from PDB Reference Genomes: Graphical views, assembled sequence and mapping data NCBI Database Searching with Entrez Using limits and field restriction to find mouse GAPD Linking and neighboring with mouse GAPD NCBI Entrez Nucleotides Mouse NCBI Document Summaries: Mouse[All Fields] 3 million records Chicken not mouse !? NCBI Entrez Nucleotides: Limits: Preview/Index Mouse NCBI Entrez Nucleotides: Limits NCBI Accession All Fields Author Name EC/RN Number Mouse Feature key Field Restriction Filter Gene Name Issue Journal Name Keyword Exclude unwanted categories of sequences Modification Date Organism Page Number Gene Location Molecule Primary Accession Genomic DNA/RNA Genomic DNA/RNA Properties Mitochondrion mRNA Protein Name Chloroplast rRNA Publication Date SeqID String Only From Sequence Length RefSeq Substance Name GenBank Text Word EMBL Title Word DDBJ Uid Entrez Nucleotides: Limits: Organism Mouse NCBI Document Summaries: Mouse[Organism] 2,976,070[All Fields] -2,921,009[Organism] 55,061 NCBI Exclude Bulk Sequences, mRNA NCBI Adding Terms: Preview/Index Title Word Uid Volume Search History 3 phosphate dehydrogenase NCBI Accession All Fields Author Name EC/RN Number Feature key Filter Gene Name Issue Journal Name Keyword Modification Date Organism Page Number Primary Accession Properties glyceraldehyde Protein Name Publication Date SeqID String Sequence Length Substance Name Text Word Mouse GAPD Records NCBI Displaying Mouse GAPD Records NCBI Summary Brief GenBank ASN.1 Formats FASTA GI list LinkOut PubMed Links Protein Links Links and neighbors (related records) Nucleotide Neighbors PopSet Links Structure Links Genome Links Taxonomy Links OMIM Links Entrez GenBank / GenPept NCBI GenPept FASTA Format >gi|193425|gb|M60978.1|MUSGAPDS Mus musculus testis-specific isoform of glycerald GGCAGCCAGGCCATGAGATCTTAGGCCATGTCGAGACGTGACGTGGTCCTTACCAATGTTACTGTTGTCC AGCTACGGCGGGACCGATGCCCATGCCCATGCCCATGCCCATGTCCATGCCCATGCCCTGTGATCAGACC ACCTCCACCCAAGCTTGAGGATCCACCACCCACGGTTGAAGAACAGCCACCGCCACCGCCGCCGCCACCT CCACCTCCACCACCACCTCCTCCTCCTCCTCCACCCCAGATAGAGCCAGACAAGTTTGAAGAGGCTCCCC CTCCCCCTCCCCCTCCTCCTCCTCCTCCCCCTCCCCCTCCTCCACCACTCCAAAAGCCAGCTAGAGAGCT GACAGTGGGTATCAATGGATTTGGACGCATTGGTCGTCTGGTGCTGCGAGTCTGCATGGAGAAGGGCATT > AGGGTGGTAGCAGTGAATGACCCATTCATTGATCCAGAATACATGGTTTACATGTTCAAATATGACTCCA CACATGGTAGATACAAAGGAAACGTGGAACATAAGAATGGACAACTAGTTGTGGACAACCTTGAGATCAA CACGTACCAGTGCAAAGACCCTAAAGAAATCCCCTGGAGCTCTATAGGGAATCCCTACGTGGTGGAGTGT gi number Locus Name ACAGGCGTCTATCTGTCCATCGAGGCAGCTTCGGCACATATTTCATCTGGTGCCAGGCGTGTGGTGGTCA CTGCACCCTCCCCCGATGCACCCATGTTTGTCATGGGAGTGAACGAGAAGGACTATAACCCTGGCTCTAT Database Identifiers GACCATTGTCAGCAATGCATCCTGTACCACCAACTGCCTGGCTCCTCTCGCCAAGGTTATTCATGAAAAC Accession number gb GenBank TTCGGGATCGTGGAAGGGCTAATGACCACAGTCCATTCCTACACAGCCACTCAGAAGACAGTGGATGGGC CATCAAAGAAGGACTGGCGAGGTGGCCGCGGCGCTCACCAAAACATCATCCCATCGTCCACTGGGGCTGC emb EMBL CAAGGCTGTAGGCAAAGTCATCCCAGAGCTCAAAGGGAAGCTAACAGGAATGGCATTCCGGGTGCCAACC dbj DDBJ CCAAACGTGTCAGTTGTGGACCTGACCTGCCGCCTGGCCAAGCCTGCTTCTTACTCGGCTATCACGGAGG sp SWISS-PROT CTGTGAAAGCTGCAGCCAAGGGACCTTTGGCTGGCATCCTTGCTTACACAGAGGACCAGGTGGTCTCCAC GGACTTTAACGGCAATCCCCATTCTTCCATCTTTGATGCTAAGGCTGGAATTGCCCTCAATGACAACTTC pdb Protein Databank GTGAAGCTTGTTGCCTGGTACGACAACGAATATGGCTACAGTAACCGAGTGGTCGACCTCCTCCGCTACA pir PIR TGTTTAGCCGAGAGAAGTAACACAAAAGGCCCCTCCTTGCTCCCCTGCGCACCTCGCGTTCCTGACTTCG prf PRF GCTTCCACTCAAAGGCGCCGCCACCGGGTCAACAATGAAATAAAAACGAGAATGCGC FASTA Definition Line >gi|193425|gb|M60978.1|MUSGAPDS RefSeq NCBI ref Abstract Syntax Notation: ASN.1 Seq-entry ::= set { level 1 , class nuc-prot , descr { title "Mus musculus testis-specific isoform of glyceraldehyde 3-phosphate dehydrogenase (Gapd-S) mRNA, and translated products" , update-date std { year 1994 , month 11 , day 9 } , source { org { taxname "Mus musculus" , common "house mouse" , db { { db "taxon" , tag id 10090 } } , GenPept GenBank ASN.1 FASTA Nucleotide NCBI FASTA Protein NCBI Toolbox /***************************************************************************** * * asn2ff.c * convert an ASN.1 entry to flat file format, using the FFPrintArrayPtrs. * *****************************************************************************/ #include <accentr.h> #include "asn2ff.h" #include "asn2ffp.h" #include "ffprint.h" #include <subutil.h> #include <objall.h> #include <objcode.h> #include <lsqfetch.h> #include <explore.h> Toolbox Sources FILE *fpl; ftp://ncbi.nlm.nih.gov/toolbox/ncbi_tools Args myargs[] = { {"Filename for asn.1 input","stdin",NULL,NULL,TRUE,'a',ARG_FILE_IN,0.0,0,NULL}, {"Input is a Seq-entry","F", NULL ,NULL ,TRUE,'e',ARG_BOOLEAN,0.0,0,NULL}, {"Input asnfile in binary mode","F",NULL,NULL,TRUE,'b',ARG_BOOLEAN,0.0,0,NULL}, {"Output Filename","stdout", NULL,NULL,TRUE,'o',ARG_FILE_OUT,0.0,0,NULL}, {"Show Sequence?","T", NULL ,NULL ,TRUE,'h',ARG_BOOLEAN,0.0,0,NULL}, NCBI ftp> open ncbi.nlm.nih.gov . . #ifdef ENABLE_ID1 ftp> cd toolbox #include <accid1.h> ftp> cd ncbi_tools #endif Protein Neighbors-Structure Links Related Proteins Cn3D GAPD Structure Structure Links NCBI Advanced Neighbors: BLink NCBI BLink NCBI PubMed Link NCBI Online Books NCBI Entrez Structures Molecular Modeling Database (MMDB) and Cn3D NCBI MMDB: Molecular Modeling Data Base Derived from experimentally determined PDB records Value added to PDB records including: • • • • Structure neighbors determined by Vector Alignment Search Tool (VAST) NCBI Addition of explicit chemical graph information Validation Inclusion of Taxonomy, Citation, and other information Conversion to parseable ASN.1 data description language Searching MMDB NCBI 1CET Structure Summary BLAST neighbors VAST neighbors NCBI Cn3D viewer Cn3D : Displaying Structures NCBI Chloroquine Structure Neighbors NCBI Structural Alignments Chloroquine NADH NCBI Why do we need similarity searching? Identification and annotation •Incomplete or no annotations (GenBank) •Incorrectly annotated sequences but it ain’t necessarily so! NCBI Evolutionary relationships homologous molecules may have similar functions Basic Local Alignment Search Tool Widely used similarity search tool Heuristic approach based on Smith Waterman algorithm Finds best local alignments Provides statistical significance All combinations (DNA/Protein) query and database. DNA vs DNA DNA translation vs Protein Protein vs Protein Protein vs DNA translation DNA translation vs DNA translation www, email server, standalone, and network clients NCBI • • • • • Local Alignment Statistics High scores of local alignments between two random sequences follow Extreme Value Distribution For ungapped alignments: Expected number with score S or greater E = Kmne-S or E = mn2-S’ http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html NCBI K = scale for search space = scale for scoring system S’= bitscore = (S - lnK)/ln2 Scoring Systems •Nucleic acids identity matrix •Proteins •Position Independent Matrices •PAM Matrices (Percent Accepted Mutation) •Implicit model of evolution •Higher PAM number all calculated from PAM1 •PAM250 widely used •BLOSUM Matrices (BLOck SUbstition Matrices) •Position Specific Score Matrices (PSSM) •PSI and RPS BLAST NCBI •Empirically determined from alignment of conserved blocks •Each includes information up to a certain level of identity •BLOSUM62 widely used A 4 R -1 5 N -2 0 6 D -2 -2 1 6 C 0 -3 -3 -3 9 Q -1 1 0 0 -3 5 E -1 0 0 2 -4 2 5 Common amino acids have low weights G 0 -2 0 -1 -3 -2 -2 6 H -2 0 1 -1 -3 0 0 -2 8 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 Rare amino acids have high weights L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 TNegative 0 -1 for 0 -1 -1 substitutions -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 less-1 likely W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 for -1 more substitutions X 0 -1 -1 Positive -1 -2 -1 -1 likely -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 A R N D C Q E G H I L K M F P S T W Y V X BLOSUM62 NCBI Position Specific Substitution Rates Typical serine Active site serine NCBI Position Specific Score Matrix (PSSM) D G V I S S C N G D S G G P L N C Q A R N D C Q E G H I L K M -2 0 2 -4 2 4 -4 -3 -5 -4 0 -2 -1 0 -2 -4 -3 -3 6 -4 -5 -5 0 -2 1 -3 -3 -5 -1 -2 6 -1 -4 -5 1 -5 3 -3 -4 -6 0 -1 -4 -1 2 -4 6 -2 -5 0 8 -5 -3 -2 -1 -4 -7 -6 -4 -6 -4 -4 -4 -4 -1 -4 -2 -3 -3 -5 -4 -4 -7 -6 -7 12 -7 -7 -5 -6 -5 -5 -7 -5 Serine scored differently 0 2 -1 -6 7 0 -2 0 -6 -4 2 0 in these -3 -3 -4 -4 -4 -5 7two -4 positions -7 -7 -5 -4 -5 -2 9 -7 -4 -1 -5 -5 -7 -7 -4 -7 -4 -2 -4 -4 -3 -3 -3 -4 -6 -6 -3 -5 -6 -4 -5 -6 -5 -6 8 -6 -8 -7 -5 -6 -6 -4 -5 -6 -5 -6 8 -6 -7 -7 -5 -6 Active site nucleophile -6 -6 -5 -6 -5 -5 -6 -6 -6 -7 -4 -6 -6 -7 -7 -5 -5 -6 -7 0 -1 6 -6 1 -6 0 -6 -4 -4 -6 -6 -1 3 0 -5 4 -4 -5 -5 10 -2 -5 -5 1 -1 -1 -5 0 1 4 2 -5 2 0 0 0 -4 -2 1 0 -1 1 3 -4 -1 1 4 -3 -4 -3 -1 -2 F -6 -3 -6 -5 -7 -5 0 -2 -4 -7 -6 -7 -7 -7 0 -3 -1 0 -2 P 1 -2 -4 -5 -5 -1 -7 -5 -6 -5 -4 -6 -6 9 -6 -6 -4 0 -3 S 0 -2 0 -3 1 4 -4 -1 -3 -4 7 -4 -2 -4 -6 -2 -1 -1 0 T -1 -1 -2 0 -3 3 -4 -3 -5 -4 -2 -5 -4 -4 -5 -1 0 -1 -2 W -6 0 -6 -1 -7 -6 -5 -3 -6 -8 -6 -6 -6 -7 -5 -6 -5 -3 -2 Y -4 -6 -4 -4 -5 -5 0 -4 -6 -7 -5 -7 -7 -7 -4 -1 0 -3 -2 V -1 -5 -2 0 -6 -3 -4 -3 -6 -7 -5 -7 -7 -6 0 6 0 -4 -3 NCBI 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 A 0 -2 -1 -3 -2 4 -4 -2 -2 -5 -2 -3 -3 -2 -4 -1 0 0 -1 Gapped Alignments •Gapping provides more biologically realistic alignments •Statistical behavior not completely understood for gapped alignments •Gapped BLAST parameters must be found by simulations for each matrix NCBI •Affine gap costs = -(a+bk) a = gap open penalty b = gap extend penalty A gap of length 1 receives the score -(a+b) Intermission NCBI