Practical 1 Discussion 1 Features of major databases (PubMed and NCBI Protein Db) 2 Anatomy of PubMed Db 3 Epub ahead of print and journal impact factor How to get impact factor of any journal: 1) Direct source – web of science database 2) In direct source, e.g. blogs, sites etc (do Google search) 4 Adopted from : http://admin-apps.isiknowledge.com/JCR/JCR?RQ=LIST_SUMMARY_JOURNAL Anatomy of a PubMed record 5 Demo on downloading articles 6 Anatomy of a Protein Db 7 Accession numbers and GenInfo Identifiers gi|numeric identifier |source |alphanumeric identifier humanP53 RefSeq mRNA record as an example: gi|120407067|ref|NM_000546.3 120407067 GI or Geninfo Identifier) GI (or GenInfo Identifier) 120407067 Refseq database Source Source RefSeq database NM_000546 Accession Accession NM_000546 NM_000546.3 Version Other popular sources: dbj – DDBJ (DNA Data Bank of Japan database) emb – The European Molecular Biology Laboratory (EMBL) database prf – Protein Research Foundation database sp – SwissProt gb – GenBank pir – Protein Information Resource 8 Why do we need accession number and GI for one record? 1) What is the difference between accession and GI? 2) Why do we need these two when both seem to be accession numbers? 9 Why do we need accession number and GI for one record? ACCESSION NM_000546 NM_000546 Sequence_v1 Version GI NM_000546.1 4507636 VERSION NM_000546.3 NM_000546.2 NM_000546.1 GI 120407067 8400737 4507636 NM_000546 NM_000546 Sequence update Sequence_v2 NM_000546.2 8400737 Sequence update Sequence_v3 NM_000546.3 120407067 Q1) Which revision will NCBI show if you were to search by the accession only without the version number? 10 Accession numbers - The unique identifier for a sequence record. - An accession number applies to the complete record. - Accession numbers do not change, even if information in the record is changed at the author's request. - Sometimes, however, an original accession number might become secondary to a newer accession number, if the authors make a new submission that combines previous sequences, or if for some reason a new submission supercedes an earlier record. 11 GenInfo Identifiers - GenInfo Identifier: sequence identification number - If a sequence changes in any way, a new GI number will be assigned - A separate GI number is also assigned to each protein translation Within a nucleotide sequence record - A new GI is assigned if the protein translation changes in any way - GI sequence identifiers run parallel to the new accession.version system of sequence identifiers 12 Version - A nucleotide sequence identification number that represents a single, specific sequence in the GenBank database. - If there is any change to the sequence data (even a single base), the version number will be increased, e.g., U12345.1 → U12345.2, but the accession portion will remain stable. - The accession.version system of sequence identifiers runs parallel to the GI number system, i.e., when any change is made to a sequence, it receives a new GI number AND an increase to its version number. - A Sequence Revision History tool (http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi) is available to track the various GI numbers, version numbers, and update dates for sequences that appeared in a specific GenBank record 13 Anatomy of a Protein Db record 14 Fasta Sequence 15 Fasta Format • Text-based format for representing nucleic acid sequences or peptide sequences (single letter codes). • Easy to manipulate and parse sequences to programs. Description line/row Sequence data line(s) Description line/row Sequence data line(s) >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH Fasta Format (cont.) • • • Begins with a single-line description, followed by lines of sequence data. Description line – Distinguished from the sequence data by a greater-than (">") symbol. – The word following the ">" symbol in the same row is the identifier of the sequence. – There should be no space between the ">" and the first letter of the identifier. – Keep the identifier short and clear ; Some old programs only accept identifiers of only 10 characters. For example: > gi|5524211|Human or >HumanP53 Sequence line(s) – Ensure that the sequence data starts in the row following the description row (be careful of word wrap feature) – The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. Description line/row Sequence data line(s) Description line/row Sequence data line(s) >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH Amino acids & Nucleotides 18 IUPAC One Letter Amino Acid Code • • • • • • • • • • • • • A B C D E F G H I J K L M Alanine ASx Cysteine Aspar(D)ic Acid Glutamic Acid (F)enylalanine Glycine Histidine Isoleucine Lysine Leucine Methionine • • • • • • • • • • • • • N O P Q R S T U V W X Y Z Asparagi(N)e Aspartic Acid 22nd (Pyl) Pyrr(O)lysine Asparagine Proline ASx (Q)lutamine Arginine (R)ginine Glutamic Acid Serine Glutamine Threonine GLx 21st (Sec)Selenocysteine Lysine Valine Phenylalanine T(W)ptophan Tyrosine Tryptophan T(Y)rosine 21st (Sec) Selenocysteine GLx 22nd (Pyl) Pyrrolysine Note Amino acid Three letter code Single letter code Asparagine or aspartic acid Asx B Glutamine or glutamic acid, GLx Z Leucine or Isoleucine, Xle J Unspecified or unknown amino acid Xaa X IUPAC Nucleotide Code Standard IUPAC Nucleotide code is used to describe ambiguous sites in a given DNA sequence motif, where a single character may represent more than one nucleotide. The code is shown in the table below. http://www.yeastract.com/help/help_iupac.php 22 Advice • We highly recommend that you memorize the amino acid codes and their structures • Memorizing the codes and in particular the structures will be very useful for this module and other modules, especially for research purposes. • It is not compulsory that you memorize these for this module. Features of major database (Gene Db) 24 Anatomy of Gene Db 25 Anatomy of a Gene Db record 26 A section of Gene Db record: Reference Sequences mRNA Accession number Protein Accession number 27 Nucleic Acid Databases 28 Entrez nucleotide database (nt) • GenBank • DDBJ • EMBL • RefSeq_genomic Amino Acid Databases 1) Sequence repositories GenPept (redundant; translation of GenBank; minimal annotation) • Entrez Protein (redundant or NR) • translated DDBJ/EMBL/GenBank (i.e. GenPept) • Swiss-Prot, PIR, RefSeq_protein and PDB • RefSeq (non-redundant; reference sequences; minimal manual curation; limited species) 2) Universal curated databases • PIR-PSD (non-redundant; focus on protein family classification) • Swiss-Prot (non-redundant; manually annotated) • TrEMBL (non-redundant; extensively computer-annotated) 3) Next-generation of protein sequence database • UniProtKB (Swiss-Prot, TrEMBL and PIR-PSD integrated; less redundant than UniProt NREF) • UniParc (like Entrez Protein but more comprehensive) • UniProt NREF (like RefSeq but more comprehensive and rich with annotation) Read more: http://www.ebi.ac.uk/panda/pdf/apweiler_bairoch_2004.pdf 29 • The RefSeq Project • Designed to reduce duplication by selecting one representative sequence for each locus, except when there are naturally occurring paralogs and splice variants. • Info from: – Predictions from genomic sequence – Analysis of GenBank Records – Collaborating databases 30 • Goal: a “comprehensive, integrated, non-redundant set of sequences, including genomic DNA, transcript (RNA), and protein products, for major research organisms.” http://www.ncbi.nlm.nih.gov/RefSeq/index.html Genbank versus refseq http://www.ncbi.nlm.nih.gov/books/NBK21105/#ch1.Appendix_GenBank_RefSeq_TPA_and_UniP Choice of databases for genomic/proteomic data Genome architecture Enhancer Promoter Gene E E I U U Databases to house genomic/proteomic data Nucleotide All of above in multiple records RefSeq_genome Reference ones only Protein All real/ reliably predicted proteins in multiple records RefSeq_Protein Reference proteins only Gene Gene record with all related Information included (mRNA Protein, promoter, enhancer) Database searching can help answer questions like • • • • • • • • • • • What is the sequence of human IL-10? What is the gene coding for human IL-10? Is the function of human IL-10 known? What is it? Are there any variants of human IL-10? Who sequenced this gene? What are the differences between IL-10 in human and in other species? Which species are known to have IL-10? Is the structure of IL-10 known? What are structural and functional domains of the IL-10? Are there any motifs in the sequence that explain their properties? What is an upstream region of IL-10 containing transcriptional regulation sites? IL10 = X? Take home messages for databases • • • • • • • • • Bioinformatics = databases + tools General databases versus specialized databases Databases come and go (especially the small ones) Database redundancy - many databases for the same topic (use the most comprehensive, if not use all for comprehensiveness) Database accuracy – published ones are more reliable; nevertheless, they are still prone to errors; always good to spend sometime assessing the reliability of your data of interest by doing cross-referencing to literature or other databases Fortunately, most databases are cross-referenced Unfortunately, no common standard format; need to spend some time familiarizing each; becomes easy after some practice Finding databases relevant to you – NAR Database catalogue – Pubmed – Google 2 main methods for searching databases (each with its own pros and cons) – 1. Keyword search (covered today) – 2. Sequence search (day 2) 34