Chapter 2 Biological Databases Databases (DB) a computerized archive used to store and organize data in such a way that information can be retrieved easily via a variety of search criteria record, also called an entry query.- retrieve the whole data record Types of DB Flat file format – entries are separated by a delimiter database management systems software programs for organizing, searching, and accessing data Relational DB use a set of tables to organize data. Each table, also called a relation structured query language (SQL) – DB programming language 2 Object-Oriented DB (OO DB) Describe complex hierarchical relationships between data items store data as objects navigating through the objects with the aid of the pointers linking different objects Programming languages like C++ are used to create object-oriented databases. Biological DB Primary DB – contain original biological data, raw sequences (seqs) Secondary DB – processed or manually curated data Specialized DB – cater to a particular research interest Primary DB NCBI, EMBL, DDBJ, PDB Secondary DB Protein DBs - SWISS-PROT, TrEMBL, PIR Protein family classification DBs – Pfam, Blocks, DALI Specialized DB Flybase, WormBase, AceDB, TAIR GenBank EST, microarray DB, i.e. ArrayExpress at EBI Interconnection between Biological Databases format incompatibility three types of database structures – flat files, relational, and object oriented Common Object Request Broker Architecture (COBRA) - a network through an “interface broker” without having to understand each other’s database structure. eXtensibleMarkup Language (XML) - each biological record is broken down into small, basic components that are labeled with a hierarchical nesting of tags. This database structure significantly improves the distribution and exchange of complex sequence annotations between databases PITFALLS OF BIOLOGICAL DATABASES Reliability, redundancy of data errors can be passed on to other databases, causing propagation of errors submission of identical or overlapping sequences by the same or different authors, revision of annotations, dumping of expressed sequence tags (EST) data NCBI has now created a nonredundant database, called RefSeq SWISS-PROT database also has minimal redundancy for protein sequences sequence-cluster databases such as UniGene erroneous annotations – Gene Ontology 3 INFORMATION RETRIEVAL FROM BIOLOGICAL DATABASES PubMed/Medline - find out about a protein by its name, go to http://www.ncbi.nlm.nih.gov/entrez - DUTPase - - save : File Save as Search PubMed using author names (case insensitive) 4 Searching PubMed using fields title AB = abstract, AD = laboratory address, AU = authors, SO = journal abbreviation Example: common names such as Down, can be used in different contexts, such as titles (Down syndrome), or an address (Down address) Down [AU], Down[AB] Down[AD] Logical variable : AND, OR, NOT (in capital letter) - for more about the fields, read the Entrez Gene web page http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene Using fields to find experts near you Exercise Try to search for articles related to microRNA and cancer. Abstract and Lab. address 5 Searching PubMed using limits, such find all the review articles about dUTPase A few more tips about PubMed - add quoted query, such as “down syndrome” - adding initials to last names, for example, “Abergel C” - write down the PubMed Identifier (the PMID number) - use other names to search, such as use dUTP pyrophosphates - try the Related Articles options Retrieving Protein Sequences http://www.expasy.org/sprot Retrieving the protein sequence performing the dUTPase function in E. coli 6 FASTA format Number of amino acid, molecular weight >sp|P06968|DUT_ECOLI Deoxyuridine 5'-triphosphate nucleotidohydrolase (EC 3.6.1.23) (dUTPase) (dUTP pyrophosphatase) - Escherichia coli. MKKIDVKILDPRVGKEFPLPTYATSGSAGLDLRACLNDAVELAPGDTTLVPTGLAIHIAD PSLAAMMLPRSGLGHKHGIVLGNLVGLIDSDYQGQLMISVWNRGQDSFTIQPGERIAQMI FVPVVQAEFNLVEDFDATDRGEGGFGHSGRQ Advanced search with field restriction - gene definition, gene name, and organism Retrieving a list of protein sequences Deselect the TrEMBL and TrEMBL-New boxes (computer annotated sequences) 7 Perform this query on SRS against Swiss-Prot Gene names are different in different species, even they have the same function E. coli dut Yeast dut1 Vaccinia virus F2L Herpes virus UL50 8 Retrieving DNA sequences - retrieving the DNA sequence related to E. coli dUTPase (P06968) - look under the Cross reference section, EMBL, GenBank, DDBJ and CodingSequence - the GenBank entry consists of many parts, such as LOCUS, REFERENCE section, FEATURES section, ORIGIN (Sequence) section 326 FEATURES promoter 286 promoter 291 310 start 316 -10 RBS -2 -1 12 CDS 9 Repeat unit repeat_unit 831..838 repeat_unit /note="inverted repeat A" 844..851 /note="inverted repeat A'" inverted repeat : ccgcaaac gtttgcgg Using BLAST to compare my protein sequence to other protein sequences 10 Making a Multiple Sequence Alignment (MSA) with ClustalW http://pir.georgetown.edu (DUT_CANAL, DUT_BPT5) evolution distance = 0.33884 + 0.33684 11 Appendix >sp|P06968|DUT_ECOLI Deoxyuridine 5'-triphosphate nucleotidohydrolase (EC 3.6.1.23) (dUTPase) (dUTP pyrophosphatase) - Escherichia coli. MKKIDVKILDPRVGKEFPLPTYATSGSAGLDLRACLNDAVELAPGDTTLVPTGLAIHIAD PSLAAMMLPRSGLGHKHGIVLGNLVGLIDSDYQGQLMISVWNRGQDSFTIQPGERIAQMI FVPVVQAEFNLVEDFDATDRGEGGFGHSGRQ For MSA DUT_AQUAE DUT_BPT5 DUT_BRAJA DUT_BUCAI DUT_CANAL >DUT_AQUAE MSKVILKIKRLPHAQDLPLPSYATPHSSGLDLRAAIEKPLKIKPFERVLIPTGLILEIPE GYEGQVRPRSGLAWKKGLTVLNAPGTIDADYRGEVKVILVNLGNEEVVIERGERIAQLVI APVQRVEVVEVEEVSQTQRGEGGFGSTGTK 12 >DUT_BPT5 MIKIKLTHPDCMPKIGSEDAAGMDLRAFFGTNPAADLRAIAPGKSLMIDTGVAVEIPRGW FGLVVPRSSLGKRHLMIANTAGVIDSDYRGTIKMNLYNYGSEMQTLENFERLCQLVVLPH YSTHNFKIVDELEETIRGEGGFGSSGSK >DUT_BRAJA MSTKVTVELQRLPHAEGLPLPAYQTAEAAGLDLMAAVPEDAPLTLASGRYALVPTGLAIA LPPGHEAQVRPRSGLAAKHGVTVLNSPGTIDADYRGEIKVILINHGAAAFVIKRGERIAQ MVIAPVVQAALVPVATLSATDRGAGGFGSTGR >DUT_BUCAI MSNIEIKILDSRMKNNFSLPSYATLGSSGLDLRACLDETVKLKAHKTILIPTGIAIYIAN PNITALILPRSGLGHKKGIVLGNLVGLIDSDYQGQLMISLWNRSDQDFYVNPHDRVAQII FVPIIRPCFLLVKNFNETSRSKKGFGHSGVSGVI >DUT_CANAL MTSEDQSLKKQKLESTQSLKVYLRSPKGKVPTKGSALAAGYDLYSAEAATIPAHGQGLVS TDISIIVPIGTYGRVAPRSGLAVKHGISTGAGVIDADYRGEVKVVLFNHSEKDFEIKEGD RIAQLVLEQIVNADIKEISLEELDNTERGEGGFGSTGKN 13 Asia University Bioinformatics – assignment Name: ________________________ Class: _____________________ Point your browser to NCBI home page. 1. What are the Genbank common name of the following species ? a. Acer rubrum common name:_____________ b. Orycteropus afer common name:_____________ Hint : point your browser to NCBI home page and use Taxonomy to search. 2. Use PubMed to find out how many references are related to the protein, “mitochondrial cytochrome b”. 3. 4. Number of items? __________ items How many items are written by the author Yang on mitochondrial cytochrome b ? Hint: search PubMed using fields. __________ items. Retrieve the following three protein sequences in FASTA format: a. horse pancreatic ribonuclease, b. minke whale pancreatic ribonuclease c. kagaroo pancreatic ribonuclease. [Hint: Use pancreatic ribonuclease as your keyword to search in SWISS-PROT home page. You may find more than one sequence for each species, use search results from SWISS-PROT not from TrEMBL.] Fill in the details for the protein, pancreatic ribonuclease, in the table below. Species Horse Minke whale Kangaroo Primary Accession ID Length of Amino acids Molecular weight Number of DISULFID bond 5. Using the three sequences from question 4 to determine which two of these species are most closely related by doing multiple sequence alignment (MSA). [Hint: Point your browser to the PIR web site (http://pir.georgetown.edu) and do the ClustalW alignment. Then, from the TREE VIEW result, determine which two species are most closely related. closer.] A. horse and minke whale are most closely related B. horse and kangaroo are most closely related C. kangaroo and minke whale are most closely related Answer: A or B or C A smaller evolution distance implies they are