Practical 1b Retrieving Biological Information from the Internet Introduction During the last practical, you explored a number of biological databases and acquired some database searching skills. As your needs for biological information change, for instance, as your research interests change direction, ideally you would be able to find suitable databases on your own. Options for finding these databases include, but are not limited to, (1) searching the Internet with a general purpose search engine such as Google, using well-chosen keywords, (2) searching PubMed for publications about databases, and (3) identifying catalogues of databases. The journal Nucleic Acids Research produces a special annual Database Issue, which is an example of a database catalogue. In this practical, you will inspect the most recent Database Issue. Please note that not all biological databases are described in each Database Issue. In addition, using the database identification and searching skills you have acquired, you will learn how to gather information from appropriate databases in order to address various research needs. Objectives in General By the end of this practical you will: - Carry out a brief survey of the databases catalogued by NAR - Learn how to formulate an opinion on the completeness of certain databases - Know how to independently identify useful databases for your research needs - Learn how to solve research questions by making use of existing information in literature and public databases Practical Exercise The Nucleic Acids Research (NAR) Database Issue: Identifying useful databases 1 Using either PubMed or Google, find the January 2011 Database Issue of Nucleic Acids Research. or Go to www.nar.oxfordjournals.org, then click on the “2011 Database Issue” link on the right hand side. 2 Check that you are at the correct URL: http://nar.oxfordjournals.org/content/39/suppl_1 3 The Nucleic Acids Research (NAR) Database Issue is a special supplementary issue of the Nucleic Acids Research journal that is published yearly. The special Database issue features descriptions of new as well as existing databases containing molecular biology data. Scroll through the list of articles. Which volume of the Nucleic Acids Research journal is the 2011 Database Issue published in? 4 At the first article in this volume, “The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection”, Michael Y. Galperin and Guy R. Cochrane, pp D1-D6, click on the “Database Summaries” link. In case you are interested in reading the Galperin and Cochrane (2011) paper, it is made available to you in the Miscellaneous folder of the Student Workbin on the IVLE. Since this has been made available to you, please avoid downloading it directly from the NAR website as it will incur unnecessary charges to the university. 5 Read the description of the Datafbase Summaries page. Is it a comprehensive collection of all available biological databases? 6 After clicking on the “Database Summaries” link, click on the “Complete Category/Summary Paper List” link for a detailed listing (see below). 7 The Category/Summary Paper list is a catalogue of databases published in the Database Issue. From the list of databases at NAR, locate the databases that you used in the last practical. Are they all there? Do any of them appear more than once? Is there any difference between them (if yes, please describe some of the differences)? (hint: read database descriptions) (example: “PDBe”, “PDBSum”,”PDB” under “Protein Structure” section) (example: “NCBI Protein Database” under “Protein Sequence Database”) (example: “UniProt” “UniParc”, “UniRef”,”SwissProt” under “Protein Sequence Database”) 8 Give a few examples of biological databases for the following data types. Read the database summary for information about the data type. Protein sequence Nucleic acid sequence Protein-protein interactions Pathways Genome information Taxonomy information 8 Open a new webpage and go to p53.bii.a-star.edu.sg What kind of information is shown there? Is this database in the NAR database list? What does this tell you about the comprehensiveness of the NAR database collection? Problem Scenario Having acquired some knowledge about online biological databases and basic database searching skills, this week you shall learn how to source for information on specific organisms. The NCBI Coffee Break (http://www.ncbi.nlm.nih.gov/books/NBK2345/) is an interesting e-book which contains articles reporting recent biomedical discoveries and highlighting NCBI databases and tools used in the research process. Your supervisor has read the e-book and found the first two articles on the Neanderthal man (“Neanderthal man lives on in some of us”) and the woolly mammoth (“From Africa to the Arctic”) particularly interesting and relevant to your training in bioinformatics. He has, therefore, assigned you to collect as much information as you can on the above two organisms which are already extinct. Your supervisor decided that you should be independent enough to carry out the task. He has, therefore, instructed you to select appropriate databases and source for relevant information on your own. Task After going through the practical exercises last week, you would have acquired useful database identification and searching skills. Using these skills, source for information on the Neanderthal man and the wooly mammoth, then write a short summary (with proper citations and references to specific databases) on these organisms. To find information on a specific organism, what kind of database should you search in? (Hint: You can look through the NAR database catalogue to get the answer) Eg, go to “NCBI Taxonomy Browser” search “Neanderthal man”, “wooly mammoth” Hint: A good starting is the UniProt Taxonomy Browser (go http://www.uniprot.org/taxonomy/) and the NCBI Taxonomy Browser http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html/). In addition to providing information about specific organisms, these databases also provide many cross-links to other relevant databases including protein sequence, nucleotide sequence and genomic sequence databases. Can you find and list down other taxonomy databases from the NAR database catalogue? Do all the taxonomy databases listed contain information on all taxons (organisms) in the Tree of Life? Ideally, for each organism, you should at least be able to answer the following questions: – What is the scientific name of these organisms? – What is the full taxonomic lineage of these organisms? – Are these organisms prokaryote or eukaryote? – What is the TaxonID for the organism? – Is the genome of each organism sequenced? – List the accession number(s) for the genome sequence of each. Observe the format of the accession numbers of the genome records. Do they share a common format? Which database are these records rom? – List the sequencing center for the full genome sequences. – List the title and Pubmed ID (if available) of the paper(s) with the genome sequencing results. – List the name, protein sequence accession number and GeneID of some protein coding genes present in each organism. Do both organisms share the same protein coding genes? – List the names and GeneID of some non-protein coding genes in each organism. What are the functions of these genes? – Provide a short description of the features of the organism. (hint: You can search in Google or PubMed) Advanced Section Your supervisor is happy with your progress in this exercise. Applying the skills and knowledge you have learnt in this practical, he has requested you to find out as much as you can about the organism, Dictyostelium discoideum, which he is intending to use in his experiments. Ideally, for each organism, you should at least be able to answer the following questions: – What is the full taxonomic lineage of Dictyostelium discoideum? – Is Dictyostelium discoideum a prokaryote or eukaryote? – What is the TaxonID for the organism Dictyostelium discoideum? – What is the common name of Dictyostelium discoideum? – Is the genome of Dictyostelium discoideum sequenced? – List the accession number(s) for the full genome sequence(s). – List the sequencing center(s) for the full genome sequences. – List the title and Pubmed ID (if available) of the paper(s) with the genome sequencing results. How many chromosomes does Dictyostelium discoideum have? List the name and GeneID of some of the genes present in the organism. – Are there specialized database(s) which contains information specific to Dictyostelium discoideum? – Provide a short description of the features of the organism. (From Pubmed or description from specialized databases)