Objectives in General - BioInformatics Centre

advertisement

Practical 1a Exploring basic bioinformatics resources

Introduction - Bioinformatics and the Internet

Bioinformatics is a marriage of biology and information technology. We can think of it as an emerging science that deals with computer-based storage and analysis of biological data. It is rapidly becoming a discipline that impacts on all areas of the biomedical sciences, and bioinformatics skills are an increasingly important part of every biologist’s toolkit.

Simplistically, practical bioinformatics can be broken down into two major components:

- Biological databases that store large volumes of biological data in structured formats, the ability to query them intelligently to gain insights and knowledge into biological problems, and to develop, create, construct, and maintain new databases and information resources.

- Software (here generalized to “tools”) to query and analyze these data, and the design and development of such algorithms and applications that are needed by biologists.

As far as the biologist is concerned, many, but not all, of these databases and tools can be found on the Internet - mostly through web interfaces or stand-alone software applications or packages that can be downloaded and installed on your computer.

Importantly, the bioinformatics state-of-theart changes rapidly; today’s gold standard might be tomorrow’s dinosaur (if you’ll excuse the mixed metaphor). Therefore, the modern biologist should know how to find and assess relevant data and tools, so that they can always equip themselves with the best bioinformatics resources.

Biological Databases

There are hundreds of biological databases accessible via the Internet. They store many different types of biological data, including but not limited to:

- Sequences: nucleotides, amino acids

- Structure

- Literature

- Diseases

- Biomolecular interactions; protein-protein, protein-ligand

- Metabolic and signaling pathways

- Function descriptions

- Immunogenicity

- Toxicity

Knowing how to use databases allows you to retrieve information far more rapidly than you could with any literature search. In addition, the data are generally formatted to enable easy analysis.

Databases can be general, or they can be specialized. For instance, Entrez Gene stores gene information for all fully-sequenced genomes. SGD (the Saccharomyces

Genome Database) stores equivalent information, but only for baker’s yeast,

Saccharomyces cerevisiae. Other specialist databases may focus only on a particular molecule type (e. g. protein kinases), or even on a particular molecule (e.g. p53).

There are often multiple databases for a single class of biological entity or concept.

These tend to have their own distinguishing features – for instance the level of detail stored for the data, the query and export options, or data visualization options. Also, new databases are created as life sciences research becomes more sophisticated, and

“new” data types emerge. Therefore in order to navigate through this ever-evolving database landscape, what you need is a set of skills that will serve you for most databases you may encounter at any particular stage of your biological career.

Objectives in General

The main objective of this practical is to get you to begin exploring a number of biological databases.

By the end of this practical you will:

Know multiple approaches to find biological data on the internet

- Know how to access subscribed journals in your institution using the library proxy server

Know how to formulate simple as well as advanced queries of selected databases (including wildcard, Boolean and field-specific queries)

Know how to assess the quality of a search in terms of concepts including specificity , true positives (relevant hits) and false positives (irrelevant hits)

Gain some familiarity with the contents of selected databases

- Learn to take note of database identifiers

Use database cross references to link to other databases

- Understand how biological information and knowledge is stored in databases which you can access online

- Learn how to source information found in online biological databases

Problem Scenario

In this practical, we will focus on tryptophanyl-tRNA synthetase as an example.

You have just joined a bioinformatics lab in UBD for your honours year project. For your project, you will be working on tryptophanyl-tRNA synthetases . As the first step of your research, you have been asked by your supervisor to find out all you can about tryptophanyl-tRNA synthetases by looking up databases available online.

Being new to bioinformatics, your supervisor understands that you may not have any experience in database searching. He has, therefore, provided you with a list of relevant online databases as well as the kind of data they contain as a starting point:

– NCBI (National Center for Biotechnology Information) - Molecular biology resource that provides the public with several useful bioinformatics analysis tools as well as biological databases including: o PubMed – literature o RefSeq – reference (representative) sequences o Gene

– locus-centred gene information o Protein – protein sequences o OMIM – online mendelian inheritance in man – genetic disorders

URL: http://www.ncbi.nlm.nih.gov/

– UniProt – protein sequences and functional information

URL: http://www.uniprot.org/

– ENZYME – enzyme information

URL: http://au.expasy.org/enzyme/

– KEGG Pathways – biomolecular pathway diagrams

URL: http://www.genome.jp/kegg/kegg2.html

PDB

– three-dimensional biological structures

URL: http://www.rcsb.org/pdb/

Searching Literature Databases

Practical Exercise

1 From the list of databases provided, identify one that stores scientific literature.

2 Perform a simple text query for articles on tryptophanyl-tRNA synthetase against the literature database. Enter the search term Tryptophanyl-tRNA synthetase .

How many abstracts are returned?

How many of these are review papers?

3 Look at the 13 th entry and notice the information presented.

Who are the authors?

What is the title?

What is the journal; which volume, issue (if any) and page numbers?

What is the PubMedID (PMID)?

How is this article related to tryptophanyl-tRNA synthetase ? (Hint: Read the abstract by clicking on the hyperlinked journal title)

4 On which year is the oldest article among the results returned published?

(Hint: You may need to select the option to show 200 items per page and sort by publication date: Display Settings  Items per page set to “200” and Sort by “Pub Date”)

5 Click on the “Review” link under the “Filter your results” panel on the right of the page for review articles that match your search term.

What do you think is the difference between articles displayed when you click

“All” and those displayed when you click “Review”?

6 From your search results, are there any articles that appear to be irrelevant

(these are known as false positives , as opposed to true positives which refer to hits that are relevant) to your search term?

7 How would you make your search term more specific, with less false positives ? (Hint: Refer to the PubMed tutorial on Boolean Logic and Phrase

Searching at http://www.nlm.nih.gov/bsd/disted/pubmedtutorial/. Also, see

Appendix A )

8 Look for useful review article(s) that provide you with a comprehensive overview of tryptophanyl-tRNA synthetases and their functions.

9 For any one of the review articles that you have found, try downloading the full-text PDF by clicking on the link button at the upper right hand side. Are you able to download the full-text PDF?

( ATTENTION : For the next step, DO NOT proceed to download the PDF file.

This will be demonstrated just ONCE on the projection system. Imagine all students downloading the same article!)

10 If you are unable to access the full-text PDF, in the address bar on the journal’s home page, modify the URL by adding “libproxy1.nus.edu.sg” as follows, and hit enter.

E.g. http://www.sciencedirect.com.

libproxy1.nus.edu.sg

/science?_ob=ArticleURL&

_udi=B6VRJ-47DTDSV-

2&_user=10&_rdoc=1&_fmt=&_orig=search&_sort=d&view=c&_acct=C00005

0221&_version=1&_urlVersion=0&_userid=10&md5=a0de8904fbea9d075010

47a630d118ad

What does this do?

11 While searching for research or review articles, you would want to store and organize these bibliographic references and cite them in your own report or document. For this purpose, software tools for publishing and managing bibliographies such as EndNote and WizFolio have been developed and made available to students and researchers.

For this course, we strongly encourage students to use WizFolio , a webbased bibliographic management tool, to organize and cite references for their presentations and miniproject reports.

Create a WizFolio account at http://www.wizfolio.com by clicking on the “Sign

Up Now” button, if you haven’t done so. For help in using WizFolio , visit http://help.wizfolio.com/.

Now, try adding the article above (in 10) to your WizFolio account. There are two main ways of doing so. i) Add directly from PubMed by using WizAdd (you may not be able to do this in the lab because you do not have the administrator right to install

WizAdd. You can try this at home on your PC/laptop). In this approach, once you have added the article metadata as a record in WizFolio , you can find the article PDF by clicking on the “Locate PDF” function, and then once you have downloaded the PDF, use the “Upload Item File” feature to tag the PDF to the record.

ii) Download the PDF on your own using the ‘libproxy” shortcut. Upload the downloaded PDF to WizFolio using the “Add”  “Upload File(s)” option on the WizFolio webpage. For you to try this step, the PDF

(Kisselev_1993_Biochimie.pdf) has been downloaded for you and is available in the student workbin, “Miscellaneous” folder. With this approach, the relevant metadata of the article will be extracted from

PubMed by using the information in the PDF and will be appended to the record created in WizFolio for the uploaded PDF.

12 How would you further restrict your search term to include only articles published between 2000 to present? (Hint: Click on the

“Limits” or “Advanced

Search” link at the top of the page above the search box) How specific is your search now?

How many hits are returned now?

NOTE:

For more tips on searching against PubMed , please refer to the PubMed tutorial at http://www.nlm.nih.gov/bsd/disted/pubmedtutorial/.

Searching Protein Databases - RefSeq

Besides PubMed, NCBI also contains other biological databases, including protein sequence databases such as RefSeq and NCBI Protein .

1 Choose the Protein database from the “Search” dropdown list (see screenshot).

2 Do a simple text query on Tryptophanyl-tRNA synthetase .

3 Scroll through the first page of entries. Notice the database description that is in the format “ alphanumeric identifier GI: numeric identifier ”,

e.g. NP_000001.1 GI: 1234567

4 Click on the first entry to view the full record.

What is the accession number of the record?

What is the version number of the record?

What is the GI number of the record?

What is the difference between a. accession number and GI number? b. version number and GI number?

Which organism is this Tryptophanyl-tRNA synthetase from?

What is the taxonomic lineage for this organism?

What is the taxonomy identifier (Taxon ID)?

How many amino acids (aa) constitute this protein?

What is the amino acid sequence? Obtain the sequence in FASTA format as this will prepare your sequence for input into bioinformatics tools for analysis, since majority of them accept input sequence in FASTA format.

To get the sequence in FASTA format, click on the arrow button beside

“Display Settings”, then select “FASTA” and click “Apply”.

Notice the specific features of the FASTA format. Describe them.

Go back to the full record by changing the display back to “GenPept (full)”.

Write down the accession number of the nucleotide sequence record from which this protein sequence is translated from. (Hint: Look for the field

DBSOURCE)

Which database does this link to?

What is the UniProt ID for the same amino acid sequence?

Which database does this link to?

Is it a protein or nucleotide sequence database?

What is the GeneID for this sequence?

Which database does this link to?

Is it a protein or nucleotide sequence database?

What is the EC number of tryptophanyl-tRNA synthetase ?

Click on the hyperlinked EC number to view the corresponding ENZYME record.

Which database does this link to?

Write down the equation catalysed by tryptophanyltRNA synthetase .

5 Go back to the original protein record, click on the hyperlinked UniProt accession number to view the corresponding UniProt record.

List 3 database cross-references in the UniProt record that are not present in the RefSeq record.

Task

After going through the practical exercise, you would have acquired some basic database searching skills. Using these skills, explore the other databases introduced in the practical to source for information for human tryptophanyl-tRNA synthetase , then write a short summary (with proper references to specific databases) on human tryptophanyl-tRNA synthetase .

Ideally, you should at least be able to answer the following questions:

– How many forms of human tryptophanyl-tRNA synthetase exist? How do they differ?

– Why do you think there is more than one form of synthetase?

For each form of human tryptophanyl-tRNA synthetase,

– What other names does it have?

– Which protein family does it belong to?

– What is the GeneID?

– Which chromosome is it on?

– How many transcript variants does it have? How do they differ?

– What is the identifier for the mRNA sequence? How long is the mRNA sequence?

– What are the identifier(s) for the product (protein)? How long are the protein sequence(s)?

– Which conserved domains are present in the protein sequences?

– What are the functions of tryptophanyl-tRNA synthetases ?

What chemical reaction does it catalyze?

– What are the KEGG pathways that it is involved in?

Draw a diagram to show that you understand the big picture of the role of this very important enzyme.

– List all the codons associated with WARS.

– What if there is a mutation and tRNA synthetase recognises the wrong anticodon?

– What clinical uses or applications does it potentially have?

– What are the Protein Data Bank (PDB) identifier(s) for the 3D protein structure?

List some examples.

– WARS is closely related to YARS ( tyrosyl-tRNA synthetase ). How did these synthetases evolved, and to what extent did their evolution determine the

Genetic Code?

Advanced Work (optional)

For tryptophanyl-tRNA synthetase across different species, there are several aliases, including WARS, IFP53, GAMMA-2, TrpRS and trpS. Search each of these aliases against GenPept , RefSeq , GenBank and PDB and compare the number of hits obtained. Scan through your search results and answer the following questions:

– Are most of the hits true positives , i.e. are they relevant?

– Is each search term comprehensive? Does it retrieve all isoforms of tryptophanyltRNA synthetase across all species?

– Is it a good idea to carry out database searches using only one alias for a particular gene/protein only? Why?

Appendix A

– Boolean Queries

The following refers to how Boolean operators are used to query NCBI databases using

NCBI’s Entrez search engine. Given two search terms, e.g. p53 and cancer , you can combine them with the different Boolean operators AND, OR, or NOT to get different query results, as follows.

AND: both terms appear in the same record

For example, PubMed search with p53 AND cancer – finds all PubMed records where both words appear in the same document. The search does not find records with only p53 , or only cancer.

OR: either term appears in the record

For example: PubMed search with p53 OR cancer – finds all PubMed records containing either word, including records where both words appear.

NOT: search term 1 BUT NOT search term 2 appears in the record.

For example: PubMed search with p53 NOT cancer – finds all PubMed records containing p53 but not cancer. p53

12 records total

Both terms:

5 records cancer

15 records total p53 AND cancer : returns how many records? p53 OR cancer : returns how many records? p53 NOT cancer : returns how many records?

Is there a possibility of getting false positives (irrelevant hits) when using the search terms p53 and cancer ? Why?

The Entrez search rules and syntax for using Boolean operators

Always use Boolean operators AND, OR, NOT in UPPERCASE (e.g., promoters OR response elements).

This is because most search engines only accept them in upper case. However, some engines accept them in both upper and lower case. To avoid miscommunication with the search engine, we highly recommend that you always use

Boolean operators in CAPS.

Entrez processes all Boolean operators in a left-to-right sequence. The order in which

Entrez processes a search statement can be changed by enclosing individual concepts in parentheses. The terms inside the parentheses are processed first as a unit and then incorporated into the overall strategy.

For example, the search statement p53 AND ( MDM2 OR bax ) is processed by Entrez as follows: a First, all records that contain the word MDM2 or that contain the word “bax” are identified; after which identical records picked up by both words are removed. b Then those records are searched for the term p53 . c Only the records from 1 that also contain p53 are returned.

Hints: Click on the Details button to see how Entrez translated and executed your search strategy. Notice that Entrez uses an extended vocabulary when it translates some of your query terms. For instance tumor gets translated into tumor, tumor or neoplasms.

See Writing Advanced Search Statements for more information on using Boolean

Operators and Entrez Search Field Qualifiers. The use of parentheses can change your search results significantly. http://www.ncbi.nlm.nih.gov/entrez/query/static/help/Summary_Matrices.html#Search_Fi elds_and_Qualifiers

NCBI Minicourses

These are problem-based, hands-on practical courses relevant to this practical.

Students are encouraged to try them out at home.

1 Entrez Quick Start

(http://www.ncbi.nlm.nih.gov/Class/minicourses/entrez.html)

2 Entrez Gene Quick Start

(http://www.ncbi.nlm.nih.gov/Class/minicourses/entrezgene.html)

3 GenBank Quick Start

(http://www.ncbi.nlm.nih.gov/Class/minicourses/quickgenbank.html)

Additional Readings

1 PubMed Online Training (http://www.nlm.nih.gov/bsd/disted/pubmed.html)

2 Entrez Help

(http://www.ncbi.nlm.nih.gov/bookshelf/br.fcgi?book=helpentrez&part=Entrez

Help#EntrezHelp.Entrez__the_Life_Sci)

3 Entrez Tutorial (http://www.ncbi.nlm.nih.gov/Entrez/tutor.html)

4 UniProt Tutorial (http://www.uniprot.org/help/text-search)

Gene : A searchable database of genes, focusing on genomes that have been completely sequenced and that have an active research community to contribute genespecific data. Information includes nomenclature, chromosomal localization, gene products and their attributes (e.g., protein interactions), associated markers, phenotypes, interactions, and links to citations, sequences, variation details, maps, expression reports, homologs, protein domain content, and external databases.

Nucleotide Database : A collection of nucleotide sequences from several sources, including GenBank, RefSeq, the Third Party Annotation (TPA) database, and PDB.

Searching the Nucleotide Database will yield available results from each of its component databases.

Information on different databases in NCBI can be found from: http://www.ncbi.nlm.nih.gov/guide/all/

Download