Practical 2 Exploring bioinformatics software, tools and techniques Introduction During the first two practical sessions, you explored a number of biological databases and acquired some database searching skills. Options for finding these databases include, but are not limited to, (1) searching the Internet with a general purpose search engine such as Google, using well-chosen keywords, (2) searching PubMed for publications about databases, and (3) identifying catalogs of databases (such as Nucleic Acids Research annual Database Issue, which is an example of a database catalog). In this practical, you will explore a catalog of available bioinformatics tools, and apply some of these tools to extract useful information from biological data. Most tools will have a number of available options to select when setting up your analysis. The recommended options are usually selected as the default settings. However, they may not be the best options for the problem you are addressing. It is advisable that users of bioinformatics tools carefully explore the options available before performing any analysis. For the purpose of today’s exercise, we will stick to default settings. Some tools are available at multiple servers. The functionality of the software may differ slightly from one server to another. For this practical, we will be focusing on software hosted at NUS. Objectives in General By the end of this practical you will be: - familiar with the scope of bioinformatics tools/software and know where to search for tools/software relevant to one’s research area familiar with the application of bioinformatics tools to address biological problems Problem Scenario Your supervisor in the bioinformatics lab is currently working on p53 and he needs your help for the bioinformatics component of his research. With the skills acquired from the earlier practical, you have successfully aided your supervisor in surveying the existing p53 databases. Now he requires your expertise in analyzing the nucleotide/protein sequence of p53 using bioinformatics tools and techniques currently available so that he will be able to infer relevant biological information from the primary sequence. For this purpose, your supervisor is introducing to you the Bioinformatics Links Directory at the University of British Columbia (http://www.bioinformatics.ca/links_directory/), which provides a list of available bioinformatics tools. You are required to browse through the resources listed to identify appropriate bioinformatics databases and tools that address your supervisor’s research needs. Practical Exercise The Bioinformatics Links Directory: Scope of bioinformatics software New bioinformatics tools are developed on a regular basis. Some of these are catalogued at the Bioinformatics Links Directory. You will now browse through the resources listed there and try to identify tools for a particular analysis. Don’t spend too much time on this – just try to get an appreciation of the diversity of tools available. 1 Go to http://www.bioinformatics.ca/links_directory/. List down any three categories of resources that you find particularly interesting? 2 For each of those three categories, identify two tools that look like they might do the same thing. For example, find two tools that allow you to perform a sequence comparison, (eg, multiple sequence alignment) Restrict this to software i.e. do not list any databases (yes, Bioinformatics Links Directory also catalogues databases because of their collaboration with NAR). 3 Are you able to find software suites that comprise of a wide range of bioinformatics tools for protein and/or DNA sequence analysis? (hint: “Do-it-all Tools for Proteins” under protein category, and “DNA and Genomic analysis” under DNA category) Note: From your previous experience in surveying the databases from the Nucleic Acids Research annual Database Issue, you should recognize that the resources provided on Bioinformatics Links Directory may not be comprehensive. It is advisable that you try searching in PubMed and Google if you are not able to find what you are looking for. Preparing your input file for analysis 1 Using the database query skills that you have acquired, find the human p53 RefSeq protein sequence record. (hint: search NCBI Gene Database for p53). Click on NP_000537.3, then change the view from GenBank format to FASTA format. The FASTA format has a one-line header starting with the > character that contains a description of the sequence record, followed by the sequence itself in one-letter code. 2 Save the sequence to a text file by clicking Send Select: Complete record; File Format: FASTA Create file. 3 Open the file in Notepad or WordPad by right-clicking and selecting Notepad or WordPad as the preferred application. 4 You will notice that the description line is too long. Most bioinformatics tools prefer shorter names and where possible, you are recommended to rename the descriptor to something more familiar to you, but to note the original descriptor somewhere in case you need the accession number or other information. In this case, you will re-name the description on the first line as “wt human p53 protein”, and re-save your file. Make sure that the descriptive information (in this case “wt human p53 protein”) is all on the same line as the > symbol, and that the sequence starts on the next line. Task In this practical, you have been introduced the Bioinformatics Links Directory, which contains a comprehensive list of bioinformatics tools and the Emboss suite, which contains a variety of bioinformatics tools for protein and DNA sequence analysis. Using the bioinformatics tools listed in the Bioinformatics Link Directory and/or the Emboss suite, analyze the p53 mRNA and protein sequences (use the default program settings and parameters) to retrieve the following information: (Hint: For help and information, always refer to the user manuals accompanying the bioinformatics software and tools) – PROSITE motifs present in a protein sequence (eg, wt p53) Protein sequence patterns (or motifs) represent a useful method of determining the function(s) of proteins. The Emboss program, patmatmotifs, takes a protein sequence and compares it to the PROSITE motif database to scan for motifs present in the sequence. To access patmatmotifs, go to Protein Motifs patmatmotifs from Wemboss. patmatmotifs scan the the wild-type p53 protein sequence with motifs from the PROSITE database. Paste your result below. – Antigenic sites in a protein sequence Antigenic sites in protein are those regions that can be recognized by antibodies. The Emboss program, Protein Motifs antigenic predicts potential antigenic sites within a protein sequence. Use the wild-type p53 protein sequence as the query (default parameters) to predict for antigenic sites. – DNA binding residues in a protein sequence BindN takes an amino acid sequence as input and predicts potential DNA or RNA-binding residues using support vector machines (SVMs). To access BindN, go to the “Protein” category in the Bioinformatics Links Directory and click on “Sequence Features”. Use the wild-type p53 protein sequence as the query (default parameters) to predict for DNA binding residues in the sequence. – Binding and interaction partners of a protein sequence 3D-partner is a tool to predict interacting partners and binding models of a query protein sequence through the analysis of structural complexes. To access 3D-partner, go to the “Protein” category in the Bioinformatics Link Directory and click on “Interactions, Pathways, Enzymes”. Use the wild-type p53 protein sequence as the query (default parameters) to predict for the interaction partners.