Bioinformatics Workbook Aims and Objective of Workbook: The workbook aims to introduce students to some of the more easily accessible resources available on the internet for sequence analysis and manipulation, in order to facilitate a practical understanding of BioInformatics (BI). Summary: The field of BI deals with the use of computers to solve biological problems, most often those that deal with genetics. It is one of the newest, and most dynamic and expanding fields in biology, due to the massive amounts of information being generated by the Human Genome Project (HGP) and other genome projects. As a result, there is a requirement for computerised databases to store, organise and index all the data and for specialised tools to view and analyse it. Growing Importance of BI: Biology is undergoing a rapid transformation from a lab-based practical discipline to one in which information science plays a central and critical role. Of growing importance are emerging technologies, a selection of the most important are summarised below; Database-mining: this is the process by which the structure/function of an unknown gene/protein is inferred from similar sequences identified in information already stored in database, most often from well-characterised model organisms. Evolutionary Biology: BI offers the potential for investigating and even reconstructing the evolutionary relationships between different organisms. This is based on the premise that closely related organisms have sequences that are more similar than distantly related organisms. In addition, a time of divergence from a shared sequence can be estimated. Of particular interest are model organisms that have homologs of human disease genes. These often give insight into the molecular basis of a disease to be further investigated. Note: NCBI's COGs database has been designed to simplify evolutionary studies. Protein Homology Modelling: This uses protein structures that have already been solved experimentally to predict the 3-D structure of another protein that has a similar amino acid sequence. Genome Mapping: A computerised and easily navigated map of any given genome, in which sequence information is correctly oriented, is critical to allow the easy localisation of specific gene/nucleotide sequences. Note: NCBI's Map Viewer is a tool that allows a user to view (when available) an organism's complete genome; integrated maps for each chromosome; and/or sequence data for a genomic region of interest. Basic Bioinformatics Fact File Internet Based Resources: There are many powerful BI tools freely available on the internet, some of which tend to be complicated and quite difficult to use for beginners. The National Centre for Biotechnology Information (NCBI) has produced many of the easiest to use BI resources and the URL of its home page is: http://www.ncbi.nlm.nih.gov Note: The internet is incredibly dynamic and the diverse range of available BI software/resources continues to grow; However, resources intended to be used should be checked periodically for ‘webrot’ to ensure that URL has not changed, and that the resource is still available and free to use etc. The exercises and associated instructions were developed in March, 2004, so may well have to be modified depending on future developments. NCBI was established in 1988 when the growing importance and centralised position of computerised information processing methods in biological research was recognised. The organisation has been of central importance and a pioneer of BI systems and more information is available at NCBI outreach and education: http://www.ncbi.nlm.nih.gov/About/outreach/courses.html NCBI Education Resources: These are covered in easy to follow tutorials at URL: http://www.ncbi.nih.gov/Education/ and below some key summaries of tools and critical concepts from the resources, which are used in the suggested BI workbook, are presented: Databases: are large, organised bodies of data. They have associated software that allows authorised users to update (add new entries or modify existing entries), query (search), and retrieve data stored within the system. NCBI is the home of Genbank, one of the largest genome database repositories in the world. It typically contains information such as a contact name; the input sequence, with a description of the type of molecule; the scientific name of the source organism from which it was isolated; and, often, literature citations associated with the sequence. Search and Retrieval of Database Information: Entrez is the search and retrieval system that links many NCBI databases. In addition to allowing access to and retrieval of specific information from a single database, Entrez also allows access of integrated information from many NCBI databases. Suggested Student BI Workbook: The workbook uses Aequorea victoria Green Fluorescent Protein (GFP) sequence data in a series of exercises that are designed to lead a student through the use of databases, use of associated manipulation and analysis software and other useful BI resources. Note: A foundation in basic genetic principles is required in advance to solve the problems that the BI tools are going to be used to solve. Background Information: Also see: Green Fluorescence Protein Reference Fact File Key references: Prasher et al., (1992) cloned and sequenced genomic and cDNA GFP clones A.victoria GFP sequence NCBI Accession Number: M62654 Li et al., Lampyris noctiluca luciferase gene. Direct Submission. NCBI accession number: AY447204 Gurskaya et al., (2003) A colourless green fluorescent protein homologue from the nonfluorescent hydromedusa Aequorea coerulescens and its fluorescent mutants. Biochem. J. 373 (Pt 2), 403-408 Accession Number: AY151052. Ohmiya, et al., (1995) Cloning, expression and sequence analysis of cDNA for the luciferases from the Japanese fireflies, Pyrocoelia miyako and Hotaria parvula. Photochem. Photobiol. 62 (2), 309-313 Accession no: L39928 Format of Workbook: The workbook is organised around 5 practical exercises that use databases and software to solve problems associated with GFP sequence data: 1: Use of databases, (a-c) to confirm the identity of clones and search for similar nucleotide and translated protein sequences and alignment of sequences 2: Identification of restriction enzyme recognition sites in a DNA sequence 3: Information and literature searches 4. Molecular visualisation of GFP Exercise 1 (a-c): Use of databases (i) to confirm the identity of an 'unknown' sequence (ii) to search for nucleotide and translated protein sequences in databases and aligning sequences. There are a number of BLAST (Basic Local Alignment Search Tool) programs that allow easy sequence comparison between query sequences and those contained in a variety of databases. There are a number of both nucleotide and protein databases that can be searched with either DNA or amino acid query sequences and programs such as BLASTX and TBLASTN can be used to cross-compare different types of query and database sequences. BLAST returns sequence alignments with each alignment being assigned a score that reflects the degree of similarity. The higher the score, the greater degree of similarity. Functional and evolutionary information can be inferred from well-designed queries and alignments. Three BLAST tutorial information guides are available at URL: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html. Beginners should start with the QUERY tutorial, more experienced could use BLAST and the advanced can use PSI-BLAST tutorial. Exercise 2: Identification of restriction enzyme recognition sites in a DNA sequence Generation of accurate restriction maps of cloned DNA sequences is fundamental to recombinant DNA technology applications. The BI exercise uses the WEBCUTTER software, but there are a number of free alternatives available. Note: This exercise could be linked to a laboratory-based practical where students could be asked to predict restriction maps of cloned sequences and then to test them out by undertaking the digests in the lab. The Restriction Digestion and Agarose Gel Electrophoresis practical session could be easily adapted for such a purpose. Exercise 3: Information and literature searches The Internet is a rich source of information and, with the right information skills, it is a highly effective tool for supporting teaching and learning. However, effective searching and critical evaluation of the information found requires key information skills, without which the overwhelming volume of information available, much of which is not peer reviewed, can be frustrating and confusing. The RDN Virtual Training Suite [URL: http://www.vts.rdn.ac.uk/] is a set of online and interactive tutorials designed to help students, lecturers and researchers improve their internet information-literacy and IT skills. Specifically, the tutorials cover (1) Types of resources available on the internet, (2) How to search for information on the internet efficiently and effectively, (3) How to evaluate the material found on the internet, and (4) How to cite material found on the internet. PubMed is a bibliographic database composed of literature primarily from the life sciences. It contains links to full-text articles at participating publishers' Web sites, as well as links to other third party sites such as libraries and sequencing centres and provides access and links to the integrated molecular biology databases maintained by NCBI. URL for a basic pubMed tutorial: http://www.nlm.nih.gov/bsd/pubmed_tutorial/m1001.html. Exercise 4: Molecular visualisation of GFP The ‘holy grail’ of BI is to turn a nucleotide sequence or amino acid primary sequence into a 3D protein sequence. This is not yet feasible due to the paucity of resolved 3D protein structures from which reliable predictive modelling programmes can be developed and validated. However, the 3D crystal structure of GFP has been resolved and its tertiary structure can be explored using the Molecular Visualisation Freeware RasMol programme. The URL at which this software and associated information can be downloaded is: http://www.umass.edu/microbio/rasmol/. Protein explorer (a derivative of RasMol) and a one hour tutorial are also available at: http://molvis.sdsc.edu/protexpl/qtour.htm. Instructions are given in the workbook for simple introductory exploration of a GFP pdb file with RasMol version 2.6. Summary: The field of bioinformatics (BI) deals with the use of computers to solve biological problems, most often those that deal with genetics. It is one of the newest, and most dynamic fields in biology and is expanding at an incredible rate, due to the massive amounts of information being generated by the Human Genome Project (HGP) and other genome projects. As a result, there is an absolute requirement for computerised databases to store, organise and index all the data and for specialised software tools to view, analyse and manipulate it. Consequentially, biology is undergoing a rapid transformation from a lab-based practical discipline to one in which information science plays a central and critical role. Learning Outcomes: After directed learning and completion of the workbook you should be able to: Discuss the central importance of BI in modern biotechnology Use and discuss the main applications of a range of BI databases and informatics resources on the internet Using BI resources practically, with reference to the specific example of GFP, to help solve biological problems, plan laboratory-based practicals and process data Demonstrate developed and improved IT skills Background The Green Fluorescent Protein (GFP) was first isolated from the bioluminescent jellyfish Aequorea victoria. The biological functions of bioluminescence are not clear, but several classes of marine organisms display this trait. Wild type (native) A. victoria GFP is just 238 amino acids and its crystal structure has revealed a highly compact “-can” tertiary structure that confers exceptional stability. The fluorophore forms post-translationally, following cyclisation and oxidation of the three residues at codons 65-67 (Ser - dehydroTyr – Gly: SYG). Both genomic and cDNA clones were isolated and sequenced in 1992 and GFP has been expressed in a wide range of cells and transgenic animals and plants. Mutation of the fluorophore and other regions of the protein has produced “improved” GFP isoforms including (i) (Tyr66>His, Tyr145>Phe), which emits blue instead of green light; (ii) (Ser65>thr, Phe64>Leu), which is significantly brighter than the wild type and (iii) “optimised” forms which now exist for most model systems. Exercise 1: Use of databases 1a: Confirming the identity of an 'unknown' sequence You are working as a technician in a biotechnological lab supervised by Prof. Strictus who supervises a research group investigating the structure, functions and applications of luminescent and fluorescent proteins. She has received two clones of DNA sequences that she will use in her research; one is a GFP clone from A. victoria and the second is a luciferase clone from Lampyris noctiluca (firefly). You have been given some sequence data from each clone [sequence 01 and sequence 02] and the next step is to confirm their identity by searching a DNA database, as described below. Note: Both sequences are 200bp in length and are from the coding strand of exons and will be made available to you electronically: Sequence 01: 5’ cattacctgtccacacaatctgccctttccaaagatcccaacgaaaagagagatcacatgatccttcttgagtttgt aacagctgctgggattacacatggcatggatgaactatacaaataaatgtccagacttccaattgacactaaagtgt ccgaacaattactaaaatctcagggttcctggttaaattcaggctg 3' Sequence 02: 5’ aacagtttgtccataggagttgcaccaacaaatgatatttacaatgaacgtgaattatacaacagtttgtccataaag aaattacctataattcagaaaattgttattctggattctcgagaggattatatggggaaacaatctatgtactcgttcatt gaatctcatttacctgcaggttttaatgaatatgattctac 3’ The aim of the first exercise is to search a database for any other DNA sequences that are similar (homologous) to these regions of the genes. This should confirm the identity of the clones and also identify other, similar gene sequences. A commonly used method, a blastn search, is outlined below. 1. Select a sequence and copy it. 2. The National Centre for Biotechnology Information (NCBI) is the home of Genbank, one of the largest genome database repositories in the world. Go to the NCBI homepage at: http://www.ncbi.nlm.nih.gov/ 3. Click on the word BLAST at the top of the page. A BLAST search will identify DNA sequences in the database that are similar to the query sequence entered as input data. 4. Under the heading Nucleotide Blast, click on the link for a standard nucleotidenucleotide BLAST [blastn] search. 5. Paste your sequence in the box headed ‘Search'. Do not change any of the default settings. Click on the BLAST! button. If there are a lot of queries and the server is busy, you may have to wait several seconds [or even minutes] for the server to return the search results. 6. Click on the FORMAT results button to retrieve you results, which include: A histogram indicating the size of the region of homology shared between the query sequence and database entries retrieved A list of DNA sequences that are similar to the query sequence A sequence alignment showing the specific nucleotide position shared or similarity between the sequences. By moving the cursor over the histogram or scrolling down to view the significant alignments you should see that this search retrieves different entries, which quickly confirms the identity of each sequence. It also retrieves matches to sequences from several other species in addition to cloned, mutant and engineered forms of each sequence. Spend some time and fully explore the data produced. Scroll down past the histogram and the list of DNA sequences to the sequence alignment section which provides: A brief description of the nucleotide sequences from the database The number of identical nucleotides within the matching region (e.g. identities 90/100, 90%) A schematic diagram aligning matching nucleotides between our query sequence and the sequence in the database. Click on the database entry code for any similar sequence. This will retrieve the complete database entry including publication/reference information, position of genes in the database entry, position of the coding sequences for the gene, the predicted polypeptide sequence of the gene product and notes about the gene or DNA sequence. The first and closest alignment confirms the identity of each sequence. Record the following information for each query sequence: Sequence 01: Gene:__________________________ Species of origin:_________________ Accession Number:_______________ Literature citation for sequence:_____ _______________________________ _______________________________ Sequence 02: Gene:________________________ Species of origin:_______________ Accession Number:_____________ Literature citation for sequence:____ _____________________________ _____________________________ Note that there are several sequences that have been recalled from non L. noctiluca species that bear similarity to the L. noctiluca luceriferase query sequence, however, there are far fewer non A. victoria species that have similarity to the GFP query sequence. Also note the large number of cloned, fusion, mutant and engineered forms recalled from the GFP search. Can you explain this data? Name the non L. noctiluca species that bear the greatest similarity to the luciferase query sequence (i.e. have the highest bit score over 200 and should be red in the histogram) A) Also list the following:i) The number of nucleotides in the query sequence that are identical to the database entry ii) The percentage identity iii) The accession code for the full sequence of the database entry A: i) : ii) iii) Name the non A. victoria species that bear the greatest similarity to the GFP query sequence (i.e. have the highest bit score over 200 and should be red in the histogram) B) Also list the following:i) The number of nucleotides in the query sequence that are identical to the database entry ii) The percentage identity iii) The accession code for the full sequence of the database entry A: i) iii) : ii) Exercise 1b: Aligning similar sequences We can also manipulate DNA sequences using bioinformatics. Two sequences may be fully aligned over their entire length by using the bl2seq programme. The aim of this exercise is to compare the full length GFP sequence from A. Victoria with the species you identified from exercise 1a as having a sequence with the highest similarity [and the full length luceriferase sequence from L. noctiluca with the species you identified in exercise 1a as having a sequence with the highest similarity] 1: Return to the basic blast search page at NCBI. [URL: http://www.ncbi.nlm.nih.gov/BLAST/] 2: Under the heading Special, click on the link for Align two sequences (bl2seq) 3: For sequence 1, enter the accession code for the complete sequence of A. Victoria GFP [which you recorded for 1a] and for sequence 2 enter the accession code for the complete sequence of the sequence from the most similar non A. Victoria species [repeat this with L. noctiluca luciferase and non L. noctiluca species]. 4: Press align An alignment diagram, which summarising areas of greatest similarity as a diagonal line, is returned along with the full detailed alignment of both nucleotide and translated amino acid sequences. Take some time to explore this data fully. Can you explain this data? Exercise 1c: Searching for Protein Sequences in a Database We can also manipulate DNA sequences using bioinformatics. blastx is a program that translates DNA sequences into protein sequences and then searches databases for protein sequences similar to the predicted protein sequence. The aim of this exercise is to run a blastx search on the L. noctiluca and A.victoria DNA sequences. The method is outlined below. 1: Return to the basic blast search page at NCBI. [URL: http://www.ncbi.nlm.nih.gov/BLAST/] 2: Under the heading Translated BLAST Searches, click on the link for Nucleotide query-protein db [blastx] 3: As before, paste each DNA sequence in turn in the 'search' box and click on the BLAST! Button to perform the search and then on the FORMAT button. Again, you may have to wait several seconds for the server to return the search results. For the non A. victoria species [repeat this with L. noctiluca species] you identified in exercise 1a as having the greatest similarity, record the percentage identity exhibited after comparison of the translated protein sequences generated by BLASTX. GFP Species: Luciferase Species: Identity: Identity: Compare the percentage identities generated by blastx (translated protein sequence) and the basic blast search (exercise 1, DNA sequence similarity). Can you explain why similarity scores change after translation? Exercise 2: Identification of Restriction Enzyme Recognition Sites in a DNA sequence Below is given a second sequence from the A. Victoria GFP clone. This time it is a sequence surrounding the fluorophore region. 5' agtaaaggagaagaacttttcactggagttgtcccaattcttgttgaattagatggtgatgttaatgggcacaaatt ctctgtcagtggagagggtgaaggtgatgcaacatacggaaaacttacccttaaatttatttgcactactggaaa gctacctgttccatggccaacacttgtcactactttc tcttatggt gt 3' fluorophor e Prof Strictus wants you to develop an easy test, based on restriction digestion analysis, for wild type (native) GFP clones. There are a range of GFP derivatives, including several that have been mutated in the fluorophore region to give proteins with enhanced fluorescence or have different coloured fluorescence. Therefore, a restriction enzyme that cuts at the wild type GFP fluorophore sequence must be identified in order to develop the test. In order to identify candidate restriction enzymes for use in a test, a restriction enzyme map of the DNA sequence needs to be generated. A commonly used program is: Webcutter. 1: Go to the Webcutter homepage at: http://www.firstmarket.com/cutter/cut2.html 2: Scroll down to the middle of the page and paste the DNA sequence in the large box provided. Enter a working title for the sequence (e.g. your intitials_GFP.seq) 3: Leave the default settings as they are and click the analyse sequence box. Locate the fluorophore region on the restriction map and list all restriction enzymes that could be used in a test digest of the GFP clones in order to detect the wild type fluorophore sequence. Exercise 3: Literature searches The Internet is a rich source of information and, with the right information skills, it is a highly effective tool. However, effective searching and critical evaluation of the information found is essential, as much of the material to be found on the internet is of poor quality and often factually inaccurate. Without good information skills, the volume of information available can be overwhelming and performing a literature search can become frustrating and confusing. The RDN Virtual Training Suite [URL: http://www.vts.rdn.ac.uk/] contains set of online and interactive tutorials designed to help you improve your internet information literacy and IT skills. Specifically, the tutorials cover (1) Types of resources available on the internet, (2) How to search for information on the internet efficiently and effectively, (3) How to evaluate the material found on the internet, and (4) How to cite material found on the internet. You are highly recommended to take a tutorial. It is also possible to search genbank for references to specific key words. Return to the NCBI home page and click on PubMed. Type in key words associated with GFP to obtain key references regarding engineered forms of GFP, including those with enhanced fluorescence or different coloured fluorescence. Using these resources, retrieve those references and find out whether the engineered forms have had the nucleotide sequence at the fluorophore region altered. Carefully record if each clone has had its fluorophore sequence altered and, if it has, record how and whether it would still be digested by each of the restriction enzymes you identified in exercise 2. Using the data produced and information retrieved from exercises 2 and 3 write a short report for Prof Strictus on the feasibility of developing a test for the identification of wild type GFP clones using restriction digestion analysis. Exercise 4: Molecular visualisation of GFP The ‘holy grail’ of BI is to be able to deduce the 3D structure of a protein from its linear nucleotide sequence or amino acid primary sequence. It is hoped that, from structural analysis, functional information could also be deduced. Two analytical methods have emerged; (i) pattern recognition techniques that are built on the assumption that similar traits can be identified in related proteins with similar sequences. That is, detecting similarities between unknown sequences with known sequences where the structure and function is known, and inferring the possible structure from this. [Note: You have already had direct experience of this in exercise 1]. .predictive modelling. This is a very futuristic method, which tries to predict the protein 3D structure directly from the amino acid sequences, whether the structures have been resolved or not. It is very ambitious, given that the primary structure cannot reliably predict secondary structure ie. it is very difficult to predict the folding of a protein. This method is not yet feasible due to the lack of resolved 3D protein structures from which reliable predictive modelling programmes can be developed and validated. However, the 3D crystal structure of GFP has been resolved and its tertiary structure can be explored using the molecular visualisation programme RasMol. Introductory instructions for use are given below, but once you are familiar with the basic commands you should spend time to experiment with, for example, different display modes and ‘what if?’ scenarios. 1. Open up the Rasmol programme [you will be told where Rasmol will be located] 2. Under file in the menu bar, choose open and select the gfp pdb file [you will be told where this file is located] 3. The default Rasmol settings are a basic wireframe model of the GFP molecule displayed on a black screen. Spend some time exploring the 3D structure of GFP in this format. Note that, by selecting any area of the GFP protein molecule by selecting the area of interest with the cursor and keeping the mouse button depressed, you may rotate/revolve the molecule at will. 4. The following instructions will result in the display of a GFP molecule in which (i) the beta ribbons forming the beta–can scaffold of the protein are coloured green, (ii) the amino acids forming the fluorophore region are displayed as ball and stick models and coloured red to increase contrast. From the Rasmol command line box type in ‘select all’ and press return. Now go to the Rasmol display screen and under display in the menu bar choose ribbons The molecule of GFP should now be displayed in this format, coloured grey. Return to the Rasmol command line box and type in ‘color green’ [note American spellings are required] and press return. On the Rasmol display screen you should now have a GFP molecule which is coloured green with the beta ribbons forming the beta-can structure clearly shown. Spend some time exploring this structure Describe this structure- how many beta ribbons are there? Are there any helical sections? Where are these? Return to the Rasmol command line box and type in ‘select 65-67’and press return. Type in ‘colour red’ and press return. Go to the Rasmol display screen and, under display in the menu bar, choose ball and stick On the display screen the molecule of GFP should now have the fluorophore highlighted in red. Describe the appearance of the fluorophore. Can you relate its structure to its function: i.e. explain how you think the structure of the amino acids allows the GFP protein to fluoresce. Bibliography and Further Reading: Introduction to Bioinformatics. Cell and Molecular Biology in Action Series. Attwood and Parry-Smith (1999). Longman. Altschul, S. F. et al., Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25:3389-3402 (1997). Webcutter 1.0, copyright 1995, Max Heiman. Prasher, DC., Eckenrode, VK., Ward, WW., Prendergast, FG., and Cormier, MJ (1992) Primary structure of the Aequorea Victoria green-fluorescent protein. Gene. 111(2): 229233 Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402.