Animal Biotechnology (CH365)

advertisement
Bioinformatics Workbook
Aims and Objective of Workbook:
The workbook aims to introduce students to some of the more easily accessible resources
available on the internet for sequence analysis and manipulation, in order to facilitate a
practical understanding of BioInformatics (BI).
Summary:
The field of BI deals with the use of computers to solve biological problems, most often
those that deal with genetics. It is one of the newest, and most dynamic and expanding
fields in biology, due to the massive amounts of information being generated by the
Human Genome Project (HGP) and other genome projects. As a result, there is a
requirement for computerised databases to store, organise and index all the data and for
specialised tools to view and analyse it.
Growing Importance of BI:
Biology is undergoing a rapid transformation from a lab-based practical discipline to one in
which information science plays a central and critical role. Of growing importance are
emerging technologies, a selection of the most important are summarised below;
Database-mining: this is the process by which the structure/function of an unknown
gene/protein is inferred from similar sequences identified in information already stored in
database, most often from well-characterised model organisms.
Evolutionary Biology: BI offers the potential for investigating and even reconstructing the
evolutionary relationships between different organisms. This is based on the premise that
closely related organisms have sequences that are more similar than distantly related
organisms. In addition, a time of divergence from a shared sequence can be estimated. Of
particular interest are model organisms that have homologs of human disease genes.
These often give insight into the molecular basis of a disease to be further investigated.
Note: NCBI's COGs database has been designed to simplify evolutionary studies.
Protein Homology Modelling: This uses protein structures that have already been solved
experimentally to predict the 3-D structure of another protein that has a similar amino acid
sequence.
Genome Mapping: A computerised and easily navigated map of any given genome, in
which sequence information is correctly oriented, is critical to allow the easy localisation of
specific gene/nucleotide sequences. Note: NCBI's Map Viewer is a tool that allows a user
to view (when available) an organism's complete genome; integrated maps for each
chromosome; and/or sequence data for a genomic region of interest.
Basic Bioinformatics Fact File
Internet Based Resources:
There are many powerful BI tools freely available on the internet, some of which tend to be
complicated and quite difficult to use for beginners. The National Centre for
Biotechnology Information (NCBI) has produced many of the easiest to use BI
resources and the URL of its home page is: http://www.ncbi.nlm.nih.gov
Note: The internet is incredibly dynamic and the diverse range of available BI
software/resources continues to grow; However, resources intended to be used should be
checked periodically for ‘webrot’ to ensure that URL has not changed, and that the
resource is still available and free to use etc. The exercises and associated instructions
were developed in March, 2004, so may well have to be modified depending on future
developments.
NCBI was established in 1988 when the growing importance and centralised position of
computerised information processing methods in biological research was recognised. The
organisation has been of central importance and a pioneer of BI systems and more
information is available at NCBI outreach and education:
http://www.ncbi.nlm.nih.gov/About/outreach/courses.html
NCBI Education Resources:
These are covered in easy to follow tutorials at URL: http://www.ncbi.nih.gov/Education/
and below some key summaries of tools and critical concepts from the resources, which
are used in the suggested BI workbook, are presented:
Databases: are large, organised bodies of data. They have associated software that
allows authorised users to update (add new entries or modify existing entries), query
(search), and retrieve data stored within the system. NCBI is the home of Genbank, one
of the largest genome database repositories in the world. It typically contains information
such as a contact name; the input sequence, with a description of the type of molecule; the
scientific name of the source organism from which it was isolated; and, often, literature
citations associated with the sequence.
Search and Retrieval of Database Information: Entrez is the search and retrieval
system that links many NCBI databases. In addition to allowing access to and retrieval of
specific information from a single database, Entrez also allows access of integrated
information from many NCBI databases.
Suggested Student BI Workbook:
The workbook uses Aequorea victoria Green Fluorescent Protein (GFP) sequence data in
a series of exercises that are designed to lead a student through the use of databases,
use of associated manipulation and analysis software and other useful BI resources.
Note: A foundation in basic genetic principles is required in advance to solve the problems
that the BI tools are going to be used to solve.
Background Information:
Also see: Green Fluorescence Protein Reference Fact File
Key references:
Prasher et al., (1992) cloned and sequenced genomic and cDNA GFP clones A.victoria
GFP sequence NCBI Accession Number: M62654
Li et al., Lampyris noctiluca luciferase gene. Direct Submission. NCBI accession number:
AY447204
Gurskaya et al., (2003) A colourless green fluorescent protein homologue from the nonfluorescent hydromedusa Aequorea coerulescens and its fluorescent mutants. Biochem. J.
373 (Pt 2), 403-408 Accession Number: AY151052.
Ohmiya, et al., (1995) Cloning, expression and sequence analysis of cDNA for the
luciferases from the Japanese fireflies, Pyrocoelia miyako and Hotaria parvula.
Photochem. Photobiol. 62 (2), 309-313 Accession no: L39928
Format of Workbook:
The workbook is organised around 5 practical exercises that use databases and software
to solve problems associated with GFP sequence data:
1: Use of databases, (a-c) to confirm the identity of clones and search for
similar nucleotide and translated protein sequences and alignment of
sequences
2: Identification of restriction enzyme recognition sites in a DNA sequence
3: Information and literature searches
4. Molecular visualisation of GFP
Exercise 1 (a-c): Use of databases (i) to confirm the identity of an 'unknown' sequence (ii)
to search for nucleotide and translated protein sequences in databases and aligning
sequences.
There are a number of BLAST (Basic Local Alignment Search Tool) programs that allow
easy sequence comparison between query sequences and those contained in a variety of
databases. There are a number of both nucleotide and protein databases that can be
searched with either DNA or amino acid query sequences and programs such as BLASTX
and TBLASTN can be used to cross-compare different types of query and database
sequences. BLAST returns sequence alignments with each alignment being assigned a
score that reflects the degree of similarity. The higher the score, the greater degree of
similarity. Functional and evolutionary information can be inferred from well-designed
queries and alignments.
Three BLAST tutorial information guides are available at URL:
http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html.
Beginners should start with the QUERY tutorial, more experienced could use BLAST and
the advanced can use PSI-BLAST tutorial.
Exercise 2: Identification of restriction enzyme recognition sites in a DNA sequence
Generation of accurate restriction maps of cloned DNA sequences is fundamental to
recombinant DNA technology applications. The BI exercise uses the WEBCUTTER
software, but there are a number of free alternatives available.
Note: This exercise could be linked to a laboratory-based practical where students could
be asked to predict restriction maps of cloned sequences and then to test them out by
undertaking the digests in the lab. The Restriction Digestion and Agarose Gel
Electrophoresis practical session could be easily adapted for such a purpose.
Exercise 3: Information and literature searches
The Internet is a rich source of information and, with the right information skills, it is a
highly effective tool for supporting teaching and learning. However, effective searching
and critical evaluation of the information found requires key information skills, without
which the overwhelming volume of information available, much of which is not peer
reviewed, can be frustrating and confusing.
The RDN Virtual Training Suite [URL: http://www.vts.rdn.ac.uk/] is a set of online and
interactive tutorials designed to help students, lecturers and researchers improve their
internet information-literacy and IT skills. Specifically, the tutorials cover (1) Types of
resources available on the internet, (2) How to search for information on the internet
efficiently and effectively, (3) How to evaluate the material found on the internet, and (4)
How to cite material found on the internet.
PubMed is a bibliographic database composed of literature primarily from the life
sciences. It contains links to full-text articles at participating publishers' Web sites, as well
as links to other third party sites such as libraries and sequencing centres and provides
access and links to the integrated molecular biology databases maintained by NCBI. URL
for a basic pubMed tutorial: http://www.nlm.nih.gov/bsd/pubmed_tutorial/m1001.html.
Exercise 4: Molecular visualisation of GFP
The ‘holy grail’ of BI is to turn a nucleotide sequence or amino acid primary sequence into
a 3D protein sequence. This is not yet feasible due to the paucity of resolved 3D protein
structures from which reliable predictive modelling programmes can be developed and
validated. However, the 3D crystal structure of GFP has been resolved and its tertiary
structure can be explored using the Molecular Visualisation Freeware RasMol programme.
The URL at which this software and associated information can be downloaded is:
http://www.umass.edu/microbio/rasmol/. Protein explorer (a derivative of RasMol) and a
one hour tutorial are also available at: http://molvis.sdsc.edu/protexpl/qtour.htm.
Instructions are given in the workbook for simple introductory exploration of a GFP pdb file
with RasMol version 2.6.
Summary:
The field of bioinformatics (BI) deals with the use of computers to solve biological
problems, most often those that deal with genetics. It is one of the newest, and most
dynamic fields in biology and is expanding at an incredible rate, due to the massive
amounts of information being generated by the Human Genome Project (HGP) and other
genome projects. As a result, there is an absolute requirement for computerised
databases to store, organise and index all the data and for specialised software tools to
view, analyse and manipulate it. Consequentially, biology is undergoing a rapid
transformation from a lab-based practical discipline to one in which information science
plays a central and critical role.
Learning Outcomes:
After directed learning and completion of the workbook you should be able to:
 Discuss the central importance of BI in modern biotechnology
 Use and discuss the main applications of a range of BI databases and informatics
resources on the internet
 Using BI resources practically, with reference to the specific example of GFP, to
help solve biological problems, plan laboratory-based practicals and process data
 Demonstrate developed and improved IT skills
Background
The Green Fluorescent Protein (GFP) was first isolated from the bioluminescent jellyfish
Aequorea victoria. The biological functions of bioluminescence are not clear, but several
classes of marine organisms display this trait.
Wild type (native) A. victoria GFP is just 238 amino acids and its crystal structure has
revealed a highly compact “-can” tertiary structure that confers exceptional stability. The
fluorophore forms post-translationally, following cyclisation and oxidation of the three
residues at codons 65-67 (Ser - dehydroTyr – Gly: SYG).
Both genomic and cDNA clones were isolated and sequenced in 1992 and GFP has been
expressed in a wide range of cells and transgenic animals and plants. Mutation of the
fluorophore and other regions of the protein has produced “improved” GFP isoforms
including (i) (Tyr66>His, Tyr145>Phe), which emits blue instead of green light; (ii)
(Ser65>thr, Phe64>Leu), which is significantly brighter than the wild type and (iii)
“optimised” forms which now exist for most model systems.
Exercise 1: Use of databases
1a: Confirming the identity of an 'unknown' sequence
You are working as a technician in a biotechnological lab supervised by Prof. Strictus who
supervises a research group investigating the structure, functions and applications of
luminescent and fluorescent proteins. She has received two clones of DNA sequences that
she will use in her research; one is a GFP clone from A. victoria and the second is a
luciferase clone from Lampyris noctiluca (firefly). You have been given some sequence
data from each clone [sequence 01 and sequence 02] and the next step is to confirm their
identity by searching a DNA database, as described below.
Note: Both sequences are 200bp in length and are from the coding strand of exons and
will be made available to you electronically:
Sequence 01:
5’
cattacctgtccacacaatctgccctttccaaagatcccaacgaaaagagagatcacatgatccttcttgagtttgt
aacagctgctgggattacacatggcatggatgaactatacaaataaatgtccagacttccaattgacactaaagtgt
ccgaacaattactaaaatctcagggttcctggttaaattcaggctg
3'
Sequence 02:
5’
aacagtttgtccataggagttgcaccaacaaatgatatttacaatgaacgtgaattatacaacagtttgtccataaag
aaattacctataattcagaaaattgttattctggattctcgagaggattatatggggaaacaatctatgtactcgttcatt
gaatctcatttacctgcaggttttaatgaatatgattctac
3’
The aim of the first exercise is to search a database for any other DNA sequences that are
similar (homologous) to these regions of the genes. This should confirm the identity of the
clones and also identify other, similar gene sequences. A commonly used method, a
blastn search, is outlined below.
1. Select a sequence and copy it.
2. The National Centre for Biotechnology Information (NCBI) is the home of Genbank,
one of the largest genome database repositories in the world.
Go to the NCBI homepage at: http://www.ncbi.nlm.nih.gov/
3. Click on the word BLAST at the top of the page.
A BLAST search will identify DNA sequences in the database that are similar to the
query sequence entered as input data.
4. Under the heading Nucleotide Blast, click on the link for a standard nucleotidenucleotide BLAST [blastn] search.
5. Paste your sequence in the box headed ‘Search'. Do not change any of the default
settings. Click on the BLAST! button. If there are a lot of queries and the server is busy,
you may have to wait several seconds [or even minutes] for the server to return the
search results.
6. Click on the FORMAT results button to retrieve you results, which include:
 A histogram indicating the size of the region of homology shared between the
query sequence and database entries retrieved
 A list of DNA sequences that are similar to the query sequence
 A sequence alignment showing the specific nucleotide position shared or
similarity between the sequences.
By moving the cursor over the histogram or scrolling down to view the significant
alignments you should see that this search retrieves different entries, which quickly
confirms the identity of each sequence. It also retrieves matches to sequences from
several other species in addition to cloned, mutant and engineered forms of each
sequence. Spend some time and fully explore the data produced.
Scroll down past the histogram and the list of DNA sequences to the sequence alignment
section which provides: A brief description of the nucleotide sequences from the database
 The number of identical nucleotides within the matching region (e.g. identities
90/100, 90%)
 A schematic diagram aligning matching nucleotides between our query sequence
and the sequence in the database.
Click on the database entry code for any similar sequence. This will retrieve the complete
database entry including publication/reference information, position of genes in the
database entry, position of the coding sequences for the gene, the predicted polypeptide
sequence of the gene product and notes about the gene or DNA sequence.
The first and closest alignment confirms the identity of each sequence.
Record the following information for each query sequence:
Sequence 01:
Gene:__________________________
Species of origin:_________________
Accession Number:_______________
Literature citation for sequence:_____
_______________________________
_______________________________
Sequence 02:
Gene:________________________
Species of origin:_______________
Accession Number:_____________
Literature citation for sequence:____
_____________________________
_____________________________
Note that there are several sequences that have been recalled from non L. noctiluca
species that bear similarity to the L. noctiluca luceriferase query sequence, however, there
are far fewer non A. victoria species that have similarity to the GFP query sequence. Also
note the large number of cloned, fusion, mutant and engineered forms recalled from the
GFP search.
Can you explain this data?
Name the non L. noctiluca species that bear the greatest similarity to the luciferase query
sequence (i.e. have the highest bit score over 200 and should be red in the histogram)
A)
Also list the following:i) The number of nucleotides in the query sequence that are identical to the
database entry
ii) The percentage identity
iii) The accession code for the full sequence of the database entry
A: i)
: ii)
iii)
Name the non A. victoria species that bear the greatest similarity to the GFP query
sequence (i.e. have the highest bit score over 200 and should be red in the histogram)
B)
Also list the following:i) The number of nucleotides in the query sequence that are identical to the
database entry
ii) The percentage identity
iii) The accession code for the full sequence of the database entry
A: i)
iii)
: ii)
Exercise 1b: Aligning similar sequences
We can also manipulate DNA sequences using bioinformatics. Two sequences may be
fully aligned over their entire length by using the bl2seq programme. The aim of this
exercise is to compare the full length GFP sequence from A. Victoria with the species you
identified from exercise 1a as having a sequence with the highest similarity [and the full
length luceriferase sequence from L. noctiluca with the species you identified in exercise
1a as having a sequence with the highest similarity]
1: Return to the basic blast search page at NCBI.
[URL: http://www.ncbi.nlm.nih.gov/BLAST/]
2: Under the heading Special, click on the link for Align two sequences (bl2seq)
3: For sequence 1, enter the accession code for the complete sequence of A. Victoria GFP
[which you recorded for 1a] and for sequence 2 enter the accession code for the complete
sequence of the sequence from the most similar non A. Victoria species [repeat this with L.
noctiluca luciferase and non L. noctiluca species].
4: Press align
An alignment diagram, which summarising areas of greatest similarity as a diagonal line, is
returned along with the full detailed alignment of both nucleotide and translated amino acid
sequences. Take some time to explore this data fully.
Can you explain this data?
Exercise 1c: Searching for Protein Sequences in a Database
We can also manipulate DNA sequences using bioinformatics. blastx is a program that
translates DNA sequences into protein sequences and then searches databases for
protein sequences similar to the predicted protein sequence.
The aim of this exercise is to run a blastx search on the L. noctiluca and A.victoria DNA
sequences. The method is outlined below.
1: Return to the basic blast search page at NCBI.
[URL: http://www.ncbi.nlm.nih.gov/BLAST/]
2: Under the heading Translated BLAST Searches, click on the link for Nucleotide
query-protein db [blastx]
3: As before, paste each DNA sequence in turn in the 'search' box and click on the
BLAST! Button to perform the search and then on the FORMAT button. Again, you may
have to wait several seconds for the server to return the search results.
For the non A. victoria species [repeat this with L. noctiluca species] you identified in
exercise 1a as having the greatest similarity, record the percentage identity exhibited after
comparison of the translated protein sequences generated by BLASTX.
GFP
Species:
Luciferase
Species:
Identity:
Identity:
Compare the percentage identities generated by blastx (translated protein sequence) and
the basic blast search (exercise 1, DNA sequence similarity).
Can you explain why similarity scores change after translation?
Exercise 2: Identification of Restriction Enzyme Recognition Sites in a DNA
sequence
Below is given a second sequence from the A. Victoria GFP clone. This time it is a
sequence surrounding the fluorophore region.
5'
agtaaaggagaagaacttttcactggagttgtcccaattcttgttgaattagatggtgatgttaatgggcacaaatt
ctctgtcagtggagagggtgaaggtgatgcaacatacggaaaacttacccttaaatttatttgcactactggaaa
gctacctgttccatggccaacacttgtcactactttc tcttatggt gt
3'
fluorophor
e
Prof Strictus wants you to develop an easy test, based on restriction digestion analysis, for
wild type (native) GFP clones. There are a range of GFP derivatives, including several that
have been mutated in the fluorophore region to give proteins with enhanced fluorescence
or have different coloured fluorescence. Therefore, a restriction enzyme that cuts at the
wild type GFP fluorophore sequence must be identified in order to develop the test.
In order to identify candidate restriction enzymes for use in a test, a restriction enzyme
map of the DNA sequence needs to be generated. A commonly used program is:
Webcutter.
1: Go to the Webcutter homepage at: http://www.firstmarket.com/cutter/cut2.html
2: Scroll down to the middle of the page and paste the DNA sequence in the large
box provided. Enter a working title for the sequence (e.g. your intitials_GFP.seq)
3: Leave the default settings as they are and click the analyse sequence box.
Locate the fluorophore region on the restriction map and list all restriction enzymes that
could be used in a test digest of the GFP clones in order to detect the wild type
fluorophore sequence.
Exercise 3: Literature searches
The Internet is a rich source of information and, with the right information skills, it is a
highly effective tool. However, effective searching and critical evaluation of the
information found is essential, as much of the material to be found on the internet is of
poor quality and often factually inaccurate. Without good information skills, the volume of
information available can be overwhelming and performing a literature search can become
frustrating and confusing.
The RDN Virtual Training Suite [URL: http://www.vts.rdn.ac.uk/] contains set of online
and interactive tutorials designed to help you improve your internet information literacy and
IT skills. Specifically, the tutorials cover (1) Types of resources available on the internet,
(2) How to search for information on the internet efficiently and effectively, (3) How to
evaluate the material found on the internet, and (4) How to cite material found on the
internet. You are highly recommended to take a tutorial.
It is also possible to search genbank for references to specific key words. Return to the
NCBI home page and click on PubMed. Type in key words associated with GFP to obtain
key references regarding engineered forms of GFP, including those with enhanced
fluorescence or different coloured fluorescence.
Using these resources, retrieve those references and find out whether the engineered
forms have had the nucleotide sequence at the fluorophore region altered. Carefully record
if each clone has had its fluorophore sequence altered and, if it has, record how and
whether it would still be digested by each of the restriction enzymes you identified in
exercise 2.
Using the data produced and information retrieved from exercises 2 and 3 write a
short report for Prof Strictus on the feasibility of developing a test for the
identification of wild type GFP clones using restriction digestion analysis.
Exercise 4: Molecular visualisation of GFP
The ‘holy grail’ of BI is to be able to deduce the 3D structure of a protein from its linear
nucleotide sequence or amino acid primary sequence. It is hoped that, from structural
analysis, functional information could also be deduced. Two analytical methods have
emerged; (i) pattern recognition techniques that are built on the assumption that similar
traits can be identified in related proteins with similar sequences. That is, detecting
similarities between unknown sequences with known sequences where the structure and
function is known, and inferring the possible structure from this. [Note: You have already
had direct experience of this in exercise 1].
.predictive modelling. This is a very futuristic method, which tries to predict the protein
3D structure directly from the amino acid sequences, whether the structures have been
resolved or not. It is very ambitious, given that the primary structure cannot reliably predict
secondary structure ie. it is very difficult to predict the folding of a protein. This method is
not yet feasible due to the lack of resolved 3D protein structures from which reliable
predictive modelling programmes can be developed and validated.
However, the 3D crystal structure of GFP has been resolved and its tertiary structure can
be explored using the molecular visualisation programme RasMol. Introductory
instructions for use are given below, but once you are familiar with the basic commands
you should spend time to experiment with, for example, different display modes and ‘what
if?’ scenarios.
1. Open up the Rasmol programme [you will be told where Rasmol will be located]
2. Under file in the menu bar, choose open and select the gfp pdb file [you will be
told where this file is located]
3. The default Rasmol settings are a basic wireframe model of the GFP molecule
displayed on a black screen. Spend some time exploring the 3D structure of GFP in
this format. Note that, by selecting any area of the GFP protein molecule by
selecting the area of interest with the cursor and keeping the mouse button
depressed, you may rotate/revolve the molecule at will.
4. The following instructions will result in the display of a GFP molecule in which (i) the
beta ribbons forming the beta–can scaffold of the protein are coloured green, (ii) the
amino acids forming the fluorophore region are displayed as ball and stick models
and coloured red to increase contrast.
From the Rasmol command line box type in ‘select all’ and press return.
Now go to the Rasmol display screen and under display in the menu bar choose ribbons
The molecule of GFP should now be displayed in this format, coloured grey.
Return to the Rasmol command line box and type in ‘color green’ [note American spellings
are required] and press return. On the Rasmol display screen you should now have a GFP
molecule which is coloured green with the beta ribbons forming the beta-can structure
clearly shown. Spend some time exploring this structure
Describe this structure- how many beta ribbons are there? Are there any helical
sections? Where are these?
Return to the Rasmol command line box and type in ‘select 65-67’and press return.
Type in ‘colour red’ and press return.
Go to the Rasmol display screen and, under display in the menu bar, choose ball and
stick
On the display screen the molecule of GFP should now have the fluorophore highlighted in
red.
Describe the appearance of the fluorophore. Can you relate its structure to its function:
i.e. explain how you think the structure of the amino acids allows the GFP protein to
fluoresce.
Bibliography and Further Reading:
Introduction to Bioinformatics. Cell and Molecular Biology in Action Series. Attwood and
Parry-Smith (1999). Longman.
Altschul, S. F. et al., Gapped BLAST and PSI-BLAST: a new generation of protein
database search programs. Nucleic Acids Res. 25:3389-3402 (1997).
Webcutter 1.0, copyright 1995, Max Heiman.
Prasher, DC., Eckenrode, VK., Ward, WW., Prendergast, FG., and Cormier, MJ (1992)
Primary structure of the Aequorea Victoria green-fluorescent protein. Gene. 111(2): 229233
Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schäffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Download