Bioinformatics Lab Assignment - the laboratory for genomics and

advertisement
Bioinformatics Lab Assignment
What is bioinformatics?
Bioinformatics is a field that is a sub discipline of Computational Biology. Computational Biology is an
interdisciplinary field that combines concepts and techniques from biology, physics, computer science,
linguistics, mathematics (probability and statistics), and organics chemistry. Its focus is to make predictions
about biological systems and to analyze biological data in an effort to provide more insight into how living
organisms function. For example, Computational Biology could be used to predict if two proteins interact or
not. If prediction is accurate, then Computational Biology could further be used to analyze biological data
obtained from a wet lab experiment involving the proteins to understand how these proteins contribute to the
physiology of the organism. Computational Biology can be further broken down into Molecular Modeling and
Bioinformatics. (I will only discuss bioinformatics from here forward)
Bioinformatics is the branch of Computational Biology that focuses on using the concepts and techniques from
several different fields to study and analyze DNA, RNA, and protein sequence information. The ultimate goal
is to to learn how an organism's ge
nome relates to its biology. This field provides the tools necessary for
biologists to understand how genetic information stored in an organisms DNA relates to the RNA and protein
sequences that are produced. Tools such as biological databases serve the purpose of storing and organizing
millions of sequences from hundreds of different organisms. Other tools allow scientists to compare sequences
for phylogenetic analysis or even for identifying the function of a protein sequence. The goal of biologists is to
understand how living things work and bioinformatics provides scientists with the tools to connect the
organisms genome to its biological functions.
The application of bioinformatics provides an opportunity to change the face of society. For example,
bioinformatics provides scientist with the opportunity to analyze the genetic component of diseases which can
lead to developing better treatments, cures, and even preventative tests for patients. For example, knowing
what genes are produced by a particular organism can help scientist identify proteins that are important for a
particular biological pathway or are major players in the development of certain disease. With a better
understanding of disease mechanisms and using bioinformatics tools to identify and validate new drug targets
(proteins), more specific medicines that act on the cause, not the symptoms of the disease can be developed.
Doctors, in the future, will be able to harness the power of “ personalized medicine” which will allow doctors to
analyze a patients genetic profile and prescribe the best available drug therapy to that patient. Also, knowing
specific details of the genetic mechanisms of disease will help scientists develop diagnostic tests to measure a
persons susceptibility to different disease in an effort to prevent the development of that disease. Other areas
such as the application of microbes to clean up toxic waste sites, improving the nutritional quality of plants, and
even improving the health of livestock can become a reality as bioinformatics is applied. 1
Using Bioinformatics
Bioinformatics has three major emphasis: 1. Manage and organize biological information using databases
2. Process, analyze, and visualize biological data and present it to communicate scientific findings
3. Share biological information to the public using the Internet.
From these three responsibilities springs many different types of bioinformatics tools that can be used by
scientists. They range from searching for scientific literature, searching for genes or proteins, analyzing DNA
sequence data, analyzing protein sequence data, predicting 3D structures of proteins, and more. Below is an
abbreviated list of the different types of tools that can be used and analysis that can currently be performed with
bioinformatics.
1. Public biological databases – Collection of biological data such as nucleic acid sequences, amino acid
sequences, published literature, biochemical pathways, etc., that can be searched by the public.
2. Sequence alignment tools – A tool that will attempt to identify as many matching residues in series between
two sequences (nucleic or amino). This is very useful to determine how closely related two sequences might
be.
3. Sequence searching – Tools used to compare a query sequence to millions of sequences in a database in an
effort to find highly similar sequences.
4. Gene prediction – Software to predict open reading frames, genes, exon splice sites, promoter binding sites,
etc. from long continuous strings of nucleotides
5. Multiple Sequence Alignment – Tool that will align several sequences at one time. Useful for identifying
conserved functional domains in a group of related sequences and extracting information about a gene
family.
6. Phylogenetic Analysis – Studying the evolutionary relatedness of a group of sequences or organisms.
7. Protein sequence Analysis – Calculate the isoelectric point, molecular weight, peptide mass fingerprints.
Predict secondary structure features and post­translational modification sites.
8. Protein structure prediction – Predicting the 3D structure of the protein to give insights into how the
protein may function given the tight relationship between protein structure and function.
9. Whole genome analysis – Navigating through the genome and annotating the genome.
2
Biology and Evolution in Bioinformatics
The Central Dogma of Molecular biology needs to be understood before getting involved in bioinformatics.
The central dogma states that DNA directs its own replication and its transcription into complementary strands
of RNA. Protein sequences are then created from translating the sequence of nucleic acids in RNA into
corresponding sequences of amino acids.
Without understanding this principle, we could not understand the relationship between the nucleic acids in
genes and the nucleic acids in RNA or the amino acids in protein sequences. Simple bioinformatics tools can be
created to translate a gene into a protein sequence because we understand that codons correspond to specific
amino acids. Bioinformaticians are able to create software to predict genes because they understand the
relationship between genes and start and top codons. Another important principle that is used in bioinformatics is evolutionary theory. The theory of evolution is the
scientific model that describes the descent of all living organisms from a common ancestor. In this model, the
changing of the genetic code within a species can bring about speciation, which will result in two species that
are highly similar, but not capable of interbreeding. Because two organisms share a common ancestor, many of
their genes highly similar. Knowing this, genes can be classified as being homologous or non­homologous.
Homologous genes are two genes that are derived from a common ancestral gene. These genes can either be
found within the same species or between different species and can possibly share the same function. Homologous genes are then sub classified into two other groups, orthologous genes and paralogous genes.
Orthologous genes are genes that are derived from a common ancestor, are found in different species, and have
the same biochemical function and same 3D structure. Paralogous genes are also derived from a common
ancestral gene, but are found within in the same species and have different biochemical functions. These concepts aide scientist greatly in identifying the possible functions of genes in organisms. Homologous
genes, whether they are orthologous or paralogous, are highly similar in their sequence content. This
observation is important to bioinformatics because programs can be built that can determine the similarity of
two sequences, in an effort to try to identify the function of unknown genes. For example, if the function of
gene A from some organism is known, I can compare that gene to all the genes from a second organism whose
genes are all unknown. If I find that an unknown gene in the second organism highly similar to gene A from the
first organism, then there is a strong possibility that the two genes share the same function. By knowing the
probable function of a gene, scientists have a better idea of how to design their experiments to ascertain the true
biochemistry of the unknown sequence.
3
Databases
One of the major emphasis in bioinformatics is the efficient use of databases to store biological data. A
database is a collection of information (paper or electronic) that is organized and stored in an electronic format
that can be searched by a computer. A biological database is a database that is a collection of biological
information in electronic format. They are specifically designed so that biological information can be easily
stored, queried, updated, and retrieved. There are essentially two types of databases: primary databases and secondary databases. Primary databases
hold primary sequence information whereas secondary databases store the analysis of primary sequence
information. Primary databases contain three types of biological data: DNA, RNA, and amino acid sequence
information. All other databases contain some type of information that arose from using bioinformatics analysis
tools to analyze this primary sequence information. For example, there are secondary databases that contain
information on protein domains, protein families, biochemical pathways, and taxonomy.
Any information about a sequence (protein or DNA) is stored in its very own database record. A database
record is a entry in the database that contains the sequence of a protein or gene and any annotations that are
related to it. An annotation is information about the protein or DNA sequence that is stored with the database
record. Annotations are bits of information such as sequence's fu
nction, references to scientific journals
describing its biochemistry, links to other databases that contain related information to the protein sequence,
descriptions on the 3D structure of the protein, etc. These records are very useful to the scientist because they
provide important information that can help the scientist with their own research. For example, if a scientist is
interested in the function of adenine deaminase in Deinococcus radiodurans, they can use a database search to
find information about that protein and any proteins that are related to it or that interact with it.
There are different types of searches that can be performed to find the database record. One can simply search a
database using a simple text query search, just as you might do at yahoo.com or google.com. The search engine
will retrieve the entry from the query box and search the database for any records in the database that match the
users entry within some degree of similarity. These matches will then be returned to the user.
A second method for searching for information is to perform sequence similarity search. A sequence similarity
search is a technique involves comparing a query sequence to a database full of sequences to find sequences
similar to the query sequence. This is the most frequently used tool in bioinformatics because it helps scientist
identify the functions of proteins and protein domains. This is possible because of the fact that many sequences
that are highly similar share similar biochemical functions and 3D structures. If the sequences are similar
enough to be classified as orthologous sequences (sequences that are found in different organisms and have the
same function), scientist can transfer what they know about one protein or gene to another.
4
Example – Text Search against a Biological Database In this example you will perform a simple text search on the Expert Protein Analysis System (ExPASy) website
at us.expasy.org. This website is dedicated to providing the scientific community with well curated protein
databases and protein sequence analysis tools. Each record in this database contains a protein sequence, the
name of the protein, the organism the protein came from, plus more information and links to related information
to other databases. The text search tool that we are going to use is called the Sequence Retrieval System (SRS)
at us.expasy.org/srs5bin/cgi­bin/wgetz. Follow the steps below to perform the text query search
1.
2.
3.
4.
Open a browser to http://us.expasy.org/srs5/
Click the START button On the new page, click the Continue button To the right of the drop box with “ All Text” selected, you'l
l find a query box. Type “a denine
deaminase” in th e query box.
5. Below the 4 query boxes, you'l
l find 4 drop boxes. Find the drop box labeled "Use View" with
“Sho rt Description” s elected. Change this selection to “ Long Description”.
6. Press Do Query
In seconds you should see a list of 49 entries (or records) that were found with our text search using just the
name of our protein. There are several columns in this list that I want to define below. (If we didn't ch
ange
“Sho rt Description” to “ Long Description” we would have received a smaller number of columns).
•
•
•
•
•
•
RootLIBs – Contains the hyperlink to the database record for this particular sequence. acc – abbreviation for accession number. The accession number is a unique identifier assigned to a sequence
(like a social security number) and should remain linked with it for ever.
gen – abbreviation for gene name. The name(s) are used to represent a gene. As there can be more than one
name assigned to a gene. The curators at ExPASy make a distinction between the one which they believe
should be used as the official gene name and the other names which are listed as "Synonyms".
des ­ abbreviation for description. This contains general descriptive information about the stored sequence.
org – abbreviation for organism. This contains the name of the organism that the protein was sequenced
from.
sl – abbreviation for sequence length. This is a count of the number of amino acids that are in the protein.
We could search through this list along the org column in search for Deinococcus radiodurans, or we can refine
our search a little bit so that we receive less entries. Let's
refine our search.
1.
2.
3.
4.
5.
Click Query Form
Type “ adenine deaminase” in th e first query box.
Type “ Deinococcus radiodurans” in se cond query box.
Make sure you will be viewing “ Long Description”
Click Do Query
Now you can see that we have only 1 entry to choose from. Lets look at this database record. Click on the link
SWISS_PROT:ADEC_DEIRA to look at the database record.
5
The database record is divided into several sections. Not all database records will have all the same sections.
The basic sections that all database records should have would be Entry Information, Name and Origin of the
Protein, References, and Sequence. Some of the extra sections that you should observe are the Comments,
Database Cross­References, Keywords, Features, and Additional Information From iProClass. Below I will
describe the purpose of each section in this database record. (Note that a majority of the headers in these
database records i.e. “ Entry name” a re clickable and will return a short description of the term)
•
•
•
•
•
•
•
•
•
•
Entry Information – database related information
Name and Origin of The Proteins – name of protein, the organism from which it originated, and taxonomy
information
References – Scientific Journals that describe the biochemistry of the protein
Comments – May contain the function of the protein, and other miscellaneous information.
Copyright – some legal stuff that is important only if you are a commercial entity
Cross­references ­ links to many other databases across that internet that have information related to this
protein such as genetic sequence, 3D structure, protein domains, protein families, or other protein databases.
Keywords ­ This provides information that can be used to generate indexes of the sequence entries based
on functional, structural, or other categories. Features – The FT (Feature Table) lines provide a precise but simple means for the annotation of the
sequence data. This entry doesn't h
ave any. Sequence ­ The amino acid sequence of the protein.
Additional Information from iProClass – iProClass is another annotated database. This is a link to all the
related information stored about this protein in that database.
The purpose of these database records is to store information about a particular gene which can be retrieved to
aide a scientist in their own research. For example, you can find the catalytic activity under the “Comm ents”
section of this database record. You can identify the lineage of this organism under the “ Name and Origin of
the protein” s ection.
6
Sequence Comparisons (Pairwise Sequence Alignments)
Sequence data is the most abundant type of biological data available electronically. Pairwise sequence
comparison (or pairwise sequence alignment) is the most essential technique in bioinformatics. This gives us
the ability to perform sequence similarity searches, build evolutionary trees, and identify characteristic features
of protein families. It is the primary means of linking biological function to the genome and for transferring
known information from one genome to another. This gives us a tool to determine if sequences are similar
enough to transfer information from one sequence to the next.
A pairwise sequence alignment algorithm is an algorithm that attempts to identify as many matching amino
acids (or nucleic acids) between two different protein (or DNA) sequences. These algorithms compares two
sequences by looking for a series of identical characters or character patterns that are in the same order in both
sequences. The outcome of identifying matching characters between two sequences results in an alignment that
is scored based on the number of matching and mismatching characters. Sometimes, alignments include gaps to
one sequence or the other in order to receive a higher number of matching characters. If this occurs, the scoring
is also based on the number of gaps inserted in the alignment. The better the alignment, the higher the number
of matches, and the lower the number of gaps and mismatches found in the alignment. For example, here is a
good alignment, a mediocre alignment and a poor alignment between different amino acid sequences:
Example 1
seq 1: LARPGVLGLAEMMNYPGALGGDAGVWDILNAGRRSGKRLDGHDAGLGGRELLAYAAAGLE
LARPGVLGLAEMMNYPG LGGDAGVWDILNAGRRSGKRLDGH AGLGGRELLAYAAAGL
seq 2: LARPGVLGLAEMMNYPGVLGGDAGVWDILNAGRRSGKRLDGHAAGLGGRELLAYAAAGLH
Example 2
seq 3: FLAPGFIDGHIHIGSNLLTPASFAAAVLPHGTTAVVAEPHEIVNVLGPAGLNWMLGAGPT
+L PG IDGH+HI S+L++PA FA VL GTTAV+A+PHEI NV G AGL +ML A
seq 4: YLLPGLIDGHVHIESSLVSPAQFARLVLARGTTAVIADPHEIANVCGLAGLRYMLDATRD
Example 3
seq 5: D----------------VATFDPPAHWPTLQ-----MFPDQIVSGRAAPG-------SGD
VAT
+Q
F ++I
+
seq 6: QQTIQCKKLTEEDLLLKVATTKETVRCNVIQKQEIGTFTERITKEIPVENGLLQWQKANC
In the above examples, the two sequences being compared are on the top and bottom, while the middle line of
characters highlight the matching amino acids between the sequences. Before I comment on the alignments, lets
try to understand what the symbols are. Whenever both sequences have an identical matching amino acid, the
amino acid symbol will appear in the middle line between both sequences. When there is a mismatch, an empty
space will appear between both lines. You'
ll notice a '
+'
. This represents a "similar" match. Some amino acids
are often substituted for each other without change the function of the protein sequence. So because tyrosine
(Y) and phenylalanine (F) are both aromatic amino acids, this would be considered a similar match in a pairwise
alignment and would receive a '
+'
. Last are the gaps. Because the goal is to optimize the number of identical
matching amino acids, it is sometimes beneficial to include gaps in one of the two sequences. Gaps model
insertion of deletion events that have occured while both sequences evolved from a common ancestor (assuming
that there was one).
So what about these example alignments. Example 1 is a very good alignment. Alignments that have a high
7
degree of similarity as in this example gives the scientist confidence that the two sequences would have the
same function and structure. Example 2 is a mediocre alignment because it has several mismatching amino
acids. Example 3 is a terrible alignment. When performing sequence alignments, the goal is to try to align two
sequences that will ideally have lots of matching amino acids and minimal number gaps. How can we
quantitatively evaluate the quality of these alignments to determine if the alignment is scientifically interesting.
There are a couple of different types of scores that are used to determine if alignments are significant or not:
percent identity, percent similarity, alignment score, expectation value (E­value). 1. Percent Identity – The number of identical matches in the alignment divided by the length of the alignment,
times 100. Columns in the alignment that include gaps are not scored in the calculation.
1. Example 2: percent identity equals 63%
2. Percent Similarity ­ The sum of the identical matches and similar matches divided by the length of the
alignment, times 100. (Note that DNA pairwise alignments don't h
ave this score.)
1. Example 2: percent similarity equals 77%
3. Alignment score – An algorithmically computed score based on the number of matches, substitutions,
insertions, and deletions (gaps) within an alignment. Scores for matches and substitutions are obtained from
a scoring matrix (a 20x20 matrix that contains scores for matching any combination of amino acids) and gap
penalties. Higher scores denote better alignments.
1. Example 2: alignment score equals 77.8 bits
4. Expectation value (E­value) – Can be thought of as a prediction of the number of false­positives one might
receive from a sequence similarity search. This value is the number of alignments one would expect to
randomly receive with a given score or better. The lower the E­value, the smaller number of false­positives
are expected, the more confident you can be that the alignment was unique.
1. Example2: E­value equals 8e­14 = 8x10^­14
8
Sequence Similarity Searching
One of the most popular tools that uses pairwise sequence comparison algorithms, are sequence similarity
search algorithms. These algorithms are used to search through huge databases that contain millions of
sequences in an effort find similar sequences to a query sequence. After a scientist enters a protein or DNA
sequence as the input to a sequence similarity search algorithm, it will compare that sequence to a database of
sequences and then return a list of sequences that were found to be similar to the query sequence. This is an
alternative search method to the query text search method. One of the most popular similarity search algorithms is called BLAST. BLAST stands for “b asic local
alignment search tool” and we will be using this tool later an example to find sequences from the database on
the ExPASy website. BLAST is the tool that is implemented at the ExPASy site that enable users to search
their massive database. One of the reasons why this tool is popular is because it gives scientist a way to identify some key features of
the protein or gene that they are studying. As was mentioned earlier in this lab that orthologous sequences may
share biochemical functions, some important 3D structures, and are highly similar in their sequence content. So
Scientists, in an effort to possibly identify protein domains in their unknown protein sequence, its function, or
maybe its structure, they can use the BLAST algorithm to search huge databases such as the one at ExPASy site
to find database records that can contain insightful information to the scientist. So what is the big picture? We have a database full of annotated protein sequences, meaning each sequence will
also contatin other related information such as it's f
unciton, important functional domains, literature describing
its biochemistry, it's 3D s
tructure, and more. We also have unknown protein sequence. We can use a sequence
similarity searching tool to find a highly similar sequence form some protien database. If the scores are good
enough (E­value, alignment score, percent identity, percent similarity), we can transfer what is known about the
annotated sequence to our sequence. It is important to understand that just because both sequences are similar,
doesn'
t mean that they will, without a doubt have the same function. This is simply a hypothesis. We are
saying that, until a wet lab experiment is performed to prove that our unknown protein functions similarly to
this well annotated protein, we will for the mean time, assume that it does.
9
Example – Sequence Similarity Search
In this very simple example, let us assume that we have the amino acid sequence to some unknown protein from
an brand new organism, call it Thisainticus realicans (pronounced This – ain't ­
icus real ­icans). We would like
to try to find out the possible function of this protein, find other organisms that might contain similar proteins,
or some literature that discusses the biochemistry of a protein that is similar to our unknown. How would we
discover this information? We could perform a sequencing similarity search in an effort to find some sequence
in this database that is highly similar to our unknown sequence. Based on it's al
ignment score, percent identity
score, percent similarity score, and expecation value, we would like to try to determine if our unknown protein
is similar enough to a protein found in the similarity search so that we can transfer what is known about that
sequence onto our unknown sequence. Follow the steps below to perform a sequence similarity search (or blast search) against the ExPASy protein
sequence database with an unknown protein sequence.
1.
2.
3.
4.
Open a browser to http://annogen.ouhsc.edu/module/lab/
Click the link Example – Sequence Similarity Search. This should reveal an amino acid sequence.
Open a second browser to http://us.expasy.org/tools/blast/
Copy and paste the amino acid sequence from the first browser to the sequence query box on the
ExPASy blast site.
5. Find the drop box labeled “ Output format” to the right of the query box. Change it from “ HTML” to
“ NiceBlast”.
6. Now scroll all the way to the bottom of the screen. You should see a check box labeled “Fil ter the
sequence for low­complexity regions”. Un check this check box.
7. Click Run BLAST
We have just given BLAST an unknown protein sequence from Thisainticus realicans to search with. It will
take this sequence and search the huge database at ExPASy and will return to use any sequences that were
similar to our query sequence. In a few seconds you should see a page which is called our “ hit list”. Th ese are
all the best matching sequences found in the database.
How does one know what sequences to consider from this list. First you only want to consider those sequences
that have favorable scores. Those matches that have high alignment scores, high percent identity scores and low
E­values should be considered first. Since the results list is sorted based on alignment score, you can simply
look at the beginning of the list for your highly similar sequences. A high alignment score and percent identity
score is a good indication as to how similar the sequences are. A low E­value communicates that an alignment
score of this value is highly unlikely to occur by some random chance. In our example we see that our unknown sequence aligned very well to over 30 different sequences that are all
adenine deaminase. The best alignment was to Deinococcus radiodurans. Because this has the highest
alignment score of 1077 and the lowest E­value of 0.0, we will assume that our unknown sequence has a very
close function to adenine deaminase of Deinococcus radiodurans.
For more detail, we can look closer at the pairwise sequence alignment between our unknown sequence and any
hit in our list by clicking on it's E­v
alue hyperlink. By clicking on the E­value of our first entry, we learn that
we have a percent identity of 91% and a positive (similarity) identity of 92%. Because these identities are so
10
high, we can be extremely confident that both sequences have the same function. We can only be truly sure of
the function of our unknown protein if we were to perform wet lab experiments to verify it'
s function. The
conclusion that we are arriving at can be more thought of as a hypothesis of the function of our protein that
needs to be verified with wet lab experiments. So what is the probable function of our protein? We can look at the database record of adenine deaminase from
D. radiodurans by clicking on the accession link. Here we can get more information about this protein. 11
Example ­ Open Reading Frame Prediction This is a tool that predicts open reading frames given along enough DNA sequence. An open reading frame is a
sequence of nucleotides in a DNA molecule that codes for a protein. The program, ORF finder, will try to find
start and stop codons within this sequences. It then will determine if there are enough nucleotides between the
start and stop codon to consider that portion of the DNA sequence a potential open reading frame. The tool will
then report the potential reading frames to the user. Like many tools, it is not perfect. Often times many
prediction tools make incorrect predictions, and miss some predictions. So it is up to the bioinformatician or
biologics to evaluate the predicted open reading frame.
Follow the steps below to predict the open reading frame from a DNA sequence.
1.
2.
3.
4.
5.
6.
go to http://microgen.ouhsc.edu/nsu­lab/
Click Example – Open Reading Frame Prediction hyperlink
Open a second browser
Go to http://www.ncbi.nlm.nih.gov/gorf/gorf.html
Copy the DNA sequence into the input box at the ORF Finder site
Press find Below is a description of the buttons and drop boxes that you see on the screen.
•
•
•
•
•
•
•
•
•
•
“ Graphical Display” ­ highlights the predicted open reading frames from a DNA sequence.
“ View” button – Provides you with a new web browser with the sequence information from accepted open
reading frames based on the option in the "View option" drop box.
“ View option” drop box – Provides you with different format options for viewing the sequences of your
accepted predicted open reading frames.
“ Redraw” button – After changing the value in the "Minimum Frame Length" drop box, this will redraw
the graphical display.
“ Minimum Frame length” drop box – Defines the minimum number of nucleotides that are to be present
between a stop codon and a start codon for a region of the DNA to be considered a possible open reading
frame.
“ Six Frames/ORF finder” button – Changes the graphical display to show all possible open reading reading
frames, in all the different frame shifts.
“ Frame” col umn – indicates which strand and which frame the open reading frame is found.
“ from” c olumn – The position in the nucleic acid sequence entry where the predicted reading frame in
question begins.
“ to” c olumn – The position in the nucleic acid sequence entry where the predicted reading frame in question
end. “ Length” column – The length of the predicted open reading.
Currently the graphical display is reporting all the potential open reading frames that are at least 100 nucleotides
long. We actually want to look at open reading frames that are at least 300 nucleotides long. 1. Change the open reading frame requirement from 100 to 300 by changing the values “ Minimum
Frame length” drop box.
12
2. Press the Redraw button. We now need to retrieve a protein sequence that ORF finder has predicted from our DNA sequence. To the far
right of the website you will see a column labeled “F rame” with aqua colored squares below it. Each Row
corresponds to one of the open reading frames predicted from our DNA sequence. We want to look at the
longest predicted sequence which happens to be the very first row.
1. Click in the aqua colored square corresponding to the longest predicted open reading frame. This will
bring to the screen the translation of this sequence. The predicted open reading frame that we
selected will correspond to the pink colored region on the graphical display. 2. Click the “A ccept” button th at just appeared on the screen. The translation screen should disappear
and the region that was once pink is now green.
3. Find the drop box labeled “ 1 GenBank” and change it to “ 3 Fasta protein” .
4. Click the View button.
This is the protein sequence predicted by ORF finder in fasta format. Fasta format is simply a way of
displaying a protein or DNA sequence. It is simply a header line that begins with a greater than sign, “ >” , and is
usually followed by a brief description of the sequence on the lines below it. The lines below the header line are
either all nucleic acids or amino acid sequences. This format makes it easier to copy and paste sequences into
other windows or query boxes. The very next step would be to copy this sequence into ExPASY'
s blast search
window to search for similar sequences, just as we did for the previous example.
13
Assignment
1. Background to assignment. For this assignment you will be assigned the task of assembling the data from a genome sequencing
project and annotating it. You will be working with small viral genome, Thisainticus realicans. This is
a highly simplified simulation of a sequencing and annotation project which will just serve to give you
an idea this process. During this assignment you will be the bioinformatician who receives reads from
the DNA sequencing lab that you will have to decontaminate, assemble into contigs, predict open
reading frames from the assembled sequences, identify potential protein sequences and annotate that
protein sequence. Shotgun Sequencing to Gene discovery
Here is a little background information to prepare you for the assignment ahead. Here I am going to
introduce the method of Shotgun Sequencing to sequence the genome of an organism. I am then going
to briefly describe the responsibilities of the bioinformatician to assemble the sequence data during
sequencing projects. Following every sequencing project should be an accurate annotation of the
genome, which will also be discussed.
Shotgun sequencing is a popular method for sequencing the genome of any organism and has been
employed in many genome sequencing projects. The ultimate goal of any sequencing project is to
accurately identify the exact order of every nucleotide found in the genome of that organism. With
current sequencing technology, only about 500 to 800 base pairs can be identified at a time on any given
DNA molecule. This causes a major problem when a small bacterial genome can contain a few million
to several million base pars. Because of this limitation, the entire genome has to be sequenced by
breaking it up into lots of smaller DNA fragments and putting them back together. The sequencing lab
breaks up the genome and inserts single DNA molecule pieces (in a process called cloning) into its own
individual plasmid. A huge collection of these genome­sequence­containing­plasmids are kept and
organized into a clone library. Sequencing technicians will choose different plasmids from this library
and sequence the inserted genomic DNA molecules from these plasmids. The resulting sequence that is
produced from sequencing these inserts are called reads. These reads are then delivered to the
bioinformatician for analysis.
In any sequencing project, the goal is to obtain an accurate sequence for the organism. Making sure that
the sequence information of only the genome of the organism being sequenced is identified and stored is
very important. With most lab procedures, there are opportunities to contaminate data. Most of the
time, when sequencing the genomic inserts from the plasmids, some of the DNA of the plasmid is
sequenced as well. The reads that the bioinformatician receives may or may not be contaminated with
sequence information of the plasmid. The very first responsibility of the bioinformatician is to check the
reads for quality to make sure that there is no plasmid sequence information incorporated with organisms
genome sequence information. If so, this extra sequence information must be removed before moving to
the next step.
After the bioinformatician is confident that he or she has successfully removed any contamination from
the reads, then the reads can be assembled into contigs. Contigs are reads have been joined into longer
pieces. The ultimate goal of the assembly process is to join all the reads into a contiguous (“ contig”
comes from contiguous) DNA molecule that is the exact replica of the genome of the organism. In the
initial stages of the sequencing project, the assembly process produces lots of small contigs that have
14
many gaps between them because there is not enough sequence data. As the technicians sequence more
genome inserts obtained from their clone library and deliver the new reads to the bioinformatician, the
new reads are incorporated into the assembled contigs. Eventually the contigs will grow longer and the
gaps between the contigs with shorten and disappear, connecting contigs. The sequencing is complete
when all the gaps are “c losed”.
After finally sequencing the genome, you may feel as if you work is complete because you have
accomplished such a great task. The very next important phase is to annotate the genomes sequence
information. Annotation is the process of taking the raw sequence data from any sequencing project and
adding to it analysis and interpretation to try to extract the genomes relevance to the biology of the
organism. In other words, we need to locate the genes, identify the proteins they encode and their
function, and identify regulatory sequences. Its nice to have the sequence information of the genome,
but if we do not know what genes it contains and what proteins are produced, the genome is useless to
us. One of the techniques for annotating a genome involves the practice of gene prediction. Simply stated, a
program tries to identify potential open reading frames, a length of sequence that could encode for a
protein. After identifying these potential open reading frames the annotator can then “ blast” this
sequence against a database in an effort to find a sequence that is highly similar. If the annotator
receives a successful hit, they can transfer what is known about the database sequence to our sequence,
because they know that sequences that are highly similar share similar biochemical functions and 3D
structures. 2. Download your reads.
The sequencers have deposited the reads into a folder on the network. You can download the sequences
from the website http://microgen.ouhsc.edu/nsu­lab/. Click on the reads hyperlink. These are the read
sequences that you will be need to assemble.
3. Assemble your reads
Normally before assembling your reads, you would first decontaminate them by trying to identify any
foreign sequence material that might have entered the sequence data. We will skip this step because of
time constraints and go straight to assemby. There is a nice and simple online assembly program called CAP3 at http://pbil.univ­lyon1.fr/cap3.php.
Go to this website and copy and paste ALL the sequences into the input box. In a few minutes you will
soon receive your assembled contigs. The Contigs link will contain all the contigs in fasta formatted
files. The Single sequences link will tell you what reads that you entered didn't ov
erlap any other reads
in your group. The Assembly details link will give you details as to what read overlapped what read and
in what orientation they overlapped. Take a moment and look at the assembly details just to see how
your reads were assembled. 4. Predict open reading frames
Now that you have assembled your genome, you must annotate it. For this excersize we will first have
to predict where the genes are located. Open up another browser to
http://www.ncbi.nlm.nih.org/gorf/gorf.html. From the CAP3 website, copy and paste the largest contig
from the Contigs link into ORF finder and press the accept button. This will search for open reading
15
frames in your assembled sequences.
Questions
Once you have identified potential open reading frames in your contig using ORF finder, now you have to try to
identify if the ORFs could possibly code for functional proteins. To do this choose one of the open reading
frames predicted by ORF finder, translate it using the view button, and then blast the amino acid sequence
(this is your unknown sequence) against ExPASy. (hint – use one of the two larger predicted open reading
frames.) Answer the questions below. 1. How long (amino acid length)is the sequence that you are blasting against ExPASy?
2. Out of the first, say 20 sequences from your hit list, what is the most common protein that your
unknown protein sequence matches?
3. What 5 results (or blast hits) for each query sequence have the highest alignment score? Give their
accession numbers and protein names. What are their percent identity scores?
4. From these top 5 hits, which sequence is our unknown sequence most likely to share a similar
function and what organism does it come from? How did you determine these two sequence would
most likely share a common function? 5. Identify one or two different scientific papers from the database record of the blast hit that you
identified to be most similar to your unknown sequence.
6. Can you identify some functional domains that our unknown sequence could possibly have? (hint –
does the blast hit that it matches have any functional domains? If so what?) 16
Download