1. Predict genes in a DNA sequence. (ORFFinder @ ncbi ,... 2. Use web-based tools to predict protein family/function and understand... BCB 444/544

advertisement
BCB 444/544 Fall 06 Sep 14
Lab 4
p. 1
BCB 444/544
Lab 5 (Sept 27)
Function Prediction
Due Mon Oct 1 by 5 PM - email to jdsander@iastate.edu
Objectives
1. Predict genes in a DNA sequence. (ORFFinder @ ncbi , GeneMark)
2. Use web-based tools to predict protein family/function and understand how they work.
3. Refresh skills we have developed in previous labs.
Introduction
In this lab you will be using sequence based approaches to predict genes and protein function.
Gene prediction works best with an ensemble of techniques including ORF scanning, intron
identification, transcription factor binding site identification, and codon bias. There are multiple
approaches to the function prediction problem including: sequence alignments, motif
identification, weight matrices, and Hidden Markov Models. After completing this lab you
should be able to understand each of these methods and be familiar with online services that
provide these services. We will be assuming knowledge of activities in previous labs such as
sequence retrieval and sequence alignment via BLAST to complete this lab.
Exercises
Gene Prediction
Part I – ORF Finder
We’ll start our gene prediction journey with the simple task of scanning a genome for open
reading frames. An open reading frame or ORF is the DNA sequence located between the start
codon and a stop codon. Open reading frames represent potential genes. Since each codon
consists of three base pairs, each ORF must have a length divisible by three. By the same token,
there exist three possible reading frames for a strand of DNA. A fourth reading frame would be
inline with the first reading frame. Since DNA is double stranded we have a total of 6 reading
frames (3 on each strand).
The three possible reading frames:
AUG, UGG, GGT.
Note: GTA is in the same reading frame as
AUG
Let’s use ORF Finder at NCBI http://www.ncbi.nlm.nih.gov/gorf/gorf.html to locate all the
possible open reading frames. We’re going to locate all the open reading frames in the Human
Immunodeficiency Virus (HIV). Its genbank accession number is NC_001802. We can either
enter the accession number or we could find the sequence ourselves and enter it directly. For
now enter the accession and click the “OrfFind” button. The default settings show us all open
reading frames greater than 100 bases in length.
BCB 444/544 Fall 06 Sep 14
1.)
2.)
Lab 4
p. 2
What is the longest open reading frame detected in the HIV virus.
Assuming this is a valid ORF with no introns, how long would the protein
produced by this ORF be?
Change the minimum displayed ORF length by changing the drop down box from 100 to 50 and
then clicking the “Redraw” button.
3.)
Do you think most of these new ORF’s are translated into proteins? Why or why
not?
4.)
Take note of the frames that these genes are found in. What do you think the +
and the – stand for?
Let’s assume we think the longest ORF is a real gene. Click either on the highlighted ORF in the
2D chromosomal picture or on the colored box next to the ORF in the ORF list. We can either
accept this ORF as an actual gene or select the “Alternative Initiation Codons” button. Although
we learned that open reading frames start with ATG, Biology has another rule that takes
precedence over all other rules: “There is an exception to every rule except this one.”
5.)
Select the “Alternative Initiation Codons” button. The gene now starts with what
earlier codon.
Part II – GeneMark
ORF’s go a long way to helping us identify potential genes. However, not all ORF’s
have the appropriate transcription factor binding sites to provide translation. Also, prediction
gets much more challenging when we consider alternative start codons, introns, and alternative
splicing. Thankfully there are plenty more tricks we can pull out of our bag.
Nearly all eukaryotic introns begin with GT and end with AG; this is known as the GTAG rule. We can also take advantage of something known as codon bias. There are 20 Amino
Acids and 64 possible codons (4^3). This means that many Amino Acids are represented by
multiple codon triplets. However separate species are able to utilize certain codons better due to
their specific cell machinery. If we know how often a given triplet is used in known genes for
that organism, we can assign probabilities to each codon occurring. We can then look at these
probabilities for the entire protein to assess whether it is likely to be a gene in this organism. We
can also expand our search to scan for transcription factor binding sites upstream of the ORF.
By combining all of these methods and several other techniques we can do a much better job of
predicting genes than simply looking at all ORF’s.
There are many programs out there, taking advantage of different combinations of this
information. One such program, Gene Seqer, was developed by Volker Brendel, an Iowa State
professor and is accessible through his site at http://www.plantgdb.org/ . Feel free to go through
their tutorial after you have completed the lab.
In this lab, we will be using a the GeneMark program.
Go to
http://www.ncbi.nlm.nih.gov and click on the “Genomic Biology” link on the left side of the
screen. Here we can access whole genome sequences. On the right side of the screen under
BCB 444/544 Fall 06 Sep 14
Lab 4
p. 3
Organism-Specific select the G for Genome Resources next to Human. There should be a figure
with the human chromosomes in the upper left hand corner. Click on chromosome 1 to go to a
map of that chromosome. Click on the “Download/View Sequence/Evidence” link in the upper
right corner. The chromosome is very large and is broken into several files. We just want
~50,000 bases of the chromosome for this lab. Change the range values to display the sequence
from 50000 to 100000, then click the “Change Region/Strand” button. A new region with the
desired region should appear below. Click display and select FASTA format. Now copy this
shortened FASTA sequence and go to http://opal.biology.gatech.edu/GeneMark/genemark24.cgi
and paste in the copied sequence. Under running options change the species to H. sapiens.
Under output options check “Generate PDF graphics (screen)” Click the “Start GeneMark”
button then click on “View PDF Graphical Output”
6.)
7.)
In which regions is the most probable gene?
In what reading frame is the most probable gene?
Function Prediction
Part III - BLAST - Alignments
Last lab we used sequence alignments to find similar proteins with similar sequence composition
to our own. It is a common assumption that proteins of similar sequence share similar function.
Protein sequence determines protein structure, and the combination structure and sequence
determine function. You first instinct may be to directly infer that sequence determines function,
and this is largely true. So why not drop “sequence infers structure” and skip directly to
“sequence infers function”. It is worded this way because multiple diverse sequences may result
in the same structure. In this case, proteins with vastly different sequence compositions may
share similar functions. We state that sequence and structure infer function because structure
itself is not enough to produce an actively interacting protein. The interacting regions must have
the proper biochemical properties as well; therefore the sequence at these interacting regions is
important.
We will start today’s lab with the basic assumption that sequence infers function. Let’s return to
NCBI and retrieve the protein sequence NP_031939. Now BLAST the protein sequence against
all Fungi.
8.)
How many sequences do you find?
Many of these sequences will be unnamed or hypothetical, meaning their function or actual
existence as a protein is unknown. Several with good e-vals exist and provide a known function.
9.)
Do you think this protein shares similar function with the other proteins? Why?
(Hint look at the picture of the aligned regions near the top of the page.)
10.)
Looking at the best match, what positions in the query sequence align to the target
sequence?
At the top of the screen, click on the “Show Conserved Domains” link in the upper right corner.
BCB 444/544 Fall 06 Sep 14
11.)
Lab 4
p. 4
Do these domains appears to correlate with the conserved region identified in the
BLAST search?
Now repeat the BLAST again; however, this time BLAST against the entire NR database (not
just Fungi). There should be many hits this time. Look at what regions of the blast match this
time.
12.)
Based solely on the description of the protein hits from this BLAST, what would
you predict the function of this entire protein to be?
13.)
How does this contradict with what we know to be the case? (Hint: Look at the
answers to questions 10 and 11)
14.)
What potential problems could you see in trying to identify protein function
strictly by sequence alignment in the following scenario:
A second protein
exists with sequence related to only the first 200 bases of this protein? The blast
results give this protein as the top hit.
Now imagine this incorrect label propagating to several other related proteins. A user could blast
a sequence and get a number of hits as matches, all of which have incorrect labels stemming
from one original error. Unfortunately this scenario does happen. We must be careful when
assessing function via alignments.
Part IV – PROSITE – Motifs & Weight Matrices
In lecture we have already been introduced to protein motifs, a recurring pattern of amino acids
that denote a characteristic such as a structure or function. Identifying protein function by
scanning for known protein motifs can prove more reliable than sequence alignments. Go to the
PROSITE website at http://ca.expasy.org/prosite/. Read the PROSITE User Manual Introduction
under Documents. Under the Tools for PROSITE section submit the NP_031939 sequence we
explored above in Blast.
15.)
How many hits were found for this sequence?
16.)
What conditions are met for the profile identified?
17.)
What is better evidence of function – a profile or a patter? What is the difference
between the two?
Part V – SuperFam – Hidden Markov Models (HMM)
Read “A concrete Example” of a HMM http://en.wikipedia.org/wiki/Hidden_Markov_Models .
Then go to the Super Family website at http://supfam.org/SUPERFAMILY/hmm.html . This
website uses pre-trained HMM’s to determine what family a protein falls into. Again paste the
BCB 444/544 Fall 06 Sep 14
Lab 4
p. 5
NP_031939 FASTA sequence into the sequence window. Change the notification to Browser
and click submit. When “YOUR RUN IS COMPLETE” click “here” to see the output.
18.)
What domains does SUPERFAM identify?
There are alternative programs that predict Superfamily based on HMM’s. Each program may
give varying results depending how they were train.
Part VI - JAFA – Metaserver
Using one method of function prediction may be adequate. However, it is often better to use
multiple methods at once and combine the results. JAFA (Joint Assembly of Function
Annotations) is one such metaserver. Go to the JAFA meta server at http://jafa.burnham.org/
and use the NP_031939 FASTA sequence again.
19.)
Do you receive any new information?
20.)
Based on all the function prediction results what can you say about our mystery
protein? (What do we know and what do we not know?)
Due Mon Sept 18 by 5 PM - email to jdsander@iastate.edu
Download