Dabbling in bioinformatics:

advertisement
Bio/CS – 251
April 12, 2006
Laboratory 10:
Hunting for genes in a DNA sequence
Today, we examine web-based tools for finding genes hidden in eukaryotic genomes. We will locate
and identify genes in stretches of DNA sequence from Aspergillus nidulans, a common bread mold
fungus and the favorite organism studied by Dr. James.
This laboratory will rely on methods and websites that are presented in Chapter 5, pp. 158-172 in
Bioinformatics for Dummies.
Objective: Obtain the following DNA sequence, and ORF it! (in other words, find all of the
Open Reading Frames, or ORFs, and determine what proteins they encode)
Go to the GenBank entry tool at http://www.ncbi.nlm.nih.gov/entrez/
In the Search pull down window select Gene and press Go. A new page will appear. Along
the left column, point your browser to “Genomic Biology”. Then, along the right column of the new
page, under “Genome Resources”, find the line labeled “Aspergillus”. On that line click the button
labeled with a G. A new page, “Aspergillus Genome Resources” will appear. Along the right column,
open the link to “A. nidulans Database at the Broad Institute”. This will direct you to
the A. nidulans website, which is constructed and maintained by the The Broad Institute at
Massachusetts Institute of Technology (MIT)
a. The A. nidulans genome sequence is ~30 million base pairs in length, or about 1/100th the size
of the human genome, and codes for roughly 8500 genes. It is broken up into manageable-sized
chunks called contigs. A contig is one contiguous stretch of DNA assembled from a number of
smaller, overlapping sequences. Today you will identify and study all of the genes encoded in a
small, 15,000 bp sub-region of Contig #26. To obtain this chunk of DNA sequence, do the following:
b. Check that your browser is pointed to the Aspergillus nidulans database:
http://www.broad.mit.edu/annotation/fungi/aspergillus_nidulans/
c. Point your browser to “Browse Regions”.
d. In the box labeled “Supercontig number”, enter 1.26
In the box labeled “Start”, enter 357000
In the box labeled “Stop”, enter 372000
Click on the hotlink button labeled “DNA Sequence”
Copy/paste this 15 kb sequence here, and convert it to 10 pt courier font:
For the following exercises, follow along in pp. 158-163 in BFD.
e. Use ORF Finder to locate all of the potential Open Reading Frames (ORFs) in this 15 kb
stretch of DNA. ORF Finder will predict ORFs, i.e., long stretches of DNA that could
potentially contain a protein-coding portion of a gene. ORF Finder is a graphical analysis tool
that finds all open reading frames of a selectable minimum size, usually > 100 nucleotides.
(1) To access ORF Finder, go to NCBI: http://www.ncbi.nlm.nih.gov
Bio/CS – 251
April 12, 2006
(2) Under HOTSPOTS in the right column, choose ORF Finder:
http://www.ncbi.nlm.nih.gov/gorf/gorf.html
(3) Copy/paste the 15 kb sequence into the ORF Finder box (just the sequence!), and click on
the OrfFind button.
Six parallel horizontal bars will appear on the screen. Each will contain a number of
blue boxes. Each blue box represents one potential ORF i.e., one potential exon. You will see that a
blizzard of potential ORFs can be found in this region. Only a small number of them represent
genuine protein-coding regions. Your assignment is to find out which ones belong to real genes!
Begin by pasting a copy of the screen below. Also, keep this page open for further investigations.
f.
Before you test the ORFs to see which ones are real, answer the following questions:
Q1: What do the six bars represent? Explain why there are six bars, and explain how the
three parallel bars at the top differ from the three parallel bars at the bottom.
Q2: Could a single gene be contained in multiple, adjacent or overlapping ORFs? In other
words, is the protein-coding region of a gene necessarily contained in a single ORF, or could the
protein-coding region be broken up into more than one ORF? Why or why not?
g. Sometimes it can help to determine which potential ORFs are real by comparing your output
from ORF Finder with another gene-finding tool, called GeneMark.
Leave your ORF Finder window open, showing the ORF map of the 15 kb region we are
studying.
Open a new window, and point your browser to http://opal.biology.gatech.edu/GeneMark/.
Then, choose the link corresponding to “Gene Prediction in Eukaryotes” associated with the
rat icon. Click on the link labeled GeneMark-E and GeneMark.hmm-E . If you follow the directions on
pp. 162-163, and choose C. elegans as the species most closely related to A. nidulans, then click on
the Start GeneMark.hmm button, you will receive a PDF output that shows the position of all
probable genes in the 15 kb sequence.
Unfortunately this tool, like ORF Finder, also predicts many more functional ORFs than really
exist. Careful examination of the output may help to narrow the field. However, another way to make
sense of your ORFs is to return to ORF Finder and use the associated blastp feature to BLAST a
subset of the ORFs.
h. Return to the ORF Finder window, and proceed as follows:
Find the FOUR (4) bona fide genes in this 15,000 bp region. Find all
of the ORFs (exons) corresponding to these four genes, as follows:
(1) Based on the assumption that the longer the ORF, the more likely it is to represent a bona
fide gene, use Blastp to BLAST the largest eight (8) ORFs, for starters. For each of these 8 BLAST
searches, do the following:
(a) First, click on the blue ORF that you intend to BLAST. Use an organized approach:
Begin at the left, and work to the right. When you click on the desired ORF, the screen will refresh
and the highlighted ORF will become purple. Also, the DNA sequence of the ORF, and a
corresponding translation, will be displayed. For each ORF that you search, paste the DNA + protein
sequences into a MSWord file, and label it clearly for identification purposes.
Bio/CS – 251
April 12, 2006
(b) Second, BLAST the ORF. For each Blastp search, ask for a graphical output and
specify 10 descriptions + 10 alignments. Obtain the output, and then paste the output below the
sequence + translation from (a) above. Use 10 point Courier font throughout.
During this effort, you will need to use your judgement to assess the quality of
the Blastp hits that are produced, and decide if the hits are significant or if they are
meaningless. In any event, for the time being save the output of these searches. Clues for
making good judgements include the following:
1. e-value: is the e-value <10-15?
2. Does the ORF contain a putative conserved domain? If so, what is it? List it
or copy in a description of the conserved functional domain (a conserved domain is a protein
region that is the same or very similar in many proteins, because it provides a function that is
common to many proteins)
If the answer to these two questions is YES, then you have probably hit a bona
fide gene.
Each time you begin work with a new ORF, start a new page in your MSWord document.
i. After you have identified each of the four different genes, go back and BLAST the
appropriate smaller ORFs that are adjacent to each identified gene on either side, to learn if the gene
is contained on more than one ORF.
j. Completing the assignment:
To complete this assignment to identify the four real genes, you will probably need to
BLAST 17-18 total ORFs from this 15,000 nt sequence.
Please submit the following to complete this assignment:
(1) Sequence + translation of each ORF that belongs to a real gene.
(2) Blastp outputs for each real-gene ORF that includes 10 descriptions + one keystone
alignment to an orthologous gene whose function is well-described and well-understood. In other
words, don’t necessarily choose an alignment because it has the highest e-value; an alignment to a
“hypothetical protein” is uninformative. If your 8th-best alignment is the first one to list a protein with a
real name (e.g., cyclic AMP-dependent protein kinase), and this alignment’s e-value is similar to each
the 7 better matches, then use this identification for your Aspergillus nidulans ORF(s).
(3) A schematic diagram depicting the order of the four genes and the distances separating
each one.
(4) In addition, the schematic diagram must show the relative position and the reading frame
of each ORF belonging to a gene. If multiple ORFs (exons) belong to the same gene, this must be
clearly described and diagrammed.
(5) Finally, after we’ve made you go through all of this labor, we’re going to teach you
another tool that you can use to check the veracity of your work.
-- Go to the A. nidulans database
(http://www.broad.mit.edu/annotation/fungi/aspergillus_nidulans/)
-- As before, point your browser to “Browse Regions”.
- In the box labeled “Supercontig number”, enter 1.26
Bio/CS – 251
April 12, 2006
- In the box labeled “Start”, enter 357000
- In the box labeled “Stop”, enter 372000
-- This time, instead of obtaining the 15kb DNA sequence, obtain the Feature Map for this
region.
-- WOW! Print out the Feature Map and answer the following questions:
A. What do the blue boxes represent?
B. What do the green boxes represent?
C. Click on each blue box corresponding to a gene that you identified by using OrfFinder.
This will reveal more annotation about each sequence.
Answer the following question: does your repetitive blastp searching agree with the
automated gene-finding annotation with regard to gene identity and gene structure.
Be sure to document similarities and differences between your manual efforts and the
autocalling software that Feature Map uses.
Download