PLSC 411/611 - Genomics
Gene Discovery and Annotation
Data Analysis Assignment
The purpose of this assignment is for you to better understand the gene identification and
annotation process. You will use a number of WWW sites for this project. As you visit these
sites, you should try to learn as much as possible about them. This will certainly help the
learning process.
What to turn in. Each group will turn in a written report. As you go through the
assignment, you will see a series of 24 questions. Your report will consist of answers to these
questions in addition to a Table that is described below. Some of these questions will require a
short answer; others will require full lengthy explanations. All answers must be in the form of
complete sentences. Preface each answer with the question itself. It will make grading easier.
Every group will be analyzing a BAC clone from soybean. These are Phase 3 BAC
clones. What does Phase 3 mean? The following table from NCBI provides a good explanation.
One-to-few pass reads of a single clone (not contigs).
Unfinished, may be unordered, unoriented contigs, with gaps.
Unfinished, ordered, oriented contigs, with or without gaps.
Finished, no gaps (with or without annotations).
In the table “Unfinished” refers to the status of the sequencing of the BAC. Note that only Phase
3 BACs are considered to be finished. For hierarchical shotgun sequencing, all BACs are
sequenced to Phase 3.
Okay, the first step is to download your data.
1. Go the Project page at the class WWW site. The address is:
2. Find your group, and download the file containing BAC DNA sequence.
There are a couple of things to note. First, this is a .txt (or flat file) without any hidden
information as you would have with a word processing file. A good software program to use is
called TextPad. I would recommend that you download the program and use it whenever you
are working with this file. Here is the link:
Next, it is important to become familiar with the data format. All DNA sequence files are in
what is called “FASTA” format. The first line begins with the “>” symbol followed by
descriptor information. The second line and on contains the DNA (or protein) sequence.
Now, you will start the analysis of the DNA. You will use the WWW version of the
FGENESH software. FGENESH is commercially available gene discovery software that uses
the Hidden Markov Models approach. It is considered to be the fastest and most reliable of all of
the gene prediction software currently available. The software is sold by a company called
Softberry, Inc. Use the following link to connect to the Softberry WWW site.
3. Running the FGENESH analysis.
To run FGENESH, you need to click on the “Run Programs Online” drop down menu in
the upper right of the Softberry homepage and then select “GENE FINDING in Eukaryota” tab.
On this page select the “FGENESH” link. This will load a page where you will insert your data.
As we discussed in class, gene prediction is a procedure based upon extrinsic (previous
published data) and intrinsic (gene structure features) factors. FGENESH uses intrinsic data that
is found in your sequence. The procedures used to discover the genes in the sequence data is
based on a “training set”. A training set is a group of genes from which the software sets specific
parameters that will be used to predict the presence of a gene. Because different species have
slightly different parameters, it is important to select the appropriate training set.
4. Comparing the results of different training sets.
The first part of the assignment is to create a table that reports the following information
for you BAC clone.
Table 1. Gene prediction for sequence soybean BAC clone XXXXXXXX based on different
species training sets
Training set
Dicots (Arabidopsis)
Glycine max
Monocot plants
(Corn, Rice, Wheat,
Number of predicted genes
Number in + chain
Number in - chain
To do this, paste your BAC sequence into the “Paste nucleotide sequence here:” box Then select
the appropriate training set from those in the “Organism” section. The header sections in the
table contains the information that you need to fill in.
5. Comparing the FGENESH output generated by the “Dicot plants (Arabidopsis)” and
“Glycine max” training sets.
You will now compare, in detail, the output from the “Dicot plants (Arabidopsis)” and
“Glycine max” training sets. To do this you need to copy the complete output from these
analyses. I found this method to be a good way to store and view the output.
a. Open MS Word, create a new document, set all margins at “0.3”, and set the orientation
to “Landscape”.
b. Create a table that consists of two columns and one row.
c. Copy the output of the “Dicot” training set in the left column, and copy the “Glycine
max” training set output in the right column.
d. Change the font to 8 point, Courier New.
This formatting will allow you to view the data side-by-side. You will need to do that for the
Here are the first questions that you to consider and answer for the report.
1. What do the abbreviations “G”, “Str”, “Feature”, “Start”, “End”, “Orf”, “Len”, “TSS”,
“CDSf”, “CDSi”, “CDSf”, and “PolA” represent?
2. Which training set generated the most gene models?
3. What differences did you observe between the two gene models that are built from the
same region of the BAC clone? You should also consider orientation, the position of the
transcription start and polyadenylation sites of the models when answering this question.
What is unique about the models supported by only one of the two training sets?
4. What percentage of exons predicted by one training set were also predicted by the other
training set? You need to calculate this for both training sets. (For example, you can
state it like this: X% of the exons predicted by the Glycine max training set were also
predicted by the Arabidopsis training set.)
5. What percentage of the complete genes predicted by one training set was also predicted
by the other training set? You need to calculate this for both training sets.
6. What percentage of the complete genes predicted by one training set had the same
structure predicted by the other training set? You need to calculate this for both training
Note: I found using MS Excel was a good way to visualize the structure information that is in the
FGENESH output. The structure information is found at the top of the output.
6. Now you need to determine if these gene models are really genes.
In class we talked about extrinsic data that can help predict and/or confirm gene models. Do you
remember one source of extrinsic data? Talk among your group before proceeding. Use the
class notes if you have to.
ESTs represent expressed genes. If a predicted gene model is homologous to an EST,
then it is confirmation that the gene model is mostly likely correct. So, where is a good source of
EST information to perform this analysis? Also, what analytic tool would you use?
The PlantGDB makes available a tremendous amount of EST data. They have collated
this into Plant GDB EST Assemblies. These assemblies are a collection of Plant Unique
Transcripts (PUT) that are derived by analyzing available EST sequence data and collapsing
overlapping sequences into a single PUT. We discussed these in class. Those ESTs that can not
be overlapped with another to form a PUT are called singletons. For any one species, the
collection of PUTs and singletons is called the Transcript Assembly. (I know the terminology is
a bit confusing, but you will get used to it.) As you are probably now aware, the BAC clone you
are working with is from soybean. Therefore, it is appropriate that you use the Soybean Gene
The second question regards the analytical tool to use. We discussed the Blast algorithms
in class. Do remember these? Blast let’s you compare a sequence with a database and determine
if the database contains a homolog. It will also put a statistical (E-value) on the similarity
between the query and the hit. You will now perform a Blast analysis on the soybean PUT
Assemblies at Plant GDB to test the validity of some of the gene models predicted by the two
training sets.
First, go to the “PlantGDB EST Assemblies - Overview” at the Plant. The URL is: The EST Assemblies are built from individual EST
sequences. Sequences that overlap are built into a longer sequence. These assemblies were the
earliest representation of gene sequences for many species. Read the details on the Overview
page to find out more about this data collection. Then select “BLAST Server” under the
“Related Links” section.
Now you are to perform a blast search with five protein sequences that are predicated by
your FGENSH HMMM analysis. Pick at least one unique gene from each training set. You will
perform the Blast analysis against the soybean Plant TAs.
a. First, you need to get the amino acid sequence. You can copy this from the bottom of the
FGENESH output. Only take the amino acid sequence, don’t take the header information.
For some reason, the header layout as exported by FGENESH is not compatible with the
Blast software.
b. On the “Plant GDB BLAST” page type “Glycine max” into the space under the
“Individual Species”. Click on the term “Glycine Max” and check the “PUT” box.
c. Now paste the sequence into the box that says “Paste query sequence(s) in FASTA
d. Next, select tblastn as the “Program”.
e. And finally, click on the “Run BLAST” button.
For your report answer these questions.
7. What criteria are you using to determine if the gene model is supported by PUT EST
assembly data?
8. Of the five models you analyzed, how many were supported by PUT EST data?
9. If the model is not supported by the PUT EST data, is it necessarily true that the model is
incorrect. Please explain you answer.
10. Is protein data the only form the query might take? Is another form better? Why? Why
do you think the protein sequence rather than the nucleotide sequence is used?
11. Perform the analysis with three corresponding nucleotide sequences. (Note: You need to
use the blastn algorithm for this analysis.) Do you observe any differences with your
results? If so, why do you think this is the case?
12. Which sequence would you believe is more accurate, the PUT derived from the ESTs or
the BAC sequencing? Why?
13. Of those that were supported, can they be annotated?
14. What name would you give to the genes that can be annotated?
7. Are the soybean BAC genes found in model plant species?
Next, it is important to determine if your gene is also present in model species. The two
model plant species for which we have full genomic information are Arabidopsis and rice. To
determine if the soybean genes you just analyzed have a homolog in another species, you will
use two other WWW site. The first is The Arabidopsis Information Resource (TAIR: and the second is the Rice Genome Annotation Project
At each of these sites, here is what you are to do: run a blastp analysis to determine if a
homolog to the five soybean BAC genes is present in the genome of the model plant species
Arabidopsis and rice. I will let you figure out how to set up the blast search, but remember to
select the appropriate program, the appropriate database, and turn off the filter. For this
assignment, please answer these questions.
15. Were homologs discovered for the five soybean genes in these two other species?
16. What criteria did you use to determine that these were indeed homologs?
17. Use the information here to annotate the soybean genes?
18. Are the annotations that same as what you used from the TIGR Soybean Gene Index
19. If not, why might they have been different?
8. Are the soybean BAC genes found in other plant species?
The last aspect of the assignment is to determine if the five gene models conserved
represent genes present in other plant or non-plant species. The home for all gene and protein
sequences is GenBank. This information is stored at the National Center for Biotechnology
Information (NCBI). The URL for NCBI is: If you have not
visited this site below, I suggest that you spend some time here to see what is available.
Again you are going to analyze the five BAC models you have worked with so far. Here
are detailed instructions on how to perform Blast analysis at NCBI.
a. Go to the NCBI homepage and click on “BLAST” link.
b. You will be performing a blastp analysis. From this page, select “protein blast”
link. This will link you to a page where you will select the blastp algorithm.
(You should study the Blast page for future reference.)
c. Copy your protein sequence (again without the header) into the “Enter accession
number, gi, or FAST sequence” box.
d. We will want to search against all plant species present in GenBank. In the
“Organism” box enter “Viridiplantae (taxid:33090)”.
e. To view the Blast parameters, click on “Algorithm parameters”.
f. Now hit the “BLAST” button. This will start the analysis and produce your
Please answer these final questions for each gene.
20. What is the range of E-values for the top 50 hits.
21. What species does this analysis suggest have homologs to your soybean query?
22. Study the annotation for these hits. Now compare the annotation here with your
annotation based on the TIGR TCs and the model species (Arabidopsis and rice) analysis.
23. Based on all of the searches you have performed, provide a name for each of the five
genes you are working with here.
24. What evidence was most instructive for you doing the naming process?

The purpose of this assignment is for you to better understand the