D. Grimshawi Contig 15 GEP Report

advertisement
Contig 15 of Drosophila grimshawi
Samantha Kubeck
December 16, 2010
2
Abstract:
Contig 15 of D. grimshawi is 39,223 base pairs long and has a repeat content of 16.45%.
Most of the repeats consist of LINE elements (13.87%) followed closely by LTRs (0.84%). The
contig has five full genes- CG31999, yellow-h, rho-5, CG4038, and Ank- and one partial geneCG5262. Overall, these genes matched up well with the D. melanogaster orthologs. However,
three out of the six genes did not appear on chromosome 4 (the dot chromosome) in the D.
melanogaster genome. This may suggest a recombination event since those genes are not found
anywhere else in the D. grimshawi genome and have vital functions. No UTRs could be found on
the D. grimshawi contig. A CLUSTAL alignment of yellow-h between all drosophila species
showed high conservation. This was further bolstered by low Ka/Ks values, which suggests
purifying selection.
Genes:
CG5262: CG5262 is a protein-coding gene whose transcript produces a transmembrane
protein that transports amino acids (www.ebi.ac.uk, pfam.janelia.org). The gene has four total
exons as does the D. melanogaster gene. The gene is incomplete on the 5’ region where exon 1 is
completely missing and exon 2 is partial. The presence of four total exons in the D. grimshawi
gene was verified by another student who has the entire gene.
Figure1: CG5262 on contig15 of D. grimshawi
Interestingly, CG5262 is on chromosome 4 in D. grimshawi but on chromosome 3L in D.
melanogaster. A flybase search of CG5262 across the entire genome showed that it is only
present on chromosom 4, which may suggest that an inversion occurred at some point during
speciation between D. grimshawi and D. melanogaster.
3
Figure 2: CG5262 of D. melanogaster on chromosome 3L instead of chromosome 4 in D. grimshawi (flybase.org).
The exon-by-exon blast showed that D. melanogaster and D. grimshawi both have four exons
that line up well, but not perfectly. However, the exon boundaries could not be modified to
because there were no nearby splice sites that would allow it.
Figure 3: Blast2Seq alignment of CG5262 D. grimshawi (query) against D. melanogaster (subject). Notice that D.
grimshawi begins on amino acid 136 of D. melanogaster, verifying that the 5’ end is cut off from the contig
(blast.ncbi.nlm.nih.gov).
4
(in bp)
Dgri
Dmel
5’ UTR
N/A
121 bp
Exon1
N/A
51
Exon2
144
494
Exon3
245
221
Exon4
764
759
3’ UTR
N/A
396
Table 1: Exon-by-exon alignment of CG5262. Notice that exon 2 is much shorter in Dgri because it is cut off.
CG31999: CG31999 is a protein-coding gene whose transcript produces proteins with
the following properties: EGF-type aspartate/asparagines hydroxylation site, EGF-like calcium
binding site, growth factor receptor region, cell adhesion, extracellular matrix, and calcium ion
binding (www.ebi.ac.uk, pfam.janelia.org, geneontology.org). The D. grimshawi gene has 14
exons whereas the D. melanogaster ortholog has 13 exons. No UTR’s were identified in Dgri.
Figure 4: CG31999 on contig15 of D. grimshawi.
Figure 5: CG31999 on contig 15 of D. melanogaster.
When performing the exon-by-exon alignments, there were a few differences between the two
orthologs. A tblastn was perfomed on contig15 for the D. melanogaster CG31999 and there was
a hit, but it was at 16602bp which is an intronic region of yellow-h.
Figure 6: Tblastn (D. melanogaster is the query, D. grimshawi is the subject) showing the D. melanogaster exon 1
appears at 16602 on contig 15, which is in an intronic region of yellow-h (blast.ncbi.nlm.nih.gov).
A flybase search was conducted to see if the D. grimshawi exon1 was conserved throughout
other drosophila species, but no results were found. This indicates that exon1 of D. grimshawi
should be omitted from the gene, but when removed it causes exon2 to be truncated because the
5
closest methionine is 25 amino acids into exon2. This is not acceptable since exon2 aligned
perfectly to the D. melanogaster exon2 prior to modification. Thus, exon1 is kept in the
annotation.
Figure 7: Exon 2 of CG31999 in D. grimshawi showing the internal methionine that shortens the transcript when
exon 1 is removed.
Another difference between the two orthologs of CG31999 is that exon 2 and 3 in D. grimshawi
corresponds to exon 2 in D. melanogaster. This was found by performing blast alignments with
exons 2 and 3 of D. grimshawi against exon 2 of D. melanogaster.
6
7
Figure 8: Blast alignments of exon 2 and 3 in D. grimshawi and exon 2 in D. melanogaster. The first alignment is of
Dgri exon 2 (query) vs. Dmel exon 2 (subject). Both exons start on the same amino acid but Dmel extends
approximately 75 amino acids further than Dgri. The second alignment is of Dgri exon 3 (query) vs. Dmel exon 2
(subject). Dmel begins 92 amino acids before Dgri but they end at the same region. The third alignment is of Dgri
exon 2 and 3 spliced together (query) vs. Dmel exon 2 (subject). This shows a nearly complete transcript
(blast.ncbi.nlm.nih.gov).
The blast alignments indicate exon 2 and 3 in D. grimshawi should be merged to fit the D.
melanogaster transcript, but when attempted, it was found that a stop codon lies between the two
exons. Thus, they cannot be merged.
Figure 9: Exons 2 and 3 of CG31999 merged. Shows an internal stop codon preventing this change.
(in bp)
5’ UTR
Exon1
Exon2
Exon3
Exon4
Exon5
Exon6
Exon7
Exon8
Exon9
Exon10
Exon11
Exon12
Exon13
3’ UTR
Dmel
120
73
495
330
222
83
58
294
138
445
133
89
231
159
259
Dgri
N/A
22
273+246=519
327
192
83
58
255
138
472
133
89
231
163
N/A
8
Table 2: Exon-by-exon alignment of CG31999. Note that I considered exon 2 and 3 of Dgri as exon 2 to properly
align the table.
Yellow-h: Yellow-h produces a protein in the same family as major royal jelly, which is
produced by honeybees. Although the drosophila species do not produce actual royal jelly, their
protein product is similar in structure and contains a six-bladed beta propeller (www.ebi.ac.uk,
pfam.janelia.org, geneontology.org). Both the D. grimshawi and the D. melanogaster yellow-h
have three exons, which align well. No UTRs were found. However, a rearrangement occurred
between CG31999 and yellow-h which put them on the same strand in Dgri when they were
originally on the same strand in Dmel.
Figure 10: CG31999 and yellow-h on opposite strands in D. melanogaster
Figure 11: CG31999 and yellow-h on the same strand in D. grimshawi.
Despite the rearrangement, the genes still are highly conserved. The exon boundaries are not
exact, but they could not be modified because there are no splice sites that allow it.
9
Figure 12: BLAST alignments of Dgri yellow-h (query) against Dmel yellow-h (subject) (blast.ncbi.nlm.nih.gov).
10
(in bp)
Dmel
5’ UTR
81
Exon1
236
Exon2
667
Exon3
486
3’ UTR
32
Table 3: Exon-by-exon alignment of yellow-h.
Dgri
N/A
215
658
495
N/A
Rho-5: Rho-5 produces a protein in the rhomboid family with signal transduction and
serine endopeptidase-like activity. It also contains an S54 domain and is an integral membrane
protein (ebi.ac.uk, pfam.janelia.org, geneontology.org). It has seven exons in Dgri vs. the six in
Dmel, but that occurred because exon 1 and 2 in Dgri correspond to exon 1 in Dmel. No UTRs
were found, and this gene is found on chromosome 2L in Dmel instead of the dot chromosome in
Dgri.
11
Figure 13: BLAST alignments of rho-5 exons 1 and 2 in Dgri against rho-5 exon 1 in Dmel. The first alignment
shows that exon 1 in Dgri (query) and exon 1 in Dmel (subject) start at approximately the same region but it ends
early in Dgri. The second alignment shows that exon 2 in Dgri (query) starts approximately 600 base pairs into exon
1 in Dmel (blast.ncbi.nlm.nih.gov).
These results suggest exon 1 and 2 in Dgri should be combined into one exon to match Dmel.
However, when combined there is a stop codon in the intronic region between the two exons
which prevents this action.
12
Figure 14: In rho-5, exons 1 and 2 cannot combine because a stop codon exists between them.
In addition, there is a large transposable element separating the two exons. It is hypothesized the
transposable element inserted itself within what was originally the first exon in Dgri. This
divergent event created the two separate exons seen in Dgri.
Figure 15: A large R1 transposable element separating exon 1 and 2 in rho-5 of Dgri.
Also, originally exon 6 in Dmel was not present in Dgri because a repetitive (TA)n region was
there. But, a gene predictor detected a coding region where exon 6 should have been and BLAST
alignments verified this as exon 6 in Dmel.
13
Figure 16: BLAST alignment of hypothesized exon 6 of rho-5 in Dgri (query) vs. exon 6 of rho-5 in Dmel (subject)
(blast.ncbi.nlm.nih.gov).
Upon closer inspection, though, the repeat region does not look entirely legitimate so the
annotation was discarded and the hypothesized rho-5 exon 6 of Dgri was placed in its stead.
Figure 17: Repetitive region where exon 6 of rho-5 in Dgri should be placed.
After these changes were made, rho-5 in Dgri aligned very well with rho-5 in Dmel.
14
Figure 18: BLAST alignment of rho-5 in Dgri (query) vs. Dmel (subject). Note the break in the transcript comes
from the inability to combine exons 1 and 2 in Dgri (blast.ncbi.nlm.nih.gov).
15
(in bp)
Dmel
Dgri
5’ UTR
244
N/A
Exon1
2871
1259+1139=2398
Exon2
342
342
Exon3
255
255
Exon4
258
258
Exon5
285
300
Exon6
276
243
3’ UTR
45
N/A
Table 4: Exon-by-exon alignment of rho-5. Note that exon 1 and 2 for Dgri is combined because it corresponds to
exon 1 in Dmel.
CG4038: CG4038 produces a protein involved in rRNA processing, rRNA pseudouridine
synthesis, rRNA pseudouridylation guide activity, snoRNA binding, and has the following
structures: small nuclear ribonucleoprotein complex, small nucleolar ribonucleoprotein complex,
H/ACA ribonucleoprotein complex, Gar1/Naf1 binding region, and a translation
elongation/initiation factor site (ebi.ac.uk, pfam.janelia.org, geneontology.org). This gene is
located on chromosome 2L in Dmel as opposed to the dot chromosome in Dgri. CG4038 has
three exons in both Dgri and Dmel, but exon 1 had difficulty aligning because of a long (G)n
segment. However, upon visual inspection, another exon prediction appeared to match up better
so it was used.
Figure 19: Screenshot showing similarities between exon 1 prediction of CG4038 in Dgri (bottom) vs. exon 1 of
CG4038 in Dmel (top).
The other exons aligned, but are much shorter than the Dmel transcript. The exons could not be
modified though because there are no nearby splice sites that allow it.
16
Figure 20: BLAST alignment for CG4038 of Dgri (query) vs. Dmel (subject). Note the coverage is only 50%
because the gene is small and any changes make a large impact in query coverage (blast.ncbi.nlm.nih.gov).
(in bp)
5’ UTR
Exon1
Exon2
Exon3
3’ UTR
Dmel
72
190
381
138
190
Dgri
N/A
40
483
104
N/A
Table 5: Exon-by-exon alignment of CG4038.
Ank: Ank produces a protein involved in cytoskeletal anchoring at the plasma
membrane, signal transduction, cytoskeletal protein binding, and has the following properties:
ankyrin repeats, ZU5 domain, DEATH domain, fusome domain, spectrosome domain, and a
structural constituent of cytoskeleton (ebi.ac.uk, pfam.janelia.org, geneontology.org). There are
nine exons in both the Dgri and Dmel transcripts, but exon 1 did not align at all. A tblastn did not
produce any results, but a flybase search across all orthologs showed conservation of exon 1 in
D. mojavensis and D. virilis. This is interesting since Those two species are most closely related
to Dgri.
Figure 21: Flybase search of exon 1 in Ank for Dgri (flybase.org)
17
Figure 22: Species tree of all drosophila species showing D. mojavensis and D. virilis are most closely related to
Dgri (flybase.org).
Because of this conservation, exon 1 in Dgri is kept in the annotation. All the other exons align
well except for exon 9 which had gaps in the exonic sequence and thus could not be altered.
Overall, the gene aligns well due to its large size. There are six splice variants of Ank in Dmel,
but all the variation is due to differences in UTRs. Unfortunately, no UTRs could be found in
Dgri so at this point, only one variant of Ank is present in Dgri. The alignment for this is shown
below.
18
Figure 23: BLAST alignment of Ank for Dgri (query) vs. Dmel (subject) (blast.ncbi.nlm.nih.gov).
19
(in bp)
Dmel
Dgri
Exon1
121
141
Exon2
108
108
Exon3
1114
1114
Exon4
104
104
Exon5
942
948
Exon6
325
328
Exon7
94
97
Exon8
225
225
Exon9
1629
1669
Table 6: Exon-by-exon alignment of Ank. Note that the UTRs were omitted because they vary between the different
splice variants of Dmel.
Rearrangement of Chromosomes: As stated previously, the chromosomes on contig 15 are not
all on the dot chromosome in Dmel. CG5262 is located on chromosome 3L in Dmel, rho-5 is
located on chromosome 2L in Dmel, and CG4038 is located on chromosome 2R in Dmel.
Figure 24: Contig 15 (top) showing the gene rearrangements from Dmel (bottom) (flybase.org)
These rearrangements could have either be recombination events that changed the position of the
genes between the chromosomes, or these could be pseudogenes. A flybase search was
conducted for each of the genes that are not located on the dot chromosome in Dmel and there
were no results indicating that the genes present on contig 15 are one of a kind in Dgri. Since
these genes have vital cell functions (such as amino acid transporting and cytoskeletal
attachment to plasma membrane) it can be assumed these are not pseudogenes and the altered
structure is due to recombination events during speciation.
20
Repeats: Contig 15 had three large LINE elements five LTR elements, three DNA elements,
three unclassified elements, and two simple repeats which comprised 13.87, 0.84, 0.51, 0,84, and
0.39 percent of the contig, respectively. This gave a total of 16.45% of total repeats in contig 15.
There was a mixture of transposable elements and simple repeats throughout the contig, but
nothing of mentionable importance except for a very large LINE/LOA element spanning from
5788-10485 (4697bp). This is a very large transposable element, but when viewing the sizes for
LINE elements in Dmel, there are many that are of comparable size. The doc family, for
example, has transposable elements around the 4700bp range. Thus, this annotation is valid and
can be kept in its current state.
Figure 25: Large LINE/LOA element on contig 15.
21
Figure 26: Dmel LINE-like transposable element sizes highlighting the Doc family.
Figure 27: Repeat report for contig 15.
CLUSTAL Alignment and Ka/Ks values for Yellow-h: The CLUSTAL alignment was done on
yellow-h, the gene whose protein product is related to major royal jelly produced by honeybees.
Yellow-h was aligned against all of the drosophila orthologs and it showed high conservation
between the amino acids.
(1)
22
(2)
(3)
Figure 28: CLUSTAL alignment of yellow-h for Dper, Dpseudo, Dgri, Dgri contig 15, Dmoj, Dvir, Dwill, Dere,
Dyak, Dmel, Dsim, and Dana.
Species
Ka/Ks
D. ananasse
0.0073
D. erectus
0.0046
D. melanogaster
0.0051
D. mojavensis
0.0921
D. persimilis
0.0241
D. pseudoobscura
0.0219
D. simulans
0.0251
D. virilis
0.0976
D. willistoni
0.0246
D. yakuba
0.0050
23
Figure 29: Ka/Ks analysis of each drosophila species based on the CLUSTAL alignment.
Figure 30: Bootstrapped cladogram with corresponding Ka/Ks values. Green stars indicate conservation with
drosophila species tree, red stars indicate a change from the species tree.
Figure 31: Drosophila species tree (flybase.org).
When viewing the bootstrapped cladogram against the drosophila species tree (pg. 17 figure 22),
one can see yellow-h is highly conserved between the drosophila species. This is further
bolstered by the very low Ka/Ks values, indicating purifying selection. The only change that
occurred was Dgri changed from the clade containing Dmoj and Dvir to the clade containing
Dwill, Dana, etc. The Ka/Ks values show that Dere is most closely related to Dgri based on
similarities between yellow-h.
24
Conclusion: Contig 15 is very gene rich with six total genes within approximately 39,000 base
pairs. These genes play key roles in the cell such as amino acid transport, EGF-like calcium
binding, cytoskeletal attachment to the plasma membrane, serine protease activity, etc. The
repeat content within the contig is an appropriate value for the drosophila species (16.45% as
compared to approximately 25% for the rest of the Dgri genome). Finally, the CLUSTAL
alignment and Ka/Ks values show high conservation of the yellow-h gene between all drosophila
species.
25
References:
BLAST: Basic Local Alignment Search Tool. Web. 15 Dec. 2010.
<http://blast.ncbi.nlm.nih.gov>.
European Bioinformatics Institute | Homepage | EBI. Web. 15 Dec. 2010.
<http://www.ebi.ac.uk/>.
FlyBase Homepage. Web. 15 Dec. 2010. <http://flybase.org/>.
The Gene Ontology. Web. 15 Dec. 2010. <http://www.geneontology.org/>.
Protein Family Search Database. Howard Hughes Medical Institute. Web. 14 Dec. 2010.
<http://pfam.janelia.org>.
Download