Contig 15 of Drosophila grimshawi Samantha Kubeck December 16, 2010 2 Abstract: Contig 15 of D. grimshawi is 39,223 base pairs long and has a repeat content of 16.45%. Most of the repeats consist of LINE elements (13.87%) followed closely by LTRs (0.84%). The contig has five full genes- CG31999, yellow-h, rho-5, CG4038, and Ank- and one partial geneCG5262. Overall, these genes matched up well with the D. melanogaster orthologs. However, three out of the six genes did not appear on chromosome 4 (the dot chromosome) in the D. melanogaster genome. This may suggest a recombination event since those genes are not found anywhere else in the D. grimshawi genome and have vital functions. No UTRs could be found on the D. grimshawi contig. A CLUSTAL alignment of yellow-h between all drosophila species showed high conservation. This was further bolstered by low Ka/Ks values, which suggests purifying selection. Genes: CG5262: CG5262 is a protein-coding gene whose transcript produces a transmembrane protein that transports amino acids (www.ebi.ac.uk, pfam.janelia.org). The gene has four total exons as does the D. melanogaster gene. The gene is incomplete on the 5’ region where exon 1 is completely missing and exon 2 is partial. The presence of four total exons in the D. grimshawi gene was verified by another student who has the entire gene. Figure1: CG5262 on contig15 of D. grimshawi Interestingly, CG5262 is on chromosome 4 in D. grimshawi but on chromosome 3L in D. melanogaster. A flybase search of CG5262 across the entire genome showed that it is only present on chromosom 4, which may suggest that an inversion occurred at some point during speciation between D. grimshawi and D. melanogaster. 3 Figure 2: CG5262 of D. melanogaster on chromosome 3L instead of chromosome 4 in D. grimshawi (flybase.org). The exon-by-exon blast showed that D. melanogaster and D. grimshawi both have four exons that line up well, but not perfectly. However, the exon boundaries could not be modified to because there were no nearby splice sites that would allow it. Figure 3: Blast2Seq alignment of CG5262 D. grimshawi (query) against D. melanogaster (subject). Notice that D. grimshawi begins on amino acid 136 of D. melanogaster, verifying that the 5’ end is cut off from the contig (blast.ncbi.nlm.nih.gov). 4 (in bp) Dgri Dmel 5’ UTR N/A 121 bp Exon1 N/A 51 Exon2 144 494 Exon3 245 221 Exon4 764 759 3’ UTR N/A 396 Table 1: Exon-by-exon alignment of CG5262. Notice that exon 2 is much shorter in Dgri because it is cut off. CG31999: CG31999 is a protein-coding gene whose transcript produces proteins with the following properties: EGF-type aspartate/asparagines hydroxylation site, EGF-like calcium binding site, growth factor receptor region, cell adhesion, extracellular matrix, and calcium ion binding (www.ebi.ac.uk, pfam.janelia.org, geneontology.org). The D. grimshawi gene has 14 exons whereas the D. melanogaster ortholog has 13 exons. No UTR’s were identified in Dgri. Figure 4: CG31999 on contig15 of D. grimshawi. Figure 5: CG31999 on contig 15 of D. melanogaster. When performing the exon-by-exon alignments, there were a few differences between the two orthologs. A tblastn was perfomed on contig15 for the D. melanogaster CG31999 and there was a hit, but it was at 16602bp which is an intronic region of yellow-h. Figure 6: Tblastn (D. melanogaster is the query, D. grimshawi is the subject) showing the D. melanogaster exon 1 appears at 16602 on contig 15, which is in an intronic region of yellow-h (blast.ncbi.nlm.nih.gov). A flybase search was conducted to see if the D. grimshawi exon1 was conserved throughout other drosophila species, but no results were found. This indicates that exon1 of D. grimshawi should be omitted from the gene, but when removed it causes exon2 to be truncated because the 5 closest methionine is 25 amino acids into exon2. This is not acceptable since exon2 aligned perfectly to the D. melanogaster exon2 prior to modification. Thus, exon1 is kept in the annotation. Figure 7: Exon 2 of CG31999 in D. grimshawi showing the internal methionine that shortens the transcript when exon 1 is removed. Another difference between the two orthologs of CG31999 is that exon 2 and 3 in D. grimshawi corresponds to exon 2 in D. melanogaster. This was found by performing blast alignments with exons 2 and 3 of D. grimshawi against exon 2 of D. melanogaster. 6 7 Figure 8: Blast alignments of exon 2 and 3 in D. grimshawi and exon 2 in D. melanogaster. The first alignment is of Dgri exon 2 (query) vs. Dmel exon 2 (subject). Both exons start on the same amino acid but Dmel extends approximately 75 amino acids further than Dgri. The second alignment is of Dgri exon 3 (query) vs. Dmel exon 2 (subject). Dmel begins 92 amino acids before Dgri but they end at the same region. The third alignment is of Dgri exon 2 and 3 spliced together (query) vs. Dmel exon 2 (subject). This shows a nearly complete transcript (blast.ncbi.nlm.nih.gov). The blast alignments indicate exon 2 and 3 in D. grimshawi should be merged to fit the D. melanogaster transcript, but when attempted, it was found that a stop codon lies between the two exons. Thus, they cannot be merged. Figure 9: Exons 2 and 3 of CG31999 merged. Shows an internal stop codon preventing this change. (in bp) 5’ UTR Exon1 Exon2 Exon3 Exon4 Exon5 Exon6 Exon7 Exon8 Exon9 Exon10 Exon11 Exon12 Exon13 3’ UTR Dmel 120 73 495 330 222 83 58 294 138 445 133 89 231 159 259 Dgri N/A 22 273+246=519 327 192 83 58 255 138 472 133 89 231 163 N/A 8 Table 2: Exon-by-exon alignment of CG31999. Note that I considered exon 2 and 3 of Dgri as exon 2 to properly align the table. Yellow-h: Yellow-h produces a protein in the same family as major royal jelly, which is produced by honeybees. Although the drosophila species do not produce actual royal jelly, their protein product is similar in structure and contains a six-bladed beta propeller (www.ebi.ac.uk, pfam.janelia.org, geneontology.org). Both the D. grimshawi and the D. melanogaster yellow-h have three exons, which align well. No UTRs were found. However, a rearrangement occurred between CG31999 and yellow-h which put them on the same strand in Dgri when they were originally on the same strand in Dmel. Figure 10: CG31999 and yellow-h on opposite strands in D. melanogaster Figure 11: CG31999 and yellow-h on the same strand in D. grimshawi. Despite the rearrangement, the genes still are highly conserved. The exon boundaries are not exact, but they could not be modified because there are no splice sites that allow it. 9 Figure 12: BLAST alignments of Dgri yellow-h (query) against Dmel yellow-h (subject) (blast.ncbi.nlm.nih.gov). 10 (in bp) Dmel 5’ UTR 81 Exon1 236 Exon2 667 Exon3 486 3’ UTR 32 Table 3: Exon-by-exon alignment of yellow-h. Dgri N/A 215 658 495 N/A Rho-5: Rho-5 produces a protein in the rhomboid family with signal transduction and serine endopeptidase-like activity. It also contains an S54 domain and is an integral membrane protein (ebi.ac.uk, pfam.janelia.org, geneontology.org). It has seven exons in Dgri vs. the six in Dmel, but that occurred because exon 1 and 2 in Dgri correspond to exon 1 in Dmel. No UTRs were found, and this gene is found on chromosome 2L in Dmel instead of the dot chromosome in Dgri. 11 Figure 13: BLAST alignments of rho-5 exons 1 and 2 in Dgri against rho-5 exon 1 in Dmel. The first alignment shows that exon 1 in Dgri (query) and exon 1 in Dmel (subject) start at approximately the same region but it ends early in Dgri. The second alignment shows that exon 2 in Dgri (query) starts approximately 600 base pairs into exon 1 in Dmel (blast.ncbi.nlm.nih.gov). These results suggest exon 1 and 2 in Dgri should be combined into one exon to match Dmel. However, when combined there is a stop codon in the intronic region between the two exons which prevents this action. 12 Figure 14: In rho-5, exons 1 and 2 cannot combine because a stop codon exists between them. In addition, there is a large transposable element separating the two exons. It is hypothesized the transposable element inserted itself within what was originally the first exon in Dgri. This divergent event created the two separate exons seen in Dgri. Figure 15: A large R1 transposable element separating exon 1 and 2 in rho-5 of Dgri. Also, originally exon 6 in Dmel was not present in Dgri because a repetitive (TA)n region was there. But, a gene predictor detected a coding region where exon 6 should have been and BLAST alignments verified this as exon 6 in Dmel. 13 Figure 16: BLAST alignment of hypothesized exon 6 of rho-5 in Dgri (query) vs. exon 6 of rho-5 in Dmel (subject) (blast.ncbi.nlm.nih.gov). Upon closer inspection, though, the repeat region does not look entirely legitimate so the annotation was discarded and the hypothesized rho-5 exon 6 of Dgri was placed in its stead. Figure 17: Repetitive region where exon 6 of rho-5 in Dgri should be placed. After these changes were made, rho-5 in Dgri aligned very well with rho-5 in Dmel. 14 Figure 18: BLAST alignment of rho-5 in Dgri (query) vs. Dmel (subject). Note the break in the transcript comes from the inability to combine exons 1 and 2 in Dgri (blast.ncbi.nlm.nih.gov). 15 (in bp) Dmel Dgri 5’ UTR 244 N/A Exon1 2871 1259+1139=2398 Exon2 342 342 Exon3 255 255 Exon4 258 258 Exon5 285 300 Exon6 276 243 3’ UTR 45 N/A Table 4: Exon-by-exon alignment of rho-5. Note that exon 1 and 2 for Dgri is combined because it corresponds to exon 1 in Dmel. CG4038: CG4038 produces a protein involved in rRNA processing, rRNA pseudouridine synthesis, rRNA pseudouridylation guide activity, snoRNA binding, and has the following structures: small nuclear ribonucleoprotein complex, small nucleolar ribonucleoprotein complex, H/ACA ribonucleoprotein complex, Gar1/Naf1 binding region, and a translation elongation/initiation factor site (ebi.ac.uk, pfam.janelia.org, geneontology.org). This gene is located on chromosome 2L in Dmel as opposed to the dot chromosome in Dgri. CG4038 has three exons in both Dgri and Dmel, but exon 1 had difficulty aligning because of a long (G)n segment. However, upon visual inspection, another exon prediction appeared to match up better so it was used. Figure 19: Screenshot showing similarities between exon 1 prediction of CG4038 in Dgri (bottom) vs. exon 1 of CG4038 in Dmel (top). The other exons aligned, but are much shorter than the Dmel transcript. The exons could not be modified though because there are no nearby splice sites that allow it. 16 Figure 20: BLAST alignment for CG4038 of Dgri (query) vs. Dmel (subject). Note the coverage is only 50% because the gene is small and any changes make a large impact in query coverage (blast.ncbi.nlm.nih.gov). (in bp) 5’ UTR Exon1 Exon2 Exon3 3’ UTR Dmel 72 190 381 138 190 Dgri N/A 40 483 104 N/A Table 5: Exon-by-exon alignment of CG4038. Ank: Ank produces a protein involved in cytoskeletal anchoring at the plasma membrane, signal transduction, cytoskeletal protein binding, and has the following properties: ankyrin repeats, ZU5 domain, DEATH domain, fusome domain, spectrosome domain, and a structural constituent of cytoskeleton (ebi.ac.uk, pfam.janelia.org, geneontology.org). There are nine exons in both the Dgri and Dmel transcripts, but exon 1 did not align at all. A tblastn did not produce any results, but a flybase search across all orthologs showed conservation of exon 1 in D. mojavensis and D. virilis. This is interesting since Those two species are most closely related to Dgri. Figure 21: Flybase search of exon 1 in Ank for Dgri (flybase.org) 17 Figure 22: Species tree of all drosophila species showing D. mojavensis and D. virilis are most closely related to Dgri (flybase.org). Because of this conservation, exon 1 in Dgri is kept in the annotation. All the other exons align well except for exon 9 which had gaps in the exonic sequence and thus could not be altered. Overall, the gene aligns well due to its large size. There are six splice variants of Ank in Dmel, but all the variation is due to differences in UTRs. Unfortunately, no UTRs could be found in Dgri so at this point, only one variant of Ank is present in Dgri. The alignment for this is shown below. 18 Figure 23: BLAST alignment of Ank for Dgri (query) vs. Dmel (subject) (blast.ncbi.nlm.nih.gov). 19 (in bp) Dmel Dgri Exon1 121 141 Exon2 108 108 Exon3 1114 1114 Exon4 104 104 Exon5 942 948 Exon6 325 328 Exon7 94 97 Exon8 225 225 Exon9 1629 1669 Table 6: Exon-by-exon alignment of Ank. Note that the UTRs were omitted because they vary between the different splice variants of Dmel. Rearrangement of Chromosomes: As stated previously, the chromosomes on contig 15 are not all on the dot chromosome in Dmel. CG5262 is located on chromosome 3L in Dmel, rho-5 is located on chromosome 2L in Dmel, and CG4038 is located on chromosome 2R in Dmel. Figure 24: Contig 15 (top) showing the gene rearrangements from Dmel (bottom) (flybase.org) These rearrangements could have either be recombination events that changed the position of the genes between the chromosomes, or these could be pseudogenes. A flybase search was conducted for each of the genes that are not located on the dot chromosome in Dmel and there were no results indicating that the genes present on contig 15 are one of a kind in Dgri. Since these genes have vital cell functions (such as amino acid transporting and cytoskeletal attachment to plasma membrane) it can be assumed these are not pseudogenes and the altered structure is due to recombination events during speciation. 20 Repeats: Contig 15 had three large LINE elements five LTR elements, three DNA elements, three unclassified elements, and two simple repeats which comprised 13.87, 0.84, 0.51, 0,84, and 0.39 percent of the contig, respectively. This gave a total of 16.45% of total repeats in contig 15. There was a mixture of transposable elements and simple repeats throughout the contig, but nothing of mentionable importance except for a very large LINE/LOA element spanning from 5788-10485 (4697bp). This is a very large transposable element, but when viewing the sizes for LINE elements in Dmel, there are many that are of comparable size. The doc family, for example, has transposable elements around the 4700bp range. Thus, this annotation is valid and can be kept in its current state. Figure 25: Large LINE/LOA element on contig 15. 21 Figure 26: Dmel LINE-like transposable element sizes highlighting the Doc family. Figure 27: Repeat report for contig 15. CLUSTAL Alignment and Ka/Ks values for Yellow-h: The CLUSTAL alignment was done on yellow-h, the gene whose protein product is related to major royal jelly produced by honeybees. Yellow-h was aligned against all of the drosophila orthologs and it showed high conservation between the amino acids. (1) 22 (2) (3) Figure 28: CLUSTAL alignment of yellow-h for Dper, Dpseudo, Dgri, Dgri contig 15, Dmoj, Dvir, Dwill, Dere, Dyak, Dmel, Dsim, and Dana. Species Ka/Ks D. ananasse 0.0073 D. erectus 0.0046 D. melanogaster 0.0051 D. mojavensis 0.0921 D. persimilis 0.0241 D. pseudoobscura 0.0219 D. simulans 0.0251 D. virilis 0.0976 D. willistoni 0.0246 D. yakuba 0.0050 23 Figure 29: Ka/Ks analysis of each drosophila species based on the CLUSTAL alignment. Figure 30: Bootstrapped cladogram with corresponding Ka/Ks values. Green stars indicate conservation with drosophila species tree, red stars indicate a change from the species tree. Figure 31: Drosophila species tree (flybase.org). When viewing the bootstrapped cladogram against the drosophila species tree (pg. 17 figure 22), one can see yellow-h is highly conserved between the drosophila species. This is further bolstered by the very low Ka/Ks values, indicating purifying selection. The only change that occurred was Dgri changed from the clade containing Dmoj and Dvir to the clade containing Dwill, Dana, etc. The Ka/Ks values show that Dere is most closely related to Dgri based on similarities between yellow-h. 24 Conclusion: Contig 15 is very gene rich with six total genes within approximately 39,000 base pairs. These genes play key roles in the cell such as amino acid transport, EGF-like calcium binding, cytoskeletal attachment to the plasma membrane, serine protease activity, etc. The repeat content within the contig is an appropriate value for the drosophila species (16.45% as compared to approximately 25% for the rest of the Dgri genome). Finally, the CLUSTAL alignment and Ka/Ks values show high conservation of the yellow-h gene between all drosophila species. 25 References: BLAST: Basic Local Alignment Search Tool. Web. 15 Dec. 2010. <http://blast.ncbi.nlm.nih.gov>. European Bioinformatics Institute | Homepage | EBI. Web. 15 Dec. 2010. <http://www.ebi.ac.uk/>. FlyBase Homepage. Web. 15 Dec. 2010. <http://flybase.org/>. The Gene Ontology. Web. 15 Dec. 2010. <http://www.geneontology.org/>. Protein Family Search Database. Howard Hughes Medical Institute. Web. 14 Dec. 2010. <http://pfam.janelia.org>.