Tutorial: Finding the gene structure of orthologs When blasting an incomplete or in-progress genome sequence, it can be difficult to find hits that are the length of the full gene ortholog you are looking for. Often, you will find just one small hit in an entire database. In order to analyze the hits you get, it’s important to understand what part of the gene’s structure they come from. Once you know which part of the primary structure your hit corresponds to, you can begin to study the secondary structure, function and regulation of this gene. In this tutorial, I will explain how you can determine which exon of a gene your hit corresponds to, as well as how to get the most gene structure information out of the sequence data you have. The basic steps to determining which exons you have are as follows: - Download an ortholog sequence from a genome browser - Blast this against your database - Copy the QUERY sequence you used to get this hit - Find the detailed, full sequence of the otholog gene you used as your query within the genome browser. Make sure this page indicates which sequences refer to which exon - Search for a portion of your query sequence on the page which contains the full ortholog sequence - Because you took this query sequence straight from the genome browser, you should have an exact match. Determine which exon this match falls under. You now know which exon you have in your sequence data! I will begin with an example of a hit I found when blasting grape’s Nhx1 gene against the blueberry EST data. I first blasted a file from NCB1 listed as the complete cdna of the grape Nhx1. This gave me one large (~1200 bp) hit against the blueberry EST database. In order to figure out which part of the Nhx1 gene I had found, I went into the grape genome browser. I found six candidate genes for Nhx1 in the browser and blasted all six of these against the 1200 bp hit mentioned above. It’s important to cover all your bases and check every candidate gene, because you don’t know which copy of the gene you might have in your sequence. The first candidate gene listed gave me one hit when blasted against the 1200 bp’s from the EST database: I copied the first segment of the query sequence: CAGGCATGCTTTT Then, I went into the grape genome browser and looked at the detailed description of this first candidate gene I had obtained my query sequence from. The page gives you the transcript structure As well as the full sequence, formatted to distinguish different structural elements of the gene: Because my query sequence (CAGGCATGCTTTT) came directly from this genome browser, it should have an exact match on this page. Thus, I can utilize the find function of my web browser, and simply search the page for my query sequence. When I did this I found one match: This segment corresponds to exon 10. Thus, we know that we have the equivalent of grape’s exon 10 within our 1200 bp EST hit. Expanding on your data: I found exon 10 by investigating just one hit, received by blasting the Nhx1 grape gene against the EST database. It is possible to expand your hits, potentially uncovering more exons within your sequence data. One way to do that is to simply blast against a different database. For example, when I blast the same Nhx1 gene against the 454 data, I get a different hit than the one I get against the EST data. Another option is to use a different query in your blast. You can change sequences or perhaps blast an amino acid sequence using a tBLASTn query. Each hit you get gives you another opportunity to go into a genome browser and determine which exon your hit aligns with. Slowly, you can begin to piece together exons and gain a sense of the overall structure of your ortholog.