Molecular Biology-2015 1 Functional bioinformatics - Finding Genes Sequencing has become so easy, that we have in recent years obtained the sequences of complete genomes from numerous prokaryotes, eukaryotes, and viruses. These sequences are of little utility unless we can derive their functions; the field of functional genomics. Amongst other things, functional genomics involves the search and identification of coding sequences -the genes. One of the bioinformatics methods used to this end is the search for open reading frames (ORF). These typically start with a translation initiation codon (AUG) and end with a translation termination codon (UAG, UGA, or UAA). Genes which have an ORF necessarily code for proteins. However, one must consider that not all genes code for proteins and thus not all genes possess valid ORFs. Gene search in viral genomes: In contrast to many genomes, viral genomes are relatively small and simple making them quite easy to sequence. Sequencing of these allows rapid identification, evolutionary studies and the identification of new viruses. Given their simplicity, it should therefore be a relatively simple task to find genes within these sequences. In the following exercise you will perform a search for potential ORFs in a sequence obtained from two different RNA segments of a viral genome. 1. Go to the NCBI site and click on the link "Open reading frame finder (ORF finder)" in the menu "Resource List (A-Z)". 2. Copy-paste in the query box the sequence Viral1 from the text file "viral genome sequence" available on this course's web page. Molecular Biology-2015 2 3. Click on OrfFind to submit your request. A new page similar to the one below will be loaded. This page shows all the open reading frames found in all the six possible reading frames. The positions and lengths, in bases, of each of the ORF are presented graphically and in text form. 4. In the graphical representation, click on the longest ORF. A new page will open like the one shown below. The annotated nucleotide and amino acid sequence of the chosen ORF are shown. 5. Click on "Accept" to obtain the nucleotide sequence in FASTA format of the chosen ORF. A new page will load showing the chosen ORF in green. Molecular Biology-2015 3 6. Click on the drop down menu "View" which offers view options. Choose “FASTA nucleotide” and then click on “View”. Save this sequence. Does this sequence represent that of the mRNA, the DNA coding sequence, or that of the DNA noncoding sequence? 7. To obtain the protein sequence, go back and this time choose the option “FASTA protein”. Save this sequence. 8. Now, to determine the possible function of this ORF we will perform a search of the protein database with the translated nucleotide sequence. To do so we will use the search engine “Blastx”. Go to the “Blast” homepage and choose the option “Blastx”. Copy paste your nucleotide sequence, in FASTA format, into the query box. Click on “Blast”. 9. Obtain the record for the gene with the best match. Obtain the following information from the record: The definition The organism this gene comes from The name of the protein product (search for “product=” under the heading “protein”) The gene’s name (search for “gene=” under the heading “CDS”) 10. Repeat steps 1-9 with the second viral sequence "viral2" Molecular Biology-2015 4 Finding SNPs: Viruses are amongst the fastest evolving organisms. For instance, in the case of the influenza virus, the mutation rate is so high that new vaccines often have to be developed each year. The viral3 sequence, in the viral sequence document, represents the same gene as that of the viral1 sequence, which was isolated during a different year. Use the skills you've acquired in bioinformatics to obtain the following information about the viral3 sequence. What is the percentage identity at the nucleotide level between the viral1 and viral3 sequences? What is the percentage identity at the protein level between the two sequences? Indicate the number of conserved, semi-conserved and non-conserved amino acid changes that have occurred. Identifying putative function of unknown genes Sequencing projects in many circumstances yield sequences of unknown function. In order to try and identify the possible function of these sequences, searches are done for conserved protein domains. This often yields hints as to what the protein encoded by a given sequence might do. 1. Obtain the longest protein sequence from the unknown sequence in the document "unknown human sequence" from this courses web site. 2. On the NCBI home page, locate the link for "Conserved Domain Search Service (CD Search)" from the menu "Resource List (A-Z)". Molecular Biology-2015 5 3. Copy paste in FASTA format the protein sequence you obtained in the query box. Click submit. 4. A new page will be loaded similar to the one below: 5. A graphical and textual presentation of potential domains is displayed. From the textual representation, obtain the name of the first three domains of known function. For instance, in the above example these would be: RT_nLTR Exo_endo_phos super family DUF1725 super family