Retrouver les gènes

advertisement
Molecular Biology-2015
1
Functional bioinformatics - Finding Genes
Sequencing has become so easy, that we have in recent years obtained the sequences of
complete genomes from numerous prokaryotes, eukaryotes, and viruses. These sequences
are of little utility unless we can derive their functions; the field of functional genomics.
Amongst other things, functional genomics involves the search and identification of
coding sequences -the genes. One of the bioinformatics methods used to this end is the
search for open reading frames (ORF). These typically start with a translation initiation
codon (AUG) and end with a translation termination codon (UAG, UGA, or UAA).
Genes which have an ORF necessarily code for proteins. However, one must consider
that not all genes code for proteins and thus not all genes possess valid ORFs.
Gene search in viral genomes:
In contrast to many genomes, viral genomes are relatively small and simple making them
quite easy to sequence. Sequencing of these allows rapid identification, evolutionary
studies and the identification of new viruses. Given their simplicity, it should therefore be
a relatively simple task to find genes within these sequences. In the following exercise
you will perform a search for potential ORFs in a sequence obtained from two different
RNA segments of a viral genome.
1. Go to the NCBI site and click on the link "Open reading frame finder (ORF finder)"
in the menu "Resource List (A-Z)".
2. Copy-paste in the query box the sequence Viral1 from the text file "viral genome
sequence" available on this course's web page.
Molecular Biology-2015
2
3. Click on OrfFind to submit your request. A new page similar to the one below will be
loaded.
This page shows all the open reading frames found in all the six possible reading frames.
The positions and lengths, in bases, of each of the ORF are presented graphically and in
text form.
4. In the graphical representation, click on the longest ORF. A new page will open like
the one shown below. The annotated nucleotide and amino acid sequence of the
chosen ORF are shown.
5. Click on "Accept" to obtain the nucleotide sequence in FASTA format of the chosen
ORF. A new page will load showing the chosen ORF in green.
Molecular Biology-2015
3
6. Click on the drop down menu "View" which offers view options. Choose “FASTA
nucleotide” and then click on “View”. Save this sequence. Does this sequence
represent that of the mRNA, the DNA coding sequence, or that of the DNA noncoding sequence?
7. To obtain the protein sequence, go back and this time choose the option “FASTA
protein”. Save this sequence.
8. Now, to determine the possible function of this ORF we will perform a search of the
protein database with the translated nucleotide sequence. To do so we will use the
search engine “Blastx”. Go to the “Blast” homepage and choose the option “Blastx”.
Copy paste your nucleotide sequence, in FASTA format, into the query box. Click on
“Blast”.
9. Obtain the record for the gene with the best match. Obtain the following information
from the record:




The definition
The organism this gene comes from
The name of the protein product (search for “product=” under the heading
“protein”)
The gene’s name (search for “gene=” under the heading “CDS”)
10. Repeat steps 1-9 with the second viral sequence "viral2"
Molecular Biology-2015
4
Finding SNPs:
Viruses are amongst the fastest evolving organisms. For instance, in the case of the
influenza virus, the mutation rate is so high that new vaccines often have to be developed
each year. The viral3 sequence, in the viral sequence document, represents the same gene
as that of the viral1 sequence, which was isolated during a different year. Use the skills
you've acquired in bioinformatics to obtain the following information about the viral3
sequence.
What is the percentage identity at the nucleotide level between the viral1 and viral3
sequences?
What is the percentage identity at the protein level between the two sequences?
Indicate the number of conserved, semi-conserved and non-conserved amino acid
changes that have occurred.
Identifying putative function of unknown genes
Sequencing projects in many circumstances yield sequences of unknown function. In
order to try and identify the possible function of these sequences, searches are done for
conserved protein domains. This often yields hints as to what the protein encoded by a
given sequence might do.
1. Obtain the longest protein sequence from the unknown sequence in the document
"unknown human sequence" from this courses web site.
2. On the NCBI home page, locate the link for "Conserved Domain Search Service (CD
Search)" from the menu "Resource List (A-Z)".
Molecular Biology-2015
5
3. Copy paste in FASTA format the protein sequence you obtained in the query box.
Click submit.
4. A new page will be loaded similar to the one below:
5. A graphical and textual presentation of potential domains is displayed. From the
textual representation, obtain the name of the first three domains of known function.
For instance, in the above example these would be:



RT_nLTR
Exo_endo_phos super family
DUF1725 super family
Download