Identification of Orthologous Gene Sequences

advertisement
Identification of Orthologous Gene Sequences
Background
Orthology and paralogy are important concepts in understanding how genomes evolve and how
biological novelty arises. Orthology arises from the speciation process, and orthologous gene
sequences in different species are the product of vertical descent from a single gene in the most
recent common ancestor of the two species in question (Fitch 1970). Paralogous gene
sequences are the result of a gene duplication event (Fitch 1970). It is important for biologists
to be able to recognize the difference between these two classes of gene sequences, especially
when one wishes to re-construct an accurate phylogeny or determine the forces of natural
selection.
There are numerous methods that attempt to identify orthologous from paralogous gene
sequences. Many are based on a simple approach called reciprocal best hit BLAST (RBHB)
searches. These methods use BLAST searches between sequence data (DNA or amino acid)
from different species. BLAST searches are done reciprocally, meaning in a two species RBHB
search each species is used as the query and subject during the search process. If you have
one pair of sequences from each of the two species that are each other’s ‘best’ hit from the
reciprocal search they can be considered orthologs. Inparanoid initially uses this approach.
However, it creates ‘clusters’ or groups of genes that have high similarity based on BLAST
scores. It then uses bootstrapping (a statistical approach that is based on re-sampling the data)
to give confidence scores to the putative ortholog pair. A low confidence score maight indicate
that you cannot find an ortholog for that particular set of genes in your data set.
Statement of Module Goals
By the end of this exercise you will be able to:
-Utilize BLAST searches and a clustering algorithm to identify orthologous gene sequences
-Align orthologous sequences from two species
-Prepare these alignments for further analysis
-Estimate pairwise levels of sequence divergence between the pairs of orthologous sequences.
V & C Core Competencies
Ability to apply the process of science
Ability to use quantitative reasoning
Ability to tap into the interdisciplinary nature of science
GCAT-SEEK Sequencing requirements
Data that can be used in this module include:
-Assembled transcriptomes (RNA-SEQ) from any sequencing platform
-Annotated gene sequences from genomic assemblies (any platform)
-The ideal set of data will have the open reading frame for the DNA sequences in one FASTA
file while the other FASTA file will have the corresponding translated amino acid sequences. If
you have transcriptome data that might contain UTRs you can use a program like ORFPredictor (http://proteomics.ysu.edu/tools/OrfPredictor.html) that will give you the coding
sequence and the corresponding translated amino acid sequence. There are numerous
methodologies that can be used to identify and extract coding sequences from transcriptomic or
genomic data.
Protocols
Please check with your system administrator before installing anything on a shared computer.
For this module you will need to have the following programs installed:
BLAST, ClustalW, Ka_Ks_calculator
This can be done by typing
$sudo apt-get install blast2
$sudo apt-get install clustallw
Ka_Ks_calculator is found on the following website (http://code.google.com/p/kaks-calculator/).
It is possible to install this program onto a Linux machine, however there are difficulties in
compiling the program. A suggestions, while more laborious but effective, is to install the
program on a Mac or PC. This is done easily from the download. You will have to move over
any files to use this program, but at least it will work this way.
You will also need to have the following Perl scripts:
inparanoid.pl – this can be obtained from the following website: http://software.sbc.su.se/cgibin/request.cgi?project=inparanoid
pal2nal.pl – provided here, but information can be found here: http://www.bork.embl.de/pal2nal/
align2pairs.pl – provided here
Once you get the inparanoid package extract the files to a folder. I find it easier to do the initial
analyses in this folder. Then copy the necessary files to another folder for further analysis.
In the extracted Inparanoid folder find the folder inparanoid_4.1/
The inparanoid.pl script should be in here, along with numerous other files. Now would be a
good time to read the README file. This file contains some basic information on the method
and useful parameters. There are a number of parameters you can change in the inparanoid.pl
file. We will leave the default parameters, but it might be useful to explore some of these
changes later. For instance you could use an ‘outgroup’ species to help define your orthologous
sequences. You could also use different protein substitution matrices, which might be
dependent on how closely related your species that you are comparing are.
Copy the two files you want to perform the orthology analysis in to this folder. This initial
analysis is initially done on protein sequences. The files you will need are:
Ssaxicola_pep.fas
Sgoodei_pep.fas
Type the following at the command prompt:
$perl inparanoid.pl Ssaxicola_pep.fas Sgoodei_pep.fas
You should not see any errors if the analysis is done correctly. A number of files will be
produced. Let’s explore them.
orthologs.Ssaxicola_pep.fas_Sgoodei_pep.fas.html
This is an html file that contains the results of the inparanoid clustering. All significant ortholog
groups are reported. For each set of orthologs the sequences are given and the confidence for
each group is given. How many different ortholog groups are found for this data set? The same
information is given in a text file: output.Ssaxicola_pep.fas_Sgoodei_pep.fas.html
An additional file that will of use to us is: table.Ssaxicola_pep.fas_Sgoodei_pep.fas.html. This
information is also given in sql format (sqltable.Ssaxicola_pep.fas_Sgoodei_pep.fas.html) if you
are versed in sql. Open table table.Ssaxicola_pep.fas_Sgoodei_pep.fas.html. Each line
contains the ortollog group (1st column), the score, the sequence identifier from file A, and the
sequence identifier from file B. The last two are followed by confidence scores. We need to
extract the sequence identifiers for each set of sequences so we can align them for further
analyses. This is probably most easily done in a spreadsheet program. Copy them into a file so
that it looks like this (tab delimited):
Seq_A_1
Seq_B_1
Seq_A_2
Seq_B_2
…
And save the file as IDs.txt. Please note that this is a very simple dataset. ‘Real world’ data
may contain ortholog groups that contain more than one sequence from a particular species.
This is especially true if you happen to sequence genes from a gene family, have alternatively
spliced genes, or if you have incomplete genes (for example from transcriptomes) in your data
set. If you do get this you might want to filter your orthologs so that there is only one sequence
from each species found.
Once you have the IDs.txt file copy it to another folder. Also copy Ssaxicola_pep.fas,
Sgoodei_pep.fas Ssaxicola_cds.fas, and Sgoodei_cds.fas to that folder. What we are going to
be doing next is creating a larger file that contains all the protein sequences (protnr) and all the
coding sequence in another (nucnr). This can easily be done in a text editor or if you want to
impress someone with your Unix skills try:
$cat Ssaxicola_pep.fas Sgoodei_pep.fas > protnr
$cat Ssaxicola_cds.fas Sgoodei_cds.fas > nucnr
In order for our alignment script to work we need to format these files for a BLAST search. This
is why I had you do this analysis in another folder, it’s going to get crowded fast. To format
these for a BLAST search do the following:
$formatdb –i protnr p T –o T
$formatdb –i nucnr p F –o T
If everything goes well you should not have any errors. Now we can search the databases and
pull out sequences for each ortholog pair.
$perl align2pairs.pl
You should now have 49 files starting with ‘contig’ and 49 files starting with ‘pal2nal’. You can
alter the script to change the output file names. The contig files contain the aligned orthologs
based on amino acid sequence and the pal2nal files contain aligned nucleotide sequences
based on the amino acid alignments. The pal2nal_all.fasta contains all of the alignments in a
single FASTA file.
There is an additional program that can be used to make pairwise comparisons for all of your
alignments, it’s called Ka_Ks_calculator (http://code.google.com/p/kaks-calculator/). This
program is a little more difficult to get running on a Linux machine, but it does have a nice GUI
that runs in Windows or Mac. To use this we need to convert either the pal2nal_all.fasta or
each individual pal2nal file to what is known as AXT format. This is easily done on the
pal2nal_all.fasta file by typing the following on the command line:
$grep -i -v Ssax pal2nal_all.fasta > Ssax_Sgoo.axt
This command removes all lines from the file pal2nal_all.fasta that contain ‘Ssax’ and copies
everything to a new file named ‘Ssax_Sgoo.axt’.
$perl –p –i.bak –e ‘s/\>lcl\|//g’ Ssax_Sgoo.axt
This command is a perl one liner that deletes “>lcl|” from each instance it occurs in
Ssax_Sgoo.axt. You could also use find and replace in a text editor if you wanted to. The –
i.bak command creates a backup of the original file, just in case. The backup file is named
Ssax_Sgoo.axt.bak. Now type:
$perl –p –i.bac –e ‘s/Sg/\nSg/g’ Ssax_Sgoo.axt
This perl one liner takes each instance of Sg (the sequence identifier in this case) and replaces
it with a carriage return and a Sg. Basically putting a return between each pair of sequences. It
also puts one at the beginning of the file that you will have to remove manually. A backup file
named Ssax_Sgoo.axt.bac is also created.
The file Ssax_Sgoo.axt can now be opened in Ka_Ks_calculator. Open the Ka_Ks_calculator
GUI and import your .axt file. Select a simple model for analysis, let’s say YN (check the box),
and press Run. You should get something like this.
The analysis preformed contains pairwise estimated of synonymous and non-synonymous
substitutions for all ortholog pairs. This serves a number of functions. One is for quality control.
If you do have othrologs and you do have good alignments your Ks values should be below 0.1
for all comparisons. Values greater than this might indicate bad alignments or alignments of
non-orthologous genes. This also tells us information about the ratio of non-synonymous to
synonymous substitutions (Ka/Ka). Most genes will have a values <1, indicating purifying
selection. However genes that have a value greater than 1 might be subject to positive
Darwinian selection. You can export this table and open it in a spreadsheet program for further
analysis.
Assessment
Students will be asked to conduct the outlined analysis. Additional assessment could come in
the form of a lab report where students are asked to present results of the orhtology analysis
and the pairwise sequence comparisons. Visual examples of pairwise alignments (good and
bad) could be presented as well as a table/graph of Ka/Ks values.
Timeline of Module
The module should be performed in a typical 2-3 hour lab session.
Discussion Topics
Along the way students should recognize that not all genes from each data set will form
significant orthologous pairs. Why is this? This particular data set is from two transcriptomes, it
is possible that not all genes are present, or that orthologs have been sequenced in one data
set or the other. What would students expect to see if whole genome data sets were utilized?
How would students deal with instances where multiple paralogs were identified? Students
should also explore the alignments that automatically produced by pal2nal. Are they adequate?
Is there another method to improve this (muscle, MAFFT)? Does this method still require some
manual checks along the way? Lastly, students should try to interpret the patterns of selection.
Lecture Topics
Students should be introduced into the basic concepts of molecular evolution and genomics.
Pre-lab lecture topics should include the basic principles of how genes and genomes evolve
(mutation, recombination, natural selection, genome/gene duplication). Ideally there would also
be a previous introduction of transcriptome/genome assembly and annotation and the basic
principles behind searching large sequence databases (e.g. BLAST).
References
Berglund AC, Sjolund E, Ostlund G and Sonnhammer ELL (2008) InParanoid 6: eukaryotic
ortholog clusters with inparalogs Nucleic Acids Res. 36:D263-266
Fitch W. (1970). Distinguishing homologous from analogous proteins. Syst Zool 19 (2): 99–113
O'Brien Kevin P, Remm Maido and Sonnhammer Erik L.L (2005) Inparanoid: A Comprehensive
Database of Eukaryotic Orthologs Nucleic Acids Rresearch 33:D476-D480
Ostlund G, Schmitt T, Forslund K, Kostler T, Messina DN, Roopra S, Frings O and
Sonnhammer ELL (2009) InParanoid 7: new algorithms and tools for eukaryotic orthology
analysis Nucleic Acids Res. 38:D196-D203
Remm M, Storm CEV, and Sonnhammer ELL (2001) Automatic clustering of orthologs and inparalogs from pairwise species comparisons J. Mol. Biol. 314:1041-1052
Suyama M, Torrents D, and Bork P (2006) PAL2NAL: robust conversion of protein sequence
alignments into the corresponding codon alignments. Nucleic Acids Res. 34, W609-W612.
Zhang Z, Li J, Zhao XQ, Wang J, Wong GK, Yu J: KaKs Calculator: Calculating Ka and Ks
through model selection and model averaging. Genomics Proteomics Bioinformatics 2006 ,
4:259-263
http://www.bork.embl.de/pal2nal/
http://code.google.com/p/kaks-calculator/
Download