Identification of Orthologous Gene Sequences Background Orthology and paralogy are important concepts in understanding how genomes evolve and how biological novelty arises. Orthology arises from the speciation process, and orthologous gene sequences in different species are the product of vertical descent from a single gene in the most recent common ancestor of the two species in question (Fitch 1970). Paralogous gene sequences are the result of a gene duplication event (Fitch 1970). It is important for biologists to be able to recognize the difference between these two classes of gene sequences, especially when one wishes to re-construct an accurate phylogeny or determine the forces of natural selection. There are numerous methods that attempt to identify orthologous from paralogous gene sequences. Many are based on a simple approach called reciprocal best hit BLAST (RBHB) searches. These methods use BLAST searches between sequence data (DNA or amino acid) from different species. BLAST searches are done reciprocally, meaning in a two species RBHB search each species is used as the query and subject during the search process. If you have one pair of sequences from each of the two species that are each other’s ‘best’ hit from the reciprocal search they can be considered orthologs. Inparanoid initially uses this approach. However, it creates ‘clusters’ or groups of genes that have high similarity based on BLAST scores. It then uses bootstrapping (a statistical approach that is based on re-sampling the data) to give confidence scores to the putative ortholog pair. A low confidence score maight indicate that you cannot find an ortholog for that particular set of genes in your data set. Statement of Module Goals By the end of this exercise you will be able to: -Utilize BLAST searches and a clustering algorithm to identify orthologous gene sequences -Align orthologous sequences from two species -Prepare these alignments for further analysis -Estimate pairwise levels of sequence divergence between the pairs of orthologous sequences. V & C Core Competencies Ability to apply the process of science Ability to use quantitative reasoning Ability to tap into the interdisciplinary nature of science GCAT-SEEK Sequencing requirements Data that can be used in this module include: -Assembled transcriptomes (RNA-SEQ) from any sequencing platform -Annotated gene sequences from genomic assemblies (any platform) -The ideal set of data will have the open reading frame for the DNA sequences in one FASTA file while the other FASTA file will have the corresponding translated amino acid sequences. If you have transcriptome data that might contain UTRs you can use a program like ORFPredictor (http://proteomics.ysu.edu/tools/OrfPredictor.html) that will give you the coding sequence and the corresponding translated amino acid sequence. There are numerous methodologies that can be used to identify and extract coding sequences from transcriptomic or genomic data. Protocols Please check with your system administrator before installing anything on a shared computer. For this module you will need to have the following programs installed: BLAST, ClustalW, Ka_Ks_calculator This can be done by typing $sudo apt-get install blast2 $sudo apt-get install clustallw Ka_Ks_calculator is found on the following website (http://code.google.com/p/kaks-calculator/). It is possible to install this program onto a Linux machine, however there are difficulties in compiling the program. A suggestions, while more laborious but effective, is to install the program on a Mac or PC. This is done easily from the download. You will have to move over any files to use this program, but at least it will work this way. You will also need to have the following Perl scripts: inparanoid.pl – this can be obtained from the following website: http://software.sbc.su.se/cgibin/request.cgi?project=inparanoid pal2nal.pl – provided here, but information can be found here: http://www.bork.embl.de/pal2nal/ align2pairs.pl – provided here Once you get the inparanoid package extract the files to a folder. I find it easier to do the initial analyses in this folder. Then copy the necessary files to another folder for further analysis. In the extracted Inparanoid folder find the folder inparanoid_4.1/ The inparanoid.pl script should be in here, along with numerous other files. Now would be a good time to read the README file. This file contains some basic information on the method and useful parameters. There are a number of parameters you can change in the inparanoid.pl file. We will leave the default parameters, but it might be useful to explore some of these changes later. For instance you could use an ‘outgroup’ species to help define your orthologous sequences. You could also use different protein substitution matrices, which might be dependent on how closely related your species that you are comparing are. Copy the two files you want to perform the orthology analysis in to this folder. This initial analysis is initially done on protein sequences. The files you will need are: Ssaxicola_pep.fas Sgoodei_pep.fas Type the following at the command prompt: $perl inparanoid.pl Ssaxicola_pep.fas Sgoodei_pep.fas You should not see any errors if the analysis is done correctly. A number of files will be produced. Let’s explore them. orthologs.Ssaxicola_pep.fas_Sgoodei_pep.fas.html This is an html file that contains the results of the inparanoid clustering. All significant ortholog groups are reported. For each set of orthologs the sequences are given and the confidence for each group is given. How many different ortholog groups are found for this data set? The same information is given in a text file: output.Ssaxicola_pep.fas_Sgoodei_pep.fas.html An additional file that will of use to us is: table.Ssaxicola_pep.fas_Sgoodei_pep.fas.html. This information is also given in sql format (sqltable.Ssaxicola_pep.fas_Sgoodei_pep.fas.html) if you are versed in sql. Open table table.Ssaxicola_pep.fas_Sgoodei_pep.fas.html. Each line contains the ortollog group (1st column), the score, the sequence identifier from file A, and the sequence identifier from file B. The last two are followed by confidence scores. We need to extract the sequence identifiers for each set of sequences so we can align them for further analyses. This is probably most easily done in a spreadsheet program. Copy them into a file so that it looks like this (tab delimited): Seq_A_1 Seq_B_1 Seq_A_2 Seq_B_2 … And save the file as IDs.txt. Please note that this is a very simple dataset. ‘Real world’ data may contain ortholog groups that contain more than one sequence from a particular species. This is especially true if you happen to sequence genes from a gene family, have alternatively spliced genes, or if you have incomplete genes (for example from transcriptomes) in your data set. If you do get this you might want to filter your orthologs so that there is only one sequence from each species found. Once you have the IDs.txt file copy it to another folder. Also copy Ssaxicola_pep.fas, Sgoodei_pep.fas Ssaxicola_cds.fas, and Sgoodei_cds.fas to that folder. What we are going to be doing next is creating a larger file that contains all the protein sequences (protnr) and all the coding sequence in another (nucnr). This can easily be done in a text editor or if you want to impress someone with your Unix skills try: $cat Ssaxicola_pep.fas Sgoodei_pep.fas > protnr $cat Ssaxicola_cds.fas Sgoodei_cds.fas > nucnr In order for our alignment script to work we need to format these files for a BLAST search. This is why I had you do this analysis in another folder, it’s going to get crowded fast. To format these for a BLAST search do the following: $formatdb –i protnr p T –o T $formatdb –i nucnr p F –o T If everything goes well you should not have any errors. Now we can search the databases and pull out sequences for each ortholog pair. $perl align2pairs.pl You should now have 49 files starting with ‘contig’ and 49 files starting with ‘pal2nal’. You can alter the script to change the output file names. The contig files contain the aligned orthologs based on amino acid sequence and the pal2nal files contain aligned nucleotide sequences based on the amino acid alignments. The pal2nal_all.fasta contains all of the alignments in a single FASTA file. There is an additional program that can be used to make pairwise comparisons for all of your alignments, it’s called Ka_Ks_calculator (http://code.google.com/p/kaks-calculator/). This program is a little more difficult to get running on a Linux machine, but it does have a nice GUI that runs in Windows or Mac. To use this we need to convert either the pal2nal_all.fasta or each individual pal2nal file to what is known as AXT format. This is easily done on the pal2nal_all.fasta file by typing the following on the command line: $grep -i -v Ssax pal2nal_all.fasta > Ssax_Sgoo.axt This command removes all lines from the file pal2nal_all.fasta that contain ‘Ssax’ and copies everything to a new file named ‘Ssax_Sgoo.axt’. $perl –p –i.bak –e ‘s/\>lcl\|//g’ Ssax_Sgoo.axt This command is a perl one liner that deletes “>lcl|” from each instance it occurs in Ssax_Sgoo.axt. You could also use find and replace in a text editor if you wanted to. The – i.bak command creates a backup of the original file, just in case. The backup file is named Ssax_Sgoo.axt.bak. Now type: $perl –p –i.bac –e ‘s/Sg/\nSg/g’ Ssax_Sgoo.axt This perl one liner takes each instance of Sg (the sequence identifier in this case) and replaces it with a carriage return and a Sg. Basically putting a return between each pair of sequences. It also puts one at the beginning of the file that you will have to remove manually. A backup file named Ssax_Sgoo.axt.bac is also created. The file Ssax_Sgoo.axt can now be opened in Ka_Ks_calculator. Open the Ka_Ks_calculator GUI and import your .axt file. Select a simple model for analysis, let’s say YN (check the box), and press Run. You should get something like this. The analysis preformed contains pairwise estimated of synonymous and non-synonymous substitutions for all ortholog pairs. This serves a number of functions. One is for quality control. If you do have othrologs and you do have good alignments your Ks values should be below 0.1 for all comparisons. Values greater than this might indicate bad alignments or alignments of non-orthologous genes. This also tells us information about the ratio of non-synonymous to synonymous substitutions (Ka/Ka). Most genes will have a values <1, indicating purifying selection. However genes that have a value greater than 1 might be subject to positive Darwinian selection. You can export this table and open it in a spreadsheet program for further analysis. Assessment Students will be asked to conduct the outlined analysis. Additional assessment could come in the form of a lab report where students are asked to present results of the orhtology analysis and the pairwise sequence comparisons. Visual examples of pairwise alignments (good and bad) could be presented as well as a table/graph of Ka/Ks values. Timeline of Module The module should be performed in a typical 2-3 hour lab session. Discussion Topics Along the way students should recognize that not all genes from each data set will form significant orthologous pairs. Why is this? This particular data set is from two transcriptomes, it is possible that not all genes are present, or that orthologs have been sequenced in one data set or the other. What would students expect to see if whole genome data sets were utilized? How would students deal with instances where multiple paralogs were identified? Students should also explore the alignments that automatically produced by pal2nal. Are they adequate? Is there another method to improve this (muscle, MAFFT)? Does this method still require some manual checks along the way? Lastly, students should try to interpret the patterns of selection. Lecture Topics Students should be introduced into the basic concepts of molecular evolution and genomics. Pre-lab lecture topics should include the basic principles of how genes and genomes evolve (mutation, recombination, natural selection, genome/gene duplication). Ideally there would also be a previous introduction of transcriptome/genome assembly and annotation and the basic principles behind searching large sequence databases (e.g. BLAST). References Berglund AC, Sjolund E, Ostlund G and Sonnhammer ELL (2008) InParanoid 6: eukaryotic ortholog clusters with inparalogs Nucleic Acids Res. 36:D263-266 Fitch W. (1970). Distinguishing homologous from analogous proteins. Syst Zool 19 (2): 99–113 O'Brien Kevin P, Remm Maido and Sonnhammer Erik L.L (2005) Inparanoid: A Comprehensive Database of Eukaryotic Orthologs Nucleic Acids Rresearch 33:D476-D480 Ostlund G, Schmitt T, Forslund K, Kostler T, Messina DN, Roopra S, Frings O and Sonnhammer ELL (2009) InParanoid 7: new algorithms and tools for eukaryotic orthology analysis Nucleic Acids Res. 38:D196-D203 Remm M, Storm CEV, and Sonnhammer ELL (2001) Automatic clustering of orthologs and inparalogs from pairwise species comparisons J. Mol. Biol. 314:1041-1052 Suyama M, Torrents D, and Bork P (2006) PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 34, W609-W612. Zhang Z, Li J, Zhao XQ, Wang J, Wong GK, Yu J: KaKs Calculator: Calculating Ka and Ks through model selection and model averaging. Genomics Proteomics Bioinformatics 2006 , 4:259-263 http://www.bork.embl.de/pal2nal/ http://code.google.com/p/kaks-calculator/