Alice S Weston Computational Genomics March 15, 2005 Detecting Orthologs Using Molecular Phenotypes in Human and Mouse Introduction One way that we can better understand how we evolved as a species is to determine the relatedness of our proteins to those of other organisms; moreover, this will help to categorize the biological function of mammalian core proteins. It is sometimes found that two evolutionarily related sequences will show little or no similarity at the nucleotide level, having accumulated major changes since their divergence millions of years ago (Figure 1). While the canonical methods for defining orthologous structures, such as performing a sequence alignment, are able to identify many significant gene pairs, duplication events in the genome have made it difficult to discriminate orthologs from paralogs in some cases—therefore new techniques are needed. One research group recently published a method for identifying orthologs based upon microarray expression data1. Their approach uses the idea that genes with a similar function tend to have mRNA expression patterns, termed “molecular phenotypes”, though the sequences themselves may show no similarity. Here, I use a similar method to determine how often the top two mouse alignments for a human gene are consistent with their molecular phenotypes, where the best alignment is expected to have the highest co-expression, in general (see Methods). It is suspected that the top-scoring homolog according to BLAST will not always be the most closely related gene in terms of its observed co-expression, and therefore function, in the cell. For our purposes, the cases where the mouse gene with most sequence similarity does not produce the best co-expression wheel will be most important. It is speculated that orthologous genes will be co-expressed with a similar set of gene partners compared to a pair of non-orthologous genes that are similar at the sequence level. Methods Using extensive DNA microarray data gathered from previous experiments on mouse and human genes, the Spearman correlations were computed for each gene pair—approximately 10.9 million pairs of mouse genes and 10.6 million pairs of human genes. The Fisher z-transform was used because it takes into account the number of experiments the data is based on and is not biased by missing data. For each human gene, a neighborhood of the top 100 co-expressed genes was constructed, which is referred to as a co-expression neighborhood. Each neighbor is associated with its best gene match in mouse from BLAST, and these mouse genes are the neighbors for the top two mouse homologs of the human gene at the center of the universe (Figure 2). I then asked: are there any cases where the second best hit from BLAST has a better co-expression with the central human gene than the best hit? This would show that molecular phenotypes can be used to discriminate orthologs from paralogs. I started to address this question using a series of Perl programs. These programs work together to generate various statistics about a co-expression neighborhood. More specifically, each neighbor in the two mouse wheels is assigned a rank number that indicates how 1 Barak A. Cohen, Yitzhak Pilpel, Robi D. Mitra, and George M. Church. (2002) Discrimination between Paralogs usings Microarray Analysis: Application to the Yap1p and Yap2p Transcriptional Networks. Molecular Biology of the Cell. 13, 1608–1614. significantly the neighbor was co-expressed with the central mouse gene. If both mouse wheels have data for the neighbor i, then the lowest ranking pair is preferred in the model, where the gene pairs are given by (best mouse gene, neighbori) and (second-best mouse gene, neighbori). To determine which mouse neighborhood has the smallest rankings overall, I use a scoring system that increments the score of the neighborhood by 1 unit if it is a preferred pair. These scores are used to compute the percentage of neighbors that are ranked lower in the best mouse gene as compared to the second-best mouse gene, for example. The top mouse genes are also compared using the raw ranking numbers. As an additional step, a similar analysis was performed on neighborhoods of the top 10 co-expressed genes in order to reduce the number of neighbors with large expectancy values or those whose co-expression occurs by chance. For these neighborhoods, I compared the average ranks from the best and second-best matches by computing a value δ defined as: δ = 1/n × (X1 + X2 + · · · + Xn) – 1/m × (Y1 + Y2 + · · · + Ym) n,m ≤ 10 Here Xi is the rank of the neighbor to the mouse gene that has the best BLAST score, and Yi is the rank of the neighbor to the gene with the second-best score. This value is used to measure which mouse gene is producing the better neighborhood, where positive δ values indicate that the second-best match is prevailing. Results and Discussion Based on the results of this study, it was found that most of the best matching mouse genes in terms of sequence similarity produce the best co-expression neighborhoods compared the second-best mouse gene. This should be expected. If we know that the sequence of a gene implies its function in the cell, then it makes intuitive sense that gene pairs with the most sequence similarity will be co-expressed. This relationship between sequence and function is verified in Figure 3, which shows the percentage of neighbors that ranked lower for the best matching gene. The average for this data was 62%, which means that the best match gene is generally a better co-expression neighborhood. While about 116 genes for the best match had 100% better ranking neighbors than the second-best match, there were no cases where the second-best match ranked all of its neighbors better than the best match. However, it can be seen that there are rare cases where the second-best gene is better 80% of the time or more; these neighborhoods are worth exploring further. Consider one human gene from this area with the LocusLink ID 6772, whose sequence is shown in Figure 4 along with those of its top two genes in mouse. The second-best match for this gene is 89% better than the best match—a striking result. Based on the molecular phenotypes of these genes, where the second-best match is more strongly co-expressed with other mouse genes that are similar to those in the human neighborhood, it is possible that this it the true ortholog. The results in Figure 5 show that the best matching genes tend to have neighbors whose ranks are concentrated below 100, though not exclusively since there are some tall red bars in the higher ranks (~400). It is surprising that about 450 genes can have neighbors that rank ~421, while almost no neighbors have ranks around this number; why do so many genes have this rank? A more interesting result is that the ranks associated with second-best matches are fairly spread out, with a slight bias toward lower ranks. There appears to be a denser area from 1–50, and these genes are probably promising examples of cases where co-expression can detect orthologs missed by sequence analysis. From the δ values plotted in Figure 6, we can see that most of the best matching mouse genes are actually producing better neighborhoods, which is indicated by the number negative δ values. Figures BLAST hit #2 true ortholog BLAST hit #1 paralog active site = mutation coding region FIG 1. Detecting orthologous genes using only sequence alignment methods can miss certain orthologs that have accumulated silent mutations not affecting the functionality of the gene or its protein product. 6814 7375 10270h 51763 9646 20912 22258 19062 54194m 22083 BLAST hit #1 8888 9092 20912 22258 54387 19062 20227 56399m 22083 54387 20227 BLAST hit #2 FIG 2. A real example network for the human gene 10270, showing its two top sequence matches in mouse and their respective neighborhoods. Dashed lines connect the human neighbors to the related mouse gene with the highest rank. Percentage of Neighbor Genes that Ranked Better in Top Mouse Gene 140 Number of Top Mouse Genes 120 100 80 60 40 20 Percentage of Neighbor Genes FIG 3. Histogram shows the percentage of neighbors with a lower rank for the best match gene than for the second-best gene and how many best match genes had this percentage. μ ≈ 62% 100 96 92 88 84 80 76 72 68 64 60 56 52 48 44 40 36 32 28 24 20 16 12 8 4 0 0 Human Gene 6772 MSQWYELQQLDSKFLEQVHQLYDDSFPMEIRQYLAQWLE KQDWEHAANDVSFATIRFHDLLSQLDDQYSRFSLENNFL LQHNIRKSKRNLQDNFQEDPIQMSMIIYSCLKEERKILE NAQRFNQAQSGNIQSTVMLDKQKELDSKVRNVKDKVMCI EHEIKSLEDLQDEYDFKCKTLQNREHETNGVAKSDQKQE QLLLKKMYLMLDNKRKEVVHKIIELLNVTELTQNALIND ELVEWKRRQQSACIGGPPNACLDQLQNWFTIVAESLQQV RQQLKKLEELEQKYTYEHDPITKNKQVLWDRTFSLFQQL IQSSFVVERQPCMPTHPQRPLVLKTGVQFTVKLRLLVKL QELNYNLKVKVLFDKDVNERNTVKGFRKFNILGTHTKVM NMEESTNGSLAAEFRHLQLKEQKNAGTRTNEGPLIVTEE LHSLSFETQLCQPGLVIDLETTSLPVVVISNVSQLPSGW ASILWYNMLVAEPRNLSFFLTPPCARWAQLSEVLSWQFS SVTKRGLNVDQLNMLGEKLLGPNASPDGLIPWTRFCKEN INDKNFPFWLWIESILELIKKHLLPLWNDGCIMGFISKE RERALLKDQQPGTFLLRFSESSREGAITFTWVERSQNGG EPDFHAVEPYTKKELSAVTFPDIIRNYKVMAAENIPENP LKYLYPNIDKDHAFGKYYSRPKEAPEPMELDGPKGTGYI KTELISVSEVHPSRLQTTDNLLPMSPEEFDEVSRIVGSV EFDSMMNTV Mouse Gene 20846 – best MSQWFELQQLDSKFLEQVHQLYDDSFPMEIRQYLAQ WLEKQDWEHAAYDVSFATIRFHDLLSQLDDQYSRFS LENNFLLQHNIRKSKRNLQDNFQEDPVQMSMIIYNC LKEERKILENAQRFNQAQEGNIQNTVMLDKQKELDS KVRNVKDQVMCIEQEIKTLEELQDEYDFKCKTSQNR EGEANGVAKSDQKQEQLLLHKMFLMLDNKRKEIIHK IRELLNSIELTQNTLINDELVEWKRRQQSACIGGPP NACLDQLQSWFTIVAETLQQIRQQLKKLEELEQKFT YEPDPITKNKQVLSDRTFLLFQQLIRSSFVVERQPC MPTHPQRPLVLKTGVQFTVKLRLLVKLQELNYNLKV KVSFDKDVNEKNTVKGFRKFNILGTHTKVMNMEEST NGSLAAEFRHLQLKEQKNAGNRTNEGPLIVTEELHS LSFETQLCQPGLVIDLETTSLPVVVISNVSQLPSGW ASILWYNMLVTEPRNLSFFLNPPCAWWSQLSEVLSW QFSSVTKRGLNADQLSMLGEKLLGPNAGPDGLIPWT RFCKENINDKNFSFWPWIDTILELIKKHLLCLWNDG CIMGFISKERERALLKDQQPGTFLLRFSESSREGAI TFTWVERSQNGGEPDFHAVEPYTKKELSAVTFPDII RNYKVMAAENIPENPLKYLYPNIDKDHAFGKYYSRP KEAPEPMELDDPKRTGYIKTELISVSEVHPSRLQTT DNLLPMSPEEFDEMSRIVGPEFDSVMSTV Mouse Gene 20848 – second best MAQWNQLQQLDTRYLEQLHQLYSDSFPMELRQFLAP WIESQDWAYAASKESHATLVFHNLLGEIDQQYSRFL QESNVLYQHNLRRIKQFLQSRYLEKPMEIARIVARC LWEESRLLQTAATAAQQGGQANHPTAAVVTEKQQML EQHLQDVRKRVQDLEQKMKVVENLQDDFDFNYKTLK SQGDMQDLNGNNQSVTRQKMQQLEQMLTALDQMRRS IVSELAGLLSAMEYVQKTLTDEELADWKRRQQIACI GGPPNICLDRLENWITSLAESQLQTRQQIKKLEELQ QKVSYKGDPIVQHRPMLEERIVELFRNLMKSAFVVE RQPCMPMHPDRPLVIKTGVQFTTKVRLLVKFPELNY QLKIKVCIDKDSGDVAALRGSRKFNILGTNTKVMNM EESNNGSLSAEFKHLTLREQRCGNGGRANCDASLIV TEELHLITFETEVYHQGLKIDLETHSLPVVVISNIC QMPNAWASILWYNMLTNNPKNVNFFTKPPIGTWDQV AEVLSWQFSSTTKRGLSIEQLTTLAEKLLGPGVNYS GCQITWAKFCKENMAGKGFSFWVWLDNIIDLVKKYI LALWNEGYIMGFISKERERAILSTKPPGTFLLRFSE SSKEGGVTFTWVEKDISGKTQIQSVEPYTKQQLNNM SFAEIIMGYKIMDATNILVSPLVYLYPDIPKEEAFG KYCRPESQEHPEADRGSAAPYLKTKFICVTPFIDAV WK FIG 4. Amino acid sequences for a human gene and its top two matches in mouse from BLAST. The second-best gene (right) had 89% of its neighbors rank better than the best match (left). Comparison of Neighbor Gene Ranks for Top Two Mouse Genes 600 Number of Genes 500 400 300 200 100 Rank Mouse Gene 1 Mouse Gene 2 FIG 5. Histogram shows the raw rankings of neighbors for the best and second-best matches. Note that while the second-best matches have rankings of almost equal spread (slight left bias), the best matches have rankings concentrated below 100. 497 481 465 449 433 417 401 385 369 353 337 321 305 289 273 257 241 225 209 193 177 161 145 129 97 113 81 65 49 33 1 17 0 Delta Values for 10-gene Neighborhoods 500 400 300 200 delta 100 0 -100 -200 -300 -400 -500 FIG 6. The delta values are shown for a subset of human genes with 10 co-expression genes in their neighborhoods.