Alice S Weston

advertisement
Alice S Weston
Computational Genomics
March 15, 2005
Detecting Orthologs Using Molecular Phenotypes in Human and Mouse
Introduction
One way that we can better understand how we evolved as a species is to determine the
relatedness of our proteins to those of other organisms; moreover, this will help to categorize the
biological function of mammalian core proteins. It is sometimes found that two evolutionarily
related sequences will show little or no similarity at the nucleotide level, having accumulated
major changes since their divergence millions of years ago (Figure 1). While the canonical
methods for defining orthologous structures, such as performing a sequence alignment, are able
to identify many significant gene pairs, duplication events in the genome have made it difficult to
discriminate orthologs from paralogs in some cases—therefore new techniques are needed. One
research group recently published a method for identifying orthologs based upon microarray
expression data1. Their approach uses the idea that genes with a similar function tend to have
mRNA expression patterns, termed “molecular phenotypes”, though the sequences themselves
may show no similarity.
Here, I use a similar method to determine how often the top two mouse alignments for a
human gene are consistent with their molecular phenotypes, where the best alignment is expected
to have the highest co-expression, in general (see Methods). It is suspected that the top-scoring
homolog according to BLAST will not always be the most closely related gene in terms of its
observed co-expression, and therefore function, in the cell. For our purposes, the cases where the
mouse gene with most sequence similarity does not produce the best co-expression wheel will be
most important. It is speculated that orthologous genes will be co-expressed with a similar set of
gene partners compared to a pair of non-orthologous genes that are similar at the sequence level.
Methods
Using extensive DNA microarray data gathered from previous experiments on mouse and
human genes, the Spearman correlations were computed for each gene pair—approximately 10.9
million pairs of mouse genes and 10.6 million pairs of human genes. The Fisher z-transform was
used because it takes into account the number of experiments the data is based on and is not
biased by missing data. For each human gene, a neighborhood of the top 100 co-expressed genes
was constructed, which is referred to as a co-expression neighborhood. Each neighbor is
associated with its best gene match in mouse from BLAST, and these mouse genes are the
neighbors for the top two mouse homologs of the human gene at the center of the universe
(Figure 2). I then asked: are there any cases where the second best hit from BLAST has a better
co-expression with the central human gene than the best hit? This would show that molecular
phenotypes can be used to discriminate orthologs from paralogs.
I started to address this question using a series of Perl programs. These programs work
together to generate various statistics about a co-expression neighborhood. More specifically,
each neighbor in the two mouse wheels is assigned a rank number that indicates how
1
Barak A. Cohen, Yitzhak Pilpel, Robi D. Mitra, and George M. Church. (2002) Discrimination between Paralogs
usings Microarray Analysis: Application to the Yap1p and Yap2p Transcriptional Networks. Molecular Biology of
the Cell. 13, 1608–1614.
significantly the neighbor was co-expressed with the central mouse gene. If both mouse wheels
have data for the neighbor i, then the lowest ranking pair is preferred in the model, where the
gene pairs are given by (best mouse gene, neighbori) and (second-best mouse gene, neighbori).
To determine which mouse neighborhood has the smallest rankings overall, I use a scoring
system that increments the score of the neighborhood by 1 unit if it is a preferred pair. These
scores are used to compute the percentage of neighbors that are ranked lower in the best mouse
gene as compared to the second-best mouse gene, for example. The top mouse genes are also
compared using the raw ranking numbers.
As an additional step, a similar analysis was performed on neighborhoods of the top 10
co-expressed genes in order to reduce the number of neighbors with large expectancy values or
those whose co-expression occurs by chance. For these neighborhoods, I compared the average
ranks from the best and second-best matches by computing a value δ defined as:
δ = 1/n × (X1 + X2 + · · · + Xn) – 1/m × (Y1 + Y2 + · · · + Ym)
n,m ≤ 10
Here Xi is the rank of the neighbor to the mouse gene that has the best BLAST score, and Yi is
the rank of the neighbor to the gene with the second-best score. This value is used to measure
which mouse gene is producing the better neighborhood, where positive δ values indicate that the
second-best match is prevailing.
Results and Discussion
Based on the results of this study, it was found that most of the best matching mouse
genes in terms of sequence similarity produce the best co-expression neighborhoods compared
the second-best mouse gene. This should be expected. If we know that the sequence of a gene
implies its function in the cell, then it makes intuitive sense that gene pairs with the most
sequence similarity will be co-expressed. This relationship between sequence and function is
verified in Figure 3, which shows the percentage of neighbors that ranked lower for the best
matching gene. The average for this data was 62%, which means that the best match gene is
generally a better co-expression neighborhood. While about 116 genes for the best match had
100% better ranking neighbors than the second-best match, there were no cases where the
second-best match ranked all of its neighbors better than the best match.
However, it can be seen that there are rare cases where the second-best gene is better
80% of the time or more; these neighborhoods are worth exploring further. Consider one human
gene from this area with the LocusLink ID 6772, whose sequence is shown in Figure 4 along
with those of its top two genes in mouse. The second-best match for this gene is 89% better than
the best match—a striking result. Based on the molecular phenotypes of these genes, where the
second-best match is more strongly co-expressed with other mouse genes that are similar to those
in the human neighborhood, it is possible that this it the true ortholog.
The results in Figure 5 show that the best matching genes tend to have neighbors whose
ranks are concentrated below 100, though not exclusively since there are some tall red bars in the
higher ranks (~400). It is surprising that about 450 genes can have neighbors that rank ~421,
while almost no neighbors have ranks around this number; why do so many genes have this
rank? A more interesting result is that the ranks associated with second-best matches are fairly
spread out, with a slight bias toward lower ranks. There appears to be a denser area from 1–50,
and these genes are probably promising examples of cases where co-expression can detect
orthologs missed by sequence analysis. From the δ values plotted in Figure 6, we can see that
most of the best matching mouse genes are actually producing better neighborhoods, which is
indicated by the number negative δ values.
Figures
BLAST hit #2
true ortholog
BLAST hit #1
paralog
active site
= mutation
coding region
FIG 1. Detecting orthologous genes using only sequence alignment methods can miss certain orthologs
that have accumulated silent mutations not affecting the functionality of the gene or its protein product.
6814
7375
10270h
51763
9646
20912
22258
19062
54194m
22083
BLAST hit #1
8888
9092
20912
22258
54387
19062
20227
56399m
22083
54387
20227
BLAST hit #2
FIG 2. A real example network for the human gene 10270, showing its two top sequence matches in mouse
and their respective neighborhoods. Dashed lines connect the human neighbors to the related mouse gene
with the highest rank.
Percentage of Neighbor Genes that Ranked Better in Top Mouse Gene
140
Number of Top Mouse Genes
120
100
80
60
40
20
Percentage of Neighbor Genes
FIG 3. Histogram shows the percentage of neighbors with a lower rank for the best match gene than for the
second-best gene and how many best match genes had this percentage. μ ≈ 62%
100
96
92
88
84
80
76
72
68
64
60
56
52
48
44
40
36
32
28
24
20
16
12
8
4
0
0
Human Gene 6772
MSQWYELQQLDSKFLEQVHQLYDDSFPMEIRQYLAQWLE
KQDWEHAANDVSFATIRFHDLLSQLDDQYSRFSLENNFL
LQHNIRKSKRNLQDNFQEDPIQMSMIIYSCLKEERKILE
NAQRFNQAQSGNIQSTVMLDKQKELDSKVRNVKDKVMCI
EHEIKSLEDLQDEYDFKCKTLQNREHETNGVAKSDQKQE
QLLLKKMYLMLDNKRKEVVHKIIELLNVTELTQNALIND
ELVEWKRRQQSACIGGPPNACLDQLQNWFTIVAESLQQV
RQQLKKLEELEQKYTYEHDPITKNKQVLWDRTFSLFQQL
IQSSFVVERQPCMPTHPQRPLVLKTGVQFTVKLRLLVKL
QELNYNLKVKVLFDKDVNERNTVKGFRKFNILGTHTKVM
NMEESTNGSLAAEFRHLQLKEQKNAGTRTNEGPLIVTEE
LHSLSFETQLCQPGLVIDLETTSLPVVVISNVSQLPSGW
ASILWYNMLVAEPRNLSFFLTPPCARWAQLSEVLSWQFS
SVTKRGLNVDQLNMLGEKLLGPNASPDGLIPWTRFCKEN
INDKNFPFWLWIESILELIKKHLLPLWNDGCIMGFISKE
RERALLKDQQPGTFLLRFSESSREGAITFTWVERSQNGG
EPDFHAVEPYTKKELSAVTFPDIIRNYKVMAAENIPENP
LKYLYPNIDKDHAFGKYYSRPKEAPEPMELDGPKGTGYI
KTELISVSEVHPSRLQTTDNLLPMSPEEFDEVSRIVGSV
EFDSMMNTV
Mouse Gene 20846 – best
MSQWFELQQLDSKFLEQVHQLYDDSFPMEIRQYLAQ
WLEKQDWEHAAYDVSFATIRFHDLLSQLDDQYSRFS
LENNFLLQHNIRKSKRNLQDNFQEDPVQMSMIIYNC
LKEERKILENAQRFNQAQEGNIQNTVMLDKQKELDS
KVRNVKDQVMCIEQEIKTLEELQDEYDFKCKTSQNR
EGEANGVAKSDQKQEQLLLHKMFLMLDNKRKEIIHK
IRELLNSIELTQNTLINDELVEWKRRQQSACIGGPP
NACLDQLQSWFTIVAETLQQIRQQLKKLEELEQKFT
YEPDPITKNKQVLSDRTFLLFQQLIRSSFVVERQPC
MPTHPQRPLVLKTGVQFTVKLRLLVKLQELNYNLKV
KVSFDKDVNEKNTVKGFRKFNILGTHTKVMNMEEST
NGSLAAEFRHLQLKEQKNAGNRTNEGPLIVTEELHS
LSFETQLCQPGLVIDLETTSLPVVVISNVSQLPSGW
ASILWYNMLVTEPRNLSFFLNPPCAWWSQLSEVLSW
QFSSVTKRGLNADQLSMLGEKLLGPNAGPDGLIPWT
RFCKENINDKNFSFWPWIDTILELIKKHLLCLWNDG
CIMGFISKERERALLKDQQPGTFLLRFSESSREGAI
TFTWVERSQNGGEPDFHAVEPYTKKELSAVTFPDII
RNYKVMAAENIPENPLKYLYPNIDKDHAFGKYYSRP
KEAPEPMELDDPKRTGYIKTELISVSEVHPSRLQTT
DNLLPMSPEEFDEMSRIVGPEFDSVMSTV
Mouse Gene 20848 – second best
MAQWNQLQQLDTRYLEQLHQLYSDSFPMELRQFLAP
WIESQDWAYAASKESHATLVFHNLLGEIDQQYSRFL
QESNVLYQHNLRRIKQFLQSRYLEKPMEIARIVARC
LWEESRLLQTAATAAQQGGQANHPTAAVVTEKQQML
EQHLQDVRKRVQDLEQKMKVVENLQDDFDFNYKTLK
SQGDMQDLNGNNQSVTRQKMQQLEQMLTALDQMRRS
IVSELAGLLSAMEYVQKTLTDEELADWKRRQQIACI
GGPPNICLDRLENWITSLAESQLQTRQQIKKLEELQ
QKVSYKGDPIVQHRPMLEERIVELFRNLMKSAFVVE
RQPCMPMHPDRPLVIKTGVQFTTKVRLLVKFPELNY
QLKIKVCIDKDSGDVAALRGSRKFNILGTNTKVMNM
EESNNGSLSAEFKHLTLREQRCGNGGRANCDASLIV
TEELHLITFETEVYHQGLKIDLETHSLPVVVISNIC
QMPNAWASILWYNMLTNNPKNVNFFTKPPIGTWDQV
AEVLSWQFSSTTKRGLSIEQLTTLAEKLLGPGVNYS
GCQITWAKFCKENMAGKGFSFWVWLDNIIDLVKKYI
LALWNEGYIMGFISKERERAILSTKPPGTFLLRFSE
SSKEGGVTFTWVEKDISGKTQIQSVEPYTKQQLNNM
SFAEIIMGYKIMDATNILVSPLVYLYPDIPKEEAFG
KYCRPESQEHPEADRGSAAPYLKTKFICVTPFIDAV
WK
FIG 4. Amino acid sequences for a human gene and its top two matches in mouse from BLAST. The
second-best gene (right) had 89% of its neighbors rank better than the best match (left).
Comparison of Neighbor Gene Ranks for Top Two Mouse Genes
600
Number of Genes
500
400
300
200
100
Rank
Mouse Gene 1
Mouse Gene 2
FIG 5. Histogram shows the raw rankings of neighbors for the best and second-best matches. Note that
while the second-best matches have rankings of almost equal spread (slight left bias), the best matches have
rankings concentrated below 100.
497
481
465
449
433
417
401
385
369
353
337
321
305
289
273
257
241
225
209
193
177
161
145
129
97
113
81
65
49
33
1
17
0
Delta Values for 10-gene Neighborhoods
500
400
300
200
delta
100
0
-100
-200
-300
-400
-500
FIG 6. The delta values are shown for a subset of human genes with 10 co-expression genes in their
neighborhoods.
Download