An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan Research Proposal Requested: $137775
I. Overview
It is increasingly appreciated that many disease associated proteins contain regions of intrinsic disorder. However, relatively little is understood about the functions of these regions and it is not currently possible to predict the impact of mutations in these regions. We propose to analyze the function of disordered regions using a new systematic computational approach. We will then focus on specific examples to experimentally test the mechanisms of function of the identified disordered regions.
This proposal represents a new approach to attack a difficult problem in protein biochemistry: the function of intrinsically disordered proteins.
II. Background and Motivation
As many as 50% of human proteins are thought to contain intrinsically disordered regions [1,2], including many important disease associated proteins such as p53 [3], BRCA1[ 4] and CFTR[5].
Although important biological functions have been described for specific examples of disordered regions, little is known in general about the sequence-function relationship for most residues in these regions [6-10]. One established model is that disordered regions are important for protein regulation [1], and contain short linear motifs (short peptide sequences important for protein interaction [11]).
We propose to apply this model systematically using the ‘comparative genomics’ paradigm, which exploits the observation that functional sequences are preferentially preserved over evolution [12-14].
Because the short linear motifs within disordered regions are important for function, they are expected to be preferentially conserved relative to flanking residues. We and others have successfully exploited such evolutionary conservation to identify thousands of conserved regions in non-coding DNA [15,16], and these have been demonstrated to have important functions in transcriptional regulation [17].
Evolutionary methods have also been widely applied to protein sequences to detect remote homologues and to identify critical functional residues and motifs [18-21]. Here we propose to develop an evolutionary method to identify short conserved segments within disordered regions of proteins, as a means to identify the short linear motifs important for function. This application of ‘comparative genomics’ is (to our knowledge) novel. We will therefore devote considerable attention to confirming the effectiveness of the method in this new context. If successful, the methods we develop will represent a substantial contribution and will be generally applicable beyond the scope of this project.
Why an evolutionary approach for disordered regions?
For the vast majority of intrinsically disordered regions, specific biological functions remain unknown. Computational methods can readily identify the disordered regions based on amino acid sequence composition [22] (see [23] for review) and in principle, function could be predicted by previously developed computational methods to recognize short linear motifs [24-28] (reviewed in [29,30]). Unfortunately, these motifs, typically only 3-8 amino acids long, are not statically significant when search algorithms are applied at the proteome scale [31]; recent approaches have therefore resorted to relying on external functional data [32,33].
More general computational methods to predict interaction sites have also been developed [34-
39] (reviewed in [40,41]), but these methods show little predictive power when structural information is not available [40]. They are therefore not applicable to the majority of disordered regions. For example,
ANCHOR [37] was developed to identify binding sites in intrinsically disordered regions, and classifies between disordered binding sites and globular proteins with high accuracy [37]. Nevertheless, its predictions of binding sites are not specific enough to identify short functional motifs (see below).
Evolutionary approaches can predict functional residues with great power [19,42], and given the extensive sequence databases becoming available for closely related species, their application to
1
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan Research Proposal Requested: $137775 disordered regions is timely. An evolutionary approach will also yield hypotheses about the regulatory function of intrinsically disordered regions. Although some disordered regions become ordered upon binding, many are thought to remain in predominantly disordered conformations in vivo [43] and their functions often take advantage of this structural flexibility [44]. In addition, disordered regions may act as scaffolds for signaling proteins [45,46], switches for regulation of protein stability or may fine-tune biochemical activity [47-49]. These functions will influence the composition and evolution of the short linear motifs within the disordered region.
We will analyze the functions of the disordered domains based on the short linear motifs they contain.
Focusing on phosphorylation sites, we will confirm the functions of conserved motifs using site-specific mutagenesis and in vitro kinase assays. We will also explore the mechanisms of function of three selected disordered regions using NMR. We and others have previously applied NMR to explore the structural features of disordered regions in several proteins, including CFTR, Sic1, I-2, spinophilin and others [47-49] (reviewed in [50]). The combination of detailed mechanistic studies with the genomewide unbiased analysis of disordered regions will allow us to generalize beyond the small number of examples that can be characterized in detail.
III. Specific aims and research plan
Aim 1: Develop and test evolutionary proteomics methods to identify short linear motifs
Aim 2: Genome-wide identification of conserved segments in yeast and human disordered regions
Aim 3: Experimentally analyze disordered regions containing predicted short linear motifs
Please see Appendix 1 for figures , Appendix 2 for a schematic outline of the proposal and Appendix 3 for information about completion of specific tasks
Specific Aim 1. Develop evolutionary proteomics methods to identify short linear motifs
Natural selection is expected to remove mutations in functionally important sequences, leading to slower evolution in short linear motifs. Indeed, we and others have recently shown that characterized short linear motifs are conserved relative to the flanking amino acid sequences [51-55] and this conservation of short linear motifs has already been used to search for examples of known motifs [20,21].
To exploit evolutionary conservation to systematically identify functional elements within disordered regions (without relying on previously defined motif patterns), we propose to use a state-ofthe-art probabilistic model known as a 'phyloHMM' [56] to identify short stretches of amino acid sequence that are evolving more slowly than the immediately flanking sequences. Briefly, our phyloHMM follows previous work [56] by assuming that each column in a multiple sequence alignment
( Figure 1a
) can be classified into a ‘conserved state’ or a ‘background’ state; it then reports for each residue the posterior probability that it falls in the conserved state ( Figure 1b ). PhyloHMMs explicitly account for the similarity of sequences related by a phylogenetic tree and therefore can extract the maximum signal from the multiple alignments. Because insertions and deletions are prevalent in disordered regions, we will modify the standard probabilistic models underlying the phyloHMM [57,58] to include an insertion/deletion process (in addition to the standard substitution process). These phyloHMMs will be implemented by Alex Nguyen Ba, a PhD student in the Moses lab.
Confirming the phyloHMM methodology: We will perform two important tests of the methodology to confirm that it can be applied to real data. First, due to the rapid evolution of intrinsically disordered regions, truly conserved short linear motifs may be aligned incorrectly in the multiple sequence alignments.
To address this possibility, we will simulate the evolution of proteins in which we can insert
2
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan Research Proposal Requested: $137775 disordered regions containing ‘artificial’ conserved short linear motifs. Because we know the short linear motifs in these simulated proteins are conserved, we can test whether the aligner can correctly align them. Second, we will confirm that when short linear motifs are actually conserved, the phlyoHMM detects them (“true positives”) and not other stretches of amino acids (“false positives”). We will apply the phyloHMM to the alignments containing the ‘artificial motifs’ and compute ROC curves showing true positive and false-positive rates as we vary the assumptions about the evolution of the short linear motifs and the surrounding sequences. To confirm that the simulations are providing a realistic measure of predictive power, we will also curate from the literature a set of experimentally characterized short linear motifs that are found in disordered regions and identify the ones that appear conserved in the alignments. We will then test whether our phyloHMM can identify these bona fide short linear motifs.
Preliminary results: In our simulations, for typical motif evolutionary rates, e.g., 25% of background, we find that ~95% of the artificial motifs are correctly aligned ( Figure 2a ). Furthermore, using the phyloHMM we can identify ~50% of these conserved ‘artificial’ motifs (true positive rate is ~50%,
Figure 2b ), with false positive rates of less than 1 per 2500 amino acids ( Figure 2c ). We have also performed extensive searches of the literature to identify 337 experimentally characterized short linear motifs in budding yeast disordered regions. Of these, 112 were found in >90% of the species, such that we consider them “conserved”. The phyloHMM identified 74 of 112 (66%) indicating that the true positive rate observed in the simulation is realistic. By comparison, we also applied ANCHOR [37] to simulated disordered proteins. Even when we planted no artificial short linear motifs in these sequences, it predicted 47% (±9%) of amino acids to be within protein binding sites. Figure1b, Figure
6 and Figure 10 show direct comparison of the phyloHMM and ANCHOR. Taken together, these results indicate that using the phyloHMM it is possible to accurately identify short conserved segments with the characteristics of short linear motifs in simulated unstructured regions.
Beyond the phyloHMM: While phyloHMMs represent a novel and potentially powerful approach to identify functional elements in disordered regions, they are limited to those conserved segments that are preserved at the same location in all species, and aligned correctly by the multiple alignment algorithm.
However, this is not the case for many bona fide short linear motifs (e.g., Figure 3a ).
Of the 337 bona fide motifs, 225 were not conserved. Therefore, the “conserved” segments in disordered regions will represent only a subset of the functional elements in these regions. While the phyloHMM can identify this subset with great power, it is limited to this ‘low-hanging fruit’ of alignment conservation.
To address this, we will also develop novel methods to identify conserved sequences that are not aligned in all species, either due to alignment errors or evolutionary turnover in some species. We refer to this type of conservation as “alignment-free conservation”. To detect it, we will consider matches to a consensus motif appearing and disappearing over evolution according to a birth-death process with rates specified by a “background” amino acid substitution process that has no specific selection to retain matches to the consensus. If motif matches are retained over evolution beyond what is expected based on this background model, we can infer that selection has acted to preserve them. To apply this approach when we do not know the consensus in advance, we will scan along the alignment and test each short sequence as a potential motif.
Preliminary results: We have implemented (to our knowledge) the first algorithm that searches for
“alignment-free conservation” of matches to consensus motifs (
Figure 3b ). We compute the distribution number of motif matches in orthologous sequences based on the extant sequence in a reference species (e.g., budding yeast) under the background model with no selection pressure to retain
3
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan Research Proposal Requested: $137775 them. When then search for proteins that deviate significantly from this model. For known phosphorylation consensus sequences, we can find significant enrichment of substrates for multiple kinases ( Figure 4b ) indicating that there is a signal of conservation for bona fide motifs. We have also identified additional proteins where we can reject the background model, but are very likely to represent novel substrates. For example, at least 4 proteins in the DNA damage signaling pathway contain highly conserved clusters of non-aligned matches to the Mec1 consensus motif ( Figure 4a ). Since Mec1 directly targets 4 other proteins in this pathway, we believe our novel predictions are very promising
( Figure 4a ). Indeed, proteins with evidence for alignment-free conservation of the Mec1 consensus are significantly enriched for “response to DNA damage stimulus” proteins (P-value = 2.1 * 10 -9
).
Anticipated problems and solutions: To our knowledge phyloHMMs have not yet been employed for analysis of disordered regions, but they have been successfully applied to DNA sequences, leading to the widely used phastcons 'conservation tracks' at the UCSC genome browser [66]. Since protein evolution is much more heterogeneous than DNA evolution, if the method employed for DNA sequences (i.e., searching for the most highly conserved regions [56]) were applied to proteins, it would identify slowly evolving regions (or structural domains) rather than short linear motifs.
To address this potential problem, we use a ‘local’ rate of evolution against which we compare the expected pattern of evolution. For the phyloHMM, the background rate of is estimated using a 20 amino acid sliding window across the alignment, and the conserved rate is estimated by taking the maximum likelihood estimate of the rate at that position up to 1/3 the background rate. For our alignment-free method, we use a maximum likelihood estimate of the local evolutionary distance in a window surrounding the consensus sequence we are testing ( Figure 3b ). For the phyloHMM we also filter out known protein domains (using Pfam [61]).
Another important technical challenge is the insertions and deletions found within alignments of disordered regions. Because insertions and deletions violate the assumption that each column in an alignment is independent, standard probabilistic phylogenetic models treat substitutions only. We are developing a new model for “gaps” based on a compromise between computational feasibility and biological realism. To do so, we divide the multiple sequence alignment into “blocks” of constant gap size (illustrated as black vertical lines around grey shaded areas in Figure 1a,c ). While each block does not necessarily only include one insertion and deletion event, this is a much better approximation than treating the columns independently. We can then compute probability of each block based on the gap of size and the distribution on the phylogenetic tree ( Figure 1d, eq. 1 ). We will assume the gap process and substitution process are independent, so that the likelihood is simply the product of the two ( Figure
1d, eq. 2 ). This model allows relatively simple maximum-likelihood parameter estimation.
Specific Aim 2. Genome-wide identification of conserved segments in disordered regions
Once we have confirmed that the phyloHMM method can be applied to identify conserved segments in disordered regions, we will systematically apply it to alignments of all proteins from budding yeast (and related fungi [59]) and human (and related vertebrates [60]). We will start with analysis of the budding yeast proteome. Proteins in budding yeast and its relatives usually contain only a single exon, and a single transcript. This makes gene identification and assignment of 1 to 1 relationships between protein coding sequences from multiple species relatively straightforward. We will use the high-quality genome annotations and syntenic orthologs available for budding yeast from YGOB [59]. In vertebrates, gene structures are complex and alternative transcript numbers are large. This makes the bioinformatics analysis more time-consuming. A graduate student will be recruited to the Moses Lab to extend and apply these methods to the human proteome, building on the foundations of our yeast work.
4
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan Research Proposal Requested: $137775
We will then identify the disordered regions and protein domains in the human and yeast proteins (using DISOPRED [23] and Pfam [61]) and collect the short conserved segments that fall within disordered regions and do not match any known protein domains. To our knowledge, this analysis will represent a totally novel approach to identify (and quantify) the functional content of disordered regions. We will first test whether these conserved segments contain previously characterized instances of short linear motifs (curated from the literature), and whether they match previously identified consensus patterns.
Preliminary results: Applying our phyloHMM to alignments of the entire yeast proteome yields >7000 short conserved segments. Remarkably, of the 305 characterized short linear motifs found in disordered regions, 31% are among the predictions of the phyloHMM, strongly suggesting that many of the novel predictions are likely to represent bona fide functional sequences. Of the remaining sequences, based on our simulations we expect only ~ 200 to be false positives. This provides (to our knowledge) the first unbiased characterization of the amount of functional amino acid sequence within intrinsically disordered regions. At the residue level, this indicates that at least ~5% of the amino acids in intrinsically disordered regions are under specific evolutionary constrains and are therefore very likely to have biological functions.
Among these predictions, we have identified many conserved segments that match previously known consensus patterns. For example, the FG-motif is found in disordered regions of nuclear pore complex (NPC) proteins [67]. Of the 30 components of the NPC, 13 are known to contain FG repeats which are thought to be biologically important for the nuclear import and export of proteins [67]. To test whether the phyloHMM approach can identify these, we searched the conserved segments for those that matched the FG-motif. We found only 59 proteins that contained conserved FG-motifs, including 12 of the 13 NPC components. Since there are 3438 proteins for which alignments are available, this represents a highly significant enrichment: 12/59 vs 13/3438, P-value = 7.21 x 10
-16
. In another test of our method, we applied a similar statistical analysis to the set of conserved elements containing the canonical phosphorylation site consensus sequence (S/T-P-x-R/K) of the Cdc28 kinase. Of 695 proteins tested in a high-throughput in vitro kinase assay [68], our phylo-HMM identifies 40 proteins containing a short conserved sequence which matches the Cdc28 consensus sequence. Of those, 32 (80%) were found to be positive in their assay which is a highly significant enrichment (32/40 vs 185/695, P-value =
1.4 * 10 -11 ). Of the 8 remaining proteins that contain conserved consensus sites, but were not identified as positives in the assay, one is known to be phosphorylated in vivo (Cdc15p [69]) and two others are known substrates of the Pho85p kinase (Rim15p [70]) and the Fus3p kinase (Fus2p [71]), both of which can phosphorylate the canonical Cdc28p consensus sequence. Furthermore, we have been able to experimentally verify (See Specific Aim 3) new examples of other known short linear motifs , such as phosphorylation sites for Cbk1 ( Figure XX ) and the KEN box ( Figure 5d ) using the phyloHMM method. These results are very encouraging, and indicate that evolutionary conservation in disordered regions is a strong predictor of biological function. This preliminary data indicates that our phyloHMM can identify thousands of short conserved segments in the disordered regions of the budding yeast proteome, and when these conserved segments match known short linear motifs they are likely to perform the predicted biological function.
Identification of novel short linear motifs: Interestingly, despite the strong statistical results regarding previously known short linear motifs, we note that most of the identified conserved segments do not match any known short linear motif [11]. We hypothesize that these conserved segments represent examples of previously unrecognized short linear motifs. We will therefore attempt to discover these
5
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan Research Proposal Requested: $137775 motifs by identifying families of conserved segments with similar amino acid sequences. We will use graph-based clustering methods to identify groups of conserved segments based on sequence similarity.
These “clusters” may represent known and novel short linear motifs. To test whether they are associated with biological function, we will compare the number of proteins in the cluster with a given functional annotation to the number of proteins with that annotation expected by chance [63-65]. Statistical overrepresentation (or enrichment) of a particular function leads us to propose that function for the novel motif. These computational studies will be performed by a PhD student in the Moses lab in the first years of the project and as above, we will first apply this analysis to budding yeast, and then to vertebrate proteins. We consider this analysis a particularly exciting aspect of our proposal -- it will allow us to assign putative functions to completely novel short linear motifs.
Preliminary results: To associate the thousands of novel conserved segments with functions, we have performed a graph-based clustering of conserved segments identified in budding yeast alignments using
MCODE [62]. In addition to identifying many known motifs ( Figure 5 ), the cluster analysis revealed hundreds of new consensus sequences that were not previously recognized as short linear motifs.
Within these clusters, we have 30 unknown consensus sequences with 20 or more conserved examples in the yeast proteome. Several of these new putative consensus sequences are statistically associated with biological functions and are therefore are very likely to represent novel, previously unrecognized short linear motifs. These include a previously unreported DSF motif that is associated with amino acid permeases (6/8 vs. 36/3438, P-value = 2.4 * 10
-11
,), an NPY motif associated with vesicle and nuclear membrane proteins (7/12 vs. 419/3438, P-value = 1.7 * 10 -5 ) and an FxFP motif statistically enriched in proteins that physically interact with Cbk1 ( Figure 6 ). Preliminary experimental data (See Specific Aim
3 below) indicates that this motif is a bona fide interaction motif for Cbk1.
Anticipated problems and solutions: It may not be possible to associate functions with all novel consensus sequences identified in the cluster analysis. First, many of the new consensus sequences may be statistical artifacts of the clustering procedure. To rule this out, we will repeat the cluster analysis using multiple different algorithms and parameter settings. Clusters that appear consistently are more likely to represent bona fide patterns in the data. We will also repeat the clustering on motifs identified in simulated disordered proteins, and consider only clusters that exceed what is observed in the simulations.
Second, because the cluster analysis of short conserved segments is unbiased, some identified motifs will have functions that have not been studied in the lab and we will not have hypotheses to direct our tests of these motifs. We will therefore focus on motifs that do show a statistical association with some known function. Thus, we are once again limited to the ‘low-hanging fruit’ of novel consensus sequences that do show statistical enrichments. Encouragingly, in our preliminary data we have already identified several good candidates ( Figure 6 ) and therefore we will have more than enough to analyze within the period of this proposal.
Specific Aim 3. Experimental analysis of predicted short linear motifs and disordered regions
Because application of the phyloHMM to disordered regions is novel, to demonstrate the power of the methodology we will confirm several predictions using site-specific mutagenesis, in vitro kinase assays and fluorescence microscopy. We will then turn to more detailed NMR analysis to test hypotheses about the mechanisms of regulation mediated by the disordered regions.
3a) Experimental confirmation of computational predictions.
To show that the evolutionarily conserved segments in disordered regions actually represent bona fide short linear motifs, we will test
6
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan Research Proposal Requested: $137775 specific examples experimentally in budding yeast, including (i) novel examples of previously known consensus sequences and (ii) novel consensus sequences identified in our cluster analysis. These experiments will provide empirical support for the evolutionary proteomics approach to identifying functional elements within intrinsically disordered regions. Indeed, our preliminary data below indicates that when tested, the short conserved segments identified by the phyloHMM are likely to function as short linear motifs.
(i) New examples of known motifs.
To demonstrate that our methods can predict novel phosphorylation sites for known protein kinases we will perform in vitro kinase assays. We will focus on substrates of the NDR/LATS family protein kinase Cbk1 in budding yeast. This family of kinases is important for determination of cell-growth and development and is conserved between yeast and humans [72], but few direct substrates have been identified. We will search the short conserved segments and identify matches to the Cbk1 consensus. The disordered regions that contain these consensus sites will be tested in in vitro kinase assays with Cbk1 purified from yeast cells in collaboration with Eric Weiss’ lab at
Northwestern University ( Figure 7 ). To confirm that the in vitro phosphorylation is due to the Cbk1 kinase, we will repeat these experiments in cells without the Cbk1 kinase activity ( Figure 7 ). We aim to identify and test 5 novel Cbk1 substrates in collaboration with the Weiss lab in the second year of the proposal. While this will not be an exhaustive enumeration of substrates, it will be a large enough number to demonstrate that our methods can identify new substrates, and we will not be particularly wedded to any individual protein if we find that it is difficult to purify or work with experimentally.
Preliminary data: We identified conserved consensus matches to the Cbk1 consensus [80] in the Nterminus of Sec3p which is predicted to be disordered. In collaboration with Eric Weiss’ group at
Northwestern University, we made alanine mutations in the predicted phosphorylation sites ( Figure 7d ) and subjected the Sec3p N-terminus to an in vitro phosphorylation assay. Preliminary results ( Figure
7e ) indicate that the N-terminus of Sec3p is a very good substrate for Cbk1 in vitro and that phosphorylation of this protein is also important in vivo ( Figure 7f ). We have also identified additional candidate Cbk1 substrates: Fir1 and Tao3 which were reported to physically interact with Cbk1 [89,90], and Mpt5 which is an RNA-binding protein ( Figure 7g-i ) like Ssd1, a known Cbk1 susbstrate (refs).
In our preliminary cluster analysis of conserved segments in disordered regions ( Figure 5 ), we identified a cluster corresponding to the KEN-box degradation signal recognized by the APC Cdc20 . Only
10 proteins had a short conserved segment matching the KEN sequence. Eight of those contained an experimentally verified KEN degradation signal [81-83], were characterized targets of the APC
Cdc20
[84,85] or were cyclins, one of which contains a verified KEN sequence [83]. The two remaining conserved segments matching the KEN signal (in Spt21p and Sgd1p) have not been associated with the
APC or known to show cell-cycle regulated degradation. Spt21p is a protein involved in regulating histone transcription and its transcription is cell-cycle regulated [86,87]. Furthermore, over-expression of Spt21p has been shown to be toxic [88], suggesting a requirement for tight control on protein levels.
We therefore decided to test if the identified KEN sequence in Spt21p was truly a degradation signal.
We first tested if protein abundance was cell-cycle regulated and found that Spt21p protein levels coincide with Clb2p protein levels ( Figure 8b ) indicating that, as at the level of mRNA, Spt21p protein levels vary over the cell cycle. Given the toxicity of over expression, we reasoned that if the KEN sequence is a biologically relevant degradation signal, then over expression of a KEN-mutant form of
Spt21p would be more toxic than a wt form. We therefore mutated the KEN-box to alanines and observed that growth was more severely affected in the mutant over-expression that wt ( Figure 8d ).
This phenotypic effect is consistent with the hypothesis that the evolutionary conserved sequence is
7
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan Research Proposal Requested: $137775 important. Finally, to test the stability of the Spt21p KEN-mutant protein, we assayed protein levels of wt Spt21p and mutant by over-expressing the protein with the GAL promoter and then shutting off both transcription and translation. Indeed, we observed that the mutant protein levels remained high, while the wt degraded over time ( Figure 8e ).
(ii) Test functions of novel motifs.
Our cluster analysis will identify a large number of new consensus sequences that have not been previously recognized. In Specific Aim 2 we will associate these with putative functions through statistical overrepresentation of functional annotation in the cluster [63-65].
To test these putative functions, we will make site-specific mutations in the novel motifs. For example, proteins containing conserved FxFP motifs were associated with physical interactions with Cbk1. We will therefore test whether mutations in this short peptide can disrupt interactions with this kinase.
Similarly, a previously unreported DSF motif was associated with amino acid permeases. We hypothesize that this motif will be important for targeting of the permeases to the correct localization.
We will therefore make N-terminal GFP fusion proteins and follow their localization using fluorescence microscopy before and after mutagenesis of this motif. We will also test whether mutations in this motif impair permease function by measuring cell growth in auxotrophic strains. These experiments will be performed by a technician in the Moses Lab throughout the project.
Preliminary data: One of the novel motifs identified in our cluster analysis was an FxFP motif that was statistically enriched in proteins that were found to interact with Cbk1 in high-throughput studies
( Figure 6d “x” [89,90]). We noted that this motif resembled the docking motif that has been reported for MAPKs, so we decided to test whether this motif represented a novel docking site for the Cbk1 kinase. To test this, in collaboration with Brian Yeh in Eric Weiss’ lab, we expressed short peptide fragments containing the conserved segment and tested whether they could bind the Cbk1 kinase domain in an amylose resin pull-down assay ( Figure 6e ). Amazingly, all 6 of these peptides showed binding in this assay, with 4/6 showing strong binding. This confirms that this short motif can mediate direct interactions with the Cbk1 kinase domain. This preliminary data suggests that this motif is a novel docking site for this kinase and supports the idea that the novel patterns identified through cluster analysis represent bona fide short linear motifs motifs.
3b) Testing models for the mechanisms of regulation. Several models have been proposed for the mechanistic function of disordered regions. Perhaps most familiar is the function of disordered proteins as scaffolds for signaling and protein complex formation ( Figure 9a [44]). Scaffolds are an important mechanism to ensure specificity in cell-signaling as they bring otherwise potentially promiscuous enzymes in close physical proximity. For example, MAPKs form canonical three-kinase signaling cascades, where each kinase has great specificity for the next. However, this specificity is not always encoded in the direct interactions between kinases, but rather by scaffold proteins that physically link each kinase to the next (refs). This is one important mechanism by which kinases can be reused in multiple signaling pathways, but avoid “cross-talk” (refs).
A second well-characterized mechanism for disordered regions is the “multisite regulation” model where multiple regulatory sites in disordered regions are important for ultrasensitive “switchlike” responses (
Figure 9b [49,73]). The paradigmatic example of this mechanism is the cell-cycle regulator Sic1p, which undergoes multi-site phosphorylation to ensure switch-like onset of G1 phase.
Recently, we have demonstrated that although the N-terminus of Sic1p is disordered, it adopts a very compact three-dimensional conformation and interacts with its binding partner (Cdc4) in many rapidly alternating states ( Figure 10 [49]). Thus, in the case of Sic1, the flexibility of the disordered protein
8
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan Research Proposal Requested: $137775 seems critical to allow all of the sites to interact with Cdc4, thus providing the mechanistic basis for the switch-like regulation [74].
Finally, disordered regions can integrate signals from multiple binding partners ( Figure 9c ,
[47]). Regulatory proteins and internal binding domains can compete for binding sites within a protein to activate or inhibit the biochemical function. For example, CFTR, the protein mutated in cystic fibrosis
[47], contains a large disordered region known as the R-region that is important for controlling the activity of the CFTR channel [47]. The R-region contains multiple binding sites for interaction partners
(such as NBD1) whose binding propensity is modulated by phosphorylation. Depending on the propensity of these binding sites for their interacting proteins, they will be either highly bound or largely free. These multiple binding sites in the disordered region therefore integrate multiple signals to control activation of the channel. Because of their intrinsic structural flexibility, disordered proteins can have many more binding partners than ordered proteins and therefore are well-suited to these types of functions [44].
All of these models make predictions about the organization and evolution of the functional sequences within the disordered regions. Based on the patterns of conserved segments within the disordered regions identified by the phyloHMM (Specific Aim 1) we will analyze the mechanism of function by (i) integrating high-throughput functional data with our phlyoHMM predictions to test these models systematically and (ii) testing examples of specific proposed mechanisms using NMR studies. It is important to note that any particular disordered region can fall into more than one of the classes defined above,
(i) Statistical tests of models for disordered region function.
The scaffold model predicts that disordered regions will contain multiple short conserved protein binding motifs of various types, and that these will occur in proteins that physically interact with many proteins with related biological functions. For example, Pbs2 contains 2 conserved proline-rich binding sites (one for Sho1 and one for
Nbp2, two SH3 domain proteins, refs) and a conserved docking site that binds to Ssk2/22 (refs). These interactions confer specificity to the one of the yeast MAPK signaling pathways (refs). To test for scaffolds statistically, we will randomly permute the large-scale protein interaction data for each protein and test for an excess of proteins with many conserved binding motifs associated with large number of protein interactions.
The multi-site site regulation model predicts that proteins will have a large number of conserved motifs that match the same consensus sequence, and interact with only a single regulator that recognizes that consensus. For example, the N-terminus of Sic1p contains 9 weak binding sites for
Cdc4 [73,76] that contribute to switch-like degradation of this protein. Most of these weak sites are highly conserved, and therefore readily identified by the phyloHMM ( Figure 1b ). To test for multi-site regulation statistically, we will compare the number of proteins with large number of the same conserved motifs and a small number of protein interactions to that expected if the protein motifs were distributed randomly amongst the disordered regions. Finally, the integrator model predicts that disordered regions will contain conserved segments with the potential interact with multiple (possibly intra-molecular) regulators, or overlapping or closely spaced sites for multiple (possibly intra-molecular) regulators. These shared or overlapping interaction points allow multiple regulators to combinatorially influence the conformation or availability of the disordered region. To test for integrators statistically, we will count the proteins with multiple different conserved motifs within close proximity that interact with multiple different corresponding regulators and compare this to the number expected in permuted data. We will also test for the preponderance of conserved motifs that match the specificity of potential intra-molecular binding domains.
For each statistical test above we will identify whether the observed distribution provides support for that model. In addition, we will identify the examples of unstructured
9
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan Research Proposal Requested: $137775 regions that show the properties expected under each of the models. A graduate student in the Moses
Lab will work on this analysis throughout the duration of the project.
(ii) NMR studies of key examples . The statistical analysis will identify candidate regulatory mechanisms for many disordered regions and we will choose three to be tested further in NMR studies. We will determine which proteins are of most interest based on the short conserved segments within the disordered regions, and analysis of other functional data regarding each protein. In particular we will choose proteins where regulatory interactions with short linear motifs in the disordered regions have been established or can be confidently inferred, and can be reconstituted in vitro .
We can distinguish between the models of disordered region function using NMR as follows. For static complexes expected for scaffolds, we should observe discrete binding events giving rise to chemical shift perturbations upon interaction. For dynamic complexes expected in multi-site regulation, we should observe significant resonance broadening. Finally, for integrators, the dynamics of the complexes should be influenced by the effects of the different binding partners or modifications.
In all cases we will include the entire unstructured region, as the context of the multiple binding sites is critical to understanding the mechanism of interaction.
Previously, we have performed extensive NMR studies on the intrinsically disordered N-terminus of
Sic1p [49,74,76]. In those studies we used purified Cln2-Cdc28 to test the effects of phosphorylation, and found that both phosphorylated and unphosphorylated forms are clearly disordered [76]. Our studies revealed that this protein interacts with Cdc4 through a large number of alternating conformations ( Figure 10 ), and that this regulation is controlled by multi-site phosphorylation of the
Cdc4 binding sites by Cdk1 [76]. Interestingly, even though Sic1p is intrinsically disordered, we have found that it is fairly compact [76,79]. We hypothesize that compactness is important for the electrostatic component of the interaction with Cdc4, and we will test whether additional disordered proteins also are more compact than expected, particularly when they appear consistent with the multisite regulation model ( Figure 9b ).
We propose to study the mechanism of function of the disordered N-terminus of Pds1p, which is critical for degradation of this protein by the APC Cdc20 [77] and is regulated by the kinases Cdk1 and
Chk1 [77,78]. Pds1p (known as securin in mammals) is a conserved regulator of anaphase entry, which prevents cell cycle progression in the presence of spindle checkpoint activation [91]. It is targeted by the APC for degradation at the onset of anaphase, and the N-terminus of Pds1p contains a D-box
( Figure 11a ) that is important for regulation by APC [91]. The APC is a large, multi-subunit ubiquitin ligase that regulates the cell cycle using two activating subunits, Cdc20 and Cdh1/Hct1. Cdc20 recruits substrates to the APC by binding directly to degradation motifs (D-box and KEN box). Though the APC is too large for NMR studies, the Cdc20 subunit (which itself can bind Pds1p) is a good candidate to investigate the role of the short linear motifs identified on the N-terminus of Pds1p (see below). It has recently been shown that degradation of Pds1p is controlled by phosphorylation [77], and the Nterminus of Pds1p also contains two phosphorylation sites for Cdk1, only one of which is conserved over evolution ( Figure 11a ), indicating that the multisite regulation model developed for Sic1p [49] is unlikely apply. The N-terminus of Pds1p also contains a phosphorylation site for Chk1 ( Figure 11a ), which prevents cell-cycle progression in the presence of DNA damage by preventing APC-dependent degradation of Pds1 [78]. Therefore, we hypothesize that this disordered region integrates two signals in an ‘OR’ logic: prevent Pds1 degradation due to spindle checkpoint activation or due to DNA-damage
( Figure 11e ). To test this, we will measure the dynamics of the interaction of the N-terminus of Pds1 with Cdc20 when either, neither or both of the Cdk1 and Chk1 sites are phosphorylated. Similarly, we will perform these experiments when these signals have been altered by site-specific mutagenesis.
10
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan Research Proposal Requested: $137775
We have also identified the disordered N-terminus of Mps1 (a highly conserved mitotic kinase) as a second candidate for NMR analysis. This protein is a known target of the APC, and contains a Dbox in both yeast and human (refs) but these appear at very different locations in the protein ( Figure
12c ). In this case we are interested in how the interaction of the disordered region with the folded binding partner changes when the short motif has changed location from one end of the disordered region to the other. To test this we will make Mps1 N-terminus constructs that contain KEN-boxes at each location, as well as constructs that have no KEN-boxes and characterize these using NMR.
To confirm the disorder of the putative regions (as these have been identified as disordered by
DISOPRED [23]) we will use NMR spectroscopy to measure the
1
H
N-15
N correlation spectra, which indicate disorder when the amide proton chemical shift dispersion is narrow and line widths of peaks are sharp. To test the effects of post-translational modification on the disordered regions, we will reconstitute the modifications in vitro , and then measure the spectra of the unmodified and modified forms. To determine which regions of the disordered protein interact with the regulator, we will use titration experiments in which we will follow the changes in the intensity of the peaks. The transferred cross-saturation (TCS) can also be used to map sites of interaction. We will repeat these experiments after site-specific mutagenesis of the short conserved segments to understand their impact on the regulatory interaction. Once the experimental set up has been developed, these experiments will be performed by a postdoctoral fellow in the Forman-Kay lab over the second two years of the project.
Preliminary data: The N-terminus of Pds1p contains a highly conserved KEN-box and we have made site-specific mutations it to test its function ( Figure 11a ). This construct will also be used to test the importance of the KEN-box for interaction with the APC in vitro . We have also begun experiments on the N-terminus of Mps1. We noticed that it also contains a KEN box, but that this motif is not conserved in its location and therefore could not be detected by the phyloHMM. Nevertheless, we sought to confirm whether the putative KEN-box is important for regulation of Mps1 stability. We have performed site-specific mutagenesis experiments, and our preliminary results indicate that it leads to stabilization of the protein in a shut-off experiment ( Figure 12a,b ). Fascinatingly, a KEN-box is found in a different location in this disordred region in drosophila and vertebrates, but not in mammals ( Figure
12c ).
Anticipated problems and solutions: Not all types of interactions involving disordered regions will be amenable to in vitro analysis using NMR. In particular, scaffold proteins that have large numbers of binding partners will not be feasible. We will therefore focus on proteins where the interaction can be reconstituted in vitro . Similarly, disordered proteins can be difficult to work with experimentally, and therefore we have designed the research proposal so that we are not specifically tied to any single protein for our experiments. For example, if we cannot purify the Pds1 N-terminus, or obtain reliable
NMR data, we will move on to the Mps1 N-terminus, or other interesting candidates identified from the bioinformatics analysis.
IV. Significance
This project will systematically identify regulatory sequences and interaction sites within intrinsically disordered regions (Specific Aim 1). This represents a new approach to cracking the “code” that controls protein regulation. Furthermore, we will test whether the current models of disordered region function can explain regulatory functions and characterize several examples in mechanistic detail
(Specific Aim 2). For example, we aim to understand how disordered regions can integrate information, and how they can retain their function when short linear motifs change location over evolution. Our
11
An evolutionary proteomics approach to identify short linear motifs in intrinsically disordered regions
MOSES, Alan Research Proposal Requested: $137775 studies will provide insight into the functions of disordered regions, which are found in many important disease genes.
12