Method S1 - Morris Lab

advertisement
Supplemental Methods
On the choice and optimality of the sequence-based benchmark method
Assessing the predictive value of target site accessibility required us to establish a
benchmark for the accuracy of methods that do not consider target site accessibility. We
have made some simplifying assumptions when designing this benchmark, and as such it
may not represent the current “best possible sequence-based model”. However, we have
designed it to be fair and it has some advantages over other choices. First consider two
facts: target site accessibility is assessed using the mRNA sequence, so the best possible
sequence-based model may simply be one that most accurately predicts target site
accessibility; also, there is no consensus opinion on what the best possible sequence
model is, so even if we were able to identity and use the current best model, later
improvements to this model may invalidate our comparison. However, by making two
simplifying assumptions and evaluating models using AUROC, the baseline performance
represented by the #TS model is as good as or better than that of whole class of sequencebased models. The first assumption is that the consensus sequence motif summarizes the
RBP’s sequence binding preferences, i.e., that the RBP can only bind target sites
matching the motif and binds them with equal affinity. This approximation, though
drastic, is commonly made and we could find no reason to believe that it biased the
comparison in favor of target site accessibility. Even when we allowed the #TS model to
optimize its consensus sequence, the #ATS model still performed better using the learned
motif for eight of nine RBPs (see Fig. 4, main text). The second assumption is that the
contribution of a target site to the likelihood that a RBP will bind a transcript is
independent of its position in the transcript. Though when this assumption is relaxed and
we only scan the 3’ UTR for target sites, we have similar results (Suppl. Fig. 1 online).
The advantage of making these two assumptions is that we ensured that the #TS model
ranked transcripts in the same order (and thus have the same AUROC) as any sequencebased model in which adding another target site increases the likelihood an RBP will bind
a transcript. Models of this type represent a large number of other sensible sequencebased models. In contrast, #ATS represents only one possible way of combining the
accessibilities of multiple target sites, so its AUROC is a lower bound on that of best
possible performance of a target site accessibility based model (e.g., the probability that
at least one target site will be accessible 1).
Supplementary References
1.
2.
3.
4.
Hackermuller, J., Meisner, N.C., Auer, M., Jaritz, M. & Stadler, P.F. The effect of
RNA secondary structures on RNA-ligand binding and the modifier RNA
mechanism: a quantitative model. Gene 345, 3-12 (2005).
Hogan, D.J., Riordan, D.P., Gerber, A.P., Herschlag, D. & Brown, P.O. Diverse
RNA-binding proteins interact with functionally related sets of RNAs, suggesting
an extensive regulatory system. PLoS Biol 6, e255 (2008).
Gerber, A.P., Luschnig, S., Krasnow, M.A., Brown, P.O. & Herschlag, D.
Genome-wide identification of mRNAs associated with the translational regulator
PUMILIO in Drosophila melanogaster. Proc Natl Acad Sci U S A 103, 4487-4492
(2006).
Tadros, W. et al. SMAUG is a major regulator of maternal mRNA destabilization
in Drosophila and its translation is activated by the PAN GU kinase. Dev Cell 12,
143-155 (2007).
Download