Supplementary Material for the study Prediction of Binding Sites of Peptide Recognition Domains: An Application on Grb2 and SAP SH2 Domains by William A. McLaughlin, Tingjun Hou, and Wei Wang The supplementary material consists of the following files: 1. this file: lists the files available in the supplementary material and provides a summary the comparison between the analyses done using phosphorylated and unphosphorylated peptides 2. KnownBindingPeptideSequences.txt: lists the peptide sequences known to bind the Grb2 and SAP SH2 domains used in the study 3. Grb2_KnownOnlyHMM.hmm: the hidden Markov model for searching for potential binding sites of the Grb2 SH2 domain which was created using the sequences of the known binding peptides 4. SAP_KnownOnlyHMM.hmm: the hidden Markov model for searching potential for binding sites of the SAP SH2 domain created using the sequences of the known binding peptides 5. Grb2_BindingClusterSequences_unphos.txt: the peptide sequences in the binding cluster generated using sequence and energy characteristics with peptides in the unphosphorylated state 6. Grb2_BindingClusterHMM_unphos.hmm: the hidden Markov model created using sequence in the binding cluster generated with peptides in the unphosphorylated state. It is the primary HMM to be used to search for new binding candidates of the Grb2 SH2 domain. 7. SAP_BindingClusterSequences_unphos.txt: the peptide sequences in the binding cluster generated using sequence and energy characteristics for clustering with peptides in the unphosphorylated state 8. SAP_BindingClusterHMM_unphos.hmm: the hidden Markov model created with the binding cluster created with peptides in the unphosphorylated state (the primary HMM to be used to search for new binding candidates of the SAP SH2 domain) 9. Grb2_SearchResults_unphos.txt: the top 100 peptide sequences retrieved by database scan using the binding cluster HMM or Grb2_BindingClusterHMM_unphos.hmm across the set of human protein tyrosine positions in SWISS-PROT. The corresponding ranks of each retrieved tyrosine site are given for a comparable search using the SCANSITE tool and the known only HMM or Grb2_KnownOnlyHMM.hmm. 10. Grb2_SearchResults_phos.txt: similar list as given in Grb2_SearchResults_unphos.txt but was created by analyzing peptides in the phosphorylated state. The list is given for comparison purposes and is not considered a viable list of candidates. 11. SAP_SearchResults_unphos.txt: the top 100 hits retrieved by database scan using the SAP binding cluster HMM or SAP_BindingClusterHMM_unphos.hmm across the set of human protein sequences in SWISS-PROT. The corresponding ranks of each retrieved site are given for a comparable search using the known only HMM (SAP_KnownOnlyHMM.hmm). 12. SAP_SearchResults_phos.txt: similar list as given in SAP_SearchResults_unphos.txt but was created by analyzing peptides in the phosphorylated state. The list is given for comparison purposes and is not considered a viable list of candidates. 13. Figure_S1.tif: plots showing a comparison between the distribution of known binding peptides and candidate peptides for peptides in the phosphorylated and unphosphorylated state. Details of the comparison are below. 14. Table S1.doc: clustering results for the peptides in the phosphorylated state. Details are given below. 15. Figure_S2.tif: search results presented for the known only HMM, binding cluster HMM, and control cluster HMM for the Grb2 and SAP domain analyzes using peptides in the phosphorylated state. Details of the comparison are below. A summary of the calculations done using phosphorylated peptides The calculations for the peptides in the phosphorylated state as compared to those in the unphosphoryated state are being reported here. For the SAP dataset of peptides there were 11 known binding sequences (Table S1) compared to 1799 candidate peptide sequences. The mean energies of the known binding peptides were compared to that of candidate peptides using a t-test. The p-value of associated with the t-test was 6.41 * 10-5 for peptides in the phosphorylated state and 2.31 * 10-9 for the peptides with the phosphate removed. In Figure S1 (Panels A and B), the distribution of the known binding peptides and 100 of the candidate peptides are shown. For the Grb2 dataset of peptides there were 15 known binding sequences (Table S1) and 1400 peptides in the candidate set of peptides. The p-value of associated with the t-test comparing the means energies was 6.41 * 10-5 for phosphorylated peptides and 2.31 * 10-9 for the calculations done using unphosphorylated peptides. In Figure S1 (Panels C and D), the distribution of the known binding peptides and peptides randomly drawn from the background set of sequences are shown. Note that the binding peptides were better separated for peptides in the unphosphorylated state (Panel D) as compared to the phosphorylated state (Panel C). Clustering was done on the peptides using sequence only, energy only, and sequence and energy simultaneously. Clustering results for the peptides in phosphorylated state are given in Table S1. Clustering done using sequence and energy together produced the highest overlap score of the known binders in a given cluster for both datasets. But the overlap score was less that achieved using unphosphorylated peptides (Compare to Table 1 of the main text). For the SAP domain dataset, there were four clusters generated using sequence and energy. The second cluster contained the majority of the known binding peptides and was assigned as the binding cluster. For the Grb2 dataset, there were five clusters generated using both sequence and energy. Cluster five was assigned as the binding cluster. The sequences in these binding clusters were utilized to create a hidden Markov models. Search results are plotted in Figure S3 and are comparable to Figure 3 of the main text. The results for the SAP domain analysis are shown in panel A and Grb2 shown in panel B. The p-value associated with the separation of the means log percentile ranks of the known binding peptides using the control HMM and the binding cluster HMM for the SAP analysis was 2 * 10-2, indicating no significant different between the control HMM and the binding cluster HMM at a confidence level of p=0.01. For the Grb2 analysis the p-value for the same comparison was 1.78 * 10-4, which is less significant than that achieved for unphosphorylated peptide analysis of 1.62 * 10-4. The top ranked peptides found by the binding cluster HMM created using peptides in the phosphorylated state is given in the file SAP_SearchResults_phos.txt for the SAP analysis. For the Grb2 analysis the top ranked peptides are given in the Grb2_SearchResults_phos.txt. They are presented for comparison to the lists generated using the unphosphorylated peptides.