BCB430Y1 Project Proposal – Mass spectrometry Bimal Ramdoyal Student ID: 996407552 What is Mass Spectrometry? Mass Spectrometry (MS) is a technique that measures the mass-to-charge ratio of charged particles. Let’s consider molecule X inserted into the MS. Once inside, it is ionized by colliding with a beam of electrons. The ions then travel at varying speeds and the ones chosen hit the detector at the end of the MS. We use this technique to filter out a desired set of molecule(s) from a collection of molecules. By calculating the mass to charge ratio of certain fragments gives us hints about the abundance of the molecule X. Scoring the data The data is found in the ProHits[1] database. Currently we use two methods to score the data namely SAINT and COMPASS. To use any of the two scoring methods, the data has to be extracted from the database, and converted into input files that are compatible with each scoring method. In the case for SAINT, to find the probability of an interaction between a prey protein i with bait j, it creates separate distributions, each specific for every bait-prey pair. The spectral counts for each baitprey pair are represented after a mixture model of two components, true and false interactions. The parameters for true and false distributions and the prior probability πT of true interactions are inferred from the spectral counts for all interactions that involve prey i and bait j. In the graph below we can see a threshold: Figure 1.1 Probability Model in SAINT [2] The false interaction is represented by the left-most curve, with the lowest spectral count / distribution while the true interaction is represented by the right-most curve. If the prior probability of a true interaction is πT then the prior probability of a false interaction is (1 - πT). If prey i and bait j interact, the spectral counts of prey i in purification bait j are considered to be from a Poisson distribution representing either a true interaction (with a mean ƛij) or a false interaction (with a mean Kij). The spectral counts are used to calculate the probability of a true interaction Tij = πT P(Xij| true) and false interaction Fij = (1 - πT) P(Xij| false) which are then used to calculate the posterior probability of true interactions P(true | Xij). P(true | Xij) = πT P(Xij| true) / (πT P(Xij| true) + (1 - πT) P(Xij| false)) P(true | Xij) = Tij / (Tij + Fij) These prior probabilities are then used to calculate SAINT probabilities, to estimate the Bayesian false discovery rate (FDR). [2] COMPASS is composed of an automated MS/MS data processing component, a protein function/annotation component which forms a platform for analyzing proteomic data. For every interaction detected by the mass spectrometer, COMPASS assigns a confidence score. These scores make up datasets of interacting proteins. COMPASS uses 2 datasets (LC-MS/MS) of interacting proteins and measures frequency, abundance and reproducibility of interactions to calculate the score. COMPASS creates a matrix in which the rows are composed of unique proteins (i) identified from all the experiments and the columns represent each bait (j) used for those experiments. Each cell in the matrix is the Total Spectral Count (TSC), X, for a specific interacting protein and this is calculated by the formula: Xi,j. COMPASS has different scoring metrics [4] namely: ST, DT, & WDT. The S-score, D-score, and WDscore were all developed empirically based on their ability to effectively discriminate known interactors from known background proteins, with the S-score being the first metric we developed. Both the D- and WD-scores are based on the S-score, sharing the same fundamental formulation, but have additional terms that add increasing resolving power. [5] Once the matrix is built, we can calculate the scores for each interactor for each bait [3] by using the following algorithms for each specific score : Figure 1.2 taken from: COMPASS website: http://falcon.hms.harvard.edu/ipmsmsdbs/cgi-bin/tutorial.cgi First, Z score is constructed by mean centering and normalizing the scale in the conventional Z statistic, where mean and standard deviation are estimated from the data for each prey. D score is based on the spectral count adjusted by a scaling factor that reflects the reproducibility of prey detection over replicate purifications of the same bait. If Xij is the spectral count between prey i and bait j, then Dij = ((k / fi)pij·Xij)1/2, where k is the total number of baits profiled in the experiment, fi is the number of experiments in which prey i was detected and pij is the number of replicate experiments of bait j in which prey i was detected. After computing the scores, a threshold DT is selected from simulation data so that 95% of the simulated data falls below the chosen threshold. [2] Finally, for each score, the corresponding value above which 5% of the random data lies determines that score's threshold. This distribution can then be used to assign a p-value for proteins passing the score thresholds. Thus, we can make an argument that a protein passing a score threshold and found to have high enough TSC (reflected in the p-value) is very likely to be a real interactor. [6] Proposal 1. Getting the data To score the data, first the data must be fetched by the ProHits database, and must be converted into a different input format for Saint and COMPASS. A Perl-based interface will be created that will allow the user to select a list of experiments based on the experiment id from a sorted list, and generate a tab delimited text file of experiment ids and a T/C flag. This file is then passed as an input into another Perl script (export_saint.pl or export_COMPASS.pl). The script export_saint.pl creates 3 input files (bait.dat, prey.dat, inter.dat) in a format compatible to be run by SAINT. If the scoring method chosen is COMPASS, then one of the output files from export_saint.pl (inter.dat) has to be converted (using an existing homebrewed R script) into an m X n matrix and then the scores can be computed by and additional existing script implementing the COMPASS method. Currently, it is possible to have the 3 input files (required for Saint) generated by the ProHits user interface but our task is to use the interface to inject the input file into the slightly modified export_saint.pl script to generate the 3 required data files. (Available on: http://code.google.com/p/prohits/source/browse/trunk/Prohits/script/export_saint.pl) 2. Interface A user interface, titled Scoring Module, will be built to allow users to create a new analysis set which is a tab delimited text file which consists of the experiment ID and the T/C flag. At first, the user will be presented with a simple interface where they can select their name from a drop-down list and select the option to display only their experiments (to save time querying the database and display less data on the screen). The experiments will be sorted by experiment ID, experiment name, userid, or datetime. Once the user selects a list of experiments, another interface will allow him/her to select the T/C flag and generate the analysis set. After submitting the form, the analysis set will be created in a text file. Here is an example of the analysis set with the first column containing the experiment IDs and the second column the T/C flag: 101 407 408 634 T C C T Using the interface the user will also have the option to select a scoring method and parse in this input file to the appropriate script based on the scoring method used. Then, the 3 files discussed above (bait.dat, prey.dat and inter.dat) will be created and will be run by SAINT or modified to be run by COMPASS (server side). While the 3 files are being analysed, the user can simply shut down the browser window and after the job has completed, an email will be sent to the user (depending on what name they chose from the first page) with download links to download the output files. The same interface will have a section that will allow the user to compare the output for both methods (saved into the html folder and available through download links) and adjustment parameters. 3. Database and code metrics Running queries that involve many experiments can be intensive, so a view-based approach will be used to reduce the time needed to query the data from multiple tables. A table called “tScoringModule” and a view called “vScoringModule” will be created that will hold the data needed for the Scoring Module interface. The data recorded will be indexed and will hold the outputs after running SAINT and COMPASS. SAINT has a different way of normalizing its output values from COMPASS. An algorithm will be designed to normalize the output from COMPASS so that it is easier to compare on the interface. Once both results from SAINT and COMPASS are normalized, a table (“smResults”) will then be created to store analysis sets and results in the ProHits database. . Another interface will be created where the user can compare the two scored set of results (which will be saved in table “smResults”) by choosing their own scoring methods and associated options. 4. Documentation Will document results and provide a tutorial manual to aid new users to get comfortable with the user interfaces. Goals: The main goals of this project are to create a set of user interfaces that will allow users to select a sortable list of experiments, choose their preferred scoring methods, run the scripts through the interface itself and download the files. Furthermore, the interface should enable the users to compare their scoring methods and filter out based on categories and see the performance and to analyze different sets of data using different metrics, and to be used as a benchmarking tool. References: [1]. Prohits Database: http://www.nature.com/nbt/journal/v28/n10/extref/nbt1010-1015-S1.pdf [2]. SAINT: probabilistic scoring of affinity purification–mass spectrometry data Hyungwon Choi, 1Brett Larsen,2 Zhen-Yuan Lin,2 Ashton Breitkreutz,2 Dattatreya Mellacheruvu,1 Damian Fermin,1 Zhaohui S Qin,3, 8 Mike Tyers,2, 4, 5, 6 Anne-Claude Gingras2, 4 Alexey I Nesvizhskii1, 7 http://www.nature.com/nmeth/journal/v8/n1/full/nmeth.1541.html Analysis and validation of proteomic data generated by tandem mass spectrometry, Alexey I Nesvizhskii1, Olga Vitek2 & Ruedi Aebersold3, 4 http://www.nature.com/nmeth/journal/v4/n10/abs/nmeth1088.html [3]. Protein–protein interactions: Interactome under construction, Laura Bonetta, http://www.nature.com/nature/journal/v468/n7325/full/468851a.html Mass spectrometry-based proteomics Aebersold R, Mann M. http://www.ncbi.nlm.nih.gov/pubmed/12634793 [4], [5], [6]. Defining the Human Deubiquitinating Enzyme Interaction Landscape Mathew E. Sowa, Eric J. Bennett, Steven P. Gygi and J. Wade Harper http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2716422/ COMPASS Tutorial: http://falcon.hms.harvard.edu/ipmsmsdbs/cgi-bin/tutorial.cgi Current Scoring Module interface v1.0: http://tin.emililab.edu/Prohits/analyst/scoring_module/scoring_module.cgi link to current version of export_saint.pl: http://code.google.com/p/prohits/source/browse/trunk/Prohits/script/export_saint.pl?spec=svn37&r=37