BCB430Y1_Proposal3

advertisement
BCB430Y1
Project Proposal – Mass spectrometry
Bimal Ramdoyal
Student ID: 996407552
What is Mass Spectrometry?
Mass Spectrometry (MS) is a technique that measures the mass-to-charge ratio of charged particles.
Let’s consider molecule X inserted into the MS. Once inside, it is ionized by colliding with a beam of
electrons. The ions then travel at varying speeds and the ones chosen hit the detector at the end of the
MS. We use this technique to filter out a desired set of molecule(s) from a collection of molecules. By
calculating the mass to charge ratio of certain fragments gives us hints about the abundance of the
molecule X.
Scoring the data
The data is found in the ProHits[1] database. Currently we use two methods to score the data namely
SAINT and COMPASS. To use any of the two scoring methods, the data has to be extracted from the
database, and converted into input files that are compatible with each scoring method.
In the case for SAINT, to find the probability of an interaction between a prey protein i with bait j, it
creates separate distributions, each specific for every bait-prey pair. The spectral counts for each baitprey pair are represented after a mixture model of two components, true and false interactions. The
parameters for true and false distributions and the prior probability πT of true interactions are inferred
from the spectral counts for all interactions that involve prey i and bait j. In the graph below we can see
a threshold:
Figure 1.1 Probability Model in SAINT [2]
The false interaction is represented by the left-most curve, with the lowest spectral count / distribution
while the true interaction is represented by the right-most curve. If the prior probability of a true
interaction is πT then the prior probability of a false interaction is (1 - πT).
If prey i and bait j interact, the spectral counts of prey i in purification bait j are considered to be from a
Poisson distribution representing either a true interaction (with a mean ƛij) or a false interaction (with a
mean Kij). The spectral counts are used to calculate the probability of a true interaction Tij = πT P(Xij|
true) and false interaction Fij = (1 - πT) P(Xij| false) which are then used to calculate the posterior
probability of true interactions P(true | Xij).
P(true | Xij) = πT P(Xij| true) / (πT P(Xij| true) + (1 - πT) P(Xij| false))
P(true | Xij) = Tij / (Tij + Fij)
These prior probabilities are then used to calculate SAINT probabilities, to estimate the Bayesian false
discovery rate (FDR). [2]
COMPASS is composed of an automated MS/MS data processing component, a protein
function/annotation component which forms a platform for analyzing proteomic data. For every
interaction detected by the mass spectrometer, COMPASS assigns a confidence score. These scores
make up datasets of interacting proteins. COMPASS uses 2 datasets (LC-MS/MS) of interacting proteins
and measures frequency, abundance and reproducibility of interactions to calculate the score.
COMPASS creates a matrix in which the rows are composed of unique proteins (i) identified from all the
experiments and the columns represent each bait (j) used for those experiments. Each cell in the matrix
is the Total Spectral Count (TSC), X, for a specific interacting protein and this is calculated by the
formula: Xi,j. COMPASS has different scoring metrics [4] namely: ST, DT, & WDT. The S-score, D-score, and WDscore were all developed empirically based on their ability to effectively discriminate known interactors from
known background proteins, with the S-score being the first metric we developed. Both the D- and WD-scores are
based on the S-score, sharing the same fundamental formulation, but have additional terms that add increasing
resolving power. [5]
Once the matrix is built, we can calculate the scores for each interactor for each bait [3] by using the
following algorithms for each specific score :
Figure 1.2 taken from: COMPASS website: http://falcon.hms.harvard.edu/ipmsmsdbs/cgi-bin/tutorial.cgi
First, Z score is constructed by mean centering and normalizing the scale in the conventional Z statistic,
where mean and standard deviation are estimated from the data for each prey. D score is based on the
spectral count adjusted by a scaling factor that reflects the reproducibility of prey detection over
replicate purifications of the same bait. If Xij is the spectral count between prey i and bait j, then Dij = ((k
/ fi)pij·Xij)1/2, where k is the total number of baits profiled in the experiment, fi is the number of
experiments in which prey i was detected and pij is the number of replicate experiments of bait j in
which prey i was detected. After computing the scores, a threshold DT is selected from simulation data
so that 95% of the simulated data falls below the chosen threshold. [2]
Finally, for each score, the corresponding value above which 5% of the random data lies determines that
score's threshold. This distribution can then be used to assign a p-value for proteins passing the score
thresholds. Thus, we can make an argument that a protein passing a score threshold and found to have
high enough TSC (reflected in the p-value) is very likely to be a real interactor. [6]
Proposal
1. Getting the data
To score the data, first the data must be fetched by the ProHits database, and must be converted into a
different input format for Saint and COMPASS. A Perl-based interface will be created that will allow the
user to select a list of experiments based on the experiment id from a sorted list, and generate a tab
delimited text file of experiment ids and a T/C flag. This file is then passed as an input into another Perl
script (export_saint.pl or export_COMPASS.pl). The script export_saint.pl creates 3 input files (bait.dat,
prey.dat, inter.dat) in a format compatible to be run by SAINT. If the scoring method chosen is
COMPASS, then one of the output files from export_saint.pl (inter.dat) has to be converted (using an
existing homebrewed R script) into an m X n matrix and then the scores can be computed by and
additional existing script implementing the COMPASS method.
Currently, it is possible to have the 3 input files (required for Saint) generated by the ProHits user
interface but our task is to use the interface to inject the input file into the slightly modified
export_saint.pl script to generate the 3 required data files.
(Available on: http://code.google.com/p/prohits/source/browse/trunk/Prohits/script/export_saint.pl)
2. Interface
A user interface, titled Scoring Module, will be built to allow users to create a new analysis set which is a
tab delimited text file which consists of the experiment ID and the T/C flag. At first, the user will be
presented with a simple interface where they can select their name from a drop-down list and select the
option to display only their experiments (to save time querying the database and display less data on
the screen). The experiments will be sorted by experiment ID, experiment name, userid, or datetime.
Once the user selects a list of experiments, another interface will allow him/her to select the T/C flag
and generate the analysis set.
After submitting the form, the analysis set will be created in a text file. Here is an example of the
analysis set with the first column containing the experiment IDs and the second column the T/C flag:
101
407
408
634
T
C
C
T
Using the interface the user will also have the option to select a scoring method and parse in this input
file to the appropriate script based on the scoring method used. Then, the 3 files discussed above
(bait.dat, prey.dat and inter.dat) will be created and will be run by SAINT or modified to be run by
COMPASS (server side).
While the 3 files are being analysed, the user can simply shut down the browser window and after the
job has completed, an email will be sent to the user (depending on what name they chose from the first
page) with download links to download the output files.
The same interface will have a section that will allow the user to compare the output for both methods
(saved into the html folder and available through download links) and adjustment parameters.
3. Database and code metrics
Running queries that involve many experiments can be intensive, so a view-based approach will be used
to reduce the time needed to query the data from multiple tables. A table called “tScoringModule” and
a view called “vScoringModule” will be created that will hold the data needed for the Scoring Module
interface.
The data recorded will be indexed and will hold the outputs after running SAINT and COMPASS. SAINT
has a different way of normalizing its output values from COMPASS. An algorithm will be designed to
normalize the output from COMPASS so that it is easier to compare on the interface.
Once both results from SAINT and COMPASS are normalized, a table (“smResults”) will then be created
to store analysis sets and results in the ProHits database. . Another interface will be created where the
user can compare the two scored set of results (which will be saved in table “smResults”) by choosing
their own scoring methods and associated options.
4. Documentation
Will document results and provide a tutorial manual to aid new users to get comfortable with the user
interfaces.
Goals:
The main goals of this project are to create a set of user interfaces that will allow users to select a
sortable list of experiments, choose their preferred scoring methods, run the scripts through the
interface itself and download the files. Furthermore, the interface should enable the users to compare
their scoring methods and filter out based on categories and see the performance and to analyze
different sets of data using different metrics, and to be used as a benchmarking tool.
References:
[1]. Prohits Database: http://www.nature.com/nbt/journal/v28/n10/extref/nbt1010-1015-S1.pdf
[2]. SAINT: probabilistic scoring of affinity purification–mass spectrometry data
Hyungwon Choi, 1Brett Larsen,2 Zhen-Yuan Lin,2 Ashton Breitkreutz,2 Dattatreya Mellacheruvu,1 Damian Fermin,1
Zhaohui S Qin,3, 8 Mike Tyers,2, 4, 5, 6 Anne-Claude Gingras2, 4 Alexey I Nesvizhskii1, 7
http://www.nature.com/nmeth/journal/v8/n1/full/nmeth.1541.html
Analysis and validation of proteomic data generated by tandem mass spectrometry, Alexey I Nesvizhskii1, Olga
Vitek2 & Ruedi Aebersold3, 4
http://www.nature.com/nmeth/journal/v4/n10/abs/nmeth1088.html
[3]. Protein–protein interactions: Interactome under construction, Laura Bonetta,
http://www.nature.com/nature/journal/v468/n7325/full/468851a.html
Mass spectrometry-based proteomics
Aebersold R, Mann M. http://www.ncbi.nlm.nih.gov/pubmed/12634793
[4], [5], [6]. Defining the Human Deubiquitinating Enzyme Interaction Landscape
Mathew E. Sowa, Eric J. Bennett, Steven P. Gygi and J. Wade Harper
http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2716422/
COMPASS Tutorial: http://falcon.hms.harvard.edu/ipmsmsdbs/cgi-bin/tutorial.cgi
Current Scoring Module interface v1.0:
http://tin.emililab.edu/Prohits/analyst/scoring_module/scoring_module.cgi
link to current version of export_saint.pl:
http://code.google.com/p/prohits/source/browse/trunk/Prohits/script/export_saint.pl?spec=svn37&r=37
Download