Mass Spectrometry

advertisement
Introduction to high-throughput analysis
of proteins and metabolites by Mass
Spectrometry
The basic principle
Brief introduction of techniques
Computational issues
Background
High-throughput profiling of biological samples
Red line: central
dogma
Blue line: interaction
Metabolites
(Picture edited from http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/)
DNA: genotype, copy number, epigenetics
...
RNA: expression levels, alternative
splicing, microRNA …
Protein: concentration, modification,
interaction …
Metabolite: concentration, modification,
interaction …
Why Mass Spectrometry
The question:
In the biological system, there are tens of thousands
(species) of proteins and metabolites. How to identify and
quantify them from a sample?
14
12
10
8
6
4
2
0
control control control control control disease disease disease disease disease
1
2
3
4
5
1
2
3
4
5
Which protein is this?
Does it change significantly between control/disease samples?
Background
In a complex network, even if we know the entire structure, the network
behavior is hard to predict. Direct profiling gives us snapshots of the status of the
system.
(Picture from
KEGG
PATHWAY)
Why Mass Spectrometry
Proteins/metabolites could be separated according to their
properties:
mass/size
hydrophilicity/hydrophobicity
binding to specific ligands
charge
…………
Using
Chromatography
Electrophoresis
…………
http://en.wikibooks.org/
Why Mass Spectrometry
Problems with these separation techniques:
Reproducibility
Identification / Quantification
Inability to separate tens of thousands of species
Mass Spectrometry:
Highly accurate, highly reproducible measurements
Theoretical values easy to obtain  identification
Can study protein modifications (small ligands attached)
Measurements based on mass/charge ratio (m/z)
Mass Spectrometry --- getting ion from
solution to gas phase
Picture provided by Prof. Junmin Peng (St. Jude)
Matrix assisted laser
desorption ionization
(MALDI)
Electrospray ionization (ESI)
Mass Spectrometry --- finding m/z
Time-of-flight:
Putting a charged particle in an electric field, the time of
flight is
t=k
m
z
k: a constant related to instrument characteristics
Mass Spectrometry --- finding m/z
Quadrupole:
Radio-frequency voltage applied to opposing pair of poles.
Only ions with a specific m/z can pass to the detector at each
frequency.
Mass Spectrometry --- finding m/z
Fourier transform MS.
Ions detected not by hitting a detector, but by passing by a
detecting plate. Ions detected simultaneously.
Very high resolution.
m/z detected based on the frequency of the ion in the
cyclotron.
z
f 
m
Why is simple MS not enough
A biological sample consists of tens of thousands of species
of molecules. The resolution is not enough for clear
separation.
Biological interactions between the molecules may interfere
with ionization.
The solution:
Multi-dimensional separation: combining MS with
protein breakage by enzymatic digestion and collision
decomposition
electrophoresis
chromatography
Tandem Mass Spec (MS/MS) for protein identification
Picture provided by Prof. Junmin Peng (St. Jude)
2D gel MS/MS
Control
samples
Differential
spots
Treatment
samples
MS/MS protein
identification
In-gel
digestion
2D gel differential protein finding  in-gel digestion  MS/MS
protein identification
Int J Biol Sci 2007; 3:27-39
LC/MS
Liquid chromotography
retention time
Take “slices” in
retention time, send to
MS
Mass-to-charge ratio (m/z)
LC/MS-MS
Picture provided by Prof. JunminPeng (Emory)
LC/MS-MS
Here is an example of LC/MS spectrum.
The second MS serves the purpose of protein identification.
Matching the sequence found by the second MS falls into the
realm of sequence comparison and database search.
Peak quantification is done by the first MS.
(a) Original spectrum; (b) square root-transformed spectrum to show smaller peaks;
(c) A portion of the spectrum showing details.
Between proteomics and metabolomics
Proteomics uses LC/MS-MS. The second MS is for protein
identification.
Metabolomics uses LC/MS. Sometimes a second MS is used,
but data interpretation for metabolite identification is much
harder.
What concerns statisticians:
(1) The shared LC/MS part:
In metabolomics: quantification, identification
In proteomics: quantification
(2) The second MS:
Protein identification: sequence modeling/comparison
Protein quantification: merging values from different peptides from
the same protein.
Some computational issues in LC/MS-MS
Modeling peaks.
Noise reduction & peak detection
Multiple peaks from one molecule caused by
(1) isotopes
(2) multiple charge states
Retention time correction.
Peak alignment.
Peak quantification, especially with overlapping peaks
caused by m/z sharing (mostly in metabolomics)
From peptides to proteins.
General workflow for LC/MS
Modeling peaks
In high-resolution LC/MS data, every peak is a thin slice --there is no need to model the MS dimension.
Modeling the LC dimension is
important for quantification.
Models have been developed for
traditional LC data, which can be applied here.
Most empirical peak shape models were derived from
Gaussian model. Changes were made to account for
asymmetry in the peak shape.
Modeling peaks
Asymmetric peak. “asymmetry factor”: b/a at 0.1h
Data Analysis and
signal processing
in chromatography.
A. Felinger
Modeling peaks
The bi-Gaussian model:
The area under peak is:
Data Analysis and
signal processing
in chromatography.
A. Felinger
Modeling peaks
Generalized exponential function
Data Analysis and
signal processing
in chromatography.
A. Felinger
Modeling peaks
Log-normal function.
Data Analysis and
signal processing
in chromatography.
A. Felinger
Noise reduction
Reviewed by Katajamaa&Oresic (2007) J Chr. A 1158:318
Noise reduction
Signal-to-noise (S/N) ratio
Where to make the cut? Should it be a straight line or
a smoother?
http://www.appliedbiomics.com/Service/Promotions/promotions.html
Noise reduction & peak detection
Using filters to detect peak from noise in conjunction with hard
cutoff.
Anal Chem. 2006 Feb 1;78(3):779-87.
Retention time correction
With every run, the LC dimension data has some fluctuation.
Identify “reliable” peaks in both samples, use non-linear curve
fitting to adjust the retention time.
Anal Chem. 2006 Feb 1;78(3):779-87.
Multiple peaks from one molecule
Caused by multiple charge states (z = 1, 2, 3,……), and
different number of carbon isotopes present in the molecule.
Example: m=1000 (all C12)
1000
333.33
500
333.67
1001
500.5
334
501
1002
1003
501.5
3 charges
2 charges
single charge
Multiple peaks from one molecule
Peak alignment
Reviewed by Katajamaa&Oresic (2007) J Chr. A 1158:318
Peak alignment
Dynamic programming.
BMC Bioinformatics 2007, 8:419
Peak alignment
First align m/z dimension by binning.
Use kernel density estimation to find “meta-peaks”.
Anal Chem. 2006 Feb 1;78(3):779-87.
Dealing with overlapping peaks
(1) Matched filter.
(2) Some traditional methods.
Data Analysis and
signal processing
in chromatography.
A. Felinger
Dealing with overlapping peaks
(3) Statistical modeling using the EM algorithm
Bi-Gaussian
mixture
Gaussian
mixture
Modeling asymmetric peaks.
Bi-Gaussian model:
ì
ï
ïï
g (t ) = í
ï
ï
ïî
d
2p
d
2p
e
e
(t-a )
-
2
( t-a )
-
2
, t <a
2 s 12
2 s 22
, t ³a
Quantities used in peak location estimation:
t
¥
é
ù
é
A (t ) = log ê ò g(t)dt ú - logê ò g(t)dt ùú
ë -¥
û
ë t
û
1
B (t ) = log
3
(
t
)
1
g(t)
t
t
dt
log
ò -¥ ( )
3
At the summit,
2
(ò
¥
t
g(t) ( t - t ) dt
2
)
1 æ ds 13 ö 1 æ ds 23 ö
A (a ) = log (ds 1 2) - log (ds 2 2) = log ç
÷ - log ç
÷ = B (a )
3 è 2 ø 3 è 2 ø
Features sharing m/z value?
Smoothing.
(sensible
bandwidth ?)
Find peaks
from the
smoother.
Features sharing m/z value?
A modified EM algorithm that allows for missing intensities.
Rather than samples, we observe point estimates of density, with
some points missing.
Assumption: the intensity observations are missing at random.
(???!!!)
Features sharing m/z value – an EM-like algorithm
qij =
zij
Remove component j if
, "i, j
åz
ik
Qj < a threshold
k
é
ù
é
ù
A (t ) = log êå xi Dti ú - logêå xi Dti ú
êëti <t
úû
êëti ³t
úû
ù 1 é
ù
1 é
2
2
B (t ) = log êå xi ( ti - t ) Dti ú - logêå xi ( ti - t ) Dti ú
3 êëti <t
úû 3 êëti ³t
úû
â = argmint  (t ) - B̂ (t )
s1 =
å(t - a )
2
i
xi Dti
ti <a
s2 =
å(ti - a )
xi Dti
ti ³a
d =e
Qj
å x Dt
i
i
å x Dt
i
ti ³a
å zi2 ´log( xi zi ) å zi2
i
ij
ti <a
2
i
åz
=
åå z
i
i
ik
k
i
Uncertainty of the number of components?
Select a set of smoother window sizes;
Using each of the window size, run smoother & EM-like algorithm to fit
the data; find corresponding BIC value,
2
éæ æ
ù
ö ö
N ´ log êçåçç xi - å zij ÷÷ ÷ N ú + 4 ´ J ´ log ( N )
êç i è
ú
÷
ø
j
ø
ëè
û
Choose the result with
minimum BIC value.
On real data
An example of the overall strategy in LC/MS metabolomics
Anal Chem. 2006 Feb 1;78(3):779-87.
How much information is lost from the original
data?
Difficulty – size of the data.
Example: a piece of proteomics MS1 data
Beyond LC/MS-MS
In a complex biological sample (cell, tissue, serum, … ), there
are several thousand proteins – tens of thousands of peptides
after digestion; signal from less-abundant species may be
suppressed.
Solution:
Must reduce complexity to identify and quantify proteins.
Incorporate biochemical separation techniques:
LC-MS/MS
LC/LC-MS/MS
……
Separate proteins in multiple
dimensions. Sacrifice speed.
2D gel-MS/MS
2D gel/LC-MS/MS
Analyze a subset of proteins.
Affinity column separation – LC-MS/MS
Sacrifice coverage.
Beyond
LC/MS-MS
Nature. 452:571.
Fig. 1
Right: LC/LC/LCMS/MS
Download