Introduction to high-throughput analysis of proteins and metabolites by Mass Spectrometry The basic principle Brief introduction of techniques Computational issues Background High-throughput profiling of biological samples Red line: central dogma Blue line: interaction Metabolites (Picture edited from http://www.ncbi.nlm.nih.gov/Class/MLACourse/Modules/MolBioReview/) DNA: genotype, copy number, epigenetics ... RNA: expression levels, alternative splicing, microRNA … Protein: concentration, modification, interaction … Metabolite: concentration, modification, interaction … Why Mass Spectrometry The question: In the biological system, there are tens of thousands (species) of proteins and metabolites. How to identify and quantify them from a sample? 14 12 10 8 6 4 2 0 control control control control control disease disease disease disease disease 1 2 3 4 5 1 2 3 4 5 Which protein is this? Does it change significantly between control/disease samples? Background In a complex network, even if we know the entire structure, the network behavior is hard to predict. Direct profiling gives us snapshots of the status of the system. (Picture from KEGG PATHWAY) Why Mass Spectrometry Proteins/metabolites could be separated according to their properties: mass/size hydrophilicity/hydrophobicity binding to specific ligands charge ………… Using Chromatography Electrophoresis ………… http://en.wikibooks.org/ Why Mass Spectrometry Problems with these separation techniques: Reproducibility Identification / Quantification Inability to separate tens of thousands of species Mass Spectrometry: Highly accurate, highly reproducible measurements Theoretical values easy to obtain identification Can study protein modifications (small ligands attached) Measurements based on mass/charge ratio (m/z) Mass Spectrometry --- getting ion from solution to gas phase Picture provided by Prof. Junmin Peng (St. Jude) Matrix assisted laser desorption ionization (MALDI) Electrospray ionization (ESI) Mass Spectrometry --- finding m/z Time-of-flight: Putting a charged particle in an electric field, the time of flight is t=k m z k: a constant related to instrument characteristics Mass Spectrometry --- finding m/z Quadrupole: Radio-frequency voltage applied to opposing pair of poles. Only ions with a specific m/z can pass to the detector at each frequency. Mass Spectrometry --- finding m/z Fourier transform MS. Ions detected not by hitting a detector, but by passing by a detecting plate. Ions detected simultaneously. Very high resolution. m/z detected based on the frequency of the ion in the cyclotron. z f m Why is simple MS not enough A biological sample consists of tens of thousands of species of molecules. The resolution is not enough for clear separation. Biological interactions between the molecules may interfere with ionization. The solution: Multi-dimensional separation: combining MS with protein breakage by enzymatic digestion and collision decomposition electrophoresis chromatography Tandem Mass Spec (MS/MS) for protein identification Picture provided by Prof. Junmin Peng (St. Jude) 2D gel MS/MS Control samples Differential spots Treatment samples MS/MS protein identification In-gel digestion 2D gel differential protein finding in-gel digestion MS/MS protein identification Int J Biol Sci 2007; 3:27-39 LC/MS Liquid chromotography retention time Take “slices” in retention time, send to MS Mass-to-charge ratio (m/z) LC/MS-MS Picture provided by Prof. JunminPeng (Emory) LC/MS-MS Here is an example of LC/MS spectrum. The second MS serves the purpose of protein identification. Matching the sequence found by the second MS falls into the realm of sequence comparison and database search. Peak quantification is done by the first MS. (a) Original spectrum; (b) square root-transformed spectrum to show smaller peaks; (c) A portion of the spectrum showing details. Between proteomics and metabolomics Proteomics uses LC/MS-MS. The second MS is for protein identification. Metabolomics uses LC/MS. Sometimes a second MS is used, but data interpretation for metabolite identification is much harder. What concerns statisticians: (1) The shared LC/MS part: In metabolomics: quantification, identification In proteomics: quantification (2) The second MS: Protein identification: sequence modeling/comparison Protein quantification: merging values from different peptides from the same protein. Some computational issues in LC/MS-MS Modeling peaks. Noise reduction & peak detection Multiple peaks from one molecule caused by (1) isotopes (2) multiple charge states Retention time correction. Peak alignment. Peak quantification, especially with overlapping peaks caused by m/z sharing (mostly in metabolomics) From peptides to proteins. General workflow for LC/MS Modeling peaks In high-resolution LC/MS data, every peak is a thin slice --there is no need to model the MS dimension. Modeling the LC dimension is important for quantification. Models have been developed for traditional LC data, which can be applied here. Most empirical peak shape models were derived from Gaussian model. Changes were made to account for asymmetry in the peak shape. Modeling peaks Asymmetric peak. “asymmetry factor”: b/a at 0.1h Data Analysis and signal processing in chromatography. A. Felinger Modeling peaks The bi-Gaussian model: The area under peak is: Data Analysis and signal processing in chromatography. A. Felinger Modeling peaks Generalized exponential function Data Analysis and signal processing in chromatography. A. Felinger Modeling peaks Log-normal function. Data Analysis and signal processing in chromatography. A. Felinger Noise reduction Reviewed by Katajamaa&Oresic (2007) J Chr. A 1158:318 Noise reduction Signal-to-noise (S/N) ratio Where to make the cut? Should it be a straight line or a smoother? http://www.appliedbiomics.com/Service/Promotions/promotions.html Noise reduction & peak detection Using filters to detect peak from noise in conjunction with hard cutoff. Anal Chem. 2006 Feb 1;78(3):779-87. Retention time correction With every run, the LC dimension data has some fluctuation. Identify “reliable” peaks in both samples, use non-linear curve fitting to adjust the retention time. Anal Chem. 2006 Feb 1;78(3):779-87. Multiple peaks from one molecule Caused by multiple charge states (z = 1, 2, 3,……), and different number of carbon isotopes present in the molecule. Example: m=1000 (all C12) 1000 333.33 500 333.67 1001 500.5 334 501 1002 1003 501.5 3 charges 2 charges single charge Multiple peaks from one molecule Peak alignment Reviewed by Katajamaa&Oresic (2007) J Chr. A 1158:318 Peak alignment Dynamic programming. BMC Bioinformatics 2007, 8:419 Peak alignment First align m/z dimension by binning. Use kernel density estimation to find “meta-peaks”. Anal Chem. 2006 Feb 1;78(3):779-87. Dealing with overlapping peaks (1) Matched filter. (2) Some traditional methods. Data Analysis and signal processing in chromatography. A. Felinger Dealing with overlapping peaks (3) Statistical modeling using the EM algorithm Bi-Gaussian mixture Gaussian mixture Modeling asymmetric peaks. Bi-Gaussian model: ì ï ïï g (t ) = í ï ï ïî d 2p d 2p e e (t-a ) - 2 ( t-a ) - 2 , t <a 2 s 12 2 s 22 , t ³a Quantities used in peak location estimation: t ¥ é ù é A (t ) = log ê ò g(t)dt ú - logê ò g(t)dt ùú ë -¥ û ë t û 1 B (t ) = log 3 ( t ) 1 g(t) t t dt log ò -¥ ( ) 3 At the summit, 2 (ò ¥ t g(t) ( t - t ) dt 2 ) 1 æ ds 13 ö 1 æ ds 23 ö A (a ) = log (ds 1 2) - log (ds 2 2) = log ç ÷ - log ç ÷ = B (a ) 3 è 2 ø 3 è 2 ø Features sharing m/z value? Smoothing. (sensible bandwidth ?) Find peaks from the smoother. Features sharing m/z value? A modified EM algorithm that allows for missing intensities. Rather than samples, we observe point estimates of density, with some points missing. Assumption: the intensity observations are missing at random. (???!!!) Features sharing m/z value – an EM-like algorithm qij = zij Remove component j if , "i, j åz ik Qj < a threshold k é ù é ù A (t ) = log êå xi Dti ú - logêå xi Dti ú êëti <t úû êëti ³t úû ù 1 é ù 1 é 2 2 B (t ) = log êå xi ( ti - t ) Dti ú - logêå xi ( ti - t ) Dti ú 3 êëti <t úû 3 êëti ³t úû â = argmint  (t ) - B̂ (t ) s1 = å(t - a ) 2 i xi Dti ti <a s2 = å(ti - a ) xi Dti ti ³a d =e Qj å x Dt i i å x Dt i ti ³a å zi2 ´log( xi zi ) å zi2 i ij ti <a 2 i åz = åå z i i ik k i Uncertainty of the number of components? Select a set of smoother window sizes; Using each of the window size, run smoother & EM-like algorithm to fit the data; find corresponding BIC value, 2 éæ æ ù ö ö N ´ log êçåçç xi - å zij ÷÷ ÷ N ú + 4 ´ J ´ log ( N ) êç i è ú ÷ ø j ø ëè û Choose the result with minimum BIC value. On real data An example of the overall strategy in LC/MS metabolomics Anal Chem. 2006 Feb 1;78(3):779-87. How much information is lost from the original data? Difficulty – size of the data. Example: a piece of proteomics MS1 data Beyond LC/MS-MS In a complex biological sample (cell, tissue, serum, … ), there are several thousand proteins – tens of thousands of peptides after digestion; signal from less-abundant species may be suppressed. Solution: Must reduce complexity to identify and quantify proteins. Incorporate biochemical separation techniques: LC-MS/MS LC/LC-MS/MS …… Separate proteins in multiple dimensions. Sacrifice speed. 2D gel-MS/MS 2D gel/LC-MS/MS Analyze a subset of proteins. Affinity column separation – LC-MS/MS Sacrifice coverage. Beyond LC/MS-MS Nature. 452:571. Fig. 1 Right: LC/LC/LCMS/MS