Proteomic informatics Lawrence Hunter, Ph.D. Director, Computational Bioscience Program University of Colorado School of Medicine Larry.Hunter@uchsc.edu http://compbio.uchsc.edu/Hunter Proteomics • The characterization of the complete complement of proteins in a sample – Which proteins are present? – In what isoforms (e.g. cleavage products) – With what covalent post-translational modifications? – In what concentrations (quatification) • Today, largely addressed via mass spectrometry [NB: many slides in this lecture are from Speed & Schutz, http://www.ima.umn.edu/talks/workshops/9-29-10-3.2003/speed/speed.ppt ] Mass Spectroscopy • Many different types of instruments (e.g. MALDI/TOF) – Produce lists of specific masses (or mass/charge ratios) with “intensities” – Can dynamically manipulate samples, e.g. by breaking certain bonds. • Long used in molecular studies • Now applied to complex mixtures of proteins – Mostly of small molecules – Identification of purified proteins What is a Mass Spectrometer ? “An analytical device that determines the molecular weight of chemical compounds by separating molecular ions according to their mass-to-charge ratio (m/z)” Ionisation + + + + + + + + + ++ + + + + + + + ++ Separation by m/z + + + + + + + + + + + + + + + + + + + + Detection molecular weight = 600 Da abundance = 50 % molecular weight = 400 Da abundance = 20 % molecular weight = 300 Da abundance = 30 % 50 30 20 301 401 601 m/z Many different instruments • Different MS instruments sensitive (or not) to different types of ions From Wysocki, et al., 2005 Use of Proteomic Mass Spectra Two broad approaches: 1. Direct use: discrimination among tissue states based on raw (unanalyzed) spectra 2. Analysis: Protein identification, characterization and quantification from mass spectra Predictive models based on raw specta (or peak lists) • Typical discrimination problem • At least one impressive success: Ovarian • cancers detected from peripheral blood samples But… – Major challenges in reproducibility (lots of reasons a peak might appear incorrectly to be correlated) – Black box: no identifications or interpretations – Big challenges in modeling: why are these peaks discriminative? General challenges • The curse of dimensionality – Many possibly interacting factors. Number of interactions goes up with factorial of the number of peaks. – Kernel methods escape the curse by requiring a similarity function (which must take interactions into account) • Statistical challenges – Too many peaks, and not enough samples – Uncontrolled variability, observational (post hoc) studies Mass Spec for protein identification 2D gel • Start with a purified protein sample. – Mass of an intact protein alone is often ambiguous – Cleavage of a protein at known sites (e.g. by trypsin, which cleaves at K/R residues) generates a set of peptide fragments – “Protein fingerprinting” – a particular set of fragment masses often identifies the protein Fingerprinting • Start from a set of all possible proteins in an • organism (called “the database”) Theoretical digest of each protein to produce collection of fragments – E.g. SwissProt’s PeptideCutter http://us.expasy.org/tools/peptidecutter/ • Compare predicted fragments for each protein versus observed masses – Some fragments missing, others unexpected (e.g. PTMs) – Ad hoc scoring or probabilistic calculations. MS for protein identification Peptide mass fingerprinting 2D-GEL EXCISE DIGEST MS Proteins Sample Example: peaks at m/z 333, 336, 406, 448, 462, 889 The only protein in the database that would produce these peaks is MALK|CGIR|GGSRPFLR|ATSK|ASR|SDD • • The exact protein needs to be in the database Works only with single protein fragmentations m/z Challenges • • • • Which proteins does a genome specify? – Gene/coding region prediction from genomic data – Splice variants – Coding polymorphisms Some peptides are in multiple proteins Some peptides are missing in the data – Post-translational modifications change mass – Stochastic loss of peptides PTMs can cause false positive matches – Combinations of PTMs make this problem much worse Issues with tryptic fragments • Genome level – Coding sequence identification problem generally – Families of similar genes share fragments • Transcription/translation – Alternative splicing leads to multiple products sharing fragments • Post-translational modifications – Covalent modifications add mass – Cleavages reduce mass The simple matching task • Given a list of tryptic masses, what is the most likely protein to have generated it? – Compare the observed peptides with the ones that would be expected if the protein were present. • Challenges: – Fragments from true proteins will be missing due to chance or modifications – Fragments will spuriously match proteins that are not present, due to shared fragments or modifications Existing approaches • Widely used programs: – Mascot (http://www.matrixscience.com) – Sequest (http://fields.scripps.edu/sequest) • Good at finding matches to unmodified • peptides from a database of sequences Not so good at – Unambiguously identifying a protein (get a list of hits, first not always correct) – Identifying post-translational modification(s), especially multiple ones How they work • • Score each protein in the database against the set of peptides from the assay Score functions – Some differences between Sequest/Mascot, but... – Positive score for matching fragments, penalty for missing – Penalty for size of matching protein. (Longer proteins have more random matches) – Treat all proteins as independent • • Random selection in case of ties (not generally consistent) Reports each protein in family as separate PTMs • Each PTM has a characteristic mass • change (e.g. ~80Da for phosphate) More than 30 biologically significant PTMs – (de)protonation changes mass by 1 Da • Even just ±2 phosphates causes large increase in false positive matches Possible approaches to PTMs • Post-processing: – Look only for PTM'd fragments of proteins already identified by unmodified fragments • Bayesian: Define a prior probability for each possible combination of PTMs – Global score function for each PTM combination – Per-protein predictions of likelihood of particular PTMs. Approaches to Protein family issues • Current programs either report all ties or randomly select one. – Even post-processing to group family members in reports would be better • Better still – Link family members in the identification process – Represent and report ambiguities • How to handle probabilities for groups of related proteins? – Multiple family members can be present in a sample... Going high throughput • Recently, various successful approaches • to identifying a complex mixture of proteins. MuDPIT – Separate mixture into fractions (e.g. by liquid chromatography or strong cation exchange) – Identify many proteins in the mixture simultaneously using fingerprinting • Tandem (MS/MS) – Use CID to fragment peptides, and send them to a second MS. Pattern of fragments allows MuDPIT Approaches • Approach as multiple purified mass spec • runs How to tell how many proteins? – Just allow multiple high probability matches... • How to “assign” a peptide to a protein? – Can one peptide match more than one protein? – Related to the protein family problem... – Use quantitative information? • Complex instrument & experimental design Fractionation • Take a complex mixture and divide it into smaller sets of proteins with known qualities – SCX: separate by charge – LC: separate by hydrophobicity • Can make restricted databases based on • proteins that could have been in a particular fraction Test to make sure that identified proteins are compatible with fraction Tandem MS (MS/MS) To gain structural information about the detected masses: ... one product is selected + + + + collision + + with a gas + 50 + Second MS separation & detection 30 20 301 401 601 m/z – different molecules of the same substance can split in different ways. – in each molecule, only the pieces that retain one of the charges will be observed and present in the spectrum; the others are discarded. MS/MS advantages • Sequences make identification more • • • reliable (although still depends on database to restrict search) Able to directly detect certain kinds of PTMs With some instruments, direct de-novo sequencing becoming possible (without database) Still new, many potential improvements to informatic approaches y8 y7 y5 y6 y4 y3 Tryptic fragment y2 Val Phe Gly Lxx Lxx Asp Glu Asp Lys b2 b3 b5 b4 b6 b7 b8 y3 100 b2 95 247.0 391.1 90 789.3 85 y7 b3 80 304.0 a2 75 y4 219.0 506.2 70 b4 65 417.2 Relative Abundance 60 Lxx Asp Glu 55 Lxx 50 y8 789.3 y5 45 Phe Gly 936.4 619.2 40 y6 b5 35 732.2 530.2 30 25 20 889.4 645.3 418.1 305.1 y2 15 b8 b6 248.1 b7 262.1 10 205.0 318.1 431.1 468.4 372.2 937.4 774.4 904.5 5 0 150 200 250 300 350 400 450 500 550 m/z 600 650 700 Example MS/MS spectrum 750 800 850 900 950 Interpretation of MS/MS data • SEQUEST: – generate a predicted spectrum for each potential peptide using a simple fragmentation model (all b and y ions have the same intensity; possible losses from b and y have a lower intensity) – compute a "cross-correlation" score and find the best-matching peptide – since this operation is very time-consuming, a simpler preliminary score is used to find the 500 peptides in the database that are most likely to be the correct identification MS/MS Challenges • Theoretical spectra do not include intensity • information (much, yet) Many random matches with low scores – Poor seperability of low scoring real hits from high scoring false ones • Changes in instruments lead to new theoretical spectra Quantification • Would like to know the amount of each • protein, not just its identity Peak intensity is not directly correlated with amount of original protein – Theoretical and empirical attempts to map from peak intensity (or area under a peak) to original concentration • Isotopic ratio approaches: use isotopes (taken up in food, say) to quantify abundance ratio between two samples. Existing software • Sequest most widely used, but commercial • • and associated with an instrument manufacturer Mascot main alternative (has a web service) http://www.matrixscience.com/ Open source alternatives improving rapidly, X!Hunter and X!Tandem. Organized through The Global Proteome Machine Organization, http://www.thegpm.org