ProReP - Protein Results Parser v3.0© A Tool For Handling Tandem Mass Spectrometer Protein Database Search Results Capstone Presentation Kiran Annaiah (M.S Bioinformatics) Advisors Dr. Randy Arnold Dr. Haixu Tang Outline Background Data generation from Mass Spec Experiment Mascot Search Engine Why to parse Mascot results? Parser features Results Conclusions Acknowledgments Background High-throughput “shotgun” Proteomics • Identify, characterize and quantify all expressed proteins simultaneously in a mixture. Mass Spectrometry • Peptide mass fingerprinting • Collision Induced Dissociation (CID) spectra from MS/MS analysis LC/MS/MS approach used to identify protein components in a complex mixture Tandem mass spectra helps in inferring amino acid sequences of peptides Peptide Mass Fingerprinting vs. MS/MS protein identification James S. Eddes et.al., 2002, Proteomics Database Searching b1 NH2 b2 b 3 b4 b5 b6 b7 L M G S E P I K y7 m/z y6 Database searching software MASCOT® y5 y4 y3 y2 CO2 y1 Results Proteins found Database (SwissProt) Hemoglobin, beta chain Actin Pept. MYTCVPIASEQUENCEMIMEWTPQSDLI RPTVCIMNERCVGGPYILCMTEND Mass Score Sequence 1 738.84 41 HLDNLK 2 912.01 61 VHLTDAEK Amylase 3 915.06 56 AAVNGLWGK DSLIKRNYTIPMCSQIRECNHIPLMTRCH GYYKWSIALAINTQSFGIVRIVAMNKLPS SCRTIVGHWEDRICTMQNCISPPEKELIA VARGTSP 4 1090.24 41 VINAFNDGLK 5 1122.33 62 VVAGVASALAHK 6 1218.42 70 LVINAFNDGLK … … Mascot Search Engine Uses mass spectrometry data to identify proteins from primary sequence databases MS/MS ion search – Enzyme cleavage rules applied to sequences in the protein databases – Experimental mass values compared with calculated fragment ion mass values – Use scoring algorithm to identify the closest match or matches – Probability based MOWSE scoring algorithm Databases – – – – MSDB – non-identical protein sequence DB NCBInr SwissProt dbEST – “single-pass” cDNA sequences or EST’s A Typical Experiment Analysis of Liver / Brain Tissue RT: 0.00 - 169.99 52.01 58.65 39.66 100 47.99 95 Digest with Trypsin 77.58 75.32 NL: 4.27E9 Base Peak F: + c NSI Full m s [ 250.00-1500.00] MS CH_whole_RG_0 71503_V06 86.93 94.76 99.69 73.09 143.16 107.63 69.37 90 85 34.66 66.30 80 75 70 113.22 Relative Abundance 65 60 55 115.90 50 116.53 45 134.48 40 134.06 35 144.74 30 Liquid Chromatography 146.35 117.33 25 118.31 20 123.94 34.22 15 154.64 10 5 10.63 15.79 0 0 32.20 161.33 20 40 60 80 Tim e (m in) 100 120 140 160 LC eluting sample electrosprayed into Mass Spec CH_whole_RG_071503_V06 #3291 RT: 72.31 AV: 1 NL: 9.71E8 F: + c NSI Full ms [ 250.00-1500.00] 608.1 100 95 801.8 90 85 80 MS-MS on intense peak of a parent ion APAAIGAYSQAVLVDR from 14.5 kDa translational inhibitor protein 75 70 Relative Abundance 65 60 55 50 45 1051.3 840.6 40 696.1 35 597.9 579.9 30 894.1 25 545.1 667.7 20 Raw data converted to a DTA file 746.9 1214.7 620.3 15 931.3 528.0 1023.8 10 1066.7 1098.4 1148.2 473.3 5 1321.5 1216.9 1386.4 454.7 277.2 335.9 397.7 1469.3 0 300 Mascot Search 400 500 Generates Html file 600 700 800 900 m/z 1000 1100 1200 1300 1400 1500 Mascot output – Html file (avg. size 5 MB) Motivation Mass spectrometry generates enormous amount of data Mascot returns on an average hundreds of proteins matching the mass spectral data Time consuming to analyze the mascot results manually Need different ways of looking at data Comparison of various data sets (experiments) No tools were available in public domain to analyze Mascot results Protein Results Parser v3.0 Features Single File parsing Sequence coverage - with single file parsing Two-file comparison Multiple files – Compare – Combine Tool was developed using Perl/Tk Windows application Single File Parsing Screened Html Result (smaller file size) Sequence Coverage Two file Comparison Results – Comparison of Two Experiments Combine and Compare Feature Drug A Treatments Drug B (protein digest) Fractions (SCX) Triplicates (LC/MS/MS) 15 data files 15 data files Combine Combine Compare Multiple File Comparison Results – Multiple file comparison (sequential display) Results – Multiple file comparison (tabular display) Combine – Merging of multiple experiments Results – combining multiple experiments + + Conclusion Decreased data analysis and processing time. Search results reduced using user specified criteria in an automated way. Removal of low-scoring peptide matched greatly improves the accuracy of data interpretation A single result file can be processed multiple times, using a different set of parsing criteria each time, without the need to repeat the database search. The ability to compare two or more result files in an automated fashion makes determination of sample similarity a nearly effortless endeavor Acknowledgements • Dr. Randy Arnold – Manager and Research Scientist (Proteomics Research and Development Facility – Dept. of Chemistry) • Dr. Haixu Tang – Asst. Prof, School of Informatics • Abhijit Mahabal – Grad student, CS Dept. • Kranthi Varala – Grad Student, Bioinformatics