Lab 3: Data analysis of a complex HPLC/MS/MS data set One of the primary uses for HPLC/MS/MS systems is analysis of complex samples. In biology/biochemistry, the protein composition of a cellular extract presents an extremely complex problem. With thousands of proteins in each cell at varying concentrations, much research is being done to find effective ways to sample deep into the proteome. Proteins can be studied directly by mass spectrometry, but the information we can gather is limited. In electrospray sources, the typical protein can charge up to the + double digits. To simplify the ion being studied at any given time, it is useful to break the protein up into smaller pieces, peptides. This is done through the process of digestion with a protease, and one protease that is used frequently is Trypsin. A protein is a long chain comprised of the 20 amino acids. Trypsin cleaves this chain after every lysine and arginine (except those immediately before proline). A trypsin digestion generates what we call tryptic peptides. A complex proteome takes months to analyze and is therefore too complicated for a single lab experiment. Here we will use a standard software package to “search” our data, and using those results, select a couple cases for further analysis. I. Generating Mascot Search Results Mascot is a software package that takes protein sequences, chops them into peptide sequences based on cleavage rules, and then compares the mass of the theoretical fragments of those peptides to the observed masses in a MS/MS spectrum. As we will see later, this is something we can do manually for a single MS/MS, but would completely consume us if we attempted to manually assign the 1.2 million MS/MS spectra in complex data sets. Here we will be studying MPS1, a single human protein. You will find the .RAW file and the .mgf file for this data in C:\5181_lab3. First, open the raw file by double clicking on it. Notice its complexity as a chromatogram. This is not something we want to do manually. The raw file represents a single HPLC run in tandem with Orbitrap MS/LTQ MS/MS scans. The instrument is set to select the five most intense peaks from the MS scan and collect their masses in succession in the LTQ for MS/MS fragmentation. There is also a Dynamic Exclusion setting where a peak is excluded from the possibility of being selected for MS/MS. If a peak is sequenced twice within 30s, then that mass is excluded for 3.5 min. On the desktop is a shortcut to a webpage. Double click on this. It will bring up a window like the one given on Figure 1. You have linked to our mass spec webserver at bluemoon.colorado.edu/mascot. There are several things we want to set up correctly before pressing “Start Search…” 1. Your name – 2. Search title – 3. 4. 5. 6. 7. 8. 9. this will help you in later identifying searches this is perhaps even more useful in find your search at a later date Enzyme – There are many proteases one could use to cut up the protein. We have used trypsin and should be the selection here. Allowed Missed Cleavages – Sometimes the proteolysis is incomplete. This tells us how many missed sites are allowed in a peptide assignment. Fixed Modifications – Common practice to facilitate the trypsin digestion is to reduce the disulfide bonds and alkylate them. We use iodoacetamide which creates the Carbamidomethyl modifications to cystines. This amounts to adding ~58 daltons to each cystine residue we observe. (This may be important later) Variable Modifications – Unlike the fixed modifications, we don’t know whether or not these have occurred. In this case, we are curious about possible phosphorylation of the alcoholic residues (serine, threonine, and tyrosine). Please select PhosphoNL (STY) on this list. Monoisotopic vs. Average – This depends on how the data was collected. The data you will be searching is monoisotopic. Data File – Here is where you put the mgf file. Select browse and find it in the 5181_lab3 directory. Start Search – And off we go…. After some clicking and whirring, the software will kick out a search results page. The top of this looks like figure 2. Before looking at the data, we need to format it. There are three things we need to look at before looking at peptide assignments. In 1, we want to select “peptide summary” and in 2, we want the ion score cutoff to be 20. This will remove most of the false positive hits (MS/MS spectra that are being assigned to certain peptides incorrectly). Now click the “format as” button. When the page reloads, you will see as the first entry below this window what is highlighted in 3. Click on this to find the coverage. (Question 1) Now we are ready to look at peptides. As you scroll down you will find a giant pile of assignments. How can we use this? Well, now is the time to select two distinct cases. One is the case where we got high scoring (greater than mowse = 50) hits for a peptide in both phosphorylated and unphosphorylated forms. The second case is any peptide with a score higher than mowse = 100 and a charge state of +2. (there are a few, and some have many hits over 100. In those cases, you select the highest scoring hit to study). I selected DFETLKVDFLSK as my phos/unphos case and TPSSNTLDDYMSCFR as my over 100 case so get your own! (one set per person in each group) Pick yours now (Question 2) Figure 3 shows how we can extract the relevant information from the hit data. For the further analysis we do, we need the observed m/z, the observed MW (of the peptide) and the scan number it originated from (this correlates to the exact MS/MS that yielded this peptide hit. Once this is done, we can move away from Mascot and look at our data manually in Qual Browser. ** From here, I will take you through a sample analysis. Once you understand what I am doing, you should do this analysis for your selected peptides. II. Determining the % Phosphorylation of a Peptide First we will look at the DFETLKVDFLSK peptide. To do this, we must isolate the chromatographic peak that represents this peptide. To do this, we create an extract ion chromatogram. (Question 3) In Qual Browser, you select the chromatogram window by pushing in the pushpin in the upper right corner (it will become green). By right clicking we bring up a menu that has several selections. Select the “Ranges…” option and the Chromatogram Ranges window is displayed. (See figure 4) The fields we will use are: 1. Time range – This text box allows us to narrow our displayed time range 2. Plots display – Here we can select one or multiple plots to be displayed within the chromatogram window. Each plot has a distinct set of properties in the Plot properties box below. 3. Scan filter – In a single data set there are MS scans and MS/MS scans. The scan filter allows you to preferentially display certain scan types. 4. Plot type – This is how the plotting algorithm decides how to calculate intensity at every time point. The two selections that are commonly used are Base Peak and TIC. Base Peak defines the intensity as the height of the most intense peak in the mass spectrum at that point. TIC sums all the intensity to get a Total Ion Chromatogram. Base Peak is more useful for determining the large chromatographic peaks whereas the TIC is useful for finding regions of a high abundance of lower intensity but different mass ions. 5. Range(s) – This text box is only active in Base Peak mode and allows us to specify a mass range from which the algorithm must select the base peak. For a very narrow range, we can generate an eXtracted Ion Chromatogram. From the calculated values for the different charge states, we can generate XICs that will be stacked for easy comparison. The inset of figure 5 shows what this looks like. The mass accuracy of the Orbitrap scans is very high and so we use a very tight mass window around our calculated m/z values (+ 0.01 m/z). This yields the plot shown in the chromatogram of figure 5. An interesting feature of this plot is the stacking of two or more peaks as different charge states are represented. We can easily pick out the most likely peak that represents our peptide. Figure 5 is showing the XIC for all charge states of unphosphorylated DFETLKVDFLSK. For further confirmation, we can check the scan numbers included in these peaks against the scan number of our positive assignment. Typically, the scan number attributed to a scan is found early in the appropriate peak (Question 4) With our XIC ready, we can now integrate and label our peaks so we can read off and interpret those values (this process is shown graphically on figure 6. First we select Peak Detection Toggle Detection in All Plots… This will integrate all the displayed chromatographic peaks. Second, we select Display Options… from the pull down menu. There are several tabs here, but the one we want is Labels. In the Labels tab, we can check area and height to display them on the plot. Click Ok and the chromatogram is now prepared for reading off the data. Record the area and height for each charge state of each peak that represents the peptide. We can now repeat this process for the phosphorylated peptide. (Questions 5 – 8) A sample data sheet from excel is shown for the DFETLKVDFLSK peptide on figure 7. III. Manual validation of a peptide hit Now to our high scoring peptide. As shown in figure 8, we first generate an XIC for the peptide but this time, we remove the scan filter. This will show us all scans on a single plot, both MS and MS/MS. From the scan number we found on the Mascot search, we know where the MS/MS was taken. The MS scan previous to the MS/MS that was assigned contains the peak selected that gave the hit. In the case of my peptide, TPSSNTLDDYMSCFR, the m/z for the +2 is 897.38. You can see that it is not nearly the most intense peak in the MS, but it is not MS intensity that makes a good assignment, it is MS/MS quality. This MS scan produced 5 MS/MS scans (as they all do in this experiment). This is shown in figure 10, where I have blown up the time region where the peak was assigned. The 5 MS/MS are of varying quality, some with many fragments and some with very few. For the high scoring assignments, however, it will always look something like the top MS/MS on figure 10, which is the MS/MS that got the good assignment for TPSSNTLDDYMSCFR. Our next step is to make 2 copies of the MS/MS (simply print with the mass spectrum selected and choose “selected cell only” and “one page”. You will also want to make sure that they print in landscape as opposed to portrait.). From this point, our analysis moves away from the computer and to a table with our spectra, a pen or pencil, and a calculator. De Novo Sequencing (Question 9) The first of the two annotation methods we will be looking at is the way things were done in the beginning of peptide mass spectrometry. It is still useful, however, for validation and for finding identifications for high quality yet unassigned spectra. The method comes from an understanding of how the peptide backbone fragments. As we move along the chain, there are three places the peptide can be cleaved: x O y z R N N R H a b c O The designation abcxyz will be useful in the next section, but for now it is useful to know that the most stable place for cleavage is at the peptide bond (recall that a peptide bond is formed from the carboxylic acid of an amino acid reacting with the amine of another to form an amide). If a peptide cleaves at successive peptide bonds, the mass values in the fragmentation spectrum will be spaced according to the mass of an amino acid. This is demonstrated in figure 11. While some experience can be useful in this process, it is best to start with the most intense peaks above the parent mass (the mass of the ion that was fragmented). This will allow us to only consider singly charged ions (all doubly charged fragments will be less than the parent m/z). I began by finding the difference between the most intense peak, 1093, and the next intense peak higher in mass, 1206. The difference between these is 113, which corresponds to either leucine or isoleucine. From there I move on to the next peak and find the difference, 101, which is threonine. I repeat this several times and can find a TAG sequence. Now when we already have a peptide assignment, this TAG sequence will allow us to get our bearings. I found I/LTNSS as my initial TAG. In the sequence TPSSNTLDDYMSCFR, I see this sequence in reverse towards the beginning of the peptide. Now my job becomes easier. If the major peaks from 1595 to 1093 represent those amino acids, then the next lower peak should be less by the mass of D, 115. I subtract 115 from 1093 and get 978. I can repeat this process to find the tag all the way down to 482. I have positively assigned this spectrum to correspond with fair certainty to the peptide sequence TPSSNTLDDYMSCFR. While this technique was fairly straightforward for this high scoring case, in lower scoring cases, this can be quite difficult. Also, for unidentified cases, it can sometimes seem like you are flouncing through a field of corn. Anyhow, on to the second method. Annotation of the spectra according to specific ions (Question 10) In this method we will calculate masses of a specific set of ions and then annotate the spectrum. In the diagram of potential cleavages above, the vertical line represents the cut and the horizontal lines represent the direction of the observed charge. The most well behaved cleavage and charging gives yield to y ions. A strong y-ion series is typically what leads to high scoring spectra, so we will see if we can find the y ions in our spectrum. We calculate the masses of the y ions by starting with the first amino acid and assigning it the parent mass, MH +. (This is your calculated peptide mass + 1) We then move down the amino acid sequence, subtracting the previous amino acid masses, each in turn. In the end you will have a table like the one in figure 12. To finish off the table, start at the lowest mass and assign numbers going back up. We should be able to assign these y ions to peaks in the spectrum based on these calculated masses. I have shown how this is done in figure 12. Summary Well, in this lab we learned how to deal with a real complex data set. We know how to work with chromatograms and mass spectra of peptides in detail. There is much more to learn in this field, but for now, this will give you a good introduction to how biochemists use mass spectrometry to deal with the complexities of cellular life.