Contents 1 Introduction …………………………….………………………1 1.1 Protein Sequence Digestion .…………………………………………………….…2 1.2 Mass Spectrometry …………………………………………………………….…...4 1.3 Protein Analysis using Mass Spectrometry ….…………………………………….7 1.4 Modifications …...…………………………….……………………………………9 2 Applications …………………………………………………..14 2.1 Use Case Diagram ……………………………….……………………………….14 2.2 Web-based Protein Digesters .................................................................................15 3 Modelling and Implementation ……………………………….16 3.1 Class Diagram …………………………………………………………………….16 3.2 Modifications ……………………………………………………………………..18 3.3 Cleaving Enzymes ………………………………………………………………..19 3.4 Storage ……………………………………………………………………………20 3.5 Mass Range ……………………………………………………………………….20 3.6 Database Searching ……………………………………………………………….21 3.7 Sequence Diagram ………………………………………………………………..21 4 Results and Discussion ……………………………………….23 4.1 Output Examples ………………………………………………………………….23 4.2 Mass Distribution Analysis ……………………………………………………….25 4.3 Modifications and Missed Cleavages ……………………………………………30 4.5 Comparing Data Types …………………………………………………………...31 4.6 Database Analysis ………………………………………………………………...32 4.7 Database Search …………………………………………………………………..33 5 Conclusions …………………………………………………...34 6 Glossary ………………………………………………………36 7 Appendix ……………………………………………………...39 8 References …………………………………………………….44 1 INTRODUCTION Since the completion of genome sequencing of several organisms including the human genome, attention has been directed from genome to proteome analysis. The term proteome was first introduced by Wilkins et al. in 1995 [1] and denotes the total number of proteins expressed by a genome at a given time. Proteins represent the functional aspect of gene activities in living cells. Proteome analysis or proteomics are concerned with protein identification, determination of the function or functional networks of proteins and construction of databases storing the acquired knowledge. A lot of progress has been made in separation and identification of proteins, two-dimensional gel electrophoresis and mass spectrometry being key techniques [2]. Today the most commonly used methods for identification of proteins are peptide mass fingerprinting and MS/MS fragmentation. Both methods are based on enzymatic or chemical digestion of a purified protein, mass spectrometric measure of the resulting peptides and comparison with theoretical masses derived from in silico digestion of the protein sequences in a database [3]. An essential ingredient for high throughput analyses is the development of computer software that is able to quickly and efficiently analyze and interpret the huge amounts of information emerging from proteome analysis. This work is concerned with the implementation of a protein sequence digesting tool that models the proteolytic digestion of a protein or protein database computing the theoretically resulting peptides and their corresponding masses given a cleaving enzyme, a maximum number of missed cleavages and both fixed and variable modifications. First the biological background of protein sequence digestion and proteolytically active enzymes will be described. The basic principles of mass spectrometry will then be explained followed by a short description of its applications in proteome analysis. The next section will inform about different types of protein modifications. In chapter 2 different protein digesters available on the web are compared before going into detail with the modeling and implementation of ProtDigest in chapter 3. The results of this work are demonstrated and discussed in chapter 4 followed by the conclusions. 1 1.1 Protein Sequence Digestion Cleavage of protein sequences is a process frequently encountered in vivo that is also used in vitro for protein identification and characterization by peptide mass fingerprinting. The category of proteolytic enzymes that can hydrolyze peptide bonds in amino acid sequences and therefore generate peptides or individual amino acids, is called proteases. Exoproteases remove exactly one residue either from the amino-terminus (aminopeptidase) or from the carboxy-terminus (carboxypeptidase) resulting in a single amino acid and the shortened protein sequence. Endoproteases (or proteinases) cleave at the C- or N-terminal side of specific amino acids independent of the position in the sequence. They are classified with respect to the cleaving mechanism. Serine-proteases for example have a serine residue in their catalytic center that can perform a nucleophilic attack on the C-atom of a peptide bond. Other classes are metallo, cysteine- and aspartic-proteases. Dipeptidases hydrolyze peptide bonds between dipeptides. 1.1.1 Cleavage specificity Most proteases have a preference for a certain amino acid composition at their cleavage site. The composition is dependent on the catalytic center of the protease which interacts with the polypeptide chain. Trypsin for example recognizes the basic amino acids lysine and arginine and cleaves carboxy-terminally (K or R in position P1 in figure 1.1). Cleavage is restricted if there is proline in position P1’. Trypsin of higher specificity additionally does not cleave after K in CKY, DKD, CKH, CKD, KKR nor after R in RRH, RRR, CRK, DRD, RRF, KRR. Cleavage rules can be even more complex. Caspase 2 for example requires the following composition: 2 1.1.2 Protein Cleavage in vivo Proteolytic cleavage is involved in several important processes in vivo including the following examples: Activation of proenzymes (zymogenes): Some proteins are expressed as inactive precursors that are activated by proteolytic cleavage (e.g. trypsinogen → trypsin, prothrombin → thrombin). This is an efficient mechanism of enzyme activity regulation e.g. preventing digestive proteases from attacking gastric cells. Digestion of dietary proteins: Dietary proteins are gradually degraded to individual amino acids in order to be of use to the organism. This process is catalyzed by digestive proteases. When food reaches the stomach, pepsinogen is secreted by the gastric mucosa. Hydrochloric acid, also produced by the gastric mucosa, is necessary for the proteolytic activation of pepsinogen to pepsin and to maintain the optimum acidity (pH 1-3) for pepsin function. Further degradation of the peptides is catalyzed by trypsin, chymotrypsin and other proteases continuing in the intestines. Degradation of cellular proteins: Malfunctioning proteins or cellular proteins that are no longer of use to the cell are marked with ubiquitin and then degraded in the proteasome. Proteases are also found in the lysosome. The caspase family is a family of cysteine-proteaes implicated in programmed cell death (apoptosis). Active protease Functions in Class Cleavage specificity Pepsin Food digestion Asparticprotease Broad specificity Trypsin Food digestion Serine-protease After Lys and Arg, not before Pro Chymotrypsin Food digestion Serine-protease After Tyr, Trp, Phe, also Leu and Met Elastase Hydrolysis of elastin (structural protein) Serine-protease Mainly after Ala, also Val and Leu Thrombin Blood clotting Asparticprotease After Arg and Lys Caspase-3 Apoptosis Cysteineprotease Between Asp and Gly Table 1.1: A selection of proteases 3 1.1.3 Protein Cleavage in vitro Protein digestion has also become an important technique in vitro for the identification and characterization of proteins using mass spectrometry. The most often used enzyme is trypsin. Protein digestion can also be performed by proteolytic chemicals such as cyanogen bromide (CNBr) which cleaves after methionine. Proteolytic degradation is often performed overnight as it can take several hours depending on the reaction conditions and on the protease employed. A higher enzyme-to-substrate ratio speeds up this process but has the disadvantage of increasing the number of auto proteolysis products if digestion is performed in solution. The use of immobilized enzymes allows using an excess of enzyme concentration without increasing auto proteolysis and can therefore achieve degradation within minutes [4]. 1.2 Mass Spectrometry Mass spectrometry (MS) allows the determination of the molecular weight of biomolecules. It has become an important tool in proteome analysis and is preferred to chromatographic, electrophoretic or ultracentrifugation methods because of its preciseness. The accuracy achieved by MS is frequently better than 0.01% of the calculated mass whereas the relative error of the other methods mentioned ranges between 10 and 100% on average [11]. Data Output Sample Inlet Ion Source Data System Mass Analyzer Ion Detector Vacuum Pumps Fig.1.2:Schematic illustration of a mass spectrometer Mass spectrometers basically consist of an inlet for sample introduction (often a gas chromatograph), an ion source, a mass analyzer, an ion detector and finally a data system to 4 process the output data and produce a spectrum (see figure 1.2). The ion source produces gasphase ions, the mass analyzer separates the ionized analytes according to their mass-tocharge ratio (m/z-ratio) and the ion detector counts the number of ions for each m/z value [11]. 1.2.1 Ion Sources There are different types of ion sources producing either analyte anions (negative ion mode) or cations (positive ion mode). Generating ions is convenient as they can be efficiently detected and navigated using electric or magnetic fields. The most commonly used ionization methods to generate protein or peptide ions are matrix-assisted laser desorption/ionization (MALDI) [5] and electrospray ionization (ESI) [6]. MALDI consists of two steps: First, the analyte is mixed with a molar excess of small organic molecules, the matrix, which strongly absorb the laser wavelength. After drying the mixture the molecules to be analyzed are completely isolated from one another in the matrix. The second step of the MALDI process involves desorption of portions of the solid sample because of rapid heating and expansion into the gas phase initiated by pulses of laser light. This process results in ionized analyte molecules because of proton transfer in the gas phase. The usual charge in MALDI is +1 (see figure 1.3) [11]. ESI: The sample is ionized at atmospheric pressure. Highly charged droplets disperse from a capillary in an electric field, evaporate and are drawn into the vacuum of the analyzer. ESI generates multiply charged ions. ESI is a soft ionization method that allows for detection of non-covalent protein complexes as there is only little or no fragmentation of polymer molecules during ionization [7]. 5 Other ion sources are electron ionization (EI), chemical ionization (CI), fast atom bombardment (FAB), field desorption (FD), plasma desorption (PD), laser desorption (LD), thermospray (TSP) and atmospheric pressure chemical ionization (APCI) [11]. 1.2.2 Mass Analyzers There are also different types of mass analyzers which vary in three main characteristics: resolution, transmission and mass limit. A high resolution is desirable for a high selectivity i.e. to be able to distinguish between two molecules of low mass difference. The transmission is the ratio of ions generated and ions detected and is therefore a measure of sensitivity. The highest m/z-ratio that can be measured determines the mass limit. MALDI is mostly coupled to time-of-flight (TOF) analyzers. TOF analyzers do not have an upper mass limit. The accuracy and the speed of MALDI-TOF MS have made it the most common instrument for protein identification. ESI is mostly combined with TOF, quadrupolar or ion trap analyzers which are also attractive for their relatively low cost compared with magnetic sectors or Fourier transform-MS (FTMS)[11]. 1.2.3 Tandem Mass Spectrometry A tandem mass spectrometer has two analyzers separated by a collision cell. Here the sample ions collide with an inert gas which results in their fragmentation (collision-induced dissociation (CID) [11]). Often used combinations of analyzers are for example quadrupolequadrupole or quadrupole-TOF. The principle of MS/MS is shown in figure 1.4. A parent ion or precursor ion of a certain mass is selected from the first analyzer (MS1) and then fragmented in the collision cell resulting in a spectrum of daughter ions produced by the second analyzer (MS2). 6 1.3 Protein Analysis using Mass Spectrometry Because of its speed and sensitivity, MS has emerged as a key technique for the structural analysis and identification of proteins. It can provide information about posttranslational modifications as well as protein interactions and can also be used for relative protein quantification [22]. 1.3.1 Protein Identification There are two main techniques taking a “bottom-up” approach to protein identification using MS and subsequent sequence database searching: 1) peptide mass fingerprinting (PMF) 2) MS/MS identification An emerging technique is “top-down” MS, a term introduced by McLafferty and coworkers, also making use of database searching [13-15]. PMF: PMF is the analysis method of choice for rapid identification of proteins [8]. The protein to be analyzed first needs to be separated from its mixture. One- or two-dimensional polyacrylamide gel electrophoresis is a common method for protein separation. After excision and decoloration of the gel bands or spots, reduction and alkylation are often performed in order to prevent oxidation of Cys residues (Disulfide bonds are separated by reduction with thiols such as dithiothreitol (DTT). To prevent reformation cysteine residues are alkylated e.g. with iodoacetic acid forming S-carboxymethyl derivatives ). The purified protein is then proteolytically digested in situ (in-gel) generating smaller peptides [12]. The peptides are extracted, their masses measured by MS, mostly MALDI-TOF MS because of its speed and simplicity, and then compared with theoretically calculated masses resulting from the application of the used enzyme's cleavage rules to the protein sequences in a database (see figure 1.5). 7 Identification therefore requires the protein to be present in the database. Another requirement is that the peptides detected originate from the same protein which can be disturbed by the presence of contaminants e. g. hair, skin or artifacts of sample handling. Unambiguous results are not always achieved by PMF as the protein may be heavily modified (see section 1.4) yielding experimental masses differing from the calculated ones. As “protein identification correlates directly to the number of detected peptide signals” [9] PMF may provide ambiguous information if only few peptides are detected. Performing MS/MS is a way of increasing the level of confidence in such results or obtaining an identification if none at all was achieved by PMF. MS/MS protein identification: This technique is more complex and time consuming than the PMF approach, but it is capable of high quality identification and also of identifying different proteins in one sample [10]. Sample preparation is the same as in PMF, but the peptides derived from the digestion are subjected to tandem MS resulting in peptide fragmentation spectra. These spectra contain different ion series (a, b, c; x, y, z)[11] named by the site of fragmentation (Fig. 1.5). Sequence information can be gained from the mass differences between peaks of the same series being characteristic of an amino acid (Fig. 1.6). Each MS/MS spectrum can potentially identify one peptide. When multiple spectra point to different peptides derived from the same protein, this gives rise to a high confidence in the identification. 8 Top-Down MS: Top-down proteomics is based on tandem MS [14]. A complete mixture of intact proteins is introduced to the mass spectrometer producing intact protein ions. Ions of a specific mass are isolated and fragmented and then subjected to the second analyzer. The intact mass and the fragmentation data are then compared to a sequence database. This relatively new method can be used for the identification and localization of post-translational modifications and is mostly performed with Fourier transform MS. Software for the interpretation of top-down data is available at https://prosightptm.scs.uiuc.edu/ (ProSight PTM) [15]. 1.3.2 Protein Characterization Beside the identification of unknown proteins MS can be employed for identification and localization of post-translational modifications (PTMs), protein quantification and detection of non-covalent complexes and protein interactions. As already mentioned top-down approaches provide information about PTMs. ESI MS is soft enough to allow for non-covalent complexes and protein interactions to stay intact therefore being able to assist in higher structure elucidation [21]. Relative quantification of proteins (e. g. in order to compare tumor cells with normal cells) is often based on stable isotope labeling (see section 1.4.2). 1.4 Modifications Protein modifications can be divided into post-translational modifications and artificial modifications which are again subdivided into accidental modifications and deliberate modifications. 1. Post-translational modifications 2. Artificial modifications a) Deliberate modifications b) Accidental modifications As shown in figure 1.7, modifications can be position-specific, occurring only at the aminoor carboxy-terminus of a peptide, or non-position-specific, occurring at a residue independent of its position in the amino acid sequence. 9 Amino-terminal and carboxy-terminal modifications, respectively, can either be dependent or independent of the terminal residue. Protein-N-terminal modifications are only attached to the first, protein-C-terminal modifications to the last amino acid of the complete protein sequence and can therefore only be found in the terminal protein fragments produced by protein digestion. Non-position-specific modifications modify the side chains of specific amino acids; acidic, basic and hydroxy-group or sulfur containing residues being the most susceptible sites for modification because of their high reactivity.. Fig 1.7: Locations of modification sites 1.4.1 Post-translational modifications: “The analysis of posttranslational modifications is an important task of protein chemistry in proteome research. (...) It is assumed that modifications such as phosphorylation or glycosylation exist on every second protein and that they are important for the protein function.” (Sickmann et al. [23]). Most proteins are covalently modified after their translation at the ribosome. Posttranslational modifications (PTMs) are essential determinants of protein function and can have stabilizing effects on protein structure. They play a role in enzyme regulation, protein targeting and several more important processes in vivo and are therefore of great interest to proteomics. Because of the resulting mass difference PTMs are a considerable challenge to protein identification using sequence database searching. 10 PTMs include glycosylation, acylation, methylation, phosphorylation, sulfation, prenylation, and formation of selenoproteins. One specific residue in a protein is usually object to one type of modification, although it has been demonstrated that residues can be alternatively modified. Murine estrogen receptor beta for example can carry an N-acetyl-glucosamine or a phosphoryl-group at Ser16 [24]. Examples: 1. The hydroxyl-groups of Ser, Thr or Tyr can be covalently phosphorylated, a process catalyzed by a specific category of enzymes called kinases. Reversible phosphorylation is of particular interest because of its important role in enzyme activity regulation. In many cases activity or inactivity of an enzyme is controlled by the absence or presence of one or more phosphoryl-groups inducing a conformational change in the structure of the protein. There are specialized databases containing information about phosphorylation sites in protein sequences (e.g. http://phospho.elm.eu.org/). As phosphorylation brings forth negatively charged peptides (Fig.1.8), it cannot be detected using the standard positive ion mode. Phosphorylation causes a mass increase of 79Da in the negative ion mode. 2. The attachment of saccharides, glycosylation, can either be N- or O-linked (at Asn or Ser/Thr, respectively). N-glycosylation takes place in the endoplasmatic reticulum during mRNA translation at the ribosome and is signaled by a certain sequence of amino acids: AsnX-Ser/Thr. O-glycosylation is performed in the Golgi apparatus. N-linked oligosaccharides can be of complex, branched structure while O-linked oligosaccharides are generally shorter often containing only one to four sugar molecules. 1.4.2 Artificial modifications: Artificial modifications can either be deliberately induced or accidental products of sample preparation and handling. Deliberate labeling of proteins can for example be used for relative quantification of individual proteins within a mixture 11 There are covalently bound as well as non-covalently bound modifications. Because of the harsher preparation conditions of MALDI-TOF non-covalent complexes are more stable under ESI conditions. Most non-covalent modifications are not readily detected by MALDITOF [21]. a) Examples for deliberate modifications: 1. Reductive alkylation is performed alkylating substances such as iodoacetic to prevent peptides from forming disulfide acid, iodoacetamide or 4-vinyl-pyridene bonds prevents reoxidation by modifying the (Fig.1.9) with other peptides yielding masses misleading for peptide cysteine residues. mass mapping. As already mentioned on page 9, disulfide bridges are deliberately reduced with thiols such as dithiothreitol (DTT) or tris(2-carboxyethyl)phosphine (TCEP). Subsequent treatment with 2. Another deliberate modification is the use of isotope-coded affinity tags (ICATs) to be able to distinguish between two sister peptides of the same protein in protein mixtures representing different cell states. One mixture is treated with light, the other with heavy ICATs (deuterated) resulting in masses differing by ~8Da. The ratio of the peak intensities of two sister peptides in the joint MS spectrum determines the relative quantification of their parent proteins [25]. b) Examples for accidental modifications: 1. Proteins can be covalently modified by reaction with unpolymerized monomers of acrylamide in polyacrylamide gels during electrophoresis [16]. Especially the reactivity of the SH-group in Cys has been shown to be very high towards alkylation forming cysteinyl-Spropionamide adducts. At alkaline pH Cys can even be modified when engaged in disulfide bridges [17]. The addition of one molecule acrylamide results in an increase of ~71Da in the molecular weight of a peptide/protein. 12 3. Another modification encountered in commonly polyacrylamide oxidant [18]. Oxidation adds ~16 Da to the gel- molecular mass of the protein. separated proteins is oxidation, mostly at Met residues forming methionine sulfoxide (Fig.1.10). Residual ammonium persulfate which is used to induce gel polymerization acts as a very reactive 3. The eluant can be another source of accidental modifications. A standard solvent used for eluting protein spots from polyacrylamide gels contains formic acid which has been shown to massively formylate Ser and Thr residues (up to ten formylation products observed [19]). Formic acid is also used during cyanogen bromide digestion of proteins resulting in the same modification of Ser and Thr [20]. Formylation causes an increase of ~28Da. 4. Even the dye used for staining a gel can affect the m/z-ratio observed in MS analysis. Although only non-covalently bound by hydrophobic interactions Coomassie can even be detected by MALDI-TOF MS [21]. It has been observed that up to ten molecules of Coomassie can aggregate with relatively small polypeptides. 13 2 Applications The variety of proteomics tools distributed on the internet is very broad. A compilation of tools for protein identification and characterization, primary, secondary and tertiary structure prediction, sequence similarity searches and alignments, prediction of post-translational modifications, and translation of DNA sequences into amino acid sequences can be found on the ExPASy server (www.expasy.org/tools). 2.1 Use Case Diagram identification and characterization Protein Digestion «uses» «uses» single sequence digestion experimental PMF data identification and characterization; localize PTMs; verify PMF identfiction Fig.2.1: Use cases of in silico protein digestion in proteomics MS/MS Fragmentation PMF identification tool «uses» «extends» database digestion «uses» MS/MS fragmentation of each peptide «uses» MS/MS ion search experimenal MS/MS data The usual approach to protein identification is a PMF experiment as described in chapter 1. The experimental data is analyzed and interpreted by a PMF identification tool such as Mascot [36] which performs matching and scoring in a digested database. The outcome can be further analyzed manually using a single sequence protein digester. If an identification is not obtained or leaves doubts behind, MS/MS data can be used to gain further information. For an MS/MS ion search the peptides resulting from a database digestion need to be fragmented (see Figure 1.5). 2.2 Web-based Protein Digesters Different tools for theoretical protein sequence digestion are compared in table 2.1. MSDigest is the most complex one with most additional features. PeptideCutter has a very sophisticated model of cleavage prediction. 14 PeptideMass (ExPASy) http:// www.expasy.org/ tools/peptide-mass.html PeptideCutter (ExPASy) http:// www.expasy.org/ tools/peptidecutter MS-Digest (ProteinProspector) http:// prospector.ucsf.edu PeptideSort (GCG) http:// menu.hgmp.mrc.ac.uk/ people/gcg/gcghelp/ html/unixpeptidesort. html Input Output Modifications Enzymes Other parameters Additional features Sequence in oneletter-code; or SwissProt [37] ID or accession number (AC) HTML table with sequences, masses, artificial modifications, missed cleavages; Mw and pI of protein; text file containing masses Cys derivates and Met oxidation; inclusion of known PTMs for SwissProt sequences possible Choice of 16 standard enzymes taking into account positions P2 to P1' (see Fig. 1.1); max. 5 missed cleavages Either MH+,M or (MH)- masses; either monoisotopic or average; optional minimum mass for output peptides For SwissProt sequences: inclusion of splicing variants, protein isoforms and database conflicts possible Sequence in one- HTML map or table of No modifications letter-code; or cleavage sites; table of incorporated SwissProt ID or sequences and masses AC 34 enzymes taking into None noteworthy account P4 to P2'; select as many as you want; no missed cleavages Sophisticated model for trypsin and chymotrypsin incorporating cleavage probability Sequence without X, B or Z; or database ID, several DBs included e. g. SwissProt, NCBI [38] HTML, XML output, can be saved to file; both monoisotopic and average masses; protein Mw and pI; several options for output (-> other parameters and additional features) 29 enzymes taking into account P1 and P1'; no upper limit for number of missed cleavages; user specified enzymes mass range; minimum peptide length; amino acids present in output peptides; Calculation of ChemScore[26], Bull Breeze indices[27] and HPLC indices [28]; incorporation of user specified amino acids (elemental position) -> modifications Command line program; sequence from file or PIR[39]/ SwissProt ID; only one sequence Text file containing No modifications peptide masses, incorporated positions on sequence, amino acid compositions, pIs of all peptides and the protein 22 enzymes taking into account P1 and P1'; either one or all enzymes Mincuts, Maxcuts; if HPLC retention [29]; all enzymes are extinction coefficient selected those that do [30] not cut at least mincuts or at most maxcuts times are ignored Cys derivates; state of N-/Cterminus; list of considered (= variable) modifications; user specified modifications Table 2.1: Comparison of four protein digesters available on the internet. 15 3 Modeling and Implementation The implementation of ProtDigest was done in C++, the documentation with the help of Doxygen [www.doxygen.org]. It was debugged using the gnu debugger [www.gnu.org] and valgrind [http://valgrind.kde.org/] to fix memory leaks. The diagrams shown in this chapter were created with Microsoft Visio [www.microsoft.com/office/visio ]. 3.1 Class Diagram The class diagram is shown in Fig..3.1. The class Sequence_Set stores the digest of one or more protein sequences. Each protein to be digested is an instance of Sequence stored in the vector sequences. After the call of doCleave(), the vector peptides contains all peptides without variable modifications which are instances of Peptide, while variably modified peptides are instances of ModPeptide stored in mod_peptides. The cleaving enzyme employed is an instance of Enzyme. Furthermore a Sequence_Set has vectors for fixed and variable modifications, fixmod and varmod, and min_mass and max_mass for the mass range considered. If monoisotopic is true, the monoisotopic masses of the amino acids are stored in aa_masses and used for the calculation of the peptide masses. Otherwise the average masses are stored and used. 16 Modification -name : string = "" -residues : string = "" -mod_pos : int = 0 -mono_masses : vector<mass_t> = null -avg_masses : vector<mass_t> = null +setName() +setResidues() +setModPos() +getName() +setMonoMasses() +setAvgMasses() +getResidues() +getModPos() +getMonoMasses() +getAvgMasses() +getResidue() +getModPos() +getMonoMasses() +getMonoMass() +getAvgMasses() +getAvgMass() Sequence_Set -seqnumber : int = 0 -seqs : vector<Sequence*> = null -fixmod : vector<Modification*> = null -varmod : vector<Modification*> = null -max_missed_cleave : int = 0 -enzyme : Enzyme = null -min_mass : mass_t = 0 -max_mass : mass_t = 0 -peptides : vector<Peptide*> = null -mod_peptides : vector<ModPeptide*> = null -monoisotopic : bool = true -aa_masses : AminoAcidMassesFloat +setSeqnumber() +addSeq() +setFixmod() +setVarmod() +setMaxMissedCleave() +setEnzyme() +setMinMass() +setMaxMass() +setMonoisotopic() +getSeqnumber() +getSeqs() +getFixmod() +getVarmod() +getEnzyme() +getMinMass() +getMaxMass() +getMonoisotopic() +getPeptides() +getModPeptides() +doCleave() +modifyAminoAcidMasses() -addMissedCleave() +appendTermMods() +createModPeptides() -modPeptideRec() -getModstring() +doSearch() +getSearchMinMax() +interpolSearch() +interpolSearchMod() +outputIntoFile() +massesIntoFile() +outputToScreen() AminoAcidMassesFloat -masses : mass_t -modified : bool = false +setMass() +getMass() : mass_t +getModified() : bool ModPeptide -peptide : Peptide = null -modifications : vector<Modification*> = null -mass : mass_t = 0 +setPeptide() +addPepMod() +setModifications() +setMass() +getPeptide() : Peptide +getModifications() : vector<ModPeptide*> +getModificationAt() : Modification +getMass() : mass_t Enzyme -name : string = null -cut_at : string = null -cut_cterm : bool = 1 -no_cut : string = null +getName() +isCutCterm() +getCutAt() +getNoCut() Sequence -seq : string = "" -name : string = "" -counter : int = 0 -incremented : bool = false +setSeq() +setName() +setCounter() +setIncremented() +incrementCounter() +getSeq() +getName() +getCounter() +getIncremented() Fig.3.1: Class Diagram of ProtDigest 17 Peptide -begin : int = null -end : int = null -mass : mass_t = 0 -missedCleavage : int = 0 -protein : Sequence = null +setPepLoc() +setMissedCleave() +setMass() +setProtein() +getMass() +getMissedCleave() +getBegin() +getEnd() +getProtein() 3.2 Modifications All modifications incorporated are listed in the appendix in section 6.2. ProtDigest reads the file 'modifications.txt' and stores the modifications as a vector of instances of Modification. New modifications may be added to 'modifications.txt' (see documentation). Modifications are divided into fixed and variable modifications. Fixed modifications are present at every modifiable residue while variable modifications may be present or not. Peptides that are variably modified are instances of ModPeptide while not variably modified peptides are instances of Peptide. Every ModPeptide object is derived from a not variably modified peptide, with the additional information of its modifications and its new mass. Peptide objects do not store any modifications as they simply carry all fixed modifications that 'fit'. 3.2.1 Fixed modifications: For those modifications that are independent of the position of the amino acid, the unmodified amino acid mass is simply replaced by the modified mass. N- and C-terminal modifications are added when all peptides have already been ‘created’, giving priority to those modifications that are not position-specific if there should arise any conflicts. The molecular weight of the modification is simply added to the precalculated mass of the unmodified peptide. In this model one residue can only be modified once, although multiple modification in very few cases may be possible in reality e. g. modification of both amino-groups in N-terminal lysine is imaginable. But this is an exception that would unreasonably complicate the model and is therefore ruled out. Fixed modifications therefore do not increase the number of peptides. 3.2.2 Variable modifications: Variable are more complicated than fixed ones. If a peptide contains two potential modification sites it can either be modified once, twice or not at all. If it is modified once there are two possible positions for the modification yielding two peptides of the same molecular weight. On the one hand the position of the modification is not important for protein identification using peptide mass fingerprinting (PMF) as this approach is based on the molecular weight of the intact digest peptides. On the other hand if MS/MS identification is intended the positions must be taken into account because different permutations give rise 18 to differing MS/MS fragmentation spectra. As ProtDigest only models digestion and not fragmentation, all combinations are needed. For variable modifications the increase in the number of peptides is dependent on the position of the modification and on the frequency of the residue that is modified. In Fig. 3.2 the recursive computation of variable modifications with an example peptide is demonstrated. with Acetyl AKDK AK*D*K* AK*D* without AK*D*K with ME with Acetyl AK* sodiated AK*D°K* AK*D° without AK*D°K without with Acetyl with Acetyl AK*DK* AK*D without AK*DK A with Acetyl AKD*K* AKD* without without AKD*K with ME with Acetyl Computation of variably modified peptides: AK sodiated AKD° without wanted peptides waste peptides AKD°K* AKD°K Without intermediate steps with Acetyl K* = Acetyl (K) D* = Methyl ester (D) D° = Sodiated (D) AKDK* AKD without AKDK unmodified peptide Fig.3.2: Recursive computation given peptide AKDK and variable modifications Acetyl (K), Methyl ester (D) and Sodiated (D) 3.3 Cleaving Enzymes All enzymes incorporated are listed in the appendix in section 6.1. ProtDigest reads the file 'enzymes.txt' and stores the enzymes as a vector of instances of Enzyme. Just as it is done with modifications new enzymes may be added to 'enzymes.txt' (see documentation). Although cleavage specificity can be way more complex (as described in chapter 1.1.1) the model applied here only takes into account position P1 and P1' (Fig. 1.1) for the sake of execution speed. 19 3.3.1 Missed Cleavages There is no upper limit for missed cleavages. Missed cleavage sites as well as variable modifications increase the number of peptides. For a fixed number of missed cleavages the resulting number of peptides is predictable, if the number of peptides without missed cleavage sites is known. Missed cleavages alone cause a linear increase in the number of peptides: u pu (u 1) p0 n i i 0 u = maximum number of missed cleavages allowed pu = number of peptides if u missed cleavages are allowed n = number of sequences digested The number of peptides without missed cleavages depends on the number of sequences and their lengths and on the relative frequency of cleavage sites. 3.4 Storage An effort was put into keeping the stored data low which is why instances of Peptide only memorize the beginning and ending positions on the protein sequence instead of the complete amino acid sequence. Fixed modifications are also not stored as they are generally applied to every possible residue and can be retraced. For the same reason peptides with the same modifications but in different positions are not stored, making it unnecessary to store the positions of modifications which was (initially) intended in an earlier stage of the implementation. The question as to which data type should be used for the mass values was solved by a simple typedef enabling the usage of doubles or floats. Using integers was also considered because of their alleged quickness and easier handling in comparison to floating point numbers e. g. when comparing mass values, but was not realized as the difference in speed was found to be evanescent. 3.5 Mass Range All peptides that exceed the specified maximum mass are deleted before the computation of the variably modified peptides. On the one hand this is done in order to save time and space. 20 On the other hand most mass spectrometric analyzers have an upper mass limit therefore making the computation of peptides that cannot be analyzed useless. 3.6 Database Searching The database search implemented is based on counting the number of peptides that match the masses in an experimental MALDI spectrum for a given mass tolerance in parts per million (ppm). In order to quickly find the peptides - which are stored in C++ vectors as shown in the class diagram and which are still in the memory (on-the-fly search) - an interpolation search in the sorted vectors of Peptides and ModPeptides is performed. The next step would be to assign a score to the peptides found in order to be able to discriminate between random and significant matches and to rank candidate proteins accordingly. Unfortunately, there was no time left to realize the implementation of a scoring method. An issue that arose with database searching was the question as to what should be done with proteins containing an 'X' representing any amino acid. Two options to choose from are realized: Option 1: Treat 'X' as any other amino acid using an averaged amino acid mass of 111Da Option 2: or treat 'X' as cleavage site and delete all peptides containing an 'X' after cleavage. The first option risks mapping experimental masses to peptides containing an 'X' and as no amino acid has a molecular weight of 111Da this can only be wrong. By choosing the second option fragments containing an 'X' are omitted and a wrong cleavage might be performed, but not cleaving would add to the loss of information. 3.7 Sequence Diagram When the program is started the files 'enzymes.txt' and 'modifications.txt' are read and stored as Enzyme and Modification objects. After receiving the input parameters for the digestion (see section 4.1.3 or documentation for examples) a user specified file is parsed and an instance of Sequence_Set is created with the data received. The protein sequences are stored as Sequence objects (n = number of proteins) and an instance of AminoAcidMassesFloat is created from which the masses for the amino acids can be obtained. The function Sequence_Set::doCleave(bool x_cut, bool sort_by_mass) performs the cleavage of all sequences, either using option 2 (x_cut = true) or 1 (x_cut = false) described in section 3.6 and optionally sorting the peptides by increasing mass (sort_by_mass = true). First the amino 21 acid masses are modified according to fixed position-independent modifications. Then pu Peptide objects including those with missed cleavages are created and stored in the vector peptides (see section 3.3). Peptides outside the specified mass range are deleted. The function Sequence_Set::appendTermMods(..) further modifies the peptides with N- and C-terminal modifications. If variable modifications have been specified by the user x instances of ModPeptide are created by calling the function Sequence_Set::createModPeptides(..) which recursively computes the variably modified peptides and stores them in the vector mod_peptides. Output is either written to a file or displayed on the screen (see section 4.1). Now a search with experimental PMF data may be performed in the digestion data still stored in memory. Sequence_Set::doSearch(int error,...) maps the search masses to peptides contained in peptides and modified peptides in mod_peptides taking into account a user specified error tolerance of error ppm. The top 50 hits are written to a file as can be seen in section 6.1.3. All objects created are deleted before the program ends. readEnzymes() Begin program Enzyme readModFile() Input for digestion Modification readFasta() create create Sequence_Set Sequence create AminoAcidMassesFloat doCleave() D I G E S T I O N modifyAminoAcidMasses() create Peptide addMissedCleave() applyMassRange() appendTermMods() createModPeptides() create ModPeptide outputIntoFile() Input for search readMassList() D B doSearch() interpolSearch() S E A R C H interpolModSearch() resultsToFile() End program Fig.3.3: Sequence Diagram 22 4 Results and Discussion The diagrams contained in this chapter were developed using Microsoft Excel [www.microsoft.com/office/excel]. The plots in section 4.2 were produced with Matlab [www.mathworks.com]. 4.1 Output Examples Output can either be written to a file by using the function Sequence_Set::outputIntoFile(string filename, int precision) or displayed on the screen by using Sequence_Set::outputToScreen(int precision), or both if both functions are called. A third function, Sequence_Set::massesIntoFile(string filename, int precision), writes all peptide masses into a file simply listing them. This was used for example to import the mass values into Matlab. 4.1.1 Screen Output To give an example the following output was obtained by digesting trypsin taken from the SwissProt database with trypsin as cleaving enzyme (auto proteolysis). The maximum missed cleavage parameter was set to 0, propionamides (Cys) are fixed and phosphorylation (Tyr) is variable. >sp|P00761|TRYP_PIG Trypsin precursor (EC 3.4.21.4) - Sus scrofa (Pig). FPTDDDDKIVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYKSRIQVRLGE HNIDVLEGNEQFIAAKIITHPNFNGNTLDNDIMLIKLSSPATLNSRVATVSLPRSCAAAG TECLISGWGNTKSSGSSYPSLLQCLKPVLSDSSCKSSYPGQITGNMICVGFLEGGKDSCQ YGCAQKNKPGVYTKVCNYVNWIQQTIAAN 0-7 0 951.3821 FPTDDDDK 8-50 0 4701.22 IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYK 4781.187 Phospho (Y) 4861.153 Phospho (Y) Phospho (Y) 4941.119 Phospho (Y) Phospho (Y) Phospho (Y) 51-52 0 261.1437 SR 53-56 0 514.3227 IQVR 57-75 0 2096.054 LGEHNIDVLEGNEQFIAAK 76-95 0 2282.173 IITHPNFNGNTLDNDIMLIK 96-105 0 1044.556 LSSPATLNSR 106-113 0 841.5021 VATVSLPR 114-131 0 1909.866 SCAAAGTECLISGWGNTK 132-154 0 2527.23 SSGSSYPSLLQCLKPVLSDSSCK 2607.196 Phospho (Y) 155-175 0 2228.061 SSYPGQITGNMICVGFLEGGK 2308.027 Phospho (Y) 176-205 0 3225.428 DSCQGDSGGPVVCNGQLQGIVSWGYGCAQK 3305.394 Phospho (Y) 206-213 0 905.4971 NKPGVYTK 985.4634 Phospho (Y) 214-228 0 1806.872 VCNYVNWIQQTIAAN 1886.839 Phospho (Y) 23 The first column contains the beginning and ending positions of the peptide on the protein sequence (inclusively). Column two is the number of missed cleavages. Next is the calculated molecular weight including fixed modifications. The last column contains the amino acid sequence. If a peptide has variable modifications, the sequence is not repeated and only the modifications and the modified mass are listed. For another example see section 7.3.1. 4.1.2 File Output If more than one sequence is digested, for example a database, it is more appropriate to write the output to a file that can be parsed afterwards. The parameters used are listed at the top. The total number of sequences, peptides and residues, the average protein and peptide length and the amino acid composition are also listed. For an example see section 7.3.2. 4.1.3 Digestion and variable modification of an example peptide The following data shows the 'digestion' of the peptide used as an example in 3.3 in order to demonstrate the process of computing variable modifications and to show the setup and navigation of ProtDigest (see documentation for more detailed instructions). '*' denotes a ModPeptide being found, '°' a ModPeptide being deleted. For the list of modifications see section 5.2. > Enter name of fastafile containing sequence(s) to be cleaved: modpep Cleaving substances: [1]Trypsin [2]Arg-C [3]Asp-N [4]Chymotrypsin(FYW) [5]Chymotrypsin(FYWML) [6]Formic_acid [7]Lys-C [8]Lys-C/P [9]PepsinA [10]CNBr [11]Tryp-CNBr [12]TrypChymo [13]Trypsin/P [14]V8-DE [15]V8-E [16]no cut Choose cleaving substance by entering number: 16 Enter number of maximum missed cleavages: 0 Modifications: To see every possible modification press 1, to see a selection press 2 to not show any press 3: 3 24 Enter fixed modifications (example: 1,5,20 or 0 for none): 0 Enter variable modifications: 1,19,43 For monoisotopic masses press 1, for average masses press 2: 1 Enter minimum mass displayed (in Da): 1 Enter maximum mass displayed: 100000 How should 'X' in amino acid sequence be treated? [1] as cleavage site + kick out peptides containing 'X' [2] use averaged amino acid mass + treat as normal amino acid 2 Sort peptides by masses? n reading file...... cleaving... 0 peptides not within specified mass range. Number of residues: 4 Variably modifying 1 peptides... *******°**°**°*° <-- '*' denotes a ModPeptide found, '°' found ModPeptide not stored Number of Sequences = 1 Number of Peptides without variable modifications = 1 Number of variably modified Peptides = 8 >mod_peptide example AKDK 0-3 0 460.265 558.301 516.291 566.268 524.257 544.286 502.275 474.28 482.246 AKDK Acetyl (K) Methyl ester (D) Acetyl (K) Acetyl (K) Methyl ester (D) Acetyl (K) Sodiated (D) Acetyl (K) Acetyl (K) Sodiated (D) Acetyl (K) Acetyl (K) Acetyl (K) Methyl ester (D) Sodiated (D) Search? N As demonstrated twelve variably modified peptides are computed of which four are not stored, three being permutations and one the unmodified peptide giving rise to multiple peptides with the same molecular weight. 4.2 Mass Distribution Analysis In order to analyze the distribution of peptide masses and lengths in dependence of the cleaving substance, the SwissProt Saccharomycetes database was digested using different enzymes. 25 4.2.1 Peptide Length Distribution The peptide length distributions in Fig. 4.1 were observed by digesting with trypsin, chymotrypsin and cyanogen bromide, respectively, without missed cleavages. Chymotrypsin cleaves after F,W,Y,L and M unless followed by P while trypsin cleaves after K and R unless followed by P. CNBr cleaves after M which explains the relatively high abundance of peptides of length one as most proteins start with an M which is encoded by the start codon AUG. For trypsin and chymotrypsin the number of peptides noticeably decreases with increasing peptide length. Especially chymotrypsin tends to produce high numbers of short peptides. Taking a look at the amino acid composition of the digested database in table 4.2, approximately 20% of the peptide bonds are cleavage sites for chymotrypsin explaining the high abundance of short peptides. Trypsin approximately cleaves 11% of the bonds whereas for CNBr only 2% of the bonds are cleavable. Peptide length distribution 140000 Chymotrypsin Number of peptides 120000 Trypsin 100000 CNBr 80000 60000 40000 20000 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Peptide length Fig. 4.1: Peptide length distributions Saccharomycetes 26 computed from digestion of SwissProt 4.2.2 Peptide Mass distribution Large numbers of short i. e. lightweight peptides raise the number of random matches and therefore do not contribute to identifying a protein. Often they are sorted out before a database search is performed. Longer peptides of higher molecular weight are more characteristic of the protein they are derived from and thus are more significant. Table 4.1 compares four enzymes in regard to the number of peptides within a mass range of 500 to 3000 Da which is characteristic of PMF analysis. Tryptic digestion with one missed cleavage has the highest coverage of peptides within the mass range explaining its predominant use in PMF experiments. Trypsin CNBr Chymotrypsin Formic acid Missed Within Within Within Within cleavages <500 >3000 range <500 >3000 range <500 >3000 range <500 >3000 range 0 38.86 5.09 56.05 16.73 51.02 32.25 57.46 0.62 41.92 23.50 19.46 57.04 1 23.03 12.08 76.65 9.08 67.53 23.39 36.42 1.79 61.79 13.78 34.27 51.95 Table 4.1: relative number of peptides[%] within the specified mass range Fig.4.2 shows the mass distribution resulting from digestion with trypsin and CNBr, respectively, with two missed cleavages allowed. Methionine accounts for the large green peak at ~149 Da. The large blue peaks at ~146 and ~174Da are single lysines and arginines produced when one tryptic cleavage site is directly followed by another. As already observed from the peptide length distribution the number of peptides resulting from tryptic digestion strongly decreases with increasing mass. CNBr produces relatively few peptides almost equally covering the mass range shown. A periodicity of 14 to 15 Da is clearly visible which has also been observed in [38] when plotting the number of atomic compositions of peptides over the molecular weight. The oscillation is visible up to a mass of 1300Da and seems to be independent of the enzyme and database chosen which was concluded from digesting the SwissProt arabidopsis database with chymotrypsin and formic acid giving rise to the same periodicity (data shown in section 7.4). As every peptide is a composition of amino acids certain mass values in the lower mass range are more probable to appear than others simply based on the number of amino acid combinations possible. 27 Fig. 4.2: Distribution of monoisotopic masses 4.2.3 Peptide Mass Clustering If a smaller mass range is extracted (Fig 4.3) an interesting characteristic of peptide mass distribution can be seen, the clustering of peptide masses. This is another result of the atomic composition of proteins as they are solely made up of C, H, N, O and S which all have near to integer masses. The integer mass of an atom is called its nominal mass. The formation of clusters is a result of the limited number of combinations of the five atom types for a given nominal mass. Depending on the atomic composition of the modification even modified peptides keep to this rule. Fig 4.3: Mass clustering of peptides between 995 and 1005Da (derived from tryptic digestion of Saccharomycetes) 28 The relation of the centroid of a cluster to the nominal mass is: centroid mass = nominal mass * 1.000478 This relation was derived from plotting the monoisotopic masses over the nominal masses and calculating the slope using linear regression (data not shown). It is similar to the relation obtained by Wool and Smilansky in [32]. In Fig. 4.4 the differences between monoisotopic and nominal masses are plotted over the nominal masses in a range of 0 to 5000Da. A linear regression yields the following equation: y = 0.00047811 * x – 0.000017786 x = nominal mass y = mass difference between monoisotopic and nominal mass For masses larger than ~2091Da the centroid mass is therefore more than 1 Da away from its nominal mass i. e. the centroid is found in the next integer mass interval. Fig.4.4: Difference between monoisotopic and nominal masses The phenomenon of peptide mass clustering can for example be used for calibration of mass spectra [32]. 29 4.3 Modifications and Missed Cleavages As already described missed cleavage sites and variable modifications can significantly increase the number of peptides while there is no great computational expense associated with fixed modifications. An example is shown in figure 4.5 which visualizes the data in table 4.2 derived from the digestion of human titin (SwissProt Accession Number Q8WZ42) which has a sequence length of 34350 amino acids. The sequence was digested with trypsin allowing 0, 1 and 2 missed cleavages, respectively. No upper limit for the number of variable modifications per peptide was specified. Three different combinations of variable modifications were chosen: (a) a position-independent modification, (b) two modifications independent of the amino acid modifying the C- and N-terminus of all peptides and (c) the combination of (a) with (b). Missed cleavages 0 1 2 Type of modification (a) (b) (c) (a) (b) (c) (a) (b) (c) Number of unmodified peptides 4197 4197 4197 8393 8393 8393 12588 12588 12588 Total number of peptides (without permutations) 5196 16313 19991 11390 33097 44449 18581 49877 72895 Total number of peptides computed by recursion 5421 16313 20833 12599 33097 49083 22535 49877 88209 Table 4.2: The influence of missed cleavages and variable modifications on the number of peptides when there is no upper limit for the number of modifications per peptide (a) Phosphorylation (Y), (b) Carbamyl (N-term) and Methyl ester (C-term), (c) Carbamyl (N-term) and Methyl ester (C-term) and Phosphorylation(Y) Fig. 4.5: The influence of missed cleavages and variable modifications on the number of peptides 30 As shown by these data missed cleavages cause a linear increase of the number of unmodified peptides. For modification set (b) the increase is linear in the number of unmodified peptides, converging to 4 as each modification can only be applied once to each peptide therefore resulting in at most four times as many peptides if combined (unmodified, only N-terminally modified, only C-terminally modified, both N-terminally and C-terminally modified). Phosphorylation of tyrosine residues alone does not lead to a drastically high number of peptides, but with increasing number of missed cleavages and therefore increasing peptide length the increase in the number of peptides is more than linear. The relative abundance of tyrosine in this example protein is only 2.9% which is the reason for the relatively low increase. Combination of (a) with (b) yields the largest increase in the number of peptides being computed demonstrating the need for a fixed maximum number of modifications allowed per peptide. If none is used the computation of variably modified peptides can eventually overstrain memory capacity. 4.5 Comparing Data Types To find out whether there is a significant rounding error when using floats in comparison to doubles, the molecular weight of titin was computed first using doubles and then using floats: The double value computed was 3813839.61068 Da whereas switching to floats yielded a mass of 3812936 Da. This is a difference of almost 4 Da strongly suggesting the use of doubles. 31 4.6 Database Analysis Three databases, Saccharoycetes, Arabidopsis and Drosophila, were analyzed for their amino acid composition in order to see how often 'X' appears. The data is shown in table 4.2: Amino Acid [%] Saccharomycetes Arabidopsis thaliana Drosophila Amino Acid [%] Saccharomycetes Arabidopsis thaliana Drosophila A 5.7775 6.3047 7.5214 S 8.7707 8.9881 7.9858 B 0.0001 0.0000 0.0000 T 5.8923 5.1106 5.5651 C 1.2716 1.8247 1.8119 V 5.7648 6.7118 5.9770 D 5.8365 5.4591 5.2087 W 1.0684 1.2639 1.0445 E 6.4850 6.7806 6.1630 X 0.0008 0.0000 0.0039 F 4.4748 4.2826 3.7728 Y 3.4025 2.8577 3.0547 G 5.2803 6.3886 6.4347 Z 0.0000 0.0000 0.0000 H 2.1166 2.2755 2.6619 5904 2913 2593 I 6.5216 5.3325 5.1338 Number of sequences K 7.1977 6.4048 5.6657 2847397 11319693 1344529 L 9.4554 9.4885 9.0767 Number of residues M 2.0794 2.4505 2.4699 482.283 429.929 518.523 N 5.9645 4.4057 4.7881 Average protein length P 4.3593 4.8021 5.2059 Q 3.9073 3.4700 5.1609 R 4.3729 5.3983 5.2936 Table 4.2: Composition analysis of three example databases In the arabidopsis database X does not appear at all, the saccharomycetes database reveals rare occurrences of X and the drosophila database shows the highest relative abundance containing 0.0039% Xs. Still the number is very small suggesting to simply delete all peptides containing Xs as the loss of information will probably not be that high. With this option which is not yet implemented in ProtDigest, no possibly wrong peptides pointing to erroneous protein identification will be stored. 32 4.7 Database Search The mips arabidopsis database [http://mips.gsf.de] containing 26639 sequences was searched with 184 peak lists. The search was performed with one missed cleavage allowed and without any modifications. The search was performed once with a peptide mass tolerance of 100 ppm (parts per million) and once with 50 ppm. The results are shown in Fig. 4.6. The x-axis requires further explanation: Let the number of matching peptide masses be called the score of a protein. Position 1 means that the correct protein had the highest score found i.e. no other protein had more matching peptide masses. So if all sequences had the same score, say 1, the correct one is in position 1. Position 2 means that the correct protein had the second-best score, again not considering the number of proteins having the same or a better score. If a protein was 'not found' this means that it was not listed in the top 50 hits which can be due to the fact that more than 50 proteins had the same top score. To add more meaning to the results, for all proteins in position 1 it was checked whether another protein had the same (top-) score. Fig. 4.6: Database search results. These results demonstrate the high capability of PMF to identify proteins. Reducing the error tolerance from 100 to 50 ppm decreased the number of random matches. Without considering any modifications, in 89% of the 184 peak lists the correct protein was ranked first, 72% being the unique top scorer. Even without applying a scoring system as for example ChemScore [34] or the MOWSE score[35], PMF allows for at least significantly reducing the number of candidates by simply counting the number of matches. 33 5 Conclusions When developing a model for protein digestion the complex biological background had to be simplified in order to be capable of digesting complete databases within reasonable amount of time. The enzyme model employed relies on the assumption that cleavage only depends on the two residues adjacent to the cleaved peptide bond. In contrast, other enzyme models take a much more sophisticated approach even assigning a probability to the cleavage site such as the one used by PeptideCutter, a tool specialized on the prediction of potential cleavage sites [33]. The modification model as well could be more complex, for example taking into account known signal sequences such as N-X-S/T for n-glycosylation. But for the given reason, this was not realized. In silico protein digestion has also shown to have some computational limitations concerning memory capacity, due to variable modifications. A maximum number per peptide is inevitable and it is generally advised to use them sparingly. A suggestion to avoid runtime and memory problems when digesting complete databases, is to first search the database with as few variable modifications as possible and afterwards perform a digestion with more modifications on a subset of candidate proteins. A refinement for ProtDigest could be to individually allow for some variable modifications to be present more often than others. For example allowing four oxidations per peptide, but only one methylation. Anyway, it is questionable whether using a high number of variable modifications is reasonable when a database search is performed. They do not only cause an increase in runtime and memory, but also raise the level of random matches, simply because there are more mass values the experimental MS data can map to. Here the need for a scoring method being able to discriminate between significant and random matches becomes visible. There are several different approaches such as ChemScore [34] which assigns a score based on the chemical properties of a peptide, or the MOWSE score [35] that is used by Mascot [36], probably the most often used protein identification tool. The MOWSE score assigns a statistical weight to each individual peptide match based on the probability of a peptide belonging to a protein of a certain molecular weight which is empirically determined during the database preprocessing. The scoring method used by ProFound [40] is based on the a posteriori probabilities (Bayes probabilities) of the experimental masses belonging to a certain protein. 34 ProtDigest only includes the first part of PMF protein identification, the peptide matching stage. A scoring stage has not yet been implemented. Another improvement to its functionality could be the ability to further fragment the peptides to produce MS/MS ion series and therefore also incorporate the usage of MS/MS data. 35 6 Glossary Amino Acids: 36 Peptide Bond: A peptide bond is a chemical bond formed between two molecules when the carboxyl group of one molecule reacts with the amino group of the other molecule, releasing a molecule of water (H2O). This is a dehydration synthesis reaction, and usually occurs between amino acids. The resulting C-N bond is called a peptide bond, and the resulting molecule is called an amide. Polypeptides and proteins are chains of amino acids held together by peptide bonds. The C-N bond has a partial double bond character (with the Nitrogen atom attaining a partial positive charge and the oxygen atom a partial negative charge) and the molecule can normally not rotate around this bond. The whole arrangement of the four C,O,N,H atoms as well as the two attached carbons in a peptide bond is planar [41]. Protein Structure: Proteins are amino acid chains that fold into unique 3-dimensional structures. The shape into which a protein naturally folds is known as its native state, which is determined by its sequence of amino acids. Biochemists refer to four distinct aspects of a protein's structure: Primary structure: the amino acid sequence Secondary structure: highly patterned sub-structures--alpha helix and beta sheet--or segments of chain that assume no stable shape. Secondary structures are locally defined, meaning that there can be many different secondary motifs present in one single protein molecule Tertiary structure: the overall shape of a single protein molecule; the spatial relationship of the secondary structural motifs to one another Quaternary structure: the shape or structure that results from the union of more than one protein molecule, usually called subunit proteins subunits in this context, which function as part of the larger assembly or protein complex [41]. 37 Monoisotopic and Average Mass: Isotopes are atoms of a chemical element whose nuclei have the same atomic number, but different atomic weights. The atomic number corresponds to the number of protons in an atom. Thus, isotopes of a particular element contain the same number of protons. The difference in atomic weights results from differences in the number of neutrons in the atomic nuclei [41]. The monoisotopic mass is calculated using the mass of the most abundant natural isotope of each constituent element. An average mass is calculated using the weighted average of all its natural isotopes. 38 7 Appendix 7.1 Enzymes [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] Name Cleaves Trypsin KR Arg-C R Asp-N DB Chymotrypsin(FYW) FYW Chymotrypsin(FYWML) FYWML Formic_acid D Lys-C K Lys-C/P K PepsinA FL CNBr M Tryp-CNBr KRM TrypChymo FYWKR Trypsin/P KR V8-DE BDEZ V8-E EZ no cut - Restriction P P P P P P P P P - C-/N-term Cterm Cterm Nterm Cterm Cterm Cterm Cterm Cterm Cterm Cterm Cterm Cterm Cterm Cterm Cterm - 7.2 Modifications [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] [41] Name Acetyl (K) Acetyl (N-term) Amide (C-term) Biotinylated (K) Biotinylated (Nterm) Carbamidomethyl (C) Carbamyl (K) Carbamyl (N-term) Carboxymethyl (C) Deamidation (N) Deamidation (Q) Gigi_ICATd0 (C) Gigi_ICATd8 (C) HSe (C-term M) HSe lactone (Cterm M) ICAT_heavy ICAT_light Methyl ester (Cterm) Methyl ester (D) Methyl ester (E) N-Acetyl (Protein) N-Formyl (Protein) NIPCAM (C) O18 (C-term) Oxidation (H) Oxidation (W) Oxidation (M) PEO Biotin (C) Phospho (T) Phospho (S) Phospho (Y) PhosphoNL (S) PhosphoNL (T) Propionamide (C) Pyridyl (K) Pyridyl (Nterm) Pyro-cmC (Nterm camC) Pyro-glu (Nterm E) Pyro-glu (Nterm Q) SMA (K) SMA (Nterm) Residue K any any K any C K any C N Q C C M M C C any D E prot prot C any H W M C T S Y S T C K any C E Q K any 39 Pos any Nterm Cterm any Nterm any any Nterm any any any any any Cterm Cterm any any Cterm any any Nterm Nterm any Cterm any any any any any any any any any any any Nterm Nterm Nterm Nterm any Nterm Monoisotopic Mass 170.106 43.0184 16.0187 354.173 227.085 160.031 171.101 44.0136 161.015 115.027 129.043 589.26 597.311 -12.9901 -31.0006 553.284 545.234 31.0184 129.043 143.058 43.0184 29.0028 202.078 19.007 153.054 202.074 147.035 517.203 181.014 166.998 243.03 69.0215 83.0371 174.046 247.132 120.045 -16.0187 -17.0027 -16.0187 255.158 128.071 Average Mass 170.211 43.045 16.022 354.467 227.301 160.191 171.199 44.033 161.176 115.089 129.116 589.764 597.814 -13.08 -31.096 553.761 545.711 31.034 129.116 143.142 43.045 29.018 202.271 19.007 153.14 202.212 147.191 517.658 181.085 167.058 243.156 69.063 83.09 174.218 247.297 120.131 -16.022 -17.007 -16.022 255.317 128.151 [42] Sodiated (Cterm) any [43] Sodiated (D) D [44] Sodiated (E) E [45] S-pyridylethyl (C) C [46] Sulphone (M) M [47] Citrullination R [48] Methylation (C) C [49] Methylation (K) K [50] Methylation (R) R [51] Methylation (H) H [52] Methylation (N) N [53] Methylation (Q) Q [54] Methylation (Nterm A) A [55] Hydroxylation (P) P [56] Hydroxylation (K) K [57] Hydroxylation (D) D [58] Hydroxylation (N) N [59] di-methylation (C) C [60] di-methylation (K) K [61] di-methylation (R) R [62] di-methylation (H) H [63] di-methylation (D) D [64] di-methylation (E) E [65] di-methylation (N) N [66] di-methylation (Q) Q [67] di-methylation (Nterm A) A [68] tri-methylation (C) C [69] tri-methylation (K) K [70] tri-methylation (R) R [71] tri-methylation (H) H [72] tri-methylation (D) D [73] tri-methylation (E) E [74] tri-methylation (N) N [75] tri-methylation (Q) Q [76] tri-methylation (Nterm A) A [77] Gamma-carboxylation (D) D [78] Gamma-carboxylation (E) E [79] Beta-methylthiolation D [80] Sulfation Y [81] Phosphorylation (H) H [82] Phosphorylation (C) C [83] Phosphorylation (D) D [84] C-Mannosylation W [85] Glycation (N) N [86] Glycation (T) T [87] Glycation (K) K [88] Glycation (Nterm) any [89] Lipoyl K [90] O-GlcNac (S) S [91] O-GlcNac (T) T [92] O-GlcNac (N) N [93] Farnesylation C [94] Myristoylation (res) K [95] Myristoylation (Nterm) G [96] Pyridoxal phosphate K [97] Palmitoylation (C) C [98] Palmitoylation (S) S [99] Palmitoylation (T) T [100]Palmitoylation (K) K [101]Geranyl-geranyl C [102]Phosphopantetheine S [103]Flavin adenine dinucleotide (FAD) (C) C [104]Flavin adenine dinucleotide (FAD) (H) H [100]N-acyl diglyceride cys (tripalmitate) C 40 Cterm any any any any any any any any any any any Nterm any any any any any any any any any any any any Nterm any any any any any any any any Nterm any any any any any any any any any any any Nterm any any any any any any Nterm any any any any any any any any any Nterm 38.9847 137.009 151.025 208.067 163.03 157.085 117.025 142.111 170.117 151.075 128.059 142.074 15.0235 113.048 144.09 131.022 130.038 131.04 156.126 184.132 165.09 143.058 157.074 142.074 156.09 29.0391 145.056 170.142 198.148 179.106 157.074 171.09 156.09 170.106 43.0548 159.017 173.032 161.015 243.02 217.025 182.976 194.993 348.132 276.096 263.101 290.148 163.061 316.128 290.111 304.127 317.122 307.197 338.293 211.206 357.109 341.239 325.262 339.277 366.325 375.26 426.11 886.151 920.2 789.734 38.989 137.071 151.098 208.278 163.191 157.173 117.166 142.201 170.215 151.168 128.131 142.158 15.035 113.116 144.173 131.088 130.103 131.193 156.228 184.242 165.195 143.143 157.17 142.158 156.185 29.062 145.219 170.254 198.268 179.221 157.169 171.195 156.184 170.211 43.088 159.099 173.125 161.176 243.234 217.121 183.119 195.069 348.355 276.246 263.247 290.316 163.15 316.476 290.272 304.299 317.298 307.494 338.533 211.367 357.303 341.551 325.49 339.517 366.586 375.612 426.401 886.68 920.682 790.324 7.3 Output examples 7.3.1 Comparison with PeptideMass To give a short demonstration of the correctness of ProtDigest, the output of tryptic auto proteolysis with possible methionine sulfoxide was compared with the results of PeptideMass: >sp|P00761|TRYP_PIG Trypsin precursor (EC 3.4.21.4) - Sus scrofa (Pig). 0-7 0 951.38213 FPTDDDDK 8-50 0 4488.1089 IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYK 53-56 0 514.32272 IQVR 57-75 0 2096.05378 LGEHNIDVLEGNEQFIAAK 76-95 0 2282.17287 IITHPNFNGNTLDNDIMLIK 2298.16779 Oxidation (M) 96-105 0 1044.55636 LSSPATLNSR 106-113 0 841.50213 VATVSLPR 114-131 0 1767.79198 SCAAAGTECLISGWGNTK 132-154 0 2385.15558 SSGSSYPSLLQCLKPVLSDSSCK 155-175 0 2157.02343 SSYPGQITGNMICVGFLEGGK 2173.01835 Oxidation (M) 176-205 0 3012.31639 DSCQGDSGGPVVCNGQLQGIVSWGYGCAQK 206-213 0 905.49705 NKPGVYTK 214-228 0 1735.83518 VCNYVNWIQQTIAAN 41 7.3.2 File output ENZYME:Trypsin MAXMISSEDCLEAVAGE:1 MASSES:monoisotopic MASSRANGE:1-50000 FIXMOD: Propionamide (C) VARMOD: Oxidation (M) Number of sequences = 2913 Number of peptides = 328870 Number of residues = 1127636 Average protein length = 387.1047 Average peptide length = 6.85764 Amino acid composition: A 77864 6.905065 B 0 0 C 19699 1.746929 D 60082 5.328138 E 71415 6.333161 F 49887 4.424034 G 78176 6.932734 H 24755 2.195301 I 63335 5.616617 K 70226 6.227719 L 107257 9.511669 M 28530 2.530072 N 48521 4.302896 P 53531 4.747188 Q 40300 3.573848 R 58850 5.218883 S 91874 8.147487 T 58953 5.228017 V 78069 6.923245 W 13662 1.211561 X 3 0.0002660433 Y 32647 2.895172 Z 0 0 >sp|Q9LU15|AHP4_ARATH Histidine-containing phosphotransfer protein 4 - Arabidopsis thaliana (Mouse-ear cress). MTNIGKCMQGYLDEQFMELEELQDDANPNFVEEVSALYFKDSARLINNIDQALERGSFDFNRLDSYMHQFKGSSTSIGASKVK AECTTFREYCRAGNAEGCLRTFQQLKKEHSTLRKKLEHYFQASQ 0-5 0 662.342 MTNIGK 0-5 0 678.337 Oxidation (M) 6-39 0 4114.82 CMQGYLDEQFMELEELQDDANPNFVEEVSALYFK 6-39 0 4146.81 Oxidation (M) Oxidation (M) 6-39 0 4130.81 Oxidation (M) 40-43 0 447.208 DSAR ... ... ... 109-115 1 869.472 KEHSTLR 110-116 1 869.472 EHSTLRK 116-117 1 274.2 KK 117-126 1 1249.61 KLEHYFQASQ >sp|Q8L9T7|AHP5_ARATH Histidine-containing phosphotransfer protein 5 - Arabidopsis thaliana (Mouse-ear cress). MNTIVVAQLQRQFQDYIVSLYQQGFLDNQFSELRKLQDEGTPDFVAEVVSLFFDDCSKLINTMSISLERPDNVDFKQVDSGVH QLKGSSSSVGARRVKNVCISFKECCDVQNREGCLRCLQQVDYEYKMLKTKLQDLFNLEKQILQAGGTIPQVDIN 0-10 0 1271.7 MNTIVVAQLQR 0-10 0 1287.7 Oxidation (M) 11-33 0 2837.37 QFQDYIVSLYQQGFLDNQFSELR 34-34 0 146.106 K 35-57 0 2631.2 LQDEGTPDFVAEVVSLFFDDCSK ... ... 42 7.4 Additional results 7.4.2 Mass Distributions Fig.6.1: Oscillation of peptide mass distribution (chymotrypsin) Fig.6.2: Oscillation of peptide mass distribution (formic acid) 43 8 References [1] Electrophoresis. 1995 Jul Progress with gene-product mapping of the Mollicutes: Mycoplasma genitalium. Wasinger VC, Cordwell SJ, Cerpa-Poljak A, Yan JX, Gooley AA, Wilkins MR, Duncan MW, Harris R, Williams KL, Humphery-Smith I. [2] Electrophoresis. 1998 Aug;19(11):1941-9. Towards an automated approach for protein identification in proteome projects. Traini M, Gooley AA, Ou K, Wilkins MR, Tonella L, Sanchez JC, Hochstrasser DF, Williams KL. [3] Electrophoresis. 1998 May;19(6):893-900. Database searching using mass spectrometry data. [4] Rapid Communications in Mass Spectrometry Volume 17, Issue 10, Pages 1044-1050 On-column digestion of proteins in aqueous-organic solvents Gordon W. Slysz, David C. Schriemer * [5] Int J Mass Spectrom & Ion Proc 1987 Matrix-assisted ultraviolet Laser desorption of non-volatile compounds. Karas M, Bachmann D, Bahr U, Hillenkamp F: [6] Science 1989, 246:64-71. Electrospray Ionization for Mass Spectrometry of Large Biomolecules. Fenn JB, Mann M, Meng CK, Wong SF, Whitehouse CM: [7]Mass Spectrometry Reviews Volume 23, Issue 5, Pages 368-389 Investigation of intact protein complexes by mass spectrometry Albert J. R. Heck *, Robert H. H. van den Heuvel [8] Pept Res. 1994 May-Jun;7(3):115-24 Protein identification by peptide mass fingerprinting. Cottrell JS. [9] Rapid Commun Mass Spectrom. 2003;17(16):1825-34. Matrix-assisted laser desorption/ionization directed nano-electrospray ionization tandem mass spectrometric analysis for protein identification. Kast J, Parker CE, van der Drift K, Dial JM, Milgram SL, Wilm M, Howell M, Borchers CH. [10] Analyst. 1996 Jul;121(7):65R-76R. Future prospects for the analysis of complex biological systems using micro-column liquid chromatographyelectrospray tandem mass spectrometry. Yates JR 3rd, McCormack AL, Link AJ, Schieltz D, Eng J, Hays L. [11] Mass Spectrometry : Principles and Applications Edmond De Hoffmann, Vincent Stroobant [12] Anal Biochem. 1997 Aug 1;250(2):153-6. Identification of proteins by matrix-assisted laser desorption ionization-mass spectrometry following in-gel digestion in low-salt, nonvolatile buffer and simplified peptide recovery. Fountoulakis M, Langen H. [13] Anal Chem. 2003 Aug 15;75(16):4081-6. Web and database software for identification of intact proteins using "top down" mass spectrometry. Taylor GK, Kim YB, Forbes AJ, Meng F, McCarthy R, Kelleher NL. [14] JOURNAL OF MASS SPECTROMETRY J. Mass Spectrom. 2002; 37: 663 675 Top down protein characterization via tandem mass spectrometry Gavin E. Reid and Scott A. McLuckeyolsi 44 [15] Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W340-5. ProSight PTM: an integrated environment for protein identification and characterization by top-down mass spectrometry Richard D. LeDuc, Gregory K. Taylor,1 Yong-Bin Kim,1 Thomas E. Januszyk,1 Lee H. Bynum,1 Joseph V. Sola,1 John S. Garavelli,2 and Neil L. Kelleher11 [16] Anal Biochem. 1994 Oct;222(1):44-8. Acrylamide in Polyacrylamide Gels Can Modify Proteins during Electrophoresis Bonaventura C., Bonaventura J., Stevens R. and Millington D. [17] Rapid Commun Mass Spectrom. 1999;13(18):1818-27. Probing the reactivity of S-S bridges to acrylamide in some proteins under high pH conditions by matrixassisted laser desorption/ ionisation. Bordini E, Hamdan M, Righetti PG. [18] Mass Spectrometry Reviews Volume 20, Issue 3, Pages 121-141 Monitoring 2-D gel-induced modifications of proteins by MALDI-TOF mass spectrometry Mahmoud Hamdan 1, Marina Galvani 1, Pier Giorgio Righetti 2 [19] Electrophoresis 2001, 22, 1633 1644 Protein alkylation by acrylamide, its N-substituted derivatives and cross-linkers and its relevance to proteomics: A matrix assisted laser desorption/ ionization-time of flight-mass spectrometry study Ellenia Bordini1 Marina Galvani1 Pier Giorgio Righetti2 1GlaxoSmithKline Group, [20] Anal Biochem. 1990 Apr;186(1):116-20. Formylated peptides from cyanogen bromide digests identified by fast atom bombardment mass spectrometry. Goodlett DR, Armstrong FB, Creech RJ, van Breemen RB. [21] Rapid Commun. Mass Spectrom. 13, 1143 1151 (1999) Investigation of Some Covalent and Noncovalent Complexes by Matrix-assisted Laser Desorption/ Ionization Time-of-flight and Electrospray Mass Spectrometry Ellenia Bordini and Mahmoud Hamdan* [22] Proteomics. 2003 Nov;3(11):2208-20. Approaches for the quantification of protein concentration ratios. Moritz B, Meyer HE. [23] IUBMB Life. 2002 Aug;54(2):51-7. Identification of modified proteins by mass spectrometry. Sickmann A, Mreyen M, Meyer HE. [24]J Biol Chem. 2001 Mar 30;276(13):10570-5. Epub 2001 Jan 09. Alternative O-glycosylation/O-phosphorylation of serine-16 in murine estrogen receptor beta: post-translational regulation of turnover and transactivation activity. Cheng X, Hart GW. [25] Nat Biotechnol. 1999 Oct;17(10):994-9. Quantitative analysis of complex protein mixtures using isotope-coded affinity tags. Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R [26] J Am Soc Mass Spectrom. 2002 Jan;13(1):22-39. Scoring methods in MALDI peptide mass fingerprinting: ChemScore, and the ChemApplex program. Parker KC. [27] Arch. Biochem. Biophys, 161, 665-670 Surface Tension of Amino Acid Solutions: A Hydrophobicity Scale of the Amino Acid Residues. Bull, Henry B. and Breese, Keith (1974) [28] Anal. Biochem., 124, 201-208 The Isolation of Peptides by High-Performance Liquid Chromatography Using Predicted Elution Positions, Browne, C. A., Bennett, H. P. J. and Solomon, S. (1982) 45 [29] Proc. Natl. Acad. Sci. USA 77; 1632 (1980) Meek [30] Anal. Biochem. 182; 319-326 (1989) Gill, S.C. and von Hippel, P.H. [31] Electrophoresis. 1999 Dec;20(18):3527-34. Modeling peptide mass fingerprinting data using the atomic composition of peptides. Gay S, Binz PA, Hochstrasser DF, Appel RD. [32] Proteomics. 2002 Oct;2(10):1365-73. Precalibration of matrix-assisted laser desorption/ionization-time of flight spectra for peptide mass fingerprinting. Wool A, Smilansky Z. [33] http://www.expasy.org/tools/peptidecutter [34] J Am Soc Mass Spectrom. 2002 Jan;13(1):22-39. Scoring methods in MALDI peptide mass fingerprinting: ChemScore, and the ChemApplex program. Parker KC. [35] Current Biology (1993), vol 3, 327-332. 'Rapid Identification of Proteins by Peptide-Mass Fingerprinting'. .J.C. Pappin, P. Hojrup and A.J. Bleasby [36] http://www.matrixscience.com [37] http://www.expasy.org [38] http://www.ncbi.nlm.nih.gov [39] http://pir.georgetown.edu/ [40] http://irserver.rockefeller.edu/profound_bin/WebProFound.exe [41] http://www.wikipedia.org Figures: Fig.1.3: Principle of MALDI From Script_10_proteomics.pdf from lecture ‘algorithmic bioinformatics’ 02/03 by Prof. Reinert http://www.inf.fu-berlin.de/inst/ag-bio/file.php?p=ROOT/Teaching/Lectures/WS0304/101,algbio_v.lecture.htm Fig,1.6: (a) MS/MS spectrum From http://arthritis-research.com/ content/2/5/407/figure/F4 (b) Ion series From http://www.matrixscience.com 46