1 INTRODUCTION

advertisement
Contents
1 Introduction …………………………….………………………1
1.1 Protein Sequence Digestion .…………………………………………………….…2
1.2 Mass Spectrometry …………………………………………………………….…...4
1.3 Protein Analysis using Mass Spectrometry ….…………………………………….7
1.4 Modifications …...…………………………….……………………………………9
2 Applications …………………………………………………..14
2.1 Use Case Diagram ……………………………….……………………………….14
2.2 Web-based Protein Digesters .................................................................................15
3 Modelling and Implementation ……………………………….16
3.1 Class Diagram …………………………………………………………………….16
3.2 Modifications ……………………………………………………………………..18
3.3 Cleaving Enzymes ………………………………………………………………..19
3.4 Storage ……………………………………………………………………………20
3.5 Mass Range ……………………………………………………………………….20
3.6 Database Searching ……………………………………………………………….21
3.7 Sequence Diagram ………………………………………………………………..21
4 Results and Discussion ……………………………………….23
4.1 Output Examples ………………………………………………………………….23
4.2 Mass Distribution Analysis ……………………………………………………….25
4.3 Modifications and Missed Cleavages ……………………………………………30
4.5 Comparing Data Types …………………………………………………………...31
4.6 Database Analysis ………………………………………………………………...32
4.7 Database Search …………………………………………………………………..33
5 Conclusions …………………………………………………...34
6 Glossary ………………………………………………………36
7 Appendix ……………………………………………………...39
8 References …………………………………………………….44
1 INTRODUCTION
Since the completion of genome sequencing of several organisms including the human
genome, attention has been directed from genome to proteome analysis. The term proteome
was first introduced by Wilkins et al. in 1995 [1] and denotes the total number of proteins
expressed by a genome at a given time. Proteins represent the functional aspect of gene
activities in living cells. Proteome analysis or proteomics are concerned with protein
identification, determination of the function or functional networks of proteins and
construction of databases storing the acquired knowledge.
A lot of progress has been made in separation and identification of proteins, two-dimensional
gel electrophoresis and mass spectrometry being key techniques [2]. Today the most
commonly used methods for identification of proteins are peptide mass fingerprinting and
MS/MS fragmentation. Both methods are based on enzymatic or chemical digestion of a
purified protein, mass spectrometric measure of the resulting peptides and comparison with
theoretical masses derived from in silico digestion of the protein sequences in a database [3].
An essential ingredient for high throughput analyses is the development of computer software
that is able to quickly and efficiently analyze and interpret the huge amounts of information
emerging from proteome analysis.
This work is concerned with the implementation of a protein sequence digesting tool that
models the proteolytic digestion of a protein or protein database computing the theoretically
resulting peptides and their corresponding masses given a cleaving enzyme, a maximum
number of missed cleavages and both fixed and variable modifications.
First the biological background of protein sequence digestion and proteolytically active
enzymes will be described. The basic principles of mass spectrometry will then be explained
followed by a short description of its applications in proteome analysis. The next section will
inform about different types of protein modifications. In chapter 2 different protein digesters
available on the web are compared before going into detail with the modeling and
implementation of ProtDigest in chapter 3. The results of this work are demonstrated and
discussed in chapter 4 followed by the conclusions.
1
1.1 Protein Sequence Digestion
Cleavage of protein sequences is a process frequently encountered in vivo that is also used in
vitro for protein identification and characterization by peptide mass fingerprinting.
The category of proteolytic enzymes that can hydrolyze peptide bonds in amino acid
sequences and therefore generate peptides or individual amino acids, is called proteases.
Exoproteases remove exactly one residue either from the amino-terminus (aminopeptidase)
or from the carboxy-terminus (carboxypeptidase) resulting in a single amino acid and the
shortened protein sequence.
Endoproteases (or proteinases) cleave at the C- or N-terminal side of specific amino acids
independent of the position in the sequence. They are classified with respect to the cleaving
mechanism. Serine-proteases for example have a serine residue in their catalytic center that
can perform a nucleophilic attack on the C-atom of a peptide bond. Other classes are metallo, cysteine- and aspartic-proteases.
Dipeptidases hydrolyze peptide bonds between dipeptides.
1.1.1 Cleavage specificity
Most proteases have a preference for a certain amino acid composition at their cleavage site.
The composition is dependent on the catalytic center of the protease which interacts with the
polypeptide chain.
Trypsin for example recognizes the basic amino acids lysine and arginine and cleaves
carboxy-terminally (K or R in position P1 in figure 1.1). Cleavage is restricted if there is
proline in position P1’. Trypsin of higher specificity additionally does not cleave after K in
CKY, DKD, CKH, CKD, KKR nor after R in RRH, RRR, CRK, DRD, RRF, KRR.
Cleavage rules can be even more complex. Caspase 2 for example requires the following
composition:
2
1.1.2 Protein Cleavage in vivo
Proteolytic cleavage is involved in several important processes in vivo including the
following examples:
 Activation of proenzymes (zymogenes): Some proteins are expressed as inactive
precursors that are activated by proteolytic cleavage (e.g. trypsinogen → trypsin,
prothrombin → thrombin). This is an efficient mechanism of enzyme activity
regulation e.g. preventing digestive proteases from attacking gastric cells.
 Digestion of dietary proteins: Dietary proteins are gradually degraded to individual
amino acids in order to be of use to the organism. This process is catalyzed by
digestive proteases. When food reaches the stomach, pepsinogen is secreted by the
gastric mucosa. Hydrochloric acid, also produced by the gastric mucosa, is necessary
for the proteolytic activation of pepsinogen to pepsin and to maintain the optimum
acidity (pH 1-3) for pepsin function. Further degradation of the peptides is catalyzed
by trypsin, chymotrypsin and other proteases continuing in the intestines.
 Degradation of cellular proteins: Malfunctioning proteins or cellular proteins that
are no longer of use to the cell are marked with ubiquitin and then degraded in the
proteasome. Proteases are also found in the lysosome. The caspase family is a family
of cysteine-proteaes implicated in programmed cell death (apoptosis).
Active protease
Functions in
Class
Cleavage
specificity
Pepsin
Food digestion
Asparticprotease
Broad specificity
Trypsin
Food digestion
Serine-protease
After Lys and Arg,
not before Pro
Chymotrypsin
Food digestion
Serine-protease
After Tyr, Trp, Phe,
also Leu and Met
Elastase
Hydrolysis of elastin
(structural protein)
Serine-protease
Mainly after Ala,
also Val and Leu
Thrombin
Blood clotting
Asparticprotease
After Arg and Lys
Caspase-3
Apoptosis
Cysteineprotease
Between Asp and
Gly
Table 1.1: A selection of proteases
3
1.1.3 Protein Cleavage in vitro
Protein digestion has also become an important technique in vitro for the identification and
characterization of proteins using mass spectrometry.
The most often used enzyme is trypsin. Protein digestion can also be performed by
proteolytic chemicals such as cyanogen bromide (CNBr) which cleaves after methionine.
Proteolytic degradation is often performed overnight as it can take several hours depending
on the reaction conditions and on the protease employed. A higher enzyme-to-substrate ratio
speeds up this process but has the disadvantage of increasing the number of auto proteolysis
products if digestion is performed in solution. The use of immobilized enzymes allows using
an excess of enzyme concentration without increasing auto proteolysis and can therefore
achieve degradation within minutes [4].
1.2 Mass Spectrometry
Mass spectrometry (MS) allows the determination of the molecular weight of biomolecules.
It has become an important tool in proteome analysis and is preferred to chromatographic,
electrophoretic or ultracentrifugation methods because of its preciseness. The accuracy
achieved by MS is frequently better than 0.01% of the calculated mass whereas the relative
error of the other methods mentioned ranges between 10 and 100% on average [11].
Data
Output
Sample
Inlet
Ion Source
Data System
Mass
Analyzer
Ion
Detector
Vacuum Pumps
Fig.1.2:Schematic illustration of a mass spectrometer
Mass spectrometers basically consist of an inlet for sample introduction (often a gas
chromatograph), an ion source, a mass analyzer, an ion detector and finally a data system to
4
process the output data and produce a spectrum (see figure 1.2). The ion source produces gasphase ions, the mass analyzer separates the ionized analytes according to their mass-tocharge ratio (m/z-ratio) and the ion detector counts the number of ions for each m/z value
[11].
1.2.1 Ion Sources
There are different types of ion sources producing either analyte anions (negative ion mode)
or cations (positive ion mode). Generating ions is convenient as they can be efficiently
detected and navigated using electric or magnetic fields. The most commonly used ionization
methods to generate protein or peptide ions are matrix-assisted laser desorption/ionization
(MALDI) [5] and electrospray ionization (ESI) [6].
MALDI consists of two steps:
First, the analyte is mixed with a molar excess of small organic molecules, the matrix, which
strongly absorb the laser wavelength. After drying the mixture the molecules to be analyzed
are completely isolated from one another in the matrix.
The second step of the MALDI process involves desorption of portions of the solid sample
because of rapid heating and expansion into the gas phase initiated by pulses of laser light.
This process results in ionized analyte molecules because of proton transfer in the gas phase.
The usual charge in MALDI is +1 (see figure 1.3) [11].
ESI:
The sample is ionized at atmospheric pressure. Highly charged droplets disperse from a
capillary in an electric field, evaporate and are drawn into the vacuum of the analyzer. ESI
generates multiply charged ions. ESI is a soft ionization method that allows for detection of
non-covalent protein complexes as there is only little or no fragmentation of polymer
molecules during ionization [7].
5
Other ion sources are electron ionization (EI), chemical ionization (CI), fast atom
bombardment (FAB), field desorption (FD), plasma desorption (PD), laser desorption (LD),
thermospray (TSP) and atmospheric pressure chemical ionization (APCI) [11].
1.2.2 Mass Analyzers
There are also different types of mass analyzers which vary in three main characteristics:
resolution, transmission and mass limit. A high resolution is desirable for a high selectivity
i.e. to be able to distinguish between two molecules of low mass difference. The transmission
is the ratio of ions generated and ions detected and is therefore a measure of sensitivity. The
highest m/z-ratio that can be measured determines the mass limit.
MALDI is mostly coupled to time-of-flight (TOF) analyzers. TOF analyzers do not have an
upper mass limit. The accuracy and the speed of MALDI-TOF MS have made it the most
common instrument for protein identification.
ESI is mostly combined with TOF, quadrupolar or ion trap analyzers which are also attractive
for their relatively low cost compared with magnetic sectors or Fourier transform-MS (FTMS)[11].
1.2.3 Tandem Mass Spectrometry
A tandem mass spectrometer has two analyzers separated by a collision cell. Here the sample
ions collide with an inert gas which results in their fragmentation (collision-induced
dissociation (CID) [11]). Often used combinations of analyzers are for example quadrupolequadrupole or quadrupole-TOF.
The principle of MS/MS is shown in figure 1.4. A parent ion or precursor ion of a certain
mass is selected from the first analyzer (MS1) and then fragmented in the collision cell
resulting in a spectrum of daughter ions produced by the second analyzer (MS2).
6
1.3 Protein Analysis using Mass Spectrometry
Because of its speed and sensitivity, MS has emerged as a key technique for the structural
analysis and identification of proteins. It can provide information about posttranslational
modifications as well as protein interactions and can also be used for relative protein
quantification [22].
1.3.1 Protein Identification
There are two main techniques taking a “bottom-up” approach to protein identification using
MS and subsequent sequence database searching:
1) peptide mass fingerprinting (PMF)
2) MS/MS identification
An emerging technique is “top-down” MS, a term introduced by McLafferty and coworkers,
also making use of database searching [13-15].
PMF:
PMF is the analysis method of choice for rapid identification of proteins [8]. The protein to
be analyzed first needs to be separated from its mixture. One- or two-dimensional
polyacrylamide gel electrophoresis is a common method for protein separation. After
excision and decoloration of the gel bands or spots, reduction and alkylation are often
performed in order to prevent oxidation of Cys residues (Disulfide bonds are separated by
reduction with thiols such as dithiothreitol (DTT). To prevent reformation cysteine residues
are alkylated e.g. with iodoacetic acid forming S-carboxymethyl derivatives ). The purified
protein is then proteolytically digested in situ (in-gel) generating smaller peptides [12]. The
peptides are extracted, their masses measured by MS, mostly MALDI-TOF MS because of its
speed and simplicity, and then compared with theoretically calculated masses resulting from
the application of the used enzyme's cleavage rules to the protein sequences in a database
(see figure 1.5).
7
Identification therefore requires the protein to be present in the database. Another
requirement is that the peptides detected originate from the same protein which can be
disturbed by the presence of contaminants e. g. hair, skin or artifacts of sample handling.
Unambiguous results are not always achieved by PMF as the protein may be heavily
modified (see section 1.4) yielding experimental masses differing from the calculated ones.
As “protein identification correlates directly to the number of detected peptide signals” [9]
PMF may provide ambiguous information if only few peptides are detected.
Performing MS/MS is a way of increasing the level of confidence in such results or obtaining
an identification if none at all was achieved by PMF.
MS/MS protein identification:
This technique is more complex and time consuming than the PMF approach, but it is
capable of high quality identification and also of identifying different proteins in one sample
[10].
Sample preparation is the same as in PMF, but the peptides derived from the digestion are
subjected to tandem MS resulting in peptide fragmentation spectra. These spectra contain
different ion series (a, b, c; x, y, z)[11] named by the site of fragmentation (Fig. 1.5).
Sequence information can be gained from the mass differences between peaks of the same
series being characteristic of an amino acid (Fig. 1.6).
Each MS/MS spectrum can potentially identify one peptide. When multiple spectra point to
different peptides derived from the same protein, this gives rise to a high confidence in the
identification.
8
Top-Down MS:
Top-down proteomics is based on tandem MS [14]. A complete mixture of intact proteins is
introduced to the mass spectrometer producing intact protein ions. Ions of a specific mass are
isolated and fragmented and then subjected to the second analyzer. The intact mass and the
fragmentation data are then compared to a sequence database. This relatively new method
can be used for the identification and localization of post-translational modifications and is
mostly performed with Fourier transform MS. Software for the interpretation of top-down
data is available at https://prosightptm.scs.uiuc.edu/ (ProSight PTM) [15].
1.3.2 Protein Characterization
Beside the identification of unknown proteins MS can be employed for identification and
localization of post-translational modifications (PTMs), protein quantification and detection
of non-covalent complexes and protein interactions.
As already mentioned top-down approaches provide information about PTMs.
ESI MS is soft enough to allow for non-covalent complexes and protein interactions to stay
intact therefore being able to assist in higher structure elucidation [21].
Relative quantification of proteins (e. g. in order to compare tumor cells with normal cells) is
often based on stable isotope labeling (see section 1.4.2).
1.4 Modifications
Protein modifications can be divided into post-translational modifications and artificial
modifications which are again subdivided into accidental modifications and deliberate
modifications.
1. Post-translational modifications
2. Artificial modifications
a) Deliberate modifications
b) Accidental modifications
As shown in figure 1.7, modifications can be position-specific, occurring only at the aminoor carboxy-terminus of a peptide, or non-position-specific, occurring at a residue independent
of its position in the amino acid sequence.
9
Amino-terminal and carboxy-terminal modifications, respectively, can either be dependent or
independent of the terminal residue. Protein-N-terminal modifications are only attached to
the first, protein-C-terminal modifications to the last amino acid of the complete protein
sequence and can therefore only be found in the terminal protein fragments produced by
protein digestion.
Non-position-specific modifications modify the side chains of specific amino acids; acidic,
basic and hydroxy-group or sulfur containing residues being the most susceptible sites for
modification because of their high reactivity..
Fig 1.7: Locations of modification sites
1.4.1 Post-translational modifications:
“The analysis of posttranslational modifications is an important task of protein chemistry in
proteome research. (...) It is assumed that modifications such as phosphorylation or
glycosylation exist on every second protein and that they are important for the protein
function.” (Sickmann et al. [23]).
Most proteins are covalently modified after their translation at the ribosome. Posttranslational modifications (PTMs) are essential determinants of protein function and can
have stabilizing effects on protein structure. They play a role in enzyme regulation, protein
targeting and several more important processes in vivo and are therefore of great interest to
proteomics. Because of the resulting mass difference PTMs are a considerable challenge to
protein identification using sequence database searching.
10
PTMs include glycosylation, acylation, methylation, phosphorylation, sulfation, prenylation,
and formation of selenoproteins.
One specific residue in a protein is usually object to one type of modification, although it has
been demonstrated that residues can be alternatively modified. Murine estrogen receptor beta
for example can carry an N-acetyl-glucosamine or a phosphoryl-group at Ser16 [24].
Examples:
1. The hydroxyl-groups of Ser, Thr or Tyr can be covalently phosphorylated, a process
catalyzed by a specific category of enzymes called kinases. Reversible phosphorylation is of
particular interest because of its important role in enzyme activity regulation. In many cases
activity or inactivity of an enzyme is controlled by the absence or presence of one or more
phosphoryl-groups inducing a conformational change in the structure of the protein. There
are specialized databases containing information about phosphorylation sites in protein
sequences (e.g. http://phospho.elm.eu.org/).
As phosphorylation brings forth negatively
charged peptides (Fig.1.8), it cannot be
detected using the standard positive ion
mode. Phosphorylation causes a mass
increase of 79Da in the negative ion mode.
2. The attachment of saccharides, glycosylation, can either be N- or O-linked (at Asn or
Ser/Thr, respectively). N-glycosylation takes place in the endoplasmatic reticulum during
mRNA translation at the ribosome and is signaled by a certain sequence of amino acids: AsnX-Ser/Thr. O-glycosylation is performed in the Golgi apparatus. N-linked oligosaccharides
can be of complex, branched structure while O-linked oligosaccharides are generally shorter
often containing only one to four sugar molecules.
1.4.2 Artificial modifications:
Artificial modifications can either be deliberately induced or accidental products of sample
preparation and handling. Deliberate labeling of proteins can for example be used for relative
quantification of individual proteins within a mixture
11
There are covalently bound as well as non-covalently bound modifications. Because of the
harsher preparation conditions of MALDI-TOF non-covalent complexes are more stable
under ESI conditions. Most non-covalent modifications are not readily detected by MALDITOF [21].
a) Examples for deliberate modifications:
1.
Reductive alkylation is performed
alkylating substances such as iodoacetic
to prevent peptides from forming disulfide
acid, iodoacetamide or 4-vinyl-pyridene
bonds
prevents reoxidation by modifying the
(Fig.1.9)
with
other
peptides
yielding masses misleading for peptide
cysteine residues.
mass mapping. As already mentioned on
page 9, disulfide bridges are deliberately
reduced with thiols such as dithiothreitol
(DTT) or tris(2-carboxyethyl)phosphine
(TCEP).
Subsequent
treatment
with
2. Another deliberate modification is the use of isotope-coded affinity tags (ICATs) to be
able to distinguish between two sister peptides of the same protein in protein mixtures
representing different cell states. One mixture is treated with light, the other with heavy
ICATs (deuterated) resulting in masses differing by ~8Da. The ratio of the peak intensities of
two sister peptides in the joint MS spectrum determines the relative quantification of their
parent proteins [25].
b) Examples for accidental modifications:
1. Proteins can be covalently modified by reaction with unpolymerized monomers of
acrylamide in polyacrylamide gels during electrophoresis [16]. Especially the reactivity of
the SH-group in Cys has been shown to be very high towards alkylation forming cysteinyl-Spropionamide adducts. At alkaline pH Cys can even be modified when engaged in disulfide
bridges [17]. The addition of one molecule acrylamide results in an increase of ~71Da in the
molecular weight of a peptide/protein.
12
3. Another
modification
encountered
in
commonly
polyacrylamide
oxidant [18]. Oxidation adds ~16 Da to the
gel-
molecular mass of the protein.
separated proteins is oxidation, mostly at
Met
residues
forming
methionine
sulfoxide (Fig.1.10). Residual ammonium
persulfate which is used to induce gel
polymerization acts as a very reactive
3. The eluant can be another source of accidental modifications. A standard solvent used for
eluting protein spots from polyacrylamide gels contains formic acid which has been shown to
massively formylate Ser and Thr residues (up to ten formylation products observed [19]).
Formic acid is also used during cyanogen bromide digestion of proteins resulting in the same
modification of Ser and Thr [20]. Formylation causes an increase of ~28Da.
4. Even the dye used for staining a gel can affect the m/z-ratio observed in MS analysis.
Although only non-covalently bound by hydrophobic interactions Coomassie can even be
detected by MALDI-TOF MS [21]. It has been observed that up to ten molecules of
Coomassie can aggregate with relatively small polypeptides.
13
2 Applications
The variety of proteomics tools distributed on the internet is very broad. A compilation of
tools for protein identification and characterization, primary, secondary and tertiary structure
prediction, sequence similarity searches and alignments, prediction of post-translational
modifications, and translation of DNA sequences into amino acid sequences can be found on
the ExPASy server (www.expasy.org/tools).
2.1 Use Case Diagram
identification and
characterization
Protein Digestion
«uses»
«uses»
single sequence
digestion
experimental PMF data
identification and
characterization;
localize PTMs;
verify PMF identfiction
Fig.2.1: Use cases
of in silico
protein digestion
in proteomics
MS/MS Fragmentation
PMF identification
tool
«uses»
«extends»
database digestion
«uses»
MS/MS fragmentation
of each peptide
«uses»
MS/MS ion search
experimenal MS/MS data
The usual approach to protein identification is a PMF experiment as described in chapter 1.
The experimental data is analyzed and interpreted by a PMF identification tool such as
Mascot [36] which performs matching and scoring in a digested database. The outcome can
be further analyzed manually using a single sequence protein digester. If an identification is
not obtained or leaves doubts behind, MS/MS data can be used to gain further information.
For an MS/MS ion search the peptides resulting from a database digestion need to be
fragmented (see Figure 1.5).
2.2 Web-based Protein Digesters
Different tools for theoretical protein sequence digestion are compared in table 2.1. MSDigest is the most complex one with most additional features. PeptideCutter has a very
sophisticated model of cleavage prediction.
14
PeptideMass
(ExPASy)
http://
www.expasy.org/
tools/peptide-mass.html
PeptideCutter
(ExPASy)
http://
www.expasy.org/
tools/peptidecutter
MS-Digest
(ProteinProspector)
http://
prospector.ucsf.edu
PeptideSort (GCG)
http://
menu.hgmp.mrc.ac.uk/
people/gcg/gcghelp/
html/unixpeptidesort.
html
Input
Output
Modifications
Enzymes
Other parameters
Additional features
Sequence in oneletter-code; or
SwissProt [37]
ID or accession
number (AC)
HTML table with
sequences, masses,
artificial modifications,
missed cleavages;
Mw and pI of protein;
text file containing
masses
Cys derivates and
Met oxidation;
inclusion of known
PTMs for
SwissProt
sequences possible
Choice of 16 standard
enzymes taking into
account positions P2
to P1' (see Fig. 1.1);
max. 5 missed
cleavages
Either MH+,M or
(MH)- masses; either
monoisotopic or
average; optional
minimum mass for
output peptides
For SwissProt
sequences: inclusion of
splicing variants, protein
isoforms and database
conflicts possible
Sequence in one- HTML map or table of No modifications
letter-code; or
cleavage sites; table of incorporated
SwissProt ID or sequences and masses
AC
34 enzymes taking into None noteworthy
account P4 to P2';
select as many as you
want; no missed
cleavages
Sophisticated model for
trypsin and chymotrypsin
incorporating cleavage
probability
Sequence without
X, B or Z;
or database ID,
several DBs
included e. g.
SwissProt, NCBI
[38]
HTML, XML output,
can be saved to file;
both monoisotopic and
average masses;
protein Mw and pI;
several options for
output (-> other
parameters and
additional features)
29 enzymes taking into
account P1 and P1';
no upper limit for
number of missed
cleavages; user
specified enzymes
mass range; minimum
peptide length; amino
acids present in output
peptides;
Calculation of
ChemScore[26], Bull
Breeze indices[27] and
HPLC indices [28];
incorporation of user
specified amino acids
(elemental position) ->
modifications
Command line
program;
sequence from
file or PIR[39]/
SwissProt ID;
only one
sequence
Text file containing
No modifications
peptide masses,
incorporated
positions on sequence,
amino acid
compositions, pIs of all
peptides and the
protein
22 enzymes taking into
account P1 and P1';
either one or all
enzymes
Mincuts, Maxcuts; if HPLC retention [29];
all enzymes are
extinction coefficient
selected those that do [30]
not cut at least mincuts
or at most maxcuts
times are ignored
Cys derivates;
state of N-/Cterminus;
list of considered
(= variable)
modifications; user
specified
modifications
Table 2.1: Comparison of four protein digesters available on the internet.
15
3 Modeling and Implementation
The implementation of ProtDigest was done in C++, the documentation with the help of
Doxygen [www.doxygen.org]. It was debugged using the gnu debugger [www.gnu.org] and
valgrind [http://valgrind.kde.org/] to fix memory leaks. The diagrams shown in this chapter
were created with Microsoft Visio [www.microsoft.com/office/visio ].
3.1 Class Diagram
The class diagram is shown in Fig..3.1. The class Sequence_Set stores the digest of one or
more protein sequences. Each protein to be digested is an instance of Sequence stored in the
vector sequences. After the call of doCleave(), the vector peptides contains all peptides
without variable modifications which are instances of Peptide, while variably modified
peptides are instances of ModPeptide stored in mod_peptides. The cleaving enzyme
employed is an instance of Enzyme. Furthermore a Sequence_Set has vectors for fixed and
variable modifications, fixmod and varmod, and min_mass and max_mass for the mass range
considered. If monoisotopic is true, the monoisotopic masses of the amino acids are stored in
aa_masses and used for the calculation of the peptide masses. Otherwise the average masses
are stored and used.
16
Modification
-name : string = ""
-residues : string = ""
-mod_pos : int = 0
-mono_masses : vector<mass_t> = null
-avg_masses : vector<mass_t> = null
+setName()
+setResidues()
+setModPos()
+getName()
+setMonoMasses()
+setAvgMasses()
+getResidues()
+getModPos()
+getMonoMasses()
+getAvgMasses()
+getResidue()
+getModPos()
+getMonoMasses()
+getMonoMass()
+getAvgMasses()
+getAvgMass()
Sequence_Set
-seqnumber : int = 0
-seqs : vector<Sequence*> = null
-fixmod : vector<Modification*> = null
-varmod : vector<Modification*> = null
-max_missed_cleave : int = 0
-enzyme : Enzyme = null
-min_mass : mass_t = 0
-max_mass : mass_t = 0
-peptides : vector<Peptide*> = null
-mod_peptides : vector<ModPeptide*> = null
-monoisotopic : bool = true
-aa_masses : AminoAcidMassesFloat
+setSeqnumber()
+addSeq()
+setFixmod()
+setVarmod()
+setMaxMissedCleave()
+setEnzyme()
+setMinMass()
+setMaxMass()
+setMonoisotopic()
+getSeqnumber()
+getSeqs()
+getFixmod()
+getVarmod()
+getEnzyme()
+getMinMass()
+getMaxMass()
+getMonoisotopic()
+getPeptides()
+getModPeptides()
+doCleave()
+modifyAminoAcidMasses()
-addMissedCleave()
+appendTermMods()
+createModPeptides()
-modPeptideRec()
-getModstring()
+doSearch()
+getSearchMinMax()
+interpolSearch()
+interpolSearchMod()
+outputIntoFile()
+massesIntoFile()
+outputToScreen()
AminoAcidMassesFloat
-masses : mass_t
-modified : bool = false
+setMass()
+getMass() : mass_t
+getModified() : bool
ModPeptide
-peptide : Peptide = null
-modifications : vector<Modification*> = null
-mass : mass_t = 0
+setPeptide()
+addPepMod()
+setModifications()
+setMass()
+getPeptide() : Peptide
+getModifications() : vector<ModPeptide*>
+getModificationAt() : Modification
+getMass() : mass_t
Enzyme
-name : string = null
-cut_at : string = null
-cut_cterm : bool = 1
-no_cut : string = null
+getName()
+isCutCterm()
+getCutAt()
+getNoCut()
Sequence
-seq : string = ""
-name : string = ""
-counter : int = 0
-incremented : bool = false
+setSeq()
+setName()
+setCounter()
+setIncremented()
+incrementCounter()
+getSeq()
+getName()
+getCounter()
+getIncremented()
Fig.3.1: Class Diagram of ProtDigest
17
Peptide
-begin : int = null
-end : int = null
-mass : mass_t = 0
-missedCleavage : int = 0
-protein : Sequence = null
+setPepLoc()
+setMissedCleave()
+setMass()
+setProtein()
+getMass()
+getMissedCleave()
+getBegin()
+getEnd()
+getProtein()
3.2 Modifications
All modifications incorporated are listed in the appendix in section 6.2. ProtDigest reads the
file 'modifications.txt' and stores the modifications as a vector of instances of Modification.
New modifications may be added to 'modifications.txt' (see documentation).
Modifications are divided into fixed and variable modifications. Fixed modifications are
present at every modifiable residue while variable modifications may be present or not.
Peptides that are variably modified are instances of ModPeptide while not variably modified
peptides are instances of Peptide. Every ModPeptide object is derived from a not variably
modified peptide, with the additional information of its modifications and its new mass.
Peptide objects do not store any modifications as they simply carry all fixed modifications
that 'fit'.
3.2.1 Fixed modifications:
For those modifications that are independent of the position of the amino acid, the
unmodified amino acid mass is simply replaced by the modified mass.
N- and C-terminal modifications are added when all peptides have already been ‘created’,
giving priority to those modifications that are not position-specific if there should arise any
conflicts. The molecular weight of the modification is simply added to the precalculated mass
of the unmodified peptide. In this model one residue can only be modified once, although
multiple modification in very few cases may be possible in reality e. g. modification of both
amino-groups in N-terminal lysine is imaginable. But this is an exception that would
unreasonably complicate the model and is therefore ruled out.
Fixed modifications therefore do not increase the number of peptides.
3.2.2 Variable modifications:
Variable are more complicated than fixed ones. If a peptide contains two potential
modification sites it can either be modified once, twice or not at all. If it is modified once
there are two possible positions for the modification yielding two peptides of the same
molecular weight. On the one hand the position of the modification is not important for
protein identification using peptide mass fingerprinting (PMF) as this approach is based on
the molecular weight of the intact digest peptides. On the other hand if MS/MS identification
is intended the positions must be taken into account because different permutations give rise
18
to differing MS/MS fragmentation spectra. As ProtDigest only models digestion and not
fragmentation, all combinations are needed.
For variable modifications the increase in the number of peptides is dependent on the position
of the modification and on the frequency of the residue that is modified. In Fig. 3.2 the
recursive computation of variable modifications with an example peptide is demonstrated.
with Acetyl
AKDK
AK*D*K*
AK*D*
without
AK*D*K
with ME
with Acetyl
AK*
sodiated
AK*D°K*
AK*D°
without
AK*D°K
without
with Acetyl
with Acetyl
AK*DK*
AK*D
without
AK*DK
A
with Acetyl
AKD*K*
AKD*
without
without
AKD*K
with ME
with Acetyl
Computation of variably modified
peptides:
AK
sodiated
AKD°
without
wanted peptides
waste peptides
AKD°K*
AKD°K
Without
intermediate steps
with Acetyl
K* = Acetyl (K)
D* = Methyl ester (D)
D° = Sodiated (D)
AKDK*
AKD
without
AKDK
unmodified
peptide
Fig.3.2: Recursive computation given peptide AKDK and variable modifications Acetyl
(K), Methyl ester (D) and Sodiated (D)
3.3 Cleaving Enzymes
All enzymes incorporated are listed in the appendix in section 6.1. ProtDigest reads the file
'enzymes.txt' and stores the enzymes as a vector of instances of Enzyme. Just as it is done
with modifications new enzymes may be added to 'enzymes.txt' (see documentation).
Although cleavage specificity can be way more complex (as described in chapter 1.1.1) the
model applied here only takes into account position P1 and P1' (Fig. 1.1) for the sake of
execution speed.
19
3.3.1 Missed Cleavages
There is no upper limit for missed cleavages. Missed cleavage sites as well as variable
modifications increase the number of peptides. For a fixed number of missed cleavages the
resulting number of peptides is predictable, if the number of peptides without missed
cleavage sites is known.
Missed cleavages alone cause a linear increase in the number of peptides:
u
pu  (u  1)  p0  n   i
i 0
u = maximum number of missed cleavages allowed
pu = number of peptides if u missed cleavages are allowed
n = number of sequences digested
The number of peptides without missed cleavages depends on the number of sequences and
their lengths and on the relative frequency of cleavage sites.
3.4 Storage
An effort was put into keeping the stored data low which is why instances of Peptide only
memorize the beginning and ending positions on the protein sequence instead of the complete
amino acid sequence. Fixed modifications are also not stored as they are generally applied to
every possible residue and can be retraced. For the same reason peptides with the same
modifications but in different positions are not stored, making it unnecessary to store the
positions of modifications which was (initially) intended in an earlier stage of the
implementation.
The question as to which data type should be used for the mass values was solved by a simple
typedef enabling the usage of doubles or floats.
Using integers was also considered because of their alleged quickness and easier handling in
comparison to floating point numbers e. g. when comparing mass values, but was not realized
as the difference in speed was found to be evanescent.
3.5 Mass Range
All peptides that exceed the specified maximum mass are deleted before the computation of
the variably modified peptides. On the one hand this is done in order to save time and space.
20
On the other hand most mass spectrometric analyzers have an upper mass limit therefore
making the computation of peptides that cannot be analyzed useless.
3.6 Database Searching
The database search implemented is based on counting the number of peptides that match the
masses in an experimental MALDI spectrum for a given mass tolerance in parts per million
(ppm). In order to quickly find the peptides - which are stored in C++ vectors as shown in the
class diagram and which are still in the memory (on-the-fly search) - an interpolation search
in the sorted vectors of Peptides and ModPeptides is performed.
The next step would be to assign a score to the peptides found in order to be able to
discriminate between random and significant matches and to rank candidate proteins
accordingly. Unfortunately, there was no time left to realize the implementation of a scoring
method.
An issue that arose with database searching was the question as to what should be done with
proteins containing an 'X' representing any amino acid. Two options to choose from are
realized:
Option 1: Treat 'X' as any other amino acid using an averaged amino acid mass of 111Da
Option 2: or treat 'X' as cleavage site and delete all peptides containing an 'X' after cleavage.
The first option risks mapping experimental masses to peptides containing an 'X' and as no
amino acid has a molecular weight of 111Da this can only be wrong. By choosing the second
option fragments containing an 'X' are omitted and a wrong cleavage might be performed, but
not cleaving would add to the loss of information.
3.7 Sequence Diagram
When the program is started the files 'enzymes.txt' and 'modifications.txt' are read and stored
as Enzyme and Modification objects. After receiving the input parameters for the digestion
(see section 4.1.3 or documentation for examples) a user specified file is parsed and an
instance of Sequence_Set is created with the data received. The protein sequences are stored
as Sequence objects (n = number of proteins) and an instance of AminoAcidMassesFloat is
created from which the masses for the amino acids can be obtained. The function
Sequence_Set::doCleave(bool x_cut, bool sort_by_mass) performs the cleavage of all
sequences, either using option 2 (x_cut = true) or 1 (x_cut = false) described in section 3.6
and optionally sorting the peptides by increasing mass (sort_by_mass = true). First the amino
21
acid masses are modified according to fixed position-independent modifications. Then
pu Peptide objects including those with missed cleavages are created and stored in the vector
peptides (see section 3.3). Peptides outside the specified mass range are deleted. The function
Sequence_Set::appendTermMods(..) further modifies the peptides with N- and C-terminal
modifications. If variable modifications have been specified by the user x instances of
ModPeptide are created by calling the function Sequence_Set::createModPeptides(..) which
recursively computes the variably modified peptides and stores them in the vector
mod_peptides. Output is either written to a file or displayed on the screen (see section 4.1).
Now a search with experimental PMF data may be performed in the digestion data still stored
in memory. Sequence_Set::doSearch(int error,...) maps the search masses to peptides
contained in peptides and modified peptides in mod_peptides taking into account a user
specified error tolerance of error ppm. The top 50 hits are written to a file as can be seen in
section 6.1.3. All objects created are deleted before the program ends.
readEnzymes()
Begin
program
Enzyme
readModFile()
Input for
digestion
Modification
readFasta()
create
create
Sequence_Set
Sequence
create
AminoAcidMassesFloat
doCleave()
D
I
G
E
S
T
I
O
N
modifyAminoAcidMasses()
create
Peptide
addMissedCleave()
applyMassRange()
appendTermMods()
createModPeptides()
create
ModPeptide
outputIntoFile()
Input for
search
readMassList()
D
B
doSearch()
interpolSearch()
S
E
A
R
C
H
interpolModSearch()
resultsToFile()
End
program
Fig.3.3: Sequence Diagram
22
4 Results and Discussion
The diagrams contained in this chapter were developed using Microsoft Excel
[www.microsoft.com/office/excel]. The plots in section 4.2 were produced with Matlab
[www.mathworks.com].
4.1 Output Examples
Output
can
either
be
written
to
a
file
by
using
the
function
Sequence_Set::outputIntoFile(string filename, int precision) or displayed on the screen by
using Sequence_Set::outputToScreen(int precision), or both if both functions are called. A
third function, Sequence_Set::massesIntoFile(string filename, int precision), writes all
peptide masses into a file simply listing them. This was used for example to import the mass
values into Matlab.
4.1.1 Screen Output
To give an example the following output was obtained by digesting trypsin taken from the
SwissProt database with trypsin as cleaving enzyme (auto proteolysis). The maximum missed
cleavage parameter was set to 0, propionamides (Cys) are fixed and phosphorylation (Tyr) is
variable.
>sp|P00761|TRYP_PIG Trypsin precursor (EC 3.4.21.4) - Sus scrofa (Pig).
FPTDDDDKIVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYKSRIQVRLGE
HNIDVLEGNEQFIAAKIITHPNFNGNTLDNDIMLIKLSSPATLNSRVATVSLPRSCAAAG
TECLISGWGNTKSSGSSYPSLLQCLKPVLSDSSCKSSYPGQITGNMICVGFLEGGKDSCQ
YGCAQKNKPGVYTKVCNYVNWIQQTIAAN
0-7
0
951.3821
FPTDDDDK
8-50
0
4701.22
IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYK
4781.187
Phospho (Y)
4861.153
Phospho (Y) Phospho (Y)
4941.119
Phospho (Y) Phospho (Y) Phospho (Y)
51-52
0
261.1437
SR
53-56
0
514.3227
IQVR
57-75
0
2096.054
LGEHNIDVLEGNEQFIAAK
76-95
0
2282.173
IITHPNFNGNTLDNDIMLIK
96-105 0
1044.556
LSSPATLNSR
106-113 0
841.5021
VATVSLPR
114-131 0
1909.866
SCAAAGTECLISGWGNTK
132-154 0
2527.23
SSGSSYPSLLQCLKPVLSDSSCK
2607.196
Phospho (Y)
155-175 0
2228.061
SSYPGQITGNMICVGFLEGGK
2308.027
Phospho (Y)
176-205 0
3225.428
DSCQGDSGGPVVCNGQLQGIVSWGYGCAQK
3305.394
Phospho (Y)
206-213 0
905.4971
NKPGVYTK
985.4634
Phospho (Y)
214-228 0
1806.872
VCNYVNWIQQTIAAN
1886.839
Phospho (Y)
23
The first column contains the beginning and ending positions of the peptide on the protein
sequence (inclusively). Column two is the number of missed cleavages. Next is the calculated
molecular weight including fixed modifications. The last column contains the amino acid
sequence. If a peptide has variable modifications, the sequence is not repeated and only the
modifications and the modified mass are listed.
For another example see section 7.3.1.
4.1.2 File Output
If more than one sequence is digested, for example a database, it is more appropriate to write
the output to a file that can be parsed afterwards. The parameters used are listed at the top.
The total number of sequences, peptides and residues, the average protein and peptide length
and the amino acid composition are also listed. For an example see section 7.3.2.
4.1.3 Digestion and variable modification of an example peptide
The following data shows the 'digestion' of the peptide used as an example in 3.3 in order to
demonstrate the process of computing variable modifications and to show the setup and
navigation of ProtDigest (see documentation for more detailed instructions).
'*' denotes a ModPeptide being found, '°' a ModPeptide being deleted. For the list of
modifications see section 5.2.
>
Enter name of fastafile containing sequence(s) to be cleaved: modpep
Cleaving substances:
[1]Trypsin
[2]Arg-C
[3]Asp-N
[4]Chymotrypsin(FYW)
[5]Chymotrypsin(FYWML)
[6]Formic_acid
[7]Lys-C
[8]Lys-C/P
[9]PepsinA
[10]CNBr
[11]Tryp-CNBr
[12]TrypChymo
[13]Trypsin/P
[14]V8-DE
[15]V8-E
[16]no cut
Choose cleaving substance by entering number: 16
Enter number of maximum missed cleavages: 0
Modifications:
To see every possible modification press 1, to see a selection press 2 to not show
any press 3: 3
24
Enter fixed modifications (example: 1,5,20 or 0 for none): 0
Enter variable modifications: 1,19,43
For monoisotopic masses press 1, for average masses press 2: 1
Enter minimum mass displayed (in Da): 1
Enter maximum mass displayed: 100000
How should 'X' in amino acid sequence be treated?
[1] as cleavage site + kick out peptides containing 'X'
[2] use averaged amino acid mass + treat as normal amino acid
2
Sort peptides by masses? n
reading file......
cleaving...
0 peptides not within specified mass range.
Number of residues: 4
Variably modifying 1 peptides...
*******°**°**°*°
<-- '*' denotes a ModPeptide found, '°' found ModPeptide not stored
Number of Sequences = 1
Number of Peptides without variable modifications = 1
Number of variably modified Peptides = 8
>mod_peptide example
AKDK
0-3
0
460.265
558.301
516.291
566.268
524.257
544.286
502.275
474.28
482.246
AKDK
Acetyl (K) Methyl ester (D) Acetyl (K)
Acetyl (K) Methyl ester (D)
Acetyl (K) Sodiated (D) Acetyl (K)
Acetyl (K) Sodiated (D)
Acetyl (K) Acetyl (K)
Acetyl (K)
Methyl ester (D)
Sodiated (D)
Search? N
As demonstrated twelve variably modified peptides are computed of which four are not
stored, three being permutations and one the unmodified peptide giving rise to multiple
peptides with the same molecular weight.
4.2 Mass Distribution Analysis
In order to analyze the distribution of peptide masses and lengths in dependence of the
cleaving substance, the SwissProt Saccharomycetes database was digested using different
enzymes.
25
4.2.1 Peptide Length Distribution
The peptide length distributions in Fig. 4.1 were observed by digesting with trypsin,
chymotrypsin and cyanogen bromide, respectively, without missed cleavages.
Chymotrypsin cleaves after F,W,Y,L and M unless followed by P while trypsin cleaves after
K and R unless followed by P. CNBr cleaves after M which explains the relatively high
abundance of peptides of length one as most proteins start with an M which is encoded by the
start codon AUG. For trypsin and chymotrypsin the number of peptides noticeably decreases
with increasing peptide length. Especially chymotrypsin tends to produce high numbers of
short peptides.
Taking a look at the amino acid composition of the digested database in table 4.2,
approximately 20% of the peptide bonds are cleavage sites for chymotrypsin explaining the
high abundance of short peptides. Trypsin approximately cleaves 11% of the bonds whereas
for CNBr only 2% of the bonds are cleavable.
Peptide length distribution
140000
Chymotrypsin
Number of peptides
120000
Trypsin
100000
CNBr
80000
60000
40000
20000
0
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 19 20
Peptide length
Fig.
4.1:
Peptide
length
distributions
Saccharomycetes
26
computed
from
digestion
of
SwissProt
4.2.2 Peptide Mass distribution
Large numbers of short i. e. lightweight peptides raise the number of random matches and
therefore do not contribute to identifying a protein. Often they are sorted out before a
database search is performed. Longer peptides of higher molecular weight are more
characteristic of the protein they are derived from and thus are more significant. Table 4.1
compares four enzymes in regard to the number of peptides within a mass range of 500 to
3000 Da which is characteristic of PMF analysis. Tryptic digestion with one missed cleavage
has the highest coverage of peptides within the mass range explaining its predominant use in
PMF experiments.
Trypsin
CNBr
Chymotrypsin
Formic acid
Missed
Within
Within
Within
Within
cleavages <500 >3000 range <500 >3000 range <500 >3000 range <500 >3000 range
0
38.86 5.09 56.05 16.73 51.02 32.25 57.46 0.62 41.92 23.50 19.46 57.04
1
23.03 12.08 76.65
9.08
67.53 23.39 36.42
1.79
61.79 13.78 34.27
51.95
Table 4.1: relative number of peptides[%] within the specified mass range
Fig.4.2 shows the mass distribution resulting from digestion with trypsin and CNBr,
respectively, with two missed cleavages allowed. Methionine accounts for the large green
peak at ~149 Da. The large blue peaks at ~146 and ~174Da are single lysines and arginines
produced when one tryptic cleavage site is directly followed by another. As already observed
from the peptide length distribution the number of peptides resulting from tryptic digestion
strongly decreases with increasing mass. CNBr produces relatively few peptides almost
equally covering the mass range shown.
A periodicity of 14 to 15 Da is clearly visible which has also been observed in [38] when
plotting the number of atomic compositions of peptides over the molecular weight. The
oscillation is visible up to a mass of 1300Da and seems to be independent of the enzyme and
database chosen which was concluded from digesting the SwissProt arabidopsis database
with chymotrypsin and formic acid giving rise to the same periodicity (data shown in section
7.4). As every peptide is a composition of amino acids certain mass values in the lower mass
range are more probable to appear than others simply based on the number of amino acid
combinations possible.
27
Fig. 4.2: Distribution of monoisotopic masses
4.2.3 Peptide Mass Clustering
If a smaller mass range is extracted (Fig 4.3) an interesting characteristic of peptide mass
distribution can be seen, the clustering of peptide masses. This is another result of the atomic
composition of proteins as they are solely made up of C, H, N, O and S which all have near
to integer masses. The integer mass of an atom is called its nominal mass. The formation of
clusters is a result of the limited number of combinations of the five atom types for a given
nominal mass. Depending on the atomic composition of the modification even modified
peptides keep to this rule.
Fig 4.3: Mass clustering of peptides between 995 and 1005Da (derived from tryptic
digestion of Saccharomycetes)
28
The relation of the centroid of a cluster to the nominal mass is:
centroid mass = nominal mass * 1.000478
This relation was derived from plotting the monoisotopic masses over the nominal masses
and calculating the slope using linear regression (data not shown). It is similar to the relation
obtained by Wool and Smilansky in [32].
In Fig. 4.4 the differences between monoisotopic and nominal masses are plotted over the
nominal masses in a range of 0 to 5000Da. A linear regression yields the following equation:
y = 0.00047811 * x – 0.000017786
x = nominal mass
y = mass difference between monoisotopic and nominal mass
For masses larger than ~2091Da the centroid mass is therefore more than 1 Da away from its
nominal mass i. e. the centroid is found in the next integer mass interval.
Fig.4.4: Difference between monoisotopic and nominal masses
The phenomenon of peptide mass clustering can for example be used for calibration of mass
spectra [32].
29
4.3 Modifications and Missed Cleavages
As already described missed cleavage sites and variable modifications can significantly
increase the number of peptides while there is no great computational expense associated
with fixed modifications.
An example is shown in figure 4.5 which visualizes the data in table 4.2 derived from the
digestion of human titin (SwissProt Accession Number Q8WZ42) which has a sequence
length of 34350 amino acids. The sequence was digested with trypsin allowing 0, 1 and 2
missed cleavages, respectively. No upper limit for the number of variable modifications per
peptide was specified. Three different combinations of variable modifications were chosen:
(a) a position-independent modification, (b) two modifications independent of the amino acid
modifying the C- and N-terminus of all peptides and (c) the combination of (a) with (b).
Missed
cleavages
0
1
2
Type of
modification
(a)
(b)
(c)
(a)
(b)
(c)
(a)
(b)
(c)
Number of
unmodified
peptides
4197
4197
4197
8393
8393
8393
12588
12588
12588
Total number of
peptides (without
permutations)
5196
16313
19991
11390
33097
44449
18581
49877
72895
Total number of
peptides computed
by recursion
5421
16313
20833
12599
33097
49083
22535
49877
88209
Table 4.2: The influence of missed cleavages and variable modifications on the
number of peptides when there is no upper limit for the number of modifications per
peptide
(a) Phosphorylation (Y),
(b) Carbamyl (N-term) and Methyl ester (C-term),
(c) Carbamyl (N-term) and Methyl ester (C-term) and Phosphorylation(Y)
Fig. 4.5: The influence of missed cleavages and variable modifications on the
number of peptides
30
As shown by these data missed cleavages cause a linear increase of the number of
unmodified peptides. For modification
set (b) the increase is linear in the number of
unmodified peptides, converging to 4 as each modification can only be applied once to each
peptide therefore resulting in at most four times as many peptides if combined (unmodified,
only N-terminally modified, only C-terminally modified, both N-terminally and C-terminally
modified).
Phosphorylation of tyrosine residues alone does not lead to a drastically high number of
peptides, but with increasing number of missed cleavages and therefore increasing peptide
length the increase in the number of peptides is more than linear. The relative abundance of
tyrosine in this example protein is only 2.9% which is the reason for the relatively low
increase.
Combination of (a) with (b) yields the largest increase in the number of peptides being
computed demonstrating the need for a fixed maximum number of modifications allowed per
peptide. If none is used the computation of variably modified peptides can eventually
overstrain memory capacity.
4.5 Comparing Data Types
To find out whether there is a significant rounding error when using floats in comparison to
doubles, the molecular weight of titin was computed first using doubles and then using floats:
The double value computed was 3813839.61068 Da whereas switching to floats yielded a
mass of 3812936 Da. This is a difference of almost 4 Da strongly suggesting the use of
doubles.
31
4.6 Database Analysis
Three databases, Saccharoycetes, Arabidopsis and Drosophila, were analyzed for their amino
acid composition in order to see how often 'X' appears. The data is shown in table 4.2:
Amino
Acid [%]
Saccharomycetes
Arabidopsis
thaliana
Drosophila
Amino
Acid [%]
Saccharomycetes
Arabidopsis
thaliana
Drosophila
A
5.7775
6.3047
7.5214
S
8.7707
8.9881
7.9858
B
0.0001
0.0000
0.0000
T
5.8923
5.1106
5.5651
C
1.2716
1.8247
1.8119
V
5.7648
6.7118
5.9770
D
5.8365
5.4591
5.2087
W
1.0684
1.2639
1.0445
E
6.4850
6.7806
6.1630
X
0.0008
0.0000
0.0039
F
4.4748
4.2826
3.7728
Y
3.4025
2.8577
3.0547
G
5.2803
6.3886
6.4347
Z
0.0000
0.0000
0.0000
H
2.1166
2.2755
2.6619
5904
2913
2593
I
6.5216
5.3325
5.1338
Number
of
sequences
K
7.1977
6.4048
5.6657
2847397
11319693
1344529
L
9.4554
9.4885
9.0767
Number
of
residues
M
2.0794
2.4505
2.4699
482.283
429.929
518.523
N
5.9645
4.4057
4.7881
Average
protein
length
P
4.3593
4.8021
5.2059
Q
3.9073
3.4700
5.1609
R
4.3729
5.3983
5.2936
Table 4.2: Composition analysis of
three example databases
In the arabidopsis database X does not appear at all, the saccharomycetes database reveals
rare occurrences of X and the drosophila database shows the highest relative abundance
containing 0.0039% Xs. Still the number is very small suggesting to simply delete all
peptides containing Xs as the loss of information will probably not be that high. With this
option which is not yet implemented in ProtDigest, no possibly wrong peptides pointing to
erroneous protein identification will be stored.
32
4.7 Database Search
The mips arabidopsis database [http://mips.gsf.de] containing 26639 sequences was searched
with 184 peak lists. The search was performed with one missed cleavage allowed and without
any modifications. The search was performed once with a peptide mass tolerance of 100 ppm
(parts per million) and once with 50 ppm. The results are shown in Fig. 4.6. The x-axis
requires further explanation: Let the number of matching peptide masses be called the score
of a protein. Position 1 means that the correct protein had the highest score found i.e. no other
protein had more matching peptide masses. So if all sequences had the same score, say 1, the
correct one is in position 1. Position 2 means that the correct protein had the second-best
score, again not considering the number of proteins having the same or a better score. If a
protein was 'not found' this means that it was not listed in the top 50 hits which can be due to
the fact that more than 50 proteins had the same top score. To add more meaning to the
results, for all proteins in position 1 it was checked whether another protein had the same
(top-) score.
Fig. 4.6: Database search results.
These results demonstrate the high capability of PMF to identify proteins. Reducing the error
tolerance from 100 to 50 ppm decreased the number of random matches. Without considering
any modifications, in 89% of the 184 peak lists the correct protein was ranked first, 72%
being the unique top scorer. Even without applying a scoring system as for example
ChemScore [34] or the MOWSE score[35], PMF allows for at least significantly reducing the
number of candidates by simply counting the number of matches.
33
5 Conclusions
When developing a model for protein digestion the complex biological background had to be
simplified in order to be capable of digesting complete databases within reasonable amount
of time. The enzyme model employed relies on the assumption that cleavage only depends on
the two residues adjacent to the cleaved peptide bond. In contrast, other enzyme models take
a much more sophisticated approach even assigning a probability to the cleavage site such as
the one used by PeptideCutter, a tool specialized on the prediction of potential cleavage sites
[33].
The modification model as well could be more complex, for example taking into account
known signal sequences such as N-X-S/T for n-glycosylation. But for the given reason, this
was not realized.
In silico protein digestion has also shown to have some computational limitations concerning
memory capacity, due to variable modifications. A maximum number per peptide is
inevitable and it is generally advised to use them sparingly. A suggestion to avoid runtime
and memory problems when digesting complete databases, is to first search the database with
as few variable modifications as possible and afterwards perform a digestion with more
modifications on a subset of candidate proteins.
A refinement for ProtDigest could be to individually allow for some variable modifications
to be present more often than others. For example allowing four oxidations per peptide, but
only one methylation.
Anyway, it is questionable whether using a high number of variable modifications is
reasonable when a database search is performed. They do not only cause an increase in
runtime and memory, but also raise the level of random matches, simply because there are
more mass values the experimental MS data can map to.
Here the need for a scoring method being able to discriminate between significant and
random matches becomes visible. There are several different approaches such as ChemScore
[34] which assigns a score based on the chemical properties of a peptide, or the MOWSE
score [35] that is used by Mascot [36], probably the most often used protein identification
tool. The MOWSE score assigns a statistical weight to each individual peptide match based
on the probability of a peptide belonging to a protein of a certain molecular weight which is
empirically determined during the database preprocessing. The scoring method used by
ProFound [40] is based on the a posteriori probabilities (Bayes probabilities) of the
experimental masses belonging to a certain protein.
34
ProtDigest only includes the first part of PMF protein identification, the peptide matching
stage. A scoring stage has not yet been implemented.
Another improvement to its functionality could be the ability to further fragment the peptides
to produce MS/MS ion series and therefore also incorporate the usage of MS/MS data.
35
6 Glossary
Amino Acids:
36
Peptide Bond:
A peptide bond is a chemical bond formed between two molecules when the carboxyl group
of one molecule reacts with the amino group of the other molecule, releasing a molecule of
water (H2O). This is a dehydration synthesis reaction, and usually occurs between amino
acids.
The resulting C-N bond is called a peptide bond, and the resulting molecule is called an
amide. Polypeptides and proteins are chains of amino acids held together by peptide bonds.
The C-N bond has a partial double bond character (with the Nitrogen atom attaining a partial
positive charge and the oxygen atom a partial negative charge) and the molecule can
normally not rotate around this bond. The whole arrangement of the four C,O,N,H atoms as
well as the two attached carbons in a peptide bond is planar [41].
Protein Structure:
Proteins are amino acid chains that fold into unique 3-dimensional structures. The shape into
which a protein naturally folds is known as its native state, which is determined by its
sequence of amino acids. Biochemists refer to four distinct aspects of a protein's structure:
Primary structure: the amino acid sequence
Secondary structure: highly patterned sub-structures--alpha helix and beta sheet--or segments
of chain that assume no stable shape. Secondary structures are locally defined, meaning that
there can be many different secondary motifs present in one single protein molecule
Tertiary structure: the overall shape of a single protein molecule; the spatial relationship of
the secondary structural motifs to one another
Quaternary structure: the shape or structure that results from the union of more than one
protein molecule, usually called subunit proteins subunits in this context, which function as
part of the larger assembly or protein complex [41].
37
Monoisotopic and Average Mass:
Isotopes are atoms of a chemical element whose nuclei have the same atomic number, but
different atomic weights. The atomic number corresponds to the number of protons in an
atom. Thus, isotopes of a particular element contain the same number of protons. The
difference in atomic weights results from differences in the number of neutrons in the atomic
nuclei [41]. The monoisotopic mass is calculated using the mass of the most abundant natural
isotope of each constituent element. An average mass is calculated using the weighted
average of all its natural isotopes.
38
7 Appendix
7.1 Enzymes
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
Name
Cleaves
Trypsin
KR
Arg-C
R
Asp-N
DB
Chymotrypsin(FYW)
FYW
Chymotrypsin(FYWML) FYWML
Formic_acid
D
Lys-C
K
Lys-C/P
K
PepsinA
FL
CNBr
M
Tryp-CNBr
KRM
TrypChymo
FYWKR
Trypsin/P
KR
V8-DE
BDEZ
V8-E
EZ
no cut
-
Restriction
P
P
P
P
P
P
P
P
P
-
C-/N-term
Cterm
Cterm
Nterm
Cterm
Cterm
Cterm
Cterm
Cterm
Cterm
Cterm
Cterm
Cterm
Cterm
Cterm
Cterm
-
7.2 Modifications
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
[40]
[41]
Name
Acetyl (K)
Acetyl (N-term)
Amide (C-term)
Biotinylated (K)
Biotinylated (Nterm)
Carbamidomethyl (C)
Carbamyl (K)
Carbamyl (N-term)
Carboxymethyl (C)
Deamidation (N)
Deamidation (Q)
Gigi_ICATd0 (C)
Gigi_ICATd8 (C)
HSe (C-term M)
HSe lactone (Cterm M)
ICAT_heavy
ICAT_light
Methyl ester (Cterm)
Methyl ester (D)
Methyl ester (E)
N-Acetyl (Protein)
N-Formyl (Protein)
NIPCAM (C)
O18 (C-term)
Oxidation (H)
Oxidation (W)
Oxidation (M)
PEO Biotin (C)
Phospho (T)
Phospho (S)
Phospho (Y)
PhosphoNL (S)
PhosphoNL (T)
Propionamide (C)
Pyridyl (K)
Pyridyl (Nterm)
Pyro-cmC (Nterm camC)
Pyro-glu (Nterm E)
Pyro-glu (Nterm Q)
SMA (K)
SMA (Nterm)
Residue
K
any
any
K
any
C
K
any
C
N
Q
C
C
M
M
C
C
any
D
E
prot
prot
C
any
H
W
M
C
T
S
Y
S
T
C
K
any
C
E
Q
K
any
39
Pos
any
Nterm
Cterm
any
Nterm
any
any
Nterm
any
any
any
any
any
Cterm
Cterm
any
any
Cterm
any
any
Nterm
Nterm
any
Cterm
any
any
any
any
any
any
any
any
any
any
any
Nterm
Nterm
Nterm
Nterm
any
Nterm
Monoisotopic
Mass
170.106
43.0184
16.0187
354.173
227.085
160.031
171.101
44.0136
161.015
115.027
129.043
589.26
597.311
-12.9901
-31.0006
553.284
545.234
31.0184
129.043
143.058
43.0184
29.0028
202.078
19.007
153.054
202.074
147.035
517.203
181.014
166.998
243.03
69.0215
83.0371
174.046
247.132
120.045
-16.0187
-17.0027
-16.0187
255.158
128.071
Average
Mass
170.211
43.045
16.022
354.467
227.301
160.191
171.199
44.033
161.176
115.089
129.116
589.764
597.814
-13.08
-31.096
553.761
545.711
31.034
129.116
143.142
43.045
29.018
202.271
19.007
153.14
202.212
147.191
517.658
181.085
167.058
243.156
69.063
83.09
174.218
247.297
120.131
-16.022
-17.007
-16.022
255.317
128.151
[42] Sodiated (Cterm)
any
[43] Sodiated (D)
D
[44] Sodiated (E)
E
[45] S-pyridylethyl (C)
C
[46] Sulphone (M)
M
[47] Citrullination
R
[48] Methylation (C)
C
[49] Methylation (K)
K
[50] Methylation (R)
R
[51] Methylation (H)
H
[52] Methylation (N)
N
[53] Methylation (Q)
Q
[54] Methylation (Nterm A)
A
[55] Hydroxylation (P)
P
[56] Hydroxylation (K)
K
[57] Hydroxylation (D)
D
[58] Hydroxylation (N)
N
[59] di-methylation (C)
C
[60] di-methylation (K)
K
[61] di-methylation (R)
R
[62] di-methylation (H)
H
[63] di-methylation (D)
D
[64] di-methylation (E)
E
[65] di-methylation (N)
N
[66] di-methylation (Q)
Q
[67] di-methylation (Nterm A)
A
[68] tri-methylation (C)
C
[69] tri-methylation (K)
K
[70] tri-methylation (R)
R
[71] tri-methylation (H)
H
[72] tri-methylation (D)
D
[73] tri-methylation (E)
E
[74] tri-methylation (N)
N
[75] tri-methylation (Q)
Q
[76] tri-methylation (Nterm A)
A
[77] Gamma-carboxylation (D)
D
[78] Gamma-carboxylation (E)
E
[79] Beta-methylthiolation
D
[80] Sulfation
Y
[81] Phosphorylation (H)
H
[82] Phosphorylation (C)
C
[83] Phosphorylation (D)
D
[84] C-Mannosylation
W
[85] Glycation (N)
N
[86] Glycation (T)
T
[87] Glycation (K)
K
[88] Glycation (Nterm)
any
[89] Lipoyl
K
[90] O-GlcNac (S)
S
[91] O-GlcNac (T)
T
[92] O-GlcNac (N)
N
[93] Farnesylation
C
[94] Myristoylation (res)
K
[95] Myristoylation (Nterm)
G
[96] Pyridoxal phosphate
K
[97] Palmitoylation (C)
C
[98] Palmitoylation (S)
S
[99] Palmitoylation (T)
T
[100]Palmitoylation (K)
K
[101]Geranyl-geranyl
C
[102]Phosphopantetheine
S
[103]Flavin adenine dinucleotide (FAD) (C) C
[104]Flavin adenine dinucleotide (FAD) (H) H
[100]N-acyl diglyceride cys (tripalmitate) C
40
Cterm
any
any
any
any
any
any
any
any
any
any
any
Nterm
any
any
any
any
any
any
any
any
any
any
any
any
Nterm
any
any
any
any
any
any
any
any
Nterm
any
any
any
any
any
any
any
any
any
any
any
Nterm
any
any
any
any
any
any
Nterm
any
any
any
any
any
any
any
any
any
Nterm
38.9847
137.009
151.025
208.067
163.03
157.085
117.025
142.111
170.117
151.075
128.059
142.074
15.0235
113.048
144.09
131.022
130.038
131.04
156.126
184.132
165.09
143.058
157.074
142.074
156.09
29.0391
145.056
170.142
198.148
179.106
157.074
171.09
156.09
170.106
43.0548
159.017
173.032
161.015
243.02
217.025
182.976
194.993
348.132
276.096
263.101
290.148
163.061
316.128
290.111
304.127
317.122
307.197
338.293
211.206
357.109
341.239
325.262
339.277
366.325
375.26
426.11
886.151
920.2
789.734
38.989
137.071
151.098
208.278
163.191
157.173
117.166
142.201
170.215
151.168
128.131
142.158
15.035
113.116
144.173
131.088
130.103
131.193
156.228
184.242
165.195
143.143
157.17
142.158
156.185
29.062
145.219
170.254
198.268
179.221
157.169
171.195
156.184
170.211
43.088
159.099
173.125
161.176
243.234
217.121
183.119
195.069
348.355
276.246
263.247
290.316
163.15
316.476
290.272
304.299
317.298
307.494
338.533
211.367
357.303
341.551
325.49
339.517
366.586
375.612
426.401
886.68
920.682
790.324
7.3 Output examples
7.3.1 Comparison with PeptideMass
To give a short demonstration of the correctness of ProtDigest, the output of tryptic auto
proteolysis with possible methionine sulfoxide was compared with the results of
PeptideMass:
>sp|P00761|TRYP_PIG Trypsin precursor (EC 3.4.21.4) - Sus scrofa (Pig).
0-7
0
951.38213
FPTDDDDK
8-50
0
4488.1089
IVGGYTCAANSIPYQVSLNSGSHFCGGSLINSQWVVSAAHCYK
53-56
0
514.32272
IQVR
57-75
0
2096.05378
LGEHNIDVLEGNEQFIAAK
76-95
0
2282.17287
IITHPNFNGNTLDNDIMLIK
2298.16779
Oxidation (M)
96-105 0
1044.55636
LSSPATLNSR
106-113 0
841.50213
VATVSLPR
114-131 0
1767.79198
SCAAAGTECLISGWGNTK
132-154 0
2385.15558
SSGSSYPSLLQCLKPVLSDSSCK
155-175 0
2157.02343
SSYPGQITGNMICVGFLEGGK
2173.01835
Oxidation (M)
176-205 0
3012.31639
DSCQGDSGGPVVCNGQLQGIVSWGYGCAQK
206-213 0
905.49705
NKPGVYTK
214-228 0
1735.83518
VCNYVNWIQQTIAAN
41
7.3.2 File output
ENZYME:Trypsin
MAXMISSEDCLEAVAGE:1
MASSES:monoisotopic
MASSRANGE:1-50000
FIXMOD:
Propionamide (C)
VARMOD:
Oxidation (M)
Number of sequences = 2913
Number of peptides = 328870
Number of residues = 1127636
Average protein length = 387.1047
Average peptide length = 6.85764
Amino acid composition:
A
77864 6.905065
B
0
0
C
19699 1.746929
D
60082 5.328138
E
71415 6.333161
F
49887 4.424034
G
78176 6.932734
H
24755 2.195301
I
63335 5.616617
K
70226 6.227719
L
107257 9.511669
M
28530 2.530072
N
48521 4.302896
P
53531 4.747188
Q
40300 3.573848
R
58850 5.218883
S
91874 8.147487
T
58953 5.228017
V
78069 6.923245
W
13662 1.211561
X
3
0.0002660433
Y
32647 2.895172
Z
0
0
>sp|Q9LU15|AHP4_ARATH Histidine-containing phosphotransfer protein 4 - Arabidopsis
thaliana (Mouse-ear cress).
MTNIGKCMQGYLDEQFMELEELQDDANPNFVEEVSALYFKDSARLINNIDQALERGSFDFNRLDSYMHQFKGSSTSIGASKVK
AECTTFREYCRAGNAEGCLRTFQQLKKEHSTLRKKLEHYFQASQ
0-5
0
662.342
MTNIGK
0-5
0
678.337
Oxidation (M)
6-39
0
4114.82
CMQGYLDEQFMELEELQDDANPNFVEEVSALYFK
6-39
0
4146.81
Oxidation (M) Oxidation (M)
6-39
0
4130.81
Oxidation (M)
40-43 0
447.208
DSAR
...
...
...
109-115
1
869.472
KEHSTLR
110-116
1
869.472
EHSTLRK
116-117
1
274.2
KK
117-126
1
1249.61
KLEHYFQASQ
>sp|Q8L9T7|AHP5_ARATH Histidine-containing phosphotransfer protein 5 - Arabidopsis
thaliana (Mouse-ear cress).
MNTIVVAQLQRQFQDYIVSLYQQGFLDNQFSELRKLQDEGTPDFVAEVVSLFFDDCSKLINTMSISLERPDNVDFKQVDSGVH
QLKGSSSSVGARRVKNVCISFKECCDVQNREGCLRCLQQVDYEYKMLKTKLQDLFNLEKQILQAGGTIPQVDIN
0-10
0
1271.7
MNTIVVAQLQR
0-10
0
1287.7
Oxidation (M)
11-33 0
2837.37
QFQDYIVSLYQQGFLDNQFSELR
34-34 0
146.106
K
35-57 0
2631.2
LQDEGTPDFVAEVVSLFFDDCSK
...
...
42
7.4 Additional results
7.4.2 Mass Distributions
Fig.6.1: Oscillation of peptide mass distribution (chymotrypsin)
Fig.6.2: Oscillation of peptide mass distribution (formic acid)
43
8 References
[1] Electrophoresis. 1995 Jul
Progress with gene-product mapping of the Mollicutes: Mycoplasma genitalium.
Wasinger VC, Cordwell SJ, Cerpa-Poljak A, Yan JX, Gooley AA, Wilkins MR, Duncan MW, Harris R,
Williams KL, Humphery-Smith I.
[2] Electrophoresis. 1998 Aug;19(11):1941-9.
Towards an automated approach for protein identification in proteome projects.
Traini M, Gooley AA, Ou K, Wilkins MR, Tonella L, Sanchez JC, Hochstrasser DF, Williams KL.
[3] Electrophoresis. 1998 May;19(6):893-900.
Database searching using mass spectrometry data.
[4] Rapid Communications in Mass Spectrometry Volume 17, Issue 10, Pages 1044-1050
On-column digestion of proteins in aqueous-organic solvents
Gordon W. Slysz, David C. Schriemer *
[5] Int J Mass Spectrom & Ion Proc 1987
Matrix-assisted ultraviolet Laser desorption of non-volatile compounds.
Karas M, Bachmann D, Bahr U, Hillenkamp F:
[6] Science 1989, 246:64-71.
Electrospray Ionization for Mass Spectrometry of Large Biomolecules.
Fenn JB, Mann M, Meng CK, Wong SF, Whitehouse CM:
[7]Mass Spectrometry Reviews Volume 23, Issue 5, Pages 368-389
Investigation of intact protein complexes by mass spectrometry
Albert J. R. Heck *, Robert H. H. van den Heuvel
[8] Pept Res. 1994 May-Jun;7(3):115-24
Protein identification by peptide mass fingerprinting.
Cottrell JS.
[9] Rapid Commun Mass Spectrom. 2003;17(16):1825-34.
Matrix-assisted laser desorption/ionization directed nano-electrospray ionization tandem mass spectrometric
analysis for protein identification.
Kast J, Parker CE, van der Drift K, Dial JM, Milgram SL, Wilm M, Howell M, Borchers CH.
[10] Analyst. 1996 Jul;121(7):65R-76R.
Future prospects for the analysis of complex biological systems using micro-column liquid chromatographyelectrospray tandem mass spectrometry.
Yates JR 3rd, McCormack AL, Link AJ, Schieltz D, Eng J, Hays L.
[11] Mass Spectrometry : Principles and Applications
Edmond De Hoffmann, Vincent Stroobant
[12] Anal Biochem. 1997 Aug 1;250(2):153-6.
Identification of proteins by matrix-assisted laser desorption ionization-mass spectrometry following in-gel
digestion in low-salt, nonvolatile buffer and simplified peptide recovery.
Fountoulakis M, Langen H.
[13] Anal Chem. 2003 Aug 15;75(16):4081-6.
Web and database software for identification of intact proteins using "top down" mass spectrometry.
Taylor GK, Kim YB, Forbes AJ, Meng F, McCarthy R, Kelleher NL.
[14] JOURNAL OF MASS SPECTROMETRY J. Mass Spectrom. 2002; 37: 663 675
Top down protein characterization via tandem mass spectrometry Gavin E. Reid and Scott A. McLuckeyolsi
44
[15] Nucleic Acids Res. 2004 Jul 1;32(Web Server issue):W340-5.
ProSight PTM: an integrated environment for protein identification and characterization by top-down mass
spectrometry
Richard D. LeDuc, Gregory K. Taylor,1 Yong-Bin Kim,1 Thomas E. Januszyk,1 Lee H. Bynum,1 Joseph V.
Sola,1 John S. Garavelli,2 and Neil L. Kelleher11
[16] Anal Biochem. 1994 Oct;222(1):44-8.
Acrylamide in Polyacrylamide Gels Can Modify Proteins during Electrophoresis
Bonaventura C., Bonaventura J., Stevens R. and Millington D.
[17] Rapid Commun Mass Spectrom. 1999;13(18):1818-27.
Probing the reactivity of S-S bridges to acrylamide in some proteins under high pH conditions by matrixassisted laser desorption/ ionisation.
Bordini E, Hamdan M, Righetti PG.
[18] Mass Spectrometry Reviews Volume 20, Issue 3, Pages 121-141
Monitoring 2-D gel-induced modifications of proteins by MALDI-TOF mass spectrometry
Mahmoud Hamdan 1, Marina Galvani 1, Pier Giorgio Righetti 2
[19] Electrophoresis 2001, 22, 1633 1644
Protein alkylation by acrylamide, its N-substituted derivatives and cross-linkers and its relevance to proteomics:
A matrix assisted laser desorption/ ionization-time of flight-mass spectrometry study
Ellenia Bordini1 Marina Galvani1 Pier Giorgio Righetti2 1GlaxoSmithKline Group,
[20] Anal Biochem. 1990 Apr;186(1):116-20.
Formylated peptides from cyanogen bromide digests identified by fast atom bombardment mass spectrometry.
Goodlett DR, Armstrong FB, Creech RJ, van Breemen RB.
[21] Rapid Commun. Mass Spectrom. 13, 1143 1151 (1999)
Investigation of Some Covalent and Noncovalent Complexes by Matrix-assisted Laser Desorption/ Ionization
Time-of-flight and Electrospray Mass Spectrometry
Ellenia Bordini and Mahmoud Hamdan*
[22] Proteomics. 2003 Nov;3(11):2208-20.
Approaches for the quantification of protein concentration ratios.
Moritz B, Meyer HE.
[23] IUBMB Life. 2002 Aug;54(2):51-7.
Identification of modified proteins by mass spectrometry.
Sickmann A, Mreyen M, Meyer HE.
[24]J Biol Chem. 2001 Mar 30;276(13):10570-5. Epub 2001 Jan 09.
Alternative O-glycosylation/O-phosphorylation of serine-16 in murine estrogen receptor beta: post-translational
regulation of turnover and transactivation activity.
Cheng X, Hart GW.
[25] Nat Biotechnol. 1999 Oct;17(10):994-9.
Quantitative analysis of complex protein mixtures using isotope-coded affinity tags.
Gygi SP, Rist B, Gerber SA, Turecek F, Gelb MH, Aebersold R
[26] J Am Soc Mass Spectrom. 2002 Jan;13(1):22-39.
Scoring methods in MALDI peptide mass fingerprinting: ChemScore, and the ChemApplex program.
Parker KC.
[27] Arch. Biochem. Biophys, 161, 665-670
Surface Tension of Amino Acid Solutions: A Hydrophobicity Scale of the Amino Acid Residues.
Bull, Henry B. and Breese, Keith (1974)
[28] Anal. Biochem., 124, 201-208
The Isolation of Peptides by High-Performance Liquid Chromatography Using Predicted Elution Positions,
Browne, C. A., Bennett, H. P. J. and Solomon, S. (1982)
45
[29] Proc. Natl. Acad. Sci. USA 77; 1632 (1980)
Meek
[30] Anal. Biochem. 182; 319-326 (1989)
Gill, S.C. and von Hippel, P.H.
[31] Electrophoresis. 1999 Dec;20(18):3527-34.
Modeling peptide mass fingerprinting data using the atomic composition of peptides.
Gay S, Binz PA, Hochstrasser DF, Appel RD.
[32] Proteomics. 2002 Oct;2(10):1365-73.
Precalibration of matrix-assisted laser desorption/ionization-time of flight spectra for peptide mass
fingerprinting.
Wool A, Smilansky Z.
[33] http://www.expasy.org/tools/peptidecutter
[34] J Am Soc Mass Spectrom. 2002 Jan;13(1):22-39.
Scoring methods in MALDI peptide mass fingerprinting: ChemScore, and the ChemApplex program.
Parker KC.
[35] Current Biology (1993), vol 3, 327-332.
'Rapid Identification of Proteins by Peptide-Mass Fingerprinting'.
.J.C. Pappin, P. Hojrup and A.J. Bleasby
[36] http://www.matrixscience.com
[37] http://www.expasy.org
[38] http://www.ncbi.nlm.nih.gov
[39] http://pir.georgetown.edu/
[40] http://irserver.rockefeller.edu/profound_bin/WebProFound.exe
[41] http://www.wikipedia.org
Figures:
Fig.1.3: Principle of MALDI
From Script_10_proteomics.pdf from lecture ‘algorithmic bioinformatics’ 02/03 by Prof. Reinert
http://www.inf.fu-berlin.de/inst/ag-bio/file.php?p=ROOT/Teaching/Lectures/WS0304/101,algbio_v.lecture.htm
Fig,1.6: (a) MS/MS spectrum
From http://arthritis-research.com/ content/2/5/407/figure/F4
(b) Ion series
From http://www.matrixscience.com
46
Download