LowHweeMeng_FYP

advertisement
SIM UNIVERSITY
SCHOOL OF SCIENCE AND TECHNOLOGY
DEVELOPMENT OF A COMPUTATIONAL
MODEL FOR CALPAIN CLEAVAGE SITES
PREDICTION
STUDENT
: LOW HWEE MENG (Z0704443)
SUPERVISOR : WEE JIN KIAT, LAWRENCE
PROJECT CODE: JUL2010/BME/039
A project report submitted to SIM University in partial fulfillment of the
requirements for the degree of Bachelor of Biomedical Engineering
May 2011
Page | 1
Acknowledgement
I would like to express my gratitude and thanks to the following people who have
made this Capstone Project possible:
• Dr Wee Jin Kiat, Lawrence, for his supervision, support and advice over the course
of the Capstone Project.
• My parents, loved ones and friends for their patience and support.
• My team leader, Mr. Thoreau Hervé and fellow colleagues at the Genome Institute
of Singapore, Genome Technology and Biology department for their patience and
understanding.
Page | ii
Table of Contents
Page
Acknowledgement
ii
Table of Contents
iii
List of figures
vi
List of tables
vii
Abstract
viii
Part 1
Chapter 1: Calpain……………………………………………………………………1
1.1. Calpain discovery and biology……........................…………………………1
1.2. Calpain superfamily and structure…………………………………………...2
1.3. Calpain and disease implication..……………………………………………9
1.3.1. Role of calpain in apoptosis
1.3.2. Role of calpain in neural degeneration
1.4. Challenges in deciphering protease cleavage……………………...………...12
1.5. Project objectives…………………………………………………………….13
Chapter 2: Computational approaches to data classification ………………………...15
2.1 Introduction to Support Vector Machines (SVM)…………………………...15
2.2 Current perspective in calpain cleavage prediction………………………….16
2.2.1
Sequential determinants of calpain cleavage
2.2.2
Group-based Prediction System-Calpain Cleavage Detector
(GPS-CCD)
2.2.3
CaMPDB: a resource for calpain modulatory proteolysis
2.3 Summary.........................................................................................................19
Page | iii
Chapter 3: Calpain dataset……………………………………………………………20
3.1 Dataset collection…………………………………………………………....20
3.2 Data extraction and cleaning………………………………………………...22
3.3 Summary..........................................................................................................23
Chapter 4: Prediction of Calpain Substrate Cleavage………………………………...24
4.1 Introduction…………………………………………………………………..24
4.2 Materials and Methods……………………………………………………….24
4.2.1
Calpain datasets
4.2.2
Symmetrical subsequence extraction
4.2.3
Asymmetrical subsequence extraction
4.2.4
Training and test dataset
4.2.5
Vector encoding schemes
4.2.5.1 Simple binary encoding
4.2.5.2 Bayes Feature Extraction (BFE) encoding
4.2.6
SVM implementation
4.2.7
SVM optimization
4.2.8
SVM training and testing
4.2.9
Linear sequence analysis of primary calpain dataset
4.2.9.1 Relative position-specific amino acid propensity
4.2.9.2 Sequence logo representation of calpain cleavage events
4.3 Results and discussion……………………………………………………….37
4.3.1
Performance metrics of SVM prediction
4.3.2
Relative position-specific amino acid propensity
4.3.3
Sequence logo representation of calpain cleavage events
Page | iv
Chapter 5: Prediction of Receptor Tyrosine Kinases (RTKs) Family Proteins……….49
5.1 Introduction to Receptor Tyrosine Kinases (RTKs)………………………….49
5.2 Prediction of calpain cleavage of RTKs……………………………………...51
5.3 Summary...........................................................................................................52
Chapter 6: Conclusion………………………………………………………………....54
6.1 Summary of project report…………………………………………………....54
6.2 Recommendations and future direction……………………………………....56
Part 2
Chapter 7: Critical reviews and reflections……………………………………………58
REFERENCES………………………………………………………………………..61
Appendix A……………………………………………………………………….......65
Appendix B…………………………………………………………………………...75
Appendix C…………………………………………………………………………...80
Appendix D…………………………………………………………………………...83
Appendix E…………………………………………………………………………....87
Page | v
List of figures
Page
Figure 1-1: Schematic structures of calpain superfamily members
across various organisms..............................................................................................4
Figure 1-2: Domain structure of the human calpain family………………………….6
Figure 1-3: Crystallographic structure of human m-calpain…………………………7
Figure 1-4: Schematic representation of calpain activation in various
neurodegenerative diseases…………………………………………………………..11
Figure 2-1: Illustration of SVM concepts……………………………………………15
Figure 3-1: A summary of the calpain dataset construction process………………...23
Figure 4-1: Symmetrical subsequence segments extracted for
SVM training and testing……………………………………………………….........25
Figure 4-2: Asymmetrical subsequence segments extracted for
SVM training and testing………………………………………………………….…26
Figure 4-3: A schematic representation of datasets used for SVM training
and testing…………………………………………………………………………....28
Figure 4-4: Flowchart of SVM workflow…................................................................32
Figure 4-5: Graphical representation of the trends in SVM classifiers
performance in terms of A) accuracy and B) AROC scores for various
subsequence windows…………………………………………………………….….39
Figure 4-6: Heatmaps of position-specific amino acid intensities of
A) positive examples, B) negative examples and C) calculated propensity Px...........44
Figure 4-7: Sequence logo representation of experimentally-verified
calpain cleavage events………………………………………………………………45
Figure 5-1: Construction of the 40-mer moving window in
EGFR (P00533). …………………………………………………………………….51
Page | vi
List of tables
Page
Table 1-1: Members of the calpain family, encoding genes
and associated polypeptides………………………………………………………….8
Table 4-1: Summary of SVM prediction performance of classifiers trained
using various subsequences and encoding strategies……………………...........…...37
Table 4-2: Comparison of position-specific amino acid prevalence
in the generated 40-mer sequence logo versus findings by Tompa et al……………47
Table 5-1: Schematic maps of predicted calpain cleavage sites
on the receptor tyrosine kinase (RTK) family subset………………………………..53
Page | vii
Abstract
Calpains constitute an important family of calcium-dependent cysteine proteases
widely expressed in mammalians and conserved across eukaryotes. Distinguished by
limited proteolysis of protein substrates at neutral pH, calpains modulate key
biological processes such as apoptosis, cytoskeletal organization and neuroendocrine
pathways. Aberrations of calpain function are known to be implicated in cancers and
neurodegeneration. Despite numerous efforts to unravel calpain regulatory roles, the
precise mechanisms of substrate recognition and calpain-dependent cleavage have not
been fully established. Recent development of calpain cleavage sites prediction
methods achieved varying degrees of success and revealed interesting observations to
amino acid sequence conservation and asymmetrical contributions of amino acids to
calpain substrate recognition. A set of 341 unique calpain substrate cleavage sites
were obtained from available databases and literature searches and analyzed. To
determine unique sequence features in calpain substrates, linear sequence analysis via
sequence logo and heatmap generation as well as derivation of amino acid propensity
was conducted and revealed correlation to previous sequential studies and also
significant propensity for alanine, tryptophan, methionine, proline and serine residues
within the P4-P4’ window and downstream regions of cleavage sites. Next, to
investigate the efficacy of developing a support vector machine (SVM)-based method
for calpain cleavage site prediction, a series of SVM classifiers designed to
encapsulate the cleavage sites with various extracted subsequences (symmetrical
P4P4’, P8P8’, P12P12’, P16P16’ and P20P20’ and asymmetrical P4P12’ and P12P4’), together
with a combined approach of simple binary and bi-profile Bayes Feature Extraction
(BFE) encoding were implemented and evaluated. Predictive performance of the
SVM method achieved an accuracy ranging from 71% to 86% with AROC score
Page | viii
ranging 0.788 to 0.927 on independent test sets, with significant improvement in
overall performance with BFE encoding and longer subsequence windows.
Application of our best performing prediction model on a subset of receptor tyrosine
kinases (RTKs) revealed potential calpain regulation and involvement in the apoptosis
cascade as effectors of survival and growth signals. This study has presented an SVMbased approach for calpain substrate cleavage site prediction, highlighting its potential
to complement experimental efforts to elucidate calpain cleavage mechanisms and
degradome. The content of this project has been accepted for poster presentation in
the 19th Annual International Conference on Intelligent Systems for Molecular
Biology and 10th European Conference on Computational Biology (ISMB/ECCB),
Vienna, 2011.
.
Page | ix
PART 1
Chapter 1: Calpain
1.1
Calpain discovery and biology
Proteases play an important role in the regulation of biological functions in the body.
Calpains (EC 3.4.22.52/53) constitute an important family of intracellular, calcium
(Ca2+) dependent, non-lysosomal cysteine proteases which exhibit limited proteolytic
activities at neutral pH, in contrast to complete digestion. Calpains and its numerous
homologues form a major protease family widely expressed in mammalians and
organisms such as plants, bacteria, yeast and fungi, seemingly conserved across
eukaryotes. Limited proteolysis by calpains alters substrate structure, leading to
regulation of biochemical activities and cellular functions, deeming calpains as
“intracellular modulators”.
Calpains is involved in important biological processes such as programmed cell death
(apoptosis), cytoskeletal organization and neuroendocrine secretory pathways.
Numerous calpain substrates are localized to the cytoskeleton and secretory pathway
proteins, which affects cell structure, shape and cellular interactions. Cytoskeletal
degradation may cause disruption to secretory pathway dynamics causing
accumulation of large intracellular protein aggregates from proteolytic end-products.
Calpain involvement in cytoskeletal protein proteolysis has been associated to
neuronal diseases and their pathology in Huntington’s, Alzheimer’s and Parkinson’s
disease (Evans, et al., 2007).
Page | 1
Initial discoveries of calpain were reported in the 1960s, from calcium-dependent
proteolytic events detected in rat brain (Guroff, et al., 1964) and skeletal muscles
(Ishiura, et al., 1978). These events were attributed to “calcium-activated neutral
proteases” or CANP due to calcium requirements and activity at neutral pH. In the
same study, Ishiura, et al. also achieved the purification of the CANP molecule into
homogeneity.
The first study on cDNA cloning of the calpain catalytic subunit gave structural
evidence of a chimeric molecule consisting of a cysteine protease, similar to papain
originating from papaya, and a calmodulin-like Ca2+ molecule, with calmodulin being
a calcium-regulated signaling protein, leading to its initial nomenclature of “calpain”
(Ohno, et al., 1984). A nomenclature review of calcium-dependent proteinases unified
CANP and “calpain” to calpain (Suzuki, 1991). Calpain is classified under the papain
superfamily, which includes Clan CA, family C1 and C2, forming three distinct
families, namely bleomycin-hydrolase (BLH)-type, papain–type and calpain-type
(Berti and Storer, 1995).
1.2
Calpain superfamily and structure
The calpain system comprises of three molecules, two calcium (Ca2+) dependent
proteases, -calpain (calpain 1) and m-calpain (calpain 2) and calpastatin, a highlyspecific inhibitor of both - and m-calpains. Being the best characterized members of
the superfamily, - and m-calpains are referred to as “classical” calpains, with  and
m referring to the micromolar and millimolar Ca2+ requirements in-vitro for protease
activity respectively. Both - and m-calpain consists of two distinct subunits, an 80-
Page | 2
kDa large catalytic subunit and a 28-kDa regulatory subunit, together forming a
heterodimer. The large subunits (μCL in -calpain and mCL in m-calpain) are nonidentical, however, sharing a 55-65% amino acid sequence homology (Goll, et al.,
2003).
Numerous genomic studies in the past two decades have led to the discovery of
hundreds of calpain-related homologues in various organisms, contributing to a
superfamily of versatile functions. In humans, fifteen genes have been discovered to
encode calpain-like protease domains, generating a diverse range of homologues with
varying functional domain combinations. Figure 1-1 depicts the schematic
representation of calpain superfamily members and homologues.
Page | 3
Figure 1-1: Schematic structures of calpain superfamily members across various organisms.
Page | 4
(Adapted from CaMPDB- Calpain for Modulatory Proteolysis Database)
Deciphering calpain superfamily domain structure is essential in understanding calpain
structure-function relationship. Calpains can be classified in two general groups; typical
and atypical calpains (Figure 1-2). Typical calpains (1, 2, 8, 9, 11, 12 and 14) consists of
four well-established domain structures: domain I (autolytic activation); domain II
(cysteine catalytic site, constituting active sites IIa and IIb); domain III (C2-like Ca2+binding sites) and domain IV (calmodulin-like Ca2+ -binding sites, resembling the pentaEF hand family of polypeptide). An exception is calpain 3 (skeletal muscle-specific
calpain, p94), possessing three additional characterizing regions, NS, IS1 and IS2 (Strobl,
et al., 2000; Hosfield, et al., 2001).
Atypical calpains (5, 6, 7, 10, 13, and 15) are monomeric calpains, lacking the
calmodulin-like penta-EF hand sequences in domain IV. Instead, calpain 5, 6 and 10,
possess a C.elegans, TRA-3 like T-domain (Dear, et al., 1997; Horikawa, et al., 2000).
Calpain 7 possesses a large N-terminal domain, together with a PalB homologous Cterminal domain resembling the PalB protease originating from the A.nidulan (Franz, et
al., 1999). Calpain 15 was observed to be a vertebrate homolog of the D.melanogaster
small optic lobe gene (SOL), with high homology at the catalytic and C-terminal domains
(Kamei, et al., 1998).
Page | 5
Figure 1-2: Domain structure of the human calpain family.
(Adapted from Evans, et al., 2007)
As discussed earlier, classical calpains possess large catalytic subunits, μCL and mCL,
encoded by the CAPN1 and CAPN2 genes respectively. The calpain small subunit
encoded by the CAPN4 gene (calpain 4) consists of two domains, V and VI, and common
to both calpain 1 and 2, (Franco and Huttenlocher, 2005). Figure 1-3 shows the
crystallographic structure of human m-calpain made up of large subunit mCL domains
and small regulatory subunit domains. Other typical calpains share the similar large
subunit domain structure as classical calpains; however, they do not form a heterodimers
with the small subunit.
Page | 6
Domains dI, dIIa, dIIb, dIII, dIV, dV
and dVI are labeled in different colors.
I-II is the α-helix linking domains I and
IIa. The linker domain is represented
by a red line running from the gap
between dIII and dIV to the bottom
right of the diagram labeled III-IV.
The active sites, cysteine, Cys-105;
Histidine, His-262, Asparagine, Asn286 and Tryptophan, Trp-288 are
highlighted in gray at the top of the
domain IIb.
(Adapted from Reverter, et al., 2001)
Figure 1-3: Crystallographic structure of human m-calpain.
Despite differences in Ca2+ requirements, the activation mechanism for both calpain 1 and
2 is similar, with binding of multiple Ca2+ ions disrupting the salt bridges that maintain
the cysteine catalytic site (active sites IIa and IIb) in an open conformation to close,
initiating proteolytic activity (Bozoky, et al., 2005). Regulation of calpain activity after
substrate cleavage occurs through autolysis with intermolecular cleavage of domains I
and V resulting in dissociation of subunits. A summary of the diverse members of the
calpain family, their encoding genes and associated polypeptides is shown in Table 1-1.
Page | 7
Table 1-1: Members of the calpain family, encoding genes and associated polypeptides.
(Adapted from Goll, et al., 2003)
The understanding of calpain substrate recognition, specificity, and role in regulatory
modulation of biological processes is crucial and can give valuable information for the
identification of novel calpain substrates and regulatory pathways, a key driver for indepth studies on calpain.
Page | 8
1.3
Calpain and disease implication
1.3.1
Role of calpain in apoptosis
Apoptosis is an essential physiological process, critical in development and tissue
homeostasis. Defective apoptotic processes are known to be implicated in various
diseases. Up or down-regulation of apoptosis may lead to atrophy or to uncontrolled cell
proliferation which results in cancer. Regulation of apoptosis involves a series of signal
molecules, receptors, gene regulating proteins and enzymes. Calpain’s role in the
caspase-cascade signaling system in apoptosis regulation was reviewed by Fan, et al.,
(2005), reporting co-involvement of other molecules such as the inhibitor of apoptosis
protein (IAP), and Bcl-2 family proteins.
Studies have shown that calpains act as both positive and negative regulators in
apoptosis. Chua, et al., (2000) reported negative regulation via consequential inactivation
of caspase-7 and -9 through calpain cleavage. Nakagawa and Yuan (2000) suggested
positive apoptotic regulation through m-calpain cleavage of procaspase-12, forming an
active caspase which cleaves the Bcl-xl loop region, processing an antiapoptotic molecule
to a proapoptotic molecule. Elucidation of calpain’s role in apoptosis is difficult due to
the number of proteolytic enzymes involved in apoptotic pathways, and presence of
common substrates with caspases, e.g. fodrin and ADP-ribosyltransferase/PARP.
Page | 9
1.3.2
Role of calpain in neural degeneration
Dysfunctions in calcium homeostasis may lead to pathological activation of calpain in
several neurodegenerative diseases. Calpain activation via calcium dysregulation leads to
the cleavage of several neuronal substrates involved in neuronal structure and function,
inhibiting neuronal survival mechanisms, leading to acute and chronic neurodegenerative
diseases such as cerebral ischemia, Alzheimer’s disease, Parkinson’s disease and
Huntington’s disease. A comprehensive review of mechanics behind calcium
dysregulation,
calpain-mediated
signaling
mechanism
and
involvement
in
neurodegeneration was reported by Vosler, et al. (2008) and summarized in Figure 1-4.
Page | 10
Figure 1-4: Schematic representation of calpain activation in various neurodegenerative diseases.
Ischemia, traumatic brain injury, and epilepsy cause an acute increase in glutamate release resulting in increased intracellular calcium. Chronic
neurodegenerative diseases AD, ALS, and PRE result in increased NMDA receptor activation, while calcium dysregulation in HD and PD are
attributed to mitochondrial dysfunction. In MS, pathologic calpain activation is initiated by T-cells and propagated by other immune cells such as
macrophages and microglia. (Adapted from Vosler, et al., 2008)
Page | 11
1.4
Challenges in deciphering protease cleavage
In vitro characterization of proteases and substrates involves several biochemical
steps where proteases and protein substrates of interest are purified from biological
origin such as cultured cells and tissues, or in vitro from protein expression studies.
Purification of protein substrates and proteases, to high purity and homogeneity is
challenging, as maintaining native enzymatic activity and structure due to pH and
temperature sensitivity of proteins entails the use of suitable non-denaturing
purification conditions.
Purified protein substrates are incubated with proteases and cleavage products are
analyzed through combinations of gel electrophoresis, reverse phase highperformance liquid chromatography (RP-HPLC), N-terminal sequencing or mass
spectrometry. Alternative approaches may also involve a combination of genetics and
proteomics. Site-directed mutagenesis of genes encoding for a target protein substrate
in animal models to alter amino acid expression at a known cleavage site location to
generate a non-cleavable site constitute a gene knock-out study to examine subsequent
protein activity and function.
With the wide array of biochemical analysis of proteolytic activities in vitro,
knowledge on in vivo activities and relevant substrates of proteases remains unclear.
The key factor being proteases generally do not function individually in vivo, but in
cascades and regulatory circuits, often in the presence of other proteins acting as
substrates, activators, inhibitors and proteases. To overcome this, it is necessary to
examine proteolysis on a system-wide scale, with the collection of proteases
Page | 12
expressed in a cell (protease degradome) and all the substrates of the protease
(substrate degradome) and their state of cleavage in the complex biological
environment.
An example of a system-wide degradomic study was reported by Overall, et al.
(2004), with the development of a dedicated and complete human protease and
inhibitor microarray, CLIP-CHIP, designed for identification of expression levels of
all 715 human proteases, in active homolog and inhibitors in cells and tissues. In the
same study, the development of ICDC (inactive catalytic domain capture); a novel
yeast two-hybrid system to discover protease substrates through capture via a mutated
inactive catalytic domain was also described.
Together, system-wide studies and high-throughput proteomics have enabled the
increasing rate of novel substrates discovery, but not without its limitations.
Identification of proteolytic cleavage products of substrates in its biological
environment may still prove to be a great challenge, highlighting the necessity to
develop complementary tools to aid in the analysis of protease degradomes.
1.5
Project objectives
From numerous studies in recent years, discovery of calpain substrates and growth of
available protein sequence data has led to the creation of useful databases, predictive
algorithms and tools for research applications. Among the recent efforts are CutDB, a
proteolytic events database aimed at documenting in vivo and in vitro for natural
proteins; CaMPDB, a dedicated resource for calpain modulatory proteolysis and
prediction tools; and GPS-CCD, a specialized web-tool developed for the prediction
Page | 13
of calpain cleavage sites. More information on these studies will be discussed in the
later sections of this report.
With increased research detailing the mechanisms of calpain substrate cleavage and
accumulation of data on calpain substrates, the development of computational
prediction methods for calpain substrates is becoming increasingly achievable.
Development of calpain cleavage prediction models can provide screening of a wide
range of novel substrates for potential proteolytic activities in silico, efficiently
assessing their involvement in calpain modulation prior to tedious experimental
procedures to verify positive cleavage or interactions. In addition, calpain’s
involvement in cancers and neurodegenerative diseases potentiates itself as an
important pharmacological drug target for inhibition and therapies. To achieve this,
deeper understanding of the calpain-mediated proteolysis, substrate cleavage site
recognition and specificity has to be addressed. These potential benefits lead to the
main objective of this project to develop an accurate computational model for the
prediction of calpain cleavage sites to serve as a complementary tool to experimental
procedures in the understanding of calpain proteolytic modulation
Page | 14
Chapter 2: Computational approaches to data classification
2.1. Introduction to Support Vector Machine (SVM)
Support vector machines (SVMs) are computational mathematical algorithms used in
supervised learning methods for data and regression analysis and statistical
classification. Training of SVM classifiers are done via positive and negative training
examples. The SVM training algorithm uses the training information to build a model
to predict whether a new data falls into either of the two categories. SVM model
represents the data points as sets of vectors, mapped in high dimensional space,
followed by the construction of one or more hyperplanes to separate the sample data;
1) a separating hyperplane to enable the separation between the distinct classes of data
points, 2) a maximum-margin hyperplane that maximises the margin between the two
categories, 3) a soft margin with user specifiable parameters to control the stringency
of classification of anomalous data points and 4) a kernel function which acts to project
data from a low-dimension space to a higher dimension to improve the classification of
linearly non-separable data (Noble, 2006). The constructed SVM model can be then
used to predict new examples by mapping them in the same space for classification.
Figure 2-1 illustrates the concepts of SVM.
A)
B)
Figure 2-1: Illustration of SVM concepts. A) Two linearly-separable classes, A and B;
represented in two-dimensional space. B) Demonstration of vector mapping from
two-dimensional input space to feature space at higher dimensions using kernel
functions for non-linearly separable data.
Page | 15
2.2. Current perspective in calpain cleavage prediction
Several studies have been done to examine the precise and specific recognition of
cleavage sites of calpains to better understand the mechanisms of modulatory
proteolytic processing. To date, amino acid sequence specificity of cleavage by calpain
has not been established, although some preference with regards to amino acid residues
in the vicinity of calpain cleavage has been reported.
2.2.1.
Sequential determinants of calpain cleavage
In a bid to determine the relationship between structural information and specificity of
substrate recognition by calpain, Tompa, et al. (2004), examined the amino acids
preference of calpain 1 and 2. 49 calpain substrates with 106 sequentially identified
cleavage sites from literature was collected and analyzed for amino acid preference
surrounding the scissile bond. A position specific preference matrix was constructed
from amino acid occurrence in positions P4-P7’ and normalized to the average
frequency of the same amino acid in the entire Swiss-Prot and TrEMBL database.
Preferred residues were reported to be leucine, threonine and valine in the P2 position
and lysine, tyrosine and arginine in the P1 position, coinciding with earlier
comparative specificity and kinetic studies involving naturally occurring peptides and
synthetic fluorogenic substrates with calpain 1 and 2 (Sasaki, et al., 1984) and calpain
activity interference through site-directed mutagenic substitution of amino acids at the
P2 position of αII spectrin (fodrin) cleavage site of Val1175 (Stabach, et al., 1997).
Influence of high order structural elements for calpain cleavage was reported by
Sakai, et al. (1987) from proteolysis of calf thymus histone by calpain 2, notably the
non-cleavage of known susceptible bonds in peptide fragments generated by from
Page | 16
degradation of the intact histones. Contributions of calmodulin (CaM) –binding motif
and vicinity of PEST (Pro, Glu (Asp) and Ser/Thr) regions to calpain substrate
recognition was much debated. Wang, et al. (1989) highlighted the occurrence of
CaM-binding motifs in calpain substrates and cleavage site recognition often occurs
adjacent to a PEST region. Molinari, et al. (1995) however, showed that lower PEST
scores generated by mutation of domains surrounding the CaM-binding regions of
Ca2+-ATPase had no influence on its susceptibility to calpain. These findings, together
with an overriding amino acid preference may be the reason behind the wide array of
calpain substrates and the lack in strong sequence specificity and homology between
reported cleavage sites of different calpain substrates.
2.2.2.
Group-based Prediction System- Calpain Cleavage Detector
(GPS-CCD)
To address the lack of specialized predictors for calpain substrate cleavage sites, GPSCCD (Group-based Prediction System- Calpain Cleavage Detector) was developed by
Liu, et al. (2010) as a web-tool for calpain cleavage sites prediction. GPS-CCD1.0
was based on a previously developed algorithm of GPS2.0 by Xue, et al. (2008),
inferred from the hypothesis that short peptides sharing similar biochemical properties
and 3D structures may be evaluated for similarity via the use of suitable amino acid
substitution matrices. This led to the development of a novel Matrix Mutation (MaM)
approach (Xue, et al. 2008; Ren, et al., 2008, Ren, et al., 2009) which was employed
in the final GPS-CCD1.0. Prediction of a putative calpain cleavage peptide is
accomplished via similarity scoring from pairwise comparison to experimentallyverified cleavage bonds. The first reported GPS-CCD1.0 was developed with 265
experimentally-verified calpain cleavage sites from 102 proteins obtained from data
Page | 17
mining efforts from literature, inclusive of notable contributions by Tompa, et al.
(2004).
Performance evaluation was achieved through leave-one-out validation and 4-, 6-, 8-,
10-fold cross-validations, with best accuracy of 89.80%, sensitivity of 66.42% and
specificity of 89.86%. The current accessible version of GPS-CCD1.0 (dated 26th
February 2011) reported 368 experimentally-verified calpain cleavage sites in 130
proteins. Performance validation of the system achieved best accuracy of 89.98%,
sensitivity of 60.87% and specificity of 90.07%.
2.2.3.
CaMPDB: a resource for calpain modulatory proteolysis
To encapsulate abundant existing information on calpain, its substrates and specific
inhibitor, calpastatin; duVerle, et al. (2010), developed CaMPDB; a resource for
calpain modulatory proteolysis. A total of 267 cleavage sites were collected from 104
known calpain substrates reported in literature. Extensive enhancement of the calpain
database led to the development of three calpain cleavage site prediction tools based
on PSSM, linear and radial basis function (RBF) SVM algorithms. Performance for
the prediction methods were evaluated using Area under the ROC Curve (AUC) with
10x10-fold cross-validation. Maximal values of 69.1%, 77.3% and 80.1% was
reported for the PSSM method (window length, L = 2x30), SVM linear (L = 2x7) and
SVM RBF (L = 2x10) respectively, with best prediction performances of SVM-based
methods achieved within ten amino acids of the cleavage sites. This coincided with
the highly specific and firm binding of calpastatin to the calpain protease domain by
approximately twenty amino acids (Tompa, et al., 2004). Significant increase in
prediction performance with the RBF kernel over the linear kernel suggested strong
Page | 18
non-linear correlations between amino acid positions and cleavage. Window length
variation analysis centered about the cleavage sites revealed asymmetry in the
performance of linear and RBF kernel SVM predictors, with statistically improved
performance on the right side of the cleavage site.
2.3. Summary
We have briefly introduced and discussed the concept behind SVM and the current
perspective to calpain cleavage site prediction, available methodologies and prediction
tools. These tools amongst other calpain studies provide us with an information base to
develop our calpain cleavage site prediction model. For this project, we have chosen to
implement SVM for the development of the calpain cleavage site prediction model due
to its accuracy and performance in classification of biological data, prediction of
protein fold and interactions (Ding and Duchak, 2001; Zhang, et al., 2003), caspase
cleavage (Wee, et al., 2006) and versatility via available kernel functions to aid in
classification of non-linear data. Calpain substrate data collection, processing, SVM
implementation and application of the developed SVM prediction model on a protein
family subset will be discussed in the subsequent chapters.
Page | 19
Chapter 3: Calpain dataset
3.1 Dataset collection
For the development of any computational prediction models, it is critical utilize
accurate and reliable data. Data integrity is of utmost importance in the development of
accurate prediction models. Two distinct problems greatly hinder the development of
computational prediction models: data quality and quantity. Inaccuracy in primary data
during modeling may result in an end model that produces fallacious results whereas
sparse data may affect predictive patterns, leading to significantly less robust
prediction models. To successfully develop the cleavage prediction model for calpain
substrates, it is critical for data to be collected from experimentally-verified calpain
cleaved proteins. To construct the primary calpain dataset, efforts were taken to extract
calpain cleavage information from currently available databases: CutDB, CaMPDB,
GPS-CCD1.0 and through literature searches.
CutDB is a concerted effort by Igarashi, et al. (2007) to document proteolytic events
for natural proteins in vivo or in vitro, organized with three key attributes: protease,
protein substrate and cleavage site information. At publication, the database consisted
of a total of 3,070 proteolytic events for 470 different proteases, with information
captured from publicly available databases, MEROPS, Human Protein Reference
Database (HPRD) and publications. The extraction of all calpain-mediated proteolytic
events deposited in CutDB was achieved via keyword search using “calpain’ carried
out under “Protease definition”. A total of 449 hits were obtained for both calpain 1
and 2 mediated proteolytic events. Events without concise protein sequence or
cleavage information were omitted to obtain a total of 286 calpain cleavage events.
Page | 20
Supplementary calpain substrate data reported by GPS-CCD1.0 was obtained and
referenced for the compilation of the primary calpain dataset. From a total of 265
experimentally-verified calpain cleavage sites entries, 200 cleavage events previously
absent in CutDB was obtained. From CaMPDB, a total of 104 calpain substrates,
labeled “SB” with 267 cleavage sites was reported. From that, 54 calpain substrates
with 120 cleavage events previously absent in CutDB and GPS-CCD1.0 was collected.
To ensure that calpain data collected encompasses all recent publications, a
comprehensive search was conducted on journal articles available in PubMed. Several
permutations of keywords related to calpain substrate cleavage such as “calpain”,
“cleavage” and “substrates” was used as search entries for the period between 1st Jan
2009 through 31st Dec 2010, selected to overlap a minority of existing collected
information. Abstracts of search output were screened for indication of experimental
verification of calpain cleavage events (e.g. in vitro enzymatic assays and cleavage
sites) and suitable publications with available full text were reviewed for exact
cleavage information. Although it may be probable that some journal articles will be
omitted due to the absence of keywords, it is assumed to have minimal impact on the
final dataset. This process resulted in the identification of 7 previously unreported
substrates contributing 20 cleavage sites.
Page | 21
3.2 Data extraction and cleaning
Extraction and cleaning of all collected calpain data was done in four major steps.
Firstly, plausible entries labeled “putative”, “predicted” or “inferred from homology”
were omitted. Secondly, to eliminate typographical errors and ensure consistency of
amino acid residues surrounding the reported scissile bond, the protein sequence and
cleavage site information was cross-referenced to the Uniprot database (Uniprot
Consortium, 2010) through reported Uniprot ID or keyword searches via substrate
name. Ambiguous entries identified were verified with the original publication if
necessary. For example, vimentin, entry SB: 37 in CaMPDB, 10 cleavage sites were
erroneously reported due to single amino acid residue shift to the left due to the
removal of the methionine, “M” initiator by the authors in their amino acid count. All
10 cleavage sites were corrected with reference to original publication and the
canonical vimentin sequence deposited in Uniprot.
Next, full protein sequence of all verified calpain substrates were obtained from
Uniprot for dataset construction. For each reported calpain cleavage site, peptide
sequences of twenty amino acid residues, up and downstream of the reported scissile
bond were extracted, resulting in a set of 40-mer calpain substrate sequences centered
on its reported cleavage site.
Lastly, streamlining of the extracted data was done by the removal of redundant
sequences (100% identity) contributed by high inter-species protein similarity.
Duplicate entries occurring due to protein isoforms were reviewed and condensed
where applicable. Figure 3-1 summarizes the calpain data collection process.
Page | 22
Figure 3-1: A summary of the calpain dataset construction process.
3.3 Summary
A total of 341 unique 40-mer (P20P20’) polypeptide sequences from 130 protein
substrates were collected to constitute the final “cleaned” calpain dataset for the
development of the calpain cleavage prediction model. The final primary datasets of
calpain substrates and their cleavage site information are documented in Appendix A
(Table A-1 and A-2).
Page | 23
Chapter 4: Prediction of Calpain Substrate Cleavage
4.1 Introduction
In calpain cleavage prediction studies discussed earlier, superior performance for
SVM-based prediction methods were reported within ten amino acids flanking the
cleavage sites. Window length variation centered about the cleavage sites hinted
asymmetry in linear and RBF kernel SVM classifier performance with statistically
improved performance on the right of the cleavage site. Analysis of calpain inhibition
by calpastatin, a highly specific inhibitor of calpain, suggested an approximate twenty
amino acid binding specificity of the protease domain. These interesting findings on
calpain substrate cleavage provided the impetus to investigate influences of adjacent
amino acid sequences on calpain substrate cleavage with respect to 1) effects of
varying window length on calpain cleavage site prediction, 2) asymmetrical
contributions of amino acids on calpain substrate binding and cleavage and 3) amino
acid occurrences through linear sequence analysis of the primary calpain dataset.
4.2 Materials and Methods
4.2.1
Calpain datasets
In Chapter 3, we have obtained a calpain dataset containing 341 unique calpain
cleavage sites from 130 substrates. Due to the absence of experimentally determined
calpain non-cleavage sites, random positions were extracted from experimentallyverified calpain substrates. One random non-cleavage site was generated for every
reported cleavage site on the same substrate, resulting in the generation of an equal
number of non-cleavage sites to experimentally-verified calpain cleavage sites. For
each random non-cleavage site, 40-mer peptide sequences were extracted in the same
manner as described earlier. Together, a primary calpain dataset containing 682 entries
Page | 24
of 40-mer peptide sequences centered around its reported cleavage site (341 positive
examples) and non-cleavage site (341 negative examples) was constructed and
designated as the P20P20’ dataset.
4.2.2
Symmetrical subsequence extraction
To investigate the influence of adjacent amino acid sequences on calpain substrate
cleavage, we constructed four additional symmetrical datasets containing the reported
cleavage site flanked by four, eight, twelve and sixteen amino acid residues on either
side, forming varying window lengths, P4P4’, P8P8’, P12P12’ and P16P16’(see Figure 41).
Figure 4-1: Symmetrical subsequence segments extracted for SVM training and testing. For
Human CDK5R2 (Uniprot: Q13319), an extracted sequence window of 40 amino acids is
centered on the octapeptide cleavage site, QQRNRENL (underlined). Amino acids to the left
of the scissile bond (indicated by the inverted triangle) are labeled P1 (N) to P20 (K). Amino
acids to the right of the scissile bond are labeled P1’ (R) to P20’. Curly brackets show the
symmetrical subsequences extracted for SVM implementation, P4P4’, P8P8’, P12P12’, P16P16’
and P20P20’ respectively.
Page | 25
4.2.3
Asymmetrical subsequence extraction
To investigate the hypothesis of asymmetrical contributions of flanking amino acids on
calpain substrate binding and cleavage, we further constructed two asymmetrical
datasets to encapsulate the scissile bond and extension of four and twelve amino acids
on either sides to generate P4P12’ and P12P4’ subsequences respectively (see Figure 42).
Figure 4-2: Asymmetrical subsequence segments extracted for SVM training and testing.
Similar to the previous figure, for Human CDK5R2 (Uniprot: Q13319), an extracted sequence
window of 40 amino acids is centered on the octapeptide cleavage site, QQRNRENL
(underlined). Curly brackets show the asymmetrical subsequences extracted for SVM
implementation, P4P12’ and P12P4’ respectively.
Page | 26
4.2.4
Training and test dataset
Post-extraction of symmetrical and asymmetrical subsequences, the primary calpain
dataset was randomly divided into training and testing datasets and maintained
throughout the subsequent sections of the project.
The training datasets contained 582 sequences (291 positive and negative examples
respectively) and was used for the optimization of SVM parameters and training of the
final SVM classifier for prediction of unseen test examples.
The test dataset contained 100 sequences (50 positive and negative examples
respectively). The test dataset was used for the performance evaluation of the final
classifier. Figure 4-3 shows the segregation of various symmetrical and asymmetrical
datasets.
Page | 27
Figure 4-3: A schematic representation of datasets used for SVM training and testing. The
primary 40-mer (P20P20’) dataset consist of non-redundant calpain cleavage sites (positive
examples) and an equal number of non cleavage sites (negative examples). The P20P20’ dataset
constitutes the parent sequence for the derivation of the symmetrical P4P4’, P8P8’, P12P12’ and
P16P16’ and asymmetrical P4P12’ and P12P4’ subsequences respectively.
4.2.5
Vector encoding schemes
To encapsulate the extracted sequence information into a SVM-compatible format for
training and testing, the sequences were transformed into input vectors in simple binary
and bi-profile manner using Bayes Feature Extraction (BFE) encoding schemes.
Page | 28
4.2.5.1 Simple binary encoding
In simple binary encoding, sequences were transformed into n-dimensional vectors
using an orthonormal encoding scheme, with each amino acid represented by a 20dimensional vector, composed of either zero or one as elements.
For example, alanine was represented as [1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] and
cysteine as [0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]. For the P20P20’ dataset, each
sequence was represented by an 800-dimensional vector. Symmetrical sequences in the
P4P4’, P8P8’, P12P12’ and P16P16’ datasets were represented by 160, 320, 480 and 640dimensional vectors respectively. Asymmetrical P4P12’ and P12P4’ subsequences were
both represented by 320-dimensional vectors.
4.2.5.2 Bayes Feature Extraction (BFE) encoding
Key concepts on bi-profile vector encoding was reported by Shao, et al. (2009) in their
novel approach in computational identification of post-translational protein
methylation sites through bi-profile Bayes Feature Extraction combined with support
vector machines. In BFE, feature vectors are encoded in a bi-profile manner containing
attributes from positive and negative position-specific profiles. Profiles were generated
through the calculation of the frequency of occurrence of each amino acid at each
position of the extracted peptide sequence in the experimentally-verified calpain
cleavage sites (positive) and randomly generated calpain non-cleavage sites (negative)
respectively. For BFE, a 40-mer input peptide will be encoded by an 80-dimensional
(40 x 2) feature vector containing residue information from both positive and negative
spaces.
Page | 29
4.2.6
SVM implementation
For SVM implementation, we employed the freely downloadable LIBSVM package
developed by Chang and Lin (2001). SVM is based on the structural minimization
principle from statistical learning theory. A set of positive and negative examples can
be represented by feature vectors xi (i = 1, 2,…, N) with corresponding class labels yi
∈{+1,−1}. SVM classifier training involves the mapping of input examples onto a
high dimensional space, aided by the use of a kernel function, followed by the
definition of a separating hyperplane that differentiates the two classes with maximal
margin and minimal error. The resulting decision function for predictions of unseen
examples is given as:
Where
represents the kernel function and parameters are determined by
maximizing the following:
Under the conditions,
Variable C serves as the regularization parameter controlling balance of margin and
classification error. Based on previous findings of non-linearity between amino acid
Page | 30
positions and cleavage and superior performance of RBF kernel-based SVM classifiers
in CaMPDB, we have chosen to implement the RBF kernel given by:
4.2.7
SVM optimization
Implementation of the RBF kernel-based SVM classifiers necessitates the optimization
of two parameters; γ, the RBF kernel capacity determinant and the regularization
parameter, C. To optimize the SVM parameters γ and C, 10-fold cross-validation was
applied on each of the training datasets via grid search with SVM parameters stepped
through combinations of 0.001, 0.01, 0.1, 1, 10 and 100 for both γ and C. During 10fold cross-validation, the input training dataset was divided into 10 subsets, where 9
subsets were used for the training of the classifier followed by testing with the
remaining subset. The process is repeated 10 times, each with a different subset for
testing, ensuring all subsets are used for both training and testing. 10-fold crossvalidation accuracy scores generated were collected and tabulated in grid search tables
for each individual subsequence datasets in both BFE and simple binary encoding.
Figure 4-4 shows the general workflow for SVM development. Grid search
optimization tables generated are documented in Appendix B.
Page | 31
Figure 4-4: Flowchart of SVM workflow. (a)Primary dataset, (b)Training dataset, (c)Test
dataset, (d) 10-fold cross-validation, (e) Obtaining final accuracy (each C and γ pair
following grid-search method), (f) Collection of training and validation to obtain optimal C
and γ values, (g) Retraining of SVM model with optimized C and γ value before proceeding
testing with designated test dataset (c).
Page | 32
4.2.8
SVM training and testing
From the grid search optimization of individual subsequence datasets, optimal values
of γ and C obtained were used for the training the SVM classifiers. Final trained SVM
classifiers were used to predict the test datasets. SVM performance and effectiveness in
predicting calpain cleavage sites was measured by the calculation of the following
quantitative variables:
i. TP, true positives – the number of correctly classified cleavage sites
ii. FP, false positives – the number of incorrectly classified non-cleavage sites
iii. TN, true negatives – the number of correctly classified non-cleavage sites
iv. FN, false negatives – the number of incorrectly classified cleavage sites
From the variables above, statistical metrics of Sensitivity (Sn) and Specificity (Sp)
were computed to evaluate the ability of the prediction model to correctly classify
calpain cleavage or non-cleavage sites respectively.
Overall prediction model performance was assessed by computing Accuracy (ACC):
One major drawback of the above metrics is that a threshold must be chosen to
distinguish between predicted positives and negatives. During comparison of two
prediction methods, differences in sensitivity and specificity may be a result of
Page | 33
thresholds parameters, in actual scenario, the two methods may be identical should
threshold adjustment be made on one of the methods.
To avoid these instances, calculation of the area under the receiver operator
characteristics curve (AROC) was also applied as a non-parametric measure of
predictive performance. The ROC curve is constructed by using different values of the
threshold to plot the true positive proportion (TPP) against the false-positive proportion
(FPP), given by:
To generate the predictive scores of test datasets for AROC calculation, SVMlight
(Vapnik, 1995; Joachims, 1999) was implemented and trained with all subsequence
datasets with optimized γ and C previously obtained from grid search optimization in
LIBSVM. Prediction results from each test dataset were checked for consistency to
LIBSVM classification results and used as input to ROC analysis, a web-based
calculator for ROC curves (Eng, 2006). AROC value close to 0 is indicative of negative
correlation and at 0.5, no correlation. AROC values greater than 0.7 indicates a useful
prediction performance and values above 0.85 indicates good prediction performance.
Performance metrics: Sensitivity (Sn), Specificity (Sp), Accuracy (ACC) and AROC
values generated were tabulated. Combined ROC curves are documented under
Appendix C.
Page | 34
4.2.9
Linear sequence analysis of primary calpain dataset
4.2.9.1 Relative position-specific amino acid propensity
The relative position-specific amino acid propensity, Px, of an amino acid is a
quantitative indicator of the probability of the amino acid existing at a specific
location on a protein sequence. Individual position-specific amino acid intensities in
the positive and negative datasets were derived for the primary P20P20’dataset
containing 40-mer sequences by:
(1) Position-specific amino acid intensities for positive dataset: Number of amino
acid X at position I in the positive dataset/Total of number of sequences in
positive dataset.
(2) Position-specific amino acid intensities for negative dataset: Number of amino
acid X at position I in the negative dataset/Total of number of sequences in
negative dataset.
Propensity, defined as the ratio of the frequency of the occurrence of an amino acid in
experimentally-verified calpain cleaved substrate sequence population (positive
examples) to the frequency of the occurrence of the same amino acid in the random
non-cleaved substrate sequence population (negative examples) at a specific position
was derived by:
(3)
Relative position-specific amino acid propensity, Px = (1)/ (2)
For visualization of positive, negative and calculated propensity, three heatmaps were
generated using above calculated values. Respective 20 x 40 matrices were
constructed for heatmap generation using R programming (R Development Core
Team, 2010).
Page | 35
4.2.9.2 Sequence logo representation of calpain cleavage events
Sequence logos were developed with the aim to display and analyze patterns in
sequence conservation (Schneider and Stephens, 1990). To further visualize positionspecific amino acid occurrence and patterns in sequence conservation surrounding
reported experimentally-verified calpain cleavage sites, the positive P20P20’dataset was
used as input for the generation of a sequence logo through multiple sequence
alignment using WebLogo, developed by Crooks, et al. (2004).
Protein logos generated from input sequences enables graphical representations of
patterns and description of sequence similarity to reveal significant features of the
alignment such as amino acid conservation that could be of importance in substrate
recognition by calpain, which can be difficult to visualize in linear sequence data.
Page | 36
4.3 Results and discussion
4.3.1
Performance metrics of SVM prediction
Table 4-1: Summary of SVM prediction performance of classifiers trained using various
subsequences and encoding strategies.
Page | 37
Table 4-1 summarizes the optimized γ and C values and performance metrics of
Sensitivity (Sn); Specificity (Sp); Accuracy (ACC) and AROC values generated for all
final SVM classifiers using various subsequence datasets.
Simple binary and BFE encoded schemes for symmetrical subsequences are
represented by SVM-P4P4’ to SVM-P20P20’ and Bayes-SVM-P4P4’ to Bayes-SVMP20P20’ respectively. Asymmetrical encoded schemes are represented with SVMP4P12’, SVM-P12P4’ and Bayes-SVM-P4P12’, Bayes-SVM-P12P4’ respectively.
For symmetrical simple binary encoded schemes, maximal performance of was
observed for the SVM-P4P4’ classifier at accuracy of 77%, with sensitivity of 70%,
specificity of 84% and AROC of 0.832. Overall performance of symmetrical simple
binary encoded schemes was fairly consistent with accuracy ranging from 71 to 77%
and sensitivity and specificity between 62 to 74% and 76 to 84% respectively.
Analysis of AROC values indicated useful prediction performance, with values in the
range of 0.789 to 0.834.
The use of BFE schemes significantly improved performance across all symmetrical
subsequence windows. Performance metrics, accuracy and AROC scores for each
subsequence window were consistently higher than those obtained from classifiers
trained with simple binary encoded schemes. The best BFE classifier (Bayes-SVMP20P20’) achieved an accuracy of 85%, sensitivity of 86%, specificity of 84% and AROC
of 0.927. Graphical representation of the trends in SVM performance (accuracy and
AROC scores) across various subsequence windows are shown in Figure 4-5.
Page | 38
Figure 4-5: Graphical representation of the trends in SVM classifiers performance in terms of
A) accuracy and B) AROC scores for various subsequence windows.
Page | 39
Interestingly, two differing trends relating to subsequence window lengths were
observed between prediction performance of simple binary and BFE encoded
schemes. For BFE encoded schemes, a gradual increase in accuracy and AROC was
observed as the window length of peptide subsequence increases. In contrast,
performance metrics in the simple binary encoded scheme was highest at the classifier
trained with the shortest subsequence, SVM-P4P4’, and decreases till saturation at
SVM-P12P12’ and increases slightly with the increase of subsequence window length
to 32 and 40-mer (SVM-P16P16’ and SVM-P20P20’). This observation is
uncharacteristic as longer subsequence windows generally allow the encapsulation of
more information or features surrounding the P1-P1’ scissile bond which aid in
prediction.
Similar observations were noted in simple binary and BFE encoded schemes for
asymmetrical subsequences are represented by SVM-P4P12’ and SVM-P12P4’, designed
to investigate the hypothesis of asymmetrical contributions of flanking amino acids on
calpain substrate binding and cleavage previously reported by duVerle, et al. (2010).
For the asymmetrical simple binary encoded schemes, slight improvement in accuracy
was observed with the SVM-P4P12’ classifier at 75%, compared to the SVM-P12P4’
classifier which achieved 72%. The right-primed classifier SVM-P4P12’ also indicated
better performance in the differentiation of calpain non-cleavage sites with specificity
of 82% compared to 72% obtained in SVM-P12P4’, this however, with slight decreased
sensitivity at 68%. An increase in prediction performance was observed in the SVMP12P4’ classifier (AROC 0.795) when compared to the SVM-P4P12’ classifier (AROC
0.788).
Page | 40
Consistent to symmetrical subsequence trained SVM classifiers, an overall
improvement in prediction performance was evident when BFE was employed on
asymmetrical subsequences. SVM-P4P12’ obtained an accuracy of 82% (sensitivity of
78%, specificity of 86%) and AROC of 0.866 and SVM-P12P4’ obtained an accuracy of
81% (sensitivity of 84%, specificity of 78%) and AROC of 0.905. Observations
discussed in asymmetrical simple binary encoded schemes were also prevalent.
Based on our findings, there is no strong indication of preferential asymmetrical amino
acid sequences extension on either side of experimentally-verified calpain cleavage
sites contributing to substrate cleavage. However, plausibility of asymmetrical amino
acids contributions to calpain substrate recognition and cleavage should not be ruled
out without further investigation. Further directions will be discussed in later sections
of this report.
In the comparison of our SVM implementation to previous studies reporting best
prediction performance of accuracy 89.98%, sensitivity of 60.87% and specificity of
90.07% by GPS-CCD1.0 and maximal AROC values of 69.1% for PSSM, 77.3% for
SVM linear classifier and 80.1% for SVM RBF classifier by CaMPDB, we can infer
comparable, if not, superior performance in calpain substrate cleavage prediction
using BFE encoded schemes with symmetrical subsequences to existing methods.
Predictive performance of each method however, may be subjective. Reported
prediction performance for both GPS-CCD1.0 and CaMPDB were generated from
cross-validation, in the absence of independent out-of-sample testing. Although
authors from GPS-CCD1.0 reported the prediction of calpain cleavage sites in several
Page | 41
proteins such as caspase-14 (9 sites) and dog interleukin-1 alpha (6 sites), previously
experimentally-identified to be cleaved by calpains but without exact cleavage sites
reported. These predictions are not indicative of prediction accuracy due to absence
of experimental verification. In addition, the quantity and accuracy of calpain
substrate data used (265 and 368 entries in CaMPDB and GPS-CCD1.0 respectively)
in both studies may also affect prediction performance as no detailed procedure was
documented for data cleaning and verification in both mentioned studies.
4.3.2
Relative position-specific amino acid propensity
Figure 4-6 shows the heatmaps generated using amino acid intensities in the 40-mer
positive and negative examples and calculated propensity, Px. From the heatmap of the
positive dataset, enrichment of several amino acids at around the calpain cleavage site
is observed, especially at positions P2 to P3’. Leucine (L) is enriched at positions P2
(0.317) and P2’ (0.114). Serine (S) was found to occur frequently at position P1 and P1’
at 0.129 and 0.194. Positions P3’ and P4’ showed elevated proline (P) occurrence, at
0.220 and 0.123 respectively. Amino acid of different properties, alanine (A,
hydrophobic), glutamic acid (E, acidic), glycine and serine (G and S, polar) were found
to occur at moderate levels throughout the positive dataset.
From the heatmap of the negative dataset, enrichment of leucine (L), glutamic acid (E),
threonine (T), alanine (A), glycine (G) and lysine (K) was observed throughout the
length of the randomly generated 40-mer dataset. In both positive and negative
examples, cysteine (C), histidine (H), methionine (M) and tryptophan (W) residues
were the least occurring amino acids in the both the 40-mer positive and negative
dataset.
Page | 42
Calculated propensity Px, given by the ratio of position-specific amino acid propensity
between the positive and negative dataset, allows visualization of amino acid
differentiation in respective positions. Positions with high Px values indicates a high
likelihood of an amino acid occurring at the location compared to that in the negative
examples, vice versa with small Px values. From the heatmap and calculated average Px
values, significant propensity of alanine, methionine, proline, serine and tryptophan
residues was observed in some regions surrounding the cleavage site, in particular the
P4-P4’ segments and downstream regions. Leucine enrichment was distinct at position
P2 at 3.00, due to its significantly higher intensity in the positive dataset (0.317) despite
of its high occurrence in the negative dataset (0.106). Amino acid intensity matrices,
calculated and average Px of the 40-mer sequences are documented in Appendix D.
Page | 43
A)
B)
C)
Figure 4-6: Heatmaps of position-specific amino acid intensities of A) positive
examples, B) negative examples and C) calculated propensity Px. Vertical axis
contains the range of twenty amino acids, while the horizontal axis represents each
residue position of the 40-mer input sequences. Increasing color intensities in each
heatmap (blue for positive examples and propensity, Px, and red for negative
examples respectively) indicate position-specific amino acid enrichment.
Page | 44
4.3.3
Sequence logo representation of calpain cleavage events
Figure 4-7: Sequence logo representation of experimentally-verified calpain cleavage
events. A) Logo of 40-mer sequences (P20P20’) centered on the experimentally-verified
calpain cleavage sites. B) Expanded view of sequence logo showing P8P8’subsequence
segment.
Page | 45
Logos generated consists of one stack of letters representing each position of the input
sequence. Sequence conservation is indicated by the overall height of each stack,
measured in bits. Relative frequency of corresponding amino acids is indicated by the
height of symbols within the stack. Amino acids are represented by colors according to
their chemical properties; polar amino acids (G, S, T, Y, C, Q, N) labeled green, basic
(K, R, H) labeled blue; acidic (D and E) red and hydrophobic (A, V, L, I, P, W, F, M)
labeled black. Position-specific sequence conservation, Rseq, is defined as the
difference between the maximum possible entropy (Smax) and the entropy of the
observed symbol distribution (Sobs):
Pn: observed position-specific frequency of symbol n; N: Number of sequence-specific
symbols, equivalent to 20 for proteins. Maximum sequence conservation per site is
given by log220, approximately 4.32 bits for proteins sequences.
From Figure 4-7, there is no strong evidence of sequence conservation throughout the
40-mer input, with a wide range of amino acid residues occurring around the reported
calpain cleavage site. Position-specific amino acid conservation was observed to fall
below 0.5 bits, with the exception for the pentapeptide P2-P3’, with maxima at
approximately 0.75 bits. A comparison of amino acid prevalence from the sequence
logo generated to findings by Tompa et al. (2004) is compiled in Table 4-2.
Page | 46
Table 4-2: Comparison of position-specific amino acid prevalence in the generated 40-mer
sequence logo versus findings by Tompa, et al. (2004).
Amino acid prevalence at P2 was observed to be leucine, and valine, threonine at lower
levels, consistent with the Tompa study. Slight differences were observed at other
positions in our dataset; with position P1 occurrence of serine and glycine versus
lysine, tyrosine and arginine; position P1’ with serine, alanine and leucine versus
serine, threonine and alanine; positions P2’ (leucine, glutamic acid), P3’ (proline,
lysine, alanine) and P4’ (proline, serine, glutamic acid) versus a significant proline
prevalence reported in the Tompa study. A key contribution for this difference could
be the discovery of a more diverse range of calpain substrates since the study
conducted, where a limited collection of 106 cleavage sites from 49 calpain substrates
was analyzed. Another correlation to a previous study by Wang et al. discussed earlier
reporting the influence of PEST regions to calpain cleavage site recognition can be
observed from the low level conservation of PEST sequence motifs between P8-P1and
P3’-P8’.
Page | 47
These observations of diverse amino acid preference and cleavage by calpain is not
unexpected with such variability arising from calpain’s ability to proteolyze a wide
array of substrates in vivo and in vitro, involved various cellular processes. Possible
evolution of substrate binding sites in calpains for recognition of a wide range of
amino acid sequences in contrast to strong binding to highly specific and conserved
amino acid residues surrounding the cleavage site, distinguishes itself from other
cysteine proteases such as caspases which exhibit specificity for substrate cleavage
after an aspartic acid residue (D) at P1. These factors lead to much difficulty in the
elucidation of calpain substrate cleavage mechanisms up till today.
Page | 48
Chapter 5: Prediction of Receptor Tyrosine Kinases (RTKs) Family
Proteins
5.1 Introduction to Receptor Tyrosine Kinases (RTKs)
Protein kinases are key enzymes involved in numerous biological regulatory roles
through the protein function modification via catalytic transfer of phosphate groups
from ATP (adenosine triphosphate) molecules to specific amino acids on proteins
(phosphorylation). Phosphorlyation is an important form of post-translational protein
modification which results in functional changes in the target protein, with regards to
enzyme activity, location and protein association. Based on amino acid specificity,
protein kinases are classified into protein serine/threonine or tyrosine kinases. More
importantly, protein kinases regulate a wide variety of cellular functions including
cytoskeletal rearrangements and differentiation, cellular growth and apoptosis and
signal transduction.
A kinome study to catalogue the protein kinases in the human genome by Manning, et
al. (2002), discovered more than 500 genes encoding protein kinases, approximately
2% of all human genes. 385 were identified to be serine/threonine specific, 90 being
tyrosine specific and 43 being tyrosine kinase-like proteins. Among protein tyrosine
kinases, a large portion belonged to receptor tyrosine kinases (RTKs). The RTK
family of proteins includes approximately 20 classes, including epidermal growth
factor receptor (EGF), hepatocyte growth factor receptor (HGF), leukocyte tyrosine
receptor kinase, (LTK); RET proto-oncogene receptor (RET) and vascular endothelial
growth factor receptors (VEGF), amongst many others.
Page | 49
Hubbard and Miller (2007) reviewed RTKs as single-pass, type I transmembrane
receptors, important agents of signal transduction pathways. RTKs are generally
activated through ligand-induced oligomerization, often dimerization, bringing the
cytoplasmic tyrosine kinase domains together in close proximity, facilitating
autophosphorylation in trans of tyrosine residues in the kinase activation loop or
juxtamembrane region, inducing conformational changes that stabilize the active state
of the kinase. These phosphorylated tyrosine residues serve as binding sites for
downstream signaling or adapter proteins, and initiate subsequent cellular responses
through various signal transduction pathways. As essential components to cellular
signaling pathways and participation as growth factor receptors, mutation and
structural aberrations in RTKs are often implicated in onset and progression of
cancers.
Wee, et al. (2009) tested the hypothesis of RTK protein family regulation via caspase
proteolysis due to their common implication in apoptosis. Caspases are recognized as
the main group of enzymes involved in apoptosis, with sequential activation of a
hierarchy of caspases after death receptor stimulation in apoptotic cells. Due to
overlapping substrate specificities and evidence of caspase regulation by calpains
discussed earlier, there is increased interest to examine the possibility of calpain
involvement in RTK regulation, through direct proteolytic modulation or as factors to
the apoptosis cascade.
Page | 50
5.2 Prediction of calpain cleavage of RTKs
To examine the efficacy of calpain cleavage prediction on a protein family, we
applied the best performing SVM classifier (Bayes-SVM-P20P20’) to predict potential
cleavage sites on a subset of the RTK family: EGF receptors (EGFR and Erbb2), HGF
receptor (MET), LTK receptor (ALK) and RET receptor (RET).
Full protein sequences for EGFR (P00533), Erbb2 (P04626), MET (P08581), ALK
(Q9UM73) and RET (P07949) were collected from the Uniprot database. 40-mer
subsequences were extracted via single amino acid increment moving windows for
interrogation of each amino acid as potential calpain cleavage sites, with the exception
of residues 1-19 and 19 residues upstream of the last residue of the full protein
sequence (see Figure 5-1). The BFE encoding scheme was employed on the extracted
subsequences described in earlier sections and predicted using Bayes-SVM-P20P20’
classifiers implemented in both LIBSVM and SVMlight.
Figure 5-1: Construction of the 40-mer moving window in EGFR (P00533).
Labels 1 and 1210 refer to first and last residue of the EGFR protein. The red asterisks
between P-A (20th residue) and S-T (1190th residue) represent the first and last
interrogated cleavage sites. The light blue, red and green boxes highlight the 1st, 2nd
and 3rd extracted 40-mer subsequence, the purple and blue boxes indicate the 1170 and
1171st subsequences.
Page | 51
Table 5-1 shows the schematic maps of predicted calpain cleavage sites on the RTK
family subset, with prediction scores ≥ 1.0 in SVMlight. All members were predicted to
possess calpain cleavage sites with distribution, in most cases, throughout extra- and
intracellular regions. All selected kinases, with the exception of ALK, had predicted
calpain cleavage sites on the tyrosone kinase domain. These domains serve as
important mediators of signal transduction for RTKs and structural alterations may
lead to aberration in downstream signal transduction. EGFR, Erbb2 and MET were
predicted to possess calpain cleavage sites proximal to the membrane on the
cytoplasmic side of the receptor, suggesting the formation of an intracellular fragment
and a membrane-bound region. This may lead to possible implications in downstream
signaling, especially to normal RTK signaling pathways from competitive binding of
ligands between intact receptors and cleavage by-products of membrane-bound
receptors. With the numerous possible permutations of proteolytic fragments generated
from extracellular, intracellular and kinase domain cleavage, their involvement in
downstream functional implications such as anti or pro-apoptotic activity may prove
worthy of further experimental investigations.
5.3 Summary
In summary, a prediction of calpain cleavage sites on a protein subset of the RTK
family was conducted, with results suggesting possible calpain regulation of RTK
activity. Considering calpain’s involvement in pro and anti-apoptotic regulation via
cleavage of caspases, a likelihood of calpain being a factor in caspase-mediated RTK
regulation, RTK signaling and the production of pro-apoptotic intracellular fragments
may also be hypothesized. These hypotheses necessitate further in-depth biochemical
and structural studies on calpain mediated RTK cleavage for validation.
Page | 52
Table 5-1: Schematic maps of predicted calpain cleavage sites on the receptor tyrosine kinase (RTK) family subset. P1 positions of
predicted cleavage sites on each RTK family subset proteins are listed. Grey sections indicates location of cleavage site within the
extracellular domain, green indicates location within transmembrane domain, light blue indicates location within intracellular domain
Page | 53
and darker blue indicates location within kinase domain.
Chapter 6: Conclusion
6.1 Summary of project report
Calpains constitute an important family of intracellular, Ca2+-dependent, nonlysosomal cysteine proteases which exhibits limited proteolysis of its substrates at
neutral pH. Through cleavage of a diverse range of substrates, calpains are known to
modulate a wide range of biological processes such as apoptosis, cytoskeletal
organization and neuroendocrine secretory pathways. With calpain involvement in
diseases such as cancers and neurodegenerative diseases, it is wise to consider calpains
as clinically important targets for inhibition and therapy development, highlighting the
necessity to characterize the calpain degradome.
To date, the mechanisms of substrate recognition and cleavage by calpain have not
been fully established. However, with increasing amount of research efforts aimed at
unraveling calpain modulatory mechanisms; increasing amounts of data on calpain
substrate is becoming available. This increases the feasibility of developing calpain
cleavage prediction models to screen novel substrates for potential cleavage activities
in silico, allowing protein substrate studies to be efficiently assessed prior to tedious
experimental procedures. Recent reports of computational methods in calpain cleavage
sites prediction have been successful to certain extents, and in the midst, revealed
interesting observations to calpain cleavage mechanisms with regards to amino acid
sequences conservation and asymmetrical contributions of amino acids to calpain
recognition of substrates. These promising results provided the impetus to implement
an SVM-based method to investigate the efficacy of developing a calpain substrate
cleavage prediction tool to complement experimental procedures.
Page | 54
In our study, a total of 341 unique calpain substrate cleavage sites from 130
experimentally-verified substrates were obtained from available databases and
literature to form the foundation for SVM prediction model development. To widen
our scope of investigation, a combined approach of linear sequence analysis of amino
acid conservation and propensity; symmetrical and asymmetrical subsequence
extraction together with simple binary and bi-profile BFE encoding strategies were
employed. Sequence analysis via sequence logo and heatmap generation as well as
derivation of amino acid propensity revealed correlation to previous sequential studies
by Tompa, et al. (2004) and also significant propensity for alanine, tryptophan,
methionine, proline and serine residues within the P4-P4’ window and downstream
regions of cleavage sites. Our best SVM-based calpain cleavage site prediction model
(Bayes-SVM-P20P20’) achieved an accuracy of 85%, sensitivity of 86%, specificity of
84% and AROC of 0.927, comparable to existing published methods.
To explore the efficacy of our SVM-based prediction model to elucidate the calpain
degradome, we applied our best performing prediction model on subset of the RTKs
family, to predict for potential calpain substrates. All tested members were predicted to
possess calpain cleavage sites distributed throughout extra- and intracellular regions.
RTKs belong to a class of membrane receptors, with critical roles in cellular signaling
pathways and growth. Mutation and structural aberrations in RTKs are often
implicated in onset and progression of cancers. Prediction results suggested possible
regulation of RTK activity by calpains. With overlapping substrate specificities and coinvolvement in apoptosis between calpain and caspases, there is a likelihood of calpain
being a factor in caspase mediated RTK regulation leading to termination or
impairment of RTK signaling and initiation of apoptosis.
Page | 55
6.2 Recommendations and future direction
From our study, the efficacy of developing a computational model for the in silico
prediction of calpain substrates as a complementary tool to experimental procedures
in the understanding the role of its proteolytic modulation is highly feasible, with
developed prediction models showed promising performance when compared to
existing published methods. A logical question arises next on how prediction accuracy
may be improved. A wide array of approaches may be tested to improve the
performance of calpain prediction; however, most approaches will be extremely time
and computationally intensive and may prove challenging for the short timeframe of
the Capstone project.
Firstly, the calpain substrate database collection of cleavage events may be expanded
with deeper investigation on recently published articles. Expert curation methods can
also be employed to enhance the database such as the classification of substrates by
the calpain implicated for cleavage, e.g. calpain 1 or 2; which may be helpful in
determining similarities or differences between substrate populations. Studies on
sequence similarity may be used to investigate the occurrence of highly conserved or
repetitive sequences. In our study, only the removal of redundant occurrences that are
100% identical was employed. Development of prediction models with streamlined
calpain datasets (e.g. with removal of 95% similar sequences) may be favorable for
training of the SVM classifier, possibly removing noise sequences which may affect
prediction. Also, the calpain substrate database and developed models may be
implemented on dedicated servers and made available publicly for academic usage.
Page | 56
As discussed earlier, proteases do not function alone in biological systems but in
cascades and regulatory circuits, together with other various proteins. In addition,
majority of current prediction tools and methods are developed based on linear
peptide sequences or peptide libraries and cleavage sites predicted are only indicative
of consensus sequences cleaved by calpains but may not be cleaved in reality. Factors
influencing substrate recognition, binding and cleavage by calpains in vitro and in
vivo should be taken into consideration. Structural information such as protein folding
and secondary structure (α-helix and β-sheets), hydrophobicity, post-translational
modification information and solvent accessibility studies may aid in providing clues
to structure-function relationships. More in-depth and comprehensive investigation of
asymmetrical contributions of flanking amino acid residues to calpain cleavage may
also be achieved through the creation of asymmetrical moving windows to examine
contributions at single amino acid residue levels and how it affects prediction
performance.
In summary, with continual advancement in experimental identification of calpain
substrates together with improved methodologies on information usage and
exploration of suitable features for data analysis, the capabilities of machine learning
techniques such as SVM can be improved and maximized.
Page | 57
Part 2
Chapter 7: Critical reviews and reflections
As a working adult, it is unavoidable to encounter time constraints, handling both
work and heavy workloads from the project together with ongoing modules. During
the Capstone project, time management has proven to be critical to ensure completion
within the deadline. A sense of work prioritization and frequent consultation with my
project supervisor allowed me to pinpoint critical tasks to be completed in due time
and keep within the scope of the initial objectives set during project commencement,
even with new ideas arising from the abundant articles related to calpain research.
From the skills review and project plan drafting during proposal writing, personal
strengths and weakness assessment was helpful in judging which areas and tasks
required additional time and effort. Previous experience and knowledge in proteomics
enabled a much needed head start, being able to understand journal articles and
publications on protein studies and information such as protein databases and tools.
Much emphasis for the initial stage of the project was on the collection of
experimentally-verified calpain substrate cleavage sites and verification of the
collected information to ensure data integrity and consistency. This procedure proved
to be tedious and complicated due to variations in calpain information from available
sources, and the necessity of manual curation of collected information to Uniprot to
ensure data accuracy. The Gantt chart detailing the project plan is documented in
Appendix E.
A key weakness identified in early project planning was the lack of strong knowledge
in programming languages. This weakness led to time spent on experimentation with
Page | 58
Microsoft Excel for data processing. With trial and error and guidance from my
project supervisor, valuable knowledge was gained in Microsoft Excel functions and
data manipulation techniques widely used in this project the process of data cleaning,
subsequence extractions, random negative example extraction, simple binary and
Bayes feature encoding.
During results analysis of the various generated SVM prediction models, another
hurdle was encountered. Firstly, to generate linear sequential analysis of our collected
calpain dataset, heatmaps were required for clearer graphical representations.
Although it was possible to generate heatmaps using Microsoft Excel, gradient color
palettes were limited for classification of the calculated amino acid intensities. To
resolve this problem, knowledge in the R programming software environment for
statistical computing and graphics had to be acquired from scratch and employed.
Another setback encountered during results analysis was the computation of AROC
values as a measure of prediction performance. Although a LIBSVM tool for the
analysis of ROC curve was available in Python language interface, it required the
installation of Gnuplot, a command-line driven graphing utility, and Microsoft Visual
Studio. Trials with this setup failed to generate the required ROC curve. With the
guidance of my project supervisor, I was able to swiftly switch to investigate the
implementation of SVMlight to generate prediction scores as input to online tools for
ROC generation developed by John Hopkins University, avoiding time loss from
figuring out an unfamiliar programming language.
Page | 59
In summary, the undertaking of the Capstone project has been a very rewarding
experience, both academically and on a personal level. Valuable knowledge was
gained in current and growing repertoire of bioinformatics, especially in machine
learning applications and development of personal skills such as time management,
methodological and critical thinking and attention to details.
Page | 60
REFERENCES
Berti, P.J., and Storer, A.C. (1995). Alignment/phylogeny of the papain superfamily
of cysteine proteases. J. Mol. Biol. 246, 273-283.
Bozoky, Z., Alexa, A., Tompa, P., and Friedrich, P. (2005). Multiple interactions of
the ‘transducer’ govern its function in calpain activation by Ca2+. J. Biochem. 388,
741–744.
Chang, C.C., and Lin, C.J. (2001). LIBSVM: a library for support vector machines.
Retrieved from: http://www.csie.ntu.edu.tw/~cjlin/libsvm
Chua, B.T., Guo, K., and Li, P. (2000). Direct cleavage by the calcium-activated
protease calpain can lead to inactivation of caspases. J Biol Chem. 275, 5131–
5135.
Crooks, G.E., Hon, G., Chandonia, J.M., and Brenner, S.E. (2004). WebLogo: a
sequence logo generator. Genome Res. 14(6), 1188-1190.
Dear, N., Matena, K., Vingron, M., and Boehm, T. (1997). A new subfamily of
vertebrate calpains lacking a calmodulin-like domain: implications for calpain
regulation and evolution. Genomics. 45, 175–184.
Ding, C.H.Q., and Dubchak, I. (2001). Multi-class protein fold recognition using
support vector machines and neural networks. Bioinformatics. 17, 349-358.
duVerle, D., Takigawa, I., Ono, Y., Sorimachi, H., and Mamitsuka, H. (2010).
CaMPDB: a resource for calpain and modulatory proteolysis. Genome Inform. 22,
202-213.
Eng, J. (n.d.). ROC analysis: web-based calculator for ROC curves. Retrieved June 1,
2010, from http://www.jrocfit.org
Evans, J. S., and Turner, M. D. (2007). Emerging functions of the calpain superfamily
of cysteine proteases in neuroendocrine secretory pathways. J. Neurochem. 103,
849-859.
Fan, T.J., Han, L.H., Cong, R.S. and Liang, J. (2005). Caspase family proteases and
apoptosis. Acta Biochim Biophys Sin (Shanghai). 37(11), 719-727.
Franco, S. J., and Huttenlocher, A. (2005). Regulating cell migration: calpains make
the cut. Cell Sci. 118, 3829–3838.
Franz, T., Vingron, M., Boehm, T., and Dear, T. N. (1999). Capn7: a highly divergent
vertebrate calpain with a novel C-terminal domain. Mamm. Genome. 10, 318–321.
Goll, D. E., Thompson, V. F., Li, H., Wei, W., and Cong, J. (2003). The calpain
system. Physiol. Rev. 83, 731–801.
Page | 61
Guroff, G., and Guroff, G. (1964). A neutral calcium-activated proteinase from the
soluble fraction of rat brain. J. Biol.Chem. 239, 149.
Horikawa, Y., Oda, N., Cox, N.J. et al. (2000). Genetic variation in the gene encoding
calpain-10 is associated with type 2 diabetes mellitus. Nat. Genet. 26, 163–175.
Hosfield, C. M., Moldoveanu, T., Davies, P. L., Elce, J. S., and Jia, Z. (2001). Calpain
mutants with increased Ca2+ sensitivity and implications for the role of the C(2)like domain. J. Biol.Chem. 276, 7404–7407.
Hubbard, S.R., and Miller, W.T. (2007). Receptor tyrosine kinases: mechanisms of
activation and signaling. Curr Opin Cell Biol. 19(2), 117-123.
Igarashi, Y., Eroshkin, A., Gramatikova, S., Gramatikoff, K., Zhang, Y., Smith, J.W.,
Osterman, A.L., and Godzik, A. (2007). CutDB: a proteolytic event database.
Nucleic Acids Research. 35, D546–D549.
Ishiura, S., Murofushi, H., Suzuki, K., and Imahori, K. (1978). Studies of a calciumactivated neutral protease from chicken skeletal muscle. I. Purification and
characterization. J. Biochem. 84, 225-230.
Joachims, T. (1999). Making large-Scale SVM Learning Practical. Advances in
Kernel Methods - Support Vector Learning, B. Schölkopf and C. Burges and A.
Smola (ed.).MIT-Press.
Kamei, M., Webb, G. C., Young, I. G., and Campbell, H. D. (1998). SOLH, a human
homologue of the Drosophila melanogaster small optic lobes gene is a member of
the calpain and zinc-finger gene families and maps to human chromosome 16p13.3
near CATM (cataract with microphthalmia). Genomics. 51, 197–206.
Liu, Z., Gao, X., Cao, J., Ma, Q., Ren, J., and Xue, Y. (2010). GPS-CCD: A novel
computational program for the prediction of calpain cleavage sites. Retrieved
December 30, 2010, from: http://ccd.biocuckoo.org/
Manning, G., Whyte, D.B., Martinez, R., Hunter, T., and Sudarsanam, S. (2002).The
protein kinase complement of the human genome. Science. 298, 1912–1934.
Molinari, M., Anagli, J., and Carafoli, E. (1995). PEST sequences do not influence
substrate susceptibility to calpain proteolysis. J Biol Chem. 270(5), 2032-2035.
Nakagawa, T., and Yuan, J. (2000). Cross-talk between two cysteine protease
families. Activation of caspase-12 by calpain in apoptosis. J Cell Biol. 150, 887–
894.
Noble, W. S. (2006). What is a support vector machine? Nature Biotechnology. 24,
1565-1567.
Ohno, S., Emori, Y., Imajoh, S., Kawasaki, H., Kisaragi, M., and Suzuki, K. (1984).
Evolutionary origin of a calcium-dependent protease by fusion of genes for a thiol
protease and a calcium binding protein? Nature. 312, 566-570.
Page | 62
Overall, C.M., Tam, E.M., Kappelhoff, R., Connor, A., Ewart, T., Morrison, C.J.,
Puente, X., López-Otín, C., and Seth, A. (2004). Protease degradomics: mass
spectrometry discovery of protease substrates and the CLIP-CHIP, a dedicated
DNA microarray of all human proteases and inhibitors. Biol Chem. 385(6), 493504.
R Development Core Team. (2010). R: A language and environment for statistical
computing. Retrieved June 1, 2010, R Foundation for Statistical Computing,
Vienna, Austria, from: http://www.R-project.org/
Ren, J., Gao, X., Jin, C., Zhu, M., Wang, X., Shaw, A., Wen, L., Yao, X., and Xue, Y.
(2009). Systematic study of protein sumoylation: Development of a site-specific
predictor of SUMOsp 2.0. Proteomics. 9(12), 3409-3412.
Ren, J., Wen, L., Gao, X., Jin, C., Xue, Y., and Yao, X. (2008). CSS-Palm 2.0: an
updated software for palmitoylation sites prediction. Protein End Des Sel. 21(11),
639-644.
Reverter, D., Sorimachi, H., and Bode, W. (2001). The structure of calcium free
human m-calpain. Implications for calcium activation and function. Trends
Cardiovasc Med. 11, 222–229.
Sakai, K., Akanuma, H., Imahori, K., and Kawashima, S. (1987). A unique specificity
of a calcium activated neutral protease indicated in histone hydrolysis. J Biochem.
101(4), 911-918.
Sasaki, T., Kikuchi, T., Yumoto, N., Yoshimura, N., and Murachi, T. (1984).
Comparative specificity and kinetic studies on porcine calpain I and calpain II with
naturally occurring peptides and synthetic fluorogenic substrates. J Biol Chem.
259(20), 12489-12494.
Schneider, T.D., and Stephens, R.M. (1990). Sequence logos: A new way to display
consensus sequences. Nucleic Acids Res. 18, 6097–6100.
Shao, J., Xu, D., Tsai, S.N., Wang, Y., and Ngai, S.M. (2009). Computational
Identification of Protein Methylation Sites through Bi-Profile Bayes Feature
Extraction. PLoS One. 4, e4920.
Stabach, P.R., Cianci, C.D., Glantz, S.B., Zhang, Z., and Morrow, J.S. (1997). Sitedirected mutagenesis of alpha II spectrin at codon 1175 modulates its mu-calpain
susceptibility. Biochemistry. 36(1), 57-65.
Strobl, S., Fernandez-Catalan, C., Braun, M. et al. (2000). The crystal structure of
calcium-free human m-calpain suggests an electrostatic switch mechanism for
activation by calcium. Proc. Natl Acad. Sci. USA. 97, 588–592.
Suzuki, K. (1991). Nomenclature of calcium dependent proteinase. Biomed. Biochim.
Acta. 50, 483-484.
Page | 63
Tompa, P., Buzder-Lantos, P., Tantos, A., Farkas, A., Szilagyi, A., Banoczi, Z.,
Hudecz, F., and Friedrich, P. (2004). On the sequential determinants of calpain
cleavage. J. Biol. Chem. 279, 20775–20785.
Uniprot Consortium. (2010). The Universal Protein Resource in 2010. Nucleic Acids
Res. 38, D142–D148.
Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York: Springer
Verlag.
Vosler, P. S., Brennan, C. S., and Chen, J. (2008). Calpain-Mediated Signaling
Mechanisms in Neuronal Injury and Neurodegeneration. Mol Neurobiol. 38, 78–
100.
Wang, K.K., Villalobo, A., and Roufogalis, B.D. (1989). Calmodulin-binding proteins
as calpain substrates. Biochem J. 262(3), 693-706.
Wee, L.J., Tan, T.W., and Ranganathan, S. (2006). SVM-based prediction of caspase
substrate cleavage sites. BMC Bioinformatics.7 (Suppl 5):S14
Wee, L.J., Tong, J.C., Tan, T.W., Ranganathan, S. (2009). A multi-factor model for
caspase degradome prediction. BMC Genomics.10 (Suppl 3):S6.
Xue, Y., Ren, J., Gao, X., Jin, C., Wen, L., and Yao X. (2008). GPS 2.0, a tool to
predict kinase-specific phosphorylation sites in hierarchy. Mol Cell Proteomics.
7(9), 1598-1608.
Zhang, S.W., Pan, Q., Zhang, H.C., Zhang, Y.L., and Wang, H.Y. (2003).
Classification of protein quaternary structure with support vector machine.
Bioinformatics. 19, 2390-2396.
Page | 64
APPENDIX A
Page | 65
Table A-1: Dataset of calpain substrate cleavage sites (for cross-validation and SVM training).
Uniprot ID
P1 Position1
Cleavage Site2
ACAN
P13608
1229
EDLS-VLPS
ACAN
P13608
1287
EDLS-VLPS
ACAN
P13608
1346
EDLS-VLPS
ACAN
P13608
474
APGA-AEVP
ACAN
P13608
719
PGVA-AVPI
ACAN
P13608
365
FGVG-GEED
ACAN
P13608
1307
EDLG-VLPS
ACAN
P16112
1411
EDLS-GLPS
ACAN
P16112
954
GDLS-GLPS
ACAN
P16112
1681
PDLS-GQPS
ACAN
P16112
1452
TDLS-GLPS
ACAN
P16112
1431
GDLS-GVPS
ACAN
P16112
973
GDLS-GLPS
P68133/P62736/P60709
39, 39, 37
IVGR-PRHQ
ACTN1
P12814
243
TYVS-SFYH
ACTN1
P12814
246
SSFY-HAFS
ACTN4
O43707
265
SSFY-HAFS
Agt
P01015
28
DRVY-IHPF
Aifm1
Q9JM53
103
GLGL-SPEE
Aifm1
Q9Z0X1
102
MGLG-LSPE
Aifm1
Q9Z0X1
118
SATE-GGSV
AMPD1
P23109
85
VNLS-IPLS
AMPD1
P23109
97
TKLS-HIDE
AMPH
P49418
377
SPMS-QTLP
AMPH
P49418
392
TDLV-QPAS
AMPH
P49418
454
DLGM-DTRA
AMPH
P49418
333
PEIS-VTTP
AMPH
P49418
478
AAVG-TLVS
AMPH
P49418
593
PIQD-PQPT
AMPH
P49418
531
EELE-ATVP
AMPH
P49418
527
QPEA-EELE
AMPH
P49418
609
ADQL-ASAR
Ankrd2
Q9WV06
77
EEKR-LGVQ
ANXA1
P04083
26
QTVK-SSKG
Calpain Substrate
ACTA1/ACTA2/ACTB
ATG5
Q9H1Y0
193
YQTT-TERP
Q01814-1/6
1124, 1079
RELR-RGQI
ATXN3
P54252
200
AQLK-EQRV
ATXN3
P54252
60
DYRT-FLQQ
BAX
Q07812
33
FIQD-RAGR
BAX
Q07812
28
LLLQ-GFIQ
BCL2
P10415
73
SPLQ-TPAA
BCL2L1
Q07817
42
EGTE-SEME
BID
P55957
54
WEGY-DELQ
ATP2B2
Page | 66
Uniprot ID
P1 Position1
Cleavage Site2
P55957
70
SRLG-RIEA
Camk4
P08414/Q16566
199, 203
TVCG-TPGY
Camk4
P08414
23
STEN-LVPD
CANP B
Q9VT65
74
HAQN-ASYA
CAPN1
P07384
27
RELG-LGRH
Capn3
P16259
296
NMDN-SLLR
Capn3
P16259
322
PVQY-ETRM
CASP14
P31944
152
VMVI-KDSP
CASP-7
P55210
36
PSLF-SKKK
CASP-7
P55210
45
NVTM-RSIK
CASP-7
P55210
47
TMRS-IKTT
CASP9
P55211-1/2
330
DQLD-AISS
CASP9
P55211-1
143
GALE-SLRG
CASP9
P55211-1
120
KPEV-LRPE
CASP9
P55211-1/2
115
RPEI-RKPE
CASP9
Calpain Substrate
BID
P55211-2
120
KPEV-LRPE
CDK5R1
Q15078
98
LSTF-AQPP
CDK5R2
Q13319
100
QQRN-RENL
CDK5R2
Q13319
108
LRKG-RDPP
CDK5R2
Q13319
105
ENLL-RKGR
CDKN2D
P55273
64
LKQG-ASPN
CDKN2D
P55273
29
RLLH-RELV
CDKN2D
P55273
113
PIHL-AVQE
CDKN2D
P55273
127
SFLA-AESD
CDKN2D
P55273
47
TALQ-VMMF
CDKN2D
P55273
25
QEVR-RLLH
CRYBA1
P11843
22
AQTN-PMPG
Ctnnb1
Q02248
28
WQQQ-SYLD
Ctnnb1
Q02248
29
QQQS-YLDS
Ctnnb1
Q02248
30
QQSY-LDSG
CTTN
Q14247-1
358
ENLA-KEKE
CTTN
Q14247-1
351
SNIR-ANFE
CTTN
Q14247-1
336
AYQK-TVPV
DMD
P11532
690
TVTT-REQI
DMD
P11532
1992
MPLE-ISYV
EGFR
P00533
1030
PSTS-RTPL
EGFR
P00533
1086
DDTF-LPVP
EGFR
P00533
1151
NSTF-DSPA
EGFR
P00533
683
RRLL-QERE
EGFR
P00533
733
LWIP-EGEK
EGFR
P00533
1185
KPNG-IFKG
F2R
P25116
32
PESK-ATNA
F2R
P25116
76
SINK-SSPL
F2RL1
P55085
58
VETV-FSVD
Page | 67
Uniprot ID
P1 Position1
Cleavage Site2
F2RL1
P55085
71
VLTG-KLTT
FADK 1
Q05397
745
YQVS-GYPG
FCGR2A
P12318
263
EPPG-RQMI
FCGR2A
P12318
268
QMIA-IRKR
FCGR2A
P12318
255
DPVK-AAQF
FLNA
P21333
1761
APQY-TYAQ
Fos
P12841
90
PSQT-RAPH
GAD2
Q05329
69
AAAR-KAAC
Gap43
P07936
40
KIQA-SFRG
Gcg
P06883
67
KYLD-SRRA
Gcg
P06883
69
LDSR-RAQD
Gcg
P06883
77
FVQW-LMNT
Gcg
P06883
74
AQDF-VQWL
GJA8
P55917
300
SPLS-AKPF
Gnrh1
P07490
28
HWSY-GLRP
Gnrh1
P07490
29
WSYG-LRPG
Grin2a
Q00959
1278
NALQ-FQKN
Grin2a
Q00959
1329
GSLF-SVPS
Grm1
P23385
936
LTKS-YQGS
GRM1
Q13255
936
LTKS-YQGS
HIST1H2BC
P62808
106
LPGE-LAKH
HIST1H2BC
P62808
42
SVYV-YKVL
HIST1H2BC
P62808
81
ASRL-AHYN
HIST1H2BC
P62808
64
GIMN-SFVN
HIST1H2BC
P62808
20
KAVT-KAQK
HIST1H2BC
P62808
101
AVRL-LLPG
HIST1H2BC
P62808
46
YKVL-KQVH
HIST1H2BC
P62808
40
SYSV-YVYK
HIST1H2BC
P62808
96
REIQ-TAVR
HIST1H2BC
P62808
105
LLPG-ELAK
HIST1H2BC
P62808
53
HPDT-GISS
HTT
P42858
534
SHSS-SQVS
HTT
P42858
467
SALT-ASVK
IGFBP-2
P18065
202
TEQH-RQMG
IGFBP-3
P17936
175
HPLH-SKII
IGFBP4
P22692
143
QKHF-AKIR
IGFBP4
P22692
107
AEIE-AIQE
IGFBP4
P22692
23
LGDE-AIHC
IGFBP4
P22692
159
MKVN-GAPR
IGFBP5
P24593
161
KKLT-QSKF
IGFBP5
P24593
172
AENT-AHPR
IGFBP5
P24593
22
QSLG-SFVH
IL1A
P01583
118
PFSF-LSNV
INS
P01317
29
VNQH-LCGS
Calpain Substrate
Page | 68
Uniprot ID
P1 Position1
Cleavage Site2
INS
P01317
40
EALY-LVCG
INS
P01317
37
HLVE-ALYL
ITGB1
P05556-5
777
KWDT-QENP
ITGB1
P05556-5
772
EKMN-AKWD
ITGB1
P05556
771
KEKM-NAKW
ITGB1
P05556-5
767
AKFE-KEKM
ITGB1
P05556-5
778
WDTQ-ENPI
ITGB1
P05556-5
771
KEKM-NAKW
ITGB1
P05556
778
WDTG-ENPI
ITGB3
P05106
767
KWDT-ANNP
ITGB3
P05106
768
WDTA-NNPL
ITGB3
P05106
761
EERA-RAKW
ITGB7
P26010
770
QLNW-KQDS
ITGB7
P26010
774
KQDS-NPLY
ITGB7
P26010
769
QQLN-WKQD
ITGB7
P26010
766
KEQQ-QLNW
ITGB7
P26010
760
EYSR-FEKE
ITGB7
P26010
765
EKEQ-QQLN
ITGB7
P26010
773
WKQD-SNPL
Jun
P17325
90
HITT-TPTP
Jun
P17325
62
DLLT-SPDV
Jun
P17325
164
ASLH-SEPP
Jun
P17325
42
TLNL-ADPV
KRT18
P05783
78
GGIQ-NEKE
KRT18
P05783
253
ADIR-AQYD
KRT18
P05783
186
HGLR-KVID
KRT18
P05783
80
IQNE-KETM
KRT18
P05783
236
LTVE-VDAP
KRT18
P05783
305
TELR-RTVQ
KRT18
P05783
64
GGLA-TGIA
KRT18
P05783
290
AEVG-AAET
KRT18
P05783
137
EDLR-AQIF
KRT18
P05783
59
GGMG-SGGL
KRT18
P05783
284
VVTT-QSAE
KRT8
P05787
444
SSFG-SGAG
KRT8
P05787
79
LVLE-VDPN
KRT8
P05787
77
SPLV-LEVD
KRT8
P05787
75
LLSP-LVLE
KRT8
P05787
440
YSLG-SSFG
KRT8
P05787
236
RELQ-SQIS
Marcks
P26645
127
SSTS-SPKA
MARP2/ANKRD2
Q9GZV1
103
LDLR-REII
MBP
P02687
93
KNIV-TPRT
MBP
P02687
96
VTPR-TPPP
Calpain Substrate
Page | 69
Uniprot ID
P1 Position1
Cleavage Site2
MBP
P02686-1/2/3/4/5
152
LATA-STMD
MBP
P02686-1
204
AHYG-SLPQ
MBP
P02686-3/4
50
GGDR-GAPK
MBP
P02686-3/4
97
AHYG-SLPQ
MBP
P02686-1/2
157, 24
TMDH-ARHG
MBP
P02686-1/2/3/4/5
161, 28
ARHG-FLPR
MBP
P02686-1/3/4/5/6
279, 172,161,146,135
KGVD-AQGT
MBP
P02686-1/3/4/5
213, 106,106,80,80
SHGR-TQDE
MBP
P02686-1/3/5
231, 124, 98
VTPR-TPPP
MBP
P02686-1/3/5
241, 134, 108
GKGR-GLSL
MBP
P02686-1/3/5
265, 158, 132
GGRA-SDYK
MBP
P02686-1/3/5
244, 137, 111
RGLS-LSRF
MBP
P02686-4/6
147, 121
GGRA-SDYK
MBP
P02686-4/6
124, 98
VTPR-TPPP
MIP
Q6RZ07
239
ILKG-TRPS
MIP
P30301
237
LSVL-KGAK
MIP
P30301
238
SVLK-GAKP
MIP
Q6RZ07
238
SILK-GTRP
MIP_RAT
P09011
236
SILK-GARP
Calpain Substrate
Mtap2
P15146-3
99
QVVT-AEAV
MYO5A
Q02440
1140
LPLR-MEEP
MYOC
Q99972
226
PASR-ILKE
NEFM
O77788
467
EDEK-SEME
NF2
P35240-1/2
298
LILQ-LCIG
NF2
P35240-1/2/6
294, 294, 252
RVNK-LILQ
NFKBIA
P25963
50
KELQ-EIRL
PARP1
P18493
502
GKSG-AAPS
PARP1
P18493
384
AAVH-SGPP
PARP1
P18493
658
KKLT-VNPG
PDE1A
P54750
126
HAVQ-AGIF
PDE1A
P14100
126
HVVQ-AGIF
Pdyn
NA
207
GFLR-RIRP
Pdyn
NA
214
PKLK-WDNQ
PHKG
P00518
303
SPRG-KFKV
Plasmepsin-1
P39898
123
PHLG-NAGD
Plasmepsin-2
P46925
124
NYLG-SSND
PLCB1
P10894
880
QALH-SQPA
Ppp3ca
P63329
392
AAAR-KEVI
Ppp3ca
P63329
424
LTLK-GLTP
PPP3CA
Q08209
501
SINK-ALTS
Prkca
P05696
309
EKAK-LGPA
Prkca
P05696
316
AGNK-VISP
Prkca
P05696
324
SEDR-KQPS
Prkcb
P68403
311
AKIG-QGTK
Page | 70
Uniprot ID
P1 Position1
Cleavage Site2
Prkcg
P63319
338
KRCF-FGAS
Prkcg
P63319
321
GPSS-SPIP
PTBP1
P26599
165
LALA-ASAA
PTBP1
P26599
163
GNLA-LAAS
PTPRN
Q16849
659
SVSS-QFSD
PTRF
Q6NZI2
370
PDVH-ALLE
RB1
P06400
810
SPLK-SPYK
RCAN1
P53805-2
133
DLLY-AISK
RGP51
Q6QUW1
58
QQLS-SSGI
RYR1
P11716
1400
AMMT-QPPA
RYR1
P11716
2843
RKIS-QTAQ
SAG
P08168
377
NFVF-EEFA
SAG
P08168
380
FEEF-ARQN
SLC6A3
Q01959
71
DFLL-SVIG
SLC6A3
Q01959
43
VQLT-SSTL
Slc6a5
P58295
164
VVLG-TDGI
Slc6a5
P58295
156
WVNM-SQST
Slc6a9
Q63322/P28572-1
26
QNLT-RGNW
Slc6a9
Q63322/P28572-1
26
QNLT-RGNW
Slc6a9
Q63322/P28572-2
31
QNLT-RGNW
Slc6a9
Q63323/P28572-1
31
QNLT-RGNW
SLC8A3
P57103
512
PRAV-LASP
SLC8A3
P57103
510
PLPR-AVLA
SLC8A3
P57103
370
NILK-KHAA
SLC8A3
P57103
504
AIFN-SLPL
SMN1
Q16637
193
WNSF-LPPP
SMN1
Q16637
192
PWNS-FLPP
SNCA
P37840
57
TVAE-KTKE
SNCA
P37840
83
KTVE-GAGS
SNCA
P37840
75
TGVT-AVAQ
SNCA
P37840
74
VTGV-TAVA
SNCA
P37840
114
GILE-DMPV
SPTAN1
Q13813
1230
QLLG-SAHE
SPTAN1
Q13813
1176
QEVY-GMMP
SPTB
P11277
2058
EKST-ASWA
SPTB
P11277
2061
TASW-AERF
SPTBN1
Q01082
1440
EELQ-SQAQ
SPTBN1
Q01082
1467
QTKF-MELL
SPTBN1
Q01082
1482
HNLL-ASKE
Calpain Substrate
SPTBN1
Q01082
1447
QALS-QEGK
SPTBN1
Q01082-1/3
2066, 2053
EKSA-ATWD
TH
P17289
30
EAIM-SPRF
TH
P17289
22
SELD-AKQA
Tln1
P26039
433
TVLQ-QQYN
Page | 71
Uniprot ID
P1 Position1
Cleavage Site2
Tnnt2
P50752
81
KPSR-LFMP
TNNT3
P02641
65
PKLT-APKI
TOP1
P11387
183
DKDK-KVPE
TOP1
P11387
158
ADYK-PKKI
TP53
P04637
20
ETFS-DLWK
TPM1
P58772
256
IDDL-EDEL
TPM1
P58772
223
KYEE-EIKV
TPM1
P58772
183
EERA-ELSE
TPM1
P58772
241
RAEF-AERS
TPM1
P58772
205
NNLK-SLEA
TPM1
P58772
27
QAEA-DKKA
TPM1
P58772
204
TNNL-KSLE
tra-2
P34709
1088
ATKQ-MFES
TTN
Q9Y6L9,Q8WZ42-4
8651
QRLS-QTEP
TTN
Q9Y6L9,Q8WZ42-4
8506
IHQK-GDEA
TTN
Q9Y6L9,Q8WZ42-4
8563
MLKK-TPIL
TTN
Q9Y6L9,Q8WZ42-4
8652
RLSQ-TEPV
Ttn
A2ASS6
9828
IPDS-RVPI
Vim
P20152
53
RSLY-SSSP
Vim
P20152
92
ADAI-NTEF
Vim
P20152
71
VRLR-SSVP
Vim
P20152
33
YVTT-STRT
Vim
P20152
64
YVTR-SSAV
Vim
P20152
38
TRTY-SLGS
Vim
P20152
41
YSLG-SALR
Vim
P20152
266
PDLT-AALR
VWF
P04275
1913
TLLK-SHRV
VWF
P04275
763
RSKR-SLSC
Calpain Substrate
Cleavage sites are reported as octapeptides in the order: P4-P3-P2-P1-P1’-P2’-P3’-P4’.
Cleavage sites containing exact sequence information but originating from multiple
isoforms (if any) are demarcated by commas.
1
2
Position of the P1 amino acid in the protein sequence as reported in Uniprot.
Page | 72
Table A-2: Dataset of calpain substrate cleavage sites (for independent out-of-sample testing).
Calpain Substrate
Uniprot ID
P1 Position1
Cleavage Site2
30K, Calpain regulatory subunit
O42134
87
GFGL-DTCR
ACAN
P13608
1249
EDLS-VLPS
ACAN
P16112
365
FGVG-GEED
ACAN
P16112
1353
GDLS-GLPS
ACAN
P16112
1472
EDLS-GLPS
ACAN
P16112
709
PGVA-AVPV
Ap2b1
P62944
677
PATF-APSP
AP2B1
Q3ZB97
691
PATF-APSP
Q01814-1/6
1135, 1090
RGLN-RIQT
ATXN3
P54252
260
MQGS-SRNI
Bcl2l1
Q64373
60
WHLA-DSPA
CANP B
Q9VT65
224
PENQ-NMFW
Capn3
P16259
591
ISVD-RPVK
Capn3
P16259
274
NMTY-GTSP
COPB1
P53618
528
SALS-SSRP
Ctnnb1
Q02248
95
QRVR-AAMF
Cttn
Q60598
358
ENLA-KERE
CTTN
Q14247-1
346
VTSK-TSNI
EGFR
P00533
1059
QSCP-IKED
EZR
P15311
467
HLVM-TAPP
F2RL1
P55085
59
ETVF-SVDE
F2RL1
P55085
45
VDGT-SHVT
FLNC
Q14315
2626
SSYS-SIPK
Gcg
P06883
79
QWLM-NTKR
Gfap
P03995
56
GALN-AGFK
Gfap
P03995
29
RQLG-TMPR
GJA8
P55917
290
PLTE-VGMV
INS
P01317
32
HLCG-SHLV
INS
P01317
50
GFFY-TPKA
ITGB1
P05556
777
KWDT-GENP
ITGB2
P05107
744
EKLK-SQWN
ITGB7
P26010
778
NPLY-KSAI
KRT18
P05783
286
TTQS-AEVG
KRT18
P05783
30
RPVS-SAAS
KRT18
P05783
285
VTTQ-SAEV
KRT8
P05787
73
QSLL-SPLV
KRT8
P05787
72
NQSL-LSPL
LCP1
P13796
109
TSEQ-SSVG
ATP2B2
MBP
P02687
68
THYG-SLPQ
MBP
P02686-1/5/6
183, 50
GGDR-GAPK
MBP S
P02688
114
VHFF-KNIV
NEFM
O77788
516
SPVK-ATAP
Page | 73
Uniprot ID
P1 Position1
Cleavage Site2
PARP1
P18493
480
THLL-SPWG
Prkcb
P68403
320
PEEK-TANT
PTPRN
Q16849
608
RQQD-KERL
PTRF
Q6NZI2
30
AGAQ-AAEE
SNCA
P37840
73
VVTG-VTAV
TP53
P04637
25
LWKL-LPEN
TPM1
P58772
208
KSLE-AQAE
Vim
P20152
21
SGTS-SRPS
Calpain Substrate
Cleavage sites are reported as octapeptides in the order: P4-P3-P2-P1-P1’-P2’-P3’-P4’.
Cleavage sites containing exact sequence information but originating from multiple
isoforms (if any) are demarcated by commas.
1
2
Position of the P1 amino acid in the protein sequence as reported in Uniprot
Page | 74
APPENDIX B
Page | 75
B-1: Grid search optimization tables obtained for simple binary encoded symmetrical
subsequence windows (P4P4’, P8P8’, P12P12’, P16P16’ and P20P20’) and asymmetrical
subsequence windows (P4P12’ and P12P4’).
Optimal γ and C values are highlighted in blue.
C
γ
0.001
0.01
0.1
1
10
100
0.001
66.15
65.98
61.34
51.89
51.89
51.72
0.01
66.15
65.98
61.34
51.89
51.89
51.72
0.1
66.15
65.98
61.34
51.89
51.89
52.75
1
66.15
68.04
69.93
55.15
53.44
53.44
10
68.73
68.56
69.42
55.33
53.44
53.44
100
Binary P4P4’
69.07
68.21
69.42
55.33
53.44
53.44
0.001
0.01
0.1
1
10
100
0.001
67.18
65.98
53.44
52.06
51.20
51.20
0.01
67.18
65.98
53.44
52.06
51.20
51.20
0.1
67.18
65.98
53.44
52.06
52.06
52.06
1
67.18
70.27
72.16
53.95
52.92
52.92
10
71.48
69.07
72.68
53.95
52.92
52.92
100
Binary P8P8’
68.90
63.75
72.68
53.95
52.92
52.92
0.001
0.01
0.1
1
10
100
0.001
63.40
61.86
52.23
50.86
50.69
50.69
0.01
63.40
61.86
52.23
51.37
50.69
50.69
0.1
63.40
61.86
52.23
51.37
51.37
51.37
1
63.40
71.99
70.62
52.92
52.41
52.41
10
71.48
68.38
72.34
53.44
52.41
52.41
100
68.04
Binary P12P12’
66.15
72.34
53.44
52.41
52.41
0.001
0.01
0.1
1
10
100
0.001
57.39
55.67
51.20
50.69
50.69
50.69
0.01
57.39
55.67
51.20
50.69
50.69
50.69
0.1
57.39
55.67
51.20
50.86
50.86
50.86
1
57.39
70.62
61.51
52.23
51.03
51.03
10
69.93
66.49
65.46
52.58
51.03
51.03
100
65.46
Binary P16P16’
66.49
65.46
52.58
51.03
51.03
C
γ
C
γ
C
γ
Page | 76
C
γ
0.001
0.01
0.1
1
10
100
0.001
55.67
54.12
51.55
51.03
51.03
52.23
0.01
55.67
54.12
51.55
51.03
51.03
52.23
0.1
55.67
54.12
51.55
51.03
51.03
52.23
1
55.67
69.24
51.19
52.23
51.03
52.23
10
67.18
65.81
56.87
52.23
51.03
52.23
100
62.89
Binary P20P20’
65.81
56.87
52.23
51.03
52.23
0.001
0.01
0.1
1
10
100
0.001
66.67
67.01
54.30
51.37
50.69
50.69
0.01
66.67
67.01
54.30
51.37
50.69
50.69
0.1
66.67
67.01
54.30
51.37
51.37
51.37
1
66.67
69.07
69.59
53.44
52.41
52.41
10
68.56
68.90
69.24
53.61
52.41
52.41
69.42
64.26
69.24
53.61
52.41
52.41
0.001
0.01
0.1
1
10
100
0.001
62.71
60.31
53.44
51.72
51.72
51.72
0.01
62.71
60.31
53.44
51.72
51.72
51.72
0.1
62.71
60.31
53.44
51.72
52.58
52.58
1
62.71
71.13
70.79
54.12
53.09
53.09
10
71.31
68.04
71.82
54.12
53.09
53.09
67.53
64.43
71.82
54.12
53.09
53.09
C
γ
100
Binary P4P12’
C
γ
100
Binary P12P4’
Page | 77
B-2: Grid search optimization tables obtained for Bayes Feature Extraction (BFE)
encoded symmetrical subsequence windows (P4P4’, P8P8’, P12P12’, P16P16’ and P20P20’)
and asymmetrical subsequence windows (P4P12’ and P12P4’).
Optimal γ and C values are highlighted in blue.
C
γ
0.001
0.01
0.1
1
10
100
0.001
63.92
69.93
52.58
51.89
51.89
51.72
0.01
63.92
69.93
52.58
51.89
51.89
51.72
0.1
69.07
71.13
52.58
51.89
52.75
52.75
1
74.57
71.99
55.50
53.61
53.44
53.44
10
73.88
67.87
57.04
53.61
53.44
53.44
100
69.59
67.70
57.04
53.61
53.44
53.44
0.001
0.01
0.1
1
10
100
0.001
66.49
70.96
52.58
51.20
51.20
51.20
0.01
66.49
70.96
52.58
51.20
51.20
51.20
0.1
70.10
70.96
52.58
52.06
52.06
52.06
1
76.63
73.02
54.30
52.92
52.92
52.92
10
76.46
72.34
54.30
52.92
52.92
52.92
100
71.48
72.34
54.30
52.92
52.92
52.92
0.001
0.01
0.1
1
10
100
0.001
67.01
53.78
51.72
50.69
50.69
50.69
0.01
67.01
53.78
51.72
50.69
50.69
50.69
0.1
70.27
53.78
51.72
51.37
51.37
51.37
1
79.21
71.82
53.26
52.41
52.41
52.41
10
78.87
72.16
53.44
52.41
52.41
52.41
100
75.26
72.16
53.44
52.41
52.41
52.41
0.001
0.01
0.1
1
10
100
0.001
66.84
51.20
50.69
50.69
50.69
50.69
0.01
66.84
51.20
50.69
50.69
50.69
50.69
0.1
70.45
51.20
50.86
50.86
50.86
50.86
1
80.58
62.71
51.89
51.03
51.03
51.03
10
78.01
67.70
51.89
51.03
51.03
51.03
100
78.01
67.70
51.89
51.03
51.03
51.03
Bayes P4P4’
C
γ
Bayes P8P8’
C
γ
Bayes P12P12’
C
γ
Bayes P16P16’
Page | 78
C
γ
0.001
0.01
0.1
1
10
100
0.001
68.04
51.37
50.17
49.66
52.34
49.31
0.01
68.04
51.37
50.17
49.66
52.34
49.31
0.1
70.27
51.37
50.17
49.66
52.34
49.31
1
79.55
56.53
51.37
49.66
52.34
49.31
10
78.52
57.73
52.23
50.17
52.34
49.31
100
78.01
57.73
52.23
50.17
52.34
49.31
0.001
0.01
0.1
1
10
100
0.001
64.78
70.45
51.89
50.69
50.69
50.69
0.01
64.78
70.45
51.89
50.69
50.69
50.69
0.1
69.42
70.45
51.89
51.37
51.37
51.37
1
78.18
73.02
53.44
52.41
52.41
52.41
10
78.18
71.99
53.61
52.41
52.41
52.41
100
74.91
71.99
53.61
52.41
52.41
52.41
0.001
0.01
0.1
1
10
100
0.001
66.49
70.27
51.72
51.72
51.72
51.72
0.01
66.49
70.27
51.72
51.72
51.72
51.72
0.1
70.45
70.27
51.72
52.58
52.58
52.58
1
76.29
72.68
54.81
53.09
53.09
53.09
10
76.80
73.02
54.98
53.09
53.09
53.09
100
72.34
73.02
54.98
53.09
53.09
53.09
Bayes P20P20’
C
γ
Bayes P4P12’
C
γ
Bayes P12P4’
Page | 79
APPENDIX C
Page | 80
Figure C-1: Combined ROC curves and AROC scores generated for simple binary
encoded symmetrical and asymmetrical subsequence windows.
Page | 81
Figure C-2: Combined ROC curves and AROC scores generated for BFE encoded
symmetrical and asymmetrical subsequence windows.
Page | 82
APPENDIX D
Page | 83
For Tables D1-3, maximum intensity scores for each residue position are highlighted in yellow.
Table D-1: Amino acid intensities generated for the 40-mer positive dataset.
Page | 84
Table D-2: Amino acid intensities generated for the 40-mer negative dataset.
Table D-3: Calculated amino acid propensity, P .
X
Page | 85
Table D-4: Average P of each amino acid is calculated by averaging the P values of
X
X
the particular amino acid across all residue positions on the 40-mer peptides from the
experimentally-verified calpain cleavage sites (positive dataset), randomly generated
calpain non-cleavage sites (negative dataset) and calculated propensity P
X
.
Page | 86
APPENDIX E
Page | 87
Figure E1: Gantt chart of the BME499 Capstone project plan. Emphasis has been placed on critical stages of calpain substrate dataset collection,
implementation and testing of the SVM methodology and report writing.
Page | 88
Download