Spyros Charonis 3 Year PhD Candidate University of Manchester

advertisement
Spyros Charonis
3rd Year PhD Candidate
University of Manchester
Supervisors:
Dr Robin Curtis (Chemical Engineering)
Dr Jim Warwicker (Life Sciences)
Background
• Topic
– Computational predictive models of protein solubility and
aggregation
• Motivation
– Protein aggregation is a major bottleneck in bioprocessing
pipelines
– Developing algorithms & methods to predict aggregation-prone
sequences can aid manufacturing processes of therapeutics
Sequence-level Predictors
KR-ratio in E. coli
Genome-wide study of E. coli protein solubilities
(Niwa et al., 2009) was used to test KR-ratio as a
binary classifier (SOL/INS separation)
Protein product of ORF solubilities were
measured experimentally (bimodal
distribution observed)
Niwa et al., 2009
KR/DE-ratio in Human SA
Serum albumin is the most abundant protein in
human blood plasma
Real Sequence
A. Cumulative percentage plot of KR-ratio & DE-ratio
for 47 human serum albumin sequences
Random Sequence
B. Probability distributions plotted against KR/DEratios for binomial distributions
KR/DE-ratio in Human Mb
Myoglobin is also abundant in human blood plasma
Real Sequence
A. Cumulative percentage plot of KR-ratio & DE-ratio
for 45 human myoglobin sequences
Random Sequence
B. Probability distributions plotted against KR/DEratios for binomial distributions
Sequence-level binary classification
• Extend this type of analysis and identify significant sequence
disparities between highly soluble and insoluble proteins
• Several sequence properties considered
– KR-ratio, DE-ratio, length, residue composition, etc.
METHODS
• Statistical comparison of SOL/INS datasets to observe interesting
sequence disparities
• Difference between SOL/INS z-scores calculated for each sequence
property and visualized in a “heat map” matrix
SOL vs INS comparison
E. coli dataset (Niwa et al., 2009)
Sequence-level Parameters
1.2
Normalized value
1
0.8
0.6
INS
SOL
0.4
0.2
0
Parameter
Sequence “heat matrix”
E. coli dataset (Niwa et al., 2009)
Salient Differences
Sequence length
- Longer proteins are less soluble
KR-ratio
D+E
Electrostatic
K+R+D+E
Properties
Abs-charge
- Net surface charge increases
solubility
Sequence Entropy
- Less sequence entropy implies
higher solubility? This merits
further investigation
SOL vs INS comparison
SOLP dataset (Magnan et al., 2009)
Sequence-level Parameters
1.2
Normalized value
1
0.8
0.6
INS
SOL
0.4
0.2
0
Parameter
Comparing Solubility Datasets
• E. coli dataset (Niwa et al., 2009) vs SOLP dataset (Magnan et al., 2009)
The observed salient differences
in E. coli dataset are not present
in SOLP
This validates data-driven
methodologies (e.g. Niwa et al.)
based on actual measurements of
solubility vs. relying on
annotations from primary
databases (SOLP)
SOL/INS lines on E. coli plot more
divergent than on SOLP plot –
better “separation” on heat
matrix as well
The Search for Datasets
• Several protein datasets have been collected but solubility
data is not always available
• Use imperfect but useful proxies, e.g. protein abundance
levels & mRNA expression levels
• Large-scale quantitative proteomics studies (MaxQB,
PeptideAtlas) useful resources but usually do not directly
measure protein solubilities
– Studies measuring solubility on a proteomic scale are quite rare
Results/Conclusions
• KR-ratio is interesting, easy to use as protein engineering tool
– Easy to swap ARG for LYS
• Net charge is well known to affect solubility
– Significant change in overall charge will mess with protein purification
properties
• Sequence entropy differences between SOL/INS intriguing
– Entropy is not something readily modifiable
• More datasets need to be considered …
Heat matrix across datasets
Challenges & Final Year Work
• Lack of extensive high-quality solubility datasets
– Not a huge amount of data to test our predictors, but we settle for expression/abundance
level studies as well
• CAUTION: consider growth phase of organisms (bacteria, yeast,
fungi) when protein abundance data are quantified
• Use heat matrix with even more datasets to get clues as to which
predictors to look at more closely
• Incorporate 3D analysis, i.e. (surface charge, polarity) and perhaps
aromatics? (FWY frequencies)
– Sequence pipeline to cross-reference sequence IDs (e.g. UniProtKB database) with structures
(PDB database)
Structural Analysis - SOL/INS datasets
PyMOL rendering of Fab fragment (PDB 7FAB) with surface charge & polarity
Acknowledgements
• EPSRC CDT (Funding body)
• UCL Centre for Innovative Manufacturing
• Curtis-Warwicker group
•
•
•
Dr Robin Curtis
Dr Jim Warwicker
Colleagues @ MIB
Thank you!
Questions?
Download