Spyros Charonis 3rd Year PhD Candidate University of Manchester Supervisors: Dr Robin Curtis (Chemical Engineering) Dr Jim Warwicker (Life Sciences) Background • Topic – Computational predictive models of protein solubility and aggregation • Motivation – Protein aggregation is a major bottleneck in bioprocessing pipelines – Developing algorithms & methods to predict aggregation-prone sequences can aid manufacturing processes of therapeutics Sequence-level Predictors KR-ratio in E. coli Genome-wide study of E. coli protein solubilities (Niwa et al., 2009) was used to test KR-ratio as a binary classifier (SOL/INS separation) Protein product of ORF solubilities were measured experimentally (bimodal distribution observed) Niwa et al., 2009 KR/DE-ratio in Human SA Serum albumin is the most abundant protein in human blood plasma Real Sequence A. Cumulative percentage plot of KR-ratio & DE-ratio for 47 human serum albumin sequences Random Sequence B. Probability distributions plotted against KR/DEratios for binomial distributions KR/DE-ratio in Human Mb Myoglobin is also abundant in human blood plasma Real Sequence A. Cumulative percentage plot of KR-ratio & DE-ratio for 45 human myoglobin sequences Random Sequence B. Probability distributions plotted against KR/DEratios for binomial distributions Sequence-level binary classification • Extend this type of analysis and identify significant sequence disparities between highly soluble and insoluble proteins • Several sequence properties considered – KR-ratio, DE-ratio, length, residue composition, etc. METHODS • Statistical comparison of SOL/INS datasets to observe interesting sequence disparities • Difference between SOL/INS z-scores calculated for each sequence property and visualized in a “heat map” matrix SOL vs INS comparison E. coli dataset (Niwa et al., 2009) Sequence-level Parameters 1.2 Normalized value 1 0.8 0.6 INS SOL 0.4 0.2 0 Parameter Sequence “heat matrix” E. coli dataset (Niwa et al., 2009) Salient Differences Sequence length - Longer proteins are less soluble KR-ratio D+E Electrostatic K+R+D+E Properties Abs-charge - Net surface charge increases solubility Sequence Entropy - Less sequence entropy implies higher solubility? This merits further investigation SOL vs INS comparison SOLP dataset (Magnan et al., 2009) Sequence-level Parameters 1.2 Normalized value 1 0.8 0.6 INS SOL 0.4 0.2 0 Parameter Comparing Solubility Datasets • E. coli dataset (Niwa et al., 2009) vs SOLP dataset (Magnan et al., 2009) The observed salient differences in E. coli dataset are not present in SOLP This validates data-driven methodologies (e.g. Niwa et al.) based on actual measurements of solubility vs. relying on annotations from primary databases (SOLP) SOL/INS lines on E. coli plot more divergent than on SOLP plot – better “separation” on heat matrix as well The Search for Datasets • Several protein datasets have been collected but solubility data is not always available • Use imperfect but useful proxies, e.g. protein abundance levels & mRNA expression levels • Large-scale quantitative proteomics studies (MaxQB, PeptideAtlas) useful resources but usually do not directly measure protein solubilities – Studies measuring solubility on a proteomic scale are quite rare Results/Conclusions • KR-ratio is interesting, easy to use as protein engineering tool – Easy to swap ARG for LYS • Net charge is well known to affect solubility – Significant change in overall charge will mess with protein purification properties • Sequence entropy differences between SOL/INS intriguing – Entropy is not something readily modifiable • More datasets need to be considered … Heat matrix across datasets Challenges & Final Year Work • Lack of extensive high-quality solubility datasets – Not a huge amount of data to test our predictors, but we settle for expression/abundance level studies as well • CAUTION: consider growth phase of organisms (bacteria, yeast, fungi) when protein abundance data are quantified • Use heat matrix with even more datasets to get clues as to which predictors to look at more closely • Incorporate 3D analysis, i.e. (surface charge, polarity) and perhaps aromatics? (FWY frequencies) – Sequence pipeline to cross-reference sequence IDs (e.g. UniProtKB database) with structures (PDB database) Structural Analysis - SOL/INS datasets PyMOL rendering of Fab fragment (PDB 7FAB) with surface charge & polarity Acknowledgements • EPSRC CDT (Funding body) • UCL Centre for Innovative Manufacturing • Curtis-Warwicker group • • • Dr Robin Curtis Dr Jim Warwicker Colleagues @ MIB Thank you! Questions?