Sujet Titre : Specialization of BioLogical Models by Experimental Design Strategies Directeur de thèse Co-encadrant éventuels Nom Olivier ROUX Email téléphone Unité de recherche (avec sa labellisation) Unité de rattachement (Université, Ecole…) Nombre de thèses en cours Nom olivieroux@icloud.com + 33 2 40 37 69 79 IRCCyN Email téléphone Carito.guziolowski@irccyn.ecnantes.fr (+33) 2 40 37 69 78 IRCCyN Unité de recherche (avec sa labellisation) Unité de rattachement (Université, Ecole…) Nombre de thèses en cours Ecole Centrale de Nantes 3 Carito GUZIOLOWSKI Ecole Centrale de Nantes 2 Financement envisagé Résumé du sujet de la thèse (3 à 5 lignes) High-throughput massive and parallel technologies allow us to observe different cellular parts under a concrete situation. This information is of great value to generate and validate cell models. However, high-throughput measurements are in many cases taken over population aggregates, while models occur at the level of individual cells. The aim of this thesis proposal is to provide a novel characterization of single cell systems as a family of logic models subject to possible realizations. To achieve this, we aim to generate this family of cell specific models from population phosphoproteomics data by using logic programming. The logic behaviors of this family will be specialized after proposing efficient experimental strategies. This family of models may provide remarkable benefits to medical research through elucidating the mechanisms involved in cell dysfunction which appear in a mutual-exclusive fashion in specific cell populations. Descriptif du sujet de thèse (2 pages au maximum) Encadrement : Carito Guziolowski (60%) et Olivier Roux (40%) This PhD subject addresses a problem directly relevant to health and wellbeing, which is developing and confronting biological models with respect to experimental data. Modeling the complex nature of a cell is a key challenge of this decade. The recent development of highthroughput massive and parallel technologies allows us to observe different cellular parts under a concrete situation. This information is of great value to generate and validate cell models. However, high-throughput measurements, such as microarrays or phosphoproteomics are in many cases taken over population aggregates, while models representing signaling and transcriptional events occur at the level of individual cells. Over the past years, several studies have demonstrated a high heterogeneity between individual cells. Currently, there is a need of methods that take this heterogeneity into account when modeling at the single cell level and integrate high-throughput data sets at the population level into single cell models. The goal of this PhD work is to generate cell specific (logic) models by automatically proposing efficient experimental strategies to provide relevant measures. Modeling cell behavior at a large-scale by integrating experimental data at the population level, in spite of its heterogeneity, noise and sparsity, is one of the main steps towards design and control of in-cell biological systems. Manipulating in-silico these systems can provide remarkable benefits to medical research through the understanding of the complex mechanisms involved in cell dysfunction. Qualitative approaches, in despite of their simplicity, allow us to model large-scale biological systems, in opposition to quantitative methods. Among them, logic models are able to capture interesting and relevant behaviors in the cell as several authors have shown during the last years (Mbodj et al. Mol. Biosyst., 2013, Morris et al Methods Mol Biol, 2013, Melas et al. Osteoarthr. Cartil., 2014). Due to factors such as the sparsity or the uncertainty of experimental measurements, the model is often non-identifiable. In contrast to quantitative modeling, the use of logic models can allow tackling the model identification problem for large-scale biological systems. In fact, in a recent study, we showed that thousands of Boolean (early response) logic models fit several phosphoproteomic perturbation experiments similarly well. The aim of this PhD proposal is to explore the limits of logic model-identifiability after proposing strategies to narrow down the number of logic models that fit equally well to data. The main goal is to select specific experiments which will increase the system observability, and validate our results by proving that performing such experiments allows us to obtain precise model behaviors. Experiment selection will be achieved after exhaustive searches over the full space of admissible behaviors by using state-of-the-art solvers of constraint logic programs. After considering technological limitations (not all experimental designs are achievable for a particular biological system), we expect to reduce the variability among models and exhaustively characterize their possible mutual-exclusive behaviors. This reduced space of logical models may be a better representative of cellular heterogeneity; which may have a crucial impact on cellular function. PhD thesis Objectives We aim to study the variability of logical models learned from multiple phosphoproteomic perturbation experiments with two approaches. First, generalize the learning to include models with dynamic behaviors from time-series data. In fact, the large variability of early response models could be reduced when imposing additional constraints on dynamic behaviors. Second, implement experimental design strategies, which will propose de novo experiments to reduce the variability of equivalent models based on exhaustive optimization criteria. Our preliminary results in the context of the first approach show that a method based on constraint logic programming (Answer Set Programming, ASP) can learn a family of dynamic-patterned logical models from time-series data proposing similar (but exhaustive and more efficient) solutions than Fuzzy logic or Synchronous simulation methods would do based on metaheuristics. This is yet a preliminary work over a toy-model, which needs to be improved and extended to larger case studies. Concerning the second approach, we have previously proposed a Python package caspo (http://bioasp.github.io/caspo), based on ASP, that learns logic models from data. This package allows for design analyses that take into account the exponential number of logic models displaying high fit to data. However, the iteration of the design method from caspo exhibits a low performance when applied to real data since research spaces become too large to explore, whereas system information is not always available. All in all, we have begun to explore both research paths and our preliminary results show that logic programming inspired methods can propose a solution for these problems. The main objective of this thesis is to extend, apply, and validate these methods with regard to realcase studies. In that context, an envisaged possibility to boost our methods can be to merge ASP optimization methods inspired from artificial intelligence techniques with metaheuristic methods of local research. From the modeling side, the imposed constraints to tackle this problem, can be enriched with abstract interpretation frameworks; also, multiple logic behaviors can be modeled with Probabilistic Boolean Networks (PBN) approaches (Trairatphisan et al. Cell Commun Signal. 2013). Our methods will be applied to infer logic models, of signaling pathways, from phosphoproteomics data. We confer a particular interest to this type of experiments because cell dysfunctional models are related in many cases to the deregulation of signaling pathways, and effects over these pathways are better measured when observing proteinactivation or de-activation rather than gene-expression. In particular we will focus on three case studies, based on published works: (a) the DREAM8 challenge time-series phosphoproteomic data of four breast cancer cell lines, (b) in-silico generated perturbation data based on the HPG2 liver cancer cell line model, and (c) two phosphoproteomic datasets (ligandscreening and combinatorial follow-up) of primary human hepatocytes