Soutenance de thèse Nom et prénom : DHIFLI Wajdi Laboratoire de thèse : LIMOS Directeur de thèse : Prof. MEPHU NGUIFO Engelbert Date de soutenance : 11/12/2013 Noms des personnes composant le jury : Rapporteurs : 1. Prof. Mohammed Javeed Zaki (Rensselaer Polytechnic Institute, USA) 2. Prof. Abdoulaye Baniré Diallo (University of Québec, Canada) 3. Prof. Jan Ramon (Katholieke Universiteit Leuven, Belgium) Examinateurs : 1. 2. 3. 4. DR. David W. Ritchie (INRIA, Nancy, France) DR. Jean Sallantin (LIRMM, Montpellier, France) DR. Jean-François Gibrat (INRA, Jouy-en-Josas, France) Dr. Annegret Wagler (LIMOS, Clermont-Ferrand, France) Titre de la thèse : Fouille de Sous-graphes Basée sur la Topologie et la Connaissance du Domaine: Application sur les Structures 3D de Protéines (Titre en anglais: Topological and Domain Knowledge-based Subgraph Mining: Application on Protein 3D-Structures) Résumé : This thesis is in the intersection of two proliferating research fields, namely data mining and bioinformatics. With the emergence of graph data in the last few years, many efforts have been devoted to mining frequent subgraphs from graph databases. Yet, the number of discovered frequent subgraphs is usually exponential, mainly because of the combinatorial nature of graphs. Many frequent subgraphs are irrelevant because they are redundant or just useless for the user. Besides, their high number may hinder and even makes further explorations unfeasible. Redundancy in frequent subgraphs is mainly caused by structural and/or semantic similarities, since most discovered subgraphs differ slightly in structure and may infer similar or even identical meanings. In this thesis, we propose two approaches for selecting representative subgraphs among frequent ones in order to remove redundancy. Each of the proposed approaches addresses a specific type of redundancy. The first approach focuses on semantic redundancy where similarity between subgraphs is measured based on the similarity between their nodes' labels, using prior domain knowledge. The second approach focuses on structural redundancy where subgraphs are represented by a set of user-defined topological descriptors, and similarity between subgraphs is measured based on the distance between their corresponding topological descriptions. The main application data of this thesis are protein 3D-structures. This choice is based on biological and computational reasons. From a biological perspective, proteins play crucial roles in almost every biological process. They are responsible of a variety of physiological functions. From a computational perspective, we are interested in mining complex data. Proteins are a perfect example of such data as they are made of complex structures composed of interconnected amino acids which themselves are composed of interconnected atoms. Large amounts of protein structures are currently available in online databases, in computer analyzable formats. Protein 3D-structures can be transformed into graphs where amino acids are the graph nodes and their connections are the graph edges. This enables using graph mining techniques to study them. The biological importance of proteins, their complexity, and their availability in computer analyzable formats made them a perfect application data for this thesis.