Soutenance de thèse - EDSPI - Clermont-Fd

advertisement
Soutenance de thèse
Nom et prénom : DHIFLI Wajdi
Laboratoire de thèse : LIMOS
Directeur de thèse : Prof. MEPHU NGUIFO Engelbert
Date de soutenance : 11/12/2013
Noms des personnes composant le jury :
 Rapporteurs :
1. Prof. Mohammed Javeed Zaki (Rensselaer Polytechnic Institute, USA)
2. Prof. Abdoulaye Baniré Diallo (University of Québec, Canada)
3. Prof. Jan Ramon (Katholieke Universiteit Leuven, Belgium)

Examinateurs :
1.
2.
3.
4.
DR. David W. Ritchie (INRIA, Nancy, France)
DR. Jean Sallantin (LIRMM, Montpellier, France)
DR. Jean-François Gibrat (INRA, Jouy-en-Josas, France)
Dr. Annegret Wagler (LIMOS, Clermont-Ferrand, France)
Titre de la thèse :
Fouille de Sous-graphes Basée sur la Topologie et la Connaissance du Domaine: Application sur
les Structures 3D de Protéines
(Titre en anglais: Topological and Domain Knowledge-based Subgraph Mining: Application on
Protein 3D-Structures)
Résumé :
This thesis is in the intersection of two proliferating research fields, namely data mining and
bioinformatics. With the emergence of graph data in the last few years, many efforts have been
devoted to mining frequent subgraphs from graph databases. Yet, the number of discovered
frequent subgraphs is usually exponential, mainly because of the combinatorial nature of graphs.
Many frequent subgraphs are irrelevant because they are redundant or just useless for the user.
Besides, their high number may hinder and even makes further explorations unfeasible.
Redundancy in frequent subgraphs is mainly caused by structural and/or semantic similarities,
since most discovered subgraphs differ slightly in structure and may infer similar or even
identical meanings.
In this thesis, we propose two approaches for selecting representative subgraphs among frequent
ones in order to remove redundancy. Each of the proposed approaches addresses a specific type
of redundancy. The first approach focuses on semantic redundancy where similarity between
subgraphs is measured based on the similarity between their nodes' labels, using prior domain
knowledge. The second approach focuses on structural redundancy where subgraphs are
represented by a set of user-defined topological descriptors, and similarity between subgraphs is
measured based on the distance between their corresponding topological descriptions.
The main application data of this thesis are protein 3D-structures. This choice is based on
biological and computational reasons. From a biological perspective, proteins play crucial roles
in almost every biological process. They are responsible of a variety of physiological functions.
From a computational perspective, we are interested in mining complex data. Proteins are a
perfect example of such data as they are made of complex structures composed of interconnected
amino acids which themselves are composed of interconnected atoms. Large amounts of protein
structures are currently available in online databases, in computer analyzable formats. Protein
3D-structures can be transformed into graphs where amino acids are the graph nodes and their
connections are the graph edges. This enables using graph mining techniques to study them. The
biological importance of proteins, their complexity, and their availability in computer analyzable
formats made them a perfect application data for this thesis.
Download