THESE DE DOCTORAT DE L’UNIVERSITE PIERRE ET MARIE CURIE Spécialité Biochimie et Biologie Moléculaire (Ecole doctorale) Présentée par M Franck Rapaport Pour obtenir le grade de DOCTEUR de l’UNIVERSITÉ PIERRE ET MARIE CURIE Sujet de la thèse : Introduction de la connaissance a priori dans l’étude des puces à ADN Devant le jury composé de: Dr Gérard Biau Dr Mark Van de Wiel Dr Christophe Ambroise Dr Stéphane Robin Dr Emmanuel Barillot Dr Jean-Philippe Vert Université Pierre & Marie Curie - Paris 6 Bureau d’accueil, inscription des doctorants et base de données Esc G, 2ème étage 15 rue de l’école de médecine 75270-PARIS CEDEX 06 Tél. Secrétariat : 01 42 34 68 35 Fax : 01 42 34 68 40 Tél. pour les étudiants de A à EM : 01 42 34 69 54 Tél. pour les étudiants de EN à ME : 01 42 34 68 41 Tél. pour les étudiants de MF à Z : 01 42 34 68 51 E-mail : scolarite.doctorat@upmc.fr ii Contents Remerciements ix Abstract xi Résumé xiii 1 Background 1.1 Microarray analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 The cancerous disease . . . . . . . . . . . . . . . . . . . . 1.1.2 CGH arrays . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Gene expression arrays . . . . . . . . . . . . . . . . . . . . 1.2 The classification problem . . . . . . . . . . . . . . . . . . . . . . 1.2.1 Unsupervised classification . . . . . . . . . . . . . . . . . 1.2.2 Supervised classification . . . . . . . . . . . . . . . . . . . 1.3 The curse of dimensionality . . . . . . . . . . . . . . . . . . . . . 1.3.1 Pre-processing methods . . . . . . . . . . . . . . . . . . . 1.3.2 Wrapper methods . . . . . . . . . . . . . . . . . . . . . . 1.4 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . . 1.4.1 Spectral analysis of gene expression profiles . . . . . . . . 1.4.2 Supervised classification of aCGH data using fused L1 SVM 1.4.3 Supervised classification of gene expression profiles using network-fused SVM . . . . . . . . . . . . . . . . . . . . . 23 2 Spectral analysis 2.1 Background . . . . . . . . . . . . . . . . . . . . . 2.2 Methods . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Overview of the method . . . . . . . . . . 2.2.2 Spectral decomposition of gene expression 2.2.3 Deriving a metric for expression profiles . 2.2.4 Supervised learning and regression . . . . 2.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Results . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Unsupervised classification . . . . . . . . 2.4.2 PCA analysis . . . . . . . . . . . . . . . . 25 25 27 28 28 29 32 34 34 34 36 iii . . . . . . . . . . . . . . . profiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1 1 3 4 6 7 8 17 17 21 22 22 23 iv CONTENTS 2.5 2.4.3 Supervised classification . . . . . . . . . . . . . . . . . . . 2.4.4 Interpretation of the SVM classifier . . . . . . . . . . . . . Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Fused SVM 3.1 Introduction . . . . . . . . . . . . . . . . . 3.2 Methods . . . . . . . . . . . . . . . . . . . 3.2.1 ArrayCGH data . . . . . . . . . . 3.2.2 Classification of arrayCGH data . 3.2.3 Linear supervised classification . . 3.2.4 Fused lasso . . . . . . . . . . . . . 3.2.5 Fused SVM . . . . . . . . . . . . . 3.2.6 Implementation of the fused SVM 3.3 Data . . . . . . . . . . . . . . . . . . . . . 3.4 Results . . . . . . . . . . . . . . . . . . . . 3.4.1 Bladder tumors . . . . . . . . . . . 3.4.2 Melanoma tumors . . . . . . . . . 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Network-fused SVM 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . 4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Usual linear supervised classification method 4.2.2 Fusion and fused classification . . . . . . . . 4.2.3 Network-fused classification . . . . . . . . . . 4.2.4 Implementation . . . . . . . . . . . . . . . . . 4.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Expression data sets . . . . . . . . . . . . . . 4.3.2 Gene networks . . . . . . . . . . . . . . . . . 4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Performance . . . . . . . . . . . . . . . . . . 4.4.2 Interpretation of the classifiers . . . . . . . . 4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37 40 42 . . . . . . . . . . . . . 47 47 49 49 50 51 53 53 55 55 56 56 59 61 . . . . . . . . . . . . . 63 63 64 64 66 67 68 68 68 69 71 71 75 77 Conclusion 79 Bibliography 80 List of Figures 1.1 1.2 1.3 1.4 1.5 Example of arrayCGH results . . . . . . . . . . . . Example of arrays before and after a normalization SVM in a separable case . . . . . . . . . . . . . . . SVM in a non-separable case . . . . . . . . . . . . The hinge loss function . . . . . . . . . . . . . . . 2.1 2.2 2.3 2.4 2.5 2.6 . . . . . 3 6 10 13 16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . the first . . . . . . . . . . . . . . . . . . 28 30 35 36 38 2.7 2.8 Decomposition of a gene expression profile . . . . . . . Example of Laplacian eigenvectors . . . . . . . . . . . Unsupervised classification results for the first method PCA Plot using the first method . . . . . . . . . . . . Supervised classification results using the first method Representation of the classifiers obtained with using method . . . . . . . . . . . . . . . . . . . . . . . . . . Glycolysis/gluconeogenesis pathways . . . . . . . . . . Pyrimidine metabolism pathways . . . . . . . . . . . . 39 41 43 3.1 3.2 3.3 Bladder cancer dataset with grade classification . . . . . . . . . . Bladder cancer dataset with stage classification . . . . . . . . . . Uveal melanoma dataset . . . . . . . . . . . . . . . . . . . . . . . 57 58 60 4.1 4.2 Performance on Van’t Veer data set . . . . . . . . . . . . . . . . Performance on Wang data set . . . . . . . . . . . . . . . . . . . 72 73 v . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vi LIST OF FIGURES List of Tables 4.1 4.2 4.3 4.4 4.5 Characteristics of the different networks . . Performance on Van’t Veer data set . . . . Performance on Wang data set . . . . . . . Main categories of the Van’t Veer classifiers Main categories of the Wang classifiers . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 71 74 75 76 viii LIST OF TABLES Remerciements Je tiens d’ores et déjà à m’excuser pour tous ceux que j’oublie de remercier. Si vous pensez que vous devriez être sur cette page mais que vous n’y êtes pas, c’est que j’étais fatigué. Je remercie beaucoup mes deux directeurs de thèse (même si seul l’un des deux est officiel, je confirme qu’ils sont deux), Jean-Philippe et Emmanuel. À leur contact j’ai appris énormément, que ce soit sur le plan scientifique ou sur le plan professionnel en général. Ainsi, grâce à eux, je sais qu’on peut finir un article une heure avant la dead-line, même quand on trouve encore des erreurs dans ses scripts quelques jours plus tôt. À ce sujet, je leur suis très reconnaissant de ne pas m’avoir trop frappé. Je remercie tous ceux qui m’ont aidé dans mes recherches durant ces trois années et quelques mois. Merci à Marie Dutreix et surtout (surtout !) à Andrei Zinovyev pour leur apport au projet. Merci aux deux Pierre(s), à Nicolas R et Philippe H qui n’ont pas trop râlé quand je venais leur poser des questions sottes de statistiques(ou autres !). Merci à Séverine et Sabrinette qui ont toujours répondu avec bonne humeur à mes interrogations sur les puces à ADN. Merci à Christina Leslie de n’avoir rien dit quand je travaillais sur ma thèse au lieu de bosser sur ce pour quoi elle me payait. Un grand merci à Laurent et à Anne pour avoir relu l’intro de ma thèse, elle en avait bien besoin. Je remercie Gérard Biau, Mark Van de Wiel, Christophe Ambroise et Stéphane Robin pour m’avoir fait l’honneur d’accepter de faire partie de mon jury. Je les remercie en particulier pour la pertinence de leurs remarques pendant la session aux questions. Je remercie aussi mes différents collègues des Mines de Paris et de l’institut Curie pour leur soutien et leur bonne humeur. Je remercie l’ensemble du bureau ovale (Adil, Patrick, Stef, Séverine, Sabrinette ainsi que les membres temporaires: Fanny, Perrine, Anne et Amélie) pour leurs bonnes blagues et la bonne ambiance de mes deux premières année de thèse. Je remercie Caroline et les membres du CBio (Véro, Christian, Pierre, Martial, Misha et Brice) pour toute l’amitié qu’ils m’ont témoigné pendant mes -rares- escapades bellifontaines. Un très très grand merci au troupeau (Patrick, Fanny, Laurence, Gautier, Laurent ainsi que les deux petites dernières, Fantine et Anne) pour avoir illuminé ma dernière année. On retourne à Leysin boire du Grand Marnier quand vous voulez. J’insiste aussi pour remercier énormément le bureau système. Merci Gautier pour ton chocolat, ta fondue et tes chansons des années 80. Merci Laurent pour ix x REMERCIEMENTS ton rap, tes dessous qui dépassent et tes blagues pourries. Merci PCC pour ta manière de manger la purée, tes crachats sur les passants et ton amour de la caipirinha. Je remercie tout particulièrement Laurence Calzone pour son amitié. Bien sûr je remercie mes parents. Pour leur soutien affectif (et financier), mais aussi parce qu’il faut regarder la vérité en face : si pendant toutes ces années ils ne m’avaient pas empêché de jouer aux jeux vidéos pour travailler, je ne serai pas en train de mettre le point final à cette thèse mais de racketter des mémés pour aller m’acheter ma dose de crack. Je remercie aussi mon frère et ma sœur, toujours là pour me soutenir avec leur amour et leurs blagues scatos. Je remercie aussi tout le reste de ma famille et tout particulièrement mes grandsparents pour toute leur affection. Je voulais dire à mes deux mamies que de l’autre côté de l’Atlantique, leur cuisine me manque. Je remercie enfin les amis, les vrais, ceux qui ont toujours été là pour écouter mes jérémiades : Flou, Patou, PE, Damien et Anais donc, merci. En revanche, je ne remercie pas Facebook qui m’a bien pourri des journées de travail. Abstract While gene expression arrays and array-based comparative genomic hybridization (arrayCGH) have become standard techniques to collect numeric statements on expression disorders and copy number aberrations related to the cancerous disease, experimental results are still difficult to analyse. Indeed, not only are the scientists confronted to the dimensionality curse, the high complexity of the data comparatively to the low count of samples, but they are also battling with the difficulty to relate numerical results with biological phenomena. A solution to these issues is the incorporation into the analysis process of “a priori ” information that we have about the different biological relations that underlie our data. Different methods have been constructed for the analysis of microarray data, but they only used this information through heuristics or did not use it at all. In this thesis, we propose three different new methods built on solid mathematical bases for introduction of a priori knowledge into the analysis process. The first method is a dimension reduction technique used to incorporate gene network knowledge into gene expression analysis. The approach is based on the spectral decomposition of gene expression profiles with respect to the eigenfunctions of the graph, resulting in an attenuation of the high-frequency components of the profiles with respect to the topology of the graph. This method can be used for unsupervised and supervised classification. It results in classifiers with more biological relevance than typical classifiers. We illustrate the method with the analysis of a set of expression profiles from irradiated and non-irradiated yeast strains. The second method is a supervised classification method for arrayCGH profiles. The algorithm introduces two biological realities directly into the regularization terms of the classification problem: strong interdependency of the probes with neighbouring chromosomal position and expected sparsity of the classifier to focus on specific genomic abberations in the arrayCGH profiles. This method is illustrated with three different classification problems, spanning two different data sets. The third method is a supervised classification method for gene expression profiles. The approach introduces gene network knowledge into the classification problem by adding a regularisation term corresponding to the positive correlation between connected nodes of the graph associated with the gene network. This algorithm is tested on two different gene expression profile sets with eight xi xii different gene networks of four different categories. ABSTRACT Résumé Alors que les puces à ADN, que ce soient les puces d’expression ou les puces à hybridation génomique comparative, sont devenues des outils standards pour établir des relevés numériques sur les désordres génétiques liés au cancer, leur analyse reste une tâche compliquée. En effet, les différentes méthodes sont confrontées à deux grands problèmes : d’une part la très grande dimension des données par rapport au faible nombre d’échantillons et d’autre part la difficulté d’établir une correspondance entre ces données numériques et les phénomènes biologiques sous-jacents. Une solution proposée est d’incorporer dans l’analyse numérique notre connaissance “a priori ” de différentes relations biologiques, mais les techniques de classification, supervisée ou non, utlisées jusqu’ici n’intégraient pas cette information ou l’incorporaient à des méthodes existantes par le biais d’heuristiques. Dans cette thèse, nous proposons trois nouvelles méthodes d’analyse de puces à ADN, basées sur des concepts mathématiques solides et qui intégrent notre connaissance à priori de corrélations sous-jacentes au problème. La première méthodologie que nous proposons utilise les données de réseau métabolique pour l’analyse de profils d’expression de gènes. Cette approche est basée sur la décomposition spectrale du réseau à l’aide de la matrice laplacienne liée au graphe associé. Les données de puces sont projetées sur la base de l’espace des fonctions sur les gènes formée par cette décomposition spectrale. En considérant que les fonctions cohérentes biologiquement doivent évoluer de manière lisse sur le graphe, c’est-à-dire que les expressions de deux gènes connectés par une arête du graphe doivent avoir des valeurs proches, nous pouvons appliquer un filtre pour atténuer les composantes haute-fréquence des profils d’expression. Nous appliquons ensuite des algorithmes standards de classification non supervisée et supervisée pour obtenir des fonctions de décision plus facilement interprétables. Ces algorithmes ont été appliqués à des jeux de données publics pour discriminer des profils d’expression de levures faiblement irradiées et non-irradiées. L’interprétation des classifieurs suggère des nouvelles pistes de recherche biologique. La deuxième approche proposée est une nouvelle méthode de classification supervisée pour les données de puces d’hybridiation génomique comparative (arrayCGH). Cette approche est basée sur le problème usuel de classification modifié pour intégrer une double contrainte de régularisation qui traduit deux réalités biologiques : le fait que deux relevés successifs sur le même chromosome xiii xiv RÉSUMÉ ont de fortes chances d’appartenir à la même région d’altération du génome et le faible taux de ces altérations. Cette méthode est appliquée à trois problèmes de classification liés au cancer et concernant deux jeux de données différents. Nous obtenons alors des fonctions de classification à la fois plus efficaces et plus facilement interprétables que celles obtenues à l’aide des méthodes usuelles de classification supervisée. La dernière méthode est une autre manière d’introduire la corrélation entre gènes connectés d’un réseau dans la classification supervisée de profils d’expression. Pour cela, nous avons rajouté au problème classique de machines à vecteurs de support avec régularisation L1 un terme de régularisation qui traduit notre volonté d’attribuer dans la fonction de décision des poids semblables à deux génes connectés dans le réseau. Cette approche est testée sur deux jeux de données publics liés au cancer avec huit réseaux génétiques de quatre types (métaboliques, interactions protéine-protéine, influence et coexpression) differents. Chapter 1 Background In this preliminary chapter, we discuss the different issues underlying this thesis. We start by giving a brief overview of cancer and how the specificities of this disease lead to the wide use of microarrays in order to monitor the tumors. The following section is then dedicated to the microarray analysis techniques with a particular focus on the supervised classification problem. We then see how the difficulties associated to this problem can be reduced with the incorporation of “a priori ” knowledge into the analysis process and what were the previous attempts to do so. Finally, the last section of this chapter summarizes our contributions to the problem and gives a quick overview of this thesis. 1.1 Microarray analysis for the study of the cancerous disease This section is aimed at giving a quick glance at the concepts underlying the use of microarrays for the study of the cancerous disease to nonspecialists. The first subsection gives a precise definition of cancer and explains how it is related to mutations and abnormal gene behaviour. The two following subsections explain how specific types of abnormal gene behaviours can be monitored by two types of microarrays: gene expression arrays and comparative genomic hybridization arrays (also known as CGHarrays). These subsections include an overview of each technology for non-biologists as well as standard analysis processes and the related specific issues. 1.1.1 The cancerous disease It is generally admitted that cancerous cells are cells that have developed certain capacities that allow uncontrolled growth through mutations. [HW00] propose a list of these capacities: • Self-sufficiency in growth signal: normal cells require specific signals from other cells before they can proliferate. These signals are transmitted into 1 2 CHAPTER 1. BACKGROUND the cells by receptors that bind distinctive classes of signaling molecules. We do not know any type of normal cell that can proliferate in the absence of such signals. Tumor cells generate many of their own growth signals, thereby reducing their dependency on stimulation from their environment. • Insensitivity to growth-inhibitory (anti-growth) signals: within a normal tissue, multiple anti-proliferative signals operate. These signals that block proliferation include both soluble growth inhibitors and immobilized inhibitors embedded on the surfaces of nearby cells. Cancer cells must evade these anti-proliferative signals if they want to prosper. • Evasion of programmed cell death (apoptosis): the ability of a tumor cell population to expand in number is determined not only by the rate of cell proliferation but also by the rate of cell disappearance. Programmed cell death, apoptosis, represent a major source of this attrition. Observations indicate that the apoptotic program is present in latent form in virtually all cell types throughout the body. Once triggered by a variety of physiological signals, this program unfolds in a precise series of steps. Different examples have established the consensus that apoptosis is a major barrier to cancer that must be circumvented. • Limitless replicative potential: many and perhaps all types of mammalian cells carry an intrinsic program that limits their multiplication. This program appears to operate independently of the cell-to-cell signaling pathways concerned by the above capacities. It too must be disrupted in order for a clone of cells to expand to a size that constitutes a macroscopic tumor. • Sustained angiogenesis: the oxygen and nutrients supplied by vasculature are crucial for cell function and survival, obligating virtually all cells in a tissue to reside within a small distance of a capillary blood vessel. In order to progress to a large size, tumors must develop angiogenic ability, which is the ability to provoke blood vessel growth. • Tissue invasion and metastasis: sooner or later during the development of most types of human cancer, primary tumor masses spawn pioneer cells that move out, invade adjacent tissues and thence travel to distant sites where they may succeed in founding new colonies. These distant settlement of tumor cells are called metastases. The capability for invasion and metastasis enables cancer cells to escape the primary tumor mass and colonize new terrain in the body where, at least initially, nutrient and space are not limiting. Even if cells can be considered as cancerous without this ability to metastasize, most cancerous cells will acquire it during their development. Only one cell that has acquired each and every one of these capacities will be able to grow chaotically and without any constraint and will therefore be 1.1. MICROARRAY ANALYSIS 3 Figure 1.1: Example of arrayCGH results (log2 scale). This picture depicts genomic events occurring on chromosome 18, among which a loss occurring on the q arm. This image has been extracted from [BSBD+ 04]. considered as cancerous. A malignant cell therefore suffers from a perturbed functioning of its proteins which causes all these capacities to be active. An enabling characteristic for theses capacities is the high genomic instability of cancer cells. Due to the efficiency of the cellular process used to maintain genomic integrity, mutations are rare events in a normal cell. However, cancer cells have escaped at some time this protection process and suffer from multiple mutations, so many in fact that they are highly unlikely to occur within a human time span. Examples of these mutations include hyperactivity of oncogenes, which are genes that activate chaotic cell proliferation, such as Myc or Abl; and deletion of tumor-suppressor genes such as p53. This genomic instability can be seen as a seventh capacities of cancer. However, as it is more a prerequisite capability, allowing the other capabilities to be acquired, than one characteristic of the uncontrolled cell growth, the authors did not include it in the list. These characteristics can be acquired through large mutations, either aneuploidy of entire chromosome arms or gain or loss of smaller portions of chromosomes (from a few hundreds to a few millions base pairs), which can be seen with CGH arrays, or other mechanisms such as local mutations (a change of base), translocations, viral insertions, etc. As these last changes can not be seen with CGH arrays, the adequate analysis tool is gene expression profiling which corresponds to an indirect analysis of the impacted protein production. In the two following sections, we will discuss these two microarray techniques. 1.1.2 CGH arrays During cell division, a cell must replicate its entire genome. Several types of chromosome alterations may occur during this process. Regions of deoxyribonucleic acid (DNA) can be multiplied (resulting in a gain) or on the contrary 4 CHAPTER 1. BACKGROUND deleted (resulting in a loss). Healthy cells maintain different mechanisms to correct and prevent this unstableness, but if one of these changes goes undetected or perturbs the correction mechanism, the cell may survive in this altered state. This changes may be responsible for one or several of the acquired capacities described in the previous section. Ewing’s tumors, for example, are known to present characteristic gains in chromosomes 5, 8 or 12 [SMR+ 03]. CGH is a powerful molecular tool to analyze copy number changes (gains or losses) in the DNA content of a given subject, and especially in tumor cells. The method is based on hybridization, the formation of molecular links from a genetic sequence to its complementary genetic sequence, of the DNA of interest (often tumor DNA) and normal DNA to human preparation. Using fluorescence microscopy and quantitative image analysis, regional differences in gain or losses compared to control DNA can be detected and used for identifying copy number aberrations (CNAs) in the genome. Figure 1.1 show the result of a typical CGH array experiment. Originally, this instability was measured with chromosomal-CGH [ANM93], a technology that used entire chromosomes for hybridization purposes, but whose resolution was quite low. Recent improvements regarding resolution and sensitivity of CGH allowed the elaboration of microarray-based CGH (also called arrayCGH or CGH array) [PSS+ 98] that use probes, small portions of the genome, arrayed on silicium, instead of entire chromosomes. 1.1.3 Gene expression arrays Another interesting information about a cancerous cell is the expression of specific genes. The expression of a gene is the quantity of corresponding produced messenger ribonucleic acid (mRNA), the intermediary molecule between the DNA and the protein, which is correlated with the quantity of produced protein. A gene expression microarray, which can also be called DNA chip, is a collection of microscopic DNA spots, each one of them mapping a particular transcribed region of the genome and known as probes, which are arrayed on a solid surface. These probes, usually tens of thousand of them, are used to measure the relative quantity of specific mRNA produced by the studied cell. For this purpose, contact is made between the array and mRNA extracted from the sample. Intensity of the hybridized DNA fluorescence can then be optically measured and gives an estimate of the relative quantity of the mRNA of interest in the sample. In an error-free scenario, this intensity is proportional to the true number of transcripts present in the sample. There are two main types of DNA chips: • Spotted microarrays : the probes are either long or short fragments of DNA (amplified by cloning or polymerase chain reaction (PCR)). This type of array is typically hybridized with complementary DNA (cDNA) from two samples to be compared, one of which is often a control tissue. These two samples are marked with two different fluorophores (red and green). 1.1. MICROARRAY ANALYSIS 5 They are mixed and hybridized on the same microarray. A scanner then visualizes the fluorescence of each fluorophore. Relative intensities of the colors are used to identify up and down regulated genes. Absolute levels of gene expression cannot be determined but relative differences among different genes can be estimated. This type of microarrays is rarely used nowadays in cancer research. • Oligonucleotide microarrays : the probes are designed to match part of the sequence of known or predicted mRNAs. The probes are either 50 to 60-mers (on Long Oligonucleotide Arrays) or 25 to 30-mers (on Short Oligonucleotide Arrays). Companies such as Affymetrix or Agilent propose commercial microarrays that span the entire genome. Affymetrix microarrays give estimations of the absolute value of gene expression levels and therefore, the comparison of two conditions requires the use of two separate microarrays. On the opposite, Agilent microarrays provides the same kind of information than spotted microarrays. Oligonucleotide microarrays often contain control probes designed to hybridize with known amount of specific RNA transcripts called RNA spike-ins. These control probes are used to calibrate the expression level measurements. Unfortunately, dealing with experiments that involve multiple microarrays require a pre-processing of the gene expression profiles: the normalization. Microarrays are subject to two types of variations : interesting variations and obscuring variations. Interesting variations are biological differences, such as large differences of expression levels of specific genes between a diseased and a normal tissue sample. However, observed expression levels also include variations introduced during the experimental process. These variations may be related to differences in sample preparation, in production of the arrays or in processing of the arrays. Normalization is aimed at dealing with this obscuring variations (see, for example, figure 1.2). In gene expression profiles, these obscuring variations have different sources. One first source of variations is the dye bias (or pin tip in the case of spotted arrays): the relationship between gene expression and spot intensity may not be the same for different dyes (or spots), and therefore, for a given concentration of mRNA, the light intensity may differ. Another source of variations is related to spatial effects: due to a defective pin tip (portion of the microarray) or to a bad position of the array during hybridization, the spatial density of the signal may not be uniform. Therefore, the first step of the analysis process will be the normalization. The user needs to make an hypothesis on the value distribution of the arrays and/or the value distribution of the genes, depending on the features to compare (entire arrays or expression distribution of single genes). Per array normalization techniques include Global Normalization, Lowess (sometimes referred to as Loess), MAS4 and MAS5 [mas]. Per gene normalization methods include RMA [IBC+ 03], gc-RMA [WIG+ 04] and MAS7. During this thesis, we aimed at increasing efficiency and interpretability of microarray analysis by incorporating a priori knowledge into the process. We 6 CHAPTER 1. BACKGROUND Figure 1.2: The arrays on the right side depict the normalized values of the arrays on the left side. The un-normalized array present a strong spatial bias as the left area suffers from a much lower intensity distribution than the right area, which is characterized by the strong presence of blue spots. The normalized arrays clearly show much more homogeneous and unbiased values. especially focused on classification of microarray profiles. Therefore, the following section will provide a background for the understanding of these methods by giving a brief overview of the usual classification techniques that we will refer to during the different chapters of this thesis. 1.2 The classification problem The construction of a predictive model from microarray data is an important problem in computational biology. Typical application include, for example, cancer diagnosis or prognosis, and discriminating between different treatments applied to micro-organisms. In this section we expose the generic unsupervised and supervised classification problems and, for each one, propose and discuss a standard algorithm (respectively k-means and SVM). We especially focus on the supervised case as it has been more looked-at during this thesis. The aim of classification is to build a function f : X → Y that is able to attribute to each sample x ∈ X the correct label y ∈ Y. Supervised classification uses a training set of samples for which the labels are known to build function f while unsupervised classification doesn’t. 1.2. THE CLASSIFICATION PROBLEM 1.2.1 7 Unsupervised classification In this section, we expose the generic unsupervised classification problem, also known as partitional clustering, and present a standard algorithm used for clustering : the k-means algorithm. This algorithm was used for the works that are presented in chapter 2. The general problem Unsupervised classification, or partitional clustering, corresponds to the partitionning of a data set into subsets (or clusters) of samples that share a common trait, mathematicaly represented as a proximity according to some defined distance measure d. If m is the looked-for number of groups and X the sample space, a mathematical model of the problem is the search for the partitioning X1 , ..., Xm that minimises: ! d(x, y) m ! x,y∈Xi ! , (1.1) d(x, y) i=1 x∈Xi ,y∈X̄i where ∀i = 1, ..., m, X̄i is the complementary of Xi in X i.e. the set of all elements x of X such that x is not in Xi . This fraction represents the quotient of the intra-group distances, i.e. the sum of all the distances between two different elements of one group, by the inter-group distances, i.e. the sum of all the distances between the elements of one group and the elements of another group. Minimizing this quotient will give groups that are as compact as possible while, at the same time, being as far apart as possible one from the other. Apart from k-means method, that we will see in a little more detail in the next section, and its derivatives, clustering methods also include hierarchical clustering [War63] and graph-based methods such as Formal Concept Analysis [GW97]. The k-means algorithm The k-means algorithm is one of the simplest partitioning clustering algorithm and aims at assigning each point to the cluster whose center is the nearest. It is composed of the following steps [Mac67]: • The users choses a number k of groups. • The algorithm randomly generates k points as the center of random clusters. • Each point is assigned to the closest center, according to a distance d. • Each cluster center is recalculated as the mean of the data assigned to this cluster. 8 CHAPTER 1. BACKGROUND • The two last steps are repeated until the groups do not vary or, if a maximal number of steps has been fixed by the user, this maximal number of steps has been reached. This algorithm is simple and fast but presents a big disadvantage: as the initial centers are attributed randomly, it may give different results with each run. Authors [HKY99] proposed improvement to this method in order to insure that the results were stable, but these improved algorithms do not retain the simplicity and/or the speed of the initial approach. One critical point of this approach is the choice of d. Indeed, depending on the chosen distance measure, the points will be attributed to different groups, and therefore different clusters will be formed. In chapter 2, we will present a specific distance measure that we feel is more adapted to expression array clustering that usual measures such as euclidian or L1-norm distances. 1.2.2 Supervised classification Supervised classification is a particular category of classification methods where a set of samples X = Xi ∈ X , i ∈ 1 . . . n for which the correct labels Y = Yi ∈ Y, i ∈ 1 . . . n are known is used to build the classification function f . This set is known as the “training set”. In this thesis, we will only focus on the case where the training patterns are represented by finite-dimensional vectors that must be classified into two predefined categories, i.e., we restrict ourselves to the case X = Rp and Y = {−1, +1}. This covers for example the case when one wants to predict a good (Y = +1) or bad (Y = −1) prognosis for a tumor from a vector of gene expression data or arrayCGH profile. We note, however, that various extensions to more general training patterns and categories have been proposed (e.g., [Vap98, SS02]). Supervised classifications methods include linear methods, that we will focus on, but they are not limited to them. Another example of supervised classification methods is k-Nearest Neighbor (kNN) (see for example [MM01]), which is among the simplest of all supervised classification algorithm. User empirically decides of a small positive integer k and each new sample x will be attributed to the class which is the most common amongst its k nearest neighbors. Nearness of the samples is usually decided according to a distance d which is supposed to provide a good partitioning of the space for the considered problem. The results will therefore not only depend on the density of the training set but also on the choice of k and d. Supervised classification methods also include artificial neural networks (ANNs) [Smi93]. An ANN is an adaptive system composed of interconnected groups of small and simple entities that mimic the behavior of biological neurons. An input layer of neurons takes the sample information, passes it to one or several interconnected hidden layers of neurons which, themselves, transmit it to the output layer of neurons, which returns the estimation of the label. In most cases, ANNs change their internal weights based on the information that flows between the different layers during the learning phase. Even if ANNs are able, 1.2. THE CLASSIFICATION PROBLEM 9 theoretically, to output a wide range of classification functions, their use is not straightforward as they require a very complex tuning (choice of the neurons, choice of the model for the connections, choice of the correct algorithm and the correct algorithm parameters) or may return an inadequate classification function. Linear supervised classification Linear supervised classification methods are a specific class of supervised classification methods that aims at finding a linear classification function, i.e. a function f : x %→ w" x + b where w ∈ Rp , b ∈ R and w" is the matrix-transpose of the vector w. w can be seen as an orthogonal vector to an hyperplane P that will separate the whole space into different subspaces, and the appartenance of one sample x to one of these subspace will define its predicted class. In the case of binary classification (Y = {−1, 1}), for example, the class will be sign(w" x + b). We suppose that the variables Xi , Yi i=1,...,n are independent and identically distributed samples of an unknown probabilistic law P . Let l : Rp × Y %→ R be a loss function. It quantifies the loss l(f (x), y) incurred when a predictor predicts a scalar f (x) for the pattern x, while the correct class is y. The best classifier with respect to l is the one that minimizes the expected loss R(f ) = EP l(f (X), Y ). R(f ) is also known as the risk of the classifier f . Unfortunately the distribution P is not known, so finding the classifier with the smallest risk is not possible in practice. Instead, the empirical risk minimization (ERM) paradigm [Vap98] proposed to find a classifier that minimized the empirical risk Remp (f ), defined as the average risk over the training pairs: n Remp (f ) = 1! l(f (Xi ), Yi ). n i=1 However, as the dimension of the sample space p is usually very big, the training set of cardinality n is not big enough (i.e smaller than p) to give an appropriate sampling of the whole space, therefore the classification function may be overfitted, which means that it may perform very well on the training set (due to the minimization of l) but may perform very poorly on unseen examples. Moreover, if the search space is rich enough, an infinity of classification functions may minimize the average l on the training set. Therefore, we have to define a criterion that will help us to choose one of these classifiers. This issue is called the dimensionality curse. The standard solution is to reduce it by incorporating into the problem a constraint that will shape the profile of the classification function and give a direction to the search. In the case of binary classification (Y = {−1, 1}), the prediction of the label of a new sample only depends on the side of the hyperplane P the point is positioned on. All of the following algorithms and formulas can be possibly extended to the multi-class problem by combining multiple binary classifiers. In the next sections, we will use the classical geometrical approach 10 CHAPTER 1. BACKGROUND w" x + b > 0 γ w γ P w" x + b < 0 Figure 1.3: Support vector machine finds the hyperplane P that separates the positive examples (circles) from the negative examples (square) with the maximum margin γ. The samples to which was attributed the +1 label are colored green while the samples to which was attributed the −1 label are colored blue. to present the SVM algorithm. This approach will be extended to another formulation of the SVM in the last part of section 1.2.2, the latter representation being the one that will be used in following chapters of this thesis. The Support Vector Machine (SVM) in the separable case In the case of a linearly separable data set, which means that it is possible to find an hyperplane P such that the positively-labelled and negatively-labelled samples lay on either sides of P , Vapnik and co-workers proposed to select the classifier with the largest margin γ (distance from P to the closest point of the learning set) [BGV92b, CV95, Vap98]. This type of linear classification problem is called hard-margin Support Vector Machine (hard-margin SVM) and defines the maximum margin classifier. The equation of the hyperplane P is given by w" x + b = 0. Therefore, the ! x+b| distance from one samples x to P is given by |w#w# . If the sample is linearly " separable, the class f (x) = sign(w x + b) attributed to each sample xi of the training set by the classification function is the correct label yi . From that, we can deduce that for each couple (xi , yi ), f (xi )yi > 1 and that the distance from ! xi +b) one sample of the training set xi to the hyperplane P is given by yi (w#w# , which gives us the following formula for the margin γ: yi (w" xi + b) . i=1,...,n 'w' γ = min (1.2) 11 1.2. THE CLASSIFICATION PROBLEM As hyperplane are defined up to a scaling constant (i.e. the equations w" x+b = 0 and αw" x + αb = 0 with α ∈ R define the same hyperplane), we can add the following constraint to the definition of the hyperplane P : min yi (w" xi + b) = 1 . i=1,...,n (1.3) Using the previous constraint with 1.2 gives us a simplified value for the margin 1 γ = #w# . The hard-margin SVM looks for the hyperplane with the largest margin, which can be formulated as the following optimization problem: (w∗ , b∗ ) 1 'w'2 2 (1.4) ∀i = 1, ..., n yi (w" xi + b) ≥ 1 . (1.5) = argmin w∈Rp ,b∈R under the constraints As the objective function (the function to minimize) 1.4 is strictly convex and the constraints 1.5 are convex, this minimization problem is a convex problem with an unique solution [BV04b]. To solve this problem, we can use the methods of Lagrange multipliers. The Lagrangian of the problem is given by: L(w, b, α) = n ! 1 'w'2 + αi (1 − yi (w" xi + b)) , 2 i=1 (1.6) where the αi are called the Lagrange multipliers of the optimization problem and α is the vector of Rp whose components are the αi . Optimization theorems imply that the minimum of the objective function 1.4 under the constraints 1.5 is given by a saddle point (w∗ , b∗ , α∗ ) of the Lagrangian L, a minimum of L with regard to (w, b) and a maximum with regard to α. The minimization of L with regard to (w, b) implies that the corresponding partial ∂L derivatives ∂w and ∂L ∂b are put to 0: w− n ! αi yi xi = 0 (1.7) i=1 n ! αi yi = 0 . (1.8) i=1 Substituting these formulas into 1.6 gives us the dual formulation of the problem: α∗ = argmax α∈Rn under the constraints n ! i=1 αi − ∀i = 1, ..., nαi = 0 n ! αi yi = 0 . n 1 ! αi αj yi yi x" i xj 2 i,j=1 i=1 (1.9) 12 CHAPTER 1. BACKGROUND This problem is a quadratic program which can be solved using different methods such as interior point [Wri87], active set [BR85] or conjugate gradient [Saa96]. The Karush-Kuhn Tucker condition gives us the following property of the optimum: ∀i = 1, ..., n αi∗ (yi (w∗" xi + b∗ ) − 1) = 0 . (1.10) Therefore, at the optimum, the only linear combination coefficients that are non-null corresponds to learning samples xi such that yi (w∗" xi +b∗ ) = 1. These points are positioned on the margin of the hyperspace P and are the only ones that affect the position of P . They are called the support vectors of the classifier. Thus, the solution does not depend on the size of the sample space or even on the number of training examples but only on the count of critical examples. The optimal offset b∗ can be obtained from any support vector x% labelled by y % using the fact that y % (w∗" x% + b∗ ) − 1 = 0 and that y %2 = 1: b∗ = y % − w∗" x% . (1.11) However, to obtain a more accurate value, we will average the offset on all the support vectors. At the optimal point, the decision function is given by: " n ! ∗ f (x) = sign( αi∗ yi x%" i x + b ), (1.12) i=1 where n% is the number of support vectors and (x%1 , ..., x%n" ) the support vectors. SVM in the non-separable case Unfortunately, in general cases, a linear hyperplane separating the data into the pre-determined classes may not exist. In this case, the previous algorithm can not be applied and we have to introduce slack variables ξi for each training set couple (xi , yi ) in order to relax the constraints: yi (w" xi + b) + ξi ≥ 1 . (1.13) The slack-variable ξi corresponds to the distance between the sub-space the sample should belong to and the sample, and is therefore a measurement of the classification error. Indeed, ∀i = 1, ..., n, ξi = 0 corresponds to a well-classified sample outside the margin, 0 < ξi < 1 corresponds to a well-classified sample inside the margin and ξi > 1 to a misclassified sample. Figure 1.4 illustrates this situation. The average of the slack-variables (or their sum) corresponds to the amount of error that is tolerated and should therefore be controlled. This is done by 13 1.2. THE CLASSIFICATION PROBLEM misclassified samples : ξ > 1 P well-classified samples inside the margin : 0 < ξ < 1 Figure 1.4: Support vector machine looks for the hyperplane P that separates the positive examples (circles) from the negative examples (square) with the maximum margin γ. The samples to which was attributed the +1 label are colored green while the samples to which was attributed the −1 label are colored blue. Therefore the green rectangles and blue circles are misclassified samples (ξ > 1) while slack-variable ξ value contained between 0 and 1 implies a welllabeled sample inside the margin. 14 CHAPTER 1. BACKGROUND adding this quantity to the objective function of the SVM optimization problem: n (w∗ , b∗ , ξ ∗ ) = argmin w∈Rp ,b∈R,ξ∈Rn under the constraints ! 1 'w'2 + C ξi 2 i=1 ∀i = 1, ..., n yi (w" xi + b) ≥ 1 − ξi ∀i = 1, ..., n ξi ≥ 0 . (1.14) The constant C offers a trade-off between the number of errors and the regularization parameter : the bigger C is, the more important it will be to minimize ξ i.e. the error control with regard to the minimization of the margin, expressed by the term 'w'2 . This formulation of the problem is known as the soft margin SVM with opposition to the hard-margin approach presented in the previous section. We can see that by putting C = ∞ we retrieve the hard-margin problem. As w∗ is a linear combination of the samples, we can also formulate the problem as: (w∗ , b∗ , ξ ∗ ) = argmin α∈Rn ,b∈R,ξ∈Rn under the constraints n n ! 1 ! ' αi xi '2 + C ξi 2 i=1 i=1 n ! ∀i = 1, ..., n yi ( αj x" j xi + b) ≥ 1 − ξi j=1 ∀i = 1, ..., n ξi ≥ 0 . (1.15) Extension of SVM to non-linear problems When dealing with nonlinearly separable problem, linear classifiers may not be able to provide a satisfying classification function. A way to solve this problem is to introduce kernels: SVM may be generalized to the non-linear case by applying a linear SVM on transformed data. A function K is said symmetric if: ∀(x, y), K(x, y) = K(y, x) . (1.16) A function K is said positive semi-definite if: ∀(x1 , ..., xn ) and ∀c1 , ...cn ∈ R , n n ! n ! i=1 j=1 K(xi , xj )ci cj ≥ 0 . (1.17) Moore-Aronszajn’s theorem [Aro50] states: Theorem 1 A symmetric positive semi-definite function K(x, y) can be expressed as an inner product, which means that there exists an Hilbert space H 15 1.2. THE CLASSIFICATION PROBLEM equipped with a dot product < ·, · > and with a embedding φ from the sample space to H such that: K(x, y) =< φ(x), φ(y) > . (1.18) In this new space, the SVM problem becomes: n (w∗ , b∗ , ξ ∗ ) = argmin α∈Rn ,b∈R,ξ∈Rn under the constraints n n ! 1 !! αi αj K(xi , xj ) + C ξi 2 i=1 j=1 i=1 n ! ∀i = 1, ..., n yi ( αj K(xj , xi ) + b) ≥ 1 − ξi j=1 ∀i = 1, ..., n ξi ≥ 0 . (1.19) Therefore, we only need to know the function K, and neither the non-linear embedding φ nor the space H nor the dot-product < ·, · > need to be explicitly known. This property, known as the kernel trick is often used to extend the scope of SVM to non-linearly separable problems [CST00, STV04, STC04]. Simple examples of kernels include : K(x, y) =< x, y >d K(x, y) = e −$x−y$2 2σ (1.20) (1.21) . Another formulation of the SVM algorithm The hinge loss function is defined as follows : h: R→R x %→ max(0, 1 − x) . (1.22) The SVM can then be seen as an algorithm that find the couple (w∗ , b∗ ) such as: n ! ∗ ∗ (w , b ) = argmin h(yi (w" xi + b)) + λ'w'2 . (1.23) w,b i=1 Indeed 1.23 is equivalent to the minimization of the following form: (w∗ , b∗ ) = argmin λ'w'2 + w,b under the constraint n ! ξi i=1 ∀i = 1, ..., n ξi ≥ h(yi (w" xi + b)) . 16 CHAPTER 1. BACKGROUND 1 1 Figure 1.5: A representation of the hinge loss function. Using the formula for the hinge loss function l expressed in 1.22, this becomes: (w∗ , b∗ ) = argmin λ'w'2 + w,b under the constraints ∀i = 1, ..., n ξi ≥ 0 n ! ξi i=1 ∀i = 1, ..., n ξi ≥ 1 − yi (w" xi + b) , 1 which is equivalent to 1.14 if we set λ = 2C . This formulation will be the one that will be adopted in the next chapters. 1.22 depicts SVM as a member of the family of algorithms that aim at finding the w∗ minimizing the form: w∗ = min l(yi , w" xi ) + λΩ(w) . w (1.24) Besides classical SVM depicted in the previous sections, this class of techniques includes L1-SVM (taking the hinge loss for l and the L1-norm for Ω) and Lasso (taking the squared errors for l and the L1-norm for Ω). As we will see briefly in the last section of this chapter and more extensively in the following chapters, the main goal of this thesis has been building supervised classification techniques that incorporate a priori knowledge into Ω in order to solve one issue of machine learning: the curse of dimensionality. 1.3. THE CURSE OF DIMENSIONALITY 1.3 17 The curse of dimensionality One important issue in machine learning, for which [Bel57] proposed the term “curse of dimensionality” is the high dimension of the sample space in comparison to the low count of samples. Indeed, the volume increase exponentially with adding extra dimensions to the space, which means that the number of points needed for an efficient sampling increase also exponentially. However, in microarray analysis, the sample space dimension is given by the number of probes, which varies between a few thousands and a few hundreds of thousands, while the number of samples varies between a dozen and a few hundreds. The huge gap between the small count of samples and the astronomic amount of data which would be needed to provide an efficient sampling of the space suggest the use of methods to reduce the size of the search space. In this section, we propose a collection of this methods. We grouped the methods into “pre-processing methods”, which are methods that separate the reduction of the search space and the classification method, and “wrapper methods”, which are methods that modify the classification algorithm to reduce the size of the search space. 1.3.1 Pre-processing methods As we have seen before, the linear classifier returned by a support vector machine is a linear combination of the samples. Therefore, regularizing the samples and reducing the sample space will force the classifier to restrict itself to this reduced space. Feature selection by filtering A first method for reducing the search space is to apply feature selection, also knows as feature reduction or attribute selection. It consists of selecting a subset of relevant features from the sample set. This method is also frequently called gene selection when applied to gene expression profiles. Simple feature selection (for example only keeping the features that vary the most between the different groups) can be performed, but as some attributes may be redundant, obtaining the optimal set of features of the chosen cardinality require exploring every subset of features of this cardinality. Therefore, it is preferred to take a satisfactory set of features, which may not be optimal but is still good enough for the classification. [GGNZ06] propose an extensive review of these techniques. [GST+ 99] propose to take as a criterion for ranking the features the following formula: µ+ − µ− i δi = i+ (1.25) σi + σi− − where i is the index of the feature, µ+ i and µi the mean of the feature values for all the samples of class +1 and −1 respectively and σi+ and σi− the standard deviation of the feature values for all the samples of class +1 and −1 respectively. The original method proposed by [GST+ 99] is to select an equal number of 18 CHAPTER 1. BACKGROUND features with positive and negative δi coefficients. [FCD+ 00] propose to take the absolute value of the coefficient δi and keep the top ranking features. Other criteria can be used for feature selection, but the most commonly used nowadays are based on the control of the false discovery rate (FDR) [BH95], the expected proportion of falsely rejected hypotheses. [YY06] [CTTC07] [Pav03] all proposed methods that controlled this FDR. However, [QXGY06] pinpointed the unstableness of gene selection techniques. In particular, they showed that some genes are selected much less frequently than other genes with the same pvalue and suggested that correlation between gene expression levels may perturb the testing procedures. A key idea of the binary Relief algorithm proposed by [KR92] is to estimate the quality of features according to how well they distinguish between samples that are near to each other. The associated algorithm remains simple: the algorithm randomly choses a sample, and, for each feature, look up if its value changes between the nearest neighboring sample of the same class, and the nearest neighboring sample of the other class. If the value changes, the value corresponding to the quality of the feature is upgraded, and if it does not, this value is downgraded. The process is repeated m times, m being a number predefined by the user. This algorithm as been updated to ReliefF [Kon94], which is more robust and able to deal with multiclass problem, then with RReliefF [SK97], able to deal with continuous class problems. The different versions of the Relief algorithm present the strong advantage of not being perturbed with correlation between the different features but unfortunately require extensive computation. Constructing features without any a priori knowledge Another way to reduce the search space is to construct features, i.e. build from each data sample x ∈ Rn another vector φ(x) ∈ Rp , with p < n, with the features of φ(x) no being a subset of the features of x. Principal Component Analysis (PCA) is also called Karhunen-Loève transform, Hotelling transform and Proper Orthogonal Method [Shl05]. Widely used in all forms of analysis, from image processing to finance including computational biology, PCA is a technique that extracts from a data set the most meaningful directions of variation. The dataset is then projected on the newly formed basis in order to filter out the noise and reveal hidden structure. The new basis is a linear combination of the sample vectors obtained taking the first few eigenvectors of the sample matrix. In most cases, only the eigenvectors corresponding to the highest eigenvalues are kept, and the projection on this new basis redefines the dataset and reduces its dimension. When applied on data that is not of variance 1 and mean 0, the principle Principal Component Analysis gives Singular Value Decomposition (SVD) [Han86]. Using kernel methods, [SSM99] developed an extension of PCA to a non-linear space. [HZHS07] propose to extract from each sample the information provided by each pair of highly synergetic variables, which, in the case of gene expression 1.3. THE CURSE OF DIMENSIONALITY 19 profiles that they are considering, would be genes. Their method is able to improve the results obtained with usual classification methods. However, we can also use what we know about the data in order to improve the variable selection. Smoothing of comparative genomic hybridization data In the case of arrayCGH data, constructing features with a priori knowledge translates into “smoothing” and ”segmenting”: as two successive spots on the same chromosome are prone to be subject to the same gain or loss, a CGH profile can be seen as a sequence of segments of a certain amplification value and of a certain length. Different approaches have been proposed to achieve this goal. The most direct way to perform this segmentation is to attribute to each spot value -1 if it is considered as belonging to a lost region, 0 if it is considered in a normal region and +1 if it is in a gained region. [JFG+ 04] used thresholding for the attribution of the correct label to each spot. [HST+ 04] proposed to detect the delimitations, or “breakpoints” of each region using local-likelihood modeling: an iterative algorithm finds around each location the maximal possible likelihood in which the value θ of the amplification is constant, considering that the collected value x on each spot equals the sum of the amplification and a noise term: x = θ + ). The authors then used these regions, and the value of the estimated θ attributed to each one, to estimate, in the whole sample set, which chromosome suffered from gain or loss. [OVLW04] proposed an efficient algorithm called “circular binary segmentation”. Their method is based on the change-point detection problem. They tested their method on a breast cancer data set where the real amplification values were known and obtained better results than with a classical thresholding method. [HGO+ 07] suggested that, due to a variety of biological and experimental factors, the aCGH signal may differ to the discrete stepwise function that is often produced by segmentation algorithms. The authors proposed a two-step approach that first deals with outliers then with the differently valued spots inside each segment. Their algorithm finds on their own test set a profile that is closer to reality than the one found with circular binary segmentation. [TW07] proposed to consider the problem as a regression issue: for each sample X, we need to obtain a profile Y that corresponds to its smoothed profile. This can be transcribed as the following equation: Y under the constraints = argmin L(X, Y ) Y ∈Rp ! |Yi − Yi + 1| ≤ λ i∼i+1 'Y '1 ≤ µ , (1.26) 20 CHAPTER 1. BACKGROUND where n is the number of spots, Yi is the ith component of the vector Y , i ∼ i + 1 means that i and i + 1 are successive spots on the same chromosome, L is the square loss L : (X, Y ) → 'X − Y '2 , ''˙ 1 is the L1-norm ' •' 1 : Y → n ! 'Y '1 = |Yi | and λ and µ are two trade-off constants that help adjusting i=1 between the important of the constraints and the value of the loss. The choice of the two constraint terms is motivated by the fact that Y should be smooth, which implies the first term, and that most of the spots should be subject to normal amplification, which means that Y should be sparse and its L1 norm small. This approach is very similar to the one we develop in chapter 3 for the classification of aCGH profiles. Extraction of modules for gene expression analysis In the case of gene expression profile, one category of a priori information that we can introduce into the classification analysis is gene network information, which can be used to perform dimension reduction. In this thesis, we take for the term “gene” the most common definition given in [AJL+ 02] and call ”gene” a specific portion of DNA that will code for a specific protein. “Gene network ” is a generic term that indicates a knowledge base that describes relations between proteins and, by extension, the corresponding genes. These networks are particularly useful to analyze or predict the effects that may have the perturbation of one protein or gene. In this thesis, we will call “pathway” a part of the gene network that acts to ensure a single biological function, and a “module” or a “map” usually regroups several pathways in order to ensure several biological functions, in most of the cases in relation one to the other. Many gene networks can be represented as graphs. A graph is constituted of a set of vertices V and a set of edges E ⊂ V × V that correspond to relations between the vertices. It is called undirected if ∀(u, v) ∈ V × V such that (u, v) ∈ E, (v, u) ∈ E and directed if it is not undirected. In the case of gene networks, the vertices will be proteins, or the corresponding genes. Gene networks include metabolic networks [KGK+ 04, KGH+ 06, VDS+ 07], co-expression networks [YMH+ 07], influence networks [YMK+ 06] and Protein-Protein Interaction networks (PPI networks) [MS03, MBS05, RTV+ 05, RVH+ 05, SWL+ 05]. Chapter 4 provides a more extensive description of these different networks. A family of methodologies to incorporate this gene network knowledge into the gene expression profile analysis takes as a preliminary step the extraction of modules, or highly-connected groups of genes that should act as a unique entity, from the gene network and then analyses the profiles as a collection of underexpressed or overexpressed modules. [SZEK07] extracts clusters from a metabolic network and then estimate the over or under-expression of each module using Haar-wavelet transformation applied to each connected pair of genes. Their method is quite analog to the one used in image analysis, in which an image is a grid-like network of color 1.3. THE CURSE OF DIMENSIONALITY 21 values and is found to be more powerful than classical t-test methods without any network knowledge. [CLL+ 07] use protein-protein interaction networks and define modules, or “subnetworks” as gene sets that induce a single connected component in the network. They propose to only take into account the subnetworks that are considered significative relatively to the classification problem and then to perform analysis on the data set. The significativity of a subnetwork relatively to the problem is calculated using the Mutual Information score. Applying their method on two different data sets, they are able to find 149 and 243 significative subnetworks for each of them and show that their network-based classification achieves higher accuracy in prediction than classical classification methods. However, their classification method is sensible to perturbation in the network. We proposed in [RZD+ 07] a method that used metabolic networks to smooth the gene expression profiles. This method is developed in chapter 2. 1.3.2 Wrapper methods Another way to incorporate this a priori information is to modify the optimisation problem exposed in 1.14 in order to directly incorporate reduction of the search space inside the problem, resulting in a wrapper method, built with or without any a priori knowledge. Without any a priori knowledge [Tib96] proposed the Lasso for regression and variable selection. This method is based on the generic supervised classification framework presented in 1.24 in the context of regression (i.e. the labels yi belong to R), with the loss function l being the squared error for loss function and the regularization term Ω being the L1-norm. This method performs regression and feature selection at the same time as the use of the L1-norm forces the linear model to be sparse. This method has been enhanced by [EHJT04] in order to build the faster LARLASSO algorithm. Another wrapper approach is proposed by the “1-norm support vector machine” (L1-SVM) [ZRHT03] that uses the L1-norm to substitute for the squared L2-norm in the classical SVM algorithm as presented in 1.23. Similarly to the Lasso algorithm, the L1-norm forces the linear classifier to be sparse, resulting in a classification function with a reduced number of genes i.e. performs classification and feature selection both at the same time. Using a priori knowledge The adjunction of a kernel function described in 1.19 provides another framework for the incorporation of a priori information. The supervised classification method described in [RZD+ 07] and in chapter 2, can, for example, also been written as the combination of a SVM and a kernel function corresponding to the 22 CHAPTER 1. BACKGROUND filtering of the high-frequency components according to the metabolic network of reference : w∗ = argmin w under the constraint n ! L(w" xi , yi ) i=1 wρ(L)w ≤ µ , (1.27) with µ being a constant trade-off parameter estimated through cross-validation and ρ(L) is a spectral variation of the Laplacian matrix L of the metabolic network (see chapter 2 for a more complete description of the algorithm). [LL07] subsequently proposed to add to the method described in [RZD+ 07] a L1-constraint: w∗ = argmin w under the constraints 'w'1 ≤ λ n ! L(w" xi , yi ) i=1 wLw ≤ µ , (1.28) with λ and µ being two constant trade-off parameters estimated through crossvalidation and L being the weighted Laplacian matrix of the metabolic network. We propose two methods that modify the constraints for supervised classification in chapter 3 and 4 of this thesis. 1.4 Contributions of this thesis In this section, we present the different contributions that we made to the field during this thesis. These different contributions will be developed and further explained during the following chapters but this section provides a short introduction to them. 1.4.1 Spectral analysis of gene expression profiles The first technique that we developed integrates a priori the gene network knowledge into gene expression data. This approach is based on the spectral decomposition of gene expression profiles with respect to the eigenfunctions of the graph. This decomposition leads to the design of a new distance function which can be used for unsupervised classification and principal component analysis. Spectral decomposition can also be used to apply a filter on the data in order to attenuate components of the expression profiles with respect to the topology of the graph, high frequency variations corresponding to noise while 1.4. CONTRIBUTIONS OF THIS THESIS 23 low frequency variation correspond to biological phenomena. In particular, we can use a low-pass filter or an exponential filter that will reduce the highfrequency variations of the microarrays along the edges of the graph, and only keep smooth variations. Supervised classification techniques can the be applied on the smoothed samples in order to obtain a classifier that will have an easier biological interpretation. We applied this method on biological data extracted from a study that analyzes the effect of low irradiation on yeast cells, and tries to discriminate between a group of non-irradiated cells and a group of slightly irradiated cells. We used the KEGG metabolic network for the analysis. Even if we were not able to improve supervised classification performance, we were able to provide a better separation of the groups and a classifier that was more easily understandable and from which new pathways of interest for the problem were extracted. This work is presented extensively in chapter 2. 1.4.2 Supervised classification of aCGH data using fused L1 SVM The second method that we developed is a supervised classification method specific to aCGH data. This approach extends fused lasso [TSR+ 05], a regression method that uses two regularization terms, in order to produce a sparse solution where successive features tend to have the same value. Our method replaces the ridge regression loss function by the hinge function in order to produce a sparse linear classification function in which successive spots tend to have the same weight. This is appropriated for aCGH data since two successive spots on the same chromosome are prone to be subject to the same alteration and therefore to have similar weights on the classifier. This method, called fused L1-SVM has been tested on three different classification problems using two different data sets related to the cancerous disease. Our classification method performed well in every case. Moreover, it was able to produce easily interpretable solutions. This work is presented in chapter 3. 1.4.3 Supervised classification of gene expression profiles using network-fused SVM The third method that we developed is a supervised classification method for gene expression profiles. This methods adds a regularization term to the classical L1-SVM problem in order to constrain the classification function to attribute similar weights to features, i.e. expression of genes, that are connected in the gene network. This method is appropriated to gene expression profiles, in which expression of specific genes are positively correlated, and for which the positive correlation information is stored in specific networks. As this mathematical problem is an extension of the fused L1-SVM, with the chain of dependencies between features being replaced by a network, we called this method networkfused L1-SVM. 24 CHAPTER 1. BACKGROUND This method has been tested on two public breast cancer gene expression profile data sets with eight different networks belonging to four different types (metabolism, protein-protein interactions, co-expression and influence). For each data set, we were able to provide classifiers that performed better than classifiers that do not take into account the network interactions. However, due to the sparse knowledge that we have of the gene interactions, most of these classifiers performed worse than the classifier that takes into account all probes, even the ones that are not in the network. We also tried to extract from these different classifiers known biological phenomena and showed that they may be more biologically meaningful that usual classifiers. This work is presented in chapter 4. Chapter 2 Spectral analysis of gene expression profiles using metabolic networks This work has already been published in a slightly different form in BMC Bioinformatics, co-authored with Andrei Zinovyev, Marie Dutreix, Emmanuel Barillot and Jean-Philippe Vert [RZD+ 07]. 2.1 Background During the last decade microarrays have become the technology of choice for dissecting the genes responsible for a phenotype. By monitoring the activity of virtually all the genes from a sample in a single experiment they offer a unique perspective for explaining the global genetic picture of a variant, whether a diseased individual or a sample subject to whatever stressing conditions. However, this strength is also their major weakness, and has led to the ”gene list” syndrome. Following careful experimental design and data analysis, the result of an experiment with microarrays is often summarized as a list of genes that are differentially expressed between two conditions, or that allow samples to be classified according to their phenotypic features. Once this list of genes, typically a few hundreds, as been obtained, its meaning still has to be deciphered, but the automated translation of the list into biological interpretation is often challenging. The interpretation of the results in terms of biological functions and pathways involving several genes is of particular interest. Many databases and tools help verify a posteriori whether genes known to co-operate in some biological process are found in the list of genes selected. For example, Gene Ontology [Con00], Biocarta [bio], GenMAPP [gen] and KEGG [KGK+ 04] all allow a list of genes to be crossed with biological functions and genetic networks, 25 26 CHAPTER 2. SPECTRAL ANALYSIS including metabolic, signalling or other regulation pathways. Basic statistical analysis (e.g., [HDS+ 03, BKPT05]) can then determine whether a pathway is over-represented in the list, and whether it is over-activated or under-activated. However, one can argue that introducing information on the pathway at this point in the analysis process sacrifices some statistical power to the simplicity of the approach. For example, a small but coherent difference in the expression of all the genes in a pathway should be more significant than a larger difference occurring in unrelated genes. There is therefore a pressing need for methods integrating a priori pathway knowledge in the gene expression analysis process, and several attempts have been carried out in that direction so far. Several authors have used a priori known gene networks to derive models and constraints for gene expression. For example, logical discrete formalism [TK01] can be used to analyse all the possible steady states of a biochemical reaction network described by positive and negative influences and can determine whether the observed gene expression may be explained by a perturbation of the network. If only the signs of the concentration differences between two steady states are considered, it is possible to solve the corresponding Laplace equation in sign algebra [RLS+ 05], giving qualitative predictions for the signs of the concentration differences measured by microarrays. Other approaches, such as the MetaReg formalism [GVTS04], have also been used to predict possible gene expression patterns from the network structure, although these approaches adhere less to the formal theory of biochemical reaction networks. Unfortunately, methods based on network models are rarely satisfactory because detailed quantitative knowledge of the complete reaction network parameters is often lacking, or only fragments of the network structure are available. In these cases, more phenomenological approaches need to be used. Pathway scoring methods try to detect perturbated ”modules” or network pathways while ignoring the detailed network topology (for recent reviews see [COVP05,CDF05]). It is assumed that the genes inside a module are co-ordinately expressed, and thus a perturbation is likely to affect many of them. With available databases containing tens of thousands of reactions and interactions (KEGG [KGK+ 04], TransPath [KPV+ 06], BioCyc [KOMK+ 05], Reactome [JTGV+ 05] and others), the problem is how to integrate the detailed graph of gene interactions (and not just crude characteristics such as the inter/intramodule connectivity) into the core microarray data analysis. Some promising results have been reported with regard to this problem. [VK03] developed a method for correlating interaction graphs and different types of quantitative data, and [RDML04] showed that explicitly taking the pathway distance between pairs of genes into account enhances the statistical scores when identifying activated pathways. The co-clustering of gene expression and gene networks has been reported [HZZL02], and a dimension reduction method, called ”Network component analysis” [LBY+ 03, GTL06], was proposed to construct linear models of gene regulation based on a priori known network information. The PATIKA project [BDA+ 04] proposed a score to quantify the compatibility of a pathway with a given microarray data, and in [SYDM05] a network topology 2.2. METHODS 27 extracted from literature was used jointly with microarray data to find significantly affected pathway regulators. In this paper, we investigate a different approach for integrating gene network knowledge early in the gene expression analysis. By “gene network” we mean any graph with genes as vertices, and where edges between genes can represent various biological information. For example, an edge between two genes could represent the fact that their products interact physically (proteinprotein interaction network), the presence of a genetic interaction such as a synthetic-lethal or suppressor interaction [KI05], or the fact that these genes code for enzymes that catalyse successive chemical reactions in a pathway (metabolic network, [VK03]). As an illustration we focus on the latter case in this article, although the method proposed is not limited to the metabolic network. Our approach is based on the biological hypothesis that genes close on the network are likely to have similar expression, and consequently that noisy measures of gene expression, such as those obtained by microarrays, can be denoised to some extent by extracting their “low-frequency” component on the gene network. In the case of the metabolic gene network of the yeast S. cerevisiae considered in this study, this biological hypothesis is motivated by previous observations that genes coding for enzymes involved in a common process are often co-regulated ensuring the presence of all the necessary proteins [vHGW+ 01, HZZL02, VK03, KVC04, KCF+ 06]. The approach is formally based on the spectral decomposition of the gene expression measurements with respect to the gene network seen as a graph, followed by an attenuation of the high-frequency components of the expression vectors with respect to the topology of the graph. We show how to derive unsupervised clustering and supervised classification algorithms for expression profiles, resulting in classifiers that can be easily interpreted in terms of pathways. We illustrate the relevance of our approach by analysing a gene expression dataset monitoring the transcriptional response of irradiated and non-irradiated yeast colonies [MBM+ 04]. We show that by filtering out 80% of the eigenmodes of the KEGG metabolic network in the gene expression profile, we obtain accurate and interpretable discriminative model that may lead to new biological insights. 2.2 Methods In this section, we explain how a gene expression vector can be decomposed with respect to the eigenfunctions of a gene network, and how to derive unsupervised and supervised classification algorithms from this decomposition. Before describing the technical details of the method, we start by a brief non-technical overview of the approach. 28 CHAPTER 2. SPECTRAL ANALYSIS Figure 2.1: Following the idea of Fourier decomposition (above), we can decompose a gene expression profile, here the first non-irradiated microarray sample from our data set, into two parts: the smooth component and the high-frequency component. We can then apply some filtering to attenuate or cancel the effect of the high-frequency component. 2.2.1 Overview of the method In this section we briefly outline the main features of our approach. We propose a general mathematical formalism to include a priori the knowledge of a gene network for the analysis of gene expression data. The method is independent of the nature of the network, although we focus on the gene metabolic network as an illustration in this paper. It is based on the hypothesis that genes close on the network are likely to be co-expressed, and consequently that a biologically relevant signal can be extracted from noisy gene expression measurement by removing the “high-frequency” components of the gene expression vector over the gene network. The extraction of the low-frequency component of a vector is a classical operation in signal processing (see, e.g., figure 2.1), that can be adapted to our problem using discrete Fourier transforms and spectral graph analysis. We show how this idea can be adapted to solve the problem of supervised classification of samples based on their gene expression microarray profiles. This is achieved optimising a linear classifier such that the weights of genes linked together in the network tend to be similar, i.e., by forcing nearby genes to have similar contribution to the decision function. The resulting classifier can thereafter be easily interpreted by visual inspection of the weights over the gene network, or subsequent extraction of clusters of genes on the network with similar contributions. 2.2.2 Spectral decomposition of gene expression profiles We consider a finite set of genes V of cardinality |V | = n. The available gene network is represented by an undirected graph G = (V, E) without loop and 29 2.2. METHODS multiple edges, in which the set of vertices V is the set of genes and E ⊂ V × V is the list of edges. We will use the notation u ∼ v to indicate that two genes u and v are neighbors in the graph, that is, (u, v) ∈ E. For any gene u, we denote the degree of u in the graph by du , that is, its neighbour number. Gene expression profiling gives a value of expression f (u) for each gene u, and is therefore represented by a function f : V → R. The Laplacian of the graph G is the n × n matrix [Chu97]: du if u = v , (2.1) ∀u, v ∈ V, L(u, v) = −1 if u ∼ v , 0 otherwise . The Laplacian is a central concept in spectral graph theory [Moh97] and shares many properties with the Laplace operator on Riemannian manifolds. L is known to be symmetric positive semidefinite and singular. We denote its eigenvalues by 0 = λ1 ≤ . . . ≤ λn and the corresponding eigenvectors by e1 , . . . , en . The multiplicity of 0 as an eigenvalue is equal to the number of connected components of the graph, and the corresponding eigenvectors are constant on each connected component. The eigen-basis of L forms a Fourier basis and a natural theory for Fourier analysis and spectral decomposition on graphs can thus be derived [Chu97]. Essentially, the eigenvectors with increasing eigenvalues tend to vary more abruptly on the graph, and the smoothest functions (constant on each connected component) are associated with the smallest (zero) eigenvalue. For a good example, see figure 2.2. In particular, the Fourier transform fˆ ∈ Rn of any expression profile f is defined by: ! fˆi = ei (u)f (u), i = 1, . . . , n . u∈V The eigenvectors of L form an orthonormal basis and the expression profile f can therefore be recovered from its Fourier transform fˆ by the simple formula: f= n ! fˆi ei . (2.2) i=1 Like its continuous counterpart, the discrete Fourier transform can be used for smoothing or for extracting features. Here, our hypothesis is that analysing a gene expression profile from its Fourier transform with respect to an a priori given gene network is a practical way to decompose the expression profile into biologically interpretable information and filter out the noise. In the next two sections we illustrate the potential applications of this approach by describing how this leads to a natural definition for distances between expression profiles, and how this distance can be used for classification or regression purposes. 2.2.3 Deriving a metric for expression profiles The definition of new metrics on expression profiles that incorporate information encoded in the graph structure is a first possible application of the spectral 30 CHAPTER 2. SPECTRAL ANALYSIS Figure 2.2: Here are four example of Laplacian eigenvectors of the main component of KEGG. The colours correspond to the coefficients of the eigenvectors: positive coefficients are marked in red, negative coefficients are in green, and the intensity of the colour reflects the absolute values of the coefficients. On the upper-left side is the eigenvector associated with the smallest eigenvalue, on the upper-right side the one associated with the second smallest eigenvalues, on the lower-left side, the one associated with the third smallest eigenvalue while on the lower-right side is the one associated with the largest eigenvalue. The larger the eigenvalue, the less smooth the corresponding eigenvector. 31 2.2. METHODS decomposition. Following the classical methodology in Fourier analysis, we assume that the signal captured in the low-frequency component of the expression profiles contains the most biologically relevant information, particularly the general expression trends, whereas the high-frequency components are more likely measurement noise. For example, the low-frequency component of an expression vector on the gene metabolic network should reveal areas of positive and negative expression on the graph that are likely to correspond to the activation or inhibition of specific branches of the graph. We can translate this idea mathematically by considering the following class of transformations for expression profiles: n ! ∀f ∈ RV , Sφ (f ) = fˆi φ(λi )ei , (2.3) i=1 where φ : R → R is a non-increasing function that quantifies how each frequency is attenuated. For example, if we take φ(λ) = 1 for all λ, we get from (2.2) that the profile does not change that is, Sφ (f ) = f . However, if we take: & 1 if 0 ≤ λ ≤ λ0 , φthres (λ) = (2.4) 0 if λ >λ 0 , + we produce a low-pass filter that removes all the frequencies from f above the threshold λ0 . Finally, a function of the form: φexp (λ) = exp(−βλ) , (2.5) for some β > 0, keeps all the frequencies but strongly attenuates the highfrequency components. If Sφ (f ) includes the biologically relevant part of the expression profile, we can compare two expression profiles f and g through their representations Sφ (f ) and Sφ (g). This leads to the following metric between the profiles: dφ (f, g)2 = 'Sφ (f ) − Sφ (g)'2 n ! ' (2 = fˆi − ĝi φ(λi )2 . i=1 We note that this Euclidean metric over expression profiles is associated with the following inner products: -f, g.φ = = n ! fˆi ĝi φ(λi )2 i=1 n ! 2 f " ei e" i gφ(λi ) (2.6) i=1 = f " Kφ g , )n where Kφ = i=1 φ(λi )2 ei e" i is the positive semidefinite matrix obtained by modifying the eigenvalues of L through φ. For example, taking φ(λ) = exp(−βλ) 32 CHAPTER 2. SPECTRAL ANALYSIS ( leads to Kφ = expM (−βL , where expM denotes the matrix exponential. This observation shows that working with filtered expression profiles (2.3) is equivalent to defining a kernel (2.6) over the set of expression profiles, in the context of support vector machines and kernel methods [SS02, STV04]. This possibility is further explored in the next section. 2.2.4 Supervised learning and regression The construction of predictive models for a property or phenotype of interest from the gene expression profiles of the studied samples is a second possible application of the spectral decomposition of expression profiles on the gene network. Typical applications include predicting cancer diagnosis or prognosis from gene expression data, or discriminating between different treatments applied to micro-organisms. Most approaches presented so far build predictive models from the gene expression alone, and then check whether the predictive model is biologically relevant by studying, for example, whether genes with high weights are located in similar pathways. However the genes often give no clear biological meaning. Here, we propose a method combining both steps in a single predictive model that is trained by forcing some form of biological relevance. We use linear predictive models to predict a variable of interest y from an expression profile f . They are obtained by solving the following optimisation problem: min w∈Rn p ! l(w" fi , yi ) + C'w'2 , (2.7) i=1 where (f1 , y1 ), . . . , (fp , yp ) is a training set of profiles containing the variable y to be predicted, and l is a loss function that measures the cost of predicting w" fi instead of yi . For example, the popular support vector machine [BGV92a, SS02,STV04] is a particular case of equation (2.7) in which y can take values in −1, +1 and l(u, y) = max(0, 1 − yu) is the hinge loss function; ridge regression is obtained for y ∈ R by taking l(u, y) = (u − y)2 [HTF01]. Here, we do not apply algorithms of the form (2.7) directly to the expression profiles f , but to their images Sφ (f ). That is, we consider the problem: minn w∈R p ! l(w" Sφ (fi ), yi ) + C'w'2 . (2.8) i=1 We claim that by solving (2.8) we will find a linear predictor over the original expression profiles that tend to be smooth on the gene network. Indeed, for any 33 2.2. METHODS 1/2 w ∈ Rd , let v = Kφ w. We first observe that for any f ∈ Rn : w Sφ (f ) = w " " n ! fˆi φ(λi )ei i=1 = f" =f " n ! ei φ(λi )e" i w i=1 1/2 Kφ w = f "v , showing that the final predictor obtained by minimizing (2.8) is equal to v " f . Second, we note that: 'w'2 = w" w = v " Kφ−1 v = n ! i=1 v̂i2 , φ(λi )2 where the last equality remains valid if Kφ is not invertible simply by not including in the sum the term i for which φ(λi ) = 0. This shows that (2.8) is the equivalent of solving the following problem in the original space: minn v∈R p ! i=1 L(v " fi , yi ) + C ! i:φ(λi )>0 v̂i2 . φ(λi )2 (2.9) Thus, the resulting algorithm amounts to finding a linear predictor v that minimises the loss function of interest l regularised by a term that penalises the high-frequency components of v. This is different from the classical regularisation 'v'2 used in (2.7) that only focuses on the norm of v. As a result, the linear predictor v can be made smoother on the gene network by increasing the parameter C. This allows the prior knowledge to be direcly included because genes in similar pathways would be expected to contribute similarly to the predictive model. There are two consequences of this procedure. Firstly, if the true predictor really is smooth on the graph, the formulation (2.9) can help the algorithm focus on plausible models even with very little training data, resulting in a better estimation. As a result, we can expect a better predictive performance. Secondly, by forcing the predictive model v to be smooth on the graph, biological interpretation of the model should be easier by inspecting the areas of the graph in which the predictor is strongly positive or negative. Thus the model should be easier to interpret than models resulting from the direct optimisation of equation (2.7). 34 2.3 CHAPTER 2. SPECTRAL ANALYSIS Data We collected the expression data from a study analysing the effect of low irradiation doses on Saccharomyces cerevisiae strains [MBM+ 04]. The first group of extracted expression profiles was a set of twelve independent yeast cultures grown without radiation (not irradiated, NI). From this group, we excluded an outlier that the author of the article indicated to us. The second group was a set of six independent irradiated (I) cultures exposed to a dose of 15-20 mGy/h for 20h. This dose of irradiation produces no mutagenic effects, but induces transcriptional changes. We used the same normalization method as in the first study of this data (Splus LOWESS function, see [MBM+ 04] for details), then we attempted (1) to separate the NI samples from the I ones, and (2) to understand the difference between the two populations in terms of metabolic pathways. The gene network model used to analyse the gene expression data was therefore built from the KEGG database of metabolic pathways [KGK+ 04]. The metabolic gene network is a graph in which the enzymes are vertices and the edges between two enzymes indicate that the product of a reaction catalysed by the first enzyme is the substrate of the reaction catalysed by the second enzyme. We reconstructed this network from the KGML v0.3 version of KEGG, resulting in 4694 edges between 737 genes. We kept only the largest connected component (containing 713 genes) for further spectral analysis. 2.4 2.4.1 Results Unsupervised classification First, we tested the general effect of modifying the distances between expression profiles using the KEGG metabolic pathways as background information in an unsupervised setting. We calculated the pairwise distances between all 17 expression profiles after applying the transformations defined by the filters (2.4) and (2.5), over a wide range of parameters. We assessed whether the resulting distances were more coherent with a biological intepretation by calculating the ratio of intraclass distances over all pairwise distances, defined by: r= ) u1 ,v1 ∈V1 ) d(u1 , v1 )2 + u2 ,v2 ∈V2 d(u2 , v2 )2 ) , 2 u,v∈V d(u, v) where V1 and V2 are the two classes of points. We compared the results with those obtained by replacing KEGG with a random network, produced by keeping the same graph structure but randomly permutating the vertices, in order to assess the significance of the results. We generated 100 such networks to give an average result with a standard deviation. 2.3 shows the result for the function φexp (λ) = exp(−βλ) with varying β (left), and for the function φthres (λ) = 1(λ <λ 0 ) for varying λ0 (right). We observe that, apart from very small values of β, the change of metric with the φexp function performs worse 2.4. RESULTS 35 Figure 2.3: Performance of the unsupervised classification after changing the metric with the function φ(λ) = exp(−βλ) for different values of β (left), or with the function φ(λ) = 1(λ <λ 0 ) with varying λ0 , that is, by keeping a variable number of smallest eigenvalues (right). The red curve is obtained with the KEGG network. The black curves show the result (mean and one standard deviation interval) obtained with a random network. 36 CHAPTER 2. SPECTRAL ANALYSIS Figure 2.4: PCA plots of the initial expression profiles (a) and the transformed profiles using network topology (80% of the eigenvalues removed) (b). The green squares are non-irradiated samples and the red rhombuses are irradiated samples. Individual sample labels are shown together with GO and KEGG annotations associated with each principal component. than that of a random network. The second method (filtering out the high frequency components of the gene expression vector), in which up to 80% of the eigenvectors are removed, performs significantly better than that of a random network. When only the top 3% of the smoothest eigenvectors are kept, the performance is similar to that of a random network, and when only the top 1% is kept, the performance is significantly worse. This explains the disappointing results obtained with the φexp function: by giving more weight to the small eigenvalues exponentially, the method focuses on those first few eigenvectors that, as shown by the second method, do not provide a geometry compatible with the separation of samples into two classes. From the second plot, we can infer that at least 20% of the KEGG eigenvectors should be given sufficient weight to obtain a geometry compatible with the classification of the data in this case. 2.4.2 PCA analysis We carried out a principal component analysis (PCA, [Jol96]) on the original expression vectors f and compared this with a PCA of the transformed set of vectors Sφ (f ) obtained with the function φthres to further investigate the effect of filtering out the high frequencies of the expression profiles on their relative positions. Analysis of the initial sample distribution (figure 2.4) shows that the first principal component can partially separate irradiated from non-irradiated samples, with exception of the two irradiated samples ”I1” and ”I2”, as they have larger projections onto the third principal component than onto the first one. 2.4. RESULTS 37 The experimental protocol revealed that these two samples were affected by higher doses of radiation than the four other samples. Gene Ontology analysis of the genes that contribute most to the first principal component shows that ”pyruvate metabolism”, ”glucose metabolism”, ”carbohydrate metabolism”, and ”ergosterol biosynthesis” ontologies (here we list only independent ontologies) are over-represented (with p-values less than 10−10 ). The second component is associated with ”trehalose biosynthesis”, and ”carboxylic acid metabolism” ontologies and the third principal component is associated with the KEGG glycolysis pathway. The first three principal components collect 25%, 17% and 11% of the total dispersion. The transformation (2.3) resulting from a step-like attenuation of eigenvalues φthres removing 80% of the largest eigenvalues significantly changes the global layout of data (figure 2.4, right) but generally preserves the local neighbourhood relationships. The first three principal components collect 28%, 20% and 12% of the total dispersion, which is only slightly higher than the PCA plot of the initial profiles. The general tendency is that the non-irradiated normal samples are more closely grouped, which explains the lower intraclass distance values shown in 2.3. The principal components in this case allows them to be associated with gene ontologies with higher confidence (for the first component, the p-values are less than 10−25 ). This is a direct consequence of the fact that the principal components are constrained to belong to a subspace of smooth functions on KEGG, giving coherence in terms of pathways to the genes contributing to the components. The first component give ”DNA-directed RNA polymerase activity”, ”RNA polymerase complex” and ”protein kinase activity”. Figure 3 shows that these are the most connected clusters of the whole KEGG network. The second component is associated with ”purine ribonucleotide metabolism”, ”RNA polymerase complex”, ”carboxylic acid metabolism” and ”acetyl-CoA metabolism” ontologies and also with ”Glycolysis/Gluconeogenesis”, ”Citrate cycle (TCA cycle)” and ”Reductive carboxylate cycle (CO2 fixation)” KEGG pathways. The third component is associated with ”prenyltransferase activity”, ”lyase activity” and ”aspartate family amino acid metabolism” ontologies and with ”N-Glycan biosynthesis”, ”Glycerophospholipid metabolism”, ”Alanine and aspartate metabolism” and ”riboflavin metabolism” KEGG pathways. Thus, the PCA components of the transformed expression profiles are affected both by network features and by the microarray data. 2.4.3 Supervised classification We tested the performance of supervised classification after modifying the distances with a support vector machine (SVM) trained to discriminate irradiated samples from non-irradiated samples. For each change of metric, we estimated the performance of the SVM from the total number of misclassifications and the total hinge loss using a ”leave-one-out” (LOO) approach. This approach removes each sample in turn, trains a classifier on the remaining samples and then tests the resulting classifier on the removed sample. For each fold, the regularisation parameter was selected from the training set only by minimiz- 38 CHAPTER 2. SPECTRAL ANALYSIS Figure 2.5: Performance of the supervised classification when changing the metric with the function φexp (λ) = exp(−βλ) for different values of β (left picture), or the function φthres (λ) = 1(λ <λ 0 ) for different values of λ0 (i.e., keeping only a fraction of the smallest eigenvalues, right picture). The performance is estimated from the number of misclassifications in a leave-one-out error ing the classification error estimated with an internal LOO experiment. The calculations were carried out using the svmpath package in the R computing environment. Figure 2.5 shows the classification results for the two high frequency attenuation functions φexp and φthres with varying parameters. The baseline LOO error is 2 misclassifications for the SVM in the original Euclidean space. For the exponential variant (φexp (λ) = exp(−βλ)), we observe an irregular but certain degradation in performance for positive β for both the hinge loss and the misclassification number. This is consistent with the result shown in 2.3 in which the change of metric towards the first few eigenvectors does not give a geometry coherent with the classification of samples into irradiated and non-irradiated, resulting in a poorer performance in supervised classification as well. For the second variant, in which the expression profiles are projected onto the eigenvector of the graph with the smallest eigenvalues, we observe that the performance remains as accurate as the baseline performance until up to 80% of the eigenvectors are discarded, with the hinge loss even exhibiting a slight minimum in this region. This is consistent with the classes being more clustered in this case than in the original Euclidean space. Overall these results show that classification accuracy can be kept high even when the classifier is constrained to exhibit a certain coherence with the graph structure. 2.4. RESULTS 39 Figure 2.6: Global connection map of KEGG with mapped coefficients of the decision function obtained by applying a customary linear SVM (left) and using high-frequency eigenvalue attenuation (80% of high-frequency eigenvalues have been removed) (right). Spectral filtering divided the whole network into modules having coordinated responses, with the activation of low-frequency eigen modes being determined by microarray data. Positive coefficients are marked in red, negative coefficients are in green, and the intensity of the colour reflects the absolute values of the coefficients. Rhombuses highlight proteins participating in the Glycolysis/Gluconeogenesis KEGG pathway. Some other parts of the network are annotated including big highly connected clusters corresponding to proteinkinases and DNA and RNA polymerase sub-units. 40 2.4.4 CHAPTER 2. SPECTRAL ANALYSIS Interpretation of the SVM classifier Figure 2.6 shows the global connection map of KEGG generated from the connection matrix by Cytoscape software [SMO+ 03]. The coefficients of the decision function v of equation (2.9) for the classifier constructed either in the original Euclidean space or after filtering the 80% top spectral components of the expression profiles are shown in colour. We used a color scale from green (negative weights) to red (positive weights) to provide an easy visualisation of the classifier main features. Both classifiers give the same classification error but the classifier constructed using the network structure can be more naturally interpreted, as the classifier variables are grouped according to their participation in the network modules. Although from a biological point of view, very little can be learned from the classifier obtained in the original Euclidean space (figure 2.6, left), it is indeed possible to distinguish several features of interest for the classifier obtained in the second case (figure 2.6, right). First, oxidative phosphorylation is found among the pathways with the most positive weights, which is consistent with previous analyses showing that this pathway tends to be up-regulated after irradiation [MBM+ 04]. An important cluster involving the DNA and RNA polymerases is also found to bear weights slightly above average in these experiments. Several studies have previously reported the induction of genes involved in replication and repair after high doses of irradiation [MDM+ 01], but the detection of such an induction at the low irradiation doses used in the present biological experiments is rather interesting. The strongly negative landscape of weights in the protein kinases cluster has not been seen before and may lead to a new area of biological study. Most of the kinases are involved in signalling pathways, and therefore their low expression levels may have important biological consequences. Figure 2.6 shows a highlighted pathway named ”Glycolysis/Gluconeogenesis” in KEGG. A more detailed view of this pathway is shown in figure 2.7 . This pathway contains enzymes that are also used in many other KEGG pathways and is therefore situated in the middle and most entangled part of the global network. As already mentioned, this pathway is associated with the first and the third principal components of the initial dataset. The pathway actually contains two alternative sub-pathways that are affected differentially. Over-expression in the gluconeogenesis pathway seems to be characteristic of irradiated samples, whereas glycolysis has a low level of expression in that case. This shift can be observed by changing from anaerobic to aerobic growth conditions (called diauxic shift). The reconstruction of this from our data with no prior input of this knowledge strongly confirms the relevance of our analysis method. It also shows that analysing expression in terms of the global up- or down-regulation of entire pathways as defined, for example, by KEGG, could be misleading as there are many antagonist processes that take place within pathways. By representing KEGG as a large network instead of a set of pathways, our approach helps maintaining the biochemical relationships between genes out of the constraints of pathway limits. Once a classifier has been built using a priori the knowledge of the network, the interpretation of the results (which genes contribute 2.4. RESULTS 41 Figure 2.7: The glycolysis/gluconeogenesis pathways of KEGG with mapped coefficients of the decision function obtained by applying a customary linear SVM (a) and using high-frequency eigenvalue attenuation (b). The pathways are mutually exclusive in a cell, as clearly highlighted by our algorithm 42 CHAPTER 2. SPECTRAL ANALYSIS the most to the classification) can be performed through visualisation of known biochemical pathways, or extraction of gene clusters with similar contribution to the classifier. Importantly these gene clusters result from a combined analysis of the gene network and the gene expression data, and not from a prior analysis of the gene network alone. Figure 2.8 shows the weights of the two classifiers on the genes involved in pyrimidine metabolism which is another pathway of interest. 2.5 Discussion Our algorithm constructs a classifier in which the predictor variables are grouped according to their neighborhood relations in the network. We assume that the genes close on the network are likely to contribute similarly to the prediction function. Our working hypothesis is that the genes close on the network should have similar expression profiles. This hypothesis was validated in several studies that demonstrate that co-expressed genes tend to have similar biological functions and vice versa (e.g., [SSKK03]). Our mathematical framework based on spectral decomposition helps to systematically exploit this experimental fact and include it into data analysis. Nevertheless, one must understand that this tendency is only a trend, valid when we take the average on a large scale. It is of course possible to find many local exceptions to this trend, for example when a signaling pathway or a metabolic cascade is influenced by over- or under- expression of only one regulator without systematically affecting the expression of the rest of the pathway participants. Thus, our technique is rather coarse-grained, it does not allow to infer a precise network logic but rather detects average excitation of relatively big network modules. In our example we use a metabolic network as gene network. Our hypothesis here is based on the fact that for a smooth synthesis flow, all enzymes required for a metabolic cascade should be present in sufficient quantities, i.e., stably expressed. On the opposite, various sensor and feedback mechanisms ensure that for inactive metabolic cascades the expression of corresponding enzymes remains low. If this is true on average then our technique will help to highlight active and inactive parts of the network. Several previous studies have highlighted the significant correlation that exists between gene expression and distance over the metabolic network, thus justifying our attempt [vHGW+ 01,HZZL02,VK03, KVC04, KCF+ 06]. For other network types, like transcriptional regulatory or signalling network, more elaborated measures of ”smoothness” are certainly needed to take into account signs and directions of individual gene interactions. Our working hypothesis motivates the filtering of gene expression profile in order to remove the noisy high-frequency modes of the network. Therefore, the variation of the weights of the classifier along the graph are of low frequency and should allow grouping of variables, which is a very useful feature of the resulting classification function as the function becomes meaningful for interpreting and suggesting biological factors that cause the class separation. It allows clas- 2.5. DISCUSSION 43 Figure 2.8: The Pyrimidine Metabolism pathways of the separator obtained with an Euclidean linear SVM (up) and our modified algorithm (down). 44 CHAPTER 2. SPECTRAL ANALYSIS sifications based on functions, pathways and network modules rather than on individual genes. Classification based on pathways and network modules should lead to a more robust behaviour of the classifier in independent tests with equal if not better classification results. Our results on the dataset we analysed show only a slight improvement, although this may be due to its limited size. The two samples with different experimental settings are systematically misclassified in both the initial and our smoothed classifier which means that they probably are members of a ”third” class which should be treated differently. Introduction of network topology can not resolve this issue but can help to understand which part of the network differentiate the outliers from the other members of the same class. Interestingly, the constraint we impose on the smoothness of the classifier can also be justified mathematically in the context of regularisation for statistical estimation. Classification of microarray data is an extremely challenging problem because it usually involves a small number of samples in large dimension. Most statistical procedures developed in this context involve some form of complexity reduction by imposing some constraints on the classifier. For example, perhaps the most widely-used complexity reduction method for microarray data is to impose that the classifier has only a small number of non-zero weights, which in practice amounts to selecting a small number of genes. Mathematically speaking, this means constraining the L0 norm of the classifier to be small (the L0 norm of a vector being the number of non-zero components). Alternatively, methods like SVM constrain the L2 norm of the classifier to be small. Our method can therefore be seen as just constraining a different norm of the classifier, for the purpose of regularisation. Of course the choice of regularisation should be related to the problem at hand: it corresponds to our prior belief of what the optimal classifier, that would be discovered if enough samples were available, looks like. Performing feature selection implicitly corresponds to the assumption that the optimal classifier relies on a small number of genes, which can be a reasonable assumption in some cases. Our focus on the smoothness of the classifier on the gene network corresponds to a different implicit assumption, namely, that the optimal classifier is likely to be so. This is justified in many cases because the classes of samples to be predicted generally correspond to differences in the regulation of one or several pathways. Of course if this turns out not to be the case, reducing the effect of regularisation by decreasing the parameter C in (2.9) allows a non-smooth classifier to be learned as well. An important remark to bear in mind when interpreting pictures such as Figures 3 and 5 is that the colors represent the weights of the classifier, and not gene expression levels. There is of course a relationship between the classifier weights and the typical expression levels of genes in irradiated and non-irradiated samples: irradiated samples tend to have expression profiles positively correlated with the classifier, while non-irradiated samples tend to be negatively correlated. Roughly speaking, the classifier tries to find a smooth function that has this property. This means in particular that the pictures provide virtually no information regarding the over- or under-expression of individual genes, which is the cost to pay to obtain instead an interpretation in terms of more global 2.5. DISCUSSION 45 pathways. Constraining the classifier to rely on just a few genes would have a similar effect of reducing the complexity of the problem, but would lead to a more difficult interpretation in terms of pathways. An important advantage of our approach over other pathway-based clustering methods is that we consider the network modules that naturally appear from spectral analysis rather than a historically defined separation of the network into pathways. Thus, pathways cross-talking is taken into account, which is difficult to do using other approaches. It can however be noticed that the implicit decomposition into pathways that we obtain is biased by the very incomplete knowledge of the network and that certain regions of the network are better understood, leading to a higher connection concentration. Another important feature of this approach is that we make no strong assumption on the nature of the graph, and that the method can in principle be applied with a variety of other graphs, such as protein-protein interaction networks or co-expression networks. We leave this avenue open for future research. On the other hand, like most approaches aiming at comparing expression data with gene networks such as KEGG, the scope of this work is limited by two important constraints. First the gene network we use is only a convenient but rough approximation to describe complex biochemical processes; second, the transcriptional analysis of a sample can not give any information regarding post-transcriptional regulation and modifications. Nevertheless, we believe that our basic assumptions remain valid, in that we assume that the expression of the genes belonging to the same metabolic pathways module are coordinately regulated. Our interpretation of the results supports this assumption. Another important caveat is that we simplify the network description as an undirected graph of interactions. Although this would seem to be relevant for simplifying the description of, e.g., protein-protein interaction networks, in reality metabolic networks have a more complex nature. Similarly, gene regulation networks are influenced by the direction, sign and importance of the interaction. Although the incorporation of weights into the Laplacian (equation 2.1) is straightforward and allows the extension of the approach to weighted undirected graphs, the choice of the weights remains delicate since the importance of an interaction may be difficult to quantify. Conversely the directions and signs that accompany signalling or regulatory pathways are generally known, but their incorporation requires more work. It could nevertheless lead to important advances for the interpretation of microarray data in cancer studies, for example. Conclusions We have presented a general framework to analyse gene expression data when a gene network is known a priori. The approach involves the attenuation of the high-frequency content of the gene expression vectors with respect to the graph. We derived algorithms for unsupervised clustering and supervised classification, which enforce some level of smoothness on the gene network for the classifier. 46 CHAPTER 2. SPECTRAL ANALYSIS This enforcement can be considered as a means of reducing the high dimension of the variable space, using the available knowledge about gene network. No prior decomposition of the gene network into modules or pathways is needed, and the method can work in principle with a variety of gene networks. Acknowledgments This work was supported by the grant ACI-IMPBIO-2004-47 of the French Ministry for Research and New Technologies and by the EC contract ESBIC-D (LSHG-CT-2005-518192). We thank Sabrina Carpentier and Severine Lair from the Service de Bioinformatique of the Institut Curie for the help they provided with the normalization of the microarray data. Chapter 3 Fused SVM for arrayCGH classification This work has already been accepted in a slightly different form at the International Conference on Intelligent Systems for Molecular Biology 2008 under the title “Classification of arrayCGH using a fused SVM ”, co-authored with Emmanuel Barillot and Jean-Philippe Vert. 3.1 Introduction Genome integrity is essential to cell life and is ensured in normal cells by a series of checkpoints, which enable DNA repair or trigger cell death to avoid abnormal genome cells to appear. The p53 protein is probably the most prominent protein known to play this role. When these checkpoints are bypassed the genome may evolve and undergo alterations to a point where the cell can become premalignant and further genome alterations lead to invasive cancers. This genome instability has been shown to be an enabling characteristic of cancer [HW00], and almost all cancers are associated with genome alterations. These alterations may be single mutations, translocations, or copy number variations (CNVs). A CNV can be a deletion or a gain of small or large DNA regions, an amplification, or an aneuploidy (change in chromosome number). Many cancers present recurrent CNVs of the genome, like for example monoploidy of chromosome 3 in uveal melanoma [SPdM+ 94], loss of chromosome 9 and amplification of the region of cyclin D1 (11q13) in bladder carcinomas [BBR+ 05], loss of 1p and gain of 17q in neuroblastoma [BLC+ 01,VRVB+ 02], EGFR amplification and deletion in 1p and 19q in gliomas [IML+ 07], or amplifications of 1q, 8q24, 11q13, 17q21-q23, and 20q13 in breast cancer [YWF+ 06]. Moreover associations of specific alterations with clinical outcome have been described in many pathologies [LNM+ 97]. Recently array-based comparative genomic hybridization (arrayCGH) has been developed as a technique allowing rapid mapping of CNVs of a tumor 47 48 CHAPTER 3. FUSED SVM sample at a genomic scale [PSS+ 98]. The technique was first based on arrays using a few thousands of large insert clones (like BACs, and with a Mb range resolution) to interrogate the genome, and then improved with oligonucleotide based arrays consisting of several hundreds of thousands features, taking the resolution down to a few kb [Ger05]. Many projects have since been launched to systematically detect genomic aberrations in cancer cells [vBN06, CWT+ 06, SMR+ 03]. The etiology of cancer and the advent of arrayCGH make it natural to envisage building classifiers for prognosis or diagnosis based on the genomic profiles of tumors. Building classifiers based on expression profiles is an active field of research, but little attention has been paid yet to genome-based classification. [CWT+ 06] select a small subset of genes and apply a k-nearest neighbor classifier to discriminate between estrogen-positive and estrogen-negative patients, between high-grade patients and low-grade patients and between bad prognosis and good prognosis for breast cancer. [JFG+ 04] reduce the DNA copy number estimates to “gains” and “losses” at the chromosomal arm resolution, before using a nearest centroid method for classifying breast tumors according to their grade. As underlined in [CWT+ 06], the classification accuracy reported in [JFG+ 04] is better than the one reported in [CWT+ 06], but still remains at a fairly high level with as much as 24% of misclassified samples in the balanced problem. This may be related to the higher resolution of the arrays produced by [JFG+ 04]. Moreover, the approach used by [JFG+ 04] produces a classifier difficult to interpret as it is unable to detect any deletion or amplification that occur at the local level. [OBS+ 03] used a support vector machine (SVM) classifier using as variables all BAC ratios without any missing values. They were able to identify key CNAs. The methods developed so far either ignore the particularities of arrayCGH and the inherent correlation structure of the data [OBS+ 03], or drastically reduce the complexity of the data at the risk of filtering out useful information [JFG+ 04,CWT+ 06]. In all cases, a reduction of the complexity of the data or a control of the complexity of the predictor estimated is needed to overcome the risk of overfitting the training data, given that the number of probes that form the profile is often several orders of magnitude larger than the number of samples available to train the classifier. In this chapter we propose a new method for supervised classification, specifically designed for the processing of arrayCGH profiles. In order not to miss potentially relevant information that may be lost if the profiles are first processed and reduced to a small number of homogeneous regions, we estimate directly a linear classifier at the level of individual probes. Yet, in order to control the risk of overfitting, we define a prior on the linear classifier to be estimated. This prior encodes the hypothesis that (i) many regions of the genome should not contribute to the classification rule (sparsity of the classifier), and (ii) probes that contribute to the classifier should be grouped in regions on the chromosomes, and be given the same weight within a region. This a priori information helps reducing the search space and produces a classification rule that is easier to interpret. This technique can be seen as an extension of SVM where 3.2. METHODS 49 the complexity of the classifier is controlled by a penalty function similar to the one used in the fused lasso method to enforce sparsity and similarity between successive features [TSR+ 05]. We therefore call the method a fused SVM. It produces a linear classifier that is piecewise constant on the chromosomes, and only involves a small number of loci without any a priori regularisation of the data. From a biological point of view, it avoids the prior choice of recurrent regions of alterations, but produces a posteriori a selection of discriminant regions which are then amenable to further investigations. We test the fused SVM on several public datasets involving diagnosis and prognosis applications in bladder and uveal cancer, and compare it with a more classical method involving feature selection without prior information about the organization of probes on the genome. In a cross-validation setting, we show that the classification rules obtained with the fused SVM are systematically more accurate than the rules obtained with the classical method, and that they are also more easily interpretable. 3.2 Methods In this section we present an algorithm for the supervised classification of arrayCGH data. This algorithm, which we call fused SVM, is motivated by the linear ordering of the features along the genome and the high dependancy in behaviour of neighbouring features. The algorithm itself estimates a linear predictor by borrowing ideas from recent methods in regression, in particular the fused lasso [TSR+ 05]. We start by a rapid description of the arrayCGH technology and data, before presenting the fused SVM in the context of regularized linear classification algorithms. 3.2.1 ArrayCGH data ArrayCGH is a microarray-based technology that allows the quantification of the DNA copy number of a sample at many positions along the genome in a single experiment. The array contains thousands to millions of spots, each of them consisting of the amplified or synthesized DNA of a particular region of the genome. The array is hybridized with the DNA extracted from a sample of interest, and in most cases with (healthy) reference DNA. Both samples have first been labelled with two different fluorochromes, and the ratio of fluorescence of both fluorochromes is expected to reveal the ratio of DNA copy number at each position of the genome. The log-ratio profiles can then be used to detect the regions with abnormalities (log-ratio significantly different of 0), corresponding to gains (if the log-ratio is significantly superior to 0) or losses (if it is significantly inferior to 0). The typical density of arrayCGH ranges from 2400 BAC features in the pioneering efforts, corresponding to one approximately 100 kb probe every Mb [PSS+ 98], up to millions today, corresponding to one 25 to 70bp oligonucleotide probe every few kb, or even tiling arrays [Ger05]. 50 CHAPTER 3. FUSED SVM There are two principal ways to represent arrayCGH data: as a log-ratio collection, or as a collection of status (lost, normal or gained, usually represented as -1, 0 and 1 which correspond to the sign of the log ratio). The status representation has strong advantages over the log-ratio as it reduces the complexity of the data, provides the scientist with a direct identification of abormalities and allows the straightforwad detection of recurrent alterations. However, converting ratios into status is not always obvious and often implies a loss of information which can be detrimental to the study: for several reasons such as heterogeneity of the sample or contamination with healthy tissue (which both result in cells with different copy numbers in the sample), the status may be difficult to infer from the data, whereas the use of the ratio values avoids this problem. Another problem is the low subtelty of statuses. In particular, if we want to use arrayCGH for discriminating between two subtypes of tumors or between tumors with different future evolution, all tumors may share the same important genomic alterations that are easily captured by the status assignment while differences between the types of tumors may be characterized by more subtle signals that would disappear should we transform the log ratio values into statuses. Therefore, we consider below an arrayCGH profile as a vector of log-ratios for all probes in the array. 3.2.2 Classification of arrayCGH data While much effort has been devoted to the analysis of single arrayCGH profiles, or populations of arrayCGH profiles in order to detect genomic alterations shared by the samples in the population, we focus on the supervised classification of arrayCGH. The typical problem we want to solve is, given two populations of arrayCGH data corresponding to two populations of samples, to design a classifier that is able to predict which population any new sample belongs to. This paradigm can be applied for diagnosis or prognosis applications, where the populations are respectively samples of different tumor types, or with different evolution. Although we only focus here on binary classification, the techniques can be easily extended to problems involving more than two classes using, for example, a series of binary classifiers trained to discriminate each class against all others. While accuracy is certainly the first quality we want the classifier to have in real diagnosis and prognosis application, it is also important to be able to interpret it and understand what the classification is based on. Therefore we focus on linear classifiers, which associate a weight to each probe and produce a rule that is based on a linear combination of the probe log-ratios. The weight of a probe roughly corresponds to its contribution in the final classification rule, and therefore provides evidence about its importance as a marker to discriminate the populations. It should be pointed out, however, that when correlated features are present, the weight of a feature is not directly related to the individual correlation of the feature with the classification, hence some care should be taken for the interpretation of linear classifier. In most applications of arrayCGH classification, it can be expected that only 3.2. METHODS 51 a limited number of regions on the genome should contribute to the classification, because most parts of the genome may not differ between populations. Moreover, the notion of discriminative regions suggest that a good classifier should detect these regions, and typically be piecewise constant over them. We show below how to introduce these prior hypotheses into the linear classification algorithm. 3.2.3 Linear supervised classification Let us denote by p the number of probes hybridized on the arrayCGH. The result of an arrayCGH competitive hybridization is then a vector of p log-ratios, which we represent by a vector x in the vector space X = Rp of possible arrayCGH profiles. We assume that the samples to be hybridized can belong to two classes, which we represent by the labels −1 and +1. The classes typically correspond to the disease status or the prognosis of the samples. The aim of binary classification is to find a decision function that can predict the class y ∈ {−1, +1} of a data sample x ∈ X . Supervised classification uses a database of samples x1 , ..., xn ∈ X for which the labels y1 , ..., yn ∈ {−1, +1} are known in order to construct the prediction function. We focus on linear decision functions, which are defined by functions of the form f (x) = w" x where w" is the transpose of a vector w ∈ Rd . The class prediction for a profile x is then +1 if f (x) ≥ 0, and −1 otherwise. Training a linear classifier amounts to estimating a vector w ∈ Rd from prior knowledge and the observation of the labeled training set. The training set can be used to assess whether a candidate vector w can correctly predict the labels on the training set; one may expect such a w to correctly predict the classes of unlabeled samples as well. This induction principle, sometimes referred to as empirical risk minimization, is however likely to fail in our situation where the dimension of the samples (the number of probes) is typically larger than the number of training points. In such a case, many vectors w can indeed perfectly explain the labels of the training set, without capturing any biological information. These vectors are likely to poorly predict the classes of new samples. A well-known strategy to overcome this overfitting issue, in particular when the dimension of the data is large compared to the number of training points available, is to look for large-margin classifiers constrained by regularization [Vap98]. A large-margin classifier is a prediction function f (x) that not only tends to produce the correct sign (positive for labels +1, negative for class −1), but also tends to produce large absolute values. This can be formalized by the notion of margin, defined as yf (x): large-margin classifiers try to predict the class of a sample with large margin. Note that the prediction is correct if the margin is positive. The margin can be thought of as a measure of confidence in the prediction given by the sign of f , so a large margin is synonymous with a large confidence. Training a large-margin classifier means estimating a function f that takes large margin values on the training set. However, just like for the sign of f , if p > n then it is possible to find vectors w that lead to arbitrarily large margin on all points of the training set. In order to control this overfitting, large-margin classifiers try to maximize the 52 CHAPTER 3. FUSED SVM margin of the classifier on the training set under some additional constraint on the classifier f , typically that w is not too “large”. In summary, large-margin classifiers find a trade-off between the objective to ensure large margin values on the training set, on the one hand, and that of controlling the complexity of the classifier, on the other hand. The balance in this trade-off is typically controlled by a parameter of the algorithm. More formally, large-margin classifiers typically require the definition of two ingredients: • A loss function l(t) that is “small” when t ∈ R is “large”. From the loss function one can deduce the empirical risk of a candidate vector w, given by the average loss function applied to the margins of w on the training set: n 1! l(yi wi" x) . (3.1) Remp (w) = n i=1 The smaller the empirical risk, the better w fits the training set in the sense of having a large margin. Typical loss functions are the hinge loss l(t) = max(0, 1 − t) and the logit loss l(t) = log (1 + e−t ). • A penalty function Ω(w) that measures how “large” or how “complex” w is. Typical penalty functions are the L1 and L2 norms of w, defined ')p (1 )p 2 2 respectively by ||w||1 = i=1 |wi | and ||w||2 = . i=1 wi Given a loss function l and a penalty function Ω, large-margin classifiers can then be trained on a given training set by solving the following constrained optimization problem: min Remp (w) subject to Ω(w) ≤ µ , w∈Rp (3.2) where µ is a parameter that controls the trade-off between fitting the data, i.e., minimizing Remp (f ), and monitoring the regularity of the classifier, i.e., monitoring Ω(w). Examples of large-margin classifiers include the support vector machine (SVM) and kernel logistic regression (KLR) obtained by combining respectively the hinge and logit losses with the L2 norm penalization function [CV95, BGV92b, Vap98], or the 1-norm SVM when the hinge loss is combined with the L1 loss . The final classifier depends on both the loss function and the penalty function. In particular, the penalty function is useful to include prior knowledge or intuition about the classifier one expects. For example, the L1 penalty function is widely used because it tends to produce sparse vectors w, therefore performing an automatic selection of features. This property has been successfully used in the context of regression [Tib96], signal representation [CDS98], survival analysis [Tib97], logistic regression [GAL+ 07, KHCF04], or multinomial logistic regression [KCFH05], where one expects to estimate a sparse vector. 53 3.2. METHODS 3.2.4 Fused lasso Some authors have proposed to design specific penalty functions as a means to encode specific prior informations about the expected form of the final classifier. In the context of regression applied to signal processing, when the data is a time series, [LF96] propose to encode the expected positive correlation between successive variables by choosing a regularisation term that forces successive variables of the classifier to have similar weights. More precisely, assuming that the variables w1 , w2 , . . . , wp are sorted in a natural order where many pairs of successive values are expected to have the same weight, they propose the variable fusion penalty function: Ωf usion (w) = n−1 ! i=1 |wi − wi+1 | . (3.3) Plugging this penalty function in the general algorithm (3.2) enforces a solution w with many successive values equal to each other, that is, tends to produce a piecewise constant weight vector. In order to combine this interesting property with a requirement of sparseness of the solution, [TSR+ 05] proposed to combine the lasso penalty and the variable fusion penalty into a single optimization problem with two constraints, namely: min Remp (w) w∈Rn under the constraints n−1 ! i=1 |wi − wi+1 | ≤ µ 'w'1 ≤ λ , (3.4) where λ and µ are two parameters that control the relative trade-offs between fitting the training data (small Remp ), enforcing sparsity of the solution (small λ) and enforcing the solution to be piecewise constant (small µ). When the empirical loss is the mean square error in regression, the resulting algorithm is called fused lasso. This method was illustrated in [TSR+ 05] with examples taken from gene expression datasets and mass spectrometry. Later, [TW07] proposed a tweak of the fused lasso for the purpose of signal smoothing, and illustrated it for the problem of discretising noisy CGH profiles. 3.2.5 Fused SVM Remembering from Section 3.2.2 that for arrayCGH data classification one typically expects the “true” classifier to be sparse and piecewise constant along the genome, we propose to extend the fused lasso to the context of classification and adapt it to the chromosome structure for arrayCGH data classification. The extension of fused lasso from regression to large-margin classification is obtained simply by plugging the fused lasso penaly constraints into a large-margin empirical risk in (3.4). In what follows we focus on the empirical risk (3.1) obtained 54 CHAPTER 3. FUSED SVM from the hinge loss, which leads to a simple implementation as a linear program (see Section 3.2.6 below). The extension to other convex loss functions, in particular the logit loss function, results in convex optimization problems with linear constraints that can be solved with general convex optimization solvers [BV04a]. In the case of arrayCGH data, a minor modification to the variable fusion penalty (3.3) is necessary to take into account the structure of the genome in chromosomes. Indeed, two successive spots on the same chromosome are prone to be subject to the same amplification and are therefore likely to have similar weights on the classifier; however, this positive correlation is not expected across different chromosomes. Therefore we restrict the pairs of successive features appearing in the function constraint (3.3) to be consecutive probes on the same chromosome. We call the resulting algorithm a fused SVM, which can be formally written as the solution of the following problem: minp w∈R under the constraints ! i∼j ! i=1 n ! i=1 max(0, 1 − yi w" xi ) |wi − wj | ≤ µ |wi | ≤ λ , (3.5) where i ∼ j if i and j are the indices of succesive spots of the same chromosome. As with fused lasso, this optimisation problem tends to produce classifiers w with similar weights for consecutive features, while maintaining its sparseness. This algorithm depends on two paramters, λ and µ, which are typically chosen via cross-validation on the training set. Decreasing λ tends to increase the sparsity of w, while decreasing µ tends to enforce successive spots to have the same weight. This classification algorithm can be applied to CGH profiles, taking the ratios as features. Due to the effect of both regularisation terms, we obtain a sparse classification function that attributes similar weights to successive spots. 55 3.3. DATA 3.2.6 Implementation of the fused SVM Introducing slack variables, the problem described in (3.5) is equivalent to the following linear program : min w,α,β,γ n ! αi under the following constraints : i=1 ∀i = 1, ..., n ∀i = 1, ..., n αi ≥ 0 αi ≥ 1 − w" xi yi n ! βi ≤ λ i=1 ∀i = 1, ..., p ∀i = 1, ..., p βi ≥ wi βi ≥ −wi q ! γk ≤ µ k=1 ∀i, j such that i ∼ j ∀i, j such that i ∼ j γk ≥ wi − wj γk ≥ wj − wi (3.6) In our experiments, we implemented and solved this problem using Matlab and the SeDuMi 1.1R3 optimisation toolbox [Stu99]. 3.3 Data We consider two publicly available arrayCGH datasets for cancer research, from which we deduce three problems of diagnosis and prognosis to test our method. The first dataset contains arrayCGH profiles of 57 bladder tumor samples [SVR+ 06]. Each profile gives the relative quantity of DNA for 2215 spots. We removed the probes corresponding to sexual chromosomes, because the sex mismatch between some patients and the reference used makes the computation of copy number less reliable, giving us a final list of 2143 spots. We considered two types of tumor classification: either by grade, with 12 tumors of grade 1 and 45 tumors of higher grades (2 or 3) or by stage, with 16 tumors of stage Ta and 32 tumors of stage T2+. In the case of stage classification, 9 tumors with intermediary stage T1 were excluded from the classification. The second dataset contains arrayCGH profiles for 78 melanoma tumors that have been arrayed on 3750 spots [THH+ 08]. As for the bladder cancer dataset, we excluded the sexual chromosomes from the analysis, resulting in a total of 3649 spots. 35 of these tumors lead to the development of liver metastases within 24 months, while 43 did not. We therefore consider the problem of predicting, from an arrayCGH profile, whether or not the tumor will metastasize within 24 months. 56 CHAPTER 3. FUSED SVM In both datasets, we replaced the missing spots log-ratios by 0. In order to assess the performance of a classification method, we performed a crossvalidation for each of the three classification problems, following a leave-one-out procedure for the bladder dataset and a 10-fold procedure for the melanoma dataset. We measure the number of misclassified samples for different values of parameters λ and µ. 3.4 Results In this section, we present the results obtained with the fused SVM on the datasets described in the previous section. As a baseline method, we consider a L1 -SVM which minimizes the mean empirical hinge loss suject to a constraint on the L1 norm of the classifier in (3.2). The L1 -SVM performs automatic feature selection, and a regularization parameter λ controls the amount of regularization. It has been shown to be a competitive classification method for high-dimensional data, such as gene expression data [ZRHT04]. In fact the L1 SVM is a particular case of our fused SVM, when the µ parameter is chosen large enough to relax the variable fusion constraint (3.3), typically by taking µ > 2λ. Hence by varying µ from a large value to 0, we can see the effect of the variable fusion penalty on the classical L1 -SVM. 3.4.1 Bladder tumors The upper plot of Figure 3.1 show the estimated accuracy (by Leave One Out, also called LOO) of the fused SVM as a function of the regularization parameters λ and µ, for the classification by grade of the bladder tumors. The lower left plot of Figure 3.1 represents the best linear classifier found by the L1 -SVM (corresponding to λ = 256), while the lower right plot shows the linear classifier estimated from all samples by the fused SVM when λ and µ are set to values that minimise the LOO error, namely λ = 32 and µ = 1. Similarly, Figure 3.2 shows the same results (LOO accuracy, L1 -SVM and fused SVM classifiers) for the classification of bladder tumors according to their stage. In both cases, when µ is large enough to make the variable fusion inactive in (3.5), then the classifier only finds a compromise between the empirical risk and the L1 norm of the classifier. In other words, we recover the classical L1 SVM with parameter λ. Graphically, the performance of the L1 SVM for varying λ can be seen on the upper side of each plot of the LOO accuracy in Figures 3.1 and 3.2. Interestingly, in both cases we observe that the best performance obtained when both λ and µ can be adjusted is much better than the best performance of the L1 SVM, when only λ can be adjusted. In the case of grade classification, the number of misclassified samples drops from 12 (21%) to 7 (12%), while in the case of stage classification it drops from 13 (28%) to 7 (15%). This suggests that the additional constraint that translates our prior knowlege about the structure of the spot positions on the genome is beneficial in terms of classifier accuracy. 57 3.4. RESULTS µ 1024 0.0625 4 7 1024 λ 19 Figure 3.1: The figure on the upper side represents the number of misclassified samples in a leave-one-out error loop on the bladder cancer dataset with the grade labelling, with its color scale for different values of the parameters λ and µ which vary logarithmically along the axes. The weights of the best classifier, for classical L1 -SVM (left) and for fused SVM (right) are ordered and represented in a blue line, annotated with the chromosome separation (red line). 58 CHAPTER 3. FUSED SVM µ 1024 0.0078 0.5 7 1024 λ 20 Figure 3.2: The figure on the upper side represents the number of misclassified samples in a leave-one-out error loop on the bladder cancer dataset with the stage labelling, with its color scale, for different values of the parameters λ and µ which vary logarithmically along the axes. The weights of the best classifier, for classical L1 -SVM (left) and for fused-SVM (right) are ordered and represented in a blue line, annotated with the chromosome separation (red line). 3.4. RESULTS 59 As expected, there are also important differences in the visual aspects of the classifiers estimated by the L1 -SVM and the fused SVM. The fused SVM produces sparse and piecewise constant classifiers, amenable to further investigations, while it is more difficult to isolate from the L1 -SVM profiles the key features used in the classification, apart from a few strong peaks. As we can see by looking at the shape of the fused SVM classifier in Figure 3.1, the grade classification function is characterised by non-null constant values over a few small chromosomal regions and numerous larger regions. Of these regions, a few are already known as being altered in bladder tumors, such as the gain on region 1q [CHTG05]. Moreover some of them have already been shown to be correlated with grade, such as chromosome 7 [WCK+ 91]. On the contrary, the stage classifier is characterised by only a few regions with most of them involving large portions of chromosomes. They concern mainly chromosome 4, 7, 8q, 11p, 14, 15, 17, 20, 21 and 22, with in particular a strong contribution from chromosomes 4, 7 and 20. These results on chromosomes 7, 8q, 11p and 20 are in good agreement with [BBR+ 05] who identified the most common alterations according to tumor stage on a set of 98 bladder tumors. 3.4.2 Melanoma tumors Similarly to Figures 3.1 and 3.2, the three plots in Figure 3.3 show respectively the accuracy, estimated by 10-fold cross-validation, of the fused SVM as a function of the regularisation parameters λ and µ, the linear classifier estimated by the L1 -SVM when λ is set to the value that minimizes the estimated error (λ = 4), and the linear classifier estimated by a fused SVM on all samples when λ and µ are set to values that minimise the 10-fold error, namely λ = 64 and µ = 0.5. Similarly to the bladder study, the performance of the L1 -SVM without the fusion constraint can be retrieved by looking at the upper part of the plot of Figure 3.3. The fused classifier offers a slightly improved performance compared to the standard L1 -SVM (17 errors (22%) versus 19 errors (24%)), even though the amelioration seems more marginal compared to the improvement made with bladder tumors and the misclassification rate remains fairly high. As for the bladder datasets, the L1 -SVM and fused SVM classifiers are markedly different. The L1 -SVM classifier is based only on a few BAC concentrated on chromosome 8, with positive weights on the 8p arm and negative weights on the 8q arm. These features are biologically relevant, and correspond to a known genomic alterations (loss of 8p and gain of 8q in metastatic tumors). The presence of a strong signal concentrated on chromosome 8 for the prediction of metastasis is in this case correctly captured by the sparse L1 -SVM, which explains its relatively good performance. To the contrary, the fused SVM classifier is characterised by a many of CNAs, most of them involving large regions of chromosomes. Interestingly, we retrieve the regions whose alteration was already reported as recurrent events of uveal melanoma: chromosomes 3, 1p, 6q, 8p, 8q, 16q. As expected the contributions 60 CHAPTER 3. FUSED SVM µ 1024 0.0625 1 17 1024 λ 38 Figure 3.3: The figure on the upper part represents the number of misclassified samples in a ten-fold error loop on the melanoma dataset. The weights of the best classifier, for classical L1 -SVM (left) and for fused SVM (right) are ordered and represented in a blue line, annotated with the chromosome separation (red line). 3.5. DISCUSSION 61 of 8p and 8q are of opposite sign, in agreement with the common alterations of these regions: loss of 8p and gain of 8q in etastatic tumors. Interestingly the contribution of chromosome 3 is limited to a small region of 3p, and does not involve the whole chromosome as the frequency of chromosome 3 monosomy would have suggested. Note that this is consistent with works by [PFG+ 03] and [TPH+ 01] who delimited small 3p regions from partial chromosome 3 deletion patients. On the other hand we also observe that large portions of other chromosomes have been assigned significant positive or negative weights, such as chromosomes 1p, 2p, 4, 5, 9q, 11p, 12q, 13, 14, 20, 21. To our knowledge, they do not correspond to previous observations, and may therefore provide interesting starting points for further investigations. 3.5 Discussion We have proposed a new method for the supervised classification of arrayCGH data. Thanks to the use of a particular regularization term that translates our prior assumptions into constraints on the classifier, we estimate a linear classifier that is based on a restricted number of spots, and gives as much as possible equal weights to spots located near each other on a chromosome. Results on real data sets show that this classification method is able to discriminate between the different classes with a better performance than classical techniques that do not take into account the specificities of arrayCGH data. Moreover, the learned classifier is piecewise constant and therefore lends itself particularly well to further interpretation, highlighting in particular selected chromosomal regions with particularly highly positive or negative weights. From the methodological point of view, the use of regularized large-scale classifiers is nowadays widely spread, especially in the SVM form. Regularization is particularly important for “small n large p” problems, i.e., when the number of samples is small compared to the number of dimensions. An alternative interpretation of such classifiers is that they correspond to maximum a posteriori classifiers in a Bayesian framework, where the prior over classifier is encoded in our penalty function. It is not surprising, then, that encoding prior knowledge in the penalty function is a mathematically sound strategy that can be strongly beneficial in terms of classifier accuracy, in particular when few training samples are available. The accuracy improvements we observe on all classification datasets confirm this intuition. Besides the particular penalty function investigated in this paper, we believe our results support the general idea that engineering relevant priors for a particular problem can have important effects on the quality of the function estimated and paves the way for further research on the engineering of such priors in combination with large-margin classifiers. As for the implementation, we solved a linear program for each value couple of the regularization parameters λ and µ, but it would be interesting to generalize the recent works on path following algorithms to be able to follow the solution of the optimization problem when λ and µ vary [EHJT04]. Another interesting direction of future research concerns the combination of 62 CHAPTER 3. FUSED SVM heterogeneous data, in particular of arrayCGH and gene expression data. Gene expression variations contain indeed information complementary to CNV for the genetic aberrations of the dysfunctioning cell [SVR+ 06], and their combination is therefore likely to both improve the accuracy of the classification methods and shed new light on biological phenomena that are characteristic of each class. A possible strategy to combine such datasets would be to train a largemargin classifier with a particular regularization term that should be adequately designed. Acknowledgement We thank Jérôme Couturier, Sophie Piperno-Neumann and Simon Saule, and the uveal melanoma group from Institut Curie. We are also grateful to Philippe Hupé for his help in preparing the data. This project was partly funded by the ACI IMPBIO Kernelchip and the EC contract ESBIC-D (LSHG-CT-2005518192). FR and EB are members of the team “Systems Biology of Cancer”, Equipe labellisée par la Ligue Nationale Contre le Cancer. Chapter 4 Enhancement of L1-classification of microarray data using gene network knowledge 4.1 Introduction The construction of predictive models from gene expression data is an important problem in computational biology. Typical applications include, for example, cancer diagnosis or prognosis [vtVDvdV+ 02, WKZ+ 05, BMC+ 00, AED+ 00] and discriminating between different treatments applied to micro-organisms [NEGL+ 05]. Since the number of genes, which is the mathematical dimension of the space the microarrays evolve in, is far greater than the number of samples, this problem turns out to be quite complex. Indeed, even if classical methods have been found to be efficient in multiple cases [ARL+ 07, GST+ 99], their results are very unstable and their strong dependence on the training set (gene and sample selection) has already been pointed out [EDKG+ 05]. Including information about gene interactions (e.g. regulation of expression, metabolic or signal transduction pathways) is an attractive idea to reduce the complexity of the problem: incorporation of biological information into the mathematical methods should reduce the error rate and provide classification functions that are more easily interpretable and suffer from less unstability. Several authors have elaborated sophisticated methods in order to integrate gene collaboration information. Pathway scoring methods try to detect perturbated group of genes or “modules” while ignoring the detailed topology of the gene influences [COVP05,CDF05]. It is then assumed that all the genes inside a module are coordinately expressed, and therefore that a perturbation should affect most of them. [SZEK07, CLL+ 07] proposed methods to extract said modules 63 64 CHAPTER 4. NETWORK-FUSED SVM from gene networks. Unfortunately, these methods rely on an artificial separation of the collaboration map into subgroups and may therefore be unable to build an efficient predictive model when the biological phenomenon only affect a small number of genes, for example a subgroup of detected modules. Another class of methods that aim at incorporating gene collaboration knowledge to construct a classifier use dimension reduction techniques. These methods propose to first project the data into a subspace that will incorporate network information, then perform supervised classification on the said subspace. In a previous work [RZD+ 07], we used spectral decomposition of the graph associated with a metabolic gene network to build this subspace. [HZHS07] developed another approach which proposed to compute synergetic pairs of genes. Unfortunately these approaches sacrifice their own statistical power for the benefit of their simplicity. The main disadvantage of these “pipeline”-like methods, where each stage is distinct from the following, is that each step assumes that the results of the previous steps are correct but in fact accumulates their approximations and errors, thus diverging from optimality. Therefore, several studies have shown that it would be more efficient to include the knowledge directly into the analysis method instead of as a preliminary step [ITR+ 01,GdBLC03,CKR+ 04]. In this article, we describe a method that incorporates the benefits of both classes of algorithms by incorporating gene/gene interaction knowledge inside the analysis. This new method extends fused classification [TSR+ 05], a method used for the classification of data with features that can be ordered in a meaningful way, to build a method that classifies data where some features are positively correlated through more complex relations than a simple ordering. Different types of gene networks are found to be good repositories for this collaboration information. First, we will expose usual supervised classification methods, then how [TSR+ 05] extended the problem to build the fused classification method. We will then see how to build our new method, and see how it performed on classical datasets. 4.2 Methods In this section, we describe the usual supervised classification methods and see how we developed the methods used for incorporation of network knowledge in our analysis. 4.2.1 Usual linear supervised classification method The aim of gene expression profile classification is to build a function f : Rn → Y that is able to attribute to each new expression profile x ∈ Rn , where n is the number of genes, a label y ∈ Y, a mathematical representation of a biological property of the sample. Depending on the study, this label could be sick or healthy, the treatment the sample has been subjected to, etc. 65 4.2. METHODS Supervised classification is a particular category of classification methods where a set of samples X = {Xi }i∈1,...,p , for which the correct labels Y = {Yi }i∈1,...,p are already known, is used to build the classification function f . Linear supervised classification uses a linear vector, i.e. a w ∈ Rn , for the construction of the function f . f will then be of the form f : x %→ w" x where w" is the transpose of w. Geometrically, w can be seen as the orthogonal vector to an hyperplane P that separates the whole space into subspaces, and the subspace in which one sample Xi finds itself will define its predicted class (which will be sign(w" Xi ) in the case of binary classification, where Y = {−1, 1}). This hyperplane will be a good separator, not only if most of the samples are in the good subspace, but also if it provides a satisfactory geometrical partition of the sample spaces. [CV95] for example, proposed to maximise the margin around the hyperspace. Let l : (Xi , Yi ) → l(w" Xi , Yi ) be a loss function, a way to measure the distance between the position Xi and the subspace it should belong to. A good classifier would be the one that minimises the average l on the training set. Unfortunately, as the dimension n of the space in which the sample evolve is very large compared to the number p of samples, our classifier is likely to be over-fitted, meaning that it will perform relatively well on the training set, but may perform poorly on unknown examples. Therefore, the minimization of l is not deterministic enough to find an efficient classifier, and we have to add another constraint on w, which will represent the knowledge that we have of the shape that we want the classifier to be constrained to. The linear predictive model for our problem is then obtained by solving the following optimisation problem: minn w∈R p ! i=1 l(w" Xi , Yi ) under the constraint r(w) ≤ µ . (4.1) This problem is formed of two parts : the minimisation of the loss function l that calculates the error between the predicted class for a specific sample Xi and the real class given by the label Yi ; and a regularisation term r(w) where µ is a parameter that will be adjusted to build a trade-off between the efficiency of the classification and the minimisation of the regularisation term. The bigger µ is, the more loose the constraint is and the less w is prone to be constricted to the form represented by the regularisation term. Good examples of such classifiers include support vector machines (when taking {−1, 1} for Y the space of classes, the hinge loss function l(w" Xi , Yi ) = max(0, 1 − Yi w" Xi ) and r(w) = 'w') [SS02, STV04, STC00] or ridge regression (when Y = R, l(w" Xi , Yi ) = (w" Xi − Yi )2 and r(w) = 'w'1 ) [HTF01]. ”Usual” classification problems often use a norm for the regularisation )nterm, whether the Euclidean norm defined by ' • '2 : (w1 , ..., wn) ) ∈ Rn %→ i=1 wi2 n or the L1-norm defined by ' •' 1 : (w1 , ..., wn ) ∈ Rn %→ i=1 |wi |. The two regularisation forms force the classifier to comply to different constraints and therefore give the classifier different shapes. The constraint induced by the L1-norm regularisation forces our classifier 66 CHAPTER 4. NETWORK-FUSED SVM to be sparse, meaning that most of w components will be equal to zero (see for example a discussion in [Tib97]). Therefore it is useful if we feel that our model should depend on a small number of genes. This regularisation term does not only help with classification performance: it could lead to the building of a sparse and therefore easy to interpret ”signature”, a specific set of genes that are characteristic of the property of interest [GC05] . When a sparse classifier is not appropriated, we assume that the decision function shouldn’t be too complicated, resulting in a classifier with a little Euclidean norm and thus apply the Euclidean regularisation. Moreover, in the case of support vector machines, since the Hessian matrix is positive definite (instead of positive semi-definite) the problem is more computationally stable than with the L1-norm [Abe02]. By constraining the L1 or L2 norm to be small, we limit the space the classifier evolves in: this is “feature selection”. But, as we explain in the next section, the regularisation term can also be used to include prior knowledge that we have of the classifier. 4.2.2 Fusion and fused classification For supervised classification problems with features that can be ordered in a meaningful way, [LF96] proposed a method called “fusion classification” to incorporate the information of correlation between successive features into the regularisation term. The problem can be written as the reduction of the following form: minn w∈R p ! l(w" Xi , Yi ) under the constraint i=1 n ! i=2 |wi − wi−1 | ≤ µ . (4.2) Because of the regularisation term, this method results in a classifier where successive features tend to have similar weights. In appropriate cases, it shows great potential as features of the classifier will be partitioned in groups of successive features with similar weights, which reduces the size of the dimension space. Therefore, the classification will depend on whole groups of features and should be more robust to experimental or biological noise. [TSR+ 05] extended the method to propose a classification technique called “fused lasso”. Their method is the minimisation of the following form: min w∈Rn p ! i=1 l(w Xi , Yi ) under the constraints T n ! i=1 |wi | ≤ α and n ! i=2 |wi −wi−1 | ≤ µ . (4.3) They use Y = R as the space of labels and the ridge regression l(w" Xi , YI ) = (Yi − w" Xi )2 as loss function; hence this method is a compromise between the fusion classification and traditional lasso as it uses both regularisation terms (the regularisation described in equation 4.2 and usual L1-norm), ensuring that the classifier will be sparse, due to the L1-norm regularisation, and that its 67 4.2. METHODS successive features will have similar weights, due to the fusion-like regularisation term. In the next section, we will see how we can extend this technique to incorporate network knowledge into microarray classification. 4.2.3 Network-fused classification Fused lasso only offers a way to regulate a feature i with regards to its relations with two other features, i − 1 and i + 1, which are chosen by the way the features are ordered. If this is an interesting method for some biological data, like protein mass spectroscopy data [TSR+ 05] or CGH array data [TW07], it is not as pertinent with gene expression profiles where the relations between features, i.e. gene expression levels, are far more complex. Indeed, relations between genes are often described as a complete graph (V, E) where V is the set of vertices (genes) and E the set of edges (pairs of genes that are correlated one with the other). To establish a regularisation constraint that will incoporate this network knowledge, we propose to build a classifier that will tend to attribute similar values to connected nodes, corresponding to the following form, similar to the one seen in equation 4.3: min n w∈R ,b∈R m ! i=1 l(w" Xi , Yi ) under the constraints n ! i=1 |wi | ≤ λ and ! u∼v |wu −wv | ≤ µ , (4.4) where we denote by wu and wv the weights of w with regard respectively to genes u and v. We will use Y = −1, 1 as the label space and, for the loss function, we will use l(u, y) = max(0, 1 − yu), the hinge loss function which is the loss function that is used in the classical SVM. With analogy to fused lasso, we will have a classifier that tends to be sparse (due to the first constraint) and tends to attribute similar weights to connected nodes (due to the second constraint). Therefore, our classifier will at the same time have the advantages of a standard classifier, i.e. sparseness, and incorporate network knowledge, i.e. positive correlation between connected genes, with the two parameters λ and µ helping building a tradeoff between these two constraints and the classification efficiency represented by the loss term. As our classifier is the minimisation of a linear form under linear constraints, we are confronted with a linear problem. Numerous methods have been proposed to solve this type of problems [Tod02] . If we used another convex loss function instead of the hinge loss function, we would obtain a convex optimisation problem, which would also be solvable, but using convex optimisation methods [BV04b]. 68 CHAPTER 4. NETWORK-FUSED SVM 4.2.4 Implementation The problem described in equation 4.4 can be transformed into the following linear problem : min w,α,β,γ n ! αi under the following constraints : i=1 ∀i = 1, ..., p ∀i = 1, ..., p αi ≥ 0 αi ≥ 1 − w" Xi Yi n ! βi ≤ λ i=1 ∀i = 1, ..., n ∀i = 1, ..., n βi ≥ wi βi ≥ −wi q ! γi ≤ µ i=1 ∀u, v ∈ V × V such that (u, v) ∈ E ∀u, v ∈ V × V such that (u, v) ∈ E γi ≥ wu − wv γi ≥ wu − wv , (4.5) where ∀u ∈ U , wu is the weight attributed in w to the node u, p is the number of samples, n = Card(V ) the number of genes and q = Card(E) the number of gene interactions. This problem was implemented and solved using Matlab and the SeDuMi 1.1R3 optimisation toolbox [Stu99]. 4.3 Data In this section we describe the different datasets that have been used for our study. 4.3.1 Expression data sets We collected our first expression data from a study that aim at predicting “poor prognosis” in breast cancer patients by separating patients who developed distant metastases (51 samples) from patients who continued to be disease-free after a period of at least 5 years (46 samples) [vtVDvdV+ 02]. The original study separated this set of 97 patients into a training and a testing set, but we merged these two sample sets into one single data set in order to reduce the error bias that was underlined in previous studies [MKH05] and that is related to the selection of the samples forming each group. Each gene expression profile contains approximately 25,000 genes. We collected the data set as normalised in the previous study and put each gene mean and variance respectively to 0 and 1. 4.3. DATA 69 We collected our second expression data set from a study that aims at discriminating patients positive for Estrogen Receptors (ER) (208 samples) from ER-negative patients (77 samples) in people that developed lymph-nodenegative primary breast cancer [WKZ+ 05]. We merged the training and testing sets of the original study, obtaining a global set of 285 samples that we used to classify patients that suffered from a relapse (107 samples) from the ones that remained disease-free (179 samples). The gene expression profiles were collected using Affymetrix U133A genechips, resulting in profiles of roughly 22,000 genes. We used a gcRMA normalised version of this data set. 4.3.2 Gene networks Our method requires a database of positive relations between our gene expression levels. This data can be represented as an undirected and finite graph G that satisfies two precise conditions: the nodes represent proteins or corresponding genes, and an edge exist betweeen two nodes if and only if there exists a type of positive correlation between the expression levels of the two corresponding genes. We will denote by V the set of vertices of cardinality |V | = m and by E ⊂ V × V the edges. Different repositories provide this kind of information. One example of repository for this kind of relations is metabolic networks. In metabolic networks, the vertices represent enzymes (or the corresponding genes) and one edge is formed between two vertices u and v if v catalyses a reaction where one of the reactants is a product of a reaction catalysed by u. The correlation between metabolic pathways and gene expression data has already been shown in several studies [GDS+ 05, MOSS+ 04]. One practical way to understand the existence of this correlation is to see that a cascade of reactions will be active if all of the corresponding enzymes are active, e.g. expressed. For the analysis of this data, we collected two metabolic networks. The first one was built from KEGG database of metabolic pathways [KGK+ 04]. We reconstructed this network from the KGML v.0.6, resulting in 13275 edges between 1354 genes. The second metabolic network was extracted from Reactome [VDS+ 07], which is a database that contains several types of gene networks, resulting in a network composed of 23233 edges between 1224 genes. Interesting databases include protein-protein interaction networks (also known as protein interaction networks or PPI networks) . In PPI networks, vertices represent proteins and an edge is formed between protein u and protein v if these proteins are known to physically interact. There are three principal ways to construct PPI networks : automatic inference, yeast two-hybrid (Y2H) experiments and litterature analysis. The two first methods of construction are shown to be quite complementary while both biologically relevant [RSA+ 07]. We collected three protein-protein interaction networks. The first one was built from Bioverse-core [MS03, MBS05], a manually-curated predicted interaction network. The second one was built from CCSB-HI1 [RVH+ 05], which was constructed using Y2H experiments. [RSA+ 07] assessed the quality of both these networks. Moreover, the authors showed the complementarity of the two networks. Therefore we joined both networks to construct a new one which formed 70 KEGG Reactome Bioverse-core CCSB-HI1 Both PPI networks Resnet Coexp. network 1 Coexp. network 2 CHAPTER 4. NETWORK-FUSED SVM Including all genes Vertices Edges 1354 13275 1224 23233 1263 2855 1549 2611 2673 5446 2612 5148 2407 28836 1256 10518 Limited to VV gene set Vertices Edges 1203 9879 1171 20966 1161 2283 1278 1273 2353 3976 2238 4187 2377 20995 1038 7801 Limited to Wang gene set Vertices Edges 1156 9782 1159 21063 1216 2655 1265 1816 2344 4452 2215 4369 2406 28835 1256 10518 Table 4.1: Characteristics of the different networks used. In this table you can see the number or vertices and edges for each network used, wether in their complete form (non-including edges from one vertex to itself or negative edges) or restricted to the useful genes. our third PPI network. Influence networks can be seen as two graphs that span the same set of vertices V of cardinality |V | = n that represent genes for each graph. In the first graph Gactiv = (V, Eactiv ), (u, v) ∈ Eactiv if the expression of gene u is negatively correlated with the expression of gene v. In the second graph Ginhib = (V, Einhib ), (u, v) ∈ Einhib if the expression of gene u is negatively correlated to the expression of gene v. We will call positive influence networks the subnetwork formed by the graph Gactiv . Positive influence networks can be used as a database of correlations. We extracted from the manually curated version of the ResNet pathway database [YMK+ 06] an expression influence network of 5148 edges between 2612 genes. The last type of interesting databases that we studied is co-expression networks. In these networks, an edge is formed between gene u and gene v if they are often found co-expressed in a set of gene expression profiles. This type of networks can only be inferred and highly depend on the type of gene expression profile data, which need to be generic or large enough for the data to be viable. [YMH+ 07] proposed a way to identify co-expression modules from gene expression datasets and built a co-expression relation database based on 105 different sets of expression profiles. From these sets, they only kept relations they found significative enough, resulting in the construction of 105 different co-expression networks. We used their data to build two expression networks, the first one with relations that were significative enough in at least 10% of their data sets and the second one with relations that were significative enough in at least 20% of their data sets. For every network, we only kept edges between two distinct genes. In each analysis, we only kept the genes that were present in the microarrays. Table 4.1 compares the complexity of each network. 71 4.4. RESULTS Network KEGG Reactome Bioverse-Core CCSB-HI1 Both PPI networks Resnet Coexp. network 1 Coexp. network 2 No network Best classifier without network 10f error λ 29 2 31 4 28 4 27 4 24 2 28 4 28 2 29 2 25 2 Best classifier with network 10f error λ µ 27 2 64 28 4 2 25 16 0.25 24 2 1 23 2 4 24 2 0.5 23 2 2 25 4 4 - Table 4.2: Performance of the best classifiers for each network regarding Van’t Veer dataset. The 10f error is the number of misclassified samples in a ten-fold error. 4.4 Results In this section we describe and comment the results that we obtained using the different networks and datasets. 4.4.1 Performance We performed classification of the Van’t Veer dataset using all the previously described networks for different values of the λ, µ parameters. Results are shown in Figure 4.1 and performances of the best classifiers are shown in table 4.2. The same has been done for the Wang dataset. We obtained the results of Figure 4.2 with performances shown in table 4.3. As described by equation 4.4, the lower λ is, the more sparse the solution is, and the lower µ is, the less the solution will vary along the edges of the network. Therefore the classifier with only L1-regularisation (corresponding to the L1SVM) is obtained by taking an infinite µ, or at least a value larger than the graph penalty of the pure LASSO solution and the evolution of its performance can be seen by looking at the highest horizontal lines on figures 4.1 and 4.2. Measuring the performance using a ten-fold cross-validation for each parameter couple introduces some bias. A cleaner way to calculate the number of misclassified samples would have been to perform a nested cross-validation including parameter selection. However, it is much more time-consuming than our method, and, as we perform classification over a larger set, simple crossvalidation is enough to estimate the general trend of the classification error Looking at tables 4.2 and 4.3, we can see that even without introducing any network-related constraint, the classification performance vary depending on the gene network used. This is due to the fact that we only keep the genes that are 72 CHAPTER 4. NETWORK-FUSED SVM µ µ µ 16384 0.0312 0.0312 27 KEGG λ 16384 56 28 µ λ Reactome 57 25 µ CCSB-HI1 24 λ Both PPI networks 55 λ Resnet 55 22 µ λ 56 Coexp. network2 25 λ 55 µ Coexp. network1 λ µ 55 23 23 Bioverse-core λ 55 Figure 4.1: This figure represent the number of misclassified samples in a tenfold error for different values of the (λ, µ) parameters on the Van’t Veer dataset. λ and µ vary on the same logarithmic scale for every experiment. Bright blue correspond to the parameter couples with the highest count of misclassified samples while bright red correspond to the lowest count of misclassified samples. 73 4.4. RESULTS µ µ µ 16384 0.0312 0.0312 96 KEGG λ 16384 122 92 µ λ Reactome 118 88 µ CCSB-HI1 96 λ Both PPI networks 131 λ Resnet 124 87 µ λ 56 Coexp. network2 106 λ 162 µ Coexp. network1 λ µ 122 98 23 Bioverse-core λ 138 Figure 4.2: This figure represent the number of misclassified samples in a tenfold error for different values of the (λ, µ) parameters on the Wang dataset. λ and µ vary on the same logarithmic scale for every experiment. Bright blue correspond to the parameter couples with the highest count of misclassified samples while bright red correspond to the lowest count of misclassified samples. 74 CHAPTER 4. NETWORK-FUSED SVM Network KEGG Reactome Bioverse-Core CCSB-HI1 Both PPI networks Resnet Coexp. network 1 Coexp. network 2 No network Best classifier without network 10f error λ 107 0.0312 92 8 92 16 106 2 107 0.0312 93 2 102 16 110 0.625 88 2 Best classifier with network 10f error λ µ 96 4096 8 92 8 512 88 16 32 96 16 1 98 16 0.0625 87 8 0.125 95 16 512 106 1 0.5 - Table 4.3: Performance of the best classifiers for each network regarding Wang dataset. The 10f error is the number of misclassified samples in a ten-fold error. present in the network, which can be seen as a priori gene selection, and combine the probes that are related to the same gene. However, as every classification without any network constraint performs worse than the classification that keeps all the genes (with the exception of the classification using both PPI networks for the Van’t Veer data set), genes that are primordial for the discrimination may not be present in the networks. For the Van’t Veer classification problem, we can see that in every case, the incorporation of the network improves the performance and that, with the exceptions of the metabolic networks, they are at least as efficient as the one obtained when keeping all the genes. It is however not the case with the Wang dataset, for which the classifier with all the genes remains the best classifier, even if the introduction of Bioverse-core or Resnet achieve comparable results. An interesting phenomenon that we observe on the Wang dataset is the fact that the network constituted of both PPI networks performs worse than each network taken separously. One explanation for that is that our model is sensible to gene selection and that by combining both networks, we lose the advantage that we add by selecting the genes that were only present in one network. It may also be due to the fact that, even if both networks are complementary networks for protein-protein interactions, they contain different types of interactions, which may not be compatible one with the other: protein interactions are dynamically organised [HBH+ 04], and these dynamics may contradict the positive correlation that we try to introduce as hubs can’t physically interact with all their neighbors simultaneously. [KLXG06] proposes a high-quality network that introduced this dynamic factor and may therefore be used to solve this issue. 75 4.4. RESULTS Genes 120 N. of terms 89 Reactome Bioverse-Core 117 120 141 52 CCSB-HI1 Both PPI networks 130 235 6 55 Resnet 224 30 Coexp. network 1 238 35 Coexp. network 2 No network 104 2447 13 46 KEGG Main categories catalytic activity, metabolic processes, physiological process metabolic processes different protein domains including TGF-β signaling protein binding, alternative splicing protein binding, anti-apoptosis, JAK-STAT signaling,cell proliferation, ATP, IL-2 receptor protein binding, alternative splicing, regulation, cell differentiation protein binding, altenative splicing, metal binding protein binding transcription, cellular processes, alternative splicing,protein binding, negative regulation of cell proliferation, metal binding Table 4.4: Main categories found performing DAVID analysis with the classifiers trained on Van’t Veer dataset using the parameters described in tables 4.2 4.4.2 Interpretation of the classifiers In order to interpret the classifier, we extracted from each one the 10% of genes with the most important weights and performed an analysis with DAVID [DSH+ 03] in order to retrieve the categories that were significatively represented in those genes. The results are described in tables 4.4 and 4.5. We only kept the DAVID categories for which the p-value was inferior to 10−4 . The disparities regarding the number of terms can be explained by the nature of the different gene networks. Indeed, more terms tend to be extracted from metabolic networks, such as KEGG and Reactome, as metabolic functions are represented in the ontologies as very small group of genes, often with less than ten genes, so one lit-up metabolic pathway will correspond to many terms. On the other hand, experimentally built networks, such as CCSB-HI1 and both coexpression networks, tend to describe relations that are not functionnaly linked together and therefore produce classifiers where only a few terms are found. However, we can see that in both cases, Bioverse-core and the PPI network built by combination of the two others tend to pinpoint interesting terms which are more precise than the ones found with the classifier built without any a priori knowledge. For both datasets, the ATP pathway is found interesting as pointed out in [DB04], and the Van’t Veer classifier also show the importance of TGF-β, IL-2 (also known as TCGF) and JAK-STAT as already pointed 76 CHAPTER 4. NETWORK-FUSED SVM Genes 115 N. of terms 63 Reactome 116 72 Bioverse-Core 121 39 CCSB-HI1 127 8 Both PPI networks 234 120 Resnet 224 78 Coexp. network 1 240 9 Coexp. network 2 125 7 No network 2228 131 KEGG Main categories metabolic processes, disease mutation, purine, pyrimidine, metal binding metabolic processes, protein sequencing, disease mutation, DNA polymerase protein binding, mutagenesis site, ATP protein binding, direct protein sequencing disease mutation, blood, immune response protein binding, cell cycle, mutagenesis site protein binding, direct protein sequencing, disease mutation protein binding, alternative splicing, disease mutation, regulation of apoptosis phosphorylation, cell cycle, transcription, mutagenesis site, protein binding, negative regulation of biological processes Table 4.5: Main categories found performing DAVID analysis with the classifiers trained on Wang dataset using the parameters described in tables 4.3 4.5. DISCUSSION 77 out by different studies [DHT+ 97, BBM86, BK02]. [BBM86] also suggested the importance of blood killer cell activity in breast cancer patients, which could explain the importance of blood-related terms in the classifier obtained with the Wang dataset and the combination of both PPI networks. 4.5 Discussion We developed a method for supervised classification of gene expression profiles. The introduction of a new regularisation term that takes into account correlation between linked nodes in the gene network helps building a linear classifier that is closer to a priori known biological facts. Results on public datasets, using different types of gene networks, show that incorporation of gene networks improve the error rate compared to classifiers that do not take into account the gene networks and that, on some data sets, given the right gene networks it may reduce the misclassification rate, compared to the classifiers that take into account all the genes. Moreover, it also shows that, given the right network, the obtained classifier may be easier to interpret than standard classifiers. The fact that the network-constrained classification function does not increase performance in all cases may be explained by the important information shortage of biological networks that still have to be completed by more inference or biological experiments. The difficulty to find a standard approach in order to match the value corresponding to one probe with the expression of a gene may also complicate the task and result in these dissappointing results. Another issue in this study is the problem of normalization. Standard normalization algorithms either do per-array normalization (like MAS5 [mas]) or may perturbate the gene correlations that are essential to incorporation of any gene network (like gcRMA [WIG+ 04]). New normalization algorithms, such as MAS6, may prove to be more efficient, but they are still being tested and their efficiency has not been proven by enough studies. As dimension of microarrays tends to grow exponentially, the use of methods more specific to biological data than standard analysis processes seems more and more essential. In this context, the introduction of gene network knowledge into gene expression profile classification is a step in the right direction. 78 CHAPTER 4. NETWORK-FUSED SVM Conclusion The different contributions of this thesis show that the incorporation of a priori knowledge is a promising approach in order to reduce the mathematical complexity of microarray analysis and to produce biological results that are easier to interpret. However, even if they improve existing techniques, the three new methodologies presented here may be seen as preliminary studies since microarrays still seem to be subject to non-explicable signal variations. These obcuring and inexplicable variations are one of the main problems of microarray studies as even sophisticated normalization algorithms are unable to remove them. Indeed, the rigorous mathematical hypothesis that underlie these methods do not seem to model correctly a biological reality that is much more complex than we would like it to be. So much more complex in fact, that we seem to be unable to understand what the bias induced by these pretreatment methods is. Another difficulty that is unavoidable for computational biologists is the lack of unification. As the domain is still far from mature, few standards have yet been decided, and in most cases, whether it be simple protocols, identifiers databases, file formats or even definitions of simple terms such as “gene”, it remains difficult to merge several works. However, in the past couple of years, different initiatives have been built and we can expect this obstacle to become less and less important in the near future. Indeed, as biological experiments and new techniques of inference such as the one described by [BBV07] should provide an important help for the completion and curation of gene networks and biological data bases in general, our knowledge of the biological phenomena will be more precise and easier to model. Ideas to improve our methods could be found in a sharper modelisation of the gene interactions, such as the distinction and incorporation of positive and negative correlation or even of the dynamic orderings suggested by the behaviour of the different hubs. The use of dynamic informations, such as the one provided by time-series experiments, instead of static profiles, should also provide a more precise understanding of the phenomena. Another way to circumvent the imprecision of microarray data is to interpret this data as a collection of expression values for groups of genes instead of a collection of expression values for single genes. As the expression for each group of genes, or “module”, would be calculated based on more values (i.e. the expression value for each gene of the module), it would improve the statistical 79 80 CHAPTER 4. NETWORK-FUSED SVM power and the confidence of the interpretation. It would also facilitate the understanding of the underlying biological phenomena as it is far easier to identify the function of a group of genes than to identify the function of a single protein. Enrichment analysis methods, such as Gene Set Enrichment Analysis [STM+ 05] or Gene Set Analysis [ET06], are methods that use this idea to provide a framework for analysis. We could derive such a method from the ones that were developed during this thesis by simply extracting modules from the different classifiers produced by our analysis techniques. In particular, the network-fused SVM algorithm, by providing a piecewise-constant solution, should make this extraction easier, even if the complexity of the gene networks makes it more complicated than it sounds. The development of methods that provide a denser information, such as complete human genome microarrays for gene expression profiling or high density aCGH, constitute a new challenge for analysis processes as the explosion of the dimensionnality of the data will make the search for explicative profiles even more difficult. In this context, the introduction of more specific methods than the standard analysis algorithms, such as the incorporation of “a priori ” knowledge that we worked on during this thesis, will be even more necessary. Bibliography [Abe02] Shigeo Abe. Analysis of support vector machines. Proceedings of the 2002 12th IEEE Workshop on Neural Networks for Signal Processing, 2002. [AED+ 00] Ash A. Alizadeh, Michael B. Eisen, R. Eric Davis, Chi Ma, Izidore S. Lossos, Andreas Rosenwald, Jennifer C. Boldrick, Hajeer Sabet, Truc Tran, Xin Yu, John I. Powell, Liming Yang, Gerald E. Marti, Troy Moore, James Hudson, Lisheng Lu, David B. Lewis, Robert Tibshirani, Gavin Sherlock, Wing C. Chan, Timothy C. Greiner, Dennis D. Weisenburger, James O. Armitage, Roger Warnke, Ronald Levy, Wyndham Wilson, Michael R. Grever, John C. Byrd, David Botstein, Patrick O. Brown, and Louis M. Staudt. Distinct types of diffuse large bcell lymphoma identified by gene expression profiling. Nature, 403(6769):503–511, February 2000. [AJL+ 02] Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff, Keith Roberts, and Peter Walter. Molecular Biology of the Cell. Garland Science, March 2002. [ANM93] A. Kallioniemi O.-P. Kallioniemi F. Waldman V. Ratanatharathorn S. R. Wolman A. N. Mohamed, J. A. Macoska. Extrachromosomal gene amplification in acute myeloid leukemia; Characterization by metaphase analysis, comparative genomic hybridization, and semi-quantitative PCR, 1993. [ARL+ 07] A Andersson, C Ritz, D Lindgren, P Eden, C Lassen, J Heldrup, T Olofsson, J Rade, M Fontes, A Porwit-MacDonald, M Behrendtz, M Hoglund, B Johansson, and T Fioretos. Microarray-based classification of a consecutive series of 121 childhood acute leukemias: prediction of leukemic and genetic subtype as well as of minimal residual disease status. Leukemia, 21(6):1198–1203, April 2007. [Aro50] N. Aronszajn. Theory of reproducing kernels. Transactions of the American Mathematical Society, 68:337–404, 1950. 81 82 BIBLIOGRAPHY [BBM86] B. G. Brenner, S. Benarrosh, and R. G. Margolese. Peripheral blood natural killer cell activity in human breast cancer patients and its modulation by t-cell growth factor and autologous plasma. Cancer, 58(4):895–902, Aug 1986. [BBR+ 05] Ekaterini Blaveri, Jeremy L. Brewer, Ritu Roydasgupta, Jane Fridlyand, Sandy DeVries, Theresa Koppie, Sunanda Pejavar, Kshama Mehta, Peter Carroll, Jeff P. Simko, and Frederic M. Waldman. Bladder Cancer Stage and Outcome by ArrayBased Comparative Genomic Hybridization. Clin Cancer Res, 11(19):7012–7022, 2005. [BBV07] Kevin Bleakley, Gerard Biau, and Jean-Philippe Vert. Supervised reconstruction of biological networks with local models. Bioinformatics, 23(13):i57–i65, Jul 2007. [BDA+ 04] O Babur, E Demir, A Ayaz, U Dogrusoz, and O Sakarya. Pathway activity inference using microarray data. Technical report, Bilkent Center for Bioinformatics (BCBI), 2004. [Bel57] Richard Ernest Bellman. Dynamic Programming. Dover Publications, Incorporated, 1957. [BGV92a] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the 5th annual ACM workshop on Computational Learning Theory, pages 144–152. ACM Press, 1992. [BGV92b] Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In COLT ’92: Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, New York, NY, USA, 1992. ACM. [BH95] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of Royal Statistical Society B, 57:289–300, 1995. [bio] http://www.biocarta.com. [BK02] V. Boudny and J. Kovarik. Jak/stat signaling pathways and cancer. janus kinases/signal transducers and activators of transcription. Neoplasma, 49(6):349–355, 2002. [BKPT05] Thomas Breslin, Morten Krogh, Carsten Peterson, and Carl Troein. Signal transduction pathway profiling of individual tumor samples. BMC Bioinformatics, 6(1):163, 2005. BIBLIOGRAPHY 83 [BLC+ 01] N. Bown, M. Lastowska, S. Cotterill, S. O’Neill, C. Ellershaw, P. Roberts, I. Lewis, A. D. Pearson, U.K. Cancer Cytogenetics Group, and the U.K. Children’s Cancer Study Group. 17q gain in neuroblastoma predicts adverse clinical outcome. u.k. cancer cytogenetics group and the u.k. children’s cancer study group. Med Pediatr Oncol, 36(1):14–19, Jan 2001. [BMC+ 00] M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor, N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden, J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts, and V. Sondak. Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature, 406(6795):536– 540, Aug 2000. [BR85] Michael J. Best and Klaus Ritter. Linear Programming: Active Set Analysis and Computer Programs. Prentice Hall, 1985. [BSBD+ 04] Michael T Barrett, Alicia Scheffer, Amir Ben-Dor, Nick Sampas, Doron Lipson, Robert Kincaid, Peter Tsang, Bo Curry, Kristin Baird, Paul S Meltzer, Zohar Yakhini, Laurakay Bruhn, and Stephen Laderman. Comparative genomic hybridization using oligonucleotide microarrays and total genomic dna. Proc Natl Acad Sci U S A, 101(51):17765–17770, Dec 2004. [BV04a] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004. [BV04b] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, March 2004. [CDF05] D. Cavalieri and C. De Filippo. Bioinformatic methods for integrating whole-genome expression results into cellular networks. Drug Discov Today, 10(10):727–34, 2005. [CDS98] S. S. Chen, D. L. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20(1):33–61, 1998. [CHTG05] Timothy W Corson, Annie Huang, Ming-Sound Tsao, and Brenda L Gallie. Kif14 is a candidate oncogene in the 1q minimal region of genomic gain in multiple cancers. Oncogene, 24(30):4741–4753, May 2005. [Chu97] F. R. K. Chung. Spectral graph theory, volume 92 of CBMS Regional Conference Series. American Mathematical Society, Providence, 1997. 84 BIBLIOGRAPHY [CKR+ 04] Markus W Covert, Eric M Knight, Jennifer L Reed, Markus J Herrgard, and Bernhard O Palsson. Integrating highthroughput and computational data elucidates bacterial networks. Nature, 429(6987):92–96, May 2004. [CLL+ 07] Han-Yu Chuang, Eunjung Lee, Yu-Tsueng Liu, Doheon Lee, and Trey Ideker. Network-based classification of breast cancer metastasis. Mol Syst Biol, 3:–, October 2007. [Con00] The Gene Ontology Consortium. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25(1):25–29, May 2000. [COVP05] Keira R. Curtis, Matej Oresic, and Antonio Vidal-Puig. Pathways to the analysis of microarray data. Trends in Biotechnology, 23(8):429–435, 2005. [CST00] Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines: And Other Kernel-Based Learning Methods. Cambridge University Press, 2000. [CTTC07] James Chen, Chen-An Tsai, ShengLi Tzeng, and Chun-Houh Chen. Gene selection with multiple ordering criteria. BMC Bioinformatics, 8(1):74, 2007. [CV95] Corinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995. [CWT+ 06] S-F Chin, Y Wang, N P Thorne, A E Teschendorff, S E Pinder, M Vias, A Naderi, I Roberts, N L Barbosa-Morais, M J Garcia, N G Iyer, T Kranjac, J F R Robertson, S Aparicio, S Tavare, I Ellis, J D Brenton, and C Caldas. Using array-comparative genomic hybridization to define molecular portraits of primary breast cancers. Oncogene, 26(13):1959–1970, September 2006. [DB04] Lawrence R Dearth and Rainer K Brachmann. Atp, cancer and p53. Cancer Biol Ther, 3(7):638–640, Jul 2004. [DHT+ 97] D. Donovan, J. H. Harmey, D. Toomey, D. H. Osborne, H. P. Redmond, and D. J. Bouchier-Hayes. Tgf beta-1 regulation of vegf production by breast cancer cells. Ann Surg Oncol, 4(8):621–627, 1997. [DSH+ 03] Glynn Dennis, Brad Sherman, Douglas Hosack, Jun Yang, Wei Gao, H Lane, and Richard Lempicki. David: Database for annotation, visualization, and integrated discovery. Genome Biology, 4(9):R60, 2003. BIBLIOGRAPHY 85 [EDKG+ 05] Liat Ein-Dor, Itai Kela, Gad Getz, David Givol, and Eytan Domany. Outcome signature genes in breast cancer: is there a unique set? Bioinformatics, 21(2):171–178, 2005. [EHJT04] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Ann. Stat., 32(2):407–499, 2004. [ET06] Bradley Efron and Rob Tibshirani. On testing the significance of sets of genes. Technical report, Annals of Applied Statistics, 2006. [FCD+ 00] T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski, M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16(10):906–914, Oct 2000. [GAL+ 07] Genkin, Alexander, Lewis, D. David, Madigan, and David. Large-scale bayesian logistic regression for text categorization. Technometrics, 49(3):291–304, August 2007. [GC05] Debashis Ghosh and Arul M. Chinnaiyan. Classification and selection of biomarkers in genomic data using lasso. Journal of Biomedicine and Biotechnology, 2005(2):147–154, 2005. doi:10.1155/JBB.2005.147. [GdBLC03] Timothy S. Gardner, Diego di Bernardo, David Lorenz, and James J. Collins. Inferring Genetic Networks and Identifying Compound Mode of Action via Expression Profiling. Science, 301(5629):102–105, 2003. [GDS+ 05] Anatole Ghazalpour, Sudheer Doss, Sonal Sheth, Leslie IngramDrake, Eric Schadt, Aldons Lusis, and Thomas Drake. Genomic analysis of metabolic pathway gene expression in mice. Genome Biology, 6(7):R59, 2005. [gen] http://www.genmapp.com. [Ger05] Diane Gershon. Dna microarrays: More than gene expression. Nature, 437(7062):1195–1198, October 2005. [GGNZ06] Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lofti Zadeh, editors. Feature Extraction, Foundations and Applications. Springer, 2006. [GST+ 99] T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A. Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science, 286(5439):531–537, 1999. 86 BIBLIOGRAPHY [GTL06] S. J. Galbraith, L. M. Tran, and J. C. Liao. Transcriptome network component analysis with limited microarray data. Bioinformatics, 22(15):1886–94, 2006. [GVTS04] I. Gat-Viks, A. Tanay, and R. Shamir. Modeling and analysis of heterogeneous regulation in biological networks. J Comput Biol, 11(6):1034–49, 2004. [GW97] Bernhard Ganter and Rudolf Wille. Formal Concept Analysis: Mathematical Foundations. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1997. Translator-C. Franzke. [Han86] Per C Hansen. The truncated svd as a method for regularization. Technical report, Stanford, CA, USA, 1986. [HBH+ 04] Jing-Dong J. Han, Nicolas Bertin, Tong Hao, Debra S. Goldberg, Gabriel F. Berriz, Lan V. Zhang, Denis Dupuy, Albertha J. M. Walhout, Michael E. Cusick, Frederick P. Roth, and Marc Vidal. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature, 430(6995):88–93, July 2004. [HDS+ 03] D. Hosack, G.Jr Dennis, B.T. Sherman, H.C. Lane, and R.A. Lempicki. Identifying biological themes within lists of genes with EASE. Genome Biology, R70:1–7, 2003. [HGO+ 07] Jian Huang, Arief Gusnanto, Kathleen O’Sullivan, Johan Staaf, Ake Borg, and Yudi Pawitan. Robust smooth segmentation approach for array CGH data analysis. Bioinformatics, 23(18):2463–2469, 2007. [HKY99] Laurie J. Heyer, Semyon Kruglyak, and Shibu Yooseph. Exploring expression data: Identification and analysis of coexpressed genes. Genome Res., 9(11):1106–1115, November 1999. [HST+ 04] Philippe Hupe, Nicolas Stransky, Jean-Paul Thiery, Francois Radvanyi, and Emmanuel Barillot. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics, 20(18):3413–3422, 2004. [HTF01] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer, 2001. [HW00] D. Hanahan and R. A. Weinberg. The hallmarks of cancer. Cell, 100(1):57–70, Jan 2000. [HZHS07] Blaise Hanczar, Jean-Daniel Zucker, Corneliu Henegar, and Lorenza Saitta. Feature construction from synergic pairs to improve microarray-based classification. Bioinformatics, Oct 2007. BIBLIOGRAPHY 87 [HZZL02] D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering of biological networks and gene expression data. Bioinformatics, 2002. [IBC+ 03] Rafael A Irizarry, Benjamin M Bolstad, Francois Collin, Leslie M Cope, Bridget Hobbs, and Terence P Speed. Summaries of affymetrix genechip probe level data. Nucleic Acids Res, 31(4):e15, Feb 2003. [IML+ 07] Ahmed Idbaih, Yannick Marie, Carlo Lucchesi, Gaelle Pierron, Elodie Manie, Virginie Raynal, Veronique Mosseri, Khe HoangXuan, Michele Kujas, Isabel Brito, Karima Mokhtari, Marc Sanson, Emmanuel Barillot, Alain Aurias, Jean-Yves Delattre, and Olivier Delattre. Bac array cgh distinguishes mutually exclusive alterations that define clinicogenetic subtypes of gliomas. Int J Cancer, Dec 2007. [ITR+ 01] Trey Ideker, Vesteinn Thorsson, Jeffrey A. Ranish, Rowan Christmas, Jeremy Buhler, Jimmy K. Eng, Roger Bumgarner, David R. Goodlett, Ruedi Aebersold, and Leroy Hood. Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network. Science, 292(5518):929– 934, 2001. [JFG+ 04] Chris Jones, Emily Ford, Cheryl Gillett, Ken Ryder, Samantha Merrett, Jorge S. Reis-Filho, Laura G. Fulford, Andrew Hanby, and Sunil R. Lakhani. Molecular Cytogenetic Identification of Subgroups of Grade III Invasive Ductal Breast Carcinomas with Different Clinical Outcomes. Clin Cancer Res, 10(18):5988– 5997, 2004. [Jol96] I.T. Jolliffe. Principal component analysis. Springer-Verlag, New-York, 1996. [JTGV+ 05] G. Joshi-Tope, M. Gillespie, I. Vastrik, P. D’Eustachio, E. Schmidt, B. de Bono, B. Jassal, G. R. Gopinath, G. R. Wu, L. Matthews, S. Lewis, E. Birney, and L. Stein. Reactome: a knowledgebase of biological pathways. Nucleic Acids Res, 33(Database issue):D428–32, 2005. 1362-4962 (Electronic) Journal Article. [KCF+ 06] P. Kharchenko, L. Chen, Y. Freund, D. Vitkup, and G. M. Church. Identifying metabolic enzymes with multiple types of association evidence. BMC Bioinformatics, 7:177, 2006. [KCFH05] Balaji Krishnapuram, Lawrence Carin, Mario A.T. Figueiredo, and Alexander J. Hartemink. Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE 88 BIBLIOGRAPHY Transactions on Pattern Analysis and Machine Intelligence, 27(6):957–968, 2005. [KGH+ 06] Minoru Kanehisa, Susumu Goto, Masahiro Hattori, Kiyoko F Aoki-Kinoshita, Masumi Itoh, Shuichi Kawashima, Toshiaki Katayama, Michihiro Araki, and Mika Hirakawa. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res, 34(Database issue):D354–D357, Jan 2006. [KGK+ 04] M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori. The KEGG resource for deciphering the genome. Nucleic Acids Res., 32(Database issue):D277–80, Jan 2004. [KHCF04] Balaji Krishnapuram, Alexander J Hartemink, Lawrence Carin, and Mario A T Figueiredo. A bayesian approach to joint feature selection and classifier design. IEEE Trans Pattern Anal Mach Intell, 26(9):1105–11, Sep 2004. [KI05] R. Kelley and T. Ideker. Systematic interpretation of genetic interactions using protein networks. Nat. Biotechnol., 23(5):561– 566, May 2005. [KLXG06] Philip M. Kim, Long J. Lu, Yu Xia, and Mark B. Gerstein. Relating three-dimensional structures to protein networks provides evolutionary insights. Science, 314(5807):1938–1941, December 2006. [KOMK+ 05] P. D. Karp, C. A. Ouzounis, C. Moore-Kochlacs, L. Goldovsky, P. Kaipa, D. Ahren, S. Tsoka, N. Darzentas, V. Kunin, and N. Lopez-Bigas. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res, 33(19):6083–9, 2005. [Kon94] Igor Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In European Conference on Machine Learning, pages 171–182, 1994. [KPV+ 06] M. Krull, S. Pistor, N. Voss, A. Kel, I. Reuter, D. Kronenberg, H. Michael, K. Schwarzer, A. Potapov, C. Choi, O. KelMargoulis, and E. Wingender. TRANSPATH: an information resource for storing and visualizing signaling pathways and their pathological aberrations. Nucleic Acids Res, 34(Database issue):D546–51, 2006. [KR92] Kenji Kira and Larry A. Rendell. A practical approach to feature selection. In ML92: Proceedings of the ninth international workshop on Machine learning, pages 249–256, San Francisco, CA, USA, 1992. Morgan Kaufmann Publishers Inc. BIBLIOGRAPHY 89 [KVC04] P. Kharchenko, D. Vitkup, and G. M. Church. Filling gaps in a metabolic network using expression information. Bioinformatics, 20 Suppl 1:I178–I185, Aug 2004. [LBY+ 03] J. C. Liao, R. Boscolo, Y. L. Yang, L. M. Tran, C. Sabatti, and V. P. Roychowdhury. Network component analysis: reconstruction of regulatory signals in biological systems. Proc Natl Acad Sci U S A, 100(26):15522–7, 2003. [LF96] Stephanie R. Land and Jerome H. Friedman. Variable fusion: A new adaptive signal regression method. Technical Report, 1996. [LL07] Caiyan Li and HongZhe Li. Network-constrained regularization and variable selection for analysis of genomic data. UPenn Biostatistics Working Papers, 23, 2007. [LNM+ 97] M. Lastowska, E. Nacheva, A. McGuckin, A. Curtis, C. Grace, A. Pearson, and N. Bown. Comparative genomic hybridization study of primary neuroblastoma tumors. united kingdom children’s cancer study group. Genes Chromosomes Cancer, 18(3):162–169, Mar 1997. [Mac67] J. B. MacQueen. Some methods of classification and analysis of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathemtical Statistics and Probability, pages 281–297, 1967. [mas] http://www.affymetrix.com/support/developer/. [MBM+ 04] G. Mercier, N. Berthault, J. Mary, J. Peyre, A. Antoniadis, J.-P. Comet, A. Cornuejols, C. Froidevaux, and M. Dutreix. Biological detection of low radiation doses by combining results of two microarray analysis methods. Nucleic Acids Res., 32(1):e12, 2004. [MBS05] Jason McDermott, Roger Bumgarner, and Ram Samudrala. Functional annotation from predicted protein interaction networks. Bioinformatics, 21(15):3217–3226, Aug 2005. [MDM+ 01] G. Mercier, Y. Denis, P. Marc, L. Picard, and M. Dutreix. Transcriptional induction of repair genes during slowing of replication in irradiated Saccharomyces cerevisiae. Mutat. Res., 487(34):157–172, Dec 2001. [MKH05] Stefan Michiels, Serge Koscielny, and Catherine Hill. Prediction of cancer outcome with microarrays: a multiple random validation strategy. The Lancet, 365(9458):488–492, 2005. 90 BIBLIOGRAPHY [MM01] Songrit Maneewongvatana and David M. Mount. The analysis of a probabilistic approach to nearest neighbor searching. In WADS ’01: Proceedings of the 7th International Workshop on Algorithms and Data Structures, pages 276–286, London, UK, 2001. Springer-Verlag. [Moh97] B. Mohar. Some applications of Laplace eigenvalues of graphs. In G. Hahn and G. Sabidussi, editors, Graph Symmetry: Algebraic Methods and Applications, volume 497 of NATO ASI Series C, pages 227–275. Kluwer, Dordrecht, 1997. [MOSS+ 04] Fuminori Matsumoto, Takeshi Obayashi, Yuko SasakiSekimoto, Hiroyuki Ohta, Ken ichiro Takamiya, and Tatsuru Masuda. Gene expression profiling of the tetrapyrrole metabolic pathway in Arabidopsis with a mini-array system. Plant Physiol, 135(4):2379–2391, Aug 2004. [MS03] Jason McDermott and Ram Samudrala. Bioverse: Functional, structural and contextual annotation of proteins and proteomes. Nucleic Acids Res, 31(13):3736–3737, Jul 2003. [NEGL+ 05] Georges Natsoulis, Laurent El Ghaoui, Gert R.G. Lanckriet, Alexander M. Tolley, Fabrice Leroy, Shane Dunlea, Barrett P. Eynon, Cecelia I. Pearson, Stuart Tugendreich, and Kurt Jarnagin. Classification of a large microarray data set: Algorithm comparison and analysis of drug signatures. Genome Res., 15(5):724–736, 2005. [OBS+ 03] Ronan C O’Hagan, Cameron W Brennan, Andrew Strahs, Xuegong Zhang, Karuppiah Kannan, Melissa Donovan, Craig Cauwels, Norman E Sharpless, Wing Hung Wong, and Lynda Chin. Array comparative genome hybridization for tumor classification and gene discovery in mouse models of malignant melanoma. Cancer Res, 63(17):5352–5356, Sep 2003. [OVLW04] Adam B. Olshen, E. S. Venkatraman, Robert Lucito, and Michael Wigler. Circular binary segmentation for the analysis of array-based DNA copy number data. Biostat, 5(4):557–572, 2004. [Pav03] Paul Pavlidis. Using ANOVA for gene selection from microarray studies of the nervous system. Methods, 31(4):282–289, Dec 2003. [PFG+ 03] Paola Parrella, Vito M. Fazio, Antonietta P. Gallo, David Sidransky, and Shannath L. Merbs. Fine mapping of chromosome 3 in uveal melanoma: Identification of a minimal region of deletion on chromosomal arm 3p25.1-p25.2. Cancer Res, 63(23):8507–8510, December 2003. BIBLIOGRAPHY 91 [PSS+ 98] D. Pinkel, R. Segraves, D. Sudar, S. Clark, I. Poole, D. Kowbel, C. Collins, W. L. Kuo, C. Chen, Y. Zhai, S. H. Dairkee, B. M. Ljung, J. W. Gray, and D. G. Albertson. High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet, 20(2):207–211, Oct 1998. [QXGY06] Xing Qiu, Yuanhui Xiao, Alexander Gordon, and Andrei Yakovlev. Assessing stability of gene selection in microarray data analysis. BMC Bioinformatics, 7(1):50, 2006. [RDML04] J Rahnenfuhrer, FS Domingues, J Maydt, and T. Lengauer. Calculating the statistical significance of changes in pathway activity from gene expression data. Statistical Applications in Genetics and Molecular Biology, 3(1):Article 16, 2004. [RLS+ 05] O Radulescu, S Lagarrigue, A Siegel, M Le Borgne, and P Veber. Topology and static response of interaction networks in molecular biology. J.R.Soc.Interface, Published online, 2005. [RSA+ 07] Fidel Ramirez, Andreas Schlicker, Yassen Assenov, Thomas Lengauer, and Mario Albrecht. Computational analysis of human protein interaction networks. Proteomics, 7(15):2541–2552, Aug 2007. [RTV+ 05] Daniel R Rhodes, Scott A Tomlins, Sooryanarayana Varambally, Vasudeva Mahavisno, Terrence Barrette, Shanker Kalyana-Sundaram, Debashis Ghosh, Akhilesh Pandey, and Arul M Chinnaiyan. Probabilistic model of the human proteinprotein interaction network. Nat Biotech, 23(8):951–959, August 2005. [RVH+ 05] Jean-Francois Rual, Kavitha Venkatesan, Tong Hao, Tomoko Hirozane-Kishikawa, Amelie Dricot, Ning Li, Gabriel F. Berriz, Francis D. Gibbons, Matija Dreze, Nono Ayivi-Guedehoussou, Niels Klitgord, Christophe Simon, Mike Boxem, Stuart Milstein, Jennifer Rosenberg, Debra S. Goldberg, Lan V. Zhang, Sharyl L. Wong, Giovanni Franklin, Siming Li, Joanna S. Albala, Janghoo Lim, Carlene Fraughton, Estelle Llamosas, Sebiha Cevik, Camille Bex, Philippe Lamesch, Robert S. Sikorski, Jean Vandenhaute, Huda Y. Zoghbi, Alex Smolyar, Stephanie Bosak, Reynaldo Sequerra, Lynn Doucette-Stamm, Michael E. Cusick, David E. Hill, Frederick P. Roth, and Marc Vidal. Towards a proteome-scale map of the human protein-protein interaction network. Nature, 437(7062):1173–1178, October 2005. [RZD+ 07] Franck Rapaport, Andrei Zinovyev, Marie Dutreix, Emmanuel Barillot, and Jean-Philippe Vert. Classification of microarray data using gene networks. BMC Bioinformatics, 8(1):35, 2007. 92 BIBLIOGRAPHY [Saa96] Yousef Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, 1996. [Shl05] Jonathon Shlens. A tutorial on principal component analysis. Technical report, 2005. [SK97] M. Sikonja and I. Kononenko. An adaptation of relief for attribute estimation in regression, 1997. [Smi93] Murray Smith. Neural Networks for Statistical Modeling. John Wiley & Sons, Inc., New York, NY, USA, 1993. [SMO+ 03] P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang, D. Ramage, N. Amin, B. Schwikowski, and T. Ideker. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res, 13(11):2498– 504, 2003. [SMR+ 03] Danielle C. Shing, Dominic J. McMullan, Paul Roberts, Kim Smith, Suet-Feung Chin, James Nicholson, Roger M. Tillman, Pramila Ramani, Catherine Cullinane, and Nicholas Coleman. Fus/erg gene fusions in ewing’s tumors. Cancer Res, 63(15):4568–4576, August 2003. [SPdM+ 94] MR Speicher, G Prescher, S du Manoir, A Jauch, B Horsthemke, N Bornfeld, R Becher, and T Cremer. Chromosomal gains and losses in uveal melanomas detected by comparative genomic hybridization. Cancer Res., 54(14):3817–23, July 1994. [SS02] B. Schölkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press, Cambridge, MA, 2002. [SSKK03] J. M. Stuart, E. Segal, D. Koller, and S. K. Kim. A genecoexpression network for global discovery of conserved genetic modules. Science, 302(5643):249–55, 2003. [SSM99] Bernhard Schölkopf, Alexander J. Smola, and Klaus-Robert Müller. Kernel principal component analysis. Advances in kernel methods: support vector learning, pages 327–352, 1999. [STC00] John Shawe-Taylor and Nello Cristianini. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. [STC04] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004. BIBLIOGRAPHY 93 [STM+ 05] Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, and Jill P Mesirov. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci U S A, 102(43):15545–15550, Oct 2005. [Stu99] J.F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and Software, 11–12:625–653, 1999. Special issue on Interior Point Methods (CD supplement with software). [STV04] B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in Computational Biology. MIT Press, 2004. [SVR+ 06] Nicolas Stransky, Celine Vallot, Fabien Reyal, Isabelle BernardPierrot, Sixtina Gil Diez de Medina, Rick Segraves, Yann de Rycke, Paul Elvin, Andrew Cassidy, Carolyn Spraggon, Alexander Graham, Jennifer Southgate, Bernard Asselain, Yves Allory, Claude C Abbou, Donna G Albertson, Jean Paul Thiery, Dominique K Chopin, Daniel Pinkel, and Francois Radvanyi. Regional copy number-independent deregulation of transcription in cancer. Nat Genet, 38(12):1386–1396, December 2006. [SWL+ 05] Ulrich Stelzl, Uwe Worm, Maciej Lalowski, Christian Haenig, Felix H. Brembeck, Heike Goehler, Martin Stroedicke, Martina Zenkner, Anke Schoenherr, Susanne Koeppen, Jan Timm, Sascha Mintzlaff, Claudia Abraham, Nicole Bock, Silvia Kietzmann, Astrid Goedde, Engin Toksoz, Anja Droege, Sylvia Krobitsch, Bernhard Korn, Walter Birchmeier, Hans Lehrach, and Erich E. Wanker. A human protein-protein interaction network: A resource for annotating the proteome. Cell, 122:957– 968, 2005. [SYDM05] A.Y. Sivachenko, A. Yuriev, N. Daraselia, and I. Mazo. Identifying local gene expression patterns in biomolecular networks. Proceedings of 2005 IEEE Computational Systems Bioinformatics Conference, Stanford, California, 2005. [SZEK07] Gunnar Schramm, Marc Zapatka, Roland Eils, and Rainer Koenig. Using gene expression data and network topology to detect substantial pathways, clusters and switches during oxygen deprivation of Escherichia coli. BMC Bioinformatics, 8:149, 2007. [THH+ 08] Julien Trolet, Philippe Hupe, Isabelle Huon, Ingrid Lebigot, Pascale Mariani, Corine Plancher, Bernard Asselain, Laurence Desjardins, Olivier Delattre, Xavier Sastre-Garau, Jean-Paul 94 BIBLIOGRAPHY Thiery, Simon Saule, Sophie Piperno-Neumann, Emmanuel Barillot, and Jerome Couturier. Genomic profiling and identification of high risk tumors in uveal melanoma by array-cgh analysis of primary tumors and liver metastases. submitted to Cancer Res, 2008. [Tib96] R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc. B., 58(1):267–288, 1996. [Tib97] R. Tibshirani. The lasso method for variable selection in the Cox model. Stat Med, 16(4):385–395, Feb 1997. [TK01] R. Thomas and M. Kaufman. Multistationarity, the basis of cell differentiation and memory. II. Logical analysis of regulatory networks in terms of feedback circuits. Chaos, 11(1):180–195, 2001. [Tod02] Michael J. Todd. The many facets of linear programming. Mathematical Programming, 91(3):417–436, February 2002. [TPH+ 01] Frank Tschentscher, Gabriele Prescher, Douglas E. Horsman, Valerie A. White, Harald Rieder, Gerasimos Anastassiou, Harald Schilling, Norbert Bornfeld, Karl Ulrich Bartz-Schmidt, Bernhard Horsthemke, Dietmar R. Lohmann, and Michael Zeschnigk. Partial deletions of the long and short arm of chromosome 3 point to two tumor suppressor genes in uveal melanoma. Cancer Res, 61(8):3439–3442, April 2001. [TSR+ 05] Robert Tibshirani, Michael Saunders, Saharon Rosset, Ji Zhu, and Keith Knight. Sparsity and smoothness via the fused lasso. Journal Of The Royal Statistical Society Series B, 67(1):91–108, 2005. available at http://ideas.repec.org/a/bla/jorssb/v67y2005i1p91-108.html. [TW07] Robert Tibshirani and Pei Wang. Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics, May 2007. [Vap98] Vladimir Naumovich Vapnik. Statistical Learning Theory. Wiley, 1998. [vBN06] Erik van Beers and Petra Nederlof. Array-cgh and breast cancer. Breast Cancer Research, 8(3):210, 2006. [VDS+ 07] Imre Vastrik, Peter D’Eustachio, Esther Schmidt, Geeta JoshiTope, Gopal Gopinath, David Croft, Bernard de Bono, Marc Gillespie, Bijay Jassal, Suzanna Lewis, Lisa Matthews, Guanming Wu, Ewan Birney, and Lincoln Stein. Reactome: a knowledge base of biologic pathways and processes. Genome Biol, 8(3):R39, 2007. BIBLIOGRAPHY 95 [vHGW+ 01] J. van Helden, D. Gilbert, L. Wernisch, M. Schroeder, and S. J. Wodak. Application of regulatory sequence analysis and metabolic network analysis to the interpretation of gene expression data. In JOBIM ’00: Selected papers from the First International Conference on Computational Biology, Biology, Informatics, and Mathematics, pages 147–164, London, UK, 2001. Springer-Verlag. [VK03] J. P. Vert and M. Kanehisa. Extracting active pathways from gene expression data. Bioinformatics, 19 Suppl 2:II238–II244, 2003. [VRVB+ 02] Nadine Van Roy, Jo Vandesompele, Geert Berx, Katrien Staes, Mireille Van Gele, Els De Smet, Anne De Paepe, Genevieve Laureys, Pauline van der Drift, Rogier Versteeg, Frans Van Roy, and Frank Speleman. Localization of the 17q breakpoint of a constitutional 1;17 translocation in a patient with neuroblastoma within a 25-kb segment located between the accn1 and tlk2 genes and near the distal breakpoints of two microdeletions in neurofibromatosis type 1 patients. Genes, Chromosomes and Cancer, 35(2):113–120, 2002. [vtVDvdV+ 02] L. J. van ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A. Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton, A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts, P. S. Linsley, R. Bernards, and S. H. Friend. Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415(6871):530–536, January 2002. [War63] Joe H. Ward. Hierarchical grouping to optimize an objective function. Journal of American Statistical Association, 58:236– 244, 1963. [WCK+ 91] Frederic M. Waldman, Peter R. Carroll, Russell Kerschmann, Michael B. Cohen, Frederick G. Field, and Brian H. Mayall. Centromeric copy number of chromosome 7 is strongly correlated with tumor grade and labeling index in human bladder cancer. Cancer Res, 51(14):3807–3813, July 1991. [WIG+ 04] Zhijin Wu, Rafael A. Irizarry, Robert Gentleman, Francisco Martinez-Murillo, and Forrest Spencer. A model-based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association, 99:909–917(9), December 2004. [WKZ+ 05] Yixin Wang, Jan G M Klijn, Yi Zhang, Anieta M Sieuwerts, Maxime P Look, Fei Yang, Dmitri Talantov, Mieke Timmermans, Marion E Meijer van Gelder, Jack Yu, Tim Jatkoe, Els M 96 BIBLIOGRAPHY J J Berns, David Atkins, and John A Foekens. Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Lancet, 365(9460):671–679, 2005. [Wri87] Stephen J. Wright. Primal-Dual Interior-Point Methods. Society for Industrial Mathematics, 1987. [YMH+ 07] Xifeng Yan, Michael R Mehan, Yu Huang, Michael S Waterman, Philip S Yu, and Xianghong Jasmine Zhou. A graph-based approach to systematically reconstruct human transcriptional regulatory modules. Bioinformatics, 23(13):i577–i586, Jul 2007. [YMK+ 06] Anton Yuryev, Zufar Mulyukov, Ekaterina Kotelnikova, Sergei Maslov, Sergei Egorov, Alexander Nikitin, Nikolai Daraselia, and Ilya Mazo. Automatic pathway building in biological association networks. BMC Bioinformatics, 7:171, 2006. [YWF+ 06] Jun Yao, Stanislawa Weremowicz, Bin Feng, Robert C. Gentleman, Jeffrey R. Marks, Rebecca Gelman, Cameron Brennan, and Kornelia Polyak. Combined cDNA Array Comparative Genomic Hybridization and Serial Analysis of Gene Expression Analysis of Breast Tumor Progression. Cancer Res, 66(8):4065– 4078, 2006. [YY06] James J. Yang and Mark CK Yang. An improved procedure for gene selection from microarray experiments using false discovery rate criterion. BMC Bioinformatics, 7:15, 2006. [ZRHT03] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. norm support vector machines, 2003. [ZRHT04] J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support vector machines. In S. Thrun, L. Saul, and B. Schölkopf, editors, Adv. Neural. Inform. Process Syst., volume 16, Cambridge, MA, 2004. MIT Press.