Here - Center for Computational Biology

advertisement
THESE DE DOCTORAT DE
L’UNIVERSITE PIERRE ET MARIE CURIE
Spécialité
Biochimie et Biologie Moléculaire
(Ecole doctorale)
Présentée par
M Franck Rapaport
Pour obtenir le grade de
DOCTEUR de l’UNIVERSITÉ PIERRE ET MARIE CURIE
Sujet de la thèse :
Introduction de la connaissance a priori dans l’étude des puces à ADN
Devant le jury composé de:
Dr Gérard Biau
Dr Mark Van de Wiel
Dr Christophe Ambroise
Dr Stéphane Robin
Dr Emmanuel Barillot
Dr Jean-Philippe Vert
Université Pierre & Marie Curie - Paris 6
Bureau d’accueil, inscription des doctorants et base de données
Esc G, 2ème étage
15 rue de l’école de médecine
75270-PARIS CEDEX 06 Tél. Secrétariat : 01 42 34 68 35
Fax : 01 42 34 68 40
Tél. pour les étudiants de A à EM : 01 42 34 69 54
Tél. pour les étudiants de EN à ME : 01 42 34 68 41
Tél. pour les étudiants de MF à Z : 01 42 34 68 51
E-mail : scolarite.doctorat@upmc.fr
ii
Contents
Remerciements
ix
Abstract
xi
Résumé
xiii
1 Background
1.1 Microarray analysis . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.1 The cancerous disease . . . . . . . . . . . . . . . . . . . .
1.1.2 CGH arrays . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1.3 Gene expression arrays . . . . . . . . . . . . . . . . . . . .
1.2 The classification problem . . . . . . . . . . . . . . . . . . . . . .
1.2.1 Unsupervised classification . . . . . . . . . . . . . . . . .
1.2.2 Supervised classification . . . . . . . . . . . . . . . . . . .
1.3 The curse of dimensionality . . . . . . . . . . . . . . . . . . . . .
1.3.1 Pre-processing methods . . . . . . . . . . . . . . . . . . .
1.3.2 Wrapper methods . . . . . . . . . . . . . . . . . . . . . .
1.4 Contributions of this thesis . . . . . . . . . . . . . . . . . . . . .
1.4.1 Spectral analysis of gene expression profiles . . . . . . . .
1.4.2 Supervised classification of aCGH data using fused L1 SVM
1.4.3 Supervised classification of gene expression profiles using
network-fused SVM . . . . . . . . . . . . . . . . . . . . .
23
2 Spectral analysis
2.1 Background . . . . . . . . . . . . . . . . . . . . .
2.2 Methods . . . . . . . . . . . . . . . . . . . . . . .
2.2.1 Overview of the method . . . . . . . . . .
2.2.2 Spectral decomposition of gene expression
2.2.3 Deriving a metric for expression profiles .
2.2.4 Supervised learning and regression . . . .
2.3 Data . . . . . . . . . . . . . . . . . . . . . . . . .
2.4 Results . . . . . . . . . . . . . . . . . . . . . . . .
2.4.1 Unsupervised classification . . . . . . . .
2.4.2 PCA analysis . . . . . . . . . . . . . . . .
25
25
27
28
28
29
32
34
34
34
36
iii
. . . . .
. . . . .
. . . . .
profiles
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
1
3
4
6
7
8
17
17
21
22
22
23
iv
CONTENTS
2.5
2.4.3 Supervised classification . . . . . . . . . . . . . . . . . . .
2.4.4 Interpretation of the SVM classifier . . . . . . . . . . . . .
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Fused SVM
3.1 Introduction . . . . . . . . . . . . . . . . .
3.2 Methods . . . . . . . . . . . . . . . . . . .
3.2.1 ArrayCGH data . . . . . . . . . .
3.2.2 Classification of arrayCGH data .
3.2.3 Linear supervised classification . .
3.2.4 Fused lasso . . . . . . . . . . . . .
3.2.5 Fused SVM . . . . . . . . . . . . .
3.2.6 Implementation of the fused SVM
3.3 Data . . . . . . . . . . . . . . . . . . . . .
3.4 Results . . . . . . . . . . . . . . . . . . . .
3.4.1 Bladder tumors . . . . . . . . . . .
3.4.2 Melanoma tumors . . . . . . . . .
3.5 Discussion . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
4 Network-fused SVM
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . .
4.2 Methods . . . . . . . . . . . . . . . . . . . . . . . . .
4.2.1 Usual linear supervised classification method
4.2.2 Fusion and fused classification . . . . . . . .
4.2.3 Network-fused classification . . . . . . . . . .
4.2.4 Implementation . . . . . . . . . . . . . . . . .
4.3 Data . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Expression data sets . . . . . . . . . . . . . .
4.3.2 Gene networks . . . . . . . . . . . . . . . . .
4.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . .
4.4.1 Performance . . . . . . . . . . . . . . . . . .
4.4.2 Interpretation of the classifiers . . . . . . . .
4.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
40
42
.
.
.
.
.
.
.
.
.
.
.
.
.
47
47
49
49
50
51
53
53
55
55
56
56
59
61
.
.
.
.
.
.
.
.
.
.
.
.
.
63
63
64
64
66
67
68
68
68
69
71
71
75
77
Conclusion
79
Bibliography
80
List of Figures
1.1
1.2
1.3
1.4
1.5
Example of arrayCGH results . . . . . . . . . . . .
Example of arrays before and after a normalization
SVM in a separable case . . . . . . . . . . . . . . .
SVM in a non-separable case . . . . . . . . . . . .
The hinge loss function . . . . . . . . . . . . . . .
2.1
2.2
2.3
2.4
2.5
2.6
.
.
.
.
.
3
6
10
13
16
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
the first
. . . . . .
. . . . . .
. . . . . .
28
30
35
36
38
2.7
2.8
Decomposition of a gene expression profile . . . . . . .
Example of Laplacian eigenvectors . . . . . . . . . . .
Unsupervised classification results for the first method
PCA Plot using the first method . . . . . . . . . . . .
Supervised classification results using the first method
Representation of the classifiers obtained with using
method . . . . . . . . . . . . . . . . . . . . . . . . . .
Glycolysis/gluconeogenesis pathways . . . . . . . . . .
Pyrimidine metabolism pathways . . . . . . . . . . . .
39
41
43
3.1
3.2
3.3
Bladder cancer dataset with grade classification . . . . . . . . . .
Bladder cancer dataset with stage classification . . . . . . . . . .
Uveal melanoma dataset . . . . . . . . . . . . . . . . . . . . . . .
57
58
60
4.1
4.2
Performance on Van’t Veer data set . . . . . . . . . . . . . . . .
Performance on Wang data set . . . . . . . . . . . . . . . . . . .
72
73
v
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vi
LIST OF FIGURES
List of Tables
4.1
4.2
4.3
4.4
4.5
Characteristics of the different networks . .
Performance on Van’t Veer data set . . . .
Performance on Wang data set . . . . . . .
Main categories of the Van’t Veer classifiers
Main categories of the Wang classifiers . . .
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
70
71
74
75
76
viii
LIST OF TABLES
Remerciements
Je tiens d’ores et déjà à m’excuser pour tous ceux que j’oublie de remercier. Si
vous pensez que vous devriez être sur cette page mais que vous n’y êtes pas, c’est
que j’étais fatigué. Je remercie beaucoup mes deux directeurs de thèse (même
si seul l’un des deux est officiel, je confirme qu’ils sont deux), Jean-Philippe
et Emmanuel. À leur contact j’ai appris énormément, que ce soit sur le plan
scientifique ou sur le plan professionnel en général. Ainsi, grâce à eux, je sais
qu’on peut finir un article une heure avant la dead-line, même quand on trouve
encore des erreurs dans ses scripts quelques jours plus tôt. À ce sujet, je leur
suis très reconnaissant de ne pas m’avoir trop frappé.
Je remercie tous ceux qui m’ont aidé dans mes recherches durant ces trois
années et quelques mois. Merci à Marie Dutreix et surtout (surtout !) à Andrei
Zinovyev pour leur apport au projet. Merci aux deux Pierre(s), à Nicolas R et
Philippe H qui n’ont pas trop râlé quand je venais leur poser des questions sottes
de statistiques(ou autres !). Merci à Séverine et Sabrinette qui ont toujours
répondu avec bonne humeur à mes interrogations sur les puces à ADN. Merci
à Christina Leslie de n’avoir rien dit quand je travaillais sur ma thèse au lieu
de bosser sur ce pour quoi elle me payait. Un grand merci à Laurent et à Anne
pour avoir relu l’intro de ma thèse, elle en avait bien besoin.
Je remercie Gérard Biau, Mark Van de Wiel, Christophe Ambroise et Stéphane
Robin pour m’avoir fait l’honneur d’accepter de faire partie de mon jury. Je les
remercie en particulier pour la pertinence de leurs remarques pendant la session
aux questions.
Je remercie aussi mes différents collègues des Mines de Paris et de l’institut
Curie pour leur soutien et leur bonne humeur. Je remercie l’ensemble du bureau
ovale (Adil, Patrick, Stef, Séverine, Sabrinette ainsi que les membres temporaires: Fanny, Perrine, Anne et Amélie) pour leurs bonnes blagues et la bonne
ambiance de mes deux premières année de thèse. Je remercie Caroline et les
membres du CBio (Véro, Christian, Pierre, Martial, Misha et Brice) pour toute
l’amitié qu’ils m’ont témoigné pendant mes -rares- escapades bellifontaines. Un
très très grand merci au troupeau (Patrick, Fanny, Laurence, Gautier, Laurent
ainsi que les deux petites dernières, Fantine et Anne) pour avoir illuminé ma
dernière année. On retourne à Leysin boire du Grand Marnier quand vous
voulez.
J’insiste aussi pour remercier énormément le bureau système. Merci Gautier
pour ton chocolat, ta fondue et tes chansons des années 80. Merci Laurent pour
ix
x
REMERCIEMENTS
ton rap, tes dessous qui dépassent et tes blagues pourries. Merci PCC pour ta
manière de manger la purée, tes crachats sur les passants et ton amour de la
caipirinha. Je remercie tout particulièrement Laurence Calzone pour son amitié.
Bien sûr je remercie mes parents. Pour leur soutien affectif (et financier),
mais aussi parce qu’il faut regarder la vérité en face : si pendant toutes ces
années ils ne m’avaient pas empêché de jouer aux jeux vidéos pour travailler, je
ne serai pas en train de mettre le point final à cette thèse mais de racketter des
mémés pour aller m’acheter ma dose de crack. Je remercie aussi mon frère et ma
sœur, toujours là pour me soutenir avec leur amour et leurs blagues scatos. Je
remercie aussi tout le reste de ma famille et tout particulièrement mes grandsparents pour toute leur affection. Je voulais dire à mes deux mamies que de
l’autre côté de l’Atlantique, leur cuisine me manque.
Je remercie enfin les amis, les vrais, ceux qui ont toujours été là pour écouter
mes jérémiades : Flou, Patou, PE, Damien et Anais donc, merci. En revanche,
je ne remercie pas Facebook qui m’a bien pourri des journées de travail.
Abstract
While gene expression arrays and array-based comparative genomic hybridization (arrayCGH) have become standard techniques to collect numeric statements on expression disorders and copy number aberrations related to the cancerous disease, experimental results are still difficult to analyse. Indeed, not only
are the scientists confronted to the dimensionality curse, the high complexity of
the data comparatively to the low count of samples, but they are also battling
with the difficulty to relate numerical results with biological phenomena.
A solution to these issues is the incorporation into the analysis process of
“a priori ” information that we have about the different biological relations that
underlie our data. Different methods have been constructed for the analysis of
microarray data, but they only used this information through heuristics or did
not use it at all. In this thesis, we propose three different new methods built on
solid mathematical bases for introduction of a priori knowledge into the analysis
process.
The first method is a dimension reduction technique used to incorporate
gene network knowledge into gene expression analysis. The approach is based
on the spectral decomposition of gene expression profiles with respect to the
eigenfunctions of the graph, resulting in an attenuation of the high-frequency
components of the profiles with respect to the topology of the graph. This
method can be used for unsupervised and supervised classification. It results in
classifiers with more biological relevance than typical classifiers. We illustrate
the method with the analysis of a set of expression profiles from irradiated and
non-irradiated yeast strains.
The second method is a supervised classification method for arrayCGH profiles. The algorithm introduces two biological realities directly into the regularization terms of the classification problem: strong interdependency of the probes
with neighbouring chromosomal position and expected sparsity of the classifier
to focus on specific genomic abberations in the arrayCGH profiles. This method
is illustrated with three different classification problems, spanning two different
data sets.
The third method is a supervised classification method for gene expression
profiles. The approach introduces gene network knowledge into the classification
problem by adding a regularisation term corresponding to the positive correlation between connected nodes of the graph associated with the gene network.
This algorithm is tested on two different gene expression profile sets with eight
xi
xii
different gene networks of four different categories.
ABSTRACT
Résumé
Alors que les puces à ADN, que ce soient les puces d’expression ou les puces
à hybridation génomique comparative, sont devenues des outils standards pour
établir des relevés numériques sur les désordres génétiques liés au cancer, leur
analyse reste une tâche compliquée. En effet, les différentes méthodes sont
confrontées à deux grands problèmes : d’une part la très grande dimension des
données par rapport au faible nombre d’échantillons et d’autre part la difficulté
d’établir une correspondance entre ces données numériques et les phénomènes
biologiques sous-jacents.
Une solution proposée est d’incorporer dans l’analyse numérique notre connaissance “a priori ” de différentes relations biologiques, mais les techniques de
classification, supervisée ou non, utlisées jusqu’ici n’intégraient pas cette information ou l’incorporaient à des méthodes existantes par le biais d’heuristiques.
Dans cette thèse, nous proposons trois nouvelles méthodes d’analyse de puces
à ADN, basées sur des concepts mathématiques solides et qui intégrent notre
connaissance à priori de corrélations sous-jacentes au problème.
La première méthodologie que nous proposons utilise les données de réseau
métabolique pour l’analyse de profils d’expression de gènes. Cette approche est
basée sur la décomposition spectrale du réseau à l’aide de la matrice laplacienne liée au graphe associé. Les données de puces sont projetées sur la base de
l’espace des fonctions sur les gènes formée par cette décomposition spectrale.
En considérant que les fonctions cohérentes biologiquement doivent évoluer de
manière lisse sur le graphe, c’est-à-dire que les expressions de deux gènes connectés par une arête du graphe doivent avoir des valeurs proches, nous pouvons
appliquer un filtre pour atténuer les composantes haute-fréquence des profils
d’expression. Nous appliquons ensuite des algorithmes standards de classification non supervisée et supervisée pour obtenir des fonctions de décision plus
facilement interprétables. Ces algorithmes ont été appliqués à des jeux de
données publics pour discriminer des profils d’expression de levures faiblement
irradiées et non-irradiées. L’interprétation des classifieurs suggère des nouvelles
pistes de recherche biologique.
La deuxième approche proposée est une nouvelle méthode de classification
supervisée pour les données de puces d’hybridiation génomique comparative
(arrayCGH). Cette approche est basée sur le problème usuel de classification
modifié pour intégrer une double contrainte de régularisation qui traduit deux
réalités biologiques : le fait que deux relevés successifs sur le même chromosome
xiii
xiv
RÉSUMÉ
ont de fortes chances d’appartenir à la même région d’altération du génome et
le faible taux de ces altérations. Cette méthode est appliquée à trois problèmes
de classification liés au cancer et concernant deux jeux de données différents.
Nous obtenons alors des fonctions de classification à la fois plus efficaces et plus
facilement interprétables que celles obtenues à l’aide des méthodes usuelles de
classification supervisée.
La dernière méthode est une autre manière d’introduire la corrélation entre
gènes connectés d’un réseau dans la classification supervisée de profils d’expression. Pour cela, nous avons rajouté au problème classique de machines à vecteurs
de support avec régularisation L1 un terme de régularisation qui traduit notre
volonté d’attribuer dans la fonction de décision des poids semblables à deux
génes connectés dans le réseau. Cette approche est testée sur deux jeux de
données publics liés au cancer avec huit réseaux génétiques de quatre types
(métaboliques, interactions protéine-protéine, influence et coexpression) differents.
Chapter 1
Background
In this preliminary chapter, we discuss the different issues underlying this thesis.
We start by giving a brief overview of cancer and how the specificities of this
disease lead to the wide use of microarrays in order to monitor the tumors. The
following section is then dedicated to the microarray analysis techniques with
a particular focus on the supervised classification problem. We then see how
the difficulties associated to this problem can be reduced with the incorporation
of “a priori ” knowledge into the analysis process and what were the previous
attempts to do so. Finally, the last section of this chapter summarizes our
contributions to the problem and gives a quick overview of this thesis.
1.1
Microarray analysis for the study of the cancerous disease
This section is aimed at giving a quick glance at the concepts underlying the use
of microarrays for the study of the cancerous disease to nonspecialists. The first
subsection gives a precise definition of cancer and explains how it is related to
mutations and abnormal gene behaviour. The two following subsections explain
how specific types of abnormal gene behaviours can be monitored by two types
of microarrays: gene expression arrays and comparative genomic hybridization
arrays (also known as CGHarrays). These subsections include an overview of
each technology for non-biologists as well as standard analysis processes and the
related specific issues.
1.1.1
The cancerous disease
It is generally admitted that cancerous cells are cells that have developed certain
capacities that allow uncontrolled growth through mutations. [HW00] propose
a list of these capacities:
• Self-sufficiency in growth signal: normal cells require specific signals from
other cells before they can proliferate. These signals are transmitted into
1
2
CHAPTER 1. BACKGROUND
the cells by receptors that bind distinctive classes of signaling molecules.
We do not know any type of normal cell that can proliferate in the absence
of such signals. Tumor cells generate many of their own growth signals,
thereby reducing their dependency on stimulation from their environment.
• Insensitivity to growth-inhibitory (anti-growth) signals: within a normal
tissue, multiple anti-proliferative signals operate. These signals that block
proliferation include both soluble growth inhibitors and immobilized inhibitors embedded on the surfaces of nearby cells. Cancer cells must evade
these anti-proliferative signals if they want to prosper.
• Evasion of programmed cell death (apoptosis): the ability of a tumor cell
population to expand in number is determined not only by the rate of cell
proliferation but also by the rate of cell disappearance. Programmed cell
death, apoptosis, represent a major source of this attrition. Observations
indicate that the apoptotic program is present in latent form in virtually
all cell types throughout the body. Once triggered by a variety of physiological signals, this program unfolds in a precise series of steps. Different
examples have established the consensus that apoptosis is a major barrier
to cancer that must be circumvented.
• Limitless replicative potential: many and perhaps all types of mammalian
cells carry an intrinsic program that limits their multiplication. This program appears to operate independently of the cell-to-cell signaling pathways concerned by the above capacities. It too must be disrupted in order
for a clone of cells to expand to a size that constitutes a macroscopic
tumor.
• Sustained angiogenesis: the oxygen and nutrients supplied by vasculature
are crucial for cell function and survival, obligating virtually all cells in
a tissue to reside within a small distance of a capillary blood vessel. In
order to progress to a large size, tumors must develop angiogenic ability,
which is the ability to provoke blood vessel growth.
• Tissue invasion and metastasis: sooner or later during the development
of most types of human cancer, primary tumor masses spawn pioneer
cells that move out, invade adjacent tissues and thence travel to distant
sites where they may succeed in founding new colonies. These distant
settlement of tumor cells are called metastases. The capability for invasion
and metastasis enables cancer cells to escape the primary tumor mass and
colonize new terrain in the body where, at least initially, nutrient and space
are not limiting. Even if cells can be considered as cancerous without this
ability to metastasize, most cancerous cells will acquire it during their
development.
Only one cell that has acquired each and every one of these capacities will
be able to grow chaotically and without any constraint and will therefore be
1.1. MICROARRAY ANALYSIS
3
Figure 1.1: Example of arrayCGH results (log2 scale). This picture depicts
genomic events occurring on chromosome 18, among which a loss occurring on
the q arm. This image has been extracted from [BSBD+ 04].
considered as cancerous. A malignant cell therefore suffers from a perturbed
functioning of its proteins which causes all these capacities to be active.
An enabling characteristic for theses capacities is the high genomic instability
of cancer cells. Due to the efficiency of the cellular process used to maintain
genomic integrity, mutations are rare events in a normal cell. However, cancer
cells have escaped at some time this protection process and suffer from multiple
mutations, so many in fact that they are highly unlikely to occur within a human
time span. Examples of these mutations include hyperactivity of oncogenes,
which are genes that activate chaotic cell proliferation, such as Myc or Abl; and
deletion of tumor-suppressor genes such as p53. This genomic instability can
be seen as a seventh capacities of cancer. However, as it is more a prerequisite
capability, allowing the other capabilities to be acquired, than one characteristic
of the uncontrolled cell growth, the authors did not include it in the list.
These characteristics can be acquired through large mutations, either aneuploidy of entire chromosome arms or gain or loss of smaller portions of chromosomes (from a few hundreds to a few millions base pairs), which can be seen
with CGH arrays, or other mechanisms such as local mutations (a change of
base), translocations, viral insertions, etc. As these last changes can not be
seen with CGH arrays, the adequate analysis tool is gene expression profiling
which corresponds to an indirect analysis of the impacted protein production.
In the two following sections, we will discuss these two microarray techniques.
1.1.2
CGH arrays
During cell division, a cell must replicate its entire genome. Several types of
chromosome alterations may occur during this process. Regions of deoxyribonucleic acid (DNA) can be multiplied (resulting in a gain) or on the contrary
4
CHAPTER 1. BACKGROUND
deleted (resulting in a loss). Healthy cells maintain different mechanisms to correct and prevent this unstableness, but if one of these changes goes undetected
or perturbs the correction mechanism, the cell may survive in this altered state.
This changes may be responsible for one or several of the acquired capacities
described in the previous section. Ewing’s tumors, for example, are known to
present characteristic gains in chromosomes 5, 8 or 12 [SMR+ 03].
CGH is a powerful molecular tool to analyze copy number changes (gains
or losses) in the DNA content of a given subject, and especially in tumor cells.
The method is based on hybridization, the formation of molecular links from a
genetic sequence to its complementary genetic sequence, of the DNA of interest
(often tumor DNA) and normal DNA to human preparation. Using fluorescence
microscopy and quantitative image analysis, regional differences in gain or losses
compared to control DNA can be detected and used for identifying copy number
aberrations (CNAs) in the genome. Figure 1.1 show the result of a typical CGH
array experiment.
Originally, this instability was measured with chromosomal-CGH [ANM93],
a technology that used entire chromosomes for hybridization purposes, but
whose resolution was quite low. Recent improvements regarding resolution
and sensitivity of CGH allowed the elaboration of microarray-based CGH (also
called arrayCGH or CGH array) [PSS+ 98] that use probes, small portions of
the genome, arrayed on silicium, instead of entire chromosomes.
1.1.3
Gene expression arrays
Another interesting information about a cancerous cell is the expression of specific genes. The expression of a gene is the quantity of corresponding produced
messenger ribonucleic acid (mRNA), the intermediary molecule between the
DNA and the protein, which is correlated with the quantity of produced protein.
A gene expression microarray, which can also be called DNA chip, is a
collection of microscopic DNA spots, each one of them mapping a particular
transcribed region of the genome and known as probes, which are arrayed on
a solid surface. These probes, usually tens of thousand of them, are used to
measure the relative quantity of specific mRNA produced by the studied cell.
For this purpose, contact is made between the array and mRNA extracted from
the sample. Intensity of the hybridized DNA fluorescence can then be optically
measured and gives an estimate of the relative quantity of the mRNA of interest
in the sample. In an error-free scenario, this intensity is proportional to the true
number of transcripts present in the sample.
There are two main types of DNA chips:
• Spotted microarrays : the probes are either long or short fragments of DNA
(amplified by cloning or polymerase chain reaction (PCR)). This type of
array is typically hybridized with complementary DNA (cDNA) from two
samples to be compared, one of which is often a control tissue. These
two samples are marked with two different fluorophores (red and green).
1.1. MICROARRAY ANALYSIS
5
They are mixed and hybridized on the same microarray. A scanner then
visualizes the fluorescence of each fluorophore. Relative intensities of the
colors are used to identify up and down regulated genes. Absolute levels
of gene expression cannot be determined but relative differences among
different genes can be estimated. This type of microarrays is rarely used
nowadays in cancer research.
• Oligonucleotide microarrays : the probes are designed to match part of
the sequence of known or predicted mRNAs. The probes are either 50
to 60-mers (on Long Oligonucleotide Arrays) or 25 to 30-mers (on Short
Oligonucleotide Arrays). Companies such as Affymetrix or Agilent propose commercial microarrays that span the entire genome. Affymetrix
microarrays give estimations of the absolute value of gene expression levels and therefore, the comparison of two conditions requires the use of two
separate microarrays. On the opposite, Agilent microarrays provides the
same kind of information than spotted microarrays. Oligonucleotide microarrays often contain control probes designed to hybridize with known
amount of specific RNA transcripts called RNA spike-ins. These control
probes are used to calibrate the expression level measurements.
Unfortunately, dealing with experiments that involve multiple microarrays require a pre-processing of the gene expression profiles: the normalization.
Microarrays are subject to two types of variations : interesting variations
and obscuring variations. Interesting variations are biological differences, such
as large differences of expression levels of specific genes between a diseased
and a normal tissue sample. However, observed expression levels also include
variations introduced during the experimental process. These variations may
be related to differences in sample preparation, in production of the arrays or in
processing of the arrays. Normalization is aimed at dealing with this obscuring
variations (see, for example, figure 1.2).
In gene expression profiles, these obscuring variations have different sources.
One first source of variations is the dye bias (or pin tip in the case of spotted
arrays): the relationship between gene expression and spot intensity may not be
the same for different dyes (or spots), and therefore, for a given concentration
of mRNA, the light intensity may differ. Another source of variations is related
to spatial effects: due to a defective pin tip (portion of the microarray) or to a
bad position of the array during hybridization, the spatial density of the signal
may not be uniform.
Therefore, the first step of the analysis process will be the normalization.
The user needs to make an hypothesis on the value distribution of the arrays
and/or the value distribution of the genes, depending on the features to compare
(entire arrays or expression distribution of single genes). Per array normalization techniques include Global Normalization, Lowess (sometimes referred to
as Loess), MAS4 and MAS5 [mas]. Per gene normalization methods include
RMA [IBC+ 03], gc-RMA [WIG+ 04] and MAS7.
During this thesis, we aimed at increasing efficiency and interpretability of
microarray analysis by incorporating a priori knowledge into the process. We
6
CHAPTER 1. BACKGROUND
Figure 1.2: The arrays on the right side depict the normalized values of the
arrays on the left side. The un-normalized array present a strong spatial bias
as the left area suffers from a much lower intensity distribution than the right
area, which is characterized by the strong presence of blue spots. The normalized
arrays clearly show much more homogeneous and unbiased values.
especially focused on classification of microarray profiles. Therefore, the following section will provide a background for the understanding of these methods
by giving a brief overview of the usual classification techniques that we will refer
to during the different chapters of this thesis.
1.2
The classification problem
The construction of a predictive model from microarray data is an important
problem in computational biology. Typical application include, for example,
cancer diagnosis or prognosis, and discriminating between different treatments
applied to micro-organisms.
In this section we expose the generic unsupervised and supervised classification problems and, for each one, propose and discuss a standard algorithm
(respectively k-means and SVM). We especially focus on the supervised case as
it has been more looked-at during this thesis.
The aim of classification is to build a function f : X → Y that is able to
attribute to each sample x ∈ X the correct label y ∈ Y. Supervised classification
uses a training set of samples for which the labels are known to build function
f while unsupervised classification doesn’t.
1.2. THE CLASSIFICATION PROBLEM
1.2.1
7
Unsupervised classification
In this section, we expose the generic unsupervised classification problem, also
known as partitional clustering, and present a standard algorithm used for clustering : the k-means algorithm. This algorithm was used for the works that are
presented in chapter 2.
The general problem
Unsupervised classification, or partitional clustering, corresponds to the partitionning of a data set into subsets (or clusters) of samples that share a common
trait, mathematicaly represented as a proximity according to some defined distance measure d. If m is the looked-for number of groups and X the sample
space, a mathematical model of the problem is the search for the partitioning
X1 , ..., Xm that minimises:
!
d(x, y)
m
!
x,y∈Xi
!
,
(1.1)
d(x, y)
i=1
x∈Xi ,y∈X̄i
where ∀i = 1, ..., m, X̄i is the complementary of Xi in X i.e. the set of all
elements x of X such that x is not in Xi .
This fraction represents the quotient of the intra-group distances, i.e. the
sum of all the distances between two different elements of one group, by the
inter-group distances, i.e. the sum of all the distances between the elements
of one group and the elements of another group. Minimizing this quotient will
give groups that are as compact as possible while, at the same time, being as
far apart as possible one from the other.
Apart from k-means method, that we will see in a little more detail in the
next section, and its derivatives, clustering methods also include hierarchical
clustering [War63] and graph-based methods such as Formal Concept Analysis
[GW97].
The k-means algorithm
The k-means algorithm is one of the simplest partitioning clustering algorithm
and aims at assigning each point to the cluster whose center is the nearest. It
is composed of the following steps [Mac67]:
• The users choses a number k of groups.
• The algorithm randomly generates k points as the center of random clusters.
• Each point is assigned to the closest center, according to a distance d.
• Each cluster center is recalculated as the mean of the data assigned to this
cluster.
8
CHAPTER 1. BACKGROUND
• The two last steps are repeated until the groups do not vary or, if a
maximal number of steps has been fixed by the user, this maximal number
of steps has been reached.
This algorithm is simple and fast but presents a big disadvantage: as the initial
centers are attributed randomly, it may give different results with each run. Authors [HKY99] proposed improvement to this method in order to insure that the
results were stable, but these improved algorithms do not retain the simplicity
and/or the speed of the initial approach.
One critical point of this approach is the choice of d. Indeed, depending on
the chosen distance measure, the points will be attributed to different groups,
and therefore different clusters will be formed. In chapter 2, we will present
a specific distance measure that we feel is more adapted to expression array
clustering that usual measures such as euclidian or L1-norm distances.
1.2.2
Supervised classification
Supervised classification is a particular category of classification methods where
a set of samples X = Xi ∈ X , i ∈ 1 . . . n for which the correct labels Y =
Yi ∈ Y, i ∈ 1 . . . n are known is used to build the classification function f . This
set is known as the “training set”.
In this thesis, we will only focus on the case where the training patterns are
represented by finite-dimensional vectors that must be classified into two predefined categories, i.e., we restrict ourselves to the case X = Rp and Y = {−1, +1}.
This covers for example the case when one wants to predict a good (Y = +1)
or bad (Y = −1) prognosis for a tumor from a vector of gene expression data or
arrayCGH profile. We note, however, that various extensions to more general
training patterns and categories have been proposed (e.g., [Vap98, SS02]).
Supervised classifications methods include linear methods, that we will focus
on, but they are not limited to them.
Another example of supervised classification methods is k-Nearest Neighbor
(kNN) (see for example [MM01]), which is among the simplest of all supervised
classification algorithm. User empirically decides of a small positive integer k
and each new sample x will be attributed to the class which is the most common
amongst its k nearest neighbors. Nearness of the samples is usually decided
according to a distance d which is supposed to provide a good partitioning of
the space for the considered problem. The results will therefore not only depend
on the density of the training set but also on the choice of k and d.
Supervised classification methods also include artificial neural networks (ANNs)
[Smi93]. An ANN is an adaptive system composed of interconnected groups of
small and simple entities that mimic the behavior of biological neurons. An
input layer of neurons takes the sample information, passes it to one or several
interconnected hidden layers of neurons which, themselves, transmit it to the
output layer of neurons, which returns the estimation of the label. In most
cases, ANNs change their internal weights based on the information that flows
between the different layers during the learning phase. Even if ANNs are able,
1.2. THE CLASSIFICATION PROBLEM
9
theoretically, to output a wide range of classification functions, their use is not
straightforward as they require a very complex tuning (choice of the neurons,
choice of the model for the connections, choice of the correct algorithm and
the correct algorithm parameters) or may return an inadequate classification
function.
Linear supervised classification
Linear supervised classification methods are a specific class of supervised classification methods that aims at finding a linear classification function, i.e. a
function f : x %→ w" x + b where w ∈ Rp , b ∈ R and w" is the matrix-transpose
of the vector w. w can be seen as an orthogonal vector to an hyperplane P
that will separate the whole space into different subspaces, and the appartenance of one sample x to one of these subspace will define its predicted class.
In the case of binary classification (Y = {−1, 1}), for example, the class will be
sign(w" x + b).
We suppose that the variables Xi , Yi i=1,...,n are independent and identically
distributed samples of an unknown probabilistic law P . Let l : Rp × Y %→ R be a
loss function. It quantifies the loss l(f (x), y) incurred when a predictor predicts a
scalar f (x) for the pattern x, while the correct class is y. The best classifier with
respect to l is the one that minimizes the expected loss R(f ) = EP l(f (X), Y ).
R(f ) is also known as the risk of the classifier f . Unfortunately the distribution
P is not known, so finding the classifier with the smallest risk is not possible
in practice. Instead, the empirical risk minimization (ERM) paradigm [Vap98]
proposed to find a classifier that minimized the empirical risk Remp (f ), defined
as the average risk over the training pairs:
n
Remp (f ) =
1!
l(f (Xi ), Yi ).
n i=1
However, as the dimension of the sample space p is usually very big, the
training set of cardinality n is not big enough (i.e smaller than p) to give an
appropriate sampling of the whole space, therefore the classification function
may be overfitted, which means that it may perform very well on the training set
(due to the minimization of l) but may perform very poorly on unseen examples.
Moreover, if the search space is rich enough, an infinity of classification functions
may minimize the average l on the training set. Therefore, we have to define a
criterion that will help us to choose one of these classifiers. This issue is called
the dimensionality curse.
The standard solution is to reduce it by incorporating into the problem a
constraint that will shape the profile of the classification function and give a direction to the search. In the case of binary classification (Y = {−1, 1}), the prediction of the label of a new sample only depends on the side of the hyperplane
P the point is positioned on. All of the following algorithms and formulas can
be possibly extended to the multi-class problem by combining multiple binary
classifiers. In the next sections, we will use the classical geometrical approach
10
CHAPTER 1. BACKGROUND
w" x + b > 0
γ
w
γ
P
w" x + b < 0
Figure 1.3: Support vector machine finds the hyperplane P that separates the
positive examples (circles) from the negative examples (square) with the maximum margin γ. The samples to which was attributed the +1 label are colored
green while the samples to which was attributed the −1 label are colored blue.
to present the SVM algorithm. This approach will be extended to another formulation of the SVM in the last part of section 1.2.2, the latter representation
being the one that will be used in following chapters of this thesis.
The Support Vector Machine (SVM) in the separable case
In the case of a linearly separable data set, which means that it is possible to
find an hyperplane P such that the positively-labelled and negatively-labelled
samples lay on either sides of P , Vapnik and co-workers proposed to select the
classifier with the largest margin γ (distance from P to the closest point of the
learning set) [BGV92b, CV95, Vap98]. This type of linear classification problem
is called hard-margin Support Vector Machine (hard-margin SVM) and defines
the maximum margin classifier.
The equation of the hyperplane P is given by w" x + b = 0. Therefore, the
!
x+b|
distance from one samples x to P is given by |w#w#
. If the sample is linearly
"
separable, the class f (x) = sign(w x + b) attributed to each sample xi of the
training set by the classification function is the correct label yi . From that, we
can deduce that for each couple (xi , yi ), f (xi )yi > 1 and that the distance from
!
xi +b)
one sample of the training set xi to the hyperplane P is given by yi (w#w#
,
which gives us the following formula for the margin γ:
yi (w" xi + b)
.
i=1,...,n
'w'
γ = min
(1.2)
11
1.2. THE CLASSIFICATION PROBLEM
As hyperplane are defined up to a scaling constant (i.e. the equations w" x+b =
0 and αw" x + αb = 0 with α ∈ R define the same hyperplane), we can add the
following constraint to the definition of the hyperplane P :
min yi (w" xi + b) = 1 .
i=1,...,n
(1.3)
Using the previous constraint with 1.2 gives us a simplified value for the margin
1
γ = #w#
. The hard-margin SVM looks for the hyperplane with the largest
margin, which can be formulated as the following optimization problem:
(w∗ , b∗ )
1
'w'2
2
(1.4)
∀i = 1, ..., n yi (w" xi + b) ≥ 1 .
(1.5)
= argmin
w∈Rp ,b∈R
under the constraints
As the objective function (the function to minimize) 1.4 is strictly convex and
the constraints 1.5 are convex, this minimization problem is a convex problem
with an unique solution [BV04b]. To solve this problem, we can use the methods
of Lagrange multipliers. The Lagrangian of the problem is given by:
L(w, b, α) =
n
!
1
'w'2 +
αi (1 − yi (w" xi + b)) ,
2
i=1
(1.6)
where the αi are called the Lagrange multipliers of the optimization problem
and α is the vector of Rp whose components are the αi .
Optimization theorems imply that the minimum of the objective function 1.4
under the constraints 1.5 is given by a saddle point (w∗ , b∗ , α∗ ) of the Lagrangian
L, a minimum of L with regard to (w, b) and a maximum with regard to α. The
minimization of L with regard to (w, b) implies that the corresponding partial
∂L
derivatives ∂w
and ∂L
∂b are put to 0:
w−
n
!
αi yi xi = 0
(1.7)
i=1
n
!
αi yi = 0 .
(1.8)
i=1
Substituting these formulas into 1.6 gives us the dual formulation of the
problem:
α∗
= argmax
α∈Rn
under the constraints
n
!
i=1
αi −
∀i = 1, ..., nαi = 0
n
!
αi yi = 0 .
n
1 !
αi αj yi yi x"
i xj
2 i,j=1
i=1
(1.9)
12
CHAPTER 1. BACKGROUND
This problem is a quadratic program which can be solved using different
methods such as interior point [Wri87], active set [BR85] or conjugate gradient
[Saa96].
The Karush-Kuhn Tucker condition gives us the following property of the
optimum:
∀i = 1, ..., n αi∗ (yi (w∗" xi + b∗ ) − 1) = 0 .
(1.10)
Therefore, at the optimum, the only linear combination coefficients that are
non-null corresponds to learning samples xi such that yi (w∗" xi +b∗ ) = 1. These
points are positioned on the margin of the hyperspace P and are the only ones
that affect the position of P . They are called the support vectors of the classifier.
Thus, the solution does not depend on the size of the sample space or even on
the number of training examples but only on the count of critical examples.
The optimal offset b∗ can be obtained from any support vector x% labelled
by y % using the fact that y % (w∗" x% + b∗ ) − 1 = 0 and that y %2 = 1:
b∗ = y % − w∗" x% .
(1.11)
However, to obtain a more accurate value, we will average the offset on all the
support vectors.
At the optimal point, the decision function is given by:
"
n
!
∗
f (x) = sign(
αi∗ yi x%"
i x + b ),
(1.12)
i=1
where n% is the number of support vectors and (x%1 , ..., x%n" ) the support vectors.
SVM in the non-separable case
Unfortunately, in general cases, a linear hyperplane separating the data into the
pre-determined classes may not exist. In this case, the previous algorithm can
not be applied and we have to introduce slack variables ξi for each training set
couple (xi , yi ) in order to relax the constraints:
yi (w" xi + b) + ξi ≥ 1 .
(1.13)
The slack-variable ξi corresponds to the distance between the sub-space the
sample should belong to and the sample, and is therefore a measurement of the
classification error. Indeed, ∀i = 1, ..., n, ξi = 0 corresponds to a well-classified
sample outside the margin, 0 < ξi < 1 corresponds to a well-classified sample
inside the margin and ξi > 1 to a misclassified sample. Figure 1.4 illustrates
this situation.
The average of the slack-variables (or their sum) corresponds to the amount
of error that is tolerated and should therefore be controlled. This is done by
13
1.2. THE CLASSIFICATION PROBLEM
misclassified
samples : ξ > 1
P
well-classified samples
inside the margin : 0 < ξ < 1
Figure 1.4: Support vector machine looks for the hyperplane P that separates
the positive examples (circles) from the negative examples (square) with the
maximum margin γ. The samples to which was attributed the +1 label are
colored green while the samples to which was attributed the −1 label are colored
blue. Therefore the green rectangles and blue circles are misclassified samples
(ξ > 1) while slack-variable ξ value contained between 0 and 1 implies a welllabeled sample inside the margin.
14
CHAPTER 1. BACKGROUND
adding this quantity to the objective function of the SVM optimization problem:
n
(w∗ , b∗ , ξ ∗ )
=
argmin
w∈Rp ,b∈R,ξ∈Rn
under the constraints
!
1
'w'2 + C
ξi
2
i=1
∀i = 1, ..., n yi (w" xi + b) ≥ 1 − ξi
∀i = 1, ..., n ξi ≥ 0 .
(1.14)
The constant C offers a trade-off between the number of errors and the
regularization parameter : the bigger C is, the more important it will be to
minimize ξ i.e. the error control with regard to the minimization of the margin,
expressed by the term 'w'2 . This formulation of the problem is known as
the soft margin SVM with opposition to the hard-margin approach presented
in the previous section. We can see that by putting C = ∞ we retrieve the
hard-margin problem.
As w∗ is a linear combination of the samples, we can also formulate the
problem as:
(w∗ , b∗ , ξ ∗ )
=
argmin
α∈Rn ,b∈R,ξ∈Rn
under the constraints
n
n
!
1 !
'
αi xi '2 + C
ξi
2 i=1
i=1
n
!
∀i = 1, ..., n yi (
αj x"
j xi + b) ≥ 1 − ξi
j=1
∀i = 1, ..., n ξi ≥ 0 .
(1.15)
Extension of SVM to non-linear problems
When dealing with nonlinearly separable problem, linear classifiers may not be
able to provide a satisfying classification function. A way to solve this problem
is to introduce kernels: SVM may be generalized to the non-linear case by
applying a linear SVM on transformed data.
A function K is said symmetric if:
∀(x, y), K(x, y) = K(y, x) .
(1.16)
A function K is said positive semi-definite if:
∀(x1 , ..., xn ) and ∀c1 , ...cn ∈ R ,
n
n !
n
!
i=1 j=1
K(xi , xj )ci cj ≥ 0 .
(1.17)
Moore-Aronszajn’s theorem [Aro50] states:
Theorem 1 A symmetric positive semi-definite function K(x, y) can be expressed as an inner product, which means that there exists an Hilbert space H
15
1.2. THE CLASSIFICATION PROBLEM
equipped with a dot product < ·, · > and with a embedding φ from the sample
space to H such that:
K(x, y) =< φ(x), φ(y) > .
(1.18)
In this new space, the SVM problem becomes:
n
(w∗ , b∗ , ξ ∗ )
=
argmin
α∈Rn ,b∈R,ξ∈Rn
under the constraints
n
n
!
1 !!
αi αj K(xi , xj ) + C
ξi
2 i=1 j=1
i=1
n
!
∀i = 1, ..., n yi (
αj K(xj , xi ) + b) ≥ 1 − ξi
j=1
∀i = 1, ..., n ξi ≥ 0 .
(1.19)
Therefore, we only need to know the function K, and neither the non-linear
embedding φ nor the space H nor the dot-product < ·, · > need to be explicitly
known. This property, known as the kernel trick is often used to extend the
scope of SVM to non-linearly separable problems [CST00, STV04, STC04].
Simple examples of kernels include :
K(x, y) =< x, y >d
K(x, y) = e
−$x−y$2
2σ
(1.20)
(1.21)
.
Another formulation of the SVM algorithm
The hinge loss function is defined as follows :
h:
R→R
x %→ max(0, 1 − x) .
(1.22)
The SVM can then be seen as an algorithm that find the couple (w∗ , b∗ )
such as:
n
!
∗ ∗
(w , b ) = argmin
h(yi (w" xi + b)) + λ'w'2 .
(1.23)
w,b
i=1
Indeed 1.23 is equivalent to the minimization of the following form:
(w∗ , b∗ )
= argmin λ'w'2 +
w,b
under the constraint
n
!
ξi
i=1
∀i = 1, ..., n ξi ≥ h(yi (w" xi + b)) .
16
CHAPTER 1. BACKGROUND
1
1
Figure 1.5: A representation of the hinge loss function.
Using the formula for the hinge loss function l expressed in 1.22, this becomes:
(w∗ , b∗ )
= argmin λ'w'2 +
w,b
under the constraints
∀i = 1, ..., n ξi ≥ 0
n
!
ξi
i=1
∀i = 1, ..., n ξi ≥ 1 − yi (w" xi + b) ,
1
which is equivalent to 1.14 if we set λ = 2C
. This formulation will be the one
that will be adopted in the next chapters.
1.22 depicts SVM as a member of the family of algorithms that aim at finding
the w∗ minimizing the form:
w∗ = min l(yi , w" xi ) + λΩ(w) .
w
(1.24)
Besides classical SVM depicted in the previous sections, this class of techniques includes L1-SVM (taking the hinge loss for l and the L1-norm for Ω) and
Lasso (taking the squared errors for l and the L1-norm for Ω).
As we will see briefly in the last section of this chapter and more extensively in the following chapters, the main goal of this thesis has been building
supervised classification techniques that incorporate a priori knowledge into Ω
in order to solve one issue of machine learning: the curse of dimensionality.
1.3. THE CURSE OF DIMENSIONALITY
1.3
17
The curse of dimensionality
One important issue in machine learning, for which [Bel57] proposed the term
“curse of dimensionality” is the high dimension of the sample space in comparison to the low count of samples. Indeed, the volume increase exponentially with
adding extra dimensions to the space, which means that the number of points
needed for an efficient sampling increase also exponentially.
However, in microarray analysis, the sample space dimension is given by the
number of probes, which varies between a few thousands and a few hundreds
of thousands, while the number of samples varies between a dozen and a few
hundreds. The huge gap between the small count of samples and the astronomic
amount of data which would be needed to provide an efficient sampling of the
space suggest the use of methods to reduce the size of the search space.
In this section, we propose a collection of this methods. We grouped the
methods into “pre-processing methods”, which are methods that separate the
reduction of the search space and the classification method, and “wrapper methods”, which are methods that modify the classification algorithm to reduce the
size of the search space.
1.3.1
Pre-processing methods
As we have seen before, the linear classifier returned by a support vector machine
is a linear combination of the samples. Therefore, regularizing the samples and
reducing the sample space will force the classifier to restrict itself to this reduced
space.
Feature selection by filtering
A first method for reducing the search space is to apply feature selection, also
knows as feature reduction or attribute selection. It consists of selecting a
subset of relevant features from the sample set. This method is also frequently
called gene selection when applied to gene expression profiles. Simple feature
selection (for example only keeping the features that vary the most between the
different groups) can be performed, but as some attributes may be redundant,
obtaining the optimal set of features of the chosen cardinality require exploring
every subset of features of this cardinality. Therefore, it is preferred to take a
satisfactory set of features, which may not be optimal but is still good enough
for the classification. [GGNZ06] propose an extensive review of these techniques.
[GST+ 99] propose to take as a criterion for ranking the features the following
formula:
µ+ − µ−
i
δi = i+
(1.25)
σi + σi−
−
where i is the index of the feature, µ+
i and µi the mean of the feature values
for all the samples of class +1 and −1 respectively and σi+ and σi− the standard
deviation of the feature values for all the samples of class +1 and −1 respectively.
The original method proposed by [GST+ 99] is to select an equal number of
18
CHAPTER 1. BACKGROUND
features with positive and negative δi coefficients. [FCD+ 00] propose to take
the absolute value of the coefficient δi and keep the top ranking features.
Other criteria can be used for feature selection, but the most commonly used
nowadays are based on the control of the false discovery rate (FDR) [BH95], the
expected proportion of falsely rejected hypotheses. [YY06] [CTTC07] [Pav03]
all proposed methods that controlled this FDR. However, [QXGY06] pinpointed
the unstableness of gene selection techniques. In particular, they showed that
some genes are selected much less frequently than other genes with the same pvalue and suggested that correlation between gene expression levels may perturb
the testing procedures.
A key idea of the binary Relief algorithm proposed by [KR92] is to estimate
the quality of features according to how well they distinguish between samples
that are near to each other. The associated algorithm remains simple: the
algorithm randomly choses a sample, and, for each feature, look up if its value
changes between the nearest neighboring sample of the same class, and the
nearest neighboring sample of the other class. If the value changes, the value
corresponding to the quality of the feature is upgraded, and if it does not, this
value is downgraded. The process is repeated m times, m being a number predefined by the user. This algorithm as been updated to ReliefF [Kon94], which
is more robust and able to deal with multiclass problem, then with RReliefF
[SK97], able to deal with continuous class problems. The different versions of
the Relief algorithm present the strong advantage of not being perturbed with
correlation between the different features but unfortunately require extensive
computation.
Constructing features without any a priori knowledge
Another way to reduce the search space is to construct features, i.e. build from
each data sample x ∈ Rn another vector φ(x) ∈ Rp , with p < n, with the
features of φ(x) no being a subset of the features of x.
Principal Component Analysis (PCA) is also called Karhunen-Loève transform, Hotelling transform and Proper Orthogonal Method [Shl05]. Widely used
in all forms of analysis, from image processing to finance including computational biology, PCA is a technique that extracts from a data set the most
meaningful directions of variation. The dataset is then projected on the newly
formed basis in order to filter out the noise and reveal hidden structure. The
new basis is a linear combination of the sample vectors obtained taking the first
few eigenvectors of the sample matrix. In most cases, only the eigenvectors
corresponding to the highest eigenvalues are kept, and the projection on this
new basis redefines the dataset and reduces its dimension.
When applied on data that is not of variance 1 and mean 0, the principle Principal Component Analysis gives Singular Value Decomposition (SVD)
[Han86]. Using kernel methods, [SSM99] developed an extension of PCA to a
non-linear space.
[HZHS07] propose to extract from each sample the information provided by
each pair of highly synergetic variables, which, in the case of gene expression
1.3. THE CURSE OF DIMENSIONALITY
19
profiles that they are considering, would be genes. Their method is able to
improve the results obtained with usual classification methods.
However, we can also use what we know about the data in order to improve
the variable selection.
Smoothing of comparative genomic hybridization data
In the case of arrayCGH data, constructing features with a priori knowledge
translates into “smoothing” and ”segmenting”: as two successive spots on the
same chromosome are prone to be subject to the same gain or loss, a CGH
profile can be seen as a sequence of segments of a certain amplification value
and of a certain length. Different approaches have been proposed to achieve
this goal.
The most direct way to perform this segmentation is to attribute to each
spot value -1 if it is considered as belonging to a lost region, 0 if it is considered
in a normal region and +1 if it is in a gained region. [JFG+ 04] used thresholding
for the attribution of the correct label to each spot.
[HST+ 04] proposed to detect the delimitations, or “breakpoints” of each
region using local-likelihood modeling: an iterative algorithm finds around each
location the maximal possible likelihood in which the value θ of the amplification
is constant, considering that the collected value x on each spot equals the sum
of the amplification and a noise term: x = θ + ). The authors then used these
regions, and the value of the estimated θ attributed to each one, to estimate, in
the whole sample set, which chromosome suffered from gain or loss.
[OVLW04] proposed an efficient algorithm called “circular binary segmentation”. Their method is based on the change-point detection problem. They
tested their method on a breast cancer data set where the real amplification values were known and obtained better results than with a classical thresholding
method.
[HGO+ 07] suggested that, due to a variety of biological and experimental
factors, the aCGH signal may differ to the discrete stepwise function that is
often produced by segmentation algorithms. The authors proposed a two-step
approach that first deals with outliers then with the differently valued spots
inside each segment. Their algorithm finds on their own test set a profile that
is closer to reality than the one found with circular binary segmentation.
[TW07] proposed to consider the problem as a regression issue: for each
sample X, we need to obtain a profile Y that corresponds to its smoothed
profile. This can be transcribed as the following equation:
Y
under the constraints
= argmin L(X, Y )
Y ∈Rp
!
|Yi − Yi + 1| ≤ λ
i∼i+1
'Y '1 ≤ µ ,
(1.26)
20
CHAPTER 1. BACKGROUND
where n is the number of spots, Yi is the ith component of the vector Y ,
i ∼ i + 1 means that i and i + 1 are successive spots on the same chromosome,
L is the square loss L : (X, Y ) → 'X − Y '2 , ''˙ 1 is the L1-norm ' •' 1 : Y →
n
!
'Y '1 =
|Yi | and λ and µ are two trade-off constants that help adjusting
i=1
between the important of the constraints and the value of the loss.
The choice of the two constraint terms is motivated by the fact that Y should
be smooth, which implies the first term, and that most of the spots should be
subject to normal amplification, which means that Y should be sparse and its
L1 norm small. This approach is very similar to the one we develop in chapter
3 for the classification of aCGH profiles.
Extraction of modules for gene expression analysis
In the case of gene expression profile, one category of a priori information that
we can introduce into the classification analysis is gene network information,
which can be used to perform dimension reduction.
In this thesis, we take for the term “gene” the most common definition
given in [AJL+ 02] and call ”gene” a specific portion of DNA that will code for
a specific protein.
“Gene network ” is a generic term that indicates a knowledge base that describes relations between proteins and, by extension, the corresponding genes.
These networks are particularly useful to analyze or predict the effects that may
have the perturbation of one protein or gene. In this thesis, we will call “pathway” a part of the gene network that acts to ensure a single biological function,
and a “module” or a “map” usually regroups several pathways in order to ensure
several biological functions, in most of the cases in relation one to the other.
Many gene networks can be represented as graphs. A graph is constituted
of a set of vertices V and a set of edges E ⊂ V × V that correspond to relations between the vertices. It is called undirected if ∀(u, v) ∈ V × V such that
(u, v) ∈ E, (v, u) ∈ E and directed if it is not undirected. In the case of gene
networks, the vertices will be proteins, or the corresponding genes. Gene networks include metabolic networks [KGK+ 04, KGH+ 06, VDS+ 07], co-expression
networks [YMH+ 07], influence networks [YMK+ 06] and Protein-Protein Interaction networks (PPI networks) [MS03, MBS05, RTV+ 05, RVH+ 05, SWL+ 05].
Chapter 4 provides a more extensive description of these different networks.
A family of methodologies to incorporate this gene network knowledge into
the gene expression profile analysis takes as a preliminary step the extraction
of modules, or highly-connected groups of genes that should act as a unique
entity, from the gene network and then analyses the profiles as a collection of
underexpressed or overexpressed modules.
[SZEK07] extracts clusters from a metabolic network and then estimate
the over or under-expression of each module using Haar-wavelet transformation
applied to each connected pair of genes. Their method is quite analog to the
one used in image analysis, in which an image is a grid-like network of color
1.3. THE CURSE OF DIMENSIONALITY
21
values and is found to be more powerful than classical t-test methods without
any network knowledge.
[CLL+ 07] use protein-protein interaction networks and define modules, or
“subnetworks” as gene sets that induce a single connected component in the
network. They propose to only take into account the subnetworks that are
considered significative relatively to the classification problem and then to perform analysis on the data set. The significativity of a subnetwork relatively
to the problem is calculated using the Mutual Information score. Applying
their method on two different data sets, they are able to find 149 and 243
significative subnetworks for each of them and show that their network-based
classification achieves higher accuracy in prediction than classical classification
methods. However, their classification method is sensible to perturbation in the
network.
We proposed in [RZD+ 07] a method that used metabolic networks to smooth
the gene expression profiles. This method is developed in chapter 2.
1.3.2
Wrapper methods
Another way to incorporate this a priori information is to modify the optimisation problem exposed in 1.14 in order to directly incorporate reduction of the
search space inside the problem, resulting in a wrapper method, built with or
without any a priori knowledge.
Without any a priori knowledge
[Tib96] proposed the Lasso for regression and variable selection. This method
is based on the generic supervised classification framework presented in 1.24 in
the context of regression (i.e. the labels yi belong to R), with the loss function
l being the squared error for loss function and the regularization term Ω being
the L1-norm. This method performs regression and feature selection at the
same time as the use of the L1-norm forces the linear model to be sparse. This
method has been enhanced by [EHJT04] in order to build the faster LARLASSO algorithm.
Another wrapper approach is proposed by the “1-norm support vector machine” (L1-SVM) [ZRHT03] that uses the L1-norm to substitute for the squared
L2-norm in the classical SVM algorithm as presented in 1.23. Similarly to the
Lasso algorithm, the L1-norm forces the linear classifier to be sparse, resulting in a classification function with a reduced number of genes i.e. performs
classification and feature selection both at the same time.
Using a priori knowledge
The adjunction of a kernel function described in 1.19 provides another framework for the incorporation of a priori information. The supervised classification
method described in [RZD+ 07] and in chapter 2, can, for example, also been
written as the combination of a SVM and a kernel function corresponding to the
22
CHAPTER 1. BACKGROUND
filtering of the high-frequency components according to the metabolic network
of reference :
w∗
= argmin
w
under the constraint
n
!
L(w" xi , yi )
i=1
wρ(L)w ≤ µ ,
(1.27)
with µ being a constant trade-off parameter estimated through cross-validation
and ρ(L) is a spectral variation of the Laplacian matrix L of the metabolic
network (see chapter 2 for a more complete description of the algorithm).
[LL07] subsequently proposed to add to the method described in [RZD+ 07]
a L1-constraint:
w∗
= argmin
w
under the constraints
'w'1 ≤ λ
n
!
L(w" xi , yi )
i=1
wLw ≤ µ ,
(1.28)
with λ and µ being two constant trade-off parameters estimated through crossvalidation and L being the weighted Laplacian matrix of the metabolic network.
We propose two methods that modify the constraints for supervised classification in chapter 3 and 4 of this thesis.
1.4
Contributions of this thesis
In this section, we present the different contributions that we made to the field
during this thesis. These different contributions will be developed and further
explained during the following chapters but this section provides a short introduction to them.
1.4.1
Spectral analysis of gene expression profiles
The first technique that we developed integrates a priori the gene network
knowledge into gene expression data. This approach is based on the spectral
decomposition of gene expression profiles with respect to the eigenfunctions of
the graph. This decomposition leads to the design of a new distance function
which can be used for unsupervised classification and principal component analysis.
Spectral decomposition can also be used to apply a filter on the data in
order to attenuate components of the expression profiles with respect to the
topology of the graph, high frequency variations corresponding to noise while
1.4. CONTRIBUTIONS OF THIS THESIS
23
low frequency variation correspond to biological phenomena. In particular, we
can use a low-pass filter or an exponential filter that will reduce the highfrequency variations of the microarrays along the edges of the graph, and only
keep smooth variations. Supervised classification techniques can the be applied
on the smoothed samples in order to obtain a classifier that will have an easier
biological interpretation.
We applied this method on biological data extracted from a study that analyzes the effect of low irradiation on yeast cells, and tries to discriminate between
a group of non-irradiated cells and a group of slightly irradiated cells. We used
the KEGG metabolic network for the analysis. Even if we were not able to
improve supervised classification performance, we were able to provide a better
separation of the groups and a classifier that was more easily understandable
and from which new pathways of interest for the problem were extracted.
This work is presented extensively in chapter 2.
1.4.2
Supervised classification of aCGH data using fused
L1 SVM
The second method that we developed is a supervised classification method specific to aCGH data. This approach extends fused lasso [TSR+ 05], a regression
method that uses two regularization terms, in order to produce a sparse solution
where successive features tend to have the same value. Our method replaces the
ridge regression loss function by the hinge function in order to produce a sparse
linear classification function in which successive spots tend to have the same
weight. This is appropriated for aCGH data since two successive spots on the
same chromosome are prone to be subject to the same alteration and therefore
to have similar weights on the classifier.
This method, called fused L1-SVM has been tested on three different classification problems using two different data sets related to the cancerous disease.
Our classification method performed well in every case. Moreover, it was able
to produce easily interpretable solutions.
This work is presented in chapter 3.
1.4.3
Supervised classification of gene expression profiles
using network-fused SVM
The third method that we developed is a supervised classification method for
gene expression profiles. This methods adds a regularization term to the classical
L1-SVM problem in order to constrain the classification function to attribute
similar weights to features, i.e. expression of genes, that are connected in the
gene network. This method is appropriated to gene expression profiles, in which
expression of specific genes are positively correlated, and for which the positive
correlation information is stored in specific networks. As this mathematical
problem is an extension of the fused L1-SVM, with the chain of dependencies
between features being replaced by a network, we called this method networkfused L1-SVM.
24
CHAPTER 1. BACKGROUND
This method has been tested on two public breast cancer gene expression
profile data sets with eight different networks belonging to four different types
(metabolism, protein-protein interactions, co-expression and influence). For
each data set, we were able to provide classifiers that performed better than
classifiers that do not take into account the network interactions. However, due
to the sparse knowledge that we have of the gene interactions, most of these
classifiers performed worse than the classifier that takes into account all probes,
even the ones that are not in the network.
We also tried to extract from these different classifiers known biological phenomena and showed that they may be more biologically meaningful that usual
classifiers.
This work is presented in chapter 4.
Chapter 2
Spectral analysis of gene
expression profiles using
metabolic networks
This work has already been published in a slightly different form in BMC Bioinformatics, co-authored with Andrei Zinovyev, Marie Dutreix, Emmanuel Barillot and Jean-Philippe Vert [RZD+ 07].
2.1
Background
During the last decade microarrays have become the technology of choice for
dissecting the genes responsible for a phenotype. By monitoring the activity of
virtually all the genes from a sample in a single experiment they offer a unique
perspective for explaining the global genetic picture of a variant, whether a
diseased individual or a sample subject to whatever stressing conditions. However, this strength is also their major weakness, and has led to the ”gene list”
syndrome. Following careful experimental design and data analysis, the result
of an experiment with microarrays is often summarized as a list of genes that
are differentially expressed between two conditions, or that allow samples to be
classified according to their phenotypic features. Once this list of genes, typically a few hundreds, as been obtained, its meaning still has to be deciphered,
but the automated translation of the list into biological interpretation is often
challenging.
The interpretation of the results in terms of biological functions and pathways involving several genes is of particular interest. Many databases and tools
help verify a posteriori whether genes known to co-operate in some biological process are found in the list of genes selected. For example, Gene Ontology [Con00], Biocarta [bio], GenMAPP [gen] and KEGG [KGK+ 04] all allow
a list of genes to be crossed with biological functions and genetic networks,
25
26
CHAPTER 2. SPECTRAL ANALYSIS
including metabolic, signalling or other regulation pathways. Basic statistical
analysis (e.g., [HDS+ 03, BKPT05]) can then determine whether a pathway is
over-represented in the list, and whether it is over-activated or under-activated.
However, one can argue that introducing information on the pathway at this
point in the analysis process sacrifices some statistical power to the simplicity
of the approach. For example, a small but coherent difference in the expression
of all the genes in a pathway should be more significant than a larger difference
occurring in unrelated genes.
There is therefore a pressing need for methods integrating a priori pathway
knowledge in the gene expression analysis process, and several attempts have
been carried out in that direction so far. Several authors have used a priori
known gene networks to derive models and constraints for gene expression. For
example, logical discrete formalism [TK01] can be used to analyse all the possible steady states of a biochemical reaction network described by positive and
negative influences and can determine whether the observed gene expression
may be explained by a perturbation of the network. If only the signs of the concentration differences between two steady states are considered, it is possible
to solve the corresponding Laplace equation in sign algebra [RLS+ 05], giving
qualitative predictions for the signs of the concentration differences measured
by microarrays. Other approaches, such as the MetaReg formalism [GVTS04],
have also been used to predict possible gene expression patterns from the network structure, although these approaches adhere less to the formal theory of
biochemical reaction networks.
Unfortunately, methods based on network models are rarely satisfactory because detailed quantitative knowledge of the complete reaction network parameters is often lacking, or only fragments of the network structure are available. In
these cases, more phenomenological approaches need to be used. Pathway scoring methods try to detect perturbated ”modules” or network pathways while ignoring the detailed network topology (for recent reviews see [COVP05,CDF05]).
It is assumed that the genes inside a module are co-ordinately expressed, and
thus a perturbation is likely to affect many of them.
With available databases containing tens of thousands of reactions and interactions (KEGG [KGK+ 04], TransPath [KPV+ 06], BioCyc [KOMK+ 05], Reactome [JTGV+ 05] and others), the problem is how to integrate the detailed graph
of gene interactions (and not just crude characteristics such as the inter/intramodule connectivity) into the core microarray data analysis. Some promising
results have been reported with regard to this problem. [VK03] developed a
method for correlating interaction graphs and different types of quantitative
data, and [RDML04] showed that explicitly taking the pathway distance between pairs of genes into account enhances the statistical scores when identifying activated pathways. The co-clustering of gene expression and gene networks
has been reported [HZZL02], and a dimension reduction method, called ”Network component analysis” [LBY+ 03, GTL06], was proposed to construct linear
models of gene regulation based on a priori known network information. The
PATIKA project [BDA+ 04] proposed a score to quantify the compatibility of a
pathway with a given microarray data, and in [SYDM05] a network topology
2.2. METHODS
27
extracted from literature was used jointly with microarray data to find significantly affected pathway regulators.
In this paper, we investigate a different approach for integrating gene network knowledge early in the gene expression analysis. By “gene network” we
mean any graph with genes as vertices, and where edges between genes can
represent various biological information. For example, an edge between two
genes could represent the fact that their products interact physically (proteinprotein interaction network), the presence of a genetic interaction such as a
synthetic-lethal or suppressor interaction [KI05], or the fact that these genes
code for enzymes that catalyse successive chemical reactions in a pathway
(metabolic network, [VK03]). As an illustration we focus on the latter case
in this article, although the method proposed is not limited to the metabolic
network. Our approach is based on the biological hypothesis that genes close
on the network are likely to have similar expression, and consequently that
noisy measures of gene expression, such as those obtained by microarrays, can
be denoised to some extent by extracting their “low-frequency” component on
the gene network. In the case of the metabolic gene network of the yeast S.
cerevisiae considered in this study, this biological hypothesis is motivated by
previous observations that genes coding for enzymes involved in a common
process are often co-regulated ensuring the presence of all the necessary proteins [vHGW+ 01, HZZL02, VK03, KVC04, KCF+ 06].
The approach is formally based on the spectral decomposition of the gene
expression measurements with respect to the gene network seen as a graph,
followed by an attenuation of the high-frequency components of the expression
vectors with respect to the topology of the graph. We show how to derive
unsupervised clustering and supervised classification algorithms for expression
profiles, resulting in classifiers that can be easily interpreted in terms of pathways. We illustrate the relevance of our approach by analysing a gene expression
dataset monitoring the transcriptional response of irradiated and non-irradiated
yeast colonies [MBM+ 04]. We show that by filtering out 80% of the eigenmodes
of the KEGG metabolic network in the gene expression profile, we obtain accurate and interpretable discriminative model that may lead to new biological
insights.
2.2
Methods
In this section, we explain how a gene expression vector can be decomposed with
respect to the eigenfunctions of a gene network, and how to derive unsupervised
and supervised classification algorithms from this decomposition. Before describing the technical details of the method, we start by a brief non-technical
overview of the approach.
28
CHAPTER 2. SPECTRAL ANALYSIS
Figure 2.1: Following the idea of Fourier decomposition (above), we can decompose a gene expression profile, here the first non-irradiated microarray sample
from our data set, into two parts: the smooth component and the high-frequency
component. We can then apply some filtering to attenuate or cancel the effect
of the high-frequency component.
2.2.1
Overview of the method
In this section we briefly outline the main features of our approach. We propose
a general mathematical formalism to include a priori the knowledge of a gene
network for the analysis of gene expression data. The method is independent of
the nature of the network, although we focus on the gene metabolic network as
an illustration in this paper. It is based on the hypothesis that genes close on
the network are likely to be co-expressed, and consequently that a biologically
relevant signal can be extracted from noisy gene expression measurement by
removing the “high-frequency” components of the gene expression vector over
the gene network. The extraction of the low-frequency component of a vector
is a classical operation in signal processing (see, e.g., figure 2.1), that can be
adapted to our problem using discrete Fourier transforms and spectral graph
analysis.
We show how this idea can be adapted to solve the problem of supervised
classification of samples based on their gene expression microarray profiles. This
is achieved optimising a linear classifier such that the weights of genes linked
together in the network tend to be similar, i.e., by forcing nearby genes to
have similar contribution to the decision function. The resulting classifier can
thereafter be easily interpreted by visual inspection of the weights over the
gene network, or subsequent extraction of clusters of genes on the network with
similar contributions.
2.2.2
Spectral decomposition of gene expression profiles
We consider a finite set of genes V of cardinality |V | = n. The available gene
network is represented by an undirected graph G = (V, E) without loop and
29
2.2. METHODS
multiple edges, in which the set of vertices V is the set of genes and E ⊂ V × V
is the list of edges. We will use the notation u ∼ v to indicate that two genes
u and v are neighbors in the graph, that is, (u, v) ∈ E. For any gene u, we
denote the degree of u in the graph by du , that is, its neighbour number. Gene
expression profiling gives a value of expression f (u) for each gene u, and is
therefore represented by a function f : V → R.
The Laplacian of the graph G is the n × n matrix [Chu97]:


du if u = v ,
(2.1)
∀u, v ∈ V, L(u, v) = −1 if u ∼ v ,


0
otherwise .
The Laplacian is a central concept in spectral graph theory [Moh97] and shares
many properties with the Laplace operator on Riemannian manifolds. L is
known to be symmetric positive semidefinite and singular. We denote its eigenvalues by 0 = λ1 ≤ . . . ≤ λn and the corresponding eigenvectors by e1 , . . . , en .
The multiplicity of 0 as an eigenvalue is equal to the number of connected components of the graph, and the corresponding eigenvectors are constant on each
connected component. The eigen-basis of L forms a Fourier basis and a natural
theory for Fourier analysis and spectral decomposition on graphs can thus be
derived [Chu97]. Essentially, the eigenvectors with increasing eigenvalues tend
to vary more abruptly on the graph, and the smoothest functions (constant on
each connected component) are associated with the smallest (zero) eigenvalue.
For a good example, see figure 2.2. In particular, the Fourier transform fˆ ∈ Rn
of any expression profile f is defined by:
!
fˆi =
ei (u)f (u), i = 1, . . . , n .
u∈V
The eigenvectors of L form an orthonormal basis and the expression profile f
can therefore be recovered from its Fourier transform fˆ by the simple formula:
f=
n
!
fˆi ei .
(2.2)
i=1
Like its continuous counterpart, the discrete Fourier transform can be used
for smoothing or for extracting features. Here, our hypothesis is that analysing
a gene expression profile from its Fourier transform with respect to an a priori
given gene network is a practical way to decompose the expression profile into
biologically interpretable information and filter out the noise. In the next two
sections we illustrate the potential applications of this approach by describing
how this leads to a natural definition for distances between expression profiles,
and how this distance can be used for classification or regression purposes.
2.2.3
Deriving a metric for expression profiles
The definition of new metrics on expression profiles that incorporate information encoded in the graph structure is a first possible application of the spectral
30
CHAPTER 2. SPECTRAL ANALYSIS
Figure 2.2: Here are four example of Laplacian eigenvectors of the main component of KEGG. The colours correspond to the coefficients of the eigenvectors:
positive coefficients are marked in red, negative coefficients are in green, and
the intensity of the colour reflects the absolute values of the coefficients. On
the upper-left side is the eigenvector associated with the smallest eigenvalue, on
the upper-right side the one associated with the second smallest eigenvalues, on
the lower-left side, the one associated with the third smallest eigenvalue while
on the lower-right side is the one associated with the largest eigenvalue. The
larger the eigenvalue, the less smooth the corresponding eigenvector.
31
2.2. METHODS
decomposition. Following the classical methodology in Fourier analysis, we assume that the signal captured in the low-frequency component of the expression
profiles contains the most biologically relevant information, particularly the general expression trends, whereas the high-frequency components are more likely
measurement noise. For example, the low-frequency component of an expression vector on the gene metabolic network should reveal areas of positive and
negative expression on the graph that are likely to correspond to the activation
or inhibition of specific branches of the graph. We can translate this idea mathematically by considering the following class of transformations for expression
profiles:
n
!
∀f ∈ RV , Sφ (f ) =
fˆi φ(λi )ei ,
(2.3)
i=1
where φ : R → R is a non-increasing function that quantifies how each frequency is attenuated. For example, if we take φ(λ) = 1 for all λ, we get from
(2.2) that the profile does not change that is, Sφ (f ) = f . However, if we take:
&
1 if 0 ≤ λ ≤ λ0 ,
φthres (λ) =
(2.4)
0 if λ >λ 0 ,
+
we produce a low-pass filter that removes all the frequencies from f above the
threshold λ0 . Finally, a function of the form:
φexp (λ) = exp(−βλ) ,
(2.5)
for some β > 0, keeps all the frequencies but strongly attenuates the highfrequency components.
If Sφ (f ) includes the biologically relevant part of the expression profile, we
can compare two expression profiles f and g through their representations Sφ (f )
and Sφ (g). This leads to the following metric between the profiles:
dφ (f, g)2 = 'Sφ (f ) − Sφ (g)'2
n
!
'
(2
=
fˆi − ĝi φ(λi )2 .
i=1
We note that this Euclidean metric over expression profiles is associated with
the following inner products:
-f, g.φ =
=
n
!
fˆi ĝi φ(λi )2
i=1
n
!
2
f " ei e"
i gφ(λi )
(2.6)
i=1
= f " Kφ g ,
)n
where Kφ = i=1 φ(λi )2 ei e"
i is the positive semidefinite matrix obtained by
modifying the eigenvalues of L through φ. For example, taking φ(λ) = exp(−βλ)
32
CHAPTER 2. SPECTRAL ANALYSIS
(
leads to Kφ = expM (−βL , where expM denotes the matrix exponential. This
observation shows that working with filtered expression profiles (2.3) is equivalent to defining a kernel (2.6) over the set of expression profiles, in the context
of support vector machines and kernel methods [SS02, STV04]. This possibility
is further explored in the next section.
2.2.4
Supervised learning and regression
The construction of predictive models for a property or phenotype of interest
from the gene expression profiles of the studied samples is a second possible
application of the spectral decomposition of expression profiles on the gene
network. Typical applications include predicting cancer diagnosis or prognosis from gene expression data, or discriminating between different treatments
applied to micro-organisms. Most approaches presented so far build predictive
models from the gene expression alone, and then check whether the predictive
model is biologically relevant by studying, for example, whether genes with high
weights are located in similar pathways. However the genes often give no clear
biological meaning. Here, we propose a method combining both steps in a single
predictive model that is trained by forcing some form of biological relevance.
We use linear predictive models to predict a variable of interest y from an
expression profile f . They are obtained by solving the following optimisation
problem:
min
w∈Rn
p
!
l(w" fi , yi ) + C'w'2 ,
(2.7)
i=1
where (f1 , y1 ), . . . , (fp , yp ) is a training set of profiles containing the variable y
to be predicted, and l is a loss function that measures the cost of predicting
w" fi instead of yi . For example, the popular support vector machine [BGV92a,
SS02,STV04] is a particular case of equation (2.7) in which y can take values in
−1, +1 and l(u, y) = max(0, 1 − yu) is the hinge loss function; ridge regression
is obtained for y ∈ R by taking l(u, y) = (u − y)2 [HTF01].
Here, we do not apply algorithms of the form (2.7) directly to the expression
profiles f , but to their images Sφ (f ). That is, we consider the problem:
minn
w∈R
p
!
l(w" Sφ (fi ), yi ) + C'w'2 .
(2.8)
i=1
We claim that by solving (2.8) we will find a linear predictor over the original
expression profiles that tend to be smooth on the gene network. Indeed, for any
33
2.2. METHODS
1/2
w ∈ Rd , let v = Kφ w. We first observe that for any f ∈ Rn :
w Sφ (f ) = w
"
"
n
!
fˆi φ(λi )ei
i=1
= f"
=f
"
n
!
ei φ(λi )e"
i w
i=1
1/2
Kφ w
= f "v ,
showing that the final predictor obtained by minimizing (2.8) is equal to v " f .
Second, we note that:
'w'2 = w" w
= v " Kφ−1 v
=
n
!
i=1
v̂i2
,
φ(λi )2
where the last equality remains valid if Kφ is not invertible simply by not
including in the sum the term i for which φ(λi ) = 0. This shows that (2.8)
is the equivalent of solving the following problem in the original space:
minn
v∈R
p
!
i=1
L(v " fi , yi ) + C
!
i:φ(λi )>0
v̂i2
.
φ(λi )2
(2.9)
Thus, the resulting algorithm amounts to finding a linear predictor v that minimises the loss function of interest l regularised by a term that penalises the
high-frequency components of v. This is different from the classical regularisation 'v'2 used in (2.7) that only focuses on the norm of v. As a result, the linear
predictor v can be made smoother on the gene network by increasing the parameter C. This allows the prior knowledge to be direcly included because genes
in similar pathways would be expected to contribute similarly to the predictive
model.
There are two consequences of this procedure. Firstly, if the true predictor
really is smooth on the graph, the formulation (2.9) can help the algorithm
focus on plausible models even with very little training data, resulting in a
better estimation. As a result, we can expect a better predictive performance.
Secondly, by forcing the predictive model v to be smooth on the graph, biological
interpretation of the model should be easier by inspecting the areas of the graph
in which the predictor is strongly positive or negative. Thus the model should
be easier to interpret than models resulting from the direct optimisation of
equation (2.7).
34
2.3
CHAPTER 2. SPECTRAL ANALYSIS
Data
We collected the expression data from a study analysing the effect of low irradiation doses on Saccharomyces cerevisiae strains [MBM+ 04]. The first group
of extracted expression profiles was a set of twelve independent yeast cultures
grown without radiation (not irradiated, NI). From this group, we excluded an
outlier that the author of the article indicated to us. The second group was a
set of six independent irradiated (I) cultures exposed to a dose of 15-20 mGy/h
for 20h. This dose of irradiation produces no mutagenic effects, but induces
transcriptional changes. We used the same normalization method as in the first
study of this data (Splus LOWESS function, see [MBM+ 04] for details), then we
attempted (1) to separate the NI samples from the I ones, and (2) to understand
the difference between the two populations in terms of metabolic pathways.
The gene network model used to analyse the gene expression data was therefore built from the KEGG database of metabolic pathways [KGK+ 04]. The
metabolic gene network is a graph in which the enzymes are vertices and the
edges between two enzymes indicate that the product of a reaction catalysed
by the first enzyme is the substrate of the reaction catalysed by the second enzyme. We reconstructed this network from the KGML v0.3 version of KEGG,
resulting in 4694 edges between 737 genes. We kept only the largest connected
component (containing 713 genes) for further spectral analysis.
2.4
2.4.1
Results
Unsupervised classification
First, we tested the general effect of modifying the distances between expression
profiles using the KEGG metabolic pathways as background information in
an unsupervised setting. We calculated the pairwise distances between all 17
expression profiles after applying the transformations defined by the filters (2.4)
and (2.5), over a wide range of parameters. We assessed whether the resulting
distances were more coherent with a biological intepretation by calculating the
ratio of intraclass distances over all pairwise distances, defined by:
r=
)
u1 ,v1 ∈V1
)
d(u1 , v1 )2 + u2 ,v2 ∈V2 d(u2 , v2 )2
)
,
2
u,v∈V d(u, v)
where V1 and V2 are the two classes of points. We compared the results with
those obtained by replacing KEGG with a random network, produced by keeping the same graph structure but randomly permutating the vertices, in order to assess the significance of the results. We generated 100 such networks
to give an average result with a standard deviation. 2.3 shows the result for
the function φexp (λ) = exp(−βλ) with varying β (left), and for the function
φthres (λ) = 1(λ <λ 0 ) for varying λ0 (right). We observe that, apart from very
small values of β, the change of metric with the φexp function performs worse
2.4. RESULTS
35
Figure 2.3: Performance of the unsupervised classification after changing the
metric with the function φ(λ) = exp(−βλ) for different values of β (left), or
with the function φ(λ) = 1(λ <λ 0 ) with varying λ0 , that is, by keeping a
variable number of smallest eigenvalues (right). The red curve is obtained with
the KEGG network. The black curves show the result (mean and one standard
deviation interval) obtained with a random network.
36
CHAPTER 2. SPECTRAL ANALYSIS
Figure 2.4: PCA plots of the initial expression profiles (a) and the transformed
profiles using network topology (80% of the eigenvalues removed) (b). The
green squares are non-irradiated samples and the red rhombuses are irradiated
samples. Individual sample labels are shown together with GO and KEGG
annotations associated with each principal component.
than that of a random network. The second method (filtering out the high frequency components of the gene expression vector), in which up to 80% of the
eigenvectors are removed, performs significantly better than that of a random
network. When only the top 3% of the smoothest eigenvectors are kept, the
performance is similar to that of a random network, and when only the top 1%
is kept, the performance is significantly worse. This explains the disappointing
results obtained with the φexp function: by giving more weight to the small
eigenvalues exponentially, the method focuses on those first few eigenvectors
that, as shown by the second method, do not provide a geometry compatible
with the separation of samples into two classes. From the second plot, we can
infer that at least 20% of the KEGG eigenvectors should be given sufficient
weight to obtain a geometry compatible with the classification of the data in
this case.
2.4.2
PCA analysis
We carried out a principal component analysis (PCA, [Jol96]) on the original
expression vectors f and compared this with a PCA of the transformed set of
vectors Sφ (f ) obtained with the function φthres to further investigate the effect
of filtering out the high frequencies of the expression profiles on their relative
positions.
Analysis of the initial sample distribution (figure 2.4) shows that the first
principal component can partially separate irradiated from non-irradiated samples, with exception of the two irradiated samples ”I1” and ”I2”, as they have
larger projections onto the third principal component than onto the first one.
2.4. RESULTS
37
The experimental protocol revealed that these two samples were affected by
higher doses of radiation than the four other samples.
Gene Ontology analysis of the genes that contribute most to the first principal component shows that ”pyruvate metabolism”, ”glucose metabolism”,
”carbohydrate metabolism”, and ”ergosterol biosynthesis” ontologies (here we
list only independent ontologies) are over-represented (with p-values less than
10−10 ). The second component is associated with ”trehalose biosynthesis”, and
”carboxylic acid metabolism” ontologies and the third principal component is
associated with the KEGG glycolysis pathway. The first three principal components collect 25%, 17% and 11% of the total dispersion.
The transformation (2.3) resulting from a step-like attenuation of eigenvalues
φthres removing 80% of the largest eigenvalues significantly changes the global
layout of data (figure 2.4, right) but generally preserves the local neighbourhood
relationships. The first three principal components collect 28%, 20% and 12%
of the total dispersion, which is only slightly higher than the PCA plot of the
initial profiles. The general tendency is that the non-irradiated normal samples
are more closely grouped, which explains the lower intraclass distance values
shown in 2.3. The principal components in this case allows them to be associated with gene ontologies with higher confidence (for the first component, the
p-values are less than 10−25 ). This is a direct consequence of the fact that the
principal components are constrained to belong to a subspace of smooth functions on KEGG, giving coherence in terms of pathways to the genes contributing
to the components. The first component give ”DNA-directed RNA polymerase
activity”, ”RNA polymerase complex” and ”protein kinase activity”. Figure 3
shows that these are the most connected clusters of the whole KEGG network.
The second component is associated with ”purine ribonucleotide metabolism”,
”RNA polymerase complex”, ”carboxylic acid metabolism” and ”acetyl-CoA
metabolism” ontologies and also with ”Glycolysis/Gluconeogenesis”, ”Citrate
cycle (TCA cycle)” and ”Reductive carboxylate cycle (CO2 fixation)” KEGG
pathways. The third component is associated with ”prenyltransferase activity”, ”lyase activity” and ”aspartate family amino acid metabolism” ontologies
and with ”N-Glycan biosynthesis”, ”Glycerophospholipid metabolism”, ”Alanine and aspartate metabolism” and ”riboflavin metabolism” KEGG pathways.
Thus, the PCA components of the transformed expression profiles are affected
both by network features and by the microarray data.
2.4.3
Supervised classification
We tested the performance of supervised classification after modifying the distances with a support vector machine (SVM) trained to discriminate irradiated
samples from non-irradiated samples. For each change of metric, we estimated
the performance of the SVM from the total number of misclassifications and
the total hinge loss using a ”leave-one-out” (LOO) approach. This approach
removes each sample in turn, trains a classifier on the remaining samples and
then tests the resulting classifier on the removed sample. For each fold, the
regularisation parameter was selected from the training set only by minimiz-
38
CHAPTER 2. SPECTRAL ANALYSIS
Figure 2.5: Performance of the supervised classification when changing the metric with the function φexp (λ) = exp(−βλ) for different values of β (left picture),
or the function φthres (λ) = 1(λ <λ 0 ) for different values of λ0 (i.e., keeping
only a fraction of the smallest eigenvalues, right picture). The performance is
estimated from the number of misclassifications in a leave-one-out error
ing the classification error estimated with an internal LOO experiment. The
calculations were carried out using the svmpath package in the R computing
environment.
Figure 2.5 shows the classification results for the two high frequency attenuation functions φexp and φthres with varying parameters. The baseline LOO
error is 2 misclassifications for the SVM in the original Euclidean space. For the
exponential variant (φexp (λ) = exp(−βλ)), we observe an irregular but certain
degradation in performance for positive β for both the hinge loss and the misclassification number. This is consistent with the result shown in 2.3 in which
the change of metric towards the first few eigenvectors does not give a geometry
coherent with the classification of samples into irradiated and non-irradiated,
resulting in a poorer performance in supervised classification as well. For the
second variant, in which the expression profiles are projected onto the eigenvector of the graph with the smallest eigenvalues, we observe that the performance
remains as accurate as the baseline performance until up to 80% of the eigenvectors are discarded, with the hinge loss even exhibiting a slight minimum in this
region. This is consistent with the classes being more clustered in this case than
in the original Euclidean space. Overall these results show that classification
accuracy can be kept high even when the classifier is constrained to exhibit a
certain coherence with the graph structure.
2.4. RESULTS
39
Figure 2.6: Global connection map of KEGG with mapped coefficients of the
decision function obtained by applying a customary linear SVM (left) and using
high-frequency eigenvalue attenuation (80% of high-frequency eigenvalues have
been removed) (right). Spectral filtering divided the whole network into modules having coordinated responses, with the activation of low-frequency eigen
modes being determined by microarray data. Positive coefficients are marked in
red, negative coefficients are in green, and the intensity of the colour reflects the
absolute values of the coefficients. Rhombuses highlight proteins participating
in the Glycolysis/Gluconeogenesis KEGG pathway. Some other parts of the
network are annotated including big highly connected clusters corresponding to
proteinkinases and DNA and RNA polymerase sub-units.
40
2.4.4
CHAPTER 2. SPECTRAL ANALYSIS
Interpretation of the SVM classifier
Figure 2.6 shows the global connection map of KEGG generated from the connection matrix by Cytoscape software [SMO+ 03]. The coefficients of the decision function v of equation (2.9) for the classifier constructed either in the
original Euclidean space or after filtering the 80% top spectral components of
the expression profiles are shown in colour. We used a color scale from green
(negative weights) to red (positive weights) to provide an easy visualisation of
the classifier main features. Both classifiers give the same classification error but
the classifier constructed using the network structure can be more naturally interpreted, as the classifier variables are grouped according to their participation
in the network modules.
Although from a biological point of view, very little can be learned from the
classifier obtained in the original Euclidean space (figure 2.6, left), it is indeed
possible to distinguish several features of interest for the classifier obtained in the
second case (figure 2.6, right). First, oxidative phosphorylation is found among
the pathways with the most positive weights, which is consistent with previous
analyses showing that this pathway tends to be up-regulated after irradiation
[MBM+ 04]. An important cluster involving the DNA and RNA polymerases is
also found to bear weights slightly above average in these experiments. Several
studies have previously reported the induction of genes involved in replication
and repair after high doses of irradiation [MDM+ 01], but the detection of such an
induction at the low irradiation doses used in the present biological experiments
is rather interesting. The strongly negative landscape of weights in the protein
kinases cluster has not been seen before and may lead to a new area of biological
study. Most of the kinases are involved in signalling pathways, and therefore
their low expression levels may have important biological consequences.
Figure 2.6 shows a highlighted pathway named ”Glycolysis/Gluconeogenesis”
in KEGG. A more detailed view of this pathway is shown in figure 2.7 . This
pathway contains enzymes that are also used in many other KEGG pathways
and is therefore situated in the middle and most entangled part of the global
network. As already mentioned, this pathway is associated with the first and the
third principal components of the initial dataset. The pathway actually contains
two alternative sub-pathways that are affected differentially. Over-expression in
the gluconeogenesis pathway seems to be characteristic of irradiated samples,
whereas glycolysis has a low level of expression in that case. This shift can
be observed by changing from anaerobic to aerobic growth conditions (called
diauxic shift). The reconstruction of this from our data with no prior input of
this knowledge strongly confirms the relevance of our analysis method. It also
shows that analysing expression in terms of the global up- or down-regulation of
entire pathways as defined, for example, by KEGG, could be misleading as there
are many antagonist processes that take place within pathways. By representing KEGG as a large network instead of a set of pathways, our approach helps
maintaining the biochemical relationships between genes out of the constraints
of pathway limits. Once a classifier has been built using a priori the knowledge of the network, the interpretation of the results (which genes contribute
2.4. RESULTS
41
Figure 2.7: The glycolysis/gluconeogenesis pathways of KEGG with mapped
coefficients of the decision function obtained by applying a customary linear
SVM (a) and using high-frequency eigenvalue attenuation (b). The pathways
are mutually exclusive in a cell, as clearly highlighted by our algorithm
42
CHAPTER 2. SPECTRAL ANALYSIS
the most to the classification) can be performed through visualisation of known
biochemical pathways, or extraction of gene clusters with similar contribution
to the classifier. Importantly these gene clusters result from a combined analysis
of the gene network and the gene expression data, and not from a prior analysis
of the gene network alone.
Figure 2.8 shows the weights of the two classifiers on the genes involved in
pyrimidine metabolism which is another pathway of interest.
2.5
Discussion
Our algorithm constructs a classifier in which the predictor variables are grouped
according to their neighborhood relations in the network. We assume that the
genes close on the network are likely to contribute similarly to the prediction
function. Our working hypothesis is that the genes close on the network should
have similar expression profiles. This hypothesis was validated in several studies that demonstrate that co-expressed genes tend to have similar biological
functions and vice versa (e.g., [SSKK03]). Our mathematical framework based
on spectral decomposition helps to systematically exploit this experimental fact
and include it into data analysis.
Nevertheless, one must understand that this tendency is only a trend, valid
when we take the average on a large scale. It is of course possible to find
many local exceptions to this trend, for example when a signaling pathway or a
metabolic cascade is influenced by over- or under- expression of only one regulator without systematically affecting the expression of the rest of the pathway
participants. Thus, our technique is rather coarse-grained, it does not allow to
infer a precise network logic but rather detects average excitation of relatively
big network modules.
In our example we use a metabolic network as gene network. Our hypothesis
here is based on the fact that for a smooth synthesis flow, all enzymes required
for a metabolic cascade should be present in sufficient quantities, i.e., stably
expressed. On the opposite, various sensor and feedback mechanisms ensure that
for inactive metabolic cascades the expression of corresponding enzymes remains
low. If this is true on average then our technique will help to highlight active
and inactive parts of the network. Several previous studies have highlighted
the significant correlation that exists between gene expression and distance over
the metabolic network, thus justifying our attempt [vHGW+ 01,HZZL02,VK03,
KVC04, KCF+ 06]. For other network types, like transcriptional regulatory or
signalling network, more elaborated measures of ”smoothness” are certainly
needed to take into account signs and directions of individual gene interactions.
Our working hypothesis motivates the filtering of gene expression profile in
order to remove the noisy high-frequency modes of the network. Therefore, the
variation of the weights of the classifier along the graph are of low frequency and
should allow grouping of variables, which is a very useful feature of the resulting classification function as the function becomes meaningful for interpreting
and suggesting biological factors that cause the class separation. It allows clas-
2.5. DISCUSSION
43
Figure 2.8: The Pyrimidine Metabolism pathways of the separator obtained
with an Euclidean linear SVM (up) and our modified algorithm (down).
44
CHAPTER 2. SPECTRAL ANALYSIS
sifications based on functions, pathways and network modules rather than on
individual genes. Classification based on pathways and network modules should
lead to a more robust behaviour of the classifier in independent tests with equal
if not better classification results. Our results on the dataset we analysed show
only a slight improvement, although this may be due to its limited size. The two
samples with different experimental settings are systematically misclassified in
both the initial and our smoothed classifier which means that they probably are
members of a ”third” class which should be treated differently. Introduction of
network topology can not resolve this issue but can help to understand which
part of the network differentiate the outliers from the other members of the
same class.
Interestingly, the constraint we impose on the smoothness of the classifier
can also be justified mathematically in the context of regularisation for statistical estimation. Classification of microarray data is an extremely challenging
problem because it usually involves a small number of samples in large dimension. Most statistical procedures developed in this context involve some form of
complexity reduction by imposing some constraints on the classifier. For example, perhaps the most widely-used complexity reduction method for microarray
data is to impose that the classifier has only a small number of non-zero weights,
which in practice amounts to selecting a small number of genes. Mathematically
speaking, this means constraining the L0 norm of the classifier to be small (the
L0 norm of a vector being the number of non-zero components). Alternatively,
methods like SVM constrain the L2 norm of the classifier to be small. Our
method can therefore be seen as just constraining a different norm of the classifier, for the purpose of regularisation. Of course the choice of regularisation
should be related to the problem at hand: it corresponds to our prior belief of
what the optimal classifier, that would be discovered if enough samples were
available, looks like. Performing feature selection implicitly corresponds to the
assumption that the optimal classifier relies on a small number of genes, which
can be a reasonable assumption in some cases. Our focus on the smoothness of
the classifier on the gene network corresponds to a different implicit assumption,
namely, that the optimal classifier is likely to be so. This is justified in many
cases because the classes of samples to be predicted generally correspond to
differences in the regulation of one or several pathways. Of course if this turns
out not to be the case, reducing the effect of regularisation by decreasing the
parameter C in (2.9) allows a non-smooth classifier to be learned as well.
An important remark to bear in mind when interpreting pictures such as Figures 3 and 5 is that the colors represent the weights of the classifier, and not gene
expression levels. There is of course a relationship between the classifier weights
and the typical expression levels of genes in irradiated and non-irradiated samples: irradiated samples tend to have expression profiles positively correlated
with the classifier, while non-irradiated samples tend to be negatively correlated. Roughly speaking, the classifier tries to find a smooth function that has
this property. This means in particular that the pictures provide virtually no
information regarding the over- or under-expression of individual genes, which
is the cost to pay to obtain instead an interpretation in terms of more global
2.5. DISCUSSION
45
pathways. Constraining the classifier to rely on just a few genes would have a
similar effect of reducing the complexity of the problem, but would lead to a
more difficult interpretation in terms of pathways.
An important advantage of our approach over other pathway-based clustering methods is that we consider the network modules that naturally appear
from spectral analysis rather than a historically defined separation of the network into pathways. Thus, pathways cross-talking is taken into account, which
is difficult to do using other approaches. It can however be noticed that the
implicit decomposition into pathways that we obtain is biased by the very incomplete knowledge of the network and that certain regions of the network are
better understood, leading to a higher connection concentration.
Another important feature of this approach is that we make no strong assumption on the nature of the graph, and that the method can in principle be
applied with a variety of other graphs, such as protein-protein interaction networks or co-expression networks. We leave this avenue open for future research.
On the other hand, like most approaches aiming at comparing expression
data with gene networks such as KEGG, the scope of this work is limited by
two important constraints. First the gene network we use is only a convenient
but rough approximation to describe complex biochemical processes; second,
the transcriptional analysis of a sample can not give any information regarding
post-transcriptional regulation and modifications. Nevertheless, we believe that
our basic assumptions remain valid, in that we assume that the expression of
the genes belonging to the same metabolic pathways module are coordinately
regulated. Our interpretation of the results supports this assumption.
Another important caveat is that we simplify the network description as an
undirected graph of interactions. Although this would seem to be relevant for
simplifying the description of, e.g., protein-protein interaction networks, in reality metabolic networks have a more complex nature. Similarly, gene regulation
networks are influenced by the direction, sign and importance of the interaction. Although the incorporation of weights into the Laplacian (equation 2.1)
is straightforward and allows the extension of the approach to weighted undirected graphs, the choice of the weights remains delicate since the importance
of an interaction may be difficult to quantify. Conversely the directions and
signs that accompany signalling or regulatory pathways are generally known,
but their incorporation requires more work. It could nevertheless lead to important advances for the interpretation of microarray data in cancer studies, for
example.
Conclusions
We have presented a general framework to analyse gene expression data when
a gene network is known a priori. The approach involves the attenuation of the
high-frequency content of the gene expression vectors with respect to the graph.
We derived algorithms for unsupervised clustering and supervised classification,
which enforce some level of smoothness on the gene network for the classifier.
46
CHAPTER 2. SPECTRAL ANALYSIS
This enforcement can be considered as a means of reducing the high dimension
of the variable space, using the available knowledge about gene network. No
prior decomposition of the gene network into modules or pathways is needed,
and the method can work in principle with a variety of gene networks.
Acknowledgments
This work was supported by the grant ACI-IMPBIO-2004-47 of the French
Ministry for Research and New Technologies and by the EC contract ESBIC-D
(LSHG-CT-2005-518192). We thank Sabrina Carpentier and Severine Lair from
the Service de Bioinformatique of the Institut Curie for the help they provided
with the normalization of the microarray data.
Chapter 3
Fused SVM for arrayCGH
classification
This work has already been accepted in a slightly different form at the International Conference on Intelligent Systems for Molecular Biology 2008 under
the title “Classification of arrayCGH using a fused SVM ”, co-authored with
Emmanuel Barillot and Jean-Philippe Vert.
3.1
Introduction
Genome integrity is essential to cell life and is ensured in normal cells by a
series of checkpoints, which enable DNA repair or trigger cell death to avoid
abnormal genome cells to appear. The p53 protein is probably the most prominent protein known to play this role. When these checkpoints are bypassed
the genome may evolve and undergo alterations to a point where the cell can
become premalignant and further genome alterations lead to invasive cancers.
This genome instability has been shown to be an enabling characteristic of
cancer [HW00], and almost all cancers are associated with genome alterations.
These alterations may be single mutations, translocations, or copy number variations (CNVs). A CNV can be a deletion or a gain of small or large DNA
regions, an amplification, or an aneuploidy (change in chromosome number).
Many cancers present recurrent CNVs of the genome, like for example monoploidy of chromosome 3 in uveal melanoma [SPdM+ 94], loss of chromosome
9 and amplification of the region of cyclin D1 (11q13) in bladder carcinomas [BBR+ 05], loss of 1p and gain of 17q in neuroblastoma [BLC+ 01,VRVB+ 02],
EGFR amplification and deletion in 1p and 19q in gliomas [IML+ 07], or amplifications of 1q, 8q24, 11q13, 17q21-q23, and 20q13 in breast cancer [YWF+ 06].
Moreover associations of specific alterations with clinical outcome have been
described in many pathologies [LNM+ 97].
Recently array-based comparative genomic hybridization (arrayCGH) has
been developed as a technique allowing rapid mapping of CNVs of a tumor
47
48
CHAPTER 3. FUSED SVM
sample at a genomic scale [PSS+ 98]. The technique was first based on arrays
using a few thousands of large insert clones (like BACs, and with a Mb range
resolution) to interrogate the genome, and then improved with oligonucleotide
based arrays consisting of several hundreds of thousands features, taking the
resolution down to a few kb [Ger05]. Many projects have since been launched
to systematically detect genomic aberrations in cancer cells [vBN06, CWT+ 06,
SMR+ 03].
The etiology of cancer and the advent of arrayCGH make it natural to envisage building classifiers for prognosis or diagnosis based on the genomic profiles
of tumors. Building classifiers based on expression profiles is an active field
of research, but little attention has been paid yet to genome-based classification. [CWT+ 06] select a small subset of genes and apply a k-nearest neighbor
classifier to discriminate between estrogen-positive and estrogen-negative patients, between high-grade patients and low-grade patients and between bad
prognosis and good prognosis for breast cancer. [JFG+ 04] reduce the DNA copy
number estimates to “gains” and “losses” at the chromosomal arm resolution,
before using a nearest centroid method for classifying breast tumors according
to their grade. As underlined in [CWT+ 06], the classification accuracy reported
in [JFG+ 04] is better than the one reported in [CWT+ 06], but still remains at
a fairly high level with as much as 24% of misclassified samples in the balanced
problem. This may be related to the higher resolution of the arrays produced
by [JFG+ 04]. Moreover, the approach used by [JFG+ 04] produces a classifier
difficult to interpret as it is unable to detect any deletion or amplification that
occur at the local level. [OBS+ 03] used a support vector machine (SVM) classifier using as variables all BAC ratios without any missing values. They were
able to identify key CNAs.
The methods developed so far either ignore the particularities of arrayCGH
and the inherent correlation structure of the data [OBS+ 03], or drastically reduce the complexity of the data at the risk of filtering out useful information [JFG+ 04,CWT+ 06]. In all cases, a reduction of the complexity of the data
or a control of the complexity of the predictor estimated is needed to overcome
the risk of overfitting the training data, given that the number of probes that
form the profile is often several orders of magnitude larger than the number of
samples available to train the classifier.
In this chapter we propose a new method for supervised classification, specifically designed for the processing of arrayCGH profiles. In order not to miss
potentially relevant information that may be lost if the profiles are first processed and reduced to a small number of homogeneous regions, we estimate
directly a linear classifier at the level of individual probes. Yet, in order to
control the risk of overfitting, we define a prior on the linear classifier to be estimated. This prior encodes the hypothesis that (i) many regions of the genome
should not contribute to the classification rule (sparsity of the classifier), and
(ii) probes that contribute to the classifier should be grouped in regions on the
chromosomes, and be given the same weight within a region. This a priori information helps reducing the search space and produces a classification rule that
is easier to interpret. This technique can be seen as an extension of SVM where
3.2. METHODS
49
the complexity of the classifier is controlled by a penalty function similar to the
one used in the fused lasso method to enforce sparsity and similarity between
successive features [TSR+ 05]. We therefore call the method a fused SVM. It
produces a linear classifier that is piecewise constant on the chromosomes, and
only involves a small number of loci without any a priori regularisation of the
data. From a biological point of view, it avoids the prior choice of recurrent regions of alterations, but produces a posteriori a selection of discriminant regions
which are then amenable to further investigations.
We test the fused SVM on several public datasets involving diagnosis and
prognosis applications in bladder and uveal cancer, and compare it with a more
classical method involving feature selection without prior information about the
organization of probes on the genome. In a cross-validation setting, we show
that the classification rules obtained with the fused SVM are systematically
more accurate than the rules obtained with the classical method, and that they
are also more easily interpretable.
3.2
Methods
In this section we present an algorithm for the supervised classification of arrayCGH data. This algorithm, which we call fused SVM, is motivated by the
linear ordering of the features along the genome and the high dependancy in
behaviour of neighbouring features. The algorithm itself estimates a linear predictor by borrowing ideas from recent methods in regression, in particular the
fused lasso [TSR+ 05]. We start by a rapid description of the arrayCGH technology and data, before presenting the fused SVM in the context of regularized
linear classification algorithms.
3.2.1
ArrayCGH data
ArrayCGH is a microarray-based technology that allows the quantification of
the DNA copy number of a sample at many positions along the genome in a
single experiment. The array contains thousands to millions of spots, each of
them consisting of the amplified or synthesized DNA of a particular region of the
genome. The array is hybridized with the DNA extracted from a sample of interest, and in most cases with (healthy) reference DNA. Both samples have first
been labelled with two different fluorochromes, and the ratio of fluorescence of
both fluorochromes is expected to reveal the ratio of DNA copy number at each
position of the genome. The log-ratio profiles can then be used to detect the regions with abnormalities (log-ratio significantly different of 0), corresponding to
gains (if the log-ratio is significantly superior to 0) or losses (if it is significantly
inferior to 0).
The typical density of arrayCGH ranges from 2400 BAC features in the
pioneering efforts, corresponding to one approximately 100 kb probe every Mb
[PSS+ 98], up to millions today, corresponding to one 25 to 70bp oligonucleotide
probe every few kb, or even tiling arrays [Ger05].
50
CHAPTER 3. FUSED SVM
There are two principal ways to represent arrayCGH data: as a log-ratio collection, or as a collection of status (lost, normal or gained, usually represented as
-1, 0 and 1 which correspond to the sign of the log ratio). The status representation has strong advantages over the log-ratio as it reduces the complexity of the
data, provides the scientist with a direct identification of abormalities and allows the straightforwad detection of recurrent alterations. However, converting
ratios into status is not always obvious and often implies a loss of information
which can be detrimental to the study: for several reasons such as heterogeneity of the sample or contamination with healthy tissue (which both result in
cells with different copy numbers in the sample), the status may be difficult to
infer from the data, whereas the use of the ratio values avoids this problem.
Another problem is the low subtelty of statuses. In particular, if we want to
use arrayCGH for discriminating between two subtypes of tumors or between
tumors with different future evolution, all tumors may share the same important genomic alterations that are easily captured by the status assignment while
differences between the types of tumors may be characterized by more subtle
signals that would disappear should we transform the log ratio values into statuses. Therefore, we consider below an arrayCGH profile as a vector of log-ratios
for all probes in the array.
3.2.2
Classification of arrayCGH data
While much effort has been devoted to the analysis of single arrayCGH profiles, or populations of arrayCGH profiles in order to detect genomic alterations
shared by the samples in the population, we focus on the supervised classification
of arrayCGH. The typical problem we want to solve is, given two populations
of arrayCGH data corresponding to two populations of samples, to design a
classifier that is able to predict which population any new sample belongs to.
This paradigm can be applied for diagnosis or prognosis applications, where the
populations are respectively samples of different tumor types, or with different
evolution. Although we only focus here on binary classification, the techniques
can be easily extended to problems involving more than two classes using, for
example, a series of binary classifiers trained to discriminate each class against
all others.
While accuracy is certainly the first quality we want the classifier to have
in real diagnosis and prognosis application, it is also important to be able to
interpret it and understand what the classification is based on. Therefore we
focus on linear classifiers, which associate a weight to each probe and produce a
rule that is based on a linear combination of the probe log-ratios. The weight of a
probe roughly corresponds to its contribution in the final classification rule, and
therefore provides evidence about its importance as a marker to discriminate the
populations. It should be pointed out, however, that when correlated features
are present, the weight of a feature is not directly related to the individual
correlation of the feature with the classification, hence some care should be
taken for the interpretation of linear classifier.
In most applications of arrayCGH classification, it can be expected that only
3.2. METHODS
51
a limited number of regions on the genome should contribute to the classification, because most parts of the genome may not differ between populations.
Moreover, the notion of discriminative regions suggest that a good classifier
should detect these regions, and typically be piecewise constant over them. We
show below how to introduce these prior hypotheses into the linear classification
algorithm.
3.2.3
Linear supervised classification
Let us denote by p the number of probes hybridized on the arrayCGH. The result
of an arrayCGH competitive hybridization is then a vector of p log-ratios, which
we represent by a vector x in the vector space X = Rp of possible arrayCGH
profiles. We assume that the samples to be hybridized can belong to two classes,
which we represent by the labels −1 and +1. The classes typically correspond
to the disease status or the prognosis of the samples. The aim of binary classification is to find a decision function that can predict the class y ∈ {−1, +1}
of a data sample x ∈ X . Supervised classification uses a database of samples
x1 , ..., xn ∈ X for which the labels y1 , ..., yn ∈ {−1, +1} are known in order to
construct the prediction function. We focus on linear decision functions, which
are defined by functions of the form f (x) = w" x where w" is the transpose of
a vector w ∈ Rd . The class prediction for a profile x is then +1 if f (x) ≥ 0,
and −1 otherwise. Training a linear classifier amounts to estimating a vector
w ∈ Rd from prior knowledge and the observation of the labeled training set.
The training set can be used to assess whether a candidate vector w can
correctly predict the labels on the training set; one may expect such a w to
correctly predict the classes of unlabeled samples as well. This induction principle, sometimes referred to as empirical risk minimization, is however likely to
fail in our situation where the dimension of the samples (the number of probes)
is typically larger than the number of training points. In such a case, many
vectors w can indeed perfectly explain the labels of the training set, without
capturing any biological information. These vectors are likely to poorly predict
the classes of new samples. A well-known strategy to overcome this overfitting issue, in particular when the dimension of the data is large compared to
the number of training points available, is to look for large-margin classifiers
constrained by regularization [Vap98]. A large-margin classifier is a prediction
function f (x) that not only tends to produce the correct sign (positive for labels +1, negative for class −1), but also tends to produce large absolute values.
This can be formalized by the notion of margin, defined as yf (x): large-margin
classifiers try to predict the class of a sample with large margin. Note that the
prediction is correct if the margin is positive. The margin can be thought of
as a measure of confidence in the prediction given by the sign of f , so a large
margin is synonymous with a large confidence. Training a large-margin classifier
means estimating a function f that takes large margin values on the training
set. However, just like for the sign of f , if p > n then it is possible to find
vectors w that lead to arbitrarily large margin on all points of the training set.
In order to control this overfitting, large-margin classifiers try to maximize the
52
CHAPTER 3. FUSED SVM
margin of the classifier on the training set under some additional constraint on
the classifier f , typically that w is not too “large”. In summary, large-margin
classifiers find a trade-off between the objective to ensure large margin values
on the training set, on the one hand, and that of controlling the complexity
of the classifier, on the other hand. The balance in this trade-off is typically
controlled by a parameter of the algorithm.
More formally, large-margin classifiers typically require the definition of two
ingredients:
• A loss function l(t) that is “small” when t ∈ R is “large”. From the loss
function one can deduce the empirical risk of a candidate vector w, given
by the average loss function applied to the margins of w on the training
set:
n
1!
l(yi wi" x) .
(3.1)
Remp (w) =
n i=1
The smaller the empirical risk, the better w fits the training set in the
sense of having a large margin. Typical loss functions are the hinge loss
l(t) = max(0, 1 − t) and the logit loss l(t) = log (1 + e−t ).
• A penalty function Ω(w) that measures how “large” or how “complex”
w is. Typical penalty functions are the L1 and L2 norms of w, defined
')p
(1
)p
2 2
respectively by ||w||1 = i=1 |wi | and ||w||2 =
.
i=1 wi
Given a loss function l and a penalty function Ω, large-margin classifiers can
then be trained on a given training set by solving the following constrained
optimization problem:
min Remp (w) subject to Ω(w) ≤ µ ,
w∈Rp
(3.2)
where µ is a parameter that controls the trade-off between fitting the data, i.e.,
minimizing Remp (f ), and monitoring the regularity of the classifier, i.e., monitoring Ω(w). Examples of large-margin classifiers include the support vector
machine (SVM) and kernel logistic regression (KLR) obtained by combining
respectively the hinge and logit losses with the L2 norm penalization function [CV95, BGV92b, Vap98], or the 1-norm SVM when the hinge loss is combined with the L1 loss .
The final classifier depends on both the loss function and the penalty function. In particular, the penalty function is useful to include prior knowledge or
intuition about the classifier one expects. For example, the L1 penalty function
is widely used because it tends to produce sparse vectors w, therefore performing an automatic selection of features. This property has been successfully
used in the context of regression [Tib96], signal representation [CDS98], survival
analysis [Tib97], logistic regression [GAL+ 07, KHCF04], or multinomial logistic
regression [KCFH05], where one expects to estimate a sparse vector.
53
3.2. METHODS
3.2.4
Fused lasso
Some authors have proposed to design specific penalty functions as a means to
encode specific prior informations about the expected form of the final classifier.
In the context of regression applied to signal processing, when the data is a
time series, [LF96] propose to encode the expected positive correlation between
successive variables by choosing a regularisation term that forces successive
variables of the classifier to have similar weights. More precisely, assuming that
the variables w1 , w2 , . . . , wp are sorted in a natural order where many pairs
of successive values are expected to have the same weight, they propose the
variable fusion penalty function:
Ωf usion (w) =
n−1
!
i=1
|wi − wi+1 | .
(3.3)
Plugging this penalty function in the general algorithm (3.2) enforces a solution
w with many successive values equal to each other, that is, tends to produce a
piecewise constant weight vector. In order to combine this interesting property
with a requirement of sparseness of the solution, [TSR+ 05] proposed to combine
the lasso penalty and the variable fusion penalty into a single optimization
problem with two constraints, namely:
min Remp (w)
w∈Rn
under the constraints
n−1
!
i=1
|wi − wi+1 | ≤ µ
'w'1 ≤ λ ,
(3.4)
where λ and µ are two parameters that control the relative trade-offs between
fitting the training data (small Remp ), enforcing sparsity of the solution (small
λ) and enforcing the solution to be piecewise constant (small µ). When the
empirical loss is the mean square error in regression, the resulting algorithm
is called fused lasso. This method was illustrated in [TSR+ 05] with examples
taken from gene expression datasets and mass spectrometry. Later, [TW07]
proposed a tweak of the fused lasso for the purpose of signal smoothing, and
illustrated it for the problem of discretising noisy CGH profiles.
3.2.5
Fused SVM
Remembering from Section 3.2.2 that for arrayCGH data classification one typically expects the “true” classifier to be sparse and piecewise constant along the
genome, we propose to extend the fused lasso to the context of classification and
adapt it to the chromosome structure for arrayCGH data classification. The extension of fused lasso from regression to large-margin classification is obtained
simply by plugging the fused lasso penaly constraints into a large-margin empirical risk in (3.4). In what follows we focus on the empirical risk (3.1) obtained
54
CHAPTER 3. FUSED SVM
from the hinge loss, which leads to a simple implementation as a linear program (see Section 3.2.6 below). The extension to other convex loss functions,
in particular the logit loss function, results in convex optimization problems
with linear constraints that can be solved with general convex optimization
solvers [BV04a].
In the case of arrayCGH data, a minor modification to the variable fusion
penalty (3.3) is necessary to take into account the structure of the genome in
chromosomes. Indeed, two successive spots on the same chromosome are prone
to be subject to the same amplification and are therefore likely to have similar
weights on the classifier; however, this positive correlation is not expected across
different chromosomes. Therefore we restrict the pairs of successive features
appearing in the function constraint (3.3) to be consecutive probes on the same
chromosome.
We call the resulting algorithm a fused SVM, which can be formally written
as the solution of the following problem:
minp
w∈R
under the constraints
!
i∼j
!
i=1
n
!
i=1
max(0, 1 − yi w" xi )
|wi − wj | ≤ µ
|wi | ≤ λ ,
(3.5)
where i ∼ j if i and j are the indices of succesive spots of the same chromosome.
As with fused lasso, this optimisation problem tends to produce classifiers w with
similar weights for consecutive features, while maintaining its sparseness. This
algorithm depends on two paramters, λ and µ, which are typically chosen via
cross-validation on the training set. Decreasing λ tends to increase the sparsity
of w, while decreasing µ tends to enforce successive spots to have the same
weight.
This classification algorithm can be applied to CGH profiles, taking the
ratios as features. Due to the effect of both regularisation terms, we obtain a
sparse classification function that attributes similar weights to successive spots.
55
3.3. DATA
3.2.6
Implementation of the fused SVM
Introducing slack variables, the problem described in (3.5) is equivalent to the
following linear program :
min
w,α,β,γ
n
!
αi
under the following constraints :
i=1
∀i = 1, ..., n
∀i = 1, ..., n
αi ≥ 0
αi ≥ 1 − w" xi yi
n
!
βi ≤ λ
i=1
∀i = 1, ..., p
∀i = 1, ..., p
βi ≥ wi
βi ≥ −wi
q
!
γk ≤ µ
k=1
∀i, j such that i ∼ j
∀i, j such that i ∼ j
γk ≥ wi − wj
γk ≥ wj − wi
(3.6)
In our experiments, we implemented and solved this problem using Matlab and
the SeDuMi 1.1R3 optimisation toolbox [Stu99].
3.3
Data
We consider two publicly available arrayCGH datasets for cancer research, from
which we deduce three problems of diagnosis and prognosis to test our method.
The first dataset contains arrayCGH profiles of 57 bladder tumor samples
[SVR+ 06]. Each profile gives the relative quantity of DNA for 2215 spots.
We removed the probes corresponding to sexual chromosomes, because the sex
mismatch between some patients and the reference used makes the computation
of copy number less reliable, giving us a final list of 2143 spots. We considered
two types of tumor classification: either by grade, with 12 tumors of grade 1
and 45 tumors of higher grades (2 or 3) or by stage, with 16 tumors of stage Ta
and 32 tumors of stage T2+. In the case of stage classification, 9 tumors with
intermediary stage T1 were excluded from the classification.
The second dataset contains arrayCGH profiles for 78 melanoma tumors that
have been arrayed on 3750 spots [THH+ 08]. As for the bladder cancer dataset,
we excluded the sexual chromosomes from the analysis, resulting in a total of
3649 spots. 35 of these tumors lead to the development of liver metastases within
24 months, while 43 did not. We therefore consider the problem of predicting,
from an arrayCGH profile, whether or not the tumor will metastasize within 24
months.
56
CHAPTER 3. FUSED SVM
In both datasets, we replaced the missing spots log-ratios by 0. In order
to assess the performance of a classification method, we performed a crossvalidation for each of the three classification problems, following a leave-one-out
procedure for the bladder dataset and a 10-fold procedure for the melanoma
dataset. We measure the number of misclassified samples for different values of
parameters λ and µ.
3.4
Results
In this section, we present the results obtained with the fused SVM on the
datasets described in the previous section. As a baseline method, we consider a
L1 -SVM which minimizes the mean empirical hinge loss suject to a constraint
on the L1 norm of the classifier in (3.2). The L1 -SVM performs automatic
feature selection, and a regularization parameter λ controls the amount of regularization. It has been shown to be a competitive classification method for
high-dimensional data, such as gene expression data [ZRHT04]. In fact the L1 SVM is a particular case of our fused SVM, when the µ parameter is chosen
large enough to relax the variable fusion constraint (3.3), typically by taking
µ > 2λ. Hence by varying µ from a large value to 0, we can see the effect of the
variable fusion penalty on the classical L1 -SVM.
3.4.1
Bladder tumors
The upper plot of Figure 3.1 show the estimated accuracy (by Leave One Out,
also called LOO) of the fused SVM as a function of the regularization parameters
λ and µ, for the classification by grade of the bladder tumors. The lower left
plot of Figure 3.1 represents the best linear classifier found by the L1 -SVM
(corresponding to λ = 256), while the lower right plot shows the linear classifier
estimated from all samples by the fused SVM when λ and µ are set to values
that minimise the LOO error, namely λ = 32 and µ = 1. Similarly, Figure 3.2
shows the same results (LOO accuracy, L1 -SVM and fused SVM classifiers) for
the classification of bladder tumors according to their stage.
In both cases, when µ is large enough to make the variable fusion inactive in
(3.5), then the classifier only finds a compromise between the empirical risk and
the L1 norm of the classifier. In other words, we recover the classical L1 SVM
with parameter λ. Graphically, the performance of the L1 SVM for varying
λ can be seen on the upper side of each plot of the LOO accuracy in Figures
3.1 and 3.2. Interestingly, in both cases we observe that the best performance
obtained when both λ and µ can be adjusted is much better than the best
performance of the L1 SVM, when only λ can be adjusted. In the case of
grade classification, the number of misclassified samples drops from 12 (21%)
to 7 (12%), while in the case of stage classification it drops from 13 (28%) to
7 (15%). This suggests that the additional constraint that translates our prior
knowlege about the structure of the spot positions on the genome is beneficial
in terms of classifier accuracy.
57
3.4. RESULTS
µ
1024
0.0625
4
7
1024
λ
19
Figure 3.1: The figure on the upper side represents the number of misclassified
samples in a leave-one-out error loop on the bladder cancer dataset with the
grade labelling, with its color scale for different values of the parameters λ and µ
which vary logarithmically along the axes. The weights of the best classifier, for
classical L1 -SVM (left) and for fused SVM (right) are ordered and represented
in a blue line, annotated with the chromosome separation (red line).
58
CHAPTER 3. FUSED SVM
µ
1024
0.0078
0.5
7
1024
λ
20
Figure 3.2: The figure on the upper side represents the number of misclassified
samples in a leave-one-out error loop on the bladder cancer dataset with the
stage labelling, with its color scale, for different values of the parameters λ and µ
which vary logarithmically along the axes. The weights of the best classifier, for
classical L1 -SVM (left) and for fused-SVM (right) are ordered and represented
in a blue line, annotated with the chromosome separation (red line).
3.4. RESULTS
59
As expected, there are also important differences in the visual aspects of
the classifiers estimated by the L1 -SVM and the fused SVM. The fused SVM
produces sparse and piecewise constant classifiers, amenable to further investigations, while it is more difficult to isolate from the L1 -SVM profiles the key
features used in the classification, apart from a few strong peaks.
As we can see by looking at the shape of the fused SVM classifier in Figure
3.1, the grade classification function is characterised by non-null constant values
over a few small chromosomal regions and numerous larger regions. Of these
regions, a few are already known as being altered in bladder tumors, such as the
gain on region 1q [CHTG05]. Moreover some of them have already been shown
to be correlated with grade, such as chromosome 7 [WCK+ 91].
On the contrary, the stage classifier is characterised by only a few regions
with most of them involving large portions of chromosomes. They concern
mainly chromosome 4, 7, 8q, 11p, 14, 15, 17, 20, 21 and 22, with in particular
a strong contribution from chromosomes 4, 7 and 20. These results on chromosomes 7, 8q, 11p and 20 are in good agreement with [BBR+ 05] who identified
the most common alterations according to tumor stage on a set of 98 bladder
tumors.
3.4.2
Melanoma tumors
Similarly to Figures 3.1 and 3.2, the three plots in Figure 3.3 show respectively
the accuracy, estimated by 10-fold cross-validation, of the fused SVM as a function of the regularisation parameters λ and µ, the linear classifier estimated
by the L1 -SVM when λ is set to the value that minimizes the estimated error
(λ = 4), and the linear classifier estimated by a fused SVM on all samples when
λ and µ are set to values that minimise the 10-fold error, namely λ = 64 and
µ = 0.5.
Similarly to the bladder study, the performance of the L1 -SVM without the
fusion constraint can be retrieved by looking at the upper part of the plot of
Figure 3.3. The fused classifier offers a slightly improved performance compared
to the standard L1 -SVM (17 errors (22%) versus 19 errors (24%)), even though
the amelioration seems more marginal compared to the improvement made with
bladder tumors and the misclassification rate remains fairly high.
As for the bladder datasets, the L1 -SVM and fused SVM classifiers are
markedly different. The L1 -SVM classifier is based only on a few BAC concentrated on chromosome 8, with positive weights on the 8p arm and negative
weights on the 8q arm. These features are biologically relevant, and correspond
to a known genomic alterations (loss of 8p and gain of 8q in metastatic tumors).
The presence of a strong signal concentrated on chromosome 8 for the prediction of metastasis is in this case correctly captured by the sparse L1 -SVM, which
explains its relatively good performance.
To the contrary, the fused SVM classifier is characterised by a many of CNAs,
most of them involving large regions of chromosomes. Interestingly, we retrieve
the regions whose alteration was already reported as recurrent events of uveal
melanoma: chromosomes 3, 1p, 6q, 8p, 8q, 16q. As expected the contributions
60
CHAPTER 3. FUSED SVM
µ
1024
0.0625
1
17
1024
λ
38
Figure 3.3: The figure on the upper part represents the number of misclassified
samples in a ten-fold error loop on the melanoma dataset. The weights of the
best classifier, for classical L1 -SVM (left) and for fused SVM (right) are ordered
and represented in a blue line, annotated with the chromosome separation (red
line).
3.5. DISCUSSION
61
of 8p and 8q are of opposite sign, in agreement with the common alterations of
these regions: loss of 8p and gain of 8q in etastatic tumors. Interestingly the
contribution of chromosome 3 is limited to a small region of 3p, and does not
involve the whole chromosome as the frequency of chromosome 3 monosomy
would have suggested. Note that this is consistent with works by [PFG+ 03]
and [TPH+ 01] who delimited small 3p regions from partial chromosome 3 deletion patients. On the other hand we also observe that large portions of other
chromosomes have been assigned significant positive or negative weights, such
as chromosomes 1p, 2p, 4, 5, 9q, 11p, 12q, 13, 14, 20, 21. To our knowledge,
they do not correspond to previous observations, and may therefore provide
interesting starting points for further investigations.
3.5
Discussion
We have proposed a new method for the supervised classification of arrayCGH
data. Thanks to the use of a particular regularization term that translates our
prior assumptions into constraints on the classifier, we estimate a linear classifier
that is based on a restricted number of spots, and gives as much as possible equal
weights to spots located near each other on a chromosome. Results on real data
sets show that this classification method is able to discriminate between the
different classes with a better performance than classical techniques that do not
take into account the specificities of arrayCGH data. Moreover, the learned
classifier is piecewise constant and therefore lends itself particularly well to
further interpretation, highlighting in particular selected chromosomal regions
with particularly highly positive or negative weights.
From the methodological point of view, the use of regularized large-scale
classifiers is nowadays widely spread, especially in the SVM form. Regularization is particularly important for “small n large p” problems, i.e., when the
number of samples is small compared to the number of dimensions. An alternative interpretation of such classifiers is that they correspond to maximum a
posteriori classifiers in a Bayesian framework, where the prior over classifier
is encoded in our penalty function. It is not surprising, then, that encoding
prior knowledge in the penalty function is a mathematically sound strategy
that can be strongly beneficial in terms of classifier accuracy, in particular when
few training samples are available. The accuracy improvements we observe on
all classification datasets confirm this intuition. Besides the particular penalty
function investigated in this paper, we believe our results support the general
idea that engineering relevant priors for a particular problem can have important
effects on the quality of the function estimated and paves the way for further
research on the engineering of such priors in combination with large-margin
classifiers. As for the implementation, we solved a linear program for each value
couple of the regularization parameters λ and µ, but it would be interesting to
generalize the recent works on path following algorithms to be able to follow
the solution of the optimization problem when λ and µ vary [EHJT04].
Another interesting direction of future research concerns the combination of
62
CHAPTER 3. FUSED SVM
heterogeneous data, in particular of arrayCGH and gene expression data. Gene
expression variations contain indeed information complementary to CNV for the
genetic aberrations of the dysfunctioning cell [SVR+ 06], and their combination
is therefore likely to both improve the accuracy of the classification methods
and shed new light on biological phenomena that are characteristic of each
class. A possible strategy to combine such datasets would be to train a largemargin classifier with a particular regularization term that should be adequately
designed.
Acknowledgement
We thank Jérôme Couturier, Sophie Piperno-Neumann and Simon Saule, and
the uveal melanoma group from Institut Curie. We are also grateful to Philippe
Hupé for his help in preparing the data. This project was partly funded by
the ACI IMPBIO Kernelchip and the EC contract ESBIC-D (LSHG-CT-2005518192). FR and EB are members of the team “Systems Biology of Cancer”,
Equipe labellisée par la Ligue Nationale Contre le Cancer.
Chapter 4
Enhancement of
L1-classification of
microarray data using gene
network knowledge
4.1
Introduction
The construction of predictive models from gene expression data is an important problem in computational biology. Typical applications include, for example, cancer diagnosis or prognosis [vtVDvdV+ 02, WKZ+ 05, BMC+ 00, AED+ 00]
and discriminating between different treatments applied to micro-organisms
[NEGL+ 05]. Since the number of genes, which is the mathematical dimension of the space the microarrays evolve in, is far greater than the number of
samples, this problem turns out to be quite complex. Indeed, even if classical
methods have been found to be efficient in multiple cases [ARL+ 07, GST+ 99],
their results are very unstable and their strong dependence on the training set
(gene and sample selection) has already been pointed out [EDKG+ 05].
Including information about gene interactions (e.g. regulation of expression, metabolic or signal transduction pathways) is an attractive idea to reduce
the complexity of the problem: incorporation of biological information into the
mathematical methods should reduce the error rate and provide classification
functions that are more easily interpretable and suffer from less unstability. Several authors have elaborated sophisticated methods in order to integrate gene
collaboration information. Pathway scoring methods try to detect perturbated
group of genes or “modules” while ignoring the detailed topology of the gene
influences [COVP05,CDF05]. It is then assumed that all the genes inside a module are coordinately expressed, and therefore that a perturbation should affect
most of them. [SZEK07, CLL+ 07] proposed methods to extract said modules
63
64
CHAPTER 4. NETWORK-FUSED SVM
from gene networks. Unfortunately, these methods rely on an artificial separation of the collaboration map into subgroups and may therefore be unable to
build an efficient predictive model when the biological phenomenon only affect
a small number of genes, for example a subgroup of detected modules.
Another class of methods that aim at incorporating gene collaboration knowledge to construct a classifier use dimension reduction techniques. These methods
propose to first project the data into a subspace that will incorporate network
information, then perform supervised classification on the said subspace. In a
previous work [RZD+ 07], we used spectral decomposition of the graph associated with a metabolic gene network to build this subspace. [HZHS07] developed
another approach which proposed to compute synergetic pairs of genes. Unfortunately these approaches sacrifice their own statistical power for the benefit of
their simplicity. The main disadvantage of these “pipeline”-like methods, where
each stage is distinct from the following, is that each step assumes that the results of the previous steps are correct but in fact accumulates their approximations and errors, thus diverging from optimality. Therefore, several studies have
shown that it would be more efficient to include the knowledge directly into the
analysis method instead of as a preliminary step [ITR+ 01,GdBLC03,CKR+ 04].
In this article, we describe a method that incorporates the benefits of both
classes of algorithms by incorporating gene/gene interaction knowledge inside
the analysis. This new method extends fused classification [TSR+ 05], a method
used for the classification of data with features that can be ordered in a meaningful way, to build a method that classifies data where some features are positively
correlated through more complex relations than a simple ordering. Different
types of gene networks are found to be good repositories for this collaboration
information.
First, we will expose usual supervised classification methods, then how
[TSR+ 05] extended the problem to build the fused classification method. We
will then see how to build our new method, and see how it performed on classical
datasets.
4.2
Methods
In this section, we describe the usual supervised classification methods and see
how we developed the methods used for incorporation of network knowledge in
our analysis.
4.2.1
Usual linear supervised classification method
The aim of gene expression profile classification is to build a function f : Rn → Y
that is able to attribute to each new expression profile x ∈ Rn , where n is the
number of genes, a label y ∈ Y, a mathematical representation of a biological
property of the sample. Depending on the study, this label could be sick or
healthy, the treatment the sample has been subjected to, etc.
65
4.2. METHODS
Supervised classification is a particular category of classification methods
where a set of samples X = {Xi }i∈1,...,p , for which the correct labels Y =
{Yi }i∈1,...,p are already known, is used to build the classification function f .
Linear supervised classification uses a linear vector, i.e. a w ∈ Rn , for the
construction of the function f . f will then be of the form f : x %→ w" x where w"
is the transpose of w. Geometrically, w can be seen as the orthogonal vector to
an hyperplane P that separates the whole space into subspaces, and the subspace
in which one sample Xi finds itself will define its predicted class (which will be
sign(w" Xi ) in the case of binary classification, where Y = {−1, 1}). This
hyperplane will be a good separator, not only if most of the samples are in the
good subspace, but also if it provides a satisfactory geometrical partition of the
sample spaces. [CV95] for example, proposed to maximise the margin around
the hyperspace.
Let l : (Xi , Yi ) → l(w" Xi , Yi ) be a loss function, a way to measure the
distance between the position Xi and the subspace it should belong to. A good
classifier would be the one that minimises the average l on the training set.
Unfortunately, as the dimension n of the space in which the sample evolve
is very large compared to the number p of samples, our classifier is likely to be
over-fitted, meaning that it will perform relatively well on the training set, but
may perform poorly on unknown examples. Therefore, the minimization of l
is not deterministic enough to find an efficient classifier, and we have to add
another constraint on w, which will represent the knowledge that we have of
the shape that we want the classifier to be constrained to. The linear predictive
model for our problem is then obtained by solving the following optimisation
problem:
minn
w∈R
p
!
i=1
l(w" Xi , Yi ) under the constraint r(w) ≤ µ .
(4.1)
This problem is formed of two parts : the minimisation of the loss function l
that calculates the error between the predicted class for a specific sample Xi
and the real class given by the label Yi ; and a regularisation term r(w) where µ
is a parameter that will be adjusted to build a trade-off between the efficiency
of the classification and the minimisation of the regularisation term. The bigger
µ is, the more loose the constraint is and the less w is prone to be constricted
to the form represented by the regularisation term.
Good examples of such classifiers include support vector machines (when
taking {−1, 1} for Y the space of classes, the hinge loss function l(w" Xi , Yi ) =
max(0, 1 − Yi w" Xi ) and r(w) = 'w') [SS02, STV04, STC00] or ridge regression
(when Y = R, l(w" Xi , Yi ) = (w" Xi − Yi )2 and r(w) = 'w'1 ) [HTF01].
”Usual” classification problems often use a norm for the regularisation
)nterm,
whether the Euclidean norm defined by ' • '2 : (w1 , ..., wn)
) ∈ Rn %→ i=1 wi2
n
or the L1-norm defined by ' •' 1 : (w1 , ..., wn ) ∈ Rn %→ i=1 |wi |. The two
regularisation forms force the classifier to comply to different constraints and
therefore give the classifier different shapes.
The constraint induced by the L1-norm regularisation forces our classifier
66
CHAPTER 4. NETWORK-FUSED SVM
to be sparse, meaning that most of w components will be equal to zero (see
for example a discussion in [Tib97]). Therefore it is useful if we feel that our
model should depend on a small number of genes. This regularisation term does
not only help with classification performance: it could lead to the building of a
sparse and therefore easy to interpret ”signature”, a specific set of genes that
are characteristic of the property of interest [GC05] .
When a sparse classifier is not appropriated, we assume that the decision
function shouldn’t be too complicated, resulting in a classifier with a little Euclidean norm and thus apply the Euclidean regularisation. Moreover, in the
case of support vector machines, since the Hessian matrix is positive definite
(instead of positive semi-definite) the problem is more computationally stable
than with the L1-norm [Abe02].
By constraining the L1 or L2 norm to be small, we limit the space the
classifier evolves in: this is “feature selection”. But, as we explain in the next
section, the regularisation term can also be used to include prior knowledge that
we have of the classifier.
4.2.2
Fusion and fused classification
For supervised classification problems with features that can be ordered in a
meaningful way, [LF96] proposed a method called “fusion classification” to incorporate the information of correlation between successive features into the
regularisation term. The problem can be written as the reduction of the following form:
minn
w∈R
p
!
l(w" Xi , Yi ) under the constraint
i=1
n
!
i=2
|wi − wi−1 | ≤ µ .
(4.2)
Because of the regularisation term, this method results in a classifier where
successive features tend to have similar weights. In appropriate cases, it shows
great potential as features of the classifier will be partitioned in groups of successive features with similar weights, which reduces the size of the dimension
space. Therefore, the classification will depend on whole groups of features and
should be more robust to experimental or biological noise.
[TSR+ 05] extended the method to propose a classification technique called
“fused lasso”. Their method is the minimisation of the following form:
min
w∈Rn
p
!
i=1
l(w Xi , Yi ) under the constraints
T
n
!
i=1
|wi | ≤ α and
n
!
i=2
|wi −wi−1 | ≤ µ .
(4.3)
They use Y = R as the space of labels and the ridge regression l(w" Xi , YI ) =
(Yi − w" Xi )2 as loss function; hence this method is a compromise between the
fusion classification and traditional lasso as it uses both regularisation terms
(the regularisation described in equation 4.2 and usual L1-norm), ensuring that
the classifier will be sparse, due to the L1-norm regularisation, and that its
67
4.2. METHODS
successive features will have similar weights, due to the fusion-like regularisation
term.
In the next section, we will see how we can extend this technique to incorporate network knowledge into microarray classification.
4.2.3
Network-fused classification
Fused lasso only offers a way to regulate a feature i with regards to its relations
with two other features, i − 1 and i + 1, which are chosen by the way the
features are ordered. If this is an interesting method for some biological data,
like protein mass spectroscopy data [TSR+ 05] or CGH array data [TW07], it
is not as pertinent with gene expression profiles where the relations between
features, i.e. gene expression levels, are far more complex. Indeed, relations
between genes are often described as a complete graph (V, E) where V is the
set of vertices (genes) and E the set of edges (pairs of genes that are correlated
one with the other).
To establish a regularisation constraint that will incoporate this network
knowledge, we propose to build a classifier that will tend to attribute similar
values to connected nodes, corresponding to the following form, similar to the
one seen in equation 4.3:
min
n
w∈R ,b∈R
m
!
i=1
l(w" Xi , Yi ) under the constraints
n
!
i=1
|wi | ≤ λ and
!
u∼v
|wu −wv | ≤ µ ,
(4.4)
where we denote by wu and wv the weights of w with regard respectively to
genes u and v. We will use Y = −1, 1 as the label space and, for the loss
function, we will use l(u, y) = max(0, 1 − yu), the hinge loss function which is
the loss function that is used in the classical SVM.
With analogy to fused lasso, we will have a classifier that tends to be sparse
(due to the first constraint) and tends to attribute similar weights to connected
nodes (due to the second constraint). Therefore, our classifier will at the same
time have the advantages of a standard classifier, i.e. sparseness, and incorporate network knowledge, i.e. positive correlation between connected genes,
with the two parameters λ and µ helping building a tradeoff between these two
constraints and the classification efficiency represented by the loss term.
As our classifier is the minimisation of a linear form under linear constraints,
we are confronted with a linear problem. Numerous methods have been proposed to solve this type of problems [Tod02] . If we used another convex loss
function instead of the hinge loss function, we would obtain a convex optimisation problem, which would also be solvable, but using convex optimisation
methods [BV04b].
68
CHAPTER 4. NETWORK-FUSED SVM
4.2.4
Implementation
The problem described in equation 4.4 can be transformed into the following
linear problem :
min
w,α,β,γ
n
!
αi
under the following constraints :
i=1
∀i = 1, ..., p
∀i = 1, ..., p
αi ≥ 0
αi ≥ 1 − w" Xi Yi
n
!
βi ≤ λ
i=1
∀i = 1, ..., n
∀i = 1, ..., n
βi ≥ wi
βi ≥ −wi
q
!
γi ≤ µ
i=1
∀u, v ∈ V × V such that (u, v) ∈ E
∀u, v ∈ V × V such that (u, v) ∈ E
γi ≥ wu − wv
γi ≥ wu − wv
,
(4.5)
where ∀u ∈ U , wu is the weight attributed in w to the node u, p is the number
of samples, n = Card(V ) the number of genes and q = Card(E) the number of
gene interactions.
This problem was implemented and solved using Matlab and the SeDuMi
1.1R3 optimisation toolbox [Stu99].
4.3
Data
In this section we describe the different datasets that have been used for our
study.
4.3.1
Expression data sets
We collected our first expression data from a study that aim at predicting “poor
prognosis” in breast cancer patients by separating patients who developed distant metastases (51 samples) from patients who continued to be disease-free
after a period of at least 5 years (46 samples) [vtVDvdV+ 02]. The original
study separated this set of 97 patients into a training and a testing set, but we
merged these two sample sets into one single data set in order to reduce the
error bias that was underlined in previous studies [MKH05] and that is related
to the selection of the samples forming each group. Each gene expression profile
contains approximately 25,000 genes. We collected the data set as normalised
in the previous study and put each gene mean and variance respectively to 0
and 1.
4.3. DATA
69
We collected our second expression data set from a study that aims at
discriminating patients positive for Estrogen Receptors (ER) (208 samples)
from ER-negative patients (77 samples) in people that developed lymph-nodenegative primary breast cancer [WKZ+ 05]. We merged the training and testing
sets of the original study, obtaining a global set of 285 samples that we used to
classify patients that suffered from a relapse (107 samples) from the ones that
remained disease-free (179 samples). The gene expression profiles were collected
using Affymetrix U133A genechips, resulting in profiles of roughly 22,000 genes.
We used a gcRMA normalised version of this data set.
4.3.2
Gene networks
Our method requires a database of positive relations between our gene expression levels. This data can be represented as an undirected and finite graph G
that satisfies two precise conditions: the nodes represent proteins or corresponding genes, and an edge exist betweeen two nodes if and only if there exists a type
of positive correlation between the expression levels of the two corresponding
genes. We will denote by V the set of vertices of cardinality |V | = m and by
E ⊂ V × V the edges. Different repositories provide this kind of information.
One example of repository for this kind of relations is metabolic networks.
In metabolic networks, the vertices represent enzymes (or the corresponding
genes) and one edge is formed between two vertices u and v if v catalyses a
reaction where one of the reactants is a product of a reaction catalysed by
u. The correlation between metabolic pathways and gene expression data has
already been shown in several studies [GDS+ 05, MOSS+ 04]. One practical way
to understand the existence of this correlation is to see that a cascade of reactions
will be active if all of the corresponding enzymes are active, e.g. expressed. For
the analysis of this data, we collected two metabolic networks. The first one was
built from KEGG database of metabolic pathways [KGK+ 04]. We reconstructed
this network from the KGML v.0.6, resulting in 13275 edges between 1354 genes.
The second metabolic network was extracted from Reactome [VDS+ 07], which
is a database that contains several types of gene networks, resulting in a network
composed of 23233 edges between 1224 genes.
Interesting databases include protein-protein interaction networks (also known
as protein interaction networks or PPI networks) . In PPI networks, vertices
represent proteins and an edge is formed between protein u and protein v if
these proteins are known to physically interact. There are three principal ways
to construct PPI networks : automatic inference, yeast two-hybrid (Y2H) experiments and litterature analysis. The two first methods of construction are
shown to be quite complementary while both biologically relevant [RSA+ 07].
We collected three protein-protein interaction networks. The first one was built
from Bioverse-core [MS03, MBS05], a manually-curated predicted interaction
network. The second one was built from CCSB-HI1 [RVH+ 05], which was constructed using Y2H experiments. [RSA+ 07] assessed the quality of both these
networks. Moreover, the authors showed the complementarity of the two networks. Therefore we joined both networks to construct a new one which formed
70
KEGG
Reactome
Bioverse-core
CCSB-HI1
Both PPI networks
Resnet
Coexp. network 1
Coexp. network 2
CHAPTER 4. NETWORK-FUSED SVM
Including
all genes
Vertices Edges
1354
13275
1224
23233
1263
2855
1549
2611
2673
5446
2612
5148
2407
28836
1256
10518
Limited to
VV gene set
Vertices Edges
1203
9879
1171
20966
1161
2283
1278
1273
2353
3976
2238
4187
2377
20995
1038
7801
Limited to
Wang gene set
Vertices Edges
1156
9782
1159
21063
1216
2655
1265
1816
2344
4452
2215
4369
2406
28835
1256
10518
Table 4.1: Characteristics of the different networks used. In this table you can
see the number or vertices and edges for each network used, wether in their
complete form (non-including edges from one vertex to itself or negative edges)
or restricted to the useful genes.
our third PPI network.
Influence networks can be seen as two graphs that span the same set of
vertices V of cardinality |V | = n that represent genes for each graph. In the
first graph Gactiv = (V, Eactiv ), (u, v) ∈ Eactiv if the expression of gene u is
negatively correlated with the expression of gene v. In the second graph Ginhib =
(V, Einhib ), (u, v) ∈ Einhib if the expression of gene u is negatively correlated
to the expression of gene v. We will call positive influence networks the subnetwork formed by the graph Gactiv . Positive influence networks can be used
as a database of correlations. We extracted from the manually curated version
of the ResNet pathway database [YMK+ 06] an expression influence network of
5148 edges between 2612 genes.
The last type of interesting databases that we studied is co-expression networks. In these networks, an edge is formed between gene u and gene v if they
are often found co-expressed in a set of gene expression profiles. This type of
networks can only be inferred and highly depend on the type of gene expression profile data, which need to be generic or large enough for the data to be
viable. [YMH+ 07] proposed a way to identify co-expression modules from gene
expression datasets and built a co-expression relation database based on 105
different sets of expression profiles. From these sets, they only kept relations
they found significative enough, resulting in the construction of 105 different
co-expression networks. We used their data to build two expression networks,
the first one with relations that were significative enough in at least 10% of their
data sets and the second one with relations that were significative enough in at
least 20% of their data sets.
For every network, we only kept edges between two distinct genes. In each
analysis, we only kept the genes that were present in the microarrays. Table 4.1
compares the complexity of each network.
71
4.4. RESULTS
Network
KEGG
Reactome
Bioverse-Core
CCSB-HI1
Both PPI networks
Resnet
Coexp. network 1
Coexp. network 2
No network
Best classifier
without network
10f error
λ
29
2
31
4
28
4
27
4
24
2
28
4
28
2
29
2
25
2
Best classifier
with network
10f error λ
µ
27
2
64
28
4
2
25
16 0.25
24
2
1
23
2
4
24
2
0.5
23
2
2
25
4
4
-
Table 4.2: Performance of the best classifiers for each network regarding Van’t
Veer dataset. The 10f error is the number of misclassified samples in a ten-fold
error.
4.4
Results
In this section we describe and comment the results that we obtained using the
different networks and datasets.
4.4.1
Performance
We performed classification of the Van’t Veer dataset using all the previously
described networks for different values of the λ, µ parameters. Results are shown
in Figure 4.1 and performances of the best classifiers are shown in table 4.2.
The same has been done for the Wang dataset. We obtained the results of
Figure 4.2 with performances shown in table 4.3.
As described by equation 4.4, the lower λ is, the more sparse the solution is,
and the lower µ is, the less the solution will vary along the edges of the network.
Therefore the classifier with only L1-regularisation (corresponding to the L1SVM) is obtained by taking an infinite µ, or at least a value larger than the
graph penalty of the pure LASSO solution and the evolution of its performance
can be seen by looking at the highest horizontal lines on figures 4.1 and 4.2.
Measuring the performance using a ten-fold cross-validation for each parameter couple introduces some bias. A cleaner way to calculate the number of
misclassified samples would have been to perform a nested cross-validation including parameter selection. However, it is much more time-consuming than
our method, and, as we perform classification over a larger set, simple crossvalidation is enough to estimate the general trend of the classification error
Looking at tables 4.2 and 4.3, we can see that even without introducing any
network-related constraint, the classification performance vary depending on the
gene network used. This is due to the fact that we only keep the genes that are
72
CHAPTER 4. NETWORK-FUSED SVM
µ
µ
µ
16384
0.0312
0.0312
27
KEGG
λ
16384
56 28
µ
λ
Reactome
57 25
µ
CCSB-HI1
24
λ
Both PPI networks
55
λ
Resnet
55 22
µ
λ
56
Coexp. network2
25
λ
55
µ
Coexp. network1
λ
µ
55 23
23
Bioverse-core
λ
55
Figure 4.1: This figure represent the number of misclassified samples in a tenfold error for different values of the (λ, µ) parameters on the Van’t Veer dataset.
λ and µ vary on the same logarithmic scale for every experiment. Bright blue
correspond to the parameter couples with the highest count of misclassified
samples while bright red correspond to the lowest count of misclassified samples.
73
4.4. RESULTS
µ
µ
µ
16384
0.0312
0.0312
96
KEGG
λ
16384
122 92
µ
λ
Reactome
118 88
µ
CCSB-HI1
96
λ
Both PPI networks
131
λ
Resnet
124 87
µ
λ
56
Coexp. network2
106
λ
162
µ
Coexp. network1
λ
µ
122 98
23
Bioverse-core
λ
138
Figure 4.2: This figure represent the number of misclassified samples in a tenfold error for different values of the (λ, µ) parameters on the Wang dataset. λ
and µ vary on the same logarithmic scale for every experiment. Bright blue
correspond to the parameter couples with the highest count of misclassified
samples while bright red correspond to the lowest count of misclassified samples.
74
CHAPTER 4. NETWORK-FUSED SVM
Network
KEGG
Reactome
Bioverse-Core
CCSB-HI1
Both PPI networks
Resnet
Coexp. network 1
Coexp. network 2
No network
Best classifier
without network
10f error
λ
107
0.0312
92
8
92
16
106
2
107
0.0312
93
2
102
16
110
0.625
88
2
Best classifier
with network
10f error
λ
µ
96
4096
8
92
8
512
88
16
32
96
16
1
98
16
0.0625
87
8
0.125
95
16
512
106
1
0.5
-
Table 4.3: Performance of the best classifiers for each network regarding Wang
dataset. The 10f error is the number of misclassified samples in a ten-fold error.
present in the network, which can be seen as a priori gene selection, and combine
the probes that are related to the same gene. However, as every classification
without any network constraint performs worse than the classification that keeps
all the genes (with the exception of the classification using both PPI networks
for the Van’t Veer data set), genes that are primordial for the discrimination
may not be present in the networks.
For the Van’t Veer classification problem, we can see that in every case,
the incorporation of the network improves the performance and that, with the
exceptions of the metabolic networks, they are at least as efficient as the one
obtained when keeping all the genes. It is however not the case with the Wang
dataset, for which the classifier with all the genes remains the best classifier,
even if the introduction of Bioverse-core or Resnet achieve comparable results.
An interesting phenomenon that we observe on the Wang dataset is the fact
that the network constituted of both PPI networks performs worse than each
network taken separously. One explanation for that is that our model is sensible
to gene selection and that by combining both networks, we lose the advantage
that we add by selecting the genes that were only present in one network. It
may also be due to the fact that, even if both networks are complementary
networks for protein-protein interactions, they contain different types of interactions, which may not be compatible one with the other: protein interactions
are dynamically organised [HBH+ 04], and these dynamics may contradict the
positive correlation that we try to introduce as hubs can’t physically interact
with all their neighbors simultaneously. [KLXG06] proposes a high-quality network that introduced this dynamic factor and may therefore be used to solve
this issue.
75
4.4. RESULTS
Genes
120
N. of terms
89
Reactome
Bioverse-Core
117
120
141
52
CCSB-HI1
Both PPI networks
130
235
6
55
Resnet
224
30
Coexp. network 1
238
35
Coexp. network 2
No network
104
2447
13
46
KEGG
Main categories
catalytic activity, metabolic processes,
physiological process
metabolic processes
different protein domains including
TGF-β signaling
protein binding, alternative splicing
protein binding, anti-apoptosis,
JAK-STAT signaling,cell proliferation,
ATP, IL-2 receptor
protein binding, alternative splicing,
regulation, cell differentiation
protein binding, altenative splicing,
metal binding
protein binding
transcription, cellular processes,
alternative splicing,protein binding,
negative regulation of cell proliferation,
metal binding
Table 4.4: Main categories found performing DAVID analysis with the classifiers
trained on Van’t Veer dataset using the parameters described in tables 4.2
4.4.2
Interpretation of the classifiers
In order to interpret the classifier, we extracted from each one the 10% of
genes with the most important weights and performed an analysis with DAVID
[DSH+ 03] in order to retrieve the categories that were significatively represented
in those genes. The results are described in tables 4.4 and 4.5. We only kept
the DAVID categories for which the p-value was inferior to 10−4 .
The disparities regarding the number of terms can be explained by the nature
of the different gene networks. Indeed, more terms tend to be extracted from
metabolic networks, such as KEGG and Reactome, as metabolic functions are
represented in the ontologies as very small group of genes, often with less than
ten genes, so one lit-up metabolic pathway will correspond to many terms. On
the other hand, experimentally built networks, such as CCSB-HI1 and both coexpression networks, tend to describe relations that are not functionnaly linked
together and therefore produce classifiers where only a few terms are found.
However, we can see that in both cases, Bioverse-core and the PPI network
built by combination of the two others tend to pinpoint interesting terms which
are more precise than the ones found with the classifier built without any a
priori knowledge. For both datasets, the ATP pathway is found interesting as
pointed out in [DB04], and the Van’t Veer classifier also show the importance
of TGF-β, IL-2 (also known as TCGF) and JAK-STAT as already pointed
76
CHAPTER 4. NETWORK-FUSED SVM
Genes
115
N. of terms
63
Reactome
116
72
Bioverse-Core
121
39
CCSB-HI1
127
8
Both PPI networks
234
120
Resnet
224
78
Coexp. network 1
240
9
Coexp. network 2
125
7
No network
2228
131
KEGG
Main categories
metabolic processes, disease mutation,
purine, pyrimidine, metal binding
metabolic processes, protein
sequencing, disease mutation,
DNA polymerase
protein binding, mutagenesis site,
ATP
protein binding, direct protein
sequencing
disease mutation, blood, immune
response
protein binding, cell cycle,
mutagenesis site
protein binding, direct protein
sequencing, disease mutation
protein binding, alternative
splicing, disease mutation,
regulation of apoptosis
phosphorylation, cell cycle,
transcription, mutagenesis site,
protein binding, negative regulation
of biological processes
Table 4.5: Main categories found performing DAVID analysis with the classifiers
trained on Wang dataset using the parameters described in tables 4.3
4.5. DISCUSSION
77
out by different studies [DHT+ 97, BBM86, BK02]. [BBM86] also suggested the
importance of blood killer cell activity in breast cancer patients, which could
explain the importance of blood-related terms in the classifier obtained with the
Wang dataset and the combination of both PPI networks.
4.5
Discussion
We developed a method for supervised classification of gene expression profiles.
The introduction of a new regularisation term that takes into account correlation between linked nodes in the gene network helps building a linear classifier
that is closer to a priori known biological facts. Results on public datasets, using different types of gene networks, show that incorporation of gene networks
improve the error rate compared to classifiers that do not take into account the
gene networks and that, on some data sets, given the right gene networks it
may reduce the misclassification rate, compared to the classifiers that take into
account all the genes. Moreover, it also shows that, given the right network, the
obtained classifier may be easier to interpret than standard classifiers.
The fact that the network-constrained classification function does not increase performance in all cases may be explained by the important information
shortage of biological networks that still have to be completed by more inference
or biological experiments. The difficulty to find a standard approach in order
to match the value corresponding to one probe with the expression of a gene
may also complicate the task and result in these dissappointing results.
Another issue in this study is the problem of normalization. Standard normalization algorithms either do per-array normalization (like MAS5 [mas]) or
may perturbate the gene correlations that are essential to incorporation of any
gene network (like gcRMA [WIG+ 04]). New normalization algorithms, such as
MAS6, may prove to be more efficient, but they are still being tested and their
efficiency has not been proven by enough studies.
As dimension of microarrays tends to grow exponentially, the use of methods
more specific to biological data than standard analysis processes seems more and
more essential. In this context, the introduction of gene network knowledge into
gene expression profile classification is a step in the right direction.
78
CHAPTER 4. NETWORK-FUSED SVM
Conclusion
The different contributions of this thesis show that the incorporation of a priori
knowledge is a promising approach in order to reduce the mathematical complexity of microarray analysis and to produce biological results that are easier
to interpret. However, even if they improve existing techniques, the three new
methodologies presented here may be seen as preliminary studies since microarrays still seem to be subject to non-explicable signal variations.
These obcuring and inexplicable variations are one of the main problems of
microarray studies as even sophisticated normalization algorithms are unable to
remove them. Indeed, the rigorous mathematical hypothesis that underlie these
methods do not seem to model correctly a biological reality that is much more
complex than we would like it to be. So much more complex in fact, that we
seem to be unable to understand what the bias induced by these pretreatment
methods is.
Another difficulty that is unavoidable for computational biologists is the
lack of unification. As the domain is still far from mature, few standards have
yet been decided, and in most cases, whether it be simple protocols, identifiers
databases, file formats or even definitions of simple terms such as “gene”, it
remains difficult to merge several works. However, in the past couple of years,
different initiatives have been built and we can expect this obstacle to become
less and less important in the near future.
Indeed, as biological experiments and new techniques of inference such as
the one described by [BBV07] should provide an important help for the completion and curation of gene networks and biological data bases in general, our
knowledge of the biological phenomena will be more precise and easier to model.
Ideas to improve our methods could be found in a sharper modelisation of
the gene interactions, such as the distinction and incorporation of positive and
negative correlation or even of the dynamic orderings suggested by the behaviour
of the different hubs. The use of dynamic informations, such as the one provided
by time-series experiments, instead of static profiles, should also provide a more
precise understanding of the phenomena.
Another way to circumvent the imprecision of microarray data is to interpret this data as a collection of expression values for groups of genes instead
of a collection of expression values for single genes. As the expression for each
group of genes, or “module”, would be calculated based on more values (i.e. the
expression value for each gene of the module), it would improve the statistical
79
80
CHAPTER 4. NETWORK-FUSED SVM
power and the confidence of the interpretation. It would also facilitate the understanding of the underlying biological phenomena as it is far easier to identify
the function of a group of genes than to identify the function of a single protein.
Enrichment analysis methods, such as Gene Set Enrichment Analysis [STM+ 05]
or Gene Set Analysis [ET06], are methods that use this idea to provide a framework for analysis. We could derive such a method from the ones that were
developed during this thesis by simply extracting modules from the different
classifiers produced by our analysis techniques. In particular, the network-fused
SVM algorithm, by providing a piecewise-constant solution, should make this
extraction easier, even if the complexity of the gene networks makes it more
complicated than it sounds.
The development of methods that provide a denser information, such as
complete human genome microarrays for gene expression profiling or high density aCGH, constitute a new challenge for analysis processes as the explosion
of the dimensionnality of the data will make the search for explicative profiles
even more difficult. In this context, the introduction of more specific methods
than the standard analysis algorithms, such as the incorporation of “a priori ”
knowledge that we worked on during this thesis, will be even more necessary.
Bibliography
[Abe02]
Shigeo Abe. Analysis of support vector machines. Proceedings
of the 2002 12th IEEE Workshop on Neural Networks for Signal
Processing, 2002.
[AED+ 00]
Ash A. Alizadeh, Michael B. Eisen, R. Eric Davis, Chi Ma,
Izidore S. Lossos, Andreas Rosenwald, Jennifer C. Boldrick, Hajeer Sabet, Truc Tran, Xin Yu, John I. Powell, Liming Yang,
Gerald E. Marti, Troy Moore, James Hudson, Lisheng Lu,
David B. Lewis, Robert Tibshirani, Gavin Sherlock, Wing C.
Chan, Timothy C. Greiner, Dennis D. Weisenburger, James O.
Armitage, Roger Warnke, Ronald Levy, Wyndham Wilson,
Michael R. Grever, John C. Byrd, David Botstein, Patrick O.
Brown, and Louis M. Staudt. Distinct types of diffuse large bcell lymphoma identified by gene expression profiling. Nature,
403(6769):503–511, February 2000.
[AJL+ 02]
Bruce Alberts, Alexander Johnson, Julian Lewis, Martin Raff,
Keith Roberts, and Peter Walter. Molecular Biology of the Cell.
Garland Science, March 2002.
[ANM93]
A. Kallioniemi O.-P. Kallioniemi F. Waldman V.
Ratanatharathorn S. R. Wolman A. N. Mohamed, J. A. Macoska. Extrachromosomal gene amplification in acute myeloid
leukemia; Characterization by metaphase analysis, comparative
genomic hybridization, and semi-quantitative PCR, 1993.
[ARL+ 07]
A Andersson, C Ritz, D Lindgren, P Eden, C Lassen, J Heldrup, T Olofsson, J Rade, M Fontes, A Porwit-MacDonald,
M Behrendtz, M Hoglund, B Johansson, and T Fioretos.
Microarray-based classification of a consecutive series of 121
childhood acute leukemias: prediction of leukemic and genetic
subtype as well as of minimal residual disease status. Leukemia,
21(6):1198–1203, April 2007.
[Aro50]
N. Aronszajn. Theory of reproducing kernels. Transactions of
the American Mathematical Society, 68:337–404, 1950.
81
82
BIBLIOGRAPHY
[BBM86]
B. G. Brenner, S. Benarrosh, and R. G. Margolese. Peripheral blood natural killer cell activity in human breast cancer
patients and its modulation by t-cell growth factor and autologous plasma. Cancer, 58(4):895–902, Aug 1986.
[BBR+ 05]
Ekaterini Blaveri, Jeremy L. Brewer, Ritu Roydasgupta, Jane
Fridlyand, Sandy DeVries, Theresa Koppie, Sunanda Pejavar,
Kshama Mehta, Peter Carroll, Jeff P. Simko, and Frederic M.
Waldman. Bladder Cancer Stage and Outcome by ArrayBased Comparative Genomic Hybridization. Clin Cancer Res,
11(19):7012–7022, 2005.
[BBV07]
Kevin Bleakley, Gerard Biau, and Jean-Philippe Vert. Supervised reconstruction of biological networks with local models.
Bioinformatics, 23(13):i57–i65, Jul 2007.
[BDA+ 04]
O Babur, E Demir, A Ayaz, U Dogrusoz, and O Sakarya. Pathway activity inference using microarray data. Technical report,
Bilkent Center for Bioinformatics (BCBI), 2004.
[Bel57]
Richard Ernest Bellman. Dynamic Programming. Dover Publications, Incorporated, 1957.
[BGV92a]
B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. In Proceedings of the
5th annual ACM workshop on Computational Learning Theory,
pages 144–152. ACM Press, 1992.
[BGV92b]
Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In
COLT ’92: Proceedings of the fifth annual workshop on Computational learning theory, pages 144–152, New York, NY, USA,
1992. ACM.
[BH95]
Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of Royal Statistical Society B, 57:289–300, 1995.
[bio]
http://www.biocarta.com.
[BK02]
V. Boudny and J. Kovarik. Jak/stat signaling pathways and
cancer. janus kinases/signal transducers and activators of transcription. Neoplasma, 49(6):349–355, 2002.
[BKPT05]
Thomas Breslin, Morten Krogh, Carsten Peterson, and Carl
Troein. Signal transduction pathway profiling of individual tumor samples. BMC Bioinformatics, 6(1):163, 2005.
BIBLIOGRAPHY
83
[BLC+ 01]
N. Bown, M. Lastowska, S. Cotterill, S. O’Neill, C. Ellershaw,
P. Roberts, I. Lewis, A. D. Pearson, U.K. Cancer Cytogenetics
Group, and the U.K. Children’s Cancer Study Group. 17q gain
in neuroblastoma predicts adverse clinical outcome. u.k. cancer
cytogenetics group and the u.k. children’s cancer study group.
Med Pediatr Oncol, 36(1):14–19, Jan 2001.
[BMC+ 00]
M. Bittner, P. Meltzer, Y. Chen, Y. Jiang, E. Seftor, M. Hendrix, M. Radmacher, R. Simon, Z. Yakhini, A. Ben-Dor,
N. Sampas, E. Dougherty, E. Wang, F. Marincola, C. Gooden,
J. Lueders, A. Glatfelter, P. Pollock, J. Carpten, E. Gillanders, D. Leja, K. Dietrich, C. Beaudry, M. Berens, D. Alberts,
and V. Sondak. Molecular classification of cutaneous malignant
melanoma by gene expression profiling. Nature, 406(6795):536–
540, Aug 2000.
[BR85]
Michael J. Best and Klaus Ritter. Linear Programming: Active
Set Analysis and Computer Programs. Prentice Hall, 1985.
[BSBD+ 04]
Michael T Barrett, Alicia Scheffer, Amir Ben-Dor, Nick Sampas,
Doron Lipson, Robert Kincaid, Peter Tsang, Bo Curry, Kristin
Baird, Paul S Meltzer, Zohar Yakhini, Laurakay Bruhn, and
Stephen Laderman. Comparative genomic hybridization using
oligonucleotide microarrays and total genomic dna. Proc Natl
Acad Sci U S A, 101(51):17765–17770, Dec 2004.
[BV04a]
S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 2004.
[BV04b]
Stephen Boyd and Lieven Vandenberghe. Convex Optimization.
Cambridge University Press, March 2004.
[CDF05]
D. Cavalieri and C. De Filippo. Bioinformatic methods for integrating whole-genome expression results into cellular networks.
Drug Discov Today, 10(10):727–34, 2005.
[CDS98]
S. S. Chen, D. L. Donoho, and M. Saunders. Atomic decomposition by basis pursuit. SIAM J. Sci. Comput., 20(1):33–61,
1998.
[CHTG05]
Timothy W Corson, Annie Huang, Ming-Sound Tsao, and
Brenda L Gallie. Kif14 is a candidate oncogene in the 1q minimal region of genomic gain in multiple cancers. Oncogene,
24(30):4741–4753, May 2005.
[Chu97]
F. R. K. Chung. Spectral graph theory, volume 92 of CBMS
Regional Conference Series. American Mathematical Society,
Providence, 1997.
84
BIBLIOGRAPHY
[CKR+ 04]
Markus W Covert, Eric M Knight, Jennifer L Reed, Markus J
Herrgard, and Bernhard O Palsson.
Integrating highthroughput and computational data elucidates bacterial networks. Nature, 429(6987):92–96, May 2004.
[CLL+ 07]
Han-Yu Chuang, Eunjung Lee, Yu-Tsueng Liu, Doheon Lee,
and Trey Ideker. Network-based classification of breast cancer
metastasis. Mol Syst Biol, 3:–, October 2007.
[Con00]
The Gene Ontology Consortium. Gene ontology: tool for the
unification of biology. The Gene Ontology Consortium. Nat.
Genet., 25(1):25–29, May 2000.
[COVP05]
Keira R. Curtis, Matej Oresic, and Antonio Vidal-Puig. Pathways to the analysis of microarray data. Trends in Biotechnology, 23(8):429–435, 2005.
[CST00]
Nello Cristianini and John Shawe-Taylor. An Introduction to
Support Vector Machines: And Other Kernel-Based Learning
Methods. Cambridge University Press, 2000.
[CTTC07]
James Chen, Chen-An Tsai, ShengLi Tzeng, and Chun-Houh
Chen. Gene selection with multiple ordering criteria. BMC
Bioinformatics, 8(1):74, 2007.
[CV95]
Corinna Cortes and Vladimir Vapnik. Support-vector networks.
Machine Learning, 20(3):273–297, 1995.
[CWT+ 06]
S-F Chin, Y Wang, N P Thorne, A E Teschendorff, S E Pinder,
M Vias, A Naderi, I Roberts, N L Barbosa-Morais, M J Garcia,
N G Iyer, T Kranjac, J F R Robertson, S Aparicio, S Tavare,
I Ellis, J D Brenton, and C Caldas. Using array-comparative
genomic hybridization to define molecular portraits of primary
breast cancers. Oncogene, 26(13):1959–1970, September 2006.
[DB04]
Lawrence R Dearth and Rainer K Brachmann. Atp, cancer and
p53. Cancer Biol Ther, 3(7):638–640, Jul 2004.
[DHT+ 97]
D. Donovan, J. H. Harmey, D. Toomey, D. H. Osborne, H. P.
Redmond, and D. J. Bouchier-Hayes. Tgf beta-1 regulation
of vegf production by breast cancer cells. Ann Surg Oncol,
4(8):621–627, 1997.
[DSH+ 03]
Glynn Dennis, Brad Sherman, Douglas Hosack, Jun Yang, Wei
Gao, H Lane, and Richard Lempicki. David: Database for annotation, visualization, and integrated discovery. Genome Biology,
4(9):R60, 2003.
BIBLIOGRAPHY
85
[EDKG+ 05]
Liat Ein-Dor, Itai Kela, Gad Getz, David Givol, and Eytan
Domany. Outcome signature genes in breast cancer: is there a
unique set? Bioinformatics, 21(2):171–178, 2005.
[EHJT04]
B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least
angle regression. Ann. Stat., 32(2):407–499, 2004.
[ET06]
Bradley Efron and Rob Tibshirani. On testing the significance
of sets of genes. Technical report, Annals of Applied Statistics,
2006.
[FCD+ 00]
T. S. Furey, N. Cristianini, N. Duffy, D. W. Bednarski,
M. Schummer, and D. Haussler. Support vector machine classification and validation of cancer tissue samples using microarray
expression data. Bioinformatics, 16(10):906–914, Oct 2000.
[GAL+ 07]
Genkin, Alexander, Lewis, D. David, Madigan, and David.
Large-scale bayesian logistic regression for text categorization.
Technometrics, 49(3):291–304, August 2007.
[GC05]
Debashis Ghosh and Arul M. Chinnaiyan. Classification and
selection of biomarkers in genomic data using lasso. Journal of Biomedicine and Biotechnology, 2005(2):147–154, 2005.
doi:10.1155/JBB.2005.147.
[GdBLC03]
Timothy S. Gardner, Diego di Bernardo, David Lorenz, and
James J. Collins. Inferring Genetic Networks and Identifying
Compound Mode of Action via Expression Profiling. Science,
301(5629):102–105, 2003.
[GDS+ 05]
Anatole Ghazalpour, Sudheer Doss, Sonal Sheth, Leslie IngramDrake, Eric Schadt, Aldons Lusis, and Thomas Drake. Genomic
analysis of metabolic pathway gene expression in mice. Genome
Biology, 6(7):R59, 2005.
[gen]
http://www.genmapp.com.
[Ger05]
Diane Gershon. Dna microarrays: More than gene expression.
Nature, 437(7062):1195–1198, October 2005.
[GGNZ06]
Isabelle Guyon, Steve Gunn, Masoud Nikravesh, and Lofti
Zadeh, editors. Feature Extraction, Foundations and Applications. Springer, 2006.
[GST+ 99]
T. R. Golub, D. K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J. P. Mesirov, H. Coller, M. L. Loh, J. R. Downing, M. A.
Caligiuri, C. D. Bloomfield, and E. S. Lander. Molecular Classification of Cancer: Class Discovery and Class Prediction by
Gene Expression Monitoring. Science, 286(5439):531–537, 1999.
86
BIBLIOGRAPHY
[GTL06]
S. J. Galbraith, L. M. Tran, and J. C. Liao. Transcriptome network component analysis with limited microarray data. Bioinformatics, 22(15):1886–94, 2006.
[GVTS04]
I. Gat-Viks, A. Tanay, and R. Shamir. Modeling and analysis
of heterogeneous regulation in biological networks. J Comput
Biol, 11(6):1034–49, 2004.
[GW97]
Bernhard Ganter and Rudolf Wille. Formal Concept Analysis:
Mathematical Foundations. Springer-Verlag New York, Inc., Secaucus, NJ, USA, 1997. Translator-C. Franzke.
[Han86]
Per C Hansen. The truncated svd as a method for regularization. Technical report, Stanford, CA, USA, 1986.
[HBH+ 04]
Jing-Dong J. Han, Nicolas Bertin, Tong Hao, Debra S. Goldberg, Gabriel F. Berriz, Lan V. Zhang, Denis Dupuy, Albertha J. M. Walhout, Michael E. Cusick, Frederick P. Roth,
and Marc Vidal. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature,
430(6995):88–93, July 2004.
[HDS+ 03]
D. Hosack, G.Jr Dennis, B.T. Sherman, H.C. Lane, and R.A.
Lempicki. Identifying biological themes within lists of genes
with EASE. Genome Biology, R70:1–7, 2003.
[HGO+ 07]
Jian Huang, Arief Gusnanto, Kathleen O’Sullivan, Johan Staaf,
Ake Borg, and Yudi Pawitan. Robust smooth segmentation approach for array CGH data analysis. Bioinformatics,
23(18):2463–2469, 2007.
[HKY99]
Laurie J. Heyer, Semyon Kruglyak, and Shibu Yooseph. Exploring expression data: Identification and analysis of coexpressed
genes. Genome Res., 9(11):1106–1115, November 1999.
[HST+ 04]
Philippe Hupe, Nicolas Stransky, Jean-Paul Thiery, Francois
Radvanyi, and Emmanuel Barillot. Analysis of array CGH data:
from signal ratio to gain and loss of DNA regions. Bioinformatics, 20(18):3413–3422, 2004.
[HTF01]
T. Hastie, R. Tibshirani, and J. Friedman. The elements
of statistical learning: data mining, inference, and prediction.
Springer, 2001.
[HW00]
D. Hanahan and R. A. Weinberg. The hallmarks of cancer. Cell,
100(1):57–70, Jan 2000.
[HZHS07]
Blaise Hanczar, Jean-Daniel Zucker, Corneliu Henegar, and
Lorenza Saitta. Feature construction from synergic pairs to
improve microarray-based classification. Bioinformatics, Oct
2007.
BIBLIOGRAPHY
87
[HZZL02]
D. Hanisch, A. Zien, R. Zimmer, and T. Lengauer. Co-clustering
of biological networks and gene expression data. Bioinformatics,
2002.
[IBC+ 03]
Rafael A Irizarry, Benjamin M Bolstad, Francois Collin,
Leslie M Cope, Bridget Hobbs, and Terence P Speed. Summaries of affymetrix genechip probe level data. Nucleic Acids
Res, 31(4):e15, Feb 2003.
[IML+ 07]
Ahmed Idbaih, Yannick Marie, Carlo Lucchesi, Gaelle Pierron,
Elodie Manie, Virginie Raynal, Veronique Mosseri, Khe HoangXuan, Michele Kujas, Isabel Brito, Karima Mokhtari, Marc
Sanson, Emmanuel Barillot, Alain Aurias, Jean-Yves Delattre,
and Olivier Delattre. Bac array cgh distinguishes mutually exclusive alterations that define clinicogenetic subtypes of gliomas.
Int J Cancer, Dec 2007.
[ITR+ 01]
Trey Ideker, Vesteinn Thorsson, Jeffrey A. Ranish, Rowan
Christmas, Jeremy Buhler, Jimmy K. Eng, Roger Bumgarner, David R. Goodlett, Ruedi Aebersold, and Leroy Hood.
Integrated Genomic and Proteomic Analyses of a Systematically Perturbed Metabolic Network. Science, 292(5518):929–
934, 2001.
[JFG+ 04]
Chris Jones, Emily Ford, Cheryl Gillett, Ken Ryder, Samantha
Merrett, Jorge S. Reis-Filho, Laura G. Fulford, Andrew Hanby,
and Sunil R. Lakhani. Molecular Cytogenetic Identification of
Subgroups of Grade III Invasive Ductal Breast Carcinomas with
Different Clinical Outcomes. Clin Cancer Res, 10(18):5988–
5997, 2004.
[Jol96]
I.T. Jolliffe. Principal component analysis. Springer-Verlag,
New-York, 1996.
[JTGV+ 05]
G. Joshi-Tope, M. Gillespie, I. Vastrik, P. D’Eustachio,
E. Schmidt, B. de Bono, B. Jassal, G. R. Gopinath, G. R.
Wu, L. Matthews, S. Lewis, E. Birney, and L. Stein. Reactome: a knowledgebase of biological pathways. Nucleic Acids
Res, 33(Database issue):D428–32, 2005. 1362-4962 (Electronic)
Journal Article.
[KCF+ 06]
P. Kharchenko, L. Chen, Y. Freund, D. Vitkup, and G. M.
Church. Identifying metabolic enzymes with multiple types of
association evidence. BMC Bioinformatics, 7:177, 2006.
[KCFH05]
Balaji Krishnapuram, Lawrence Carin, Mario A.T. Figueiredo,
and Alexander J. Hartemink. Sparse multinomial logistic regression: Fast algorithms and generalization bounds. IEEE
88
BIBLIOGRAPHY
Transactions on Pattern Analysis and Machine Intelligence,
27(6):957–968, 2005.
[KGH+ 06]
Minoru Kanehisa, Susumu Goto, Masahiro Hattori, Kiyoko F
Aoki-Kinoshita, Masumi Itoh, Shuichi Kawashima, Toshiaki
Katayama, Michihiro Araki, and Mika Hirakawa. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res, 34(Database issue):D354–D357, Jan 2006.
[KGK+ 04]
M. Kanehisa, S. Goto, S. Kawashima, Y. Okuno, and M. Hattori. The KEGG resource for deciphering the genome. Nucleic
Acids Res., 32(Database issue):D277–80, Jan 2004.
[KHCF04]
Balaji Krishnapuram, Alexander J Hartemink, Lawrence Carin,
and Mario A T Figueiredo. A bayesian approach to joint feature
selection and classifier design. IEEE Trans Pattern Anal Mach
Intell, 26(9):1105–11, Sep 2004.
[KI05]
R. Kelley and T. Ideker. Systematic interpretation of genetic interactions using protein networks. Nat. Biotechnol., 23(5):561–
566, May 2005.
[KLXG06]
Philip M. Kim, Long J. Lu, Yu Xia, and Mark B. Gerstein. Relating three-dimensional structures to protein networks provides
evolutionary insights. Science, 314(5807):1938–1941, December
2006.
[KOMK+ 05]
P. D. Karp, C. A. Ouzounis, C. Moore-Kochlacs, L. Goldovsky,
P. Kaipa, D. Ahren, S. Tsoka, N. Darzentas, V. Kunin, and
N. Lopez-Bigas. Expansion of the BioCyc collection of pathway/genome databases to 160 genomes. Nucleic Acids Res,
33(19):6083–9, 2005.
[Kon94]
Igor Kononenko. Estimating attributes: Analysis and extensions of RELIEF. In European Conference on Machine Learning, pages 171–182, 1994.
[KPV+ 06]
M. Krull, S. Pistor, N. Voss, A. Kel, I. Reuter, D. Kronenberg, H. Michael, K. Schwarzer, A. Potapov, C. Choi, O. KelMargoulis, and E. Wingender. TRANSPATH: an information resource for storing and visualizing signaling pathways and
their pathological aberrations. Nucleic Acids Res, 34(Database
issue):D546–51, 2006.
[KR92]
Kenji Kira and Larry A. Rendell. A practical approach to feature selection. In ML92: Proceedings of the ninth international
workshop on Machine learning, pages 249–256, San Francisco,
CA, USA, 1992. Morgan Kaufmann Publishers Inc.
BIBLIOGRAPHY
89
[KVC04]
P. Kharchenko, D. Vitkup, and G. M. Church. Filling gaps in a
metabolic network using expression information. Bioinformatics, 20 Suppl 1:I178–I185, Aug 2004.
[LBY+ 03]
J. C. Liao, R. Boscolo, Y. L. Yang, L. M. Tran, C. Sabatti, and
V. P. Roychowdhury. Network component analysis: reconstruction of regulatory signals in biological systems. Proc Natl Acad
Sci U S A, 100(26):15522–7, 2003.
[LF96]
Stephanie R. Land and Jerome H. Friedman. Variable fusion: A
new adaptive signal regression method. Technical Report, 1996.
[LL07]
Caiyan Li and HongZhe Li. Network-constrained regularization and variable selection for analysis of genomic data. UPenn
Biostatistics Working Papers, 23, 2007.
[LNM+ 97]
M. Lastowska, E. Nacheva, A. McGuckin, A. Curtis, C. Grace,
A. Pearson, and N. Bown. Comparative genomic hybridization study of primary neuroblastoma tumors. united kingdom
children’s cancer study group. Genes Chromosomes Cancer,
18(3):162–169, Mar 1997.
[Mac67]
J. B. MacQueen. Some methods of classification and analysis
of multivariate observations. In Proceedings of the Fifth Berkeley Symposium on Mathemtical Statistics and Probability, pages
281–297, 1967.
[mas]
http://www.affymetrix.com/support/developer/.
[MBM+ 04]
G. Mercier, N. Berthault, J. Mary, J. Peyre, A. Antoniadis, J.-P.
Comet, A. Cornuejols, C. Froidevaux, and M. Dutreix. Biological detection of low radiation doses by combining results of two
microarray analysis methods. Nucleic Acids Res., 32(1):e12,
2004.
[MBS05]
Jason McDermott, Roger Bumgarner, and Ram Samudrala.
Functional annotation from predicted protein interaction networks. Bioinformatics, 21(15):3217–3226, Aug 2005.
[MDM+ 01]
G. Mercier, Y. Denis, P. Marc, L. Picard, and M. Dutreix. Transcriptional induction of repair genes during slowing of replication in irradiated Saccharomyces cerevisiae. Mutat. Res., 487(34):157–172, Dec 2001.
[MKH05]
Stefan Michiels, Serge Koscielny, and Catherine Hill. Prediction of cancer outcome with microarrays: a multiple random
validation strategy. The Lancet, 365(9458):488–492, 2005.
90
BIBLIOGRAPHY
[MM01]
Songrit Maneewongvatana and David M. Mount. The analysis
of a probabilistic approach to nearest neighbor searching. In
WADS ’01: Proceedings of the 7th International Workshop on
Algorithms and Data Structures, pages 276–286, London, UK,
2001. Springer-Verlag.
[Moh97]
B. Mohar. Some applications of Laplace eigenvalues of graphs.
In G. Hahn and G. Sabidussi, editors, Graph Symmetry: Algebraic Methods and Applications, volume 497 of NATO ASI
Series C, pages 227–275. Kluwer, Dordrecht, 1997.
[MOSS+ 04]
Fuminori Matsumoto, Takeshi Obayashi, Yuko SasakiSekimoto, Hiroyuki Ohta, Ken ichiro Takamiya, and Tatsuru
Masuda. Gene expression profiling of the tetrapyrrole metabolic
pathway in Arabidopsis with a mini-array system. Plant Physiol, 135(4):2379–2391, Aug 2004.
[MS03]
Jason McDermott and Ram Samudrala. Bioverse: Functional,
structural and contextual annotation of proteins and proteomes.
Nucleic Acids Res, 31(13):3736–3737, Jul 2003.
[NEGL+ 05]
Georges Natsoulis, Laurent El Ghaoui, Gert R.G. Lanckriet,
Alexander M. Tolley, Fabrice Leroy, Shane Dunlea, Barrett P.
Eynon, Cecelia I. Pearson, Stuart Tugendreich, and Kurt Jarnagin. Classification of a large microarray data set: Algorithm
comparison and analysis of drug signatures. Genome Res.,
15(5):724–736, 2005.
[OBS+ 03]
Ronan C O’Hagan, Cameron W Brennan, Andrew Strahs,
Xuegong Zhang, Karuppiah Kannan, Melissa Donovan, Craig
Cauwels, Norman E Sharpless, Wing Hung Wong, and Lynda
Chin. Array comparative genome hybridization for tumor classification and gene discovery in mouse models of malignant
melanoma. Cancer Res, 63(17):5352–5356, Sep 2003.
[OVLW04]
Adam B. Olshen, E. S. Venkatraman, Robert Lucito, and
Michael Wigler. Circular binary segmentation for the analysis
of array-based DNA copy number data. Biostat, 5(4):557–572,
2004.
[Pav03]
Paul Pavlidis. Using ANOVA for gene selection from microarray
studies of the nervous system. Methods, 31(4):282–289, Dec
2003.
[PFG+ 03]
Paola Parrella, Vito M. Fazio, Antonietta P. Gallo, David
Sidransky, and Shannath L. Merbs. Fine mapping of chromosome 3 in uveal melanoma: Identification of a minimal region of deletion on chromosomal arm 3p25.1-p25.2. Cancer Res,
63(23):8507–8510, December 2003.
BIBLIOGRAPHY
91
[PSS+ 98]
D. Pinkel, R. Segraves, D. Sudar, S. Clark, I. Poole, D. Kowbel,
C. Collins, W. L. Kuo, C. Chen, Y. Zhai, S. H. Dairkee, B. M.
Ljung, J. W. Gray, and D. G. Albertson. High resolution analysis of DNA copy number variation using comparative genomic
hybridization to microarrays. Nat Genet, 20(2):207–211, Oct
1998.
[QXGY06]
Xing Qiu, Yuanhui Xiao, Alexander Gordon, and Andrei
Yakovlev. Assessing stability of gene selection in microarray
data analysis. BMC Bioinformatics, 7(1):50, 2006.
[RDML04]
J Rahnenfuhrer, FS Domingues, J Maydt, and T. Lengauer.
Calculating the statistical significance of changes in pathway
activity from gene expression data. Statistical Applications in
Genetics and Molecular Biology, 3(1):Article 16, 2004.
[RLS+ 05]
O Radulescu, S Lagarrigue, A Siegel, M Le Borgne, and P Veber. Topology and static response of interaction networks in
molecular biology. J.R.Soc.Interface, Published online, 2005.
[RSA+ 07]
Fidel Ramirez, Andreas Schlicker, Yassen Assenov, Thomas
Lengauer, and Mario Albrecht. Computational analysis of human protein interaction networks. Proteomics, 7(15):2541–2552,
Aug 2007.
[RTV+ 05]
Daniel R Rhodes, Scott A Tomlins, Sooryanarayana Varambally, Vasudeva Mahavisno, Terrence Barrette, Shanker
Kalyana-Sundaram, Debashis Ghosh, Akhilesh Pandey, and
Arul M Chinnaiyan. Probabilistic model of the human proteinprotein interaction network. Nat Biotech, 23(8):951–959, August 2005.
[RVH+ 05]
Jean-Francois Rual, Kavitha Venkatesan, Tong Hao, Tomoko
Hirozane-Kishikawa, Amelie Dricot, Ning Li, Gabriel F. Berriz,
Francis D. Gibbons, Matija Dreze, Nono Ayivi-Guedehoussou,
Niels Klitgord, Christophe Simon, Mike Boxem, Stuart Milstein, Jennifer Rosenberg, Debra S. Goldberg, Lan V. Zhang,
Sharyl L. Wong, Giovanni Franklin, Siming Li, Joanna S. Albala, Janghoo Lim, Carlene Fraughton, Estelle Llamosas, Sebiha Cevik, Camille Bex, Philippe Lamesch, Robert S. Sikorski,
Jean Vandenhaute, Huda Y. Zoghbi, Alex Smolyar, Stephanie
Bosak, Reynaldo Sequerra, Lynn Doucette-Stamm, Michael E.
Cusick, David E. Hill, Frederick P. Roth, and Marc Vidal. Towards a proteome-scale map of the human protein-protein interaction network. Nature, 437(7062):1173–1178, October 2005.
[RZD+ 07]
Franck Rapaport, Andrei Zinovyev, Marie Dutreix, Emmanuel
Barillot, and Jean-Philippe Vert. Classification of microarray
data using gene networks. BMC Bioinformatics, 8(1):35, 2007.
92
BIBLIOGRAPHY
[Saa96]
Yousef Saad. Iterative Methods for Sparse Linear Systems. Society for Industrial and Applied Mathematics, 1996.
[Shl05]
Jonathon Shlens. A tutorial on principal component analysis.
Technical report, 2005.
[SK97]
M. Sikonja and I. Kononenko. An adaptation of relief for attribute estimation in regression, 1997.
[Smi93]
Murray Smith. Neural Networks for Statistical Modeling. John
Wiley & Sons, Inc., New York, NY, USA, 1993.
[SMO+ 03]
P. Shannon, A. Markiel, O. Ozier, N. S. Baliga, J. T. Wang,
D. Ramage, N. Amin, B. Schwikowski, and T. Ideker. Cytoscape: a software environment for integrated models of
biomolecular interaction networks. Genome Res, 13(11):2498–
504, 2003.
[SMR+ 03]
Danielle C. Shing, Dominic J. McMullan, Paul Roberts, Kim
Smith, Suet-Feung Chin, James Nicholson, Roger M. Tillman, Pramila Ramani, Catherine Cullinane, and Nicholas Coleman. Fus/erg gene fusions in ewing’s tumors. Cancer Res,
63(15):4568–4576, August 2003.
[SPdM+ 94]
MR Speicher, G Prescher, S du Manoir, A Jauch, B Horsthemke, N Bornfeld, R Becher, and T Cremer. Chromosomal
gains and losses in uveal melanomas detected by comparative
genomic hybridization. Cancer Res., 54(14):3817–23, July 1994.
[SS02]
B. Schölkopf and A. J. Smola. Learning with Kernels: Support
Vector Machines, Regularization, Optimization, and Beyond.
MIT Press, Cambridge, MA, 2002.
[SSKK03]
J. M. Stuart, E. Segal, D. Koller, and S. K. Kim. A genecoexpression network for global discovery of conserved genetic
modules. Science, 302(5643):249–55, 2003.
[SSM99]
Bernhard Schölkopf, Alexander J. Smola, and Klaus-Robert
Müller. Kernel principal component analysis. Advances in kernel methods: support vector learning, pages 327–352, 1999.
[STC00]
John Shawe-Taylor and Nello Cristianini. An Introduction
to Support Vector Machines and Other Kernel-based Learning
Methods. Cambridge University Press, 2000.
[STC04]
John Shawe-Taylor and Nello Cristianini. Kernel Methods for
Pattern Analysis. Cambridge University Press, 2004.
BIBLIOGRAPHY
93
[STM+ 05]
Aravind Subramanian, Pablo Tamayo, Vamsi K Mootha, Sayan
Mukherjee, Benjamin L Ebert, Michael A Gillette, Amanda
Paulovich, Scott L Pomeroy, Todd R Golub, Eric S Lander, and
Jill P Mesirov. Gene set enrichment analysis: a knowledge-based
approach for interpreting genome-wide expression profiles. Proc
Natl Acad Sci U S A, 102(43):15545–15550, Oct 2005.
[Stu99]
J.F. Sturm. Using SeDuMi 1.02, a MATLAB toolbox for optimization over symmetric cones. Optimization Methods and
Software, 11–12:625–653, 1999. Special issue on Interior Point
Methods (CD supplement with software).
[STV04]
B. Schölkopf, K. Tsuda, and J.-P. Vert. Kernel Methods in
Computational Biology. MIT Press, 2004.
[SVR+ 06]
Nicolas Stransky, Celine Vallot, Fabien Reyal, Isabelle BernardPierrot, Sixtina Gil Diez de Medina, Rick Segraves, Yann
de Rycke, Paul Elvin, Andrew Cassidy, Carolyn Spraggon,
Alexander Graham, Jennifer Southgate, Bernard Asselain, Yves
Allory, Claude C Abbou, Donna G Albertson, Jean Paul Thiery,
Dominique K Chopin, Daniel Pinkel, and Francois Radvanyi.
Regional copy number-independent deregulation of transcription in cancer. Nat Genet, 38(12):1386–1396, December 2006.
[SWL+ 05]
Ulrich Stelzl, Uwe Worm, Maciej Lalowski, Christian Haenig,
Felix H. Brembeck, Heike Goehler, Martin Stroedicke, Martina Zenkner, Anke Schoenherr, Susanne Koeppen, Jan Timm,
Sascha Mintzlaff, Claudia Abraham, Nicole Bock, Silvia Kietzmann, Astrid Goedde, Engin Toksoz, Anja Droege, Sylvia Krobitsch, Bernhard Korn, Walter Birchmeier, Hans Lehrach, and
Erich E. Wanker. A human protein-protein interaction network: A resource for annotating the proteome. Cell, 122:957–
968, 2005.
[SYDM05]
A.Y. Sivachenko, A. Yuriev, N. Daraselia, and I. Mazo. Identifying local gene expression patterns in biomolecular networks.
Proceedings of 2005 IEEE Computational Systems Bioinformatics Conference, Stanford, California, 2005.
[SZEK07]
Gunnar Schramm, Marc Zapatka, Roland Eils, and Rainer
Koenig. Using gene expression data and network topology to
detect substantial pathways, clusters and switches during oxygen deprivation of Escherichia coli. BMC Bioinformatics, 8:149,
2007.
[THH+ 08]
Julien Trolet, Philippe Hupe, Isabelle Huon, Ingrid Lebigot,
Pascale Mariani, Corine Plancher, Bernard Asselain, Laurence
Desjardins, Olivier Delattre, Xavier Sastre-Garau, Jean-Paul
94
BIBLIOGRAPHY
Thiery, Simon Saule, Sophie Piperno-Neumann, Emmanuel
Barillot, and Jerome Couturier. Genomic profiling and identification of high risk tumors in uveal melanoma by array-cgh
analysis of primary tumors and liver metastases. submitted to
Cancer Res, 2008.
[Tib96]
R. Tibshirani. Regression shrinkage and selection via the lasso.
J. Royal. Statist. Soc. B., 58(1):267–288, 1996.
[Tib97]
R. Tibshirani. The lasso method for variable selection in the
Cox model. Stat Med, 16(4):385–395, Feb 1997.
[TK01]
R. Thomas and M. Kaufman. Multistationarity, the basis of cell
differentiation and memory. II. Logical analysis of regulatory
networks in terms of feedback circuits. Chaos, 11(1):180–195,
2001.
[Tod02]
Michael J. Todd. The many facets of linear programming. Mathematical Programming, 91(3):417–436, February 2002.
[TPH+ 01]
Frank Tschentscher, Gabriele Prescher, Douglas E. Horsman,
Valerie A. White, Harald Rieder, Gerasimos Anastassiou, Harald Schilling, Norbert Bornfeld, Karl Ulrich Bartz-Schmidt,
Bernhard Horsthemke, Dietmar R. Lohmann, and Michael
Zeschnigk. Partial deletions of the long and short arm of
chromosome 3 point to two tumor suppressor genes in uveal
melanoma. Cancer Res, 61(8):3439–3442, April 2001.
[TSR+ 05]
Robert Tibshirani, Michael Saunders, Saharon Rosset,
Ji Zhu, and Keith Knight.
Sparsity and smoothness
via the fused lasso.
Journal Of The Royal Statistical Society Series B, 67(1):91–108, 2005.
available at
http://ideas.repec.org/a/bla/jorssb/v67y2005i1p91-108.html.
[TW07]
Robert Tibshirani and Pei Wang. Spatial smoothing and hot
spot detection for CGH data using the fused lasso. Biostatistics,
May 2007.
[Vap98]
Vladimir Naumovich Vapnik. Statistical Learning Theory. Wiley, 1998.
[vBN06]
Erik van Beers and Petra Nederlof. Array-cgh and breast cancer.
Breast Cancer Research, 8(3):210, 2006.
[VDS+ 07]
Imre Vastrik, Peter D’Eustachio, Esther Schmidt, Geeta JoshiTope, Gopal Gopinath, David Croft, Bernard de Bono, Marc
Gillespie, Bijay Jassal, Suzanna Lewis, Lisa Matthews, Guanming Wu, Ewan Birney, and Lincoln Stein. Reactome: a knowledge base of biologic pathways and processes. Genome Biol,
8(3):R39, 2007.
BIBLIOGRAPHY
95
[vHGW+ 01]
J. van Helden, D. Gilbert, L. Wernisch, M. Schroeder, and
S. J. Wodak. Application of regulatory sequence analysis and
metabolic network analysis to the interpretation of gene expression data. In JOBIM ’00: Selected papers from the First International Conference on Computational Biology, Biology, Informatics, and Mathematics, pages 147–164, London, UK, 2001.
Springer-Verlag.
[VK03]
J. P. Vert and M. Kanehisa. Extracting active pathways from
gene expression data. Bioinformatics, 19 Suppl 2:II238–II244,
2003.
[VRVB+ 02]
Nadine Van Roy, Jo Vandesompele, Geert Berx, Katrien Staes,
Mireille Van Gele, Els De Smet, Anne De Paepe, Genevieve
Laureys, Pauline van der Drift, Rogier Versteeg, Frans Van Roy,
and Frank Speleman. Localization of the 17q breakpoint of a
constitutional 1;17 translocation in a patient with neuroblastoma within a 25-kb segment located between the accn1 and
tlk2 genes and near the distal breakpoints of two microdeletions
in neurofibromatosis type 1 patients. Genes, Chromosomes and
Cancer, 35(2):113–120, 2002.
[vtVDvdV+ 02] L. J. van ’t Veer, H. Dai, M. J. van de Vijver, Y. D. He, A. A.
Hart, M. Mao, H. L. Peterse, K. van der Kooy, M. J. Marton,
A. T. Witteveen, G. J. Schreiber, R. M. Kerkhoven, C. Roberts,
P. S. Linsley, R. Bernards, and S. H. Friend. Gene expression profiling predicts clinical outcome of breast cancer. Nature,
415(6871):530–536, January 2002.
[War63]
Joe H. Ward. Hierarchical grouping to optimize an objective
function. Journal of American Statistical Association, 58:236–
244, 1963.
[WCK+ 91]
Frederic M. Waldman, Peter R. Carroll, Russell Kerschmann,
Michael B. Cohen, Frederick G. Field, and Brian H. Mayall.
Centromeric copy number of chromosome 7 is strongly correlated with tumor grade and labeling index in human bladder
cancer. Cancer Res, 51(14):3807–3813, July 1991.
[WIG+ 04]
Zhijin Wu, Rafael A. Irizarry, Robert Gentleman, Francisco
Martinez-Murillo, and Forrest Spencer. A model-based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association, 99:909–917(9), December 2004.
[WKZ+ 05]
Yixin Wang, Jan G M Klijn, Yi Zhang, Anieta M Sieuwerts,
Maxime P Look, Fei Yang, Dmitri Talantov, Mieke Timmermans, Marion E Meijer van Gelder, Jack Yu, Tim Jatkoe, Els M
96
BIBLIOGRAPHY
J J Berns, David Atkins, and John A Foekens. Gene-expression
profiles to predict distant metastasis of lymph-node-negative
primary breast cancer. Lancet, 365(9460):671–679, 2005.
[Wri87]
Stephen J. Wright. Primal-Dual Interior-Point Methods. Society for Industrial Mathematics, 1987.
[YMH+ 07]
Xifeng Yan, Michael R Mehan, Yu Huang, Michael S Waterman,
Philip S Yu, and Xianghong Jasmine Zhou. A graph-based
approach to systematically reconstruct human transcriptional
regulatory modules. Bioinformatics, 23(13):i577–i586, Jul 2007.
[YMK+ 06]
Anton Yuryev, Zufar Mulyukov, Ekaterina Kotelnikova, Sergei
Maslov, Sergei Egorov, Alexander Nikitin, Nikolai Daraselia,
and Ilya Mazo. Automatic pathway building in biological association networks. BMC Bioinformatics, 7:171, 2006.
[YWF+ 06]
Jun Yao, Stanislawa Weremowicz, Bin Feng, Robert C. Gentleman, Jeffrey R. Marks, Rebecca Gelman, Cameron Brennan,
and Kornelia Polyak. Combined cDNA Array Comparative Genomic Hybridization and Serial Analysis of Gene Expression
Analysis of Breast Tumor Progression. Cancer Res, 66(8):4065–
4078, 2006.
[YY06]
James J. Yang and Mark CK Yang. An improved procedure for
gene selection from microarray experiments using false discovery
rate criterion. BMC Bioinformatics, 7:15, 2006.
[ZRHT03]
J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. norm support
vector machines, 2003.
[ZRHT04]
J. Zhu, S. Rosset, T. Hastie, and R. Tibshirani. 1-norm support
vector machines. In S. Thrun, L. Saul, and B. Schölkopf, editors,
Adv. Neural. Inform. Process Syst., volume 16, Cambridge, MA,
2004. MIT Press.
Download