BIOINFORMATICS Getting the most from molecular cancer data using unsupervised feature learning

advertisement

BIOINFORMATICS

Getting the most from molecular cancer data using unsupervised feature learning

James Skinner

1 , ∗

, Dr Richard Savage

2 , 3

1

Centre for Complexity Science, University of Warwick

2

Warwick Systems Biology Centre, University of Warwick

3

Warwick Medical School, University of Warwick

Vol. 00 no. 00 2014

Pages 1–13

ABSTRACT

Motivation:

The prevalence of cancer and its implications for public health and well-being have justify is as a large, active area of research.

Molecular data such as gene expression are being generated at an increasing rate, but have proven difficult to extract information from.

Here we investigate methods of extracting information from gene expression data using expression and survival data on 1981 breast cancer patients. Our goal is to discover the best way in which to extract a small number of features from the data which are informative with regard to survival prognosis. We perform unsupervised feature learning before supervised survival time prediction, using predictive accuracy as a measure of the quality of the information contained in the learned features.

Results:

We find that—given tumour expression data—we are able to predict patient survival time with a concordance index of 0.669. The recently data-mined metagenes are found to out-perform the literature-mined gene list used in the MammaPrint

TM diagnostic test. We also find feature extraction of the raw expression data to outperform the metagenes at high dimensionality and under-perform at equal dimensionality. Results suggest the gene expression data have a linear, clustered structure, and the patient survival approximately satisfies the proportional hazards assumption made by the Cox model.

Contact: j.r.skinner@warwick.ac.uk

1 INTRODUCTION

One in four deaths in the United States is due to cancer, of which breast cancer is the second leading type (Siegel et al.

, 2014). With the large number of people affected ( Table 1 ), any progress towards improving diagnosis, treatment or otherwise reducing the cost or discomfort incurred by cancer would be of great benefit.

With the advent of new, high throughput technologies such as

DNA sequencing for genomics and microarrays or RNA-seq for transcriptomics, large amounts of data are becoming available from which we may learn a great deal about a number of diseases. As the cost of these technologies continues to fall, the routine construction of such -omics datas for medical patients will become increasingly feasible. This opens up the challenge of dealing with such data; whilst the data may contain information allowing the personalisation of treatment, this information may be difficult to extract. This is particularly relevant to cancer; not only is cancer a large class of different diseases, but the nature of cancer means the genome

∗ to whom correspondence should be addressed and transcriptome of a tumour is under strong mutational drift, obscuring useful prognostic information under considerable noise.

In this work, we investigate a dataset from METABRIC

(Molecular Taxonomy of Breast Cancer International Consortium), of gene expression from 1981 breast tumours (Curtis et al.

,

2012) labelled with patient survival data, and apply a variety of supervised and unsupervised learning techniques to attempt to extract signatures from the molecular data. We also investigate restricting to lists of genes known to be relevant to cancer (van ’t

Veer et al.

, 2002; Cheng et al.

, 2013) and compare the quality of the data obtained with that extracted by unsupervised learning techniques. Before starting the investigation we split the data into a test set of 1320 patients, on which we evaluate how the techniques perform and use to select parameter values such as dimensionality, and a test set of 661 patients on which we finally evaluate the performance of our algorithms. The splitting was performed by assigning every third patient to the test data set.

The purpose of this work is to discover the best methods for extracting useful signatures from gene expression microarray data with regards to predicting survival times of cancer patients. The motivation between accurate survival time prediction is twofold; firstly, by predicting patient survival times we may influence treatment to minimise discomfort and optimise the quality of life.

If a patient is old and their predicted survival time from cancer very long, then this would warrant a very different treatment from a young patient with a shorter predicted survival time.

The second motive for accurate survival time prediction is that we may use predictive accuracy as a measure of our ability to extract high quality, relevant information from the raw data. If we are able to restrict the data to a small set of genes or gene interactions and obtain a high degree of predictive accuracy of patient survival time, then it is likely that these genes are involved in the underlying biology driving the cancer. This would then identify promising research directions and help us direct our efforts in developing drugs targeting particular genes or pathways.

Table 1.

Statistics of cancer prevalence in the US (Siegel et al.

, 2014).

2014 Projections

Probability (%) of developing invasive cancer 2008–2010 by age (years)

< 49 50–59 60–69 > 69 Lifetime New cases Deaths

Breast 235,030 40,430 1.9

All sites 1,665,540 585,720 4.5

2.3

6.4

3.5

12.8

6.7

31.8

12.3

40.9

c Oxford University Press 2014.

1

James Skinner

2 METABRIC

We investigate a dataset of breast tumour gene expression microarray measurements from the METABRIC study (Curtis et al.

,

2012). The data consist of 1981 independent samples of 49576 microarray probe intensities, along with a label of the patient survival time. This survival time may be censored or uncensored; that is, the time may be the known survival time, or may be the last time the patient was seen and since then their status is unknown.

The data are obtained from fresh frozen primary invasive breast carcinomas, all from female patients. DNA and RNA are extracted from each specimen, and the expression data is generated through transcriptional profiling on the Illumina HT-12 v3 Expression

BeadChip (Illumina, 2014); a particular microarray. The data are hosted at the European Genome-Phenome Archive (EGA

1

).

Benign tumours and tumours with low cellularity are excluded from the set, and RNA/DNA mismatches were flagged. Apparently related individuals, gauged through genotype analysis on the

Affymetrix SNP 6.0 platform, were also removed. Clinical and demographic information are avaliable in Tables S1–S3 in the supplementary information of Curtis et al.

(2012).

Being microarray data, each variable corresponds to the quantity of mRNA—the product of gene transcription—from each sample which binds to a particular single-stranded DNA probe in the microarray. These probes have been designed to minimize the error in mapping probe activity to gene expression levels, so we may infer gene expression levels from probe activities. As is normal with microarray data, we have very many more features than samples.

We are able to convert from probe activity to gene expression using the Dunning et al.

(2013) R package.

3 RESTRICTED GENE SETS

Previous work has been performed on identifying particular sets of genes with relevance to breast cancer. The set of 12 attractor metagenes (Cheng et al.

, 2013) have been mined computationally from a number of gene expression datasets and show promising results in that very similar metagenes are mined from ovarian, colon and breast cancer data. MammaPrint

TM is a diagnostic test in which a breast tumour tissue sample is analysed for the activity of a set of

70 genes; this data is then used to provide information on the risk of the tumour metastasizing (van ’t Veer et al.

, 2002; Cardoso et al.

,

2007). The Metagene and MammaPrint

TM gene lists importantly differ in that the MammaPrint

TM gene list was mined from the literature by hand, whilst the Metagenes were discovered entirely computationally from data.

3.1

Attractor Metagenes

The set of attractor metagenes—henceforth referred to simply by

“metagenes”—is a set of linear combinations of gene expressions mined from cancer gene expression data (Cheng et al.

, 2013; Cheng and Wei-Yi, 2013). The method for extracting metagenes is an iterative unsupervised method, and is similar to k-means clustering in having two repeated steps: the consensus metagene is defined as the average of all genes in a cluster of co-expressed genes, then the members of the cluster are replaced with the genes most highly

1 http://www.ebi.ac.uk/ega/ correlated with the consensus metagene. These steps are repeated until the clusters and their consensus metagenes have sufficiently converged. The ‘heart’ of each cluster is identified, consisting of the most strongly co-expressed genes, and all other genes removed.

Each consensus metagene is a linear combination of the expressions of all genes in its cluster. We attempted further feature extraction and cross-validation on the set of 12 metagenes relevant to cancer prediction identified by Cheng et al.

(2013). Additionally, we also separated each metagene out into its component genes, retaining their coefficients, and worked with this set of 81 genes.

For clarity, we will refer to these reduced dimension datasets as

‘mspace’ and ‘gspace’ metagenes respectively.

The metagenes identified map well to well defined biological functions. Cheng et al.

(2013) put particular emphasis in three of the metagenes which they name the Mesenchymal Transition metagene, the Mitotic CIN metagene and the Lymphocite-Specific metagene.

The Mesenchymal Transition metagene contains primarily genes associated with the ability for epithelial cells to become migratory.

Whilst this process, known as Epithelial-Mesenchymal Transition

(EMT) is important to many processes such as development, it is also involved in metastasis of tumours (Kalluri and Weinberg,

2009). This metagene is expressed early in the development of breast cancer, and is highly prognostic (predictive of survival) in early-stage tumours.

The Mitotic CIN metagene consists of mostly kinetochore associated genes, where the kinetochore is a protein structure involved in the separation of chromosomal material between the two daughter cells in cell replication. This attractor shows association with Chromosomal Instability (CIN) (Geigl et al.

, 2008); a cellular condition where there is a high rate of large portions or entire chromosomes gained or lost in cell replication. Expression of this metagenes signifies uncontrolled cell division, and is thus a highly prognostic in all cases. This metagene is found to be the most prognostic in breast cancer.

The Lymphocite-Specific metagene contains mainly lymphocitespecific genes. Lymphocites are a class of white blood cell, and are able to invade and attack tumours in certain types of cancer

(Shankaran et al.

, 2001). Expression of the Lymphocite-Specific metagene has been found to be strongly protective in ER-negative breast cancers (that is, breast cancers in which the tumour cells do not require oestrogen to grow and divide). While the precise details are not fully understood, it appears that this is related to the ability of the tumour to recruit immune cells to fight the cancer.

3.2

MammaPrint

The MammaPrint

TM

70-gene list was supplied using a gene naming convention, while the DREAM microarray dataset features relate to probes in a microarray. To work with the 70-gene list, a mapping was constructed between genes and probes using Dunning et al.

(2013); an R package specifically for working with data generated by the microarray used (Illumina, 2014). Since it is often the case that multiple microarray probes correspond to the same gene transcript, mapping from the 70 genes gives 118 probes, where every gene mapped to at-least one probe, and all probes were retained.

2

Getting the most from molecular cancer data using unsupervised feature learning

4 UNSUPERVISED FEATURE LEARNING

We perform a number of unsupervised learning techniques to reduce the dimensionality of the data. Unsupervised learning potentially plays two important but separate roles: improving predictive accuracy in regression, and providing about the data. It is normally the case that one of these goals is emphasised at the expense of the other. In addition, by reducing the dimensionality of the data we are also reducing the time and memory requirements of any algorithms dealing with this data, potentially improving the feasibility of use of higher complexity algorithms.

Predictive accuracy in machine learning may often be improved by first reducing the dimensionality of a dataset by representing the data with some small set of learned features. There are a number of intuitive reasons behind why this works; firstly, we may describe data in as high a dimensional space as we like, but adding more dimensions above the intrinsic dimensionality of the data does not introduce any more information, and adding a very large number of uninformative dimensions may result in the inherent errors in any machine-learned function adding up to produce significant deviation from the true underlying function. To have statistical confidence in a regression, we require that the data is sufficiently dense in the space in which the regression is being performed. As the dimensionality of a space increases, so does the volume of the space, thus the quantity of data required to achieve a given density grows by the power of the dimensionality. If the data lies very close to some subspace, discovering this subspace and projecting the data on to it is likely only removing noise introduced by actual measurements, improving the quality of the data.

By performing unsupervised feature learning then following up with cross-validation, we are able to discover structure in the data.

We may find that certain variables (or linear combinations of) are completely uninformative, having no relevance to the task at hand; sparse methods where very small coefficients are preferentially set to identically zero may be preferable for identifying uninformative dimensions. We may also discover that the data lies very close to some lower dimensional linear subspace or nonlinear manifold, telling us that the data is intrinsically lower dimensional and the state of the system may be described by a smaller number of variables.

Given n data points represented by m > n features, the data are unable to span more than n dimensions. This tells us that there is some linear transformation to an n dimensional representation which loses none of the information in the data. Since we are working with gene expression data, m exceeds n by more than an order of magnitude, so we know that we are able to reduce the data to n dimensions without damaging predictive accuracy.

4.1

Principal Component Analysis

We begin the investigation by performing Principal Component

Analysis (PCA) (Barber, 2012) on the data. PCA is attractive for a number of reasons; since the principal components are orthogonal, we know that we can capture all directions of variability in the gene expression data with a number of principal components equal to the number of samples. Additionally, we are able to vary the dimensionality of the projected data without recalculating the principal components each time. This is because we reduce dimensionality by projecting the data onto the top few principal components, where we calculate all principal components at once and retain only the ones we require. This contrasts to other techniques, such as clustering, where we must recalculate the clusters for each target dimensionality, thus requiring far more computation.

Investigating the eigenvalue distribution of the covariance matrix which is—calculated as part of PCA—is informative. Typically when plotting the eigenvalues ordered by the index of their principal component we observe a steep, approximately linear decline in eigenvalue magnitude for the lower indices, joined by a ‘knee’ to a more gentle approximately linear decline for the higher eigenvalue indices. The interpretation of this is that the principal components capture the signal in the data, whilst the later components capture the noise and the knee sits at the boundary between these two regimes. The location of the knee is therefore an indicator of the optimum number of principal components to retain, and the relative slopes of the signal and noise regimes indicate how much noise is in the data.

Ideally the eigenvalue plot would be a step function, with a small number of principal components equally capturing the full signal in the data, and the remaining components containing no useful information and would be discarded. In reality, we see a soft transition from the ‘signal’ components into the ‘noise’ components, where the majority of the components are in the noise regime when we have very high dimensional data. If there are a very large number of components with small eigenvalues, the small amounts of error inherent in each component may add up to contribute considerable variance in the data, thus removing these dimensions may filter noise from the signal and improve predictive performance.

4.2

Sparse Principal Component Analysis

Expanding on PCA, we investigate Sparse PCA (SPCA) (Zou et al.

,

2004; Zou and Hastie, 2012); a variant of PCA which removes the component orthogonality constraint and produces eigenvectors which are sparse (contain identically zero components). Sparsity is achieved by introducing a lasso regularisation to the loadings— the matrix of eigenvalues by which we multiply the data to project onto the principal components—giving a preference for loadings containing many zeroes.

SPCA has advantages in the interpretability of the results. In

PCA, each component is a linear combination of every variable and is difficult to comprehend. Contrarily, SPCA loadings may contain mostly zeroes with a small number of non-zero elements.

If the variables in the dataset represent comprehensible physical phenomenon, such as gene expression, then the sparse loadings describe small sets of these phenomenon which vary together, and the magnitude in which they do so, and thus may be informative about the problem being studied.

If SPCA produces loadings with an entire row of zeroes, this results in the corresponding variable making no contribution to the projected data. This is desirable since the variables often correspond to phenomena we can understand or control, and means that the variable in question has no contribution to the data and need not be recorded. The removal of variables is known as variable selection, and is often performed separately but here we see it is built into

SPCA.

SPCA is particularly relevant to microarray data such as we are dealing with here. The variables, corresponding to RNA probes, may contain redundancy from either multiple probes targeting the

3

James Skinner same gene transcript, or from multiple gene transcripts resulting from activation of a single underlying mechanism. This redundancy may be removed through sparse loadings ignoring one or more variables, and inspecting these loadings conveys which probes are uninformative. Zou et al.

(2004) describes an SPCA algorithm specifically for the case of having many more features than samples, as is the case with gene expression data.

4.3

Independent Component Analysis

Independent Component Analysis (ICA) (Comon, 1994; Marchini et al.

, 2013) is a variant on PCA which relaxes the constraint that successive principal components must be orthogonal, and instead minimises the mutual information between them. This is equivalent to assuming that the data are a mixture of non-Gaussian signals with finite mean and variance, and these signals are reconstructed by maximising the non-Gaussianity of the components extracted.

Here, a mixture means the signals are added together and a linear transformation applied. The intuition behind ICA is that, due to the

Central Limit Theorem, a mixture of any finite mean and variance non-Gaussian signals will be closer to Gaussian distributed than each of the underlying signals, thus we want to find the underlying signals that are minimally Gaussian.

4.4

K-means Clustering

K-means clustering is a qualitatively different method of unsupervised feature learning, where the features learned are centroids of clusters identified in the data. We use K-means to search for cluster centroids and use these as a new basis, projecting the data onto the set of centroid vectors. The intent is to exploit a different type of structure—clusters of features—in the microarray data.

Since clustering has no criteria in which the angles between clusters are large, numerical issues are not unexpected; centroid vectors which are near parallel will produce a poorly conditioned loadings, as columns of the loadings will be near linear combinations of the others.

of this matrix is then calculated, and it is this value which is to be minimised minimisation by modifying W . The gradients are analytical, allowing efficient minimisation with off-the-shelf optimisation tools. We use the MATLAB implementation by Ngiam et al.

2

.

4.6

Kernel Principal Component Analysis

All previously discussed methods of dimensional reduction have been linear; that is, they may be performed by multiplying the matrix of data by another matrix. To venture into nonlinear methods of dimension reduction, we chose to perform Kernel Principal

Component Analysis (KPCA) (Sch¨olkopf et al.

, 1998; Karatzoglou et al.

, 2004; Barber, 2012)—a generalisation of PCA using the kernel trick to efficiently perform PCA after performing a nonlinear mapping of the data into some higher dimensional space. This results in nonlinear principal components in the original data space, allowing the data to be projected onto nonlinear manifolds instead of simple linear subspaces.

4.7

Gaussian Process Latent Variable Model

For a second, modern nonlinear feature learning method, we apply a Gaussian Process Latent Variable Model (GP-LVM) to the data using the R implementation by Lawrence

3

. We keep the default parameters which use the radial basis function kernel.

4.5

Sparse Filtering

Sparse Filtering (SF) (Ngiam et al.

, 2011) is a modern unsupervised learning method optimising for sparsity, but differs from SPCA in that it is the transformed data, not the loadings, which are sparse. SF has three motivating concepts; population sparsity, lifetime sparsity and high dispersal. To satisfy population sparsity, each example should be sparse, meaning a single example is represented by only a small number of active (non-zero) features. Lifetime sparsity specifies a similar condition on features; a given feature should be active on only a small number of examples. For high dispersal, different features should have similar activity over all the samples; it should not be the case that some features are active very rarely but very strongly, with other features weakly active all the time.

SF works by first computing the feature matrix F = XW

> where X is our matrix of data with samples in rows. An optimisation is performed to find a W maximising population sparsity whilst high dispersal is enforced, which leads also to lifetime sparsity

(Ngiam et al.

, 2011). Each column (feature) of F is divided by its L

2 norm, causing each feature to be equally active across all samples. Each row (sample) is then divided by its L

2 norm, causing each sample to lie on the unit sphere. The L

1 norm

5 SURVIVAL PREDICTION ALGORITHMS

To thoroughly investigate the effect of feature reduction on prediction, we use four supervised survival prediction algorithms; a basic Cox model, a Cox model subject to elastic-net regularisation on the learned coefficients, a Boosted aggregation of Cox models, and a Random Survival Forest.

We measure the predictive performance by performing 5-fold crossvalidation on the data and averaging the Concordance Index (CI) of the risk predictions of each fold.

The concordance index for survival data, as calculated using the

R package Schroeder et al.

(2011) is defined as follows. Drawing a pair of random samples from the data, the CI is the probability that the sample with the higher predicted risk experiences their event first. Being a probability, this varies between 0 and 1. However, the effective CI falls in the interval [0 .

5 −− 1] , since a CI of 0.5 indicates performance equal to random and CI of 1 indicates perfect ability to predict the order in which individuals will fail. A CI of below

0.5 shows that we are ordering individuals in a manner worse than random, meaning we are managing to extract information relevant to prediction, and then using this to order individuals incorrectly.

The patient survival time data are censored; often we only know the last time the patient was seen alive and we do not know their current status. In this case we are dealing with right-censored survival data since we can put a lower bound on the survival time of a patient—the time at which the patient was last seen alive—but we do not know the current state of the patient who may have become lost to follow up (Kalbfleisch and Prentice, 2002). The censored data

2 https://github.com/jngiam/sparseFiltering

3 http://staffwww.dcs.shef.ac.uk/people/N.Lawrence/ vargplvm/

4

Getting the most from molecular cancer data using unsupervised feature learning still contain information relevant to survival prediction, but cannot be used to train traditional regression algorithms for prediction, so we instead use more specialised survival models which are able to learn from this censored data.

5.1

Cox Model

The Cox model, also known as the proportional hazards model, is a linear survival model (Kalbfleisch and Prentice, 2002), modelling the instantaneous rate at which some event occurs at time T given that the event has not yet occurred at time t = T . The function modelling this is known as the hazard function, taking the general form

P ( t ≤ T < t + ∆ t | T ≥ t, z )

λ ( t | z ) = lim

∆ t & 0 ∆ t where the numerator is the probability that, given that the event has not occurred before time t , the event will not occur before time t + ∆ t (Kalbfleisch and Prentice, 2002). This is a general definition applicable to all survival models, where z is a vector of covariates applying to some individual. In the case of the Cox model, the hazard function takes the particular form

λ ( t | z ) = λ

0

( t ) e z

>

β where λ

0

( t ) is the unspecified base-line hazard function. It is possible to estimate the effect parameters β without considering

λ

0

( t ) given that the assumption of proportional hazards holds (Cox,

1972); that is, for two vectors of covariates z and z

0

,

λ ( t | z )

λ ( t | z 0 )

=

λ

0

( t ) e z

>

β

λ

0

( t ) e z

0>

β

0

= e z

>

β − z

0>

β

0 has no t -dependence. We can test whether our data satisfies this proportional hazards assumption by checking that our effect parameters β inferred from the data have no statistically significant dependence on time.

An important aspect of the Cox model is that β may be inferred using the partial likelihood

L ( β ) =

Y i =1

P e z

> i

β j ∈ R i e z

> j

β where R i is the set of individuals who are at risk at the failure time of individual i , and z i is the vector of covariates for individual i . That is, R i is the set of all individuals whose failure times (or censoring times) are greater than or equal to that of individual i . This allows us to learn β from right-censored data.

5.2

Elastic net regularised Cox model

We use the R package glmnet to fit Cox models subject to regularisation of parameters β via elastic net regularisation; a convex combination of lasso ( L

1

) and ridge ( L

2

) penalties. What this means is that, when maximising the partial likelihood of the

Cox model, we also impose the constraint that

α || β ||

1

+ (1 − α ) || β || 2

2

≤ c where || x ||

1 and || x ||

2 are the L

1 and L

2 norms respectively. Taking the log of this partial likelihood and considering the Lagrangian allows us to reformulate our problem as

2

β = arg max

β

 n

 m

X

 z

> i

β − log

X i =1 j ∈ R i

   e z

> j

β

 

− λP

α

( β )

 where

P

α

( β ) = α || β ||

1

+

1 − α

|| β || 2

2

2 is the elastic net penalty. Choosing α ∈ [0 , 1] varies the penalty between the lasso penalty ( α = 1 ), choosing a small number of non-zero coefficients, and ridge regression ( α = 0 ), scaling all components towards zero. As α is increased, we get increased sparsity but also increased magnitude of non-zero components

(Simon et al.

, 2011). We choose to use α = 1 corresponding to purely lasso regularisation, since it is the sparsity property we are interested in.

5.3

Boosted Cox Model

Boosting (Ridgeway, 1999) is a machine learning method for taking multiple predictors which may perform only slightly better than random guessing, known as weak learners, and combining these into a single much more accurate predictor called a strong learner.

The classical boosting algorithm is Adaboost (Freund and Schapire,

1996), which, at a high level, iteratively trains a single weak learner on some set of labelled data, assigns weights to the data points according to how accurately the weak learner was able to predict the labels, and trains another weak learner placing greater emphasis on correctly classifying the highly weighted points. This process is continued for some time, which may be until some set number of weak learners are trained or some other stopping criterion is met, and the final strong learner is constructed by taking some linear combination of the predictions of each of the weak learners.

Adaboost is a classification algorithm and here we perform a regression, and importantly our response variable is a survival time where our training labels are subject to right-censoring. To tackle this we use the R package with contributions from others (2013) to construct a Generalised Boosted Regression model (GBM) where the weak learners are Cox models.

5.4

Random Survival Forest

Random Forests (Breiman, 2001) are a model aggregation technique where, instead of using a single global model, we instead recursively partition the input space into cells until the behaviour of the data in each cell can be described by a simple local model applying only to the cell. For classification, this simple model is often a class label, and for regression a real valued constant. Nonlinear behaviour can be achieved even when the local models are linear.

Constructing a Random Forest, we first bootstrap our dataset and use this data to grow a tree. Starting from the root node representing all of input space, we select a random subset of all the variables, then partition the space on one or more of these variables. We perform this recursively to grow a binary tree until we have a tree of sufficient size, likely constraining that each cell contain some minimum amount of data. We then fit a simple model to the data in each cell separately, and continue to bootstrap our original dataset again to grow more trees. When performing predictions, we take the mean prediction of every tree.

5

James Skinner

Random Survival Forests (RSFs) (Ishwaran et al.

, 2008) are

Random Forests applied to survival data. When partitioning, we select a single variable and value on which to partition such that the difference in survival between daughter cells is maximised, pushing dissimilar cases apart. We recurse on this operation until no more cells may be partitioned due to our minimum data requirement, or a maximum tree size is reached. The model fit to each cell is the

Nelson-Aalen estimator of the cumulative hazard rate; an alternative representation of the survival hazard function.

The Nelson-Aalen estimator takes the form

ˆ

( t ) =

X d i t i

<t

| R i

| where each t i is the time at which individual i is removed, either through censoring or failure.

d i is the number of failure events occurring at time t i

, and | R i

| is the number of individuals still at risk at time t i

. This model does not rely on the proportional hazards assumption. To implement RSFs we use the randomSurvivalForest

R package (Ishwaran and Kogalur, 2013, 2007).

0

2

200

4

400

METABRIC

600 mspace

800

6 gspace

8

1000 1200

10 12

0 20 40

MammaPrint

60 80

6 RESULTS

Before running any algorithms on our data, we first split the 1981 samples into a training set and test set of 1320 and 661 samples respectively. We do not touch the test set until all the results in

Section 6.1

and Section 6.2

have already been computed.

6.1

Unsupervised learning results

Investigating the eigenvalues of the covariance matrix ( Figure 1 ) may tell us about the structure of the data. We can see that nearly all of the variance of the data is contained within the first

100 or so principal components, and following that the remaining eigenvalues are very similar in value. A possible reasons or this is that the remaining directions of variance are due to noise in the measurements performed in the microarray, or the stochasticity inherent in mRNA production in the cell, and thus the data naturally occupies a lower dimensional linear subspace.

We investigate the eigenvalues of the covariance matrix when performing PCA on the raw data, gspace and mspace metagenes and MammaPrint data. From the eigenvalue distribution plots in

Figure 1 , we can see clear knees in the METABRIC, gspace and

MammaPrint

TM data, indicating optimum dimensionalities around

20, 10 and 10 respectively. The transition between components containing signal and noise is less clear with the mspace data, which is likely an indicator that the mspace representation is already highly compressed and truncation of any components will result in a loss of useful information. The other possibility which would give a similar eigenvalue distribution is that the magnitude of the noise is on the same scale as the signal, but this is unlikely given that the metagenes have been extracted from the raw data which have a good separation in scales between the signal and noise. The largest eigenvalue for the mspace metagenes is only a factor of 44.9 times larger than the smallest eigenvalue, whilst for MammaParint

TM it is 1013.7 and for the raw data and gspace metagenes it is infinity since we have zero-eigenvalues.

Performing PCA on the METABRIC dataset produces a large number of coefficients close to zero. Since we are in the regime of

0 20 40 60

Index

80 100 120

Fig. 1: Eigenvalue distribution of the covariance matrices of the respective datasets calculated during PCA. Looking at the distribution for the raw METABRIC data, we can see that the data vary only a very small amount in most of the possible directions, possibly indicating directions containing only noise and no useful information. This gives us confidence that we can improve the quality of the information by extracting the relevant signal from the background noise. Note that the mspace plot y-axis does not begin at zero; where the magnitudes of the largest and smallest mspace eigenvalues are not drastically different, this indicates a possible efficient compressed representation of the data, since the data fully span the space.

having far more features than samples, a large number of near-zero coefficients which would become zero in the limit of infinite data may accumulate to contribute significant error, and for this reason we chose to perform Sparse PCA (SPCA) (Zou et al.

, 2004; Zou and

Hastie, 2012). A histogram of the coefficients of the loadings—the matrix which multiply the data by to arrive at the lower dimensional approximation—for both PCA and SPCA is given in Figure 2 , where it can be seen that moving from PCA to SPCA sets a larger number of the coefficients effectively to zero.

To choose an appropriate kernel in KPCA, model selection was performed by iterating through each kernel and determining performance on a grid of hyperparameters through cross-validating with the cox model. The kernel and hyperparameter set chosen was that giving the best performance. This resulted in selecting the linear

6

Getting the most from molecular cancer data using unsupervised feature learning

gspace data matrix conditioning

Loadings for PCA and SPCA performed on MammaPrint dataset

PCA

SPCA

−0.6

−0.4

−0.2

0.0

0.2

0.4

0.6

Size of coefficient

Fig. 2: Distribution of the coefficients in the loadings produced by PCA and SPCA. It can be seen that SPCA produces a large number of loadings which are extremely close to zero, indicating the loadings are sparse, whilst the PCA coefficients are approximately

Gaussian distributed.

kernel,

K ( x , y ) = x

> y which causes KPCA to become equivalent to simple PCA. Although not an exciting result, this gives evidence that there is no significant nonlinear structure in the data of a complexity low enough for the quantity of data is able to support. Since KPCA and PCA are now equivalent, KPCA shall be omitted from further study.

Looking at the condition number of the matrix of gspace metagene data before and after feature extraction yields interesting results. We see in Figure 3 that the condition number of the data jumps by many orders of magnitude when increasing dimensionality from 12 to 13 on all of the feature learning algorithms used. Where the condition number measures how close to singular a matrix is, a large condition number tells us that we have a large separation in scales in the data, which we may interpret as the data approximately spanning only a linear subspace of the full space. This is to be expected as the gspace metagenes have been obtained by expanding each mspace metagene into its component genes, thus the gspace metagenes are still intrinsically 12 dimensional.

The condition number is a measure of how close to singular the matrix is; the identity matrix has condition number 1 and a rank deficient matrix has condition number infinity. A large condition number on a matrix of data indicates that the data do not fully span the space in which it is being represented, so contains information on the structure of the data and hints at the optimum dimensionality to select in feature learning. For certain algorithms such as PCA, addition of more dimensions past the intrinsic dimensionality of the data only adds ‘junk’ dimensions, which may negatively affect the performance of prediction algorithms.

5 10 15 20

Components

25 30

Fig. 3: Condition number of the gspace metagene data as it is reduced in dimensionality by the five algorithms given in the legend, as well as the condition number of the unreduced matrix of data.

There is a large jump moving from 12 to 13 components, which is due to the data being intrinsically 12 dimensional and when being represented in greater than 12 dimensions no-longer spans the space, producing an ill-conditioned matrix.

6.2

Supervised learning results

PCA

ICA

SPCA

K−means

Sparse Filtering raw data

6.2.1

Predictive accuracies Having used MammaPrint

TM and metagenes to extract reduced gene lists from the METABRIC data, we use unsupervised learning techniques to further reduce dimensionality and then use supervised learning with 5-fold crossvalidation to measure predictive accuracy, which we use as a measure of the quality of the data extracted from the METABRIC dataset. When reducing dimensionality, we start at two dimensions due to difficulties with the predictive algorithms used accepting one-dimensional data. The supervised algorithms used are those described in Section 5 ; the Cox model, the elastic-net regularised

Cox model, the Boosted Cox model and the Random Survival

Forest. These are labelled in the plots below as ’cox’, ’glmnet’,

’gbm’ and ’rsf’ respectively.

The raw METABRIC data is sufficiently high dimensional to make the use of many algorithms infeasible. We perform only

PCA on the raw data, varying the number of principal components retained and plotting the concordance indices achieved with each prediction algorithm in Figure 4 . It can be seen that the best performance is achieved on all algorithms in the range of 14 to

25 principal components, which is consistent with the location of the ‘knee’ in the eigenvalue distribution of the METABRIC data in

Figure 1 .

By cross-validating each prediction algorithm after reducing each dataset with all feature learning algorithms (ignoring KPCA since

7

James Skinner

5 cox glmnet rsf gbm

PCA

10 15 20

Components

25 30

Fig. 4: Concordance indexes achieved on raw METABRIC data after being reduced to the first few Principal Components by PCA.

The optimum number of components to use appears to be in the range of 14–25.

selection of the linear kernel reduced it to regular PCA), we produce

Figure 5 . There are a number of aspects to look out for in this plot.

We can see that, although glmnet often fails, the four prediction algorithms show qualitatively similar behaviour on the same data, with peaks and plateaus around the same number of components.

When glmnet performs well it shows a similar behaviour to the regular, un-regularized cox model. Such plots are useful in determining the optimum number of dimensions to use in a machine learning task.

The Cox model seems often to be the best forming algorithm, though is often matched or slightly overtaken by glmnet and displays occasional instabilities in the form of downwards spiked in concordance. The model aggregate algorithms rsf and gbm show improved stability at the cost of reduced performance. The instabilities in the Cox model are likely rare enough, and the performance lead great enough, to justify the Cox model as the best algorithm to use in this case.

We can see that the mspace metagenes show the best performance on a small number of components. It is worth noting that the x-axes have different scales in each case, which can be gauged graphically by comparing grid square widths. Where the MammaPrint

TM data shows a performance peak half-way across the plot at around

20 components on both ICA and GP-LVM, this is many more components than the peaks in the gspace and mspace data seen for many of the algorithm combinations. Although we have plenty of data to justify 20 features, we would still like to reduce dimensionality as much as possible for two reasons. Firstly, we are more likely to generalise well to new data in lower dimension, and secondly, for interpretability we would like to study a small number of components.

The performance drop seen on some algorithms on the gspace metagenes when using greater than 12 components is to be expected. Since the data are intrinsically 12-dimensional having been extracted from their corresponding mspace metagenes, where the transformation between mspace and gspace is linear, the matrix of data becomes ill-conditioned when reduced to a dimensionality greater than 12 for every algorithm used ( Figure 3 ). SPCA shows the nice property that performance does not continue to degrade as the data become increasingly ill conditioned.

As a visualisation of the concordance data, we plot concordances for the cox model in Figure 6 akong with the predictive accuracy achieved when training a cox model on the unreduced dataset. In every case, predictive accuracy may be improved through dimension reduction to some optimum dimensionality. This indicates that the restricted gene datasets still contain noisy and irrelevant structure which we are able to filter out, ‘purifying’ the relevant information in the dataset.

The best performance is achieved on the mspace metagenes data, closely followed by the gspace data, with ICA and SPCA attaining the greatest predictive accuracy on these datasets (note that, in Figure 6 , these overlap entirely except in gspace with components > 12). While the top mspace and gspace performance is comparable, the mspace data performs better on a smaller number of components.

It is not surprising that feature extraction yields better results on the MammaPrint

TM data, since this is simply a set of genes whilst the metagenes have already been feature-selected from microarray data. It is still interesting to note that this difference in predictive improvement shows the feature learning algorithms struggling to find a more compressed representation of the metagenes structure, but easily compressing the MammaPrint

TM structure into a lower dimension, indicating that the metagenes are already an efficient compressed representation.

6.2.2

Proportional Hazards Assumption Here we test that the data meets the proportional hazards assumption of the Cox model.

Using the R survival package (Therneau, 2014), we can use cox.zph

to arrive at p-values of whether the variables satisfy proportional hazards. It is computationally infeasible to train a cox model over the raw dataset, so we use PCA to reduce the raw data down to 100 dimensions, fit a cox model to this and test for the validity of the proportional hazards assumption. Doing this we get a p-value of 3 .

42 e − 07 that the set of all variables satisfies proportional hazards, giving strong evidence that this assumption is not met. The smallest individual p-value is 2 .

40 e − 05 , corresponding to the second principal component z

2

—this is the covariate which most strongly violates proportional hazards.

The Schoenfeld residuals for z

2 are defined as z

2 minus the expected value of z

2 given the failure time for each data point.

Schoenfeld residuals are time-independent in principal, so we should see no time dependence when plotting these against time

(Grambsch and Therneau, 1994) given that proportional hazards is satisfied. As shown in Figure 7a , there is statistically significant time dependence since a horizontal line does not lie within the confidence intervals. For comparison, the principal component with the greatest p-value, z

58

, is shown in Figure 7b , where no significant time dependence can be seen. Although the time

8

MammaPrint

TM cox glmnet rsf gbm

10 20 30 40

Getting the most from molecular cancer data using unsupervised feature learning

Metagenes:mspace Metagenes:gspace

2 4 6 8 10 12 5 10 20 30

10 20 30 40 2 4 6 8 10 12 5 10 20 30

10 20 30 40 2 4 6 8 10 12 5 10 20 30

10 20 30 40 2 4 6 8 10 12 5 10 20 30

10 20 30 40 2 4 6 8 10 12 5 10 20 30

10 20 30 40 2 4 6 8 10 12 10 20 30 40

Fig. 5: Predictive accuracies achieved in crossvalidation on the training set of data. We look at all combinations of restricted-gene dataset, survival model and feature learning algorithm whilst varying the dimensionality of the learned feature set. We can see a number of peaks where the data are represented near its optimum dimensionality, where the peaks appear to be around the same location for a given dataset regardless of feature learning or predictive algorithm used. Such a plot can be useful in deciding the target dimensionality to use in feature learning.

9

James Skinner

MammaPrint cox prediction mspace cox prediction gspace cox prediction

10 20

Components

(a) mamma

30 40 2 4 6 8

Components

(b) mspace

PCA

SPCA

ICA

K−means

Sparse Filtering

GP−LVM

Raw data

10 12 5 10 15

Components

(c) gspace

20 25 30

Fig. 6: Concordance indexes achieved by the Cox model on the MammaPrint

TM gene list and both Metagene representations used here after reduction in dimension by a number of feature learning algorithms. As a benchmark, the horizontal dotted lines show the Cox performance on the raw datasets. We can see each dataset has an optimum dimensionality and that we improve performance over the raw data in each case.

dependence seen in Figure 7a is statistically significant, the actual deviation of the trend in the Schoenfeld residuals from 0 is small in comparison to the variance of the residuals, so the data does not drastically violate the proportional hazards assumption, which may explain why the Cox model, both regularised (Friedman et al.

, 2010) and un-regularised (Therneau, 2014), are the algorithms attaining the greatest predictive accuracy.

6.3

Test Set Performance

We now evaluate the algorithmic performance on the test set of

661 patients so-far held out from study. The entire training set of

1320 patients is used to train the unsupervised learning methods, and the learned mapping is then applied to the test set. Using the prediction algorithms trained on the training data, survival times for the test set are predicted, producing a single concordance index for each combination of learning algorithms. The ordered concordances achieved on the test set are presented in Table 2 .

We selected the number of features to use in each case using

Figure 5 to identify peaks in predictive performance which are indicative of the region in which the optimum dimensionality lies.

It appears that these performance peaks lie around the same point on each dataset, where the predictive or feature learning algorithm used does not seem to affect the optimum dimensionality. For this reason we chose to use dimensionalities of 18, 10 and 7 for the

MammaPrint

TM

, mspace and gspace metagenes respectively.

As a benchmark we included results from working with the raw

METABRIC data. Due to the size of the dataset leading to high time and memory requirements, we were only able to perform PCA,

K-means and Sparse Filtering on the data. For dimensionalities we chose 22, corresponding to the near end of the knee in Figure 1 and the peak in prediction in Figure 4 , as well as 10 to enable comparison with the gspace and mspace metagenes.

Since we have been working with the training data for some time, making decisions—such as the number of dimensions to use— based off performances achieved, it is to be expected that we have

‘over-fit’ the training data somewhat, and thus the concordances achieved on the training set are likely to be optimistic with regard to performance on unseen data, such as the test set.

All concordance indices achieved on the test data are given in

Table 2 . It appears that the best predictions can be made when reducing dimensionality of the raw data instead of going through one of the restricted gene lists. This was unexpected given the results in cross-validation, and emphasises how careful we must be when drawing conclusions from a training set of data. Note that in high dimension it is still possible for us to over-fit to the specifics of the study and preprocessing methods used to generate the data, resulting in over-fitting and good apparent performance on the test data yet poor generalisation to new data.

Looking now at the raw data reduced to 10 components we find we move drastically down the table, under-performing the metagenes and performing comparably with the MammaPrint

TM gene list. This tells us two things about the metagenes. Firstly, that they are an extremely efficient representation of the information we are concerned with, fitting more information into 12 dimensions than we have been able to achieve here. Secondly, that there was still a lot of useful information in the raw data that has been lost in transforming to the metagenes.

Our best result comes from K-means clustering, which may indicate a clustered structure to the data. This is consistent with the effectiveness of the metagenes, having come from a clustering algorithm very similar to K-means clustering.

On the Metagenes we achieve the best prediction after reducing from 12 to 10 dimensions using SPCA. We should remain sceptical of this result since we have no error bars on performance and the following six best algorithm combinations involve Metagenes on which we perform no feature learning at all. Looking at the Table 2 , we see that we can make comparable or improved predictions after

10

Getting the most from molecular cancer data using unsupervised feature learning

Table 2.

All concordance indices achieved on the test set after training the feature learning and prediction algorithms on the training set and applying the same transformation/prediction on the test set. We vary the restricted gene list (MammaPrint

TM

, mspace and gspace metagenes), feature extraction algorithm, number of components used and prediction algorithm.

e

Ranking Concordance Dataset

F eatur

Extraction

Pr ediction

Components

1 0.6691

raw kmeans cox

2 0.6663

mspace spca

3 0.6657

gspace none gbm cox

22

10

7

4 0.6657

gspace none

5 0.6657

mspace none

6 0.6657

mspace none

7 0.6653

gspace none rsf cox rsf glm

8 0.6652

mspace none glm 10

9 0.6640

mspace kmeans cox 10

10 0.6640

mspace kmeans rsf 10

7

10

10

7

11 0.6637

mspace ica

12 0.6636

mspace ica

13 0.6636

mspace ica

14 0.6635

mspace spca

15 0.6635

mspace spca

16 0.6632

mspace spca

17 0.6623

mspace sf

18 0.6623

mspace sf glm 10 cox rsf cox rsf cox rsf

10

10

10

10 glm 10

10

10

19 0.6618

mspace pca

20 0.6618

mspace pca cox rsf

10

10

21 0.6608

mspace kmeans glm 10

22 0.6606

raw pca

23 0.6600

mspace sf glm 22 glm 10

24 0.6592

mspace pca glm 10

25 0.6588

gspace kmeans glm 7

26 0.6584

raw

27 0.6584

raw pca pca

28 0.6581

mspace pca

29 0.6580

gspace none

30 0.6580

mspace none cox rsf gbm

22

22 gbm 10 gbm 7

10

31 0.6564

raw kmeans rsf

32 0.6534

gspace ica cox

33 0.6534

gspace ica

34 0.6534

gspace spca rsf cox

35 0.6534

gspace spca

36 0.6530

raw kmeans rsf glm

7

22

22

7

7

7 e

Ranking Concordance Dataset

F eatur

Extraction

Pr ediction

Components

37 0.6519

gspace

38 0.6516

gspace

39 0.6516

gspace

40 0.6509

gspace

41 0.6492

gspace

42 0.6489

raw

43 0.6488

raw

44 0.6488

raw

45 0.6476

raw

46 0.6471

raw ica pca pca spca spca pca sf glm cox rsf glm gbm

7

7

7

7

7 gbm 22 cox 10 sf sf rsf kmeans rsf

10 glm 10

10

47 0.6458

gspace pca

48 0.6454

mamma pca

49 0.6450

mamma none

50 0.6438

mamma pca glm 7 glm 18 gbm 18 cox 18

51 0.6438

mamma pca

52 0.6437

raw kmeans rsf glm

18

10

53 0.6425

gspace pca

54 0.6424

mamma none gbm glm

7

18

55 0.6401

mamma pca

56 0.6400

raw kmeans gbm gbm

18

22

57 0.6391

mamma ica cox 18

58 0.6391

mamma ica

59 0.6391

mamma spca rsf 18 cox 18

60 0.6391

mamma spca

61 0.6386

gspace kmeans rsf cox

18

7

62 0.6386

gspace kmeans rsf

63 0.6382

mamma spca glm

7

18

64 0.6380

raw kmeans cox 10

65 0.6363

mspace sf gbm 10

66 0.6359

mamma ica glm 18

67 0.6343

mamma sf

68 0.6343

mamma sf

69 0.6334

gspace

70 0.6334

gspace sf sf

71 0.6328

raw

72 0.6325

gspace pca sf cox 18 rsf 18 cox rsf

7

7 glm 10 glm 7 e

Ranking Concordance Dataset

F eatur

Extraction

Pr ediction

Components

73 0.6320

mamma sf

74 0.6306

raw

75 0.6305

raw pca pca glm 18 gbm 10 cox 10

76 0.6305

raw

77 0.6297

raw pca sf rsf 10 glm 22

78 0.6290

raw kmeans gbm 10

79 0.6282

mamma kmeans cox 18

80 0.6282

mamma kmeans glm 18

81 0.6282

mamma kmeans rsf 18

82 0.6274

raw sf gbm 10

83 0.6255

mamma spca

84 0.6245

mamma sf

85 0.6223

raw

86 0.6223

raw sf sf gbm 18 gbm 18 cox 22 rsf 22

87 0.6223

raw sf

88 0.6140

mamma none gbm 22 cox 18

89 0.6140

mamma none rsf 18

90 0.6135

mamma kmeans gbm 18

91 0.6081

gspace

92 0.6066

gspace sf kmeans gbm gbm

93 0.6060

mspace kmeans gbm 10

7

7

94 0.5409

mamma ica

95 0.5186

gspace gplvm

96 0.5124

mamma gplvm

97 0.5124

mamma gplvm

98 0.5123

mamma gplvm

99 0.5044

gspace gplvm

100 0.5030

gspace

101 0.5030

gspace gplvm gplvm

102 0.5003

mspace gplvm

103 0.5003

mspace gplvm

104 0.4998

mspace gplvm

105 0.4996

mamma gplvm

106 0.4946

mspace gplvm

107 0.4815

gspace

108 NA mspace ica ica gbm 18 gbm 7 cox 18 rsf 18 glm 18 glm 7 cox rsf cox 10

7

7 rsf 10 glm 10 gbm 18 gbm 10 gbm 7 gbm 10 reducing the dimension from 118 to just 18, indicating a large amount of redundancy in the dataset.

The ordered concordance indices are plotted in Figure 8 , where the points have been shaped and coloured corresponding to the feature learning and prediction algorithm used. The first thing we see is that prediction fails in a number of cases; namely, when using GP-LVM (purple shapes) or—cross-referencing with

Table 2 —when mixing ICA with gbm, though this may simply be a coincidence of gbm failing each time it is paired with ICA.

The poor performance of the MammaPrint

TM gene list is much below that of the Metagenes, and feature learning only damages the performance further. This indicates that there is less or lower quality information in the MammaPrint

TM gene list than in the metagenes, at-least with regard to predicting survival time, despite the mspace metagenes being of much lower dimension.

7 DISCUSSION AND FURTHER WORK

Since our feature learning algorithms are unsupervised, they are unaware of patient survival times and any information they extract is from patterns in the underlying data. It seems reasonable to suggest that our unsupervised learning results are generalisable from breast cancer expression data to gene expression data in general. When doing this, it should be taken into consideration that we only have data from female patients and breast carcinomas, both of which will influence gene expression. Noting this, it could be illuminating to discover if the apparent lack of nonlinear structure is a common feature of expression data, an artefact of the data considered here, or if the non-linearity is simply obscured by a high degree of noise in the measurements.

An interesting avenue for further study would be to investigate how the features learned relate to gene expression, and to carry out a bioinformatics study of the genes found to be important to survival prediction. In this regard, it is convenient that the top performing algorithm combination includes SPCA ( Table 2 ) for feature learning, as the sparsity of loadings aids interpretability.

11

James Skinner

P = 2.4e−05

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

630 2200 4900 7800

Time

(a) Proportional hazards assumption is not satisfied.

P = 0.997

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ●

● ● ●

● ●

630 2200 4900 7800

Time

(b) Proportional hazards satisfied.

Fig. 7: Illustration of a violation ( a ) and observance ( b ) of the proportional hazards assumption. To show that the proportional hazards assumption holds, we want no statistically significant time dependence in the Schoenfeld residuals, meaning we would be able to fit a horizontal line within the confidence intervals (dotted lines).

0 pca ica spca kmeans gplvm none

20 40 60

Ranking

●●●●

● mamma gspace mspace

●●● raw

80

●●●

100

Fig. 8: Ordered concordance indexes achieved on the training set of data, where the algorithm and gene list used are indicated by the shape and colour of the points respectively.

Network (GRN) with a high degree of structure, being composed of many hierarchically organised network motifs (Alon, 2006).

Thus a network representation of the data is likely to be a highly appropriate model to use, and network approximations exploiting the observed structure should efficiently reduce dimensionality with minimal information loss.

Improvements can be made over the method of dimension selection used here of visually locating peaks in predictive performance when crossvalidating on a training set. Particularly, application of the Akaike Information Criterion (AIC) (Akaike,

1974) would provide a quantitative measure with which to select dimensionality.

Application of supervised feature learning would be an informative next stage. This would allow us to extract features that we know to be relevant to the predictive task, as opposed to learning features describing potentially irrelevant structure in the data distribution.

Continuing the route of improving survival prediction, it may be worthwhile to investigate survival models other than the Cox model since we know that the proportional hazards assumption is not met.

There exist extensions to the Cox model allowing time dependant covariates (Kalbfleisch and Prentice, 2002) which would likely be appropriate. Application of state-of-the-art methods of feature learning (David Lopez-Paz et al.

, 2014) may also be interesting.

We may be able to improve the learning of features through the introduction of prior knowledge. We know that the underlying model from which the data are generated is a Genetic Regulatory

8 CONCLUSION

We find we are able to take gene expression data from a breast tumour and predict patient survival time with a concordance index of at-least 0.669. That is, given two random patients with whom we predict survival times, the probability that the patient with the higher predicted risk (thus lower survival time) experiences a shorter survival time is 0.669.

We have good evidence to suggest that the breast cancer gene expression microarray data contain little nonlinear structure, finding linear dimension reduction algorithms to perform the best in maximising the ability for supervised learning algorithms to predict survival time. In addition, model-selecting KPCA results in selection of the linear kernel, reducing it to regular PCA. We may

12

Getting the most from molecular cancer data using unsupervised feature learning be observing highly clustered structure however, since not only to we get very good results with K-means clustering, but also with a gene list obtained using a clustering-like algorithm.

The proportional hazards assumption made by the Cox model, a popular survival model which we use here, is violated by the data.

Although the violation is statistically significant, it is small, which is consistent our observation that the Cox model performs well in prediction.

Looking at the MammaPrint

TM and Metagene restricted gene lists, we find that the Metagenes outperform the MammaPrint

TM gene list in predicting survival times.

This is interesting as the

MammaPrint

TM gene list is currently used in clinical trials

(Cardoso et al.

, 2007), and, to the knowledge of the authors, the

Metagenes are not. It is worth noting that these gene lists have been constructed in different ways and for different purposes; the

MammaPrint

TM gene list has been hand-mined from literature for assessment of the risk that a breast tumour will metastasize to other parts of the body, whilst the Metagenes have been obtained purely from unsupervised mining of cancer expression data, with no predictive goal taken into account. An investigation of the effectiveness of the Metagenes in predicting risk of metastization could prove valuable.

We are able to outperform both the MammaPrint

TM and

Metagenes gene lists by simply projecting the raw data onto the first

22 principal components or K-means cluster centroids. Whilst this is still of much higher dimension than the restricted gene lists, greater predictive accuracy on the test set indicates that the PCA and Kmeans reduced raw data contain more or higher quality information than the restricted gene sets—at-least of a structure that our learning algorithms are able to exploit. We also find that the metagenes form a highly efficient compressed representation of the raw data, performing well in prediction ( Table 2 ), being difficult to improve on through feature extraction ( Figure 6 ), and having a relatively flat distribution of covariance matrix eigenvalues ( Figure 1 ).

REFERENCES

Akaike, H. (1974). A new look at the statistical model identification.

Automatic Control,

IEEE Transactions on , 19 (6), 716–723.

Alon, U. (2006).

An Introduction to Systems Biology: Design Principles of

Biological Circuits (Chapman & Hall/CRC Mathematical & Computational

Biology) . Chapman and Hall/CRC, 1 edition.

Barber, D. (2012).

Bayesian Reasoning and Machine Learning . Cambridge University

Press.

Breiman, L. (2001). Random forests.

Machine Learning , 45 (1), 5–32.

Cardoso, F., Piccart-Gebhart, M., van ’t Veer, L., and Rutgers, E. (2007). The mindact trial: the first prospective clinical validation of a genomic tool.

Mol Oncol , 1 (3),

246–51.

Cheng and Wei-Yi (2013).

cafr: Attractor Metagenes Finding Algorithm . R package version 0.312.

Cheng, W.-Y., Yang, T.-H. O., and Anastassiou, D. (2013). Biomolecular events in cancer revealed by attractor metagenes.

PLoS Comput Biol , 9 (2), e1002920.

Comon, P. (1994). Independent component analysis, a new concept?

Signal Process ,

36 (3), 287–314.

Cox, D. R. (1972). Regression Models and Life-Tables.

Journal of the Royal Statistical

Society. Series B (Methodological) , 34 (2), 187–220.

Curtis, C., Shah, S. P., Chin, S.-F., Turashvili, G., Rueda, O. M., Dunning, M. J.,

Speed, D., Lynch, A. G., Samarajiwa, S., Yuan, Y., et al.

(2012). The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups.

Nature , 486 (7403), 346–352.

David Lopez-Paz, Sra, S., Smola, A., Ghahramani, Z., and Schoelkopf, B. (2014).

Randomized nonlinear component analysis. In E. P. Xing and T. Jebara, editors,

Proceedings of the 31st International Conference on Machine Learning , pages

1359–1367.

Dunning, M., Lynch, A., and Eldridge, M. (2013).

illuminaHumanv3.db: Illumina

HumanHT12v3 annotation data (chip illuminaHumanv3) . R package version 1.18.0.

Freund, Y. and Schapire, R. E. (1996). Experiments with a new boosting algorithm.

Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent.

Journal of Statistical Software , 33 (1), 1–22.

Geigl, J. B., Obenauf, A. C., Schwarzbraun, T., and Speicher, M. R. (2008). Defining chromosomal instability.

Trends in Genetics , 24 (2), 64 – 69.

Grambsch, P. M. and Therneau, T. M. (1994). Proportional hazards tests and diagnostics based on weighted residuals.

Biometrika , 81 (3), 515–526.

Illumina (2014).

HumanHT-12 v3 Expression BeadChip datasheet.

http://res.illumina.com/documents/products/datasheets/ datasheet_humanht_12.pdf

.

Ishwaran, H. and Kogalur, U. (2007). Random survival forests for r.

R News , 7 (2),

25–31.

Ishwaran, H. and Kogalur, U. (2013).

Random Survival Forests . R package version

3.6.4.

Ishwaran, H., Kogalur, U., Blackstone, E., and Lauer, M. (2008). Random survival forests.

Ann. Appl. Statist.

, 2 (3), 841–860.

Kalbfleisch, J. D. and Prentice, R. L. (2002).

The Statistical Analysis of Failure Time

Data . Wiley Series in Probability and Statistics. Wiley-Interscience, 2 edition.

Kalluri, R. and Weinberg, R. A. (2009).

The basics of epithelial-mesenchymal transition.

The Journal of Clinical Investigation , 119 (6), 1420–1428.

Karatzoglou, A., Smola, A., Hornik, K., and Zeileis, A. (2004). kernlab – an S4 package for kernel methods in R.

Journal of Statistical Software , 11 (9), 1–20.

Lawrence, N. D. (2004). Gaussian process latent variable models for visualisation of high dimensional data. In S. Thrun, L. Saul, and B. Sch¨olkopf, editors, Advances in

Neural Information Processing Systems 16 , pages 329–336. MIT Press.

Marchini, J. L., Heaton, C., and Ripley, B. D. (2013).

fastICA: FastICA Algorithms to perform ICA and Projection Pursuit . R package version 1.2-0.

Ngiam, J., Chen, Z., Bhaskar, S. A., Koh, P. W., and Ng, A. Y. (2011). Sparse filtering.

In J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Weinberger, editors,

Advances in Neural Information Processing Systems 24 , pages 1125–1133. Curran

Associates, Inc.

Ridgeway, G. (1999). The state of boosting.

Sch¨olkopf, B., Smola, A., and M¨uller, K.-R. (1998). Nonlinear component analysis as a kernel eigenvalue problem.

Neural Comput.

, 10 (5), 1299–1319.

Schroeder, M. S., Culhane, A. C., Quackenbush, J., and Haibe-Kains, B. (2011).

survcomp: an r/bioconductor package for performance assessment and comparison of survival models.

Bioinformatics , 27(22) , 3206–3208.

Shankaran, V., Ikeda, H., Bruce, A. T., White, J. M., Swanson, P. E., Old, L. J., and

Schreiber, R. D. (2001). Ifn γ and lymphocytes prevent primary tumour development and shape tumour immunogenicity.

Nature , 410 (6832), 1107–1111.

Siegel, R., Ma, J., Zou, Z., and Jemal, A. (2014). Cancer statistics, 2014.

CA: A Cancer

Journal for Clinicians , 64 (1), 9–29.

Simon, N., Friedman, J. H., Hastie, T., and Tibshirani, R. (2011). Regularization paths for cox’s proportional hazards model via coordinate descent.

Journal of Statistical

Software , 39 (5), 1–13.

Therneau, T. M. (2014).

A Package for Survival Analysis in S . R package version

2.37-7.

van ’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A., Mao, M.,

Peterse, H. L., van der Kooy, K., Marton, M. J., Witteveen, A. T., Schreiber, G. J.,

Kerkhoven, R. M., Roberts, C., Linsley, P. S., Bernards, R., and Friend, S. H.

(2002). Gene expression profiling predicts clinical outcome of breast cancer.

Nature ,

415 (6871), 530–536.

with contributions from others, G. R. (2013).

gbm: Generalized Boosted Regression

Models . R package version 2.1.

Zou, H. and Hastie, T. (2012).

elasticnet: Elastic-Net for Sparse Estimation and Sparse

PCA . R package version 1.1.

Zou, H., Hastie, T., and Tibshirani, R. (2004). Sparse principal component analysis.

Journal of Computational and Graphical Statistics , 15 , 2006.

13

Download