PCR

advertisement
Partial Cox Regression Analysis
1. Literature Survey
1.1 Survival Analysis
Survival analysis is the branch of applied statistics dealing with the analysis of data on
times of events in individual life-histories (human or otherwise). A more modern and
broader title is generalized event history analysis. To begin with, the event in question
was often the failure of a medical treatment designed to keep cancer patients in remission
and the emergence and growth of survival analysis was directly connected to the large
amount of resources put into cancer research. This area of medical statistics brought a
whole new class of problems to the fore, especially the problem of how to deal with
censored data. At first many ad hoc techniques were used to get around these problems
but slowly a unifying idea emerged. This is to see such data as the result of a dynamic
process in time, each further day of observation producing some new pieces of
data. Tractable statistical models are based on modeling events continuously in time,
conditioning on past events; and new statistical ideas such as partial likelihood are also
based on this dynamic time structure.
Survival analysis examines and models the time it takes for events to occur. The
prototypical such event is death, from which the name ‘survival analysis’ and much of its
terminology derives, but the ambit of application of survival analysis is much broader.
Essentially the same methods are employed in a variety of disciplines under various
rubrics – for example, ‘event-history analysis’ in sociology. Survival analysis focuses on
the distribution of survival times. Although there are well known methods for estimating
unconditional survival distributions, most interesting survival modeling examines the
relationship between survival and one or more predictors, usually termed covariates in
the survival-analysis literature. The Cox proportional-hazards regression model
(introduced in a seminal paper by Cox, 1972), a broadly applicable and the most widely
used method of survival analysis. The survival library in R and S-PLUS also contains all
of the other commonly employed tools of survival analysis.
There are many texts on survival analysis: Cox and Oakes (1984) is a classic source.
Allison (1995) presents a highly readable introduction to the subject based on the SAS
statistical package, but nevertheless of general interest. Extensive documentation for the
survival library may be found in Therneau (1999).
1.2 Basic Concepts and Notation
Let T represent survival time. We regard T as a random variable with cumulative
distribution function P (t) = Pr(T ≤ t) and probability density function p(t)=dP (t)/dt. The
more optimistic survival function S(t) is the complement of the distribution function,
S(t) = Pr(T> t) = 1− P (t). A fourth representation of the distribution of survival times is
the hazard function, which assesses the instantaneous risk of demise at time t, conditional
on survival to that time:
Modeling of survival data usually employs the hazard function or the log hazard. For
example, assuming a constant hazard, h(t) = ν, implies an exponential distribution of
survival times, with density function p(t) = νe−νt . Other common hazard models include
which leads to the Gompertz distribution of survival times, and
which leads to the Weibull distribution of survival times. In both the Gompertz and
Weibull distributions, the hazard can either increase or decrease with time; moreover, in
both instances, setting ρ =0 yields the exponential model. A nearly universal feature of
survival data is censoring, the most common form of which is right-censoring: Here, the
period of observation expires, or an individual is removed from the study, before the
event occurs – for example, some individuals may still be alive at the end of a clinical
trial, or may drop out of the study for various reasons other than death prior to its
termination. An observation is left-censored if its initial time at risk is unknown. Indeed,
the same observation may be both right and left-censored, a circumstance termed
interval-censoring. Censoring complicates the likelihood function, and hence the
estimation, of survival models.
Moreover, conditional on the value of any covariates in a survival model and on an
individual’s survival to a particular time, censoring must be independent of the future
value of the hazard for the individual. If this condition is not met, then estimates of the
survival distribution can be seriously biased. For example, if individuals tend to drop out
of a clinical trial shortly before they die, and therefore their deaths go unobserved,
survival time will be over-estimated. Censoring that meets this requirement is
noninformative. A common instance of noninformative censoring occurs when a study
terminates at a predetermined date.
1.3 The Cox Proportional-Hazards Model
As mentioned, survival analysis typically examines the relationship of the survival
distribution to covariates. Most commonly, this examination entails the specification of a
linear-like model for the log hazard. For example, a parametric model based on the
exponential distribution may be written as
or, equivalently,
that is, as a linear model for the log-hazard or as a multiplicative model for the hazard.
Here, i is a subscript for observation, and the x’s are the covariates. The constant α in this
model represents a kind of log-baseline hazard, since log hi(t)= α [or hi(t)= eα] when all
of the x’s are zero.
The Cox model, in contrast, leaves the baseline hazard function α(t) = log h0(t)
unspecified:
or, again equivalently,
This model is semi-parametric because while the baseline hazard can take any form, the
covariates enter the model linearly. Consider, now, two observations i and i′ that differ in
their x-values, with the corresponding linear predictors
and
The hazard ratio for these two observations,
is independent of time t. Consequently, the Cox model is a proportional-hazards model.
Remarkably, even though the baseline hazard is unspecified, the Cox model can still be
estimated by the method of partial likelihood, developed by Cox (1972) in the same paper
in which he introduced the Cox model. Although the resulting estimates are not as
efficient as maximum-likelihood estimates for a correctly specified parametric hazard
regression model, not having to make arbitrary, and possibly incorrect, assumptions about
the form of the baseline hazard is a compensating virtue of Cox’s specification. Having
fit the model, it is possible to extract an estimate of the baseline hazard.
1.4 The coxph Function
The coxph function is used to fit proportional hazards regression model.
coxph(formula, data=parent.frame(), weights, subset,
na.action, init, control, method=c("efron","breslow","exact"),
singular.ok=TRUE, robust=FALSE,
model=FALSE, x=FALSE, y=TRUE,...
)
Example:
# Create a simple data set for a time-dependent model
test2 <- list(start=c(1, 2, 5, 2, 1, 7, 3, 4, 8, 8),
stop =c(2, 3, 6, 7, 8, 9, 9, 9,14,17),
event=c(1, 1, 1, 1, 1, 1, 1, 0, 0, 0),
x =c(1, 0, 0, 1, 0, 1, 1, 1, 0, 0) )
summary( coxph( Surv(start, stop, event) ~ x, test2))
Result:
Call:
coxph(formula = Surv(start, stop, event) ~ x, data = test2)
n= 10
coef exp(coef) se(coef)
z p
x -0.0211
0.98 0.795 -0.0265 0.98
exp(coef) exp(-coef) lower .95 upper .95
x
0.98
1.02 0.206
4.65
Rsquare= 0 (max possible= 0.84 )
Likelihood ratio test= 0 on 1 df, p=0.979
Wald test = 0 on 1 df, p=0.979
Score (logrank) test = 0 on 1 df, p=0.979
1.5 DNA Microarray Technology
The introduction of DNA microarray technology is a technical advance in biomedical
research. DNA microarrays rely on the hybridization properties of nucleic acids to
monitor DNA or RNA abundance on a genomic scale in different types of cells.
Hybridization refers to the annealing of two nucleic acid strands following the basepairing rules. Nucleic acid strands in a duplex can be separated, or denatured, by heating
to destroy the hydrogen bonds.
DNA microarray technology permits simultaneous measurements of expression levels for
thousands of genes, which offers the possibility of a powerful, genome-wide approach to
the genetic basis of different types of tumors. The genome-wide expression profiles can
be used for molecular classification of cancers, for studying varying levels of drug
responses in the area of pharmacogenomics and for predicting different patients' clinical
outcomes. The problem of cancer class prediction using the gene expression data, which
can be formulated as predicting binary or multi-category outcomes, has been studied
extensively and has been demonstrated great promise in recent years. However, there has
been less development in relating gene expression profiles to censored survival
phenotypes such as time to cancer recurrence or death due to cancer. Due to large
variability in time to cancer recurrence among cancer patients, studying possibly
censored survival phenotypes can be more informative than treating the phenotypes as
binary or categorical variables.
From a statistical perspective, one challenge to studying time to event outcome results
from right censoring during patient followup, since some patients may still be event-free.
These patients are termed right-censored, and for these patients, we only know that the
time to event is greater than the time of last followup. An additional challenge is in the
microarray gene expression data itself. Microarray gene expression data is often
measured with great deal of background, irrelevant readings, and the sample size of
tissues or patients is usually very small compared to the number of genes measured by
expression arrays. In addition, there is a potential of high collinearity among many genes.
Censoring of patients proves difficult when compared to binary or continuous
phenotypes. A frequent approach to relating gene expression profiles to survival
phenotypes is to first group tumor samples into several clusters based on gene expression
patterns across many genes, and then to use the Kaplan-Meier (KM) curve or the log-rank
test to indicate whether there is a difference in survival time among different tumor
groups. Another approach is to cluster genes first based on their expression across
different samples, and use the sample averages of the gene expression levels in a Cox
model for survival outcome. Both approaches suffer the drawback that the phenotype
information is completely ignored in the clustering step and therefore may
result in loss of efficiency. Additionally, results will potentially be sensitive to the
clustering algorithm and distance metrics employed, as well as the number of clusters
chosen.
Partial least squares (PLS) (Wold, 1966) is a method of constructing linear regression
equations by constructing new explanatory variables or factors or components using
linear combinations of the original variables. The methods can be effectively applied to
the settings where the number of explanatory variables is very large (Wold, 1966;
Garthwaite, 1994). Different from the principal components (PC) analysis, this method
makes use of the response variable in constructing the latent components. The method
identifes linear combinations of the original variables as predictors and uses these linear
components in the standard regression analysis. Nyuyen and Rock (2002) applied the
standard PLS methods of Wold (1966) directly to survival data and used the resulted PLS
components in the Cox model for predicting survival time. The approach did not really
generalize the linear PLS to censored survival data, but applied it directly. However, such
direct application of the Wold algorithm to survival data is questionable and indeed does
not seem reasonable since it treats both time to event and time to censoring as the same.
Park et al. (2002) reformulated the Cox model as a Poisson regression and applied the
formulation of PLS for the generalized linear models to derive the PLS components.
However, such reformulation introduces many additional nuisance parameters and when
the number of covariates is large, the algorithm may fail to converge. In addition, Park et
al. (2002) did not evaluate how well the model predicts the survival of a future patient.
A different extension of PLS to the censored survival data in the framework of the Cox
model by providing a parallel algorithm for constructing the latent components is the
Partial Cox Regression Model (PCR).
2. Partial Cox Regression Method
2.1 Introduction
The Partial Cox Regression (PCR) method involves constructing predictive components
by repeated least square fitting of residuals and Cox regression fitting. These components
can then be used in the Cox model for building a useful predictive model for survival.
The key difference from the standard principal components Cox regression analysis is
that in constructing the predictive components, the PCR method utilizes the observed
survival/censoring information.
The Partial Cox Regression method can be very useful in building a parsimonious
predictive model that can accurately predict the survival of future patients based on the
gene expression profile and survival times of previous patients.
2.2 Method for Constructing the Partial Cox Regression Model
Suppose that we have a sample size of n from which to estimate the relationship
p
between the survival time and the gene expression levels X1
T    ( i 1) j ˆ ( i 1) jV( i 1) j
j1
,
… , XP of p genes. Due to censoring, for i = 1, … , n, the ith datum in the sample is
denoted by (ti , δi , xi1 , xi2 , … xip), where, δi is the censoring indicator and ti is the
survival time if δi = 1 or censoring time if δi = 0, and {xi1 , xi2 , … , xip}’ is the
vector of the gene expression level of p genes for the ith sample. Let xj = {x1j , x2j ,
…, xnj}’ be the vector of gene expression levels of the jth gene over n samples. The
aim is to build the following Cox regression model for the hazard of cancer
recurrence or death at time t
λ (t) = λo (t) exp (β1T1 + β2T2 + … + βkTk)
= λo (t) exp (f(X)),
(1)
where each component Tk and the risk score function f(X) is a linear combination of
X = {X1 , X2 , … , Xp}. In this model, λo(t) is an unspecified baseline hazard function.
Following the idea of PLS and adopting the principle that when considering the
relationship between the hazard and some specified X variable, other X variables are
not allowed to influence the relationship directly but are only allowed to influence it
through the components Tk, develop the following procedure to determine the
components sequentially. To construct the first component, first define
V1j = Xj - x . j
(2)
where x . j = 1/n∑ni=1 xij . The vector of the sample values of V1j is v1j = {v11j, ..., vn1j}
= xj - x .j and therefore the sample mean of V1j is zero. Then for each gene j, fit the
following Cox model
λ(t) = λo(t) exp(β1j V1j)
based on the sample value of V1j and obtain the maximum partial likelihood estimate
(MPLE) of β1j, denoted by βˆ1j. Then, each βˆ1jV1j provides an estimate of the log
relative hazard in the hazard function. To reconcile these estimates, set T1 equal to the
weighted average, so
p
T1   1 j ˆ1 jV1 j
(3)
j1
where ω1j is a weight with Σω1j = 1. It is easy to see that the sample mean of T1 is
also zero. Note that T1 is a special type of compound covariates advocated by Tukey
(1993) in a clinical trial setting when there are many covariates.
Note that the X variables potentially contain further useful information for predicting
the risk of recurrence or survival. So, one should not stop at the T1 step. The
information in Xj that is not in T1 may be estimated by residuals from a regression of
V1j (equivalently, Xj) on T1, and denote the residual as V2j , which can be written as
V2 j  V1j -
V1j T1
T1T1
(4)
T1
Similarly, the contribution of the residual information in V1j to the variability in the
risk of recurrence or death after adjusting T1 can be estimated by performing the
following Cox regression analysis,
λ(t) = λo(t) exp(β1T1 + β2jV2j)
and denote the MPLE of β2j as βˆ2j . Then each βˆ2j V2j provides an estimate of the log
relative hazard after adjusting for T1. Define the next component T2 as
p
T2    2 j ˆ 2 jV2 j
(5)
j1
where ω2j s are the weights. Note here the sample mean of T2 is zero.
This procedure extends iteratively in a natural way to give component T2, … , TK,
where the maximum value of K is the sample size n. Specifically, suppose that
T i (i ≥ 1) has just been constructed for variables Vij , j = 1, …, p. To obtain Ti+1, first,
Vij is regressed against Ti and denote the residual as V (i+1)j , which can be written as
V(i  j) j  Vij -
VijTi
TiTi
Ti
Then, perform the following Cox regression analysis for each j,
(6)
λ(t) = λo(t) exp (β1T1 + β2T2 + … , βiTi + β (i+1)j V(i+1)j)
and denote the MPLE of β(i+1)j as β^ (i+1)j . Then, construct Ti+1 as
p
T(i  j)    (i 1) j ˆ(i 1) jV(i 1) j
(7)
j1
where ω(i+1)j s are the weights. We call the above procedure the partial Cox regression
(PCR) and the components T1, T2, … constructed the PCR components.
It is easy to verify that these components are mutually uncorrelated with sample mean
of zero. In addition, it is also easy to verify that the procedure gives precisely the PLS
in the case of linear model and when the simple linear regression is used in place of
the Cox regression. In constructing the PCR weights, let ωij  var(Vij), i.e., variables
with large variance are given larger weights. This weight was also suggested in the
PLS literatures (e.g., Garthwaite, 1994). An alternative is to use equal eights of 1/p,
which aims to spread the load among the X variables in making predictions.
Here, only the variance weights are used. After the components T1, T2, … , Tk are
determined, model (1) is used for estimating the hazard function by the standard
partial likelihood method. After an estimate of the regression model (1) has been
determined, equations (3)-(7) can be used to write the risk score function f(X) in
model (1) in terms of original variables X, rather than the components T1, …, Tk , i.e.,
f(X) = Σ pj=1 β*j V1j = Σ pj=1 β*j (Xj - ‾x. j ),
for some coefficients β*j . This can then be used for estimating the hazard function for
future samples on the basis of their X values. Note that the sample mean of the score
function f(X) is zero. Finally, by examining the coefficients of X values in the final
model, one can rank the gene effects by the absolute values of the coefficients.
3. Method Implementation in R
Partial Cox regression function - PCRf
This function output the k PCR components for each individual in the
data set (comp, matrix n by k) and for each component, the
coefficients (coef, matrix m by k)
Program automatically centers and standardizes predictors.
Arguments:
x:
time:
n*m predictor matrix, with m predictors.
survival time for n observations.
surv.st: censoring status for n observations.
k:
number of output PCR components.
Program output k PCR components in comp each PCR's corresponding
coefficients output are in coef
PCRf <- function(x, time, surv.st, k)
{
x <- as.matrix(x)
nm <- dim(x)
n <- nm[1]
m <- nm[2]
ti <- matrix(0, n, k)
beta <- matrix(0, m, k)
V2 <- matrix(0, n, m)
one <- rep(1, n)
para <- matrix(0, m, k)
coef <- matrix(0, m, k)
meanx <- drop(one %*% x)/n
x <- scale(x, meanx, FALSE)
V1 <- x
space1 <- NULL
for(i in 1:k)
{
normx <- sqrt(drop(one %*% (V1^2)))
for(j in 1:m)
{
fit <- coxph(Surv(time, surv.st) ~
cbind(space1, V1[, j]))
sdf <- fit$coefficients[i]
names(sdf) <- NULL
beta[j, i] <- (sdf * normx[j])/sum(normx)
ti[, i] <- ti[, i] + beta[j, i] * V1[, j]
}
if(i == 1)
coef[, i] <- beta[, i]
else coef[, i] <- beta[, i] - coef[, 1:(i - 1)]
%*% t(t(beta[, i]) %*% para[, 1:(i - 1)])
space1 <- cbind(space1, ti[, i])
for(j in 1:m)
{
fit <- lm(V1[, j] ~ ti[, i] - 1)
V2[, j] <- fit$residuals
para[j, i] <- fit$coefficients
}
names(para) <- NULL
V1 <- V2
cat("loop=", i, "\n")
}
list(comp = ti, coef = coef)
}
4. Results
Data used is the diffuse large B-cell lymphoma (DLBCL), the most common subtype of
non-Hodgkin's lymphoma, can be found on the site, http://www.broad.mit.edu/cgibin/cancer/datasets.cgi.
The PCR components were built based on the gene expression data of the genes. The
p-values for the first ten PCR components for different number of genes are shown below
which shows that only the first 3 or 4 PCR components are strongly associated with the
survival in the univariate Cox regression analysis, indicating that the PCR components
result in dimension reduction. These components are significantly associated with the risk
of death at the 0.05 level; all other PCR components are not significant.
The p-values for univariate Cox regression analysis for the first 10 PCR components:
No.
1
2
3
4
5
6
7
8
9
10
PCR
(no. of genes = 1836)
0
3E-10
2E-02
0.233
0.223
0.280
0.829
0.824
0.552
0.889
PCR
( no. of genes = 506)
0
2E-11
2E-06
4E-05
0.066
0.233
0.199
0.516
0.514
0.994
The components which are significantly related to the risk of death at the p = 0.05 level in
univariate Cox regression analysis are used to build the final predictive model.
The mean of the risk scores of the patients in the training set, which is zero, is used as the
cutoff point to divide the patients into high and low risk groups. Based on the estimated
coefficients, the risk scores for patients in the testing data set is estimated and divided
these patients into two risk groups.
The median survival times are 10 years and 2 years respectively for the low and high-risk
groups defined by the PCR model (p = 0.0033). Following figure shows the survival
curves for two groups of patients in the testing set defined by having positive or negative
risk scores.
Figure 1. Results from analysis of DLBCL dataset
References:
Hongzhe Li, Jiang Gui (2004). Partial Cox Regression Analysis for High-Dimensional
Microarray Gene Expression Data.
Akritas MG (1994): Nearest neighbor estimation of a bivariate distribution under random
censoring. Annals of Statistics, 22:1299-1327.
Garthwaite PH (1994): An interpretation of partial least squares. Journal of the American
Statistical Association, 89, 122-127.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh
ML, Downing JR, Caligiuri MA, Bloomfield CD, ,Lander ES: Molecular Classification
of Cancer (1999): Class Discovery and Class Prediction by Gene Expression Monitoring.
Science , 286:531-537.
Heagerty PJ, Lumley T, Pepe M (2000): Time dependent ROC curves for censored
survival data and a diagnostic marker. Biometrics, 56:337-344.
Nyuyen D, Rocke DM (2002): Partial Least Squares Proportional Hazard Regression for
Application to DNA Microarray Data. Bioinformatics, 18, 1625-1632.
Park PJ, Tian L, Kohane IS (2002): Linking Expression Data with Patient Survival Times
Using Partial Least Squares. Bioinformatics, 18: S120-127.
Rosenwald A, Wright G, Chan W, Connors JM, Campo E, Fisher R, Gascoyne RD,
Muller-Hermelink K, Smeland EB, Staut LM (2002): The Use of Molecular Proing to
Predict Survival After Themotheropy for Diffuse Large-B-Cell Lymphoma. The New
England Journal of Medicine, 346:1937-1947.
Tukey JW (1993). Tighening the clinical trial. Controled Clinical Trials, 14: 266-285.
Wold H (1966): Estimation of principal components and related models by iterative least
squares. In P.R. Krishnaiaah (Ed.). Multivariate Analysis. 391-420. New: Academic
Press.
URL
http://repositories.cdlib.org/cbmb/partialcox
www.accessexcellence.org/RC/VL/GG/microArray.html
Download