Powerpoint

advertisement
How to analyze your organism’s chance of survival?
Data: Time between two distinct events, repeated among many
subjects/objects/organisms. The first event is predefined while the
second is typically some specific kind of transition.
The time between these two events will be called the transition time
Graphic representation
Transition time
(or survival time)
70
61
90+
time
22
It can happen that subjects exit the study for other reasons than the
event of interest. This is called censored data. Transition time is
more than what we have registered, but we don’t know by how
much.

Medical analysis – diagnosis to death by that disease:
 End of the study
 The patient wants to leave the study
 Death by other causes

Plant survival study – start of experiment to death by environmental factors:
 End of study
 Experimenter accidentally dropped the pot on the ground

Start of larval stage until transition to adult:
 End of the study
 Death of the larva

Matriculation to master’s degree of students:
 End of study
 Student dropped out
 Death of student (hopefully not!)
If we disregard censored data, we can seriously underestimate the
transition time!






Subject: Whatsoever the start and end events belong to.
Transition: The end event.
Transition time: Time between the two events of interest.
Censoring: When a subject leave the study in a different
way than the specified transition.
Age: Time from the start event to the present time for
subjects which have not been censored and which have not
undergone transition.
Treatment: Same as regression/dependent
variable/explanation variable in ordinary regression.
Finding the effect of a treatment on survival is typically the
goal of survival analysis.
The transition time will typically
vary, so we need a statistical
distribution for it.
Distribution f(t) describes the
probability-density of the
transition times (t). A sharp peak
around a value means most
survival times are found around
that value.
A histogram of actual transition
times will start to look like this
distribution when we have much
data.
Distribution of age at death for the
population of U.S.A. 2003, as derived
from the histogram of ages.
Survival curve, S(t) : Describes
the probability of not going
through a transition before a
given age, t.
A histogram of the age of subjects
at a given time, will look like this
curve.
In more mathematical terms, it’s
the cumulative sum (integral) of
probabilities (densities) for all
times larger than t.
Survival curve for U.S.A. 2003,
from the histogram of ages.

S (t ) 
 f (t )dt  1  F (t )
t
conversely
f (t )   S ' (t )
Hazard rate, h(t): The
chance of going through a
transition the next time
interval, given that the
subject has not done so
earlier.
Hazard rate for U.S.A. 2003, derived from
the histogram of ages.
P (T  t  t | T  t ) f (t )

  log(S (t ))'
t  0
t
S (t )
conversely
h(t )  lim
t
S (t )  e  H (t ) , where H (t )   h(u )du
0
Concept
Graph
Transition time distribution, f(t):
Survival curve, S(t) :

S (t ) 

f (t )dt  1  F (t )
t
Hazard rate, h(t):
h(t ) 
If we have an
expression for
one of these
concepts, the
other two can
be derived.
f (t )
d
  log(S (t ))
S (t )
dt
Just different
ways of looking
at the same
thing.

Form: f(t)=e-t

Usage:




Unstable elementary particles
Radioactive isotopes
Time between phone calls
Life time of a particular copy of DNA for
microbial organisms? PS: Conditioned on the state
f(t)=e-t
of the organism itself and it’s environment.

Special quality - memoryless:
f(t-t0 | t>t0)=f(t)
S(t)=e-t
If the survival
probability drops
to 50% in t=5, it
will drop to 25% in
t=10 and to 12.5%
in t=15.
t0
h(t)=
Constant
hazard.
Reasonable,
since the
distribution is
memoryless.

Assume microbial survival is
conditionally exponential distribution.

Contribution from genetics and
environment spreads out the death rate,
, according to the gamma distribution,
(a,b).

Result: f(t)= (a/b) (1+t/b)-(a+1)
f(t) for a=1, b=1
(Pareto distribution)
S(t)=(1+t/b)-a
h(t)=a/(b+t)
Dropping hazard rate.
Reasonable:
•If old age => good genes
and/or good env.
•If young, overrepresentation of bad genes
and/or bad env.
Cartoon model of aging:
the uniform distribution:
f(t)=I(0<t<a)/a
 a=maximal age.
 All outcomes below that
are equally probable.

S(t)=(1-t/a)I(0<t<a)
h(t)=1/(a-t)
f(t)
Hazard rate increases
inversely proportional
to the distance to a.
The closer to the
maximum attainable
age, the more risk
there is of dying.

Often observed in
engineering: a hazard rate
that it higher for small and
large times than for
moderate times.

Can be reasonable for
complex biological
organisms also. For
instance humans.

Possible to start with
modeling hazard rates in
order to make a transition
time distribution.
Estimated
h(t) from
census data
2003, U.S.A.
h(t )   log(S (t ))' 
 Increasing hazard

survival-curve
bends downwards
on the log-scale.
h' (t )   log(S (t ))''
h(t)
S(t)

Example: Uniform transition time (cartoon of aging)

Decreasing hazard

survival-curve
bends upwards on
the log-scale.
h(t)

S(t)
Example: Pareto distribution (Varying genes/env.
Possibly also a model for vulnerability in early life.)
Kaplan-Meier is a parameter-free
way of estimating the survival
curve.
Example: Survival plot for plant
experiment – all plants
 Similar to histograms, in that it simply
summarizes the data.
 Performed by first noting for which
times, tj , there are transitions in the
data.
 The number of transitions, mj , and
the number of subjects “at risk”, yj ,
(both subjects which will transit later
and subjects that are later censored),
is then used.
mj
ˆ
S
(
t
)

(
1

)
 Technical:

yj
j|t j t
R code:
survfit(Surv(t.event,censoring.status))
Use “plot”, to see the resulting curve.
Divide your dataset into subgroups
You can get a confidence interval for with different treatments and you get
this estimated curve.
a feel for the difference between
these treatments. (In this plant study, the
treatment is day length, “dlen”):
R code:
survfit(Surv(t.event,censoring.status)dlen,
conf.type=“plain”, conf.int=0.95)
R code: survfit(Surv(t.event,censoring.status),
conf.type=“plain”, conf.int=0.95)
PS: Note that using confidence intervals to say
whether there is a difference means invoking a
large number of dependent tests. Not ok.
Does model comparison between grouping the
data according to treatment and not grouping
the data.
Compares the observed number of events to that
expected if the groups have equal transition
time distribution.
Gives a p-value for the zero-hypothesis (no effect
of different treatments).
R-code: survdiff(Surv(t.event,censoring)~treatment)
Strength
• No model assumptions =>
can be used for almost any
dataset.
• Allows categorical
explanation variables.
• Model-independent
confidence interval and
model comparison, too.
Weakness
• Model assumptions =>
stronger inference.
• No inference for transition time
distribution or hazard rate.
• No way to stepwise add and test more
explanation variables (treatments).
• Explanation variables can only be
added by subdividing the dataset into
smaller and smaller pieces.
• No way to incorporate continuous
explanation variables.
Addresses (almost all) the weaknesses of the KaplanMeier approach.
 Does so by a single model assumption: proportional
hazard.





Separates time dependency from variable dependency *in
the hazard rate*.
Hazard ratio: The hazard rate for one choice of
explanation variables divided by the hazard rate of
another choice.
Allows for continuous explanation variables and additive
effects of different categorical variables.
R-code: coxph(Surv(t.event,censoring)~var1+var2+var3)
Proportional hazard regression:
h(t|x)=h0(t)ex
or
lh(t|x)=lh0(t)+x where lh=log(h).
Assume one explanation variable, x: What
happens if we change it from x to x+1?
 Log-hazard rate changes by a additive factor .
lh(t|x=1)=
lh(t|x=0)+0.69
h(t|x=1)=
2h(t|x=0)
 The hazard changes by a multiplicative factor e.
 Log-survival curve also changes by a multiplicative
factor e. (The latter can be compared to results from the
Kaplan-Meier estimator.)
 Survival curve changes from S(t)
to S(t)exp().
(Ex: S(t|x=1)=S(t|x=0)2. Not so easy to see in a plot.)
log(S(t|x=1))=
2log(S(t|x=0))

Gives a (partial) likelihood, so more model
comparison techniques available.
 Likelihood-ratio test
 AIC/BIC
 Wald test (comparison between estimate and standard error, implemented in R).


Stepwise adding/subtracting extra variables
possible, either likelihood-based methods or by
Wald test.
Full model exploration by information criteria also
possible (though prohibitively costly when the number of explanation variables is
high).

Implemented in R: cox.zph(…)

What makes the hazard non-proportional can be
viewer using the Kaplan-Meier-estimated logsurvival-curve. (Or plot(cox.zph(…)))
For this plant experiment, the
survival-curves for short day
length and long day length seem
to part company at around day
30-40.
Doesn’t necessarily invalidate the
Cox-regression but makes the
Kaplan-Meier estimate for log-survival-curve
hazard ratio an average effect.
Download