Regression Trees for Censored Data Author(s): Mark Robert Segal

advertisement
Regression Trees for Censored Data
Author(s): Mark Robert Segal
Source: Biometrics, Vol. 44, No. 1 (Mar., 1988), pp. 35-47
Published by: International Biometric Society
Stable URL: http://www.jstor.org/stable/2531894
Accessed: 19/10/2008 14:20
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at
http://www.jstor.org/action/showPublisher?publisherCode=ibs.
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with the
scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that
promotes the discovery and use of these resources. For more information about JSTOR, please contact support@jstor.org.
International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to
Biometrics.
http://www.jstor.org
BIOMETRICS
44, 35-47
March1988
Regression Trees for CensoredData
Mark RobertSegal
ChanningLaboratory,HarvardMedicalSchool,
180 LongwoodAvenue, Boston, Massachusetts02115, U.S.A.
SUMMARY
The regression-treemethodologyis extended to right-censoredresponsevariablesby replacingthe
conventionalsplittingrules with rules based on the Tarone-Wareor Harrington-Flemingclassesof
two-samplestatistics.New pruning strategiesfor determiningdesirabletree size are also devised.
Propertiesof this approachare developed and comparisonswith existing procedures,in terms of
practicalproblems,arediscussed.Illustrative,real-worldperformancesof the techniquearepresented.
1. Introduction
In recent yearsconsiderableresearcheffort has been dedicatedtowarddevising regression
techniquesfree from some of the restrictiveclassicalassumptions.Frequently,the tradeoff
for this freedom is that these methods require large data sets and are computationally
intensive. Two broad categories of such approachescan be summarizedas smoothing
(Stone, 1977; Friedman and Stuetzle, 1981; Breiman and Friedman, 1985) and trees
(Breimanet al., 1984).Anotherareaof currentstatisticalendeavoris the analysisof survival
data whereinthe responsevariableis subjectto censoring.The proportionalhazardsmodel
(Cox, 1972)has affordeda widelyused and flexibletechnique.However,againconstraining
assumptionshavebecome an issue and attemptsto overcomethese have led to the adaption
of smoothing and tree methodologies to the survival setting. Specifically, Hastie and
Tibshirani(1986) extendthe Cox model by replacingthe linearmodelingof covariateswith
an additive(sum of smooth functions)model.
This paperproposesmodificationsto the conventionalregressiontree methodologywith
the primarymotivationof facilitatingan extensionto censoreddata.However,the suggested
changesused to achieve this end have merit in their own right, and so comparisonswith
the existingtechniques(in the uncensoredsetting)are also mentioned.The basic alteration
is the replacementof the goodness-of-splitcriteria,whichhadbeen gearedtowardoptimizing
within-nodehomogeneity,with measuresof between-nodeseparation.The measuresused
are two-sample statisticsbelonging to the Tarone-Ware or Harrington-Flemingclasses.
Their introductionfurthernecessitateschangingthe pruningalgorithmused to determine
desirabletree size.
Existingmethodsfor treeconstructionin the survivaldatasettingutilizedifferentsplitting
and pruning approaches.Gordon and Olshen (1985) use distance measures(Wasserstein
metrics)betweenKaplan-Meiercurvesand certainpoint masses.The motivationis by way
of analogy with the least squarescriterionused in the uncensoredsetting and the method
allows for an immediate inheritanceof the CART(see below) pruningalgorithm.Ciampi
et al. (1987) and Davis (unpublishedPh.D. thesis, Harvard University, 1988) pursue
splittingbasedon likelihood-ratiotests.Workon comparingthe performanceand properties
of the varioustree techniquesis in progress.
Key words: Censoring;Pruning;Regressiontree;Splittingrule;Tarone-Wareclass.
35
36
Biometrics,March 1988
The next section provides a brief overview of current regressiontree (or recursive
partitioning)methodology.Section 3 dealswith the new splittingcriteriaand, in particular,
addressesthe censoreddata issue. Section 4 indicateshow the new pruningstrategieswork.
The fifth section demonstratesthe performanceof the new technique on a real-world
example,and the sixth discussespropertiesand potentialof the new approachwith reference
to problemsencounteredin practice.
Repeatedallusion is made to the definitive referenceby Breimanet al. (1984), which is
referredto as CART, and in which stand-alonesection numbers should be sought. This
monographmakes explicit the advantagesof tree methods from an applied perspective.
The importance of tree techniques in biomedical settings has also been emphasizedby
Goldman et al. (1982). The extractionof clinically meaningful strata that have distinct
survivalprospectsis an endpoint commonly sought by medical investigators.Further,the
tree structuremakes for readyinterpretabilityand easy classificationof new patients.
2. ConstructingRegressionTrees
A simplifieddescriptionof regressiontreesis presentedin this section,so thatthe subsequent
reformulationscan be understood.For a more detailed understandingthe CART monographis recommended.Attentionhere is restrictedto the familiarregressionsetting-there
are p predictor variables X1, . . ., Xp and a continuous and uncensored response Y. No
comment is made herewith regardto issuessuch as the treatmentof missingvalues(?5.3.2)
or variable importance(?5.3.4), for which carryoverfrom the standardmethods to the
new methods is immediate. For a compilation of the advantagesaffordedby using tree
procedures see Segal (unpublished Ph.D. thesis, Department of Statistics, Stanford
University, 1986) or CART?2.7.
In orderto constructa regressiontree four componentsare required.These are:
1. A set of (binary) questions of the form "Is x. E A ?" where x. is a case and A C r, the
predictor space. The answer to such a question induces a partition, or split, of
the predictorspace. That is, the cases for which the answer is yes are associatedwith
region A and those for which the answer is no are associated with the complement
of A. The subsamplesso formedare called nodes.
2. A goodness-of-splitcriterion /(s, t) that can be evaluatedfor any split s of any node t.
The criterionis used to assessthe worthof the competingsplits,where(in CART)worth
pertainsto within-nodehomogeneity.
3. A means for determiningthe appropriatetree size.
4. Statisticalsummariesfor the terminalnodes of the selectedtree. These can be as simple
as the node averageor as involved as a Kaplan-Meier survival curve, depending on
context.
What follows is an elaborationof these aspects.
2.1 CandidateBinary Splits
The plethoraof possible splits in 1 above-resulting from not placing any restrictionson
the regionA-is reducedto a computationallyfeasiblenumberby constrainingthat:
(a) Each split depends on the value of only a single predictor variable. [Note: This
restrictioncan be loosened; the software (CARTT, CaliforniaStatistical Software,
1984) permitssplits on linear combinationsof predictors.]
(b) For orderedpredictorsXj, only splitsresultingfrom questionsof the form "IsXj < c?"
for c E R1 are considered.
RegressionTreesfor CensoredData
37
(c) For categoricalpredictorsall possible splits into disjoint subsets of the categoriesare
allowed.
It may appearthat any reductionin numberof splitsresultingfrom the above constraints
is worthless.Certainly(a) restrictsus to examiningpredictorsunivariatelyand (b) restricts
us to dividing W' into two semi-infiniteintervalsas opposed to the multitudesof other
possiblebreak-ups.However,we are still contendingwith an uncountablyinfinite number
of partitionsas c rangesover _W'.The point is that the randomvariableXj takes on only a
finite numberof values in the sample at hand-at most n for the n cases. Hence, we have
to examine only those values of c that result in a case switching"sides"-from the right
semi-infiniteintervalto the left. So there are at most n - 1 splits given by {IsXj < ci?"}
wherethe ci aretaken,by convention,halfwaybetweenconsecutivedistinctobservedvalues
of Xj.
The tree itself is grown as follows. For each node (the initial or root node comprisesthe
entire sample):
1. Examineevery allowablesplit on each predictorvariable.
2. Select and execute (createtwo new daughternodes) the best of these splits.
Steps 1 and 2 are then reappliedto each of the daughternodes, and so on.
2.2 Goodness-of-SplitCriterion
"Best"in 2 above is assessedin terms of the goodness-of-splitcriterion.Two such criteria
areespousedin CARTand are availablein the associatedsoftware.These are LeastSquares
(??8.3, 8.4) and Least Absolute Deviations (?8.11). Both afforda comparisonbased on a
subadditive"between/within"decomposition,where between alludes to the homogeneity
or loss measureappliedto the parentnode. For point of referencethe definitionof the least
squarescriterionis presentedhere.The obviouschangesgive riseto leastabsolutedeviations
(or any other between/withincriterionsuch as is used in ?6.4).
Let t designate a node of the tree. That is, t contains a subsample yX y)}. Let N(t) be
the total numberof cases in t and let
N(t)
XnlEt
be the node responseaverage.Then the within-nodesum of squaresis given by
SS(t)
=
I[y
-y(t)]
Now suppose a split s partitionst into left and right daughternodes tL and tR. The least
squarescriterionis
O(s, t) = SS(t) - SS(tL) - SS(tR)
and the best split s* is the split such that
o(s*, t) = max k(s, t)
sEQ
where Q is the set of all permissiblesplits.
A least squaresregressiontree is constructedby recursivelysplittingnodes to maximize
the above X function. The criterionis such that we create smaller and smaller nodes of
progressivelyincreasinghomogeneity.The subadditivitymentioned above is equivalentto
the nonnegativenessof k; i.e., SS(t) 3 SS(tL)+ SS(tR).The same is true if we work with
38
Biometrics,March 1988
least absolutedeviations.It is this inequalitythat is responsiblefor the increasing(strictly
nondecreasing)homogeneity.
2.3 DeterminingDesirable TreeSize
It remainsunresolvedas to what constitutesan appropriatesized tree. Originally[the AID
program,Morganand Sonquist(1963)] this was determinedby use of stoppingrules:if a
node became too small or the improvement0(s*, t) resultingfrom the best split was not
sufficient(to surmountsome presetthreshold),then the node was declaredterminal.This
proved unsatisfactoryon account of the rigidity of the thresholds. In some instances
overfitting(as a consequence of setting thresholdstoo small) via too large a tree would
occur. In others, underfittingwould result from rejectionof a split (owing to its improvement not exceedingthe cutoff) precludingthe emergenceof subsequentworthwhilesplits.
There is an analogyhere to stepwiseversussubsetregressionin terms of failureto capture
importantinteractions.The problemwas redressedby (1) initiallygrowinga verylargetree;
(2) iterativelypruningthis tree all the way back up (to the root node), therebycreatinga
nested sequenceof trees;(3) selectingthe best tree from this sequenceusing test-sampleor
cross-validationestimatesof error.
This procedureis detailedin CART(Chap.3). The means for performingthe pruningin
step (2) is called "minimalcost-complexity pruning";see ?3.3. This paperpresentssome
alternativesto this when no within-nodecost is available[forleast squaresthe within-node
cost is just SS(.)]. Such alternativesare necessarybecause the minimal cost-complexity
algorithmrelies cruciallyon within-nodecost. It is clear how the above procedureaverts
the deficienciesof stoppingrules.Any potentiallyconsequentialsplitshave the opportunity
to emerge,as the initial largetree can be grownso big as to possessterminalnodes that are
pure-contain only one categoryin the classificationcontext or only one responsevalue
for regression.The usage of cross-validationor test-sampleestimatesis intendedto ensure
that a realisticsized tree is producedin relationto how much noise/samplingvariabilityis
present.This aspectis discussedmore fully in Section 4.
3. Two-SampleStatistic Splitting
3.1 Motivation
Insteadof gearingour splittingcriteriato optimizingwithin-nodehomogeneity,we could
rewardsplits that resultedin large between-nodeseparation.The magnitudeof any twosample statisticaffordssuch a goodness-of-splitmeasure.Such a change constitutesmore
thanjust a rephrasingof the problem.Whilstit is (empirically)the case that splittingbased
on two-samplet statisticswith unpooled varianceestimates(Welch statistics)gives results
strongly resembling those obtained from least squares splitting, there is no algebraic
equivalenceand problemscan be contrivedwhereresultsare dissimilar.
The fact that in all the cases analysed,splittingusing Welch statisticsand splittingusing
least squaresgave comparableresults, supportsthe usage of two-sample statistics:given
that the two techniques produce analogous results and least squares gives worthwhile
answers,the new approachmust be doing somethingreasonable.But why replacea proven
method with one that is harderto motivate and offers no computationalsavings?The
answerlies in the advantagesprovidedby using two-samplerank statistics.These include
all the conventionaldesiderataof ranksplus some additionalbenefits:
1. Invarianceunder monotonetransformationof the response Y. The regressiontrees
createdby using least squaresor least absolutedeviationspossessedsuch invarianceonly
with respect to monotone transformationsof the (ordered)predictors.This means, for
RegressionTreesfor CensoredData
39
instance,that the optimal split is the same regardlessof whetherwe use X1 or X1 = g(X1)
for some monotone g. If the optimal split on X1 is XI < c then the optimal split using XI
will be X1 < g(c) (?2.7). However,it is only throughthe use of two-samplerank statistics
that best splits (predictorand cutoff) are preservedunder monotone transformationsh, of
Y to Y = h(Y). This is clearly a worthwhilepropertywhen there is no naturalresponse
scale in which to work (see Gordon and Olshen, 1978; Anderson, 1966). The issue is
especiallypertinentin the context of censoredregression;the censoredregressionsettingis
furtheremphasizedby Prentice( 1978).
2. Insensitivityto outliersin the responsespace. The use of least squares,and to a lesser
extent least absolutedeviations,is subjectto the familiarsensitivityto extreme Y observations. This, in the regressiontree setting,is not necessarilya drawback,since such outliers
will be isolated into their own (single case) terminalnodes. Still, the influence on overall
tree topology can be distortingand the interpretationof splits leading to the isolation of
the outlier can be problematic. Friedman (1979) regards the presence of outliers as
weakeningthe least squaresprocedureby wastingsplits.
3. Computationalfeasibility. The actualcomputationalimplementationfor evaluating
the multitudeof competingsplitsis detailedin Segal'sunpublishedPh.D. thesis.The easiest
case is the uncensoredsetting, where using the Wilcoxon procedureis no more involved
than using any two-samplelinearrankstatistic.The updatingalgorithmdevisedmakes for
an O(n) procedurethat is as simple as the O(n) least squaresalgorithm.The storyis not so
simple when it comes to dealingwith splittingbased on membersof the Tarone-Wareor
Harrington-Flemingclasses. Nevertheless,efficient algorithmscan be developedwith the
rightorganization.
4. Extensionto censoredresponse. The principalmotivationfor changingthe splitting
criterionwas to enable tree techniquesto be used for survivaldata. Insteadof using twosamplerankstatisticsfor uncensoredvalues as goodness-of-splitcriteria,analoguesof such
statisticsthat account for censoringare used, as describedbelow.
3.2 CensoredData Rank Statistics
Before the merits or otherwise of the two-sample statistics used for censored response
(Gehan, Prentice,Mantel-Haenszel,Tarone-Ware)are discussed,their form and computationalimplementationare described.The Tarone-Wareclass of statisticsderivesfrom a
sequenceof 2 x 2 tables.In the survivalanalyticcontext of censoredresponse,this sequence
arisesfrom constructinga 2 x 2 table for each distinct, uncensoredresponse:
Alive
Dead
Population 1
ani
Population2
and the statisticshave the followingform:
TW=
ZI wi[a,
cEk=1
-
E0(A1)]
w[varo(Ai)]'12
whereAi is the randomvariablecorrespondingto numberof deathsin population 1 for the
ith table; wi are constants used to weight the respectivetables;the sum is over all tables,
i.e., all distinct uncensoredobservations;the null hypothesisis that the death rates for the
40
Biometrics,March 1988
two populations are equal; for fixed margins the null expectations and variances are
hypergeometric:
E0(A1)
-
milni,
ni
o(i)
varo(Ai)
=
[ Mlni-mil)
- I
ni
-ni
][ni)
1--I
ni)
Standardspecificationsfor the weights wi include the following:
1.
2.
3.
4.
wi 1 gives the Mantel-Haenszelor log-rank(Peto and Peto, 1972) statistic.
wi = ni gives the Gehan (1965) statistic.
wi = n112gives a statisticadvocatedby Tarone and Ware(1977).
wi = S* gives Prentice's (1978) generalization of the Wilcoxon, where S* =
HIJ1 nj/(nj + 1) is almost the Kaplan-Meier survival estimate at the ith uncensored
failuretime.
The Gehan statisticis subjectto domination by a small numberof earlyfailures(Prentice
and Marek, 1979) and hence should be used selectively.Another possibilityis to use the
Harringtonand Fleming (1982) class, which has weights wi = S3 for some fixed power p,
with S now being exactlythe Kaplan-Meiersurvivalestimate.For this to be computationally feasible it would be necessaryto restrictall splits to a particularp value-trying any
sort of optimizationon p for individualnodes would be too involved. In practice,at least
for the data sets and simulationsexamined, the trees emergingfrom using differingsplit
statistics(apart from the Gehan) are surprisinglysimilar. Such a finding applied in the
uncensoredsettingwhen competingmembersof the class of linearrankstatisticswereused
and is also reportedin CARTin the classification-treecontext.
The implementationof a splittingalgorithmusing the Tarone-Wareclass is indicated
by the followingpseudo-code:
Splitting Algorithm
For each node
Initialize BestStat = BestPredictor = BestSplitPoint = 0
Sort response values and collect ties (compute mil)
Compute risk-sets ni and the Kaplan-Meier for the node
Loop over all p predictors: X1, X2, . . ., Xp
Sort the node (response and censoring indicator) with respect to ordered Xcurrent
Loop over all potential split points i (i.e., step along the Xcurrentaxis)
By cycling over the ordered response values compute a, and n,1
Using prespecified weights w, compute the two-sample statistic: TwoSam
If (I TwoSam I) > BestStat Then
BestStat *- TwoSam
BestPredictor - Xcurent
BestSplitPoint *- i
End If
End
End
Importantbut simple issues such as not splitting on tied predictorvalues and efficient
means for sortingand rankinghave not been highlightedfor clarity.Thereis an additional
user-specifiedparameterthat serves to regulatethe degree of censoringpermittedin any
given node. Thus, only those splitsthat resultin daughternodes havinga ratioof uncensored
to censoredobservationsgreaterthan the presetthresholdare examined.Actual choices for
this thresholdshould be determinedin an exploratory,problem-specificmanner.
RegressionTreesforCensoredData
41
4. RevisedPruningStrategies
An important differencebetween least squares(or least absolute deviations) splitting as
outlined in Section 2 and any two-sample statistic splitting rule (?3) is that the former
providesa within-nodeestimate of error, namely SS(t), the within-nodesum of squares.
Such is not the case for two-samplestatisticsplits,which affordonly a measureof goodness
of split. These measurescannot be decomposed to attributea within-node error.This is
consequential,since the within-nodeerrorsform a key componentof the pruningalgorithm
developedin CART(Chap.3). The algorithm,therefore,does not carryover to the present
situationand, inasmuch as tree size is a fundamentalissue, alternateapproachesmust be
sought.The followingsection describesan attemptto inheritthe CARTalgorithmand, on
account of the limitations of this attack, the succeeding section details an alternative
approachto pruning.This lattertechnique,which retainsthe bottom-uptacticbut sacrifices
cross-validation,is believedto work well.
4.1 Inheritingthe CARTAlgorithm
In orderto circumventthe problemsposed by the absenceof within-nodeloss, an attempt
to revert back to the original splitting criteria was made. Specifically, least absolute
deviations splitting was tried. Using least squares was not entertainedbecause of the
unstablenature of the mean when estimated from survivalcurves. The intention was to
account for the censoring by using medians based on the Kaplan-Meier survival curve
estimated for each node and then consider absolute deviations about these. Thus, the
goodness-of-splitcriterionfor a split at s of node t into tLU tRwould be
I Yn-v(t)
IMS, t)
XnEl
I Yn -
I
n,
lL
(QL) I
E
I Yn
V(tR)
|
,
XnEIR
where v(.) is the Kaplan-Meier median. The best split of a node t would again be that
which maximized q.
However,there were a varietyof difficultiesassociatedwith this approachthat rendered
it useless:
(a) Computationallythis method was very slow.
(b) The actual splits obtainedusing this criterionon simulateddata with known structure
were not convincing. The method did not uncover the important variablesor split
points.
(c) The hope behind resurrectingleast absolutedeviationssplittingwas the inheritanceof
the pruningalgorithmused in standardCART.However,even this did not materialize.
An unstatednecessityfor the minimal cost-complexityalgorithm(?3.3) to workis that
the splittingcriteriabe subadditive,i.e., /(s, t) 3 0 Vs, t. This is easily seen to hold in
the uncensoredcase, where v(t) above can be taken to be any sample median for the
node t. But this does not hold for censoredy's and v(.) a Kaplan-Meiermedian.
4.2 Bottom-UpApproaches
In view of the above failures,what was needed was an altogetherdifferenttack. It was
decided to preservethe concept of initially growing a very large tree and subsequently
pruningthis. What was sacrificedwas the selection of a particulartree from the generated
sequence by cross-validation.Further,the minimal cost-complexity pruning algorithm
itself was (by necessity)replacedwith some new pruningschemata.
The loss of cross-validationas a selection mechanismwas not tragic.While the method
had performedwell its usage had severalrecognizedflaws. The more detractingof these
include: (i) inaccuracies and instabilities of the cross-validationestimates (?8.7) and
42
Biometrics,March 1988
(ii) failure of the tree selected as optimal to preclude noisy splits (?8.6). Of course,
criticism(ii) can be levelled at any technique,but is cited here on account of such noisy
splits emergingeven in a highly structuredsituation.Indeed,the authorsof CARTequally
promote user selection of the right-sizedtree (??3.4.3, 6.2). This should be done in an
exploratoryfashion and aided by the incorporationof subject-matterknowledge.
But for such user selection,the user must be providedwith a tree sequenceand hopefully
one that contains good candidatetrees. It was to this end that the new pruningalgorithms
werecreated.Beforeexpoundingon these it is importantto reiteratewhat is being acquired
from the CART approach-protection against the deficienciesof stopping rules as highlightedin Section 2.
After several strategieswere tried, the following emerged as the preferredpruning
algorithm:
Initiallygrow a very largetree.
From the bottom, step up this tree, assigningto each internalnode the maximum split
statisticcontainedin the subtreeof which the node underconsiderationis the root.
Collect all these maxima and place them in increasingorder.
The firstprunedtree of the sequencecorrespondsto locatingthe highestnode in the tree
possessingthe smallestmaximum and removingall its descendents.
The second tree of the sequence is then obtainedby reapplyingthis processto the first
tree and so on until all that remainsis the root node.
This procedureis illustratedin conjunction with the example in the next section. The
associated output is also displayed. Essentially, each internal node is linked with the
maximumsplit statisticcontainedin the subtreefor whichthe node is the root. The pruning
sequence is then determined by the order of these maxima. Selecting a tree from the
sequenceprovidedcan be done by plottingmaximal subtreesplit statisticsagainsttree size
and pickingthe tree correspondingto the characteristic"kink"in the curve;see ?3.4.3 or
Friedman(TechnicalReport 12, Departmentof Statistics,StanfordUniversity, 1985).
In terms of computation time, the construction of the tree sequence is very much a
secondaryconcern relative to the initial growing of the large tree. The building process
requiresthe evaluation of many splits at each node, whereasfor this particularpruning
method, there is one very rapidascent of the tree (to ascertainthe maximum subtreesplit
statistic for each node only simple comparisonsas opposed to calculationsare required),
followed by subtree removal, which entails simple looping to update quantities such as
number of terminal nodes. Thus, no computational burden results from the pruning
algorithm.
5. StanfordHeart TransplantData
One exampleforillustratingthe performanceof any regressiontechniquewherethe response
is subject to censoringis the StanfordHeart Transplantdata. The parametricattacks of
Miller (1976), Buckley and James (1979), and Koul, Susarla,and Van Ryzin (1981) are
compared with regardto this data set by Miller and Halpern (1982). The more recent
"nonparametric"
treatmentsof Doksum and Yandell (1982), Hastieand Tibshirani(1986),
and Owen (TechnicalReport25, Departmentof Statistics,StanfordUniversity, 1987) plus
the celebratedproportionalhazards model of Cox (1972) have all been tested on the
Stanforddata set. A fully parametricanalysison a subsetof the data is presentedby Aitkin,
Laird,and Francis(1983).
A brief data descriptionis now given. The response Y is logi0 survivaltime, where the
survivaltime is the time (in days)until deathdue to rejectionof the transplantheart.There
arep = 2 predictors:X,, the age of the recipient,and X2, a tissue mismatchscoremeasuring
RegressionTreesforCensoredData
43
recipient and donor tissue compatibility.One hundred fifty-seven cases were analysed,
there being a 35%censoringrate.
What has consistently emerged from the plethora of analyses is that age is the more
significantpredictor.For instance, Miller and Halpern (1982) exclude mismatch from
additionalexamination,havingfound it to be insignificantin multipleregressionanalyses.
However,Tannerand Wong (1983) find that patientspossessinghigh mismatchand older
ages are characterizedby a distinctive hazard function. Further, the nonparametric
approacheshave revealeda cutoff value of roughly50 years,in that the subpopulationsso
defined(<50 and >50) have distinct survivalcharacteristics.
Regressiontrees, using two-samplestatisticsplitting,were used to analyse the data. In
particular,the Mantel-Haenszel statistic was used in conjunction with subtree maximal
statisticpruningto produceboth the tree schematicin Figure 1 and the tree sequence in
Table 1. The valuesbelow the squareterminalnodes in Figure 1 are Kaplan-Meiermedians
for that node. It is worth recordingthat neither the initial large tree nor the pruned tree
sequencewas substantiallyalteredby using other splittingstatisticsfrom the Tarone-Ware
class. In fact, the key first split was identicalin the cases examined. What is immediately
evidentfromthe tree diagramis the confirmationof the previousfindings.First,age clearly
emergesas the more consequentialpredictor(but see CART?5.3.4 for an automatedmeans
for predictorrankingthat overcomes possible masking- this is not an issue here since
there are only two predictors).Second, the cutoff at around 50 is reflectedby the value of
the firstsplit point.
However,the analysiscan proceedfurther.A naturalfirst summaryfor a terminalnode
when we have censored response is the estimated Kaplan-Meier survival curve S. The
programalso providesthe user with the possibilityof extractingcertainderivedquantities,
125/9
64/
G < 41 \
29
53
3.14
0
\
32
1t.81
mimac
< 1.66
\8
2.04
Figure1. Tree diagramfor Stanfordhearttransplantdata.
44
Biometrics, March 1988
Table 1
Pruningby subtreemaximal statistics,
Stanfordhearttransplantdata
Subtree
Tree
Terminal
maximal
number
nodes
statistic
1
14
.00
2
13
.60
3
12
.99
4
11
1.17
10
5
1.25
6
9
1.79
7
6
2.43
8
4
2.48
2
9
2.79
10
1
4.91
such as the Kaplan-Meier median S-'(.5) as node summaries or predictions.Figure 2
featuressuperimposedsurvivalcurves for each of the 4 terminalnodes. The curve correspondingto node 3 in the tree schematicof Figure2 lies appreciablybelow the curvesfor
nodes 4 and 6. Node 3 contains the >50 age group. The survivalprospectsare noticeably
worse than the bulk of the -50 group, containedin nodes 4 and 6, as would be expected.
But, the survivalcharacteristicsfor node 7 resemblethose for node 3. Node 7 contains
patientswho are middle-aged(41-50) as opposed to young and who also have high tissue
mismatchscores.Thus, it is not surprisingthat they possessequallypoor survivalprospects.
It is the extraction of precisely such local interactionsthat makes the tree techniques
so powerful. Whilst caution must be exercised in interpretingsurvival curves based on
only 8 cases, Tannerand Wong (1983) identifieda very similarlocal interaction.
0.
.
O~~~~~~~~~
I
0
_____W
CL0
C;
+
node
4
O
node
6
.
node
7
i.-
C;
0.0
0.5
1.0
:1.5
2.0
2.5
SurvTlme
Figure2. Kaplan-Meier
curvesforterminalnodes.
3.0
3.5
RegressionTreesfor CensoredData
45
6. Discussion
Kalbfleischand Prentice (1980) present a detailed analysis of mouse leukemia data to
demonstratea varietyof difficultiesthat arise in practicalproblems.The particularissues,
their conventionalresolution,and the way in which tree structuredprocedurescope will
form the basis of the discussion. For an actual analysis of the data using the techniques
espoused here see Segal (unpublished Ph.D. thesis, Department of Statistics, Stanford
University, 1986).
Kalbfleischand Prentice(1980) formulatethe practicalproblemsencounteredin terms
of the followingfive questions,each of which is discussedin turn: (1) How should classes
be formed?(2) How to deal with the multiplecomparisonsissue?(3) How shouldpotential
differencesbetweenMantel-Haenszeland Gehan tests be reconciled?(4) Are sample sizes
adequatefor asymptotics?(5) How to cope with cases possessingmissingvalues?
The firstquestionpertainsto Kalbfleisch.andPrentice'sstratificationof the data. This is
achieved by examining each predictorindividually. For continuous predictors,division
pointsarepickedarbitrarilymodulo some guidelines.Theseguidelinesincludethe following
recommendations:(i) three or four classeswill often provide adequateresolutionwithout
unduly compromisingefficiency;(ii) roughlyequal sample sizes among classes should be
pursued,though not at the expense of natural division points or the creation of highly
unequalcensoringrates.
However,there are problemsassociatedwith such rules of thumb. The considerationof
predictorsone at a time precludesthe emergenceof any interactions.The firstrecommendation is revealedto be inappropriateand the second to be internallyinconsistentwith
respectto the mouse leukemiadata in Segal'sthesis. The regressiontree approachhas, as a
central aim, the formation of meaningful classes. These are determined by the data
themselvesand hence are not subjectto the vagariesof assumptionsor guidelines.Interactionsarereadilyrecognizedand no problemsarisein dealingwith variablesof continuous,
mixed, or discrete type. This contrasts with other nonparametricproceduresbased on
smoothingthat have difficultydealingwith mixed predictors.
The processof class formationis a precursorto testingfor differencesbetweenclasses.If
a large number of classes are created and a large number of (nonindependent)tests
performed,then the significancelevels that can be attached are subject to the familiar
degradation-the multiple comparisonsproblem.Of course, this can be accountedfor by
the conventional (Scheffe, Tukey, Bonferroni)methods. The problem does not exist in
conjunction with regressiontrees, simply because no testing is performed.Whetherthis
constitutesan asset or a liability is a contentious point. Still, more can be said on the
matter. Kalbfleischand Prentice (1980) assert that class formation on the basis of the
observedmortalityitself would invalidatethe correspondingtests. This is true to a certain
extent, yet it does not imply that, in the regressiontree settingwhereclassesare formedon
the basis of the observedmortalitydata, testingis not possible.What is necessitatedis that
the "corresponding"tests be made to conform to the process by which the classes are
constructed.Thus, for instance, in using a regressiontree derivedfrom Mantel-Haenszel
splittingto perform a test on the significanceof any given split (and hence of different
survivalcurves for the two associatedclasses),what is requiredis the null distributionof
the maximumof the relevantnumberof Mantel-Haenszelstatistics.Clearlythis distribution
is hardto get a handle on, but the simulationstudiesdecribedin Segal'sthesis can be used
instead.
With regardto the third question on reconcilingdifferences,should they arise,between
the Mantel-Haenszel and Gehan tests for equality of survival curves from two classes,
Kalbfleischand Prentice(1980) adviseusing some intermediaryweightingof the tests, i.e.,
some member of the Tarone-Wareclass. This is completelyconcordantwith the splitting
46
Biometrics,March 1988
strategyused in the constructionof regressiontrees, wherebyany memberof the TaroneWareor Harrington-Flemingclassescan be used as the splittingcriterion.
Again, since no formal testing is undertakenin the regressiontree approach,the fourth
question concerningasymptoticsis somewhatmoot. The recommendationof Kalbfleisch
and Prentice, for situations where sample sizes are perceivedto be too small to warrant
recourseto asymptoticresults,is to performtestingby simulatingthe actualnull distribution
of the statistic.This is preciselythe strategyproposedin conjunctionwith regressiontree
methods. If censoring is independent of the predictors,an alternativesimulation tactic
would be to develop a permutationtest. This would involve permutingthe responsevalues
and correspondingcensoringindicatorsover the cases and then recomputingthe tree.
The only asymptoticissue to be examinedin the context of tree schematais consistency;
see CART (Chap. 12) and the sequence of papers by Gordon and Olshen (1978, 1980,
1984). Under regularityconditionsthat do not dependon the particularsplittingcriteriaor
pruning algorithmused, consistency results are obtained for both the classificationand
regressionproblems.The regularityconditions include a growthcondition on the amount
of mass in each member of the partitionand the requirementthat the diameterof every
member go to 0 in probability.For censored responsedata, identifiabilityissues arise as
indicatedin Gordonand Olshen (1985), yet since there is no relianceon the particulartree
construction methods used, the consistency results carry over immediately to the twosample statisticschemesdevelopedhere.
The final question concerns how to cope with missing values. These can constitute a
consequentialportion of the data in medical/biologicalstudies and hence efficient information extractionfrom (as opposed to the discardingof) such cases is an importantissue.
The manner in which this is achieved by tree schemata is detailed in CART (Chap. 5).
For an illuminating illustration of how distorted results can occur using conventional
regressionpractices in the presence of missing data, and how tree methods overcome
such problems, see Bloch and Segal (Technical Report 108, Department of Statistics,
StanfordUniversity, 1985).
The overallconclusionthen, is that the regressiontree methodologydevelopeddealsvery
well with the five problemsposed as being of practicalconsequence.
ACKNOWLEDGEMENTS
The author wishes to thank ProfessorJerome Friedman and the two refereesfor many
helpful comments.
RfESUMfE
On etend la methodologiedes arbresde regressionaux variablesde reponsecensureesa droite en
remplapantles regleshabituellesde coupurepar des reglesbaseessur les classesde Tarone-Wareou
Harrington-Flemingde statistiquespour deux echantillons.On donne aussi de nouvelles strategies
d'elagagepermettantde determinerla taille adequatede l'arbre.On developpeles proprietesde cette
approche,et on discute des comparaisonsavec les proceduresexistantes,en termes de problemes
pratiques.Une illustrationdes performancesreellesde la techniqueest presentee.
REFERENCES
Aitkin, M., Laird,N., and Francis,B. (1983). A reanalysisof the StanfordHeart TransplantData.
Journalof the AmericanStatisticalAssociation78, 264-274.
Anderson,T. W. (1966). Some nonparametricmultivariateproceduresbasedon statisticallyequivalent
blocks.In MultivariateAnalysis,P. R. Krishnaiah(ed.), 5-27. Orlando,Florida:AcademicPress.
Breiman,L. and Friedman,J. H. (1985). Estimatingoptimaltransformationsfor multipleregression
and correlation.Journalof the AmericanStatisticalAssociation80, 580-598.
Breiman,L., Friedman,J. H., Olshen, R. A., and Stone, C. J. (1984). Classificationand Regression
Trees.Belmont,California:Wadsworth.
RegressionTreesfor CensoredData
47
Buckley,J. and James,I. R. (1979). Linearregressionwith censoreddata.Biometrika66, 429-436.
Ciampi, A., Chang, C.-H., Hogg, S., and McKinney, S. (1987). Recursivepartition:A versatile
method for exploratorydata analysis in biostatistics.In Proceedingsfrom Joshi Festschrift,
G. Umphrey(ed.), 23-50. Amsterdam:North-Holland.
Cox, D. R. (1972). Regressionmodels and life tables.Journalof the Royal StatisticalSociety,Series
B34, 187-202.
Doksum, K. A. and Yandell, B. S. (1982). Propertiesof regressionestimates based on censored
survivaldata.In Festschriftfor ErichL. Lehmann,P. J. Bickel,K. A. Doksum,and J. L. Hodges,
Jr. (eds), 140-156. Berkeley:Universityof CaliforniaPress.
approachto nonparametricmultipleregression.In SmoothFriedman,J. H. (1979). A tree-structured
ing Techniques
for CurveEstimation,T. Gasserand M. Rosenblatt(eds),5-22. Berlin:SpringerVerlag.
Friedman,J. H. and Stuetzle, W. (1981). Projectionpursuit regression.Journal of the American
StatisticalAssociation76, 817-823.
Gehan,E. A. (1965). A generalizedWilcoxontest for comparingarbitrarilysingly-censoredsamples.
Biometrika52, 203-223.
protocolto
Goldman,L., Weinberg,M., Weisberg,M., Olshen,R., et al. (1982). A computer-derived
aid in the diagnosisof emergencyroom patientswith acute chest pain. New EnglandJournalof
Medicine307, 588-596.
Gordon,L. and Olshen,R. A. (1978). Asymptoticallyefficientsolutionsto the classificationproblem.
Annalsof Statistics6, 515-533.
Gordon,L. andOlshen,R. A. (1980). Consistentnonparametricregressionfromrecursivepartitioning
schemes.Journalof MultivariateAnalysis10, 611-627.
Gordon, L. and Olshen, R. A. (1984). Almost surely consistent nonparametricregressionfrom
recursivepartitioningschemes.Journalof MultivariateAnalysis15, 147-163.
Gordon, L. and Olshen,R. A. (1985). Tree-structuredsurvivalanalysis.CancerTreatmentReports
69, 1065-1069.
Harrington,D. P. and Fleming,T. R. (1982). A class of rank test proceduresfor censoredsurvival
data.Biometrika69, 553-566.
Hastie, T. J. and Tibshirani, R. J. (1986). Generalized additive models. Statistical Science 1,
297-318.
Kalbfleisch,J. D. and Prentice,R. L. (1980). The StatisticalAnalysis of Failure Time Data. New
York:Wiley.
Koul, H., Susarla,V., and Van Ryzin, J. (1981). Regressionanalysiswith randomlyright-censored
data.Annalsof Statistics9, 1276-1288.
Miller,R. G., Jr. (1976). Leastsquaresregressionwith censoreddata.Biometrika63, 449-464.
Miller,R. G., Jr. and Halpern,J. (1982). Regressionwith censoreddata. Biometrika69, 521-531.
Morgan,J. N. and Sonquist,J. A. (1963). Problemsin the analysisof surveydata, and a proposal.
Journalof the AmericanStatisticalAssociation58, 415-434.
Peto, R. and Peto, J. (1972). Asymptoticallyefficientrank-invarianttest procedures.Journalof the
Royal StatisticalSociety,SeriesA 135, 185-198.
Prentice,R. L. (1978). Linearranktests with right-censoreddata.Biometrika65, 167-179.
Prentice,R. L. and Marek,P. (1979). A qualitativediscrepancybetween censoreddata rank tests.
Biometrics35, 861-867.
Stone, C. J. (1977). Consistentnonparametricregression(with Discussion).Annals of Statistics 5,
595-645.
Tanner,M. A. and Wong, W. H. (1983). Discussionof: A reanalysisof the StanfordHeartData by
Aitkin, Laird,and Francis.Journalof the AmericanStatisticalAssociation78, 286-287.
Tarone, R. E. and Ware,J. (1977). On distribution-freetests for equalityof survivaldistributions.
Biometrika64, 156-160.
ReceivedSeptember1986;revisedJune 1987.
Download