Regression Trees for Censored Data Author(s): Mark Robert Segal Source: Biometrics, Vol. 44, No. 1 (Mar., 1988), pp. 35-47 Published by: International Biometric Society Stable URL: http://www.jstor.org/stable/2531894 Accessed: 19/10/2008 14:20 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=ibs. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit organization founded in 1995 to build trusted digital archives for scholarship. We work with the scholarly community to preserve their work and the materials they rely upon, and to build a common research platform that promotes the discovery and use of these resources. For more information about JSTOR, please contact support@jstor.org. International Biometric Society is collaborating with JSTOR to digitize, preserve and extend access to Biometrics. http://www.jstor.org BIOMETRICS 44, 35-47 March1988 Regression Trees for CensoredData Mark RobertSegal ChanningLaboratory,HarvardMedicalSchool, 180 LongwoodAvenue, Boston, Massachusetts02115, U.S.A. SUMMARY The regression-treemethodologyis extended to right-censoredresponsevariablesby replacingthe conventionalsplittingrules with rules based on the Tarone-Wareor Harrington-Flemingclassesof two-samplestatistics.New pruning strategiesfor determiningdesirabletree size are also devised. Propertiesof this approachare developed and comparisonswith existing procedures,in terms of practicalproblems,arediscussed.Illustrative,real-worldperformancesof the techniquearepresented. 1. Introduction In recent yearsconsiderableresearcheffort has been dedicatedtowarddevising regression techniquesfree from some of the restrictiveclassicalassumptions.Frequently,the tradeoff for this freedom is that these methods require large data sets and are computationally intensive. Two broad categories of such approachescan be summarizedas smoothing (Stone, 1977; Friedman and Stuetzle, 1981; Breiman and Friedman, 1985) and trees (Breimanet al., 1984).Anotherareaof currentstatisticalendeavoris the analysisof survival data whereinthe responsevariableis subjectto censoring.The proportionalhazardsmodel (Cox, 1972)has affordeda widelyused and flexibletechnique.However,againconstraining assumptionshavebecome an issue and attemptsto overcomethese have led to the adaption of smoothing and tree methodologies to the survival setting. Specifically, Hastie and Tibshirani(1986) extendthe Cox model by replacingthe linearmodelingof covariateswith an additive(sum of smooth functions)model. This paperproposesmodificationsto the conventionalregressiontree methodologywith the primarymotivationof facilitatingan extensionto censoreddata.However,the suggested changesused to achieve this end have merit in their own right, and so comparisonswith the existingtechniques(in the uncensoredsetting)are also mentioned.The basic alteration is the replacementof the goodness-of-splitcriteria,whichhadbeen gearedtowardoptimizing within-nodehomogeneity,with measuresof between-nodeseparation.The measuresused are two-sample statisticsbelonging to the Tarone-Ware or Harrington-Flemingclasses. Their introductionfurthernecessitateschangingthe pruningalgorithmused to determine desirabletree size. Existingmethodsfor treeconstructionin the survivaldatasettingutilizedifferentsplitting and pruning approaches.Gordon and Olshen (1985) use distance measures(Wasserstein metrics)betweenKaplan-Meiercurvesand certainpoint masses.The motivationis by way of analogy with the least squarescriterionused in the uncensoredsetting and the method allows for an immediate inheritanceof the CART(see below) pruningalgorithm.Ciampi et al. (1987) and Davis (unpublishedPh.D. thesis, Harvard University, 1988) pursue splittingbasedon likelihood-ratiotests.Workon comparingthe performanceand properties of the varioustree techniquesis in progress. Key words: Censoring;Pruning;Regressiontree;Splittingrule;Tarone-Wareclass. 35 36 Biometrics,March 1988 The next section provides a brief overview of current regressiontree (or recursive partitioning)methodology.Section 3 dealswith the new splittingcriteriaand, in particular, addressesthe censoreddata issue. Section 4 indicateshow the new pruningstrategieswork. The fifth section demonstratesthe performanceof the new technique on a real-world example,and the sixth discussespropertiesand potentialof the new approachwith reference to problemsencounteredin practice. Repeatedallusion is made to the definitive referenceby Breimanet al. (1984), which is referredto as CART, and in which stand-alonesection numbers should be sought. This monographmakes explicit the advantagesof tree methods from an applied perspective. The importance of tree techniques in biomedical settings has also been emphasizedby Goldman et al. (1982). The extractionof clinically meaningful strata that have distinct survivalprospectsis an endpoint commonly sought by medical investigators.Further,the tree structuremakes for readyinterpretabilityand easy classificationof new patients. 2. ConstructingRegressionTrees A simplifieddescriptionof regressiontreesis presentedin this section,so thatthe subsequent reformulationscan be understood.For a more detailed understandingthe CART monographis recommended.Attentionhere is restrictedto the familiarregressionsetting-there are p predictor variables X1, . . ., Xp and a continuous and uncensored response Y. No comment is made herewith regardto issuessuch as the treatmentof missingvalues(?5.3.2) or variable importance(?5.3.4), for which carryoverfrom the standardmethods to the new methods is immediate. For a compilation of the advantagesaffordedby using tree procedures see Segal (unpublished Ph.D. thesis, Department of Statistics, Stanford University, 1986) or CART?2.7. In orderto constructa regressiontree four componentsare required.These are: 1. A set of (binary) questions of the form "Is x. E A ?" where x. is a case and A C r, the predictor space. The answer to such a question induces a partition, or split, of the predictorspace. That is, the cases for which the answer is yes are associatedwith region A and those for which the answer is no are associated with the complement of A. The subsamplesso formedare called nodes. 2. A goodness-of-splitcriterion /(s, t) that can be evaluatedfor any split s of any node t. The criterionis used to assessthe worthof the competingsplits,where(in CART)worth pertainsto within-nodehomogeneity. 3. A means for determiningthe appropriatetree size. 4. Statisticalsummariesfor the terminalnodes of the selectedtree. These can be as simple as the node averageor as involved as a Kaplan-Meier survival curve, depending on context. What follows is an elaborationof these aspects. 2.1 CandidateBinary Splits The plethoraof possible splits in 1 above-resulting from not placing any restrictionson the regionA-is reducedto a computationallyfeasiblenumberby constrainingthat: (a) Each split depends on the value of only a single predictor variable. [Note: This restrictioncan be loosened; the software (CARTT, CaliforniaStatistical Software, 1984) permitssplits on linear combinationsof predictors.] (b) For orderedpredictorsXj, only splitsresultingfrom questionsof the form "IsXj < c?" for c E R1 are considered. RegressionTreesfor CensoredData 37 (c) For categoricalpredictorsall possible splits into disjoint subsets of the categoriesare allowed. It may appearthat any reductionin numberof splitsresultingfrom the above constraints is worthless.Certainly(a) restrictsus to examiningpredictorsunivariatelyand (b) restricts us to dividing W' into two semi-infiniteintervalsas opposed to the multitudesof other possiblebreak-ups.However,we are still contendingwith an uncountablyinfinite number of partitionsas c rangesover _W'.The point is that the randomvariableXj takes on only a finite numberof values in the sample at hand-at most n for the n cases. Hence, we have to examine only those values of c that result in a case switching"sides"-from the right semi-infiniteintervalto the left. So there are at most n - 1 splits given by {IsXj < ci?"} wherethe ci aretaken,by convention,halfwaybetweenconsecutivedistinctobservedvalues of Xj. The tree itself is grown as follows. For each node (the initial or root node comprisesthe entire sample): 1. Examineevery allowablesplit on each predictorvariable. 2. Select and execute (createtwo new daughternodes) the best of these splits. Steps 1 and 2 are then reappliedto each of the daughternodes, and so on. 2.2 Goodness-of-SplitCriterion "Best"in 2 above is assessedin terms of the goodness-of-splitcriterion.Two such criteria areespousedin CARTand are availablein the associatedsoftware.These are LeastSquares (??8.3, 8.4) and Least Absolute Deviations (?8.11). Both afforda comparisonbased on a subadditive"between/within"decomposition,where between alludes to the homogeneity or loss measureappliedto the parentnode. For point of referencethe definitionof the least squarescriterionis presentedhere.The obviouschangesgive riseto leastabsolutedeviations (or any other between/withincriterionsuch as is used in ?6.4). Let t designate a node of the tree. That is, t contains a subsample yX y)}. Let N(t) be the total numberof cases in t and let N(t) XnlEt be the node responseaverage.Then the within-nodesum of squaresis given by SS(t) = I[y -y(t)] Now suppose a split s partitionst into left and right daughternodes tL and tR. The least squarescriterionis O(s, t) = SS(t) - SS(tL) - SS(tR) and the best split s* is the split such that o(s*, t) = max k(s, t) sEQ where Q is the set of all permissiblesplits. A least squaresregressiontree is constructedby recursivelysplittingnodes to maximize the above X function. The criterionis such that we create smaller and smaller nodes of progressivelyincreasinghomogeneity.The subadditivitymentioned above is equivalentto the nonnegativenessof k; i.e., SS(t) 3 SS(tL)+ SS(tR).The same is true if we work with 38 Biometrics,March 1988 least absolutedeviations.It is this inequalitythat is responsiblefor the increasing(strictly nondecreasing)homogeneity. 2.3 DeterminingDesirable TreeSize It remainsunresolvedas to what constitutesan appropriatesized tree. Originally[the AID program,Morganand Sonquist(1963)] this was determinedby use of stoppingrules:if a node became too small or the improvement0(s*, t) resultingfrom the best split was not sufficient(to surmountsome presetthreshold),then the node was declaredterminal.This proved unsatisfactoryon account of the rigidity of the thresholds. In some instances overfitting(as a consequence of setting thresholdstoo small) via too large a tree would occur. In others, underfittingwould result from rejectionof a split (owing to its improvement not exceedingthe cutoff) precludingthe emergenceof subsequentworthwhilesplits. There is an analogyhere to stepwiseversussubsetregressionin terms of failureto capture importantinteractions.The problemwas redressedby (1) initiallygrowinga verylargetree; (2) iterativelypruningthis tree all the way back up (to the root node), therebycreatinga nested sequenceof trees;(3) selectingthe best tree from this sequenceusing test-sampleor cross-validationestimatesof error. This procedureis detailedin CART(Chap.3). The means for performingthe pruningin step (2) is called "minimalcost-complexity pruning";see ?3.3. This paperpresentssome alternativesto this when no within-nodecost is available[forleast squaresthe within-node cost is just SS(.)]. Such alternativesare necessarybecause the minimal cost-complexity algorithmrelies cruciallyon within-nodecost. It is clear how the above procedureaverts the deficienciesof stoppingrules.Any potentiallyconsequentialsplitshave the opportunity to emerge,as the initial largetree can be grownso big as to possessterminalnodes that are pure-contain only one categoryin the classificationcontext or only one responsevalue for regression.The usage of cross-validationor test-sampleestimatesis intendedto ensure that a realisticsized tree is producedin relationto how much noise/samplingvariabilityis present.This aspectis discussedmore fully in Section 4. 3. Two-SampleStatistic Splitting 3.1 Motivation Insteadof gearingour splittingcriteriato optimizingwithin-nodehomogeneity,we could rewardsplits that resultedin large between-nodeseparation.The magnitudeof any twosample statisticaffordssuch a goodness-of-splitmeasure.Such a change constitutesmore thanjust a rephrasingof the problem.Whilstit is (empirically)the case that splittingbased on two-samplet statisticswith unpooled varianceestimates(Welch statistics)gives results strongly resembling those obtained from least squares splitting, there is no algebraic equivalenceand problemscan be contrivedwhereresultsare dissimilar. The fact that in all the cases analysed,splittingusing Welch statisticsand splittingusing least squaresgave comparableresults, supportsthe usage of two-sample statistics:given that the two techniques produce analogous results and least squares gives worthwhile answers,the new approachmust be doing somethingreasonable.But why replacea proven method with one that is harderto motivate and offers no computationalsavings?The answerlies in the advantagesprovidedby using two-samplerank statistics.These include all the conventionaldesiderataof ranksplus some additionalbenefits: 1. Invarianceunder monotonetransformationof the response Y. The regressiontrees createdby using least squaresor least absolutedeviationspossessedsuch invarianceonly with respect to monotone transformationsof the (ordered)predictors.This means, for RegressionTreesfor CensoredData 39 instance,that the optimal split is the same regardlessof whetherwe use X1 or X1 = g(X1) for some monotone g. If the optimal split on X1 is XI < c then the optimal split using XI will be X1 < g(c) (?2.7). However,it is only throughthe use of two-samplerank statistics that best splits (predictorand cutoff) are preservedunder monotone transformationsh, of Y to Y = h(Y). This is clearly a worthwhilepropertywhen there is no naturalresponse scale in which to work (see Gordon and Olshen, 1978; Anderson, 1966). The issue is especiallypertinentin the context of censoredregression;the censoredregressionsettingis furtheremphasizedby Prentice( 1978). 2. Insensitivityto outliersin the responsespace. The use of least squares,and to a lesser extent least absolutedeviations,is subjectto the familiarsensitivityto extreme Y observations. This, in the regressiontree setting,is not necessarilya drawback,since such outliers will be isolated into their own (single case) terminalnodes. Still, the influence on overall tree topology can be distortingand the interpretationof splits leading to the isolation of the outlier can be problematic. Friedman (1979) regards the presence of outliers as weakeningthe least squaresprocedureby wastingsplits. 3. Computationalfeasibility. The actualcomputationalimplementationfor evaluating the multitudeof competingsplitsis detailedin Segal'sunpublishedPh.D. thesis.The easiest case is the uncensoredsetting, where using the Wilcoxon procedureis no more involved than using any two-samplelinearrankstatistic.The updatingalgorithmdevisedmakes for an O(n) procedurethat is as simple as the O(n) least squaresalgorithm.The storyis not so simple when it comes to dealingwith splittingbased on membersof the Tarone-Wareor Harrington-Flemingclasses. Nevertheless,efficient algorithmscan be developedwith the rightorganization. 4. Extensionto censoredresponse. The principalmotivationfor changingthe splitting criterionwas to enable tree techniquesto be used for survivaldata. Insteadof using twosamplerankstatisticsfor uncensoredvalues as goodness-of-splitcriteria,analoguesof such statisticsthat account for censoringare used, as describedbelow. 3.2 CensoredData Rank Statistics Before the merits or otherwise of the two-sample statistics used for censored response (Gehan, Prentice,Mantel-Haenszel,Tarone-Ware)are discussed,their form and computationalimplementationare described.The Tarone-Wareclass of statisticsderivesfrom a sequenceof 2 x 2 tables.In the survivalanalyticcontext of censoredresponse,this sequence arisesfrom constructinga 2 x 2 table for each distinct, uncensoredresponse: Alive Dead Population 1 ani Population2 and the statisticshave the followingform: TW= ZI wi[a, cEk=1 - E0(A1)] w[varo(Ai)]'12 whereAi is the randomvariablecorrespondingto numberof deathsin population 1 for the ith table; wi are constants used to weight the respectivetables;the sum is over all tables, i.e., all distinct uncensoredobservations;the null hypothesisis that the death rates for the 40 Biometrics,March 1988 two populations are equal; for fixed margins the null expectations and variances are hypergeometric: E0(A1) - milni, ni o(i) varo(Ai) = [ Mlni-mil) - I ni -ni ][ni) 1--I ni) Standardspecificationsfor the weights wi include the following: 1. 2. 3. 4. wi 1 gives the Mantel-Haenszelor log-rank(Peto and Peto, 1972) statistic. wi = ni gives the Gehan (1965) statistic. wi = n112gives a statisticadvocatedby Tarone and Ware(1977). wi = S* gives Prentice's (1978) generalization of the Wilcoxon, where S* = HIJ1 nj/(nj + 1) is almost the Kaplan-Meier survival estimate at the ith uncensored failuretime. The Gehan statisticis subjectto domination by a small numberof earlyfailures(Prentice and Marek, 1979) and hence should be used selectively.Another possibilityis to use the Harringtonand Fleming (1982) class, which has weights wi = S3 for some fixed power p, with S now being exactlythe Kaplan-Meiersurvivalestimate.For this to be computationally feasible it would be necessaryto restrictall splits to a particularp value-trying any sort of optimizationon p for individualnodes would be too involved. In practice,at least for the data sets and simulationsexamined, the trees emergingfrom using differingsplit statistics(apart from the Gehan) are surprisinglysimilar. Such a finding applied in the uncensoredsettingwhen competingmembersof the class of linearrankstatisticswereused and is also reportedin CARTin the classification-treecontext. The implementationof a splittingalgorithmusing the Tarone-Wareclass is indicated by the followingpseudo-code: Splitting Algorithm For each node Initialize BestStat = BestPredictor = BestSplitPoint = 0 Sort response values and collect ties (compute mil) Compute risk-sets ni and the Kaplan-Meier for the node Loop over all p predictors: X1, X2, . . ., Xp Sort the node (response and censoring indicator) with respect to ordered Xcurrent Loop over all potential split points i (i.e., step along the Xcurrentaxis) By cycling over the ordered response values compute a, and n,1 Using prespecified weights w, compute the two-sample statistic: TwoSam If (I TwoSam I) > BestStat Then BestStat *- TwoSam BestPredictor - Xcurent BestSplitPoint *- i End If End End Importantbut simple issues such as not splitting on tied predictorvalues and efficient means for sortingand rankinghave not been highlightedfor clarity.Thereis an additional user-specifiedparameterthat serves to regulatethe degree of censoringpermittedin any given node. Thus, only those splitsthat resultin daughternodes havinga ratioof uncensored to censoredobservationsgreaterthan the presetthresholdare examined.Actual choices for this thresholdshould be determinedin an exploratory,problem-specificmanner. RegressionTreesforCensoredData 41 4. RevisedPruningStrategies An important differencebetween least squares(or least absolute deviations) splitting as outlined in Section 2 and any two-sample statistic splitting rule (?3) is that the former providesa within-nodeestimate of error, namely SS(t), the within-nodesum of squares. Such is not the case for two-samplestatisticsplits,which affordonly a measureof goodness of split. These measurescannot be decomposed to attributea within-node error.This is consequential,since the within-nodeerrorsform a key componentof the pruningalgorithm developedin CART(Chap.3). The algorithm,therefore,does not carryover to the present situationand, inasmuch as tree size is a fundamentalissue, alternateapproachesmust be sought.The followingsection describesan attemptto inheritthe CARTalgorithmand, on account of the limitations of this attack, the succeeding section details an alternative approachto pruning.This lattertechnique,which retainsthe bottom-uptacticbut sacrifices cross-validation,is believedto work well. 4.1 Inheritingthe CARTAlgorithm In orderto circumventthe problemsposed by the absenceof within-nodeloss, an attempt to revert back to the original splitting criteria was made. Specifically, least absolute deviations splitting was tried. Using least squares was not entertainedbecause of the unstablenature of the mean when estimated from survivalcurves. The intention was to account for the censoring by using medians based on the Kaplan-Meier survival curve estimated for each node and then consider absolute deviations about these. Thus, the goodness-of-splitcriterionfor a split at s of node t into tLU tRwould be I Yn-v(t) IMS, t) XnEl I Yn - I n, lL (QL) I E I Yn V(tR) | , XnEIR where v(.) is the Kaplan-Meier median. The best split of a node t would again be that which maximized q. However,there were a varietyof difficultiesassociatedwith this approachthat rendered it useless: (a) Computationallythis method was very slow. (b) The actual splits obtainedusing this criterionon simulateddata with known structure were not convincing. The method did not uncover the important variablesor split points. (c) The hope behind resurrectingleast absolutedeviationssplittingwas the inheritanceof the pruningalgorithmused in standardCART.However,even this did not materialize. An unstatednecessityfor the minimal cost-complexityalgorithm(?3.3) to workis that the splittingcriteriabe subadditive,i.e., /(s, t) 3 0 Vs, t. This is easily seen to hold in the uncensoredcase, where v(t) above can be taken to be any sample median for the node t. But this does not hold for censoredy's and v(.) a Kaplan-Meiermedian. 4.2 Bottom-UpApproaches In view of the above failures,what was needed was an altogetherdifferenttack. It was decided to preservethe concept of initially growing a very large tree and subsequently pruningthis. What was sacrificedwas the selection of a particulartree from the generated sequence by cross-validation.Further,the minimal cost-complexity pruning algorithm itself was (by necessity)replacedwith some new pruningschemata. The loss of cross-validationas a selection mechanismwas not tragic.While the method had performedwell its usage had severalrecognizedflaws. The more detractingof these include: (i) inaccuracies and instabilities of the cross-validationestimates (?8.7) and 42 Biometrics,March 1988 (ii) failure of the tree selected as optimal to preclude noisy splits (?8.6). Of course, criticism(ii) can be levelled at any technique,but is cited here on account of such noisy splits emergingeven in a highly structuredsituation.Indeed,the authorsof CARTequally promote user selection of the right-sizedtree (??3.4.3, 6.2). This should be done in an exploratoryfashion and aided by the incorporationof subject-matterknowledge. But for such user selection,the user must be providedwith a tree sequenceand hopefully one that contains good candidatetrees. It was to this end that the new pruningalgorithms werecreated.Beforeexpoundingon these it is importantto reiteratewhat is being acquired from the CART approach-protection against the deficienciesof stopping rules as highlightedin Section 2. After several strategieswere tried, the following emerged as the preferredpruning algorithm: Initiallygrow a very largetree. From the bottom, step up this tree, assigningto each internalnode the maximum split statisticcontainedin the subtreeof which the node underconsiderationis the root. Collect all these maxima and place them in increasingorder. The firstprunedtree of the sequencecorrespondsto locatingthe highestnode in the tree possessingthe smallestmaximum and removingall its descendents. The second tree of the sequence is then obtainedby reapplyingthis processto the first tree and so on until all that remainsis the root node. This procedureis illustratedin conjunction with the example in the next section. The associated output is also displayed. Essentially, each internal node is linked with the maximumsplit statisticcontainedin the subtreefor whichthe node is the root. The pruning sequence is then determined by the order of these maxima. Selecting a tree from the sequenceprovidedcan be done by plottingmaximal subtreesplit statisticsagainsttree size and pickingthe tree correspondingto the characteristic"kink"in the curve;see ?3.4.3 or Friedman(TechnicalReport 12, Departmentof Statistics,StanfordUniversity, 1985). In terms of computation time, the construction of the tree sequence is very much a secondaryconcern relative to the initial growing of the large tree. The building process requiresthe evaluation of many splits at each node, whereasfor this particularpruning method, there is one very rapidascent of the tree (to ascertainthe maximum subtreesplit statistic for each node only simple comparisonsas opposed to calculationsare required), followed by subtree removal, which entails simple looping to update quantities such as number of terminal nodes. Thus, no computational burden results from the pruning algorithm. 5. StanfordHeart TransplantData One exampleforillustratingthe performanceof any regressiontechniquewherethe response is subject to censoringis the StanfordHeart Transplantdata. The parametricattacks of Miller (1976), Buckley and James (1979), and Koul, Susarla,and Van Ryzin (1981) are compared with regardto this data set by Miller and Halpern (1982). The more recent "nonparametric" treatmentsof Doksum and Yandell (1982), Hastieand Tibshirani(1986), and Owen (TechnicalReport25, Departmentof Statistics,StanfordUniversity, 1987) plus the celebratedproportionalhazards model of Cox (1972) have all been tested on the Stanforddata set. A fully parametricanalysison a subsetof the data is presentedby Aitkin, Laird,and Francis(1983). A brief data descriptionis now given. The response Y is logi0 survivaltime, where the survivaltime is the time (in days)until deathdue to rejectionof the transplantheart.There arep = 2 predictors:X,, the age of the recipient,and X2, a tissue mismatchscoremeasuring RegressionTreesforCensoredData 43 recipient and donor tissue compatibility.One hundred fifty-seven cases were analysed, there being a 35%censoringrate. What has consistently emerged from the plethora of analyses is that age is the more significantpredictor.For instance, Miller and Halpern (1982) exclude mismatch from additionalexamination,havingfound it to be insignificantin multipleregressionanalyses. However,Tannerand Wong (1983) find that patientspossessinghigh mismatchand older ages are characterizedby a distinctive hazard function. Further, the nonparametric approacheshave revealeda cutoff value of roughly50 years,in that the subpopulationsso defined(<50 and >50) have distinct survivalcharacteristics. Regressiontrees, using two-samplestatisticsplitting,were used to analyse the data. In particular,the Mantel-Haenszel statistic was used in conjunction with subtree maximal statisticpruningto produceboth the tree schematicin Figure 1 and the tree sequence in Table 1. The valuesbelow the squareterminalnodes in Figure 1 are Kaplan-Meiermedians for that node. It is worth recordingthat neither the initial large tree nor the pruned tree sequencewas substantiallyalteredby using other splittingstatisticsfrom the Tarone-Ware class. In fact, the key first split was identicalin the cases examined. What is immediately evidentfromthe tree diagramis the confirmationof the previousfindings.First,age clearly emergesas the more consequentialpredictor(but see CART?5.3.4 for an automatedmeans for predictorrankingthat overcomes possible masking- this is not an issue here since there are only two predictors).Second, the cutoff at around 50 is reflectedby the value of the firstsplit point. However,the analysiscan proceedfurther.A naturalfirst summaryfor a terminalnode when we have censored response is the estimated Kaplan-Meier survival curve S. The programalso providesthe user with the possibilityof extractingcertainderivedquantities, 125/9 64/ G < 41 \ 29 53 3.14 0 \ 32 1t.81 mimac < 1.66 \8 2.04 Figure1. Tree diagramfor Stanfordhearttransplantdata. 44 Biometrics, March 1988 Table 1 Pruningby subtreemaximal statistics, Stanfordhearttransplantdata Subtree Tree Terminal maximal number nodes statistic 1 14 .00 2 13 .60 3 12 .99 4 11 1.17 10 5 1.25 6 9 1.79 7 6 2.43 8 4 2.48 2 9 2.79 10 1 4.91 such as the Kaplan-Meier median S-'(.5) as node summaries or predictions.Figure 2 featuressuperimposedsurvivalcurves for each of the 4 terminalnodes. The curve correspondingto node 3 in the tree schematicof Figure2 lies appreciablybelow the curvesfor nodes 4 and 6. Node 3 contains the >50 age group. The survivalprospectsare noticeably worse than the bulk of the -50 group, containedin nodes 4 and 6, as would be expected. But, the survivalcharacteristicsfor node 7 resemblethose for node 3. Node 7 contains patientswho are middle-aged(41-50) as opposed to young and who also have high tissue mismatchscores.Thus, it is not surprisingthat they possessequallypoor survivalprospects. It is the extraction of precisely such local interactionsthat makes the tree techniques so powerful. Whilst caution must be exercised in interpretingsurvival curves based on only 8 cases, Tannerand Wong (1983) identifieda very similarlocal interaction. 0. . O~~~~~~~~~ I 0 _____W CL0 C; + node 4 O node 6 . node 7 i.- C; 0.0 0.5 1.0 :1.5 2.0 2.5 SurvTlme Figure2. Kaplan-Meier curvesforterminalnodes. 3.0 3.5 RegressionTreesfor CensoredData 45 6. Discussion Kalbfleischand Prentice (1980) present a detailed analysis of mouse leukemia data to demonstratea varietyof difficultiesthat arise in practicalproblems.The particularissues, their conventionalresolution,and the way in which tree structuredprocedurescope will form the basis of the discussion. For an actual analysis of the data using the techniques espoused here see Segal (unpublished Ph.D. thesis, Department of Statistics, Stanford University, 1986). Kalbfleischand Prentice(1980) formulatethe practicalproblemsencounteredin terms of the followingfive questions,each of which is discussedin turn: (1) How should classes be formed?(2) How to deal with the multiplecomparisonsissue?(3) How shouldpotential differencesbetweenMantel-Haenszeland Gehan tests be reconciled?(4) Are sample sizes adequatefor asymptotics?(5) How to cope with cases possessingmissingvalues? The firstquestionpertainsto Kalbfleisch.andPrentice'sstratificationof the data. This is achieved by examining each predictorindividually. For continuous predictors,division pointsarepickedarbitrarilymodulo some guidelines.Theseguidelinesincludethe following recommendations:(i) three or four classeswill often provide adequateresolutionwithout unduly compromisingefficiency;(ii) roughlyequal sample sizes among classes should be pursued,though not at the expense of natural division points or the creation of highly unequalcensoringrates. However,there are problemsassociatedwith such rules of thumb. The considerationof predictorsone at a time precludesthe emergenceof any interactions.The firstrecommendation is revealedto be inappropriateand the second to be internallyinconsistentwith respectto the mouse leukemiadata in Segal'sthesis. The regressiontree approachhas, as a central aim, the formation of meaningful classes. These are determined by the data themselvesand hence are not subjectto the vagariesof assumptionsor guidelines.Interactionsarereadilyrecognizedand no problemsarisein dealingwith variablesof continuous, mixed, or discrete type. This contrasts with other nonparametricproceduresbased on smoothingthat have difficultydealingwith mixed predictors. The processof class formationis a precursorto testingfor differencesbetweenclasses.If a large number of classes are created and a large number of (nonindependent)tests performed,then the significancelevels that can be attached are subject to the familiar degradation-the multiple comparisonsproblem.Of course, this can be accountedfor by the conventional (Scheffe, Tukey, Bonferroni)methods. The problem does not exist in conjunction with regressiontrees, simply because no testing is performed.Whetherthis constitutesan asset or a liability is a contentious point. Still, more can be said on the matter. Kalbfleischand Prentice (1980) assert that class formation on the basis of the observedmortalityitself would invalidatethe correspondingtests. This is true to a certain extent, yet it does not imply that, in the regressiontree settingwhereclassesare formedon the basis of the observedmortalitydata, testingis not possible.What is necessitatedis that the "corresponding"tests be made to conform to the process by which the classes are constructed.Thus, for instance, in using a regressiontree derivedfrom Mantel-Haenszel splittingto perform a test on the significanceof any given split (and hence of different survivalcurves for the two associatedclasses),what is requiredis the null distributionof the maximumof the relevantnumberof Mantel-Haenszelstatistics.Clearlythis distribution is hardto get a handle on, but the simulationstudiesdecribedin Segal'sthesis can be used instead. With regardto the third question on reconcilingdifferences,should they arise,between the Mantel-Haenszel and Gehan tests for equality of survival curves from two classes, Kalbfleischand Prentice(1980) adviseusing some intermediaryweightingof the tests, i.e., some member of the Tarone-Wareclass. This is completelyconcordantwith the splitting 46 Biometrics,March 1988 strategyused in the constructionof regressiontrees, wherebyany memberof the TaroneWareor Harrington-Flemingclassescan be used as the splittingcriterion. Again, since no formal testing is undertakenin the regressiontree approach,the fourth question concerningasymptoticsis somewhatmoot. The recommendationof Kalbfleisch and Prentice, for situations where sample sizes are perceivedto be too small to warrant recourseto asymptoticresults,is to performtestingby simulatingthe actualnull distribution of the statistic.This is preciselythe strategyproposedin conjunctionwith regressiontree methods. If censoring is independent of the predictors,an alternativesimulation tactic would be to develop a permutationtest. This would involve permutingthe responsevalues and correspondingcensoringindicatorsover the cases and then recomputingthe tree. The only asymptoticissue to be examinedin the context of tree schematais consistency; see CART (Chap. 12) and the sequence of papers by Gordon and Olshen (1978, 1980, 1984). Under regularityconditionsthat do not dependon the particularsplittingcriteriaor pruning algorithmused, consistency results are obtained for both the classificationand regressionproblems.The regularityconditions include a growthcondition on the amount of mass in each member of the partitionand the requirementthat the diameterof every member go to 0 in probability.For censored responsedata, identifiabilityissues arise as indicatedin Gordonand Olshen (1985), yet since there is no relianceon the particulartree construction methods used, the consistency results carry over immediately to the twosample statisticschemesdevelopedhere. The final question concerns how to cope with missing values. These can constitute a consequentialportion of the data in medical/biologicalstudies and hence efficient information extractionfrom (as opposed to the discardingof) such cases is an importantissue. The manner in which this is achieved by tree schemata is detailed in CART (Chap. 5). For an illuminating illustration of how distorted results can occur using conventional regressionpractices in the presence of missing data, and how tree methods overcome such problems, see Bloch and Segal (Technical Report 108, Department of Statistics, StanfordUniversity, 1985). The overallconclusionthen, is that the regressiontree methodologydevelopeddealsvery well with the five problemsposed as being of practicalconsequence. ACKNOWLEDGEMENTS The author wishes to thank ProfessorJerome Friedman and the two refereesfor many helpful comments. RfESUMfE On etend la methodologiedes arbresde regressionaux variablesde reponsecensureesa droite en remplapantles regleshabituellesde coupurepar des reglesbaseessur les classesde Tarone-Wareou Harrington-Flemingde statistiquespour deux echantillons.On donne aussi de nouvelles strategies d'elagagepermettantde determinerla taille adequatede l'arbre.On developpeles proprietesde cette approche,et on discute des comparaisonsavec les proceduresexistantes,en termes de problemes pratiques.Une illustrationdes performancesreellesde la techniqueest presentee. REFERENCES Aitkin, M., Laird,N., and Francis,B. (1983). A reanalysisof the StanfordHeart TransplantData. Journalof the AmericanStatisticalAssociation78, 264-274. Anderson,T. W. (1966). Some nonparametricmultivariateproceduresbasedon statisticallyequivalent blocks.In MultivariateAnalysis,P. R. Krishnaiah(ed.), 5-27. Orlando,Florida:AcademicPress. Breiman,L. and Friedman,J. H. (1985). Estimatingoptimaltransformationsfor multipleregression and correlation.Journalof the AmericanStatisticalAssociation80, 580-598. Breiman,L., Friedman,J. H., Olshen, R. A., and Stone, C. J. (1984). Classificationand Regression Trees.Belmont,California:Wadsworth. RegressionTreesfor CensoredData 47 Buckley,J. and James,I. R. (1979). Linearregressionwith censoreddata.Biometrika66, 429-436. Ciampi, A., Chang, C.-H., Hogg, S., and McKinney, S. (1987). Recursivepartition:A versatile method for exploratorydata analysis in biostatistics.In Proceedingsfrom Joshi Festschrift, G. Umphrey(ed.), 23-50. Amsterdam:North-Holland. Cox, D. R. (1972). Regressionmodels and life tables.Journalof the Royal StatisticalSociety,Series B34, 187-202. Doksum, K. A. and Yandell, B. S. (1982). Propertiesof regressionestimates based on censored survivaldata.In Festschriftfor ErichL. Lehmann,P. J. Bickel,K. A. Doksum,and J. L. Hodges, Jr. (eds), 140-156. Berkeley:Universityof CaliforniaPress. approachto nonparametricmultipleregression.In SmoothFriedman,J. H. (1979). A tree-structured ing Techniques for CurveEstimation,T. Gasserand M. Rosenblatt(eds),5-22. Berlin:SpringerVerlag. Friedman,J. H. and Stuetzle, W. (1981). Projectionpursuit regression.Journal of the American StatisticalAssociation76, 817-823. Gehan,E. A. (1965). A generalizedWilcoxontest for comparingarbitrarilysingly-censoredsamples. Biometrika52, 203-223. protocolto Goldman,L., Weinberg,M., Weisberg,M., Olshen,R., et al. (1982). A computer-derived aid in the diagnosisof emergencyroom patientswith acute chest pain. New EnglandJournalof Medicine307, 588-596. Gordon,L. and Olshen,R. A. (1978). Asymptoticallyefficientsolutionsto the classificationproblem. Annalsof Statistics6, 515-533. Gordon,L. andOlshen,R. A. (1980). Consistentnonparametricregressionfromrecursivepartitioning schemes.Journalof MultivariateAnalysis10, 611-627. Gordon, L. and Olshen, R. A. (1984). Almost surely consistent nonparametricregressionfrom recursivepartitioningschemes.Journalof MultivariateAnalysis15, 147-163. Gordon, L. and Olshen,R. A. (1985). Tree-structuredsurvivalanalysis.CancerTreatmentReports 69, 1065-1069. Harrington,D. P. and Fleming,T. R. (1982). A class of rank test proceduresfor censoredsurvival data.Biometrika69, 553-566. Hastie, T. J. and Tibshirani, R. J. (1986). Generalized additive models. Statistical Science 1, 297-318. Kalbfleisch,J. D. and Prentice,R. L. (1980). The StatisticalAnalysis of Failure Time Data. New York:Wiley. Koul, H., Susarla,V., and Van Ryzin, J. (1981). Regressionanalysiswith randomlyright-censored data.Annalsof Statistics9, 1276-1288. Miller,R. G., Jr. (1976). Leastsquaresregressionwith censoreddata.Biometrika63, 449-464. Miller,R. G., Jr. and Halpern,J. (1982). Regressionwith censoreddata. Biometrika69, 521-531. Morgan,J. N. and Sonquist,J. A. (1963). Problemsin the analysisof surveydata, and a proposal. Journalof the AmericanStatisticalAssociation58, 415-434. Peto, R. and Peto, J. (1972). Asymptoticallyefficientrank-invarianttest procedures.Journalof the Royal StatisticalSociety,SeriesA 135, 185-198. Prentice,R. L. (1978). Linearranktests with right-censoreddata.Biometrika65, 167-179. Prentice,R. L. and Marek,P. (1979). A qualitativediscrepancybetween censoreddata rank tests. Biometrics35, 861-867. Stone, C. J. (1977). Consistentnonparametricregression(with Discussion).Annals of Statistics 5, 595-645. Tanner,M. A. and Wong, W. H. (1983). Discussionof: A reanalysisof the StanfordHeartData by Aitkin, Laird,and Francis.Journalof the AmericanStatisticalAssociation78, 286-287. Tarone, R. E. and Ware,J. (1977). On distribution-freetests for equalityof survivaldistributions. Biometrika64, 156-160. ReceivedSeptember1986;revisedJune 1987.