DATA SET1; INFILE 'c:\courses\st557\sas\bpd.dat'; INPUT SEX 1 YOB 2-3 APGAR 4-5 GEST 6-8 /* BWT 9-12 AGSYM 13-14 AGVEN 15-17 This program uses the TREEDISC INTUB 18-21 VENTL 22-25 macro in SAS to apply a CHAID LOWO2 26-30 MEDO2 31-34 algorithm to the BPD data. This HIO2 35-38 SURV 39 DSURV 40-44 code is stored in the file chaidbpd.sas RDS 45 BPDHI 46-47; */ /* First set some graphics options */ /* To print postscipt files in UNIX */ IF BPDHI>20 THEN BPD=1; ELSE BPD=2; LABEL YOB = YEAR OF BIRTH /* APGAR = APGAR SCORE goptions cback=white ctext=black targetdevice=ps300 rotate=landscape; */ /* To print postscript files from Windows */ goptions cback=white ctext=black GEST = GESTATIONAL AGE (WEEKSx10) BWT = BIRTH WEIGHT (GRAMS) AGSYM = AGE AT ONSET OF RESPIRATORY SYMPTOMS (HRSx10) AGVEN = AGE WHEN VENTILATORY device=WIN target=ps ASSISTANCE BEGAN (HRS) rotate=landscape ; INTUB = DURATION OF ENDOTRACHEAL INTUBATION (HRS) VENTL = DURATION OF ASSISTED VENTILATION (HRS) 2018 2017 /* Establish formats for classification variables */ LOWO2 = EXPOSURE TO 22-39% OXYGEN (HRS) MEDO2 = EXPOSURE TO 40-79% OXYGEN (HRS) HIO2 = EXPOSURE TO 80-100% OXYGEN (HRS) SURV = SURVIVAL AS OF 5/1/1975 PROC FORMAT; VALUE SEX 0 = 'FEMALE' 1 = 'MALE'; VALUE SURV 0 = 'DEAD' 1 = 'ALIVE'; VALUE RDS 0 = 'NONE' DSURV = DURATION OF SURVIVAL AS 1 = 'SLIGHT' OF 5/1/75 (HRS) 2 = 'MODERATE' RDS = SEVERITY OF RDS 3 = 'SUBSTANTIAL' BPDHI = BPD ASSESSMENT 4 = 'SEVERE' BPD = BPD INDICATOR; 5 = 'VERY SEVERE'; VALUE BPD 1 = 'YES' /* Retain only babies who lived at 2 = 'NO'; least 72 hours */ IF(DSURV > 72); run; 2019 2020 /* Load in the xmacros file */ /* %inc 'c:\courses\st557\sas\xmacro.sas'; Load in the TREEDISC macro */ %treedisc(intree=trd, draw=graphics); /* /* Draw the tree on one page */ Draw a larger tree on several pages */ %inc 'c:\courses\st557\sas\treedisc.sas'; goptions cback=white ctext=black device=WIN target=ps rotate=portrait; /* Compute a tree for predicting BPD %treedisc(intree=trd, incidence from exposure to elevated draw=graphics, pos=120 120); levels of oxygen, use of ventilation, and other explanatory variables*/ %treedisc(data=set1, depvar=bpd, nominal=rds:, ordinal=ventl: lowo2: medo2: hio2:, outtree=trd, options=noformat, trace=long); 2021 All infants BPD:78 no BPD:170 Summary: ;; ;; ; ;; Med O2 < 157 hrs BPD:21 no BPD:152 ;H@H@HH @ HHH @@ HHH @@ HHHH HH @ 157 Med O2 450 BPD:22 no BPD:18 ;@ ;; @@ Low O2 < 170 hrs BPD:5 no BPD:14 ;@ ;; High O2 > 90 hrs BPD:3 no BPD:0 ;P@PP ; @@PPPPPP ; PPPP ; @ PPPP ; @@ PPP ; ; @ PP High O2 35 hrs BPD:2 no BPD:78 2022 36 High O2 159 BPD:7 no BPD:64 Med O2 > 450 hrs BPD:35 no BPD:0 Low O2 > 170 hrs BPD:17 no BPD:4 @@ High O2 90 hrs BPD:2 no BPD:14 Keep exposure to the medium O2 levels below 450 hours When exposure to the medium O2 levels is between 157 and 450 hours, keep { exposure to low O2 levels below 170 hours { exposure to high O2 levels below 90 hours When exposure to the medium level of oxygen is below 157 hours, keep exposure to high O2 levels below 160 hours High O2 > 160 hrs BPD:12 no BPD:10 2023 2024 The Algorithm: Classication and Regression Trees (Cart) Binary Splits All cases Brieman, et al. (1984), ;; ;; Classicaiton and Regression Trees, Wadsworth. ;@ ; @ @@ ;; CART software sold by Salford Systems, CA. Available in the SAS data mining package Chambers and Hastie (1992), Statistical Models in S, (Chapter 9), Wadsworth. Tree( ) and related functions in Splus. X7 18.3 @@ @@ X7 > 18.3 Build a larger tree than you really need and prune it back { cross validation { validation sample Can have { continuous variables { nominal (categorical) variabales Missing data? 2026 2025 For a discrete response (at the k-th node): Determination of a split: Deviance Deviance = ;2nk ; ; @@@ ; ;; ; ;; DevianceLeft @@ @@ @ DevianceRight Choose (i) an explanatory variable (ii) boundary (or cut point) that maximizes the change in deviance deviance = deviance ;(devianceLeft + devianceRight) 2027 XI Pik log(Pik) i=1 where Pik is the proportion of cases at the k-th node in the i-th response category. For a continuous response (at the k-th node): Deviance = Xnk (yjk ; yk)2 j =1 2028 Stop splitting if # Use the tree function in Splus to node deviance is less than some fraction of root node deviance (say < 1%). number of cases at the node is too small (say < 10 cases). # build a classification tree. This # file is stored as # Use the bpdtree.ssc bpdsp3.dat data file which # does not have missing values. The tree Use mindev and minsize in tree.control ( ) # function in Splus cannot handle missing Prune the tree: Crossvalidation Validation sample # with the prune function in Splus # values. The APGAR score variable was # not included because it contained too # many missing values. The tree is pruned # Enter the data as a matrix treec.dat <- matrix(scan("bpdsp3.dat"), ncol=17,byrow=T) # Select the columns to be used in the # analysis and define row and column labels ID<-treec.dat[,1] treec.dat <- treec.dat[, c(2:12,15,17)] 2030 2029 # Print a description of what happens at dimnames(treec.dat)<-list(ID,c("Sex","Year", # each node of the tree "Gage","Bweight","AgeSYM","AgeVEN", "Intub","Ventl","LowO2", print(trees.out) "MedO2","HiO2","RDS","BPD")) # Display the tree. treec.dat<-data.frame(treec.dat) Unix users should # first use the motif( ) function to open # a graphics window. # Compute the classification tree and put plot(trees.out) # it in the file text(trees.out) trees.out ftree <- formula(BPD~Sex+Year+Gage+Bweight+ AgeSYM+AgeVEN+Intub+Ventl+ # Use crossvalidation provided by the LowO2+MedO2+HiO2+RDS) # cv.tree function to determine how to # prune the tree trees.out<-tree(ftree,treec.dat) k <- seq(.05,2.0, length=20) trees.cv <- cv.tree(trees.out, k=k) # Print a summary description of the tree summary(trees.out) plot(trees.cv, type='b') # The resulting plot suggests that the tree # should be pruned at about 6 nodes using k=.5 # Display the pruned tree Description of the tree created for the BPD data. Each line provides the treep.out <- prune.tree(tree = trees.out, k=.5) information for one node in the tree. summary(treep.out) The information is presented in the plot(treep.out) following order: text(treep.out) [1] node number [2] decision rule for going left at this node [3] Number of sample cases that reached this node [4] value of the deviance [5] average value for the response: in this case it is the proportion of babies with BPD at this node [6] An * denotes a terminal node 2031 node), split, n, deviance, yval 5) HiO2>159.5 23 * denotes terminal node 5.6520 0.56520 10) MedO2<71.5 17 4.1180 0.41180 20) Gage<342.5 10 1) root 242 52.5000 0.31820 40) Year<67.5 5 2) MedO2<183 182 22.2900 0.14290 41) Year>67.5 5 4) HiO2<159.5 159 11.9400 0.08176 8) LowO2<527.5 146 6.6640 0.04795 16) Ventl<146 106 17) Ventl>146 40 21) Gage>342.5 7 0.0000 0.00000 * 5.7750 0.17500 34) MedO2<158 35 3.5430 0.11430 68) Bweight<2239.5 28 136) Year<69.5 5 137) Year>69.5 23 0.8000 0.20000 * 0.0000 0.00000 * 69) Bweight>2239.5 7 35) MedO2>158 5 9) LowO2>527.5 13 0.9643 0.03571 1.7140 0.42860 * 1.2000 0.60000 * 3.2310 0.46150 18) Gage<305 6 0.0000 0.00000 * 19) Gage>305 7 0.8571 0.85710 * 2032 11) MedO2>71.5 6 3) MedO2>183 60 2.4000 0.60000 0.8000 0.80000 * 1.2000 0.40000 * 0.8571 0.14290 * 0.0000 1.00000 * 7.6500 0.85000 6) Ventl<220.5 13 3.2310 0.46150 12) HiO2<26.5 8 0.8750 0.12500 * 13) HiO2>26.5 5 0.0000 1.00000 * 7) Ventl>220.5 47 1.9150 0.95740 14) Intub<430.5 14 1.7140 0.85710 28) Gage<330 5 1.2000 0.60000 * 29) Gage>330 9 0.0000 1.00000 * 15) Intub>430.5 33 0.0000 1.00000 * 2033 0.05 0.15 0.26 0.36 0.46 0.67 0.87 1.30 Ventl<220.5 HiO2<26.5 Gage<330 Intub<430.5 0.60001.00001.0000 LowO2<527.5 MedO2<71.5 0.1250 1.0000 Gage<342.5 Ventl<146 Gage<305 Year<67.5 1.0000 MedO2<158 0.80000.40000.1429 0.0000 Bweight<2239.5 Year<69.5 0.60000.00000.8571 0.20000.00000.4286 32 28 HiO2<159.5 30 deviance 34 MedO2<183 | 5 10 15 size 2034 2035 Summary Do not exceed 160 hours of exposure to the high oxygen levels MedO2<183 | HiO2<159.5 Ventl<220.5 HiO2<26.5 0.95740 LowO2<527.5 MedO2<71.5 0.125001.00000 Gage<342.5 Ventl<146 Gage<305 1.00000 MedO2<158 0.600000.14290 0.00000Bweight<2239.5 0.600000.000000.85710 0.035710.42860 Do not exceed 183 hours of exposure to the medium oxygen levels Do not exceed 528 hours of exposure to the lower oxygen levels Do not ventilate for more than 146 hours 2036 2037