DATA SET1; INFILE 'c:\courses\st557\sas\bpd.dat'; BWT 9-12 AGSYM 13-14 AGVEN 15-17

advertisement
DATA SET1;
INFILE 'c:\courses\st557\sas\bpd.dat';
INPUT SEX 1 YOB 2-3 APGAR 4-5 GEST 6-8
/*
BWT 9-12 AGSYM 13-14 AGVEN 15-17
This program uses the TREEDISC
INTUB 18-21 VENTL 22-25
macro in SAS to apply a CHAID
LOWO2 26-30 MEDO2 31-34
algorithm to the BPD data. This
HIO2 35-38 SURV 39 DSURV 40-44
code is stored in the file
chaidbpd.sas
RDS 45 BPDHI 46-47;
*/
/*
First set some graphics options */
/*
To print postscipt files in UNIX */
IF BPDHI>20 THEN BPD=1; ELSE BPD=2;
LABEL YOB = YEAR OF BIRTH
/*
APGAR = APGAR SCORE
goptions cback=white ctext=black
targetdevice=ps300 rotate=landscape;
*/
/* To print postscript files from Windows */
goptions cback=white ctext=black
GEST = GESTATIONAL AGE (WEEKSx10)
BWT = BIRTH WEIGHT (GRAMS)
AGSYM = AGE AT ONSET OF
RESPIRATORY SYMPTOMS (HRSx10)
AGVEN = AGE WHEN VENTILATORY
device=WIN target=ps
ASSISTANCE BEGAN (HRS)
rotate=landscape ;
INTUB = DURATION OF ENDOTRACHEAL
INTUBATION (HRS)
VENTL = DURATION OF ASSISTED
VENTILATION (HRS)
2018
2017
/* Establish formats for
classification variables */
LOWO2 = EXPOSURE TO 22-39% OXYGEN (HRS)
MEDO2 = EXPOSURE TO 40-79% OXYGEN (HRS)
HIO2 = EXPOSURE TO 80-100% OXYGEN (HRS)
SURV = SURVIVAL AS OF 5/1/1975
PROC FORMAT; VALUE SEX 0 = 'FEMALE'
1 = 'MALE';
VALUE SURV 0 = 'DEAD'
1 = 'ALIVE';
VALUE RDS 0 = 'NONE'
DSURV = DURATION OF SURVIVAL AS
1 = 'SLIGHT'
OF 5/1/75 (HRS)
2 = 'MODERATE'
RDS = SEVERITY OF RDS
3 = 'SUBSTANTIAL'
BPDHI = BPD ASSESSMENT
4 = 'SEVERE'
BPD = BPD INDICATOR;
5 = 'VERY SEVERE';
VALUE BPD 1 = 'YES'
/* Retain only babies who lived at
2 = 'NO';
least 72 hours */
IF(DSURV > 72);
run;
2019
2020
/*
Load in the xmacros file */
/*
%inc 'c:\courses\st557\sas\xmacro.sas';
Load in the TREEDISC macro
*/
%treedisc(intree=trd, draw=graphics);
/*
/*
Draw the tree on one page
*/
Draw a larger tree on several
pages */
%inc 'c:\courses\st557\sas\treedisc.sas';
goptions cback=white ctext=black
device=WIN target=ps rotate=portrait;
/* Compute a tree for predicting BPD
%treedisc(intree=trd,
incidence from exposure to elevated
draw=graphics, pos=120 120);
levels of oxygen, use of ventilation,
and other explanatory variables*/
%treedisc(data=set1, depvar=bpd, nominal=rds:,
ordinal=ventl: lowo2: medo2: hio2:,
outtree=trd, options=noformat,
trace=long);
2021
All infants
BPD:78
no BPD:170
Summary:
;;
;;
;
;;
Med O2 < 157 hrs
BPD:21
no BPD:152
;H@H@HH
@ HHH
@@ HHH
@@ HHHH
HH
@
157 Med O2 450
BPD:22
no BPD:18
;@
;; @@
Low O2 < 170 hrs
BPD:5
no BPD:14
;@
;;
High O2 > 90 hrs
BPD:3
no BPD:0
;P@PP
; @@PPPPPP
;
PPPP
;
@
PPPP
;
@@
PPP
;
;
@
PP
High O2 35 hrs
BPD:2
no BPD:78
2022
36 High O2 159
BPD:7
no BPD:64
Med O2 > 450 hrs
BPD:35
no BPD:0
Low O2 > 170 hrs
BPD:17
no BPD:4
@@
High O2 90 hrs
BPD:2
no BPD:14
Keep exposure to the medium O2 levels
below 450 hours
When exposure to the medium O2 levels is between 157 and 450 hours, keep
{ exposure to low O2 levels
below 170 hours
{ exposure to high O2 levels
below 90 hours
When exposure to the medium level of
oxygen is below 157 hours, keep exposure to high O2 levels below 160 hours
High O2 > 160 hrs
BPD:12
no BPD:10
2023
2024
The Algorithm:
Classication and Regression Trees
(Cart)
Binary Splits
All cases
Brieman, et al. (1984),
;;
;;
Classicaiton and Regression
Trees, Wadsworth.
;@
; @
@@
;;
CART software sold by Salford Systems, CA.
Available in the SAS data mining package
Chambers and Hastie (1992), Statistical Models in S, (Chapter 9),
Wadsworth.
Tree( ) and related functions in Splus.
X7 18.3
@@
@@
X7 > 18.3
Build a larger tree than you really
need and prune it back
{ cross validation
{ validation sample
Can have
{ continuous variables
{ nominal (categorical) variabales
Missing data?
2026
2025
For a discrete response (at the k-th
node):
Determination of a split:
Deviance
Deviance = ;2nk
;
; @@@
;
;;
;
;;
DevianceLeft
@@
@@
@
DevianceRight
Choose
(i) an explanatory variable
(ii) boundary (or cut point)
that maximizes the change in deviance
deviance = deviance
;(devianceLeft + devianceRight)
2027
XI Pik log(Pik)
i=1
where Pik is the proportion of cases at
the k-th node in the i-th response category.
For a continuous response (at the k-th
node):
Deviance =
Xnk (yjk ; yk)2
j =1
2028
Stop splitting if
# Use the tree function in Splus to
node deviance is less than some fraction
of root node deviance (say < 1%).
number of cases at the node is too
small (say < 10 cases).
# build a classification tree. This
# file is stored as
# Use the
bpdtree.ssc
bpdsp3.dat
data file which
# does not have missing values.
The tree
Use mindev and minsize in
tree.control ( )
# function in Splus cannot handle missing
Prune the tree:
Crossvalidation
Validation sample
# with the prune function in Splus
# values.
The APGAR score variable was
# not included because it contained too
# many missing values. The tree is pruned
# Enter the data as a matrix
treec.dat <- matrix(scan("bpdsp3.dat"),
ncol=17,byrow=T)
# Select the columns to be used in the
# analysis and define row and column labels
ID<-treec.dat[,1]
treec.dat <- treec.dat[, c(2:12,15,17)]
2030
2029
# Print a description of what happens at
dimnames(treec.dat)<-list(ID,c("Sex","Year",
# each node of the tree
"Gage","Bweight","AgeSYM","AgeVEN",
"Intub","Ventl","LowO2",
print(trees.out)
"MedO2","HiO2","RDS","BPD"))
# Display the tree.
treec.dat<-data.frame(treec.dat)
Unix users should
# first use the motif( ) function to open
# a graphics window.
# Compute the classification tree and put
plot(trees.out)
# it in the file
text(trees.out)
trees.out
ftree <- formula(BPD~Sex+Year+Gage+Bweight+
AgeSYM+AgeVEN+Intub+Ventl+
# Use crossvalidation provided by the
LowO2+MedO2+HiO2+RDS)
# cv.tree function to determine how to
# prune the tree
trees.out<-tree(ftree,treec.dat)
k <- seq(.05,2.0, length=20)
trees.cv <- cv.tree(trees.out, k=k)
# Print a summary description of the tree
summary(trees.out)
plot(trees.cv, type='b')
# The resulting plot suggests that the tree
# should be pruned at about 6 nodes using k=.5
# Display the pruned tree
Description of the tree created for
the BPD data.
Each line provides the
treep.out <- prune.tree(tree = trees.out, k=.5)
information for one node in the tree.
summary(treep.out)
The information is presented in the
plot(treep.out)
following order:
text(treep.out)
[1] node number
[2] decision rule for going left
at this node
[3] Number of sample cases that
reached this node
[4] value of the deviance
[5] average value for the response:
in this case it is the
proportion of babies with
BPD at this node
[6] An * denotes a terminal node
2031
node), split, n, deviance, yval
5) HiO2>159.5 23
* denotes terminal node
5.6520 0.56520
10) MedO2<71.5 17
4.1180 0.41180
20) Gage<342.5 10
1) root 242 52.5000 0.31820
40) Year<67.5 5
2) MedO2<183 182 22.2900 0.14290
41) Year>67.5 5
4) HiO2<159.5 159 11.9400 0.08176
8) LowO2<527.5 146
6.6640 0.04795
16) Ventl<146 106
17) Ventl>146 40
21) Gage>342.5 7
0.0000 0.00000 *
5.7750 0.17500
34) MedO2<158 35
3.5430 0.11430
68) Bweight<2239.5 28
136) Year<69.5 5
137) Year>69.5 23
0.8000 0.20000 *
0.0000 0.00000 *
69) Bweight>2239.5 7
35) MedO2>158 5
9) LowO2>527.5 13
0.9643 0.03571
1.7140 0.42860 *
1.2000 0.60000 *
3.2310 0.46150
18) Gage<305 6
0.0000 0.00000 *
19) Gage>305 7
0.8571 0.85710 *
2032
11) MedO2>71.5 6
3) MedO2>183 60
2.4000 0.60000
0.8000 0.80000 *
1.2000 0.40000 *
0.8571 0.14290 *
0.0000 1.00000 *
7.6500 0.85000
6) Ventl<220.5 13
3.2310 0.46150
12) HiO2<26.5 8
0.8750 0.12500 *
13) HiO2>26.5 5
0.0000 1.00000 *
7) Ventl>220.5 47
1.9150 0.95740
14) Intub<430.5 14
1.7140 0.85710
28) Gage<330 5
1.2000 0.60000 *
29) Gage>330 9
0.0000 1.00000 *
15) Intub>430.5 33
0.0000 1.00000 *
2033
0.05
0.15
0.26
0.36
0.46
0.67
0.87
1.30
Ventl<220.5
HiO2<26.5 Gage<330
Intub<430.5
0.60001.00001.0000
LowO2<527.5
MedO2<71.5
0.1250
1.0000
Gage<342.5
Ventl<146
Gage<305 Year<67.5
1.0000
MedO2<158
0.80000.40000.1429
0.0000 Bweight<2239.5
Year<69.5
0.60000.00000.8571
0.20000.00000.4286
32
28
HiO2<159.5
30
deviance
34
MedO2<183
|
5
10
15
size
2034
2035
Summary
Do not exceed 160 hours of exposure to the high oxygen levels
MedO2<183
|
HiO2<159.5
Ventl<220.5
HiO2<26.5
0.95740
LowO2<527.5
MedO2<71.5
0.125001.00000
Gage<342.5
Ventl<146
Gage<305
1.00000
MedO2<158
0.600000.14290
0.00000Bweight<2239.5
0.600000.000000.85710
0.035710.42860
Do not exceed 183 hours of exposure to the medium oxygen
levels
Do not exceed 528 hours of exposure to the lower oxygen levels
Do not ventilate for more than
146 hours
2036
2037
Download