Keith Kohrs STA 705 Keith Kohrs STA 705 Project The standard

advertisement
Keith Kohrs
STA 705
Keith Kohrs
STA 705
Project
The standard Greenwood formula we have discussed in class is for survival data that may be right
censored. However, we do not have a formula for the variance of the MLE of the survival function for
data that may be left, right, or interval censored. In particular, interval censored data points could occur
frequently in a trial. The event, possibly death, could occur between scheduled visits to the physician.
This would give an interval of when the event occurred. We could bootstrap the survival function at a
point to get a variance estimate, since there is not a “Greenwood formula” in this case. As this is
something that could be used to handle any form of censored data, this is the aim of my project.
The EM algorithm is employed to get the MLE of the survival function. This is accomplished using the
Icens package in R. In particular, the EMICM function from this package is used. This function was used
to bootstrap because of its speed. The bootstrap is performed by re-sampling from the original n data
points (in this case intervals) n times. The program will take input in various forms and bootstrap the
MLE B times for a given vector of times t. The B values for the MLE at the points are returned as well as
the variance estimates for those points.
In testing the function, data was randomly created using a created function named getArray. Two
datasets of size 15 and one dataset of 10 observations were tested with the function. The default for
the program is to have B=500,000. The points chosen to bootstrap for variance estimates were the
quartiles of the MLE. The datasets are given below in interval format. These will be referred to as A, B,
and C, respectively.
Left
82
77
77
82
82
73
83
79
84
79
Right
82
Inf
79
82
82
73
83
79
Inf
79
Left
66
78
81
82
77
82
80
79
77
88
80
65
81
72
71
Right 66
Inf
81
87
77
82
82
Inf
77
88
80
65
91
75
87
Left
87
80
77
68
87
76
84
76
76
-Inf
81
87
75
89
69
Right Inf
Inf
81
Inf
87
76
84
76
Inf
84
81
Inf
83
89
Inf
Keith Kohrs
STA 705
For dataset A, we get the following for the MLE of the survival function:
> EMalg(input=obsv10)
[,1]
[,2]
[,3]
[,4] [,5]
[,6]
[,7]
[,8] [,9]
[1,] 0.9 0.7875 0.675 0.5625 0.45 0.3375 0.225 0.1125
0
[2,] 73.0 77.0000 79.000 79.0000 82.00 82.0000 82.000 83.0000
84
[3,] 73.0 79.0000 79.000 79.0000 82.00 82.0000 82.000 83.0000 Inf
So the function was bootstrapped at t = 79 and 82. The program was run with B = 10,000 twice, and B =
500,000 once. For t = 79, the variance estimate of the survival function is 0.02683396. For t = 82, the
variance estimate is 0.01937614.
For dataset B, we get the following for the MLE of the survival function:
> EMalg(input=data15)
[,1]
[,2]
[,3]
[,4]
[,5]
[,6]
[,7]
[,8]
[,9] [,10]
[1,] 0.9333333 0.8666667 0.793567 0.7204673 0.6473675 0.5057313 0.3640951 0.3640951 0.1091223
0
[2,] 65.0000000 66.0000000 72.000000 77.0000000 77.0000000 80.0000000 81.0000000 81.0000000 82.0000000
88
[3,] 65.0000000 66.0000000 75.000000 77.0000000 77.0000000 80.0000000 81.0000000 82.0000000 82.0000000
88
The function was bootstrapped at t = 76, 80, and 82. The program was run once with B = 500,000. The
variance estimates are, respectively, 0.01812016, 0.02913296, and 0.01987020.
For dataset C, we get the following for the MLE of the survival function:
> EMalg(input=obsv15)
[,1]
[,2]
[,3]
[,4]
[,5]
[,6] [,7]
[1,] 0.8908568 0.7817137 0.6302856 0.4788575 0.383086 0.2553907
0
[2,] 76.0000000 76.0000000 80.0000000 81.0000000 84.000000 87.0000000
89
[3,] 76.0000000 76.0000000 81.0000000 81.0000000 84.000000 87.0000000
89
The function was bootstrapped at t = 76, 80, and 82. The program was run eleven times, with B values
of 5000 (9 times), 10 000, and 500 000. The variance estimates are 0.02273754, 0.02273754, and
0.02130653, respectively.
The loops were all run within R. If there was more time for the project (or perhaps if I started after the
last class when we discussed the source code), then I would move all of the loops to C and would modify
the EMICM code as well, as only some of it is necessary for the bootstrap. While function calls to
mainProj() that used B = 1000 or 5000 would be completed in minutes, when B = 500,000 the function
would take in excess of twelve hours to complete. This is too long for practical use, and could certainly
be improved by moving the loop portion of the code to C or C++. Also, the output from EMICM could be
improved by having the jumps and the jump points, instead of the intervals and probability jumps.
Keith Kohrs
STA 705
Most of the difficulty in this project arose from dealing with the format of the EMICM output. Searching
through intervals was somewhat more difficult than just considering jump points. This also introduced
extra data that needed to be carried around. Changing the EMICM source code would be beneficial to
making this code practical to use.
Download