Keith Kohrs STA 705 Keith Kohrs STA 705 Project The standard Greenwood formula we have discussed in class is for survival data that may be right censored. However, we do not have a formula for the variance of the MLE of the survival function for data that may be left, right, or interval censored. In particular, interval censored data points could occur frequently in a trial. The event, possibly death, could occur between scheduled visits to the physician. This would give an interval of when the event occurred. We could bootstrap the survival function at a point to get a variance estimate, since there is not a “Greenwood formula” in this case. As this is something that could be used to handle any form of censored data, this is the aim of my project. The EM algorithm is employed to get the MLE of the survival function. This is accomplished using the Icens package in R. In particular, the EMICM function from this package is used. This function was used to bootstrap because of its speed. The bootstrap is performed by re-sampling from the original n data points (in this case intervals) n times. The program will take input in various forms and bootstrap the MLE B times for a given vector of times t. The B values for the MLE at the points are returned as well as the variance estimates for those points. In testing the function, data was randomly created using a created function named getArray. Two datasets of size 15 and one dataset of 10 observations were tested with the function. The default for the program is to have B=500,000. The points chosen to bootstrap for variance estimates were the quartiles of the MLE. The datasets are given below in interval format. These will be referred to as A, B, and C, respectively. Left 82 77 77 82 82 73 83 79 84 79 Right 82 Inf 79 82 82 73 83 79 Inf 79 Left 66 78 81 82 77 82 80 79 77 88 80 65 81 72 71 Right 66 Inf 81 87 77 82 82 Inf 77 88 80 65 91 75 87 Left 87 80 77 68 87 76 84 76 76 -Inf 81 87 75 89 69 Right Inf Inf 81 Inf 87 76 84 76 Inf 84 81 Inf 83 89 Inf Keith Kohrs STA 705 For dataset A, we get the following for the MLE of the survival function: > EMalg(input=obsv10) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [1,] 0.9 0.7875 0.675 0.5625 0.45 0.3375 0.225 0.1125 0 [2,] 73.0 77.0000 79.000 79.0000 82.00 82.0000 82.000 83.0000 84 [3,] 73.0 79.0000 79.000 79.0000 82.00 82.0000 82.000 83.0000 Inf So the function was bootstrapped at t = 79 and 82. The program was run with B = 10,000 twice, and B = 500,000 once. For t = 79, the variance estimate of the survival function is 0.02683396. For t = 82, the variance estimate is 0.01937614. For dataset B, we get the following for the MLE of the survival function: > EMalg(input=data15) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 0.9333333 0.8666667 0.793567 0.7204673 0.6473675 0.5057313 0.3640951 0.3640951 0.1091223 0 [2,] 65.0000000 66.0000000 72.000000 77.0000000 77.0000000 80.0000000 81.0000000 81.0000000 82.0000000 88 [3,] 65.0000000 66.0000000 75.000000 77.0000000 77.0000000 80.0000000 81.0000000 82.0000000 82.0000000 88 The function was bootstrapped at t = 76, 80, and 82. The program was run once with B = 500,000. The variance estimates are, respectively, 0.01812016, 0.02913296, and 0.01987020. For dataset C, we get the following for the MLE of the survival function: > EMalg(input=obsv15) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] 0.8908568 0.7817137 0.6302856 0.4788575 0.383086 0.2553907 0 [2,] 76.0000000 76.0000000 80.0000000 81.0000000 84.000000 87.0000000 89 [3,] 76.0000000 76.0000000 81.0000000 81.0000000 84.000000 87.0000000 89 The function was bootstrapped at t = 76, 80, and 82. The program was run eleven times, with B values of 5000 (9 times), 10 000, and 500 000. The variance estimates are 0.02273754, 0.02273754, and 0.02130653, respectively. The loops were all run within R. If there was more time for the project (or perhaps if I started after the last class when we discussed the source code), then I would move all of the loops to C and would modify the EMICM code as well, as only some of it is necessary for the bootstrap. While function calls to mainProj() that used B = 1000 or 5000 would be completed in minutes, when B = 500,000 the function would take in excess of twelve hours to complete. This is too long for practical use, and could certainly be improved by moving the loop portion of the code to C or C++. Also, the output from EMICM could be improved by having the jumps and the jump points, instead of the intervals and probability jumps. Keith Kohrs STA 705 Most of the difficulty in this project arose from dealing with the format of the EMICM output. Searching through intervals was somewhat more difficult than just considering jump points. This also introduced extra data that needed to be carried around. Changing the EMICM source code would be beneficial to making this code practical to use.