ESTIMATION THROUGH BOOTSTRAPPING T.A.S. Vijayaraghavan Although confidence interval estimates have been widely used in making inferences about population parameters, the estimating procedure used is based on assumptions that do not always hold. If there is substantial non-normality in the population, particularly if a small sample size n is used, the confidence interval estimate for the mean may not be precise. We will consider an alternative estimation approach called bootstrapping. Bradley Efron who is usually credited not only with the term "bootstrapping" in this context but also with making the central assertion that the relative frequency distribution of the repeated sample statistics is an estimate of the sampling distribution published his insights in 1979 in an article in Annals of Mathematical Statistics. The word "bootstrap" suggests pulling oneself up by one's bootstraps, which is almost equivalent to achieving the impossible. Much of the rationale behind Efron's path-breaking approach is based on the fact that the distributional assumptions of multinormality required for parametric inferences are not always met nor do we always know the sampling distribution of several other statistics of interest (eg skewness, kurtosis, difference between medians, eigenvales). Under such conditions parametric inferences are difficult to justify and it is sometimes better to draw inferences about population parameters strictly from the sample at hand rather than making unrealistic assumptions about the population of interest. The bootstrap concept is simple. Statisticians would have come up with the idea for a long time, were it not for the necessity of extensive computations. The best known technique in this case is bootstrapping which involves resampling the sample data with replacement many, many times in order to generate an empirical estimate of the entire sampling distribution of a statistic. In other words, bootstrapping treats the sample as if it were the population and applies Monte Carlo sampling to generate an empirical estimate of the statistic's sampling distribution. The typical requirements for bootstrapping is considerable computational power, enough to take 500 - 1000 samples; appropriate software or computer program and a sample size of at least 30 observations. Still bootstrapping is not a panacea for all other worries. It is good for confidence intervals and bias estimation but not for point estimation. As well, the original sample must be a good representation of the population of interest (cover full range of population values) and must have been drawn as simple random sample (although work with more complex samples is underway). If these conditions are not met, uncertainty will remain. Briefly, bootstrapping estimation procedures involve an initial sample and then repeated resampling from the initial sample. What makes bootstrapping estimation useful is that the procedures are based on the initial sample and make no assumptions regarding the shape of the underlying population. In addition, the procedures do not require knowledge of any population parameters. The steps in bootstrapping estimation of the mean are as follows: 1. Draw a random sample of size n with out replacement from a population of size N. 2. Resample the initial sample by selecting n observations with replacement from the n observations in the initial sample. 3. Compute , the statistic of interest, from this resample. 4. Repeat steps 2 and 3 m different times ( where m is typically selected as a value between 100 and 1000, depending on the speed of the computer being used). 5. Form the resampling distribution of the statistic of interest (i.e. the distribution of the sample mean obtained from m samples) using a stem-and-leaf display or an ordered array. 6. To form a 100(1-)% bootstrap confidence interval of the population mean , use the stem-and-leaf display or ordered array for the resampling distribution and find the value that cuts off the smallest (/2)100% and the value that cuts off the largest (/2)100% of the statistic. These values provide the lower and upper limits for the bootstrap confidence interval estimate of the unknown parameter. Consider the following example of annual cooking oil consumption (in gallons) of single family homes where the marketing manager is interested in estimating the average annual consumption. 1150.25 872.37 1459.56 941.96 1013.27 1352.67 1126.57 1252.01 767.37 1402.59 983.45 1184.17 373.91 1598.57 1069.32 1365.11 1046.35 1047.40 1598.66 1108.94 942.71 1110.50 1064.46 1343.29 1326.19 1577.77 1050.86 1018.23 1617.73 1074.86 330.00 851.60 996.92 1300.76 975.86 For these data we may use a statistical software package to obtain the sample average X=1,122.7 gallons and the sample standard deviation s =295.72 gallons. If the manager would like to have 95% confidence that the interval obtained includes the population average amount of oil consumed per year, using X= 1122.7, s=295.72, n=35, and t34=2.0322( treating n=35 as small sample!),we have X + tn-1 s/ n = 1122.7 + 2.0322(295.72/ 35 ) 1021.12 µ 1224.28 (of course assuming normal population here) We would conclude with 95% confidence that the average amount of cooking oil consumed per year is between 1021.12 and 1224.28 gallons. The 95% confidence interval states that we are 95% sure that the sample we have selected is one in which the population mean µ is located within the interval. This 95% confidence actually means that if all possible samples of size 35 were selected (some thing that would never be done in practice), 95% of the intervals developed would include the true population mean somewhere within the interval. If one wants to treat the case as a sufficiently large sample without the normal population assumption, the Central limit Theorem can be invoked and one would have obtained the interval as X + z /2 s/ n = 1122.7 + 1.96(295.72/ 35 ) =1122.7 + 97.97 1024.73 µ 1220.67 which is narrower than the ones obtained using tdistribution) The estimate obtained was based on the assumption that the underlying population of oil usage consumption was approximately normally distributed or invoking the central limit theorem as if the sample size is sufficiently large etc. assumptions that are not necessary for the bootstrap procedure. Let us now obtain the bootstrap estimate for this example. Following the six-step procedure just described, the sample data of Table.1 above was used to obtain a resample of 35 observations which was selected with replacement. The first resample is listed in Table 2. The mean of this sample is 1,003.26 gallons. 1326.19 330.00 941.96 373.91 1110.50 1150.25 1013.27 767.37 1050.86 373.91 1184.17 996.92 1050.86 1074.86 1069.32 1013.27 1047.40 1459.56 1326.19 330.00 330.00 1018.23 975.86 1110.50 1365.11 1126.57 1064.46 373.91 1402.59 1110.50 1343.29 1598.66 1018.23 1343.29 941.96 Notice that in this first resample some values from the original sample such as 330.00 are repeated and others such as 1,352.67 do not occur. If this resampling process is performed m=200 times, the resampling distribution containing 200 resampling means can be developed. Table.3 presents the ordered array of the resampling distribution. To form a 95% bootstrap confidence interval estimate of , the smallest 2.5%and the largest 2.5% of the resample means need to be identified (step6). 988.50 1030.21 1042.23 1055.08 1061.79 1071.00 1079.46 1086.84 1095.35 1100.20 1106.84 1115.71 1119.98 1123.34 1127.15 1129.84 1136.67 1142.72 1145.72 1153.35 1159.55 1164.06 1171.21 1176.32 1183.99 1187.27 1191.28 1206.28 1230.11 994.49 1030.73 1046.16 1055.39 1062.46 1071.38 1081.45 1089.45 1095.40 1101.59 1111.14 1116.46 1120.31 1123.73 1127.30 1132.30 1136.90 1142.73 1146.21 1156.35 1161.01 1164.39 1172.03 1176.73 1184.27 1188.03 1191.86 1209.61 1233.43 1003.26 1032.04 1050.01 1055.76 1062.86 1074.99 1083.40 1089.95 1095.88 1103.41 1112.46 1116.51 1121.01 1124.38 1127.55 1132.78 1138.24 1143.03 1149.77 1156.43 1161.16 1165.45 1172.19 1178.39 1184.68 1188.25 1192.90 1216.85 1233.97 1004.25 1032.77 1050.97 1059.57 1063.31 1077.18 1084.25 1092.37 1096.26 1104.07 1112.47 1117.84 1121.46 1124.98 1127.86 1133.39 1138.52 1143.32 1150.87 1158.49 1161.50 1167.87 1172.90 1178.63 1185.80 1188.53 1193.21 1225.20 1251.88 1006.53 1034.17 1052.73 1059.88 1066.91 1077.63 1085.21 1092.53 1097.59 1104.82 1112.53 1118.67 1121.68 1125.63 1128.06 1134.03 1139.53 1143.35 1152.88 1158.62 1162.12 1168.50 1173.00 1182.16 1185.97 1189.39 1196.66 1226.46 1014.38 1035.21 1053.00 1060.51 1068.84 1078.62 1085.98 1094.48 1100.06 1105.19 1113.58 1119.17 1121.92 1126.01 1129.03 1134.43 1139.81 1143.70 1153.13 1159.14 1163.15 1170.06 1173.37 1182.37 1186.88 1190.20 1199.30 1226.67 1018.62 1041.45 1054.58 1061.70 1070.53 1079.41 1086.22 1095.25 1100.19 1106.67 1115.09 1119.71 1122.40 1127.00 1129.66 1135.07 1140.60 1144.37 1153.25 1159.45 1163.24 1171.11 1174.25 1183.12 1187.01 1191.10 1206.23 1227.02 When 200 sample means are obtained, the fifth ( i.e.200*.025) smallest value will cut off the lowest 2.5%, while the fifth largest value will cut off the highest 2.5% . From Table.3, we obtain the values of 1006.53 gallons as the fifth smallest and 1227.02 as the fifth largest. Therefore, the 95% bootstrap confidence interval for the population average amount of cooking oil consumed is 1006.53 to 1227.02 gallons. This estimate is fairly closer to the traditional confidence interval estimate of 1021.12 to 1224.28 obtained earlier. However, the bootstrap estimate requires less stringent assumptions than does the traditional confidence interval estimate.