boots - WordPress.com

advertisement
ESTIMATION THROUGH BOOTSTRAPPING
T.A.S. Vijayaraghavan
Although confidence interval estimates have been widely used in making inferences
about population parameters, the estimating procedure used is based on assumptions
that do not always hold. If there is substantial non-normality in the population,
particularly if a small sample size n is used, the confidence interval estimate for the
mean may not be precise. We will consider an alternative estimation approach called
bootstrapping. Bradley Efron who is usually credited not only with the term
"bootstrapping" in this context but also with making the central assertion that the
relative frequency distribution of the repeated sample statistics is an estimate of the
sampling distribution published his insights in 1979 in an article in Annals of
Mathematical Statistics. The word "bootstrap" suggests pulling oneself up by one's
bootstraps, which is almost equivalent to achieving the impossible.
Much of the rationale behind Efron's path-breaking approach is based on the fact that
the distributional assumptions of multinormality required for parametric inferences are
not always met nor do we always know the sampling distribution of several other
statistics of interest (eg skewness, kurtosis, difference between medians, eigenvales).
Under such conditions parametric inferences are difficult to justify and it is sometimes
better to draw inferences about population parameters strictly from the sample at hand
rather than making unrealistic assumptions about the population of interest. The
bootstrap concept is simple. Statisticians would have come up with the idea for a long
time, were it not for the necessity of extensive computations. The best known technique
in this case is bootstrapping which involves resampling the sample data with
replacement many, many times in order to generate an empirical estimate of the entire
sampling distribution of a statistic. In other words, bootstrapping treats the sample as if
it were the population and applies Monte Carlo sampling to generate an empirical
estimate of the statistic's sampling distribution.
The typical requirements for bootstrapping is considerable computational power,
enough to take 500 - 1000 samples; appropriate software or computer program and a
sample size of at least 30 observations.
Still bootstrapping is not a panacea for all other worries. It is good for confidence
intervals and bias estimation but not for point estimation. As well, the original sample
must be a good representation of the population of interest (cover full range of
population values) and must have been drawn as simple random sample (although work
with more complex samples is underway). If these conditions are not met, uncertainty
will remain.
Briefly, bootstrapping estimation procedures involve an initial sample and then
repeated resampling from the initial sample. What makes bootstrapping estimation
useful is that the procedures are based on the initial sample and make no assumptions
regarding the shape of the underlying population. In addition, the procedures do not
require knowledge of any population parameters.
The steps in bootstrapping estimation of the mean are as follows:
1. Draw a random sample of size n with out replacement from a population of size N.
2. Resample the initial sample by selecting n observations with replacement from the n
observations in the initial sample.
3. Compute , the statistic of interest, from this resample.
4. Repeat steps 2 and 3 m different times ( where m is typically selected as a value
between 100 and 1000, depending on the speed of the computer being used).
5. Form the resampling distribution of the statistic of interest (i.e. the distribution of
the sample mean obtained from m samples) using a stem-and-leaf display or an
ordered array.
6. To form a 100(1-)% bootstrap confidence interval of the population mean , use
the stem-and-leaf display or ordered array for the resampling distribution and find
the value that cuts off the smallest (/2)100% and the value that cuts off the largest
(/2)100% of the statistic. These values provide the lower and upper limits for the
bootstrap confidence interval estimate of the unknown parameter.
Consider the following example of annual cooking oil consumption (in gallons) of single
family homes where the marketing manager is interested in estimating the average
annual consumption.
1150.25
872.37
1459.56
941.96
1013.27
1352.67
1126.57
1252.01
767.37
1402.59
983.45
1184.17
373.91
1598.57
1069.32
1365.11
1046.35
1047.40
1598.66
1108.94
942.71
1110.50
1064.46
1343.29
1326.19
1577.77
1050.86
1018.23
1617.73
1074.86
330.00
851.60
996.92
1300.76
975.86
For these data we may use a statistical software package to obtain the sample average
X=1,122.7 gallons and the sample standard deviation s =295.72 gallons.
If the manager would like to have 95% confidence that the interval obtained includes
the population average amount of oil consumed per year, using X= 1122.7, s=295.72,
n=35, and t34=2.0322( treating n=35 as small sample!),we have
X + tn-1 s/ n = 1122.7 + 2.0322(295.72/ 35 )
1021.12 µ 1224.28
(of course assuming normal population here)
We would conclude with 95% confidence that the average amount of cooking oil
consumed per year is between 1021.12 and 1224.28 gallons. The 95% confidence
interval states that we are 95% sure that the sample we have selected is one in which
the population mean µ is located within the interval. This 95% confidence actually
means that if all possible samples of size 35 were selected (some thing that would never
be done in practice), 95% of the intervals developed would include the true population
mean somewhere within the interval.
If one wants to treat the case as a sufficiently large sample without the normal
population assumption, the Central limit Theorem can be invoked and one would have
obtained the interval as
X + z /2 s/ n = 1122.7 + 1.96(295.72/ 35 )
=1122.7 + 97.97
1024.73 µ 1220.67 which is narrower than the ones obtained using tdistribution)
The estimate obtained was based on the assumption that the underlying population of
oil usage consumption was approximately normally distributed or invoking the central
limit theorem as if the sample size is sufficiently large etc. assumptions that are not
necessary for the bootstrap procedure. Let us now obtain the bootstrap estimate for
this example. Following the six-step procedure just described, the sample data of
Table.1 above was used to obtain a resample of 35 observations which was selected
with replacement. The first resample is listed in Table 2. The mean of this sample is
1,003.26 gallons.
1326.19
330.00
941.96
373.91
1110.50
1150.25
1013.27
767.37
1050.86
373.91
1184.17
996.92
1050.86
1074.86
1069.32
1013.27
1047.40
1459.56
1326.19
330.00
330.00
1018.23
975.86
1110.50
1365.11
1126.57
1064.46
373.91
1402.59
1110.50
1343.29
1598.66
1018.23
1343.29
941.96
Notice that in this first resample some values from the original sample such as 330.00
are repeated and others such as 1,352.67 do not occur. If this resampling process is
performed m=200 times, the resampling distribution containing 200 resampling means
can be developed. Table.3 presents the ordered array of the resampling distribution.
To form a 95% bootstrap confidence interval estimate of , the smallest 2.5%and the
largest 2.5% of the resample means need to be identified (step6).
988.50
1030.21
1042.23
1055.08
1061.79
1071.00
1079.46
1086.84
1095.35
1100.20
1106.84
1115.71
1119.98
1123.34
1127.15
1129.84
1136.67
1142.72
1145.72
1153.35
1159.55
1164.06
1171.21
1176.32
1183.99
1187.27
1191.28
1206.28
1230.11
994.49
1030.73
1046.16
1055.39
1062.46
1071.38
1081.45
1089.45
1095.40
1101.59
1111.14
1116.46
1120.31
1123.73
1127.30
1132.30
1136.90
1142.73
1146.21
1156.35
1161.01
1164.39
1172.03
1176.73
1184.27
1188.03
1191.86
1209.61
1233.43
1003.26
1032.04
1050.01
1055.76
1062.86
1074.99
1083.40
1089.95
1095.88
1103.41
1112.46
1116.51
1121.01
1124.38
1127.55
1132.78
1138.24
1143.03
1149.77
1156.43
1161.16
1165.45
1172.19
1178.39
1184.68
1188.25
1192.90
1216.85
1233.97
1004.25
1032.77
1050.97
1059.57
1063.31
1077.18
1084.25
1092.37
1096.26
1104.07
1112.47
1117.84
1121.46
1124.98
1127.86
1133.39
1138.52
1143.32
1150.87
1158.49
1161.50
1167.87
1172.90
1178.63
1185.80
1188.53
1193.21
1225.20
1251.88
1006.53
1034.17
1052.73
1059.88
1066.91
1077.63
1085.21
1092.53
1097.59
1104.82
1112.53
1118.67
1121.68
1125.63
1128.06
1134.03
1139.53
1143.35
1152.88
1158.62
1162.12
1168.50
1173.00
1182.16
1185.97
1189.39
1196.66
1226.46
1014.38
1035.21
1053.00
1060.51
1068.84
1078.62
1085.98
1094.48
1100.06
1105.19
1113.58
1119.17
1121.92
1126.01
1129.03
1134.43
1139.81
1143.70
1153.13
1159.14
1163.15
1170.06
1173.37
1182.37
1186.88
1190.20
1199.30
1226.67
1018.62
1041.45
1054.58
1061.70
1070.53
1079.41
1086.22
1095.25
1100.19
1106.67
1115.09
1119.71
1122.40
1127.00
1129.66
1135.07
1140.60
1144.37
1153.25
1159.45
1163.24
1171.11
1174.25
1183.12
1187.01
1191.10
1206.23
1227.02
When 200 sample means are obtained, the fifth ( i.e.200*.025) smallest value will cut
off the lowest 2.5%, while the fifth largest value will cut off the highest 2.5% . From
Table.3, we obtain the values of 1006.53 gallons as the fifth smallest and 1227.02 as the
fifth largest. Therefore, the 95% bootstrap confidence interval for the population
average amount of cooking oil consumed is 1006.53 to 1227.02 gallons. This estimate is
fairly closer to the traditional confidence interval estimate of 1021.12 to 1224.28
obtained earlier. However, the bootstrap estimate requires less stringent assumptions
than does the traditional confidence interval estimate.
Download