Uploaded by nelsonwantsstudy

STPM Math T coursework

advertisement
INTRODUCTION
In statistics, sampling data plays a very important role in allowing one to make
inferences and deduce certain parameters from a sample collected. The sample data
drawn is random and unbiased, thus the properties of a sample will correspond very
closely to with those of the populations. As a result, there is no need to collect data
for the whole population which is time-consuming, cost ineffective and difficult to
manage. Hence, the sample data collected will be able to represent a parameter of
interest of a population by estimation. One way of estimating a parameter of a
population is by using confidence interval obtained from a sample distribution of
random data. However, there are factors that can affect the precision of confidence
interval, they are sample size, confidence interval and the underlying population
distribution. Therefore, in this paper, the effect of sample size and underlying
population distribution on the coverage probabilities of confidence interval for
population mean based on normal distribution and chi-squared distribution will be
investigated.
The width of confidence interval will be affected by sample size as when sample size
increases, with the condition of confidence level being constant, the width of
confidence interval decreases. This indicates that the precision of the confidence
interval increases. This can be simply understood as with larger sample size, one will
be able to observe the data collected more precisely so one will be more confident in
the data collected. As a result, the estimation made will be tighter in range.
Similarly, from the underlying population distribution, we can observed the
graphical illustration of the distribution and make certain degree of induction such as
the standard deviation of the data. The more spread out the population distribution,
the greater the standard deviation. Consequently, the confidence interval will be
wider as the there is less consistency in the data.
METHODOLOGY
In carrying out this activity, Microsoft Excel is used to generate 500 observations and
only one example is shown for parts with many repetitions in order to make the
paper more neat and easier to read. For population distribution which is normally
distributed, the 500 observations can be generated by using the function,
=NORMINV(RAND(), mean, standard deviation). After that, in order to generate a
graphical illustration for the population distribution in the form of line graph, all
observation is arranged in one column and labelled x-axis while another column is for
the probability of each of the data on the x-axis. The probability is calculated using
the function, =NORMDSIT(A, mean, variance, FALSE), where A is the cell reference of
a data and “FALSE” is used because the probability generated is not cumulative
probability. Then, insert tab is selected and go the XY Scatter, then Scatter, and
Scatter with smooth lines is selected to produce the line graph of the population
distribution. This is repeated for the two normal distribution of mean=10, variance=2
and another one with mean=10, variance=4.
On the other hand, for the Chi-squared distribution, the function used to generate
the observations is different, which is =CHIINV(RAND(),v), where v is the degree of
freedom. Similar steps are taken as in that of normal distribution where two columns,
x-axis column and y-axis column, are generated. In y-axis column, to find the
probability of each of the data, the formula used is =CHIDIST(A, v). Lastly, the line
graph of the Chi-squared distribution can be plotted in the same way as that in
normal distribution. This is used by substituting the degree of freedom of 1,10 and
30.
Next, from the 500 data generated, 10 random data is selected. This can be done
using the function, =INDEX(B3:B10, RANDBETWEEN(1, ROWS(B3:B10)), 1), where
B3:B10 is the list of cell references which contains the observations generated. After
10 random data,x is selected, they are used to calculate the sample mean using
=AVERAGE(B3:B12), where B3:B12 is the cell references of the sample data. Standard
deviation is also calculated using =STDEV.P(B3:B12). Then, the alpha value is
calculated, which is 0.05 and the sample size, n=10. So we can calculate the sampling
error using =CONFIDENCE.NORM(alpha value, standard deviation, sample size).
Lower boundary is the sample mean minus by the sampling error while the upper
boundary is the sample mean add with the sampling error. Hence, the confidence
interval at 95% confidence level is obtained. This is repeated for 200 times to
generate 200 of 95% confidence interval for sample size, n=10. So 200 confidence
interval is generated from 200 sets of sample with n=10 for each of the 5 distribution.
This procedure is repeated using n=30 and n=50.
Then, the proportion of confidence interval that contains the population mean is
determined by dividing the total number of confidence interval containing population
mean by the sample size, which is n=200 as there are 200 outcomes. The total
number of confidence interval containing the population mean is determined using
the function, =COUNTIFS(G5, “<=population mean”, H5, “>=population mean”),
where G5 and H5 are lower boundary and upper boundary respectively. If population
mean is found within the interval, an outcome of 1 is showed while outcome 0 means
population mean is not within the interval. Later, the coverage probability of
confidence interval for population mean is calculated by using the formula:
The coverage probability is calculated for each of the distributions with n=10, n=30
and n=50.
Lastly, the results of coverage probability of the confidence interval for each of the
population means with different sample size is observed and discussed.
RESULTS AND DISCUSSION
The coverage probability of confidence interval for population mean is shown in the
table below:
Population
Sample size,
distribution
n
Normal
10
distribution
30
with
50
mean=10,
variance=2
Normal
10
distribution
30
with
50
mean=10,
variance=4
Chi-squared
10
distribution
30
with 1
50
degree of
freedom
Chi-squared
10
distribution
30
with 10
50
degree of
freedom
Chi-squared
10
distribution
30
with 30
50
degree of
freedom
Formula for standard error:
Standard
error
0.0197
0.0192
0.0180
Sampling error
0.0386
0.0376
0.0353
Confidence
Interval
(0.8764,0.9537)
(0.8824,0.9576)
(0.8946,0.9654)
0.0180
0.0139
0.0202
0.0353
0.0272
0.0396
(0.8946,0.9654)
(0.9328,0.9872)
(0.8703,0.9497)
0.0226
0.0174
0.0256
0.0443
0.0341
0.0502
(0.8408,0.9292)
(0.9008,0.9692)
(0.7948,0.8952)
0.0249
0.0180
0.0192
0.0482
0.0353
0.0376
(0.8062,0.9038)
(0.8946,0.9654)
(0.8824,0.9576)
0.0234
0.0230
0.0226
0.0459
0.0451
0.0443
(0.8292,0.9208)
(0.8350,0.9250)
(0.8408,0.9292)
Formula for sampling error/margin of error:
From the table, it is shown that in general an increase in sample size causes the
confidence interval of coverage probability to be narrower and tighter where the
standard error and sampling error become smaller. Occasionally, when the sample
size increases, the confidence interval becomes wider. This might be due to the
higher variability of the population distribution, which means the population
distribution is larger. For example, for normal distribution with mean=10 and
variance=4, there is less consistency in the data generated. As a result, there are
more extreme values which can affect the width of the confidence interval, causing it
to be wider, even if the sample size increases. To put it simply, a less consistent data
makes it harder to make predictions so one will be less confident with the estimation
made. Similarly, for Chi-squared distribution with 1 degree of freedom, we can see
from Diagram 1 that the graph is positively skewed, the range of data is very big, from
very small value to very large value, with most of the data clustered at one extreme
while the other extreme has little data. This causes the difference between data to be
large due to the presence of extreme values. As a result, the confidence interval of
coverage probability is larger as the range of data is large.
Apart from that, based on the table, most of the confidence intervals of coverage
probability do contain the 95% confidence level used for estimation of population
mean. This is true as by using a confidence level of 95%, it means that from 100
samples, there are at least 95 samples which have a confidence interval containing
the population mean.
CONCLUSION
In conclusion, the increase in sample size generally causes the confidence interval of
coverage probability to be narrower due to smaller standard error and the
underlying distribution which is wider with higher variability will cause the
confidence interval of coverage probability to be wider as the standard error
increases. In addition, it is also shown that at large sample size, the coverage
probability of confidence interval for population mean is at least 0.845 as most of the
distribution will form confidence interval that contains the population mean.
Download