1/24/07 252onea (Open this document in 'Outline' view!) ECONOMICS 252 COURSE OUTLINE A. Parameter Estimation 1. Review of the Normal Distribution See 251greatD, 251distrex2, 251distrex3, 251distrex4 2. Point and Interval Estimation Point and Interval Estimation. Properties of Estimators. Let ˆ be an estimator for . i. Unbiassedness E ˆ . ii. Consistency (As sample size gets larger, estimate gets better.). iii. Efficiency ( ˆ has a small variance). Define BLUE. iv. Maximum Likelihood ( ˆ is the value of that is most likely to have produced the observed data) 3. A Confidence Interval for the Mean when the Population Variance is Known. a. A Two-Sided Confidence Interval An interval of this type is used in two situations: (i) where the population variance, 2 , is in fact, known and the sample size is relatively large; or (ii) where the variance is not known and the sample variance, s 2 , is used to replace 2 , but the degrees of freedom are so large that the appropriate value of t n 1 is not very different from z . The first of these situations is not very realistic, but serves as a good introduction to confidence intervals. The formula for this type of confidence interval for the mean is, x z x , where x . Note: If n .05 N , use x n ( n is sample size and N is population size) See 252oneaex1. 2 n N n N 1 Don’t use this method unless you know the population variance. b. A One-Sided Confidence Interval. There are two types of one-sided confidence interval for the mean. These are (i) An upper bound, and (ii) a lower bound, and have the form: x z x and x z x . An example is in 252oneaex1a. 4. A Confidence Interval for the Mean when the Population Variance is not Known. 1 "The variance is not known " implies no previous knowledge or assumption about the value of 2 . Knowing s 2 is having a guess as to what the variance is; it is not the same as knowing the variance. If the population distribution is normal or approximately normal, the formula for a two-sided confidence interval for the mean is x tn1s x , where s x s . Note: If n .05 N , use s x s N n N 1 n 2 n See 252oneaex2 and 252oneaex3. Note: this is the more common case – if you do not know the population variance and the sample size is not very large, using z instead of t is a very bad idea. 2 5. Deciding on Sample Size when working with a Mean The formula usually suggested is n z 2 2 e2 , where, if is not known, it can be approximated by x.001 x.999 . 6 6. A Confidence Interval for a Proportion. (a. Small Samples. Table 16 gives Confidence Intervals for proportions. These tables are of use when the conditions do not exist in which one can use the normal distribution. For example if n 10 and p .5 , and we wish to find a 95% confidence interval, we can look at the horizontal axis of the upper table. There we can find p .5 and look up to find the upper and lower curves for n 10 . Then vertical line at p .5 intersects these curves. The lower curve meets the vertical line at about p .175 . (Read up the vertical axis.). The upper curve meets the vertical line at about p .825 , so that our 95% confidence interval is about .175 p .825 .) b. Large Samples. More usually, using the normal approximation to the binomial distribution, and using p for the population probability of success and q for the population probability of failure, and letting the corresponding sample quantities, we can write p p z s p , where s p 2 p and q be pq and q 1 p . An n example is in 251 proport. c. Deciding on Sample Size. The usually suggested formula is n pqz 2 , but since p is usually unknown, a conservative choice is to e2 set p 0.5 . This is the formula everyone forgets that we covered. 7. A Confidence Interval for a Variance. This method is only appropriate when the population distribution is normal or approximately normal. For small samples chi-square table use. n 1s 2 22 s 2DF z 2 2DF 2 n 1s 2 12 2 , but if the degrees of freedom are too large for the s 2DF z 2 2DF . An example is in 252oneaex4. 3 (8. Appendix A Confidence Interval for a Median. In a situation where the population distribution is not normal, it is often more appropriate to find the median than the mean. The process of finding a confidence interval for a median is based on one simple fact: the probability that a single number picked at random from a population is above (or below) the median is 50%. Similarly, the probability that any two numbers picked at random from a population are both above (or both below) the median is 25%.. This comes from the multiplication rule: If A is the probability that the first number is above the median, and B is the probability that the second number is above the median, then P A B P A PB if A and B are independent events. If the probability of both numbers being above the median is 25%, and the probability of both numbers being below the median is 25%, then the probability that both numbers are on the same side of the median is 50%. This is due to the addition rule: Let event C be "both numbers are above the median," and event D be "both numbers are below the median." Then event C D is "both numbers are on the same side of the median." The addition rule says that if C and D are mutually exclusive, PC D P(C ) P( D) . Finally, if the probability that both numbers are on the same side of the median is 50%, then the probability that the two numbers are on opposite sides of the median is also 50%. This means that, since any two numbers picked from the sample have a 50% chance of bracketing the median, these two numbers constitute a 50% confidence interval. Note that, since p , the probability that any one number is above the median, is 0.5, and q , the probability that any one number is below the median, is also 0.5, we have a problem that resembles finding the distribution of the number of heads on two tosses of a fair coin. If we call a head a success, the distribution of heads on two tosses is described by the binomial distribution with n (the number of tries) set at 2, and p (the probability of success on one try) set at 0.5. For convenience, we will use q (the probability of failure on one try) for the probability that one number is below the median or of getting a tail on one toss of a fair coin. It is always true that q 1 p . The formula for the binomial distribution is Px Cxn p x q nx , where x is the number of successes. For the probability of two successes (heads) in 2 tries, we find that P2 C22 .52 .50 1.25 1 .25 . We find the probability of two heads or two tails in two tries by noting that the probability of two failures (tails) is P0 C02 .50 .52 .25 . Thus the probability of two heads or two tails is P2 P0 .25 .25 .50. This is the same as the probability of two randomly picked numbers both being on the same side of the mean. To take this a bit further, let us assume that we take a sample of n numbers from a population and then take two numbers at equal distances from the ends of the sample (for example, the fourth lowest and the fourth highest of a sample of 20 numbers). We will find that it is relatively easy to figure out the probability that these numbers bracket the median, and this will be our confidence level. This process requires some new thinking because: (i) we find our confidence interval without using a point estimate as we did in every previously studied method for constructing a confidence interval; and (ii) we find the interval first and then figure out its confidence level instead of starting with a confidence level and then figuring out the interval. This process serves as an introduction to the field of nonparametric statistics, which is largely made up of methods that do intervals and tests without assuming that the parent distribution (the distribution of the population from which the sample is drawn) is normal. In the case of finding a median, the process to be explained would be unnecessary if the parent population were normal, because in a normal population the mean and median are identical. Therefore, if the parent population is normal, we could use a method for finding a confidence interval for the mean in place of a method for finding a confidence interval for a median. 4 Assume that we pick a sample of four from a population, and that this sample, when put in ascending order, is 20,25,29,30 . If we use two numbers at equal distances from the ends as our confidence interval , we can use 20 30 or 25 29 ( is our symbol for a population median). The first of these intervals ( 20 30 ) is wrong only if all four numbers in the sample are below the median or all four numbers are above the median. The probability that all four are above the median is the same as the probability of four heads in four tosses, P4 C 44 .54 .50 .0625 . The probability that all four numbers are below the median is the same as the probability of four tails on four tosses P0 C 04 .50 .54 .0625 . We can find the probability of all four being above the median from a cumulative binomial table by noting that, for n 4,Px 4 Px 4 1 Px 3 . The binomial table will tell us that, for p .5 , Px 4 1 , and Px 3 .9375 , so Px 4 1 .9375 .0625 . Since the probability that all four numbers are below the median, Px 0 , is the same as the probability that all four numbers are above the median, the probability that the two numbers do not bracket the median (the probability that we are wrong or the significance level) is 2Px 0 2.0625 .1250 . The confidence level is thus 1 1 2Px 0 1 2.0625 .8750 . Now try picking the confidence interval 25 29 , by choosing the numbers x2 and x3 , that is the second from the top and the second from the bottom in the ordered sample, 20,25,29,30 . This interval is invalid if (i) the lowest three or more numbers in the sample are below the median (equivalent to three or more tails when a coin is tossed four times), or (ii) the highest three or more numbers in the sample are above the median (equivalent to three or more heads). The probability of the first of these events is (for n 4 and p .5) Px 1 , and the probability of the second event is Px 3 . But, using the binomial table we find that Px 3 1 Px 2 Px 1 .3125 . So the probability that the interval does not bracket the median is 2Px 1 2.3125 .6250 , and the confidence level is 1 1 2Px 1 1 2.3125 .3750 . 5 Generalize this to a situation where we take a random sample of n items from a population and put the numbers in ascending order so that x1 x2 x3 xn1 xn . Now pick x k and x n -k +1 , the numbers that are the k th from the bottom and the k th from the top, respectively. This interval is invalid if (i) all the numbers included in the interval and all the numbers below the interval are below the median or (ii) all the numbers on the interval and all the numbers above the interval are above the median. The probability of the first event is Px k 1 and the probability of the second event is Px n k Px k 1 the equality is due to the symmetry of the binomial distribution for p .5 . So 2Px k 1 , and the confidence level is 1 1 2Px k 1 . For example, if we take a sample of 100 items and put them in order and then use the interval x38 x 63 , that is, the 38th number from the bottom and the 38th number from the top, the confidence level (from the binomial table for n 100 and p .5 ) is 1 1 2Px 37 1 2.0060 .9880 . There will be some situations in which we cannot find Px k 1 on the cumulative binomial table. Then we must use a normal approximation to the binomial distribution, that is (using a continuity correction), find the normal probability, k 1 1 2 np k .5 .5n . (In the last part of this equality, .5 was P x k 1 1 2 P z P z 2 npq . 5 n substituted for both p and q .) This takes us back to a more conventional formulation for the confidence interval because we can choose k so that z 2 that k n 1 z .2 n 2 k .5 .5n . If we solve this equation for k , we find .5 n . Thus if we want a 95% confidence interval for the median, and we take a sample of n 150 , and pick k 150 1 1.96 150 63 .4975 . Our interval will then be x63 x88 .) 2 © 2002 R. E. Bove 6