Chapter 5: Common Distributions In this chapter we examine four of the distributions that will be frequently encountered later in the course. 5.1 The Normal Distribution The normal distribution is the most widely used distribution in statistics. Continuous data such as mass, length, etc, can often be modelled using a normal distribution. The normal distribution has two parameters- the mean ( ) and variance ( 2 ). If a random variable X has a normal distribution then we can write this as: X ~ N[ , 2 ]. A normal distribution with = 0 and = 1 is referred to as a standard normal distribution (and a random variable with this distribution is usually denoted Z). Important result: If X is a random variable distributed as N[ , 2 ] , then X ~ N[0,1]. The process of subtracting the mean and dividing by the standard deviation is referred to as standardisation: General Normal X ~ N[ , 2 ] Standard Normal Z ~ N[0, 1] z x 0.04 0.06 dnorm(x, sd = 3) 0.08 0.10 0.12 0.4 0.3 0.2 0.00 0.0 0.02 0.1 dnorm(x) -4 -2 0 2 4 -5 pdf of N[0,1] 0 pdf of N[0,9] Example: The fully grown lengths (in mm) of a certain insect can be regarded as having the following normal distribution: X ~ N[64, 16]. What is the probability that an insect has length less than 59 mm? Applying the standardisation formula, x 59 64 z 1.25. 4 Thus, P( X 59) P(Z 1.25) P(Z 1.25) 1 (1.25) 1 0.8944 0.1056 . 3.1.1 Percentage points Definition: Consider a random variable X with some distribution. The (upper) 100 % point is the value of x such that: P(X > x) = . For the standard normal distribution, we will denote the (upper) 100% point by z , i.e.: P(Z > z ) = . X ~ N[ , 2 ] Z ~ N[0, 1] z x 5 In statistical tables (e.g. Lindley and Scott), there is a separate percentage point table covering the most used values of . In Lindley and Scott, P represents 100 , x(P) represents the value of z . Extract: P = 100 10% 5% 2% 1% 0.1% 0.01 0.05 0.02 0.01 0.001 x(P) = z 1.2816 1.6449 2.0537 2.3263 3.0902 The 10% point for the standard normal is z 0.1 1.2816 . Example 1: Let X ~ N[50, 16]. Find the value of x such that P(X > x) = 0.05, i.e. find the (upper) 5% point. X 50 ~ N[0,1]. 4 The 5% point for the standard normal is z 0.05 1.6449 . If X ~ N[50, 16], then Thus, the 5% point for a N[50, 16] distribution can be obtained by solving So, the 5% point is x 50 1.6449 4 56.5796 . x 50 1.6449 . 4 Example 2: Let Z ~ N[0, 1]. Find the value of z such that P(Z < z) = 0.01 (i.e. find the lower 1% point). The upper 1% point for a standard normal is z 0.01 2.3263 . Therefore, P(Z > 2.3263) = 0.01. By symmetry, we must also have P(Z < -2.3263) = 0.01. So, the lower 1% point is –2.3263. 5.2 The chi-squared distribution 3.2.1 Introduction The chi-squared ( 2 ) distribution has a single parameter called the degrees of freedom- this can be any positive integer. The 2 distribution with n degrees of freedom is denoted n2 . Probability density function: If X ~ n2 , then the p.d.f. of X (for x > 0) is given by: 1 f ( x) n / 2 x n / 21e x / 2 . n 2 2 For x 0, f ( x) 0. This density is written in terms of the gamma function. Some of the key properties of this function are: ( x) ( x 1)( x 1); 12 ; ( x) ( x 1)! if x is a natural number. The degrees of freedom, n, define the shape of the 2 density. For n < 3, the density has a mode at zero. For n 3, the mode moves further away from zero as n increases. The shapes of some specific densities are given below. Graph of several chi-squared densities 0.6 n= n= n= n= 0.5 1 2 4 8 0.4 0.3 0.2 0.1 0 5.2.2 0 2 4 6 8 10 12 Finding probabilities Probabilities associated with the 2 distribution can be looked up in probability tables. Lindley and Scott list the d.o.f. (which they denote ) along the top of each column. Then for each value x listed, the values in the table are the probability that X < x. Extracts: = 3.0 x 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 etc = 7.0 P(X < x) 0.0000 0.0811 0.1987 0.3177 0.4276 0.5247 0.6084 0.6792 0.7385 Example 1: If X ~ 32 , then P(X < 2.5) = 0.5247. x 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 P(X < x) 0.0052 0.0402 0.1150 0.2202 0.3400 0.4603 0.5711 0.6674 0.7473 0.8114 Example 2: Suppose X ~ 72 . Find P(X > 10). Now, from tables we can find, P(X < 10) = 0.8114 P(X > 10) = 1 – 0.8114 = 0.1886. 3.2.2 Percentage points The 100 % point for the n2 distribution is denoted n2, . Therefore, if X ~ n2 , then P(X > n2, ) = . The percentage points of the 2 distribution are in a separate table in Lindley and Scott. Extract: P 99 95 10 5 1 = 1.0 = 2.0 = 3.0 = 4.0 = 5.0 = 6.0 = 7.0 = 8.0 0.000 0.020 0.115 0.297 0.554 0.872 1.239 1.646 0.004 0.103 0.352 0.711 1.145 1.635 2.167 2.733 2.706 4.606 6.251 7.779 9.236 10.64 12.02 13.36 3.841 5.991 7.815 9.488 11.07 12.59 14.07 15.51 6.635 9.210 11.34 13.28 15.09 16.81 18.48 20.09 52, 0.1 9.236. So P(X > 9.236) = 0.1 In this table, the degrees of freedom for the distribution are listed going down the rows and P is 100. The chi-squared distribution is not symmetric (unlike the normal distribution). So if we want a lower percentage point (i.e. a value of x such that P(X < x) = ) , then we can't simply negate the corresponding upper percentage point. Instead we need to find n2,1 . Example 1: Let X ~ 82 . Find the lower 1% point (i.e. the value of x such that P(X < x) = 0.01). The lower 1% point is denoted 82, 0.99 , the value for which is 1.646. Example 2: 2 Suppose X ~ 10 . Find the value of t for which P(X > t) = 0.1321. Here, t would be the 13.21% point for the distribution. But, 0.1321 is a non-standard value of . So we need to use the distribution function table to find t. P(X > t) = 0.1321 P(X < t) = 1 – 0.1321 = 0.8679. Going through the distribution table we find that t = 15. 5.3 The Student t-distribution 5.3.1 Introduction Definition: Suppose that we have two independent random variables Y and Z, such that: Y ~ N[0, 1] and Z ~ n2 . Then the random variable X defined by Y X Z n has a t-distribution with n degrees of freedom- denoted t n . The t-distribution is symmetric about zero and its general shape is like the bell-shape of a normal distribution. However, the tails of the t-distribution can approach zero much more slowly than those of the normal distribution- i.e. the t-distribution is more heavy tailed than the normal. The degrees of freedom define how heavy-tailed the t-distribution is. Note: The t-distribution with n = 1 is sometimes referred to as the Cauchy distribution. This is so heavy tailed that its mean and variance do not exist! (This is because the integrals specifying the mean and variance are not absolutely convergent.) Important note: The density of a t-distribution converges to that of the standard normal as n . The diagram below shows how the t-distribution varies for different degrees of freedom. Comparing several t distributions with the standard normal 0.4 normal t2 t5 t 20 0.35 0.3 Density 0.25 0.2 0.15 0.1 0.05 0 -3 -2 -1 0 x 1 2 3 5.3.2 Probabilities Probabilities associated with the t-distribution can be looked up in tables. In Lindley and Scott, the degrees of freedom are again denoted by and are listed along the top of the columns. Then for each value t listed, the values in the table are the probability that X < t. Example 1: Let X ~ t 3 . Then P(X < 2.5) = 0.9561. Example 2: Let X ~ t12 . Find P(X > 2.5). P(X > 2.5) = 1 - P(X < 2.5) = 1 - 0.986 = 0.014. Percentage points The 100 % point for the t n distribution is denoted by t n , . If X ~ t n , then: P(X > t n , ) = . Percentage points for the t-distribution are tabulated separately. The degrees of freedom for the distribution are listed down the rows and P = 100. Example 1: Find the 5% point for t 6 . Directly from tables, this is seen to be t 6, 0.05 1.943 . (Thus P(X > 1.943) = 0.05.) As the t-distribution is symmetric, finding lower percentage points is simple. Example 2: Let X ~ t10 . Find the value of t such that P(X < t) = 0.01 (i.e. find the lower 1% point). The upper 1% point is t10,0.01 2.764. But P(X > 2.764) = 0.01 P(X < -2.764) = 0.01. So, the lower 1% point, t, is -2.764. Note: To find non-standard percentage points (such as the 12.5% point, for example), we need to use the t-distribution function table. 5.3The (Fisher’s) F-distribution 5.4.1 Introduction Definition: Consider two independent random variables Y and Z such that nY ~ n2 and mZ ~ m2 . The random variable X defined by Y X Z is then said to have an F-distribution with n and m degrees of freedom- denoted Fn, m . The F-distribution therefore has two parameters, both of which are degrees of freedom. The order of the degrees of freedom is important! The Fn ,m distribution is not the same as the Fm, n distribution. Note: The density for the F-distribution is only defined for positive values of x. The values of the two degrees of freedom define the shape of the distribution. Plots of the F-distribution for various values of n and m are shown below. Graphs of several F distributions 1 n=2, m=2 n=4, m=4 n=8, m=8 n=20, m=20 0.9 0.8 0.7 Density 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 x 4 5 6 Graphs of several more F distributions 1 n= n= n= n= 0.9 0.8 2, m = 4 4, m = 2 5, m = 10 10, m = 20 0.7 Density 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 2.5 x 3 3.5 4 4.5 5 Lindley and Scott do not have tables for looking up probabilities associated with the Fdistribution. 3.3.2 Percentage points Separate tables giving 10, 5, 2.5, 1, 0.5 and 0.1 percentage points for F-distributions with different combinations of degrees of freedom can be found in Lindley and Scott. We will denote the (upper) 100 % point for the Fn, m distribution by Fn, m, . If X ~ Fn, m , then: P(X > Fn, m, ) = . In the table of the 100 percentage points for the F-distribution, the first degrees of freedom is denoted 1 and listed along the columns. The second degrees of freedom is denoted by 2 and listed down the rows. Extract: 1% points of the F-distribution 1 2 1 2 3 4 5 1 4052 98.50 34.12 21.20 16.26 2 4999 99.00 30.82 18.00 13.27 3 5403 99.17 29.46 16.69 12.06 4 5625 99.25 28.71 15.98 11.39 5 5764 99.30 28.24 15.52 10.97 The (upper) 1% point for an F5, 3 distribution is 28.24. We write F5, 3, 0.01 28.24. Example: Find the 5% point for both the F5,10 and the F10, 5 distributions. From the 5% points table: F5,10, 0.05 3.326 F10, 5, 0.05 4.735 Notice that these are not the same. The tables in Lindley and Scott give the upper percentage points only- i.e. they give the values of x such that P(X > x) = , for small values of . Since the F-distribution is not symmetric, to find lower upper percentage points we cannot simply use the negative of the corresponding upper percentage point: P( X x) P( X x). The density is in fact not even defined for x < 0. 5.3.3 Finding lower percentage points Result: Suppose that X Y ~ Fn, m . Then Z X 1 Z ~ Fm, n . Y Proof: Y ~ Fn, m if nY ~ n2 and mZ ~ m2 . Z But by definition of the F-distribution, this means that Z ~ Fm, n Y as required. X We can use this result to find lower percentage points for F-distributions: Important result: The lower 100 percentage point for the Fn, m distribution is the reciprocal of the upper 100 percentage point of the Fm, n distribution. Proof: If X ~ Fn, m and x represents the lower 100 percentage point for this distribution, then P(X < x) = . But P( X x ) 1 1 P . X x 1 1 ~ Fm, n then is (by definition) the upper 100 percentage point of the Fm, n x X distribution. 1 So, x . Fm, n, As Example 1: Let X ~ F5,10 . Suppose we wish to find x such that P(X < x) = 0.05- i.e. we want to find the lower 5% point of the F5,10 distribution. The lower 5% point of the F5,10 distribution is the reciprocal of the upper 5% point of F10, 5 distribution. So, x 1 F10, 5, 0.05 1 0.2112 . 4.735 Example 2: Suppose X ~ F4,7 . Find the upper and lower 10% points. The upper 10% point can be found directly from tables: F4, 7, 0.1 2.961 . The lower 10% point is the reciprocal of the upper 10% point of the F7, 4 distribution: Lower 10% point = F4, 7, 0.9 1 1 0.2513 . F7, 4, 0.1 3.979 Exercise: Suppose X ~ F2, 4 . Find the upper and lower 1% points. 5.5 Some additional facts about distributions 1) If X 1 ,..., X n are independent with X i ~ N[ i , i2 ] , i = 1, …, n, then n n a 0 ai X i ~ N a 0 ai i , i 1 i 1 n i 1 ai2 i2 ; 2) If X 1 ,..., X n are i.i.d. as N[0, 1], then (a) X i2 ~ 12 , for i = 1, 2, …, n; n (b) X i2 ~ n2 ; i 1 3) If X 1 ,..., X n are independent with X i ~ k2i , i = 1, …, n, then n X i ~ k2 , i 1 where k k1 ... k n . 4) If X ~ t n , then X 2 ~ F1, n . These results are not proved in this course. Chapter 6: Sampling Distributions 6.1 Parameters The purpose of many statistical investigations is to learn about the distribution of some random variable X. Many aspects about X's distribution may be of interest, but attention often focuses on one or two particular population characteristics. Example 1: A bakery needs to decide how many loaves of fresh bread it should put out on its shelves each day. If they put out too many, then they will lose money as stale bread will not sell, and if they put out too few, then they will lose potential sales. Therefore, to help the bakery make its order, interest might focus on the mean number of loaves, , usually sold on a particular day. Example 2: Suppose that a company has the job of packing a certain breakfast cereal into boxes, so that each box approximately contains 500g of cereal. The weight of cereal in each box varies around 500g due to the variability of the cereal product. The company wants to check that the amount going into each box doesn't vary too much about 500g- weights greater than 500g will lose the company money and weights less than 500g could lead to customer dissatisfaction. In this case, attention may focus on the variability of weights in the boxes as described by , the standard deviation of weights. Example 3: When testing a new drug, a doctor might not be interested so much in the number of people cured by the drug, but rather the proportion, , of people who are cured by the drug. We call , , or population parameters. To learn about such parameters, we can observe a random sample of n observations, x1 ,..., x n , and then use these data to calculate estimates for the parameter(s) of interest. For example, a sample mean could be used to estimate . Definition: Any quantity computed from values in a sample is called a (sample) statistic. Example: All the numerical summaries introduced in Chapter 2 are statistics as they are all calculated from values in the random sample. This includes statistics such as the sample mean (which utilises all the observations in its calculation) and the sample median (which only takes account of the middle observations). It is important to realise that there is a difference between population parameters and sample statistics. The population parameter is a characteristic of the distribution of the random variable, is typically unknown and cannot be observed. By contrast, a statistic is a characteristic of the sample and can be observed. For example, the population mean has some fixed (but unknown) value. On the other hand, the sample mean, X , can be observed and therefore can be known for a particular sample. The observed value of X , however, can vary from sample to sample (as different samples will give different values of x1 ,..., x n ). The value of a statistic, therefore, is subject to sampling variability. Definition: As a statistic is a function of the random variables X 1 ,..., X n , it is itself a random variable. The distribution of a statistic is called its sampling distribution. The sampling distribution of a statistic describes the long-run behaviour of the statistic's values when many different samples, each of size n, are obtained and the value of the statistic is computed for each sample. 6.2 The sampling distribution of the sample mean To investigate the sampling distribution for X , we will consider several experiments. Experiment 1: We generate 500 random samples (each of size n) from N[100, 400]. For each of these 500 samples we calculate x , so we have a random sample of 500 observations from the sampling distribution of X . This was repeated for n = 5, 20, 50. Sampling distribution for the sample mean (n = 20) 60 70 50 60 40 Frequency Frequency Sampling distribution for the sample mean (n = 5) 30 20 50 40 30 20 10 10 0 0 80 90 100 110 85 120 95 105 115 Sample mean Sample mean Sampling distribution for the sample mean (n = 50) 90 80 Frequency 70 60 50 40 30 20 10 0 90 100 110 Sample mean Observations: In each case the distribution seems roughly normal and it is clear that each of these histograms is centred roughly at 100 (the mean of the normal distribution from which the samples were generated). We can also see that as the sample size n increases, the variability in the sampling distributions decreases (look carefully at the scales on the horizontal axes). These points can also be seen if we look at some statistics relating to each histogram above: Sample size n=5 n = 20 n = 50 Mean 100.07 99.83 100.05 Standard deviation 8.17 4.40 2.81 We will do a similar set of experiments to see what the sampling distribution for X is like when we are not sampling from the normal distribution. Experiment 2: We generate 500 random samples (each of size n) from a uniform U[0,1] distribution. Again, for each of these 500 samples we calculate x , so we have a random sample of 500 observations from the sampling distribution of X . This was repeated for n = 5, 10, 20, 50. Note: If X ~ U[0, 1], then E[X] = 0.5 and Var[X] = 1/12 (so s.d. = 0.289). Sampling distribution for the sample mean (n = 5) Sampling distribution for the sample mean (n = 10) 80 60 70 60 Frequency Frequency 50 40 30 20 50 40 30 20 10 10 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.2 0.9 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Sample mean Sample mean Sampling distribution for the sample mean (n = 20) Sampling distribution for the sample mean (n = 50) 90 60 80 50 60 Frequency Frequency 70 50 40 30 40 30 20 20 10 10 0 0 0.25 0.35 0.45 0.55 0.65 0.75 0.35 Sample mean 0.45 0.55 0.65 Sample mean Observations: The shapes of the histograms relating to the sample means look increasingly more like normal distributions as n increases- this is despite the data being sampled from a uniform distribution. The histograms in each case seem to centre on 0.5 (the mean of the U[0, 1] distribution). Also, the variability of the sampling distributions is decreasing as the sample size becomes larger. The mean and standard deviation for the data in the four situations above are given below: Sample size n=5 n = 10 n = 20 n = 50 Mean 0.491 0.504 0.502 0.499 Standard deviation 0.133 0.095 0.068 0.042 Important Result: For an independent random X 1 ,..., X n from a distribution with mean and variance 2 , the sampling distribution for X has the following properties: 1. E[ X ] . 2. Var[ X ] therefore 2 n . The standard deviation of X (often called the standard error) is . n 2 3. If each X i ~ N[ , 2 ] , then X ~ N , regardless of the size of n. n 4. If X 1 ,..., X n are not normally distributed then when n is large (say at least 30) the 2 distribution of X is approximately N , . n Proof 1 n 1 n 1 E[ X ] E X i E[ X i ] n (as required). n n 1 n 1 Because we are assuming that the random variables are independent, we can also write: 1 n 1 n 1 2 (as required). Var[ X ] Var X i 2 Var[ X i ] 2 n 2 n n n 1 n 1 A linear combination of normally distributed random variables also has a normal distribution. The mean and variance are as given above. Not proved here. Note: Part (4) of the above result is the Central Limit Theorem, an extremely powerful and useful result in Statistics. Example 1: X 1 ,..., X 20 are independently and identically distributed N[30, 5]. Find the sampling distribution for X . Here n = 20 and so X ~ N[30, 5/20] = N[30, 0.25]. Example 2: X 1 ,..., X 40 are i.i.d Po(10) random variables. What approximately is the sampling distribution for X ? The sample size can be considered large enough for the Central Limit Theorem to be applied. The sampling distribution can therefore be considered approximately normal. A 10 Po(10) distribution has mean and variance equal to 10, therefore X ~ N 10, N10, 0.25 40 (roughly). 6.3 Sampling distribution of the sample proportion In many statistical investigations we are interested in learning about the proportion of individuals, or objects, in a population that possess a specified property. For example, we might be interested in what proportion of patients are alive 5 years after diagnosis of a particular cancer, or we might be interested in the proportion of UK adults who would like a ban on blood-sports. Denote the true population proportion of interest by . Note that is a population parameter. To learn about , we could observe a random sample in which each of the n observations is either a “success” or a “failure”. The sample proportion, p, is given by: p = (number of successes) n. The sample proportion is clearly a sample statistic. It makes sense to use p to learn about . We are therefore interested in the sampling distribution for p. To investigate the sampling distribution for p, we will look at 2 experiments in which we generate random samples of observed values of p. Experiment 1: Suppose that we generate 500 samples of size n where each sampled value is either a success (with probability = 0.25) or a failure (with probability 1 - = 0.75). We then calculate the observed proportion of “successes” in each of the 500 samples. We will do this for n = 5, 10, 25 and 50. Sampling distribution for the sample proportion (n = 5) Sampling distribution for the sample proportion (n = 10) 140 200 120 Frequency Frequency 100 100 80 60 40 20 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Sample proportion, p Sample proportion, p Sampling distribution for the sample proportion (n = 20) Sampling distribution for the sample proportion (n = 50) 70 100 60 Frequency Frequency 50 50 40 30 20 10 0 0 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Sample proportion, p 0.7 0.8 0.1 0.2 0.3 0.4 0.5 Sample proportion, p Observations: For a sample of size 5, the possible values of p are 0, 0.2, 0.4, 0.6, 0.8 and 1. The sampling distribution for p gives the probability of each of these 6 values. The histogram for the case n = 5 is positively skewed. As n increases, the histograms become more and more symmetrical and in fact when n = 50 the histogram clearly resembles a normal curve centred on 0.25. In addition, increasing the sample size decreases the range of observed values for p. Experiment 2: Once again we will generate 500 samples, but this time we will have the sample sizes n = 10, 25, 50 and 100 and we will take the true proportion of successes, to be 0.07. So once again each observation in each sample is either a success (S) with probability 0.07, or failure (F) with probability 0.93. Sampling distribution for the sample proportion (n = 10) Sampling distribution for the sample proportion (n = 25) 150 Frequency Frequency 200 100 100 50 0 0 0.00 0.0 0.1 0.2 0.3 0.4 0.05 0.10 0.15 0.20 0.25 Sample proportion, p Sample proportion, p Sampling distribution for the sample proportion (n = 50) Sampling distribution for the sample proportion (n = 100) 80 70 60 Frequency Frequency 100 50 50 40 30 20 10 0 0 0.0 0.1 Sample proportion, p 0.2 0.00 0.05 0.10 Sample proportion, p Observations: When n = 10, the possible values for p are 0, 0.1, 0.2, …, 1. The histogram for the 500 samples is very positively skewed and no values greater than 0.4 was observed for p. [Notice how in the previous experiment, the density for p was not very skewed when n = 10]. As n increases to 25 and 50, the histograms still look positively skewed. However, when the sample size reaches 100, the histogram is beginning to look slightly more normal. Therefore we note that in this experiment we need larger sample sizes than in Experiment 1 before the sampling distribution for p looks approximately normal. We also note that increasing the sample size again results in a narrowing in the range of observed values for p. Thus to summarise the observations from this experiment: Densities are roughly centred about = 0.07. Variance for p decreases as n increases. As the sample size increases, the density for p becomes approximately normal. However, the density tends to normality much slower than when we had = 0.25. Therefore, it appears that the rate at which the sampling distribution for p tends to normality depends not only on the sample size n, but also on the value of . 0.15 Important result: If p is the sample proportion of successes in a random sample of size n where is the true proportion of successes, then the following results hold: The expected value of p is . (1 ) The standard error (i.e. s.d.) of p is . n When n is sufficiently large, the sampling distribution for p is approximately normal. Note: The further the value of is from 0.5, the larger the value of n must be in order for the normal approximation of the sampling distribution for p to be accurate. Rule of thumb: If both n 5 and n(1 ) 5 , then we may use the normal approximation for p. Proof: Let X = total number of successes in the sample. Then X ~ Bi[n, ] and so: E[X] = n V[X]= n(1 - ) sd[X] = n (1 ) . But, by definition, the sample proportion p = X , and so n 1 X 1 E[p] = E E[ X ] n . n n n 2 (1 ) 1 X 1 V[p] = V V[ X ] 2 n (1 ) . n n n n Taking square roots, we get the required standard error for p. Also, Proof of the normality approximation is simply an application of the Central Limit Theorem, so that for large n (1 ) X ~ N , . n approximately. Example 1: Suppose that the proportion of women who believe that they are underpaid is 0.55. a) If we had a random sample of size 10, could we assume that the sampling distribution for p is approximately normal? b) For a random sample of 400, what are the mean value and standard deviation for p? c) In a sample of size 400, what is the probability that we observe the proportion of women who believe they are underpaid to be greater than 0.6? a) = 0.55 and n = 10, so n = 5.5 and n(1 - ) = 4.5. As both of these are not 5, then we cannot assume that the distribution of p is normal with only a sample size of 10. b) n = 400, so: E[p] = = 0.55 (1 ) 0.55 0.45 V[p] = 0.000619 n 400 sd[X] = 0.0249. For n = 400, n = 220 and n(1 - ) = 180 and so p's distribution can be considered approximately normal. Therefore: p ~ N[0.55, 0.000619]. c) 0.6 0.55 P( p 0.6) P Z P(Z 2.008) 1 (2.008) 1 0.9778 0.0222 approximat ely. 0.0249 Example 2: Suppose that the true proportion of individuals with a particular disease is 0.02. What minimum sample size would be needed before p's distribution can be assumed to be approximately normal? For approximate normality we need n 5 and n(1 - ) 5. Now n (0.02) 5 n 250. n (0.98) 5 n 5.102 Therefore, to assume approximate normality for p, we would need a sample size of at least 250. Exercise: 90% of the population are right-handed. In a sample of 200 people, what is the probability that the sample proportion who are right-handed is less than 0.86? 6.4 Sampling distribution for sample variance When we want to learn about the variance, 2 , of a population, it is natural to first look towards the sample variance, S 2 . We are therefore interested in the sampling distribution for S2. In general, the sampling distribution for S 2 does not follow any fixed rules and so here we will only look at the case when X 1 ,..., X n are i.i.d. N[ , 2 ]. Important result: If X 1 ,..., X n are i.i.d. N[ , 2 ] where is unknown, then (n 1) S 2 2 ~ n21 . Proof: The proof will not be given in this course. Experiment: To demonstrate that this result does in fact hold in practice, 500 samples were generated from N[100, 400] for various samples sizes, n and the value of (n 1) S 2 (n 1) S 2 calculated for each of the 500 samples. Histograms of these samples 400 2 Histogram for n = 3 Frequency 200 100 0 0 5 10 15 Statistic then demonstrate what the sampling distribution for ( n 1) S 2 2 looks like in each case. Histogram for n = 5 90 80 Frequency 70 60 50 40 30 20 10 0 0 5 10 Statistic Histogram for n = 20 60 70 50 60 Frequency Frequency Histogram for n = 10 40 30 20 50 40 30 20 10 10 0 0 0 0 1 2 3 4 5 1 2 6 3 4 5 Statistic Statistic Observations: In the case when n = 3, the histogram for the sample of 500 observations of ( n 1) S 2 is 2 heavily positively skewed and resembles a 22 distribution. The histograms for the other cases, where n = 5, 10 and 20, also resemble chi-squared distributions (the respective degrees of freedom should be 4, 9 and 19).