advertisement

Chapter: Sampling Distributions ǲStatistics mean never having to say you are certainǳ Ȇunknown ǲHumility is to make a right estimate of one's selfǤǳ Ȇ Charles Spurgeon Estimation The main aim of Statistics is to make inferences about unknown population parameters, that is Ȃ estimating. Sampling error We carefully collect data from a sample and then use this information based on a sample to estimate the value of the unknown population parameter. Of course different samples would give different estimates. This difference is known as sampling error. Thus every conclusion made about population would contain some level of uncertainty. We want to know some important truth which in practice is unachievable. What to do? What can save the situation is that we can quantify this sampling error! We can well predict how exactly sample differences would vary. For instance that ǲalmost surely be within 5 pounds around ͳ ǳ (average weight is from 171 to 181 pounds) ǲ proportion of broken details must almost surely be within r0.3% of 5%ǳ (the proportion is between 4.7 and 5.3). Sampling Distributions Sampling distributions is the probability distribution of some estimator (statistics) among different samples. It shows how estimates could vary among different samples. We want to estimate some population parameter which is a fixed quantity Ȃ a constant. For that we use special functions called Ȃ estimators or statistics. Estimators are random variables that vary depending on the particular sample chosen. The probability distribution showing how exactly estimator varies is called a sampling distribution. Why a sample statistics is a variable? It is very easy Ȃ because different samples give different values! Consider a big example to understand exactly what a sampling distribution is. ǲof workersǳ Suppose one department of a company consists of six workers. Their work experiences in years are: 2, 4, 6, 6, 7, 8 years. A manager wants to form a small team out of this department of 4 workers to work on a project. Since he is quite busy he decided to choose them randomly. Work experience matters in regard to how the job is done, therefore manager is interested: What could be the mean work experience in a group? Authors: Chernenko Elizaveta &Ovcharova Maria. Please send your questions and commentaries to: [email protected] and [email protected] Solution: First of all the population here is the total number of workers in the department, namely 6. The population mean of number of years of experience is: 2+4+6+6+7+8 6 = 5.5 = P Group of 4 employees could be regarded as a random sample of size 4, chosen from a population of size 6. Fifteen possible samples could be selected (this is the amount of combinations ܥ46 ). Table 1 shows all possible samples with mean years of experience calculated on each sample. Samples (in years) 1. 2,4,6,6 2. 2,4,6,7 3. 2,4,6,8 4. 2,4,6,7 5. 2,4,6,8 6. 2,4,7,8 7. 2,6,6,7 8. 2,6,6,8 Sample means 4.5 4.75 5.0 4.75 5.0 5.25 5.25 5.5 Samples (in years) 9. 2,6,7,8 10.2,6,7,8 11.4,6,6,7 12.4,6,6,8 13.4,6,7,8 14.4,6,7,8 15.6,6,7,8 Sample means 5.75 5.75 5.75 6.00 6.25 6.25 6.75 Since it was random sampling, each sample of 15 has an equal chance of being selected, namely 1/15. Using this we can determine the probability of every mean work experience achieved. For instance, from the table we can see that three of fifteen possible samples have means 5.75 years. Thus the probability that the team for the project would have an average work experience of 5.75 years is 3/15, etc. This collection of values of sample means with associated probabilities would constitute the probability distribution of sample mean, or right the sampling distribution of the sample mean. X P(X) 4.5 1/15 4.75 2/15 5 2/15 5.25 5.5 2/15 1/15 5.75 3/15 6 1/15 6.25 2/15 6.75 1/15 Notice, that while the numbers of experience for six workers vary from 2 to 8, the values of mean experience vary from 4.5 to 6.75. That is values of sample means have much smaller range, and are concentrated in the center. This is a simple discrete case example. However if we plot the means on a histogram, we will see that even it approximately resembles a normal distribution. The bigger is the sample size, the closer is the distribution of statistics to normal. 1. Possible values for mean work experience in a team Recall: Probability distribution is the collection of values with associated probabilities So, distinguish among three notions: x x x Sampling distribution Ȃ distribution not of X, but of some statistics among different samples Population distribution Ȃ the true distribution of all possible values of X in the world. From population we take samples. Sample distribution Ȃ you may find this notion too. It is just the way a particular sample taken is distributed. This you usually draw with histogram to show the range of all values in the sample, with associated relative frequencies in this case (not probabilities) Examples of parameters are: population mean P, population proportion p ഥ, sample proportion ො Examples of estimators (statistics) are: sample mean X Background. Sampling (method) is the set of random variables: X1, X2ǡǥǡn Sample is the realization of all random variables: particular numbers x1, x2ǡǥǡn The same note: sample mean ഥ X is a variable, but in problems you will meet the sentences like ǲͷͲǳǤȂ a particular value calculated on that sample. Also we always assume: All Xi have the same distribution All Xi are independent Variability of a Sampling Distribution The variability of a sampling distribution is measured by its variance or its standard deviation. The standard deviation of this statistic is also called the standard error. CLT: Central Limit Theorem Why normal distribution is so important? Because it can be used to describe the results of sampling procedures we have achieved. Sampling distributions are most often normal! This is the essence of CLT. The central limit theorem states that: the sum of independent identically distributed random variables is approximately normally distributed, if the quantity of these variables is large enough. ഥ and pො , statistics, most frequently used in our course. It is applicable for X ഥ: For sample mean X x x x It is the sum of Xi Xi are independent since the sample is random Xi are identically distributed because each is taken from the same underlying population For sample proportion pො : pො = m/n m has binomial distribution. However, binomial is the sum of Bernoulli random variables. m =6yi where yiaBernoulli (recall chapter on discrete random variable) x x x Thus the proportion is the sum of Yi Yi are independent Yi are all Bernoulli, thus are identically distributed When clt starts working? -­‐as n approaches infinity -­‐ the closer is the underlying population to normal, the earlier clt starts working. How large is "large enough"? It could be decided differently, but in our course we use 30. Requirements for accuracy. The more closely the sampling distribution needs to resemble a normal distribution, the more sample observations will be required. The more closely the original population resembles a normal distribution, the fewer sample points will be required. Shortly the idea of CLT is described as follows: if the sample is big, everything is going to be normal! Sampling distributions for most commonly used statistics 1. Estimating Parameter P. ഥ is an estimator for Population mean is a constant and is denoted by P. Sample mean ܆ ഥ could vary among different samples? population mean P. How sample mean X Sample mean is a random variable based on a sample. It is simply the average of sample ഥ observations, and is denoted by X ഥ = ܺ1+ ܺ2+ڮ+ܺ݊ = 6Xi X ݊ n ഥ, namely its mean and variance. ǯ alculate the parameters of the random variable X ഥ: xExpectation (mean) of X ഥ) = E(6Xi ) Plug the formula into the brackets: E(X n One over n is just a constant, so we put it out of brackets and from the chapter of random variables we know that the expectation of sum is the sum of expectations, so: 6Xi E( n 1 1 n n ) = E(X1+X2 ΪǥΪn) = (E(X1) + E(X2ȌΪǥΪȋn)) Since each random variable Xi has the same distribution and mean P, we know each expectation: 1 n 1 nP n n (E(X1) + E(X2ȌΪǥΪȋn)) = (P + P ΪǥΪP) = = P ഥ: xSimilarly Variance of X 6X ഥ) = Var ( i ) = 1 Var (X1+X2 ΪǥΪn) = 1 (VarX1 + VarX2 Ϊǥ+ VarXn) = Var (X n2 n2 n 1 1 nV2 V2 n n n n = 2 (V2 + V2 ΪǥΪV2) = 2 nV2 = = 2 WeǯǤ ഥ has the following distribution: Thus X V ഥ X a N (P, ) 2 n Once more, recall that from sample to sample, different values would be achieved for the sample mean. Thus this quantity can be regarded as a random variable, which has a probability distribution. 2. Estimating Parameter p. ෝ is an estimator Population proportion is a constant and is denoted by p. Sample proportion ܘ for population proportion p. m It equals pො = Ȃ the number of successes m out of total number (sample size) n. n Consider closer this expression. Sample size n is a constant, while m is a random variable, which is distributed binomially, ǯǡǤ Indeed, m is number of people or objects, proportion of which we are looking for (for instance, the number of defective details of the number of respondents who support strike). And each of this unit has a chance p to be in the required category of interest. Thus m a Bi(n,p) Expectation of binomial random variable equals np, while variance equals np(1-­‐p). Now let us calculate the characteristics of the random variable pො , namely its mean and variance. xExpectation (mean) of pො : m 1 1 np n n n n E(pො ) = E( ) = E(m) = np = = p xVariance of pො : m 1 1 Var(pො ) = Var( ) = n 2 Var(m) = n 2 np(1-­‐p) = n np (1െp) p(1െp) n2 n = p(1െp) n Standard deviation is as usually the square root of variance. Vpො = ට ǯǤ Thus pො has the following distribution: (1െ) Ƹ N( p, ݊ ) Comparison of parameters Statistics allows us to compare several parameters in order to conclude which is higher. For instance, we might be interested in the following questions. Which of two financial firms pays a higher mean starting salary? Whether the proportion of defective smartphones is higher for Samsung than for Apple, or is the same for both brands? What can be said about the difference in the mean welfare of single-­‐ parent families versus two-­‐parent families? 3. Estimating parameter P1-­‐P2 Difference between two means As we just shown sample mean has the following distribution by CLT: V ഥ X a N (P, n ) 2 If we consider two independent populations, each of them would have: 2 2 1 2 V ഥ ഥ2a N (P2,V2 ) X1 a N (P1,n 1 ) and X n ഥ1 -­‐ X ഥ2 is also a normal random variable, since it can be represented as the difference (which is a X particular case of sum) of two independent normal random variables. Find the expectation and variance of this random variable: ഥ1 -­‐ X ഥ2) = E(X ഥ1) Ȃ E(X ഥ2) = P1 -­‐ P2 xE(X 2 2 1 2 ഥ1 -­‐ X ഥ2) = VarX ഥ1 + Var(-­‐܆ ഥ2) = VarX ഥ1 + (-­‐1)2VarX ഥ2 = VarX ഥ1 + VarX ഥ2 = V1 + V2 xVar(X n n Standard deviation as usual equals to the square root of variance V2 V2 1 2 VXഥ1 Ϋ Xഥ2 = ටn 1 + n 2 Note(!) that the standard deviation of the difference is higher than that of a single variable. It is evident because now two estimators (that both could vary) are involved Thus, the difference of sample means has the following distribution: 2 2 ഥ1-­‐X ഥ2 a N( P1 -­‐P2, V1 + V2 ) X n n *notice plus here! Example ǲTest ǳ 17 years old student Masha is quite good at writing tests on Staistics, but her results varies a lot. Her classmate Peter offered her a bet on a bar of chocolate that at the end of next semester his average test mark for this subject would be higher than hersǤǯͶǤͷͷ ͲǤǡǯͶǤ͵ͲǤͶǤThere are going to be 50 tests in the next semester. What is the probability that Masha will lose the bet, that is, that her average mark will be less than the average mark of Peter? Solution: 2 2 ഥ1-­‐X ഥ2 a N( P1 -­‐P2, V1 + V2 ) by CLT X n n The mean of the differences is: P1 -­‐P2 =4.5 Ȃ 4.3 = 0.2 The standard deviation is: (0.7 ) 2 (0.4) 2 = 0.114. 50 50 The chance that Masha will lose is equal to the probability: 0 0.2 P(X1X2) = P(X1 -­‐ X2 0) = P(z ) = P(z Ȃ1.75) = 0.04 0.114 (From Table of the Normal Distribution, the area to the left of Ȃ1.75 is 0.0400.) VഥX1 Ϋ Xഥ2 = Thus there is only a 4% chance that Masha will lose. 4. Estimating parameter p1-­‐p2 Single sample proportion has the distribution Ƹ N( p, (1െ) ݊ ) Exactly the same logic applies to the difference of two sample proportions 1 ήሺ1െ 1 ሻ ήሺ1െ ሻ + 2 ݊ 2 ) ݊1 2 Ƹ1 െ Ƹ2 ~ܰ(1 െ 2 , ට Sample AP practice problems 2015 6d 6. Corn tortillas are made at a large facility (ɩɪɢɫɩɨɫɨɛɥɟɧɢɟ) that produces 100,000 tortillas per day on each of its two production lines. The distribution of the diameters of the tortillas produced on production line A is approximately normal with mean 5.9 inches, and the distribution of the diameters of the tortillas produced on production line B is approximately normal with mean 6.1 inches. The figure below shows the distributions of diameters for the two production lines. The tortillas produced at the factory are advertised as having a diameter of 6 inches. For the purpose of quality control, a sample of 200 tortillas is selected and the diameters are measured. From the sample of 200 tortillas, the manager of the facility wants to estimate the mean diameter, in inches, of the 200,000 tortillas produced on a given day. Two sampling methods have been proposed. Method 1: Take a random sample of 200 tortillas from the 200,000 tortillas produced on a given day. Measure the diameter of each selected tortilla. Method 2: Randomly select one of the two production lines on a given day. Take a random sample of 200 tortillas from the 100,000 tortillas produced by the selected production line. Measure the diameter of each selected tortilla. Each day, the distribution of the 200,000 tortillas made that day has mean diameter 6 inches with standard deviation 0.11 inch. G )RUVDPSOHVRIVL]HWDNHQIURPRQHGD\¶VSURGXFWLRQGHVFULEHWKHVDPSOLQJGLVWULEXWLRQRIWKH sample mean diameter for samples that are obtained using Method 1. (e) Suppose that one of the two sampling methods will be selected and used every day for one year (365 days). The sample mean of the 200 diameters will be recorded each day. Which of the two methods will result in less variability in the distribution of the 365 sample means? Explain. Solution: (d) The sampling distribution of the sample mean diameter for samples obtained using Method 1would be 0.11 approximately normal with mean 6 inches and standard deviation 200 | 0.0078 inches ξ (e) Method 1 would result in less variability in the sample means (plural!) over 365 days. With Method 2 roughly half of the sample means will be around 5.9 inches and the other half will be around 6.1 inches while with Method 1the sample means will all be very close to 6.0 inches, as indicated by very small standard deviation in part (d) (0.0078 inch) 2014 3b 3. Schools in a certain state receive funding based on the number of students who attend the school. To determine the number of students who attend a school, one school day is selected at random and the number of students in attendance that day is counted and used for funding purposes. The daily number of absences at High School A in the state is approximately normally distributed with mean of 120 students and standard deviation of 10.5 students. (a) If more than 140 students are absent on the day the attendance count is taken for funding purposes, the school will lose some of its state funding in the subsequent year. Approximately the probability that High School A will lose some state funding is about 0.0287. E 7KHSULQFLSDOV¶DVVRFLDWLRQLQWKHVWDWHVXJJHVWVWKDWLQVWHDGRIFKRRVLQJRQHGD\DWUDQGRPWKHVWDWHVKRXOG choose 3 days at random. With the suggested plan, High School A would lose some of its state funding in the subsequent year if the mean number of students absent for the 3 days is greater than 140. Would High School A be more likely, less likely, or equally likely to lose funding using the suggested plan compared to the plan described in part (a)? Justify your choice. Solution: (b) High School A would be less likely to lose state funding. With a random sample of 3 days, the distribution of the sample mean number of students absent would have less variability than that of a single day. With less variability, the distribution of the sample mean would concentrate more narrowly around the mean of 120 students, resulting in a smaller probability that the mean number of students absent would exceed 140. In particular, the standard deviation of the sample mean number of absences is: V 10.5 = 3 = 6.062. ݊ ξ ξ So, the probability that High School A loses funding using the suggested plan would be as determined as: ഥ !140) = P(z ! 140െ120 ) = P(z ! 3.3) = 0.0005 | 0, which is much less than a probability of 0.0287 obtained P(X 6.062 for the plan described in part (a). 2007 3 3. A company recently stocked a new lake in a central city park with 2,000 fish of various sizes. The distribution of the length of these fishes is approximately normal. (a) A company claims that the mean length of the fish is 8 inches. If the claim is true, which of the following would be more likely? x a random sample of 15 fishes having a mean length that is greater than 10 inches Chapter on confidence intervals. Introduction. The basic purpose of statistics is to get useful knowledge about unknown from the mess of known data. For example, we might be interested in the true values of: ђ -­‐ ƚŚĞ ŵĞĂŶ ƐĂůĂƌǇ ŝŶ ZƵƐƐŝĂ ŝŶ ϮϬϭϱ͕ ʋ ʹ the proportion of Saint-­‐Petersburg citizens in favor of the anti-­‐ŐĂǇůĂǁ͕Žƌʍ-­‐ riskiness of some investment project. Usually, we cannot calculate those parameters as we do not have the data about the whole population. However, we can estimate their values based on the sample data. One way to do that are point estimates such as sample mean ݔҧ , sample proportion p, sample standard deviation s. This approach has a significant drawback: our point estimates almost never coincide with the true values of parameters. To increase ŝŶǀĞƐƚŝŐĂƚŽƌ͛Ɛ confidence in estimates the other approach is used ʹ interval estimates. Although we are almost never able to ͞ĐĂƚĐŚ͟ƚŚĞǀĂůƵĞŽĨƚƌƵĞƉĂƌĂŵĞƚĞƌǁŝƚŚĂƉŽŝŶƚĞƐƚŝŵĂƚĞ͕ǁĞĐĂŶĐŽŶƐƚƌƵĐƚĂŶŝŶƚĞƌǀĂůǁŚŝĐŚǁŝůůŝŶĐůƵĚĞƚŚĞ ƉĂƌĂŵĞƚĞƌ͞ĂůŵŽƐƚĨŽƌƐƵƌĞ͘͟ Estimates c Point estimates onfidence Confidence intervals confidence intervals ! You might compare these two methods with the two approaches to fishing. intervals Trying to catch a fish with a spear is similar to calculating point estimate in hope it will be equal to the parameter. Contrary, using a net is similar to constructing an interval, which would contain the parameter in significant number of trials. We call the intervals constructed a confidence intervals since those might contain the true parameter with a specified level of confidence. ! Note that there is always a chance the confidence interval WILL NOT contain the true value of parameter! Confidence intervals for population ƉĂƌĂŵĞƚĞƌƐ;ђĂŶĚʋͿ 1. Confidence intervals for the population mean ђ 1.1. dŚĞĐĂƐĞŽĨŬŶŽǁŶʍ Let us suppose we have a big sample of values of some variable X. According to the Central Limit Theorem (CLT) sample mean ݔҧ ŝƐĂƉƉƌŽǆŝŵĂƚĞůǇŶŽƌŵĂůůǇĚŝƐƚƌŝďƵƚĞĚ͘ƐǁĞ͛ǀĞůĞĂƌŶĞĚĨƌŽŵƚŚĞƉƌĞǀŝŽƵƐchapter, it has the following parameters: ݔҧ ~ܰ(ߤ, ߪ ξ݊ ). How do we construct a confidence interval? STEP 1. As for any Normal variable we can reduce it to the Standard Normal variable z: ݔҧ െߤ ߪ ൗ ݊ ξ = (ܰ~ݖ0,1) STEP 2. If we express ݔҧ from the equation (1) we get: ݔҧ = ߤ + ݖή (1) ߪ ξ݊ Z is a random variable taking positive as well as negative values. Thus, the first summand shows that ݔҧ on ĂǀĞƌĂŐĞĞƋƵĂůƐђ͕ĂŶĚƚŚĞƐĞĐŽŶĚŽŶĞƌĞƉƌĞsents a deviation of ݔҧ ĨƌŽŵђǁŚŝĐŚĚĞƉĞŶĚƐŽŶƌĂŶĚŽŵĐŽŵƉŽŶĞŶƚ z. Z follows the known probability distribution. So, the equation specifies the range of the deviations of ݔҧ around ђǁŚŝĐŚŵŝŐŚƚŚĂƉƉĞŶ with given probability reflected in the corresponding value of zcritical. Assuming that zcritical is non-­‐negative we can rewrite the expression as follows: ݔҧ = ߤ േ ݈ܽܿ݅ݐ݅ݎܿݖή ߪ ξ݊ (2) STEP 3. ,ŽǁĞǀĞƌ͕ǁĞĂƌĞƵƐƵĂůůǇŝŶƚĞƌĞƐƚĞĚŝŶƚŚĞƉůĂƵƐŝďůĞǀĂůƵĞƐŽĨђ͕ŶŽƚݔҧ . ExpƌĞƐƐŝŶŐђĨƌŽŵƚŚĞĞƋƵĂƚŝŽŶ (2) we get: ߤ = ݔҧ േ ݈ܽܿ݅ݐ݅ݎܿݖή ߪ ξ݊ dŚƵƐ͕ǁĞ͛ǀĞĂƌƌŝǀĞĚƚŽƚŚĞĨŝƌƐƚĨŽƌŵƵůĂŽĨĐŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂů͘/t provides a convenient way to estimate the ďŽƵŶĚĂƌŝĞƐǁŝƚŚŝŶǁŚŝĐŚƚŚĞƚƌƵĞǀĂůƵĞŽĨђůŝĞƐǁŝƚŚthe certain level of confidence. The component ݈ܽܿ݅ݐ݅ݎܿݖή ߪ is called the margin of error. ξ݊ What is the margin of error and how to find zcritical? Let͛ƐŐŽďĂĐŬƚŽƚŚĞĞƋƵĂƚŝŽŶ;ϮͿ͗ݔҧ = ߤ േ ݈ܽܿ݅ݐ݅ݎܿݖή ߪ ξ݊ As you can see the value of the random variable ݔҧ has two components: true value ђ and the random component േ ݈ܽܿ݅ݐ݅ݎܿݖή ߪ . Since we know the pdf of z we can calculate the interval of possible values ݔҧ can take ξ݊ with given probability. Example 1. Consider an example. Assume the true mean ŚĞŝŐŚƚ ŽĨ /& ƐƚƵĚĞŶƚƐ ŝƐ ђсϭϳϱ Đŵ ǁŝƚŚ ƐƚĂŶĚĂƌĚ ĚĞǀŝĂƚŝŽŶʍсϱĐŵ͘ƌĂŶĚŽŵƐĂŵƉůĞŽĨ100 students is taken and their heights are measured to calculate ݔҧ -­‐ the sample mean height. ݔҧ can take different values depending on the heights of random set of students, who get into the sample. The range of values ݔҧ can take is: ݔҧ = 175 േ ݈ܽܿ݅ݐ݅ݎܿݖή 5 ξ100 Suppose, we want the interval to include values ݔҧ takes with probability 95%. What value of zcritical should be used to provide an interval including 95% of all possible values of ݔҧ ? ƐǁĞ͛ǀĞƐƚĂƚĞĚĞĂƌůŝĞƌݔҧ is a Normal random variable. The picture below displays its probability distribution. In order to find the interval which includes the most probable 95% of ݔҧ values (the smallest interval) we should take the central 95% of values, leaving 5% of unlikely values at the two equal tails of the distribution. Due to the symmetry area of each tail equals 5% 2 = 2,5%. Thus, zcritical=z0,025 such that P(z>z0,025)=2,5%=0,025. From the table we get: z0,025=1,96. Then, ݔҧ = 175 േ 1,96 ή 5 ξ100 | 175 r 1 So, we get that with 95% probability the sample mean will take value between 174 and 176 centimeters. In general we denote the probability of interest by (1-­‐ɲͿ% and call it the confidence level (95% in the given example). ɲ ŝƐ ƚŚĞ ƉƌŽďĂďŝůŝƚǇ ŽĨ ͞ůĞĨƚ ďĞŚŝŶĚ͟ or improbable values of ݔҧ and is called the significance level. Therefore in general case ݖcritical = ߙݖ. 2 dŚĞŐĞŶĞƌĂůĨŽƌŵƵůĂŽĨĐŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂůĨŽƌђǁŝƚŚŬŶŽǁŶʍŝƐ͗ ߪ ߤ = ݔҧ േ ߙݖή 2 ξ݊ (3) Is there life without the Central Limit Theorem? tĞ͛ǀĞ ƐƚĂƌƚĞĚ ŽƵƌ ĞǆƉůĂŶĂƚŝŽŶ ďǇ ƐƚĂƚŝŶŐ ƚŚĂƚ ǁŝƚŚ ůĂƌŐĞ ĞŶŽƵŐŚ ƐĂŵƉůĞ ƚŚĞ >d ŝŶƐƵƌĞƐ ƚŚĂƚ͗ ݔҧ ~ܰ(ߤ, ߪ ). ξ݊ However, what happens if the sample is small, say, less than 30 observations? Does that mean that construction of confidence interval is impossible? The answer depends on the distribution of X. Note that if X is itself normally distributed sample mean ܺത, being a sum of independent normal variables is also normal. Thus, we can apply the same logic and construct the confidence interval using the same formula. Still, for small samples it is necessary that the assumption about the normality of X is held. If you are asked to find the confidence interval based on the small sample, and nothing is said about the distribution of X, before interval estimation you should state the assumption, that the random variable of interest (for example height) is normal. 1.2 ŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂůĨŽƌђǁŝƚŚƵŶŬŶŽǁŶʍ ^ŽĨĂƌǁĞ͛ǀĞƌĞĂĐŚĞĚthe formula: ߤ = ݔҧ േ ܼߙ ή 2 ߪ to estimate the unknown population mean. To calculate such ξ݊ ĂŶŝŶƚĞƌǀĂůƚŚĞǀĂůƵĞŽĨʍŶĞĞĚƐƚŽďĞŬŶŽǁŶ͘,ŽǁĞǀĞƌ͕ǁŚĞŶƉŽƉƵůĂƚŝŽŶŵĞĂŶŝƐƵŶŬŶŽǁŶ;we are trying to estimate it by the interval!) how is it possible to know the true value of population standard deviation? Indeed, ŝƚŝƐƵƐƵĂůůǇŶŽƚƚŚĞĐĂƐĞ͘ůůǁĞŚĂǀĞŝƐĂƐĂŵƉůĞ͊KĨĐŽƵƌƐĞ͕ǁĞĐĂŶƌĞƉůĂĐĞʍǁŝƚŚŝƚƐƐĂŵƉůĞĞƐƚŝŵĂƚĞƐ͘ ݔҧ െߤ ƐǁĞ͛ǀĞƐŚŽǁŶĞĂrlier in this section, ߪ ൗ ݊ ξ ~ܰ(0,1). How the probability distribution will change if ʍŝƐƌĞƉůĂĐĞd with s? Obviously, when a constant is replaced with a random variable the variance of the whole expression will rise. ݔҧ െߤ Pdf of a random variable ݏ ൗ ݊ ξ ǁŝůůŚĂǀĞďŝŐŐĞƌ͞ƚĂŝůƐ͟ŝŶĚŝĐĂƚŝŶŐŚŝŐŚĞƌƉƌŽďĂďŝůŝƚǇŽĨĚĞǀŝĂƚŝŽŶƐfrom mean. What is more, the smaller is the sample ʹ the higher is the probability of large deviations ŽĨƐĨƌŽŵʍ, reflected in even fatter tails. It happens because small samples give less precise values of standard deviation s, which becomes more dispersed. However, all the other properties of pdf function stay the same: centered at 0 having bell-­‐shaped form. Hi! You need me to pass AP! and Here the new distribution is to be introduced ʹ the Student or t-­‐distribution. Its pdf is defined by the so-­‐called degrees of freedom (df) calculated as number of observation minus 1, df=n-­‐1. The corresponding number is conventionally indicated in brackets after the t letter: t(n-­‐1). dŚĞĐůŽƐĞƌŝƐƐƚŽʍ;ǁŚŝĐŚŝŶŐĞŶĞƌĂůŚĂƉƉĞŶƐǁŝƚŚŚŝŐŚĞƌŶͿ͕ƚŚĞ closer is t-­‐distribution to z. As is shown on the picture above t-­‐distributions with lower degrees of freedom are more dispersed around 0 (v on the picture denotes value of df). On the other hand, with larger df t-­‐distribution becomes closer to N(0,1). Indeed, when n approaching infinity, pdf of t approaches pdf of z. Thus, we get the following formula: ݔҧ െ ߤ ~ ݊(ݐെ 1) ݏ ൗ ݊ ξ Applying the same algorithm as in 1.1 we arrive to the new formula of confidence interval: ݏ ߤ = ݔҧ േ ݊( ߙݐെ 1) ή 2 ξ݊ Note that ݊( ߙݐെ 1) is a number (not multiplication of t by (n-­‐1))! It is a critical value of t-­‐distribution with n-­‐1 2 ߙ 2 degrees of freedom such that P(t> ݊( ߙݐെ 1))= . 2 Note that when n is large enough t distribution can be approximated by z. Thus, for large samples it is acceptable to apply the formula: ߤ = ݔҧ േ ߙݖή ξ݊ ݏ. However, it would be an approximation, while the general rule is 2 to use z-­‐ĚŝƐƚƌŝďƵƚŝŽŶǁŚĞŶʍŝƐŬŶŽǁŶ͕ĂŶĚƚ-­‐distribution ʹ otherwise. Again, the formulas given here are applicable when sample is large enough so that we can apply CLT. For small samples to ensure that ݔҧ ~ܰ we need to make sure that X is normal. IĨŝƚ͛ƐŶŽƚŐŝǀĞŶŝŶĂƉƌŽďůĞŵʹ you should state the corresponding assumption. So, how do you choose the right formula? 2. ŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂůĨŽƌƉŽƉƵůĂƚŝŽŶƉƌŽƉŽƌƚŝŽŶʋ Analogically, by CLT for large n sample proportion p has normal distribution. Ɛ ǁĞ͛ǀĞ ƐŚŽǁŶ ŝŶ ƚŚĞ ƉƌĞǀŝŽƵƐ ߨ(1െߨ) ) ݊ chapter ߨ(ܰ~, ට Thus, = ݖ െߨ ߨ(1െߨ) ݊ and ߨ = േ ܼߙ ή ට ߨ (1െߨ ) ට ݊ 2 ߨሺ1െߨሻ is the margin of error. Since we do not know tŚĞƚƌƵĞǀĂůƵĞŽĨƉƌŽƉŽƌƚŝŽŶʋǁĞĐĂŶŶŽƚĚŝƌĞĐƚůǇ ݊ Now ߙݖή ට 2 calculate the margin of error. However, as the sample is initially assumed to be large enough we can still ĐŽŶƐƚƌƵĐƚĂŶĂƉƉƌŽǆŝŵĂƚĞĐŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂůďǇƌĞƉůĂĐŝŶŐʋŝŶŵĂƌŐŝŶŽĨĞƌƌŽƌǁŝƚŚŝƚƐƐĂŵƉůĞĞƐƚŝŵate p. Thus͕ǁĞŐĞƚƚŚĞĐŽŶǀĞŶƚŝŽŶĂůĨŽƌŵƵůĂĨŽƌƚŚĞĐŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂůĨŽƌʋ͗ (1െ) ݊ ߨ = േ ܼߙ ή ට 2 (4) ! Note that THIS IS THE ONLY FORMULA used for (2-­‐sided) ŝŶƚĞƌǀĂůĞƐƚŝŵĂƚŝŽŶŽĨʋ͘ It is only used when sample is large enough and uses only normal distribution (no t-­‐distribution for proportions!). Confidence intervals for the difference of parameters 3. Confidence interval for difference in population means ࣆ െ ࣆ 3.1. Known population standard deviations , As you should have learned from the previous chapter, difference of two sample means is normally distributed ߪ2 ߪ2 1 2 given that sample sizes are large enough. Since ܧሺݔ തതത1 െ തതതሻ ݔ2 = ߤ1 െ ߤ2 and ܸሺݔ തതത1 െ തതതሻ ݔ2 = ݊1 + ݊2 we have: ݔ1 െ തതത~ܰ(ߤ തതത ݔ2 1െ ߪ2 ߤ2 , ට݊1 1 തݔതതതെݔ 1 തതതതെ(ߤ 2 1 െߤ 2 ) Thus, + ߪ22 ) ݊2 = (ܰ~ݖ0,1) ߪ2 ߪ2 ඨ 1+ 2 ݊1 ݊2 Hence, the following confidence interval is constructed: ߪ12 ߪ22 ߤ1 െ ߤ2 = തതത ݔ1 െ തതത ݔ2 േ ߙݖή ඨ + ݊1 ݊2 2 3.1. Unknown population standard deviations, assuming ് Of course the previous formula can only be used given you know the true values of standard deviations ߪ1 and ߪ2 of ܺ1 and ܺ2 . In most real life problems this is not the case. All we usually have are the samples, and thus, the sample statistics calculated on them. When we replace ߪ1 and ߪ2 with sample standard deviations ݏ1 and ݏ2 the distribution is no more normal. To account for the increased variability in possible values of തതത ݔ1 െ തതത ݔ2 Student distribution is applied. തݔതതതെݔ 1 തതതതെ(ߤ 2 1 െߤ 2 ) 2 2 ݊1 ݊2 ~ݐሺ݇ െ 1ሻ, k=min{n1,n2} ݏ ݏ ඨ 1+ 2 Thus, we come to the following formula of confidence interval: ݏ2 ݏ2 1 2 ߤ1 െ ߤ2 = തതത ݔ1 െ തതത ݔ2 േ ݇( ߙݐെ 1) ή ට݊1 + ݊2 , k=min{n1,n2} 2 Why is the minimum of n1 and n2 is used in calculating the degrees of freedom? The answer is that variability of ƚŚĞǁŚŽůĞĐŽŶƐƚƌƵĐƚŝŽŶĚĞƉĞŶĚƐŽŶƚŚĞ͞ǁŽƌƐƚ͕͟ƚŚĞůĞĂƐƚƐƚĂďůĞĐŽŵƉŽŶĞŶƚŝŶŝƚ͘>Ğƚ͛ƐƐƵƉƉŽƐĞŶ1=1000, n2=25. That means calculations of s1 are highly precise, it takes values close to the true standard deviation ߪ1 and has small variance. Contrary, s2, being calculated on the small sample is highly volatile. Hence, however precise s1 is, variability in possible values of s2 will make the overall expression less stable, leading to small degrees of freedom ʹ depending on n2. Hence, degrees of freedom are always calculated based on the size of the smallest sample. ! Note that your graphic calculator uses another and somewhat more sophisticated calculation of degrees of freedom. It results in a bit higher and usually fractional (not integer) value of degrees of freedom. If you use the results from the calculator in solving a problem you should indicate df used by it. If the samples are not large enough (in the AP course sample is supposed to be small when n < 30) you can still apply the same formula given that both ܺ1 and ܺ2 are normal random variables. 3.2. Unknown population standard deviations, equal variances assumption: = In many situations it is reasonable to assume that ߪ1 = ߪ2 = ߪ. What happens then? ߪ12 ߪ22 1 1 ඨ ߤ1 െ ߤ2 = തതത ݔ1 െ തതത ݔ2 േ ݖή + = തതത ݔ1 െ തതത ݔ2 േ ߙݖή ߪ ή ඨ + ݊1 ݊2 ݊1 ݊2 2 ߙ 2 The above formula can be applied when ߪ is known. tŚĞŶ ŝƚ͛Ɛ ƵŶŬŶŽǁŶ ʍ ĐĂŶ ďĞ ĞǀĂůƵĂƚĞĚ ďĂƐĞĚ Žn the joint sample of ܺ1 and ܺ2 . Of course, since population standard deviations for both ܺ1 and ܺ2 are the same ;ʍͿ you can also estimate it based on either ܺ1 sample (use ݏ1 ) or ܺ2 sample (use ݏ2 ). However, the preciseness of the both estimators will be limited to corresponding sample size (n1 or n2). To make the estimator more precise it is useful to merge the two samples. The generally used estimator for ߪ is the so called pooled standard deviation (for the explanation of formula address the Appendix): ݏ12 ή ሺ݊1 െ 1ሻ + ݏ22 ή ሺ݊2 െ 1ሻ ߪො = = ݏඨ ݊1 + ݊2 െ 2 When we replace ߪ with ݏ normal distribution changes to Student distribution. Since the preciseness of ݏ is limited to the size of joint sample ݊1 + ݊2 and its calculation involves two estimates of unknown population parameters Student distribution has (݊1 + ݊2 െ 2) degrees of freedom. Thus, we arrive to the formula for confidence interval with the same and unknown standard deviation ߪ: 1 1 ߤ1 െ ߤ2 = തതത ݔ1 െ തതത ݔ2 േ ݊( ߙݐ1 + ݊2 െ 2) ή ݏή ඨ + ݊1 ݊2 2 3.3 Confidence interval for the mean difference (for paired/matched samples) Sometimes the difference is to be calculated based on so-­‐called matched or paired samples. It means that although two sets of observations are involved (say, ܺ1 and ܺ2 ), both are in fact taken from the same population of X. For example you may want to compare the mean difference in blood pressure of 1st year ICEF students before and after they take their winter exam in Stats (population is presented by all ICEF freshmen). For that purpose one sample of 1st year students can be taken and their average measures of blood pressure before and after the തതത1 and ܺ തതത2 ) are used to construct the interval for mean difference (we denote the true mean difference by exam (ܺ ѐͿ͘/ŶƚŚĂƚĐĂƐĞŶŽƚŽŶůǇƚŚĞƉŽƉƵůĂƚŝŽŶŝƐƚŚĞƐĂŵĞ;ϭst year students), but the samples contain the same set of students. Why we compare a sample with itself? Because we want to see is there effect of some treatment which is not applied to sample objects at the first stage (we observe ܺ1 ) and is applied at the second stage (ܺ2 ). In the given example treatment is the exam and we want to see does it constitute a stress for students. In order ƚŽĐŚĞĐŬƚŚĂƚǁĞůŽŽŬǁŚĞƚŚĞƌƚŚĞĐŽŶĨŝĚĞŶĐĞŝŶƚĞƌǀĂůĨŽƌѐ͕ĞƐƚŝŵĂƚĞĚďǇ݀ҧ = തതതതതതതതതതതതത (ܺ1 െ ܺ2 ) contains zero or not. If it does, then, ܺ1݅ െ ܺ2݅ (difference in blood pressure of student i) on average is close to zero and the interval of most plausible values of mean difference includes the case of no effect (zero). Therefore, we conclude that the treatment has no statistically measurable effect. Contrary, if the mean difference interval only contains positive (or only negative) values we may assert that the exam is a stressful event for students since it significantly changes their physiological condition. ,ŽǁĞǀĞƌ͕ ƐŽŵĞƚŝŵĞƐ ŝƚ͛Ɛ ŝŵƉŽƐƐŝďůĞ ƚŽ ŐĞƚ ĞǆĂĐƚůǇ ƚŚĞ ƐĂŵĞ ƐĂŵƉůĞƐ͕ ĂŶĚ ƚŚĞ ƚǁŽ ƐĂŵƉůĞƐ ĨƌŽŵ ƚŚĞ ƐĂŵĞ population are taken. For example we might have a sample of twins, who has almost the same DNA. But in each pair of twins the first one was brought up by the mother only (sample 1), while the second was living with the father (sample 2). In the given example we might be interested in effects of gender of a single parent on psychological characteristics of a child. Say, we may want to compare mean difference in feminity scores shown up by the twins. In this example population is the same (all twins, one of whom was brought up by mother, and the other ʹ by father). Samples are different, but can be easily paired, so that we may compare the scores of twins in each pair. The strategy for constructing the confidence interval is the same. First, we create one sample from the two ʹ we calculate differences in scores for each pair i: di=X1i-­‐X2i. Then, we apply the same strategy as for constructing confidence interval ĨŽƌ ƐŝŶŐůĞ ŵĞĂŶ ђ͘ tĞ ĐĂůĐƵůĂƚĞ ƐĂŵƉůĞ ŵĞĂŶ ݀ҧ and sample standard deviation ݀ݏ and apply t-­‐ĚŝƐƚƌŝďƵƚŝŽŶƚŽĞƐƚŝŵĂƚĞѐ͗ ο= dത േ t Ƚ (n െ 1) ή 2 sd ξn As usually, if sample of pairs is small (n<30), we should check (or at least assume) that d is normally distributed. Which formula to choose? 4. Confidence interval for the difference in populations proportions െ . It was previously shown that difference in population proportions is normally distributed given that both samples are large enough: ߨ 1 ήሺ1െߨ 1 ሻ ߨ ήሺ1െߨ ሻ + 2 ݊ 2 ) ݊1 2 1 െ 2 ~ܰ(ߨ1 െ ߨ2 , ට Thus, we have: 1 െ 2 െ(ߨ 1 െߨ 2 ) ߨ ήሺ1െߨ ሻ ߨ ήሺ1െߨ ሻ ට 1 ݊ 1 + 2 ݊ 2 1 2 = (ܰ~ݖ0,1) Hence, the following confidence interval is constructed: ߨ1 െ ߨ2 = 1 െ 2 േ ߙݖή ඨ 2 ߨ1 ή ሺ1 െ ߨ1 ሻ ߨ2 ή ሺ1 െ ߨ2 ሻ + ݊1 ݊2 ! Note that THIS IS THE ONLY FORMULA used for (2-­‐sided) interval estimation of ߨ1 െ ߨ2 . It is only used when samples are large enough and uses only normal distribution (no t-­‐distribution for proportions!). Appendix The simplest idea for estimating ɐ 2 is to sum the squared deviations of X1 and X 2 and then estimate average deviation. Note that X1 and X 2 have different means, so for each variable deviations are calculated with respect to different centers. sp2 = n2 1 σni=1 (xi1 െ xന1 )2 + σi=1 (xi2 െ ധധധ) x2 2 Ԣ Ŷ͛ ŝƐ ƚŚĞ ŽǀĞƌĂůů ŶƵŵďĞƌ ŽĨ ŽďƐĞƌǀĂƚŝŽŶƐ ŝŶ ƚŚĞ ƚǁŽ samples necessary to calculate the average squared deviation. Since the standard deviation in each sample is calculated by the formula: s 2 = σni=1 (x i െxധ )2 nെ1 we can express the sums of squared deviations in terms of s1 and s2 : sp2 = s12 ή ሺn1 െ 1ሻ + s22 ή ሺn2 െ 1ሻ Ԣ Since s12 and s22 are unbiased estimators for ɐ2 it is quite intuitive that nԢ = n1 + n2 െ 2 (if we divide simply by (n1 + n2 ) sp2 will underestimate ɐ2 . ! Note that squared pooled standard deviation ݏ2 is also approximately equal to the average of ݏ12 and ݏ22 weighted by the corresponding sample sizes. Sample A P practice problems A P 2010 3 A humane society wanted to estimate with 95 percent confidence the proportion of households in its county that own at least one dog. (a) Interpret the 95 percent confidence level in this context. The humane society selected a random sample of households in its county and used the sample to estimate the proportion of all households that own at least one dog. The conditions for calculating a 95 percent confidence interval for the proportion of households in this county that own at least one dog ZHUHFKHFNHGDQGYHULILHGDQGWKHUHVXOWLQJFRQILGHQFHLQWHUYDOZDV (b) A national pet products association claimed that 39 percent of all American households owned at leaVWRQHGRJ'RHVWKHKXPDQHVRFLHW\¶VLQWHUYDOHVWLPDWHSURYLGHHYLGHQFHWKDWWKHSURSRUWLRQRIGRJ owners in its county is different from the claimed national proportion? Explain. F +RZPDQ\KRXVHKROGVZHUHVHOHFWHGLQWKHKXPDQHVRFLHW\¶VVDPSOH"6KRZhow you obtained your answer. Solution: (a) The 95 percent confidence level means that if one were to repeatedly take random samples of the same size from the population and construct a 95 percent confidence interval from each sample, then in the long run 95 percent of those intervals would succeed in capturing the actual value of the population proportion of households in the county that own at least one dog. (b) No. The 95 percent confidence interval 0.417 r 0.119 is the interval (0.298, 0.536). This interval includes the value 0.39 as a plausible value for the population proportion of households in the county that own at least one dog. Therefore, the confidence interval does not provide evidence that the proportion of dog owners in this county is different from the claimed national proportion. (c) The sample proportion is 0.417, and the margin of error is 0.119. Determining the sample size requires solving the equation 0.119 = 1.96ට 0.119 = 1.96 0.119 = 0.417 (1െ 0.417) ݊ 0.493 ξ݊ 0.966 ξ݊ ξ݊ = 0.966/0.119 = 8.12 n = 65.95 So the humane society must have selected 66 households for its sample.