Sampling Distribution and Confidence Intervals (annotated)

8 - Sampling Distributions and Confidence Intervals for Single Population Parameters     Sampling Distribution of the Sample Mean (X ) Sampling Distribution of the Sample Proportion ( pˆ ) Confidence Interval for the Population Mean (  ) Confidence Interval for the Population Proportion ( p ) Introduction: When take a sample of size n from a population and calculate summary statistics like the sample mean (X ) , the sample median (Med), the sample variance ( s 2 ), the sample standard deviation (s), or the sample proportion ( p̂ ) we must realize that these quantities will _________________________________________________________________ and hence are themselves ________________________________________. Any random variable in statistics has a probability distribution. We have been talking about three common probability distributions in statistics. When X = # of “successes” in n independent trials we used the binomial distribution to talk about X probabilistically, when X = # of occurrences in a fixed time/space unit we use a Poisson distribution, and finally when X was continuous and had an approximate bell-shaped distribution we used the normal distribution to calculate probabilities and quantiles associated with X. Because the summary statistics discussed above are random variables they also have a probability distribution that determines the likelihood of certain values of these statistics being obtained. The distribution of a summary statistic, e.g. the sample mean (X ) is called the ______________________________________. In this handout we explore the sampling distributions of the sample mean ( X ) and the sample proportion ( p̂ ). Sampling Distribution of X The sample mean ( X ) is a random quantity that varies from sample to sample. The probability distribution the sample mean follows is called the sampling distribution of X . The sampling distribution demo I showed in class is found at the following web address: http://www.ruf.rice.edu/~lane/stat_sim/sampling_dist/ 1 The Central Limit Theorem (CLT) ~ tells us about the sampling distributions of the sample mean ( X ). There is also a version (which we will see later) that tells us about the sampling distribution of the sample proportion ( p̂ ) . The CLT for X says the following: 1. 2. 3. The sampling distribution will be ___________ if either of the conditions below are met:  or if  We now consider applications of the central limit theorem (CLT). 2 Applications to Decision Making Example 1: Cholesterol levels of adult males (50-60 yrs. old) The mean blood cholesterol level of adult males (50-60 yrs. old) is 200 mg/dl with a standard deviation of 20 mg/dl. Assume also that blood cholesterol levels are approximately normally distributed in this population. a) What is the probability that when taking a sample of size n = 25 that you would obtain sample mean greater than 225 mg/dl? b) Give a range of values that we would expect the sample mean to fall approximately 95% of the time. c) Suppose we took sample of adult males between the ages of 50 – 60 who are also strict vegetarians and obtained sample mean of X  188 mg/dl. Does this provide evidence that the subpopulation of vegetarians have a lower mean cholesterol level that the greater population of men in this age group? Explain. 3 Example 2: S/R Ratio The objectives of a study by Skjelbo et al. (1996) were to examine (a) the relationship between chloroguanide metabolism and efficacy in malaria prophylaxis and (b) the mephenytoin metabolism and its relationship to chloroguanide metabolism among Tanzanians. From information provided by urine specimens from the n = 216 subjects, the investigators computed the ratio of unchanged S-mephenytoin to R-mephenytoin (S/R ratio). Is there evidence that the S/R ratio of vaccinated Tanzanians is greater than .275? 4 Confidence Intervals for the Population Mean  Example: Suppose we are trying to estimate the birth weight of infants born to women who smoke during pregnancy. A sample of n = 73 women who smoked during pregnancy and the birth weight of their baby was obtained yielding a sample mean of X  6.08 lbs.. This is called a _____________________ for the population mean () because it yields a single value for this unknown quantity. A better estimate might be 6.08 lbs. give or take _____ lbs., i.e. ______ up to _______. This is called an __________________________ as it gives a range or interval of plausible values for the population mean. How do we know this if this a good interval estimate? What properties should a good interval estimate have? 1) 2) The central limit theorem states that if our sample size (n) is sufficiently large, then  X  X ~ N ( , ) which also says by standardizing that Z  ~ N (0,1)  n n This means that when we collect our data the probability our observed sample mean will fall within two standard errors of the mean is approximately .95 or a 95% chance, or more precisely X    P(2  Z  2)  P(2   2)  P(2   X    2 )  n n n P(   2   X 2  )  .9544 n n To make this 95% exactly, we simply use 1.96 in place of 2.00 in the expression above, because P(-1.96 < Z < 1.96) = .9500. For 99% confidence we use ________ and for 90% we use ________ in place of 1.96. Starting with the statement, P(1.96  X    1.96)  .9500 n we can perform similar algebraic manipulations to those above to isolate the population mean in the middle of the inequality instead. By doing this we will obtain an interval that has an approximate 95% chance of covering the true population mean (. 5 This says that the interval from X  1.96   up to X  1.96   has a 95% chance of n n covering the true population mean . This interval is simply the sample mean plus or minus roughly two standard errors. However, this interval cannot be calculated in practice! WHY? A “simple fix” to this would be replace ____ by the estimated standard deviation from our data _____. The problem with our “simple fix” is that the distribution of X  is not a standard s n normal, i.e. N(0,1)!!! FACT: If the population we are sampling from is approximately normal then X  has a t-distribution with degrees of freedom df = n – 1. s n What does a t-distribution look like? Facts about the t-distribution:    Examples: Using the t-table to find confidence intervals a) n = 20 and 95% confidence t = b) n = 20 and 99% confidence t = c) n = 50 and 90% confidence t = d) n = 10 and 95% confidence t = 6 The basic form of most confidence intervals is: (estimate)  (table value)( SE of estimate) MARGIN OF ERROR General Form for a Confidence Interval for the Mean For the population mean we have, X  (t  table value) SE ( X ) or X t s n The appropriate columns in Table A.4 (t-distribution table) for the different confidence intervals are as follows: 90% Confidence look in the .05 column (if n is “large” we can use 1.645) 95% Confidence look in the .025 column (if n is “large” we can use 1.960) 99% Confidence look in the .005 column (if n is “large” we can use 2.576) Example: Suppose we are trying to estimate the birth weight of infants born to women who smoke during pregnancy. A sample of n = 73 women from Baltimore who smoked during pregnancy and the birth weight of their baby was obtained yielding a sample mean of X  6.08 lbs. with a sample standard deviation of s = 1.45 lbs. Use this information to find a 95% CI for the mean birth weight of infants born to mothers who smoked during pregnancy found, assuming that birth weights for this population are normally distributed. 7 Suppose a sample of n = 113 Baltimore mothers who did not smoke during pregnancy was obtained and a sample mean birth weight of X  6.71 lbs. with a standard deviation of s = 1.66 lbs was obtained. a) Find a 95% confidence interval for the mean birth weight of infants born to nonsmoking mothers. b) Does this interval in conjunction with the interval obtained for mothers who smoked during pregnancy provide evidence that infants born to smoking mothers have a lower mean birth weight? 8 Sampling Distribution of the Sample Proportion ( p̂ ) As with the sample mean ( X ) the sample proportion ( p̂ ) is also random, as it too varies from sample to sample. The sampling distribution of p̂ has the following properties: 1. The mean of the sampling distribution is the population proportion (p) 2. The standard deviation of the sampling distribution or the standard error of p̂ and is given by: SE ( pˆ )  p  population proportion (unknown) p(1  p) where n  sample size n 3. The sampling distribution is approx. normal provided n is “sufficiently large”. np  5 n(1  p)  5 * Note : some recommend using 10 in place of 5. Note: When estimating proportions large sample sizes are general ly used (e.g. n > 100) 9 APPLICATIONS TO DECISION MAKING Example: New Method for Treating a Certain Illness/Disease Suppose the current treatment method for certain disease has 70% success rate. A new method has been proposed that will hopefully have a higher success rate. The new method is administered to a sample n = 50 patient and 40 have successful treatment. Can we conclude on the basis of this result that the new method has a higher success rate? 10 CONFIDENCE INTERVALS FOR THE POPULATION PROPORTION (p) Motivating Example: In a study conducted to investigate the non-clinical factors associated with the method of surgical treatment received for early-stage breast cancer, some patients underwent a modified radical mastectomy while others had a partial mastectomy accompanied by radiation therapy. We are interested in determining whether the age of the patient affects the type of treatment she receives. In particular, we want to know whether the proportions of women under 55 are identical in the two treatment groups. A sample of n = 658 women who underwent a partial mastectomy and subsequent radiation therapy contains 292 women under 55, which is a sample percentage of 44%. A better estimate might be 44% give or take 4%, i.e. estimating that the actual percentage of women who receive this form of treatment under the age of 55 is between 39% and 48%. This is called an “interval estimate”, as it gives a range or interval of plausible values for the population proportion/percentage. As with the population mean discussed earlier, we wish this interval to be narrow enough to provide useful information about this unknown percentage, yet have a high probability or chance of covering the actual percentage of women under 55 amongst those opting for this course of treatment for early-stage breast cancer. The central limit theorem for proportions states that if our sample size (n) is sufficiently p(1  p) large, then pˆ ~ N ( p, ) . This means that when we take our sample and find our n sample proportion, p̂ , the probability our observed sample proportion will fall within approximately two standard errors of the population proportion is roughly 95%, or more precisely P( p  1.96  p(1  p)  pˆ  p  1.96  n p(1  p) )  .9500  Recall: P 1.96  Z  1.96  .9500 n Starting with this statement we can perform some algebraic manipulations to isolate the population proportion, p,in the middle of the inequality above. By doing this we will see that the resulting interval will have a 95% chance of covering the true population proportion (p). After a Wonderfully Simple Mathematical Derivation:  p(1  p) p(1  p) up to pˆ  1.96  has a 95% n n chance of covering the true population proportion p. This interval is simply the sample proportion plus or minus roughly two standard errors, i.e. pˆ  1.96  SE ( pˆ ) . However, this interval cannot be calculated in practice! WHY? This says that the interval from pˆ  1.96  11 A simple fix is to replace ______ by our sample based estimate ________. Provided the sample size is sufficient large the resulting interval will still have an approximate 95% chance of covering the true population proportion. This gives what we should technically call the estimated standard error of the proportion, but when we say “standard error of the proportion” it is assumed this estimated version is the one we are talking about because in reality the population proportion p is NOT known. If p were known we would not be conducting a study in first place! General Form for a C for Population Proportion (p) estimate  (table value)  (estimated standard error of estimate) pˆ  (normal table value)  Margin of Error  z pˆ (1  pˆ ) n or pˆ  z pˆ (1  pˆ ) n pˆ (1  pˆ ) n Normal Table Values: 95% Confidence we use z = 1.96 90% Confidence we use z = 1.645 99% Confidence we use z = 2.576 Again we see the confidence interval has the basic form: ESTIMATE  (TABLE VALUE)  (STANDARD ERROR OF THE ESTIMATE) MARGIN OF ERROR In other words we take our estimate plus or minus a certain number of standard errors to obtain the confidence interval, i.e. plus or minus the margin of error. Example: Early-Stage Breast Cancer Treatment Method and Age (cont’d) In a sample of n = 658 women who underwent a partial mastectomy and subsequent radiation therapy contains 292 women under 55, which is a sample percentage of 44.4%. Find a 95% CI for the true proportion of women under 55 in this population. 12 In a sample of n = 1580 women who received a modified radical mastectomy 397 women were under 55, which is a sample percentage of 25.1%. Find a 95% CI for the true proportion of women under 55 in this population. Do these intervals suggest that the proportion of women under the age of 55 differs significantly for these two courses of treatment of early-stage breast cancer? 13 One-Sided Confidence Intervals One-Sided CI’s for the Population Mean ( Lower Bound for  s X t n Upper Bound for  s X t n Where t comes from the t-distribution with df = n – 1. The appropriate columns in Table A.4 for the different confidence intervals are as follows: 90% Confidence look in the .10 column 95% Confidence look in the .05 column 99% Confidence look in the .01 column One-Sided CI’s for the Population Proportion (p) Lower Bound for p pˆ (1  pˆ ) pˆ  z n Upper Bound for p pˆ (1  pˆ ) pˆ  z n Where z comes from the standard normal distribution. The appropriate values for the different confidence intervals are as follows: 90% Confidence use z = 1.280 95% Confidence use z = 1.645 99% Confidence use z = 2.330 14

Sampling Distribution and Confidence Intervals (annotated)

Related documents

Products

Support

Sampling Distribution and Confidence Intervals (annotated)

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib