Biostatistics Unit 6 Confidence Intervals 1 Statistical inference • Statistical inference is the procedure by which we reach a conclusion about a population on the basis of the information contained in a sample drawn from that population. • Estimation involves the use of the data in the sample to calculate the corresponding parameter in the population from which the sample was drawn. 2 Types of estimates • A point estimate is a single numerical value used to estimate the corresponding population parameter. • An interval estimate consists of two numerical values that, with a specified degree of confidence, we feel includes the parameter being estimated. 3 Estimator • An estimator is a rule or formula that tells how to compute the estimate. • Estimators are unbiased if they predict well the value in the population. 4 Table of unbiased estimators 5 Sampled and target populations • The sampled population is the population from which we actually draw the sample. • The target population is the population about which we wish to make an inference. (continued) 6 Sampled and target populations • These two populations may or may not be the same. • When they are the same, it is possible to use statistical inference procedures to make conclusions about the target population. • If the sample and target populations are different, conclusions can be made about the target population only on the basis of nonstatistical considerations. 7 Random and nonrandom samples The strict validity of statistical procedures depends on the assumption of random samples. 8 Confidence intervals to be studied A) Confidence Interval for a Population mean B) Confidence Interval for the Difference of Two Population Means C) Confidence Interval for a Population Proportion D) Confidence Interval for the Difference of Two Population Proportions E) Confidence Interval for the Variance of a Normally Distributed Population F) Confidence Interval for the Ratio of Variances of Two Normally Distributed Populations 9 A) Confidence interval for a population mean Estimating the mean • Estimating the mean of a normally distributed population entails drawing a sample of size n and computing which is used as a point estimate of m. • It is more meaningful to estimate m by an interval that communicates information regarding the probable magnitude of m. 10 Sampling distributions and estimation Interval estimates are based on sampling distributions. When the sample mean is being used as an estimator of a population mean, and the population is normally distributed, the sample mean will be normally distributed with mean, , equal to the population mean, m, and variance of 11 The 95% confidence interval • 95% of the values of making up the distribution will lie within two standard deviations of the mean. • The actual value is 1.96 • The interval is noted by the two points, m – 1.96s and m + 1.96s , so that 95% of the values are in the interval, m ± 1.96s . 12 The 95% confidence interval • Since m and are unknown, the location of the distribution is uncertain. • We can use as a point estimate of m. • In constructing intervals of m ± 1.96s , about 95% of these intervals would contain m. 13 Example Suppose a researcher, interested in obtaining an estimate of the average level of some enzyme in a certain human population, takes a sample of 10 individuals, determines the level of the enzyme in each, and computes a sample mean of x = 22. Suppose further it is known that the variable of interest is approximately normally distributed with a variance of 45. We wish to estimate m. 14 Solution ± 1.96s 15 Components of an interval estimate • The interval estimate of m is centered on the point estimate of m. • 95% of the values of the standard normal curve lie within 1.96 standard deviations of the mean. • The z score of 1.96 used in this case is called the reliability coefficient. 16 General expression for an interval estimate 17 Table of reliability coefficients for confidence intervals 18 Interpretation of confidence intervals The interval estimate for m is expressed as: ± z1-(a/2)s If a = .05, we can say that, in repeated sampling, 95% of the intervals constructed this way will include m. This is based on the probability of occurrence of different values of . (continued) 19 Interpretation of confidence intervals The area of the curve of that is outside the area of the interval is called a. The amount of area inside the interval is called 1-a. 20 Probabilistic interpretation of the interval In repeated sampling from a normally distributed population with a known standard deviation, 100(1- a) percent of all intervals in the form will, in the long run, include the population mean, m. (continued) 21 Probabilistic interpretation of the interval The quantity 1-a is called the confidence coefficient or confidence level and the interval, , is called the confidence interval for m. 22 Practical interpretation of the interval When sampling is from a normally distributed population with known standard deviation, we are 100(1- a) percent confident that the single computed interval, contains the population mean, m. 23 Precision • Precision indicates how much the values deviate from their mean. • Precision is found by multiplying the reliability factor by the standard error of the mean. • This is also called the margin of error. 24 Exercise 6.2.2 We wish to estimate the mean serum indirect bilirubin level of 4-day-old infants. The mean for a sample of 16 infants was found to be 5.98 mg/dl. Assuming bilirubin levels in 4-day-old infants are approximately normally distributed with a standard deviation of 3.5 mg/dl find: A) The 90% confidence interval for m B) The 95% confidence interval for m C) The 99% confidence interval for m 25 Solution (1) Given = 5.98 s = 3.5 n = 16 26 (2) Sketch 27 Solution (3) Calculations A) 90% interval (z = 1.645) 5.98 ± 1.645 (.875) 5.98-1.439375, 5.98+1.439375 (4.5408, 7.4129) 28 Solution B) 95% interval (z = 1.96) 5.98 ± 1.96 (.875) (4.265, 7.695) 29 Solution C) 99% interval (z = 2.575) 5.98 ± 2.575 (.875) (3.7261, 8.2339) 30 Solution (4) Results A higher percent confidence level gives a wider band. There is less chance of making an error but there is more uncertainty. Calculator answers are more accurate because the calculator uses exact values and derives its answers from calculus. 31 The t distribution In most real life situations the variance of the population is unknown. We know that the z score, is normally distributed if the population is normally distributed and is approximately normally distributed when the population is large. But, it cannot be used because s is unknown. 32 Estimation of the standard deviation The sample standard deviation, can be used to replace s. If n 30, then s is a good approximation of s. An alternate procedure is used when the samples are small. It is known as Student's t distribution. 33 Student's t distribution Student's t distribution is used as an alternative for z with small samples. It uses the following formula: 34 Student's t distribution Student's t distribution was developed in 1908 by W. S. Gosset (1876-1937) who worked for the Guinness Brewery. 35 Properties of the t distribution 1. Mean = 0 2. It is symmetrical about the mean. 3. Variance is greater than 1 but approaches 1 as the sample gets large. For df > 2, the variance = df/(df-2) or (continued) 36 Properties of the t distribution 4. The range is to . 5. t is really a family of distributions because the divisors are different. 6. Compared with the normal distribution, t is less peaked and has higher tails. 7. t distribution approaches the normal distribution as n-1 approaches infinity. 37 38 Confidence interval for a mean using t General relationship The reliability coefficient is obtained from the t distribution. 39 Confidence interval When sampling is from a normal distribution whose standard deviation, s, is unknown, the 100(1- a) percent confidence interval for the population mean, m, is given by: 40 Deciding between z and t • When constructing a confidence interval for a population mean, we must decide whether to use z or t. • Which one to use depends on the size of the sample, whether it is normally distributed or not, and whether or not the variance is known. • There are various flowcharts and decision keys that can be used to help decide. Mine appears below. 41 Key for deciding between z and t in confidence interval construction 1. 2. 3. 4. 5. 6. 7. Population normally distributed................2 Not as above—normally distributed.........5 Sample size is large (30 or higher)............3 Sample size is small (less than 30)............4 Population variance is known.............use z Population variance not known.... use t (or z) Population variance is known.............use z Population variance is not known.......use t Sample size is large..................................6 Sample size is small..................................7 Population variance is known.............use z Population variance not known (central limit theorem applies)............use z Must use a non-parametric method 42 Example In a study of preeclampsia, Kaminski and Rechberger found the mean systolic blood pressure of 10 healthy, nonpregnant women to be 119 with a standard deviation of 2.1. (continued) 43 Example (Preeclampsia: Development of hypertension, albuminuria, or edema between the 20th week of pregnancy and the first week postpartum. Eclampsia: Coma and/or convulsive seizures in the same time period, without other etiology.) 44 Example a. What is the estimated standard error of the mean? b. Construct the 99% confidence interval for the mean of the population from which the 10 subjects may be presumed to be a random sample. c. What is the precision of the estimate? d. What assumptions are necessary for the validity of the confidence interval you constructed? 45 Solution (1) Given n = 10 = 119 s = 2.1 46 (2) Sketch of t distribution 47 Reading the t table 48 49 (3) Calculations = .6640783086 119 ± 3.2498 (.66407...) 116.84, 121.16 50 Solution Precision = 3.2498 (.66407...) = 2.158121687 Assumptions The population is normally distributed The 10 subjects represent a random sample from this population. 51 B) Confidence interval for the difference of two population means Introduction From each of two populations an independent random sample is drawn. Sample means, and , are calculated. (continued) 52 B) Confidence interval for the difference of two population means Introduction The difference is which is an unbiased estimator of the difference between the two population means, . The variance of the estimator is 53 Conditions for use Assuming the populations are normally distributed, there are three situations where we would determine the 100(1- a) percent confidence interval for . (continued) 54 Conditions for use a) where the population variances are known (use z) b) where the population variances are unknown but equal (use t) c) where the population variances are unknown but unequal (use t'). 55 Population variances are known When the population variances are known, the 100(1- a) percent confidence interval for is given by 56 Example 6.4.1 A research team is interested in the difference between serum uric acid levels in patients with and without Down's syndrome. In a large hospital for the treatment of the mentally retarded, a sample of 12 individuals with Down's syndrome yielded a mean of = 4.5 mg/100 ml. In a general hospital a sample of 15 normal individuals of the same age and sex were found to have a mean value of = 3.4 mg/100 ml. If it is reasonable to assume that the two populations of values are normally distributed with variances equal to 1 and 1.5, find the 95% confidence interval for . 57 Solution (1) Given n1 = 12, = 4.5, =1 n2 = 15, = 3.4, = 1.5 58 Solution (2) Calculations The point estimate for = 4.5 - 3.4 = 1.1 is 59 Solution The standard error is 60 Solution The 95% confidence interval is 1.1 ± 1.96 (.4282) (.26, 1.94) 61 Population variances unknown but equal If it can be assumed that the population variances are equal then each sample variance is actually a point estimate of the same quantity. Therefore, we can combine the sample variances to form a pooled estimate. 62 Weighted averages The pooled estimate of the common variance is made using weighted averages. This means that each sample variance is weighted by its degrees of freedom. 63 Pooled estimate of the variance The pooled estimate of the variance comes from the formula: 64 Standard error of the estimate The standard error of the estimate is 65 Confidence interval The 100(1-a) confidence interval for is: 66 Example (1) Given n1 = 13, = 21.0, s1 = 4.9 n2 = 17, = 12.1, s2 = 5.6 a = .05 67 Example (2) Calculations The point estimate for - is = 21.0 - 12.1 = 8.9 68 Example The pooled estimate of the variance is 69 Example The standard error is 70 Example The 95% confidence interval is 8.9 ± 2.0484 (1.9569) 8.9 ± 4.0085 (4.9, 12.9) 71 Population variances unknown and not equal With unequal variances, the quantity used to calculate the test statistic does not follow the t distribution. A substitute reliability factor called t' has been proposed. 72 C) Confidence interval for a population proportion To begin, a sample is drawn from the population of interest and the sample proportion, , is calculated. This sample proportion is used as the point estimator of the population proportion, p. The confidence interval is defined by the general formula: 73 Distribution When n is large, the reliability coefficient will be z from the standard normal distribution. Since p, the population proportion, is unknown, we use as an estimate. The estimate of , the standard error, is given by: 74 Confidence interval The 100(1- a) confidence interval for p is given by: 75 Probabilistic interpretation. We say that we are 95% confident that the population proportion, p, lies between the calculated limits since, in repeated sampling, about 95% of the intervals constructed this way would contain p. 76 Practical interpretation. In a specific example, we would expect, with 95% confidence, to find the population proportion between the two boundaries. 77 Example 6.5.2 A research study obtained data regarding sexual behavior from a sample of unmarried men and women between the ages of 20 and 44 residing in geographic areas characterized by high rates of sexually transmitted diseases and admission to drug programs. Fifty percent of 1229 respondents reported that they never used a condom. Construct a 95 percent confidence interval for the population proportion never using a condom. 78 Solution (1) Given n = 1229 = .50 (for the TI-83, x = 615) 79 Solution (2) Calculation 80 D) Confidence interval for the difference of two population proportions When studying the difference between two population proportions, the difference between the two sample proportions, , can be used as an unbiased point estimator for the difference between the two population proportions, p1 – p2. This is used with the general formula: 81 Distribution When the central limit theorem applies, the normal distribution is used to obtain confidence intervals. The standard error is estimated by the formula: 82 Confidence interval The 100(1- a) percent confidence interval for p1 – p2 is given by: 83 Probabilistic interpretation. We say that we are 95% confident that the difference between the two population proportions, p1 – p2, lies between the calculated limits since, in repeated sampling, about 95% of the intervals constructed this way would contain p1 – p2. 84 Practical interpretation. In a specific example, we would expect, with 95% confidence, to find the difference between the two population proportions between the two limits. 85 Example 6.6.1 A study of teenage suicide included a sample of 96 boys and 123 girls between ages of 12 and 16 years selected scientifically from admissions records to a private psychiatric hospital. Suicide attempts were reported by 18 of the boys and 60 of the girls. We assume that the girls constitute a simple random sample from a population of similar girls and likewise for the boys. Construct a 99 percent confidence interval for the difference between the two proportions. 86 Solution (1) Given n1 = 123 = .4878 n2 = 96 = .1875 87 Solution (2) Calculation 88 Determining the sample size for estimating means It is important to have a sample that is the correct size. It is also important to have a method that will allow prediction of the correct sample size for estimating a population mean or a population proportion. This is important especially in business or commercial situations where money is involved. Selecting a sample size that is too big wastes money. One that is too small may give inaccurate results. 89 Objectives The width of the confidence interval is determined by the magnitude of the margin of error which is given by: d = (reliability coefficient) (standard error) The total width of the interval is twice this amount. 90 Reducing the margin of error In the standard error, , the value of s is a constant. If the reliability coefficient is fixed, the only way to reduce the margin of error is to have a large sample. The size of the sample depends on the size of s, the degree of reliability and the desired interval width. 91 Margin of error 92 Sample size for a large population d = (reliability coefficient) X (standard error) Solving for n gives 93 Estimating s2 Generally the variance of the population under study is unknown. As a result s has to be estimated. The most common sources of estimates for s are: 1. A pilot sample which is drawn from the population and used as an estimate of s. 2. Estimates of s from previous or similar studies. 3. In a normally distributed population, the range is usually about 6 standard deviations so is estimated by R/6. 94 Determination of the sample size for estimating proportions The manner of finding sample sizes for estimating a population proportion is basically the same as for estimating a mean. The general formula is: 95 Sample size Assuming proper random sampling and an approximately normal distribution, the sample size is 96 Estimating the population proportion It is necessary to estimate the population proportion, p, to use in the determination of the sample size. 1. If an upper limit is suspected or presumed, it could be used to represent p. 2. A pilot sample could be drawn and used to obtain an estimate for p. 3. With no better estimate, one may use p = .5 which gives the maximum value of n. 97 E) Confidence interval for the variance of a normally distributed population Measures of dispersion s S (continued) 98 E) Confidence interval for the variance of a normally distributed population Measures of dispersion s E( s2 ) = when sampling is with replacement S E( s2 ) = when sampling is without replacement. 99 Large population size When N is large, N and N-1 are approximately equal so s2 and s2 will be approximately equal. These results justify why s2 can be used to compute the population variance. 100 Interval estimate of a population variance • The value of s2 is used as a point estimator of the population variance, s2. • Confidence intervals of s2 are based on the sampling distribution of (n-1) s2/ s2. • If samples of size n are drawn from a normally distributed population, this quantity has a distribution known as the chi-square distribution with n-1 degrees of freedom. • The assumption that the sample is drawn from a normally distributed population is crucial. 101 The chi-square distribution The chi-square distribution is not symmetrical. For low values of n, its shape is variable. The distribution does not have negative values. 102 Microsoft Excel Demonstration Note how the shape of the curve changes depending on the degrees of freedom. With 1 degree of freedom, the curve is hyperbolic. [Here follows the Excel Worksheet.] 103 Microsoft Excel Demonstration 104 Reading the c2 table 105 Finding c2 values 106 Finding c2 values 107 Finding c2 values 108 109 Confidence interval on the c2 distribution The 100(1-a) confidence interval for the distribution of (n-1) s2/s2 is a two-tailed c2 distribution between and . This interval is given by 110 Confidence interval for s2 From the sampling distribution of (n-1) s2/s2 the sampling distribution of s2 is derived. The formula is: 111 Confidence interval for s To get the 100(1-a) confidence interval for s, the population standard deviation, the square root of each term is taken. The result is the formula below. 112 Example 6.9.1 In a study on cholesterol levels a sample of 12 men and women was chosen. The plasma cholesterol levels (mmol/L) of the subjects were as follows: 6.0, 6.4, 7.0, 5.8, 6.0, 5.8, 5.9, 6.7, 6.1, 6.5, 6.3, and 5.8. We assume that these 12 subjects constitute a simple random sample of a population of similar subjects. We wish to estimate the variance of the plasma cholesterol levels with a 95 percent confidence interval. 113 Solution (1) Given 6.0 6.4 7.0 5.8 6.0 5.8 5.9 6.7 6.1 6.5 6.3 5.8 Estimate the variance with a 95% confidence interval. 114 Solution (2) Calculations Value of s = .3918680978 Values of c2 from table = 21.920 = 3.816 115 Calculations 116 F) Confidence interval for the ratio of variances of two normally distributed populations A way to compare the variances of two normally distributed populations is to use the variance ratio, / . The variance ratio is used, among other things, as the test statistic for analysis of variance (ANOVA). If the two variances are equal, then V. R. = 1. 117 Sampling distribution The sampling distribution of ( / )/( / ) is used. Since the population variances are usually not known, the sample variances are used. The assumptions are that and are computed from independent samples of size n1 and n2, respectively, drawn from two normally distributed populations. (continued) 118 Sampling distribution If the assumptions are met, ( / )/( / ) follows a distribution known as the F distribution with two values used for degrees of freedom. 119 Degrees of freedom • The F distribution uses two values for degrees of freedom. • The numerator degrees of freedom is the value of n1 -1 which is used in calculating . • The denominator degrees of freedom is the value of n2 -1which is used in calculating . 120 The F distribution • The F distribution is not symmetrical. • The distribution does not have negative values. • Because it uses two values of degrees of freedom, there are separate charts for different confidence intervals. 121 F distribution tables 122 Reading F tables F tables come in denominations based on which are , , , and with one tail. For two-tail intervals, the lower boundary, , must be calculated to give values of , and . 123 Reading F tables 124 Two-tail F distribution boundaries 125 The F.95 table 126 The F.975 table 127 The F.995 table 128 Confidence interval for / The distribution ( / )/( / ) is used to establish the 100(1- a) percent confidence interval for / . The starting point is (continued) 129 Confidence interval for / From this relation, it can be shown that the 100(1- a) percent confidence interval for / is 130 Example 6.10.1 Among 11 patients in a certain study, the standard deviation of the property of interest was 5.8. In another group of 4 patients, the standard deviation was 3.4. We wish to construct a 95 percent confidence interval for the ratio of the variances of these two populations. 131 Solution (1) Given n1 = 11 n2 = 4 = (5.8)2 = 33.64 a = .05 = (3.4)2 = 11.56 10, 3 = 14.42 = 1/ 3, 10 = 1/4.83 = .20704 132 133 Solution (2) Calculations Calculation of the 95% confidence interval for / 134 fin 135