CHAPTER: FIVE ESTIMATION Statistics Descriptive Organising, summarising & describing data Inferential Prediction Generalising Relationships Significance Estimation – describing population parameters in terms of sample statistic. Parameter is a numerical descriptive measure of a population data. Statistic is a numerical descriptive measure of a sample data. The sample from a population is used to provide the estimates of the population parameter. For each sample statistics we use a corresponding population parameter. Sample statistic (Sample mean) S (Sample SD) S2 (Sample Variance) p (Sample Proportion) Corresponding population parameter (Population mean) (Population SD) 2 (Population Variance) P or (Population proportion) Types of Estimation • 1. Point Estimation A single numerical value used to estimate the corresponding population parameter _ x is an estimator of the population mean μ s is an estimator of the population standard deviation σ p is an estimator of the population proportion π r is an estimator of the population correlation coefficient ρ ˆ is an estimator of population OR • OR • RR̂ is an estimator of population RR • • • • Properties of a good estimates a) Unbiasedness: A sample statistic whose mean is equal to the population parameter it estimates is unbiased. • The sample mean and median are unbiased estimators of the population mean μ. 6 b) Minimum variance: An estimate which has a minimum standard error is a good estimator. • For symmetrical distribution the mean has a minimum standard error • If the distribution is skewed the median has a minimum standard error. 7 2. Interval Estimation Interval estimation - is an estimate of parameter in terms of an interval or range of value with which it is likely to lie. • Point estimation does not give any indication on how far away the parameter lies. • A more useful method of estimation is to compute an interval which has a high probability of containing the parameter. Interval estimation is better than point estimation because it considers the variation within the samples. The interval with in which population parameter is assumed to lie is confidence interval. The upper and the lower values of the interval are called confidence limit. The population parameter lies in interval and it is not 100% sure but we have some confidence that the population parameter may lie on this interval. 1. Confidence Interval for a one population Mean Confidence interval – the estimate of a parameter as a whole range of possible values Margin of error – the point to which point estimate is accurate – when we generalize to a whole population based on a single value, some accuracy is lost. • A confidence interval extends either side of the sample mean by a multiple of the standard error. • is most common to calculate a 95% confidence interval, this extends 1.96 standard errors (SE) either side of the mean. Level of Confidence – is a measure of how certain the results are. Example – level of confidence 95% means 95% of the time the sample we pick is sufficiently representative of the whole population to allow us to make generalization. Possibility of Error – is the risk that the sample, on which the estimation ha been based, was misleading and more different from the general population than expected. Example – if the level of confidence is 95%, the risk is 5%. The possibility that the confidence interval will indeed include the parameter is called the degree of confidence or confidence level. • α is to be chosen by the researcher, most common values of α are 0.05, 0.01, 0.001 and 0.1. Case I: if is known and population is normal If we have X-value in N (,), then we can determine Z-value by in N (0,1) x Z So the sampling distribution will also be normally distributed with mean x and standard deviation x (/n) then, in N (0,1) Z / n The, the (1-) 100% confidence interval is: XZ/2/n • 100(1-α)% CI for μ when σ is known (sampling from normal population or large sample) precision of the estimate (margin of error) Estimator XZ/2/n Reliability coefficient Standard Error Example • A physical therapist wished to estimate, with 99% confidence, the mean maximal strength of a particular muscle in a certain group of individuals. He assume that strength scores are approximately normally distributed with a variance of 144. A sample of 15 subjects who participated in the experiment yielded a mean of 84.3. Estimate the maximum strength of a particular muscle for the total population. Solution • α = 0.01 ⇒ Zα/ 2 = 2.575 • x = 84.3, n =15, σ = 12 • 84.3 ± 2.58(12/15) ⇒ 84.3 ± 8.0 ⇒ (76.3, 92.3) • ⇒ We are 99% confident that the population mean is between 76.3 and 92.3. Case II: Unknown variance (small sample size n ≤ 30) We use here Students’ t-distribution for determination of confidence interval The, the (1-) 100% confidence interval is: X t/2(n-1)S/n Example • A study of hypoxemia during the immediate postoperative period reported the fractions of ideal weight for 11 patients who became severely hypoxemic during transfer to the recovery room. The mean is 1.51 and the standard deviation is 0.33. Estimate the 95% C.I. for the population mean fraction of ideal weight, where the population consists of hypoxemic patients similar to those in the study (The data is normally distributed, use α=0.05). Example tα/2 n‐1= t 0.025,10 = 2.2281 1.51 ± 2.2281 0 . 33 11 1.51 ± 0.221 (1.289, 1.731 ) We are 95% sure that the population mean μ lies between 1.289 and 1.731 2. C.I. for the difference between two population means (normally distributed) i) Known variance (2 independent samples) • A 100(1‐α)% C.I. for μ1 ‐ μ2 is 1 2 n1 n2 2 ( X 1 X 2) Z / 2 2 ii) Unknown variances a) Equal variances (2 independent samples) • A 100(1‐α)% C.I. for μ1 ‐ μ2 is 2 ( X 1 X 2) t / 2 , n1 Where S p n2 2 sp s p n1 n2 ( n11) S12 ( n2 1) S22 ) n1n2 2 2 2 1 ≤ 2 then we assume that the 2 2 S * If 0.5 ≤ S population variances are equal. Example A sample of 10 twelve years old boys and a sample of 10 twelve years old girls yielded mean height of 59.8 inches (boys), and 58.5 inches (girls) with S1=3 and S2= 4 inches respectively. Assuming the population is normality distributed, find the 95% CI for the difference in means of height between girls and boys at this age. Solution 2 1 2 2 S S (9) 2 (16) 2 = = 0.56 ⇒ 0.5<0.56<2 • We assume that the population variance are equal. S p 9 ( 9 ) 9 (16) 18 = 3.5 • α = 0.05 ⇒ α/2 = 0.025 ⇒ t0.025,18 = 2.101 (59.8 ‐ 58.5)± 2.101 (3.5) 10 2 2 (3.5) 10 ⇒ (‐2, 4.6) • We are 95% sure that μ1 ‐ μ2 is between ‐2 and 4.6. b) Unequal variances (2 independent samples) A 100(1‐α)% C.I. for μ1 – μ2 is: 2 ( X 1 X 2) t ' / 2 , f 2 s 1 s2 n1 n2 where the degree of freedom f is given by: f 2 S1 n 1 2 S1 n1 n11 2 2 S2 n2 2 S 2 n2 n21 2 2 Example The serum progesterone levels for 29 women with ectopic pregnancies and 20 women with early intrauterine pregnancies are obtained. The data are normally distributed with mean 5.6 ng/ml and standard deviation of 3.6 for the ectopic pregnancies and mean 30.9 ng/ml and standard deviation of 6.9 for the early intrauterine pregnancies. Calculate a 95% confidence interval for μ1- μ2. Solution S12 2 = S2 (3.6) 2 2 (6.9) = 0.3 ⇒0.3<0.5 2 f 3.6 29 2 3.6 2 29 291 2 6.9 20 2 6.9 20 201 2 2 26 • α = 0.05 ⇒ t α/2, 26 = 2.025 • The 95% C.I. for μ1‐μ2 is then (5.6 30.9) 2.056 3.6 29 2 6 . 9 2 20 ⇒−25.3±3.5⇒(−28.8,−21.8) At 95% level of confidence the difference in serum progesterone level for women having ectopic and intrauterine pregnancy lies b/n -28.8 and -21.8 c) Paired sample • Convert the two paired samples into a single sample of differences. dx = X1i – X2i, i = 1, 2, ..., n. A 100(1‐α)% C.I. for μ1 ‐ μ2 is: d t / 2 , n 1 Sd n Example • A study on the effect of low‐calorie intake on abnormal pulmonary physiology in patients with chronic hypercapneic respiratory failure. • Measurement of patients’ arterial oxygen tension before and after the weight loss program Patient 1 2 3 4 5 6 7 8 AOT Before 70 59 53 54 44 58 64 43 mm Hg After 82 66 65 62 74 77 68 59 12 7 12 8 30 19 4 16 Difference mean SD 13.3 8.2 • A 90% C.I. α = 0.1 ⇒ α/2 = 0.05 and • t 0.05,7 = 1.8895 13.5 ± (1.895) 8.2 = 13.5 ± 5.5 ⇒ (8,19) 8 • ⇒ We can be 90% sure that the interval from 8 to 19 mm Hg contains the actual mean increase in arterial oxygen tension for patients after weight reduction program. C.I. for a population proportion (large sample size) A 100(1- α)% C.I. for π is: p± Z 2α p(1 - p) n Example: A study on dental health practice of 300 adults interviewed, 123 said that they regularly had a dental check-up twice a year. What is the 95% C.I. for ? P = 123/300 = 0.41 a point estimator of . = 0.05 Z0.025 = 1.96 0.41±1.96 (0.41)(0.59) 300 (0.36, 0.46) • At 95% level of confidence, the proportion of adult population who regularly had a dental check-up twice a year is (0.36, 0.46). 4. C.I. for the difference between two population proportions (large sample size) A 100(1‐α)% C.I. for π1 ‐ π2 is: ( P1 P 2) Z P1(1 P1) 2 n1 P 2(1 P2 ) n2 Example • Two hundred patients suffering from a certain disease were randomly divided into two equal groups. Of the first group, who received the standard treatment, 78 recovered within three days. Out of the other 100, who were treated by a new method, 90 recovered within three days. The physician wished to estimate the true difference in the proportions who would recovered within three days. Solution • The estimate of the difference in the population proportions is • P1 - P2 = 0 78 – 0.90 = ‐0.12 • The 95% C.I. is: (0.78 0.90) 1.96 0.78( 0.22) 100 0.90( 0.10) 100 -0.12 ± 0.10 ⇒ ( − 0 .22 ,− 0 .02 ) • we are 95% sure difference between that the is –0.22 and –0.02. Note that the negative signs merely reflect the fact that better results were obtained by using the new treatment. Sample Size Estimation What is sample size? • Sample size:- the number of study population required to study an estimate in a population • Researchers always ask themselves “How big sample do I need?” – Too large sample size: too expensive and time consuming – Too small sample size: it has inadequate precision to show a good estimate or show difference 42 Sample size • Sample size is dependent on the type of design, – Descriptive vs analytic – Case control vs Cohort – Analytic cross sectional vs Case control • It is also dependent on the type of major variable used (categorical vs continuous variable) 43 Sample size depends on 1. Estimated variability 2. The precision (margin of error) 3. The sampling method (clustering, design effect) 4. Size of population 5. Feasibility (cost) 6. Confidence level (Z value of certainty) 44 Why is it important to consider sample size? • In studies concerned with estimating some characteristic of a population (e.g. the prevalence of asthmatic children), sample size calculations are important to ensure that estimates are obtained with required precision or confidence. • For example, a prevalence of 10% from a sample of size 20 would have a 95% confidence interval of 1% to 31%, which is not very precise or informative. • On the other hand, a prevalence of 10% from a sample of size 400 would have a 95% confidence interval of 7% to 13%, which may be considered sufficiently accurate. • Sample size calculations help to avoid this situation. Therefore, to obtain the optimum sample size, decision on the following is important. 1. How large error can be tolerated during estimation (d) 2. Confidence limit that the tolerated error will not exceed the determined one (1-) 3. Our advanced guess of population variance or proportion. Where do we get this knowledge? • Previous published studies • Pilot studies • If information is lacking, there is no good way to calculate the sample size! Based on the parameter to be estimated (whether population mean or proportion), we have two sets of formula. 1. Estimation of population mean () • The maximum error made in estimating population parameter is given by: d = Z/2/n this formula for calculating the margin of error • solving for n, we square both sides d2 = (Z/2)2 2 n n = (Z/2)2 2 d2 Example: Populations of cancer patient have a survival standard deviation of 43.3 months. If one wants to conduct a sample survey on these populations, how large sample is needed so that 95% of the means of these samples of size will be with in 6 months of the population mean? The population size is 480 patients. Solution: n = (1.96)2 (43.3)2 36 = 200 2. Estimation of population proportion (P or ) Assuming simple random sample and normality of the distribution of population leads to he following formula: n = (Z/2)2 (1- ) d2 Where = an advanced guess for population proportion of the most important variable q = 1- d = precision required in % (0.01-0.05) • If there is no information on the advanced guess of population proportion, then it is usually taken as 0.5. Example: What sample size do we require to achieve a 95% confidence interval of width ± 5% ( that is to be within 5% of the true value) ? In a study some years ago that found approximately 30% were smokers. Solution: n = (1.96)2 (0.3 x0.7) (0.05)2 = 323 Common sample size calculations for two population Comparing 2 independent groups- means Comparing 2 related groups- means Comparing 2 independent groups- proportions Comparing 2 related groups- proportions 57 Sample size estimation for tests between two independent sample proportions Formula: 58 Sample size estimation for tests between two independent sample proportions cont… Where as N= the sample size estimate Zcv=Z critical value for alpha (.05 alpha has a Zcv of 1.96) Z power=Z value for 1-beta (.80 power has a Z of 0.842) P1=expected proportion for sample 1 P2=expected proportion for sample 2 59 Sample size estimation for tests between two independent sample proportions cont… Proportion Example Alpha=.05 Power=.80 P1=.70 P2=.80 p= .75 60 Sample size estimation for tests between two independent sample means where N= the sample size estimate Zcv=Z critical value for alpha (.05 alpha has a Zcv of 1.96) Zpower=Z value for 1-beta (.80 power has a Z of 0.842) s=standard deviation D=the expected difference between the two means. 61 Sample size estimation for tests between two independent sample means cont… Mean Example Alpha=.05 Power=.80 D=10 S=20 62 Sample Size Adjustments • If sampling is from a finite population of size N, then n’ = n 1+(n/N) • The initial sample size approached in the study may need to be increased in accordance with the expected response rate, loss to follow up, lack of compliance, and any other predicted reasons for loss of subjects • Design effect for complex cluster sampling common values multiply n by 2, 3, …5. Sample Size Adjustments • Separate sample size calculation should be done for each important outcome & then use the maximum estimate • Allowing for response rates & other losses to the sample – The expected response rate – Loss to follow up – Lack of compliance – Other losses Failure to Achieve Required Sample Size • Patient refusal to consent • Bad time of the study (heavy working time for participants) • Adverse media publicity • Weak recruiting staff • Lack of genuine commitment to the project • Lack of staffing in wards or units • Too many projects attempting to recruit the same subjects Possible Solutions • Pilot studies • Have a plan to regularly monitor recruitment or create recruitment targets • Ask for extension in time and/or funding • Review your staffs commitment to other ongoing trials or other distracters • Regular visits to field sites