This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this site. Copyright 2009, The Johns Hopkins University and John McGready. All rights reserved. Use of these materials permitted only in accordance with license rights granted. Materials provided “AS IS”; no representations or warranties provided. User assumes all responsibility for use, and all liability related thereto, and must independently review all materials for accuracy and efficacy. May contain materials owned by others. User is responsible for obtaining permissions for use from third parties as needed. Comparing Means among Two (or More) Independent Populations John McGready Johns Hopkins University Lecture Topics CIs for mean difference between two independent populations Two sample t-test Non-parametric alternative, Mann Whitney (FYI, optional) Comparing means amongst more than two independent populations: ANOVA 3 Section A Two Sample t-test: The Resulting Confidence Interval Comparing Two Independent Groups “A Low Carbohydrate as Compared with a Low Fat Diet in Severe Obesity”* - 132 severely obese subjects randomized to one of two diet groups - Subjects followed for a six month period At the end of study period - “Subjects on the low-carbohydrate diet lost more weight than those on a low fat diet (95% confidence interval for the difference in weight loss between groups, -1.6 to -6.2 kg; p < .01)” Source: * Samaha, F., et al. A low-carbohydrate as compared with a low-fat diet in severe obesity, New England Journal of Medicine, 348: 21. 5 Comparing Two Independent Groups: Diet Types Study Scientific question - Is weight change associated with diet type? Diet Group Low-Carb Low-Fat 64 68 Mean weight change (kg) Post-diet less pre-diet -5.7 -1.8 Standard deviation of weight changes (kg) 8.6 3.9 Number of subjects (n) 6 Diet Type and Weight Change 95% CIs for weight change by diet group Low Carb: Low Fat: 7 Comparing Two Independent Groups: Diet Types Study In statistical terms, is there a non-zero difference in the average weight change for the subjects on the low-fat diet as compared to subjects on the low-carbohydrate diet? - 95% CIs for each diet group mean weight change do not overlap, but how do you quantify for the difference? The comparison of interest is not “paired” - There are different subjects in each diet group For each subject a change in weight (after diet—before weight) was computed - However, the authors compared the changes in weight between two independent groups! 8 Comparing Two Independent Groups How do we calculate - Confidence interval for difference? - p-value to determine if the difference in two groups is “significant?” Since we have large samples (both greater than 60) we know the sampling distributions of the sample means in both groups are approximately normal It turns out the difference of quantities, which are (approximately) normally distributed, are also normally distributed 9 Sampling Distribution: Difference in Sample Means So, the big news is . . . - The sampling distribution of the difference of two sample means, each based on large samples, approximates a normal distribution - This sampling distribution is centered at the true mean difference, µ1 - µ2 10 Simulated Sampling Dist’n of Sample Mean Weight Loss Simulated sampling distribution of sample mean weight change: low carbohydrate diet group 11 Simulated Sampling Dist’n of Sample Mean Weight Loss Simulated sampling distribution of sample mean weight change: low fat diet group 12 Simulated Sampling Dist’n of Sample Mean Weight Loss Simulated sampling distribution of sample mean weight change: low fat diet group 13 Simulated Sampling Dist’n of Sample Mean Weight Loss Side by side boxplots 14 95% Confidence Interval for Difference in Means Our most general formula The best estimate of a population mean difference based on sample means: Here, may represent the sample mean weight loss for the 64 subjects on the low carbohydrate diet, and the mean weight less for the 68 subjects on the low fat diet 15 95% CI for Difference in Means: Diet Types Study So, formula for the 95% CI for µ1 - µ2 is: Where means : hence the = standard error of the difference of two sample 16 Two Independent (Unpaired) Groups The standard error of the difference for two independent samples is calculated differently than we did for paired designs - With paired design we reduced data on two samples to one set differences between two groups Statisticians have developed formulas for the standard error of the difference These formulas depend on sample sizes in both groups and standard deviations in both groups The - is greater than either Why do you think this is? or 17 Principle Variation from independent sources can be added - Why do you think this is additive Of course, we don’t know σ1 and σ2: so we estimate with s1 and s2 to get an estimated standard error: 18 Comparing Two Independent Groups: Diet Types Study Recall the data from the weight change/diet type study Diet Group Low-Carb Low-Fat 64 68 Mean weight change (kg) Post-diet less pre-diet -5.7 -1.8 Standard deviation of weight changes (kg) 8.6 3.9 Number of subjects (n) 19 95% CI for Difference in Means: Diet Types Study So in this example, the estimated 95% for the true mean difference in weight between the low-carbohydrate and low-fat diet groups is: 20 From Article “Subjects on the low-carbohydrate diet lost more weight than those on a low fat diet (95% confidence interval for the difference in weight loss between groups, -1.6 to -6.2 kg; p< .01)” So those on the low carb diet lost more on average by 3.9 kg: after accounting for sampling variability this excess average loss over the low-fat diet group could be as small as 1.6 kg or as large as 6.2 kg - This confidence interval does not include 0, suggesting a real population level association between type of diet (low-carb or low-fat) and weight loss 21 Section B Two Sample t-test: Getting a p-value Hypothesis Test to Compare Two Independent Groups Two sample (unpaired) t-test Is the (mean) weight change equal in the two diet groups? - Ho: µ1 = µ2 - HA: µ1 ≠ µ2 In other words, is the expected difference in weight change zero? - Ho: µ1 - µ2 = 0 - HA: µ1 - µ2 ≠ 0 3 Hypothesis Test to Compare Two Independent Groups Recall, general “recipe” for hypothesis testing . . . 1. Start by assuming Ho true 2. Measure distance of sample result from µo (here again its 0) 3. Compare test statistic (distance) to appropriate distribution to get p-value 4 Diet Type and Weight Loss Study In the diet types and weight loss study, recall: So in this study: - So this study result was 3.3 standard errors below the null mean of 0 (i.e., 3.3 standard errors from the mean weight less expected if null was true) 5 How Are p-values Calculated? Is a result 3.3 standard errors below 0 unusual? - It depends on what kind of distribution we are dealing with The p-value is the probability of getting a test statistic as extreme as (or more extreme than) what you observed (-3.3) by chance if was true The p-value comes from the sampling distribution of the difference in two sample means What is the sampling distribution of the difference in sample means? - If both groups are large (more than 60 subjects) then this distribution is approximately normal - This sampling distribution will be centered at true difference - Under null hypothesis, this true difference is 0 6 Diet/Weight Loss Sample To compute a p-value, we would need to compute the probability of being 3.3 or more standard errors away from 0 on a standard normal curve 7 How to Use Stata to Perform a 2-Sample T-Test Command syntax: - ttesti , unequal 8 How to Use Stata to Perform a 2-Sample T-Test Command syntax: - ttesti , unequal 9 How to Use Stata to Perform a 2-Sample T-Test Command syntax: - ttesti , unequal 10 Summary: Weight Loss Example Statistical method - “We randomly assigned 132 severely obese patients . . . to a carbohydrate restricted (low-carbohydrate) diet or a calorieand fat-restricted diet” - “For comparison of continuous variables between the two groups, we calculated the change from baseline to six months in each subject, and compared the mean changes in the two diet groups using an unpaired t-test” Result - “Subjects on the low-carbohydrate diet lost more weight than those on a low fat diet (95% confidence interval for the difference in weight loss between groups, -1.6 to -6.2 kg; p < .01)” 11 Section C Two Sample t-test, Approach with Smaller Samples Sampling Distribution What is sampling distribution of the difference in sample means? - If either (or both) sample sizes are less than 60, a t-distribution is used with n1 + n2 -2 degrees of freedom: this is the degrees of freedom for the total sample size from both groups minus two 3 Two Sample t-test Example - In a randomized design, 23 patients with hyperlipidemia were randomized to either take Treatment A or Treatment B for 12 weeks - 12 patients assigned to Treatment A - 11 patients assigned to Treatment B 4 Two Sample t-test Example - LDL cholesterol levels (mmol/L) measured on each subject at baseline, and 12 weeks after start of study - The 12-week change in LDL cholesterol was computed for each subject 5 Two Sample t-test Summary of results: Treatment Group A B 12 11 Mean LDL change (mmol/L) Post-trt less pre-trt -1.41 -0.32 Standard deviation of LDL changes (mmol/L) 0.55 0.65 Number of subjects (n) 6 Two Sample t-test Scientific question - Is there a difference in LDL change between the two treatment groups? Methods of inference - Confidence interval for the difference in mean LDL cholesterol will change between the two groups - Statistical hypothesis test 7 95% Confidence Interval for Difference in Means The general formula (large samples): The general formula (“smaller” samples): 8 Two Sample t-test Sample mean difference and estimated standard error: Treatment Group A B 12 11 Mean LDL change (mmol/L) Post-trt less pre-trt -1.41 -0.32 Standard deviation of LDL changes (mmol/L) 0.55 0.65 Number of subjects (n) 9 95% CI for Difference in Means: Hyperlipidemia Ex How many standard errors to add and subtract? - Since sample sizes are small we will have to add slightly more than two standard errors Number we need add and subtract for 95% confidence comes from a t-distribution with (12 + 11 - 2 = 21 ) degrees of freedom - From t-table this value is 2.08 So, 95% CI for true mean difference in change in LDL cholesterol, drug A to drug B 10 Hypothesis Test to Compare Two Independent Groups Two-sample (unpaired) t-test: getting a p-value Is the change in LDL cholesterol the same in the two treatment groups? - Ho: µ1 = µ2 → Ho: µ1-µ2 = 0 - HA: µ1 ≠ µ2 → HA: µ1-µ2 ≠ 0 11 Hypothesis Test to Compare Two Independent Groups Recall, general “recipe” for hypothesis testing . . . 1. Start by assuming Ho true 2. Measure distance of sample result from µo (here again its 0) 3. Compare test statistic (distance) to appropriate distribution to get p-value 12 Diet Type and Weight Loss Study In the diet types and weight loss study, recall: So in this study: - So this study result was 4.4 standard errors below the null mean of 0 (i.e., 4.4 standard errors from the less expected mean difference in cholesterol change between the two treatments if null was true) 13 How Are p-values Calculated? Is a result 4.4 standard errors below 0 unusual? - It depends on what kind of distribution we are dealing with The p-value is the probability of getting a test statistic (distance) as or more extreme than what you observed (-4.4) by chance if it was true The p-value comes from the sampling distribution of the difference in two sample means What is the sampling distribution of the difference in sample means? - t-distribution with 12 + 1 – 2 = 21 degrees of freedom 14 Hyperlipidemia Example To compute a p-value, we would need to compute the probability of being 4.4 or more standard errors away from 0 on a t-distribution with 21 degrees of freedom 15 Using Stata Command syntax: - ttesti , unequal 16 Using Stata Command syntax: - ttesti , unequal 17 Using Stata Command syntax: - ttesti , unequal 18 Summary: Weight Loss Example Statistical method - Twenty-three patients with hyperlipidemia were randomly assigned to one of two treatment groups: Treatment A or Treatment B - 12 patients were assigned to receive Treatment A - 11 patients were assigned to receive Treatment B 19 Summary: Weight Loss Example Statistical method - Baseline LDL cholesterol measurements were taken on each subject, and LDL was again measured after 12 weeks of treatment - The change in LDL cholesterol was computed for each subject - The mean LDL changes in the two treatment groups were compared using an unpaired t-test and a 95% confidence interval was constructed for the difference in mean LDL changes 20 Summary: Weight Loss Example Result - Patients on treatment A showed a decrease in LDL cholesterol of 1.41 mmol/L and subjects on treatment B showed a decrease of .32 mmol/L (a difference of 1.09 mmol/L, 95% CI .57 to 1.61 mmol/L) - The difference in LDL changes was statistically significant (p < .001) 21 Section D Two Sample t-test, Two Choices FYI: Equal Variances Assumption The “traditional” t-test assumes equal variances in the two groups - This can be formally tested with another hypothesis test! - But why not just compare observed values of s1 to s2? There is a slight modification to allow for unequal variances—this modification adjusts the degrees of freedom for the test, using slightly different SE computation (the formula I give you) If you want to be truly “safe” (desert island choice of t-test) - More conservative to use test that allows for unequal variance Makes little to no difference in large sample 3 FYI: Equal Variances Assumption Actually, the following occurs: - If underlying population level standard deviations are equal: Both approaches give valid confidence intervals but intervals by approach assuming unequal standard deviations slightly wider (and p-values slightly larger) - If underlying population level standard deviations are not equal: The approach assuming equal variances does not give valid confidence intervals and can severely under-cover the goal of 95% 4 Unequal SD Approach: Diet Type/ Weight Loss Example Command syntax: - ttesti , unequal 5 Equal SD Approach: Diet Type/ Weight Loss Example Command syntax: - ttesti 6 Unequal SD Approach: LDL/ Treatment Example Command syntax: - ttesti , unequal 7 Equal SD Approach: LDL/Treatment Example Command syntax: - ttesti , unequal 8 Section E The Unpaired t-test: More Examples Example 1: CE Costs in Maryland Random sample of 500 Carotid Endarterectomy (CE) procedures performed in State of Maryland, 1995 Some results: Males Females Mean Charges (U.S. $) 6,615 7,088 SD (U.S. $) 4,220 4908 271 229 N 3 Example 1 :Boxplots! We actually have luxury of individual level data here 4 Example 1 95% CIs for 1995 CE costs by patient sex - Females: - Males: 5 Example 1 Two sample t-test, unequal standard deviations assumption 6 Example 1: Summary In a study conducted to assess determinants of CE procedure costs in Maryland, a random sample of 500 CE patients from 1995 was analyzed This consisted of 229 females with average costs of $7,088 (95% CI: 6,440 to 7,736), and 271 males with average costs $6,625 (95% CI: 6,103 to 7,127) While the females in the sample had average costs of $473 greater than males in the samples, this difference in average costs is not statistically significant (p = .25) - The 95% CI for the female to male average cost differential is $-339 to $1,285 7 Example 2 The following data is taken from a 1990 study comparing (random samples of) adolescents with bulimia to adolescents without bulimia; both groups had similar body composition and levels of physical activity* The following table shows summary data on daily calorie intake by bulimia status Bulimia No Bulimia Mean Daily Caloric Intake (kcal/kg) 22.1 29.7 SD (kcal/kg)) 4.6 6.5 N 23 15 Source: *Example based on data taken from Pagano, M., Gauvreau, K. (2000). Principles of biostatistics, 2nd ed. Duxbury Press (based on research by Gwirtsman, et al. (1989) Decreased calorie intake. American Journal of Clinical Nutrition, 49. 8 Example 2 Abstract from article: 9 Example 2 Abstract from article: 10 Example 2: Boxplots Again, luxury of individual level data: 11 Example 2 95% CIs for average daily calorie intake by bulimia status - Bulimia: - No bulimia: 12 Example 2 in Stata Two sample t-test, unequal standard deviations assumption: 13 Summary From the article: 14 Section F (Optional) Non-Parametric Analogue to the Two Sample t-test Alternative to the Two Sample T-Test Nonparametric test for comparing two groups “Non-parametric” refers to a class of tests that do not assume anything about distribution of the data Nonparametric test for comparing two groups - Mann-Whitney Rank Sum Test (Wilcoxon Rank Sum Test) - Also called Mann-Whitney-Wilcoxon (a mouthful) Tries to answer the following question: - Are the two population distributions different? 3 Advantages Does not assume populations being compared are normally distributed - The two-sample t-test requires that assumption with very small samples sizes Uses only ranks Not sensitive to outliers 4 Disadvantage of the Nonparametric Test Nonparametric methods are often less sensitive (powerful) for finding true differences because they throw away information (they use only ranks) Need full data set, not just summary statistics Results do not include any confidence intervals quantifying range of possibility for true difference between populations 5 Example: Health Education Study Evaluate an intervention to educate high school students about health and lifestyle over a two-month period 10 students randomized to “intervention” or “control” group x = post test score – pre-test score is outcome to compare between the intervention and control groups 6 Example: Health Education Study x = post- pretest score for both groups Intervention (I) Control (C) 6 -5 -6 1 4 - Only five individuals in each sample!!! - We want to compare the control and intervention groups to assess whether the “improvement” (post–pre) in scores are different, taking random sampling error into account 5 0 7 2 19 7 Example: Health Education Study With such a small sample size, we need to be sure score improvements are normally distributed if we want to use t-test (BIG assumption) Possible approach: - Mann-Whitney-Wilcoxon non-parametric test! 8 Example: Health Education Study First step—rank the pooled data (ignore groupings) - Rank -6 -5 0 1 2 4 5 5 7 19 1 2 3 4 5 6 7 8 9 10 9 Example: Health Education Study Second step—“reattach” group status - - Rank Group -6 -5 0 1 2 4 5 5 7 19 1 2 3 4 5 6 7 8 9 10 C C I C I C I C I I 10 Example: Health Education Study Find the average rank in each of the two groups Intervention group average rank Control group average rank 11 Example: Health Education Study Statisticians have developed formulas and tables to determine the probability of observing such an extreme discrepancy in ranks (6.8 vs. 4.2) by chance alone - This is the p-value In the health education study, the p-value was .17 - The interpretation is that the Mann-Whitney test did not show any significant difference in test score “improvement” between the intervention and control group (p = .17) 12 Notes The two-sample t-test would give a different answer (p = .14) Different statistical procedures can give different p-values If the largest observation, 19, was changed, the p-value based on the Mann-Whitney test would not change but the two-sample t-test would change 13 Notes The t-test or the nonparametric test? - Statisticians will not always agree, but there are some guidelines - Use non-parametric test if sample size is small and you have no reason to believe data is “well behaved” (normally distributed) - Only “ranks” available 14 Using Stata to Perform Mann-Whitney-Wilcoxon Data, as entered 15 Using Stata to Perform Mann-Whitney-Wilcoxon “ranksum” command - Syntax: ranksum varname, by(group_var) 16 Using Stata to Perform Mann-Whitney-Wilcoxon “ranksum” command - Syntax: ranksum varname, by(group_var) 17 Using Stata to Perform t-test “ttest” command without “i” on end when data already in Stata - Syntax: ttest varname, by(group_var) 18 Summary: Educational Intervention Example Statistical methods - 10 high school students were randomized to either receive a two-month health and lifestyle education program (or no program) - Each student was administered a test regarding health and lifestyle issues prior to randomization (and after the two-month period) 19 Summary: Educational Intervention Example Statistical methods - Differences in the two test scores (after-before) were computed for each student - Mean and median test score changes were computed for each of the two study groups - A Mann-Whitney rank sum test was used to determine if there was a statistically significant difference in test score change between the intervention and control groups at the end of the two-month study period 20 Summary: Educational Intervention Example Result - Participants randomized to the educational intervention scored a median five points higher on the test given at the end of the two-month study period, as compared to the test administered prior to the intervention - Participants randomized to receive no educational intervention scored a median one point higher on the test given at the end of the two-month study period - The difference in test score improvements between the intervention and control groups was not statistically significant (p = .17) 21 Section G Comparing Means between More than Two Independent Populations Motivating Example Suppose you are interested in the relationship between smoking and mid-expiratory flow (FEF), a measure of pulmonary health Suppose you recruit study subjects and classify them into one of six smoking categories - Nonsmokers (NS) - Passive smokers (PS) - Non-inhaling smokers (NI) - Light smokers (LS) - Moderate smokers (MS) - Heavy smokers (HS) 3 Motivating Example You are interested in whether differences exist in mean FEF amongst the six groups Main outcome variable is mid-expiratory flow (FEF) in liters per second 4 Motivating Example One strategy is to perform lots of two-sample t-tests (for each possible two-group comparison) In this example, there would be 15 comparisons you would need to do! - NS to PS, NS to NI, and so on . . . 5 Motivating Example It would be nice to have one “catch-all” test - Something which would tell you whether there were any differences amongst the six groups - If so, you could then do group to group comparisons to look for specific group differences 6 Extension of the Two-Sample t-Test Analysis of variance (One-Way ANOVA) - The t-test compares means in two populations - ANOVA compares means amongst more than two populations with one test The p-value from ANOVA helps answer the question - “Are there any differences in the means among the populations?” 7 Extension of the Two-Sample t-Test General idea behind ANOVA, comparing means for k-groups (k > 2): - - Ho : µ1 = µ2 = . . . µk HA : At least one mean different 8 Example Smoking and FEF (Forced Mid-Expiratory Flow Rate)* - A sample of over 3,000 persons was classified into one of six smoking categorizations based on responses to smoking related questions Source: * White, J.R., Froeb, H.F. (1980). Small-airways dysfunction in non-smokers chronically exposed to tobacco smoke, New England Journal of Medicine 302: 13. 9 Example 1 Nonsmokers (NS) Passive smokers (PS) Non-inhaling smokers (NI) Light smokers (LS) Moderate smokers (MS) Heavy smokers (HS) 10 Example 1 Smoking and FEF - From each smoking group, a random sample of 200 men was drawn (except for the non-inhalers, as there were only 50 male non-inhalers in the entire sample of 3,000) - FEF measurements were taken on each of the subjects 11 Example 1—Table Data summary Group Mean FEF SD FEF (L/s) (L/s) n NS 3.78 0.79 200 PS 3.30 0.77 200 NI 3.32 0.86 50 LS 3.23 0.78 200 MS 2.73 0.81 200 HS 2.59 0.82 200 Based on a one-way analysis of variance, there are statistically significant differences in FEF levels among the six smoking groups (p < .001) 12 What’s the Rationale behind Analysis of Variance? The variation in the sample means between groups is compared to the variation within a group If the between group variation is a lot bigger than the within group variation, that suggests there are some differences among the populations 13 Analysis of Variance 14 Summary: Smoking and FEF Statistical methods - 200 men were randomly selected from each of five smoking classification groups (non-smoker, passive smokers, light smokers, moderate smokers, and heavy smokers), as well as 50 men classified as non-inhaling smokers for a study designed to analyze the relationship between smoking and respiratory function 15 Summary: Smoking and FEF Statistical Methods - Analysis of variance was used to test for any differences in FEF levels amongst the six groups of men - Individual group comparisons were performed with a series of two sample t-tests, and 95% confidence intervals were constructed for the mean difference in FEF between each combination of groups - Analysis of variance showed statistically significant (p < .001) differences in FEF between the six groups of smokers - Non-smokers had the highest mean FEF value, 3.78 L/s, and this was statistically significantly larger than the five other smokingclassification groups 16 Summary: Smoking and FEF Results - Analysis of variance showed statistically significant (p < .001) differences in FEF between the six groups of smokers - Non-smokers had the highest mean FEF value, 3.78 L/s, and this was statistically significantly larger than the five other smokingclassification groups - The mean FEF value for non-smokers was 1.19 L/s higher than the mean FEF for heavy smokers (95% CI 1.03–1.35 L/s), the largest mean difference between any two smoking groups - Confidence intervals for all smoking group FEF comparisons are in Table 1 17 Example 2 FEV1 and three medical centers* - Data was collected on 63 patients with coronary artery disease at 3 difference medical centers (Johns Hopkins, Ranchos Los Amigos Medical Center, St. Louis University School of Medicine) - Purpose of study to investigate effects of carbon monoxide exposure on these patients - Prior to analyzing CO effects data, researchers wished to compare the respiratory health of these patients across the three medical centers Source: * Pagano, M., Gauvreau, K. (2000). Principles of biostatistics. Duxbury Press. 18 Example 2 Snippet of data in Stata 19 Boxplots FEV1 values by center 20 Example 2 ANOVA with Stata - syntax oneway outcome_var group_var 21 Example 2 ANOVA with Stata - syntax oneway outcome_var group_var 22 Example 2 FEV and 3 medical centers 95% CIs for FEV1 by medical center 23