Introduction to Biostatistics for Clinical Researchers University of Kansas Department of Biostatistics & University of Kansas Medical Center Department of Internal Medicine Schedule Friday, December 10 in 1023 Orr-Major Friday, December 17 in B018 School of Nursing Possibility of a 5th lecture, TBD All lectures will be held from 8:30a - 10:30a Materials PowerPoint files can be downloaded from the Department of Biostatistics website at http://biostatistics.kumc.edu A link to the recorded lectures will be posted in the same location An Introduction to Hypothesis Testing: The Paired tTest Topics Comparing two groups: the paired-data situation Hypothesis testing: the Null and Alternative hypotheses Relationships between confidence intervals and hypothesis testing when comparing means P-values: definitions, calculations, and more The Paired t-test: the Confidence Interval Component Two Group Designs For continuous endpoint: Are the population means different? Subjects could be randomized to one of two treatments (randomized parallel-group design) Compare the mean responses from each treatment Also referred to as independent groups Subjects could each be given both treatments with the ordering of treatments randomized (paired design) Compare the mean difference to zero (or some other interesting value) Pre-post data Matched case-control Example: Pre- versus Post- Data Why pair the observations? Decrease variability in response Each subject acts as its own control (reduced sample sizes) Good way to get preliminary data/estimates to develop further research Example: Pre- versus Post- Data Ten non-pregnant, pre-menopausal women 16-49 years old who were beginning a regimen of oral contraceptive (OC) use had their blood pressures measured prior to starting OC use and threemonths after consistent OC use The goal of this small study was to see what, if any, changes in average blood pressure were associated with OC use in such women The data shows the resulting pre- and post-OC use systolic BP measurements for the 10 women in the study Blood Pressure and OC Use Subject Before OC After OC Δ = After - Before 1 115 128 13 2 112 115 3 3 107 106 -1 4 119 128 9 5 115 122 7 6 138 145 7 7 126 132 6 8 105 109 4 9 104 102 -2 10 115 117 2 Average 115.6 120.4 4.8 Blood Pressure and OC Use The sample average of the differences is x diff 4.8 Also note: x diff x after - x before The sample standard deviation (s) of the differences is sdiff = 4.6 Standard deviation of differences follows the formula: n sdiff x i 1 diff - x diff 2 n -1 where each x diff represents an individual difference and x diff is the mean difference Note on Paired Data Designs The BP information is essentially reduced from two samples (prior to and after OC use) into one piece of information--difference in BP between the two samples Response is “within-subject” This is standard protocol for comparing paired samples with a continuous outcome measure The Confidence Interval Approach Suppose we want to draw a conclusion about a population parameter: In a population of women who use OC, is the average change in blood pressure (after - before) zero? The CI approach allows us to create a range of plausible values for the average change (μΔ) in blood pressure using data from a single, imperfect, paired sample The Confidence Interval Approach A 95% CI for μΔ in BP in the population of women taking OC is x diff t0.95,9 SE x diff x diff t0.95,9 sdiff 10 4.6 4.8 2.26 10 1.5,8.1 mmHg Note The number 0 is NOT in the confidence interval (1.5-8.1) This suggests there is a non-zero change in BP over time The phrase “statistically significant” change is used to indicate a non-zero mean change Note The BP change could be due to factors other than OC Change in weather over pre- and post- period Changes in personal stress A control group of comparable women who were not taking OC would strengthen this study This is an example of a pilot study-a small study done just to generate some evidence of a possible association This can be followed up with a larger, more scientifically rigorous study The Paired t-test: the Hypothesis Testing Component The Hypothesis Testing Approach Suppose we want to draw a conclusion about a population parameter: In a population of women who use OC, is the average change in blood pressure (after - before) zero? The hypothesis testing approach allows us to choose between two competing possibilities for the average change (μΔ) in blood pressure using data from a single, imperfect, paired sample Hypothesis Testing Two mutually exclusive, collectively exhaustive possibilities for “truth” about mean change, μΔ Null hypothesis: HO: μΔ = 0 (what we wish to ‘nullify’) Alternative hypothesis: HA : μΔ ≠ 0 (what we wish to show evidence in favor of) We use our data as ‘evidence’ in favor of or against the null hypothesis (and alternative hypothesis, as a result) Hypothesis Testing Null: Typically represents the hypothesis that there is no association or difference HO: μΔ = 0 There is no association between OC use and blood pressure Alternative: The very general complement to the null HA: μΔ ≠ 0 There is an association between OC use and blood pressure Hypothesis Testing Our result will allow us to either reject or fail to reject HO We start by assuming HO is true, and ask: How likely, given HO is true, is the result we got from our sample? In other words, what are the chances of obtaining the sample data we actually observed (“evidence”) if the truth is that there is no association between blood pressure and OC use? Hypothesis Testing HO, in combination with other information about our population and the size of our sample, sets up (via the CLT) a theoretical probability distribution of sample means computed from all possible samples of size n = 10 where µ∆ = 0 Population distribution of BP change σ∆ µ∆ = 0 Hypothesis Testing HO, in combination with other information about our population and the size of our sample, sets up (via the CLT) a theoretical probability distribution of sample means computed from all possible samples of size n = 10 where µ∆ = 0 Sampling distribution of the sample mean SE x diff µ∆ = 0 Hypothesis Testing Theoretically, if the null hypothesis were true we would be more likely to observe values of the sample mean “close” to zero: SE x diff µ∆ = 0 Hypothesis Testing Theoretically, if the null hypothesis were true it would be unlikely that we should observe values of the sample mean “far” from zero: SE x diff µ∆ = 0 Hypothesis Testing We observed a sample mean of x diff = 4.8 mmHg—is it far enough from zero for us to conclude in favor of HA? SE x diff In favor of HO In favor of HA µ∆ = 0 In favor of HA Hypothesis Testing We need some measure of how probable the result from our sample is given the null hypothesis The sampling distribution of the sample mean allows us to evaluate how unusual our sample statistic is by computing a probability corresponding to the observed results—the p-value If p is small, it suggests that the probability of obtaining the observed result from the hypothesized distribution is not likely to occur by chance. In other words, either 1. The null hypothesis is actually true and, just by chance, we got a sample that gave us an unlikely result; or 2. The null hypothesis is actually false, and we got a sample with evidence of such Hypothesis Testing 1. The null hypothesis is actually false, and we got a sample with evidence of such If we are using a random sample, we can be assured that this is the case (95% confident, in fact) Hypothesis Testing To compute a p-value, we need to find our value of x diff on the sampling distribution and figure out how “unusual” it is Recall: x diff = 4.8 mmHg SE x diff µ∆ = 0 Hypothesis Testing Problem: What is σ∆? SE x diff µ∆ = 0 n Hypothesis Testing Solution: the Student’s t distribution SE x diff µ∆ = 0 s n Hypothesis Testing Where is x diff = 4.8 mmHg located on the sampling distribution (t9)? 1.45 µ∆ = 0 x diff 4.8 Hypothesis Testing The p-value is the probability of getting a sample result as (or more) extreme that what we observed, given the null hypothesis is true: 1.45 -4.8 µ∆ = 0 4.8 Hypothesis Testing The p-value is the area under the curve corresponding to values of the sample mean more extreme than 4.8 P(| x diff | ≥ 4.8) 1.45 -4.8 µ∆ = 0 4.8 Hypothesis Testing Strictly for convenience, we standardize our distribution We center it at zero by subtracting the mean We adjust the variability to correspond to s = 1 by dividing every observation by SE x diff t 1 -t µ∆ = 0 +t x diff - O 3.3 s n Hypothesis Testing The p-value is the area under the curve corresponding to values of the sample mean more extreme than 4.8 P(|x diff | ≥ 4.8) = P(|t| ≥ 3.3) easily found in any t table 1 -3.3 µ∆ = 0 +3.3 Hypothesis Testing Note: this t is called a test statistic (and is synonymous with a zscore) It represents the distance of the observation from the hypothesized mean in standard errors In this case, our mean (4.8) is 3.3 SE away from the hypothesized mean (0) Based on this, what do you think the p-value will look like? Is a result 3.3 SE above its mean unusual? Hypothesis Testing t = 3.3 on the sampling distribution (t9) The p-value The p-value is the probability of getting a sample result as (or more) extreme that what we observed, given the null hypothesis is true The p-value We can look this up in a t-table . . . or we can let Excel or another statistical package do it for us =TDIST(3.3,9,2) The p-value We can look this up in a t-table . . . or we can let Excel or another statistical package do it for us =TDIST(3.3,9,2) Interpreting the p-value P = 0.0092: if the true before OC/after OC blood pressure difference is zero among all women taking OCs, then the chance of seeing a mean difference as extreme or more extreme than 4.8 in a sample of 10 women is 0.0092 We now need to use the p-value to make a decision—either reject or fail to reject HO We need to decide if our sample result is unlikely enough to have occurred by chance if the null was true Using the p-value to Make a Decision Establishing a cutoff In general, to make a decision about what p-value constitutes “unusual” results, there needs to be a cutoff such that all pvalues less than the cutoff result in rejection of the null The standard (but arbitrary) cutoff is p = 0.05 This cutoff is referred to as the significance level of the test and is usually represented by α For example, α = 0.05 Using the p-value to Make a Decision Establishing a cutoff Frequently, the result of a hypothesis test with p < 0.05 is called statistically significant At the α = 0.05 level, we have a statistically significant blood pressure difference in the BP/OC example Example: BP/OC Statistical method The changes in blood pressures after oral contraceptive use were calculated for 10 women A paired t-test was used to determine if there was a statistically significant change in blood pressure, and a 95% confidence interval was calculated for the mean blood pressure change (after-before) Result Blood pressure measurements increased on average 4.8 mmHg with standard deviation 4.6 mmHg The 95% confidence interval for the mean change was 1.5-8.1 mmHg The blood pressure measurements after OC use were statistically significantly higher than before OC use (p = 0.009) Example: BP/OC Discussion A limitation of this study is that there was no comparison group of women who did not use oral contraceptives We do not know if blood pressures may have risen without oral contraceptive usage Example: Clinical Agreement Two different physicians assessed the number of palpable lymph nodes in 65 randomly selected male sexual contacts of men with AIDS or AIDS-related conditions1 1Example Doctor 1 Doctor 2 Difference x 7.91 5.16 -2.75 S 4.35 3.93 2.83 based on data taken from Rosner, B. (2005). Fundamentals of Biostatistics (6th ed.), Duxbury Press 95% Confidence Interval A 95% CI for difference in mean number of lymph nodes (Doctor 2 compared to Doctor 1): x diff 1.99 SE x diff 2.75 1.99 2.83 65 -3.45, -2.05 Getting a p-value Hypotheses: HO: µdiff = 0 HA : µdiff ≠ 0 1. Assume the null is true 2. Compute the distance in SEs between x diff and the hypothesized value (zero) 3. The sample result is 7.8 SEs below 0—is this unusual? t x diff - O -2.75 -7.8 s 2.83 n 65 Getting a p-value Sample result is 7.8 SEs below 0—is this unusual? See where this result falls on the sampling distribution (t64) The p-value corresponds to P(|t|> 7.8)—without looking it up we know p < 0.001 Example: Oat Bran and LDL Cholesterol Cereal and cholesterol: 14 males with high cholesterol given oat bran cereal as part of diet for two weeks, and corn flakes cereal as part of diet for two weeks1 1Example mmol/dL Corn Flakes Oat Bran Difference x diff 4.44 4.08 0.36 S 1.0 1.1 0.40 based on data taken from Pagano, M. (2000). Principles of Biostatistics (2nd ed.), Duxbury Press 95% Confidence Interval A 95% confidence interval for the difference in mean LDL (corn flakes versus oat bran): x diff t0.95,13 SE x diff 0.36 2.16 0.04 14 0.13,0.6 Getting a p-value Hypotheses: HO: µdiff = 0 HA : µdiff ≠ 0 1. Assume the null is true 2. Compute the distance in SEs between x diff and the hypothesized value (zero) 3. The sample result is 3.3 SEs above 0—is this unusual? t x diff - O 0.36 3.3 s 0.4 n 14 Getting a p-value Sample result is 3.3 SEs above 0—is this unusual? See where this result falls on the sampling distribution (t13) The p-value corresponds to P(|t|> 3.3) Using a table or software package, we can find p = 0.005 Note on Direction of Comparison Whether we chose to examine the difference (oat corn) or (corn oat) makes no difference to our results—only in the appropriate interpretation of estimates (including confidence intervals) The sign of the mean will change The limits of the CI will reverse and signs will change Summary Designate hypotheses The alternative is usually what we are interested in supporting The null is usually what we wish to nullify (“no association/no change”) Collect data Compute difference in outcome for each paired set of observations Computex diff , the sample mean of the paired differences Compute s, the sample standard deviation of the differences Compute the 95% (or other %) CI for the true mean difference x diff t1-,n -1 SE x diff Summary To get the p-value Assume HO is true (sets up the sampling distribution) Measure the distance of the sample result from µO t x diff - O s n Summary Compare the test statistic (distance) to the appropriate distribution to get the p-value Summary Paired t-test scenarios Blood pressure/OC use example Degree of clinical agreement (each pt received two assessments) Diet example (each man received two different diets in random order) Twin study Matched case-control Suppose we wish to compare levels of a certain biomarker in pts with versus without a disease More about P-values P-values P-values are probabilities Have to be between 0 and 1 Small p-values mean that the sample results are unlikely when the null is true The p-value is the probability of obtaining a result as extreme or more extreme than what was actually observed by chance alone, assuming the null is true P-values The p-value is not the probability that the null hypothesis is true It alone imparts no information about scientific/substantive content in result of a study From the previous example, the researchers found a statistically significant (p = 0.005) difference in average LDL cholesterol levels in men who had been on a diet including corn flakes versus the same men on a diet including oat bran cereal Which diet showed lower average LDL levels? How much was the difference? Does it mean anything nutritionally? P-values If the p-value is small, either a rare event occurred and 1. HO is false; or 2. HO is true Type I Error Claim HA is true when in fact HO is true The probability of making a Type I error is called the significance level of a test (α) Note on p and α If p < α the result is called statistically significant This cutoff is the significance (or alpha) level of the test It is the probability of falsely rejecting HO The idea is to keep the chances of making an error when HO is true low and only reject if the sample evidence is highly against HO Note on p and α Truth Decision HO HA Reject HO Type I Error (α) Power (1-β) Fail to Reject HO Correct (1-α) Type II Error β One- or Two-sided? A two-sided p-value corresponds to results as or more extreme than what was observed in either direction We know a test is two-sided by observing the alternative hypothesis, HA HA containing ≠ indicates we are interested in either an increase or decrease in (greater or smaller) value than the hypothesized value A one-sided p-value corresponds to results as or more extreme than what was observed in a single direction of interest The direction of interest is stated explicitly in the alternative hypothesis, HA HA containing > indicates we are interested in an increase or greater value than the hypothesized value HA containing < indicates we are interested in a decrease or smaller value than the hypothesized value Null and Alternative Sampling Distributions 1-β β 1-α α zα Fail to reject H0 Conclude no difference Reject H0 Conclude difference Assume H0 is True 1-α α zα Fail to reject H0 Conclude no difference Reject H0 Conclude difference Assume H1 is True 1-β β zα Fail to reject H0 Conclude no difference Reject H0 Conclude difference One- or Two-sided? In some cases, a one-sided alternative may not make scientific sense In the absence of pre-existing information for the evaluation of the relationship between BP and OC, wouldn’t either result be interesting and useful (i.e., negative or positive association)? In some cases, a one-sided alternative often makes scientific sense We are not interested if new treatment is worse than old, just whether it’s statistically significantly better However, for reasons already shown (and because of the sanctity of “.05”), one-sided p-values are viewed with suspicion Connection: Hypothesis Testing and CIs The confidence interval gives plausible values for the population parameter “Data, take me to the truth” Hypothesis testing postulates two choices for the population parameter “Here are two mutually exclusive possibilities for the truth— data help me choose one” 95% Confidence Interval If zero is not in the 95% confidence interval, then we would reject HO: µ = 0 at the 5% level of significance (α = 0.05) Why? With a confidence interval, we start at the sample mean and go approximately two standard errors in either direction 95% Confidence Interval If zero is not in the 95% confidence interval, then we would reject HO: µ = 0 at the 5% level of significance (α = 0.05) Why? With a confidence interval, we start at the sample mean and go approximately two standard errors in either direction 95% Confidence Interval If zero is not in the 95% CI, then this must mean x diff is > ~2 SE away from zero (either above or below) Hence, the distance (t) will be either > 2 or < 2, and the resulting p-value will be < 0.05 95% Confidence Interval and p-value In the BP/OC example, the 95% CI tells us that p < 0.05, but it doesn’t tell us that it is p = 0.009 The confidence interval and p-value are complimentary However, you can’t get the exact p-value from just looking at a confidence interval, and you can’t get a sense of the scientific/substantive significance of your study results by looking at a p-value More on the p-value Statistical significance does not imply or prove causation Example: in the BP/OC case, there could be other factors at play that could explain the change in blood pressure A significant p-value is only ruling out random sampling (chance) as the explanation We would need a randomized comparison group to better establish causality Self-selected would be okay, but not ideal More on the p-value Statistical significance is not the same as scientific significance Hypothetical example: blood pressure and oral contraceptives Suppose: n = 100,000; x diff = 0.03 mmHg; s = 4.6 mmHg P = 0.04 Big n can sometimes produce a small p-value, even in the absence of a relationship The magnitude of the effect is small (not scientifically interesting) Noise It is very important to always report a confidence interval 95% CI: 0.002-0.058 mmHg More on the p-value Lack of statistical significance is not the same as lack of scientific significance Must evaluate results in the context of the study and sample size Small n can sometimes produce a non-significant result even though the magnitude of the association at the population level is real and important—your study just may not be big enough to detect it Underpowered, small studies makes not rejecting hard to interpret Sometimes small studies are designed without power in mind just to generate preliminary data Comparing Means among Two (or More) Independent Populations Topics CIs for mean difference between two independent populations Two-sample t-test Non-parametric alternative Comparing means of more than two independent populations Comparing Two Independent Groups “A Low Carbohydrate as Compared with a Low Fat Diet in Severe Obesity”1 132 severely obese subjects randomized to one of two diet groups Subjects followed for six months At the end of the study period: “Subjects on the low-carbohydrate diet lost more weight than those on a low-fat diet (95% CI for the difference in weight loss between groups, -1.6 to -6.2 kg; p < 0.01)” 1Samaha, F., et al. A low-carbohydrate as compared with a low-fat diet in severe obesity, NEJM 348:21. Comparing Two Independent Groups Is weight change associated with diet type? Diet Group Low-Carb Low-Fat Number of subjects (n) 64 68 Mean weight change (kg) (post - pre) -5.7 -1.8 Standard deviation of weight changes (kg) 8.6 3.9 Diet Type and Weight Change 95% CIs for weight change by diet group: Carb: - 5.7 1.96 Fat: - 1.8 1.96 8.6 64 3.9 68 -7.807 kg, -3.593kg -2.728kg, -0.873kg Comparing Two Independent Groups In statistical terms, is there a non-zero difference in the average weight change for the subjects on the low-fat diet as compared to subjects on the low-carbohydrate diet? 95% CIs for each diet group mean weight change do not overlap, but how do you quantify the difference? The comparison of interest is not “paired”-there are different subjects in each diet group For each subject, a change in weight (post-pre) was computed However, the authors compared the changes in weight between two independent groups Comparing Two Independent Groups How do we calculate CI for the difference? P-value to determine if the difference in two groups is significant? Since we have large samples (both greater than 60) we know the sampling distributions of the sample means in both groups are approximately normal It turns out the difference of quantities that are approximately normally distributed are also normally distributed Sampling Distribution of Difference in Sample Means The sampling distribution of the difference of two sample means, each based on large samples, approximates a normal distribution This sampling distribution is centered at the true mean difference, μ1 - μ2 Simulated Sampling Distribution The simulated sampling distribution of sample mean weight change for the low-carbohydrate diet group is shown: Simulated Sampling Distribution The simulated sampling distribution of sample mean weight change for the low-fat diet group is shown: Simulated Sampling Distribution The simulated sampling distribution of the difference in sample means for the two groups is shown: Simulated Sampling Distribution Side-by-side boxplots 95% CI for the Difference in Means Our most general formula is: best estimate from sample multiplier SE best estimate from sample The best estimate of a population mean difference based on sample means: x1 - x2 Here, x1 may represent the sample mean weight loss for the 64 subjects on the low-carb diet, and x 2 the mean weight loss for the 68 subjects on the low fat diet 95% CI for the Difference in Means So, x1 - x2 -5.7 - -1.8 -3.9 ; this makes the formula for the 95% CI for μ1 - μ2 is: -3.9 1.96SE x1 - x2 where SE x1 - x2 is the standard deviation of the sampling distribution (i.e., the standard error of the difference of two sample means) Two Independent Groups The standard error of the difference for two independent samples is calculated differently than that for the paired design With the paired design, we reduced data on two samples to one set of differences Statisticians have developed formulas for the standard error of the difference-they depend on sample sizes in both groups and standard deviations in both groups Aside: SE x1 - x2 is greater than either SE x1 or SE x2 -any ideas why? Principle Variation from independent sources can be added SE x1 - x 2 12 22 n1 n2 We don’t know σ1 or σ2 so we estimate them using s1 and s2 to get an estimated standard error: SE x1 - x 2 s12 s22 n1 n2 Comparing Two Independent Groups Recall from the weight change/diet type study: Diet Group Low-Carb Low-Fat Number of subjects (n) 64 68 Mean weight change (kg) (post - pre) -5.7 -1.8 Standard deviation of weight changes (kg) 8.6 3.9 SE x1 - x 2 s12 s22 n1 n2 8.6 2 3.92 1.17 64 68 95% CI for Difference in Means In this example, the approximate 95% confidence interval for the true mean difference in weight between the low-carb and low-fat diet groups is: -3.9 1.96 1.17 -6.2kg, -1.6kg From Article “Subjects on the low-carbohydrate diet lost more weight than those on a low-fat diet (95% CI: -1.6 to 6.2 kg; p < 0.01)” Those on the low-carb diet lost more on average by 3.9 kg-after accounting for sampling variability this excess average loss over the low-fat diet group could be as small as 1.6 kg or as large as 6.2 kg This CI does not include zero, suggesting a real population level association between type of diet and weight loss Two-sample t-test: Getting a p-value Hypothesis Test to Compare Two Independent Groups Two-sample t-test Is the (mean) weight change equal in the two diet groups? HO: μ1 = μ2 HA: μ1 ≠ μ2 In other words, is the expected difference in weight change zero? HO: μ1 - μ2 = 0 HA: μ1 - μ2 ≠ 0 Hypothesis Test to Compare Two Independent Groups Recall, the general “recipe” for hypothesis testing is: 1. Assume HO is true 2. Measure the distance of the sample result from the hypothesized result, μO (in most cases it’s 0) 3. Compare the test statistic (distance) to the appropriate distribution to get the p-value t observed difference - null difference SE observed difference t x1 - x 2 - O SE x1 - x 2 x1 - x 2 s12 s22 n1 n2 Diet Type and Weight Loss Study Recall: x1 - x 2 -3.9 SE x1 - x 2 1.17 For this study: t -3.9 -3.33 1.17 This study result was 3.33 standard errors below the hypothesized mean of 0-is this result unusual? How are p-values calculated? The p-value is the probability of getting a result as extreme or more extreme than what you observed if the null hypothesis were true It comes from the sampling distribution of the difference in two sample means What does this sampling distribution look like? If both groups are large, it is approximately normal It is centered at the true difference Under the null, the true difference is 0 Diet/Weight Loss To compute the p-value, we would need to compute the probability of being 3.3 or more SEs away from 0 Diet/Weight Loss In Excel, use the “TTEST” function to test whether two independent samples are significantly different If you’ve calculated the test statistic, t (-3.33, in this example), you can use the “TDIST” function to compute the p-value For the diet example, p = 0.0013 Summary: Weight Loss Example Statistical Methods “We randomly assigned 132 severely obese patients . . . To a carbohydrate-restricted (low-carbohydrate) diet or a calorieand fat-restricted diet” “For comparison of continuous variables between the two groups, we calculated the change from baseline to six months in each subject and compared the mean changes in the two diet groups using an unpaired (two-sample) t-test” Result “Subjects on the low-carbohydrate diet lost more weight than those on a low-fat diet (95% CI: -1.6 to -6.2 kg; p < 0.01)” Sampling Distribution Detail What exactly is the sampling distribution of the difference in sample means? A Student’s t distribution is used with n1 - n2 - 2 degrees of freedom (total sample size minus two) Two-Sample t-test In a randomized design, 23 patients with hyperlipidemia were randomized to either treatment A or treatment B for 12 weeks 12 to A 11 to B LDL cholesterol levels (mmol/L) measured on each subject at baseline and 12 weeks The 12-week change in LDL cholesterol was computed for each subject Treatment Group A B 12 11 Mean LDL change -1.41 -0.32 Standard deviation of LDL changes 0.55 0.65 N Two-Sample t-test Is there a difference in LDL change between the two treatment groups? Methods of inference CI for the difference in mean LDL cholesterol change between the two groups Statistical hypothesis test 95% CI for Difference in Means Treatment Group A B 12 11 Mean LDL change -1.41 -0.32 Standard deviation of LDL changes 0.55 0.65 N x1 - x 2 t1-,n1 n2 -2 SE x1 - x 2 0.552 0.652 -1.41 - -0.32 t1-,n1 n2 -2 12 11 -1.09 t1-,n1 n2 -2 0.25 95% CI for Difference in Means How many standard errors to add and subtract (i.e., what is the correct multiplier)? The number we need comes from a t with 12 + 11 - 2 = 21 degrees of freedom From t table or excel, this value is 2.08 The 95% CI for true mean difference in change in LDL cholesterol, drug A to drug B is: -1.09 2.08 0.25 -1.61, -0.57 Hypothesis Test to Compare Two Independent Groups Two-sample (unpaired) t-test: getting a p-value Is the change in LDL cholesterol the same in the two treatment groups? HO: μ1 = μ2 HO: μ1 - μ2 = 0 HA: μ1 ≠ μ2 HA: μ1 - μ2 ≠ 0 Hypothesis Test to Compare Two Independent Groups Recall the general “recipe” for hypothesis testing: 1. Assume HO is true 2. Measure the distance of the sample result from the hypothesized result (here, it’s 0) 3. Compare the test statistic (distance) to the appropriate distribution to get the p-value t observed difference - null difference SE observed difference t x1 - x 2 - O SE x1 - x 2 x1 - x 2 s12 s22 n1 n2 Diet Type and Weight Loss Study In the diet types and weight loss study, recall: x1 - x 2 -1.09 SE x1 - x 2 0.25 In this study: t -1.09 -4.4 0.25 This study result was 4.4 standard errors below the null mean of 0 How are p-values Calculated? Is a result 4.4 standard errors below 0 unusual? It depends on what kind of distribution we are dealing with The p-value is the probability of getting a result as extreme or more extreme than what was observed (-4.4) by chance, if the null hypothesis were true The p-value comes from the sampling distribution of the difference in two sample means What is the sampling distribution of the difference in sample means? t12 + 11 - 2 = 21 Hyperlipidemia Example To compute a p-value, we need to compute the probability of being 4.4 or more SE away from 0 on the t with 21 degrees of freedom P = 0.0003 Summary: Weight Loss Example Statistical Methods Twenty-three patients with hyperlipidemia were randomly assigned to one of two treatment groups: A or B 12 patients were assigned to receive A 11 patients were assigned to receive B Baseline LDL cholesterol measurements were taken on each subject and LDL was again measured after 12 weeks of treatment The change in LDL cholesterol was computed for each subject The mean LDL changes in the two treatment groups were compared using an unpaired t-test and a 95% confidence interval was constructed for the difference in mean LDL changes Summary: Weight Loss Example Result Patients on A showed a decrease in LDL cholesterol of 1.41 mmol/L and subjects on treatment B showed a decrease of 0.32 mmol/L (a difference of 1.09 mmol/L, 95% CI: 0.57 to 1.61 mmol/L) The difference in LDL changes was statistically significant (p < 0.001) FYI: Equal Variances Assumption The “traditional” t-test assumes equal variances in the two groups This can be formally tested using another hypothesis test But why not just compare observed values of s1 to s2? There is a slight modification to allow for unequal variances-this modification adjusts the degrees of freedom for the test, using slightly different SE computation If you want to be truly ‘safe’, it is more conservative to use the test that allows for unequal variances Makes little to no difference in large samples FYI: Equal Variances Assumption If underlying population level standard deviations are equal, both approaches give valid confidence intervals, but intervals assuming unequal standard deviations are slightly wider (p-values slightly larger) If underlying population level standard deviations are unequal, the approach assuming equal variances does not give valid confidence intervals and can severely under-cover the goal of 95% Non-Parametric Analogue to the Two-Sample t Alternative to the Two Sample t-test “Non-parametric” refers to a class of tests that do not assume anything about the distribution of the data Nonparametric tests for comparing two groups Mann-Whitney Rank-Sum test (Wilcoxon Rank Sum Test) Also called Wilcoxon-Mann-Whitney Test Attempts to answer: “Are the two populations distributions different?” Advantages: does not assume populations being compared are normally distributed, uses only ranks, and is not sensitive to outliers Alternative to the Two Sample t-test Disadvantages: often less sensitive (powerful) for finding true differences because they throw away information (by using only ranks rather than the raw data) need the full data set, not just summary statistics results do not include any CI quantifying range of possibility for true difference between populations Health Education Study Evaluate an intervention to educate high school students about health and lifestyle over a two-month period 10 students randomized to intervention or control group X = post-test score - pre-test score Compare between the two groups Health Education Study • Only five individuals in each sample • We want to compare the control and intervention to assess whether the ‘improvement’ in scores are different, taking random sampling error into account Intervention 5 0 7 2 19 Control -5 -6 1 4 6 • With such a small sample size, we need to be sure score improvements are normally distributed if we want to use the t test (BIG assumption) • Possible approach: Wilcoxon-Mann-Whitney test Health Education Study Step 1: rank the pooled data, ignoring groups Intervention 5 0 7 2 19 Control -5 -6 1 4 Intervention 7 3 9 5 10 Control 2 1 4 6 6 8 Step 2: reattach group status Step 3: find the average rank in each of the two groups 3 5 7 9 10 6.8 5 1 2 4 6 8 4.2 5 Health Education Study Statisticians have developed formulas and tables to determine the probability of observing such an extreme discrepancy in ranks (6.8 versus 4.2) by chance alone (p) The p-value here is 0.17 The interpretation is that the Mann-Whitney test did not show any significant difference in test score ‘improvement’ between the intervention and control group (p = 0.17) The two-sample t test would give a different answer (p = 0.14) Different statistical methods give different p-values If the largest observation was changed, the MW p would not change but the t p-value would Notes The t or the nonparametric test? Statisticians will not always agree, but there are some guidelines Use the nonparametric test if the sample size is small and you have no reason to believe data is ‘well-behaved’ (normally distributed) Only ranks are available Summary: Educational Intervention Example Statistical methods 10 high school students were randomized to either receive a two-month health and lifestyle education program or no program Each student was administered a test regarding health and lifestyle issues prior to randomization and after the two-month period Differences in the two test scores were computed for each student Mean and median test score changes were computed for each of the two study groups A Mann-Whitney rank sum test was used to determine if there was a statistically significant difference in test score change between the intervention and control groups at the end of the two-month study period Summary: Educational Intervention Example Results Participants randomized to the educational intervention scored a median five points higher on the test given at the end of the two-month study period, as compared to the test administered prior to the intervention Participants randomized to receive no educational intervention scored a median one point higher on the test given at the end of the two-month study period The difference in test score improvements between the intervention and control groups was not statistically significant (p = 0.17) Next Lecture Friday, December 17 in B018 SON from 8:30a - 10:30a Topics include ― ANOVA ― Linear Regression ― Chi-square test ― Survival Analysis ― Design of Experiments References and Citations Lectures modified from notes provided by John McGready and Johns Hopkins Bloomberg School of Public Health accessible from the World Wide Web: http://ocw.jhsph.edu/courses/introbiostats/schedule.cfm