Text - Tufts University

OCW Epidemiology and Biostatistics, 2010 Steven A. Cohen, DrPH, MPH Tufts University School of Medicine November 2, 2010 HYPOTHESIS TESTING, THE ROLE OF CHANCE, COMMON STATISTICAL TESTS Learning objectives for this session: 1) Be able to perform the key steps to hypothesis testing 2) Understand the concept and use of the alpha value 3) Understand in general terms how the p value is derived 4) Correctly interpret the p value 5) Distinguish between type I and type II error 6) Understand the relationships among sample size, power, type I and type II error 7) Have a general notion of the concept of power as it relates to study conclusions 8) Know how a two-sample t-test is applied 9) Know how to interpret the results of a two-sample t-test 10) Know when it is appropriate to use a chi-square test 11) Know how to interpret the results of a chi-square test 12) Define the following terms: hypothesis testing, null hypothesis, alternate hypothesis, type I error, type II error, p-value, power, chi square test, t-test Outside preparation: Pagano & Gauvreau Chapter 10, pages 232-246, 249-258 Chapter 11, pages 265-284 Chapter 15, pages 342-349 Note The following two lectures are designed to orient students toward interpretation of the medical literature to have a more complete and structured understanding of how statistical analyses are used in biomedical research. The way in which this material is presented is specifically designed for the medical student who has minimal background in statistics, and as such, emphasizes interpretation, rather than the theory behind the statistics. Those who have had formal coursework in statistics before this course are encouraged to consult the instructors of the course if they wish to learn more of the mathematical and conceptual details of the information presented in these lectures. HYPOTHESIS TESTING A researcher wishes to know if a new vaccine will reduce the likelihood of getting HIV in a sample of adolescents. In statistics, we can frame this research question into two hypotheses and conduct what is called hypothesis testing. 1 Null hypothesis, H0: The proportion of people who contract HIV will be the same for adolescents who have received the new vaccine and those who have not received the vaccine. Alternate hypothesis, HA: The proportion of people who contract HIV will be the different for adolescents who have received the new vaccine and those who have not received the vaccine. Note that the alternate hypothesis states that the proportions in each group who contract HIV will be different. This allows for the possibility that the new vaccine could reduce or possibly increase the likelihood of contracting HIV. Under this scenario, the alternate hypothesis is called two-tailed. Rarely, one-tailed alternate hypotheses are used in the medical literature, and would suggest that, in this example, those who were vaccinated would have a lower risk of contracting HIV than those who were not. It should be noted that the FDA does not permit one-tailed hypothesis in most clinical trials. In statistics, we assume the null hypothesis is true, unless there is substantial and convincing evidence to reject the null hypothesis. Hypothesis testing basically seeks to determine whether there is enough evidence to reject the null hypothesis in favor of the alternate, but the default is always the null hypothesis- i.e. no effect or no difference. Next, we move on to how we write and symbolize the null and alternate hypotheses in statistics. The null hypothesis is typically denoted as H0, pronounced “H-not,” whereas the alternate hypothesis is denoted HA or Ha. Let Pvaccine = the proportion of those who were vaccinated who contract HIV and Pnovaccine = the proportion of those who were not vaccinated who contract HIV. If we want to test the above hypothesis pertaining to the proportion of adolescents vaccinated who contract HIV versus the proportion of adolescents who do not get vaccinated, we denote it as follows: Scenario 1 H0: Pvaccine = Pnovaccine HA: Pvaccine ≠ Pnovaccine Scenario 2 Pvaccine - Pnovaccine = 0 Pvaccine - Pnovaccine ≠ 0 Scenario 3 Pvaccine /Pnovaccine = 1 Pvaccine / Pnovaccine ≠ 1 Each of these hypothesis formulation scenarios is equivalent to the others. If we wish to test a difference between mean fasting blood glucose levels comparing those on the drug Reducagluce versus those on placebo, we denote the null and alternate hypotheses as follows: Let μ –represent the population mean blood glucose level; therefore μReducagluce = the mean blood glucose level who are given Reducagluce and μplacebo = the population mean blood glucose level of those on placebo. Scenario 1 H0: μReducagluce = μplacebo HA: μReducagluce ≠ μplacebo Scenario 2 μReducagluce - μ placebo = 0 μReducagluce - μ placebo ≠ 0 The hypothesis is framed in terms of the population parameter, not the sample statistic, given that we wish to draw inferences about the larger population, not just the people in the study. For instance, the researcher examining the role of a new HIV vaccine in adolescents finds of the adolescents who get vaccinated, 6% contract HIV, compared to 9% of adolescents who contract HIV who are not vaccinated. We know there is a difference between the sample groups (in practice, there will almost always be some difference). The goal of research is not to make inferences about a sample, but to make inferences about the population. In this way we use the sample statistics, with appropriate uncertainty, to make inferences about population differences and/or associations. 2 Research studies do not always state the hypothesis being tested explicitly. It is good practice in reading articles to first identify the central hypothesis or hypotheses being tested in order to understand the article more easily. THE ROLE OF CHANCE A major role of statistics in biomedical studies is to quantify the role of chance in producing the associations or differences between groups observed. In the example above, if we find that adolescents who are administered vaccine have a 50% reduced risk (9% vs. 6%) of getting HIV compared to adolescents who were not administered the vaccine, is that difference between groups due to a real effect of the vaccine or is it due to chance? In other words, under what circumstances do we reject the null hypothesis? The following sections discussing Type I and II error and the p-value will help clarify how hypothesis testing is conducted in biomedical research. Alpha Value and Type I Error The alpha value (α) represents a threshold beyond which the null hypothesis would be rejected in hypothesis testing. The alpha value must be established by the investigator before the study begins. In theory, the alpha value could be anything, but in practice is often set to 0.05, or 0.10. If we subtract alpha from 1, we get the confidence levels used in confidence intervals. For example, an alpha value of 0.05 corresponds to a 1 – 0.05, or 0.95, or 95% confidence interval. Similarly, an alpha of 0.10 corresponds to a 90% confidence interval. The alpha value is also called the type I error rate. Type I error occurs when the null hypothesis is rejected, when in reality, the null hypothesis should not have been rejected. In other words, we observed a difference in the sample when, in the population, there is no difference. In the figure below, we see an example of determining the rejection region of a hypothesis test as it pertains to determining differences between means under the null hypothesis μReducagluce - μ placebo = 0 for an alpha of 0.05. The total area under the normal curve is equal to 1. Since we make the assumption that the data from which the means are obtained are normally distributed, we can use certain facts the Gaussian (or normal) distribution. First, we know that 95% of the observations of a Gaussian distribution fall within ±1.96 standard deviation or z-score units of the mean, or in this case, the difference in means. Next, we distribute alpha of 0.05 between the two extremes of the distribution, under the notion that the mean glucose level for those on the drug could be higher or lower than those on placebo. If, when we run the test (a t-test, discussed below), we find that the test statistic falls between -1.96 and 1.96, we do not reject the null hypothesis. If, however, the test statistic is above 1.96 or below -1.96, this falls in the rejection region and we reject the null hypothesis. A similar procedure is used for population proportions, odds ratios, and other population parameters. 3 Rejection region or α/2 or 2.5 % of the observations -1.96 Rejection region or α/2 or 2.5 % of the observations Z-scores 0 +1.96 95% of total observations Figure 1: Illustration of α using an example of the sampling distribution for a difference between means We reject the null hypothesis when the p-value is smaller than the pre-established alpha value. Therefore, if the alpha value is 0.05 and the p-value obtained is 0.31, we do not reject the null hypothesis. If the alpha value is 0.05 and the p-value is 0.001, we reject the null hypothesis in favor of the alternate hypothesis, and these results are indicative of not being by chance, but instead due to a real difference. The confidence interval and the p-value are related to each other. For the same alpha, when the confidence interval excludes the null value, this means that the corresponding p-value for the hypothesis test will be less than alpha, and vice versa. If, for example, you wish to determine whether the difference in means between the drug and placebo groups is significantly different than 0, there are two ways of going about doing this: construct a confidence interval around the mean and determine whether or not it includes the null value of 0, or conduct a hypothesis test (for example, a t-test, below) and determine if the test statistic has a corresponding p-value below the alpha value. Below is a schematic diagram of this concept. 4 Figure 2: Relationship between confidence intervals and p-values schematic diagram. When p-value is below the established alpha level, the corresponding confidence intervals do not contain the null value of the parameter estimate Beta, Statistical Power, and Type II Error Another important concept in statistics related to alpha (type I error) is beta (β, the type II error rate). Type II error occurs when the null hypothesis is not rejected, when in fact it should have been. This type of error occurs when a study is underpowered, meaning that the sample size is not sufficiently large enough to provide an adequate standard error to observe a significant difference or association. Power, defined as 1 – β, or 1 minus the type II error rate, is the probability that a statistical test will result in the rejection of the null hypothesis, when in fact, the null hypothesis should be rejected. A power analysis is often used in advance of a study to determine the minimum sample size required to accept the outcome of a statistical test using an expected effect size with a particular level of confidence. In a high powered study, it is easier to detect smaller differences between groups than in lower powered studies. Power increases with increasing sample size and increasing expected effect size for a fixed alpha level. As you learned in the previous lecture, studies with small sample sizes may be underpowered, which may result in Type II error, as well as wide confidence intervals. Alpha and beta are essentially independent of each other, and can take on any value that the researcher wants. However, the alpha level is typically set at 5% or 0.05, while beta is often set at 10, 15, or 20%, depending upon the study and sample size available. The two types of error involved in hypothesis testing are shown below. 5 Reality or truth about the population parameter H0 true Decision based on sample and hypothesis testing HA true Reject H0 Type 1 error (α) Correct decision Do not reject H0 Correct Decision Type II error (β) Figure 3: Relationship between two types of error in hypothesis testing, the decision to reject or not reject the null hypothesis, and the truth about the unknown population parameter of study There is a common analogy used to help explain the concepts of type I and type II error: the fire and fire alarm. In this example, the state of “no fire” is analogous to the reality that the null hypothesis is true, and the state of “fire” is when, in reality, there is a real difference between groups. Pulling the fire alarm is analogous to rejecting the null hypothesis, and not pulling the file alarm is analogous to not rejecting the null hypothesis. If there is no fire and we do not pull the fire alarm, we are taking the correct action. This would be analogous to not rejecting the null hypothesis when, in fact, the null hypothesis is true. Pulling the fire alarm when there is a fire is also a correct action, and this is analogous to rejecting the null hypothesis when it should be rejected. Pulling the fire alarm when there is no fire is analogous to rejecting the null hypothesis when the null hypothesis is true, a type I error. Not pulling the fire alarm when there actually is a fire is akin to not rejecting the null hypothesis when the alternate hypothesis is true, a type II error. COMMON STATISTICAL TESTS Putting it All Together Thus far, we have discussed confidence intervals (previous class), Type I and II error, the p-value, and power in the context of hypothesis testing in biomedical studies. A logical question that follows would be to describe exactly how hypothesis testing actually is done in biomedical research. Basically, there are two methods for hypothesis testing, each yielding identical results. The first is the method described here, which is through statistical testing. Using statistical tests, such as those that are described below (e.g. chi square tests, t-tests), the first step is to establish the Type I error rate, or α. Next, we conduct the test, usually involving some sort of calculation resulting in a test statistic. Using tables, computation, or computer programs, we find the p-value associated with that test statistic. We then compare the p-value to the pre-established α. If the p-value is less than α, we reject the null hypothesis and infer that there is a statistically significant difference association or difference between groups, depending upon the test. If the p-value is greater than α, we do not reject the null hypothesis and infer that the difference or association observed between groups is likely due to chance. The second method, which produces identical results in terms of whether or not the null hypothesis is rejected is through the use of confidence intervals. This method involves first establishing a “null value” for the confidence intervals. For differences between group means, this null value is usually 0; for relative risks and odds ratios, the null value is usually 1. We also establish the α value and confidence limits accordingly. In other words, when α is set to 0.05, this corresponds to 95%, or 1 – α, confidence limits. Next, we calculate a point estimate, typically a relative risk, odds ratio, mean, or any of a number of different statistics. Confidence intervals are then constructed around the point estimate. If the confidence interval contains the pre-established null value, we do not reject the null 6 hypothesis, and infer that the association or difference observed is likely due to chance. If the confidence interval excludes the null value, we reject the null hypothesis and infer that the difference or association is likely not due to chance, but instead due to a real difference or association between groups. A schematic diagram comparing statistical testing to the method of confidence intervals is shown in Figure 5. Figure 5: Schematic diagram illustrating how statistical testing and confidence interval methods are used to test hypotheses in biomedical research The “Table 1” Most articles you will encounter in half-decent biomedical journals will contain a table that lists descriptive statistics (e.g. means, standard deviations, frequencies, etc...) by treatment or control group(s). The goal of presenting such tables is generally to compare groups with respect to their demographic characteristics. As discussed previously in the course, randomization will almost always ensure that the groups are similar to each other with respect to such characteristics, but in nonrandomized and in some randomized studies, there remain important differences between, for example, the treatment and control groups, that, if not properly accounted for, may bias the results of the study. An example of what such a table might look like in a biomedical journal article is displayed in Table 1. This table is from a cohort study assessing the use of anti-retroviral therapy (ART) in HIV-positive mothers in reducing transmission of HIV to their infants. In this table, you will note that there are four columns: one listing the demographic characteristic, one listing the averages or frequencies for the treatment group, the next listing the averages or frequencies for the control group, and another listing the p-value for the comparison of the two groups. Table 1: Baseline characteristics of a sample of mother-infant pairs from HIV-positive mothers Characteristic Mean Age (years) Treatment Group 26.2 Control Group 25.9 p-value 0.21 7 Mean Birthweight of infant (kg) Married (%) Male sex of infant (%) 3.1 93.5 51.1 3.1 94.1 50.8 0.76 0.04 0.19 Commonly, two statistical tests are used to test the hypotheses that the two groups are similar with respect to a variety of demographic and related characteristics, the chi square test and the two independent t-test, which in this text will be referred to as the “t-test”. The Chi-Square Test A chi-square (χ2) test is used in hypothesis testing where the outcome variable and the exposure variable are both binary (or dichotomous). The chi-square test can also be used when the outcome and exposure variables are multinomial, meaning that they are categorical variables with more than two possible values (e.g. an exposure variable of never smoker, current smoker, former smoker). For the explanation and examples in this text, we will focus exclusively on binary exposure and outcome variables, however. The chi square test is used to calculate p-values associated with relative risks and odds ratios in biomedical studies. It can also be used to determine the p-value in a comparison of percentages or proportions among two or more groups, which is essentially an adaptation of relative risk. Although the distributions that underlie the chi-square test are fairly different in appearance and critical values than the t-distribution family used in the t-test, the procedure for calculating the chi-square statistic are analogous to the calculation procedure for a t-test. The first step is to start with the null hypothesis that the risk (or odds or proportions) is the same in both exposure groups. Then, construct a 2 x 2 contingency table showing how many exposed did or did not get the outcome and how many unexposed did or did not get the outcome of interest. Based on these data, we then calculate an expected value of the numbers in each cell if the null hypothesis were true (i.e. no difference between groups). Next, we compare the expected numbers in each cell of the contingency table to the actual observed value. We then calculate the chi square statistic based on the differences between the observed and expected cell counts. If the differences are small, the chi square statistic would be small, whereas if the differences are larger, the chi-square statistic would be large. Small chi-square statistics suggest that the differences observed are likely due to chance, but large chisquare statistics suggest that the observed result is not likely due to chance, but instead due to a real association. In order to determine whether or not to reject the null hypothesis, the critical value of the chi-square test must be known. The critical value for the chi-square test depends upon the number of degrees of freedom, which is derived from the numbers of exposure and outcome levels. The number of degrees of freedom is just the product of the number of possible exposure levels minus one and the number of possible outcome levels minus one [d.f. = (# exposure levels – 1) x (# outcome levels – 1)]. For example, the critical value for a chi-square test with binary exposure and outcome variables is approximately 3.84. Since chi-square values cannot be negative, the entire rejection region of the chisquare distribution is to the right of the critical value, even for a two-sided hypothesis, as depicted in the figure below. 8 Probability Critical value for 1 d.f. = 3.84 0 2 Small χ2 Rejection region for H0 4 Large χ2 Chi-square value Figure 4: Schematic diagram of chi-squared test for one degree of freedom Chi square tests were used to obtain the p-values in Table 1 above for the percent “married” and percent “male sex of infant” characteristics. For male sex of infant, the p-value was 0.19; this means that we do not reject the null hypothesis that the distribution of males differs between the control and treatment groups. On the other hand, the p-value associated with the chi square test for percent married was 0.04. If we assume that the α value (type I error rate) was 0.05, we would reject the null hypothesis that the proportion of married women is the same in the treatment and control groups. The chi-square distribution is actually a family of distributions. As the number of degrees of freedom increase (i.e. from increasing the number of levels of the exposure and/or outcome variables), the distributions become shifted to the right, meaning that the critical values increase. This is depicted schematically in the figure below. The figure shows the distributions of three chi-square distributions, for 2, 4, and 6 degrees of freedom. Note how the critical value for a theoretical value of alpha increases, or shifts to the right, as the number of degrees of freedom increases. df = 6 df = 4 df = 2 Chi-square value Figure 5: Diagram of chi squared distributions with three different degrees of freedom and denotations of critical values for a theorized value of alpha For small sample sizes, generally less than 30, a Fisher’s Exact test will be used in place of the chi square test for 2x2 table comparisons. The Fisher’s Exact test is non-parametric and does not require large sample sizes, as the chi square test does. The two-sample T-test The two-sample t-test is used to compare the means of two groups. This implies that the outcome has to be a continuous variable (so that we can derive the means) and the predictor has to be a 2-level categorical variable (e.g. treatment vs. placebo; drug A vs. drug B; male vs. female). The conventional minimal sample size for this test is 15 subjects per group, totally 30. For instance, in a randomized clinical trial, we wish to measure the effect of the drug Reducagluce by treating one group of adults with the drug, and the other with placebo. After 9 months of treatment, the mean fasting blood glucose level in the treatment group was 125 mg/dl, while the mean fasting blood 9 glucose level in the placebo group was 133 mg/dl. The mean difference is 8 mg/dl. However, is this difference statistically significant? That’s when we should use a t-test. The null hypothesis,H0 is Mean glucoseReduceagluce = Mean glucoseplacebo. Alternately, we can also write Mean glucoseReduceagluce – Mean glucoseplacebo = 0, which is closer to the concept of t-test. The t-test standardizes the difference between groups into a t-value, which will be contrasted with one of many theoretical t-distributions chosen based on the sample size. If the observed t-value is more extreme than the threshold associated with an alpha of 0.05, we reject the null hypothesis. Otherwise, we will not reject the null, and it will remain true. Below is a slightly simplified version of the formula for a t-test. There are several variations of this formula, most of which have to do with whether or not the standard deviation are equal in the two sample groups. The “expected difference” is usually 0. The standard error is a function of the sample standard deviations divided by a function of the sample size for each group. The numerator tells you how far the calculated mean difference is from the hypothesized value under the null hypothesis. The denominator contains the standard error, which basically standardizes the value obtained in the numerator and puts that value in standard error units, from which you can obtain the t-value and corresponding p-value. A graphical depiction of the procedure for the t-test is shown below. If we were to calculate the tvalue for the 8 mg/dl difference in blood glucose levels and found that the t-score is close to the hypothesized value of 0, we would conclude that chance alone is a better explanation for the observed difference between the two groups and would not reject the null hypothesis. If, however, the t-value is relatively far from the hypothesized value of 0, we would conclude that chance alone or random error does not explain the difference between the groups and reject the null hypothesis in favor of the alternate hypothesis. Several factors will increase the t-score for a given null hypothesis: 1. Increasing the sample mean difference 2. Decreasing the standard error, which generally involves (a.) a reduction in standard deviation or (b.) an increase in sample size In order to definitely determine whether the observed difference is more likely due to chance or due to a real difference, a table of probabilities would need to be consulted. These are easily found in most statistics textbooks and on the internet. 10 μHo = 0 α/2 α/2 t-value or SE units 0 If the t-value falls here, we say it is too close to the null value to reject H0. If the t-value falls here, we say it is different enough from the null value, so we reject H0. Figure 6: Schematic diagram of t-test procedure There are several variations of the t-test you might encounter when reading the medical literature or conducting research. The following table summarizes the differences. Two-sample t-test Paired-sample t-test ANOVA (Analysis of variance) Continuous Categorical groups More than 2 Outcome Exposure Number of groups Groups are… Continuous Categorical groups 2 Continuous Categorical groups 2 Not related. A subject in group 1 is not associated with any subject in group 2 Not related. A subject in group 1 is not associated with any subject in groups 2, 3, 4, … k. Important statistics Results t-statistics Related. Subjects are either paired with themselves (prepost study, left/right eyes, etc.) or subjects are related in some nature (twins, couples, etc.) t-statistics Allows us to tell if the mean difference between the two groups is significant Wilcoxon Signed Rank test Allows us to tell if any one of the means is different from the others Kruskal-Wallis test Non-parametric analogue Allows us to tell if the mean difference between the two groups is significant Wilcoxon Rank Sum test (also called Mann-Whitney U test) F-statistics NOTE: The information on the following three pages is optional and intended to provide a small amount of theoretical background and conceptual examples for the t-test. You are not required to know the material in this section. 11 On the t-distribution Another feature of t-test is the use of “t-distribution”. Let’s consider this formula: The number “±1.96” is the range of the z-score (number of standard deviations away from zero) corresponding to covering central 95% area of a normal bell-shape curve, leaving 2.5% area at each of the two ends. The rule “68-95-99” is based on a normal distribution, also known as “Gaussian distribution”. Gaussian distribution is only characterized by mean and standard deviation and it does not consider sample size. Researchers in the past soon found out that sample means do not necessarily form a perfect bell curve when the sample size is very small! To describe this process, a new distribution that changes with sample size was introduced; it is known as the “t-distribution”. The following figure shows the difference between the normal curve (black solid line) and a t-distribution with a sample size of 10 (red dotted line). The t-distribution curve is flatter and it extends farther away from the mean, 0. Note that there are many t-distributions (one for each sample size!), but when the sample size reaches 30 or so, the t-distribution curve will become very close to the normal distribution curve. The t-distribution compensates the problem of small sample size by boosting up the multiplier to bigger than 1.96. The following figure describes the details. When using Gaussian distribution, the 95%CI is always associated with a z-score of 1.96 (indicated by the horizontal dotted line), while if we use t-distribution, the t-value for calculating the 95%CI starts up much higher at around 2.1, and when the sample size increases, the critical value decreases and eventually becomes very close to 1.96. On the assumptions for t-test In order to conduct a t-test, there is the assumption that the underlying distribution must be approximately normal or Gaussian. The Central Limit Theorem helps in this regard for large sample sizes, and allows the researcher to use parametric methods, such as the t-test, even when the underlying distribution is somewhat non-normal, because the sample means follow a normal distribution. However, in small sample sizes where the underlying distribution is non-normal, the Central Limit Theorem does not apply easily. Therefore, there is a family of statistical tests used in hypothesis testing known as non-parametric tests. These tests are essentially analogous to parametric tests, such as t-tests, paired t-tests, and ANOVA, but before conducting the tests, the observed values are converted to ranks and the tests are based on these ranks, not the actual observed values. The non-parametric analogues for the independent samples t-test, the paired t-test, and ANOVA are the Wilcoxon Rank Sum test (also called the Mann-Whitney test), the Wilcoxon Signed Rank test, and the Kruskall-Wallis test, respectively. 12 T-tests: Additional Examples and Explanation The syllabus contains an explanation of the assumptions and uses of the two-sample t-test in biomedical studies. This section will provide some supplementary examples of t-tests to help you understand the concept of a t-test and the methods by which we use the t-test for hypothesis testing. These examples will walk you through conducting t-tests for two biomedical studies, while focusing on the key concepts involved in hypothesis testing, and omitting the complex mathematical details. Please note that you are not responsible for knowing the detailed formula for the two-sample t-test presented below. Recall that the steps for the t-test are as follows: 1. 2. Establish α, the type I error rate. Typically α is 0.05, but can be any value, such as 0.01 or 0.10. Conduct the t-test to obtain a t-score. The formula for a two-sample t-test is: t = sample mean difference – ΔμHo SE(mean difference), where ΔμHo is the hypothesized population mean difference under the null hypothesis, H0, often 0, and SE(mean difference) is the standard error of the difference in means, which generally involves pooling the standard deviations and sample sizes. Knowledge of the details of those formulas for the pooled standard deviation and sample size is not required for this course. 3. 4. 5. Obtain degrees of freedom, which is just the total sample size, n, minus 2. For the t-score and corresponding degrees of freedom, obtain a p-value either from a table or computer program. If p < α, reject the null hypothesis, and if p > α, do not reject the null hypothesis. Example: A randomized controlled trial of a new statin, Cholesteroff, was conducted to determine if the new drug reduces total cholesterol in young adults. After randomization, each patient was assigned either a daily regimen of Cholesteroff or placebo. We can assume that all bias was removed in the study, and that everybody followed their regimen perfectly. Part A: Each person had their total cholesterol (TC) taken at baseline. The placebo group, n = 225, had a starting TC of 205 with a standard deviation of 14 mg/dL. The treatment group, n = 240, had a starting TC of 207, with a standard deviation of 17 mg/dL. Were the groups comparable (the same) at baseline in terms of total cholesterol? In this case, the null hypothesis, H0, is that the two groups have the same mean TC, μplacebo = μtrt. The alternate hypothesis, HA, is that there is a difference in TC between the two groups, μplacebo ≠ μtrt To test these hypotheses, we conduct a two-sample t-test using the steps outlined above. 1. 2. 3. 4. 5. First, we establish α. Let’s use 0.05. Second, we calculate the t-score. t = sample mean difference – ΔμHo = sample mean difference – ΔμHo = SE(mean difference) (SD)*√(1/n1 + 1/n2) (207 – 205) - 0 = 1.35 16*√(1/225 + 1/240) Calculate degrees of freedom, d.f. = 225 + 240 – 2 = 463. Obtain corresponding p-value for a t-score of 1.35 and 463 degrees of freedom, which is 0.18. Since p > α, we do not reject the null hypothesis. There is insufficient evidence to conclude that the difference we observe in mean TC between placebo and treatment groups at baseline was not due to chance alone. In other words, we cannot rule out chance as the cause of the difference between the two groups. Part B: The subjects are followed over a period of 4 months to determine whether or not mean TC was lower in the Cholesteroff group compared to the placebo group. We again establish null and alternate hypotheses. H0: There is no difference between mean cholesterol at 4 months between those on Cholesteroff and those on placebo. μplacebo = μtrt 13 HA: Mean total cholesterol at 4 months will be different comparing those on Cholesteroff and those on placebo. μplacebo ≠ μtrt Note: we are comparing those on Cholesteroff at 4 months and those on placebo at 4 months. The results show that at 4 months after baseline, those on placebo had a mean TC of 203 with a standard deviation of 16 mg/dL. Those on Cholesteroff had a mean TC of 194, with a standard deviation of 20 mg/dL at 4 months. We again conduct a two-sample t-test. 1. 2. 3. 4. 5. First, we again establish α. Let’s use 0.05. Second, we calculate the t-score. t = sample mean difference – ΔμHo = sample mean difference – ΔμHo = SE(mean difference) (SD)*√(1/n1 + 1/n2) (194 – 203) - 0 = 5.11 18*√(1/225 + 1/240) Degrees of freedom = 463, same as above. We then obtain corresponding the p-value for a t-score of 5.11 with 463 degrees of freedom, which is 0.00001 Since p < α, we reject the null hypothesis. There is convincing evidence to suggest that the difference we observed in TC comparing the placebo to the treatment groups is not likely due to chance alone. Let’s say that another team of researchers conducted the same study and got the same results, but, since they were poor researchers, they were only able to recruit a sample size of 16 in each of the study arms. We can conduct the test in Part B above using the new information. 1. 2. 3. 4. 5. First, we establish α. Let’s use 0.05 again. Second, calculate the t-score. t = sample mean difference – ΔμHo = sample mean difference – ΔμHo = SE(mean difference) (SD)*√(1/n1 + 1/n2) (194 – 203) - 0 = 1.41 18*√(1/16 + 1/16) Degrees of freedom = 30, which is (16 + 16 – 2).. We then obtain corresponding the p-value for a t-score of 1.41 with 30 degrees of freedom, which is 0.17 Since p < α, we do not reject the null hypothesis. There is insufficient evidence to rule out chance. This may be because we do not have enough power to achieve statistical significance. Dr. Kenneth Chui contributed to the writing of this set of course notes. 14

Text - Tufts University

Related documents

Products

Support

Text - Tufts University

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib