Lecture 7 - The Department of Statistics and Applied Probability, NUS

Hypothesis Testing Testing your beliefs Ultimately in most research, the aim is to investigate whether the data supports a particular hypothesis, or whether there is evidence to reject this hypothesis. For example: How do you know that male students are more likely to weigh more than female students? How do we know that smoking is associated with increased risk of lung cancer? How do we know that possessing a particular genetic profile means someone is more likely to be obese than others with a different genetic profile? Data exploration and Statistical analysis 1. Data checking, identifying problems and characteristics 2. Understanding chance and uncertainty 3. How will the data for one attribute behave, in a theoretical framework? 4. Theoretical framework assumes complete information, need to address uncertainties in real data 5. Testing your beliefs, do the data support what you think is true? Data Data exploration, categorical / numerical outcomes Model each outcome with a theoretical distribution Estimation of parameters, quantifying uncertainty Hypothesis testing Parametric tests (t-tests, ANOVA, test of proportions) Hypothesis Testing • Null hypothesis A statement of status quo, or of no changes • Alternative hypothesis Hypothesis which the researcher wishes to investigate • Commonly, the alternative hypothesis is first formulated, and the null hypothesis is the negation of the alternative hypothesis. Pregnancy Test Kit A woman buys a pregnancy test kit, and is interested to find out whether she is pregnant. The null hypothesis in this case (status quo), is that she is not pregnant. The alternative hypothesis (hypothesis of interest), is that she is pregnant. Test kit may show: +ve: indicating there is evidence to suggest pregnancy –ve: indicating lack of evidence to suggest pregnancy Pregnancy Test Kit The test kit may either be accurate, or inaccurate. Actually pregnant Actually not pregnant Test kit shows +ve Correct +ve diagnosis Incorrect +ve diagnosis Test kit shows –ve Incorrect –ve diagnosis Correct –ve diagnosis Pregnancy Test Kit The test kit may either be accurate, or inaccurate. Types of Errors Type 1 Error (p-value) False +ve conclusion (+ve when woman is in fact not pregnant) Type 2 Error False –ve conclusion (–ve when woman is in fact pregnant) Power (Sensitivity) True +ve conclusion (+ve when woman is in fact pregnant) Specificity True –ve conclusion (–ve when woman is in fact not pregnant) P-values • Probability of observing a false positive result, also known as the significance of the test. • If the p-value is small, we are more confident that the null hypothesis can be rejected. • On average, expect 1 false positive result out of 20 results obtained. • So if we perform a large study with 1 million variables, on average we expect about 50,000 variables to display pvalues of < 0.05! Statistical tests for comparing means One of the most common statistical tests in biomedical sciences is the So here we have the mean weights comparisons of averages. for male and female students. Can Let’s revisit Example 1 from thewe previous lecture. compare these values, after accounting for the uncertainties in Example 1: thetoestimation, and quantify the The Science Faculty is interested compare between the weights of male and female students in NUS. statistical evidence for observing a difference? Recall we: - Randomly sample 200 male students and 200 female students and measure their weight. - Calculate the mean weight of these 200 male students, and use this quantity to estimate the mean weight of all the male students in NUS. - Similarly calculate the mean weight of these 200 female students and use this to estimate the mean weight of all the female students in NUS. Test statistic Test statistic A numerical quantification of the amount of evidence against the null hypothesis. Usually takes the form of: (Observed summary value – Hypothesized summary value) --- divided by --Standard error of observed summary value Or Degrees of freedom Number of independent observations that are allowed to take any values. Degrees of freedom = (df) Number of independent – observations (usually the sample size) Number of estimated parameters For example: In an assessment of whether the mean weight of 120 male students in the Science faculty exceeds 70kg, we actually have 120 independent observations. However, we need to estimate the mean weight of these 120 students, thus the degrees of freedom remaining = 120 – 1 = 119. Another way to think about it: Upon collecting the weight of 120 students, IF I know the average weight, then only the weights of 119 students are allowed to ‘vary’ as the weight of the last person must be the specific value to yield the average weight that I know of. t-tests 1-Sample t-test Useful if we are keen to compare a collection of values against a hypothesized mean value. For example: Previous surveys from the 1980s found that the mean weight of male students in the Science Faculty was 63kg. The lecturer of ST1232 believes that this figure is likely to be an under-estimate for male students in 2010. Design an experiment to test this hypothesis. - Randomly sample 200 male students from the Science Faculty, and measure their weights. - Calculate the mean weight of these 200 students and the corresponding standard error of the mean, and see whether this is significantly higher than 63kg. Identifying the hypotheses In hypothesis testing, it is extremely important to identify the hypotheses that are being tested, since this affects the calculation of the statistical evidence. Null hypothesis, Alternative hypothesis, H0: Mean weight,  = 63 H1: Mean weight,  > 63 But there are two other alternative hypotheses, each giving different outcome in the estimation of statistical evidence against the null hypothesis. To see whether the weight of current students differ from 63kg: H1: Mean weight,   63 To see whether the weight of current students is lighter than 63kg: H1: Mean weight,   63 One-sided versus two-sided tests Two-sided alternative hypothesis: Unbiased test, without assuming prior knowledge or expectation of the direction of the effect size. So the difference could be greater or smaller than the test value. Value estimated from the data measures how Remember Numerator the test statistic  different is the data from the null hypothesis Value under the null hypothesis Denominator measures the uncertainty of the estimation One-sided versus two-sided tests Two-sided alternative hypothesis: Unbiased test, without assuming prior knowledge or expectation of the direction of the effect size. So the difference could be greater or smaller than the test value. Example of an alternative hypothesis: there is a difference between the height of men and women. One-sided alternative hypothesis: Biased test, where the direction of the effect, if genuine, is known. Examples: (i) men are taller than women; (ii) the weight loss pill successfully reduces weight. One-sided versus two-sided tests One-sided tests Two-sided tests Interpreting statistical evidence: P-values are assessed by the probability found in the shaded areas: the shaded areas represent the probability of obtaining a test statistic at least as extreme as from the observed data, under the null hypothesis. SPSS calculates the two-tailed p-value by default, need to work harder to get the one-tail p-value. One-sided versus two-sided tests One-sided tests Two-sided tests Converting two-tailed p-value to one-tail p-value: - If observed effect (or test statistic) is in the same direction as the alternative hypothesis, half the p-value. - If observed effect (or test statistic) is in the opposite direction as the alternative hypothesis, one-tail p-value = 1 – 0.5  (two-tailed p-value) Example 1: A pharmaceutical company is interested in testing a new weight-loss treatment, and recruited 120 volunteers to take part in a research trial. The weight of the participants are measured before and after taking the weightloss treatment for the prescribed duration. - Null hypothesis here is that the weight-loss treatment has no effect (or average difference in weight = 0) - Alternative hypothesis here is that the weight-loss treatment is effective (or average difference in weight < 0) Suppose the evidence obtained yields a two-sided p-value of 0.03, and the average difference (after – before) is -3.5kg, what is the correct evidence for the trial? One-tail p-value = 0.03 / 2 = 0.015 (since direction of observed effect is in the same direction as the alternative hypothesis) If however, the average difference (after – before) is 3.5kg, the one-tail pvalue will instead be 1 – 0.03 / 2 = 0.985. This intuitively makes sense! Back to t-tests 1-Sample t-test Useful if we are keen to compare a collection of values against a hypothesized mean value. Null hypothesis: Mean = hypothesized value ASSUMPTIONS - Data is normally distributed (theoretical assumption) - Data is symmetrically distributed (practical application) - Observations made are all independent Two independent samples t-test For comparing the means between two groups. Null hypothesis : Mean of group 1 = Mean of group 2 Or effective : Difference in means = 0 Example: Comparing the weights between male and female students in Science faculty. ASSUMPTIONS - Data is normally (symmetrically) distributed within each group - Observations made within each group are all independent - Observations are also independent across the groups Paired-sample t-test For comparing the difference within each pair of observations Null hypothesis : No difference between the observations in each pairing Or effective : Difference within each pairing = 0 Example: Comparing the efficacy of a diet treatment, thus comparing the weight of an individual before and after the treatment. ASSUMPTIONS - Difference within each pairing follows a Normal (symmetric) distribution - Independence between pairs of observations Two independent samples vs paired-sample t-test Is there any difference between these two: Mathematically they seemed similar? Main difference in the calculations of the denominators: takes into account of the uncertainty of two sets of outcomes. first calculates the difference within each pairing, then calculate the standard error of the string of differences.  2 groups Suppose we are interested in comparing the means between three groups, what can we do? (A) Perform 3 sets of two-independent samples t-tests (between groups 1 and 2; groups 2 and 3; groups 1 and 3) (B) Perform a ‘global’ test, checking whether the means are all the same. Analysis of Variance (ANOVA) - A statistical test for comparing the means of multiple independent groups - Test the null hypothesis that the means of ALL the groups are identical Assumptions - Data within each group is normally (symmetrically) distributed - Independent observations within each group - Independent observations between the groups ANOVA Analysis of variance (ANOVA) – interesting name that we are effectively analysing the variance to decide whether there is any differences in the means! Null hypothesis Alternative hypothesis : Means of all the groups are identical : At least one of the groups has a different mean So, observing a significant p-value in this instance means that at least one group is different, but we don’t actually know which group that is! Practical Example • Previous studies suggest restriction caloric intake can increase life expectancy. • Perform an experiment with mice, each randomly assigned to one of six diet treatment. • Measure the time of death for each mouse (in months). Experimental Design Visualising the Data Practical Example Research Questions • Is there any difference in life expectancy across the different diet treatments? • If there is, which diet treatment contribute to this difference? • Which diet treatment significantly increases life expectancy? Practical Example ANOVA Q: Is there any difference in life expectancy across the different diet treatments? Consider the following null hypothesis: All the mean life expectancy of the six groups are identical Alternative hypothesis: At least one group has a different mean life expectancy Significant differences! But which treatment? Multiple Comparisons • Can compare every possible pair of treatments. DANGER! • More number of tests  more chances of making a false judgement. • Remember p-value threshold of 0.05  1 out of 20 judgement may be false. • There are 15 possible pairings for the 6 treatment groups  very likely to make a false judgement! Bonferroni Correction • Make it harder to define a result as significant. • By lowering the p-value threshold. But to lower by how much? Solution Divide the threshold by the total number of tests performed. Thus instead of a critical threshold of 0.05, we now use a critical threshold of 0.05  0.0033 15 Bonferroni Correction • A much preferred approach is to calculate the “Bonferroni corrected p-value” instead - Different p-value thresholds may be used, and difficult to decide what the Bonferroni-corrected thresholds are. - By calculating the Bonferroni-corrected p-value, it can be up to the researcher / author / editor / reviewer to decide what the global p-value threshold should be. Bonferroni-corrected p-value Multiple the obtained p-values by the number of tests performed. Post-Hoc Analysis Multiple Comparisons Dependent Variable: LIFETIME Bonferroni (I) GROUP lopro NN85 NP NR40 NR50 RR50 (J) GROUP NN85 NP NR40 NR50 RR50 lopro NP NR40 NR50 RR50 lopro NN85 NR40 NR50 RR50 lopro NN85 NP NR50 RR50 lopro NN85 NP NR40 RR50 lopro NN85 NP NR40 NR50 Mean Difference (I-J) 6.9945* 12.2837* -5.4310* -2.6115 -3.2000 -6.9945* 5.2892* -12.4254* -9.6060* -10.1945* -12.2837* -5.2892* -17.7146* -14.8951* -15.4837* 5.4310* 12.4254* 17.7146* 2.8195 2.2310 2.6115 9.6060* 14.8951* -2.8195 -.5885 3.2000 10.1945* 15.4837* -2.2310 .5885 Std. Error 1.25652 1.30637 1.24086 1.19355 1.26207 1.25652 1.30101 1.23521 1.18768 1.25652 1.30637 1.30101 1.28588 1.24030 1.30637 1.24086 1.23521 1.28588 1.17110 1.24086 1.19355 1.18768 1.24030 1.17110 1.19355 1.26207 1.25652 1.30637 1.24086 1.19355 *. The mean difference is significant at the .05 level. Sig. .000 .000 .000 .440 .175 .000 .001 .000 .000 .000 .000 .001 .000 .000 .000 .000 .000 .000 .249 1.000 .440 .000 .000 .249 1.000 .175 .000 .000 1.000 1.000 95% Confidence Interval Lower Bound Upper Bound 3.2803 10.7086 8.4222 16.1452 -9.0988 -1.7631 -6.1395 .9166 -6.9306 .5306 -10.7086 -3.2803 1.4435 9.1348 -16.0766 -8.7743 -13.1166 -6.0953 -13.9086 -6.4803 -16.1452 -8.4222 -9.1348 -1.4435 -21.5156 -13.9137 -18.5613 -11.2289 -19.3452 -11.6222 1.7631 9.0988 8.7743 16.0766 13.9137 21.5156 -.6422 6.2811 -1.4369 5.8988 -.9166 6.1395 6.0953 13.1166 11.2289 18.5613 -6.2811 .6422 -4.1166 2.9395 -.5306 6.9306 6.4803 13.9086 11.6222 19.3452 -5.8988 1.4369 -2.9395 4.1166 Questions • So why don’t we perform the post-hoc analyses all the time then? • p-values, is there a difference between 0.049 and 0.051? • p-values and effect sizes, which is better? P-values or confidence intervals? • Power (sensitivity) and specificity, can we attempt to maximise both? t-tests in SPSS Consider the mathematics.xls dataset again. 1. The average marks of the mathematics exam before starting the omega 3 trial is 70. Is there any evidence that the marks after the omega 3 trial is higher than 70? 2. It is traditionally believed that male students tend to outperform female students in mathematics. Based on the marks before the start of the trial, is there any evidence in support of this hypothesis. 3. Is there any evidence that consuming omega 3 improves the performance in the mathematics exam? 4. Is there any difference in the marks before the trial between the three schools? If there is, which school exhibited the best performance? 5. Is there any difference in the omega 3 consumption between male and female students? Average marks after = 70? This should be relatively straightforward that we need to perform a onesample test of the mean value. However, before we can decide on the use of a one-sample t-test, we need to check whether the data is indeed symmetrically distributed (and to go through the usual data exploratory procedures). Recall there are at least two ways of doing this in SPSS: 1. Qualitative assessment using a histogram. 2. Quantitative assessment with a Shapiro-Wilk’s Test. Shapiro-Wilk’s test whether the data is normally distributed. H0: Data is normally distributed H1: Data is not normally distributed H0: Mean marks = 70 H1: Mean marks > 70 (1-tailed test) Actual 2-tailed p-value = 3.28  10-5 1-tailed p-value = 1.64  10-5 Conclusion: There exists significant evidence that the marks after consuming omega 3 is greater than 70 (p-value = 1.64  10-5). Males better than females? It should be immediately clear that there are 2 groups (or “populations”) here: one for male students, and one for female students. Thus we can use the 2-independent samples t-test to compare the mean marks for the two groups. However, as before, we need to assess whether there is any violation of the normality or symmetrical distribution assumption. Important: As there are two groups, we need to assess whether both groups satisfy this assumption! Assessing assumptions for 2-independent samples t-test H0: Mean marks for males = Mean marks for females H1: Mean marks for males > Mean marks for females (1-tailed test) Conclusion: There is no evidence (pvalue = 0.77) that males perform better than females in the mathematics exam before the start of the omega 3 trial. Which row to interpret? Levene’s test for equality of variances: H0: Variance for group 1 = Variance for group 2 H1: Variance for group 1 Variance for group 2 2-tailed p-value, need to convert to 1-tailed p-value. As mean difference = -0.935, which is in opposite direction to H1, therefore: 1-tailed p-value = 1 – 0.5  0.460 = 0.77 However, regardless of what you see here, always go ahead and interpret the second row not assuming equal variances.. Evidence that omega3 improves performance? Most appropriate analysis is to compare the marks after against the marks before for the same individual. Here the appropriate test is the paired-sample t-test. H0: There is no difference between the marks before and after within each individual (or Difference = 0) H1: The mark after taking omega3 is higher than the mark before (or Difference > 0) Conclusion: There is overwhelming evidence (p-value = 2.03  10-14) that the exam marks after the omega 3 trial are higher than the marks before the trial. Actual 2-tailed p-value = 4.06  10-14 1-tailed p-value = 2.03  10-14 Since mean difference is in the same direction as the alternative hypothesis. Is there any difference in the performance across the 3 schools? There are three groups that we want to compare the average marks. The use of an ANOVA should immediately come to mind for this purpose. H0: There is no difference in the mean marks between the three schools. H1: At least one school has a different mean mark when compared to the rest of the schools. Conclusion: There is at no significant evidence (p-value = 0.063) of a difference in marks between the three schools. (or at best, marginal evidence of a difference, between schools 1 and 2). Is there any difference in omega 3 consumption between males and females? Feeling confident, we decided we can jump straight to performing the definitive analysis without exploring the data. So the appropriate test here is a 2-independent sample t-test. Is this correct? With histograms like these, there really isn’t a need to perform the Shapiro-Wilk tests! Parametric tests It’s clear that any tests that assume the data to be symmetrically distributed will not be applicable in comparing omega 3 consumption, since this outcome is extremely right skewed. There is thus a need for statistical tests that do not explicitly require such assumptions on the distribution of the variable – also known as nonparametric tests. So far, the tests we have seen are considered parametric tests – which require assumptions on the distribution of the data to be satisfied before they can be correctly used. REMEMBER! The computer does not recognise when a test is inappropriate, you have to know! Relationship between p-values and confidence intervals Let’s have a quick recap of all the valid analyses: Look at the p-values and the confidence intervals. Notice a trend? Significant at 0.05 threshold and 0 not in 95% CI Not Significant at 0.05 threshold and 0 is in 95% CI Significant at 0.05 threshold and 0 is not in 95% CI Actually, even at the numerical EDA stage, the 95% confidence intervals about the means are already very informative for indicating whether there will be any differences between the two means in a formal comparison. Relationship between p-values and confidence intervals Thus, if the test value under the null hypothesis falls within the 95% CI of the estimated quantity  p-value from the hypothesis test will be > 0.05. If the test value under the null hypothesis does not fall within the 95% CI  p-value from the hypothesis test will be < 0.05. Conversely, If the p-value from the hypothesis test is < 0.05  the 95% CI will not contain the test value under H0. If the p-value from the hypothesis test is > 0.05  the 95% CI is certain to contain the test value under H0. Students should be able to • understand the concept of the null and alternative hypotheses • know what a test statistic is, and understand that it is always calculated assuming the null hypothesis is true • understand what is meant by power, sensitivity, specificity and type I and type II errors • understand and interpret a p-value from a hypothesis test • know which statistical tests should be used • know the assumptions for these statistical tests • understand the relationship between a p-value and CI • perform the appropriate analyses in SPSS and RExcel

Lecture 7 - The Department of Statistics and Applied Probability, NUS

Related documents

Products

Support

Lecture 7 - The Department of Statistics and Applied Probability, NUS

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib