Objectives 7.2 Inference for comparing means of two populations where the samples are independent p Two-sample t significance test (we give three examples) p Two-sample t confidence interval p http://onlinestatbook.com/2/tests_of_means/ difference_means.html Standard errors p We have learnt that standard errors are crucial in constructing both confidence intervals and also statistical testing. Do not get mixed up between standard error of an estimator and standard deviation of the sample. p The amount of variation in the sample (the average (squared) distance between each estimate and the population mean) is measure by its standard error, which is (amount of variation in the sample) s s.e. = p = (square root of the sample size) n q q q You can imagine that the unknown population mean should be in some proximity to the known sample mean. The proximity is measured by the standard error. The sample mean tends to get closer in proximity/precision to the population mean as you increase the sample size. As we continue with the course, the standard errors will become more complex, but the underlying ideas are the same. Comparisons everywhere! p You will see comparisons being made all over the place. Just look at some of the products you have at home: p Dentex floss picks are clinically proven to remove more plaque than regular floss. p What does this mean – how on earth do they prove this? p This is an example of where they prove the results statistically. p It is done via clinical trials, by collecting data: p q Aim: to see if it is possible to prove that on average the amount of plaque removed using floss picks is more than the average amount of plaque removed using regular floss. They state their hypothesis as H0: µFP-µF ≤0 against HA: µFP – µF >0. where µFP = mean floss removed using Floss pick and µF = mean plaque removed using regular loss. p A simple random group of individuals are chosen and the amount of plaque removed using both regular floss and floss picks is measured (we will discuss the grouping little later on). It is found that based on 100 individuals the average amount removed using floss sticks is 3mg whereas the average amount removed using just regular floss is 2.8mg. q q q Does this automatically prove that floss sticks are better? No it doesn’t. The products makers cannot use this as their statistical proof, because this difference could be explained by random chance (variation between samples). This is when they need to know the reliability of the estimator (measure by the standard error) and use a statistical test to calculate the chance of observing a difference of 3mg – 2.8mg by random chance. The probability under the null, that there is no difference, is calculated. This is the p-value. If the p-value is large (over a pre-determined significance level, say 5%), then we cannot reject the null. This means that there isn’t any evidence in the collected data that that floss picks are working. p p On the other hand, if the p-value is less than a pre-determined significance level, then we can say there is evidence to suggest that floss picks are removing more plaque than regular floss. In other words, because the p-value is small, the data does not appear to be consistent with the null hypothesis (in other words the chance of obtaining a sample which behaves in this way when globally there is no difference between dentex and floss). Therefore we conclude that the null is incorrect and we choose the alternative hypothesis. It would appear that that Dentex found a statistically significant difference and thus can make their claims. Their p-value is small. p The nitty gritty of what is done. p The above conclusions require a standard error, which needs to be calculated from the data. However, the standard error and the exact method that us used to do the test depends on how the data was collected. Designing the :loss study p There are two ways the data could have been collected: p Either one simple random sample of individuals is taken (a hundred in the example discussed). For each individual (probably on separate days to ensure independence) is asked to use a floss stick and regular floss and the amount of plaque removed for each treatment is measured. This is an example of a matched pair study where the same individual is used in both treatments. In this case a matched paired-test is done – which we covered in the previous lectures. q q The advantage of this design is that it avoids confounding because the same individual is used for both experiments. q The disadvantage is that it takes time and effort because we need to do it over several days. Alternatively a simple random sample is taken and randomly split into two groups. Some are asked to use floss and others are asked to use floss picks. The individuals in both groups are completely independent of each other and there isn’t any matching. q The advantage it that it is quick to do this experiment. q Disadvantage: larger standard errors. Independent samples inference p The purpose of most studies is to compare the effects of different treatments or conditions. p Using matching to design an experiment is very useful way to make comparisons between populations since it tends to reduce confounding factors. p p If we have reason to believe that there is matching between subjects, then we should use a matched paired t-test. However, in many situations it is impossible to have any matching between the samples. p If we want to see whether a drug works. In this case we need a SRS of patients to give the dug and also a SRS to give the placebo. p In both the situations discussed above the samples are completely independent of each other – there isn’t any matching. In this situation we need to use an independent t-test, which we describe below. p Often the subjects are observed separately under the different conditions, resulting in samples that are independent. That is, the subjects of each sample are obtained and observed separately from, and without any regard to, the subjects of the other samples. p As in the matched pairs design, subjects should be randomized – assigned to the samples at random, when the study is observational. p By the end of the class you should be able to identify which test to apply give the situation. p q You should look to see if there is any matching in the data, if there is matching never do an independent sample t-test (this will give the wrong standard errors and can lead to unreliable results). If the samples appear to completely independent of each other use an independent sample t-test. Example 1: Heights p To motivate the independent sample t-test, we consider a problem that we already know the answer to: p In general do males students tend to be taller than female students? p In terms of a hypothesis test we to see if there is evidence to support: H0 : µM - µF ≤ 0 against HA : µM - µF > 0. q A matched design is possible, by random sampling male and female student siblings. But it is usually not feasible to obtain this data. In addition we exclude the sub-population of people with same-sex or no siblings. q Instead, a random sample of students was drawn and an independent sample t-test is done. q Statcrunch instructions: Stat -> T-stat –> Two Sample -> With data. Then place the relevant columns in each box and uncheck the box that says pooled variance. You have the option of doing a test (one or two sided) or constructing a confidence interval. In this sample there were 27 males and 37 females, there is clearly no matching. The difference in sample means is 0.45. To see whether H0 : µM - µF ≤ 0 against HA : µM - µF > 0 we do the test in Statcrunch. In this data set, from the boxplot female heights tend to be less than male heights. We see that the p-value is less 0.01% (we do the test at the 5% level) which means there is strong evidence to suggest that males are on average taller than females. t-value = 0.466 0 = 7.27 0.064 We can use the same output to construct a 99% confidence interval for the mean difference. The only difference is that the degrees of freedom is unusual – it is 48.29%. However, we do exactly the same as before, we either look-up tables (provided by me in the exam paper) or use software such as the Statcrunch output [0.466 ± 2.68 ⇥ 0.064] = [0.29, 0.64] Thus with 99% confidence we believe the mean difference lies between 0.29 to 0.64 feet. Example 2: Diets p p p We want to know whether there is any difference between two different diets. 20 randomly sampled people are randomly placed into two groups of 10. The first group goes on Diet I and the second group on Diet 2. The weight loss for each group is given below Superficially there appears to be a matching in the data. Don’t be fooled by this, the people in both group were randomly allocated (we can see this from how the data was collected) and there is no matching between, say, 2.9 and 3.5. Thus we need to use an independent sample t-procedure. As we have no reason to believe one diet is better than another, our hypothesis of interest is: H0 : µ1 - µ2 = 0 against HA : µ1 - µ2 ≠ 0 The boxplot gives the impression there are differences between, but are these statistically significant (can be explained by sampling differences)? The 95% confidence interval is [-2.23,0.598]. This tells us with 95% confidence the mean difference between the diets is somewhere in this interval. If we test the hypothesis H0 : µ1 - µ2 = 0 against HA : µ1 - µ2 ≠ 0 The t-transform is 0.82 0 t-value = = 1.22 0.6692 Using statcrunch we see that the smallest area is the area to the LEFT of -1.22, this Is 12%. Thus the p-value for the two-sided test is 24%. From the data there is no evidence to suggest there is any difference between the means of the diets. In other words, there is no evidence to suggest that there is a difference in weight loss between the two diets. Example 3: Does calcium interact with iron absorption? p It is believed that too much calcium in a diet can reduce the absorption of iron. To test this 20 randomly sampled people were put into two groups of 10. One group was given a calcium high diet and the iron absorption recorded. The other group was given a calcium low diet and iron recorded. The difference from their previous level is given below (this is why you see some negative numbers). p The data and summary statistics is given below: We observe that for this group there those in a calcium low group absorb more iron, is this statistically significant? p The hypothesis of interest is H0 : µCH - µCL ≥ 0 against HA : µCH - µCL < 0. The hypothesis given in the output above is opposite of what we want. However, from this output we immediately see that the p-value for H0 : µCH - µCL ≥ 0 against HA : µCH - µCL < 0 is the area to the LEFT of -3.19 which is 1-0.9974 = 0.26%. As this p-value is less than 5% there is evidence to reject the null and conclude that high calcium decreases iron absorption (compared with low calcium). The 95% confidence interval for the mean difference is [ 1.991 ± 0.623 ⇥ 2.1] Example 4: Calf treatments p Comparing the weights of calves and different treatments We start by seeing if there is evidence to suggest there is a difference between treatments A and B. This means we are testing H0: µA – µB =0 against HA: µA – µB ≠0. We use the independent sample t-test as both samples are completely independent of each other (the calves were randomly allocated to each group). We have to be a bit weary, as the sample size is small so using the tdistribution may not give completely reliable p-values. Note: To analyze the calf data in Statcrunch you need to split each group into their own columns. To do this go to Data -> Arrange -> Split -> Select Column data you want to analyze (for example Wt 8) and Select the group you want (for example TRT) Treatment A vs D p You may have thought the conclusions of the previous test were quite clear, since the sample means of 138.9 and 139.54 are quite close – but the most important factor is that the standard error is quite large. So the closeness of the sample means and the largeness of the standard error meant that is quite easy to explain this difference by random chance, and there is no evidence to suggest there is a true difference in the populations. p From the summary statistics, the difference between treatment A and B appears quite large (7.7), can this difference be explained by random chance? We test the hypothesis H0: µA – µD =0 against HA: µA – µD ≠0. p The mean difference may be -7.7 but the p-value is 34%, this tells us there is over a 1/3 chance of observing a difference of 7.7 in the sample means when there is in fact no difference in population means. This is quite large – over the 5% significant level, so there is no evidence to reject the null p We now construct a 95% confidence interval. To do this we use statcrunch to find the critical value of a t-distribution with 19.09 df The 95% confidence interval for the difference in mean weights for the treatments in [-7.7 – 2.09×8.09,-7.7+2.09×8.09] = [-24,9.2]. This is an interval where we believe the mean difference should lie – and explains why we were not able to reject the null, despite 7.7 being subjectively large. The difference this interval is wide is that the standard error is large, due to small sample size and large standard deviation of calf weights. The idea: The difference in sample means p We illustrate the idea with the female and male height example p For every sample the difference in sample means X̄M X̄F will vary. X̄F will have a normal p If the sample size is large enough X̄M distribution (thanks to the central limit theorem). p The normal distribution will be centered about the true mean µM - µF (population male mean minus population female mean) and but will have a complicated standard error: r q 2 M 27 + 2 F 34 Where σM = standard deviation of heights and σF = standard deviation of female heights. p Therefore, just like in the one-sample case, in order to do the test we simply take the z-transform under the null that the mean male and female height is the same (µM - µF = 0). 5.91 z=q 2 5.46 M 27 + 2 F 34 q At this point we encounter a problem. We do not know the population standard deviations σM and σF . q But we see from the summary statistics that we do have estimates for them. Thus we can replace the true population standard deviations by its estimates. And obtain the transformation: 5.91 t= q 0.272 27 5.46 + 0.212 34 The distribution of this ratio? p Having exchanged the unknown true standard deviations with their estimators (calculated from the data) it seems reasonable to suppose that extra variability has been added to this ratio and we need to correct for it by changing from a normal distribution to another distribution. Previously for the one sample case, the new distribution which took into account of this variability was the tdistribution. p In the two sample case, the ratio X̄M t= q 2 sM 27 X̄F + s2F 34 This ratio has approximately a t-distribution with a very strange number of degrees of freedom…. 2 2 2 ! s1 s2 $ # + & " n1 n2 % This is why using software is important, you don’t df = 2 2 1 ! s12 $ 1 ! s22 $ want to calculate this stuff! # & + # & n1−1" n1 % n2 −1" n2 % p We are testing H0 : µM - µF = 0 against HA : µM - µF > 0 and have the t-transform 5.91 t= q 0.272 27 5.46 + 0.212 = 7.11 34 Which we know has 48.045 degree of freedom. Now going to Statcrunch -> Stat -> Calculators -> T we get The area to the right of 7.11 for a t-distribution with 48.045 degrees of freedom is tiny. So at both the 5% and 1% significance level we would reject the null. This means there is plenty of evidence to reject the null and conclude the mean height of males is greater than females. Remember If the sample sizes are both over 15, and the data not too skewed, using the t-distribution reasonable. Summary of Analysis: Signi:icant effect Remember: Significance means the evidence of the data is sufficient to reject the null hypothesis (at our stated level α). Only data, and the statistics we calculate from the data, can be statistically “significant”. We can say that the sample means are “significantly different” or that the observed effect is “significant”. But the conclusion about the population means is simply “they are different.” The observed effect of 0.46 between male and female height is significant so we conclude that the true effect µM---µF is greater than zero. Having made this conclusion, or even if we have not, we can always estimate the difference using the confidence interval [0.33,0.58]. Standard errors p q In the one-sample case the standard error is s(standard deviation of population) p = n(sample size) r s2 n In the independent two-sample case the standard error is s s21 (variance of population one) s22 (variance of population two) + n(sample size) m(sample size) These two different standard errors are for different situations but the ideas are the same. Remember, that a smaller standard error leads to more reliable estimators. Therefore if we are designing the experiment to decrease the sample size we observe that: q For the one-sample case, we can decrease the standard error by increasing the sample size (it is usually impossible to decrease the standard deviation) q For the two-sample case, we can decrease the standard error by increasing the size of both samples (again it is usually impossible to decrease the standard deviation of the populations). Choosing the sample size p q q We now consider how to distribute the sample sizes in the case that the standard deviations for both samples are about the same. In this case the standard error is: r r 2 2 s s 1 1 + =s + n m n m Remember the standard deviation is fixed we cannot change this value. Suppose that we only have enough funds to include 200 subjects in our experiment, how to distribute them amongst the two groups: q It makes no sense to have on subject in group 1 and 199 in group 2. For example, if we are comparing male and female heights, this would be using one male height to estimate the mean height of males and 199 females heights to estimate the mean height of females. Clearly this is wrong, and we r can understand why from the standard which is 1 1 s + = 1.002s 1 199 q On the other hand if we distributed them evenly, 100 and 100, the standard error is a lot smaller r 1 1 s + = 0.141s 100 100 Which type of test? One sample, paired samples or two independent samples? p Comparing vitamin content of bread p Is blood pressure altered by use of immediately after baking vs. 3 days an oral contraceptive? Comparing later (the same loaves are used on a group of women not using an day one and 3 days later). oral contraceptive with a group taking it. p Comparing vitamin content of bread immediately after baking vs. 3 days p p Review insurance records for later (tests made on independent dollar amount paid after fire loaves). damage in houses equipped with a Average fuel efficiency for 2005 vehicles is 21 miles per gallon. Is average fuel efficiency higher in the new generation “green vehicles”? fire extinguisher vs. houses without one. Was there a difference in the average dollar amount paid? Cautions about the two sample t-­‐test or interval p Using the correct standard error and degrees of freedom is critical. p As in the one sample t-test, the method assumes simple random samples. p Likewise, it also assumes the populations have normal distributions. p p p Skewness and outliers can make the methods inaccurate (that is, having confidence/significance level other that what they are supposed to have). The larger the sample sizes, the less this is a problem. It also is less of a problem if the populations have similar skewness and the two samples are close to the same size. p “Significant effect” merely means we have sufficient evidence to say the two true means are different. It does not explain why they are different or how meaningful/important the difference is. p A confidence interval is needed to determine how big the effect is. Summary: Distribution of two sample means p In order to do statistical inference, we must know a few things about the sampling distribution of our statistic. p The sampling distribution of σ12 n1 p + x1 − x2 has standard deviation σ 22 (Mathematically, the variance of the difference is the sum of . n2 the variances of the two sample means.) This is estimated by the standard error SE = s12 n1 + s22 . n2 If the sample sizes are both over 15, and the data not too skewed, using the t-distribution reasonable. t= p Then the two-sample t statistic is p ( x1 − x2 ) − (µ1 − µ2 ) s12 n1 s22 . +n 2 This statistic has an approximate t-distribution on which we will base our inferences. But the degrees of freedom is complicated … Two-­‐sample t con:idence interval Recall that we have two independent samples and we use the difference between the sample averages ( x1 − x2) to estimate (µ1 − µ2). s12 s22 + . This estimate has standard error SE = n1 n2 p The margin of error for a confidence interval of µ1 − µ2 is 2 2 s s m = t * × 1 + 2 = t * × SE n1 n2 p We find t* is found using the computer. The confidence interval is then computed as ( x1 − x2 ) ± m. The interpretation of “confidence” is the same as before: it is the proportion of possible samples for which the method leads to a true statement about the parameters. Two-­‐sample t signi:icance test The null hypothesis is that both population means µ1 and µ2 are equal, thus their difference is equal to zero. H0: µ1 = µ2 ⇔ H0: µ1 − µ2 = 0 . Either a one-sided or a two-sided alternative hypothesis can be tested. Using the value (µ1 − µ2) = 0 given in H0, the test statistic becomes t= ( x1 − x2 ) − 0 s12 n1 + s22 n2 . To find the P-value, we look up the appropriate probability of the t-distribution using the df given by Statcrunch or me. Summary for testing μ1 = μ2 with independent samples p p The hypotheses are identified before collecting/observing data. To test the null hypothesis H0: µ1 = µ2, use t = x1 − x2 . s12 n1 p The P-value is obtained from the t-distribution (or t-table) with the unpooled degrees of freedom (computed). p For a one-sided test with Ha: µ1 < µ2, P-value = area to left of t. p For a one-sided test with Ha: µ1 > µ2, P-value = area to right of t. p p + s22 n2 For a two-sided test with Ha: µ1 ≠ µ2, P-value = twice the value for a one-sided test. If P-value < α then H0 is rejected and Ha is accepted. Otherwise, H0 is not rejected even if the evidence seems to prefer Ha. p p Report the P-value as well as your conclusion. You must decide what α you will use before the study or else it is meaningless. Summary for making con:idence interval for μ1 − μ2 with independent samples p The single value estimate is x1 − x2 . s12 n1 s22 . n2 p This has standard error p The margin of error for an interval with confidence level C is * m=t × s12 n1 + s22 n2 + , where t* is from the critical value for the level C. p p p The confidence interval is then ( x1 − x2 ) ± m. You must decide what C you will use before the study or else it is meaningless. For both hypothesis tests and confidence intervals, the key is to use the correct standard error (what is being estimated and how the data are obtained). Statistics in the media p Look at this article and the data they describe: p http://www.economist.com/news/science-andtechnology/21676754-curious-result-hintspossibility-dementia-caused-fungal p p What is the data that Dr. Carrasco has? If we did a independent sample t-test to see whether those with Alzeheimer’s had more fungal cells than those who did not Alzheimer’s what would be the p-value (give a rough estimate)? Accompanying problems associated with this Chapter p Quiz 14 p Homework 7 (Questions 5,6 and 7)