Chapter 6 Inferential Methods for a Single Numerical Variable Section 6.1: Comparing Dependent Samples, i.e. Test on Differences Example 6.1 (Example 2.5, Rosner) A common symptom of otitis media (ear infection) in young children is the prolonged presence of fluid in the middle ear. The hypothesis has been proposed that babies who are breast-fed for at least 1 month may build up some immunity against the effects of the disease. A small study of 24 pairs of babies is set up, where the babies are matched on a one-to-one basis according to age, sex, socioeconomic status, and type of medications taken. One member of the matched pair is a breast-fed baby and the other was bottle-fed. Research Question: Is there a (statistically) significant difference in the average duration length of ear infection between the breast-fed and bottle-fed babies? Chapter 2: Identified which did better Chapter 6: Measuring the amount of difference When we analyzed this data in Chapter 2, we used the binomial distribution which allowed us to have only two outcomes. With this analysis, we simply identified which baby did better (i.e. the breast-fed or bottle-fed) for each pair. Here, we will measure the amount of difference for each pair and use this in our analysis. In particular, we can consider the size of the difference which is useful because a difference of -2 is VERY different than a difference of -24. 1 Summary Statistics for the Differences… Does the observed data tend to support or refute the research question of differences existing? Discuss. Identifying and removing the outliers… Boxplot 2 Descriptive statistics without the 3 significant outliers… Discuss any differences here from what was observed with all 24 observations. Recall, the research question, “Is there a (statistically) significant difference in the average duration length of ear infection between the breast-fed and bottle-fed babies?” The average for our data (without the outliers) is -6.3. This value will certainly change over repeated samples and to answer this question, we would like to know the average stays away from 0 (i.e. the no difference situation). 3 Tinkerplots Investigation Getting the appropriate data for taking repeated samples. Original data with Average Difference = -6.3 Adjusted data with Average Difference = 0 Note: Adjusted Data = Original Data + 6.3333 Setting up “spinner” or hat in Tinkerplots using adjusted data (to simulate no difference) Taking 100 repeated samples produces the following graph. The statistic of interest here is the average; thus, on each sample the average difference is obtained. Research Question: Is there a (statistically) significant difference in the average duration length of ear infection between the breast-fed and bottle-fed babies? The following graph would be used to obtain the p-value (for a two-side test). The number of dots that exceed -6.3 or positive 6.3 would be included in the calculation of the p-value. Here the p-value would be given by 6/100 = 0.06. 4 Formal Statistical Test in Excel Research Question: Is there a (statistical) significant difference in the average of ear infection between the breast-fed and bottle-fed babies? Completing the test in Excel… Determine p-value and make the statistical decision. 5 The Decision Rule: If the p-value is less than the error rate, then the data is said to support the research question. P-Value: ___________________ Statistical Decision: _________________ Conclusion – Writing a final statement in laymen’s terms Confidence Interval Concept of 95% confidence interval… 6 Wiki Entry for Confidence Interval for Mean Identify the parts of this interval and obtain the endpoints of the 95% confidence interval. Lower Endpoint: Upper Endpoint: Sketch this interval on the line below. _______________________________________________ Interpret the meaning of this 95% confidence interval. Getting this done in Excel 7 Example 6.1.2 For this example, consider the Vitamin Intake dataset on our course website. This purpose of this investigation is to compare the actual intake of calories (and certain vitamins/minerals) against their daily recommended intake. Answer the following research question using the data from the course website. Research Question: Is there enough statistical evidence to suggest, on average, there is a difference between the actual energy intake against their daily recommended intake? The Decision Rule: If the p-value is less than the error rate, then the data is said to support the research question. P-Value: ___________________ Statistical Decision: Data provides enough evidence for research question Data does not provide enough evidence for research question Conclusion – Writing a final statement in laymen’s terms Construct an approximate 95% confidence interval using the following formula. 𝐿𝑜𝑤𝑒𝑟 𝐸𝑛𝑑𝑝𝑜𝑖𝑛𝑡: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 − (2 ∗ 𝑈𝑝𝑝𝑒𝑟 𝐸𝑛𝑑𝑝𝑜𝑖𝑛𝑡: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 + (2 ∗ 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 √𝐶𝑜𝑢𝑛𝑡 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 √𝐶𝑜𝑢𝑛𝑡 ) ) Provide, in context, an interpretation of this 95% confidence interval. 8 Section 6.2 Comparing a Single Numerical Variable Across Two Groups Consider the following study of students enrolled in an introductory statistics course. This data can be found in the StudentData.xlsx file. Again, simple summaries in Excel can be obtained using the PivotTable feature. To use the PivotTable feature, highlight all the data and select Insert > PivotTable > PivotChart as is shown here. Suppose the goal is to compare the number of text messages sent across the year in school. For example, do Freshmen send more/less/same number of text messages compared to Seniors. An initial set-up is given here. 9 The default summaries given by PivotTable is a count, here we’d like to compare the average, so right click on Count, select Value Field Settings and then select Average. 10 Consider the following changes to the default output given be Excel Deselected Blank and Other from the Year in School drop down arrow. Change the bar graph to a line graph, which more concisely shows the trends in the averages over Year in School. One additional fix to this graph would be to put the Year in School in the correct order. Excel’s default is to sort alphabetical, which is not necessarily appropriate for this situation as the Year in School variable has a relevant order, i.e. Freshmen, Sophomores, Juniors, and Seniors. To change the order of the categories, click on the appropriate label for the row to be moved, right click and select Move. In this menu, you can move this row up or down to rearrange the order. 11 The following table and graph showing the trend in average number of text messages sent for the various levels in Year in School. Questions 1. What trends are prevalent in this table and graph? 2. Your friend makes the following statement, “The reason the average text message sent is highest for sophomores is because there were a lot of sophomores in the data and the reason seniors is lowest is because there were very few seniors in the data.” What, if anything, is wrong with this statement? 3. Why might a statistician also want to consider the amount of standard deviation in the number of text messages sent? That is, what is gained by considering the standard deviations in the number of text messages? 12 Example 6.2 For this example, consider a comparison in the amount of time spent on Facebook between Senior and Freshmen. This subset of this data is given on the FRSR worksheet in Excel. Research Question: Is there a difference in the average amount of time spent on Facebook between Freshmen and Seniors? Next, getting our analyses started for comparing Facebook time / day across Freshmen and Seniors. 13 Summaries in Excel, averages, standard deviations, and counts are standard summaries… Graphical Comparisons 14 Getting the p-value in Excel… Research Question: Is there a difference in the average amount of time spent on Facebook between Freshmen and Seniors? Obtain P-value: P-value: __________________ Decision: The Decision Rule: If the p-value is less than the error rate, then the data is said to support the research question. Data provides enough evidence for the research question Data does not provide enough evidence for the research question Conclusion: Write a conclusion using everyday language for this research question. 15 Using a 95% Confidence Interval to measure the amount of difference between the two averages. JMP software output for 95% Confidence Interval Visual Depiction of 95% Confidence Interval: Interpret the 95% confidence interval using everyday language. 16 Repeated Sampling Concept, again…. Conceptual view of 95% confidence interval 17