Notes

advertisement
Chapter 6
Inferential Methods for a Single Numerical Variable
Section 6.1: Comparing Dependent Samples, i.e. Test on Differences
Example 6.1 (Example 2.5, Rosner) A common symptom of otitis media (ear infection) in young children
is the prolonged presence of fluid in the middle ear. The hypothesis has been proposed that babies who
are breast-fed for at least 1 month may build up some immunity against the effects of the disease. A
small study of 24 pairs of babies is set up, where the babies are matched on a one-to-one basis
according to age, sex, socioeconomic status, and type of medications taken. One member of the
matched pair is a breast-fed baby and the other was bottle-fed.
Research Question: Is there a (statistically) significant difference in the average duration length of ear
infection between the breast-fed and bottle-fed babies?
Chapter 2: Identified which did
better
Chapter 6: Measuring the amount
of difference
When we analyzed this data in Chapter 2, we used the binomial distribution which allowed us to have
only two outcomes. With this analysis, we simply identified which baby did better (i.e. the breast-fed or
bottle-fed) for each pair.
Here, we will measure the amount of difference for each pair and use this in our analysis. In particular,
we can consider the size of the difference which is useful because a difference of -2 is VERY different
than a difference of -24.
1
Summary Statistics for the Differences…
Does the observed data tend to support or refute the research question of differences existing? Discuss.
Identifying and removing the outliers…

Boxplot
2
Descriptive statistics without the 3 significant outliers…
Discuss any differences here from what was observed with all 24 observations.
Recall, the research question, “Is there a (statistically) significant difference in the average duration
length of ear infection between the breast-fed and bottle-fed babies?” The average for our data
(without the outliers) is -6.3. This value will certainly change over repeated samples and to answer this
question, we would like to know the average stays away from 0 (i.e. the no difference situation).
3
Tinkerplots Investigation
Getting the appropriate data for taking repeated samples.
Original data with Average Difference = -6.3
Adjusted data with Average Difference = 0
Note: Adjusted Data = Original Data + 6.3333
Setting up “spinner” or hat in Tinkerplots using adjusted data (to simulate no difference)
Taking 100 repeated samples produces the following graph. The statistic of interest here is the average;
thus, on each sample the average difference is obtained.
Research Question: Is there a (statistically) significant difference in the average duration length of ear
infection between the breast-fed and bottle-fed babies?
The following graph would be used to obtain the p-value (for a two-side test).
The number of dots that exceed -6.3 or positive 6.3 would be included in the calculation of the p-value.
Here the p-value would be given by 6/100 = 0.06.
4
Formal Statistical Test in Excel
Research Question: Is there a (statistical) significant difference in the average of ear infection between
the breast-fed and bottle-fed babies?
Completing the test in Excel…
Determine p-value and make the statistical decision.
5
The Decision Rule: If the p-value is less than the error rate, then the data is said to support the research
question.
P-Value: ___________________
Statistical Decision: _________________
Conclusion – Writing a final statement in laymen’s terms
Confidence Interval
Concept of 95% confidence interval…
6
Wiki Entry for Confidence Interval for Mean
Identify the parts of this interval and obtain the endpoints of the 95% confidence interval.

Lower Endpoint:

Upper Endpoint:
Sketch this interval on the line below.
_______________________________________________
Interpret the meaning of this 95% confidence interval.
Getting this done in Excel
7
Example 6.1.2 For this example, consider the Vitamin Intake dataset on our course website. This
purpose of this investigation is to compare the actual intake of calories (and certain vitamins/minerals)
against their daily recommended intake.
Answer the following research question using the data from the course website.
Research Question: Is there enough statistical evidence to suggest, on average, there is a difference
between the actual energy intake against their daily recommended intake?
The Decision Rule: If the p-value is less than the error rate, then the data is said to support the
research question.
P-Value: ___________________
Statistical Decision:
 Data provides enough evidence for research question
 Data does not provide enough evidence for research question
Conclusion – Writing a final statement in laymen’s terms
Construct an approximate 95% confidence interval using the following formula.
𝐿𝑜𝑤𝑒𝑟 𝐸𝑛𝑑𝑝𝑜𝑖𝑛𝑡: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 − (2 ∗
𝑈𝑝𝑝𝑒𝑟 𝐸𝑛𝑑𝑝𝑜𝑖𝑛𝑡: 𝐴𝑣𝑒𝑟𝑎𝑔𝑒 𝐷𝑖𝑓𝑓𝑒𝑟𝑒𝑛𝑐𝑒 + (2 ∗
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
√𝐶𝑜𝑢𝑛𝑡
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝐷𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛
√𝐶𝑜𝑢𝑛𝑡
)
)
Provide, in context, an interpretation of this 95% confidence interval.
8
Section 6.2 Comparing a Single Numerical Variable Across Two Groups
Consider the following study of students enrolled in an introductory statistics course. This data
can be found in the StudentData.xlsx file.
Again, simple summaries in Excel can be obtained using the PivotTable feature. To use the
PivotTable feature, highlight all the data and select Insert > PivotTable > PivotChart as is shown
here.
Suppose the goal is to compare the number of text messages sent across the year in school.
For example, do Freshmen send more/less/same number of text messages compared to
Seniors. An initial set-up is given here.
9
The default summaries given by PivotTable is a count, here we’d like to compare the average,
so right click on Count, select Value Field Settings and then select Average.
10
Consider the following changes to the default output given be Excel
 Deselected Blank and Other from the Year in School drop down arrow.
 Change the bar graph to a line graph, which more concisely shows the trends in the
averages over Year in School.
One additional fix to this graph would be to put the Year in School in the correct order. Excel’s
default is to sort alphabetical, which is not necessarily appropriate for this situation as the Year
in School variable has a relevant order, i.e. Freshmen, Sophomores, Juniors, and Seniors.
To change the order of the categories, click on the appropriate label for the row to be moved,
right click and select Move. In this menu, you can move this row up or down to rearrange the
order.
11
The following table and graph showing the trend in average number of text messages sent for
the various levels in Year in School.
Questions
1. What trends are prevalent in this table and graph?
2. Your friend makes the following statement, “The reason the average text message sent
is highest for sophomores is because there were a lot of sophomores in the data and the
reason seniors is lowest is because there were very few seniors in the data.” What, if
anything, is wrong with this statement?
3. Why might a statistician also want to consider the amount of standard deviation in the
number of text messages sent? That is, what is gained by considering the standard
deviations in the number of text messages?
12
Example 6.2 For this example, consider a comparison in the amount of time spent on Facebook
between Senior and Freshmen. This subset of this data is given on the FRSR worksheet in Excel.
Research Question: Is there a difference in the average amount of time spent on Facebook
between Freshmen and Seniors?
Next, getting our analyses started for comparing Facebook time / day across Freshmen and
Seniors.
13
Summaries in Excel, averages, standard deviations, and counts are standard summaries…
Graphical Comparisons
14
Getting the p-value in Excel…
Research Question: Is there a difference in the average amount of time spent on Facebook
between Freshmen and Seniors?
Obtain P-value:
P-value: __________________
Decision:
The Decision Rule: If the p-value is less than the error rate, then the data is said to support the
research question.


Data provides enough evidence for the research question
Data does not provide enough evidence for the research question
Conclusion: Write a conclusion using everyday language for this research question.
15
Using a 95% Confidence Interval to measure the amount of difference between the two
averages.
JMP software output for 95% Confidence Interval
Visual Depiction of 95% Confidence Interval:
Interpret the 95% confidence interval using everyday language.
16
Repeated Sampling Concept, again….
Conceptual view of 95% confidence interval
17
Download