Algebra 1 Summer Institute 2014 The Cholesterol

advertisement
Algebra 1 Summer Institute 2014
The Cholesterol-Diet Emanation
Summary
Goals
The activities in this session
look at numerical data.
Participants will describe
the graphical
representations used and
interpret those descriptions
in context. Sampling
distributions and
simulations will be done as
well as deciding exactly
what it is that one wants to
know
Participant Handouts
 Examine a data analysis
problem from different
perspectives
 Formulate appropriate
questions that can be
answered by data
analysis
 Use appropriate
graphical displays to
make informed decisions
in the presence of
variability
 Compare different
analyses of a single
scenario
 Infer whether the results
occurred by chance
variation or as a result of
other factors.
1. Dietary Change and
Cholesterol
2. What can you know?
3. Data-based dietary
decisions
4. Simulating and Counting
Successes
Materials
Technology
Source
Estimated Time
Paper
Rulers
Colored Pencils
LCD Projector
Facilitator Laptop
Excel
Navigating through
Data Analysis in
Grades 9-12
120 minutes
Mathematics Standards
Common Core State Standards for Mathematics
FAMS.7.SP.1: Use random sampling to draw inferences about a population.
1.1: Understand that statistics can be used to gain information about a population by
examining a sample of the population; generalizations about a population from a
sample are valid only if the sample is representative of that population.
Understand that random sampling tends to produce representative samples and
support valid inferences.
1.2: Use data from a random sample to draw inferences about a population with an
unknown characteristic of interest. Generate multiple samples (or simulated
samples) of the same size to gauge the variation in estimates or predictions. For
example, estimate the mean word length in a book by randomly sampling words
from the book; predict the winner of a school election based upon randomly
sampled survey data. Gauge how far off the estimate or prediction might be.
1
Algebra 1 Summer Institute 2014
Standards for Mathematical Practice
1. Make sense of problems and persevere in solving them
2. Reason abstractly and quantitatively
3. Construct viable arguments and critique the reasoning of others
4. Model with mathematics
5. Use tools appropriately
Instructional Plan
This activity looks at numerical data, which are quantitative and are also known as
measurement data. It centers on a study of diet and cholesterol. The activity focuses on
(1) analyzing the data in light of a question of interest and (2) moving from representing
the data graphically to drawing a conclusion about the results. This is also an
observational study.
High cholesterol is a contributor to heart disease. Table 1 in the participants’ handout
“Dietary Change and Cholesterol” lists data from a study investigating the effect of
dietary change on cholesterol levels. Twenty-four hospital employees voluntarily
switched from a “standard American diet” to a vegetarian diet for one month. The data
show their cholesterol levels both before and after the dietary change, in milligrams of
cholesterol per deciliter of blood (mg/dL). Suppose for the activity that it is desirable to
decrease the level of cholesterol in the blood. Thus assume that the purpose of the switch
to the new vegetarian diet is to decrease that level.
1. Once the scenario has been described, ask the participants: what question (or
questions) does the study described have the potential to answer?
Possible questions are: is the diet effective, for what patients, if any, will the diet
make a difference? Can a doctor predict, with some degree of certainty, the
change in a patient’s cholesterol after following the diet? Could you as a patient
predict your own change in cholesterol after being on the diet?
2. A possible way to proceed at this point is to divide the participants in as many
groups as questions were suggested and ask each group to focus on that question,
plan analysis that answer them, and carry the plan. At the end, each group can
make a presentation of their findings.
For example, participants can be divided in 4 groups and each group will focus on
one of the following questions:
i.
Does the diet make a difference?
ii. What are the chances that my cholesterol level will drop? By how much, if
at all, can I expect it to drop?
2
Algebra 1 Summer Institute 2014
iii.
iv.
How does my current cholesterol level affect what might happen if I use
the diet? Can a doctor predict the change in cholesterol level for a patient
who follows the diet?
Does the diet work, and if so, for whom?
What follows are some comments intended to guide discussions about their work.
Does the diet make a difference?
Using a scatter plot with “before” and “after” (before, after) coordinates is one approach
to this question. If the diet made no difference all the points will fall on the y = x line.
Plotting this line can provide a benchmark for answering this question. If the diet made a
difference and the cholesterol levels improved, the points would fall below the y = x line.
The actual data shows a decrease for most of the volunteers, although other questions,
such as to what degree the levels improved, remain unanswered.
Another approach is to treat the before and after data as two independent samples. In this
approach, side-by-side boxplots or histograms can be used for comparisons. There is a
great deal of overlap and variability among the data in both sets. The overlap makes any
improvements in level difficult to determine from these analyses.
What are the chances that my cholesterol level will drop? By how much, if at all, can I
expect it to drop?
The first question to ask would be: how many individuals’ levels decrease? As well as:
what was the largest decrease that any one experienced? What was the smallest?
A variable of interest here could be the change in levels, which can be found by
subtracting the after values minus before. This change would be a negative. The variable
could be called “Improvement”. This variable could be represented in a box plot or dot
plot.
Finding the mean change does not give any sense of the variability around it. Are the
changes small or large? Are they clustered? The graphs can help answer these questions.
How does my current cholesterol level affect what might happen if I use the diet? Can a
doctor predict the change in cholesterol level for a patient who follows the diet?
The previous questions did not consider both the relationship between cholesterol levels
before and after the diet and how the initial cholesterol level may affect any changes in
levels. A possible approach is to fit a line to the data and to base conclusions only on the
slope of the resulting equation. The scatter plot of before and after seems to indicate a
linear relationship. The equation of the line could be written as: After = 37 + 0.7(Before).
In this approach it is useful to plot the line y = x. This line allows identification of
improved cholesterol levels. Comparing the two lines (regression and y =x) reveals that
3
Algebra 1 Summer Institute 2014
the average amount of improvement resulting from the diet is related to the initial
cholesterol level before the diet. People with higher initial levels tend to have larger
decreases in levels, on average, than do people with lower initial levels.
Does the diet work, and if so, for whom?
If we use the regression line, we could predict the level of cholesterol for a person after a
diet. However, the scatterplot showing before and after levels would tell us there is a
range of plausible outcomes rather than a single value. There is still variation in the
“after” cholesterol levels that diet does not account for.
In fact, for certain initial cholesterol levels, the range of likely “after diet” cholesterol
levels may include the no improvement line (y = x). The proportion of points above the
no improvement line for similar initial cholesterol levels give and estimate of the
probability of increased cholesterol with the vegetarian diet for patients at the starting
point. Thus, as doctors, we could predict the patient’s level and a range in which that
level might reasonably fall, and we could estimate the patient’s chances for any
improvement at all.
We could represent this last analysis on a scatterplot showing improvement versus
cholesterol before the diet. Here we could concentrate on the values for which
improvement > 0. We could draw the line improvement = 0 in the scatterplot.
This approach emphasizes not only the amount of improvement but also the probability
of improvement that depends on the initial level of cholesterol. The distance above the 0
line gives the amount of improvement (distance measured in term of the difference in the
y-values). The proportion of points above the line for any given range of before values
estimate the probability of improvement for a patient starting in that range.
Simulating and Counting Successes
Suppose and individual’s cholesterol levels were measured several times. Those
measurements might not be all identical. If this is so, could the improvement seen in the
data from the study represent just the natural variability among measurements of
individual cholesterol levels? That is, could it be that the diet actually had no effect?
If the diet had no effect, one could expect that the increases and decreases occurred
purely randomly. This activity gives participants an opportunity to consider a number of
additional issues related to this question in connection with the scenario:
 Suppose that the probability of cholesterol levels decreasing when people went on
the vegetarian diet really were 50%. Under this condition, what would be the
probability that the cholesterol levels of at least 21 of the 24 patients would
decrease in a study lie that in the scenario Dietary Change and Cholesterol.
 What would your answer suggest about the diet?
4
Algebra 1 Summer Institute 2014
The original question: “Does the diet make a difference?” has no yet really been
answered. The data shows improvement, but is the improvement enough to make the diet
a desirable treatment? The important question is: “are the observed improvements
possibly due to chance variation when no mean improvement exists, or are they actual
mean improvements? The basic idea is that variability is inherent in the population and in
the sampling process. Inference is the process of using the sample to make conclusions
about the population being considered in the presence of this variation.
A possible simulation model examines whether improvement occurred more often than
might be expected if the diet were not effective. Another model examines whether
improvements were of greater size than might be expected if the diet were not effective.
The two models differ in the choice of sample statistics. In the first one “number of
improved volunteers” is used; the second uses “mean improvement”. Both models can be
represented with simulations, and the underlying randomization processes in the two
simulation models are identical. Only the first model is discussed in this activity.
Binomial Inference
The question “will I improve under the new diet?” leads to an analysis of success-failure
outcomes. A decrease in cholesterol level constitutes success. An increase constitutes
failure. Thus the statistic of interest can be thought of as a particular quantity – “number
who improve”. If the diet actually had no effect, we would expect about half of the
outcomes to be success and half to be failures, since each outcome would be due only to
chance variation, not to the effect of the diet. Hence, if the diet produced no
improvement, we would expect the probability of improvement to be simple p = .5,
which is called the null hypothesis. So in a set of 24 volunteers, we would expect 12 to
improve. In the case that the diet really did not work, then, what would the chance be that
as many as 21 would improve?
The activity “Simulating and Counting Successes” enables participants to investigate this
question by carrying out a simple simulation. The variable to observe is “Number who
improve”. The activity page for the activity summarizes the elements that are essential in
a simulation. The activity uses a coin. If using dice, even numbers can represent success
and odd numbers represent failures.
After participants have tried the simulation enough times to understand the connection
between its structure and the study’s features and to know how to use the simulation’s
results to estimate probability of at least 21 successes, they can switch to Excel to
generate enough trials (300 or more) to give stable distributions.
After doing the simulation, what is the estimated probability of having 21 or more
successes? If, for example, the number of successes is 0 when the diet has no effect and p
= .5, then there is strong evidence that 21 successes did not occur as a result of chance
variation and that the diet did have an effect on the cholesterol level.
Drawing conclusions from sample mean improvement
5
Algebra 1 Summer Institute 2014
The second model offers another approach. Rather than counting subjects that improve,
we could consider the actual amount of improvement. If the diet were not effective, we
would assume that the mean improvement would be 0. In this case the null hypothesis is
that the mean improvement is 0.
To test this new null hypothesis we need a simulation that uses the numbers in the
original data to generate a sampling distribution of the sample mean improvement under
the null hypothesis. Under the assumption that the diet was not effective, we might think
of the cholesterol levels in the table as assigned to “before” and “after” randomly.
For example, the first volunteer had cholesterol levels of 195 and 146. Over the course of
a month, the levels can fluctuate. Could it be that the levels of this volunteer were due to
that fluctuation? Could the measurements, through normal variation, have been reversed?
We would still consider those levels a pair since they came from the same volunteer;
however, could we think of the value for before and after as due to chance variation over
time, rather than to dietary change for that individual?
These assumptions allow us to create a simulation that randomizes the assignment of
each person’s levels to “before” and “after” and lets us compute the improvement for
each person and then record the sample man improvement for the set of 24 pairs. The
simulation must be repeated enough times to generate a simulated sampling distribution
of sample mean under the null hypothesis of no change in cholesterol as a result of the
diet. Then we can locate the observed sample mean improvement (19.5 mean from the
first table) within the simulated sampling distribution.
Participants could try the following simulation. For example, we might toss a coin 24
times to carry out one round of our randomization, with heads meaning that we will label
the first number for a given patient “before” and tails meaning we will label the first
number as “after”. Then the string of 24 coin tosses HTTHHTH.... will allow us to either
switch the levels for that patient or leave them the same. In short, if we get heads for the
a patient, the “before” and “after” levels stay the same, but if we get tails, then we switch
the order of the levels.
In practice, multiplying each actual improvement value by a randomly generated 1 or -1,
and then computing the mean of the resulting list may accomplish this simulation. For
example, if the string is HTTHT, it would be replaced by 1, -1, -1, 1, -1. Such
calculations are simple in Excel.
6
Download