Language for describing variability

advertisement
ML2-INSTR_Language_Describing_Variability.doc
Data Literacy Project
Vocabulary for describing variability in a data set
1. What is the RANGE of the data? Use the minimum and maximum values on the
scale to describe the range.
“During the last 100 years, ice-out on Swan Lake has ranged from day 85 (March 25)
[the minimum] to the about day 127 (May 6) [the maximum]”).
“Swan L ice-out has had a range of about 42 days”.
“Swan L ice-out spreads over 42 days”.
2. What is the SHAPE of the DISTRIBUTION?
Normal: a single, largely symmetrical hump in the middle (See Swan Lake ice-out
above). In a normal distribution, the median and mean are both pretty close to the
middle of the hump. The hump can be tight or wide or somewhere in between.
Students might like to say the data are clumped around a central value (i.e. mean or
median), and the clump is symmetrical.
Skewed: (can be to the left or right): The data are heavily stacked to one side, or
asymmetrical. In a skewed distribution, the mean and median can be quite different.
(You can test this out with the Maine bear hunt data set). (“Skewed to the right” means the
“tail” of the data is on the right.)
“The ages of bears killed in Maine’s bear hunt in 2008 and 2009 are strongly skewed towards
younger bears.”
“The distribution is skewed to the right. Most bears killed were less than 5 years old.”
Evenly scattered (or spread out): the data are spread out evenly, with no strong
“humps”. There may be gaps in the distribution between data points.
Bi-modal: Two obvious humps in the distribution shape. In a bi-modal data set, the
mean is often not very descriptive of the data because it usually falls somewhere in the
dip between the humps.
“The distribution is bi-modal, with peaks at around 50 minutes and around 80 minutes.”
“The distribution peaks in two places, at 50 minutes and around 80 minutes. It’s bi-modal.”
3. Where is the CENTER of the DATA?
Mean: The average of all of the data points. For normal distributions, mean is usually a good
ML2-INSTR_Language_Describing_Variability.doc
“central value”, around which most of the data points are symmetrically clumped. For
scattered data sets, the mean is still a reasonable measure of center, but because the data are
evenly spread out and not clumped around the middle, there is more variability in the data.
That is, more points are far away from the mean.
Median: The middle of the data points — half of the values are greater than and half are less
than the median. Median is a more useful description of the center of the data when the
distribution is skewed.
Mode: If the distribution is bi-modal, it’s useful to give both modes as two distinct centers of
the data. The data are clumped around two different “centers” or clusters.
ML2-INSTR_Language_Describing_Variability.doc
__________________________
TEACHING NOTES:
If a distribution shape doesn’t fit with any of the descriptions provided, encourage students to
develop their own descriptors. And send a copy of the frequency plot and their description to
us — we can suggest language.
A useful way to ask students to describe the shape of a distribution might be: “What call are
you going to make on the shape of this distribution? Sometimes it is a little bit bimodal (or
skewed), but not really meaningfully so. Help them reason it through in the context of the data.
(In the Old Faithful data set above, clearly the mean is not a typical value.) (See Pfannkuch, et. al,
2010, Telling Data Stories: essential Dialogues for Comparative Reasoning. J. of Statistics Education 18:1.)
Summary list of vocabulary useful for describing VARIABILTY:
Range
Spread (same as range)
Minimum value
Maximum value
Spread out
Narrow
Center
Mean
Median
Mode
Clustered
Not clustered
Greater variability in the data (tightly
clustered around a center)
Less variability in the data (not tightly
clustered around a center)
Distribution shape
Normal
Skewed
Scattered
Bi-modal
Clumped
Hump
Spread out
Symmetrical
Asymmetrical
Tight
Wide
Peak
Tall
Longer/shorter than
Download