ML2-INSTR_Language_Describing_Variability.doc Data Literacy Project Vocabulary for describing variability in a data set 1. What is the RANGE of the data? Use the minimum and maximum values on the scale to describe the range. “During the last 100 years, ice-out on Swan Lake has ranged from day 85 (March 25) [the minimum] to the about day 127 (May 6) [the maximum]”). “Swan L ice-out has had a range of about 42 days”. “Swan L ice-out spreads over 42 days”. 2. What is the SHAPE of the DISTRIBUTION? Normal: a single, largely symmetrical hump in the middle (See Swan Lake ice-out above). In a normal distribution, the median and mean are both pretty close to the middle of the hump. The hump can be tight or wide or somewhere in between. Students might like to say the data are clumped around a central value (i.e. mean or median), and the clump is symmetrical. Skewed: (can be to the left or right): The data are heavily stacked to one side, or asymmetrical. In a skewed distribution, the mean and median can be quite different. (You can test this out with the Maine bear hunt data set). (“Skewed to the right” means the “tail” of the data is on the right.) “The ages of bears killed in Maine’s bear hunt in 2008 and 2009 are strongly skewed towards younger bears.” “The distribution is skewed to the right. Most bears killed were less than 5 years old.” Evenly scattered (or spread out): the data are spread out evenly, with no strong “humps”. There may be gaps in the distribution between data points. Bi-modal: Two obvious humps in the distribution shape. In a bi-modal data set, the mean is often not very descriptive of the data because it usually falls somewhere in the dip between the humps. “The distribution is bi-modal, with peaks at around 50 minutes and around 80 minutes.” “The distribution peaks in two places, at 50 minutes and around 80 minutes. It’s bi-modal.” 3. Where is the CENTER of the DATA? Mean: The average of all of the data points. For normal distributions, mean is usually a good ML2-INSTR_Language_Describing_Variability.doc “central value”, around which most of the data points are symmetrically clumped. For scattered data sets, the mean is still a reasonable measure of center, but because the data are evenly spread out and not clumped around the middle, there is more variability in the data. That is, more points are far away from the mean. Median: The middle of the data points — half of the values are greater than and half are less than the median. Median is a more useful description of the center of the data when the distribution is skewed. Mode: If the distribution is bi-modal, it’s useful to give both modes as two distinct centers of the data. The data are clumped around two different “centers” or clusters. ML2-INSTR_Language_Describing_Variability.doc __________________________ TEACHING NOTES: If a distribution shape doesn’t fit with any of the descriptions provided, encourage students to develop their own descriptors. And send a copy of the frequency plot and their description to us — we can suggest language. A useful way to ask students to describe the shape of a distribution might be: “What call are you going to make on the shape of this distribution? Sometimes it is a little bit bimodal (or skewed), but not really meaningfully so. Help them reason it through in the context of the data. (In the Old Faithful data set above, clearly the mean is not a typical value.) (See Pfannkuch, et. al, 2010, Telling Data Stories: essential Dialogues for Comparative Reasoning. J. of Statistics Education 18:1.) Summary list of vocabulary useful for describing VARIABILTY: Range Spread (same as range) Minimum value Maximum value Spread out Narrow Center Mean Median Mode Clustered Not clustered Greater variability in the data (tightly clustered around a center) Less variability in the data (not tightly clustered around a center) Distribution shape Normal Skewed Scattered Bi-modal Clumped Hump Spread out Symmetrical Asymmetrical Tight Wide Peak Tall Longer/shorter than