Name: Date: Ms. D`Amato Block: Chapter 5: Describing Distributions

advertisement
Name:
Ms. D’Amato
Date:
Block:
Chapter 5: Describing Distributions Numerically
Finding the Center: The Median

When we think of a typical value, we usually look for the

For a unimodal, symmetric distribution, it’s easy to find the center:
See left graph below.

As a measure of center, the
(the average of the minimum and maximum
values) is very sensitive to skewed distributions and outliers. See right graph above.

The
of the distribution.
is a more reasonable choice for center than the midrange.
o Is the value with exactly half the values below it and half above it.
o It is the
data value (once the data values have been ordered) that
divides the histogram into two
o It has the same units as the data.
areas.
Finding the median by hand:
When n is odd…
…the median is the middle value. Counting from
the ends, we find this value in the
𝑛+1
2
position.
When n is even…
…there are two middle values. The median is the
average of the two values in positions
𝑛
2
and
𝑛
2
+
1.
14.1, 3.2, 25.3, 2.8, -17.5, 13.9, 45.8
14.1, 3.2, 25.3, 2.8, -17.5, 13.9, 45.8, 35.7
Spread: Home on the Range

Always report a measure of
describing a distribution numerically.

The
along with a measure of center when
of the data is the difference between the maximum and minimum values:
Range = max – min
What is the range of the following data values?
34, 25, 16, 73, 41

A disadvantage of the range (like the midrange) is that a single
can make it very large and, thus, not representative of the data overall.
Spread: The Interquartile Range

The
lets us ignore extreme data values and
concentrate on the middle of the data. To find the IQR, we first need to know what quartiles
are…
o Quartiles divide the data into four equal sections.

The
the median.
quartile is the median of the half of the data below

The
median.
quartile is the median of the half of the data above the
o The difference between the quartiles is the IQR, so
IQR = upper quartile – lower quartile

The lower and upper quartiles are the

The IQR contains the middle
and
percentiles of the data, so…
of the values of the distribution.
Finding quartiles by hand:
When n is odd…
When n is even…
14.1, 3.2, 25.3, 2.8, -17.5, 13.9, 45.8
14.1, 3.2, 25.3, 2.8, -17.5, 13.9, 45.8, 35.7
Lower Quartile = ________
Lower Quartile = ________
Upper Quartile = ________
Upper Quartile = ________
IQR = ________
IQR = ________
5-Number Summary

The 5-number summary of a distribution reports its
,
, and
(maximum and minimum).
o Example: Let’s look at the five-number summary for the number of flat tires on Route 7.
Max
47 flat tires
Q3
22
Median
19
Q1
17
Min
13
o A
(quantitative) variable.
is a graphical display of the 5-number summary of a
o Boxplots are useful when comparing groups.
Making Boxplots
1. Draw a single vertical axis spanning the range of the data. Draw short horizontal lines at the
lower and upper quartiles and at the median. Then connect them with vertical lines to form a
box.
2. Erect “fences” around the main part of the data.
 The upper fence is 1.5 IQRs above the upper quartile.
 The lower fence is 1.5 IQRs below the lower quartile.
 Note: the fences only help with constructing the boxplot and should not appear in the
final display.
and
Upper fence = Q3 + 1.5 IQRs = 22 + 1.5 x 5 = 29.5 flat tires
Lower fence = Q1 – 1.5 IQRs = 17 – 1.5 x 5 = 9.5 flat tires
3. Use the fences to grow “whiskers.”
 Draw lines from the ends of the box up and down to the most extreme data values
found within the fences.
 If a data value falls outside one of the fences, we do not connect it with a whisker.
4. Add the outliers by displaying any data values beyond the fences with special symbols.
 We often use a different symbol for “far outliers” that are farther than 3 IQRs from the
quartiles.



The center of a boxplot is a box that shows the middle half of the data, between the quartiles.
If the median is roughly centered between the quartiles, then the middle half of the data is
roughly symmetric.
If it is not centered, the distribution is skewed. The whiskers show skewness as well if they are
not the same length.

By turning a boxplot and putting it on the same scale, we can compare the boxplot and
histogram of flat tires and see how each represents the distribution.
Comparing Groups with Boxplots



When placed side-by-side, we can see which group has a higher median, a greater IQR, where
the central 50% of the data are located, and which has the greater overall range.
Can get an idea of the symmetry form whether the medians are centered within their boxes and
whether the whiskers extend roughly the same distance on either side of the boxes.
The following set of boxplots compares the effectiveness of various coffee containers:
What does this graphical display tell you?
Summarizing Symmetric Distributions

Medians do a good job of identifying the center of skewed distributions.


When we have symmetric data, the
is a good measure of center.
We find the mean by adding up all of the data values and dividing by n, the number of data
values we have.
The distribution of pulse rates for 52 adults is generally symmetric, with a mean of 72.7 beats
per minute (bpm) and a median of 73 bpm:

The Formula for Averaging (Say It in Greek)

The formula for mean is given by:
y

Total  y

n
n
The formula says that to find the mean, we add up the numbers and divide by n.
Mean or Median?

Regardless of the shape of the distribution, the
of the data would balance:

In symmetric distributions, the mean and median are approximately the
in
value, so either measure of center may be used.
For skewed data, though, it’s better to report the
than the mean as a
measure of center.
The median splits a histogram so that the
of the bars on either side of the
median are equal. The mean
the histogram, taking into account both the
size of the bars and their distance from the center, but as a result, it may not have equal
numbers of data values on either side.
When the data is symmetric the mean = median



is the point at which a histogram


When the data is skewed to the left (tail longer on left) the mean is to the left of the median
(mean < median)
When the data is skewed to the right the mean (tail longer on right) the mean is to the right of
the median (mean > median)
What About Spread? The Standard Deviation

A more powerful measure of spread than the IQR is the
which takes into account how far

A

The
, notated by
(almost) averaging them:
data value is from the mean.
is the distance that a data value is from the mean.
o Since adding all deviations together would total zero, we square each deviation and find
an average of sorts for the deviations.
, is found by summing the squared deviations and
s2 

 y  y 
2
n 1
The variance will play a role later in our study, but it is problematic as a measure of spread—it
is measured in

,
units!
The
,
, is just the square root of the variance and is
measured in the same units as the original data.
s
 y  y 
2
n 1
Finding the standard deviation by hand:
Steps:
1. Find the mean, 𝑦̅. Round to the nearest tenth if necessary.
2. Next, find the deviations by taking 𝑦̅ from each value: (𝑦 − 𝑦̅)
3. Square the deviation: (𝑦 − 𝑦̅)2
4. Add these numbers up and divide by n – 1. This gives you the variance.
5. Take the square root to find the standard deviation.
Let’s look at the batch of values is 4, 3, 10, 12, 8, 9, and 3.
1. Find the mean.
2-3. Find the deviations and squared deviation.
Observations
y
Deviations
(𝑦 − 𝑦̅)
Squared Deviations
(𝑦 − 𝑦̅)2
4. Add the squared deviations up and divide by n – 1.
5. Take the square root of your answer for #4. This is your standard deviation.
Which y value(s) lie within one standard deviation of the mean?
Thinking About Variation

Since Statistics is about variation,
Statistics.

Measures of spread help us talk about what we don’t know.
is an important fundamental concept of

When the data values are tightly clustered around the center of the distribution, the IQR and
standard deviation will be

.
When the data values are scattered far from the center, the IQR and standard deviation will be
.
Shape, Center, and Spread

When telling about a quantitative variable, always report the shape of its distribution, along
with a center and a spread.
o If the shape is
, report the
o If the shape is
, report the
and possibly the median and IQR as well.
and
.
and
What About Outliers?


If there are any clear outliers and you are reporting the mean and standard deviation, report
them with the outliers present and with the outliers removed. The differences may be quite
revealing.
Note: The median and IQR are not likely to be affected by the outliers.
What Can Go Wrong?



Don’t forget to do a reality check — don’t let technology do your thinking for you.
Don’t forget to sort the values before finding the median or percentiles.
Don’t compute numerical summaries of a categorical variable.





Watch out for multiple modes — multiple modes might indicate multiple groups in your data.
Be aware of slightly different methods — different statistics packages and calculators may give
you different answers for the same data.
Beware of outliers.
Make a picture (make a picture, make a picture).
Be careful when comparing groups that have very different spreads.
o Consider these side-by-side boxplots of cotinine levels:
What Have We Learned?



We can now summarize distributions of quantitative variables numerically.
o The 5-number summary displays the min, Q1, median, Q3, and max.
o Measures of center include the mean and median.
o Measures of spread include the range, IQR, and standard deviation.
We know which measures to use for symmetric distributions and skewed distributions.
We can also display distributions with boxplots.
o While histograms better show the shape of the distribution, boxplots reveal the center,
middle 50%, and any outliers in the distribution.
o Boxplots are useful for comparing groups.
Download