Numerical descriptors BPS chapter 2 © 2006 W.H. Freeman and Company

advertisement
Numerical descriptors
BPS chapter 2
© 2006 W.H. Freeman and Company
Objectives (BPS chapter 2)
Describing distributions with numbers

Measure of center: mean and median

Measure of spread: quartiles and standard deviation

The five-number summary and boxplots

IQR and outliers

Choosing among summary statistics

Using technology

Organizing a statistical problem
Measure of center: the mean
The mean or arithmetic average
To calculate the average, or mean, add
all values, then divide by the number of
individuals. It is the “center of mass.”
Sum of heights is 1598.3
Divided by 25 women = 63.9 inches
58 .2
59 .5
60 .7
60 .9
61 .9
61 .9
62 .2
62 .2
62 .4
62 .9
63 .9
63 .1
63 .9
64 .0
64 .5
64 .1
64 .8
65 .2
65 .7
66 .2
66 .7
67 .1
67 .8
68 .9
69 .6
woman
(i)
height
(x)
woman
(i)
height
(x)
i=1
x1= 58.2
i = 14
x14= 64.0
i=2
x2= 59.5
i = 15
x15= 64.5
i=3
x3= 60.7
i = 16
x16= 64.1
i=4
x4= 60.9
i = 17
x17= 64.8
i=5
x5= 61.9
i = 18
x18= 65.2
i=6
x6= 61.9
i = 19
x19= 65.7
i=7
x7= 62.2
i = 20
x20= 66.2
i=8
x8= 62.2
i = 21
x21= 66.7
i=9
x9= 62.4
i = 22
x22= 67.1
i = 10
x10= 62.9
i = 23
x23= 67.8
i = 11
x11= 63.9
i = 24
i = 12
x12= 63.1
i = 25
i = 13
x13= 63.9
n=25
x
x24= 68.9
= 69.6
25
Mathematical notation:
x 1  x 2  ....  xn
x
n
1 n
x   xi
n i 1
1598.3
x
 63.9
25
S=1598.3
Learn right away how to get the mean using your calculators.
Your numerical summary must be meaningful
Height of 25 women in a class
x  69.3
Here the shape of
the distribution is
wildly irregular.
Why?
Could we have
more than one
plant species or
phenotype?
The distribution of women’s
height appears coherent and
symmetrical. The mean is a good
numerical summary.
x  69.6
Height of plants by color
x  63.9
5
x  70.5
x  78.3
red
Number of plants
4
pink
blue
3
2
1
0
58
60
62
64
66
68
70
72
74
76
78
80
82
Height in centimeters
A single numerical summary here would not make sense.
84
Measure of center: the median
The median is the midpoint of a distribution—the number such
that half of the observations are smaller and half are larger.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
25 12
6.1
1. Sort observations from smallest to largest.
n = number of observations
______________________________
2. If n is odd, the median is
observation (n+1)/2 down the list
 n = 25
(n+1)/2 = 26/2 = 13
Median = 3.4
3. If n is even, the median is the
mean of the two center observations
n = 24 
n/2 = 12
Median = (3.3+3.4) /2 = 3.35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
Comparing the mean and the median
The mean and the median are the same only if the distribution is
symmetrical. The median is a measure of center that is resistant to skew
and outliers. The mean is not.
Mean and median for a
symmetric distribution
Mean
Median
Mean and median for
skewed distributions
Left skew
Mean
Median
Mean
Median
Right skew
Mean and median of a distribution with outliers
Percent of people dying
x  3.4
x  4.2
Without the outliers
With the outliers
The mean is pulled to the
The median, on the other hand,
right a lot by the outliers
is only slightly pulled to the right
(from 3.4 to 4.2).
by the outliers (from 3.4 to 3.6).
Impact of skewed data
Mean and median of a symmetric
distribution
Disease X:
x  3.4
M  3.4
Mean and median are the same.
and a right-skewed distribution
Multiple myeloma:
x  3.4
M  2.5
The mean is pulled toward
the skew.
Measure of spread: quartiles
The first quartile, Q1, is the value in
the sample that has 25% of the data
at or below it.
M = median = 3.4
The third quartile, Q3, is the value in
the sample that has 75% of the data
at or below it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
1
2
3
4
5
1
2
3
4
5
6
7
1
2
3
4
5
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
6.1
Q1= first quartile = 2.2
Q3= third quartile = 4.35
Center and spread in boxplots
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6.1
5.6
5.3
4.9
4.7
4.5
4.2
4.1
3.9
3.8
3.7
3.6
3.4
3.3
2.9
2.8
2.5
2.3
2.3
2.1
1.5
1.9
1.6
1.2
0.6
Largest = max = 6.1
7
Q3= third quartile
= 4.35
M = median = 3.4
6
Years until death
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
5
4
3
2
1
Q1= first quartile
= 2.2
Smallest = min = 0.6
0
Disease X
“Five-number summary”
Boxplots for skewed data
Comparing box plots for a normal
and a right-skewed distribution
Years until death
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Boxplots remain true
to the data and clearly
depict symmetry or
skewness.
Disease X
Multiple myeloma
IQR and outliers
The interquartile range (IQR) is the distance between the first and
third quartiles (the length of the box in the boxplot)
IQR = Q3 - Q1
An outlier is an individual value that falls outside the overall pattern.

How far outside the overall pattern does a value have to fall to be
considered an outlier?

Low outlier: any value < Q1 – 1.5 IQR

High outlier: any value > Q3 + 1.5 IQR
Measure of spread: standard deviation
The standard deviation is used to describe the variation around the mean.
1) First calculate the variance s2.
1 n
2
s 
(
x

x
)
 i
n 1 1
2
2) Then take the square root to get
the standard deviation s.
x
Mean
± 1 s.d.
1 n
2
s
(
x

x
)
 i
n 1 1
Calculations …
1 n
2
s
( xi  x )

n 1 1
Mean = 63.4
Sum of squared deviations from mean = 85.2
Degrees freedom (df) = (n − 1) = 13
Women’s height (inches)
i
xi
x
(xi-x)
(xi-x)2
1
59
63.4
−4.4
19.0
2
60
63.4
−3.4
11.3
3
61
63.4
−2.4
5.6
4
62
63.4
−1.4
1.8
5
62
63.4
−1.4
1.8
6
63
63.4
−0.4
0.1
7
63
63.4
−0.4
0.1
8
63
63.4
−0.4
0.1
9
64
63.4
0.6
0.4
10
64
63.4
0.6
0.4
11
65
63.4
1.6
2.7
12
66
63.4
2.6
7.0
13
67
63.4
3.6
13.3
14
68
63.4
4.6
21.6
Sum
0.0
Sum
85.2
Mean
63.4
s2 = variance = 85.2/13 = 6.55 inches squared
s = standard deviation = √6.55 = 2.56 inches
We’ll never calculate these by hand, so make sure you know how
to get the standard deviation using your calculator.
Software output for summary statistics:
Excel—From Menu:
Tools/Data Analysis/
Descriptive Statistics
Give common
statistics of your
sample data.
Minitab
Choosing among summary statistics

Because the mean is not
resistant to outliers or skew, use it
to describe distributions that are
fairly symmetrical and don’t have
outliers.
 Plot the mean and use the
standard deviation for error bars.
Otherwise, use the median in the
five-number summary, which can
be plotted as a boxplot.
Height of 30 women
69
68
67
Height in inches

66
65
64
63
62
61
60
59
58
Boxplot
plot
Box
Mean +/sd
Mean
± s.d.
What should you use? When and why?
Arithmetic mean or median?

Middletown is considering imposing an income tax on citizens. City hall
wants a numerical summary of its citizens’ incomes to estimate the total tax
base.


Mean: Although income is likely to be right-skewed, the city government
wants to know about the total tax base.
In a study of standard of living of typical families in Middletown, a sociologist
makes a numerical summary of family income in that city.

Median: The sociologist is interested in a “typical” family and wants to
lessen the impact of extreme incomes.
Organizing a statistical problem

State: What is the practical question, in the context of a real-world
setting?

Formulate: What specific statistical operations does this problem call
for?

Solve: Make the graphs and carry out the calculations needed for this
problem.

Conclude: Give your practical conclusion in the setting of the real-world
setting.
Download