Topic (4) SUMMARIZING QUANTITATIVE DATA A) Frequency Distributions For Samples

advertisement
Topic (4) Summarizing Data – Quantitative Variables
4-1
Topic (4) SUMMARIZING QUANTITATIVE DATA
A) Frequency Distributions For Samples
Frequency Tables and Histograms
Can’t always list every possible value for quantitative variables
or the datasets get too large. We wish to summarize the data in
some way. So, we create groupings (intervals, bins, classes) and
assign each observation to a grouping based on the value of its
quantitative variable.
1) How many groupings or intervals (classes)? Want
approximately 5-10 observations per group (on average)
in equal width intervals (groups)
C=
number of observations n
=
8
8
e.g. for n = 48, use anywhere from 6 to 10 intervals
2) How big is each interval (bin)? Should be equal-width,
so choose a starting value slightly below the min value in
the dataset and an ending value slightly above the max
value in the dataset
Size of each class =
ending value − starting value
c
Topic (4) Summarizing Data – Quantitative Variables
4-2
e.g. Shannon-Weiner Index (SWI) ranges from 0.0 to
2.2685 and n = 48. We’ll use a range from 0 to 2.4 which
divides nicely with c = 6. So interval width = 2.4/6 = 0.4.
3) Construct each class or grouping:
FREQUENCY TABLE
Grouping
0-0.4
>0.4-0.8
>0.8-1.2
>1.2-1.6
>1.6-2.0
>2.0-2.4
TOTAL
Absolute Frequency
27
6
3
4
6
2
48
Relative Frequency
27/48=56.25%
6/48 = 12.5%
3/48 = 6.25%
4/48 =8.33%
6/48=12.5%
2/48=4.16%
100%
Histogram (a graphical display of the frequency table) –
display either the absolute or relative frequency
60
50
40
P
e
r
c 30
e
n
t
20
10
0
0. 2
0. 6
1
1. 4
SW
I
1. 8
2. 2
Topic (4) Summarizing Data – Quantitative Variables
4-3
Stem-and-Leaf Plots: every observation is explicitly
displayed in the graphic
e.g. SWI for n = 48 locations
Stem
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Leaf
7
1
#
1
1
244
2
9
4
0
00
8
3
1
1
1
1
2
1
369
3
46
23
2
2
59
2
0455
4
69
2
0012666679
10
02235566889
11
----+----+----+----+
Multiply Stem.Leaf by 10**-1
Topic (4) Summarizing Data – Quantitative Variables
4-4
To construct a stem-and-leaf plot:
1) find the minimum and maximum values in the dataset
2) decide which digits in a value are significant (“stem”)
and which are less important (“leaves”) and which really
do not provide much information (this part of the value is
ignored or truncated out)
e.g. SWI
min = 0 and max = 2.2685
stem = X . X = 2 . 2 ____
leaf = _ . _ X = _ . _ 7
Observations with the same stem are plotted one next to
the other in increasing leaf order within the stem.
Note the order of the vertical axis – SAS plots stem-andleaf plots vertically (not horizontally) and in a mirror
image of the X-axis if you were to rotate the plot to the
horizontal position. Compare the shape to the histogram
on the previous page.
Box Plots: plots using the “5-number summary”:
{minimum, first quartile, median, third quartile,
maximum}
Topic (4) Summarizing Data – Quantitative Variables
4-5
Order the data for a particular variable from low to high
in value as is done in the stem-and-leaf plot.
The first quartile (also called the lower quartile or 25th
percentile) is that value where 25% of the observations
fall below it and the remainder are its value or higher.
E.g. in the SWI, the first quartile is the 12th smallest out
of 48th ordered numbers: 0.10.
The median is the middle value or 50th percentile where
half of the data have values less than the median and 50%
have values the same or higher. When the number of
observations is even, the median is the average of the two
middle values.
E.g. in the SWI, the median is the average of the 24th and
25th ordered values 0.30 and 0.34 or median = 0.32.
The third quartile (also called the upper quartile or 75th
percentile) is that value where 75% of the observations
fall below it and the remaining 25% are its value or
higher.
E.g. in the SWI, the third quartile is the 36th smallest out
of 48th ordered numbers: 1.18.
Topic (4) Summarizing Data – Quantitative Variables
4-6
On a vertical axis showing the range of possible values,
plot a rectangle whose length extends from the first to the
third quartiles and which has a waist (horizontal line) at
the median. Extend lines vertically to the minimum or
maximum of the data.
Stem
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Leaf
7
1
#
1
1
244
2
9
4
0
00
8
3
1
1
1
1
2
1
369
3
46
23
2
2
59
2
0455
4
69
2
0012666679
10
02235566889
11
----+----+----+----+
Multiply Stem.Leaf by 10**-1
Boxplot
|
|
|
|
|
|
|
|
|
|
|
+-----+
|
|
|
|
|
|
|
|
| + |
|
|
|
|
*-----*
|
|
+-----+
|
These can be made fancier by adding the ability to
identify outliers and extreme observations as well.
Topic (4) Summarizing Data – Quantitative Variables
4-7
Boxplots are especially useful for comparing several
different datasets simultaneously
EXAMPLE Summaries of daily high temperatures for
each month at Oberlin, OH
Such graphics allow us to see how the distribution of the
data changes in time, specifically how the monthly
medians and variability of the data change throughout the
year.
Topic (4) Summarizing Data – Quantitative Variables
4-8
The histogram, i.e. the frequency distribution, or stemand-leaf plot plays an important role in statistical
analysis.
As a consequence we spend a lot of time and effort
describing these distributions. The descriptions include:
Shape of the distribution (skew, modality, symmetry,
gaps, and outlying or other unusual data points)
Symmetric: each half of the histogram is a mirror image
of the other half. The frequency distribution is said to
have equal-length tails.
14
12
10
8
6
4
Std. Dev = 9.09
2
Mean = 52.1
N = 99.00
0
32.5
37.5
35.0
X
42.5
40.0
47.5
45.0
52.5
50.0
57.5
55.0
62.5
60.0
67.5
65.0
72.5
70.0
Topic (4) Summarizing Data – Quantitative Variables
4-9
Skew: the tails are not equal in length, the side of the
longer tail determine the direction of skew. Positive skew
is a long tail toward large values; negative skew, small
values.
6
5
4
3
2
1
S
M
N
0
0.0
5.0
CONCENTR
10.0
15.0
20.0
25.0
-90
-70
-50
-30
-10
Topic (4) Summarizing Data – Quantitative Variables
4-10
Gaps, outlying or unusual observations: intervals with
no observations or values that are not in the pattern with
the rest of the data.
E.g. number of fish in a tow
Stem
7
7
6
6
5
5
4
4
3
3
2
2
1
1
0
0
Leaf
5
4
8
#
1
1
1
Boxplot
0
0
0
4
1
566
3
02
2
68
2
113
3
5566
4
2234
4
5555567788
10
0000112223333444
16
----+----+----+----+
Multiply Stem.Leaf by 10**+3
|
|
|
|
+-----+
| + |
|
|
*-----*
+-----+
Topic (4) Summarizing Data – Quantitative Variables
4-11
Modality: the mode is the most frequent value in a
dataset when only a few different values are listed. When
the data have many different values (e.g. SWI), the mode
is usually said to be the interval with the most
observations. Data can be unimodal (one mode) or multimodal (one primary mode with secondary modes)
50
40
30
20
10
St
M
N
0
40.0
50.0
45.0
60.0
55.0
70.0
65.0
80.0
75.0
90.0
85.0
95.0
Topic (4) Summarizing Data – Quantitative Variables
4-12
Three examples of unimodal, symmetric distributions.
What distinguishes the three distributions?
Special name for distributions which follow a symmetric,
unimodal shape with equal sized tails and with a specific
curve between the mode and the tails: NORMAL
DISTRIBUTION
Topic (4) Summarizing Data – Quantitative Variables
4-13
B) Frequency Distributions for Populations
N = population size >>> n = sample size. If we used the
rule of thumb for number of bars needed we’d get an
extremely large number:
The tops of the bars approach a smooth line – this is
called the density curve of the population
N=40
N=40000
-2 -1.5 -1 -0.5 0 .5
1 1.5
N=400
-3 -2 -1
0
1
2
3
-4 -3 -2 -1 0 1 2 3 4
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
5-14
C) Summary Measures for Samples
1) Measures of Center
a) Median (50th percentile)
Important Point #1: The median is said to be robust
because it is resistant to outliers
Important Point #2: The sample median divides the total
area under the bars in a histogram in half.
Important Point #3: Populations also have medians
called the population median (M). This number divides
the area under the curve describing the population
frequency distribution in halves.
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
Stem
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
5-15
Leaf
7
1
#
1
1
244
2
9
4
0
00
8
3
1
1
1
1
2
1
369
3
46
23
2
2
59
2
0455
4
69
2
0012666679
10
02235566889
11
----+----+----+----+
Multiply Stem.Leaf by 10**-1
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
5-16
b) Arithmetic Mean
Defn: The MEAN of a data set is the average value. That is, it is
the value obtained by adding all of the numbers together and
dividing the result by the number of values in the sum (see
symbols later). The SAMPLE MEAN is denoted as x
(pronounced “x-bar”). The POPULATION MEAN is denoted
μ (pronounced “mu”).
EXAMPLE
The fish lengths for a study in the Tennessee River are:
48, 45, 49, 51, 44, 49, 46, 28.5, 26, 25.5, 25, 44
A dot plot of these data:
•
•
•
•• •
••• •• •
____|______|______|_____|______|_____|____
25
30
35
40
45
50
Length (cm)
If each point has the same weight, where should the pivot point
be to balance the x-axis (i.e. keep it horizontal)?
Ans:
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
5-17
To calculate the sample mean: Sum the data values and
divide the result by n.
48+45+49+51+44+49+46+28.5+26+25.5+25+44 = 481 = 40.08
12
12
We say that the fish caught in the study averaged 40.08
cm in length.
Important Point #1: If one were able to observe the value
of every single element in a population (say, every single
fish in the Tennessee River in 1978), then it would be
possible to calculate the population mean μ. Since we
can’t do that, we say that an estimate of the population
mean μ is the sample mean x .
Important Point #2: Is the mean robust?
Ans:
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
5-18
NOTATION:
X
denotes the NAME of the variable
e.g. LENGTH
x
denotes a value for the named variable
e.g. 48 cm
i
a subscript which denotes the index number for the
observation
e.g. fish IDs run from 1 to 12
xi
denotes the value for the ith observation (that is, the ith
observed value)
e.g. x1 = 48, x2 = 45, etc.
Σ
denotes the operation “SUM”
So, we can write
n
x=
∑ xi
i =1
n
=
x1 + x2 + ... + xn
n
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
5-19
For frequency distributions, the relationship of the mean
to the median depends on the shape of the distribution:
Skewed to the right (positive):
mean
median
Skewed to the left (negative):
mean
median
Symmetric and unimodal
mean
median
Uniform
mean
median
Symmetric and bimodal:
mean
median
Question: So, which measure of center do you use when?
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
5-20
2) Measures of Spread
How do we capture variability in a set of values using a
single summary statistic?
X
0
Z
1
2
3
4
5
6
7
0
1
2
3
4
5
6
7
Y
0
1
2
3
4
5
6
7
Note how each of these
datasets vary in their
minimum and maximum
values and how they vary
within their distribution as
well.
a) Range of a Variable
Defn: Range = Maximum value – Minimum Value
e.g. Tenn. River fish lengths: range = 26 cm (51 - 25)
Question: is the range a robust measure for variability??
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
5-21
b) Standard Deviation
The distance xi − x is called the deviation of the ith value
from the sample mean.
EXAMPLE: fish lengths ( x = 40 .08 )
•
•
•
•• •
• • • •• •
____|______|______|_____|______|_____|____
25
30
35
40
45
50
xi − x = deviation
25 - 40.1= -15.1
25.5 - 40.1= -14.6
26 - 40.1= -14.1
28.5 - 40.1 = -11.6
44 - 40.1= 3.9
44 - 40.1= 3.9
45 - 40.1= 4.9
46 - 40.1= 5.9
48 - 40.1= 7.9
49 - 40.1= 8.9
49 - 40.1= 8.9
51 - 40.1= 10.9
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
5-22
Question: Might these deviations be useful information to
describe the variability in a set of data?
The standard deviation is a measure of the average
deviation of values in a set of data.
FACT: for any set of data, the deviations always sum to
0!
So to be useful, we do the following:
1) calculate the deviations, xi − x , i=1,…,n
2
2) square each deviation, ( xi − x ) , i=1,…,n
n
3) sum up the squares,
∑ ( xi − x )2
i =1
n
∑ ( xi − x )2
4) divide by (n-1) {NOT n}
n
i =1
∑ ( xi
5) take the square root
i =1
n −1
− x )2
n −1
=s
= s2
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
5-23
s , the sample standard deviation, can be thought of as the
typical or average deviation of an observation from the
sample mean.
EXAMPLE: fish lengths
•
•
•
•• •
• • • •• •
____|______|______|_____|______|_____|____
25
30
35
40
45
50
Deviations
-15.1
-14.6
-14.1
-11.6
3.9
3.9
4.9
5.9
7.9
8.9
8.9
10.9
-----Σ= 0.0
(Deviations)2
228.01
213.16
198.81
134.56
15.21
15.21
24.01
34.81
62.41
79.21
79.21
118.81
---------1203.42
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
5-24
1203.42
= 109.40 cm2 = s 2
(12 − 1)
Divide by (n-1):
Take the square root: 109.402 cm2 = 10.46 cm = s
Interpretation?
Defn: the SAMPLE STANDARD DEVIATION is defined by
n
∑ ( xi
the equation
i =1
− x )2
n −1
= s.
The SAMPLE VARIANCE is s2.
The POPULATION VARIANCE is denoted σ2.
The POPULATION STANDARD DEVIATION is
denoted by σ .
Question: How is it used ?
1.
s is the sample estimate of the population standard
deviation σ . (note that σ is almost always unknown!)
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
2.
5-25
large values of s (or σ) imply large variability in a
data set (but it depends on the scale as well)
a)
good for comparing two or more datasets
when the data have the same units of
measurement
EXAMPLE Based on a sample of 50 acres on randomly
selected farms in Maryland, the 1998 corn yield averaged
125 bushels per acre with a standard deviation (s.d.) of 40
bushels. The next year, a drought year, had an average
yield of x = 83 bushels per acre and s = 25.
Let’s assume that the frequency distributions of the
number of bushels per acre for fields in each of these 2
years look unimodal and symmetric , i.e. “normal”).
Important and Useful Point: the range and the s.d. of a
set of data that are approximately normally distributed are
related: range = max − min ≈ 6s . So knowing x and s
and that the data are “normal” in shape, we can graph and
compare the two years yields:
|____________________________________________|
0
250
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
5-26
c) Coefficient of Variation
CV = s × 100% .
x
Note that CV is unitless and is often used to compare
different variables measured on different scales.
EXAMPLE: Tennessee River fish study of the effects of DDT
10.46
CV
=
× 100% = 26.09%
Fish lengths:
40.08
407.76
CV
=
×100% = 40.76%
Fish weights:
1000.33
6.98
CV
=
×100% = 96.87%
DDT concentration:
7.21
Question: which random variable (Length, Weight,
DDTconc.) is the most variable?
Question: Suppose I had measured the fish lengths in
inches. Would the CV be the same?
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
5-27
d) The Interquartile Range
Defn: Recall the LOWER QUARTILE (Q1) of a dataset is
the 25th percentile of the observations and that the UPPER
QUARTILE (Q3) is the 75th percentile of the
observations.
The INTERQUARTILE RANGE (IQR) is the range of
the middle 50% of the dataset.
IQR = Q3 – Q1 .
EXAMPLE n=12 Fish weights
441, 532, 544, 778, 897, 917, 986, 1023, 1266, 1398, 1459, 1763
Median:
917 + 986
m=
= 951.5
2
Q1:
544 + 778
= 661
2
Q3:
1266 + 1398
= 1332
2
IQR:
1332-661=671
Topic (5) SUMMARIZING DATA – SPREAD OR VARIABILITY
Question: Is the IQR resistant to outliers?
5-28
Download