Chapter 2 - Pegasus @ UCF

advertisement
Chapter 2: Methods for
Describing Sets of Data
(Page 19-98)
Homework:14ab, 36, 43, 45, 51,
56, 64abc, 71, 79, 85, 89, 96
1
Section 2.1: Numerical Measures of Central
Tendency (center):
• Why we are interested in the central tendency
of a set of measurements?
The central tendency of a set of measurements is the
tendency of the data to cluster (or center) about certain
numerical values. Since it is very important to both
descriptive and inferential statistics, there are many
numerical measures such as mean, median, and mode
available to estimate the central tendency of a set of
measurements. One can not say which one is the best
measure for the central tendency of a set of data because
data have very different characteristic.
2
The most popular measure for the central
tendency is the mean (or the arithmetic mean). We
use the Greek letter µ to stand for the population
mean and use the
to standx
for the sample
mean. The mode is a useful numerical measure of
the central tendency if one wants to know the
measurement that occurs most frequently in the
data set. The median is a good measure for the
central tendency if there are several extremely
large (or extremely small) measurements in the
data.
• Which one is the best numerical measure for
the central tendency of a set of data?
3
• Example 2.1 (Basic):
The following data
give the weekly expenditures (in dollars) on
nonalcoholic beverages for 45 households
randomly selected from the 1996 Diary Survey.
6.5
10.9
12.3
9.0
10.4
8.2
9.2
5.4
4.7
5.6
8.0
16.5
15.1
0.9
9.8
0.7
3.3
4.9
7.2
12.7
1.3
4.6
5.4
2.5
9.0
0.9
13.5
10.1 10.3
3.1
2.2
1.6 12.7
2.2
10.6
10.5
2.4
7.1
1.4
10.1 15.9
7.1
1.3
4.6
2.7
Use part of the SAS output in next 3 tables to find
the sample size, mean, median, and mode for
weekly expenditures.
4
Results for Example 2.1
Variable=EXPENSE
Moments
N
Mean
Std Dev
Skewness
USS
CV
T:Mean=0
Num ^= 0
M(Sign)
Sign Rank
45
6.986667
4.468811
0.31744
3075.3
63.96199
10.4878
45
22.5
517.5
Sum Wgts
Sum
Variance
Kurtosis
CSS
Std Mean
Pr>|T|
Num > 0
Pr>=|M|
Pr>=|S|
5
45
314.4
19.97027
-0.88551
878.692
0.666171
0.0001
45
0.0001
0.0001
Quantiles(Def=5)
100% Max
75% Q3
50% Med
25% Q1
Range
Q3-Q1
Mode
16.5
10.3
7.1
2.7
15.8
7.6
0.9
99%
95%
90%
10%
6
16.5
15.1
12.7
1.3
Extremes
Lowest
0.7(
0.9(
0.9(
1.3(
1.3(
Obs
27)
34)
14)
39)
20)
7
Highest
12.7(
13.5(
15.1(
15.9(
16.5(
Obs
45)
22)
26)
24)
41)
Example 2.2 (Intermediate): Michelson conducted an
experiment to determine the velocity of the light between
1879 and 1882. Table 2.1 presents Michelson's
determinations minus 299000 in Km/sec.
Table 2.1 Velocity of the Light
870 890 850 1000 960 830 880 880 890 910
870 840 740 980 940 790 880 910 810 920
810 780 900 930 960 810 880 850 810 890
740 810 1070 650 940 880 860 870 820 860
810 760 930 760 880 880 720 840 800 880
940 810 850 810 800 830 720 840 770 720
950 790 950 1000 850 800 620 850 760 840
800 810 980 1000 860 790 860 840 740 850
810 820 980 960 900 760 970 840 750 850
870 850 880 960 840 800 950 840 760 780
8
Result From Example 2.2
Variable=SPEED
N
Mean
Std Dev
Skewness
USS
CV
T:Mean=0
Num ^= 0
M(Sign)
Sgn Rank
100
852.2
78.96528
-0.01125
73241800
9.26605
107.9209
100
50
2525
Sum
Variance
Kurtosis
CSS
Std Mean
Pr>|T|
Num > 0
Pr>=|M|
Pr>=|S|
9
85220
6235.515
0.347244
617316
7.896528
0.0001
100
0.0001
0.0001
Quantiles(Def=5)
100% Max
1070
75% Q3
895
50% Med
850
25% Q1
805
0% Min
620
Range
Q3-Q1
Mode
99%
95%
90%
10%
5%
1%
450
90
810
10
1035
980
960
760
730
635
Extremes
Lowest
620(
650(
720(
720(
720(
Obs
67)
34)
60)
57)
47)
11
Highest
980(
1000(
1000(
1000(
1070(
Obs
83)
4)
64)
74)
33)
• The data set is skew to the right if there are several
extremely large measurements (see Figure 2.2). In
this case the mean is greater than the median and
the extremely large values have a stronger impact
on the mean.
• The data set is skew to the left if there are several
extremely small measurements (see Figure 2.3).
In this case the mean is small than the median and
the extremely small values pose stronger impact
on the mean as well.
• The data sets are well behaved if they are
symmetric (see Figure 2.1). Symmetrical data sets
pose several good properties that will be discussed
in later chapters.
12
Figure 2.1 Symmetric Distribution
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
Mean, Median, and Mode Overlap
13
Figure 2.2 SKEW TO THE RIGHT
0.02
0.018
0.016
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
Mean > Median
14
FIGURE 2.3 SKEW TO THE LEFT
0.02
0.018
0.016
0.014
0.012
0.01
0.008
0.006
0.004
0.002
0
Mean < Median
15
Section 2.2: Numerical Measures of Variability
• Why we are interested in numerical
measures for the variability of a set of
measurements?
The variability of a set of measurements is the
"spread" of the data. Measure of variabiltiy is as important
as the measure of central tendency. There are many
significant different data sets, which can have the same
mean, median, and mode. We introduce three numerical
measurements: range, variance, and standard
deviatiation to estimate the variability.
16
• Why sometimes the range is not a good
numerical measure for the variability of a
set of data?
The variability of two sets of data can be very
different even if they have a similar range because
the range only depends on the largest and smallest
measurements and one extremely large
measurement (or one extremely small
measurement) can alter the range significantly.
17
We use the symbols s and s2 to stand for the
samlpe standard deviation and the sample
variance, respectively, and the Greek
symbols s and s2 to stand for the population
standard deviation and the population
variance, respectively. Both standard
deviation and variance are good measures
for the variability of a set of measurements.
18
• Is there any set of measurements that can
be completely explained by the sample
mean and the sample standard deviation?
Yes. A set of measurements can be
explained completely by the sample mean
and the sample standard deviation of the
relative frequency distribution if the data is
similar to Figure 2.1.
19
Example 2.3 (Basic): Find the variance, the
standard deviation and the range from SAS
output in Example 2.1.
20
Example 2.4 (Intermediate):
a) Find the variance, the standard deviation
and the range from SAS output in Example
2.2.
b) Find the variance, the standard deviation,
and the range without three extreme values.
c) Which measure is most affected by the
deletion of extreme values?
d) Comparing the mean, the median, and the
mode before and after the deletion of outliers.
21
Result From Example 2.4
(Without Extreme values)
Variable=SPEED
N
97
Mean
854.433 Sum
82880
Std Dev
70.31135 Variance
4943.686
Skewness
0.206141 Kurtosis
-0.57312
USS
71290000 CSS
474593.8
CV
8.229007 Std Mean
7.139036
T:Mean=0
119.6847 Pr>|T|
0.0001
Num ^= 0
97 Num > 0
97
M(Sign)
48.5 Pr>=|M|
0.0001
Sgn Rank
2376.5 Pr>=|S|
0.0001
22
Quantiles(Def=5)
100%
75%
50%
25%
0%
Max
Q3
Med
Q1
Min
Range
Q3-Q1
Mode
1000
890
850
810
720
99%
95%
90%
10%
5%
1%
280
80
810
23
1000
980
960
760
740
720
Section 2.3: Interpreting the Standard Deviation
Standard deviation provides a measurement of
variability of a sample. The sample with larger
sample standard deviation has higher variability.
The standard deviation also provides information
to answer question such as "How many
measurements are within 2 standard deviations of
the mean?" for any specific data set. We need to
understand the following two rules in order to
answer the above question.
24
Chebyshev's Rule:
2
1

1
/
k
For any set of measurements, at least
of the measurements will fall within k standard
deviations of the mean for any number of k greater
than 1
(a) At least 3/4 of the measurements will fall within
the interval x  2s , x  2s for a sample
and    2s ,   2s  for a population.
(b) At least 8/9 of the measurements will fall within
the interval x  3s , x  3s for a sample
and    3s ,   3s  for a population.
25
The Empirical Rule:
The empirical rule is a rule of thumb that applies
only to samples or populations with frequency
distributions that are mound-shaped, i.e. the
frequency distributions are similar to a bell
(a) Approximately 68% of the measurements will
fall within the interval x  s , x  s for a sample
and    s ,   s  for a population.
(b) Approximately 95% of the measurements will
fall within the interval x  2s , x  2s for a
sample and    2s ,   2s  for a population.
(c) Approximately 99.7% of the measurements
will fall within the interval x  3s , x  3s for a
sample and    3s ,   3s  for a population.
26
Example 2.5 (Basic):
For any set of data, what can be said about
the percentage of measurements contained
in each of the following intervals.
(a)   2s to   2s.
(b)   3s to   3s
(c)   4s to   4s.
27
Example 2.6 (Intermediate): The mean and
standard deviation of a group of one hundred
NBA players are 70.25 inches and 3.25
inches, respectively.
(a) How many players in this group are taller
than 76.75 inches based upon the Empirical
Rule?
(b) Can we answer part (a) based on the
Chebyshev's rule?
(c) What assumption is required in order to
apply the Empirical Rule?
28
Section 2.4: Numerical Measures of Relative
Standing
• Can you say that you did poorly in one exam if
you got 70 points?
You might do poorly or you might do a fair job in
this exam. You can get the top score if all other
students got less than 60 points in this extremely
difficult exam. Your performance should be
judged by the relative standing instead of the
numerical score. Descriptive measures of the
relationship of a measurement to the rest of the
date are called measures of relative standing.
29
Example 2.7 (Basic): Base on the SAS output for
Example 2.1 to find the following percentiles:
(a) 10th percentile
(b) 25th percentile
(c) 50th percentile
(d) 55th percentile
(e) 90th percentile
Note:
1. Median is the 50th percentile of a quantitative data
set.
2.Upper quartile is the 75th percentile and lower
quartile is the 25th percentile of a quantitative data
set.
30
• Quantile: Let q be any number between 0 and 1, the qth
quantile denoted by Q(q) is a number such that a fraction
of q of the measurements fall below and a fraction of (1-q)
of the measurements fall above this number.
31
• Sample Z Score:
Suppose x is a measurement from a sample with
mean x and standard deviation s. The sample Z
score of x is
x -x
Z=
.
s
• Population Z Score:
Suppose x is a measurement from a population
with mean  and standard deviation s. The
population Z score of x is
Z
x
s
.
32
Example 2.8: The following data give the yearly
contributions (in dollars) to a local church by 35
households randomly selected from the 1996
Interview Survey.
30 50 27 25 100 300 100 75 200
76 25 15 60 240 100 130 15 200
18 10 25 50 125 200 400 500 300
34 87 24 25 140 275 250 150
(a) Find the mean and median of this set of data?
(b) Find the standard deviation and range?
(c) Compute the Z score for 200.
(d) How many measurements are fall within two
standard deviations of the mean?
33
Univariate Procedure
Variable=DOLLARS
N
35
Mean
125.1714
Std Dev
120.8157
Skewness
1.374005
USS
1044655
CV
96.52021
T:Mean=0
6.129369
Num ^= 0
35
M(Sign)
17.5
Sgn Rank
315
34
Sum Wgts
Sum
Variance
Kurtosis
CSS
Std Mean
Pr>|T|
Num > 0
Pr>=|M|
Pr>=|S|
35
4381
14596.44
1.620988
496279
20.42159
0.0001
35
0.0001
0.0001
Quantiles(Def=5)
100%
75%
50%
25%
0%
Max
Q3
Med
Q1
Min
Range
Q3-Q1
Mode
500
200
87
25
10
99%
95%
90%
10%
5%
1%
490
175
25
35
500
400
300
18
15
10
Extremes
Lowest
10(
15(
15(
18(
24(
Obs
20)
17)
12)
19)
30)
Highest
275(
300(
300(
400(
500(
36
Obs
33)
6)
27)
25)
26)
Section 2.5: Graphic Methods for Describing
Data (Bar Chart, Pie Chart, and Histogram)
• Why we need to use graphic methods to
describe data.
Mean and standard deviation alone can not characterize the
wide variety of distributions that data can have. We can
easily find examples that several significantly different
data sets have same mean and standard deviation.
• Can we find several different data sets with
same mean and standard deviation?
Three data sets in Figure 2.4 all have same mean, median,
standard deviation, and variance. However, they are very
different.
37
Figure 2.4
C
•• • • • • • • • • • • • • • • •
B
••
• • • • • • • • • • • • •• • • • • • • • • •• • • • • • • • • • • • • •
A
• • • • • • • • • •• • • • • •
82
87
92
97
• • • • • •• • • • • • • • • •
102
38
107
112
117
122
We will not cover bar-charts, pie-charts, or histograms
in this semester. Firstly, bar-charts and pie-charts pose
several perception problems as indicated by the famous
book entitled "The Elements of Graphing Data" (William
S. Cleveland, 1995). Secondly, we focus on discussing
quantitative data in this semester but both pie-charts and
bar-charts are graphical tools for qualitative data. Thirdly,
there is more information encoded in a well designed stemleaf display than a histogram.
• Box-plots, and stem-leaf displays are the
graphical methods discussed in this course.
39
Section 2.6: Stem-and-Leaf Display
Figure 2.5 shows a stem-and-leaf display of the
ozone data (Tukey 1977). It is a hybrid between a
data table and a histogram since it shows numerical
values as numerals but its profile is very much like a
histogram (see Figure 2.6).
One can follow the following steps to construct a
stem-and-leaf display by hand.
1. Define the stem and leaf to be used.
2. Write the stems in a column arranged from the smallest
stem at the top(bottom) to the largest stem at the bottom
(top).
40
3. If the leaves consist of more than one digit, drop the
digits after the first digit.
4. Record the leaf for each measurement in the row
corresponding to its stem.
5. Find the median and highlight the leaf corresponding to
the median.
6. Count the number of leaves in the row with the median
and put the count in the depth column.
7. Count the number of leaves for each row from the top
row to the median row and put the cumulative counts in
the depth column.
8. Count the number of leaves for each row from the
bottom row to the median row and put the cummulative
counts in the depth column.
41
Figure 2.5 Stem-and-Leaf
Depth
3
5
8
12
16
23
30
36
43
59
(11)
55
42
35
26
12
2
Stem
Leaf
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
034
99
025
1236
1346
2244455
1334899
013338
1244899
0000002235667779
11111122355
0114444668889
1222259
023677779
11223788888888
3444467888
44
42
Stem
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
Depth
3
5
8
12
16
23
30
36
43
59
(11)
55
42
35
26
12
2
034
99
025
1236
1346
2244455
1334899
013338
1244899
0000002235667779
11111122355
0114444668889
1222259
023677779
11223788888888
3444467888
44
Leaf
Figure 2.6 Stem-and-Leaf Display with 90 Degree Rotation
43
Univariate Procedure
Variable=OZONE
N
125
Mean
79.288
Std Dev
39.90954
Skewness
0.510449
USS
983327
CV
50.3349
T:Mean=0
22.2119
Num ^= 0
125
M(Sign)
62.5
Sgn Rank
3937.5
Sum Wgts
Sum
Variance
Kurtosis
CSS
Std Mean
Pr>|T|
Num > 0
Pr>=|M|
Pr>=|S|
44
125
9911
1592.771
-0.49653
197503.6
3.569618
0.0001
125
0.0001
0.0001
Quantiles(Def=5)
100%
75%
50%
25%
0%
Max
Q3
Med
Q1
Min
Range
Q3-Q1
Mode
174
103
72
47
14
99%
95%
90%
10%
5%
1%
160
56
38
45
173
152
136
31
24
14
Advantages of stem-and-leaf display:
• Both the numerical values and the graphical shape
can be seen on a stem-and-leaf display.
• It is very easy to locate an individual measurement
on a stem-and-leaf display.
• You can sort a relative small data set by hand
using stem-and-leaf display.
• You can get the following information such as
median, mode, range, maximum, minimum, upper
quartile, lower quartile, and inner quartile range
on a stem-and-leaf display.
46
• We can determine the symmetry
information of a set of measurements from
the stem-and-leaf display. A set of
measurements is symmetric if its relative
frequency distribution looks similar to
Figure 2.1. The relative frequency
distribution of Ozone data can be seen from
the rotated stem-and-leaf display (Figure
2.6). Ozone data is skewed to the right
because there are more observations with
small values than observations with large
values.
47
Example 2.9: the following table contains 48
measurements of the weight of a group of male
students in STA 3023 last year.
Table 2.1
123 128 130 135 140 142 145 151 155 155 155 156
156 156 160 160 163 165 165 170 170 170 170 173
174 175 175 180 182 185 185 185 185 186 190 190
191 195 195 198 200 205 206 208 215 220 220 230
a) Construct a stem-and-leaf display for data in Table 2.1.
b) Is the data symmetric?
c) Find the mean, the median, the range, the standard
deviation, the lower quartile, and the upper quartile from
SAS output
48
Depth
Stem Leaves
2
4
7
14
19
(8)
21
14
8
4
3
1
120
130
140
150
160
170
180
190
200
210
220
230
49
3,8
0,5
0,2,5
1,5,5,5,6,6,6
0,0,3,5,9
0,0,0,0,3,4,5,5
0,2,5,5,5,5,6
0,0,1,5,5,8
0,5,6,8
5
0,0
0
120
130
140
150
160
170
180
190
200
210
220
230
2
4
7
14
19
(8)
21
14
8
4
3
1
3,8
0,5
0,2,5
1,5,5,5,6,6,6
0,0,3,5,9
0,0,0,0,3,4,5,5
0,2,5,5,5,5,6
0,0,1,5,5,8
0,5,6,8
5
0,0
0
Stem Leaves
Depth
Figure 2.7 Stem-and-Leaf Display with 90 Degree Rotation
50
SAS Output for Example 2.9
Variable=WEIGHT
N
48 Sum Wgts
Mean
174.3333 Sum
Std Dev
25.41932 Variance
Skewness
0.070001 Kurtosis
USS
1489190 CSS
CV
14.58087 Std Mean
T:Mean=0
47.5157 Pr>|T|
Num ^= 0
48 Num > 0
M(Sign)
24 Pr>=|M|
Sgn Rank
588 Pr>=|S|
51
48
8368
646.1418
-0.43366
30368.67
3.668963
0.0001
48
0.0001
0.0001
Quantiles(Def=5)
100%
75%
50%
25%
0%
Max
Q3
Med
Q1
Min
Range
Q3-Q1
Mode
230
190.5
173.5
156
123
99%
95%
90%
10%
5%
1%
107
34.5
170
52
230
220
208
140
130
123
Section 2.7: Box Plots
• Inner Quartile Range (IQR): The upper quartile
minus the lower quartile.
•
•
•
•
•
Step: 1.5*IQR
Upper Inner Fence: Upper quartile plus one step.
Lower Inner Fence: Lower quartile minus one step.
Upper Outer Fence: Upper quartile plus two steps.
Lower Outer Fence: Lower quartile minus two
steps.
• Outside Value: Any measurements that are greater
than the upper inner fence or less than the lower inner
fence.
53
Elements of a Box Plot:
• A rectangle is drawn with the ends drawn at the lower
and upper quartiles. The median of the data is shown
in the box, usually by a line through the box.
• The points at distances 1.5*IQR from each hinge mark
the inner fences of the data set. Horizontal lines are
drawn from each hinge to the most extreme
measurement inside the inner fence.
• A second pair of fences, the outer fences, exist at a
distance of 3 *IQR from the hinges. One symbol
(usually "*" in SAS) is use to represent measurements
falling between the inner and outer fences. Another
symbol (usually "0" in SAS) is use to represent
measurements beyond the outer fence.
54
Interpretation of Box Plots
• The median shows the central tendency of the data.
• The length of the box (IQR) provides a measure of the
variability of the middle 50% of the data.
• The individual outside values give the viewer an
opportunity to the presence of outliers, that is,
observations that seem unsually, or even implausibly,
large or small. Outside values are not necessarily
outliers, but any outliers will almost certain appear as
an outlier.
• The box plot allows a partial assessment of symmetry.
The box plot is symmetric about it median if the data is
symmetric. If one whisker is clearly longer, the data is
probably skewed to the direction of the longer whisker.
55
Example 2.10:
Base on the box plot for data in
Example 2.1 to answer the following:
(a) Is the data symmetric?
(b) Is there any outside value?
(c) Find the upper quartile, the median, the
lower quartile, minimum value, and the
maximum value.
56
5
10
15
Figure 2.8 Box Plot for Data in Example 2.1
Weekly Expenditure (in Dollar)
57
Example 2.11:
Base on the box plot for data in Example 2.2 to
answer the following:
a. Is the data symmetric?
b. Is there any outside value?
c. Find the upper quartile, the median, the lower
quartile, minimum value, and the maximum value.
d. Compute the inner quartile range and step.
58
700
800
900
1000
Figure 2.9 Velocity of the Light
Speed of the Light
59
Example 2.12:
Base on the box plot for data in Example 2.8 to
answer the following:
(a) Is the data symmetric?
(b) Is there any outside value?
(c) Find the upper quartile, the median, the lower
quartile, minimum value, and the maximum value.
(d) Compute the inner quartile range and step.
60
0
100
200
300
400
500
Figure 2.10 Box Plot for Data in Example 2.8
Yearly Contributions
61
Quick Review:
•
•
•
•
•
•
•
•
Mean, Median, and Mode
Range, Standard Deviation, and Variance
Upper Quartile, Lower Quartile, and IQR
Chebyshev's Rule and Empirical Rule
Z-Score
Symmetry and Skewness
Mound-Shaped distribution
Box-Plot and Stem-and-Leaf Display
62
Download