MATH 2620 B

advertisement
Chapter 2 – Descriptive Statistics
2.1 Frequency Distributions
Types of Data
Qualitative Data – a nonnumerically valued data.
Quantitative Data – a numerically valued data.
Frequency and Relative Frequency Distributions:
Example 1
A nursery school offers programs for 4-year olds ranging
from 1-day-a-week program to 5-day-a-week program. To
help in planning, the school's director surveyed parents
regarding the type of program they prefer. The following
data, which represents the number of days, were obtained.
2
3
1
2
2
3
3
1
2
2
3
4
2
1
4
4
2
3
5
1
3
2
5
2
2
5
Construct the frequency and relative frequency distributions
and answer the following.
(a)
What percentage of parents did not prefer 2-daya-week program?
(b)
What percentage of parents prefers 4-day-a-week
or 5-day-a-week program?
Solution
In the data set there are 4 1s, 10 2s, 6 3s, 3 4s, and 3 5s with
26 total data items. For example, the relative frequency of the
data value 1 is (4/26)100 = 15.4%.
Number
of Days
1
2
3
4
5
Frequency
4
10
6
3
3
Total:
26
Relative
Frequency (%)
15.4%
38.5%
23.1%
11.5%
11.5%
100%
a.
Since there are 38.5% parents that prefer 2-day-a-week
program, the percentage of parents that do not prefer 2day-a-week program is 100 - 38.5 = 61.5%.
b.
The percentage of parents that prefer 4-day-a-week
program or 5-day-a- week program is 11.5 + 11.5 =
23%.
Grouped-Data Table
Example 1 We are given the mathematics
achievement test scores for a sample of 50 sixthgrade students at Maple Elementary School.
75
49
84
55
61
77
63
84
41
67
48
61
85
69
72
51
98
79
65
57
46
51
79
88
64
61
54
75
77
63
Test
Scores
40-49
50-59
60-69
70-79
80-89
90-99
Total
65
57
85
89
67
68
53
65
71
71
71
49
83
55
60
54
71
50
63
77
Frequency
5
10
15
12
7
1
50
Relative
Frequency (%)
10
20
30
24
14
2
100%
Cumulative Frequency and Cumulative Percent
Frequency Distributions:
Freq. Distribution
Test
Scores
40-49
50-59
60-69
70-79
80-89
90-99
Comulative Freq.
Distribution
Frequency Test
Cumulative
Scores
Freq.
5
5
 49
10
15
 59
15
30
 69
12
42
 79
7
49
 89
1
50
 99
Cumulative Relative Frequency Distribution
Freq. Distribution
Test
Scores
40-49
50-59
60-69
70-79
80-89
90-99
Comulative Relative Freq.
Distribution
Frequency Test Scores
Cumulative
Relative Freq.
(%)
5
10
 49
10
30
 59
15
60
 69
12
84
 79
7
98
 89
1
100
 99
Questions: What are we looking for when we look
at data?
a.
The shape of the distribution of the data.
b.
The symmetry or skew of the data.
c.
The center of the data.
d.
The spread of the data.
Graphical Displays:
Histogram -- A histogram is a graphical
representation of quantitative data that can help
answer the questions above.
Example 2 Draw the histogram for the data in
Example 1.
1.Guidelines for making a histogram:
a. Choose between 5 and 20 classes (intervals for a
histogram). A histogram is sensitive to the
number of classes, so you may want to try
several possibilities in practice. Rule of thumb:
about n classes for a histogram.
b. All class widths must be the same.
c. The lower limit of the smallest class is always
less than the smallest data value. The upper limit
of the largest class is always greater than the
largest value.
d. Each item goes into one and only class; that is,
the classes are non-overlapping.
Homework: 15-23(odd), 29, 31, 32 (pp. 43-45)
2.2 Pie Charts and Bar Graphs
A Histogram is designed for use with quantitative
data. Two methods for displaying qualitative data are
Pie Charts and Bar Graphs.
Example 3. (Reference Example 4 on page 49)
Display the relative frequency distribution of the
data using
a.
a pie chart
b. a bar graph
Stem-and-Leaf Diagram
We are given the mathematics achievement test
scores for a sample of 50 sixth-grade students at
Maple Elementary School. Draw a stem-and-leaf
display for this data.
75
49
84
55
61
77
63
84
41
67
48
61
85
69
72
51
98
79
65
57
46
51
79
88
64
61
54
75
77
63
65
57
85
89
67
68
53
65
71
71
71
49
83
55
60
54
71
50
63
77
Solution
Stem
Leaves
4| 8 6 9 9 1
5| 1 7 5 5 1 4 4 3 0 7
6| 5 1 9 1 4 7 0 1 8 3 5 5 3 7 3
7| 5 1 9 2 7 1 9 5 7 1 1 7
8| 4 5 5 3 8 9 4
9| 8
4| 1 6 8 9 9
5| 0 1 1 3 4 4 5 5 7 7
6| 0 1 1 1 3 3 3 4 5 5 5 5 7 7 8 9
7| 1 1 1 1 2 5 5 5 7 7 7 9 9
8| 3 4 4 5 5 8 9
9| 8
Homework: 15, 17, 20, 21 (pp.54-55)
2.3 Measures of Center
1. Mean (Also called the Arithmetic Mean)
The mean of a data set is the sum of the
observations divided by the number of
observations.
If the data values are x , x , x , …, x , then
1
2
3
n
Mean = x  x  nx  ...  x
Two Notations for the mean:
1
2
3
n
a. Sample mean: x (read as x-bar)
b. Population Mean:  (“Mu”)
Thus x = n x where n = number of items in
the sample data, and
 = N x where N = size of the population.
Note:  (called sigma) is a Greek symbol
that signifies summation.
Example 1: Find the mean for this sample
data: 2, 3, 6, 7, 7, 8. 9, 9, 9, 10
Solution:
x
= n x =
2  3  6  7  7  8  9  9  9  10
10
= 70/10 = 7
Example 2: A sample of five families in
Harrold, Iowa showed the following annual
family incomes:
$17,500, $23,000, $24,000, $26,000,
$320,000
Find the mean for this data.
Soln. x = n x =
17500  23000  24000  26000  320000
5
= 410500/5 = $82,100
Extreme Value/Outlier: a data value that is
too large or too small as compared to most
of the data values.
Note: In the presence of extreme value(s),
the mean provides a poor description of the
center of the data set.
2. Median
The median is the middle value of the
data when the data has been arranged in
the ascending/descending order.
Example 3: Find the median for the data set
1 and data set 2.
Data Set 1: 7, 2, 8, 5, 9, 4, 7, 8, 6
Data Set 2: 7, 2, 8, 5, 9, 4, 8, 8
Solution: The median for data set 1 is 7
while the median for the data set 2 is 7.5
Example 4: Find the median for the data set
given in Example 2.
Solution: Median = $24,000
Note: The median in not affected by extreme
values. Thus in the presence of extreme
values, median may be a better indicator of
the center.
3. Mode
The most frequently occurring data value in
a set of data is called the mode. That is, the
mode is the value that occurs with greatest
frequency.
Example5: Find the mode for the given data:
2, 3, 3, 2, 2, 8, 7, 8, 7, 9, 8, 8
Solution: Mode = 8
Example 6: Find the mode for the given data:
2, 3, 3, 2, 2, 8, 7, 8, 7, 9, 8, 8, 2
Solution:
Mode = 2 or 8
Note: Such a distribution is called bimodal.
Example 7: Find the mode for the given data:
2, 3, 8, 7, 9
Solution: Mode is undefined.
Note: Mode is seldom used in practice, except to
answer the very special question that it is
designed to answer:
a. What is the most watched TV show?
b. What is the best selling automobile?
c. What is the most common cause of
death?
Example 8: 10 out of the 11 data values in a
data set are 11, 13, 15, 9, 4, 12, 10, 7, 8, and 15.
If the mean for the data is 10, what is the
missing item?
Soln. 6
Homework: 1, 2, 3, 5, 14, 15-23(odd), 29, and 30
pp. 64-66
2.4 Measures of Variation
Range = Largest Value – Smallest Value
Example 9: Given the three data sets below,
find the range, mean, and median.
Data Set 1:
99, 91, 84, 84, 80, 80, 80, 76, 76, 69, 61
Data Set 2:
99, 80, 80, 80, 80, 80, 80, 80, 80, 80, 61
Data Set 3:
99, 99, 99, 99, 99, 80, 61, 61, 61, 61, 61
Soln: For all of these data sets,
Range = 99 – 61 = 38 and
Mean = Median = 80
Note: The range is based on only two of the
items in the data set and thus is influenced too
much by extreme values.
Variance:
Given the data 46, 54, 42, 46, 32. The mean ()
for this data is 44.
X
46
54
X-
2
10
(X - )
4
100
2
42
46
32
Total
-2
2
-12
0
4
4
144
256
Variance = average squared deviation from the
mean
= 256/5 = 51.2
Population Variance
 =  (XN-  ) where N is the size of the
population.
2
2
Sample Variance
s =
2
 (X - x )
n 1
2
where n is the size of the sample.
Easier Computational Formula for Variance
 =  (X ) N N ( )
2
2
2
s =
2
 (X
2
)  n( x )
2
n 1
Standard Deviation =
Variance
So, Sample Standard Deviation = s =
Population Standard Deviation =  =
s2
2
Example 10: Given the sample data below, find
the sample standard deviation.
9, 11, 16, 14, 12, 12, 10, 9, 9
Solution:
Sum of the x- values = 102
Sum of the squares of x-values = 1204
Sample mean=11.33, sample variance = 6.08
sample s.d. = 2.47
Some Uses of Mean and Standard Deviation
Data: x , x , x , …, x
1
Z-score =
xx
s
2
3
n
where s = sample s.d.
Z-score for any data item is referred to as its
standardized value. It can be interpreted as a
measure of the relative location of an item in the
data.
Example 11: If the Z-score of a data item is 2,
the data value is 2-standard deviations above, or
larger than, the sample mean.
CHEBYSHEV’S THEOREM (P 77)
For any data set, at least
75% of the items must lie within two
standard deviations of the mean;
89% of the items must lie within three
standard deviations of the mean;
94% of the items must lie within four
standard deviations of the mean.
Example 12: Midterm scores for 100 students in
a college statistics course had a mean of 70 and
s.d. of 5.
(a) How many students scored between 60
and 80?
(b) How many students scored between 50
and 90?
The Empirical Rule:
For a data having approximately a bell-shaped
distribution,
Approximately 68% of the data fall within
1-standard deviation of the mean;
Approximately 95% of the data fall within
2-standard deviation of the mean;
Approximately 99.7% of the data fall within
3-standard deviation of the mean.
Example 13: In a class with 50 students, the
mean score on a test was 60 while the standard
deviation was 12. It is given that the scores are
normally distributed. How many students
a. scored between 48 and 72?
b. scored between 36 and 84?
c. scored between 24 and 96?
Detecting Outliers:
Sometimes a set of data has one or more
items with unusually large or unusually
small values. Extreme values such as these
are called Outliers. Experienced
statisticians take steps to identify outliers
and then review each one carefully. An
outlier may have been an item for which the
value has been incorrectly recorded. If so,
the value can be corrected before proceeding
with the analysis. An outlier may also be an
item that was incorrectly included in the
data set; If so, it can be removed. Finally, an
outlier may just be an unusual item that has
been correctly recorded and does belong in
the data set. In such cases, the item should
remain in the data set.
Use of Z-score to identify outliers:
RULE: An item with z-value > 3 or Z-value < -3
will be treated as an outlier.
Example 14: Given the data set below, identify
outliers, if any, in the data.
46, 54, 42, 46, 32
Soln. Note that x = 44 and s.d. = 8
x
x
x-
46
54
42
46
32
44
44
44
44
44
2
10
-2
2
-12
x
z-score
0.25
1.25
-0.25
0.25
-1.50
There are no outliers in this data.
Homework: 2, 4, 6, 9, 13, 15, 16, 17, 19, 21, 25,
26, 27, 28, 29, 30 pp. 80-83.
2.5 Measures of Position
Percentile: A percentile is a numerical measure
that also locates values of interest in the data set.
A percentile provides information regarding how
the data items are spread over the interval from
the lowest value to the highest value.
Defn. The pth percentile of a data set is a value
such that at least p percent of the items take of
this value or less and at least (100 – p) percent of
the items take on this value or more.
Step 1:
Sort the data in ascending order, that is,
from the smallest to the largest.
Step 2:
Find i = (p/100)n where n is the
number of data values.
Step 3:
If i is not an integer, then
pth percentile = x
If i is an integer, then
pth percentile = x 2x
INT ( i  1)
i
i 1
Example 16: Given the data below, find the 50th
and 90th percentiles.
26, 4, 5, 20, 6, 12, 15, 15, 15, 8, 9, 10, 14,
18, 16, 17
Soln: Step 1:Sort the data in ascending order,
that is, from the smallest to the largest.
4, 5, 6, 8, 9, 10, 12, 14, 15, 15, 15, 16,
17, 18, 20, 26
90th percentile = 20;
50th percentile = 14.5
Note: The median and the 50th percentile are the
same.
Quartiles
It is often desired to divide a data set into four
parts with each part containing one-fourth of the
data.
Q = First Quartile =
1
25% percentile
Q = Second Quartile
Q 3 = Third Quartile
2
=
=
50% percentile
75% percentile
Example 17: For the data given in Example 16,
find the first, second, and third quartiles.
Soln.
Q = 8.5, Q = 14.5, Q 3 = 16.5
1
2
The Interquartile Range (IQR)
IQR = Q - Q
3
1
Note: The IQR gives the range of the middle
50% of the observations.
The Five-Number Summary
The five number summary of a data set consists
of the minimum, maximum, and quartiles
written in increasing order: Min, Q , Q Q 3 , and
Max.
1
2
Example 17: Reference Example 16. Find the
five-number summary.
Soln. The data is
4, 5, 6, 8, 9, 10, 12, 14, 15, 15, 15, 16,
17, 18, 20, 26
Minimum = 4, Q ,= 8.5, Q = 14.5, Q 3 = 16.5,
and Maximum = 26.
1
2
Boxplot (P 89)
A boxplot is based on the five-number summary
and can be used to provide a graphical display of
the center and variation of a data set.
Notes:
1. There are two ways of identifying
outliers – using z-scores, and upper and
lower fences. These methods do not
necessarily identify the same items as
outliers.
2. An advantage of using boxplots for
analysis of data is that we need very few
numerical calculations. Just arrange data
in the ascending order and compute the
five-number summary. You do not have
to compute the mean and the standard
deviation.
The Shape of Distributions
(Ref. Page 63 in the text book)
Homework:1, 3, 5, 7, 8, 15, 16, 21(pp. 93-95)
Download