Table C. 50 Data Values

advertisement
MAT 155: Describing, Exploring, and Comparing Data
0201-NotesCh2-3.doc
Page 1 of 8
Notes for Chapter 2 Summarizing and Graphing Data
Chapter 3 Describing, Exploring, and Comparing Data
Frequency Distribution, Graphic Representation,
Measures of Center, Variation, & Standing
In these chapters, we will study (1) visual representation of data, (2) means of center and
variation, and (3) relative standings and exploratory analysis. These three areas will include (1)
frequency distribution, relative frequency distribution, cumulative frequency distribution,
histogram, frequency polygon, stem-and-leaf plots, and scatter plots; (2) arithmetic mean,
median, mode, midrange, weighted mean, range, standard deviation, coefficient of variation,
empirical rule, and Chebyshev’s Theorem; and (3) z-scores, quartiles, percentiles, outliers, and
box plots (5-number summary).
Visual Representation of Data
A frequency distribution is one convenient way to represent a large amount of data in a small
amount of space using two columns: (1) categories or classes and (2) frequency. There are
some general guidelines that we should use when constructing a frequency distribution.
First, determine the number of classes, k, by using the “2 to the k rule.” Find the smallest
integer k so that 2k  n where n is the total number of observations or data values. For
example, if n = 50 data values, we would find k = 6 classes. [24 = 16, 25 = 32, and 26 = 64] Of
course, we have some freedom that allows us to choose the number of classes different from the
k-value when actually constructing the frequency distribution. We may choose a different kvalue to make the distribution more appealing.
NOTE: Classes should be mutually exclusive and collectively exhaustive. This would ensure
that each data value would fit into only one class, and every value would belong to a class. Also,
we should try to have at least 5 and not more than 15 classes. Thus, we will try to satisfy the
inequality 5  k  15. We should avoid, if possible, open-ended classes.
Second, determine the class interval or class width. Two guidelines that may be used to
determine the class interval, i, are
l arg est data value  smallest data value
l arg est data value  smallest data value
(1) i 
(2) i 
number of classes
1  3.322(log n)
Suppose the smallest and largest values of the 50 values from above are 12 and 88, respectively.
88  12
88  12
 12.666 and by (2) i 
 11.439
By (1) i 
6
1  3.322(log 50)
Again, we have some freedom to choose the class width (interval) to be a whole number if we
wish. Depending on our choice for i, we may have to change the number of classes from 6.
NOTE: The class intervals should be equal.
MAT 155: Describing, Exploring, and Comparing Data
0201-NotesCh2-3.doc
Page 2 of 8
We will set up our classes so that the lower limit (left value) of the class is included in that class,
and the upper limit (right value) of the class is not included in that class. Returning to the 50
data values ranging from 12 to 88, let us set up the classes. If we choose i = 12 and start the first
class with a lower limit of 12, we would need 7 classes in order to include the largest value of 88.
If we choose i = 15 and start with 10 as the lower limit of the first class, we would need only 6
classes to include the value of 88. NOTE: Some people recommend that the lower limit of the
first class be a whole number multiple of the smallest data value. However, this is not essential,
and we will use that only when it is convenient. Based on the information presented above, we
may choose either of the class setups below.
Table A
Classes: k=7, i=12
12-24
24-36
36-48
48-60
60-72
72-84
84-96
Table B
Classes: k=6, i=15
10-25
25-40
40-55
55-70
70-85
85-100
Once we set up the classes, we count and record the number of values in each class. In Table A,
we record, in the frequency column, the number of values so that 12  value < 24, 24  value <
36, etc.
57
43
88
20
78
73
46
41
73
72
Table C. 50 Data Values
25
12
21
70
25
78
22
26
23
87
79
17
13
16
69
24
73
75
48
42
19
42
81
54
16
40
70
37
64
17
74
61
24
39
81
19
64
20
85
46
Using the guidelines, Table A, and the data in Table C above, we get the frequency distribution
in Table D below.
Table D. Frequency Distribution
Classes
Frequency
12-24
13
24-36
5
36-48
9
48-60
3
60-72
6
72-84
11
MAT 155: Describing, Exploring, and Comparing Data
0201-NotesCh2-3.doc
84-96
Sum of freq. = n =
Page 3 of 8
3
50
The relative frequency distribution is constructed from the frequency distribution by dividing
each frequency by the sum of the frequencies. For example 13/50 = 0.26, 5/50 = 0.10, etc.
Table E below is the relative frequency distribution constructed from Table D.
Table E. Relative Frequency Distribution
Classes
Frequency
Relative Frequency
12-24
13
0.26
24-36
5
0.10
36-48
9
0.18
48-60
3
0.06
60-72
6
0.12
72-84
11
0.22
84-96
3
0.06
Total =
50
1.00
From Table E, we see that about 26% and 22% of the data values are in the intervals [12,24) and
[72,84), respectively.
In addition to the relative frequency distribution, we will discuss the less than cumulative
frequency distribution (LCF).
The LCF (Table F) shows the accumulated frequency that is less than the upper limit value in
the respective class.
Table F. Less Than Cumulative Frequency Distribution
Classes
Frequency
Less than Cumulative
Frequency (<cf)
12-24
13
13
24-36
5
18
36-48
9
27
48-60
3
30
60-72
6
36
72-84
11
47
84-96
3
50
Total =
50
--We see that 13 values are smaller than 24. The 13 in the first class plus 5 in the second class
give 18 values less than 36. [The rest of the values in the column <cf are obtained thusly 18 + 9
= 27, 27 + 3 = 30, 30 + 6 = 36, 36 + 11 = 47, and 47 + 3 = 50.]
The histogram is constructed by using the class limits on the horizontal axis of the frequencies
on the vertical axis. The histogram below on the left was constructed using Statdisk; on the right
by using Excel.
MAT 155: Describing, Exploring, and Comparing Data
0201-NotesCh2-3.doc
Page 4 of 8
Fregrency
Histogram
15
10
5
0
1224
2436
3648
4860
6072
7284
8496
Classes
The stem-and-leaf plot is a good representation for raw data. All values are shown in a concise
form as shown by the Minitab output of the following data: 42, 45, 51, 61, 69, 76, 78, 78, 72, 62,
51, and 44.
Table H. Current worksheet: Cities.mtw
Character Stem-and-Leaf Display
Stem-and-leaf of Atlanta
Leaf Unit = 1.0
3
5
(3)
4
4
5
6
7
245
11
129
2688
N
= 12
The stem-and-leaf indicates that
there are 12 data values, and
each leaf represents 1 unit. As
we read the first row, we see
there are 3 values in the 40’s.
These are 42, 44, and 45. There
are 2 (5-3=2) values in the 50’s:
51 and 51. There are (3) values
in the 60’s: 61, 62, and 69.
Finally, there are 4 values in the
70’s: 72, 76, 78, and 78.
The first column accumulates
from the top down until we reach
(3) [Don’t be concerned about
Interval is 10. Stem
the meaning of this value.] Then
increases by 10: 40, 50,
the accumulation starts at the
60, 70
bottom and works upward.
Adding the (3), the number
above it, and the number below it gives us the n = 12. [5 + 3 + 4 = 12]
Measures of Center and Variation
We will first discuss the population mean and the sample mean. When talking about the
population and a sample, we refer to a parameter and a statistic, respectively. Notations for the
population and sample means are  (mu) and X (X-bar), respectively. Notice in the formulas
that N (upper case) represents the total number of observations in the population, and n (lower
case) represents the total number of observations in the sample.
MAT 155: Describing, Exploring, and Comparing Data
0201-NotesCh2-3.doc
Page 5 of 8
Arithmetic Mean for Population and Sample
Type of Data
Population
Sample
X
X
Raw

X 
N
n
f X X   f X

Grouped
f
f
The arithmetic mean (1) is calculated for interval-level and ratio-level data, (2) includes all data
values, (3) is unique for a set of data, (4) is useful in comparing two or more groups of data, and
(5) is affected by extremely large or extremely small values.
The median is a measure of center that requires little or no calculation for raw data. To find the
median for raw data, we use the following procedure. (1) Order the data from smallest value to
largest value or vice-versa. (2) If the number of data values is odd, choose the value in the
middle so that the same number of values are to the left as are to the right of the middle values.
(3) If the number of data values is even, choose the two values in the middle so that the same
number of values are to the left as are to the right of the two middle values. (4) Calculate the
average of those two values.
To find the median for grouped data, we use the following procedure. (1) In the frequency
distribution, form the less than cumulative frequency (<CF) column. (2) Find one-half the sum of
the frequencies, n/2. (3) Find the largest number in the <CF column that is not larger than n/2.
(4) Circle the row (Class, frequency, <CF) below the number in Step 3. This row contains the
median. (5) Subtract the number found in Step 3 (CF) from n/2, divide by the frequency (f)
circle in Step 4, and multiply by the class interval (i). (6) Add the answer from Step 5 to the
lower class limit circled in Step 4. This represents the median for the grouped data. The
following formula summarizes the six-step procedure given above.
n
 CF
Median for grouped data Median  L  2
(i)
f
The mode is a measure of center that identifies the data value that appears most frequently.
There will be no mode if all data values appear the same number of times. There will be more
than one mode if two or more data values appear with the same frequency and more frequently
than other data value(s). To find the mode for raw data, simply find the value(s) that appear
most frequently. To find the mode for grouped data, find the midpoint(s) of the class(es) that
has (have) the largest frequencies. The class containing the mode is called the modal class.
The mid-range is midway between the largest value and the smallest value of the data.
l arg est  smallest
midrange 
2
MAT 155: Describing, Exploring, and Comparing Data
0201-NotesCh2-3.doc
Page 6 of 8
The weighted mean may be calculated by using the following three-step procedure: (1) multiply
each value by a weight for that value, (2) sum those products, and (3) divide that sum by the sum
of the weights. The following formula expresses the above procedure:
 wX
w
w1 X 1  w2 X 2    wn X n
,
w1  w2    wn
where w represents the weight and X represents the data value.
Weighted Mean
X 

Skewness tells us something about the shape of a frequency distribution. A symmetric
distribution is one whose graph is symmetric with respect to a vertical line that passes through
the mean, median, and mode. If a distribution is skewed to the right, the graph is elongated (or
stretched) to the right side. If a distribution is skewed to the left, the graph is elongated (or
stretched) to the left side. Remember that extremely large values will pull the mean to the right;
thus, skewing the graph (distribution) to the right. Similarly, extremely small values will pull the
mean to the left; thus, skewing the graph (distribution) to the left. To calculate the coefficient of
skewness by hand, we use Pearson’s index (coefficient) of skewness formula:
3(mean  median)
I  sk 
s
We will discuss variation (dispersion) for two reasons. First, variation (dispersion) can be used
to indicate the presence or absence of reliability. Second, variation (dispersion) can be used to
compare the spread of two or more distributions.
One measure of variation (dispersion) is the range. The range is the difference between the
largest and smallest data values. The calculation of the range is the simplest of the measures of
variation (dispersion). A disadvantage of using the range is that it involves only two of the data
values.
Range
Range  L arg est Value  Smallest Value (D1)
We calculate the variance of data so that we can find the standard deviation. For population
data, the variance is the arithmetic mean of the squared deviations from the mean. For sample
data, divide the sum of the squared deviations by n-1. We may use the following procedure to
calculate the variance for ungrouped data. (1) Calculate the arithmetic mean. (2) Find the
difference between each data value and the mean. (3) Square each of the differences found in
Step 2. (4) Sum the squares from Step 3. (5) If the data is from a population, divide the sum in
Step 4 by N, the total number of data values. (6) If the data is from a sample, divide the sum in
Step 4 by n-1, where n is the total number of data values. The above steps are summarized in the
two formulas below. In the sample calculation, the denominator of n-1 is used instead of n to
help correct for the error created by the smaller number of data values in the sample compared to
the population.
The table below shows the Conceptual Formulas and Calculation Formulas (for raw or
ungrouped data) used to find the variance of data. The standard deviation can be used to
compare the dispersion of two or more populations or samples. Also, if the data values are
MAT 155: Describing, Exploring, and Comparing Data
0201-NotesCh2-3.doc
Page 7 of 8
measured in the same units and the means are close together, a small standard deviation may be
used indicate that the mean a reliable measure of central tendency. For population data, the
standard deviation is the square root of the population variance. For sample data, the standard
deviation is the square root of the sample variance. We may use the following procedure to
calculate the standard deviation for ungrouped data. (1) Calculate the variance. (2) Find the
square root of the variance from Step 1. The above steps are summarized in the formulas below.
Conceptual
Formulas to Calculate the Variance of Raw Data
Population
Sample
2
( X  )
( X  X )2


2
2
(D3)
(D4)
 
s 
N
n 1
Calculation 
2

 X 
N  X2 
2
N2
2
(D5) s

n X 2 
 X 
2
n(n  1)
(D6)
Formulas to Calculate the Variance and Standard Deviation of Grouped Data
Population
Sample
Variance
2 
N  f  X 2  
 f
 X
2
s2 
N2
(D7)
Standard  
Deviation
(D9)
n f  X 2  
 f
 X
2
n(n  1)
(D8)
N  f  X 2  
N2
 f
 X
2
s 
n  f  X 2  
 f
 X
2
n(n  1)
(D10)
For grouped data, the range is the difference between the upper limit of the largest class
(interval) and lower limit of the smallest class (interval).
Range
Range  Upper Limit of L arg est Interval  Lower Limit of Smallest Interval (D1G)
Relative Dispersion. If the units of measure are different or the means are not close together,
the standard deviation cannot be used to compare dispersions of data sets. Therefore, we use the
coefficient of variation that measures the dispersion relative to the mean by dividing the
standard deviation by the mean and multiplying by 100 to form a percent. The coefficient of
variation is calculated by using the following formula:
s
(D12)
CV  (100%)
X
Empirical Rule. The Empirical Rule applies only to distributions that are symmetrical and
bell-shaped. For such distributions, the Empirical Rule states that about 68% of the data values
are within plus or minus one standard deviation of the mean; about 95% within plus and minus
two standard deviations of the mean; and about 99.7% within plus and minus three standard
deviations of the mean.
MAT 155: Describing, Exploring, and Comparing Data
0201-NotesCh2-3.doc
Page 8 of 8
Chebyshev’s Theorem allows us to determine the minimum proportion of data values within a
specific number (larger than one) of standard deviations of the mean for any set of data values.
This minimum proportion is calculated by using the formula
1
1 2
(D11)
k
where k >1 is the number of standard deviations either side of the mean. Using this formula, we
see that at least 75% of the data values are between two standard deviations below the mean and
two standard deviations above the mean. Similarly, there would be at least 88.9% within three
standard deviations of the mean. There would be at least 55.6% within 1.5 standard deviations
of the mean.
1
4 5


1.5 yields 1  1.5 2  1  9  9  0.556  55.6%
Z-score, standard score, is the number of standard deviations x is from the mean.
x  
x  x
z-score for sample: z 
z-score for population: z 

s
Quartiles, Deciles, Percentiles. Earlier we discussed measures of center. One of those
measures was the median. We found the median to be the middle value of ungrouped data, and
we used a formula to find the median for grouped data. Now we will calculate quartiles, deciles,
and percentiles as measures of dispersion. For ungrouped data, the following formula may be
used to find the location, L, of a percentile, k:
 k 
L  
(D13)
n
 100 
If L is whole number, Pk is midway between Lth value and (L+1)st value of the sorted data. If L is
not a whole number, Pk is the next value above the Lth position. To find the location of the first
quartile, simply find the location of the 25th percentile; to find the location of the second decile,
simply find the location of the 20th percentile.
Percentile of value x =
number of values less than x
 100
total number of values
Box Plots. A box plot is a graphical display of five values: smallest and largest data values, the
median, and the first and third quartiles. To draw a box plot, (1) identify the smallest and largest
data values, (2) calculate the first, second, and third quartiles, (3) draw a rectangle with the first
quartile at the left end, the third quartile at the right end, and the second quartile (median) as a
vertical line segment in the rectangle, (4) draw line segments from the left end to the smallest
value and from the right end to the largest value. As an example, consider the following:
smallest value is 50, first quartile is 70, second quartile (median) is 90, third quartile is 115, and
the largest value is 150. The box plot representing these values is shown below.
50
70
90
115
150
(Copyrighted by Claude S. Moore 2004-2008)
Download