Statistics Notes 3

advertisement
Advanced Math 45
Statistics #3: Grouped Data
4. Calculations Using Grouped Data
1. Organisation of Ungrouped Data
2. Measures of Central Tendency
3. Measures of Spread
1.
Organisation of Ungrouped Data
If you have a large quantity of raw data it is often impractical to consider each
individual piece of data so it is convenient to group the data into appropriate
intervals. Usually there should be no fewer than 10 intervals and no more than 25
intervals. When fixing the boundaries it is customary to use one decimal place more
than in the raw data – this ensures that any one piece of data can only fall into one
interval.
Example: Twenty-five light bulbs were selected from a production lot of 1000 bulbs
and test for hours of bulb life.
803
880
871
883
939
896
976
843
927
890
878
912
895
942
868
886
955
875
918
881
861
843
882
843
901
Put this data into a grouped frequency table.
One possible answer is:
Class
Boundaries
799.5 – 819.5
819.5 – 839.5
839.5 – 859.5
859.5 – 879.5
879.5 – 899.5
899.5 – 919.5
919.5 – 939.5
939.5 – 959.5
959.5 – 979.5
Tally
Frequency
1
1
0
3
6
7
3
2
2
1
111
1111 1
1111 11
111
11
11
1
This type of data can be represented in histograms and frequency polygons.
A histogram is a graph in which classes are marked on the horizontal axis and the
frequencies / relative frequencies / percentages are marked on the vertical axis. The
Page 1 of 8
Advanced Math 45
Statistics #3: Grouped Data
frequencies / relative frequencies / percentages are represented by the heights of
the bars, and the bars are drawn adjacent to each other.
A frequency polygon is a graph formed by joining the midpoints of the tops of
successive bars in a histogram with straight lines.
(NB: The bars are not always shown)
Page 2 of 8
Advanced Math 45
Statistics #3: Grouped Data
A cumulative frequency graph is another useful tool for displaying grouped data.
It gives the total number of values that fall below the upper boundary of each class.
Class
Boundaries
799.5 – 819.5
819.5 – 839.5
839.5 – 859.5
859.5 – 879.5
879.5 – 899.5
899.5 – 919.5
919.5 – 939.5
939.5 – 959.5
959.5 – 979.5
Frequency
1
0
3
6
7
3
2
2
1
Cumulative
Frequency
1
1
4
10
17
20
22
24
25
Note that a cumulative
frequency curve is
called an ogive
A cumulative frequency diagram can be used to find the median of a distribution,
quartiles and percentiles.
In our data we have 25 pieces of data to median (middle piece of data) will be 13th
piece of data.
The quartiles and the median separate the data in to four sections. The lower
quartile will be ¼ of the way through the data ( ¼ x 25 = 6.25th piece of data) while
the upper quartile is ¾ of the way through the data ( ¾ x 25 = 18.75th piece of data)
Page 3 of 8
Advanced Math 45
Statistics #3: Grouped Data
From the graph
Median ≈ 885
Lower Quartile:
Q1 ≈ 865
Upper Quartile:
Q3 ≈ 910
The Interquartile range (IQR) = Q3 – Q1 = 910 – 865 = 45 and can be used to
examine how the middle half of the data is spread out.
The median and quartiles can be used to create a box and whisker plot as shown
below.
The median and the lower and upper quartiles are all special percentiles – a
percentile is 1/100 of a given set of elements arranged in order of magnitude, hence
the median is the 50th percentile, lower quartile the 25th percentile and the upper
quartile the 75th percentile.
Page 4 of 8
Advanced Math 45
Statistics #3: Grouped Data
If we wanted the 10th
percentile, we would find
10% of our total, in this case
25, so 2.5 and use the
cumulative frequency
diagram to identify the value
for which 10% of the
population is below.
P10 = 853
i.e. 10% of light bulbs have a
life length of less than 853
hours.
Or if we wanted to find the percentile for a given value, we do the same but in
reverse order. To find the percentile rank of a bulb life length of 900 hours, draw a
line up from 900 to the ogive and then across to read the cumulative frequency.
Finally convert into a percentage.
From graph we can see that 17
bulbs have a life length of 900
hrs or less.
17/25*100 = 68
so this the is 68th percentile,
P68 = 900
which means that 68% of the
bulbs have a life length of less
than or equal to 900 hours.
Assignment #5 Qu 1-3
Page 5 of 8
Advanced Math 45
Statistics #3: Grouped Data
2. Measures of Central Tendency
Sometime data is grouped together so individual pieces of data are unknown, or
there is so much data that it is impracticable to deal with individual pieces of data so
the data is put into suitable groups.
Example: The table below shows daily commute times for all 25 workers in a
company.
Daily Commute Times
(mins)
0 to less than 10
10 to less than 20
20 to less than 30
30 to less than 40
40 to less than 50
Number of Employees
4
9
6
4
2
The Mode – we can only identify the modal group, that is the group with the most
employees i.e. the highest frequency.
Modal Group: 10 to less than 20 mins.
The Median – the middle piece of the ordered data. In this case there are 25 pieces
of data so we are looking for the 13th piece of data.
Daily Commute Times
(mins)
0 to less than 10
10 to less than 20
20 to less than 30
30 to less than 40
40 to less than 50
Number of Employees
4
9
6
4
2
Cumulative Number of
Employees
4
13
19
23
25
We can easily see that the median falls in the group “10 to less than 20”.
If you need an estimate of an actual value for the median you need to use the
following formula:
where
L = the lower limit of the class containing the median n = the total number of frequencies f = the frequency of the median class Page 6 of 8
Advanced Math 45
Statistics #3: Grouped Data
CF = the cumulative number of frequencies in the classes preceding the class
containing the median i = the width of the class containing the median
Hence an estimate of median is:
25
−4
(10) = 19.44
𝑀𝑒𝑑𝑖𝑎𝑛 = 10 + 2
9
̅ ) - again, because we don’t have the actual data we use the mid-point
The Mean ( 𝒙
of each group in order to find an estimate for the mean
Mid-point is the average of the limits of the group.
i.e. Mid point of the first group would be (0 + 10)/2 = 5 etc.
Formula for the mean becomes:
𝜇=
Daily Commute
Times (mins)
0 to less than 10
10 to less than 20
20 to less than 30
30 to less than 40
40 to less than 50
∑ 𝑚𝑓
∑𝑓
Number of Employees
(f)
4
9
6
4
2
Mid Point (m)
mf
5
15
25
35
45
20
135
150
140
90
∑ 𝑓 = 25
𝜇=
∑ 𝑚𝑓 = 535
∑ 𝑚𝑓 535
=
= 21.4 𝑚𝑖𝑛𝑢𝑡𝑒𝑠
∑𝑓
25
Assignment #5 Qu 4&5
3. Measure of Spread
Range: This is not normally calculated from grouped data.
Standard Deviation: The formulae become
∑ 𝑓(𝑚−𝜇)2
𝜎=√
𝑛
∑ 𝑓𝑚2
and 𝜎 = √
𝑛
− 𝜇2
Using the Daily Commute Times example from above, (𝜇 = 21.4)
Page 7 of 8
Advanced Math 45
Daily Commute
Times (mins)
0 to less than 10
10 to less than
20
20 to less than
30
30 to less than
40
40 to less than
50
Statistics #3: Grouped Data
Number of
Employees (f)
4
9
Mid Point
(m)
5
15
(𝒎 − 𝝁)
(𝒎 − 𝝁)𝟐
𝒇(𝒎 − 𝝁)𝟐
-16.4
-6.4
268.96
40.96
1075.84
368.64
6
25
3.6
12.96
77.76
4
35
13.6
184.96
739.84
2
45
23.6
556.96
1113.92
∑ 𝒇(𝒎 −
𝝁)𝟐 =3376
𝑛 = ∑ 𝑓 = 25
∑ 𝑓(𝑚 − 𝜇)2
𝜎=√
𝑛
=√
3376
25
= 11.62
Using the alternate formula
Daily Commute
Times (mins)
0 to less than 10
10 to less than 20
20 to less than 30
30 to less than 40
40 to less than 50
Number of
Employees (f)
4
9
6
4
2
Mid Point
(m)
5
15
25
35
45
𝒎𝟐
𝒇𝒎𝟐
25
225
625
1225
2025
100
2025
3750
4900
4050
∑ 𝑓𝑚2
𝑛 = ∑ 𝑓 = 25
= 14825
∑ 𝑓𝑚2
√
𝜎=
− 𝜇2
𝑛
=√
14825
− 21.42
25
=11.62
Assignment #5 Qu 6
Page 8 of 8
Download