bs chapter 3 - WordPress.com

advertisement
DESCRIPTIVE STATISTICS
CHAPTER TWO
Content
2.1 Data organization and Frequency
Distribution
2.2 Types of Graph
2.3 Summary Statistics (Data Description)
•
•
•
Measures of Central Tendency
Measures of Variation
Measures of Position
Objectives
At the end of this chapter, you should be able to
•
Organize data using frequency distributions.
•
Represent data in frequency distributions graphically using histograms,
frequency polygons, and ogives.
•
Represent data using Pareto charts, time series graphs, and pie graphs.
•
Draw and interpret a stem and leaf plot.
•
Summarize data using measures of central tendency, such as the mean,
median, mode, and midrange.
•
Describe data using measures of variation, such as the range, variance, and
standard deviation.
•
Identify the position of a data value in a data set, using various measures of
position, such as percentiles, deciles, and quartiles.
2.1 Data Organization & Frequency Distribution
A. The raw data
– A fresh data have been collected from any resource
Example of the Raw Data
The Slimline Beverage
Company makes and sells a
line of dietetic soft drink
products. These products
are sold in bottles and cans.
In additions, soft drink
syrups are sold to
restaurants, theaters, and
other outlets that mix small
amounts of the syrup with
carbonated water and sell
the result in cup. The sales
manager wants to see how
new Fizzy Cola syrup is
selling so the raw sales data
on gallons of syrup sold
were gathered as shown on
below table.
Raw data: Gallons Of Fizzy Cola Syrup Sold by 50 Employees of Slimline
Beverage Company in 1 Month
Employee
Galoon sold
Employee
Galoon
sold
Employee
Galoon
sold
Employee
Galoon
sold
PP
95
RN
95
GH
135.5
IT
135.5
SM
100.75
SG
100.75
RI
115.25
NI
115.25
PT
126
AD
126
OS
128.75
GC
128.75
PU
114
RO
114
US
113.25
AS
113.25
MS
134
EY
134
PO
132
NC
132
FK
116.75
YO
116.75
OR
105
YA
105
LZ
97.5
OU
97.5
FT
118.25
TN
118.25
FE
102.25
US
102.25
WO
121.75
HB
121.75
AN
110
LT
110
OF
109.25
IE
109.25
RJ
125
EA
125
RT
136
NF
136
OO
144
AT
144
KH
124
GU
124
UY
112
RI
112
EI
91
XN
91
TT
82.5
NS
82.5
B. The data array
An arrangement of data items in either as ascending (lowesthighest) or descending (highest-lowest) order.
Example of Data Array
Array data: Gallons Of Fizzy
Cola Syrup Sold by 50
Employees of Slimline
Beverage Company in 1
Month
The lowest data:
The highest data:
82.5
105
116.75
128.75
82.5
109.25
116.75
128.75
91
109.25
118.25
132
91
110
118.25
132
95
110
121.75
134
95
112
121.75
134
97.5
112
124
135.5
97.5
113.25
124
135.5
100.75
113.25
125
136
100.75
114
125
136
102.25
114
126
144
102.25
115.25
126
144
105
115.25
Range:
Advantages
- We can see the range of the data
- We can determine the data distribution
- An array can show the presence of large concentrations of
items at particular values (outliers – data that are different
than the rest of data, much larger or smaller )
Disadvantages
- The array is still a rather awkward data organization tool,
especially when the number of data items is large.
- There’s often a need to arrange the data into a more
compact form for analysis and communication purposes.
C. Frequency distribution (frequency table)
Group’s data items into classes and then records the number of items
that appear in each class.
The purpose
To organize the data items into a compact form without obscuring essential facts
How to do (general)?
1. Determine the number of classes that will be used to group the data.
a. Minimum – 5, maximum – 20
b. The actual number depends on such factor
i. The number of observations being group
ii. The purpose of the distribution
iii. The arbitrary preferences of the analyst
c. Use classes that can give you a good view of the data pattern and enable you
to gain insights into the information that is there
d. All data items from the smallest to the largest must be included
e. Each items must be assign to one and only one class
2. Determine the width (class interval) of these classes
a. The width should be equal
b. Width = range / number of classes
c. Whenever possible an open-ended class interval (one with an unspecified
upper or lower class limit) should be avoided
3. Determine the number of observations / frequency in each class
Types of Frequency Distribution
• Categorical
Frequency
Distribution
– Used for data
that can be
placed in specific
categories such
as nominal or
ordinal level data
• Ungrouped
Frequency
Distribution
– Used for
numerical data
– The range of
data is small

Grouped
Frequency
Distribution
– Used for
numerical
data too
– The range of
the data is
large
Example : Categorical Frequency Distribution
Twenty-five army inductees were given a blood test to determine their blood
type. The data set is
A
O
B
A
AB
B
O
B
O
A
B
B
O
O
O
AB
AB
A
O
B
O
B
O
AB
A
Construct a frequency distribution for the data.
Constructing an ungrouped & Grouped Frequency Distribution
STEP 1 Determine the classes.
- Find the highest and lowest value.
- Find the range.
- Select the number of classes desired.
- Find the width by dividing the range by the number of
classes and rounding up.
- Select a starting point (usually the lowest value or any
convenient number less than the lowest value); add the
width to get the lower limits.
- Find the upper class limits.
- Find the boundaries.
STEP 2 Tally the data.
STEP 3 Find the numerical frequencies from the tallies.
STEP 4 Find the cumulative frequencies.
•
The lower class limit represents the smallest data value that can be included in
the class.
•
The upper class limit represents the largest value that can be included in the
class.
•
The class boundaries are used to separate the classes so that there are no gaps in
the frequency distribution.
•
Rule of Thumb: Class limits should have the same decimal place value as the
data, but the class boundaries have one additional place value and end in a 5.
•
The class width for a class in a frequency distribution is found by subtracting the
lower (or upper) class limit of one class from the lower (or upper) class limit of
the next class.
•
The class midpoint is found by adding the lower and upper boundaries (or limits)
and dividing by 2.
Class Rules
•
•
•
•
•
There should be between 5 and 20 classes.
The classes must be mutually exclusive.
The classes must be continuous.
The classes must be exhaustive.
The classes must be equal width.
Example : Ungrouped Frequency Distribution
The data shown here represent the number of miles
per gallon that 30 selected four-wheel-drive sports
utility vehicles obtained in city driving. Construct a
frequency distribution.
12
16
15
12
19
17
18
16
14
13
12
12
12
15
16
14
16
15
12
18
16
17
16
15
16
18
15
16
15
14
Example : Grouped Frequency Distribution
These data represent the record high temperatures for each of
the 50 states. Construct a grouped frequency distribution for
the data using 7 classes.
112
110
107
116
120
100
118
112
108
113
127
117
114
110
120
120
116
115
121
117
134
118
118
113
105
118
122
117
120
110
105
114
118
119
118
110
114
122
111
112
109
105
106
104
114
112
109
110
111
114
Why Construct Frequency Distributions?
To organize the data in a meaningful, intelligible way.
To enable the reader to make comparisons
among different data sets.
To facilitate computational procedures
for measures of average and spread.
To enable the reader to determine the
nature or shape of the distribution.
To enable the researcher to draw charts and graphs
for the presentation of data.
2.2 Types of Graph

The purpose of graphs in statistics is to convey
the data to the viewer in pictorial form.

Graphs are useful in getting the audience’s
attention in a publication or a presentation.
A. Histogram, Frequency Polygon, Ogive
• Histogram
– A graph that displays the data by using vertical bars of various heights
to represent the frequencies
• Frequency Polygon
– A graph that displays the data by using lines that connect points
plotted for the frequencies at the midpoints of the classes. The
frequencies represent the heights of the midpoints.
Ogive (Cumulative Frequency Graph)
– A graph that represents the cumulative frequencies for the
classes in a frequency distribution
Procedure to construct Histogram,
Frequency Polygon & Ogive
• STEP 1 Draw and label the x and y axes.
• STEP 2 Choose a suitable scale for the frequencies
or cumulative frequencies, and label it on the y
axis.
• STEP 3 Represent the class boundaries for the
histogram or ogive, or the midpoint for the
frequency polygon, on the x axis.
• STEP 4 Plot the points and then draw the bars or
lines.
Example
These data represent the record high temperatures for each of
the 50 states. Construct a grouped frequency distribution for
the data using 7 classes. Then, construct a histogram,
frequency polygon and ogive for these data.
112
110
107
116
120
100
118
112
108
113
127
117
114
110
120
120
116
115
121
117
134
118
118
113
105
118
122
117
120
110
105
114
118
119
118
110
114
122
111
112
109
105
106
104
114
112
109
110
111
114
Distribution Shapes
B. Pareto Chart
Used to represent a frequency
distribution for a categorical
variable and the frequency are
displayed by the heights of
vertical bars.
Example
Twenty-five army inductees were given a blood test
to determine their blood type. The data set is
A
O
B
A
AB
B
O
B
O
A
B
B
O
O
O
AB
AB
A
O
B
O
B
O
AB
A
Construct a pareto chart for the data.
C. Time Series Graph
•
•
•
•
•
•
Represents data that occur over a
specified period of time
STEP 1 Draw and label the x and
y axes.
STEP 2 Label the x axis for years
and the y axis for the
number of theaters.
STEP 3 Plot each point according
to the table.
STEP 4 Draw line segments
connecting adjacent
points. Do not try to fit a
smooth curve through the
data points.
We look for a trend or pattern that
occurs over the time period
(ascending, descending) & the slope
or steepness of the line (increase,
decrease)
Two time series graph for comparisons
(compound time series graph)
Example
In 1958, there were more than 4000 outdoor drive-in theaters. The
number of these theaters has changed over the years. Draw a time
series graph for the data and summarize the findings.
Year
1988
1990
1992
1994
1996
1998
2000
Number
1497
910
870
859
826
750
637
D. Pie Chart
A pie graph is a circle that is divided into
sections or wedges according to the
percentage of frequencies in each
category of the distribution.
The purpose of the pie graph is to show
the relationship of the parts to the whole
by visually comparing the sizes of the
sectors.
Percentages or proportions can be used.
The variable is nominal or categorical.
Example
Twenty-five army inductees were given a blood test
to determine their blood type. The data set is
A
O
B
A
AB
B
O
B
O
A
B
B
O
O
O
AB
AB
A
O
B
Construct a pie chart for the data.
O
B
O
AB
A
Stem-and-Leaf Plots
• A stem-and-leaf plot is a
data plot that uses part of
a data value as the stem
and part of the data value
as the leaf to form groups
or classes.
• It has the advantage over
grouped frequency
distribution of retaining
the actual data while
showing them in graphic
form.
Stem
leaf
Example
An insurance company researcher conducted a survey on the
number of car thefts in a large city for a period of 30 days
last summer. The raw data are shown below.
Construct a stem and leaf plot.
52
58
75
79
57
65
62
77
56
59
51
53
51
66
55
68
63
78
50
53
67
65
69
66
69
57
73
72
75
55
Conclusions (2.1 & 2.2)
• Data can be organized in some
meaningful way using frequency
distributions. Once the
frequency distribution is
constructed, the representation
of the data by graphs is a simple
task.
2.3 Summary Statistics (Data Description)
•
Statistical methods can be used to summarize data.
•
Measures of average are also called measures of central tendency and include
the mean, median, mode, and midrange.
•
Measures that determine the spread of data values are called measures of
variation or measures of dispersion and include the range, variance, and
standard deviation.
•
•
Measures of position tell where a specific data value falls within the data set or
its relative position in comparison with other data values.
The most common measures of position are percentiles, deciles, and quartiles.
•
The measures of central tendency, variation, and position are part of what is
called traditional statistics. This type of data is typically used to confirm
conjectures about the data

Measures of Central Tendency
Mean
the sum of the values divided by the total number of values.
Population Mean
Sample Mean
N

 xi
i 1
N
n
, N population size
x
x
i 1
n
i
, n sample size
Arithmetic Mean – Individual Data
Example 1
• Calculate the arithmetic mean for the
following:
 3, 5, 8, 12, 15
35
The Arithmetic Mean – Ungrouped Frequency
Distribution
Example 2
• Number of defects in a sample of 50 products
No of defects
No of products
0
5
1
7
2
15
3
13
4
6
5
4
36
The Arithmetic Mean – Grouped Frequency
Distribution
Example 3
• A radar speed recorder was setup on a stretch of road to
which a legal speed limit was applied. The result are
summarized in the table below:
Speed (mph)
No of cars observed
15 – 20
5
20 – 25
39
25 – 30
112
30 – 35
295
35 – 40
242
40 – 45
89
45 – 50
8
37
Mean
•
One computes the mean by using all the values of the data.
•
The mean varies less than the median or mode when samples are taken from the same
population and all three measures are computed for these samples.
•
The mean is used in computing other statistics, such as variance.
•
The mean for the data set is unique, and not necessarily one of the data values.
•
The mean cannot be computed for an open-ended frequency distribution.
•
The mean is affected by extremely high or low values and may not be the appropriate
average to use in these situations

Measures of Central Tendency
Median
the middle number of n ordered data (smallest to largest)
If n is odd
Median  xn 1
2
If n is even
xn  xn
Median 
2
2
2
1
Median
• The median is used when one must find the center or middle value of
a data set.
• The median is used when one must determine whether the data
values fall into the upper half or lower half of the distribution.
• The median is used to find the average of an open-ended distribution.
• The median is affected less than the mean by extremely high or
extremely low values.
The Median – Individual Data
Example 4
• The following data relates to the marks
obtained in a course of 15 students
• Progress test 1: marks obtained
30, 35, 52, 52, 35, 40, 59, 60, 41, 46, 61, 65, 47, 70, 72
• In the case of even number of observations,
there is, no definite middle item
• The median is then taken to be the average of
two middle items
41
The Median – Locating the Median Graphically
• Example 5
• Given below is the frequency distribution of marks obtained by 50 students
in a certain college
Marks
No. of Students
10 – 20
3
20 – 30
7
30 – 40
10
40 – 50
20
50 – 60
7
60 – 70
3
42
The Median – Ungrouped Frequency
Distribution
• Example 6
• Tests for defects are carried out in a textile factory on a lot comprising 400
pieces of cloth. The results of the tests are tabulated below
No of faults per pieces
No pieces
0
92
1
142
2
96
3
46
4
18
5
6
6
0
43

Measures of Central Tendency
Mod
the most commonly occurring value in a data series
• The mode is used when the most typical case is desired.
• The mode is the easiest average to compute.
• The mode can be used when the data are nominal, such as religious
preference, gender, or political affiliation.
• The mode is not always unique. A data set can have more than one
mode, or the mode may not exist for a data set.
The Mode – Individual Data
• Example 7
• Determine the mode from the following data:
• Marks obtained by 10 students
10, 27, 24, 12, 27, 27, 20, 18, 15, 20
45
The Mode – Grouped Frequency Distribution
• Example 8
• A client company of your firm is a horticultural shop selling a wide variety
of product to its customers. The analysis of weekly sales of plants
throughout the year is summarized in the following frequency distribution
Weekly sales of plants ($)
No. of weeks
1255 – 1280
9
1280 – 1305
19
1305 – 1330
10
1330 – 1355
8
1355 – 1380
6
46

Measures of Central Tendency
Midrange
is a rough estimate of the middle & also a very rough
estimate of the average and can be affected by one
extremely high or low value.
lowest value  highest value
MR 
2
Types of Distribution
Symmetric
Positively skewed or right-skewed
Negatively skewed or left-skewed

Measures of Variation / Dispersion
• Used when the central of tendency doesn't mean
anything or not needed (eg: mean are same for
two types of data)
• One that gauges the variability that exists in a
data set
• To form a judgment about how well the average
value illustrate/ depict the data
• To learn the extent of the scatter so that steps
may be taken to control the existing variation

Measures of Variation / Dispersion
Range
is the different between the highest
value and the lowest value in a data set.
The symbol R is used for the range.
R = highest value - lowest value

Measures of Variation / Dispersion
Variance
is the average of the squares of the distance each value is from the mean.
Population Variance
N
 
2
x
i 1
i
 
 
n
2
s2 
,
N
N
  xi   
i 1
N
Sample Variance
x  x 
i 1
n
s
,
, n sample size
n 1
N population size
2
2
i
x  x 
i 1
2
i
n 1
, n sample size
N population size
Population standard deviation , 
Sample standard deviation, s
Standard Deviation
is the square root of the variance
Variance
• The variance is the average of the squared
deviations from the arithmetic mean
• Calculate of Variance
• The following data relates to the marks obtained by
15 students in an Accounting examination
• 50, 60, 60, 65, 70, 50, 40, 45, 40, 50, 70, 80, 80, 70,
70
52
Standard Deviation
• Calculation of Standard Deviation – grouped frequency
distribution
• The following data relates to the sales of electronic calculators
in the South of England
Sales per week (thousand)
No. of weeks
4–6
2
6–8
5
8 – 10
12
10 – 12
9
12 - 14
3
53
Variance & Standard deviation
• Variances and standard deviations can be used to determine the spread
of the data. If the variance or standard deviation is large, the data are
more dispersed. The information is useful in comparing two or more data
sets to determine which is more variable.
• The measures of variance and standard deviation are used to determine
the consistency of a variable.
• The variance and standard deviation are used to determine the number
of data values that fall within a specified interval in a distribution.
• The variance and standard deviation are used quite often in inferential
statistics.
• Measures of Position
Describing the position of the data value
Quartile
 in  F 
 for i = 1, 2, 3
Qi  L  C  4
f




where;
L Lower limit of the interval containing Qi
C Width of the interval containing Qi
F Cumulative frequency before class Qi
f
Frequency class Qi
Quartile Deviation
- Individual Data
• The following is the marks of 9 students in a
certain examination.
Student No
Marks
1
20
2
28
3
40
4
12
5
30
6
15
7
50
8
45
9
60
56
Quartile Deviation
Example – Group Frequency Distribution
• The following group frequency table describes
the weight of 95 packages selected for a QC test.
Weight (grams)
No. of Packages
450 – 452
11
452 – 454
26
454 – 456
34
456 - 458
24
57
The measures of central tendency, variation,
and position for Grouped data
measures of central tendency
Mean Class
x
fx
Median class
i i
N
where;
f i frequency
xi midpoint
Mode class
n  F 

Median Class  L  C  2
 f 


where;
L
C
F
f
Lower limit of the interval containing median
Width of the interval containing median
Cumulative frequency before class median
Frequency for class median
  
Mode class = L  C  1 
 1  2 
where;
L Lower limit of the interval containing mod
C Width of the interval containing mod
1 Frequency class mode - frequency before class mod
2 Frequency class mode - frequency after class mod
measures of Variation
Population variance
N
 
2
 f x  
i 1
i
i
N
2

where;
fi frequency
fx
2
i i
 f x 

Sample variance
2
n
i i
N
N
s 
2
 f x  x
i 1
i
i
n 1
where;
fi frequency
xi midpoint
xi midpoint
N population size
 mean class
n sample size
x mean class
2

fx
2
i i
 f x 

i i
n 1
n
2
Conclusions
• By combining all of these
techniques discussed in
this chapter together, the
student is now able to
collect, organize,
summarize and present
data.
Download