Uploaded by Aiza Aparejo

Module4 - Data Management (Gayas&Garcia) PART1

advertisement
MODULE 4: Data Management
MODULE 4
Data Management
Catalina B. Gayas & Emmeline R. Garcia
Table of Contents
Lesson 1 Data Collection, Organization, and Interpretation
Basic terminology in statistics 3
Data Collection and Sampling Techniques 4
Frequency Distribution and Graphs for Numerical Data 6
Lesson 2 Measures of Central Tendency
Mean (Raw and Grouped Data) 13
The Weighted Mean 15
Median (Raw and Grouped Data) 15
Mode (Raw and Grouped Data) 18
Types of Distribution 19
Lesson 3 Measures of Variation
Range (Raw and Grouped Data) 22
Mean Absolute Deviation (Raw and Grouped Data) 23
Variance and Standard Deviation (Raw and Grouped Data) 25
Lesson 4 Measures of Relative Position
Standard Score 30
Percentiles, Deciles, and Quartiles 31
Lesson 5 Normal Distribution
The Standard Normal Distribution 37
Applications of Normal Distribution 41
Lesson 6 Correlation Coefficients and Linear Regression
Correlation Analysis 46
Linear Regression 50
Leyte Normal University | Mathematics Unit
1
MODULE 4: Data Management
Overview
Statistics is used in all aspects of human endeavors. Statistics is used to describe data; to determine
significant relationship between and among variables; to determine significant difference in a
variable of interest between or among groups; and to make forecast and prediction. The concepts in
Statistics were already discussed in your K to 12 Curriculum. Hence, this module focuses on the
application of these concepts in the real setting, in which you can relate to. It is the aim of this
module to make you appreciate the importance of Statistics, and at the same time have fun doing
the exercises and activities.
This module includes the topics: Data Collection, Organization, and Interpretation; Measures of
Central Tendency; Measures of Dispersion; Measures of Relative Position; Normal Distribution; and
Correlation Coefficients and Linear Regression. Computer applications will be utilized in this
module, especially the use of Microsoft Excel and statistical analysis software, like SPSS, for data
analysis.
Objectives
At the end of this module, you should be able to:
1. demonstrate knowledge of basic statistical terms;
2. use statistical methods to summarize and organize data;
3. solve problems applying normal distribution;
4. apply linear regression and correlation in analyzing data; and
5. interpret computer outputs in data analysis.
2
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
LESSON 1: Data Collection, Organization, and Interpretation
Introduction
Statistics is defined as science of collecting, organizing, summarizing, presenting and interpreting
data. There are three main reasons why student study statistics. They are as follows: (1) To read
and understand the various statistical studies published in print or broadcast media; (2) To conduct
research in his own field since statistical procedures are basic to research; and (3) To become
better consumers and citizen by using the knowledge gained from studying statistics.
Basic Terminology in Statistics
In studying statistics, it is important to understand the basic terms used in the subject. The
following terms are defined for this purpose.
Variable refers to a characteristic or attribute that can assume different or varied values. Example of
a variable is sex, nationality, score, height, etc. Data are the measurements or observations that the
variables can assume. A data set is collection of data values, and every particular value in the set is
called datum.
There are two branches of statistics. The branch that involves collection, organization,
summarization and presentation of data is called descriptive statistics. While the branch that
makes generalization from sample (representative of a population) to a population (totality of all
observations or entities of any sort), performs estimation and hypothesis testing, and determines
relationship among variables and makes predictions is called inferential statistics.
Variables can be classified as quantitative and qualitative. Quantitative variable is a numerical
value that can be ordered or ranked. IQ, scores, weight, temperature are examples of quantitative
variables. Quantitative variable is further classified as discrete and continuous. Discrete variable
assumes values that can be counted. On the other hand, a continuous variable assumes unlimited
number of values between any two specific values. Continuous variable is measured. The number of
deaths in a certain locality relative to CoViD-19 pandemic is an example of a discrete variable, while
the height of a person is an example of a continuous variable. Why a height is considered a continuous
variable? What are other examples of continuous variables? How about discrete variables?
Variables are also classified according into four levels of measurement scales. They are: nominal,
ordinal, interval and ratio. Nominal scale is the simplest scale of measurement that classifies data
into mutually exclusive categories and uses numbers for labels only. Sex, occupation, religious
affiliation and marital status are examples of nominal data. Ordinal scale uses numbers for labelling
and the numbers can be ranked. However, there is no equal difference between ranks. Socio
economic status, Latin honor, and academic rank are examples of ordinal data. Interval scale
possesses the characteristics of ordinal scale (label and rank) and equal differences between ranks
exist. Also, in an interval data, there is no true zero value. Score in an examination, temperature,
Intelligent Quotient (IQ) are examples of interval scale. Ratio scale is the highest level of
measurement. It possesses all the characteristics of an ordinal scale (label, rank, equal differences
Leyte Normal University | Mathematics Unit
3
MODULE 4: Data Management
between ranks) and a true zero value of a number exist. Distance travelled, height, weight and age
are examples of ratio scale.
Variables are also classified according to their functions, especially in experimental studies. They
are independent or explanatory variable, dependent or outcome variable, and confounding
variable. Independent Variable is the variable manipulated by the researcher, while the dependent
variable is the variable affected or influenced by the manipulated variable. The confounding
variable on the other hand is a variable that influences the dependent variable. For example
a researcher is interested on finding out the effect of learning delivery modes (pure online,
pure printed module, mixture of online and printed module) on the performance (test score)
of the students in GE104. The delivery mode is the independent variable; the performance is
the dependent variable. The performance can be affected by learning ability of the students. Thus,
the learning ability is a confounding variable.
Data Collection and Sampling Techniques
Data can be collected in different ways. The method to use in the collection of data depends on the
source of data as well as the type of data to be collected. Data can be collected through survey
(telephone, questionnaire or interview), test, observation, and experimentation. Details on how
each method are done and what is the advantage of one over the other will not be part of
this lesson as this is exhaustively discussed in your research course.
Data are collected from a representative of a population called sample. The process of collecting
samples is called sampling. There are two types of sampling: non-probability and probability
sampling. In non-probability sampling, not every member of the population is given equal chance to
be chosen, hence the samples are not are true representative of the population. If the objective of
the study is to make a generalization, using non-probability sampling is discouraged. Convenience
or Accidental sampling, Purposive or Judgemental Sampling and Quota Sampling are the most
common techniques in non-probability sampling.
Probability sampling on the other hand gives equal chance to each member of the population to be
selected as a representative. There are four techniques under this type of sampling. They are as
follows: simple random sampling, systematic random sampling, stratified random sampling and
cluster random sampling.
Simple Random Sampling is a technique used in when the population is homogeneous with respect
to the characteristic of interest to the researcher and the population size is known (Petilos, 2012).
Selection of sample can be done either by lottery method or using random numbers.
Systematic Random Sampling is a technique that selects the desired sample size by selecting every
kth subject. To select the sample the researcher assigns number to each member of the population
(by numbering consecutively) then he determines the value of k by dividing the total number
of cases (population) by the desired number of samples. For example the total population (N) is 1,000
and the sample size (n) is 100. Therefore, the value of k is 10. Thus, the researcher will select every
10th subject in the population, which is determined by selecting the starting number between 1 to 10
by using simple random sampling. Suppose the starting number is 6, so the researcher will
4
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
consider the subjects whose numbers are: 6, 16, 26, etc. until the desired number of
samples is completed.
Stratified Random Sampling is a technique used by grouping the population into subgroups called
strata according to the common characteristic/s as determined by the researcher. The subjects are
selected from each stratum which is proportional to the number of each subgroup. For example if
the population consists of all freshmen student across the three colleges (A, B, and C) in University X.
If the total freshmen population among the three colleges is 1400 divided as follows: NA = 350; NB =
500 and NC = 550 and the researcher wishes to take a total of 350 respondents. Then he has to select
from each stratum the desired samples using either simple random sampling or systematic random
sampling using the following computation:
College
N
A
350
B
500
C
550
Total
1,400
n
351
[Note: Due to the rule of rounding off numbers as applied in A & C which are 87.5 = 88 and 137.5 =
138, respectively, the researcher has to decide in which subgroup he has to reduce the samples by
1.]
Cluster Random Sampling is a technique used when the population is large enough or the
respondents are residing in a large geographic area and it is impossible for the researcher to obtain
the list of all members of the population. The members of each cluster are heterogeneous. Unlike the
stratified random sampling where the subjects are selected individually, in this technique
cluster/s is selected randomly and all members of the selected cluster would represent the
population. For example a researcher wishes to determine the type of fertilizer (pure synthetic,
pure organic or combination of synthetic and organic) use by rice farmers from the municipality of
Town Q. Assuming that there is no available list of rice farmers (categorized a small scale, medium
and large scale rice producing), the researcher can get a copy of the map of Town Q and determine
the number of barangays which are located outside downtown and along the seashore areas. Each
of these barangays is a considered a cluster. Suppose there are 43 barangays that belong to this
group. Therefore, there are 43 clusters to choose from. The researcher then decides how many of
these barangays will be included and then he randomly selects the cluster/s. The rice farmers in the
selected cluster/s represent the group from Town Q.
Frequency Distribution and Graphs for Numerical Data
Once the researcher has already collected the data, the next thing to do is to organize. There are
three ways of presenting data: tabular, graphical and textual. The following discussion focuses on
how to organize raw data and subsequently represent those using graphs.
Example 1.1
Below are scores of 50 students in Statistics examination.
Leyte Normal University | Mathematics Unit
5
MODULE 4: Data Management
63 88 79 92 86 87 83 78 47 67
68 76 46 81 92 77 76 84 70 66
77 75 98 81 82 81 87 78 70 60
94 79 52 82 77 81 77 70 74 61
56 69 83 83 71 48 90 52 75 84
Looking at the array of scores it would be difficult for the reader to tell the characteristic of
the group. Thus, a frequency distribution needs to be prepared. A frequency distribution is
an organization of raw data classes/groups and frequencies. The frequency distribution is a
tabular way of organizing raw data. The following are the steps in preparing frequency distribution.
Step 1. Determine the number of classes.
• Find the highest value (HV) and lowest value (LV).
• Find the range (R) by subtracting the lowest from the highest value.
• Determine the estimate number of classes by getting the square root of n, call this k.
Your actual number of classes could be greater than the estimated one.
Step 2. Determine the class size of the interval.
R
c = k (rounded to the nearest whole number)
Step 3. Determine lower and upper limit of the lowest class interval. The lower limit should
be divisible by the class interval.
€
Step 4. Determine the upper class
Step 5. Tally the scores in their respective classes
Step 6. Summarize the tallies.
Illustration: Using the array of raw scores given above, we have:
1. Determine the number of classes
R = HV – LV
= 98 – 46
R = 52
(it tells us the gap between the highest and lowest scores in the given data set)
k = 50 = 7.07
k=7
2. Determine the class size.
€
R
c= k=
52
7 = 7.43 c
=8
3. Determine lower and upper limits of the lowest class interval.
Since the lowest value in the given data set is 46 and it is not divisible by the class interval
€
which is 8, we have to find a smaller number closest to 46 which is divisible by 8.
The number is 40. So, our lower limit of our lowest class interval is 40 and the upper limit is
47, because the lower limit of the next class interval is 48 = lower limit of the preceding class
6
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
added by the class size (c). It follows that the upper limit of this class interval is 55. Thus, the
class boundary is 48 – 55. Following the same procedure, you can find the remaining class
intervals.
4. Determine the upper class. The highest class interval should contain the highest value of the
given data set. Since our highest value is 98 which is not divisible by the class size of 8, so the
lower limit of the highest class interval should be a number smaller and closest to 98. The
number is 96. Thus, the highest class interval is between 96 – 103.
5. List down the class intervals and tally the scores in their respective classes.
Class Limits
Class Boundaries
Tallies
Frequency
96 - 103
95.5 - 103.5
/
1
88 - 95
87.5 – 95.5
/////
5
80 - 87
79.5 – 87.5
/////-/////-////
14
72 - 79
71.5 – 79.5
/////-/////-///
13
64 - 71
63.5 – 71.5
/////-///
8
56 - 63
55.5 – 62.5
////
4
48 - 55
47.5 – 55.5
///
3
40 - 47
39.5 – 47.5
//
2
REMARKS:
• In this illustration the actual number of classes which is 8 is greater than the estimated value of k which is
7.
• The second column shows the boundary of each class interval in which the actual lower and upper limits are
indicated. These are called true limits or class boundaries.
• The true upper limit of the preceding class is also the true lower limit of the succeeding class. This shows
the continuity of the data.
Using the same data set as presented in the frequency distribution above, we can prepare graphs. In
this module, we will discuss only the histogram, frequency polygon and ogive. These are the
most commonly used graphs in research.
A histogram displays the data using continuous bars (vertical or horizontal). The histogram is a bar
graph in which bars are constructed without space in between. This implies that the data presented
is continuous. The heights/lengths of the bars show the frequency of the respective classes. The
frequency polygon on the other hand displays the data by using lines connecting the points
plotted for the frequencies of each class. This graph is used when the data is continuous.
Both graphs use the midpoints of the classes in the frequency axis.
The ogive is a graph that shows the cumulative frequencies for the classes in the given distribution.
The ogive can be constructed either for cumulative frequency less of cumulative frequency greater.
The following are steps in constructing the above-specified graphs manually. The same graphs can be
constructed by using either by Excel or Minitab and the specific steps are illustrated in the book of
Bluman.
Example 1.2
Before constructing the different graphs, we need to add more information in our
frequency distribution as shown below.
7
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
Class Interval
f
X
<cf
>cf
rf
95.5 - 103.5
1
99.5
50
1
2.0
87.5 – 95.5
5
91.5
49
6
10.0
79.5 – 87.5
14
83.5
44
20
28.0
71.5 – 79.5
13
75.5
30
33
26.0
63.5 – 71.5
8
67.5
17
41
34.0
55.5 – 62.5
4
59.5
9
45
18.0
47.5 – 55.5
3
51.5
5
48
10.0
39.5 – 47.5
2
43.5
2
50
4.0
N = 50
100.0
REMARKS:
X = LL +UL
• The midpoint of each class is
obtained using the formula:
€
2
.
Steps in Constructing a Histogram
Step
What to do?
1
Construct two perpendicular axes (vertical and horizontal)
2
Label the vertical axis as the frequency axis and the horizontal as variable
axis.(In our illustration below, our variable is a score)
3
Lay off segments along the vertical axis (y-axis) to correspond to the frequencies.
(The segments must be equal in length)
4
Lay off segments along the horizontal axis (x-axis) to correspond to the different class
intervals of the variable. The first line segment should be moved a little to the right if the
lowest value of the variable is not zero.
5
Mark all midpoints of the intervals and label these using class midpoints.
6
Draw rectangle or bars whose heights correspond to the frequency counts and whose
widths to the class size. (Shade or color your bars).
Adapted from: Resource Materials in Basic Statistics (Petilos,p.9)
y
c
n
e
u
q
e
r
F
Score
Figure 1.1. Histogram
8
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
Steps in Constructing a Frequency Polygon
Step
What to do?
1
Construct two perpendicular axes (vertical and horizontal)
2
Label the vertical axis as the frequency axis and the horizontal as variable
axis.(In our illustration above, our variable is a score)
3
Lay off segments along the vertical axis (y-axis) to correspond to the
frequencies. (The segments must be equal in length)
4
Lay off segments along the horizontal axis (x-axis) to correspond to the different class
intervals of the variable. The first line segment should be moved a little to the right if the
lowest value of the variable is not zero.
5
For each class interval, the class midpoint and corresponding frequency are
considered ordered pair and is plotted in the plane determined by the coordinate
axes.
6
The plotted points are then joined using line segments from left to right. To close the
polygon, extend one class interval to both sides by connecting the endpoints of the
graph to the midpoints of the extended segments along the x-axis.
Adapted from: Resource Materials in Basic Statistics (Petilos, p.10)
16
14
12
yc
n
e
u
q
e
r
F
10
8
6
4
2
Score
0
Figure 1.2. Frequency
Polygon
Steps in Constructing an Ogive
Step
What to do?
1
Construct two perpendicular axes (vertical and horizontal)
2
Label the vertical axis as the cumulative frequency axis and the horizontal as variable
axis. (In our example the variable is a score).
3
Lay off equal segments along the vertical axis (y axis) to correspond to the
cumulative frequencies. Use an appropriate scale to represent the cumulative
frequencies. (Depending on the numbers in the cumulative frequencies, the scales can be
by 2’s, 4’s, 5’s, etc. )
4
Lay off equal segments along the horizontal axis (x axis) to correspond to the true
upper limit of the ogive for less than cumulative frequencies and true lower of the
ogive for greater than cumulative frequencies
5
Plot the cumulative frequencies with the corresponding class boundaries.
6
The plotted points are then joined using line segments from left to right.
Reference: Bluman, pp54-55
9
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
REMARK:
• To determine the percentage or the number of cases found below or above a particular boundary. • If the
ogives (for >cf and <cf) are graphed on the same coordinate plane, a line can be drawn from the point of
intersection of the two graphs onto the variable axis which represents the median of the data set.
20
e
r
F
10
60
0
50
yc
40
n
e
30
u
q
Class Boundaries
e
r
20
F
60
10
50
0
40
yc
n
Class Boundaries
e
30
u
q
Figure 1.3. Less than cumulative frequency
Figure 1.4. Greater than cumulative frequency
Stem and Leaf Plot
Another method of organizing data which is a combination of sorting and graphing is the
called stem and leaf plot. It is a data plot that uses the leading digit as stem and the trailing digit as
leaf.
Steps in Constructing Stem and Leaf Plot.
Step
What to do?
1
List down the leading digits of the data set called the stem. Arrange them in a
column either from lowest to highest or vice versa.
2
Starting from the first to the last entry of the data set, carefully record the trailing
digits (leaf) in their corresponding stem.
3
Arrange in order the trailing digits in each row. If there are no data values in a class,
the stem number is written and the leaf row is left blank.
Reference: Bluman, pp81-82
Example 1.2
Let us illustrate the above procedure using the data on the scores of 50 students in
Statistics examination. The data are reproduced as follows:
63 88 79 92 86 87 83 78 47 67
68 76 46 81 92 77 76 84 70 66
77 75 98 81 82 81 87 78 70 60
94 79 52 82 77 81 77 70 74 61
56 69 83 83 71 48 90 52 75 84
10
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
Steps:
1.
Stem (Leading Digit) Leaf (Trailing Digit)
9 48220
8 831123627113744
7 76599717678800045
6 3897601
5 622
4 687
2. Rearranging the trailing digits (leaf) we have:
Stem
Leaf
9 02248
8 111122333446778
7 00014556677778899
6 0136789
5 226
4 678
REMARKS:
• The figure shows that the distribution peaks in the center and there are no gaps in the data. •
The highest score is 98 and the lowest is 46.
• Most scores are 70 and above.
What other information can you draw from the figure above?
Leyte Normal University | Mathematics Unit
1
1
MODULE 4: Data Management
Exercises 1.1
A. Determine the area of statistics (descriptive or inferential) illustrated by thefollowing
statements.
1. A recent study showed that eating garlic could lower blood pressure.
2. The teacher - pupil ratio in public schools has increased from 1:40 in 2015 to 1:50 in
2019.
3. It is predicted that the average number of automobiles each household owns will
increase next year.
4. A study revealed that Lagundi is more effective in curing cough than a similar
product.
5. Consumers generally prefer Colgate than any other toothpaste.
B. In each statement below identify the variable/s and classify it/them according to the level of
measurement (nominal, ordinal, interval, ratio)
1. Marital status of faculty members in a university.
2. Time it takes a student to travel from home to school.
3. Scores in the College Admission Test of freshman students in University Q.
4. Socio-economic status of the residents in a barangay (poor, average, above-average). 5.
Ages of freshman college students of Leyte Normal University.
C. Classify each variable as discrete or continuous
1. Number of CoViD-19 cases in Eastern Visayas.
2. Weights of backpacks of college students inside a Science laboratory room.
3. Number of new mono bloc chairs inside the university social hall.
4. Blood pressures of patients seeking admission in a hospital.
5. Number of boxes of disposable surgical masks sold in one pharmacy in three days.
D. A research is to be conducted to determine the level of language proficiency and numeracy skills
among the 700 Education and 300 Management graduating students at University Q. The
researcher wants a sample of 300 be selecting representatives from the two programs.
1. What is the population of the study?
2. What is the sample in the study?
3. What are the variables of the study? What is the level of measurement of each
variable?
4. What sampling technique used in this study?
E. An insurance company researcher conducted a survey on the number of car thefts in a
large city for a period of 30 days last summer. The raw data are shown below. Construct a
grouped frequency distribution, frequency polygon, histogram and ogives (Show all
necessary solutions).
52 58 75 79 57 65 62 77 56 51
59 53 51 66 55 68 63 78 50 53
67 65 69 66 69 57 73 72 75 55
Leyte Normal University | Mathematics Unit
12
MODULE 4: Data Management
LESSON 2: Measures of Central Tendency
Introduction
Statistics is a science of collecting, organizing, summarizing, presenting and interpreting data. There
are two branches of statistics. The branch that involves collection, organization, summarization and
presentation of data is called descriptive statistics. While the branch that involves the
interpretation and drawing conclusion is called inferential statistics. Descriptive statistics include
the measures of central tendency, measures of position and measures of variability.
There are three measures of central tendency or measures of central location, namely: the mean,
median and the mode. The measure of central tendency is a single value that describes a whole set
of data by identifying the central position within the given data set. It is sometimes called
the measure of central location or summary statistics.
Mean (Raw and Grouped Data)
The data gathered in their original form is called raw or ungrouped data, while the data that have
been organized into a frequency distribution is called grouped data.
For raw data, the mean is defined as the arithmetic average of a data set It is equal to the sum of
the measurements divided by the number of cases (n). It is the measure used when there is
no extreme value of the data set and the data is either an interval or ratio. Among the three measures
of central tendency, the mean is the most reliable and is amenable for further mathematical
manipulation which makes it useful for inferential statistics.
Formula: mean =
The Greek capital letter sigma is used to denote a sum. Thus, the formula above means, the
summation of the values of x divided by the total number of cases. For the data collected from a
population the symbol use for the mean is a Greek letter (read as mu) which is called parameter. x
(read as: x bar) which is
While the data collected from sample, the symbol use for the mean is
called statistic. The total number of cases is denoted by N and n for a parameter and statistic,
respectively. Thus the working formula for the mean of a population is: =
€
Example 2.1. Compute for the average of the scores in a Math quiz of 15
students. 23 25 34 32 22 24 26 24 34 30 26 26 37
25 24
Leyte Normal University | Mathematics Unit
13
MODULE 4: Data Management
Solution:
Using a calculator, we have:
x = 23+ 25+ 34 +...+ 24
15
= 412
15
x = 27.5
This implies that the average score in a Math quiz of the 15 students is 27.5
€
Note: Rounding Rule for the Mean. The mean should be rounded to one more decimal
than occurs in the raw data.
place
For grouped data, the mean is obtained by using the formula below:
x = ΣfX
N
where: � = average or mean
f = class frequency
€
X = midpoint of each class
N = total number of cases
Example 2.2. Using the data in Example 4.1.1 we find the mean of grouped data. (Scores of
50 students in Statistics examination)
Class Interval
f
X
fX
96 - 103
1
99.5
99.5
88 – 95
5
91.5
457.5
80– 87
14
83.5
1169.0
72– 79
13
75.5
981.5
64 – 71
8
67.5
540.0
56 – 63
4
59.5
238.0
48– 55
3
51.5
154.5
40 – 47
2
43.5
87.0
N = 50
fX = 3,727.00
By substitution, we have:
x = ΣfX
N = 3727
50 = 74.54
Therefore, the average score of 50 students in a Statistics examination is 74.54.
€
Note: We rounded off the computed mean to the nearest hundredths because the class intervals is
actually 0.5 below and above the given limits. Thus, the true lower limit of each class interval is 0.5
below the apparent lower limit and the true upper limit is 0.5 higher than the apparent upper limit.
14
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
Example, for the class interval of 40 - 47 with the lower limit of 40 and the upper limit of 47, has a
true lower limit of 39.5 (0.5 lower than the apparent lower limit), and has a true upper limit of 47.5
(0.5 higher than the apparent upper limit).
There is another method of finding the mean of grouped data by using the assumed
deviation. However, the discussion of this method will not be included in this module.
The Weighted Mean
When the weight of each value or observation is not equal the weighted mean is obtained. The
weighted mean is computed using the formula below:
X = ΣwX
Σw
X = w1X1 +w2X2 +...+wnXn
w1 +w2 +...+wn
Where: w1, w2, …, wnare the weights and X1, X2,…, Xn are the values or
observations
weight
ΣΣΣΣΣ wX = sum of the products of each value and its respective
Σw = sum of the weights
€
Example 2.3
Find the grade weighted average of a student in his five subjects as shown in the table below:
Subject
Grade (X)
No. of Units (w)
wX
Mathematics
1.5
3
4.5
English
1.7
3
5.1
PE
1.3
2
2.6
Physics
1.6
5
8.0
Social Science
1.5
3
4.5
Σ w=16
ΣwX =24.7
By substituting to the formula, we find the Grade Weighted Average (GWA) of the student:
X = ΣwX
Σ w = 24.7
16 = 1.54
Thus, the grade weighted mean of the student is 1.54.
€
Median (Raw and Grouped Data)
The median is the middlemost value of the measurements when they are arranged from smallest to
highest. It is used when the data is at least ordinal. The median is not affected by extreme values or
outliers. The median is reliable and less stable than the mean.
Leyte Normal University | Mathematics Unit
1
5
MODULE 4: Data Management
For raw data or ungrouped data, the median is obtained by getting the middlemost value after the
data set is arranged from lowest to highest. It is the value that divides the data set into two equal
parts.
Example 2.4
Using the data set in Example 2.1 we have:
23 25 34 32 22 24 26 24 34 30 26 26 37 25 24
Solution: a) Arrange the scores from lowest to highest.
Using stem and leaf plot we have:
Stem Leaf
3
42407
2
3524646654
Rearranging the leaf in our plot above we have
Stem
3
2
Leaf
02447
2344455666
22 23 24 24 24 25 25 26 26 26 30 32 34 34 37
Thus, the median of the given data set is 26. This implies that with the score of 26, there
seven cases below and above it. Example 2.4 is an example of data set for odd cases (n = 15). How to
find the median when there are even cases? Based on the definition of the median it is the
middlemost value.
Example 2.5
22 23 24 24 24 25 25 26 26 26 30 32 34 34 35 37
case. Thus we
⎛
⎞
⎛
⎞
2+1 ⎝⎜ ⎠⎟th
n
2
To get the median of even
have:
cases, we take the
⎠⎟th
case and
n
average of the ⎝⎜
⎛
⎞
n
⎝⎜ ⎠⎟th case + 2+1
⎛ ⎞
⎝⎜ ⎠⎟th case n
Md =
2
€
€
2
= 26 + 26
2
Md = 26
This implies that the value of 26 divides the cases into two equal parts. This 26 is not the 8 th nor the
9th case but there is a value of 26 between 8th and 9th cases.
€
16
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
For grouped data, the median is obtained using the formula below:
⎛⎜ N
⎟ c
− cf f ⎟( )
2
Md = ⎜
⎞
⎟
⎜
⎜
⎟
LL +
⎠
⎝
where: LL = true lower limit of the median class
N = total number of cases
€
cf = cumulative frequency below the median class
f = frequency of the median class
c = class interval
Example 2.6
Using the data in Example 1.1 we find the median of grouped data. (Scores of 50 students
in Statistics examination)
Class Interval
f
<cf
96 - 103
1
50
88 – 95
5
49
80– 87
14
44
13 (f)
30
72– 79 (median class)
64 – 71
8
17 (cf)
56 – 63
4
9
48– 55
3
5
40 – 47
2
2
N= 50
Note that 50% of 50 cases is 25. This means that we find a number or value such that 50% of the
total number of cases is below and above it. Using the formula above we have:
Md =
LL +
⎛⎜
⎜
⎜
⎜
⎝
N
2 − cf f
=
⎛
⎞
71.5+ ⎜
2 −17 ⎟
13
⎞
⎟
⎜
⎜
⎟
⎠
⎟ c
⎟( ) ⎞
⎝
50
⎟ 8
⎟( )
⎟
= 71.5+25 −17
⎠⎟(
⎛⎜
⎠
(0.6154)(8) = 71.5+ 4.92
8 = 71.5+
)
76.42
⎝⎜
Md =
13
This implies that 76.42 is the middlemost value of the given data set. This means that there are 25
cases found below and above this value.
€
Leyte Normal University | Mathematics Unit
17
MODULE 4: Data Management
Mode (Raw and Grouped Data)
The mode is the most frequent value in a given data set. The mode is used when you want
to determine a quick estimate of the typical value in a given data set. The mode is the most unstable
measure of central tendency especially if there are only few cases. A given data set can have more
than one mode. For cases where there are two modes it is called bimodal.
Example 2.7
Using the data set in Example 1.1, we notice that there are two values (24 and 26) that have the
same frequency of 3.
22 23 24 24 24 25 25 26 26 26 30 32 34 34 37
Therefore, the modes of the given distribution are 24 and 26. This is an example of a
bimodal distribution.
Example 2.8
Find the mode of the following data: 12, 34, 12, 71, 48, 93, 71 .
By inspection, the number 12 occurs more often than the other numbers. Therefore, the mode of
the distribution is 12. This is an example of a unimodal distribution.
Example 2.9
Find the mode of the following data set:
12, 5, 8, 9, 11, 11, 4, 7, 23, 7, 8, 12, 23, 9, 4, 5
By inspection, each number in the list occurs twice. There is no number that occurs more
often than the others. Therefore, there is no mode.
For grouped data, the mode is obtained by using the formula below:
+d1
⎛
Mo = LL
⎝⎜
⎞
⎠⎟(
)
c
d1 + d2
where: LL = true lower limit or lower boundary of the modal class;
d1 = absolute difference between the frequencies of the modal class
€
and the lower class interval (interval just below it);
d2 = absolute difference between the frequencies of the modal class
and the higher class interval (interval just above it);
c = the class size
Leyte Normal University | Mathematics Unit
18
MODULE 4: Data Management
Example 2.10
Using the data in Example 4.1.1 we find the mode of grouped data. (Scores of 50 students
in Statistics examination)
Class Interval
f
96 - 103
1
88 – 95
(interval just above the modal class)
5
80– 87
(modal class)
14
72– 79
(interval just below the modal class)
13
64 – 71
8
56 – 63
4
48– 55
3
40 – 47
2
Using the formula below, we obtain the mode of the given data set:
⎞ ⎛
⎞
⎛
Mo = LL +d1
⎝⎜
d1 + d2
⎠⎟(
) = 79.5+14 −13
c
⎜
⎜
+ 14 − 5 ⎟ 8
(
) ⎟( ) ⎠
(14 −13)
⎝
⎛
⎞
1
Mo = 79.5+ 1+ 9
⎝⎜
Mo = 80.30
⎠⎟(
)
(0.10)(8) = 79.5+.80
Therefore, the mode of the given data set is 80.3.
€
8 = 79.5+
In summary, the given data set has the following values of the measures of central
tendency: Mean = 74.54 Median = 76.42 Mode = 80.30
What is the characteristic of our illustrative distribution? Why?
Types of Distribution
The characteristic of the distribution can be determined by the shape of its graph (histogram
of frequency polygon). According to Bluman, the symmetric, positively skewed and negatively
skewed are the most important shapes of graphs that describe a distribution. Skewness refers to
the degree of departure of the distribution from the line of symmetry. When the data values
are evenly distributed on both sides of the mean and it is unimodal, the distribution is
called symmetric distribution. Further, the mean, median and mode have equal values and are at
the center of the x = Md = Mo .
distribution. In symbol,
€
19
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
A positively skewed or right-skewed distribution is unimodal and majority of the data values
cluster at the lower end of the distribution and to the left of the mean. Moreover, with the
positively skewed distribution, the mode is lesser than the median and the median is lesser than
Mo < Md < x .
the mean. In symbol,
A negatively skewed or left-skewed distribution is observed when majority of the data
values cluster at the upper end of the distribution and to the right of the mean. Furthermore,
with the
€
negatively skewed distribution the mode is greater than the median and the median is greater than
x < Md < Mo .
the mean. In symbol,
The following graphs are illustrations of the three types of distribution according to its
skewness (MathBits.com).
€
Symmetric Distribution
Positively Skewed Distribution
Negatively Skewed Distribution
20
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
Summary of Measures of Central Tendency
Measure
Mean
Common
Name
Arithmetic
Average
When to Use
• There are no
extreme values
• When the data at
least an interval
Advantage
• Most stable, i.e.,
stable and less
variable from sample
to sample
• Amendable for
further
mathematical
manipulation which
Disadvantage
• Affected by
extreme scores or
values
makes it useful in
inferential statistics
Median
Middle
Score/Value
• The distribution
is skewed
• When the data is
at least ordinal or
rank
Mode
Typical
Score/Value
• When a quick
• Easy to compute
• Not affected by
• Less stable from
sample to sample
extreme scores or
values
• Easy to compute
estimate to the
typical score or
value to be
determined
• The most unstable
measure
especially when
the number
of cases is small.
Adapted from: Resource Materials in Basic Statistics (Petilos, p.14)
Exercises 2.1
A. Using Exercise 13.1 on page 811 of the book, Mathematical Excursion by Aufman,
answer numbers 4 to 9 and 11.
B. Using the same exercise, find the mean, median and mode of the data set of number
14 on page 812.
C. Problem Solving.
1. If the mean age of eight college freshman students is 19.25. and six of the ages are:
19, 18, 20, 19, 20 and 18. What are the ages of the two students who are twin
siblings? What is the mode (age) of the eight students?
2. Find the mean of 20, 30, 40, 50 and 60.
a. Add 5 to each value and find the mean.
b. Subtract 5 from each value and find the mean.
c. Multiply each value by 5 and find the mean.
d. Divide each value by 5 and find the mean.
e. Make a general statement about each situation.
Leyte Normal University | Mathematics Unit
21
MODULE 4: Data Management
LESSON 3: Measures of Variation
Introduction
In the preceding lesson you learned the three measures of central tendency namely, mean, median
and mode. Accordingly, to describe the data set, it is important that one knows more than
the measures we studied in the previous lesson as one tends to claim that two or more data sets are
not varied when it is observed that the averages are equal. In this lesson, we will discuss the
measures of variation/spread or measures of dispersion. In this module the four measures of
variability both for ungrouped and grouped data will be talked over. They are the range, mean
absolute deviation, variance and standard deviation.
Range (Raw and Grouped Data)
The range is simply the gap or difference between the highest and lowest value/observation of the
data set. In formula: R = HV – LV.
If R = 0, it implies that all values in a data set are equal. Thus, there is no variability of the data.
Example 3.1
Ages of female faculty members from three departments.
Statistical
Measure
Implication/Impression
Data Set
A
B
C
37
40
39
38
41
40
42
42
42
45
43
43
48
44
46
Mean
Equal distribution
42
42
42
Range
Distribution A is
more spread. Why?
11
4
7
According to Petilos in his Resource Material in Basic Statistics, range of grouped data is equal to
the difference between true upper limit of the highest class interval and the true lower limit of the
lowest class interval. If the apparent limits are used, the range is equal to the difference between
upper limit of the highest class interval less than the lower limit of the lowest class interval plus 1. In
formula:
R = UL
(
)H − (LL)L
€
22
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
Example 3.2
Scores of 50 students in Statistics examination
Class Interval
f
96 - 103
1
88 – 95
5
80– 87
14
72– 79
13
64 – 71
8
56 – 63
4
48– 55
3
40 – 47
2
N = 50
Using the data set as presented in the distribution above, the range is:
R = 103.5 – 39.5 = 64 (using true limits)
R = 103 – 40 + 1 = 63 + 1 = 64 (using apparent limit)
Mean Absolute Deviation (Raw and Grouped Data)
The mean absolute deviation (MAD) of a data set is defined as the average distance between each
data value and the mean. It helps to describe how “spread out” the values in a data set are
(https://www.khanacademy.org/math). The MAD for raw data is computed using the following
formula:
MAD =Σ X − x
or value
N
where: X = score
x = mean score or mean value
€
N = total number of cases
Using the data set of Example 3.1 and computing for the MAD of each distribution, we
have: €
Example 3.3
Ages of female faculty members from three departments
Statistical
Measure
Implication/Impression
Data Set
A
B
C
37
40
39
38
41
40
42
42
42
45
43
43
48
44
46
23
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
Mean
Equal distribution
42
42
42
Range
Distribution A is more
spread. Why?
11
4
7
MAD
Distribution B is least
variable compared to
the other two data
sets. Why?
3.6
1.2
2.0
By substituting the formula, we find the MAD of Data Set A as follows:
MAD =Σ X − x
N = 37 − 42 + 38 − 42 + 42 − 42 + 45 − 42 + 48 − 42
5
= 5+ 4 + 0 + 3+ 6
5=
18
5
MAD = 3.6
Following the same procedure we find the MAD of the remaining two distributions as reflected on
the table above.
€
It can be deduced from the table of Example 3.3 that the scores of Data Set A deviate from
the mean by an average of 3.6, compared to Data Set B where the scores deviate from the mean by
an average of 1.2. This implies that Data Set B is less spread compared to Data Set A. The lesser the
value of MAD the less spread the distribution is.
For grouped data the MAD is obtained using the formula below:
MAD =Σ f X − x
N
€
x = mean score or mean value
Example 3.4
where: X = class mark or
midpoint of each class f =
frequency of each class
N = total number of cases
€
Using the data in Example 1.1 we find the mean absolute deviation of grouped data.
Scores of 50 students in Statistics examination
Class Interval
f
X
⏐ x -X⏐
�⏐ x -X⏐
96 - 103
1
99.5
24.96
24.96
88 – 95
5
91.5
16.96
84.80
80– 87
14
€
8.96
72– 79
13
75.5
0.96
12.48
64 – 71
8
67.5
7.04
56.32
56 – 63
4
59.5
15.04
60.16
48– 55
3
51.5
23.04
69.12
83.5
€
125.44
40 – 47
2
43.5
31.04
62.08
Σ �⏐ x -X⏐= 495.36
N = 50
x=74.54 (from Example 2.2)
Recall:
Leyte Normal University | Mathematics Unit €
24
€
MODULE 4: Data Management
Thus,
MAD =Σ f X − x
N = 495.36
50 = 9.9072
MAD = 9.91
This implies that the 50 scores deviate from the mean of 74.54 by an average of 9.91 units.
€
Variance and Standard Deviation (Raw and Grouped Data)
The last two measures of dispersion or measures of variation to be included in this module are the
variance and standard deviation. Bluman, in his book Elementary Statistics, defines variance as the
average of the squares of the distance each score or value from the mean. While the standard
deviation, is the square root of the variance. It looks at how spread out a group of numbers is from
the mean (https://www.investopedia.com).
The population variance and standard deviation are calculated using the following respective
formulas:
σ2 =
Variance (σ2
Σ X−∝2
(
)
read as “sigma squared”):
N
σ=
Σ X−∝2
(
)
Standard Deviation (σ = square root of the variance):
€
N
Where: σ2 = population variance
σ = population standard deviation
€
X = the item or observation
∝ = population mean
N = total number of cases
Example 3.5
The following data are ages of 10 teachers in one Elementary School:
27, 34, 30, 29, 28, 30, 34, 35, 28, 29.
Find the variance and standard deviation of this population data.
Solution: To compute for the variance, we present the data as shown in the table below:
X−∝
(X − ∝)2
27
-4.4
19.36
34
2.6
6.76
30
-1.4
1.96
Age (X)
25
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
29
-2.4
5.76
28
-3.4
11.56
30
-1.4
1.96
34
2.6
6.76
35
3.6
12.96
38
6.6
43.56
29
-2.4
5.76
Σ (X − ∝)2 = 116.4
∝ = 31.4
By substituting the formula, the variance is
Σ X−∝2
σ2 = (
)
N = 116.4
10 = 11.64
When the variance is zero (0) it indicates that all of the data values are the same. Thus, there is no
variation. Since a variance is an average of the square it follows that all non-zero variances
are €
positive. A small variance indicates that the data points tend to be very close to the mean, and to
each other. A high variance indicates that the data points are very spread out from the mean, and
from one another (MathBits.com).
What does a population variance of 11.64 mean? Since the value of 11.64 is far from zero,
this implies that the observations are more spread from one another and from the mean.
From above value of population variance, it follows that the population standard deviation which is
the square root of the variance is:
σ = 11.64 = 3.41 .
We recall that the standard deviation measures how concentrated the data are around the mean;
the more concentrated, the smaller the standard deviation. €
(https://www.dummies.com/education/math/statistics). What is the implication of the above value
in relation to the mean of the given data set?
Example 3.6
Using the data set of Example 4.3.3, determine the variance and standard deviation of each subset
of data. Compare your results. The table is reproduced below.
Ages of female faculty members from three departments
Statistic
al
Implication/Impression
Data Set
A
B
C
Measur
e
Mean
Equal distribution
37
40
39
38
41
40
42
42
42
45
43
43
48
44
46
42
42
42
26
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
Range
Distribution A is
more spread. Why?
11
4
7
MAD
Distribution B is
least
variable
compared to the
other two data sets.
Why?
3.6
1.2
2.0
Variance
Standard
Deviation
Computing Sample Variance and Standard Deviation
The table below shows the different notations use for the variance and standard deviation.
Notation
Statistical Measure
σ2
Variance of a population
σ
Standard deviation of a sample
s2
Variance of a sample
s
Standard deviation of a sample
If the data set is taken from a sample, the variance and standard deviation are obtained using the
following computational formula (Bluman, p.137)
Sample Variance:
s2 =n ΣX 2 (
) − (Σ X )2
n n −1
(
)
Sample Standard Deviation (square root of the variance):
− ΣX
€
s = n ΣX 2 (
)
(
)2
n n −1
2
Where: s = sample
variance
(
)
X = individual observation
€
n = sample size
Example 3.7
Find the sample variance and standard deviation for the daily production rate of fiberglass boats of
a certain manufacturer. If the company production manager feels that a standard deviation of more
than three boats a day is unacceptable, should the manager be concerned about the plant
production rate? Why?
17 21 18 27 17 21 20 22 18 23
27
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
Solution:
X
17
21
18
27
17
21
20
22
18
23
ΣX = 204
X2
289
441
324
729
289
441
400
484
324
529
ΣX2 = 4,250
s2 =n ΣX 2 (
) − (Σ X )2
n n −1
(
)
=
(10)(4250) − (204)2
10 10 −1
(
)
s2 = 42500 − 41616
90 = 884
90 = 9.82
From above value of sample variance, it follows that the sample standard deviation which is
the square root of the variance is:
€
s = 9.82 = 3.13 .
REMARK:
Since the obtained sample standard deviation of 3.1implies that the fiberglass boats plant
daily €
production is within the acceptable rate. Thus, there is no reason for the plant manager to weary
about its production.
Computing Sample Variance and Standard Deviation from Grouped Data
For grouped data we find the variance and standard deviation using the following computational
formula (Bluman, p.139)
€
Variance:
s =n ΣfX
2
2
()
(Σ fX )2
−
n n −1
(
)
s = n ΣfX 2 (
)−
(Σ fX )2 n(n −1)
Standard
Deviation:
where: f = class frequency
X = class mark
€
n = total number of observations
Example 3.8
Using the data in Example 4.1.1 we find the variance and standard deviation of grouped data. The
table is reproduced below:
28
Leyte Normal University | Mathematics Unit
MODULE 4: Data Management
Scores of 50 students in Statistics examination
Class Interval
f
X
fX
fX2
96 - 103
1
99.5
99.5
9900.25
88 – 95
5
91.5
457.5
41861.25
80– 87
14
83.5
1169.0
97611.50
72– 79
13
75.5
981.5
74103.25
64 – 71
8
67.5
540.0
36450.00
56 – 63
4
59.5
238.0
14161.00
48– 55
3
51.5
154.5
7956.75
40 – 47
2
43.5
87.0
3784.50
ΣfX = 3727
ΣfX2= 285,828.50
N = 50
Substituting the above computational or shortcut formula, we obtain the sample variance as
follows:
s2 =n ΣfX 2 (
) − (Σ fX )2
n n −1
(
)
=
(50)(285828.50) − (3727)2
(50)(50 −1)
s2 = 14291425 −13890529
(50)(49)
=
400896
2450 = 163.63
With the above sample variance value of 163.63 it follows that the sample standard deviation (s)
which is the square root of the variance is 12.79. This implies that the scores of 50 students deviate
€
from the mean on the average by a distance of 12.79 units.
There is another method of computing the sample variance and sample standard deviation by using
the Coded Deviation. However, its discussion is not included in this module.
Exercises 3.1
A. Using Exercise 13.2 on page 823 of the book, Mathematical Excursion by Aufman,
answer numbers 4 to 8 and 12.
B. Using the same exercise, answer number 20 on page 824 on the ages of the female and
male actors Academy awardees. Answer questions a, b, and c found at the end of the
exercise.
C. Critical Thinking
Using the exercise no. 26 on page 825 perform the suggested activity and answer
the question found at the end.
Leyte Normal University | Mathematics Unit
29
Download