Uploaded by CHERRYLYN GILIGAN

STATS-NOTES

advertisement
Module 1: A Review on Descriptive Statistics
Business Statistics has its own language just like any other areas of study. (Black, 2010).
And because it has its own language, people need to learn how to communicate under a
common ground. In our day to day activities, we have encountered Statistics without
actually knowing it and oftentimes experience its many practical and relevant uses and
applications. The role of statistics in business is in evaluating all naturally collected data
and information to determine what it says about the company's operations and strategy.
(Bianca, 2019). In this module, let us try to navigate the very essential points of
Descriptive Statistics which many of you have taken in your high school days.
Example: In each of the following items, determine the correct data type
(quantitative or qualitative). If the given data is quantitative, indicate further if it is
continuous or discrete.








INTRODUCTION
Business Statistics has its own language just like any other areas of study. (Black, 2010).
And because it has its own language, people need to learn how to communicate under a
common ground. In our day to day activities, we have encountered Statistics without
actually knowing it and oftentimes experience its many practical and relevant uses and
applications. The role of statistics in business is in evaluating all naturally collected data
and information to determine what it says about the company's operations and strategy.
(Bianca, 2019). In this module, let us try to navigate the very essential points of Descriptive
Statistics which many of you have taken in your high school days.
After going through this module, the students will be able to:
Explain the importance and relevance of studying Statistics to Business.
2.
Differentiate and discuss population from sample, qualitative data from quantitative
data, parameter from statistics, discrete from continuous type of variable.
3.
Apply the different measures of central tendency, dispersion, and relative position in
describing and analyzing data sets.
Example:
Determine what the key terms are being referred to in the following situation.
4.
Summarize and interpret data using various type of data presentation and data
organization including the use of Microsoft excel.
A study is made to determine the amount of time spent in studying Statistics daily. It
involves all freshmen college students in XYZ University. There are total of 1000 freshmen
students, 150 of whom are being chosen randomly. From the 150 randomly selected
freshmen college students, it turns out that the average time spent daily in studying
Statistics is 1 hour.
5.
Solve and explain various application problems.
Answers.
6.
Manifest patience, integrity, self-discipline, excellence, and critical thinking in
working with the different application problems.
Lesson 1: Meaning of Statistics and its Basic Components

The population is all freshmen college students of XYZ University.

The sample are the 150 students randomly selected from the population of
1000 freshmen college students.

The parameter is the average (mean) amount of time spent daily in studying
Statistics. (Note: If all the 1000 students are being interviewed, then there are
1000 different answers/responses. Get the average of all those 1000 responses,
and that defines your parameter).

The statistic is the average time spent by the 150 selected freshmen college
students in studying Statistics, and that is 1 hour.

The variable could be the amount of time spent daily by one freshman college
student in studying Statistics. (Say for example, we can assign a variable T = the
amount of time spent daily by one freshman student in studying Statistics in XYZ
University.
What is Statistics?





Methods for processing & analyzing numbers
Methods for helping reduce the uncertainty inherent in decision making
A branch of science taking and transforming numbers into useful information
for decision makers (source: Business Statistics: A First Course, 2009, p.4)
According to Webster’s Third New International Dictionary, Statistics is a
science that deals with the collection, analysis, interpretation, and
presentation of numerical data (Black, 2010).
A branch of science that helps transforms numerical data into useful
information and that it allows you to understand the risks associated with
making a business decision (Berenson & et.al,2012
b. the type of car you drive
c. the distance from your home to the nearest grocery store
d. the number of classes you take per school year
e. the type of calculator you use
f. weights of sumo wrestlers
g. number of correct answers on a quiz
h. IQ scores
Answers.
LEARNING OBJECTIVES:
1.
a. the number of pairs of shoes you own

The data are the raw scores in terms of amount of time spent daily by the
freshmen students (can be in minutes, hour or even seconds).









a. quantitative- discrete
b. qualitative (or categorical)
c. quantitative - continuous
d. quantitative - discrete
e. qualitative (or categorical)
f. quantitative- continuous
g. quantitative - discrete
h. quantitative - continuous
Source: Holmes & et al, 2018, pp.10-11
LEVELS OF MEASUREMENT
Kindly study the 4 different types of measurement scale below represented in the diagram.
a. Nominal - lowest level of data measurement.
categorize, label or classifies objects or peoples’ responses so that all of those in a single
category is coded numerically.
Examples: ~
employee ID number (ex. 16754; 18722; 98123)
~
gender (1-Male; 2- Female)
~
hospital bed number (ex. bed# 12; bed # 23; bed # 18)
~
marital status (1-single; 2- married; 3- separated; 4- widow/widower)
b. Ordinal – used to rank or order objects or characteristics
- higher than the nominal type of measurement
Examples:
~
degree of medicine effectiveness such as:
~
effective, 2- slightly/moderately effective, 3 - very effective.
~
teaching performance:
~
excellent; 2- very satisfactory; 3 – satisfactory; 4- fair; 5 – poor
c. Interval – next to the highest level of data.



data are always numerical .
distances between numbers have meaning.
no absolute zero or no true zero. (which means, zero does not necessarily mean
absence of something like the case of a zero temperature and a zero IQ).
Answers:

Question 1 is a time measurement with an absolute zero and is therefore ratio-level
measurement. A person who has been out of the hospital for two weeks has been
out twice as long as someone who has been out of the hospital for one week.

Question 2 yields nominal data because the patient is asked only to categorize the
type of unit he or she was in. This question does not require a hierarchy or ranking of
the type of unit.

addition and subtraction of numerical data can be performed
Examples:
~ temperature in degrees Fahrenheit.
~ IQ level
~ age bracket
d. Ratio





Questions 3, 4, and 5 are likely to result in ordinal-level data. Suppose a number is
assigned the descriptors in each of these three questions. For question 3, “very
important” might be assigned a 4, “somewhat important” a 3, “not very important”
a 2, and “not at all important” a 1. Certainly, the higher the number, the more
important is the hospital’s location. Thus, these responses can be ranked by selection.
However, the increases in importance from 1 to 2 to 3 to 4 are not necessarily equal.
This same logic applies to the numeric values assigned in questions 4 and 5.

Source: (Black 2010 pp.9-10)
the highest among the scales of measurement.
usually based on a standard unit.
have absolute zero which means that characteristic of zero is fixed
and that there is an absence of that particular property/characteristic.
Examples: height; time; weight; number of cars own; amount of one’s investment; score
in an achievement test.
Notice that in the ratio scale, it is possible to perform the 4 fundamental operations of
addition subtraction, multiplication and division. If for example student A scored 100 in a
Statistics quiz and that student B scored 25, then we can say that student A scored four
times that of student B. or we can say that student A scored 75 points more than that of
student B.
Consider further the Problem below:
Example: The mean of the preceding data set on the projected selling price for various
brand of computer printers can be computed as
follows:
LESSON 2 – THE MEASURES OF CENTRAL TENDENCY
Statistics as one of the field of science involves the processes of collection, organization
or presentation, analysis and interpretation of data. Generally, statistics is being divided
into two main divisions. The first main division refers to the collection, organization or
presentation and analysis of data called descriptive statistics. Its main goal is to provide a
description for the data sets. The second division is concerned in interpreting and
drawing out of conclusions or generalizations from the analysis of the sample random
data called inferential statistics. In providing descriptions or summary for a sample data
set, the use of central tendency, dispersion and relative position are applied.
THE MEASURES OF CENTRAL TENDENCY
A Measure of Central Tendency is a statistic that can be obtained from a set of
observations or scores that represents the data set. It is often useful to find the single
numerical value located at the center of the distribution of the data set. It is also defined
as the tendency of the same observations or scores to cluster about a single point.
Example: Jonecis is planning to build up a business that sales different brand of computer
printers. He conducted a survey on the prices of different brands of printers in the market
and made projections on the possible selling price for the items once he engages such
business. The data below presents the projected selling price for various brand of laptops.
Brand
A
B
C
D
E
Price in
Php 20,990 Php 14,990 Php 16,484 Php 15,799 Php 21,984
Pesos
The central tendency for the selling price for the different brands of printers is the average
selling price of the five items that is a “center or central” value for which the different
amounts about to cluster.The Measures of Central Tendency has three commonly used
measures that finds or locates the center or central value of a given data set.
Example: JM is a college instructor handling classes in Economics. He currently conducted
a study involving his class for the subject. He is interested to determine the average
academic performance of his entire class composed of two sections. He randomly selects
the same number of students from his two classes to compose the samples and record
their respective prelim grades shown as follows.
Student
Prelim
Grade
1
2.2
2
1.9
3
1.7
4
2.0
5
1.8
6
1.5
7
1.4
8
2.1
9
1.6
Solution: Each subject is worth 3 units. Hence, the sum of all credit units is 21 units for all
courses. Thus, JM’s grade point average (GPA) during the prelim term of the first
semester is given
by
10
2.3
THE MODE
The mode is the value in a data set that appears most frequently. Unlike the
arithmetic mean, the mode is not easily affected by the occurrence of any extreme values.
The median
The median measure of central tendency is the middle most value in an ordered array of
data. The median measure is known to be unaffected by any extreme values in a set of
data. Hence, whenever an extreme value is present, it is proper to use the median rather
than the mean to describe the data set.
To find the median value for a given data set, one has to organize the data in an array, that
is, arranging the scores in terms of increasing numerical value and apply the following
formulas:
The measures of dispersion
The measures of dispersion determine the amount of variation or spread in the data. This
measure is helpful in detecting inconsistencies or insufficiencies of values in the data set.
Consider for instance the coffee dispenser machine that supplies coffee to customers
shown in the table below.
Example: The mode value of the ordered sequence of the number of customers of the
Beauty Salon for 30 days period is 3 since in a sequence
1, 1, 1, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 7, 7, 7, 7, 8, 8, 8, 8, 10, 10, 11, 12, 12, 13, and 17, the
score, 3 has the highest frequency of occurrence.
THE WEIGHTED MEAN
The weighted mean measure is a variation of the arithmetic mean wherein the individual
scores has an assigned weight that normally emphasize that one score is better than the
other.
The weighted mean value can be obtained by using the following formula given
by
The mean of the cup of coffee in oz dispensed with machine dispenser 1 is 8 oz. However,
the amount of coffee dispensed per cup is very inconsistent. Some cup seems to overflow
while others have only lesser amount of coffee dispensed in a cup. This suggests that the
first machine dispenser needs some calibrations. On one hand, machine dispenser 2 is very
consistent with the amount of coffee dispensed having only a very small deviation of the
values. This indicates that the second machine dispenser needs no more calibrations for it
works well in serving the customers.
The situations discuss above suggests that the mean value is not enough to describe a data
set for it lacks information that reflects the deviation or spread of the data values. Thus,
the need to introduce further descriptive measures that characterize the spread or
deviations of data values in terms of range and standard deviation measures.
Example: The table below determines JM’s’ grades during the prelim term of the first
semester.
The Range
The Range determines the difference between the largest and smallest observations in a
set of data given by the formula:
Example: Find the range of the amount of coffee in oz dispensed in 5 cups from machine
dispenser 2.
Determine the median measure of the data.
Solution: The ordered sequence in terms of the number of customers in 30 days period is
given by
1, 1, 1, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 7, 7, 7, 7, 8, 8, 8, 8, 10, 10, 11, 12, 12, 13, and
17
Solution: The largest amount of coffee dispensed in a cup is 10.50 oz while the smallest
amount is 5.99 oz. The range of the two extreme values is 10.50 – 5.99 = 4.51 0z.
Although the range measure seems easy to compute, however, it can be very sensitive to
extreme values and provides no information for the spread of values in between these two
scores.
The standard deviation
The standard deviation is the measures of variation that takes into account on how all the
values in the data set are distributed. This measure evaluates how the values fluctuate
about the mean and is less sensitive to the extreme values.
Which of the two companies has received a more or less consistent job satisfaction rating
by employees?
Solution: The
5.83
mean
job
satisfaction
rating
for
each
of
the
company
Example: Suppose the median annual travel expenses of personnel in a certain academic
institution is Php 100,000. If the 85th percentile for the annual travel expenses of personnel
was Php 110,000, find the percentage of personnel whose annual travel expenses was:
is
Although most statistical computations normally involved a sample instead of a
population, the two set of formulas for sample and population measures of dispersion are
defined and provided as follows.
i.
greater than Php 100,000
ii.
less than Php 110,000
iii.
between Php 100,000 and Php 110,000
Standard Deviation and Populations
Solution:
i)
ii)
iii)
Example 1: Consider the examination scores for the sample students randomly selected in
the Stat class. - 15, 17, 19, 23, 25
Find the standard deviation of the sample.
Lesson 3: The Measures of Relative
Suppose a student is taking examinations in the mathematics of the modern world
(MMW) and art and appreciation (AA). The student got a score of 45 in MMW while a
score of 50 in art and appreciation. The mean score for all students taking the MMW
exam in the class is 40 with a standard deviation of 5. On one hand, the mean score for
all students taking art and appreciation is 45 with a standard deviation of 8. For which of
the two subjects did a student perform better?
Comparing the performance of the student in both subjects cannot be done right away
since the different sets of examination scores have different amount of variability. Hence,
for a comparison to be possible, a transformation must be made for student’s scores in
MMW and art appreciation.
Since the median measure is the same as the 50th percentile, then 50% of the
personnel incurred travel expenses greater than Php 100,000.
Since Php 110,000 is the 85th percentile, then 85% of the personnel in the
academic institution incurred travel expenses less than Php 110,000.
Using parts (i) and (ii), 85%-50%=35% of personnel incurred travel expenses
between Php 100,000 and Php 110,000.
The Percentile of a Data Value
Example: In a licensure examination for accountants given to 1500 students, Yurie scored
550 higher than the scores of 900 students who took the examination. What is the
percentile for Yurie’s examination score?
THE Z-SCORE
The z – score of the given data value x is the number of standard deviations the score
lies above or below the mean. The transformation of x-score to z - score is defined by the
equation given as follows.
The z - score or standard score does have a mean of 0 and standard deviation of 1 unit.
Example: Comparing the performance of the student in MMW and art appreciation with
his score provided above, we have
Example 2: A study is conducted to provide evidence on which companies have received a
more or less consistent employees’ job satisfaction ratings based on the survey results that
used the scale of 1 (strongly dissatisfied) to 7 (strongly satisfied). The following ratings are
given below.
The result indicates that the student scored 1.00 standard deviation above the mean in
his MMW course while 0.625 standard deviation above the mean for his art appreciation
course. The resulting z-scores suggest that the student perform better in MMW than in his
art appreciation course.
The Percentiles
Therefore, Yurie’s score in the licensure examination for accountants is at 60th percentile.
Example: Consider the data below that represents the number of dining rooms occupied
in a beach resort for a 15-days period.
Find the first, second and third quartile of the data.
Solution: Listing the data value in an array form, we have 55, 60, 65, 70, 72, 75, 78, 80,
84, 88, 89, 91, 95, 99, 100
The Use of Stem-and-Leaf Diagrams in Organizing Data
The stem-and-leaf display is a tabular way of organizing data. It is formed by splitting the
data values into two parts. The “tens” part forms the “stem” while the units digit
represents the leaves. The first part can also be extended to hundreds, thousands and so
on as the leading units depending on the values involved in a given data set.
Example: The following table shows the ages of customers who owns a motorcycle.
Construct a stem-and-leaf display for the data
LESSON 1: CONSTRUCTING THE FREQUENCY DISTRIBUTION
A grouped frequency distribution is useful whenever the range of the data set is quiet
large. Hence, the data must be grouped into classes whether it is categorical or interval or
ratio data. The following shows the procedure for constructing the frequency distribution.
A CATEGORICAL FREQUENCY DISTRIBUTION
The categorical frequency distribution is utilized to organize nominal or ordinal type of
data. For instance, we can employ categorical frequency distribution for variables such as
gender, marital status, socio-economic status, political affiliation and so on.
Example: Twenty business statistics students were given an academic performance
evaluation by their instructor. The data set is shown as follows:
Solution: A stem-and-leaf diagram can be constructed by writing all the stems in a column
in ascending order while indicating the corresponding leaf to the right of the vertical line
as illustrated in the figure below.
Learning Check
Construct the frequency distribution for the data on Job Satisfaction by rank and file
employees of a certain company.
Module 2: Organization and Presentation of Data
Thus, the boundary score for 75% of the observation is 91.
SUMMARIZING DATA USING A BOX AND WHISKER PLOT
The Box-and-Whisker plot is a graphical way of providing the visual summary of a set of
data that involves the median, quartiles and extreme values, which characterizes the
distribution. The following figure illustrates the components of Box-and-Whisker plot.
INTRODUCTION
In conducting a business research or assessment, one must gather data for the variable/s
under investigation. In order to describe situations, create conclusions or making inferences
about the occurrence of events, one must organize the data gathered in a more meaningful
manner. Once the data is organized, the next move that one can do is to present the data so
that those who will be benefited directly or indirectly from reading the study or assessment
can understand it. The most commonly used procedure of presenting data is through the use
of graphs and charts. Each of these graphs and charts has its specific functions depending on
the nature of the variables being investigated.
Module 2 discusses on how to organize data by constructing frequency distribution and
the manner the data will be presented by constructing graphs and charts.
B. THE FREQUENCY DISTRIBUTION FOR NUMBER
Data in its original form and structure are called raw data.
Example: The following is a raw data depicting the number of students taking the IQ test
during a year in 60 randomly selected classes in a certain university.
After going through this module, the students will be able to:
1.
2.
3.
4.
5.
6.
Discuss and explain the methods in organizing and presenting data.
Organize the data into a frequency distribution using excel data analysis.
Represents frequency distribution graphically using histogram, frequency
polygons, and cumulative frequency polygon (ogives).
Plot the data using bar graph (multiple bar graph), pie chart, time series graph
and scatter plot.
Analyze and interpret the graphs/charts in the context of the variable/s under
investigation.
Show volunteerism and innovativeness in organizing and presenting data
concerning real life business application problems.
When these scores are arranged in either ascending or descending magnitude,
then such an arrangement is called an array. It is usually helpful to put the raw data in an
array because it is easy to identify the extreme values or the values where the scores most
cluster. When the data are placed into a system wherein they are organized, then these
partake the nature of grouped data.
Definition: The procedure of organizing data into groups is called a Frequency Distribution
Table (FDT)
Example: The following presents a frequency distribution table of the exam scores of
fifteen Business Students.
Scores
Frequency
20 – 29
30 – 39
40 – 49
50 – 59
60 – 69
5
4
3
2
1
15
Components of a Frequency Distribution Table
The following are the components of a Frequency Distribution Table
I. Class Interval
These are the numbers defining the class. It consist of the end numbers called the class
limits namely the lower limit and upper limit.
II. Class Frequency (f)
This component shows the number of observations falling in the class.
III. Class Boundaries
These are the so called “true class limits”. They are classified as:
Lower Class Boundary (LCB), which is defined as the middle value of the lower class limits
of the class and the upper class limit of the preceding class and Upper Class Boundary that
is, the middle value between the upper class limit of the class and the lower limit of the
next class.
IV. Class Size
The difference between two consecutive upper limits or two consecutive lower limits.
V. Class Mark (CM)
This component is the midpoint or the middle value of a class interval.
VI. Cumulative frequency (CF)
This component shows the accumulated frequencies of successive classes.
There are two types of Cumulative Frequencies.
A. Greater than CF (> CF) – shows the number of observations greater than the lower class
boundary (LCB).
B. Less than CF (< CF) - shows the number of observations less than the upper class
boundary (UCB).
Example: Construct a Frequency Distribution Table for the number of students taking the
IQ test during a year in 60 randomly selected classes in a certain university.
number of students taking the IQ test. Then, enter the range of upper class limits into the
bin range. Consequently, check levels, cumulative percentage, chart output and click ok.
Solution:
1.
2.
3.
4.
5.
6.
Using the Sturge’s Approximation Formula, K= 1 + 3.332 log n, where K,
approximate number of classes and n, number of cases, then the approximated
number of class intervals for the data set is given by
K= 1 + 3.332 log (60)
= 1 + 3.332(1.77815125)
K = 6.92 or 7
The range R is given by
R=Maximum Value- Minimum Value
=59 – 21
= 38
The approximate class size C is:
C=R/K
=38/7
=5.43 or 6
The lowest class interval (or the first class) is 21 – 26.
Adding the class size C=6 to the class limit beginning with the lowest class
interval, we then obtain the other class intervals shown as follows:
Class Intervals
21 – 26
27 – 32
33 – 38
39 – 44
45 – 50
51 – 56
57 - 62
Tally of Scores and the Frequency Distribution Table
In constructing a Frequency Distribution Table, attention must be given in selecting the
number of class intervals or groupings in the frequency distribution table. There are no
exact rules for determining this number of class intervals. However, one suggestion in
literature for determining the number of class intervals is to use Sturges’ rule such as the
one specified in Step 1.
Excel Output
Bin Range
1. Determine the number of classes. For first approximation, it is suggested to use the
Sturge’s Approximation Formula.
K= 1 + 3.332 log n
where K = approximate number of classes
n = number of cases
2. Determine the range R, where R = maximum value minimum value
3. Determine the approximate class size C using the formula
C = R / K. It is usually convenient to round off C to a nearest whole number
4. Determine the lowest class interval (or the first class). This class should include the
minimum value in the data set. For uniformity, let us agree that for our purposes, the lower
limit of the class interval should start at the minimum value.
5. Determine all class limits by adding the class size C to the limits of the previous class.
6. Tally the scores / observations falling in each class.
Thus, the Complete Frequency Distribution Table is as follows:
More
Note: Data analysis found in excel can be used to generate the frequency distribution once
the class intervals are already set. Just use histogram function under data analysis window
as shown below. For the bin range, the upper class limits of each class intervals is being
used. For the input range, enter the range of occupied cells of data defined by the variable
Frequency
26
32
38
44
50
56
62
6
11
4
6
9
17
7
0
Cumulative %
10.00%
28.33%
35.00%
45.00%
60.00%
88.33%
100.00%
100.00%
Learning check 1
A.
Histogram
LEARNING CHECK 2:
The following data shows the ages of customers who owns a motorcycle.
The following data shows the ages of customers who owns a motorcycle.
Organize this data in a Frequency Distribution Table.
Using excel, construct the following statistical graphs for the data set.
LESSON 2 GRAPHS ASSOCIATED WITH THE FR
Whenever the data set of any business variables contains a quite large number of
numerical values, making descriptions or conclusions from an array or stem-and-leaf
plot of the values in the data set is relatively difficult. In this case, we will need the use of
graphs or charts in dealing with the situations. There are various graphs or charts that can
be used to visualize, characterize, and show the numerical data values defined by the
variables under investigation. These include histogram, frequency polygon, and cumulative
frequency (ogive).
B. Frequency Polygon
A. HISTOGRAM
A histogram is a statistical graph in which the class intervals are plotted on the horizontal
axis and the class frequencies on the vertical axis. The height of the bars determines the
class frequencies, and the bars are drawn adjacent to each other. It contains the
frequency of each class interval and does not necessarily reveal the data set values in the
actual observations.
B. FREQUENCY POLYGON
A frequency polygon is a statistical graph that displays the data set values through using
points, which are interconnected by line segments. The frequency of scores or data
values are represented by the heights of the points at the midpoints or class marks of the
class intervals.
C. CUMULATIVE FREQUENCY POLYGON OR OGIVE
A cumulative frequency polygon or ogive is a statistical graph that shows the
cumulative frequencies for the class intervals in a frequency distribution. The cumulative
frequency of the distribution are marked on the vertical axis while upper class boundaries
(UCB) on the horizontal axis.
EXAMPLE
Example: Consider below the complete frequency distribution for the number of students
taking the IQ test during a year in 60 randomly selected classes in a certain university.
C. Cumulative Frequency Polygon (Ogive)
A.
Histogram
B.
Frequency Polygon
C.
Cumulative Frequency Polygon
Provide appropriate labels for each of the statistical graphs. Then, also provide a
brief analysis and interpretation for each of the resulting statistical graphs.
Download