Uploaded by Shubhuls162078

Statistics for Management

advertisement
Module-1: Statistics
O
nl
in
e
Notes
Learning Objective:
●●
To get introduced with types of statistics and its application in different phases
●●
To develop an understanding of data representation techniques
●●
To understand the MS Excel applications of numerical measures of central
tendency and dispersion
Learning Outcome:
At the end of the course, the learners will be able to –
To be able to arrange and describe statistical information using numerical and
graphical procedures
●●
To be able to use the tool MS Excel for answering business problems based on
numerical measures
er
s
ity
●●
“ It is the science of collection, presentation, analysis, and interpretation of
numerical data from logical analysis”
Croxton and Cowden define-
ni
v
1.1.1 Statistical Thinking and Analysis
U
Data is a collection of any number of related observations. We can collect the
number of telephones installed in a given day by several workers or the numbers of
telephones installed per day over a period of several days by one worker and call the
results our data. A collection of data is called a data set and a single observation is
called as a data point.
)A
m
ity
Statistics is not restricted to only information about the State, but it also extends
to almost every realm of the business. Statistics is about scientific methods to gather,
organize, summarize and analyze data. More important still is to draw valid conclusions
and make effective decisions based on such analysis. To a large degree, company
performance depends on the preciseness and accuracy of the forecast. Statistics
is an indispensable instrument for manufacturing control and market research.
Statistical tools are extensively used in business for time and motion study, consumer
behaviour study, investment decisions, credit ratings, performance measurements and
compensations, inventory management, accounting, quality control, distribution channel
design, etc.
(c
For managers, therefore, understanding statistical concepts and knowledge
about using statistical tools is essential. With an increase in a company’s size and
market uncertainty due to reduced competition, the need for statistical knowledge and
statistical analysis of various business circumstances has greatly increased. Prior to
this, when the size of business used to be small without much complexities, a single
person, usually owner or manager of the firm, used to take all decisions regarding
the business. Example: A manager used to decide, from where the necessary raw
materials and other factors of production were to be acquired, how much of output will
Amity Directorate of Distance & Online Education
2
Statistics Management
O
nl
in
e
be produced, where it will be sold, etc. This type of decision making was usually based
on experience and expectations of this single individual and as such had no scientific
basis.
Notes
1.1.2 Limitations and Applications of Statistics
Statistical techniques, because of their flexibility have become popular and
are used in numerous fields. But statistics is not a cure-all technique and has few
limitations. It cannot be applied to all kinds of situations and cannot be made to answer
all queries. The major limitations are:
Statistics deals with only those problems, which can be expressed in quantitative
terms and amenable to mathematical and numerical analysis. These are not suitable
for qualitative data such as customer loyalty, employee integrity, emotional bonding,
motivation etc.
2.
Statistics deals only with the collection of data and no importance is attached to an
individual item.
3.
Statistical results are only an approximation and not mathematically correct. There is
always a possibility of random error.
4.
Statistics, if used wrongly, can lead to misleading conclusions, and therefore, should
be used only after complete understanding of the process and the conceptual base.
5.
Statistics laws are not exact laws and are liable to be misused.
6.
The greatest limitation is that the statistical data can be used properly only by a
professional. A person having thorough knowledge of the methods of statistics and
proper training can only come to conclusions.
7.
If statistical data are not uniform and homogenous, then the study of the problem is
not possible. Homogeneity of data is essential for a proper study.
8.
Statistical methods are not the only method for studying a problem. There are other
methods as well, and a problem can be studied in various ways.
U
ni
v
er
s
ity
1.
ity
1.1.3 Types of Statistical Methods: Descriptive & Inferential
(c
)A
m
The study of statistics can be categorized into two main branches. These branches
are descriptive statistics and inferential statistics.
Descriptive statistics is used to sum up and graph the data for a category picked.
This method helps to understand a particular collection of observations. A sample is
defined on descriptive statistics. There is no confusion in concise numbers, since you
just identify the individuals or things which are calculated.
Descriptive statistics give information that describes the data in some manner. For
example, suppose a pet shop sells cats, dogs, birds and fish. If 100 pets are sold, and
35 out of the 100 were dogs, then one description of the data on the pets sold would be
that 35% were dogs.
Inferential statistics are techniques that allow us to use certain samples to generalize
the populations from which the samples were taken. Hence, it is crucial that the sample
represents the population accurately. The method to get this done is called sampling. Since
Amity Directorate of Distance & Online Education
3
Statistics Management
●●
Define the population we are studying.
●●
Draw a representative sample from that population.
●●
Use analyses that incorporate the sampling error.
Notes
O
nl
in
e
the inferential statistics aim at drawing conclusions from a sample and generalizing them to
a population, we need to be sure that our sample accurately represents the population. This
requirement affects our process. At a broad level, we must do the following:
1.1.4 Importance and Scope of Statistics
Condensation: Statistics compresses a mass of figures to small meaningful
information, for example, average sales, BSE index, the growth rate etc. It is
impossible to get a precise idea about the profitability of a business from a mere
record of income and expenditure transactions. The information of Return On
Investment (ROI), Earnings Per Share (EPS), profit margins, etc., however, can be
easily remembered, understood and thus used in decision-making.
●●
Forecast: Statistics helps in forecasting by analyzing trends, which are essential
ity
●●
er
s
for planning and decision-making. Predictions based on the gut feeling or hunch
can be harmful for the business. For example, to decide the refining capacity for a
petrochemical plant, it is required to predict the demand of petrochemical product
mix, supply of crude oil, the cost of crude, substitution products, etc., for next 10 to
20 years, before committing an investment.
Testing of hypotheses: Hypotheses are the statements about population
parameters based on past knowledge or information. It must be checked for its
validity in the light of current information. Inductive inference about the population
based on the sample estimates involves an element of risk. However, sampling
keeps the decision-making costs low. Statistics provides quantitative base for
testing our beliefs about the population.
●●
Relationship between Facts: Statistical methods are used to investigate the
cause and effect relationship between two or more facts. The relationship between
demand and supply, money-supply and price level can be best understood with
the help of statistical methods.
●●
Expectation: Statistics provides the basic building block for framing suitable
policies. For example how much raw material should be imported, how much
ity
U
ni
v
●●
m
capacity should be installed, or manpower recruited, etc., depends upon the
expected value of outcome of our present decisions
)A
1.1.5 Population and Sample
Sample
(c
A sample consists of one or more observations drawn from the population. Sample
is the group of people who actually took part in your research. They are the people
that are questioned (for example, in a qualitative study) or who actually complete the
survey (for example, in a quantitative study). Participants who may have been research
participants but didn’t personally participate are not considered part of the survey.
Amity Directorate of Distance & Online Education
4
Statistics Management
O
nl
in
e
A sample data set contains a part, or a subset, of a population. The size of a
sample is always less than the size of the population from which it is taken. [Utilizes the
count n - 1 in formulas.]
Notes
Population
A population includes all of the elements from a set of data. Population is the
broader group of people that you expect to generalize your study results to. Your
sample is just going to be a part of the population. The size of your sample will depend
on your exact population.
A population data set contains all members of a specified group (the entire list of
possible data values). [Utilizes the count n in formulas.]
ity
Example: The population may be “ALL people living in India.
For example – Mr. Tom wants to do a statistical analysis on students’ final
examination scores in her math class for the past year. Should he consider her data to
be a population data set or a sample data set?
er
s
Mrs. Tom is only working with the scores from his class. There is no reason for him
to generalize her results to all management students in the school. He has all of the
data that pertaining to his investigation = population.
1.2.1 Importance of Graphical Representation of Data
●●
One of the most convincing and appealing ways in which statistical results
may be represented is through graphs and diagrams..
U
●●
ni
v
Data needs to be process and analyze the data obtained from the field. The
processing consists mainly of recording, labeling, classifying and tabulating the
collected data so that it is consistent with the report. The data may be viewed either in
tabulation form or via charts. Effective use of the data collected primarily depends on
how it is organized, presented, and summarized.
Diagrams and graphs are extremely used because of the following reasons:
Diagrams and Graphs attract to the eye.
●●
They have more memorizing effect.
●●
It facilitates for easy comparison of data from one period to another.
●●
Diagram and graphs give bird’s eye view of entire data; therefore, it conveys
meaning very quickly
(c
)A
m
ity
●●
1.2.2 Bar Chart
In a bar diagram, only the length of the bar is taken into account but not the width.
In other words bar is a thick line whose width is merely shown, but length of the bar is
taken into account and is called one-dimensional diagram.
Simple Bar Diagram
It represents only one variable. Since these are of the same width and vary only in
lengths (heights), it becomes very easy for a comparative study. Simple bar diagrams
Amity Directorate of Distance & Online Education
5
Statistics Management
Illustration - 1
Notes
O
nl
in
e
are very popular in practice. A bar chart can be either vertical or horizontal; for example
sales, production, population figures etc. for various years may be shown by simple bar
charts
The following table gives the birth rate per thousand of different countries over a
certain period of time.
India
Germany
U. K.
New Zealand
Sweden
China
Birth rate
33
16
20
30
15
40
ni
v
er
s
ity
Country
Sub-divided Bar Diagram
U
Comparing the size of bars, China’s birth rate is highest, next is India whereas
Germany and Sweden equal in the lowest positions.
Illustration - 1
ity
In a subdivided bar diagram, each bar representing the magnitude of given value is
further subdivided into various components. Each component occupies a part of the bar
proportional to its share in total.
m
Present the following data in a sub-divided bar diagram.
Science
Humanities
Commerce
2014-2015
240
560
220
2015-2016
280
610
280
(c
)A
Year/Faculty
Amity Directorate of Distance & Online Education
6
Statistics Management
ity
O
nl
in
e
Notes
Multiple Bar Diagram
er
s
In a multiple bar diagram, two or more sets of related data are represented and the
components are shown as separate adjoining bars. The height of each bar represents
the actual value of the component. The components are shown by different shades or
colours.
ni
v
Illustration 1 - Construct a suitable bar diagram for the following data of number of
students in two different colleges in different faculties.
College Arts
Science
Commerce
Total
A
1200
800
600
2600
700
500
600
1800
(c
)A
m
ity
U
B
Amity Directorate of Distance & Online Education
7
Statistics Management
In percentage bar diagram the length of the entire bar kept equal to 100 (Hundred).
Various segment of each bar may change and represent percentage on an aggregate.
Illustation 1
Men
Women
Children
1995
45%
35%
20%
1996
44%
34%
22%
1997
48%
36%
16%
ni
v
er
s
ity
Year
Notes
O
nl
in
e
Percentage bar Diagram
1.2.3 Pie Chart
U
A pie chart or a circle chart is a circular statistical graphic, that is divided into
slices to illustrate a numerical proportion. In a pie chart, the arc length of each slice is
proportional to the quantity it represents. While it is named for its resemblance to a pie
which has been sliced, there are variations on the way it can be presented.. In a pie
chart, categories of data are represented by wedges in the circle and are proportional in
size to the percent of individuals in each category.
m
ity
Pie charts are very widely used in the business world and the mass media. Pie
charts are generally used to show percentage or proportional data and usually the
percentage represented by each category is provided next to the corresponding slice of
pie. Pie charts are good for displaying data for around six categories or fewer.
1.2.4 Histogram
)A
Histogram is a graphical data display using bars of different heights. This is similar
to a bar map, but there are ranges of histogram categories. The height of each bar
shows how many fall within each set.
A histogram can be used when:
The data is numerical
●●
The shape of the data’s distribution is to be viewed, especially when
determining whether the output of a process is distributed approximately
normally
(c
●●
Amity Directorate of Distance & Online Education
8
Statistics Management
●●
Analyzing whether a process can meet the customer’s requirements
●●
Analyzing what the output from a supplier’s process looks like seeing whether
a process change has occurred from one time period to another
●●
Determining whether the outputs of two or more processes are different
●●
You wish to communicate the distribution of data quickly and easily to others
O
nl
in
e
Notes
1.2.5 Frequency Polygon
ity
These are the frequencies plotted against the mid-points of the class-intervals and
the points thus obtained are joined by line segments. On comparing the Histogram and
a frequency polygon, in frequency polygons the points replace the bars (rectangles).
Also, when several distributions are to be compared on the same graph paper,
frequency polygons are better than Histograms.
Illustration 1
Draw a histogram and frequency polygon from the following data
10-20
Number of Persons
er
s
Age in Years
3
20-30
16
30-40
22
ni
v
40-50
50-60
60-70
(c
)A
m
ity
U
70-80
Amity Directorate of Distance & Online Education
35
24
15
2
9
Statistics Management
Notes
O
nl
in
e
1.2.6 Ogives
When frequencies are added, they are called the cumulative frequencies. The
curve obtained by plotting cumulating frequencies is called a cumulative frequency
curve or an ogive (pronounced as ojive).
To construct an Ogive: (i) Add up the progressive totals of frequencies, class by
class, to get the cumulative frequencies. (ii) Plot classes on the horizontal (x-axis) and
cumulative frequencies on the vertical (y-axis).
Less than Ogive: To plot a less than ogive, data is arranged in ascending order
of magnitude and frequencies are cumulated from the top i.e. adding. Cumulative
frequencies are plotted against the upper class limits. Ogives under this method, gives
a positive curve
ity
Greater than Ogive: To plot a greater than ogive, the data is arranged in the
ascending order of magnitude and frequencies are cumulated from the bottom or
subtracted from the total from the top. Cumulative frequencies are plotted against the
lower class limits. Ogives under this method, gives negative curve
er
s
Uses: Certain values like median, quartiles, quartile deviation, co-efficient of
skewness etc. can be located using ogives. Ogives are helpful in the comparison of the
two distributions.
Illustration 1 –
ni
v
Draw less than and more than ogive curves for the following frequency distribution
and obtain median graphically. Verify the result.
CI
0-20
20-40
40-60
60-80 80-100
100-120
120-140
140-160
f
5
12
18
25
12
8
5
Icf
mcf
size
20
5
100
0
40
17
95
20
35
83
40
60
65
60
75
40
80
120
87
25
100
140
95
13
120
100
5
140
60
80
)A
m
100
ity
Size
U
15
(c
160
Amity Directorate of Distance & Online Education
10
Statistics Management
er
s
1.2.7 Pareto Chart
ity
O
nl
in
e
Notes
A Pareto Chart is a graph showing the frequency of the defects and their
cumulative effect. Pareto charts are helpful in identifying the defects that should be
prioritized to achieve the greatest overall change.
ni
v
The Pareto principle (also known as the 80/20 rule, the law of the vital few, or the
principle of factor sparsity) states that, for many events, roughly 80% of the effects
come from 20% of the causes.
(c
)A
m
ity
U
An example of pareto chart –
When to use a pareto chart
●●
A pareto chart must be used when analyzing data about the frequency of problems
or causes in a process
●●
A pareto chart must be used when there are many problems or causes and you
want to focus on the most significant
Amity Directorate of Distance & Online Education
11
Statistics Management
It must be used when analyzing broad causes by looking at their specific
components
●●
It must be used while communicating with others about the data
Notes
O
nl
in
e
●●
1.2.8 Stem-and-leaf display
A stem-and-leaf display or stem-and-leaf plot is a device for presenting quantitative
data in a graphical format, similar to a histogram, to assist in visualizing the shape of a
distribution. They are are useful tools in exploratory data analysis.
For example –
2.3, 2.5, 2.5, 2.7, 2.8 3.2, 3.6, 3.6, 4.5, 5.0
The stem-and-leaf plot for the same will be –
Leaf
2
35578
3
266
4
5
5
0
U
Stem “2” Leaf “3” means 2.3
ni
v
Stem
er
s
Tom got his friends to do a long jump and got these results:
ity
A Stem and Leaf Plot is a special table where each data value is split into a “stem”
(the first digit or digits) and a “leaf” which is usually the last digit. The “stem” values are
listed down, and the “leaf” values go right (or left) from the stem values. The “stem” is
used to group the scores and each “leaf” shows the individual scores within each group.
What the stem and leaf here mean (Stem “2” Leaf “3” means 2.3)
●●
In this case each leaf is a decimal
●●
It is OK to repeat a leaf value
●●
5.0 has a leaf of “0”
ity
●●
1.2.9 Cross tabulations
)A
m
Cross tabulation is a method by which the relationship between multiple variables
is quantitatively analysed. Cross tabulation party variables also known as contingency
tables or cross tabs to explain the connection between the various variables. It also
shows how the correlations vary from one group to another variable.
(c
A cross-tabulation (or crosstab) is, for reference, a two- (or more) dimensional table
which records the number (frequency) of respondents having the specific characteristics
described in the table cells. Cross-tabulation tables offer a wealth of information on the
variables’ relationship. Cross-tabulation analysis goes by several names in the research
world including crosstab, contingency table, chi-square and data tabulation.
Amity Directorate of Distance & Online Education
12
Statistics Management
●●
O
nl
in
e
Importance of Cross Tabulation
Notes
Clean and Useable Data:
Cross tabulation makes it simple to interpret data! The clarity offered by cross
tabulation helps deliver clean data that be used to improve decisions throughout an
organization.
●●
Easy to Understand:
No advanced statistical degree is needed to interpret cross tabulation. The results
are easy to read and explain. This is makes it useful in any type of presentation.
1.2.10 Scatter plot and Trend line
er
s
ity
Scatter diagram is the most fundamental graph plotted to show relationship
between two variables. It is a simple way to represent bivariate distribution. Bivariate
distribution is the distribution of two random variables. Two variables are plotted one
against each of the X and Y axis. Thus, every data pair of (xi , yi ) is represented by
a point on the graph, x being abscissa and y being the ordinate of the point. From a
scatter diagram we can find if there is any relationship between the x and y, and if yes,
what type of relationship. Scatter diagram thus, indicates nature and strength of the
correlation.
The pattern of points obtained by plotting the observed points are knows as scatter
diagram.
ni
v
It gives us two types of information.
●●
Whether the variables are related or not.
●●
If so, what kind of relationship or estimating equation that describes the
relationship.
U
If the dots cluster around a line, the correlation is called linear correlation. If
the dots cluster around a curve, the correlation is called a non-linear or curve linear
correlation.
(c
)A
m
ity
Scatter diagram is drawn to visualize the relationship between two variables. The
values of more important variable are plotted on the X-axis while the values of the
variable are plotted on the Y-axis. On the graph, dots are plotted to represent different
pairs of data. When dots are plotted to represent all the pairs, we get a scatter diagram.
The way the dots scatter gives an indication of the kind of relationship which exists
between the two variables. While drawing scatter diagram, it is not necessary to take
at the point of sign the zero values of X and Y variables, but the minimum values of the
variables considered may be taken.
●●
When there is a positive correlation between the variables, the dots on the
scatter diagram run from left hand bottom to the right hand upper corner. In
case of perfect positive correlation all the dots will lie on a straight line.
●●
When a negative correlation exists between the variables, dots on the scatter
diagram run from the upper left hand corner to the bottom right hand corner. In
case of perfect negative correlation, all the dots lie on a straight line.
Amity Directorate of Distance & Online Education
13
Statistics Management
Advertisement cost in ‘000’
40
65
60
90
85
75
35
90
34
Sales in Lakh `
45
56
58
82
65
70
64
85
50
76
85
er
s
ity
Solution:
Notes
O
nl
in
e
Example: Figures on advertisement expenditure (X) and Sales (Y) of a firm for the
last ten years are given below. Draw a scatter diagram.
U
ni
v
A scatter diagram gives two very useful types of information. First, we can observe
patterns between variables that indicate whether the variables are related. Secondly,
if the variables are related we can get idea of what kind of relationship (linear or nonlinear) would describe the relationship. Correlation examines the first question of
determining whether an association exists between the two variables, and if it does, to
what extent. Regression examines the second question of establishing an appropriate
relation between the variables.
1.3.1 Arithmetic mean - intro and application
ity
The mean is the average of the numbers. It is easy to calculate: add up all the
numbers, then divide by how many numbers there are. In other words it is the sum
divided by the count.
m
Arithmetic mean is defined as the value obtained by dividing the total values
of all items in the series by their number. In other word is defined as the sum of the
given observations divided by the number of observations, i.e., add values of all items
together and divide this sum by the number of observations.
)A
Symbolically – x = x1 + x2 + x3 + xn/n
Properties of Arithmetic Mean
The sum of the deviations, of all the values of x, from their arithmetic mean, is
zero.
2.
The product of the arithmetic mean and the number of items gives the total of
all items.
3.
Finding the combined arithmetic mean when different groups are given.
(c
1.
Amity Directorate of Distance & Online Education
14
Demerits of Arithmetic Mean
Notes
O
nl
in
e
Statistics Management
1.
Arithmetic mean is affected by the extreme values.
2.
Arithmetic mean cannot be determined by inspection and cannot be located
graphically.
3.
Arithmetic mean cannot be obtained if a single observation is lost or missing.
4.
Arithmetic mean cannot be calculated when open-end class intervals are
present in the data.
Arithmetic Mean for Ungrouped Data
ity
Individual Series
1. Direct Method
The following steps are involved in calculating arithmetic mean under an individual
series using direct method:
Add up all the values of all the items in the series.
-
Divide the sum of the values by the number of items. The result is the arithmetic
mean.
er
s
-
The following formula is used: X = Ʃ x/N
ni
v
Where, X = Arithmetic mean Ʃx = Sum of the values N = Number of items.
Illustration 1 – Value(x) – 125 128 132 135 140 148 155 157 159 191
Calculate the arithmetic mean
U
Solution –
Total number of terms = N = 10
Mean = Ʃ x = 125 128 132 135 140 148 155 157 159 191 = 1440
ity
X = Ʃ x/n = 1440/10
= 144
(c
)A
m
2. Short-cut Method or Indirect method
The following steps are involved in calculating arithmetic mean under individual
series using short-cut or indirect method:
1.
Assume one of the values in the series as an average. It is called as working
mean or assumed average.
2.
Find out the deviation of each value from the assumed average.
3.
Add up the deviations
4.
Apply the following formula. X = A d N + Ʃ
where, X = Arithmetic mean A = Assumed average Ʃd = Sum of the deviations N =
Number of items
Amity Directorate of Distance & Online Education
15
Statistics Management
Roll No
1
2
3
4
5
6
7
8
9
10
Marks
43
48
65
57
31
60
37
48
78
59
Solution –
Marks Obtained
D = 60
1
43
-17
2
48
-12
3
65
5
4
57
5
31
6
60
7
37
8
48
9
78
10
59
Combined Arithmetic Mean
-29
er
s
0
-23
-12
18
-1
Ʃd = – 74
U
Ʃ 60 + (- 74/10) = 52.6 marks
-3
ni
v
X = a +Ʃd/N
ity
Roll No
Notes
O
nl
in
e
Illustration - 1 Calculate the arithmetic average of the data given below using
short–cut method
ity
Arithmetic mean and number of items of two or more related groups are known
as combined mean of the entire group. The combined average of two series can be
calculated by the given formula –
n1x1 + n2x2/ n1 + n2
m
Where, n1 = No. of items of the first group, n2 = No. of items of the second group
x1 = A.M of the first group, x2 = A.M of the second group,
)A
Example - From the following data ascertain the combined mean of a factory
consisting of 2 branches namely branch A and Branch B. In branch A the number of
workers is 500, and their average salary of 300. In branch B the number of workers is
1,000 and their average salary is 250
(c
Solution:
Let the no. of workers in branch A be n1 = 500
Let the no. of worker in branch B be n2 = 1000
Amity Directorate of Distance & Online Education
16
Statistics Management
300
O
nl
in
e
Average salary x1 =
Notes
Average salary x2 = 250
n1x1 + n2x2/ n1 + n2
= 500(300) + 1000(250)/ 500 + 1000
= 1, 50,000 + 2, 50,000/1500
= 266.66
Weighted Arithmetic Mean
ity
Sometimes, some observations get relatively more importance than other
observations. The weight for such observation must be given on the basis of their
relative importance. In weighted arithmetic mean, for finding an average the value of
each item is multiplied by its weight and then the product are divided by the number of
weights.
er
s
Symbolically = Ʃwx / Ʃw
Example – Calculate simple and weighted average from the following data –
Jan
Price
42.5
No. of tonnes
25
Solution:
May
June
51.25
50
52
44.25
54
30
40
50
10
45
WX purchased (w)
Jan
42.5
25
1062.5
51.25
30
1537.5
50
40
2000
April
52
50
2600
May
44.25
10
442.5
June
54
45
2430
N=6
X = 294
Ʃw = 200
Ʃwx = 10027.5
U
No. of tonnes ( in 000)(x)
ity
m
April
Price Per Tonn
March
)A
March
Month
Feb
(c
Feb
ni
v
Month
Simple AM
X = Ʃx/n = 294/6 = 49
Weighted AM
Xw = Ʃwx/Ʃw = 10027.5/200 = 50.137
The correct average price paid is ` 50.30 and not ` 49 i.e., weight arithmetic mean
is correct than simple arithmetic mean.
Amity Directorate of Distance & Online Education
17
Statistics Management
Notes
O
nl
in
e
1.3.2 Median - Intro and Application
Median is defined as the value of the item dividing the series into two equal
halves, where one half contains all values less than (or equal to) it and the other half
contains all values greater than (or equal to) it. It is also defined as the “central value
of the variable. In median, the value of items must be arranged in order of their size or
magnitude to find out the median.
Median is a positional average. The term position refers to the place of a value in
the series, where the place of median is such that it is equal to the number of items
lying on the either side; therefore it is also called as locative average.
Merits of Median
ity
Following are the advantages of median:
It is rigidly defined.
●●
It is easy to calculate and understand.
●●
It can be located graphically.
●●
It is not affected by extreme values like the arithmetic mean.
●●
It can be found by mere inspection.
●●
It can be used for qualitative studies.
●●
Even if the extreme values are unknown, median can be calculated if one
knows the number of items.
ni
v
er
s
●●
Demerits of Median
Following are the disadvantages of median:
In the case of individual observations, the values are to be arranged in order
of their size to locate median. Such an arrangement of data is tedious task if
the number of items is large.
●●
If the median is multiplied by the number of items, the total value of all the
items cannot be obtained as in the case of the arithmetic average.
●●
It is not suitable for complex algebraic or mathematical treatment.
●●
It is more affected by sampling fluctuations.
ity
U
●●
Application of Median
m
Example – Determine the median from the following –
25, 15, 23, 40, 27 25 23 25 20
(c
)A
Solution - Arranging the figures in ascending order –
S.no
Value or Size
1
15
2
20
3
23
4
23
Amity Directorate of Distance & Online Education
18
Statistics Management
25
6
O
nl
in
e
5
Notes
25
7
25
8
27
9
40
Median = 10/2 = 5th term
= 25
1.3.3 Mode - Intro and Application
ity
The word “mode” is derived from the French word “1a mode” which means fashion.
So it can be regarded as the most fashionable item in the series or the group.
er
s
Croxtan and Cowden regard mode as “the most typical of a series of values”. As a
result it can sum up the characteristics of a group more satisfactorily than the arithmetic
mean or median.
Mode is defined as the value of the variable occurring most frequently in a
distribution. In other words it is the most frequent size of item in a series.
ni
v
Merits of Mode
●●
The most important advantage of mode is that it is usually on an actual value.
●●
In the case of discrete series, mode can be easily located by inspection.
●●
Mode is not affected by extreme values.
●●
U
The following are the merits of mode:
It is easy to understand and this average is used by people in their every day
speech.
ity
●●
Mode can be determined even if extreme values are not given.
Demerits of Mode
(c
)A
m
The following are the demerits of mode:
●●
It is not based on all the observation of the data
●●
In a number of cases there will be more than one mode in the series.
●●
If mode is multiplied by the number of items, the product will not be equal to
the total value of the items.
●●
It will not truly represent the group if there are a small number of items of the
same size in a large group of items of different sizes
●●
It is not suitable for further mathematical treatment
Applications of Mode
Mode in Ungrouped Data
Amity Directorate of Distance & Online Education
19
Statistics Management
Notes
O
nl
in
e
Individual Series
The mode of this series can be obtained by mere inspection. The number which
occurs most often is the mode.
Illustration - 1
Locate mode in the data 7, 12, 8, 5, 9, 6, 10, 9, 4, 9, 9
Solution:
On inspection, it is observed that the number 9 has maximum frequency i.e.,
repeated maximum of 4 times than any other number. Therefore mode (Z) = 9
Discrete Series
ity
The mode is calculated by applying grouping and analysis table.
Grouping Table: Consisting of six columns including frequency column,
1st column is the frequency 2nd and 3rd column is the grouping two
way frequencies and 4th, 5th and 6th column is the grouping three way
frequencies.
●●
Analysis table: consisting of 2 columns namely tally bar and frequency
Steps in Calculating Mode in Discrete Series
er
s
●●
ni
v
The following steps are involved in calculating mode in discrete series:
Group the frequencies by two’s.
●●
Leave the frequency and group the other frequencies in two’s.
●●
Group the frequencies in threes.
●●
Leave the frequency of the first size and add the frequencies of other sizes in
three’s.
●●
Leave the frequencies of the first two sizes and add the frequencies of the
other sizes in threes.
●●
Prepare an analysis table to know the size occurring the maximum number
of times. Find out the size, which occurs the largest number of times. That
particular size is the mode.
ity
U
●●
m
Continuous Series
The following steps are involved in calculating mode in continuous series.
)A
Find out the modal class. Modal class can be easily found out by inspection. The
group containing maximum frequency is the modal group. Where two or more classes
appear to be a modal class group, it can be decided by grouping process and preparing
an analyzed table as was discussed in question number 2.102.
The actual value of mode is calculated by applying the following formula.
(c
Mo = l + fm – f1 / 2fm – f1 – f2 . i
Amity Directorate of Distance & Online Education
20
Statistics Management
O
nl
in
e
1.3.4 Partition values - Quartiles and Percentiles
Notes
A percentile is the value below which a percentage of data falls.
Example: You are the fourth tallest person in a group of 20
80% of people are shorter than you:
That means you are at the 80th percentile.
If your height is 1.65m then “1.65m” is the 80th percentile height in that group.
Quartiles are the values that split data into quarters. Quartiles are values that divide
a (part of a) data table into four groups containing an approximately equal number of
observations. The total of 100% is split into four equal parts: 25%, 50%, 75% and 100%.
ity
The Quartiles also divide the data into divisions of 25%, so:
Quartile 1 (Q1) can be called the 25th percentile
●●
Quartile 2 (Q2) can be called the 50th percentile
●●
Quartile 3 (Q3) can be called the 75th percentile
er
s
●●
Example:
For 1, 3, 3, 4, 5, 6, 6, 7, 8, 8:
The 25th percentile = 3
●●
The 50th percentile = 5.5
●●
The 75th percentile = 7
ni
v
●●
The percentiles and quartiles are computed as follows:
1.
The f-value of each value in the data table is computed:
i–1
n–2
U
fi =
where i is the index of the value, and n the number of values.
The first quartile is computed by interpolating between the f-values immediately
below and above 0.25, to arrive at the value corresponding to the f-value 0.25.
(c
)A
m
ity
2.
3.
The third quartile is computed by interpolating between the f-values immediately
below and above 0.75, to arrive at the value corresponding to the f-value 0.75.
4.
Any other percentile is similarly calculated by interpolating between the
appropriate values.
1.3.5 Measures of Dispersion - Range - intro and Application
A measure of dispersion or variation in any data shows the extent to which the
numerical values tend to spread about an average. If the difference between items is
small, the average represents and describes the data adequately. For large differences
it is proper to supplement information by calculating a measure of dispersion in addition
to an average. It is useful to determine data for the knowledge it may serve:
●●
To compare the current results with the past results.
Amity Directorate of Distance & Online Education
21
Statistics Management
To compare two are more sets of observations.
●●
To suggest methods to control variation in the data.
Notes
O
nl
in
e
●●
A study of variations helps us in knowing the extent of uniformity or consistency in
any data. Uniformity in production is an essential requirement in industry. Quality control
methods are based on the laws of dispersion.
Absolute and Relative Measures of Dispersion
ity
The measures of dispersion can be either ‘absolute’ or “relative”. Absolute
measures of dispersion are expressed in the same units in which the original data
are expressed. For example, if the series is expressed as Marks of the students in a
particular subject; the absolute dispersion will provide the value in Marks. The only
difficulty is that if two or more series are expressed in different units, the series cannot
be compared on the basis of dispersion.
er
s
‘Relative’ or ‘Coefficient’ of dispersion is the ratio or the percentage of a measure
of absolute dispersion to an appropriate average. The basic advantage of this measure
is that two or more series can be compared with each other despite the fact they are
expressed in different units.
A precise measure of dispersion is one that gives the magnitude of the variation
in a series, i.e. it measures in numerical terms, the extent of the scatter of the values
around the average.
The range Relative range
ity
The Quartile Deviation
U
Measures of Dispersion
ni
v
When dispersion is measured in terms of the original units of a series, it is absolute
dispersion or variability. It is difficult to compare absolute values of dispersion in
different series, especially when the series in different units or have different sets of
values. A good measure of dispersion should have properties similar to those described
for a good measure of central tendency.
Relative Variability
Relative range
Deviation Relative Quartile Deviation
The Mean Deviation
Deviation Relative Mean deviation
The Median Deviation
Deviation Coefficient of Variation
m
The Standard Deviation
Graphical Method
)A
Range
Definition: The ‘Range’ of the data is the difference between the largest value of
data and smallest value of data.
(c
This is an absolute measure of variability. However, if we have to compare two sets
of data, ‘Range’ may not give a true picture. In such case, relative measure of range,
called coefficient of range is used. This is given by,
Amity Directorate of Distance & Online Education
22
Formulae: Range = L-S
Notes
Where L – Largest value and S- Smallest Value
O
nl
in
e
Statistics Management
In individual observations and discrete series, L and S are easily identified. In
continuous series, the following two methods are used as follows:
Method 1: L - Upper boundary of the highest class.
S - Lower boundary of the lowest class.
Method 2: L - Mid value of the highest class.
S - Mid Value of the lowest class.
Example: Find the set of observations 10 5 8 11 12 9
ity
Solution: L = 12 S = 5
Range = L – S
=7
er
s
= 12 – 5
Coefficient of range = L – S / L + S
= 12 – 5/ 12 + 5
= 0.4118
ni
v
= 7/17
Interquartile Range and Deviations
Inter-quartile range and deviations are described in the following sub sections.
U
Inter-quartile Range
ity
Inter-quartile range is a difference between upper quartile (third quartile) and lower
quartile (first quartile). Thus, Inter Quartile Range = (Q3 - Q1)
Quartile deviation
(c
)A
m
Quartile Deviation is the average of the difference between upper quartile and
lower quartile.
Thus, Quartile Deviation = QD = (Q3 - Q1)/2
Quartile Deviation (QD) also gives the average deviation of upper and lower
quartiles from Median.
QD = (Q3 - Q1)/2 = Q3 - Q1 / Q3 + Q1
Example: Weekly wages of a labourers is given below. Calculate Q.D. and
coefficient of Q.D.
Weekly wages
100
200
400
500
600
Total
No. of Weeks:
5
8
21
12
6
52
Amity Directorate of Distance & Online Education
23
Solution:
Weekly wages
No. of Weeks:
Cumulative Frequency
100
5
5
200
8
13
400
21
34
500
12
46
600
6
52
N = 52
ity
Q1 = N+1 /4
= 52+1/4
= 200 + 0.25 (400-200)
= 200 + 0.25 × 200
= 200 + 50
ni
v
= 250
er
s
13.25
Q1 = 13th value + 0.25 (14th value – 13th value)
Notes
O
nl
in
e
Statistics Management
Q3 = 3(N+1 /4)
= 3 x 13.25
U
= 39.75
Q3 = 39th value + 0.75 (40th value – 39th value)
= 500 + 0.75 (500-500)
= 500.
ity
= 500 + 0.75 X 0
Q.D. = Q3 - Q1 / 2
m
= 500 – 250/2
= 250/2
)A
= 125
Coefficient of Q.D. = Q3 - Q1/ Q3 + Q1
. .= 500 -250/ 500 + 250
= 250/750
(c
= 0.333
Amity Directorate of Distance & Online Education
24
Statistics Management
O
nl
in
e
1.3.7 Standard Deviation and Variance
Notes
Variance is defined as the average of squared deviation of data pointsfrom their mean.
When the data constitute a sample, the variance is denoted byσ2x and averaging
is done by dividing the sum of the squared deviation from the mean by ‘n – 1’. When
observations constitute the population, the variance is denoted by σ2 and we divide by
N for the average
Different formulas for calculating variance:
n
(xi–x)2
i=1
n–1
Population Variance Var (x) = σ2 =
Where,
ity
Sample Variance Var (x) = σx2 =
(xi–µ)2
N
x = Sample mean
n = Sample size
µ = Population mean
ni
v
N = Population size
er
s
xi for i = 1, 2, ..., n are observation values
Population Variance is,
∑ ( xi − µ ) 2
Var (x) = σ =
N
2
n
n
n
n
U
∑ ( x 2i − 2 µ xi + µ 2 ) ∑ ( x 2i ) − 2 µ ∑ xi + µ 2 ∑ (1)
=i 1
=i 1
=i 1 =i 1
=
=
N
N
n
ity
∑ x 2i
=
i =1
N
− µ2
(c
)A
m
Var (x) = E(X2)–[E(X)]2
Standard deviation
Definition: Standard Deviation is the root mean square deviation of the values
from their arithmetic mean. S.D. is denoted by symbol σ (read sigma). The Standard
Deviation (SD) of a set of data is the positive square root of the variance of the set.
This is also referred as Root Mean Square (RMS.) value of the deviations of the data
points. SD of sample is the square root of the sample variance i.e. equal to σx and the
Standard Deviation of a population is the square root of the variance of the population
and denoted by σ.
Amity Directorate of Distance & Online Education
25
Statistics Management
The properties of standard deviation are:
O
nl
in
e
Notes
●●
It is the most important and widely used measure of variability.
●●
It is based on all the observations.
●●
Further mathematical treatment is possible.
●●
It is affected least by any sampling fluctuations.
●●
It is affected by the extreme values and it gives more importance to the values
that are away from the mean.
●●
The main limitation is; we cannot compare the variability of different data sets
given in different units
Formula for Calculating S.D.
2
Ex 2  ∑ x 
−
 n 
n
er
s
=
σ
ity
For the set of values x1, x2 ........Xn
If an assumed value A is taken for mean and d = X-A, then
2
ni
v
=
σ
Ed 2  ∑ d 
−
 n 
n
For a frequency distribution
U
2
Efd 2  ∑ fd 
σ=
−
×C
 N 
N
ity
Where d = X–A and C is the true class interval
N = Total frequency
Application of Standard Deviation
m
Example : Find the standard deviation for the following data:
0-10
10-20
20-30
30-40
40-50
50-60
60-70
Frequency
6
14
10
8
1
3
8
)A
Class Interval
(c
Solution: Direct Method
Amity Directorate of Distance & Online Education
26
Statistics Management
Class
Mark mi
Frequency
Fi x mi
di = (mi-A)
di2
fi x di2
Jan
5
6
30
-25
625
3750
Feb
15
14
210
-15
225
3150
March
25
10
250
-5
25
250
April
35
8
280
5
25
200
May
45
1
45
15
225
225
June
55
3
165
25
625
1875
N=6
65
8
520
35
1225
9800
Σfi = 50
1500
SD = /19250/50 = 19.62
19250
er
s
Mean = 1500/50 = 30
O
nl
in
e
Class
Interval
ity
Notes
Combined Standard Deviation
Standard Deviation of Combined Means
ni
v
The mean and S.D. of two groups are given in the following table
Group
Mean
S.D.
Size
I
X1
σ1
n1
X2
σ2
n2
II
n1x+n 2 x 2
n1 +n 2
ity
X=
U
Let X and σ be the mean and S.D. of teh combined group of (n1 + n2) items. Then
X and σ are determined by the formulae.
n1σ 12 +n 2σ 2 2 + n1d12 +n 2 d 2 2
=
(or) σ
n1 +n 2
(c
)A
m
=
σ2
where d1
n1σ 12 +n 2σ 2 2 + n1d12 +n 2 d 2 2
n1 +n 2
= x1 -x;d 2 =x 2 − x (or)d1 = x1 − x ;d 2 =
x2 − x
These results can be extended to 3 samples as follows:
X=
n1x1 +n 2 x 2 + n 3 x 3
n1 +n 2 + n 3
σ2 =
n1σ 12 +n 2σ 2 2 +n 3σ 32 + n1d12 +n 2 d 2 2 +n 3d 32
n1 +n 2 + n 3
Amity Directorate of Distance & Online Education
27
1.3.8 Relative measure of dispersion - Coefficient of variation
It is defined as the ratio of SD and mean, multiplied by 100.
CV =σ/ μ×100
Notes
O
nl
in
e
Statistics Management
This is also called as variability. Smaller value of CV indicates greater stability and
lesser variability.
Example: Two batsmen A and B made the following scores in the preliminary round
of World Cup Series of cricket matches.
A 14, 13, 26, 53, 17, 29, 79, 36, 84 and 49
Who will you select for the final? Justify your answer?
ity
B 37, 22, 56, 52, 28, 30, 37, 48, 20 and 40
er
s
Solution: We will first calculate mean, standard deviation and Karl Pearson’s
coefficient of variation. We will select the player based on the average score as well
as consistency. We not only want the player who has been scoring at high average but
also doing it consistently. Thus, the probability of his playing good inning in final is high.
For Player ‘A’ (Using Direct Method)
Deviation (xi - µ)
(xi - µ)2
Σ xi2
14
-26
676
196
13
-27
729
169
26
-14
196
676
53
13
169
2809
17
-23
529
289
29
-11
121
841
79
39
1521
6241
36
-4
16
1296
44
1936
7056
9
81
2401
Σ (xi - µ) = 0
Σ (xi-µ)2 = 5974
Σ xi2 = 21974
49
)A
Now,
U
m
Σ xi = 400
ity
84
ni
v
Score xi
Mean =
10
∑ (xi-µ ) 2
(c
Variance = Var(x) = i −1
N
=
5974
=597.4
10
Amity Directorate of Distance & Online Education
28
Standard Deviation = σ = Var(x) = 597.4=24.44
Notes
O
nl
in
e
Statistics Management
Example: Two batsmen A and B made the following scores in the preliminary
round of World Cup Series of cricket matches.
A 14, 13, 26, 53, 17, 29, 79, 36, 84 and 49
B 37, 22, 56, 52, 28, 30, 37, 48, 20 and 40
Who will you select for the final? Justify your answer?
For Player ‘A’ (Using Direct Method)
Deviation (xi - µ)
(xi - µ)2
Σ xi2
14
-26
676
196
-27
729
169
-14
196
676
13
169
2809
-23
529
289
-11
121
841
39
1521
6241
-4
16
1296
44
1936
7056
49
9
81
2401
Σ xi = 400
Σ (xi - µ) = 0
Σ (xi-µ)2 = 5974
Σ xi2 = 21974
er
s
Score xi
ity
Solution: We will first calculate mean, standard deviation and Karl Pearson’s
coefficientof variation. We will select the player based on the average score as well as
consistency. We not only want the player who has been scoring at high average but
also doing it consistently. Thus, the probability of his playing good inning in final is high.
13
26
53
29
79
36
ity
U
84
ni
v
17
Now,
(c
)A
m
Mean =
10
∑ (xi-µ ) 2
Variance = Var(x) = i −1
N
=
5974
=597.4
10
Standard Deviation = σ = Var(x) = 597.4=24.44
Key Terms
●●
Sample: A sample consists one or more observations drawn from the population.
Sample is the group of people who actually took part in your research.
Amity Directorate of Distance & Online Education
29
Statistics Management
Population: A population includes all of the elements from a set of data.
Population is the broader group of people that you expect to generalize your study
results to.
●●
Frequency Polygon: These are the frequencies plotted against the mid-points of
the class-intervals and the points thus obtained are joined by line segments
●●
Bar Diagram: Only length of the bar is taken into account but not the width. In
other wards bar is a thick line whose width is shown merely, but length of the bar is
taken into account is called one-dimensional diagram.
●●
Simple Bar Diagram: It represents only one variable. Since these are of the same
width and vary only in lengths (heights), it becomes very easy for comparative
study. Simple bar diagrams are very popular in practice.
●●
Percentage bar diagram: the length of the entire bar kept equal to 100
(Hundred). Various segment of each bar may change and represent percentage
on an aggregate.
●●
Range: The ‘Range’ of the data is the difference between the largest value of data
and smallest value of data.
ity
er
s
Check your progress
b)
Upper limit of the class
c)
Any value of the class
d)
Middle limit of the class
ni
v
Lower limit of the class
a)
Social Statistics
b)
Descriptive Statistics
c)
Education Statistics
d)
Business Statistics
U
Numerical methods and graphical methods are specialized procedures used in
A histogram consists of a set of
a)
Adjacent triangles
b)
Adjacent rectangles
c)
Non adjacent rectangles
d)
Adjacent squares
)A
3.
a)
ity
2.
A frequency polygon is constructed by plotting frequency of the class interval and the
m
1.
Notes
O
nl
in
e
●●
Component bar charts are used when data is divided into
a)
Circles
b)
Squares
(c
4.
c)
Parts
d)
Groups
Amity Directorate of Distance & Online Education
30
Statistics Management
A circle in which sector represents various quantities is called
a)
Histogram
b)
Pie chart
c)
Frequency Polygon
d)
O give
Questions and Exercises
O
nl
in
e
5.
Notes
What do you mean by statistics ?
2.
What are the various type of bar diagrams ?
3.
What are the merits of mean median and mode
4.
What do you understand by Standard deviation and combined standard deviation
5.
Find the standard deviation for the following data:
ity
1.
0-10
10-20
20-30
30-40
40-50 50-60
60-70
Frequency
8
14
10
6
4
8
Check your progress
er
s
Class Interval
d)
Middle limit of the class
2.
b)
Descriptive Statistics
3.
b)
Adjacent rectangles
4.
c)
Parts
5.
b)
Pie chart
ni
v
1.
3
U
Further Readings
Richard I. Levin, David S. Rubin, Sanjay Rastogi Masood Husain Siddiqui,
Statistics for Management, Pearson Education, 7th Edition, 2016.
2.
Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016.
3.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An
Introduction to Statistical Learning with Applications in R, Springer, 2016.
(c
)A
m
ity
1.
Bibliography
1.
Srivastava V. K. etal – Quantitative Techniques for Managerial Decision Making,
Wiley Eastern Ltd
2.
Richard, I.Levin and Charles A.Kirkpatrick – Quantitative Approaches to Management,
McGraw Hill, Kogakusha Ltd.
3.
Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016.
4.
Budnik, Frank S Dennis Mcleaavey, Richard Mojena – Principles of Operation
Research - AIT BS New Delhi.
5.
Sharma J K – Operation Research- theory and applications-Mc Millan,New Delhi
Amity Directorate of Distance & Online Education
31
Statistics Management
Kalavathy S. – Operation Research – Vikas Pub Co
7.
Gould F J – Introduction to Management Science – Englewood Cliffs N J Prentice Hall.
8.
Naray J K, Operation Research, theory and applications – Mc Millan, New Dehi.
9.
Taha Hamdy, Operations Research, Prentice Hall of India
Notes
O
nl
in
e
6.
10. Tulasian: Quantitative Techniques: Pearson Ed.
11. Vohr.N.D. Quantitative Techniques in Management, TMH.
(c
)A
m
ity
U
ni
v
er
s
ity
11. Stevenson W.D, Introduction to Management Science, TMH.
Amity Directorate of Distance & Online Education
32
Statistics Management
O
nl
in
e
Module-2: Probability Theory
Notes
Learning Objective:
●●
To get familiarize with business problems associated with the concept of
probability and probability distributions
●●
To understand the MS Excel applications of Binomial, Poisson and Normal
probabilities
Learning Outcome:
At the end of the course, the learners will be able to –
Compute Binomial, Poisson and Normal probabilities through MS Excel
●●
Understand various theorems and principles of probability
ity
●●
Prof. Boddington-
er
s
“defined statistics as the science of estimates and probabilities”
2.1.1 Probability – Introduction
ni
v
A probability is the quantitative measure of risk. Statistician I.J. Good suggests, “The
theory of probability is much older than the human species, since the assessment of
uncertainty incorporates the idea of learning from experience, which most creatures do.”
U
Probability and sampling are inseparable parts of statistics. Before we discuss
probability and sampling distributions, we must be familiar with some common terms
used in theory of probability. Although these terms are commonly used in business, they
have precise technical meaning.
ity
Random Experiment: In theory of probability, a process or activity that results in
outcomes under study is called experiment, for example, sampling from a production lot.
Random experiment is an experiment whose outcome is not predictable in advance. There
is a chance or risk (sometimes also called as uncertainty) associated with each outcome.
(c
)A
m
Sample Space: It is a set of all possible outcomes of an experiment. It is usually
represented as S.
Example: If the random experiment is rolling of a die, the sample space is a set, S
= {1, 2, 3, 4, 5, 6}.
Similarly, if the random experiment is tossing of three coins, the sample space is, S
= {HHH, HHT, HTH, THH, HTT, THT, TTH, TTT} with total of 8 possible outcomes. (H is
heads, and T is Tails showing up.)
If we select a random sample of 2 items from a production lot and check them for
defect, the sample space will be S = {DD, DS, DR, RS, RR, SS} where D stands for
defective, S stands for serviceable and R stands for re-workable.
●●
Event: One or more possible outcomes that belong to certain category of our
interest are called as event. A sub set E of the sample space S is an event. In
other words, an event is a favorable outcome.
Amity Directorate of Distance & Online Education
33
Statistics Management
Event space: It is a set of all possible events. It is usually represented as E. Note
that usually in probability and statistics; we are interested in number of elements in
sample space and number of elements in event space.
●●
Union of events: If E and F are two events, then another event defined to include
all outcomes that are either in E or in F or in both is called as a union of events E
and F. It is denoted as E U F.
●●
Intersection of events: If E and F are two events, then another event defined to
include all outcomes that are in both E and F is called as an intersection of events
E and F. It is denoted as E∩ F.
●●
Mutually exclusive events: The events E and F are said to be mutually exclusive
events if they have no outcome of the experiment common to them. In other
words, events E and F are said to be mutually exclusive events if E∩ F = φ, where
φ is a null or empty set.
●●
Collectively exhaustive events: The events are collectively exhaustive if their
union is the sample space.
●●
Complement of event: Complement of an event E is an event which consists of
all outcomes that are not in the E. It is denoted as EC. Thus, E ∩ EC = φ and E U
EC = S
Notes
er
s
ity
O
nl
in
e
●●
2.1.2 Types of Events
ni
v
A probability event can be defined as a set of outcomes of an experiment. In other
words, an event in probability is the subset of the respective sample space. A random
experiment ‘s entire potential set of outcomes is the sample space or the individual
space of that encounter. The probability of an occurrence happening is called chance.
The likelihood of any event happening lies between 0 and 1.
U
For example –
The sample space for the tossing of three coins simultaneously is given by:
ity
S = {(T, T, T), (T, T, H), (T, H, T), (T, H, H), (H, T, T), (H, T, H), (H, H, T), (H, H, H)}
Suppose, if we want to find only the outcomes which have at least two heads; then
the set of all such possibilities can be given as:
E = { (H , T , H) , (H , H ,T) , (H , H ,H) , (T , H , H)}
m
Thus, an event is a subset of the sample space, i.e., E is a subset of S.
)A
There could be a lot of events associated with a given sample space. For any
event to occur, the outcome of the experiment must be an element of the set of event E.
By event it is meant one or more than one outcomes.
Example Events:
Getting a Tail when tossing a coin is an event
●●
Rolling a “5” is an event.
(c
●●
An event can include several outcomes:
●●
Choosing a “King” from a deck of cards (any of the 4 Kings) is also an event
Amity Directorate of Distance & Online Education
34
●●
Notes
Rolling an “even number” (2, 4 or 6) is an event
Events can be:
O
nl
in
e
Statistics Management
●●
Independent (each event is not affected by other events),
●●
Dependent (also called “Conditional”, where an event is affected by other events)
●●
Mutually Exclusive (events can’t happen at the same time)
2.1.3 Algebra of Events
er
s
Complementary Events
ity
Events are the outcome of an experiment. The likelihood of an event occurring is
the ratio of number of favourable events to total number of occurrences. Often they
will happen together with two things occurring or it can happen that just one of them is
going to happen. Event algebra can offer an event that performs certain operations over
two given events. The operations are union, intersection, complement and difference of
two events. As events are the subset of sample space, these operations are performed
as set operations.
ni
v
For an event AA, there is a complimentary event BB such that BB represent the
set of events which are not in the set AA. For example, if two coins are tossed together
then the sample space will be {HT,TH,HH,TT}{HT,TH,HH,TT}. Let AA be the event of
getting one head, then the set AA = {HT,TH}{HT,TH}. The complementary events of A,
BA, B = {HH,TT}{HH,TT}.
Events with AND
U
AND stands for the intersection of two sets. An event is the intersection of two
events if it has got the members present in both the event. For example, if a pair of dice
is rolled then the sample space will have 3636 members. Suppose AA is the event of
getting both dice having same members and BB is the event having the sum as 66.
AA = {(1,1),(2,2),(3,3),(4,4),(5,5),(6,6)}{(1,1),(2,2),(3,3),(4,4),(5,5),(6,6)}
ity
BB = {(3,3),(1,5),(5,1),(2,4),(4,2)}{(3,3),(1,5),(5,1),(2,4),(4,2)}
AA AND BB = {(3,3)}{(3,3)}
(c
)A
m
Events with OR
OR stands for union of two sets. An event is called union of two events if it has got
members present in either of the sets. For example, if two coins are tossed together the
sample space, SS = {HT,TH,TT,HH}{HT,TH,TT,HH}. Let event AA be the event having
only one head and event BB be the event having two heads.
AA = {HT}{HT}
BB = {HH}{HH}
Union of AA and BB, AA OR BB = {HT,HH}{HT,HH}
Events with BUT NOT
For two events AA and BB, AA but not BB is the event having all the elements of AA
but excluding the elements of BB. This can also be represented as AA - BB. Suppose,
Amity Directorate of Distance & Online Education
35
Statistics Management
Notes
O
nl
in
e
there is an experiment of choosing 44 cards from a deck of 5252 cards. The event AA is
having all cards as red cards and event BB is having all cards as king. Then the event
AA but not BB will have all red cards excluding the two kings.
2.1.4 Addition Rule of Probability
If one task can be done in n1 ways and other task can be done in n2 ways and
if these tasks cannot be done at the same time, then there are (n1+n2) ways of doing
one of these tasks (either one task or the other). When logical OR is used in deciding
outcomes of the experiment and events are mutually exclusive then the ‘Sum Rule’ is
applicable.
The Addition rule of probability states that:
If ‘A’ and ‘B’ are any two events then the probability of the occurrence of either
‘A’ or ‘B’ is given by:
ity
1.
P (A U B) = P (A) +P (B) – P (A∩B)
If ‘A’ and ‘B’ are two mutually exclusive events then the probability of occurrence
of either A or B is given by
er
s
2.
P (A U B) = P (A) + P (B)
Example: An urn contains 10 balls of which 5 are white, 3 black and 2 red. If we
select one ball randomly, how many ways are there that the ball is either white or red?
ni
v
Solution:
Answer is 5 + 2 = 7.
U
Example: In a triangular series the probability of Indian team winning match with
Zimbawe is 0.7 and that with Australia is 0.4. If the probability of India winning both
matches is 0.3, what is the probability that India will win at least one match so that it can
enter the final?
Solution:
ity
Now, given that probability of the Indian team winning the match with
Zimbawe P (A) = 0.7,
Australia P (A) = 0.4 and with
m
both P (A ∩B) = 0.3
Therefore, probability that India will win at least one match is,
)A
P (A U B) = P (A) + P (B) - P (A∩ B)
= 0.7 + 0.4 - 0.3
= 0.8
(c
2.1.5 Multiplication Rule of Probability
Suppose that a procedure can be broken down into a sequence of two tasks. If
there are n1 ways to do first task and n2 ways to do second task after the first task
Amity Directorate of Distance & Online Education
36
Statistics Management
O
nl
in
e
has been done. Then there are (n1 × n2) ways to do the procedure. In general, if r
experiments are to be performed are such that the first outcome can be in n1 ways,
having completed the first experiment the second experiment outcome can be in n2,
then similarly outcome of the third experiment can be in n3 ways, and so on. Then there
is a total of n1 × n2 × n3 ×…× nr possible outcomes of the r experiments.
Notes
Multiplicative rule is stated as:
If ‘A’ and ‘B’ are two independent events then the probability of occurrence of ‘A’
and ‘B’ is given by:
P (A∩B) = P (A) P (B)
ity
It must be remembered that when the logical AND is used to indicate successive
experiments then, the ‘Product Rule’ is applicable.
Example: How many outcomes are there if we toss a coin and then throw a dice?
Answer is 2 × 6 = 12.
er
s
Example: It has been found that 80% of all tourists who visit India visit Delhi, 70%
of them visit Mumbai and 60% of them visit both.
1. What is the probability that a tourist will visit at least one city?
2. Also, find the probability that he will visit neither city.
ni
v
Solution:
Let D indicate visit to Delhi and M denote visit to Mumbai.
Given, P (D) = 0.8, P (M) = 0.7 and P (D ∩M) = 0.6
Probability that a tourist will visit at least one city is,
U
P (D UM) = P (D) + P (M) - P (D ∩M) = 0.8 + 0.7 - 0.6 = 0.9
2. P (D¢ ∩M¢) =1 - P (D UM) =1- 0.9 = 0.1
(c
)A
m
ity
2.1.6 Conditional, Joint and Marginal Probability
As a measure of uncertainty, probability depends on the information available. If
we know occurrence of say event F, probability of event E happening may be different
as compared to original probability of E when we had no knowledge of the event
F happening. Probability that E occurs given that F has occurred is the conditional
probability and denoted by P(E F) . If event F occurs, then our sample space is reduced
to the event space of F. Also now for event E to occur, we must have both events E and
F occur simultaneously. Hence probability that event E occurs, given that event F has
occurred, is equal to the probability of EF (that is E ∩ F) relative to the probability of F.
Thus,
P( E | F ) =
P( EF )
P( F )
Another variation of conditional probability rule is
=
P( EF ) P( E | F ) × P( F )
Amity Directorate of Distance & Online Education
37
Statistics Management
Notes
O
nl
in
e
Conditional probability satisfies all the properties and axioms of probabilities. Now
onwards, we would write (E ∩ F) as EF, which is a common convention.
Conditional probability is the probability that an event will occur given that another
event has already occurred. If A and B are two events, then the conditional probability of
A given B is written as P (A/B) and read as “the probability of A given that B has already
occurred.”
Example: The probability that a new product will be successful if a competitor
does not launch a similar product is 0.67. The probability that a new product will be
successful in the presence of a competitor’s new product is 0.42. The probability that
the competitor will launch a new product is 0.35. What is the probability that the product
will be success?
ity
Solution: Let S denote that the product is successful, L denote competitor will
launch a product and LC denotes competitor will not launch the product. Now, from
given data,
Hence, P(LC ) = 1− P(L) = 1− 0.35 = 0.65
er
s
P(S LC ) = 0.67 , P(S|L) = 0.42 , P(L) = 0.35
Now, using conditional probability formula, probability that the product will be
success P(S) is,
P(S) = P(S L)P(L) + P(S LC )P(LC )
ni
v
= 0.42 × 0.35 + 0.67 × 0.65 = 0.5825
2.1.7 Baye’s Theorem
U
Consider two events, E and F. whatsoever be the events, we can always say
that the probability of E is equal to the probability of intersection of E and F, plus, the
probability of the intersection of E and complement of F. That is,
Baye’s Formula
ity
P (E) = P (E ∩ F) + P (E ∩ F ∩ C)
Let, E and F are events.
E = (E ∩ F) U (E ∩ F ∩ C)
m
For any element in E, must be either in both E and F or be in E but not in F. (E F)
and (E FC) are mutually exclusive, since former must be in F and latter must not in F,
we have by Axiom 3,
)A
P (E) = (E F) + (E FC) = P(E F)×P(F) +P(E Fc )×P(Fc )
= P(E|F)×P(F) +P(E|Fc )×[1− P(F)
(c
Suppose now that E has occurred and we are interested in determining the
probability of Fi has occurred, then using above equations, we have following
proposition.
P(=
Fi | E )
P( EFi )
P( E | Fi ) × P( Fi )
= n
P( E )
∑ P( E | Fi ) × P( Fi )
for all i = 1,2...n
i =1
Amity Directorate of Distance & Online Education
38
Statistics Management
This equation is known as Baye’s’ formula. If we think of the events Fi as being
possible ‘hypothesis’ about proportionality of some subject matter, say market shares of
a competitors, then Baye’s’ formula gives us how these should be modified by the new
evidence of the experiment, says a market survey.
O
nl
in
e
Notes
Example: A bin contains 3 different types of lamps. The probability that a type 1
lamp will give over 100 hours of use is 0.7, with the corresponding probabilities for type
2 and 3 lamps being 0.4 and 0.3 respectively. Suppose that 20 per cent of the lamps in
the bin are of type 1, 30 per cent are of type 2 and 50 per cent are of type 3. What is the
probability that a randomly selected lamp will last more than 100 hours? Given that a
selected lamp lasted more than 100 hours, what are the conditional probabilities that it
is of type 1, type 2 and type 3?
ity
Solution: Let type 1, type 2 and type 3 lamps be denoted by T1, T2 and T3
respectively. Also, we denote S if a lamp lasts more than 100 hours and SC if it does
not. Now, as per given data,
P(S|T1) = 0.7 , = P(S|T2 ) 0.4 , = P(S|T3 ) 0.3
er
s
= P(T1 ) 0.2 , = P(T2 ) 0.3 , = P(T3 ) 0.5
Now, using conditional probability formula,
= P(S1) = P(S|T1 )P(T1 ) P(S|T2 )P(T2 ) P(S|T3 )P(T3)
= 0.7 × 0.2 + 0.4 × 0.3 +0.3 × 0.5
ni
v
= 0.41
(b) Now, using Bayes’ formula
P( S | T1 ) P(T1 ) 0.7 × 0.2
= = 0.341
P( S )
0.41
U
=
P(T1 | S )
ity
P=
(T2 | S )
P=
(T3 | S )
P( S | T2 ) P(T2 ) 0.4 × 0.3
= = 0.293
P( S )
0.41
P ( S | T3 ) P (T3 ) 0.3 × 0.5
= = 0.366
P( S )
0.41
(c
)A
m
2.2.1 Random Variables - Introduction
In many practical situations, the random variable of interest follows a specific pattern.
Random variables are often classified according to the probability mass function in
case of discrete, and probability density function in case of continuous random variable.
When the distributions are entirely known, all the statistical calculations are possible. In
practice, however, the distributions may not be known fully. But it can be approximated
that the random variable to one of the known types of standard random variables by
examining the processes that make it random. These standard distributions are also
called ‘probability models’ or sample distributions. Various characteristics of distribution
like mean, variance, moments, etc. can be calculated using known closed formulae. We
will study some of the common types of probability distributions. The normal distribution is
the backbone of statistical inference and hence we will study it in more detail.
Amity Directorate of Distance & Online Education
39
Statistics Management
1.
Bernoulli distribution
2.
Binomial distribution
3.
Poisson distribution
4.
Normal distribution
Notes
O
nl
in
e
There are broadly four theoretical distributions which are generally applied in
practice. They are:
2.2.2 Mean/ Expected Value of Random Variable
ity
In probability theory, the expected value of a random variable is a generalization
of the weighted average and intuitively is the arithmetic mean of a large number of
independent realizations of that variable. The expected value is also known as the
expectation, mathematical expectation, mean, average, or first moment.
er
s
A Random Variable is a set of possible values from a random experiment. The
mean of a discrete random variable X is a weighted average of the possible values
that the random variable can take. Unlike the sample mean of a group of observations,
which gives each observation equal weight, the mean of a random variable weights
each outcome xi according to its probability, pi. The common symbol for the mean (also
known as the expected value of X) is u.
It is defined as –
ni
v
ux = x1p1 + x1p1 + … + kxpk
= ∑xipi
U
The formula changes slightly according to what kinds of events are happening. For
most simple events, either the Expected Value formula of a Binomial Random Variable
or the Expected Value formula for Multiple Events is used.
2.2.3 Variance and Standard Deviation of Random Variable
ity
The variance is a numerical description of the spread, or the dispersion, of the
random variable. That is, the variance of a random variable X is a measure of how
spread out the values of X are, given how likely each value is to be observed.
Variance: Var(X)
m
The Variance is:
Var(X) = Σx2p − μ2
)A
To calculate the Variance:
●●
square each value and multiply by its probability
●●
sum them up and we get Σx2p
●●
then subtract the square of the Expected Value μ2
(c
Standard Deviation: σ
The Standard Deviation is the square root of the Variance:
Amity Directorate of Distance & Online Education
40
σ = √Var(X)
Notes
2.2.4 Binomial Distribution - Introduction
O
nl
in
e
Statistics Management
Usually we often conduct many trials, which are independent and identical.
Suppose we perform n independent Bernoulli trials (each with two possible outcomes
and probability of success p) each of which results in a success with probability p and
probability of failure (1 – p). If random variable X represents the number of successes
that occur in n trials (order of successes not important), then X is said to be a Binomial
random variable with parameters (n, p).
ity
Note that Bernoulli random variable is a Binomial random variable with parameter
(1, p) i.e. n = 1. The probability mass function of a binomial random variable with
parameters (n, p) is given by,
P(X = i) = (1 – p)n – 1 for i = 0, 1, 2, ....., n
Expected value and variance for Binomial random variable are,
Var = [X] = np(1 – p)
er
s
μ = E[X] = np
2.2.5 Binomial Distribution - Application
ni
v
When to use binomial distribution is an important decision. Binomial distribution
can be used when following conditions are satisfied:
Trials are finite (and not very large), performed repeatedly for ‘n’ times.
●●
Each trial (random experiment) should be a Bernoulli trial, the one that results
in either success or failure.
●●
Probability of success in any trial is ‘p’ and is constant for each trial.
●●
U
●●
All the trials are independent.
ity
These trials are usually the experiments of selection ‘with replacement’. In cases
where the number of the population is very large, drawing a small sample from it
does not change probability of success significantly. Hence, we could consider the
distribution as Bernoulli distribution.
(c
)A
m
Following are some of the real life examples of applications of binomial distribution.
●●
Number of defective bulbs in a lot of n items produced by a machine.
●●
Number of female births out of n births in a hospital.
●●
Number of correct answers in a multiple-choice test.
●●
Number of seeds germinated in a row of n planted seeds.
●●
Number of recaptured fish n a sample of n fish.
●●
Number of missiles hitting the targets out of n fired.
Example: Suppose that the probability that a light in a classroom will be burnt
out is 1/3. The classroom has in all five lights and it is unusable if the number of lights
burning is less than two. What is the probability that the class room is unusable on a
random occasion?
Amity Directorate of Distance & Online Education
41
Statistics Management
1
3
Class room is unusable if the number of burnouts is 4 or 5. That is i = 4 or 5. Noting
Notes
O
nl
in
e
Solution: This a case of binomial distribution with n = 5 and p =
that,
 n
P( X =+
i )   ( P)i (1 − P) n −i
4) P( X ==
i 
Thus, the probability that the class room is unusable on a random occasion is,
4
5
0
 5  1   2   5  1   2 
P( X =
4) + P( X =
5) =
+
0.0412 + 0.00412 =
0.04532
 4  3   3   5  3   3  =
ity
Example: It is observed that 80% of T.V. vuewers watch Aap Ki Adalat programme.
What is the probability that at least 80% of the viewers in a random sample of 5 watch
this programme?
Solution: This is the case of binomial distribution with n = 5 and p = 0.8. Also i = 4
er
s
or 5.
Probability of at least 80% of the viewers in a random sample of 5 watches this
programme.
 5
 5
4
1
5
0
P ( X > 4) + P ( X =
4) + P ( X =
5) =
0.4096 + 0.3277
 4 (0.8) (0.2) +  5 (0.8) (0.2) =
−
ni
v
= 0.7373
We must remember that a cumulative binomial probability refers to the probability
that the binomial random variable falls within a specified range (e.g., is greater than or
equal to a stated lower limit and less than or equal to a stated upper limit).
U
2.2.6 Poisson Distribution-Introduction
ity
A random variable X, taking one of the values 0, 1, 2, is said to be a Poisson
random variable with parameter l, if for some l > 0,
P(X = i) = eλ/I For i = 0, 1, 2 …
P(X = i) is a probability mass function (p.m.f.) of the Poisson random variable. Its
expected value and variance are,
m
μ = E[X] = λ
Var(X) = λ
)A
Poisson random variable has wide range of applications. It can also be used as
an approximation for a binomial random variable with parameters if n is large and p is
small enough to make the product np of moderate size. In this case, we call np = l an
average rate. Some of the common examples where Poisson random variable can be
used to define the probability distribution are:
Number of accidents per day on expressway.
2.
Number of earthquakes occurring over fixed time span.
(c
1.
Amity Directorate of Distance & Online Education
42
Statistics Management
3.
Number of misprints on a page.
4.
Number of arrivals of calls on telephone exchange per minute.
5.
Number of interrupts per second on a server.
2.2.7 Poisson Distribution-Application
O
nl
in
e
Notes
Procedure for Using Cumulative Poisson Probabilities Table
Poisson p.m.f. for given l and i can be easily calculated using scientific calculators.
But while calculating cumulative probabilities i.e., ‘c.d.f.’, manual calculations become
too tedious. In such cases, we can use the Cumulative Poisson Probabilities.
Cumulative
ity
Poisson Probabilities is referred as follows:
To find cumulative binomial probability for given n, i and p
●●
Looking at the given value of l i.e., average rate in the first column of the
table.
●●
In first row look for the value of i, the number of successes.
●●
Locate the cell in the column of i value and row of l value. The contained in
this cell is the value of cumulative Poisson probability.
er
s
●●
Solution:
Method I
ni
v
Example: Average number of accidents on express way is five per week. Find the
probability of exactly two accidents that would take place in a given week. Also find the
probability of at the most two accidents that will take place in next week.
U
Using binomial distribution with parameters (n=10, p=0.1) we get,
P{X<1} = p(0) + p(1) = 10C0(0.1)0(0.1)10 + 10C1 (0.1)1(0.1)9 = 0.7361
Or, Using Cumulative Binomial Probabilities Table
ity
We can read for n=10, p=0.1 and i=1, the cumulative probability as 0.7361.
(c
)A
m
Method II
Using Poisson distribution (as approximation ot Binomial distribution) with
parameter /=10x0.1=1 we get,
P{X<1}=p(0) + p(1) = [e-1 (/) 0] / 0! + [e-1 (/) 1]/1! = e-1 + e-1 = 0.7358
Or, Using Cumulative Poisson Probabilities Table
We can read for /=1, and i=1 the cumulative probability as 0.7358.
Note: That Poisson distribution gives reasonable good approximation.
Example: Average time for updating a passbook by a bank clerk is 15 seconds.
Someone arrives just ahead of you. Find the probability that you will have to wait
for your turn,
1.
More than 1 minute.
Amity Directorate of Distance & Online Education
43
Statistics Management
Less than ½ minutes.
Notes
Solution:
Now, λ = 60/15 = 4 passbooks per minute
P {X > 1} = 1 – F (1) = e-4 = 0.0183
P {X < 0.5} = F (0.5) = 1 - e-2
= 1 - 0.1353
= 0.8647
2.2.8 Normal Distribution- Introduction including empirical rule
O
nl
in
e
2.
ity
Normal random variable and its distribution is commonly used in many business
and engineering problems. Many other distributions like binomial, Poisson, beta, chisquare, students, exponential, etc., could also be approximated to normal distribution
under specific conditions. (Usually when sample size is large.)
er
s
If random variable is affected by many independent causes, and the effect of each
cause is not significantly large as compared to other effects, then the random variable
will closely follow the normal distribution, e.g., weights of coffee filled in packs, lengths
of nails manufactured on a machine, hardness of ball bearing surface, diameters of
shafts produced on lathe, effectiveness of training programme on the employees’
productivity, etc., are examples of normally distributed random variables.
ni
v
Further, many sampling statistics, e.g., sample means X bar, are normally
distributed.
Empirical Rule
U
The empirical rule, also referred to as the three-sigma rule is a statistical rule
which states that for a normal distribution, almost all observed data will fall within three
standard deviations which is denoted by σ of the mean or average which is denoted
by µ.
ity
The empirical rule states that for a normal distribution, nearly all of the data will fall
within three standard deviations of the mean. The empirical rule can be broken down
into three parts:
68% of data falls within the first standard deviation from the mean.
●●
95% fall within two standard deviations.
●●
99.7% fall within three standard deviations.
m
●●
)A
The Empirical Rule is often used in statistics for forecasting, especially when
obtaining the right data is difficult or impossible to get. The rule can give you a rough
estimate of what your data collection might look like if you were able to survey the entire
population.
(c
A random variable X is a normal random variable with parameters μ and σ if the
probability density function (p.d.f.) of X is given by
1
=
f ( x)
e
σ 2π
( x − µ )2
2σ 2
Where, - ∞ < X < ∞
Amity Directorate of Distance & Online Education
44
Statistics Management
Properties of Normal Distribution
O
nl
in
e
This distribution is bell-shaped curve that is symmetric about μ. It gives a
theoretical base to the observation that, in practice, many random phenomena obey
approximately, a normal probability distribution. Mean of normal random variable is E(X)
= μ and variance of normal random variable is Var(X) σ2. If X is normally distributed
with parameters μ and σ, then another random variable is also normally distributed with
parameters (aμ + b) and (a σ).
Notes
It is perfectly symmetric about the mean μ.
2.
For a normal distribution mean = median = mode.
3.
It is uni-modal (one mode), with skewness = 0 and kurtosis = 0.
4.
Normal distribution is a limiting form of binomial distribution when number trials n is
large, and neither the probability p nor (1-p) is very small.
5.
Normal distribution is a limiting case of Poisson distribution when mean μ = λ is very
large.
6.
While working on probability of normal distribution we usually use normal distribution
(more often standard normal distribution) tables.
er
s
ity
1.
While reading these tables, properties are:
ni
v
(a) The probability that a normally distributed random variable with mean μ and
variance σ² lies between two specified values a and b is P (a < X < b) = area
under the curve P(x) between the specified values X = a and X = b.
(b) Total area under the curve P (x) is equal to 1 in which 0.5 lies on either side of
the mean.
U
2.2.9 Standard Normal Distribution
(c
)A
m
ity
Calculating cumulative density of normal distribution involves integration. Further,
tabulation also has a problem that we must have tables for every possible value of μ
and σ² (which is not feasible). Hence, we transform Normal Random Variable to another
random variable known as Standard Normal Random Variable. For this, we use a
transformation,
z is a normally distributed random variable with parameters, μ = 0 and σ = 1. Any
normal random variable can be transformed to standard normal random variable z. We
can get cumulative distribution function as,
2
z
1
F (a) =
∫ f ( x)dx =
∫
e 2 dz
−∞
2π
a
a
This has been calculated for various values of ‘a’ and tabulated. Also, we know that,
F (–a) = 1 – F(a) and also F(a < Z < b) = F(b) – F(a)
Example: Tea is filled in the packs of 200 gm by a machine with variability of 0.25
Amity Directorate of Distance & Online Education
45
Statistics Management
Notes
O
nl
in
e
gms. Packs weighing less than 200 gm would be rejected by customers and not legally
acceptable. Therefore, marketing and legal department requests production manager to
set the machine to fill slightly more quantity in each pack. However, finance department
objects to this since it would lead to financial loss due to overfilling the packs. The
general manager wants to know the 99% confidence interval, when the machine is set
at 200gms, so that he can take a decision. Find confidence interval. What is your advice
to the production manger?
Solution:
Let weight of the tea in a pack is a random variable X.
We know that the mean μ = 200 gm and variance σ² = 0.25 gms i.e. σ = 0.5 gm.
ity
First, we find the value of z for 99% confidence. Standard Normal Distribution curve
is symmetric about mean.
Hence, corresponding to 99% confidence, half area under the curve
= 0.99/2
er
s
= 0.495.
Value z corresponding to probability 0.495 is 2.575. Thus, the 99% confidence
interval in terms of variable z is ± 2.575 which in terms of variable x is, 200 ±1.2875 or
(198.71 to 201.29).
ni
v
Note: that x = σ z + μ = 0.5 × (±2.575) + 200 = 200 ± 1.2875
U
Hence, we can advise the production manager to set his machine to fill tea with
mean weight as 201.2875 or say 201.29. In that case we have 99% confidence of
meeting legal requirement and at the same time to keep the cost of excess filling of the
coffee to minimum.
Key Terms
Probability: Probability of a given event is an expression of likelihood or chance
of occurrence of an event. A probability is a number which rages from zero to one.
●●
Continuous Probability Distributions: Continuous random variables are those
that take on any value including fractions and decimals. Continuous random
variables give rise to continuous probability distributions. Continuous is the
opposite of discrete.
●●
Random Experiment: In theory of probability, a process or activity that results
in outcomes under study is called experiment, for example, sampling from a
production lot.
)A
m
ity
●●
Sample: A sample is that part of the universe which the select for the purpose
of investigation. A sample exhibits the characteristics of the universe. The word
sample literally means small universe.
●●
Sampling: Sampling is defined as the selection of some part of an aggregate
or totality on the basis of which a judgment or inference about the aggregate or
totality is made. Sampling is the process of learning about the population on the
basis of a sample drawn from it.
(c
●●
Amity Directorate of Distance & Online Education
46
Statistics Management
●●
Stratified random sampling: Stratified random sampling requires the separation
of defined target population into different groups called strata and the selection of
sample from each stratum.
●●
Cluster sampling: Cluster sampling is a probability sampling method in which
the sampling units are divided into mutually exclusive and collectively exhaustive
subpopulation called clusters.
●●
Hypothesis testing: Hypothesis testing refers to the formal procedures used by
statisticians to accept or reject statistical hypotheses. It is an assumption about a
population parameter. This assumption may or may not be true.
O
nl
in
e
Notes
Check your progress
3.
Collectively exclusive events
b.
Mutually exhaustive events
c.
Mutually exclusive events
d.
Collectively exhaustive events
ity
a.
er
s
2.
In probability theories, events which can never occur together are classified as
Value which is used to measure distance between mean and random variable x in
terms of standard deviation is called
a.
Z-value
b.
Variance
c.
Probability of x
d.
Density function of x
ni
v
1.
test is applied when samples are less than 30.
T
c.
U
a.
d.
None of these
Z
Rank
ity
b.
(c
)A
m
4.
5.
Under non-random sampling method, samples are selected on the basis of
a.
Stages
b.
Strategy
c.
Originality
d.
Convenience
Probability of second event in situation if first event has been occurred is classified
as
a.
Series probability
b.
Conditional probability
c.
Joint probability
d.
Dependent probability
Amity Directorate of Distance & Online Education
47
Statistics Management
Questions and Exercises
What is probability? What do you mean by probability distributions?
2.
What is normal distribution ? What are the merits of normal distribution
3.
What is Hypothesis Testing?
4.
What do you mean by t-test and z test ?
5.
Explain Poisson Distribution and its Application
Check your progress
c)
Mutually exclusive events
2.
a)
Z value
3.
a)
T test
4.
d)
Convenience
5.
b)
Conditional probability
ity
1.
O
nl
in
e
Notes
1.
er
s
Further Readings
Richard I. Levin, David S. Rubin, Sanjay Rastogi Masood Husain Siddiqui,
Statistics for Management, Pearson Education, 7th Edition, 2016.
2.
Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016.
3.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An
Introduction to Statistical Learning with Applications in R, Springer, 2016.
ni
v
1.
Bibliography
Srivastava V. K. etal – Quantitative Techniques for Managerial Decision Making,
Wiley Eastern Ltd
2.
Richard, I.Levin and Charles A.Kirkpatrick – Quantitative Approaches to Management,
McGraw Hill, Kogakusha Ltd.
3.
Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016.
4.
Budnik, Frank S Dennis Mcleaavey, Richard Mojena – Principles of Operation
Research - AIT BS New Delhi.
5.
Sharma J K – Operation Research- theory and applications-Mc Millan,New Delhi
6.
Kalavathy S. – Operation Research – Vikas Pub Co
7.
Gould F J – Introduction to Management Science – Englewood Cliffs N J Prentice
Hall.
8.
Naray J K, Operation Research, theory and applications – Mc Millan, New Dehi.
9.
Taha Hamdy, Operations Research, Prentice Hall of India
)A
m
ity
U
1.
10. Tulasian: Quantitative Techniques: Pearson Ed.
11. Vohr.N.D. Quantitative Techniques in Management, TMH.
(c
12. Stevenson W.D, Introduction to Management Science, TMH.
Amity Directorate of Distance & Online Education
48
Statistics Management
O
nl
in
e
Module-3: Sampling, Sampling Distribution
and Estimation
Notes
Learning Objective:
●●
To understand the basic concepts of sampling distribution and estimation
techniques
●●
To get familiarize with MS Excel for confidence interval construction
Learning Outcome:
At the end of the course, the learners will be able to –
Use sampling methods and estimations techniques in order to answer business
queries
●●
Understand the purpose and need of sampling.
er
s
3.1.1 Sampling - Introduction
ity
●●
ni
v
Sampling is an important concept which is practiced in every activity. Sampling
involves selecting a relatively small number of elements from a large defined group
of elements and expecting that the information gathered from the small group will
allow judgments to be made about the large group. The basic idea of sampling is that
by selecting some of the elements in a population, the conclusion about the entire
population is drawn. Sampling is used when conducting census is impossible or
unreasonable.
Meaning of Sampling
ity
U
Sampling is defined as the selection of some part of an aggregate or totality on
the basis of which a judgment or inference about the aggregate or totality is made.
Sampling is the process of learning about the population on the basis of a sample
drawn from it.
Purpose of Sampling
(c
)A
m
There are several reasons for sampling. They are explained below:
1.
Lower cost: The cost of conducting a study based on a sample is much lesser
than the cost of conducting the census study.
2.
Greater accuracy of results: It is generally argued that the quality of a study
is often better with sampling data than with a census. Research findings also
substantiate this opinion.
3.
Greater speed of data collection: Speed of execution of data collection is
higher with the sample. It also reduces the time between the recognition of a
need for information and the availability of that information.
4.
Availability of population element: Some situations require sampling. When
the breaking strength of materials is to be tested, it has to be destroyed. A
Amity Directorate of Distance & Online Education
49
Statistics Management
Notes
O
nl
in
e
census method cannot be resorted as it would mean complete destruction of all
materials. Sampling is the only process possible if the population is infinite.
Features of Sampling Method
The sampling technique has the following good features of value and significance:
Economy: Sampling technique brings about cost control of a research project as it
requires much less physical resources as well as time than the census technique.
2.
Reliability: In sampling technique, if due diligence is exercised in the choice of
sample unit and if the research topic is homogenous then the sample survey
can have almost the same reliability as that of census survey.
3.
Detailed Study: An intensive and detailed study of sample units can be done
since their number is fairly small. Also multiple approaches can be applied to a
sample for an intensive analysis.
4.
Scientific Base: As mentioned earlier this technique is of scientific nature as
the underlined theory is based on principle of statistics.
5.
Greater Suitability in most Situations: It has a wide applicability in most
situations as the examination of few sample units normally suffices.
6.
Accuracy: The accuracy is determined by the extent to which bias is eliminated
from the sampling. When the sample elements are drawn properly some
sample elements underestimates the population values being studied and
others overestimate them.
ni
v
er
s
ity
1.
Essentials of Sampling
In order to reach a clear conclusion, the sampling should possess the following
essentials:
It must be representative: The sample selected should possess the similar
characteristics of the original universe from which it has been drawn.
2.
Homogeneity: Selected samples from the universe should have similar nature
and should not have any difference when compared with the universe.
3.
Adequate Samples: In order to have a more reliable and representative result,
a good number of items are to be included in the sample.
4.
Optimization: All efforts should be made to get maximum results both in terms
of cost as well as efficiency. If the size of the sample is larger, there is better
efficiency and at the same time the cost is more. A proper size of sample is
maintained in order to have optimized results in terms of cost and efficiency.
m
ity
U
1.
)A
3.1.2 Types of Sampling
(c
The sampling design can be broadly grouped on two basis viz., representation
and element selection. Representation refers to the selection of members on a
probability or by other means. Element selection refers to the manner in which the
elements are selected individually and directly from the population. If each element is
drawn individually from the population at large, it is an unrestricted sample. Restricted
sampling is where additional controls are imposed, in other words it covers all other
forms of sampling.
Amity Directorate of Distance & Online Education
50
Statistics Management
O
nl
in
e
The classification of sampling design on the basis of representation and element
selection is -
Notes
Probability Sampling
Probability sampling is where each sampling unit in the defined target population
has a known non-zero probability of being selected in the sample. The actual probability
of selection for each sampling unit may or may not be equal depending on the type
of probability sampling design used. Specific rules for selecting members from the
operational population are made to ensure unbiased selection of the sampling units and
proper sample representation of the defined target population. The results obtained by
using probability sampling designs can be generalized to the target population within a
specified margin of error.
er
s
ity
Probability samples are characterised by the fact that, the sampling units are
selected by chance. In such a case, each member of the population has a known,
non- zero probability of being selected. However, it may not be true that all samples
would have the same probability of selection, but it is possible to say the probability
of selecting any particular sample of a given size. It is possible that one can calculate
the probability that any given population element would be included in the sample. This
requires a precise definition of the target population as well as the sampling frame.
ni
v
Probability sampling techniques differ in terms of sampling efficiency which is a
concept that refers to trade off between sampling cost and precision. Precision refers to
the level of uncertainty about the characteristics being measured. Precision is inversely
related to sampling errors but directly related to cost. The greater the precision, the
greater the cost and there should be a trade-off between sampling cost and precision.
The researcher is required to design the most efficient sampling design in order to
increase the efficiency of the sampling.
U
The different types of probability sampling designs are discussed below:
Simple Random Sampling
(c
)A
m
ity
The following are the implications of random sampling:
●●
It provides each element in the population an equal probability chance of
being chosen in the sample, with all choices being independent of one
another and
●●
It offers each possible sample combination an equal probability opportunity of
being selected.
In the unrestricted probability sampling design every element in the population
has a known, equal non-zero chance of being selected as a subject. For example, if
10 employees (n = 10) are to be selected from 30 employees (N = 30), the researcher
can write the name of each employee in a piece of paper and select them on a random
basis. Each employee will have an equal known probability of selection for a sample.
The same is expressed in terms of the following formula:
Probability of selection = Size of sample / Size of population
Each employee would have a 10/30 or .333 chance of being randomly selected
in a drawn sample. When the defined target population consists of a larger number
Amity Directorate of Distance & Online Education
51
Statistics Management
Notes
O
nl
in
e
of sampling units, a more sophisticated method can be used to randomly draw the
necessary sample. A table of random numbers can be used for this purpose. The table
of random numbers contains a list of randomly generated numbers. The numbers
can be randomly generated through the computer programs also. Using the random
numbers the sample can be selected.
Advantages and Disadvantages
ity
The simple random sampling technique can be easily understood and the survey
result can be generalized to the defined target population with a pre specified margin
of error. It also enables the researcher to gain unbiased estimates of the population’s
characteristics. The method guarantees that every sampling unit of the population
has a known and equal chance of being selected, irrespective of the actual size of the
sample resulting in a valid representation of the defined target population.
er
s
The major drawback of the simple random sampling is the difficulty of obtaining
complete, current and accurate listing of the target population elements. Simple
random sampling process requires all sampling units to be identified which would be
cumbersome and expensive in case of a large population. Hence, this method is most
suitable for a small population.
Systematic Random Sampling
U
ni
v
The systematic random sampling design is similar to simple random sampling but
requires that the defined target population should be selected in some way. It involves
drawing every nth element in the population starting with a randomly chosen element
between 1 and n. In other words individual sampling units are selected according their
position using a skip interval. The skip interval is determined by dividing the sample size
into population size. For example, if the researcher wants a sample of 100 to be drawn
from a defined target population of 1000, the skip interval would be 10(1000/100). Once
the skip interval is calculated, the researcher would randomly select a starting point and
take every 10th until the entire target population is proceeded through. The steps to be
followed in a systematic sampling method are enumerated below:
Total number of elements in the population should be identified
●●
The sampling ratio is to be calculated ( n = total population size divided by
size of the desired sample)
●●
A sample can be drawn by choosing every nth entry
ity
●●
It is important that the natural order of the defined target population list be
unrelated to the characteristic being studied.
)A
1.
m
Two important considerations in using the systematic random sampling are:
2.
Skip interval should not correspond to the systematic change in the target
population.
Advantages and Disadvantages
(c
The major advantage is its simplicity and flexibility. In case of systematic sampling
there is no need to number the entries in a large personnel file before drawing a
Amity Directorate of Distance & Online Education
52
Statistics Management
O
nl
in
e
sample. The availability of lists and shorter time required to draw a sample compared
to random sampling makes systematic sampling an attractive, economical method for
researchers.
Notes
The greatest weakness of systematic random sampling is the potential for the
hidden patterns in the data that are not found by the researcher. This could result in
a sample not truly representative of the target population. Another difficulty is that the
researcher must know exactly how many sampling units make up the defined target
population. In situations where the target population is extremely large or unknown,
identifying the true number of units is difficult and the estimates may not be accurate.
Stratified Random Sampling
er
s
ity
Stratified random sampling requires the separation of defined target population into
different groups called strata and the selection of sample from each stratum. Stratified
random sampling is very useful when the divisions of target population are skewed
or when extremes are present in the probability distribution of the target population
elements of interest. The goal in stratification is to minimize the variability within each
stratum and maximize the difference between strata. The ideal stratification would be
based on the primary variable under study. Researchers often have several important
variables about which they want to draw conclusions.
ni
v
A reasonable approach is to identify some basis for stratification that correlates
well with other major variables. It might be a single variable like age, income etc. or
a compound variable like on the basis of income and gender. Stratification leads to
segmenting the population into smaller, more homogeneous sets of elements. In order
to ensure that the sample maintains the required precision in terms of representing
the total population, representative samples must be drawn from each of the smaller
population groups.
●●
U
There are three reasons as to why a researcher chooses a stratified random
sample:
●●
To provide adequate data for analyzing various sub populations
●●
To enable different research methods and procedures to be used in different
strata.
ity
To increase the sample’s statistical efficiency
(c
)A
m
Cluster Sampling
Cluster sampling is a probability sampling method in which the sampling units
are divided into mutually exclusive and collectively exhaustive subpopulation called
clusters. Each cluster is assumed to be the representative of the heterogeneity of
the target population. Groups of elements that would have heterogeneity among the
members within each group are chosen for study in cluster sampling. Several groups
with intragroup heterogeneity and intergroup homogeneity are found. A random
sampling of the clusters or groups is done and information is gathered from each of
the members in the randomly chosen clusters. Cluster sampling offers more of
heterogeneity within groups and more homogeneity among the groups.
Amity Directorate of Distance & Online Education
53
Statistics Management
Notes
O
nl
in
e
Single Stage and Multistage Cluster Sampling
In single stage cluster sampling, the population is divided into convenient clusters
and required number of clusters are randomly chosen as sample subjects. Each
element in each of the randomly chosen cluster is investigated in the study. Cluster
sampling can also be done in several stages which is known as multistage cluster
sampling. For example: To study the banking behaviour of customers in a national
survey, cluster sampling can be used to select the urban, semi-urban and rural
geographical locations of the study. At the next stage, particular areas in each of the
location would be chosen. At the third stage, the banks within each area would be
chosen.
er
s
Advantages and Disadvantages of Cluster Sampling
ity
Thus multi-stage sampling involves a probability sampling of the primary sampling
units; from each of the primary units, a probability sampling of the secondary sampling
units is drawn; a third level of probability sampling is done from each of these
secondary units, and so on until the final stage of breakdown for the sample units are
arrived at, where every member of the unit will be a sample.
ni
v
The cluster sampling method is widely used due to its overall cost-effectiveness
and feasibility of implementation. In many situations the only reliable sampling unit
frame available to researchers and representative of the defined target population,
is one that describes and lists clusters. The list of geographical regions, telephone
exchanges, or blocks of residential dwelling can normally be easily compiled than
the list of all the individual sampling units making up the target population. Clustering
method is a cost efficient way of sampling and collecting raw data from a defined target
population.
ity
U
One major drawback of clustering method is the tendency of the cluster to be
homogeneous. The greater the homogeneity of the cluster, the less precise will be the
sample estimate in representing the target population parameters. The conditions of
intra- cluster heterogeneity and inter-cluster homogeneity are often not met. For these
reasons this method is not practiced often.
Area Sampling
)A
m
Area sampling is a form of cluster sampling in which the clusters are formed by
geographic designations. For example, state, district, city, town etc., Area sampling is
a form of cluster sampling in which any geographic unit with identifiable boundaries
can be used. Area sampling is less expensive than most other probability designs and
is not dependent on population frame. A city map showing blocks of the city would be
adequate information to allow a researcher to take a sample of the blocks and obtain
data from the residents therein.
Sequential/Multiphase Sampling
(c
This is also called Double Sampling. Double sampling is opted when further
information is needed from a subset of groups from which some information has already
been collected for the same study. It is called as double sampling because initially a
sample is used in the study to collect some preliminary information of interest and later
a sub-sample of this primary sample is used to examine the matter in more detail The
Amity Directorate of Distance & Online Education
54
Statistics Management
Sampling with Probability Proportional to Size
O
nl
in
e
process includes collecting data from a sample using a previously defined technique.
Based on this information, a sub sample is selected for further study. It is more
convenient and economical to collect some information by sampling and then use this
information as the basis for selecting a sub sample for further study.
Notes
Non-probability Sampling
er
s
ity
When the case of cluster sampling units does not have exactly or approximately
the same number of elements, it is better for the researcher to adopt a random
selection process, where the probability of inclusion of each cluster in the sample
tends to be proportional to the size of the cluster. For this, the number of elements
in each cluster has to be listed, irrespective of the method used for ordering it. Then
the researcher should systematically pick the required number of elements from the
cumulative totals. The actual numbers thus chosen would not however reflect the
individual elements, but would indicate as to which cluster and how many from them are
to be chosen by using simple random sampling or systematic sampling. The outcome
of such sampling is equivalent to that of simple random sample. This method is also
less cumbersome and is also relatively less expensive.
ni
v
In non probability sampling method, the elements in the population do not have any
probabilities attached to being chosen as sample subjects. This means that the findings
of the study cannot be generalized to the population. However, at times the researcher
may be less concerned about generalizability and the purpose may be just to obtain
some preliminary information in a quick and inexpensive way. Sometimes when the
population size is unknown, then non probability sampling would be the only way to
obtain data. Some non-probability sampling techniques may be more dependable than
others and could often lead to important information with regard to the population.
U
Convenience Sampling
(c
)A
m
ity
Non-probability samples that are unrestricted are called convenient sampling.
Convenience sampling refers to the collection of information from members of
population who are conveniently available to provide it. Researchers or field workers
have the freedom to choose as samples whomever they find, thus it is named as
convenience. It is mostly used during the exploratory phase of a research project
and it is the best way of getting some basic information quickly and efficiently. The
assumption is that the target population is homogeneous and the individuals selected
as samples are similar to the overall defined target population with regard to the
characteristics being studied. However, in reality there is no way to accurately assess
the representativeness of the sample. Due to the self selection and voluntary nature of
participation in data collection process the researcher should give due consideration to
the non-response error.
Advantages and Disadvantages
Convenient sampling allows a large number of respondents to be interviewed
in a relatively short time. This is one of the main reasons for using convenient
sampling in the early stages of research. However the major drawback is that the
Amity Directorate of Distance & Online Education
55
Statistics Management
Notes
O
nl
in
e
use of convenience samples in the development phases of constructs and scale
measurements can have a serious negative impact on the overall reliability and validity
of those measures and instruments used to collect raw data. Another major drawback is
that the raw data and results are not generalizable to the defined target population with
any measure of precision. It is not possible to measure the representativeness of the
sample, because sampling error estimates cannot be accurately determined.
Judgment Sampling
ity
Judgment sampling is a non-probability sampling method in which participants
are selected according to an experienced individual’s belief that they will meet the
requirements of the study. The researcher selects sample members who conform to
some criterion. It is appropriate in the early stages of an exploratory study and involves
the choice of subjects who are most advantageously placed or in the best position to
provide the information required. This is used when a limited number or category of
people have the information that are being sought. The underlying assumption is that
the researcher’s belief that the opinions of a group of perceived experts on the topic of
interest are representative of the entire target population.
er
s
Advantages and Disadvantages
ni
v
If the judgment of the researcher or expert is correct then the sample generated
from the judgment sampling will be much better than one generated by convenience
sampling. However, as in the case of all non-probability sampling methods, the
representativeness of the sample cannot be measured. The raw data and information
collected through judgment sampling provides only a preliminary insight
Quota Sampling
ity
U
The quota sampling method involves the selection of prospective participants
according to pre specified quotas regarding either the demographic characteristics
(gender, age, education, income, occupation etc.,) specific attitudes (satisfied, neutral,
dissatisfied) or specific behaviours (regular, occasional, rare user of product). The
purpose of quota sampling is to provide an assurance that pre specified subgroups
of the defined target population are represented on pertinent sampling factors that
are determined by the researcher. It ensures that certain groups are adequately
represented in the study through the assignment of the quota.
Advantages and Disadvantages
)A
m
The greatest advantage of quota sampling is that the sample generated contains
specific subgroups in the proportion desired by researchers. In those research projects
that require interviews the use of quotas ensures that the appropriate subgroups are
identified and included in the survey. The quota sampling method may eliminate or
reduce selection bias.
(c
An inherent limitation of quota sampling is that the success of the study will be
dependent on subjective decisions made by the researchers. As a non-probability
method, it is incapable of measuring true representativeness of the sample or accuracy
of the estimate obtained. Therefore, attempts to generalize the data results beyond
those respondents who were sampled and interviewed become very questionable and
may misrepresent the given target population.
Amity Directorate of Distance & Online Education
56
Statistics Management
O
nl
in
e
Snowball Sampling
Notes
Advantages and Disadvantages
ity
Snowball sampling is a non-probability sampling method in which a set of
respondents are chosen who help the researcher to identify additional respondents to
be included in the study. This method of sampling is also called as referral sampling
because one respondent refers other potential respondents. This method involves
probability and non-probability methods. The initial respondents are chosen by a
random method and the subsequent respondents are chosen by non-probability
methods. Snowball sampling is typically used in research situations where the defined
target population is very small and unique and compiling a complete list of sampling
units is a nearly impossible task. This technique is widely used in academic research.
While the traditional probability and other non-probability sampling methods would
normally require an extreme search effort to qualify a sufficient number of prospective
respondents, the snowball method would yield better result at a much lower cost. The
researcher has to identify and interview one qualified respondent and then solicit his
help to identify other respondents with similar characteristics.
ni
v
er
s
Snowball sampling enables to identify and select prospective respondents who
are small in number, hard to reach and uniquely defined target population. It is most
useful in qualitative research practices. Reduced sample size and costs are the primary
advantage of this sampling method. The major drawback is that the chance of bias is
higher. If there is a significant difference between people who are identified through
snowball sampling and others who are not then, it may give rise to problems. The
results cannot be generalized to members of larger defined target population.
3.1.3 Types of Sampling & Non Sampling Errors and Precautions
U
A sampling error represents a statistical error occuring when an analyst does not
select a sample that represents the entire population of data and the results found
in the sample do not represent the results that would be obtained from the entire
population.
Regardless of the fact that the sample is not representative of the population or
skewed in any way, a sampling error is a difference in sampled value versus true
population value.
●●
Also randomized samples may have some sampling error, since it is just a
population estimate from which it is derived.
●●
Sampling errors can be eliminated when the sample size is increased and also
by ensuring that the sample adequately represents the entire population. For
example, ABC Company provides a subscription-based service that allows
consumers to pay a monthly fee to stream videos and other programming over the
web.
(c
)A
m
ity
●●
A non-sampling error is a statistical term referring to an error resulting from data
collection, which causes the data to differ from the true values. A non-sampling error is
different from that of a sampling error.
●●
A non-sampling error refers to either random or systematic errors, and these errors
can be challenging to spot in a survey, sample, or census.
Amity Directorate of Distance & Online Education
57
Statistics Management
Systematic non-sampling errors are worse than random non-sampling errors
because systematic errors may result in the study, survey or census having to be
scrapped.
●●
The higher the number of errors, the less reliable the information.
●●
When non-sampling errors occur, the rate of bias in a study or survey goes up.
Notes
O
nl
in
e
●●
3.1.4 Central Limit Theorem
In the study of probability theory, the central limit theorem (CLT) states that the
distribution of sample approximates a normal distribution also known as a “bell curve.
As the sample size becomes larger, it assumed that all samples are identical in size,
and regardless of the shape of the population distribution.
er
s
ity
It is a statistical theory stating that, given a sufficiently large sample size from a
population with a finite degree of variance, the mean of all samples from the same
population would be approximately equal to the average. Furthermore, all the samples
will follow an approximate normal distribution pattern, with all variances being
approximately equal to the variance of the population, divided by each sample’s size
The central limit theorem (CLT) states that the distribution of sample means
approximates a normal distribution as the sample size gets larger.
●●
Sample sizes equal to or greater than 30 are considered sufficient for the theorem
to hold.
●●
A key aspect of the theorem is that the average of the sample means and standard
deviations will equal the population mean and standard deviation.
●●
A sufficiently large sample size can always predict the characteristics of a
population accurately.
ni
v
●●
U
3.1.5 Sampling Distribution of the Mean
ity
A sample is that part of the universe which the select for the purpose of
investigation. A sample exhibits the characteristics of the universe. The word sample
literally means small universe. For example, suppose the microchips produced in
a factory are to be tested. The aggregate of all such items is universe, but it is not
possible to test every item. So in such a case, a part of the universe is taken and then
tested. Now this quantity extracted for testing is known as sample.
)A
m
If we take certain number of samples and for each sample compute various
statistical measures such as mean, standard deviation etc. then we can find out that
each sample may give its own value for statistics under consideration. All such values
of a particular statics, say, mean together with their relative frequencies will constitute
the sampling distribution of mean standard deviation.
3.1.6 Sampling Distribution of Proportion
(c
Sampling distribution of sample proportion refers to the concept that If repeated
random samples of a given size n are taken from a population of values for a
categorical variable, where the proportion in the category of interest is p, then the mean
of all sample proportions (p-hat) is the population proportion (p).
Amity Directorate of Distance & Online Education
58
Statistics Management
O
nl
in
e
The theory dictates the behavior much more precisely than saying that there is
less spread for larger samples as regards the spread of all sample proportions. The
standard deviation of all sample proportions is generally directly related to the sample
size, n as shown below
Notes
The standard deviation of all sample proportion (p ) is exactly
p (1 − p )
n
Given that the sample size n appears in the square root denominator, the standard
deviation decreases as the sample size increases. Eventually, the p-hat distribution
form should be reasonably normal as long as the sample size n is sufficiently high. The
convention specifies that np and n(1 – p) should be at least 10
p is normally distributed with a mean of μp = p
p (1 − p )
n
as long as np > 10 and n(1-p) > 10
er
s
3.1.7 Estimation – Introduction
ity
and a standard deviation σp =
Let x be a random variable with probability density function (or probability mass
function)
f(X ; θ1 , θ2 , .... θk), where θ1 , θ2 , .... θk are the k parameters of the population.
ni
v
Given a random sample x1 , x2 , ...... xn from this population, we may be interested
in estimating one or more of the k parameters θ1 , θ2 , ...... θk. In order to be specific,
let x be any normal variate so that its probability density function can be written as N(x :
μ, σ). We may be interested in estimating m or s or both on the basis of random sample
obtained from this population.
U
It should be noted here that there can be several estimators of a parameter, e.g.,
we can have any of the sample mean, median, mode, geometric mean, harmonic
mean, etc., as an estimator of population mean μ. Similarly, S will be –
ity
1
1
s=
∑ (x i -x) 2 or s =∑ (x i -x) 2
n
n −1
(c
)A
m
as an estimator of population standard deviation s. This method of estimation,
where a single statistic such as Mean, Median, Standard deviation, etc. is used as an
estimator of population parameter, is known as the Point Estimation.
3.1.8 Types of Estimation
Statisticians use sample statistics to estimate population parameters. For example,
sample means are used to estimate population means; sample proportions, to estimate
population proportions.
An estimate of a population parameter may be expressed in two ways:
●●
Point estimate. A point estimate of a population parameter is a single value
of a statistic. For example, the sample mean x is a point estimate of the
population mean μ. Similarly, the sample proportion p is a point estimate of
the population proportion P.
Amity Directorate of Distance & Online Education
59
Statistics Management
●●
Notes
O
nl
in
e
A population parameter is denoted by θθ which is unknown constant. The
available information is in the form of a random sample x1,x2,...,xnx1,x2,...,xn of size
nn drawn from the population. We formulate a function of the sample observation
x1,x2,...,xnx1,x2,...,xn. The estimator of θθ is denoted by θ^θ^. The different random
sample provides different values of the statistics θ^θ^. Thus θ^θ^ is a random variable
with its own sampling probability distribution.
Interval estimate. An interval estimate is defined by two numbers, between
which a population parameter is said to lie. For example, a < x < b is an
interval estimate of the population mean μ. It indicates that the population
mean is greater than a but less than b.
er
s
ity
This range of values used to estimate a population parameter is known as
interval estimate or estimate by a confidence interval, and is defined by two numbers,
between which a population parameter is expected to lie. For example, a<x¯<ba<x¯<b
is an interval estimate of the population mean μ, indicating that the population mean
is greater than aa but less than bb. The purpose of an interval estimate is to provide
information about how close the point estimate is to the true parameter.
3.1.9 Using z Statistic for Estimating Population Mean
ni
v
The estimation of a population mean given a random sample is a very common
task. If the population standard deviation (σσ) is known, the construction of a
confidence interval for the population mean (μ) is based on the normally distributed
sampling distribution of the sample means
The 100(1−α)%100 the confidence interval for μ is given by
CI : x ± z *α /2 × σ x
σ
n
U
Where σ x =
ity
The value of z*α/2 corresponds to the critical value and is obtained from the
standard normal table or computed with the qnorm() function in R. The critical value is
a quantity that is related to the desired level of confidence. Typical values for z*α/2zα/2*
are 1.64, 1.96, and 2.58, corresponding to a confidence level of 90%, 95% and 99%.
This critical value is multiplied with the standard error, given by σx¯σx¯, in order to
widen or narrowing the margin of error.
)A
m
The standard error (σx¯) is given by the ratio of the standard deviation of the
population (σ) and the square root of the sample size nn. It describes the degree to
which the computed sample statistic may be expected to differ from one sample to
another. The product of the critical value and the standard error is called the margin of
error. It is the quantity that is subtracted from and added to the value of x¯ to obtain the
confidence interval for μ.
(c
3.1.10 Confidence Interval for Estimating Population Mean When
Population SD is Unknown
A confidence interval gives an estimated range of values which is likely to include
an unknown population parameter, the estimated range being calculated from a given
Amity Directorate of Distance & Online Education
60
Statistics Management
set of sample data. The common notation for the parameter in question is θ. Often, this
parameter is the population mean μ, which is estimated through the sample mean X .
O
nl
in
e
Notes
The level C of a confidence interval gives the probability that the interval produced
by the method employed includes the true value of the parameter θ.
In many situations, the value of σ is unknown, thus it is estimated with the sample
standard deviation, s; and/or the sample size is small (less than 30), and it is unsure
as to where data came from a normal distribution. (In the latter case, the Central
Limit Theorem can’t be used.) In either situation, the z*-value can not be used from
the standard normal (Z-) distribution as a critical value anymore. It is essential to use a
larger critical value than that, because of not knowing the data quantity.
−
X ± t *n −1
s
, where t*n-1
n
ity
The formula for a confidence interval for one population mean in this case is
er
s
is the critical t*-value from the t-distribution with n-1 degrees of freedom (where n is
the sample size).
Estimating population mean using t Statistic
ni
v
A statistical examination of two population means. A two-sample t-test examines
whether two samples are different and is commonly used when the variances of two
normal distributions are unknown and when an experiment uses a small sample size.
Formula: t =
x–m
s
n
U
Where, is the sample mean, Ä is a specified value to be tested, s is the sample
standard deviation and n is the size of the sample. Look up the significance level of the
z-value in the standard normal table.
(c
)A
m
ity
When the standard deviation of the sample is substituted for the standard deviation
of the population, the statistic does not have a normal distribution; it has what is called
the t-distribution. Because there is a different t-distribution for each sample size, it is
not practical to list a separate area of the curve table for each one. Instead, critical
t-values for common alpha levels (0.10, 0.05, 0.01, and so forth) are usually given in
a single table for a range of sample sizes. For very large samples, the t-distribution
approximates the standard normal (z) distribution. In practice, it is best to use
t-distributions any time the population standard deviation is not known.
Values in the t-table are not actually listed by sample size but by degrees
of freedom (df). The number of degrees of freedom for a problem involving the
t-distribution for sample size n is simply n – 1 for a one-sample mean problem.
Uses of T Test
Among the most frequently used t-tests are:
●●
A one-sample location test of whether the mean of a normally distributed
population has a value specified in a null hypothesis.
Amity Directorate of Distance & Online Education
61
Statistics Management
A two sample location test of the null hypothesis that the means of two
normally distributed populations are equal.
Notes
O
nl
in
e
●●
All such tests are usually called Student’s t-tests, though strictly speaking that
name should only be used if the variances of the two populations are also assumed
to be equal; the form of the test used when this assumption is dropped is sometimes
called Welch’s t-test. These tests are often referred to as “unpaired” or “independent
samples” t-tests, as they are typically applied when the statistical units underlying the
two samples being compared are non-overlapping.
ity
A test of the null hypothesis that the difference between two responses measured
on the same statistical unit has a mean value of zero. For example, suppose we
measure the size of a cancer patient’s tumor before and after a treatment. If the
treatment is effective, we expect the tumor size for many of the patients to be smaller
following the treatment. This is often referred to as the “paired” or “repeated measures”
t-test: A test of whether the slope of a regression line differs significantly from 0.
3.1.12 Confidence Interval Estimation for Population Proportion
er
s
The confidence interval (CI) for a population proportion can be used to show the
statistical probability that a characteristic is likely to occur within the population.
ni
v
For example, if we wish to estimate the proportion of people with diabetes in
a population, we consider a diagnosis of diabetes as a “success” (i.e., and individual
who has the outcome of interest), and we consider lack of diagnosis of diabetes as a
“failure.” In this example, X represents the number of people with a diagnosis of diabetes
in the sample. The sample proportion is p̂ (called “p-hat”), and it is computed by taking the
ratio of the number of successes in the sample to the sample size, that is =
P = x/n
U
Where x is the number of successes in the sample and n is the size of the sample
σp’=p(1−p)n
ity
The formula for the confidence interval for a population proportion follows the same
format as that for an estimate of a population mean. The sampling distribution for the
proportion from , the standard deviation was found to be:
The confidence interval for a population proportion, therefore, becomes:
m
p=p′±[Z(a2)p′(1−p′)n]
Z(a2) is set according to our desired degree of confidence and p′(1−p′)n is the
standard deviation of the sampling distribution.
)A
The sample proportions p′ and q′ are estimates of the unknown population
proportions p and q. The estimated proportions p′ and q′ are used because p and q are
not known.
(c
Key Terms
●●
Sample: A sample is that part of the universe which the select for the purpose
of investigation. A sample exhibits the characteristics of the universe. The word
sample literally means small universe.
Amity Directorate of Distance & Online Education
62
Statistics Management
Sampling: Sampling is defined as the selection of some part of an aggregate
or totality on the basis of which a judgment or inference about the aggregate or
totality is made. Sampling is the process of learning about the population on the
basis of a sample drawn from it.
●●
Stratified random sampling: Stratified random sampling requires the separation
of defined target population into different groups called strata and the selection of
sample from each stratum.
●●
Cluster sampling: Cluster sampling is a probability sampling method in which
the sampling units are divided into mutually exclusive and collectively exhaustive
subpopulation called clusters.
●●
Confidence interval: (CI) for a population proportion can be used to show the
statistical probability that a characteristic is likely to occur within the population.
●●
Point estimate. A point estimate of a population parameter is a single value of a
statistic
●●
Interval estimate. An interval estimate is defined by two numbers, between which
a population parameter is said to lie.
ity
O
nl
in
e
●●
Check your progress
Probability
b)
Central Limit Theorem
c)
Z test
d)
Sampling Theorem
ni
v
a)
____ error is a statistical term referring to an error resulting from data collection,
which causes the data to differ from the true values
a)
b)
Non - sampling
c)
Probability
d)
Central
Sampling
(c
)A
m
ity
2.
_____ states that the distribution of sample means approximates a normal
distribution as the sample size gets larger.
U
1.
er
s
Notes
3.
4.
Sampling method in which a set of respondents are chosen who help the
researcher to identify additional respondents to be included in the study is ?
a)
Quota Sampling
b)
Judgment Sampling
c)
Snowball Sampling
d)
Convenience Sampling
Value used to measure distance between mean and random variable x in terms of
standard deviation is a)
Z-value
Amity Directorate of Distance & Online Education
63
Statistics Management
Variance
c)
Probability of x
d)
Density function of x
Notes
O
nl
in
e
5.
b)
Test is applied when samples are less than 30.
a)
T
b)
Z
c)
Rank
d)
None of these
Questions and Exercises
What is sampling? Explain the features of sampling
2.
Differentiate between sampling and non-sampling.
3.
Explain any five types of sampling techniques
4.
What do you mean by t-test and z test?
5.
Explain Confidence interval estimation for population proportion
Central Limit Theorem
2.
b)
Non - sampling
3.
c)
Snowball Sampling
4.
a)
Z-value
5.
a)
T
er
s
b)
U
1.
ni
v
Check your progress:
ity
1.
Further Readings
Richard I. Levin, David S. Rubin, Sanjay Rastogi Masood Husain Siddiqui,
Statistics for Management, Pearson Education, 7th Edition, 2016.
5.
Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016.
6.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An
Introduction to Statistical Learning with Applications in R, Springer, 2016.
m
ity
4.
Bibliography
)A
13. Srivastava V. K. etal – Quantitative Techniques for Managerial Decision Making,
Wiley Eastern Ltd
14. Richard, I.Levin and Charles A.Kirkpatrick – Quantitative Approaches to Management,
McGraw Hill, Kogakusha Ltd.
15. Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016.
(c
16. Budnik, Frank S Dennis Mcleaavey, Richard Mojena – Principles of Operation
Research - AIT BS New Delhi.
Amity Directorate of Distance & Online Education
64
Statistics Management
18. Kalavathy S. – Operation Research – Vikas Pub Co
O
nl
in
e
17. Sharma J K – Operation Research- theory and applications-Mc Millan,New Delhi
Notes
19. Gould F J – Introduction to Management Science – Englewood Cliffs N J Prentice
Hall.
20. Naray J K, Operation Research, theory and applications – Mc Millan, New Dehi.
21. Taha Hamdy, Operations Research, Prentice Hall of India
22. Tulasian: Quantitative Techniques: Pearson Ed.
23. Vohr.N.D. Quantitative Techniques in Management, TMH
(c
)A
m
ity
U
ni
v
er
s
ity
24. Stevenson W.D, Introduction to Management Science, TMH
Amity Directorate of Distance & Online Education
65
Module-4: Concepts of Hypothesis Testing
Learning Objective:
●●
To get introduced with the concept of hypothesis testing and learn parametric and
non-parametric
Learning Outcome:
At the end of the course, the learners will be able to –
●●
Notes
O
nl
in
e
Statistics Management
Perform Test of Hypothesis as well as calculate confidence interval for a
population parameter for single sample and two sample cases.
ity
4.1.1 Hypothesis Testing - Introduction
U
Characteristics of Hypothesis
ni
v
er
s
Hypothesis test is a method of making decisions using data from a scientific study.
In statistics, a result is called statistically significant if it has been predicted as unlikely
to have occurred by chance alone, according to a pre-determined threshold probability,
the significance level. The phrase “test of significance” was coined by statistician
Ronald Fisher. These tests are used in determining what outcomes of a study would
lead to a rejection of the null hypothesis for a pre-specified level of significance;
this can help to decide whether results contain enough information to cast doubt on
conventional wisdom, given that conventional wisdom has been used to establish
the null hypothesis. The critical region of a hypothesis test is the set of all outcomes
which cause the null hypothesis to be rejected in favor of the alternative hypothesis.
Statistical hypothesis testing is sometimes called confirmatory data analysis, in contrast
to exploratory data analysis, which may not have pre-specified hypotheses. Statistical
hypothesis testing is a key technique of frequents inference.
The important characteristics of Hypothesis are as follows:
ity
Hypothesis must be conceptually clear
m
The concepts used in the hypothesis should be clearly defined, operationally
if possible. Such definitions should be commonly accepted and easily communicable
among the research scholars.
Hypothesis should have empirical referents
)A
The variables contained in the hypothesis should be empirical realities. In case
these are not empirical realities then it will not be possible to make the observations.
Being handicapped by the data collection, it may not be possible to test the hypothesis.
Watch for words like ought, should, bad.
(c
Hypothesis must be specific
The hypothesis should not only be specific to a place and situation but also
these should be narrowed down with respect to its operation. Let there be no global
Amity Directorate of Distance & Online Education
66
Statistics Management
use of concepts whereby the researcher is using such a broad concept which may
all inclusive and may not be able to tell anything. For example somebody may try to
propose the relationship between urbanization and family size. Yes urbanization
influences in declining the size of families. But urbanization is such comprehensive
variable which hide the operation of so many other factor which emerge as part of the
urbanization process. These factors could be the rise in education levels, women’s
levels of education, women empowerment, emergence of dual earner families, decline
in patriarchy, accessibility to health services, role of mass media, and could be more.
Therefore the global use of the word `urbanization’ may not tell much. Hence it is
suggested to that the hypothesis should be specific.
O
nl
in
e
Notes
Hypothesis should be related to available techniques of research
ity
Hypothesis may have empirical reality; still we are looking for tools and techniques
that could be used for the collection of data. If the techniques are not there then the
researcher is handicapped. Therefore, either the techniques are already available or the
researcher is in a position to develop suitable techniques for the study.
er
s
Hypothesis should be related to a body of theory
ni
v
Hypothesis has to be supported by theoretical argumentation. For this purpose the
research may develop his/her theoretical framework which could help in the generation
of relevant hypothesis. For the development of a framework the researcher shall
depend on the existing body of knowledge. In such an effort a connection between
the study in hand and the existing body of knowledge can be established. That is how
the study could benefit from the existing knowledge and later on through testing the
hypothesis could contribute to the reservoir of knowledge.
Hypothesis testing procedure
U
Hypothesis testing refers to the formal procedures used by statisticians to accept
or reject statistical hypotheses. It is an assumption about a population parameter. This
assumption may or may not be true.
(c
)A
m
ity
The best way to determine whether a statistical hypothesis is true would be to
examine the entire population. Since that is often impractical, researchers typically
examine a random sample from the population. If sample data are not consistent with
the statistical hypothesis, the hypothesis is rejected.
In doing so, one has to take the help of certain assumptions or hypothetical values
about the characteristics of the population if some such information is available. Such
hypothesis about the population is termed as statistical hypothesis and the hypothesis
is tested on the basis of sample values. The procedure enables one to decide on a
certain hypothesis and test its significance. “A claim or hypothesis about the population
parameters is known as Null Hypothesis and is written as, H 0 .”
This hypothesis is then tested with available evidence and a decision is made
whether to accept this hypothesis or reject it. If this hypothesis is rejected, then we
accept the alternate hypothesis. This hypothesis is written as H1. For testing hypothesis
or test of significance we use both parametric tests and nonparametric or distribution
free tests. Parametric tests assume within properties of the population, from which we
draw samples. Such assumptions may be about population parameters, sample size
Amity Directorate of Distance & Online Education
67
Statistics Management
Notes
O
nl
in
e
etc. In case of non-parametric tests, we do not make such assumptions. Here we
assume only nominal or ordinal data.
4.1.2 Developing Null and Alternate Hypothesis
Null Hypothesis
ity
It is used for testing the hypothesis formulated by the researcher. Researchers
treat evidence that supports a hypothesis differently from the evidence that opposes
it. They give negative evidence more importance than to the positive one. It is because
the negative evidence tarnishes the hypothesis. It shows that the predictions made by
the hypothesis are wrong. The null hypothesis simply states that there is no relationship
between the variables or the relationship between the variables is “zero.” That is
how symbolically null hypothesis is denoted as “H0”. For example: H0 = There is no
relationship between the level of job commitment and the level of efficiency.
er
s
Or H0 = The relationship between level of job commitment and the level of
efficiency is zero. Or The two variables are independent of each other. It does not take
into consideration the direction of association
ni
v
(i.e. H0 is non directional), which may be a second step in testing the hypothesis.
First we look whether or not there is an association then we go for the direction of
association and the strength of association. Experts recommend that we test our
hypothesis indirectly by testing the null hypothesis. In case we have any credibility in
our hypothesis then the research data should reject the null hypothesis. Rejection of the
null hypothesis leads to the acceptance of the alternative hypothesis.
Alternative Hypothesis
ity
U
The alternative (to the null) hypothesis simply states that there is a relationship
between the variables under study. In our example it could be: there is a relationship
between the level of job commitment and the level of efficiency. Not only there is an
association between the two variables under study but also the relationship is perfect
which is indicated by the number “1”. Thereby the alternative hypothesis is symbolically
denoted as “ H1”. It can be written like this: H1: There is a relationship between the level
of job commitment of the officers and their level of efficiency.
4.1.3 Type I Error and Type II Error
)A
m
A statistically significant result cannot prove that a research hypothesis is correct
(as this implies 100% certainty). Because a p-value is based on probabilities, there is
always a chance of making an incorrect conclusion regarding accepting or rejecting the
null hypothesis (H0).
Anytime we make a decision using statistics there are four possible outcomes, with
two representing correct decisions and two representing errors.
(c
Type 1 error
A type 1 error is also known as a false positive and occurs when a researcher
incorrectly rejects a true null hypothesis. This means that your report that your findings
are significant when in fact they have occurred by chance.
Amity Directorate of Distance & Online Education
68
Statistics Management
●●
The probability of making a type I error is represented by your alpha level (α),
which is the p-value below which you reject the null hypothesis. A p-value of
0.05 indicates that user is willing to accept a 5% chance that you are wrong
when you reject the null hypothesis.
●●
The risk of committing a type I error can be reduced by using a lower value
for p. For example, a p-value of 0.01 would mean there is a 1% chance of
committing a Type I error.
●●
However, using a lower value for alpha means that you will be less likely to
detect a true difference if one really exists (thus risking a type II error).
O
nl
in
e
Notes
Type 2 error
ity
A type II error is also known as a false negative and occurs when a researcher fails
to reject a null hypothesis which is really false. Here a researcher concludes there is not
a significant effect, when actually there really is.
er
s
The probability of making a type II error is called Beta (β), and this is related to the
power of the statistical test (power = 1- β). The risk of committing a type II error can be
decreased by ensuring that the test has enough power.
4.1.4 Level of Significance and Critical Region
Level of Significance
The level of significance often referred to as alpha or α, is a measure of
the strength of the evidence to be present in your sample before the null
hypothesis is rejected and it is concluded that the effect is statistically
significant. Before performing the experiment the researcher decides the
degree of significance.
●●
The significance level is the probability of rejecting the null hypothesis when
it is true. For example, a significance level of 0.06 indicates a 6% risk of
concluding that a difference exists when there is no actual difference. Lower
significance levels indicate that stronger evidence is required before the null
hypothesis is rejected.
ity
U
ni
v
●●
(c
)A
m
●●
The significance levels are used during hypothesis testing to help in the
determination of which hypothesis the data supports and are comparing the
p-value with significance level. If the p-value is less than the significance
level, then the null hypothesis can be rejected and concluded that the effect
is statistically significant. In other words, the evidence in the sample is strong
enough to be able to reject the null hypothesis at the population level.
Critical Region
A critical region, also known as the Region of Rejection, is a set of test statistic
values for which the null hypothesis is rejected. That is to say, if the test statistics
observed are in the critical region then we reject the null hypothesis and accept the
alternative hypothesis. The critical region defines how far away our sample statistic
must be from the null hypothesis value before we can say it is unusual enough to reject
the null hypothesis.
Amity Directorate of Distance & Online Education
69
Statistics Management
Notes
O
nl
in
e
The “best” critical region is one where the likelihood of making a Type I or Type II
error is minimised. In other words, the uniformly most powerful rejection region is the
region where the smallest chance of making a Type I or II error is present. It is also the
region that provides the largest (or equally greatest) power function for a UMP test.
4.1.5 Standard Error
A statistic’s standard error is the standard deviation from its sampling distribution,
or an estimate of that standard deviation. If the mean is the parameter or the statistic it
is called the mean standard error. It is defined as –
SE=
σ
n
ity
Where,
SE is Standard error of the sample
N is the number of samples and
σ Is the sample standard deviation.
er
s
Standard error increases when standard deviation, i.e. the variance of the
population, increases. Standard error decreases when sample size increases – as the
sample size gets closer to the true size of the population, the sample means cluster
more and more around the true population mean.
ni
v
The standard error tells how accurate the mean is likely to be compared with the
true population of any given sample from that population. By increasing the standard
error, i.e. the means are more spread out; it becomes more likely that any given mean
is an inaccurate representation of the true mean population.
4.1.6 Confidence Interval
ity
U
A Confidence Interval is a range of values where the true value lies in. It is a type
of estimate computed from the statistics of the observed data. This proposes a range of
plausible values for an unknown parameter (for example, the mean). The interval has
an associated confidence level that the true parameter is in the proposed range.
Given observations and a confidence level a valid confidence interval
has a probability of containing the true underlying parameter. The level of
confidence can be chosen by the investigator. In general terms, a confidence
interval for an unknown parameter is based on sampling the distribution of a
corresponding estimator. The confidence level here represents the frequency
(i.e. the proportion) of possible confidence intervals that contain the true value
of the unknown population parameter.
●●
In other words, if confidence intervals are constructed using a given
confidence level from an infinite number of independent sample statistics, the
proportion of those intervals that contain the true value of the parameter will
be equal to the confidence level.
)A
m
●●
(c
For example, if the confidence level is 90% then in a hypothetical indefinite data
collection, in 90% of the samples the interval estimate will contain the population
parameter. The confidence level is designated before examining the data. Most
Amity Directorate of Distance & Online Education
70
Statistics Management
commonly, a 95% confidence level is used. However, confidence levels of 90% and
99% are also often used in analysis.
O
nl
in
e
Notes
Factors affecting the width of the confidence interval include the size of the sample,
the confidence level, and the variability in the sample. A larger sample will tend to
produce a better estimate of the population parameter, when all other factors are equal.
A higher confidence level will tend to produce a broader confidence interval.
4.2.1 For Single Population Mean Using t-statistic
When s is not known, we use its estimate computed from the given sample. Here,
the nature of the sampling distribution of X would depend upon sample size n. There
are the following two possibilities:
ity
If parent population is normal and n < 30 (popularly known as small sample case),
use t - test. The
Unbiased estimate of s in this case is given by s=
∑ ( xi − x ) 2
n −1
er
s
If n ³ 30 (large sample case), use standard normal test. The unbiased estimate of
s in this case can be taken as s=
∑ ( xi − x ) 2
since the difference between n and n - 1
n
is negligible for large values of n. Note that the parent population may or may not be
ni
v
normal in this case.
Application
U
Statisticians use tα to represent the t statistic that has a cumulative probability of
(1 - α). For example, suppose we were interested in the t statistic having a cumulative
probability of 0.95. In this example, α would be equal to (1 - 0.95) or 0.05. We would
refer to the t statistic as t0.05
ity
Of course, the value of t0.05 depends on the number of degrees of freedom. For
example, with 2 degrees of freedom, t0.05 is equal to 2.92; but with 20 degrees of
freedom, t0.05 is equal to 1.725.
(c
)A
m
Example:
ABC Corporation manufactures light bulbs. The CEO claims that an average
Acme light bulb lasts 300 days. A researcher randomly selects 15 bulbs for testing. The
sampled bulbs last an average of 290 days, with a standard deviation of 50 days. If the
CEO’s claim were true, what is the probability that 15 randomly selected bulbs would
have an average life of no more than 290 days?
Note: Solution is the traditional approach and requires the computation of the t
statistic, based on data presented in the problem description. Then, the T distribution
calculator is to be used to find the probability.
Solution:
Computing the t statistic, based on the following equation:
Amity Directorate of Distance & Online Education
71
Statistics Management
t=[x-μ]/[s/√ (n)]
t = ( 290 - 300 ) / [ 50 / √ ( 15) ]
t = -10 / 12.909945 = - 0.7745966
O
nl
in
e
Notes
where x is the sample mean, μ is the population mean, s is the standard deviation
of the sample, and n is the sample size.
●●
The degrees of freedom are equal to 15 - 1 = 14.
●●
The t statistic is equal to - 0.7745966.
4.2.2 For Single Population Mean Using z-statistic
ity
The calculator displays the cumulative probability: 0.226. Hence, if the true bulb
life were 300 days, there is a 22.6% chance that the average bulb life for 15 randomly
selected bulbs would be less than or equal to 290 days.
er
s
A z-test is a statistical test that is used to determine if means of population differ
when the variances are known and the sample size is large. It is assumed that the
test statistics have a normal distribution, and nuisance parameters such as standard
deviation should be known in order to perform an accurate z-test.
It is useful to standardized the values of a normal distribution by converting them
into z-scores as -
ni
v
(a) It allows the researchers to calculate the probability of a score occurring within a
standard normal distribution;
(b) It enables the comparison of two scores that are from different samples (which may
have different means and standard deviations).
A z-test is a statistical test to determine whether two population means are
different when the variances are known and the sample size is large.
●●
It can be used to test hypotheses in which the z-test follows a normal distribution.
●●
A z-statistic, or z-score, is a number representing the result from the z-test.
●●
Z-tests are closely related to t-tests, but t-tests are best performed when an
experiment has a small sample size.
●●
Also, t-tests assume the standard deviation is unknown, while z-tests assume
it is known.
m
Application
ity
U
●●
The conditions for a z test are:
The distribution of the population is Normal
●●
The sample size is large n>30.
)A
●●
If at least one of conditions are satisfied, then.
(c
Z = x – µ / σ / √n
Where, x is the sample mean,
u is the population mean
σ is the population standard deviation and
n is the sample size
Amity Directorate of Distance & Online Education
72
Statistics Management
O
nl
in
e
Example:
Notes
The mean length of the lumber is supposed to be 8.5 feet. A builder wants to check
whether the shipment of lumber she receives has a mean length different from 8.5 feet.
If the builder observes that the sample mean of 61 pieces of lumber is 8.3 feet with a
sample standard deviation of 1.2 feet. What will she conclude? Is 8.3 very different from
8.5?
Solution:
Whether the value is different or not depends on the standard deviation of x
Thus,
Z = x – µ / σ / √n
ity
= 8.3 -8.5 / 1.2 √ 61
= - 1.3
er
s
Thus, It is been asked if −1.3 is very far away from zero, since that corresponds to
the case when x¯ is equal to μ0. If it is far away so the null statement is unlikely to be
valid and one refuses it. Otherwise the null hypothesis can not be discarded.
4.2.3 Hypothesis Testing for Population Proportion.
ni
v
Using independent samples means that there is no relationship between the
groups. The values in one sample have no association with the values in the other
sample. These populations are not related, and the samples are independent. We look
at the difference of the independent means.
U
As with comparing two population proportions, when we compare two population
means from independent populations, the interest is in the difference of the two means.
In other words, if μ1 is the population mean from population 1 and μ2 is the population
mean from population 2, then the difference is μ1−μ2.
ity
It is important to be able to distinguish between an independent sample and a
dependent sample.
(c
)A
m
Independent sample
The samples from two populations are independent if the samples selected from
one of the populations have no relationship with the samples selected from the other
population.
Dependent sample
The samples are dependent if each measurement in one sample is matched or
paired with a particular measurement in the other sample. Another way to consider this
is how many measurements are taken off of each subject. If only one measurement,
then independent; if two measurements, then paired. Exceptions are in familial
situations such as in a study of spouses or twins. In such cases, the data is almost
always treated as paired data.
Amity Directorate of Distance & Online Education
73
Example - Compare the time that males and females spend watching TV.
a.
We randomly select 15 men and 15 women and compare the average time they
spend watching TV. Is this an independent sample or paired sample?
b.
We randomly select 15 couples and compare the time the husbands and wives
spend watching TV. Is this an independent sample or paired sample?
a.
Independent Sample
b.
Paired sample
Application
Notes
O
nl
in
e
Statistics Management
ity
The null hypothesis to be tested is H0: π = π0 against Ha: π ≠ π0 for a two tailed test
and π > or < π0 for a one tailed test. The test statistic is
er
s
p − π0
n
= ( p − π0 )
π 0 (1 − π 0 )
π 0 (1 − π 0 )
n
=
zcal
Example :
A wholesaler in oranges claims that only 4% of the apples supplied by him are
defective. A random sample of 600 apples contained 36 defective apples. Test the claim
of the wholesaler.
ni
v
Solution.
We have to test H0 : π £ 0.04 against Ha : π > 0.04.
It is given that p = 36/ 600 = 0.06 and n = 600.
600
2.5
=
0.04 x0.96
U
(0.06−0.04)
zcal =
Example:
ity
This value is highly significant in comparison to 1.645, therefore, H0 is rejected at
5% level of significance.
m
470 tails were obtained in 1,000 throws of an unbiased coin. Can the difference
between the proportion of tails in sample and their proportion in population be regarded
as due to fluctuations of sampling?
Solution:
)A
We have to test H0 : π = 0.5 against Ha : π ≠ 0.5.
It is given that p = 470/1000 = 0.47 and n = 1000.
(c
Since this value is less than 1.96, the coin can be regarded as fair and thus,
the difference between sample and population proportion of heads are only due to
fluctuations of sampling.
Amity Directorate of Distance & Online Education
74
Statistics Management
a. When the population mean in being known
O
nl
in
e
4.3.1 Inference about the Difference Between two Population Means
Notes
This test is applicable when the random sample X1 , X2 , ...... Xn is drawn from a
normal population.
We can write H0 : µ = µ0 (specified) against Ha : µ ≠ µ0 (two tailed test)
The test statistic X − µ � N (0,1) . Let the value of this statistic calculated from
σ/ n
sample be denoted
as zcal =
X −µ
. The decision rule would be:
σ/ n
ity
Reject H0 at 5% (say) level of significance if zcal > 1.96. Otherwise, there is no
evidence against H0 at 5% level of significance.
Example –
er
s
A company claims that the average mileage of bikes of his company is 40 km/l.
A random sample of 20 bikes of the company showed an average mileage of 42 km/l.
Test the claim of the manufacturer on the assumption that the mileage of scooter is
normally distributed with a standard deviation of 2 km/l.
Here, we have to test H0 : µ = 40 against Ha : π ≠ 40
X −µ
=
σ/ n
42 − 40
= 4.47
2 / 20
ni
v
=
zcal
Since zcal > 1.96, is rejected at 5% level of significance.
U
b. When the population mean is being unknown
ity
When s is not known, we use its estimate computed from the given sample. Here,
the nature of the sampling distribution of X would depend upon sample size n. There
are the following two possibilities:
(c
)A
m
If parent population is normal and n < 30 (popularly known as small sample case),
use t – test. Also, like normal test, the hypothesis may be one or two tailed
If n ³ 30 (large sample case), use standard normal test. Since the difference
between n and n - 1 is negligible for large values of n. Note that the parent population
may or may not be normal in this case.
Example:
Daily sales figures of 40 shopkeepers showed that their average sales and
standard deviation were Rs 528 and Rs 600 respectively. Is the assertion that daily
sales on the average is Rs 400, contradicted at 5% level of significance by the sample?
Solution:
Since n > 30, standard normal test is applicable. It is given that n = 40, X = 528 and
S = 600.
Amity Directorate of Distance & Online Education
75
Statistics Management
We have to test H0 : µ = 400 against Ha : µ ≠ 400.
528 − 400
= 1.35
600 / 40
O
nl
in
e
=
zcal
Notes
Since this value is less than 1.96, there is no evidence against H0 at 5% level of
significance. Hence, the given assertion is not contradicted by the sample.
4.3.2 Inference about the Difference Between two Population
Proportions
A test of two population proportions is very similar to a test of two means, except
that the parameter of interest is now “p” instead of “µ”.
ity
With a one-sample proportion test, p = x/n is used. as the point estimate of p.
It is expect that p̂ would be close to p. With a test of two proportions, we will
have two p̂ ’s, and we expect that (p̂ 1 – p̂ 2) will be close to (p1 – p2). The test statistic
accounts for both samples.
z=
er
s
With a one-sample proportion test, the test statistic is

p− p
p (1 − p )
n
ni
v
●●
and it has an approximate standard normal distribution.
●●
For a two-sample proportion test, we would expect the test statistic to be
U
HOWEVER, the null hypothesis will be that p1 = p2. Because the H0 is assumed
to be true, the test assumes that p1 = p2. We can then assume that p1 = p2 equals p, a
common population proportion. We must compute a pooled estimate of p (its unknown)
using our sample data.
ity
Application
When we have a categorical variable of interest measured in two populations, it is
quite often that we are interested in comparing the proportions of a certain category for
the two populations.
)A
m
Men and Women were asked about what they would do if they received a $100 bill
by mail, addressed to their neighbor, but wrongly delivered to them. Would they return
it to their neighbour? Of the 69 males sampled, 52 said “yes” and of the 131 females
sampled, 120 said “yes.”
Does the data indicate that the proportions that said “yes” are different for male and
female?
(c
If the proportion of males who said “yes, they would return it” is denoted as p1 and
the proportion of females who said “yes, they would return it” is denoted as p2, thus p1
= p2
p1 – p2 = 0 or p1/p2 = 1
Amity Directorate of Distance & Online Education
76
Statistics Management
O
nl
in
e
It is required to develop a confidence interval or perform a hypothesis test for one
of these expressions.
Notes
Thus,
Men: n1 = 69 p1 = 52/69
Women = n2 = 131 p2 = 120/131
Using the formula –




p1 (1 − p1 ) p2 (1 − p2 )
+
n1
n2
52  52  120  120 
1 − 
1 −

52 120
69  69  131  131
−
± 1.96
+
69 131
69
131
−0.1624 ± 1.96(0.05725)
−0.1624 ± 0.1122or (0.2746 − 0.0502)
ity
 
p1 − p2 ± zα /2
er
s
We are 95% confident that the difference of population proportions of men who
said “yes” and women who said “yes” is between -0.2746 and -0.0502.
Based on both ends of the interval being negative, it seems like the proportion of
females who would return it is higher than the proportion of males who would return it.
4.3.3 Independent Samples and Matched Samples
●●
The same study participants are measured before and after an intervention.
The same study participants are measured twice for two different
interventions.
ity
●●
U
ni
v
Matched samples also called as matched pairs, paired samples or dependent
samples are paired such that all characteristics except the one under review are shared
by the participants. A “participant” is a member of the sample, and can be a person,
object or thing. Matched pairs are widely used to assign one person to a treatment
group and another to a control group. This method , called matching, is used in the
design of matched pairs. The “pairs” should not be different persons, at different times
they can be the same individuals.
(c
)A
m
An independent sample is the opposite of a matched sample which deals with
unrelated classes.
Although matching pairs are intentionally selected, individual samples are typically
selected at random (through simple random sampling or a similar technique)
4.3.4 Inference about the Ratio of two Population Variances
One of the essential steps of a test to compare two population variances is for
checking the equal variances assumption if you want to use the pooled variances. Many
people use this test as a guide to see if there are any clear violations, much like using
the rule of thumb.
An F-test is used to test if the variances of two populations are equal. This test can
be a two-tailed test or a one-tailed test.
Amity Directorate of Distance & Online Education
77
Statistics Management
Notes
O
nl
in
e
The two-tailed version tests that the variances are not equal against the alternative.
The one-tailed version tests only in one direction, that is, the variance from the
first population is either greater or less than (but not both) the second variance in
population. The problem determines the choice. If we are testing a new process , for
example, we might only be interested in knowing if the new process is less variable
than the old process.
Application:
To compare the variances of two quantitative variables, the hypotheses of interest
Alternatives
σ2
H 0 : 12 = 1
σ2
Hα :
σ 12
≠1
σ 22
Hα :
σ 12
>1
σ 22
Hα :
σ 12
<1
σ 22
er
s
Null
ity
are:
Example:
ni
v
Suppose randomly 7 women are selected from a population of women, and 12
men from a population of men. The table below shows the standard deviation in each
sample and in each population. Compute the f statistic.
Population standard deviation
Sample standard deviation
Women
30
35
Men
50
U
Population
Solution:
45
ity
The f statistic can be computed from the population and sample standard
deviations, using the following equation: f = [ s1 2/ σ1, 2 ] / [ s2 2/ σ2, 2 ]
m
where σ 1 is the standard deviation of population 1, s1 is the standard deviation of
the sample drawn from population 1, σ 2 is the standard deviation of population 2, and s
1 is the standard deviation of the sample drawn from population 2.
f = ( 35`2 / 30`2 ) / ( 45`2 / 50`2 )
)A
= (1225 / 900) / (2025 / 2500)
= 1.361 / 0.81
= 1.68
(c
For this calculation, the numerator degrees of freedom v1 are 7 - 1 or 6; and the
denominator degrees of freedom v2 are 12 - 1 or 11. On the other hand, if the men’s
data appears in the numerator, we can calculate an f statistic as follows:
Amity Directorate of Distance & Online Education
78
Statistics Management
O
nl
in
e
f = ( 45`2 / 50`2 ) / ( 352 / 302 )
Notes
= (2025 / 2500) / (1225 / 900)
= 0.81 / 1.361
= 0.595
For this calculation, the numerator degrees of freedom v1 are 12 – 1 or 11; and
the denominator degrees of freedom v2 are 7 – 1 or 6. When you are trying to find the
cumulative probability associated with an f statistic, you need to know v1 and v2.
Assumptions
ity
Several assumptions are made for the test. Your population must be approximately
normally distributed (i.e. fit the shape of a bell curve) in order to use the test. Plus, the
samples must be independent events. In addition, you’ll want to bear in mind a few
important points:
The larger variance should always go in the numerator (the top number) to
force the test into a right-tailed test. Right-tailed tests are easier to calculate.
●●
For two-tailed tests, divide alpha by 2 before finding the right critical value.
●●
If you are given standard deviations, they must be squared to get the
variances.
●●
If your degrees of freedom aren’t listed in the F Table, use the larger critical
value. This helps to avoid the possibility of Type I errors.
ni
v
er
s
●●
4.4.1 Analysis of Variance
Variance is defined as the average of squared deviation of data points from their
mean.
ity
U
When the data constitute a sample, the variance is denoted byσ2x and averaging
is done by dividing the sum of the squared deviation from the mean by ‘n – 1’. When
observations constitute the population, the variance is denoted by σ2 and we divide by
N for the average
Different formulas for calculating variance:
(c
)A
m
Sample Variance Var (X) = σ 2 =
x
Population Variance Var (X) =
n
∑ ( xi − X ) 2
i =1
n −1
∑( − )
Where,
Xi for i = 1, 2, ..., n are observations values.
X = Sample mean
n = Sample size.
µ = Population mean
Amity Directorate of Distance & Online Education
79
Statistics Management
N = Population size
O
nl
in
e
Notes
Population Variance is,
Var (x) = σ 2 =
∑ ( xi − µ ) 2
N
n
n
n
n
∑ ( xi2 − 2µ xi + µ 2 ) ∑ ( xi2 ) − 2µ ∑ xi + µ 2 ∑ (1)
=i 1
=i 1 =i 1 =i 1
= =
N
N
n
=
∑ xi2
i =1
− µ2
N
Var (x) = E(X 2 )-[E(X)]2
ity
4.5.1 Chi Square Test
er
s
It is the test that uses the chi-square statistic to test the fit between a theoretical
frequency distribution and a frequency distribution of observed data for which each
observation may fall into one of several classes.
Formula of Chi-square text:
x2 = Σ
(O – E)2
E
X2 cal < X2 table, accept H0
Conditions of Chi-square Test
ni
v
Table value of X2 for d.f and a
A chi-square test can be used when the data satisfies four conditions:
There must be two observed sets of data or one observed set of data and one
expected set of data (generally, there are n-rows and c-columns of data).
●●
The two sets of data must be based on the same sample size.
●●
Each cell in the data contains the observed or expected count of five or large?
●●
The different cells in a row of column must have categorical variables (male,
female less than 25 years of age, 25 year of age, older than 40 years of age
etc.
ity
U
●●
The distribution typically looks like a normal distribution, which is skewed to
the right with a long tail to the right. It is a continuous distribution with only
positive values. It has following applications:
)A
●●
m
Application areas of Chi-square Test
To test whether the sample differences among various sample proportions are
significant or can they be attributed to chance
●●
To test the independence of two variables in a contingency table.
●●
To use it as a test of goodness of fit.
(c
●●
Amity Directorate of Distance & Online Education
80
Example 1:
Notes
O
nl
in
e
Statistics Management
The operations manager of a company that manufactures tires wants to determine
whether there are any differences in the quality of work among the three daily shifts.
She randomly selects 496 tires and carefully inspects them. Each tire is either classified
as perfect, satisfactory, or defective, and the shift that produced it is also recorded. The
two categorical variables of interest are shift and condition of the tire produced. The
data can be summarized by the accompanying two-way table. Does the data provide
sufficient evidence at the 5% significance level to infer that there are differences in
quality among the three shifts?
Satisfactory
Shift 1
106
124
Shift 2
67
Shift 3
37
Total
210
C1
1
85
1
153
72
3
112
281
5
496
Total
124
1
231
130.87
2.33
67
85
1
64.78
86.68
1.54
37
72
3
47.42
63.45
1.13
210
281
5
ni
v
U
153
112
496
ity
Total
231
C3
97.80
3
1
C2
106
2
Total
er
s
Solution:
Defective
ity
Perfect
Chi-Sq = 8.647 DF = 4, P-Value = 0.071
(c
)A
m
There are 3 cells with expected counts less than 5.0.
In the above example, there are no significant results at a 5% significance level
since the p-value (0.071) is greater than 0.05. Even if we did have a significant result,
we still could not trust the result, because there are 3 (33.3% of) cells with expected
counts < 5.0
Example 2
A food services manager for a baseball park wants to know if there is a relationship
between gender (male or female) and the preferred condiment on a hot dog. The
following table summarizes the results. Test the hypothesis with a significance level of
10%.
Amity Directorate of Distance & Online Education
81
Ketchup
Male
Mustard
Relish
Total
15
23
10
48
Female 25
19
8
52
Total
42
18
100
40
Solution:
●●
H0: Gender and condiments are independent
●●
Ha: Gender and condiments are not independent
Mustard
Relish
15 ( 19.2)
23 ( 20.16)
10 ( 8.64)
Female 25 ( 20.8)
19 ( 21.84)
8 ( 9.36)
Total
42
Male
40
Total
48
52
er
s
Ketchup
ity
The hypotheses are:
Notes
O
nl
in
e
Statistics Management
18
100
ni
v
None of the expected counts in the table are less than 5. Therefore, we can
proceed with the Chi-square test. The test statistic is
(15 − 19.2)2 (23 − 20.16)2 (10 − 8.64)2
x =
+
+
+
19.2
20.16
8.64
2*
U
(25 − 20.8)2 (19 − 21.84)2 (8 − 9.36)2
+
+
=
2.95
20.8
21.84
9.36
ity
The p-value is found by P(χ2>χ2*)=P(χ2>2.95) with (3-1)(2-1) =2 degrees of
freedom. Using a table or software, we find the p-value to be 0.2288. With a p-value
greater than 10%, we can conclude that there is not enough evidence in the data to
suggest that gender and preferred condiment are related.
m
Assumptions of Chi-square Test
)A
The chi-squared test, when used with the standard approximation that a chisquared distribution is applicable, has the following assumptions:
Simple random sample: The sample data is a random sampling from a fixed
distribution or population where each member of the population has an equal
probability of selection. Variants of the test have been developed for complex
samples, such as where the data is weighted.
●●
Sample size (whole table): A sample with a sufficiently large size is
assumed. If a chi squared test is conducted on a sample with a smaller size,
(c
●●
Amity Directorate of Distance & Online Education
82
Statistics Management
O
nl
in
e
then the chi squared test will yield an inaccurate inference. The researcher, by
using chi squared test on small samples, might end up committing a Type II
error.
Notes
●●
Expected cell count: Adequate expected cell counts. Some require 5 or more,
and others require 10 or more. A common rule is 5 or more in all cells of a 2-by2 table, and 5 or more in 80% of cells in larger tables, but no cells with zero
expected count. When this assumption is not met, Yates’s correction is applied.
●●
Independence: The observations are always assumed to be independent of
each other. This means chi-squared cannot be used to test correlated data
(like matched pairs or panel data). In those cases you might want to turn to
McNamara’s test.
ity
Degrees of Freedom (d.f)
The degree of freedom, abbreviated as d.f, denotes the extent of independence
(freedom) enjoyed by a given set of observed frequencies. Degrees of freedom are
usually denoted by the letter ‘v’ of the Greek alphabet.
er
s
Suppose, if we are given a set of ‘n’ observed frequencies which are subjected to
‘k’ independent constraints (restrictions). Then
Key Terms
ni
v
Degrees of Freedom = No. of frequencies – No. of independent constraints ( v =
n–k)
Hypothesis Test: Hypothesis test is a method of making decisions using data
from a scientific study
●●
Type I error: A type 1 error is also known as a false positive and occurs when a
researcher incorrectly rejects a true null hypothesis.
●●
Type II error: A type II error is a false negative and occurs when a researcher fails
to reject a null hypothesis which is really false
●●
Confidence Interval: A Confidence Interval is a range of values where the true
value lies in. It is a type of estimate computed from the statistics of the observed
data.
(c
)A
m
ity
U
●●
●●
Z- Test: A z-test is a statistical test to determine whether two population means
are different when the variances are known and the sample size is large.
●●
p Value: The p-value is the probability of receiving outcomes as extreme as the
outcomes of a statistical hypothesis test, assuming the null hypothesis is correct.
●●
Sample random sample: The sample data is a random sampling from a fixed
distribution or population where each member of the population has an equal
probability of selection
●●
Degrees of Freedom: The degree of freedom, abbreviated as d.f, denotes
the extent of independence or the freedom enjoyed by a given set of observed
frequencies
Amity Directorate of Distance & Online Education
83
Statistics Management
Check your progress :
5.
O
nl
in
e
b)
Quartile range
c)
Sample
d)
Mean
a)
T test
b)
Quartile
c)
z test
d)
Median
ity
A ____ a statistical test to determine whether two population means are different
when variances are known
a)
Standard deviation
b)
Median
c)
Degree of freedom
d)
Hypothesis
er
s
What denotes the extent of independence enjoyed by a given set of observed
frequencies
ni
v
4.
Confidence Interval
Which test is used as test of goodness of fit.
a)
Z test
b)
T test
c)
Chi square test
d)
Fitness test
U
3.
a)
A _____ is also known as a false positive and occurs when researcher incorrectly
rejects a true null hypothesis.
ity
2.
Notes
A ____ is a range of values where the true value lies in.
a)
Type I error
b)
Type II error
c)
T test error
d)
Probability error
m
1.
)A
Questions & Exercises
What do you understand by hypothesis ? Explain its characterstics
2.
Explain the type of hypothesis and how to develop them ?
3.
What is the p value approach to hypothesis testing ?
(c
1.
4.
Explain the Chi square test and its assumptions
5.
What do you Infer about the difference between two population means
Amity Directorate of Distance & Online Education
84
Statistics Management
1.
a)
Confidence Interval
2.
c)
z test
3.
c)
Degree of freedom
4.
c)
Chi square test
5.
a)
Type I error
O
nl
in
e
Check your progress:
Notes
Further Readings
Richard I. Levin, David S. Rubin, Sanjay Rastogi Masood Husain Siddiqui,
Statistics for Management, Pearson Education, 7th Edition, 2016.
2.
Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016.
3.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An
Introduction to Statistical Learning with Applications in R, Springer, 2016.
ity
1.
er
s
Bibliography
Srivastava V. K. etal – Quantitative Techniques for Managerial Decision Making,
Wiley Eastern Ltd
2.
Richard, I.Levin and Charles A.Kirkpatrick – Quantitative Approaches to Management,
McGraw Hill, Kogakusha Ltd.
3.
Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016.
4.
Budnik, Frank S Dennis Mcleaavey, Richard Mojena – Principles of Operation
Research - AIT BS New Delhi.
5.
Sharma J K – Operation Research- theory and applications-Mc Millan,New Delhi
6.
Kalavathy S. – Operation Research – Vikas Pub Co
7.
Gould F J – Introduction to Management Science – Englewood Cliffs N J Prentice
Hall.
8.
Naray J K, Operation Research, theory and applications – Mc Millan, New Dehi.
9.
Taha Hamdy, Operations Research, Prentice Hall of India
ity
U
ni
v
1.
11. Vohr.N.D. Quantitative Techniques in Management, TMH
12. Stevenson W.D, Introduction to Management Science, TMH
(c
)A
m
10. Tulasian: Quantitative Techniques: Pearson Ed.
Amity Directorate of Distance & Online Education
85
Statistics Management
Module-5: Forecasting Techniques
Learning Objective:
●●
To understand the measures of linear relationship between variables
●●
To get familiarize with Time Series Analysis
Learning Outcome:
Understand and apply forecasting techniques for business decision making and to
uncover relationships between variables to produce forecasts of the future values
of strategic variables
ity
●●
O
nl
in
e
Notes
“If two or more quantities vary in sympathy so that the movement in one
tends to be accompanied by corresponding movements in others than they
are said are correlated.”
er
s
L.R. Conner says-
5.1.1 Measures of Linear Relationship: covariance & correlation – Intro
ni
v
We often encounter the situations, where data appears as pairs of figures relating
to two variables, for example, price and demand of commodity, money supply and
inflation, industrial growth and GDP, advertising expenditure and market share, etc.
U
Examples of correlation problems are found in the study of the relationship
between IQ and aggregate percentage marks obtained in mathematics examination
or blood pressure and metabolism. In these examples, both variables are observed
as they naturally occur, since neither variable can be fixed at predetermined levels.
Correlation and regression analysis show how to determine the nature and strength of
the relationship between the variables.
According to Croxton and Cowden “When the relationship is of a quantitative
nature, the appropriate statistical tool for discovering and measuring the
relationship and expressing it in a brief formula is known as correlation”
●●
A.M. Tuttle says, “Correlation is an analysis of the co variation between two or
more variables.”
ity
●●
)A
m
Correlation is a degree of linear association between two random variables. In
these two variables, we do not differentiate them as dependent and independent
variables. It may be the case that one is the cause and other is an effect i.e.
independent and dependent variables respectively. On the other hand, both may be
dependent variables on a third variable. In some cases there may not be any cause
effect relationship at all. Therefore, if we do not consider and study the underlying
economic or physical relationship, correlation may sometimes give absurd results.
(c
5.1.2 Covariance and Correlation - Application in Real Life
For example, The case of global average temperature and Indian population. Both
are increasing over past 50 years but obviously not related. Correlation is an analysis of
the degree to which two or more variables fluctuate with reference to each other.
Amity Directorate of Distance & Online Education
86
Statistics Management
O
nl
in
e
Correlation is expressed by a coefficient ranging between –1 and +1. Positive (+ve)
sign indicates movement of the variables in the same direction. E.g. Variation of the
fertilizers used on a farm and yield, observes a positive relationship within technological
limits. Whereas negative (–ve) coefficient indicates movement of the variables in the
opposite directions, i.e. when one variable decreases, other increases. E.g. Variation of
price and demand of a commodity have inverse relationship. Absence of correlation is
indicated if the coefficient is close to zero. Value of the coefficient close to ±1denotes a
very strong linear relationship.
●●
The study of correlation helps managers in following ways:
●●
To identify relationship of various factors and decision variables.
●●
To estimate value of one variable for a given value of other if both are
correlated.
●●
To understand economic behaviour and market forces.
●●
To reduce uncertainty in decision-making to a large extent.
ity
Notes
ni
v
er
s
In business, correlation analysis often helps manager to take decisions by
estimating the effects of changing the values of the decision variables like promotion,
advertising, price, production processes, on the objective parameters like costs, sales,
market share, consumer satisfaction, competitive price. The decision becomes more
objective by removing subjectivity to certain extent. However, it must be understood that
the correlation analysis only tells us about the two or more variables in a data fluctuate
together or not. It does not necessarily be due cause and effect relationship. To know
if the fluctuations in one of the variables indeed affects other or not, one has to be
established with logical understanding of the business environment.
5.1.3 Types of Correlation
ity
U
The correlation can be studied as positive and negative, simple and multiple, partial
and total, linear and non linear. Further the method to study the correlation is plotting
graphs on x - y axis or by algebraic calculation of coefficient of correlation. Graphs
are usually scatter diagrams or line diagrams. The correlation coefficients have been
defined in different ways, of these Karl Pearson’s correlation coefficient; Spearman’s
Rank correlation coefficient and coefficient of determination.
(c
)A
m
1. Positive or negative correlation: In positive correlation, both factors increase
or decrease together. Positive or direct Correlation refers to the movement of variables
in the same direction.
The correlation is said to be positive when the increase (decrease) in the value of
one variable is accompanied by an increase (decrease) in the value of other variable also.
Negative or inverse correlation refers to the movement of the variables in opposite
direction. Correlation is said to be negative, if an increase (decrease) in the value of
one variable is accompanied by a decrease (increase) in the value of other.
When we say a perfect correlation, the scatter diagram will show a linear (straight
line) plot with all points falling on straight line. If we take appropriate scale, the straight
line inclination can be adjusted to 45°, although it is not necessary as long as inclination
is not 0° or 90° where there is no correlation at all because value of one variable
changes without any change in the value of other variable.
Amity Directorate of Distance & Online Education
87
Statistics Management
Notes
O
nl
in
e
In case of negative correlation when one variable increases the other decrease
and visa versa. If the scatter diagram shows the points distributed closely around an
imaginary line, we say it is high degree of correlation. On the other hand, if we can
hardly see any unique imaginary line around which the observations are scattered, we
say correlation does not exist. Even in case of imaginary line being parallel to one of
the axes we say no correlation exists between the variables. If the imaginary line is a
straight line we say the correlation is linear.
ity
2. Simple or multiple correlations: In simple correlation the variation is between
only two variables under study and the variation is hardly influenced by any external
factor. In other words, if one of the variables remains same, there won’t be any change
in other variable. For example, variation in sales against price change in case of a
price sensitive product under stable market conditions shows a negative correlation. In
multiple correlations, more than two variables affect one another. In such a case, we
need to study correlation between all the pairs that are affecting each other and study
extent to which they have the influence.
3. Partial or total correlation
ni
v
er
s
In case of multiple correlation analysis there are two approaches to study the
correlation. In case of partial correlation, we study variation of two variables and
excluding the effects of other variables by keeping them under controlled condition. In
case of ‘total correlation’ study we allow all relevant variables to vary with respect to
each other and find the combined effect. With few variables, it is feasible to study ‘total
correlation’. As number of variables increase, it becomes impractical to study the ‘total
correlation’. For example, coefficient of correlation between yield of wheat and chemical
fertilizers excluding the effects of pesticides and manures is called partial correlation.
Total correlation is based upon all the variables.
U
4. Linear and nonlinear correlation:
When the amount of change in one variable tends to keep a constant ratio to the
amount of change in the other variable, then the correlation is said to be linear.
m
ity
The distinction between linear and non-linear is based upon the consistency of the
ratio of change between the variables. The manager must be careful in analyzing the
correlation using coefficients because most of the coefficients are based on assumption
of linearity. Hence plotting a scatter diagram is good practice. In case of linear
correlation, the differential (derivative) of relationship is constant with the graph of the
data being a straight line.
)A
In case on nonlinear correlation the rate of variation changes as values increase or
decrease. The nonlinear relationship could be approximated to a polynomial (parabolic,
cubic etc.), exponential sinusoidal, etc. In such cases using the correlation coefficients
based on linear assumption will be misleading unless used over a very short data
range. Using computers, we could analyze a nonlinear correlation to a certain extent,
with some simplified assumption
(c
5.1.4 Correlation of Grouped Data
Many times the observations are grouped into a ‘two way’ frequency distribution
table. These are called bivariate frequency distribution. It is a matrix where rows are
Amity Directorate of Distance & Online Education
88
Statistics Management
O
nl
in
e
grouped for X variable and columns are grouped for Y variable. Each cell say (i, j)
represents them frequency or count that falls in both groups of a particular range of
values of Xi and Yj. In this case correlation coefficient is given by
Notes
1
∑ f × mx × m y − ∑ ( f × mx ) ∑ ( f × m y )
n
r=
(∑ f × my ) 2
( ∑ f × mx ) 2
∑ ( f × mx 2 ) −
∑( f × my 2 ) −
n
n
Where mX and mY are class marks of frequency distributions of X and Y variables,
fX and fY are marginal frequencies of X and Y and fXY are joint frequencies of X and Y
respectively.
Example: Calculate coefficient of correlation for the following data.
0-500
500-1000 1000-1500 1500-2000
2000-2500
Total
0-200
12
6
-
-
-
18
200-400
2
18
4
2
1
27
400-600
-
4
7
3
-
14
600-800
-
1
800-1000
-
-
Total
14
29
er
s
ity
X/Y
-
2
1
4
1
2
3
6
12
9
5
69
X
Class Mark mx
m
)A
mx − a
g
Frequency f
f x dx
f x dx2
14
-28
56
500-1000
750
-1
29
-29
29
1000-1500
1250
0
12
0
0
1500-2000
1750
1
9
9
9
2000-2500
2250
2
5
10
20
-38
114
U
-2
Total
(c
dx =
250
ity
0-500
ni
v
Solution: Let the assumed mean for X be 1 = 1250 and the scaling factor g = 500.
Therefore, we can calculate f x dy and f x dx2 from the marginal distribution of X as,
Definition: The correlation coefficient measures the degree of association between
two variables X and Y.
The coefficient is given as –
r=
Covx.Cov y
ó xó y
1
∑ ( X − X )(Y − Y )
n
r=
ó xó y
Where r is the ‘Correlation Coefficient’ or ‘Product Moment Correlation Coefficient’
between X and Y. σ X and σ Y are the standard deviations of X and Y respectively. ‘n’ is
the number of the pairs of variables X and Y in the given data.
Amity Directorate of Distance & Online Education
89
Statistics Management
Notes
O
nl
in
e
The expression - 1/nΣ(X − X)(Y − Y)
is known as a covariance between the variables X and Y. It is denoted asCov(x,y)
. The Correlation Coefficient r is a dimensionless number whose value lies between
+1 and –1. Positive values of r indicate positive (or direct) correlation between the two
variables X and Y i.e. both X and Y increase or decrease together.
Negative values of r indicate negative (or inverse) correlation, thereby meaning that
an increase in one variable X or Y results in a decrease in the value of the other variable.
A zero correlation means that there is no association between the two variables.
The formula can be modified as,
1
1
∑ ( X − X )(Y − Y )
∑ ( XY − XY − XY + XY )
n
=
r = n
∑ XY ∑ X ∑ Y
−
×
n
n
n
∑X2 ∑X
−
 n 
n
2
(2)
∑Y 2  ∑Y 
−
 n 
n
2
er
s
=
σ xσ y
ity
σ xσ y
(3)
E[ XY ] − E[ X ]E[Y ]
=
E[ X 2 ] − ( E[ X ]) 2 E[Y 2 ] − ( E[Y ]) 2
ni
v
Equations (2) and (3) are alternate forms of equation (1). These have advantage
that each value from the mean may not be subtracted.
Example: The data of advertisement expenditure (X) and sales (Y) of a company
for past 10 year period is given below. Determine the correlation coefficient between
these variables and comment the correlation.
50
50
50
Y
700
650
600
40
30
20
20
15
10
5
450
400
300
250
210
200
U
X
500
Sl.No.
Y = yi
U = ui
V = vi
30
120
16
900
650
4
25
100
16
625
50
600
4
20
80
16
400
40
500
2
10
20
4
100
30
450
0
5
0
0
25
6
20
400
-2
0
0
4
0
7
20
300
-2
-10
20
4
100
8
15
250
-3
-15
45
9
225
9
10
210
-4
-19
76
16
361
10
5
200
-5
-20
100
25
400
-2
26
561
110
3136
2
3
4
50
(c
)A
5
50
Total
700
m
1
X = xi
ity
Solution: We shall take U to be the deviation of X values from the assumed mean
of 30 divided by 5. Similarly, V represents the deviation of Y values from the assumed
mean of 400 divided by 10.
4
uivi
u i2
vi2
Amity Directorate of Distance & Online Education
90
Statistics Management
n
1 n n
∑ ui vi − ∑ ui ∑ vi
=i 1
=
n i 1 =i 1
r=
2
2
n
n
1 n 
1 n 
∑ ui 2 −  ∑ ui  ∑ vi 2 −  ∑ vi 
 i1 =
=i 1
=
n  i 1=
n i 1 
=
( −2)(26)
561 −
10
=
4
676
110 −
3136 −
10
10
561 − 5.2
= 0.976
109.6 3068.4
Interpretation of r
O
nl
in
e
Short cut procedure for calculation of correlation coefficient
Notes
er
s
ity
The correlation coefficient, r ranges from −1 to 1. A value of 1 implies that a linear
equation describes the relationship between X and Y perfectly, with all data points lying
on a line for which Y increases as X increases. A value of −1 implies that all data points
lie on a line for which Y decreases as X increases. A value of 0 implies that there is no
linear correlation between the variables.
More generally, note that (Xi − X) (Yi − Y) is positive if and only if Xi and Yi lie
on the same side of their respective means. Thus the correlation coefficient is positive
if Xi and Yi tend to be simultaneously greater than, or simultaneously less than, their
respective means.
The correlation coefficient is negative if Xi and Yi tend to lie on opposite sides
of their respective means.
●●
The coefficient of correlation r lies between –1 and +1 inclusive of those
values.
●●
When r is positive, the variables x and y increases or decrease together.
●●
r = +1 implies that there is a perfect positive correlation between variables x
and y.
U
●●
ni
v
●●
When r is negative, the variables x and y move in the opposite direction.
When r = -1, there is a perfect negative correlation.
●●
When r = 0, the two variables are uncorrelated.
ity
●●
(c
)A
m
5.1.5 Spearman Rank Correlation Method - Intro & Application
Quite often the data is available in the form of some ranking for different
variables. Also there are occasions where it is difficult to measure the cause-effect
variables. For example, while selecting a candidate, there are number of factors on
which the experts base their assessment. It is not possible to measure many of these
parameters in physical units e.g. sincerity, loyalty, integrity, tactfulness, initiative, etc.
Similar is the case during dance contests. However, in these cases the experts may
rank the candidates. It is then necessary to find out whether the two sets of ranks
are in agreement with each other. This is measured by Rank Correlation Coefficient.
The purpose of computing a correlation coefficient in such situations is to determine
the extent to which the two sets of ranking are in agreement. The coefficient that is
determined from these ranks is known as Spearman’s rank coefficient, rS
Amity Directorate of Distance & Online Education
91
Statistics Management
This is defined by the following formula:
O
nl
in
e
Notes
n
rs = 1 −
6 × ∑ di 2
i =1
n(n 2 − 1)
Where, n = Number of observation pairs
D = Xi - Yi
= Xi = Values of variable X and = Yi values of variable Y
Rank Correlation when Ranks are given
Rank for Variable X
1
2
3
4
5
6
7
Rank for Variable Y
3
1
4
2
6
9
8
8
9
10
10
5
7
er
s
To determine the coefficient of rank correlation, S r
ity
Example: Ranks obtained by a set of ten students in a mathematics test (variable
X) and a physics test (variable Y) are shown below:
Solution: Computations of Spearman’s Rank Correlation as shown below:
d i2
1
1
3
+2
4
2
2
1
-1
1
3
3
4
+1
1
4
4
2
-2
1
5
5
6
+1
1
6
6
9
+3
9
7
7
8
+1
1
8
8
10
+2
4
9
9
5
-4
16
10
10
7
-3
9
50
U
n
∑ di 2 =
50
i =1
m
Now, n = 10,
ity
Total
ni
v
Individual Rank in Maths (X = xi) Rank in Physics (Y = yi) di = xi-yi
Using the formula
n
)A
6 × ∑ di 2
6 × 50
i =1
=
=
rs =
1−
1−
0.697
2
n(n − 1)
10(100 − 1)
It can be said that there is a high degree of correlation between the performance in
mathematics and physics.
(c
Rank Correlation when Ranks are not given
Example: Find the rank correlation coefficient for the following data.
Amity Directorate of Distance & Online Education
92
Statistics Management
X
75
88
95
70
60
Y
120
134
115
110
140
80
81
50
142
100
150
O
nl
in
e
Notes
Solution: Let R1 and R2 denotes the ranks in X and Y respectively.
Y
R1
R2
d=R1-R2
d2
75
120
5
5
0
0
88
134
2
4
-2
4
95
150
1
1
0
0
70
115
6
6
0
0
60
110
7
7
0
0
80
140
4
3
1
1
81
142
3
2
1
1
50
100
8
8
0
0
6
er
s
ity
X
6∑d2
6×6
1−
=
1−
=
+.93
Coefficient of Correlation P =
8(64 − 1)
n(n 2 − 1)
ni
v
In this method the biggest item gets the first rank, the next biggest second rank and
so on.
5.1.6 Regression Model
(c
)A
m
ity
U
There is a need for a statistical model that will extract information from the given
data to establish the regression relationship between independent and dependent
relationship. The model should capture systematic behaviour of data. The nonsystematic behaviour cannot be captured and called as errors. The error is due to
random component that cannot be predicted as well as the component not adequately
considered in statistical model. Good statistical model captures the entire systematic
component leaving only random errors.
In any model we attempt to capture everything which is systematic in data.
Random errors cannot be captured in any case. Assuming the random errors are
‘Normally distributed’ we can specify the confidence level and interval of random errors.
Thus, our estimates are more reliable.
If the variables in a bivariate distribution are correlated, the points in scatter
diagram approximately cluster around some curve. If the curve is straight line we call
it as linear regression. Otherwise, it is curvilinear regression. The equation of the curve
which is closest to the observations is called the ‘best fit’.
The best fit is calculated as per Legender’s principle of least sum squares of
deviations of the observed data points from the corresponding values on the ‘best
fit’ curve. This is called as minimum squared error criteria. It may be noted that the
deviation (error) can be measured in X direction or Y direction. Accordingly we will get
two ‘best fit’ curves. If we measure deviation in Y direction, i.e. for a given i x value of
Amity Directorate of Distance & Online Education
93
Statistics Management
Notes
O
nl
in
e
data point ( x,y ) and then we measure corresponding y value on ‘best fit’ curve and
then take the value of deviation in y, we call it as regression of Y on X. In the other
case, if we measure deviations in X direction we call it as regression of X and Y.
Definition: According to Morris Myers Blair, regression is the measure of the average
relationship between two or more variables in terms of the original units of the data.
Applicability of Regression Analysis
ity
Regression analysis is one of the most popular and commonly used statistical tools
in business. With availability of computer packages, it has simplified the use. However,
one must be careful before using this tool as it gives only mathematical measure based
on available data. It does not check whether the cause effect relationship really exists
and if it exists which is dependent and which is dependent variable.
Regression analysis helps in the following ways -
er
s
Regression analysis is a branch of statistical theory which is widely used in
all the scientific disciplines. It is a basic technique for measuring or estimating the
relationship among economic variables that constitute the essence of economic theory
and economic life. The uses of regression analysis are not confined to economic and
business activities. Its applications are extended to almost all the natural, physical and
social sciences.
It provides mathematical relationship between two or more variables. This
mathematical relationship can then be used for further analysis and treatment
of information using more complex techniques.
●●
Since most of the business analysis and decisions are based on causeeffect relationships, regression analysis is highly valuable tool to provide
mathematical model for this relationship.
●●
Most wide use of regression analysis is the analysis, estimation and forecast.
●●
Regression analysis is also used in establishing the theories based on
relationships of various parameters.
●●
Some of the common examples are demand and supply, money supply and
expenditure, inflation and interest rates, promotion expenditure and sales,
productivity and profitability, health of workers and absenteeism, etc.
ity
U
ni
v
●●
5.1.7 Estimating the Coefficient Using Least Square Method
m
Generally the method used to find the ‘best’ fit that a straight line of this kind can
give is the least-square method. To use it efficiently, we first determine
)A
∑ xi 2 =
∑ xi 2 − nX 2
∑ yi 2 =
∑ yi 2 − nY 2
∑ xi yi =
∑ xi yi − nX .Y
(c
b=
∑ xi yi
, a= Y − bX
∑ xi 2
Amity Directorate of Distance & Online Education
94
Statistics Management
r=
O
nl
in
e
These measures define a and b which will give the best possible fit through the
original X and Y points and the value of r can then be worked out as under:
Notes
b ∑ xi 2
∑ yi 2
Thus, the regression analysis is a statistical method to deal with the formulation
of mathematical model depicting relationship amongst variables which can be used for
the purpose of prediction of the values of dependent variable, given the values of the
independent variable.
ity
Alternatively, for fitting a regression equation of the type Y = a + bX to the given
values of X and Y variables, we can find the values of the two constants viz., a and b by
using the following two normal equations:
∑ X iYi = a ∑ X i + b ∑ X i 2
er
s
∑ yi = na + b ∑ X i
ni
v
Solving these equations for finding a and b values. Once these values are obtained
and have been put in the equation Y = a + bX, we say that we have fitted the regression
equation of Y on X to the given data. In a similar fashion, we can develop the regression
equation of X and Y viz., X = a + bX, presuming Y as an independent variable and X as
dependent variable.
5.1.8 Assessing the Model
Method of Least Square parabolic trend
ity
U
The mathematical form of a parabolic trend is given by Yt = a + bt + ct2 or Y =
a + bt + ct2 (dropping the subscript for convenience). Here a, b and c are constants
to be determined from the given data. Using the method of least squares, the normal
equations for the simultaneous solution of a, b, and c are:
∑ Y = na + b ∑ t + c ∑ t 2
∑ tY = a ∑ t + b ∑ t 2 + c ∑ t 3
(c
)A
m
∑ t 2Y = a ∑ t 2 + b ∑ t 3 + c ∑ t 4
By selecting a suitable year of origin, i.e., define X = t - origin such that SX = 0, the
computation work can be considerably simplified. Also note that if SX = 0, then SX3 will
also be equal to zero. Thus, the above equations can be rewritten as:
∑ Y = na + c ∑ X 2 ..(i)
∑ XY =∑
b X 2 ..(ii)
∑ X 2Y = a ∑ X 2 + c ∑ X 4
..(iii)
From equation (ii), we get
Amity Directorate of Distance & Online Education
b=
∑ XY
...(iv)
∑X2
95
Statistics Management
And from equation(iii), we get c =
∑Y − c ∑ X 2
n
n ∑ X 2Y − (∑ X 2 )(∑ Y )
n ∑ X 4 − (∑ X 2 ) 2
Notes
O
nl
in
e
Further, from equation (i), we get a =
...(v)
...(vi)
Thus, equations (iv), (v) and (vi) can be used to determine the values of the
constants a, b and c.
5.1.9 Standard Error of Estimate
ity
Standard Error of Estimate is the measure of variation around the computed
regression line.
Standard error of estimate (SE) of Y measure the variability of the observed values
of Y around the regression line. Standard error of estimate gives a measure about the
line of regression. of the scatter of the observations about the line of regression.
Y = Observed value of y
er
s
Standard Error of Estimate of Y on X is: S.E. of Yon X (SExy) = √ Σ (Y–Ye)2/n-2
Ye = Estimated values from the estimated equation that correspond to each y value
e = The error term (Y – Y e )
ni
v
n = Number of observation in sample.
The convenient formula: (SExy) = √ Σ Y2_a Σ Y_b Σ YX /n – 2
variable. a = Y
U
X = Value of independent variable. Y = Value of dependent
intercept.
b = Slope of estimating equation. n = Number of data points.
Regression Coefficient of X on Y
m
ity
The regression coefficient of X on Y is represented by the symbol bxy that
measures the change in X for the unit change in Y. Symbolically, it can be represented
as: The bxy can be obtained by using the following formula when the deviations are
taken from the actual means of X and Y: When the deviations are obtained from the
assumed mean, the following formula is used:
Regression Coefficient of Y on X
)A
The symbol byx is used that measures the change in Y corresponding to the unit
change in X. Symbolically, it can be represented as:
In case, the deviations are taken from the actual means; the following formula
is used:
●●
The byx can be calculated by using the following formula when the deviations
are taken from the assumed means:
(c
●●
Amity Directorate of Distance & Online Education
96
Statistics Management
The Regression Coefficient is also called as a slope coefficient because it
determines the slope of the line i.e., the change in the independent variable
for the unit change in the independent variable.
O
nl
in
e
●●
Notes
5.1.10 Regression Coefficient
The coefficients of regression are YX b and XY b. They have following implications:
Slopes of regression lines of Y on X and X on Y viz. YX b and XY b must have
same signs (because r² cannot be negative).
●●
Correlation coefficient is geometric mean of YX b and XY b.
●●
If both slopes YX b and XY b are positive correlation coefficient r is positive. If both
YX b and XY b are negative the correlation coefficient r is negative.
●●
Both regression lines intersect at point (X,Y )
ity
●●
As in case of calculation of correlation coefficient, we can directly write the formula
for the two regression coefficients for a bivariate frequency distribution as given below –
N ∑ ∑ fi j xi y j − (∑ fi xi )(∑ f j y j )
N ∑ fi xi 2 − (∑ fi xi ) 2
Xi − A
YJ − B
=
and vj
h
k
k  N ∑ ∑ fi j ui v j − (∑ f i ui )(∑ f j xJ ) 
b= 

h
N ∑ fi ui 2 − (∑ fi ui ) 2

N ∑ ∑ fi j xi y j − (∑ fi xi )(∑ f j yJ )
d=
N ∑ f j y j 2 − (∑ f j y j )2
U
ni
v
or,=
if we define ui
Similarly
er
s
b=
h  N ∑ fi j ui v j − (∑ f i ui )(∑ f j vJ ) 


k 
N ∑ f j v j 2 − (∑ f j v j )2

ity
or d =
(c
)A
m
5.2.1 Time Series
Time series analysis systematically identifies and isolates different kinds of timerelated patterns in the data. Four common relationship patterns are horizontal, trend,
seasonal and cyclic. The random component is superimposed on these patterns. There
is a procedure for decomposing the time series in these patterns. These are used for
forecasting. However, more accurate and statistically sound procedure is to identify
the patterns in time series using auto-correlations that was explained in previous
subsection. It is correlation between the values of same variable at different time lag.
When the time series represents completely random data, the auto correlation
for various time lags is close to zero with values fluctuating both on positive and
negative side. If auto correlation slowly drops to zero, and more than two or three
differ significantly from zero, it indicates presence of trend in the data. The trend can
be removed by taking difference between consecutive values and constructing a new
series. This is called numerical differentiation.
Amity Directorate of Distance & Online Education
97
Definition
A time series is a collection of data obtained by observing a response variable
atperiodic points in time. If repeated observations on a variable produce a time series,
the variable is called a time series variable. We use Yi to denote the value of the
variable at time i.
Objectives of Time Series
Notes
O
nl
in
e
Statistics Management
The analysis of time series implies its decomposition into various factors that affect
the value of its variable in a given period. It is a quantitative and objective evaluation of
the effects of various factors on the activity under consideration.
There are two main objectives of the analysis of any time series data:
To study the past behaviour of data.
ity
1.
5.2.2 Variation in Time Series
Time Series analysis – Secular Component
er
s
2. To make forecasts for future. The study of past behaviour is essential because
it provides us the knowledge of the effects of various forces. This can facilitate
the process of anticipation of future course of events and, thus, forecasting the
value of the variable as well as planning for future.
ity
U
ni
v
Secular trend or simply trend is the general tendency of the data to increase or
decrease or stagnate over a long period of time. Most of the business and economic
time series would reveal a tendency to increase or to decrease over a number of years.
For example, data regarding industrial production, agricultural production, population,
bank deposits, deficit financing, etc., show that, in general, these magnitudes have
been rising over a fairly long period. As opposed to this, a time series may also reveal
a declining trend, e.g., in the case of substitution of one commodity by another, the
demand of the substituted commodity would reveal a declining trend such as the
demand for cotton clothes, demand for coarse grains like bajra, jowar, etc. With
the improved medical facilities, the death rate is likely to show a declining trend, etc.
The change in trend, in either case, is attributable to the fundamental forces such as
changes in population, technology, composition of production, etc.
Time Series Analysis - Seasonal Component
)A
m
Cycles that occur over short periods of time, normally < 1 year. e.g. monthly,
weekly, daily. A time series, where the time interval between successive observations
is less than or equal to one year, may have the effects of both the seasonal and cyclical
variations. However, the seasonal variations are absent if the time interval between
successive observations is greater than one year.
Causes of Seasonal variations:
The main causes of seasonal variations are:
Climatic Conditions
●●
Customs and Traditions
(c
●●
Amity Directorate of Distance & Online Education
98
Statistics Management
Climatic Conditions: The changes in climatic conditions affect the value of
time series variable and the resulting changes are known as seasonal variations. For
example, the sale of woolen garments is generally at its peak in the month of November
and December because of the beginning of winter season. Similarly, timely rainfall may
increase agricultural output, prices of agricultural commodities are lowest during their
harvesting season, etc., reflect the effect of climatic conditions on the value of time
series variable.
O
nl
in
e
Notes
Customs and Traditions: The customs and traditions of the people also give rise
to the seasonal variations in time series. For example, the purchase of clothing and
ornaments may be highest during the marriage season, sale of sweets during Diwali,
etc., are variations which are the results of customs and traditions of the people.
ity
Time Series Analysis - Cyclical Component
Cyclical variations are revealed by most of the economic and business
time series and, therefore, are also termed as trade or the business cycles.
Any trade cycle has four phases which are respectively known as boom,
recession, depression and recovery.
●●
Various phases repeat themselves regularly one after another in the given
sequence. The time interval between two identical phases is known as the
period of cyclical variations. The period is always greater than one year.
Normally, the period of cyclical variations lies between 3 to 10 years.
er
s
●●
ni
v
Objectives of Measuring Cyclical Variations
The main objectives of measuring cyclical variations are:
To analyse the behaviour of cyclical variations in the past.
●●
To predict the effect of cyclical variations so as to provide guidelines for future
business policies.
U
●●
Time Series Analysis - Random Component
(c
)A
m
ity
As the name suggests, these variations do not reveal any regular pattern of
the movements. These variations are caused by random factors such as strikes,
fire, floods, war, famines, etc. Random variations is that component of a time series
that cannot be explained in terms of any of the components discussed so far. This
component is obtained as a residue after the elimination of trend, seasonal and cyclical
components and hence is often termed as residual component. Random variations
are usually short-term variations but sometimes their effect may be so intense that the
value of trend may get permanently affected.
Numerical Application
Using the method of Free hand determine the trend of the following data:
Year
1998
Production 42
(in tonnes)
Amity Directorate of Distance & Online Education
1999
2000
2001
2002
2003
2004
2005
44
48
42
46
50
48
52
99
Statistics Management
Solution:
er
s
ity
O
nl
in
e
Notes
Price (`)
Year
52
2000
75
1995
65
2001
70
1996
58
2002
64
1997
63
2003
78
1998
66
2004
80
1999
72
2005
73
Price (`)
Year
Price (`)
3 yearly moving total
1994
ity
Solution:
U
Year
1994
ni
v
Example 2 - Find trend values from the following data using three yearly moving
averages and show the trend line on the graph.
52
–
65
175
58.33
58
186
62.00
63
187
62.33
1995
1996
1997
3 yearly moving average
66
201
67.00
1999
72
213
71.00
2000
75
217
72.33
2001
70
209
69.67
2002
64
212
70.67
2003
78
222
74.00
2004
80
231
77.00
2005
73
–
(c
)A
1998
m
Computation of trend values
Amity Directorate of Distance & Online Education
100
Statistics Management
er
s
ity
O
nl
in
e
Notes
Key Terms
Correlation: Correlation is expressed by a coefficient ranging between –1 and +1.
Positive (+ve) sign indicates movement of the variables in the same direction.
●●
Positive correlation: The correlation is said to be positive when the increase
(decrease) in the value of one variable is accompanied by an increase (decrease)
in the value of other variable also.
●●
Negative correlation: Negative or inverse correlation refers to the movement of
the variables in opposite direction
●●
Linear correlation: When the amount of change in one variable tends to keep a
constant ratio to the amount of change in the other variable, then the correlation is
said to be linear.
(c
)A
m
ity
U
ni
v
●●
●●
Regression: Regression is a basic technique for measuring or estimating the
relationship among economic variables that constitute the essence of economic
theory and economic life.
●●
Time Series: A time series is a collection of data obtained by observing a
response variable at periodic points in time.
●●
Standard Error of Estimate: Standard Error of Estimate is the measure of
variation around the computed regression line.
Check your progress:
1.
In ____ correlation, both factors increase or decrease together.
a)
Constant
b)
Positive
Amity Directorate of Distance & Online Education
101
Statistics Management
5.
Notes
O
nl
in
e
Probability
The correlation that refers to the movement of the variables in opposite direction
a)
Constant
b)
Positive
c)
Negative
d)
Probability
a)
Mean deviation
b)
Sample
c)
Time Series
d)
Hypothesis
ity
A ____ is a collection of data obtained by observing a response variable at
periodic points in time
er
s
Technique for estimating the relationship among economic variables that constitute
the essence of economic theory is ?
a)
Correlation
b)
Time Series
c)
Regression
d)
Standard deviation
ni
v
4.
d)
In ____ the variation is between only two variables under study and the variation is
hardly influenced by any external factor.
a)
Partial correlation
b)
Total correlation
c)
Standard correlation
d)
Multiple correlation
U
3.
Negative
ity
2.
c)
Questions and exercises
Explain the measures of linear relationship.
2.
What is correlation? What are the various types of correlation?
3.
Explain correlation in a grouped data
4.
The data of advertisement expenditure (X) and sales (Y) of a company for past
10 year period is given below. Determine the correlation coefficient between these
variables and comment the correlation.
)A
m
1.
50
50
50
40
30
20
20
15
10
5
Y
700
650
600
500
450
400
300
250
210
200
(c
X
5.
What do you understand by time series analysis ? Explain its components.
Amity Directorate of Distance & Online Education
102
Statistics Management
1.
b)
Positive
2.
c)
Negative
3.
c)
Time Series
4.
c)
Regression
5.
d)
Multiple correlation
O
nl
in
e
Check your progress:
Notes
Further Readings
Richard I. Levin, David S. Rubin, Sanjay Rastogi Masood Husain Siddiqui,
Statistics for Management, Pearson Education, 7th Edition, 2016.
2.
Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016.
3.
Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani, An
Introduction to Statistical Learning with Applications in R, Springer, 2016.
ity
1.
er
s
Bibliography
Srivastava V. K. etal – Quantitative Techniques for Managerial Decision Making,
Wiley Eastern Ltd
2.
Richard, I.Levin and Charles A.Kirkpatrick – Quantitative Approaches to Management,
McGraw Hill, Kogakusha Ltd.
3.
Prem.S.Mann, Introductory Statistics, 7th Edition, Wiley India, 2016.
4.
Budnik, Frank S Dennis Mcleaavey, Richard Mojena – Principles of Operation
Research - AIT BS New Delhi.
5.
Sharma J K – Operation Research- theory and applications-Mc Millan,New Delhi
6.
Kalavathy S. – Operation Research – Vikas Pub Co
7.
Gould F J – Introduction to Management Science – Englewood Cliffs N J Prentice
Hall.
8.
Naray J K, Operation Research, theory and applications – Mc Millan, New Dehi.
9.
Taha Hamdy, Operations Research, Prentice Hall of India
ity
U
ni
v
1.
11. Vohr.N.D. Quantitative Techniques in Management, TMH
12. Stevenson W.D, Introduction to Management Science, TMH
(c
)A
m
10. Tulasian: Quantitative Techniques: Pearson Ed.
Amity Directorate of Distance & Online Education
Download