Quantitative Techniques in Business

advertisement
©Dr. Valerie P. Muehsam, 2006
Quantitative Techniques in Business
Introduction to Statistics
In the business world, and in fact, in practically every aspect of daily living,
quantitative techniques are used to assist in decision making. Why? Unlike the
classroom, in the “real world” there is often not enough information available to be
guaranteed of making a correct decision. For instance, if advertisers would like to know
how many households in the United States with televisions are tuned to a particular
television show, at a particular date and time, it would be impossible to determine
without the complete cooperation of every household and an astonishing amount of time
and money. If a consumer protection agency wanted to determine the true proportion of
prescription drug users who also use herbal non-regulated over-the-counter supplements,
this information would most likely not be available. As a result of the inability to
determine characteristics of interest, the application of statistics, and other quantitative
techniques has developed.
Statistics is defined as the process of collecting a sample, organizing, analyzing
and interpreting data. The numeric values which represent the characteristics analyzed in
this process are also referred to as statistics. When information related to a particular
group is desired, and it is impossible or impractical to obtain this information, a sample
or subset of the group is obtained and the information of interest is determined for the
subset. For instance someone is interested in the average annual income of all the
students with majors in the College of Business Administration at Sam Houston State
University, the only way this information could be obtained is if the annual income of
every student in this population could be collected, recorded and analyzed without error.
Since this would take considerable time and money, and since the probability of
collecting the data necessary to determine the true annual salary of the students is small, a
sample of this population will be taken. The sample mean annual salary of the sample of
students will be determined and used to estimate the true mean annual salary of all the
students with majors in the College of Business Administration at Sam Houston State
University.
The study of statistics consists of two types: descriptive statistics and
inferential statistics. Descriptive statistics are characteristics, usually numeric, used to
describe a particular data set. An example of a descriptive statistic would be the average
final exam grade of ten students in an elementary statistics class. This average test score
is used to indicate a “typical value” for the exam grades of the ten students. Inferential
statistics, on the other hand, are similar to descriptive statistics in that each is calculated
from a sample, but the difference is the use of the statistic. In inferential statistics, the
statistic is used to make inference, or make decisions, about the entire population of
interest. In other words, we take a sample and calculate a statistic and use that statistic to
make inference about the actual value of the characteristic in the entire population.
For instance, there are many descriptive characteristics of a firm’s customers that
their management would like to know but this information may be difficult or impossible
to determine. Measurement of each and every customer of a large retail firm is nearly
impossible. Even if the information were gathered, it would be unlikely that it would be
timely.
Unfortunately, managers do not always know what mean (average) weekly
demand for a product will be or what proportion of television viewers will watch a
2
particular show. Since these parameters of interest are not known, and usually
impossible or impractical to determine, the parameters will be estimated using partial
information gathered from a sample.
For instance, if the desired parameter is the mean annual salary of the income
earning residents of a particular county, a sample of 200 of these residents could be
obtained and the annual salary of each resident (element) in the sample could be
determined and the mean annual salary of the sample residents. If the sample is drawn in
a random fashion from a frame, or list, of the entire population, and if we use correct
statistical techniques, the sample mean annual salary (a statistic) may be a good estimate
of the true mean annual salary (a parameter) of all the residents of this county.
A population includes all the elements of interest. We use the term “element” to
represent each individual unit of a group in which we have interest. For instance,
elements may refer to people (i.e., customers), records (i.e., all loan accounts at a
particular bank), products (i.e., we are interested in the proportion defective) etc. The
notation used in statistics to represent the population size is “N”. In our example above,
the population of interest would be all the income earning residents of the county. Each
of these residents is an element in our population. If the population of the income
earning residents in the county was 50,000 then N = 50,000. The size of the population,
N, is often not known.
A sample is a subset of the population. The notation for the sample size is “n”.
In our previous example, the sample would be the 200 residents we sampled out of all the
income earning residents in the county. In this case n = 200.
3
A parameter is a characteristic, usually numeric, of the population. Populations
have many parameters but researchers are often interested in only one or two of these
characteristics. For instance, in our example above, the parameter of interest is the
population mean annual salary of all the income earning residents of the county. The
mean annual salary is but one of many other characteristics of this population that may be
of interest and could also be estimated. The proportion of these residents who support a
particular school bond issue and the mean age of the residents are two examples of other
parameters that may be of interest.
A statistic is a characteristic, usually numeric, of the sample. Samples, like
populations, also have many statistics that may be calculated. For each parameter of a
population, there is a corresponding statistic that may be calculated from a sample. An
important item to remember is that a statistic is a random variable which indicates that
each sample may result in many different values for the statistic. For instance, in the
example above, the statistic is the sample mean annual income of the 200 residents of the
county. This value is called the “sample mean” because it is calculated from the sample.
Although the sample mean is our “best guess” for the value of the population
mean it is one of many possible values that could be calculated from different samples of
size 200. In other words, there are many samples of 200 that could be collected from the
population of 50,000 residents. Unfortunately, even if we take a random sample of 200,
we could end up with the most affluent 200 residents in the county. The sample mean
calculated from this sample would not be representative of the population. The
possibility of collecting a sample like this cannot be ignored. We will, however, learn to
4
use statistical techniques that allow us to estimate the probability of getting a value for
the sample statistic that is not a good estimate of the population parameter.
The use of statistics to estimate parameters of interest is not guaranteed to be
successful. If the estimate is not “good” the result could be a faulty decision that, in turn,
could result in loss of time and/or revenue. We must not allow quantitative techniques to
make decisions for us, we must use these techniques only as a tool to assist us in decision
making.
Scale of Data Measurement
Before any statistical technique is employed, a researcher must determine the type
of data that is to be collected. In a general sense, there are two types of data: qualitative
data and quantitative data.
Qualitative data categorizes an element by a non-numeric attribute. For instance,
if we are interested in which political party a resident belongs to, we are categorizing the
resident using qualitative data: Democratic, Republican, Independent, etc. Qualitative
data is often the data we are interested in gathering in the social sciences and particularly
in business. For instance, much of what we want to know in business is related to
attitudes or behavior of consumers. The data is not numeric and therefore more difficult
to analyze. We often calculate the proportion of elements with a particular characteristic
(i.e., the proportion of residents who own their own home) but many techniques cannot
be used on this type of data.
There are two types of qualitative data: nominal data and ordinal data.
Nominal data is, in terms of structure, the lowest form of data. Nominal data is
5
qualitative data that has no natural order. Examples of nominal data include: gender;
political affiliation; type of car owned; product model; etc. Data comprised of “numbers”
can also be qualitative data. Zip codes, area codes, telephone numbers are examples of
data that are qualitative. In math terms, these data are not “real” numbers because they
do not represent numeric measures. One way to determine whether “numbers” are
numeric measures is to consider whether one might be interested in an average of these
“numbers”. If a number can be replaced with letters, words or symbols without losing
any information then this indicates that a “number” is NOT a numeric measure. Ordinal
data is qualitative data that has a natural order. Examples of ordinal data include:
military rank; size of clothing using S, M, L, XL; place in which a race was finished;
condition of a used appliance using POOR, AVERAGE, GOOD, EXCELLENT; etc.
While ordinal data has an order, the intervals between the rankings are not equal
intervals. Thus, while ordinal data has more structure than nominal data, math functions
on the data, such as differences, are not valid.
Quantitative data categorizes an element by a numeric measure. Quantitative
data are true numbers and, as a result, more quantitative techniques are available for use
with this data. Quantitative data can be divided into two types of data: interval data and
ratio data. Interval data is quantitative data that has no natural starting point or zero
level. Examples of interval data include Fahrenheit temperature and scores on IQ tests.
Each (of these type data) is a numeric measure but neither has a natural starting point or
zero level. Zero degrees Fahrenheit is not the absence of temperature just as there is no
zero level for a test of intelligence. Interval data can be used for any technique that
requires quantitative data, however, we must realize that ratios have no meaning with this
6
type of data since there is no natural zero level. For example, 50 degrees Fahrenheit is
not twice as warm as 25 degrees Fahrenheit. Ratio data is quantitative data that has a
natural starting point or zero level. Most quantitative data falls into this scale of data
measurement. Examples of ratio scaled data include height, weight, rate of return, net
income, etc. Since there is a natural zero level, ratios have meaning.
Measures of Central Tendency
Once we have decided the type of data that we are going to collect, we must
determine the type of techniques that are appropriate for analyzing the data. The first
organizational technique we will most likely perform is to order the data from smallest
value to largest value. We order the data to get an idea about the range of the values
observed. Consider a particular example, if we have collected annual income figures
from 1,000 households what might we be interested in knowing about this data? Perhaps
we would be interested in a typical annual income value for the data set. Typical values
are often referred to as Measures of Central Tendency. Measures of central tendency
are attempts to identify typical values which are representative of the 1,000 observations
collected. The three most common measures of central tendency are the mean, the
median and the mode. All three of these measures are referred to as “average” or
“typical” values although they are each different measures of typical.
The first, and most popular, measure of central tendency is the arithmetic mean,
hereafter referred to as simply the mean. The mean is calculated as the sum of the
observations divided by the number of observations. The sample mean is denoted x and
the formula for calculating the sample mean is: x 
7
x .
n
The population or true mean
is denoted  (the Greek script letter “mu”) and is calculated the same way as the sample
mean except that all elements in the population are measured.
The mean requires at least interval scaled data which means it is only valid for
true numeric measures. The mean is often referred to as the “gravitational center of the
data set” which is similar to the balancing point of the data. If equal weights were
placed on a scale representing a number line for each observation in a data set, the mean
would be the point at which the scale balances. Since each observation has an equal
weight, the magnitude of the values influence the mean. The mean, while certainly the
most commonly used measure of central tendency, is not always a good measure of
“typical.” For instance, data sets that include extreme values relative to the rest of the
data “pull” the mean in that direction. Extremely small values cause the mean to be
“small” and extremely large values cause the mean to be “large.” The result is that the
mean is not a “good” measure of typical and in fact, may be larger or smaller than all
values except the extreme one. When extreme values occur in a data set, we often use
another measure of typical referred to as the median. For instance, attempts to find a
typical income often is best expressed as the median income rather than the mean income
since there is a lower limit (zero) but not an upper limit on income.
The median is the second most commonly used measure of central tendency and
is referred to as the positional average. The median is the center value in an ordered
data set. If the data set has an odd number of observations then the median is the value
found in the center of the distribution of ordered values. If the sample set has an even
number of values then the median is the mean of the two values surrounding the center of
the data set. The median is also P50, the fiftieth percentile. This means that 50% or half
8
of the values are smaller than the median and half of the values or 50% are greater than
the median. The procedure for finding the median is:
1. Order the data set from smallest to largest (or largest to smallest). NOTE:
this requires that the data can be ordered so the median cannot be found for
nominal data.
2. Find i, which is the location or position of the median. This position can be
calculated by using the following formula: i 
n 1
, where n is the size of
2
the sample.
3. If i is an integer then the median is the value found at the ith position in the
ordered data set. If i is not an integer, then the median is the mean of the two
values surrounding the ith position.
The median is often denoted as M or ~x .
The last of the more common Measures of Central Tendency is called the mode.
The mode is the most commonly occurring value in a data set, in other words, the value
that occurs with the greatest frequency. The mode, unlike either the mean or the median,
does not have to be unique. A data set can have more than one mode or no mode at all.
A data set with: one mode is referred to as unimodal; two modes is referred to as
bimodal; and three or more modes is referred to as multimodal. There is no universal
notation for the mode and the mode is valid for any type of data.
Measures of Data Variation
Besides a measure of “typical,” what else might we want to know about a data
set? Do the measures of central tendency tell us all we need to “know” about the
9
observations we have collected? Certainly not, in fact, two data sets could have the same
mean and be completely difference in terms of dispersion. Consider that we “know” the
mean depth of a lake where we plan our next office picnic. Suppose the mean depth of
the lake is 4 feet, is this all we need to know about the depth of this lake? No. We need
to know how much the values (depth) varies around 4 feet. The depth of the lake could
be 4 feet at every point and have a mean of 4 feet or the depth of the lake could vary
greatly around four feet and still have a mean of 4 feet. There could be places where the
depth is a few inches and other places where the depth is 10 feet. This information about
how the data are dispersed is very important (especially for those of us who cannot
swim). The study of statistics could appropriately be referred to as the study of
variability since many of the techniques employ the comparison of the variability of
typical values in different groups to determine whether or not these values are the same
or different between groups.
Measures of Data Variation (variability, dispersion, or
spread) are attempts to describe how spread out, or how much the values vary, in a
particular data set. All measures of data variation or dispersion require quantitative
data to calculate and are nonnegative. The measures of data variation are zero (if all the
values are equal) or positive. A “large” measure of spread indicates a more dispersed
data set while a “small” measure indicates a more tightly grouped data set.
The easiest measure of spread to calculate is the range. The range is the
difference between the largest or maximum value and the smallest or minimum value.
The notation and formula for the range is: R  H  L , where H is the largest of
maximum value and L is the smallest or minimum value. The range, while simple to
calculate, is only informative if it is “small.” “Small” and “large” are relative terms and
10
must be determined relative to the magnitude of the values measured. For instance, a
range of $3 for dinner could be characterized as “small” if we are eating at a five-star
restaurant in a pricey hotel in New York City where the dinner entrees range in price
from $12.00 to $35.00 but may be characterized as “large” if we’re eating at a local fastfood restaurant. If the range is “small” it means that the two extreme values are very
close to each other, so the rest of the values must also be tightly grouped. If the range is
“large” we know that the extreme values are a long way from each other but we know
nothing about the distribution of the rest of the observations. Since the range only uses
two values in its calculation, we are provided with limited information.
Like our favorite measure of central tendency, the mean, we might like to come
up with a measure of variability that incorporates all the values in the data set as opposed
to using only the two values needed to calculate the range. We might be interested in
finding out, on the average, how much the values vary around a “typical value.” In an
effort to describe the variability of a data set we could measure the distance each value is
from the mean, our standard measure of “typical.” The distance a value is from the mean
is called the “deviation from the mean” and is found by subtracting the mean from a
particular value. This deviation from the mean can be negative, (if the value is smaller
than the mean) positive, (if the value is bigger than the mean) or zero (if the value is
equal to the mean). To calculate the average deviation from the mean, we could sum
the deviations from the mean for each value in the data set and divide by the number of
observations in our sample. Unfortunately, although a good idea intuitively, this value
will always be zero since the mean is the gravitational center of the data set and as a
result, the sum of the deviations from the mean sum to zero and so the average
11
deviation would be zero (0):
 ( x x )  0 .
n
This occurs because the deviations from the
mean that are negative offset the deviations from the mean that are positive. We can
avoid this problem by using the absolute value or square of the deviations from the mean.
The Mean Absolute Deviation (MAD), is the sum of the absolute deviations
from the mean divided by the sample size: MAD 
| x  x | .
n
The MAD is used in
financial analysis to determine the variability in stock prices from the expected price.
Unfortunately, while the MAD is the “best” measure of spread for descriptive purposes, it
is not useful for inferential statistics since the distribution of an absolute value function is
not smooth.
The sample variance, denoted s2, is the sum of the squared deviations from the
mean divided by the sample size less one (n-1). Continuing our effort to find an average
deviation from the mean, we square the deviations from the mean to eliminate any
negative values so our numerator is not equal to zero, and then divide by the sample size
less one. Our denominator is made smaller (hence our variance is made larger) as an
adjustment to our estimate for the true population variance, denoted 2 (sigma squared)
since we calculate the sample variance, s2, using the sample mean, x , instead of the true
population mean,  (mu). The true measure of variability for the population should be
calculated according to each value’s distance from , the population mean. The
adjustment in the denominator makes our estimate larger than without the adjustment to
account for the estimate ( x ) used in the numerator. Since we would prefer to have a
“small” measure of variability because this indicates that the mean, x , is a good measure
of “typical” since most of the values are “close to” the mean, adjusting our estimate for
12
the variance to be larger is considered to be conservative. We are unsure of the true value
of the mean so we use the value of the sample mean to estimate the variability in the data.
The deviations from the mean are estimated using deviations from the sample mean. It is
said that we lose one degree of freedom (df) in the denominator for every estimate in the
numerator. All variances are of the form: sum of squares divided by degrees of
freedom.
The problem with the variance is that the value is in squared units. For instance,
if we are measuring the dollar amount spent on lunch, the variance will be in dollars
squared. Since squared units make interpretation difficult, we normally take the square
root of the variance to return to the original units of measurement. The positive square
root of the sample variance, s2, is the sample standard deviation, s. The sample
standard deviation, s, is our estimate for the true population standard deviation,
denoted sigma), which is the positive square root of the population variance, 2. The
definitional formula for the sample variance, s2, is given below followed by an algebraic
manipulation which we call the computation formula. The computational formula is
easier and faster to calculate but intuitively the definitional formula makes more sense as
our estimate of the “average” (squared) deviation from the mean.
s2 
 (x  x)
n 1
2

x
2

( x) 2
n 1
n
= the sample variance
s  s 2 = the sample standard deviation
Although we rarely calculate parameters, the following formulae are given for the
population variance and the population standard deviation.
13
(x  )  x

=

2

2
N
2

( x ) 2
N
N
= the population variance
   2 = the population standard deviation.
Uses of the Standard Deviation
The standard deviation of a sample is an attempt to estimate the typical distance
that values in the data set differ from the mean. We use the standard deviation as the
step-size to estimate the percentage of values that lie within 1 step, 2 steps, or three steps
of the mean. For example, Chebyshev’s Theorem, which applies to any distribution
regardless of its shape, states that within k standard deviations of the mean, at
least 1 
1
% of the values will fall. Since Chebyshev’s Theorem applies to any
k2
distribution regardless of shape, the information learned is less specific then we might
like. In other words, using the formula, we would discover that at least 75% of the
observations (in any distribution) lie within 2 standard deviations of the mean. This
means that 75%-100% of the values will fall within two standard deviations of the mean.
While some information is better than none, we would like to be more precise in our
estimate of this percentage. For certain known distributions, we can more precisely
estimate the percentage of values that lie within one, two or three standard deviations of
the mean.
The Empirical Rule, which only applies to a normal distribution, provides us
with much more information about this particular distribution than Chebyshev’s
Theorem. The Empirical Rule states that for any normal distribution, approximately 68%
14
of the values will fall within one standard deviation of the mean, approximately 95% of
the values will fall within two standard deviations of the mean, and approximately 99.7%
of the values will fall within three standard deviations of the mean. This much more
precise information is only true for data distributed normally. The normal distribution,
sometimes referred to as the Gaussian distribution after Karl Gauss who discovered that
the normal distribution of certain errors, is bell-shaped and symmetrical, and models the
behavior of many random variables. We will discuss the normal distribution as well as
its probability distribution later in the course.
Measures of Position or Location
Measure of central tendency and measures of data variation are singular values to
describe an entire data set. Measure of position or location are measures of an individual
value and indicate the relative position of that value to the other values in the data set. A
commonly used measure of position is a percentile. Aptitude tests often provide an
individual’s percentile ranking to let them know how they did relative to others who took
the test. To determine what test score exceeds a certain percentage of test scores, we first
divide our data set into 100 equal parts and then count in to determine the location of the
value that corresponds to the percentile we are interested in.
The kth percentile, Pk, is that value which is equal to or greater than, k% of the
observations and is less than or equal to the remaining (100-k)% of the observations.
The procedure for calculating the kth percentile is:
1. Order the data from smallest to largest value.
15
2. Find
nk
, where n is the sample size and k is the percentile you are
100
calculating.
3. (a)
if
nk
is not an integer, then i, the position of the kth percentile, will be
100
the next larger integer. For example if
(b)
if
nk
= 4.5 then i = 5.
100
nk
is an integer, then i, the position of the kth percentile, will be
100
nk
nk
+.5. For example if
= 6 then i = 6.5.
100
100
4. (a) if i is an integer (3a above) then the kth percentile if the value found at the
ith position. For example, in 3a above, i = 5, so the kth percentile is the 5th
value in the ordered data set.
(b) if i is not an integer (3b above) then the kth percentile if the mean of the two
values surrounding the ith position. For example, in 3b above, i = 6.5, so
the kth percentile is the mean of the sixth and seventh values in the ordered
data set.
Sometimes, instead of being interested in what data point has a certain percentage
above it or below it, researchers are interested in determining the value that is “typical”
for the “center” group of values. For example, suppose we are charged with the
responsibility of developing the curriculum for a kindergarten class. The students in a
class of kindergarteners could differ tremendously in terms of acquired knowledge.
Suppose, in an effort to develop the curriculum, we give each student in the class an
aptitude class to measure his/her abilities in basic knowledge. The scores may vary
16
greatly since some of the students may have attended preschool since they were very
young while others may not have attended at all. If we do not have the resources to have
multi-level curriculum, then we would develop a curriculum that was targeted at those “in
the middle” in terms of their aptitude scores. Since we are interested in targeting the
center of the distribution of aptitude scores, we will determine what constitutes the
“middle 50%” and gear our curriculum at those students.
Quartiles, which are just specific percentiles, allow us to divide our data into four
equal groups. The first or lower quartile, Q1, is equal to the 25% percentile, P25. The
second or mid-quartile, Q2, is equal to the 50% percentile, P50, which is also the median,
M. The third or upper quartile, Q3, is equal to the 75% percentile, P75. We use these
quartiles to help us determine characteristics of the middle 50% of our data. For
example, the Interquartile Range (IQR), is the range of the middle 50% of the data.
Like the range, the IQR is a measure of data variation or dispersion but instead of
indicating the range of all the data like the range does, the IQR indicates the range of only
the middle 50%. Like other Measures of Data Variation, the IQR requires quantitative
data to calculate. The formula for the IQR is: IQR  Q3  Q1 . To calculate the IQR, the
first and third quartiles are determined by finding the corresponding percentile, i.e.,
Q3=P75 and Q1=P25.
The Mid-Quartile Range, (MQR), is a statistic we calculate to determine a
“typical” value in the middle group of observations. The MQR is a Measure of Central
Tendency and is the mean of the extreme values of the middle 50% of the observations.
It is not the mean of all observations in the middle 50%, but instead we find the mean of
the first and third quartiles. The formula for the MQR is: MQR 
17
Q1  Q3
.
2
Another measure of position or location is called the Z-score or Z value. The Zscore for a particular value in a data set indicates the number of standard deviations
that value is from the mean. Z-scores can be negative (if the value is less than the
mean), positive (if the values is larger than the mean), or equal to zero (if the value is
equal to the mean). The Z-score for the mean is always zero. For example, a value with
a Z-score of 1.35 is 1.35 standard deviations above the mean. A value with a Z-score of
–2.12 is 2.12 standard deviations below the mean.
Z-values can be calculated, and a Standard Normal Table used, to determine
approximately what proportion of the values, for a normal distribution, are above or
below a particular value, or between two values in a distribution.
Frequency Distributions
Terminology:
Defn: The frequency, f, for a value or a class of values is the number of
times that value or class of values occurs in the data set.
We are simply counting how often a value or set of values occurs in the data set.
1. What is the minimum number of times a value or class of values occur(s) in a data
set? The minimum number of times a value or class of values can occur is zero
(0). What is the maximum number of times a value or class of values can occur in
the data set? The maximum number of times a value or class of values can occur
in the data set is n, or the total number of values in the data set.
0fn
2. If we add the frequencies for each value or set of values it will sum to n.
f = n
Defn: The relative frequency, f/n, (how often the value occurs divided by the
total number of observations—gives you a proportion of times a value or class of
18
values occurs) for a value or a class of values is the proportion of time that a value
or class of values occurs in the data set.
1. What is the minimum proportion of time a value or class of values occur(s) in a
data set? The minimum proportion of time a value or class of values can occur is
zero (0). What is the maximum proportion of time a value or class of values can
occur in the data set? The maximum proportion of time a value or class of values
can occur in the data set is one (1).
0  f/n  1
2. If we add the relative frequencies for each value or set of values it will sum to one
(1).
f/n = 1
Defn: The cumulative frequency, F, for a value or a class of values is the
number of times that value or any smaller value occurs in the data set.
We are simply keeping a running total.
1. Cumulative frequencies are non-decreasing (this means the values cannot
decrease—they can level off but they can’t go down).
2. The cumulative frequency for the last value or class of values is n.
3. We must have at least ordinal scaled data to find cumulative frequencies.
Defn: The cumulative relative frequency, F/n, for a value or a class of values
is the proportion of time that value or any smaller value occurs in the data set.
We are simply keeping a running total of relative frequencies or proportions.
1. Cumulative relative frequencies are non-decreasing.
2. The cumulative relative frequency for the last value or class of values is one (1).
3. We must have at least ordinal scaled data to find cumulative relative frequencies.
19
Download