©Dr. Valerie P. Muehsam, 2006 Quantitative Techniques in Business Introduction to Statistics In the business world, and in fact, in practically every aspect of daily living, quantitative techniques are used to assist in decision making. Why? Unlike the classroom, in the “real world” there is often not enough information available to be guaranteed of making a correct decision. For instance, if advertisers would like to know how many households in the United States with televisions are tuned to a particular television show, at a particular date and time, it would be impossible to determine without the complete cooperation of every household and an astonishing amount of time and money. If a consumer protection agency wanted to determine the true proportion of prescription drug users who also use herbal non-regulated over-the-counter supplements, this information would most likely not be available. As a result of the inability to determine characteristics of interest, the application of statistics, and other quantitative techniques has developed. Statistics is defined as the process of collecting a sample, organizing, analyzing and interpreting data. The numeric values which represent the characteristics analyzed in this process are also referred to as statistics. When information related to a particular group is desired, and it is impossible or impractical to obtain this information, a sample or subset of the group is obtained and the information of interest is determined for the subset. For instance someone is interested in the average annual income of all the students with majors in the College of Business Administration at Sam Houston State University, the only way this information could be obtained is if the annual income of every student in this population could be collected, recorded and analyzed without error. Since this would take considerable time and money, and since the probability of collecting the data necessary to determine the true annual salary of the students is small, a sample of this population will be taken. The sample mean annual salary of the sample of students will be determined and used to estimate the true mean annual salary of all the students with majors in the College of Business Administration at Sam Houston State University. The study of statistics consists of two types: descriptive statistics and inferential statistics. Descriptive statistics are characteristics, usually numeric, used to describe a particular data set. An example of a descriptive statistic would be the average final exam grade of ten students in an elementary statistics class. This average test score is used to indicate a “typical value” for the exam grades of the ten students. Inferential statistics, on the other hand, are similar to descriptive statistics in that each is calculated from a sample, but the difference is the use of the statistic. In inferential statistics, the statistic is used to make inference, or make decisions, about the entire population of interest. In other words, we take a sample and calculate a statistic and use that statistic to make inference about the actual value of the characteristic in the entire population. For instance, there are many descriptive characteristics of a firm’s customers that their management would like to know but this information may be difficult or impossible to determine. Measurement of each and every customer of a large retail firm is nearly impossible. Even if the information were gathered, it would be unlikely that it would be timely. Unfortunately, managers do not always know what mean (average) weekly demand for a product will be or what proportion of television viewers will watch a 2 particular show. Since these parameters of interest are not known, and usually impossible or impractical to determine, the parameters will be estimated using partial information gathered from a sample. For instance, if the desired parameter is the mean annual salary of the income earning residents of a particular county, a sample of 200 of these residents could be obtained and the annual salary of each resident (element) in the sample could be determined and the mean annual salary of the sample residents. If the sample is drawn in a random fashion from a frame, or list, of the entire population, and if we use correct statistical techniques, the sample mean annual salary (a statistic) may be a good estimate of the true mean annual salary (a parameter) of all the residents of this county. A population includes all the elements of interest. We use the term “element” to represent each individual unit of a group in which we have interest. For instance, elements may refer to people (i.e., customers), records (i.e., all loan accounts at a particular bank), products (i.e., we are interested in the proportion defective) etc. The notation used in statistics to represent the population size is “N”. In our example above, the population of interest would be all the income earning residents of the county. Each of these residents is an element in our population. If the population of the income earning residents in the county was 50,000 then N = 50,000. The size of the population, N, is often not known. A sample is a subset of the population. The notation for the sample size is “n”. In our previous example, the sample would be the 200 residents we sampled out of all the income earning residents in the county. In this case n = 200. 3 A parameter is a characteristic, usually numeric, of the population. Populations have many parameters but researchers are often interested in only one or two of these characteristics. For instance, in our example above, the parameter of interest is the population mean annual salary of all the income earning residents of the county. The mean annual salary is but one of many other characteristics of this population that may be of interest and could also be estimated. The proportion of these residents who support a particular school bond issue and the mean age of the residents are two examples of other parameters that may be of interest. A statistic is a characteristic, usually numeric, of the sample. Samples, like populations, also have many statistics that may be calculated. For each parameter of a population, there is a corresponding statistic that may be calculated from a sample. An important item to remember is that a statistic is a random variable which indicates that each sample may result in many different values for the statistic. For instance, in the example above, the statistic is the sample mean annual income of the 200 residents of the county. This value is called the “sample mean” because it is calculated from the sample. Although the sample mean is our “best guess” for the value of the population mean it is one of many possible values that could be calculated from different samples of size 200. In other words, there are many samples of 200 that could be collected from the population of 50,000 residents. Unfortunately, even if we take a random sample of 200, we could end up with the most affluent 200 residents in the county. The sample mean calculated from this sample would not be representative of the population. The possibility of collecting a sample like this cannot be ignored. We will, however, learn to 4 use statistical techniques that allow us to estimate the probability of getting a value for the sample statistic that is not a good estimate of the population parameter. The use of statistics to estimate parameters of interest is not guaranteed to be successful. If the estimate is not “good” the result could be a faulty decision that, in turn, could result in loss of time and/or revenue. We must not allow quantitative techniques to make decisions for us, we must use these techniques only as a tool to assist us in decision making. Scale of Data Measurement Before any statistical technique is employed, a researcher must determine the type of data that is to be collected. In a general sense, there are two types of data: qualitative data and quantitative data. Qualitative data categorizes an element by a non-numeric attribute. For instance, if we are interested in which political party a resident belongs to, we are categorizing the resident using qualitative data: Democratic, Republican, Independent, etc. Qualitative data is often the data we are interested in gathering in the social sciences and particularly in business. For instance, much of what we want to know in business is related to attitudes or behavior of consumers. The data is not numeric and therefore more difficult to analyze. We often calculate the proportion of elements with a particular characteristic (i.e., the proportion of residents who own their own home) but many techniques cannot be used on this type of data. There are two types of qualitative data: nominal data and ordinal data. Nominal data is, in terms of structure, the lowest form of data. Nominal data is 5 qualitative data that has no natural order. Examples of nominal data include: gender; political affiliation; type of car owned; product model; etc. Data comprised of “numbers” can also be qualitative data. Zip codes, area codes, telephone numbers are examples of data that are qualitative. In math terms, these data are not “real” numbers because they do not represent numeric measures. One way to determine whether “numbers” are numeric measures is to consider whether one might be interested in an average of these “numbers”. If a number can be replaced with letters, words or symbols without losing any information then this indicates that a “number” is NOT a numeric measure. Ordinal data is qualitative data that has a natural order. Examples of ordinal data include: military rank; size of clothing using S, M, L, XL; place in which a race was finished; condition of a used appliance using POOR, AVERAGE, GOOD, EXCELLENT; etc. While ordinal data has an order, the intervals between the rankings are not equal intervals. Thus, while ordinal data has more structure than nominal data, math functions on the data, such as differences, are not valid. Quantitative data categorizes an element by a numeric measure. Quantitative data are true numbers and, as a result, more quantitative techniques are available for use with this data. Quantitative data can be divided into two types of data: interval data and ratio data. Interval data is quantitative data that has no natural starting point or zero level. Examples of interval data include Fahrenheit temperature and scores on IQ tests. Each (of these type data) is a numeric measure but neither has a natural starting point or zero level. Zero degrees Fahrenheit is not the absence of temperature just as there is no zero level for a test of intelligence. Interval data can be used for any technique that requires quantitative data, however, we must realize that ratios have no meaning with this 6 type of data since there is no natural zero level. For example, 50 degrees Fahrenheit is not twice as warm as 25 degrees Fahrenheit. Ratio data is quantitative data that has a natural starting point or zero level. Most quantitative data falls into this scale of data measurement. Examples of ratio scaled data include height, weight, rate of return, net income, etc. Since there is a natural zero level, ratios have meaning. Measures of Central Tendency Once we have decided the type of data that we are going to collect, we must determine the type of techniques that are appropriate for analyzing the data. The first organizational technique we will most likely perform is to order the data from smallest value to largest value. We order the data to get an idea about the range of the values observed. Consider a particular example, if we have collected annual income figures from 1,000 households what might we be interested in knowing about this data? Perhaps we would be interested in a typical annual income value for the data set. Typical values are often referred to as Measures of Central Tendency. Measures of central tendency are attempts to identify typical values which are representative of the 1,000 observations collected. The three most common measures of central tendency are the mean, the median and the mode. All three of these measures are referred to as “average” or “typical” values although they are each different measures of typical. The first, and most popular, measure of central tendency is the arithmetic mean, hereafter referred to as simply the mean. The mean is calculated as the sum of the observations divided by the number of observations. The sample mean is denoted x and the formula for calculating the sample mean is: x 7 x . n The population or true mean is denoted (the Greek script letter “mu”) and is calculated the same way as the sample mean except that all elements in the population are measured. The mean requires at least interval scaled data which means it is only valid for true numeric measures. The mean is often referred to as the “gravitational center of the data set” which is similar to the balancing point of the data. If equal weights were placed on a scale representing a number line for each observation in a data set, the mean would be the point at which the scale balances. Since each observation has an equal weight, the magnitude of the values influence the mean. The mean, while certainly the most commonly used measure of central tendency, is not always a good measure of “typical.” For instance, data sets that include extreme values relative to the rest of the data “pull” the mean in that direction. Extremely small values cause the mean to be “small” and extremely large values cause the mean to be “large.” The result is that the mean is not a “good” measure of typical and in fact, may be larger or smaller than all values except the extreme one. When extreme values occur in a data set, we often use another measure of typical referred to as the median. For instance, attempts to find a typical income often is best expressed as the median income rather than the mean income since there is a lower limit (zero) but not an upper limit on income. The median is the second most commonly used measure of central tendency and is referred to as the positional average. The median is the center value in an ordered data set. If the data set has an odd number of observations then the median is the value found in the center of the distribution of ordered values. If the sample set has an even number of values then the median is the mean of the two values surrounding the center of the data set. The median is also P50, the fiftieth percentile. This means that 50% or half 8 of the values are smaller than the median and half of the values or 50% are greater than the median. The procedure for finding the median is: 1. Order the data set from smallest to largest (or largest to smallest). NOTE: this requires that the data can be ordered so the median cannot be found for nominal data. 2. Find i, which is the location or position of the median. This position can be calculated by using the following formula: i n 1 , where n is the size of 2 the sample. 3. If i is an integer then the median is the value found at the ith position in the ordered data set. If i is not an integer, then the median is the mean of the two values surrounding the ith position. The median is often denoted as M or ~x . The last of the more common Measures of Central Tendency is called the mode. The mode is the most commonly occurring value in a data set, in other words, the value that occurs with the greatest frequency. The mode, unlike either the mean or the median, does not have to be unique. A data set can have more than one mode or no mode at all. A data set with: one mode is referred to as unimodal; two modes is referred to as bimodal; and three or more modes is referred to as multimodal. There is no universal notation for the mode and the mode is valid for any type of data. Measures of Data Variation Besides a measure of “typical,” what else might we want to know about a data set? Do the measures of central tendency tell us all we need to “know” about the 9 observations we have collected? Certainly not, in fact, two data sets could have the same mean and be completely difference in terms of dispersion. Consider that we “know” the mean depth of a lake where we plan our next office picnic. Suppose the mean depth of the lake is 4 feet, is this all we need to know about the depth of this lake? No. We need to know how much the values (depth) varies around 4 feet. The depth of the lake could be 4 feet at every point and have a mean of 4 feet or the depth of the lake could vary greatly around four feet and still have a mean of 4 feet. There could be places where the depth is a few inches and other places where the depth is 10 feet. This information about how the data are dispersed is very important (especially for those of us who cannot swim). The study of statistics could appropriately be referred to as the study of variability since many of the techniques employ the comparison of the variability of typical values in different groups to determine whether or not these values are the same or different between groups. Measures of Data Variation (variability, dispersion, or spread) are attempts to describe how spread out, or how much the values vary, in a particular data set. All measures of data variation or dispersion require quantitative data to calculate and are nonnegative. The measures of data variation are zero (if all the values are equal) or positive. A “large” measure of spread indicates a more dispersed data set while a “small” measure indicates a more tightly grouped data set. The easiest measure of spread to calculate is the range. The range is the difference between the largest or maximum value and the smallest or minimum value. The notation and formula for the range is: R H L , where H is the largest of maximum value and L is the smallest or minimum value. The range, while simple to calculate, is only informative if it is “small.” “Small” and “large” are relative terms and 10 must be determined relative to the magnitude of the values measured. For instance, a range of $3 for dinner could be characterized as “small” if we are eating at a five-star restaurant in a pricey hotel in New York City where the dinner entrees range in price from $12.00 to $35.00 but may be characterized as “large” if we’re eating at a local fastfood restaurant. If the range is “small” it means that the two extreme values are very close to each other, so the rest of the values must also be tightly grouped. If the range is “large” we know that the extreme values are a long way from each other but we know nothing about the distribution of the rest of the observations. Since the range only uses two values in its calculation, we are provided with limited information. Like our favorite measure of central tendency, the mean, we might like to come up with a measure of variability that incorporates all the values in the data set as opposed to using only the two values needed to calculate the range. We might be interested in finding out, on the average, how much the values vary around a “typical value.” In an effort to describe the variability of a data set we could measure the distance each value is from the mean, our standard measure of “typical.” The distance a value is from the mean is called the “deviation from the mean” and is found by subtracting the mean from a particular value. This deviation from the mean can be negative, (if the value is smaller than the mean) positive, (if the value is bigger than the mean) or zero (if the value is equal to the mean). To calculate the average deviation from the mean, we could sum the deviations from the mean for each value in the data set and divide by the number of observations in our sample. Unfortunately, although a good idea intuitively, this value will always be zero since the mean is the gravitational center of the data set and as a result, the sum of the deviations from the mean sum to zero and so the average 11 deviation would be zero (0): ( x x ) 0 . n This occurs because the deviations from the mean that are negative offset the deviations from the mean that are positive. We can avoid this problem by using the absolute value or square of the deviations from the mean. The Mean Absolute Deviation (MAD), is the sum of the absolute deviations from the mean divided by the sample size: MAD | x x | . n The MAD is used in financial analysis to determine the variability in stock prices from the expected price. Unfortunately, while the MAD is the “best” measure of spread for descriptive purposes, it is not useful for inferential statistics since the distribution of an absolute value function is not smooth. The sample variance, denoted s2, is the sum of the squared deviations from the mean divided by the sample size less one (n-1). Continuing our effort to find an average deviation from the mean, we square the deviations from the mean to eliminate any negative values so our numerator is not equal to zero, and then divide by the sample size less one. Our denominator is made smaller (hence our variance is made larger) as an adjustment to our estimate for the true population variance, denoted 2 (sigma squared) since we calculate the sample variance, s2, using the sample mean, x , instead of the true population mean, (mu). The true measure of variability for the population should be calculated according to each value’s distance from , the population mean. The adjustment in the denominator makes our estimate larger than without the adjustment to account for the estimate ( x ) used in the numerator. Since we would prefer to have a “small” measure of variability because this indicates that the mean, x , is a good measure of “typical” since most of the values are “close to” the mean, adjusting our estimate for 12 the variance to be larger is considered to be conservative. We are unsure of the true value of the mean so we use the value of the sample mean to estimate the variability in the data. The deviations from the mean are estimated using deviations from the sample mean. It is said that we lose one degree of freedom (df) in the denominator for every estimate in the numerator. All variances are of the form: sum of squares divided by degrees of freedom. The problem with the variance is that the value is in squared units. For instance, if we are measuring the dollar amount spent on lunch, the variance will be in dollars squared. Since squared units make interpretation difficult, we normally take the square root of the variance to return to the original units of measurement. The positive square root of the sample variance, s2, is the sample standard deviation, s. The sample standard deviation, s, is our estimate for the true population standard deviation, denoted sigma), which is the positive square root of the population variance, 2. The definitional formula for the sample variance, s2, is given below followed by an algebraic manipulation which we call the computation formula. The computational formula is easier and faster to calculate but intuitively the definitional formula makes more sense as our estimate of the “average” (squared) deviation from the mean. s2 (x x) n 1 2 x 2 ( x) 2 n 1 n = the sample variance s s 2 = the sample standard deviation Although we rarely calculate parameters, the following formulae are given for the population variance and the population standard deviation. 13 (x ) x = 2 2 N 2 ( x ) 2 N N = the population variance 2 = the population standard deviation. Uses of the Standard Deviation The standard deviation of a sample is an attempt to estimate the typical distance that values in the data set differ from the mean. We use the standard deviation as the step-size to estimate the percentage of values that lie within 1 step, 2 steps, or three steps of the mean. For example, Chebyshev’s Theorem, which applies to any distribution regardless of its shape, states that within k standard deviations of the mean, at least 1 1 % of the values will fall. Since Chebyshev’s Theorem applies to any k2 distribution regardless of shape, the information learned is less specific then we might like. In other words, using the formula, we would discover that at least 75% of the observations (in any distribution) lie within 2 standard deviations of the mean. This means that 75%-100% of the values will fall within two standard deviations of the mean. While some information is better than none, we would like to be more precise in our estimate of this percentage. For certain known distributions, we can more precisely estimate the percentage of values that lie within one, two or three standard deviations of the mean. The Empirical Rule, which only applies to a normal distribution, provides us with much more information about this particular distribution than Chebyshev’s Theorem. The Empirical Rule states that for any normal distribution, approximately 68% 14 of the values will fall within one standard deviation of the mean, approximately 95% of the values will fall within two standard deviations of the mean, and approximately 99.7% of the values will fall within three standard deviations of the mean. This much more precise information is only true for data distributed normally. The normal distribution, sometimes referred to as the Gaussian distribution after Karl Gauss who discovered that the normal distribution of certain errors, is bell-shaped and symmetrical, and models the behavior of many random variables. We will discuss the normal distribution as well as its probability distribution later in the course. Measures of Position or Location Measure of central tendency and measures of data variation are singular values to describe an entire data set. Measure of position or location are measures of an individual value and indicate the relative position of that value to the other values in the data set. A commonly used measure of position is a percentile. Aptitude tests often provide an individual’s percentile ranking to let them know how they did relative to others who took the test. To determine what test score exceeds a certain percentage of test scores, we first divide our data set into 100 equal parts and then count in to determine the location of the value that corresponds to the percentile we are interested in. The kth percentile, Pk, is that value which is equal to or greater than, k% of the observations and is less than or equal to the remaining (100-k)% of the observations. The procedure for calculating the kth percentile is: 1. Order the data from smallest to largest value. 15 2. Find nk , where n is the sample size and k is the percentile you are 100 calculating. 3. (a) if nk is not an integer, then i, the position of the kth percentile, will be 100 the next larger integer. For example if (b) if nk = 4.5 then i = 5. 100 nk is an integer, then i, the position of the kth percentile, will be 100 nk nk +.5. For example if = 6 then i = 6.5. 100 100 4. (a) if i is an integer (3a above) then the kth percentile if the value found at the ith position. For example, in 3a above, i = 5, so the kth percentile is the 5th value in the ordered data set. (b) if i is not an integer (3b above) then the kth percentile if the mean of the two values surrounding the ith position. For example, in 3b above, i = 6.5, so the kth percentile is the mean of the sixth and seventh values in the ordered data set. Sometimes, instead of being interested in what data point has a certain percentage above it or below it, researchers are interested in determining the value that is “typical” for the “center” group of values. For example, suppose we are charged with the responsibility of developing the curriculum for a kindergarten class. The students in a class of kindergarteners could differ tremendously in terms of acquired knowledge. Suppose, in an effort to develop the curriculum, we give each student in the class an aptitude class to measure his/her abilities in basic knowledge. The scores may vary 16 greatly since some of the students may have attended preschool since they were very young while others may not have attended at all. If we do not have the resources to have multi-level curriculum, then we would develop a curriculum that was targeted at those “in the middle” in terms of their aptitude scores. Since we are interested in targeting the center of the distribution of aptitude scores, we will determine what constitutes the “middle 50%” and gear our curriculum at those students. Quartiles, which are just specific percentiles, allow us to divide our data into four equal groups. The first or lower quartile, Q1, is equal to the 25% percentile, P25. The second or mid-quartile, Q2, is equal to the 50% percentile, P50, which is also the median, M. The third or upper quartile, Q3, is equal to the 75% percentile, P75. We use these quartiles to help us determine characteristics of the middle 50% of our data. For example, the Interquartile Range (IQR), is the range of the middle 50% of the data. Like the range, the IQR is a measure of data variation or dispersion but instead of indicating the range of all the data like the range does, the IQR indicates the range of only the middle 50%. Like other Measures of Data Variation, the IQR requires quantitative data to calculate. The formula for the IQR is: IQR Q3 Q1 . To calculate the IQR, the first and third quartiles are determined by finding the corresponding percentile, i.e., Q3=P75 and Q1=P25. The Mid-Quartile Range, (MQR), is a statistic we calculate to determine a “typical” value in the middle group of observations. The MQR is a Measure of Central Tendency and is the mean of the extreme values of the middle 50% of the observations. It is not the mean of all observations in the middle 50%, but instead we find the mean of the first and third quartiles. The formula for the MQR is: MQR 17 Q1 Q3 . 2 Another measure of position or location is called the Z-score or Z value. The Zscore for a particular value in a data set indicates the number of standard deviations that value is from the mean. Z-scores can be negative (if the value is less than the mean), positive (if the values is larger than the mean), or equal to zero (if the value is equal to the mean). The Z-score for the mean is always zero. For example, a value with a Z-score of 1.35 is 1.35 standard deviations above the mean. A value with a Z-score of –2.12 is 2.12 standard deviations below the mean. Z-values can be calculated, and a Standard Normal Table used, to determine approximately what proportion of the values, for a normal distribution, are above or below a particular value, or between two values in a distribution. Frequency Distributions Terminology: Defn: The frequency, f, for a value or a class of values is the number of times that value or class of values occurs in the data set. We are simply counting how often a value or set of values occurs in the data set. 1. What is the minimum number of times a value or class of values occur(s) in a data set? The minimum number of times a value or class of values can occur is zero (0). What is the maximum number of times a value or class of values can occur in the data set? The maximum number of times a value or class of values can occur in the data set is n, or the total number of values in the data set. 0fn 2. If we add the frequencies for each value or set of values it will sum to n. f = n Defn: The relative frequency, f/n, (how often the value occurs divided by the total number of observations—gives you a proportion of times a value or class of 18 values occurs) for a value or a class of values is the proportion of time that a value or class of values occurs in the data set. 1. What is the minimum proportion of time a value or class of values occur(s) in a data set? The minimum proportion of time a value or class of values can occur is zero (0). What is the maximum proportion of time a value or class of values can occur in the data set? The maximum proportion of time a value or class of values can occur in the data set is one (1). 0 f/n 1 2. If we add the relative frequencies for each value or set of values it will sum to one (1). f/n = 1 Defn: The cumulative frequency, F, for a value or a class of values is the number of times that value or any smaller value occurs in the data set. We are simply keeping a running total. 1. Cumulative frequencies are non-decreasing (this means the values cannot decrease—they can level off but they can’t go down). 2. The cumulative frequency for the last value or class of values is n. 3. We must have at least ordinal scaled data to find cumulative frequencies. Defn: The cumulative relative frequency, F/n, for a value or a class of values is the proportion of time that value or any smaller value occurs in the data set. We are simply keeping a running total of relative frequencies or proportions. 1. Cumulative relative frequencies are non-decreasing. 2. The cumulative relative frequency for the last value or class of values is one (1). 3. We must have at least ordinal scaled data to find cumulative relative frequencies. 19