Module 1: A Review on Descriptive Statistics Business Statistics has its own language just like any other areas of study. (Black, 2010). And because it has its own language, people need to learn how to communicate under a common ground. In our day to day activities, we have encountered Statistics without actually knowing it and oftentimes experience its many practical and relevant uses and applications. The role of statistics in business is in evaluating all naturally collected data and information to determine what it says about the company's operations and strategy. (Bianca, 2019). In this module, let us try to navigate the very essential points of Descriptive Statistics which many of you have taken in your high school days. Example: In each of the following items, determine the correct data type (quantitative or qualitative). If the given data is quantitative, indicate further if it is continuous or discrete. INTRODUCTION Business Statistics has its own language just like any other areas of study. (Black, 2010). And because it has its own language, people need to learn how to communicate under a common ground. In our day to day activities, we have encountered Statistics without actually knowing it and oftentimes experience its many practical and relevant uses and applications. The role of statistics in business is in evaluating all naturally collected data and information to determine what it says about the company's operations and strategy. (Bianca, 2019). In this module, let us try to navigate the very essential points of Descriptive Statistics which many of you have taken in your high school days. After going through this module, the students will be able to: Explain the importance and relevance of studying Statistics to Business. 2. Differentiate and discuss population from sample, qualitative data from quantitative data, parameter from statistics, discrete from continuous type of variable. 3. Apply the different measures of central tendency, dispersion, and relative position in describing and analyzing data sets. Example: Determine what the key terms are being referred to in the following situation. 4. Summarize and interpret data using various type of data presentation and data organization including the use of Microsoft excel. A study is made to determine the amount of time spent in studying Statistics daily. It involves all freshmen college students in XYZ University. There are total of 1000 freshmen students, 150 of whom are being chosen randomly. From the 150 randomly selected freshmen college students, it turns out that the average time spent daily in studying Statistics is 1 hour. 5. Solve and explain various application problems. Answers. 6. Manifest patience, integrity, self-discipline, excellence, and critical thinking in working with the different application problems. Lesson 1: Meaning of Statistics and its Basic Components The population is all freshmen college students of XYZ University. The sample are the 150 students randomly selected from the population of 1000 freshmen college students. The parameter is the average (mean) amount of time spent daily in studying Statistics. (Note: If all the 1000 students are being interviewed, then there are 1000 different answers/responses. Get the average of all those 1000 responses, and that defines your parameter). The statistic is the average time spent by the 150 selected freshmen college students in studying Statistics, and that is 1 hour. The variable could be the amount of time spent daily by one freshman college student in studying Statistics. (Say for example, we can assign a variable T = the amount of time spent daily by one freshman student in studying Statistics in XYZ University. What is Statistics? Methods for processing & analyzing numbers Methods for helping reduce the uncertainty inherent in decision making A branch of science taking and transforming numbers into useful information for decision makers (source: Business Statistics: A First Course, 2009, p.4) According to Webster’s Third New International Dictionary, Statistics is a science that deals with the collection, analysis, interpretation, and presentation of numerical data (Black, 2010). A branch of science that helps transforms numerical data into useful information and that it allows you to understand the risks associated with making a business decision (Berenson & et.al,2012 b. the type of car you drive c. the distance from your home to the nearest grocery store d. the number of classes you take per school year e. the type of calculator you use f. weights of sumo wrestlers g. number of correct answers on a quiz h. IQ scores Answers. LEARNING OBJECTIVES: 1. a. the number of pairs of shoes you own The data are the raw scores in terms of amount of time spent daily by the freshmen students (can be in minutes, hour or even seconds). a. quantitative- discrete b. qualitative (or categorical) c. quantitative - continuous d. quantitative - discrete e. qualitative (or categorical) f. quantitative- continuous g. quantitative - discrete h. quantitative - continuous Source: Holmes & et al, 2018, pp.10-11 LEVELS OF MEASUREMENT Kindly study the 4 different types of measurement scale below represented in the diagram. a. Nominal - lowest level of data measurement. categorize, label or classifies objects or peoples’ responses so that all of those in a single category is coded numerically. Examples: ~ employee ID number (ex. 16754; 18722; 98123) ~ gender (1-Male; 2- Female) ~ hospital bed number (ex. bed# 12; bed # 23; bed # 18) ~ marital status (1-single; 2- married; 3- separated; 4- widow/widower) b. Ordinal – used to rank or order objects or characteristics - higher than the nominal type of measurement Examples: ~ degree of medicine effectiveness such as: ~ effective, 2- slightly/moderately effective, 3 - very effective. ~ teaching performance: ~ excellent; 2- very satisfactory; 3 – satisfactory; 4- fair; 5 – poor c. Interval – next to the highest level of data. data are always numerical . distances between numbers have meaning. no absolute zero or no true zero. (which means, zero does not necessarily mean absence of something like the case of a zero temperature and a zero IQ). Answers: Question 1 is a time measurement with an absolute zero and is therefore ratio-level measurement. A person who has been out of the hospital for two weeks has been out twice as long as someone who has been out of the hospital for one week. Question 2 yields nominal data because the patient is asked only to categorize the type of unit he or she was in. This question does not require a hierarchy or ranking of the type of unit. addition and subtraction of numerical data can be performed Examples: ~ temperature in degrees Fahrenheit. ~ IQ level ~ age bracket d. Ratio Questions 3, 4, and 5 are likely to result in ordinal-level data. Suppose a number is assigned the descriptors in each of these three questions. For question 3, “very important” might be assigned a 4, “somewhat important” a 3, “not very important” a 2, and “not at all important” a 1. Certainly, the higher the number, the more important is the hospital’s location. Thus, these responses can be ranked by selection. However, the increases in importance from 1 to 2 to 3 to 4 are not necessarily equal. This same logic applies to the numeric values assigned in questions 4 and 5. Source: (Black 2010 pp.9-10) the highest among the scales of measurement. usually based on a standard unit. have absolute zero which means that characteristic of zero is fixed and that there is an absence of that particular property/characteristic. Examples: height; time; weight; number of cars own; amount of one’s investment; score in an achievement test. Notice that in the ratio scale, it is possible to perform the 4 fundamental operations of addition subtraction, multiplication and division. If for example student A scored 100 in a Statistics quiz and that student B scored 25, then we can say that student A scored four times that of student B. or we can say that student A scored 75 points more than that of student B. Consider further the Problem below: Example: The mean of the preceding data set on the projected selling price for various brand of computer printers can be computed as follows: LESSON 2 – THE MEASURES OF CENTRAL TENDENCY Statistics as one of the field of science involves the processes of collection, organization or presentation, analysis and interpretation of data. Generally, statistics is being divided into two main divisions. The first main division refers to the collection, organization or presentation and analysis of data called descriptive statistics. Its main goal is to provide a description for the data sets. The second division is concerned in interpreting and drawing out of conclusions or generalizations from the analysis of the sample random data called inferential statistics. In providing descriptions or summary for a sample data set, the use of central tendency, dispersion and relative position are applied. THE MEASURES OF CENTRAL TENDENCY A Measure of Central Tendency is a statistic that can be obtained from a set of observations or scores that represents the data set. It is often useful to find the single numerical value located at the center of the distribution of the data set. It is also defined as the tendency of the same observations or scores to cluster about a single point. Example: Jonecis is planning to build up a business that sales different brand of computer printers. He conducted a survey on the prices of different brands of printers in the market and made projections on the possible selling price for the items once he engages such business. The data below presents the projected selling price for various brand of laptops. Brand A B C D E Price in Php 20,990 Php 14,990 Php 16,484 Php 15,799 Php 21,984 Pesos The central tendency for the selling price for the different brands of printers is the average selling price of the five items that is a “center or central” value for which the different amounts about to cluster.The Measures of Central Tendency has three commonly used measures that finds or locates the center or central value of a given data set. Example: JM is a college instructor handling classes in Economics. He currently conducted a study involving his class for the subject. He is interested to determine the average academic performance of his entire class composed of two sections. He randomly selects the same number of students from his two classes to compose the samples and record their respective prelim grades shown as follows. Student Prelim Grade 1 2.2 2 1.9 3 1.7 4 2.0 5 1.8 6 1.5 7 1.4 8 2.1 9 1.6 Solution: Each subject is worth 3 units. Hence, the sum of all credit units is 21 units for all courses. Thus, JM’s grade point average (GPA) during the prelim term of the first semester is given by 10 2.3 THE MODE The mode is the value in a data set that appears most frequently. Unlike the arithmetic mean, the mode is not easily affected by the occurrence of any extreme values. The median The median measure of central tendency is the middle most value in an ordered array of data. The median measure is known to be unaffected by any extreme values in a set of data. Hence, whenever an extreme value is present, it is proper to use the median rather than the mean to describe the data set. To find the median value for a given data set, one has to organize the data in an array, that is, arranging the scores in terms of increasing numerical value and apply the following formulas: The measures of dispersion The measures of dispersion determine the amount of variation or spread in the data. This measure is helpful in detecting inconsistencies or insufficiencies of values in the data set. Consider for instance the coffee dispenser machine that supplies coffee to customers shown in the table below. Example: The mode value of the ordered sequence of the number of customers of the Beauty Salon for 30 days period is 3 since in a sequence 1, 1, 1, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 7, 7, 7, 7, 8, 8, 8, 8, 10, 10, 11, 12, 12, 13, and 17, the score, 3 has the highest frequency of occurrence. THE WEIGHTED MEAN The weighted mean measure is a variation of the arithmetic mean wherein the individual scores has an assigned weight that normally emphasize that one score is better than the other. The weighted mean value can be obtained by using the following formula given by The mean of the cup of coffee in oz dispensed with machine dispenser 1 is 8 oz. However, the amount of coffee dispensed per cup is very inconsistent. Some cup seems to overflow while others have only lesser amount of coffee dispensed in a cup. This suggests that the first machine dispenser needs some calibrations. On one hand, machine dispenser 2 is very consistent with the amount of coffee dispensed having only a very small deviation of the values. This indicates that the second machine dispenser needs no more calibrations for it works well in serving the customers. The situations discuss above suggests that the mean value is not enough to describe a data set for it lacks information that reflects the deviation or spread of the data values. Thus, the need to introduce further descriptive measures that characterize the spread or deviations of data values in terms of range and standard deviation measures. Example: The table below determines JM’s’ grades during the prelim term of the first semester. The Range The Range determines the difference between the largest and smallest observations in a set of data given by the formula: Example: Find the range of the amount of coffee in oz dispensed in 5 cups from machine dispenser 2. Determine the median measure of the data. Solution: The ordered sequence in terms of the number of customers in 30 days period is given by 1, 1, 1, 2, 3, 3, 3, 3, 3, 4, 4, 4, 5, 5, 5, 7, 7, 7, 7, 8, 8, 8, 8, 10, 10, 11, 12, 12, 13, and 17 Solution: The largest amount of coffee dispensed in a cup is 10.50 oz while the smallest amount is 5.99 oz. The range of the two extreme values is 10.50 – 5.99 = 4.51 0z. Although the range measure seems easy to compute, however, it can be very sensitive to extreme values and provides no information for the spread of values in between these two scores. The standard deviation The standard deviation is the measures of variation that takes into account on how all the values in the data set are distributed. This measure evaluates how the values fluctuate about the mean and is less sensitive to the extreme values. Which of the two companies has received a more or less consistent job satisfaction rating by employees? Solution: The 5.83 mean job satisfaction rating for each of the company Example: Suppose the median annual travel expenses of personnel in a certain academic institution is Php 100,000. If the 85th percentile for the annual travel expenses of personnel was Php 110,000, find the percentage of personnel whose annual travel expenses was: is Although most statistical computations normally involved a sample instead of a population, the two set of formulas for sample and population measures of dispersion are defined and provided as follows. i. greater than Php 100,000 ii. less than Php 110,000 iii. between Php 100,000 and Php 110,000 Standard Deviation and Populations Solution: i) ii) iii) Example 1: Consider the examination scores for the sample students randomly selected in the Stat class. - 15, 17, 19, 23, 25 Find the standard deviation of the sample. Lesson 3: The Measures of Relative Suppose a student is taking examinations in the mathematics of the modern world (MMW) and art and appreciation (AA). The student got a score of 45 in MMW while a score of 50 in art and appreciation. The mean score for all students taking the MMW exam in the class is 40 with a standard deviation of 5. On one hand, the mean score for all students taking art and appreciation is 45 with a standard deviation of 8. For which of the two subjects did a student perform better? Comparing the performance of the student in both subjects cannot be done right away since the different sets of examination scores have different amount of variability. Hence, for a comparison to be possible, a transformation must be made for student’s scores in MMW and art appreciation. Since the median measure is the same as the 50th percentile, then 50% of the personnel incurred travel expenses greater than Php 100,000. Since Php 110,000 is the 85th percentile, then 85% of the personnel in the academic institution incurred travel expenses less than Php 110,000. Using parts (i) and (ii), 85%-50%=35% of personnel incurred travel expenses between Php 100,000 and Php 110,000. The Percentile of a Data Value Example: In a licensure examination for accountants given to 1500 students, Yurie scored 550 higher than the scores of 900 students who took the examination. What is the percentile for Yurie’s examination score? THE Z-SCORE The z – score of the given data value x is the number of standard deviations the score lies above or below the mean. The transformation of x-score to z - score is defined by the equation given as follows. The z - score or standard score does have a mean of 0 and standard deviation of 1 unit. Example: Comparing the performance of the student in MMW and art appreciation with his score provided above, we have Example 2: A study is conducted to provide evidence on which companies have received a more or less consistent employees’ job satisfaction ratings based on the survey results that used the scale of 1 (strongly dissatisfied) to 7 (strongly satisfied). The following ratings are given below. The result indicates that the student scored 1.00 standard deviation above the mean in his MMW course while 0.625 standard deviation above the mean for his art appreciation course. The resulting z-scores suggest that the student perform better in MMW than in his art appreciation course. The Percentiles Therefore, Yurie’s score in the licensure examination for accountants is at 60th percentile. Example: Consider the data below that represents the number of dining rooms occupied in a beach resort for a 15-days period. Find the first, second and third quartile of the data. Solution: Listing the data value in an array form, we have 55, 60, 65, 70, 72, 75, 78, 80, 84, 88, 89, 91, 95, 99, 100 The Use of Stem-and-Leaf Diagrams in Organizing Data The stem-and-leaf display is a tabular way of organizing data. It is formed by splitting the data values into two parts. The “tens” part forms the “stem” while the units digit represents the leaves. The first part can also be extended to hundreds, thousands and so on as the leading units depending on the values involved in a given data set. Example: The following table shows the ages of customers who owns a motorcycle. Construct a stem-and-leaf display for the data LESSON 1: CONSTRUCTING THE FREQUENCY DISTRIBUTION A grouped frequency distribution is useful whenever the range of the data set is quiet large. Hence, the data must be grouped into classes whether it is categorical or interval or ratio data. The following shows the procedure for constructing the frequency distribution. A CATEGORICAL FREQUENCY DISTRIBUTION The categorical frequency distribution is utilized to organize nominal or ordinal type of data. For instance, we can employ categorical frequency distribution for variables such as gender, marital status, socio-economic status, political affiliation and so on. Example: Twenty business statistics students were given an academic performance evaluation by their instructor. The data set is shown as follows: Solution: A stem-and-leaf diagram can be constructed by writing all the stems in a column in ascending order while indicating the corresponding leaf to the right of the vertical line as illustrated in the figure below. Learning Check Construct the frequency distribution for the data on Job Satisfaction by rank and file employees of a certain company. Module 2: Organization and Presentation of Data Thus, the boundary score for 75% of the observation is 91. SUMMARIZING DATA USING A BOX AND WHISKER PLOT The Box-and-Whisker plot is a graphical way of providing the visual summary of a set of data that involves the median, quartiles and extreme values, which characterizes the distribution. The following figure illustrates the components of Box-and-Whisker plot. INTRODUCTION In conducting a business research or assessment, one must gather data for the variable/s under investigation. In order to describe situations, create conclusions or making inferences about the occurrence of events, one must organize the data gathered in a more meaningful manner. Once the data is organized, the next move that one can do is to present the data so that those who will be benefited directly or indirectly from reading the study or assessment can understand it. The most commonly used procedure of presenting data is through the use of graphs and charts. Each of these graphs and charts has its specific functions depending on the nature of the variables being investigated. Module 2 discusses on how to organize data by constructing frequency distribution and the manner the data will be presented by constructing graphs and charts. B. THE FREQUENCY DISTRIBUTION FOR NUMBER Data in its original form and structure are called raw data. Example: The following is a raw data depicting the number of students taking the IQ test during a year in 60 randomly selected classes in a certain university. After going through this module, the students will be able to: 1. 2. 3. 4. 5. 6. Discuss and explain the methods in organizing and presenting data. Organize the data into a frequency distribution using excel data analysis. Represents frequency distribution graphically using histogram, frequency polygons, and cumulative frequency polygon (ogives). Plot the data using bar graph (multiple bar graph), pie chart, time series graph and scatter plot. Analyze and interpret the graphs/charts in the context of the variable/s under investigation. Show volunteerism and innovativeness in organizing and presenting data concerning real life business application problems. When these scores are arranged in either ascending or descending magnitude, then such an arrangement is called an array. It is usually helpful to put the raw data in an array because it is easy to identify the extreme values or the values where the scores most cluster. When the data are placed into a system wherein they are organized, then these partake the nature of grouped data. Definition: The procedure of organizing data into groups is called a Frequency Distribution Table (FDT) Example: The following presents a frequency distribution table of the exam scores of fifteen Business Students. Scores Frequency 20 – 29 30 – 39 40 – 49 50 – 59 60 – 69 5 4 3 2 1 15 Components of a Frequency Distribution Table The following are the components of a Frequency Distribution Table I. Class Interval These are the numbers defining the class. It consist of the end numbers called the class limits namely the lower limit and upper limit. II. Class Frequency (f) This component shows the number of observations falling in the class. III. Class Boundaries These are the so called “true class limits”. They are classified as: Lower Class Boundary (LCB), which is defined as the middle value of the lower class limits of the class and the upper class limit of the preceding class and Upper Class Boundary that is, the middle value between the upper class limit of the class and the lower limit of the next class. IV. Class Size The difference between two consecutive upper limits or two consecutive lower limits. V. Class Mark (CM) This component is the midpoint or the middle value of a class interval. VI. Cumulative frequency (CF) This component shows the accumulated frequencies of successive classes. There are two types of Cumulative Frequencies. A. Greater than CF (> CF) – shows the number of observations greater than the lower class boundary (LCB). B. Less than CF (< CF) - shows the number of observations less than the upper class boundary (UCB). Example: Construct a Frequency Distribution Table for the number of students taking the IQ test during a year in 60 randomly selected classes in a certain university. number of students taking the IQ test. Then, enter the range of upper class limits into the bin range. Consequently, check levels, cumulative percentage, chart output and click ok. Solution: 1. 2. 3. 4. 5. 6. Using the Sturge’s Approximation Formula, K= 1 + 3.332 log n, where K, approximate number of classes and n, number of cases, then the approximated number of class intervals for the data set is given by K= 1 + 3.332 log (60) = 1 + 3.332(1.77815125) K = 6.92 or 7 The range R is given by R=Maximum Value- Minimum Value =59 – 21 = 38 The approximate class size C is: C=R/K =38/7 =5.43 or 6 The lowest class interval (or the first class) is 21 – 26. Adding the class size C=6 to the class limit beginning with the lowest class interval, we then obtain the other class intervals shown as follows: Class Intervals 21 – 26 27 – 32 33 – 38 39 – 44 45 – 50 51 – 56 57 - 62 Tally of Scores and the Frequency Distribution Table In constructing a Frequency Distribution Table, attention must be given in selecting the number of class intervals or groupings in the frequency distribution table. There are no exact rules for determining this number of class intervals. However, one suggestion in literature for determining the number of class intervals is to use Sturges’ rule such as the one specified in Step 1. Excel Output Bin Range 1. Determine the number of classes. For first approximation, it is suggested to use the Sturge’s Approximation Formula. K= 1 + 3.332 log n where K = approximate number of classes n = number of cases 2. Determine the range R, where R = maximum value minimum value 3. Determine the approximate class size C using the formula C = R / K. It is usually convenient to round off C to a nearest whole number 4. Determine the lowest class interval (or the first class). This class should include the minimum value in the data set. For uniformity, let us agree that for our purposes, the lower limit of the class interval should start at the minimum value. 5. Determine all class limits by adding the class size C to the limits of the previous class. 6. Tally the scores / observations falling in each class. Thus, the Complete Frequency Distribution Table is as follows: More Note: Data analysis found in excel can be used to generate the frequency distribution once the class intervals are already set. Just use histogram function under data analysis window as shown below. For the bin range, the upper class limits of each class intervals is being used. For the input range, enter the range of occupied cells of data defined by the variable Frequency 26 32 38 44 50 56 62 6 11 4 6 9 17 7 0 Cumulative % 10.00% 28.33% 35.00% 45.00% 60.00% 88.33% 100.00% 100.00% Learning check 1 A. Histogram LEARNING CHECK 2: The following data shows the ages of customers who owns a motorcycle. The following data shows the ages of customers who owns a motorcycle. Organize this data in a Frequency Distribution Table. Using excel, construct the following statistical graphs for the data set. LESSON 2 GRAPHS ASSOCIATED WITH THE FR Whenever the data set of any business variables contains a quite large number of numerical values, making descriptions or conclusions from an array or stem-and-leaf plot of the values in the data set is relatively difficult. In this case, we will need the use of graphs or charts in dealing with the situations. There are various graphs or charts that can be used to visualize, characterize, and show the numerical data values defined by the variables under investigation. These include histogram, frequency polygon, and cumulative frequency (ogive). B. Frequency Polygon A. HISTOGRAM A histogram is a statistical graph in which the class intervals are plotted on the horizontal axis and the class frequencies on the vertical axis. The height of the bars determines the class frequencies, and the bars are drawn adjacent to each other. It contains the frequency of each class interval and does not necessarily reveal the data set values in the actual observations. B. FREQUENCY POLYGON A frequency polygon is a statistical graph that displays the data set values through using points, which are interconnected by line segments. The frequency of scores or data values are represented by the heights of the points at the midpoints or class marks of the class intervals. C. CUMULATIVE FREQUENCY POLYGON OR OGIVE A cumulative frequency polygon or ogive is a statistical graph that shows the cumulative frequencies for the class intervals in a frequency distribution. The cumulative frequency of the distribution are marked on the vertical axis while upper class boundaries (UCB) on the horizontal axis. EXAMPLE Example: Consider below the complete frequency distribution for the number of students taking the IQ test during a year in 60 randomly selected classes in a certain university. C. Cumulative Frequency Polygon (Ogive) A. Histogram B. Frequency Polygon C. Cumulative Frequency Polygon Provide appropriate labels for each of the statistical graphs. Then, also provide a brief analysis and interpretation for each of the resulting statistical graphs.