MODULE 4: Data Management MODULE 4 Data Management Catalina B. Gayas & Emmeline R. Garcia Table of Contents Lesson 1 Data Collection, Organization, and Interpretation Basic terminology in statistics 3 Data Collection and Sampling Techniques 4 Frequency Distribution and Graphs for Numerical Data 6 Lesson 2 Measures of Central Tendency Mean (Raw and Grouped Data) 13 The Weighted Mean 15 Median (Raw and Grouped Data) 15 Mode (Raw and Grouped Data) 18 Types of Distribution 19 Lesson 3 Measures of Variation Range (Raw and Grouped Data) 22 Mean Absolute Deviation (Raw and Grouped Data) 23 Variance and Standard Deviation (Raw and Grouped Data) 25 Lesson 4 Measures of Relative Position Standard Score 30 Percentiles, Deciles, and Quartiles 31 Lesson 5 Normal Distribution The Standard Normal Distribution 37 Applications of Normal Distribution 41 Lesson 6 Correlation Coefficients and Linear Regression Correlation Analysis 46 Linear Regression 50 Leyte Normal University | Mathematics Unit 1 MODULE 4: Data Management Overview Statistics is used in all aspects of human endeavors. Statistics is used to describe data; to determine significant relationship between and among variables; to determine significant difference in a variable of interest between or among groups; and to make forecast and prediction. The concepts in Statistics were already discussed in your K to 12 Curriculum. Hence, this module focuses on the application of these concepts in the real setting, in which you can relate to. It is the aim of this module to make you appreciate the importance of Statistics, and at the same time have fun doing the exercises and activities. This module includes the topics: Data Collection, Organization, and Interpretation; Measures of Central Tendency; Measures of Dispersion; Measures of Relative Position; Normal Distribution; and Correlation Coefficients and Linear Regression. Computer applications will be utilized in this module, especially the use of Microsoft Excel and statistical analysis software, like SPSS, for data analysis. Objectives At the end of this module, you should be able to: 1. demonstrate knowledge of basic statistical terms; 2. use statistical methods to summarize and organize data; 3. solve problems applying normal distribution; 4. apply linear regression and correlation in analyzing data; and 5. interpret computer outputs in data analysis. 2 Leyte Normal University | Mathematics Unit MODULE 4: Data Management LESSON 1: Data Collection, Organization, and Interpretation Introduction Statistics is defined as science of collecting, organizing, summarizing, presenting and interpreting data. There are three main reasons why student study statistics. They are as follows: (1) To read and understand the various statistical studies published in print or broadcast media; (2) To conduct research in his own field since statistical procedures are basic to research; and (3) To become better consumers and citizen by using the knowledge gained from studying statistics. Basic Terminology in Statistics In studying statistics, it is important to understand the basic terms used in the subject. The following terms are defined for this purpose. Variable refers to a characteristic or attribute that can assume different or varied values. Example of a variable is sex, nationality, score, height, etc. Data are the measurements or observations that the variables can assume. A data set is collection of data values, and every particular value in the set is called datum. There are two branches of statistics. The branch that involves collection, organization, summarization and presentation of data is called descriptive statistics. While the branch that makes generalization from sample (representative of a population) to a population (totality of all observations or entities of any sort), performs estimation and hypothesis testing, and determines relationship among variables and makes predictions is called inferential statistics. Variables can be classified as quantitative and qualitative. Quantitative variable is a numerical value that can be ordered or ranked. IQ, scores, weight, temperature are examples of quantitative variables. Quantitative variable is further classified as discrete and continuous. Discrete variable assumes values that can be counted. On the other hand, a continuous variable assumes unlimited number of values between any two specific values. Continuous variable is measured. The number of deaths in a certain locality relative to CoViD-19 pandemic is an example of a discrete variable, while the height of a person is an example of a continuous variable. Why a height is considered a continuous variable? What are other examples of continuous variables? How about discrete variables? Variables are also classified according into four levels of measurement scales. They are: nominal, ordinal, interval and ratio. Nominal scale is the simplest scale of measurement that classifies data into mutually exclusive categories and uses numbers for labels only. Sex, occupation, religious affiliation and marital status are examples of nominal data. Ordinal scale uses numbers for labelling and the numbers can be ranked. However, there is no equal difference between ranks. Socio economic status, Latin honor, and academic rank are examples of ordinal data. Interval scale possesses the characteristics of ordinal scale (label and rank) and equal differences between ranks exist. Also, in an interval data, there is no true zero value. Score in an examination, temperature, Intelligent Quotient (IQ) are examples of interval scale. Ratio scale is the highest level of measurement. It possesses all the characteristics of an ordinal scale (label, rank, equal differences Leyte Normal University | Mathematics Unit 3 MODULE 4: Data Management between ranks) and a true zero value of a number exist. Distance travelled, height, weight and age are examples of ratio scale. Variables are also classified according to their functions, especially in experimental studies. They are independent or explanatory variable, dependent or outcome variable, and confounding variable. Independent Variable is the variable manipulated by the researcher, while the dependent variable is the variable affected or influenced by the manipulated variable. The confounding variable on the other hand is a variable that influences the dependent variable. For example a researcher is interested on finding out the effect of learning delivery modes (pure online, pure printed module, mixture of online and printed module) on the performance (test score) of the students in GE104. The delivery mode is the independent variable; the performance is the dependent variable. The performance can be affected by learning ability of the students. Thus, the learning ability is a confounding variable. Data Collection and Sampling Techniques Data can be collected in different ways. The method to use in the collection of data depends on the source of data as well as the type of data to be collected. Data can be collected through survey (telephone, questionnaire or interview), test, observation, and experimentation. Details on how each method are done and what is the advantage of one over the other will not be part of this lesson as this is exhaustively discussed in your research course. Data are collected from a representative of a population called sample. The process of collecting samples is called sampling. There are two types of sampling: non-probability and probability sampling. In non-probability sampling, not every member of the population is given equal chance to be chosen, hence the samples are not are true representative of the population. If the objective of the study is to make a generalization, using non-probability sampling is discouraged. Convenience or Accidental sampling, Purposive or Judgemental Sampling and Quota Sampling are the most common techniques in non-probability sampling. Probability sampling on the other hand gives equal chance to each member of the population to be selected as a representative. There are four techniques under this type of sampling. They are as follows: simple random sampling, systematic random sampling, stratified random sampling and cluster random sampling. Simple Random Sampling is a technique used in when the population is homogeneous with respect to the characteristic of interest to the researcher and the population size is known (Petilos, 2012). Selection of sample can be done either by lottery method or using random numbers. Systematic Random Sampling is a technique that selects the desired sample size by selecting every kth subject. To select the sample the researcher assigns number to each member of the population (by numbering consecutively) then he determines the value of k by dividing the total number of cases (population) by the desired number of samples. For example the total population (N) is 1,000 and the sample size (n) is 100. Therefore, the value of k is 10. Thus, the researcher will select every 10th subject in the population, which is determined by selecting the starting number between 1 to 10 by using simple random sampling. Suppose the starting number is 6, so the researcher will 4 Leyte Normal University | Mathematics Unit MODULE 4: Data Management consider the subjects whose numbers are: 6, 16, 26, etc. until the desired number of samples is completed. Stratified Random Sampling is a technique used by grouping the population into subgroups called strata according to the common characteristic/s as determined by the researcher. The subjects are selected from each stratum which is proportional to the number of each subgroup. For example if the population consists of all freshmen student across the three colleges (A, B, and C) in University X. If the total freshmen population among the three colleges is 1400 divided as follows: NA = 350; NB = 500 and NC = 550 and the researcher wishes to take a total of 350 respondents. Then he has to select from each stratum the desired samples using either simple random sampling or systematic random sampling using the following computation: College N A 350 B 500 C 550 Total 1,400 n 351 [Note: Due to the rule of rounding off numbers as applied in A & C which are 87.5 = 88 and 137.5 = 138, respectively, the researcher has to decide in which subgroup he has to reduce the samples by 1.] Cluster Random Sampling is a technique used when the population is large enough or the respondents are residing in a large geographic area and it is impossible for the researcher to obtain the list of all members of the population. The members of each cluster are heterogeneous. Unlike the stratified random sampling where the subjects are selected individually, in this technique cluster/s is selected randomly and all members of the selected cluster would represent the population. For example a researcher wishes to determine the type of fertilizer (pure synthetic, pure organic or combination of synthetic and organic) use by rice farmers from the municipality of Town Q. Assuming that there is no available list of rice farmers (categorized a small scale, medium and large scale rice producing), the researcher can get a copy of the map of Town Q and determine the number of barangays which are located outside downtown and along the seashore areas. Each of these barangays is a considered a cluster. Suppose there are 43 barangays that belong to this group. Therefore, there are 43 clusters to choose from. The researcher then decides how many of these barangays will be included and then he randomly selects the cluster/s. The rice farmers in the selected cluster/s represent the group from Town Q. Frequency Distribution and Graphs for Numerical Data Once the researcher has already collected the data, the next thing to do is to organize. There are three ways of presenting data: tabular, graphical and textual. The following discussion focuses on how to organize raw data and subsequently represent those using graphs. Example 1.1 Below are scores of 50 students in Statistics examination. Leyte Normal University | Mathematics Unit 5 MODULE 4: Data Management 63 88 79 92 86 87 83 78 47 67 68 76 46 81 92 77 76 84 70 66 77 75 98 81 82 81 87 78 70 60 94 79 52 82 77 81 77 70 74 61 56 69 83 83 71 48 90 52 75 84 Looking at the array of scores it would be difficult for the reader to tell the characteristic of the group. Thus, a frequency distribution needs to be prepared. A frequency distribution is an organization of raw data classes/groups and frequencies. The frequency distribution is a tabular way of organizing raw data. The following are the steps in preparing frequency distribution. Step 1. Determine the number of classes. • Find the highest value (HV) and lowest value (LV). • Find the range (R) by subtracting the lowest from the highest value. • Determine the estimate number of classes by getting the square root of n, call this k. Your actual number of classes could be greater than the estimated one. Step 2. Determine the class size of the interval. R c = k (rounded to the nearest whole number) Step 3. Determine lower and upper limit of the lowest class interval. The lower limit should be divisible by the class interval. € Step 4. Determine the upper class Step 5. Tally the scores in their respective classes Step 6. Summarize the tallies. Illustration: Using the array of raw scores given above, we have: 1. Determine the number of classes R = HV – LV = 98 – 46 R = 52 (it tells us the gap between the highest and lowest scores in the given data set) k = 50 = 7.07 k=7 2. Determine the class size. € R c= k= 52 7 = 7.43 c =8 3. Determine lower and upper limits of the lowest class interval. Since the lowest value in the given data set is 46 and it is not divisible by the class interval € which is 8, we have to find a smaller number closest to 46 which is divisible by 8. The number is 40. So, our lower limit of our lowest class interval is 40 and the upper limit is 47, because the lower limit of the next class interval is 48 = lower limit of the preceding class 6 Leyte Normal University | Mathematics Unit MODULE 4: Data Management added by the class size (c). It follows that the upper limit of this class interval is 55. Thus, the class boundary is 48 – 55. Following the same procedure, you can find the remaining class intervals. 4. Determine the upper class. The highest class interval should contain the highest value of the given data set. Since our highest value is 98 which is not divisible by the class size of 8, so the lower limit of the highest class interval should be a number smaller and closest to 98. The number is 96. Thus, the highest class interval is between 96 – 103. 5. List down the class intervals and tally the scores in their respective classes. Class Limits Class Boundaries Tallies Frequency 96 - 103 95.5 - 103.5 / 1 88 - 95 87.5 – 95.5 ///// 5 80 - 87 79.5 – 87.5 /////-/////-//// 14 72 - 79 71.5 – 79.5 /////-/////-/// 13 64 - 71 63.5 – 71.5 /////-/// 8 56 - 63 55.5 – 62.5 //// 4 48 - 55 47.5 – 55.5 /// 3 40 - 47 39.5 – 47.5 // 2 REMARKS: • In this illustration the actual number of classes which is 8 is greater than the estimated value of k which is 7. • The second column shows the boundary of each class interval in which the actual lower and upper limits are indicated. These are called true limits or class boundaries. • The true upper limit of the preceding class is also the true lower limit of the succeeding class. This shows the continuity of the data. Using the same data set as presented in the frequency distribution above, we can prepare graphs. In this module, we will discuss only the histogram, frequency polygon and ogive. These are the most commonly used graphs in research. A histogram displays the data using continuous bars (vertical or horizontal). The histogram is a bar graph in which bars are constructed without space in between. This implies that the data presented is continuous. The heights/lengths of the bars show the frequency of the respective classes. The frequency polygon on the other hand displays the data by using lines connecting the points plotted for the frequencies of each class. This graph is used when the data is continuous. Both graphs use the midpoints of the classes in the frequency axis. The ogive is a graph that shows the cumulative frequencies for the classes in the given distribution. The ogive can be constructed either for cumulative frequency less of cumulative frequency greater. The following are steps in constructing the above-specified graphs manually. The same graphs can be constructed by using either by Excel or Minitab and the specific steps are illustrated in the book of Bluman. Example 1.2 Before constructing the different graphs, we need to add more information in our frequency distribution as shown below. 7 Leyte Normal University | Mathematics Unit MODULE 4: Data Management Class Interval f X <cf >cf rf 95.5 - 103.5 1 99.5 50 1 2.0 87.5 – 95.5 5 91.5 49 6 10.0 79.5 – 87.5 14 83.5 44 20 28.0 71.5 – 79.5 13 75.5 30 33 26.0 63.5 – 71.5 8 67.5 17 41 34.0 55.5 – 62.5 4 59.5 9 45 18.0 47.5 – 55.5 3 51.5 5 48 10.0 39.5 – 47.5 2 43.5 2 50 4.0 N = 50 100.0 REMARKS: X = LL +UL • The midpoint of each class is obtained using the formula: € 2 . Steps in Constructing a Histogram Step What to do? 1 Construct two perpendicular axes (vertical and horizontal) 2 Label the vertical axis as the frequency axis and the horizontal as variable axis.(In our illustration below, our variable is a score) 3 Lay off segments along the vertical axis (y-axis) to correspond to the frequencies. (The segments must be equal in length) 4 Lay off segments along the horizontal axis (x-axis) to correspond to the different class intervals of the variable. The first line segment should be moved a little to the right if the lowest value of the variable is not zero. 5 Mark all midpoints of the intervals and label these using class midpoints. 6 Draw rectangle or bars whose heights correspond to the frequency counts and whose widths to the class size. (Shade or color your bars). Adapted from: Resource Materials in Basic Statistics (Petilos,p.9) y c n e u q e r F Score Figure 1.1. Histogram 8 Leyte Normal University | Mathematics Unit MODULE 4: Data Management Steps in Constructing a Frequency Polygon Step What to do? 1 Construct two perpendicular axes (vertical and horizontal) 2 Label the vertical axis as the frequency axis and the horizontal as variable axis.(In our illustration above, our variable is a score) 3 Lay off segments along the vertical axis (y-axis) to correspond to the frequencies. (The segments must be equal in length) 4 Lay off segments along the horizontal axis (x-axis) to correspond to the different class intervals of the variable. The first line segment should be moved a little to the right if the lowest value of the variable is not zero. 5 For each class interval, the class midpoint and corresponding frequency are considered ordered pair and is plotted in the plane determined by the coordinate axes. 6 The plotted points are then joined using line segments from left to right. To close the polygon, extend one class interval to both sides by connecting the endpoints of the graph to the midpoints of the extended segments along the x-axis. Adapted from: Resource Materials in Basic Statistics (Petilos, p.10) 16 14 12 yc n e u q e r F 10 8 6 4 2 Score 0 Figure 1.2. Frequency Polygon Steps in Constructing an Ogive Step What to do? 1 Construct two perpendicular axes (vertical and horizontal) 2 Label the vertical axis as the cumulative frequency axis and the horizontal as variable axis. (In our example the variable is a score). 3 Lay off equal segments along the vertical axis (y axis) to correspond to the cumulative frequencies. Use an appropriate scale to represent the cumulative frequencies. (Depending on the numbers in the cumulative frequencies, the scales can be by 2’s, 4’s, 5’s, etc. ) 4 Lay off equal segments along the horizontal axis (x axis) to correspond to the true upper limit of the ogive for less than cumulative frequencies and true lower of the ogive for greater than cumulative frequencies 5 Plot the cumulative frequencies with the corresponding class boundaries. 6 The plotted points are then joined using line segments from left to right. Reference: Bluman, pp54-55 9 Leyte Normal University | Mathematics Unit MODULE 4: Data Management REMARK: • To determine the percentage or the number of cases found below or above a particular boundary. • If the ogives (for >cf and <cf) are graphed on the same coordinate plane, a line can be drawn from the point of intersection of the two graphs onto the variable axis which represents the median of the data set. 20 e r F 10 60 0 50 yc 40 n e 30 u q Class Boundaries e r 20 F 60 10 50 0 40 yc n Class Boundaries e 30 u q Figure 1.3. Less than cumulative frequency Figure 1.4. Greater than cumulative frequency Stem and Leaf Plot Another method of organizing data which is a combination of sorting and graphing is the called stem and leaf plot. It is a data plot that uses the leading digit as stem and the trailing digit as leaf. Steps in Constructing Stem and Leaf Plot. Step What to do? 1 List down the leading digits of the data set called the stem. Arrange them in a column either from lowest to highest or vice versa. 2 Starting from the first to the last entry of the data set, carefully record the trailing digits (leaf) in their corresponding stem. 3 Arrange in order the trailing digits in each row. If there are no data values in a class, the stem number is written and the leaf row is left blank. Reference: Bluman, pp81-82 Example 1.2 Let us illustrate the above procedure using the data on the scores of 50 students in Statistics examination. The data are reproduced as follows: 63 88 79 92 86 87 83 78 47 67 68 76 46 81 92 77 76 84 70 66 77 75 98 81 82 81 87 78 70 60 94 79 52 82 77 81 77 70 74 61 56 69 83 83 71 48 90 52 75 84 10 Leyte Normal University | Mathematics Unit MODULE 4: Data Management Steps: 1. Stem (Leading Digit) Leaf (Trailing Digit) 9 48220 8 831123627113744 7 76599717678800045 6 3897601 5 622 4 687 2. Rearranging the trailing digits (leaf) we have: Stem Leaf 9 02248 8 111122333446778 7 00014556677778899 6 0136789 5 226 4 678 REMARKS: • The figure shows that the distribution peaks in the center and there are no gaps in the data. • The highest score is 98 and the lowest is 46. • Most scores are 70 and above. What other information can you draw from the figure above? Leyte Normal University | Mathematics Unit 1 1 MODULE 4: Data Management Exercises 1.1 A. Determine the area of statistics (descriptive or inferential) illustrated by thefollowing statements. 1. A recent study showed that eating garlic could lower blood pressure. 2. The teacher - pupil ratio in public schools has increased from 1:40 in 2015 to 1:50 in 2019. 3. It is predicted that the average number of automobiles each household owns will increase next year. 4. A study revealed that Lagundi is more effective in curing cough than a similar product. 5. Consumers generally prefer Colgate than any other toothpaste. B. In each statement below identify the variable/s and classify it/them according to the level of measurement (nominal, ordinal, interval, ratio) 1. Marital status of faculty members in a university. 2. Time it takes a student to travel from home to school. 3. Scores in the College Admission Test of freshman students in University Q. 4. Socio-economic status of the residents in a barangay (poor, average, above-average). 5. Ages of freshman college students of Leyte Normal University. C. Classify each variable as discrete or continuous 1. Number of CoViD-19 cases in Eastern Visayas. 2. Weights of backpacks of college students inside a Science laboratory room. 3. Number of new mono bloc chairs inside the university social hall. 4. Blood pressures of patients seeking admission in a hospital. 5. Number of boxes of disposable surgical masks sold in one pharmacy in three days. D. A research is to be conducted to determine the level of language proficiency and numeracy skills among the 700 Education and 300 Management graduating students at University Q. The researcher wants a sample of 300 be selecting representatives from the two programs. 1. What is the population of the study? 2. What is the sample in the study? 3. What are the variables of the study? What is the level of measurement of each variable? 4. What sampling technique used in this study? E. An insurance company researcher conducted a survey on the number of car thefts in a large city for a period of 30 days last summer. The raw data are shown below. Construct a grouped frequency distribution, frequency polygon, histogram and ogives (Show all necessary solutions). 52 58 75 79 57 65 62 77 56 51 59 53 51 66 55 68 63 78 50 53 67 65 69 66 69 57 73 72 75 55 Leyte Normal University | Mathematics Unit 12 MODULE 4: Data Management LESSON 2: Measures of Central Tendency Introduction Statistics is a science of collecting, organizing, summarizing, presenting and interpreting data. There are two branches of statistics. The branch that involves collection, organization, summarization and presentation of data is called descriptive statistics. While the branch that involves the interpretation and drawing conclusion is called inferential statistics. Descriptive statistics include the measures of central tendency, measures of position and measures of variability. There are three measures of central tendency or measures of central location, namely: the mean, median and the mode. The measure of central tendency is a single value that describes a whole set of data by identifying the central position within the given data set. It is sometimes called the measure of central location or summary statistics. Mean (Raw and Grouped Data) The data gathered in their original form is called raw or ungrouped data, while the data that have been organized into a frequency distribution is called grouped data. For raw data, the mean is defined as the arithmetic average of a data set It is equal to the sum of the measurements divided by the number of cases (n). It is the measure used when there is no extreme value of the data set and the data is either an interval or ratio. Among the three measures of central tendency, the mean is the most reliable and is amenable for further mathematical manipulation which makes it useful for inferential statistics. Formula: mean = The Greek capital letter sigma is used to denote a sum. Thus, the formula above means, the summation of the values of x divided by the total number of cases. For the data collected from a population the symbol use for the mean is a Greek letter (read as mu) which is called parameter. x (read as: x bar) which is While the data collected from sample, the symbol use for the mean is called statistic. The total number of cases is denoted by N and n for a parameter and statistic, respectively. Thus the working formula for the mean of a population is: = € Example 2.1. Compute for the average of the scores in a Math quiz of 15 students. 23 25 34 32 22 24 26 24 34 30 26 26 37 25 24 Leyte Normal University | Mathematics Unit 13 MODULE 4: Data Management Solution: Using a calculator, we have: x = 23+ 25+ 34 +...+ 24 15 = 412 15 x = 27.5 This implies that the average score in a Math quiz of the 15 students is 27.5 € Note: Rounding Rule for the Mean. The mean should be rounded to one more decimal than occurs in the raw data. place For grouped data, the mean is obtained by using the formula below: x = ΣfX N where: � = average or mean f = class frequency € X = midpoint of each class N = total number of cases Example 2.2. Using the data in Example 4.1.1 we find the mean of grouped data. (Scores of 50 students in Statistics examination) Class Interval f X fX 96 - 103 1 99.5 99.5 88 – 95 5 91.5 457.5 80– 87 14 83.5 1169.0 72– 79 13 75.5 981.5 64 – 71 8 67.5 540.0 56 – 63 4 59.5 238.0 48– 55 3 51.5 154.5 40 – 47 2 43.5 87.0 N = 50 fX = 3,727.00 By substitution, we have: x = ΣfX N = 3727 50 = 74.54 Therefore, the average score of 50 students in a Statistics examination is 74.54. € Note: We rounded off the computed mean to the nearest hundredths because the class intervals is actually 0.5 below and above the given limits. Thus, the true lower limit of each class interval is 0.5 below the apparent lower limit and the true upper limit is 0.5 higher than the apparent upper limit. 14 Leyte Normal University | Mathematics Unit MODULE 4: Data Management Example, for the class interval of 40 - 47 with the lower limit of 40 and the upper limit of 47, has a true lower limit of 39.5 (0.5 lower than the apparent lower limit), and has a true upper limit of 47.5 (0.5 higher than the apparent upper limit). There is another method of finding the mean of grouped data by using the assumed deviation. However, the discussion of this method will not be included in this module. The Weighted Mean When the weight of each value or observation is not equal the weighted mean is obtained. The weighted mean is computed using the formula below: X = ΣwX Σw X = w1X1 +w2X2 +...+wnXn w1 +w2 +...+wn Where: w1, w2, …, wnare the weights and X1, X2,…, Xn are the values or observations weight ΣΣΣΣΣ wX = sum of the products of each value and its respective Σw = sum of the weights € Example 2.3 Find the grade weighted average of a student in his five subjects as shown in the table below: Subject Grade (X) No. of Units (w) wX Mathematics 1.5 3 4.5 English 1.7 3 5.1 PE 1.3 2 2.6 Physics 1.6 5 8.0 Social Science 1.5 3 4.5 Σ w=16 ΣwX =24.7 By substituting to the formula, we find the Grade Weighted Average (GWA) of the student: X = ΣwX Σ w = 24.7 16 = 1.54 Thus, the grade weighted mean of the student is 1.54. € Median (Raw and Grouped Data) The median is the middlemost value of the measurements when they are arranged from smallest to highest. It is used when the data is at least ordinal. The median is not affected by extreme values or outliers. The median is reliable and less stable than the mean. Leyte Normal University | Mathematics Unit 1 5 MODULE 4: Data Management For raw data or ungrouped data, the median is obtained by getting the middlemost value after the data set is arranged from lowest to highest. It is the value that divides the data set into two equal parts. Example 2.4 Using the data set in Example 2.1 we have: 23 25 34 32 22 24 26 24 34 30 26 26 37 25 24 Solution: a) Arrange the scores from lowest to highest. Using stem and leaf plot we have: Stem Leaf 3 42407 2 3524646654 Rearranging the leaf in our plot above we have Stem 3 2 Leaf 02447 2344455666 22 23 24 24 24 25 25 26 26 26 30 32 34 34 37 Thus, the median of the given data set is 26. This implies that with the score of 26, there seven cases below and above it. Example 2.4 is an example of data set for odd cases (n = 15). How to find the median when there are even cases? Based on the definition of the median it is the middlemost value. Example 2.5 22 23 24 24 24 25 25 26 26 26 30 32 34 34 35 37 case. Thus we ⎛ ⎞ ⎛ ⎞ 2+1 ⎝⎜ ⎠⎟th n 2 To get the median of even have: cases, we take the ⎠⎟th case and n average of the ⎝⎜ ⎛ ⎞ n ⎝⎜ ⎠⎟th case + 2+1 ⎛ ⎞ ⎝⎜ ⎠⎟th case n Md = 2 € € 2 = 26 + 26 2 Md = 26 This implies that the value of 26 divides the cases into two equal parts. This 26 is not the 8 th nor the 9th case but there is a value of 26 between 8th and 9th cases. € 16 Leyte Normal University | Mathematics Unit MODULE 4: Data Management For grouped data, the median is obtained using the formula below: ⎛⎜ N ⎟ c − cf f ⎟( ) 2 Md = ⎜ ⎞ ⎟ ⎜ ⎜ ⎟ LL + ⎠ ⎝ where: LL = true lower limit of the median class N = total number of cases € cf = cumulative frequency below the median class f = frequency of the median class c = class interval Example 2.6 Using the data in Example 1.1 we find the median of grouped data. (Scores of 50 students in Statistics examination) Class Interval f <cf 96 - 103 1 50 88 – 95 5 49 80– 87 14 44 13 (f) 30 72– 79 (median class) 64 – 71 8 17 (cf) 56 – 63 4 9 48– 55 3 5 40 – 47 2 2 N= 50 Note that 50% of 50 cases is 25. This means that we find a number or value such that 50% of the total number of cases is below and above it. Using the formula above we have: Md = LL + ⎛⎜ ⎜ ⎜ ⎜ ⎝ N 2 − cf f = ⎛ ⎞ 71.5+ ⎜ 2 −17 ⎟ 13 ⎞ ⎟ ⎜ ⎜ ⎟ ⎠ ⎟ c ⎟( ) ⎞ ⎝ 50 ⎟ 8 ⎟( ) ⎟ = 71.5+25 −17 ⎠⎟( ⎛⎜ ⎠ (0.6154)(8) = 71.5+ 4.92 8 = 71.5+ ) 76.42 ⎝⎜ Md = 13 This implies that 76.42 is the middlemost value of the given data set. This means that there are 25 cases found below and above this value. € Leyte Normal University | Mathematics Unit 17 MODULE 4: Data Management Mode (Raw and Grouped Data) The mode is the most frequent value in a given data set. The mode is used when you want to determine a quick estimate of the typical value in a given data set. The mode is the most unstable measure of central tendency especially if there are only few cases. A given data set can have more than one mode. For cases where there are two modes it is called bimodal. Example 2.7 Using the data set in Example 1.1, we notice that there are two values (24 and 26) that have the same frequency of 3. 22 23 24 24 24 25 25 26 26 26 30 32 34 34 37 Therefore, the modes of the given distribution are 24 and 26. This is an example of a bimodal distribution. Example 2.8 Find the mode of the following data: 12, 34, 12, 71, 48, 93, 71 . By inspection, the number 12 occurs more often than the other numbers. Therefore, the mode of the distribution is 12. This is an example of a unimodal distribution. Example 2.9 Find the mode of the following data set: 12, 5, 8, 9, 11, 11, 4, 7, 23, 7, 8, 12, 23, 9, 4, 5 By inspection, each number in the list occurs twice. There is no number that occurs more often than the others. Therefore, there is no mode. For grouped data, the mode is obtained by using the formula below: +d1 ⎛ Mo = LL ⎝⎜ ⎞ ⎠⎟( ) c d1 + d2 where: LL = true lower limit or lower boundary of the modal class; d1 = absolute difference between the frequencies of the modal class € and the lower class interval (interval just below it); d2 = absolute difference between the frequencies of the modal class and the higher class interval (interval just above it); c = the class size Leyte Normal University | Mathematics Unit 18 MODULE 4: Data Management Example 2.10 Using the data in Example 4.1.1 we find the mode of grouped data. (Scores of 50 students in Statistics examination) Class Interval f 96 - 103 1 88 – 95 (interval just above the modal class) 5 80– 87 (modal class) 14 72– 79 (interval just below the modal class) 13 64 – 71 8 56 – 63 4 48– 55 3 40 – 47 2 Using the formula below, we obtain the mode of the given data set: ⎞ ⎛ ⎞ ⎛ Mo = LL +d1 ⎝⎜ d1 + d2 ⎠⎟( ) = 79.5+14 −13 c ⎜ ⎜ + 14 − 5 ⎟ 8 ( ) ⎟( ) ⎠ (14 −13) ⎝ ⎛ ⎞ 1 Mo = 79.5+ 1+ 9 ⎝⎜ Mo = 80.30 ⎠⎟( ) (0.10)(8) = 79.5+.80 Therefore, the mode of the given data set is 80.3. € 8 = 79.5+ In summary, the given data set has the following values of the measures of central tendency: Mean = 74.54 Median = 76.42 Mode = 80.30 What is the characteristic of our illustrative distribution? Why? Types of Distribution The characteristic of the distribution can be determined by the shape of its graph (histogram of frequency polygon). According to Bluman, the symmetric, positively skewed and negatively skewed are the most important shapes of graphs that describe a distribution. Skewness refers to the degree of departure of the distribution from the line of symmetry. When the data values are evenly distributed on both sides of the mean and it is unimodal, the distribution is called symmetric distribution. Further, the mean, median and mode have equal values and are at the center of the x = Md = Mo . distribution. In symbol, € 19 Leyte Normal University | Mathematics Unit MODULE 4: Data Management A positively skewed or right-skewed distribution is unimodal and majority of the data values cluster at the lower end of the distribution and to the left of the mean. Moreover, with the positively skewed distribution, the mode is lesser than the median and the median is lesser than Mo < Md < x . the mean. In symbol, A negatively skewed or left-skewed distribution is observed when majority of the data values cluster at the upper end of the distribution and to the right of the mean. Furthermore, with the € negatively skewed distribution the mode is greater than the median and the median is greater than x < Md < Mo . the mean. In symbol, The following graphs are illustrations of the three types of distribution according to its skewness (MathBits.com). € Symmetric Distribution Positively Skewed Distribution Negatively Skewed Distribution 20 Leyte Normal University | Mathematics Unit MODULE 4: Data Management Summary of Measures of Central Tendency Measure Mean Common Name Arithmetic Average When to Use • There are no extreme values • When the data at least an interval Advantage • Most stable, i.e., stable and less variable from sample to sample • Amendable for further mathematical manipulation which Disadvantage • Affected by extreme scores or values makes it useful in inferential statistics Median Middle Score/Value • The distribution is skewed • When the data is at least ordinal or rank Mode Typical Score/Value • When a quick • Easy to compute • Not affected by • Less stable from sample to sample extreme scores or values • Easy to compute estimate to the typical score or value to be determined • The most unstable measure especially when the number of cases is small. Adapted from: Resource Materials in Basic Statistics (Petilos, p.14) Exercises 2.1 A. Using Exercise 13.1 on page 811 of the book, Mathematical Excursion by Aufman, answer numbers 4 to 9 and 11. B. Using the same exercise, find the mean, median and mode of the data set of number 14 on page 812. C. Problem Solving. 1. If the mean age of eight college freshman students is 19.25. and six of the ages are: 19, 18, 20, 19, 20 and 18. What are the ages of the two students who are twin siblings? What is the mode (age) of the eight students? 2. Find the mean of 20, 30, 40, 50 and 60. a. Add 5 to each value and find the mean. b. Subtract 5 from each value and find the mean. c. Multiply each value by 5 and find the mean. d. Divide each value by 5 and find the mean. e. Make a general statement about each situation. Leyte Normal University | Mathematics Unit 21 MODULE 4: Data Management LESSON 3: Measures of Variation Introduction In the preceding lesson you learned the three measures of central tendency namely, mean, median and mode. Accordingly, to describe the data set, it is important that one knows more than the measures we studied in the previous lesson as one tends to claim that two or more data sets are not varied when it is observed that the averages are equal. In this lesson, we will discuss the measures of variation/spread or measures of dispersion. In this module the four measures of variability both for ungrouped and grouped data will be talked over. They are the range, mean absolute deviation, variance and standard deviation. Range (Raw and Grouped Data) The range is simply the gap or difference between the highest and lowest value/observation of the data set. In formula: R = HV – LV. If R = 0, it implies that all values in a data set are equal. Thus, there is no variability of the data. Example 3.1 Ages of female faculty members from three departments. Statistical Measure Implication/Impression Data Set A B C 37 40 39 38 41 40 42 42 42 45 43 43 48 44 46 Mean Equal distribution 42 42 42 Range Distribution A is more spread. Why? 11 4 7 According to Petilos in his Resource Material in Basic Statistics, range of grouped data is equal to the difference between true upper limit of the highest class interval and the true lower limit of the lowest class interval. If the apparent limits are used, the range is equal to the difference between upper limit of the highest class interval less than the lower limit of the lowest class interval plus 1. In formula: R = UL ( )H − (LL)L € 22 Leyte Normal University | Mathematics Unit MODULE 4: Data Management Example 3.2 Scores of 50 students in Statistics examination Class Interval f 96 - 103 1 88 – 95 5 80– 87 14 72– 79 13 64 – 71 8 56 – 63 4 48– 55 3 40 – 47 2 N = 50 Using the data set as presented in the distribution above, the range is: R = 103.5 – 39.5 = 64 (using true limits) R = 103 – 40 + 1 = 63 + 1 = 64 (using apparent limit) Mean Absolute Deviation (Raw and Grouped Data) The mean absolute deviation (MAD) of a data set is defined as the average distance between each data value and the mean. It helps to describe how “spread out” the values in a data set are (https://www.khanacademy.org/math). The MAD for raw data is computed using the following formula: MAD =Σ X − x or value N where: X = score x = mean score or mean value € N = total number of cases Using the data set of Example 3.1 and computing for the MAD of each distribution, we have: € Example 3.3 Ages of female faculty members from three departments Statistical Measure Implication/Impression Data Set A B C 37 40 39 38 41 40 42 42 42 45 43 43 48 44 46 23 Leyte Normal University | Mathematics Unit MODULE 4: Data Management Mean Equal distribution 42 42 42 Range Distribution A is more spread. Why? 11 4 7 MAD Distribution B is least variable compared to the other two data sets. Why? 3.6 1.2 2.0 By substituting the formula, we find the MAD of Data Set A as follows: MAD =Σ X − x N = 37 − 42 + 38 − 42 + 42 − 42 + 45 − 42 + 48 − 42 5 = 5+ 4 + 0 + 3+ 6 5= 18 5 MAD = 3.6 Following the same procedure we find the MAD of the remaining two distributions as reflected on the table above. € It can be deduced from the table of Example 3.3 that the scores of Data Set A deviate from the mean by an average of 3.6, compared to Data Set B where the scores deviate from the mean by an average of 1.2. This implies that Data Set B is less spread compared to Data Set A. The lesser the value of MAD the less spread the distribution is. For grouped data the MAD is obtained using the formula below: MAD =Σ f X − x N € x = mean score or mean value Example 3.4 where: X = class mark or midpoint of each class f = frequency of each class N = total number of cases € Using the data in Example 1.1 we find the mean absolute deviation of grouped data. Scores of 50 students in Statistics examination Class Interval f X ⏐ x -X⏐ �⏐ x -X⏐ 96 - 103 1 99.5 24.96 24.96 88 – 95 5 91.5 16.96 84.80 80– 87 14 € 8.96 72– 79 13 75.5 0.96 12.48 64 – 71 8 67.5 7.04 56.32 56 – 63 4 59.5 15.04 60.16 48– 55 3 51.5 23.04 69.12 83.5 € 125.44 40 – 47 2 43.5 31.04 62.08 Σ �⏐ x -X⏐= 495.36 N = 50 x=74.54 (from Example 2.2) Recall: Leyte Normal University | Mathematics Unit € 24 € MODULE 4: Data Management Thus, MAD =Σ f X − x N = 495.36 50 = 9.9072 MAD = 9.91 This implies that the 50 scores deviate from the mean of 74.54 by an average of 9.91 units. € Variance and Standard Deviation (Raw and Grouped Data) The last two measures of dispersion or measures of variation to be included in this module are the variance and standard deviation. Bluman, in his book Elementary Statistics, defines variance as the average of the squares of the distance each score or value from the mean. While the standard deviation, is the square root of the variance. It looks at how spread out a group of numbers is from the mean (https://www.investopedia.com). The population variance and standard deviation are calculated using the following respective formulas: σ2 = Variance (σ2 Σ X−∝2 ( ) read as “sigma squared”): N σ= Σ X−∝2 ( ) Standard Deviation (σ = square root of the variance): € N Where: σ2 = population variance σ = population standard deviation € X = the item or observation ∝ = population mean N = total number of cases Example 3.5 The following data are ages of 10 teachers in one Elementary School: 27, 34, 30, 29, 28, 30, 34, 35, 28, 29. Find the variance and standard deviation of this population data. Solution: To compute for the variance, we present the data as shown in the table below: X−∝ (X − ∝)2 27 -4.4 19.36 34 2.6 6.76 30 -1.4 1.96 Age (X) 25 Leyte Normal University | Mathematics Unit MODULE 4: Data Management 29 -2.4 5.76 28 -3.4 11.56 30 -1.4 1.96 34 2.6 6.76 35 3.6 12.96 38 6.6 43.56 29 -2.4 5.76 Σ (X − ∝)2 = 116.4 ∝ = 31.4 By substituting the formula, the variance is Σ X−∝2 σ2 = ( ) N = 116.4 10 = 11.64 When the variance is zero (0) it indicates that all of the data values are the same. Thus, there is no variation. Since a variance is an average of the square it follows that all non-zero variances are € positive. A small variance indicates that the data points tend to be very close to the mean, and to each other. A high variance indicates that the data points are very spread out from the mean, and from one another (MathBits.com). What does a population variance of 11.64 mean? Since the value of 11.64 is far from zero, this implies that the observations are more spread from one another and from the mean. From above value of population variance, it follows that the population standard deviation which is the square root of the variance is: σ = 11.64 = 3.41 . We recall that the standard deviation measures how concentrated the data are around the mean; the more concentrated, the smaller the standard deviation. € (https://www.dummies.com/education/math/statistics). What is the implication of the above value in relation to the mean of the given data set? Example 3.6 Using the data set of Example 4.3.3, determine the variance and standard deviation of each subset of data. Compare your results. The table is reproduced below. Ages of female faculty members from three departments Statistic al Implication/Impression Data Set A B C Measur e Mean Equal distribution 37 40 39 38 41 40 42 42 42 45 43 43 48 44 46 42 42 42 26 Leyte Normal University | Mathematics Unit MODULE 4: Data Management Range Distribution A is more spread. Why? 11 4 7 MAD Distribution B is least variable compared to the other two data sets. Why? 3.6 1.2 2.0 Variance Standard Deviation Computing Sample Variance and Standard Deviation The table below shows the different notations use for the variance and standard deviation. Notation Statistical Measure σ2 Variance of a population σ Standard deviation of a sample s2 Variance of a sample s Standard deviation of a sample If the data set is taken from a sample, the variance and standard deviation are obtained using the following computational formula (Bluman, p.137) Sample Variance: s2 =n ΣX 2 ( ) − (Σ X )2 n n −1 ( ) Sample Standard Deviation (square root of the variance): − ΣX € s = n ΣX 2 ( ) ( )2 n n −1 2 Where: s = sample variance ( ) X = individual observation € n = sample size Example 3.7 Find the sample variance and standard deviation for the daily production rate of fiberglass boats of a certain manufacturer. If the company production manager feels that a standard deviation of more than three boats a day is unacceptable, should the manager be concerned about the plant production rate? Why? 17 21 18 27 17 21 20 22 18 23 27 Leyte Normal University | Mathematics Unit MODULE 4: Data Management Solution: X 17 21 18 27 17 21 20 22 18 23 ΣX = 204 X2 289 441 324 729 289 441 400 484 324 529 ΣX2 = 4,250 s2 =n ΣX 2 ( ) − (Σ X )2 n n −1 ( ) = (10)(4250) − (204)2 10 10 −1 ( ) s2 = 42500 − 41616 90 = 884 90 = 9.82 From above value of sample variance, it follows that the sample standard deviation which is the square root of the variance is: € s = 9.82 = 3.13 . REMARK: Since the obtained sample standard deviation of 3.1implies that the fiberglass boats plant daily € production is within the acceptable rate. Thus, there is no reason for the plant manager to weary about its production. Computing Sample Variance and Standard Deviation from Grouped Data For grouped data we find the variance and standard deviation using the following computational formula (Bluman, p.139) € Variance: s =n ΣfX 2 2 () (Σ fX )2 − n n −1 ( ) s = n ΣfX 2 ( )− (Σ fX )2 n(n −1) Standard Deviation: where: f = class frequency X = class mark € n = total number of observations Example 3.8 Using the data in Example 4.1.1 we find the variance and standard deviation of grouped data. The table is reproduced below: 28 Leyte Normal University | Mathematics Unit MODULE 4: Data Management Scores of 50 students in Statistics examination Class Interval f X fX fX2 96 - 103 1 99.5 99.5 9900.25 88 – 95 5 91.5 457.5 41861.25 80– 87 14 83.5 1169.0 97611.50 72– 79 13 75.5 981.5 74103.25 64 – 71 8 67.5 540.0 36450.00 56 – 63 4 59.5 238.0 14161.00 48– 55 3 51.5 154.5 7956.75 40 – 47 2 43.5 87.0 3784.50 ΣfX = 3727 ΣfX2= 285,828.50 N = 50 Substituting the above computational or shortcut formula, we obtain the sample variance as follows: s2 =n ΣfX 2 ( ) − (Σ fX )2 n n −1 ( ) = (50)(285828.50) − (3727)2 (50)(50 −1) s2 = 14291425 −13890529 (50)(49) = 400896 2450 = 163.63 With the above sample variance value of 163.63 it follows that the sample standard deviation (s) which is the square root of the variance is 12.79. This implies that the scores of 50 students deviate € from the mean on the average by a distance of 12.79 units. There is another method of computing the sample variance and sample standard deviation by using the Coded Deviation. However, its discussion is not included in this module. Exercises 3.1 A. Using Exercise 13.2 on page 823 of the book, Mathematical Excursion by Aufman, answer numbers 4 to 8 and 12. B. Using the same exercise, answer number 20 on page 824 on the ages of the female and male actors Academy awardees. Answer questions a, b, and c found at the end of the exercise. C. Critical Thinking Using the exercise no. 26 on page 825 perform the suggested activity and answer the question found at the end. Leyte Normal University | Mathematics Unit 29