Descriptive Statistics Module

BULACAN STATE UNIVERSITY COLLEGE OF ENGINEERING CIVIL ENGINEERING DEPARTMENT ENGINEERING DATA ANALYSIS MODULE 3 COMPILED BY: MERRICRIS U. PANGILINAN & JENNIE C. ROQUE DESCRIPTIVE STATISTICS 1 3 MODULE 3 DESCRIPTIVE STATISTICS 1.1 DURATION FOR CHAPTER 3: 6 HRS 1.2 STUDENTS’ SKILLS ACQUISITION: At the end of this lesson, you will 1. Calculate the mean, median, and mode for a set of data, and compare these measures of center. 2. Identify the symbols and know the formulas for sample and population means. 3. Describe the skewness and the peakness of the graph. 4. Calculate the standard deviation for grouped and ungrouped data. 5. Calculate the weighted mean, percentiles, and quartiles for a data set. 1.3 WHAT YOU KNOW SO FAR? The following questions are made to assess the students on what they have already known in the subject. Answer the following as much as you can. Link will be given for you to access the Google forms. 1.4. INTRODUCTION Once data are collected, it is useful to summarize the data set by identifying a value around which the data are centered. Three commonly used measures of center are the mode, the median, and the mean. DESCRIPTIVE STATISTICS 2 1.5 LESSON PROPER 1.5.1. MEASURES OF THE CENTER OF THE DATA 1.5.1.A. Measures of Center for Ungrouped Data The "center" of a data set is also a way of describing location. The two most widely used measures of the "center" of the data are the mean (average) and the median. To calculate the mean weight of 50 people, add the 50 weights together and divide by 50. To find the median weight of the 50 people, order the data and find the number that splits the data into two equal parts. The median is generally a better measure of the center when there are extreme values or outliers because it is not affected by the precise numerical values of the outliers. The mean is the most common measure of the center. When each value in the data set is not unique, the mean can be calculated by multiplying each distinct value by its frequency and then dividing the sum by the total number of data values. The letter used to represent the sample mean is an 𝑥̅ (pronounced “ 𝑥 bar”). The Greek letter 𝜇 (pronounced "mew") represents the population mean. One of the requirements for the sample mean to be a good estimate of the population mean is for the sample taken to be truly random. To see that both ways of calculating the mean are the same, consider the sample: 1; 1; 1; 2; 2; 3; 4; 4; 4; 4; 4 𝑥̅ = 1+1+1+2+2+3+4+4+4+4+4 = 2.7 11 You may consider the number of occurrences of each value. 𝑥̅ = 1(3) + 2(2) + 3(1) + 4(5) = 2.7 11 In the second example, the frequencies are in parenthesis and was considered in the computation for the mean. You can quickly find the location of the median by using the expression 𝑛+1 2 DESCRIPTIVE STATISTICS 3 The letter 𝑛 is the total number of data values in the sample. If 𝑛 is an odd number, the median is the middle value of the ordered data (ordered smallest to largest). If 𝑛 is an even number, the median is equal to the two middle values added together and divided by two after the data has been ordered. For example, if the total number of data values is 97, then 𝑛 + 1 97 + 1 = = 49 2 2 The median is the 49th value in the ordered data. If the total number of data values is 100, then 𝑛 + 1 100 + 1 = = 50.5 2 2 The median occurs midway between the 50 and 51 values. The location of the median and the value of the median are NOT the same. Examples: 1. Score data for the first quiz in a 50-item Engineering Data Analysis quiz are as follows (smallest to largest): 3; 4; 8; 8; 10; 11; 12; 13; 14; 15; 15; 16; 16; 17; 17; 18; 21; 22; 22; 24; 24; 25; 26; 26; 27; 27; 29; 29; 31; 32; 33; 33; 34; 34; 35; 37; 40; 44; 44; 47 Calculate the mean and the median. Solution The calculation for the mean is 𝑥̅ = 3 + 4 + 8(2) + 10 + 11 + 12 + 13 + 14 + 15(2) + 16(2) + 17(2) + ⋯ . +35 + 37 + 40 + 44(2) + 47 40 𝑥̅ = 23.6 To find the median ̃𝑥. Locate first the location. The location is 𝑛+1 2 = 40+1 2 = 20.5 Starting from the smallest value, locate the value between 20th and 21th (the two 24s) 𝑥̃ = 24 + 24 = 24 2 DESCRIPTIVE STATISTICS 4 2. The following data show the number of months graduates typically wait on a before getting hired. The data are ordered from smallest to largest. Calculate the mean and median. 3; 4; 5; 7; 7; 7; 7; 8; 8; 9; 9; 10; 10; 10; 10; 10; 11; 12; 12; 13; 14; 14; 15; 15; 17; 17; 18; 19; 19; 19; 21; 21; 22; 22; 23; 24; 24; 24; 24 Solution Mean 𝑥̅ = 3 + 4 + 5 + 7(4) + 8(2) + 9(2) + 10(5) + 11 + 12(2) + 13 + 14(2) + 15(2) + 17(2) + 18 + 19(3) + 21(2) + 22(2) + 23 + 24(4) 39 𝑥̅ = 13.95 Median 𝑥̃ Starting at the smallest value, locate 39+1 2 = 20th term. The median is 13. The 20th term. Mean vs. Median Both the mean and the median are important and widely used measures of center. Consider the following example: Suppose you got an 85 and a 93 on your first two statics quizzes, but then you had a really bad day and got a 14 on your next quiz! The mean of your three grades would be 64. Which is a better measure of your performance? As you can see, the middle number in the set is an 85. That middle does not change if the lowest grade is an 84, or if the lowest grade is a 14. However, when you add the three numbers to find the mean, the sum will be much smaller if the lowest grade is a 14. Outliers and Resistance The mean and the median are so different in the previous example because there is one grade that is extremely different from the rest of the data. In statistics, we call such extreme values outliers. The mean is affected by the presence of an outlier; however, the median is not. A statistic that is not affected by outliers is called resistant. We say that the median is a resistant measure of center, and the mean is not resistant. In a sense, the median can resist the pull of a far away value, but the mean is drawn to such values. It cannot resist the influence of outlier values. As a result, when we have a data set that contains an outlier, it is often better to use the median to describe the center, rather than the mean. DESCRIPTIVE STATISTICS 5 ̂ is the most frequent value. Another measure of the center is the mode. The mode 𝒙 There can be more than one mode in a data set if those values have the same frequency and that frequency is the highest. A data set with two modes is called bimodal. The mode can be calculated for qualitative data as well as for quantitative data. For example, if the data set is: red, red, red, green, green, yellow, purple, black, blue, the mode is red. Examples: 1. Statistics exam scores for 20 students are as follows: 50; 53; 59; 59; 63; 63; 72; 72; 72; 72; 72; 76; 78; 81; 83; 84; 84; 84; 90; 93 Find the mode. Solution ̂ = 72. The most frequent score is 72, which occurs five times. 𝒙 2. The number of books checked out from the library from 25 students are as follows: 0; 0; 0; 1; 2; 3; 3; 4; 4; 5; 5; 7; 7; 7; 7; 8; 8; 8; 9; 10; 10; 11; 11; 12; 12 Find the mode. Solution ̂ = 7. The most frequent number of books is 7, which occurs four times. 𝒙 3. Five real estate exam scores are 430, 430, 480, 480, 495. The data set is bimodal because the scores 430 and 480 each occur twice. When is the mode the best measure of the "center"? Consider a weight loss program that advertises a mean weight loss of six pounds the first week of the program. The mode might indicate that most people lose two pounds the first week, making the program less appealing. Exercises: The students in a statistics class were asked to report the number of children that live in their house. The data are recorded below: 1, 3, 4, 3, 1, 2, 2, 2, 1, 2, 2, 3, 4, 5, 1, 2, 3, 2, 1, 2, 3, 6 Find the mean, median and mode. DESCRIPTIVE STATISTICS 6 The Law of Large Numbers and the Mean The Law of Large Numbers says that if you take samples of larger and larger size from any population, then the mean 𝑥̅ of the sample isvery likely to get closer and closer to population mean 𝜇. 1.5.1.B. Measures of Center for Grouped Data Constructing Frequency Distribution: STEPS IN CONSTRUCTING A REQUENCY DISTRIBUTION: 1. Determine the largest and smallest value in the data. 2. Determine the number of class intervals (k) desired. Table 3.1. Recommended k values from Juran and Gyrna: Number of observation, n Recommended no. of classes (k) 20 -50 5 or 6 51 -100 7 101 – 200 8 201 - 500 9 501 – 1000 10 over 1000 11 - 20 Sturges offers a mathematical formula: 𝑘 = 1 + 3.222(log 𝑛) Or as mentioned before, you may use 𝑘 = √𝑛 and then round to the nearest whole number, if necessary. 𝑜𝑟 𝑘 = √𝑛 3. Determine the approximate class size(c) [ class size is also known as bin size or class width] 𝑥𝑚𝑎𝑥 − 𝑥𝑚𝑖𝑛 𝑐= 𝑘 4. Determine the lower and upper limits of the class interval. 5. Write down the class intervals starting with the decided lower and upper class limit of the first class interval. Add the class size to the lower and upper class limits to obtain the next class interval and so on. 6. Determine the number of observations falling under each class interval that is find the class frequency. Example: A random sample of 30 capacitors were taken from the ECE laboratory and were measured. The following data represents values of the capacitances in 𝜇𝐹. 65.6 83.4 73.3 76.2 35.6 63.6 33.2 28.6 52.5 56.4 10.3 36.0 74.7 52.5 74.7 64.7 73.0 49.2 52.7 45.8 DESCRIPTIVE STATISTICS 7 97.6 72.1 41.0 65.4 64.5 78.5 83.4 80.1 45.9 50.2 Summarize the data above using frequency distribution. Solution: A. For frequency distribution Step 1. Determine the largest and smallest value in the data Step 2. Determine the number of class intervals (k) desired. Since there are 30 observations (n = 30), Let us use k = 5 from Juran’s recommendation Step 3. Determine the approximate class size(c) [ class size is also known as bin size or class width] 𝑥 −𝑥 97.6−10.3 𝑐 = 𝑚𝑎𝑥𝑘 𝑚𝑖𝑛 = = 17.46 ≈ 17.5 since the data has one decimal we 5 are going to use c with one decimal Step 4 and 5. Determine the lower and upper limits of the class interval. Write down the class intervals starting with the decided lower and upper class limit of the first class interval. Add the class size of 17.5 to the lower and upper class limits to obtain the next class interval and so on. Since the smallest value is 10.3, let us use 10.3 as the lower limit. And if we add the class size of 17.5 to 10.3, it will have a value of 27.8, this will the lower class limit of the next interval see Table 3.2. In doing so, the upper limit of the 1st row must be 27.7 so that values will fall in the interval Table 3.2. Step 6. Determine the number of observations falling under each class interval that is find the class frequency, see Table 3.3 DESCRIPTIVE STATISTICS 8 Table 3.3. Class Boundaries Class boundaries are the true class limit. They are values one half measurement unit more accurate than the observed values. This is necessary so that NO values can be observed exactly on a boundary. 𝑈𝐶𝐵𝑖 = 𝑈𝐿𝑖 +𝐿𝐿𝑖+1 2 where 𝑖 is the class or row For example, using the previous problem let us determine the starting UCB boundary, see Table 3.4 Table 3.4. Or we may use as what had been discussed in the previous chapter. 10.3 is the smallest value and the data contain one decimal, the lower class boundary will be 10.25(10.3 – 0.05). And to get the remaining boundary, simply add the class size as seen in Table 3.5 DESCRIPTIVE STATISTICS 9 Table 3.5. Note that the upper class boundary of a certain row is the same as the lower class boundary of the next row. Mean, Median and Mode When only grouped data is available, you do not know the individual data values (we only know intervals and interval frequencies); therefore, you cannot compute an exact mean, median and mode for the data set. What we must do is estimate the actual central tendencies using frequency table as shown above. A frequency table is a data representation in which grouped data is displayed along with the corresponding frequencies. We simply need to modify the definition to fit within the restrictions of a frequency table. Since we do not know the individual data values we can instead find the midpoint of each interval. The midpoint (𝑥𝑖 ) is 𝑙𝑜𝑤𝑒𝑟 𝑙𝑖𝑚𝑖𝑡 + 𝑢𝑝𝑝𝑒𝑟 𝑙𝑖𝑚𝑡 𝑙𝑜𝑤𝑒𝑟 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 + 𝑢𝑝𝑝𝑒𝑟 𝑏𝑜𝑢𝑛𝑑𝑎𝑟𝑦 = 2 2 We can now modify the following mean definition to be 𝑥̅ = ∑ 𝑓𝑖 𝑥𝑖 ∑ 𝑓𝑖 where fi = class frequency, xi = class mark or midpoint median definition to be 𝑛 − (∑ 𝑓)1 2 𝑥̃ = 𝐿1 + [ 𝑓𝑚𝑒𝑑 ]𝑐 Where: 𝑛 L1 = LCB of the median class (class in which the 2 𝑡ℎ item belong) n = total frequency f med = median class frequency DESCRIPTIVE STATISTICS 10 (∑f)1 = sum of the frequencies of all classes lower than the median class. c = median class size mode definition to be 𝑥̂ = 𝐿1 + [ ∆1 ]𝑐 ∆1 + ∆2 Where: L1 = LCB of the modal class (class with the highest frequency) Δ1 = excess of the modal class frequency over the frequency of the next lower class. Δ2 = excess of the modal class frequency over the frequency of the next higher class. c = median class size. Example. Using the same problem of a random sample of 30 capacitors that were taken from the ECE laboratory and were measured. The following data represents values of the capacitances in 𝜇𝐹. 65.6 83.4 73.3 76.2 97.6 72.1 35.6 63.6 33.2 28.6 41.0 65.4 52.5 56.4 10.3 36.0 64.5 78.5 74.7 52.5 74.7 64.7 83.4 80.1 73.0 49.2 52.7 45.8 45.9 50.2 Determine the mean median and mode Solution: Frequency table, class boundaries and midpoint (𝑥𝑖 ) are computed based on what is discussed previously. class interval lower limit upper limit 10.3 27.8 45.3 62.8 80.3 - 27.7 45.2 62.7 80.2 97.7 class boundary Lower CB Upper CB 10.25 27.75 45.25 62.75 80.25 27.75 45.25 62.75 80.25 97.75 Frequency, f xi 1 5 8 13 3 19.0 36.5 54.0 71.5 89.0 n = 30 DESCRIPTIVE STATISTICS 11 A. Mean ∑ 𝑓𝑖 𝑥𝑖 (1)19 + (5)36.5 + (8)54 + (13)71.5 + (3)89 𝑥̅ = = = 61 ∑ 𝑓𝑖 30 B. Median 𝑛 − (∑ 𝑓)1 2 𝑥̃ = 𝐿1 + [ 𝑓𝑚𝑒𝑑 𝑛 ]𝑐 30 Locate first the 2 = 2 = 15 class boundary class interval lower limit upper limit 10.3 27.8 45.3 62.8 80.3 - 27.7 45.2 62.7 80.2 97.7 Lower CB Upper CB 10.25 27.75 45.25 62.75 80.25 27.75 45.25 62.75 80.25 97.75 L1 = 62.75 𝑛 = 15 2 Frequency, f 1 5 8 13 3 xi 19.0 36.5 54.0 1+5+8 +13 = 27This is 71.5 where the 15th term is. 89.0 n = 30 𝑓 𝑚𝑒𝑑 = 13 ∑f1 = 8 + 5 +1 = 14 C = 17.5 (as computed in the previous problem) 𝑥̃ = 62.75 + [ 15 − 14 13 ] 17.5 = 64.8 C. Mode 𝑥̂ = 𝐿1 + [ ∆1 ]𝑐 ∆1 + ∆2 L1 = 62.75 Δ1 = 13 -8 = 5 Δ 2 = 13 – 3 = 10 c = 17.5 𝑥̂ = 62.75 + [ 5 ] 17.5 = 68.58 5 + 10 1.5.2. SKEWNESS AND THE MEAN, MEDIAN AND MODE The data in Figure 3.1 can be presented using histogram. The histogram displays a symmetrical distribution of data. A distribution is symmetrical if a vertical line can be drawn at some point in the histogram such that the shape to the left and the right of the vertical line are mirror images of each other. The mean, the median, and the mode are each 8 for these data. In a perfectly symmetrical distribution, the mean and the median are the same. This example has one mode (unimodal), and the mode is the same as the mean and median. In a symmetrical distribution that has two modes (bimodal), the two modes would be different from the mean and median. DESCRIPTIVE STATISTICS 12 4 x 5 6 7 8 9 10 11 3 2 1 f 1 1 2 3 2 1 1 0 5 6 7 8 9 10 11 Figure 3.1 The histogram shown in Figure 3.2 for the data is not symmetrical. The right-hand side seems "chopped off" compared to the left side. A distribution of this type is called skewed to the left because it is pulled out to the left. The mean is 8.46, the median is 9, and the mode is 10. Notice that the mean is less than the median, and they are both less than the mode. The mean and the median both reflect the skewing, but the mean reflects it more so. 5 4 x 5 6 7 8 9 10 11 3 2 1 f 1 1 2 2 2 4 1 0 5 6 7 8 9 10 11 Figure 3.2 The histogram shown in Figure 3.3 for the data is also not symmetrical. It is skewed to the right. The mean is 7.53, the median is 7, and the mode is 6. Of the three statistics, the mean is the largest, while the mode is the smallest. Again, the mean reflects the skewing the most. 5 4 x 5 6 7 8 9 10 11 3 2 1 f 1 4 3 3 2 1 1 0 5 6 7 8 9 10 11 Figure 3.3 DESCRIPTIVE STATISTICS 13 Formula for measurement of skewness: Third central moment about the mean determines the symmetry of distribution 𝑎3 = ∑ 𝑓(𝑥−𝑥̅ )3 (𝑛−1)𝑠3 Exercises: Discuss the mean, median, and mode for each of the following problems. Is there a pattern between the shape and measure of the center? DESCRIPTIVE STATISTICS 14 1.5.3. MEASURES OF KURTOSIS Kurtosis is the degree of peakedness of unimodal distribution. Peakedness is a comparative measure of the height of the peak of a frequency distribution usually taken relative to a normal distribution. (a symmetric distribution) Mesokurtic (normal distribution) is not very peaked or flat topped. Leptokurtic – distribution having a relatively high peak. Platykurtic – distribution is flat-topped 𝑛 Moment coefficient of kurtosis, a4 = 𝑠44 , where n4 is the fourth moment about the mean and equal to ∑ 𝑓(𝑥−𝑥̅ )4 (𝑛−1) If a4 = 3, distribution is mesokurtic, Normal peakedness a4 < 3, distribution is platykurtic, low peakedness a4 > 3, distribution id leptokurtic, high peakedness 1.5.4 MEASURES OF THE SPREAD OF DATA An important characteristic of any set of data is the variation in the data. In some data sets, the data values are concentrated closely near the mean; in other data sets, the data values are more widely spread out from the mean. The most common DESCRIPTIVE STATISTICS 15 measure of variation, or spread, is the standard deviation. The standard deviation is a number that measures how far data values are from their mean. • The standard deviation provides a measure of the overall variation in a data set The standard deviation is always positive or zero. The standard deviation is small when the data are all concentrated close to the mean, exhibiting little variation or spread. The standard deviation is larger when the data values are more spread out from the mean, exhibiting more variation. • The standard deviation can be used to determine whether a data value is close to or far from the mean. • Calculating the Standard Deviation If 𝑥 is a number, then the difference "𝑥 − 𝑚𝑒𝑎𝑛" is called its deviation. In a data set, there are as many deviations as there are items in the data set. The deviations are used to calculate the standard deviation. If the numbers belong to a population, in symbols a deviation is 𝑥 − 𝜇 . For sample data, in symbols a deviation is 𝑥 − 𝑥̅ . The procedure to calculate the standard deviation depends on whether the numbers are the entire population or are data from a sample. The calculations are similar, but not identical. Therefore, the symbol used to represent the standard deviation depends on whether it is calculated from a population or a sample. The lower case letter 𝒔 represents the sample standard deviation and the Greek letter 𝜎(sigma, lower case) represents the population standard deviation. If the sample has the same characteristics as the population, then s should be a good estimate of 𝜎. To calculate the standard deviation, we need to calculate the variance first. The variance is the average of the squares of the deviations (𝑥 − 𝑥̅ for a sample or 𝑥 − 𝜇 for a population). The symbol 𝜎 2 represents the population variance, the population standard deviation 𝜎 is the square root of the population variance. The symbol 𝑠 2 represents the sample variance; the sample standard deviation 𝒔 is the square root of the sample variance. You can think of the standard deviation as a special average of the deviations. DESCRIPTIVE STATISTICS 16 • Variance is defined as the square of the standard deviation. 𝑣 = 𝜎 2 population variance 𝑣 = 𝑠 2 sample variance • Standard deviation of the sample is defined as follows. (𝑥 −𝑥̅ )2 𝑖 s = √ 𝑛−1 For a grouped data: ∑ 𝑓𝑖 (𝑥𝑖 −𝑥̅ )2 s=√ 𝑛−1 Where: s = standard deviation of sample xi = class mark or midpoint fi = frequency 𝑥̅ = sample mean n = number of sample • Standard deviation of the population is defined as follows. (𝑥𝑖 −𝜇)2 𝜎= √ For a grouped data: 𝑁 ∑ 𝑓𝑖 (𝑥𝑖 −𝜇)2 𝜎=√ 𝑁 Where: 𝜎 = standard deviation of population xi = class mark or midpoint fi = frequency 𝜇= population mean N = number of population DESCRIPTIVE STATISTICS 17 • NOTE: n-1 is used as denominator so as to obtain an s2 which is unbiased estimate of the population variance, σ2. But as n increases, the bias becomes smaller. If the numbers come from a census of the entire population and not a sample, when we calculate the average of the squared deviations to find the variance, we divide by 𝑁 , the number of items in the population. If the data are from a sample rather than a population, when we calculate the average of the squared deviations, we divide by n – 1, one less than the number of items in the sample. Examples: 1. Consider the previous problem capacitors in ECE Lab with the computed tabular data as follows. Determine the standard deviation of sample. Solution: Determine the 𝑥𝑖 (midpoint) of each row as shown in the table, and 𝑥̅ (mean) using the formula ∑ 𝑓 𝑖 𝑥𝑖 class interval lower limit upper limit 10.3 27.7 27.8 45.2 45.3 62.7 62.8 80.2 80.3 97.7 ∑ 𝑓𝑖 . In this problem, 𝑥̅ = 61 class boundary Frequency, f Lower CB Upper CB 10.25 27.75 1 27.75 45.25 5 45.25 62.75 8 62.75 80.25 13 80.25 97.75 3 ∑ 30 xi 19.0 36.5 54.0 71.5 89.0 Determine the remaining the value for the column of (𝑥𝑖 − 𝑥̅ )2 for each row as well as the column for 𝑓(𝑥𝑖 − 𝑥̅ )2 class interval lower limit upper limit 10.3 27.7 27.8 45.2 45.3 62.7 62.8 80.2 80.3 97.7 class boundary Frequency, f Lower CB Upper CB 10.25 27.75 1 27.75 45.25 5 45.25 62.75 8 62.75 80.25 13 80.25 97.75 3 ∑ 30 xi xi-x̅ (xi-x̅ )2 (f)(xi-x̅ )2 19.0 36.5 54.0 71.5 89.0 -42.0 -24.5 -7.0 10.5 28.0 1764.0 600.3 49.0 110.3 784.0 1764 3001.25 392 1433.25 2352 8942.5 ∑ 𝑓𝑖 (𝑥𝑖 −𝑥̅ )2 Using the formula for standard deviation of sample as √ 𝑛−1 = 17.5602411 DESCRIPTIVE STATISTICS 18 Explanation of the standard deviation calculation shown in the table The deviations show how spread out the data are about the mean. The data value of midpoint 89 is farther from the mean than is the midpoint data value 71.5 which is indicated by the deviations 28 and 10.5. A positive deviation occurs when the data value is greater than the mean, whereas a negative deviation occurs when the data value is less than the mean. The deviation is –42 for the midpoint data value 19.. The standard deviation measures the spread in the same units as the data The standard deviation, or , is either zero or larger than zero. When the standard deviation is zero, there is no spread; that is, the all the data values are equal to each other. The standard deviation is small when the data are all concentrated close to the mean, and is larger when the data values show more variation from the mean. When the standard deviation is a lot larger than zero, the data values are very spread out about the mean; outliers can make standard deviation very large. The standard deviation, when first presented, can seem unclear. By graphing your data, you can get a better "feel" for the deviations and the standard deviation. You will find that in symmetrical distributions, the standard deviation can be very helpful but in skewed distributions, the standard deviation may not be much help. The reason is that the two sides of a skewed distribution have different spreads. In a skewed distribution, it is better to look at the first quartile, the median, the third quartile, the smallest value, and the largest value. Because numbers can be confusing, always graph your data. Display your data in a histogram or a box plot. Comparing Values from Different Data Sets The standard deviation is useful when comparing data values (x) that come from different data sets. If the data sets have different means and standard deviations. sample 𝑥 = 𝑥̅ + 𝑧𝑠 Population 𝑥 = 𝜇 + 𝑧𝜎 𝑥 − 𝑥̅ 𝑠 𝑥−𝜇 𝑧= 𝜎 𝑧= The z value are the z scores and that will be the number of standard deviations. Example: Two students, Henry and Josh, from different high schools, wanted to find out who had the highest GPA when compared to his school. Which student had the highest GPA when compared to his school? DESCRIPTIVE STATISTICS 19 Student Henry Josh GWA 2.85 77 School Mean GWA 3 80 School Standard Deviation 0.7 10 Solution For each of the student, determine how many standard deviations his GWA is away from the average for his school. (careful with the sign). 𝑥 − 𝑥̅ 𝑠 For Henry: 𝑥 − 𝑥̅ 2.85 − 3 𝑧= = = −0.21 𝑠 0.7 𝑧= For Josh: 𝑥 − 𝑥̅ 77 − 80 𝑧= = = −0.3 𝑠 10 Henry has the better GPA when compared to his school because his GPA is 0.21 standard deviations below his school's mean while Josh's GPA is 0.3 standard deviations below his school's mean. Henry's z-score of –0.21 is higher than Josh's z-score of –0.3. For GPA, higher values are better, so we conclude that Henry has the better GPA when compared to his school. Two swimmers, Eileen and Janeth, from different teams, wanted to find out who had the fastest time for the 50-meter freestyle when compared to her team. Which swimmer had the fastest time when compared to her team ? Swimmer Time(s) Eileen Janeth 26.2 27.3 Team Mean Time 27.2 30.1 Team Standard Deviation 0.8 1.4 For Eileen: 𝑥 − 𝑥̅ 26.2 − 27.2 𝑧= = = −1.25 𝑠 0.8 For Josh: 𝑥 − 𝑥̅ 27.3 − 30.1 𝑧= = = −2.0 𝑠 1.4 DESCRIPTIVE STATISTICS 20 The following lists give a few facts that provide a little more insight into what the standard deviation tells us about the distribution of the data. • For ANY data set, no matter what the distribution of the data is: At least 75% of the data is within two standard deviations of the mean. At least 89% of the data is within three standard deviations of the mean. At least 95% of the data is within 4.5 standard deviations of the mean. This is known as Chebyshev's Rule. (see Figure 3.5) • For data having a distribution that is BELL-SHAPED and SYMMETRIC: Approximately 68% of the data is within one standard deviation of the mean. Approximately 95% of the data is within two standard deviations of the mean. More than 99% of the data is within three standard deviations of the mean. This is known as the Empirical Rule(see Figure 3.4) It is important to note that this rule only applies when the shape of the distribution of the data is bell-shaped and symmetric. 1.5.5. MEASURES OF THE LOCATION OF DATA DESCRIPTIVE STATISTICS 21 The common measures of location are quartiles and percentiles however, we also have deciles. Quartiles are special percentiles. The first quartile, 𝑄1 , is the same as the 25th percentile, and the third quartile, 𝑄3 , is the same as the 75th percentile. The median,𝑥̃ , is called both the second quartile and the 50 percentile and the 5th decile. To calculate quartiles, decile and percentiles for an ungrouped data, the data must be ordered from smallest to largest. Quartiles divide ordered data into quarters. Deciles divide ordered data into tens. Percentiles divide ordered data into hundredths. To score in the 90th percentile of an exam does not mean, necessarily, that you received 90% on a test. It means that 90% of test scores are the same or less than your score and 10% of the test scores are the same or greater than your test score. Percentiles are mostly used with very large populations. Therefore, if you were to say that 90% of the test scores are less (and not the same or less) than your score, it would be acceptable because removing one particular data value is not significant. To determine the percentile, quartile and decile for a grouped data, we will be using 𝑛 the formula for the median of a grouped data with certain changes in the value for 2, 1 3 for quartiles it will be replaced with 4 𝑛 𝑡𝑜 4 𝑛, while for deciles it will be replaced with 1 10 9 1 99 𝑛 𝑡𝑜 10 𝑛 and for percentiles it will be 100 𝑛 𝑡𝑜 100 𝑛 𝑛 − (∑ 𝑓)1 2 𝑥 = 𝐿1 + [ 𝑓𝑜𝑏𝑠 ]𝑐 Where: L1 = LCB of the observed class (class in which the observed item belong) n = total frequency f obs = observed class frequency (∑f)1 = sum of the frequencies of all classes lower than the observed class. c = median class size Example: Consider the previous problem capacitors in ECE Lab with the computed tabular data as follows. class boundary class interval Frequency, f lower limit upper limit Lower CB Upper CB 10.3 27.7 10.25 27.75 1 27.8 45.2 27.75 45.25 5 45.3 62.7 45.25 62.75 8 62.8 80.2 62.75 80.25 13 80.3 97.7 80.25 97.75 3 30 xi 19.0 36.5 54.0 71.5 89.0 DESCRIPTIVE STATISTICS 22 Determine 𝑄1 , 𝑃40 𝑎𝑛𝑑 𝐷8 Solution: 1 𝑛 1 For 𝑄1: Locate the 𝑛( this will replace the in the formula)term of the data, 𝑛 = 1 4 4 2 4 (30) = 7.5~8th class boundary class interval Frequency, f lower limit upper limit Lower CB Upper CB 10.3 27.7 10.25 27.75 1 27.8 45.2 27.75 45.25 5 45.3 62.7 45.25 62.75 8 62.8 80.2 62.75 80.25 13 80.3 97.7 80.25 97.75 3 30 1 𝑛 − (∑ 𝑓)1 4 𝑥𝑄1 = 𝐿1 + [ 𝑓𝑜𝑏𝑠 xi 19.0 36.5 54.0 71.5 89.0 1+5+8=14, this where the 8th term is. ] 𝑐 = 45.25 + [ 40 8 − (5 + 1) ] 17.5 = 49.625 8 𝑛 40 For 𝑃40 : Locate the 100 𝑛( this will replace the 2 in the formula)term of the data, 100 𝑛 = 40 100 (30) = 12𝑡ℎ class boundary class interval Frequency, f lower limit upper limit Lower CB Upper CB 10.3 27.7 10.25 27.75 1 27.8 45.2 27.75 45.25 5 45.3 62.7 45.25 62.75 8 62.8 80.2 62.75 80.25 13 80.3 97.7 80.25 97.75 3 30 40 𝑛 − (∑ 𝑓)1 𝑥𝑃40 = 𝐿1 + [100 𝑓𝑜𝑏𝑠 8 xi 19.0 36.5 54.0 71.5 89.0 1+5+8=14, this where the 12th term is. ] 𝑐 = 45.25 + [ 12 − (5 + 1) ] 17.5 = 58.375 8 𝑛 8 For 𝐷8 : Locate the 10 𝑛( this will replace the 2 in the formula)term of the data, 10 𝑛 = 8 10 (30) = 24𝑡ℎ class boundary class interval Frequency, f lower limit upper limit Lower CB Upper CB 10.3 27.7 10.25 27.75 1 27.8 45.2 27.75 45.25 5 45.3 62.7 45.25 62.75 8 62.8 80.2 62.75 80.25 13 80.3 97.7 80.25 97.75 3 30 8 𝑛 − (∑ 𝑓)1 10 𝑥𝐷8 = 𝐿1 + [ 𝑓𝑜𝑏𝑠 xi 19.0 36.5 54.0 71.5 89.0 ] 𝑐 = 62.75 + [ 1+5+8+13=27, this where the 24th term is. 24 − (8 + 5 + 1) ] 17.5 = 76.211 13 DESCRIPTIVE STATISTICS 23 1.6. CLASS ASSIGNMENT: 1. Suppose that in a small town of 50 people, one person earns $5,000,000 per year and the other 49 each earn $30,000. Which is the better measure of the "center": the mean or the median? 2. Enrique has a 91, 87, and 95 for his statistics grades for the first three quarters. His mean grade for the year must be a 93 for him to be exempted from taking the final exam. Assuming grades are rounded following valid mathematical procedures, what is the lowest whole number grade he can get for the 4th quarter and still be exempt from taking the exam? 1.7. SUMMARY The mean and the median can be calculated to help you find the "center" of a data set. The mean is the best estimate for the actual data set, but the median is the best measurement when a data set contains several outliers or extreme values. The mode will tell you the most frequently occurring datum (or data) in your data set. The mean, median, and mode are extremely helpful when you need to analyze your data, but if your data set consists of ranges which lack specific values, the mean may seem impossible to calculate. However, the mean can be approximated if you add the lower boundary with the upper boundary and divide by two to find the midpoint of each interval. Multiply each midpoint by the number of values found in the corresponding range. Divide the sum of these values by the total number of data values in the set. Looking at the distribution of data can reveal a lot about the relationship between the mean, the median, and the mode. There are three types of distributions: A left (or negatively) skewed distribution has a shape like Figure 3.4a A symmetrical or normal distribution looks like Figure 3.4b A right (or positively) skewed distribution has a shape like Figure 3.4c DESCRIPTIVE STATISTICS 24 Figure 3.4 The standard deviation can help you calculate the spread of data. There are different equations to use if are calculating the standard deviation of a sample or of a population DESCRIPTIVE STATISTICS 25

Descriptive Statistics Module

Related documents

Products

Support

Descriptive Statistics Module

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib