CHAPTER 1 NUMERICAL DESCRIPTIVE STATISTICS 1. 2. What Is Statistics? 1.1. Descriptive statistics 1.2. Inferential statistics 1.2.1. Population 1.2.2. Sample Numerical Descriptive Statistics 2.1. Measures of Central Tendency 2.1.1. The Arithmetic Mean 2.1.1.1. population mean 𝝁 ̅ 2.1.1.2. sample mean 𝒙 2.1.2. The Mean as the Center of Gravity of Data Set 2.1.3. The Mean is Affected by the Outlying Values 2.1.4. Weighted Mean 2.1.5. The Mean of Binary Data 2.2. Measures of Dispersion or Data Variability 2.2.1. Variance 2.2.1.1. Variance of Population Data 𝝈𝟐 2.2.1.2. Sample Variance 𝒔𝟐 2.2.1.3. Computational Formula for Variance 2.2.1.4. Variance of Binary Population Data 2.2.2. Standard Deviation 2.3. The z-score 1. What Is Statistics? Statistics is a discipline which studies the collection, organization, presentation, analysis and interpretation of numerical data. Simply put, statistics is the science of using data to learn about the world around us. There are two branches of statistics: the descriptive statistics, and the inferential statistics. 1.1. Descriptive statistics Descriptive statistics is the easy part. It deals with the collection, organization, and presentation of data. Descriptive statistics involves tables, charts, and presentation of summary characteristics of the data, which include concepts such as the mean, median or standard deviation. Descriptive statistics is encountered daily in the news media. For example, in the weather report you frequently hear about the average temperature, precipitation, pollen count, etc., in a given month of the year. Or you may read about the stock market trend, changes in the mortgage rate, the rise and fall in the crime rate, students' performance in statewide tests, and many similar reports. 1.2. Inferential statistics Inferential statistics is the complicated part of statistics. It deals with inferring or drawing conclusions about the whole (population data) from analyzing a part of a group (sample data). An opinion poll is an example of inferential statistics. For example, to determine the voters' preference for a given political candidate a sample of registered voters is questioned from which inferences are made about the attitudes of the population of all potential voters. The reason inferential statistics is more complicated is that it involves the theories of probability and sampling distribution, subjects unfamiliar to most students of introduction to statistics. CHAPTER 1—Numerical Descriptive Statistics Page 1 of 16 1.2.1. population In inferential statistics, the term population applies to every element, observation or data point in the phenomenon or group that is the subject of the analysis. Stated another way, a population consists of all the items or individuals about which you want to draw a conclusion. 1.2.2. Sample The sample is a subset or a portion of the population selected in order to estimate, or infer about, specific characteristics of the population. For example, suppose we are interested in the average age of residents of a retirement community in Florida. Table 1.1, listing the age of every resident, represents the population that is the subject of the study. The population has 608 observations. The shaded cells in the table represent the age data for a sample of size 40 randomly selected from the population. Table 1.2 contains the sample data. The population data set here is said to be “finite”. You can easily obtain and list them. And you can easily compute the average age. Here, the average age of the population of residents in this community is 64.2. The population average (or population mean), denoted by (mu, the Greek lower case m) is an example of a summary characteristic of a data set. A summary characteristic that is obtained from the population is called a population parameter. Thus, = 64.2 is a population parameter. CHAPTER 1—Numerical Descriptive Statistics Page 2 of 16 82 69 56 74 68 65 60 66 70 51 59 64 75 69 69 70 60 76 65 64 64 55 64 65 69 75 58 61 65 59 62 62 57 61 59 78 70 70 69 63 62 73 52 68 79 61 68 68 56 79 61 68 67 52 67 65 65 53 56 69 66 57 67 62 Table 1.1 Population Age Data for the Residents of a Retirement Community 54 88 62 87 84 52 85 90 50 81 78 61 79 82 71 70 67 67 64 67 68 63 63 62 70 61 64 67 65 64 58 65 54 50 68 66 59 67 58 61 61 50 68 67 52 76 55 70 56 62 59 67 54 80 57 63 74 79 65 73 65 61 64 70 67 67 61 62 68 70 67 67 70 68 64 64 69 64 61 64 63 70 67 70 69 63 63 67 70 68 79 63 61 60 67 76 73 78 51 54 76 52 55 58 62 70 63 66 70 69 69 69 63 70 69 65 63 64 68 64 69 67 62 70 68 66 61 65 66 65 68 66 70 65 62 55 53 69 60 74 78 55 76 57 78 76 57 63 57 52 60 58 68 68 54 52 62 66 66 65 67 56 57 67 66 69 65 68 68 63 65 70 64 66 70 62 70 66 63 68 54 76 63 72 68 72 65 79 59 80 52 76 50 55 76 62 63 67 61 66 64 68 67 65 69 65 61 68 67 64 67 66 68 67 67 61 61 63 65 69 68 65 63 69 66 61 69 64 68 68 63 62 69 67 63 64 61 62 68 68 60 58 59 56 56 59 58 57 60 60 59 60 59 60 58 69 74 52 68 61 64 53 59 69 69 73 55 52 77 76 55 72 52 67 70 75 66 63 80 74 66 75 50 52 61 68 63 70 61 63 63 70 61 65 63 69 64 61 64 69 66 66 64 64 63 69 66 67 62 62 69 66 68 65 61 52 51 55 55 54 52 55 50 50 54 54 54 52 53 51 69 66 62 67 61 66 65 63 61 64 67 63 63 62 64 62 68 67 64 66 65 62 62 66 65 66 70 69 65 67 67 70 66 61 68 61 67 61 64 69 64 70 64 64 61 70 50 73 77 79 51 66 84 64 53 73 60 60 73 56 57 57 57 60 60 57 57 60 59 60 59 60 60 56 59 69 62 61 67 67 70 61 66 67 70 62 69 69 65 61 60 51 67 67 56 61 64 67 57 69 58 62 52 62 70 58 58 60 60 57 59 58 57 60 59 60 59 60 56 60 63 66 68 62 68 68 70 62 61 62 65 63 62 65 62 65 70 69 66 53 56 50 64 66 65 67 51 63 63 55 86 66 54 73 64 66 50 63 64 72 68 63 57 61 65 67 60 74 55 67 70 52 61 63 66 66 59 68 53 56 68 55 73 67 57 50 61 70 69 66 68 57 52 68 51 65 64 69 59 72 72 65 65 53 67 66 62 82 56 65 50 59 70 57 Sometimes it is preferable to determine the summary characteristic of interest from a sample. In most cases, the population data set may not be finite, and hence not obtainable. In such cases a sample, as a subset of the population data, my serve us better than the study of the whole population. Even with finite population data sets, sometimes it is preferable to obtain a sample because it may be more convenient or that the sample data can be screened for errors much better than when population data is used. If a summary characteristic is computed from the sample data, then this summary characteristic represents an estimate of the population parameter. The sample (estimated) summary characteristic is called a sample statistic. Table 1.2 shows the age data for a sample of 40 residents randomly selected from the population. CHAPTER 1—Numerical Descriptive Statistics Page 3 of 16 Table 1.2 The Age Data for a Sample of 40 Residents 54 69 66 64 61 61 62 70 51 53 57 69 60 54 60 70 66 69 69 59 70 76 75 66 60 67 52 50 61 60 69 52 52 66 62 69 65 63 64 70 The average or mean age computed from the random sample of size 40 shown in Table 1.2 is 62.8. This average is denoted by the symbol 𝑥̅ (x-bar). Thus, the sample statistic 𝑥̅ = 62.8 is an estimate of the population parameter = 64.2.1 Table 1.3 shows another feature or characteristic of the population of residents in the retirement community. This time the table lists the residents according to their gender, where 𝑚𝑎𝑙𝑒 = 0 and 𝑓𝑒𝑚𝑎𝑙𝑒 = 1. Here we are interested in the proportion of females in the community. In the table there are 316 observations with value of 1. Therefore, the proportion of females in the population is 𝜋 = 316⁄608 = 0.52. Note the symbol 𝜋 (pi, the Greek lower case for 𝑝). This symbol is used to denote the population proportion. The population proportion 𝝅 is another example of a population parameter. The method for computing and 𝑥̅ will be shown later in this chapter. The mean is simply the sum of all values divided by the number of observations in a data set: = 39006⁄608 = 64.1546 and 𝑥̅ = 2513⁄40 = 62.825. 1 CHAPTER 1—Numerical Descriptive Statistics Page 4 of 16 1 1 0 0 0 1 0 1 1 0 0 1 1 1 0 0 1 0 1 1 0 1 0 0 1 0 0 1 1 0 0 0 1 1 0 0 1 1 0 1 1 0 0 0 0 1 1 0 1 1 0 0 1 0 0 1 0 1 0 0 0 1 0 0 Table 1.3 Listing of Residents of the Retirement Community According to Gender: 𝑴𝒂𝒍𝒆 = 𝟎, 𝒂𝒏𝒅 𝑭𝒆𝒎𝒂𝒍𝒆 = 𝟏 1 1 0 0 0 0 1 0 0 0 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 1 1 0 1 1 1 1 0 1 1 0 0 1 0 0 1 1 0 1 0 1 1 1 0 0 0 1 1 1 1 0 0 0 1 1 0 1 0 0 1 0 1 1 0 1 0 1 1 1 0 1 0 1 1 0 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 0 0 1 1 0 1 1 1 0 1 1 0 0 1 1 1 0 0 1 0 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 1 0 0 0 1 1 1 0 0 0 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 1 1 0 1 0 0 1 1 1 0 0 0 0 0 1 1 0 0 0 1 0 0 0 1 1 1 0 0 1 1 1 0 1 1 0 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 0 1 0 1 1 1 1 0 1 1 0 0 1 1 0 0 1 0 0 0 0 0 1 1 1 1 0 0 0 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 0 0 1 1 0 0 0 0 0 1 1 0 1 1 1 1 1 1 1 1 0 0 1 1 0 1 0 1 1 0 0 0 0 1 1 0 1 0 1 0 0 0 0 1 0 0 1 1 0 0 1 1 0 1 0 0 1 0 0 1 1 0 1 0 1 1 1 0 1 1 1 0 1 1 1 1 0 0 1 1 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 1 1 1 1 0 0 1 1 0 1 1 0 1 1 0 0 0 0 1 1 0 1 1 1 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0 1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1 0 1 1 1 0 1 1 0 1 1 1 0 1 0 1 1 1 1 0 1 0 0 0 1 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 0 1 0 0 0 0 0 0 1 1 0 0 1 1 1 1 1 0 1 0 0 1 0 0 1 1 0 1 1 1 1 1 0 1 1 1 0 1 0 0 0 0 The sample statistic used to estimate the population parameter π is the sample proportion 𝑝̅ (p-bar). In the sample shown in Table 1.4, there are 17 observations with value 1. The sample proportion is then: 𝑝̅ = 17⁄40 = 0.425.2 The sample proportion 𝑝̅ = 0.425 appears to be a very inaccurate estimate of the population proportion 𝜋 = 0.52. This is because the sample size is relatively small. To obtain a more accurate estimate of a population proportion the sample size should be larger. 2 CHAPTER 1—Numerical Descriptive Statistics Page 5 of 16 Table 1.4 Gender of a Sample of 40 Residents: 𝑴𝒂𝒍𝒆 = 𝟎, and 𝑭𝒆𝒎𝒂𝒍𝒆 = 𝟏 1 0 1 0 0 1 0 1 1 1 1 0 0 0 0 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 0 1 0 1 0 0 0 1 0 1 In short, we have introduced here two examples of a population parameter: The population mean and the population proportion π. The sample statistic that is used as the estimators of each population parameter are, respectively: the sample mean 𝑥̅ and the sample proportion 𝑝̅ Population Parameter Population Mean Population Proportion Symbol 𝜋 Sample Statistic Sample Mean Sample Proportion Symbol 𝑥̅ 𝑝̅ 2. Numerical Descriptive Statistics Basically, the numerical descriptive statistics involves computing a compact measure or summary characteristics of data set that represents an essential feature of that data set. A data set may therefore be summarized or compacted and its characteristics represented by a single number. Compact measures are particularly useful when comparing two data sets. For example, to compare the performance of students in two sections of a statistics course, the average score in each section on the common departmental final provides a single effective measure. The "average" or "mean" is a summary characteristic. There are two groups of measures that provide a compact characteristic of a data set. These groups are: 1) measures of central tendency or location; and 2) measures of dispersion or variability. 2.1. Measures of Central Tendency There are three different measures of central tendency: 1) the arithmetic mean; 2) the median; and 3) the mode. 2.1.1. The Arithmetic Mean The most widely known and used measure of central tendency is the arithmetic mean (or, simply, the mean or the average). The mean is the sum of the values of all the observations in a data set divided by the number of observations. If the data set represents a population then the population mean is denoted by 𝜇 ; and if the data set represents a sample, then the sample mean is denoted by 𝑥̅ . 2.1.1.1. Population Mean The formula for the population mean is: N Population Mean: xi i 1 N x1 x N N N is the number of observations in the population data set. CHAPTER 1—Numerical Descriptive Statistics Page 6 of 16 Example 13 A population consists of five data points as follows: 𝑥1 = 2 𝑥2 = 5 𝑥3 = 7 𝑥4 = 9 𝑥5 = 17 Find the population mean. 5 xi i 1 5 2 5 7 9 17 40 8 5 5 2.1.1.2. Sample Mean The formula for the sample mean is the same as the population mean formula, except for the symbols. The sample mean is denoted by 𝑥̅ and the sample size by 𝑛. n Sample Mean: x xi i 1 n x1 x n n In Excel, the average is computed using the following function: =AVERAGE(data range) For practice, copy Table 1.1 (the age of residents in the retirement community) into Excel. Go to a blank cell where you want to enter the average value. In that cell type “=average(“ and then select the cells that contain the population data by dragging the mouse, and then simply press “Enter”, or click √. 2.1.2. The Mean as the Center of Gravity of Data Set The mean represents the "center of gravity" of a set of numbers. To explain this, first you must understand one of the most important terms in statistics, that is, deviation. Deviation is simply the distance of, or difference between, each data point from the mean: xi − µ. Table 1.5 below shows the deviation of each of the five data points from the mean µ = 8. Note that the sum of deviations equals zero. This is where the notion of “center of gravity” comes in. Table 1.5 Deviation of data from the mean (µ = 8) 𝑥𝑖 Deviation 𝑥𝑖 − µ 2 −6 5 −3 7 −1 9 1 17 9 (𝑥𝑖 − 𝜇) = 0 As the following diagram shows, µ = 8 is the balancing point of the five numbers. This means that sum of the deviations of the values exceeding the mean, 1 + 9 = 10, exactly balances the sum of the deviations of the values below the mean, −6 − 3 − 1 = −10, . Thus the sum of all deviations from the mean is zero. 3 See the Excel file to learn how to use Excel to do the examples. CHAPTER 1—Numerical Descriptive Statistics Page 7 of 16 2 5 7 9 17 8 2.1.3. The Mean is Affected by the Outlying Values This is the drawback of the mean. When there are extremely large or small values as compared to the other observations in the data set, the average will give a distorted summary characteristic of the data set. For example, assume the data shown in Table 1.6 are the scores from the departmental common final for two different sections of a statistics course. Using the average score as a measure of the overall performance of each class, which section has the superior performance? Table 1.6 Departmental Final Scores for Two Sections of a Statistics Course Scores in Section A Scores in Section B 60 64 60 76 96 72 88 84 60 48 64 80 64 72 56 88 60 72 72 80 72 76 72 12 64 40 52 56 64 52 48 68 60 68 96 80 60 64 68 µ𝐴 = 66.3 µ𝐵 = 66.4 Since µ𝐴 and µ𝐵 are nearly identical, student performances in both sections appear to be the same. However note that one student in Section B scored a 12. Since the mean is the center of gravity of the data set, this extremely low score has pulled the center of gravity, the mean, for this section down, thus distorting the overall picture. If this score were taken out, the Section B average would rise to 69.6, which would indicate that students in Section B had a better performance. The distortion created by the impact of the outlying values on the mean is the disadvantage of using the mean as a summary characteristic measuring the central tendency of the data set. 2.1.4. Weighted Mean Whenever observations in a data set carry different weights, weighted average should be used. The general formula is the following: 𝜇 = ∑𝑤𝑖 𝑥𝑖 = 𝑤1 𝑥1 + ⋯ + 𝑤𝑛 𝑥𝑛 The formula simply shows that the weighted mean is sum of the product of each data point times its relative weight. Since these weights are relative, the sum of the weights equals 1. CHAPTER 1—Numerical Descriptive Statistics Page 8 of 16 ∑𝑤𝑖 = 𝑤1 + ⋯ + 𝑤𝑛 = 1 Table 1.7 shows a simple example of weighted average. Note that the sum of weights weighted average is ∑𝑥𝑖 𝑤𝑖 = 67. Table 1.7 ∑𝑤𝑖 = 1 and the Weighted average calculation 𝑥𝑖 𝑤𝑖 𝑥𝑖 𝑤𝑖 10 0.05 0.5 20 0.10 2.0 50 0.25 12.5 70 0.20 14.0 95 0.40 38.0 ∑𝑤𝑖 = 1.00 ∑𝑥𝑖 𝑤𝑖 = 67.0 Sometimes weight of each observation in a data set represents the relative frequency of that observation in the data set. For example, suppose in the example above there are n = 200 observations. The weight assigned to the data point 95, 0.40, indicates that the value 95 occurs 80 times out of 200. In some cases you must compute the weights, as shown in the following example. Example 2 Four different sections of a statistics course took the departmental common final. You are given the mean score for the four sections and are told to find the overall departmental average (the grand average). The mean for each section is as follows: 𝐴1 = 64 𝐴2 = 52 𝐴3 = 68 𝐴4 = 48 At first, you may be tempted to add the averages and divide the sum by 4. This would be a correct approach if all sections had the same number of students. But if the sections have unequal number of students, then the section averages must be weighted by the number of students. Table 1.8 shows the number of students for each section and the calculation of the weighted grand average, Μ. Table 1.8 Calculation Of Weighted Grand Average Score Section Number of Relative Average Students Frequency 𝜇𝑖 𝑓𝑖 𝑤𝑖 = 𝑓𝑖 ⁄𝑁 𝑤𝑖 𝜇𝑖 64 30 0.21 13.7 52 45 0.32 16.7 68 15 0.11 7.3 48 50 0.36 17.1 𝑁 = ∑𝑓𝑖 = 140 ∑𝑤𝑖 𝜇𝑖 = 54.9 In many cases the weights are assigned to the values, such as in the following examples. CHAPTER 1—Numerical Descriptive Statistics Page 9 of 16 Example 3 The instructor in a Statistics class assigns course grades according to the following requirements: Five homework assignments, three hourly tests, and a departmental final. The instructor assigns weights of 10% to homework assignments, 40% to hourly tests, and 50% to the departmental final. A student received the following scores. Homework Tests Final 88 86 95 90 100 80 75 90 70 What is the student’s average score for the semester? If the simple average were used, then the student’s average score would be 86—adding all the scores and dividing the sum by 9. But this is not an accurate summary measure of the student’s performance in the class because homework assignments are only 10% of the grade, and the final 50%. The following shows the calculation of average taking into account the weights assigned to each category. Task Homework Hourly Tests Final Average 𝑥̅ 91.8 81.7 70.0 Weight 𝑤 0.10 0.40 0.50 𝑤 ∙ 𝑥̅ 9.2 32.7 35.0 Course average = 76.8 Example 4 Last year, John’s investment portfolio had three securities: an aggressive mutual fund, which rose 30%; an electrical utility, which rose 5%; and a gold mining share, which fell 10%. If the aggressive mutual fund comprised 50% of the value of John’s portfolio for that year, the utility 30%, and the gold mining share the remainder, what was the overall rate of growth for john’s portfolio? Security Mutual Fund Utility Gold Rate of Return 𝑥 (%) 30 5 −10 Share in Portfolio 𝑤 0.50 0.30 0.20 𝑥∙𝑤 15.0 1.5 −2.0 𝜇 = 𝑥 ∙ 𝑤 = 14.5 2.1.5. The Mean of Binary Data In the discussion of the population and sample proportion above it was stated that when qualitative characteristics of elements of a population or sample are considered, for statistical analysis we can assign the values 0 and 1 to the characteristics of interest. For example, if we want to determine what proportion of group of individuals is female, we may assign the value 1 for “female” and 0 to “male”. Suppose a group of 10 people consists of 6 males and 4 females. To determine the proportion of people in this group all we have to do is divide 4 by 10, which shows that 0.4 (or 40%) of the group is female. CHAPTER 1—Numerical Descriptive Statistics Page 10 of 16 Another way of viewing the concept of proportion is to think of it as the mean of a set of binary data. Since the mean is determined as the sum of the values of data points divided by the number of data points, when you add the binary data all we are doing is adding the 1’s. If in a set with 10 elements 6 are 0’s and 4 are 1’s, then the sum of all data points is equal to 4. Divide that sum by 10, you have 0.4. π= x = 0 0 1 0 1 0 0 0 11 = 10 N 4 = 0.4 10 Thus, proportion is simply the mean of a binary data set. 2.2. Measures of Dispersion or Data Variability In addition to measures of central tendency, measures of dispersion or data variability as summary measures provide important and useful information about the characteristics of a data set. 2.2.1. Variance The variance of a data set is a summary measure of the dispersion or scatter of the observations from the mean. To compute the variance you must find the mean of the squared deviations of the data points from the mean of the data. 2.2.1.1. Variance of Population Data The population variance is denoted by 𝜎 2 (lower case Greek letter sigma-square). Variance is the average value of the squared deviation of observations from the mean of the data set. To compute the variance of a population data set, first you must find the mean µ, then determine the sum of squared deviations of the observations from the mean, as follows: Deviation from the mean = 𝑥𝑖 − µ Squared Deviation = (𝑥𝑖 − µ)2 Sum of the squared deviations = (𝑥𝑖 − µ)2 Variance is the average of the square deviations. Therefore, divide the sum of squared deviations by N: 𝜎2 = (𝑥𝑖 − µ)2 𝑁 Example 5 Find the variance of the following data set: 34 55 46 38 42 The following worksheet shows the computations: CHAPTER 1—Numerical Descriptive Statistics Page 11 of 16 Table 1.9 Deviation Square deviation 𝑥𝑖 − µ (𝑥𝑖 − µ)2 -9 81 𝑥𝑖 34 55 12 144 46 3 9 38 -5 25 42 -1 (𝑥𝑖 − µ) = 𝜇 = 43 𝜎2 = (𝑥𝑖 − µ)2 𝑁 = 1 2 260 260 = 52 5 2.2.1.2. Sample Variance The variance of sample data set is not only denoted by a different symbol but also it is obtained using a different formula. To find the average squared deviation or “mean square”, divide the sum of squared deviations, ∑(𝑥 − 𝑥̅ )2 , by 𝑛 − 1. The value obtained by 𝑛 − 1 is called the degrees of freedom. This concept will be explained later within the context of inferential statistics. 𝑠2 = (𝑥𝑖 − 𝑥̅ )2 𝑛−1 Example 6 The following is the commuting time to the IUPUI campus (in minutes) for a random sample of 5 E270 students. Find the sample variance. 𝑥𝑖 35 14 20 -1 1 25 4 16 10 -11 121 15 -6 𝑥̅ = 21 𝑠2 = (𝑥𝑖 − 𝑥̅ )2 𝑛−1 = Squared Deviation (𝑥𝑖 − 𝑥̅ )2 Deviation 𝑥𝑖 − 𝑥̅ 196 (𝑥𝑖 − 𝑥̅) = 2 36 370 370 = 92.5 4 CHAPTER 1—Numerical Descriptive Statistics Page 12 of 16 2.2.1.3. Computational (Simplified) Formula to Find the Population Variance We can adjust the variance formula to obtain a simpler process to compute the variance. Of course, if you have access to a computer software like Excel there is no need to use this formula. Nevertheless, the computation of the sum of squared deviations in the numerator of the variance formula is simplified using the following reconfiguration of the sum of squared deviations formula. 4 (𝑥𝑖 − µ)2 = 𝑥𝑖 2 − 𝑁𝜇2 The computational formula for the population variance is then, 𝜎2 = 𝑥𝑖 2 − 𝑁𝜇2 𝑁 Example 7 Compute the variance from the data from Table 1.9 using the computational formula: 𝜎2 = 𝑥 𝑥2 34 1156 55 3025 46 2116 38 1444 42 1764 µ = 43 𝑥² = 9505 = 𝑥𝑖 2 − 𝑁𝜇 2 9505 − 5 × 432 = = 52 𝑁 5 In Excel, the variance of a population data set is obtained by the function: =VAR.P(data range) Similarly, the computational formula for the numerator of the sample variance is: (𝑥𝑖 − 𝑥̅)2 = 𝑥𝑖 2 − 𝑛𝑥̅2 The computational formula for the sample variance is then, 𝑠2 = 4 𝑥𝑖 2 − 𝑛𝑥̅ 2 𝑛−1 (𝑥𝑖 − µ)2 = (𝑥𝑖 − 2µ𝑥𝑖 + 𝜇2 ) (𝑥𝑖 − µ)2 = 𝑥𝑖 2 − 2µ𝑥𝑖 + 𝑁𝜇2 (𝑥𝑖 − µ)2 = 𝑥𝑖 2 − 2𝑁µ2 + 𝑁𝜇2 (𝑥𝑖 − µ)2 = 𝑥𝑖 2 − 𝑁µ2 CHAPTER 1—Numerical Descriptive Statistics Page 13 of 16 Example 8 Use the sample data from Example 6 to compute the numerator of the sample variance using the computational formula. 𝑥² 𝑥 35 1225 20 400 25 625 10 100 15 225 𝑥² = 2575 𝑥̅ = 43 𝑠2 = = 𝑥𝑖 2 − 𝑛𝑥̅ 2 2575 − 5 × 212 = = 92.5 𝑛−1 4 In Excel, the variance of a sample data set is found by the function: =VAR.S(data range) 2.2.1.4. Variance of Binary Population Data When the data set is binary (consisting only of 0’s and 1’s) the computation of the variance becomes very simple, according to the following formula: 5 𝜎 2 = 𝜋(1 − 𝜋) For example if the population proportion is 𝜋 = 0.4, then the variance is 𝜎 2 = 0.4(0.6) = 0.24. The usefulness of this formula will become apparent later when we discuss the sampling distribution of proportion and inferences about the population proportion. 2.2.2. Standard Deviation The standard deviation is the (positive) square root of the variance. It is a measure of dispersion showing the average deviation of the data points from the mean of the data. The population standard deviation formula is: Population Standard Deviation: 𝜎=√ (𝑥𝑖 − µ)2 𝑁 𝑥𝑖 2 − 𝑁𝜇2 =√ 𝑁 And the sample standard deviation is: Sample Standard Deviation: (𝑥𝑖 − 𝑥̅ )2 𝑠=√ 𝑛−1 =√ 𝑥𝑖 2 − 𝑛𝑥̅ 2 𝑛−1 The proof is as follows. Using the computational formula for the population variance and replacing μ with π for proportion, we have, 5 𝜎2 = 2 𝜎 = 𝜎2 = 𝑥𝑖 2 − 𝑁𝜋 2 𝑁 𝑥𝑖 − 𝑁𝜋 2 𝑥𝑖 𝑁 (For binary data ∑𝑥𝑖2 = ∑𝑥𝑖 ) − 𝜋2 𝑁 𝜎 = 𝜋 − 𝜋 2 = 𝜋(1 − 𝜋) 2 CHAPTER 1—Numerical Descriptive Statistics Page 14 of 16 From the above two examples, where 𝜎 2 = 52 and 𝑠 2 = 92.5, the standard deviations are, respectively: 6 𝜎 = √52 = 7.21 𝑠 = √92.5 = 9.618 In Excel, the standard deviation of the population data set (σ) is obtained by: =STDEV.P(data range) the standard deviation of the sample data set (s) is obtained by: =STDEV.S(data range) 2.3. The z-score Using the mean and the standard deviation of a data set we can express the data values in terms of their distance from the mean measured in units of the standard deviation. The deviation of each data point from the mean is 𝑥𝑖 − 𝜇. If you divide the deviation by 𝜎, then the distance is measured relative to, or in units of, the standard deviation. Through this process we "standardize" the data points; we transform the variable 𝑥 into the standardized variable 𝑧. The conversion formula is: 𝑧= 𝑥𝑖 − 𝜇 𝜎 Using the appropriate symbols, the same conversion formula applies to sample data, 𝑥𝑖 − 𝑥̅ 𝑧= 𝑠 Example 9 For the following population data set, find the mean and the standard deviation and then find the z-score for each data point. That is, transform the 𝑥 variable into a z variable. 50 𝜇 = ∑𝑥⁄𝑁 = 30 26 37 15 22 𝜎 = √∑(𝑥 − 𝜇)2 ⁄𝑁 = 12.28 The standardized values are determined as follows: 𝑥𝑖 𝑥𝑖 − µ 50 20 𝑧𝑖 = (𝑥𝑖 − µ)𝜎 1.63 26 -4 -0.33 37 7 0.57 15 -15 -1.22 22 -8 -0.65 (𝑥𝑖 − 𝜇) = 0 𝑧𝑖 = 0.00 Note that since the sum of all deviations from the mean equal to zero, then the mean of all 𝑧 scores must be zero. Also the variance and the standard deviation are both equal to one. 7 Note that 𝜎 2 and 𝜎 are summary characteristics of the population data set. Therefore, each is a population parameter. Similarly, 𝑠 2 and 𝑠 are summary characteristics computed from sample data. Therefore, each is a sample statistic. 6 7 𝜇𝑧 = ∑𝑧 𝑁 CHAPTER 1—Numerical Descriptive Statistics Page 15 of 16 𝑧 (𝑧 − 𝜇𝑧 )2 * 1.63 0.0781 -0.33 1.9531 0.57 0.0781 -1.22 -0.65 0.0781 2.8125 𝑧 = 0.00 ∑𝑧 𝜇𝑧 = = 0.00 𝑁 (𝑧 − 𝜇𝑧 )2 = 𝑧 2 = 5.0000 ∑𝑧 2 5 𝜎𝑧2 = = =1 𝑁 5 * Note that the z values are formatted as rounded to two decimal points. The squared values in the second column are, therefore, not exactly the squares of the rounded values in the first column. . 𝜇𝑧 = 𝜇𝑧 = ∑ 𝑥−𝜇 𝜎 𝑁 ∑(𝑥 − 𝜇) 0 = =0 𝑁𝜎 𝑁𝜎 𝜎𝑧2 = ∑(𝑧 − 𝜇𝑧 )2 𝑁 𝜎𝑧2 = 1 ∑𝑧 2 𝑁 𝜎𝑧2 = 1 (𝑥 − 𝜇)2 ∑ 𝑁 𝜎2 𝜎𝑧2 = 1 𝑁𝜎 2 2 ∑(𝑥 − 𝜇) = =1 𝑁𝜎 2 𝑁𝜎 2 CHAPTER 1—Numerical Descriptive Statistics Page 16 of 16