1 Basic Concepts

advertisement
CHAPTER 1
NUMERICAL DESCRIPTIVE STATISTICS
1.
2.
What Is Statistics?
1.1. Descriptive statistics
1.2. Inferential statistics
1.2.1. Population
1.2.2. Sample
Numerical Descriptive Statistics
2.1. Measures of Central Tendency
2.1.1. The Arithmetic Mean
2.1.1.1.
population mean 𝝁
̅
2.1.1.2.
sample mean 𝒙
2.1.2. The Mean as the Center of Gravity of Data Set
2.1.3. The Mean is Affected by the Outlying Values
2.1.4. Weighted Mean
2.1.5. The Mean of Binary Data
2.2. Measures of Dispersion or Data Variability
2.2.1. Variance
2.2.1.1.
Variance of Population Data 𝝈𝟐
2.2.1.2.
Sample Variance 𝒔𝟐
2.2.1.3.
Computational Formula for Variance
2.2.1.4.
Variance of Binary Population Data
2.2.2. Standard Deviation
2.3. The z-score
1. What Is Statistics?
Statistics is a discipline which studies the collection, organization, presentation, analysis and interpretation of
numerical data. Simply put, statistics is the science of using data to learn about the world around us. There
are two branches of statistics: the descriptive statistics, and the inferential statistics.
1.1. Descriptive statistics
Descriptive statistics is the easy part. It deals with the collection, organization, and presentation of data.
Descriptive statistics involves tables, charts, and presentation of summary characteristics of the data, which
include concepts such as the mean, median or standard deviation. Descriptive statistics is encountered daily
in the news media. For example, in the weather report you frequently hear about the average temperature,
precipitation, pollen count, etc., in a given month of the year. Or you may read about the stock market trend,
changes in the mortgage rate, the rise and fall in the crime rate, students' performance in statewide tests, and
many similar reports.
1.2. Inferential statistics
Inferential statistics is the complicated part of statistics. It deals with inferring or drawing conclusions about
the whole (population data) from analyzing a part of a group (sample data). An opinion poll is an example of
inferential statistics. For example, to determine the voters' preference for a given political candidate a sample
of registered voters is questioned from which inferences are made about the attitudes of the population of all
potential voters. The reason inferential statistics is more complicated is that it involves the theories of
probability and sampling distribution, subjects unfamiliar to most students of introduction to statistics.
CHAPTER 1—Numerical Descriptive Statistics
Page 1 of 16
1.2.1. population
In inferential statistics, the term population applies to every element, observation or data point in the
phenomenon or group that is the subject of the analysis. Stated another way, a population consists of all the
items or individuals about which you want to draw a conclusion.
1.2.2. Sample
The sample is a subset or a portion of the population selected in order to estimate, or infer about, specific
characteristics of the population.
For example, suppose we are interested in the average age of residents of a retirement community in Florida.
Table 1.1, listing the age of every resident, represents the population that is the subject of the study. The
population has 608 observations. The shaded cells in the table represent the age data for a sample of size 40
randomly selected from the population. Table 1.2 contains the sample data.
The population data set here is said to be “finite”. You can easily obtain and list them. And you can easily
compute the average age. Here, the average age of the population of residents in this community is 64.2. The
population average (or population mean), denoted by  (mu, the Greek lower case m) is an example of a
summary characteristic of a data set. A summary characteristic that is obtained from the population is called
a population parameter. Thus,  = 64.2 is a population parameter.
CHAPTER 1—Numerical Descriptive Statistics
Page 2 of 16
82
69
56
74
68
65
60
66
70
51
59
64
75
69
69
70
60
76
65
64
64
55
64
65
69
75
58
61
65
59
62
62
57
61
59
78
70
70
69
63
62
73
52
68
79
61
68
68
56
79
61
68
67
52
67
65
65
53
56
69
66
57
67
62
Table 1.1
Population Age Data for the Residents of a Retirement Community
54 88 62 87 84 52 85 90 50 81 78 61 79 82 71
70 67 67 64 67 68 63 63 62 70 61 64 67 65 64
58 65 54 50 68 66 59 67 58 61 61 50 68 67 52
76 55 70 56 62 59 67 54 80 57 63 74 79 65 73
65 61 64 70 67 67 61 62 68 70 67 67 70 68 64
64 69 64 61 64 63 70 67 70 69 63 63 67 70 68
79 63 61 60 67 76 73 78 51 54 76 52 55 58 62
70 63 66 70 69 69 69 63 70 69 65 63 64 68 64
69 67 62 70 68 66 61 65 66 65 68 66 70 65 62
55 53 69 60 74 78 55 76 57 78 76 57 63 57 52
60 58 68 68 54 52 62 66 66 65 67 56 57 67 66
69 65 68 68 63 65 70 64 66 70 62 70 66 63 68
54 76 63 72 68 72 65 79 59 80 52 76 50 55 76
62 63 67 61 66 64 68 67 65 69 65 61 68 67 64
67 66 68 67 67 61 61 63 65 69 68 65 63 69 66
61 69 64 68 68 63 62 69 67 63 64 61 62 68 68
60 58 59 56 56 59 58 57 60 60 59 60 59 60 58
69 74 52 68 61 64 53 59 69 69 73 55 52 77 76
55 72 52 67 70 75 66 63 80 74 66 75 50 52 61
68 63 70 61 63 63 70 61 65 63 69 64 61 64 69
66 66 64 64 63 69 66 67 62 62 69 66 68 65 61
52 51 55 55 54 52 55 50 50 54 54 54 52 53 51
69 66 62 67 61 66 65 63 61 64 67 63 63 62 64
62 68 67 64 66 65 62 62 66 65 66 70 69 65 67
67 70 66 61 68 61 67 61 64 69 64 70 64 64 61
70 50 73 77 79 51 66 84 64 53 73 60 60 73 56
57 57 57 60 60 57 57 60 59 60 59 60 60 56 59
69 62 61 67 67 70 61 66 67 70 62 69 69 65 61
60 51 67 67 56 61 64 67 57 69 58 62 52 62 70
58 58 60 60 57 59 58 57 60 59 60 59 60 56 60
63 66 68 62 68 68 70 62 61 62 65 63 62 65 62
65 70 69 66 53 56 50 64 66 65 67 51 63 63 55
86
66
54
73
64
66
50
63
64
72
68
63
57
61
65
67
60
74
55
67
70
52
61
63
66
66
59
68
53
56
68
55
73
67
57
50
61
70
69
66
68
57
52
68
51
65
64
69
59
72
72
65
65
53
67
66
62
82
56
65
50
59
70
57
Sometimes it is preferable to determine the summary characteristic of interest from a sample. In most cases,
the population data set may not be finite, and hence not obtainable. In such cases a sample, as a subset of the
population data, my serve us better than the study of the whole population. Even with finite population data
sets, sometimes it is preferable to obtain a sample because it may be more convenient or that the sample data
can be screened for errors much better than when population data is used. If a summary characteristic is
computed from the sample data, then this summary characteristic represents an estimate of the population
parameter. The sample (estimated) summary characteristic is called a sample statistic.
Table 1.2 shows the age data for a sample of 40 residents randomly selected from the population.
CHAPTER 1—Numerical Descriptive Statistics
Page 3 of 16
Table 1.2
The Age Data for a Sample of 40 Residents
54
69
66
64
61
61
62
70
51
53
57
69
60
54
60
70
66
69
69
59
70
76
75
66
60
67
52
50
61
60
69
52
52
66
62
69
65
63
64
70
The average or mean age computed from the random sample of size 40 shown in Table 1.2 is 62.8. This
average is denoted by the symbol 𝑥̅ (x-bar). Thus, the sample statistic 𝑥̅ = 62.8 is an estimate of the
population parameter  = 64.2.1
Table 1.3 shows another feature or characteristic of the population of residents in the retirement community.
This time the table lists the residents according to their gender, where 𝑚𝑎𝑙𝑒 = 0 and 𝑓𝑒𝑚𝑎𝑙𝑒 = 1. Here we
are interested in the proportion of females in the community. In the table there are 316 observations with
value of 1. Therefore, the proportion of females in the population is 𝜋 = 316⁄608 = 0.52. Note the symbol 𝜋
(pi, the Greek lower case for 𝑝). This symbol is used to denote the population proportion. The population
proportion 𝝅 is another example of a population parameter.
The method for computing  and 𝑥̅ will be shown later in this chapter. The mean is simply the sum of all values divided
by the number of observations in a data set:  = 39006⁄608 = 64.1546 and 𝑥̅ = 2513⁄40 = 62.825.
1
CHAPTER 1—Numerical Descriptive Statistics
Page 4 of 16
1
1
0
0
0
1
0
1
1
0
0
1
1
1
0
0
1
0
1
1
0
1
0
0
1
0
0
1
1
0
0
0
1
1
0
0
1
1
0
1
1
0
0
0
0
1
1
0
1
1
0
0
1
0
0
1
0
1
0
0
0
1
0
0
Table 1.3
Listing of Residents of the Retirement Community According to Gender:
𝑴𝒂𝒍𝒆 = 𝟎, 𝒂𝒏𝒅 𝑭𝒆𝒎𝒂𝒍𝒆 = 𝟏
1
1
0
0
0
0
1
0
0
0
1
1
0
1
0
1
0
0
0
1
1
0
0
0
1
0
1
0
1
1
0
1
1
1
1
0
1
1
0
0
1
0
0
1
1
0
1
0
1
1
1
0
0
0
1
1
1
1
0
0
0
1
1
0
1
0
0
1
0
1
1
0
1
0
1
1
1
0
1
0
1
1
0
0
1
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
0
0
1
1
0
1
1
1
0
1
1
0
0
1
1
1
0
0
1
0
1
0
0
0
0
0
1
0
1
1
0
1
0
1
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
0
1
1
1
0
0
1
1
1
1
1
0
1
0
1
1
1
1
0
1
1
1
0
0
0
1
1
1
0
0
0
1
0
1
0
0
0
0
1
0
1
0
0
0
0
1
0
1
1
0
1
0
0
1
1
1
0
0
0
0
0
1
1
0
0
0
1
0
0
0
1
1
1
0
0
1
1
1
0
1
1
0
0
1
0
0
1
1
1
1
0
0
1
1
1
1
1
0
1
0
1
1
1
1
0
1
1
0
0
1
1
0
0
1
0
0
0
0
0
1
1
1
1
0
0
0
1
1
1
0
1
1
1
0
1
1
1
1
1
1
0
0
0
1
1
0
0
0
0
0
1
1
0
1
1
1
1
1
1
1
1
0
0
1
1
0
1
0
1
1
0
0
0
0
1
1
0
1
0
1
0
0
0
0
1
0
0
1
1
0
0
1
1
0
1
0
0
1
0
0
1
1
0
1
0
1
1
1
0
1
1
1
0
1
1
1
1
0
0
1
1
1
1
0
1
1
0
0
1
0
0
0
0
0
0
0
1
1
0
1
0
1
1
0
0
0
1
1
1
1
0
0
1
1
0
1
1
0
1
1
0
0
0
0
1
1
0
1
1
1
1
1
1
0
0
1
1
0
1
1
0
1
0
0
0
1
1
0
1
1
0
1
1
1
1
1
1
0
0
1
1
0
0
1
1
1
0
1
1
1
0
1
1
0
1
1
1
0
1
0
1
1
1
1
0
1
0
0
0
1
1
0
0
1
1
0
1
0
1
1
0
1
1
1
0
0
1
0
0
0
0
0
0
1
1
0
0
1
1
1
1
1
0
1
0
0
1
0
0
1
1
0
1
1
1
1
1
0
1
1
1
0
1
0
0
0
0
The sample statistic used to estimate the population parameter π is the sample proportion 𝑝̅ (p-bar). In the
sample shown in Table 1.4, there are 17 observations with value 1. The sample proportion is then: 𝑝̅ =
17⁄40 = 0.425.2
The sample proportion 𝑝̅ = 0.425 appears to be a very inaccurate estimate of the population proportion 𝜋 = 0.52. This
is because the sample size is relatively small. To obtain a more accurate estimate of a population proportion the sample
size should be larger.
2
CHAPTER 1—Numerical Descriptive Statistics
Page 5 of 16
Table 1.4
Gender of a Sample of 40 Residents:
𝑴𝒂𝒍𝒆 = 𝟎, and 𝑭𝒆𝒎𝒂𝒍𝒆 = 𝟏
1 0
1
0
0
1
0
1
1 1
1
0
0
0
0
0
1 0
1
1
0
0
1
0
0 1
0
0
1
0
0
1
0 1
0
0
0
1
0
1
In short, we have introduced here two examples of a population parameter: The population mean  and the
population proportion π. The sample statistic that is used as the estimators of each population parameter are,
respectively: the sample mean 𝑥̅ and the sample proportion 𝑝̅
Population Parameter
Population Mean
Population Proportion
Symbol

𝜋
Sample Statistic
Sample Mean
Sample Proportion
Symbol
𝑥̅
𝑝̅
2. Numerical Descriptive Statistics
Basically, the numerical descriptive statistics involves computing a compact measure or summary
characteristics of data set that represents an essential feature of that data set. A data set may therefore be
summarized or compacted and its characteristics represented by a single number. Compact measures are
particularly useful when comparing two data sets. For example, to compare the performance of students in
two sections of a statistics course, the average score in each section on the common departmental final
provides a single effective measure. The "average" or "mean" is a summary characteristic.
There are two groups of measures that provide a compact characteristic of a data set. These groups are: 1)
measures of central tendency or location; and 2) measures of dispersion or variability.
2.1.
Measures of Central Tendency
There are three different measures of central tendency: 1) the arithmetic mean; 2) the median; and 3) the
mode.
2.1.1.
The Arithmetic Mean
The most widely known and used measure of central tendency is the arithmetic mean (or, simply, the mean
or the average). The mean is the sum of the values of all the observations in a data set divided by the number
of observations. If the data set represents a population then the population mean is denoted by 𝜇 ; and if the
data set represents a sample, then the sample mean is denoted by 𝑥̅ .
2.1.1.1. Population Mean
The formula for the population mean is:
N
Population Mean:

 xi
i 1
N

x1    x N
N
N is the number of observations in the population data set.
CHAPTER 1—Numerical Descriptive Statistics
Page 6 of 16
Example 13
A population consists of five data points as follows:
𝑥1 = 2
𝑥2 = 5
𝑥3 = 7
𝑥4 = 9
𝑥5 = 17
Find the population mean.
5

 xi
i 1
5

2  5  7  9  17 40

8
5
5
2.1.1.2. Sample Mean
The formula for the sample mean is the same as the population mean formula, except for the symbols. The
sample mean is denoted by 𝑥̅ and the sample size by 𝑛.
n
Sample Mean:
x
 xi
i 1
n

x1    x n
n
In Excel, the average is computed using the following function:
=AVERAGE(data range)
For practice, copy Table 1.1 (the age of residents in the retirement community) into Excel. Go to a blank cell
where you want to enter the average value. In that cell type “=average(“ and then select the cells that
contain the population data by dragging the mouse, and then simply press “Enter”, or click √.
2.1.2.
The Mean as the Center of Gravity of Data Set
The mean represents the "center of gravity" of a set of numbers. To explain this, first you must understand
one of the most important terms in statistics, that is, deviation. Deviation is simply the distance of, or
difference between, each data point from the mean: xi − µ. Table 1.5 below shows the deviation of each of the
five data points from the mean µ = 8. Note that the sum of deviations equals zero. This is where the notion of
“center of gravity” comes in.
Table 1.5
Deviation of data from the mean (µ = 8)
𝑥𝑖
Deviation
𝑥𝑖 − µ
2
−6
5
−3
7
−1
9
1
17
9
(𝑥𝑖 − 𝜇) = 0
As the following diagram shows, µ = 8 is the balancing point of the five numbers. This means that sum of the
deviations of the values exceeding the mean, 1 + 9 = 10, exactly balances the sum of the deviations of the
values below the mean, −6 − 3 − 1 = −10, . Thus the sum of all deviations from the mean is zero.
3
See the Excel file to learn how to use Excel to do the examples.
CHAPTER 1—Numerical Descriptive Statistics
Page 7 of 16
2
5
7
9
17
8
2.1.3.
The Mean is Affected by the Outlying Values
This is the drawback of the mean. When there are extremely large or small values as compared to the other
observations in the data set, the average will give a distorted summary characteristic of the data set. For
example, assume the data shown in Table 1.6 are the scores from the departmental common final for two
different sections of a statistics course. Using the average score as a measure of the overall performance of
each class, which section has the superior performance?
Table 1.6
Departmental Final Scores for Two Sections of a Statistics Course
Scores in Section A
Scores in Section B
60
64
60
76
96
72
88
84
60
48
64
80
64
72
56
88
60
72
72
80
72
76
72
12
64
40
52
56
64
52
48
68
60
68
96
80
60
64
68
µ𝐴 = 66.3
µ𝐵 = 66.4
Since µ𝐴 and µ𝐵 are nearly identical, student performances in both sections appear to be the same. However
note that one student in Section B scored a 12. Since the mean is the center of gravity of the data set, this
extremely low score has pulled the center of gravity, the mean, for this section down, thus distorting the
overall picture. If this score were taken out, the Section B average would rise to 69.6, which would indicate
that students in Section B had a better performance. The distortion created by the impact of the outlying
values on the mean is the disadvantage of using the mean as a summary characteristic measuring the central
tendency of the data set.
2.1.4.
Weighted Mean
Whenever observations in a data set carry different weights, weighted average should be used. The general
formula is the following:
𝜇 = ∑𝑤𝑖 𝑥𝑖 = 𝑤1 𝑥1 + ⋯ + 𝑤𝑛 𝑥𝑛
The formula simply shows that the weighted mean is sum of the product of each data point times its relative
weight. Since these weights are relative, the sum of the weights equals 1.
CHAPTER 1—Numerical Descriptive Statistics
Page 8 of 16
∑𝑤𝑖 = 𝑤1 + ⋯ + 𝑤𝑛 = 1
Table 1.7 shows a simple example of weighted average. Note that the sum of weights
weighted average is ∑𝑥𝑖 𝑤𝑖 = 67.
Table 1.7
∑𝑤𝑖 = 1 and the
Weighted average calculation
𝑥𝑖
𝑤𝑖
𝑥𝑖 𝑤𝑖
10
0.05
0.5
20
0.10
2.0
50
0.25
12.5
70
0.20
14.0
95
0.40
38.0
∑𝑤𝑖 = 1.00 ∑𝑥𝑖 𝑤𝑖 = 67.0
Sometimes weight of each observation in a data set represents the relative frequency of that observation in
the data set. For example, suppose in the example above there are n = 200 observations. The weight assigned
to the data point 95, 0.40, indicates that the value 95 occurs 80 times out of 200. In some cases you must
compute the weights, as shown in the following example.
Example 2
Four different sections of a statistics course took the departmental common final. You are given the mean
score for the four sections and are told to find the overall departmental average (the grand average). The
mean for each section is as follows:
𝐴1 = 64
𝐴2 = 52
𝐴3 = 68
𝐴4 = 48
At first, you may be tempted to add the averages and divide the sum by 4. This would be a correct approach if
all sections had the same number of students. But if the sections have unequal number of students, then the
section averages must be weighted by the number of students. Table 1.8 shows the number of students for
each section and the calculation of the weighted grand average, Μ.
Table 1.8
Calculation Of Weighted Grand Average Score
Section
Number of
Relative
Average
Students
Frequency
𝜇𝑖
𝑓𝑖
𝑤𝑖 = 𝑓𝑖 ⁄𝑁
𝑤𝑖 𝜇𝑖
64
30
0.21
13.7
52
45
0.32
16.7
68
15
0.11
7.3
48
50
0.36
17.1
𝑁 = ∑𝑓𝑖 = 140
∑𝑤𝑖 𝜇𝑖 = 54.9
In many cases the weights are assigned to the values, such as in the following examples.
CHAPTER 1—Numerical Descriptive Statistics
Page 9 of 16
Example 3
The instructor in a Statistics class assigns course grades according to the following requirements: Five
homework assignments, three hourly tests, and a departmental final. The instructor assigns weights of 10%
to homework assignments, 40% to hourly tests, and 50% to the departmental final. A student received the
following scores.
Homework
Tests
Final
88
86
95
90
100
80
75
90
70
What is the student’s average score for the semester? If the simple average were used, then the student’s
average score would be 86—adding all the scores and dividing the sum by 9. But this is not an accurate
summary measure of the student’s performance in the class because homework assignments are only 10% of
the grade, and the final 50%. The following shows the calculation of average taking into account the weights
assigned to each category.
Task
Homework
Hourly Tests
Final
Average
𝑥̅
91.8
81.7
70.0
Weight
𝑤
0.10
0.40
0.50
𝑤 ∙ 𝑥̅
9.2
32.7
35.0
Course average = 76.8
Example 4
Last year, John’s investment portfolio had three securities: an aggressive mutual fund, which rose 30%; an
electrical utility, which rose 5%; and a gold mining share, which fell 10%. If the aggressive mutual fund
comprised 50% of the value of John’s portfolio for that year, the utility 30%, and the gold mining share the
remainder, what was the overall rate of growth for john’s portfolio?
Security
Mutual Fund
Utility
Gold
Rate of Return
𝑥 (%)
30
5
−10
Share in Portfolio
𝑤
0.50
0.30
0.20
𝑥∙𝑤
15.0
1.5
−2.0
𝜇 = 𝑥 ∙ 𝑤 = 14.5
2.1.5.
The Mean of Binary Data
In the discussion of the population and sample proportion above it was stated that when qualitative
characteristics of elements of a population or sample are considered, for statistical analysis we can assign the
values 0 and 1 to the characteristics of interest. For example, if we want to determine what proportion of
group of individuals is female, we may assign the value 1 for “female” and 0 to “male”. Suppose a group of 10
people consists of 6 males and 4 females. To determine the proportion of people in this group all we have to
do is divide 4 by 10, which shows that 0.4 (or 40%) of the group is female.
CHAPTER 1—Numerical Descriptive Statistics
Page 10 of 16
Another way of viewing the concept of proportion is to think of it as the mean of a set of binary data. Since
the mean is determined as the sum of the values of data points divided by the number of data points, when
you add the binary data all we are doing is adding the 1’s. If in a set with 10 elements 6 are 0’s and 4 are 1’s,
then the sum of all data points is equal to 4. Divide that sum by 10, you have 0.4.
π=
 x = 0  0 1 0 1 0  0  0 11 =
10
N
4
= 0.4
10
Thus, proportion is simply the mean of a binary data set.
2.2.
Measures of Dispersion or Data Variability
In addition to measures of central tendency, measures of dispersion or data variability as summary measures
provide important and useful information about the characteristics of a data set.
2.2.1.
Variance
The variance of a data set is a summary measure of the dispersion or scatter of the observations from the
mean. To compute the variance you must find the mean of the squared deviations of the data points from the
mean of the data.
2.2.1.1. Variance of Population Data
The population variance is denoted by 𝜎 2 (lower case Greek letter sigma-square). Variance is the average
value of the squared deviation of observations from the mean of the data set. To compute the variance of a
population data set, first you must find the mean µ, then determine the sum of squared deviations of the
observations from the mean, as follows:
Deviation from the mean = 𝑥𝑖 − µ
Squared Deviation = (𝑥𝑖 − µ)2
Sum of the squared deviations = (𝑥𝑖 − µ)2
Variance is the average of the square deviations. Therefore, divide the sum of squared deviations by N:
𝜎2 =
(𝑥𝑖 − µ)2
𝑁
Example 5
Find the variance of the following data set:
34
55
46
38
42
The following worksheet shows the computations:
CHAPTER 1—Numerical Descriptive Statistics
Page 11 of 16
Table 1.9
Deviation Square deviation
𝑥𝑖 − µ
(𝑥𝑖 − µ)2
-9
81
𝑥𝑖
34
55
12
144
46
3
9
38
-5
25
42
-1
(𝑥𝑖 − µ) =
𝜇 = 43
𝜎2 =
(𝑥𝑖 − µ)2
𝑁
=
1
2
260
260
= 52
5
2.2.1.2. Sample Variance
The variance of sample data set is not only denoted by a different symbol but also it is obtained using a
different formula. To find the average squared deviation or “mean square”, divide the sum of squared
deviations, ∑(𝑥 − 𝑥̅ )2 , by 𝑛 − 1. The value obtained by 𝑛 − 1 is called the degrees of freedom. This concept
will be explained later within the context of inferential statistics.
𝑠2 =
(𝑥𝑖 − 𝑥̅ )2
𝑛−1
Example 6
The following is the commuting time to the IUPUI campus (in minutes) for a random sample of 5 E270
students. Find the sample variance.
𝑥𝑖
35
14
20
-1
1
25
4
16
10
-11
121
15
-6
𝑥̅ = 21
𝑠2 =
(𝑥𝑖 − 𝑥̅ )2
𝑛−1
=
Squared
Deviation
(𝑥𝑖 − 𝑥̅ )2
Deviation
𝑥𝑖 − 𝑥̅
196
(𝑥𝑖 − 𝑥̅) =
2
36
370
370
= 92.5
4
CHAPTER 1—Numerical Descriptive Statistics
Page 12 of 16
2.2.1.3.
Computational (Simplified) Formula to Find the Population Variance
We can adjust the variance formula to obtain a simpler process to compute the variance. Of course, if you
have access to a computer software like Excel there is no need to use this formula. Nevertheless, the
computation of the sum of squared deviations in the numerator of the variance formula is simplified using the
following reconfiguration of the sum of squared deviations formula. 4
(𝑥𝑖 − µ)2 = 𝑥𝑖 2 − 𝑁𝜇2
The computational formula for the population variance is then,
𝜎2 =
𝑥𝑖 2 − 𝑁𝜇2
𝑁
Example 7
Compute the variance from the data from Table 1.9 using the computational formula:
𝜎2 =
𝑥
𝑥2
34
1156
55
3025
46
2116
38
1444
42
1764
µ = 43
𝑥² = 9505
= 𝑥𝑖 2 − 𝑁𝜇 2 9505 − 5 × 432
=
= 52
𝑁
5
In Excel, the variance of a population data set is obtained by the function: =VAR.P(data range)
Similarly, the computational formula for the numerator of the sample variance is:
(𝑥𝑖 − 𝑥̅)2 = 𝑥𝑖 2 − 𝑛𝑥̅2
The computational formula for the sample variance is then,
𝑠2 =
4
𝑥𝑖 2 − 𝑛𝑥̅ 2
𝑛−1
(𝑥𝑖 − µ)2 = (𝑥𝑖 − 2µ𝑥𝑖 + 𝜇2 )
(𝑥𝑖 − µ)2 = 𝑥𝑖 2 − 2µ𝑥𝑖 + 𝑁𝜇2
(𝑥𝑖 − µ)2 = 𝑥𝑖 2 − 2𝑁µ2 + 𝑁𝜇2
(𝑥𝑖 − µ)2 = 𝑥𝑖 2 − 𝑁µ2
CHAPTER 1—Numerical Descriptive Statistics
Page 13 of 16
Example 8
Use the sample data from Example 6 to compute the numerator of the sample variance using the
computational formula.
𝑥²
𝑥
35
1225
20
400
25
625
10
100
15
225
𝑥² = 2575
𝑥̅ = 43
𝑠2 =
= 𝑥𝑖 2 − 𝑛𝑥̅ 2 2575 − 5 × 212
=
= 92.5
𝑛−1
4
In Excel, the variance of a sample data set is found by the function: =VAR.S(data range)
2.2.1.4.
Variance of Binary Population Data
When the data set is binary (consisting only of 0’s and 1’s) the computation of the variance becomes very
simple, according to the following formula: 5
𝜎 2 = 𝜋(1 − 𝜋)
For example if the population proportion is 𝜋 = 0.4, then the variance is 𝜎 2 = 0.4(0.6) = 0.24.
The usefulness of this formula will become apparent later when we discuss the sampling distribution of
proportion and inferences about the population proportion.
2.2.2.
Standard Deviation
The standard deviation is the (positive) square root of the variance. It is a measure of dispersion showing the
average deviation of the data points from the mean of the data. The population standard deviation formula is:
Population Standard Deviation:
𝜎=√
(𝑥𝑖 − µ)2
𝑁
𝑥𝑖 2 − 𝑁𝜇2
=√
𝑁
And the sample standard deviation is:
Sample Standard Deviation:
(𝑥𝑖 − 𝑥̅ )2
𝑠=√
𝑛−1
=√
𝑥𝑖 2 − 𝑛𝑥̅ 2
𝑛−1
The proof is as follows. Using the computational formula for the population variance and replacing μ with π for
proportion, we have,
5
𝜎2 =
2
𝜎 =
𝜎2 =
𝑥𝑖 2 − 𝑁𝜋 2
𝑁
𝑥𝑖 − 𝑁𝜋 2
𝑥𝑖
𝑁
(For binary data ∑𝑥𝑖2 = ∑𝑥𝑖 )
− 𝜋2
𝑁
𝜎 = 𝜋 − 𝜋 2 = 𝜋(1 − 𝜋)
2
CHAPTER 1—Numerical Descriptive Statistics
Page 14 of 16
From the above two examples, where 𝜎 2 = 52 and 𝑠 2 = 92.5, the standard deviations are, respectively: 6
𝜎 = √52 = 7.21
𝑠 = √92.5 = 9.618
In Excel,
the standard deviation of the population data set (σ) is obtained by: =STDEV.P(data range)
the standard deviation of the sample data set (s) is obtained by: =STDEV.S(data range)
2.3.
The z-score
Using the mean and the standard deviation of a data set we can express the data values in terms of their
distance from the mean measured in units of the standard deviation. The deviation of each data point from
the mean is 𝑥𝑖 − 𝜇. If you divide the deviation by 𝜎, then the distance is measured relative to, or in units of,
the standard deviation. Through this process we "standardize" the data points; we transform the variable 𝑥
into the standardized variable 𝑧. The conversion formula is:
𝑧=
𝑥𝑖 − 𝜇
𝜎
Using the appropriate symbols, the same conversion formula applies to sample data,
𝑥𝑖 − 𝑥̅
𝑧=
𝑠
Example 9
For the following population data set, find the mean and the standard deviation and then find the z-score for
each data point. That is, transform the 𝑥 variable into a z variable.
50
𝜇 = ∑𝑥⁄𝑁 = 30
26
37
15
22
𝜎 = √∑(𝑥 − 𝜇)2 ⁄𝑁 = 12.28
The standardized values are determined as follows:
𝑥𝑖
𝑥𝑖 − µ
50
20
𝑧𝑖 = (𝑥𝑖 − µ)𝜎
1.63
26
-4
-0.33
37
7
0.57
15
-15
-1.22
22
-8
-0.65
(𝑥𝑖 − 𝜇) = 0
𝑧𝑖 = 0.00
Note that since the sum of all deviations from the mean equal to zero, then the mean of all 𝑧 scores must be
zero. Also the variance and the standard deviation are both equal to one. 7
Note that 𝜎 2 and 𝜎 are summary characteristics of the population data set. Therefore, each is a population parameter.
Similarly, 𝑠 2 and 𝑠 are summary characteristics computed from sample data. Therefore, each is a sample statistic.
6
7
𝜇𝑧 =
∑𝑧
𝑁
CHAPTER 1—Numerical Descriptive Statistics
Page 15 of 16
𝑧
(𝑧 − 𝜇𝑧 )2 *
1.63
0.0781
-0.33
1.9531
0.57
0.0781
-1.22
-0.65
0.0781
2.8125
𝑧 = 0.00
∑𝑧
𝜇𝑧 =
= 0.00
𝑁
(𝑧 − 𝜇𝑧 )2 = 𝑧 2 = 5.0000
∑𝑧 2 5
𝜎𝑧2 =
= =1
𝑁
5
* Note that the z values are formatted as rounded to two decimal points. The squared
values in the second column are, therefore, not exactly the squares of the rounded values
in the first column.
.
𝜇𝑧
=
𝜇𝑧
=
∑
𝑥−𝜇
𝜎
𝑁
∑(𝑥 − 𝜇)
0
=
=0
𝑁𝜎
𝑁𝜎
𝜎𝑧2 =
∑(𝑧 − 𝜇𝑧 )2
𝑁
𝜎𝑧2 =
1
∑𝑧 2
𝑁
𝜎𝑧2 =
1 (𝑥 − 𝜇)2
∑
𝑁
𝜎2
𝜎𝑧2 =
1
𝑁𝜎 2
2
∑(𝑥
−
𝜇)
=
=1
𝑁𝜎 2
𝑁𝜎 2
CHAPTER 1—Numerical Descriptive Statistics
Page 16 of 16
Download