Numerical Measures

advertisement
Numerical Measures
In this handout we will develop numerical measures which will help us describe a
data set. We begin with some definitions of great importance for understanding
inferential statistics.
A population is the entire body of data from which a sample may be
drawn.
A sample is a specific subset of a population.
A statistic is a numerical measure which is computed from a sample
of data.
A parameter is a numerical measure of a population. Parameters are
usually represented with Greek letters.
Population
Parameter
s
Sample
Statistics
As a little foreshadowing, you should know that questions of interest are typically
questions about parameters. Inferential statistics, which we will talk about in great detail,
is the study of using statistics to answer questions about parameters.
13
Numerical Measures of Populations—Parameters
Measures of Central Tendency
(1) The mean or average: Let xi be the ith data point in a population, i = 1,…, N. Then
N

x
i 1
i
N
A mean is also the number which minimizes the sum of squared distances to each of the
population values.
Example:
Consider a population made up of the five values: 3, 4, 1, 7, 5. Then

3  4 1 7  5
4
5
(2) A median is the middle value of a population where the values have been ordered
from smallest to largest.
What is the median for the population above?
What should we do if our ordered population looks like: 2, 3, 4, 5, 6, 7? That is, what if
there is no unique middle value?
A median minimizes the sum of the absolute distances to the data points. As a result, the
median is not as readily influenced by outliers as is the case with the mean.
Example: Two populations
{1,2,3,4,5}
{1,2,3,4,5000}
μ=
μ=
M=
M=
14
Homework:
For the population {1, 2, 3, 4, 5, 6}, compute (1) the sum of squared distances and (2) the
sum of absolute distance from each population value to the number (a) 4.0 and (b) 5.0.
To illustrate, to compute the sum of squared distances of each population value to 3.5 (μ),
Sum-of-squared-distances
= (1 – 3.5)2 + (2 – 3.5)2 + (3 – 3.5)2 + (4 – 3.5)2
+ (5 – 3.5)2 + (6 – 3.5)2
= 17.5
Sum-of-absolute-distance
= |1 – 3.5| + |2 – 3.5| + |3 – 3.5| + |4 – 3.5|
+ |5 – 3.5| + |6 – 3.5|
= 9.0
Now you do the computations for 4.0 and 5.0. You may use EXCEL if you wish.
(3) A mode is the most frequently occurring value in a population.
Example: What is the mode for the population {1, 2, 3, 3, 5}?
Descriptions of populations



Unimodal
Bimodal
Skewed
15
Measures of Variability or Dispersion
While a measure of central tendency is very useful, it does not distinguish
between populations which look considerably different. For example, the mean, median,
and mode of the two populations {1000, 10000, 10000, 19000} and {9000, 10000,
10000, 11000} are exactly the same, but look at their frequency distribution.
1
1
0
0
10
00
30
00
50
00
70
00
90
00
11
00
0
13
00
0
15
00
0
17
00
0
19
00
0
2
10
00
30
00
50
00
70
00
90
00
11
00
0
13
00
0
15
00
0
17
00
0
19
00
0
2
If these represent the populations of incomes for a summer intern position, which
population has a distribution which is more “equitable”?
Yet both populations have μ = $10,000. The key distinguishing feature between these
populations is the amount of dispersion exhibited by their values. Which population do
you think has the greatest dispersion?
(1) The range is the difference between the largest and smallest value in the population.
Range = largest value – smallest value
What is the range for population 1 above? For population 2?
Now consider the following populations:
{1, 1, 1, 1, 5}
{1, 2, 3, 4, 5}
What are their ranges?
16
It should be obvious, however, that these two populations are dispersed in significantly
different ways. In this sense, a range provides a naïve measure of dispersion because it
takes into account only two values in the population. Which ones?
Note that in some sense the measure does take into account all of the values but in a loose
way. How?
Another way of developing a measure of dispersion is to measure how far each value is
from some fixed point. For example, we could choose our fixed point to be zero.
Unfortunately, the two populations {-1, 0 , 1} and {4, 5, 6} would have different
measures of dispersion although they have (intuitively) the same dispersion. Why?
2
2
1
1
0
0
-2
-1
0
1
3
2
4
5
6
7
Perhaps a better “fixed” point would be in the middle of the values, like the population
mean. With the population {1, 3, 4, 5, 7}, we have μ = 4 and
xi
1
3
4
5
7
xi - μ
-3
-1
0
1
3
0
5
Thus
 (x
i 1
i
  )  0.
17
In fact:
N
N
N
i 1
i 1
i 1
 ( xi   )   xi   
N
  x i  N
i 1
N
x
N
i
  xi  N [ i 1 ]
N
i 1
N
N
i 1
i 1
  xi   xi
0
This will happen for any population. Is this a very good measure of dispersion?
How can we correct for this?
(2) A variance is the average or mean squared distance to the population mean (μ).
N
2 
 (x
i 1
i
 )2
N
.
The standard deviation, σ, is the square root of the variance. For the population
{1, 3, 4, 5, 7}:
xi
1
3
4
5
7
xi - μ
-3
-1
0
1
3
(xi - μ)2
9
1
0
1
9
0
20
Then σ2 = 20/5 = 4, and σ = √4 = 2.
18
Download