Central Tendency - Kenneth Benoit's Home Page

advertisement
Central Tendency
MSc Module 6: Introduction to Quantitative Research Methods
Kenneth Benoit
February 3, 2010
Definition of Central Tendency
I
I
Central tendency refers to a single number that characterizes
the typical unit among a set of data
Several measures exist
I
I
I
I
average or mean, with different versions
median
mode
Which measure of central tendency we choose will be
determined by the nature of the data, and the characteristic
of the data that is of primary interest
The mode
I
I
The mode is the most frequently occuring value in a
distribution
Example:
I
I
I
I
Consider the set
X = { 1, 2, 3, 1, 1, 5, 5, 4, 1, 4, 4, 3 }
The most frequency occuring values are:
X = { 1, 2, 3, 1, 1, 5, 5, 4, 1, 4, 4, 3 }
So the mode is 1
Note: not the frequency of the most occuring number, but its
value
The mode in R
> x <- c(1, 2, 3, 1, 1, 5, 5, 4, 1, 4, 4, 3)
> mode(x)
# this means something very different in R!
[1] "numeric"
> table(x)
# do a frequency distribution
x
1 2 3 4 5
4 1 2 3 2
> # see which values are the maximum values
> which(table(x)==max(table(x)))
1
1
Multiple modes
I
Some distributions can have more than one mode
I
Example:
I
We can say a distribution is bimodal even if one mode is
smaller than the other
X = { 6, 6, 7, 2, 6, 1, 2, 3, 2, 4 } is bimodal
I
>
>
2
2
x <- c(6, 6, 7, 2, 6, 1, 2, 3, 2, 4)
which(table(x)==max(table(x)))
6
5
The Median
I
When ordinal or interval data are sorted, the middlemost
point in the distribution is called a median
I
In other words, a median divides a distribution into two
distributions of equal sizes
I
OK for ordinal and stronger data
The position of the median can be computed as
I
I
I
I
N+1
2
If N is odd, then the median is center case
If N is even, then the median is the point above which 50% of
the cases fall and below which 50% of the cases fall —
generally taken to be the middle value between the pair of
center cases
Example: (1, 5, 6, 7, 7, 8)
Middle pair is 6, 7 so median is 6.5
In R this is easy, using the median() function
> x <- c(1, 5, 6, 7, 7, 8)
> median(x)
[1] 6.5
Medians in R
With an odd number of cases:
> x <- c(4, 8, 15, 16, 23, 42, 45, 2, 7, 33, 4, 46, 997)
> sort(x)
[1]
2
4
4
7
8 15 16 23 33 42 45 46 997
> length(x)
# length of vector is N
[1] 13
> length(x)%%2
# modulus division: is there are remainder?
[1] 1
> (length(x)+1)/2 # the position of the median by our formula
[1] 7
> length(x)%%2==0 # no remainder means this number is ODD
[1] FALSE
> median(x)
# use R to compute median
[1] 16
Medians in R
With an even number of cases
> x <- c(x,998)
> x
[1]
4
8 15
> sort(x)
[1]
2
4
4
> (length(x)+1)/2
[1] 7.5
> length(x)%%2
[1] 0
> (length(x)+1)/2
[1] 7
> length(x)%%2==0
[1] TRUE
> median(x)
[1] 19.5
# add one more number to our test vector
# look at the vector
16 23 42 45
2
7 33
4 46 997 998
# look at the vector sorted
7
8 15 16 23 33 42 45 46 997 998
# position of new median value
# the position of the median by our formula
# should be TRUE if even
The Mean
I
Also known as the average
I
By far the most common measure of central tendency
I
Most common is arithmetic mean: the sum of all values,
divided by the number of cases
P
Xi
X̄ = i
N
I
I
I
I
I
X̄ is the mean, read as “X bar”
P
N
i is the sum for each value i (Greek sigma)
Xi is each value of X
N is the total number of values in our set
Inappropriate for non-interval data
Interpreting the Mean
I
Unlike the mode, the mean is not the score that occurs most;
indeed, the mean may not occur at all in a dataset
I
Unlike the median, a mean does not necessarily represent the
middlemost point in a distribution
I
Levin and Fox analagy: the mean is a “center of gravity”: the
point in a distribution around which scores above the mean
balance values below it
Interpreting the Mean as Deviations
I
A deviation is the distance and direction of any value Xi from
the mean X̄
Computation is simply Deviation = Xi − X̄
I
In R:
I
>
>
>
>
x <- c(9,8,6,5,2)
xbar <- mean(x)
xdev <- x - xbar
cbind(x,xdev)
x xdev
[1,] 9
3
[2,] 8
2
[3,] 6
0
[4,] 5
-1
[5,] 2
-4
> xbar
[1] 6
Comparing the mode, median, and mean
I
Levels of measurement
I
I
I
I
mode applies to any level (although may not be strictly useful
at interval level);
median only applies to ordinal or higher data
mean only applies to interval data
Shape of the distribution
I
I
in a perfectly symmetrical distribution, mode = median =
mean
not true when distribution is non-symmetrical
Mean is sensitive to extreme values
I
Example:
> a <- c(5,6,6,7,8,9,10,10)
> b <- c(5,6,6,7,8,9,10,95)
> median(a)
[1] 7.5
> median(b)
[1] 7.5
> mean(a)
[1] 7.625
> mean(b)
[1] 18.25
I
This is why median is almost always used to characterize
income distributions
Mean and median both problematic when distributions are
multi-modal
> load("dail2002.Rdata")
> attach(dail2002)
> summary(spend_total)
Min. 1st Qu. Median
Mean 3rd Qu.
Max.
0
5803
14610
14210
20740
51970
> mean(spend_total)
[1] 14213.24
> plot(density(spend_total))
> abline(v=mean(spend_total), col="red")
> abline(v=median(spend_total), col="blue", lty="dashed")
2e-05
1e-05
0e+00
Density
3e-05
4e-05
density.default(x = spend_total)
-10000
0
10000
20000
N = 464
30000
40000
Bandwidth = 2593
50000
60000
Different forms of means: Geometric
P
i Xi
N
I
Arithmetic mean is the most common: X̄ =
I
Geometric mean may be better when data is from a process
that changes multiplicatively
I
Definition: The geometric mean is the nth root of the product
of the data
I
Equation:
q
X̂ = n ΠN
i Xi
I
Example: X̂ for (10, 1, 1000, 1, 10) is
> prod(10, 1, 1000, 1, 10)
[1] 1e+05
> prod(10, 1, 1000, 1, 10)^(1/5)
[1] 10
> exp(1/5 * sum(10, 1, 100, 10))
> exp(mean(log(c(10, 1, 1000, 1, 10))))
[1] 10
Different forms of means: Harmonic
I
Harmonic mean may be better when data is from a process
that changes multiplicatively
I
Definition: The reciprocal of the average of the reciprocals
I
Equation:
X̃ =
I
1
N
1
PN
i
1
Xi
Example: X̃ for (1, 2, 4, 1) is
> x <- c(1,2,4,1)
> 1/mean(1/x)
[1] 1.454545
> length(x)/sum(1/x)
[1] 1.454545
N
= PN
i
1
Xi
Harmonic mean continued
I
When to use? Crawley’s elephant example:
I
I
I
I
I
I
I
I
An elephant has a square territory that 2km on each side
It walks the first side at 1 km/hr
It walks the second side at 2 km/hr
It walks the 3rd side at 4 km/hr
It walks the 4th side at 1 km/hr
Question: what is the elephant’s average speed?
(1 + 2 + 3 + 1)/4 sides = 2 km/hr/side – but this is wrong
The better way to look at it is (total time/total distance)
I
I
I
I
I
total distance is 2km x 4 = 8km
time for 1st side, at 2km/hr, was 1hr
likewise, time for sides 2-3 is 1, 0.5, and 2 hours
So total time is 5.5 hrs
This means average speed as 8km/5.5hrs = 1.4545 km/hr
I > 1/mean(1/c(1,2,4,1))
[1] 1.454545
Weighted mean
I
A mean with different points contributing differently to the
final result, depending on a weight
I
I
Usually applies to arithmetic mean, but could also provide a
weight for geometric or harmonic means
P
Define wi as a normalized weight, where N
i wi = 1
I
Then the (arithmetic) weighted mean is
X̄ =
N
X
wi Xi
i
I
Very frequently used in determining final grades for a class,
for instance, when assignments carry different weights
I
Also often used in survey datasets to count different
observations differently
Weighted mean examples
From the state dataset built-in to R:
> data(state)
# load the built-in state dataset
> state.x77[1:5,1:2]
Population Income
Alabama
3615
3624
Alaska
365
6315
Arizona
2212
4530
Arkansas
2110
3378
California
21198
5114
> attach(data.frame(state.x77)) # coerce as data frame and attach
> mean(Income)
[1] 4435.8
> weighted.mean(Income, Population) # automatically normalizes weight
[1] 4567.63
> median(Income)
[1] 4519
Some helpful R commands for central tendency
mean()
weighted.mean()
median()
fivenum()
quantile()
Homework 2 Problem 4
0
1
2
3
4
5
6
7
> country <- c("Canada","China","England","Germany","Greece","Other")
> f <- c(5,7,2,5,3,4)
> sum(f)
[1] 26
> barplot(f, names=country)
Canada
China
England
Germany
Greece
Other
Download