Central Tendency MSc Module 6: Introduction to Quantitative Research Methods Kenneth Benoit February 3, 2010 Definition of Central Tendency I I Central tendency refers to a single number that characterizes the typical unit among a set of data Several measures exist I I I I average or mean, with different versions median mode Which measure of central tendency we choose will be determined by the nature of the data, and the characteristic of the data that is of primary interest The mode I I The mode is the most frequently occuring value in a distribution Example: I I I I Consider the set X = { 1, 2, 3, 1, 1, 5, 5, 4, 1, 4, 4, 3 } The most frequency occuring values are: X = { 1, 2, 3, 1, 1, 5, 5, 4, 1, 4, 4, 3 } So the mode is 1 Note: not the frequency of the most occuring number, but its value The mode in R > x <- c(1, 2, 3, 1, 1, 5, 5, 4, 1, 4, 4, 3) > mode(x) # this means something very different in R! [1] "numeric" > table(x) # do a frequency distribution x 1 2 3 4 5 4 1 2 3 2 > # see which values are the maximum values > which(table(x)==max(table(x))) 1 1 Multiple modes I Some distributions can have more than one mode I Example: I We can say a distribution is bimodal even if one mode is smaller than the other X = { 6, 6, 7, 2, 6, 1, 2, 3, 2, 4 } is bimodal I > > 2 2 x <- c(6, 6, 7, 2, 6, 1, 2, 3, 2, 4) which(table(x)==max(table(x))) 6 5 The Median I When ordinal or interval data are sorted, the middlemost point in the distribution is called a median I In other words, a median divides a distribution into two distributions of equal sizes I OK for ordinal and stronger data The position of the median can be computed as I I I I N+1 2 If N is odd, then the median is center case If N is even, then the median is the point above which 50% of the cases fall and below which 50% of the cases fall — generally taken to be the middle value between the pair of center cases Example: (1, 5, 6, 7, 7, 8) Middle pair is 6, 7 so median is 6.5 In R this is easy, using the median() function > x <- c(1, 5, 6, 7, 7, 8) > median(x) [1] 6.5 Medians in R With an odd number of cases: > x <- c(4, 8, 15, 16, 23, 42, 45, 2, 7, 33, 4, 46, 997) > sort(x) [1] 2 4 4 7 8 15 16 23 33 42 45 46 997 > length(x) # length of vector is N [1] 13 > length(x)%%2 # modulus division: is there are remainder? [1] 1 > (length(x)+1)/2 # the position of the median by our formula [1] 7 > length(x)%%2==0 # no remainder means this number is ODD [1] FALSE > median(x) # use R to compute median [1] 16 Medians in R With an even number of cases > x <- c(x,998) > x [1] 4 8 15 > sort(x) [1] 2 4 4 > (length(x)+1)/2 [1] 7.5 > length(x)%%2 [1] 0 > (length(x)+1)/2 [1] 7 > length(x)%%2==0 [1] TRUE > median(x) [1] 19.5 # add one more number to our test vector # look at the vector 16 23 42 45 2 7 33 4 46 997 998 # look at the vector sorted 7 8 15 16 23 33 42 45 46 997 998 # position of new median value # the position of the median by our formula # should be TRUE if even The Mean I Also known as the average I By far the most common measure of central tendency I Most common is arithmetic mean: the sum of all values, divided by the number of cases P Xi X̄ = i N I I I I I X̄ is the mean, read as “X bar” P N i is the sum for each value i (Greek sigma) Xi is each value of X N is the total number of values in our set Inappropriate for non-interval data Interpreting the Mean I Unlike the mode, the mean is not the score that occurs most; indeed, the mean may not occur at all in a dataset I Unlike the median, a mean does not necessarily represent the middlemost point in a distribution I Levin and Fox analagy: the mean is a “center of gravity”: the point in a distribution around which scores above the mean balance values below it Interpreting the Mean as Deviations I A deviation is the distance and direction of any value Xi from the mean X̄ Computation is simply Deviation = Xi − X̄ I In R: I > > > > x <- c(9,8,6,5,2) xbar <- mean(x) xdev <- x - xbar cbind(x,xdev) x xdev [1,] 9 3 [2,] 8 2 [3,] 6 0 [4,] 5 -1 [5,] 2 -4 > xbar [1] 6 Comparing the mode, median, and mean I Levels of measurement I I I I mode applies to any level (although may not be strictly useful at interval level); median only applies to ordinal or higher data mean only applies to interval data Shape of the distribution I I in a perfectly symmetrical distribution, mode = median = mean not true when distribution is non-symmetrical Mean is sensitive to extreme values I Example: > a <- c(5,6,6,7,8,9,10,10) > b <- c(5,6,6,7,8,9,10,95) > median(a) [1] 7.5 > median(b) [1] 7.5 > mean(a) [1] 7.625 > mean(b) [1] 18.25 I This is why median is almost always used to characterize income distributions Mean and median both problematic when distributions are multi-modal > load("dail2002.Rdata") > attach(dail2002) > summary(spend_total) Min. 1st Qu. Median Mean 3rd Qu. Max. 0 5803 14610 14210 20740 51970 > mean(spend_total) [1] 14213.24 > plot(density(spend_total)) > abline(v=mean(spend_total), col="red") > abline(v=median(spend_total), col="blue", lty="dashed") 2e-05 1e-05 0e+00 Density 3e-05 4e-05 density.default(x = spend_total) -10000 0 10000 20000 N = 464 30000 40000 Bandwidth = 2593 50000 60000 Different forms of means: Geometric P i Xi N I Arithmetic mean is the most common: X̄ = I Geometric mean may be better when data is from a process that changes multiplicatively I Definition: The geometric mean is the nth root of the product of the data I Equation: q X̂ = n ΠN i Xi I Example: X̂ for (10, 1, 1000, 1, 10) is > prod(10, 1, 1000, 1, 10) [1] 1e+05 > prod(10, 1, 1000, 1, 10)^(1/5) [1] 10 > exp(1/5 * sum(10, 1, 100, 10)) > exp(mean(log(c(10, 1, 1000, 1, 10)))) [1] 10 Different forms of means: Harmonic I Harmonic mean may be better when data is from a process that changes multiplicatively I Definition: The reciprocal of the average of the reciprocals I Equation: X̃ = I 1 N 1 PN i 1 Xi Example: X̃ for (1, 2, 4, 1) is > x <- c(1,2,4,1) > 1/mean(1/x) [1] 1.454545 > length(x)/sum(1/x) [1] 1.454545 N = PN i 1 Xi Harmonic mean continued I When to use? Crawley’s elephant example: I I I I I I I I An elephant has a square territory that 2km on each side It walks the first side at 1 km/hr It walks the second side at 2 km/hr It walks the 3rd side at 4 km/hr It walks the 4th side at 1 km/hr Question: what is the elephant’s average speed? (1 + 2 + 3 + 1)/4 sides = 2 km/hr/side – but this is wrong The better way to look at it is (total time/total distance) I I I I I total distance is 2km x 4 = 8km time for 1st side, at 2km/hr, was 1hr likewise, time for sides 2-3 is 1, 0.5, and 2 hours So total time is 5.5 hrs This means average speed as 8km/5.5hrs = 1.4545 km/hr I > 1/mean(1/c(1,2,4,1)) [1] 1.454545 Weighted mean I A mean with different points contributing differently to the final result, depending on a weight I I Usually applies to arithmetic mean, but could also provide a weight for geometric or harmonic means P Define wi as a normalized weight, where N i wi = 1 I Then the (arithmetic) weighted mean is X̄ = N X wi Xi i I Very frequently used in determining final grades for a class, for instance, when assignments carry different weights I Also often used in survey datasets to count different observations differently Weighted mean examples From the state dataset built-in to R: > data(state) # load the built-in state dataset > state.x77[1:5,1:2] Population Income Alabama 3615 3624 Alaska 365 6315 Arizona 2212 4530 Arkansas 2110 3378 California 21198 5114 > attach(data.frame(state.x77)) # coerce as data frame and attach > mean(Income) [1] 4435.8 > weighted.mean(Income, Population) # automatically normalizes weight [1] 4567.63 > median(Income) [1] 4519 Some helpful R commands for central tendency mean() weighted.mean() median() fivenum() quantile() Homework 2 Problem 4 0 1 2 3 4 5 6 7 > country <- c("Canada","China","England","Germany","Greece","Other") > f <- c(5,7,2,5,3,4) > sum(f) [1] 26 > barplot(f, names=country) Canada China England Germany Greece Other