Handout 2 Measures of Central Tendency and Variation

advertisement

Data Setting: A graduate student in Psychology was asked to grade 40 final exams, selected at random from several large sections of an introductory course. The resulting scores are found below.

77

84

91

50

75

68

92

74

81

92

86

96

61

37

80

84

83

52

60

75

95

62

83

85

78

98

83

73

100

71

87

81

85

79

64

71

85

78

81

65

The professor for the class asked her how the students performed?

Unfortunately, the graduate student found that looking at the list of scores was about as informative as looking at a scrambled set of letters.

To get information out of the data, she needed to summarize the data.

So, how should she go about summarizing the scores on the test?

Measures of Central Tendency

Measures of central tendency, or more simply measures of center, indicate where the center or most typical value of a data set lies.

Three common measures of central tendency are the arithmetic mean (mean), median, and mode.

The mean is the sum of the observations divided by the number of observations.

We use x

to denote sample mean and

to denote population mean.

Example The following are salaries for four randomly selected employees at Microsoft.

$30,000 $35,000 $32,500 $36,000 x

30 , 000

35 , 000

32 , 500

36 , 000

4

$ 33 , 375

The median of a set of observations is defined to be the middle value when the observations are arranged from lowest to highest.

If the number of observations is odd, then the median is the observation exactly in the middle of the data set.

If the number of observations is even, then the median is the mean of the two middle observations in the ordered list.

Example Find the median for the sample of four salaries from Microsoft.

First put the salaries in order:

$30,000

$32,500

$35,000

$36,000

Median = 32 , 500

35 , 000

$33,750

2

The mode is the value that occurs most frequently in a data set.

There can be more than one mode.

When no value is repeated, we say there is no mode.

Example Find the mode for the following test scores

84 89 82 91

The mode is 84.

86 84

Which measure of central tendency do we use for a given data set?

For qualitative data, the mode is the only one of the three that makes sense.

For quantitative data, mean or median is often preferred over the mode as a measure of center because the value that occurs most frequently may not necessarily be located near the center of the data set.

If there are outliers in the data, median is preferred over the mean as the mean can be misleading when outliers are present.

If there are no outliers, the mean is often the preferred choice because it has some nice properties, which will be important in making inferences later in this course.

Measures of Variability

Measures of central tendency provide only a partial description of a quantitative data set. The description is incomplete without a measure of the variability, or spread, of the data set.

For example, consider the following three data sets, which contain tests scores for students in three different classes.

Test Scores Mean Median

C1

C2

C3

50

50

70

60

73

73

70

74

74

80

76

76

90

77

77

100

100

80

75

75

75

75

75

75

The three data sets all have the same center but where they differ is in terms of variation in the scores.

Three common measures of variability are the range, variance, and standard deviation.

The range of a data set is equal to the largest measurement minus the smallest measurement.

Example Calculate the range for class 1 scores.

Range = 100 – 50 = 50

The sample variance for a sample of n measurements is equal to the sum of the squared distances from the mean divided by (n - 1). In symbols, using s

2

to represent the sample variance, s

2

=

 ( x

 x ) 2 n

1

In calculating s 2 , we divide by (n – 1) instead of n to get a better estimate of the population variance

  

Using n in the formula for s 2 , we tend to underestimate

  

The sample standard deviation, s, is defined as the positive square root of the sample variance, s 2 .

Example Calculate s

2

and s for the class 1 scores x x

 x

( x

 x )

2

50

60

70

-25

-15

-5

625

225

25

80

90

100

5

15

25

( x x )

0  ( x

25

225 x )

625

2 

1 , 750 s

2

=

 ( x n

 x

1

)

2

1750

5

350 s =

350

18 .

708

Variance and Standard deviation are useful for comparing variability in two data sets. The data set with the larger variance, or standard deviation, exhibits more variation in the data.

Standard deviation has the advantage over variance in that it provides a measure of variability in the same units as the original data.

Thus, the standard deviation for a set of incomes would be in dollars whereas the variance would be in dollars

2

.

Standard deviation can be used in conjunction with the mean to describe the variability in a single data set.

Empirical Rule: For a distribution that is approximately bell-shaped,

Approximately:

68% of the observations fall within one standard deviation of the mean.

95% of the observations fall within two standard deviations of the mean.

99.7% of the observations fall within three standard deviations of the mean.

Download