Chap4.stat.doc

advertisement

CHAPTER FOUR: Variability

In order to determine the degree to which members of a distribution vary amongst themselves, measures of spread or VARIABILITY had to be developed. These measures could, in turn, assess the appropriateness of the measures of central tendency. The mean could be especially misleading when a distribution included outliers or was severely skewed.

Consider this scenario. If everyone who took a class test either failed or got a perfect score, the mean would suggest that class performance was average. Yet no one in that hypothetical class did average work. A measure of variability would indicate extreme spread among scores, and the teacher would be forewarned not to rely upon the mean for a realistic assessment of class progress. This is not unusually when assessing

HETEROGENEOUS (diversified) samples. When the sample or population is

HOMOGENEOUS (similar in terms of the dependent measure), the measure of spread is small, and the measures of central tendency are more reliable.

Types of Measures of Spread:

The True Range-The simplest way of measuring of spread is to determine the span of possible places on the scale of interest. This can be done by simply subtracting the lowest score from the highest. Be sure to add 1 to count the starting point of the scale.

Range =[(Highest Score - Lowest Score) + 1]

If this seems confusing, just subtract the lower limit (LL) of the lowest score (X) from the upper limit (UL) of the highest score, and that will give the correct answer as well.

Range =[UL of Highest Score - LL of Lowest Score]

Remember, you must account for all the spaces occupied on the scale, including all those in between. If the highest number is 9, and the lowest is 0, there are ten spaces included on that scale. Count them: (0, 1, 2, 3, 4, 5, 6, 7, 8, 9 )

The Inter-quartile Range(IQR)-You recall that you could determine the median by locating the percentile (score) at the fiftieth percentile rank (50th%). The same methods can be used to locate the score (X) at the first quartile (25th%) and the third quartile

(75th%). By definition, QUARTILES divide a distribution into quarters, or fourths. The

IQR is the range between the 1st and 3rd quartiles (Q). Therefore, simply subtract the score at the 25th% (Q1) from the score at the 75th% (Q3):

IQR = X(75th%) - X(25th%)

The Semi-Interquartile Range(SIQR)-When the data is very skewed or incomplete, the

SIQR replaces the IQR. To compute the SIQR, simply divide the IQR by 2:

SIQR = (IQR)/2

The Standard of Deviation(S)-The best way to determine the appropriateness of the mean is to determine the average (mean) amount of dispersion from the mean. Unfortunately, the mean (X) is the sum of the scores (∑X) divided by the number of scores (N). The sum of the distance scores from the mean always equals zero, making computation useless.

Recall:

_

Sum of the Distances = Sum of (X - X) = 0

To correct this problem, the distance score is squared. This sum of the distance scores squared is called the SUM SQUARES (SS). The formula, then is:

SS = The Sum of (The score – The Sample Mean) Squared

This can serve as the sum of the distances that is divided by the number of distances to suggest a mean distance. But to return to the original scale, the square root of this squared mean distance should be computed. The standard of deviation is the square root of the variance:

(S)(S) = SS/N

The Variance (S)-The standard of deviation (S) squared is called the variance. This is the very number one takes the square root of to determine the S; the squared mean distance from the mean. The purpose of the variance was primarily to calculate the S. It can be thought of as a squared measure of variability.

S(S) = [Sum of (X - X)Squared / N]

The formula for the SUM Squares (SS). Use the formula you are more comfortable with.

They measure the same thing.

DEGREES OF FREEDOM: Variability can be determined in both the inferential and descriptive cases. Descriptive statistics are based upon populations. That is what the above formulas apply to. The formulas for both the variance and the standard of deviation should be adjusted for the case of samples, because there is a risk of bias in the estimate. The sample may not be the most representative of the true population. To assure unbiased estimates, subtract a 1 from the denominator, N. This adjusted denominator is called

DEGREES OF FREEDOM (df = n-1)

NOTATION: To further distinguish the descriptive from the inferential case,

English letters will be reserved for inferential statistics. The descriptive case will have Greek letters for notation. This signals that the denominator is an N, and not the df, as that is not required in the descriptive case.

APPLICATION: While the S and S require continuous data, the ranges only require ordinal scale. They, therefore, complement measures of central tendency of the same scale. When all the members of the distribution are multiplied by a constant, the SS is inflated, but the mean is not changed. If the members of the distribution all have a constant added to them, neither the mean nor the SS changes. This is true in both the descriptive and inferential cases.

Consider the data set X = 1, 4, 3, 2, 2, 0, 1, 2

_

X f

4 1

3 1

2

1

0

4

1

1 cf c% X-X Squared

8 100.0 2 4

7

6

2

1

87.5 1

75.0 0

25.0 -1

12.5 -2

1

0

1

4

__ ____ ____ __ __ __

8 0 10 16 0 10

Xf Mn = SumX/N = 16/8 = 2

4

3 Mo = 2

8

1

0

Md = 50th% = 2

__ Rg = 4.5 - (-.5) = 5

16

IQR = X75% - X25%

= 2.5 - 1.5 = 1

SIQR = IQR/2 = 1/2 .5

It sometimes eases interpolation to see the set in a line.

In this case, there are 2 integers in each quartile.

X = 0 1 1 2 2 2 3 4

25% 50% 75% 100%

Now that the ordinal measures are complete, consider the continuous solutions:

SS = Sum (X - X)Sq = 10

= SS/N = 10/8 = 1.25 S = SS/df = 10/7 = 1.428

_____

= \ 1.25 = 1.118 S = Square Root of1.428 = 1.195

Notice that the measures of variability based upon SS, are always larger in the inferential case. That is the effect that the df has upon them. The estimate of spread based upon a sample has an element of risk, depending upon the degree of similarity of the sample to the population from which it is drawn. The df will inflate that 'best guess' of variability, to 'hedge your bet' and be certain to cover that true variability.

Download