P201 Lecture Notes07 Chapter 6

advertisement
PSY 201 Lecture Notes Chapters 12,13,14,15
Measures of Variability and Shape
Variability
Variability refers to differences between score values
The larger the differences, the greater the variability
Example of low variability: Costs of 2010 Toyota Camrys in $1000s in a small town
25
27
25
26
26
26
27
Example of large variability: Costs of 2010 Toyota Camrys in $1000s in a large city
26
24
28
22
30
32
20
Dot plots
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
25
30
35
40
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
20
25
30
35
40
After examination of the numbers, we can see that there are bigger differences between the 2nd than
between the 1st. How should those differences be summarized?
The possible measures:
1.
2.
3.
4.
Range: Differences between largest score and smallest score.
Interquartile Range
Variance
Standard Deviation
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
The Range
Range = Largest value in the collection minus smallest value.
Problems with the Range
1. Quite variable from sample to sample, even if all samples are from the same population.
How ironic that a measure of variability would be too variable.
2. May be restricted by ceiling or floor of the scale.
Much psychological measurement comes from scales to which persons respond on a 1-5
scale, often labeled 1=Strongly disagree, 2=Disagree, 3=Neutral, 4=Agree, 5=Strongly Agree.
“I support Richard M. Nixon.” in 1950s
Value Freq
5
1
4
3
3
100
2
3
1
1
“I support Richard M. Nixon.” in 1970s
Value
Freq
5
100
4
50
3
20
2
50
1
100
Range:
Range: 5-1=4
5-1=4
So the range is not generally useful, although it is often reported.
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
The Interquartile Range
Quartiles:
Points identifying "quarters" of a distribution.
Conceptual Definitions
Q4
Fourth Quartile
The value below which 4/4th's of the distribution falls.
Q3
Third Quartile
The value below which 3/4ths of the distribution falls.
Q2
Second Quartile
The value below which 2/4ths of the distribution falls.
Q1
First Quartile
The value below which 1/4th of the distribution falls.
Q0
"Zeroth" Quartile
The value below which 0/4th's of the distribution falls.
Operational Definitions
Q4
The largest score value in the distribution.
Q3
The median of the scores in the upper half of the distribution.
(If N is odd, include the overall median in the upper half.)
Q2
The overall median of the collection. Compute using the median formula.
Q1
The median of the scores in the lower half of the distribution..
(If N is odd, include the overall median in the lower half.)
Q0
The smallest score value in the distribution.
Interquartile Range:
The distance (on the number line) between the Q1 and Q3 - between the first
quartile and the third quartile.
IQR = Q3 - Q1
Interpretation
The distance or interval size required to contain the middle 50% of the scores.
If the middle 50% is contained in a small area, the distribution is quite "crowded" - the
scores are close to each other; the distribution has little variability.
If the middle 50% is contained in a wide area, the distribution is sparse - the scores are far
from either other; the distribution has much variability.
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
Example - A distribution with an even number of scores.
Upper half of distribution
75
65
50
45
40
40
35
35
30
30
30
25
25
10
So the interquartile range (IQR) for this distribution is 45 – 30 = 15.
Example - A distribution with an odd number of scores.
Note that 35, the overall median is included
in both the lower and upper halves.
Upper half of distribution
Lower half of distribution
65
50
45
40
35
35
30
25
25
20
15
So, the interquartile range (IQR) for this distribution is 42.5 – 25 = 17.5
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
Comparing Variability of ACTComp scores of Females and Males
Q1
Q2
Q3
Interestingly, Q1, Q2, and Q3 are identical for Males and Females in this sample.
The interquartile range for each distribution is 25 – 19 = 6.
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
The Variance, also called the Mean Square.
The variance is the “average” of the squared differences of the scores from the mean.
Group
Population
Sample
Symbol
σ2 (sigma squared)
S2 (ess squared)
Formula
Σ(X-µ)2 / N
Σ(X-X-bar)2 / (N-1)
Note that the formula for the sample variance is different from the formula for the population
variance. The sample variance requires dividing the sum of squared differences by N-1, not N. For
this reason, the sample variance is almost the average of squared differences when computed from a
sample. It is exactly the average when computed from a population. Hence the quotes around
average in the definition above.
Computing the Variance
(Note: SPSS divides by N-1).
Consider the first set of Camry prices above
(X-Mean)2
1
1
Σ(X-X-bar)2
1
---------------0
N-1
0
0
1
4
Sum of squared differences (Σ(X-X-bar)2) = 4
Population Variance = Σ(X-µ)2 / N = 4/7 = .57 We will never compute a population variance.
Sample Variance = Σ(X-X-bar)2 / (N-1) = 4/6 = .67
X
25
27
25
26
26
26
27
Mean
26
26
26
26
26
26
26
X-Mean
-1
1
-1
0
0
0
1
Now the second set of Camry Prices
X
26
24
28
22
30
32
20
Mean
26
26
26
26
26
26
26
X-Mean
0
-2
2
-4
4
6
-6
Sum of squared differences = 112
Population Variance = 112/7 = 16.
Sample Variance = 112/6 = 18.67
(X-Mean)2
0
4
4
16
16
36
36
112
O
O
O
O
O
O
O
------------------------------------------------------------|
|
|
|
|
|
|
|
|
|
|
|
|
20
21
22
23
24
25
26
27
28
29
30
31
32
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
Characterizing the Variance
Positive aspects
The variance is “connected” to every score in the collection. This is generally regarded as a plus –
changing the value of ANY score will change the variance.
The variance has good lineage – it’s part of the formula for the Normal Distribution.
The variance is a key quantity in many inferential statistics.
Negative aspects
The variance is in squared units. So its value is not easily related to the individual score values.
Often, the variance can be much larger than difference between the individual score values.
So the variance is not a good DESCRIPTIVE measure of variability.
The Standard Deviation
The standard deviation is the square root of the variance.
Memorize this
formula
Group
Symbol
Formula
Population
σ
Σ(X-µ)2 / N
Sample
S
Σ(X-X-bar)2 / (N-1)
Note that as was the case for the variance the formula for the sample standard deviation is different
from the formula for the population standard deviation. The sample standard deviation requires
dividing the sum of squared differences by N-1, not N.
FYI – Most computer programs automatically compute the “dividing by N-1” standard deviation.
This is what SPSS does.
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
Computing the Standard Deviation
Computation of the standard deviation involves
1) computing the variance,
2) taking the square root of the variance.
Consider the first set of Camry prices above
X
25
27
25
26
26
26
27
X-bar
26
26
26
26
26
26
26
X-X-bar
-1
1
-1
0
0
0
1
(X-X-bar)2
1
1
1
0
0
0
1
We did this above computing
the variance.
Sum of squared differences = 4
Population VarianceN = 4/7 = .57
Sample VarianceN-1 = 4/6 = .67
Population standard deviation = sqrt(.57) = .75
Sample standard deviation = sqrt(.67) = .82
New stuff
Now the second set of Camry Prices – the big city prices
X
26
24
28
22
30
32
20
X-bar
26
26
26
26
26
26
26
X-X-bar
0
-2
2
-4
4
6
-6
(X-X-bar)2
0
4
4
16
16
36
36
We did this above computing
the variance.
Sum of squared differences = 112
Population VarianceN = 112/7 = 16
Sample VarianceN-1 = 112/6 = 18.67
Population standard deviation = sqrt(16) = 4.00
Sample standard deviation = sqrt (18.67) = 4.32
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
New stuff
3/7/2016
Characterizing the standard deviation – Standard deviation up close and personal
What’s good:
1) Connected to every score;
2) Good lineage;;
3)Fits the data – the values of the standard deviation make sense.
What’s bad:
1) Hard to understand what it means;
2) Inflated by skewness, outliers
What does the above standard deviation mean????
First note that the standard deviation is telling us that the second set of prices is more variable than
the first set. This is something we already knew, so that’s comforting.
Unfortunately, there is no simple, easy to digest, description of what the standard deviation
represents.
If anything, it might be thought of as the “average” of the differences of the scores from the
mean. If someone not familiar with statistics asks me what it is, that’s what I tell them.
But, lack of an interpretation of the number doesn’t prevent us from using that number.
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
Two key facts about the standard deviation
1. For large (N > > 30) unimodal, symmetric (US) distributions, with no outliers, about 2/3 of the
scores will be within one standard deviation of the mean, that is, between Mean-1SD and
Mean+1SD.
<------About 2/3 of scores ----->
Mean - SD
Mean
Mean + SD
2. For large N (N >> 30) unimodal, symmetric (US) distributions, with no outliers, about 95% of
the scores will be within two standard deviations of the mean, that is, between Mean–2SD and
Mean+2SD.
<------95% of scores ----->
Mean - 2SD
Mean - SD
Mean
Mean + SD
Mean + 2SD
This means that if you know three things about the distribution:
1) that it’s unimodal and symmetric,
2) the mean, and
3) the standard deviation, you can tell pretty much how an individual score placed in that
distribution.
For example, Joe scores two standard deviations above the mean on a test.
What percent of the persons taking the test scored worse than Joe?
Fact 2 above says that 95% of the scores are below Joe’s.
And of the remaining 5%, ½ of that, or 2 ½ % would be in the left hand tail of the distribution, way
below Joe’s score and the other 2 ½% would be in the upper tail, above Joe’s score.
Joe
2 ½%
2 ½%
95%
0So the answer is approximately 97.5% of the scores would be below Joe’s.
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
Reasons for being interested in Variability
1. It may be important to identify issues for which variability of opinion is high.
For example, attitudes toward abortion.
Attitudes toward parking meters in the Fort Wood area – large variability from ++ to --.
2. There may be situations in which it is important to have low (or high) variability.
For example, most teachers prefer low variability of entering ability when teaching classes
such as this.
If variability is low, this means that the teacher’s presentation will likely be understood by
everyone, if it’s chosen appropriately.
If variability is high, some may not understand and some may be bored.
3. Variability is part of individual differences
There is variability in almost every human characteristic. The discovery of explanations for
that variability occupies much of the time of research psychologists.
Example
When people are asked to fake on personality tests, some are better able to fake than others.
Why?
4. Variability, as measured by the standard deviation, is used assess the size of differences in means.
Conventions concerning the size of mean differences
.2 Standard deviations = Small difference
.5 Standard deviations = Medium difference
.8 Standard deviations = Large difference
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
Measures of distribution shape
Measures of skewness
A popular measure of skewness is the following, given by
Kirk, R. (1999). Statistics: An introduction. 4th Ed. New York: Harcourt Brace.
Skewness = (Σ(X-Mean)3 / N ) / S3
In English: The sum of the cubed deviations of scores from the mean divided by N, then divided by
the cube of the standard deviation.
Interpretation of values
Value of Skewness measure
Interpretaton
Larger than 0
Positively skewed distribution
0
Symmetric distribution
Less than 0
Negatively skewed distribution
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
Example of the skewness statistic
1. Salaries from the Employee Data file.
2. Extroversion scores of 109 UTC students
Sta tistic s
sal ary Curren t Sa lary
N
Va lid
47 4
Mi ssing
Ske wne ss
2.1 25
Std . Erro r of S kewness
Sta tistic s
0
.11 2
he xt
N
Va lid
10 9
Mi ssing
Histogram
1
Ske wne ss
-.2 20
Std . Erro r of S kewness
.23 1
120
Histogram
100
14
12
60
10
40
20
0
$0
Mean = $34,419.57
Std. Dev. =
$17,075.661
N = 474
$40,000
$80,000
$120,000
$20,000
$60,000
$100,000
$140,000
Frequency
Frequency
80
8
6
4
Current Salary
2
Mean = 4.4582
Std. Dev. = 0.95104
N = 109
0
0.00
2.00
4.00
6.00
8.00
hext
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
Kurtosis
Kurtosis refers to the relationship of the shape of a distribution to the shape of the Normal
Distribution.
Kirk gives the following measure of Kurtosis
Kursosis = ( (Σ(X-Mean)4 / N ) / S4 ) - 3
In English: The sum of the deviations of scores from the mean raised to the fourth power divided by
N, then divided by the standard deviation raised to the fourth power minus 3.
Interpretation
Value of Kurtosis measure
Interpretaton
Larger than 0
More peaked than the Normal distribution
0
Same peakedness as the Normal distribution.
Less than 0
Less peaked (flatter) than the Normal distribution.
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
Example
1. Extroversion scores of 1000 values from a uniform distribution.
As is pretty much apparent from the histogram, according to the Kurtosis measure the distribution is
less peaked – flatter - than the Normal Distribution.
Biderman’s 201 Handouts Topic 4 (Numeric Measures II) -14
3/7/2016
Download