Distribution shapes in histograms

advertisement
Distribution shapes in histograms
A histogram displays numerical scores for a group or condition. Along the bottom is the score
scale (interval scale), and up the side is plotted the frequency or number of cases that scored
each score. The linked bars that make up the histogram rise high or low reflecting how many
people scored each score or range of scores on the bottom scale. Often the score scale on the
bottom is grouped – e.g. instead of showing a bar for people who scored 5 and another for
people who scored 6 etc… the histogram shows one bar for those who scored 5.001-6,
another for 6.001-8 etc. See examples below. You can tell SPSS the interval size you want,
and what end points of the scale you want displayed. These choices can often make the graph
look quite different!
Looking at a histogram one can see visually



what score or range of scores was most often obtained by the cases, = where the heap
is high, = the mode. This may or may not be close to the average (= mean) score
depending on how symmetrical the shape is
how spread out the scores are along the score scale, reflected also in the SD
details of the shape of the distribution, which we consider below
REAL DISTRIBUTION SHAPES
More or less symmetrical distribution
A single heap with more or less the same sized ‘tails’ on each side. This arises typically
where the cases measured are a more or less homogeneous group exhibiting some degree of
variation, as is to be expected, and most cases are scoring some way from the fixed ends of
the score scale, if it has any.
Here is an example, which is about as symmetrical as you often get in real data. 77 middle
class learners of English in Colleges in Pakistan were asked to show their degree of
agreement, on a scale 1-5, to each of 10 statements which collectively measure Interest in
Foreign Languages (e.g. ‘I would really like to learn a lot of foreign languages’). The totals
for this variable therefore potentially range between 10 and 50, but in fact most people scored
well in between, average 33. The middle of the scale is of course 30.
20
10
Std. Dev = 8.48
Mean = 33
N = 77.00
0
10 - 15
20 - 25
15 - 20
30 - 35
25 - 30
40 - 45
35 - 40
45 - 50
INTEREST IN FOREIGN LANGUAGES
The spread (reflected by the size of the SD) can be greater or smaller. Here it is quite high,
approaching half the maximum it could be: that max here is half the scale length, 40/2=20
(see article on SD).
Moderately skewed distribution
A distribution heaped on the left with a longer tail on the right is said to be positively skewed.
One heaped on the right with a longer tail on the left is said to be negatively skewed. Often
the heap is on the side nearest a fixed end of the scale.
Example. 72 lower class learners of English in Colleges in Pakistan were asked to show their
degree of agreement, on a scale 1-5, to each of 10 statements which collectively measure the
Parental Encouragement they receive (e.g. ‘My parents/guardians want me to learn English’).
The totals for this variable therefore potentially range between 10 and 50: in fact most people
scored fairly near the bottom of the scale, mean 21, thus creating a positive skew in the
results.
30
20
10
Std. Dev = 5.08
Mean = 21
N = 72.00
0
10 - 15
20 - 25
15 - 20
30 - 35
25 - 30
40 - 45
35 - 40
45 - 50
PARENTAL ENCOURAGEMENT
J shaped distribution
The J is an extreme case of negative skew. The reverse J or ‘ski-jump’ shape is an extreme
case of positive skew. In both of these typically the cases are scoring tight up against a fixed
end limit of a scale. The J is like a heap which has one half missing, because the scale comes
to an end so there is nowhere for the expected other half of the distribution to fall.
Here is an example of a J shaped distribution. 40 teachers of English composition in Saudi
Arabia were asked to say how often they wrote words of praise on student compositions, on a
scale 4=always to 0=never. The results graph as follows and we can see that most claimed to
be full of praise, thus creating this shape. If the results had been for student test scores, for
example, one would have said that the test was rather easy for the subjects and therefore
‘ceiling effect’ is manifested.
30
20
10
Std. Dev = .93
Mean = 3.4
N = 40.00
0
0.0
1.0
2.0
3.0
4.0
PRAISE
Below is a reverse J distribution…. An extreme example of positive skew (compare the
milder version in the last section). Arabic learners of English were required to write a first
draft and a final draft of a composition. By comparing drafts the numbers of revisions they
made were counted. One type, shown here, is the number of revisions of units of phrase size
(as against single words or whole sentences) made by each person. Since many writers made
no revisions of this sort at all, we get ‘floor effect’ with a heap bunched against the bottom
end of the scale which is 0 (the top end of this scale has no fixed limit, of course).
20
10
Std. Dev = 1.81
Mean = 1
N = 38.00
0
0
1
2
3
4
5
6
7
PHRASE LEVEL REVISIONS
8
9
10 11 12
U shape, or any of the above with gaps/marked low points indicating more than one heap
Disatributions like this are indications that the cases you have graphed may not be a
homogeneous group but really two or more groups from more than one population. Hence
you have as it were more than one symmetric or skewed heap displayed, one for each group
that scores rather differently. You can find an example of this sort of distribution shape where
I discuss ways of dividing subjects into distinct groups when initially they all have obtained
scores on a continuous scale. One way of doing that is by a simple form of ‘cluster analysis’ visually using low points or spaces in the distribution to decide where to cut the scale and say
‘all above this score are group A and all below group B’.
Here is another example. 217 learners of English in Colleges in Pakistan were asked to show
their degree of agreement, on a scale 1-5, to each of 5 statements which collectively measure
their perception of how far their Pakistani identity is threatened by English (e.g. ‘When I use
English, I don’t feel that I am Pakistani any more’). The totals for this variable therefore
potentially range between 5 and 25. As we see, the subjects have polarised into two types of
respondent. The majority see quite a high threat, forming a negatively skewed heap on the
right, while a smaller group, forming a more symmetrical heap on the left, sees a relatively
low threat. If we were to cut the scale at 16 and treat the groups as separate, one has 68
members, mean 9.5, the other 149, mean 21.3. In short, we have probably got two quite
different kinds of people portrayed here: on closer investigation it emerged that in fact the
low group is nearly all upper class, the high group almost exclusively middle and working
class.
70
60
50
40
30
20
Std. Dev = 6.00
10
Mean = 18
N = 217.00
0
5-7
9 - 11
7-9
13 - 15
11 - 13
17 - 19
15 - 17
21 - 23
19 - 21
23 - 25
THREAT OF ENGLISH TO CULTURAL IDENTITY
WHY LOOK AT DISTRIBUTION SHAPES?…
OK, so we can look at the shapes we get, give them names, and say a bit about what might
they might indicate about our results… but why else are they important?
The answer is that it may be relevant to the further statistical procedures you want to use.
Many popular significance tests such as t tests and ANOVAs are of the ‘parametric’ type,
meaning they require the data to have certain properties, one of which is ‘normality of
distribution of the population’. Usually we don’t know the distribution shapes of the
populations which we claim to have sampled for each group or condition involved, so have to
assess them from the samples… I.e. if we want to do things properly (and not just skip over
thinking about the distributional prerequisites of sig tests altogether) we have to
EITHER do tests to see if the shape of the sample could readily be from a population with the
normal distribution (e.g. the Kolmogorov-Smirnov or KS one sample test under
nonparametric in SPSS)
OR assess the population shape from the sample distributions shape just by eye…. This is
what we pursue below…
By and large, if the shape we see is skewed or discontinuous it is less likely to be from a
normal distribution… but
 with small samples the sample could have quite a non-normal shape and still possibly
be from a population which has the normal distribution
 some non-normal shapes can be converted to normal shape.
In general we need to be familiar with some ideal shapes, esp. this so-called normal
distribution shape….
THREE FAMILIES OF IDEAL DISTRIBUTION SHAPES
Ideal distribution shapes are always smooth and perfect, unlike the ones you get from actual
data. That is because they are defined by mathematical formulae not the actual scores,
responses etc. of real people, which tend to be irregular.
The formulae are of the form y = some function of x. Here x is the score scale on the bottom
of the histogram, and y is the frequency scale up the side. A simple example of a histogram
defined by a formula would be y=3x, which defines the histogram below. I.e. it says the
number of people scoring any score is three times that score. It defines a skewed triangular
distribution shape. However, this is not a very useful ideal distribution shape, since real data
does not usually
pattern anything like that.
40
30
Number of cases
20
10
Std. Dev = 2.93
Mean = 8
N = 234.00
0
0
1
2
3
4
5
6
7
8
9
10 11 12
Interval score scale
Obviously there is an endless number of ideal distribution shapes possible, as the
mathematical formulae one could use are limitless. We are interested just in distribution
shapes that seem to be close to real data. If you like, the shapes we think the data would have,
were it not for the sort of random variation that real people are subject to. Three stand out as
specially useful, though their formulae are much more complicated than that of the example
above.
FROM HERE ON, THIS IS INCOMPLETE!
The Normal Distribution
This shape, also known as the Gaussian curve, has a fearsome formula which crucially
includes the mean and SD of the distribution on the right hand side. I.e. y is a function of the
mean and SD of the distribution, as well as x. I.e. you get different normal curves depending
on what the mean is, and the spread of scores. They are always symmetrical, but some can be
quite flat, some very tall and thin.
One simple way to judge the normality of an actual sample is to get SPSS to superimpose a
normal curve on the histogram of the actual data one has gathered. SPSS provides the version
of the normal curve that fits the mean and SD of your data. Here we see it done for the
histogram we looked at above. Intuitively the fit is not too bad… The highest point of the
actual data is in the right place, and although there is an outsize bar on the left hand side
(interval 10-25), it is compensated by a low one next door (25-30).
20
10
Std. Dev = 8.48
Mean = 33
N = 77.00
0
10 - 15
20 - 25
15 - 20
30 - 35
25 - 30
40 - 45
35 - 40
INTEREST IN FOREIGN LANGUAGES
The Log-Normal or Gamma Distribution
45 - 50
The Poisson Distribution
Download