Graphical Summaries

advertisement
AMS 5
GRAPHICAL DESCRIPTIVE
METHODS
Histograms
In the US how are incomes distributed? In
March 1973 50,000 American families
reported their income for the previous
year. Of course these data have to be
summarized-nobody wants to look all
these numbers.
A graph that is often used to summarize
data is the histogram
Read a Histogram
Blocks
Class Intervals ,
e.g. ($1000-$2000), ($2000,$3000), …,($25000,$30000)
Read a Histogram
In a histogram, the areas of the block
represent percentages.
About what percentage of the families earned between $10,000
- $25,000?
Were there more families with incomes between $10,000 $25,000 or between $15,000 - $25,000?
Read a Histogram
Read a Histogram
a)
b)
c)
d)
e)
f)
About 1% of the families in the previous figure
had incomes between $0 and $1,000. Estimate
the percentage who had incomes
$1,000-$2,000
$2,000-$3,000
$3,000-$4,000
$4,000-$5,000
$4,000-$7,000
$7,000-$10,000
Distribution Table
Income Level
Percent
$0-$1,000
1
$1,000-$2,000
2
$2,000-$3,000
3
$3,000-$4,000
4
$4,000-$5,000
5
$5,000-$6,000
5
$6,000-$7,000
5
$7,000-$10,000
15
$10,000-$15,000
26
$15,000-$25,000
26
$25,000-$50,000
8
$50,000 and over
1
Distribution Table
In the distribution tables you need to be
cautious with the endpoint conventions.
For the previous table the left endpoint is
included in the class interval, while the
right endpoint is excluded.
The percents do not add to 100% in the
previous table due to rounding. We will
finally ignore the last class (above
$50,000).
Drawing a Histogram
Put down a horizontal axis. Use the right
distance between the intervals.
Next step is to draw the blocks. DON’T
PLOT THE PERCENTS, by making the
heights of the blocks equal to them.
Drawing a Histogram
Many more families with incomes
Over $25,000 than under $7,000
Drawing a Histogram
The problem is that we have different
lengths of the class intervals. The 8% who
earn $25,000-$50,000 are spread over a
larger range of incomes than the 15%
who earn $7,000-$10,000. Plotting
percents directly ignores this, and makes
the blocks over longer class intervals too
big.
Drawing a Histogram
Income Level
Percent (P)
Length ( × $1,000)
(L)
Height = P / L
$0-$1,000
1
1
1
$1,000-$2,000
2
1
2
$2,000-$3,000
3
1
3
$3,000-$4,000
4
1
4
$4,000-$5,000
5
1
5
$5,000-$6,000
5
1
5
$6,000-$7,000
5
1
5
$7,000-$10,000
15
3
5
$10,000-$15,000
26
5
5.2
$15,000-$25,000
26
10
2.6
$25,000-$50,000
8
25
0.32
Drawing a Histogram
Units in the vertical scale: For example the height of the block over the interval
$7,000 to $10,000 is 5% per $1,000. , i.e. in each thousand-dollar interval
between $7,000 and $10,000 there are about 5% of the families.
Density Scale
In the previous example the histogram was
drawn using the density scale. Remember that
the areas of the blocks come out in percent. A
high height implies that large chunks of area
accumulate in small portions of the horizontal
scale. This implies that the density of the data is
high in the intervals where the height is large. In
other words, the data are more crowded in
those intervals.
In a Histogram the height of a block represents
crowding – percentage per horizontal unit.
Density Scale
Example: By looking only the histogram, about what
percent of the families in the city had incomes between
$15,000-$25,000?
Answer: The height of the block is 2.6% per 1,000
dollar, i.e. each thousand-dollar interval between
$15,000 and $25,000 contains about 2.6% of the
families in the city. There are 10 of these intervals, and
therefore the answer is
10 × 2.6% = 26%.
The area under the histogram over an interval equals the
percentage of cases in that interval. The total area under
the histogram therefore should be 100%.
Other types of Histogram
Raw-Frequency Histograms.
Relative-Frequency Histograms.
Use it only when class
intervals have
the same length.
Example : Civil- service 1966 examination scores in Chicago.
Value
Raw
frequency
Relative
frequency
Value
Raw
frequency
Relative
frequency
Value
Raw
frequency
Relative
frequency
26
1
0.45
48
8
3.59
68
2
0.90
27
4
1.79
49
4
1.79
69
8
3.59
29
1
0.45
50
2
0.90
71
2
0.90
30
4
1.79
51
5
2.24
72
1
0.45
31
3
1.35
52
5
2.24
73
1
0.45
32
2
0.90
53
5
2.24
74
3
1.35
33
5
2.24
54
5
2.24
75
2
0.90
34
3
1.35
55
3
1.35
76
2
0.90
35
2
0.90
56
5
2.24
78
1
0.45
36
3
1.35
57
4
1.79
80
4
1.79
37
7
3.14
58
8
3.59
81
3
1.35
39
7
3.14
59
4
1.79
82
2
0.90
40
1
0.45
60
6
2.69
83
4
1.79
41
1
0.45
61
6
2.69
84
7
3.14
42
5
2.24
62
3
1.35
90
3
1.35
43
8
3.59
63
2
0.90
91
3
1.35
44
6
2.69
64
1
0.45
92
3
1.35
45
7
3.14
65
1
0.45
93
4
1.79
46
6
2.69
66
3
1.35
95
2
0.90
47
6
2.69
67
4
1.79
Total
223
100.0%
0
0
2
.01
Raw frequency
4
6
Relative frequency
.02
.03
8
.04
Raw/Relative-Frequency
Histograms
20
40
60
scores
80
100
20
40
60
scores
80
100
The two graphs are identical. In the second just re-label the vertical axis so that for
example 1 now corresponds to (1/223) × 100% = 0.45%.
The relative-frequency histograms are preferred when you want to compare to
histograms with different data size.
Topics on the number of Blocks
and the class intervals length
It is a usually simpler idea to have all intervals of the
same length. Although the choice of the length of
each interval depends on the variable of interest. For
example lets suppose that you want to plot a
histogram for educational level (years of schooling
completed; kindergarten doesn’t count) of persons
age 25 and over in the US. It is quite reasonable to
use intervals of different widths, that represents the
different categories of the educational system.
Back to the US income example. The first intervals are
quite “skinny”. Do you think it would look good to
divide the last for example “fat” interval into skinny
ones?
Topics on the number of Blocks
and the class intervals length
How many blocks? There are many different histograms
you can make with the same variable. For the exams
score example we used the extreme of having the larger
possible number of very skinny blocks. This is not a very
good idea, the pattern is lost in detail and it is obvious
that with a different sample the resulting histogram
would be probably completely different. On the other
hand by using too few blocks the pattern of the sample
will be lost within the blocks.
There are mathematical formulas and empirical
expressions that relate the sample size with the number
of blocks. Also most of the computer programs produce
by default a reasonable number of blocks. The default
raw histogram from the computer program STATA for
the exams score example is the following:
0
Raw frequency
10
20
30
Topics on the number of Blocks
and the class intervals length
20
40
60
scores
80
100
Cross Tabulation
In many situations we need to perform an
exploratory analysis of data to observe possible
associations with a discrete variable. For
example, consider measuring the blood pressure
of women and divide them in two groups: one
taking the contraceptive pill and the other not
taking it. We can produce a table with the
distribution of one group in one column and the
distribution of the other in another column. This
can be used to produce two histograms in order
to make a visual comparison of the two groups.
The variable that is used for the cross-tabulation
is usually referred to as a covariable.
Cross Tabulation
blood pressure
(mm)
under 100
100 - 110
110-120
120 - 130
130 - 140
140 - 150
150 - 160
over 160
Non users
%
8
20
31
19
13
6
2
1
users
%
6
12
26
22
17
11
4
2
Cross Tabulation
Download