AMS 5 GRAPHICAL DESCRIPTIVE METHODS Histograms In the US how are incomes distributed? In March 1973 50,000 American families reported their income for the previous year. Of course these data have to be summarized-nobody wants to look all these numbers. A graph that is often used to summarize data is the histogram Read a Histogram Blocks Class Intervals , e.g. ($1000-$2000), ($2000,$3000), …,($25000,$30000) Read a Histogram In a histogram, the areas of the block represent percentages. About what percentage of the families earned between $10,000 - $25,000? Were there more families with incomes between $10,000 $25,000 or between $15,000 - $25,000? Read a Histogram Read a Histogram a) b) c) d) e) f) About 1% of the families in the previous figure had incomes between $0 and $1,000. Estimate the percentage who had incomes $1,000-$2,000 $2,000-$3,000 $3,000-$4,000 $4,000-$5,000 $4,000-$7,000 $7,000-$10,000 Distribution Table Income Level Percent $0-$1,000 1 $1,000-$2,000 2 $2,000-$3,000 3 $3,000-$4,000 4 $4,000-$5,000 5 $5,000-$6,000 5 $6,000-$7,000 5 $7,000-$10,000 15 $10,000-$15,000 26 $15,000-$25,000 26 $25,000-$50,000 8 $50,000 and over 1 Distribution Table In the distribution tables you need to be cautious with the endpoint conventions. For the previous table the left endpoint is included in the class interval, while the right endpoint is excluded. The percents do not add to 100% in the previous table due to rounding. We will finally ignore the last class (above $50,000). Drawing a Histogram Put down a horizontal axis. Use the right distance between the intervals. Next step is to draw the blocks. DON’T PLOT THE PERCENTS, by making the heights of the blocks equal to them. Drawing a Histogram Many more families with incomes Over $25,000 than under $7,000 Drawing a Histogram The problem is that we have different lengths of the class intervals. The 8% who earn $25,000-$50,000 are spread over a larger range of incomes than the 15% who earn $7,000-$10,000. Plotting percents directly ignores this, and makes the blocks over longer class intervals too big. Drawing a Histogram Income Level Percent (P) Length ( × $1,000) (L) Height = P / L $0-$1,000 1 1 1 $1,000-$2,000 2 1 2 $2,000-$3,000 3 1 3 $3,000-$4,000 4 1 4 $4,000-$5,000 5 1 5 $5,000-$6,000 5 1 5 $6,000-$7,000 5 1 5 $7,000-$10,000 15 3 5 $10,000-$15,000 26 5 5.2 $15,000-$25,000 26 10 2.6 $25,000-$50,000 8 25 0.32 Drawing a Histogram Units in the vertical scale: For example the height of the block over the interval $7,000 to $10,000 is 5% per $1,000. , i.e. in each thousand-dollar interval between $7,000 and $10,000 there are about 5% of the families. Density Scale In the previous example the histogram was drawn using the density scale. Remember that the areas of the blocks come out in percent. A high height implies that large chunks of area accumulate in small portions of the horizontal scale. This implies that the density of the data is high in the intervals where the height is large. In other words, the data are more crowded in those intervals. In a Histogram the height of a block represents crowding – percentage per horizontal unit. Density Scale Example: By looking only the histogram, about what percent of the families in the city had incomes between $15,000-$25,000? Answer: The height of the block is 2.6% per 1,000 dollar, i.e. each thousand-dollar interval between $15,000 and $25,000 contains about 2.6% of the families in the city. There are 10 of these intervals, and therefore the answer is 10 × 2.6% = 26%. The area under the histogram over an interval equals the percentage of cases in that interval. The total area under the histogram therefore should be 100%. Other types of Histogram Raw-Frequency Histograms. Relative-Frequency Histograms. Use it only when class intervals have the same length. Example : Civil- service 1966 examination scores in Chicago. Value Raw frequency Relative frequency Value Raw frequency Relative frequency Value Raw frequency Relative frequency 26 1 0.45 48 8 3.59 68 2 0.90 27 4 1.79 49 4 1.79 69 8 3.59 29 1 0.45 50 2 0.90 71 2 0.90 30 4 1.79 51 5 2.24 72 1 0.45 31 3 1.35 52 5 2.24 73 1 0.45 32 2 0.90 53 5 2.24 74 3 1.35 33 5 2.24 54 5 2.24 75 2 0.90 34 3 1.35 55 3 1.35 76 2 0.90 35 2 0.90 56 5 2.24 78 1 0.45 36 3 1.35 57 4 1.79 80 4 1.79 37 7 3.14 58 8 3.59 81 3 1.35 39 7 3.14 59 4 1.79 82 2 0.90 40 1 0.45 60 6 2.69 83 4 1.79 41 1 0.45 61 6 2.69 84 7 3.14 42 5 2.24 62 3 1.35 90 3 1.35 43 8 3.59 63 2 0.90 91 3 1.35 44 6 2.69 64 1 0.45 92 3 1.35 45 7 3.14 65 1 0.45 93 4 1.79 46 6 2.69 66 3 1.35 95 2 0.90 47 6 2.69 67 4 1.79 Total 223 100.0% 0 0 2 .01 Raw frequency 4 6 Relative frequency .02 .03 8 .04 Raw/Relative-Frequency Histograms 20 40 60 scores 80 100 20 40 60 scores 80 100 The two graphs are identical. In the second just re-label the vertical axis so that for example 1 now corresponds to (1/223) × 100% = 0.45%. The relative-frequency histograms are preferred when you want to compare to histograms with different data size. Topics on the number of Blocks and the class intervals length It is a usually simpler idea to have all intervals of the same length. Although the choice of the length of each interval depends on the variable of interest. For example lets suppose that you want to plot a histogram for educational level (years of schooling completed; kindergarten doesn’t count) of persons age 25 and over in the US. It is quite reasonable to use intervals of different widths, that represents the different categories of the educational system. Back to the US income example. The first intervals are quite “skinny”. Do you think it would look good to divide the last for example “fat” interval into skinny ones? Topics on the number of Blocks and the class intervals length How many blocks? There are many different histograms you can make with the same variable. For the exams score example we used the extreme of having the larger possible number of very skinny blocks. This is not a very good idea, the pattern is lost in detail and it is obvious that with a different sample the resulting histogram would be probably completely different. On the other hand by using too few blocks the pattern of the sample will be lost within the blocks. There are mathematical formulas and empirical expressions that relate the sample size with the number of blocks. Also most of the computer programs produce by default a reasonable number of blocks. The default raw histogram from the computer program STATA for the exams score example is the following: 0 Raw frequency 10 20 30 Topics on the number of Blocks and the class intervals length 20 40 60 scores 80 100 Cross Tabulation In many situations we need to perform an exploratory analysis of data to observe possible associations with a discrete variable. For example, consider measuring the blood pressure of women and divide them in two groups: one taking the contraceptive pill and the other not taking it. We can produce a table with the distribution of one group in one column and the distribution of the other in another column. This can be used to produce two histograms in order to make a visual comparison of the two groups. The variable that is used for the cross-tabulation is usually referred to as a covariable. Cross Tabulation blood pressure (mm) under 100 100 - 110 110-120 120 - 130 130 - 140 140 - 150 150 - 160 over 160 Non users % 8 20 31 19 13 6 2 1 users % 6 12 26 22 17 11 4 2 Cross Tabulation