ECON 309 Lecture 11: Disaggregation I. The Need for Disaggregation Disaggregation means taking “overall” or “total” figures and breaking them down by subgroups. Disaggregation can be important because subgroups may differ substantially, in ways that are obscured by the overall or total figures. In some cases, it will turn out that the overall or total figures are driven mostly by just one subgroup. Example. Suppose you’re interested in how people commit suicide, perhaps with the intent of creating suicide prevention programs. You might look at the following figures for different suicide methods in the year 2003: Suicide Rates (and Percents of Total) by Method (rates are per 100,000 population) Firearm Suffoc. Poisoning Other Total 5.81 2.28 1.88 0.86 10.83 (53.6%) (21.1%) (17.4%) (7.9%) So you conclude that firearm suicides are the most common kind of suicide, with suffocation and poisoning in (distant) second and third places. But what happens if we break this down by gender? Suicide Rates (and Percents of Total) by Method and Gender (rates are per 100,000 population) Firearm Suffoc. Poisoning Other Total 10.37 3.84 2.13 1.28 17.62 Male (58.9%) (21.8%) (12.1%) (7.3%) 0.77 1.63 0.44 4.25 Female 1.41 (33.2%) (18.1%) (38.3%) (9.8%) 2.28 1.88 0.86 10.83 Overall 5.81 (53.6%) (21.1%) (17.4%) (7.9%) For men, the ranking still holds: firearms most common, followed by suffocation and then poisoning. But for women, the ordering is completely different: poisoning is most common, then firearms, then suffocation. Or to put it another way, men’s third-choice method (poisoning) is women’s first-choice method. So why are the overall figures so much closer to the men’s figures? Simple: men commit suicide a lot more often than women do, as the totals-by-gender column on the far right clearly shows. Looking at the overall figures could be misleading, because you might want to adopt different suicide-prevention programs for men and women if you knew they were different. II. Pivot Tables The table above is useful because it breaks down data in two different, and cross-cutting, ways: by method and by gender. It will often be useful to disaggregate data by more than one kind of classification. Fortunately, Excel has a powerful tool for doing this: the pivot table. To create a pivot table, click in one cell of your data set, then go to the Insert tab, click on Pivot Table, and select Pivot Table. This will bring up a dialogue box. In that box, click the icon next to “Select a table or range” and then highlight all the data you intend to use (including labels). Choose a location for the table (either a new worksheet or the same worksheet as the data) and click OK. This will create the frame for a pivot table, but with no data in it yet. Then you can drag fields (which are the column headings for your data) into the table. Pull the fields from the field list on the right. Pull a field into the row area if you want the different elements of that field to be the rows of the table. (E.g., if the data points are people, and the “Gender” column tells the gender for each data point, then you can drag Gender into the row area to get a row for men and a row for women.) Pull a field into the column area if you want the different elements of that field to be columns. (E.g., if the “Major” column tells the college major for each person, you can pull this into the column area to get a column for each major.) Finally, pull a field into the data area if that field includes the kind of figures you want analyzed. (E.g., if the “Income” figure tells the income for each person, pull this into the data area to get analysis of people’s income. In this example, you’d have each cell giving income broken down by gender and major.) You will need to practice with the Pivot Table tool to get used to it. In particular, Excel will make assumptions about what you want done with your data, and you may have to change it. For instance, by default, a pivot table gives you sums of data that you group together. For instance, in the Men/Accounting cell, the table described above would give the sum of all male accounting majors’ incomes. If you want something else, such as the average, you will need to right-click the data area in the table, select “summarize data by,” and choose the option you want. [Use majors.xls data set for simple demonstration. Then use popularity.xls data set for more complex demonstration. The latter will require four fields to be dragged into the data area; then right-click on data and choose Order Move to column.] III. Simpson’s Paradox Simpson’s Paradox is a strange – but actually not that uncommon – result that can be observed when data is disaggregated. Simpson’s Paradox says that between two categories A and B, a percentage or rate can be higher for A than B overall, and yet when the data is disaggregated over subgroups, the percentage or rate can be lower for A than B in every subgroup. Here’s a famous example: UC Berkeley was sued for discrimination against women applying for graduate school. Women were rejected a good deal more often than men. However, if you looked at admission department by department (English, economics, physics, etc.), it turned out that men were rejected more often than women in almost every department, and in no case was there a significant difference in favor of men. How is this possible? Women were applying in larger numbers to the departments with higher rejection rates, while men were applying in larger numbers to the departments with lower rejection rates. [Use hospitals.xls data set to demonstrate. Overall, Memorial hospital seems to be doing better at saving patients, with 54% survival rate compared to Regent’s 48%. However, if you break down the data by type of patient, Regent does better than Memorial for both critical and non-critical patients. So why does Regent appear worse overall? Because Regent gets a disproportionate number of critical patients, who die more often regardless of the hospital. You can see this by changing the Field Settings on “Survival” to Count instead of Average.] Another classic case of Simpson’s paradox involved two different treatments for kidney stones. Treatment B seemed better overall than Treatment A. But A was better than B for small stones, and A was also better than B for large stones. How did this happen? Treatment A was being used more often for the large stones, which are more difficult to treat.