Disaggregation

advertisement
ECON 309
Lecture 11: Disaggregation
I. The Need for Disaggregation
Disaggregation means taking “overall” or “total” figures and breaking them down by
subgroups. Disaggregation can be important because subgroups may differ substantially,
in ways that are obscured by the overall or total figures. In some cases, it will turn out
that the overall or total figures are driven mostly by just one subgroup.
Example. Suppose you’re interested in how people commit suicide, perhaps with the
intent of creating suicide prevention programs. You might look at the following figures
for different suicide methods in the year 2003:
Suicide Rates (and Percents of Total) by Method
(rates are per 100,000 population)
Firearm Suffoc. Poisoning Other Total
5.81
2.28
1.88
0.86
10.83
(53.6%) (21.1%) (17.4%)
(7.9%)
So you conclude that firearm suicides are the most common kind of suicide, with
suffocation and poisoning in (distant) second and third places. But what happens if we
break this down by gender?
Suicide Rates (and Percents of Total) by Method and Gender
(rates are per 100,000 population)
Firearm Suffoc. Poisoning Other Total
10.37
3.84
2.13
1.28
17.62
Male
(58.9%) (21.8%) (12.1%)
(7.3%)
0.77
1.63
0.44
4.25
Female 1.41
(33.2%) (18.1%) (38.3%)
(9.8%)
2.28
1.88
0.86
10.83
Overall 5.81
(53.6%) (21.1%) (17.4%)
(7.9%)
For men, the ranking still holds: firearms most common, followed by suffocation and
then poisoning. But for women, the ordering is completely different: poisoning is most
common, then firearms, then suffocation. Or to put it another way, men’s third-choice
method (poisoning) is women’s first-choice method.
So why are the overall figures so much closer to the men’s figures? Simple: men
commit suicide a lot more often than women do, as the totals-by-gender column on the
far right clearly shows.
Looking at the overall figures could be misleading, because you might want to adopt
different suicide-prevention programs for men and women if you knew they were
different.
II. Pivot Tables
The table above is useful because it breaks down data in two different, and cross-cutting,
ways: by method and by gender. It will often be useful to disaggregate data by more
than one kind of classification. Fortunately, Excel has a powerful tool for doing this: the
pivot table.
To create a pivot table, click in one cell of your data set, then go to the Insert tab, click on
Pivot Table, and select Pivot Table. This will bring up a dialogue box. In that box, click
the icon next to “Select a table or range” and then highlight all the data you intend to use
(including labels). Choose a location for the table (either a new worksheet or the same
worksheet as the data) and click OK. This will create the frame for a pivot table, but with
no data in it yet.
Then you can drag fields (which are the column headings for your data) into the table.
Pull the fields from the field list on the right. Pull a field into the row area if you want
the different elements of that field to be the rows of the table. (E.g., if the data points are
people, and the “Gender” column tells the gender for each data point, then you can drag
Gender into the row area to get a row for men and a row for women.) Pull a field into the
column area if you want the different elements of that field to be columns. (E.g., if the
“Major” column tells the college major for each person, you can pull this into the column
area to get a column for each major.) Finally, pull a field into the data area if that field
includes the kind of figures you want analyzed. (E.g., if the “Income” figure tells the
income for each person, pull this into the data area to get analysis of people’s income. In
this example, you’d have each cell giving income broken down by gender and major.)
You will need to practice with the Pivot Table tool to get used to it. In particular, Excel
will make assumptions about what you want done with your data, and you may have to
change it. For instance, by default, a pivot table gives you sums of data that you group
together. For instance, in the Men/Accounting cell, the table described above would give
the sum of all male accounting majors’ incomes. If you want something else, such as the
average, you will need to right-click the data area in the table, select “summarize data
by,” and choose the option you want.
[Use majors.xls data set for simple demonstration. Then use popularity.xls data set for
more complex demonstration. The latter will require four fields to be dragged into the
data area; then right-click on data and choose Order  Move to column.]
III. Simpson’s Paradox
Simpson’s Paradox is a strange – but actually not that uncommon – result that can be
observed when data is disaggregated. Simpson’s Paradox says that between two
categories A and B, a percentage or rate can be higher for A than B overall, and yet when
the data is disaggregated over subgroups, the percentage or rate can be lower for A than
B in every subgroup.
Here’s a famous example: UC Berkeley was sued for discrimination against women
applying for graduate school. Women were rejected a good deal more often than men.
However, if you looked at admission department by department (English, economics,
physics, etc.), it turned out that men were rejected more often than women in almost
every department, and in no case was there a significant difference in favor of men. How
is this possible? Women were applying in larger numbers to the departments with higher
rejection rates, while men were applying in larger numbers to the departments with lower
rejection rates.
[Use hospitals.xls data set to demonstrate. Overall, Memorial hospital seems to be doing
better at saving patients, with 54% survival rate compared to Regent’s 48%. However, if
you break down the data by type of patient, Regent does better than Memorial for both
critical and non-critical patients. So why does Regent appear worse overall? Because
Regent gets a disproportionate number of critical patients, who die more often regardless
of the hospital. You can see this by changing the Field Settings on “Survival” to Count
instead of Average.]
Another classic case of Simpson’s paradox involved two different treatments for kidney
stones. Treatment B seemed better overall than Treatment A. But A was better than B
for small stones, and A was also better than B for large stones. How did this happen?
Treatment A was being used more often for the large stones, which are more difficult to
treat.
Download