Displaying and Describing
Categorical Data
CHAPTER 3
Objectives
•
•
•
•
•
•
•
•
•
•
•
Frequency Table
Relative Frequency Table
Distribution
Area Principle
Bar Chart Pie Chart
Contingency Table
Marginal Distribution
Conditional Distribution
Independence
Segmented Bar Chart
Simpson’s Paradox
The Three Rules of Data Analysis
•
The three rules of data analysis won’t be
difficult to remember:
1. Make a picture—things may be revealed
that are not obvious in the raw data.
These will be things to think about.
2. Make a picture—important features of and
patterns in the data will show up. You may
also see things that you did not expect.
3. Make a picture—the best way to tell
others about your data is with a wellchosen picture.
Frequency Table
• What is a frequency Table? A
frequency table is an organization of
raw data in tabular form, using
classes (or intervals) and
frequencies.
• What is a frequency count? The
frequency or the frequency count for
a data value is the number of times
the value occurs in the data set.
Categorical Frequency Tables
• NOTE: Later we will consider qualitative
frequency tables.
• What is a categorical frequency table? A
categorical frequency table represents data
that can be placed in specific categories,
such as gender, hair color, political affiliation
etc.
Frequency Table
A frequency table is a tabular summary of
data showing the frequency (or number) of items
in each of several non-overlapping categories.
The objective is to provide insights about the
data that cannot be quickly obtained by looking
only at the original data.
Frequency Tables:
• We can “organize” the data by counting
the number of data values in each
category of interest.
• We can organize these counts into a
frequency table, which records the totals
and the category names.
Categorical Frequency Table
• Example: The blood types of 25 blood donors
are given below. Summarize the data using
a frequency distribution.
AB
O
B
A
A
B
B
O
O
B
A
O
B
AB
AB
O
A
B
AB
O
B
O
B
O
A
Categorical Frequency Table for
the Blood Types
Note: The classes for the distribution are the
blood types.
Your Turn
• Guests staying at Marada Inn were asked to rate the
quality of their accommodations as being excellent
(E),above average (AA), average (A), below average
(BA), or poor (P). The ratings provided by a sample of
20 guests:
• BA, AA, AA, A, AA, A, AA, A, AA, BA, P, E, AA, A, AA,
AA, BA, P, AA, A .
• Make a frequency table.
Categorical Frequency
Distribution
Rating
Counts
Poor (P)
2
Below Average (BA)
3
Average (A)
5
Above Average (AA)
9
Excellent (E)
1
Total
20
Relative Frequency Tables:
• A relative frequency table is similar, but gives the relative
frequency, a decimal or percentage (instead of counts) for
each category.
Relative Frequency Table
The relative frequency of a class is the fraction or
proportion of the total number of data items
belonging to the class.
A relative frequency table is a tabular summary of
a set of data showing the relative frequency for
each class.
Percent Frequency Table
The percent frequency of a class is the relative
frequency multiplied by 100.
A percent frequency table is a tabular
summary of a set of data showing the percent
frequency for each class.
Relative & Percent Categorical
Frequency Table
• Using the frequency table below, from the
Marada Inn problem, create a relative and
percent frequency table.
• Add two additional columns labeled
relative frequency and percent frequency.
Rating
Counts
Poor (P)
2
Below Average (BA)
3
Average (A)
5
Above Average (AA)
9
Excellent (E)
1
Total
20
Relative & Percent Categorical
Frequency Table
Rating
Frequency
Poor
2
Below avg.
3
Avg.
5
Above avg.
9
Excellent
1
Total:
20
Rel. Freq.
2/20 = .10
3/20 = .15
5/20 = .25
9/20 = .45
1/20 = .05
1
%Freq
10%
15%
25%
45%
5%
100%
Frequency Tables:
• All three types of tables show how
cases are distributed across the
categories.
• They describe the distribution of a
categorical variable because they
name the possible categories and
tell how frequently each occurs.
There are three kinds of lies:
lies, damned lies, and statistics.
Benjamin Disraeli (1804 - 1881)
Misleading Statistics
• Now that we have the frequency
table, we are ready to make a
picture or a graph of the data.
• Misleading graphs
• Scale
• Pictographs
Misleading Statistics - Scale
• Adjusting the scale of a graph is a
common way to mislead (or lie) with
statistics.
• Example:
Misleading Scale
Misleading Statistics
• The best data displays observe a
fundamental principle of graphing data
called the area principle.
• The area principle says that the area
occupied by a part of the graph should
correspond to the magnitude of the value it
represents.
• Violations of the area principle are another
common way of misleading with statistics.
What’s Wrong With This Picture?
• You might think that
a good way to show
the Titanic data is
with this display:
What’s Wrong With This Picture?
• The ship display makes it look like
most of the people on the Titanic were
crew members, with a few passengers
along for the ride.
• When we look at each ship, we see the
area taken up by the ship, instead of
the length of the ship.
• The ship display violates the area
principle:
• The area occupied by a part of the
graph should correspond to the
Slide 3 - 24
magnitude of the value it represents.
Area Principle - Pictographs
Double the length, width, and height
of a cube, and the volume increases
by a factor of eight
Area Principle
Pictographs
GRAPHING CATEGORICAL OR
QUALITATIVE DATA
Ways to Graph Categorical Data
Because the variable is categorical,
the data in the graph can be
ordered any way we want
(alphabetical, by increasing value,
by year, by personal preference,
etc.).
1. Bar Charts – Each category is
represented by a bar.
2. Pie Charts - The slices must
represent the parts of one whole.
Bar Charts
• A bar chart displays the distribution of a categorical variable,
showing the counts for each category next to each other for easy
comparison.
• A bar chart stays true
to the area principle.
• Thus, a better display
for the ship data is:
Bar Charts (cont.)
• A relative frequency bar chart displays the relative proportion of
counts for each category.
• A relative frequency bar chart also stays true to the area
principle.
• Replacing counts
with percentages
in the ship data:
Slide 3 - 30
Bar Charts
 A bar chart is a graphical device for depicting
qualitative data.
 On one axis (usually the horizontal axis), we specify
the labels that are used for each of the categories.
 A frequency, relative frequency, or percent frequency
scale can be used for the other axis (usually the
vertical axis).
 Using a bar of fixed width drawn above each class
label, we extend the height appropriately.
 The bars are separated to emphasize the fact that each
class is a separate category.
Bar Charts
•Either counts (frequency bar chart) or proportions
(relative frequency bar chart) may be shown on the
y-axis. This will not change the shape or
relationships of the graph.
•Make sure all graphs have a descriptive title and
that the axes are labeled (this is true for all graphs in
AP Stats).
Pie Charts
• When you are interested in parts of the whole, a pie chart
might be your display of choice.
• Pie charts show the whole
group of cases as a circle.
• They slice the circle into
pieces whose size is
proportional to the
fraction of the whole
in each category.
Slide 3 - 33
Pie Chart
 The pie chart is a commonly used graphical device
for presenting relative frequency distributions for
qualitative data.

First draw a circle; then subdivide the circle
into sectors that correspond in area to the
relative frequency for each category.

Since there are 360 degrees in a circle,
a category with a relative frequency of .25 would
consume .25(360) = 90 degrees of the circle.
Relations between Two
Categorical Variables
• Examples:
• Is gender or race related to political
preference?
• What type of music can make people
relax?
• Will different packaging of the same
product attract people with different
social-economic background?
• A contingency table or two-way table is a
way to display the data from two categorical
variables. A sort of Venn Diagram which
shows how a population splits according to
two factors.
Contingency Tables
• A contingency table allows us to look at two categorical
variables together.
• It shows how individuals are distributed along each variable,
contingent on the value of the other variable.
• Example: we can examine the class of ticket and whether a
person survived the Titanic:
Contingency Tables (cont.)
• The margins of the table, both on the right and on the
bottom, give totals and the frequency distributions for each of
the variables.
• Each frequency distribution is called a marginal distribution
of its respective variable.
• The marginal distribution of Survival is:
Contingency Tables (cont.)
• Each cell of the table gives the count for a combination
of values of the two values.
• For example, the second cell in the crew column
tells us that 673 crew members died when the
Titanic sunk.
Conditional Distributions
• A conditional distribution shows the distribution of one
variable for just the individuals who satisfy some
condition on another variable.
• The following is the conditional distribution of ticket
Class, conditional on having survived:
Conditional Distributions (cont.)
• The following is the conditional
distribution of ticket Class, conditional
on having perished:
Conditional Distributions (cont.)
• The conditional distributions tell us that there is a
difference in class for those who survived and those
who perished.
• This is better
shown with
pie charts of
the two
distributions:
Conditional Distributions (cont.)
• We see that the distribution of Class for the survivors is
different from that of the nonsurvivors.
• This leads us to believe that Class and Survival are
associated, that they are not independent.
• The variables would be considered independent when the
distribution of one variable in a contingency table is the
same for all categories of the other variable.
Segmented Bar Charts
• A segmented bar chart
displays the same
information as a pie chart,
but in the form of bars
instead of circles.
• Each bar is treated as the
“whole” and is divided
proportionally into
segments corresponding
o the percentage in each
group.
• Here is the segmented
bar chart for ticket Class
by Survival status:
Slide 3 - 43
Example: Income level vs. Job Satisfaction
Income
Conditional
distribution
< 30K
30K-50K
50K-80K
> 80K
C. Total
Job Satisfaction
1
2
3
4
20
24 80 82
22
38 104 125
13
28 81 113
7
18 54 92
62 108 319 412
Row Total
206
289
235
171
901
Marginal
distribution
Table
total
•This is a Contingency table with Income Level as the Row Variable and Job
Satisfaction as the Column Variable.
•The distributions of income to job satisfaction or job satisfaction to income
are called Conditional Distributions.
•The distributions of income alone and job satisfaction alone are called
Marginal Distributions.
•Relationships between categorical variables are described by calculating
appropriate percents from the counts given in each cell.
Example:
• A Statistics class reports the following data on sex and eye
color for students in the class:
Eye Color
Sex
Blue
Brown
Green/Hazel/Other
Total
Males
6
20
6
32
Females
4
16
12
32
Total
10
36
18
64
1.
2.
3.
4.
5.
6.
7.
What percent of females are brown-eyed?
What percent of brown-eyed students are female?
What percent of students are brown-eyed females?
What’s the distribution of eye color?
What’s the conditional distribution of eye color for the males?
Compare the percent who are female among the blue-eyed
students to the percent of all students who are female?
Does it seem that eye color and sex are independent? Explain.
Eye Color
Solution:
Sex
1.
2.
3.
4.
5.
6.
7.
Blue
Brown
Green/Hazel/Other
Total
Males
6
20
6
32
Females
4
16
12
32
Total
10
36
18
64
What percent of females are brown-eyed? 16/32 = .5 or 50%
What percent of brown-eyed students are female?
16/36 = .444 or 44.4%
What percent of students are brown-eyed females?
16/64 = .25 or 25%
What’s the distribution of eye color?
10/64 = .156 or 15.6% Blue,
36/64 = .563 or 56.3% Brown, 18/64 =.281 or 28.1% Green/Hazel/Other
What’s the conditional distribution of eye color for the males?
6/32 = .188 or 18.8% Blue, 20/32 = .625 or 62.5% Brown,
6/32 = .188 or 18.8% Green/Hazel/Other
Compare the percent who are female among the blue-eyed students to the
percent of all students who are female?
4/10 = .4 or 40% of the blue-eyed
students are female, while 32/64 = .5 or 50% of all students are female.
Does it seem that eye color and sex are independent? Explain. Since blueeyed students appear less likely to be female, it seems that Sex and Eye
Color may not be independent. (But the numbers are small.)
SIMPSON’S PARADOX
Simpson’s Paradox
• Discovered by E. H. Simpson in 1951.
• Occurs when averaging different samples of different
sizes
• Two groups from one sample are compared to two similar
groups from another sample
• One sample’s success rate for both
groups is higher than the success rates
for the other sample
 Not E. H. Simpson
Simpson’s Paradox
•
However, when both groups’
respective success rates are
combined, the sample with the lower
success rate ends up with the better
overall proportion of successes. Thus,
the paradox.
•One sample usually has a
considerably smaller number of
members than the other groups
•Simpson’s Paradox does not occur in
populations with similar amounts
What is Simpson’s Paradox?
• Simpson’s Paradox occurs
when an association
between two variables is
reversed upon observing a
third variable.
Simpson’s Paradox
• Simpson’s paradox  lurking variable
creates a reversal in the direction of an
association (“confounding”)
• To uncover Simpson’s Paradox, divide
data into subgroups based on the lurking
variable
Recent Cleveland Indians season records
2003—68-94, 42.0% winning percentage
2004—80-82, 49.4% winning percentage
Two-season record: 148-176, 45.7% win
percentage
Recent Minnesota Twins season records
2003—90-72, 55.6% win percentage
2004—92-70, 56.8% win percentage
Two-season record: 182-142, 56.2% win
percentage
Notice that the Twins had a higher percentage in both 2003 and
2004, as well as in the two-year period. Not Simpson’s Paradox.
Recent Cleveland Indians season records
2003—68-94, 42.0% winning percentage
2004—80-82, 49.4% winning percentage
Two-season record: 148-176, 45.7% win
percentage
Recent Minnesota Twins season records
2003—90-72, 55.6% win percentage
2004—92-70, 56.8% win percentage
Two-season record: 182-142, 56.2% win
percentage
Notice that the Twins had a higher percentage in both 2003 and
2004, as well as in the two-year period. Not Simpson’s Paradox.
Simpson’s Paradox at work
Ronnie Belliard
2002—61/289, .211 of his at-bats were hits
2003—124/447, .277 of his at-bats were hits
Two-season average: 185/736, hits .2514 of the time
Casey Blake
2002—4/20, .200 of his at-bats were hits
2003—143/557, .257 of his at-bats were hits
Two-season average: 147/577, hits .2548 of the time
The two season batting avg. for Belliard was lower than Blake’s, but divided
into separate seasons, Belliard’s had a higher batting avg. both seasons.
This is Simpson’s Paradox.
Discrimination? (Simpson’s Paradox)
Consider college acceptance rates by sex
Accepted
Not
accepted
Total
Men
198
162
360
Women
88
112
200
Total
286
274
560
198 of 360 (55%) of men accepted
88 of 200 (44%) of women accepted
Is there a sex bias?
Discrimination? (Simpson’s Paradox)
• Or is there a lurking variable that
explains the association?
• To evaluate this, split applications
according to the lurking variable
"major applied to”
• Business School (240 applicants)
• Art School (320 applicants)
Discrimination? (Simpson’s Paradox)
BUSINESS SCHOOL
Accepted
Not
accepted
Total
Men
18
102
120
Women
24
96
120
Total
42
198
240
18 of 120 men (15%) of men were accepted to B-school
24 of 120 (20%) of women were accepted to B-school
A higher percentage of women were accepted
Discrimination (Simpson’s Paradox)
ART SCHOOL
Accepted
Not
accepted
Total
Men
180
60
240
Women
64
16
80
Total
244
76
320
180 of 240 men (75%) of men were accepted
64 of 80 (80%) of women were accepted
A higher percentage of women were accepted.
Discrimination? (Simpson’s Paradox)
• Within each school, a higher percentage of women
were accepted than men.
• No discrimination against women
• Possible discrimination against men
• This is an example of Simpson’s Paradox.
• When the lurking variable (School applied to) was
ignored, the data suggest discrimination against women.
• When the School applied to was considered, the
association is reversed.
Colin R. Blyth’s example of Simpson’s Paradox
•
A doctor was planning to try a new treatment on patients mostly local (C) and a
few in Chicago (C’). A statistician advised him to use a table of random numbers
and as each C patient became available, assign him to the new treatment with
probability .91, leave him to the standard treatment with probability .09; and the
same for C’ patient with probability .01 and .99 respectively. When the doctor
returned with the data the statistician told him that the new treatment was
obviously a very bad one, and criticized him for having continued trying it on so
many patients.
Treatment
Standard
New
Dead
5950
9005
Alive
5050
1095
(46%)
(11%)
•
The doctor replied that he continued because the new
treatment was obviously a very good one, having nearly
doubled the recovery rate in both cities.
C’ patient only
C patient only
Treatment
Standard
New
Standard
New
Dead
950
9000
5000
5
Alive
50
1000
5000
95
5%
10%
50%
95%
Smokers’ Example
• In England a study was conducted to examine the
survival rates of smokers and non-smokers. The result
implied a significant positive correlation between
smoking & survival rates because only 24% of smokers
died as compared to 31% of non-smokers. When the
data were broken down by age group in a contingency
table, it was found that there were more older people in
the non-smoker group. Thus age played a very
significant role in the outcome but since it was
overlooked the researchers were left with deceiving
results. (Appleton & French, 1996).
The Paradox
What’s true for the parts isn’t true for the whole.
CONCLUSION!!!!
Simpson’s paradox is a rare phenomenon!
It does not occur often! Thus statisticians
must be trained academically & ethically
well enough to make sure that if it has
occurred they will detect and correct it.
This is where practice, critical thinking
skills, and repetition come into play!
What Can Go Wrong?
• Don’t violate the area principle.
• While some people might like the pie chart on
the left better, it is harder to compare fractions
of the whole, which a well-done pie chart does.
What Can Go Wrong? (cont.)
• Keep it honest—make sure your display shows
what it says it shows.
• This plot of the percentage of high-school
students who engage in specified dangerous
behaviors has a problem. Can you see it?
What Can Go Wrong? (cont.)
• Don’t confuse similar-sounding
percentages—pay particular attention to
the wording of the context.
• Don’t forget to look at the variables
separately too—examine the marginal
distributions, since it is important to know
how many cases are in each category.
What Can Go Wrong? (cont.)
• Be sure to use enough individuals!
• Do not make a report like “We found that 66.67% of
the rats improved their performance with training.
The other rat died.”
What Can Go Wrong? (cont.)
• Don’t overstate your case—don’t claim
something you can’t.
• Don’t use unfair or silly averages—this
could lead to Simpson’s Paradox, so be
careful when you average one variable
across different levels of a second
variable.
What have we learned?
• We can summarize categorical data by
counting the number of cases in each
category (expressing these as counts or
percents).
• We can display the distribution in a bar
chart or pie chart.
• And, we can examine two-way tables
called contingency tables, examining
marginal and/or conditional distributions of
the variables.
Assignment
• Exercises pg. 37 – 43: #5, 7, 11, 16, 21,
23, 27 -31 odd, 37
• Read Ch-4, pg. 44 - 71