Chapter 3 Displaying, Summarizing Qualitative Data

advertisement
Chapter 3
Graphical and Numerical
Summaries of Categorical Data



UNIT OBJECTIVES
At the conclusion of this unit you should be able to:
1) Construct graphs that appropriately describe
data
2) Calculate and interpret numerical summaries of a
data set.
3) Combine numerical methods with graphical
methods to analyze a data set.
Displaying Qualitative Data
“Sometimes you can see a lot just
by looking.”
Yogi Berra
Hall of Fame Catcher, NY Yankees
The three rules of data analysis
won’t be difficult to remember



1. Make a picture —reveals aspects not obvious in
the raw data; enables you to think clearly about the
patterns and relationships that may be hiding in your
data.
2. Make a picture —to show important features of
and patterns in the data. You may also see things
that you did not expect: the extraordinary (possibly
wrong) data values or unexpected patterns
3. Make a picture —the best way to tell others
about your data is with a well-chosen picture.
Bar Charts: show counts
or relative frequency for
each category

Example: Titanic passenger/crew distribution
Titanic Passengers by Class
1000.00
885
900.00
800.00
706
700.00
600.00
500.00
400.00
325
285
300.00
200.00
100.00
0.00
Crew
First
Second
Third
Pie Charts: shows
proportions of the
whole in each category

Example: Titanic passenger/crew
distribution
Titanic Passengers by Class
Third
32%
Second
13%
Crew
40%
First
15%
Example: Top 10 causes of death in the United
States 2001
Rank Causes of death
Counts
% of top
10s
% of total
deaths
1 Heart disease
700,142
37%
28%
2 Cancer
553,768
29%
22%
3 Cerebrovascular
163,538
9%
6%
4 Chronic respiratory
123,013
6%
5%
5 Accidents
101,537
5%
4%
6 Diabetes mellitus
71,372
4%
3%
7 Flu and pneumonia
62,034
3%
2%
8 Alzheimer’s disease
53,852
3%
2%
9 Kidney disorders
39,480
2%
2%
32,238
2%
1%
10 Septicemia
All other causes
629,967
25%
For each individual who died in the United States in 2001, we record what was
the cause of death. The table above is a summary of that information.
Top 10 causes of death: bar graph
Top 10 causes of deaths in the United States 2001
The number of individuals
who died of an accident in
2001 is approximately
100,000.
Ca
nc
Ce
er
re
s
br
ov
Ch
as
cu
ro
ni
la
c
r
re
sp
ira
to
ry
Ac
ci
Di
de
ab
nt
s
et
es
m
el
Fl
litu
u
&
s
pn
eu
Al
zh
m
on
ei
m
ia
er
's
di
se
Ki
as
dn
e
ey
di
so
rd
er
s
Se
pt
ice
m
ia
ise
as
es
800
700
600
500
400
300
200
100
0
He
ar
td
Counts (x1000)
Each category is represented by one bar. The bar’s height shows the count (or
sometimes the percentage) for that particular category.
zh
ei
m
er
's
di
de
nt
s
se
as
e
Ac
ci
800
700
600
500
400
300
200
100
0
Ca
nc
Ce
er
s
re
br
ov
Ch
as
cu
ro
la
ni
r
c
re
sp
ira
Di
to
ab
ry
et
es
m
el
Fl
litu
u
s
&
pn
eu
m
on
He
ia
ar
td
ise
as
Ki
dn
es
ey
di
so
rd
er
s
Se
pt
ice
m
ia
Al
Counts (x1000)
ise
as
es
Ca
nc
Ce
er
re
s
br
ov
Ch
as
cu
ro
ni
la
c
r
re
sp
ira
to
ry
Ac
ci
Di
de
ab
nt
s
et
es
m
el
Fl
litu
u
&
s
pn
eu
Al
zh
m
on
ei
m
ia
er
's
di
se
Ki
as
dn
e
ey
di
so
rd
er
s
Se
pt
ice
m
ia
He
ar
td
Counts (x1000)
800
700
600
500
400
300
200
100
0
Top 10 causes of deaths in the United
States 2001
Bar graph sorted by rank
 Easy to analyze
Sorted alphabetically
 Much less useful
Top 10 causes of death: pie chart
Each slice represents a piece of one whole. The size of a slice depends on what
percent of the whole this category represents.
Percent of people dying from
top 10 causes of death in the United States in 2001
Make sure your
labels match
the data.
Make sure
all percents
add up to 100.
Percent of deaths from top 10 causes
Percent of
deaths from
all causes
Child poverty before and after
government intervention—UNICEF,
1996
What does this chart tell you?
•The United States has the highest rate of child
poverty among developed nations (22% of
under 18).
•Its government does the least—through taxes
and subsidies—to remedy the problem (size of
orange bars and percent difference between
orange/blue bars).
Could you transform this bar graph to fit in 1 pie
chart? In two pie charts? Why?
The poverty line is defined as 50% of national median income.
Contingency Tables:
Categories for Two
Variables

Example: Survival and class on the
Titanic Marginal distributions
Crew
Alive
Dead
Total
First
212
673
885
885/2201
marg. dist. 40.2%
of class
Second Third
202
118
123
167
325
285
325/2201
14.8%
285/2201
12.9%
Total
178
528
706
706/2201
32.1%
710
1491
2201
marg. dist.
of survival
710/2201
32.3%
1491/2201
67.7%
Marginal distribution of class.
Bar chart.
Marginal distribution of class:
Pie chart
Contingency Tables: Categories
for Two Variables (cont.)

Conditional distributions.
Given the class of a passenger, what is the
chance the passenger survived?
Crew
Alive
Survival
Dead
Total
Count
% of col.
Count
% of col.
Count
212
24.0%
673
76.0%
885
First
202
62.2%
123
37.8%
325
Class
Second Third
Total
118
178
710
41.4%
25.2%
32.3%
167
528
1491
58.6%
74.8%
67.7%
285
706
2201
Conditional distributions:
segmented bar chart
Contingency Tables:
Categories for Two
Variables (cont.)
Questions:

What fraction of survivors were in first class?

What fraction of passengers were in first class and
survivors ?

What fraction of the first class passengers
survived?
Class
Crew
Alive
Survival
Dead
Total
Count
% of col.
Count
% of col.
Count
212
24.0%
673
76.0%
885
First
202
62.2%
123
37.8%
325
202/710
202/2201
202/325
Second Third
Total
118
178
710
41.4%
25.2%
32.3%
167
528
1491
58.6%
74.8%
67.7%
285
706
2201
3-Way Tables

Example: Georgia death-sentence data
Death
Sentence
Yes
No
Totals
% Death Sentence
Race of Defendant
Black
White
Race of Victim
Race of Victim
Black
White
Black
White
18
50
2
58
1420
178
62
687
1438
228
64
745
1.2
21.9
3.1
7.8
Totals
128
2347
2475
UC Berkeley Lawsuit
MEN
WOMEN
No. of
applicants
2691
1835
Admitted
1199
557
%
admitted
44.6
30.4
LAWSUIT (cont.)
MEN
MAJOR
A
B
C
D
E
F
TOTAL
No. of
Applicants
825
560
325
417
191
373
2691
No.
Admitted
512 (62%)
353 (63%)
120 (37%)
138 (33%)
53 (28%)
23 (6%)
1199
WOMEN
No. of
No.
Applicants Admitted
108
*89 (82%)
25
*17 (68%)
593
202 (34%)
375
*131 (35%)
393
94 (24%)
341
*24 (7%)
1835
557
Simpson’s Paradox

The reversal of the direction of a
comparison or association when
data from several groups are
combined to form a single group.
Fly Alaska Airlines, the ontime airline!
Alaska Airlines
% Arrivals No. of
Destination On Time
Arrivals
L. A.
88.9%
559
Phoenix
94.8%
233
San Diego 91.4%
232
San Fran. 83.1%
605
Seattle
85.8%
2,146
Total
3,775
American West
% Arrivals No. of
On Time Arrivals
85.6%
811
92.1%
5,255
85.5%
448
71.3%
449
76.7%
262
7,225
American West Wins!
You’re a Hero!
Alaska Airlines
% Arrivals No. of
Destination On Time
Arrivals
L. A.
88.9%
559
Phoenix
94.8%
233
San Diego 91.4%
232
San Fran. 83.1%
605
Seattle
85.8%
2,146
Total
3,775
86.7%
American West
% Arrivals No. of
On Time Arrivals
85.6%
811
92.1%
5,255
85.5%
448
71.3%
449
76.7%
262
7,225
89.1%
End of Chapter 3
Download