GRAPHICAL METHODS FOR QUANTITATIVE DATA

advertisement
Lecture Unit 2
Graphical and Numerical
Summaries of Data





UNIT OBJECTIVES
At the conclusion of this unit you should be able to:
1) Construct graphs that appropriately describe
data
2) Calculate and interpret numerical summaries of a
data set.
3) Combine numerical methods with graphical
methods to analyze a data set.
4) Apply graphical methods of summarizing data to
choose appropriate numerical summaries.
5) Apply software and/or calculators to automate
graphical and numerical summary procedures.
Displaying Qualitative Data
Section 2.1
“Sometimes you can see a lot just
by looking.”
Yogi Berra
Hall of Fame Catcher, NY Yankees
The three rules of data analysis
won’t be difficult to remember



1. Make a picture —reveals aspects not obvious in
the raw data; enables you to think clearly about the
patterns and relationships that may be hiding in your
data.
2. Make a picture —to show important features of
and patterns in the data. You may also see things
that you did not expect: the extraordinary (possibly
wrong) data values or unexpected patterns
3. Make a picture —the best way to tell others
about your data is with a well-chosen picture.
Bar Charts: show counts
or relative frequency for
each category

Example: Titanic passenger/crew distribution
Titanic Passengers by Class
1000.00
885
900.00
800.00
706
700.00
600.00
500.00
400.00
325
285
300.00
200.00
100.00
0.00
Crew
First
Second
Third
Pie Charts: shows
proportions of the
whole in each category

Example: Titanic passenger/crew
distribution
Titanic Passengers by Class
Third
32%
Second
13%
Crew
40%
First
15%
Example: Top 10 causes of death in the United
States
Rank Causes of death
Counts
% of top
10s
% of total
deaths
1 Heart disease
700,142
37%
28%
2 Cancer
553,768
29%
22%
3 Cerebrovascular
163,538
9%
6%
4 Chronic respiratory
123,013
6%
5%
5 Accidents
101,537
5%
4%
6 Diabetes mellitus
71,372
4%
3%
7 Flu and pneumonia
62,034
3%
2%
8 Alzheimer’s disease
53,852
3%
2%
9 Kidney disorders
39,480
2%
2%
32,238
2%
1%
10 Septicemia
All other causes
629,967
25%
For each individual who died in the United States, we record what was the
cause of death. The table above is a summary of that information.
Top 10 causes of death: bar graph
Top 10 causes of deaths in the United States
The number of individuals
who died of an accident in
is approximately 100,000.
Ca
nc
Ce
er
re
s
br
ov
Ch
as
cu
ro
ni
la
c
r
re
sp
ira
to
ry
Ac
ci
Di
de
ab
nt
s
et
es
m
el
Fl
litu
u
&
s
pn
eu
Al
zh
m
on
ei
m
ia
er
's
di
se
Ki
as
dn
e
ey
di
so
rd
er
s
Se
pt
ice
m
ia
ise
as
es
800
700
600
500
400
300
200
100
0
He
ar
td
Counts (x1000)
Each category is represented by one bar. The bar’s height shows the count (or
sometimes the percentage) for that particular category.
zh
ei
m
er
's
di
de
nt
s
se
as
e
Ac
ci
800
700
600
500
400
300
200
100
0
Ca
nc
Ce
er
s
re
br
ov
Ch
as
cu
ro
la
ni
r
c
re
sp
ira
Di
to
ab
ry
et
es
m
el
Fl
litu
u
s
&
pn
eu
m
on
He
ia
ar
td
ise
as
Ki
dn
es
ey
di
so
rd
er
s
Se
pt
ice
m
ia
Al
Counts (x1000)
ise
as
es
Ca
nc
Ce
er
re
s
br
ov
Ch
as
cu
ro
ni
la
c
r
re
sp
ira
to
ry
Ac
ci
Di
de
ab
nt
s
et
es
m
el
Fl
litu
u
&
s
pn
eu
Al
zh
m
on
ei
m
ia
er
's
di
se
Ki
as
dn
e
ey
di
so
rd
er
s
Se
pt
ice
m
ia
He
ar
td
Counts (x1000)
800
700
600
500
400
300
200
100
0
Top 10 causes of deaths in the United
States
Bar graph sorted by rank
 Easy to analyze
Sorted alphabetically
 Much less useful
Recent Annual Computer Hardware Sales
($billion)
1. United States $158
2. China $64.4
3. Japan $54
4. Germany $24.4
5. Britain $23.5
6. France $19.3
7. Brazil $14.2
8. Italy $13.1
9. Australia $12.8
10. India $11.9
NY Times
Recent Annual Software Sales ($billions)
1. United States $137.9
2. Japan $23.4
3. Germany $20
4. Britain $16.8
5. France $12.6
6. Canada $7.3
7. Italy $6.3
8. China $5.4
9. Netherlands $5.4
10. Australia $4.8
Top 10 causes of death: pie chart
Each slice represents a piece of one whole. The size of a slice depends on what
percent of the whole this category represents.
Percent of people dying from
top 10 causes of death in the United States
Make sure your
labels match
the data.
Make sure
all percents
add up to 100.
Percent of deaths from top 10 causes
Percent of
deaths from
all causes
Internships
Basic bar chart
Side-by-side bar chart
Average Student Debt by State 2010
Class
$0
New Hampshire
Maine
Iowa
Minnesota
Pennsylvania
Vermont
Ohio
Indiana
Rhode island
New York
Michigan
Massachusetts
Connecticut
Alabama
Wisconsin
Louisiana
DC
Idaho
Oregon
Illinois
New Jersey
West Virginia
South Carolina
Virginia
South Dakota
Montana
Alaska
Missouri
Kansas
Mississippi
Washington
Colorado
Maryland
Delaware
Arkansas
Nebraska
Florida
North Carolina
Texas
Oklahoma
Wyoming
Tennessee
Kentucky
Georgia
Arizona
California
Nevada
New Mexico
Hawaii
Utah
$5,000 $10,000$15,000$20,000$25,000$30,000$35,000
Student Debt North Carolina Schools
North Carolina Private Schools
2010 Class
Average debt of graduates
0
Campbell University Inc
New Life Theological Seminary
Meredith College
Mid-Atlantic Christian University
Wake Forest University
Methodist University
Johnson C Smith University
Chowan University
Catawba College
Mars Hill College
Elon University
Wingate University
Lenoir-Rhyne University
Davidson College
St Andrews Presbyterian…
Duke University
Belmont Abbey College
Mean North Carolina - 4-year…
Brevard College
Warren Wilson College
Mount Olive College
Salem College
Saint Augustines College
High Point University
Tuition and fees (in-state)
20000
North Carolina Public Schools 2010
Class
Average debt of graduates
40000
0
UNC Greensboro
UNC School of the Arts
NC A & T
Mean North Carolina - 4-year or
above
NCSU
UNC-Wilmington
UNC Charlotte
ECU
Appalachian
UNC Asheville
Elizabeth City
Tuition and fees (in-state)
5000 10000 15000 20000 25000
Unnecessary dimension in a
pie chart
Contingency Tables:
Categories for Two
Variables

Example: Survival and class on the
Titanic Marginal distributions
Crew
Alive
Dead
Total
First
212
673
885
885/2201
marg. dist. 40.2%
of class
Second Third
202
118
123
167
325
285
325/2201
14.8%
285/2201
12.9%
Total
178
528
706
706/2201
32.1%
710
1491
2201
marg. dist.
of survival
710/2201
32.3%
1491/2201
67.7%
Marginal distribution of class.
Bar chart.
Marginal distribution of class:
Pie chart
Contingency Tables: Categories
for Two Variables (cont.)

Conditional distributions.
Given the class of a passenger, what is the
chance the passenger survived?
Crew
Alive
Survival
Dead
Total
Count
% of col.
Count
% of col.
Count
212
24.0%
673
76.0%
885
First
202
62.2%
123
37.8%
325
Class
Second Third
Total
118
178
710
41.4%
25.2%
32.3%
167
528
1491
58.6%
74.8%
67.7%
285
706
2201
Conditional distributions:
segmented bar chart
Contingency Tables:
Categories for Two
Variables (cont.)
Questions:

What fraction of survivors were in first class?

What fraction of passengers were in first class and
survivors ?

What fraction of the first class passengers
survived?
Class
Crew
Alive
Survival
Dead
Total
Count
% of col.
Count
% of col.
Count
212
24.0%
673
76.0%
885
First
202
62.2%
123
37.8%
325
202/710
202/2201
202/325
Second Third
Total
118
178
710
41.4%
25.2%
32.3%
167
528
1491
58.6%
74.8%
67.7%
285
706
2201
TV viewers during the Super Bowl in 2013.
What is the marginal distribution of those
who watched the commercials only?
1.
2.
3.
4.
8.0%
23.5%
58.2%
27.7%
0%
1
0%
2
0%
3
0%
4
TV viewers during the Super Bowl in 2013.
What percentage watched the game and
were female?
1.
2.
3.
4.
41.8%
38.8%
51.2%
19.8%
0%
1
0%
2
0%
3
0%
4
10
TV viewers during the Super Bowl in 2013.
Given that a viewer did not watch the Super
Bowl telecast, what percentage were male?
1.
2.
3.
4.
45.2%
48.8%
26.8%
27.7%
0%
1
0%
2
0%
3
0%
4
10
3-Way Tables

Example: Georgia death-sentence data
Death
Sentence
Yes
No
Totals
% Death Sentence
Race of Defendant
Black
White
Race of Victim
Race of Victim
Black
White
Black
White
18
50
2
58
1420
178
62
687
1438
228
64
745
1.2
21.9
3.1
7.8
Totals
128
2347
2475
UC Berkeley Lawsuit
M EN
W O M EN
N o. of
a p p lican ts
26 9 1
18 3 5
A d m itted
119 9
557
%
a d m itted
44.6
3 0 .4
LAWSUIT (cont.)
M EN
M A JO R
A
B
C
D
E
F
TOTAL
N o. of
A p p lican ts
825
560
325
417
191
373
2691
N o.
A d m itted
5 1 2 (6 2 % )
3 5 3 (6 3 % )
1 2 0 (3 7 % )
1 3 8 (3 3 % )
5 3 (2 8 % )
2 3 (6 % )
1199
W OM EN
N o. of
N o.
A p p lican ts
A d m itted
108
* 8 9 (8 2 % )
25
* 1 7 (6 8 % )
593
2 0 2 (3 4 % )
375
* 1 3 1 (3 5 % )
393
9 4 (2 4 % )
341
* 2 4 (7 % )
1835
557
Simpson’s Paradox

The reversal of the direction of a
comparison or association when
data from several groups are
combined to form a single group.
Fly Alaska Airlines, the ontime airline!
A la sk a A irlin es
A m erica n W est
% A rriv als N o. o f
% A rriv als N o. o f
D estin ation O n T im e
A rriv als O n T im e
A rriv als
L. A.
P ho en ix
S an D iego
S an F ran .
S eattle
T o tal
8 8.9 %
9 4.8 %
9 1.4 %
8 3.1 %
8 5.8 %
5 59
2 33
2 32
6 05
2 ,1 46
3 ,7 75
8 5.6 %
9 2.1 %
8 5.5 %
7 1.3 %
7 6.7 %
8 11
5 ,2 55
4 48
4 49
2 62
7 ,2 25
American West Wins!
You’re a Hero!
A la sk a A irlin es
A m erica n W est
% A rriv als N o. o f
% A rriv als N o. o f
D estin ation O n T im e
A rriv als O n T im e
A rriv als
L. A.
P ho en ix
S an D iego
S an F ran .
S eattle
T o tal
8 8.9 %
9 4.8 %
9 1.4 %
8 3.1 %
8 5.8 %
8 6.7 %
5 59
2 33
2 32
6 05
2 ,1 46
3 ,7 75
8 5.6 %
9 2.1 %
8 5.5 %
7 1.3 %
7 6.7 %
8 9.1 %
8 11
5 ,2 55
4 48
4 49
2 62
7 ,2 25
Section 2.2
Displaying Quantitative Data
Histograms
Stem and Leaf Displays
Relative frequency
Relative Frequency
Histogram of Exam Grades
.30
.25
.20
.15
.10
.05
0
40
50
60
70
80
Grade
90
100
Frequency Histograms
BAKER CITY HOSPITAL - LENGTH OF STAY
DISTRIBUTION
70
60
50
40
30
20
10
0
0<2
2<4
4<6
6<8
8<10
10<12
12<14
14<16
16<18
Frequency Histograms
A histogram shows three general types of
information:
 It provides visual indication of where
the approximate center of the data is.
 We can gain an understanding of the
degree of spread, or variation, in the
data.
 We can observe the shape of the
distribution.
30
19.2
19.23
19.26
19.29
19.32
19.35
19.38
19.41
19.44
19.47
19.5
19.53
19.56
19.59
19.62
19.65
19.68
19.71
19.74
19.77
19.8
19.83
19.86
19.89
19.92
19.95
19.98
20.01
20.04
20.07
20.1
20.13
20.16
20.19
Frequency
All 200 m Races 20.2 secs or
less
200 m Races 20.2 secs or less (approx. 700)
60
50
40
Usain Bolt
2008 19.30
Michael Johnson
1996 19.32
20
10
0
TIMES
Histograms Showing Different
Centers
70
60
50
40
30
20
10
0
0<2
2<4
4<6
6<8
8<10
10<12
12<14
14<16
16<18
0<2
2<4
4<6
6<8
8<10
10<12
12<14
14<16
16<18
70
60
50
40
30
20
10
0
Histograms - Same Center,
Different Spread
70
60
50
40
30
20
10
16
<
18
14
<
16
12
<
14
10
10
<
12
8
8<
6<
6
4<
4
2<
0<
2
0
70
60
50
40
30
20
10
0
0<2
2<4
4<6
6<8
8<10
10<12
12<14
14<16
16<18
369480
821544.6154
1273609.231
1725673.846
2177738.462
2629803.077
3081867.692
3533932.308
3985996.923
4438061.538
4890126.154
5342190.769
5794255.385
6246320
6698384.615
7150449.231
7602513.846
8054578.462
8506643.077
8958707.692
9410772.308
9862836.923
10314901.54
10766966.15
11219030.77
11671095.38
12123160
12575224.62
13027289.23
13479353.85
13931418.46
14383483.08
14835547.69
15287612.31
15739676.92
16191741.54
16643806.15
17095870.77
17547935.38
More
Frequency
Excel Example: 2012-13 NFL
Salaries
Histogram
1000
900
800
700
600
500
400
300
200
100
0
Bin
Statcrunch Example: 2012-13 NFL
Salaries
Frequency and Relative
Frequency Histograms
identify smallest and largest values in
data set
 divide interval between largest and
smallest values into between 5 and 20
subintervals called classes
* each data value in one and only one
class
* no data value is on a boundary

How Many Classes?
Can choose from two formulas
2 n 
. 3333
Sturges'
1
Rule :
log( n )
log( 2 )
n is the sample size
Histogram Construction (cont.)
* compute frequency or relative
frequency of observations in each class
* x-axis: class boundaries;
y-axis: frequency or relative frequency
scale
* over each class draw a rectangle with
height corresponding to the frequency
or relative frequency in that class
Example. Number of daily
employee absences from work
106 obs; approx. no of classes=
{2(106)}1/3 = {212}1/3 = 5.69
1+ log(106)/log(2) = 1 + 6.73 = 7.73
 There is no single “correct” answer for
the number of classes
 For example, you can choose 6, 7, 8, or
9 classes; don’t choose 15 classes

EXCEL Histogram
Histogram of Employee Absences
45
Frequency
40
35
30
25
20
15
10
5
0
Absences from Work
Absences from Work (cont.)
6 classes
 class width: (158-121)/6=37/6=6.17 7
 6 classes, each of width 7; classes span
6(7)=42 units
 data spans 158-121=37 units
 classes overlap the span of the actual
data values by 42-37=5
 lower boundary of 1st class: (1/2)(5)
units below 121 = 121-2.5 = 118.5

EXCEL histogram
Histogram of Employee Absences
70
Frequency
60
50
40
30
20
10
0
118.5
125.5
132.5 139.5 146.5
Absences from Work
153.5
160.5
Grades on a statistics exam
Data:
75 66 77 66 64 73 91 65 59 86 61 86 61
58 70 77 80 58 94 78 62 79 83 54 52 45
82 48 67 55
Frequency Distribution of
Grades
Class Limits
40 up to 50
Frequency
2
50 up to 60
6
60 up to 70
8
70 up to 80
7
80 up to 90
5
90 up to 100
2
Total
30
Relative Frequency
Distribution of Grades
Class Limits
40 up to 50
Relative Frequency
2/30 = .067
50 up to 60
6/30 = .200
60 up to 70
8/30 = .267
70 up to 80
7/30 = .233
80 up to 90
5/30 = .167
90 up to 100
2/30 = .067
Relative frequency
Relative Frequency
Histogram of Grades
.30
.25
.20
.15
.10
.05
0
40
50
60
70
80
Grade
90
100
Based on the histogram, about what
percent of the values
are between 47.5 and
52.5?
1.
2.
3.
4.
50%
5%
17%
30%
0%
1
0%
2
0%
3
0%
4
10
Stem and leaf displays

Have the following general appearance
stem
leaf
1
8 9
2
1 2 8 9 9
3
2 3 8 9
4
0 1
5
6 7
6
4
Stem and Leaf Displays
Partition each no. in data into a “stem” and
“leaf”
 Constructing stem and leaf display
1) deter. stem and leaf partition (5-20 stems)
2) write stems in column with smallest stem at
top; include all stems in range of data
3) only 1 digit in leaves; drop digits or round off
4) record leaf for each no. in corresponding
stem row; ordering the leaves in each row
helps

Example: employee ages at a small company
18 21 22 19 32 33 40 41 56 57 64 28 29 29 38
39; stem: 10’s digit; leaf: 1’s digit
 18: stem=1; leaf=8; 18 = 1 | 8
stem
leaf
1
8 9
2
1 2 8 9 9
3
2 3 8 9
4
0 1
5
6 7
6
4
Suppose a 95 yr. old is hired
stem
1
2
3
4
5
6
7
8
9
leaf
8 9
1 2 8 9 9
2 3 8 9
0 1
6 7
4
5
Number of TD passes by NFL teams:
2012-2013 season
(stems are 10’s digit)
stem
4
3
2
2
1
0
leaf
03
247
6677789
01222233444
13467889
8
Pulse Rates n = 138
#
3
9
10
23
23
16
23
10
10
4
2
4
1
Stem
4*
4.
5*
5.
6*
6.
7*
7.
8*
8.
9*
9.
10*
10.
11*
Leaves
588
001233444
5556788899
00011111122233333344444
55556666667777788888888
00000112222334444
55555666666777888888999
0000112224
5555667789
0012
58
0223
1
Advantages/Disadvantages of
Stem-and-Leaf Displays
Advantages
1) each measurement displayed
2) ascending order in each stem row
3) relatively simple (data set not too large)
 Disadvantages
display becomes unwieldy for large data
sets

Population of 185 US cities with
between 100,000 and 500,000

Multiply stems by 100,000
Back-to-back stem-and-leaf displays. TD
passes by NFL teams: 1999-2000, 2012-13
multiply stems by 10
1999-2000
2
6
2
6655
43322221100
9998887666
421
2012-13
4
3
3
2
2
1
1
0
03
7
24
6677789
01222233444
67889
134
8
Below is a stem-and-leaf display for the
pulse rates of 24 women at a health clinic.
How many pulses are between 67 and 77?
Stems are
10’s digits
1.
2.
3.
4.
5.
4
6
8
10
12
0%
1
0%
0%
2
3
0%
0%
4
5
10
Interpreting Graphical Displays: Shape
Symmetric
distribution
A distribution is symmetric if the right and left

sides of the histogram are approximately mirror
images of each other.

A distribution is skewed to the right if the right
side of the histogram (side with larger values)
extends much farther out than the left side. It is
skewed to the left if the left side of the histogram
Skewed
distribution
extends much farther out than the right side.
Complex,
multimodal
distribution

Not all distributions have a simple overall shape,
especially when there are few observations.
Shape (cont.)Female heart attack
patients in New York state
Age: left-skewed
Cost: right-skewed
Shape (cont.): Outliers
An important kind of deviation is an outlier. Outliers are observations
that lie outside the overall pattern of a distribution. Always look for
outliers and try to explain them.
The overall pattern is fairly
symmetrical except for 2
states clearly not belonging
to the main trend. Alaska
and Florida have unusual
representation of the
elderly in their population.
A large gap in the
distribution is typically a
sign of an outlier.
Alaska
Florida
Center: typical value of frozen
personal pizza? ~$2.65
Spread: fuel efficiency 4, 8
cylinders
4 cylinders: more spread
8 cylinders: less spread
Other Graphical Methods for
Economic Data

Time plots
plot observations in time order, with
time on the horizontal axis and the variable on the vertical axis
** Time series
measurements are taken at regular
intervals (monthly unemployment,
quarterly GDP, weather records,
electricity demand, etc.)
Unemployment Rate, by Educational
Attainment
Water Use During Super Bowl
Winning Times 100 M Dash
Annual Mean Temperature
End of Section 2.2
Download