02_graphics

advertisement
Describing data
with graphics
and numbers
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Types of Data
• Categorical Variables
– also known as class variables,
nominal variables
• Quantitative Variables
– aka numerical nariables
– either continuous or discrete.
Graphing categorical variables
Ten most common causes of death in Americans between
15 and 19 years old in 1999.
Bar graphs
Graphing numerical variables
Heights of BIOL 300 students
(cm)
165
170
142
173
168
168
155
160
165
165
163
152
154
165
180
173
190
165
175
165
170
170
156
155
163
168
177
166
Stem-and-leaf plot
Stem-and-leaf plot
19
18
17
16
15
14
0
0
0003357
0335555556888
24556
2
Frequency table
Height Group
141-150
151-160
161-170
171-180
181-190
Frequency
Frequency table
Height Group
Frequency
141-150
1
151-160
6
161-170
15
171-180
5
181-190
1
Histogram
Histogram
Histogram
Frequency
distribution
Histogram with more data
Cumulative Frequency Distribution
1
0.8
Cumulative
Frequency
0.6
0.4
0.2
150 160 170 180 190 200 210
Height (in cm) of Bio300 Students
Cumulative Frequency Distribution
1
0.8
Cumulative
Frequency
0.6
0.4
0.2
150 160 170 180 190 200 210
Height (in cm) of Bio300 Students
50th percentile
(median)
90th percentile
Associations between two
categorical variables
Association between
reproductive effort and avian
malaria
Table 2.3A. Contingency table showing incidence of
malaria in female great tits subjected to experimental
egg removal.
con t ro l
gro up
m a la r ia
no
m a la r ia
co lum n
t ot al
e gg re mo va l
gro up
r ow
t ot al
7
28
15
15
22
43
35
30
65
Association between
reproductive effort and avian
malaria
Table 2.3A. Contingency table showing incidence of
malaria in female great tits subjected to experimental
egg removal.
con t ro l
gro up
m a la r ia
no
m a la r ia
co lum n
t ot al
e gg re mo va l
gro up
r ow
t ot al
7
28
15
15
22
43
35
30
65
Mosaic plot
Relative frequency
1.0
0.8
0.6
0.4
0.2
0.0
Control
Egg removal
Treatment
Fig ure 2.3B. Mosaic plot for reproductive effort and avian malaria
in great tits ( Table 2.3A). Blue fill indicates diseased birds whereas
the white fill indicates birds fr ee of malaria. n = 65 birds.
Grouped Bar Graph
25
20
15
10
5
0
Malaria No malaria
Control
Malaria No malaria
Egg removal
Associations between
categorical and numerical
variables
Multiple histograms
600
Non-conserved
400
200
0
0
200
400
600
800
600
1000
Conserved
400
200
0
0
200
400
600
Protein length
800
1000
Associations between two
numerical variables
Scatterplots
Scatterplots
Evaluating Graphics
• Lie factor
• Chartjunk
• Efficiency
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Don’t mislead with graphics
Better representation of truth
Lie Factor
• Lie factor = size of effect shown in graphic
size of effect in data
Lie Factor Example
Effect in graphic: 2.33/0.08
= 29.1
Effect in data: 6748/5844
= 1.15
Lie factor = 29.1 / 1.15
= 25.3
Chartjunk
4th Qtr
3rd Qtr
North
West
East
2nd Qtr
1st Qtr
0
50
100
Needless 3D Graphics
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
TIFF (Uncompressed) decompressor
Summary: Graphical methods
for frequency distributions
Type of Data
Categor ical da ta
Numerical data
Method
Bar gr aph
Histogram
Cumulative frequen cy dis tribution
Summary: Associations
between variables
Explanatory var iable
Response variable
Categorical
Numerical
Contingen cy table
Grouped b ar graph
Categorical
Mosaic plot
Multiple histograms
Scatter plot
Cumulative frequency distributions
Numerical
Great book on graphics
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Describing data
Two common descriptions of
data
• Location (or central tendency)
• Width (or spread)
Measures of location
Mean
Median
Mode
Mean
n
Y
i
Y
i1
n
n is the size of the sample
Mean
Y1=56, Y2=72, Y3=18, Y4=42
Mean
Y1=56, Y2=72, Y3=18, Y4=42
Y
= (56+72+18+42) / 4 = 47
Median
• The median is the middle measurement
in a set of ordered data.
The data:
18
28
24
25
36
14
34
The data:
18
28
24
25
36
14
34
can be put in order:
14
18 24
Median is 25.
25
28
34
36
Median
Mode
Mean
12.5
10.0
Frequency
7.5
5.0
2.5
0.0
5
6
7
8
9 10 11 12 13 14 15 16 17 18
Mouse weight at 50 days old, in
a line selected for small size
Mean vs. median in politics
• 2004 U.S. Economy
• Republicans: times are good
– Mean income increasing ~ 4% per year
• Democrats: times are bad
– Median family income fell
• Why?
Mean 169.3 cm
Median 170 cm
Mode 165-170 cm
1
0.8
Cumulative
Frequency
0.6
0.4
0.2
150
160
170 180
190
200
210
Height (in cm) of Bio300 Students
Measures of width
•
•
•
•
Range
Standard deviation
Variance
Coefficient of variation
Range
14 17 18 20 22 22 24 25
26 28 28 28 30 34 36
Range
14 17 18 20 22 22 24 25
26 28 28 28 30 34 36
The range is 36-14 = 22
Population Variance
N
Y  
2
 
2
i
i1
N
Sample variance
n
 Y  Y 
i
s 
2
i1
n 1
n is the sample size
2
Shortcut for calculating
sample variance
 n 2

Yi



n
2
2 
i1

s  
Y


n 1 n




Standard deviation (SD)
• Positive square root of the variance
 is the true standard deviation
s is the sample standard deviation
In class exercise
Calculate the variance and standard
deviation of a sample
with the following data:
6, 1, 2
Answer
Variance=7
Standard deviation = 7
Coefficient of variance (CV)
CV = 100 s /Y

.
Equal means, different
variances
V=1
0.4
0.3
V=2
Frequency
0.2
V=10
0.1
-5
0
Value
5
10
Manipulating means
• The mean of the sum of two variables:
E[X + Y] = E[X]+ E[Y]
• The mean of the sum of a variable and a constant:
E[X + c] = E[X]+ c
• The mean of a product of a variable and a constant:
E[c X] = c E[X]
• The mean of a product of two variables:
E[X Y] = E[X] E[Y]
if and only if X and Y are independent.
Manipulating variance
• The variance of the sum of two variables:
Var[X + Y] = Var[X]+ Var[Y]
if and only if X and Y are independent.
• The variance of the sum of a variable and a constant:
Var[X + c] = Var[X]
• The variance of a product of a variable and a constant:
Var[c X] = c2 Var[X]
Parents’ heights
Mean
Variance
Father Height
174.3
71.7
Mother Height
160.4
58.3
Father Height
+Mother Height
334.7
184.9
Download