Ethics of Data Representation Lecture

advertisement
Ethics of data representation
v2.0
Data Visualisation Process
Collect Raw Data
Process and Filter
Data
Clean Dataset
Exploratory
Analysis
Generate
Visualisation
Generate
Conclusion
What is Ethics
when it comes to data visualisation?
• The figure/graph/image should show what is actually
happening and not what you want to happen.
• Different ways of being unethical:
– knowingly:
• deliberately showing the data in a misleading manner,
• choosing the ‘most representative’ image/experiment.
– unknowingly:
• not exploring/getting to know the data well enough,
• misusing your chosen graphical representation.
Cheating knowingly:
Choice of graph
• Hypothesis (what you want to see): Applying a
treatment will decrease the levels of a variable.
1400
1400
1200
1200
1000
1000
800
800
600
600
400
400
200
200
Exp2
Exp1
Exp3
Exp4
0
0
B e fo re
Afte r
You choose to plot your data like that
B e fo re
Afte r
You know that what is going on
Cheating knowingly:
Choice of axis/scale
• You want to show an increase in salary in the last term.
20200
25000
20000
19800
S a la r y
S a la r y
20000
19600
15000
10000
19400
5000
19200
June
J u ly
Au g
Sept
O ct
N ov
Dec
0
June
You choose to plot your data like that
J u ly
Au g
Sept
O ct
N ov
Dec
You know that what is going on
Cheating knowingly:
Choice of axis/scale
• Be careful with Linear vs. logarithmic scale.
Cheating knowingly:
Choice of axis/scale
• If you want to cheat, a bar graph using a log axis is a great tool, as it
lets you either exaggerate differences between groups or minimize
them.
Linear scale
Logarithmic scale
Cheating knowingly:
Choice of axis/scale
• Logarithmic axis should be used for:
Logarithmically spaced values
Lognormal data
Cheating knowingly:
Manipulating images: Western blot
• Presenting bands out of context
• ‘Playing’ too much with contrast
Original
• ‘Rebuilding’ a Western blot
from several cuts
Brightness
and Contrast
Adjusted
Brightness and
Contrast
Adjusted Too
Much:
Oversaturation
Cheating unknowingly:
Not exploring/getting to know the data well enough
70
60
• Hypothesis: increase from CondA to CondB.
You run the experiment once and you choose to plot
the data as a bar chart.
50
40
30
20
10
0
C ondA
120
120
100
100
80
80
60
60
40
40
20
20
C ondB
0
0
C ondA
C ondB
C ondA
C ondB
Cheating unknowingly:
Not exploring/getting to know the data well enough
100
100
p=0.04
Comparisons:
Treatments vs. Control
60
p=0.32
40
50
0
-5 0
20
-1 0 0
T re a t1
0
C o n tr o l
T r e a tm e n t 1
T r e a tm e n t 2
T r e a tm e n t 3
140
120
100
V a lu e
V a lu e
80
S t a n d a r d is e d v a lu e s
p=0.001
120
80
60
40
20
Exp5
Exp4
Exp3
Exp1
Exp2
0
C o n tr o l
T r e a tm e n t 1
T r e a tm e n t 2
T r e a tm e n t 3
T re a t2
T re a t3
Types of plot
Things you can illustrate
Plot types – Distribution/Exploration
Histograms
• Very good for exploring data. Better on big dataset.
• Rules: Number of intervals ≈√N and Interval width ≈ Range ÷√N
• Histograms are great but careful with the resolution (= number of bins) as it affects the
shape of the distribution.
Plot types – Distribution/Exploration
Histograms
• Be careful with the resolution …
… and the type of data you are dealing with.
B in w id t h = 1 .5
B in w id t h = 1
B in w id t h = 1 .2 5
10
18
12
8
12
10
8
6
4
N u m b e r o f v a lu e s
10
14
N u m b e r o f v a lu e s
N u m b e r o f v a lu e s
16
8
6
4
6
4
2
2
2
0
0
0
0 .0
1 .5
3 .0
4 .5
6 .0
7 .5
9 .0
1 0 .5
0.00 1.25 2.50 3.75 5.00 6.25 7.50 8.75 10.00
0
1
2
3
4
5
6
7
• Histograms are great but careful with discrete data.
8
9
10
Plot types – Distribution/Exploration
Boxplots and Bean plots
110
Maximum
Upper Quartile (Q3) 75th percentile (3rd quartile)
100
L e n g th (c m )
Median
90
Interquartile Range (IQR): 50% of the data
80
Lower Quartile (Q1) 25th percentile (1st quartile)
Minimum
Cutoff = Q1 – 1.5*IQR
70
Outlier
60
M a le
F e m a le
Plot types – Distribution/Exploration
Boxplots and Bean plots
A bean= a ‘batch’ of data
Bimodal
Scatterplot shows individual data
Uniform
Normal
Distributions
Data density mirrored by the shape of the polygon
• Very good for exploring data. Better on medium size dataset.
• Boxplots are great but be careful with underlying distribution.
Plot types – Exploration/Comparison
Stripcharts/Scatterplots
2 .0
V a lu e s
1 .5
1 .0
0 .5
0 .0
C o n tro l
•
•
•
•
CondA
CondB
CondC
CondD
Very good for exploring data. Better on small/medium dataset.
Very informative: exploration AND comparison.
Very hard to cheat with these.
Stripcharts are great but they don’t work so well with big samples.
Plot types – Comparisons
Barcharts
Standard error
Standard deviation
4 .0
3 .0
3 .5
2 .5
3 .0
2 .0
2 .5
2 .0
1 .5
1 .5
1 .0
1 .0
0 .5
0 .5
0 .0
0 .0
C o n tro l C o n d A
C ondB
C ondC
C ondD
C o n tro l C o n d A
Confidence interval
C ondB
C ondC
C ondD
Star wars (cool graph!)
3 .0
2 .5
C ondD
2 .0
C o n tro l C o n d A
C ondB
C ondC
C ondD
C ondC
1 .5
1 .0
C ondB
0 .5
C ondA
0 .0
C o n tro l C o n d A
C ondB
C ondC
C ondD
C o n tro l
0 .0
0 .5
1 .0
1 .5
2 .0
2 .5
3 .0
Plot types – Comparisons
Barcharts
• Be careful with the scale when plotting ratio
6
3
5
2
lo g 2 ( r a t io )
r a t io
4
3
2
1
1
0
-1
-2
-3
0
0
10
20
30
40
50
60
70
80
90 100
0
10
20
30
40
50
60
70
80
90 100
• Very good for presenting results and emphasizing differences.
•
Effectiveness: most important info with the most effective
channel.
• Barcharts are great but after data exploration and the y-axis
needs to be chosen wisely.
Plot types – Relationship/Comparison
Line graphs
C aP
100
O C aP
2
P e r c e n t s u r v iv a l
1
0
-1
A CaP
O A C aP
80
60
40
20
-2
0
0
10
20
30
40
50
60
70
80
90 100
0
10
20
30
40
50
60
T im e
140
5 experiments
120
100
V a lu e
A r b itr a r y c h a n g e o v e r tim e
3
80
60
40
20
0
C o n tr o l
T r e a tm e n t 1
T r e a tm e n t 2
T r e a tm e n t 3
Except for exploration …
• Very good for presenting results of matched/paired/repeated data.
• Linecharts are great but careful with the axes.
Plot types – Relationships
Scatterplot
• Very good for understanding the relationship between quantitative
variables.
Plot types – Relationships
Scatterplots
• Scatterplots are great but big data can be tricky.
• Solution: smoothed densities colour representation
Plot types – Relationships
Heatmaps
• Great for big data sets, allow to plot a third quantitative value: colour scheme for grouping.
Euclidean distance
Correlation
Colour scheme
• Heatmaps are great but plot data that are changing.
A heatmap is basically a table that has colors in
place of numbers.
Simon’s data from simple numbers to
correlation
Plot types – Composition
Stack charts/Pie charts
100
P e rc e n ta g e
80
A
A
B
B
C
C
D
D
E
E
60
A
B
C
40
D
E
20
0
G ro u p A
G ro u p B
• Stack /pie charts are great but
keep an eye on the sample size.
T o t a l= 6 2
T o t a l= 6 2
Download