Ethics of data representation v2.0 Data Visualisation Process Collect Raw Data Process and Filter Data Clean Dataset Exploratory Analysis Generate Visualisation Generate Conclusion What is Ethics when it comes to data visualisation? • The figure/graph/image should show what is actually happening and not what you want to happen. • Different ways of being unethical: – knowingly: • deliberately showing the data in a misleading manner, • choosing the ‘most representative’ image/experiment. – unknowingly: • not exploring/getting to know the data well enough, • misusing your chosen graphical representation. Cheating knowingly: Choice of graph • Hypothesis (what you want to see): Applying a treatment will decrease the levels of a variable. 1400 1400 1200 1200 1000 1000 800 800 600 600 400 400 200 200 Exp2 Exp1 Exp3 Exp4 0 0 B e fo re Afte r You choose to plot your data like that B e fo re Afte r You know that what is going on Cheating knowingly: Choice of axis/scale • You want to show an increase in salary in the last term. 20200 25000 20000 19800 S a la r y S a la r y 20000 19600 15000 10000 19400 5000 19200 June J u ly Au g Sept O ct N ov Dec 0 June You choose to plot your data like that J u ly Au g Sept O ct N ov Dec You know that what is going on Cheating knowingly: Choice of axis/scale • Be careful with Linear vs. logarithmic scale. Cheating knowingly: Choice of axis/scale • If you want to cheat, a bar graph using a log axis is a great tool, as it lets you either exaggerate differences between groups or minimize them. Linear scale Logarithmic scale Cheating knowingly: Choice of axis/scale • Logarithmic axis should be used for: Logarithmically spaced values Lognormal data Cheating knowingly: Manipulating images: Western blot • Presenting bands out of context • ‘Playing’ too much with contrast Original • ‘Rebuilding’ a Western blot from several cuts Brightness and Contrast Adjusted Brightness and Contrast Adjusted Too Much: Oversaturation Cheating unknowingly: Not exploring/getting to know the data well enough 70 60 • Hypothesis: increase from CondA to CondB. You run the experiment once and you choose to plot the data as a bar chart. 50 40 30 20 10 0 C ondA 120 120 100 100 80 80 60 60 40 40 20 20 C ondB 0 0 C ondA C ondB C ondA C ondB Cheating unknowingly: Not exploring/getting to know the data well enough 100 100 p=0.04 Comparisons: Treatments vs. Control 60 p=0.32 40 50 0 -5 0 20 -1 0 0 T re a t1 0 C o n tr o l T r e a tm e n t 1 T r e a tm e n t 2 T r e a tm e n t 3 140 120 100 V a lu e V a lu e 80 S t a n d a r d is e d v a lu e s p=0.001 120 80 60 40 20 Exp5 Exp4 Exp3 Exp1 Exp2 0 C o n tr o l T r e a tm e n t 1 T r e a tm e n t 2 T r e a tm e n t 3 T re a t2 T re a t3 Types of plot Things you can illustrate Plot types – Distribution/Exploration Histograms • Very good for exploring data. Better on big dataset. • Rules: Number of intervals ≈√N and Interval width ≈ Range ÷√N • Histograms are great but careful with the resolution (= number of bins) as it affects the shape of the distribution. Plot types – Distribution/Exploration Histograms • Be careful with the resolution … … and the type of data you are dealing with. B in w id t h = 1 .5 B in w id t h = 1 B in w id t h = 1 .2 5 10 18 12 8 12 10 8 6 4 N u m b e r o f v a lu e s 10 14 N u m b e r o f v a lu e s N u m b e r o f v a lu e s 16 8 6 4 6 4 2 2 2 0 0 0 0 .0 1 .5 3 .0 4 .5 6 .0 7 .5 9 .0 1 0 .5 0.00 1.25 2.50 3.75 5.00 6.25 7.50 8.75 10.00 0 1 2 3 4 5 6 7 • Histograms are great but careful with discrete data. 8 9 10 Plot types – Distribution/Exploration Boxplots and Bean plots 110 Maximum Upper Quartile (Q3) 75th percentile (3rd quartile) 100 L e n g th (c m ) Median 90 Interquartile Range (IQR): 50% of the data 80 Lower Quartile (Q1) 25th percentile (1st quartile) Minimum Cutoff = Q1 – 1.5*IQR 70 Outlier 60 M a le F e m a le Plot types – Distribution/Exploration Boxplots and Bean plots A bean= a ‘batch’ of data Bimodal Scatterplot shows individual data Uniform Normal Distributions Data density mirrored by the shape of the polygon • Very good for exploring data. Better on medium size dataset. • Boxplots are great but be careful with underlying distribution. Plot types – Exploration/Comparison Stripcharts/Scatterplots 2 .0 V a lu e s 1 .5 1 .0 0 .5 0 .0 C o n tro l • • • • CondA CondB CondC CondD Very good for exploring data. Better on small/medium dataset. Very informative: exploration AND comparison. Very hard to cheat with these. Stripcharts are great but they don’t work so well with big samples. Plot types – Comparisons Barcharts Standard error Standard deviation 4 .0 3 .0 3 .5 2 .5 3 .0 2 .0 2 .5 2 .0 1 .5 1 .5 1 .0 1 .0 0 .5 0 .5 0 .0 0 .0 C o n tro l C o n d A C ondB C ondC C ondD C o n tro l C o n d A Confidence interval C ondB C ondC C ondD Star wars (cool graph!) 3 .0 2 .5 C ondD 2 .0 C o n tro l C o n d A C ondB C ondC C ondD C ondC 1 .5 1 .0 C ondB 0 .5 C ondA 0 .0 C o n tro l C o n d A C ondB C ondC C ondD C o n tro l 0 .0 0 .5 1 .0 1 .5 2 .0 2 .5 3 .0 Plot types – Comparisons Barcharts • Be careful with the scale when plotting ratio 6 3 5 2 lo g 2 ( r a t io ) r a t io 4 3 2 1 1 0 -1 -2 -3 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 • Very good for presenting results and emphasizing differences. • Effectiveness: most important info with the most effective channel. • Barcharts are great but after data exploration and the y-axis needs to be chosen wisely. Plot types – Relationship/Comparison Line graphs C aP 100 O C aP 2 P e r c e n t s u r v iv a l 1 0 -1 A CaP O A C aP 80 60 40 20 -2 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 T im e 140 5 experiments 120 100 V a lu e A r b itr a r y c h a n g e o v e r tim e 3 80 60 40 20 0 C o n tr o l T r e a tm e n t 1 T r e a tm e n t 2 T r e a tm e n t 3 Except for exploration … • Very good for presenting results of matched/paired/repeated data. • Linecharts are great but careful with the axes. Plot types – Relationships Scatterplot • Very good for understanding the relationship between quantitative variables. Plot types – Relationships Scatterplots • Scatterplots are great but big data can be tricky. • Solution: smoothed densities colour representation Plot types – Relationships Heatmaps • Great for big data sets, allow to plot a third quantitative value: colour scheme for grouping. Euclidean distance Correlation Colour scheme • Heatmaps are great but plot data that are changing. A heatmap is basically a table that has colors in place of numbers. Simon’s data from simple numbers to correlation Plot types – Composition Stack charts/Pie charts 100 P e rc e n ta g e 80 A A B B C C D D E E 60 A B C 40 D E 20 0 G ro u p A G ro u p B • Stack /pie charts are great but keep an eye on the sample size. T o t a l= 6 2 T o t a l= 6 2