The theory of data visualisation v2.1 Simon Andrews, Phil Ewels simon.andrews@babraham.ac.uk phil.ewels@scilifelab.se Data Visualisation • A scientific discipline involving the creation and study of the visual representation of data whose goal is to communicate information clearly and efficiently to users. • Data Visualisation is both an art and a science. Sample A Sample B 1 1 2 4 4 16 8 64 12 144 160 180 140 160 120 140 120 100 Sample A 80 Sample B 60 100 Sample B 80 Sample A 60 40 40 20 20 0 0 1 2 3 4 5 1 2 3 4 5 160 140 150 120 100 100 Sample A 80 Sample B 60 Sample A 50 Sample B 0 40 1 20 2 3 Sample B Sample A 4 5 0 1 2 3 4 5 1 Sample B 150 160 100 140 50 5 120 2 0 Sample A 100 Sample B 80 Sample B 60 40 4 3 20 0 0 5 10 15 ISBN-10: 1466508914 http://www.cs.ubc.ca/~tmm/talks.html Data Viz Process Collect Raw Data Process and Filter Data Clean Dataset Exploratory Analysis Generate Visualisation Generate Conclusion A data visualisation should… • • • • • Show the data Not distort the data Summarise to make things clearer Serve a clear purpose Link to the accompanying text and statistics Things you can illustrate Graphical Representations • Basic questions – How are you going to turn the data into a graphical form (weight becomes length etc.) – How are you going to arrange things in space – How are you going to use colours, shapes etc. to clarify the point you want to make Marks and Channels • Marks – Geometric primitives • Lines • Points • Areas – Used to represent data sets • Channels – Graphical appearance of a mark • • • • Colour Length Position Angle – Used to encode data Figures are a combination of marks and channels 4.5 1 Mark = Rectangle 1 Channel = Length of longest side 4 3.5 3 2.5 2 1.5 1 Mark = Circle segment 1 Channel = Angle 1 0.5 0 1 2 3 10 1 Mark = Diamond shape 2 Channels = X position, Y position 9 8 7 6 5 4 3 2 1 0 0 2 4 6 8 10 1 Mark = Circle 4 Channels: X position Y position Area Colour Golden Rules • Effectiveness – Encode the most important information with the most effective channel • Expressiveness – Match the properties of the data and channel Types of channel • Quantitative – – – – – – Position on scale Length Angle Area Colour (saturation) Colour (lightness) • Qualitative – Spatial Grouping – Colour (hue) – Shape Colour • Technical representations of colour – Red + Green + Blue (RGB) – Cyan + Magenta + Yellow + Black (CMYK) • Perceptual representation of colour – Hue + Saturation + Lightness (HSL) HSL Representation • Hue = Shade of colour = Qualitative • Saturation = Amount of colour = Quantitative • Lightness = Amount of white = Quantitative • Humans have no innate quantitative perception of hue but we have learned some (cold – hot, rainbow etc.) • Our perception of hue is not linear Types of channel • Quantitative – – – – – – Position on scale Length Angle Area Colour (saturation) Colour (lightness) • Qualitative – Spatial Grouping – Colour (hue) – Shape Data Types • Quantitative – Height, Length, Weight, Expression etc. • Ordered – Small, Medium, Large – January, February, March • Categorical – WT, Mutant1, Mutant2 – GeneA, GeneB, GeneC Golden Rules • Effectiveness – Encode the most important information with the most effective channel • Expressiveness – Match the properties of the data and channel Golden Rules • Effectiveness – Encode the most important information with the most effective channel • Expressiveness – Match the properties of the data and channel Effectiveness of quantitation 10 10 18 9 9 16 8 8 14 7 7 6 6 5 5 4 4 3 3 2 2 4 1 1 2 0 10 8 6 0 0 0.9 1 4.5X 1.1 2X 12 1 1.8X 2 1 7X 16X 3.4X Quantitation Perception Golden Rules • Effectiveness – Encode the most important information with the most effective channel • Expressiveness – Match the properties of the data and channel Most Quantitative Representations Good quantitation • • • • • • • • Poor quantitation Bar chart Stacked bar chart with common start Stacked bar chart with different starts Pie charts Bubble plots (circular area) Rectangular area Colour (luminance) Colour (saturation) Discriminability • If you encode categorical data are the differences between categories easy for the user to perceive correctly? Qualitative Discrimination • How many colours can you discriminate? Qualitative Discrimination • How many (fillable) shapes can you discriminate? • Can combine with colour, but need to maintain similar fillable areas Separability • The effectiveness of a channel does not always survive being combined with a second channel. • There are large variations in how much two different channels interfere with each other • Trying to put too much information on a figure can erode the impact of the main point you’re trying to make Separability There is no confusion between the two channels Larger points are easier to discriminate than smaller ones We tend to focus on the area of the shape rather than the height/width separately Humans are very bad at separating combined colours Popout • A distinct item immediately stands out from the others • Triggered by our low level visual system • You don’t need to actively look at every point (slow!) to see it Popout (find the red circle) Popout Speed of identification is independent of the number of distracting points Popout (Find the circle) Popout Colour pops out more than shape Popout Mixing channels removes the effect (Find the red circle) Use of space • Where you want a viewer to focus on specific subsets of data you can help their perception by using the layout or highlighting of data to draw their attention to the point you’re making Grouping 80 70 60 50 40 30 20 10 0 Grouping 80 70 60 50 40 30 20 10 0 CpG CHH CHG CpG CHH CHG Exon CGI CpG CHH CHG Intron CpG CHH CHG Repeat Containment Containment Containment Wild Type 80 70 60 50 40 30 20 10 0 CpG CHH CHG CpG CHH CHG CpG CHH CHG CpG CHH CHG CpG CHH CHG CpG CHH CHG CpG CHH CHG CpG CHH CHG Mutant 80 70 60 50 40 30 20 10 0 Linking 1 80 70 60 50 40 30 20 1 10 0 2 2 1 2 3 1 2 3 1 2 3 30 25 20 3 15 10 5 0 3 25 20 15 10 5 0 Linking 80 70 60 50 40 30 20 10 0 1 2 3 1 2 3 1 2 3 30 25 20 15 10 5 0 25 20 15 10 5 0 Ordering • Is a monkey heavier than a dog? 140 120 Weight (kg) 100 80 60 40 20 0 cat aardvark fish aardvark cow cat dog monkey fish dog horse cow monkey horse Validation • Always try to validate plots you create • You have seen your data too often to get an unbiased view • Show the plot to someone not familiar with the data – What does this plot tell you? – Is this the message you wanted to convey? – If they pick multiple points, do they choose the most important one first? General Rules • No unnecessary figures – Does a graphical representation make things clearer? – Would a table be better? • One point per figure – Design each figure to illustrate a single point – Adding complexity compromises the effectiveness of the main point • No absolute reliance on colour – Figures should ideally still work in black and white – Colour should help perception • No 3D – 3D is hardly ever justified and makes things less clear • Figures should be self-contained – Must be understandable without additional information