Biomedical Visualization Kai Xu • Introduction • Brief background • Visualisation in biomedical research • An example • Challenges: data quality, uncertainty, temporal / dynamic, multi-source. Recent development in Genomics • The sequencing of human genome finished in 2003 • Current research tries to understand how genes work – What does this gene do? – Which group of genes are responsible for this? – How do these genes and external environment factors cause this? – … • Ultimate goal: personalised medicine based on individual DNA: – Every patient gets the most effective treatment based on his/her DNA. Nature, 2001 The emerging of new research fields • Such research requires knowledge beyond traditional biology, and there are many new fields: – Bio-physics – Bio-chemistry – Bio-mathematics / statistics – Bio-informatics –… Bioinformatics • Also known as “computational biology”. • Using computer science knowledge to solve biological problems. • Two types of research contribution: – Computer science; – Biology. • Visualisation is an important part of bioinformatics. • Introduction • Brief background • Visualisation in biomedical research • An example Concepts • DNA – a nucleic acid which carries genetic instructions. – forms a double helix. Each strand of DNA is a linked chain of 4 types of “bases”: A, C, G, and T. • Genes: segments of DNA – Only a small percentage of DNA are genes; – Spread through DNA (not continuous); – Control the encoding of protein. • Genome: both the genes and the noncoding sequences • Genomics: genome research • RNA – Single-stranded nucleic acid – copy genetic information from DNA and then pass it into proteins • Proteins – long chains containing 20 different kinds of amino acids – primary constituents of living things Central Dogma of Molecular Biology • The flow of genetic information follows the path DNA→RNA →Protein • “DNA makes RNA, RNA makes protein, and proteins make us.” (F. Crick) Systems Biology • The study of the mechanisms underlying complex biological processes as integrated systems of many, diverse, interacting components. • Integrative approach to biology – Interpret data within the context of other information – Aims to merge the existing knowledge into one consistent picture of biology across all scales Why should we care? • Because live is interconnected through many different layers: – molecules – Cells – Tissues / organs – Organisms – biological communities / ecosystem • Every change at one level leads to changes in other levels. Model inter-connections with network Networks Nodes and edges Molecular graphs Atoms/Bonds Metabolic networks Metabolites/Reactions Interaction networks Proteins/Interactions Regulatory networks Genes/Activation Ecological networks Species/Interaction Evolutionary Networks Species/Evolution Time and Size Scales • Molecular graphs: • 10-14s 10-9 m • Metabolic networks: • 10-10s 10-8 m • Interaction networks: • 10-8s 10-7 m • Regulatory networks: • 10-2s 10-6 m • Ecological networks: • Days 101 m • Evolutionary Networks: • 106 years 106 m Challenges • Missing data – Networks at organ and organism level; – Networks between the levels. • Complexity – Multiple networks in one level: metabolic/interaction/regulation networks at cellular level. • Dynamics – All networks are changing all the time, responding to various internal conditions and external stimulus; – Evolution. • Introduction • Brief background • Visualisation in biomedical research • An example • Visualization can contribute to bioinformatics and system biology, because it is good at: – Utilising the super computing power in human brain; – Presenting complex data; – Showing the overall structure; – Providing hints and hypothesis. • Biomedical visualization is an application of graph visualization. • In some cases, it just requires applying existing visualization method. • Sometimes, such solutions may not exist because they are still open problem for graph visualization: – Visualization of large and complex network. • Sometimes, it also presents new challenges: – Visualizing the dynamics of inter-level networks at dramatically different scales. • What sure is that it is pushing the boundary of both areas. • Introduction • Brief background • Visualisation in biomedical research – Non-network visualisation – Network visualisation • An example Sequence Alignment • Used to study the evolution of species by analyzing protein or nucleic acid sequences. Multiple sequences can be aligned. Sequences are aligned such that corresponding positions have the same Y axis. Two aligned DNA sequences (image source: wikipedia.org) • The small differences between sequences are attributed to insertions or deletions. [1] Jalview Sequence Alignment: Consensus Sequences • Consensus sequences are a way to summarize the results of a sequence alignment. • Conserved sequence motifs are probably important sequences that remained the same during evolution • Does not specify the probability of occurrence for a letter at a given position. [AG]TA[TA][TA][TAGC]{G} Consensus Sequence Sequence Alignment: Sequence Logos • Every position summarizes occurrences of the letters in that position – Color: pre-attentive, reinforces the text [AG]TA[TA][TA][TAGC]{G} Consensus Sequence – Height proportional to the frequency • Scalability is limited. At maximum dozens of positions. • Validation: Although visualization is simple, it is used as a standard for representing consensus sequences in the scientific literature. Sequence Logo Sequence Alignment: Jalview Aligned sequences • Automatic multiple sequence alignment can be improved by manual editing. • Reinforcement using multiple visual expressions of the same information – Color intensity on the sequences and consensus bar height Hundreds of positions Coupled Perspectives – Height and color intensity for the quality and conservation metrics • Interaction consists in sequence editing (insertion and deletion) Summarized results Sequence Alignment: Vista • Visualization of pre-computed multiple sequence alignments: Vista. – • • • Large amounts of linear data (~100k) Colored curve (conservation metric) – Curve only above a threshold – Colors code types of regions (exons, introns, …) Visualization information – Colored curve (conservation metric) – Annotations – Several windows on the data Interaction – Zoom – History Annotations (Exons, UTR) Pattern Discovery • Pattern discovery: Detect matching sequences of amino-acids or nucleicacids that correspond to a given pattern. • Task time increases drastically with the length of the pattern. Metapatterns can be detected visually. Nucleic Acid Sequences S1 = ATGGGACTCCTTCC S2 = GATGTGATATTCTCC Matching sequence S1 = ATGGGACTCCTTCC S2 = GATGTGATATTCTCC P1 = TGxxAxT S1 = ATGGGACTCCTTCC S2 = GATGTGATATTCTCC P2 = CTCC Pattern Discovery: PattVision • Sequences are aligned in the X-Y plane • Patterns are emphasized with colors in the plane and with blending planes – Orthogonal on the sequences plan – Level of support mapped on height • Interaction – 3D navigation – Selection of patterns to visualize – Sequence alignment Sequence Walkers • • Exploratory analysis for detection of binding sites for a molecule relative to a nucleic sequence. 4 Binding site Non-Binding site Favorable contacts 5 DNA orientation 3 relative to the protein Visualization – Similar to the sequence logos – Very simple concepts 2 – Rich in information – Task-centered Unfavorable contacts – Simple visualization because the task is clearly defined • A single degree of interactive freedom Nucleotide sequence 1 Position of the binding protein 5 DNA Duplication - Dot Plots • Classic technique[1]: align the sequence along the axes, and put a dot at the intersection of two axes of there is a match. • Problem: Plots are noisy due to the reduced alphabet. • Solution: display a dot only if it is part of a duplication chain of length n. Dot Plot • Other Problems – The 1D characteristic of the data is not straightforward from the view. – P2. Technique does not scale. Filtered Dot Plot [I] Gibbs and MacIntire, 1970 Arc Diagrams • Visualization technique used to display duplication in string-like data. • Construction – Sequence is displayed along X axis – Arcs connect regions with duplicated sequence data • More effective than Dot-Plots because duplication is shown only once • Different representation for different user task DNA Duplication inside a single sequence Arc Diagrams - Extended • • • • The Bard tool introduces adaptations specific to DNA analysis – Inexact matching – Reverse matching – Defines the way to represent intersequence duplication Above the axis, arcs connect regions that are at least 80% identical Visualization techniques – Esthetics – Pre-attentive processing Figure Elements – Four sequences of approx 120k base pairs – Green arcs - possible new functional regions Interaction – Highlighting – Filtering Below the axis, arcs connect known genes Medical Data Visualization • Problem: Support exploratory analysis on large amounts of clinical data. • Data is nominal and represented as a database of property-value pairs. • Approach: The Cube[1] – 3d parallel coordinates – A plane for each data dimension – Time dimension common to all planes [1] Falkman2001 A generic “cube” Medical Data Visualization non >10 <10 nonsymp symp Browsing Image Repositories • The image repository has more than 15GB of data that can not be downloaded easily. • The user needs a way to retrieve only the images that interest him. • A visual browser which presents low-resolution images from two orthogonal perspectives is used to navigate the data set. Visualizing Treatment Plans • Emphasize the properties of treatment plans: – Hierarchical decomposition – Parallel execution – Intention-oriented – Have a time dimension – Preconditions • Asbru and AsbruView are improvements over flowcharts • Metaphors • Interaction – Plans can be moved around – Plans can be hidden – Navigation along the time axis [1] Kosara et. al • Introduction • Brief background • Visualisation in biomedical research – Non-network visualisation – Network visualisation • An example Phylogenetic Trees • Evolutionary relationships between species have a tree structure • Usually based on DNA and/or protein sequences • Not only sequences are different between organisms, but also biological networks Drawing Phylogenetic Trees • Trees vary much in their dimensions and need different visualization techniques – Small trees (hundreds of species) – Large trees (thousands of species) – Huge trees (hundreds of thousands) Small phylogenetic tree (Source: wikipedia.org) • Different trees we will have different requirements Drawing Small Phylogenetic Trees • Requirement: The length of edges and paths should be proportional to the evolutionary distance between species • Drawing a tree that conforms to this specification is NP-hard • Application: Phylodraw [Choi2000] – Minimizes a squared distance aiming to find a tree close to the ideal one – Tries to avoid overlapping (succeeds up to 100 nodes) – Interaction is limited to moving nodes in the tree Drawing Large Phylogenetic Trees • Requirement: Accommodate trees with thousands of nodes • Possible solution: Use the properties of hyperbolic space to provide focus and context – Hypertree [Bingham 2000]: • Interaction – Navigation Drawing Huge Phylogenetic Trees • Drawing the tree of life is not trivial. • It currently has more than 80.000 species. • A possible approach: use 3D hyperbolic space (Walrus) • Interaction – Navigation – Would large screen displays help? • But what does this tree show? Comparing Phylogenetic Trees in 2.5D Networked Protein Analysis • Analysis of protein interaction with network analysis • The proteins set is compared one by one and the similar are joined by an edge. • Resulting graph has – Hundreds of thousands of nodes – Millions of edges • How to draw it? LGL: Large Graph Layout • Iterative spring layout – Compute MST 30.000 vertices LGL – Choose a root – Add the nodes in groups, based on their distance from the root and layout after each group • Very large data set Spring – 140,000 vertices representing proteins • Looks better than others. Anything else? iv LGL - Using Visualization To Think • Observation: Proteins tend to organize based on their functions. • Assumption: Function of uncharacterized proteins can be inferred from their position in the map. • Validation by problem solving – 23 families are characterized in the article Visualization: Metabolic Networks (manually produced) Michael, 1993 Visualisation Requirements and Graph Drawing Solutions • Important goals of metabolic network visualisation: – Understanding of the interconnections between reactions – Flow of substances through the network – Identification of main and alternative paths • Visualisation should: – Easy to understand – similar to established drawings – Support interactive features • The two established methods: – Force-directed layout method – Hierarchical (Sugiyama) layout method Many software available VisANT – Web-based 2D CytoScape - 2D PathBank – Web-based 3D Network Comparison • Biological networks differ slightly across organisms; • Networks can be placed side by side; • Do not scale: the more networks to be compared, the more difficult a visual comparison is. Visual Network Comparison • Idea: stack the networks into the 3rd dimension Centrality in Biological Networks Example: lethality in proteinprotein interaction networks • Degree centrality compared with the lethality of the removal of proteins (Jeong et al., 2001) • Positive correlation between lethality and connectivity Visualising Centralities • The network is shown in 2D; • Each centrality is shown as a 2.5D surface: – The height at a node corresponds its centrality value • Colors and surface textures are used to differentiate centralities. Comparison of Centralities Better Visual Comparison Network Motif Visualization • Motifs – Frequent small sub-graph • Interesting motifs are found in various biological networks: – Gene regulatory networks – Metabolic networks – Protein-protein interaction networks – Neuronal networks – food-webs • Introduction • Brief background • Visualisation in biomedical research • An example Gene Ontology Network Visualization and Analysis • Collaboration with – Molecular biologists from UNSW – Liver biologist from Usyd / Royal Prince Alfred hospital • Goal: understand the relationship between a group of genes and Hepatitis C Virus (HCV) infection in human liver Workflow Microarray data Candidate genes Gene 1 Gene 2 Gene 3 … Gene n Functions of interest Gene Ontology term 1 Gene Ontology term 2 Gene Ontology term 3 … Gene Ontology term m Analysis using Gene Ontology Microarray • A chip with thousands of genes – Each dot is a gene, • Can measure the expression level of thousands of genes in one experiment – Gene expression level: how active the gene is – Previously only limited number of genes per test – Speed up experiments signigicantly Identify candidate genes for a specific disease • Usually two sets of microarray tests: – One with healthy tissue – The other with diseaseinfected tissue. • Find genes that differentially expressed: – Active in healthy tissue and non-active in infected tissue; or vice versa. Gene 1 Gene 2 Gene 3 … Gene n Gene Ontology (GO) • Ontology: a hierarchy of concepts – Each parent concept is a generalization / abstraction of its children – Commonly used to provide a standard vocabulary • Gene Ontology (GO): a hierarchy of concepts describing gene functions. – Also the annotations of which genes have such functions. Identify Gene Functions of Interest • Functional analysis – Find gene functions (in GO) that may explain the relationships between the candidate genes and the disease Gene Ontology term 1 Gene Ontology term 2 Gene Ontology term 3 … Gene Ontology term m Previous Approaches • Mainly statistical analysis Example • A function is important if the random chance it appears is very small • Data: 20 candidate genes with 60 known functions in GO. – Comparing against a large set of background genes • Background genes: all the human genes (25,000). • Test: randomly pick a group of 20 genes from the background set and repeat a sufficiently large number of times to see what’s the random chance of each of the 60 functions appear. Problems • Focus on individual term, – neglect the connections between them, and – cannot explain the functions of a group of genes. • Provide either too specific, or too general, level of biological information. – Depends on what is known about the specific genes Proposed approach • Analyzing the data in a custom-build genefunction network. • Data used: hepatitis C virus (HCV) infection in humans. Gene-term network • Each gene is connected to its functions • Also included are the connections among these terms in GO hierarchy. • Not showing the whole GO because it is very large: – Thousands of terms GO term 1 Data GO term 2 GO term 3 Gene 1 GO term 4 Gene 2 Gene-term network GO term 3 Gene 1 GO term 4 Gene 2 k−level gene-term network • Biologists also want high-level functions: GO term 1 Data GO term 2 – May be easier to discover the connections between genes • k−level gene-term network GO term 3 GO term 4 – parameter k specifies the level of abstraction: – each gene is connected to the kth parent of its primary annotation. – Increasing the value of k results in the inclusion of higher level terms from the GO hierarchy. Gene 1 Gene 2 2-level gene-term network GO term 2 Gene 1 Gene 2 Gene-Term Network • The examples are 1-level gene-term network. • Genes are blue, GO terms are green. • Red color level indicates number of connections a GO terms have • Most are connected – Genes have similar / related functions Statistical analysis • Similar to the previous approaches • But use the number of genes linked to a GO term as the test statistic – This utilize the gene-term network structure rather than probability of appearance. – This can be replaced by any other network measure such as various centralities. Clustering • Hierarchical clustering is used to identify groups of genes that had related functional annotations. • The similarity of the GO annotations is used as the distance metrics between genes. – Gene-term matrix t1 t2 t3 … tn g1 1 1 0 … 0 g2 1 0 1 … 0 … 0 … gm 0 1 0 Summary • Model biological data as a network to understand its function as a group rather than individuals: – From the system biology perspective; – Produce results considerably different from previous methods. • Interpret the results: – Working with liver biologists at Usyd and hospital • Visualization is mainly used as a way to present the model and results. Another example • Gene co-expression network from 3 different tissues – Each color is one tissue – Two genes are connected if their correlation is above certain threshold • The biologists were surprised by the ‘donuts’, by can’t explain them so far. • Such structure is difficult to detect using algorithms, but straightforward with visualization.