Biomedical Visualization Kai Xu

Biomedical Visualization Kai Xu • Introduction • Brief background • Visualisation in biomedical research • An example • Challenges: data quality, uncertainty, temporal / dynamic, multi-source. Recent development in Genomics • The sequencing of human genome finished in 2003 • Current research tries to understand how genes work – What does this gene do? – Which group of genes are responsible for this? – How do these genes and external environment factors cause this? – … • Ultimate goal: personalised medicine based on individual DNA: – Every patient gets the most effective treatment based on his/her DNA. Nature, 2001 The emerging of new research fields • Such research requires knowledge beyond traditional biology, and there are many new fields: – Bio-physics – Bio-chemistry – Bio-mathematics / statistics – Bio-informatics –… Bioinformatics • Also known as “computational biology”. • Using computer science knowledge to solve biological problems. • Two types of research contribution: – Computer science; – Biology. • Visualisation is an important part of bioinformatics. • Introduction • Brief background • Visualisation in biomedical research • An example Concepts • DNA – a nucleic acid which carries genetic instructions. – forms a double helix. Each strand of DNA is a linked chain of 4 types of “bases”: A, C, G, and T. • Genes: segments of DNA – Only a small percentage of DNA are genes; – Spread through DNA (not continuous); – Control the encoding of protein. • Genome: both the genes and the noncoding sequences • Genomics: genome research • RNA – Single-stranded nucleic acid – copy genetic information from DNA and then pass it into proteins • Proteins – long chains containing 20 different kinds of amino acids – primary constituents of living things Central Dogma of Molecular Biology • The flow of genetic information follows the path DNA→RNA →Protein • “DNA makes RNA, RNA makes protein, and proteins make us.” (F. Crick) Systems Biology • The study of the mechanisms underlying complex biological processes as integrated systems of many, diverse, interacting components. • Integrative approach to biology – Interpret data within the context of other information – Aims to merge the existing knowledge into one consistent picture of biology across all scales Why should we care? • Because live is interconnected through many different layers: – molecules – Cells – Tissues / organs – Organisms – biological communities / ecosystem • Every change at one level leads to changes in other levels. Model inter-connections with network Networks Nodes and edges Molecular graphs Atoms/Bonds Metabolic networks Metabolites/Reactions Interaction networks Proteins/Interactions Regulatory networks Genes/Activation Ecological networks Species/Interaction Evolutionary Networks Species/Evolution Time and Size Scales • Molecular graphs: • 10-14s 10-9 m • Metabolic networks: • 10-10s 10-8 m • Interaction networks: • 10-8s 10-7 m • Regulatory networks: • 10-2s 10-6 m • Ecological networks: • Days 101 m • Evolutionary Networks: • 106 years 106 m Challenges • Missing data – Networks at organ and organism level; – Networks between the levels. • Complexity – Multiple networks in one level: metabolic/interaction/regulation networks at cellular level. • Dynamics – All networks are changing all the time, responding to various internal conditions and external stimulus; – Evolution. • Introduction • Brief background • Visualisation in biomedical research • An example • Visualization can contribute to bioinformatics and system biology, because it is good at: – Utilising the super computing power in human brain; – Presenting complex data; – Showing the overall structure; – Providing hints and hypothesis. • Biomedical visualization is an application of graph visualization. • In some cases, it just requires applying existing visualization method. • Sometimes, such solutions may not exist because they are still open problem for graph visualization: – Visualization of large and complex network. • Sometimes, it also presents new challenges: – Visualizing the dynamics of inter-level networks at dramatically different scales. • What sure is that it is pushing the boundary of both areas. • Introduction • Brief background • Visualisation in biomedical research – Non-network visualisation – Network visualisation • An example Sequence Alignment • Used to study the evolution of species by analyzing protein or nucleic acid sequences. Multiple sequences can be aligned. Sequences are aligned such that corresponding positions have the same Y axis. Two aligned DNA sequences (image source: wikipedia.org) • The small differences between sequences are attributed to insertions or deletions. [1] Jalview Sequence Alignment: Consensus Sequences • Consensus sequences are a way to summarize the results of a sequence alignment. • Conserved sequence motifs are probably important sequences that remained the same during evolution • Does not specify the probability of occurrence for a letter at a given position. [AG]TA[TA][TA][TAGC]{G} Consensus Sequence Sequence Alignment: Sequence Logos • Every position summarizes occurrences of the letters in that position – Color: pre-attentive, reinforces the text [AG]TA[TA][TA][TAGC]{G} Consensus Sequence – Height proportional to the frequency • Scalability is limited. At maximum dozens of positions. • Validation: Although visualization is simple, it is used as a standard for representing consensus sequences in the scientific literature. Sequence Logo Sequence Alignment: Jalview Aligned sequences • Automatic multiple sequence alignment can be improved by manual editing. • Reinforcement using multiple visual expressions of the same information – Color intensity on the sequences and consensus bar height Hundreds of positions Coupled Perspectives – Height and color intensity for the quality and conservation metrics • Interaction consists in sequence editing (insertion and deletion) Summarized results Sequence Alignment: Vista • Visualization of pre-computed multiple sequence alignments: Vista. – • • • Large amounts of linear data (~100k) Colored curve (conservation metric) – Curve only above a threshold – Colors code types of regions (exons, introns, …) Visualization information – Colored curve (conservation metric) – Annotations – Several windows on the data Interaction – Zoom – History Annotations (Exons, UTR) Pattern Discovery • Pattern discovery: Detect matching sequences of amino-acids or nucleicacids that correspond to a given pattern. • Task time increases drastically with the length of the pattern. Metapatterns can be detected visually. Nucleic Acid Sequences S1 = ATGGGACTCCTTCC S2 = GATGTGATATTCTCC Matching sequence S1 = ATGGGACTCCTTCC S2 = GATGTGATATTCTCC P1 = TGxxAxT S1 = ATGGGACTCCTTCC S2 = GATGTGATATTCTCC P2 = CTCC Pattern Discovery: PattVision • Sequences are aligned in the X-Y plane • Patterns are emphasized with colors in the plane and with blending planes – Orthogonal on the sequences plan – Level of support mapped on height • Interaction – 3D navigation – Selection of patterns to visualize – Sequence alignment Sequence Walkers • • Exploratory analysis for detection of binding sites for a molecule relative to a nucleic sequence. 4 Binding site Non-Binding site Favorable contacts 5 DNA orientation 3 relative to the protein Visualization – Similar to the sequence logos – Very simple concepts 2 – Rich in information – Task-centered Unfavorable contacts – Simple visualization because the task is clearly defined • A single degree of interactive freedom Nucleotide sequence 1 Position of the binding protein 5 DNA Duplication - Dot Plots • Classic technique[1]: align the sequence along the axes, and put a dot at the intersection of two axes of there is a match. • Problem: Plots are noisy due to the reduced alphabet. • Solution: display a dot only if it is part of a duplication chain of length n. Dot Plot • Other Problems – The 1D characteristic of the data is not straightforward from the view. – P2. Technique does not scale. Filtered Dot Plot [I] Gibbs and MacIntire, 1970 Arc Diagrams • Visualization technique used to display duplication in string-like data. • Construction – Sequence is displayed along X axis – Arcs connect regions with duplicated sequence data • More effective than Dot-Plots because duplication is shown only once • Different representation for different user task DNA Duplication inside a single sequence Arc Diagrams - Extended • • • • The Bard tool introduces adaptations specific to DNA analysis – Inexact matching – Reverse matching – Defines the way to represent intersequence duplication Above the axis, arcs connect regions that are at least 80% identical Visualization techniques – Esthetics – Pre-attentive processing Figure Elements – Four sequences of approx 120k base pairs – Green arcs - possible new functional regions Interaction – Highlighting – Filtering Below the axis, arcs connect known genes Medical Data Visualization • Problem: Support exploratory analysis on large amounts of clinical data. • Data is nominal and represented as a database of property-value pairs. • Approach: The Cube[1] – 3d parallel coordinates – A plane for each data dimension – Time dimension common to all planes [1] Falkman2001 A generic “cube” Medical Data Visualization non >10 <10 nonsymp symp Browsing Image Repositories • The image repository has more than 15GB of data that can not be downloaded easily. • The user needs a way to retrieve only the images that interest him. • A visual browser which presents low-resolution images from two orthogonal perspectives is used to navigate the data set. Visualizing Treatment Plans • Emphasize the properties of treatment plans: – Hierarchical decomposition – Parallel execution – Intention-oriented – Have a time dimension – Preconditions • Asbru and AsbruView are improvements over flowcharts • Metaphors • Interaction – Plans can be moved around – Plans can be hidden – Navigation along the time axis [1] Kosara et. al • Introduction • Brief background • Visualisation in biomedical research – Non-network visualisation – Network visualisation • An example Phylogenetic Trees • Evolutionary relationships between species have a tree structure • Usually based on DNA and/or protein sequences • Not only sequences are different between organisms, but also biological networks Drawing Phylogenetic Trees • Trees vary much in their dimensions and need different visualization techniques – Small trees (hundreds of species) – Large trees (thousands of species) – Huge trees (hundreds of thousands) Small phylogenetic tree (Source: wikipedia.org) • Different trees we will have different requirements Drawing Small Phylogenetic Trees • Requirement: The length of edges and paths should be proportional to the evolutionary distance between species • Drawing a tree that conforms to this specification is NP-hard • Application: Phylodraw [Choi2000] – Minimizes a squared distance aiming to find a tree close to the ideal one – Tries to avoid overlapping (succeeds up to 100 nodes) – Interaction is limited to moving nodes in the tree Drawing Large Phylogenetic Trees • Requirement: Accommodate trees with thousands of nodes • Possible solution: Use the properties of hyperbolic space to provide focus and context – Hypertree [Bingham 2000]: • Interaction – Navigation Drawing Huge Phylogenetic Trees • Drawing the tree of life is not trivial. • It currently has more than 80.000 species. • A possible approach: use 3D hyperbolic space (Walrus) • Interaction – Navigation – Would large screen displays help? • But what does this tree show? Comparing Phylogenetic Trees in 2.5D Networked Protein Analysis • Analysis of protein interaction with network analysis • The proteins set is compared one by one and the similar are joined by an edge. • Resulting graph has – Hundreds of thousands of nodes – Millions of edges • How to draw it? LGL: Large Graph Layout • Iterative spring layout – Compute MST 30.000 vertices LGL – Choose a root – Add the nodes in groups, based on their distance from the root and layout after each group • Very large data set Spring – 140,000 vertices representing proteins • Looks better than others. Anything else? iv LGL - Using Visualization To Think • Observation: Proteins tend to organize based on their functions. • Assumption: Function of uncharacterized proteins can be inferred from their position in the map. • Validation by problem solving – 23 families are characterized in the article Visualization: Metabolic Networks (manually produced) Michael, 1993 Visualisation Requirements and Graph Drawing Solutions • Important goals of metabolic network visualisation: – Understanding of the interconnections between reactions – Flow of substances through the network – Identification of main and alternative paths • Visualisation should: – Easy to understand – similar to established drawings – Support interactive features • The two established methods: – Force-directed layout method – Hierarchical (Sugiyama) layout method Many software available VisANT – Web-based 2D CytoScape - 2D PathBank – Web-based 3D Network Comparison • Biological networks differ slightly across organisms; • Networks can be placed side by side; • Do not scale: the more networks to be compared, the more difficult a visual comparison is. Visual Network Comparison • Idea: stack the networks into the 3rd dimension Centrality in Biological Networks Example: lethality in proteinprotein interaction networks • Degree centrality compared with the lethality of the removal of proteins (Jeong et al., 2001) • Positive correlation between lethality and connectivity Visualising Centralities • The network is shown in 2D; • Each centrality is shown as a 2.5D surface: – The height at a node corresponds its centrality value • Colors and surface textures are used to differentiate centralities. Comparison of Centralities Better Visual Comparison Network Motif Visualization • Motifs – Frequent small sub-graph • Interesting motifs are found in various biological networks: – Gene regulatory networks – Metabolic networks – Protein-protein interaction networks – Neuronal networks – food-webs • Introduction • Brief background • Visualisation in biomedical research • An example Gene Ontology Network Visualization and Analysis • Collaboration with – Molecular biologists from UNSW – Liver biologist from Usyd / Royal Prince Alfred hospital • Goal: understand the relationship between a group of genes and Hepatitis C Virus (HCV) infection in human liver Workflow Microarray data Candidate genes Gene 1 Gene 2 Gene 3 … Gene n Functions of interest Gene Ontology term 1 Gene Ontology term 2 Gene Ontology term 3 … Gene Ontology term m Analysis using Gene Ontology Microarray • A chip with thousands of genes – Each dot is a gene, • Can measure the expression level of thousands of genes in one experiment – Gene expression level: how active the gene is – Previously only limited number of genes per test – Speed up experiments signigicantly Identify candidate genes for a specific disease • Usually two sets of microarray tests: – One with healthy tissue – The other with diseaseinfected tissue. • Find genes that differentially expressed: – Active in healthy tissue and non-active in infected tissue; or vice versa. Gene 1 Gene 2 Gene 3 … Gene n Gene Ontology (GO) • Ontology: a hierarchy of concepts – Each parent concept is a generalization / abstraction of its children – Commonly used to provide a standard vocabulary • Gene Ontology (GO): a hierarchy of concepts describing gene functions. – Also the annotations of which genes have such functions. Identify Gene Functions of Interest • Functional analysis – Find gene functions (in GO) that may explain the relationships between the candidate genes and the disease Gene Ontology term 1 Gene Ontology term 2 Gene Ontology term 3 … Gene Ontology term m Previous Approaches • Mainly statistical analysis Example • A function is important if the random chance it appears is very small • Data: 20 candidate genes with 60 known functions in GO. – Comparing against a large set of background genes • Background genes: all the human genes (25,000). • Test: randomly pick a group of 20 genes from the background set and repeat a sufficiently large number of times to see what’s the random chance of each of the 60 functions appear. Problems • Focus on individual term, – neglect the connections between them, and – cannot explain the functions of a group of genes. • Provide either too specific, or too general, level of biological information. – Depends on what is known about the specific genes Proposed approach • Analyzing the data in a custom-build genefunction network. • Data used: hepatitis C virus (HCV) infection in humans. Gene-term network • Each gene is connected to its functions • Also included are the connections among these terms in GO hierarchy. • Not showing the whole GO because it is very large: – Thousands of terms GO term 1 Data GO term 2 GO term 3 Gene 1 GO term 4 Gene 2 Gene-term network GO term 3 Gene 1 GO term 4 Gene 2 k−level gene-term network • Biologists also want high-level functions: GO term 1 Data GO term 2 – May be easier to discover the connections between genes • k−level gene-term network GO term 3 GO term 4 – parameter k specifies the level of abstraction: – each gene is connected to the kth parent of its primary annotation. – Increasing the value of k results in the inclusion of higher level terms from the GO hierarchy. Gene 1 Gene 2 2-level gene-term network GO term 2 Gene 1 Gene 2 Gene-Term Network • The examples are 1-level gene-term network. • Genes are blue, GO terms are green. • Red color level indicates number of connections a GO terms have • Most are connected – Genes have similar / related functions Statistical analysis • Similar to the previous approaches • But use the number of genes linked to a GO term as the test statistic – This utilize the gene-term network structure rather than probability of appearance. – This can be replaced by any other network measure such as various centralities. Clustering • Hierarchical clustering is used to identify groups of genes that had related functional annotations. • The similarity of the GO annotations is used as the distance metrics between genes. – Gene-term matrix t1 t2 t3 … tn g1 1 1 0 … 0 g2 1 0 1 … 0 … 0 … gm 0 1 0 Summary • Model biological data as a network to understand its function as a group rather than individuals: – From the system biology perspective; – Produce results considerably different from previous methods. • Interpret the results: – Working with liver biologists at Usyd and hospital • Visualization is mainly used as a way to present the model and results. Another example • Gene co-expression network from 3 different tissues – Each color is one tissue – Two genes are connected if their correlation is above certain threshold • The biologists were surprised by the ‘donuts’, by can’t explain them so far. • Such structure is difficult to detect using algorithms, but straightforward with visualization.

Biomedical Visualization Kai Xu

Related documents

Products

Support

Biomedical Visualization Kai Xu

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib