Biomedical Visualization Kai Xu

advertisement
Biomedical Visualization
Kai Xu
• Introduction
• Brief background
• Visualisation in biomedical research
• An example
• Challenges: data quality, uncertainty,
temporal / dynamic, multi-source.
Recent development in Genomics
• The sequencing of human genome
finished in 2003
• Current research tries to understand
how genes work
– What does this gene do?
– Which group of genes are responsible
for this?
– How do these genes and external
environment factors cause this?
– …
• Ultimate goal: personalised medicine
based on individual DNA:
– Every patient gets the most effective
treatment based on his/her DNA.
Nature, 2001
The emerging of new research fields
• Such research requires knowledge beyond
traditional biology, and there are many new
fields:
– Bio-physics
– Bio-chemistry
– Bio-mathematics / statistics
– Bio-informatics
–…
Bioinformatics
• Also known as “computational biology”.
• Using computer science knowledge to solve
biological problems.
• Two types of research contribution:
– Computer science;
– Biology.
• Visualisation is an important part of
bioinformatics.
• Introduction
• Brief background
• Visualisation in biomedical research
• An example
Concepts
• DNA
– a nucleic acid which carries genetic
instructions.
– forms a double helix. Each strand of
DNA is a linked chain of 4 types of
“bases”: A, C, G, and T.
• Genes: segments of DNA
– Only a small percentage of DNA are
genes;
– Spread through DNA (not continuous);
– Control the encoding of protein.
• Genome: both the genes and the noncoding sequences
• Genomics: genome research
• RNA
– Single-stranded nucleic acid
– copy genetic information from DNA
and then pass it into proteins
• Proteins
– long chains containing 20 different
kinds of amino acids
– primary constituents of living things
Central Dogma of Molecular Biology
• The flow of genetic information
follows the path
DNA→RNA →Protein
• “DNA makes RNA, RNA
makes protein, and proteins
make us.” (F. Crick)
Systems Biology
• The study of the mechanisms underlying
complex biological processes as integrated
systems of many, diverse, interacting
components.
• Integrative approach to biology
– Interpret data within the context of other information
– Aims to merge the existing knowledge into one
consistent picture of biology across all scales
Why should we care?
• Because live is interconnected through many
different layers:
– molecules
– Cells
– Tissues / organs
– Organisms
– biological communities /
ecosystem
• Every change at one level
leads to changes in other
levels.
Model inter-connections with
network
Networks
Nodes and edges
Molecular graphs
Atoms/Bonds
Metabolic networks
Metabolites/Reactions
Interaction networks
Proteins/Interactions
Regulatory networks
Genes/Activation
Ecological networks
Species/Interaction
Evolutionary Networks Species/Evolution
Time and Size Scales
• Molecular graphs:
• 10-14s
10-9 m
• Metabolic networks:
• 10-10s
10-8 m
• Interaction networks:
• 10-8s
10-7 m
• Regulatory networks:
• 10-2s
10-6 m
• Ecological networks:
• Days
101 m
• Evolutionary
Networks:
• 106 years
106 m
Challenges
• Missing data
– Networks at organ and organism level;
– Networks between the levels.
• Complexity
– Multiple networks in one level:
metabolic/interaction/regulation networks at
cellular level.
• Dynamics
– All networks are changing all the time,
responding to various internal conditions and
external stimulus;
– Evolution.
• Introduction
• Brief background
• Visualisation in biomedical research
• An example
• Visualization can contribute to bioinformatics
and system biology, because it is good at:
– Utilising the super computing power in human brain;
– Presenting complex data;
– Showing the overall structure;
– Providing hints and hypothesis.
• Biomedical visualization is an application of graph
visualization.
• In some cases, it just requires applying existing
visualization method.
• Sometimes, such solutions may not exist because they
are still open problem for graph visualization:
– Visualization of large and complex network.
• Sometimes, it also presents new challenges:
– Visualizing the dynamics of inter-level networks at dramatically
different scales.
• What sure is that it is pushing the boundary of both
areas.
• Introduction
• Brief background
• Visualisation in biomedical research
– Non-network visualisation
– Network visualisation
• An example
Sequence Alignment
• Used to study the
evolution of species by
analyzing protein or
nucleic acid sequences.
Multiple sequences can
be aligned.
Sequences are aligned such that
corresponding positions have the same Y
axis.
Two aligned DNA sequences
(image source: wikipedia.org)
• The small differences
between sequences are
attributed to insertions or
deletions.
[1] Jalview
Sequence Alignment: Consensus
Sequences
• Consensus sequences are a
way to summarize the results
of a sequence alignment.
• Conserved sequence motifs
are probably important
sequences that remained the
same during evolution
• Does not specify the
probability of occurrence for a
letter at a given position.
[AG]TA[TA][TA][TAGC]{G}
Consensus Sequence
Sequence Alignment: Sequence Logos
•
Every position summarizes
occurrences of the letters in that
position
– Color: pre-attentive, reinforces the
text
[AG]TA[TA][TA][TAGC]{G}
Consensus Sequence
– Height proportional to the
frequency
•
Scalability is limited. At maximum
dozens of positions.
•
Validation: Although visualization
is simple, it is used as a standard
for representing consensus
sequences in the scientific
literature.
Sequence Logo
Sequence Alignment: Jalview
Aligned sequences
•
Automatic multiple sequence
alignment can be improved by
manual editing.
•
Reinforcement using multiple
visual expressions of the same
information
– Color intensity on the
sequences and consensus bar
height
Hundreds of positions
Coupled Perspectives
– Height and color intensity for
the quality and conservation
metrics
•
Interaction consists in
sequence editing (insertion and
deletion)
Summarized results
Sequence Alignment: Vista
•
Visualization of pre-computed multiple
sequence alignments: Vista.
–
•
•
•
Large amounts of linear data (~100k)
Colored curve (conservation metric)
–
Curve only above a threshold
–
Colors code types of regions (exons,
introns, …)
Visualization information
–
Colored curve (conservation metric)
–
Annotations
–
Several windows on the data
Interaction
–
Zoom
–
History
Annotations
(Exons, UTR)
Pattern Discovery
• Pattern discovery: Detect
matching sequences of
amino-acids or nucleicacids that correspond to a
given pattern.
• Task time increases
drastically with the length
of the pattern. Metapatterns can be detected
visually.
Nucleic Acid Sequences
S1 = ATGGGACTCCTTCC
S2 = GATGTGATATTCTCC
Matching sequence
S1 = ATGGGACTCCTTCC
S2 = GATGTGATATTCTCC
P1 = TGxxAxT
S1 = ATGGGACTCCTTCC
S2 = GATGTGATATTCTCC
P2 = CTCC
Pattern Discovery: PattVision
•
Sequences are aligned in the X-Y
plane
•
Patterns are emphasized with
colors in the plane and with
blending planes
– Orthogonal on the sequences
plan
– Level of support mapped on
height
•
Interaction
– 3D navigation
– Selection of patterns to visualize
– Sequence alignment
Sequence Walkers
•
•
Exploratory analysis for
detection of binding sites for a
molecule relative to a nucleic
sequence.
4
Binding site
Non-Binding site
Favorable
contacts 5
DNA orientation 3
relative to the protein
Visualization
– Similar to the sequence logos
– Very simple concepts
2
– Rich in information
– Task-centered
Unfavorable contacts
– Simple visualization because
the task is clearly defined
•
A single degree of interactive
freedom
Nucleotide sequence
1
Position of the
binding protein
5
DNA Duplication - Dot Plots
• Classic technique[1]: align the
sequence along the axes, and
put a dot at the intersection of
two axes of there is a match.
• Problem: Plots are noisy due
to the reduced alphabet.
• Solution: display a dot only if it
is part of a duplication chain of
length n.
Dot Plot
• Other Problems
– The 1D characteristic of the
data is not straightforward
from the view.
– P2. Technique does not scale.
Filtered Dot Plot
[I] Gibbs and MacIntire, 1970
Arc Diagrams
• Visualization technique used to
display duplication in string-like
data.
• Construction
– Sequence is displayed along
X axis
– Arcs connect regions with
duplicated sequence data
• More effective than Dot-Plots
because duplication is shown
only once
• Different representation for
different user task
DNA Duplication inside a single sequence
Arc Diagrams - Extended
•
•
•
•
The Bard tool introduces adaptations
specific to DNA analysis
–
Inexact matching
–
Reverse matching
–
Defines the way to represent intersequence duplication
Above the axis, arcs connect regions that
are at least 80% identical
Visualization techniques
–
Esthetics
–
Pre-attentive processing
Figure Elements
–
Four sequences of approx 120k base
pairs
–
Green arcs - possible new functional
regions
Interaction
–
Highlighting
–
Filtering
Below the axis, arcs connect known
genes
Medical Data Visualization
• Problem: Support exploratory
analysis on large amounts of
clinical data.
• Data is nominal and
represented as a database of
property-value pairs.
• Approach: The Cube[1]
– 3d parallel coordinates
– A plane for each data
dimension
– Time dimension common to all
planes
[1] Falkman2001
A generic “cube”
Medical Data Visualization
non
>10
<10
nonsymp
symp
Browsing Image Repositories
• The image repository has more
than 15GB of data that can not
be downloaded easily.
• The user needs a way to
retrieve only the images that
interest him.
• A visual browser which
presents low-resolution images
from two orthogonal
perspectives is used to
navigate the data set.
Visualizing Treatment Plans
•
Emphasize the properties of
treatment plans:
– Hierarchical decomposition
– Parallel execution
– Intention-oriented
– Have a time dimension
– Preconditions
•
Asbru and AsbruView are
improvements over flowcharts
•
Metaphors
•
Interaction
– Plans can be moved around
– Plans can be hidden
– Navigation along the time axis
[1] Kosara et. al
• Introduction
• Brief background
• Visualisation in biomedical research
– Non-network visualisation
– Network visualisation
• An example
Phylogenetic Trees
• Evolutionary relationships
between species have a
tree structure
• Usually based on DNA
and/or protein sequences
• Not only sequences are
different between
organisms, but also
biological networks
Drawing Phylogenetic Trees
• Trees vary much in their
dimensions and need different
visualization techniques
– Small trees (hundreds of
species)
– Large trees (thousands of
species)
– Huge trees (hundreds of
thousands)
Small phylogenetic tree
(Source: wikipedia.org)
• Different trees we will have
different requirements
Drawing Small Phylogenetic Trees
•
Requirement: The length of edges and
paths should be proportional to the
evolutionary distance between species
•
Drawing a tree that conforms to this
specification is NP-hard
•
Application: Phylodraw [Choi2000]
–
Minimizes a squared distance aiming to
find a tree close to the ideal one
–
Tries to avoid overlapping (succeeds
up to 100 nodes)
–
Interaction is limited to moving nodes in
the tree
Drawing Large Phylogenetic Trees
• Requirement: Accommodate
trees with thousands of nodes
• Possible solution: Use the
properties of hyperbolic
space to provide focus and
context
– Hypertree [Bingham 2000]:
• Interaction
– Navigation
Drawing Huge Phylogenetic Trees
• Drawing the tree of life is not
trivial.
• It currently has more than
80.000 species.
• A possible approach: use 3D
hyperbolic space (Walrus)
• Interaction
– Navigation
– Would large screen displays
help?
• But what does this tree show?
Comparing Phylogenetic Trees in 2.5D
Networked Protein Analysis
• Analysis of protein interaction
with network analysis
• The proteins set is compared
one by one and the similar are
joined by an edge.
• Resulting graph has
– Hundreds of thousands of
nodes
– Millions of edges
• How to draw it?
LGL: Large Graph Layout
• Iterative spring layout
– Compute MST
30.000 vertices
LGL
– Choose a root
– Add the nodes in groups,
based on their distance
from the root and layout
after each group
• Very large data set
Spring
– 140,000 vertices
representing proteins
• Looks better than others.
Anything else?
iv
LGL - Using Visualization To Think
• Observation: Proteins
tend to organize based
on their functions.
• Assumption: Function of
uncharacterized proteins
can be inferred from
their position in the map.
• Validation by problem
solving
– 23 families are
characterized in the article
Visualization: Metabolic Networks (manually produced)
Michael, 1993
Visualisation Requirements and Graph
Drawing Solutions
• Important goals of metabolic network visualisation:
– Understanding of the interconnections between reactions
– Flow of substances through the network
– Identification of main and alternative paths
• Visualisation should:
– Easy to understand
– similar to established drawings
– Support interactive features
• The two established methods:
– Force-directed layout method
– Hierarchical (Sugiyama) layout method
Many software available
VisANT – Web-based 2D
CytoScape - 2D
PathBank – Web-based 3D
Network Comparison
• Biological networks differ
slightly across organisms;
• Networks can be placed
side by side;
• Do not scale: the more
networks to be compared,
the more difficult a visual
comparison is.
Visual Network Comparison
• Idea: stack the
networks into the
3rd dimension
Centrality in Biological Networks
Example: lethality in proteinprotein interaction networks
• Degree centrality compared
with the lethality of the
removal of proteins (Jeong
et al., 2001)
• Positive correlation
between lethality and
connectivity
Visualising Centralities
• The network is shown
in 2D;
• Each centrality is
shown as a 2.5D
surface:
– The height at a node
corresponds its
centrality value
• Colors and surface
textures are used to
differentiate
centralities.
Comparison of Centralities
Better Visual Comparison
Network Motif Visualization
• Motifs
– Frequent small sub-graph
• Interesting motifs are
found in various
biological networks:
– Gene regulatory networks
– Metabolic networks
– Protein-protein interaction
networks
– Neuronal networks
– food-webs
• Introduction
• Brief background
• Visualisation in biomedical research
• An example
Gene Ontology Network Visualization and
Analysis
• Collaboration with
– Molecular biologists from UNSW
– Liver biologist from Usyd / Royal Prince Alfred hospital
• Goal: understand the relationship between a group of
genes and Hepatitis C Virus (HCV) infection in human
liver
Workflow
Microarray data
Candidate genes
Gene 1
Gene 2
Gene 3
…
Gene n
Functions of interest
Gene Ontology term 1
Gene Ontology term 2
Gene Ontology term 3
…
Gene Ontology term m
Analysis using Gene Ontology
Microarray
• A chip with thousands of
genes
– Each dot is a gene,
• Can measure the
expression level of
thousands of genes in
one experiment
– Gene expression level:
how active the gene is
– Previously only limited
number of genes per test
– Speed up experiments
signigicantly
Identify candidate genes for a specific disease
• Usually two sets of
microarray tests:
– One with healthy tissue
– The other with diseaseinfected tissue.
• Find genes that
differentially expressed:
– Active in healthy tissue and
non-active in infected
tissue; or vice versa.
Gene 1
Gene 2
Gene 3
…
Gene n
Gene Ontology (GO)
• Ontology: a hierarchy of
concepts
– Each parent concept is a
generalization / abstraction
of its children
– Commonly used to provide
a standard vocabulary
• Gene Ontology (GO): a
hierarchy of concepts
describing gene
functions.
– Also the annotations of
which genes have such
functions.
Identify Gene Functions of Interest
• Functional analysis
– Find gene functions (in
GO) that may explain
the relationships
between the candidate
genes and the disease
Gene Ontology term 1
Gene Ontology term 2
Gene Ontology term 3
…
Gene Ontology term m
Previous Approaches
• Mainly statistical analysis
Example
• A function is important if
the random chance it
appears is very small
• Data: 20 candidate genes with
60 known functions in GO.
– Comparing against a large
set of background genes
• Background genes: all the
human genes (25,000).
• Test: randomly pick a group of
20 genes from the background
set and repeat a sufficiently
large number of times to see
what’s the random chance of
each of the 60 functions
appear.
Problems
• Focus on individual term,
– neglect the connections between them, and
– cannot explain the functions of a group of genes.
• Provide either too specific, or too general, level
of biological information.
– Depends on what is known about the specific genes
Proposed approach
• Analyzing the data in a custom-build genefunction network.
• Data used: hepatitis C virus (HCV) infection in
humans.
Gene-term network
• Each gene is connected
to its functions
• Also included are the
connections among these
terms in GO hierarchy.
• Not showing the whole
GO because it is very
large:
– Thousands of terms
GO term 1
Data
GO term 2
GO term 3
Gene 1
GO term 4
Gene 2
Gene-term network
GO term 3
Gene 1
GO term 4
Gene 2
k−level gene-term network
• Biologists also want high-level
functions:
GO term 1
Data
GO term 2
– May be easier to discover the
connections between genes
• k−level gene-term network
GO term 3
GO term 4
– parameter k specifies the level
of abstraction:
– each gene is connected to the
kth parent of its primary
annotation.
– Increasing the value of k
results in the inclusion of
higher level terms from the
GO hierarchy.
Gene 1
Gene 2
2-level gene-term network
GO term 2
Gene 1
Gene 2
Gene-Term Network
• The examples are 1-level
gene-term network.
• Genes are blue, GO
terms are green.
• Red color level indicates
number of connections a
GO terms have
• Most are connected
– Genes have similar /
related functions
Statistical analysis
• Similar to the previous approaches
• But use the number of genes linked to a GO
term as the test statistic
– This utilize the gene-term network structure rather
than probability of appearance.
– This can be replaced by any other network measure
such as various centralities.
Clustering
• Hierarchical clustering is
used to identify groups of
genes that had related
functional annotations.
• The similarity of the GO
annotations is used as
the distance metrics
between genes.
– Gene-term matrix
t1
t2 t3
…
tn
g1 1
1
0
…
0
g2 1
0
1
…
0
…
0
…
gm 0
1
0
Summary
• Model biological data as a network to understand its
function as a group rather than individuals:
– From the system biology perspective;
– Produce results considerably different from previous methods.
• Interpret the results:
– Working with liver biologists at Usyd and hospital
• Visualization is mainly used as a way to present the
model and results.
Another example
•
Gene co-expression
network from 3 different
tissues
– Each color is one
tissue
– Two genes are
connected if their
correlation is above
certain threshold
•
The biologists were
surprised by the
‘donuts’, by can’t
explain them so far.
•
Such structure is
difficult to detect using
algorithms, but
straightforward with
visualization.
Download