Mapper slides

advertisement
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
A) Data Set
Example: Point cloud data
representing a hand.
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Function f : Data Set  R
Ex 1: x-coordinate
f : (x, y, z)  x
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Function f : Data Set  R
Ex 1: x-coordinate
f : (x, y, z)  x
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Function f : Data Set  R
Ex 1: x-coordinate
f : (x, y, z)  x
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Put data into
overlapping bins.
Example: f-1(ai, bi)
( () () () () () )
Function f : Data Set  R
Ex 1: x-coordinate
f : (x, y, z)  x
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Put data into
overlapping bins.
Example: f-1(ai, bi)
( () () () () () )
Function f : Data Set  R
Ex 1: x-coordinate
f : (x, y, z)  x
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Data Set
Example: Point cloud data
representing a hand.
Function f : Data Set  R
Example: x-coordinate
f : (x, y, z)  x
Put data into overlapping bins.
Example: f-1(ai, bi)
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
D) Cluster each bin
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
D) Cluster each bin
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
D) Cluster each bin
& create network.
Vertex = a cluster of a bin.
Edge = nonempty intersection
between clusters
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
D) Cluster each bin
& create network.
Vertex = a cluster of a bin.
Edge = nonempty intersection
between clusters
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
A) Data Set
Example: Point cloud data
representing a hand.
B) Function f : Data Set  R
Example: x-coordinate
f : (x, y, z)  x
C) Put data into overlapping bins.
Example: f-1(ai, bi)
D) Cluster each bin & create network.
Vertex = a cluster of a bin.
Edge = nonempty intersection
between clusters
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Note: we
made many,
many choices
It helps to know what you are doing when
you make choices, so collaborating with
others is highly recommended.
We chose
how to
model the
data set
A) Data Set
Example: Point cloud data
representing a hand.
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Chose
filter
function
Function f : Data Set  R
Ex 1: x-coordinate
f : (x, y, z)  x
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Chose
filter
function
Function f : Data Set  R
Ex 1: x-coordinate
f : (x, y, z)  x
Ex 2: y-coordinate
g : (x, y, z)  y
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Chose
filter
function
Function f : Data Set  R
Ex 1: x-coordinate
f : (x, y, z)  x
Possible filter functions:
Principle component analysis
L-infinity centrality:
f(x) = max{d(x, p) : p in data set}
Norm: f(x) = ||x || = length of x
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Chose bins
Put data into overlapping bins.
Example: f-1(ai, bi)
If equal length intervals:
Choose length.
Choose % overlap.
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Chose how
to cluster.
Normally
need a
definition of
distance
between
data points
Cluster each bin & create network.
Vertex = a cluster of a bin.
Edge = nonempty intersection
between clusters
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Note: we
made many,
many choices
It helps to know what you are doing when
you make choices, so collaborating with
others is highly recommended.
Note: we made many, many choices
“It is useful to think of it as a camera, with lens
adjustments and other settings. A different filter
function may generate a network with a different
shape, thus allowing one to explore the data from
a different mathematical perspective.”
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Note: we made many, many choices
“It is useful to think of it as a camera, with lens
adjustments and other settings. A different filter
function may generate a network with a different
shape, thus allowing one to explore the data from
a different mathematical perspective.”
False positives???
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
False Positives will occur
Note: we made many, many choices
“It is useful to think of it as a camera, with lens
adjustments and other settings. A different filter
function may generate a network with a different
shape, thus allowing one to explore the data from
a different mathematical perspective.”
False positives vs persistencia
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Chose
filter
function
Function f : Data Set  R
Ex 1: x-coordinate
f : (x, y, z)  x
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Chose
filter
function
Function f : Data Set  R
Ex 1: x-coordinate
f : (x, y, z)  x
( () () () () () )
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Chose
filter
Only need to cover the data points.
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Chose filter
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Chose filter
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Application 3 (in paper): Basketball
Data: rates (per minute played) of rebounds, assists,
turnovers, steals, blocked shots, personal fouls, and
points scored for 452 players.
 Input: 452 points in R7
For each player, we have a vector
(
)
rebounds assists turnovers steals blocked shots personal fouls points scored
min , min ,
min , min ,
min
,
min
,
min
= (r, a, t, s, b, f, p) in R7
Distance: variance normalized Euclidean distance.
Clustering: Single linkage.
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Filters: principle and secondary SVD values.
http://commons.wikimedia.org/wiki/File:SVD_Graphic_Example.png
Data
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
A) Low resolution map at 20 intervals for each filter B) High resolution map at 30 intervals for
each filter. The overlap is such at that each interval overlaps with half of the adjacent
intervals, the graphs are colored by points per game, and a variance normalized Euclidean
distance metric is applied. Metric: Variance Normalized Euclidean; Lens: Principal SVD Value
(Resolution 20, Gain 2.0x, Equalized) and Secondary SVD Value (Resolution 20, Gain 2.0x,
Equalized). Color: red: high values, blue: low values.
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
LeBron James , Kobe Bryant, Brook Lopez
A) Low resolution map at 20 intervals for each filter B) High resolution map at 30 intervals for
each filter. The overlap is such at that each interval overlaps with half of the adjacent
intervals, the graphs are colored by points per game, and a variance normalized Euclidean
distance metric is applied. Metric: Variance Normalized Euclidean; Lens: Principal SVD Value
(Resolution 20, Gain 2.0x, Equalized) and Secondary SVD Value (Resolution 20, Gain 2.0x,
Equalized). Color: red: high values, blue: low values.
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Topological Data Analysis (TDA): Three key ideas of
topology that make extracting of patterns via shape
possible.
1.) coordinate free.
• No dependence on the coordinate system chosen.
• Can compare data derived from different platforms
• vital when one is studying data collected with
different technologies, or from different labs when
the methodologies cannot be standardized.
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Topological Data Analysis (TDA): Three key ideas of
topology that make extracting of patterns via
shape possible.
2.) invariant under “small” deformations.
• less sensitive to noise
Figure from http://comptop.stanford.edu/u/preprints/mapperPBG.pdf
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Topological Methods for the Analysis
of High Dimensional
Data Sets and 3D Object Recognition,
Singh, Mémoli, Carlsson
Topological Data Analysis
(TDA): Three key ideas of
topology that make extracting
of patterns via shape possible.
3.) compressed
representations of shapes.
• Input: dataset with
thousands of points
• Output: network with
13 vertices and 12 edges.
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
What graph do you get when you apply mapper to
the ideal trefoil knot?
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Topological Data Analysis (TDA): Three key ideas of topology
that make extracting of patterns via shape possible.
1.) coordinate free.
• No dependence on the coordinate system chosen.
• Can compare data derived from different platforms
2.) invariant under “small” deformations.
• less sensitive to noise
3.) compressed representations of shapes.
• Input: dataset with thousands of points
• Output: network with 13 vertices and 12 edges.
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Application 1: breast cancer gene expression
Data: microarray gene expression data from 2 data
sets, NKI and GSE2034
Distance: correlation distance
Filters: (1) L-infinity centrality:
f(x) = max{d(x, p) : p in data set}
captures the structure of the points far
removed from the center or norm.
(2) NKI: survival vs. death
GSE2034: no relapse vs. relapse
Clustering: Single linkage.
www.nature.com/scitable/topicpage/microarray-based-comparative-genomic-hybridization-acgh-45432
Gene expression
profiling predicts
clinical outcome of
breast cancer
van 't Veer LJ, Dai H,
van de Vijver MJ, He
YD, Hart AA, Mao M,
Peterse HL, van der
Kooy K, Marton MJ,
Witteveen AT, Schreiber
GJ, Kerkhoven RM,
Roberts C, Linsley PS,
Bernards R, Friend SH
Nature. 2002 Jan
31;415(6871):530-6.
2 breast cancer data sets:
1.) NKI (2002):
gene expression levels of 24,000 from 272 tumors.
Includes node-negative and node-positive patients,
who had or had not received adjuvant systemic
therapy. Also includes survival information.
2.) GSE203414 (2005)
expression of 22,000 transcripts from total RNA of
frozen tumour samples from 286 lymph-nodenegative patients who had not received adjuvant
systemic treatment. Also includes time to relapse
information.
http://bioinformatics.nki.nl/data.php
Comparison of our results with those of Van de Vijver and
colleagues is difficult because of differences in patients,
techniques, and materials used.
Their study included node-negative and node-positive patients, who had or had not received
adjuvant systemic therapy, and only women younger than 53 years.
microarray platforms used in the studies differ—Affymetrix and Agilent.
Of the 70 genes in the study by van't Veer and co-workers, 48 are present on the Affymetrix
U133a array, whereas only 38 of our 76 genes are present on the Agilent array. There is a
three-gene overlap between the two signatures (cyclin E2, origin recognition complex, and
TNF superfamily protein).
Despite the apparent difference, both signatures included genes that identified several
common pathways that might be involved in tumour recurrence. This finding supports the idea
that although there might be redundancy in gene members, effective signatures could be
required to include representation of specific pathways.
From: Gene-expression profiles to predict distant metastasis of lymph-node-negative primary
breast cancer, Yixin Wang et al, The Lancet, Volume 365, Issue 9460, 19–25 February 2005,
Pages 671–679
Two filter functions, L-Infinity centrality and survival or relapse were used to generate the networks. The top half of panels
A and B are the networks of patients who didn't survive, the bottom half are the patients who survived. Panels C and D are
similar to panels A and B except that one of the filters is relapse instead of survival. Panels A and C are colored by the
average expression of the ESR1 gene. Panels B and D are colored by the average expression of the genes in the KEGG
chemokine pathway. Metric: Correlation; Lens: L-Infinity Centrality (Resolution 70, Gain 3.0x, Equalized) and Event Death
(Resolution
30, Gain 3.0x). Color bar: red: high values, blue: low values.
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Identifying subtypes of cancer in a consistent manner is a challenge in the
field since sub-populations can be small and their relationships complex
High expression level of the estrogen receptor gene (ESR1) is positively
correlated with improved prognosis, given that this set of patients is likely to
respond to standard therapies.
• But , there are still sub-groups of high ESR1 that do not respond well to
therapy.
Low ESR1 levels are strongly correlated with poor prognosis
• But there are patients with low ESR1 levels but high survival rates
Many molecular sub-groups have been identified,
• But often difficult to identify the same sub-group in a broader setting,
where data sets are generated on different platforms, on different sets of
patients and at a different times, because of the noise and complexity in
the data.
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
Highlighted in red are the lowERNS (top panel) and the lowERHS (bottom panel) patient subgroups.
http://www.nature.com/srep/2013/130207/srep01236/full/srep01236.html
http://www.pnas.org/content/early/2011/04/07/1102826108
DSGA decomposition of the original tumor vector into the Normal component its linear
models fit onto the Healthy State Model and the Disease component vector of residuals.
Nicolau M et al. PNAS 2011;108:7265-7270
©2011 by National Academy of Sciences
PAD analysis of the NKI data. The
output has three progression arms,
PAD analysis of the NKI data. because tumors (data points) are
ordered by the magnitude of deviation
from normal (the HSM). Each bin is
colored by the mean of the filter map
on the points. Blue bins contain
tumors whose total deviation from
HSM is small (normal and Normal-like
tumors). Red bins contain tumors
whose deviation from HSM is large.
The image of f was subdivided into 15
intervals with 80% overlap. All bins are
seen (outliers included). Regions of
sparse data show branching. Several
bins are disconnected from the main
graph. The ER− arm consists mostly
of Basal tumors. The c-MYB+ group
was chosen within the ER arm as the
tightest subset, between the two
sparse regions.
©2011 by National Academy of Sciences
Nicolau M et al. PNAS 2011;108:7265-7270
http://scitation.aip.org/content/aip/journal/jcp/130/14/10.1063/1.3103496
Data: Contact maps from 2,800 Serial Replica
Exchange Molecular Dynamics (SREMD) simulations
of the GCAA tetraloop on the Folding@home
distributed computing platform.
• 760 trajectories with a complete unfolding event
• 550 trajectories with a complete refolding event.
Goal: To determine
secondary structure
pathways between
folded and unfolded
state
Problem: Many more folded and unfolded
conformations than intermediate conformations
How to distinguish intermediate conformations from
noise?
Solution
Choose f: space of conformations  R
f(conformation) = density
550 trajectories with a
complete refolding event
2952 configurations
Distance = Hamming distance
550 trajectories with a
complete refolding event
2952 configurations
760 trajectories with a
complete refolding event
4330 configurations
An eQTL biological data visualization challenge and
approaches from the visualization community,
Bartlett et al.
BMC Bioinformatics
2012, 13(Suppl 8):S8
Mapper applied
to SNP data:
http://www.biomedcentral.com/1471-2105/13/S8/S8
http://www.ayasdi.com/
From: http://www.ayasdi.com/company/media/seriesc/
Customers Discover Breakthrough Insights That Deliver
Significant ROI
Ayasdi currently does business with three of the five largest
financial institutions in the world. The company’s financial
services customers use Ayasdi to enhance their consumer
credit decisions, accelerate the development and accuracy of
risk and regulatory models, optimize services for private
banking clients, and improve fraud detection models. In each
case, financial institutions are finding critical insights and
patterns that return hundreds of millions of dollars in ROI.
From: http://www.ayasdi.com/company/media/seriesc/
Ayasdi expanded its healthcare portfolio with client wins in both the provider
and payer space. Healthcare service providers rely upon Ayasdi to analyze
extensive clinical records to identify best practices that will improve both
patient and financial outcomes. Ayasdi can analyze a broad set of data types
(e.g. lab tests, pharmaceuticals, patient records and billing information) to
find important patterns that reduce costs while delivering a higher quality of
care. Additionally, healthcare payers are leveraging their data to improve
their revenue cycle through better identification of fraud, waste and abuse
within their claims processing operations.
"Our mission is to detect and help diagnose disabling spinal cord and brain
injuries, that would not necessarily show up on a standard MRI,” said Adam
Ferguson, Assistant Professor and Wings for Life and GE-NFL Head-Health
challenge Principal Investigator, Brain and Spinal Injury Center at the
University of California San Francisco Medical Center. "By analyzing brain
image scans and clinical information, Ayasdi’s intelligent analytics software
helps us quickly find critical patterns in our data that would be extremely
difficult to identify with conventional statistical tools."
Download