Presentation Title

advertisement
Dr. Mahout:
Analyzing clinical data using scalable
and distributed computing
Shannon Quinn
CPCB
squinn@cmu.edu | spq1@pitt.edu
November 10, 2011
1/29
Punchline
 Cloud computing for biological and
clinical data analysis
 Problem: high- dimensional, noisy!
tech2date.com
Heart tissue: biomedcentral
fMRI: wikipedia
segmentation: biodynamics UCSD
2/29
Disclaimer
 Biology jargon
 Academic jargon
3/29
My Background
 2nd year Ph.D. student in CPCB Program
 Research in bioimage informatics
4/29
My Background
 Other
http://collegefootballbelt.com/Logos/
http://s3.amazonaws.com/data.tumblr.com/
5/29
Computational biology and …the
cloud?
 Biological data
• is BIG
• requires repetitive analysis in chunks
• modeling involves linear algebra and
statistics
6/29
Use case 1: protein behavior
10-15
bond vibration
10-12
10-9
10-6
timescale of relevant motions
side-chain
rotation
domain shifts/
max. catalysis
10-3
global conformational
shifts
100
protein
folding
sampling
detail
a common tradeoff…
7/29
Molecular dynamics
8/29
“The curse of [MD] dimensionality”





MD := F  ma
for every atom
for every t
…
http://icanhascheezburger.files.wordpress.com/
http://www.pdb.org/pdb/explore/explore.do?structureId=3fxi
9/29
Pipeline for MD trajectory analysis
 Find a “surface” of
protein shapes
1. MD output
2. Define surface
(graph!)
3. Partition surface
http://www.dillgroup.ucsf.edu/
10/29
Mahout implementation
Defining surface/graph:
MatrixMultiplicationJob (matrixmult)
TransposeJob (transpose)
DistributedLanczosSolver (svd)
StochasticSVD (ssvd)
Partitioning surface/graph:
SpectralKMeans (spectralkmeans)
Eigencuts (eigencuts)
Kmeans (kmeans)
. . .
11/29
MD in Mahout conclusion
 MD simulations
(x@Home
projects)
http://folding.stanford.edu/
 Existing Mahout
functionality
 Additional
algorithms
12/29
Use case 2: diseases affecting cilia
 What are cilia?
• Hairlike structures
• Keep things
moving
• Diseased
cilia =
http://fc06.deviantart.net/fs71/f/2010/177/d/5/Sad_Panda_by_jinxii24.jpg
13/29
Importance of correct diagnoses
 Symptoms look
familiar
 Consequences do
not
14/29
Beat pattern of cilia tells a lot!
1. Clinicians
What is look
the motion
at cilia motion
called? in making
2. their
Candiagnoses
we create a database of motions?
15/29
Clinicians’ ultimate goal
?
Category 1
?
Category 2
?
Category 3
16/29
Cilia as dynamic textures
 Computer vision
 Properties
Saisan et al 2001
17/29
The [proposed] pipeline
 Step 1
• Clinician captures video and uploads it
http://googolplex.dyndns.org/cilia/
18/29
The [proposed] pipeline
 Step 2
• Mahout job: autoregressive modeling
Appearance Model
Dynamic Model
http://web.media.mit.edu/~tristan/phd/dissertation/figures/manifold2.jpg
yt ~ Cxt
xt ~ A1xt 1  ...
19/29
The [proposed] pipeline
 Step 3
• Add the transition matrices to cloud library
A=
20/29
The [proposed] pipeline
 Step 4
Axis 2
• Recompute network with added videos
?
Axis 1
21/29
One more thing…
 What’s really cool about AR models:
• Can you spot the fake?
Synthetic
Original
22/29
Mahout implementation
Learning autoregressive models:
MatrixMultiplicationJob (matrixmult)
TransposeJob (transpose)
DistributedLanczosSolver (svd)
StochasticSVD (ssvd)
Comparing autoregressive parameters:
SpectralKMeans (spectralkmeans)
Eigencuts (eigencuts)
Frobenius norm
Tensors
? ? ?
23/29
Cilia on Mahout conclusions
 Autoregressive modeling uses linear algebra
that is already implemented
 Maintaining AR library requires new
functionality
 Mahout framework gives us elbow room
24/29
Final Thoughts
 Biological / biomedical data is large,
high-dimensional, and noisy
 We extend Mahout’s current linear
algebra framework (spectral clustering,
autoregressive models)
 We provide a cloud framework!
25/29
Research Group
 University of Pittsburgh
• Dr. Chakra Chennubhotla Lab (advisor)
 CMU@Qatar
• Dr. Majd Sakr Lab (collaborator)
 University of Pittsburgh Medical Center
• Dr. Cecilia Lo Lab (collaborator)
26/29
Sources
 Resources
• Apache Mahout
• Spectrally Clustered
 Links
• Categorizing ciliary motion defects (BSEC 2011)
• Eigencuts spectral clustering algorithm
 Technical report (coming soon!)
27/29
Contact
 Shannon Quinn
• squinn@cmu.edu | spq1@pitt.edu
• http://www.magsolweb.net/
28/29
Thank you!
http://icanhascheezburger.files.wordpress.com/
29/29
Download