Dr. Mahout: Analyzing clinical data using scalable and distributed computing Shannon Quinn CPCB squinn@cmu.edu | spq1@pitt.edu November 10, 2011 1/29 Punchline Cloud computing for biological and clinical data analysis Problem: high- dimensional, noisy! tech2date.com Heart tissue: biomedcentral fMRI: wikipedia segmentation: biodynamics UCSD 2/29 Disclaimer Biology jargon Academic jargon 3/29 My Background 2nd year Ph.D. student in CPCB Program Research in bioimage informatics 4/29 My Background Other http://collegefootballbelt.com/Logos/ http://s3.amazonaws.com/data.tumblr.com/ 5/29 Computational biology and …the cloud? Biological data • is BIG • requires repetitive analysis in chunks • modeling involves linear algebra and statistics 6/29 Use case 1: protein behavior 10-15 bond vibration 10-12 10-9 10-6 timescale of relevant motions side-chain rotation domain shifts/ max. catalysis 10-3 global conformational shifts 100 protein folding sampling detail a common tradeoff… 7/29 Molecular dynamics 8/29 “The curse of [MD] dimensionality” MD := F ma for every atom for every t … http://icanhascheezburger.files.wordpress.com/ http://www.pdb.org/pdb/explore/explore.do?structureId=3fxi 9/29 Pipeline for MD trajectory analysis Find a “surface” of protein shapes 1. MD output 2. Define surface (graph!) 3. Partition surface http://www.dillgroup.ucsf.edu/ 10/29 Mahout implementation Defining surface/graph: MatrixMultiplicationJob (matrixmult) TransposeJob (transpose) DistributedLanczosSolver (svd) StochasticSVD (ssvd) Partitioning surface/graph: SpectralKMeans (spectralkmeans) Eigencuts (eigencuts) Kmeans (kmeans) . . . 11/29 MD in Mahout conclusion MD simulations (x@Home projects) http://folding.stanford.edu/ Existing Mahout functionality Additional algorithms 12/29 Use case 2: diseases affecting cilia What are cilia? • Hairlike structures • Keep things moving • Diseased cilia = http://fc06.deviantart.net/fs71/f/2010/177/d/5/Sad_Panda_by_jinxii24.jpg 13/29 Importance of correct diagnoses Symptoms look familiar Consequences do not 14/29 Beat pattern of cilia tells a lot! 1. Clinicians What is look the motion at cilia motion called? in making 2. their Candiagnoses we create a database of motions? 15/29 Clinicians’ ultimate goal ? Category 1 ? Category 2 ? Category 3 16/29 Cilia as dynamic textures Computer vision Properties Saisan et al 2001 17/29 The [proposed] pipeline Step 1 • Clinician captures video and uploads it http://googolplex.dyndns.org/cilia/ 18/29 The [proposed] pipeline Step 2 • Mahout job: autoregressive modeling Appearance Model Dynamic Model http://web.media.mit.edu/~tristan/phd/dissertation/figures/manifold2.jpg yt ~ Cxt xt ~ A1xt 1 ... 19/29 The [proposed] pipeline Step 3 • Add the transition matrices to cloud library A= 20/29 The [proposed] pipeline Step 4 Axis 2 • Recompute network with added videos ? Axis 1 21/29 One more thing… What’s really cool about AR models: • Can you spot the fake? Synthetic Original 22/29 Mahout implementation Learning autoregressive models: MatrixMultiplicationJob (matrixmult) TransposeJob (transpose) DistributedLanczosSolver (svd) StochasticSVD (ssvd) Comparing autoregressive parameters: SpectralKMeans (spectralkmeans) Eigencuts (eigencuts) Frobenius norm Tensors ? ? ? 23/29 Cilia on Mahout conclusions Autoregressive modeling uses linear algebra that is already implemented Maintaining AR library requires new functionality Mahout framework gives us elbow room 24/29 Final Thoughts Biological / biomedical data is large, high-dimensional, and noisy We extend Mahout’s current linear algebra framework (spectral clustering, autoregressive models) We provide a cloud framework! 25/29 Research Group University of Pittsburgh • Dr. Chakra Chennubhotla Lab (advisor) CMU@Qatar • Dr. Majd Sakr Lab (collaborator) University of Pittsburgh Medical Center • Dr. Cecilia Lo Lab (collaborator) 26/29 Sources Resources • Apache Mahout • Spectrally Clustered Links • Categorizing ciliary motion defects (BSEC 2011) • Eigencuts spectral clustering algorithm Technical report (coming soon!) 27/29 Contact Shannon Quinn • squinn@cmu.edu | spq1@pitt.edu • http://www.magsolweb.net/ 28/29 Thank you! http://icanhascheezburger.files.wordpress.com/ 29/29