Atomistic Protein Folding Simulations on the Submillisecond Timescale Using Worldwide Distributed Computing Qing Lu CMSC 838 Presentation Overview Overview of talk Motivation Challenge Methods Ensemble Dynamics Folding@Home Evaluation Observations CMSC 838T – Presentation Motivation Atomistic simulation of protein folding understand dynamics of folding real-time folding in full atomic detail large-scale parallelization methods Benefits protein folding & disease protein self-assemble to function proteins misfold diseases nanotechnology nanomachines self-assemble on the nanoscale CMSC 838T – Presentation Challenge Difficulties limited by current computational techniques fastest folding in microseconds one CPU: 1ns/day, 30 years 10,000 fold computational gap 1,000 CPUs, 1 microsecond / day traditional parallelization scheme hard to scale to a large amount of processors extremely fast communication complexity of coordination expensive supercomputers cost time-sharing CMSC 838T – Presentation Method ensemble dynamics a new simulation algorithm parallel simulation Folding@Home heterogeneous network, Internet large-scale distributed platform CMSC 838T – Presentation Simulation of Dynamics free energy barrier progress from one state to another: transition thermal fluctuations to push system over free energy barrier previous approaches: sampling maybe stuck in meta-stable free energy minima expensive computational cost of sampling CMSC 838T – Presentation Ensemble Dynamics application scenario Algorithm waiting time of transitions dominates total time protein folding transition: free energy barrier crossing coupled simulations: transition coupling M independent simulations from a initial condition first simulation to cross free energy barrier M times less to cross barrier than average time restart M simulations with the new location after transition Near linear speed up in #processors exponential kinetics: f(t) = 1 – exp(-k t) If k * t is small, f(t) = k * t M simulations M * f(t) = M * k * t folding events CMSC 838T – Presentation Limitations barrier crossing probability exponential assumptions correct transition detection transition: free energy barrier crossing a large variance in energy: threshold correct detection is not guaranteed multiple possible transition not addressed selection of the first transition CMSC 838T – Presentation Distributed Computing Distributed simulations M processors for each run simulate folding in atomic detail on each processor restart once a crossing barrier event occurs Implementation: Folding@Home worldwide distributed computing: Internet started in October 2000 more than 200,000 participants 10,000 CPU-years in the first 12 months CMSC 838T – Presentation Folding@Home CMSC 838T – Presentation Folding@Home client-server architecture server assign jobs(work unit) to client client sends back results after computation ~100K data transfer between client and server why is ensemble dynamics good for Folding@Home? CPU intensive job: a few hours, often days connection speed: modem, good enough suitable for Folding@Home CMSC 838T – Presentation Other@Home Work SETI@Home FightAids@Home search for intelligent life outside Earth data analysis of signals find drug therapy for HIV how drugs interact with various HIV virus mutations distributed projects Divide-and-Conquer CPU intensive jobs small pieces of data(kilobytes) transfer communication not a major concern CMSC 838T – Presentation Evaluation Folding@Home based on Tinker molecular dynamics code voluntary participants worldwide, over 400,000 CPUs simulate folding and unfolding folding rates simulations on small proteins CMSC 838T – Presentation Folding Rates CMSC 838T – Presentation Folding & Unfolding CMSC 838T – Presentation Observations Sampling too expensive to run for a long timescales waste too much time lingering in local energy minima Ensemble dynamics speed up simulations of dynamics biological meaning of simulations results? results on large protein folding? limitations: correct transition detection, transition probability Folding@Home cheap way to achieve super computation power huge distributed computing platform: over 400,000 CPUs an efficient approach for CPU intensive job Complexity of problems and size of data increase rapidly find better algorithm is preferable to buying supercomputers CMSC 838T – Presentation