Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya CMSC 838T Presentation Motivation Problem paper is trying to solve 3D structure prediction using threading Is a given target sequence likely to fold to a 3D template core? Find the alignment that minimizes some score function NP-complete; optimal solution not possible MAX-SNP-hard; arbitrary approximation not possible Why do we care 3D structure determines biological function of protein Amino acid sequence (almost) uniquely determines 3D structure Threading is usually less accurate than comparative modeling but easier to solve CMSC 838T – Presentation Talk Overview Overview of talk Motivation Techniques Evaluation Related work Observations CMSC 838T – Presentation Techniques Approach Reduce the problem to some known theoretical problem of interest In this case, network flow Use existing tools for solving the theoretical problem efficiently CPLEX Explore possibilities for parallelizing the problem Investigate the intrinsic hardness for real biological examples CMSC 838T – Presentation Mathematical Formulation Two structurally similar proteins Spatial adjacencies (interactions) Possible threading with a sequence Objective function CMSC 838T – Presentation Reduction to Network Flow: An Example CMSC 838T – Presentation Reduction to Network Flow: Variables and Constraints Standard Network Flow Variable xi,t for each segment to position assignment Restricted to [0, 1] With standard flow conservation constraints Additional cost for non-local interactions Variable zi,t,i’,t’ for each non-local interaction Restricted to {0, 1} Constrained to sum to 1 for each non-local pair (i, i’) Upper bounded by flow entering (i, t) and leaving (i’, t’) CMSC 838T – Presentation Drawbacks of Approach Integer programming is hard to solve! Relax to linear programming with (0, 1) variables Approximate to integer solution using standard heuristics Existing tools like CPLEX Huge number of variables For 36 segments and 81 positions, IP problem has 741264 rows, 360945 columns and 54145231 non-zero variables! Need to reduce number of variables and constraints Calls for parallelization if possible CMSC 838T – Presentation Parallel Solution Utilize special flow constraints Split into sub-problems that may be solved parallely Split the k-th layer in the graph into r intervals Force path for a sub-problem to pass through a particular interval in the layer Pass best bound for objective function found so far as parameter to sub-problem Sub-task aborts when dual objective function exceeds the current best bound CMSC 838T – Presentation Improving Parallel Solution Drawback: Hardest Sub-Problem Dominates! Parallel strategy was found to be slower than the sequential! Sub-problems can potentially become harder to solve Many more difficult sub-problems than easy ones Solution: Break the atomicity of the tasks Each sub-task periodically checks the current best bound and updates its cut-off Extra overhead is still small compared to task granularity Now the easiest executing sub-task dominates! CMSC 838T – Presentation Evaluation Experimental environment Real protein sequences ILOG CPLEX Callable Library SUN Ultra-Sparc II, 450 Mhz Objective function coefficients generated from FROST Maximum of 7 processors and 29 sub-problems Evaluation results Sequential version much faster than previous branch-andbound results for the same problem formulation Time taken comparable to PROSPECT Splitting and parallelization significantly improve turnaround Really tiny gap between relaxed LP and ILP solutions Mostly integer solutions even for relaxed LP! CMSC 838T – Presentation Result Tables Comparison with branch and bound algorithm Comment: Self threading results in significantly lower scores (as should be) CMSC 838T – Presentation Result Tables Gap between relaxed LP and ILP Comment: Tiny relaxation gap. (significance?) CMSC 838T – Presentation Result Tables Size of the LP formulation Comment: LP problem size is still too large. CMSC 838T – Presentation Result Tables Performance with parallel sub-tasks Comment: Longer times with more sub-problems?? CMSC 838T – Presentation Related Work Similar / previous approaches Lathrop and Smith, 1998 Uses same cost function Branch and bound algorithm for searching the space of threadings Xu, Xu and Uberbacher, 1998 Divide and conquer algorithm Xu, Li, Lin, Kim and Xu, 2003 Linear programming formulation Solved using b&b algorithm None of the above suggest any parallelizing scheme CMSC 838T – Presentation Observations Points of Interest Mapping to a known problem of interest Nicely utilizes particular constraints to break into independent subtasks Threading of real amino acid sequences seems possible Raises interesting questions about real-life protein threading being in P Solver tailored for this particular problem may yield better results CMSC 838T – Presentation Observations Criticism Not enough experiments with large number of subtasks and processors to show scaling Prohibitively large number of variables and constraints How accurate are the objective function coefficients? What is the resolution of the objective function? Threading onto multiple sequences for prediction still looks daunting Not clear how to extend the idea for 3-way and more complex interactions Improvements Seems possible to break up the sub-tasks recursively CMSC 838T – Presentation Thank you! CMSC 838T – Presentation