Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen Andonov Indrajit Bhattacharya

advertisement
Solving the Protein Threading
Problem in Parallel
Nocola Yanev, Rumen Andonov
Indrajit Bhattacharya
CMSC 838T Presentation
Motivation


Problem paper is trying to solve

3D structure prediction using threading

Is a given target sequence likely to fold to a 3D template core?

Find the alignment that minimizes some score function

NP-complete; optimal solution not possible

MAX-SNP-hard; arbitrary approximation not possible
Why do we care

3D structure determines biological function of protein

Amino acid sequence (almost) uniquely determines 3D
structure

Threading is usually less accurate than comparative modeling
but easier to solve
CMSC 838T – Presentation
Talk Overview

Overview of talk

Motivation

Techniques

Evaluation

Related work

Observations
CMSC 838T – Presentation
Techniques

Approach

Reduce the problem to some known theoretical problem of
interest

In this case, network flow
Use existing tools for solving the theoretical problem efficiently

CPLEX
Explore possibilities for parallelizing the problem

Investigate the intrinsic hardness for real biological examples


CMSC 838T – Presentation
Mathematical Formulation
Two structurally
similar proteins
Spatial adjacencies
(interactions)
Possible threading
with a sequence
Objective function
CMSC 838T – Presentation
Reduction to Network Flow: An Example
CMSC 838T – Presentation
Reduction to Network Flow:
Variables and Constraints


Standard Network Flow

Variable xi,t for each segment to position assignment

Restricted to [0, 1]

With standard flow conservation constraints
Additional cost for non-local interactions

Variable zi,t,i’,t’ for each non-local interaction

Restricted to {0, 1}

Constrained to sum to 1 for each non-local pair (i, i’)

Upper bounded by flow entering (i, t) and leaving (i’, t’)
CMSC 838T – Presentation
Drawbacks of Approach


Integer programming is hard to solve!

Relax to linear programming with (0, 1) variables

Approximate to integer solution using standard heuristics

Existing tools like CPLEX
Huge number of variables

For 36 segments and 81 positions, IP problem has 741264
rows, 360945 columns and 54145231 non-zero variables!

Need to reduce number of variables and constraints

Calls for parallelization if possible
CMSC 838T – Presentation
Parallel Solution

Utilize special flow constraints

Split into sub-problems that may be solved parallely

Split the k-th layer in the graph into r intervals

Force path for a sub-problem to pass through a particular
interval in the layer

Pass best bound for objective function found so far as
parameter to sub-problem

Sub-task aborts when dual objective function exceeds the
current best bound
CMSC 838T – Presentation
Improving Parallel Solution


Drawback: Hardest Sub-Problem Dominates!

Parallel strategy was found to be slower than the sequential!

Sub-problems can potentially become harder to solve

Many more difficult sub-problems than easy ones
Solution:

Break the atomicity of the tasks

Each sub-task periodically checks the current best bound and
updates its cut-off

Extra overhead is still small compared to task granularity

Now the easiest executing sub-task dominates!
CMSC 838T – Presentation
Evaluation


Experimental environment

Real protein sequences

ILOG CPLEX Callable Library

SUN Ultra-Sparc II, 450 Mhz

Objective function coefficients generated from FROST

Maximum of 7 processors and 29 sub-problems
Evaluation results

Sequential version much faster than previous branch-andbound results for the same problem formulation

Time taken comparable to PROSPECT

Splitting and parallelization significantly improve turnaround

Really tiny gap between relaxed LP and ILP solutions

Mostly integer solutions even for relaxed LP!
CMSC 838T – Presentation
Result Tables
Comparison with branch and bound algorithm
Comment: Self threading results in significantly
lower scores (as should be)
CMSC 838T – Presentation
Result Tables
Gap between relaxed LP and ILP
Comment: Tiny relaxation gap. (significance?)
CMSC 838T – Presentation
Result Tables
Size of the LP formulation
Comment: LP problem size is still too large.
CMSC 838T – Presentation
Result Tables
Performance with parallel sub-tasks
Comment: Longer times with more sub-problems??
CMSC 838T – Presentation
Related Work


Similar / previous approaches

Lathrop and Smith, 1998
 Uses same cost function
 Branch and bound algorithm for searching the space of
threadings

Xu, Xu and Uberbacher, 1998
 Divide and conquer algorithm

Xu, Li, Lin, Kim and Xu, 2003
 Linear programming formulation
 Solved using b&b algorithm
None of the above suggest any parallelizing scheme
CMSC 838T – Presentation
Observations

Points of Interest

Mapping to a known problem of interest

Nicely utilizes particular constraints to break into independent
subtasks

Threading of real amino acid sequences seems possible

Raises interesting questions about real-life protein threading
being in P

Solver tailored for this particular problem may yield better
results
CMSC 838T – Presentation
Observations


Criticism

Not enough experiments with large number of subtasks and
processors to show scaling

Prohibitively large number of variables and constraints

How accurate are the objective function coefficients?

What is the resolution of the objective function?

Threading onto multiple sequences for prediction still looks
daunting

Not clear how to extend the idea for 3-way and more complex
interactions
Improvements

Seems possible to break up the sub-tasks recursively
CMSC 838T – Presentation
Thank you!
CMSC 838T – Presentation
Download