Blast2cap3 - Conferences

advertisement

XSEDE '14, July 13 - 18 2014, Atlanta, GA, USA

Evaluating Distributed

Platforms for Protein-Guided

Scientific Workflow

Natasha Pavlovikj

, Kevin Begcy, Sairam Behera,

Malachy Campbell, Harkamal Walia, Jitender S.Deogun

University of Nebraska-Lincoln

1

Introduction

Gene expression and transcriptome analysis are one of the main focuses of research for a great number of biologists and scientists

The analysis of this so called “big data” is done by using a complex set of multitude of software tools

Enhanced demand of powerful computational resources where the data can be stored and analyzed

2

Assembly Pipeline

3

Assembly of raw sequence data is a complex multi-stage process composed of preprocessing, assembling, and postprocessing

Assembly pipeline

is used to simplify the entire assembly process by automating steps of the pipeline

blast2cap3

Overlap-based assembly program CAP3 is used to merge transcripts based on the overlapping region with specified identity

Multiple approaches used for assembling the filtered reads produce high redundancy of the resulting transcripts

However, because most of the produced transcripts code for a protein, a protein similarity should be also considered during the merging

4

blast2cap3

5

Blast2cap3

is a Python script written by Vince

Buffalo from Plant Sciences Department, UCD

Blast2cap3

is a protein-guided assembly approach that first clusters the transcripts based on similarity to a common protein and then passes each cluster to CAP3

The recent use of

blast2cap3

on the wheat transcriptome assembly shows that

blast2cap3

generates fewer artificially fused sequences and reduces the total number of transcripts by 8-9%

blast2cap3

The assembled transcripts are aligned with protein datasets closely related to the organism for which the transcripts are generated, and afterwards, transcripts sharing a common protein hit are merged using CAP3

The current implementation of

blast2cap3

supports only serial execution

6

Pegasus Workflow

Management System

The modularity of

blast2cap3

allows us to decompose the existing approach on multiple tasks, some of which can be run in parallel

The protein-guided assembly can be structured into a scientific workflow

7

Pegasus Workflow

Management System

Pegasus uses DAX (directed acyclic graph in

XML) files to specify an abstract workflow

Pegasus WMS is a framework that automatically maps high-level scientific workflows organized as directed acyclic graph (DAG) onto wide range of execution platforms, including clusters, grids, and clouds

The abstract workflow contains information and description of all executable files and logical names of the input files used by the workflow

8

blast2cap3 with Pegasus

WMS

Each node represents a workflow task, while each edge represents the dependency between the tasks

Archive of all required built libraries and tools

(Python, Biopython, CAP3)

The step of downloading and extracting this archive is defined as a task in the workflow

Pegasus WMS implementation of

blast2cap3

reduces the running time of the current serial implementation of

blast2cap3

for more than 95%

9

10

Execution Platforms

The resources that scientific workflows require can exceed the capabilities of the local computational resources

Scientific workflows are usually executed on distributed platforms, such as campus clusters, grids or clouds

Used execution platforms

11

Sandhills: University of

Nebraska Campus Cluster

Sandhills is one of the High Performance

Computing (HPC) Clusters at the University of

Nebraska

– Lincoln Holland Computing Center

(HCC)

Used by faculty and students

Sandhills was constructed in 2011 and it has

1440 AMD cores housed in a total of 44 nodes

Every new user account of HCC is required to be associated with a faculty or research group

12

OSG: Open Science Grid

OSG is a national consortium of geographically distributed academic institutions and laboratories that provide hundreds computing and storage resources to the OSG users

OSG is organized into Virtual Organizations

OSG does not own any computing or storage resources, but allows users to use the resources contributed by the other members of the OSG and VO’s

Every new user applies for an OSG certificate

13

Amazon EC2: Amazon

Elastic Compute Cloud

Amazon Elastic Compute Cloud (Amazon EC2) is a large commercial Web-based service provided by Amazon.com

Users have access to virtual machine (VM) instances where they deploy VM images with customized software and libraries

Amazon EC2 is a scalable, elastic and flexible platform

Amazon EC2 users are hourly billed for the number and the type of resources they are using

14

Experiments

Investigate the behavior of the modified Pegasus

WMS implementation of

blast2cap3

when the workflow is composed of 30, 110, 210, 610,

1,010, and 2,010 tasks respectively

Run the workflow multiple times on the different execution platforms in order to detect the different workflow performance as well as the different resource availability over time

15

Experiments

Compare the total workflow running time between different execution platforms

Examine the number of running versus the number of idle jobs over time for each workflow

16

Experimental Data

Diploid wheat

Triticum urartu

dataset from NCBI

The assembled transcripts were generated using

Velvet as a de novo assembler

These transcripts were aligned with closely related wheat organisms (B

arley

,

Brachypodium

,

Rice

,

Maize

,

Sorghum, Arabidopsis)

transcripts.fasta

”, 404 MB big, 236,529 assembled transcripts

alignments.out

”, 155 MB big, 1,717,454 protein hits

17

Comparing Running Time on Sandhills,

OSG and Amazon EC2 for Workflows with

Different Number of Tasks

18

Comparing the Number of Running Jobs versus the Number of Idle Jobs Over Time for Workflows with Different Task Number

19

Comparing the Number of Running Jobs versus the Number of Idle Jobs Over Time for Workflows with Different Task Number

20

Comparing the Number of Running Jobs versus the Number of Idle Jobs Over Time for Workflows with Different Task Number

21

Comparing the Number of Running Jobs versus the Number of Idle Jobs Over Time for Workflows with Different Task Number

22

Comparing the Number of Running Jobs versus the Number of Idle Jobs Over Time for Workflows with Different Task Number

23

Comparing the Number of Running Jobs versus the Number of Idle Jobs Over Time for Workflows with Different Task Number

24

Cost Comparison of Different Execution

Platforms

Sandhills:

 generally free resources

OSG:

 completely free resources

The main and the most important difference between the commercial cloud and the academic distributed resources is the cost

Amazon EC2:

 complex pricing model

50 m1.large spot instance X $0.04 per hour = $122.84

25

Conclusion

Using more than 100 tasks in a workflow significantly reduces the running time for all execution platforms

The resource allocation on Sandhills and OSG is opportunistic, and its availability changes over time

The results are almost constant when Amazon

EC2 is used

Workflow failures were not encountered on

Sandhills and Amazon EC2

26

Conclusion

The predictability of the Amazon EC2 resources leads to better workflow running time when the cloud is used as a platform

For our

blast2cap3

workflow, better running time and better usage of the allocated resources were achieved when Amazon EC2 is used

Due to the Amazon EC2 cost, the academic distributed systems can be a good alternative

27

Acknowledgments

University of Nebraska Holland Computing

Center

Open Science Grid

28

Download