Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases

advertisement
Using Metacomputing Tools to
Facilitate Large Scale Analyses of
Biological Databases
Vinay D. Shet
CMSC 838 Presentation
Authors: Allison Waugh, Glenn A. Williams, Liping Wei, and Russ B. Altman
Motivation

Biological databases are growing at a very high rate


Protein Data Bank (PDB) increased from 5811 entries to 12110 in three
years
Computational tools required to efficiently access and analyze this
data

Typical data analyses



Linear scans across database looking for something
“all-versus-all” comparisons within database
High performance distributed computing resources can play
important role in these analyses

Authors use a distributed computing environment, LEGION, to enable
large scale analysis on PDB
CMSC 838T – Presentation
Motivation

Similar to evaluation of threaded-blast project


We run threaded blast over Sun SMP with 24 processors
Authors run program called FEATURE over LEGION
framework

Can access hundreds of CPUs worldwide

Can spawn sequential versions of FEATURE on all of them
CMSC 838T – Presentation
Talk Overview

Overview of talk

Motivation

Background

LEGION
 FEATURE
Methods

Experiments
Results

Discussions

Related work

Observations


CMSC 838T – Presentation
Background

LEGION (Worldwide Virtual Computer)

Metacomputing environment comprised of geographically distributed,
heterogeneous collections of workstations and supercomputers

Connects resources to make up a single, worldwide, virtual computer

Coordinates large number of parallel jobs on a mixture of processors
SMPs, MPPs, PCs on any network

Legion provides the software infrastructure so that a system of
heterogeneous, geographically distributed, high performance
machines can interact seamlessly.

No manual installation of binaries over multiple platforms (LEGION
does it automatically)
CMSC 838T – Presentation
Background

LEGION

LAM - MPI implementation for workstation clusters

Legion supports transparent scheduling, data management, fault
tolerance, site autonomy, single file name space , efficient scheduling
comprehensive resource management, and a wide range of security
options.
CMSC 838T – Presentation
Background

FEATURE

Site characterization and recognition system


Site is a microenvironment distinguished by some structural or
functional role
Identifies functional or structural sites of interest in query protein
CMSC 838T – Presentation
Background

FEATURE

Measures spatial distributions of chemical and physical properties to
create statistical model of microenvironment

Compares regions of query protein with known sites and control nonsites and assigns scores indicating likelihood of region being site

Produces list of potential sites locations with corresponding scores

Has been used to recognize ion, ligand and enzyme binding sites

FEATURE is typical data-driven algorithm requiring large data storage
and efficient data analysis

Requires 12 hours on single processor to evaluate 580 non-redundant
PDB entries
CMSC 838T – Presentation
Methods

FEATURE run on all protein entries in May 2000 PDB

Searched for potential Calcium binding sites



FEATURE has 90% sensitivity and 100% specificity to this
Three experiments conducted

Sequential scan of PDB subset using single processor

Comprehensive scan of PDB using LEGION system using 50
processors

Set of runs of LEGION using constant PDB subset but varying
processors
Input parameters to FEATURE and statistical model for Ca remained
constant
CMSC 838T – Presentation
Methods

Experiments

Sequentially scanned arbitrary 726 proteins from PDB


Comprehensive scan of all proteins (10,996 total) in PDB



Runs made on single processor Sun E450 machine with 300
MHz Ultra-Sparc CPU
Maximum # of processors: 50
FEATURE code compiled for various platforms so binaries
can be run on different machines across LEGION
Scanned subset of proteins with varying number of processors


Arbitrarily selected 4997 proteins for each run
Varied number of processors using values 20, 40, 60, and 80
CMSC 838T – Presentation
Results

FEATURE reported six run time failures due to non-standard PDB file
formats for sequential run

FEATURE also run time assertion failures, illegal instructions or
segmentation faults during second experiment
CMSC 838T – Presentation
Results
CMSC 838T – Presentation
Discussion

FEATURE performance deteriorates after # of processors exceeds
60


Optimal max number is constrained by
 client’s process table which keeps track of each LEGION process
spawned
 amount of memory available to support spawned processes
Thus even if LEGION contains 100s of nodes, users cannot use them

Also LEGION provides minimal fault-tolerance (if any instance fails
user must wait till everything has finished to re-spawn)

Authors maintained local copy of database but concede that this
is not realistic situation as


updates to PDB occur frequently
Consumes lot of disk space
CMSC 838T – Presentation
Related Work

Threaded BLAST and MPI Blast

Authors work is similar to threaded blast

MPI Blast is a parallelized version of Blast so single query can
be split across multiple processors

FEATURE is not truly parallelized
CMSC 838T – Presentation
Observations

Running CPU intensive tasks over many processors is
definitely useful


However, LEGION does not scale well as there is performance
degradation after 60 processors
They have not utilized true parallelism in FEATURE

It seems to me that there is lot of potential to parallelize FEATURE given
that many potential sites can be examined simultaneously

What is performance enhancement in parallelized version?
CMSC 838T – Presentation
Questions
CMSC 838T – Presentation
Download