Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn A. Williams, Liping Wei, and Russ B. Altman Motivation Biological databases are growing at a very high rate Protein Data Bank (PDB) increased from 5811 entries to 12110 in three years Computational tools required to efficiently access and analyze this data Typical data analyses Linear scans across database looking for something “all-versus-all” comparisons within database High performance distributed computing resources can play important role in these analyses Authors use a distributed computing environment, LEGION, to enable large scale analysis on PDB CMSC 838T – Presentation Motivation Similar to evaluation of threaded-blast project We run threaded blast over Sun SMP with 24 processors Authors run program called FEATURE over LEGION framework Can access hundreds of CPUs worldwide Can spawn sequential versions of FEATURE on all of them CMSC 838T – Presentation Talk Overview Overview of talk Motivation Background LEGION FEATURE Methods Experiments Results Discussions Related work Observations CMSC 838T – Presentation Background LEGION (Worldwide Virtual Computer) Metacomputing environment comprised of geographically distributed, heterogeneous collections of workstations and supercomputers Connects resources to make up a single, worldwide, virtual computer Coordinates large number of parallel jobs on a mixture of processors SMPs, MPPs, PCs on any network Legion provides the software infrastructure so that a system of heterogeneous, geographically distributed, high performance machines can interact seamlessly. No manual installation of binaries over multiple platforms (LEGION does it automatically) CMSC 838T – Presentation Background LEGION LAM - MPI implementation for workstation clusters Legion supports transparent scheduling, data management, fault tolerance, site autonomy, single file name space , efficient scheduling comprehensive resource management, and a wide range of security options. CMSC 838T – Presentation Background FEATURE Site characterization and recognition system Site is a microenvironment distinguished by some structural or functional role Identifies functional or structural sites of interest in query protein CMSC 838T – Presentation Background FEATURE Measures spatial distributions of chemical and physical properties to create statistical model of microenvironment Compares regions of query protein with known sites and control nonsites and assigns scores indicating likelihood of region being site Produces list of potential sites locations with corresponding scores Has been used to recognize ion, ligand and enzyme binding sites FEATURE is typical data-driven algorithm requiring large data storage and efficient data analysis Requires 12 hours on single processor to evaluate 580 non-redundant PDB entries CMSC 838T – Presentation Methods FEATURE run on all protein entries in May 2000 PDB Searched for potential Calcium binding sites FEATURE has 90% sensitivity and 100% specificity to this Three experiments conducted Sequential scan of PDB subset using single processor Comprehensive scan of PDB using LEGION system using 50 processors Set of runs of LEGION using constant PDB subset but varying processors Input parameters to FEATURE and statistical model for Ca remained constant CMSC 838T – Presentation Methods Experiments Sequentially scanned arbitrary 726 proteins from PDB Comprehensive scan of all proteins (10,996 total) in PDB Runs made on single processor Sun E450 machine with 300 MHz Ultra-Sparc CPU Maximum # of processors: 50 FEATURE code compiled for various platforms so binaries can be run on different machines across LEGION Scanned subset of proteins with varying number of processors Arbitrarily selected 4997 proteins for each run Varied number of processors using values 20, 40, 60, and 80 CMSC 838T – Presentation Results FEATURE reported six run time failures due to non-standard PDB file formats for sequential run FEATURE also run time assertion failures, illegal instructions or segmentation faults during second experiment CMSC 838T – Presentation Results CMSC 838T – Presentation Discussion FEATURE performance deteriorates after # of processors exceeds 60 Optimal max number is constrained by client’s process table which keeps track of each LEGION process spawned amount of memory available to support spawned processes Thus even if LEGION contains 100s of nodes, users cannot use them Also LEGION provides minimal fault-tolerance (if any instance fails user must wait till everything has finished to re-spawn) Authors maintained local copy of database but concede that this is not realistic situation as updates to PDB occur frequently Consumes lot of disk space CMSC 838T – Presentation Related Work Threaded BLAST and MPI Blast Authors work is similar to threaded blast MPI Blast is a parallelized version of Blast so single query can be split across multiple processors FEATURE is not truly parallelized CMSC 838T – Presentation Observations Running CPU intensive tasks over many processors is definitely useful However, LEGION does not scale well as there is performance degradation after 60 processors They have not utilized true parallelism in FEATURE It seems to me that there is lot of potential to parallelize FEATURE given that many potential sites can be examined simultaneously What is performance enhancement in parallelized version? CMSC 838T – Presentation Questions CMSC 838T – Presentation