Application Driven Design for a Large-Scale, Multi-Purpose Grid Infrastructure Mary Fran Yafchak, maryfran@sura.org SURA IT Program Coordinator, SURAgrid project manager About SURA 501(c)3 consortium of research universities Major programs in: – Nuclear Physics (“JLab”, www.jlab.org) – Coastal Science (SCOOP, scoop.sura.org) – Information Technology • Network infrastructure • SURAgrid • Education & Outreach SURA Mission: foster excellence in scientific research strengthen the scientific and technical capabilities of the nation and of the Southeast provide outstanding training opportunities for the next generation of scientists and engineers http://www.sura.org About SURA 501(c)3 consortium of research universities Major programs in: – Nuclear Physics (“JLab”, www.jlab.org) – Coastal Science (SCOOP, scoop.sura.org) – Information Technology • Network infrastructure • SURAgrid • Education & Outreach SURA Mission: foster excellence in scientific research strengthen the scientific and technical capabilities of the nation and of the Southeast provide outstanding training opportunities for the next generation of scientists and engineers http://www.sura.org Scope of the SURA region 62 diverse member institutions Geographically - 16 states plus DC Perspective extends beyond the membership – Broader education community • Non-SURA higher ed, Minority Serving Institutions, K-12 – Economic Development • Regional network development • Technology transfer • Collaboration with Southern Governors’ Association About SURAgrid A open initiative in support of regional strategy and infrastructure development – Applications of regional impact are key drivers Designed to support a wide variety of applications – “Big science” but “smaller science” O.K. too! – Applications beyond those typically expected on grids • Instructional use, student exposure • Open to what new user communities will bring – On-ramp to national HPC & CI facilities (e.g., Teragrid) Not as easy as building a community or projectspecific grid but needs to be done… About SURAgrid Broad view of grid infrastructure Facilitate seamless sharing of resources within a campus, across related campuses and between different institutions – Integrate with other enterprise-wide middleware – Integrate heterogeneous platforms and resources – Explore grid-to-grid integration Support range of user groups with varying application needs and levels of grid expertise – Participants include IT developers & support staff, computer scientists, domain scientists SURAgrid Goals To develop scalable infrastructure that leverages local institutional identity and authorization while managing access to shared resources To promote the use of this infrastructure for the broad research and education community To provide a forum for participants to share experience with grid technology, and participate in collaborative project development SURAgrid Vision SURAgrid Institutional Resources (e.g. Current participants) Gateways to National Cyberinfrastructure (e.g. Teragrid) VO or Project Resources (e.g, SCOOP) Industry Partner Coop Resources (e.g. IBM partnership) Other externally funded resources (e.g. group proposals) Heterogeneous Environment to Meet Diverse User Needs Project-Specific View “MySURAgrid” View Project-specific tools SURAgrid Resources SURAgrid Resources and Applications Sample User Portals SURA regional development: • Develop & manage partnership relations • Facilitate collaborative project development • Orchestrate centralized services & support • Foster and catalyze application development • Develop training & education (user, admin) • Other…(Community-driven, over time…) Bowie State GMU SURAgrid Participants UMich UMD UKY (As of November 2006) UVA GPN UArk Vanderbilt ODU UAH USC NCState MCSR SC TTU Clemson TACC LATech = SURA Member UFL UAB TAMU Kennesaw State LSU ULL Tulane = Resources on the grid GSU UNCC Major Areas of Activity Grid-Building (gridportal.sura.org) – Themes: heterogeneity, flexibility, interoperability Access Management – Themes: local autonomy, scalability, leveraging enterprise infrastructure Application Discovery & Deployment – Themes: broadly useful, inclusive beyond typical users and uses,promoting collaborative work Outreach & Community – Themes: sharing experience, incubator for new ideas, fostering scientific & corporate partnerships Major Areas of Activity Grid-Building (gridportal.sura.org) – Themes: heterogeneity, flexibility, interoperability Access Management – Themes: local autonomy, scalability, leveraging enterprise infrastructure Application Discovery & Deployment – Themes: broadly useful, inclusive beyond typical users and uses,promoting collaborative work Outreach & Community – Themes: sharing experience, incubator for new ideas, fostering scientific & corporate partnerships SURAgrid Application Strategy Provide immediate benefit to applications while applications drive infrastructure development Leverage initial application set to illustrate benefits and refine deployment Increase quantity and diversity of both applications and users Develop processes for scalable, efficient deployment; assist in “grid-enabling” applications Efforts significantly bolstered through NSF award: “Creating a Catalyst Application Set for the Development of Large-Scale Multi-purpose Grid Infrastructure” (NSF-OCI-054555) Creating a Catalyst Application Set Discovery Ongoing methods: meetings, conferences, word of mouth Formal survey of SURA members to supplement methods Evaluation Develop criteria to help prioritize and direct deployment efforts Determine readiness to deploy and tools/assistance required Implementation Exercise and evolve existing deployment & support processes in response to lessons learned Document and disseminate lessons learned Explore means to assist in grid-enabling applications Some Application Close-ups In SURAgrid demo area today: GSU: Multiple Genome Alignment on the Grid – Demo’d by Victor Bolet, Art Vandenberg UAB: Dynamic BLAST – Demo’d by: Enis Afgan, John-Paul Robinson ODU: Bioelectric Simulator for Whole Body Tissues – Demo’d by Mahantesh Halappanavar NCState: Simulation-Optimization for Threat Management in Urban Water Systems – Demo’d by Sarat Sreepathi UNC: Storm Surge Modeling with ADCIRC – Demo’d by Howard Lander GSU Multiple Genome Alignment Sequence Alignment Problem Used to determine biological meaningful relationship among organisms – Evolutionary information – Diseases, causes and cures – Information about a new protein Especially compute intensive for long sequences – – – – Needleman and Wunsch (1970) - optimal global alignment Smith and Waterman (1981) - optimal local alignment Taylor (1987) - multiple sequence alignment by pairwise alignment BLAST trades off optimal results for faster computation Examples of Genome Alignment Alignment 1 Sequence X Sequence Y Score A T A – A G T A T G C A G T 1 1 -1 -2 1 1 1 Total Score = 2 Alignment 2 Sequence X Sequence Y Score A T A A G T A T G C A G T 1 1 -1 -1 -1 -1 -1 Total Score = -3 Based on pairwise algorithm – – SM reduced by storing only non-zero elements – – – Similarity Matrix, SM, built to compare all sequence positions Observation that many “alignment scores” are zero value Row-column information stored along with value Block of memory dynamically allocated as non-zero element found Data structure used to access allocated blocks Parallelism introduced to reduce computation Ahmed, N, Pan, Y, Vandenberg, A and Sun, Y, "Parallel Algorithm for Multiple Genome Alignment on the Grid Environment," 6th Intl Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC-05) in conjunction with (IPDPS-2005) April 4-8, 2005. Similarity Matrix Generation Align Sequence X: TGATGGAGGT Sequence Y: GATAGG 1 = matching; 0 = non-matching ss = substitution score; gp = gap score Generate SM max score with respect to neighbors: Trace sequences Back trace matrix to find sequence matches Parallel distribution of multiple sequences Seq 1-2 Sequences 1-6 Sequences 7-12 Seq 3-4 Seq 5-6 Convergence - collaboration Algorithm implementation – Nova Ahmed, Masters CS (now PhD student GT) – Dr. Yi Pan, Chair Computer Science NMI Integration Testbed program – Georgia State, Art Vandenberg, Victor Bolet, Chao “Bill” Xie, Dharam Damani, et al. – University of Alabama at Birmingham, John-Paul Robinson, Pravin Joshi, Jill Gemmill, SURAgrid – Looking for applications to demonstrate value Algorithm Validation: Shared Memory 500 Computation Time SGI Origin 2000 24 250MHz R10000; 4G Limitations Memory (Max sequence is 2000 x 2000) Processors (Policy limits student to 12 processors) Not scalable Computation Time (Shared Memory) 400 300 200 100 0 2 4 6 8 10 12 Computation Time (Shared Memory) Number of Processors Performance Validates Algorithm Computation time decreases with increased number of processors Shared Memory vs. Cluster, Grid Cluster* UAB cluster: 8 node Beowulf (550MHz Pentium III; 512 MB RAM) Clusters retain algorithm improvement 400 Genome length 3000(Grid) 300 Genome length 3000(Cluster) 200 Genome length 3000( Shared Memory) 100 0 Number of Processors 26 20 14 8 2 Computation Time (seconds) 500 * NB: Comparing clusters with shared memory is, of course, relative; systems are distinctly different. 600 400 Genome length 10000 (Grid) 200 Genome length 10000( Cluster) 0 26 Nu mber o f Processo rs 20 14 8 2 Computation Time (seconds) Grid (Globus, MPICH-G2) overhead negligible Advantages of grid-enabled cluster: Scalable – Can add new cluster nodes to the grid Easier job submission – Don’t need account on every node Scheduling is easier – Can submit multiple jobs at one time 500 Speed up (1 cpu / N cpu) 9 Single Cluster 8 400 7 Single Clustered Grid 300 Speed up Computation time (sec) Computation Time Multi Clustered Grid 200 100 6 Single Cluster 5 4 3 Single Clustered Grid 2 Multi Clustered Grid 1 0 0 5 10 15 20 25 30 0 0 Number of processors 5 10 15 20 25 Number of Processors 9 processors available in Multi Clustered Grid 32 processors for other configs. Interesting: When multiple clusters used (application spanned three separate clusters), performance improved additionally?! 30 Grid tools used Globus Toolkit - built on the Open Grid Services Architecture (OGSA) Nexus - Communication library, allows multi-method communication with a single API for a wide range of protocols. Using Nexus, Message Passing Interface MPICH-G2 used in the experiments. Resource specification language (RSL) - job submission and execution (globus-job-submit, globus-job-run) and status (globus-job-status) The Grid-enabling Story Iterative, evolutionary, collaborative – 1st ssh to resource and get code working – 2nd submit from local account to remote globus machine – 3rd run from SURAgrid portal SURAgrid infrastructure components providing improved work-flow Integration with campus components enables more seamless access Overall structure can be used as model for campus research infrastructure: – Integrated authentication/authorization – Portals for applications – Grid administration/configuration support SURAgrid Portal QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. SURAgrid MyProxy service QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Get Proxy MyProxy… secure grid credential QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. GSI proxy credentials are loaded into your account… SURAgrid Portal file transfer QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Job submission via Portal QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. Output retrieved QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime™ and a TIFF (Uncompressed) decompressor are needed to see this picture. SURAgrid Account Management list my users QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. SURAgrid Account Management add user QuickTime™ and a TIFF (LZW) decompressor are needed to see this picture. Multiple Genome Alignment & SURAgrid Collaborative cooperation Convergence of opportunity Application / Infrastructure drivers interact Emergent applications: – – – – – Cosmic ray simulation (Dr. Xiaochun He) Classification/clustering (Dr. Vijay Vaishnavi, Art Vandenberg) Muon detector grid (Dr. Xiaochun He) Neuron (Dr. Paul Katz, Dr. Robert Calin-Jageman, Chao “Bil”l Xie) AnimatLab (Dr. Don Edwards, Dr. Ying Zhu, David Cofer, James Reid) IBM System p5 575 with Power5+ Processors BioSim: Bio-electric Simulator for Whole Body Tissues Numerical simulations for electrostimulation of tissues and whole-body biomodels Predicts spatial and time dependent currents and voltages in part or whole-body biomodels Numerous diagnostic and therapeutic applications, e.g., neurogenesis, cancer treatment, etc. Fast parallelized computational approach Simulation Models From electrical standpoint, tissues are characterized as conductivities and permittivities Whole-body discretized within a cubic space simulation volume Cartesian grid of points along the three axes. Thus, at most a total of six nearest neighbors * Dimensions in millimeters Numerical Models Kirchhoff’s node analysis [( A / L)d{V } / dt {V }( A / L)] 0 Recast to compute matrix only once [ M ][V |t dt V |t ] [ B(t )] For large models, matrix inversion is intractable LU decomposition of the matrix Numerical Models Voltage: User-specified timedependent waveform Impose boundary conditions locally Actual data for conductivity and permittivity Results in extremely sparse (asymmetric) matrix [M] Red: Total elements in the matrix Blue: Nonzero Values The Landscape of Sparse Ax=b Solvers Direct A = LU Nonsymmetric Symmetric positive definite More Robust Iterative y’ = Ay More General Pivoting LU GMRES, QMR, … Cholesky Conjugate gradient More Robust Less Storage Source: John Gilbert, Sparse Matrix Days in MIT 18.337 LU Decomposition Source: Florin Dobrian LU Decomposition Source: Florin Dobrian Computational Complexity 100 X 100 X 10 nodes: ~75 GB of memory (8-B floating precision) Sparse data structure: ~ 6 MB (in our case) Sparse direct solver: SuperLU-DIST – Xiaoye S. Li and James W. Dimmel, “SuperLU-DIST: A Scalable Distributed-Memory Sparse Direct Solver for Unsymmetric Linear Systems”, ACM Trans. Mathematical Software, June 2003, Volume 29, Number 2, Pages 110-140. Fill reducing orderings with Metis – G. Karypis and V. Kumar, “A fast and high quality multilevel scheme for partitioning irregular graphs”, SIAM Journal on Scientific Computing, 1999, Volume 20, Number 1. Performance on compute clusters Time in Seconds 144,000-node Rat Model Blue: Average iteration time Cyan: Factorization time Output: Visualization with MATLAB Potential Profile at a depth of 12mm Output: Visualization with MATLAB Simulated Potential Evolution Along the Entire 51-mm Width of the Rat Model Deployment on Mileva: 4-node cluster dedicated for SURAgrid purposes Authentication – ODU Root CA – Cross certification with SURA Bridge – Compatibility of accounts for ODU users Authorization & Accounting Initial Goals: – Develop larger whole-body models with greater resolution – Scalability tests Grid Workflow Establish user accounts for ODU users – SURAgrid Central User Authentication and Authorization System – Off-line/Customized (e.g., USC) Manually launch jobs based on remote resource – SSH/GSISSH/SURAgrid Portal – PBS/LSF/SGE Transfer files – SCP/GSISCP/SURAgrid Portal Conclusions Science: – Electrostimulation has variety of diagnostic and therapeutic applications – While numerical simulations provide many advantages over real experiments, they can be very arduous Grid enabling: – New possibilities with grid computing – Grid-enabling an application is complex and time consuming – Security is nontrivial Future Steps Grid-enabling BioSim – Explore alternatives for grid enabling BioSim – Establish new collaborations – Scalability experiments with large compute clusters accessible via SURAgrid Future applications: – Molecular and Cellular Dynamics – Computational Nano-Electronics – Tools: Gromacs, DL-POLY, LAMMPS References and Contacts A Mishra, R Joshi, K Schoenbach and C Clark, “A Fast Parallelized Computational Approach Based on Sparse LU Factorization for Predictions of Spatial and Time-Dependent Currents and Voltages in Full-Body Biomodels”, IEEE Trans. Plasma Science, August 2006, Volume 34, Number 4. http://www.lions.odu.edu/~rjoshi/ Ravindra Joshi, Ashutosh Mishra, Mike Sachon, Mahantesh Halappanavar – (rjoshi, amishra, msachon, mhalappa)@odu.edu UAB and SURAgrid Background UAB's approach to grid deployment Specific Testbed Application UAB and SURAgrid Background The University of Alabama at Birmingham (UAB) is leading medical research campus in the South East UAB was a participant in the National Science foundation Middleware Initiative Test-bed (NMI Testbed) in 2002-2004 Evaluation of grid technologies in NMI Test-bed sparked effort to build a campus grid: UABgrid Collaboration with NMI Test-bed team members leveraged synergistic interest in grid deployment Managing Grid Middleware at UAB Over 10 years experience operating an authoritative campus identity management system All members of UAB community have a selfselected network identity that is used as an authoritative identity for human resource records, desktop systems, and many other systems on campus Desire to leverage the same identity infrastructure for grid systems UABgrid Identity Management Leverage Pubcookie as WebSSO system for webbased interface to authoritative, LDAP-based identity management system Create UABgrid CA to transparently assign grid credentials using web based interface Establish access to UABgrid resources based on UABgrid PKI identity Establish access to SURAgrid resources base on cross certification of UABgrid CA with SURAgrid Bridge CA UABgrid Workflow Establish grid PKI credentials for UAB users – UAB Campus Identity Management System – UABgrid CA cross certified with SURAgrid Bridge CA Launch jobs via application specific web interface – User authenticates via campus WebSSO – Transparent initialization of grid credentials – Submits job request Collects results via application's web interface Engaging Campus Researchers Empower campus research applications with grid technologies Hide details of grid middleware operation Select an application specific to the needs campus researches BLAST Basic Local Alignment Search Tool (BLAST) is an established sequence analysis tool that performs similarity searches by calculating statistical significance between a short query sequence and a large database of infrequently changing information such as DNA and amino acid sequences Used by numerous scientists in gene discovery, categorization, and simulation. BLAST algorithms With the rapid development in sequencing technology of large genomes for several species, the sequence databases have been growing at exponential rates Parallel techniques for speeding up BLAST searches: – Database splitting BLAST (mpiBLAST, TurboBLAST, psiBLAST) – Query splitting BLAST (SS-Wrapper) Goals of Grid Applications Use distributed, dynamic, heterogeneous and unreliable resources without direct user intervention Adapt to environment changes Provide high level of usability while at least maintaining performance levels Dynamic BLAST goals Complement BLAST with the power of the grid Master-worker application allowing for dynamic load balancing and variable resource availability based on Grid middleware Focuses on using small, distributed, readily available resources (cycle harvesting) Works with existing BLAST algorithm(s) Dynamic BLAST architecture Web interface Dynamic BLAST GridDeploy GridWay GT4 GT4 GT4 SGE PBS N/A BLAST BLAST BLAST MySQL Perl Java scripts Dynamic BLAST workflow Jobs Data analysis Post-processing File Parsing and Fragment creation Job Monitor GJ Submit job GJ GJ Dynamic BLAST performance • Comparison of execution time between query splitting BLAST, Dynamic BLAST and mpiBLAST for searching 1,000 queries against yeast.nt database QS-BLAST mpiBLAST Dynamic BLAST 0 20 40 60 80 100 120 140 • Comparison of execution time of Dynamic BLAST on variable number of nodes for 1,000 queries against yeast.nt database Time (in seconds) Time (in seconds) 140 120 100 80 60 40 20 0 8 nodes 16 nodes 26 nodes 160 Deployment on SURAgrid Incorporate ODU resource Mileva: 4-node cluster dedicated for SURAgrid purposes Local UAB BLAST web interface – Leverage existing UAB Identity Infrastructure – Cross certification with SURA Bridge CA Authorization on a per-user basis Initial Goals: – Solidify the execution requirements of Dynamic BLAST – Perform scalability tests – Engage researchers further in the promise of grid computing SURAgrid Experience Combine Local needs with regional needs No Need to Build Campus Grid in Isolation Synergistic Collaborations Leads to Solid Architectural Foundations Leads to Solid Documentation Extends Practical Experience for UAB Students Next Steps Improve availability of web interface to non-UAB users – Leverage Shibboleth and GridShib to federate web interface – Conversation begun on sharing technology foundation with SURAgrid Extend the number of resources running BLAST – ODU integration helps document steps and is model for additional sites Explore other applications relevant to UAB research community References and Contacts Purushotham Bangalore, Computer and Information Sciences, UAB puri@uab.edu Enis Afgan, Computer and Information Sciences, UAB afgane@uab.edu John-Paul Robinson, HPC Services, UAB IT jpr@uab.edu http://uabgrid.uab.edu/dynamicblast SURAgrid Application Demos Stop by the demo floor for a closer look at these applications and more! Application demo team: Mahantesh Halappanavar (ODU) Sarat Sreepathi (NCSU) Purushotham Bangalore (UAB) Enis Afgan (UAB) John-Paul Robinson (UAB) Howard Lander (RENCI) Victor Bolet (GSU) Art Vandenberg (GSU) Kate Barzee (SURA) Mary Fran Yafchak (SURA) Questions or comments? For more information, or to join SURAgrid: http://www.sura.org/SURAgrid maryfran@sura.org