Integration of the Biological Databases into Grid-Portal Environments Michal Kosiedowski, Michal Malecki, Cezary Mazurek, Pawel Spychala, Marcin Wolski Agenda • • • • • • Introduction PROGRESS Grid-Portal Environment Data Management System Enabling SRS resources within DMS Case study Conclusions R&D Center • PSNC was established in 1993 and is an R&D Center in: – New Generation Networks • POZMAN and PIONIER networks • 6-NET, ATRIUM, Muppet, – HPC and Grids • GRIDLAB, CROSSGRID, VLAB, PROGRESS, Clusterix, HPCEuropa – Portals and Content Management Tools • Polish Educational Portal "Interkl@sa", Multimedia City Guide, Digital Library Framework, Interactive TV PROGRESS (1) • Project Partners – – – – PSNC IBCh Poznan SUN Microsystems Poland Cyfronet AMM, Krakow Technical University Lodz • Co-funded by The State Committee for Scientific Research (KBN) and SUN Microsystems Poland PROGRESS (2) • The PROGRESS project produced a set of open source tools for use by: – Grid constructors – Computational applications developers – Computing portals operators PROGRESS (3) • Cluster of 80 processors • Networked Storage of 1,3 TB • Software: ORACLE, HPC Cluster Tools, Sun ONE, Sun Grid Engine, Globus Gdańsk Wrocław Data Management Issues • Hide the data management complexity from the end users • Use new standards defined by grid organizations • Co-operate with different kinds of client applications • Provide seamless access to data and information for grid computing • Enable intuitive and efficient methods for resource exploration • Providing friendly interface to data management for administrators and scientists PROGRESS Web Services and PROGRESS PORTLETS GRID SERVICE PROVIDER WS DATA MANAGEMENT GRID RESOURCE BROKER Data Management System • A distributed system enabling the management of grid data files • Stores files in distributed storage modules of various types: generic filesystems, archivers, relational databases • Uses metadata to describe files • Allows access to data banks like a mirror of Sequence Retrieval System • Exposes its functionality within the Data Broker Service DMS Functionality • Virtual file system keeping the data organized in a tree structure – Metadirectories – hierarchize other objects – Metafiles - represent a logical view of computational data regardless of their physical location • DMS provides its services in a form of Web Services API (Data Broker Service) DMS Functionality • Web Services interface: storing, access, describing and delivery of data – directory mgmt.: e.g. add, remove and rename directories, retrieve root and current path, change path, – file mgmt.: e.g. add, remove and rename files, add, remove and retrieve physical file location, – metadata mgmt.: e.g. retrieve list of schemes and attributes, assign schemes to files and edit values – external datasource mgmt.: e.g. databanks content retrieving, entry resolving, databanks exploring DMS Architecture Data Broker • Serves as an interface (Web Services) for external clients, such as the HPC Portal and the grid resource broker • Mediates in the flow of all requests directed to the DMS • Authorizes the client that submitted the request Metadata Repository • Central and single point of metadata management • Responsible for all metadata operations and their storage and maintenance • It stores the following sorts of information: – metadata about resources: data files, its physical localization and possible way to access them, – metadata about rights: all information related to the rights – users, their groups, access rights. – metadata describing the standards for file description, e.g. Dublin Core (DC) – metadata about services: data brokers, data containers Data Container • Enables access to physical data • Data can be stored on various media types • Data can be organized as files on generic filesystems, BLOBs in databases or files on data tapes • All Containers possess a uniform interface regardless of the media types they manage • Container does not perform file transfers - it uses external services like ftp, https, gass, gridftp Proxy (SRS Container) • Enables access to external scientific databases • Includes both Repository (listing entries, retrieving attached metadata, building queries) and Data Container (downloading files) functionality • DMS treats the Proxy as a separate, independent module, that manages read-only data • The PROGRESS grid-portal environment: the Proxy (named SRS Container) enables access to SRS resources Administrative Portal • Web application allowing users to handle grid data management with the use of a web browser • An intuitive interface allowing to execute superset of DMS services • An effective way to explore huge SRS resources • On-line help SRS Resources in PSNC • Genbank Release (about 32 mln entries) Updates (about 2 mln entries) • EMBL - European Molecular Biology Laboratory Release (about 42 mln entries) Updates (about 2 mln entries) • PDB – Protein Data Bank • Swissprot Swissprot Release, Swissprot New, SPTREMBL, REMTREMBL SRS Installation • Installation uses multiple storage resources • Data access interface delivered via a common portal (srs.man.poznan.pl) • Administrative tasks (retrieval and data preparation) splitted onto multiple machines • Parallel data retrieving from remote resources • Offline data indexing and packing on a computational machine (0.5Tb storage) • Compressed online data (2*250Gb storage) SRS Installation - Schema bellis-e.man.poznan.pl storage 02 offline online indexing offindex flatfiles viola.man.poznan.pl storage 01 index SRS srs.man.poznan.pl flatfiles SRS Container • Using shell-based access to the SRS – SRS operations are sent via a shell command • Access interface based on Web Services – Internal functionality delivered using SOAP communication • Data access - ftp, gsiftp, gass protocols – Data are accessed using external file servers integrated with SRS module • Advanced caching system – Databanks and entries are cached and reused in the following user requests Portal Interface – databanks list Portal Interface – databank content Portal Interface - searching Portal Interface – search results Portal Interface – copying entries Portal Interface – file properties DMS Installation Requirements • Java virtual machine: recommended Java(TM) 2 Runtime Environment, Standard Edition 1.4.1 or higher. • Database server: DMS is ready to cooperate with Oracle and PostgreSQL engine: – Oracle - Oracle8i or higher recommended – PostgreSQL - version 7.3 or higher is required with the additional extentions: • chkpass and tablefunc from contrib package • plpqsql support Usage scenario: PROGRESS HPC Portal • SRS resources can be used as input for grid jobs created, configured and submitted for execution in the grid with the use of the PROGRESS HPC Portal • An example application is AminSim – aminoacid sequences similarity – developed by Prof. Jacek Blazewicz group at the Institute of Computing Science, Poznan University of Technology AminSim portlet (1) AminSim portlet (2) AminSim portlet (3) AminSim portlet (4) Conclusions • SRS resources have been integrated with the distributed file structure of DMS and enabled for use within a grid-portal environment (PROGRESS HPC Portal) • A web interface (DMS Portal) enhances the efficiency of the SRS resources exploration: – fast copying interesting entries directly to the users’ home directory – merging files – saving files in various formats (e.g. Fasta) • The universal access layer to the to the scientific databases may by successfully used to connect other data sources to the Data Management System Contact info • Check http://dms.progress.psnc.pl for more information about DMS • Check http://dms.progress.psnc.pl/docs/demo.htm for the DMS Portal demo • Check http://progress.psnc.pl for more information about PROGRESS • Download it now: http://progress.psnc.pl • Mail DMS team: szd@man.poznan.pl • Mail PROGRESS team: progress@man.poznan.pl