Integration of the Biological Databases into Grid-Portal Environments Michal Kosiedowski

advertisement
Integration of the Biological
Databases into Grid-Portal
Environments
Michal Kosiedowski,
Michal Malecki, Cezary
Mazurek, Pawel Spychala,
Marcin Wolski
Agenda
•
•
•
•
•
•
Introduction
PROGRESS Grid-Portal Environment
Data Management System
Enabling SRS resources within DMS
Case study
Conclusions
R&D Center
• PSNC was established in 1993
and is an R&D Center in:
– New Generation Networks
• POZMAN and PIONIER networks
• 6-NET, ATRIUM, Muppet,
– HPC and Grids
• GRIDLAB, CROSSGRID, VLAB,
PROGRESS, Clusterix, HPCEuropa
– Portals and Content Management Tools
• Polish Educational Portal "Interkl@sa",
Multimedia City Guide,
Digital Library Framework,
Interactive TV
PROGRESS (1)
• Project Partners
–
–
–
–
PSNC IBCh Poznan
SUN Microsystems Poland
Cyfronet AMM, Krakow
Technical University Lodz
• Co-funded by The State Committee for
Scientific Research (KBN) and SUN
Microsystems Poland
PROGRESS (2)
• The PROGRESS project produced
a set of open source tools for use
by:
– Grid constructors
– Computational applications
developers
– Computing portals operators
PROGRESS (3)
• Cluster of 80
processors
• Networked Storage
of 1,3 TB
• Software: ORACLE,
HPC Cluster Tools,
Sun ONE, Sun Grid
Engine, Globus
Gdańsk
Wrocław
Data Management Issues
• Hide the data management complexity from the end
users
• Use new standards defined by grid organizations
• Co-operate with different kinds of client applications
• Provide seamless access to data and information for
grid computing
• Enable intuitive and efficient methods for resource
exploration
• Providing friendly interface to data management for
administrators and scientists
PROGRESS
Web Services and
PROGRESS
PORTLETS
GRID SERVICE
PROVIDER
WS
DATA
MANAGEMENT
GRID
RESOURCE
BROKER
Data Management System
• A distributed system enabling the
management of grid data files
• Stores files in distributed storage modules of
various types: generic filesystems, archivers,
relational databases
• Uses metadata to describe files
• Allows access to data banks like a mirror of
Sequence Retrieval System
• Exposes its functionality within the Data
Broker Service
DMS Functionality
• Virtual file system keeping the data
organized in a tree structure
– Metadirectories – hierarchize other objects
– Metafiles - represent a logical view of
computational data regardless of their
physical location
• DMS provides its services in a form of
Web Services API (Data Broker
Service)
DMS Functionality
• Web Services interface: storing, access, describing
and delivery of data
– directory mgmt.: e.g. add, remove and rename directories,
retrieve root and current path, change path,
– file mgmt.: e.g. add, remove and rename files, add, remove
and retrieve physical file location,
– metadata mgmt.: e.g. retrieve list of schemes and attributes,
assign schemes to files and edit values
– external datasource mgmt.: e.g. databanks content
retrieving, entry resolving, databanks exploring
DMS Architecture
Data Broker
• Serves as an interface (Web Services)
for external clients, such as the HPC
Portal and the grid resource broker
• Mediates in the flow of all requests
directed to the DMS
• Authorizes the client that submitted the
request
Metadata Repository
• Central and single point of metadata management
• Responsible for all metadata operations and their
storage and maintenance
• It stores the following sorts of information:
– metadata about resources: data files, its physical localization
and possible way to access them,
– metadata about rights: all information related to the rights –
users, their groups, access rights.
– metadata describing the standards for file description, e.g.
Dublin Core (DC)
– metadata about services: data brokers, data containers
Data Container
• Enables access to physical data
• Data can be stored on various media types
• Data can be organized as files on generic
filesystems, BLOBs in databases or files on data
tapes
• All Containers possess a uniform interface regardless
of the media types they manage
• Container does not perform file transfers - it uses
external services like ftp, https, gass, gridftp
Proxy (SRS Container)
• Enables access to external scientific databases
• Includes both Repository (listing entries, retrieving
attached metadata, building queries) and Data
Container (downloading files) functionality
• DMS treats the Proxy as a separate, independent
module, that manages read-only data
• The PROGRESS grid-portal environment: the Proxy
(named SRS Container) enables access to SRS
resources
Administrative Portal
• Web application allowing users to
handle grid data management with the
use of a web browser
• An intuitive interface allowing to execute
superset of DMS services
• An effective way to explore huge SRS
resources
• On-line help
SRS Resources in PSNC
• Genbank
Release (about 32 mln entries)
Updates (about 2 mln entries)
• EMBL - European Molecular Biology
Laboratory Release (about 42 mln entries)
Updates (about 2 mln entries)
• PDB – Protein Data Bank
• Swissprot
Swissprot Release, Swissprot New,
SPTREMBL, REMTREMBL
SRS Installation
• Installation uses multiple storage resources
• Data access interface delivered via a
common portal (srs.man.poznan.pl)
• Administrative tasks (retrieval and data
preparation) splitted onto multiple machines
• Parallel data retrieving from remote resources
• Offline data indexing and packing on a
computational machine (0.5Tb storage)
• Compressed online data (2*250Gb storage)
SRS Installation - Schema
bellis-e.man.poznan.pl
storage 02
offline
online
indexing
offindex
flatfiles
viola.man.poznan.pl
storage 01
index
SRS
srs.man.poznan.pl
flatfiles
SRS Container
• Using shell-based access to the SRS
– SRS operations are sent via a shell command
• Access interface based on Web Services
– Internal functionality delivered using SOAP communication
• Data access - ftp, gsiftp, gass protocols
– Data are accessed using external file servers integrated with
SRS module
• Advanced caching system
– Databanks and entries are cached and reused in the
following user requests
Portal Interface – databanks list
Portal Interface – databank content
Portal Interface - searching
Portal Interface – search results
Portal Interface – copying entries
Portal Interface – file properties
DMS Installation Requirements
• Java virtual machine: recommended Java(TM) 2
Runtime Environment, Standard Edition 1.4.1 or
higher.
• Database server: DMS is ready to cooperate with
Oracle and PostgreSQL engine:
– Oracle - Oracle8i or higher recommended
– PostgreSQL - version 7.3 or higher is required with the
additional extentions:
• chkpass and tablefunc
from contrib package
• plpqsql support
Usage scenario: PROGRESS
HPC Portal
• SRS resources can be used as input for grid
jobs created, configured and submitted for
execution in the grid with the use of the
PROGRESS HPC Portal
• An example application is AminSim –
aminoacid sequences similarity – developed
by Prof. Jacek Blazewicz group at the
Institute of Computing Science, Poznan
University of Technology
AminSim portlet (1)
AminSim portlet (2)
AminSim portlet (3)
AminSim portlet (4)
Conclusions
• SRS resources have been integrated with the distributed file
structure of DMS and enabled for use within a grid-portal
environment (PROGRESS HPC Portal)
• A web interface (DMS Portal) enhances the efficiency of the
SRS resources exploration:
– fast copying interesting entries directly to the users’ home directory
– merging files
– saving files in various formats (e.g. Fasta)
• The universal access layer to the to the scientific databases
may by successfully used to connect other data sources to the
Data Management System
Contact info
• Check http://dms.progress.psnc.pl for more
information about DMS
• Check http://dms.progress.psnc.pl/docs/demo.htm for
the DMS Portal demo
• Check http://progress.psnc.pl for more information
about PROGRESS
• Download it now: http://progress.psnc.pl
• Mail DMS team: szd@man.poznan.pl
• Mail PROGRESS team: progress@man.poznan.pl
Download