Globus Data and Replica Management Ann Chervenak USC Information Sciences Institute

Globus Data and Replica Management

Ann Chervenak

USC Information Sciences Institute

Talk Outline













Brief Introduction to Globus Toolkit

Globus Tools for Data Management

The Replica Location Service (RLS)

Examples of production use of RLS

Higher-level data management services



The Data Replication Service (DRS)

Summary

The Application-Infrastructure Gap

Dynamic and/or

Distributed

Applications

Shared Distributed Infrastructure

A B

1 1

9 9





Bridging the Gap:

Service-Oriented Infrastructure

Users

Service-oriented applications Composition



Wrap applications as services



Compose applications into workflows

Service-oriented infrastructure

Workflows

Invocation

Appln

Service

Appln

Service

Provisioning



Provision physical resources to support application workloads

Globus is Service-Oriented

Infrastructure Technology









Software for service-oriented infrastructure





Service-enable new & existing resources

Uniform abstractions & mechanisms

Tools to build applications that exploit serviceoriented infrastructure



Registries, security, data management, …

Open source & open standards



Each empowers the other

Enabler of a rich tool & service ecosystem

Globus Toolkit











Core Web services



Infrastructure for building new services

Security



Apply uniform policy across distinct systems

Execution management



Provision, deploy, & manage services

Data management



Discover, transfer, & access large data

Monitoring



Discover & monitor dynamic services

Globus Tools and Services for Data Management











GridFTP



A secure, robust, efficient data transfer protocol

The Reliable File Transfer Service (RFT)



Web services-based, stores state about transfers

The Data Access and Integration Service (DAIS)



Service to access to data resources, particularly relational and

XML databases

The Replica Location Service (RLS)



Distributed registry that records locations of data copies

The Data Replication Service



Web services-based, combines data replication and registration functionality

Replica Management in Grids

Data intensive applications produce terabytes or petabytes of data



Hundreds of millions of data objects

Replicate data at multiple locations for reasons of:





Fault tolerance



Avoid single points of failure

Performance





Avoid wide area data transfer latencies

Achieve load balancing

A Replica Location Service

• A Replica Location Service (RLS) is a distributed registry that records the locations of data copies and allows replica discovery





RLS maintains mappings between logical identifiers and target names

Must perform and scale well: support hundreds of millions of objects, hundreds of clients



E.g., LIGO (Laser Interferometer Gravitational Wave

Observatory) Project



RLS servers at 10 sites



Maintain associations between 6 million logical file names & 40 million physical file locations

RLS Features

• Local Replica

Catalogs (LRCs) contain consistent information about logical-to-target mappings

Replica Location Indexes

RLI

LRC LRC LRC

RLI

LRC LRC

Local Replica Catalogs

• Replica Location Index (RLI) nodes aggregate information about one or more LRCs

• LRCs use soft state update mechanisms to inform RLIs about their state: relaxed consistency of index

• Optional compression of state updates reduces communication, CPU and storage overheads

Components of RLS Implementation











Common server implementation for LRC and RLI client

Front-End Server







Multi-threaded

Written in C

Supports GSI Authentication using X.509 certificates

Back-end Server



MySQL, PostgreSQL and Oracle

Relational Database

Client APIs: C, Java, Python

Client Command line tool

LRC/RLI Server

ODBC (libiodbc) myodbc mySQL Server

DB client

RLS Implementation Features









Two types of soft state updates from LRCs to RLIs



Complete list of logical names registered in LRC



Compressed updates: Bloom filter summaries of LRC

Immediate mode



Incremental updates

User-defined attributes



May be associated with logical or target names

Partitioning



Divide LRC soft state updates among RLI index nodes using pattern matching of logical names



Not used much in practice because compressed updates are efficient

Performance Testing



Extensive performance testing reported in HPDC 2004 paper





Performance of individual LRC (catalog) or RLI (index) servers



Client program submits operation requests to server

Performance of soft state updates



Client LRC catalogs sends updates to index servers

Software Versions:



Replica Location Service Version 2.0.9









Globus Packaging Toolkit Version 2.2.5

libiODBC library Version 3.0.5

MySQL database Version 4.0.14

MyODBC library (with MySQL) Version 3.51.06

Testing Environment



Local Area Network Tests







100 Megabit Ethernet

Clients (either client program or LRCs) on cluster: dual Pentium-III 547 MHz workstations with 1.5

Gigabytes of memory running Red Hat Linux 9

Server: dual Intel Xeon 2.2 GHz processor with 1

Gigabyte of memory running Red Hat Linux 7.3



Wide Area Network Tests (Soft state updates)



LRC clients (Los Angeles): cluster nodes



RLI server (Chicago): dual Intel Xeon 2.2 GHz machine with 2 gigabytes of memory running Red

Hat Linux 7.3

LRC Operation Rates (MySQL Backend)

Operation Rates,

LRC w ith 1 m illion entries in MySQL Back End,

Multiple Clients, Multiple Threads Per Client,

Database Flush Disabled

2500

2000

1500

1000

500

0

1 2 3 4 5 6 7 8 9 10

Num ber Of Clients

Query Rate w ith 10 threads per client

Add Rate w ith 10 threads per client

Delete Rate w ith 10 threads per client

• Up to 100 total requesting threads

• Clients and server on LAN

• Query: request the target of a logical name

• Add: register a new <logical name, target> mapping

• Delete a mapping

3000

2500

2000

1500

1000

500

0

1

Bulk Operation Performance

Bulk vs. Non-Bulk Operation Rates,

1000 Operations Per Request,

10 Request Threads Per Client

2 3 4 5 6 7

Num ber of clients

Bulk Query

Bulk Add/Delete

Non-bulk Query

Non-bulk Add

Non-bulk Delete

8 9 10











For user convenience, server supports bulk operations

E.g., 1000 operations per request

Combine adds/deletes to maintain approx. constant DB size

For small number of clients, bulk operations increase rates

E.g., 1 client

(10 threads) performs

27% more queries,

7% more adds/deletes

Bloom Filter Compression





Construct a summary of each LRC’s state by hashing logical names, creating a bitmap

RLI stores in memory one bitmap per LRC

Advantages:



Updates much smaller, faster



Supports higher query rate



Satisfied from memory rather than database

Disadvantages:





Lose ability to do wildcard queries, since not sending logical names to RLI

Small probability of false positives (configurable)



Relaxed consistency model

LRC

Database

Size

Bloom Filter Performance:

Single Wide Area Soft State Update

(Los Angeles to Chicago)

Avg. time to send soft state update

(seconds)

Avg. time for initial bloom filter computation

(seconds)

Less than 1 2

Size of bloom filter (bits)

1 million 100,000 entries

1 million entries

5 million entries

1.67

6.8

18.4

91.6

10 million

50 million

RLS in Production Use: LIGO





Laser Interferometer Gravitational Wave Observatory

Currently use RLS servers at 10 sites



Contain mappings from 6 million logical files to over 40 million physical replicas

Used in customized data management system: the

LIGO Lightweight Data Replicator System (LDR)



Includes RLS, GridFTP, custom metadata catalog, tools for storage management and data validation

RLS in Production Use: ESG











Earth System Grid: Climate modeling data (CCSM, PCM,

IPCC)

RLS at 4 sites

Data management coordinated by ESG portal

Datasets stored at NCAR





64.41 TB in 397253 total files

1230 portal users

IPCC Data at LLNL











26.50 TB in 59,300 files

400 registered users

Data downloaded: 56.80 TB in 263,800 files

Avg. 300GB downloaded/day

200+ research papers being written









RLS in Production Use:

Pegasus Workflow Manager

Pegasus: Planning for Execution in Grids

Used by scientific applications to manage complex executions

Pegasus system







Maps from a high-level, abstract definition of a workflow onto a Grid environment

Maps to a concrete or executable workflow in the form of a

Directed Acyclic Graph (DAG)

Passes this concrete workflow to the Condor DAGMan execution system

Pegasus uses RLS to





Identify physical replicas of logical files specified in the abstract workflow

Register new files created during workflow execution



Scientific applications that use RLS via Pegasus include:













LIGO

Atlas High energy physics application

Southern California Earthquake Center (SCEC)

Astronomy: Montage and Galaxy Morphology applications

Bioinformatics

Tomography



Other RLS Users



QCD Grid, US CMS experiment (integrated with POOL),

Atlas via Don Quijote

Motivation for

Data Replication Services





Data-intensive applications need higher-level data management services that integrate lower-level Grid functionality







Efficient data transfer (GridFTP, RFT)

Replica registration and discovery (RLS)

Eventually validation of replicas, consistency management, etc.

Goal is to generalize the custom data management systems developed by several application communities





Eventually plan to provide a suite of general, configurable, higher-level data management services

Globus Data Replication Service (DRS) is the first of these services

The Data Replication Service



Included in the Tech Preview of GT4.0 release



Design is based on the publication component of the

Lightweight Data Replicator system



Developed by Scott Koranda from U. Wisconsin at Milwaukee



Functionality











Replicate a set of files in the Grid on a local site

Users identify a set of desired files

DRS queries Replica Location Service to discover current locations of these files

Creates local replicas of desired files using the Reliable File

Transfer Service

Registers new replicas in Replica Location Service for discovery

Relationship to

Other Globus Services

At requesting site, deploy:

Local Site



WS-RF Services







Data Replication

Service

Delegation Service

Reliable File Transfer

Service

Data

Replication

Service

Replicator

Resource

Delegation

Service

Delegated

Credential

Reliable

File

Transfer

Service

RFT

Resource

Web Service Container



Pre WS-RF Components





Replica Location

Service (Local Replica

Catalog and Replica

Location Index)

GridFTP Server

Local

Replica

Catalog

Replica

Location

Index

GridFTP

Server

Service

EPR

EPR

EPR

Resource

RPs

WSRF in a Nutshell

GetRP

GetMultRPs

SetRP

QueryRPs

Subscribe

SetTermTime

Destroy















Service

State Management:





Resource

Resource Property

State Identification:



Endpoint Reference

State Interfaces:



GetRP, QueryRPs,

GetMultipleRPs, SetRP

Lifetime Interfaces:



SetTerminationTime



ImmediateDestruction

Notification Interfaces





Subscribe

Notify

ServiceGroups



Performance Measurements:

Wide Area Testing

The destination for the pull-based transfers is located in

Los Angeles





Dual-processor, 1.1 GHz Pentium III workstation with 1.5

GBytes of memory and a 1 Gbit Ethernet

Runs a GT4 container and deploys services including RFT and DRS as well as GridFTP and RLS



The remote site where desired data files are stored is located at Argonne National Laboratory in Illinois





Dual-processor, 3 GHz Intel Xeon workstation with 2 gigabytes of memory with 1.1 terabytes of disk

Runs a GT4 container as well as GridFTP and RLS services

DRS Operations Measured











Create the DRS Replicator resource

Discover source files for replication using local RLS

Replica Location Index and remote RLS Local

Replica Catalogs

Initiate an Reliable File Transfer operation by creating an RFT resource

Perform RFT data transfer(s)

Register the new replicas in the RLS Local Replica

Catalog

Experiment 1: Replicate

10 Files of Size 1 Gigabyte

Component of Operation Time (milliseconds)

Create Replicator Resource 317.0

Discover Files in RLS

Create RFT Resource

449.0

808.6

Transfer Using RFT

Register Replicas in RLS

1186796.0

3720.8





Data transfer time dominates

Wide area data transfer rate of 67.4 Mbits/sec

Experiment 2: Replicate

1000 Files of Size 10 Megabytes

Component of Operation Time (milliseconds)

Create Replicator Resource 1561.0

Discover Files in RLS

Create RFT Resource

9.8

1286.6

Transfer Using RFT

Register Replicas in RLS

963456.0

11278.2







Time to create Replicator and RFT resources is larger



Need to store state for 1000 outstanding transfers

Data transfer time still dominates

Wide area data transfer rate of 85 Mbits/sec

Summary







Globus Tools for Data Management









GridFTP protocol

Reliable File Transfer Service

OGSA Data Access and Integration Service

Replica Location Service



Data Replication Service

RLS used in production at large scale by a variety of scientific applications

Moving toward configurable, general higher-level data services



DRS is first of these

For More Information



RLS



“Performance and Scalability of a Replica Location Service,”

High Performance Distributed Computing Conference, 2004 http://www.isi.edu/~annc/papers/chervenakhpdc13.pdf



Documentation: http://www.globus.org/toolkit/docs/4.0/data/rls



DRS



“Wide Area Data Replication for Scientific Collaborations,”

Grid Computing (Grid2005), http://www.isi.edu/~annc/papers/grid2005final.pdf



Documentation: http://www.globus.org/toolkit/docs/4.0/techpreview/datarep

Globus Data and Replica Management Ann Chervenak USC Information Sciences Institute

Globus Data and Replica Management

Talk Outline

The Application-Infrastructure Gap

Bridging the Gap:

Service-Oriented Infrastructure

Globus is Service-Oriented

Infrastructure Technology

Globus Toolkit

Globus Tools and Services for Data Management

Replica Management in Grids

A Replica Location Service

RLS Features

Components of RLS Implementation

RLS Implementation Features

Performance Testing

Testing Environment

LRC Operation Rates (MySQL Backend)

Bulk Operation Performance

Bloom Filter Compression

Bloom Filter Performance:

Single Wide Area Soft State Update

(Los Angeles to Chicago)

RLS in Production Use: LIGO

RLS in Production Use: ESG

RLS in Production Use:

Pegasus Workflow Manager

Motivation for

Data Replication Services

The Data Replication Service

Relationship to

Other Globus Services

WSRF in a Nutshell

Performance Measurements:

Wide Area Testing

DRS Operations Measured

Experiment 1: Replicate

10 Files of Size 1 Gigabyte

Experiment 2: Replicate

1000 Files of Size 10 Megabytes

Summary

For More Information

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib