Neptune: Scalable Replication Management and Programming Support for Cluster-based Network Services

advertisement
Neptune: Scalable Replication
Management and Programming Support
for Cluster-based Network Services
Kai Shen, Tao Yang, Lingkun Chu, JoAnne L.
Holliday, Douglas K. Kuschner, and Huican Zhu
Department of Computer Science
University of California, Santa Barbara
http://www.cs.ucsb.edu/research/Neptune
Motivations
Availability, incremental-scalability, and
manageability - key requirements for building
large-scale network services.
 Challenging for those with frequent persistent
data updates.
 Existing solutions in managing persistent data:

 Pure data partitioning: no availability guarantee; bad
at dealing with runtime hot-spots.
 Disk-sharing: inherently unscalable; single-point of
failure.
 Replication provided by database vendors: tied to
specific database systems; inflexible in consistency.
6/20/2016
USITS 2001, San Francisco
Neptune Project Goal
Design a scalable clustering architecture for
aggregating and replicating network services with
persistent data.
 Provide a simple and flexible programming model to
shield complexity of data replication, service
discovery, load balancing, and failover management.
 Provide flexible replica consistency support to
address availability and performance tradeoffs for
different services.

6/20/2016
USITS 2001, San Francisco
Related Work





TACC, MultiSpace: infrastructure support for
cluster-based network services.
DDS: distributed persistent data structure for
network services.
Porcupine: cluster-based email service (with
commutative updates).
Bayou: weak consistency for wide-area
applications.
BEA Tuxedo – platform middleware supporting
transactional RPC.
6/20/2016
USITS 2001, San Francisco
Outline





6/20/2016
Motivations & Related Work
System Architecture and Assumptions
Replica Consistency and Failure
Recovery
System implementation and Service
Deployments
Experimental Studies
USITS 2001, San Francisco
Partitionable Network Services
Characteristics of network services:


Information independence. Service data can be divided into
independent categories (e.g. discussion group).
User independence. Data accessed by different users tend
to be independent (e.g. email service).
Neptune is targeting partitionable network services:



Service data can be divided into independent partitions.
Each service access can be delivered independently on a
single partition; or
Each service access can be aggregated from sub-services
each of which can be delivered independently on a single
partition.
6/20/2016
USITS 2001, San Francisco
Conceptual Architecture for a
Neptune Service Cluster
Neptune service cluster
Photo
Album
Partition
0 - 19
Web
Server
Wide-area
Network
Web
Server
Wireless
Network
Highthroughput
low-latency
network
WAP
Gateway
Image
Store
Partition
0-9
Discussion
Group
Partition
0-9
Neptune client module
6/20/2016
Image
Store
Partition
10 - 19
USITS 2001, San Francisco
Discussion
Group
Partition
0-9
Neptune Components
Neptune components on client and server-side:


Neptune Server Module: starts, regulates, terminates
registered service instances and maintains replica data
consistency.
Neptune Client Module: provides location-transparent
accesses to application service clients.
Service
client
Neptune
client
module
Client node
data request
data response
Neptune
server
module
Service
instance
Server node
Interaction among service modules and Neptune modules
6/20/2016
USITS 2001, San Francisco
partitioned
data
Programming Interfaces
Request/Response communications:

Client-side API: (called by service clients)
NeptuneCall (CltHandle, Service, Partition, SvcMethod,
Request, Response);

Service Interface: (abstract interface that
application services implement)
SvcMethod (SvcHandle, Partition, Request, Response);
Stream-based communications:

6/20/2016
Neptune sets up a bi-directional stream between
the service client and the service instance.
USITS 2001, San Francisco
Assumptions
All system modules follow fail-stop failure
model.
 Network partitions do not occur inside the
service cluster.
 Neptune does allow persistent data
survive all-node failures.


6/20/2016
Atomic execution is supported if each
underlying service module ensures
atomicity in stand-alone configuration.
USITS 2001, San Francisco
Neptune Replica Consistency Model
A service access is called a write if it changes the state of
persistent data; and it is called a read otherwise.



Level 1: Write-anywhere replication for commutative writes.
Writes are accepted at any replica and propagated to peers.
E.g. message board (append-only).
Level 2: Primary-secondary replication for ordered writes.
Writes are only accepted at primary node, then ordered and
propagated to secondaries.
Level 3: Primary-secondary replication with staleness
control. Soft time-based staleness bound and progressive
version delivery.
Not strong consistency because writes completed
independently at each replica.
6/20/2016
USITS 2001, San Francisco
Soft Time-based Staleness Bound
Semantics: each read serviced at a replica at
most x seconds stale compared to the primary.
 Important for services such as on-line auction.
 Implementation:

 Each replica periodically announces its data version;
 Neptune client module directs requests only to
replicas with a fresh enough version.

6/20/2016
The bound is soft, depending on network
latency, announcement frequency, and
intermittent packet losses.
USITS 2001, San Francisco
Progressive Version Delivery

From each client’s point of view,
 Writes are always seen by subsequent reads.
 Versions delivered for reads are progressive.
Important for services like on-line auction.
 Implementation:

 Each replica periodically announces its data version;
 Each service invocation returns a version number for
a service client to keep as a session variable;
 Neptune client module directs a read to a replica with
an announced version >= all the previously-returned
version.
6/20/2016
USITS 2001, San Francisco
Failure Recovery
A REDO log is maintained for each data partition at
each replica, which has two portions:


Committed portion: completed writes;
Uncommitted portion: writes received but not yet completed.
Three-phrase recovery for primary-secondary
replication (level-2 & level-3):



Synchronize with underlying service module;
Recover missed writes from the current primary;
Resume normal operations.
Only phase one is necessary for write-anywhere
replication (level-1).
6/20/2016
USITS 2001, San Francisco
Outline





6/20/2016
Motivations & Related Work
System Architecture and Assumptions
Replica Consistency and Failure
Recovery
System Implementation and Service
Deployments
Experimental Studies
USITS 2001, San Francisco
Prototype System Implementation
on a Linux cluster

Service availability and node runtime workload
are announced through IP Multicast.
 multicast once a second;
 kept as soft state, expires in five seconds.
Service instances can run either as processes or
threads in Neptune server runtime environment.
 Each Neptune server module maintains a
process/thread pool and a waiting queue.

6/20/2016
USITS 2001, San Francisco
Experience with Service
Deployments

On-line discussion group
 View message headers, view message, and add message.
 All three consistency levels can be applied.

Auction
 Level 3 consistency with staleness control is used.

Persistent cache
 Store key-value pairs (e.g. caching query result).
 Level 2 consistency (Primary-secondary) is used.
 Fast prototyping and implementation without
worrying about replication/clustering complexities.
6/20/2016
USITS 2001, San Francisco
Experimental Settings for
Performance Evaluation

Synthetic Workloads:
 10% and 50% write percentages;
 Balanced workload to assess best-case scalability;
 Skewed workload to evaluate the impact of runtime
hotspots.
Metric: maximum throughput when at least 98%
client requests are completed in 2 seconds.
 Evaluation Environment:

 Linux cluster with dual 400MHz Pentium IIs,
512MB/1GB memory, dual 100Mb/s Ethernet interfaces.
 Lucent P550 Ethernet switch with 22Gb/s backplane
bandwidth.
6/20/2016
USITS 2001, San Francisco
Scalability under Balanced Workload

NoRep is about twice as fast as Rep=4 under 50%
writes.

Insignificant performance difference across three
consistency levels under balanced workload.
6/20/2016
USITS 2001, San Francisco
Skewed Workload
Each skewed workload consists of requests
chosen from a set of partitions according to Zipf
distribution.
 Define the workload imbalance factor as the
proportion of the requests directed to the most
popular partition.

 For a 16-partition service, an imbalance factor of 1/16
indicates a completely balanced workload.
 An imbalance factor of 1 means all requests are
directed to one partition.
6/20/2016
USITS 2001, San Francisco
Impact of Workload Imbalance on
Replication Degrees
10% writes;
level-2 consistency;
8 nodes.

6/20/2016
Replication provides dynamic load-sharing for runtime hotspots (Rep=4 could be up to 3 times as fast as NoRep).
USITS 2001, San Francisco
Impact of Workload Imbalance on
Consistency Levels
10% writes;
Rep degree 4;
8 nodes.
Modest performance difference:


6/20/2016
Up to 12% between level-2 and level-3;
Up to 9% between level-1 and level-2.
USITS 2001, San Francisco
Failure Recovery for Primarysecondary Replication
Graceful performance degradation.
 Performance drop after the three-node failure.
 Errors and timeouts trailing each recovery (write recovery
and sync overhead).

6/20/2016
USITS 2001, San Francisco
Conclusions
Contributions:


Scalable replication for cluster-based network services;
multi-level consistency with staleness control.
A simple programming model to shield replication and
clustering complexities from application service authors.
Evaluation results:



Replication improves performance for runtime hotspots.
Performance of level 3 consistency is competitive.
Level 2/3 carries extra overhead during failure recovery.
http://www.cs.ucsb.edu/research/Neptune
6/20/2016
USITS 2001, San Francisco
Download