Mohammad Ahmad
Jun Han
Joohoon Lee
Paul Cheong
Suk Chan Kang
Hwi Cheong (Paul) hcheong@andrew.cmu.edu
Mohammad Ahmad mohman@cmu.edu
Joohoon Lee jool@ece.cmu.edu
Jun Han junhan@andrew.cmu.edu
SukChan Kang sckang@andrew.cmu.edu
Blackjack game application
User can create tables and play Blackjack.
User can create/retrieve profiles.
Configuration
Operating System: Linux
Middleware: Enterprise Java Beans (EJB)
Application Development Language: Java
Database: MySQL
Servers: JBOSS
J2EE 1.4
Three-tier system
Server completely stateless
Hard-coded server name into clients
Every client talks to HostBean (session)
Passive replication
Completely stateless servers
No need to transfer states from primary to backup
All states stored in database
Only one instance of HostBean (session bean) needed to handle multiple client invocations efficient on server-side
Degree of replication depends on number of available machines
Sacred machines
Replication Manager (chess)
mySQL database (mahjongg)
Clients
Responsible for server availability notification and recovery
Server availability notification
Server notifies Replication Manager during boot.
Replication Manager pings each available server periodically.
Server recovery
Process fault: pinging fails; reboot server by sending script to machine
Machine fault (Crash fault): pinging fails; sending script does nothing; machine has to be booted and server has to be manually launched.
Client-RM communication
Client contacts Replication Manager each time it fails over
Client quits when Replication Manager returns no server or Replication Manager can’t be reached.
Evaluation of Performance (without failover)
Server process is killed.
Client receives a RemoteException
Client contacts Replication Manager and asks for a new server.
Replication Manager gives the client a new server.
Client remakes invocation to new server
Replication Manager sends script to recover crashed server
3 servers initially available
Replication Manager on chess
30 fault injections
Client keeps making invocations until 30 failovers are complete.
4 probes on server, 3 probes on client to calculate latency
8 x 10
5 End-to-End Roundtrip Latency vs # Invocations
7
6
5
4
3
2
1
0
0 0.5
1
Invocation #
1.5
2 2.5
x 10
5
Maximum jitter: ~700ms
Minimum jitter: ~300ms
Average failover time: ~ 404ms
Most of latency comes from getting an exception from server and connecting to the new server
Failure Induced
1% 2%
97%
Query Naming Service
Detection of Failure + Reconnecting to new Server
Normal Run time
Real-time Fault-Tolerant Baseline
Architecture Improvements
Fail-over time Improvements
Saving list of servers in client
Reduces time communicating with replication manager
Pre-creating host beans
Client will create host beans on all servers as soon as it receives list from replication manager
Runtime Improvements
Caching on the server side
Client-RM and Client-Server
Improvements
Client-RM and Client-Server communication
Client contacts Replication Manager each time it runs out of servers to receive a list of available servers.
Client connects to all servers in the list and makes a host beans in them, then starts the application with one server
During each failover, client connects to the next server in the list.
No looping inside list
Client quits when Replication Manager returns an empty list of servers or Replication Manager can’t be reached.
Caching in server
Saves commonly accessed database data in server
Use Hashmap to map query to previously retrieved data.
O(1) performance for caching
3 servers initially available
Replication Manager on chess
30 fault injections
Client keeps making invocations until 30 failovers are complete.
4 probes on server, 5 probes on client to calculate latency and naming service time
Client probes
Probes around getPlayerName() and getTableName()
Probes around getHost() – for failover
Server probes
Record source of invocation – name of method
Record invocation arrival and result return times
Invocation #
Average failover time: 217 ms
Half the latency without improvements (404 ms)
Non-failover RTT is visibly lower (shown on graphs below)
End-to-End Roundtrip Latency vs # Invocations
6
5
4
3
8 x 10
5
7
2
1
0
0 1 1.5
2 0.5
Before Real-Time Implementation
2.5
x 10
5
After Real-Time Implementation
Blackjack game GUI
Load-balancing using Replication Manager
Multiple number of clients per table (JMS)
Profiling on JBoss to help improve performance
Generating a more realistic workload
TimeoutException
What we have accomplished
Fault-tolerant system with automatic server detection and recovery
Our real-time implementations proved to be successful in improving failover time as well as general performance
What we have learned
Merging code can be a pain.
A stateless bean are accessed by multiple clients.
State can exist even in stateless beans and is useful if accessed by all clients cache!
What we would do differently
Start evaluation earlier…
Put more effort and time into implementing timeout’s to enable bounded detection of server failure.