Team SSDMSE06: SuDuelKu ENGINEERING & 18-749: Fault-Tolerant Distributed Systems

advertisement
Team SSDMSE06: SuDuelKu
18-749: Fault-Tolerant Distributed Systems
Saul Jaspan, Lucia de Lascurain, Luis Rios, Yudi Nagata, &
Christopher Nelson
Electrical &Computer
ENGINEERING
Team Members
Lucia de Lascurain
ldelascu@andrew.cmu.edu
Saul Jaspan
Saul.Jaspan@gmail.com
Christopher Nelson
crnelson@cs.cmu.edu
Yudi Nagata
ynagata@andrew.cmu.edu
Luis Rios
Jriostre@andrew.cmu.edu
http://www.ece.cmu.edu/~ece749/teams-06/team1/html/index.html
2
Baseline Application



A real-time, fault-tolerant, high performance game where two or more
SuDoKu players can pit their intelligence against each other.
Su - Duel - Ku
Configuration
 Tools
–
–
–
–

EJB
Linux
MySQL
Jboss
–
–
–
–
Eclipse
LOMBOZ
xDoclet
Ant
Architectural Elements
–
–
–
Client(s)
Game Server
–Session Bean
–Entity Beans
Database
3
Baseline Architecture
Game
s()
e
am
G
t
lis
listG
ame
s( )
…
Client 1
e()
joinGam
SuDuelKu
Server
om() Bean
inGameRo
jo
Client N
getPlay
erLocal(
)
list
Sq
uar
es ()
CM
P
Game
Room
Player
ca
ll
CM
P ca
ll
MySQL
CMP call
all
c
P
CM
Square
SuDuelKu Server
Legend
Server Boundary Client
a
b
Session Bean Entity Bean Database a calls b
4
Fault-Tolerance Goals



Stateless Replicated SuDuelKu Server (each on their own machine)
Stateless Replicated Global Naming Server (each on their own machine)
Sacred Machine with
–
–
–
–

Stateful Recovery Manager
Stateful Fault-Injector
Stateful Database ;-)
JMS Server
Fault-Tolerant Framework
–
Global Recovery Manager
• Fault Detection
• Replica Creation
• Replica Destruction
–
Global Fault Injector
5
Mechanisms for Fail-Over







The client asks the global naming server for the primary server
If the game server goes down, the client gets a SuduelkuException.
After getting an exception, the client waits for a configurable amount of
time.
Meanwhile, the recovery manager tries to execute isAlive on the crashed
server and gets a CommunicationException.
The recovery manager notifies the naming server that the server is down
and sets a new primary server.
The recovery manager attempts to start the crashed server.
When the client’s waiting time is over, it asks the global naming server for
the primary server again.
6
FT-Baseline Architecture 1
Client 1
…
joinG
ame()
Client N
()
getPrimary
getPrimary()
joinGame()
joi
nG
am
e( )
join
Gam
e()
Name Server
Fault
Injector
9
Kill -
SuDuelKu
Server
(primary)
-9
ll
i
K
CMP
isA
liv
e ()
SuDuelKu
Server
(backup)
MySQL
CMP
isAlive()
setPrimary()
Recovery
Manager
Legend
a
Client
Replicated Server
FT Framework
Global Naming
Database
b
a calls b
7
FT-Baseline Architecture 2
Name
Server
Recovery Manager
SuDuelKuClient
EJBClient
void startNewServer(…)
primary
backup
backup
primary
Waiting EJBClient
void setCommandRate(…)
SuDuelKu Server
primary
bool isAlive(…)
FaultInjector
void killServer(…)
backup
8
Fault-Tolerance Experimentation

Latency vs. Size of Reply

Latency is independent of size of reply – the network isn’t the bottleneck
9
Fault-Tolerance Experimentation

Latency vs. 3 Sigma Outliers per Experiment

Number of clients is the major cause of latency
10
Fail-Over Measurements

Pie chart break-down of failover time
< 1%
< 1%
99%

Fault-Detection is the culprit!
11
More-FT Architecture 1
CMP
Game
Database
ve
i
l
isA
Recovery
Manager
Fault
Injector
publish
JMS
ent
erA
nsw
er
on
M
es
sa
ge
Backup
Server
isAlive
isA
liv
e
Primary
Server
ary
m
i
r
getP
-9
kill
kill -9
Client 1
Name Server
CMP
Name
Database
Legend
a
Client
Server
FT Framework Global Naming
Database
JMS
b
a calls b
12
More-FT Architecture 2
Name Server
List getNamingServer()
List getGamingServers()
Server getPrimary()
bool isAlive()
SuDuelKuClient
EJBClient
Waiting EJBClient
void setCommandRate(…)
Recovery Manager
void
void
void
void
void
void
void
startGameServer(…)
startNameServer(…)
blackListServer(…)
gameServerDown(…)
gameServerBackUp(…)
nameServerDown(…)
nameServerBackUp(…)
List availableServers
List currentServers
List blackListedServers
FaultInjector
void killServer(…)
void killPrimary()
SuDuelKu Server
bool isAlive(…)
JMS
void publish(…)
13
Other Features

Fault-Tolerant Naming Server
–

Callback via JMS
–
–
–

Good recovery design supports smart recovery tactics
CLI & GUI !!!
–
–
–

JBoss implementation of JMS was good enough
Durable subscriptions allowed for clients to be stateless
JBoss JMX console helps debugging messaging code
#1 Recovery Manager - Blacklist problematic servers
–

Adds significant complexity to the Recovery Manager state machine
Middleware supports clean separation of concerns - Client/Server
CLI supports easy scripting of clients
GUI supports easy game play!
Eclipse & Lomboz
–
Tools can have a steep learning curve, but once learned can save time
14
Insights from Measurements

Scripting the clients can create unrealistic usage of the system
–

Stateless servers increase database bottleneck
–

For our application, we may consider sending more data per message to reduce
the number of sent messages - thus promoting scalability.
Running experiments was not as simple as we thought
–
–
–
–

There is a tradeoff between fast data access and failover-time (complexity)
Reply size did not dramatically impact latency, but request rate did
–

Implementing a usable system with probes will allow for data collection under
true system usage
Kerberos authentication expired
Ran out of disk space
Had trouble keeping multiple clients up
Coordination between multiple components: JBoss, database, clients
Getting representative data is near impossible
–
Expectation management
15
Open Issues

Strange behavior after recovering from multiple faults
–
–

Messaging Issues
–
–

JBoss sporadically reports communication errors when beans are
communicating via local interfaces
Possibly addressed by using separate JBoss configuration directories - one for
our FT Naming Servers and one for our FT Game servers
Detect finished games to purge unused topics and subscriptions
Identify the exceptions generated by JMS server failure
Still to come…
–
–
–
–
–
–
Fault-Tolerant callbacks
Improved server startup time to increase tolerable failure rate
Improve scalability to support expected growth in the user community ;-)
Calculating player scores
Purging complete games from the database
Algorithm for game generation
16
Conclusions (1)

Enlightenment
–
–
–

Middleware does not remove all complexity, it just moves the complexity into a
new layer
Middleware decisions place constraints on software architecture that must be
addressed during design
Fault-Tolerance, Real-Time, and High-Performance all involve tradeoffs that
must be understood and be made explicitly
Why we are Team #1 (read #1 Team)
–
–
–
–
–
Dueling SuDoKu game
Fault-Tolerant game server
Fault-Tolerant global naming server
Callbacks (soon to be FT) within a Fault-Tolerant application
An awesome GUI
17
Conclusions (2)

New approaches for next time
–
–
–
–
–
–
–
–
Naming conventions
Coding conventions
Start implementing probes early
Start gathering data early
Design for probes
Design for a scriptable client
Start thinking about transactions early
Be aware of concurrency issues
18
Demo
19
Questions
20
Additional thoughts




How would active replication help?
What would be the impact of the active replication? Can we keep state in
the database?
How complex would it be to keep server with state? Could there be a mixed
style?
Some more lessons
–
–
–
–
–
–
–
CMP is handy but very complex
CMP has limitations with database design (no referential integrity allowed)
Experimentation is a laborious process
Debugging the server is not as easy as the client
JNI works differently on solaris than linux
Output to indicate progress of experiments effects the results, but without it,
debugging experiments is non-trivial
Use a mechanism for disabling debug/probe output
21
Download