Dedicated Servers in Gears of War 3

advertisement
Dedicated Servers in Gears of War 3
Scaling to Millions of Players
Michael Weilbacher
Development Manager, Microsoft Studios
Introductions
Michael Weilbacher
●
Technical Development Manager at Microsoft
●
●
●
1.5 years at Microsoft
16.5 years in the game industry
Shipped games:
●Gears
of War 3, Magic the Gathering: Tactical,
●Mortal Kombat: Deception to Mortal Kombat vs. DC Universe,
●John Woo presents Stranglehold, NBA Ballers, Blitz, MLB Slugfest, Psi-Ops: The Mindgate
Conspiracy,
●NASCAR 02-03, Madden NFL 97-03, NCAA Football 97-02 and some more….
Topics – From the beginning to the end








What are/why dedicated servers
The consumer experience
The associated cost
Game code decisions
Administering the servers
Implementation rollout
Out in the wild
Gears of war 3 dedicated
Trends and the future
servers
What are dedicated servers?
●
32-bit headless client instance without renderer/user input
●
Multiple clients hosted on a single server
●
Servers hosted in a datacenter
●
Multiple datacenters worldwide support the community
●
Software infrastructure that ties it all together
What are / why dedicated servers
Why dedicated servers?
Best game experience
Addresses Gears 2 problems
Datacenters provide high bandwidth, low
latency
 Increased host performance
 Consistency between games






Prevent host latency advantages
Reduce host quitting and game
interruption
Cheaters and lag switches
Community perception/expectations
Decided against distributing the server to the public
 Reduces problem scope
 Security concerns
 Control the experience with consistent performance/bandwidth
 The downside of hosting is the increase to the game cost
What are / why dedicated servers
Overview of our datacenters
Four large datacenters
Four small datacenters
Over 900 servers worldwide
Average ~70 users per core
What are / why dedicated servers
What is our latency tolerance?
< 150ms is playable, 50-90 is best
Average case after launch at datacenters
were 75 ms
Able to tweak by region

Oceania / Asia requirements relaxed
slightly after launch
During development

Playtest labs tested worst case

Artificial latency >200ms

Packet loss @ 5-10%
The consumer experience
Finding the server hosted game
Each server hosted game is assigned an ID based on datacenter
Each client is assigned one of these IDs based on an IP to location lookup
In matchmaking query

Client looks for a server hosted game with the ID
Hosted game balance experience and TrueSkill rating based on players that join
The consumer experience
Consumers finding the best match
Matchmaking returns servers in TrueSkill/XP range
4 types of queries
●
●
●
●
Best – looking for exactly my party size
Any – looking for any match that fits the party
Empty – configure a new host from the shared pool
Default to peer to peer
Lots of knobs to tweak allows much control over matchmaking
experience
The consumer experience
Games always available
Always fallback to a player hosted match
 Necessary if we ever phase out servers over the life of the project
 Underestimating server need should never affect players
 Some people are just not close to datacenters
 Favor server experience
 Die roll to balance between "host vs. client rich“
Servers can host migrate if needed
 Servers to peer to peer migration, but not server to server
 Tracking this metric shows host migration is rare
 ~0.17% of matches
The consumer experience
How much hardware do we need for day one?
Historical data from previous game

Gears 2 multiplayer data

Gears supports 10 players max
Sales forecast per region
Formula driven

Assumed 15% attach rate online
and 30% concurrent rate
Can be costly if you are wrong

If too little, then community is unhappy

If too much, the accountants are unhappy

Easier to ask for the accountants forgiveness
The associated cost
How much hardware: should you buy versus rent?
Purchased enough for long term
needs (not peak)
Rented over 45% in US
Rented in regions which were
hard to setup big deployments
●
GameServers.Com
The associated cost
Monthly cost
Hardware is not most expensive part
About the graph:

At our highest cost bandwidth facility

Hardware amortized over 36 months
The associated cost
How much bandwidth do you need?
Our average hosted game sends out
~7kb/sec
Our average consumers sends in
~4kb/sec
VOIP traffic is peer to peer to reduce
host bandwidth requirement
Cost savings:
 Pay for burst (more costly) versus
committed (long term)
●
More upfront, but cheaper over lifecycle
The associated cost
Match making: LSP or XLIVE/G4WL?
Punch through LSP?




Extra level of indirection
Extra latency
Roll your own matchmaking (no advertising on LIVE)
Non-starter
Games for Window Live?


Acts as a headless client
Codebase built around LIVE already (UE3 / Gears1 / Gears PC / Gears 2)
 Only minor and focused additions/changes required
Game code decisions
G4WL challenges
Still beholden to client rules

CD Key / Local admin account necessary per
instance

Need one local account for each game
process on servers


One live account for each hosted game

1 Gamertag for every 10 users

Microsoft Platform created a custom tool to
generate all the accounts
Manually creating initial 50 Gamertags was no fun

Over 100k Gamertags created!
Platform did not maintain the accounts for us

Manually accepting Terms of Service for every
Gamertag

Used a web testing solution to help
upgrade accounts when account terms
changed

Very painful for all parties involved
Talk to your Developer Account Manager
before you go down this route
Game code decisions
Modifications to the existing UE3 dedicated server platform
Sitting idle

Needed to restart every 10 minutes to pull down possibly
new information

Dynamically need to configure themselves with new
updates

Transition period where clients and hosts are sync'd up
Detecting "empty" and resetting

People start to go into the game and do not make it

People stop playing and server needs to become available
again
Server shutdown whitelist

Need to be able to shutdown gracefully for
upgrades/maintenance
Auto configure when the first party joins and re-advertise
 Players make a request for what game mode they want to
play and the game needs to setup
 Empty server pool shared across all playlists and
configurations
General robustness
 Needs a solid uptime, error conditions, shutdown
 Fortunately not a single crash during the beta
 However precision issues creep in after 48 hours,
so we reboot as players roll off servers close to
that mark
 Lots of memory leak testing
 Lots of logging, events, perf counters (more on that later)
Most of these have been integrated back to UE3
Game code decisions
Memory and Performance
Memory was not as a big deal as
performance

Servers run under 150MB/instance

Memory was cheap on the server
Set a goal for a solid 30 fps network tick rate

Simulated load with automated bot matches

Charted fps via performance counters

2.5 hosted games per core (2009 Gears 2)

7.2 hosted games per core (2011 Gears 3)
Memory optimizations
Major performance wins

Stripped out the renderer

Lots of time spent removing "visual effects"
code paths

Get the whole team thinking about dedicated servers

Moving from Server 2008 -> 2008R2 was 2x win
(Vista -> Win7 kernel)

The associated cost
Lessons learned
Servers load much faster than clients

Server told clients to load things before they had unloaded previous maps -> higher watermark and
occasional OOM errors
 Introduced configurable latency before loading next map

No intrinsic first player assumption
 Slow to connect players were missing the game based on checks that assumed player host existed
 More code to check that at least one player existed before running existing checks
Mixing client and server side optimizations

Lots of animation optimizations "last render time" code had to be double checked
 Invisible collision in a few instances where the animation never played leaving collision in a bad state
Make sure the "server" Gamertag was never exposed to the clients

Made sure arbitrated sessions did not include server in the TrueSkill calculation

Never registered a session for the “server”
Game code decisions
Reporting systems
Created by Games IT at MS

SCMM – Monitoring system
Tells Tier1 staff an issue is occurring
 Email reporting and graphing of
health
Monitoring DB for heartbeats in game
process and launcher
Most common issue is XLive not logging in.
Administering the servers
Control center
Aimed at Tier 1 support
Silverlight app that interfaces with Master
services

Lock down to datacenter and not accessible
to the team
Silverlight app that shows high level metrics

Available through login

Webservice only has three read only service
calls

Can fetch log files of game
Administering the servers
Major components of infrastructure
•
•
•
•
Master DB
Master Service
Launcher Servers
Game Process
Administering the servers
Master DB
All components handshakes with the DB
to accomplish work
Size fixed after all machines and
accounts are added
Parameterized stored procedures only
Separate DB for metrics
No performance issues with proper
indices in place
Administering the servers
Master service
Writes to the master DB
Configuration setting of the machines


Datacenter setup with ID association
Assigns accounts to each machine and each process
(Account and 5x5 input)
Installs and health monitor of launcher service on each
machine
Tracks and moves builds to the datacenter local cache


Removed from DB and move to file caching
Can inject into the ini for custom settings
Can fetch log files from any game process or launcher
service
Administering the servers
“Gears of War 3” process
Runs many per server
All communication is asynchronous with database
DB Status messages

Game status
(datacenter/game

mode/playlist version/map name)
Server status
(launching/map
cycling/restarting/shutting down/etc)
DB Configuration options


Query every time server restarts or idle threshold is reached
Query returned various key/value pairs

Very flexible
Many performance counters exposed

Frame rate, thread timings, number of players connected, client
connection data (Ping, Incoming/Outgoing traffic, Packet loss)
Administering the servers
Launcher service
Runs one per server
Owns game processes on server
DB Commands to interact with game
 Start (Install if needed from cache),
 Stop (Bleeds off clients), Kill, Kill All
 Restart server, and clean machine
Health monitor of the process

Reasons to restart,




Every 48 hours
In case game crashes
Datacenter ID or playlist version does not match
Server status hangs in any state for too long
(datacenter/game
name)

Hot swappable

Gather and records state of the game
processes
 Game status
Server status
(launching/map
Allowed us to change health rules dynamically
without stopping server hosted games
Administering the servers
mode/playlist version/map
down/etc)
cycling/restarting/shutting
Health monitoring – good day
Administering the servers
Health monitoring – bad day
Administering the servers
Lessons learned
Restarting the process automatically is mandatory

Many small things outside your control, allows you to come back online quickly
 Live connectivity
 Server hiccups
 Configuration issues
G4WL cannot handle loading all processes at once

We found the need of 10-15 seconds between the load of each game process to prevent XLIVE DLL
issues
All administrative applications need the ability to be updated without taking down the server
hosted games

From the game to the monitoring services, you never know when you need to make adjustment
and this allows you to do a simple form of A/B testing
Administering the servers
Developer environment
Client/Server Environment




Could run multiple clients and servers on same machine

Multiple Gamertags / local accounts required (runas.exe)
Maintained GFWL PC client for rapid iteration
Could run without admin tool from commandline
UnrealConsole could talk to server through socket

All the debugging functionality of UE3
Admin Environment


One datacenter simulation for testing

5 servers with 1 SQL/webservice server
Could run locally using

Visual 2008 (for the game),

Visual 2010 (admin tools),

SQL, and Internet Information Services (ISS)
Implementation rollout
Phase 1 - Gears 2 title update
(April 2010)
Retrofit game to support planned Gears 3 features

Good way to introduce features with no expectations
First test of new matchmaking flow
First test of dedicated servers

Limited run of dedicated servers


Profiling servers in a real environment


Controlled environment, closely monitored
Tested CPU/Bandwidth usage in the wild on various hardware
Found 2 otherwise irreproducible crashes in the wild

Able to get minidumps and figure out the problems
Implementation rollout
Phase 1.5 – Large test in the labs
(January 2011)
More than 100 people (mostly testers)
Lock machines available and cores to create simulated overload


Monitor CPU and bandwidth
Will stress servers, but not infrastructure.
Work with enterprise staff to look for flaws
(outside of games devs)




DB analysis
Network sniffers
Locking down cores on a PC
Number of network cards, etc
Implementation rollout
Phase 2 - Gears 3 Beta
(April 2011)
Real rollout of servers to datacenters
First consumer trials of our server
administration tools

Phased rollout
Huge success


Solid uptime
Gamers happy
Implementation rollout
Lessons LEARNED (BETA)

Lack of communication issues (zombie games)
Misconfigured servers
Small number of game and balance issues
Added more matchmaking tweaks to ease contention
Good sampling of ping data from around the world
Discover data points which we should capture during release

A HTTPS Webservice is better than direct DB access






Better caching of static data in the DB to offset the DB load
Implementation rollout
Submission process
MS Cert



needs to be able to run the server in their environment
needs to be able to see the client attached to a server
liked to see that the server is attached to the client
Challenges of MS Cert Environment



Closed environment
Not accessible to our admin framework or network
Reverse IP lookup cannot find their server
Solutions


Always keep the ability to run the server by itself without any DB connections
Set cert environment to only one use datacenter

therefore all IPs return one datacenter ID
Implementation rollout
Security reviews of datacenters
(Before you go out…)
Always kill the process on security concerns,
better to be alerted than be exposed
The game is signed, but we have exposed connections that must
be protected!!
Use SDL to examine how trustworthy those communications are
and what happens if someone crashes your game process
File and networking fuzzing can be difficult, but worthwhile
Look for exposure of personal information especially in log files
Get an enterprise developer to look at your SQL stored procedures
Know your pattern for your game to help look for regularities

Think of that pattern as credit cards look for fraud
Implementation rollout
DLC / Title updates
Ability to rev the dedicated servers faster / independent of client
updates
Servers have to have all the content


Matchmaking can impose certain requirements on clients before searching
Balance between value to those who purchase content vs. fragmenting
our client base
Plan for an update path for your servers
Out in the wild
Releases of new G4WL client dlls
If mandatory upgrade, servers will not work until upgraded
Update requires a server to shut down all games

Can be done in a rolling manner
No matter how much communication, be prepared to be surprised when
these happen
Automated solution to deploy to reduce impact
Out in the wild
To the cloud…
Trends suggest that we have a lot of
unused time on our servers everyday
W it h a c lo u d s o lu t io n , yo u c o u ld
p o s s ib ly g et t h e f o llo w in g :
 Pay for what you use (but
more likely at higher hour
rate)
 More volume upfront for day
one demands
 Tier1 built in to the purchase
(hardware issues, network
issues)
 Could freeze VMs on machines
to debug later
The Future
Hopefully your launch looks
like ours….
Questions?
Email: mweilb@microsoft.com
Special thanks to:
• Epic Games
• MS Core Pub Team
• MS Games IT
Individual call outs:
Josh Markiewicz
Sam Zamani
Wes Hunt
Ian Thomas
Joe Graf
Vijay Krishnan
Nur Sheikhassan
Chris Kimmell
Chris Wynn
Microsoft Studios Core Publishing is recruiting
Download