Dedicated Servers in Gears of War 3 Scaling to Millions of Players Michael Weilbacher Development Manager, Microsoft Studios Introductions Michael Weilbacher ● Technical Development Manager at Microsoft ● ● ● 1.5 years at Microsoft 16.5 years in the game industry Shipped games: ●Gears of War 3, Magic the Gathering: Tactical, ●Mortal Kombat: Deception to Mortal Kombat vs. DC Universe, ●John Woo presents Stranglehold, NBA Ballers, Blitz, MLB Slugfest, Psi-Ops: The Mindgate Conspiracy, ●NASCAR 02-03, Madden NFL 97-03, NCAA Football 97-02 and some more…. Topics – From the beginning to the end What are/why dedicated servers The consumer experience The associated cost Game code decisions Administering the servers Implementation rollout Out in the wild Gears of war 3 dedicated Trends and the future servers What are dedicated servers? ● 32-bit headless client instance without renderer/user input ● Multiple clients hosted on a single server ● Servers hosted in a datacenter ● Multiple datacenters worldwide support the community ● Software infrastructure that ties it all together What are / why dedicated servers Why dedicated servers? Best game experience Addresses Gears 2 problems Datacenters provide high bandwidth, low latency Increased host performance Consistency between games Prevent host latency advantages Reduce host quitting and game interruption Cheaters and lag switches Community perception/expectations Decided against distributing the server to the public Reduces problem scope Security concerns Control the experience with consistent performance/bandwidth The downside of hosting is the increase to the game cost What are / why dedicated servers Overview of our datacenters Four large datacenters Four small datacenters Over 900 servers worldwide Average ~70 users per core What are / why dedicated servers What is our latency tolerance? < 150ms is playable, 50-90 is best Average case after launch at datacenters were 75 ms Able to tweak by region Oceania / Asia requirements relaxed slightly after launch During development Playtest labs tested worst case Artificial latency >200ms Packet loss @ 5-10% The consumer experience Finding the server hosted game Each server hosted game is assigned an ID based on datacenter Each client is assigned one of these IDs based on an IP to location lookup In matchmaking query Client looks for a server hosted game with the ID Hosted game balance experience and TrueSkill rating based on players that join The consumer experience Consumers finding the best match Matchmaking returns servers in TrueSkill/XP range 4 types of queries ● ● ● ● Best – looking for exactly my party size Any – looking for any match that fits the party Empty – configure a new host from the shared pool Default to peer to peer Lots of knobs to tweak allows much control over matchmaking experience The consumer experience Games always available Always fallback to a player hosted match Necessary if we ever phase out servers over the life of the project Underestimating server need should never affect players Some people are just not close to datacenters Favor server experience Die roll to balance between "host vs. client rich“ Servers can host migrate if needed Servers to peer to peer migration, but not server to server Tracking this metric shows host migration is rare ~0.17% of matches The consumer experience How much hardware do we need for day one? Historical data from previous game Gears 2 multiplayer data Gears supports 10 players max Sales forecast per region Formula driven Assumed 15% attach rate online and 30% concurrent rate Can be costly if you are wrong If too little, then community is unhappy If too much, the accountants are unhappy Easier to ask for the accountants forgiveness The associated cost How much hardware: should you buy versus rent? Purchased enough for long term needs (not peak) Rented over 45% in US Rented in regions which were hard to setup big deployments ● GameServers.Com The associated cost Monthly cost Hardware is not most expensive part About the graph: At our highest cost bandwidth facility Hardware amortized over 36 months The associated cost How much bandwidth do you need? Our average hosted game sends out ~7kb/sec Our average consumers sends in ~4kb/sec VOIP traffic is peer to peer to reduce host bandwidth requirement Cost savings: Pay for burst (more costly) versus committed (long term) ● More upfront, but cheaper over lifecycle The associated cost Match making: LSP or XLIVE/G4WL? Punch through LSP? Extra level of indirection Extra latency Roll your own matchmaking (no advertising on LIVE) Non-starter Games for Window Live? Acts as a headless client Codebase built around LIVE already (UE3 / Gears1 / Gears PC / Gears 2) Only minor and focused additions/changes required Game code decisions G4WL challenges Still beholden to client rules CD Key / Local admin account necessary per instance Need one local account for each game process on servers One live account for each hosted game 1 Gamertag for every 10 users Microsoft Platform created a custom tool to generate all the accounts Manually creating initial 50 Gamertags was no fun Over 100k Gamertags created! Platform did not maintain the accounts for us Manually accepting Terms of Service for every Gamertag Used a web testing solution to help upgrade accounts when account terms changed Very painful for all parties involved Talk to your Developer Account Manager before you go down this route Game code decisions Modifications to the existing UE3 dedicated server platform Sitting idle Needed to restart every 10 minutes to pull down possibly new information Dynamically need to configure themselves with new updates Transition period where clients and hosts are sync'd up Detecting "empty" and resetting People start to go into the game and do not make it People stop playing and server needs to become available again Server shutdown whitelist Need to be able to shutdown gracefully for upgrades/maintenance Auto configure when the first party joins and re-advertise Players make a request for what game mode they want to play and the game needs to setup Empty server pool shared across all playlists and configurations General robustness Needs a solid uptime, error conditions, shutdown Fortunately not a single crash during the beta However precision issues creep in after 48 hours, so we reboot as players roll off servers close to that mark Lots of memory leak testing Lots of logging, events, perf counters (more on that later) Most of these have been integrated back to UE3 Game code decisions Memory and Performance Memory was not as a big deal as performance Servers run under 150MB/instance Memory was cheap on the server Set a goal for a solid 30 fps network tick rate Simulated load with automated bot matches Charted fps via performance counters 2.5 hosted games per core (2009 Gears 2) 7.2 hosted games per core (2011 Gears 3) Memory optimizations Major performance wins Stripped out the renderer Lots of time spent removing "visual effects" code paths Get the whole team thinking about dedicated servers Moving from Server 2008 -> 2008R2 was 2x win (Vista -> Win7 kernel) The associated cost Lessons learned Servers load much faster than clients Server told clients to load things before they had unloaded previous maps -> higher watermark and occasional OOM errors Introduced configurable latency before loading next map No intrinsic first player assumption Slow to connect players were missing the game based on checks that assumed player host existed More code to check that at least one player existed before running existing checks Mixing client and server side optimizations Lots of animation optimizations "last render time" code had to be double checked Invisible collision in a few instances where the animation never played leaving collision in a bad state Make sure the "server" Gamertag was never exposed to the clients Made sure arbitrated sessions did not include server in the TrueSkill calculation Never registered a session for the “server” Game code decisions Reporting systems Created by Games IT at MS SCMM – Monitoring system Tells Tier1 staff an issue is occurring Email reporting and graphing of health Monitoring DB for heartbeats in game process and launcher Most common issue is XLive not logging in. Administering the servers Control center Aimed at Tier 1 support Silverlight app that interfaces with Master services Lock down to datacenter and not accessible to the team Silverlight app that shows high level metrics Available through login Webservice only has three read only service calls Can fetch log files of game Administering the servers Major components of infrastructure • • • • Master DB Master Service Launcher Servers Game Process Administering the servers Master DB All components handshakes with the DB to accomplish work Size fixed after all machines and accounts are added Parameterized stored procedures only Separate DB for metrics No performance issues with proper indices in place Administering the servers Master service Writes to the master DB Configuration setting of the machines Datacenter setup with ID association Assigns accounts to each machine and each process (Account and 5x5 input) Installs and health monitor of launcher service on each machine Tracks and moves builds to the datacenter local cache Removed from DB and move to file caching Can inject into the ini for custom settings Can fetch log files from any game process or launcher service Administering the servers “Gears of War 3” process Runs many per server All communication is asynchronous with database DB Status messages Game status (datacenter/game mode/playlist version/map name) Server status (launching/map cycling/restarting/shutting down/etc) DB Configuration options Query every time server restarts or idle threshold is reached Query returned various key/value pairs Very flexible Many performance counters exposed Frame rate, thread timings, number of players connected, client connection data (Ping, Incoming/Outgoing traffic, Packet loss) Administering the servers Launcher service Runs one per server Owns game processes on server DB Commands to interact with game Start (Install if needed from cache), Stop (Bleeds off clients), Kill, Kill All Restart server, and clean machine Health monitor of the process Reasons to restart, Every 48 hours In case game crashes Datacenter ID or playlist version does not match Server status hangs in any state for too long (datacenter/game name) Hot swappable Gather and records state of the game processes Game status Server status (launching/map Allowed us to change health rules dynamically without stopping server hosted games Administering the servers mode/playlist version/map down/etc) cycling/restarting/shutting Health monitoring – good day Administering the servers Health monitoring – bad day Administering the servers Lessons learned Restarting the process automatically is mandatory Many small things outside your control, allows you to come back online quickly Live connectivity Server hiccups Configuration issues G4WL cannot handle loading all processes at once We found the need of 10-15 seconds between the load of each game process to prevent XLIVE DLL issues All administrative applications need the ability to be updated without taking down the server hosted games From the game to the monitoring services, you never know when you need to make adjustment and this allows you to do a simple form of A/B testing Administering the servers Developer environment Client/Server Environment Could run multiple clients and servers on same machine Multiple Gamertags / local accounts required (runas.exe) Maintained GFWL PC client for rapid iteration Could run without admin tool from commandline UnrealConsole could talk to server through socket All the debugging functionality of UE3 Admin Environment One datacenter simulation for testing 5 servers with 1 SQL/webservice server Could run locally using Visual 2008 (for the game), Visual 2010 (admin tools), SQL, and Internet Information Services (ISS) Implementation rollout Phase 1 - Gears 2 title update (April 2010) Retrofit game to support planned Gears 3 features Good way to introduce features with no expectations First test of new matchmaking flow First test of dedicated servers Limited run of dedicated servers Profiling servers in a real environment Controlled environment, closely monitored Tested CPU/Bandwidth usage in the wild on various hardware Found 2 otherwise irreproducible crashes in the wild Able to get minidumps and figure out the problems Implementation rollout Phase 1.5 – Large test in the labs (January 2011) More than 100 people (mostly testers) Lock machines available and cores to create simulated overload Monitor CPU and bandwidth Will stress servers, but not infrastructure. Work with enterprise staff to look for flaws (outside of games devs) DB analysis Network sniffers Locking down cores on a PC Number of network cards, etc Implementation rollout Phase 2 - Gears 3 Beta (April 2011) Real rollout of servers to datacenters First consumer trials of our server administration tools Phased rollout Huge success Solid uptime Gamers happy Implementation rollout Lessons LEARNED (BETA) Lack of communication issues (zombie games) Misconfigured servers Small number of game and balance issues Added more matchmaking tweaks to ease contention Good sampling of ping data from around the world Discover data points which we should capture during release A HTTPS Webservice is better than direct DB access Better caching of static data in the DB to offset the DB load Implementation rollout Submission process MS Cert needs to be able to run the server in their environment needs to be able to see the client attached to a server liked to see that the server is attached to the client Challenges of MS Cert Environment Closed environment Not accessible to our admin framework or network Reverse IP lookup cannot find their server Solutions Always keep the ability to run the server by itself without any DB connections Set cert environment to only one use datacenter therefore all IPs return one datacenter ID Implementation rollout Security reviews of datacenters (Before you go out…) Always kill the process on security concerns, better to be alerted than be exposed The game is signed, but we have exposed connections that must be protected!! Use SDL to examine how trustworthy those communications are and what happens if someone crashes your game process File and networking fuzzing can be difficult, but worthwhile Look for exposure of personal information especially in log files Get an enterprise developer to look at your SQL stored procedures Know your pattern for your game to help look for regularities Think of that pattern as credit cards look for fraud Implementation rollout DLC / Title updates Ability to rev the dedicated servers faster / independent of client updates Servers have to have all the content Matchmaking can impose certain requirements on clients before searching Balance between value to those who purchase content vs. fragmenting our client base Plan for an update path for your servers Out in the wild Releases of new G4WL client dlls If mandatory upgrade, servers will not work until upgraded Update requires a server to shut down all games Can be done in a rolling manner No matter how much communication, be prepared to be surprised when these happen Automated solution to deploy to reduce impact Out in the wild To the cloud… Trends suggest that we have a lot of unused time on our servers everyday W it h a c lo u d s o lu t io n , yo u c o u ld p o s s ib ly g et t h e f o llo w in g : Pay for what you use (but more likely at higher hour rate) More volume upfront for day one demands Tier1 built in to the purchase (hardware issues, network issues) Could freeze VMs on machines to debug later The Future Hopefully your launch looks like ours…. Questions? Email: mweilb@microsoft.com Special thanks to: • Epic Games • MS Core Pub Team • MS Games IT Individual call outs: Josh Markiewicz Sam Zamani Wes Hunt Ian Thomas Joe Graf Vijay Krishnan Nur Sheikhassan Chris Kimmell Chris Wynn Microsoft Studios Core Publishing is recruiting