Engineering Scalable Social Games Dr. Robert Zubek Top social game developer Growth Example Roller Coaster Kingdom 1M DAUs over a weekend source: developeranalytics.com Growth Example FishVille 6M DAUs in its first week source: developeranalytics.com Growth Example FarmVille 25M DAUs over five months source: developeranalytics.com Talk Overview Introduce game developers to best practices for large-scale web development Part I. Game Architectures Part II. Scaling Solutions Part I. Game• Architectures server approaches • client approaches Three Major Types Web server stack HTML Flash Web + MMO stack Flash Server Side Web stack Mixed stack • Usually based on a LAMP stack • Game logic in MMO server (eg. Java) • Game logic in PHP • • HTTP communication Web stack for everything else Example 1 FishVille DB Servers Cache and Queue Servers Web Servers + CDN Example 2 YoVille DB Servers MMO Servers Cache and Queue Servers Web Servers + CDN • • • • HTTP scales very well Stateless request/response Easy to load-balance across servers Why stack? Easy toweb add more servers Some limitations • Server-initiated actions • Storing state between requests example Web stack and state caches DB Servers Web #1 Web #2 Web #3 Web #4 x x x Client MMO servers • • Minimally “MMO”: • Persistent socket connection per client • Live game support: chat, live events • Supports data push • Keeps game state in memory Very different from web! • Harder to scale out! • But on the plus side: 1. Supports live communication and events Why MMO servers? 2. Game logic can run on the server, independent of client’s actions 3. Game state can be entirely in RAM Example Web + MMO stack DB Servers caches MMO Servers Client Web Servers Client Side Flash HTML + AJAX • High production quality games • The game is “just” a web page • Game logic on client • Minimal sys reqs • Can keep open socket • Limited graphics and animation Social Network Integration • “Easy” because not related to scaling • Calling the host network to get a list of friends, post something, etc. • Networks provide REST APIs, and sometimes client libraries Architectures Data database, cache, etc. Server Web server stack Client HTML Flash Web + MMO Flash blowing up Introducenot game developers to best grow web development practices as foryou large-scale Part I. Game Architectures Part II. Scaling Solutions Scaling Scale up vs. scale out I need a bigger box I need more boxes Scaling Scale up vs. scale out Scale out has been a clear win • At some point you can’t get a box big enough, quickly enough • Much easier to add more boxes • But: requires architectural support from the app! Example: Scaling up vs. out • 500,000 daily players in first week • Started with just one DB, which quickly became a bottleneck • Short term: scaled up, optimized DB accesses, fixed slow queries, caching, etc. • Long term: must scale out with the size of the player base! Scaling We’ll look at four elements: Databases Caching Game servers (web + MMO) Capacity planning II A. Databases DBs are the first to fall over caches App Servers DB Client App Servers queues App W R App App W M R S S App App M M S M S App App M0 M M1 M How to partition data? • Two most common ways: • Vertical – by table • • Easy but doesn’t scale with DAUs Horizontal – by row • Harder to do, but gives best results! • Different rows live on different DBs • Need a good mapping from row # to DB Row striping • • • • Row-to-shard mapping: id Data 100 foo 101 bar 102 baz 103 xyzzy M0 Primary key modulo # of DBs Like a “logical RAID 0” More clever schemes exist, of course M1 Servers Servers Servers Scaling out your shards M0 M1 x x M2 S0 M3 S1 Sharding in a nutshell • It’s just data striping across databases (“logical RAID 0”) • There’s no automatic replication, etc. • No magic about how it works • That’s also what makes it robust, easy to set up and maintain! Example: Sharding • Biggest challenge: designing how to partition huge existing dataset • Partitioned both horizontally and vertically • Lots of joins had to be broken • Data patterns needed redesigned • • With sharding, you need to know the shard ID for each query! Data replication had difficulty catching up with high-volume usage Sharding surprises Be careful with joins • Can’t do joins across shards • Instead, do multiple selects, or denormalize your data Sharding surprises Skip transactions and foreign key constraints • CPU-expensive; easier to pay this cost in the application layer (just be careful) • The more you keep in RAM, the less you’ll need these II B. Caching • caches DBs can be too slow • Spend memory to buy speed! Servers App Client DB App Servers queues Caching • • Two main uses: • Speed up DB access to commonly used data • Shared game state storage for stateless web game servers Memcached: network-attached RAM cache Memcache • Stores simple key-value pairs • • • eg. “uid_123” => {name:”Robert”, …} What to put there? • Structured game data • Mutexes for actions across servers Caveat • It’s an LRU cache, not a DB! No persistence! Memcache scaling out Shard it just like the DB! DBs MCs App II C. Game Servers caches MMO Servers Client DB Web Servers queues Scaling web servers LB LB LB DNS LB Scaling content delivery Web Servers Client game data only media files media files CDN Scaling MMO servers • • Scaling out is easiest when servers don’t need to talk to each other • Databases: sharding • Memcache: sharding • Web: load-balanced farms • MMO: ??? Our MMO server approach: shard just like other servers Scaling MMO servers • • Keep all servers separate from each other • Minimize inter-server communication • All participants in a live event should be on the same server Scaling out means no direct sharing • Sharing through third parties is okay Problem #1: Load balancing • What if a server gets overloaded? • • Can’t move a live player to a different server Poor man’s load-balancing: • Players don’t pick servers; the LB does • If a server gets hot, remove it from the LB pool • If enough servers get hot, add more server instances and send new connections there Problem #2: Deployment Downtime = lost revenues • You can measure cost of downtime in $/minute! Problem #2: Deployment Deployment comparison: • Web: just copy over PHP files • Socket servers – much harder • How to deploy with zero downtime? • Ideally: set up shadow new server and slowly transition players over II D. Capacity planning Capacity We believe in scaling out Different logistics than • Demand could change rapidly • How to provision enough servers? scaling up Capacity: colo vs. cloud Colo Cloud • Higher fixed cost • Low or zero fixed cost • Lower variable cost • Higher variable cost • Significant lead time on additional capacity • Little or no lead time on additional capacity • More control over ops • Less control over ops • Any chips you want • Limited choice of VMs • Any disks you want • Virtualized storage • Scale up or out • Can’t scale up! Only out. Operations • Scaling out: a legion of servers • You’ll need tools • App health – custom dashboard • Server monitoring – Munin • Automatic alerts – Nagios Ops: Dashboard Ops: Munin Server stats from the entire farm Ops: Nagios • SMS alerts for your server stats • Connections drop to zero • Or CPU load exceeds 4 • Or test account can’t connect, etc. • Put alerts on business stats, too! • DAUs dropping below daily avg, etc. Server failures • • Common in the cloud • Dropping off the net • Sometimes even shutting down Be defensive • Ops should be prepared • Devs should program defensively The End Many thanks to those who taught me a lot: Virtual Worlds: Scott Dale, Justin Waldron Coaster: Biren Gandhi, Nathan Brown, Tatung Mei http://robert.zubek.net/gdc2010