I S h o t Yo u F i r s t ! Gameplay Networking in Halo: Reach Who am I? • David Aldridge, Lead Networking Engineer at Bungie • Spent three years working on Halo: Reach networking • I’ve been making games for a while What is Halo: Reach? • [video] Talk Takeaways • A proven architecture for scalable gameplay networking • How to design solid networking for your game mechanics • How to measure and optimize your networking What is this talk NOT about? • Halo’s Campaign or Firefight networking • Sockets/low level networking • High level networking – Matchmaking – Rating & ranking systems – Creating and curating an online ecosystem BUNGIE’S GAMEPLAY NETWORKING ARCHITECTURE What is gameplay networking? • Communicating sufficient information to maintain a perceptually shared reality, while minimizing both bandwidth use and perceived violations of the integrity of the simulation (artifacts) • OR: Technology to help multiple players sustain the belief that they are playing a fun game together Common simplifying approaches • 1. Lockstep (a.k.a. deterministic, input-passing) – Common for games with a strict split between input and simulation (e.g. RTS), so input latency issues can be bypassed – Also common for ports of classic games (avoids game alterations) • 2. Reliable transport protocols (TCP or homegrown) – Requires high bandwidth or simple networked state – TCP requires high latency tolerance • 3. Send all networked state as a single blob (atomically) – E.g. Quake 3 model – Works very well as long as the total networked state is not too large Halo has to solve the hard problem • • • • Highly competitive multiplayer action game 16 players, vehicles, hundreds of replicated objects No dedicated servers Game is expected to work regardless of connection quality • For N players, O(N2) data needs to be networked Bandwidth needed as a multiple of the 2-player case We can’t network everything! 120 100 80 60 40 20 0 2 3 4 5 6 7 8 9 10 11 Number of players 12 13 14 15 16 TRIBES points the way • “The TRIBES Engine Networking Model”, Frohnmayer and Gift, GDC 1999 • A host/client model, resilient to cheating • Protocols for semi-reliable data delivery • Supports persistent state and transient events • Highly scalable to match available bandwidth Three Key Terms Term: Replication • The communication of state or events to a remote peer – “Replicating an object” means causing it to be created and updated on a remote peer – A “replicated object” is one whose state is kept approximately in sync between peers – Our replication systems are the Application Layer of our network stack Term: Authority • Permission to update the persistent state of an object – E.g. in Reach, the game host peer is authoritative over dealing damage Term: Prediction • Extrapolating the current properties of an entity based on historical authoritative data and local guesses about the future – A predicted object is one which the local peer does not have full control over – this is the opposite of an authoritative object Bungie’s Networking Stack Layer Purpose Game Runs the game Game Interface Extract and apply replicated data Prioritization Rate the priority of all possible replication options Replication Protocols with various reliability guarantees Channel Manager Flow and congestion control Transport Send & receive on sockets Let’s talk about gameplay Layer Purpose Game Runs the game Game Interface Extract and apply replicated data Prioritization Rate the priority of all possible replication options Replication Protocols with various reliability guarantees Channel Manager Flow and congestion control Transport Send & receive on sockets Replication Protocol: State Data • Guaranteed eventual delivery of most current state, host→client only – – – – Object position Object health Territory capture timer ~150 more properties Replication Protocol: Events • Unreliable notifications of transient occurrences, host→client and client→host – – – – Please fire my weapon This weapon was fired Projectile detonated ~50 more events Replication Protocol: Control data • High-frequency, best-effort transmission of rapidly-updated data extracted from player control inputs, host→client and client→host – Current analog stick values for all players (host->client) – Current position of client’s own biped (client->host) – ~15 more properties Replication: The Big Picture Control Data “My biped is now at position x” Events “I just fired my primary weapon” “I’d like to get into this warthog” Replication: The Big Picture Control Data “This biped is now trying to strafe left” State Data “This object is now in position X” “This warthog now has a broken windshield” “All these broken warthog chunks now exist” Events “This weapon just fired” “This warthog just took damage at this point” Replication is never fully reliable • Unreliability enables aggressive prioritization, which lets us handle the richness of our simulation • Flow control layer decides when to send a packet, and what size it should be • Replication writes data into the packet until full • There is always more data than will fit, so we write high-priority data first Prioritization • • • • • • Priority is based on client view and simulation state Priority is calculated separately per-object per-client Distance/direction is the core metric Size & speed affect priority Shooting & damage apply appropriate boosts Lots of special cases (e.g. thrown grenades) Prioritization example Prioritization example Prioritization example 0.22/0.97/127 0.50/1.00/0 Legend: Final priority / relevance / desired update period (ms) Prioritization example Legend: Final priority / relevance / desired update period (ms) 0.19/0.73/339 DESIGNING FOR NETWORKING QUALITY Throwing a grenade • [video] Single-box grenade throw Player presses left trigger Grenade throw animation begins Throw animation delay Release frame is reached, grenade object is detached from hand, aimed, and launched Client grenade throw – attempt #1 • Send grenade throw request to host • Throw grenade locally when host confirms Client grenade throw – attempt #1 Button press One-way latency, client to host Grenade throw animation begins Here’s the lag! Throw animation starts Throw animation delay Release frame is reached Throw animation delay Release frame is reached, throw grenade Client grenade throw – attempt #2 • Throw a grenade locally. • Ask host to also throw a grenade. Client grenade throw – attempt #2 Button press, grenade throw animation begins Release frame is reached, throw grenade Where is the lag? There isn’t any! Throw animation delay Throw animation delay Grenade throw animation begins Release frame is reached, throw grenade Client grenade throw - actual • Predict throw animation • But do not predict grenade release – wait for host • Grenades in flight are always real, and the host is authoritative over them • Where is the lag? Client grenade throw - actual Button press, grenade throw animation begins Release frame is reached, delete grenade, aim throw Here’s the lag! Grenade appears Throw animation delay Throw animation delay Grenade throw animation begins Release frame is reached, delete grenade Create grenade aimed at X, grenade appears Results! • [video] TRICKIER GAMEPLAY EXAMPLES Armor Lock • [video] Armor Lock as a sequence diagram Player presses equipment button Intro animation begins 3 frames Intro completes, invulnerability begins Player releases equipment button Invulnerability ends Armor Lock networking, v1 • All animations & FX predicted by clients • This feels very responsive, no visible lag • But where is the lag? V1 sequence diagram Button press, intro animation begins 3 frame delay Intro animation begins Intro animation completes, player appears invulnerable 3 frame delay Grenade explodes Intro animation completes, player is invulnerable WTF I was armor locked! Where is the lag? Armor Lock, v2 • Animation controlled by client… • …but wait for host to tell you to show yourself as invincible • Where did we move the lag to? V2 sequence diagram Button press, intro animation begins 3 frame delay Intro animation completes, no shield yet Here’s the lag! WTF, why does my armor lock not work properly? Intro animation begins 3 frame delay Grenade explodes Intro animation completes, player is invulnerable Armor Lock, v3 – one last tweak Button press, intro animation begins 3 frame delay Intro animation completes, no shield yet Intro animation begins (3-RTT) frame delay Invulnerability begins Intro animation ends Grenade explodes :-) What just happened? • Did we just cheat lag? Where did it go? Armor Lock, v3 Button press, intro animation begins 3 frame delay Intro animation completes, no shield yet Intro animation begins (3-RTT) frame delay Invulnerability begins Intro animation ends Grenade explodes :-) Results! • [video] Example #3: Assassinations • [video] Assassinations • 2 bipeds are happily running along • Suddenly, we need to force them to perform a joint, synchronized animation Assassinations, v1 • Local prediction of participant positions & orientations • Worked great in in-house playtests & take-homes • Failed in the wilds of the public beta Assassinations, v1 - issues • [videos] Assassinations, v1 - issues • Animation didn’t always fit in the predicted positions on client machines • On completion, must resolve discrepancies for survivors Assassinations, v2 - shipping • All peers (including participants) obey host strictly • No discrepancies on exit! • Visual-only object state is interpolated on the way in to the animation Results! • [video] 4 rules of gameplay networking 1. Which parts of your gameplay need to be adjudicated by a single authority? 2. Always ask: Where am I hiding the lag? 3. Don’t be afraid to change game mechanics to improve networking 4. Reserve time to iterate MEASURING AND OPTIMIZING Networking is a magnet for entropy • Invisible system with ever-growing complexity • Optimizations obscure original intent of systems • May appear to work, but have lots of soft failures and inefficiencies • Halo 3 games with 16 players were often laggy • Let’s optimize! Optimization is dangerous • Easy to find an “obvious” architectural optimization, gain 1% efficiency, and introduce a week’s worth of bugs • Just like CPU, don’t optimize without good data! “The First Rule of Program Optimization: Don't do it. The Second Rule of Program Optimization (for experts only!): Don't do it yet.” - Michael A. Jackson Inspection tools are the key! • Deep inspection and analysis tools will help you identify the best optimizations • Think about the kind of tools you use for CPU performance optimization Tool: Profilers • We built profilers to track bandwidth use and priority calculation results Profiler demo • [video] Tool: Films • Deterministic playback of gameplay sessions • Extraordinarily useful for debugging gameplay… • …but have never been very useful for network debugging – Network systems are idle during film playback Leveraging Films • Splice the network profiler data into the films • For the first time, we could analyze network performance after the fact + Tool: Playtests • Network perf playtests, once a month during production • Simulate adverse network conditions with traffic shaping tools Tool: Playtests • How can we measure success in these playtests? • Allow players to report lag with a controller button! – Afterwards, investigate perceived lag events • Will also find confusing game mechanics! Culmination! • [video] Inspection of Halo 3 revealed… • • • • • 50% positions/velocities/orientations 20% player control data 20% weapon firing, bullets, damage 10% other Woohoo, let’s optimize the heavy hitters! This was a false start • Hard to further optimize the encoding of positions, velocities, and orientations • Like seeing your math functions in your CPU profiles • Need to optimize at a higher level GOOD OPTIMIZATIONS IN REACH Reducing always-on bandwidth use • Host->client control replication accounted for 22% of all host upstream on Halo 3 – Removed data that was duplicated in object state data – Removed data that clients didn’t need to know – Optimized some encoding (details in slide notes) • Reduced bandwidth use by 60% (14% overall) Fixing a prioritization bug • Problem: Idle grenades rolling around on the ground had incredibly high network priority • The cause was traced back… to a bugfix at the end of Halo 3! • “Equipment” was given a huge priority boost • Fix: only apply priority boost to active equipment Changing game mechanics • Halo 3 used a constant artificial friction on items • Problem: Very slow descent on hills • Optimization: Fake friction! Ragdoll networking • Ragdolls are difficult and costly to network well • Hey, why do we have to network ragdolls? Shock Skepticism Consideration Ragdoll networking • Ragdolls are difficult and costly to network well • Hey, why do we have to network ragdolls? • 2 challenges – Ragdolls block bullets – Humping • 2 fixes – Allow bullets and grenades to penetrate ragdolls freely – Sync initial state of ragdoll Smoothing out bursts of bandwidth • Problems with high ROF weapons: bullets were networked optimally, but not the damage they caused! – Fix: Allow client prediction of some damage effects • Periodic update of game statistics data taking priority over gameplay traffic (on a protocol below replication) – Fix: Limit statistics data to <= 10% of each packet • Low-priority objects getting updates in perfect sync – Fix: Limit objects that can take “panic” priority to N per packet 3 rules of network optimization 1. Measure twice, cut once - use tools to guide your optimizations 2. Don’t focus on encoding & compression – look at the big picture 3. Make friends with your game mechanics designers and coders TIDBITS AND THE FUTURE Numbers from Reach 250kbits/s Minimum total upstream for the host of a solid 16 player game 675kbits/s Maximum total upstream bandwidth use from a single peer 45kbits/s Maximum bandwidth sent to one client from a host 1kbit/s Host upstream required to replicate one biped to one client at combat quality 10hz Minimum packet rate for solid gameplay 100ms/200ms Maximum latency for close-quarters gameplay for tournament/casual 133ms/300ms Maximum latency for ranged gameplay for tournament/casual Related best practices • Flow & congestion control • Connection quality records & smart host selection • Host migration - adding this late is hard • A multiplayer beta or demo • Regular internal playtests, with traffic shaping • Full-time network testers, early and late More Resources • “Recreating The LAN Party Online”, Butcher & House, GDC 2005 • “The TRIBES Engine Networking Model”, Frohnmayer & Gift, GDC 1999 • Play Reach! Acknowledgements • Many people toiled to make Halo: Reach play as well as it does online, especially these guys Kings Among Men Nick Gerrone Lead Network Tester Paul Lewellen Network Engineer Additional Kings Jon Cable Sandbox Engineer Luke Timmins Lead of Networking and UI What’s next for Bungie? • Usability improvements to replication – Reducing boilerplate code • Extension of replication protocols to support one-off, low-bandwidth, complex use cases – I just want to network a state machine, I don’t want to get a PhD in replication What’s really next for Bungie? Questions? daldridge@bungie.com www.bungie.net/careers we’re hiring! The talk proper was already too long BONUS SLIDES Basics of encoding • For rare things, and by default: write raw bits • For common things: limit range as much as possible, write only necessary bits (bitstream) • For floats: quantize to fixed point • For positions and vectors: Do lots of work to compress these – limit domains, limit precision, think about temporal coherence, use google Packet rate vs. size • • • • Maximize packet rate to minimize latency Maximize packet size to maximize throughput Goals in direct tension… Ideally, maximize packet rate by default, but lower it as needed when simulation becomes too rich Problem: Networking new mechanics is hard with our replication systems • This is somewhat intentional! • Ease of use is dangerous • Lots of safeguards ensure careful thought (but add implementation time) • We still get quick-and-dirty prototype networking that needs to be rewritten late, but we try to minimize the amount of it Example of a bad optimization • “Let’s classify all our networked object indices into contiguous buckets by object type so we can use fewer bits to refer to an object if the type is known on both ends, which is common” • Saved 1% of bandwidth - awesome • Cost over 30 hours of debugging/support over the course of the project What is “Lag”? • • • • • Perceived delay or inconsistency Caused by latency Caused by bandwidth limitation Caused by packet loss Sometimes caused by game mechanics Glitches • Glitch: Colloquially, a series of events that break or appear to break the rules or perceived rules of the game • There are 4 important classes of glitches – – – – Perceived as wrong / real break of real rule Perceived as wrong / real rule, but not a real break Not perceived as wrong / real break of a real rule Perceived breakage of a perceived rule Melee “Glitches” • Conceptually melee is very simple • In practice it’s not; we had to make post-ship fixes to it in halo 2/3 • Example: In Reach public beta, client melee strikes were sometimes (rarely) ignored by the host There isn’t any more THAT’S ALL THERE IS