Gameplay Networking of Halo: Reach

advertisement
I S h o t Yo u F i r s t !
Gameplay Networking in Halo: Reach
Who am I?
• David Aldridge, Lead Networking Engineer at Bungie
• Spent three years working on
Halo: Reach networking
• I’ve been making games for a
while
What is Halo: Reach?
• [video]
Talk Takeaways
• A proven architecture for scalable gameplay
networking
• How to design solid networking for your game
mechanics
• How to measure and optimize your networking
What is this talk NOT about?
• Halo’s Campaign or Firefight networking
• Sockets/low level networking
• High level networking
– Matchmaking
– Rating & ranking systems
– Creating and curating an online ecosystem
BUNGIE’S GAMEPLAY NETWORKING
ARCHITECTURE
What is gameplay networking?
• Communicating sufficient information to maintain a
perceptually shared reality, while minimizing both
bandwidth use and perceived violations of the
integrity of the simulation (artifacts)
• OR: Technology to help multiple players sustain the
belief that they are playing a fun game together
Common simplifying approaches
• 1. Lockstep (a.k.a. deterministic, input-passing)
– Common for games with a strict split between input and simulation
(e.g. RTS), so input latency issues can be bypassed
– Also common for ports of classic games (avoids game alterations)
• 2. Reliable transport protocols (TCP or homegrown)
– Requires high bandwidth or simple networked state
– TCP requires high latency tolerance
• 3. Send all networked state as a single blob (atomically)
– E.g. Quake 3 model
– Works very well as long as the total networked state is not too large
Halo has to solve the hard problem
•
•
•
•
Highly competitive multiplayer action game
16 players, vehicles, hundreds of replicated objects
No dedicated servers
Game is expected to work regardless of connection
quality
• For N players, O(N2) data needs to be networked
Bandwidth needed as a
multiple of the 2-player case
We can’t network everything!
120
100
80
60
40
20
0
2
3
4
5
6
7
8
9
10
11
Number of players
12
13
14
15
16
TRIBES points the way
• “The TRIBES Engine Networking Model”, Frohnmayer
and Gift, GDC 1999
• A host/client model, resilient to cheating
• Protocols for semi-reliable data delivery
• Supports persistent state and transient events
• Highly scalable to match available bandwidth
Three Key Terms
Term: Replication
• The communication of state or
events to a remote peer
– “Replicating an object” means causing it to
be created and updated on a remote peer
– A “replicated object” is one whose state is
kept approximately in sync between peers
– Our replication systems are the Application
Layer of our network stack
Term: Authority
• Permission to update the
persistent state of an object
– E.g. in Reach, the game host peer is
authoritative over dealing damage
Term: Prediction
• Extrapolating the current
properties of an entity based on
historical authoritative data and
local guesses about the future
– A predicted object is one which the
local peer does not have full control
over – this is the opposite of an
authoritative object
Bungie’s Networking Stack
Layer
Purpose
Game
Runs the game
Game Interface
Extract and apply replicated data
Prioritization
Rate the priority of all possible replication options
Replication
Protocols with various reliability guarantees
Channel Manager
Flow and congestion control
Transport
Send & receive on sockets
Let’s talk about gameplay
Layer
Purpose
Game
Runs the game
Game Interface
Extract and apply replicated data
Prioritization
Rate the priority of all possible replication options
Replication
Protocols with various reliability guarantees
Channel Manager
Flow and congestion control
Transport
Send & receive on sockets
Replication Protocol: State Data
• Guaranteed eventual delivery
of most current state,
host→client only
–
–
–
–
Object position
Object health
Territory capture timer
~150 more properties
Replication Protocol: Events
• Unreliable notifications of
transient occurrences,
host→client and client→host
–
–
–
–
Please fire my weapon
This weapon was fired
Projectile detonated
~50 more events
Replication Protocol: Control data
• High-frequency, best-effort
transmission of rapidly-updated data
extracted from player control inputs,
host→client and client→host
– Current analog stick values for all players
(host->client)
– Current position of client’s own biped
(client->host)
– ~15 more properties
Replication: The Big Picture
Control Data
“My biped is now at position x”
Events
“I just fired my primary weapon”
“I’d like to get into this warthog”
Replication: The Big Picture
Control Data
“This biped is now trying to strafe left”
State Data
“This object is now in position X”
“This warthog now has a broken windshield”
“All these broken warthog chunks now exist”
Events
“This weapon just fired”
“This warthog just took damage at this point”
Replication is never fully reliable
• Unreliability enables aggressive prioritization, which
lets us handle the richness of our simulation
• Flow control layer decides when to send a packet,
and what size it should be
• Replication writes data into the packet until full
• There is always more data than will fit, so we write
high-priority data first
Prioritization
•
•
•
•
•
•
Priority is based on client view and simulation state
Priority is calculated separately per-object per-client
Distance/direction is the core metric
Size & speed affect priority
Shooting & damage apply appropriate boosts
Lots of special cases (e.g. thrown grenades)
Prioritization example
Prioritization example
Prioritization example
0.22/0.97/127
0.50/1.00/0
Legend:
Final priority / relevance / desired update period (ms)
Prioritization example
Legend:
Final priority / relevance / desired update period (ms)
0.19/0.73/339
DESIGNING FOR NETWORKING
QUALITY
Throwing a grenade
• [video]
Single-box grenade throw
Player presses left trigger
Grenade throw
animation begins
Throw
animation
delay
Release frame is
reached, grenade object
is detached from hand,
aimed, and launched
Client grenade throw – attempt #1
• Send grenade throw
request to host
• Throw grenade locally
when host confirms
Client grenade throw – attempt #1
Button press
One-way latency, client
to host
Grenade throw
animation begins
Here’s the lag!
Throw animation starts
Throw
animation
delay
Release frame is reached
Throw
animation
delay
Release frame is
reached, throw grenade
Client grenade throw – attempt #2
• Throw a grenade locally.
• Ask host to also throw a
grenade.
Client grenade throw – attempt #2
Button press, grenade
throw animation begins
Release frame is reached,
throw grenade
Where is the lag?
There isn’t any!
Throw
animation
delay
Throw
animation
delay
Grenade throw
animation begins
Release frame is
reached, throw grenade
Client grenade throw - actual
• Predict throw animation
• But do not predict grenade release – wait for host
• Grenades in flight are always real, and the host is
authoritative over them
• Where is the lag?
Client grenade throw - actual
Button press, grenade
throw animation begins
Release frame is reached,
delete grenade, aim
throw
Here’s the lag!
Grenade appears
Throw
animation
delay
Throw
animation
delay
Grenade throw
animation begins
Release frame is
reached, delete grenade
Create grenade aimed at
X, grenade appears
Results!
• [video]
TRICKIER GAMEPLAY EXAMPLES
Armor Lock
• [video]
Armor Lock as a sequence diagram
Player presses equipment button
Intro animation begins
3 frames
Intro completes,
invulnerability begins
Player releases equipment button
Invulnerability ends
Armor Lock networking, v1
• All animations & FX predicted by clients
• This feels very responsive, no visible lag
• But where is the lag?
V1 sequence diagram
Button press, intro
animation begins
3 frame
delay
Intro animation begins
Intro animation
completes, player
appears invulnerable
3 frame
delay
Grenade explodes
Intro animation
completes, player is
invulnerable
WTF I was armor locked!
Where is the lag?
Armor Lock, v2
• Animation controlled by client…
• …but wait for host to tell you to
show yourself as invincible
• Where did we move the lag to?
V2 sequence diagram
Button press, intro
animation begins
3 frame
delay
Intro animation
completes, no shield yet
Here’s the lag!
WTF, why does my armor
lock not work properly?
Intro animation begins
3 frame
delay
Grenade explodes
Intro animation
completes, player is
invulnerable
Armor Lock, v3 – one last tweak
Button press, intro
animation begins
3 frame
delay
Intro animation
completes, no shield yet
Intro animation begins
(3-RTT) frame delay
Invulnerability begins
Intro animation ends
Grenade explodes
:-)
What just happened?
• Did we just cheat lag? Where did it go?
Armor Lock, v3
Button press, intro
animation begins
3 frame
delay
Intro animation
completes, no shield yet
Intro animation begins
(3-RTT) frame delay
Invulnerability begins
Intro animation ends
Grenade explodes
:-)
Results!
• [video]
Example #3: Assassinations
• [video]
Assassinations
• 2 bipeds are happily
running along
• Suddenly, we need
to force them to
perform a joint,
synchronized
animation
Assassinations, v1
• Local prediction of participant positions &
orientations
• Worked great in in-house playtests & take-homes
• Failed in the wilds of the public beta
Assassinations, v1 - issues
• [videos]
Assassinations, v1 - issues
• Animation didn’t always
fit in the predicted
positions on client
machines
• On completion, must
resolve discrepancies for
survivors
Assassinations, v2 - shipping
• All peers (including participants)
obey host strictly
• No discrepancies on exit!
• Visual-only object state is
interpolated on the way in to the
animation
Results!
• [video]
4 rules of gameplay networking
1. Which parts of your gameplay need to be
adjudicated by a single authority?
2. Always ask: Where am I hiding the lag?
3. Don’t be afraid to change game mechanics to
improve networking
4. Reserve time to iterate
MEASURING AND OPTIMIZING
Networking is a magnet for entropy
• Invisible system with ever-growing
complexity
• Optimizations obscure original
intent of systems
• May appear to work, but have lots
of soft failures and inefficiencies
• Halo 3 games with 16 players were
often laggy
• Let’s optimize!
Optimization is dangerous
• Easy to find an “obvious” architectural optimization,
gain 1% efficiency, and introduce a week’s worth of
bugs
• Just like CPU, don’t optimize without good data!
“The First Rule of Program Optimization: Don't do it.
The Second Rule of Program Optimization (for experts only!): Don't do it yet.”
- Michael A. Jackson
Inspection tools are the key!
• Deep inspection and
analysis tools will help you
identify the best
optimizations
• Think about the kind of
tools you use for CPU
performance optimization
Tool: Profilers
• We built profilers to track
bandwidth use and priority
calculation results
Profiler demo
• [video]
Tool: Films
• Deterministic playback of
gameplay sessions
• Extraordinarily useful for
debugging gameplay…
• …but have never been very
useful for network debugging
– Network systems are idle during
film playback
Leveraging Films
• Splice the network
profiler data into the
films
• For the first time, we
could analyze network
performance after the
fact
+
Tool: Playtests
• Network perf playtests,
once a month during
production
• Simulate adverse
network conditions with
traffic shaping tools
Tool: Playtests
• How can we measure success in
these playtests?
• Allow players to report lag with a
controller button!
– Afterwards, investigate perceived lag events
• Will also find confusing game
mechanics!
Culmination!
• [video]
Inspection of Halo 3 revealed…
•
•
•
•
•
50% positions/velocities/orientations
20% player control data
20% weapon firing, bullets, damage
10% other
Woohoo, let’s optimize the heavy
hitters!
This was a false start
• Hard to further optimize the
encoding of positions, velocities,
and orientations
• Like seeing your math functions in
your CPU profiles
• Need to optimize at a higher level
GOOD OPTIMIZATIONS IN REACH
Reducing always-on bandwidth use
• Host->client control replication accounted for 22% of
all host upstream on Halo 3
– Removed data that was duplicated in object state data
– Removed data that clients didn’t need to know
– Optimized some encoding (details in slide notes)
• Reduced bandwidth use by 60% (14% overall)
Fixing a prioritization bug
• Problem: Idle grenades rolling around on the
ground had incredibly high network priority
• The cause was traced back… to a bugfix at the end
of Halo 3!
• “Equipment” was given a huge priority boost
• Fix: only apply priority boost to active equipment
Changing game mechanics
• Halo 3 used a constant
artificial friction on items
• Problem: Very slow
descent on hills
• Optimization: Fake
friction!
Ragdoll networking
• Ragdolls are difficult and costly to network well
• Hey, why do we have to network ragdolls?
Shock
Skepticism
Consideration
Ragdoll networking
• Ragdolls are difficult and costly to network well
• Hey, why do we have to network ragdolls?
• 2 challenges
– Ragdolls block bullets
– Humping
• 2 fixes
– Allow bullets and grenades to penetrate ragdolls freely
– Sync initial state of ragdoll
Smoothing out bursts of bandwidth
• Problems with high ROF weapons: bullets were
networked optimally, but not the damage they caused!
– Fix: Allow client prediction of some damage effects
• Periodic update of game statistics data taking priority
over gameplay traffic (on a protocol below replication)
– Fix: Limit statistics data to <= 10% of each packet
• Low-priority objects getting updates in perfect sync
– Fix: Limit objects that can take “panic” priority to N per packet
3 rules of network optimization
1. Measure twice, cut once - use tools to guide your
optimizations
2. Don’t focus on encoding & compression – look at
the big picture
3. Make friends with your game mechanics designers
and coders
TIDBITS AND THE FUTURE
Numbers from Reach
250kbits/s
Minimum total upstream for the host of a solid 16 player game
675kbits/s
Maximum total upstream bandwidth use from a single peer
45kbits/s
Maximum bandwidth sent to one client from a host
1kbit/s
Host upstream required to replicate one biped to one client at
combat quality
10hz
Minimum packet rate for solid gameplay
100ms/200ms
Maximum latency for close-quarters gameplay for
tournament/casual
133ms/300ms
Maximum latency for ranged gameplay for tournament/casual
Related best practices
• Flow & congestion control
• Connection quality records & smart host
selection
• Host migration - adding this late is hard
• A multiplayer beta or demo
• Regular internal playtests, with traffic
shaping
• Full-time network testers, early and late
More Resources
• “Recreating The LAN Party
Online”, Butcher & House,
GDC 2005
• “The TRIBES Engine
Networking Model”,
Frohnmayer & Gift, GDC 1999
• Play Reach!
Acknowledgements
• Many people toiled to make Halo: Reach play as well
as it does online, especially these guys
Kings Among Men
Nick Gerrone
Lead Network Tester
Paul Lewellen
Network Engineer
Additional Kings
Jon Cable
Sandbox Engineer
Luke Timmins
Lead of Networking and UI
What’s next for Bungie?
• Usability improvements to replication
– Reducing boilerplate code
• Extension of replication protocols to support one-off,
low-bandwidth, complex use cases
– I just want to network a state machine, I don’t want to get
a PhD in replication
What’s really next for Bungie?
Questions?
daldridge@bungie.com
www.bungie.net/careers
we’re hiring!
The talk proper was already too long
BONUS SLIDES
Basics of encoding
• For rare things, and by default: write raw bits
• For common things: limit range as much as possible,
write only necessary bits (bitstream)
• For floats: quantize to fixed point
• For positions and vectors: Do lots of work to
compress these – limit domains, limit precision, think
about temporal coherence, use google
Packet rate vs. size
•
•
•
•
Maximize packet rate to minimize latency
Maximize packet size to maximize throughput
Goals in direct tension…
Ideally, maximize packet rate by default, but lower it
as needed when simulation becomes too rich
Problem: Networking new mechanics
is hard with our replication systems
• This is somewhat intentional!
• Ease of use is dangerous
• Lots of safeguards ensure careful thought (but add
implementation time)
• We still get quick-and-dirty prototype networking
that needs to be rewritten late, but we try to
minimize the amount of it
Example of a bad optimization
• “Let’s classify all our networked object indices into
contiguous buckets by object type so we can use
fewer bits to refer to an object if the type is known
on both ends, which is common”
• Saved 1% of bandwidth - awesome
• Cost over 30 hours of debugging/support over the
course of the project
What is “Lag”?
•
•
•
•
•
Perceived delay or inconsistency
Caused by latency
Caused by bandwidth limitation
Caused by packet loss
Sometimes caused by game mechanics
Glitches
• Glitch: Colloquially, a series of events that break or
appear to break the rules or perceived rules of the
game
• There are 4 important classes of glitches
–
–
–
–
Perceived as wrong / real break of real rule
Perceived as wrong / real rule, but not a real break
Not perceived as wrong / real break of a real rule
Perceived breakage of a perceived rule
Melee “Glitches”
• Conceptually melee is very simple
• In practice it’s not; we had to make post-ship fixes to
it in halo 2/3
• Example: In Reach public beta, client melee strikes
were sometimes (rarely) ignored by the host
There isn’t any more
THAT’S ALL THERE IS
Download