Overwatch’s architecture is designed to synchronize with the web

Overwatch Gameplay Architecture and Netcode

Timothy Ford

Lead Gameplay Engineer

Blizzard Entertainment

Translation: kevinan

 

At GDC2017 [Overwatch Gameplay Architecture andNetcode], Tim Ford from blizzard talked about the design of Overwatch’s Gameplay Architecture and network syncing. Take a look.

Hello, everyone. This post is about Overwatch’s game architecture and networking. As usual, put your phone on silent; Remember to fill out the questionnaire when you leave; Replace Hanzo and push the car! (laughter)

My name is Tim Ford and I lead the Overwatch development team at Blizzard. He’s been with the team since the program started in the summer of 2013. Before that, I was working on Titan, but this sharing had nothing to do with Titan. (laughter)

Some of the techniques shared here are for reducing the complexity of an ever-growing code base. To achieve this we follow a rigorous framework. Finally, I’ll explain how to manage complexity by discussing the inherently complex issue of Netcode.

 

Overwatch is an online team-based hero shooter with a near-future worldview. Its main feature is the diversity of heroes, each with their own unique stunts.

Overwatch uses an architecture called the Entity Component System, which I’ll refer to as ECS for short.

ECS is different from the component models that are popular in some off-the-shelf engines, and even more so from the classic Actor patterns of the late 1990s and early 2000s. Our team has years of experience with these architectures, so our choice of ECS is a bit of a “grass is greener on the other side.” But we made a prototype beforehand, so it wasn’t a spur of the moment decision.

After more than three years of development, we realized that the ECS architecture could manage rapidly increasing code complexity. While I’m happy to share the benefits of ECS, remember that everything I’ve said today is really hindsight.

 

Overview of the ECS architecture

This is what the ECS architecture looks like. First, there is the World, which is a collection of systems and entities. An entity is an ID that corresponds to a collection of components. Components are used to store game state without any behaviors. System has behavior but no state.

 

This may sound surprising, since the component has no functions and the System has no fields.

 

 


 

The System and components used by the ECS engine

The left-hand side of the diagram is a list of systems in polling order, and the right side is the components owned by different entities. After selecting the different systems on the left, like playing the piano, all the corresponding components are highlighted on the right. We call this a component tuple.

The System traverses all tuples and performs some action on their State (that is, the Behavior). Remember that a component does not contain any functions and its state is stored naked.

Most important systems focus on more than one component, and as you can see, the Transform component here is used by many systems.


 

An example of System polling (TICK) from the prototype engine

This is the polling function of the physical System, very straightforward, is a periodic update of the internal physics engine. The physics engine could be Box2d or Domino (Blizzard’s own physics engine). After performing the simulation of the physical world, we iterate over the collection of tuples. Use the proxy stored in the DynamicPhysicsComponent to fetch the underlying physical representation and copy it to the Transform component and Contact component.

System doesn’t know what an entity is, it only cares about a small slice of a collection of components and then performs a set of behaviors on that slice. Some entities have as many as 30 components, while others have as few as 2 or 3. The System does not care about the number; it only cares about the subset of components that perform operational behavior.

Like the example in the prototype engine, here are the player character entities that can do a lot of cool things, and on the right are the bullet entities that the player can fire.

Each System, at runtime, does not know or care what these entities are, they simply perform operations on a subset of entity-related components.

This is the case with the implementation in Overwatch.

EntityAdmin is a World that stores a collection of all systems and a hash table of all entities. The table key is the ID of the entity. ID is a 32-bit unsigned integer that uniquely identifies the Entity on the Entity Array. On the other hand, each entity also stores the entity ID and the Resource Handle, an optional field that points to the entity’s corresponding Asset resource, which defines the entity.

Component is a base class with hundreds of subclasses. Each subclass component contains the member variables needed to execute Behavior on System. The only use of polymorphism here is to override life-cycle management functions such as Create and Destructor. The only other functions that can be used directly by inheriting component class instances are helper functions for easy access to internal state. But these helpers are not behaviors; they are simply accessors.

The end of EntityAdmin calls all System updates. Each System does some work. Figure 9 is how we use it. Instead of performing operations on a fixed set of tuple components, we select some base components to iterate over, and then invoke sibling components with the corresponding behavior. So you can see that the operation is only performed for tuples of entities that contain Derp and Herp components.

List of systems and components of the Overwatch client

There are about 46 different systems and 103 components. The cool animation on this page is meant to attract you. (Laughter)

And then the server

You can see that some System implementations require many components, while others require only a few. Ideally, we try to ensure that each System relies on many components to run. This can be done by treating them as pure functions without mutating their states. We do have a small number of systems that need to change component state, in which case they have to manage the complexity themselves.

Here is the actual System code

This System is used to manage player connections and is responsible for the AFK Away From Keyboard on all of our game servers.

This System traverses all the Connection components, which manage the player’s network Connection on the server, and are attached to the entity representing the player. It can be a player playing a match, a spectator, or another player-controlled character. System does not know or care about these details, and its job is to force offline.

Each tuple of the Connection component contains an InputStream and a Stats component. We read in your actions from the input stream component to make sure you have to do something, like a keyboard key; And read your contribution to the game in some way from the Stats component.

As long as you do this you will keep resetting the AFK timer, otherwise we will send messages to your client via the network Connection handle stored on the Connection component kicking you off.

Entities running on a System must have complete tuples for these behaviors to work. The robot entity in our game has no Connection or input flow component, only a Stats component, so it is not affected by the force offline feature. The behavior of a System depends on “slicing” the complete set. And frankly, we really don’t have to waste resources taking forced robots offline.

 

Why can’t you just use the traditional object-oriented programming model?

The above System update behavior raises the question: Why not use the traditional object-oriented programming (OOP) component model? For example, override the Update function in the Connection component to keep track of AFK?

The answer is, because the Connection component is used by multiple actions simultaneously, including: AFK checks; List of connected players who can receive network broadcast messages; Store state including player name; Stores the player’s status like unlocked achievements. So (with traditional OOP) which behavior should be invoked in a component’s Update? Where should the rest go?

In traditional OOP, a class is both behavior and data, but the Connection component is not behavior; it is just state. Connection doesn’t fit the concept of objects in OOP at all; it means completely different things in different systems, at different times.

 

Therefore, what are the theoretical advantages of separating behavior and state?

Think of the cherry trees blooming in your front yard. Subjectively speaking, these trees are very different to you, to the president of your community’s business board, to your gardener, to a bird, to your property tax officer, to termites. In terms of describing the states of these trees, different observers will see different behaviors. A tree is a subject that is treated differently by different observers.

By analogy, the player entity, or more accurately, the Connection component, is a subject that is treated differently by different systems. The System that manages player connections, which we discussed earlier, treats the Connection component as the subject of AFK’s offline kick; Connection utilities treat the Connection component as the body that broadcasts the player’s network messages; On the client side, the user interface System treats the Connection component as a body of pop-up UI elements with the player’s name on the scoreboard.

Why does Behavior do this? It turns out that it is easier to describe the entire Behavior of a tree by classifying all behaviors according to the subject’s perspective, and the same applies to game objects.

 

However, with the implementation of this industrial-strength ECS architecture, we encountered new problems.

First, we get stuck with the rule that components can’t have functions. The System cannot have a state. Obviously, the System should be able to have some states, right? Some legacy systems imported from other non-ECS architectures have member variables. Is this a problem? For example, InputSystem, you can store player input in the InputSystem, and other systems that need to know if a button was pressed just need a pointer to the InputSystem.

Storing a global variable in a single component seems silly, since you can’t instantiate a new component type just once. There’s no need to prove it. Components are often accessed iteratively in the same way we’ve seen before, which can look weird if there’s only one instance of a component in the entire game.

Anyway, it lasted a while. We store one-off state data in the System and provide global access. You can see the entire access process in Figure 16.

If one System can call another, it is not friendly to compile time because systems need to include each other. Let’s say I’m refactoring InputSystem and I want to move some functions and change the header file. Client/System/Input/InputSystem. H), then all rely on the header file for Input of the state of the System needs to be recompiled, this is very annoying, there will be a lot of coupling, because each other exposed the internal behavior between the System implementation. (Reprint without attribution, really big man? They even deleted the translator’s name! Disclaimer: This article was translated by kevinan at the request of GAD!

You can see from the bottom of Figure 16 that we have a PostBuildPlayerCommand function, which is the main value of InputSystem here. If I wanted to add some new functionality to this function, CommandSystem would need to fill in some additional structure information to send to the server based on the player’s input. Should I add this new functionality to CommandSystem or PostBuildPlayerCommand? Am I exposing internal implementations to each other between systems?

As the system grows, choosing where to add new behavior code becomes ambiguous. The behavior of the CommandSystem above fills some structures. why mix them up? And why put it here instead of somewhere else?


 

Anyway, we made do with it for a while, until the Killcam requirement came along.

 

 

To implement Killcam, we will have two different, parallel game environments, one for real-time in-game rendering and one for dedicated Killcam. I’ll show you how they are implemented next.

 

First, and quite straightforwardly, I will add a second brand new ECS World. Now there are two worlds, one for liveGame(normal game) and one for replayGame for Replay.

         

The way Replay works is that the server sends about 8 to 12 seconds of online game data, and then the client flips the World and starts rendering the replayAdmin World onto the player’s screen. Then forward the online game data to replayAdmin, pretending that the data really came from the web. At this point, all the systems, all the components, all the behaviors don’t know that they are not predicted, and they assume that the client is running on the network in real time, just like normal game play.

Sounds cool, right? If anyone wants to learn more about the technology of playback, I encourage you to listen to Phil Orwig’s talk tomorrow, also in this room, at 11am sharp.

 

Anyway, by now we know that, first, all the call sites that need global access to the System will suddenly fail. Also, instead of having only one global EntityAdmin, there are now two; System A cannot access global System B directly, but somehow has to access it through the shared EntityAdmin, which is confusing.

After Killcam, we spent a lot of time reviewing the flaws in our programming model, including: weird access patterns; The compile cycle is too long; The most dangerous is the coupling of internal systems. Looks like we’re in big trouble.

The ultimate solution to these problems depends on the fact that there is nothing wrong with developing a component that has only one instance! Based on this principle, we implemented a Singleton component.

These components belong to a single anonymous entity that can be accessed directly through EntityAdmin. We moved most of the state in the System into singletons.

I should mention here that a state that only needs to be accessed by a System is actually very rare. We continued this habit later in the development of a new System, if we found that the System depended on some state. Just do a singleton to store it, and almost every time you find that some other System also needs these states, so you’ve solved the coupling problem in the previous architecture in advance.

Here is an example of a singleton input.

All the keystroke information is in a singleton, but we’ve removed it from the InputSystem. Any System that wants to know if a key was pressed can simply pick up any component and ask (that singleton). In doing so, some of the troublesome coupling problems disappear, and we are more in line with the ARCHITECTURAL philosophy of ECS: systems have no state; Components have no behavior.

Buttons are not actions, there is a behavior in the Movement System that controls local player movements, and this singleton is used to predict local player movements. One of the behaviors in MovementStateSystem is to package the key information to the server.

It turned out that the singleton pattern was so prevalent that 40% of the components in our entire game were singletons.

Once we move some System state into singletons, we break the shared System functions into Utility functions that need to be run on those singletons, which is again a bit coupled, as we’ll discuss in more detail next.

After the transformation shown in Figure 22, InputSystem still exists. It reads the Input from the operating System, populates the value of SingletonInput, and then other systems downstream can get the same Input to do what they want.

Things like key mapping can be implemented in a singleton, which is decoupled from the CommandSystem.

We’ve also moved the PostBuildPlayerCommand function into CommandSysem, as it should be, so that all modifications to playerCommands can and only be made here. These player commands are important data structures that will be synchronized across the network and used to simulate game play.

What we didn’t know when we introduced the singleton was that we were building a decoupled, less complex development paradigm. In this example, CommandSystem is the only place where it has side effects related to the player typing the command.

Every programmer can easily see how the player’s commands change, because this is the only code that is likely to change at any one time during a System update. If you want to add code to modify the player’s commands, it’s clear that you can only change it in this source file, and all ambiguity is gone.

 

Now let’s talk about another issue, which has to do with sharing behavior.

Shared behavior typically occurs when the same behavior is used by multiple systems.

Sometimes, two observers of the same subject will be interested in the same behavior. Going back to the cherry tree example, both your community’s business board chairman and your gardener might want to know how many leaves the tree will shed when spring comes.

You can do different things with this output, at least the chairman might yell at you, the gardener might go back to work, but the behavior here is the same.

For example, A lot of code is concerned with “hostile relationships”, such as, is entity A and entity B hostile to each other? The adversarial relationship is determined by three optional components: Filter bits, PET Master, and PET. Filter bits Indicates the number of the storage team (Team index). The PET master stores the unique keys of all the PETS it owns; Pet is usually used in batteries like The one at Tobion.

If neither entity has filter bits, then they are not hostile. So for the two doors, they are not hostile because their Filter bits components do not have team numbers.

If they are both on the same team, it’s not antagonistic, it’s easy to understand.

If they were on two teams that were perpetually hostile, they would check both their pet master components and the other’s to make sure that each pet was hostile to the other. This also solves a problem: if you are hostile to everyone, then when you build a battery, the battery will attack you immediately. It did. It was a bug, and we fixed it. (laughter)

If you want to check the hostility of a flying shell, it’s easy to backcheck the person who fired the shell.

The implementation of this example is actually a function call CombatUtilityIsHostile, which takes two entities as arguments and returns true or false to indicate whether they are hostile. This function is called by countless systems.

Figure 25 shows the System that calls this function, but as you can see, only three components are used, which is a very small number, and all three components are read-only to them. More importantly, they are pure data, and these systems never modify the data inside them, just read it.

Let’s do another example where we use this function.

As an example, we use different rules when using Utility functions that share behavior.

If you want to call a Utility function in more than one place, it should rely on fewer components and have few or no side effects. If your Utility functions rely on many components, try to limit the number of call points.

Our example here is called CharacterMoveUtil, and this function is used to move the player’s position within each tick during the game simulation. There are two call points, one for simulating the execution of player input commands on the server, and the other for predicting player input on the client.

 

We continue to replace function calls between systems with Utility functions and move state from System to singleton components.

If you were to replace function calls between systems with a shared Utility function, there was no magically avoiding complexity, which almost invariably required stating-level adjustments.

Just as you can hide side effects behind publicly accessible System functions, you can do the same behind Utility functions.

If you need to call those Utility functions from several places, you introduce a lot of serious side effects throughout the game loop. While it may not seem obvious that this happens after a function call, it’s also pretty scary coupling.

If there’s one thing you’ve learned from this sharing, it better be this: if there’s only one call point, the complexity of the behavior is low, because all the side effects are limited to where the function call takes place.

Here’s a look at the techniques we use to reduce this coupling.

When you discover that something has serious side effects and must be done, ask yourself: Does this code have to be done now?

A good singleton solves the problem of coupling between systems through “Deferment.” Deferred stores the required state for the action, and then postpones the side effects to a better time in the current frame.

 

For example, there are many call points in the code that generate an impact effect.

Including hitScan bullets; Exploitable projectiles with time of flight; Charia’s particle beam, which looks like a crack in a wall and needs to maintain contact with its target when firing; And then there’s spraying.

The side effects of creating a collision effect are significant because you need to create a new entity on the screen, which can indirectly affect the lifecycle, threads, scene management, and resource management.

The life cycle of collision effects needs to start well before the screen renders, which means they don’t need to show up halfway through the game simulation, at different call points.

Figure 30 shows a small portion of the code used to create the collision effect. Based on Transform, collision type, material structure data to do the collision calculation, but also called LOD, scene management, priority management, and finally generated the required effects.

 

 

This code ensures that lasting effects like bullet holes and scorch marks don’t stack up strangely. For example, you shoot an air rifle at a wall, leaving a bunch of dots, and then the Pharaoh’s Eagle fires a rocket that creates a large scorch mark on the dots. You’ll want to remove the dots, or they’ll look ugly, like z-fighting flashes. I don’t want to be running around doing that deletion. It better be done in one place.

I’m going to have to change the code, but it looks like a lot of calls, and I’m going to have to test everything. And as there are more heroes, everyone needs new special effects. Then I copy and paste the function call here and there, it’s not a big deal, it’s a function call, it’s not a nightmare. (laughter)

In fact, this will cause side effects at every call point. It takes a lot more brain power to remember how the code works, and that’s where complexity comes in, and it should definitely be avoided.

So we have the Contact singleton.

It contains an array of pending collision records, each with enough information to create that effect later in the frame. If you want to generate a special effect, just add a new record and fill in the data. Later in the frame, when the scene is updated and ready to render, the ResolveContactSystem iterates through the sets, generating effects according to LOD rules and overlaying them. This way, even if there are serious side effects, they only occur at one call point per frame.

In addition to reducing complexity, the “deferred” scheme has many other advantages. Data and instructions are cached locally, which improves performance. You can budget performance for special effects, for example, if you have 12 D.Vaas shooting walls at the same time, they will bring hundreds of special effects, you don’t have to create all of them at once, you can just create your own D.VAas, the other effects can be spread out later in the calculation, smoothing out performance burrs. This has a lot of advantages, really, you can now implement some complex logic. Even if the ResolveContactSystem needs to perform multithreaded collaboration to determine the orientation of individual particle effects, it is now easy to do so. Deferred technology is really cool.

Utility functions, singletons, postponements, these are just a few of the patterns we have used to build the ECS architecture over the past three years. In addition to limiting states in systems and behaviors in components, these techniques dictate how we solve problems in Overwatch.

 

Adhering to these limits means you have to use a lot of clever tricks to solve problems. Ultimately, however, these techniques result in a sustainably maintainable, decoupled, clean code system. It limits you, it leads you into a pit, but it’s a pit of success.

 

With that in mind, let’s talk about one of the real challenges, and how ECS simplifies it.

As Gameplay engineers, the most important issue we have ever solved is Netcode.

Here is the first goal, is to develop a responsive online battle action game. In order to achieve rapid response, it is necessary to predict (or anticipate) the player’s actions. It is impossible to be highly responsive if every operation is waiting for the server to packet back. Despite the fact that clients can’t be trusted because some jerks cheat, the truth about FPS games hasn’t changed in 20 years.

Actions that respond quickly to demand in the game include: movement, skills and, in our case, weapons with skills, and hit registration.

There is a common principle behind all of this: the player must see the response immediately after pressing the button. This must be true even when network latency is high.

As I demonstrate on this powerpoint page, the ping value is 250ms, and all my actions are given immediate feedback, which “looks” perfect with no delay at all.

         

However, the prediction client, server validation, and network latency have a side effect: misprediction, or prediction failure. The main symptom of misprediction is that you fail to perform what you think you have done.

The server will need to correct your actions, but not at the cost of latency. We use Determinism to reduce the probability of prediction errors. Here’s how.

The PING value is still 250 milliseconds. I thought I jumped, but the server didn’t see it that way, I was yanked back and frozen (freezing is one of Mei’s abilities). Here you can even see the whole thing at work. At the beginning of the prediction process, trying to move us into the air, even the CD of the gorilla jumping skill has gone into cooling, and that’s right, we don’t want the prediction accuracy to be only nine times out of ten. So we want to respond as quickly as possible,

If you happen to be playing the game in Sri Lanka and are frozen by Mei, it is possible to predict wrong.

I’ll start with a few guidelines and then discuss how this new technology leverages ECS to reduce complexity.

There will be no generic data replication, remote Entity interpolation, or backwardsreconciliation details.

We are standing on the shoulders of giants, using some of the techniques that have been mentioned in other literature. The next slides will assume that you are already familiar with those technologies.

 

Uncertainty (Determinism)

Deterministic simulation techniques rely on clock synchronization, fixed update cycles and quantization. Both the server and client run on this synchronised clock and quantization value. Time is quantized into a command frame, which we call a “command frame.” Each command frame is a fixed 16 ms, but in esports it is 7 ms.

The frequency of the simulation process is fixed, so the computer clock needs to be cycled into a fixed command frame number. We use a loop accumulator to handle frame number increments.

In our ECS framework, any System that requires pre-representation, or simulation based on player input, is not updated, but UpdateFixed. UpdateFixed is called on every fixed command frame.

Assuming that the output stream is stable, the client will always be ahead of the server by about half an RTT plus a cached frame. The RTT here is the PING value plus the logical processing time. In the example above 39, our RTT is 160 ms, half of which is 80 ms, plus 1 frame, we have 16 ms per frame, which adds up to the amount of advance the client has compared to the server.

The vertical lines in the figure represent each frame in process. The client starts to simulate and reports the input of frame 19 to the server, and after some time (basically half of RTT plus buffer time), the server starts to simulate the frame. That’s why I say the client is always ahead of the server.

Because the client is rushing to accept player input as quickly as possible, as close to the present moment as possible, waiting for the server to return to respond would seem too slow and cause the game to stall. In figure 39, you want the buffer to be as small as possible. By the way, the game is running at 60 Hertz, and I’m playing the animation at 1/100th of normal speed.

The predictive System of the client reads the current input and simulates the movement of the hunt. Here I use the game stick to represent the input operation of hunting empty and reported. Here (frame 14) the hunt is a simulation of my current motion, and after a full RTT plus a buffering event, the hunt will eventually return from the server to the client. What comes back here is a snapshot of the motion state verified by the server. The side effect of server impersonation authority is that validation takes an extra half RTT time to get back to the client.

So why does the client use a ring buffer to record historical movements? This is to facilitate comparison with the results returned by the server. After comparison, if the result is the same as the server simulation, the client will happily move on to the next input. If the results don’t reconcile, that’s a “prediction error.” In these cases, “reconciliation” is required.

If you wanted to keep things simple, you could just overwrite the client with the result sent from the server, but this result is already “old” (relative to the current input) because the server’s packet back is usually hundreds of milliseconds old.

In addition to the ring buffer above, we have another ring buffer to store the player’s input actions. Because the code for handling movement is deterministic, once the player starts to get into the state he wants to get into, it’s easy to reproduce the process. So what we do here is, once we packet back from the server and find that the prediction failed, we replay all of your input until we catch up with the current moment. As shown in frame 17 in Figure 41 below, the client thinks the hunter is on the run, and the server indicates that you have been stunned, possibly by McRae’s flashbang.

What follows is that when the client receives a packet describing the state of the character, we basically have to restore the movement state to the last server-validated state in time, and we have to recalculate all subsequent inputs until we catch up with the current moment (frame 25).

Now that the client has reached frame 27 (above), we have received the packet back from frame 17 on the server. Once you have resynchronized, you have reverted to the “lockstep” algorithm.

We must know exactly how long we’ve been knocked out.

After frame 33, the client knows it is no longer in a blackout state, and the server is emulating the same condition. No more weird synchronous catch-up issues. Once in the movement state, the player’s current input can be replayed.

However, the client network is not so stable and packet loss occurs from time to time. The input in our game is implemented through customized reliable UDP. So input packets from clients often fail to reach the server, which is packet loss. The server tries to keep a small buffer for unsimulated input, but keeps it as small as possible to keep the game running smoothly.

Once the buffer is empty, the server can only “guess” based on your last input. When the actual input arrives, it will try to “smooth it out” and make sure it doesn’t lose any of your actions, but there will also be predictive errors.

Here are the moments of wonder.

As you can see in the figure above, some packets have been lost from the client. When the server realizes that it has lost some packets, it copies the previous input and makes a prediction. Hoping that the prediction is correct, it sends a message to the client saying: “Hey buddy, lost packets, something is wrong.” What happens next is even weirder, the client time dilates and simulates faster than the agreed frame rate.

 

In this example, the agreed frame rate is 16ms, so the client will pretend that the current frame rate is 15.2ms, and it wants to move it up even further. As a result, these inputs are coming faster and faster. The buffer on the server will also be larger, in order to survive without wasting as much as possible.

 

The technology works well, especially in the often jittery Internet environment, where packet loss and PING are unstable. Even if you’re playing this game from the International Space Station, you can. So I think this plan is really NB.

 

Now, everybody take notes, I got this message here, and now I’m going to zoom in on the time scale, and notice that we’re really speeding up the polling, and you can see that the slope on the right is getting flatter. It reports input much faster than before. At the same time, the buffering on the server is getting larger and larger, allowing for more packet loss, which can be replaced during buffering if it does occur.

Once the server notices that your network is healthy again, it sends you a message saying, “Hey buddy, it’s okay now.” The client will do the opposite: it will shrink the timescale and send packets at a slower rate. The server also reduces the size of the buffer.

         

If this process continues, the goal is not to exceed tolerance limits and to minimize predictive errors through input redundancy.

I mentioned earlier that once the server gets hungry, it copies the last input, right? Once the client catches up, the input is no longer copied, and there is a risk that it will be ignored due to packet loss. The workaround is that the client maintains a sliding window for input operations. This technology has been around since Thor World.

Instead of just sending the current input for frame 19, we send all the input from the last motion state confirmed by the server up to now. As you can see from the above example, the last confirmation from the server was frame 4. And we just got to frame 19. We’re going to package each input of each frame into a packet. Players typically only do one action every 1/60 of a second at most, so the amount of data is small. Before you press the “forward” button, you’re probably already “forward”.

As a result, even if packet loss occurs, the next packet will still have all the input operations when it arrives, which will fill in all the holes caused by packet loss before you can actually simulate. So the feedback loop process and the growing buffer size, and the sliding window, makes it so you don’t lose anything by losing packets. So there is no prediction error even if you lose a packet.

I’m going to show you the animation again, this time at double speed, 1/50 of normal speed.

 

There are all the unstable factors: network PING jitter, packet loss, client time scale enlargement, input window filled with all the bugs, prediction failure, server correction. We’ll play them all together for you.

 

The next issue, I don’t want to go into too much detail, because it was shared by Dan Reid, because it was part of the opening ceremony, so I highly recommend that you listen to it, it’s really great. Again in this room, as soon as I’m done.

All skills are developed in Blizzard’s own directive scripting language Statescript. One of the great things about scripting systems is that they can travel backwards and forwards. In the client prediction, and then the server validation, just like the move operation in the previous example, we can roll you back and replay all the input. Skills also use the same rollback and forward principle as moves, rolling back to the state of the last validated snapshot and then replaying the input until the current moment.

You will remember the example of the server correction process caused by the stun of hunting air, and the skill process is the same. Both the client and the server simulate the deterministic process of skill execution, with the client ahead of the server, so typically the client simulates first and the server follows later. The client handles the prediction error by rolling back from the server snapshot and then rolling forth, as illustrated in the animation shown in this slide. This is the ghostly form of death. The squares in Figure 45 represent the ghost form and with these squares I can confidently play cool special effects and animations.

 

These blocks are closed when the ghost form ends. These little animations will show the State shutdown process in the same frame. This is followed by the ghost form, and pretty soon we get a message from the server saying, “Hey, you’ve been told how I predicted the ghost form, so you go back and turn all these states on, and then we’ll re-simulate all the inputs and turn all these states off.” This is basically the process of rolling back and forward each time the server sends an update.

 

It’s cool to be able to predict movement, which means you can predict every skill, and we do, and we can do the same for weapons and other modules.

 

 

Now let’s talk about prediction and confirmation of the hit decision.

ECS handles this quite conveniently, remember that an entity is the subject of an action if it has a tuple of components required for that action. If your entity is hostile (remember the hostility check we talked about earlier) and you have a ModifyHealthQueue component, you can be hit by another player, all subject to “hit rule”.

 

The two components, one is used to check for hostility and the other is ModifyHealthQueue. ModifyHealthQueue is the server that records all damage and heals on you. Like the singleton Contact, it evaluates lazily and has multiple call points, which is the biggest side effect. The reason for the delay calculation is that we don’t want to generate a lot of effects immediately during the projectile simulation, so we choose to delay.

 

Damage, by the way, is also not predicted at all on the client side, because they’re all frauds.

However, the hit determination is handled on the client side. So, if you have a MovementState component that is a remote object that will not be manipulated by local players, you will be interpolated by the motion System. Standard interpolation occurs between the last two MovementStates received, a technique that has been around since the days of Quake.

 

The System doesn’t care if you’re a moving platform, a turret, a door, or a Pharaoh’s Eagle. You just need to have a MovementState component. The MovementState component is also responsible for storing ring buffers. Used to hold the location of the air hunters.

 

With the MovementState component, the server rolls you back to the frame you were in when the attacker reported it, before it can hit, which is backwards Reconcilation. This is all orthogonal to the ModifyHealthQueue component, which decides whether to accept damage. We also need to rewind the doors, the platform, the car, if the bullet is stopped, it doesn’t matter. In general if you are hostile and you have MovementState components, you will be backed up and possibly hurt.

 

Rewind is the act of being manipulated by a set of Utility functions; Injury is another behavior that occurs when the MovementState component is processed delayed. These two behaviors are independent, each occurring on its own component slice.

 

The shooting process is a little abstract, and I’m going to break it down here.

The boxes in Figure 47 are the logical boundaries of each entity. The logical boundary basically represents the union of this genji’s real-time snapshot. So the logical boundaries around Genji represent the character’s total movement in the last half second. If I were to shoot in the crosshair direction now, I would intersect this boundary first before reverting to the character, because based on my PING, it could be anywhere within the boundary.

In this case, if I shoot in this direction, I just have to rewind Anna alone, because the bullet only intersects her boundary. No need to reverse the sledgehammer and his energy shield or the car and the back door at the same time.

 

Shooting, like moving, can have predictive failures.

The green avatar here is the reaper’s client view and the yellow is the server view. These little green dots are where the client thinks its bullet hit. You can see the thin green line is the path of the bullet, but when the server is checking, the bluish-purple hemisphere represents the actual hit.

This is a completely artificial example, the deterministic simulation is very reliable, in order to reproduce the predicted failure of the shooting process, I set my packet loss rate to 60%, and then shot the bastard for 20 minutes before successfully recreating it.

I should also mention that the simulation is so accurate thanks to our QA team colleagues. They never took “NO” for an answer, and since NO other game on the market could predict a hit with this level of accuracy, my QA buddies didn’t trust me and didn’t care. We just kept making bug lists, and more and more bug lists, and every time we checked to see if there was a bug, there was a bug every time. We owe them a huge debt of gratitude for their work to make such a great product.

 


 

If your PING is particularly high, the hit rule will be invalid.

Once the PING value exceeds 220 ms, we delay some of the hits and don’t predict them, waiting for the server to confirm. The reason for this is that you already extrapolate on the client and don’t want to go that far back. Don’t want the victim to feel like they’re running for cover behind a wall, only to get pulled back and hurt. So a layer of protection was added. This rewinds the behavior after a period of interpolation. The following video demonstrates this process.

 

When PING is 0, the trajectory collision is predicted, but the hit point and health bar are not predicted until the server returns the render.

By the time the PING reaches 300ms, the collision is not even predicted because the target is doing a fast read interpolation. It’s not actually there at all. Here we use Dead Reckoning (DR), which is close, but it’s not really there. This happens when the god of death moves from side to side, and is completely unpredictable when interpolated. We won’t take care of your feelings here, your network is terrible.

In the last video, it’s particularly obvious when the PING hits 1 second. The reaper moves in the same way, but also interpolates. By the way, even when PING is a second slow, all of the client’s actions are immediately predictable and responsive, but most of them are wrong. I should have done better (it was noon). I would have killed him.

         

Here’s another example of a prediction failure, the PING is still not so good, 150 milliseconds. Under this condition, whenever the failure of motion prediction is encountered, the wrong prediction will be hit. Let’s show it in slow motion.

 

Look, there’s blood, but there’s no blood, there’s no crater, so it’s wrong to predict a ballistic collision. The server rejected it. It was not a legitimate hit. The reason for the failure of the impact prediction was that the ice wall stood up. You “think” you’re still standing on the ground when you fire, but when the server simulates you’re already lifted into the air by the ice wall, and this behavior causes the prediction to fail.

 

When we fixed these minor hit prediction errors, we found that most of them could be eliminated by agreeing with the server on the location, so we spent a lot of time trying to align the location.

         

Here are some examples of prediction failures related to sports, but also related to gameplay.

The PING value is still 150 milliseconds, you want to shoot the god of death, but he is in ghost form, when the arrow hits him, the client will predict that there should be blood spewing out, no hit pit, no blood bar. We didn’t even hit him, because he was already in ghost mode.

 

In this case, although the attackers will be satisfied first most of the time, unless the victims do something to mitigate the attack. In this case, the ghostly form of Death gives him 3 seconds of invincibility. Anyway, we didn’t actually hit death.

Let me philosophically imagine that you’re the god of Death, and you go into a ghost state, but in fact the server will probably tell you to play all the effects and then kill you, because you can’t get into that state that fast.


 

ECS simplifies network synchronization problems. The System, used in the network synchronization code, knows when it’s being applied to the player. It’s pretty straightforward, basically if an entity is controlled by something with a Connection component, it’s a player.

System also knows which targets need to be rewound to the frame at the attacker’s moment, and any entity that contains the MovementState component will be rewound.

 

The main behavior of the internal association between entities and components is that MovementState can be undone on the timeline.

 

Figure 52 is a panoramic view of systems and components, only a few of which are related to network synchronization behavior. And this is the most complicated problem we have. There are two NetworkEvent and NetworkMessage in the System, which are the core components of the network synchronization module and participate in typical network behaviors such as receiving input and sending output.

 

There are several other systems I can name on one hand: InterpolateMovement, Weapons, Statescript, MovementState, which I particularly want to delete because I don’t like it. Therefore, there are only 3 System related to gameplay in the network synchronization module, and only the components that are highlighted on the right are read-only for the network synchronization module. The real ModifyHealthQueue is something like ModifyHealthQueue because the damage done to enemies is real.

Now let’s look back and see what we have learned after using ECS for so many years.

I kind of wish that both System and Utility would go back to the original ECS authoritative routine for manipulating primitives in a special way, where we just iterate through one component and then access all of its siblings. For a really complex component access tuple model, you need to know the exact access object. If a behavior requires a tuple of 40 components, it may be because your system design is too complex and the tuples are in conflict.

 

Another cool side effect of tuples is that you have prior knowledge about which systems can access which states, so going back to the prototype engine we used for tuples, you can see that two or three systems can operate on different sets of components. Because you know what tuples are for by definition. It’s designed to be very extensible. Just like the piano animation, you can see multiple systems lighting up at the same time, just because the set of components they operate is different.

Since the priority of component reads and writes is already known, System polling can handle multithreaded gameplay code. The Transform component is still popular, but only a few systems actually modify it, and most systems read only it. So when you define a tuple, you can mark the component as read-only, which means that even if multiple systems operate on the component, they are read-only and can be processed in parallel.

Entity lifecycle management requires some skill, especially those created in the middle of a frame. In the early days, we deferred the create and destroy behavior, and when you said, “Hey, I want to create an entity,” you actually did it at the end of that frame. As it turns out, delaying destruction isn’t a problem at all, and delaying creation has a ton of side effects. Especially if you request the creation of A new entity in System A and then use it in System B, then if you delay the creation process, you can only use it once every frame.

It’s a little uncomfortable. This also added a lot of internal complexity, and we wanted to change this part of the code so that it could be created in the middle of a frame so that it could be used immediately.

 

We made these changes after the game was released, which was pretty scary. This patch came in 1.2 or 1.3 and I stayed up all night the night it went live.

It took us about a year and a half to come up with guidelines for the use of ECS, as in the previous authoritative example, but we needed to adapt some of the existing code to the new architecture. These guidelines include: components have no functions; The System has no state. Shared code goes into Utils; Complex side effects in components are queued off, especially singleton components; System cannot call other System functions, not even our own named System, which Blizzard shared a few years ago.

 

There is still a large amount of code that does not conform to this specification, so it is not surprising that it is a major source of complexity and maintenance. You can see this by looking at the number of code changes or bugs.

So, if you have legacy code that doesn’t fit into the ECS specification, you should definitely not use it. Keep subsystems clean and do not create any proxy components to encapsulate them.

Different systems are designed to solve problems in different ways.

ECS is a tool that integrates a large number of systems, and improper System design principles should not be adopted.

 

ECS is designed to integrate and decouple a large number of modules. Many systems and their dependent components are shaped like icebergs.

Iceberg components have very little surface exposure to other ECS ‘systems, but they actually have a large number of states, agents, or data structures inside that are inaccessible to the ECS layer.

 

The size of these icebergs is quite obvious in the threading model, where most of the ECS work, such as updating the System, takes place on the main thread (top of Figure 58). We also use a lot of multi-threading techniques like fork and join. In this example, a character fires a lot of projectiles, and the script System says we need to generate some projectiles, and creates a few worker threads to do that. And here’s the ResolvedContactSystem trying to create some collision effects, and it took several worker threads to do it.

 

The behind-the-scenes work of the projectile simulation has been isolated and invisible to the upper ECS, which is good.

Another cool example is AIPetDataSystem, which uses fork and join nicely. At the ECS level, there’s just a little bit of coupling, maybe saying “Hey, this is a breakable door, you might need to rebuild paths in these areas,” but there’s a lot of work behind the scenes, like getting all the triangles, rendering and trimming, None of this has anything to do with ECS, and we shouldn’t put ECS in those problem areas, we should figure it out ourselves.

The video here demonstrates the PathValidationSystem, where the Path is all these blue blocks on which the AI can walk. Paths are not only used for AI, but also for many hero abilities. Therefore, data synchronization between these paths is required between the server and the client.

 

The Zen Yata in the video will destroy these objects here, and you’ll see the damaged objects fall below the surface. And then the doors there will open and we’ll glue those surfaces together. The PathValidationSystem simply says, “Hey, the triangle has changed.” Then all the data will be used to reconstruct the path behind the iceberg.

And now I’m ready to end today’s sharing.

 

ECS is the glue that holds Overwatch together, and it’s cool because it helps you integrate a lot of disparate systems with minimal coupling. If you are going to use ECS to define your specification, in fact, no matter what architecture you want to use to quickly define your specification, only a few programmers will need access to physical system code, script engines, or audio libraries. But everyone should be able to use the glue code and integrate the system together.

By imposing these limits, success can be achieved.

 

As it turns out, network synchronization is really complicated, so it has to be decoupled from the rest of the engine as much as possible, and ECS is a good way to solve this problem.

Finally before I take my questions, I’d like to thank our team members and especially the Gameplay engineers for taking 3 years to create such a beautiful piece of art. We worked together to create principles, architecture evolved, and the results were there for all to see.