Original address: blog.discord.com/why-discord…
Author: Medium.com/@jesse_1122…
Published: February 5, 2020-10 minutes reading
Rust is emerging as a premier language in a variety of fields. In Discord, we’ve seen the success of Rust on client and server sides. For example, we use it for Go Live’s video coding pipeline on the client side and for Elixir NIFs on the server side. Recently, we significantly improved the performance of a service by switching its implementation from Go to Rust. This article explains why we are reimplementing the service, how we are doing it, and the resulting performance improvements.
Read status service
Discord is a product-focused company, so let’s start with some product background. The service we switched to Rust from Go is the “read state” service. Its sole purpose is to keep track of the channels and information you’re already reading. Every time you connect to Discord, every time you send a message, every time you Read a message, you access Read States. In short, the “reading state” is on the hot path. We want to make sure Discord feels super fast at all times, so we need to make sure the Read States are fast.
In the implementation of Go, the Read States service did not support its product requirements. It was fast most of the time, but every few minutes we saw big latency spikes, which was bad for the user experience. After investigation, we determined that these spikes were caused by Go’s core features: its memory model and garbage collector (GC).
Why did Go fail to meet our performance goals
To explain why Go did not meet our performance goals, we first need to discuss the data structure, scale, access patterns, and architecture of the service. The data structure we use to store the read state information is conveniently called “read state”. Discord has billions of readings. Each user, each channel has a read state. Each read state has several counters that require an atomized update and are often reset to 0. For example, one of the counters is how many @mentions you have in a channel.
For fast atomic counter updates, each read state server has a cache of the least-used recently read state (LRU). There are millions of users in each cache. There are tens of millions of “read states” in each cache. There are hundreds of thousands of cache updates per second.
For persistence, we use a Cassandra database cluster to support caching. When the cache key is ejected, we submit your read state to the database. Each time the read status is updated, we also schedule a database commit within the next 30 seconds. There are tens of thousands of database writes every second.
In the image below, you can see the response time and system CPU for the peak sample time of the Go service. As you may have noticed, delays and CPU spikes occur about every 2 minutes.
So why the 2-minute spike?
In Go, memory is not immediately freed when the cache key is expelled. Instead, the garbage collector runs every once in a while, finds unreferenced memory, and frees it. In other words, memory is not freed as soon as it stops being used, but hangs there for a period of time before the garbage collector determines whether it has really stopped being used. During garbage collection, Go has to do a lot of work to determine which memory is free, which can slow down the program.
These delay spikes are definitely an impact on garbage collection performance, but we’ve written the Go code very efficiently, and the amount allocated is very small. We don’t produce a lot of waste.
After studying the Go source code, we know that Go forces garbage collection to run at least once every two minutes. In other words, if garbage collection does not run for two minutes, Go still forces garbage collection regardless of whether the heap grows or not.
We thought we could tune the garbage collector more frequently to prevent large spikes, so we implemented an endpoint on the service to change the GC percentage of the garbage collector. Unfortunately, no matter how we configure the GC percentage, nothing changes. How is that possible? As it turns out, this is because we are not allocating memory fast enough to force garbage collection to occur more frequently.
As we continued our investigation, we learned that the spike was not because there was a large amount of memory ready to be freed, but because the garbage collector needed to scan the entire LRU cache to determine if memory was actually freed from the reference. Therefore, we thought that a smaller LRU cache would be faster because the garbage collector would have to scan less. Therefore, we added another setting to the service to change the size of the LRU cache and changed the architecture so that each server has many partitions of the LRU cache. We were right. As the LRU cache gets smaller, garbage collection leads to smaller spikes.
Unfortunately, transactions that make the LRU cache smaller result in higher 99th delay times. This is because, if the cache is smaller, the user’s read state cannot be in the cache. If it’s not in the cache, then we have to do a database load.
After a lot of load testing with different cache capacities, we found a setting that looked good. Not entirely, but satisfied enough, and with more important things to do, we let the service run like this for quite some time.
During that time, we saw Rust having more and more success in other parts of Discord, and together we decided to create frameworks and libraries to build new services entirely out of Rust. This service is the perfect candidate to migrate to Rust because it is small and independent, but we also expect Rust to address these latency spikes. So we took on the task of porting Read States to Rust, hoping to prove that Rust is a service language and improve the user experience.
Memory management in Rust
Rust is surprisingly fast and memory efficient: with no runtime or garbage collector, it can power performance-critical services, run on embedded devices, and integrate easily with other languages.
Rust doesn’t have a garbage collector, so we don’t expect it to have the latency spikes that Go does.
Rust uses a relatively unique approach to memory management that incorporates the concept of “ownership” of memory. Basically, Rust keeps a record of who can read and write to memory. It knows when the program is using memory and releases it as soon as it is no longer needed. It executes memory rules at compile time, making runtime memory errors almost impossible. The compiler takes care of that.
Thus, in the Rust version of the read state service, when the user’s read state is ejected from the LRU cache, it is immediately released from memory. The memory that reads the state does not sit there waiting for the garbage collector to collect it. Rust knew it was no longer in use and immediately released it. There is no runtime procedure to determine whether it should be released.
Asynchronous Rust
But Rust’s ecosystem has a problem. At the time the service was reimplemented, Rust stable version did not have a good word for asynchronous Rust. Asynchronous programming is a requirement for a network service. Several community libraries have asynchronous Rust enabled, but they require a lot of rituals and the error messages are extremely obscure.
Fortunately, the Rust team is working on making asynchronous programming easy and available on Rust’s unstable nightly channels.
Discord has never been afraid to embrace new technologies that look promising. For example, we were early adopters of Elixir, React, React Native, and Scylla. If a technology is promising and gives us an edge, we don’t mind the inherent difficulties and instability of dealing with bleeding edges. This is one of the ways we have quickly reached more than 250 million users with less than 50 engineers.
Embracing the new asynchronous functionality in Rust Nightly is another example of our willingness to embrace new and promising technologies. As an engineering team, we decided it was worth using the night version of Rust, and we committed to running it on the night version until the stable version fully supported Async. Together we worked through any issues that arose and currently Rust Stable supports asynchronous rust.⁵ The bet paid off.
Implementation, load testing, and release
The actual rewrite is fairly straightforward. It started out as a rough translation, and then we streamlined it where it made sense. For example, Rust has a nice type system with extensive support for generics, so we can throw away Go code that exists simply because of the lack of generics. In addition, Rust’s memory model is able to deduce cross-thread memory safety, so we were able to discard some of the manual cross-thread memory protection required in Go.
When we started the load test, we were immediately pleased with the results. The Rust version latency is just as good as the Go version, and there are no latency spikes! This made us very happy.
It’s worth noting that we only considered very basic optimizations when writing the Rust release. Even with just basic optimizations, Rust can outperform the hand-tuned version of Go. This is a huge testament to how easy it is to write efficient programs with Rust compared to our in-depth study of Go.
But we’re not content to simply match Go’s performance. After some analysis and performance tuning, we were able to beat Go on every performance metric. Latency, CPU, and memory are better in the Rust version.
Performance optimizations for Rust include.
- Use BTreeMap instead of HashMap in the LRU cache to optimize memory usage.
- Replace the original metric library with one that uses modern Rust concurrency.
- Reduced the number of memory copies we made.
Satisfied, we decided to launch the service.
The rollout went fairly smoothly because of the load tests. We put it on a single canary node, found some missing edge cases, and fixed them. Soon after that, we rolled it out to the whole team.
Here are the results.
Go is purple and Rust is blue.
Increase cache capacity
After running the service successfully for a few days, we decided it was time to increase the capacity of the LRU cache again. In the Go version, as mentioned above, raising the upper limit of the LRU cache resulted in a longer garbage collection. We no longer had to deal with garbage collection, so we thought we could increase the cache ceiling and get better performance. We increased the memory capacity of the box, optimized the data structures to use less memory (just for fun), and increased the cache capacity to 8 million read states.
The results below speak for themselves. Note that the average time is now measured in microseconds, while the maximum @ mentioned time is measured in milliseconds.
A growing ecosystem
Finally, another advantage of Rust is that it has a rapidly growing ecosystem. Recently, Tokio (the asynchronous runtime we use) released version 0.2. We upgraded it and it gave us the benefits of the CPU for free. Below you can see that the CPU continues to decrease from around 16th.
The idea of ending
At this point, Discord is using Rust in many places in its software stack. We use it for the game SDK, video capture and coding for Go Live, Elixir NIFs, several back-end services, and more.
When starting a new project or software component, we consider using Rust. Of course, we only use it where it makes sense.
In addition to performance, Rust has many advantages for engineering teams. For example, its type safety and credit checker make it easy to refactor the code when product requirements change or new language knowledge is discovered. Plus, the ecosystem and the tools are great, and there’s a lot of momentum behind it.
If you’ve made it this far, you’re probably excited about Rust, or have been for a long time. If you want to use Rust professionally to solve interesting problems, you should consider working in Discord.
Also, an interesting fact: The Rust team uses Discord for coordination. There’s even a very useful Rust community server where you can find us chatting from time to time. Check it out here.
- Go to version 1.9.2. Edit: Chart from 1.9.2. We tried versions 1.8, 1.9, and 1.10 without any improvements. The initial migration from Go to Rust was completed in May 2019.
- To put it bluntly, we don’t think you should rewrite everything with Rust, just because.
- Quote from www.rust-lang.org/
- Unless, of course, you’re using an unsafe one.
- areweasyncyet.rs/
www.deepl.com translation