Ractor multithreaded Ruby program guide

What is the Ractor?

Ractor is a new feature introduced in Ruby 3. Ractor, as its name suggests, is a combination of Ruby and Actor. The Actor model is a communication-based, non-locking synchronization concurrency model. The actor-based concurrency model has many applications in Ruby, such as concurrent :: actors in concurrent-Ruby. Concurrent Ruby, while introducing a large abstraction model that allows for the development of highly Concurrent applications, does not get rid of Ruby’s Global Interpreter Lock, which causes only one thread to be active at a time. So concurrent-Ruby is usually used with an unlocked JRuby interpreter. However, unlocking the GIL directly causes problems with a large number of dependencies that default to GIL availability, which can lead to unexpected thread contention issues in multithreaded development.

Last year at RubyConf China, I asked Matz why Ruby was designed for multithreading when multicore minicomputers and supercomputers became so common in the 1990s. Matz says he was still using Windows 95 PCS, and if he had known how common multicore would be, he wouldn’t have designed Ruby like this.

What data can be shared between ractors?

However, problems left over from history still need to be resolved. With Fiber Scheduler referenced in Ruby 3 to improve extremely low single-thread utilization in I/O intensive scenarios; We need to further address multithreading utilization in computationally intensive scenarios.

To solve this problem, Ruby 3 introduced the Ractor model. Ractor is essentially a Thread Thread, but Ractor makes a series of restrictions. First, locks are not shared between ractors; That is, two threads cannot compete for the same lock. Ractors can pass messages between ractors. Ractor has a global lock inside to ensure that the behavior within the Ractor is consistent with that of the original Thread. Passing messages must be of value type, which means that no Pointers will survive across Ractor and data race problems will be avoided. In short, Ractor treats each Thread as an Actor.

But Ruby has no real value types. But the nature of value types is to replace references with copies. All we have to do is make sure Ruby objects are copiable. We look at the Ractor documentation and we can see a strict description of this:

Ractors don't share everything, unlike threads.
* Most objects are *Unshareable objects*, so you don't need to care about thread-safety problem which is caused by sharing.
* Some objects are *Shareable objects*.
 * Immutable objects: frozen objects which don't refer to unshareable-objects.
 * i = 123: i is an immutable object.
 * s = "str".freeze: s is an immutable object.
 * a = [1, [2], 3].freeze: a is not an immutable object because `a` refer unshareable-object [2] (which is not frozen).
 * Class/Module objects
 * Special shareable objects
 * Ractor object itself.
 * And more...

Copy the code

Ractor performance enhancement test

To test Ractor, we need a computationally intensive scenario. The most computationally intensive scenario, of course, is doing the math itself. For example, we have the following program:

DAT = (0... 72072000).to_a p DAT.map { |a| a**2 }.reduce(:+)Copy the code

This program computes the sum of squares from 0 to 72072000. If we run the program, we get 8.17s.

If we were writing in traditional multithreading, we could write the program like this:

THREADS = 8 LCM = 72072000 t = [] res = [] (0... THREADS).each do |i| r = Thread.new do dat = (((LCM/THREADS)*i)... ((LCM/THREADS)*(i+1))).to_a res << dat.map{ |a| a ** 2 }.reduce(:+) end t << r end t.each { |t| t.join } p res.reduce(:+)Copy the code

After running it, we found that although eight system threads were created, the total elapsed time was 8.21s. No significant performance improvement.

Using Ractor to rewrite the program, the main need to change our child thread needs to access the external I variable, we use the method of message to pass in, the improved code will look like this:

THREADS = 8 LCM = 72072000 t = [] (0... THREADS).each do |i| r = Ractor.new i do |j| dat = (((LCM/THREADS)*j)... ((LCM/THREADS)*(j+1))).to_a dat.map{ |a| a ** 2 }.reduce(:+) end t << r end p t.map { |t| t.take }.reduce(:+)Copy the code

The result? We tested it with different numbers of threads.

Ractor does improve the problem of multi-threaded global interpretation locks.

Ractor under the microscope

I used AMD uProf (or Intel VTune for Intel cpus) for CPU statistics. To reduce the impact of the core on single-threaded performance, I locked the AMD Ryzen 7 2700X full core to 4.2ghz.

For AMD Ryzen 7 2700X, 4 threads are more than 3 times faster than a single thread. To four threads, which is about four times faster than a single thread. AMD Ryzen 7 2700X is an 8-core, 16-thread CPU. Also, every four cores make up a CCX, and memory access across CCX comes at an additional cost. This results in significant performance improvement within 4 threads, and limited performance improvement beyond 4 threads due to CCX and SMT. As the number of threads increases, IPC (instructions per clock cycle) begins to decline. In single-threaded calculations, the CPU can execute 2.42 instructions per clock cycle. But when it comes to 16-thread calculations, the CPU can execute only 1.40 instructions per clock cycle. At the same time, more threads means more complex operating system thread scheduling, making multi-core utilization less and less.

Similarly, we reached a similar conclusion for Intel I7-6820HQ. It was a 4-core 8-thread CPU, and the improvements were limited as the 5th thread started using the HT.

How does Ractor improve the performance of existing Ruby programs?

In addition to improving computing efficiency in computationally intensive scenarios, the introduction of Ractor has a positive impact on the memory footprint of existing large Ruby Web applications. Existing Web servers, such as PUMa, have extremely low I/O multiplexing performance and often use multithreading + multi-process to improve performance. Since the Web server can scale freely horizontally and is managed in a multi-process manner, the GIL lock problem can be completely solved.

But the fork instruction is inefficient. Microsoft presented A paper in HOTOS 2019: A fork() in the Road, which causes A very slow startup compared to spawn. To alleviate this problem, after the introduction of GC.compact in Ruby 2.7, it is often necessary to perform multiple compacts to reduce the cost of fork starts. Further, using Ractor instead of multi-process management makes it easier to pass messages, reuse freezable constants, and reduce memory footprint.

conclusion

Ruby 3 opens up a Pandora’s box of multithreading. We can make better use of multithreading to improve performance. However, seeing calls from different threads in a CPU Profiler can cause CPU IPC and cache hits to drop, which puts higher demands on program tuning.

We’ll see as we go.