Click here to see the relationship between asynchrony and coroutine, manual asynchrony /Wisp performance comparison, Workload of adaptation and more

With the emergence of a number of new asynchronous frameworks and coroutine supporting languages (such as Go), thread scheduling in operating systems has become a performance bottleneck in many scenarios, and Java has been questioned whether it is no longer suitable for the latest cloud scenarios. Four years ago, the Ali JVM team started working on Wisp2, bringing the coroutine capabilities of Go to the Java world.

The Java platform has long been known for its ecosystem of libraries and frameworks that help developers build applications quickly. Most of these Java framework libraries are based on thread pools and blocking mechanisms to service concurrency for the following reasons:

Java language in the core class library provides a powerful concurrency, multithreading applications can get good performance; Some Java EE standards are thread-level blocking (such as JDBC); Applications can be developed quickly based on blocking patterns. Today, however, with the emergence of a number of nascent asynchronous frameworks and coroutine supporting languages (such as Go), operating system thread scheduling is a performance bottleneck in many scenarios. Java has therefore been questioned about its ability to adapt to the latest cloud scenarios.

Ali started working on Wisp2 four years ago. It is mainly used in THE I/O-intensive server scenario, which is the case for most companies’ online services (offline applications are more computation-oriented and not applicable). It compares the Java coroutine of Goroutine in the function attribute, and has reached an ideal situation in the product form, performance and stability. Up to now, there have been hundreds of applications, tens of thousands of containers online WISp1/2. Wisp coroutine is fully compatible with the code writing method of multi-thread blocking, and only need to increase THE JVM parameters to open the coroutine. Alibaba’s core e-commerce applications have passed two double Eleven tests on the coroutine model, which not only enjoy the rich ecology of Java, but also obtain the performance of asynchronous programs.

Wisp2 is all about performance and compatibility with existing code. In short, existing MULTIthreaded, IO intensive Java applications can get an asynchronous performance boost by simply adding Wisp2’s JVM parameters.

As an example, here is a comparison of the message-middleware broker (MQ for short) and DRDS adding parameters without changing the code:

You can see that the context switch and SYS CPU are significantly reduced, with RT reduced by 11.45% and QPS increased by 18.13%.

Because Wisp2 is fully compatible with existing Java code, it is very simple to use. How easy is it?

If your app is a “standard” online app (using the /home/admin/$APP_NAME/setenv.sh configuration parameter), you can start Wisp2 by typing the following command as user admin:

The curl gosling.alibaba-inc.com/sh/enable-w… | sh

Otherwise you need to manually update JDK and Java parameters:

Ajdk 8.7.12 _fp2 RPM

Sudo yum install ajdk -b current # you can also install the latest JDK Java -xx :+UseWisp2…. Start the Java application with Wisp parameters

You can then verify with JStack that the coroutine is indeed turned on.

Carrier threads are threads that schedule coroutines, and the -coroutine below […] Represents a coroutine, active indicates the number of times it has been scheduled, STEAL indicates the number of times it has been work stealing, and preempt indicates the number of time slice preemption.

The following figure shows the top-H of DRDS on ECS. You can see that hundreds of threads of the application are hosted by 8 Carrier threads, evenly running over several threads in the CPU core. Some of the threads named Java below are GC threads.

Overhead myth # 1: Entering the kernel causes context switching

Let’s look at a test procedure:

pipe(a);
while (1) {
  write(a[1], a, 1);
  read(a[0], a, 1);
  n += 2;
}

Copy the code

The above program measured each PIPE operation on the Dragon server taking about 334ns, which is very fast.

Myth 2: Context switching is expensive

Context switching in both user and kernel mode is lightweight in nature, and there are even hardware instructions to support it, such as PUSHA, which helps us save general purpose registers. Threads in the same process share the page table, so the overhead of context switching is typically only:

Save various register switches SP (the call instruction automatically stacks the PC) can be done in dozens of instructions.

overhead

Since neither the near kernel nor the context switch is slow, where is the overhead of multithreading?

Let’s look at the hotspot distribution of a blocking system call futex:

You can see a lot of scheduling overhead in the hot spots above. Let’s look at the process:

Calling the system call (which may need to block); System calls do block, and the kernel needs to decide which thread to execute next (scheduling); Perform an up-down switchover. Therefore, both of the above myths are related to the overhead of multithreading, but the real overhead comes from thread blocking wake up scheduling.

To sum up, the principles that you want to use the threading model to improve Web Server performance are:

The number of active threads is approximately equal to the number of cpus per thread. The rest of the article will focus closely on these two topics.

Using eventLoop + asynchronous callback is a good choice for both of these conditions.

Keywords: Dubbo Java application service Middleware Go scheduling