Server I/O performance: Node, PHP, Java, and Go

Understanding an application’s input/output (I/O) model means the difference between a planned processing load and a brutal real-world usage scenario. If the application is small and not serving a high load, it may have little impact. But as the load on your application increases, using the wrong I/O model can leave you bruised and bruised.

As with most scenarios with multiple approaches, it’s not about which approach is better, it’s about understanding how to make trade-offs. Let’s take a look at the I/O landscape and see what you can steal from it.


image

In this article, we’ll compare Node, Java, Go, and PHP with Apache, discuss how these different languages model their I/O, the strengths and weaknesses of each model, and draw some preliminary baseline conclusions. If you’re concerned about the I/O performance of your next Web application, you’re on the right page.

I/O Basics: A quick review

To understand the factors closely related to I/O, it is necessary to review the concepts underlying the operating system. While you don’t deal with most of these concepts directly, you’re dealing with them indirectly all the time through your application’s runtime environment. The devil is in the details.

The system calls

First, we have system calls, which can be described like this:

  • Your program (in the “user area,” as they call it) must let the operating system kernel perform I/O operations on itself.
  • “Syscall” means your program is asking the kernel to do something. The details of implementing system calls vary from operating system to operating system, but the basic concepts are the same. There will be specific instructions that transfer control from your program to the kernel (similar to function calls but with a special sauce for handling this scenario). Typically, system calls are blocked, meaning your program has to wait for the kernel to return to your code.
  • The kernel performs the underlying I/O operations on what we call physical devices (hard disks, network cards, and so on) and responds to the system call. In the real world, the kernel might have to do a lot of things to fulfill your request, including waiting for the device to be ready, updating its internal state, etc., but as an application developer, you don’t have to worry about that. Here’s how the kernel works.


image

Blocking and non-blocking calls

Well, I just said above that system calls block, and generally that’s true. However, some calls are classified as “non-blocking,” meaning that the kernel receives your request, puts it somewhere in a queue or buffer, and then immediately returns without waiting for the actual I/O call. So it just “blocks” for a very short time, short enough just to put your request in the queue.

Here are some examples to help clarify: -read() is a blocking call — you pass it a file handle and a buffer to hold the read data, and the call returns when the data is ready. Note that this approach has the advantage of elegance and simplicity. -epoll_create(), epoll_ctl(), and epoll_wait() are calls that let you create a set of handles for listening, add/remove handles from that group, and then block until there is activity, respectively. This allows you to effectively control a series of I/O operations from a single thread. This is great if you need these features, but as you can see, it’s also pretty complicated to use.

It’s important to understand the order of magnitude of the timeshare difference here. If a CPU core is running at 3 GHZ, it performs 3 billion cycles per second (or 3 cycles per nanosecond) without optimization. Non-blocking system calls can take an order of 10 nanoseconds to complete — or “relatively few nanoseconds.” For blocked calls that are receiving information over the network, it may take more time — for example, 200 milliseconds (0.2 seconds). For example, if a non-blocking call consumes 20 nanoseconds, then the blocking call consumes 200,000,000 nanoseconds. For blocking calls, your program waits 10 million times longer.


image

The kernel provides both blocking I/O (” read from the network connection and give me the data “) and non-blocking I/O (” Tell me when these network connections have new data “). Depending on which mechanism is used, the blocking time of the invocation process varies significantly in length.

scheduling

Then the third key thing is what to do when a large number of threads or processes start blocking.

For our purposes, there is not much difference between threads and processes. In fact, the most obvious execution-related difference is that threads share the same memory, while each process has its own memory space, making separate processes tend to occupy a large amount of memory. But when we talk about scheduling, it boils down to a list of events (similar to threads and processes), each of which requires a slice of execution time on a valid CPU kernel. If you have 300 threads running on 8 cores, you divide up the time so that each thread gets its share by running each kernel for a short period of time and then switching to the next thread. This is done through a “context switch” that allows the CPU to switch from one running thread/process to the next.

These context switches have a cost — they consume some time. At fast times, it can be less than 100 nanoseconds, but depending on implementation details, processor speed/architecture, CPU cache, etc., it is not uncommon to consume 1000 nanoseconds or more.

The more threads (or processes) there are, the more context switches there are. When we’re talking about tens of thousands of threads, and each switch takes hundreds of nanoseconds, the speed gets really slow.

However, a non-blocking call essentially tells the kernel to “call me when you have some new data or an event on either of these connections.” These non-blocking calls are designed to efficiently handle large I/O loads and reduce context switches.

Are you still reading this article so far? Because now comes the interesting part: Let’s look at how some fluent languages use these tools, and draw some conclusions about the trade-off between ease of use and performance… And other interesting comments.

Please note that although the example shown in this article is trivial (and is not complete, only shows the relevant part of the code), but the database access, external cache system (memcache, etc) and anything need I/O, is to carry out some behind ended on the I/O operations, such as has the same effect as the example shown. Similarly, in cases where I/O is described as “blocking” (PHP, Java), the reading and writing of HTTP requests and responses are themselves blocking calls: again, more I/O is hidden in the system and its accompanying performance issues need to be considered.

There are many factors to consider when choosing a programming language for a project. There are even more factors to consider when you just consider performance. However, if your concern is that your program is primarily limited to I/O, and if I/O performance is critical to your project, these are all things you need to know. The “Keep it simple” way: PHP.

Back in the ’90s, a lot of people wore Converse shoes and wrote CGI scripts in Perl. Then came PHP, which a lot of people liked to use because it made it much easier to create dynamic web pages.

The model used by PHP is fairly simple. There are some changes, but basically a PHP server looks like this:

The HTTP request comes from the user’s browser and visits your Apache web server. Apache creates a separate process for each request, reusing them with some optimizations to minimize the number of times they need to be executed (the creation process is relatively slow). Apache calls PHP and tells it to run the appropriate.php file on disk. The PHP code executes and makes some blocking I/O calls. If file_get_contents() is called in PHP, behind the scenes it will trigger the read() system call and wait for the result to return.

Of course, the actual code is simply embedded in your page, and the action is blocked:

$file_data = file_get_contents('/path/to/file.dat'); / / blocking network I/O $curl = curl_init (' http://example.com/example-microservice '); $result = curl_exec($curl); $result = $db->query('SELECT id, data FROM examples ORDER BY id DESC LIMIT 100');Copy the code

On how it integrates with the system, something like this:


image

It’s pretty simple: one request, one process. The I/O is blocked. What are the advantages? Simple, doable. What are the disadvantages? Connect to 20,000 clients at the same time, and your server dies. This approach does not scale well because the tools provided by the kernel for handling high-volume I/O (epoll, etc.) are not used. To make matters worse, running a separate procedure for each request often uses a lot of system resources, especially memory, which is usually the first thing encountered in such scenarios.

Note: Ruby uses a very similar approach to PHP, and in a broad and universal way, we can think of it as the same.

Multithreaded approach: Java

So just as you buy your first domain name, Java comes along, and it’s cool to casually say “dot com” after a sentence. Java has multithreading built into the language (especially at creation time), which is great.

Most Java web servers do this by starting a new thread of execution for each incoming request, and then eventually calling the function you wrote as the application developer in that thread.

Performing I/O operations in Java servlets often looks like this:

public void doGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException {// Blocking fileI /O InputStream fileIs = new FileInputStream("/path/to/file"); / / blocking network I/O URLConnection URLConnection = (new URL (" http://example.com/example-microservice ")). OpenConnection (); InputStream netIs = urlConnection.getInputStream(); // More blocked network I/O out.println("..." ); }Copy the code

Since our doGet method above corresponds to a request and runs in its own thread, rather than a separate process that needs its own dedicated memory for each request, we will have a separate thread. This has some nice advantages, such as being able to share state between threads, share cached data, and so on, since they can access each other’s memory, but the impact of how it interacts with scheduling is almost identical to what was done in the previous PHP example. Each request creates a new thread, and various I/O operations in this thread block until the request is fully processed. To minimize the cost of creating and destroying them, threads are pooled together, but still, having thousands of connections means thousands of threads, which is bad for the scheduler.

A major milestone was the ability to perform non-blocking I/O calls in Java version 1.4 (and, again, significantly upgraded version 1.7). Most apps, websites and other programs don’t use it, but at least it’s available. Some Java web servers try to take advantage of this in various ways; However, the vast majority of deployed Java applications still work as described above.


image

Java takes us a step further and certainly has some nice “out of the box” capabilities for I/O, but it still doesn’t really solve the problem of what to do when you have a heavily I/ O-bound application that is being yanked to the ground by thousands of blocking threads.

Non-blocking I/O: Node as first-class citizen

Node.js is the new kid on the block when it comes to better I/O. Anyone who has ever had the simplest knowledge of Node has been told that it is “non-blocking” and that it handles I/O efficiently. In a general sense, this is true. But the devil is in the details, and how this voodoo is implemented is crucial when it comes to performance.

Essentially, instead of basically saying “write code here to handle requests,” the Node implementation paradigm shifts to “write code here to start processing requests.” Each time you need to do something involving I/O, make a request or provide a callback that Node will call when it’s done.

Typical Node code for I/O operations in evaluation looks like this:


http.createServer(function(request, response) {  
    fs.readFile('/path/to/file', 'utf8', function(err, data) {
        response.end(data);
    });
});

Copy the code

As you can see, there are two callback functions. The first is called when the request starts, and the second is called when the file data is available.

Doing so basically gives Node a chance to efficiently handle I/O between these callbacks. A scenario is more relevant in the Node database calls, but I don’t want to list this annoying example, because it is exactly the same principle: start the database calls, and provides a callback function to the Node, it USES the non-blocking calls to perform I/O operations alone, then in your required to invoke the callback function when data is available. The mechanism by which I/O calls the queue, Node processes it, and then retrieves the callback function is called an “event loop.” It works very well.


image

However, there is a barrier to this model. Behind the scenes, the reason is more about how to implement the JavaScript V8 engine (Chrome’s JS engine for Node) 1 than anything else. All the JS code you write runs in one thread. Think about it. This means that when I/O is performed using effective non-blocking techniques, a CPU binding JS can run in a single thread, with each code block blocking the next. A common example is to loop through database records, processing them in some way before output to the client. Here’s an example of how it works:

var handler = function(request, response) { connection.query('SELECT ... ', function (err, rows) { if (err) { throw err }; for (var i = 0; i < rows.length; I++) {// process each line} response.end(...) ; })};Copy the code

While Node does handle I/O efficiently, the for loop in the example above uses CPU cycles in your main thread. This means that if you have 10,000 connections, the loop could slow your entire application to a snail’s pace, depending on how long each loop takes. Each request must be shared in the main thread for a period of time, one at a time.

This whole concept is based on the premise that I/O operations are the slowest part, so the most important thing is to handle these operations efficiently, even if that means doing other processing sequentially. This is true in some cases, but not all.

The other point is that, while this is just an opinion, writing a bunch of nested callbacks can be quite annoying, and some people think it makes code obviously unruly. Deep in Node code, it’s not uncommon to see nested levels of four, five, or more.

Again, we’re back to the tradeoff. If your main performance problem is I/O, then the Node model works fine. However, its Achilles heel is that if you’re not careful, you can process HTTP requests and place CPU-intensive code in a function that slows each connection to a snail’s pace.

True non-blocking: Go

Before getting into the Go chapter, I should disclose that I am a Go fan. I have used Go on many projects, am a vocal proponent of its productivity benefits, and have seen them at work while using them.

In other words, let’s see how it handles I/O. A key feature of the Go language is that it includes its own scheduler. Instead of each thread executing on a single OS thread, Go uses the concept of “goroutines.” Depending on what the Goroutine does, the Go runtime can assign a Goroutine to an OS thread and make it execute, or suspend it without being associated with an OS thread. Each request from Go’s HTTP server is handled in a separate Goroutine.

A schematic of how this scheduler works is shown below:


image

This is done at various points in the Go runtime, making I/O calls by writing/reading/connecting/etc requests, putting the current Goroutine to sleep, and reawakening it with information when further action is available.

In fact, the Go runtime doesn’t do much different from what Node does, except that the callback mechanism is built into the implementation of the I/O call and automatically interacts with the scheduler. It is also not constrained by the need to run all of its handler code in the same thread, and Go will automatically map Goroutine to OS threads as it sees fit based on the logic of its scheduler. The final code looks something like this:

Func ServeHTTP(w http.responsewriter, r * http.request) {rows, err := db.query ("SELECT... ) for _, row := range rows {// Process rows // each request in its own goroutine} w.write (...) // Output the response result, also non-blocking}Copy the code

As you can see above, our basic code structure looks like a simpler approach, with non-blocking I/O implemented behind the scenes.

In most cases, this ends up being the “best of both worlds.” Non-blocking I/O is used for all important things, but your code looks like it’s blocking, so it’s often easier to understand and maintain. The interaction between the Go scheduler and OS scheduler takes care of the rest. It’s not the whole magic, and if you’re building a large system, it’s worth taking the time to understand how it works in more detail; But at the same time, an “out of the box” environment works well and scales well.

Go may have its downsides, but the way it handles I/O in general is not one of them.

Lies, cursed lies and benchmarks

Accurate timing of these various mode context switches is difficult. You can also say that it doesn’t make much difference to you. So instead, I’ll give some benchmarks for comparing HTTP server performance in these server environments. Keep in mind that the performance of the entire end-to-end HTTP request/response path is related to many factors, and the data I’ve put together here is just a sample so you can make a basic comparison.

For each of these environments, I wrote the appropriate code to read a 64K file in random bytes, run a SHA-256 hash N times (N is specified in the URL’s query string, for example… /test.php? N =100), and prints the resulting hash in hexadecimal form. I chose this example because it is a very simple way to run the same benchmark with some consistent I/O and a controlled way to increase CPU utilization.

Refer to these baseline points for more details on environment usage.

First, let’s look at some low-concurrency examples. Running 2000 iterations with 300 concurrent requests and doing only one hash per request (N = 1) yields:


image

Time is the average number of milliseconds to complete a request in total concurrent requests. The lower the better. It’s hard to draw conclusions from one chart, but it seems to me that it has to do with things like connections and computations, and we see that time has more to do with the general execution of the language itself, and therefore more to do with I/O. Note that languages that are considered “scripting languages” (input is arbitrary and interpreted dynamically) execute slowest.

But what happens if you increase N to 1000 and still send 300 requests — the same load, but the hash iteration is 100 times larger (significantly increasing CPU load) :


image

Time is the average number of milliseconds to complete a request in total concurrent requests. The lower the better.

All of a sudden, Node’s performance drops dramatically because cpu-intensive operations are blocking each other on each request. Interestingly, IN this test, PHP performed much better (relative to the other languages) and beat Java. (It’s worth noting that in PHP, the SHA-256 implementation is written in C, and the execution path takes more time in this loop because we did 1000 hash iterations this time).

Now let’s try 5000 concurrent connections (and N = 1) — or something close to it. Unfortunately, for most of these environments, the failure rate is not obvious. For this chart, we focus on the total number of requests per second. The higher the better:


image

Total requests per second. The higher the better.

This picture looks completely different. This is a guess, but it looks like for high connection volumes, the overhead per connection is associated with generating new processes, and the extra memory associated with PHP + Apache seems to be a major factor and constraining PHP performance. Obviously, Go is the champion here, followed by Java and Node, and finally PHP.

conclusion

Given all this, it’s clear that as languages evolve, so do solutions for large applications that handle a lot of I/O.

To be fair, aside from the description of this article, PHP and Java do have non-blocking I/O implementations for Web applications. However, these approaches are not as common as those described above, and you need to consider the operational overhead associated with using this approach to maintain the server. Not to mention that your code must be structured in a way that suits these environments; A “normal” PHP or Java Web application would not normally undergo major changes in such an environment.

For comparison, if you consider only a few important factors that affect performance and ease of use, you get:


image

Threads are generally more memory efficient than processes because they share the same memory space that processes do not. Combining the factors related to non-blocking I/O, when we move down the list to general startup as it relates to improving I/O, we see at least the same factors as those considered above. If I had to pick a winner from the contest above, it would definitely be Go.

Even so, in practice, the environment you choose to build your application in is closely related to how familiar your team is with the environment and the overall productivity that can be achieved. So it probably doesn’t make sense for each team to just jump in and start developing Web applications and services with Node or Go. In fact, finding familiarity with developers or internal teams is often cited as the main reason for not using a different language and/or a different environment. In other words, times have changed dramatically over the past 15 years.


image