Server-side I/O Performance: Node vs. PHP vs. Java vs. Go
Abstract: This paper first briefly introduces the basic concepts related to I/O, then compares the I/O performance of Node, PHP, Java and Go, and gives some suggestions for selection. The following is a translation.
Knowing your application’s input/output (I/O) model will help you better understand the difference between what is ideal and what is real when it comes to handling load. Perhaps your application is small and does not need to support a high load, so there is less to consider. However, as application traffic loads increase, using the wrong I/O model can have very serious consequences.
In this article, we’ll compare Node, Java, Go, and PHP with Apache packages, discuss how different languages model I/O, the pros and cons of each model, and some basic performance metrics. If you’re concerned about the I/O performance of your next Web application, this article will help.
I/O Basics: A quick review
To understand the I/ O-related factors, we must first understand these concepts at the operating system level. While it is unlikely that you will encounter many concepts directly at first, you will encounter them in the course of your application, either directly or indirectly. Details matter.
The system calls
First, let’s take a look at the system call, which is described as follows:
-
An application requests the operating system kernel to perform I/O operations for it.
-
A “system call” is when a program asks the kernel to do something. Implementation details vary from operating system to operating system, but the basic concepts are the same. When a “system call” is executed, specific instructions that control the program are transferred to the kernel. In general, system calls are blocked, which means the program waits until the kernel returns the result.
-
The kernel performs low-level I/O operations on physical devices (disks, network cards, and so on) and replies to system calls. In the real world, the kernel might have to do a lot of things to fulfill your request, including waiting for the device to be ready, updating its internal state, and so on, but as an application developer, you don’t need to care about that, it’s the kernel’s business.
Blocking and non-blocking calls
As I said above, system calls generally block. However, some calls are “non-blocking,” meaning that the kernel will queue or buffer the request and return it immediately without waiting for the actual I/O to occur. So, it only “blocks” for a short time, but the queue takes a certain amount of time.
To illustrate this, here are a few examples (Linux system calls) :
-
Read () is a blocking call. We need to pass it a file handle and a buffer to hold the data, which will be returned when the data is saved to the buffer. It has the advantage of being elegant and simple.
-
Epoll_create (), epoll_ctl(), and epoll_wait() can be used to create a set of handles to listen on, add/remove handles from this group, and block until the handle is active. These system calls allow you to efficiently control a large number of I/O operations using only a single thread. These features, while useful, are quite complex to use.
It’s important to understand the magnitude of the time difference here. If an unoptimized CPU core is running at 3GHz, it can perform 3 billion cycles per second (3 cycles per nanosecond). A non-blocking system call may take about 10 + cycles, or a few nanoseconds. Blocking calls that receive information from the network may take much longer, say 200 milliseconds (1/5 of a second). For example, non-blocking calls take 20 nanoseconds and blocking calls take 200,000,000 nanoseconds. As a result, the process may have to wait 10 million cycles to block the call.
The kernel provides both blocking I/O (” read data from the network “) and non-blocking I/O (” tell me when there’s new data on the network connection “), and both mechanisms block the calling process for completely different lengths of time.
scheduling
The third very critical thing is what happens when many threads or processes start to block.
For us, there is not much difference between threads and processes. In reality, the most significant performance-related difference is that since threads share the same memory and each process has its own memory space, individual processes tend to consume more memory. However, when we talk about scheduling, we’re really talking about doing a series of things, each of which requires a certain amount of execution time on the available CPU cores. If you have eight kernels running 300 threads, then you have to fragment the time so that each thread gets its time slice, and each kernel runs for a short time and then switches to the next thread. This is done through a “context switch” that allows the CPU to switch from one thread/process to the next.
This context switch has a cost, that is, it takes time. Fast times can be less than 100 nanoseconds, but depending on hardware and software implementation details, processor speed/architecture, CPU caching, etc., it’s not unusual to take 1,000 nanoseconds or more.
The more threads (or processes) there are, the more context switches there are. When you have thousands of threads, each taking a few hundred nanoseconds to switch, the system becomes very slow.
However, a non-blocking call essentially tells the kernel to “call me only when new data or events arrive on these connections.” These non-blocking calls effectively handle the heavy I/O load and reduce context switches.
It’s worth noting that, although the examples in this article are small, database access, external caching systems (memcache and the like), and anything that requires I/O will eventually perform some type of I/O call, just like in the example.
There are many factors that affect the choice of programming language in a project, even if you only consider performance. However, if you are concerned that your application is primarily limited by I/O and performance is a major factor in determining the success or failure of your project, then the following suggestions are important to consider.
“Keep it simple” : PHP
Back in the 1990s, a lot of people wore Converse shoes and wrote CGI scripts in Perl. Then, ALONG came PHP, which a lot of people loved and made dynamic web pages much easier.
The model used by PHP is very simple. Although it is unlikely to be the same, a typical PHP server works like this:
The user’s browser makes an HTTP request that goes to the Apache Web server. Apache creates a separate process for each request and reuses these processes with some optimization to minimize the amount of work that would otherwise have to be done (creating a process is relatively slow).
Apache calls PHP and tells it to run some.php file on disk.
The PHP code starts executing and blocks the I/O call. The file_get_contents() that you call in PHP actually calls the read() system call underneath and waits for the result to return.
- // blocking network I/O$curl = curl_init(‘http://example.com/example-microservice’);
- $result = curl_exec($curl);
- // some more blocking network I/O$result = $db->query(‘SELECT id, data FROM examples ORDER BY id DESC limit 100’);
- ? >
<? PHP // blocking file I/O$file_data = file_get_contents(' /path/to/file.dat '); // blocking network I/O$curl = curl_init('http://example.com/example-microservice'); $result = curl_exec($curl); // some more blocking network I/O$result = $db->query('SELECT id, data FROM examples ORDER BY id DESC limit 100'); ? >
The integration diagram with the system looks like this:
It’s simple: one process per request. The I/O call is blocked. What are the advantages? Simple and effective. Drawbacks? If 20,000 clients are concurrent, the server will crash. This approach is difficult to scale because the tools provided by the kernel for handling large volumes of I/O (epoll, etc.) are underutilized. To make matters worse, running a separate process for each request often takes up a lot of system resources, especially memory, which is usually the first to run out.
* Note: Ruby is very similar to PHP in this regard.
Multithreading: Java
So, Java was born. And Java has multithreading built into the language, which is great for creating threads in particular.
Most Java Web servers start a new thread of execution for each request, and then call functions written by the developer in this thread.
Performing I/O in Java servlets tends to look like this:
- publicvoiddoGet(HttpServletRequest request,
- HttpServletResponse response) throws ServletException, IOException
- {
- // blocking file I/O
- InputStream fileIs = new FileInputStream(“/path/to/file”);
- // blocking network I/O
- URLConnection urlConnection = (new URL(“http://example.com/example-microservice”)).openConnection();
- InputStream netIs = urlConnection.getInputStream();
- // some more blocking network I/O
- out.println(“…” );
- }
publicvoiddoGet(HttpServletRequest request, HttpServletResponse response) throws ServletException, IOException { // blocking file I/O InputStream fileIs = new FileInputStream("/path/to/file"); // blocking network I/O URLConnection urlConnection = (new URL("http://example.com/example-microservice")).openConnection(); InputStream netIs = urlConnection.getInputStream(); // some more blocking network I/O out.println("..." ); }
Since the doGet method above corresponds to a request and runs in its own thread, rather than in a separate process that requires separate memory, we will create a separate thread. Each request gets a new thread and blocks various I/O operations within that thread until the request is processed. The application creates a thread pool to minimize the cost of creating and destroying threads, but thousands of connections means thousands of threads, which is not a good thing for the scheduler.
Notably, version 1.4 of Java (updated again in version 1.7) adds the ability to make non-blocking I/O calls. Although most applications do not use this feature, it is at least available. Some Java Web servers are trying to use this feature, but the vast majority of deployed Java applications still work according to the principles described above.
Java provides a lot of out-of-the-box functionality for I/O, but it doesn’t have a great solution for creating a lot of blocked threads to perform a lot of I/O.
Make non-blocking I/O a priority: Node
Node.js performs better in I/O and is more popular with users. Anyone with a brief knowledge of Node knows that it is “non-blocking” and handles I/O efficiently. This is true in a general sense. But the details and the way they are implemented are crucial.
When you need to do something that involves I/O, you need to make a request and give a callback function that Node will call when the request is processed.
Typical code to perform AN I/O operation on a request looks like this:
- http.createServer(function(request, response) {
- fs.readFile(‘/path/to/file’, ‘utf8’, function(err, data) {
- response.end(data);
- });
- });
http.createServer(function(request, response) { fs.readFile('/path/to/file', 'utf8', function(err, data) { response.end(data); }); });
As shown above, there are two callback functions. The first function is called when the request starts, and the second function is called when the file data is available.
This way, Node can handle the I/O of these callbacks more efficiently. A more illustrative example is invoking a database operation in Node. First, your program calls the database operation and gives Node a callback function. Node uses a non-blocking call to perform the I/O operation alone, and then calls your callback when the requested data is available. This mechanism of queuing I/O calls and having Node process the I/O calls and get a callback is called an “event loop.” It’s a great mechanic.
There is a problem with this model, however. At the bottom, the reason for this problem has to do with the implementation of the V8 JavaScript engine (Node uses Chrome’s JavaScript engine) : you write JS code that runs in a single thread. Think about it. This means that while efficient non-blocking techniques are used to perform I/O, JS code runs CPU-based operations within a single thread operation, with each block blocking the execution of the next. A common example is to loop over database records, process the records in some way, and then output them to the client. The following code shows how this example works:
- var handler = function(request, response) {
- connection.query(‘SELECT … ‘, function(err, rows) {if (err) { throw err };
- for (var i = 0; i < rows.length; i++) {
- // do processing on each row
- }
- response.end(…) ; // write out the results
- })
- };
var handler = function(request, response) { connection.query('SELECT ... ', function(err, rows) {if (err) { throw err }; for (var i = 0; i < rows.length; i++) { // do processing on each row } response.end(...) ; // write out the results }) };
Although Node handles I/O efficiently, the for loop in the above example uses CPU cycles in one main thread. This means that if you have 10,000 connections, the loop can consume the entire application. Each request must take up a short amount of time in the main thread.
This whole concept is based on the premise that I/O operations are the slowest part, so it is important to handle them efficiently even if serial processing is unavoidable. This is true in some cases, but not always.
Another point is that writing a bunch of nested callbacks is cumbersome, and some people think it’s ugly. It’s not uncommon to have four, five, or more layers of callback embedded in Node code.
It’s time to weigh the pros and cons again. If your primary performance problem is I/O, then this Node model can help. The downside, however, is that if you put CPU-intensive code in a function that handles HTTP requests, you can accidentally jam every connection.
Native non-blocking: Go
Before I introduce Go, LET me reveal that I am a Go fan. I’ve used Go on a number of projects.
Let’s take a look at how it handles I/O. A key feature of the Go language is that it includes its own scheduler. Instead of having one operating system thread for each thread of execution, it uses the concept of “goroutines.” The Go runtime assigns an operating system thread to a Goroutine and controls its execution or suspension. Each request from the Go HTTP server is processed in a separate Goroutine.
The scheduler works as follows:
In fact, the Go runtime is doing different things than Node, except that the callback mechanism is built into the implementation of the I/O call and automatically interacts with the scheduler. It’s also not limited by having to run all your processing code in the same thread, and Go will automatically map your Goroutine to the operating system thread it sees fit, based on the logic in its scheduler. So, the code looks like this:
- func ServeHTTP(w http.ResponseWriter, r *http.Request) {
- // the underlying network call here is non-blocking
- rows, err := db.Query(“SELECT …” )
- for _, row := range rows {
- // do something with the rows,// each request in its own goroutine
- }
- w.Write(…) // write the response, also non-blocking
- }
func ServeHTTP(w http.ResponseWriter, r *http.Request) { // the underlying network call here is non-blocking rows, err := db.Query("SELECT ..." ) for _, row := range rows { // do something with the rows,// each request in its own goroutine } w.Write(...) // write the response, also non-blocking }
As shown above, this basic code structure is simpler and also implements non-blocking I/O.
In most cases, this truly “gets the best of both worlds.” Non-blocking I/O can be used for all important things, but the code looks blocked, so it’s often easier to understand and maintain. All that remains is the interaction between the Go scheduler and the OS scheduler. It’s not magic, and if you’re building a large system, it’s worth taking the time to understand how it works. At the same time, “out of the box” features allow it to work and expand better.
Go may have its fair share of downsides, but overall there are no obvious downsides to the way it handles I/O.
The performance evaluation
It is difficult to accurately time context switches between these different models. Of course, I can also say that it doesn’t do you much good. Here, I’ll do a basic performance comparison of HTTP services in these server environments. Remember that there are many factors involved in end-to-end HTTP request/response performance.
I wrote a piece of code for each environment to read random bytes in 64K files and then run sha-256 hashes on it N times (specifying N in the URL’s query string, for example… /test.php? N =100) and print the result in hexadecimal. I chose this one because it makes it easy to run some continuous I/O operations and can increase CPU utilization in a controlled way.
First, let’s look at some examples of low concurrency. Running 2000 iterations with 300 concurrent requests, hashing once per request (N=1) results in the following:
It’s hard to draw conclusions from this graph alone, but I personally think what we see in this case with a lot of linking and computation has more to do with the execution of the language itself. Note that “Scripting language” is the slowest.
But what happens if we increase N to 1000, but still 300 concurrent requests, increasing the number of iterations of the hash by a factor of 1000 for the same load (significantly higher CPU load) :
All of a sudden, Node’s performance drops dramatically as CPU-intensive operations block each other on each request. Interestingly, in this test, PHP performed better (relative to others) and even better than Java. (It’s worth noting that in PHP, the implementation of SHA-256 is written in C, but the execution path takes more time in this loop because we did 1000 hash iterations this time).
Now, let’s try 5000 concurrent connections (N=1). Unfortunately, for most environments, the failure rate is not obvious. Let’s look at the number of requests handled per second in this chart, the higher the better:
This graph doesn’t look quite the same. My guess is that with the high number of connections, new processes and memory requests in PHP + Apache seem to be the main factor affecting PHP performance. Go is the clear winner, followed by Java, Node, and finally PHP.
Although there are many factors involved in overall throughput, and there are significant differences between applications, the more you understand the underlying principles and trade-offs involved, the better your applications will perform.
conclusion
To sum up, as languages evolve, so do solutions for handling large applications with large volumes of I/O.
To be fair, PHP and Java both have non-blocking I/O implementations available for Web applications. However, these implementations are not as widely used as the approaches described above, and there are maintenance costs to consider. Not to mention that the application code must be built in a way that is appropriate for this environment.
Let’s compare a few important factors that affect performance and ease of use:
PHP | process | no | – |
Java | thread | effective | Need a callback |
Node.js | thread | is | Need a callback |
Go | Thread (Goroutines) | is | Without the callback |
Because threads share the same memory space while processes do not, threads are generally much more memory efficient than processes. From the top down in the list above, each I/ O-related factor gets better and better. So, if I had to pick a winner from the comparison above, it would definitely be Go.
Even so, in practice, the environment you choose to build your application in is closely related to your team’s familiarity with the environment and the overall productivity your team can achieve. So, using Node or Go to develop Web applications and services may not be the best choice for your team.
Hopefully, this has helped you get a clearer picture of what’s going on underneath and given you some advice on how to handle application scalability.