Classic C10K problems on the server

I was recently looking at Unix network programming, and I came across a very classic article, probably not new, However, it is really classic,C10K problem, a simple translation (markdown conversion format may have a few problems) to master Linux IO and Linux threads will have a deeper understanding (there are a lot of links). The C10K Problem

Now the Web server has to handle tens of thousands of requests simultaneously, doesn’t it? After all, today’s web will have a lot of room to grow. Computers are just as powerful. You can buy a machine with 1000MHz,2 gb memory and 1000Mbits/ SEC network card for $1200. Let’s see — 20,000 clients, 50KHz per client, At 1000 kilobytes per second and 50 kilobytes per second, nothing consumes more resources than these 20,000 clients each pulling 4 kilobytes per second from disk and sending them to the network per second.(By the way, $0.80 per client, some operating systems charge for a single client Dollar licensing fees look a bit expensive) so hardware is no longer a bottleneck.

In 1999, one of the busiest FTP sites, Cdrom.com, actually handled 10,000 clients simultaneously through a gigabit Ethernet network card. Now that same speed is being offered by ISPs, they expect it to become increasingly popular with large enterprise customers.

The lightweight client-side computing model seems to be catching on again – servers running over the Internet serving thousands of clients.

Based on some of the above considerations, here are some considerations for how to configure the operating system or write code to support thousands of clients. Much of the discussion centered around UNIx-like operating systems, because that’s my area of personal interest, but Windows will cover a little bit as well.

content

  • C10K problem
    • The content]
    • Related websites
    • Pre-reading books
    • I/O framework
    • The I/O strategy
      • 1. One thread serves multiple clients, using non-blocking IO and horizontally triggered ready notifications
      • 2. One thread serves multiple clients, using non-blocking IO and ready change notifications
      • 3. A thread serves multiple clients and uses asynchronous I/O
      • 4. One thread serves one client
      • Linux threads
        • NGPT: The next generation Posix thread for Linux
        • NPTL: Linux native Posix threading library
        • FreeBSD thread support
        • NetBSD thread support
        • Solaris Thread Support
        • Java thread support in JDK 1.3.x and earlier
        • Note: 1:1 threads versus M:N threads
      • 5. Build server-side code into the kernel
    • Bring the TCP stack into user space
    • comments
    • Open the restriction on file handles
    • Thread limit
    • Java problem]
    • Other Suggestions
    • Other restrictions
    • The kernel problem
    • Testing service performance
    • example
    • Interesting select() based server
    • Interesting /dev/poll-based server
    • Interesting Epoll-based server
    • Interesting kqueue() based server
    • Interesting real-time signal-based server
    • Interesting thread-based server
    • Interesting kernel server
    • Other interesting links

Related websites

See Nick Black’s excellent Fast Unix Server page for an overview of the situation circa 2009.

In October 2003,Felix von Leitner put together an excellent web site and demo on network scalability that compares the performance of various network system calls and operating systems. One of the findings was that the Linux 2.6 kernel beat the 2.4 kernel, and of course there are a lot of good images here that OS developers can give some thought to on a daily basis.

Pre-reading books

If you haven’t read the late W. Richard Stevens’ copy of Unix Network Programming: Network Apis: Sockets and Xti (Volume 1), please get one as soon as possible. It describes many of the pitfalls of I/O policies and writing high-performance servers. It even talks about the ‘thundering herd’ problem. While you’re at it, read Jeff Darcy’s post on high performance server design.

(Another book building extensible Web sites might be helpful for people who use rather than write a Web server.)

I/O framework

Here are a few pre-packaged libraries that abstract some of the techniques described below to isolate code from the operating system and make it more portable.

  • ACE, a lightweight C++ I/O framework, contains some I/O strategies implemented with object-facing ideas and many other useful things. Specifically, his Reactor performs non-blocking I/O in an object-facing manner, and Proactor is a way to process asynchronous I/O in an object-facing manner.
  • ASIO is a C++ I/O framework that is becoming part of Boost. It’s like updating the ACE for the STL era.
  • Libevent is a lightweight C I/O framework written by Niels Provos. It supports kQueue and SELECT, and will soon support poll and epoll. I think it only uses the horizontal trigger, which has two sides. Niels gives a graph showing how time and number of connections can handle an event, and it shows that kqueue and sys_epoll are the clear winners.
  • My own attempts at lightweight frameworks (unfortunately not kept up to date)
    • Poller is a lightweight C++ I/O framework that uses any of the readiness apis (poll, select, /dev/poll, kqueue, sigio) to implement the horizontal trigger readiness API. Polling performs much better when tested against a variety of other apis. The documentation links to the following Poller subclass, and the next section of the linked documentation explains how to use these ready-to-use apis.
    • Rn is a lightweight C I/O framework, and this is my second attempt after Poller. He uses LGPL (so it’s easier to use in commercial applications) and C(so it’s easier to use in non-C ++ products). It is now used in some commercial products.
  • Matt Welsh wrote a paper in April 2000 describing his Sandstorm I/O framework on how to balance worker threads and event-driven usage when building scalable services.
  • Cory Nelson Scale! Library – a Windows asynchronous socket, file, and pipe I/O library.

The I/O strategy

Designers of web software have many options. Here are some:

  • Whether and how to make multiple I/O calls from a single thread
    • Don’t handle; Use multiple threads and processes concurrently whenever possible, using blocking and synchronous calls.
    • Use non-blocking calls (e.g., O_NONBLOCK on a socket write()) to start I/O, ready notifications (e.g.,poll() or /dev/poll) to know when the channel is OK and then start the next I/O. Typically this can only be used for network I/O, not disk I/O.
    • Use asynchronous calls (e.g., AIO_write ()) to start I/O and completion notifications (e.g., signals or completion ports) to notify I/O of completion. This applies to both network and disk I/O.
  • How is each customer’s service controlled
    • One process serving one customer (classic Unix approach, used since 1980 or so)
    • One system-level thread serves multiple customers; Each customer is controlled by:
      • A user-level thread (e.g. GNU State threads, classic Java with green threads)
      • State machines (somewhat esoteric, but popular in some circles; My favorite)
      • Continuation (a bit esoteric, but popular in some circles; My favorite)
    • A system-level thread serves a single customer (e.g., classic Java with native threads)
  • A system-level thread serves each active client (e.g. Front-end for Tomcat and Apache; NT completion port; The thread pool)
  • Whether to use standard system services, or to build services into the kernel (e.g., in some custom driver, kernel module, or VxD)

Here are five combinations that seem to be very popular:

  1. A thread serves multiple clients. Use non-blocking I/O and horizontally trigger readiness notifications.
  2. A thread serves multiple clients. Use non-blocking I/O and ready change notifications.
  3. A thread serves multiple clients. Use asynchronous I/O.
  4. A thread serves multiple clients. Use blocking I/O
  5. Build server-side code into the kernel

1. One thread serves multiple clients, using non-blocking IO and horizontally triggering ready notifications

. Set all network handles to non-blocking mode and use select() or poll() to tell which network handle has data waiting. This model is the most traditional. In this mode, the kernel tells you if a file descriptor is ready and if you have done anything with it since the last time the kernel told you about it.(The term ‘horizontal trigger’ comes from computer hardware design; It is the opposite of ‘edge triggering’).Jonathon Lemon introduced these terms in his paper on BSDCON 2000 Paper Kqueue ()

Note: It is important to remember that the ready notification from the kernel is just a hint; When you try to read the file descriptor, it may not be ready. This is why you need to use non-blocking mode when using ready notifications.

An important bottleneck is when read() or sendfile() reads from a disk block if the page is not currently in memory. There is no impact on disk file processing in setting non-blocking mode. The same is true for memory mapped disk files. First, when a service requires disk I/O, its processing blocks all clients must wait, so native non-threaded performance will be wasted.

This is the purpose of asynchronous I/O, of course, limited to systems without AIO, and using multiple threads and multiple processes for disk I/O can also solve this bottleneck. One approach is to use a memory-mapped file. If mincore() indicates that I/O is required, let a worker thread do the I/O and continue processing network traffic. Jef Poskanzer mentions Pai, Druschel, and Zwaenepoel’s 1999 Flash Web servers using this trick; They gave a talk about it at Usenix ’99. It seems that Mincore () is available on BSD-derived Unixes such as FreeBSD and Solaris, but it is not part of the single Unix specification. Since Kernel2.3.51, it has also become part of Linux, thanks to Chuck Lever.

But at FreeBSD-Hackers List in November 2003, Vivek Pei et al reported a great result using their Flash Web server. Then they attacked the bottlenecks, one of which was mincore(not a good idea, I guess), another was Sendfile blocking disk access; Their modified sendfile() returns something similar to EWOULDBLOCK when his accessing disk page is not yet in the core state, improving performance. What I really need is aio_sendfile().) The end result of their optimization was a SpecWeb99 score of around 800 on a 1GHZ/1GB FreeBSD box, which is better than any file on Spec.org.

In the collection of non-blocking sockets, there are several methods for how a single thread can tell which socket is ready:

  • Unfortunately, select() limits the handling of FD_SETSIZE. This restriction is compiled into the standard library and user programs.(Some versions of the C library let you remind you of this restriction when compiling your application.) See Poller_select (cc,h) for an example of how select() can be used to interact with other ready notification patterns.
  • Traditional poll() has no hard-coded limit on the number of file descriptors that poll() can handle, but it gets slow when there are thousands of connections, because most file descriptors are idle at any given time, and it takes time to fully scan thousands of file descriptors. Some operating systems (like Solaris 8) use technologies like Poll Hinting to speed up poll(), etc. Niels Provos was implemented for Linux in 1999 and tested with benchmark programs. See Poller_poll (CC, H, Benchmarks) for an example of how poll() can be used to interact with other ready notification patterns.
  • The idea behind /dev/poll, which is recommended on Solaris as a replacement for poll, is to use poll() to use the same parameters for most calls. Using /dev/poll, get a file descriptor for /dev/poll, then write the file descriptor you care about to the /dev/poll descriptor; You can then read the set of currently ready file descriptors from this handle. It appeared quietly in Solaris 7 (see Patchid 106541), but its first public appearance was in Solaris 8; In the case of the 750 client, this is only 10% of the overhead of poll(), according to Sun. Various implementations of /dev/poll were tried on Linux, but none were as efficient as epoll and were never really completed. Use of /dev/poll/is not recommended on Linux. See Poller_devpoll (cc, H Basic test) is an example of how to use /dev/poll to interact with other ready notification modes.(Note – this example works with Linux /dev/poll and may not work properly on Solaris.)
  • Kqueue () is a recommended alternative to poll on FreeBSD systems (soon,NetBSD). Kqueue () can specify edge firing or horizontal firing.

2. One thread serves multiple clients, using non-blocking IO and ready change notifications

Ready-to-change notification (or edge-ready notification) means that you provide the kernel with a file descriptor, and then the kernel will somehow notify you when the descriptor is converted from not ready to ready. Then it assumes that the known file descriptor is ready for you and won’t send similar notice to this descriptor, until you do something in the descriptor for the descriptor is not ready (for example, until you receive EWOULDBLOCK error) to send, receive or accept the call, the number of bytes or less than the request to send or receive transmission).

When you use ready change notification, you must be prepared to handle fake events, because the most common implementation is to signal ready whenever any packet is received, regardless of whether the file descriptor is ready or not.

This is the opposite of a “level triggered” readiness notification. It is less tolerant of programming errors, because if you only miss one event, the connection to the event is stuck forever. Nonetheless, I find that edge-triggered ready notifications make it easier to program non-blocking clients using OpenSSL, so it’s worth a try.

[Banga, Mogul, Drusha ’99] described this type of pattern in 1999.

Several apis enable an application to retrieve a “file descriptor ready” notification:

  • Kqueue () This is the recommended method of edge firing on FreeBSD(and soon,NetBSD) systems. Kqueue ()/ kEvent () is a common alternative to poll() supported in FreeBSD 4.3 and later, and in NetBsD-current as of October 2002; It supports both edge and horizontal triggering.(See also Jonathan Lemon’s web page and his BSDCon 2000 paper on Kqueue ().) As with /dev/poll, you can allocate a listener, but instead of opening the file /dev/poll, you can call kqueue() to get it. To change the events you are listening for or to get a list of current events, call kEvent () on the descriptor returned by kqueue(). It can listen not only for socket ready, but also for pure file ready, signals, and even I/O completion. Note: As of October 2000, the thread libraries on FreeBSD do not interact well with kQueue (); Thus, when kqueue() blocks, the entire process blocks, not just the thread that called kqueue(). See Poller_kqueue (CC, H, base test) for an example of how to use kqueue() to interact with other ready notification patterns. Examples and libraries using kqueue() :
    • The Python binding to PykQueue-kqueue ()
    • Example Echo server by Ronald F. Guilmette; Also check out his post on Freebsd.Questions dated September 28, 2000.
  • Epoll This is the recommended use of edge triggered polling in the Kernel of Linux 2.6. On July 11, 2001,Davide Libenzi proposed an alternative to real-time signals; He put his patch called/dev/epoll www.xmailserver.org/linux-patches/nio-improve.html. This is like a signal-ready notification in real time, but it can incorporate redundant events and has a more efficient method of batch event retrieval. When the interface is changed from a special file in /dev to a system call sys_epoll,Epoll is merged into the 2.5.46 kernel development tree. The 2.4 kernel can also use patches from older Epoll. Around Halloween 2002, the Linux kernel mailing list had a long debate about unifying epoll, AIO, and other event sources. It will happen, but Davide is focused on building EPoll first.
  • Polyakov’s KEvent (Linux 2.6+) news report: On February 9, 2006 and July 9, 2006, Evgeniy Polyakov released patches that appeared to unify Epoll and AIo; His goal is to support the network AIO. See:
    • LWN article on KEvent

    • His Announcement in July

    • His KEvent page)
    • His NAIO page

    • Some recent discussions

  • At OLS 2006,Ulrich Drepper presented a new high-speed asynchronous network API. See:
    • His paper, “Need asynchronous, Zero-copy Network I/O.”
    • His slides

    • LWN’s July 22 article

  • Real-time signal Edge trigger poll. the Linux2.4 kernel can assign socket-ready events via specific real-time signals. The following is an example:
/* Mask off SIGIO and the signal you want to use. */
sigemptyset(&sigset);
sigaddset(&sigset, signum);
sigaddset(&sigset, SIGIO);
sigprocmask(SIG_BLOCK, &m_sigset, NULL);
/* For each file descriptor, invoke F_SETOWN, F_SETSIG, and set O_ASYNC. */
fcntl(fd, F_SETOWN, (int) getpid());
fcntl(fd, F_SETSIG, signum);
flags = fcntl(fd, F_GETFL);
flags |= O_NONBLOCK|O_ASYNC;
fcntl(fd, F_SETFL, flags);Copy the code

  • The signal is sent when normal I/O functions such as read() or write() complete. To use this section, inside the loop, after poll() has processed all the descriptors, go into the sigWaitInfo () sigWaitInfo () loop. If sigWaitInfo or SIGtimedWait returns your real-time signal,siginfo.si_fd and siginfo.si_band provide almost the same information as pollfd.fd and pollfd.revents when calling poll(). The information after that is the same, if you are processing the I/O, then continue to call sigWaitInfo () if sigWaitInfo returns the traditional SIGIO, then the signal queue overflows, you must refresh the signal queue by temporarily changing the signal handler to SIG_DFL, and then go back to the outer poll() loop. See Poller_sigio (cc, h) for an example of how to use Rtsignals to interact with other ready notification modes. See Zach Brown’s PHHTTPD for examples of code that uses this feature directly. PHHTTPD is a little hard to figure out……) [Provos, Lever and Tweedie 2000] describes the latest benchmarks for PHHTTPD, using different sigtimedWait (), sigtimedWait4 (), which allows you to retrieve multiple signals with a single call. Interestingly, the main benefit of sigtimedWait4 () to them seems to be to allow the application to measure system overload (so it can behave appropriately).(Note that poll() also provides the same system load measurement.)

Signal-per-fd is an improvement of real-time Signal proposed by Chandra and Mosberger, which reduces or eliminates real-time Signal queue overflow by merging redundant events. But it is not beyond epoll. Their papers (www.hpl.hp.com/techreports)… The performance of this scheme was compared to select() and /dev/poll. Vitaly Luban announced a patch to implement the program on May 18, 2001; His patch on www.luban.org/GPL/gpl.htm… (Note: as of September 2001, there may be stability issues with this patch under high load. Dkftpbench may trigger oops. for approximately 4500 users.) See Poller_sigfd (cc,h) for an example of how to use signal-per-fd Examples of interacting with other ready notification patterns.

3. A thread serves multiple clients and uses asynchronous I/O.

This has not caught on on Unix so far, either because fewer operating systems support asynchronous I/O or because (like non-blocking I/O) it requires a rethinking of the application. Under standard Unix, asynchronous I/O is provided by the AIO_ interface (scroll down from this link to “Asynchronous Input and Output “), which associates signals and values with each I/O operation. Signals and their values are queued and efficiently transmitted to the user process. This is a real-time extension from POSIX 1003.1b and version 2 of the single Unix specification.

AIO is often used in conjunction with edgetriggered completion notifications, where signals are queued when an operation is complete.(It can also be used with horizontally triggered completion notifications by calling AIO_suspend (), although I suspect very few people do this.)

Glibc 2.1 and later versions provide a generic implementation, only for standard compatibility, not for performance improvements.

As of Linux kernel 2.5.32,Ben LaHaise’s Linux AIO implementation has been incorporated into the main Linux kernel. It does not use kernel threads and has a very efficient low-level API, but (as of 2.6.0-Test2) does not yet support sockets.(There is also an AIO patch for the 2.4 kernel, but the 2.5/2.6 implementation is somewhat different.) More information:

  • Page “Kernel Asynchronous I/O(AIO) support for Linux, an attempt to tie all the pieces of information together about the AIO implementation for the 2.6 kernel (released September 16, 2003).
  • Round 3: AiO vs /dev/epoll by Benjamin C.R. LaHaise (OLS 2002)
  • Linux2.5 of asynchronous I/O support by Bhattacharya, Pratt, Pulaverty and Morgan, IBM published in OLS ‘2003.
  • Suparna Bhattacharya’s Asynchronous I/O(AIO) Design Note for Linux – Compare Ben’s AIO with SGI’s KAIO and some other AIO projects
  • Linux AIO home page – Ben’s preliminary patch, mailing list, etc.
  • Linux-aio mailing list archive

  • Libaio – Oracle – a library that implements standard Posix AIO on top of Libaio. Joel Becker first mentioned it on April 18, 2003.

Suparna also suggests looking at the DAFS API’s approach to AIO.

Red Hat AS and Suse SLES both provide high-performance implementations on 2.4 kernels. It is related to, but not identical to, the 2.6 kernel implementation.

In February 2006, the network AIO had a new try; See above for Evgeniy Polyakov’s kevent-based AIO

In 1999,SGI implemented high-speed AIO for Linux, starting with version 1.1, which is said to be very compatible with disk I/O and sockets. It seems to use kernel threads. Still useful for those who can’t wait for Ben’s AIO to support sockets.

O’Reilly’s book POSIX.4: Real-world Programming is said to cover a good introduction to AIO.

Tutorial for early nonstandard AIO implementations of Solaris online Sunsite. This may be worth a look, but remember that you need to mentally convert “aioread” to “AIo_read” etc.

Note that AIO does not provide a way to open a file without blocking disk I/O; If you’re concerned about hibernation by opening disk files,Linus suggests that you simply perform open() in another thread instead of making the aio_open() system call.

Under Windows, asynchronous I/O is associated with the terms “overlapping I/O” and IOCP or “I/O completion port “. Microsoft’s IOCP combines techniques from existing technologies such as asynchronous I/O(such as AIO_WRITE) and queued completion notifications (such as when the AIO_SIGevent field is used with AIO_write), as well as the new idea of preventing certain requests from trying to keep a single IOCP related to the number of running threads Constants. For more information, see I/O Finish Port Internals by Mark Russinovich on sysinternals.com, and Jeffrey Richter’s book “For Microsoft Windows. 2000 Write a server program “(Amazon, MSPress), U.S. Patent #06223207, or MSDN.

4. A thread serves multiple clients

. Let read() and write() block. There is a major drawback to using the full stack probes per client, which is that they consume memory. Many operating systems also struggle to handle hundreds of threads. If each thread gets a 2MB stack (not the unusual default), then (2^30/2 ^21) = 512 threads on a 32-bit machine will run out of virtual memory with 1GB of user-accessible VMS (for example,Linux is usually on x86 You can solve this problem by providing a smaller stack, but most thread libraries do not allow increasing the thread stack once a thread has been created, so doing so means you have to minimize the memory use of your program. You can also solve this problem by moving to a 64-bit processor.

Thread support on Linux, FreeBSD, Solaris is improving, and even for mainstream users, 64-bit processors are coming. Perhaps in the future, those who prefer one thread per client will be able to serve 10,000 clients. At this point, however, if you really want to support that many customers, you’re probably better off using some other methods.

For those of you with an unflinchily pro-thread view, see Why Events are a bad idea (for highly concurrent servers) published on HotOS IX by von Behren,Condit, and Brewer,UCB. Can anyone at the anti-line camp point to a paper that refutes this paper? : -)

Linux threads

Linux Threads is the name of the standard Linux thread library. Since GliBC 2.0, it has been integrated into GliBC, mostly Posix compliant, but not in terms of performance and signal support.

NGPT: The next generation Posix thread for Linux

NGPT is a project started by IBM to bring Linux better Posix thread compatibility. It is currently in stable version 2.2 and works very well… However, the NGPT team has announced that they are putting the NGPT code base in support-only mode because they feel it is “the best way to support the community for the long term “. The NGPT team will continue to improve Linux (Thanks to the NGPT team for their good work and the elegant way they moved to NPTL.)

NPTL: Linux native Posix threading library

NPTL was initiated by Ulrich Drepper(gliBC maintainer) and Ulrich Drepper to bring world-class Posix thread library support to Linux.

As of October 5, 2003,NPTL is now merged into the Glibc CVS tree as an additional directory (just like Linux threads), so it will almost certainly be released with the next version of Glibc.

Red Hat 9 was the first distribution to include NPTL (this was a little inconvenient for some users, but someone had to break the ice…)

NPTL links:

  • NPTL discussion mailing list

  • NPTL source

  • NPTEL’s preliminary announcement

  • Original white paper describing NPTEL objectives

  • The revised white paper describes the final design of the NPTEL

  • Ingo Molnar’s first benchmark shows that it can handle 10^6 threads
  • Ulrich’s benchmark compares the performance of Linux threads,NPTL, and IBM’s NGPT. It seems to show that NPTL is faster than NGPT.

Here’s my attempt to write a history of NPTL (also check out Jerry Cooperstein’s post):

In March 2002, Bill Abt of the NGPT team, gliBC maintainer, met with Ulrich Drepper and others to discuss the development of LinuxThreads. One idea that came out of the meeting was to improve mutual exclusion; Rusty Russell et al. later implemented Fast User-space Locks (Futexes), which are now used in NGPT and NPTL. Most participants agreed that NGPT should be merged with GLIBC.

But Ulrich Drepper didn’t like NGPT and thought he could do better.(This may not come as a surprise to those trying to patch Glibc 🙂 Over the next few months,Ulrich Drepper,Ingo Molnar worked on GliBC And kernel changes that make up the Native Posix Threading Library (NPTL).NPTL uses all the kernel improvements designed by NGPT and takes advantage of some new features:

> NPTL uses three kernel features introduced by NGPT: getpid() returns PID,CLONE_THREAD and futexes; NPTL also uses (and relies on) a wider range of new kernel capabilities that were developed as part of the project.

Some items in NGPT introduced in 2.5.8 kernel have been modified, cleaned up and extended, such as thread group processing (CLONE_THREAD).[CLONE_THREAD changes affecting NGPT compatibility with NGPT Personnel synchronization to ensure that the NGPT is not broken in any unacceptable way.]

> NPTL development and use of the kernel function in the design, is described in the paper people.redhat.com/drepper/npt… .

> Short list :TLS support, various clone extensions (CLONE_SETTLS,CLONE_SETTID,CLONE_CLEARTID),POSIX thread signal processing,sys_exit() extensions (TID is published at VM publication time) Futex)sys_exit_group() system call,sys_execve() enhances functionality and supports separate threads.

> There is also work to extend PID space – for example,procfs crashed due to the 64K PID design, working to allocate scalability for max_PID and PID. In addition, a number of performance-only improvements have been made.

In essence, the new feature is entirely using the 1:1 threading approach – the kernel can now help improve everything about threading, and we precisely perform the minimum required context switches and kernel calls for each of the basic thread primiprimis.

FreeBSD thread support

FreeBSD supports both Linux threads and user-space thread libraries. In addition, an M:N implementation called KSE was introduced in FreeBSD 5.0. Overview, see www.unobvious.com/bsd/freebsd… .

On March 25, 2003,Jeff Roberson posted on FreeBSD-Arch:

. Thanks to Julian,David Xu,Mini,Dan Eischen, and everyone else who participated in the development of KSE and LibpThread,Mini and I have developed a threading implementation of the 1:1 model. This code works in parallel with KSE and does not change it in any way. It actually helps to bring M:N threads closer together by testing shared bits…


In July 2006,Robert Watson proposed that 1:1 threading should become the default implementation in FreeBsd 7.x:

I know this has been discussed in the past, but I think it’s time to reconsider as 7.x moves forward. Libthr performs better than libpThread in many common application and scenario-specific benchmarks… Libthr is also implemented on a number of our platforms, and libpThread is already implemented on several platforms. Our advice to users of MySQL and many other threads is to “switch to libthr”, which is also suggestive! . So the cursive advice is to make libthr the default threading library on 7.x.


NetBSD thread support

According to Noriyuki Soda:

Kernel support M:N Scheduler Activations model-based threading libraries will be merged into NetBSD-Current on January 18, 2003.


For more details, see An Implementation of Scheduler Activations on the NetBSD, demonstrated by Nathan J. Williams of NethanD Systems in FREENIX in 2002 Operating System.

Solaris Thread Support

Thread support in Solaris grows… From Solaris 2 through Solaris 8, the default thread library uses the M:N model, but Solaris 9 defaults to 1:1 model thread support. See Sun’s Multithreaded Programming guide and Sun’s notes on Java and Solaris threads

Java threads are supported with JDK 1.3.x and beyond

It is well known that Java up to JDk1.3.x does not support any method of handling network connections other than one thread per client. Volanomark is a good microbenchmark to measure the throughput of messages per second across different numbers of connections. As of May 2003, JDK 1.3 from various vendors was actually able to handle 10,000 simultaneous connections – albeit with a significant performance drop. See Table 4 for information on which JVMS can handle 10,000 connections and how performance suffers as the number of connections increases.

Note: 1:1 threads versus M:N threads

When implementing the threading library, there is a choice: you can put all the thread support in the kernel (this is called the 1:1 threading model), or you can move a significant portion of it into user space (this is called the M:N threading model). At one point,M:N is supposed to be higher performance, but it’s too complicated to get right and most people are moving away from it.

  • Why does Molnar prefer 1:1 to M:N

  • Sun is moving towards 1:1 threads

  • NGPT is a Linux M:N threading library.
  • Although Ulrich Drepper planned to use M:N threads in the new Glibc threading library, he has since switched to a 1:1 threading model
  • MacOSX appears to use 1:1 threads.

  • FreeBSD and NetBSD still seem to believe in M:N threads… Lonely persistence? It looks like FreeBSD 7.0 might switch to 1:1 threads (see above), so maybe the M:N thread believers turned out to be wrong after all.

5. Build the server code into the kernel

It is said that Novell and Microsoft have done this at different times, at least one NFS implementation has done this, KHTTPD has done this for Linux and static web pages, and “TUX”(threaded Linux Web server) is Ingo Molnar is a fast and flexible kernel-space HTTP server for Linux. Ingo said on September 1, 2000 announcement can be downloaded from ftp://ftp.redhat.com/pub/redhat/tux TUX alpha version, and explains how to join the mailing list for more information.

The Linux-kernel list has been discussing the pros and cons of this approach, and it seems that instead of moving the Web server into the kernel, the kernel should add minimal hooks to improve the performance of the Web server. In this way, other types of servers can benefit. See for example Zach Brown’s comments about userland’s relationship to the kernel HTTP server. It seems that the 2.4 Linux kernel provides enough functionality for user programs, since the X15 server runs as fast as Tux without any kernel modifications.

Bring the TCP stack into userspace

For example, see the NetMap Packet I/O framework and the Sandstorm Proof-of-concept Web server based on this.

comments

Richard Gooch has written a paper discussing I/O options.

In 2001,Tim Brecht and MMichal Ostrowski tested various strategies for simplifying select-based servers. Their data is worth looking at.

In 2003,Tim Brecht released the source code for Userver, a small Web server composed of several servers written by Abhishek Chandra, David Mosberger, David Pariag, and Michal Ostrowski. It can use select(), poll(), or sigio.

Back in March 1999,Dean Gaudet wrote:

I keep getting asked “Why don’t you use a select/event based model like Zeus? It’s clearly the fastest.”…


His reasoning boils down to “it’s really hard, the benefits are not clear,” and yet, within a few months, it was clear that people were willing to keep trying.

Mark Russinovich wrote an editorial and article discussing I/O policies in the Linux kernel 2.2. It’s worth watching, and even he seems misguided in some ways. In particular, he seems to think that Linux 2.2’s asynchronous I/O (see F_SETSIG above) does not notify the user process when data is ready, only when a new connection arrives. This seems like a strange misunderstanding. Also look at the earlier draft,Ingo Molnar’s rebuttal on April 30, 1999,Russinovich’s comment on May 2, 1999, A rejoiner from Alan Cox, and various Linux-kernel posts, I suspect he is trying to say that Linux does not support asynchronous disk I/O, which used to be true, but now that SGI has implemented KAIO, it is no longer so true.

Refer to sysinternals.com and these pages on MSDN for information on the “complete port,” which he says is unique to NT; In short, Win32’s “overlapping I/O” result is too low for convenience, and the “completion port” is a wrapper that provides a completion event queue, Plus the number of debug magic attempts to keep running, allowing more threads to get the completion event from this port if it is dormant (potentially blocking I/O), leaving the thread unchanged.

See also OS/400 support for I/O completion ports

In September 1999 there was an interesting discussion of the Linux-kernel “> 15,000 simultaneous connections “(and the second week of threads). Emphasize:

  • Ed Hall publishes some notes about his experience; He achieved >1000 connections/SEC on UP P2/333 running Solaris. His code uses a small block of threads (1 or 2 per CPU), each using an “event-based model” to manage a large number of clients.
  • Mike Jagdis has published an analysis of poll/select performance overhead and says that “the current select/poll implementation can be significantly improved, especially in blocking situations, but due to SELECT /poll No, so the overhead still increases with the number of descriptors, and you can’t remember which descriptors are interesting. This is easy to fix with the new API. Suggestions are welcome……”
  • Mike posts about his work on improving select() and poll().
  • Mike has published some possible apis to replace poll()/select(): “You could write a ‘device like’API for a ‘pollfd like’ structure, ‘device’ listening for events and providing a ‘pollfd like’ structure that represents them when you read it? …”
  • Rogier Wolff suggests using the “digital guy suggested API”,www.cs.rice.edu/~gaurav/pap…
  • Joerg Pommnitz pointed out that any new API along these lines should be able to wait not only for file descriptors but also for signals and SYSV-IPC. Our synchronization primitive should at least be able to do Win32 WaitForMultipleObjects.
  • Stephen Tweedie asserts that the combination of _SETSIG, queued real-time signal and sigWaitInfo () is www.cs.rice.edu/~gaurav/pap… Is a superset of the APIS proposed in. He also mentions that if you’re interested in performance, you can block signals at any time; Instead of sending signals asynchronously, the process uses sigWaitInfo () to get the next signal from the queue.
  • Jayson Nordwick compared the completion port and THE F_SETSIG synchronous event models and concluded that they were very similar.
  • Alan Cox pointed out that an older version of the SIGIO patch for SCT was included in 2.3.18AC.
  • Jordan Mendelson has posted some sample code showing how to use F_SETSIG.
  • Stephen C. Tweedie goes on to compare the completion port and F_SETSIG, and notes :” Using the signaling-out mechanism, if the library uses the same mechanism, your application will get signals sent to the various library components “, but the library can set its own signal handlers, so this shouldn’t affect the program (much).
  • Doug Royer noted that when he worked on the Sun Calendar server, he got 100,000 connections on Solaris 2.6. Others estimate how much RAM Linux will need and what bottlenecks it will encounter.

Interesting reading!

Open the restriction on file handles

  • Any Unix: limits are set by uLIMIT or setrLIMIT
  • Solaris: See the Solaris FAQ, Question 3.46 (or so; They renumber regularly)
  • FreeBSD:

    Edit /boot/loader.conf to add lines

    set kern.maxfiles=XXXX

    Where XXXX is the required system limit for the file descriptor, and restart. Thanks to an anonymous reader who wrote that he said he got over 10,000 connections on FreeBSD 4.3 and said

    “FWIW: You can’t actually easily adjust the maximum number of connections in FreeBSD via Sysctl…. You must do this in the /boot/loader.conf file.

    The reason for this is that the zalloci() call initializes the socket and the TCPCB structure area happens very early on system startup, so that the area can be both type stable and exchangeable.

    You also need to set the number of MBUFs higher because you are consuming one MBUF per connection for the TCPtempl structure (on the unmodified kernel) to implement Keepalive.”



    Other readers said:

    “Starting with FreeBSD 4.4, tcpTempl constructs are no longer allocated; You no longer need to worry about consuming an MBUF per connection



    Also check out:

    • The FreeBSD handbook

    • SYSCTL TUNING, LOADER TUNABLES, and KERNEL CONFIG TUNING
    • Tweaking the Impact of FreeBSD 4.3 Box on high performance, Daemon news, August 2001
    • Postfix.org adjustment notes, covering FreeBSD 4.2 and 4.4
    • The Measurement Factory notes, about FreeBSD 4.3


  • In OpenBSD, additional adjustments are needed to increase the number of openfile handles available to each process: the openfiles-cur parameter in /etc/login.conf needs to be added. You can change the kern. Max file using sysctl -w or sysctl.conf, but it doesn’t work. This is important because the login.conf limit is a very low 64 for non-privileged processes and 128 for privileged processes
  • Linux: refer to theBodo Bauer’s /proc documentation. In the 2.4 kernel

    echo 32768 > /proc/sys/fs/file-max

    Increased system open file restrictions. and

    ulimit -n 32768

    ulimit -n 32768

    Increases the limit of the current process

    On 2.2.x kernels,

    In the 2.2.x kernel,

    echo 32768 > /proc/sys/fs/file-max echo 65536 > /proc/sys/fs/inode-max

    Increased system open file restrictions. and

    ulimit -n 32768

    ulimit -n 32768

    Increases the limit of the current process

    I verified that processes on Red Hat 6.0 (around 2.2.5 with patches) can open at least 31000 file descriptors this way. Another researcher has demonstrated that processes on 2.2.12 can open at least 90,000 file descriptors in this way (with appropriate limitations). The upper limit appears to be available memory.

    Stephen C. Tweedie publishedOn how to use initScript and pam_LIMIT to set uLIMIT limits globally or by user at boot time.

    In the 2.2 older kernel, however, even with the above changes, the number of open files per process was still limited to 1024

    See alsoOskar from 1998, which discusses per-process and system-wide limits for file descriptors in the 2.0.36 kernel.

Thread limit

On any architecture, you may need to reduce the amount of stack space allocated per thread to avoid running out of virtual memory. If pthreads is used, you can set it at run time using pthread_attr_init().

  • Solaris: I heard that it supports as many threads as possible to accommodate memory
  • Kernel with NPTL: /proc/sys/vm-max_map_count may need to be increased to more than 32000 threads (however, unless you are using a 64-bit processor, you will need to use very small stack threads to get close to this number). See the NPTL mailing list, for example the subject line “Cannot create more than 32K threads? , for more information.
  • Linux 2.4: /proc/sys/kernel/threads-max is the maximum number of threads; My Red Hat 8 system defaults to 2047. You can set the increment by echo the new value to the file as usual, e.g.”echo 4000 > /proc/sys/kernel/threads-max”
  • Linux 2.2: Even the 2.2.13 kernel limits the number of threads, at least at Intel. I don’t know what the constraints are on other architectures. Mingo has released a patch on Intel for 2.1.131 to remove this restriction. It is included in 2.3.20. See also Volano’s detailed description of promoting file, thread, and FD_SET limits in the 2.2 kernel. Wow. This document will take you through a lot of hard stuff, but it’s a little out of date.
  • Java: See Volano for baseline details. Plus their information on how to tune various systems to handle large numbers of threads.

Java problem

With JDK 1.3, Java’s standard network libraries mostly provide a client-one-thread model. This is one way to do non-blocking reads, but there is no way to do non-blocking writes.

In May 2001. JDK 1.4 introduced the package java.nio to provide full support for non-blocking I/O (and other good stuff). Look at the release notes warning. Give it a try and give Sun feedback!

HP’s Java also includes a thread rotation API.

In 2000, Matt Welsh implemented non-blocking sockets for Java. Their performance benchmarks show that they are better than blocking sockets on servers that handle large numbers (greater than 10000) of connections. His class library is called Java-nbio; He was part of the Sandstorm project. Benchmark tests show that 10000 connection performance is available.

See Dean Gaude’s article on Java, network I/O, and threading topics, and Matt Welsh’s paper on events versus worker threads

Prior to NIO, there were several suggestions for improving Java’s network API:

  • Matt Welsh’s Jaguar system proposes preserializing objects, new Java bytecode and memory management changes to allow asynchronous I/O using Java.
  • Connecting Java to the Virtual Interface Architecture by C-C. Chang and T. von Eicken proposed memory management changes to allow Java to use asynchronous I/O.
  • Jsr-51 is a Sun engineering project that proposes the Java.nio package. Matt Welsh attended (who says Sun doesn’t listen?) .

Other Suggestions

  • Zero copy Typically, data is copied multiple times from one location to another. Any scheme that eliminates these copies to the bare physical minimum is called “zero copy “.
    • Thomas Ogrisegg, under Linux 2.4.17-2.4.20, sends patches to mmaped files with zero copies. Claims it is faster than sendfile().
    • Io-lite is a proposal for a set of I/O primitives that eliminates the need for many copies.
    • In 1999, Alan Cox pointed out that zero copies were sometimes not worth the trouble (but he did like Sendfile ()).
  • Ingo implemented a zero-copy TCP for TUX 1.0 in the 2.4 kernel in July 2000, and he says he will be making it available to user space soon.
    • Drew Gallatin and Robert Robert have added some zero-copy features to FreeBSD; The idea seems to be that if you call write() or read() on a socket, the pointer is page aligned, and the amount of data transferred is at least one page, and you don’t reuse the buffer right away, memory management tricks will be used to avoid copying. But see the follow-up to this message on the Linux-kernel for any doubts about the speed of these memory management tricks. According to Noriyuki Soda: Since the release of NetBSD-1.6, zero copy on the sender is supported by specifying the “SOSEND_LOAN” kernel option. This option is now the default for Netbsd-current (you can disable this feature by specifying “SOSEND_NO_LOAN” in the kernel options on NetBSD_current). Using this feature, zero replication is automatically enabled if data exceeding 4096 bytes is specified as data to be sent.
    • The sendfile() system call implements zero-copy networking. The sendfile() function in Linux and FreeBSD allows you to tell the kernel to send some or all of the files. Make the operating system as efficient as possible. It can also be used in servers that use threads or servers that use non-blocking I/O (in Linux, its record is so far sparse; Use _syscall4 to call it. Andi Kleen is writing a man page that overwrites this. See also Jeff Tranter exploring the Sendfile system call in Linux Gazette Issue 91.) There are rumors that ftp.cdrom.com is benefiting from the zero-copy implementation of Sendfile ().Sendfile () being made available for the 2.4 kernel. See LWN Jan 25 2001. A report by a developer using Sendfile () in Freebsd shows that using POLLWRBAND instead of POLLOUT makes a big difference. Solaris 8 (updated as of July 2001) has a new system call ‘sendFilev ‘. A copy of the man page is here. It is also mentioned in the Solaris 8 7/01 release notes. I suspect this is most useful when sending to a socket in blocking mode; Using non-blocking sockets can be a bit painful.
  • A new socket option under Linux, TCP_CORK, tells the kernel to avoid sending partial frames, which helps a bit, such as when there are lots of small write() calls that you can’t bundle together for some reason. Unsetting the option flushes the buffer. Better to use writev(), but…… See LWN Jan 25 2001 for a summary of some very interesting discussions about the Linux-kernel regarding TCP-Cork and possible alternatives to MSG_MORE.
  • [Provos, Lever, and Tweedie 2000] notes that dropping incoming connections when the server is overloaded can improve the shape of the performance curve and reduce the overall error rate. They use the smooth version of “number of I/O ready clients” as a measure of overload. This technique should easily be applied to a server written using select, poll, or any system call that returns the ready event technique for each call (such as /dev/poll or sigtimedWait4 ()).
  • Some programs can benefit from using non-POSIX threads. Not all threads are created equal. For example, the Clone () function in Linux (and its friends in other operating systems) allows you to create a thread with its own current working directory, which is useful when implementing an FTP server. See Hoser FTPd for an example of using native threads instead of Pthreads.
  • Caching your own data can sometimes be a victory. Vivek Sadananda Pai([email protected]) in new-HTTPd “Re: Fix Hybrid Server Issues,” May 9, states: “I compared the raw performance of the select-based and multi-process servers on FreeBSD and Solaris/x86. In microbenchmarks, the performance of the software architecture differs very little. The big performance advantage of select-based servers comes from application-level caching. While multi-process servers can be implemented at a higher cost, actual workloads (compared to microbenchmarks) are harder to reap the same benefits. I will present these measurements as part of a paper that will appear at the next Usenix conference. If you have an afterword, the paper can be found at www.cs.rice.edu/~vivek/flas…”

Other restrictions

  • Older system libraries may use 16-bit variables to hold file handles, which can cause trouble on handle 32767. Glibc2.1 should be fine.
  • Many systems use 16-bit variables to hold process or thread IDS. It would be interesting to port the Volano scalability benchmark to C to see what the maximum number of threads is for various operating systems.
  • Some operating systems preallocate too much thread-local memory; If each thread gets 1MB and the total VM space is 2GB, a maximum of 2000 threads is created.
  • See www.acme.com/software/th… Performance comparison at the bottom. Please note how various servers are failing on 128 + connections, and even those who know the cause on Solaris 2.6, let me know.

Note: You would expect this behavior if the TCP stack has a bug that results in a SHORTER (200ms) SYN or FIN latency, as shown in Linux 2.2.0-2.2.6, and the operating system or HTTP daemon has a hard limit on the number of connections. There may be other reasons.

The kernel problem

For Linux, it looks like the kernel bottleneck is being fixed. Read Linux Weekly News,Kernel Traffic, the Linux-kernel mailing list, and my Mindcraft Redux Page.

In March 1999, Microsoft sponsored a benchmark comparing NT and Linux to serve a large number of HTTP and SMB clients, and the Linux results were unsatisfactory. See also the April 1999 Mindcraft benchmark article for more information

See Also Linux Extensibility Projects. They are doing interesting work. Includes a suggestive poll patch from Niels Provos, some work on thunderous group issues.

Also join Mike Jagdis on improving select() and poll(); Here’s Mike’s post about it.

Mohit Aron ([email protected]) writes that rate-based clocks in TCP can improve HTTP response times on “slow” connections by 80%

Measuring server performance

Two tests in particular are simple, fun, and difficult:

  1. Raw connections per second (how many 512-byte files can be served per second?)
  2. The total transfer rate of large files with many slow clients (how many 28.8K modem clients can be downloaded from the server at the same time before performance enters the bottom pool?)

Jef Poskanzer has published a benchmark that compares many Web servers. Look at his results www.acme.com/software/th…

I also have some old notes on comparing THTTPD to Apache that might be of interest to beginners.

Chuck Lever keeps reminding us about Banga and Druschel’s paper on Web server benchmarking. It’s worth reading.

IBM has an excellent paper called Java Server Benchmarking [Baylor et al.,2000]. It’s worth reading.

example

Nginx is a Web server that uses any efficient network event mechanism available on the target operating system. It became very popular; There are even two books about it

Interesting select() based server

  • THTTPD is very simple. Use a single process. it has very good performance, but it does not scale with the number of cpus. You can also use kqueue.
  • Mathopd. Similar to THTTPD.
  • fhttpd

  • boa

  • Roxen

  • Zeus, trying to be the absolute fastest commercial server. Look at their adjustment guide.
  • Other non-Java services are listed at www.acme.com/software/th…
  • BetaFTPd

  • Flash-lite – Uses the IO-Lite Web server.
  • Flash: Efficient and portable Web server – use select(), mmap(), mincore().
  • Flash Web server as of 2003 – using select(), modified sendfile(), asynchronous Open ().
  • Xitami – Uses select() to implement its own thread abstraction for portability of thread-free systems.
  • Medusa – a server-writing toolkit in Python designed to provide very high performance.
  • Userver – A small HTTP server that can use SELECT, poll, epoll or sigio

Interesting based on /dev/poll server

  • Nancy (polocy) pelosi rovos, C. Ever,”Scalable Network I/O in Linux” May 2000.[FREENIX Track,Proc.USENIX 2000,San Diego,California (June 2000).] Description modified to support THTTPD version of /dev/poll. Compare performance to PHHTTPD.

Interesting based on ePoll server

  • ribs2

  • Cmogstored – Epoll/kQueue for most networks, threads for disks and accepT4

Interesting kqueue() based server

  • THTTPD (from version 2.21?)
  • Adrian Chadd says, “I’m doing a lot of work to make SQUID actually look like a Kqueue IO system “; His official Squid sub-project; See squid.sourceforge.net/projects.ht… This is obviously newer than Benno’s patch

Interesting real-time signal based server.

  • Chromium X15. Uses 2.4 kernel SIGIO functionality along with sendfile() and TCP_CORK, and is reportedly even faster than TUX. Source code is available under a community license. See Fabio Riccardi’s original announcement
  • Zach Brown’s PHHTTPD – “a faster service server, which is used to show the SIGIO/SIGInfo event model. If you try to use it in a production environment, consider this code highly experimental and take extra care yourself, “using the SigInfo feature in 2.3.21 or later, which includes updated kernel patches as needed. Rumored to be even faster than KHTTPD. See some of his notes from May 31, 1999

Interesting thread-based server

  • Hoser FTPD. Check out his base page
  • Peter Eriksson’s PHTTPD and
  • pftpd

  • Java-based services are listed at www.acme.com/software/th…
  • Sun’s Java Web Server (which has been reported to be able to handle 500 clients simultaneously)

Interesting kernel-based server

  • khttpd

  • Ingo Molnar et al. “TUX” Threaded linUX WebServer.2.4 kernel.

Other interesting links

  • Design notes from Jeff Darcy on designing high performance servers

  • Ericsson’s ARIES project – Benchmark results for Apache 1 versus Apache 2 versus Tomcat on 1 to 12 processors
  • Web server performance page by Prof. Peter Ladkin
  • Novell’s Fast cache – claims 10,000 hits per second. Pretty nice performance diagram.
  • Rik van Riel’s Linux performance tuning site