1. Why do interviewers like to look for performance optimizations?

The purpose of the performance optimization question is probably not to ask you to design a high-performance system, but to see the full range of a candidate’s knowledge background and practical ability.

Performance problems are different from application defects, which are code quality problems that result in a loss of functionality, whereas performance problems are caused by a combination of factors such as application architecture, code quality, Linux server configuration, hardware device configuration, and so on.

Performance optimization requires both depth and breadth of knowledge

  • In terms of depth, test whether a candidate has a solid knowledge of basic operating system principles, algorithms and data structures.
  • In terms of breadth, we examine the architectural ability of an interviewer, such as whether he has considered the architectural design of the project, how to solve the problem of high availability and scalability, whether he has asked for the root of the problem, whether the system has mastered the relevant technology adopted in the project, etc.
  • Performance optimization is the job that best reflects the comprehensive ability of the interviewer. It can not only examine various common technologies, but also examine engineering ability, thinking ability, architecture ability and so on.

Therefore, mastering performance optimization actually represents mastering these aspects of knowledge.

2. Principle and optimization of TCP three-way handshake

  • The maximum number of retries before the client receives the ACK from the server is tcp_syn_retries (six times by default) with an interval of 2^ N seconds (n is 0 to retries 1). If the network is stable, the maximum number of retries should be reduced.
  • Tcp_max_sync_backlog The size of the half-connection queue used by the server to maintain unfinished handshakes can be increased.
  • Tcp_syncookies, which is anti-SYN DDOS, does not occupy the size of the half-connection queue.
  • Maximum number of attempts for a SYN+ACK reply on the server tcp_SYNack_retries (five times by default). The network is unstable.
  • Tcp_abort_on_overflow (default 0 is recommended for unexpected traffic)
  • Server-side full connection queue length backlog and net.core.somaxconn

3. How to understand load average

Load average is the average number of active processes (1 minute, 5 minutes, 15 minutes), including

  • Runnable state process (R)
  • Uninterruptible State process (D)

The exponential moving average is calculated as follows:

The average load is not as straightforward as it seems, so there is Pressure Stall Information (PSI) :

  • 10s, 1m, 5m percentage of hardware resource shortage Kernel >= 4.20 you can also view the resource shortage of each process by using cgroup2
  • Only some processes run out of resources
  • All full processes are running short of resources

The figure below shows a comparison between average load and PSI:

The load average is calculated as an exponentially decaying average of the number of active processes.

4. Why can memory pools optimize memory performance

Virtual memory space

  • Read-only segment, including code and constants, etc.
  • Data segments, including global variables, etc.
  • The heap, including dynamically allocated memory, starts at a low address and grows upwards.
  • File mapping segments, including dynamic libraries, shared memory, and so on, grow downward from high addresses.
  • Stack, including local variables, function call context, etc. The stack size is fixed, usually 8 MB.

Among the five segments, memory for the heap and file mapping segments is allocated dynamically. For example, using the C standard library’s malloc() or mmap(), you can dynamically allocate memory in the heap and file mapping segments, respectively.

Malloc () is a memory allocation function provided by the C standard library. Corresponding to the system call, there are two implementations, namely BRK () and mmap().

  • For small chunks (less than 128K), the C library allocates memory using BRK (), which allocates memory by moving the top of the heap. This memory is not immediately returned to the system, but is cached so that it can be reused.
  • Large chunks of memory (larger than 128K) are allocated using mmap(), that is, a chunk of free memory in the file mapping segment is allocated.
  • Allocation is not made when malloc is called, but when the requested memory is actually accessed.

Of course, both methods have their advantages and disadvantages.

  • BRK () cache can reduce the occurrence of page failure and improve the efficiency of memory access.
  • Memory allocated by mmap() will be returned to the system when it is freed, so every time MMAP will have a page missing exception. When the memory is busy, frequent memory allocation will lead to a large number of page missing exceptions, which will increase the management burden of the kernel.

What problems do memory pools solve:

  • Pre-allocated memory. Memory used up by a thread is not returned to the system, but reserved for the next time it is needed
  • The system applies for memory in blocks only when the allocated memory in the memory pool is insufficient
  • Multithreaded protection
  • Reduce memory fragmentation

5. How can I quickly locate disk I/O performance problems

Disk and file system I/O is the most common type of performance problem, such as database performance, Redis performance, network storage performance and so on are directly related to I/O performance. To understand I/O performance, of course, you have to start with the basics, such as the I/O stack shown below:

Now that you are familiar with the principles, you can locate disk and file system I/O bottlenecks based on the following figure

6. How to optimize the network performance bottleneck?

The reasons why network performance tuning is more difficult include

  • More basic knowledge is required: network protocols, kernel protocol stacks, network infrastructure, network programming interfaces, and distributed systems.
  • Need more practical experience: Since it involves a wide range of factors, troubleshooting and optimization of network performance problems may be related to many factors, to quickly identify bottlenecks need more practical experience.

Network performance optimization involves a wide range of aspects, but it is also one of the core abilities that must be mastered to build distributed systems, micro-services and cloud native applications. In the actual analysis and optimization of network performance, it can be disassembled and analyzed step by step according to the processing path of Linux network package kernel protocol stack and the principle of Linux general IP network stack.

The following figure shows the processing path of Linux network package kernel protocol stack:

Here is a schematic diagram of the Linux Common IP networking stack:

7. The maximum number of concurrent connections supported by Linux

To answer the question of how many concurrent connections Linux can support, it’s important to understand how Linux identifies a “connection.” Linux distinguishes a connection through a quintuple: protocol, source IP address, destination IP address, source port, destination port.

For a client, if it concurrently connects to the same service, it means that the protocol, source IP address, destination IP address, and destination port are fixed. The only variable is the source port. Since the port number is 16 bits, the maximum number of concurrent connections is 64K.

Are there any other restrictions? In fact, port numbers are not the only limitation, there are many things in the kernel protocol stack that limit the maximum number of concurrent connections:

  • Net.ipv4. ip_local_port_range Starts from 32768 by default
  • Net.ipv4. tcp_fin_timeout The last connection is not released immediately and needs to be released for a period of time

On the server side, it’s another story. The server mainly accepts client requests. Therefore, the protocol, destination IP address, and destination port number are fixed. The source IP address and source port number of the client are changeable. Thus, the maximum number of concurrent connections is the number of available IP addresses x the number of available ports, which is 2^48 (and more for IPv6).

Like the client side, there are many other factors in the kernel protocol stack that affect the maximum connection count in addition to quintuples. Such as maximum file descriptors, memory, Sokect cache, connection tracing, and so on.

8. How to resolve annoying TIME_WAIT problems

If TIME_WAIT is annoying, why is it necessary? This starts with how TCP works. The following is a TCP state diagram. We know that TCP has two basic mechanisms: the three-way handshake and the four-way wave, and TIME_WAIT is the required state during the four-way wave.

As you can see from the TCP state diagram, the active closed party enters the TIME_WAIT state (while the passive closed party enters the CLOSE_WAIT state). The duration of TIME_WAIT is 2MSL, that is, 60 seconds. The TIME_WAIT state exists to resolve network exceptions such as

  • There are delays and retransmissions on the network, and delayed packets from the previous connection may be received by new connection errors.
  • The last ACK may lose retransmission when closing a connection, which interferes with new connections (i.e., repeated FIN packets close new connections).

When there are too many TIME_WAIT connections, there are three hazards:

  • memory
  • Occupied port number
  • Occupy connection trace table

The solution

  • Net.ipv4. tcp_MAX_TW_buckets and net.netfilter.nf_conntrack_max were added.
  • Reduce net.ipv4.tcp_fin_timeout and net.netfilter.nf_conntrack_tcp_timeout_time_wait to release resources.
  • Enable net.ipv4.tcp_tw_reuse for port reuse. In this way, ports occupied by the TIME_WAIT state can also be used for new connections.
  • Increase the range of the local port net.ipv4.ip_local_port_range.
  • Increases the number of maximum file descriptors. You can use fs.nr_open and fs.file-max.

9. How are container applications different from normal processes

Normal Linux processes typically have less isolation by default, but not without it. As one of the most basic functions of the operating system, memory management allocates different virtual memory space to each process, which means that the memory space of the process is isolated. In addition, different users have different permissions to run processes. The root user has administrator permissions, while common users do not.

Of course, these isolation mechanisms for normal processes also apply to container applications. Similarly, the various isolation mechanisms applied by the container are also applicable to ordinary processes, and the container encapsulates these additional isolation mechanisms into more usable interfaces.

The applications inside the container are normal Linux processes, with some isolation:

  • Namespaces (pid, NET, IPC, MNT, UTS, user, etc.)
  • Cgroups (CPU, memory, blkio, cPUSet, net_cls, net_prio, etc.)
  • Capabilities (default includes chown, dac_override, fowner, fsetid, kill, setgid, setuid, setpCap, net_bind_service, net_raw, Sys_chroot, mKNOd, AUDIT_write, setfcap)
    • Capsh displays capabilities
    • Setcap can modify an app’s capabilities

10. What do you do when you have a problem

Everyone will encounter problems at a loss, difficult diseases and supernatural problems will accompany in the growth process of each technical person. So don’t panic, sort out the problem first and then attack it gradually,

Here are some ideas you can use to solve these problems:

  • Grasp the whole: first sort out the problem, figure out the problem is actually more than half successful.
  • System monitoring: Monitoring system indicators (USE method) of usage, saturation and error number.
  • Application monitoring: Monitors application indicators such as latency, number of requests, and number of errors (RED). Problems can be located faster with link tracing.
  • Dynamic tracing: Penetrating the kernel and process to analyze problems related to the current state of the kernel or application is very effective for many difficult problems.

Performance problems are not as difficult as you might think, and most performance problems will come naturally once you understand a few basic principles of your application and system, practice a lot, and build a holistic view of the overall performance.


For more on performance tuning, see Geek Time’s Linux Performance Tuning in Action. Also welcome to pay attention to chat cloud native public number, learn more performance optimization and Kubernetes cloud native knowledge.