Implementation of leaderboards

As a central server, the leaderboard of the full service provides relevant interface calls to all game servers. Its interface is divided into synchronous interface and asynchronous interface versions. The asynchronous interface is realized through WebFlux. Center serving (list) database realized by using mongo, comprehensive center under the complexity of the cluster environment, stability and throughput requirements of the list itself, finally decide not to do the second level cache in the center, play clothes after authentication request flow directly into the database (cluster version gateway do a layer of load balancing). A partial implementation of the webFlux version is posted below:

Domain model:

@Data @AllArgsConstructor @Document @CompoundIndexes(@CompoundIndex(name="sortIndex",def="{rid:1,value:-1; secondValue:-1,rankTime:1}")) public class RankElement { @Id private String id; /** * ID */ private String rid; /** * rank */ private long value; /** * private long secondValue; /** * private long rankTime; . }Copy the code

Service layer:

@Component public class ReactiveRankService { @Autowired private ReactiveMongoTemplate template; public Flux<RankElement> asyncGetRankList(String rankId,int limit){ Aggregation aggregation =Aggregation.newAggregation(Aggregation.match(Criteria.where("rid").is(rankId)), Aggregation.sort(Sort.by(Sort.Order.desc("value"),Sort.Order.desc("secondValue"),Sort.Order.asc("rankTime")))); if (limit>0){ aggregation.getPipeine().add(Aggregation.limit(limit)); } return template.aggregate(aggregation,"RankElement",RankElement.class); }}Copy the code

External interface layer:

@RestController
@RequestMapping("rank")
public class RankController {
    @Autowired
    private ReactiveRankService service;
​
    @PostMapping("/getRank")
    public Flux<RankElement> asyncGetRank(ServerHttpRequest request,@RequestBody RankVo vo){
        return service.asyncGetRankList(vo.getRankId(),vo.getLimit());
    }
​
}
Copy the code

Single test

Tool use

Pressure test client: WRK

WRK: Lightweight performance testing tool;

Easy to install, with a zero learning curve and a few minutes to learn;

Based on the system’s own high-performance I/O mechanism, such as Epoll, KQueue, the use of asynchronous event-driven framework, through a few threads can be pressed out of a large amount of concurrency;

Linux performance analysis tool: SAR

The SAR command reports on file reads and writes, system call usage, disk I/O, CPU efficiency, memory usage, process activity, and IPC related activities. The SAR command is a statistical tool for system health under Linux. It displays the specified operating system status counter to the standard output device. The SAR tool samples the current state of the system and then computes data and scales to represent the current operating state of the system. It is characterized by continuous sampling of the system to obtain a large number of sampling data. Both the sampled data and the results of the analysis can be documented with minimal system resource consumption.

Database monitoring tool: MongoStat

The test environment

Pressure measuring machine: XX.XX.xx.49

Machine under test: XX.xx.xx.216 6 cores, 10 gigabit bandwidth, number of file handles 65535

Database: Mongodb4.2.3, deployed at 10.11.10.30, total data volume 1 million documents, query document number 100

Pressure test command:

Synchronization. / WRK – t16 – c300 – d120s — latency xx. Xx. Xx. 216:10000 / crank/getPl…

Asynchronous. / WRK – t16 – c300 – d120s — latency xx. Xx. Xx. 216:10000 / crank/getPl…

The pressure measurement process

CPU load

Free time:

[root@cqs_xx.xx.xx.216 ~]# sar-q 1 100 Linux 3.10.0-693.el7.x86_64 (cqs_xx.xx.xx.216) 06/06/202_x86_64_ (6 CPU) 01:11:16 PM Runq-SZ plist-SZ LDAVG-1 LDAVG-5 LDAVG-15 blocked 01:11:17 PM 0 2212 0.41 2.11 1.85 0 01:11:18 PM 0 2212 0.41 2.11 1.85 0 01:11:19 PM 0 2212 0.41 2.11 1.85 0 01:11:20 PM 2 2212 0.41 2.11 1.85 0 01:11:21 PM 0 2213 0.41 2.11 1.85 0Copy the code

When the pressure test:

[root@cqs_xx.xx.xx.216 ~]# sar-q 1 100 Linux 3.10.0-693.el7.x86_64 (cqs_xx.xx.xx.216) 06/06/202_x86_64_ (6 CPU) 04:56:34 PM Runq-SZ plist-SZ LDAVG-1 LDAVG-5 LDAVG-15 blocked 04:56:35 PM 4 2402 6.30 3.90 2.73 0 04:56:36 PM 7 2402 6.30 3.90 2.73 0 04:56:37 PM 9 2401 6.30 3.90 2.73 0 04:56:38 PM 8 2401 6.30 3.90 2.73 0 04:56:39 PM 4 2399 6.30 3.90 2.73 0 04:56:40 PM 13 2399 6.20 3.91 2.74 0 04:56:41 PM 16 2399 6.20 3.91 2.74 0 04:56:42 PM 7 2399 6.20 3.91 2.74 0 04:56:43 PM 7 2399 6.20 3.91 2.74 0 04:56:44 PM 11 2399 6.20 3.91 2.74 0Copy the code

Runq-sz: Length of runqueue (number of processes waiting to run)

Plist-sz: The number of processes and threads in the process list

Ldavg-1: Systemload average in the last minute

Ldavg-5: indicates the average system load in the last 5 minutes

Ldavg-15: indicates the average system load in the last 15 minutes

216 This machine has a 6-core CPU and a single-core load of 1, so the total load is 6. About how much load is ideal, there is a dispute, online to find a circle, each have their own views.

Some people think that less than *2 or *3 cores is acceptable, but it is generally accepted that more than the number of cores is overload.

According to the statistics above, the CPU load is around 6.3, and the number of waiting processes is greater than the number of cores, which is basically overloaded.

However, a high load of the system does not mean insufficient CPU resources, but a large number of queues that need to be run. However, the tasks in the queue may consume CPU, I/O and other factors, so we also need to combine with the CPU usage.

CPU utilization

Free time:

[root@cqs_xx.xx.xx.216 ~]# sar-u-o 1 100 Linux 3.10.0-693.el7.x86_64 (cqs_xx.xx.xx.216) 06/06/2021_x86_64_ (6 CPU) 01:12:33 PM CPU %user % Nice % System % IOwait % Steal % Idle 01:12:34 PM all 3.87 0.00 0.67 0.00 0.00 95.45 01:12:35 PM all 0.00 0.00 0.50 0.00 95.46 01:12:36 PM all 4.03 0.00 0.50 0.00 95.46 01:12:37 PM all 3.53 0.00 0.67 0.00 0.00 95.80 01:12:38 PM all 3.54 0.00 0.51 0.00 0.00 95.95 01:12:39 PM all 4.38 0.00 0.67 0.00 0.00 94.95 01:12:40 PM all 4.24 0.00 1.02 0.00 0.00 94.75Copy the code

When the pressure test:

11:39:46 AM CPU %user %nice %system % IOwait % Steal %idle 11:39:48 AM all 71.57 0.00 9.70 0.00 0.00 18.73 11:39:49 AM all 0.00 0.00 18.53 11:39:50 AM all 24.66 11:39:51 AM all 24.66 11:39:51 AM all 24.66 11:39:51 AM all 24.66 11:39:51 AM all 24.66 11:39:51 AM all 24.66 11:39:51 AM all 24.66 11:39:50 0.00 0.00 22.91 11:39:52 AM ALL 71.21 0.00 10.44 0.00 18.35 11:39:53 AM All 66.44 0.00 9.73 0.00 0.00 23.83 11:39:54 AM All 66.11 0.00 9.06 0.00 24.83 11:39:55 AM ALL 70.72 0.00 9.82 0.00 0.00 19.47 11:39:56 AM ALL 64.66 0.00 8.38 0.00 0.00 26.97 11:39:57 AM ALL 78.93 0.00 10.70 0.00 10.37 11:39:58 AM all 76.76 0.00 9.36 0.00 0.00 13.88 11:39:59 AM 0.00 0.00 0.00 0.00 0.00 12.98Copy the code

%user: Displays the percentage of total CPU time used by application runs at the user level.

%nice: indicates the percentage of total CPU time used for nice operations at the user level.

%system: The percentage of total CPU time spent running at the kernel level.

% ioWAIT: Displays the percentage of total CPU time spent waiting for I/O operations.

% STEAL: Percentage of the hypervisor waiting for virtual cpus to serve another virtual process.

%idle: Displays the percentage of idle CPU time occupied by the total CPU time.

According to the above data, %idle basically remains around 18, and CPU utilization is almost the same. It can be seen that there is indeed a bottleneck in CPU hardware.

System running status

Free time:

[root@cqs_xx.xx.xx.216 ~]#vmstat 1 100 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 1 0 576820 2495800 0 2499564 0 0 3 7 0 1 8 2 91 0 0 1 0 576820 2496048 0 2499568 0 0 0 5935 7030 4 1 95 0 0 1 576820 2496172 0 2499572 0 0 359 5915 6936 4 1 95 0 0 0 576820 2495940 0 2499596 0 0 120 5956 6965 4 1 95 0 0 0 576820 2495692 0 2499604 0 0 0 5639 6891 5 1 95 0 0 0 576820 2495692 0 2499604 0 0 0 5437 6795 41 95 0 0 0 576820 2495692 0 2499604 0 0 0 6441 7460 41 95 0 0 0 576820 2495708 0 2499604 0 0 0 5673 7033 4 1 95 0 0 0 576820 2495708 0 2499604 0 0 0 40 5627 6821 5 1 94 0 0Copy the code

When the pressure test:

[root@cqs_xx.xx.xx.216 ~]#vmstat 1 200 procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu----- r b swpd free buff cache si so bi bo in cs us sy id wa st 4 0 1032 4799576 2104 8018456 0 0 5 34 49 24 4 1 95 0 0 15 0 1032 4799840 2104 8018420 0 0 0 17660 5607 58 8 34 0 0 17 0 1032 4799172 2104 8018420 0 0 0 17872 5471 58 9 33 0 0 12 0 1032 4799208 2104 8018420 0 0 0 17336 5241 53 8 38 0 0 10 0 1032 4799284 2104 8018420 0 0 0 19477 4681 65 9 26 0 01 0 1032 4799396 2104 8018428 0 0 0 18248 5205 61 9 30 0 0 17 0 1032 4799116 2104 8018428 0 0 0 16143 3249 61 9 30 0 0 15 0 1032 4799528 2104 8018428 0 0 8 17940 3980 68 10 22 0 03 0 1032 4798860 2104 8018428 0 0 0 18214 4343 63 8 29 0 0 14 0 1032 4798884 2104 8018428 0 0 0 17161 4448 65 10 25 0 0Copy the code

Procs: r indicates the number of processes running and waiting for the CPU time slice. If the value is greater than the number of system cpus for a long time, the CPU is insufficient. B indicates the number of processes waiting for resources, such as I/O or memory swap. Memory: SWPD Virtual memory usage. That is, the amount of memory switched to the memory swap area. If the value is greater than 0, the physical memory is insufficient. Free Free physical memory buff Memory used as a buff Cache Memory used as a cache Swap: si Size of virtual memory read from the disk per second. If the value is greater than 0, the physical memory is insufficient or the memory is leaked. Search for the process that consumes memory to resolve the problem. So the size of virtual memory written to disk per second, if this value is greater than 0, ditto. Generally, the values of si and SO are 0. If the values of si and SO are not 0 for a long time, the memory is insufficient. IO: Bi Indicates the number of blocks received by the block device per second. The block device refers to all disks and other block devices in the system. The default block size is 1024 bytes BO Number of blocks sent by the block device per second. You should consider improving disk read and write performance. System: in Number of CPU interrupts per second, including time interrupts. Cs Number of context switches per second The larger these two values are, the more CPU the kernel consumes: If the value of US is relatively high, it indicates that the user process consumes a lot of CPU time. However, if it is over 50% for a long time, it is necessary to consider the percentage of CPU time consumed by the optimization program or algorithm SY kernel process. If the value of SY is too high, it indicates that the kernel consumes a lot of CPU resources. For example, I/O operations are frequent. Id Percentage of time that the CPU is idle. Wa Percentage of time occupied by I/O wait. The higher the WA value is, the more serious the I/O wait is. Based on experience, the reference value of WA is 20%. It can also be caused by bandwidth bottlenecks on disks or disk controllers (mostly block operations)

All indicators are basically in normal state, here is a supplement about soft interrupt here: The CPU microprocessor has an interrupt signal bit. At the end of each CPU clock cycle, the CPU detects whether there is an interrupt signal at that interrupt signal bit. If there is an interrupt signal, the CPU decides whether to suspend the current execution instruction and execute the interrupt instruction instead according to the interrupt priority. (Essentially CPU level while polling)

The Linux operating system mainly uses two privilege levels, 0 and 3, corresponding to the kernel mode and user mode respectively.

The switch from user mode to kernel mode is actually a process through system calls to some interfaces of the kernel. Thus realize the switch. The system call switch is completed by software interrupt, which is a normal exception developed by the programmer himself. Then under Linux, this exception is specifically the assembly instruction calling int $0x80, which will produce a programming exception with vector 0x80. The reason the system call needs to be implemented with this interrupt exception is because the exception is actually thrown into the kernel through the system gate.

On the other hand, we can also observe thread-specific execution:

Top-h -p+ process ID Obtains information about the threads of the process

Printf “%x\n”+pid thread ID to hexadecimal

Jstack process ID | grep thread ID of the hex – C 10 10 lines of code (see the top and bottom) to observe the operation condition of the thread

[root@cqs_xx.xx.xx.216 ~]#jstack 8071 |grep 2976 -C 10 at io.netty.channel.unix.FileDescriptor.readAddress(FileDescriptor.java:141) at io.netty.channel.epoll.AbstractEpollChannel.doReadBytes(AbstractEpollChannel.java:349) at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:781) at  io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378) at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989) at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) at java.lang.Thread.run(Thread.java:748) "reactor-http-epoll-2" #49 daemon prio=5 os_prio=0 tid=0x00007fd814013000 nid=0x2976 waiting on condition [0x00007fd8491bc000] java.lang.Thread.State: WAITING (parking) at sun.misc.Unsafe.park(Native Method) - parking to wait for <0x000000077e048db8> (a java.util.concurrent.CountDownLatch$Sync) at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997) at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304) at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231) at com.mongodb.connection.netty.NettyStream$FutureAsyncCompletionHandler.get(NettyStream.java:390) at com.mongodb.connection.netty.NettyStream.read(NettyStream.java:190)Copy the code

Run a few times more, you can see these threads are basic in com.mongodb.connection.netty.Net tyStream. Read () hangs, namely wait Mongo after return data latch. The countDown ().

Network adapter traffic

Free time:

[root@cqs_xx.xx.xx.216 ~]#sar -n DEV 1 100
Linux 3.10.0-693.el7.x86_64 (cqs_xx.xx.xx.216)   06/06/2021      _x86_64_        (6 CPU)
​
01:09:25 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
01:09:26 PM    ens192     30.00      5.00      2.94      0.63      0.00      0.00      0.00
01:09:26 PM        lo     78.00     78.00      9.86      9.86      0.00      0.00      0.00
​
01:09:26 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
01:09:27 PM    ens192     44.00     12.00      3.72      1.53      0.00      0.00      5.00
01:09:27 PM        lo     68.00     68.00      6.98      6.98      0.00      0.00      0.00
​
01:09:27 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
01:09:28 PM    ens192     37.00     28.00      4.79     67.47      0.00      0.00      0.00
01:09:28 PM        lo     91.00     91.00      9.06      9.06      0.00      0.00      0.00
​
01:09:28 PM     IFACE   rxpck/s   txpck/s    rxkB/s    txkB/s   rxcmp/s   txcmp/s  rxmcst/s
01:09:29 PM    ens192     25.00      7.00      3.36      0.83      0.00      0.00      0.00
01:09:29 PM        lo     58.00     58.00      7.54      7.54      0.00      0.00      0.00
​
​
Copy the code

When the pressure test:

[root@cqs_xx.xx.xx.216 ~]# sar-n DEV 1 100 Linux 3.10.0-693.el7.x86_64 (cqs_xx.xx.xx.216) 06/06/202_x86_64_ (6 CPU) 01:05:59 PM IFACE RXPCK /s TXPCK /s rxkB/s txkB/s RXCMP /s TXCMP /s RXMCST /s 01:06:00 PM ENS192 10060.00 9928.00 34921.85 26883.82 0.00 0.00 01:06:00 PM lo 76.00 76.00 8.06 8.06 0.00 0.00 0.00 01:06:00 PM IFACE RXPCK /s rxkB/s TxkB /s RXCMP /s RXMCST /s 01:06:01 PM ENS192 10173.00 9983.00 35802.35 27512.99 0.00 0.00 01:06:01 PM lo 105.00 105.00 14.68 14.68 0.00 0.00 01:06:01 PM IFACE RXPCK /s TXPCK /s rxkB/s txkB/s RXCMP /s TXCMP /s RXMCST /s 01:06:02 PM ENS192 9681.19 9652.48 34967.50 26923.68 0.00 0.00 01:06:02 PM LO 67.33 67.33 7.78 7.78 0.00 0.00 01:06:02 PM IFACE RXPCK /s TXPCK /s rxkB/s txkB/s RXCMP /s TXCMP /s RXMCST /s 01:06:03 PM ENS192 9727.00 9730.00 35323.93 27125.50 0.00 0.00 01:06:03 PM lo 60.00 60.00 7.77 7.77 0.00 0.00 01:06:03 PM IFACE RXPCK /s TXPCK /s rxkB/s TxkB /s RXCMP /s RXMCST /s 01:06:04 PM ENS192 7716.00 7662.00 27382.98 21179.04 0.00 0.00 01:06:04 PM lo 48.00 48.00 4.03 4.03 0.00 0.00 01:06:04 PM IFACE RXPCK /s TXPCK /s rxkB/s txkB/s RXCMP /s TXCMP /s RXMCST /s 01:06:05 PM Ens192 9132.00 8872.00 30950.51 23759.11 0.00 0.00 01:06:05 PM lo 79.00 79.00 9.34 9.34 0.00 0.00Copy the code

#IFACE Specifies the name of the local nic interface

# RXPCK /s Number of packets received per second

# TXPCK /s Number of packets sent per second

#rxKB/S Indicates the size of packets received per second, in KB

#txKB/S Size of packets sent per second, in KB

# RXCMP /s Compressed packets received per second

# TXCMP /s Compressed packets sent per second

# RXMCST /s Multicast packets received per second

216 This machine is a 10-gigabit network card (available through ethtool XXX), which theoretically supports 1280000kB/s (10000Mb/s=1280MB/s=1280000kB/s). According to the above tests, The received/sent packet throughput rate (mainly rxKB/s and txKB/s) is far below the upper limit of network adapter traffic.

We can also use sar-n EDEV to analyze whether the network card traffic limit is reached:

[root@cqs_xx.xx.xx.216 ~]# sar-n EDEV 1 100 Linux 3.10.0-1127.18.2.el7.x86_64 (cqs_xx.xx.xx.216) 06/01/2021-x86_64_ (6) CPU) 05:15:22 PM IFACE rxerr/s txerr/s coll/s rxdrop/s txdrop/s txcarr/s rxfram/s rxfifo/s txfifo/s 05:15:23 PM ens192 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:15:23 PM Lo 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:15:23 PM Veth93e25d0 0.00 0.00 0.00 0.00 0.00 0.00 05:15:23 PM VethC322396 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 05:15:23 PM Docker0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00Copy the code

With rxDROP /s, txDrop /s index can get the network packets discarded when the buffer is full.

TCP Connection Status

300 connections during pressure test, using netstat -s | grep overflow command to view the cumulative value of pressure test before and after the cumulative value by subtracting the discarding the connection this time

[root@cqs_xx.xx.xx.216 ~]#netstat -s |grep overflow
    57414 times the listen queue of a socket overflowed
[root@cqs_xx.xx.xx.216 ~]#netstat -s |grep overflow
    57434 times the listen queue of a socket overflowed
​
Copy the code

As you can see, 20 connections were dropped.

Here’s a sidebar about overflow:

During the TCP three-way handshake, the Linux kernel maintains two queues:

Half-connection queue, also known as SYN queue; Full connection queue, also known as ACCePET queue; When a server receives a SYN request from a client, the kernel stores the connection in the half-connection queue and responds with a SYN+ACK. The client then returns an ACK. When the server receives a ACK for the third handshake, the kernel removes the connection from the half-connection queue and creates a new full connection. Add it to the Accept queue and wait for the process to fetch the connection when it calls accept. Through the ss – LNT | grep port number Can get connection information:

[root@cqs_xx.xx.xx.216 ~]#ss -lnt |grep 10000 State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 128 [: :] : [: :] : 10000 *Copy the code

The meanings of recv-q/send-q in different states are as follows:

LISTEN state The LISTEN state
Recv-Q The size of the current full connection queue is the number of TCP connections that have completed the three-way handshake and are waiting to be accepted by the server The number of bytes received but not read by the application process
Send-Q Maximum queue length of the current full connection Number of bytes sent but not acknowledged

When more than most of the TCP connection queue, the server will lose follow-up in a TCP connection, lost the number of TCP connections will be counted up, also is our above netstat -s | grep overflow to collect data.

From the above test results, we know that when the server processes a large number of requests concurrently, if the TCP full connection queue is too small, it is prone to overflow. When the TCP full connection queue overflows, subsequent requests are discarded. As a result, the number of requests on the server does not increase.

The maximum value of the TCP full-connection queue depends on the minimum value between the somaxconn and backlog, i.e., min(somaxconn, backlog).

#cat tcp_max_syn_backlog
65535
[root@cqs_xx.xx.xx.216 ipv4]#cat /proc/sys/net/core/somaxconn
128
Copy the code

I don’t know why ops made the backlog so big… But it doesn’t matter, we just need to change somaxconn:

[root@cqs_xx.xx.xx.216 ~]#echo 1000|sudo dd of=/proc/sys/net/core/somaxconn
0+1 records in
0+1 records out
5 bytes (5 B) copied, 7.9866e-05 s, 62.6 kB/s
​
Copy the code

The change will take effect only after the service is restarted:

[root@cqs_xx.xx.xx.216 ~]#ss -lnt |grep 10000 State Recv-Q Send-Q Local Address:Port Peer Address:Port LISTEN 0 1000 [: :] : 10000Copy the code

Pressure test again to see if the connection is missing:

root@localhost:~# netstat -s|grep overflow
    4046 times the listen queue of a socket overflowed
root@localhost:~# netstat -s|grep overflow
    4046 times the listen queue of a socket overflowed
​
Copy the code

No.

Memory and swap space

216 Service (Service Deployment Service)

[root@cqs_xx.xx.xx.216 ~]#smem --sort swap PID User Command swap USS PSS RSS 55929 root java-jar match-plus-1.0-sn 0 2195172 2196396 2201960Copy the code

30 Service (Database Deployment Service)

root@localhost:~# smem --sort swap
  PID User     Command                         Swap      USS      PSS      RSS
65002 polkitd  mongod --auth --bind_ip_all      172   589456   589764   590076
Copy the code

Resident set size (RSS) : The Resident uses physical memory (including the memory occupied by the shared library). Adding THE RSS values of various processes usually exceeds the memory consumption of the entire system because THE RSS contains the memory shared by all processes.

Proportional Set Size (PSS) Proportional memory occupied by each process when all programs using a shared library share the memory occupied by the shared library. Obviously the sum of the PSS of all the processes is the memory usage of the system. It is more accurate by averaging the size of shared memory and spreading it out among processes.

Unique set Size (USS) Memory occupied by a process. It is its own part of the PSS. It only calculates the memory occupied by a process and does not include any shared part.

Swap: Swapping usually occurs when the application needs more memory than the actual physical memory. To deal with this situation, the operating system usually configures an area called Swap. The switch area is usually located on the physical disk. When the physical memory is used up, the OPERATING system temporarily swaps some memory data to the disk space. This part of the memory is usually accessed least frequently and does not affect the busy memory area. When the memory that has been swapped to the disk area is accessed by the application, the memory needs to be read from the disk swap area on a page basis. Swapping affects the performance of the application.

The virtual machine garbage collector performs very poorly when swapping because most of the areas accessed by the garbage collector are unreachable, meaning that the garbage collector causes swapping to occur. The scenario is dramatic. If the heap area of the garbage collection has been swapped to disk space, the swap will occur on a page by page basis so that it can be scanned by the garbage collector, dramatically causing the garbage collector to take longer to collect. If The garbage collector “stops The World” at this point, this time will be extended.

We paid special attention to Mongo’s memory usage, which is basically within the normal range.

Mono performance analysis

The index optimization

For compound indexes, this rule of thumb is helpful in deciding the order of fields in the index:

  • First, add those fields against which Equality queries are run.
  • The next fields to be indexed should reflect the Sort order of the query.
  • The last fields represent the Range of data to be accessed.

There is an ESR principle for Mongo’s coincidence index, that is, compound index is established according to the order of Equality (equivalent query) field –Sort field –Range field. The index of this test is established according to ranking Id– Sort field, so it meets the requirements.

Covered queries return results from an index directly without having to access the source documents, and are therefore very efficient.

For overwrite indexes, because you need to query the entire content of the leaderboard, there is no way to avoid back to the table, so it cannot be implemented.

For Mongo index and other official optimization Suggestions, here is differ a list. For details see www.mongodb.com/blog/post/p…

Query Statement Analysis

Db.rankelement. find({rid:”1_1_1″}).sort({“value”:1,”secondValue”:1,”rankTime”:-1}).limit(100).explain(true), Analyze the query report using the Explain command and find that its executionTimeMills (actual elapsed time) is less than 1 ms. NReturned =totalDocsExamined =100 This is the best case scenario where the index hits completely, so the query statement is basically fine.

Slow Query logs

35000 requests without 1 slow query (over 100ms)

We lower the slow query threshold with db.setProfilingLevel(1,{slowms:10}) to print out items that are longer than 10 milliseconds

dbRes1:PRIMARY> db.setProfilingLevel(1,{slowms:10})
{
        "was" : 1,
        "slowms" : 5,
        "sampleRate" : 1,
        "ok" : 1,
        "$gleStats" : {
                "lastOpTime" : Timestamp(0, 0),
                "electionId" : ObjectId("7fffffff0000000000000001")
        },
        "lastCommittedOpTime" : Timestamp(1622540805, 1),
        "$configServerState" : {
                "opTime" : {
                        "ts" : Timestamp(1622540794, 1),
                        "t" : NumberLong(1)
                }
        },
        "$clusterTime" : {
                "clusterTime" : Timestamp(1622540805, 1),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        },
        "operationTime" : Timestamp(1622540805, 1)
}
dbRes1:PRIMARY> db.getProfilingStatus()
{
        "was" : 1,
        "slowms" : 10,
        "sampleRate" : 1,
​
Copy the code

Pressure test again, produce 10 slow query (more than 10ms), acceptable.

If the mongodb log is too large, you can rotate the log to rename the current log to a file with a date, and then create a new log file. Specific practices:

  • ps -def|grep mongod
  • Kill -sigusr1 Mongo process

mongostat

insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time *0 *0 * * 0 0 5873 1950 694 | 0 0.1% 0 24.6% 9.63 G m 0 | | 0 0 1 2.64 m, 54.2 m 334 dbRes1 PRI May 30 22:48:11. * * * * 0 0 0 0 6231 355 2090 | 0 0.1% 24.6% 0, 9.63 G 694 m 2 | | 0 0 0, 2.81 m, 57.8 m 334 dbRes1 PRI May 30 22:48:12. * * * * 0 0 0 0 6209 355 2074 | 0 0.1% 0 24.6% 9.63 G of 694 m 2 | | 0 0 0, 2.80 m, 57.5 m 334 dbRes1 PRI May 30 22:48:13. * * * 0 0 0 1 6193 355 2071 694 | 0 0.1% 0 24.6% 9.63 G m 5 | | 0 0 0 2.79 m, 57.2 m 334 dbRes1 PRI May 30 22:48:14. * * * 0 0 0 1 6197 355 2071 694 | 0 0.1% 0 24.6% 9.63 G m 0 9 | | 0 0 2.79 m, 57.4 m 334 22:48:15 dbRes1 PRI May 30. 354 * * * * 0 0 0 0 6208 2071 694 | 0 0.1% 0 24.6% 9.63 G m 0 5 | | 0 0 2.79 m, 57.4 m 334 dbRes1 PRI May 30 22:48:16. 358 * * * * 0 0 0 0 6069 2015 694 | 0 0.1% 0 24.6% 9.63 G m 4 | | 0 0 0, 2.73 m, 56.2 m 334 dbRes1 PRI May 30 22:48:17. 354 * 0 * * * 0 0 0 5981 2004 | 0 0.1% 0 24.6% 9.63 G 694 m 3 3 | | 0 0, 2.70 m, 55.3 m 334 dbRes1 PRI May 30 22:48:18. 354 * * * * 0 0 0 0. 6096 2023 | 0 0.1% 0 24.6% 9.63 G 694 m 3 | | 0 0 0, 2.74 m, 56.4 m 334 dbRes1 PRI May 30 22:48:19. * * * * 0 0 0 0 6252 355 2077 | 0 0.1% 24.6% 0 9.63 G of 694 m 2 | | 0 0 0, 2.81 m, 57.8 m 334 dbRes1 PRI May 30 22:48:20. 353Copy the code

Generally, we focus on these indicators through Mongostat:

Used: cache ratio used in the WiredTiger storage engine

Dirty: Percentage of dirty data in the cache of the WiredTiger storage engine

Res: Physical memory usage (MB)

Qr/QW: indicates the length of the client read/write queue. If the value is too large, it indicates a bottleneck

Netout: Mongodb outbound traffic (I used it as another Angle to judge the horizontal expansion ability of the cluster in cluster testing)

The Eviction strategy for wiredTiger storage engine is designed to be beautifully designed.

Parameter names The default value instructions
eviction_target 80 When cache used exceeds eviction_target, the background EVICT thread starts to flush out DirtyPAGE
eviction_trigger 95 When cache used exceeds eviction_trigger, the user thread also starts to flush out the CLEAN PAGE
eviction_dirty_target 5 When cache dirty exceeds eviction_dirty_target, the background EVICT thread starts weeding out dirty pages
eviction_dirty_trigger 20 When cache dirty exceeds eviction_dirty_trigger, the user thread also starts weeding out dirty pages

If mongostat finds that used and dirty continue to exceed eviction_trigger and eviction_dirty_trigger, then the user’s request thread will also do evict (expensive), which will lead to increased request delay, then we can basically determine: Mongodb already has a resource problem, but based on the data above, it is still a long way from reaching mongo’s performance bottleneck.

Final throughput result

The synchronous interface

  • 8 threads, 100 connections

    root@localhost:wrk-master# ./wrk -t8 -c100 -d30s --latency http://xx.xx.xx.216:10000/crank/getPlayerRank/1_1_50 Running 30s test @ http://xx.xx.xx.216:10000/crank/getPlayerRank/1_1_50 8 threads and 100 connections Thread Stats Avg Stdev Max +/ -stdev Latency 80.952ms 21.982ms 21.982ms 21.982ms 21.982ms +/ -stdev Latency 80.952ms 21.982ms 21.982ms 21.982ms 75% 84.85 MS 90% 91.94 MS 99% 119.76 MS 35550 requests in 30.03s, 686.81MB Read requests/SEC: 1183.88 Transfer/ SEC: 22.87MBCopy the code
  • 8 threads, 200 connections

    root@localhost:wrk-master# ./wrk -t8 -c200 -d30s --latency http://xx.xx.xx.216:10000/crank/getPlayerRank/1_1_50 Running 30s test @ http://xx.xx.xx.216:10000/crank/getPlayerRank/1_1_50 8 threads and 200 connections Thread Stats Avg Stdev Max +/ -stdev Latency 154.29ms 14.25ms 344.59ms 87.58% Req/Sec 162.45 36.30 356.00 66.53% Latency Distribution 50% 153.55ms 750.40 MS 38821 requests in 30.03s, 750.00MB Read requests/SEC: 1292.76 24.98 MBCopy the code

The asynchronous interface

  • 8 threads, 100 connections

    root@localhost:wrk-master# ./wrk -t8 -c100 -d30s --latency http://xx.xx.xx.216:10000/crank/getPlayerRankAsync/1_1_50 Running 30s test @ http://xx.xx.xx.216:10000/crank/getPlayerRankAsync/1_1_50 8 threads and 100 connections Thread Stats Avg Stdev Max +/ -stdev Latency Latency Latency Latency Latency Latency Latency Distribution 112.66 MS read requests in 30.08s, 1.06GB read requests/SEC: 1952.13 Transfer/SEC: 36.08 MBCopy the code
  • 8 threads, 200 connections

    root@localhost:wrk-master# ./wrk -t8 -c200 -d30s --latency http://xx.xx.xx.216:10000/crank/getPlayerRankAsync/1_1_50 Running 30s test @ http://xx.xx.xx.216:10000/crank/getPlayerRankAsync/1_1_50 8 threads and 200 connections Thread Stats Avg Stdev Max + / - Stdev Latency 47.63 ms 18.77 ms 189.96 69.58% 71.27% the Req/Sec 254.37 35.42 353.00 ms Latency Distribution 50% 44.67 MS 75% 57.79 MS 90% 72.09 MS 99% 105.99 MS 60856 requests in 30.07s, 1.10GB read requests/SEC: 2023.79 Transfer/SEC: 37.41 MBCopy the code

Cluster testing

Tool use

Nginx performance analysis tool: OpenResty-Systemtap-Toolkit; stapxx

The SystemTAP and kernel bebuginfo packages need to be installed.

There are two open source projects in OpenResty: OpenResty-Systemtap-Toolkit and Stapxx. They are based on the Systemtap packaged toolset for real-time analysis and diagnosis of Nginx and OpenResty, covering common features and debugging scenarios such as onCPU, offCPU, shared dictionary, garbage collection, request latency, memory pooling, connection pooling, file access, and so on.

FlameGraph generation tool: FlameGraph

The test environment

Pressure tester: XX.XX.xx.49 8-core ten gigabit bandwidth

Nginx server: xx.xx.xx.49

Upstream server group:

Xx.xx.xx. 17 16 cores 16 GB memory 10 MBIT/s bandwidth

Xx.xx.xx. 216 6-core 20 GB memory 10 MBIT/s band

Xx.xx.xx. 74 4-core 8 GB memory 10 MBIT/s bandwidth

Database: Mongodb4.2.3, total data volume 1,000,000 documents, query document number 100

Deployed on 10.11.10.30 with 12 cores and 12 GB memory

Pressure test command:

Synchronization. / WRK – t16 – c300 – d120s – latency at http://localhost:80/crank/getPlayerRank/1_1_50

Asynchronous. / WRK – t16 – c300 – d120s – latency at http://localhost:80/crank/getPlayerRankAsync/1_1_50

The pressure measurement process

Single test results of each machine

  • xx.xx.xx.74

    root@localhost:wrk-master# ./wrk -t16 -c300 -d30s --latency http://xx.xx.xx.74:10000/crank/getPlayerRankAsync/1_1_50 Running 30s test @ http://xx.xx.xx.74:10000/crank/getPlayerRankAsync/1_1_50 16 threads and 300 connections Thread Stats Avg Stdev Max +/ -stdev Latency Latency Latency Latency Latency Latency Latency 51120 requests in 30.03s, 0.92GB read requests/SEC: 1702.40 Transfer/SEC: 31.47 MBCopy the code
  • xx.xx.xx.216

    root@localhost:wrk-master# ./wrk -t16 -c300 -d30s --latency http://xx.xx.xx.216:10000/crank/getPlayerRankAsync/1_1_50 Running 30s test @ http://xx.xx.xx.216:10000/crank/getPlayerRankAsync/1_1_50 16 threads and 300 connections Thread Stats Avg Stdev Max +/- Stdev Latency 144.67ms 27.78ms 456.42ms 75.23% Req/Sec 124.80 23.99 190.00 66.23% Distribution 50% 144.93 MS 75% 159.49 MS 90% 174.50 MS 99% 227.56 MS 59613 requests in 30.04s, 1.08GB read requests/SEC: 1984.68 Transfer/SEC: 36.69 MBCopy the code
  • xx.xx.xx.17

    root@localhost:wrk-master# ./wrk -t16 -c300 -d30s --latency http://xx.xx.xx.17:10000/crank/getPlayerRankAsync/1_1_50 Running 30s test @ http://xx.xx.xx.17:10000/crank/getPlayerRankAsync/1_1_50 16 threads and 300 connections Thread Stats Avg Stdev Max +/ -stdev Latency 121.32ms 45.76ms 485.47ms 74.13% Req/Sec 148.95 35.55 260.00 69.42% Latency Distribution 50% 117.09 MS 75% 145.59 MS 90% 174.42 MS 99% 267.22 MS 71274 requests in 30.03s, 1.29GB read requests/SEC: 2673.33 Transfer/SEC: 43.87 MBCopy the code

The first test

root@localhost:wrk-master# ./wrk -t16 -c500 -d40s --latency http://localhost:80/crank/getPlayerRankAsync/1_1_50 Running 40s test @ http://localhost:80/crank/getPlayerRankAsync/1_1_50 16 threads and 500 connections Thread Stats Avg Stdev Max +/ -stdev Latency 136.44ms 112.89ms 849.86ms 66.91% Req/Sec 248.98 75.84 520.00 70.89% Latency Distribution 50% 125.99 MS 211.94 MS 90% 285.44 MS 997.16 MS 158584 requests in 40.04s, 2.88GB read requests/SEC: 3960.99 Transfer/SEC: 73.59 MBCopy the code

The first test results are not as good as expected. Let’s first analyze the performance of Nginx:

Nginx performance analysis

Use the top command to see an approximate CPU load and usage:

Tasks: 297 total, 2 running, 295 sleeping, 0 stopped, 0 zombie %Cpu0 : 5.0us, 8.4SY, 0.0ni, 81.3ID, 0.0wa, 0.0hi, 5.4si, 0.0st %Cpu1: 4.4us, 3.7sy, 0.0ni, 89.9 ID, 0.0wa, 0.0hi, 2.0si, 0.0st %Cpu2: 5.4US, 6.7SY, 0.0ni, 82.6ID, 0.0wa, 0.0hi, 5.4si, 0.0st %Cpu3: 5.8us, 8.2SY, 0.0Ni, 80.6id, 0.7wa, 0.0hi, 4.8si, 0.0st %Cpu4: 6.7us, 7.4sy, 0.0ni, 0.01id, 0.0wa, 0.0hi, 4.7si, 0.0st %Cpu5: 4.4us, 7.7sy, 0.0ni, 83.2ID, 0.0wa, 0.0hi, 4.7si, 0.0st %Cpu6: 6.4US, 5.7SY, 0.0ni, 83.4ID, 0.0wa, 0.0hi, 4.4si, 0.0st %Cpu7: 5.8US, 3.8SY, 0.0Ni, 83.2ID, 0.0wa, 0.0hi, 7.2Si, 0.0st KiB Mem: 32771584 total, 4231832 free, 10005316 used, 18534436 buff/cache KiB Swap: 8388604 total, 8240380 free, 148224 used. 21692928 avail MemCopy the code

Monitor a Worker process with ngx-single-req-latency. SXX to observe the time distribution of a single request:

root@localhost:samples# ./ngx-single-req-latency.sxx  -x 51561
Start tracing process 51561 (/usr/local/nginx/sbin/nginx)...
​
[1622455837141478] pid:51561 GET /crank/getPlayerRankAsync/1_1_50
    total: 48570us, accept() ~ header-read: 44us, rewrite: 7us, pre-access: 13us, access: 11us, content: 48475us
    upstream: connect=8us, time-to-first-byte=48047us, read=0us
​
Copy the code

As you can see, most of the time is spent in time-to-first-byte waiting for an initial response from the upstream server, and nginx’s time for processing requests and establishing connections is negligible.

Let’s try to look at Nginx performance from another Angle — the flame chart;

The generation of flame map is divided into two processes: data collection and image generation. There are many tools available in these two processes. Here, SATPXX is used as data collection and FlameGraph is used to generate flame map.

The satpxx commands are as follows:./sample-bt. SXX: collects on-CPU time and analyzes CPU usage.

Here is a script that automatically generates the flame chart:

export PATH=/usr/local/nginx-tool/stapxx-master:/usr/local/FlameGraph/FlameGraph-master:/usr/local/nginx-tool/openresty-systemt ap-toolkit-master:$PATH pid=$(ps aux |grep '/usr/local/nginx/sbin/nginx'|grep master |awk '{print $2}') ./samples/sample-bt.sxx --arg time=20 --skip-badvars -D MAXSKIPPED=100000 -D MAXMAPENTRIES=100000 --master $pid>a.bt stackcollapse-stap.pl a.bt>a.cbt flamegraph.pl a.cbt>a.svgCopy the code

The Y-axis represents the call stack, and each layer is a function. The deeper the call stack, the higher the flame, with the executing function at the top and its parent functions below.

The X-axis represents the number of samples. The wider a function occupies along the X-axis, the more times it is drawn, or the longer it takes to execute. Note that the X-axis does not represent time, but rather all call stacks are grouped in alphabetical order.

The flame diagram is to see which function on the top takes up the most width. Any “flat top” (plateaus) indicates that the function may have a performance problem.

In the figure, writev and readV are write and read buffers respectively, and epoll Wait is used in the Epoll mechanism to wait for the generation of events and capture the read and write events of all network interfaces. As you can see, the top-level functions are basically dealing with the network layer, and the internal request processing takes little time.

Adjust the load balancing algorithm

Due to the significant difference in the performance of upstream server groups, load balancing cannot be carried out according to the default polling mechanism, and proper adjustment should be made for each machine. The principle of adjustment is to make each machine reach the maximum load of single machine test. Observe the number of new connections (SAR-N SOCK), cpu1-minute load, and CPU usage (user-level usage).

In the first test, the CPU load of 17 servers of this machine is almost half of the limit test of single machine. Through constant weight adjustment, the optimal weight ratio is finally obtained as follows:

  upstream match{
          least_conn;
          server xx.xx.xx.74:10000 weight=1 ;
          server xx.xx.xx.216:10000 weight=1 ;
          server xx.xx.xx.17:10000 weight=4 ;
          keepalive 1000;
    }
​
Copy the code

Under the complex balancing policy of least connections computing, client requests are allocated to the proxy server with the least active connections based on the weight of each server in the upstream server group.

Calculation process:

1. Compare the ratio of the number of active connections (Conns) to the weight of each backend, and select the one with the lowest ratio to allocate client requests

2. If server A was requested last time, the current request is selected between server B and server C.

root@localhost:wrk-master# ./wrk -t16 -c500 -d30s --latency http://localhost:80/crank/getPlayerRankAsync/1_1_50 Running 30s test @ http://localhost:80/crank/getPlayerRankAsync/1_1_50 16 threads and 500 connections Thread Stats Avg Stdev Max +/ -stdev Latency 131.78ms 162.92ms 1.95s 91.92% Req/Sec 308.96 75.80 636.00 71.26% Latency Distribution 50% 86.44ms 75% 147.74 MS 90% 258.26 MS 99% 936.16 MS 147646 requests in 30.10s, 2.68GB read Socket errors: Connect 0, read 0, write 0, timeout 10 Requests/ SEC: 4905.67 Transfer/ SEC: 91.14MBCopy the code

Throughput has also improved.

Excessive number of time-waits

In multiple tests, it was found that the throughput would decrease gradually with the number of tests. Through the single-machine analysis process above, the problem of excessive time-Wait quantity (SAR-N SOCK) was finally located in the network monitoring area:

Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     05/29/2021      _x86_64_        (6 CPU)
​
11:54:00 PM    totsck    tcpsck    udpsck    rawsck   ip-frag    tcp-tw
11:54:01 PM      1871       765         0         0         0      6000
11:54:02 PM      1874       772         0         0         0      6000
11:54:03 PM      1875       769         0         0         0      6000
11:54:04 PM      1873       762         0         0         0      6000
11:54:05 PM      1875       762         0         0         0      6000
11:54:06 PM      1875       776         0         0         0      6000
11:54:07 PM      1875       765         0         0         0      6000
Copy the code

The above was taken from 216 servers in the upstream server. After each stress test, the number of time-wait increased from more than 100 to 6000, and then gradually decreased.

As a complement, TIME_WAIT is a state entered by a TCP connection that is closed after receiving a FIN packet and sending an ACK reply. TCP is a reliable protocol based on an unreliable network. The ACK packet sent by the active party may be delayed, triggering the FIN packet retransmission from the passive party. The ACK packet sent by the active party takes 2MSL of time.

A large number of time_waits means that either short connections are used or the connection pool on the server is too small. Except for some connections maintained by the pool, connections are constantly being created and destroyed. The server actively closes the connection and enters the TIME_WAIT state for collection.

By default, the Nginx server and the upstream server use a short connection, resulting in the connection will be closed by the server after each request processing. The following instructions should be added to the Nginx server domain to make it a long connection:

proxy_set_header Connection "";
proxy_http_version 1.1;
Copy the code

Observe the connection TIME_WAIT again:

[root@cqs_xx.xx.xx.216 ~]#sar -n SOCK 1 100
Linux 3.10.0-1127.18.2.el7.x86_64 (cqs_xx.xx.xx.216)     05/31/2021      _x86_64_        (6 CPU)
​
10:34:28 PM    totsck    tcpsck    udpsck    rawsck   ip-frag    tcp-tw
10:34:29 PM      1874        76         1         0         0        75
10:34:30 PM      1874        76         1         0         0        75
10:34:31 PM      1874        76         1         0         0        75
10:34:31 PM      1874        76         1         0         0        75
Average:         1874        76         1         0         0        75
​
Copy the code

The TIME_WAIT number of upstream servers is basically stable at about 75 and no longer increases.

However, another problem arises when time-wait is moved to the Nginx server…

​
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain)     05/31/2021      _x86_64_        (8 CPU)
​
10:44:47 PM    totsck    tcpsck    udpsck    rawsck   ip-frag    tcp-tw
10:44:48 PM      2801      1668         0         0         0      6000
10:44:49 PM      2801      1668         0         0         0      6000
10:44:50 PM      2802      1669         0         0         0      6000
10:44:51 PM      2801      1672         0         0         0      6000
10:44:52 PM      2801      1670         0         0         0      5708
10:44:53 PM      2801      1668         0         0         0      6000
10:44:54 PM      2789      1665         0         0         0      6000
​
​
Copy the code

Nginx closes the KeepAlive connection on the Nginx server. If Nginx closes the KeepAlive connection on the Nginx server, Nginx closes the KeepAlive connection on the Nginx server.

Activates the cache for connections to upstream servers. Activate connection cache to upstream server.

The connections parameter sets the maximum number of idle keepalive connections to upstream servers that are preserved in the cache of each worker process. When this number is exceeded, The least recently used connections are closed. The connections parameter sets the maximum number of idle Keepalive connections that each worker process keeps in the buffer to the upstream server. When this number is breached, the least recently used connections are closed.

It should be particularly noted that the keepalive directive does not limit the total number of connections to upstream servers that an nginx worker process can open. The connections parameter should be set to a number small enough to let upstream servers process new incoming connections as well. Note: The Keepalive directive does not limit the total number of connections an Nginx worker process can make to an upstream server. The connections parameter should be set to a number small enough to allow the upstream server to process incoming connections.

Keepalive is the maximum number of free connections. By default, connection pooling is not used, so it is not surprising that the client (Nginx) destroys connections after each request. We attach Keepalive to upstream:

upstream match{
          least_conn;
          server xx.xx.xx.74:10000 weight=1 ;
          server xx.xx.xx.216:10000 weight=1 ;
          server xx.xx.xx.17:10000 weight=4 ;
          keepalive 1000;
    }
​
Copy the code
Linux 3.10.0-957.el7.x86_64 (localhost.localdomain) 06/01/2021_x86_64_ (8 CPU) 10:31:56 AM TOTsck TCPSCK UDPsck Rawsck ip-frag tcp-tw 10:31:57 AM 1296 172 0 0 0 126 10:31:58 AM 1296 172 0 0 0 111 10:31:59 AM 1296 172 0 0 0 111 10:32:00 AM 1299 174 0 0 0 112 10:32:01 AM 1300 174 0 0 0 116Copy the code

Time-wait no longer grows.

Shock group effect and accept_mutex

Nginx accepts requests Main process first call listen to create monitor TCP socket, monitor, and then call the fork to create multiple processes, the child process internal call blocking the accept waiting for connection requests, when there is a TCP connection request arrives, the all these child awakened and connection requests, grab the child to get the connection request and create a TCP connection, Then handle. A process that fails to preempt a connection request receives an EAGAIN error and hangs again the next time it calls blocking Accept.

Multiple processes wake up for a single connection request at the same time, known as the stampede effect. In high concurrency, most processes wake up ineffectively and then go back to sleep because they failed to preempt the connection request, resulting in a significant performance loss for the system.

Nginx has enabled accept_mutex by default, which means there is no surprise group problem, but is it really that serious? In fact, Nginx author Igor Sysoev once explained this:

OS may wake all processes waiting on accept() and select(), this is called thundering herd problem. This is a problem if you have a lot of workers as in Apache (hundreds and more), but this insensible if you have just several workers as nginx usually has. Therefore turning accept_mutex off is as scheduling incoming connection by OS via select/kqueue/epoll/etc (but not accept()).

To put it simply: Apache can start hundreds or thousands of processes, and if a swarm problem occurs, the impact is relatively large; However, for Nginx, worker_processes is typically set to the number of cpus, so it can be a few dozen at most, with relatively little impact if stampedes do occur.

Suppose you have a hundred chickens and now you have one grain of grain. There are two ways to feed them:

You throw the grain directly into the midst of the chickens, and all the chickens fight for it. Only one chick wins, and the other ninety-nine lose. This is equivalent to closing accept_mutex. You take the initiative to catch a chicken, put the grain into its mouth, ninety-nine other chickens do not know, should sleep. This is equivalent to activating accept_mutex. As you can see, it is better to activate accept_mutex in this scenario. Let’s modify the problem scenario so that INSTEAD of having only one grain, I have a bowl of grain. What do I do?

At this time if still use the initiative to catch the chickens to plug the food approach is too inefficient, a pot of food I do not know when to feed, we can imagine dozens of chickens waiting in line to feed the kind of eagerly looking forward to the scene. It would be better to throw the bowl of food directly into the middle of the chickens and let them grab it, which would cause a certain amount of confusion, but the overall efficiency would be greatly increased.

Nginx enables accept_mutex by default (it is disabled by default in the latest version), which is a conservative choice. If it is turned off, it may cause a certain degree of panic problems, such as increased context switches (SAR-W) or increased load, but if the traffic is high, it is recommended to turn it off for the sake of system throughput.

– blog.huoding.com/2013/08/24

Using the latest version of nginx, run openresty-Systemtap-Toolkit -master’s./ngx-req-distr -m “Nginx pid” to check the distribution of requests among worker processes:

New version (not enabled)

root@localhost:openresty-systemtap-toolkit-master# ./ngx-req-distr -m 13032 Tracing 5056 5057 5058 5059 5060 5061 5062 5063... Hit Ctrl-C to end. ^C worker 5056: 1242 reqs worker 5057: 5895reqs worker 5058: 4560 reqs worker 5059: 4980 reqs worker 5060: 791 reqs worker 5061: 2516 reqs worker 5062: 4848 reqs worker 5063: 3210 reqsCopy the code

Old version (open)

root@localhost:openresty-systemtap-toolkit-master# ./ngx-req-distr -m 13032 Tracing 5056 5057 5058 5059 5060 5061 5062 5063... Hit Ctrl-C to end. ^C worker 5056: 0 reqs worker 5057: 0 reqs worker 5058: 0 reqs worker 5059: 0 reqs worker 5060: 0 reqs worker 5061: 0 reqs worker 5062: 28487 reqs worker 5063: 0 reqsCopy the code

It is clear that requests are more evenly distributed across processes due to the “stampede effect”. However, since the pressure on the Nginx server is still very small at this point, the difference in throughput between a single process running requests and multiple processes running requests is not significant.

Mongo performance analysis

General state of machine

[root@localhost bin]# top top-15:42:50 up 5 days, 4:37, 4 Users, Load Average: 30.48, 17.56, 10.51 Tasks: 280 total, 1 running, 279 sleeping, 0 stopped, 0 zombie %Cpu0 : 59.4us, 6.5SY, 0.0ni, 28.0id, 0.0wa, 0.0hi, 6.1si, 0.0st %Cpu1: 9.3US, 7.1SY, 0.0ni, 27.5ID, 0.0wa, 0.0hi, 6.1si, 0.0st %Cpu2: 58.6us, 6.4SY, 0.0Ni, 28.8ID, 0.0wa, 0.0hi, 6.1si, 0.0st %Cpu3: 9.3US, 6.8SY, 0.0ni, 28.5ID, 0.0wa, 0.0hi, 5.4si, 0.0st %Cpu4: 57.2US, 7.7sy, 0.0ni, 34.7ID, 0.0wa, 0.0hi, 0.3si, 0.0st %Cpu5: 63.6US, 7.1SY, 0.0ni, 23.1ID, 0.0wa, 0.0hi, 6.1si, 0.0st %Cpu6: 49.5US, 5.8SY, 0.0ni, 40.2ID, 0.0wa, 0.0hi, 4.5si, 0.0st %Cpu7: 50.0us, 6.7sy, 0.0ni, 0.03ID, 0.0wa, 0.0hi, 0.0si, 0.0st %Cpu8: 53.6US, 5.2SY, 0.0ni, 35.4ID, 0.0wa, 0.0hi, 5.8si, 0.0st %Cpu9: 47.4 us, 5.1SY, 0.0ni, 0.04ID, 0.0wa, 0.0hi, 0.0si, 0.0st %Cpu10: 55.9us, 5.2SY, 0.0ni, 31.7id, 0.0wa, 0.0hi, 5.2si, 0.0st %Cpu11: 59.8US, 8.sy, 0.0Ni, 32.1ID, 0.0wa, 0.0hi, 0.0Si, 0.0st KiB Mem: 12119052 total, 160944 free, 5752388 used, 6205720 buff/cache KiB Swap: 8388604 total, 6295696 free, 2092908 Used.5907880 Avail Mem PID USER PR NI VIRT RES SHR S %CPU % Mem TIME+ COMMAND 49550 root 20 0 9.9g 805428 9232s 779.7 6.6 2580:20 mongod 2506 1000 20 0 36.9g 2.8g 666368s 4.0 24.0 997:50.52 Java 2859 1000 20 0 9328788 1.7g 6249s 2.0 186 14.5:07.34 JavaCopy the code

The maximum load of 12 cores is in the low 30s, and the maximum CPU usage is no more than 70%.

Generally speaking, CPU utilization of less than 85% can indicate that the CPU has not reached the bottleneck, a few thousand TPS, CPU this performance is reasonable.

mongostat

insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn set repl time *0 *0 * * 0 0 18034 6013 | 0 0.1% 0 25.1% 9.75 G 719 m 2 | | 0 6 0 8.12 m 167 m 333 dbRes1 PRI Jun 1 11:57:49. * * * * 0 0 0 0 17208 485 5758 | 0 0.1% 25.1% 0, 9.75 G 719 m 8 | | 0 0 0 7.75 m, 159 m 333 dbRes1 PRI Jun 1 11:57:50. * * * * 0 0 0 0 16987 487 5631 | 0 0.1% 0 25.1% 9.75 G 719 m 0 | | 0 0 9 7.63 m 157 m 333 dbRes1 PRI Jun 1 11:57:51. 486 * 7 * 0 17472 764 5896 | 0 0.2% 0 25.1% 0, 9.75 G 719 m 0 | 0 10 | 0, 8.04 m, 162 m 333 dbRes1 PRI Jun 1 11:57:52. 486 * * * * 0 0 0 0 14945 4996 719 | 0 0.2% 0 25.1% 9.75 G m 4 | | 0 0 1 6.75 m to 138 m 333 dbRes1 PRI Jun 1 11:57:53. 495 * 5 * 0 0 * 0 14018 4656 719 | 0 0.2% 0 25.1% 9.75 G m 4 | | 0 0 0 6.28 m 130 m 333 dbRes1 PRI Jun 1 11:57:54. 486 * * * * 0 0 0 0 10574 3524 | 0 0.2% 0 25.1% 9.75 G 719 m 2 | 3 | 0 0, 4.76 m, 97.8 m 333 dbRes1 PRI Jun 1 11:57:55. 486 * 0 * * * 0 0 0 12845 4292 719 | 0 0.2% 0 25.1% 9.75 G m 7 | | 0 0 0, 5.78 m, 119 m 333 dbRes1 PRI Jun 1 11:57:56. 488 * * * * 0 0 0 0. 14862 4939 | 0 0.2% 0 25.1% 9.75 G 719 m 4 | | 0 0 1 6.68 m 137 m 333 dbRes1 PRI Jun 1 11:57:57. 485Copy the code

Final throughput result

root@localhost:wrk-master# ./wrk -t16 -c400 -d40s --latency http://localhost:80/crank/getPlayerRankAsync/1_1_50 Running 40s test @ http://localhost:80/crank/getPlayerRankAsync/1_1_50 16 threads and 400 connections Thread Stats Avg Stdev Max +/ -stdev Latency 73.38ms 281.28ms 281.28ms Latency Distribution 50% 281.28ms +/ -stdev Latency 73.38ms 281.28ms 259.74 MS 234396 requests in 40.03s, 4.25GB read requests/SEC: 108.69 MBCopy the code

conclusion

The purpose of this article is to share possible viewing angles under performance pressure measurement. The throughput of each stage in this article is “machine-specific” and not of much reference value. Leaderboards are not implemented by MongoDB, for example, they can be changed to Redis. In fact, the author has also implemented Redis version. In this architecture, it does not show better performance than Mongo, and the implementation complexity is higher than Mongo (for example, Zset can achieve multi-value sorting is very enchanting). Or add level 2 cache in the central server (game server, that is, the client has implemented level 1 cache), but once the introduction of cache, it will increase the complexity of the system, need to consider the effectiveness of cache, update, consistency and other issues, especially in the cluster environment is more difficult to control. The author believes that upgrading hardware is often a better solution if the need can be met by enhancing the performance of CPU and IO itself (such as expanding the number of servers). Even if it costs a little more, it is usually less risky than caching. From the above pressure test results, it is indeed hardware, especially CPU bottlenecks.

I have a shallow talent, if there is a mistake, welcome peer corrections.