Phenomenon:

In early May, it was found that the failure rate of the discovery page interface 499 of a certain version increased significantly, because the number of 499 generated on iOS was too high. The back-end found that the source in the interface was all the traffic brought by the message push within the station.

Suspected possible:

  1. The aggregation domain name mechanism has problems
  2. HTTP2.0 multiplexes TCP. TCP disconnection causes all multiplexed requests to disconnect from the backend, triggering Nginx to generate 499
  3. The client will cancel the unfinished datatask, triggering the TCP disconnect, resulting in 499
  4. Nginx’s own protection mechanism misidentifies a large number of requests from the same device in a short period of time on the same interface as an attack, and automatically disconnects to generate 499

Attempt to reproduce operation:

Client: Baseline iPhone; Version: 12.5.0; Scenario: jump to discovery page Tab through universalLink simulation on the terminal, and call frequency is once per second

Recurrence results:

  1. Half an hour pressure test, from the front-end network interface callback log found no error, that is, all requests successfully responded;
  2. Back-end log #1 can be found to have about 5 entries 499;
  3. TCP. Flags. fin or tcp.flags.reset is not detected near the time 499 on the back-end wireshark, indicating that the front-end TCP connection is not disconnected.
  4. During cold startup, page Tab is found through universalLink jump, and two interface calls occur in a short time, one is to enter the page request interface, and the other is to register the trigger request interface.

Try to optimize:

The time of registration request interface was delayed 1.5 seconds, and the pressure test lasted for one hour. The back-end 499 appeared only 2 times, which decreased significantly compared with the previous one hour more than 10 times.

Analysis:

  1. The abnormal business domain names were removed from the aggregation domain name cloud control. After one day observation, 499 trend did not change, so the aggregation domain name mechanism was excluded.
  2. The front-end uses HTTP2.0 for abnormal services by default, and the LB also supports HTTP2.0 for abnormal service domain names by default. Therefore, the LB uses HTTP2.0 even if an independent domain name is used. HTTP2.0 overuses TCP, so 499 cannot be excluded.
  3. During cold startup, it is found that TAB will call the discovery interface twice in a short time, and the later interface will cancel the incomplete datatask of the previous one. However, after analysis, near the time point of 499 on the back end, The front-end Wireshark did not catch tcp.flags.fin or tcp.flags.reset. The front-end TCP connection was not disconnected, so the front-end Datatask cancel was excluded.
  4. The iOS terminal can reproduce a large number of requests in a short time through a single device. In addition, an interface of android terminal also finds 499 requests from a single user in a short time on the line. However, nginx may not be able to generate 499 because it considers frequent requests from a single device to be unsafe

Conclusion:

Nginx, for security reasons, treats repeated requests from a single device as an attack, and then disconnects the connection. HTTP2.0, because of the TCP reuse mechanism, will result in 499 requests being reused.

Follow-up:

  1. The upper-layer business needs to analyze whether repeated requests for a short period of time are justified
  2. Backend research baby love Nginx security mechanism, whether a single device in a short period of time a large number of requests to a single interface lead to mistaken attack, and then disconnection.

Follow-up conclusions:

After proxy_ignore_client_abort is configured on an Nginx, the success rate is 100%.