When I was writing code in the afternoon, I suddenly received a high load alarm of a small cluster of machines. O&m students also received DDoS attack alarms detected by the public cloud network in a timely manner. The overall attack flow is not large, less than 300m, after all, I have seen dozens of GIGABytes of DDoS attacks before, not very nervous, and it is a small cluster, business is not so important, the impact is not big. So, at that time, I was thinking what this person wants to do, this attack flow is not interesting, this wave will be a test, there will be a large-scale attack against other sites, pay attention to the alert, there is a problem timely feedback, coordination of all parties containment.

Associated with several business related group and monitoring group also have some feedback, and the defendants have DDOS attacks, attention observation, the cloud has started the cleaning, a pair of war has begun and departments of the appearance of the embattled ~ just this at this moment, suddenly a group, a client bosses said, the corresponding time points issued a configuration, As a result, the client will have more requests… Roll back quickly and slowly recover later. It’s the end of a self-directed fake DDOS attack

To summarize after the event

  1. Operation and maintenance managers see that the attack IP addresses are scattered on the public cloud console (of course, in real large-scale DDOS attacks, the attacker also controls many servers and IP addresses are not gathered). They cannot rule out whether it is a real attack or a problem of the service itself
  2. Clients should be aware of front-end protection and back-end protection: every request initiated should be carefully reflected, including retry after failure, circular requests, especially for the core main path of APP. For the client it is a request, but for the back end it is multiplied by the real-time user base.
  3. Students on the server side should have the awareness of distrust of the front end: they should have the basic ability of fusing and limiting the flow. The daily capacity of buffer should be kept at a reasonable level. After all, there are cost issues involved.
  4. Is the influence of the incident was A business, but there’s another unrelated business feedback B, they have fewer requests, ask me for help locating B business, look at time point and time the incident point completely right, after screening, to link two business from public entrance, to the backend cluster, to backend specific business is completely independent, after the end there is no reason to influence each other, In this case, the two business requests went through the same network channel. Service A was considered as DDOS, and the public cloud performed cleaning interception, and abnormal errors were returned due to overload on the back-end, which inevitably led to timeout and failure on the client. If the client uses A network channel that uses queues to keep requests coming out correctly, the failure of service A can cause A jam of subsequent requests. The investigation of several students on the client was confirmed.

The above.

April 02, 2021