The background,

This post was published simultaneously on the Prodesire Public account and the Prodesire blog. Recently, in the process of cloud service API test project, I found that API would be called in large quantities sometimes, resulting in traffic limiting errors. When encountering such an error, the traditional retry policy is to try again at regular intervals. However, a large number of requests flood in at the same time during the retries, causing traffic limiting.

This reminds me of my experience two years ago when I looked up the Celery Task documentation and found that it was possible to set retry_backoff for tasks, which allowed tasks to retry in exponential retreat on failures. So what exactly does exponential retreat look like?

Two, index retreat

Exponential backoff, according to an Exponential backoff on Wiki, is an algorithm that, through feedback, slows down the rate of a process exponentially, gradually finding the right rate.

In Ethernet, this algorithm is usually used for scheduling retransmission after a conflict. Delay retransmission is determined based on the time slot and number of retransmission attempts.

After c collisions (such as a request failure), a random value between 0 and $2^ C-1 $is chosen as the number of time slots.

  • For the first collision, each sender will wait 0 or 1 time slot to send.
  • After the second collision, the sender will wait between 0 and 3 timesslots (calculated by $2^2-1$) to send.
  • After the third collision, the sender will wait between 0 and 7 timesslots (calculated by $2^3-1$) to send.
  • And so on…

As the number of retransmissions increases, the latency increases exponentially.

In layman’s terms, each retry takes twice as long as the last one.

Third, the expectation of exponential retreat

Considering the uniform distribution of retreat times, the mathematical expectation of retreat times is the average of all possibilities. In other words, after c conflicts, the number of retreat time slots is in [0,1… N], where $N=2^c-1$, then the mathematical expectation of retreat time (in time slots) is

? E(c)=frac{1}{N+1}sum_{i=0}^{N}{i}=frac{1}{N+1}frac{N(N+1)}{2}=frac{N}{2}=frac{2^c-1}{2}?

So for the previous example:

  • $E(1)= FRAc {2^1-1}{2}=0.5$after the first collision
  • The expected retreat time after the second collision is $E(2)= FRAc {2^2-1}{2}=1.5$
  • After the third collision, the expected retreat time is $E(3)= FRAc {2^3-1}{2}=3.5$

Four, the application of index retreat

4.1 Exponential regression algorithm in Celery

Celery /utils/time.py = celery/utils/time.py

def get_exponential_backoff_interval( factor, retries, maximum, full_jitter=False ): """Calculate the exponential backoff wait time.""" # Will be zero if factor equals 0 countdown = factor * (2 ** retries)  # Full jitter according to # https://www.awsarchitectureblog.com/2015/03/backoff.html if full_jitter: countdown = random.randrange(countdown + 1) # Adjust according to maximum wait time and account for negative values. return max(0, min(maximum, countdown))Copy the code

Here, factor is the withdrawal coefficient, acting on the overall withdrawal time. Retries corresponds to C above (the number of collisions). Countdown = Factor * (2 ** retries) is the same as the exponential retreat algorithm mentioned above. On this basis, full_jitter can be set to True, meaning to do a “jitter” on the retreat time, so as to have a certain randomness. Finally, you limit a given value to maximum to avoid an infinite wait time. However, once the maximum retreat time is reached, multiple tasks may be executed again at the same time. See task.retry_jitter for more.

4.2 Examples of connections in Advanced Programming in UNIX Environments

There is also an example of using exponential fallback to establish connections in section 16.4 of Advanced Programming in UNIX Environments (3rd edition) :

#include "apue.h" #include <sys/socket.h> #define MAXSLEEP 128 int connect_retry(int domain, int type, int protocol, const struct sockaddr *addr, socklen_t alen) { int numsec, fd; /* * for (numsec = 1; numsec < MAXSLEEP; numsec <<= 1) { if (fd = socket(domain, type, protocol) < 0) return (-1); If (connect(fd, addr, alen) == 0) {/* * connect to accept */ return (fd); } close(fd); /* if (numsec <= MAXSLEEP / 2) sleep(numsec); } return (-1); }Copy the code

If the connection fails, the process sleeps for a short period of time (NUMsec) and then enters the next loop to try again. Each cycle is twice as long as the last until the maximum delay is more than 1 minute, after which no retries are made.

conclusion

Going back to the problem at the beginning, we can avoid limiting traffic again to the greatest extent possible by retrying using an exponential backoff algorithm when we encounter a limiting error. Exponential backoff adds time magnification and randomness to make it “smarter” than fixed-time retries. At this point, we no longer need to worry about limiting the flow of the entire test program to run interrupted ~