I wrote a previous article about the “Three Principles of Reliable communication “. For a distributed database, if you want to achieve 100% high availability (that is, clients never return failed requests), you can also use the retry theory and the de-duplication theory of the three principles of reliable communication. But in practice, there are trade-offs between success rate and time (speed and performance). This article shares practical experience and introduces what kind of choices are universal for your reference.
The client accesses the database server and initiates a large number of requests. It is impossible for every request to be successful. The request may fail due to network reasons. Requests may fail because of internal server processing conflicts or coordination conflicts between distributed nodes.
Fault-tolerant processing is retry when an error is encountered. Because errors are bound to occur, only retry can eliminate the effects of errors, just as packets are bound to be lost at the IP layer, but TCP retransmits to some degree of reliable transmission.
Some systems that implement the Basic Paxos + log replication state machine model have a lot of conflicts due to so-called “Leaderless”. Even using Raft, unexpected elections in some cases can result in request conflicts.
Who should try again in the face of conflict (failure)? This involves the problem of module responsibility division in engineering practice, which is often more important than code implementation. In general, the lower the retry location, the better the performance; The higher up the location where a retry occurs, the more comprehensive the basis for deciding whether to retry is.
We simply divide the database system (ecosystem) into several large modules, from the bottom (left) to the top (right) :
Replication -> server -> Client SDK -> user The most common approach to replication -> server -> Client SDK -> user is to let the user try again, such as the common Redis SDK, if a server is down and the request fails, then the user is required to change IP, Recreate the connection and repeat the request again.
Some systems will encapsulate a proprietary Client SDK, for example, simply encapsulating the official Redis SDK, intercepting the result of each request, and automatically retrying the SDK internally if an error is found. By doing so, the user does not need to have retry logic and the code can be simplified. Well, with multiple modules working together, if one module takes on some responsibilities, its upper modules can save some effort.
What if the user doesn’t want to retry and the Client SDK doesn’t want to retry? Can you put all the blame on the server? Absolutely not, see the summary of this article. So, why doesn’t the user need to retry after the SDK retries? Because the SDK and user are in the same runtime, they are one entity and there are no reliable transport issues between the two.
So, since the client SDK must have retry logic, does the server not need to have retry logic? Theoretically, yes, but in practice, the server itself still needs to reduce its own failure rate, the necessary means to reduce the failure rate is retry. For example, if the server requests the PaxOS module to synchronize an operation log, but the unexpected multi-master occurs, the server fails to compete with other nodes for the same location. In this case, if the server sends an error message to the client, The number of server failures increases by one. However, the server can retry and call the PaxOS module again to jostle for the next location until it succeeds. In this way, the client will rarely see an error from the server.
However, neither the server nor the client can retry indefinitely, because each retry consumes time. In the most extreme case, it may take several hours to retry forever, which is of course not feasible. Therefore, you need to introduce a timeout mechanism, even if the retry fails after a certain number of times, you must report an error to the upper layer.
Retries increase the total elapsed time, which has the undesirable effect that upper layers perceive lower layers as slow and poor performance. Therefore, we must have systematic thinking, make a judgment, and make a comprehensive trade-off. As a rule of thumb, both the Server and client SDKS must be refined and retried as much as possible to improve the success rate. In most cases, developers tend to give up retry too much and retry too little. After all, one more retry scenario means one more piece of code, and you’re always going to want to be lazy.
To design a highly reliable system, the three principles of reliable transport are a very useful foundation theory, but not a silver bullet. In essence, software development is a lot of analytical elaboration and manual work, and the control of system complexity.
An additional problem with retry is de-duplication, which is the second of the three principles of reliable transport. You’ve probably heard the term idempotent, which is the same thing as denaturation. If an operation is non-idempotent, retry is not possible.
In practice, however, we can push the responsibility for idempotence upwards, as far as possible. After all, at least for users, a 100% success rate is a much higher priority than idempotent doubts. Users allow the lower layer to retry without considering idempotence, but are sensitive to the occasional failure of the lower layer. Simply put: never mind what idempotent, within the timeout limit, bold retry!