preface

In general, it is best not to serialize read and write requests into an in-memory queue if the cache is allowed to slightly inconsistencies with the database, that is, if your system is not strictly “cache + database” consistent.

Serialization guarantees that no inconsistencies will occur, but it can also result in a significant throughput reduction of the system, requiring several times more machines than normal to support a single request on the line.

Cache Aside Pattern

The most classic Cache + database read/write Pattern is Cache Aside Pattern.

  • If the cache does not exist, read the database, and then fetch the data and put it in the cache, and return the response.
  • When updating, update the database first and then delete the cache.

Why delete the cache instead of update it?

The reason is simple. A lot of times, in a more complex caching scenario, the cache is not just a value pulled straight out of the database.

For example, one field of a table may be updated, and the corresponding cache may need to query the data of the other two tables and perform operations to calculate the latest value of the cache.

Also, updating the cache can be costly. Does that mean that every time you change a database, you have to update the corresponding cache? This may be true in some scenarios, but not in more complex scenarios where cached data is computed. If you frequently modify multiple tables involved in a cache, the cache updates frequently. But the question is, will this cache be accessed frequently?

For example, if a table’s fields are changed 20 times in a minute, or 100 times, the cache is updated 20 times, or 100 times. But the cache was only read once in a minute, and there was a lot of cold data. In fact, if you just delete the cache, the cache is recalculated within a minute, and the overhead is significantly reduced. Cache is what you need to cache.

In fact, deleting the cache, rather than updating it, is the idea of lazy computing. Instead of redoing a complex calculation every time, whether it’s needed or not, let it recalculate until it needs to be used. Like Mybatis, Hibernate, have lazy loading idea. To query a department, the department has a list of employees. There is no need to query the data of 1000 employees in the department at the same time. 80% of the time, the department, you just want to access the information in that department. Search the department first and access the employees in the database at the same time, so only when you want to access the employees in the database, you will query 1000 employees in the database.

The most elementary cache inconsistency problem and solution

Problem: Update the database before deleting the cache. If the cache deletion fails, it will result in new data in the database and old data in the cache, causing data inconsistencies.

Delete the cache first and then update the database. If the database update fails, the database is old, the cache is empty, and the data is not inconsistent. The old data in the database is read and updated to the cache because the cache does not have it at the time of reading.

More complex data inconsistency problem analysis

The data has changed, the cache has been deleted, and the database has been modified, which has not yet been modified. A request comes in, reads the cache, finds that the cache is empty, queries the database, finds the old data before modification, and puts it in the cache. The subsequent data change procedure completes the database modification. The data in the database is different from the data in the cache…

Why does this problem occur when hundreds of millions of traffic are concurrent?

This problem can only occur when reading or writing data concurrently. In fact, if you have a very low concurrency, especially if you have a very low read concurrency, 10,000 visits per day, then very rarely, you’re going to have the kind of inconsistencies that I just described. However, the problem is that if the daily traffic is hundreds of millions and the concurrent reads per second are tens of thousands, as long as there are data update requests per second, the above database + cache inconsistency may occur.

Solutions are as follows:

When data is updated, operations are routed to an internal JVM queue based on the unique identity of the data. When the data is read, if it is not in the cache, the “read data + update cache” operation will be performed again. After routing according to the unique identifier, the data will also be sent to the same JVM internal queue.

A queue corresponds to a worker thread, and each worker thread receives the corresponding operation sequentially, and then executes it one by one. In this case, a data change operation deletes the cache and then updates the database, but the update has not been completed. If a read request does not reach the cache, the cache update request can be sent to the queue first. At this time, the cache update request will be backlogged in the queue, and then wait for the cache update to complete synchronously.

There is an optimization point here, in a queue, it is meaningless to string multiple update cache requests together, so we can filter. If there is already one update cache request in the queue, then we don’t need to put another update request in the queue, and just wait for the previous update request to complete.

After the worker thread of that queue has finished the database modification of the previous operation, the next operation, the cache update operation, will read the latest value from the database and write it to the cache.

If the request is still in the waiting time range and polling finds that the value can be fetched, it returns directly; If the request waits more than a certain amount of time, the current old value is read directly from the database this time.

In high concurrency scenarios, the following issues should be addressed in this solution:

  • A read request is blocked for a long time

Because read requests are very lightly asynchronous, it is important to be aware of read timeouts, within which each read request must be returned.

In this solution, the biggest risk is that the data may be updated so frequently that a large number of update operations are backlogged in the queue, and then read requests will have a large number of timeouts, resulting in a large number of requests going directly to the database. Be sure to run some real-world tests to see how often data is updated.

On the other hand, because there may be a backlog of update operations for multiple data items in a queue, you need to test for your own business situation, and you may need to deploy multiple services, each sharing some data update operations. If 100 item inventory modification operations are squeezed in a memory queue, and each inventory modification operation takes 10ms to complete, the last item read request may wait 10 * 100 = 1000ms = 1s before getting data, which will cause a long time block of read requests.

Must be done according to actual business operation of the system, and to some of the pressure test, and simulated the online environment, look at the busiest time, how much memory queue may squeeze updates, may lead to a final update operations corresponding read requests, how much time will hang, if read requests in 200 ms to return, if you after the calculation, Even on the busiest days, with 10 updates backlogged, waiting up to 200ms, that’s fine.

If a memory queue is likely to have a particularly large backlog of updates, then you add machines so that fewer service instances deployed on each machine process less data, and the fewer backlogged updates per memory queue.

In fact, based on the experience of previous projects, data write frequency is generally very low, so in fact, the backlog of updates in the queue should be very small. For projects like this, with high read concurrency and read cache architecture, write requests are generally very small, and QPS of several hundred per second is good.

Let’s actually do a rough calculation.

If you have 500 writes per second, and if you divide it into five time slices, 100 writes every 200ms into 20 memory queues, you might have 5 writes per memory queue. After each write operation performance test, it is generally completed in about 20ms, so for each memory queue data read request, also hang for a while at most, within 200ms can certainly return.

After the simple calculation just now, we know that the write QPS supported by a single machine is no problem in hundreds. If the write QPS is expanded by 10 times, then expand the machine by 10 times, and each machine has 20 queues.

  • The concurrency of read requests is too high. Procedure

There is also a risk that a sudden flood of read requests will hang on the service in tens of milliseconds to see how well the service can hold up and how many machines are needed to hold up the peak of the maximum limit case.

However, not all data are updated at the same time, and the cache is not invalid at the same time. Therefore, the cache of a few data may be invalid every time, and then the corresponding read requests of those data will come, and the concurrency should not be very large.

  • Request routing for multi-service instance deployment

It is possible that multiple instances of the service are deployed, so it is important to ensure that requests to perform data update operations, as well as cache update operations, are routed through the Nginx server to the same service instance.

For example, read and write requests to the same item are routed to the same machine. You can do your own hash routing between services based on a request parameter, you can also use Nginx’s hash routing function, etc.

  • The routing of hotspot goods is faulty, causing the request skew

In the event that an item’s read and write requests are so high that they all go to the same queue on the same machine, it may cause too much stress on the same machine. That is, since the cache is cleared only when the commodity data is updated, and then the read and write concurrency is caused, the problem is not particularly significant depending on the business system, if the update frequency is not too high, but it is possible that the load on some machines will be higher.