The background,

After the epidemic last year, in order to accelerate the launch of the tourism market, Hubei launched A nationwide campaign of “Travel with Love to Hubei” — all a-level tourist attractions in the province are free to tourists from all over the country and welcome people from all over the country with open arms. This paper will introduce the core problems encountered by the online booking system during this activity, the transformation process of the system and some experience of implementation. This is a practical optimization to improve system stability in high concurrency and high availability scenarios. I hope it can provide some reference ideas for students facing the same problem.

Activities page

Ii. Risks and challenges

In the early stage of activities, the system faces the following four types of risks:

  • Large flow, instantaneous increase of inlet flow 100 times, far beyond the carrying capacity of the system;

  • Under high concurrency, service stability decreases.

  • Purchase limit error;

  • Hot tickets, hot travel dates deducted hot stock;

System challenges under high concurrency

Let’s take a look at the impact and solutions of each problem.

2.1 Increase of inlet traffic by 100 times

The problem

At the start of the campaign, the incoming traffic increased 100 times, and the current system could not solve the problem with horizontal scaling.

Request volume monitoring

The target

Improve the throughput capacity of the entry application and reduce the downstream adjustment.

strategy

Reduce reliance on

1) Remove unnecessary dependencies in the 0 yuan ticket scenario. For example: preferential, immediate reduction;

2) Merge duplicate IO(SOA/ Redis/DB) to reduce repeated access to the same data in one request.

Context-passing objects reduce repetitive IO

Improved cache hit ratio

We are talking about interface-level caching, where the data source relies on the downstream interface, as shown in the following figure:

Service layer – interface level cache – Fixed expiration

Interface-level caching usually uses fixed expiration + lazy loading to cache downstream interface return objects or custom DO objects. When a request comes in, the data is first fetched from the cache. If the request hits the cache, the data will be returned. If the request fails, the data will be retrieved from the downstream to rebuild the cache.

Fixed expiration + lazy loading cache

This cache solution has the risk of breakdown and penetration, which will be magnified in high concurrency scenarios. The following describes how to solve these common problems in the system.

1) Cache breakdown

Description: Cache breakdown is in the database, not in the cache. For example, when a key has a high volume of traffic, it is a centralized and concurrent access. When the key fails, a large number of requests break through the cache and are directly sent to the downstream interface or database, resulting in heavy downstream pressure.

Solution: Add a passive refresh mechanism to the cache, increase the last refresh time in the cache entity object, obtain data from the cache after the request comes in, and then check whether the cache is full of refresh conditions. If not, obtain data asynchronously and rebuild the cache. If not, do not update the cache. Users can renew the expiration time by refreshing the cache asynchronously to avoid the fixed expiration of the cache.

For example, for product description, the cache expiration time used to be 5 minutes, but now the cache expiration time is 24 hours, and the passive refresh time is 1 minute. The user returns the last cache for each request, but builds a cache asynchronously every 1 minute.

2) Cache penetration

Description: Cache penetration refers to data that does not exist in the database or cache. When users continuously send requests, for example, to obtain data whose ID does not exist, the cache cannot be hit, resulting in heavy downstream pressure.

Solution: when the cache misses, didn’t have to get the data, the downstream cache entity content is empty object, cached entities to increase penetration status identification, this kind of cache expiration time is shorter, the default expires in 30 s, 10 s refresh, prevent does not exist the id of the repeated access to the downstream, penetration is a small amount of most of the scene, but some of the scenes. For example, only a small number of commodities have a certain rule configuration. In this case, we set the expiration time and refresh time of the cache of penetration type to be the same as the normal expiration time and refresh time to prevent frequent requests for no downstream data.

3) Abnormal degradation

When an exception occurs downstream, the cache update policy is as follows:

Cache update:

  • Downstream non-core: Write a short empty cache (for example, 30 seconds expires and 10 seconds is refreshed) to prevent downstream timeout and affect the stability of upstream services.

  • Downstream is the core: the cache is not updated in the case of an exception and will be updated again in the next request to prevent empty cache writing and block the core process.

4) Cache modular management

Cache keys are classified according to data sources. Each type of key corresponds to a cache module name. Each cache module can dynamically set the version number, expiration time and refresh time, and unify the burying point and monitoring. After modular management, the granularity of cache expiration time is more detailed. By analyzing the cache module hit ratio monitoring, we can deduce whether the expiration time and refresh time are reasonable or not, and finally adjust the cache expiration time and refresh time dynamically to achieve the best hit ratio.

Cache module hit ratio visualization buried point

We encapsulate the above functions as cache components, and only need to care about data access implementation when using, which not only solves some common problems of using cache itself, but also reduces the coupling degree between business code and cache read and write.

The following figure shows the comparison of cache usage process before and after optimization:

Cache usage comparison

The effect

By solving the problems of cache penetration and breakdown, abnormal degradation, and cache modular management, the cache hit ratio is improved to more than 98%, interface performance (RT) is improved by more than 50%, upstream and downstream interface call ratio is reduced from 1:3.9 to 1:1.3, and downstream interface call ratio is reduced by 70%.

Improved processing performance by 50%

2.2 Low service stability with high concurrency

The problem

At 8:00 a.m. every day when the ticket grab activity starts, the DB connection pool is full, the thread fluctuation is large, and the goods and services timeout.

Database thread fluctuation

thinking

  • Why is the DB connection pool full?

  • Why does the API time out?

  • Is the DB unstable affecting the API, or is the API heavy traffic affecting the DB?

Problem analysis

1) Why is the DB connection pool full? Analyze three types of SQL logs.

  • Too many Insert statements — Scenario: The limit purchase record is submitted, and the limit purchase form is isolated from the database, but the product API still times out (excluded)

  • The Update statement takes too long – Scenario: Hot spots in inventory are deducted (Critical)

  • Select High-frequency query – Scenario: Query commodity information

2) Why does the API time out?

Check the logs. After the activity starts at 8:00, the DB of a large number of popular products is the same as that of the Select high-frequency query.

3) Is the INSTABILITY of DB affecting API, or is too much API traffic affecting DB?

According to #2, a large amount of traffic penetrated into DB due to cache breakdown.

Why is the cache broken down?

Combing system architecture, due to 8:00 timing available offline Job control, 8:00 commodity online data changes, the data change causes the cache is refreshed (first delete after increased), in the cache invalidation moment, breakdown to the DB server traffic, cause server database connection pool is played, cache breakdown phenomenon is also mentioned.

Data Access layer – table level cache – Active refresh

As shown in the figure below, the cache is actively expired after the product information is changed, and the cache is reloaded when the user accesses it:

Data Access Layer Cache refresh architecture (old) – Message changes remove cache keys

The target

In order to prevent cache breakdown caused by cache deletion during activity and traffic penetrating into DB, the following two policies are adopted:

1) Avoid cache invalidation caused by data update during activity

We split the merchantable status into visible and merchantable status.

  • Visible status: 7:00 am online in advance and visible to the outside, avoiding the peak;

  • Available status: Logical timed sales not only solve the problem that cache is refreshed after data is modified during timed online operation, but also solve the problem that product available status is delayed after Job goes online.

Logical timing can be sold to avoid peak cache breakdown

2) Adjust the cache refresh strategy

The original cache refresh scheme (delete first and add later) has the risk of cache breakdown. Therefore, the cache refresh policy is changed to overwrite update to avoid cache breakdown caused by cache failure. The new cache refresh architecture listens for MQ messages sent by MySQL binlog through Canal, is aggregated at the consumer end, and the cache is rebuilt.

Data Access Layer Cache Refresh Architecture (New) – Message changes rebuild the cache

The effect

Service (RT) normal, QPS increased to 21W.

The above two types of problems have nothing to do with specific business. Here we introduce two business pain points:

  • How to prevent malicious purchase (purchase limit)

  • How to prevent underbuying/overbuying (withholding inventory)

2.3 purchase

What is the purchase limit?

Limit buy limit buy namely, the quantity that sets to buy, often be the product of a few special offer and depreciate, a kind of commercial measure that takes to prevent maliciously snap up place.

Purchase limit rules (dozens of combinations) for example:

1) Only one ID card can be booked at the same scenic spot on the same travel date;

2) Within 7 days (booking date), only 3 scenic spots can be booked in a certain area and the maximum purchase is 20;

3) During the activity, there are more than 5 reservations and no noshow purchase limit;

The problem

Failed to deduct the inventory, the purchase restriction was cancelled successfully (the actual data is inconsistent), and the reservation was restricted again.

why

Purchase limit submission is a double write operation of Redis and DB. Redis is a synchronous write operation, while DB is a asynchronous write operation of the thread pool. When the number of requests is too large, the thread queue will be backlogged, resulting in Redis write success and DB write delay. After submitting the purchase limit record successfully and failing to deduct the inventory, the purchase limit record needs to be cancelled.

As shown below:

Limit Check – Submit Limit – Cancel Limit

In the scenario of high concurrency, there is a backlog in the queue of thread pool for submitting limit purchase records. After Redis is successfully written, DB is not finished writing. At this time, the deletion of limit purchase is successful, DB deletes the record that is not found, and finally the limit purchase record is written after submission.

The diagram below:

Thread queue is backlogged, the first “submit limit” request is later than “cancel limit” request

The target

Stable service, accurate purchase limit.

strategy

Ensure that the Redis/DB of the unpurchase limit operation is ultimately consistent.

Because there may be a backlog in the submission of purchase limit records, the submission record of purchase limit has not been written when the limit is cancelled, so the corresponding submission record cannot be deleted when the limit is cancelled. We ensure that the unlimit operation (Redis/DB) is ultimately consistent by delaying message compensation retries. When the limit is cancelled, when the number of rows affected by the limit record is 0, the MQ delay message is sent, the message is consumed at the Consumer end, the limit is cancelled again, and the core indicators are detected through buried points and monitoring.

As shown below:

Order – submit limit and cancel limit

The effect

Accurate purchase limit, no false interception of complaints.

2.4 Inventory deduction

The problem

  • The backstage of the product showed that 1W was sold out, but 5000 were actually sold, resulting in unsold inventory.

  • MySQL has a hot row-level lock, affecting the performance of deduction.

why

  • Inventory and inventory details SQL is not in a transaction, a large number of deductions are prone to partial failure, resulting in inventory records and details inconsistent situation.

  • Hot spots hot travel dates are booked centrally, resulting in a hot spot of inventory deduction in MySQL.

The target

Accurate inventory deduction, improve processing capacity.

strategy

1) Put deduction inventory records and deduction details in one transaction to ensure data consistency.

DB transaction deducts inventory

The effect

Advantages: Consistent data.

Disadvantages: Hot resources, hot dates, longer time to deduct inventory row-level lock, longer interface RT, reduced processing power.

2) Use distributed cache, pre-reduce inventory in distributed cache, reduce database access.

Second kill goods asynchronous deduction, eliminate DB peak, non-second kill normal process.

When goods are online, the inventory will be written into Redis. When the inventory is deducted from the activity, the deduction message MQ will be sent after the deduction is successful by using Incrby atom. The consumption message at the Consumer terminal will perform DB deduction of the inventory. Perform DB inventory return. If no deduction record is found (MQ may be delayed), delay retry and detect core indicators through buried points and monitoring.

Asynchronous deduction of inventory

The effect

  • Service RT stable, database IO stable

  • Redis deduction shows signs of being hot

3) Cache hotspot buckets to deduct inventory

When the traffic of a single Key reaches the Redis single instance capacity, the single Key needs to be split to solve the single instance hotspot problem. The hot Key problem occurred on the hot date of hot ticket, and it was found after observation and monitoring that it was not particularly serious. Therefore, the temporary split Redis cluster was adopted to reduce the traffic of single instance and alleviate the hot spot problem. Therefore, the cache of hot spot bucket deduction inventory has not been realized this time.

As shown below:

Cache hotspot bucket deduction

By barrel and by stock:

Before the second kill starts, lock the inventory modification, and implement the bucket splitting policy, divide the module into N buckets according to the inventory Id, and the corresponding cache Key of each bucket is Key [0~ n-1]. Each bucket saves m stocks and initializes them to Redis. When the second kill starts, it is reduced according to the Hash(Uid)%N route to different buckets. Resolve all traffic accessing a single Key causing stress to a single Redis instance.

Barrels of shrinkage:

Normally, inventory in each bucket of hot activity is reduced to 0 after several rounds of deduction.

In special scenarios, there may be single-digit inventory left for each bucket, and the number of copies is larger than the remaining inventory at the time of reservation, resulting in unsuccessful deduction. For example, if the number of buckets is 100 and each bucket has one or two buckets in storage, the deduction fails when the user orders three buckets. When the inventory is less than ten digits, shrink the number of barrels, prevent users from seeing inventory, deduction has been failed.

Comparison before and after optimization

Comparison of inventory reduction schemes

Third, review and summary

Reviewing the whole activity of “Traveling with Love in Hubei”, we prepared as follows:

  • Sorting out risk points, including system architecture and core processes, and formulating coping strategies after identifying them;

  • Estimated flow: estimated activity peak QPS based on ticket volume, historical PV and holiday peak;

  • Full-link pressure test: perform full-link pressure test on the system and peak QPS to find out the problem points and optimize and improve;

  • Current limiting configuration: Configure a safe limit threshold for the system that meets service requirements.

  • Emergency plan: collect possible risk points in each domain and make emergency treatment plan;

  • Monitoring: observe the monitoring indicators during activities, if there is any abnormality, deal with it according to the plan;

  • Follow-up: analyze log after activity, monitor index, fault analysis and continuous improvement;

This paper describes four representative problems encountered in the ticket grabbing activities. In the process of optimization, we should constantly consider and implement technical details and precipitation core technologies, so as to achieve the goal of smooth booking and entry and good experience for users.