background

Everybody is good, this article is to introduce a very classic big interview is frequently asked a question, is the instantaneous high concurrency snapping up problems, generally speaking, the big development of the system will often encounter some similar electricity kill for, attractions tickets buying in high concurrency, special goods (such as face masks) high concurrency snapping up high concurrency rob tickets, similar to the 12306 class system.

Therefore, we often ask this kind of high concurrency buying questions. At this time, if you can’t give a set of various problems that the system may encounter in high concurrency scenarios, as well as your corresponding architecture design and solutions, the basic interview may be cold.

So today is handy with you to analyze, assuming that under the scenario of special goods inventory shortage, 1 minutes to snap up 10 w a mask this kind of special goods, there may be hundreds of thousands of people at this time the order of magnitude instantaneous inflow to snap up, at this time of the system may have what problem, we should how to design architecture to solve this kind of problem?

Business Architecture Design

First of all, when analyzing this kind of problem, we should not consider how high the instantaneous high concurrency is, but first draw a picture of the basic business architecture to realize the purchase of this kind of special goods, and analyze the business process clearly.

If you look at the picture below, if you’re going to have a system for buying goods, you have to have a system for buying goods. You have to rely on the commodity system, because you need to read and write the commodity data, you have to rely on the inventory system for inventory deduction, you have to rely on the price system to calculate the current purchase price of the product, and you have to rely on the marketing system to verify the purchase price of the product.

Finally, we have to rely on the basic system of authentication authentication and risk control interception to determine whether the panic buying can be executed. Therefore, a panic buying involves many systems and a complete basic business architecture diagram of the basic high-concurrent panic buying system.As shown in Figure 1 below:

Figure 1: Business architecture design of high concurrent buying system

Network topology architecture design

In addition, we also have to snap up the request for you is how to reach your snapping up system step by step, at the same time, this process you are to be drawn, in general, our APP mobile terminal to the back-end access is through a domain name to initiate the request, the domain will get our SLB through DNS parsing the IP address of the load balancing system.

The request is then sent to our SLB load balancing system, which then distributes the request evenly to our back-end API gateway system, which then distributes the traffic to our snap up system, so it looks roughly like figure 2:

Figure 2: Topology architecture design of high concurrent buying network

Well, when you can sketch out the business architecture diagram and the production deployment network topology in front of the interviewer, we can assure you that while the interviewer’s face may be blank, the interviewer’s true reflection will be this: Little brother can ah, ordinary people hear this problem on the direct meng force, this boy unexpectedly know to start from the business architecture and network topology architecture analysis.

But we don’t happy too early, distance you successfully complete the analysis of this problem, is roughly just finished journey to the West in eight thousand eight thousand miles, the rest of the one hundred thousand will continue to go! Along the way we will soon encounter all kinds of demons and ghosts! Pull yourself together and go back and forth.

Ssec traffic peak

Often here, we should analyze the next step, is the difference between daily flow and shopping flow, what does it mean? Let’s talk about daily flow first, which means that when there is no shopping at ordinary times, other people normally come to buy various goods, the general flow of the system should be how many requests per second.

It is hard to say for this question, because different companies are different, but we can take a more intermediate value, the whole system is only 1000 requests per second in daily life, this is a more relevant value, neither high nor low, as shown in Figure 3 below.

Figure 3: Daily concurrent buying system business flow

Generally speaking, as long as each of your snap systems and the systems it depends on is deployed on more than 2 machines, the normal traffic of 1000 requests per second, the various system brothers work together and fight together, there is no big problem. But if there is such an activity, a special commodity is limited to 10W, and everyone needs it especially, and then, the limit is to rob at 10:00 a.m. every day, every time there are hundreds of thousands of people with red eyes staring at the mobile phone screen ready to rob him, determined to win, at this time, what will the flow be like?

Note that highlight, generally speaking, according to general buying experience, often, your goods will be gone within 1 minute 10 w, and according to this law, 80% of the goods will be within 20% of the time was gone, that is to say, 8 w items may be snapped up in 10 seconds, and participation for this 8 w goods flow rate reached 80% of the population quantity, Assuming a total of 50W people participated in the buying, that is, 40W people initiated the buying request within 10s, and 8W products were snapped up. At this point,The number of requests per second should be 40W / 10s = 4W /s QPS, please see figure 4 below:

Figure 4: High concurrent buying system traffic

What do you think of the picture above? Don’t be confused, the interviewer listened with relish, let’s quickly continue to talk, or you stop at this time, you will stare small eyes! What if you’re getting 4W requests per second for your system at this point?

Quite simply, the system will absolutely be killed. Network bandwidth will be full, CPU usage will reach 90%, database load will be too high, and downstream dependencies will frequently timeout. Why? That’s because your system is routinely deployed to withstand 1,000 requests per second, and they’re not designed to withstand your 4W requests per second.

Optimization of architecture design

So the question is, how do you make your shopping system withstand 4W requests per second? In order to solve this problem, have to sleep while the interviewer, my brother secretly to teach you a little bit of wu3 lin2’s secret, under normal circumstances, a 4 core 8 g machine, open 200 threads handle requests, if he wants to call other services, or to access the database, basically a second single machine is a 1000 the amount of the request.

Concurrent buying system performance bottleneck analysis

But, mind you, just to get to the point, it’s not that your 4-core, 8-gigabyte machine is only capable of handling 1,000 requests per second. His key question is,He has to call other services, and he has to access the database, and because of this access to the external system over the network, he has to handle a smaller number of requests per second, please see figure 5 below:

Figure 5: Concurrent buying system performance bottleneck

It is important to know that middleware systems such as Redis and RocketMQ, after deep optimization, often have no problem with tens of thousands or even tens of thousands of QPS per machine. What does the so-called deep optimization mean? In short, your best bet is to read and write data from your own memory every time you request it, and then return it. Do not access external systems over the network. In this case, your concurrency can increase by several orders of magnitude, as shown in Figure 6 below:

Figure 6: In-depth optimization of concurrent buying system architecture

| concurrent snapping up system architecture optimization

So, in general, in this scenario, there are three very powerful optimization methods, and that is to drastically reduce the dependency calls to external services; Write data directly to cache as far as possible, and then asynchronously write db; Read data as much as possible to cache the data in the system JVM memory, local read return.

Here can give you some examples, for example, for the special commodity fixed price snap up, then the price system, marketing system call can be omitted, after all, the price is fixed, there is no discount; For the general operations of risk control and authentication, can it be brought up to the API gateway level for him to perform, removing this kind of general logic from our business system? This reduces the number of calls to four systems all at once.

For example, for inventory deduction, can the inventory system synchronize the data to Redis, we directly synchronize the inventory in Redis, and then send MQ message asynchronously destocking the database of the inventory system to deduct the inventory? In addition, for a large number of queries on commodity data, is it possible to cache commodity data in Redis and load all popular commodity data into the local cache in the JVM memory of the panic buying system in advance? The optimized system would look something like Figure 7 below.

Figure 7: Concurrent buying system architecture cache optimization

As you can see in the figure above, after a process of optimization, our buying system no longer calls any service directly. When it reads commodity data, it preferentially reads pre-cached data from its OWN JVM local cache, almost pure memory operation, and then deducts inventory to write redis. Destocking and placing orders in the inventory system and even in the order system’s database are performed asynchronously through MQ.

Basically, the system optimization to this level, mainly to snap up the system to deploy a few more machines, can withstand tens of thousands of high concurrency requests per second. But is this the end? Of course not, at this time there are many problems in the system, we have to continue to analyze, further step by step optimization.

1) Analysis and solution of cache breakdown problem of high concurrent buying system

First, analyze the first problem, which is that the commodity data cache is local to the snap up system JVMBreakdown problems during cachingThe data we store in the JVM local cache of the flash store system usually needs to be set to expire, because if you cache in the JVM all the time, the product data will change, and you will not know. So suppose we set an expiration time for 30 minutes, and every 30 minutes after expiration, The purchasing system has to check the commodity data cache in Redis. If not, it has to call the interface of the commodity system to check from the database, as shown in Figure 8 below.

Figure 8: High concurrent buying system – cache data expiration problem

So when you’re running out of local cache in your flash store, and there’s no data in the local cache, and there’s probably no data in the Redis cache at this very critical moment, a lot of requests come in, and they’re not found in the local cache, and they’re not found in redis, and then what? Then, of course, it will be finished, because these requests will flood into the commodity system, causing the commodity system to query the database and directly break down the commodity system, as shown in Figure 9 below.

Figure 9: High concurrent buying system – cache breakdown problem

Therefore, at this time, we often need to make a special scheme design for the local cache, that is, for the local cache, do not adopt the mode of letting it expire automatically and then cannot be read in the commodity system when the request comes, but adopt the panic buying system to automatically refresh the local cache.

That is to say, the system can open a background thread, and then let him automatically check the latest cache data in Redis every 30 minutes, or check the latest cache data in the commodity system, and thenFlush the local cacheIn this way, you can avoid saying that after automatic expiration, a sudden flood of requests that are not found in the cache flood into the goods system, as shown in Figure 10 below.

Figure 10: High concurrent buying system – automatic cache refresh mechanism

If you are interested in the solution to the cache breakdown problem, you can pay attention to the public number Shishan architecture notes free access to the whole network subscription 20000+ “hundred million levels of traffic e-commerce detail page system combat”, from the production level database and cache double-write consistency scheme code.

2) Analysis and solution of data inconsistency in high concurrent buying system

Let’s move on to the next more common problem, which isThe cache of the inventory is inconsistent with the DBThe scenario of this problem may occur in the following situation, that is, after you have finished the inventory deduction in Redis, you send a message asynchronously through MQ to ask the inventory system to deduct the inventory in DB, but the inventory system of others has not deducted the inventory deduction in DB, at this time, you suddenly roll back the inventory deduction due to an exception. At this point, Redis restored the withheld inventory, and then sent a message to MQ to restore the inventory deduction, as shown in Figure 11 below.

Figure 11: High concurrent buying system — data inconsistency problem (I)

But this time the inventory is restored in the redis, but inventory system is not necessarily the db there, because the inventory system to get messages from the MQ, is likely to be random sequence obtained, is the first access to restore the news of the inventory, the inventory systems tend to judge that whether before inventory deductions about the snapping up log, if not, He does not restore inventory, and then gets the message of inventory reduction, which he does, but the message of inventory restoration is never processed, as shown in Figure 12 below.

Figure 12: High concurrent buying system — data inconsistency problem (II)

So what does that lead to? It will lead to the deduction of inventory in Redis, and then the inventory is restored, but the inventory system db obtained the restore inventory instruction, the result is nothing, and then obtained the deduction inventory instruction, but the inventory is deducted, at this time the cache and the INVENTORY in the DB is inconsistent.

So for this problem, usuallyImplement MQ sequential messagesIn other words, multiple inventory operation instructions for the same purchase order are sent to a partition of MQ to make them orderly and force the inventory system to obtain and execute them in sequence. In this way, the inventory reduction instruction is executed first and then the inventory recovery instruction is executed, as shown in Figure 13 below:

Figure 13: High concurrent buying system – MQ sequential messages

conclusion

Well, the article here, today was to tell everyone about the big factory we often meet with high concurrency to snap up such systems architecture design and optimization of process, as well as the breakdown and data cache out-of-order inconsistent problem analysis and solution, we hope you can in the future after reading the interview when meet this kind of problem, reasoned analysis step by step, step by step Let the interviewer see you calm as water, delicate as silk adaptability.

END

[Architecture Notes of Ishigari] Reader Exchange Group (technical dry goods, technology sharing, workplace communication)

Scan code to reply: 666, join immediately:

If you want to learn more about the actual practice of Spring Cloud Alibaba in large-scale projects, I recommend the large-scale Enterprise-level Distributed Order System Project produced by Ruape Classroom. For details of the course, please refer to sourl.cn/aict6k