background

Speaking of architecture, most people think of technical language, technical framework, SOA, microservices, middleware, etc. These are pure system architecture or infrastructure, which are basically not affected by business. Most of them can be developed and developed independently of specific business to form their own independent system or even standardized technical products.

But in fact, in most cases, technology is to serve the business, we develop more application system or business system, the different characteristics of business determines the application (business) architecture must have different characteristics.

These different characteristics cannot be solved by technology alone. An important principle of application architecture design is technology neutrality, so it is more important to think from an application perspective than from a technology perspective.

I have worked on the core transaction related system of e-commerce. When it comes to e-commerce, people naturally think of PV, UV, high performance, high concurrency, high stability, snap, order, inventory, distributed transaction, etc.

Each point here sounds full of depth and mystery at first, take the more concerned seconds kill for example (10 million people seconds kill 100 pieces of 100G gold bar) let’s analyze.

Conventional snapkill architecture

The general architecture is as follows

Conventional flow distribution model

Traffic at the application layer > Service layer > DB layer

The superNB system flow distribution model is as follows

Display layer traffic = application layer traffic = Service layer traffic = DB layer traffic

As we know, DB is the lowest level of the system and the biggest bottleneck of the flow. As can be seen from the above figures, super NB companies have solved the DB bottleneck. All the flow can go all the way to the DB layer, and each layer can be expanded arbitrarily, so the pressure of the system can be easily resolved.

Of course, some inexperienced systems do the same, but the DB layer and even other layers do not do well, so the system often hangs. In fact, NB companies will not do so, even if the technology can do it is not necessary, because the cost is too large.

So we want to from the DB layer before trapezoidal layer by layer flow filtering, also became the conventional flow distribution model seen above, the best result is to DB layer flow only the actual order number 100 (100 gold bars).

Seckill traffic filtering – General idea

Returning to the conventional traffic distribution model, the following is a commonly used traffic filtering process for seckill systems:

(Click to enlarge image)

If purely considering the ultimate technology, it is simple to solve the problem of the DB layer, so that DB can be like the application layer and service layer at any time distributed expansion, but in fact DB can not do, DB is the biggest bottleneck, so there is a queuing system and reservation system.

Caching must be used, but the second kill is often instantaneous. The timeliness of caching makes the caching system hardly helpful in the case of such a large volume of traffic to the instant impact of DB.

For example, the inventory update of the second kill item causes a large number of UDpate SQL directly to the DB, and the DB pressure is very large. We tried mysql select first, update when there is inventory, the effect is more obvious, especially the second kill single SKU or a small number of SKU scenarios, because select and update DB pressure we understand, interested can test themselves below look.

Queuing system is the most commonly used in most seckilling scenarios. In essence, queuing system asynchronously slows down traffic, flattens instantaneous peak value, and reduces traffic impact on subsequent service layer and DB layer.

The role of the reservation system is to predict the traffic in advance. Although the reservation volume itself is not controllable, it can make a pre-plan for the known traffic in advance before seckilling, so that the system is in a controllable state.

The way of booking + queuing has been able to meet the needs of most of the panic buying scenes.

SEC traffic filtering – a new idea

After all, booking and queuing is a plug-in system, is it necessary to do their own plans and prevention between the layers, let’s take a look at the last picture layer by layer to think of some ways:

(Click to enlarge image)

DB layer

DB layer can not be arbitrarily or at any time to expand, is the biggest bottleneck and the last bottom line, is absolutely can not be broken, one is DB concurrent connection can not exceed the maximum number of connections, two is DB pressure can not be too large, so the flow must be in the front control.

The service layer

The service layer, while distributed, is limited to DB connections, which does not mean it can be extended indefinitely. If the public service layer does not distinguish between the buskill service and common service, traffic filtering cannot be performed because normal traffic of common service is affected. In this case, you can only perform traffic filtering at the application layer.

If it is a separate service for the second kill process, or the traffic source can distinguish the second kill business or other business, it still has ideas, methods are always more than difficult, such as:

Random number filtering

Filters a percentage of user requests directly back to the application layer wrapper, prompting the user in a friendly way.

Preset thresholds for current limiting

Set the maximum threshold for the processing of a single machine in a unit time. For example, the actual maximum TPS of a single machine is 10000/ s. Set the threshold to 20.

At this time, the system does not process business logic and DB operations, but simply judge and respond back, so the processing capacity of a single machine 20+X is far more than 10000/ SEC.

Note: the above two methods are not serial or dependent, both are optional methods, their essence is to filter the flow and improve the processing capacity of the single machine to protect the system from being washed out.

Maximize the flow filtering under the condition of ensuring certain user experience (single machine processing capacity). For example, in the second scheme, only 20 percent of the flow of a single machine can reach the DB layer, ensuring that the DB layer is absolutely no problem. If the pressure of the service layer is still high, you can continue to add distributed servers and lower the preset threshold. The second scheme is actually a more controllable version of the first.

The application layer

1. Random number filtering, same as above.

2. Preset threshold and current limit, set the maximum processing threshold of single machine in unit time, same as above.

3. Through The Tomcat maximum connection number control, the request exceeding the maximum number of connections is directly rejected, but the user experience is very bad, the system has the feeling of false death crash, try to solve the problem by adding distributed servers. (The service layer must control the maximum number of connections through the application layer. The display layer and application layer are directly dependent on the number of users, which is difficult to control. The reservation system can be used to make the flow controllable.)

Presentation layer

Random number filtering: indicates a certain percentage of traffic requests to users in a friendly manner.

Normal display layer should not be filtered, no request to the server [stand hand], but from a business perspective, snapping up seconds kill itself is a probability event, is not totally depends on the order (later can get the instead, sometimes it depends on the distributed server processing, network and asynchronous processing of queuing system, etc.).

While it’s a shame for a tech person to cheat the user, it’s still a good idea to give it a hug once in a while, and it’s better than the system being overwhelmed.

The treatment of several layers mentioned above is not a very good scheme or architecture, but just to show that a small idea or scheme may solve many problems, which cannot be solved entirely by technology, and these structures or schemes are not directly related to advanced technology itself.

Thoughts on distributed transactions

Let me give you another example. A lot of people are very concerned about how to guarantee distributed transactions, especially heterogeneous systems and heterogeneous DB, and I know that at least there is no good technology to solve this problem completely. The current common approach is rollback and compensation.

Of course, many people also do one less move, that is, data reconciliation, any architecture without later rollback compensation or data reconciliation will have problems in the final data consistency of distributed transactions, after all, this is a heterogeneous system or DB. As shown in figure:

If there are some technical requirements for seckilling, the distributed transaction scheme here is even less technical, as long as the details are done (rollback, compensation, reconciliation), the data is still 100% accurate.

However, many of us focus on distributed transaction architectures or technical solutions that are too extreme and ignore or do not cut down on the easy details, and the system is constantly in trouble.

In order to pursue perfection, a certain brand of mobile phone adopts the back cover technology of special technology on a high-end mobile phone. As a result, due to technical reasons, the good rate of the back cover is low and the output is low, which affects the normal sales of the core product mobile phone, and the gain outweighs the loss. There’s something about the same thing.

conclusion

Application architecture is both easy and difficult. The easy part is that there is no need to pay too much attention to technical implementation. The difficult part is that the most appropriate and definitely not the most perfect architecture scheme must be given according to the actual business scenarios and business needs, time, cost and resources.

I’ve seen the discussion before about whether architecture is designed or evolved, but this is easy, because at a certain stage or point in time, such as the early days of the system, it must be designed.

But if you look at the architecture of a system, or if you have to choose between the two, it has to be evolutionary.

All large-scale Internet systems are certainly not designed as they are now in the initial stage. They evolve along with businesses from small to large, from small to large, and from simple to complex. The evolution of architecture also bears witness to the development of businesses.

Technology is endless, but application architecture (business system) to the pursuit of technology to be endless, when the DB bottleneck can not be solved, change the idea, a queuing system and reservation system, the technical difficulty will be reduced a lot; If distributed transactions cannot be resolved, do the basics of rollback, compensation, and reconciliation.

Horse step tie firmly, the system can still be robust and stable high performance. Think more about solutions from the perspective of application architecture, there is a better day ahead!

The authors introduce

Zhang Ligang, chief architect of Shanghai Xianbei Technology, former architecture director of Technology Department of Yihaodian, responsible for order opening, freight, inventory, orders and other e-commerce core transaction systems, has a deep understanding and practice of e-commerce core business, high concurrency, high performance and so on.

Thanks to Yuta Tian guang for correcting this article.