preface

Everyone is familiar with the second kill. Since it first appeared in 2011, the scene has become ubiquitous, from singles’ Day shopping to 12306 ticket grabs. To put it simply, seckilling is a process in which a large number of requests at the same time scramble to buy the same product and complete the transaction.

From the perspective of architecture, seckill system is a three-high system with high performance, high consistency and high availability. What concerns need to be paid to build and maintain a large traffic seckill system is the topic discussed in this article.

The overall thinking

Start from a higher dimension and think about the problem as a whole. Second kill is about solving two core problems, one is concurrent read, the other is concurrent write, corresponding to the architecture design, is high availability, consistency and high performance requirements.

On the second kill system design thinking, this paper is based on the three layers in turn, briefly described as follows:

  • High performance. Seckill involves high read and write support. How to support high concurrency and withstand high IOPS? The core optimization philosophy is similar: high reads are “read as little” or “read as little” as possible, high writes are split data. This paper will be carried out from three aspects: static and static separation, hot spot optimization and server performance optimization

  • Consistency. The core concern of seckill is inventory, a limited number of items being deducted by multiple requests at the same time, and accuracy is an obvious challenge. How to do neither more nor less? This paper discusses the core logic of consistent design from several common inventory reduction schemes in the industry

  • High availability. In the actual operation of large distributed system, the operating conditions are very complex. The sudden increase of business flow, the instability of service dependence, the bottleneck of application itself, the damage of physical resources and other aspects will bring large and small impacts on the operation of the system. How to ensure the efficient and stable operation of applications under complex working conditions, how to prevent and face unexpected problems, and what aspects should be taken into consideration in system design? This paper will focus on thinking from the panoramic perspective of architecture landing

A high performance

You may have noticed that you don’t need to refresh the entire page during seckill, only the clock ticks. This is because of the large flow of the second kill system to do the static transformation of the system, that is, the data in the sense of static separation. Three steps of static and static separation: 1. Data separation; 2. Static cache; 3. Data integration.

The primary purpose of static and dynamic separation is to transform dynamic pages into static pages suitable for caching. Therefore, the first step is to separate dynamic data, mainly from the following two aspects:

  • The user. User identity information includes login status and login portrait, etc., related elements can be separated out and obtained through dynamic request. Related recommendations, such as user preferences and regional preferences, can also be loaded asynchronously

  • Time. The second kill time is centrally controlled by the server and can be obtained through dynamic requests

Here you can open a second kill page of the e-commerce platform to see what kind of dynamic data there are in this page.

1.2 Static cache After the separation of static data, the second step is to carry out a reasonable cache of static data, which derives two problems: 1, how to cache; 2. Where is the cache

1.2.1 How to Cache

One of the features of the static transformation is to cache the entire HTTP connection rather than just the static data. In this way, the Web proxy server can retrieve the corresponding response body based on the request URL and return it directly, without reorganizing the HTTP protocol or parsing the HTTP request header.

And as a cache key, URL unique is essential, but for the commodity system, the URL can be based on the commodity ID to unique identification, such as Taobao item.taobao.com/item.htm?id…

1.2.2 Where is the Cache

Where is the static data cached? This can be done in three ways: 1. 2, the CDN; 3. Server.

The browser is of course the first choice, but the user’s browser is out of control, mainly because it is hard for the system to actively push messages to the user if the user does not actively refresh. This can result in the client seeing the wrong information for a long time. For a seckill system, ensuring that the cache can expire in seconds is essential.

The server mainly performs dynamic logic calculation and loading, and is not good at handling a large number of connections. Each connection consumes a large amount of memory, and the Servlet container is slow in resolving HTTP, which is easy to occupy logical computing resources. In addition, static data sinking down to this point also lengthens the request path.

Therefore, static data is usually cached in CDN, which itself is better at handling large concurrent static file requests. It can not only achieve active failure, but also be as close to users as possible, and avoid the weaknesses of Java language level. It should be noted that the CDN has the following problems to be solved:

  • Failure problem. Any cache should be time-sensitive, especially for a seckill scenario. Therefore, the system needs to ensure that CDN across the country invalidates cache information within seconds, which actually has high requirements on the invalidation system of CDN

  • Hit ratio problem. High hit is the core performance requirement of the cache system, otherwise the cache will be meaningless. The hit ratio becomes an issue if data is placed in CDNS across the country, which inevitably leads to a decrease in the likelihood of requests hitting the same cache

Therefore, it is unrealistic to put data into all CDN nodes in the country, and the problems of failure and hit ratio will face great challenges. It is more feasible to select several CDN nodes for static transformation. The selection of nodes usually needs to meet the following conditions:

  • Adjacent to the area of high traffic

  • An area far from the main station

  • An area of good network quality between a node and a master station

Based on the above factors, it is appropriate to choose the second-level cache of CDN, because the number of second-level cache is small, the capacity is larger, and the traffic is relatively concentrated. In this way, the problem of cache invalidity and hit ratio can be better solved, and it is an ideal CDN solution at present. The deployment mode is as follows: 1.3 Data IntegrationAfter the separation of static data, how to organize the data page becomes a new problem, mainly in dynamic data loading processing, usually there are two schemes: ESI (Edge Side Includes) scheme and CSI (Client Side Include) scheme.

  • ESI solution: The Web proxy server requests dynamic data and inserts dynamic data into a static page. Then, a complete page is displayed. This mode has high requirements on server performance, but provides better user experience

  • CSI solution: Only static pages are returned on the Web proxy server, and the front end initiates a separate asynchronous JS request for dynamic data. This approach is performance-friendly to the server, but the user experience is slightly worse

There are only two points to improve performance: one is to minimize data to reduce unnecessary requests; the other is to minimize paths to improve the efficiency of a single request. The specific method is based on this general direction.

2 Hot Spot Optimization

Hotspots are classified into hotspot operations and hotspot data.

2.1 Hotspot Operations Zero-point refresh, zero-point order, and zero-point shopping cart addition all belong to hotspot operations. The hotspot operation is a user’s behavior and cannot be changed. However, some restrictions can be implemented, such as blocking the prompt when the user refreshes the page frequently.

2.2 Hotspot Data There are three steps to process hotspot data: hotspot identification, hotspot isolation, and hotspot optimization.

2.2.1 Hotspot Identification

Hotspot data is classified into static hotspot data and dynamic hotspot data, as follows:

  • Static hotspot: Hotspot data that can be predicted in advance. On the eve of promotion, hot commodities can be analyzed according to the characteristics of the industry and the latitude information of active merchants, or screened in advance by the way of seller registration; In addition, it can also predict in advance through technical means. For example, it can calculate the commodities that buyers visit every day by big data, and then calculate the TOP N commodities, which can be regarded as hot commodities

  • Dynamic hotspot: hotspot data that cannot be predicted in advance. Hot and cold data often alternate with actual business scenarios, especially with the rise of live selling, where a temporary AD by a carrier can lead to a large number of purchases in a short period of time. Because such items are rarely accessed on a daily basis, they can be ejected or expire even after a period of time in the cache system, and even cold data in the DB. Transient traffic often leads to cache breakdown, and requests directly reach DB, resulting in high DB pressure

Therefore, the second kill system needs to realize the dynamic discovery capability of hotspot data. A common implementation idea is as follows:

  • Asynchronously collect hotspot Key information of each link of a transaction, for example, Nginx collects access urls or Agent collects hotspot logs (some middleware has the capability of hotspot discovery) to identify potential hotspot data in advance

  • Aggregating and analyzing hotspot data, the hotspot data reaches certain rules and is distributed to the link system through subscription. Each system decides how to process hotspot data according to its own requirements, such as limiting traffic or caching, so as to achieve hotspot protection

Note that:

  • It is better to collect hotspot data asynchronously. On the one hand, it does not affect the core transaction links of services, and on the other hand, it ensures the universality of the collection method

  • It is better to achieve real-time hotspot discovery at the second level, so that dynamic discovery is meaningful. In fact, it also puts forward higher requirements for data collection and analysis capabilities of core nodes

2.2.2 Hotspot Isolation

After hotspot data is identified, the first principle is to isolate hotspot data. Do not let 1% of hotspot data affect the other 99%. You can implement hotspot isolation based on the following layers:

  • Service isolation. As a marketing activity, sellers need to sign up separately. Technically, the system can preheat the cache of known hot spots in advance

  • System isolation. System isolation is run time isolation, separated from the other 99% by grouping deployment, and can also apply for separate domain names. The entry layer lets requests fall into different clusters

  • Data isolation. As hotspot data, you can enable a separate cache cluster or DB service group for horizontal or vertical capability expansion

Of course, there are many ways to achieve isolation. For example, it can be differentiated by user, assign different cookies to different users, and route the entry layer to different service interfaces. Another example is the same domain name, but the backend calls a different service interface; Or marking data at the data layer to distinguish them. These measures are used to distinguish recognized hot requests from common requests.

2.2.3 Hotspot Optimization

With hotspot data isolated, it is easy to optimize the 1% of requests in one of two ways:

  • Caching: Hotspot caching is the most effective approach. Static data can be cached for long periods if hot data is static and dynamic

  • Traffic limiting: Traffic limiting is more of a protection mechanism. It should be noted that each service should always pay attention to whether the request triggers traffic limiting and review it in time

2.2.4 summary

Hot spot optimization of data is different from static and static separation. Hot spot optimization splits data vertically based on the 80/20 principle for targeted processing. Hotspot identification and isolation is not only relevant for the “seckill” scenario, but also for other high-performance distributed systems.

3 System Optimization

For a software system, there are many ways to improve performance, such as improving hardware level, tuning JVM performance, and focusing on code level performance optimization:

  • Reduce serialization: Reducing serialization in Java can greatly improve system performance. Serialization mostly occurs in the RPC phase, so RPC calls should be minimized. A feasible solution is to “merge deployment” multiple applications with strong correlation, thus reducing RPC calls between different applications (Microservice Design Specification)

  • Direct output stream data: Any STRING I/O operation, whether disk OR network I/O, is CPU intensive because characters need to be converted to bytes, and this conversion must look up table encoding. Therefore, for common data, such as static strings, it is recommended to encode them into bytes and cache them in advance. Specifically, at the code level, the OutputStream() class function is used to reduce the data encoding conversion. In addition, the hot method toString() does not call the ReflectionToString implementation directly. It is recommended to hardcode directly and print only the basic and core elements of DO

  • Clipping log exception stack: No matter external system exceptions or application exceptions, there will be a stack out. Under heavy traffic, frequent output of the complete stack will only aggravate the current load of the system. The depth of the exception stack output can be controlled through the log configuration file

  • Go component framework: For extreme optimization, you can remove some component frameworks, such as the traditional MVC framework, and use servlets to handle requests directly. This bypasses a lot of complicated and useless processing logic, saves milliseconds, and, of course, requires a reasonable assessment of your dependency on the framework

4 to sum up

Performance optimization requires a baseline value, so the system also needs to do application baselines, such as performance baselines (when performance suddenly decreased), cost baselines (how many machines were used last year), link baselines (what changes happened to core processes), and keep track of system performance through baselines. Promote the system to continuously improve the coding quality at the code level, timely remove unreasonable calls at the business level, and continuously optimize and improve the architecture level.

consistency

In the second kill system, inventory is a key data, can not sell out is a problem, oversold is even more a problem. The consistency problem in the second kill scenario is mainly the accuracy of inventory deduction.

1. The way to reduce inventory

The purchase process in e-mart is usually divided into two steps: order and payment. “Submit order” means place order, “pay order” means payment. Based on this setting, there are generally the following ways to reduce inventory:

  • Order to reduce inventory. After the buyer places an order, the inventory of goods is deducted. Order to reduce inventory is the simplest way to reduce inventory, but also the most accurate control of one

  • Payment to reduce inventory. Buyers do not immediately reduce inventory when they place an order, but wait until payment is made. But because the inventory is reduced at the time of payment, if the concurrency is high, the buyer may not be able to pay after placing the order, because the goods have been bought by others

  • Withholding inventory. This method is relatively complicated. After a buyer places an order, the inventory is kept for a certain period of time (such as 15 minutes). After this period, the inventory is automatically released and other buyers can buy it

It can be seen that the inventory reduction method is divided based on the multi-stage shopping process, but there are some problems in both the ordering stage and the payment stage. The following is a detailed analysis.

The problem of inventory reduction

2.1 Order and reduce inventory

Advantages: Best user experience. Order to reduce inventory is the simplest way to reduce inventory, but also the most accurate control of one. When placing an order, commodity inventory can be controlled directly through the database transaction mechanism, so it must not appear that the order can not pay the money.

Cons: Probably won’t sell. Under normal circumstances, buyers place orders after the probability of payment is very high, so there will not be too big a problem. But there is an exception, when the seller participates in a promotional activity, the competitor places the order of all the goods through malicious order, leading to the inventory clearance, then it cannot be sold normally — you know, malicious order people will not really pay, which is the deficiency of “order to reduce inventory”.

2.2 Payment for inventory reduction

Advantages: certain actual sales. “Order to reduce inventory” may lead to malicious order, thus affecting the seller’s sales of goods, “payment to reduce inventory” because of the need to pay real money, can effectively avoid.

Disadvantages: Poor user experience. After the user places an order, may not actually pay, assuming that there are 100 goods, it may appear that 200 people place a successful order, because the order will not reduce inventory, so it may appear that the number of successful orders far exceeds the number of real inventory, this will happen especially in the promotion of popular goods. In this way, many buyers will be unable to pay after placing a successful order, and the shopping experience is naturally poor.

2.3 Withholding inventory

Advantages: Alleviates the problems of the above two ways. In fact, withholding inventory is the combination of “order reduction inventory” and “payment reduction inventory”. The two operations are related before and after, withholding inventory when placing an order and releasing inventory when making payment.

Disadvantages: Does not completely solve the above problems. For example, in the case of malicious order, the effective payment time can be set to 10 minutes, but the malicious buyer can place the order again after 10 minutes.

2.4 summary

The problem of inventory reduction is mainly reflected in user experience and business demands. The essential reason is that there are two or even multiple steps in the shopping process, and inventory reduction in different stages is easy to be maliciously exploited.

3. How to reduce inventory in practice

The industry is the most common withholding inventory. Whether it is takeout food or e-commerce shopping, there is generally an “effective payment time” after placing an order, after which the order will be automatically released, which is a typical withholding inventory scheme. But as mentioned above, withholding inventory also needs to solve the problem of malicious ordering, to ensure that the goods are sold out; On the other hand, how to avoid oversold, is also a pain point.

Sell it: The solution to malicious ordering is mainly combined with security and anti-cheating measures to stop it. For example, identify frequent orders do not pay buyers and marking, so that you can mark buyers order without reducing inventory; Another example is to set the maximum number of items purchased by a single person. One person can only buy N items at most. Or to repeat the order does not pay the behavior of the number of times limit blocking

Avoid oversold: There are actually two types of oversold inventory. For ordinary goods, second kill is just a big promotion means, even if the inventory is oversold, businesses can also solve through replenishment; For some commodities, seckill, as a marketing method, does not allow negative inventory at all. In other words, in terms of data consistency, it is necessary to ensure that the inventory field value in the database cannot be negative in the case of large concurrent requests. Generally, there are various schemes:

One is to judge through transactions, that is, to ensure that the reduced inventory can not be negative, otherwise it will roll back;

The second is to directly set the database field type as unsigned integer, so that once the inventory is negative, it will report errors in the execution of SQL;

Use CASE WHEN statement:

UPDATE item SET inventory = CASE WHEN inventory >= xxx THEN inventory-xxx ELSE inventory END
Copy the code

Business means to ensure that goods sold out, technical means to ensure that goods will not be oversold, inventory problem is never a simple technical problem, to solve the problem from a variety of perspectives.

4 Consistency performance optimization

Inventory is a key data, but also a hot data. For the system, the actual impact of hot spots is “high read” and “high write”, which is also the core technical problem in the seckill scenario.

4.1 High Concurrent Reads

The second kill scenario solves the problem of high concurrent read. The key word is layered verification. That is, when a link is being read, the system checks only the operations that do not affect the performance, such as whether the user has the second kill qualification, whether the product status is normal, whether the user answers correctly, whether the second kill has ended, and whether the request is illegal, but does not perform consistency check and other operations that may cause bottlenecks. Inventory is not checked for consistency until link writing to ensure final accuracy at the data layer.

Therefore, under the hierarchical check setting, the system can use distributed cache or even LocalCache to resist high concurrent reads. In other words, certain dirty data is allowed to be read in the scenario. In this way, only a small number of uninventable order requests will be mistaken for inventory. The final consistency will be guaranteed when data is actually written, so as to achieve a balance between high availability and consistency.

In fact, the core idea of hierarchical verification is that different layers filter out invalid requests as much as possible and process them effectively only at the end of the “funnel”, thus shortening the path of system bottlenecks.

4.2 High Concurrent Write

High concurrency write optimization methods, one is to change the DB selection, the other is to optimize the DB performance, the following are discussed respectively.

4.2.1 DB replacement and selection

There are differences in the reduction of inventory between second kill goods and ordinary goods. The core difference lies in the small data magnitude and short transaction time. Therefore, can the second kill inventory be directly implemented in the cache system, that is, directly in a cache with persistent function to reduce inventory, such as Redis?

If the inventory reduction logic is very simple, for example, there is no complex linkage relationship between SKU inventory and total inventory, I think it is completely ok. However, if you have more complex destocking logic or need to use transactions, destocking must be done in the database.

4.2.2 Optimize DB Performance

Inventory data landing to the database implementation is actually a row store (MySQL), so there are a lot of threads competing for InnoDB row locks. However, the higher the concurrency, the more waiting threads, the lower THE TPS, the higher the RT, and the throughput is severely affected — note that this assumes that the database has completed data isolation based on the above [performance optimization] for the sake of focus.

There are two ways to solve the problem of concurrent locking:

1. Application layer queuing.

By adding a cluster distributed lock to the cache, the cluster can control the concurrency of operations on the same row of database records and control the number of database connections occupied by a single item, preventing hot items from occupying too many database connections

2. Data layer queuing.

While application-layer queuing is detrimental to performance, data-layer queuing is ideal. In the industry, Ali’s database team has developed a patch program on InnoDB layer, which can perform concurrent queuing of single row records based on DB layer, so as to achieve customized optimization in seckill scenarios. Note that queuing is different from lock competition. If you are familiar with MySQL, InnoDB’s internal deadlock detection and MySQL Server/InnoDB switching are performance consuming.

Ali’s database team also made a lot of other optimizations,

Fixes such as COMMIT_ON_SUCCESS and ROLLBACK_ON_FAIL do not wait for a transaction to commit in real time by adding hints to the SQL. Instead, after the last SQL has been executed, Commit or roll back directly based on the result of TARGET_AFFECT_ROW, reducing the network wait time in milliseconds.

Ali has now made MySQL open source with these patches:

Github.com/alibaba/Ali…

4.3 summary

High reading and high writing are very different. The optimization space of read request is larger, while the bottleneck of write request is generally in the storage layer. The essence of optimization idea is to balance based on CAP theory.

5 to sum up

Of course, inventory reduction still has many details, such as how to make up for the delayed inventory after timeout, and how to ensure the consistency between inventory reduction and payment by third-party payment. These are also great challenges.

High availability

If you look at the SEC traffic monitor, you’ll see that it’s not a meandering curve, but a straight line, because SEC traffic is highly concentrated at a particular point in time. This creates a particularly high zero peak, and the drain on resources is almost instantaneous. So the availability protection of seckill systems is indispensable.

1 flow peak clipping

For the second kill target scenario, the final number of people who can grab the goods is fixed, whether 100 or 10,000 people participate in the result is the same, that is, the effective request amount is limited. The higher the concurrency, the more invalid requests. But seckill as a commercial marketing means, before the start of the activity is hoping to have more people to brush the page, but after the real start, seckill request is not the more the better. Therefore, the system can design some rules to artificially delay the second kill requests, and even filter out some invalid requests.

1.1 In the early days, the second kill was just a simple click on the second kill button, and then the questions were added. Why more questions? By increasing the complexity of the purchase, it serves two purposes:

  • Prevent cheating. In the early stage, seckill was rampant, and malicious buyers or competitors used seckill to sweep goods. Merchants failed to achieve the purpose of marketing, so they added questions to limit

  • Defer requests. Zero point traffic takes effect in milliseconds, and the peak order time can be artificially extended from <1s to <10s. This time is very important for the server to greatly reduce peak concurrency; In addition, due to the sequence of requests, when the request after the answer comes, there may be no inventory, so it is impossible to place an order at all. At this stage, real writing on the data layer is very limited

It should be noted that in addition to verifying the correctness of the answer, the submission time should also be verified. For example, the possibility of manual operation <1s is very small, which can further prevent machine answering.

The answer is already widely used, essentially by cutting traffic at the entry level so that the system can better support transient peaks.

1.2 Queuing The most common peak-cutting scheme is to use message queues to buffer instantaneous traffic by converting synchronous direct calls into asynchronous indirect pushes. In addition to message queues, there are many similar queuing schemes, such as:

  • Thread pool lock wait

  • Local memory flood storage waiting

  • Local files are written sequentially and then read sequentially

The disadvantages of queuing are also obvious, mainly including two points:

  • Requests are backlogged. If the traffic peak lasts for a long time and reaches the upper limit of the water level of the queue, the queue will also be overwhelmed. This protects the downstream system, but it is not much different from the request for direct discarding

  • User experience. Asynchronous push is naturally inferior to synchronous call in terms of timeliness and orderliness. Therefore, requests may be sent first and then received, affecting the shopping experience of some sensitive users

The nature of queuing is to buffer a one-step operation into a two-step operation at the business level, but due to the disadvantages of this approach, compromises and balances are ultimately made based on the business magnitude and the kill scenario.

1.3 Filtering The core structure of filtering is layered. Invalid requests are filtered out at different layers to achieve precise triggering of data reads and writes. Common filtering mainly has the following layers:

  • 1. Read traffic limiting: Traffic limiting is implemented for read requests to filter out the requests that exceed the system capacity
  • 2. Read cache: Cache read requests to filter out repeated requests
  • 3. Write traffic limiting: The write request traffic limiting is implemented to filter out the requests that exceed the system capacity
  • 4. Write verification: Verifies the consistency of write requests, reserves only the final valid data

The core purpose of filtering is to ensure the IO performance of valid requests by reducing the data IO of invalid requests.

The system can achieve the purpose of peak traffic cutting by answering questions at the entrance layer, queuing at the business layer and filtering at the data layer. In essence, it seeks a balance between commercial demands and architectural performance.

In addition, new peak cutting means also emerge in an endless stream, with business cut in the majority, such as zero promotion synchronous coupons or launch lottery activities, part of the flow dispersed to other systems, which can also play a role in peak cutting.

When a system is faced with continuous peak traffic, it is difficult to recover by adjusting itself. No one can predict all the situations in daily operation and maintenance, and accidents are inevitable. Especially in the scenario of second kill, in order to ensure the high availability of the system, a Plan B Plan must be designed for the bottom.

High availability construction is actually a system engineering, which runs through the whole life cycle of system construction.Specifically, the high availability construction of the system involves the analysis of architecture stage, coding stage, test stage, release stage, operation stage, and failure:

  • Architecture stage: consider the scalability and fault tolerance of the system to avoid a single point of problem. For example, if an IDC or even a city is faulty, the system will not be affected

  • Coding phase: Ensure the robustness of the code, such as RPC calls, set up a reasonable timeout exit mechanism to prevent other systems from being dragged down, and also default to handle unexpected return errors

  • Test stage: ensure the coverage of CI and fault tolerance of Sonar, check the basic quality twice, and regularly produce the trend report of the overall quality

  • Release phase: System deployment is most likely to expose errors, so there should be a front-loading checklist template, a mid-loading and down-loading mechanism, and a post-loading rollback mechanism

  • Operation phase: the system is in operation state most of the time, the most important is the real-time monitoring during operation, timely find problems, accurate alarm and provide detailed data for troubleshooting

  • Fault occurrence: Stop loss in time to prevent impact from expanding, locate the cause, rectify the problem, and restore services

For daily operation and maintenance, high availability is mainly for the operation phase, during which additional construction needs to be strengthened, mainly by the following means:

  • Prevention: normal pressure measurement system should be established, single point pressure measurement and full link pressure measurement should be carried out on the service regularly, and the water level should be measured

  • Control: perform degradation, current limiting and fusing protection of online operation. Note that traffic limiting, downgrading, or circuit breaker are detrimental to services. Therefore, confirm with upstream and downstream services before performing operations. Take traffic limiting as an example. For which services can be restricted, the lower limit, the duration of traffic limiting, and the recovery conditions, you need to confirm with the service side repeatedly

  • Monitoring: Establish a performance baseline and record the performance trend. Set up alarm system, find problems in time warning

  • Recovery: The ability to stop losses in case of failure and provide quick data correction tools, not necessarily good, but necessary

In the whole life cycle of system construction, every link may make mistakes, and even some links make mistakes that cannot be remedied or cost is very high. Therefore, high availability is a system engineering and must be considered in the whole life cycle. At the same time, considering the growth of services, high availability needs long-term planning and systematic construction.

3 to sum up

In fact, high availability means “stability”. Stability is usually not important, but if something goes wrong, it will be fatal. However, its landing is another problem — the business develops well in normal times, and the stability construction will be degraded to give way to the business.

Must solve the problem well in organizational guarantee, let’s head back stability performance indicators, for example, at the same time in the department to establish the stability of construction team, team members from each line of the core of power, performance ratings by stability, head, this allows the systematism in the construction task of the implementation of specific business system.

Personal summary

The design of a seckill system can create different architectures from simple to complex according to different levels of traffic. The essence is the trade-offs and trade-offs of all aspects. Of course, you may have noticed that this article does not cover specific selection options, because these are not important to the architecture, and as an architect, you should always remind yourself what the main line is.

At the same time, here is also abstract, refining, mainly for personal seconds to kill the design of the outline arrangement, convenient for the students to reference!Author: a zhe segmentfault.com/a/1190000020970562