Stability is crucial for large web sites, and with 100,000 web visits per minute, the slightest mistake can cause major failures. Today this article together to discuss the next hundred million level traffic site in the stability of some practices, I hope to help you.

Basic strategy

Configuration change

Configuration is to unify a lot of business process-related data on a configuration platform, separated from the code, so that the code only deals with common business logic. Once configured, the code has the ability to handle all scenarios, using the configuration data to determine what data to operate on when running online.


The configuration design allows us to make quick changes to the line, adding, changing and deleting in real time, which is very good for fast problem solving.

Business switch

A service switch is a switch for a specific process. The processing logic can be controlled in real time by opening and closing the switch. There are many types of switches. The following are commonly used.

Boolean type: specifies whether to enable a process, such as enabling and disabling a checksum.

Number type: service Number configuration.

String: text configuration corresponding to services.

Collection type: the Collection switch corresponding to the business, for example, specifying a specific type of business startup or shutdown process;

Map type: indicates the mapping switch of a service. For example, a service of a specific type is processed.


Deployment strategy

Deploying the system is a challenge in the face of traffic of up to 100,000 visits per minute. The deployment policy is based on machine rooms and machines. That is, the new version and the old version coexist and provide services at the same time. The whole deployment process is a gradual replacement process.


Grayscale deployment gradually increases the proportion of the new version in the service. Grayscale test is performed for online users. If machine logs are abnormal or users report service problems, roll back the new version immediately to avoid wider impact.

Error handling

In the software architecture, there should be a unified process and standard for error handling, and the right and wrong codes should be classified, so that the error types can be quickly determined according to the statistical error codes. To do this, you need to agree on a consistent format for error codes, such as CHK for verification failure, THD for third-party service error, SYS for current system error, The REQUIRED ending indicates that the REQUIRED item is not filled in, the INVALID ending indicates a data error, and the EXCEPTION ending indicates that an EXCEPTION occurs.

Log collection

The program must run to generate logs, logs are the first source of troubleshooting online problems. Accurate and complete recording of meaningful log can be effective analysis. For example, based on the log, we can count the service call volume and analyze the success rate, we can clearly see the error information, including which users make mistakes, what is the reason for the error, and we can establish an online problem alarm mechanism.

Logs must be printed in a unified format and stored in a unified log analysis service to record and search logs in real time.

Online Monitoring Strategy

The link to track

One of the challenges of distributed systems is how to track and process links. Typically, we generate a unique identifier, such as a UUID, for each request at the beginning of the link and record the identifier for each subsequent processing node and propagate it to the next processing node.

Based on this unique identification code, we can completely recover the processing process of a request on the link from massive logs, as well as the input and output data, and carry out full link analysis.

Abnormal monitoring

Exceptions thrown by Java are processed and printed to logs to establish exception monitoring. In exception monitoring, we focus on the number of exceptions, stack information and change trend.

By monitoring exceptions, you can quickly locate online problems and locate exceptions based on stack information. Exceptions of third-party services, such as distributed call timeout, can also be quickly analyzed and located.

Machine monitoring

Machine monitoring focuses on machine CPU, memory, network usage, JVM thread count, memory usage, Full GC count, etc. When the traffic peak arrives, the server is under pressure, and it is likely to run out of CPU and memory. It may also cause the JVM to run out of memory for Full GC, resulting in service crash. Server health is too important to ignore.

Strategies for dealing with peak traffic

Service degradation

Service degradation refers to skipping a specific process when the traffic exceeds the service capacity of the system in certain cases, such as during Singles’ Day and Twelfth Singles’ Day. For example, after a seller places an order, we may need to carry out a series of procedures such as risk assessment and data verification. When service degradation occurs, data verification logic will be skipped to ensure the stability of the service.


Service degradation is an effective method to ensure user experience and prevent system crash in the face of traffic peak. For example, the verification of image content is time-consuming. Canceling the verification in the face of a traffic peak can avoid a long wait for users, reduce the impact on downstream links, and ensure service stability.

Service current limiting

Service traffic limiting means that a threshold is estimated in advance based on the service processing capability. If the traffic exceeds the threshold, the processing is abandoned and an error is returned. Service traffic limiting is an important measure to protect the system against peak traffic. For example, the peak value of ordering at zero on Singles’ Day, the peak value of nine-point buying of Yu ‘ebao, and the peak value of commodity editing at the end of the event all need to be restricted to protect the system.


In distributed deployment, it is easy to calculate the normal traffic pressure of the system based on dubbo, Spring Cloud, or HSF data statistics. With the help of full-link pressure measurement, it is easy to see the specific performance of the system under peak traffic. Therefore, service traffic limiting is not difficult to implement.

Disaster fault

The disaster recovery capability of a single system is almost zero. Once the service crashes, it becomes unavailable immediately. Distributed system through service more active, can provide uninterrupted service; Load balancing with Nginx and Apache can further improve availability.

In fact, even with load balancing and distributed service deployment, the system still faces disaster recovery problems. At present, large services, such as Taobao, Tmall, wechat and JINGdong, have been deployed in different places. The main purpose of remote active deployment is to reduce the impact of a single room fault and improve disaster recovery capability by providing services in multiple equipment rooms.


conclusion

This article has probably sorted out a hundred million traffic website in terms of stability need to pay attention to and do a good job, there are many problems about stability worth discussing and pondering.

What’s your opinion? Welcome to leave a message. This article was first published on the public account “Program heart”, welcome to follow.

Recommended reading

Java virtual machine JIT just-in-time compiler

Popular understanding of KMP string matching algorithm

This article will suffice for an in-depth understanding of Java enumeration types

Java technology Counting queues in Java

MyBatis dynamic SQL common functions

Three new language features added to Java 9