Server capacity expansion thinking and problem analysis of 100 million traffic architecture

Why expansion

In human language, No matter how to optimize performance, can achieve the maximum value is certain, for a large users of applications, can undertake all kinds of optimization to the server, such as current limiting, resource isolation, but the cap is there, then you should change our hardware, such as using a stronger more CPU, memory, in the above in an example of a student canteen dozen rice, if a student The token bucket algorithm can be used to give priority to the token of senior three students, but what if there are still a lot of senior three students? The only way to do that is to increase the number of Windows or dining halls, which is to expand the hardware.

Expansion strategy

Capacity expansion policies can be divided into two types. One is to expand the capacity of a single server, that is, the CPU, memory, and storage devices in the server, and the other is to expand related components, such as memory, disk, and CPU expansion.

The whole machine hardware

Are the benefits of the machine capacity, there are a lot of professional server hardware suppliers, such as IBM, wave, DELL, HP, etc., professional supplier of hardware, they assemble and collocation may experience more rich, in addition some companies will to some optimization of components, thereby server more stable, by analogy to buy a computer, some people may choose to buy taobao sellers already Assembled desktop, some people may buy all kinds of hardware to assemble themselves home, for ordinary people, choose the former is a more reliable choice, because even if you understand some parameters of the hardware, it is difficult to ensure that their collocation of the machine can play the maximum performance of each component.

component

For some technical ability of the company, buy more is their various components assembly, so costs are lower, because the saving cost such as assembly, and can according to the business customization, for example, some companies are computationally intensive, so the main change is more CPU, IO intensive, so the capacity should be memory, etc., some companies need to store a lot of number According to the data, storage devices such as hard disks may be expanded.

Components include:

cpu

Intel, Amd, reference frequency, number of threads, etc

The network card

Hundred -> gigabit -> ten gigabit

memory

ECC check

disk

SCSI HDDS (mechanical), HHD (hybrid), SATA SSD, PCI-E SSD, and MVMe SSD

AKF split principle

In Redis AKF, the AKF split principle is introduced in detail. Here is a brief review:

For an application, if a single machine is not enough to support service requests, then you can set up clusters in modes such as master master, master slave, etc. :

This is called the AKF principle of X-axis scaling, and the purpose is to distribute requests across multiple machines. However, the problem of data synchronization needs to be solved among multiple machines. The more machines there are, the more likely the data will be out of sync. Therefore, the service requests of the hot spots in the server can be collected and separated to expand the capacity of the hot spots. This is the Y-axis splitting of AKF principle:

If a service is still too hot after splitting, that is, horizontal replication is not enough to support data requests after splitting the Y-axis, the data of the service can be split. Specifically, the data of a certain business can be placed in multiple places. For example, computer rooms can be deployed in Hubei, Beijing and Shanghai. When people around the country need to request data, the nearby server will provide services.

Troubleshoot problems after capacity expansion

With the growth of business, the system becomes larger and larger. According to the functions of the system, it is divided into independent and interconnected projects, such as transaction system, financial system, production process system, logistics system, website system, etc. However, there are many problems in the distributed structure. Each of these questions is worthy of further discussion, which is briefly mentioned here and covered later.

Data sharing Problem How to share and synchronize data among all services is a problem that needs to be considered. In a microservice architecture, there is no single copy of data, so data loss caused by machine damage cannot be avoided. How to synchronize data among multiple copies? At present, the solutions for reference are to set up data centers and database clusters.
Interface call problem Calls between different servers follow the remote call protocol RPCJAVA RMI: Java Remote Method Invocation (RMI) is an application programming interface (API) for Remote procedure Invocation in the Java programming language. It enables programs running on the client to invoke objects on the remote server. Dubbo: Provides high-performance RPC calls for interface proxies
Persistent data avalanche problem database partition table, see :MySQL tuning partition table. Resource isolation, see: Ideas and methods of resource isolation in a billion – level traffic architecture. Cache set data persistence strategy :Redis persistence RDB and AOF.
High concurrency problem cache: such as cache breakdown, penetration, avalanche, etc. Refer to Redis breakdown, penetration, avalanche causes and solutions. Data closed-loop: in order to facilitate understanding, for example, for taobao, have a web, IOS, android version, what is the edition of the tao, etc., while the client is not the same, but the commodity information display is the same, which is a commodity, which end with the same data, whether it is need a plan to solve different according to the same data in different end under concurrent exhibition This is called data closed loop.
Data consistency this is a difficult point, the general idea is how to ensure data consistency between multiple servers. The price of the same commodity should be the same at different client and server ends, usually using distributed lock.

Database expansion: cluster

First, let’s briefly talk about the difference between distributed and cluster. These two words are often used together, but they have different meanings. Distributed can shorten the execution time of a single task to improve work efficiency, while cluster emphasizes on increasing the number of operations executed per unit time to improve efficiency. To put it more simply, distributed means that steps are distributed to each computer, regardless of dependencies, whereas clustered means that several tasks are being processed at the same time.

When a single database storage cannot meet business requirements, data is stored in a cluster on different servers, which can be master or slave. The master is responsible for writing and the slave is responsible for reading, and the pressure related to the database is divided into multiple machines.

A distributed ids

In a complex distributed system, it is often necessary to uniquely identify a large amount of data and messages. It is easy to think of the use of autoincrement, but there are many problems with autoincrement, such as ID has too strong rules, may be malicious query collection, in the face of increasing data, data sub-database sub-table need to have a unique ID to identify a data or message, so the database autoincrement ID obviously can not meet the needs; Special items such as goods, orders and users also need a unique ID to identify them. A system capable of generating globally unique ids is necessary at this point. In summary, what are the requirements of ID numbers for business systems?

Distributed ID Requirements

For distributed ids, the following requirements must be met:

Global uniqueness: No duplicate ID numbers, which is a basic requirement since they are unique.
Increasing trend: The MySQL InnoDB engine uses clustered indexes. Since most RDBMS use b-tree data structure to store index data, we should try to use ordered primary keys to ensure write performance.
Monotonically increasing: Ensure that the next ID is greater than the previous ID. For example, the transaction version number, IM incremental messages, and sorting are required.
Information security: if the ID is continuous, malicious user pickpocket work is very easy to do, directly in order to download the specified URL; If it is the order number, it is even more dangerous, the competitor can directly know our daily order. Therefore, in some application scenarios, irregular and irregular ids are required.

The above 123 corresponds to three different scenarios, but requirements 3 and 4 are mutually exclusive, that is, the same solution cannot be used to meet the requirements. In addition to the requirements on the ID number itself, the business also has high requirements on the availability of the ID number generation system. Imagine a disaster if the ID number generation system fails and the entire data-related actions cannot be performed. As a result, the next ID generation system should at least do the following:

The average delay and TP999 delay should be as low as possible;
Usability 5 nines (meituan requires it, while some companies like Alibaba require 6 nines);
High QPS.

Distributed ID generation strategy

There are many common ID generation strategies in the industry, such as UUID, Snowflake generation algorithm, Redis, Zookeeper, etc. Here we will briefly talk about UUID and Snowflake, which will be discussed in detail later.

UUID generation algorithm

The standard form of UUID consists of 32 hexadecimal digits, hyphenated into five paragraphs of 36 characters of the form 8-4-4-4-12, as shown in the following example: 550E8400-E29B-41D4-a716-446655440000, so far there are 5 ways to generate UUID in the industry. For details, see the UUID specification published by the IETF. A Universally Unique IDentifier (UUID) URN Namespace.

Advantages:

Very high performance: local generation, no network consumption.

Disadvantages:

Not easy to store: UUID is too long, 16 bytes 128 bits, usually represented as a string of 36 lengths, which is not suitable for many scenarios.
Information insecurity: The algorithm that generates UUids based on MAC addresses can cause MAC addresses to leak, a vulnerability that was used to find the creator of Melissa’s virus.
MySQL has recommended that the primary key be as short as possible [4]. A UUID of 36 characters does not meet the requirements. All indexes other than the clustered index are known as secondary indexes. In InnoDB, each record in a secondary index contains the primary key columns for the row, as well as the columns specified for the secondary index. InnoDB uses this primary key value to search for the row in the clustered index.If the primary key is long, the secondary indexes use more space, so it is advantageous to have a short primary key.

(2) Bad for MySQL index: If it is used as the primary key of the database, the disorder of THE UUID in InnoDB engine may cause the data location to change frequently, which seriously affects the performance.

Snowflake generation algorithm

This scheme is basically an algorithm that generates ids based on a partition of the namespace (UUID counts as well, so it is common to analyze it separately). In this scheme, 64-bit bits are divided into segments to identify machines, times, etc. In Snowflake, 64-bit bits are represented as follows:

The 41-bit time can be expressed as (1L<<41) /(1000L

3600

24*365)=69 years, 10-bit machines can each represent 1024 machines. If we have requirements for IDC division, we can also divide 10-bit into 5-bit for IDC and 5-bit into working machines. In this case, 32 IDCs can be represented. Each IDC can contain 32 machines, which can be defined based on your requirements. Twelve auto-increment serial numbers can represent 212,212 ids. Theoretically, Snowflake’s QPS is about 409.6 W /s. This allocation ensures that any machine in any IDC will generate different ids in any milliseconds.

The advantages and disadvantages of this approach are:

Advantages:

The number of milliseconds is high, the increment sequence is low, and the whole ID is trending upward.
It does not rely on third-party systems such as databases, and is deployed as a service, providing higher stability and high performance in ID generation.
You can allocate bits according to your service characteristics, which is very flexible.

Disadvantages:

If the clock on the machine is dialed back, the number may be repeated or the service may be unavailable.

The elastic expansion

In human language, a cluster automatically expands resources at a scheduled time and releases resources at a scheduled time. In this way, it can solve the regular demand of peak and valley of resources and achieve the purpose of making full and reasonable use of resources. However, flexible capacity expansion has some problems:

First, the elasticity of virtual machines is weak. To deploy services using VMS, you need to apply for VMS, create and deploy VMS, configure the service environment, and start service instances. The first steps belong to the private cloud platform and the next steps belong to the service engineer. Multiple departments need to cooperate to complete capacity expansion at a time. Capacity expansion takes hours, making it difficult to automate the process. Automatic one-click rapid capacity expansion greatly improves service flexibility, releases more manpower, and eliminates the risk of accidents caused by manual operations.

Second, IT costs a lot. Due to the poor elasticity of VMS, service departments reserve a large number of VMS and service instances to cope with traffic peaks and bursts. That is, a large number of VMS or dedicated servers are deployed and the resources required during peak hours are usually twice as many as those required during off-peak hours. The resource reservation approach brings very high IT costs, and during off-peak hours, these machine resources are idle, which is also a huge waste.

Akiko: The harmonica that can’t wait

Link: www.cnblogs.com/Courage129/…