1. Some principles of system design

Hine law

  • The occurrence of accidents is the result of quantity accumulation
  • No matter how good the technology and perfect regulations are, they cannot replace people’s own quality and sense of responsibility in practical operation

Murphy’s law

  • Nothing is as simple as it seems
  • Everything takes longer than you expected
  • Anything that can go wrong will go wrong
  • If you’re worried about something happening, it’s more likely to happen

2. Highly available design in software architecture

2.1. What is high availability

Assuming that a system is always available, the system is 100% available.

Most companies have a high availability target of 99.99%. That’s 53 minutes of downtime a year.

2.2 Usability measurement and assessment

The industry now uses N nines to quantify availability

describe Popular name Availability level Annual downtime
Basic availability 2 and 9 99% 87.6 hours
High availability 3 and 9 99.9% 8.8 hours
Availability of automatic fault recovery 4 to 9 99.99% 53 minutes
Extremely high availability 5 and 9 99.999% 5 minutes

Fault measurement and assessment:

category describe The weight
High risk class S accident failure Once a fault occurs, the entire service may be unavailable 100
Major Class A fault Customer clearly perceives service exception: wrong answer 20
Intermediate level B fault Customers can sense service exceptions: the response is slow 5
General C fault Service jitter occurs for a short time 1

2.3. How to ensure the high availability of the system

  • Avoid using a single point during system design
  • The principle of high availability assurance is clustering, or redundancy
  • High availability of the system is realized through automatic failover

Solutions to the high availability problem:

  1. Load balancing
  2. Current limiting
  3. demotion
  4. isolation
  5. Timeout and retry
  6. The rollback
  7. Pressure measurement and preplan

3. Load balancing

3.1 DNS&nginx load balancing

Ensure that the service cluster can be failover.

When the service goes down, load requests are transferred to achieve high availability.

Other load balancers:

  1. Services and Services RPC – RPC framework provides load scenarios (DUBBO, SpringCloud)
  2. Data cluster needs load balancing (MyCAT, HaProxy)

3.2 upstream configuration

“Configure upstream in nginx

Upstream Backend {server 192.168.1.101:8080 weight=1; Server 192.168.1.102:8080 weight = 2; }Copy the code

Proxy_pass to handle user requests

location / {
	proxy_pass http://backend;
}
Copy the code

3.3. Load balancing algorithm

3.3.1, round – robin

Polling, the default load balancing algorithm, forwards requests to upstream servers in a polling manner, with the weight configuration to implement weight-based polling

3.3.2 rainfall distribution on 10-12, ip_hash

Load balancing is performed based on customer IP addresses and load balancing is performed on the same upstream server using the same IP address

upstream backend{
	ip_hash;
	server 192.168.1.101:8080 weight=1;
	server 192.168.1.102:8080 weight=2;
}
Copy the code

3.3.3、hash key [consistent]

Hash a key or use a consistent hash algorithm for load balancing.

The problem with the Hash algorithm is that when a server is added or deleted, many keys are re-loaded to different servers, causing backend problems.

With consistent hashing, when a server is added/removed, only a few keys will be re-loaded to different servers.

Hash algorithm:

upstream backend{
	hash $url;
	server 192.168.1.101:8080 weight=1;
	server 192.168.1.102:8080 weight=2;
}
Copy the code

Consistent hash algorithm: consistent_key specifies this algorithm dynamically.

upstream backend{
	hash $consistent_key consistent;
	server 192.168.1.101:8080 weight=1;
	server 192.168.1.102:8080 weight=2;
}
Copy the code

3.4. Retry after failure

upstream backend{
	server 192.168.1.101:8080 max_fails=2 fail_timeout=10s weight=1;
	server 192.168.1.102:8080 max_fails=2 fail_timeout=10s weight=2;
}
Copy the code

If max_fails for several times within fail_timeout seconds, the upstream server is considered unavailable or not viable and will be removed. After fail_timeout seconds, the server will be added to the viable list again and retry.

3.5. Health check

Keep an eye on the health of the service, and if the service becomes unavailable, requests are forwarded to other living services to improve availability.

Nginx can integrate the nginx_upstream_check_module module for active health checks. TCP heartbeat and HTTP heartbeat are supported.

TCP heart:

upstream backend{
	server 192.168.1.101:8080 weight=1;
	server 192.168.1.102:8080 weight=2;
	check interval=3000 rise=1 fall=3 timeout=2000 type=tcp;
}
Copy the code
  • Interval: indicates the detection interval, which is configured every 3s.
  • Fall: How many times does the detection fail before the upstream server is identified as inviable.
  • Rise: The number of successful detections before the upstream server is identified as alive and ready to process the request.

Heartbeat HTTP:

upstream backend{
	server 192.168.1.101:8080 weight=1;
	server 192.168.1.102:8080 weight=2;
	check interval=3000 rise=1 fall=3 timeout=2000 type=tcp;
	check_http_send"HTTP/HEAD/status1.0The \ r \ n \ r \ n "check_http_expect_alive http_2xx http_3xx; }Copy the code
  • Check_http_send: indicates the CONTENT of the HTTP request sent during the check.
  • Check_http_expect_alive: If the upstream server returns a matching response code, the upstream server is considered alive.

Do not set the check time too short. Otherwise, too many heartbeat check packets may affect the upstream server.

3.6 Other Configurations

  1. Backup server
upstream backend{
	server 192.168.1.101:8080 weight=1;
	server 192.168.1.102:8080 weight=2 backup;
}
Copy the code

In this case, server 102 is the secondary server. When all primary servers are unavailable, requests are forwarded to the secondary server.

  1. Unavailable server
upstream backend{
	server 192.168.1.101:8080 weight=1;
	server 192.168.1.102:8080 weight=2 down;
}
Copy the code

In this case, the 102 server is permanently unavailable. You can use this configuration to remove the 102 server in the event of testing or machine problems.

4. Isolation

4.1. Thread isolation

Thread isolation refers to thread pool isolation where a problem with one request does not affect other thread pools.

4.2 Process isolation

Divide projects into subprojects that are physically isolated from each other and do not call each other.

4.3 Cluster Isolation

Isolate clusters so that they do not affect each other.

4.4 Isolation of the machine room

Different machine room for deployment, Hangzhou machine room; Beijing Computer Room; Shanghai Computer Room;

4.5 Read and write isolation

Internet projects mostly read more than write less, read and write separation, expand read ability, improve performance, improve availability.

4.6 Static and static isolation

Static resources into nGINx,CDN, so as to achieve static isolation, prevent pages from loading a large number of static resources

4.7 Hotspot isolation

Separate hotspot services into systems or services for isolation, such as seckilling and buying.

Read hotspots generally use multi-level caching

A cache plus message queue is used to write hotspots

5, current limiting

If traffic limiting is not done, the service may be washed out when sudden heavy traffic occurs.

5.1. Traffic limiting algorithm

5.1.1 Leaky bucket algorithm

The idea of leaky bucket algorithm is very simple. The water (request) enters the leaky bucket first, and the leaky bucket flows out of the water at a certain speed. When the inflow speed is too high, the water directly overflows.

5.1.2 Token bucket algorithm

The token bucket algorithm works by putting tokens into the bucket at a constant rate. If the request needs to be processed, a token needs to be fetched from the bucket first. When no token is available in the bucket, the service is denied.

5.2 Tomcat Stream limiting

For an application system, there is always a limit to the number of concurrent/requests, that is, there is always a TPS/QPS threshold beyond which the system will not respond to user requests or will respond very slowly, so it is best to overload protection to prevent a flood of requests from overwhelming the system.

<Connector port="8080" protocol="HTTP / 1.1"
	connectionTimeout="20000"
	redirectPort="8443" maxThreads="800" maxConnections="2000" acceptCount="1000"/>
Copy the code
  • acceptCount: Waiting queue. If all Tomcat threads are busy responding, the new connection will be queued. If it exceeds the queuing size, the connection will be rejected. The default value is 100
  • maxConnections: Indicates the maximum number of instantaneous connections that can be created.
  • maxThreads: Maximum number of threads that Tomcat can start to process requests, that is, the number of tasks that can be processed at the same time. The default value is 200.

5.3. Traffic limiting on interfaces

Limit the frequency of requests for an interface

long limit = 1000;
while(true) {// Get the current second
	long currentSeconds = System.currentTimeMillis()/10;
	if (counter.get(currentSeconds).incrementAndGet()>limit) {
	/ / the current limit
	continue;
    }
    // Business processing
}
Copy the code

5.4 Redis current limiting

In fact, the lua script is used to set parameters for traffic limiting.

5.5. Nginx traffic limiting

Nginx can use two modules that come with Nginx:

  1. Connection number flow limiting module ngx_HTTP_limit_conn_module
  2. Leaky bucket algorithm to achieve the request traffic limiting module ngx_HTTP_limit_req_module

ngx_http_limit_conn_module

Traffic limiting is performed based on the total number of network connections corresponding to a key

You can limit the total number of connections in an IP dimension by IP or in a domain name by service domain name.

http{
	limit_conn_zone $binary_remote_addr zone=addr:10m;
	limit_conn_log_level error;
	limit_conn_status 503; . server{location /limit{
			limit_conn addr 1; }}... }Copy the code
  • limit_conn: Configures the shared memory area where keys and counters are stored and the maximum number of connections for a specified key. The maximum number of connections specified here is 1, which means that Nginx can process at most one connection concurrently.
  • limit_conn_zone: sets the size of the shared memory area for limiting traffic keys and storing key information. The key here is
    b i n a r y r e m o t e a d d r “Said, I P Addresses can also be used Binary_remote_addr “, which indicates the IP address, can also be used
    Server_name is used as the key to limit the maximum number of connections at the domain name level.
  • limit_conn_status: Status code returned after traffic limiting is configured. 503 is returned by default.
  • limit_conn_log_level: Sets the log level after traffic limiting. The default log level is Error.

ngx_http_limit_req_module

The leakage bucket algorithm is used to limit the traffic of the requests corresponding to the specified key, for example, to limit the request rate by IP dimension. The configuration example is as follows:

limit_conn_log_level error;
limit_conn_status 503; . server{location /limit{
		limit_req zone=one burst=5nodelay; }}Copy the code
  • limit_req: Configures the traffic limiting zone, bucket capacity (burst capacity, 0 by default), and delay mode (default delay).
  • limit_req_zone: Configures the traffic limiting key, the size of the shared memory area for storing the key information, and the fixed request rate. The specified key is $binary_remote_ADDR, which indicates the IP address. The fixed request rate is set using the rate parameter and supports 10r/s and 60r/m, that is, 10 requests per second and 60 requests per minute. Eventually, however, this translates to a fixed rate of requests per second (10r/s is one request per 100 milliseconds, and 60r/m is one request per 1000 milliseconds).
  • limit_conn_status: Status code returned after traffic limiting is configured. 503 is returned by default.
  • limit_conn_log_level: Sets the log level after traffic limiting. The default level is Error.

6, the drop

When traffic spikes, service problems (such as long response times or unresponsiveness) occur, or non-core services affect the performance of the core process, you still need to ensure that the service is available, even at the expense of the service. The system can automatically degrade according to some key data, or manually degrade by configuring switches.

The ultimate goal of a downgrade is to ensure that the core service is available, even if it is lossy.

6.1. Downgrade Plan

Before demotion, the system needs to be sorted out to determine whether the system can lose pawn protection, so as to sort out which can be demoted and which can not be demoted.

  • general: For example, some services can be automatically degraded if they occasionally time out due to network jitter or when the service is online.
  • warning: If the success rate of some services fluctuates within a period of time (for example, the success rate ranges from 95 to 100%), the service can be automatically or manually degraded and an alarm is generated.
  • error: For example, if the availability rate is below 90%, or the database connection pool is used up, or the traffic suddenly increases to the maximum threshold that the system can tolerate, the system can be automatically degraded or manually degraded according to the situation.
  • Serious mistakes: For example, if data error occurs due to special reasons, an emergency manual downgrade is required.

Degradation according to whether automation can be divided into: automatic switch degradation and artificial switch degradation. Degradation can be classified into read service degradation and write service degradation. According to the system level, degradation can be classified as: multi-level degradation.

Function points for degradation are primarily considered in terms of server-side links, which sort out where degradation is needed based on the service invocation links accessed by users.

6.2 page degradation

Some pages take up scarce service resources during large sales or buying events and can be downgraded in an emergency.

6.3. Page fragment degradation

For example, the merchant section of the product details page needs to be degraded because of data errors.

6.4 page asynchronous request degradation

For example, recommendations/shipping requests loaded asynchronously on the product details page can be downgraded if they are slow to respond or there are problems with the back-end service.

6.5 Service functions are degraded

For example, when rendering a product detail page, you need to invoke some less important services (related categories, hot lists, etc.) that are not available in exceptional cases, i.e. degraded.

6.6. Read degradation

For example, the multi-level cache mode can be downgraded to read-only cache if there is a problem with the back-end service. This mode is suitable for scenarios where read consistency is not required.

6.7. Write degrade

For example, we can only update the Cache and then asynchronously reduce the inventory to DB to ensure final consistency. In this case, we can downgrade DB to Cache.

6.8. Automatic downgrade

When the number of errors in the service reaches the threshold (99.99%), degrade the service and issue a warning.

Timeouts and retries

After accessing the service, a timeout occurs due to delayed response from the network or other reasons. In this case, you can initiate a second request by default for user experience.

  1. Agent layer timeout and retry: nginx
  2. Web container timeout and retry
  3. Timeouts and retries between middleware and services
  4. Database connection timeout and retry
  5. Nosql timeout and retry
  6. Service timeout and retry
  7. Front-end browser Ajax request timeout and retry

8. Pressure measurement and plan

8.1. System pressure test

A performance pressure test is used to evaluate the stability and performance of the system and determine whether to expand or reduce the system capacity based on the pressure test data.

Offline compaction: Compaction of an interface (such as an inventory query interface) or a component (such as a database connection pool) of a system, such as JMeter or Apache AB, and then tuning (such as tuning JVM parameters, code optimization) to achieve optimal performance of a single interface or component. Offline pressure measurement environment (for example, servers, networks, data volume, etc.) is completely different from online pressure measurement. Due to low fidelity, it is difficult to perform full-link pressure measurement. It is suitable for component-level pressure measurement, and data can only be used as reference.

Online pressure test: there are many ways of online pressure test, including read pressure test, write pressure test and mixed pressure test according to read and write, simulation pressure test and drainage pressure test according to data authenticity, isolated cluster pressure test and online cluster pressure test according to whether to provide services to users. Read load measurement is the read flow of a load measurement system, for example, load measurement of commodity prices and services. Write manometry is the write flow of a manometry system, such as an order. Separate the written data from the real data and delete the data after the pressure measurement is complete. System bottlenecks may not be found by reading or writing tests alone, because sometimes read and write tests may interfere with each other, so mixed tests may be used in these cases. Simulation compaction is the system compaction through simulation requests. The data of simulation requests can be program constructs, artificial constructs (such as preparing some users and goods in advance), or using Nginx access logs. If the amount of data in the compaction is limited, the request hot spots will be formed. A better approach is to consider drainage manometry, such as using TCPCopy replication

8.2 system Optimization and DISASTER recovery

After receiving the pressure test report, we will analyze the report and then carry out some targeted optimization, such as hardware upgrade, system expansion, parameter tuning, code optimization (such as synchronous code to asynchronous), architecture optimization (such as adding cache, read and write separation, historical data archiving), etc.

Do not directly reuse other people’s case, must be reasonable according to the pressure test results to adjust their own case.

During system optimization, code walk-through is required to find unreasonable parameter configurations, such as timeout period, degrade policy, and cache time. Troubleshoot slow query in the system pressure test, including Redis and MySQL, and troubleshoot slow query by optimizing the query.

In terms of application system expansion, you can evaluate whether to expand the capacity and how many times to expand the capacity according to the traffic volume of last year, the promotion intensity communicated with the operation and business side, and the traffic volume of recent period. For example, if the GMV is expected to grow by 100%, you can consider expanding the capacity by 2 to 3 times.

8.3 Emergency Plan

Will find some after the system pressure measuring system bottlenecks, after system optimization can improve the system throughput and reduce response time, after the disaster recover system so as to protect the usability, but there will still be some risks, such as network jitter, a machine Load is too high, a slow service, database Load value higher, in order to prevent avalanches in system because of these problems, It is necessary to make emergency plans for these situations, so that when there is an emergency, there are corresponding measures to solve these problems.

The emergency plan can be carried out according to the following steps: firstly, system classification, then full-link analysis, monitoring and alarm configuration, and finally emergency plan formulation.