SSR service stability optimization: a general downgrade scheme

Server Side Renderinig (SSR) is a technique for pre-generating the DOM structure of a page on the Server Side rather than on the browser Side. The advantages of SSR are easy SEO, fast first screen loading, and friendly to search engines. Therefore, SSR is suitable for pages that focus on content rather than interaction, such as news sites and large event pages. SSR has greatly improved the user experience of these projects. However, there is no free lunch in the world. SSR comes with some inevitable disadvantages, including increased maintenance cost, higher server cost, reduced QPS that can be supported by a single server, and increased technical complexity.

The following is the single QPS carrying capacity of some SSR services preliminarily obtained through pressure test under the condition that interface delay does not increase significantly:

project	The machine configuration	Rough estimate QPS carrying capacity
Benchmark Items *	1c2g	60
Next. Js benchmark project *	1c2g	25
Business A official website	4c8g	22
Business A Reconfigured official website	4c8g	44
Business B song return page	1c2g	34

Note:

The base project uses the company’s internal React front-end scaffolding initialization to render pages containing 5000 DIV DOM nodes served by Node middleware __
The next.js benchmark project is initialized with the next.js scaffolding, and the rendered page contains dom nodes of 5000 divs, served by the Next start directive

It can be found that in actual business scenarios, the single QPS carrying capacity of SSR service is indeed relatively low. When the service traffic suddenly increases, THE SSR cluster will bear great pressure.

SSR is the most core interface in the service, once it is suspended, the website will be directly inaccessible. In order to improve the stability of SSR service, we need a set of general SSR downgrade to CSR scheme to ensure that THE SSR cluster can provide enough flexibility to cope with the sudden increase of traffic and ensure the availability of core services. And can be convenient in the existing SSR service quickly landing.

Selection of degradation scheme

Node Middleware Degradation

This is the easiest way to downgrade, but it is generally not recommended to do so by directly adding middleware to Node services to handle downgrades. The reason is that when the service is abnormal, the Node business process itself cannot respond to the request in a timely manner, and the related degradation strategy becomes invalid.

For example, if we use a timer to obtain the current CPU usage and downgrade CSR by the CPU threshold, when the Node process has high CPU load, the timer itself may not respond to the callback in a timely manner.

Nginx configuration pocket bottom

Error_page configuration on Nginx to process SSR route response error and return CSR result

Excellent: the implementation is relatively simple, Nginx as the outermost pocket bottom, can take most of the abnormal scenarios.

Bad: An additional CSR cluster is required to respond to exceptions and no more customization can be done.

The specific implementation

Note: The project needs to support CSR and SSR isomorphism

Create a NEW CSR cluster that is only responsible for responding to the service’s HTML static resources (or configuring a bottom-pocket CDN domain name).
Configure the error_page rule on Nginx for the corresponding server render interface route.
Create a bottom-pocket route @fallback, and select the CDN domain name of the newly created CSR cluster/bottom-pocket for the corresponding back-end service cluster.

Standalone gateway degradation

An independent gateway service is used to handle CSR/SSR traffic scheduling, and CSR is adopted in the case of UNAVAILABLE SSR service or timeout.

The request flow flow is as follows:

Excellent: has maximum freedom to add most of the service governance policies (fuses, traffic limiting, etc.) to the gateway service.

Poor: The request link adds additional centralized gateway service, greatly increasing the complexity of deployment model, and at the same time, it needs to introduce additional operation and maintenance cost of gateway service itself.

Distributed Gateway degradation

The decentralized Sidecar distributed gateway and service processes are deployed in the same instance. All client traffic accesses service services through the gateway and processes all non-service functions, such as general parameter parsing and protocol translation, at the gateway layer. Currently, all our services are connected to the Sidecar distributed gateway.

Best: Due to the flow of business will be first after distributed gateway again to enter the business services, therefore in sidecars configuration degradation logic can be low cost in the form of business without awareness the configuration of the deployment, transparent to the business code, no invasion, and the existing services have access to distributed gateway, access way is simple, do not need to deploy services alone, does not affect the existing on-line process.

Poor: The processing logic of the distributed gateway still occupies SERVICE CPU resources, affecting the overall performance to some extent.

Distributed gateway degradation scheme

Considering the scalability and upgrade cost, we finally choose to achieve universal SSR degradation through distributed gateway. The distributed gateway supports the loading of custom logic in the form of Loader, so we developed and published a custom Loader containing a degradation scheme. You can load the Loader in the corresponding service to access the degraded solution.

Processing flow

LB is the front-end load balancing layer. Traffic is forwarded to the Sidecar distributed gateway through LB, and then to the SSR_Service.

Business transformation

Service transformation only needs to ensure that the service is isomorphic between SSR and CSR, and can run normally under the mode of service degradation to CSR. The access of distributed gateway should be insensitive to the service.

Adaptive current limiting

In custom Loader, we dynamically calculate the LOAD of CPU through a sliding window to limit traffic. The core purpose is to avoid the failure of a single instance caused by CPU overload during peak traffic.

For the SSR service, when the ADAPTIVE traffic limiting is triggered when the CPU exceeds the threshold, an error status code is not directly returned but the CSR HTML page is degraded. In this way, the CPU Load is greatly reduced and the QPS carrying capacity of the service is significantly increased.

Initiative to downgrade

We support active SSR degradation of Sidecar by global configuration switch or adding specific query, so as to support active degradation in special scenarios.

MPA/SPA mode is supported

The service forms are varied, and may be multi-page applications or single-page applications. For single-page applications, you only need to directly degrade to the HTML page path configured by the user. For multi-page applications, we have added a routing profile to provide mapping between different routes to the corresponding degraded HTML route.

The example route configuration format is as follows:

// ssrfallback.config.json
{
  "routes": [{"path": "^/track(/\\S)?"."file": "./track.html"
    },
    {
      "path": "^/album"."file": "./album.html"}}]Copy the code

The cache

Because Sidecar restarts with service instances, and the amount of degraded Html that needs to be loaded is controllable, you can safely add memory cache to read Html files to reduce file IO consumption.

The performance test

project	The machine configuration	No downgrade QPS upper limit	Added QPS cap after downgrade	Effective SSR QPS
Benchmark Items *	1c2g	60	1500	36
Next. Js benchmark project	1c2g	25	1000 +	14
Business A Reconfigured official website	4c8g	44	1000 +	30
Business B song return page	1c2g	34	1000 +	24

Among them, effective SSR QPS refers to the QPS that can be basically maintained by SSR rendering after triggering degradation, which can be roughly measured when the pressurized QPS = about 500. When the pressurized QPS rises further, the effective SSR QPS will gradually decrease until the service cannot respond.

Performance analysis

Take the results of the benchmark project as an example

Since the adaptive flow limiting component is connected, in the SSR scenario, when the CPU is overloaded, it can adaptively switch to THE CSR mode to reduce the CPU load, so as to ensure that it can handle the QPS surge scenario. The maximum QPS load can be around 1500QPS (only QPS up to 500 are shown here).

By observing the response delay, it can be found that the overall delay level is low. Only in a period of time when the adaptive current limiting is triggered, there will be a period of high delay due to the CPU pressure, and then the normal response can be maintained smoothly.

According to the statistics, the QPS of NON-degraded SSR requests can be basically stable at 30+. Under the condition of high load, the performance of the instance can still be fully utilized to complete a certain degree of SSR rendering requests, but not all degraded to CSR.

Effective SSR QPS decreased with the gradual increase of pressure QPS, and the QPS of SSR was basically stable after the QPS was stabilized

QPS was degraded and increased gradually with the increase of QPS in pressure measurement

More than 99.6% of these requests are returned within 800ms.

summary

In order to improve the stability of SSR service, we chose to implement the LOGIC of SSR degradation on the distributed gateway to provide general SSR degradation capability based on the distributed gateway. Through the practice and performance test of SSR service in multiple business scenarios, this scheme can ensure that the SSR cluster can dynamically degrade a certain proportion of requests to CSR based on the current CPU resources under the condition of sudden increase in traffic, greatly improve the carrying capacity of SSR service cluster, and reduce the cost of service transformation. Convenient and quick landing access.