background

Jingdong Event System is a system that can edit online, update and release new events in real time, and provide page access services externally. Its high timeliness, flexibility and other characteristics, extremely popular, has developed into one of jingdong several important traffic entrance. In recent years, the PV carried by the system has reached hundreds of millions. With the rapid development of JINGdong’s business, the pressure on jingdong’s activity system will become more and more severe. A more efficient and stable system architecture is urgently needed to support the rapid development of business. This article focuses on the performance of active page browsing.

Difficulties in improving the performance of active page browsing:

1. Activities vary greatly from activity to activity, unlike product pages that have fixed patterns. The common part that can be extracted from each page is limited and the reusability is poor.

2. The content of the activity page is diverse and the business is varied. Relying on a large number of external business interfaces, it is difficult for data to be closed loop. The performance and stability of external interfaces severely restrict the rendering speed and stability of active pages.

After years of development practice in this system, the idea of “page rendering and browsing asynchronization” is put forward, and based on this, the architecture of this system is upgraded. Through the operation in recent months, all aspects of performance have been significantly improved. Before sharing the “new architecture,” let’s take a look at the architecture of our existing Web systems.

I. Development and status quo of Web architecture

1. Development phase

Take the evolution of THE event system architecture of JINGdong as an example. There is no specific business logic drawn here, but a simple description of the architecture:

 

2. The second step is to add cache where performance is consumed. Here, redis cache is added for part of library search operations

 

3. Full page Redis cache: Due to the large number of active pages, the cost of rendering a page is very high. Consider caching the rendered active content as a whole page, and the next time a request comes in, if there is a value in the cache, fetch it directly.

 

The above is a simple diagram of the evolution of the system application service level architecture. To reduce the pressure on the application server, you can add CDN and Nginx proxy_CAxhe to the front of the application server to reduce the source back rate.

 

 

 

 

 

4. Overall structure (old)

In addition to the “browsing service” mentioned in the previous 3 steps, the old architecture also made two other big optimizations: “interface service” and static service

 

 

1. Access request, the first to reach the browsing service, the whole page frame returned to the browser (CDN, NGINx, Redis and other levels of cache).

2. For real-time data (such as seckill) and personalized data (such as login and personal coordinates), front-end real-time interface is used to call and front-end interface service.

3. Static service: Static resource separation, all static JS and CSS access static service.

Main points: Browsing service, interface service separation. Fixed parts of the page browse service, real-time change, personalized use front-end interface services.

Interface services: there are two types: reading redis cache directly and calling external interfaces. The interface that reads directly from Redis can be optimized using Nginx + Lua (OpenReSTY) without going into details. This post focuses on the “Browsing services” architecture

Ii. Performance comparison of old and new architectures

Before we talk about the new architecture let’s look at the new energy comparison between the old and new architecture

1. New capabilities of old browsing services:

Through CDN cache and NGINx cache, the traffic back to the application server is about 20%-40%. The performance comparison here only applies to the traffic back to the application server.

On November 11, 2015, the browsing method tp99 is as follows :(physical machine)

 

Tp99 is about 1000ms and has a large jitter range. The memory usage is about 70% and the CPU usage is about 45%.

There is no cache within 1000ms, and there is a risk of blocking or even hanging.

2. New structure new capabilities of browsing services

This 2016 618 adopts the new architecture support, browse TP99 as follows (divided into APP terminal activities and PC terminal activities) :

Mobile activity browsing TP99 stable at 8ms, PC activity browsing TP99 stable at around 15ms. Almost a straight line throughout the day, no performance jitter.

The new architecture supports the server (Docker) CPU performance as follows

CPU consumption has been flat at 1% with almost no jitter.

Comparison results: The new tp99 architecture reduces from 1000ms to 15ms, CPU consumption from 45% to 1%, and the performance of the new architecture is improved substantially.

why!!!

Here we unveil the new architecture.

Third, exploring new structures

1. Page browsing and page rendering asynchronization

Looking at the previous browsing service architecture, 20%-40% of page requests will re-render the page, which requires recalculation, query and object creation, resulting in increased CPU and memory consumption and decreased TP99 performance.

If the redis full page cache was guaranteed on every request, none of these performance issues would exist.

That is: page browsing, asynchronously with page rendering.

2. Problems and solutions after direct transformation:

Ideally, rendering can be triggered manually if the page data changes (the page publishes new content) and automatically by listening to MQ for external data changes.

However, there are some external interfaces that do not support MQ or cannot use MQ, such as an item in an active page whose name changes.

To solve this problem, the View project sends a re-render request to Engine at specified intervals – the latest content is put into Redis. The new content is available when the next request comes in. Because there are many activities and you cannot determine which activities are being accessed, using a Timer is not recommended. By adding a cache key, the processing logic is as follows:

The advantage is that only activities that have access are periodically re-rendered.

Iv. Explanation of new Structure:

Collating architecture (excluding business) :

View Project responsibilities:

A. Retrieve static HTML from cache or disk. If no error page is displayed. (File system access performance is low, over 100ms level, not used here)

B. Determine whether to initiate rendering to Engine again based on whether cache KEY2 is expired. (This is not necessary if the external interfaces of your project support MQ.)

 

Engine engineering responsibilities: render active pages, put the results on hard disk, redis.

 

Publish project, MQ responsibility: The page changes and re-renders to Engine. The specific page logic is not explained here

 

 

The Job of the Engine project is to re-render the page when the content changes and push the entire page to Redis or hard drive.

 

2. View Engineering Architecture (Redis version)

The View project’s job is to retrieve page content from Redis based on links.

3. View Engineering Architecture (Hard disk version)

 

Comparison between two versions

A.R edis version

Advantages: Simple access, good performance, especially in the case of a large number of pages, no performance jitter. Single Docker TPS reaches 700.

Disadvantages: Heavy reliance on JD Redis service, if there is a problem with the Redis service, all pages cannot be accessed.

B. hard disk version

Advantages: Does not rely on any external services, as long as the application service is not hung up and the network is normal, external services can be stable.

Excellent performance in small number of pages. Single Docker TPS reaches 2000.

Disadvantages: In the case of a large amount of page data (all active pages of the system have xx GIGABytes or so), disk IO consumption increases (here using Java IO, if using Nginx + Lua, I/O consumption should be controlled within 10%).

Solution:

A. Use URL hash to access and store all pages. All pages are evenly distributed to application servers.

B. Use nginx+ Lua to use Nginx asynchronous IO instead of Java IO.

4. Openresty + hard disk version

Now through nginx+ Lua to do application services, which has high concurrent processing capacity, high performance, high stability has been more and more popular. As explained above, the View project has no business logic. It can be easily implemented in Lua, fetching pages from Redis or hard disk for more efficient Web services. If you want to learn Java engineering, high performance and distributed, simple. Micro services, Spring, MyBatis, Netty source code friends can add my Java Advanced Qun: 694549689, which has Ali Cattle live broadcast technology and Java large Internet technology video to share to everyone free.

1. Candidates with 1-5 work experience, who do not know where to start in the face of current popular technology and need to break the technical bottleneck can be recruited.

2. You’ve been at the company for a long time and you’re comfortable, but you hit a brick wall when you change jobs. Need to study in a short period of time, job-hopping can be added.

3. If you have no working experience, but have a solid foundation, you can add those who are familiar with Java working mechanism, common design ideas and common Java development framework.

Through the test comparison, view project read local hard disk speed, than read Redis (the same page, read Redis is 15ms, hard disk is 8ms). Therefore, I choose to use the hard disk, Redis for backup, and read Redis when the hard disk cannot read.

 

Here, the url hash of the front machine is its own logic, and the engine project uses the same rules to push to the view server disk. The specific logic is not detailed here. We’ll have time to do a separate share later.

Advantages: all the advantages of hard disk version, while removing Tomcat, directly use nginx high concurrency, and IO processing capacity. The performance, and stability to achieve the optimal.

Disadvantages: 1. The hard disk is faulty, affecting access. 2. Method monitoring and log printing need to be rewritten using Lua script.

Five:

The redis, hard drive, and OpenResty + hard drive versions are all based on asynchronous page browsing and page rendering.

 

Advantage:

1, all business logic is stripped to the Engine project, the new VIEW project theoretically never need to be online.

2, disaster recovery diversification (REDis, hard disk, file system), and more simple, after the external interface or service problems, cut off engine project rendering, no longer update redis and hard disk.

3. The new VIEW project is completely isolated from the business logic and does not rely on external interfaces and services. During the promotion period, even if the external interface has new problems or external services are suspended, the normal access of the VIEW project will not be affected at all.

4. The performance has been improved a hundred times, from 1000ms to 10ms. See the previous performance screenshots.

5, stability: as long as the view server network is normal, you can theoretically use the hang up.

6. Greatly save server resources. According to this architecture, 4+20+30=54 Dockers are enough to support 1 billion pv. (4 nginx proxy_cache, 20 Views, 30 engines)

Six: Concluding remarks

It has been a developer for nearly 10 years and has been sucking up resources on the web like a parasite. Some time ago, entrusted by “Zhang Kaitao” god, I made a simple arrangement of the new structure of the activity system and shared it with you, hoping to bring you a little help. The first time to do online sharing, it is inevitable that some did not consider the comprehensive place, later will slowly share some of their own experience, we grow together. Finally, some chicken soup for the soul…