Meitu long connection service lasted for three years, and accumulated rich practical experience in memory optimization. This article will introduce our team in these years to do some attempts on the road of memory optimization.

About the author: Wang Hongjia, system r&d engineer, now working in Meitu Company, mainly engaged in communication and storage related field r&d. Participated in the research and development of general long Connection channel, Meitu Push, distributed database (Titan open source), route distributor and other projects. Strong interest in basic r&d technology and open source projects.

Meitu long connection service introduction

With the rapid development of science and technology, technology changes with each passing day, the application of long connection scenarios are increasing. It is not only widely used in back-end services, such as database access and internal service state coordination, but also the preferred solution for long connection services in App scenarios such as message push, chat information and live subtitles. The importance of long connection service has been constantly mentioned by industry experts in various occasions, and at the same time, it has attracted more attention and discussion, and companies have begun to build their own long connection service.

 

At the beginning of 2016, Meitu began to build the long-connection service. At the same time, Go rose to the forefront in the field of programming language. Considering its rich programming library, perfect tool chain, simple and efficient concurrency model and other advantages, we finally chose Go as the language to realize the long-connection service. In the choice of communication protocol, considering the advantages of LIGHTWEIGHT, simple and easy to implement, MQTT protocol is selected as the carrier of data interaction. The overall architecture is described below.

 

Meitu long connection service (internal code name of the project is Bifrost) has lasted for three years. In these three years, long connection service has gone through business test, service reconstruction and storage upgrade, etc. Long connection service has changed from supporting more than 200,000 connections for single machines to supporting millions of connections for single machines at present. There is a common problem in most of the long connection services, that is, the memory occupation is too high, we often find that a single node hundreds of thousands of long connection, but the memory occupation is more than ten G or more, what can be done to reduce the memory?

 

This paper will introduce the exploration of long connection service in memory optimization from multiple angles. First, we will introduce the architecture model of current service and the memory management of Go language, so that we can clearly understand the direction of memory optimization and the important data we pay attention to. We will focus on some of the memory optimization on the attempt and specific optimization means, I hope you have a certain reference significance.

Architectural model

A good architecture model design can not only make the system have good scalability, but also reflect the service ability well. In addition, in the design of data abstraction, module division, tool chain improvement, so that not only can make the software has more flexible expansion ability, higher service ability, but also improve the stability, robustness and maintainability of the system.

 

Abstract pubSUB data sets at the data abstraction level for message distribution and processing. At the module division level, services are divided into three parts: internal communication (GRPCSRV), external service (MQTTSRV), and connection management (Session). On the toolchain side, we built automated testing, system mocks, and pressure tools. Meitu long connection service architecture design is as follows:

Figure 1. Architecture diagram

From the architecture diagram, we can clearly see that it is composed of 7 modules, which are conf, GRPCSRV, MQTTSRV, Session, Pubsub, packet and util. The functions of each module are as follows:

 

  • Conf: the configuration manager is responsible for initializing service configurations and verifying basic fields.

  • GRPCSRV: GRPC service for information exchange and coordination within a cluster.

  • MQTTSRV: MQTT service that receives client connections and supports single-process multi-port MQTT service.

  • Session: Session module, which manages client status changes and sends and receives MQTT information.

  • Pubsub: Publish subscription module that saves sessions by Topic dimension and publishes Topic notifications to sessions.

  • Packet: protocol parsing module, responsible for MQTT protocol packet parsing.

  • Util: tool package that integrates monitoring, logging, GRPC client, and scheduling report.

Go memory management

As we all know, Go is a language with its own garbage collection mechanism. Memory management is implemented by referring to TCMALloc, using continuous virtual address, page (8K) as a unit and multi-level cache management. If the value is smaller than 16 bytes, Go is used to allocate the McAche in context P. If the value is larger than 32 KB, Go is used to allocate the McAche in context P. If the value is larger than 32 KB, Go is used to allocate the McAche in context P. If the SPAN of the size class corresponding to McAche does not have any available blocks, the server requests a block from McEntral. If McEntral also has no blocks available, it requests them from mheap and shards them. If the MHEAP does not have an appropriate span, apply to the operating system.

 

Go also does a great job with memory statistics, providing fine-grained statistics on memory allocation, GC collection, Goroutine management, and so on. In the process of optimization, some data can help us find and analyze problems. Before introducing optimization, let’s take a look at which parameters need to be paid attention to. The statistical parameters are as follows:

  • Go_memstats_sys_bytes: The total number of bytes of memory that a process gets from the operating system, including the virtual address space reserved by the Go runtime heap, stack, and other internal data structures.

  • Go_memstats_heap_inuse_bytes: Bytes you’re using in SPANS. It does not contain bytes that might have been returned to the operating system, could be reused for heap allocation, or could be reused as stack memory.

  • Go_memstats_heap_idle_bytes: Spans idle bytes.

  • Go_memstats_stack_sys_bytes: stack memory bytes, used to allocate goroutine stack memory.

 

In the memory monitoring, the virtual address space of the heap is divided into span according to Go, that is, the continuous area of memory 8K or larger is counted. Span can be in one of three states:

  1. Idle contains no objects or other data, and the physical memory of the free space can be freed back to the OS (but the virtual address space is never freed), or it can be converted to in-use or stack space;

  2. Inuse contains at least one heap object and may have free space to allocate more heap objects;

  3. Stack span is used for goroutine stacks, which are not considered part of the heap. Span can be changed between heap and stack memory, but it is never used for both.

 

In addition, there is a section that counts runtime internal structures that are not allocated from heap memory (usually because they are part of the implementation heap), and unlike stack memory, any memory allocated to these structures is dedicated to these structures, which is primarily used to debug runtime memory overhead.

 

Although Go has a rich standard library, language-level support for concurrency, and built-in Runtime, it consumes more memory than if C/C++ did the same logic. In the running process of the program, its stack memory will automatically expand as it is used, but the lazy collection method in the stack memory reclamation leads to increased memory consumption to a certain extent. In addition, the GC mechanism will also bring additional memory consumption.

 

Go provides three memory reclamation mechanisms: timing trigger, quantity trigger, and manual trigger. Go works fine with a small amount of memory garbage. However, regardless of the trigger method, the service will feel a significant lag during GC execution due to the large amount of garbage memory created in the case of a large number of user services. These are also the problems that long connection services are facing at present. In the following article, I will introduce specific practices on how to reduce and solve these problems one by one.

 

Optimal path

After understanding the architecture design, Go memory management, basic monitoring, I believe you have a general understanding of the current system, first to show you the results of memory optimization, the following table 1 is the comparison table before and after memory optimization, the number of online connections are basically the same, the process memory occupation is greatly reduced. Stack memory usage decreased by about 5.9 GB, heap memory usage decreased by 0.9 GB, and Other memory usage decreased slightly. So how do we achieve memory reduction? Next, I will talk about our team’s exploration of memory optimization with you.

Before optimization

The optimized

Number of online links

225 K

225 K

Process memory usage

13.4 G

4.7 G

Heap uses memory

5.2 G

3.4 G

Stack Applying for memory

7.25 G

1.02  G

Other Applying for memory

0.9 G

0.37 G

Table 1 Comparison table before and after memory optimization

Note: Process occupied memory ≈ Virtual memory – Memory not returned

 

Before optimization, a machine on the line was randomly selected for memory analysis. Through monitoring, it was found that the current node process occupied 22.3g of virtual memory, 5.2g of memory used in heap area, 8.9G of memory not returned in heap area, 7.25g of memory in stack area, and about 0.9g of other memory, and the number of connections was 225 K.

 

We do a simple conversion, it can be seen that the average memory occupied by a link is respectively: heap: 23K, stack: 32K. According to the monitoring data and the principle of memory allocation, the main reasons are as follows: Goroutine occupation, session status information and Pubsub module occupation. We plan to optimize from three aspects: business, program and network mode.

Business optimization

As mentioned above, the session module is mainly used to process the sending and receiving of messages. In the implementation, it takes into account that the message production speed of business is greater than the message consumption speed of client in common scenarios. In order to alleviate this situation, the buffer queue of messages is introduced in the design, which is also helpful for the flow control of client messages.

 

The buffered message queue is implemented with chan, whose size is initialized to 128 by default as a rule of thumb. However, in the current online push scenario, we find that the production speed of messages is generally less than the consumption speed, and the buffer size of 128 is obviously too large, so we adjust the length to 16 to reduce the allocation of memory.

 

In the algorithm of grouping management of clients according to topic in the design, two data structures, map and list, are combined in the way of space for time to provide O (1) deletion, O (1) addition and O (n) traversal for client set operations. Data is deleted by token deletion, which is recorded using an auxiliary slice structure, and actual deletion takes place only when a preset threshold is reached. While tag deletion improves traversal and add performance, it also introduces memory consumption issues.

 

You may wonder what kind of scenarios need to provide such complexity, but in practice there are two scenarios:

  1. In actual network scenarios, clients may disconnect or reconnect at any time due to network instability, so the increase and deletion of collections need to be within a constant range.

  2. In the process of message release, the method of traversing collection is adopted to issue notices one by one. However, with the increase of the number of users on a single topic, the problem of message overheating of a single topic user set often occurs, which takes too long and leads to message compression. Therefore, traversing collection is also required to be as fast as possible.

 

Benchamrk data analysis shows that the best performance is provided when the token reclaim slice length is 1000, so the default configuration threshold is 1000. In online services, the default configuration is used in all cases. However, in the current use of push service, it is found that the mechanism of tag deletion and delayed reclamation has little benefit, mainly because topic and client are in the mode of 1:1, that is, there is no client set. Therefore, the reclamation threshold is adjusted to 2 to reduce invalid memory occupation.

 

All the above optimization, as long as the gray scale of service online after simple adjustment configuration, dynamic configuration through the CONF module during the design and implementation, reduce the cost of service development and maintenance. The monitoring results are compared in the following table. In the case that the number of online connections after optimization is more than that of the optimized online connections, the number of heap usage memory usage decreases from 4.16G to 3.5g, which is about 0.66G lower.

Golang code optimization

When implementing the architecture shown above, it is found that there are many shared variables between the session module and MQTTSRV module. At present, the implementation method adopts pointer or value copy. As the number of sessions is proportional to the amount of client data, a large amount of memory is consumed for sharing data. Not only does this increase GC pressure, but it is also a huge drain on memory. Considering this problem repeatedly, reference system library context design in the architecture is also abstract context package responsible for information transfer between modules interaction, unified allocation of memory. In addition, we also refer to others’ optimization methods to reduce the allocation of temporary variables and improve the efficiency of system operation. The main optimization angles are as follows:

  • In places that frequently apply for memory, the pool mode is used for memory management

  • Small objects are grouped into structures for one allocation to reduce the number of memory allocations

  • The contents of the cache are allocated enough space at a time and reused appropriately

  • When slice and map are created by make, the estimated size specifies the capacity

  • The call stack avoids requesting too many temporary objects

  • Reduce conversion between []byte and string, and try to use []byte for string processing

 

At present, the system has a complete unit test and integration test, so after a week of rapid development and reconstruction, the comparison of gray on-line monitoring data is shown in the following table: for the same number of connections, heap memory usage is reduced by 0.27g, and stack application memory usage is reduced by 3.81g. Why is the stack going down so much?

 

By setting stackDebug to recompile the program to track the execution of the program, optimizing the majority of the goroutine stack to 16K in memory, reducing the allocation of temporary variables and splitting the large function processing logic, effectively reducing the memory expansion of the trigger stack (see Resources for more details). After optimization, goroutine stack memory was reduced to 8K. A connection requires two Goroutines to be started to read and write data, which roughly reduces memory by about 16K for a connection and 3.68G for a 23W connection.

Network model optimization

 

The classic implementation of network programming in the Go language is synchronous processing, starting two Goroutines to handle the read and write requests respectively. Goroutines are also lightweight, unlike Threads. However, for a million connections, this design pattern would start at least two million goroutines, one of which would have a stack size between 2 KB and 8KB, which would be very resource-intensive. In most scenarios, only a few connections have data processing, and most goroutines block IO processing. Therefore, we can use the DESIGN of C language for reference and use the Epoll model to distribute events in the program. Only active connections will start Goroutine to process business. Based on this idea, we can modify the network processing process.

 

After the network model modification test was completed, the gray scale went online. The comparison of monitoring data is shown in the following table: When the number of connections after optimization is 10 K more than before optimization, the heap memory usage is reduced by 0.33g, and the stack memory application is reduced by 2.34g, indicating a significant optimization effect.

conclusion

After service optimization, temporary memory optimization, and network model optimization, the actual memory usage of 21w long online connections is about 5.1g. A simple pressure test for 100w connection only establishes the connection. If no other operations are performed, the connection occupies about 10 gb. Long connection service memory optimization has achieved initial success, but this is only a small step for our team, and there is still more work to be done in the future: network links, service capabilities, storage optimization, etc., these are all directions that need to be explored. If you have any good ideas, welcome to share with our team and discuss together.

 

We currently have open source plans for the Bifrost project, so stay tuned.

Refer to the article

Go tool pprof use is introduced: https://segmentfault.com/a/1190000016412013

Go memory monitoring is introduced: https://golang.org/src/runtime/mstats.go

Go memory optimization is introduced: https://blog.golang.org/profiling-go-programs

High performance Go memory allocation of service: https://segment.com/blog/allocation-efficiency-in-high-performance-go-services

Go stack optimization analysis: https://studygolang.com/articles/10597

Reference reading:

  • Multithreading officially supported! Redis 6.0 compared with the previous version of performance evaluation

  • Do you really understand SLAs in performance pressure tests?

  • A Netflix-developed microservices choreography engine that supports visual workflow definitions

  • Did you really decompress the test? Actual combat describes the design and implementation of performance test scenarios

  • Some misconceptions about Golang GC – Is it really more advanced than Java algorithms?

Technical original and architectural practice articles, welcome to submit through the public menu “contact us”. Please indicate that the reprint is from the wechat official account of “ArchNotes” and contains the following QR code.

Highly available architecture

Change the way the Internet is built

Long press the QR code to follow the “HA Architecture” public account