As a high-performance network application framework, Netty implements a high-performance memory management mechanism
Through learning the realization principle, algorithm and concurrent design, it is beneficial for us to write more elegant and higher performance code; When you encounter memory problems with Netty, you can locate and troubleshoot them more efficiently
This paper introduces the memory management mechanism based on NEtty4.1.43.final
ByteBuf classification
Netty uses ByteBuf objects as data containers for I/O operations. Netty’s memory management is also based on The efficient allocation and release of ByteBuf objects
When discussing ByteBuf object management, it is mainly classified in the following aspects:
- Pooled and Unpooled
When Unpooled memory is allocated each time, the system API is directly called to apply for the same size of memory required by ByteBuf from the operating system, and the Pooled is released through the system call. When Pooled memory is allocated, it is based on a large block of pre-allocated memory and encapsulates part of it into ByteBuf for use. Recycle the memory to the memory pool
Netty4 uses Pooled by default. You can set this parameter by setting -dio.netty.allocator. Type =unpooled or Pooled
- Heap and Direct Heap refer to ByteBuf associated memory allocated within the JVM Heap. Allocated memory is managed by GC. Direct refers to ByteBuf associated memory allocated outside the JVM Heap, which is not managed by GC. The underlying Java NIO-based DirectByteBuffer object
Note: The advantage of using out-of-heap memory is that when Java performs I/O operations, it needs to pass in the start address and length of the buffer where the data resides. Due to the presence of GC, the location of objects in the heap often changes, resulting in the object address change and system call error. To avoid this situation, when I/O system calls are made based on heap memory, the memory needs to be copied out of the heap. If I/O operations are performed directly based on out-of-heap memory, the copying cost can be saved
Pooled object management
Unpooled objects (Unpooled), where using and releasing an object requires only a call to the underlying interface implementation, are much more complex and can be investigated with the following questions:
- How does the memory pool management algorithm achieve efficient memory allocation and reduce memory fragmentation
- How to achieve elastic scaling when a memory pool is constantly applied or released under high load
- Memory pool as global data, how to reduce lock contention in multi-threaded environment
1 algorithm Design
1.1 Overall Principles
Netty applies for a chunk of contiguous memory from the system. The chunk is called chunk. The default chunkSize is 16Mb. For fine-grained management, Netty further splits chunks into pages, with each chunk containing 2048 pages (pageSize = 8Kb) by default.
Different pooled memory objects have different allocation policies. The following describes the allocation principle of pooled memory objects whose size is within **(pageSize/2, chunkSize) **. The allocation principle of other large and small objects will be introduced later. Within the same chunk, Netty manages pages in multiple groups with different granularity:
- Layer 1, group size = 1*pageSize, there are 2048 groups
- Layer 2, group size = 2*pageSize, a total of 1024 groups
- Size = 4*pageSize;
When a request is made to allocate memory, the number of memory allocated is evaluated up to the nearest block size, and the free group size is searched from left to right. For example, if the request is 1.5 *pageSize, the number of memory allocated is evaluated up to 2 *pageSize. Find a completely free group of memory in this layer group to allocate, as shown below:
When a 2 * pageSize block is allocated, the block is marked as fully used (red) and the coarse-grained up block is marked as partially used (yellow) to facilitate the next memory allocation.
1.2 Algorithm Structure
Netty implements the multi-level group management of different granularity mentioned above based on the balanced tree
To create a ByteBuf of a given size, the algorithm needs to find the first chunkSize memory in PoolChunk that can hold the allocated memory
In order to quickly find the location of chunk that can hold the requested memory, the algorithm constructs a fully balanced tree based on byte array (memoryMap) storage. The multi-layer depth of the balanced tree is just the multi-level grouping of chunk according to different granularity introduced above:
The depth of the tree is calculated from 0, the number of nodes at each layer, and the memory size of each node are as follows:
depth = 0.1Node, nodeSize = chunkSize depth =1.2Nodes, nodeSize = chunkSize/2. The depth = d,2^d nodes, nodeSize = chunkSize/(2^d) ... The depth = maxOrder,2^maxOrder nodes, nodeSize = chunkSize/2^{maxOrder} = pageSize
Copy the code
The maximum depth of the tree is maxOrder(maximum order, default value 11). With this tree, the algorithm’s lookup in chunk can be converted to:
When requesting a chunkSize/2^ K memory allocation, search left to right for the first free node in the level of the balanced tree height k
The first node at depth = n is stored in the memoryMap[2^n], and the second node is stored in the memoryMap[2^n+1]. And so on (below represents chunkSize/2 allocated)
Based on the usage of memoryMap[ID] nodes, the greater the memoryMap[ID] value, the less memory is available
- MemoryMap [id] = DEPth_of_id: Indicates the initial state of an ID node. The value of depth_of_id indicates the depth of the ID node in the tree
- MemoryMap [id] = maxOrder + 1: All ID nodes are used, memory of the node is fully allocated, and no child node is free
- Depth_of_id < memoryMap[id] < maxOrder + 1: The id node is partially used, and the memoryMap[id] value x represents children of ID where the first free node is at depth X and there are no free nodes at depth [depth_of_id, x]
1.3 Applying for or Releasing memory Resources
The PoolArena#normalizeCapacity() method takes the nearest power of 2, for example, 8000bytes normalized to 8192bytes (chunkSize/2^11). 8193 bytes normalized to 16384 bytes (chunkSize/2^10)
In the PoolChunk#allocateRun method, when allocating chunkSize/2^d memory, the first free memory needs to be found at depth = d, and the algorithm starts from the root node. Id = 1), the specific steps are as follows:
-
Procedure Step 1 Check whether memoryMap[ID] > D or depth_of_id > d is the value of the current node. If yes, memory cannot be allocated from the chunk. The search is complete
-
Step 2 Check whether memoryMap[id] == D, and depth_of_id <= d if yes, the current node is free memory depth = d. The search is complete. Update the current node value to memoryMap[id] = max_order + 1. Represents that the node has been used, and traverses all the ancestor nodes of the current node, updating the node value to the minimum value of their left and right child nodes; If no, go to Step 3
-
Step 3 Check whether the current node value memoryMap[id] <= D and depth_of_id <=d if yes, the idle node is among the children of the current node. Check the left child node memoryMap[2 * ID] <= D (check whether the left child node can be allocated). If yes, Then the current node is updated to the left child node, otherwise it is updated to the right child node, and then repeat steps 1 and 2
Refer to the following example to apply for chunkSize/2 memory allocation
Note: In the figure, although memoryMap[id] = DEPth_of_id of the child node whose index = 2 is memoryMap, in fact, the node memory has been allocated, because the algorithm traverses from top to bottom, so in actual processing, the node only updates the value of the ancestor node after allocating memory, but does not update the value of the child node
When releasing the memory, update the memoryMap[ID] to depth_of_id based on the ID returned by memory application, and set the ancestor node value of the ID node to the minimum value of the left and right nodes
1.4 Memory management for giant objects
For huge objects whose size exceeds chunkSize, Netty adopts a non-pooled management policy. When each request for memory allocation is made, a special non-pooled PoolChunk object is created for management. The internal memoryMap is null. When object memory is freed, the entire Chunk is freed. The memory application logic is in the PoolArena#allocateHuge() method, and the release logic is in the PoolArena#destroyChunk() method
1.5 Small Object Memory Management
When the size of the request object reqCapacity <= 496, the normalized calculation method is to take up the latest multiple of 16, for example, 15 is 16, 40 is 48, and 490 is 496. Small objects whose normalizedCapacity is smaller than pageSize are classified into two types: Tiny objects: integer multiples of 16 after normalization, for example, 16, 32, 48… 496, a total of 31 specifications Small objects: there are 512, 1024, 2048, and 4096 specifications that are the power of 2 after normalization
Allocating a page directly to these small objects will cause waste, and it will consume more space to balance the tree mark in the page. Therefore, the implementation of Netty is: First apply for an idle page in PoolChunk, and divide the same page into small memory of the same size for storage
The PoolSubpage contains the memory size (elemSize), the amount of available memory (numAvail), and the usage of each small memory. The PoolSubpage contains a bitmap of type long[] with the corresponding bit value 0 or 1. To record whether memory is used
Note: Netty normalized the pooled memory to a larger value. For example, 1025 bytes will be normalized to 2048 bytes and 8193 bytes will be normalized to 16384 bytes. It can be understood as a trade-off. Through normalization processing, the size of pooled memory allocation is normalized, which greatly facilitates memory application, memory and memory overcommitment, and improves efficiency
2 Elastic expansion
The principle of the previous algorithm describes how Netty applies for and releases memory blocks. The capacity of a single chunk is limited. How to manage multiple chunks to build an elastic memory pool?
2.1 PoolChunk management
To solve the problem of limited capacity of a single PoolChunk, Netty manages multiple poolChunks in a linked list, and then holds the head of the list with the PoolChunkList object
If all poolChunks are grouped into a linked list, it is inefficient to perform traversal search and management. Therefore, Netty designed the PoolArena object (arena in Chinese) to manage multiple poolChunkLists and poolsubpages. Thread safety control, external memory allocation, release services
The PoolArena has six poolChunkLists. Each PoolChunkList has a different PoolChunk usage range:
// Hold PoolChunk with usage (0,25%)
private final PoolChunkList<T> qInit;
/ / (1%, 50%)
private final PoolChunkList<T> q000;
/ / (25%, 75%)
private final PoolChunkList<T> q025;
/ / (50%, 100%)
private final PoolChunkList<T> q050;
/ / (75%, 100%)
private final PoolChunkList<T> q075;
/ / 100%
private final PoolChunkList<T> q100;
Copy the code
Six PoolChunkList objects form a bidirectional linked list. When PoolChunk memory is allocated or released, the PoolChunk usage changes. If the PoolChunkList usage exceeds the threshold, check whether the PoolChunk usage exceeds the threshold. The new PoolChunkList needs to be found along the two-way list of the six poolChunkLists to become the new head; Similarly, when a PoolChunk is created and allocated, the PoolChunk must be added to the appropriate PoolChunkList according to the preceding logic
NormCapacity (size range: [pageSize, chunkSize]) is allocated as follows:
- Access Q050, q025, q000, qInit, and q075 in sequence, and check whether any PoolChunk can allocate memory through the PoolChunkList
- If any of the preceding five poolChunkLists is allocated successfully, the PoolChunk usage changes. Add the memory to an appropriate PoolChunkList
- Otherwise, create a PoolChunk, allocate memory, and add it to the appropriate PoolChunkList (PoolChunkList expanded).
* * note: Q050 -> Q025 -> q000 -> qInit -> Q075 PoolChunkList for qInit -> Q075 PoolChunkList for qInit -> Q075 Improve the PoolChunk memory usage and reduce the traversal of PoolChunk in the PoolChunkList
If the PoolChunk usage is 0, the PoolChunkList is removed from the PoolChunkList to release the PoolChunkList. Avoid caching memory in the pool at peak times (PoolChunkList reduction)
The PoolChunkList’s rated usage ranges are crossed, which is designed because the PoolChunkList can be moved back and forth if the PoolChunk memory allocation is at a threshold
2.2 PoolSubpage management
PoolArena holds two arrays of poolSubPages, one for tiny and one for small:
// The array length is 32, and the actual use field starts from index = 1, corresponding to 31 tiny sizes PoolSubpage
private final PoolSubpage<T>[] tinySubpagePools;
// The array length is 4, corresponding to the four types of small size PoolSubpage
private final PoolSubpage<T>[] smallSubpagePools;
Copy the code
The poolSubPages of the same size (elemSize) form a linked list, and the heads of PoolSubpage lists of different sizes are stored in tinySubpagePools or smallSubpagePools arrays, as shown in the following figure:
PoolSubpage = tinySubpagePools; smallSubpagePools = smallSubpagePools; PoolSubpage = tinySubpagePools; If the number of nodes in the PoolSubpage list is 0, create a new PoolSubpage and add it to the list
The PoolSubpage stored in the linked list is partially allocated memory. When all memory is allocated or released, the PoolSubpage will be removed from the linked list to reduce unnecessary linked list nodes. When the PoolSubpage memory is fully allocated and part of the memory is released, it will be added to the linked list again
PoolArean has an elastic memory pool that can be expanded by the following figure:
3 Concurrent Design
The allocation of memory will inevitably encounter multi-threaded concurrent scenarios, whether the balanced tree tag of PoolChunk or the bitmap tag of PoolSubpage are multi-threaded unsafe, how to improve the concurrent performance under the premise of thread safety?
First, to reduce contention between threads, Netty creates multiple poolarenas ahead of time (default = 2 * CPU cores). When a thread first requests a pooled memory allocation, it finds the PoolArena held by the least thread and stores it in the thread-local variable PoolThreadCache. Implement associative binding of threads to PoolArena (PoolThreadLocalCache#initialValue() method)
**Java implements thread-local variables based on the principle of ThreadLocalMap member variables. The map key in this variable is ThreadLocal and value- is the value of thread-local variables that need to be customized. When ThreadLocal#get() is called, thread.currentthread () gets the value in the ThreadLocalMap of the Thread accessed by the currentThread
Netty has designed a higher performance alternative to ThreadLocal: FastThreadLocal is used with the FastThreadLocalThread class. The basic principle is to store local variables in Thead based on ThreadLocalMap. Each FastThreadLocal maintains a global atom-incremented array of int type index internally
In addition, Netty also designed a caching mechanism to improve concurrency performance: When the requested object is freed, the PoolArena is not immediately freed. Instead, the PoolChunk and the offset position (handler variable) of the chunk associated with the memory are stored in the PoolThreadLocalCache fixed size cache queue (if the cache queue is full, the memory is freed immediately). When requesting a memory allocation, PoolArena will first check whether PoolThreadLocalCache has any available memory in its cache queue. If so, PoolArena will allocate it directly to improve allocation efficiency
conclusion
The design of Netty pooled memory management refers to Facebook’s Jemalloc, but also has similarities with Linux memory allocation algorithm Buddy algorithm and Slab algorithm. Many distributed systems and frameworks can be designed in the design of the operating system, so it is valuable to learn the underlying principles
The next article introduces the troubleshooting of the Netty memory leak
reference
Scalable Memory Allocation using Jemalloc — Facebook engineering.fb.com/core-data/s…
Netty entry and actual combat: imitation writing wechat IM instant messaging system juejin.cn/book/684473…
More exciting, welcome to the public number distributed system architecture