Netty Source code storage Management (1)(4.1.44)

As a high-performance network application framework, Netty has its own memory allocation. The idea is derived from Jemalloc Github, which can be said to be the Java version of Jemalloc. This chapter source code is based on Netty 4.1.44 version, the version is using jEMalloc3. x algorithm thought, and 4.1.45 after the version is based on JEMalloc4. x algorithm for reconstruction, the difference between the two is quite big.

High-performance Memory allocation

Jemalloc is a new generation of memory allocator introduced by Jason Evans in the FreeBSD project. It is a general-purpose malLOc implementation that focuses on reducing memory fragmentation and improving memory allocation efficiency in high-concurrency scenarios, with the goal of replacing MALLOc. Jemalloc is widely used in Firefox, Redis, Rust, Netty and other well-known products or programming languages. For details, see A paper by Jason Evans [A Scalable Concurrent Malloc Implementation for FreeBSD]. In addition to Jemalloc, there are some well-known high-performance memory allocator implementations in the industry, such as PTMALloc and TCMALloc. A simple comparison is as follows:

Ptmalloc (Per-thread Malloc) is a standard memory allocator based on gliBC. The disadvantage is that the memory can not be shared between multiple threads, the memory overhead is very large.
Tcmalloc (Thread-Caching Malloc) is open source by Google. It is featured with thread caching and is currently used in Chrome and Safari. Tcmalloc allocates a local cache for each thread. Small memory objects can be allocated from thread-local buffers, while for large memory allocations, spin locks are used to reduce memory contention and improve memory efficiency.
jemallocDrawing on the excellent design ideas of TCMALloc, there are many similarities in architectural design, including thread caching feature. But Jemalloc is more complex in design than TCMALloc. It divides the memory allocation granularity into ** Small, Large, Huge**, and records a lot of metadata, so the metadata takes up more space than TCMALLOC.

From the above, their core goals are nothing more than two things:

Efficient memory allocation and reclamation to improve performance in single – or multi-threaded scenarios.
Reduce memory fragmentation, both memory fragmentation and external fragmentation. Improve memory utilization.

Memory fragments

In the Linux world, physical memory is divided into 4KB memory pages, which is the minimum granularity to allocate memory. Allocation and recycling are done based on page. Fragments generated within the page are called memory fragments, and fragments generated outside the page are called external fragments. Fragmentation occurs when memory is broken up into small chunks that, while free and with contiguous addresses, are too small to be used. As memory is allocated and freed more and more times, memory becomes more and more discontinuous. Eventually, the entire memory is reduced to fragmentation, and even if there are enough free page frames to satisfy the request, it is not enough to allocate a large contiguity of page frames, so the core of reducing memory waste is to avoid fragmentation as much as possible.

Common memory allocator algorithmswiki

Common memory allocator algorithms are:

Dynamic memory allocation
Partner algorithmWiki
Slab algorithm

Dynamic memory allocation

Dynamic Memory allocation is also called simple DMA. Simply put, the operating system will give you as much memory as you want. In most cases, the amount of memory required is not known until the program is running, the amount of memory allocated in advance is difficult to control, too much is wasted, and too little is wasted. DMA allocates as needed from a block of memory. Metadata is recorded for allocated memory and free partitions are used to maintain free memory so that available partitions can be quickly found for the next allocation. The following three search strategies are common:

First Fit algorithm

Free partitions are joined together in a bidirectional linked list in order of memory address from lowest to highest.
Memory allocation starts at the low address and is allocated each time. So low address usage is high and high address usage is low. It also generates more small memory.

Cycle First Fit Algorithm (Next FIT)

This algorithm is a variation of the first adaptive algorithm, the main change is that the second allocation starts from the next free partition search.
For the first time adaptation algorithm, the algorithm allocates the memory more evenly and improves the search efficiency, but it leads to serious memory fragmentation.

Best Fit Algorithm

The spatial partition chain always keeps increasing order from small to large. When allocating memory, the appropriate spatial memory is found and allocated from the beginning, and when the allocation request is completed, the free partition chain is reordered by partition size.
This algorithm has higher space utilization, but it also has small space partitions that are difficult to use. The reason is that the size of free memory blocks remains unchanged, and there is no optimized classification for memory size. Unless the size of memory exactly equals the size of free memory blocks, the space utilization rate is 100%.
It needs to be reordered after each allocation, so there is CPU consumption.

Buddy Memory Allocationwiki

The partner memory allocation technique is a memory allocation algorithm, which divides the memory into partitions to meet the memory request with the most appropriate size. Invented by Harry Markowitz in 1963. The partner algorithm groups all the free page boxes into 11 block linked lists, each containing 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, and 1024 contiguous page boxes, respectively. The maximum memory request size is 4MB, which is contiguous.

The companion algorithm is the same size, continuous address.
Disadvantages: While the partner algorithm effectively reduces external fragmentation, the minimum granularity is page, which can result in very severe internal fragmentation, up to 50% memory fragmentation.

Slab algorithm

Partner algorithmThis is not applicable in small memory scenarios, where one is allocated at a timepage, resulting in memory tuition and miscellaneous fees. whileSlab algorithmIs in thePartner algorithmOn the basis of the small memory allocation scenario made special optimization:
- Provides a tuning cache mechanism to store kernel objects, which can be retrieved from the cache when the kernel needs to allocate memory again.
Linux uses Slab algorithm to allocate memory.

Jemalloc algorithms

Jemalloc is based on Slab and is more complex than Slab. Slab improves speed and efficiency in small memory allocation scenarios, and Jemalloc also has excellent memory allocation efficiency in multi-threaded scenarios through Arena and Thread Cache. Arena is the idea of divide-and-rule, rather than having one person manage all memory, it is better to assign tasks to multiple people, each managing independently without interference (threads competing). Thread Cache is the core idea of TCmalloc, which Jemalloc also borrows from. Each thread has its own memory manager, and allocation is done within this thread so that it does not have to compete with other threads. The related documents

Facebook Engineering Post: This article was written in 2011 and buta collection to Jemalloc 2.1.0.
jemalloc(3) manual page: The manual page for the latest release fully describes the API and options supported by jemalloc, and includes a brief summary of its internals.

The underlying memory allocation of Netty is based on the idea of jemalloc algorithm.

Memory size

Netty reserved different memory allocation policies for different sizes, as shown in the figure above. Defined in Nettyio.netty.buffer.PoolArena.SizeClassEnumeration class, used to describe the memory specification types shown above, respectivelyTiny, Small, and Normal.when>16MB, asHugeType. Netty defines finer – grained memory allocation units for each region, respectivelyChunk, Page, and Subpage.

// io.netty.buffer.PoolArena.SizeClass
enum SizeClass {
    Tiny,
    Small,
    Normal
}
Copy the code

Memory normalization

Netty normalizes the memory size requested by users to facilitate subsequent calculation and memory allocation. For example, if the user requests 31B memory size, if no memory normalization, directly return 31B memory size, then it is a DMA memory allocation. By normalizing memory, 31B specifications are reduced to 32B and 15MB is normalized to 16MB. Of course, the strategy is different for different types of memory. Here are some clues:

forHugeLevel of memory size, returning as much memory as the user requests (memory alignment if necessary).
fortiny 、small 、normalLevel of memory to512BIs the dividing line:
- when>=512B, returns the value closest to 2 and greater than the size of the memory requested by the user. For example, the requested memory size is513B, the return1024B.
- when<512BWhen, return closest16Is a multiple of and greater than the size of the memory requested by the user. For example, the requested memory size is17B, the return32B; The applied memory size is46BTo return to48B.

Memory normalizationCore source code inio.netty.buffer.PoolArenaObject,PoolArenaThis is the most important class for Netty to manage memory:

Gets the closest number to 2 to the n

We need to apply memorynormalized.Easy to calculate and manage.The following is the1025The process of normalizing:A series ofThe displacementCalculate, see dazzling. The main goal is to find the value closest to 2 that is larger than the size of the memory requested by the user. The idea is to take the binary0100, 0000, 0001 (1025).become0111, 1111, 1111 (2048).. I’m going to start with zeroi, the highest binary bit of the original value is1Is denoted asj, the specific execution process is described as follows:

To perform firsti-1Operation, the purpose of which is to solve for a value of 2 can also get itself, not 2.
To performi |= i>>>1Operation in order to assign the value noj-1A value of1. Has been the firstjThe bit position is determined to be1, so unsigned move one bit to the rightj-1Also for the1. And then with the original value|Update the number after operationj-1The value of the. In this case, the first value of the original valuej,j-1Are identified as1, and then I can unsign double to the right, letj-2,j-3The assignment for1. Due to theintThe type has 32 bits, so you only need to proceed5Each time, unsigned right shift 1, 2, 4, 8, 16 can be less thaniIs assigned to all bits of1.

Gets the next nearest multiple of 16

The idea is simple: first, erase the lower four bits (to 0) and add 16 to get the target value.

(reqCapacity & ~15) + 16;
// 0000 0000 0000 0000 0000 0000 0000 0000
// 0000 0000 0000 0000 0000 0000 0000 0000 1111 (15)
// 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 1111 0000 (-16)(~15) // ~15
// 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 (32)(reqCapacity& ~15
// 0000 0000 0000 0000 0000 0000 0000 0000 0000 (48) //
Copy the code

summary

NettyThrough a large number ofAn operationTo improve performance, but the codeIt’s not very readable. So you can do a bit search on the Internet and you can do a bit simulation.
Bit operation of the use of skills, you can see a bit operation introduction and practical skills, which is very detailed.
Netty and memory normalization of the bit-operation techniques show three:
- One is to find the value of 2 closest to allocated memory and larger than allocated memory.
- The second is to find the value closest to the allocated memory and greater than 16 times the allocated memory.
- Third, judge whether it is greater than a certain number through the mask.
Memory normalization is in bytes, not bits.

Netty memory pool allocation

First of all,NettyWill send toThe operating systemApply for a contiguous chunk of memory, called a chunk, unless requestedHugeLevel size of memory, otherwise the general size is16MB, the use ofio.netty.buffer.PoolChunkObject wrapping. It looks like this:

NettywillchunkFurther split into multiplepage, eachpageThe default size is8KBTherefore, everychunkcontains2048 个 page. In order toSmall memoryFor fine management, reduce memory fragmentation, improve memory usage, Netty to **page ** further split severalsubpage, the size of the subpage changes dynamically, and the minimum is16Byte.
Calculation: When memory allocation is requested, the required memory size is normalized to obtain the appropriate memory value. Verify the exact height of the tree based on the value.
Search: Searches and allocates free groups from left to right at the corresponding height of the group size.
Tag: Groups are marked as fully used and their parent tag information is updated through a loop. The parent node takes the smallest of the two child nodes.

Of course, the above is just the overall idea, but it is still in the clouds. I believe that after the following can help you to see the sun.

Huge Allocation logic overview

Large memory allocation is a bit simpler than other types of memory allocation. The memory unit that operates is PoolChunk, which has the capacity required by the user (memory alignment requirements are met). Netty uses a non-pooled management policy for Huge memory blocks. Each time a special non-pooled PoolChunk object is created for memory allocation. When the object memory is released, the entire PoolChunk memory is also released. Large memory allocation logic is in io.net ty. The buffer. The PoolArena# allocateHuge completed.

Normal allocation logic

The size of Normal allocation ranges from 4097B to 16M. The core idea is to split PoolChunk into 2048 pages, which is the minimum unit of Normal allocation. Each page is equal in size (pageSize=8KB) and logically manages these page objects through a full binary tree. Normal’s allocation core logic is done in PoolChunk#allocateRun(int).

Small allocation logic

The size range assigned by the Small level is (496B, 4096B). The core is to split a page into several subpages. PoolSubpage is the incarnation of these subpages. It effectively solves the problem of memory fragmentation caused by small memory scenarios. A page size is 8192B, and there are only four sizes: 512B, 1024B, 2048B, and 4096B, increasing by a factor of two. If the required memory size ranges from 496B to 4096B, one of the four memory types can be determined. When allocating memory, find an idle page at the bottom of the tree, divide it into several subpages, and construct a PoolSubpage for management. Select the first subpage for this application, mark it as used, and place PoolSubpage in the linked list corresponding to the PoolSubpage[] smallSubpagePools array. PoolSubpage[] allocates memory directly from the linked list.

Tiny allocation logic

The Tiny level allocation ranges in size from (0B to 496B). The allocation logic is similar to Small, in that the idle Page is found and divided into several subpages and a PoolSubpage is constructed to manage them. The first subpage is then selected for this application and the PoolSubpage object is placed in the linked list corresponding to the PoolSubpage[] tinySubpagePools array. Wait for the next assignment. The difference is how do you define how many? The definition logic given by Tiny is to get the size closest to 16*N that is larger than the specification value. For example, if the size of memory is 31B, find the next nearest multiple of 16*1 and greater than 31 is 32, therefore, split the Page into 8192/32=256 subpages, which are determined according to the specification value, it is variable value.

PoolArena

The above shows how memory allocation is done for different levels of Netty. Next, we first to understand some classes, for the subsequent source code interpretation to lay a foundation. PoolArena is a core class that allocates memory using a fixed number of arenas. By default, it is dependent on the number of CPU cores. It is a shared object by threads, and each thread is bound to only one PoolArena. When a thread requests memory allocation for the first time, it gets an Arena through round-robin, and only deals with that Arena for its entire life cycle. As mentioned earlier, PoolArena is an embodiment of the idea of didivide and conquer, and performs well in multi-threaded scenarios. PoolArena provides DirectArena and HeapArena subclasses, which are required because of the underlying container types. But the core logic is done in the PoolArena. The PoolArena data structure (excluding the monitoring metrics) can be roughly divided into two categories: the six PoolChunkLists that store PoolChunks and the two arrays that store poolsubpages. The PoolArena constructor initialization also does a lot of important work, including concatenating PoolChunkList and initializing PoolSubpage[].

Initialize PoolChunkList

q000,q025,q050,q075,q100Indicates the lowest memory usage. As shown in the figure belowanyPoolChunkListBoth have upper and lower limits for memory usage:minUsag,maxUsage. If usage exceedsmaxUsage, thenPoolChunkFrom the currentPoolChunkListRemove and move toThe nextPoolChunkList. Similarly, if the usage is less thanminUsage, thenPoolChunkFrom the currentPoolChunkListRemove and move toBefore aPoolChunkList. eachPoolChunkListThere’s some overlap between the upper and lower bounds of phi becausePoolChunkNeed to be inPoolChunkListConstantly moving, if the critical values are perfectly aligned, will causePoolChunkIn the twoPoolChunkListConstant movement, resulting in performance loss.PoolChunkListApply toChunkMemory allocation in the scenario,PoolArenaInitialize the6 个 PoolChunkListAnd according to the above end to end, forming a bidirectional linked list, onlyq000thisPoolChunkListThere is no forward node because when the restPoolChunkListThere is no suitablePoolChunkWhen memory can be allocated, a new one is createdPoolChunkIn thepInit, and then allocate memory according to the size of the requested memory. And in thep000In thePoolChunkIf, due to memory return, the usage drops to0%, you do not need to addpInit, execute the destruction method directly to free the memory of the entire memory block. In this way, the memory in the memory pool has a life cycle process such as build/destroy, avoiding memory that is not used.

Initialize PoolSubpage []

PoolSubpageIt’s for some onepageBecause of the avatarPageCan also according to theelemSizeBreak it up into piecessubpage, used in PoolArenaPoolSubpage[]Array to storePoolSubpageObject, pastPoolArenaAs shown below:Remember this picture:forSmallIt comes in four different sizes, sosmallSupbagePoolsThe array length of is4.smallSubpagePools[0]saidelemSize=512B 的 PoolSubpageA linked list of objects,smallSubpagePols[1]saidelemSize=1024B 的 PoolSubpagesA linked list of objects. And so on,tinySubpagePoolsThe principle is the same, but the granularity of the partition (step size) is less, to16Is increasing by multiple of. Therefore, due toTinySize limits, total can be divided into32Class, sotinySubpagePoolsThe array length is32. Array subscriptsizeThe capacity varies, and each array corresponds to a set of bidirectional linked lists. These two arrays are for storagePoolSubpageAnd according to the objectPoolSubpage#elemSizeDetermine the location of the indexindex, and finally construct a bidirectional linked list.

The source code

The subclass implementation

The inheritance system is shown in the figure below:

PoolArenaMetricTo define andPoolArenaRelated monitoring interfaces.
PoolArena: Abstract class. Defines the main core variables and part of the memory allocation logic. Due to the storageData containersThe create and destroy logic is also different. So it has two subclasses, DirectArena and HeapArena.

An abstract classPoolArenaThere are several interfaces that subclasses must implement:

These abstract methods are the difference between the DirectArena and HeapArena implementation classes, and the details are not described.

PoolChunkList

PoolChunkListIt’s a bidirectional linked list for storagePoolChunkObject, which points toPoolChunkThe head of a linked list. And forPoolChunkListThe node itself, it and otherPoolChunkListIt also forms a bidirectional linked list. As shown above.PoolChunkListThe internal definition is simple:

PoolChunk

PoolChunk is Netty’s description of the idea of Jemalloc3. x algorithm, which is the core class of Netty’s memory allocation.

To translate documents

An overview of the description

Page is the smallest memory unit that can be allocated by Chunk. Chunk is a collection of pages. The Chunk size is calculated by chunkSize = 2^{maxOrder} * pageSize. First, we allocate a size = chunkSize byte array, and when we need to create a ByteBuf of a given size, we search for the first position in the byte array that has enough free space to accommodate the requested size, And returns a handle value of type LONG to encode the offset information (the memory segment is then marked reserved, so it is always used by one ByteBuf, not multiple). For simplicity, all requested memory sizes are normalized using the PoolArena#normalizecapacity method. This ensures that when we request a memory segment of size >= pageSize, the normalized capacity is equal to the next nearest power of 2. To get the first offset available for the size of the request, we construct a Compelte balanced binary ** to speed up the search. Use the array memoryMap to store information about the tree. The tree looks something like this (the parentheses indicate the size of each node)

depth=0 1 node (chunkSize)
depth=1 2 nodes (chunkSize/2)
.
depth=d 2^d nodes (chunkSize/2^d)
.
depth=maxOrder 2^maxOrder nodes (chunkSize/2^{maxOrder} = pageSize)

When depth=maxOrder, the leaf node is composed of page.

Search algorithm

Encode full binary trees in a memoryMap with symbols.

memoryMapType isbyte[]Is used to record tree allocations. The initial value is the depth of the tree where the corresponding node resides.
memoryMap[id] = depth_of_id=> Idle/Not allocated.
memoryMap[id] > depth_of_id=> At least one child node has been allocated, but other child nodes can still be allocated.
memoryMap[id] = maxOrder + 1=> The current node has been allocated, that is, the current node is unavailable.

allocateNode(d)

The goal is to find the first free allocatable node at the corresponding depth from left to right. The d parameter indicates depth.

Start at the beginning node. (depth=0 or id=1)
ifmemoryMap[1] > dSaid thisChunkNo allocated memory available.
If the value of the left node<=h, we can allocate from the left subtree and repeat until we find a free node.
Otherwise, depth the right subtree and repeat until a free node is found.

allocateRun(size)

Assign a set of pages. The size parameter represents the normalized memory size.

To calculatesizeThe corresponding depth. The formulad = log_2(chunkSize/size).
returnallocateNode(d)

allocateSubpage(size)

Create/initialize a new PoolSubpage of normcacity size. Any PoolSubpage created/initialized is added to the sub-page memory pool of the PoolArena that owns the PoolChunk.

useallocateNode(maxOrder)Find any free page child nodes and return onehandleThe variable.
usehandlebuildPoolSubpageObject and add toPoolArena 的 subpagePoolMemory pool.

The source code

PoolChunkThe source code is relatively complex, the first need to define the variable to understand clearly, for the subsequent memory allocation source code analysis to lay a foundation.

Overview of related methods:

This is just to give you an idea of what the variables and methods actually do when the source code is analyzed.

PoolSubpage

PoolSubpage 是 Small,TinyLevel The object used when allocating memory. aPoolSubpageObject corresponds to a page. So, onePoolSubpageThe managed memory size is8KB. Related variables are explained as follows:

PoolSubpage is also very clever at managing small memory, more on that later.

Let’s talk about pooled memory allocation

In the ByteBuf chapter we talked about the ByteBufAllocator allocator system. But there is an overview of the allocator system, PooledByteBufAllocator related to pooled allocator simply describes the initialization process. Now we continue to use this as a starting point to clarify how classes are allocated and managed. PooledByteBufAllocator is a thread-safe class. We can use PooledByteBufAllocator. DEFAULT to obtain a io.net ty. Buffer. PooledByteBufAllocator pooling distributor, which is one of Netty recommended practices. PooledByteBufAllocator will initialize two important arrays, namely heapArenas and directArenas. All operations related to memory allocation will be delegated to heapArenas or directArenas. The length of the array is usually calculated by 2*CPU_CORE. This reflects the memory allocation design concept of Netty (or jemalloc algorithm to be exact), which reduces memory competition by adding multiple Arenas and improves the speed and efficiency of memory allocation in multi-threaded environment. The arenas array is made up of the PoolArena object we discussed above, which is the central hub of memory allocation, a big steward. This includes managing PoolChunk objects, managing PoolSubpage objects, allocating the core logic of memory objects, managing the local object cache pool, memory pool destruction, etc. It focuses on managing allocated memory objects. PoolChunk is the embodiment of the idea of Jemalloc algorithm. It knows how to allocate memory efficiently. You only need to call the corresponding method to get the chunk of memory you want. PoolArena will worry about that anyway. Next, we will take you into the world of Netty memory allocation through source code through the PooledByteBufAllocator method for entry.

Off-heap memory allocation source code implementation

The underlying data storage container for off-heap memory is the java.nio.byteBuffer object. Usually by io.net ty. Buffer. AbstractByteBufAllocator# directBuffer (int) to get a pooling of heap memory ByteBuf object. Tracking method, it will be through the abstract class io.net ty. The buffer. The AbstractByteBufAllocator# newDirectBuffer to subclass implementation, here is the use of pooling distributor PooledByteBufAllocator implementation. The relevant source code is as follows:

// io.netty.buffer.PooledByteBufAllocator#newDirectBuffer
/** * get an out-of-heap memory "ByteBuf" object */
@Override
protected ByteBuf newDirectBuffer(int initialCapacity, int maxCapacity) {
    // #1 Fetch the PoolThreadCache object from the local thread cache
    PoolThreadCache cache = threadCache.get();
    
    // #2 Retrieve "directArena" from the cache object, select the corresponding "Arena" according to the storage type.
    PoolArena<ByteBuffer> directArena = cache.directArena;

    final ByteBuf buf;
    if(directArena ! =null) {
        // delegate "directArena" to allocate memory
        buf = directArena.allocate(cache, initialCapacity, maxCapacity);
    } else {
        // # 2
        buf = PlatformDependent.hasUnsafe() ?
            UnsafeByteBufUtil.newUnsafeDirectByteBuf(this, initialCapacity, maxCapacity) :
            new UnpooledDirectByteBuf(this, initialCapacity, maxCapacity);
    }
    
    // #4 Wrap the generated "ByteBuf" object for memory leak detection
    return toLeakAwareBuffer(buf);
}
Copy the code

Above is the allocator to allocate a pooled ByteBuf object core source code. Doesn’t feel very simple, so the memory allocation delegate is done by directArena. As mentioned earlier, each thread can only bind to one PoolArena object for the lifetime of the thread. This reference is stored in the PoolThreadCache, a thread that wants to allocate memory. The call to threadcache.get () initializes the relevant variables. Normally Netty starts the local threadCache by default, so the directArena object obtained from the cache is not empty. This PoolThreadCache is useful! It holds the PoolArena object, which caches part of the ByteBuffer or byte[] information via MemoryRegionCache, All we need to know is that we get one of the dicrectArena objects from the PoolThreadCache local cache, and by comparing the size of each PoolArena#numThreadCaches in PoolByteBufAllocator, Returns the minimum PoolArena object. Each thread has a PoolThreadCache. More on PoolThreadCache in a new section. Continuing with the main line, PoolArena#allocate(PoolThreadCache, int, int) is now used. Let’s see what PoolArena does:

Phase 1: Initialize a ByteBuf instance object

Object pooling speeds up memory and release of ByteBuf objects, but the downside is that programs run by developers who don’t know Netty’s underlying layers can cause memory leaks. If the object pool does not exist, it is created according to the corresponding rules.

// io.netty.buffer.PoolArena#allocate(io.netty.buffer.PoolThreadCache, int, int)
/** * Get pooled instances of "ByteBuf" */
PooledByteBuf<T> allocate(PoolThreadCache cache, int reqCapacity, int maxCapacity) {
    
    // #1 get an instance object of "ByteBuf". It can be generated directly or retrieved from the object pool.
    // It is the "PoolArena" abstract class that needs to be subclassed. Here is the "PoolArena" implementation class
    PooledByteBuf<T> buf = newByteBuf(maxCapacity);
    
    // #2 fills the physical memory information for "buf"
    allocate(cache, buf, reqCapacity);
    
    / / # 3 returns
    return buf;
}

// io.netty.buffer.PoolArena.DirectArena#newByteBuf
/** * get an instance object of "ByteBuf". * /
@Override
protected PooledByteBuf<ByteBuffer> newByteBuf(int maxCapacity) {
    if (HAS_UNSAFE) {
        
        // #1 "ByteBuf" with "Unsafe", which is generally supported in the server
        // So let's take a closer look at how this method is implemented
        return PooledUnsafeDirectByteBuf.newInstance(maxCapacity);
    } else {
        
        // #2 "unbroadening" for "ByteBuf"
        returnPooledDirectByteBuf.newInstance(maxCapacity); }}Copy the code

// io.netty.buffer.PooledUnsafeDirectByteBuf
/ * * * "PooledUnsafeDirectByteBuf" is not "public" decorate, it is visible object, therefore, we can't get this type instances by distributor. * this "ByteBuf" has the "ObjectPool" ObjectPool to speed up the allocation of objects. * a and its type, called "io.net ty. Buffer. PooledDirectByteBuf", also use "ObjectPool object pool". * specific difference is "PooledUnsafeDirectByteBuf" maintenance "memoryAddress" internal variables, this is the necessary "Unsafe" operating variables. * /
final class PooledUnsafeDirectByteBuf extends PooledByteBuf<ByteBuffer> {
    / / object pool
    private static final ObjectPool<PooledUnsafeDirectByteBuf> RECYCLER = ObjectPool.newPool(
            new ObjectCreator<PooledUnsafeDirectByteBuf>() {
        @Override
        public PooledUnsafeDirectByteBuf newObject(Handle<PooledUnsafeDirectByteBuf> handle) {
            return new PooledUnsafeDirectByteBuf(handle, 0); }});static PooledUnsafeDirectByteBuf newInstance(int maxCapacity) {
        // #1 Get the instance of "ByteBuf" from the object pool
        PooledUnsafeDirectByteBuf buf = RECYCLER.get();
        
        / / # 2 reset
        buf.reuse(maxCapacity);
        
        / / return
        return buf;
    }

    private long memoryAddress;
    
    // Reset all pointer variables
    final void reuse(int maxCapacity) {
        maxCapacity(maxCapacity);
        resetRefCnt();
        setIndex0(0.0);
        discardMarks();
    }

    // ...
}
Copy the code

Phase 2: Fill memory information for ByteBuf

The core method of this phase belongs toio.netty.buffer.PoolArena#allocate(io.netty.buffer.PoolThreadCache, io.netty.buffer.PooledByteBuf<T>, int) ，PoolArenaDifferent memory allocation policies are adopted according to the requested memory size, and the memory information is writtenByteBufObject. So we had a little bit ofPoolSubpage<T>[] tinySubpagePoolsAnd PoolSubpage[] smallSubpagePools smallSubpage [] smallSubpagePoolstiny&smallLevel of memory used when. This is available on the next request to allocate the same amount of memoryPoolSubpage<T>[]I’m going to allocate. Take a good recess from the source:

Now to summarize the off-heap allocation logic:

First of all, the application capacity is normalized. The value that is closest to and greater than the original value to the power of 2 is called the canonical value.
Select the appropriate allocation strategy based on the specification values. In the general direction, yes3Species allocation strategy, respectivelytiny&small,normalAs well asHuge.
HugeMemory allocation is not attempted from the local thread cache, nor is it pooled and created directlyPoolChunkObject and return.
whenNormalTo allocate memory, pressq050->q025->q000->qInit->q075Assign in order, fromq050I’m going to start allocating because it’s a compromise allocation if I go fromq000If it were distributed, there would be most of themPoolChunkFaced with frequent creation and destruction, memory allocation performance degrades. If fromq050At the beginning, will makePoolChunkThe range of usage remained in the middle, both reducedPoolChunkListThe probability of being recycled also takes into account performance. If the allocation is successful, this is evaluatedPoolChunkThe usage exceedsPoolChunkListWhen moving to the next onePoolChunkListChain in the table. If the allocation fails, a new memory block is created for memory, and if the allocation succeeds, it is added toqInitA linked list.
forTiny&SmallLevel, will try to passPoolSubpageAllocates, and returns if the allocation is successful. If the assignment fails, press againNormalThat allocation logic does the allocation.

In general, the PoolArena#allocate method is the core logic by which PoolArena objects allocate memory, choosing an appropriate allocation strategy based on the specification values. It also speeds up memory allocation through local thread caches, ByteBuf object allocation through object pools, and reduces GC.

Overview of in-heap memory allocation

The logic for allocating in-heap and off-heap memory is roughly the same, except that:

usePoolArenaA subclass ofHeapArenaComplete assignments.
The underlying data container isbyte[]And theDirectArena 是 java.nio.ByteBufferObject.

Memory recovery

Who is the subject of memory reclamation? We know that Netty caches part of the allocated memory through Thead Cache, so how does it reclaim memory? The subject here is Thread Cache. For the big butler PoolArena, how does it manage memory reclamation? Most of the time, ByteBuf objects are released by BytBuf#release(). This API only makes the reference count -1 and does not directly reclaim physical memory. Reclaim physical memory only when the reference count is 0. The ByteBuf#release() call is summarized as follows: we Update the reference count with an Update object, and if the reference count is zero, memory needs to be freed. If the PoolChunk to which it belongs does not support pooling, it is released directly. For pool-chunk, check whether the local thread can cache the memory information to be reclaimed. If the local thread cache succeeds, return the memory information. Otherwise, PoolArena handles memory reclamation. The PoolArena is handed over to the PoolChunkList. The processing logic is relatively simple: Find PoolChunk to reclaim the memory and check whether PoolChunk meets minUsage. If not, move the forward node. At this point, this is what memory reclamation looks like.

// io.netty.buffer.AbstractReferenceCountedByteBuf#release()
@Override
public boolean release(a) {
    RefCnt = refcnt-2; refCnt= refcnt-2
    Update. Release (this) returns true, indicating that the current reference count for "ByteBuf" is 0,
    // It is time to release
    // #2 free memory
    return handleRelease(updater.release(this));
}

// io.netty.buffer.AbstractReferenceCountedByteBuf#handleRelease
private boolean handleRelease(boolean result) {
    if (result) {
        // Free memory
        deallocate();
    }
    return result;
}

// io.netty.buffer.PooledByteBuf#deallocate
@Override
protected final void deallocate(a) {
    // considering whether the handle variable is >=0
    if (handle >= 0) {
        final long handle = this.handle;
        this.handle = -1;
        memory = null;
        
        // use PoolArena#free
        chunk.arena.free(chunk, tmpNioBuf, handle, maxLength, cache);
        tmpNioBuf = null;
        chunk = null;
        // Reclaim the "ByteBuf" objectrecycle(); }}// io.netty.buffer.PoolArena#free
/** * PoolArena defines "release" *@paramChunk ByteBuf (PoolChunk) *@paramNioBuffer Temporary "ByteBuffer" object * inside "ByteBuf"@paramHandle The value of the handle variable *@paramNormCapacity Indicates the applied memory value *@paramCache Thread cache */
void free(PoolChunk<T> chunk, 
          ByteBuffer nioBuffer, 
          long handle, int normCapacity, PoolThreadCache cache) {
    
    if (chunk.unpooled) {
        
        // #1 The Chunk to which "ByteBuf" belongs is not pooled and directly destroyed
        // Different destruction strategies are adopted according to the underlying implementation.
        // If the object is "ByteBuf", according to the classification of "Cleaner", take different destruction methods
        // If it is "byte[]" and no processing is done, the JVM GC will reclaim the memory
        int size = chunk.chunkSize();
        destroyChunk(chunk);
        activeBytesHuge.add(-size);
        deallocationsHuge.increment();
    } else {
        
        // #2 For pooled "chunks"
        SizeClass sizeClass = sizeClass(normCapacity);
        if(cache ! =null &&
            // Try adding it to the local cache. How to add it will be explained in another section
            // MermoryRegionCache is used to cache memory information, such as handle value, capacity, and chunk
            // It can be allocated from the local thread when the next thread requests the same capacity
			// Then some people would say, have borrowed not to return? That's not possible. PoolThreadCache maintains the add count and fires when a certain threshold is reached
            // Reclaim action does not cause a memory leak
            cache.add(this, chunk, nioBuffer, handle, normCapacity, sizeClass)) {
            return;
        }
		
        // the local cache fails to be added
        freeChunk(chunk, handle, sizeClass, nioBuffer, false); }}// io.netty.buffer.PoolArena#freeChunk
/** * Release the "ByteBuf" object *@param chunk
 * @param handle
 * @param sizeClass
 * @param nioBuffer
 * @param finalizer
 */
void freeChunk(PoolChunk<T> chunk, 
               long handle, 
               SizeClass sizeClass, 
               ByteBuffer nioBuffer, boolean finalizer) {
    final boolean destroyChunk;
    synchronized (this) {
        // We only call this if freeChunk is not called because of the PoolThreadCache finalizer as otherwise this
        // may fail due lazy class-loading in for example tomcat.
        // This is the judgment of lazy loading. For example, when Tomcat uninstalls an application, it uninstalls the corresponding ClassLoader.
        // The thread recovery finalizer may require the class information of this class loader, so let's check
        if(! finalizer) {switch (sizeClass) {
                case Normal:
                    ++deallocationsNormal;
                    break;
                case Small:
                    ++deallocationsSmall;
                    break;
                case Tiny:
                    ++deallocationsTiny;
                    break;
                default:
                    throw newError(); }}// call PoolChunkList#free to return the memorydestroyChunk = ! chunk.parent.free(chunk, handle, nioBuffer); }if (destroyChunk) {
        // destroyChunk not need to be called while holding the synchronized lock.destroyChunk(chunk); }}// io.netty.buffer.PoolChunkList#free
boolean free(PoolChunk<T> chunk, long handle, ByteBuffer nioBuffer) {
    
    // #1 collect memory blocks with "PoolChunk#free"
    // Handle records the location of the tree
    // PoolChunk caches nioBuffer objects for use next time
    chunk.free(handle, nioBuffer);
    
    // #2 Determine whether the current usage of PoolChunk needs to be moved to the previous node list
    if (chunk.usage() < minUsage) {
        remove(chunk);
        // Move the PoolChunk down the PoolChunkList linked-list.
        return move0(chunk);
    }
    return true;
}
Copy the code

conclusion

This is a small step towards Netty memory and a big step towards becoming familiar with Netty memory. 2333, hope to give you a general understanding of the entire memory flow through the analysis of specific classes and structures. Once we’re familiar with the process, we’ll get into the details.

My official account

Search the trail

Netty Source code storage Management (1)(4.1.44)

Netty Source code storage Management (1)(4.1.44)

High-performance Memory allocation

Memory fragments

Common memory allocator algorithmswiki

Dynamic memory allocation

First Fit algorithm

Cycle First Fit Algorithm (Next FIT)

Best Fit Algorithm

Buddy Memory Allocationwiki

Slab algorithm

Jemalloc algorithms

Memory size

Memory normalization

Gets the closest number to 2 to the n

Gets the next nearest multiple of 16

summary

Netty memory pool allocation

Huge Allocation logic overview

Normal allocation logic

Small allocation logic

Tiny allocation logic

PoolArena

Initialize PoolChunkList

Initialize PoolSubpage []

The source code

The subclass implementation

PoolChunkList

PoolChunk

To translate documents

An overview of the description

Search algorithm

allocateNode(d)

allocateRun(size)

allocateSubpage(size)

The source code

PoolSubpage

Let’s talk about pooled memory allocation

Off-heap memory allocation source code implementation

Phase 1: Initialize a ByteBuf instance object

Phase 2: Fill memory information for ByteBuf

Overview of in-heap memory allocation

Memory recovery

conclusion

My official account

Related Posts

Redis persistence principle analysis

Learn to write scripts from scratch (Day 2)

Keywords in JAVA