preface

This series of articles will explain the evolution of NIO and pave the way for the reactor-Netty library.

About THE Java programming methodology — Reactor and Webflux video sharing, has completed Rxjava and Reactor, site B address is as follows:

Rxjava source code reading and sharing: www.bilibili.com/video/av345…

Reactor source code reading and sharing: www.bilibili.com/video/av353…

This series of related video sharing: www.bilibili.com/video/av432…

This series of source code interpretation based on JDK11 API details may differ from other versions, please solve JDK version issues.

Previous installments in this series:

BIO to NIO source code some things BIO

BIO to NIO source code for something on NIO

BIO to NIO source code for something in NIO

BIO to NIO source code for something under NIO Selector

BIO to NIO source code for some things to read on NIO Buffer

Some knowledge of operating systems

Before we look at NIO’s DirectByteBuffer operation Buffer, it is necessary to understand the operating system so that we can understand why the program is designed this way.

User mode and kernel mode

Here, let’s take the Linux operating system as an example. First, let’s look at a picture:

Figure 1

As shown in the figure above, from a macro point of view, Linux operating system architecture is divided into kernel mode and user mode. An operating system is essentially software that runs on hardware resources. And hardware resources, take Intel x86 architecture CPU for example, in all the INSTRUCTIONS of the CPU, some instructions are very dangerous, if used incorrectly, will cause the whole system crash. For example, clear the memory and set the clock. If all programs could use these instructions, your system would crash all day long. Therefore, CPU divides instructions into privileged instructions and non-privileged instructions. For those dangerous instructions, only the operating system and its related modules are allowed to use them. Ordinary applications can only use those instructions that will not cause disaster. Therefore, the Intel CPU divides the privilege levels into four RING0, RING1, RING2, and RING3 levels.

For the operating system, RING0 is in kernel mode and has the highest rights. The general application is in RING3 state — user state. In terms of permission constraints, it is possible to read lower-level data in a high-privilege state, such as process context, code, data, etc., but not vice versa. That is, RING0 can read all contents of ring0-3, RING1 can read contents of Ring1-3, and RING2 can read only its own data. That is, the Ring3 state cannot access the Ring0 address space, including code and data.

As we know, the address space size of processes in the Linux operating system on a 32-bit machine is 4G, where 0-3G corresponds to user space and 3G-4G corresponds to kernel space. What if our physical machine only has 2 gigabytes of memory? So, this 4 gigabytes of address space is what we call virtual address memory space (so, when running a 32-bit operating system like Windows, we only recognize 4 gigabytes of physical memory when there are more than 8 gigabytes).

So what is the virtual address memory space, and how does it correspond to the actual physical memory space?

The process uses an address in virtual address memory, and the operating system assists the hardware in “translating” it into a real physical address. Virtual addresses are mapped to physical memory through Page tables maintained by the operating system and referenced by the processor. Kernel space has the highest level of privilege in the page table, so user-mode programs attempting to access these pages cause a page fault. In Linux, kernel space is persistent and maps to the same physical memory in all processes. Kernel code and data are always addressable, ready to handle interrupts and system calls. In contrast, the mapping of the user-mode address space changes as the process switches.

The standard memory segment layout for Linux processes in virtual memory is shown below:

Figure 2

Note that the 32-bit kernel address space partition is different from the 64-bit kernel address space partition.

In the x86 architecture, the Kernel Space is divided into three types:

ZONE_DMA memory starts at 16MB

ZONE_NORMAL 16MB~896MB

ZONE_HIGHMEM 896MB ~ End

The origin of high memory

When the kernel module code or process accesses the memory, the memory address pointed to in the code is a logical address, but the mapping to the real physical memory address needs to be one-to-one. For example, the physical address of the logical address 0xc0000003 is 0× 3,0 xc0000004 is 0×4. … , the relationship between logical address and physical address is as follows:

Physical ADDRESS = Logical address – 0xC0000000

Logical addresses Physical memory address
0xc0000000 0 x 0
0xc0000001 0 x 1
0xc0000002 0 x 2
0xc0000003 0 x 3
0xe0000000 0 x 20000000
0xffffffff 0 x 40000000 * * * *

Assuming the above simple address mapping, it can be seen from Figure 2 that the kernel logical address space access is 0xC0000000-0xFFFFFFFF, and the corresponding physical memory range is 0×0 ~ 0×40000000, that is, only 1G physical memory can be accessed. If 8 GB of physical memory is installed on the machine, the kernel can access only the first 1 GB of physical memory, and the next 7 GB of physical memory is inaccessible because the kernel address space has been mapped to the range 0 x 0 to 0 x 40000000. Even if 8GB of physical memory is installed, how does the kernel access memory with a physical address of 0×40000001? The address space between 0xC0000000 and 0xFFFFFFFF has been used up, so the memory beyond 0×40000000 cannot be accessed.

Obviously, the kernel address space 0xC0000000 ~ 0xfffff cannot be used entirely for simple address mapping. Therefore, the x86 architecture divides the kernel address space into three parts: ZONE_DMA, ZONE_NORMAL, and ZONE_HIGHMEM. ZONE_HIGHMEM stands for high level memory, which is where the concept of high level memory comes from.

So how does the kernel access all physical memory through the 128MB high memory address space?

If the kernel wants to access the memory with a physical address greater than 896MB, it finds a free logical address space within the address space between 0xF8000000 and 0xFFFFFFFF and uses it for a while. Use this logical address space to map to the physical memory you want to access (that is, populate the kernel PTE page table), use it temporarily, and return it when it is finished. In this way, others can also use the address space to access other physical memory, so that the limited address space can access all physical memory.

For example, if the kernel wants to access 1MB of physical memory starting from 2 gb, the physical address range is 0x 80000000 to 0x800FFFFF. Before access, find a free address space of 1MB size. Assume that the found free address space is 0xF8700000 to 0xf87fff, and map the 1MB logical address space to the memory of the physical address space 0x 80000000 to 0x800fff. The mapping is as follows:

Logical addresses Physical memory address
0xF8700000 0 x 80000000
0xF8700001 0 x 80000001
0xF8700002 0 x 80000002
0xF87FFFFF 0x800FFFFF

When the kernel is done accessing0 x 80000000 ~ 0 x800fffffAfter physical memory will0xF8700000 ~ 0xF87FFFFFKernel linear space free. This can be used by other processes or code0xF8700000 ~ 0xF87FFFFFThis address accesses other physical memory.

From the above description, we can know the most basic idea of high level memory: borrow a segment of address space, create a temporary address map, release after use, so that this segment of address space can be used in a cycle, access all physical memory.

What if a kernel process or module has been occupying a certain logical address space? If this is the case, the kernel’s high memory address space is increasingly strained. If it is occupied and not released, there is no high memory address space to map to the physical memory, and the corresponding physical memory can not be accessed.

The virtual space of the process

To put it simply, when a process uses memory, it does not directly access the physical memory address, but accesses the virtual memory address, and then converts the virtual memory address to the physical memory address. The virtual space is made up of all the addresses seen by the process. A virtual space is a process’s remapping of all the physical addresses (allocated and to be allocated) assigned to it.

Here can be thought of as a virtual space are mapped to the hard disk space, and recorded by a page table mapping position, when access to a certain address, effectively, through pages in the table that this data is in memory, if not, then used a page fault is unusual, will drive the corresponding data copied into memory, if there is no free memory, will choose to sacrifice the page, Replace other pages (that is, overwrite the old page).

For further details, see virtual memory for Linux processes

Let’s go back to the concepts of kernel-state and user-state. The kernel state of the operating system is used to control the hardware resources of the computer and provide the environment for the upper application program to run. User mode is the activity space of the upper application program. The execution of the application program must rely on the resources provided by the kernel, including CPU resources, storage resources, I/O resources, etc. In order for upper-layer applications to access these resources, the kernel must provide an interface for upper-layer applications to access them: system calls.

System calls are the smallest functional unit of an operating system. These system calls can be extended and tailored to different application scenarios. Today, various versions of Unix implementations provide varying numbers of system calls, such as 240-260 system calls in different versions of Linux and about 320 in FreeBSD. We can think of a system call as an unreducible operation (similar to an atomic operation, but different), compared to a “stroke” of a Chinese character, and a “Chinese character” represents an upper application.

User-space applications, through system calls, enter kernel space. At this time, the user space process to pass a lot of variables, parameter values to the kernel, kernel mode running also need to save some register values, variables, etc. The so-called “process context” can be regarded as these parameters passed by the user process to the kernel and the kernel to save the set of variables and register values and the environment at that time, etc.

#### System I/O call

So let’s look at general IO calls. In traditional file IO operations, the low-level standard IO system call functions read() and write() provided by the operating system are called. At this time, the process calling this function (namely, JAVA process in JAVA) switches from the current user state to the kernel state. The kernel code of the OS then reads the corresponding file data into the kernel IO buffer, and then copies the data from the kernel IO buffer into the private address space of the process, thus completing an I/O operation. As shown in the figure below.

Figure 3

Here, let’s go through the process with a Demo:

byte[] b = new byte[1024];

while((read = inputStream.read(b))>=0) {
        total = total + read;
            // other code....
        }
Copy the code

We create a buffer with new Byte [1024]. Since the JVM is in user mode, the buffer created here is a user buffer. The syscall Read system call is then triggered by calling the read() method in a while loop to read the data. Let’s focus on the details of what happens when inputStream.read is called:

  1. The kernel tells the disk controller: I want to read data on a block of disk on the disk.
  2. Read data from the hard disk into the kernel buffer under DMA control.
  3. The kernel copies data from the kernel buffer to the user buffer.

The user buffer here is the byte array new in our code. Please refer to Figure 3 to understand the whole process.

For the operating system, the JVM is in user-state space. Processes in user space cannot directly manipulate the underlying hardware. IO operations require manipulation of underlying hardware, such as hard disks. Therefore, IO operations must be completed with the help of the kernel (interrupt, trap), i.e. there will be a switch from user mode to kernel mode.

When we write new byte[] arrays, we generally create “arbitrary” arrays of “arbitrary size”. For example, new byte[128], new byte[1024], and new byte[4096]….

However, when reading data from a hard disk, it is not to read data of any size, but to read one hard disk or several hard disks at a time (this is because the operation of hard disk access is very expensive). Therefore, an “intermediate buffer” — that is, the kernel buffer is required. Data is read from the hard disk into the kernel buffer, and then moved from the kernel buffer to the user buffer.

This is why we often feel that the first read is slow and subsequent reads are fast. For a subsequent read operation, the data it needs to read is probably already in the kernel buffer, so it simply copies the data from the kernel buffer to the user buffer, without involving the underlying read from the disk, which is faster.

When the data is not available, the process is suspended and waits for the kernel to fetch the data from the hard disk into the kernel buffer.

DMA- It is used to exchange data directly between device memory and main memory RAM. This process requires no CPU intervention. For devices with a lot of data exchange in the system, the full use of DMA features can greatly improve system performance. Refer to DMA analysis in the Linux kernel

Direct memory mapping IO

DMA reads data this operation involves the underlying hardware, the hardware is generally not directly access to the user space, that is, DMA can not directly access the user buffer, common IO operations need to move data back and forth between the user buffer and the kernel buffer, which affects the IO speed in a certain program. Is there a solution?

This brings us to the core of what we want to talk about: Direct memory mapped IO.

The virtual address space is an area in which the virtual address of a certain segment is mapped to a certain part of the file object during memory mapping. At this time, data is not copied to memory, but when the process code references the virtual address in the code for the first time, the page missing exception is triggered. In this case, the OS directly copies the related file data to the user private space of the process according to the mapping relationship, as shown in the following figure.

Figure 4.

As you can see from Figure 4, both the kernel-space buffer and user-space buffer are mapped to the same physical memory area.

Its main features are as follows:

  1. Operations on files no longer require read or write system IO calls
  2. When a user process accesses the address of a “memory mapped file”, a page error occurs automatically, and the underlying OS writes data from the hard disk to memory.
  3. An important reason memory-mapped files are more efficient than standard IO is because there is no need to copy data into OS kernel buffers.

To explore the DirectByteBuffer

With all the layers laid out above, let’s review ByteBuffer. As an abstract class, ByteBuffer is implemented into two classes: HeapByteBuffer and DirectByteBuffer. HeapByteBuffer is an in-heap ByteBuffer, based on a user-mode implementation that uses byte[] to store data, as we’ve already seen. DirectByteBuffer is an out-of-heap ByteBuffer that directly uses out-of-heap memory space to store data and uses direct memory mapping IO, which is one of the core aspects of NIO’s high performance. So let’s analyze the implementation of DirectByteBuffer.

The creation of DirectByteBuffer

We can instantiate a DirectByteBuffer using the java.nio.byteBuffer #allocateDirect method.

//java.nio.ByteBuffer#allocateDirect
public static ByteBuffer allocateDirect(int capacity) {
    return new DirectByteBuffer(capacity);
}

 
DirectByteBuffer(int cap) {   // package-private
	// Initialize the four core attributes of Buffer
    super(-1.0, cap, cap);
    To determine whether page alignment is required, use the -xx :+PageAlignDirectMemory parameter, which is false by default
    boolean pa = VM.isDirectMemoryPageAligned();
    // Get the memory size per page
    int ps = Bits.pageSize();
    // Allocate the size of memory, if the page alignment, need to add a page of memory capacity
    long size = Math.max(1L, (long)cap + (pa ? ps : 0));
    // Use the Bits class to store the size of the total allocated memory (per page) and the size of the actual memory
    Bits.reserveMemory(size, cap);

    long base = 0;
    try {
        // Call the unsafe method to allocate memory
        base = UNSAFE.allocateMemory(size);
    } catch (OutOfMemoryError x) {
        // Failed to allocate memory
        Bits.unreserveMemory(size, cap);
        throw x;
    }
    
    // Initializes the allocated memory space, specifying the size of the memory, each location in the space value 0
    UNSAFE.setMemory(base, size, (byte) 0);
     // Set the memory start address. If page alignment is required,
     // Determine whether base is aligned or not. If base is aligned but not the starting position of a page, calculate the address alignment operation
    if(pa && (base % ps ! =0)) {
        // Round up to page boundary
        address = base + ps - (base & (ps - 1));
    } else {
        address = base;
    }
     // Create a cleaner and finally call Deallocator. Run to free the memory
    cleaner = Cleaner.create(this.new Deallocator(base, size, cap));
    att = null;
}
Copy the code
Page alignment

First, by the VM. IsDirectMemoryPageAligned () to determine whether need page alignment, about alignment, we here to contact the internal theory.

In modern computing architectures, reading data from memory is basically 2^N bytes loaded from main memory into the CPU. This value is basically the size of the cache line. That is, if the data read is within the same cache line, it is the fastest. Currently, most PCS have a cache line of 128 bytes. And it’s the same thing for the head address. On a 32-bit machine, if you have a 4-byte chunk of memory across two cache lines, it will require two memory miss interrupts when loaded into the CPU.

All right, back to the point. For any kind of small memory request, it is not allocated by actual size, but is first aligned according to certain rules. The rules for this alignment are complex and typically tailored to system page size, machine word size, and system characteristics. Typically, different step sizes are used in different intervals. Here’s an example:

The serial number The size of the interval Byte alignment
0 [16] 0 – 8
1 (16, 128] 16
2 (128, 256] 32
3 (256, 512] 64

As the step size of each interval is different, it is divided into more intervals. For example, the length request between (256, 320) should actually be allocated 320 bytes, not 512 bytes. A 1 byte request is always allocated 8 bytes.

For example, on a 32-bit machine, an int variable takes up 4 bytes. If the real physical memory address of the variable is 0x400005, The computer will fetch four bytes from 0x400004 and then four bytes from 0x400008, and then the value of the variable will be the last three bytes and the first four bytes. That is, if a variable starts with an odd address, it will probably have to read more memory, whereas if it starts with an even address, Especially computer bits /8 multiples start, high efficiency!

When page alignment is required, the kernel always adjusts the size of the vmalloc function to page alignment and adds a page size to the adjusted value. The kernel adds a page size to prevent possible out-of-bounds access. A page is the smallest block of memory that can be transferred to an IO device. Therefore, align the data with the page size and use the page size as the allocation unit to influence the interaction when writing to the hard disk/network device. Thus, by allocating one more page of space, one more memory read and one more page of space can be used when the data is larger than one page, similar to the scenario described in the previous paragraph.

 // -- Processor and memory-system properties --

    private static int PAGE_SIZE = -1;
// java.nio.Bits#pageSize
    static int pageSize(a) {
        if (PAGE_SIZE == -1)
            PAGE_SIZE = UNSAFE.pageSize();
        return PAGE_SIZE;
    }
/** * Reports the size in bytes of a native memory page (whatever that is). * This value will always be a power of two. * /
public native int pageSize(a);

Copy the code
Determine whether the available space meets the requirements

A call to Java.nio.bits #reserveMemory is used to determine if there is enough space to allocate memory:

//java.nio.Bits#tryReserveMemory
private static boolean tryReserveMemory(long size, int cap) {

    // -XX:MaxDirectMemorySize limits the total capacity rather than the
    // actual memory usage, which will differ when buffers are page
    // aligned.
    -xx :MaxDirectMemorySize = -xx :MaxDirectMemorySize
    long totalCap;
    // The maximum available space is subtracted from the used space, and the remaining space is equal to the required space
    while (cap <= MAX_MEMORY - (totalCap = TOTAL_CAPACITY.get())) {
        if (TOTAL_CAPACITY.compareAndSet(totalCap, totalCap + cap)) {
            RESERVED_MEMORY.addAndGet(size);
            COUNT.incrementAndGet();
            return true; }}return false;
}
// java.nio.Bits#reserveMemory
// size: The actual memory size required according to whether the page is aligned
// cap: user specified memory size (<=size)
static void reserveMemory(long size, int cap) {
    // Get the maximum external memory size that can be applied for
    // This size can be set with -xx :MaxDirectMemorySize=
      
    if(! MEMORY_LIMIT_SET && VM.initLevel() >=1) {
        MAX_MEMORY = VM.maxDirectMemory();
        MEMORY_LIMIT_SET = true;
    }

    // optimist!
    // If there is enough space to allocate, simply return, otherwise continue with the following logic and try to reallocate
    if (tryReserveMemory(size, cap)) {
        return;
    }

    final JavaLangRefAccess jlra = SharedSecrets.getJavaLangRefAccess();
    boolean interrupted = false;
    try {

        // Retry allocation until success or there are no more
        // references (including Cleaners that might free direct
        // buffer memory) to process and allocation still fails.
        boolean refprocActive;
        do {
            // In the do while loop, if no more references (including the Cleaners that might free the direct buffer memory) are processed, then the "//" loop is retested to see if the requested memory meets the criteria. If this process fails, the interruption is set to true. Also interrupts the current thread in the last //finally block.
            try {
                refprocActive = jlra.waitForReferenceProcessing();
            } catch (InterruptedException e) {
                // Defer interrupts and keep trying.
                interrupted = true;
                refprocActive = true;
            }
            if (tryReserveMemory(size, cap)) {
                return; }}while (refprocActive);

        // trigger VM's Reference processing
        System.gc();

        long sleepTime = 1;
        int sleeps = 0;
        while (true) {
            if (tryReserveMemory(size, cap)) {
                return;
            }
            if (sleeps >= MAX_SLEEPS) {
                break;
            }
            try {
                if(! jlra.waitForReferenceProcessing()) { Thread.sleep(sleepTime); sleepTime <<=1; sleeps++; }}catch (InterruptedException e) {
                interrupted = true; }}// no luck
        throw new OutOfMemoryError("Direct buffer memory");

    } finally {
        if (interrupted) {
            // don't swallow interruptsThread.currentThread().interrupt(); }}}Copy the code

This method is mainly used to determine whether the requested out-of-heap memory exceeds the maximum value specified in the use case. If there is enough space to apply, the corresponding variable is updated. If there is no space to apply, an OutOfMemoryError is raised.

The maximum heap of external memory that can be requested by default

As mentioned above, DirectByteBuffer determines whether there is enough space available before applying for memory. The user can control the maximum amount of DirectByteBuffer memory that can be allocated by setting -xx :MaxDirectMemorySize=

. But what is this size by default?

DirectByteBuffer gets this value from sun. Misc. M#maxDirectMemory.

 // A user-settable upper limit on the maximum amount of allocatable direct
    // buffer memory. This value may be changed during VM initialization if
    // "java" is launched with "-XX:MaxDirectMemorySize=<size>".
    //
    // The initial value of this field is arbitrary; during JRE initialization
    // it will be reset to the value specified on the command line, if any,
    // otherwise to Runtime.getRuntime().maxMemory().
    //
    private static long directMemory = 64 * 1024 * 1024;

    // Returns the maximum amount of allocatable direct buffer memory.
    // The directMemory variable is initialized during system initialization
    // in the saveAndRemoveProperties method.
    //
    public static long maxDirectMemory(a) {
        return directMemory;
    }
Copy the code

If directMemory is set to 64MB, is the default maximum for out-of-heap memory 64MB? This value is reset to the user-specified value during JRE initial startup, or runtime.geTruntime ().maxMemory() if not specified.

/**
 * Returns the maximum amount of memory that the Java virtual machine
 * will attempt to use.  If there is no inherent limit then the value
 * {@link java.lang.Long#MAX_VALUE} will be returned.
 *
 * @return  the maximum amount of memory that the virtual machine will
 *          attempt to use, measured in bytes
 * @since1.4 * /
public native long maxMemory(a);

//src\java.base\share\native\libjava\Runtime.c
JNIEXPORT jlong JNICALL
Java_java_lang_Runtime_maxMemory(JNIEnv *env, jobject this)
{
    return JVM_MaxMemory();
}
//src\hotspot\share\include\jvm.h
JNIEXPORT jlong JNICALL
JVM_MaxMemory(void);

//src\hotspot\share\prims\jvm.cpp
JVM_ENTRY_NO_ENV(jlong, JVM_MaxMemory(void))
  JVMWrapper("JVM_MaxMemory");
  size_t n = Universe::heap()->max_capacity();
  return convert_size_t_to_jlong(n);
JVM_END
Copy the code

Let’s look at the JRE initialization source code:

	/** * java.lang.System#initPhase1 * Initialize the system class. Called after thread initialization. */
    private static void initPhase1(a) {

        // VM might invoke JNU_NewStringPlatform() to set those encoding
        // sensitive properties (user.home, user.name, boot.class.path, etc.)
        // during "props" initialization, in which it may need access, via
        // System.getProperty(), to the related system encoding property that
        // have been initialized (put into "props") at early stage of the
        // initialization. So make sure the "props" is available at the
        // very beginning of the initialization and all system properties to
        // be put into it directly.
        props = new Properties(84);
        initProperties(props);  // initialized by the VM

        // There are certain system configurations that may be controlled by
        // VM options such as the maximum amount of direct memory and
        // Integer cache size used to support the object identity semantics
        // of autoboxing. Typically, the library will obtain these values
        // from the properties set by the VM. If the properties are for
        // internal implementation use only, these properties should be
        // removed from the system properties.
        //
        // See java.lang.Integer.IntegerCache and the
        // VM.saveAndRemoveProperties method for example.
        //
        // Save a private copy of the system properties object that
        // can only be accessed by the internal implementation. Remove
        // certain system properties that are not intended for public access.
        // Let's focus here
        VM.saveAndRemoveProperties(props);

        lineSeparator = props.getProperty("line.separator");
        StaticProperty.javaHome();          // Load StaticProperty to cache the property values
        VersionProps.init();

        FileInputStream fdIn = new FileInputStream(FileDescriptor.in);
        FileOutputStream fdOut = new FileOutputStream(FileDescriptor.out);
        FileOutputStream fdErr = new FileOutputStream(FileDescriptor.err);
        setIn0(new BufferedInputStream(fdIn));
        setOut0(newPrintStream(fdOut, props.getProperty("sun.stdout.encoding")));
        setErr0(newPrintStream(fdErr, props.getProperty("sun.stderr.encoding")));

        // Setup Java signal handlers for HUP, TERM, and INT (where available).
        Terminator.setup();

        // Initialize any miscellaneous operating system settings that need to be
        // set for the class libraries. Currently this is no-op everywhere except
        // for Windows where the process-wide error mode is set before the java.io
        // classes are used.
        VM.initializeOSEnvironment();

        // The main thread is not added to its thread group in the same
        // way as other threads; we must do it ourselves here.
        Thread current = Thread.currentThread();
        current.getThreadGroup().add(current);

        // register shared secrets
        setJavaLangAccess();

        // Subsystems that are invoked during initialization can invoke
        // VM.isBooted() in order to avoid doing things that should
        // wait until the VM is fully initialized. The initialization level
        // is incremented from 0 to 1 here to indicate the first phase of
        // initialization has completed.
        // IMPORTANT: Ensure that this remains the last initialization action!
        VM.initLevel(1);
    }
Copy the code

The assignment to directMemory takes place in the sun.misc.VM#saveAndRemoveProperties function:

 // Save a private copy of the system properties and remove
    // the system properties that are not intended for public access.
    //
    // This method can only be invoked during system initialization.
    public static void saveAndRemoveProperties(Properties props) {
        if(initLevel() ! =0)
            throw new IllegalStateException("Wrong init level");

        @SuppressWarnings({"rawtypes"."unchecked"})
        Map<String, String> sp =
            Map.ofEntries(props.entrySet().toArray(new Map.Entry[0]));
        // only main thread is running at this time, so savedProps and
        // its content will be correctly published to threads started later
        savedProps = sp;

        // Set the maximum amount of direct memory. This value is controlled
        // by the vm option -XX:MaxDirectMemorySize=<size>.
        // The maximum amount of allocatable direct buffer memory (in bytes)
        // from the system property sun.nio.MaxDirectMemorySize set by the VM.
        // The system property will be removed.
        String s = (String)props.remove("sun.nio.MaxDirectMemorySize");
        if(s ! =null) {
            if (s.equals("1")) {
                // -XX:MaxDirectMemorySize not given, take default
                directMemory = Runtime.getRuntime().maxMemory();
            } else {
                long l = Long.parseLong(s);
                if (l > -1) directMemory = l; }}// Check if direct buffers should be page aligned
        s = (String)props.remove("sun.nio.PageAlignDirectMemory");
        if ("true".equals(s))
            pageAlignDirectMemory = true;

        // Remove other private system properties
        // used by java.lang.Integer.IntegerCache
        props.remove("java.lang.Integer.IntegerCache.high");

        // used by sun.launcher.LauncherHelper
        props.remove("sun.java.launcher.diag");

        // used by jdk.internal.loader.ClassLoaders
        props.remove("jdk.boot.class.path.append");
    }
Copy the code

So by default, the DirectByteBuffer’s out-of-heap memory maximum is runtime.geTruntime ().maxMemory(), which is equal to the maximum Java heap size available, as specified by our -xmx argument.

System. Gc is explored

Also, we see an active call to System.gc() in the code to clean up unused object references in allocated DirectMemory. The JDK uses system.gc () to trigger a full GC. If the out-of-heap memory does not have enough space, the out-of-heap memory will exceed its threshold. In this case, the JDK will trigger a full GC through its built-in system.gc () mechanism. For recycling. Calling System.gc() itself executes a piece of logic, so let’s explore the details.

//java.lang.System#gc
    public static void gc(a) {
        Runtime.getRuntime().gc();
    }
//java.lang.Runtime#gc
    public native void gc(a);
Copy the code
JNIEXPORT void JNICALL
Java_java_lang_Runtime_gc(JNIEnv *env, jobject this)
{
    JVM_GC();
}
Copy the code

You can see that the JVM_GC() method is called directly, which is implemented in jVM.cpp

//src\hotspot\share\prims\jvm.cpp JVM_ENTRY_NO_ENV(void, JVM_GC(void)) JVMWrapper("JVM_GC"); if (! DisableExplicitGC) { Universe::heap()->collect(GCCause::_java_lang_system_gc); } JVM_END //src\hotspot\share\runtime\interfaceSupport.inline.hpp #define JVM_ENTRY_NO_ENV(result_type, header) \ extern "C" { \ result_type JNICALL header { \ JavaThread* thread = JavaThread::current(); \ ThreadInVMfromNative __tiv(thread); \ debug_only(VMNativeEntryWrapper __vew;) \ VM_ENTRY_BASE(result_type, header, thread) ... #define JVM_END } } #define VM_ENTRY_BASE(result_type, header, thread) \ TRACE_CALL(result_type, header) \ HandleMarkCleaner __hm(thread); \ Thread* THREAD = thread; \ os::verify_stack_alignment(); \ /* begin of body */Copy the code
Brief Analysis of macro definition

So, #define JVM_ENTRY_NO_ENV is a macro definition, so if you’re not familiar with it, let me just say a few words.

Macro definition classification:

  1. Macro definition with no arguments

    • #define macro name
    • Function: can be realized with macro body instead of macro name
    • Example:#define TRUE 1
    • Function: The program uses TRUE several times. If the value of TRUE needs to be changed, only one change is required
  2. Macros with parameters: #define macro name

Macros define functions:

  1. Convenient program modification

    • The above#define TRUE 1Is an example
  2. Improve the efficiency of the program

    • The expansion of macro definition is completed in the pre-processing stage of the program, without the need to allocate memory at runtime, and can partially achieve the function of the function, but without the function call stack, stack overhead, high efficiency
  3. Enhanced readability

    • This goes without saying. When we see a macro definition like PI, we naturally think that it corresponds to the constant PI
  4. String splicing

Such as:

#define CAT(a,b,c) a##b##c

main()
{
    printf("%d\n" CAT(1.2.3));
    printf("%s\n", CAT('a'.'b'.'c');
}
Copy the code

The output of the program would be:

123
abc
Copy the code
  1. Parameters are converted to strings

Example:

#defind CAT(n) "abc"#n

main()
{
    printf("%s\n", CAT(15));
}
Copy the code

The output is going to be

abc15
Copy the code
  1. Used for program debugging trace
    • Common macros used for debugging are, _ L I N E, F I L E , D A T E , T I M E , S T D C _
  2. An example of implementing mutable macros is:
#define PR(...) printf(_ _VA_ARGS_ _)  
Copy the code

It’s a little bit like the interpreter model. In simple terms, we agree with each other that if I say “1”, you say “I’m born to be useful”. Let’s define it as follows:

#defineA abcdefg (can also be a long piece of code a function)
Copy the code

Similarly, a macro is the contract that you have with the compiler, and you tell it, when I write a, that’s what I’m referring to. So, when you precompile, when the compiler sees that a is this, it’s going to replace all the a’s with that string.

Want to continue, you can refer to [c + + macro definition explanation] (www.cnblogs.com/fnlingnzb-l)… .

Reference we listed earlier JVM. CPP in JVM_GC () the relevant part of the code, you can know, interfaceSupport. Inline. HPP defines JVM_ENTRY_NO_ENV macros in logic, while the following code defines JVM_GC associated logic, JVM_GC is then executed as sub-logic in the macro logic of JVM_ENTRY_NO_ENV.

JVM_ENTRY_NO_ENV(void, JVM_GC(void))
  JVMWrapper("JVM_GC");
  if(! DisableExplicitGC) { Universe::heap()->collect(GCCause::_java_lang_system_gc); } JVM_ENDCopy the code

Here we come into contact with a JDK in our common the AccessController. DoPrivileged method, it is in the JVM. CPP in corresponding implementation as follows:

JVM_ENTRY(jobject, JVM_DoPrivileged(JNIEnv *env, jclass cls, jobject action, jobject context, jboolean wrapException))
  JVMWrapper("JVM_DoPrivileged"); # omit method body JVM_ENDCopy the code

JVM_ENTRY is also a macro definition in interfacesupport.hpp:

#define JVM_ENTRY(result_type, header)                               \
extern "C" {                                                         \
  result_type JNICALL header {                                       \
    JavaThread* thread=JavaThread::thread_from_jni_environment(env); \
    ThreadInVMfromNative __tiv(thread);                              \
    debug_only(VMNativeEntryWrapper __vew;)                          \
    VM_ENTRY_BASE(result_type, header, thread)
Copy the code

Then after conversion, the result is as follows:

extern "C"{\jobject JNICALL JVM_DoPrivileged(JNIEnv *env, jclass cls, jobject action, jobject context, jboolean wrapException) { \ JavaThread* thread=JavaThread::thread_from_jni_environment(env); \ ThreadInVMfromNative __tiv(thread); \ debug_only(VMNativeEntryWrapper __vew;) \... }}Copy the code

About interfaceSupport. Inline. HPP JVM_ENTRY_NO_ENV macros defined logic within the extern “C” is the following code is compiled by C language, C + + C code can be nested.

JNICALL, which is particularly common in source code, is an empty macro definition, just to tell people that this is a JNI call. The macro definition is as follows:

#define JNICALL
Copy the code

With regard to the JNI, we can refer to https://docs.oracle.com/javase/8/docs/technotes/guides/jni/spec/jniTOC.html document to further.

JVM_GC method interpretation

Heap’s collect method is called, and GCCause is _JAVA_LANG_system_GC, which is the gc generated for whatever reason. We can look at the associated source code to see the various condition definitions that cause the GC.

//
// This class exposes implementation details of the various
// collector(s), and we need to be very careful with it. If
// use of this class grows, we should split it into public
// and implementation-private "causes".
//
// The definitions in the SA code should be kept in sync
// with the definitions here.
//
// src\hotspot\share\gc\shared\gcCause.hpp
class GCCause : public AllStatic {
 public:
  enum Cause {
    /* public */
    _java_lang_system_gc,
    _full_gc_alot,
    _scavenge_alot,
    _allocation_profiler,
    _jvmti_force_gc,
    _gc_locker,
    _heap_inspection,
    _heap_dump,
    _wb_young_gc,
    _wb_conc_mark,
    _wb_full_gc,

    /* implementation independent, but reserved for GC use */
    _no_gc,
    _no_cause_specified,
    _allocation_failure,

    /* implementation specific */

    _tenured_generation_full,
    _metadata_GC_threshold,
    _metadata_GC_clear_soft_refs,

    _cms_generation_full,
    _cms_initial_mark,
    _cms_final_remark,
    _cms_concurrent_mark,

    _old_generation_expanded_on_last_scavenge,
    _old_generation_too_full_to_scavenge,
    _adaptive_size_policy,

    _g1_inc_collection_pause,
    _g1_humongous_allocation,

    _dcmd_gc_run,

    _z_timer,
    _z_warmup,
    _z_allocation_rate,
    _z_allocation_stall,
    _z_proactive,

    _last_gc_cause
  };
Copy the code

Let’s go back to the JVM_GC definition and note that DisableExplicitGC, if true, will not execute the collect method, which invalidates system.gc (), DisableExplicitGC the DisableExplicitGC parameter is set to -xx :+DisableExplicitGC. The default value is false and you can set it to true.

If DisableExplicitGC is the default value, Universe::heap()->collect(GCCause:: _JAVA_LANG_system_GC); At this point, we can see that this is a function expression passing in Universe::heap() :

 // The particular choice of collected heap.
static CollectedHeap* heap(a) { return _collectedHeap; }
CollectedHeap*  Universe::_collectedHeap = NULL;
CollectedHeap* Universe::create_heap() {
  assert(_collectedHeap == NULL."Heap already created");
  return GCConfig::arguments()->create_heap();
}
Copy the code

As shown in the figure above, there are several types of heap. The specific type of heap depends on the GC algorithm we choose to use. Here, we take the commonly used CMS GC as an example, and its corresponding heap is CMSHeap, so let’s look at the corresponding method of CMSHeap.

//src\hotspot\share\gc\cms\cmsHeap.hpp
class CMSHeap : public GenCollectedHeap {
public: CMSHeap(GenCollectorPolicy *policy); .void CMSHeap::collect(GCCause::Cause cause) {
  if (should_do_concurrent_full_gc(cause)) {
    // Mostly concurrent full collection.
    collect_mostly_concurrent(cause);
  } else{ GenCollectedHeap::collect(cause); }}... }//src\hotspot\share\gc\shared\genCollectedHeap.cpp

void GenCollectedHeap::collect(GCCause::Cause cause) {
  if (cause == GCCause::_wb_young_gc) {
    // Young collection for the WhiteBox API.
    collect(cause, YoungGen);
  } else {
#ifdef ASSERT
  if (cause == GCCause::_scavenge_alot) {
    // Young collection only.
    collect(cause, YoungGen);
  } else {
    // Stop-the-world full collection.
    collect(cause, OldGen);
  }
#else
    // Stop-the-world full collection.
    collect(cause, OldGen);
#endif}}Copy the code

The should_do_concurrent_FULL_GC method is used to determine whether a parallel Full GC is required. If so, the collect_MOSTly_CONCURRENT method is called to perform a parallel Full GC. If it is not, it will generally go to the logic of collect(cause, OldGen) and carry out stop-the-world full collection, which is generally called STW full GC.

Should_do_concurrent_full_gc:

bool CMSHeap::should_do_concurrent_full_gc(GCCause::Cause cause) {
  switch (cause) {
    case GCCause::_gc_locker:           return GCLockerInvokesConcurrent;
    case GCCause::_java_lang_system_gc:
    case GCCause::_dcmd_gc_run:         return ExplicitGCInvokesConcurrent;
    default:                            return false; }}Copy the code

And if it is _java_lang_system_gc ExplicitGCInvokesConcurrent to true is parallel Full GC, this leads to the another parameter ExplicitGCInvokesConcurrent, If the configuration – XX: + ExplicitGCInvokesConcurrent to true, parallel Full GC, the default is false.

Parallel Full GC

Let’s take a look at how collect_MOSTly_concurrent performs a parallel Full GC.

//src\hotspot\share\gc\cms\cmsHeap.cpp
voidCMSHeap::collect_mostly_concurrent(GCCause::Cause cause) { assert(! Heap_lock->owned_by_self(),"Should not own Heap_lock");

  MutexLocker ml(Heap_lock);
  // Read the GC counts while holding the Heap_lock
  unsigned int full_gc_count_before = total_full_collections();
  unsigned int gc_count_before      = total_collections();
  {
    MutexUnlocker mu(Heap_lock);
    VM_GenCollectFullConcurrent op(gc_count_before, full_gc_count_before, cause); VMThread::execute(&op); }}Copy the code

Finally by VMThread VM_GenCollectFullConcurrent void of VM_GenCollectFullConcurrent: : doit () method to recycle (related English comments is clear, it is no longer explains) :

// VM operation to invoke a concurrent collection of a
// GenCollectedHeap heap.
void VM_GenCollectFullConcurrent::doit() {
  assert(Thread::current()->is_VM_thread(), "Should be VM thread");
  assert(GCLockerInvokesConcurrent || ExplicitGCInvokesConcurrent, "Unexpected");

  CMSHeap* heap = CMSHeap::heap();
  if (_gc_count_before == heap->total_collections()) {
    // The "full" of do_full_collection call below "forces"
    // a collection; the second arg, 0, below ensures that
    // only the young gen is collected. XXX In the future,
    // we'll probably need to have something in this interface
    // to say do this only if we are sure we will not bail
    // out to a full collection in this attempt, but that's
    // for the future.
    assert(SafepointSynchronize::is_at_safepoint(),
      "We can only be executing this arm of if at a safepoint");
    GCCauseSetter gccs(heap, _gc_cause);
    heap->do_full_collection(heap->must_clear_all_soft_refs(), GenCollectedHeap::YoungGen);
  } // Else no need for a foreground young gc
  assert((_gc_count_before < heap->total_collections()) ||
         (GCLocker::is_active() /* gc may have been skipped */
          && (_gc_count_before == heap->total_collections())),
         "total_collections() should be monotonically increasing");

  MutexLockerEx x(FullGCCount_lock, Mutex::_no_safepoint_check_flag);
  assert(_full_gc_count_before <= heap->total_full_collections(), "Error");
  if (heap->total_full_collections() == _full_gc_count_before) {
    // Nudge the CMS thread to start a concurrent collection.
    CMSCollector::request_full_gc(_full_gc_count_before, _gc_cause);
  } else {
    assert(_full_gc_count_before < heap->total_full_collections(), "Error");
    FullGCCount_lock->notify_all();  // Inform the Java thread its work is done}}Copy the code

The CMSCollector:: request_FULL_GC method is called: CMSCollector:: request_FULL_GC

//src\hotspot\share\gc\cms\concurrentMarkSweepGeneration.cpp
void CMSCollector::request_full_gc(unsigned int full_gc_count, GCCause::Cause cause) {
  CMSHeap* heap = CMSHeap::heap();
  unsigned int gc_count = heap->total_full_collections();
  if (gc_count == full_gc_count) {
    MutexLockerEx y(CGC_lock, Mutex::_no_safepoint_check_flag);
    _full_gc_requested = true;
    _full_gc_cause = cause;
    CGC_lock->notify();   // nudge CMS thread
  } else {
    assert(gc_count > full_gc_count, "Error: causal loop"); }}Copy the code

The main concern here is that in the case of gc_count == full_Gc_count, _full_gc_requested is set to true and the CMS reclaim thread is awakened. It should be mentioned here that CMS GC has a background thread scanning all the time to determine whether to perform a CMS GC. This thread will perform a SCAN every 2 seconds by default, and there is a judgment condition on whether _full_GC_requested is true. If it is true, a CMS GC will be performed. Recycle Old and Perm areas once.

Normal Full GC

A normal Full GC executes the following logic:

void GenCollectedHeap::collect(GCCause::Cause cause, GenerationType max_generation) {
  // The caller doesn't have the Heap_lockassert(! Heap_lock->owned_by_self(),"this thread should not own the Heap_lock");
  MutexLocker ml(Heap_lock);
  collect_locked(cause, max_generation);
}

// this is the private collection interface
// The Heap_lock is expected to be held on entry.
//src\hotspot\share\gc\shared\genCollectedHeap.cpp
void GenCollectedHeap::collect_locked(GCCause::Cause cause, GenerationType max_generation) {
  // Read the GC count while holding the Heap_lock
  unsigned int gc_count_before      = total_collections();
  unsigned int full_gc_count_before = total_full_collections();
  {
    MutexUnlocker mu(Heap_lock);  // give up heap lock, execute gets it back
    VM_GenCollectFull op(gc_count_before, full_gc_count_before, cause, max_generation); VMThread::execute(&op); }}Copy the code

The void VM_GenCollectFull::doit() method in VM_GenCollectFull is called by VMThread to reclaim the VMThread.

//src\hotspot\share\gc\shared\vmGCOperations.cpp
void VM_GenCollectFull::doit() {
  SvcGCMarker sgcm(SvcGCMarker::FULL);

  GenCollectedHeap* gch = GenCollectedHeap::heap();
  GCCauseSetter gccs(gch, _gc_cause);
  gch->do_full_collection(gch->must_clear_all_soft_refs(), _max_generation);
}

//src\hotspot\share\gc\shared\genCollectedHeap.cpp
void GenCollectedHeap::do_full_collection(bool clear_all_soft_refs,
                                          GenerationType last_generation) {
  GenerationType local_last_generation;
  if(! incremental_collection_will_fail(false /* don't consult_young */) &&
      gc_cause() == GCCause::_gc_locker) {
    local_last_generation = YoungGen;
  } else {
    local_last_generation = last_generation;
  }

  do_collection(true.// full
                clear_all_soft_refs,    // clear_all_soft_refs
                0.// size
                false.// is_tlab
                local_last_generation); // last_generation
  // Hack XXX FIX ME !!!
  // A scavenge may not have been attempted, or may have
  // been attempted and failed, because the old gen was too full
  if (local_last_generation == YoungGen && gc_cause() == GCCause::_gc_locker &&
      incremental_collection_will_fail(false /* don't consult_young */)) {
    log_debug(gc, jni)("GC locker: Trying a full collection because scavenge failed");
    // This time allow the old gen to be collected as well
    do_collection(true.// full
                  clear_all_soft_refs, // clear_all_soft_refs
                  0.// size
                  false.// is_tlab
                  OldGen);             // last_generation}}Copy the code

A Full GC is finally done through GenCollectedHeap’s do_FULL_collection method, which recycles the Young, Old, and Perm sections, and even though the Old section is using CMS GC, Old areas are also compact, or MSC, marked – clear – compressed.

Parallel and normal Full GC comparison

stop the world

The JVM uses this thread to poll its queue. The queue contains several VM_operation operations, such as memory allocation failures and requests for GC operations. When performing GC operations, other business threads will enter the safe point first. These threads stop executing any bytecode instructions until they are safe, so they stop the world(STW) and the process stops.

CMS GC

CMS GC can be divided into background and foreground mode. As the name implies, background is done in background, that is, normal business thread running can not be affected. The trigger condition, for example, can trigger a BACKGROUND CMS GC when the memory ratio of old exceeds how much. This process will go through all the stages of CMS GC, the pause of the pause, the parallel of the parallel, relatively high efficiency, after all, there are GC stages that are parallel with the business thread. It happened the scene and foreground is otherwise, such as business request thread allocates memory, but memory is not enough, so may trigger a CMS GC, and the process must be wait for the memory allocated to the thread below to continue to go, so that the whole process must be STW and CMS at this time the whole process of GC is to suspend application, However, in order to improve efficiency, it does not go through all stages, only some of them. The skipped stages are mainly parallel stages, that is, Precleaning, AbortablePreclean and Resizing stages will not go through. Among them, sweep stage is synchronous. However, if the CMS GC like foreground is kept, the business thread is unavailable throughout the process, and the efficiency will be greatly affected.

Normal Full GC is actually Full GC in the Full sense of the term. There are also some scenarios that call the interface of Full GC, but do not do all of them. Sometimes only Young GC is done, sometimes only CMS GC is done. In addition, it can be seen from the previous code that VMThread is ultimately executed, so the whole time is the sum of Young GC+CMS GC, of which CMS GC is foreground mentioned above, so the whole process will be long, which should be avoided.

Parallel Full GC will also do YGC and CMS GC, but the efficiency is high in CMS GC is the background, the whole pause process is mainly YGC+CMS_initMark+CMS_remark several stages.

In The annotation GenCollectedHeap: : collect this method, The caller doesn ‘t have The Heap_lock, namely The caller doesn’t hold Heap_lock, can also understand The foreground.

conclusion

System.gc() triggers Full GC and can be disabled with the -xx :+DisableExplicitGC parameter. Also can use – XX: + ExplicitGCInvokesConcurrent parameters for parallel Full GC, improve performance. However, system.gc () is not recommended because Full GC takes a long time and has a significant impact on applications. It is also not recommended to set -xx :+DisableExplicitGC to DisableExplicitGC, especially if out-of-heap memory is not allocated enough space. The JDK triggers system.gc () to reclaim out-of-heap memory.

Reference blog:

Lovestblog. Cn/blog / 2015/0…

www.jianshu.com/p/40412b008…

Use Unsafe. AllocateMemory to allocateMemory

Sun.misc. Unsafe provides a set of methods for allocating, reallocating, and freeing memory. They are similar to the MALloc /free methods of C:

  1. long Unsafe.allocateMemory(long size)Allocate a block of memory. This chunk of memory may contain junk data (not automatically cleared). If the assignment fails, it flips onejava.lang.OutOfMemoryErrorThe exception. It returns a non-zero memory address.
  2. Unsafe.reallocateMemory(long address, long size)Reallocate a block of memory to remove data from the old memory buffer (addressTo a newly allocated block of memory. If the address is equal to 0, this method is equal toallocateMemoryThe effect is the same. It returns the address of the new memory buffer.
  3. Unsafe.freeMemory(long address)Release a memory buffer generated by the previous two methods. ifaddressfor0Do nothing.
//jdk.internal.misc.Unsafe#allocateMemory
public long allocateMemory(long bytes) {
        allocateMemoryChecks(bytes);

        if (bytes == 0) {
            return 0;
        }

        long p = allocateMemory0(bytes);
        if (p == 0) {
            throw new OutOfMemoryError();
        }

        return p;
    }
//jdk.internal.misc.Unsafe#allocateMemory0
private native long allocateMemory0(long bytes);
Copy the code

The local method of allocateMemory0 is defined as follows:

//src\hotspot\share\prims\unsafe.cpp
UNSAFE_ENTRY(jlong, Unsafe_AllocateMemory0(JNIEnv *env, jobject unsafe, jlong size)) {
  size_t sz = (size_t)size;

  sz = align_up(sz, HeapWordSize);
  void* x = os::malloc(sz, mtOther);

  return addr_to_java(x);
} UNSAFE_END
Copy the code

It can be seen that sun.misc.Unsafe#allocateMemory uses malloc, a C library function, to allocateMemory. If you’re using Linux, chances are you’re using ptmalloc in gliBC that comes with Linux.

DirectByteBuffer memory release principle

At the end of the constructor for DirectByteBuffer, we see this line of code:

 // Create a cleaner and finally call Deallocator. Run to free the memory
cleaner = Cleaner.create(this.new Deallocator(base, size, cap));
Copy the code

The DirectByteBuffer itself is a Java object that is stored in the heap and can be automatically reclaimed by the GC mechanism of the JDK. However, the direct memory applied for by the DirectByteBuffer is not in the GC scope and cannot be reclaimed automatically. Can we register a hook function for the Heap memory object DirectByteBuffer (this can be done through the Runnable interface run method), which will call the run method when the DirectByteBuffer object is collected by GC? The immediate memory of the DirectByteBuffer reference is freed in this method, and the freeMemory method is called in the Run method. As you can see from the code shown above, registration is done through the Create method of the Sun.misc.cleaner class.

//jdk.internal.ref.Cleaner#create	
/**
 * Creates a new cleaner.
 *
 * @param  ob the referent object to be cleaned
 * @param  thunk
 *         The cleanup code to be run when the cleaner is invoked.  The
 *         cleanup code is run directly from the reference-handler thread,
 *         so it should be as simple and straightforward as possible.
 *
 * @return  The new cleaner
 */
public static Cleaner create(Object ob, Runnable thunk) {
    if (thunk == null)
        return null;
    return add(new Cleaner(ob, thunk));
}

//jdk.internal.ref.Cleaner#clean
/** * Runs this cleaner, if it has not been run before. */
public void clean(a) {
    if(! remove(this))
        return;
    try {
        thunk.run();
    } catch (final Throwable x) {
        AccessController.doPrivileged(new PrivilegedAction<>() {
            public Void run(a) {
                if(System.err ! =null)
                    new Error("Cleaner terminated abnormally", x)
                    .printStackTrace();
                System.exit(1);
                return null; }}); }}Copy the code

The first argument is a heap memory object, in this case a DirectByteBuffer object, and the second argument is a Runnable task that defines an action representing the callback method to be executed when the heap memory object is reclaimed. We can see that in the last line of DirectByteBuffer, the two arguments passed in are this and a Deallocator(which implements the Runnable interface), where this represents the current DirectByteBuffer instance. When the current DirectByteBuffer is recycled, the Deallocator run method is called to clear the direct memory referenced by DirectByteBuffer as shown in the following code:

private static class Deallocator
    implements Runnable
{

    private long address;
    private long size;
    private int capacity;

    private Deallocator(long address, long size, int capacity) {
        assert(address ! =0);
        this.address = address;
        this.size = size;
        this.capacity = capacity;
    }

    public void run(a) {
        if (address == 0) {
            // Paranoia
            return;
        }
        UNSAFE.freeMemory(address);
        address = 0; Bits.unreserveMemory(size, capacity); }}Copy the code

As you can see, the run method calls the unbroadening. FreeMemory method to release a reference to the immediate memory.

DirectByteBuffer Memory release process

Since DirectByteBuffer requests memory outside the heap, and DirectByteBuffer itself only stores the starting address of memory, Therefore, the memory footprint of DirectByteBuffer consists of the DirectByteBuffer object in the heap and the corresponding memory space outside the heap.

According to our previous gameplay, There are finalize functions that can be used in Java, but Finalize mechanism is not recommended by Java official, because there are many things that need to be paid attention to. The recommended method is to use virtual reference to handle the subsequent processing when objects are recycled. To simplify this, the JDK provides a Cleaner class, a subclass of PhantomReference, that triggers the corresponding Runnable callback when PhantomReference is added to the ReferenceQueue.

DirectByteBuffer Read and write operations

DirectByteBuffer will eventually use the methods sun.misc.Unsafe#getByte(long) and sun.misc.Unsafe#putByte(long, byte) to read and write bytes from specified locations in the memory space outside the heap. The address reads and writes data from the corresponding memory location, as shown below.

//java.nio.Buffer#nextGetIndex()
final int nextGetIndex(a) {                          // package-private
    if (position >= limit)
        throw new BufferUnderflowException();
    return position++;
}
//java.nio.DirectByteBuffer
public long address(a) {
    return address;
}

private long ix(int i) {
    return address + ((long)i << 0);
}

public byte get(a) {
    try {
        return ((UNSAFE.getByte(ix(nextGetIndex()))));
    } finally {
        Reference.reachabilityFence(this); }}public byte get(int i) {
    try {
        return ((UNSAFE.getByte(ix(checkIndex(i)))));
    } finally {
        Reference.reachabilityFence(this); }}public ByteBuffer put(byte x) {
    try {
        UNSAFE.putByte(ix(nextPutIndex()), ((x)));
    } finally {
        Reference.reachabilityFence(this);
    }
    return this;
}

public ByteBuffer put(int i, byte x) {
    try {
        UNSAFE.putByte(ix(checkIndex(i)), ((x)));
    } finally {
        Reference.reachabilityFence(this);
    }
    return this;
}
Copy the code

Two or three things about MappedByteBuffer

MappedByteBuffer is supposed to be a subclass of DirectByteBuffer, but in order to keep the structure specification clear and simple, and for optimization purposes, the reverse is more appropriate, and because DirectByteBuffer is a package-level private class (i.e. there is no class permission definition before the class keyword), Abstract classes are defined to be extensible, so you can see why the JDK is designed this way. Although MappedByteBuffer should logically be a subclass of DirectByteBuffer, and MappedByteBuffer’s memory GC is similar to DirectByteBuffer’s GC (as opposed to heap GC), However, the size of the allocated MappedByteBuffer is not affected by the -xx :MaxDirectMemorySize parameter. Because system level I/O operations are required, you need to set a FileDescriptor to map buffer operations. If it is not mapped to buffer, the FileDescriptor is null.

MappedByteBuffer encapsulates memory-mapped file operations, meaning that only file IO operations can be performed. MappedByteBuffer is a mapped buffer generated according to mmap, which is mapped to the corresponding file page. MappedByteBuffer can be directly operated on the mapped buffer, which is mapped to the file page. The operating system writes and writes files by calling in and calling out the corresponding memory pages.

Map method interpretation in FileChannel

We can through the Java nio. Channels. FileChannel# map (MapMode mode, long position, Get MappedByteBuffer long size), we see the sun. The nio. Ch. FileChannelImpl implementation of it:

	
private static final int MAP_RO = 0;
private static final int MAP_RW = 1;
private static final int MAP_PV = 2;
//sun.nio.ch.FileChannelImpl#map
public MappedByteBuffer map(MapMode mode, long position, long size)
throws IOException
{
    ensureOpen();
    if (mode == null)
        throw new NullPointerException("Mode is null");
    if (position < 0L)
        throw new IllegalArgumentException("Negative position");
    if (size < 0L)
        throw new IllegalArgumentException("Negative size");
    if (position + size < 0)
        throw new IllegalArgumentException("Position + size overflow");
    / / maximum 2 g
    if (size > Integer.MAX_VALUE)
        throw new IllegalArgumentException("Size exceeds Integer.MAX_VALUE");

    int imode = -1;
    if (mode == MapMode.READ_ONLY)
        imode = MAP_RO;
    else if (mode == MapMode.READ_WRITE)
        imode = MAP_RW;
    else if (mode == MapMode.PRIVATE)
        imode = MAP_PV;
    assert (imode >= 0);
    if((mode ! = MapMode.READ_ONLY) && ! writable)throw new NonWritableChannelException();
    if(! readable)throw new NonReadableChannelException();

    long addr = -1;
    int ti = -1;
    try {
        beginBlocking();
        ti = threads.add();
        if(! isOpen())return null;

        long mapSize;
        int pagePosition;
        synchronized (positionLock) {
            long filesize;
            do {
                //nd.size() returns the actual file size
                filesize = nd.size(fd);
            } while ((filesize == IOStatus.INTERRUPTED) && isOpen());
            if(! isOpen())return null;
    
            // If the actual file size is smaller than the required file size, increase the file size,
            // The size of the file is changed, and the portion of the file that grows is set to 0 by default.
            if (filesize < position + size) { // Extend file size
                if(! writable) {throw new IOException("Channel not open for writing " +
                        "- cannot extend file to required size");
                }
                int rv;
                do {
                    // Increase the file size
                    rv = nd.truncate(fd, position + size);
                } while ((rv == IOStatus.INTERRUPTED) && isOpen());
                if(! isOpen())return null;
            }
            // If the mapping file size is required to be 0, the operating system mmap call is not called,
            // Just generate a DirectByteBuffer with a size of 0 and return it
            if (size == 0) {
                addr = 0;
                // a valid file descriptor is not required
                FileDescriptor dummy = new FileDescriptor();
                if((! writable) || (imode == MAP_RO))return Util.newMappedByteBufferR(0.0, dummy, null);
                else
                    return Util.newMappedByteBuffer(0.0, dummy, null);
            }
            //allocationGranularity specifies the memory size allocated to the mapped buffer, and pagePosition specifies the number of pages
            pagePosition = (int)(position % allocationGranularity);
            // Get the mapping position, i.e. map from mapPosition
            long mapPosition = position - pagePosition;
            // Add pagePosition from the beginning of the page to increase the mapping space
            mapSize = size + pagePosition;
            try {
                // This will be explained later
                // If map0 did not throw an exception, the address is valid
                addr = map0(imode, mapPosition, mapSize);
            } catch (OutOfMemoryError x) {
                // An OutOfMemoryError may indicate that we've exhausted
                // memory so force gc and re-attempt map
                System.gc();
                try {
                    Thread.sleep(100);
                } catch (InterruptedException y) {
                    Thread.currentThread().interrupt();
                }
                try {
                    addr = map0(imode, mapPosition, mapSize);
                } catch (OutOfMemoryError y) {
                    // After a second OOME, fail
                    throw new IOException("Map failed", y); }}}// synchronized

        // On Windows, and potentially other platforms, we need an open
        // file descriptor for some mapping operations.
        FileDescriptor mfd;
        try {
            mfd = nd.duplicateForMapping(fd);
        } catch (IOException ioe) {
            unmap0(addr, mapSize);
            throw ioe;
        }

        assert (IOStatus.checkAll(addr));
        assert (addr % allocationGranularity == 0);
        int isize = (int)size;
        Unmapper um = new Unmapper(addr, mapSize, isize, mfd);
        if((! writable) || (imode == MAP_RO)) {return Util.newMappedByteBufferR(isize,
                                                addr + pagePosition,
                                                mfd,
                                                um);
        } else {
            returnUtil.newMappedByteBuffer(isize, addr + pagePosition, mfd, um); }}finally{ threads.remove(ti); endBlocking(IOStatus.checkAll(addr)); }}Copy the code

Let’s look at sun. Nio. Ch. FileChannelImpl# map0 implementation:

//src\java.base\unix\native\libnio\ch\FileChannelImpl.c
JNIEXPORT jlong JNICALL
Java_sun_nio_ch_FileChannelImpl_map0(JNIEnv *env, jobject this,
                                     jint prot, jlong off, jlong len)
{
    void *mapAddress = 0;
    jobject fdo = (*env)->GetObjectField(env, this, chan_fd);
     // Get the read state of the file being operated on, i.e. the value of the corresponding file descriptor
    jint fd = fdval(env, fdo);
    int protections = 0;
    int flags = 0;

    if (prot == sun_nio_ch_FileChannelImpl_MAP_RO) {
        protections = PROT_READ;
        flags = MAP_SHARED;
    } else if (prot == sun_nio_ch_FileChannelImpl_MAP_RW) {
        protections = PROT_WRITE | PROT_READ;
        flags = MAP_SHARED;
    } else if (prot == sun_nio_ch_FileChannelImpl_MAP_PV) {
        protections =  PROT_WRITE | PROT_READ;
        flags = MAP_PRIVATE;
    }
Mmap64 is the macro definition, and the actual last call is mmap
    mapAddress = mmap64(
        0./* Let OS decide location */
        len,                  /* Number of bytes to map */
        protections,          /* File permissions */
        flags,                /* Changes are shared */
        fd,                   /* File descriptor of mapped file */
        off);                 /* Offset into file */

    if (mapAddress == MAP_FAILED) {
        if (errno == ENOMEM) {
            // If no mapping succeeds, OutOfMemoryError is thrown
            JNU_ThrowOutOfMemoryError(env, "Map failed");
            return IOS_THROWN;
        }
        return handle(env, -1."Map failed");
    }

    return ((jlong) (unsigned long) mapAddress);
}
Copy the code

Note that although the size parameter of filechannel.map () is long, the maximum size of size is integer.max_value, which can only map up to 2 gigabytes of space. In fact, the Mmap provided by the operating system can allocate more space, but JAVA is limited to 2G here. If spark is used to process large data files, spark generates the following logs:

WARN scheduler.TaskSetManager: Lost task 19. 0 in stage 6. 0 (TID 120.10111.32.47.): java.lang.IllegalArgumentException: Size exceeds Integer.MAX_VALUE
at sun.nio.ch.FileChannelImpl.map(FileChannelImpl.java:828)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:123)
at org.apache.spark.storage.DiskStore.getBytes(DiskStore.scala:132)
at org.apache.spark.storage.BlockManager.doGetLocal(BlockManager.scala:517)
at org.apache.spark.storage.BlockManager.getLocal(BlockManager.scala:432)
at org.apache.spark.storage.BlockManager.get(BlockManager.scala:618)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:146)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70)

Copy the code

Combined with previous source code:

/ / maximum 2 g
if (size > Integer.MAX_VALUE)
    throw new IllegalArgumentException("Size exceeds Integer.MAX_VALUE");
Copy the code

Whether or not we can easily locate the error and why it occurred is clear in the log, but we have a deeper understanding of its nature. Since we can’t change the limit of 2G, we can increase the number of containers, that is, manually set the number of partitions in the RDD. The default RDD partition of Spark is 18. If you manually set the RDD partition to 500, the preceding problem is solved. You can use rdD. repartition(numPart:Int) to reset the number of partitions after the RDD is loaded.

val data_new = data.repartition(500)
Copy the code

A MappedByteBuffer is a buffer generated by mmap that is created and managed by the operating system directly. Finally, the JVM uses unmmap to get the operating system to release the memory directly.

private static void unmap(MappedByteBuffer bb) {
    Cleaner cl = ((DirectBuffer)bb).cleaner();
    if(cl ! =null)
        cl.clean();
}
Copy the code

Here you can see, the incoming a MappedByteBuffer type parameter, we went back to sun. Nio. Ch. FileChannelImpl# map method implementation, in order to facilitate recovery, here on the operation of the file descriptor for packing again, MFD = nd.duplicateFormapping (fd), and then also defines a memory-freeing behavior through an implementation of the Runnable interface (in this case, the Unmapper implementation). Unmapper um = new Unmapper(addr, mapSize, isize, MFD); Finally, since we want to return an MappedByteBuffer object, we have the following code:

int isize = (int)size;
    Unmapper um = new Unmapper(addr, mapSize, isize, mfd);
    if((! writable) || (imode == MAP_RO)) {return Util.newMappedByteBufferR(isize,
                                            addr + pagePosition,
                                            mfd,
                                            um);
    } else {
        return Util.newMappedByteBuffer(isize,
                                        addr + pagePosition,
                                        mfd,
                                        um);
    }
Copy the code

You create a DirectByteBuffer object, The recovery strategy and Java. Before we come into contact with the nio. ByteBuffer# allocateDirect (which is Java. Nio. DirectByteBuffer# DirectByteBuffer (int)) are different. This is where the last call to Munmap is required for system reclamation.

protected DirectByteBuffer(int cap, long addr,
                                    FileDescriptor fd,
                                    Runnable unmapper)
{

    super(-1.0, cap, cap, fd);
    address = addr;
    cleaner = Cleaner.create(this, unmapper);
    att = null;

}

// -- Memory-mapped buffers --
//sun.nio.ch.FileChannelImpl.Unmapper
    private static class Unmapper
        implements Runnable
    {
        // may be required to close file
        private static final NativeDispatcher nd = new FileDispatcherImpl();

        // keep track of mapped buffer usage
        static volatile int count;
        static volatile long totalSize;
        static volatile long totalCapacity;

        private volatile long address;
        private final long size;
        private final int cap;
        private final FileDescriptor fd;

        private Unmapper(long address, long size, int cap,
                         FileDescriptor fd)
        {
            assert(address ! =0);
            this.address = address;
            this.size = size;
            this.cap = cap;
            this.fd = fd;

            synchronized(Unmapper.class) { count++; totalSize += size; totalCapacity += cap; }}public void run(a) {
            if (address == 0)
                return;
            unmap0(address, size);
            address = 0;

            // if this mapping has a valid file descriptor then we close it
            if (fd.valid()) {
                try {
                    nd.close(fd);
                } catch (IOException ignore) {
                    // nothing we can do}}synchronized(Unmapper.class) { count--; totalSize -= size; totalCapacity -= cap; }}}Copy the code

The local implementation of unmap0(address, size) involved here is as follows, and as you can see, it calls Munmap.

JNIEXPORT jint JNICALL
Java_sun_nio_ch_FileChannelImpl_unmap0(JNIEnv *env, jobject this,
                                       jlong address, jlong len)
{
    void *a = (void *)jlong_to_ptr(address);
    return handle(env,
                  munmap(a, (size_t)len),
                  "Unmap failed");
}
Copy the code
Summary of FileChannel map methods

The FileChannel map method simply maps files to memory images. That is, MappedByteBuffer map(int mode,long position,long size) can be used to map the size area of the file starting from position into a memory image file. Mode refers to the way to access the memory image file: READ_ONLY, READ_WRITE, PRIVATE.

  • READ_ONLY(MapMode.READ_ONLYRead-only) : Attempts to modify the resulting buffer will result in a throwReadOnlyBufferException.
  • READ_WRITE(MapMode.READ_WRITERead/write) : Changes to the resulting buffer are eventually propagated to the file; The change may not be visible to other programs that map to the same file.
  • PRIVATE(MapMode.PRIVATEPrivate) : Changes to the resulting buffer are not propagated to the file and are not visible to other programs that map to the same file; Instead, a dedicated copy of the modified portion of the buffer is created.

By calling FileChannel’s map() method, you can map a portion or all of the file into memory. The mapped buffer is a direct buffer that inherits ByteBuffer, but has many advantages over ByteBuffer:

  • Read fast
  • Write to fast
  • Write anytime, anywhere
Mmap for quick understanding

In short, mmap maps files directly to user-mode memory addresses, so that operations on files are no longer write/read, but direct operations on memory addresses. Three functions are provided in C to implement this:

  • mmap: Mapping.
  • munmap: Cancels the mapping.
  • msync: Changes made by a process to the shared content in the mapped space are not written directly back to the disk file. If you do not use this method, it is not guaranteed to be called againmunmapLet’s write back the change.

The mapping between virtual memory and the hard disk file (mMAP system call) is established. When a process accesses a page and generates a page miss interrupt, the kernel reads the page into memory (that is, copies files from the hard disk into memory) and updates the page table to point to the page. All processes share the same physical memory, and only one copy of data can be stored in the physical memory. Different processes only need to map their own virtual memory to the physical memory. In this way, it is convenient to share the same copy, saving memory. After memory mapping, data in a file can be accessed using in-memory Read/Write instructions instead of I/O system functions such as Read and Write, thus increasing file access speed.

Here, msync, Munmap and close(fd) are illustrated in a small Demo, just read the comments.

#include <stdio.h>   
#include <stdlib.h>   
#include <string.h>   
#include <unistd.h>   
#include <sys/mman.h>   
#include <sys/types.h>   
#include <fcntl.h>   
int main(int argc, char *argv[])  
{  
 int fd;  
 char *addr;  
 char *str = "Hello World";  
 fd = open("./a",O_CREAT|O_RDWR|O_TRUNC,0666);  
 if(fd == - 1)  
 {  
  perror("open file fail:");  
  exit(1);  
 }  
 if(ftruncate(fd,4096) = =- 1)  
 {  
  perror("ftruncate fail:");  
  close(fd);  
  exit(1);  
 }  
 addr =(char *) mmap(NULL.4096,PROT_READ|PROT_WRITE,MAP_SHARED,fd,0);  
 if(addr == (char *)MAP_FAILED)  
 {  
  perror("mmap fail:");  
  exit(1);  
 }  
 memset(addr,' '.4096);  
   
 memcpy(addr,str,strlen(str));                       // Write a hello world
 // Close the file. The contents of the mapped space can still be written to the file through msync to achieve space and file synchronization.
 close(fd);                                          
 memcpy(addr+strlen(str),str,strlen(str));           // Write a hello world
 // Synchronize to file
//MS_ASYNC flushes back regardless of whether the mapping area is updated.
//MS_SYNC is used to flush if the mapping area is updated,
// If the mapping area is not updated, wait until the update is complete and flush back.
//MS_INVALIDATE discards parts of the mapping area that are identical to the original file.
 if(msync(addr,4096,MS_SYNC)==- 1)  
 {  
  perror("msync fail:");  
  exit(1);  
 }  
      munmap(addr,4096);  
 return 0;  
}  
Copy the code

For more information, see the underlying principles of MappedByteBuffer and MMap

Force exploration in FileChannel

To complement FileChannel’s map method, it is necessary to introduce its three companion methods:

  • force(): The buffer isREAD_WRITEIn this mode, changes to the buffer are forcibly written to the file, that is, the buffer memory update content is flushed to the hard disk.
  • load(): loads the contents of a buffer into memory and returns a reference to the buffer.
  • isLoaded(): Returns true if the contents of the buffer are in physical memory, false otherwise. Here, we have thesun.nio.ch.FileChannelImpl#forceImplementation of the next analysis, the first to see its relevant source code.
//sun.nio.ch.FileChannelImpl#force public void force(boolean metaData) throws IOException { ensureOpen(); int rv = -1; int ti = -1; try { beginBlocking(); ti = threads.add(); if (! isOpen()) return; do { rv = nd.force(fd, metaData); } while ((rv == IOStatus.INTERRUPTED) && isOpen()); } finally { threads.remove(ti); endBlocking(rv > -1); assert IOStatus.check(rv); } } //sun.nio.ch.FileDispatcherImpl#force int force(FileDescriptor fd, boolean metaData) throws IOException { return force0(fd, metaData); } static native int force0(FileDescriptor fd, boolean metaData) throws IOException; //src\java.base\unix\native\libnio\ch\FileDispatcherImpl.c JNIEXPORT jint JNICALL Java_sun_nio_ch_FileDispatcherImpl_force0(JNIEnv *env, jobject this, jobject fdo, jboolean md) { jint fd = fdval(env, fdo); int result = 0; #ifdef MACOSX result = fcntl(fd, F_FULLFSYNC); if (result == -1 && errno == ENOTSUP) { /* Try fsync() in case F_FULLSYUNC is not implemented on the file system. */ result = fsync(fd); } #else /* end MACOSX, begin not-MACOSX */ if (md == JNI_FALSE) { result = fdatasync(fd); } else { #ifdef _AIX /* On AIX, calling fsync on a file descriptor that is opened only for * reading results in an error ("EBADF: The FileDescriptor parameter is * not a valid file descriptor open for writing."). * However, at this point it is not possibly anymore to read the * 'writable' attribute of the corresponding file channel so we have  to * use 'fcntl'. */ int getfl = fcntl(fd, F_GETFL); if (getfl >= 0 && (getfl & O_ACCMODE) == O_RDONLY) { return 0; } #endif /* _AIX */ result = fsync(fd); } #endif /* not-MACOSX */ return handle(env, result, "Force failed"); }Copy the code

We’ll skip the MACOSX implementation and focus only on the Linux platform. It is found that force calls fdatasync(fsync) when the parameter passed in is false. By consulting the Linux function manual (see fdatasync), we can see:

fsync() transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed. It also flushes metadata information associated with the file (see stat(2)).

Calling fsync() does not necessarily ensure that the entry in the directory containing the file has also reached disk. For that an explicit fsync() on a file descriptor for the directory is also needed.

fdatasync() is similar to fsync(), but does not flush modified metadata unless that metadata is needed in order to allow a subsequent data retrieval to be correctly handled. For example, changes to st_atime or st_mtime (respectively, time of last access and time of last modification; see stat(2)) do not require flushing because they are not necessary for a subsequent data read to be handled correctly. On the other hand, a change to the file size (st_size, as made by say ftruncate(2)), would require a metadata flush.

The aim of fdatasync() is to reduce disk activity for applications that do not require all metadata to be synchronized with the disk.
Copy the code

In a nutshell, fdatasync only flushes data to hard disks. Fsync flushes both data and inode information to the hard disk, such as st_atime. Since inodes and data are not continuously stored on hard disk, fsync requires more hard disk writes, but allows inodes to be updated. If you don’t care about inode information (such as the last file accessed), you can improve performance by using Fdatasync. For cases where inode information is concerned, fsync should be used.

Note that if write cache is enabled on a physical disk, fsync and fdatasync cannot guarantee that data written back will be fully written to the disk storage medium (data may still be stored in the disk cache without writing to the medium). Therefore, it is possible to make a fsync system call but still lose data after a power failure or have file system inconsistencies.

To ensure consistency between the actual file system on the hard disk and the contents of the buffer cache, UNIX systems provide sync, fsync, and fdatasync functions. The sync function simply queues all the modified block buffers into write queues and returns. It does not wait for the actual write operation to finish. The system daemon, usually called Update, calls sync periodically (typically every 30 seconds). This ensures that the kernel’s block buffers are flushed periodically. The sync(1) command also calls sync. The fsync function only works on the single file specified by the file descriptor Filedes and waits for the write operation to end and then returns. Fsync can be used for applications such as databases that need to ensure that modified blocks are written to hard disk immediately. The fdatasync function is similar to fsync, but it only affects the data portion of the file. In addition to data, fsync also updates file properties synchronously.

In other words, fdatasync is written to the Page cache first and then periodically flushed to disk by PDFlush, so mmap only allocates a memory address in the process space, the real memory is still using Pagecache. So force is a call to fsync to flush dirty pages to the hard disk, but mmap and sharing and the like should be complicated to implement.

That is, in Linux, fsync is called when force in FileChannel is passed true, false is called fdatasync, and fdatasync only brushes data, not meta data. Even if force is not called, the kernel periodically flusher dirty pages to the hard disk, which is 30 seconds by default.

Finally, we give a working Demo:

FileOutputStream outputStream = new FileOutputStream("/Users/simviso/b.txt");

// Force file data and metadata to fall down
outputStream.getChannel().force(true);

// Force file data to fall down, regardless of whether metadata falls down
outputStream.getChannel().force(false);

Copy the code

Zero copy

Files are manipulated using memory-mapped buffers, which read files much faster than normal IO operations. This is because no explicit system call (read, write) is made when a memory-mapped buffer is used to manipulate a file, and under certain conditions, the OS will automatically cache some file pages. Zerocopy can improve the performance of IO intensive JAVA applications. Zerocopy reduces the number of times that IO operations require data to be copied frequently between the kernel and user buffers, as well as the number of context switches between user and kernel states. One of the steps most of our WEB applications follow is to accept a user request -> read data from the local disk -> enter the kernel buffer -> user buffer -> user buffer -> send it over the socket. Each copy of data between kernel and user buffers consumes CPU and memory bandwidth. Zerocopy can effectively reduce the number of copies. Here, we take the file server data transmission as an example to analyze the whole process: read files from the server’s hard disk, and send files to the client through the network (socket), written in code, in fact, the core of the two sentences:

File.read(fileDesc, buffer, len);
Socket.send(socket, buffer, len);
Copy the code

There are only two steps. Step 1: Read the file into buffer; Step 2: Send the buffer over the socket. However, these two steps require four context switches (that is, between user and kernel) and four copy operations to complete. The whole process is shown below:

  1. The first context switch occurs when the read() method is executed, indicating that the server is going to read the file on disk, which triggers a sys_read() system call. The DMA reads data from the hard disk into the kernel buffer (the first copy).

  2. The second context switch occurs when the read() method returns (read() is a blocking call), indicating that the data has been successfully read from the hard disk into the kernel buffer. At this point, the kernel state is returned to the user state, complete the action: copy the kernel buffer to the user buffer (second copy).

  3. The third context switch occurs when the send() method is executed, indicating that the server is ready to send the data. Copy the user buffer to the kernel buffer (third copy)

  4. The fourth context switch occurs in the return of send(), which can be returned asynchronously: the thread immediately returns from send(), and the rest of the data is copied and sent to the underlying implementation of the operating system. At this point, the kernel state is returned to the user state, complete the operation: the kernel Buffer data sent to the NIC Buffer. (Fourth copy)

Kernel buffer

Why do you need a kernel buffer? Because kernel buffers improve performance. As we learned earlier, it was the introduction of kernel buffers (intermediate buffers) that caused data to be copied back and forth, reducing efficiency. So why do kernel buffers improve performance?

For read operations, the kernel buffer acts as a prefetch cache. When a user program only needs to read a small amount of data at a time, the operating system first reads a large chunk of data from disk into the kernel buffer, and the user program only takes a small amount (for example, I only read a small byte array like New Byte [128]). The next time the user program reads the data, it can fetch it directly from the kernel buffer, and the operating system does not need to access the hard disk again! Because the data the user wants to read is already in the kernel buffer! This is also why subsequent reads (read() method calls) are significantly faster than the first. From this point of view, kernel buffers do improve read performance.

Write operations: you can write asynchronously. In dest[], the user program tells the operating system to write the contents of the dest[] array to the XXX file, and then the write method returns. The operating system silently copies the contents of the user buffer (dest[]) into the kernel buffer in the background, and then writes the data in the kernel buffer to the hard disk. Then, as long as the kernel buffer is full, the user’s write operation can be returned quickly. This is called an asynchronous flush strategy.

File transfer is handled by Zerocopy

The java.nio.file.Files class is introduced in JDK7, which is convenient for many file operations, but it is more used for small text transfer, not suitable for large Files. Should use Java nio. Channels. FileChannel transferTo under class, transferFrom method. Here, we will analyze the details of the transferTo method, the source code is as follows:

 public long transferTo(long position, long count,
                           WritableByteChannel target)
        throws IOException
    {
        ensureOpen();
        if(! target.isOpen())throw new ClosedChannelException();
        if(! readable)throw new NonReadableChannelException();
        if (target instanceofFileChannelImpl && ! ((FileChannelImpl)target).writable)throw new NonWritableChannelException();
        if ((position < 0) || (count < 0))
            throw new IllegalArgumentException();
        long sz = size();
        if (position > sz)
            return 0;
        int icount = (int)Math.min(count, Integer.MAX_VALUE);
        if ((sz - position) < icount)
            icount = (int)(sz - position);

        long n;

        // Attempt a direct transfer, if the kernel supports it
        if ((n = transferToDirectly(position, icount, target)) >= 0)
            return n;

        // Attempt a mapped transfer, but only to trusted channel types
        if ((n = transferToTrustedChannel(position, icount, target)) >= 0)
            return n;

        // Slow path for untrusted targets
        return transferToArbitraryChannel(position, icount, target);
    }

Copy the code

There are three different ways to try to copy a file, but let’s look at transferToDirectly:

//sun.nio.ch.FileChannelImpl#transferToDirectly
private long transferToDirectly(long position, int icount,
                                    WritableByteChannel target)
        throws IOException
    {
        if(! transferSupported)return IOStatus.UNSUPPORTED;

        FileDescriptor targetFD = null;
        if (target instanceof FileChannelImpl) {
            if(! fileSupported)return IOStatus.UNSUPPORTED_CASE;
            targetFD = ((FileChannelImpl)target).fd;
        } else if (target instanceof SelChImpl) {
            // Direct transfer to pipe causes EINVAL on some configurations
            if ((target instanceofSinkChannelImpl) && ! pipeSupported)return IOStatus.UNSUPPORTED_CASE;

            // Platform-specific restrictions. Now there is only one:
            // Direct transfer to non-blocking channel could be forbidden
            SelectableChannel sc = (SelectableChannel)target;
            if(! nd.canTransferToDirectly(sc))return IOStatus.UNSUPPORTED_CASE;

            targetFD = ((SelChImpl)target).getFD();
        }

        if (targetFD == null)
            return IOStatus.UNSUPPORTED;
        int thisFDVal = IOUtil.fdVal(fd);
        int targetFDVal = IOUtil.fdVal(targetFD);
        if (thisFDVal == targetFDVal) // Not supported on some configurations
            return IOStatus.UNSUPPORTED;

        if (nd.transferToDirectlyNeedsPositionLock()) {
            synchronized (positionLock) {
                long pos = position();
                try {
                    return transferToDirectlyInternal(position, icount,
                                                      target, targetFD);
                } finally{ position(pos); }}}else {
            returntransferToDirectlyInternal(position, icount, target, targetFD); }}Copy the code

This method is one of the many details we have contacted you can borrow the details of this method under review the previous knowledge, here, to the point, to view the details of this transferToDirectlyInternal:

//sun.nio.ch.FileChannelImpl#transferToDirectlyInternal
private long transferToDirectlyInternal(long position, int icount,
                                        WritableByteChannel target,
                                        FileDescriptor targetFD)
    throws IOException
{
    assert! nd.transferToDirectlyNeedsPositionLock() || Thread.holdsLock(positionLock);long n = -1;
    int ti = -1;
    try {
        beginBlocking();
        ti = threads.add();
        if(! isOpen())return -1;
        do {
            n = transferTo0(fd, position, icount, targetFD);
        } while ((n == IOStatus.INTERRUPTED) && isOpen());
        if (n == IOStatus.UNSUPPORTED_CASE) {
            if (target instanceof SinkChannelImpl)
                pipeSupported = false;
            if (target instanceof FileChannelImpl)
                fileSupported = false;
            return IOStatus.UNSUPPORTED_CASE;
        }
        if (n == IOStatus.UNSUPPORTED) {
            // Don't bother trying again
            transferSupported = false;
            return IOStatus.UNSUPPORTED;
        }
        return IOStatus.normalize(n);
    } finally {
        threads.remove(ti);
        end (n > -1); }}Copy the code

As you can see, is transferTo0 transferToDirectlyInternal last call, we only see its under Linux implementation:

Java_sun_nio_ch_FileChannelImpl_transferTo0(JNIEnv *env, jobject this, jobject srcFDO, jlong position, jlong count, jobject dstFDO) { jint srcFD = fdval(env, srcFDO); jint dstFD = fdval(env, dstFDO); #if defined(__linux__) off64_t offset = (off64_t)position; jlong n = sendfile64(dstFD, srcFD, &offset, (size_t)count); if (n < 0) { if (errno == EAGAIN) return IOS_UNAVAILABLE; if ((errno == EINVAL) && ((ssize_t)count >= 0)) return IOS_UNSUPPORTED_CASE; if (errno == EINTR) { return IOS_INTERRUPTED; } JNU_ThrowIOExceptionWithLastError(env, "Transfer failed"); return IOS_THROWN; } return n; . }Copy the code

Here we can see the call using sendfile, and here we interpret the action with a diagram:

After sendFile is called, the data is read from the hardware device (in this case, the hard disk) to the kernel space via DMA, and then the kernel space data is copied to the Socket Buffer, and the Socket buffer data is copied to the protocol engine (such as our common network card, The NIC in question writes to the server. The copy of traditional IO between the kernel and the user is subtracted, but the copy inside the kernel still exists. We have made corresponding improvements to the four times copy operation diagram previously drawn with file server data transmission as an example, as follows:

To summarize transferTo(), when this method is called, it switches from user mode to kernel mode. DMA reads data from disk into Read Buffer (the first copy of data). Then, again in kernel space, data is copied from Read Buffer to Socket Buffer (the second data copy), and finally from Socket Buffer to NIC Buffer (the third data copy). Finally, return from kernel state to user state. The whole process involves three data copies and two context switches. Intuitively, that’s one less copy of the data. But there is no buffer in user space. Moreover, of the three data copies, only the second copy requires CPU intervention. But the previous traditional data copies required four copies and three copies required CPU intervention.

Linux2.4 and later have improved:

The socket buffer is not a buffer, but a file descriptor that describes where the data in the kernel buffer starts and how long the data is. It basically stores no data, most of it is Pointers. Then the Protocol engine (NIC in this case) is also read from the file descriptor via DMA copy. This means that the user program executes the transferTo() method, resulting in a system call that switches from user mode to kernel mode. Data is copied internally from disk to Read Buffer via DMA. DMA transfers data directly from the Read buffer to the NIC buffer, using a file descriptor that identifies the address and length of the data to be transferred. Data copying is done without CPU intervention. There are only two copies and two context switches.

Efficient Data Transfer through Zero copy

Finally, we see the sun again. Nio. Ch. FileChannelImpl# transferTo dealt with the other two copies transferToTrustedChannel and transferToArbitraryChannel, first to see the relevant source code:

// Maximum size to map when using a mapped buffer
private static final long MAPPED_TRANSFER_SIZE = 8L*1024L*1024L;
//sun.nio.ch.FileChannelImpl#transferToTrustedChannel
private long transferToTrustedChannel(long position, long count,
                                        WritableByteChannel target)
    throws IOException
{
    boolean isSelChImpl = (target instanceof SelChImpl);
    if(! ((targetinstanceof FileChannelImpl) || isSelChImpl))
        return IOStatus.UNSUPPORTED;

    // Trusted target: Use a mapped buffer
    long remaining = count;
    while (remaining > 0L) {
        long size = Math.min(remaining, MAPPED_TRANSFER_SIZE);
        try {
            MappedByteBuffer dbb = map(MapMode.READ_ONLY, position, size);
            try {
                // ## Bug: Closing this channel will not terminate the write
                int n = target.write(dbb);
                assert n >= 0;
                remaining -= n;
                if (isSelChImpl) {
                    // one attempt to write to selectable channel
                    break;
                }
                assert n > 0;
                position += n;
            } finally{ unmap(dbb); }}catch (ClosedByInterruptException e) {
           ...
        } catch(IOException ioe) { ... }}return count - remaining;
}
Copy the code

You can see that the transferToTrustedChannel copies data through Mmap. The maximum transfer time is 8m (MappedByteBuffer buffer size). While transferToArbitraryChannel a distribution DirectBuffer a maximum of 8192:

private static final int TRANSFER_SIZE = 8192;
//sun.nio.ch.FileChannelImpl#transferToArbitraryChannel
private long transferToArbitraryChannel(long position, int icount,
                                        WritableByteChannel target)
    throws IOException
{
    // Untrusted target: Use a newly-erased buffer
    int c = Math.min(icount, TRANSFER_SIZE);
    / / Util. GetTemporaryDirectBuffer got the DirectBuffer
    ByteBuffer bb = Util.getTemporaryDirectBuffer(c);
    long tw = 0;                    // Total bytes written
    long pos = position;
    try {
        Util.erase(bb);
        while (tw < icount) {
            bb.limit(Math.min((int)(icount - tw), TRANSFER_SIZE));
            int nr = read(bb, pos);
            if (nr <= 0)
                break;
            bb.flip();
            // ## Bug: Will block writing target if this channel
            // ## is asynchronously closed
            int nw = target.write(bb);
            tw += nw;
            if(nw ! = nr)break;
            pos += nw;
            bb.clear();
        }
        return tw;
    } catch (IOException x) {
        if (tw > 0)
            return tw;
        throw x;
    } finally{ Util.releaseTemporaryDirectBuffer(bb); }}Copy the code

The most important logic in the code shown above is read(bb, pos) and target.write(bb). Here, we just look at the former:

//sun.nio.ch.FileChannelImpl#read(java.nio.ByteBuffer, long)
    public int read(ByteBuffer dst, long position) throws IOException {
    if (dst == null)
        throw new NullPointerException();
    if (position < 0)
        throw new IllegalArgumentException("Negative position");
    if(! readable)throw new NonReadableChannelException();
    if (direct)
        Util.checkChannelPositionAligned(position, alignment);
    ensureOpen();
    if (nd.needsPositionLock()) {
        synchronized (positionLock) {
            returnreadInternal(dst, position); }}else {
        returnreadInternal(dst, position); }}//sun.nio.ch.FileChannelImpl#readInternal
private int readInternal(ByteBuffer dst, long position) throws IOException {
    assert! nd.needsPositionLock() || Thread.holdsLock(positionLock);int n = 0;
    int ti = -1;

    try {
        beginBlocking();
        ti = threads.add();
        if(! isOpen())return -1;
        do {
            n = IOUtil.read(fd, dst, position, direct, alignment, nd);
        } while ((n == IOStatus.INTERRUPTED) && isOpen());
        return IOStatus.normalize(n);
    } finally {
        threads.remove(ti);
        endBlocking(n > 0);
        assertIOStatus.check(n); }}Copy the code

Read, sun. Nio.ch.ioutil #readIntoNativeBuffer. Read and pread. Also, target.write(bb) is a final system call to pwrite and write, which consumes CPU resources.

Finally, consider that the kernel buffer becomes a bottleneck when the amount of data to be transferred is much larger than the size of the kernel buffer. The kernel buffer is no longer “buffered” at this point because the amount of data being transferred is too large, which is why zero-copy is preferred for large file transfers.

conclusion

This article from the operating system level began to explain the IO underlying implementation principle, analyzed the IO underlying implementation details of some advantages and disadvantages, while the Java NIO DirectBufferd and MappedByteBuffer were detailed interpretation. Finally, the realization principle of Zerocopy technology is described on the basis of the previous source code.

Some illustrations in this article are from the Internet, and the copyright belongs to the original author. If there is anything wrong, please leave a message to inform, thank you!