Binder-driven Initialization Mapping (part 2)

binder_ioctl

We continue to review the two parts of user control binder initialization:

if ((ioctl(bs->fd, BINDER_VERSION, &vers) == -1) || (vers.protocol_version ! = BINDER_CURRENT_PROTOCOL_VERSION)) { fprintf(stderr, "binder: kernel driver version (%d) differs from user space version (%d)\n", vers.protocol_version, BINDER_CURRENT_PROTOCOL_VERSION); goto fail_open; }Copy the code

Ioctl is a method that performs IO operations on the system and reads and writes data to internal files according to the protocol. It also involves the transition from user to kernel mode. But I’m not going to analyze the user-to-kernel process. According to the data structure:

static const struct file_operations binder_fops = {
...
    .unlocked_ioctl = binder_ioctl,
    .compat_ioctl = binder_ioctl,
...
};
Copy the code

Let’s look directly at the binder_ioctl processing.

static long binder_ioctl(struct file *filp, unsigned int cmd, unsigned long arg) { int ret; struct binder_proc *proc = filp->private_data; struct binder_thread *thread; . Ret = wait_event_interrupTIBLE (binder_user_error_wait, binder_stop_on_user_error < 2); if (ret) goto err_unlocked; binder_lock(__func__); Thread = binder_get_thread(proc); if (thread == NULL) { ret = -ENOMEM; goto err; } switch (cmd) { ... case BINDER_VERSION: { struct binder_version __user *ver = ubuf; if (size ! = sizeof(struct binder_version)) { ret = -EINVAL; goto err; } if (put_user(BINDER_CURRENT_PROTOCOL_VERSION, &ver->protocol_version)) { ret = -EINVAL; goto err; } break; } default: ret = -EINVAL; goto err; } ret = 0; err: if (thread) thread->looper &= ~BINDER_LOOPER_STATE_NEED_RETURN; binder_unlock(__func__); wait_event_interruptible(binder_user_error_wait, binder_stop_on_user_error < 2); if (ret && ret ! = -ERESTARTSYS) pr_info("%d:%d ioctl %x %lx returned %d\n", proc->pid, current->pid, cmd, arg, ret); err_unlocked: trace_binder_ioctl_done(ret); return ret; }Copy the code

Don’t look at so many things, but actually do very little. The binder_user_error_wait queue will enter the wait state when the global variable binder_user_error_wait is greater than. Otherwise, keep going down.

Then we’ll encounter the second important object in binder, binder_thread. Remember from the previous article that the private data in the file is binder_proc. Binder_thread is initialized at this point.

Finally, check whether the version of binder meets the requirements, and return the user process’s binder to continue initialization.

binder_get_thread

Let’s look a little bit at the initialization of binder_thread.

static struct binder_thread *binder_get_thread(struct binder_proc *proc) { struct binder_thread *thread = NULL; struct rb_node *parent = NULL; struct rb_node **p = &proc->threads.rb_node; Parent = *p; parent = *p; thread = rb_entry(parent, struct binder_thread, rb_node); if (current->pid < thread->pid) p = &(*p)->rb_left; else if (current->pid > thread->pid) p = &(*p)->rb_right; else break; } if (*p == NULL) { thread = kzalloc(sizeof(*thread), GFP_KERNEL); if (thread == NULL) return NULL; binder_stats_created(BINDER_STAT_THREAD); thread->proc = proc; thread->pid = current->pid; init_waitqueue_head(&thread->wait); INIT_LIST_HEAD(&thread->todo); rb_link_node(&thread->rb_node, parent, p); rb_insert_color(&thread->rb_node, &proc->threads); thread->looper |= BINDER_LOOPER_STATE_NEED_RETURN; thread->return_error = BR_OK; thread->return_error2 = BR_OK; } return thread; }Copy the code

This is where red-black tree management, important for the first binder, comes in. If you are familiar with my article on red black trees. If there is a thread with a consistent PID, it will jump out and return. If there is no thread, it will continue to search until it jumps out of the loop. Pid indicates the ID of the current process.

At this point we have found the appropriate position in the loop above, what the father is. Then allocate memory, bind binder_proc, PID, initialize the wait queue in binder_thread, and toDO queue, and insert the current binder_thread into the red-black tree. The binder_thread initialization is complete.

In other words, don’t be fooled by the name, it’s not really a thread born from a kernel application thread (a thread born from kThreadD is essentially a lightweight process forked out). Instead, the kernel implements thread-like effects through its wait queue.

The implementation of the kernel wait queue is actually very simple, consistent with the cas design, which is to judge whether to continue the work below the thread through a state. If the condition is not met, the CPU scheduling will be switched immediately through schedule().

binder_mmap

Next, take a look at step three with user space Binders

  bs->mapsize = mapsize;
    bs->mapped = mmap(NULL, mapsize, PROT_READ, MAP_PRIVATE, bs->fd, 0);
    if (bs->mapped == MAP_FAILED) {
        fprintf(stderr,"binder: cannot map device (%s)\n",
                strerror(errno));
        goto fail_map;
    }

    return bs;
Copy the code

Trace kernel mmap kernel source code

After looking at mmap in user space implemented by MMAP2 in user space, we went straight to mmAP2’s implementation.

SYSCALL_DEFINE6(mmap2, unsigned long, addr, unsigned long, len,
        unsigned long, prot, unsigned long, flags,
        unsigned long, fd, unsigned long, off_4k)
{
#define PAGE_ADJUST (PAGE_SHIFT - 12)
    if (off_4k & ((1 << PAGE_ADJUST) - 1))
        return -EINVAL;
    return sys_mmap_pgoff(addr, len, prot, flags, fd,
                  off_4k >> PAGE_ADJUST);
}
Copy the code

In fact, the mmap of all systems will have this processing core.

SYSCALL_DEFINE6(mmap_pgoff, unsigned long, addr, unsigned long, len, unsigned long, prot, unsigned long, flags, unsigned long, fd, unsigned long, pgoff) { struct file *file = NULL; unsigned long retval = -EBADF; . flags &= ~(MAP_EXECUTABLE | MAP_DENYWRITE); retval = vm_mmap_pgoff(file, addr, len, prot, flags, pgoff); out_fput: if (file) fput(file); out: return retval; }Copy the code

Core vm_MMAP_pgoff, and finally call do_MMAP_pgoff

unsigned long do_mmap_pgoff(struct file *file, unsigned long addr, unsigned long len, unsigned long prot, unsigned long flags, unsigned long pgoff, unsigned long *populate) { struct mm_struct *mm = current->mm; vm_flags_t vm_flags; *populate = 0; /* * Does the application expect PROT_READ to imply PROT_EXEC? * * (the exception is when the underlying filesystem is noexec * mounted, in which case we dont add PROT_EXEC.) */ if ((prot & PROT_READ) && (current->personality & READ_IMPLIES_EXEC)) if (! (file && (file->f_path.mnt->mnt_flags & MNT_NOEXEC))) prot |= PROT_EXEC; if (! len) return -EINVAL; if (! (flags & MAP_FIXED)) addr = round_hint_to_min(addr); /* Careful about overflows.. */ len = PAGE_ALIGN(len); if (! len) return -ENOMEM; /* offset overflow? */ if ((pgoff + (len >> PAGE_SHIFT)) < pgoff) return -EOVERFLOW; /* Too many mappings? */ if (mm->map_count > sysctl_max_map_count) return -ENOMEM; /* Obtain the address to map to. we verify (or select) it and ensure * that it represents a valid section of the address  space. */ addr = get_unmapped_area(file, addr, len, pgoff, flags); if (addr & ~PAGE_MASK) return addr; /* Do simple checking here so the lower-level routines won't have * to. we assume access permissions have been handled by the open * of the memory object, so we don't do any here. */ vm_flags = calc_vm_prot_bits(prot) | calc_vm_flag_bits(flags) | mm->def_flags | VM_MAYREAD |  VM_MAYWRITE | VM_MAYEXEC; if (flags & MAP_LOCKED) if (! can_do_mlock()) return -EPERM; if (mlock_future_check(mm, vm_flags, len)) return -EAGAIN; if (file) { struct inode *inode = file_inode(file); switch (flags & MAP_TYPE) { case MAP_SHARED: if ((prot&PROT_WRITE) && ! (file->f_mode&FMODE_WRITE)) return -EACCES; /* * Make sure we don't allow writing to an append-only * file.. */ if (IS_APPEND(inode) && (file->f_mode & FMODE_WRITE)) return -EACCES; /* * Make sure there are no mandatory locks on the file. */ if (locks_verify_locked(file)) return -EAGAIN; vm_flags |= VM_SHARED | VM_MAYSHARE; if (! (file->f_mode & FMODE_WRITE)) vm_flags &= ~(VM_MAYWRITE | VM_SHARED); /* fall through */ case MAP_PRIVATE: if (! (file->f_mode & FMODE_READ)) return -EACCES; if (file->f_path.mnt->mnt_flags & MNT_NOEXEC) { if (vm_flags & VM_EXEC) return -EPERM; vm_flags &= ~VM_MAYEXEC; } if (! file->f_op->mmap) return -ENODEV; if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) return -EINVAL; break; default: return -EINVAL; } } else { switch (flags & MAP_TYPE) { case MAP_SHARED: if (vm_flags & (VM_GROWSDOWN|VM_GROWSUP)) return -EINVAL; /* * Ignore pgoff. */ pgoff = 0; vm_flags |= VM_SHARED | VM_MAYSHARE; break; case MAP_PRIVATE: /* * Set pgoff according to addr for anon_vma. */ pgoff = addr >> PAGE_SHIFT; break; default: return -EINVAL; } } /* * Set 'VM_NORESERVE' if we should not account for the * memory use of this mapping. */ if (flags & MAP_NORESERVE)  { /* We honor MAP_NORESERVE if allowed to overcommit */ if (sysctl_overcommit_memory ! = OVERCOMMIT_NEVER) vm_flags |= VM_NORESERVE; /* hugetlb applies strict overcommit unless MAP_NORESERVE */ if (file && is_file_hugepages(file)) vm_flags |= VM_NORESERVE; } addr = mmap_region(file, addr, len, vm_flags, pgoff); if (! IS_ERR_VALUE(addr) && ((vm_flags & VM_LOCKED) || (flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE)) *populate = len; return addr; }Copy the code

Review of Linux memory management and process address space basics

Speaking of memory management for operating systems. So we have logical addresses, linear addresses, physical addresses. We intuitively know the name of the physical memory address corresponds to the main memory above the physical sense of the storage unit address, corresponding to the chip superior memory unit; Linear address is a 32-bit operator, up to 4G addresses; The logical address corresponds to the address at which the machine language reads each command and operand, and each logical address corresponds to a segment and an offset.

In practice, however, it is often impossible for the upper layer to manipulate physical memory directly due to the complexity of the operation. So every operating system does a transformation for you in the kernel, from a logical address to a linear address and then to physical memory, so let’s find the physical storage unit.

So there are some ways to manage from physical memory to logical memory:

1. Section management

2. Page management

3. Segment-page management

I’m only going to talk about paging on Linux.

First of all, we should distinguish between page management and section management. Segment memory management: Segments can assign different linear addresses to each program. To convert a logical address to a linear address. Page management: Each page can be mapped to a physical address for each linear address. Section page management: go through a section table management and then find a page table, equivalent to page management + section management.

The implementation of processes in Linux relies heavily on paging management of memory. Linux does not choose segmented management. On the one hand, segmented memory fragmentation management is not a very good solution. On the other hand, paging management can make use of fragmentary physical addresses. Second, when all processes are using the same segment register, the linear address can be shared. Moreover, most processors have limited support for segments. As an open source system with strong portability, Linux must be compatible with most mainstream processors.

Here is a brief introduction to the level 3 paging memory management in Linux, using a 32-bit machine as an example:

A basic memory page management consists of the following structures:

Directories are up to 10 digits

Page table middle 10 digits

Offset last 12 bits

Converting a linear address into a physical address requires two steps. The first step is to determine which page table the directory table points to, and which page frame the page table points to. Finally, the position in the page is found by the offset of the page frame.

It’s worth noting that there’s this concept of pages and frames. Linear addresses are divided into groups of fixed length called pages. Consecutive addresses within a page are mapped to consecutive physical addresses. Paging units divide RAM into fixed-length page frames. In other words, a page is a set of data, and a page frame is a physical address.

The Linux Level 4 page is enhanced on this basis to be compatible with 64-bit processors.

There are four types of page tables:

Page global directory

Page parent directory

Mid-page directory

A page table

That’s actually where the idea came from. When we run 32-bit and do not enable the physical extension, the upper level of the page, the middle level of the page are all 0, which is equivalent to removing the upper level of the page and the middle level of the page equal to a secondary page. When we use 64 bits, we can add more pages to the top of the page, the middle of the page.

Process Address base

In Linux there is a structure called MM_struct that represents the memory descriptor of a process in Linux. In other words, mm_struct represents memory management. All memory descriptors are stored in a bidirectional linked list, MMList. The first element of the linked list is init_mm, which is used by process 0 (the first process initialized by the kernel to maintain TSS’s global table GDT, etc.). For speed, the mm_struct is also inserted into a red-black tree with a root of mm_rb.

Within this there is a key structure vm_area_struct representing the linear region descriptor. Each linear region represents a linear address interval. All linear areas in a process are connected by a simple linked list. Vm_area_struct is also inserted into a vm_rb red-black tree for management.

Back in the mmap

1. Check the size of the application that is passed down from user space. The first thing to check is whether the requested length is valid or not.

/* align addr on a size boundary - adjust address up/down if needed */ #define _ALIGN_UP(addr,size) (((addr)+((size)-1))&(~((size)-1))) #define _ALIGN(addr,size) _ALIGN_UP(addr,size) /* to align the pointer to the (next)  page boundary */ #define PAGE_ALIGN(addr) _ALIGN(addr, PAGE_SIZE)Copy the code

So what we’re doing here is we’re essentially taking len over its pagesize, 12 bits, and adding len to the pagesize. And then we see if the offset is legal, if the offset is legal, if it’s illegal, if the offset plus len is less than the current page offset.

Finally, see if map_count in mm_struct reaches the limit of the linear range that memory descriptors can manage.

2. Get_unmapped_area to find the free linear area. When you’re looking for linear regions, there are two kinds of linear regions. The first is to start at the linear address 0x40000000 and go to the high end address. The second is to start the user stack and grow towards the lower end of the address. Now it’s moving up the ladder. Note that this method does a check to see if the requested address is within the TASK_SIZE. TASK_SIZE = 0 ~ 3G user space allocation address. This size is TASK_SIZE and can be accessed by both kernel and user modes. When 3G+896M+8M ~ 4G is the linear region of kernel space allocation. In order to distinguish it from the linear region vm_area_struct of user space, the linear region structure of kernel state is VM_struct.

3. Handle vm_Flag based on the preceding information. MAP_PRIVATE is passed down, file exists. Check whether the current file is valid and whether there is a method pointer to implement Mmap.

4. Mmap_region will call slab allocation function kmem_cache_zalloc to generate vm_areA_struct in memory and insert it into the linked list and red-black tree, and call the Mmap operator of file.

unsigned long mmap_region(struct file *file, unsigned long addr, unsigned long len, vm_flags_t vm_flags, unsigned long pgoff) { struct mm_struct *mm = current->mm; struct vm_area_struct *vma, *prev; int error; struct rb_node **rb_link, *rb_parent; unsigned long charged = 0; . munmap_back: .... /* * Determine the object being mapped and call the appropriate * specific mapper. the address has already been validated, but * not unmapped, but the maps are removed from the list. */ vma = kmem_cache_zalloc(vm_area_cachep, GFP_KERNEL); if (! vma) { error = -ENOMEM; goto unacct_error; } vma->vm_mm = mm; vma->vm_start = addr; vma->vm_end = addr + len; vma->vm_flags = vm_flags; vma->vm_page_prot = vm_get_page_prot(vm_flags); vma->vm_pgoff = pgoff; INIT_LIST_HEAD(&vma->anon_vma_chain); if (file) { if (vm_flags & VM_DENYWRITE) { error = deny_write_access(file); if (error) goto free_vma; } if (vm_flags & VM_SHARED) { error = mapping_map_writable(file->f_mapping); if (error) goto allow_write_and_free_vma; } /* ->mmap() can change vma->vm_file, but must guarantee that * vma_link() below can deny write-access if VM_DENYWRITE is set * and map writably if VM_SHARED is set. This usually means the * new file must not have been exposed to user-space, yet. */ vma->vm_file = get_file(file); error = file->f_op->mmap(file, vma); . addr = vma->vm_start; vm_flags = vma->vm_flags; } else if (vm_flags & VM_SHARED) { ... }... return addr; }Copy the code

Note that Linux does not bind the actual physical page to the corresponding linear area when Mmap invokes the system call. It simply means that a free region object has been found and managed. At this point, the mapping management is handed over to the driver. If it is a normal file, the operation map binding is deferred until the data read is written.

Back to Binder’s MMAP

Binder_mmap The first step is to prepare the linear region of kernel space

static int binder_mmap(struct file *filp, struct vm_area_struct *vma) { int ret; struct vm_struct *area; struct binder_proc *proc = filp->private_data; const char *failure_string; struct binder_buffer *buffer; if (proc->tsk ! = current) return -EINVAL; If ((vma->vm_end - vma->vm_start) > SZ_4M) vma->vm_end = vMA ->vm_start + SZ_4M; . If (vma->vm_flags & FORBIDDEN_MMAP_FLAGS) {ret = -eperm; if (vma->vm_flags & FORBIDDEN_MMAP_FLAGS) {ret = -eperm; failure_string = "bad vm_flags"; goto err_bad_arg; } vma->vm_flags = (vma->vm_flags | VM_DONTCOPY) & ~VM_MAYWRITE; Mutex_lock (&binder_mmap_lock); if (proc->buffer) { ret = -EBUSY; failure_string = "already mapped"; goto err_already_mapped; } area = get_vm_area(vma->vm_end - vma->vm_start, VM_IOREMAP); if (area == NULL) { ret = -ENOMEM; failure_string = "get_vm_area"; goto err_get_vm_area_failed; } proc->buffer = area->addr; proc->user_buffer_offset = vma->vm_start - (uintptr_t)proc->buffer; mutex_unlock(&binder_mmap_lock); . }Copy the code

First, we can see that the Linux kernel limits the maximum kernel space to 4M per mapping from the kernel user space.

The second test is to see if this linear region can be read or written. Remember what I wrote up here when I turn binder on it’s read/write, so it works.

Finally, get_vm_area is used to obtain a kernel linear vm_struct with the same size as vm_area_struct. Passing the address of the kernel’s linear region to buffer_proc represents the process object registered into binder drivers. Thus binder_proc holds the kernel linear region corresponding to the current mapping. At this point, the linear region of user space is known, and the strategy is not to store, but to calculate the address by adding and subtracting.

+ user_BUFFer_offset = User-state linear area of each process mapping area

With this basic data, you can quickly switch between kernel and user using the Binder.

Binder_mmap Step 2 begins binding the physical page for binder_proc

proc->pages = kzalloc(sizeof(proc->pages[0]) * ((vma->vm_end - vma->vm_start) / PAGE_SIZE), GFP_KERNEL);
    if (proc->pages == NULL) {
        ret = -ENOMEM;
        failure_string = "alloc page array";
        goto err_alloc_pages_failed;
    }
    proc->buffer_size = vma->vm_end - vma->vm_start;

    vma->vm_ops = &binder_vm_ops;
    vma->vm_private_data = proc;

    if (binder_update_page_range(proc, 1, proc->buffer, proc->buffer + PAGE_SIZE, vma)) {
        ret = -ENOMEM;
        failure_string = "alloc small buf";
        goto err_alloc_small_buf_failed;
    }
Copy the code

1. We will allocate memory from the kernel cache for the array-pass call to Kzalloc of Pages in buffer_proc. Pages are data structures that represent page frames. The requested size is how many 4KB page boxes can be allocated for the VMA size at this time.

2. The obtained kernel buffer size is the same as that of the VMA.

3. Set the method pointer for the vMA operation and set binder_proc to vMA private data.

4. At this point we only have the kernel virtual space descriptor ready in memory, we also need binder_update_page_range real bound physical page.

binder_update_page_range

static int binder_update_page_range(struct binder_proc *proc, int allocate, void *start, void *end, struct vm_area_struct *vma) { void *page_addr; unsigned long user_page_addr; struct vm_struct tmp_area; struct page **page; struct mm_struct *mm; . if (end <= start) return 0; . If (vma) mm = NULL; else mm = get_task_mm(proc->tsk); if (mm) { down_write(&mm->mmap_sem); vma = proc->vma; if (vma && mm ! = proc->vma_vm_mm) { ... vma = NULL; } } if (allocate == 0) goto free_range; if (vma == NULL) { ... goto err_no_vma; } for (page_addr = start; page_addr < end; page_addr += PAGE_SIZE) { int ret; page = &proc->pages[(page_addr - proc->buffer) / PAGE_SIZE]; BUG_ON(*page); *page = alloc_page(GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO); if (*page == NULL) { ... goto err_alloc_page_failed; } tmp_area.addr = page_addr; tmp_area.size = PAGE_SIZE + PAGE_SIZE /* guard page? * /. ret = map_vm_area(&tmp_area, PAGE_KERNEL, page); if (ret) { ... goto err_map_kernel_failed; } user_page_addr = (uintptr_t)page_addr + proc->user_buffer_offset; ret = vm_insert_page(vma, user_page_addr, page[0]); if (ret) { ... goto err_vm_insert_page_failed; } /* vm_insert_page does not seem to increment the refcount */ } if (mm) { up_write(&mm->mmap_sem); mmput(mm); } return 0; . }Copy the code

There are two logical steps here.

1. Verify that the current VMA is valid. If the current VMA is invalid, obtain the memory descriptor from the current process descriptor. If yes, remove the VMA as the new VMA.

2. Bind physical pages. In the upper layer function for the array to request the memory of the page frame array, here through the loop, from vm start to end, every 4KB request page frame (because the Linux kernel is 4KB for a page frame, so it is easier for Linux processing). Each time through alloc_page through the partner algorithm to apply for physical pages, and finally through map_VM_area vm_area (linear area of kernel space) and physical address really bound. According to the calculation summary above, we can also calculate the number of user space page addresses corresponding to each page, and finally insert into the pagetable(pagetable) management.

The actual binding of Linux physical memory is in the second step. Again, in mMap, we have physical memory, and then we bind physical memory to the page frame descriptor to make sense of the page frame. Here we have the kernel’s physical memory.

Remember, we return the VMA’s start address by calling Mmap, and the ultimate goal is to map binder’s binder_proc and process to the same physical address. We are not actually using this memory at this point, so the best thing to do is to request a shared physical page of the desired size to the application layer.

The third step binder_mmap

buffer = proc->buffer; INIT_LIST_HEAD(&proc->buffers); list_add(&buffer->entry, &proc->buffers); buffer->free = 1; binder_insert_free_buffer(proc, buffer); proc->free_async_space = proc->buffer_size / 2; barrier(); proc->files = get_files_struct(current); proc->vma = vma; proc->vma_vm_mm = vma->vm_mm; /*pr_info("binder_mmap: %d %lx-%lx maps %p\n", proc->pid, vma->vm_start, vma->vm_end, proc->buffer); */ return 0;Copy the code

At this point, the binder_proc structure representing the process object is mapped into kernel space, and the binder_buffer object representing the process object is mapped into kernel space. Set the necessary parameters in proc, such as vMA, the current file descriptor. And binder_buffer is added to the proc->buffers list, and the newly requested buffer is a kernel buffer waiting to be used, so it is added to the red-black tree of the free buffer named free.

Binder_buffer Memory management

static size_t binder_buffer_size(struct binder_proc *proc,
                 struct binder_buffer *buffer)
{
    if (list_is_last(&buffer->entry, &proc->buffers))
        return proc->buffer + proc->buffer_size - (void *)buffer->data;
    return (size_t)list_entry(buffer->entry.next,
              struct binder_buffer, entry) - (size_t)buffer->data;
}
Copy the code

Gets the binder_buffer size

Binder’s binder_buffer kernel buffer memory management mechanism is explored here. Binder_buffer acts as a kernel buffer, often carrying data in Binder drivers. Let’s think about what data would be needed if we were to design binder_buffer, what would be the first target process, what would be the second object to represent binder work transactions, and what data would be needed. Here is the binder_buffer data structure:

struct binder_buffer { struct list_head entry; Struct rb_node rb_node; //binder_node's red-black tree unsigned free:1; unsigned allow_user_free:1; unsigned async_transaction:1; unsigned debug_id:29; struct binder_transaction *transaction; Struct binder_node *target_node; // Target binder entity size_t data_size; // Data buffer size size_t offsets_size; // Uint8_t data[0]; // Pointer to data buffer};Copy the code

The Binder_buffer is divided into two areas. The first area is the metadata area, which describes the Binder_buffer from the start of the structure to the location of the structure’s data object. The data object represents a data buffer that points to the kernel buffer.

Therefore, the following formula can be obtained:

Real memory buffer size = offset_size(effective buffer offset)+data_size (data buffer size)

With this foundation in hand, we can see that a buffer_size is measured in two cases in the driver;

1. When binder_buffer is in the middle of the buffers list in binder_proc. The current size of the binder_buffer is obtained by subtracting the address of the current buffer’s data from the next buffer.

2. When binder_buffer is at the end of the buffers list in binder_proc. We can’t find the next binder_buffer kernel buffer, so we take the entire binder_buffer starting address in binder_proc, add buffer_size to get the current buffer size, and subtract buffer_data’s address.

In both cases, the buffer_size obtained by binder_buffer_size is actually the valid size of the binder_buffer.

Binder_buffer Kernel buffer empty red-black tree idle and use red-black tree management

static void binder_insert_free_buffer(struct binder_proc *proc,
                      struct binder_buffer *new_buffer)
{
    struct rb_node **p = &proc->free_buffers.rb_node;
    struct rb_node *parent = NULL;
    struct binder_buffer *buffer;
    size_t buffer_size;
    size_t new_buffer_size;

    new_buffer_size = binder_buffer_size(proc, new_buffer);

    while (*p) {
        parent = *p;
        buffer = rb_entry(parent, struct binder_buffer, rb_node);

        buffer_size = binder_buffer_size(proc, buffer);

        if (new_buffer_size < buffer_size)
            p = &parent->rb_left;
        else
            p = &parent->rb_right;
    }
    rb_link_node(&new_buffer->rb_node, parent, p);
    rb_insert_color(&new_buffer->rb_node, &proc->free_buffers);
}
Copy the code

You can easily understand this passage after reading my description of trees. First, we extract the free red-black tree in the current binder_proc and branch through the loop to find where the newly created free buffer should be inserted.

This is where the first important red-black tree of memory management, free_buffers, appears. If there is a free admin area, there must be a used admin area. Let’s look at the binder_proc data structure.

Struct binder_proc {// Process management red black tree mapped to binder // Binder_threads manage red and black tree struct rb_root threads; Struct rb_root nodes; // Binder_node Struct rb_root refs_by_desc; Struct rb_root refs_by_node; struct rb_root refs_by_node; // Represents the mapping process pid int pid; Struct vm_area_struct *vma; Struct mm_struct *vma_vm_mm; Struct task_struct * TSK; Struct files_struct *files; struct files_struct *files; Struct hlist_node deferred_work_node; int deferred_work; // the process corresponds to the current entire binder_buffer kernel buffer start address void *buffer; Ptrdiff_t user_buffer_offset; ptrdiff_t user_buffer_offset; // struct list_head buffers; //binder idle kernel buffer red black management block struct rb_root free_buffers; Struct rb_root allocated_buffers; size_t free_async_space; Struct page **pages; // Size_t buffer_size; uint32_t buffer_free; Struct list_head todo; // This is a core wait_queue_head_t wait; struct binder_stats stats; Struct list_head delivered_death; int max_threads; int requested_threads; int requested_threads_started; int ready_threads; long default_priority; struct dentry *debugfs_entry; };Copy the code

This series of binders will cover these properties and see how they are handled by the Binder mechanism.

Here we only need to focus on free_buffers and allocated_buffers red-black trees. The first one is what we see when we generate a new kernel buffer that is not being used and will be inserted into the free buffer. When we use it, we’re going to take it out of the free buffer, do some processing, and then insert it into the used buffer.

binder_alloc_buf

The first part

static struct binder_buffer *binder_alloc_buf(struct binder_proc *proc,
                          size_t data_size,
                          size_t offsets_size, int is_async)
{
    struct rb_node *n = proc->free_buffers.rb_node;
    struct binder_buffer *buffer;
    size_t buffer_size;
    struct rb_node *best_fit = NULL;
    void *has_page_addr;
    void *end_page_addr;
    size_t size;

    if (proc->vma == NULL) {
...
        return NULL;
    }

    size = ALIGN(data_size, sizeof(void *)) +
        ALIGN(offsets_size, sizeof(void *));

    if (size < data_size || size < offsets_size) {
...
        return NULL;
    }

    if (is_async &&
        proc->free_async_space < size + sizeof(struct binder_buffer)) {
...
        return NULL;
    }

Copy the code

Look at the size calculation.

Let’s look at this macro first:

#define _ALIGN(addr,size)   (((addr)+(size)-1)&(~((size)-1))) 
 ALIGN(data_size, sizeof(void *)) 
Copy the code

This algorithm is very useful for learning. This algorithm is used to change the address or value to a multiple of size. If we need to align pointer sizes at this point, let’s say on a 32-bit machine, it’s 4 bytes aligned.

Here is the binary operation:

( add + (100 – 001) )& (~(100 – 1)) = (add + 011) & 100

The result of this calculation is that you can ignore the last two digits to 0 by using and. If you convert to decimal you will find that it is exactly divisible by 4.

So we know that

size = data_size + offset_size

The size of the buffer is the offset of the valid data area aligned with the size of the pointer + the metadata size.

The second part

while (n) { buffer = rb_entry(n, struct binder_buffer, rb_node); BUG_ON(! buffer->free); buffer_size = binder_buffer_size(proc, buffer); if (size < buffer_size) { best_fit = n; n = n->rb_left; } else if (size > buffer_size) n = n->rb_right; else { best_fit = n; break; } } if (best_fit == NULL) { pr_err("%d: binder_alloc_buf size %zd failed, no address space\n", proc->pid, size); return NULL; } if (n == NULL) { buffer = rb_entry(best_fit, struct binder_buffer, rb_node); buffer_size = binder_buffer_size(proc, buffer); }Copy the code

The principle is simple: Binder drivers find the appropriate buffer size from the free_buffers of the current process. The appropriate size is less than or equal to the size of the current idle buffer node.

If the node is empty at this point, but the appropriate region is found. We will fetch the buffer node of the current size from free_buffers. And measure buffer_size.

The third part

has_page_addr = (void *)(((uintptr_t)buffer->data + buffer_size) & PAGE_MASK); if (n == NULL) { if (size + sizeof(struct binder_buffer) + 4 >= buffer_size) buffer_size = size; /* no room for other buffers */ else buffer_size = size + sizeof(struct binder_buffer); } end_page_addr = (void *)PAGE_ALIGN((uintptr_t)buffer->data + buffer_size); if (end_page_addr > has_page_addr) end_page_addr = has_page_addr; if (binder_update_page_range(proc, 1, (void *)PAGE_ALIGN((uintptr_t)buffer->data), end_page_addr, NULL)) return NULL; rb_erase(best_fit, &proc->free_buffers); buffer->free = 0; binder_insert_allocated_buffer(proc, buffer); if (buffer_size ! = size) { struct binder_buffer *new_buffer = (void *)buffer->data + size; list_add(&new_buffer->entry, &buffer->entry); new_buffer->free = 1; binder_insert_free_buffer(proc, new_buffer); }... buffer->data_size = data_size; buffer->offsets_size = offsets_size; buffer->async_transaction = is_async; if (is_async) { proc->free_async_space -= size + sizeof(struct binder_buffer); . } return buffer;Copy the code

If you’re familiar with the Linux kernel’s partner algorithm for allocating physical pages, you’ll notice that the core of this section of processing uses buffer memory.

The Binfer driver will try to extract a new binder_buffer from the current buffer_size. But this can be a waste of memory. After all, binder finds the right size to be less than or equal to the current free cache node size.

Here, when we call mmap for the first time, we apply the size of 4m as a large buffer node inserted into the free buffer. If this algorithm is taken out directly as the most suitable use of the buffer, then it is bound to cause memory waste. If you have asynchronous binder operations at the same time, you’re bound to run out of memory.

Binder’s solution therefore mimics the buddy algorithm. Binder tries to cut a small buffer from the current large buffer and, if it can fit the current kernel buffer and a binder_buffer size, puts the current small buffer into the free kernel buffer.

The core idea of thinking is clear, look at the source code is how to deal with this problem.

1. Has_page_addr is calculated as the end address of binder_buffer and then the PAGE_MASK is cleared to calculate the end address at the beginning of the page.

2. When we don’t find buffer_size exactly the same. It means we’re going to start cutting the larger buffer.

    if (n == NULL) {
        if (size + sizeof(struct binder_buffer) + 4 >= buffer_size)
            buffer_size = size; /* no room for other buffers */
        else
            buffer_size = size + sizeof(struct binder_buffer);
    }
Copy the code

In fact, this part is to try to detect that a new kernel buffer is applied to the original kernel buffer. Buffer_size adds the size of the next binder_buffer to the buffer_size buffer. Otherwise, the kernel buffer node cannot be cut this time, so keep the size. This 4 bytes is to ensure that the next buffer has room for a valid data area.

3. Calculate end_page_addr. Buffer ->data + buffer_size. If the buffer is included, the PAGE_SIZE alignment of the address is required, which in this case is a multiple of 4KB.

    end_page_addr =
        (void *)PAGE_ALIGN((uintptr_t)buffer->data + buffer_size);
    if (end_page_addr > has_page_addr)
        end_page_addr = has_page_addr;
    if (binder_update_page_range(proc, 1,
        (void *)PAGE_ALIGN((uintptr_t)buffer->data), end_page_addr, NULL))
        return NULL;
Copy the code

To avoid the end address end_page_addr out of bounds. Because alignment at 4KB May appear larger or smaller than the original end of the calculation, has_page_addr may appear smaller than the original end of the calculation because of the PAGE_MASK. To ensure that end_page_ADDR does not cross into the next buffer, select has_page_ADDR as the end address. And with this new end_page_ADDR as the end address, buffer->data as the starting point, reallocate the free kernel buffer.

4. Delete the appropriate node currently from the free kernel buffer from the original free buffer and add the current binder_buffer to the in-use buffer. Finally, the small free little buffer node that you just rebound is inserted into the free buffer. Finally, set the binder_buffer parameter to complete the generation of the kernel buffer.

conclusion

Let’s go back to the scenario where we are now in the initialization scenario of the Service Manager. That is, at this point our Service Manager is mapped to the Binder driver as the first binder service.

Use a diagram to visualize the MMAP process.

This completes the parsing of the binder_open method in the Service Manager. At this point, although the mapping has been successful, we just got a mapped share address and haven’t used it yet. Next, the initialization of the Service Manager continues.

Think about it

Why mMAP is used and how mmap differs from traditional process communication (IPC). Take pipe, the most classic IPC communication, for example. Pipe uses two file descriptors to separately control reads and writes. In fact, after the two file descriptors are initialized with PIPE, data is read and written to the file descriptor. The general process is like this. The core of the source code is alloc_file in the kernel and set the pipeFIFO_fops structure on these files.

Therefore, compared with our MMAP, we can directly find the peer address through memory mapping and copy the data directly, which directly reduces the process of data copy, which greatly reduces the overhead. Remember that every file operation is an interrupt signal, to do data transfer processing in the kernel, this process is actually time-consuming and expensive.

According to the mMAP source code above. Let’s compare mMAP to mMAP.

1. If MAP_PRIVATE of Mmap is used, data is not reflected to disks. We can see that in Linux binding to disk generally refers to structure inodes. The source code shows that mmap is only bound to vMA virtual memory, and inodes are not required, so mmap must be able to modify to memory, but not to disk. When MAP_PRIVATE branches, inodes are not involved.

2. The PROT_READ flag must be turned on before mmap builds the mapping area. from

case MAP_PRIVATE: if (! (file->f_mode & FMODE_READ)) return -EACCES;Copy the code

As you can see, an error is returned if PROT_READ is not turned on. If we look closely at the next steps of binder_mmap, we can see that mmap must then read the current VMA virtual memory and bind the physical memory. So without PROT_READ on, you cannot create a mapping.