LXCFS is a small FUSE filesystem written with the intention of making Linux containers feel more like a virtual machine. It started as a side-project of LXC but is useable by any runtime.

In human terms to explain it is:

XCFS is an open source FUSE (user-mode file system) implementation to support LXC containers, which can also support Docker containers. Let the application in the container read memory and CPU information through the LXCFS mapping, to read their own virtual data through the cgroup container-related definition information.

What is resource view isolation?

Container technology provides an environment isolation approach that is different from traditional virtual machine technology. Normal Linux containers speed up container packaging and startup, but also reduce container isolation. The most well-known problem of Linux containers is resource view isolation.

A container can use cgroup to limit the usage of resources, including memory and CPU. Note that if a process in the container uses some common monitoring commands, such as free and top, it still sees the data of the physical machine instead of the container. This is because the container does not isolate the resource view of /proc, /sys and other file systems.

Why do resource view isolation for containers?

  1. From the perspective of containers, some service developers are accustomed to using the top and free commands on traditional physical machines and virtual machines to check the resource usage of the system. However, the container does not isolate the resource view, so the data in the container is still the data of the physical machine.

  2. From an application perspective, running processes in containers is different from running processes on physical virtual machines. And some applications running processes in containers have some security risks:

    For many JVM-based Java programs, the application starts with the allocation of the JVM heap and stack sizes based on the system resource ceiling. Running a JAVA application in a container will cause the application to fail to start because the memory data obtained by the JVM is still the data of the physical machine, and the resource quota allocated by the container is smaller than the amount of resources required by the JVM to start.

    For programs that require host CPU info, For example, golang server development needs to obtain runtime.gomaxprocs (Runtime.numCPU ()) in Golang server development or operation when setting the number of service start processes (such as worker_processes in nginx configuration) Auto), like to automatically determine the number of CPUS in the operating environment through the program. However, processes in the container will always get the number of CPU cores from /proc/cpuinfo, and the /proc file system in the container is still physical, thus affecting the health of the services running in the container.

    How do I isolate resource views for containers?

    LXCFS came out of nowhere to solve this problem.

    LXCFS reads system information from cgroup through file mounting and mounts it to the proc system inside the container through the volume of docker. The application in docker is then instructed to read the proc as if it were the real proc from the host.

    Here is an architecture diagram of how LXCFS works:

    Explain this picture, when we put the host machine/var/lib/LXCFS/proc/memoinfo file mount to the Docker container/proc/meminfo position after container process reads the corresponding file content, The /dev/fuse implementation of LXCFS reads the correct memory limit from the container’s corresponding Cgroup. This allows the application to obtain the correct resource constraints. CPU limitations work the same way.

    Resource view isolation is implemented through LXCFS

    Install LXCFS

    wget https://copr-be.cloud.fedoraproject.org/results/ganto/lxc3/epel-7-x86_64/01041891-lxcfs/lxcfs-3.1.2-0.2.el7.x86_64.rpm;
    RPM -ivh lxcfs-3.2.2-0.2.el7.x86_64. RPM --force --nodepsCopy the code

    Check that the installation is successful

    [root@ifdasdfe2344 system]# lxcfs -h
    Usage:
    
    lxcfs [-f|-d] [-p pidfile] mountpoint
      -f running foreground by default; -d enable debug output
     Default pidfile is /run/lxcfs.pid lxcfs -h Copy the code

    Start the LXCFS

    Direct background boot

    lxcfs /var/lib/lxcfs &
    Copy the code

    Boot from Systemd (recommended)

    touch /usr/lib/systemd/system/lxcfs.service
    
    cat > /usr/lib/systemd/system/lxcfs.service <<EOF
    [Unit]
    Description=lxcfs
     [Service] ExecStart=/usr/bin/lxcfs -f /var/lib/lxcfs Restart=on-failure #ExecReload=/bin/kill -s SIGHUP $MAINPID  [Install] WantedBy=multi-user.target EOF  systemctl daemon-reload systemctl start lxcfs.service Copy the code

    Check whether the startup is successful

    [root@ifdasdfe2344 system]# ps aux | grep lxcfs
    Root 3276 0.0 0.0 112708 980 PTS /2 S+ 15:45 0:00 grep --color=auto LXCFSRoot 18625 0.0 0.0 234628 1296? Ssl 14:16 0:00 /usr/bin/lxcfs -f /var/lib/lxcfsCopy the code

    The startup succeeded.

    Verify the LXCFS effect

    Not open LXCFS

    We first run a container on a machine without LXCFS enabled, and observe CPU and memory information in the container. To see the difference, we used a high-configuration server (32C128G).

    # Do the following
    systemctl stop lxcfs
    
    docker run -it ubuntu /bin/bash Enter the nginx container
    
    free -h Copy the code

    From the above result, we can see that the memory information is viewed in the container, but the host’s meminfo is displayed.

    # Look at the number of CPU cores
    cat /proc/cpuinfo| grep "processor"| wc -l
    Copy the code

    It turns out that without LXCFS enabled, the container sees CPUInfo as the host.

    Open LXCFS

    systemctl start lxcfs
    
    LXCFS /proc file mapped to the /proc file in the container with memory set to 256M:
    docker run -it -m 256m \\
          -v /var/lib/lxcfs/proc/cpuinfo:/proc/cpuinfo:rw \\
     -v /var/lib/lxcfs/proc/diskstats:/proc/diskstats:rw \\  -v /var/lib/lxcfs/proc/meminfo:/proc/meminfo:rw \\  -v /var/lib/lxcfs/proc/stat:/proc/stat:rw \\  -v /var/lib/lxcfs/proc/swaps:/proc/swaps:rw \\  -v /var/lib/lxcfs/proc/uptime:/proc/uptime:rw \\  ubuntu:latest /bin/bash  free -h Copy the code

    You can see that the container’s own memory was correctly retrieved and that resource view isolation of the memory was successful.

    # --cpus 2, limiting containers to a maximum of two logical cpus
    
    docker run -it --rm -m 256m  --cpus 2  \\
          -v /var/lib/lxcfs/proc/cpuinfo:/proc/cpuinfo:rw \\
          -v /var/lib/lxcfs/proc/diskstats:/proc/diskstats:rw \\
     -v /var/lib/lxcfs/proc/meminfo:/proc/meminfo:rw \\  -v /var/lib/lxcfs/proc/stat:/proc/stat:rw \\  -v /var/lib/lxcfs/proc/swaps:/proc/swaps:rw \\  -v /var/lib/lxcfs/proc/uptime:/proc/uptime:rw \\  ubuntu:latest /bin/sh Copy the code

    Cpuinfo also limits the number of logical cpus a container can use. Specifying that a container can only run on a specified number of cpus should do more good than harm, requiring a little extra work to allocate the CpusET when creating the container.

    Kubernetes practice for LXCFS

    Using LXCFS in Kubernetes requires solving two problems:

    The first problem is that LXCFS needs to be started on each node;

    The second problem is to mount the /proc files maintained by LXCFS into each container;

    DaemonSet to run the LXCFS FUSE filesystem

    For the first problem, we installed LXCFS on each K8S node using DaemOnset.

    Use the following YAML file directly:

    apiVersion: apps/v1
    kind: DaemonSet
    metadata:
      name: lxcfs
      labels:
     app: lxcfs spec:  selector:  matchLabels:  app: lxcfs  template:  metadata:  labels:  app: lxcfs  spec:  hostPID: true  tolerations:  - key: node-role.kubernetes.io/master  effect: NoSchedule  containers:  - name: lxcfs Image: registry.cn-hangzhou.aliyuncs.com/denverdino/lxcfs:3.0.4 imagePullPolicy: Always  securityContext:  privileged: true  volumeMounts:  - name: cgroup  mountPath: /sys/fs/cgroup  - name: lxcfs  mountPath: /var/lib/lxcfs  mountPropagation: Bidirectional  - name: usr-local  mountPath: /usr/local  volumes:  - name: cgroup  hostPath:  path: /sys/fs/cgroup  - name: usr-local  hostPath:  path: /usr/local  - name: lxcfs  hostPath:  path: /var/lib/lxcfs  type: DirectoryOrCreate Copy the code
    kubectl apply -f lxcfs-daemonset.yaml
    Copy the code

    You can see that the DAemonset of LXCFS has been deployed on each node.

    Map LXCFS proc files to containers

    For the second problem, we can solve it in two ways.

    The first is simply to declare mount of the host /var/lib/lxcf/proc series of files in the YAML file of K8S Deployment.

    The second method uses Kubernetes’ extension mechanism Initializer to automatically mount LXCFS files. However, the functionality of InitializerConfiguration is no longer supported after K8S 1.14 and will not be described here. But we can implement admission-webhook (Admission Control) to further validate requests after authorization or add default parameters, https://kubernetes.feisky.xyz/extension/auth/admission) to achieve the same purpose.

    Verify whether your K8S cluster supports Admission
    $ kubectl api-versions | grep admissionregistration.k8s.io/v1beta1
    admissionregistration.k8s.io/v1beta1
    Copy the code

    The preparation of admission-Webhook is beyond the scope of this article. You can read more about it in the official documentation.

    Here is an implementation LXCFS admission webhook example, you can refer to: https://github.com/hantmac/lxcfs-admission-webhook

    conclusion

    This article describes a way to provide container resource view isolation through LXCFS, which can help some container applications better identify container runtime resource constraints.

    At the same time, we introduced the deployment of LXCFS FUSE using container and DaemonSet, which not only greatly simplifies deployment, but also makes easy use of Kubernetes’ container management capability to support automatic recovery when LXCFS process fails. This ensures node deployment consistency during cluster scaling. This technique is applicable to other similar monitoring or system extensions.

    In addition, we introduce the application of Kubernetes Admission Webhook to realize the automatic mounting of LXCFS files. The entire process is transparent to the application deployer, greatly simplifying operation and maintenance.