On the afternoon of April 10, iQiyi technical product team held the offline technology salon of “I Technology Fair”, the theme of which was “Exploration and Practice of cloud native implementation”. Technical experts from Kuaishou, Baidu and Bytedance were invited to share and discuss practical experience of cloud native implementation with iQiyi technical product team.
Among them, Zhao Wei, a technical expert from IQiyi, shared the container practice of iQiyi, narrated the container application scenarios and practical experience in container network and container runtime of IQiyi.
PS: Follow the public account, reply the key word “cloud native” in the background, and you can get the PPT and video shared by guests of this technical meeting.
** The following is the dry goods sharing of “IQiyi Container Practice”, which is compiled according to the on-site speech of [I Technology Conference].
Iqiyi Container practice/sharing guest: Zhao Wei, technical expert of IQiyi
The main content to share includes iQiyi’s practical experience and problems in containers in recent years, as well as some of our psychological process in the selection and exploration process.
1. Application scenarios of IQiyi Containers
More than half of iQiyi’s internal application instances are running in the form of containers, which are basically running on a cluster of physical machines. At first, iQIYI adopted Mesos framework, Marathon, an open source service scheduling framework based on Mesos, and Sisyphus, a self-developed batch processing framework, and developed QAE (iQIYI App Engine) based on Marathon. At the time, a number of companies were doing similar things based on Mesos/Marathon. Recently, Mesos was announced to be retiring from Apache, and Twitter’s contribution to the Aurora Project, a well-known Mesos scheduling framework, was added to Apache’s retirement list in February of last year. We are very sorry to say goodbye to Mesos and accelerate our transition to Kubernetes. Iqiyi started late in Kubernetes. In addition to providing native K8S services, we will further provide application engines with different levels of abstraction based on K8S, including Serverless, FaaS, Workflow, etc. We hope to have the opportunity to share our follow-up work with you in the future.
2. Container network practice
The development of iQiyi container network application is shown in the figure below:
The one we use most in Mesos is Docker’s native Local Bridge + NAT, which was put into use in 2014 and is still running in large numbers. We tried Calico in the middle of the project. However, at that time, Docker and K8S and other organizations debated about container network standards, which resulted in our inability to fully invest in a certain technology. In addition, due to the difficulty of management, Calico solution was not widely applied in IQiyi. Later, in the development of K8S, because the NAT policy was not very friendly in the application, we decided a basic direction, connecting the container network with the Intranet, and selected some specific implementation methods, such as VXLAN, Calico/CNI. Then Cilium came along and was used by some companies, and people were interested in it; Now that we’re late to the game, we might as well just catch up with some radical new solutions.
For those of you who are unfamiliar with the Container Network Model (CNM) and THE Container Network Interface (CNI), here is a simple explanation:
Docker separated its network from an independent project libnetwork and proposed CNM, which defined concepts such as network and access point, as well as operations such as network creation and network joining, allowing third-party plug-ins to connect with Docker in accordance with standards. CoreOS came up with a simpler VERSION of CNI, defining only interface standards for adding and subtracting containers to the network. At the time, the two companies disagreed about the use of the interfaces. For K8S, they think that the interface of CNM is too closely combined with Docker, and the user operation standard is too complicated. Compared with CNM, CNI is more secure, simple and loosely coupled. The official reply from Docker side is that the opinions and suggestions of K8S and other communities do not conform to the overall design of Docker. Finally, K8S chose CNI as the network scheme. CNI now occupies a dominant position in industry applications, but for plug-ins, no matter which interface, the work content is the same. For example, CNM requires network configuration to be stored by Docker’s libKV, while CNI does not have such requirements, but for network plug-ins, its storage is always required. However, it is more convenient for K8S users to be managed by plug-ins themselves or by K8S.
(1) Bridge + NAT
Go back to the Mesos environment. At that time, Mesos did not support CNI well, and IQiyi encountered many difficulties in management, operation and maintenance in using CNM, so the most simple and reliable Bridge + NAT solution became the only choice. At that time, we thought it was simple and idealistic. We thought that whether we go NAT or not, we should hide the information in service registration. However, in long-term practice, this layer of NAT still brings a lot of troubles to our operation and maintenance work.
Common problems we encountered include RPC exposing service address, IP PORT lookup application not working properly during barrier removal, Nginx Keepalive failure, etc. In addition, there are some occasional problems such as network card can not release, IP conflicts and so on, but overall it is relatively reliable.
Nginx Keepalive failures are one of the trickier problems.
Problem: Multiple RS, through Nginx proxy request, QPS is very low, occasional 502, with a certain regularity of recurrence.
Solution:
1) Packet capture: — when Nginx: 502, directly receive RST;
— INSIDE RS container: FIN sent in the middle, no packet at 502.
2) Solution:
Speculate on a few possibilities:
— There is a bug in iptables — however, the relevant article indicates that this situation only occurs in a small probability and does not occur steadily, which is not consistent with the phenomenon of this bug. Besides, the service uses a short link to access FIN normally, so this reason is excluded.
— Bridge network problem: small probability of occurrence, also do not meet;
Iptables NAT rule error: Then it was found that there was no connection conversion rule between Nginx and RS in the NAT table of the host. In connection with the low-frequency request, it was determined that THE KEEPalive connection was actively disconnected after RS idle timeout. Nginx is unaware that it is still trying to use the old connection and the access fails.
There is no perfect solution to this problem, and there are several ways to alleviate it:
1) Increase the kernel parameter net.netfilter. nf_conntrack_tcp_timeout_ESTABLISHED to make the idle time tolerated by NAT rules exceed the idle connection time tolerated by RS.
2) When Nginx or other clients use Keepalive, they use TCP heartbeat to maintain the connection.
3) All short connection requests are used.
(2) Bridge/CNI + VXLAN
After applying K8S, IQiyi initially tried Bridge/CNI + VXLAN. For layer 2 networks, K8S officials have not yet provided best practices. We only know that public clouds like GCE are used in this way, but the whole work should be similar to an early Docker tool called Pipework, which is relatively simple and easy to use.
Some problems have also been encountered in the application process:
Problem: In this mode, if a request from a Pod to access the Service IP is forwarded to an instance of the same node, no response will be received.
If the kernel parameter net.bridge.bridge-nf-call-iptables is 1, the packet will be processed by the host iptables and will be lost. After the parameter is set to 0, the parameter value of each node is still found to be 1. After a series of checks, it was finally found that the br_netfilter module besides bridge was loaded when Docker was started, and the parameter was changed to 1. Set br_netfilter to boot and set the kernel parameter to 0 on boot. Another solution is to ditch the Docker and replace it with Containerd.
(3) Cilium/CNI + BGP
Cilium is a bit of a gamble. It simply works by implementing some work into the kernel through eBPF mechanisms. The program itself is not faster than a regular program, but it significantly shortens the path of execution.
(Photo from Internet)
Compared with Bridge/CNI + VXLAN, Cilium/CNI + BGP involves the change of the entire basic network environment. IPAM and BGP are closely related to network planning.
The ideas of IPAM generally include completely distributed CIDR per host, centralized CIDR per IDC and global CIDR. All options have advantages and disadvantages. For example, the ROUTING table of CIDR per Host is simple, but IP drift has great limitations and will waste a lot of IP due to fragmentation. CIDR per IDC and Global CIDR schemes have small IP drift limitations and can save IP addresses. However, the routing table becomes huge and difficult to maintain. After coordination, we finally decided to plan according to TOR, balancing routing complexity, IP resources, and flexibility as much as possible. Specific network planning involves confidential information and is not available here.
BGP configuration involves major changes on switches and hosts. The host continues the previous HA design of dual physical network adapters. By establishing BGP connections between the two switches and establishing equal-cost routes, the host can not only improve network bandwidth but also ensure high availability of the network to a certain extent.
(4) Bridge/CNI + Cilium/CNI hybrid deployment
Problem: — Cilium + BGP transformation cycle is long, and delivery time cannot be met in time;
— Cilium technology itself potential risks, need to prepare a quick recovery plan;
— Stock Bridge clusters migrate smoothly to Cilium.
Solution: — Bridge and Cilium are unified in the internal network segment planning, and switches are deployed across nodes;
– Label different network nodes and deploy CNI Agent through Daemon Set.
As shown in the figure, the green line is layer 2 network and the red layer 3 data path.
3. Container runtime
Iqiyi has also made a lot of attempts in container operation. Docker was the first and most popular; Mesos Unified Container was tried for a while; Finally, select Containerd + RunC/Kata in the K8S environment.
The early working mode and stability of Docker Daemon was criticized by many users. We tried Unified Container from Mesos for better consistency of cluster and Container state, but we had no fewer problems than Docker. Due to unsatisfactory storage reliability and efficiency of mirrors and containers at the time, as well as the lack of barrier removal tools, this attempt was finally put to an end. Of course, some special application scenarios still use this environment.
The main solution to be shared in this article is Containerd + RunC/Kata.
When using containers, you often encounter the following problems:
First of all, the isolation of containers is not sufficient, and there will be a certain degree of interaction between containers. In many cases, the failure of one container will lead to the breakdown of the whole host. For example, we have encountered some problems before. When the number of processes in a container soared, the load of the whole host increased, accompanied by a high frequency of thread switching. In this case, the simple limit of cgroup could not help other containers to run properly. In addition, I also encountered some low-level errors, such as forgetting to close the connection after the request, which led to the rapid exhaustion of FD and then the machine failure;
Secondly, the resources detected in the container are usually the resources of the host machine. For example, some JAVA programs automatically adapt their running status such as the number of threads by detecting CPU and memory. Of course, Java and other containers are gradually adapted to the direction of development, but the effect is not too ideal;
Finally, there are security requirements, such as visibility and accessibility control of various resources.
A common solution to these problems is to run the container inside a virtual machine. In the practical application, we should break through two points, one is that the virtual machine should be as light as possible and start fast, and the other is convenient to use and convenient to integrate with Kubernetes.
Container runtime practices
(1) the Kata
(Photo from Internet)
Iqiyi contacted Intel Clear Containers relatively early, but it has not been formally applied, and only made some simple tests, similar to HYPER; The two projects eventually merged to form Kata Containers, which minimised the use of virtual machines without compromising their use. Kata also complies with the OCI standard and can interact with external controllers such as Containerd or its own CLI.
Figure note: Test performance of Kata and RunC
We ran a series of benchmarks against Kata. According to the test results, it can be seen that the startup time of the two is similar, Kata is a little more than a second, RunC is half a second, in practice, the need for millisecond level startup is not so much, is a completely acceptable gap. The CPU test uses classic prime calculation, counting 2,000,000 prime numbers, 8 threads, 60 seconds to see how many rounds it can run; The size of the memory test is 100GB. Overall, Kata was slightly slower than RunC.
It is important to note that the performance shown in real application scenarios is often somewhat different from the benchmark. For example, when we tested the deep learning image reasoning scenario, Kata had a loss of 5% to 10% compared to RunC, which is barely acceptable, but needs to be further optimized for scale application.
However, Kata itself has a layer of virtual machine, the practical application of some limitations, listed below:
· VMX virtualization support needs to be enabled on the CPU
· Host network mode is not supported
· Adding another container network namespace is not supported
· The docker checkpoint and restore functions are not supported
· Events are not fully supported, for example, OOM Notification is not supported
·Update Command does not support block IO weight
· The Docker run parameter — SHm-size is not supported to set shared memory
· Docker run parameters – sysctl, etc., are not supported
In practice, other than the occasional impact of not supporting the host network mode, other limitations have little impact.
(2) gVisor
(Photo from Internet)
GVisor, another option for lightweight virtualization, was launched by Google, which developed a “highly imitative” kernel using Golang. In addition to the extreme performance, there are also some compatibility issues.
Put its official documentation and compatibility description link here:
Official documentation: gvisor. Dev /docs/
Compatibility: gvisor. Dev /docs/user\_…
Iqiyi chose not to use gVisor because it currently does not fully support many tools that are required in production environments, such as IP, SSHD, netstat, and other commands.
(3) Containerd + RunC/Kata
Containerd’s relationship to RunC:
(Photo from Internet)
To explain briefly, OCI is an open container standard led by Docker, which defines the image, container runtime, and image repository specifications. As long as both Kata and RunC implement the interface in accordance with OCI, this can be done almost directly. The replacement process is a little bit more tedious with Docker, but it’s very simple with Containerd.
(Photo from Internet)
Kubernetes’ announcement that Docker has been marked as Deprecated since V1.20 caused consternation. For now, however, using Containerd instead of Docker + Shim is pretty straightforward.
With this foundation, using Kata in Kubernetes is simple. The K8S Runtime Class is Containerd.
Finally, specify Kata in Pod Spec:
Application scenarios
A common problem facing the Internet industry is the low utilization rate of server resources. Even in the afternoon peak, evening peak and other time, the overall utilization rate is not satisfactory. The CPU usage of a cluster is monitored within 24 hours. As you can see, the bulge in the red box is where we did a little bit of work to solve this problem, like running some tasks at night; Because it is a random day, it looks slightly worse than the average daily optimization effect.
As shown in the figure, the current solution is mainly Mesos, which manages BOTH KVM and Docker hosts. The virtual machine part is simple and crude. Use free resources (based on the actual usage) on each host to create VMS of appropriate specifications, and start VMS at night and stop VMS at dawn. Docker makes use of the flexible oversold capacity of Mesos to assign separate roles to oversold resources. The two resources are managed by Mesos and used by the task scheduling system, which also operates between 1:00 a.m. and 6:00 a.m. to avoid interfering with regular loads.
In a K8S + Kata/RunC environment, things get a little easier. This benefits from K8S’s more reliable vertical and horizontal scalability and Kata’s strong performance isolation characteristics. Virtual machines run only at night because the control over resource scheduling is not precise when they are created, and the performance of common virtual machines may be adversely affected when they are used in the daytime. The RunC part of the container is normal scheduling, while Kata is used to run off-line tasks such as transcoding. Overall resources are controlled by K8S to avoid conflicts between multiple resource and task management frameworks in server control.
Iqiyi has just started its work in K8S online and offline mixing department, and hopes to make fine operation based on the past Mesos mixing department and further improve the utilization rate of the server.
Maybe you’d like to see more
Build user security rating, UGC intelligent audit application practice
Application practice and evolution of OCR technology in IQiyi