This article is produced by bytedance Systems STE team. Non distruptive upgrade (hereinafter referred to as hot upgrade) of system-level services is of great significance to rapid iterative development of services. As the entrance to the virtual network, virtual switch (vSwitch) has changeable requirements. However, frequent upgrade disconnection may affect services running on virtual machines. In addition, there is usually only one virtual switch per host, which is not architecturally responsible. Therefore, hot upgrade technology is critical to fast iteration of vSwitch. This paper introduces our practice of hot update technology on DPDK based Open vSwtich (HEREINAFTER referred to as OVS-DPDK), and hopes to discuss with peers in the industry.
The status quo
- The existing “hot upgrade” scheme of Open vSwitch (HEREINAFTER referred to as OVS) is basically implemented for OVS-kernel. In OVS-kernel, the vSwitchD process is the slow path and the kernel module is the fast path. When upgrading the VSwitchD process, do not replace the kernel module, so that most traffic is forwarded through the flow cache of the kernel to reduce network disturbance. Upgrading the kernel module may lead to a long disconnection risk. The details you can refer to link: http://docs.openvswitch.org/en/latest/intro/install/general/?highlight=hot%20upgrade#hot-upgrading
- In OVS-DPDK, both fast and slow paths are integrated into the vSwitchD process. If you simply restart vSWtichd, due to DPDK initialization (mainly large page initialization), The 1 gbit/s page duration is about 600ms) and the initialization of the network adapter (the actual initialization of the Mellanox CX5 driver takes about 1s) takes a long time. Therefore, the network may be disconnected in seconds.
- In order to achieve rapid iterative development and reduce the service disturbance caused by disconnection caused by upgrade, the hot upgrade feature of OVS-DPDK needs to be developed.
Scheme and Compromise
Hot updates can be implemented in a number of ways:
-
Plug-in upgrade (hot patch) : the main package processing logic into a dynamic link library, by hot loading plug-ins to avoid time-consuming DPDK initialization and nic initialization. (DPDK master-slave mode can also be considered a variation of this approach.)
- Advantages: The normal upgrade disconnection time is very small, can achieve nanosecond level.
- Disadvantages: The framework upgrade and plug-in upgrade become two types of upgrade, the framework and plug-in must call each other’s API and shared data structure, developers need to keep the framework and plug-in ABI (Application Binary Interface) consistent. A simple example is that the structure of a flow table may be shared between the framework and the plug-in, and if the upgrade modifies the structure of the flow table, the framework needs to be updated as well, otherwise it will cause memory errors due to inconsistencies.
-
Dual-process upgrade (hot replacement) : Through hardware resource redundancy, new and old OVS coexist in the upgrade. The old OVS continue to forward, so that the traffic will not be interrupted, and the new OVS take over the traffic after initialization.
- Advantages: The old and new OVS are two processes and do not need to do any compatibility, just need to comply with the communication protocol of information synchronization between the old and new processes.
- Disadvantages: Resource redundancy, network card and memory redundancy to meet the upgrade requirements of both old and new OVS (2X memory plus 2X VF), longer downtime.
The industry generally adopts dual-process thermal upgrade, which has a wider coverage and is relatively easier to implement in engineering. The following figure describes the general process of a two-process hot upgrade.
The working principle of OVS hot update is not complicated, but there are a lot of trivial engineering details to consider in implementation. Mainly through the code of fine control to make the upgrade of the network time is small.
Two-stage design
We divide the whole process of hot upgrade into two stages (two-stage). The design here is a bit reference to the process of virtual machine live migration. In the first phase, the network’s forwarding service does not stop, and the entire system can be rolled back automatically if the upgrade fails. Only when phase two is reached will the network be disconnected. In the second phase, if a problem occurs, you can only disconnect the network and restart the service. The system no longer has the rollback capability. In implementation, we turned on a JSON-RPC service for OVS for hot upgrades. When the OVS process starts, it automatically detects the existence of old processes. If yes, the system initiates a hot upgrade request through JSON-RPC. The OVS process that receives the hot upgrade request starts the two-phase upgrade:
-
Stage 1:
- The old OVS process releases all kinds of exclusive resources, such as the ovSDB database lock, pid file name, unixctl server path, and so on. At this point, the old OVS cannot obtain the latest configuration from ovSDB, but the PMD thread is still running and forwarding is not stopped.
- After the resource is released, the old and new processes start state synchronization, mainly for some OVS Megaflow synchronization, network TAP device FD synchronization, etc.
- At the same time, the old OVS backed up all the OpenFlow rules.
- When this is done, the old process returns a response to the JSON-RPC request. If no problem is found, the upgrade continues. Otherwise, the upgrade fails. The new process exits and the upgrade fails. At the same time, the old process rolls back to the status and applies for the resources released to restore the normal working status.
- The new process gets various state information and starts normal initialization. DPDK initializes the memory, and OVS obtains the network bridge, network adapter and other configurations from OVSDB to initiate initialization. During this process, if an exception occurs and the new process crashes, the JSON-RPC Unix socket connection is interrupted. Once the old process detects that the connection is broken, it considers the new process failed to initialize and automatically rolls back.
- The new process loads the OpenFlow rules backed up before. The Local Controller records and delivers OpenFlow rule changes.
- The new process initiates a second-phase JSON-RPC request.
-
Stage 2:
- The old process exits.
- The new process starts the PMD thread and starts forwarding.
- The upgrade is complete.
Below is a simple illustration of the upgrade sequence:
As can be seen from the upgrade sequence diagram, in the first stage of hot upgrade, the new OVS process completed the most time-consuming initialization work, while the old OVS process was forwarding all the time and did not break the network. The disconnection only occurs in the second phase between the old OVS exiting and the new OVS starting the PMD thread.
During the upgrade, we took advantage of a feature of the MLNX network card: the same network card device can be opened by two independent DPDK processes at the same time, and traffic is mirrored to both processes at the same time. If other nics are used, the same effect can be achieved by switching between multiple VF and flow table rules. I’m not going to expand the narrative here.
Evaluation of disconnection time
We examined the outage time at the back end of two virtual network cards:
- Representer + VF
- The vhost-user + virtio mode is used.
The test method
The two virtual machines on two hosts can ping each other. Run ping -i 0.01. The interval between the two pings is 10ms. The number of ping packets that are not returned indicates the upgrade disconnection time. Iperf -i 1 Test to check whether TCP throughput is affected.
Representer + VF mode
Ping test
The results of the 10 tests are as follows:
1524 packets transmitted, 1524 received, +36 duplicates, 0% packet loss, time 16747ms
623 packets transmitted, 622 received, +29 duplicates, 0% packet loss, time 6830ms
662 packets transmitted, 662 received, +30 duplicates, 0% packet loss, time 7263ms
725 packets transmitted, 724 received, +28 duplicates, 0% packet loss, time 7955ms
636 packets transmitted, 635 received, +28 duplicates, 0% packet loss, time 6973ms
752 packets transmitted, 750 received, +27 duplicates, 0% packet loss, time 8251ms 961 packets transmitted, 961 received, +31 duplicates, 0% packet loss, time 10551ms 737 packets transmitted, 737 received, +29 duplicates, 0% packet loss, time 8084ms 869 packets transmitted, 869 received, +27 duplicates, 0% packet loss, time 9543ms 841 packets transmitted, 840 received, +28 duplicates, 0% packet loss, time 9228ms Copy the code
If a maximum of one packet loss occurs, the network disconnection duration is 10ms. However, many duplicates of packets are discovered. This is because the MLNX network card has a bug in SwitchDev mode: When a dual process occurs, traffic should be mirrored to another process. As a result, one process receives two duplicate packets while the other process does not. MLNX has confirmed the issue.
Iperf test
During the upgrade, it was observed that the rate halved for 1s, from 22Gbps to 13Gbps at full rate, presumably due to the retransmission of packets and upgrade disconnection caused by the MLNX bug. Return to 22Gbps immediately after 1s.
Vhost-user + virtio mode
Ping test
The disconnection time is about 70 to 80ms. Through log analysis, it was found that the old process was repeatedly reconnected to vhost when exiting, which could be further reduced by further optimization. I’m not going to expand it here.
Iperf test
VF mode is similar to the result, but also throughput will have a certain decline, the upgrade will be immediately restored after completion.
The relevant code
The ovS-DPDK we use is open source. If you are interested in implementing the hot upgrade, refer to the ndu.c file in the vswitchd directory. In the Utilities directory we provide the OVS-nductl tool for testing hot upgrades for jSON-RPC services, which can simulate hot upgrades. For the upgrade script, see ovs-hotupgrade.sh in the Utilities directory. Our open source link is: https://github.com/bytedance/ovs-dpdk
More share
EBPF technical practice: High-performance ACL
Automatic elastic telescopic core peak mixed department | how to support millions of architecture salon review
Evolution of Bytedance distributed table storage system
STE team, Bytedance Systems Department
Bytes to beat systems STE team has been committed to the operating system kernel and virtualization, system software and library construction and performance optimization, the stability and reliability of the large scale data center construction, new collaborative design of hardware and software based technology in the field of research and development and engineering, software engineering the basis of comprehensive ability, For byte upper business escort. At the same time, the team actively follows the community’s technology trends and embraces open source and standards. Welcome more ambitious people to join, if interested can send resume to [email protected].
Welcome to the Bytedance technical team