An overview of the

In a bare Kubernetes cluster there is no LB functionality, which is typically provided by cloud vendors. If the cluster is not deployed on a Service provided by a cloud vendor, the status of the LB Service is always the samependdingState. Only throughNodePortWay andexternalIPsMode To divert external traffic to the cluster. This is inconvenient to deploy in a production system and has a second-class citizen feel in the Kuberntes ecosystem.As shown in the figure above, MetalLB listening Service changes divert external traffic to the reachable path of the Kubernetes cluster node through the Speaker component using the corresponding pattern. In POD, kuber-proxy forwards traffic to POD according to the forwarding mode (iptables or IPVS). MetaLB load balancing is responsible for load balancing from the host dimension, while the load between POD replicas is passedkube-proxyThe implementation. MetalLB is responsible for IP address assignment, broadcast according to the set broadcast mode, node election, node failover and other functions. Traffic diversion is implemented through ARP, NDP, and BGP.

Address assignment

After MetalLB starts in the Kubernetes cluster, MetalLB doesn’t come up with an IP out of thin air. Therefore, you need to configure the IP address pool and the broadcast mode used by the corresponding address pool. MetalLB allocates and reclaims IP addresses in response to changes in the state of a Service, including the addition, deletion, and modification of the Service.

The IP address can be obtained in two ways: public IP address or LAN IP address, and virtual IP address. If the LAN has a rich IP address or an IP address segment can be purchased from a cloud vendor, you can use Method 1. Mode 1 broadcast is implemented by ARP and NDP in Layer2 mode. If the IP address segment does not conflict with the network, mode 2 can be adopted. Mode 2 Requires that BGP networks are supported or routers support BGP routing protocols.

Broadcasting mode

After an IP address in the IP address pool is allocated, you need to notify the network outside the cluster that the IP address is reachable. MetalLB uses standard routing protocols implemented in ARP, NDP, or BGP. ARP broadcasts IPv4 and NDP broadcasts IPv6. MetalLB uses two modes, Layer2 mode and BGP mode.

Layer2 mode

In Layer2 mode, the relationship between hosts and services is ensured in the cluster and IP is reachable through the standard ARP or NDP protocol. In Layer2 mode, all hosts in the cluster elect the Leader as the host for exposing services. From a network perspective, there are multiple IPS on this host.

During the system running, if a node is lost or faulty, the system automatically initiates the leader node election and routes the NODE IP address to a new node. So in Layer2 mode, this functionality is similar to keepalived functionality. Keepalived, however, is implemented using VRRP. The election of nodes is provided by the MemberList project.

From the previous information, in Layer2 mode, a leader node is elected as the exposed node among all nodes. As a result, the bandwidth of the exposed services of the entire cluster depends on the bandwidth of the Leader node. Some operating systems may have cache problems, which eventually leads to the failure of the leader node and the inability to quickly switch to the new leader node host, resulting in short-term service request failures.

BGP mode

In BGP mode, hosts in a cluster share routing information with peers so that external services can access services in the cluster.

The load balancing of BGP traffic forwarded to hosts depends on routers. Common routers balance each connection based on the packet hash value. If TCP and UDP packets in a connection are sent to the same node, the network rearrangement problem caused by data to be sent to different nodes does not occur, and traffic diffusion occurs on different connections rather than on the same connection.

In some high performance routers, the seed data will be extracted from the packet as the key value to calculate the hash, and then selected to forward to different hosts. The classic model approach is usually triples and quintuples. Triples use (protocol, source IP, destination IP) as key values to ensure that data from the same connection reaches the same back-end host. Quintuples add source and destination ports as key values to the technical section. Compared with triples, quintuples ensure that packets from the same client are sent to the same host.

The advantage of BGP mode is that external hardware routers can be used instead of custom load capacity. Of course, this is the biggest disadvantage of BGP mode. If a host in the cluster fails and you want to quickly disconnect all services, the router cannot respond quickly.

Stateless service load balancing based on BGP uses the fixed fields of packets to perform hash calculation and forward specific packets to specific next hops. However, this is also a disadvantage of BGP. If the BGP session is interrupted or the back-end host set changes, the hash value of the packet is recalculated and forwarded to another back-end host. A data error occurs on the host receiving the new data packets. If the service IP address changes, the mapping between the service IP address and the node changes, packet loss will also occur. To solve these problems, you can choose solutions according to your own scenarios:

  • The router chooses a more stable oneECMPHashing algorithms (sometimes calledElastic wide ECPMorElastic LAG), this algorithm can greatly reduce the number of connections affected when the backend host changes.
  • Deploy services to a fixed pool of trusted hosts that is as small as possible.
  • Deploy services during traffic downturns whenever possible
  • Split each Service into two Kubernetes services using different IP addresses, and gracefully migrate users’ traffic to the new Service over time via DNS
  • Adding retries to clients is especially useful on mobile or rich clients
  • Place the service on the Ingress controller back end. The Ingress console uses MetalLB to accept traffic, which is equivalent to having a state service layer between BGP and the service, so there is no need to worry about service changes. Just focus on the deployment of the Ingress controller.
  • For some internal services that use very little, occasional connection resets are acceptable.

The working process of the

MetalLB is divided into two parts, Controller and Speaker. The division of labor between the two is as follows:

  • Controller monitors service changes and allocates IP addresses based on the corresponding IP address pool.

  • The Speaker is responsible for monitoring Service changes, initiating corresponding broadcasts or responses according to specific protocols, and electing the Leader of nodes.

Resource name \ Component name Controler Speaker
ConfigMap Square root Square root
Node X Square root
Service Square root Square root
Endpoint X Square root

As shown in the above list, list the resources monitored by Controller and Speaker components.As shown in the figure above, there is no direct communication between Controller and Speaker, and the writing of each other mainly depends on the change of Service. It can be used between nodesmemberlistOpen source projects are elected to achieve thiskeepalivedThe effect. The Controler and Speaker work together to implement load-balance when SVC and Node change.

Add and update Service

The Controller process

As shown in the figure above, the Controller side receives the event processing process of Service with LoadBalance as follows:

  1. Apiserver sends the update add event to the Controller;
  2. The Controller makes a deep copy of the message after receiving it.
  3. The Controller clears IP addresses assigned to the Service from the current IP address list.
  4. Controller calculates the IP address for the LoadBalance based on the current IP address pool.
  5. The Controller adds the IP to the Service and updates the Service information to Apiserver.
  6. After receiving the updated information, Kube-Proxy modifies the corresponding rules on the host.

From the above steps, you can see that the Controller only interacts with the Apiserver. Finally, the interaction between the two is realized by the Apiserver listening to the Service.

The Speaker process

Speaker about the working process of Service changes, the specific steps are as follows:

  1. When kube-Apiserver receives a Service addition or update, it sends the event to the Speaker.
  2. When the Speaker component receives the message, it copies a message;
  3. Speaker calculates the protocol based on the IP address of the Ingress field in the Service.
  4. Speaker responds to Layer2 and BGP in policy mode or synchronizes routing rules to peers.
  5. Speaker Adds a promethues indicator.

When adding or updating routes, the BGP mode is used to add routing rules to the list and synchronize them to all peers. In Layer2 mode it is added to the reply list.

Delete SVC operations

When the Service changes, the Controller and the Speaker also initiate actions.

The Controller process

As shown in the figure above, when the Service is deleted, both Controller and Kube-proxy initiate corresponding work, as shown in the following steps:

  1. Kube-apiserver receives a Service deletion and sends an event to the listener.
  2. After receiving the message, Kube-proxy updates the forwarding rule on the host.
  3. The Controller receives the message and makes a copy of it;
  4. Controller Deletes the IP address list.
  5. Controller Reclaims IP addresses.

The Speaker process

As shown in the figure, after the Speaker receives the Service deletion operation, it will determine the corresponding protocol and cancel the reply or initiate the synchronization of routing rules. The specific steps are as follows:

  1. Apiserver sends the Service deletion event to the Speaker;
  2. The Speaker receives the message and copies it;
  3. When the Speaker receives the message, the Service gets the broadcast object.
  4. Speaker calls the object in broadcast mode to initiate a delete operation.

In BGP mode, a Service operation removes the deleted Service from the broadcast list and synchronies routing rules to the AS. In Layer2 mode, it is removed directly from the reply list.

Node state change

Node state changes are critical to the Speaker. If a node fails to provide services, perform operations in Layer2 or BGP mode to switch the IP address of the LB to another host in a timely manner to reduce node failure events. MetalLB nodes also initiate elections, which are implemented using memberList. Next we discuss node changes and what happens to nodes.

When nodes are added, deleted or updated, the Speaker will call the protocol list for rotation training after receiving the message and initiate relevant operations one by one. In BGP mode, route rules are forcibly synchronized to the AS. In Layer2 mode, the rejoin operation is forcibly initiated.

conclusion

MetalLB is responsible for diverting traffic to hosts in the Kubernetes cluster, where it provides two modes, Layer2 mode and BGP mode. The Layer2 mode satisfies the requirement that there are rich IP addresses in the LAN. However, Layer2 can divert all external traffic to only one host in the cluster. As a result, the exposed traffic of the cluster is limited by the traffic limit of a single host. BGP mode is an ideal implementation. BGP Requires soft routes or hardware routers that support BGP rather than customized load balancers. When BGP is used, the router uses hashing to ensure that the packets on a single connection are forwarded to the same host. This leads to the biggest disadvantage of BGP mode. When a host is faulty, it cannot switch to a new host quickly. As a result, different packets of the same connection are sent to different hosts, causing network rearrangement. There is no ideal solution to solve this problem, but to choose a stable hash algorithm, traffic trough update service, retry failure, service in the Ingress Controller back end, etc.

Performance in high-traffic scenarios depends on network conditions and the forwarding speed of cluster hosts (ipvS or IPTable forwarding), whereas MetalLB can only use BGP to forward different requests to different hosts to expand bandwidth and improve performance.

MetalLB version 0.9.5 has several drawbacks, both in code and in practice:

  • If a Service already exists that is already using LoadBalance, it will not be rebroadcast on the first startup.
  • BGP routing rule synchronization is initiated when the Service or node that uses LoadBalance changes. The synchronization does not occur at other times. If MetalLB or the router fails for a long time, the created Service cannot be synchronized.