This article will attempt to explain how multiple network layers operate in a K8S cluster.

Kubernetes is a powerful platform with many clever and intelligent designs, but it can be confusing to discuss the interaction patterns of networks: POD networks, service networks, cluster IP, container ports, host ports, node ports…… Seeing too many concepts at once makes some people’s eyes go straight. And most of us talk about these things at work, crossing all the layers at once, it’s hard to understand them all at once. If you treat it as a piece each time, and gradually get a clear understanding of how each layer works, you’ll be able to understand it in a step-by-step way.

Returning to the focus of the web, I’m going to break this article into three parts:

  • Part 1: Containers and PODS
  • Part two: K8S Service, which is the pod abstraction layer.
  • The final article will exploreingressAnd the flow direction process from outside the cluster to POD.

However, this article does not attempt to provide a basic introduction to containers, Kubernetes, and Pods. To learn more about how containers work, see Docker. An overview of k8S usage can be found here, and a detailed overview of POD is available here. Finally, you also need to have a basic understanding of the network and IP address space.

Well, if not, see if there’s time for a follow-up article on the above.

Pod

What is Pod? (Assuming you’ve read the Pod article on the K8S website)

A POD consists of one or more containers that reside on the same host and are configured to share network space and other resources, such as file volumes.

Pod is the basic unit of K8S applications. And what does the point of the picture above mean by “shared cyberspace”?

In practice, it means that all containers in a POD can be interconnected on localhost. If I have an nginx container running on port 80 and another container running scrapyd, the second container can be connected to the first container as http://localhost:80. But how does this work on the inside? Let’s look at a classic case: when we start a Docker container on a local machine.

Looking down from the top, we have a network interface: eth0. Connected to it is a bridge, Docker0, and connected to it is another virtual network interface, veth0.

Note that docker0 and veth0 are both on the same network, in this case 172.17.0.0/24. On this network, Docker0 is assigned to 172.17.0.1 and is the default gateway for veth0, which is assigned to 172.17.0.2.

Due to the way network namespace is configured, when the container is started, its internal processes can only see veth0 and communicate with the outside world through Docker0 and eth0. Now let’s start the second container:

As shown in the figure above, the second container gets a new virtual network interface, VEth1, connected to the same Docker0 bridge. This interface is assigned 172.17.0.3, so it is on the same network as the bridge and the first container. Then two containers can communicate over a bridge as long as they can somehow discover the IP address of the other container.

The logic at the bottom of the Docker container is explained above. But it doesn’t show us the “shared cyberspace” of the K8S Pod. Fortunately, namespace design is flexible. **Docker can start a container, but instead of creating a new virtual network interface for it, it specifies that it shares an existing interface (that is, add it to the existing namespace system). ** In this case, the diagram above looks a little different:

The second container now sees veth0 instead of having its own veth1 as in the previous example.

This has several implications: First, both containers can be addressed from 172.17.0.2 of the connection, and inside each container can connect to the other container’s open port on localhost. This also means that two containers cannot open the same port, which is a limitation, but no different than when multiple processes are running on a single host. In this way, a group of processes can take full advantage of the decoupling and isolation of containers while working together in the simplest network environment.

K8s implements this network model by creating a special container (infra) for each POD, which provides a network interface for other containers. If you SSH to a Kubernetes cluster node with pod and run Docker ps, you will see that at least one container was started with pause.

The pause command pauses the current process until it receives a signal, so these containers do nothing but sleep until K8S issues SIGTERM to them. Although inactive, the “Pause” container is the core of a POD and provides the virtual network interface that all other containers use to communicate with each other and the outside world. So, in a hypothetical POD-like structure, the network model looks something like this:

Pod NetWork

Nice, but even a pod full of containers with internal containers that can talk to each other won’t get us to a massive system (because the real world can’t have just one POD). In the next article on Services, we’ll see more clearly that the core of the K8S design requires pods to be able to communicate with other Pods, whether they’re running on the same host or on different hosts.

To investigate how this situation communicates, we need to take a step up and look at the nodes in the cluster.

This section will include some explanations of network routing inappropriateness, but try to make the explanation as straightforward as possible. It’s a bit difficult to find a straightforward and short tutorial on IP routing, but if you want a reliable answer, wikipedia articles on the subject are not that hard.

A K8S cluster consists of one or more nodes. A node is a host, whether physical or virtual, with a container runtime and its container dependencies (mostly docker for now) and several K8S system components that are connected to a network that allows it to reach other nodes in the cluster. To simplify the diagram, a simple cluster of two nodes looks like this:

If you are running your cluster on a cloud platform like AWS, the diagram above is very close to the default network architecture for a single project environment. For illustration purposes, I used the private network 10.100.0.0/24 in this example, so the router is 10.100.0.1, and the two examples are 10.100.0.2 and 10.100.0.3.

Under this setting, each instance can communicate with other instances on eth0, it meets our docker above containers, but remember that we see on the pod is not on the private network: it hung on a completely different bridge, because the network is virtual, only exist in a specific node. For clarity, let’s label the POD stuff back:

The host on the left has an eth0 interface at 10.100.0.2, whose default gateway is on the router at 10.100.0.1. This interface is connected to the bridge docker0 at 172.17.0.1 and to the interface veth0 at 172.17.0.2.

The VEth0 interface is created with the Pause container and is visible in all three containers through shared network space. Since the local routing rules are established when the bridge is created, any packet arriving at eth0 with the destination address 172.17.0.2 will be forwarded to the bridge, which will then send it to veth0, which sounds good so far.

If we know we have a pod 172.17.0.2 on this host, we can add rules in the router to set the next hop of that address to 10.100.0.2 from where they will be forwarded to veth0. crazying! Now let’s look at another host.

The host on the right also has eth0 at 10.100.0.3 and uses the same default gateway 10.100.0.1. Eth0 is also connected to the Docker0 bridge at 172.17.0.1.

Okay, is there a problem? This address should be different from the other bridge on Host1. And I’m doing the same thing here, because I want to do the worst case. If you install Docker and let it start by default, it will most likely work this way. But even if the network of choice is different, it highlights a more fundamental problem:

That is, one node usually does not know what bridge private address space is allocated to another node

But we need to know that if we’re going to send packets to it and get them to the right place. The need for additional structure is obvious.


K8s provides this structure in two ways.

  1. First, it assigns a total address space to the bridge on each node, and then assigns the bridge addresses within that space based on the nodes the bridge builds.
  2. Second, it adds routing rules to the gateway at 10.100.0.1, telling it how packets destined for each bridge should be routed, that is, which node’s eth0 the bridge can reach.

This combination of virtual network interfaces, Bridges, and routing rules is often called overlay networking. When talking about K8S, I usually refer to this network as a “POD network.” First, it’s an overlay network that allows PODS to communicate with each other on any node.

The following diagram shows the entire routing table build:

One thing to note: I have changed the name of the bridge from “docker0” to “cbr0”. The K8S does not use a standard Docker bridge device, and in fact “CBR” stands for “custom bridge”. I’m not quite sure about some of its customizations right now, but this is an important difference between Docker running on K8S and the default installation.

One other thing to note is that the address space allocated to the bridge in this example is 10.0.0.0/14. This is taken from a cluster I have in the Cloud and is a real example to me. But your cluster may be assigned a completely different address range. But there is currently no way to expose it using Kubectl.

conclusion

But then again: Generally speaking, you don’t have to think about how the network works. 🥱).

When one POD communicates with another, it usually does so through an abstraction of services. A service is a software-defined agent, which will be the subject of the next article in this series. However, the pod network address will appear during logging and debugging, and in some scenarios, you may still need to explicitly route to the network.

For example, leaving a K8S pod to any address in the 10.0.0.0/8 range is not NAT by default. So if you’re communicating with another service on a private network in that range, you might need to set up some sort of rule to route packets back to the POD.

Finally, I hope this article can help you.


This paper is participating in theNetwork protocols must be known and must be known”Essay campaign