In most cases, an Envoy Proxy is deployed in the same network environment as an application as a Sidecar; each application only needs to interact with an Envoy (localhost) and does not need to know the addresses of other services. However this is not Envoy only usage scenario, it is in itself a seven-story agent, through the modular structure to realize the core functions such as traffic management, information monitoring, such as traffic management functions including automatic reconnection, fusing, global speed limits, traffic images and anomaly detection, and other advanced features, so the Envoy also are often used to edge agent, Examples include Istio’s Ingress Gateway, Ingress Controller (Contour, Ambassador, Gloo, etc.) implemented based on an Envoy.

My blog was also deployed on a lightweight Kubernetes cluster (actually K3S). At first, Contour was used as Ingress Controller to expose the blog, comment and other services within the cluster. However, it didn’t last long. Because I deployed all kinds of strange things in the cluster, some personalized Contour couldn’t meet my needs. After all, everyone knows that many details are lost with each layer of abstraction. Another Controller might have this problem again, so just use the bare Envoy as an edge proxy, or just hand lift YAML.

It’s not all hand polish, though. There’s no control plane, but there’s still a sense of ritual. I can dynamically update configurations based on files, see Envoy Primer: Dynamically Update Configurations Based on a File System.

1. The UDS is introduced

With all this nonsense, let’s get down to business. To improve the performance of the blog, I chose to deploy the blog and Envoy on the same node and use HostNetwork mode for both. Envoy communicates with the blog’s Pod (Nginx) via localhost. To further improve performance, I looked at Unix Domain sockets (UDS), also known as IPC (Inter-process Communication). To understand UDS, let’s start with a simple model.

In the real world, the whole process of information exchange between two people is called primary Communication, and the two sides of the Communication are called endpoints. Depending on the communication environment of tools, endpoints can choose different tools to communicate with each other. If they are close to each other, they can have direct conversations. If they are far away, they can choose to make phone calls or chat on wechat. These tools are called sockets.

Similarly, there is a similar concept in computers:

  • inUnix, a communication consists of two endpoints, for exampleHTTPThe service side andHTTPThe client.
  • To communicate between endpoints, you must use some of the tools used between endpoints in UnixSocketTo communicate.

Socket was originally designed for network communication, but an IPC mechanism, UDS, was developed on the Socket framework. The benefits of using UDS are obvious: no need to go through the network protocol stack, no need to pack and unpack, calculate checksums, maintain serial numbers, and reply, just copy application layer data from one process to another. This is because IPC mechanisms are essentially for reliable communication, while network protocols are designed for unreliable communication.

The most obvious difference between UDS and network sockets is that a network Socket address consists of an IP address and a port number, whereas a UDS address is the path of a Socket file in the file system. The common name ends with. Sock. This Socket file can be referenced by system processes. Two processes can open the same UDS and communicate with each other. This communication mode only occurs in the system kernel and is not transmitted over the network. Let’s take a look at how the Envoy can communicate with the upstream cluster Nginx via the UDS, and their communication model looks something like this:

2. Nginx listens to the UDS

The first step is to modify the Nginx configuration to listen on the UDS. The Socket descriptor file location is up to you. The listen parameter needs to be changed to the following form:

listen      unix:/sock/hugo.sock;Copy the code

Of course, if you want to get faster communication speed, you can put it in the /dev/shm directory. This directory is called TMPFS. It is the area where RAM can be used directly, so the reading and writing speed will be very fast, as explained separately below.

3. Envoy–>UDS–>Nginx

Envoy defaults to using IP addresses and port numbers to communicate with upstream clusters. If you want to use UDS to communicate with upstream clusters, you first need to change the service discovery type to static:

type: staticCopy the code

You also need to define the endpoint as UDS:

- endpoint:
    address:
      pipe:
        path: "/sock/hugo.sock"Copy the code

The final Cluster configuration is as follows:

- "@type": type.googleapis.com/envoy.api.v2.Cluster
  name: hugo
  connect_timeout: 15s
  type: static
  load_assignment:
    cluster_name: hugo
    endpoints:
    - lb_endpoints:
      - endpoint:
          address:
            pipe:
              path: "/sock/hugo.sock"Copy the code

Finally, to enable an Envoy to visit Nginx Socket files, Kubernetes can share the same emptyDir by mounting it into two containers. Of course, the biggest premise is that the Container in the Pod is shared IPC. The configuration is as follows:

spec:
  ...
  template:
    ...
    spec:
      containers:
      - name: envoy
        ...
        volumeMounts:
        - mountPath: /sock
          name: hugo-socket
          ...
      - name: hugo
        ...
        volumeMounts:
        - mountPath: /sock
          name: hugo-socket
          ...
      volumes:
      ...
      - name: hugo-socket
        emptyDir: {}Copy the code

Now you can have the pleasure of visiting my blog again and checking the Envoy’s log that successfully forwards requests to upstream clusters via the Socket:

[the 2020-04-27 T02:49:47. 943 z] "the GET/posts/Prometheus - histograms HTTP / 1.1" / 200-169949 1 0 0 "66.249.64.209, 45.145.38.4" "Mozilla / 5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.96 Mobile Safari/537.36 (compatible; Googlebot / 2.1; +http://www.google.com/bot.html)" "9d490b2d-7c18-4dc7-b815-97f11bfc04d5" "fuckcloudnative.io" "/dev/shm/hugo.sock"Copy the code

Hey hey, Google’s crawler also joined in.

The Socket is in the /dev/shm/ directory. Don’t worry, it’s not over yet, but a little background.

4. Linux shared memory mechanism

Shared memory is a mechanism used for interprocess communication (IPC) on Linux.

Interprocess communication can use channels, sockets, signals, semaphores, message queues and other modes, but these modes usually need to be copied between user mode and kernel mode. It is generally believed that there will be four copies. In contrast, shared memory maps memory directly to user-mode space, where multiple processes access the same block of memory and theoretically performs better. Hey hey, can improve the above scheme again.

Shared memory has two mechanisms:

  • POSIXShared memory (Shm_open () and shm_unlink ())
  • System VShared memory (Shmget (), shmat(), SHMDT ())

Among them, System V shared memory has a long history, common UNIX systems have this mechanism; The POSIX shared memory interface is more convenient and easy to use. It is usually used in conjunction with memory mapping MMAP.

The main differences between MMAP and System V shared memory are:

  • The System V SHM is persistent and remains in memory until the System is shut down, unless explicitly deleted by a process.
  • mmapThe mapped memory is not persistent, and if the process is closed, the map is invalidated unless it has been mapped to a file in advance.
  • /dev/shmIs the default mount point for SYSV shared memory under Linux.

POSIX shared memory is implemented based on TMPFS. In fact, not only PSM(POSIX Shared Memory), but also SSM(System V Shared Memory) is implemented based on TMPFS in the kernel.

Here you can see that TMPFS has two main functions:

  • Used forSystem VShared memory, anonymous memory mapping; This part is managed by the kernel and not visible to the user.
  • Used forPOSIXShared memory is the responsibility of the usermountAnd generally mount to/dev/shm, depends on theCONFIG_TMPFS.

Although Both System V and POSIX shared memory are implemented through TMPFS, they are subject to different limitations. That is, /proc/sys/kernel/shmmax affects only System V shared memory, and /dev/shm affects only POSIX shared memory. In fact, System V and POSIX shared memory are inherently two different instances of TMPFS.

The System V shared memory space is limited only by /proc/sys/kernel/shmmax. By default, the user mounts /dev/shm, which is 1/2 of the physical memory.

To recap:

  • POSIXShared Memory andSystem VShared in-kernel is all throughtmpfsImplementation, but corresponding to two differenttmpfsInstance, independent of each other.
  • through/proc/sys/kernel/shmmaxCan restrictSystem VIndicates the maximum value of shared memory/dev/shmCan restrictPOSIXMaximum value of shared memory.

5. Kubernetes shared memory

A Pod created by Kubernetes has a default shared memory of 64MB and cannot be changed.

Why this value? In fact, Kubernetes itself does not set the size of shared memory, 64MB is actually the default size of shared memory Docker.

When Docker runs, you can set the size of shared memory by using –shm-size:

🐳 - docker run - rm centos: 7 df -h | grep SHM SHM 64 m 0 64 m 0% / dev/SHM 🐳 - docker run - rm - SHM - size 128 m centos: 7 df  -h |grep shm shm 128M 0 128M 0% /dev/shmCopy the code

However, Kubernetes does not provide a way to set the SHM size. In this issue the community has been debating for a long time whether or not to add a parameter to SHM, but it has not been resolved. Instead, there is a workgroud solution: mount emptyDir of type Memory to /dev/shm.

Kubernetes provides a special emptyDir: The emptydir. medium field can be set to “Memory” to tell Kubernetes to use TMPFS (a RAM-based file system) as the medium. Users can mount the emptyDir of Memory media to any directory and use that directory as a high-performance file system, as well as /dev/shm, to solve the shared Memory problem.

Using emptyDir, while it solves the problem, has its drawbacks:

  • Memory usage cannot be disabled in time. Although in a minute or twoKubeletwillPodSqueeze out, but in this time, actually, yeahNodeThere are risks.
  • Affects Kubernetes scheduling becauseemptyDirDoes not involve NodeResources“, causing the Pod to secretly use Node memory without the scheduler knowing.
  • The user cannot sense that the memory is unavailable in time.

Since shared Memory is also subject to Cgroup limits, all we need to do is set Memory limits to Pod. If you set Pod’s Memory limits to the shared Memory limit, you run into a problem: when shared Memory runs out, no command can be executed. You have to wait for a timeout before Kubelet ejects you.

This problem can be easily solved by setting the shared Memory size to 50% of Memory limits. Based on the above analysis, the final design is as follows:

  1. The Memory mediumemptyDirMount to the/dev/shm/.
  2. Configuration of PodMemory limits.
  3. configurationemptyDirsizeLimitMemory limitsOf 50%.

6. Final configuration

Based on the above design, the final configuration is as follows.

Nginx configuration changed to:

listen      unix:/dev/shm/hugo.sock;Copy the code

Envoy configuration changed to:

- "@type": type.googleapis.com/envoy.api.v2.Cluster
  name: hugo
  connect_timeout: 15s
  type: static
  load_assignment:
    cluster_name: hugo
    endpoints:
    - lb_endpoints:
      - endpoint:
          address:
            pipe:
              path: "/dev/shm/hugo.sock"Copy the code

Kubernetes’ manifest is changed to:

spec:
  ...
  template:
    ...
    spec:
      containers:
      - name: envoy
        resources:
          limits:
            memory: 256Mi
        ...
        volumeMounts:
        - mountPath: /dev/shm
          name: hugo-socket
          ...
      - name: hugo
        resources:
          limits:
            memory: 256Mi
        ...
        volumeMounts:
        - mountPath: /dev/shm
          name: hugo-socket
          ...
      volumes:
      ...
      - name: hugo-socket
        emptyDir:
          medium: Memory
          sizeLimit: 128MiCopy the code

7. Reference materials

  • Set shared memory to Kubernetes Pod
  • Kubernetes shared memory scheme between pods

Wechat official account

Scan the following QR code to follow the wechat public account, in the public account reply ◉ plus group ◉ to join our cloud native communication group, and Sun Hongliang, Zhang Curator, Yang Ming and other leaders to discuss cloud native technology