In the first part of our Service Mesh tour, we discussed “What is a service mesh and why did we choose Linkerd2?” . In the second part, we will discuss the problems we face and how we can solve them.
A series of
In Intenseye, why we chose Linkerd2 as the Service Mesh tool (Part.1)
Problem 1: Election of the Apache ZooKeeper Leader
At Intenseye, we use Apache Pulsar instead of the traditional Apache Kafka queuing system.
Apache Pulsar is a cloud-native, multi-tenant, high-performance distributed messaging and streaming platform originally created by Yahoo!
Apache Pulsar uses Apache Zookeeper for metadata storage, cluster configuration, and coordination. After we meshed Zookeeper with Linkerd2, K8S restarted the Pods one by one, but they got stuck in the “CrashloopBackOff”.
We checked the logs and found that ZooKeeper was unable to communicate with other cluster members. We further excavated and found that the Zookeeper node could not select a leader due to the grid. The ZooKeeper server listens on three ports: 2181 for client connections, 2888 for follower connections — if they are leaders, and 3888 for other server connections during the leader election.
We then looked through the documentation and found the “skip-inbound-ports” and “skip-outbound-ports” parameters for Linkerd2 Sidecar. Once we add ports 2888 and 3888 to skip the inbound/outbound, the setup mediation comes into play. Because these ports are used for internal Zookeeper POD communication, the grid can be skipped.
This is a common problem with all service grids if you are using an application with leader elections, for example; Pulsar, Kafka, etc. This is the solution.
Problem 2: Fail-fast logs
We have started to grid our application one by one. All is well until we grid one of the AI services, which we call Application-A. We have another application running as more than 500 lightweight Pods, which we call application-B, that uses gRPC to make requests to Application-A.
After 1 or 2 minutes of successful mesh, we see hundreds of Failed to proxy Request: HTTP Logical service in fail-fast errors in application -B. We checked the source code for the LinkerD-Proxy repository, and we found where the log was printed, but couldn’t make sense of the error message. I mean, what is an HTTP Logical Service?
So we raised an issue with Linkerd2 via GitHub. They were very interested in the problem and tried to help us fix it, even releasing a special package to debug the problem.
After all the discussion, it turns out that the “MAX_conCURRENT_STREAMS” value set on Application -A of 10 is not sufficient to handle the request. Linkerd2 makes it visible. We have increased the value from 10 to 100. Quick failure errors no longer occur.
Problem 3: Outbound connections before Sidecar initialization
We have very few applications that make HTTP calls during application startup. It needs to get some information before the service request. So the application tries to establish an outbound connection before the Linkerd2 sidecar is initialized, so it fails. K8S is restarting the application container (not the Sidecar container) while sidecar is ready. So it works fine after an application container restart.
Again, this is another common problem with all service grids. There is no elegant solution to this. A very simple solution is to “sleep” during startup. There is an unsolved issue on GitHub and the Linkerd2 folks have provided a solution that I think requires more work than “sleep”.
We keep it that way. 1 Automatic application container restart has resolved the problem.
Question 4: Prometheus
Prometheus is an open source cloud native application for monitoring and alerts. It records real-time metrics in a time series database with flexible queries and real-time alerts.
Linkerd2 has beautifully documented tutorials that let you bring your own instance of Prometheus. We followed it and all worked out fine until we gridded an application that used Prometheus’s “PushGateway” to push our own internal metrics outside of those generated by Linkerd2.
PushGateway is a mediation service that allows you to push metrics from jobs that cannot be fetched/pulled.
After the grid, more than 500 lightweight Pods began pushing metrics through the Sidecar agent. We started having memory problems on the PushGateway side, and we skipped the 9091 (PushGateway port) grid out of more than 500 pods.
conclusion
Not everything was easy when Arya killed the Night King. Making changes in a running system has always been difficult. We knew there would be some bumps in the road when integrating with Linkerd2, but we worked through our problems.
References:
- linkerd.io/
- prometheus.io/
- pulsar.apache.org/
- youtu.be/3wGMV60wBm4