background

Istio, currently the most active Service Mesh project, provides a wide range of capabilities, including traffic management, security, and visibility, each of which is required for service governance, operation and maintenance. The rich capabilities of Istio also bring challenges to the operation and maintenance of qualitatively complex systems. However, compared to capabilities and future scalability, Istio capabilities bring unlimited imagination to service governance and are full of opportunities and challenges.

At present, the version of The Istio control surface of The Front-end service of Doodle Intelligence is 1.5.0. 700 + services and 1100+ POD instances are connected to the Istio control surface, and it is responsible for traffic control and capacity support of the largest business cluster in the front-end service of Doodle Intelligence.

Problems with the development process of Doodle Intelligence

The front-end foundation team began to contact Kubernetes in 2018, and built a publishing platform based on Kubernetes to serve the front-end business team. However, as the business team grew larger and larger, some problems began to appear in the development and publishing process:

  1. Verification problem of multi – branch parallel development
  2. Online problems caused by configuration complexity in a multi-region environment

Most started to ponder about its own internal business team to adjust processing, but more and more problems since the team, we began to think through the gray released ability to solve problems in the development and release process, pretest probability in the daily environment, multiple branch issued multiple grayscale version, according to the different header distribution flow to different versions, Ensure that each feature branch has a separate instance for verification and does not affect each other. The on-line environment provides grayscale capabilities to test regression as the last line of defense for project quality.

In early 2020, we started deploying Istio in production based on implementation complexity and capability richness after investigating multiple solutions and taking into account grayscale solutions from other internal teams. While access to Istio increases the complexity of the system to a certain extent, it does surprise us.

Problems solved by Istio are introduced

Gray released

Based on Istio’s native resources VirtualService and DestinationRule, the grayscale publishing capability of publishing platform is constructed.

  1. Each application published by the publishing platform is labeled with two labelscanarytagstage.stageUsed to identify normal and grayscale releases,canarytagUsed to identify different grayscale versions
  2. Each time a grayscale ReplicaSet is released, a corresponding grayscale ReplicaSet instance is created
  3. Publish the correspondingDestinationRuleConfigure by labelcanarytag.stageThe normal release and different versions of grayscale instances are defined as different instance sets
  4. throughVirtualServiceA collection of different instances distributed based on different headers

The following figure shows the configuration information.

Flow observation and anomaly perception

We built our monitoring platform based on the native Prometheus-Operator in the community, and each cluster deployed a Separate Prometheus-Operator, and based on the collected objects, Divide it into business, Kubernetes cluster infrastructure, Istio data plane traffic, and deploy the corresponding business Prometheus instance: Kubernetes Cluster infrastructure monitors Prometheus instances and Istio traffic monitors Prometheus instances.

In addition, Grafana was used to build the overall data surface flow monitoring market to realize the flow observation.

Based on the current monitoring data, alarm rules are configured to detect and handle traffic anomalies and fluctuations in time.

sum(envoy_cluster_upstream_cx_active{namespace! ="fe-pre"}) by(namespace, service) < 1 Sum by(namespace, service) (rate(envoy_cluster_upstream_rq_503)[1m])) > 0 503 Sum by(namespace, service) (rate(envoy_cluster_upstream_rq{response_code_class! ="2xx"}[1m]))! = 0 Service exception alarmCopy the code

Present stage achievement and existence question

All POD instances in the largest online service cluster have been connected to the Istio control plane. Istio controls the overall traffic and provides traffic observation capability.

Grayscale releases account for more than 60% of the total number of releases. More and more projects begin to use grayscale release ability, and cover the company’s two largest business lines. Grayscale ability and stability have been recognized and praised by the business.

However, as the volume of traffic increases and the number of POD instances increases, some problems are encountered:

  1. The Istio version the team used was 1.5.0. This version caused envoy readiness probes to fail when the Pilot pushed a large number of XDS updates, and again caused a large number of EDS updates, causing cluster fluctuations. The community has solved this problem. The team also plans to upgrade to version 1.7.7.
  2. After the Pilot restarts abnormally, the envoy connected to the Pilot instance cannot perceive the server anomaly and will wait for TCP Keepalive to time out and fail the check before reconnecting to the normal Istiod. During this time, cluster updates will not be synchronized. The default configuration waits 975 seconds. This problem can be resolved by configuring the envoy boot configuration, modifying the TCP_keepalive configuration of the xDS-grPC Cluster upstream_connection_options, Make sure to reconnect within 1 minute.
"upstream_connection_options": {
  "tcp_keepalive": {
    "keepalive_time": 30."keepalive_probes": 3."keepalive_interval": 5}}Copy the code

future

The Doodle Front-end Infrastructure team started using Istio in early 2020. The use of Istio’s rich capabilities and powerful extensibility is still being explored. For the future, we will focus on exploring and exploring in two directions:

  1. Based on the current CAPABILITIES of Istio, it can achieve fine traffic management and service degradation and circuit breaker. Currently, there is a lack of management ability for front-end micro-services, and there is no effective means to ensure service stability in abnormal scenarios
  2. Fault injection based on Istio Provides fault use cases and injects services to improve fault tolerance and stability of the entire service system

To /blog/tuya-i…

Author: Doodle Intelligence front-end basic technology team

The doodle Smart front-end basic technology team serves the doodle Smart front-end business team and is responsible for the construction of a one-stop platform from development, construction, release, launch, traffic management to operation and maintenance.