preface

Didi self-built Kafka cloud management and control platform from April 2019 plan open source to January 14, 2021 to complete open source, internal iterations of three major versions, lasting 22 months to complete the fruit, once open source has been widely recognized by the community of users, up to the current project Star reached 1.9K, the user breakthrough 688.

(Long press the qr code above to view Github project details)

If you are interested, you can click Star✨ to collect it

01. Development history of Didi Logi-KafkaManager

There are many problems in the process of using Kafka in Didi. For example, there are a large number of clusters and topics, and the business scenarios are complex and difficult to manage. Super high message volume to the cluster caused great pressure; User operations and o&M are frequent and complex, and cost is high.

At the same time, similar products in the community cannot meet our needs in terms of operation and maintenance control ability, index monitoring ability, platform role perspective, user experience and other aspects.

So Didi finally chose to build its own Kafka operation and maintenance management platform.

After nearly five years of precipitation, several iterations of the version. Today, Didi Logi-KafkaManager hosts dozens of Kafka clusters within Didi, nearly a thousand machines, over 20K Kafka topics, and over 2 billion messages per day. And in the cluster service stability, operation and maintenance ability perfection, user-friendly interface and other aspects have excellent performance.

02. Didi Logi-KafkaManager design concept

Didi Logi-KafkaManager focuses on the product design concept of “one point, three changes” :

Point: refers to the data security as the core point. Establish multi-tenant isolation model, production/consumption authentication;

Platformization: Refers to the high frequency operations that refine users and operations. Operation and maintenance operation platform;

Visualization: This refers to monitoring Topic/ cluster core metrics. Monitoring index visualization;

Expertise: refers to the accumulation of daily cluster operation and maintenance experience. Expert resource governance.

Didi Logi-KafkaManager adopts hierarchical business architecture in architecture, which is as follows:

Resource layer: only rely on ZooKeeper and mysaQL, rely on thin, convenient deployment;

Engine **** layer: Kafka using Didi 2.5 version is fully compatible with the open-source community version Kafka engine on the basis of the development of some of its own engine features, such as disk overload protection;

Gateway layer: Above the engine layer, we designed the gateway layer, which mainly provides security control, Topic flow limiting, service discovery, degradation ability, etc.

Service layer: Based on Kafka Gateway we provide a wealth of functions on Kafka Manager, mainly include: Topic management, monitoring management, cluster management, etc.

Platform layer: A web platform is provided externally, which provides different functional pages for ordinary users and operation and maintenance users respectively. Some high-frequency operations in daily use are carried out on the platform to reduce user costs.

Didi Logi-KafkaManager is based on didi enhanced engine design and has the following features:

Stability of cluster services: Disk overload protection, production and consumption degradation and traffic limiting capabilities ensure lasting, efficient and stable services. Avoid unlimited use of cluster traffic by users. Users with heavy traffic will exhaust system resources and affect other users, which may cause node failures in the cluster.

Efficient problem location: it has topic-level real-time and historical time-consuming statistics, which facilitates accurate location of current problems and quick backtracking of historical problems;

Service discovery agility: Manages metadata in a unified manner, simplifying the dependence of each client on the Bootstrap address. A cluster address change has no impact on clients.

Highlights design:

  • Define tenants through Appid+password to achieve tenant isolation.

  • Authentication of production and consumption through Topic+AppID;

  • The resource division unit is defined by Region. Some brokers are divided into regions. If some topics are abnormal, the scope of influence is limited to the Region to which they belong.

  • Logical clusters are divided based on service and security capabilities to improve management and control efficiency.

In Didi Logi-KafkaManager, an experience map based on role-splitting and multi-scene perspectives is established.

There are user experience maps for ordinary users to conduct Kafka production, consumption and monitoring and alarm operations, operation and maintenance experience maps for operation and maintenance personnel to conduct cluster management and cluster monitoring and other operation and maintenance operations, and operation experience maps to help enterprises establish Kafka operation system and resource governance system.

In the User Experience map:

  1. Users need to apply for a platform tenant. Apply AppID as user name in Kafka and use AppID+password as authentication.

  2. Then apply for cluster resources. Users apply and use on demand. You can use the shared cluster provided by the platform or apply for a separate cluster for your application.

  3. Then create a Topic. Users can create topics based on the AppID or apply for read and write permissions for other topics.

  4. After creating a Topic, you can perform Topic o&M operations, such as sampling Topic data, adjusting quotas, and applying for partitions.

  5. In the process of production and consumption, time consuming statistics of each link of production and consumption were made based on Topic, and performance indicators of different quantiles were monitored. Helps users locate problems more accurately.

03. Didi Logi-Kafkamanager experience map

In the O&M experience map, the platform integrates o&M modules such as cluster deployment, cluster monitoring, cluster o&M, version management, and resource governance. Refine the high-frequency operations of operation and maintenance personnel, such as cluster deployment, Topic migration, Topic resource governance, Broker leader rebalance, etc. The high frequency operation platform simplifies user operations and greatly reduces operation and maintenance costs

In the map of operation experience, the platform deposits resources management methods based on didi’s internal operation experience for many years. Aiming at frequent common problems such as hot Topic partition and insufficient partition, precipitate resource management methods to realize expert resource management.

An original work order system was created to approve Topic creation, quota adjustment, zoning application and other operations by professional operation and maintenance personnel, standardize the use of resources and ensure the smooth operation of the platform.

Establish billing system, and apply and use Topic resources and cluster resources on demand. Calculate the cost according to the flow, help enterprises to control the cost, and build a big data cost accounting system.

04. Logi-kafkamanager compares similar products

Compared with similar community products, Didi Logi-KafkaManager has obvious advantages in cluster monitoring, operation and maintenance experience, user experience and data security:

01. In cluster monitoring:

  • Similar products lack historical data of indicators, some key indicators, and do not support alarms.

  • Didi Logi-Kafkamanager displays historical data of key indicators for troubleshooting. The key indicators of the Broker, such as the time for the Broker to refresh the log and the real-time time by different fractions, are displayed. In addition, you can define channels for reporting monitoring indicators and create alarm monitoring rules.

02. Operation and maintenance experience:

  • Similar products generally do not support preferred replica elections for Broker dimensions, do not support resetting consumption offsets, lack analysis and discovery capabilities for traffic spikes, and do not have Topic isolation capabilities.

  • Didi Logi-Kafkamanager supports the migration of Topic partition granularity. It also supports the election of the preferred copy of the cluster Broker dimension and the reset of the consumption offset. You can also view the proportion of traffic to major topics on the Broker. The concepts of regions and logical clusters are introduced to o&M management and control to isolate topics of different services.

03. In user experience:

  • The perspective of similar products is relatively single, generally the administrator perspective, the lack of user perspective. Moreover, the cost of using the platform is high, and the page lacks intuitive graphical interface.

  • Didi Logi-KafkaManager takes into account both user and administrator perspectives, precipitation high-frequency operation and platformization of high-frequency operation.

04. In data security:

  • Similar products often do not have any security control, and can be operated and maintained after login, which may cause security risks such as data leakage and data tampering.

  • In Didi Logi-Kafkamanager, Appid+Passwd is used to define tenants, and Topic+App ID is used for identity authentication during production and consumption, which makes data more secure. Combined with the role permissions, the management and control are classified. The cluster operation and maintenance is performed by the administrator role, and the platform operation is more standardized.

05. Future planning of Didi Logi-KafkaManager

In the future, the following questions will be considered and optimized

1. Build cross-cluster smooth migration ability of Topic based on MirrorMaker + KafkaGateWay.

2. Further support Kafka 2.5+ latest version of control, providing richer monitoring of key indicators.

3. Expand more enterprise-level features based on the open source version. Build Kafka Ecology with various enterprise sectors.

We will focus on polishing our products and continue to give back to the open source community!

If you have any questions when using Didi Logi-KafkaManager, or need to communicate with the developer, you can scan the qr code below to enter Didi Logi’s open source user group and ask questions in the group.

There are didi Logi-Kafkamanager project leader: Didi senior expert engineer – Zhang Liang and other technical gurus, online for you to answer your questions, welcome you to use pin pin scan code into the group.

Didi Logi-KafkaManager

http://117.51.150.133:8080

Account/Password:

Admin/admin.

GitHub Project address:

Github.com/didi/Logi-K…