Ganluo, the technical director of Commercial advertising In China, will take you from 0 to 1 to build an industrial-grade available heterogeneous data bidirectional streaming processing platform in the booklet “Kafka Connect Based on the actual practice of low code platform”.
The authors introduce
Gan Luo, technical director of Commercial advertising center of Shell House Hunting, mainly responsible for the research, development and management of commercial advertising Center. Led the unification and retrieval reconstruction of advertising material data storage engine, and built a two-way streaming processing platform supporting a variety of heterogeneous data from 0 to 1, which can process 1 billion + data per day. At present, the overall availability of core service of c-terminal advertising traffic distribution is 5 and 9, and the flow peak of shell C-terminal advertising traffic distribution is 1 billion + on a daily basis.
He has worked in Mogujie, Tencent and Huobi Group successively. He is good at core RESEARCH and development and infrastructure work in multiple fields including e-commerce trading and marketing, social content and digital currency high-frequency trading.
🚀 Billion data governance challenges
In the era of big data, we often need to accurately screen out the needed data from the massive data. In the beginning, we need to deal with data in the order of millions or less. In this case, the mainstream offline computing and real-time computing data processing scheme, in terms of performance is very stable.
However, with the rapid development of The Times, it is more and more common to deal with billions of levels of data, and most companies’ data synchronization and cleaning technology is still relatively traditional, there are a series of problems such as high latency, low throughput and poor performance. As a result, the overall technical architecture of services will face usability and stability challenges.
For example, your business asks you to synchronize billions of offline data stored in MySQL and MongoDB to Kafka for real-time consumption. You may need to develop a service that listens to MySQL Binlog/MongoDB Oplog to migrate large amounts of MySQL/MongoDB data to Kafka clusters while ensuring data consistency.
What if the business needs to synchronize offline data from Hive to Kafka for real-time consumption? In this case, you may need to use MapReduce or Spark for offline data batch processing. Data consistency and fault tolerance cannot be guaranteed for massive data.
We can summarize the possible challenges into four scenarios:
- Have massive data synchronization and cleaning requirements, but do not know MapReduce/Spark/Flink, or do not want to rely on heavy middleware;
- There are data synchronization and cleaning demands of various heterogeneous data sources, but they do not want to have development amount every time, and lack scalability and reusability of scale;
- There are data synchronization and cleaning demands of various heterogeneous data sources, but there is a lack of fault tolerance management and task execution state monitoring system.
- There are massive data synchronization and cleaning demands, but they do not want to invest a lot of machine computing resources, or do not want to do complex middleware cluster operation and maintenance.
Kafka Connect is the perfect choice for you.
🔥 Kafka Connect benefits
Kafka Connect is part of Apache Kafka. Kafka Connect provides a data channel for streaming integration with other external data storage systems and Kafka.
Kafka Connect supports streaming of offline data (batch data) to real-time data (Kafka) with heterogeneous data sources (MySQL, MongoDB, Elasticsearch, Kafka). It also provides processing capabilities on the Data Pipline (Data synchronization pipeline), allowing developers to perform structured cleansing of real-time Data in the Data pipeline with a high degree of flexibility.
In 2020, ganrow’s team built its own heterogeneous data bidirectional streaming synchronization service based on Kafka Connect, running a 100+ Source and Sink Connectors cluster. It covers a variety of heterogeneous storage engines including MySQL, MongoDB, Hive, Elasticsearch and Kafka, and processes more than 1 billion offline and real-time data per day.
In addition, they also customized the Kafka Connect cluster console. In addition to meeting the daily Connectors cluster management, it also realized the full process of data synchronization tasks from heterogeneous data access, to select data cleaning rules, and then select write data source self-service access. Truly achieve zero development can create heterogeneous data streaming synchronous Connectors cluster.
Along the way, they came up with a lot of best practices that author Ganrow really wanted to share in a little book.
🏆 Study booklet, what can you improve?
The volume will be divided into 7 modules, from the current mainstream of various data synchronization framework selection, to Kafka Connect based on open source ecology, to build a new data streaming bidirectional synchronization new architecture, and customized development of heterogeneous data bidirectional synchronization Connector components.
In the end, you will not only gain an industrial-grade, scalable, easily accessible and maintainable bidirectional streaming processing platform that supports daily processing of billions of levels of massive heterogeneous data, but also be able to synchronize and clean massive data with ease.
In more detail, you will gain:
- General flow processing technology scheme and architecture design for massive heterogeneous data
- An industrial-grade bidirectional streaming processing platform for heterogeneous data that is available, extensible and easy to maintain
- Kafka Connect, CDC mechanism, Data Routing & Pipeline stack and other basic principles and production practices
- Master Source and Sink Connectors architecture analysis and extended development ability
- Understanding Transforms architectural design concepts and ability to customize lightweight ETL components
- Master the establishment of one-stop indicator collection and monitoring system based on JMX, Prometheus Exporter and Grafana
Finally, if you want to master or improve your offline and real-time data synchronization and processing capabilities, Learn Kafka’s core features, MySQL and MongoDB’s underlying storage mechanisms, CDC architecture concepts and application scenarios, Elasticsearch sharding/routing/pipe operations, and common ETL components and frameworks.
New 50% off, limited time 14.95 yuan, stamp link can buy: sourl.cn/cHk2xT