To thank my friends for supporting me! Sorted out a Java advanced architecture data, Spring source analysis, Dubbo, Redis, Netty, Zookeeper, Spring Cloud, search public number [Java Ploughman] attention to receive
Sherlock.IO is eBay’s existing monitoring platform, handling tens of billions of logs, events and metrics every day. Flink Streaming Job The Flink Streaming Job is used to process the logs and events.
This paper will combine the current situation of Flink monitoring system, specifically describe the practice and application of Flink in the monitoring system, hoping to give some reference and inspiration to the staff in the same industry.
Monitor the status of Flink system
EBay’s monitoring platform Sherlock.IO handles tens of billions of logs, events, and metrics every day.
By constructing the Flink Streaming Job real-time processing system, the monitoring team can feed back the processing results of logs and events to users in time.
Currently, the monitoring team maintains eight Flink clusters, with the largest cluster size reaching thousands of TaskManagers, running hundreds of jobs, some of which have been running steadily for more than six months.
Metadata driven
To make it easier for users and administrators to create Flink jobs and adjust parameters, the monitoring team built a metadata service on Flink.
The service can describe the DAG of a job in Json, and the same DAG shares the same job, making it easier to create jobs without calling the Flink API.
Currently, jobs created using this metadata microservice only support Kafka as a data source. Once the data is connected to Kafka, users can define a Capability to process the logic for processing the data through Flink Streaming.
Metadata microservices
The metadata describing the job consists of three parts:
-
Capability
-
Policy
-
Resource
The Flink Streaming API is connected to the Metadata microservice API by the Flink Adaptor, which shields the Flink Stream API by calling the Flink Streaming API to create jobs based on the jobs described by the metadata microservice.
1) Capability
The Capability reads data from Kafka and then writes it to Elasticsearch.
This Capability will be the job named “eventProcess”, and defines its parallelism as “5”, its operator is “EventEsIndexSinkCapability”, the data flow “from the Source to Sink.
In addition, we have implemented the periodic update mechanism of Zookeeper, which makes it no longer necessary to restart the job after the Policy is changed. As long as the Policy is changed within the update interval, the Policy of the namespace will be automatically applied to the job.
Share the work
Optimization and monitoring of Flink operations
Heartbeat
availability
(1) Flink restarts
②Flink operation was aborted
(4) Flink job no longer processes data during operation
Flink job isolation
Back pressure
Other monitoring means
(1) the History server
② Monitor jobs and clusters
The instance
Event Alerting
Eventzon
Netmon
Summary and Prospect