In the technology selection for monitoring Flink, I chose Prometheus+Influxdb as the storage, because Prometheus does not support Checkpoint indicator storage, so Influxdb was added. However, due to optimistic estimation, having indicators is better than not having indicators, and there is no customization for InfluxDbReporter, the following measurement data are original and all indicators are collected.
Influxdb Specifies the memory usage during the long-term running
Basic situation | instructions |
---|---|
Flink native InfluxdbReporter | Collection of all indicators |
Retention_policy is set to one_hour | Flink official documentation default |
influxdb | Version 1.8 |
Scale of operation | Memory usage Min (Influxdb during startup) | Memory usage Max(longer running time) |
---|---|---|
About 100 | 1GB | 40GB |
The optimization process
At first, the memory usage is relatively small, only a few GB, adjust the index scheme index_version = tsi1. REF: www.influxdata.com/blog/how-to…
Later, after a period of observation, it was found that the memory usage fluctuated regularly around 10GB at first, and then after a period of time, it still rose and exceeded 20GB. So double the memory Settings to 40GB.
Later, with the Influx client, it was found that the retention_policy of the default database monitor was long, and the duration was 168h. Modify the Retention policy of Monitor
alter retention policy monitor on _internal duration 1d replication 1 shard duration 30m
alter retention policy autogen on flink duration 1h replication 1
Copy the code
The memory stabilizes for a while, then the 90% memory usage alarm is sent again… Disable the monitor database
[monitor]
store-enabled = false
Copy the code
Modify the above configuration, close the Monitor database used by the InfluxDB node, and restart it. Enter the Influx client, type show Shards, and find that monitor is missing. But the problem remains. There is no way this time, hurry.
Another village
Log in the Influx client and execute show measurements; A particularly large number of metrics are found, such as the following for the Kafka consumer group
taskmanager_job_task_operator_KafkaConsumer_committed_offsets_video_bc-0
taskmanager_job_task_operator_KafkaConsumer_committed_offsets_video_bc-1
taskmanager_job_task_operator_KafkaConsumer_committed_offsets_video_bc-2
taskmanager_job_task_operator_KafkaConsumer_committed_offsets_video_bc-3
taskmanager_job_task_operator_KafkaConsumer_committed_offsets_video_bc-4
taskmanager_job_task_operator_KafkaConsumer_committed_offsets_video_bc-5
Copy the code
Probably have a new idea, that is the amount of data is too large, but at this time, can not let the user one by one to stop the operation ah. So I went to the document:
Docs.influxdata.com/influxdb/v1…
In general, having more RAM helps queries return faster. There is no known downside to adding more RAM.
The major component that affects your RAM needs is series cardinality. A series cardinality around or above 10 million can cause OOM failures even with large amounts of RAM. If this is the case, you can usually address the problem by redesigning your schema.
Copy the code
According to official documents,
If the SERIES CARDINALITY exceeds 10 million, no matter how large the memory is, it will result in an OOM. The only way to do that is to limit the complexity of the schema.
Run the SHOW SERIES CARDINALITY check and find 1 million already. Surprisingly, the cause of the problem is clear. The main reason is that there is too much useless data. Recall at the beginning, the starting point is to collect individual indicators, should not be collected entirely.
Road to simplicity, less is more
At this point, the InfluxDB service is stopped. Log in to the influxDB server, rm -rf /. Delete all data and restart influxDB. The memory usage was restored to 1GB to 6GB.
It took a long time for the problem to be solved, but the whole process was very happy.
Don’t try to solve complex problems in more complex ways
According to the above requirements, you can modify org. Apache. Flink. Metrics. Influxdb. AbstractReporter, remove except LastCheckpointExternalPath configuration, recompile, deployment to production…
@Override
public void notifyOfAddedMetric(Metric metric, String metricName, MetricGroup group) {
if (!metricName.equals("lastCheckpointExternalPath")) {
return;
}
...
Copy the code
follow-up
After about a week of observation, the following combined measures were found to be effective:
- Delete all monitoring data
- The environment variable INFLUXDB_MONITOR_STORE_ENABLED=false does not take effect. It may not be loaded into the container.
- Modifying a Configuration File
[monitor] store-enabled=false
After the system restarts, the memory usage is no longer high.
Therefore, all indicators are still reported, but the Monitor database is disabled. The memory floats between 2GB and 3GB. As shown below:
conclusion
There is usually more than one way to solve a problem, and there may be many effective solutions to the problem. By changing the store-enabled value of monitor to false and deleting physical data, you can temporarily solve the problem of high CPU and memory usage. However, to fundamentally solve the problem, the InfluxdbMetricReporter needs to be modified.