Joke monitoring system
With the accumulation of time, the risk of failure becomes higher and higher, and the occurrence of accidents is always unexpected. If manual operation and maintenance is adopted, it is a great challenge for fault location and troubleshooting. The longer the failure time, the greater the loss, so at a certain level of development of the team needs a set of comprehensive monitoring system
Monitor screen
The most important thing for a perfect monitoring system is that it can never fail. Even if the platform fails, it must ensure that the monitoring system may alarm, so the high availability of the monitoring system itself is what I have been pursuing. First, let’s see what functions a complete monitoring system should consider
What are the problems of monitoring system design
The monitoring system monitors many roles. I divide them into the following categories: server, container, service or application, network, storage, and middleware. According to the solutions in the industry, different types of collectors are used to collect data
What are the functional considerations?
-
Different monitoring indicator sources can be marked to clarify service sources
-
Support aggregation operation, conversion of indicator meaning, combination for calculation, summary, analysis
-
Alarms, reports, and graphs are displayed on a large screen
-
Save historical data for easy traceability
Ease of use should be considered
-
You can add or subtract monitoring items and customize monitoring
-
Supports configuration expressions for calculation
-
It is better to have automatic discovery, which is automatically monitored when new servers or resources such as PODS are added
-
You can configure alarm policies to define alarm ranges and thresholds, and customize alarms
Scheme selection
Which open source solutions should you choose? Elasticsearch, Nagios, Zabbix and Prometheus are common solutions in the industry. Other solutions are not discussed
Scheme selection
-
Elasticsearch is a real-time distributed search and analysis engine that supports sharping and fast search. It is usually used in combination with Logstash and Kibana
-
Nagios: The advantage of Nagios is that servers, applications, and devices that fail automatically restart and automatically scroll logs. Flexible configuration. You can customize shell scripts and use distributed monitoring mode. It also supports redundant host monitoring, multiple alarm Settings, and commands to reload configuration files without disturbing Nagios. The disadvantage is that the event console is very weak and the plug-in is not easy to use. Poor performance, flow and other indicators; Can not see the historical data, can only see the alarm event, it is difficult to trace the cause of the fault; Configuration is complex, beginners invest time, energy and cost is relatively large.
-
Zabbix is easy to get started, simple to get started, powerful, and easy to configure and manage, but the deep requirements require a lot of familiarity with Zabbix and a lot of secondary custom development, too much secondary development is unacceptable
-
Prometheus supports almost all of the above requirements, with visualizations that plug into Grafana and do aggregated queries in promSQL without customization; You can classify each indicator by tag. A strong community provides collection solutions and non-invasive high availability solutions for applications, networks, servers, and other devices and roles, which is the focus of today’s discussion
For all these reasons, Prometheus is a good fit
Prometheus and his defects
Prometheus architecture diagram
-
As can be seen from the above architecture diagram, Prometheus deploys a collector (exporter) on the client to collect data, and the server actively communicates with Prometheus to fetch data
-
The client can also push data to PushGateway and give it to Prometheus for fetching
-
Prometheus has automatic discovery capability that can be configured to actively pull platform interfaces for monitoring scope: On Azure, Consul, and openstack, you can configure the tag for the detection role. If the tag is strongly related to the service, you can customize the code and pull the interface of your own platform to identify the monitoring role and dynamically tag the role
-
Prometheus also has the alarm capability. After logging in to the AlertManager component provided by the government, Prometheus detects alarms and uses Webhook to log in to the alarm email or SMS notification platform
-
The problem is that alarm policies cannot be configured and alarm records cannot be stored. You can add components after AlertManager to enable alarm convergence, silence, grouping, and storage
-
For dynamic configuration of alarm policies, you can write programs to generate alarm configurations based on the policies, store them in a directory specified by Prometheus, and call the Prometheus hot update interface
-
The only thing you have to deal with are performance issues and high availability issues with high loads
Problems with standalone Prometheus deployment
Prometheus’ architectural decision made it more suitable for standalone deployment, which alleviates stress through server upsets, but still has common problems
Single point Prometheus
-
The collection rate will be stuck due to CPU/network communication limitation, and the collection speed will be slow. If indicators are not actively pulled within the period, the indicators will be lost. In this case, you can prolong the collection period, resulting in coarse granularity. Another approach is to reduce the collection of useless metrics
-
For the same reason, the speed of the query is limited. When the data is stored for too long, the disk will be under great pressure
-
In the case of a single point of failure there is no way out, direct service is unavailable
What are the options for single point high load?
Referring to the previous article on automatic horizontal scaling and load balancing during high load, the first method of horizontal scaling that comes to mind is the grouping capability provided by Prometheus
Fragmentation collection
Corresponding to the splitting of Prometheus and the collection of partial nodes by configuration, this method has three problems
-
Decentralized data operation and maintenance is difficult
-
To switch data sources back and forth, you don’t see the global view
To solve this problem, consider adding a storage location summary data (Remote Write)
Post-sharding summary
In this case, TSDB summary is considered. TSDB that supports capacity expansion and ensures high availability of the cluster is required
However, an additional query component on top of TSDB would lose the native query statement capability, so TSDB could be replaced with Prometheus node and stored in federated form
The federal
In this case, basic usage requirements can be met. Is there a more automatic way to manually expand and modify groups through Prometheus self-monitoring?
Elastic expansion (automatic horizontal expansion)
There are three premises for elastic expansion
-
You must be able to monitor the current node load status and predict expansion opportunities
-
You need to maintain the start-stop mode of the service, automatically create the service, and place the service on the corresponding node
-
At the same time, the data collection range of each Prometheus node should be modified
The most direct solution to create and destroy services is container orchestration on K8S, and horizontal expansion can be achieved by CPU usage or custom metrics. However, the problem that cannot be solved is modifying Prometheus node configuration to dynamically allocate the collection range. Consider the following solution
The scheduler
-
Prometheus Configure node anti-affinity (k8s configure podAntiAffinity)
-
Write a scheduler to detect Prometheus node status using the K8S API
-
Detect node failures and load using K8S, distribute stress using hash, extend Prometheus’s SD auto-discovery feature, and bring its own hostname to obtain the data range provided by the scheduler
In this way there is no need to modify the configuration file, since it is the Prometheus interface that periodically updates the monitoring scope
- Scaling Prometheus based on running conditions does not require configMap
Here you may have a question, if I monitor the server in the above way, so many receivers, plus a Redis cluster monitoring, which node should be put on? The answer is that Prometheus could be created independently of this auto-scaling scheme to monitor a small amount of data, or it could be put directly on all nodes and consider de-weighting at a higher level, which we will discuss in a moment.
So far, after sharding, the pressure is dispersed, but the problem that has not been solved is that data dispersion can not be summarized and query, single point of failure data loss.
Summary queries may remind you of federated deployment, but the stress comes to a point where it doesn’t solve the problem at all; To solve a single point of failure, deploy two or more monitoring nodes for each monitoring range in redundancy mode. However, client pull times are doubled.
How to ensure data loss from single point of failure
To avoid the problem of failure to aggregate queries and single point of failure data loss, the plan is to plug in a high availability solution thanos, set Prometheus as a stateless application, and enable remote write to push data to Thanos
Pushed to the thanos
Thus Prometheus stores no data, and if it loses some nodes, it automatically scales out new ones as long as there are enough nodes, with the load increasing and then decreasing, all in two cycles
PS:, Prometheus caches collected indicators in a memory queue before writing them to a remote storage device and sends them to a remote storage device to reduce the number of connections. To increase the write rate, modify queue_config
A brief introduction to Thanos, a non-intrusive high availability solution for summarizing, calculating, de-duplicating, compressing, storing, querying, and reporting data generated by Prometheus. Thanos implements the query interface provided by Prometheus. To the outside world, a query for Prometheus or thanos has exactly the same effect; it is imperceptible
To implement a distributed high availability monitoring system
How would you let us implement such a component?
Summarize the storage, and the upper layer performs other functions
Shard data is written to the store and other components communicate with the store, which is what Thanos’s main solution does
Thanos architecture diagram
As shown above, all components communicate with the object store to store or read data
-
Use object storage as a storage engine
-
A Sidecar is deployed for each Prometheus node and periodically pushes data to the object storage
-
Ruler is responsible for determining alarms and aggregating indicators according to rules
-
Compact is used for registration reduction and compression. One piece of data is written back to storage in one minute, five minutes, and one hour. The larger the query time granularity is, the coarser the indicator granularity is, preventing front-end data from overrunning
-
Query communicates with other components to read data in the form of a GRPC, which does not communicate directly with the object store but adds a layer of gateway in the middle
-
The scheme sidECar in the figure above is not my architecture this time, but other things are the same. The principle of SIDECAR is to cache the collected data locally (the data within 2 hours is considered as hot data by default), and push the cold data only. Recently, the data is stored locally, and there will be some pressure to summarize the data when querying, and the single point of failure is still not solved
Sidercar can be used in small clusters without network stress
Do not store at the receiving end
Sidercar, deployed with Prometheus, violates the principle of simplicity in the container and also increases storage pressure. Try stripping them apart?
Aggregate and transfer
My idea is to collect data and push it, then store it, and let other components do the communication with the store
The receive package
As shown above, the Receive component implements the Remote Write interface to which Prometheus pushes data in real time; Receive itself was essentially Prometheus without collecting, in which case Prometheus no longer needed to store data and the previous scheme could be implemented
-
Data in object storage is immutable, meaning that once written, it becomes read-only
-
Prometheus local storage works by writing data received to a local file store to form a WAL file list, as did Receive, and then generating blocks after a certain period of time that are uploaded to the object store
-
The Query component queries recevie for recent data (within 2 hours by default) and uses object storage after expiration
-
Receive Uses the DNSSRV function of the K8S for service discovery, facilitating downstream data retrieval instead of using k8S service: IP load balancing
-
Receive provides the hash algorithm to evenly distribute the upstream remotely written traffic on each node. In this case, k8S automatic service rotation can be adopted. Recevie routes the request to the corresponding node
To prevent data loss caused by Prometheus dropping a copy, add a copy to Prometheus and deduplicate it during Query; It is mainly realized by — quere. replica-label parameter of Query and Prometheus_replica parameter of Prometheus, as shown in the figure below
An overview of
Similarly, other components, such as ruler, can also be configured with redundant deployment. Rule_replica will not be discussed
Fortunately, Recevie comes with a distributed consistency algorithm, otherwise you would have to implement one yourself, so here we are
-
The data receiver can handle the pressure balance of massive data
-
Solved the problem of high query latency when Prometheus was deployed on different clusters
-
It solves the cross node data composition operation (ruler)
-
Data compression alignment reduction is solved
Is Hashring really a distributed consistency algorithm
We know that distributed consistency algorithms can solve the following problems
-
When the pressure increases, it can expand automatically, and when the pressure decreases, it can shrink automatically
-
Ensure that data is not lost during capacity expansion or single point of failure
-
Data mapping points must be consistent during capacity expansion or contraction; otherwise, data disconnection may occur
However, in the actual use, it is not difficult to find that data loss still occurs, which arouses my interest
K8s statefulSet assigns pod names to receVie in STS mode. Recevie should be deployed in STS mode, and the number of copies should be matched with the configuration relationship. Three nodes can already support a large amount of data processing
thanos-receive-hashrings.json: | [ { "hashring": "soft-tenants", "endpoints": [ "thanos-receive-0.thanos-receive.thanos.svc.cluster.local:10901", "thanos-receive-1.thanos-receive.thanos.svc.cluster.local:10901", "thanos-receive-2.thanos-receive.thanos.svc.cluster.local:10901" ] } ]
Copy the code
Found in the source code, in fact, there is no use of distributed consistency algorithm!! As you can see in the hashring.go function, this is a simple hash mod, so hashring is misleading
func (s simpleHashring) GetN(tenant string, ts *prompb.TimeSeries, n uint64) (string, error) { if n >= uint64(len(s)) { return "", &insufficientNodesError{have: uint64(len(s)), want: n + 1} } return s[(hash(tenant, ts)+n)%uint64(len(s))], nil}
Copy the code
The result is a hash algorithm like this
hash(string(tenant_id) + sort(timeseries.labelset).join())
Copy the code
-
Tenant_id indicates that the data source contains tenants and can assign its own hash to different tenants
-
For details about the hash algorithm, see document 5 at the end of this document
Receive.replication-factor can be configured to ensure that data is not lost due to service discovery and communication with all services
PS: Redundancy is also a bit of a problem. The algorithm is to select the node after the hash mod, such as the NTH node, and then select N +1 and N +2 if factor is 2, and then send the request to N. At this time, if N fails, it will actually fail
What happens when receive fails
When scaling occurs, as the Hashring changes, all nodes need to flush write-Ahead-log data into the TSDB block and upload it to OSS (if configured), as these nodes will then have a new allocation. The time series on the existing nodes do not need to be adjusted, only the incoming requests find the receiver node to go to according to the new distribution.
This process does not require a restart of Receive, and there is a watch in the code that detects changes to the Hashring
Note that flush when this happens may produce smaller TSDB blocks, but the Compactor module can optimize them to merge so there is no problem.
When a receiver node fails, Prometheus’s remote write is retried when the back-end target is unresponsive or 503, and therefore, service suspension at receiver for a period of time is tolerable. If this hang time is unacceptable, you can set the number of copies to 3 or more so that even if one receiver hangs, there are other receivers to receive write requests
Service indicator calculation problems
If there are very complex business indicators that need to be collected and pushed from other places, the best way is to write as collector EXPORTER and carry out compound operation in ruler. Of course, there may be embarrassing problems that the expression cannot be written
Consider writing a k8S timed job that pushes data to PushGateway and gives it to Prometheus to pull
PS1: Please note that as per OUR exporter’s development standards, duplicates are not allowed
PS2: To delete stale junk data, call the http://%s/metrics/job/%s/instance/%s/host/ interface of PushGateway
Dynamic update of alarm policies or storage of alarm records
To generate alarm policies dynamically, you can write a service to receive a request, call K8S to generate ConfigMap and notify ruler to do a hot update
-
Update policy configuration file configmap (there will be some delay in an update to the pod, use subPath is not hot update, note configMapAndSecretChangeDetectionStrategy: Watch parameters must be as the default Watch)
-
Mount configMap on the corresponding ruler
A panoramic view
A panoramic view
The last
Of course, for a mature monitoring system, in addition to fault detection and timely alarm, there should be more functions, this is not the scope of this discussion, if there is time to write in the future
-
Operation fault report and resource daily report weekly report monthly report are used for trend analysis
-
Low-load reports are used to analyze server utilization and prevent resource waste
-
With fault trends and more important indicators covered, AI can be used to predict faults before they occur
The last
An easier way to monitor k8S is to create K8S resources, such as collector Prometheus, collector abstract ServiceMonitor, AlertManager, etc., for which Prometheus Operator can be created. What data to monitor becomes a resource object that operates directly on the K8S cluster
Monitoring may serve as a horizontal scaling service for other applications, using Prometheus Adpater to customize monitoring metrics for automatic scaling
Monitoring can also serve the O&M platform by providing automatic fault repair
In short, as long as the monitoring platform is good enough, the operation staff will be out of work
Cite and expand the material
-
7 Open Source Cloud monitoring Tools you have to know
-
2. Thanos’ TKEStack practice – Even – A super Theme for Hugo
-
3. Prometheus Remote Write Configuration – timing database TSDB-Ali Cloud
-
4, Thanos – Highly available Prometheus setup with long term storage capabilities
-
Xxhashing – Extremely Fast non-cryptographic Hash algorithm