Mr. MAO suggested
Moring Checklist Benefits of daily inspection
- Warning problems, abnormal data
- Quickly diagnose root causes based on log indicators
MTTR Average failure recovery time MTBF
I recommend taking a look at OpenStracretting-Go
Communication up across departments should have a point in time and conclusions should not be as if they should be general
The log
The level of logging
Glog library (currently not maintained)
github.com/golang/glogGlog is a non-maintained log library provided by Google. Glog has several versions in other languages, which greatly influenced my use of the log library at the time. It contains the following log levels:
- Info
- Warning
- Error
- Fatal(interrupts program execution)
Frequent calls to log.debug() are expensive and stack free
There are other third-party log libraries like Log4go, LogGo, ZAP, etc. They also provide a visible line to set the log level, generally providing the log level:
- Trace
- Debug
- Info
- Warning
- Error
- Critical
Warning(not recommended)
No one reads the warning because, by definition, nothing is wrong. Maybe something will go wrong in the future, but that sounds like someone else’s problem. We eliminate the warning level as much as possible, which is either an informational message or an error. Referring to the Go language design philosophy, all warnings are errors, and warnings in other languages can be ignored unless the IDE or CICD process forces them to be errors and then forces programmers to eliminate them as much as possible. Also, if you want to finally eliminate a warning, you can write it as error to make code writers pay attention.
- There may be a problem that no one will pay attention to equals no error
- It is recommended to convert warning to error to force programmers to eliminate it
Fatal(not recommended)
After logging the message, call os.exit (1) directly, which means:
- The other Goroutine defer statements are not executed;
- Buffers of all kinds are not flushed, including logged ones;
- Temporary files or directories will not be removed;
Do not log with FATAL, but return an error to the caller. If the error continues until main.main. Main. Main that is the correct place to do any cleanup operations before exiting.
Error
There are also many people who log immediately where errors occur, especially at the error level.
- Errors that need to be handled such as code degraded handleError (go planB)
- The error is thrown to the caller and the log is printed at the top;
If you choose to handle an error through logging, then by definition it is no longer an error – you have already handled it. The act of logging an error handles the error and is therefore no longer appropriate to log as an error.
Either log or return errors do not do extra operationsThis is a degraded behavior that is inherently damaging to the service, and it is preferable to use Warning here.
Debug
There are only two things you should record:
- Things that developers care about when developing or debugging software.
- What users care about when using software.
Obviously, these are the debug and information levels, respectively.
Log.info simply writes this line to log output. There should be no option to turn it off, because users should only be told what works for them. If an unhandled error occurs, it is thrown to Main. main(where the program terminates) to insert the FATAL prefix before the last log message, or it is written directly to OS.stderr.
Log.debug is a completely different thing. It is controlled by the developer or support engineer. Debug statements should be rich during development without having to resort to the Trace or Debug2 (you know who you are) level. The log package should support fine-grained control to enable or disable debugging, and debug statements should be enabled or disabled only within the package or finer scope.
Debug should have a level of 12345 for tuning granularity and can be switched on and off by module
How we design and think: github.com/go-kratos/k…
Selection of the logs
A complete centralized log system must contain the following main features:
- Collect – Can collect log data from multiple sources;
- Transfer – Can stably transfer log data to the central system;
- Storage – How to store log data;
- Analysis – can support UI analysis;
- Warning – can provide error reporting, monitoring mechanism;
ELK Stack (Elasticsearch, Logstash, Kibana) is open source software. A new FileBeat is added, which is a lightweight log collection and processing tool (Agent). FileBeat consumes less resources and is suitable for collecting logs from various servers and transferring them to Logstash. This tool is also recommended by the official.
The centralized
This architecture is distributed on each node by Logstash to collect logs and data. After analysis and filtering, the logs are sent to Elasticsearch on the remote server for storage. Elasticsearch compresses and stores data in fragments and provides various apis for users to query and operate. Users can also easily query logs and generate reports based on the data through configuring Kibana Web. Since logstash plays a server role, it is inevitable that there will be a hot spot problem with centralized traffic. Therefore, we do not recommend this deployment mode. In addition, because a lot of match operations (formatting logs) are required, it consumes a lot of CPU, which is not conducive to Scale out.
This architecture introduces the message queuing mechanism
The Logstash Agent on each node first passes data/logs to Kafka, and then indirectly passes messages or data from the queue to the Logstash Agent, which filters and analyzes the data to the Elasticsearch store. Finally, Kibana presents the logs and data to the user. With the introduction of Kafka, even if the remote Logstash Server fails, the data is stored first to avoid data loss.
Further:
Replace the logstash on the collection side with Beats, which is more flexible, consumes less resources and has stronger scalability. Replace logstash with Beats (go write extensibility)
Logging system: Design goal
- Convergence of access mode;
- Log format specification;
- Log resolution is transparent to the log system.
- System high throughput, low latency;
- The system is highly available, scalable and operable.
Log system: Format specification
The output format of logs is JSON:
- Time: indicates the log generation time. The value is in ISO8601 format.
- Level: indicates the log level, including ERROR, WARN, INFO, and DEBUG.
- App_id: indicates the application ID, which identifies the log source.
- Instance_id: indicates the instance ID, which is used to distinguish different instances of the same application, namely, hostname.
Log system – Design and implementation
From generation to retrieval, logs go through several stages: production & collection & transmission & segmented storage & retrieval
Log system: Collection
Logstash:
Monitoring TCP/UDP logs is applicable to reporting logs over the network
Filebeat:
Directly collecting locally generated log files applies to applications where log output cannot be customized
Logagent:
Physical machine deployment, listening unixsocket The log system provides various language SDKS to read local log files directly
Log system – Transfer
Kafka unified transport platform based on Flume + Kafka unified transport platform (in the case of a large number of logs, different service logs can be routed to different Kafka)
Log streaming based on LogID:
- The general level of
- Low level
- High level (ERROR)
Now replace it with Flink+kafka
Log system – sharding
Consume log from Kafka, parse log, write ElasticSearch
Bili-index: self-research, golang development, simple logic, high performance, convenient customization.
- Logs Generated by log specifications (collected by the Log Agent)
Logstash: ES official component, based on JRuby development, powerful, high resource consumption, low performance.
- You need to configure various log resolution rules to handle logs (fileBeat and Logstash) that are not generated according to log specifications.
Log system – storage and retrieval
Elasticsearch Multi-cluster architecture Logs are hierarchical and highly available
Master node, Data node(Hot/Stale), and client node in a single-data cluster
- Hot -> cold migration at fixed times of the day
- Index is created one day in advance, and the mapping management is based on the template
- Search based on Kibana
Elasticsearch recommends taking the time to research and learn
Standardization standardization structure
Log system – files
The use of customized protocols has high requirements on SDK quality and version upgrade. Therefore, we will use the solution of “local file” for a long time: Collecting local log files: there is no limit on the location, and the container or physical machine configuration self-description: The configuration is provided by the APP/PAAS itself, and the Agent reads and takes effect. Log neither re-loss nor re-loss: multi-level queues, which stably handle all anomalies during log collection. Monitoring: Monitors the running status in real time
Log system – Collects container logs
Intra-container application log collection: Based on overlay2, directly searches for the corresponding log file on the physical machine
Link to track
Link tracing: Dapper
According to the implementation of Google Dapper paper, a globally unique Traceid is generated for each request, and end-to-end transparency is transmitted to all upstream and downstream nodes. A SPANID is generated for each layer, and isolated call logs and exception information of different systems are connected together through traceid. Spanid and level are used to express the parent-child relationship of nodes.
Key Concepts:
- Tree
- Span call unit
- Annotation Mount information and label it
Link tracing: Call chain
In a trace tree structure, tree nodes are the basic units of the entire architecture, and each node is a reference to a SPAN. Although spans simply represent the start and end times of spans in log files, they are relatively independent in the overall tree structure.
Key Concepts:
- TraceID
- SpanID unit ID
- ParentID upstream ID
- Family & Title APPID GRPC method name
Link tracing: Tracing information
- Trace information includes timestamp, event, method name (Family+Title), and Comment (TAG/Comment).
- The timestamps on the client and server come from different hosts, and we have to take into account the time bias. The RPC client sends a request before the server receives it, and the same is true for the response (the server responds before the client receives the response). In this way, the RPC on the server side has an upper and lower bound for timestamps.
A calling unit has many events
- 1.client sent
- 2.server recv
- . Intermediate operation
- 3.server sent
- 4.client recv
Link tracing: Implant point
Dapper can track the path of distributed control at almost zero cost to application developers, relying almost entirely on modifications based on a small number of common component libraries. As follows: When a thread is processing the trace control path, Dapper stores the trace context in ThreadLocal. In Go language, the first parameter of each method is specified as context, which covers the common middleware & communication framework, and is not limited to: Redis, memcache, RPC, HTTP, Database, queue.
Link Tracing: Architecture diagram
Link tracing: Tracks consumption
Processing trace consumption
- The consumption of tracking data generated and collected by the system being monitored leads to system performance degradation,
- The need to use a portion of resources to store and analyze trace data is the most critical part of Dapper’s performance impact:
- Because collection and analysis can be more easily turned off in an emergency, ID generation takes time, creating spans, and so on;
- Change the Agent NICE value to prevent CPU contention on a heavily loaded server.
The sampling
If a significant action occurs once in the system, it occurs thousands of times, and we don’t collect all the data based on that.
Interesting paper
Uncertainty in Aggregate Estimates from Sampled Distributed Traces
Link tracing: Trace sampling
Fixed sampling, 1/1024
This simple scheme is very useful for our high-throughput online service, because those events of interest (in the case of high-throughput) are still likely to occur frequently and are often enough to be caught. However, important events can be missed at lower sampling rates and lower transmission loads, and acceptable performance losses are required to use higher sampling rates. The solution for such a system is to override the default sampling rate, which requires manual intervention, a situation we try to avoid in Dapper.
Should be actively sampled:
We understand it as the items expected to collect samples per unit time. Under high QPS, the sampling rate naturally decreases, while under low QPS, the sampling rate naturally increases. For example, one packet is collected from an interface within 1s.
Secondary sampling:
Problem: With a large number of container nodes, even using active sampling can still result in a large sample size, so you need to control the total size of data written to the central repository, taking advantage of the fact that all spans come from a particular trace and share the same Traceid, even though these spans may span thousands of hosts.
Solution: For each span in the collection system, we use the hash algorithm to convert traceid into a scalar Z, where 0<=Z<=1. We select the run-time sampling rate, so that we can elegantly remove excess data that we cannot write to the warehouse. We can also adjust the sampling rate during the running period by adjusting the coefficient of the secondary sampling rate in the collection system. Finally, we send the policy to the agent collection system through the back-end storage pressure to achieve accurate secondary sampling.
Downstream sampling:
The more dependent the services are, the higher the sampling rate for downstream services is even after the gateway layer uses active sampling.
Link tracing: API
Search:
Search by Family (service name), Title (interface), time, caller, etc
Details:
Based on a single Traceid, you can view overall link information, including SPAN and Level statistics, SPAN details, and dependent services and components.
Global dependency graph:
Because the dependencies between services change dynamically, it is not possible to infer the dependencies between all of these services, between tasks, and between tasks and other software components from configuration information alone.
Dependent search:
Searching the dependency of a single service facilitates us to consider the deployment of resources globally when doing “remote live” and distinguish whether the service belongs to the category of live. It also facilitates us to comb the dependency service and hierarchy frequently to optimize the availability of our overall architecture.
Inferring ring dependence:
In a complex business architecture, it is difficult to avoid calls that are all hierarchical, but we want to make sure that the call stack is always downward, i.e., no ring dependencies.
Link tracing: Experience & Optimization
Performance optimization:
- 1. Unnecessary serial calls;
- 2. Cache read amplification;
- 3. Database write amplification;
- 4. Service interface aggregation call; pipline
Exception log System integration:
If these exceptions occur in the context of a Dapper trace sample, the corresponding Traceid and SPANID are also recorded as metadata in the exception log. The front end of the exception monitoring service provides a link from reporting specific exception information directly to their respective distributed tracing;
User log integration:
Traceid is returned in the header of the request. When the user encounters a fault or reports to the customer service, we can use Traceid as the key word of the entire request link, and then search ES Index in parallel according to the services involved in the interface-level service dependency interface to aggregate and sort the data, so as to diagnose the problem intuitively.
Capacity estimation:
According to the gateway service, deduce the whole downstream service call fan out accurately estimate the traffic and the proportion of each system;
Network hot spots & Vulnerable points:
Our internal RPC framework is not unified enough, and the components of the basic library have not yet reached the size of the application layer protocol. If we collect them, we can easily realize traffic hot spots, machine room hot spots, abnormal traffic and other situations.
- Similarly, the span that is prone to failure can be calculated easily, so that we can identify the vulnerable points of services.
- You can dig up a lot of useful stuff based on data analysis
The hotspot is divided into clusters
Opentraceing:
Standardization promotion, the above several features, rely on SPAN TAG for calculation, so we will gradually complete the standardization protocol, and it is more convenient for us to open source, rather than an internal “special system”;
monitoring
Monitoring
- Four indicators
- Delay lantcy, the interface time
- Traffic QPS
- The proportion of error codes returned by error excuses and the number of requests
- Saturation serves its own water level
For basic libraries involving net, cache, DB, RPC and other resource types, four gold indicators of dimension should be monitored first:
- Delay (time consuming, need to distinguish between normal and abnormal)
- Traffic (need to override source, i.e. caller)
- Error (override error Code or HTTP Status Code)
- Saturation (how full the service capacity is)
Instrument panel design
- The state running time of the service in the first line
- The second line is the service indicator
- View service indicators
- The third line also wants to see the physical machine, container, virtual machine, runtime metrics (language level) – GC, Groutine, lock, etc
- The fourth line is the state of the kernel
- CPU IO CloseWait FD Context Switch (kernel level)
- The fifth line of client indicators from the client perspective to monitor time-consuming, error reporting problems
- Call mysql RPC
- For example, memcache is stable but the client is not connected (possibly due to network problems). In this case, the server QPS chatters down
- For example, the QPS index of a client’s RPC request is very high. (Hotspot can also be viewed through link tracing.) The fan-out ratio may be very high (receiving one request but sending n requests downstream).
Specification:
- Indicators are unified, and the threshold of learning is low for everyone to have the same cognition. One template can be compatible with indicators of all languages, so that everyone can reuse (common indicators and personalized indicators).
- Click on memcache to directly jump to the corresponding memcache monitoring interface (throughout reference)
- Try to do it all in one panel. If you are stuck, you can fold fine-grained indicators and do slow loading
Quick Problem Location
- To quickly locate the problem: the appID in the microservice runs through all referenced components, and the entire system is connected by this appID when making the asset system
- Benefits: From the perspective of development, I can quickly jump to the panel of the service I depend on (memcache/mysql) and directly check whether there is a problem. Try to fix it at once instead of searching for it by myself
O&m is often divided by component, and assets are sorted and monitored from the perspective of applications
Prometheus + Granfana
Pull mode bulk data is taken away in a wave, and a summarized indicator (counter), window difference (density type), a external-type agent, is collecting mysql/ Redis InfuxDB sequential database Grafna rendering federation
Most cases are monitored unless the machine fills a blank
System level: CPU, Memory, IO, Network, TCP/IP status, FD (etc.), Kernel: Context Switch Runtime: various GC, Mem internal status, etc
- The long tail problem
- Dependent resource (Client/Server’s View)
Objective:
- Quickly locate problems, such as service time-consuming increases
- Logs show problems but do not show causes. You need to analyze trends based on indicators
Profiling
- If you open a Profiling port online, goToProfile automatically opens the profile screen and generates a fire map after 30 seconds of clicking
- Use service discovery to find node information and provide a quick way to quickly view process Profiling information on the WEB (flame maps, etc.)
watchdog
Problem: can not catch the scene of the time, there is no recurrence of how to do? Writing a script to capture data once a minute is too low to recover after the fact
- The code has a built-in watchdog, using memory, CPU and other semaphore trigger automatic acquisition (similar to a sliding mean method of calculation to get a continuous high level of data automatically trigger gotoprofile)
Opentracing (Google Dapper)
- Traceid associated
Pay attention to the unity of duty
Common phenomenon: Developers are used to assigning bugs to testing and problems to operations and maintenance.
- Now Internet companies are integrated functions of development to ensure quality code quality, test quality, unit test, delivery online
- Error log development at a glance should understand you let operation and maintenance to analyze a wool where to find such an awesome operation and maintenance?
- Responsibilities of Operation Students (SRE criticism)
- Provide a good platform for developers to quickly and efficiently locate problems
- Testing students provide an automated framework for efficient automated testing
Who is responsible for what is theirs
- O&m focuses on the IASS layer system layer
- Development focuses on the business layer, the code layer
References
https://dave.cheney.net/2015/11/05/lets-talk-about-logging
https://www.ardanlabs.com/blog/2013/11/using-log-package-in-go.html
https://www.ardanlabs.com/blog/2017/05/design-philosophy-on-logging.html
The package level logger anti pattern
https://help.aliyun.com/document_detail/28979.html?spm=a2c4g.11186623.2.10.3b0a729amtsBZe
https://developer.aliyun.com/article/703229
https://developer.aliyun.com/article/204554
https://developer.aliyun.com/article/251629
https://www.elastic.co/cn/what-is/elk-stack
https://my.oschina.net/itblog/blog/547250
https://www.cnblogs.com/aresxin/p/8035137.html
https://www.elastic.co/cn/products/beats/filebeat
https://www.elastic.co/guide/en/beats/filebeat/5.6/index.html
https://www.elastic.co/cn/products/logstash
https://www.elastic.co/guide/en/logstash/5.6/index.html
https://www.elastic.co/cn/products/kibana
https://www.elastic.co/guide/en/kibana/5.5/index.html
https://www.elastic.co/guide/en/elasticsearch/reference/5.6/index.html
https://elasticsearch.cn/
https://blog.aliasmee.com/post/graylog-log-system-architecture/
Copy the code