How long will it take to find the root cause of the system failure? 5 minutes? Five days? If your answer is closer to 5 minutes, chances are your production system and tests are very well logged. More often, non-core work such as logging, exception handling, and even testing is treated as a way to remedy problems when they occur. As with exception handling and testing, a logging strategy is required both on the system and in testing. Never underestimate the power of logging. Best logging, even in place of the debugger. Here are some guidelines that have served me well over the years.

Keep modest

Don’t take too many notes. Logging takes up a lot of disk space because you haven’t thought about what to record. If there are too many records, you need to design complex ways to reduce disk access, retain log history, archive large amounts of data, and query large log data sets. What’s more, it’s hard to tease out valuable information in verbose data.

Recording too little is worse than recording too much, recording too little. Logging generally has two main goals: to aid in error investigation and event confirmation. If your logs do not explain the cause of the error or whether a transaction occurred, you are logging too little.

To record:

  • Important Startup Configuration
  • error
  • warning
  • Changes to persistent data
  • Requests and responses between major system components
  • Important state changes
  • The user interaction
  • Calls with a known risk of failure
  • A wait that may take some time to satisfy the condition
  • The periodic progress of long-running tasks
  • The important branch points of logic and the conditions that lead to the branch
  • Summary of “processing steps” and “events” for high-level functions – Avoid documenting every step of a complex process in low-level functions.

Not suitable for recording:

  • Function entry – Do not log function entry unless it is important or logged at the debug level.
  • Data in a loop – Avoid logging over multiple iterations of the loop. It is possible to log periodically in iterations of small loops or in large loops.
  • Contents of large messages or files – truncate or summarize data in a way useful for debugging.
  • Benign errors – errors that are not real errors can confuse the reader of the log. Exception handling sometimes occurs when it is part of a successful execution process.
  • Repeat errors – Do not repeat the same or similar errors. This will fill the log quickly and hide the real problem. The frequency of error types is best handled by monitoring. The log needs to capture only partial error details

Multiple log levels

Do not log everything at the same log level. Most log libraries provide several log levels, some of which can be enabled at system boot time. This allows you to easily control the level of log detail.

Typical levels are:

  • Debug: Detailed, useful only for development or debugging.
  • Info: the most common level.
  • Warning: A strange or unexpected but acceptable state.
  • Error: An Error occurred, but the process can recover.
  • Critical: The system shuts down or restarts because the process cannot be recovered.

In fact, only two logging configurations are required:

  • Production: All levels are enabled except debug level. If something went wrong in production, the log should reveal why.
  • Development, debugging: Turn on all levels when writing new code or trying to reproduce a problem.

The test log is also important

Logging quality for test and production code is equally important. When a test fails, the log should clearly show whether the failure came from the test or production system. If you can’t, there is something wrong with the log of the test.

The test log should include:

  • Test execution environment
  • The initial state
  • Setup steps
  • Test case steps
  • Interaction with the system
  • Desired result
  • Actual results
  • Clean up the steps

Conditional level of detail control using temporary log queues

When an error occurs, the log should contain many details. Unfortunately, once an error is encountered, the details that led to the error may no longer be available. Also, if you follow the advice “Don’t log too much,” logging prior to error logging may not provide enough detail. A good way to solve this problem is to create temporary log queues in memory. The details of each step are appended to the queue throughout the transaction. If the transaction completes successfully, the queue record digest is discarded. If an error is encountered, record the contents of the entire queue and the error. This technique is particularly useful for test logging of system interactions.

Failure and unreliability are opportunities

When production problems occur, you focus on finding and fixing problems, but you should also think about logging. If you struggle to find the cause of your error, this is a great opportunity to improve your log. Before resolving the problem, fix the log so that the log clearly shows the cause. If the problem occurs again, it will be easier to identify.

If the problem cannot be reproduced, or there is an unreliable test. Enhance logging to track problems if they occur again.

Failures should be used to improve logging throughout the development process. When writing new code, try to avoid using the debugger and use only the log. Does the log show what happened? If not, logging is inadequate.

It is best to record performance data

Recorded timing data can help debug performance issues. For example, determining the cause of timeouts in large systems can be difficult unless you can track the time spent on each important processing step. This can be easily done by recording the start and end times of the calls. Calls that may be time-consuming include:

  • Important system calls
  • Network request
  • CPU intensive computing
  • Connect device interactions
  • The transaction

Trace traces in multiple threads/processes

Create unique identifiers for transactions that involve processing across multiple threads and/or processes. The originator of the transaction should create the ID and pass it to each component that performs work for the transaction. Each component should record this ID when recording information about a transaction. It is easier to track a particular transaction when multiple transactions are being processed concurrently.

Monitoring and logging complement each other

Production services should have both logging and monitoring capabilities. Monitoring provides real-time statistical summaries of system status. Alerts you when a percentage of certain request types fail, abnormal traffic patterns are encountered, performance degrades, or other anomalies occur. In some cases, this information alone can reveal the cause of the problem. In most cases, however, a monitoring alert is just a trigger to initiate an investigation. Monitor to show symptoms of the problem; Logs provide details and status of individual transactions from which you can fully understand the cause of the problem.

The original:Optimal Logging

Author: Cyningsun Author: www.cyningsun.com/12-27-2020/… Copyright notice: All articles on this blog are licensed under CC BY-NC-ND 3.0CN unless otherwise stated. Reprint please indicate the source!