The main goal of error logging in the program is to provide important clues and guidance for better troubleshooting and problem solving. However, in practice, the content and format of error logs vary, and error messages may be incomplete, without relevant background, or unclear, making troubleshooting and solving problems very inconvenient or time-consuming. In fact, if you put a little effort into programming, you can eliminate a lot of troubleshooting.
Before explaining how to write an effective error log, it is important to understand how errors occur.
How are mistakes made
For the current system, errors are caused by three factors: 1. Invalid parameters introduced by the upper – layer system. For the error introduced by illegal parameter, the error can be intercepted by parameter check and precondition check. 2. Errors caused by interaction with the underlying system. There are two types of errors caused by interaction with the lower layer: a. The lower layer system processes successfully, but the communication fails, which will lead to data inconsistency between subsystems; In this case, a time-out compensation mechanism can be used, where tasks are recorded in advance and data is revised later through scheduled tasks. B. The communication succeeds, but the underlying processing fails. In this case, you need to communicate with the lower level developers to coordinate interactions between subsystems; You need to handle the error or provide a proper prompt based on the error code and error description. In either case, it is necessary to assume that the underlying system is generally reliable, and to consider the design of errors. 3. The system at this layer processes an error.
This layer system error causes:
Cause one: Negligence. An oversight is when the programmer failed to avoid such errors when he could have avoided them. For example, if you type && into &, == into =; Boundary error, compound logic judgment error, etc. Negligence is either a programmer’s lack of attention, such as being tired, working all night, writing programs while in meetings; Or they are in a hurry to implement features, not caring about the robustness of the application, etc. Improvement: Using code static analysis tools, line coverage through unit tests can effectively avoid this problem.
Cause two: Errors and exceptions are not properly handled. For example, input problems. When calculating the addition of two numbers, we should not only consider the overflow problem, but also consider the case of illegal input. The former can be avoided through knowledge, error, or experience, while the latter must be limited to the extent that our intelligence can control it, such as using regular expressions to filter out illegal input. Regular expressions must be tested. For invalid input, provide detailed, easy to understand, and friendly prompt information, cause, and suggestion. Improvement: Consider error situations and exception handling as thoroughly as possible. After implementing the main process, add a step: carefully review possible errors and exceptions and return reasonable error codes and error descriptions. Each interface or module effectively handles its own errors and exceptions to avoid bugs caused by complex scenarios. For example, a business use case is completed by scenario A.B.C interaction. A.B succeeds but c. C fails. In this case, B needs to roll back the code and message returned by c. A needs to roll back the code and message returned by B and returns the code and message to the client. This is a segmented rollback mechanism that requires that every scenario must account for rollback in exceptional cases.
Cause three: The logical coupling is tight. Due to the close coupling of business logic, with the step by step development of software products, all kinds of logic relationships are intricate, and it is difficult to see the global situation, resulting in the impact of local modification to the global scope, causing unpredictable problems. Improvement: Write short functions and methods, preferably no more than 50 lines each. Write stateless functions and methods, read only global state, the same precondition will always output the same result, will not depend on external state and change their behavior; Define reasonable structure, interface and logical segment to make interface interaction as orthogonal and low coupling as possible; For the service layer, provide simple, orthogonal interfaces as much as possible; Keep refactoring, keep applications modular and loosely coupled, and clarify logical dependencies. When a large number of service interfaces interact with each other, the logical flow and interdependence of each service interface must be sorted out and optimized as a whole. For entities with a large number of states, related business interfaces need to be sorted out to sort out the transition relationship between states.
Cause four: The algorithm is incorrect. Improvement measures: Firstly, the algorithm is separated from the application. If the algorithm has multiple implementations, it can be found by cross-checking unit tests, such as sorting operations. If the algorithm has reversible properties, it can be found through unit tests of reversible verification, such as encryption and decryption operations.
Cause five: Parameters of the same type are passed in incorrect order. For example, modifyFlow(int rx, int tx), modifyFlow(tx,rx) is called. Stagger parameters of the same type as much as possible; If none of the above can be met, the interface test must be used to verify that the interface parameter values are different.
Cause six: The null pointer is abnormal. Null-pointer exceptions are usually caused by objects not being properly initialized, or by not checking for non-null objects before using them. Improvement measures: Check whether the configuration object is successfully initialized. For normal objects, check to see if the entity object is non-null before it is used.
Cause seven: The network communication is incorrect. Network communication errors are usually caused by network delay, congestion, or failure. Network communication errors are usually events with a low probability, which may lead to large-scale failures and bugs that are difficult to reproduce. Improvement: Type the INFO log at the end point of the former subsystem and the entry point of the latter subsystem respectively. Give a clue about the time difference between the two.
Cause eight: Transaction and concurrency errors. The combination of transactions and concurrency makes it easy to generate errors that are very difficult to locate. Improvements: For concurrent operations in the program, which involve shared variables and important state changes, add INFO logs. A more efficient way??
Cause nine: The configuration is incorrect. Improvement measure: Check all configuration items and print the corresponding INFO log to ensure that all configurations are loaded successfully when starting applications or corresponding configurations.
Cause ten: An error is caused by unfamiliar services. In medium – and large-sized systems, part of the business logic and business interaction are complicated, and the whole business logic may exist in the brains of multiple developers, and everyone’s understanding is not complete. This can easily lead to business coding errors. Improvement measures: Design correct business use cases through multi-person discussion and communication, and write and implement business logic according to business use cases; The final business logic and business use cases must be fully documented; Indicate the preconditions, processing logic, post-verification, and precautions of the service interface. When services change, you need to update service annotations. Code REVIEW. Business annotations are an important document of the business interface and serve as an important cache for business understanding.
Cause 11: Errors caused by design problems. For example, the synchronous serial mode has problems of performance and slow response, while the concurrent asynchronous mode can solve the problems of performance and slow response, but it will bring hidden dangers of security and correctness. The asynchronous approach leads to changes in the programming model, with new issues such as asynchronous message push and reception. Using caching can improve performance, but there are problems with cache updates. Improvement: Write and review design documentation carefully. The design document must describe the background, requirements, business objectives to be met, business performance indicators to be achieved, the possible impact, the overall design idea, detailed plan, foresee the advantages and disadvantages of the plan and the possible impact; Through testing and acceptance, ensure that the design solution meets the business goals and performance indicators.
Cause twelve: Error caused by unknown details. Such as buffer overflows, SQL injection attacks. From the function is no problem, but from the malicious use of the view, there are loopholes. For example, if you choose the Jackson library to parse JSON strings, by default, parsing errors will occur when new fields are added to the object. You must annotate the object with @jsonIgnoreProperties (ignoreUnknown = true) to properly handle changes. This is not necessarily the case with other JSON libraries. Improvement measures: On the one hand through experience, on the other hand, considering security issues and exceptions, choose mature and rigorously tested libraries.
Reason 13: Bugs that change over time. It’s not uncommon for solutions that seemed great in the past to become clunky or ineffective in current or future scenarios. For example, encryption and decryption algorithms, in the past may be considered perfect, after cracking will be carefully used. Improvements: Keep track of changes and bug fixes, and fix outdated code, libraries, and behavior.
Cause 14: A hardware error occurs. Such as memory leaks, insufficient storage space, OutofMemoryErrors, etc. Improvement measures: Added performance monitoring for important indicators such as CPU, memory, and network of application systems.
Common system errors:
1. The entity does not exist in the database, which entity or entity identifier must be specified; 2. The entity configuration is incorrect. You must specify which configuration is wrong and what the correct configuration should be. 3. If the physical resource does not meet the requirements, the current resource must be specified and the requirements must be specified. 4. If the entity operation preconditions are not met, it must indicate what preconditions need to be met and what the current state is; 5. If entity operation post-verification is not met, it must specify what post-verification needs to be met and what the current state is; 6. If a performance problem causes a timeout, you must specify the cause and how to optimize it in the future. 7. Inconsistent status or data among multiple subsystems due to communication errors? Errors that are difficult to locate tend to occur at lower levels. Because the underlying layer cannot predict the specific business scenario, the error messages are relatively generic. This requires as many leads as possible at the top of the business. Errors must occur when the preconditions are not met on a stack during the interaction of multiple systems or layers. During programming, ensure that all necessary preconditions are met in each stack as far as possible, avoid wrong parameters passing to the bottom layer as far as possible, and intercept errors in the business layer as far as possible. Most errors are caused by a combination of causes. But every mistake must have a cause. After resolving the errors, conduct a thorough analysis of how the errors occurred and how to prevent them from happening again. Success comes with effort, but progress comes with reflection.
How can I write an error log that makes it easier to troubleshoot problems
The basic rules for typing error logs are as follows: 1. Each error log provides a complete description of what went wrong in what scenario, what caused it (or what might have caused it), and how to resolve it (or how to resolve it). Be as specific as possible. For example, NC resources are insufficient, what exactly refers to the insufficient resources, whether it can be directly specified through the program; Generic errors, such as VM NOT EXIST, specify the circumstances under which they occurred, which may facilitate subsequent statistical work. Be as direct as possible. The ideal error log should be a first instinct to know what is causing it and how to fix it, rather than having to go through several steps to find the actual cause. 4. Integrate existing experience directly into the system. All problems and experiences that have been solved should be integrated into the system in as friendly a way as possible, giving new employees a better clue, rather than buried elsewhere. 5. Typesetting should be neat and orderly, and the format should be unified and standardized. Dense, haphazard logs are gut-wrenching, unfriendly, and inconvenient to troubleshoot. 6. Use multiple keywords to uniquely identify the request and highlight the keywords: time, entity id (for example, VMName), and operation name.
The basic steps for troubleshooting are as follows: Log in to the application server, open the log file, locate the error log, and rectify the fault according to the error log. 1. From logging in to opening the log file: It is inconvenient to log in to multiple application servers one by one. You need to write a tool to put on the AG to directly view all the server logs on the AG, or even directly filter out the error logs you need. 2. Locate the error log. Currently, the log layout is too dense to locate error logs. You can use “time” to locate the error log near the front, and then use an entity keyword/action name combination to lock the error log. It is traditional to locate error logs based on requestId, but the requestId must be found first and is not descriptive. It is best to locate the error log by time/content keyword directly. 3. Analyze error logs. The content of the error log should be more straightforward, clearly indicating that it matches the characteristics of the problem being investigated, and giving important clues. Often, the problem with program error logs is that they are written in an incomplete, half-English format that is concise and understandable for the current code context. Once you leave the code context, it’s hard to know what you’re talking about, and you have to think about it or look at the code to understand what the logs are saying. Isn’t it self-inflicted? Such as:
if ((storageType == StorageType.dfs1 || storageType == StorageType.dfs2) && (zone.hasStorageType(StorageType.io3) || Zone.hasstoragetype (storagetype.io4)) {// DfS1 and DFS2 are stored in IO3 and io4. } else { log.info("zone storage type not support, zone: " + zone.getZoneId() + ", storageType: " + storageType.name()); throw new BizException(DeviceErrorCode.ZONE_STORAGE_TYPE_NOT_SUPPORT); }Copy the code
What type of storage should a zone support? Do Not Let Me Think ! Error logs should provide a clear description of what happened even when you leave the code context. In addition, if you can directly explain the cause in the error log, you can save effort in the inspection log. In a sense, the error log can also be a very useful document for all kinds of illegal running use cases.
The current program error log may contain the following problems: 1. The error log does not specify error parameters and content:
catch(Exception ex){
log.error("control ip insert failed", ex);
return new ResultSet<AddControlIpResponse>(
ControlIpErrorCode.ERROR_CONTROL_IP_INSERT_FAILURE);
}
Copy the code
The control IP that failed to insert is not specified. The control IP keyword makes it easier to search for and lock out errors. Similarly:
log.error("Get some errors when insert subnet and its IPs into database. Add subnet or IP failure.", e);
Copy the code
It does not specify which subnets are under which IP. It is worth noting that specifying these would require some extra work and might affect performance slightly. This is a tradeoff between performance and debuggability. Format (“Some MSG to ErrorObj: %s”, errobj) to specify error parameters and contents. This typically requires writing readable toString methods on DO objects.
2. The error scenario is not clear:
log.error("nc has exist, nc ip" + request.getIp());
Copy the code
An error has been detected in createNc for NC. However, the log does not indicate the error scenario, which makes people guess why NC is reported to have an error. Can be changed to
log.error("nc has exist when want to create nc, please check nc parameters. Given nc ip: " + request.getIp());
log.error("[create nc] nc has exist, please check nc parameters. Given nc ip: " + request.getIp());
Copy the code
Similarly:
log.error("not all vm destroyed, nc id " + request.getNcId());
Copy the code
to
log.error("[delete nc] some vms [%s] in the nc are not destroyed. nc id: %s", vmNames, request.getNcId());
Copy the code
Solution: Add when to the error message, or prefix the error message with [interface name] to indicate the error scenario, which is known directly from the error log. Executor: interface name: service: when
3. Unclear content or unclear meaning:
if(aliMonitorReporter == null) { log.error("aliMonitorReporter is null!" ); } else { aliMonitorReporter.attach(new ThreadPoolMonitor(namePrefix, asynTaskThreadPool.getThreadPoolExecutor())); }Copy the code
To:
log.error("aliMonitorReporter is null, probably not initialized properly, please check configuration in file xxx.");
Copy the code
Similarly:
if (diskWbps == null && diskRbps == null && diskWiops == null && diskRiops == null) {
log.error("none of attribute is specified for modifying");
throw new BizException(DeviceErrorCode.NO_ATTRIBUTE_FOR_MODIFY);
}
Copy the code
Instead of
log.error("[modify disk attribute] None of [diskWbps,diskRbps,diskWiops,diskRiops] is specified for disk id:" + diskId);
Copy the code
Solution: Better describe the errors.
4. The guidance content of troubleshooting is not clear:
log.error("get gw group ip segment failed. zkPath: " + LockResource.getGwGroupIpSegmnetLockPath(request.getGwGroupId()));
Copy the code
zkPath ? How to check this problem? Who do I go to? Where to look for more specific clues? Solution: Add corresponding background knowledge and guide troubleshooting measures.
5. The error content is not detailed enough:
if (! ncResourceService.isNcResourceEnough(ncResourceDO, vmResourceCondition)) { log.error("disk space is not enough at vm's nc, nc id:" + vmDO.getNcId()); throw new BizException(ResourceErrorCode.ERROR_RESOURCE_NOT_ENOUGH); }Copy the code
What is the shortage of resources? How much is left? How many are needed now? It is worth noting that specifying these requires a little extra work that may affect performance slightly. This is a tradeoff between performance and debuggability. Solution: Reduce manual comparison by improving procedures or techniques to reveal specific differences as much as possible.
6. Semi-english sentence patterns are not clear enough to read, and need to think to piece together the complete meaning:
log.warn("cache status conflict, device id "+deviceDO.getId()+"
Copy the code
db status “+deviceDO.getStatus() +”, nc status “+ status);
To: log.warn(String.format("[query cache status] device cache status conflicts between regiondb and nc, status of device '%s' in regiondb is %s , but is %s in nc.", deviceDO.getId(), deviceDO.getStatus(), status));Copy the code
Solution: Change to natural readable English sentences. To sum up, the error log format can be:
Log. Error ("[interface name or operation name] [Some error Msg] happens. [params] [Probably Because]. Log. Error (String. Format (" Some error Msg] happens. [%s]. [Probably Because]. params));Copy the code
or
Log. Error ("[Some error Msg] happens to error when [in Some condition]. [Probably Because]. log.error(String.format("[Some Error Msg] happens to %s when [in some condition]. [Probably Because]. [Probably need to do].", parameters));Copy the code
[Probably Reason]. [Probably need to do]. In some cases it can be omitted; Some important interfaces and scenarios are best illustrated. Each error log is independent. It can be as complete, specific, and direct as possible to explain what error occurred in what scenario, the cause, and the measures or steps to be taken.
Question:
1. Does string. format performance affect logging? In general, error logging should be minimal, and the frequency of using String.format should not be high enough to affect applications and logging. 2. When development time is tight, do you have time to weigh your words? Create a standardized format for your content and put it in a format that will save you time to think about your words. 3. When to use info, WARN, error? Info is used to print the normal status information that the program should appear for easy tracking and positioning. Warn indicates that the system is experiencing minor irregularities that do not affect operation or use. Error indicates that a system error or exception occurs and the operation cannot be completed properly. stackoverflow.com/ques… Error log is an important means to troubleshoot problems. When we program to implement a feature, we usually consider the various errors that can occur and the causes: to troubleshoot the cause, we need some key description to locate the cause. This results in a triad of error symptoms -> error critical description -> ultimate error cause. Each type of error needs to provide the corresponding error key description as far as possible, so as to locate the corresponding error cause. In other words, when programming, think carefully about which descriptions are most helpful in locating the cause of the error, and add those descriptions to the error log whenever possible. If there are problems or difficulties not pointed out in the text, please give your suggestions.