Zhu Ye Internet architecture practice result S2E2: write business code the most easily out of 10 kinds of pit | Denver annual essay

I admit that the title of this article is a bit clickbait, especially when writing business code, because people don’t pay enough attention to the details that are easiest to tune (focusing on Java, of course, but many of these points are not language specific).

1. Use of the client

When using middleware or storage such as Redis, ElasticSearch, RabbitMQ, Mongodb, etc., we will definitely use client packages to communicate with these systems. We will also use Http clients to send Http requests. One of the biggest mistakes you can make with these Client packages is the way clients are used, such as a class called RedisClient, which is the entry point for Redis operations. Should you use new RedisClient().get(KEY) each time or inject a singleton RedisClient?

As we know, the Client of these components often needs to communicate with the server remotely over a TCP connection. For performance purposes, the Client usually maintains a connection pool with long links. If a Client like RedisClient or MongoClient or HttpClient maintains a connection pool inside the class, The Client is usually thread-safe, can be used in multi-threaded environments, and is strictly prohibited from creating new objects every time (if the framework is good enough, it is usually singleton and does not allow instantiation).

For example, if a Client creates 5 TCP connections for the entire application every time it makes a new connection, and 5 TCP connections are created every time it calls Redis because of misuse, then the QPS could go from 10000 to 10. To make matters worse, sometimes these clients not only maintain the connection pool for TCP connections, but also maintain the thread pool for task processing. The thread pool may also have a large default core thread.

When Netty and other frameworks are used, the Event Loop thread is used to do IO processing. For clients, the Work Group may only have 2~4 links. We assume that 4 links are enough. For the Client connection pool, each new Bootstrap is generated, and the Client connection pool is called 5, which is equivalent to 20 EventLoopGroups at a time. In this case, the Client user is misusing the framework and each new Client is generated. It takes 20 threads to make a request, which is the triple whammy.

So you might say, well, isn’t it good to use singletons for all clients? Is not the case, it depends on the realization of Client, Client probably only one entrance, the connection pool and the pool maintenance thread in another class, the entry itself is lightweight, (for example, some configuration) of the state, is not allowed as a singleton, framework developers is to let people through this convenient entrance to use API. If you use this as a singleton, you might have a string configuration problem. So there is no universal answer to the question of clients using best practices.

The reason why I did not mention database here is that people generally use Mybatis and JPA to use database, and they will not deal with data source directly. Generally speaking, it is not easy to make mistakes. But now there are too many middleware, and the client is even more official and community. When using the client, we must find out exactly how to use the client according to the documentation (or please use the keyword XXX threadsafe or XXX Singleton to search Google for confirmation). If you are not sure, please look at the source code. Take a look at how the client handles the connection pool thread pool, otherwise it could cause huge performance problems. It’s not just performance. I’ve seen a lot of major failures where the service crashed due to memory explosion, TCP connection overload, and so on due to improper client use.

2. Service invocation parameter configuration

Now everyone is practicing microservices architecture, and whatever microservices framework you’re using, whether it’s HTTP REST based or TCP RPC, you’re going to set some parameters that can be a little bit cratering if you don’t think about it carefully.

The timeout configuration

The client usually pays the most attention to two parameters, ConnectionTimeout and ReadTimeout :(ReadTimeout), which refers to the timeout of establishing a TCP connection and the timeout of reading (needed) data from the Socket. The latter is not only the time of the network, but also the time of the server processing tasks. Consider a few points when setting up:

Connection timeout is relatively simple, TCP chain building generally does not take a long time, setting too much significance, see a setting of 60 seconds or even longer, if more than 2 seconds are not connected to it is better to directly give up, quickly give up at least can retry, why wait.
Read timeout not only involves the network, also involves the processing of remote service or execution time, you can think about it, if the client reads a timeout in 5 seconds, the execution time of the remote service in 10 seconds, then the client 5 seconds after receive the read timed out mistakes, remote service continues to execute, complete after 10 seconds, At this point, if the client tries again, the server will execute again. In general, it is recommended to evaluate the server execution time (e.g. P95 at 3 seconds), and the client read timeout parameter is recommended to be slightly longer than the server execution time (e.g. 5 seconds), otherwise you may encounter repeat execution problems.
A Job invokes the service to execute a scheduled task and generates a statement. It takes 30 minutes to execute a scheduled task (after the task is completed, the data status is updated to generated). However, the read timeout set on the Job client is 60 seconds. Executing a timeout every 1 minute and then executing a task that should have been handled once a day instead of 30 times because the task was so resource-intensive that the server simply hung up before it reached 30 times. Most RPC frameworks execute the business logic on the server side in a thread pool without setting a timeout itself. Again, for long operations, consider whether you need to do synchronous remote services. If you do, you need to control state through locking or concurrency through limiting traffic.
You might wonder why most frameworks don’t pay attention to WriteTimeout. Actually write operation itself is written to the Socket buffer, the data sent to distant process is asynchronous, in terms of writing itself is often quickly, unless the buffer is full, we can’t know whether write operation successfully written to the remote, if want to know when will get the response data such as just know, reading is the stage at this time, So the timeout configuration for the write operation itself is of little significance.

Automatic retry

Both Spring Cloud Ribbon and other RPC clients often have automatic retry functions (MaxAutoRetries and MaxAutoRetriesNextServer). Considering Failover, Some frameworks will retry node B by default if node A hangs. We need to consider whether this function we need to support our server is idempotent, retry the framework of the strategy is to Get request or to all requests, because Get bad will automatically retry problem on pit (not all the service side of the processing of the problems of power enough, or in other words, and before the problem is, Not all servers can properly handle idempotent processing when the request itself is not completed. Most of the idempotent processing considered by the server is based on its own operation after the completion of the transaction update data table state). For remote service invocations, clients and servers agree on idempotent policies, and it is important to specify what to do if timeouts are inconsistent.

3. Use of thread pools

Thread pool configuration

Ali Java Development Guide mentioned:

Do not use Executors to create a thread pool. Use ThreadPoolExecutor to clear the running rules of the thread pool and avoid resource depletion. 1) FixedThreadPool and SingleThreadPool: The allowed request queue length is Integer.MAX_VALUE, which may accumulate a large number of requests and result in OOM. 2) CachedThreadPool and ScheduledThreadPool: the number of threads allowed to create is integer. MAX_VALUE, which may create a large number of threads, resulting in OOM.

It is recommended that you familiarize yourself with the basic principles of thread pools and manually configure parameters such as the number of threads, length of queue type, and rejection policy based on actual service requirements.

We tend to use queues for task buffering (whether thread pools or MQ), and the rejection policy in the case of full queues is also worth mentioning. When we use thread pools for asynchronous processing, it doesn’t matter that these tasks will be compensated or the task itself will be lost. If we use CallerRunsPolicy too easily, we may run into big problems because the task will be executed by the caller thread when the queue is full. This is often the last thing the caller wants. Even worse, when the thread pool is used by NIO frameworks such as Netty, if the caller is an IO EventLoopGroup thread, then the IO thread will be blocked when the business thread pool is full. When there are too many tasks, how to handle them, whether to compensate after recording or discard them, or whether to execute them by the caller needs to be carefully considered.

Thread pool sharing

I’ve seen some business code that does Utils type sharing of various operations throughout the project using a thread pool, I have also seen a lot of business code in Java 8 that uses the Parallel Stream feature to do time-consuming operations without using a custom thread pool or setting a larger number of threads (not aware of the parallel Stream shared ForkJoinPool problem). The problem with sharing is that it interferes. If some asynchronous operations take an average of one second and others take 100 seconds, these operations together sharing a thread pool are likely to interact and even starve to death. You are advised to set an isolated thread pool based on the asynchronous service type.

4. Thread safety

Whether the object is singleton

When using the Spring container, because beans are singletons by default, we are particularly prone to making singletons that should not be singletons. For example, a class is stateful when it has some data fields in it. When we cooperate with Spring when used with other frameworks are more likely to bother this wrong, there is no use the Spring framework inside, for example, their own through some mechanism to maintain the object caching mechanism or pool statement cycle, if we directly into containers, container management framework with some type of creation method, may encounter a lot of bugs. For internal data fields of singleton types, consider using ThreadLocal to encapsulate them so that the internal data of the type is not corrupted based on thread isolation in multithreaded cases.

Whether singletons are thread-safe

In the previous point we were talking about whether or not an object should be singleton, and here we’re talking about whether or not it’s thread safe in the singleton case. When using a variety of classes provided by various frameworks, it is sometimes natural to add static or Spring singleton injection (for performance purposes), but before doing so make sure the type is thread-safe (for example, the common SimpleDateFormat is not thread-safe). I think the number one keyword I googled during development was XXX Threadsafe. On the other hand, if you’re developing a framework, you have an obligation to tell the user whether the type is thread-safe in comments. Thread safety problem in the testing process is not easy to find, after all, no concurrent test, but in production may have strange questions, if appear such Bug out ConcurrentModificationException this concurrency exception is good, It’s really hard to locate a problem without an exception. Many Web programmers are unaware of the fact that their projects are multithreaded environments.

Lock scope and granularity

Sync (object) It is worth thinking carefully about what this object is, whether it is a class instance, a type, or a redis Key (cross-process lock). We need to make sure that the lock holds the required operation, and we have seen some code that failed the lock because it was not locked at the correct level.

It is also important to minimize the granularity of locks; if all operations are method-level distributed locks, the method is always globally single-threaded. At this point there is no point in adding machines, the system will not scale.

The last is to consider the timeout problem of locks, especially distributed locks. If the timeout is not set, it is likely that the lock will never be released because of code interruption. It is not recommended to build wheels for Redis locks, and it is recommended to use the officially recommended red lock scheme (such as Redisson’s implementation).

5, asynchronous

Data flow sequence

If the data flows are processed asynchronously, there is a problem with the order of the data flows. Such as we start the request to other service execution asynchronous operations (such as pay), and then perform local database operations (such as creating a payment order), after the completion of the commit the transaction may encounter an external service request processing quickly, gave us a first data correction (payment notice) success, this time our local affairs are not submitted yet, The payment order has not been dropped into the database, so the original data cannot be found when the external callback comes, leading to problems. Even worse, we return the SUCCESS status of the external callback so that the external callback will not compensate.

When using MQ, we also encounter the problem of compensation data re-enqueueing and resending, where later messages may be received before earlier messages are received. Can our message consumption processor handle this? If you don’t get this right you can have a problem with logic.

Asynchronous nonblocking

When using non-blocking frameworks such as Spring WebFlux, Netty (especially the former, where developers tend to focus on this issue), we need to be aware that our business processes should not occupy too much of the IO thread of the event loop, or we might end up with a few IO threads blocking. Whether the task is executed in the IO thread is not absolute, if the small task is assigned to the business thread pool execution may have the problem of thread switch, the gain is not worth the loss, all still need to stress test data can not be taken for granted. If you don’t get this right you can have a huge performance degradation problem. Sometimes NIO framework Reactor model is used improperly, and its efficiency is as good as request-per-thread thread model.

6, transaction

Local transactions

Most projects today use the @Transactional annotation directly to start transactions, but don’t give much thought to how this annotation works. Common pitfalls include:

Annotations at all don’t work because of configuration issues (especially if Spring Boot is not used)
The Transactional entry does not have @transactional, and the method marked this.method() is annotated. The class is invalid because it has no proxy
For example, rollbackFor is not configured, or the method eats all the exceptions inside and does not have an exception that can’t be rolled back

The code that causes Transactional problems is quite buggy and often difficult to spot. Many projects just pretend to use @Transactional without considering that annotations don’t work at all

Distributed transaction

Whether the final implementation is consistent or two-phase implementation (just the idea, not necessarily middleware), the overall transactivity across processes needs to be considered. The hardest part is to consider how the transactionality of remote resources and the transactionality of local resources can be considered as a whole transaction.

7. Reference root

This is a memory leak problem, and Java programs do not have a narrow memory leak problem if they do not use direct out-of-heap memory allocation. Static = static; static = static; static = static; static = static; static = static; More subtly, Spring beans are singletons by default, which declare structures like List to store data in a Service. Static is not declared, but it is a static property (which gives the impression that the object can recycle itself). This question requires us to be clear:

Whether the class to which our data belongs is singleton or static (lifecycle)
The declaration cycle of the class to which our data belongs (exploring the reference root)
Is our data itself infinitely expanding or is it just a finite collection
When our data is put into a Map or Set, does the new data replace the old data?

In plain English, be careful when you see lists, maps, and other data structures in your code that are not declared inside the method body (as class member fields).

8, sentence, etc

Judgment is just the code implementation details in the most prone to error of a point, here or once again recommended ali Java development manual and the installation of IDE inspection tools, there are a lot of prohibited or mandatory items, each is a pit, recommend everyone to savor these code details one by one.

= =

One of the most common mistakes Java programmers make is one that leads to very buggy code, which can be found through code static inspection. Such bugs are very difficult to find and very unfortunate. Think about how many times in business code we actually need to judge references to two objects in addition to nulling.

In the Entity database considering the null pointer problem, we often use the wrapper type, external Http request input parameter we will also consider the null pointer problem with the wrapper type, this time when the use of == together is particularly prone to problems, especially need to pay attention to. And equal or unequal processing is often branch logic, the test is easy to cover, when the real problem is a big problem.

The Map and hashCode ()

Also mentioned in the Ali Java development manual, if a custom object can be used as a Map Key, then hashCode() and equals() must be overridden, which is very easy to ignore in business development. I also met this problem, the cause of the error is not I don’t know this, but I don’t know and don’t realize my class will be a framework to do as the Key of the Map (tripartite framework, not what you write) for caching, then because of this problem into multiple instances of classes defined by the framework as an instance appear unexpected bugs.

9. Use of middleware

When using middleware, it is best to stress test the middleware or storage for the usage scenario and study the configuration parameters to get a good understanding of the rationale, otherwise it is easy to fall into the hole of not using the configuration in accordance with best practices. It is not a problem if you use a system such as MongoDb, ElasticSearch, or InfluxDb on a large scale and then run into scalability problems.

I have encountered developers who use Redis as a database instead of a key-value cache and use KEYS to search for the KEYS they need for batch operations. This way of using is completely against the best practices of Redis. Frequent use of such operations in a huge Redis cluster may cause Redis to freeze. For the use of Redis, I have also encountered IO performance problems caused by unreasonable RDB configuration and OOM problems caused by excessive memory usage during snapshots.

For example, using the InfluxDb, its Tag is a good feature. We can flexibly set up various indicators for grouping various tags, but Tag cannot be used to save data with too much combination range. Urls, ids, etc. might otherwise slow down the performance of the InfluxDb or even OOM due to large indexes (high Cardinality issues).

Another example is that there is a business with Mongodb due to pressure, and finally Mongodb is not configured to enable write-ahead log and replication. After a power failure, the database cannot be started because of the damage of storage files. It took several days to study recovery tools and data storage structure to repair data files. All historical data is inaccessible for the entire period.

For projects that pursue stability at the limit, it is recommended to simply make an appointment, even if it is to rely on MySQL without introducing anything else, and consider other middleware when there are performance problems, which is the least likely to cause problems.

10. Environment and configuration

There are too many pits caused by environmental problems, and sometimes people are not aware of environmental differences. Just to name a few, I believe that the combination of development and operation of some environmental configuration problems caused by pit or line accidents and problems too many. However, local application deployment to production may have all kinds of strange problems due to the lack of container environment, K8S environment and complex network environment.

The network environment

Met overpressure pressure very well, but to the problem of online or collapse, due to pressure measurement are all trying to deploy a service, produce a lot of service of the network (or lines) link, the environment is not the same, the consumption of the network inevitably bring request delay, bring the thread blocking, bring more resource consumption. Domain names are incorrectly configured (or incorrectly resolved), so that requests that should go through the Intranet go through the public network. In the test environment or the local environment, IP addresses are often configured to avoid this problem.

In fact, some local requests go to the public network connected to some services on the server. It is not completely local pressure. If you do not realize this problem, this time for performance optimization is often at a loss. Therefore, it is best to use a tool like IFtop to observe whether the network traffic usage of our process (and the address of the remote service connected to it) is as expected.

Container environment

Now everyone uses K8S and Docker. In this environment, our business projects not only go through multiple layers from outside to inside on the network, but also have multiple layers (Pod layer, Docker layer and OS layer) for CPU, memory and file handle configuration. In this case, resources are limited due to a configuration mismatch.

After all, the NETWORK of K8S is quite complex. Different CNI schemes may have different problems, such as slow access in Docker, slow access through Service, and slow access through Ingress to locate problems.

ParallelGCThreads is an example of a number of threads that can be configured too large for many frameworks (e.g., for some hosts with 48 cores and 96 threads for 2 cpus). Improper configuration can cause performance problems.

It is known that the Java process is designed to be launched when Docker is used.it is known that the Java process is designed to be usedwhen Docker is used. The Supervisor itself has limitations (MINFDS and Minprocs).

Environmental isolation

Internet companies usually have a grayscale environment or Staging environment to do the final test before launching. However, problems often occur because this environment shares some resources with the production environment.

One problem I met before was that Quniu was used as CDN. The same CDN was used in both grayscale environment and production environment, which resulted in the cache of new static resource files on THE CDN node during grayscale test, resulting in the access error of external users (accessing new static resources). It is more troublesome to roll back the solution immediately after this problem, because the CDN has been polluted. The long term solution is as simple as doing isolation or different static resource file names each time you publish.

conclusion

To sum up, threads, thread synchronization, pooling, network connections, network links, object instantiation, memory and other aspects of the foundation are the most likely to make mistakes, understand the internal framework for the use of these basic resources, according to the best practices of rational allocation, this is the business development needs to pay special attention to the point. Sometimes when using tripartite frameworks and middleware, some code is not configured according to the best practice but is configured as the worst practice because it does not understand the details, which causes great problems and is a pity.

Since there are many kinds of pits, this article is just a piece of paper, I hope readers can add their own god pits, I hope you can leave a message in the comments section.

Denver annual essay | 2018 technical way with me The campaign is under way…