Click “Ingenuity Zero” above, select “Top public account”
Technical articles first time delivery!
For every programmer, failure is hanging on the head of the sword of Damocles, are afraid to avoid, how to avoid failure is every programmer are struggling to find hope to solve the problem. For this question, everyone can give their own answers from various perspectives such as requirement analysis, architecture design, code writing, testing, code review, online launch, online service operation and maintenance.
Most of our services are structured as follows: they are used by users, but also rely on third-party services provided by others. Various business, algorithms, data and other logics are interspersed between them, and each of them may be the source of failure. How to avoid failures? I can sum it up in one sentence: be suspicious of third parties, guard against users, and be yourself.
1. Suspect third parties
Stick to the belief that “all third party services are unreliable,” no matter what the third party promises. Based on this belief, we need to take the following actions.
1.1 If there is a cushion, make a plan for business degradation
What if the third party service dies? And our business goes down with it? Obviously this is not what we want to see, and if we can work out a downgrading plan, it will greatly improve the reliability of the service. Let me give you a couple of examples just to make sense of it. For example, when we make personalized recommendation service, we need to obtain personalized data of users from the user center, so that we can put it into the model for scoring and sorting. However, if the user center service fails and we cannot obtain the data, then we will not recommend it. Obviously not, we can cache a hot item to keep it safe. Another example is to make a data synchronization service, which needs to obtain the latest data from a third party and update it to mysql. The third party provides two ways exactly: 1) Message notification service, which only sends the changed data; 2) One is HTTP service, which requires us to actively call to obtain data. At first, we chose message synchronization because it was more real-time, but then we encountered the problem of delayed message delivery, and there was nothing abnormal. When we found that one day had passed, the problem had been upgraded to failure. A reasonable approach would be to use both synchronization schemes, with the message mode being used for real-time updates and the HTTP active synchronization mode being triggered periodically (say 1 hour) for the bottom line. Even if the message fails, the active synchronization can guarantee hourly updates.
In some cases, third-party services appear to be normal, but the data returned is contaminated. Is there any way to prevent this? Some people say that at this time in addition to notify the third party to quickly restore data, the basic can only wait. For example, we do the retrieval service on mobile terminals, which needs to call the third-party interface to obtain data to build the inverted index. If the third-party data is wrong, our index will also be wrong, which leads to the screening of the wrong content by our retrieval service. It will take half an hour for the third-party service to recover the data at the soonest, and it will also take half an hour for us to build the index. That is to say, the retrieval service may be unavailable for more than one hour, which is unacceptable. How to cover the bottom? Our approach is to every once in a while to save the whole quantity index file snapshot, once the third-party data source data pollution problems, we first press the stop switch, index building and rapid roll back to the early snapshot normal index file, so that while the data is not very new (before May 1 hour), but at least can guarantee retrieval results, It’s not going to have a big impact on trading.
1.2 Follow the fast Failure principle and ensure that the timeout period is set
The normal response time of a third-party interface invoked by a certain service is 50 ms. One day, the third-party interface has problems, and the response time of about 15% of the requests exceeds 2s. Soon after, the load of the service soared to more than 10, and the response time was very slow, that is, the third-party service dragged our service down. Why are they dragged down? No timeout set! We use a synchronous invocation, using a thread pool with a maximum number of threads set to 50, and if all threads are busy, the extra requests are placed in the queue. If third-party interface response time is around 50 ms, then thread can quickly finish their work, and then deal with the next request, but unfortunately if there is a certain percentage of the third-party interface response time for 2 s, so this last 50 threads will be hold, the queue will accumulate a large number of requests, As a result, the overall service capacity is greatly reduced.
The correct way is to negotiate with the third party to determine a shorter timeout time, such as 200 ms, so that even if their service problems will not have a big impact on our service.
1.3 Properly protect the third party and carefully select the retry mechanism
You need to carefully consider using retry mechanisms in conjunction with your own business and exceptions. For example, some students call a third party service and report an exception, so it is not right to directly retry, no matter what is wrong. For example, some business returned exceptions indicate that the business logic is wrong, so you retry the result is an exception; For example, some exceptions are interface processing timeout exceptions. In this case, you need to judge based on services. In some cases, retry will cause more pressure to the rear service, and the effect will be worse.
2. Defensive user
Here’s another thing to stick to: “All users are unreliable,” no matter what the user promises. Based on this belief, we need to take the following actions.
2.1 Design a good API (RPC and Restful) to avoid misuse
We’ve seen a lot of failures in the last two years, both directly and indirectly due to poor interfaces. If your interface is misused by a lot of people, you should rethink your interface design. Interface design looks simple, but it is a deep knowledge. Check out Joshua Bloch’s talk “How to Design a Good API & Why It Matters” and “Java API Design Checklist.”
Here’s my experience:
-
Follow the principle of least exposed interfaces. We provide as many interfaces as users can use, because the more interfaces we provide, the easier it is to use them incorrectly. In addition, the more interfaces are exposed, the higher the maintenance cost will be.
-
Don’t let the user do something the interface can do. If a user needs to call our interface more than once to perform a complete operation, the interface design may be flawed. For example, if the interface to getData only provides the getData(int id) interface, then if the user wants to get 20 data ata time, it needs to loop through and call our interface for 20 times, which not only causes poor performance of the user, but also increases the pressure of our service. Providing the getDataList(List idList) interface is obviously necessary at this point.
-
Avoid long-running interfaces. Again, take the method of getting data as an example: getDataList(List idList). If a user sends 1W ids at a time, it is estimated that our service will not produce results in a few seconds, and often the result is timeout, how to call the result is always timeout exception, then how to do? Limit the length, for example, to 100, that is, a maximum of 100 ids can be transmitted each time to avoid long execution. If the length of the id list transmitted by the user exceeds 100, an exception is reported. When this restriction is added, it must be made clear to the user that the method has this limitation. In case of misuse before a user an order to buy more than 100 items, the order service needs to call product center interface to get all the goods under the order information, but how to call failure, and abnormal didn’t hit any valuable information, screen for a long time later learned is commodity center interface to do the length limit. How can you add a limit and not let users misuse it? There are two ways of thinking: 1) The interface performs split call operation for the user. For example, the user passes 1W IDS, and the interface is divided into 100 ID lists (each 100 in length), and then the call is repeated. In this way, the internal mechanism is shielded from the user and transparent to the user. 2) let the users themselves do break up, write their own cycle show calls, this method we do need to let the user know the limit, the specific methods are: 1) change the method name, such as getDataListWithLimitLength idList (List); 2) Add notes; 3) If the length is more than 100, explicitly throw an exception and tell them so.
-
Parameter easy-to-use principle. Avoid parameter length is too long, generally more than 3 will be difficult to use, then someone said that I have so many parameters, then how to do? Write a parameter class! Also, avoid consecutive arguments of the same type, which could easily be misused. If you can use other types such as ints, try not to use strings. This is another way to avoid misuse.
-
The exception. The interface should be a realistic reflection of the problems in execution, let alone clever code to do something special. It’s not uncommon to see a try catch in some of your interface code, no matter what exception is thrown internally, return an empty set. This makes the user very helpless, many times do not know whether it is their own parameter transmission problem, or internal problems of the service side, and once unknown, it may be misused.
public List<Integer> test() { try { ... } catch (Exception e) { return Collections.emptyList(); }}
Copy the code
2.2 Traffic control: Allocate traffic by service to avoid abuse
I believe many students who do high concurrency service have encountered A similar event: One day, Mr. A suddenly found that his interface request volume suddenly increased 10 times, and soon the interface was almost unavailable, causing A chain reaction that led to the whole system crash. Why would it increase by 10 times? Is it because the interface is attacked by outsiders? In my experience, it is more likely for insiders to commit crimes. Before, I also saw some students call online services with MapReduce job, killing the service in minutes. How to deal with this situation? Life has given us the answer: old power switches, for example, are fitted with fuses that blow out when someone uses a device with very high power to protect everything from being burned out by a strong current. Similarly, you need to install a fuse on the interface to prevent system breakdown caused by excessive pressure from unexpected requests. When the traffic is too heavy, you can reject or divert traffic. For the traffic limiting algorithm, see Interface Traffic Limiting Practice.
2.3 Be yourself
It is a very big topic to be a good self, which can be mainly introduced from the stages of requirement analysis, architecture design, code writing, testing, code review, online launch, online service operation and maintenance. This time, I will simply share some principles of experience in architecture design and code writing.
2.3.1 Principle of single responsibility
Students who have worked for more than two years should take a good look at design patterns. I think the specific design patterns are not important, but the principles behind them. The single responsibility principle, for example, is very useful in guiding our requirements analysis, architectural design, coding, and so on. In the demand analysis stage, the principle of single responsibility can define the boundary of our service. If the boundary of service is not clearly defined, all kinds of reasonable and unreasonable demands will be met, and finally the sad result of service being unmaintainable, unextensible and continuous failure will occur. For architecture, a single responsibility is also important. For example, when read and write modules are placed together, the read service jitter is very severe. If read and write modules are separated, the stability of the read service is greatly improved (read and write separation). For example, if a service contains the order, search, and recommendation interfaces at the same time, if the recommendation problem may affect the order function, then the different interfaces can be split into independent services and deployed independently, so that a problem does not affect other services (resource isolation); Another example is that our picture service uses an independent domain name and is placed on the CDN, which is independent from other services (static separation). From a coding perspective, a class does one thing, and if your class does more than one thing, consider splitting it up. This has the advantage of being very clear and easy to change later, with little impact on the rest of the code. If you look at methods in a class at a finer level, a method does only one thing, that is, only one function. If you do two things, separate them, because modifying one function might affect the other.
2.3.2 Controlling resource Usage
The code mind must strain to realize that the machine we’re working on has limited resources. What are the machine resources? If the protection and control of CPU, memory, network, and disk are not done well, once a resource is fully loaded, it is easy to lead to online problems.
2.3.2.1 How can I limit CPU Resources
-
Computational algorithm optimization. If the service needs a lot of calculation, such as recommendation sorting service, you must optimize your calculation algorithm. For example, the author once optimized the heavily used algorithm of geospatial distance calculation and achieved good results. Please refer to the article “Geospatial Distance Calculation Optimization”.
-
The lock. For many services, where there are not as many computationally expensive algorithms but CPU usage is high, you need to look at the use of locks. My advice is to try not to explicitly use locks if you don’t have to.
-
Habit problem. For example, when writing a loop, always check to see if it can exit correctly, sometimes accidentally, under certain conditions, it will become an infinite loop, a famous case is “multi-threaded HashMap infinite loop problem”. For example, if there are more than one String added, should stringBuffer.append be used?
-
Use thread pools whenever possible. The thread pool is used to limit the number of threads to avoid the overhead of thread context switching caused by too many threads.
-
JVM parameter tuning. JVM parameters can also affect CPU usage, as shown in Jitter Solution when Publishing or Restarting Online Services.
2.3.2.2 How can I Limit memory Resources
-
JVM parameter Settings. By setting the JVM parameters to limit the memory usage, the JVM parameter tuning is more based on experience, there is a good article written by a friend can refer to “Linux and JVM memory relationship analysis”.
-
Initialize the Java collection class size. When using Java collection classes, initialize the size as much as possible, which is important in memory-consuming services such as long-connection services.
-
Use memory pools/object pools
-
Be sure to set the maximum queue length when using thread pools. We’ve seen a lot of failures before where there’s no limit on the maximum queue length and memory runs out.
-
Avoid using local caches if the data is large. If the data volume is large, you can consider placing it in a distributed cache such as Redis, Tair, etc. Otherwise the GC may freeze its own service.
-
Compress the cached data. Do before recommend related services, such as need to save the user preference data, if directly save may have 12 g, then USES the essay this compression algorithm directly compressed into 6 g, but at this time must consider good compression decompression algorithm CPU utilization, efficiency and balance of compression ratio, some compression rate is very high but the poor performance of the algorithm, Nor is it suitable for online real-time calls. Sometimes probuf is used directly to serialize and save, which also saves memory space.
-
Understand the details of third-party software implementation, accurate tuning. When using third-party software, only know how to save memory, clear details after I have experience greatly in the practical work, such as found in after reading the lucene source before our original index files can be compressed, and this can’t find in the documentation, refer to the lucene index file size optimization summary.
2.3.2.3 How do I Limit Network Resources
-
Reduce the number of calls. If you are aware of the network overhead involved, you should use batch processing. And as often encountered in the recommended service to multiple places to get data, generally adopts multi-thread parallel to get the data, this time not only consume CPU resources, has cost network resources, often used in a practical way is to put a lot of data offline storage to a piece of online services at this time as long as a request to all data acquisition.
-
Reduce the amount of data transferred. One method is compressed transmission, and the other is on-demand transmission, such as getData(int ID), which is often encountered. If we return all the information of the Data corresponding to the ID, on the one hand, it is not needed, and on the other hand, the amount of Data transmission is too large. GetData (int ID, Listfields); the server returns only the fields required by the user.
2.3.2.4 How can I Limit Disk Resources
Log to control the amount, and regular cleaning. 1) Only key exception logs are printed; 2) Monitor and alarm the log size. I once met a third-party service hang up and then I have been print call the third party service exception log, I was service has downgraded solution, if the third party service will automatically use other services, but suddenly received alarm said I hang up and service on the machine a see just know is not enough to lead to the collapse of the disk; 3) Periodically clear logs, such as crontab, every few days; 4) Print logs to the remote storage system. For some important logs, you can directly print logs to the remote HDFS file system.
2.3.3 Avoid single points
Don’t put all your eggs in one basket! At a large level, the service can be deployed in multiple computer rooms and live in different places. From a design point of view, the service should be able to scale horizontally. For many stateless services, nginx and ZooKeeper can easily achieve horizontal scaling; For some job-type services, how to avoid a single point, after all, can only run on a node, see Quartz application and Clustering principle analysis. How do you avoid a single point for data services? In short, this can be done by sharding, layering, etc., as summarized in the blog post below.
summary
How to avoid failure, my experience is condensed into one sentence: “suspect the third party, guard against the user, be yourself”, you can also think, summarize and share their own experience.
END
If you feel that you have gained something after reading it, please click “like”, “follow” and add the official account “Ingenuity Zero” to check out more wonderful history!!