How would you feel if I told you, when I came here for an interview, that’s the question? I don’t think most people go into the pit. However, I came in. Because I find this kind of technical problem interesting. That’s it. Otherwise the job will be boring.
preface
- You don’t really need to fix this bug, because many domestic companies have a weekly release, so you won’t even know it’s there.
- In fact, you don’t have to fix this bug, because you write a scheduled auto-restart script and do it in the dead of night.
- In fact, you don’t have to fix this bug, because Baidu is also starting to support SPA system SEO, so why are you still working on crappy SSR there?
If you’re bored like I am, read on. I don’t know. In my local test anyway, most of the problems are KO.
The reason is to change jobs at once
During the interview, the interviewer asked a question that their system had a strange phenomenon. The SSR system based on Vue had a slow CPU increase and had to be restarted every once in a while. Ask me what is the solution?
Huh? The CPU to rise? Is there a memory leak? Did every request return? Are there blocking IO operations? If it is Express does it all return? It goes up slowly. What’s the scale? What is QPS? Is the server load reasonable?
Then I successfully got the Offer, and the second task I was given was to solve this technical problem. Do you think I’m being tricked here? Hahaha, but I just like the challenge. It’s fun, otherwise it’s boring. However, it is difficult to produce results in a short period of time for this kind of technical investigation, which is also very dangerous for me. And, um… There is also the possibility of undoing. Who knows. It’s a fun thing to do anyway. Tube he?
Is the problem really what it says?
When solving a technical problem, we often get the performance described by the person who encountered the problem, but the actual performance of the problem does not always correspond to the description.
When we encounter a performance problem, we need to fully understand what the nature of the problem is. Just a slow increase in CPU? Modern SPA frameworks have serious CPU consumption issues, is the server cluster capacity insufficient? Is there an accompanying memory leak? Are there pending requests that are not returned? These questions went over and over in my mind.
Until I saw the system, I saw the source code, I got on the server, I saw all the server monitoring data, good boy. It’s funny. It makes me more hyper.
Question:
- CPU increases periodically, occasionally decreases, but the overall trend is upward. The cycle reaches more than 80% occupancy rate in about 2 weeks.
- Memory leaks a small amount every day, very little. There will also be releases. The general trend is around 500M per day.
- Daily visits fluctuate in a wide range when there is activity, but the overall level is relatively stable. However, the log system only keeps the logs of the last 7 days, which makes it difficult to analyze the cause from the logs. The data is gone from the days in question.
- Back-end systems, at the code level, have limited performance gains from code optimization without major code logic problems.
First curve
Based on the access logs and the descriptors describing the problem, the peak usage occurs on days when the CPU is high. And there is a significant reduction in CPU usage from the decrease in traffic. Therefore, according to the user traffic during the peak CPU hours of those days, the server should be underloaded and did not withstand the peak traffic. So I took the survey result to the Leader. The Leader accepts it. After all, it makes sense on a data level. And in this consulting operation and maintenance colleagues, they also think so. And there was a really big spike in traffic that lasted for a couple of hours.
However, in the following days, it was observed that when the traffic was not so huge, there would still be a slow upward trend, just a little slower than the peak traffic period. So the first survey was declared wrong.
Second bend
According to the empirical analysis, the phenomenon of CPU slowly rising without significant decrease is mostly caused by the code snippet is suspended and cannot be released. For Nodejs, there are several things: 1 setTimeout, 2 blocking IO, 3 Express not calling res.end() to end the request.
I started to do code review and found that the whole project was built on official Vue-HackerNews 2.0. It doesn’t matter from the code. Student: So maybe blocking IO?
So find the operation and maintenance students to get how to view the activity network link, the local environment for pressure testing. Then stop for half an hour and check the link status (because the operating system does not release the link immediately after you finish the operation to optimize IO usage, so wait a while).
The result of the stress test was quite shocking. Due to the poor performance of the back-end interface in the test environment, a large number of requests were suspended. At this time, the number of blocked socket links is also very large, and the memory is soaring, but the CPU has not decreased significantly. Hahaha problem found (rejoice too soon).
Find out if o&M has a way to set up a disconnect on the server for long unresponsive links. Operation and maintenance is very helpless…
All right. You have to do it yourself. Why are so many links hanging up? If the server is overloaded, the client socket will hang and stay in the Connection state because it is unable to respond.
I went to the project manager and told him that they had timed out and it didn’t happen…
But I see a lot of requests in the log that are returned after 200s… Indicates that the timeout set by our code is not working. So I needed to find enough evidence to convince him.
Sometimes when we communicate, the other party does not trust your point of view, in fact, because your evidence is not sufficient, so at this time, you need to find enough convincing evidence to prove your point of view.
Digging deep into the Nodejs document and project code, I found that the axios implementation had a problem:
if (config.timeout) {
timer = setTimeout(function handleRequestTimeout() {
req.abort();
reject(createError('timeout of ' + config.timeout + 'ms exceeded', config, 'ECONNABORTED', req)); }}Copy the code
There seems to be nothing wrong with the code here, which is a typical solution for handling timeouts in front end processing.
In Nodejs, the link to the IO blocks the timer processing, so the setTimeout does not fire on time, and will not return for more than 10 seconds.
It looks like the problem has been resolved. The high volume of traffic and blocked connections caused the requests to pile up and the server couldn’t handle them and the CPU couldn’t get down.
In the official Nodejs documentation:
If req.abort() is called before the connection succeeds, the following events will be emitted in the following order:
- socket
- (req.abort() called here)
- abort
- close
- error with an error with message Error: socket hang up and code ECONNRESET
Copy the code
So I proposed PR to Axios. The solution is to use socket timeout to handle connect instead of setTimeout, which is blocked in Nodejs. This problem also exists in Node-request. After a large number of local tests, it was found that the CPU and memory were within the normal range under high load. I thought everything was okay.
if (config.timeout) {
// Sometime, the response will be very slow, and does not respond, the connect event will be block by event loop system.
// And timer callback will be fired, and abort() will be invoked before connection, then get "socket hang up" and code ECONNRESET.
// At this time, if we have a large number of request, nodejs will hang up some socket on background. and the number will up and up.
// And then these socket which be hang up will devoring CPU little by little.
// ClientRequest.setTimeout will be fired on the specify milliseconds, and can make sure that abort() will be fired after connect.
req.setTimeout(config.timeout, function handleRequestTimeout() {
req.abort();
reject(createError('timeout of ' + config.timeout + 'ms exceeded', config, 'ECONNABORTED', req)); }}Copy the code
However… I was wrong again.
One day I forgot to turn off the computer and the local pressure test environment was still running. The next day, to my surprise, all the suspended socket resources were released. But the memory, the CPU still hasn’t been reclaimed. I checked with my operations colleagues on this, and it’s true that the operating system will automatically take care of links that have been inactive for a long time. Although I fixed the problem by modifying the Axios source code, it seems that the root cause of the problem was not found.
Accidentally discovered the “SAO” processing in vue-Router
I really had no clue. I didn’t seem to catch the root cause of the problem after several days of effort, although I solved the problem by accident. But there is some uncertainty about whether this approach will eradicate the problem.
Inspect is used to repeatedly analyze the system’s memory. Because the online traffic is very large, but the leakage of memory and CPU is very small, and it is difficult to reproduce such a large amount of traffic locally, so it is very difficult to reproduce locally. Coupled with THE GC method of JS, it is very difficult to investigate. Can only request a request after repeated memory mirror to find even a trace of clues.
For CPUS, it’s even harder to track, with online daily CPU growth around 0.02 per hour. This means that an average request has little impact on CPU leaks, and once a large request is tested, memory tracking is not accurate.
Maybe this is the advantage of the older programmer, can calm down, patience to find the problem. Sometimes it doesn’t take a lot of skill to solve a technical problem. The way you solve the problem, and the patience, is the main thing.
It happened that a timer always appeared in the memory image after a request was made. The next time the memory image is captured, another timer is released. What the FXXK? What the hell.
This timer has no obvious information to tell me where it was created. Another collapse.
Could this be the root cause of the memory leak? Timers are very small, asynchronous, and do not block the system, so they do not cause the CPU to run high for long periods of time like an infinite loop. It looks like this timer is the root of the problem.
The good news is that all Nodejs apis are js implementations, so interrupt point tracking code is directly in setTimeout… It was a miracle. The SAO operation in vue-Router was discovered. Procedure
function poll (
cb, // somehow flow cannot infer this is a function
instances,
key,
isValid
) {
if (instances[key]) {
cb(instances[key]);
} else if (isValid()) {
setTimeout(function () {
console.log('vue-router poll');
poll(cb, instances, key, isValid);
}, 16); }}Copy the code
Yeah, that’s right, it’s an infinite loop timer. What is instances? The passing code should be the corresponding asynchronous component instance, and the key is the key value of the corresponding component in the instance array. There are only two exit conditions: 1. The asynchronous component is loaded. 2.
However, in the SSR scenario, routing changes do not occur on a per-request basis. So the exit condition is that the asynchronous component is loaded. But for some reason, it didn’t load. So this timer is stuck in an infinite loop. This is only if the component implements the beforeRouteEnter guard function.
Because the vue-Router code implementation is too SAO. Turn to github for help. The issue was discovered
It’s a perfect match for me. However, THERE is some chill about member’s reply. The problem can be perfectly reproduced through the simple setting of the subject. Instead, the team just closed it with “A boiled down repro instead of A whole app would help to identify the problem, thanks”…
And even more exasperating:
> A boiled down repro instead of a whole app would help to identify the problem, thanks if you have an infinite loop, It's probably next not being called without arguments "= Think we are all fools?Copy the code
All right. Looks like you’re on your own when you’re caught. After communicating with the subject, I began to try to solve the problem. But after a few days of hard work, he gave up. And I… Also chose to give up (don’t look at me so lofty, to tell the truth, a few days to see vue-router source code. Really didn’t find a good solution, mainly will change a lot of things. .
The solution
There are two causes of memory and CPU leaks in VUe-SSR that I have investigated so far:
- The suspended socket causes temporary congestion
- In some cases, the timer in vue-Router may fall into an infinite loop
- A lot of template compilation leaves a lot of memory for strings
So what’s the solution?
- Removes processing of beforeRouteEnter in Component. By moving the processing elsewhere and analyzing it at the vue-Router code level, you can avoid getting stuck in the timer’s infinite loop.
- Instead of using setTimeout in nodeJS to handle server side request timeout, use http.request timeout event handle to handle the timeout. Prevents I/O blocking timer processing.
- If the SEO requirements are not too high, the skeleton page rendering method is adopted to render the skeleton page to the client, and then the front-end directly initiates ajax requests to pull the server data. Avoid server request execution on nodeJS side because the server background can not respond to the jam, causing part of the link to hang. Nodejs’s event loop is different from the browser’s, although both are V8 based. This is also the common application of vue-SSR in most domestic Internet companies.)
There may be
I have only studied Vue-SSR for 2 weeks. If there is any question above, please remind me to correct in time.