You can’t live in a river without a knife.
I got a nasty bug. Strictly speaking, this is probably the first bad experience of my career, because with my knowledge and experience, basically any reproducible bug is solvable. However, this bug, which has bothered me for three months, has the following physical characteristics:
- Background log statistics to anomalies, incidences, low frequency
- Abnormal user equipment is not regular, what kind of mobile phone
- We can’t reproduce ourselves, no equipment, no environment
- Call back to report abnormal users, there is indeed a problem
- Customer service has not received active feedback about this anomaly
This bug is not a JS error, but a business logic error. The expression is that the data submitted by the user is inexplicably missing. The scene is the following interface
When the user has filled all the blanks, the submit button becomes available and the data is submitted into an array. The error log shows that some of the arrays submitted by the user are empty, and some of the arrays are missing items.
The problem is that there is validation before submission, and it is impossible for the user to submit such unvalidated data. And it happens occasionally, if the logic is wrong, it should all report errors, we will certainly find it in the test.
The tricky part is that we haven’t recreated the situation at all, with all the colleagues, all the phones, all the fiddling, not once. This makes debugging a lot of trouble, just guessing what might go wrong and then verifying it. But there’s no way to test it… If it can’t be reproduced, how can we judge it as a successful fix?
There seems to be only one way to verify this: online journals. Guess the question, go online, read the log.
It’s a painful process. The interface was simple, but it was a massive project. Because of the large number of questions, a lot of components are removed, for the common and flexible expansion, component nesting depth of as many as five layers. Its architectural complexity is also among the TOP3 in my career.
For example, there are many processes for rendering: formula image transfer to LaTeX, Mathjax rendering formula, space on rendering formula, space numbering, cursor simulation, automatic focus space, dynamic font size calculation and so on. And the keyboard below is our H5 simulation, not the system keyboard. Not to mention the logic of check and score.
The first n attempts
See what changes have been made since the last version, erase any suspicious changes, and see if the log is normal. Awkwardly, this was a major refactoring, and there were a lot of changes. Then a remote debug operation of blind men and elephants began.
Time and time again online, log observation, offline. The problem was never diagnosed by repeatedly ruling out relevant functions. I’ve even tried places I’m sure of, and I can’t find a problem. I’ve tried 20 times, and now I doubt life. The leadership saw these line record all angry, say you this up and down of make chicken hair. I’m devastated, too.
It seems that with this blind man touch elephant means is to do not know, I realized the seriousness of the situation, secretly feel that this may not be easy to solve, Lv mou must use lifelong learning, for the people.
N plus 1 try
If there are so many user logs, why can’t we reproduce them ourselves? That’s what I’ve been struggling with. So they went crazy again.
God pays off, I actually re-emerged! Here’s how it works: Fill in the blanks and press the submit and delete buttons with two fingers at the same time. In this way, the validation is passed and the data is deleted before submission.
I was very excited when I discovered this operation, but will there be so many users? Obviously not. At that time, I thought that the submit button and the delete button were next to each other. Maybe the user accidentally touched the delete button when pressing the submit button. This is reasonable, after all, the user is a pupil, the operation is not necessarily so accurate.
I was excited to verify it. A “Empty” button was added (via configuration) between the delete button and the submit button, so that users are guaranteed not to touch it by mistake.
Online, the log is still. I fell. I guess it wasn’t a mistake.
N plus 2 attempts
As the bug dragged on longer and longer, MY mind grew restless. But the focus is on the delete button, which is hard to find and reproduce.
How do I click Commit and delete? That’s when I thought of clickthrough (the keyboard uses the TouchStart event to respond quickly). Because when you click submit, the analog keyboard will close up, and the delete button will pass over the place of the submit button. According to the principle of clickthrough, if the click event is applied to the delete button, then it is clicked.
I’m starting to admire my imagination. I’m at my wit’s end. Let’s try. There are two ways to avoid click penetration: blocking the default action of the click event, or a delay in letting elements collapse. I chose the latter.
Online, the log is still. I am vomiting blood. The delete button doesn’t even listen for the Click event. This is a desperate situation.
N plus three attempts
Scanning the code, I found a serious suspect. The answer is an array, a reference type. Due to complex component relationships, data of this reference type can be accessed by multiple components.
The catch with mutable data is that it can be modified in places you don’t know about. The code is written by VUE, some components contain watch, maybe the watch is accidentally inserted somewhere, and the data will be changed when the submission is finished.
This seems a reasonable guess to me, and I was vaguely worried about not using immutable data during development. All right, quick verification. After clicking the submit button, I cloned a copy of the answer data, and then graded and submitted the operation. We don’t have to worry about tampering with the data we already have.
Online, the log is still. Continue to vomit blood.
But it also Narrows the suspicion that the data was not tampered with at the time of submission, but at the time of submission. How do users bypass validation to submit data? Is there something wrong with my checksum function, that this place has changed the data? I went through the code. Nothing.
N plus 4 attempts
This focuses on what happens when the user fills in the answer. Like a detective, I went through the code again and again with a magnifying glass, but days of tracking turned up nothing useful.
Until one day, when the sun was shining and there were clouds in the sky, it felt like something good was about to happen. QA posted a screenshot on the feedback group saying that the dot was still being parsed and could not be typed in the air. The diagram below:
My sensitive nerves immediately smell a clue. It uses MathJax to parse the formula, which loads font files on demand, scans page nodes, and generates a large number of DOM nodes. That’s a lot of pressure for browsers, let alone mobile.
I will scan the code of formula processing immediately, because some empty will appear on the formula, so the code is to wait for the formula rendering after the unified number of empty, and then automatic focus, and the automatic focus will be the first to assign a value to the answer. Oh, my God, that’s not where the problem is! There may be a delay in the rendering process of the formula, and the user may do something during this time!
First of all, this is consistent with the fact that it is accidental, because it is also accidental that jitter network delays and so on occur in formula parsing. And while the company’s network is fast, the user’s network may be slow, which is consistent with the fact that we have never repeated. It feels like a good one this time! A lot of detective TV is played so ah, the protagonist through someone else’s unintentional word association clue, and then the case solved, the truth! Yeah, yeah, yeah, that’s what it feels like!
Optimize at the code level as soon as possible to deal with the empty space without a formula, and make sure that the user can input the space after the formula is executed.
Optimization completed, regression testing, everything is ready, just waiting for online verification, final word!
The results… And the log! Poof! TV is full of lies ah!
Wait a minute! Although there are logs, but seems to be missing! Is this optimization working? While it could theoretically explain some of the effects, what about the existence of logs? Is there more than one reason for missing answers?
N plus 5 attempts
As the days passed, I still had no solid leads. There are some guesses in succession, hit a few log points or fruitless. I became more and more anxious when I saw the frowns of QA colleagues and the concerned inquiries of leaders. Since this is a public component library, other projects are waiting to use it. If my bug is not solved, it will affect the progress of other projects.
It was another sunny day with white clouds in the sky. I accidentally brought up the subject with another backend colleague, who casually said that it was supposed to be auto-committed over time.
What? What! Automatic submission? ! I was suddenly struck by lightning. Because I wrote it as a public component, I also left out some APIS, such as submitting answers. What I provide is the component of the answer interface, but there is a countdown scene in others’ projects. After timeout, my submission API will be called and the user’s answer will be submitted.
If the user does not fill in anything during the timeout, it will not be the empty answer submitted!! There is no verification function at all, it is submitted by someone else’s call API.
I cried. No wonder there is no feedback from users, they have no reason to give feedback when the time is up. No wonder we haven’t recreated it ourselves, obsessed with how to get it out. Even if QA students see a timeout submission, they don’t realize it’s empty.
Yes, the truth is it, modify the relevant logic after the online, sure enough, error log disappeared. The bug that had been bothering me for three months was finally solved! I closed my eyes and silently set off a firecracker.
conclusion
After three months, I finally found the problem. In fact, the NTH + 4th time I solved some problems, and the last time I solved them completely, I proved that there is more than one truth. And this buggy event, also gave me a lot of inspiration.
-
Be careful of logic changes when refactoring, and make sure to run through all cases after refactoring.
-
Troubleshooting methods, during which I used a variety of controlled trials, a variety of source level troubleshooting
-
When using VUE to do complex projects, special attention should be paid to the number of nested layers of components, and less watch should be written to avoid the chaotic sequence of program execution
-
When designing external apis, consider robustness, not only for the instability of incoming parameters but also for the instability of the current context.