Everyone makes mistakes, but some mistakes are so ridiculous that I can’t figure out how anyone could make them. More unexpectedly, this kind of thing should happen to us. Of course, such things can only be discovered in hindsight. Next, I’m going to describe a series of recent mistakes that we made with one of our apps. The most interesting thing is that the initial indications reveal a completely different problem from what actually happened.

In the middle of a dreary night

Just after midnight, I was awakened by an alarm from the surveillance system. Adventory, an app that indexes ads in our PPC advertising system, has apparently restarted several times in a row. In the cloud environment, instance restarts are normal and do not trigger an alarm, but the number of instance restarts exceeded the threshold for a short period of time. I opened my laptop and dove into the project log.

It must be the network

I saw that the service timed out several times while connecting to ZooKeeper. We use ZooKeeper (ZK) to coordinate indexing across multiple instances and rely on it for robustness. Obviously, a Zookeeper failure will prevent the index operation from continuing, but it should not cause the entire system to fail. Also, it’s very rare (this is the first time I’ve encountered a ZK die in production), and I think it might not be easy to fix. So I woke up the ZooKeeper to see what was going on.

In the meantime, I checked our configuration and found that the ZooKeeper connection timed out in seconds. Obviously, ZooKeeper is all hung up, and since other services are using it, that means the problem is very serious. I’ve sent word to a couple of other teams, and they obviously don’t know about this.

My colleague in the ZooKeeper team replied to me. In his opinion, the system runs normally. Since other users didn’t seem to be affected, I slowly realized it wasn’t ZooKeeper. There was obviously a network timeout in the log, so I woke up my colleague who was in charge of the network.

The team responsible for the network checked their surveillance and found nothing unusual. Since a single network segment, or even a single node, can be disconnected from the rest of the nodes, they checked the machines on which our system instance was located and found no abnormalities. In the meantime, I tried several other ideas, but none of them worked, and I reached the limits of my intelligence. It was getting late (or early) and, despite my attempts, reboots were becoming less frequent. Since this service is only responsible for refreshing the data and does not affect the availability of the data, we decided to postpone the issue until the morning.

It must be the GC

Sometimes it’s a good idea to put a problem on the back burner and sleep on it until your mind is clear. No one knew what was going on, and the service was very strange. All of a sudden, IT hit me. What are the main sources of Java service weirdness? Recycling, of course.

To prepare for this situation, we have been printing GC logs. I immediately downloaded the GC logs and opened Censum to analyze the logs. Before I looked closely, I noticed a horrifying situation: Full GC happens every 15 minutes, and each GC causes a 20-second service pause. No wonder the connection to ZooKeeper timed out, even though ZooKeeper and the network are working properly.

These pauses also explain why the entire service is always dead, rather than just logging an error after a timeout. Our service runs on Marathon, which periodically checks the health of each instance, and if an endpoint doesn’t respond for a period of time, Marathon restarts that service.

Knowing the cause will solve half the problem, so I am sure it will be solved soon. To explain the reasoning behind this, I need to explain how Adventory works. It’s not a standard microservice like yours.

Adventory is used to index our ads to ElasticSearch (ES). It takes two steps. The first step is to get the data you need. So far, this service receives events sent via Hermes from several other systems. Data is saved to the MongoDB cluster. At a maximum of a few hundred requests per second, each operation is so lightweight that it doesn’t take much to trigger some memory reclamation. The second step is to index the data. This operation is performed periodically (approximately every two minutes) and collects all MongoDB cluster stored data into a stream via RxJava, combines it into a non-normal record, and sends it to ElasticSearch. This part of the operation is similar to an offline batch task rather than a service.

Maintaining the index is not worth it because of the large number of updates to the data that are often required, so the entire index is rebuilt every time a scheduled task is performed. This means that a whole chunk of data passes through the system, causing a lot of memory reclamation. Despite using the streaming approach, we were forced to scale the heap up to 12 GB. Because the heap is so large (and currently fully supported), our GC chose G1.

I’ve worked with services that recycle a lot of objects with a short life cycle. With that experience in mind, I increased both the default values -xx :G1NewSizePercent and -xx :G1MaxNewSizePercent so that the new generation becomes larger and the young GC can process more data without sending it to the old age. Censum also shows many premature ascensions. This is also consistent with full GC occurring some time later. Unfortunately, these Settings don’t work.

Then I thought, maybe producers produce data too quickly for consumers to consume, resulting in records being recycled before they can be processed. I try to reduce the number of threads that produce data, slow down the speed of data generation, and keep the size of the data pool that consumers send to ES the same. This was mostly a backpressure mechanism, but it didn’t work either.

Must be a memory leak

At this point, a colleague, who was level-headed at the time, suggested we do what we did in the first place: check the data in the heap. We prepared an example of a development environment with the same amount of data and roughly the same heap size as the online example. By connecting it to JnisualVM and analyzing a sample of memory, we can see the approximate number and size of objects in the heap. At first glance, we can see that the number of Ad objects in our domain is abnormally high and is growing throughout the indexing process, all the way up to the level of ads we are processing. But… It shouldn’t be. After all, the reason we stream this data through RX is to prevent all the data from being loaded into memory.

As my suspicions grew, I examined this part of the code. They were written two years ago and haven’t been scrutinized since. Sure enough, we actually loaded all the data into memory. It was certainly not intentional. Since RxJava was not fully understood at the time, we wanted the code to run in parallel in a special way. To take some work out of the main RX workflow, we decided to run CompetableFuture with a separate executor. But then we have to wait for all of that CompetableFuture to work… By storing their references and then calling Join (). This results in all future references, and the data they refer to, remaining alive until the index is complete. This prevents the garbage collector from cleaning them up in time.

Is it really that bad?

Of course it was a stupid mistake and we are disgusted that it was discovered so late. I even recall a brief discussion a long time ago about the need for a 12 GB heap for this application. 12 GB heap, that’s a bit big. On the other hand, the code has been running for almost two years without any problems. We could have fixed it relatively easily at the time, but two years ago it would have taken a lot more time and we had a lot more important work to do than saving a few gigabytes of memory.

So while from a purely technical point of view it is a shame that this problem has taken so long to solve, from a strategic point of view it may be more pragmatic to leave the memory waste alone. Another consideration, of course, is the impact of the problem if it happens. We had almost no impact on our users, but it could have been worse. Software engineering is all about balancing the pros and cons and prioritizing different tasks.

Still won’t do

With more experience with RX, we can easily solve the problem of ComplerableFurue. Rewrite the code to use RX only; During the rewrite, upgrade to RX2; True streaming processes data rather than collecting it in memory. After these changes are reviewed by code, they are deployed to the development environment for testing. To our surprise, the memory required for the application did not decrease at all. Memory sampling shows that there are fewer AD objects in memory than before. Also, the number of objects now does not grow all the time and sometimes decreases, so they are not all collected in memory. Same old problem, it seems that the data is still not really being streamed.

So what’s going on now?

The relevant key word has just been mentioned: back pressure. When data is streamed, it is normal for producers and consumers to have different speeds. If the producer is faster than the consumer and can’t slow it down, it keeps producing more and more data, and the consumer can’t process it at the same rate. The phenomenon is the growing cache of unprocessed data, and that’s what’s really happening in our application. Back pressure is a mechanism that allows a slower consumer to tell a faster producer to slow down.

Our indexing system doesn’t have the concept of back pressure, which was fine before, since we saved the entire index in memory anyway. Once we solved the previous problems and started to really stream the data, the lack of back pressure became apparent.

This is a pattern I’ve seen many times with performance problems: solving one problem brings up another that you haven’t even heard of because other problems hide it. If your house floods a lot, you won’t notice it’s a fire hazard.

Fix the problem caused by the fix

In RxJava 2, the original Observable class is split into an Observable that does not support back pressure and a Flowable that supports back pressure. Fortunately, there are easy ways to transform an Observable that doesn’t support back pressure into a Flowable that supports back pressure right out of the box. This includes creating Flowable from non-responsive resources such as Iterable. These flowables can be combined to produce flowables that also support back pressure, so that with a quick solution to one point, the entire system has back pressure support.

With this change, we reduced the heap from 12 GB to 3 GB, while maintaining the same speed as before. We still had pauses for full GC every few hours for up to two seconds, but that was much better than the 20 second pauses (and system crashes) we had seen before.

Optimize GC again

But the story does not end there. Examining the GC logs, we noticed a large number of premature promotions, up to 70%. Although the performance was acceptable, we tried to fix the problem, hoping maybe to fix the full GC problem as well.

If an object has a short lifespan but is still promoted to the old age, we call this premature tenuring (or premature upgrade). Objects in the old generation are usually large and use different GC algorithms than the new generation, and these prematurely promoted objects take up space in the old generation, so they can affect GC performance. Therefore, we try to avoid premature promotion.

Our application produces a large number of short-lived objects during indexing, so some premature enhancements are normal, but should not be so severe. When an application produces a large number of short-lived objects, the first thing that comes to mind is simply to increase the space for the new generation. By default, the G1’s GC automatically adjusts the new generation’s space, allowing the new generation to use between 5% and 60% of the heap memory. I noticed that the ratio of the new generation to the old generation in the running app was always changing by a wide margin, but I went ahead and changed two parameters: -xx :G1NewSizePercent=40 and -xx :G1MaxNewSizePercent=90 to see what happened. This didn’t work and even made things worse by triggering the Full GC as soon as the application started. I tried other scales as well, but it was best to just increase G1MaxNewSizePercent without changing the minimum. And that worked, roughly as well as the default, and it didn’t get any better.

After trying a lot of things with little success, I gave up and emailed Kirk Pepperdine. He’s a well-known Java performance expert, and I happened to meet him at a training session at Allegro’s Devoxx conference. By looking at the GC log and email communication, Kirk advise Settings – XX: G1MixedGCLiveThresholdPercent = 100. This setting should force the G1 GC to ignore how much they are filled when mixed GC, but instead force the cleanup of all old generations, thus also cleaning up prematurely promoted objects from the new generation. This should prevent the old age from filling up and producing a full GC. However, after running for a while, we were again surprised to find a full GC. Kirk reasoned that he had seen this happen in other applications as well, and that it was a bug in the G1 GC: The mixed GC apparently didn’t clean up all the garbage and let it pile up until full GC was generated. He said he had notified Oracle of the problem, but they insisted that the phenomenon we observed was not a bug but normal.

conclusion

What we ended up doing was increasing the memory size of the application a little bit (from 3GB to 4GB) and the Full GC disappeared. We’re still seeing a lot of premature improvements, but since the performance is fine, we don’t care. One option we could try is to switch to GMS (Concurrent Mark Sweep) GC, but since it has been deprecated, we try not to use it.

So what’s the moral of the story? First, performance issues can easily lead you astray. At first it looked like it was ZooKeeper or the network, but it turned out to be our code. Even when I realized this, the first steps I took were not well thought out. To prevent full GC, I started tuning GC before checking what was really going on. This is a common pitfall, so remember: even if you have a gut feeling about what to do, check what’s really going on first and check again to prevent wasting time on the wrong question.

Second, performance is too hard to solve. Our code had good test coverage and performed exceptionally well, but it didn’t meet performance requirements either, and it wasn’t clearly defined from the start. Performance issues don’t surface until well after deployment. Because it is often difficult to realistically reproduce your production environment, you are often forced to test performance in production, even if it sounds terrible.

Third, solving one problem can cause another to surface, forcing you to dig deeper than you thought possible. The fact that we had no back pressure was enough to disrupt the system, but it wasn’t until we fixed the memory leak that it surfaced.