Case study: How can Young GC occur frequently in a BI system with 100,000 concurrent requests per second?

Welcome to follow our wechat official account: Shishan100

My new course ** “C2C e-commerce System Micro-service Architecture 120-day Practical Training Camp” is online in the public account ruxihu Technology Nest **, interested students, you can click the link below for details:

120-Day Training Camp of C2C E-commerce System Micro-Service Architecture

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

This article comes from the public account tanuki technology nest column:

JVM From Scratch

Is the author of the fire squad captain open reading article

Captain of the fire brigade, ali senior tech

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

1, the previous review

One of our highlights this week is to re-emphasize how frequent JVM GC can be detrimental to system performance.

So after analyzing the scenarios and principles of GC occurring in the JVM and sorting out the concepts and timing of various GC terms, we can use two real-life examples of our online systems to remind you of the performance problems associated with frequent GC.

2. What is the BI system serving millions of merchants?

Let me first tell you about a real production system we have online, which is a BI system serving millions of merchants.

This so-called BI system, many business system development students may not have contact with, so a brief introduction to his background.

To put it simply, if you are a platform, then hundreds of thousands or even millions of merchants will do business on your platform and use your platform system

At this time, a large amount of data will be generated, based on which we need to provide some data reports for merchants

For example: How many visitors does each business get per day? How many transactions? What is the conversion rate?

Of course it’s much more complicated than that, but here’s the concept.

Therefore, a BI system is needed, the so-called BI, the English full name is “Business Intelligence”, that is, “Business Intelligence”, does it sound particularly lofty?

In fact, don’t think too lofty, to put it bluntly, is the daily operation of some business data collection for analysis, and then show a variety of data statements to the business system.

The so-called “business intelligence” is to show you some data reports so that you can better understand your business status, and then let the boss “smart” to adjust the business strategy to improve performance.

Therefore, the general operation logic of a BI system like this is as follows. Firstly, a lot of daily operation data of merchants will be collected from a platform we provide for their daily use.

As shown below:

Then, the business data can be calculated based on a variety of big data computing platforms, such as Hadoop, Spark, Flink and other technologies, and a variety of data reports can be calculated.

As shown below:

Then we need to put all the calculated data analysis reports into some storage, such as MySQL, Elastcisearch, HBase can store similar data

As shown below:

The last step is to develop a BI system based on Java based on data reports stored in MySQL, HBase and Elasticsearch.

Then through this system, all kinds of stored data are exposed to the front end, allowing the front end to carry out complex screening and analysis of stored data based on various conditions.

As shown below:

3. Deployment architecture at the beginning of the system launch

What we focus on here is the BI system in the above scenario as a case study. Other links are related to technologies related to big data, which will not be involved for the time being. In the future, we can offer more courses to explain those technologies.

At the beginning, the BI system used few merchants.

Because we want to know, even in a huge Internet factory, although the factory itself has accumulated a large number of businesses, but if you target them on the line a paid product, may not be everyone at the beginning to buy

So when the system went online, it was probably used by a small number of merchants, like a few thousand merchants.

Therefore, at the beginning, the system deployment was very simple, that is, several machines were used to deploy the BI system mentioned above, all of which were ordinary 4-core 8G configuration

Under this configuration, generally speaking, the memory allocated to the new generation in the heap is about 1.5G, and the Eden area is about 1G

As shown below:

4. Technical pain points: real-time automatic refresh report + large data report

In fact, at the beginning, the system did not have much problem with a small number of merchants, it worked very well, but the problem was when the number of merchants using the system began to skyrocket.

Suddenly more and more merchants started to use the system, for example, to give you an example, when the order of magnitude of merchants reached tens of thousands.

At this time, I want to explain the characteristics of such a BI system, that is, there is a data report in the BI system, which supports the front-end page with a JS script and automatically sends requests to the background to refresh the data every few seconds.

This report is called a “real-time data report” and is shown in the following figure:

So you can imagine, if only tens of thousands of merchants as your system users, it is likely that thousands of merchants open the real-time report at the same time

Then, after each merchant opens the live report, the front page sends requests to the background every few seconds to load the latest data.

Basically, your BI system is getting hundreds of requests per second per machine, let’s say 500 requests per second.

Then each request will load a large amount of data required by a report, because the BI system may need to perform in-memory field calculations on those data before returning it to the front page for display.

According to our previous calculations, each request requires about 100KB of data to be loaded for calculation, so 500 requests per second requires 50MB of data to be loaded into memory for calculation.

As shown below:

5. Frequent Young GC with no major impact

In fact, we have found the problems of the above system. Under the operation model of the above system, 50MB data will be loaded into Eden area every second.

This means that in just 200s, or about 3 minutes, the Eden area will be filled quickly and a Young GC will be triggered to collect the new generation garbage.

Of course, Eden’s Young GC at around 1G is relatively fast, and it may only take dozens of ms to finish it.

So as we’ve analyzed before, this actually doesn’t have a big impact on system performance. Moreover, in the BI system scenario described above, the surviving objects may be tens or even several MEgabytes after each Young GC.

Therefore, if this is just the case, you may see the following scenario: after a few minutes of BI system running, it will suddenly stall for 10ms, but there is almost no impact on end users and system performance

The diagram below:

6, improve the machine configuration: the use of large memory machine

For such a system, with the use of more and more businesses, the concurrency pressure is increasing, and even there will be a peak of 100,000 concurrent pressure per second

If you’re still using 4-core 8G machines, you might need to deploy hundreds of machines to withstand the 100,000-per-second high concurrency pressure.

So in this case, we usually upgrade the configuration of the machine.

BI itself is a very memory hungry system, so we upgraded the deployed machine to a 16-core 32GB high configuration machine. Each machine can handle thousands of requests per second, at which point only 20 or 30 machines can be deployed.

However, the problem is that if a large memory machine is used, the new generation will be allocated at least 20GB of large memory, and Eden will occupy more than 16GB of memory space.

As shown below:

At this point, thousands of requests per second will load hundreds of MEgabytes of data into memory per second, so it will take tens of seconds, or even a minute, to fill the Eden area and perform Young GC.

At this point, the Young GC will reclaim so much memory that it will be much slower, perhaps causing the system to stall for a few hundred milliseconds, or one second

As shown below:

So if your system is stuck for a long time, it will inevitably lead to a lot of requests backlog and queue in a moment. In serious cases, it will lead to the problem of front-end request timeout in the online system from time to time. That is, the front-end request is found to have not returned after one or two seconds, and the timeout error is reported.

7. Use G1 to optimize Young GC performance on large memory machines

So one optimization to the system at the time was to use the G1 garbage collector to deal with the slow Young GC with large memory.

(Ps: As for G1 collector, we’ve explained how it works in detail in a step-by-step diagram, which you can check back if you’ve forgotten.)

You can set G1 to an expected GC pause time of, say, 100ms to ensure that G1 pauses are at most 100ms per Young GC without affecting end users.

In this case, the effect is very significant, and G1 automatically reclaims a portion of regions every Young GC, ensuring that the GC pause time is less than 100ms

In this case, the Young GC might be more frequent, but the pauses are so small that they don’t have much impact on the system.

8. Conclusion of this paper

This article uses a case, in fact, to illustrate a problem, usually even if the Young GC occurs more frequently, in fact, generally does not cause a great impact on the system

Only if you have a very large amount of memory on your machine should you be aware that the Young GC can also cause quite long pauses, in which case the G1 garbage collector is usually recommended for large memory machines.

9, a little thought

Here’s a quick quiz for you to think about how to look at your online system:

How often does Young GC occur?
How long does Young GC take?
And do you think it affects your system a lot?

END

Click on the link below to learn more about the column:

JVM From Scratch

Or go directly to the public account: Architecture Notes for Ishiki, click on the menu bar below – Boot Camp

“21 days Internet Java Advanced Interview Training Camp (distributed)” details, scan the two-dimensional code at the end of the picture, try the course

Case study: How can Young GC occur frequently in a BI system with 100,000 concurrent requests per second?

1, the previous review

2. What is the BI system serving millions of merchants?

3. Deployment architecture at the beginning of the system launch

4. Technical pain points: real-time automatic refresh report + large data report

5. Frequent Young GC with no major impact

6, improve the machine configuration: the use of large memory machine

7. Use G1 to optimize Young GC performance on large memory machines

8. Conclusion of this paper

9, a little thought

Related Posts

How Spring Session works

Swastika graphic | a chat already and AQS that something (read not you find me)

Python Multithreading VS Multithreading (part 2)