This is the 10th day of my participation in the August More Text Challenge. For details, see:August is more challenging

Using the basics of the JVM we’ve learned before, we can combine them with some actual production examples to illustrate how we can reasonably give an untuned initial value for expected concurrency pressures when bringing a production system online.

In addition, we will analyze the various parameters in the setting of the point to consider, how much Java heap memory is needed? How much memory does the new generation and the old need? How big is the permanent generation and the virtual machine stack? All of these will be analyzed step by step in combination with cases.

Note: How to set JVM parameters must be adjusted according to specific scenarios of different business systems. It does not mean that there is a common configuration and template. This idea is definitely wrong, and must be combined with the case and business scenario analysis.

1. How to resist the billions of orders on Nov 11? How should the JVM set up memory?

Let’s first look at a data, the 2020 Tmall Double 11 global carnival season real-time logistics orders frozen at 2.321 billion. What a concept! Transaction of 2.321 billion orders in one day! What’s more, tmall orders peaked at 583,000 per second in 2020!

The secret behind 544,000 orders per second

In the two months before Nov 11, Alibaba completed the migration of hundreds of thousands of physical servers from offline data centers to the cloud. It’s a huge undertaking, but the consumer on the front end has no sense of it.

In the past three years, Aliyun invested huge resources to develop Shenlong server, which is the guarantee that the peak of 544,000 orders per second can be passed smoothly. What is the concept of 544,000 orders per second? Zhang Xiantao, a researcher at AliYun’s intelligent Basic products Division, said other companies may still be struggling with 1,000 orders per second.

Not negligible computational power

On Nov 11, Alibaba processed 970PB of data. A comparable figure is that CCTV has been filming programs for decades and has saved 80PB of data.

The flow computing system and feitian big data platform support the large-scale computing power of Double 11. Flow computing system plays an important role in system and merchant scheduling. For example, on The Day of Double 11, merchants will prepare goods in advance. When it is predicted that the main products promoted by merchants will sell too fast, feitian big data platform will prompt the merchants to change their strategy, so as not to be out of stock at the beginning; When the forecast of the main push of commodity sales can not meet expectations, flying big data platform will remind merchants to consider sending coupons to boost sales.

It’s a different Double 11

In this year’s Double 11 media communication conference, Alibaba Group CTO Zhang Jianfeng said that Alibaba Cloud has completed four core breakthroughs in technology:

First, in the core virtual machine system, self – developed Shenlong architecture, with self – developed server to do virtualization. Dragon server output is also very linear under high pressure.

Second, I have developed a cloud-based database, and there is no problem on this year’s Double 11.

Third, computing and storage have been separated, data are accessed from the remote end, storage can be very convenient expansion.

Fourth, do RDMA network, can do in the remote storage, can be faster than the local read and write disk.

During The Double 11, nearly 2 million containers supported the core system of e-commerce. In the merchants’ side, Alibaba’s technical team rapidly expanded the capacity of 54,000 cores for merchants, helping merchants process 870,000 orders per second at its peak, and providing 41 billion calls to merchants.

These are the technical forces behind Singles day.

2. How should THE JVM memory be set for a million daily payment system?

Nowadays, many systems are inseparable from payment, so the development of payment system is almost a skill we must master. Here, we take the development of an e-commerce system as the background, and analyze a payment system with daily transaction volume in millions as a practical case.

In order to double 11 such a day of orders is obviously not only by controlling the NUMBER of JVM memory and servers can be solved, here the author to an average of millions of daily transaction payment orders to do practical case analysis, note: The daily payment orders that can reach millions are basically the largest Internet companies in China, or a general third-party payment platform that connects with payment transactions of various apps.

Enter the topic analysis:

First of all, the normal online shopping process is: add goods to the shopping cart → order → settlement payment → show the payment results

And for our system development how should the whole process? Let’s take a look at a full flow chart of wechat Pay that I made before:

Does it feel a little complicated? And there are three systems interacting, and if we simplify the picture a little bit, the whole process looks like this:

Does the extraction feel very clear?

You can understand the order system as our e-commerce background system, and our payment system can also be regarded as a part of the e-commerce system, but this part is very important, we usually extract it as an independent system for development and maintenance.

The payment system is a bridge connecting consumers, merchants (or platforms) and financial institutions. It manages payment data, calls the third-party payment platform interface, records payment information (corresponding order number, payment amount, etc.), and verifies the amount, etc.

Let’s first sort out the flow of the whole process from the user to the order system to the payment system and the three-party payment system, so that we have an overall payment logic:

  1. Users submit orders for payment to our order system
  2. The order system submits the payment request to the payment system
  3. The payment system generates a payment order. At this time, the status of the order is “to be paid”. The user returns to the payment page and chooses the payment method
  4. Users choose wechat or Alipay to determine the payment method
  5. The payment system submits the actual payment request to the third-party payment channel: processing the request, fund transfer
  6. Return the processing result and the payment system changes the order to “completed”

The above is just a simple payment process we simplified. In fact, a complete payment system also includes many things (such as account management, reconciliation management, settlement management, settlement management, etc.). We focus on the core payment process.

Where is the pressure on a payment system with millions of transactions a day?

First through the flow chart we know that the user has submitted a payment request, to order system to file the real payment orders to the payment system, so when this payment system is really started to work, and deal with, so every day there are one million payment requests corresponds to the payment system will receive one million pay orders, By going to the store and process the one million pay orders, said that more straightforward point is our JVM memory every day there will be millions of a payment order objects need to create (each order object contains the user information, commodity information, channel information, payment time, price and other kinds of information summary), so we focused on the management of the JVM to see, With millions of payment order objects being created and destroyed in JVM memory every day, there are several core questions to consider:

  1. How much memory space do we need in the JVM to support the creation of so many order objects? How much heap memory space is the key?
  2. How much memory space is required per machine, and how many machines are deployed?

With two core issues in mind, we need to know where our system is at its peak if we want to analyze and determine memory allocation. Because in general users will have a peak when shopping to buy, noon and night, for example, statistics is probably a few hours together, that is to say, in a few hours can produce about one million orders, we calculated according to the four hours in almost 60 ~ 70 orders/SEC, here we take the whole directly, It is calculated and processed at a rate of 100 orders per second.

How long and how much space a payment order takes to process

Next we have to know about an order processing time, when users click on the submit orders, the order will carry related parameters to the electricity business background system, created by the electricity system order, and remove the shopping cart goods operation, as well as the order to the operation of the database, then will send the order to the payment system, The entire process from initiating the request to creating the order to the payment system is roughly one second.

What is the size of the object occupied by an order? Generally an order object in the core instance variable is more than 20 almost, according to the basic data type corresponding to the size of the byte to calculate, generally an order object is about 500 bytes in size. That’s 100 orders per second, which is about 100*500 = 50,000 per second, which is about 50 kilobytes, which is pretty small.

Then combining the above two analysis we can know, the system will come to every 1 seconds 100 payment order, and each payment orders to create needs to 1 second, which is 1 seconds later, can produce 50 KB of garbage in memory object, because 1 second after the 100 objects, no quotes, become a new generation of garbage objects.

The next second will continue to generate 100 order objects, which will then continue to generate 50KB garbage objects, so that the next generation will continue to generate heap garbage objects until it is full, triggering the Minor GC to collect.

Payment system memory usage estimate

According to the analysis above, 1 second waste generated 50 KB object, so almost 100 seconds 5 MB garbage objects, can you feel a little less than for fear, but we have analyzed the above is just a payment order takes up the size of the object, in the actual operation will produce a lot of other objects per second (system itself + we carry correlation of various objects). So if we really wanted to estimate the memory footprint, we would have to magnify the previous calculation by 10-20 times!

So by this estimate, we’re creating objects between 500KB and 1MB per second. If one second generates 1MB of garbage, 100 seconds will generate 100MB of garbage. If the Eden region is allocated 800MB of garbage, 800 seconds will trigger a Minor GC. If the Minor GC is frequently triggered, it is not a good thing. It will affect the performance stability of our line.

How to set the PAYMENT system JVM memory?

So how do we deploy and allocate JVM memory when the actual system goes live? Here if the us economy limited simply assign a 2 nuclear 4 g machine to deploy, 4 g memory allocated on the JVM’s also up to 2 g, and all of the 2 g can’t to heap memory, and method of area, such as stack memory area, at the most, heap memory can be allocated to a 1 g or so, and heap memory is new generation and the old s, In this way, the size of our new generation is only a few hundred Mb at most. According to our previous analysis, 1 second can consume about 1MB of memory, which can be filled up in a few hundred seconds, causing garbage collection and affecting the performance and stability of our system. (Triggering garbage collection causes STW and the system thread to stop, which we’ll cover later.)

So how to solve and optimize?

  1. Considering the cost of deploying on a 4-core 8GB machine, we can allocate at least 4 gb of memory for the JVM and 2 GB for the new generation, we can increase the time for Minor GC to trigger from a few hundred seconds to a half-hour to an hour, reducing the GC frequency
  2. By scaling the number of servers, we can deploy three to five machines to scale out horizontally, but of course the more machines there are, the fewer requests the machine handles per day, which puts less pressure on THE JVM memory.

Of course actual need according to your own business as well as the system performance is reasonable configuration, for each system to do a JVM memory before online simulation to estimate (how to use tools to check the change of the actual JVM memory process follow-up we will explain) and test many times to arrive at a reasonable data to estimate the amount of user requests to simulate calculations, Set a reasonable value in advance to reduce frequent GC triggers and ensure the stable operation of the system.

3. Double 11 promotion, instantaneous traffic increased by 10 times

In addition to the daily average payment transaction volume needs to be estimated, but also need to think about how to ensure the stability of the server when the big promotion. For example, with the advent of Double 11, it is likely that the pressure on the server will increase by 10 times in an instant, and all the people will buy things at this time or on this day. At this time, it may not be a problem of 100 payment orders per second, but a problem of 1000 orders per second or even bigger! At this time, not only is our memory pressure, especially thread resources, CPU resources are almost full, memory is also at risk!

Before, we calculated that the object generated per second is 1MB, so the memory usage per second may reach 10MB or even tens of MB (it has to be larger than that, don’t consider that it is just right). At this time, there is also a problem that we can handle 100 orders in about one second before. But now 1000 orders 1 second is certainly not finished processing, just analyzed, CPU, thread, memory are tight, system performance will follow unstable, the 1000 orders at least take a few seconds or even dozens of seconds to complete processing.

So when we are the new generation of fast full, this time still enter the object will appear problem, because the object may not dispose of the then have to a wave of object, and also by the garbage objects to fill in the new generation, would trigger a Minor GC, assume our new generation and the old s memory allocation respectively 1 g, now the situation is as follows:

New requests come to allocate insufficient space trigger MinorGC, reclaim some objects, and our small number of objects are still being referenced due to slow system processing, and a large number of objects are being generated every second. If we estimate that new objects are created every second 100MB, the Eden zone in the new generation of 1 G accounts for 800MB. It takes less than 8S for a MinorGC to fire, so this large number of frequent MinorGC fires, combined with the slow processing of a small number of objects, can cause some objects to survive multiple Minor GCS and eventually become old:

When that part of the object is processed, it loses its reference and becomes a garbage object, but it has existed in our old days.

So according to this frequency the old age is occupied the speed also very fast! Once the old is Full, it triggers a Full GC, which is worse than the Minor GC and causes the system to pause for longer! Can you imagine what it’s like when your system freezes in the middle of a rush to get the order, and you’re in the middle of garbage collection, but the user can’t access the payment page? Wait for the system to recover, then the payment will be over the kill period or the goods have been sold out, seriously affecting the user experience.

The new and old rules of garbage collection and how to optimize them will be explained later.

Therefore, we must combine the JVM memory to think and estimate when the company launches the project development, especially the project with a large number of users, the unreasonable estimate of business system pressure, such as the real pressure comes, the system is facing a crash at any time. This is why a lot of interviews at big companies are focused on the JVM, to see if you really know enough about your own system to be able to accurately estimate memory online.