Let’s start with a question, which is often asked during the interview process.

If your company’s current product can support 10W users, your boss suddenly tells you that he has raised money and will launch a large number of advertisements. It is expected that the number of users will reach 1000W in one month. If this task is assigned to you, what should you do?

Problem decomposition for 1000W users

How to support 1000 w user is actually a very abstract problem, for technology development, we need a very clear for the key business performance indicators data, for example, under the peak time for transaction response time, number of concurrent users, QPS, success rate, and basic requirements, etc, all these have to be very clear, Only in this way can we guide the transformation and optimization of the entire architecture. Therefore, if you receive such a problem, first of all, you need to locate the essence of the problem, that is, first of all, you need to know some quantifiable data indicators.

  • If you have previous experience with similar business transaction history, you need to consult and process the raw data collected (logs) to analyze the peak times, transaction activity, transaction size, etc., to get the details of the demand that you want to see

  • In another case, there is no relevant data index as a reference, so experience is needed for analysis. For example, we can refer to some mature business transaction models of similar industries (such as daily transaction activities of banking industry or ticket sales and inspection activities of transportation industry) or simply follow the “2/8” principle and “2/5/8” principle to directly practice.

    • When users can get a response within 2 seconds, they will feel that the response of the system is fast.
    • When the user gets the response between 2-5 seconds, the response speed of the system will feel ok;
    • When users get a response within 5-8 seconds, they will feel that the response speed of the system is slow but acceptable.
    • When the user still doesn’t get a response after more than eight seconds, he or she feels bad about the system, or thinks it’s unresponsive, and leaves the Web site or makes a second request.

In estimating the response time, number of concurrent users, TPS, the success rate of these key indicators at the same time, you still need to be concerned with specific business function requirement in dimension, each business function has its own characteristics, such as some of the scenes can don’t need to be synchronized to return to the results, some business scenarios can accept returns “system is busy, please wait!” Therefore, it is necessary to learn how to balance the relationship between these indicators. In most cases, it is best to prioritize these indicators and try to examine only a few high-priority indicators. (SLA Service level)

A Service Level Agreement is a Service Level Agreement. Service SLA is the formal commitment of service provider to service consumer and a key item to measure service capability level. The items defined in the service SLA must be measurable and have clear measures.

An explanation of concepts related to concurrency

Before we look at these issues, let me give you a general overview of some of the key metrics that are relevant to the system.

TPS

TPS (Transaction Per Second) Number of transactions processed Per Second.

From a macro perspective, a transaction refers to the entire process of a client sending a request to a server and waiting for the request to return. The timing starts from the request initiated by the client and ends after receiving the response result from the server. The total number of transactions completed within this time period is calculated, which is called TPS.

From a micro perspective, the transaction operation of a database, from the beginning of the transaction to the completion of the transaction, represents a complete transaction, which is TPS at the database level.

QPS

QPS (Queries Per Second) Number of Queries Per Second, indicating the number of Queries that the server can respond to Per Second. The query here refers to the number of successful requests made by the user to the server, which can be simply considered the number of requests per second.

For a single interface, TPS and QPS are equal. If at the macro level, a TPS is represented by the user opening a page until the end of the page rendering, then the page will call the server many times, such as loading static resources, querying server-side rendering data, etc., resulting in two QPS. Therefore, a TPS may contain multiple QPS.

QPS= Number of concurrent requests/average response time

RT

Response Time (RT) : indicates the interval between the client initiating a request and the server returning the request. Generally, it indicates the average Response Time.

concurrency

Concurrency is the number of requests that the system can process at the same time.

QPS is the number of requests per second, and QPS is the number of simultaneous requests processed by the system. The number of concurrent requests is larger than QPS, because a connection on the server takes a certain amount of time to process, and the connection is occupied until the request is processed.

For example, if QPS=1000, it means that the client sends 1000 requests to the server every second, and if the processing time of a request is 3s, it means that the total concurrency =1000*3=3000, which means that the server will have 3000 concurrent requests.

Calculation method

How do you calculate these indicators? Let me give you an example.

Suppose 200W users access our system in an hour from 10:00 to 11:00, assuming that the average time of each user request is 3 seconds, then the calculation results are as follows:

  • QPS=2000000/60*60 = 556 (556 requests are sent to the server per second)
  • RT=3s (average response time per request is 3 seconds)
  • Concurrency = 556 * 3 = 1668

From this calculation process, it can be found that the larger the value of RT is, the more the number of concurrent requests will be, and the number of concurrent requests represents the number of connection requests processed by the server at the same time, which means the more connections the server occupies, which will consume memory resources and CPU resources, etc. Therefore, a larger RT value consumes more system resources and takes longer time for the server to process requests.

However, the actual situation is that the smaller the RT value, the better. For example, in a game, the response of at least 100ms can achieve the best experience. For an e-commerce system, 3s is acceptable.

Use the 2/8 rule to calculate the number of visits per 1,000 million users

Going back to the original question, assuming we don’t have any historical data, we can use the 2/8 rule to make an estimate.

  • 1,000 million users, 20% of the people who visit this site every day, so 200 million people visit this site every day.

  • Assuming the average user comes and clicks 50 times, the total PV is equal to 100 million.

  • A day is 24 hours. According to the 2/8 rule, most users’ active time is concentrated in (24*0.2), which is about 5 hours, and most users refer to (100 million clicks * 80%), which is about 8000W (PV), which means that within 5 hours, about 8000W clicks will come in. That’s about 4500 requests per second (8000W/5 hours).

  • 4,500 is just an average number. During these 5 hours, impossible requests are very even, and there may be a large number of users visiting the site in a concentrated way (for example, for websites like Taobao, the peak time of daily visits is concentrated at 14:00 PM and 21:00 PM, where 21: 00 is the peak of activity in a day), typically the peak of access is about 3 to 4 times the average access request (this is the rule of thumb), and we calculate it as 4 times. So in those five hours you might get 18,000 requests per second. In other words, instead of supporting 1000W users, the server needs to be able to support 18000 requests per second (QPS=18000).

Estimated server pressure

Once you have roughly estimated the peak concurrency that the back-end servers need to support, you need to estimate the stress at the system-wide architecture level and then configure the appropriate number of servers and architecture. In this case, first of all, we need to know how much concurrency a server can handle. How to analyze this problem? Our application is deployed on Tomcat, so we need to start with the performance of Tomcat itself.

The following diagram shows how Tomcat works and is illustrated below.

  • LimitLatch is the connection controller that controls the maximum number of connections that Tomcat can process at the same time. In NIO/NIO2 mode, the default is 10000, or 8192 if it is APR/ Native

  • An Acceptor is a separate thread that receives a connection request from a client in a while loop that calls the socket.accept method. When a new connection request is received, Accept returns a Channel object that Poller processes.

    Poller is essentially a Selector, it also implements threads, Poller maintains an array of channels internally, it checks the ready state of a Channel in an infinite loop, and once there’s a Channel to read, A SocketProcessor task object is generated and thrown to the Executor to process

  • SocketProcessor implements the Runnable interface. When the thread pool is performing a SocketProcessor task, the Http11Processor will handle the request. The Http11Processor reads the Channel’s data to generate the ServletRequest object.

  • Executors are thread pools that run the SocketProcessor task class. The SocketProcessor’s run method calls Http11Processor to read and parse the request data. As we know, Http11Processor is the encapsulation of the application layer protocol. It will call the container to get the response, and then write the response through the Channel.

From this figure, there are four factors that limit the number of Tomcat requests.

Current server system resources

Socket/File: Can’t open so many files (Socket/File: Can’t open so many files)

In Linux, each TCP connection consumes one file descriptor (FD), and this error occurs when the file descriptor exceeds the current limit of the Linux system.

To see how many files a process can open, run the following command

Ulimit-a or ulimit-nCopy the code

Open Files (-n) 1024 Is the limit on the number of open file handles (including the number of open sockets) for a process in Linux.

There are only user level restrictions, but there are also total system restrictions, check the system bus system:

cat /proc/sys/fs/file-max
Copy the code

File-max sets the number of files that can be opened by all processes in the system. Also, some programs can be called by setrLimit to set a per-process limit. If you get a lot of error messages about running out of file handles, you should increase this value.

When the above exception occurs, we can modify it in the following way (open limit for a single process)

vi /etc/security/limits.conf
  root soft nofile 65535
  root hard nofile 65535
  * soft nofile 65535
  * hard nofile 65535
Copy the code
  • *Represents all users,rootIndicates the root user.
  • Noproc indicates the maximum number of processes
  • Nofile indicates the maximum number of open files.
  • Soft/Hard: The former generates a warning when the threshold is reached, and the latter generates an error.

Also note that we need to ensure that the number of open files at the process level is less than or equal to the system’s total limit, otherwise we need to modify the system’s total limit.

vi /proc/sys/fs/file-max
Copy the code

The largest overhead of TCP connections on system resources is memory.

Because a TCP connection ultimately requires both parties to receive and send data, you need a read buffer and a write buffer, which under Linux are 4096 bytes minimum, You can run cat /proc/sys/net/ipv4/tcp_rmem and cat /proc/sys/net/ipv4/tcp_wmem to view the information.

Therefore, if the minimum memory usage of a TCP connection is 4096+4096 = 8K, the maximum number of concurrent connections supported by a machine with 8 GB memory is: 810241024/8 without considering other limitations, which is about 1 million. This number is the theoretical upper limit. In practice, due to the Linux kernel’s limitation on some resources and the business processing of the program, it is difficult to reach 1 million connections with 8G memory. Of course, we can also increase the concurrency by increasing the memory.

Configuration of the JVM on which Tomcat depends

We know that Tomcat is a Java program running on the JVM, so we need to optimize the JVM to better improve the performance of Tomcat. Let’s briefly introduce the JVM, as shown in the figure below.

In the JVM, memory is divided into a heap, program counters, local stacks, method areas (meta-spaces), and virtual machine stacks.

Heap space specification

The heap memory is the largest area of JVM memory, where almost all objects and arrays are allocated, and is shared by all threads. The heap space is divided into Cenozoic and old age, and the Cenozoic is further divided into Eden and Surivor region, as shown in the figure below.

The ratio of the new generation to the old is 1:2, which means that the new generation will take up 1/3 of the heap space, and the old generation will take up 2/3 of the heap space. In addition, in the Cenozoic era, the space proportion is Eden:Surivor0:Surivor1=8:1:1. For example, if the Eden area memory size is 40M, then the two Survivor segments are 5M, the whole generation is 50M, and then the old generation is 100M, which means the total heap size is 150M.

You can view the default parameters with Java -xx :PrintFlagsFinal -version

uintx InitialSurvivorRatio                      = 8
uintx NewRatio                                  = 2
Copy the code

InitialSurvivorRatio: initial ratio of new generation Eden/Survivor Spaces

NewRatio: ratio of Old memory to Young memory

Here’s how heap memory works:

  • When Eden is full, YGC (Young GC) will be triggered. Most objects will be reclaimed. If there are any surviving objects, they will be copied to Survivor0 and Eden will be emptied.
  • If YGC is triggered again, the surviving Eden+Survivor0 object is copied to Survivor1, and both Eden and Survivor0 will be emptied
  • YGC is then fired, and objects in Eden+Survivor1 are copied to Survivor0, and the cycle continues until the age of the object reaches the threshold. (Designed this way because most objects in Eden will be reclaimed)
  • Objects that do not fit into Survivor zones will go straight to the old age
  • Full GC is triggered when the old age is Full.

GC flag – Clearing algorithm pauses other threads during execution?

Program counter

The program counter is used to record the bytecode addresses and so on executed by each thread. When a thread has a context switch, it needs to rely on this to remember the current location of execution and continue execution at the last location after the next resumption of execution.

Methods area

A method area is a logical concept, and in the 1.8 version of the HotSpot VIRTUAL Machine, it is implemented as a meta-space.

The method area stores information about classes that have been loaded by VMS, including class meta-information, runtime constant pool, string constant pool, and class information including class version, fields, methods, interfaces, and parent class information.

The method area is similar to the heap space in that it is a shared memory area, so the method area is thread shared.

This place to send stacks and virtual machine stacks

The Java virtual machine stack is a thread’s private memory space. When a thread is created, a thread stack is applied to the VIRTUAL machine to store information about local variables, operand stacks, and dynamically linked methods. Each method call is accompanied by a stack frame call, and when a method returns, the stack frame call.

The native method stack is similar to the virtual machine stack in that it is used to manage calls to native methods, also known as native methods.

How should JVM memory be set up

With that basic information in mind, how should memory be set up in the JVM? What are the parameters to set?

In the JVM, the core parameters to configure are nothing more than.

  • -Xms indicates the size of the Java heap memory

  • -Xmx indicates the maximum heap memory size of Java

  • -xmn, the size of the new generation of Java heap memory, minus the new generation is the old memory

    If the new generation memory is set too small, Minor GC will be triggered frequently, which will affect system stability

  • -xx :MetaspaceSize. The MetaspaceSize size is 128 MB

  • -xx :MaxMetaspaceSize, indicating the maximum cloud space size. (If these two parameters are not specified, the metasize will be dynamically adjusted at runtime.) 256M

    A meta-space of a new system, basically there is no way to have a method to calculate, generally set a few hundred megabytes is enough, because it is mainly stored in some classes of information.

  • -Xss, thread stack memory size, this is basically no need to estimate, set 512KB to 1M, because the smaller the value, the more threads can be allocated.

The amount of JVM memory, depending on the configuration of the machine, for example, a 2-core, 4-gigabyte server can allocate only about 2 gigabytes of MEMORY to JVM processes because the machine itself needs memory and other processes are running on the machine. And this 2G is also allocated to stack memory, heap memory, meta space, that heap memory can get about 1G, and then the heap memory has to be divided into the new generation, the old age.

Tomcat configuration

Tomcat.apache.org/tomcat-8.0-…

The maximum number of request processing threads to be created by this Connector, which therefore determines the maximum number of simultaneous requests that can be handled. If not specified, this attribute is set to 200. If an executor is associated with this connector, this attribute is ignored as the connector will execute tasks using the executor rather than an internal thread pool. Note that if an executor is configured any value set for this attribute will be recorded correctly but it will be reported (e.g. via JMX) as -1 to make clear that it is not used.

server:
  tomcat:
    uri-encoding: UTF-8
    # Maximum number of working threads, default 200, 4-core 8G memory, thread count experience value 800
    The operating system has overhead for scheduling switches between threads, so more is not better.
    max-threads: 1000
    # wait queue length, default 100
    accept-count: 1000
    max-connections: 20000
    # Minimum number of idle working threads (default: 10)
    min-spare-threads: 100
Copy the code
  • Accept-count: specifies the accept-count count. If the number of HTTP requests reaches tomcat’s maximum number of threads, Tomcat will place the requests in the wait queue. This acceptCount specifies the accept-count count. If the queue is full, tomcat will refuse new requests.

  • **maxThreads: ** Maximum number of threads tomcat creates for each HTTP request that reaches the Web service. The maximum number of threads determines how many requests the Web service container can process at the same time. MaxThreads defaults to 200, which is definitely recommended. However, there is a cost to adding threads. More threads not only incur more thread context switching costs, but also mean more memory consumption. By default, a 1M thread stack is allocated when new threads are created in the JVM, so more threads require more memory. The empirical value of the number of threads is: 1 core 2 gb memory is 200, and the empirical value of the number of threads is 200. 4 core 8G memory, thread count experience value 800.

  • MaxConnections: Maximum number of connections. This parameter is the maximum number of connections tomcat can accept at a time. For Java’s blocking BIO, the default is the value of MaxThreads; If a custom Executor Executor is used in BIO mode, the default value will be the value of MaxThreads in the Executor. For Java’s new NIO schema, the maxConnections default is 10000. For APR/ Native IO mode on Windows, maxConnections defaults to 8192

    If the value is set to -1, maxConnections is disabled, indicating that the number of connections to the Tomcat container is not limited. The relationship between maxConnections and accept-count is as follows: When the number of maxConnections reaches the maximum value, the system continues to receive connections but does not exceed the acceptCount value.

1.3.4 Pressure from application

As previously analyzed, NIOEndPoint generates a SocketProcessor task to be processed by the thread pool after receiving the connection request from the client. The RUN method in SocketProcessor calls the HttpProcessor component to parse the protocol at the application layer. And generate a Request object. Finally, the Adapter’s Service method is called to pass the request to the container.

The container is responsible for the internal processing, that is, the current connector gets information through the Socket and gets a Servlet request, and the container is responsible for handling the Servlet request.

Tomcat uses the Mapper component to locate the URL requested by the user to a specific Serlvet, and then Spring’s DispatcherServlet intercepts the Servlet request. The Mapper mapping based on Spring itself is located in our specific Controller.

After arriving at Controller, for our business, a request really begins. Controller calls Service and Service calls DAO. After completing database operations, the original request is returned to the client to complete an overall session. That is, the business logic processing time in the Controller is also affected by the concurrency of the entire container.

Number of servers evaluation

Through the above analysis, we assume that the QPS of a Tomcat node is 500. If the QPS of a Tomcat node is 18000 in the peak period, 40 servers are needed, and these four servers need to distribute requests through Nginx software load balancing. Nginx has a good performance. Nginx can handle static file concurrency up to 5W/s. In addition, because Nginx can not be a single point, we can use LVS to do load balancing on Nginx, LVS (Linux VirtualServer), it is using IP load balancing technology to achieve load balancing.

With such a set of architectures, our current server is able to undertake QPS=18000 simultaneously, but not enough. Back to the two formulas we mentioned earlier.

  • QPS= Concurrency/average response time

  • Concurrency =QPS* average response time

Assuming that our RT is 3S, it means that the number of concurrent connections on the server side =18000*3=54000, that is, there are 54000 connections to the server side at the same time, so the number of connections that the server side needs to support at the same time is 54000, which we discussed how to configure in the previous article. If THE RT is larger, it means that the accumulation of more links, and these connections will occupy memory resources /CPU resources, easy to cause system crash. At the same time, when the number of links exceeds the threshold, subsequent requests cannot come in, and the user will get a timeout result, which is obviously not what we want, so we have to shorten the value of RT. Pay attention to [follow Mic learning structure] public account, get more boutique original