After a normal activity promotion, the customer service began to feedback that there were users who couldn’t open the web page or APP when they grabbed the mark, and the mark had been robbed when they opened it. At the beginning, there was no special attention to it, and it was not like this when they grabbed the Xiaomi phone. As the activity continued to advance, more users strongly protested that they could not get the upper mark after leading the interest rate hike volume or cash discount volume, believing that the platform faked and deliberately did not allow the use to save resources.

The analysis process

In fact, in the past, there would be a series of user feedback is not reduced, for customers to take Xiaomi grab mobile phone as an example of fooling in the past, this time the user feedback is too strong, let us pay attention to it. Our front a total of three product, app, website, H5, the app usage, the biggest website second, H5 during peacetime use very little but do traffic will bulge (activities generally is in the majority of H5 games, H5 also facilitate marketing), the front three new products are respectively using the LVS load to the backend two web server (pictured), This time the user feedback is mainly on the Web and APP end, so focus on these four servers.



Tracking the web server service logs, it was found that no new database connections were requested or the database connections had been used up in the database update layer. It was thought that the maximum number of connections of the database was too small, so it adjusted the maximum number of connections of the mysql database to three times as before. The next time when the bid was robbed, I continued to observe the business log and found that errors related to database links were not reported, but many users still gave feedback that the page could not be opened during bid robbing.

Continue to follow the web server, when the bid using the command (ps – ef | grep HTTPD | wc -l) view the HTTPD there are about 1000 the number of connections, random check the apache configuration file sets the maximum number of connections to 1024 (apache the default maximum number of connections for 256). Originally, the number of connections reached the maximum number during bidding competition, and many users could not get HTTP connections during bidding competition, resulting in no response on the page or app waiting. Adjust the maximum number of connections in the Apache configuration file to 1024*3.

According to the feedback from customer service, there are still a lot of users who give feedback about the bidding problem, but it is a little better than before. However, there are sporadic users who give feedback that they have won the bidding, and finally they give back. Then continue to observe the database server, using the top command and MySQLWorkbench to check the load of the mysql master and slave libraries surprised (as shown in the figure below), the indicators of the mysql server master library have reached the peak, while the slave library is almost not too much pressure.



The tracking code found that the business codes of the three terminals were all connected to the master library, while the secondary library was only used by the query business of the background, so the transformation was started immediately. In addition to the query in the process of bidding, all the queries of other pages or businesses were transformed into query slave database. After the transformation, it was found that the pressure of the master database was significantly reduced and the pressure of the slave database began to come up. The diagram below:



According to the feedback from the customer service, after the transformation, the problem of winning bids and returning them almost disappeared. The problem that the page cannot be opened or opened slowly in the process of winning bids has been alleviated to some extent, but there are still some users who give feedback on this problem. According to the analysis results of the above projects, it can be concluded that:

  • 1 Load The two servers have reached the processing limit. You need to configure more servers to load the servers.
  • 2 The pressure of the primary mysql database is significantly reduced, but the pressure of the secondary mysql database is increased, and the current mode of one master and one slave has been changed to one master and many slaves.
  • 3. To solve these problems thoroughly, we need to consider the overall optimization of the platform, such as: business optimization (remove hot spots in the business), increase cache, part of the page static (can use the front-end optimization rules of Yahoo and Google, there are also many test sites online can be evaluated) and so on.

An optimized report was prepared on the basis of these facts, as follows:

Optimize the report

1 background

With the continuous development of the company’s business, the business volume and user volume surge, the official website PV from the original XXX-XXX to the present XXX-XXXX, APP active users are greatly increased; Therefore, the current technical architecture of the platform has become more challenging. Especially in the recent tense platform source, the full mark time is getting shorter and shorter. Servers are under increasing pressure; Therefore, the current system architecture needs to be upgraded to support greater user and business volumes.

2 User access diagram



At present, the platform has three products for users: the platform official website, the platform APP and the platform small webpage. Among them, the platform’s official website and platform APP are under greater pressure.

3 Existing problems

Rob mark when users focus on the following several aspects 1, web pages or APP won’t open 2, 3 website or APP open slowly, after the success of the bid in the process of transfer, because the server is responsible for update failure under pressure, again a refund 4, database connection number, cause the failure of full scale after adding investment record, back mark progress

4, analysis,

Through in-depth analysis of recent server parameters, concurrency, and system logs, it is concluded that: 1. During the bidding rush, the official website of the platform and the platform APP were under great pressure on the server, and the problems of the platform APP were more prominent. The maximum number of apache connections on a single APP server was close to 2600 during the bidding rush, which was close to the maximum processing capacity of Apache

2. Database servers are under a lot of pressure. Database pressure is mainly prominent in two periods: 1) When the platform does activities, the visit volume of official website, small web pages and APP increases dramatically, resulting in a huge increase in data query volume. When the database processing limit is reached, the website will show problems such as slow opening; 2) When the user snatches the bid, the pressure of the user snatches the bid can be divided into two stages: before and during the bid snatching. Before bidding snatch, users open the bidding snatch page in advance and keep refreshing it because of the fast speed of bidding snatch, so the query pressure of the database will increase constantly. If the number of users snatch the bidding is very large, the database connection number will be used up before bidding snatch. In bidding competition, a single purchase will involve about 15 tables for change query, and each target share is 10 million, about 100-200 people will buy and complete the full bid every time. Based on the median value of 150 people, data need to be updated 2000-3000 times in a few seconds (only update, not including query), resulting in a large number of concurrent. Updates may fail or connections may time out, affecting user bidding and normal system full bidding.

5 Solutions

A diagram of a single user accessing a Web service in a Web server solution



At present, two services are used to balance websites and platform APPS. Apache is installed in each server to accept the server. Each Apache can handle about 2000 connections at most. Therefore, the website or APP can handle more than 4000 user requests in theory. To support 10,000 simultaneous requests, you would need five Apache servers, so there are currently six Web servers missing. Access diagram after server upgrade



2. Database Solution This section describes the database deployment solution






1) The separation of master and slave solves 80% of the query pressure of the master database. At present, both the official website and APP of the platform are connected to the master database of mysql, which leads to the multiplication of the pressure of the master database. The migration of all queries in the service to the secondary database can greatly reduce the pressure of the master database.

2) Add a cache server. When the query from the slave library reaches its peak, the synchronization between master and slave will also be affected, thus affecting transactions. Therefore, the frequently used queries are cached to reduce the request pressure on the database. Three more cache servers are required to set up the Redis cluster.



1) Static homepage of the official website. According to the analysis of CNZZ statistics, the homepage accounts for about 15% of the overall page view of the website. For the data that does not change often, it is processed by static to improve the smoothness of opening the official website.

2) Optimize the Apache server, enable gzip compression, configure a reasonable number of links, etc

3) Remove the update hot spot in the investment process: target schedule. Every time the bidding is successful or failed, the target schedule needs to be updated, and there will be problems such as optimistic lock when multi-threaded update. Remove the updates in the process, save the progress information in the target schedule only after the full bid, optimize the pressure on the database in the investment process.

6 Server Upgrade Scheme

1. The biggest pressure of the platform comes from the database, which needs to change the current one master and one slave to one master and four slaves. The official website/APP/small web page generated a large number of queries, distributed by virtual IP to three slave library, background management queries go to another slave library. Three new servers need to be added to the database



2, increase the cache to reduce the pressure of data, need to add two large memory cache servers



3. Three new Web servers are required to decompose user access requests

Two more servers are required for APP. In the bidding process, the APP server is under the greatest pressure and two more servers are required. Schematic diagram after configuration



The official website needs to add a new server. There is also some pressure in the bidding process of the official website, so it needs to add a new server. The schematic diagram is as follows:



In total, 8 servers need to be purchased, two of which require large memory (more than 64G).

Click to download the Word version of the optimization report

Note: after all optimization schemes are put into production, the problem will be solved and the bidding will be free!


Pure smile: a production accident optimization experience