Recently, due to the addition of some risk control measures, resulting in new group order interface QPS, TPS decreased about 5%~10%, this is not bad!

First of all, a quick explanation of the activity:

Business Introduction: As the name implies, new group is a group initiated by new users. If the group is successful, the system will automatically reward new users with a platform coupon of 15.1 yuan minus 15 yuan.

This is equivalent to no threshold discount. Each user only gets one chance. The biggest purpose of the group activity is to attract new recruits.

New user judgment criteria: is there a successful payment order? Not new users: New users.

Current problem: Because of this kind of preferential activities are easy to be targeted by the wool party, black industry. Therefore, we have improved the order risk control system, so that there is no escape from black production!

However, due to the need to call the risk control system synchronously, the QPS and TPS indicators of the entire single interface are decreased. From the perspective of performance, [New group single interface] cannot meet the requirements of performance indicators. So the CTO named me to lead the charge… Blunt!

Problem analysis

The judgment of risk control system is generally divided into two kinds: online synchronous analysis and offline asynchronous analysis. In real business, both are necessary.

Online synchronous analysis can intercept risk at the point of entry, while offline asynchronous analysis can provide more comprehensive risk judgment base data and risk monitoring capabilities.

Recently, we have strengthened and optimized the risk control rules for online synchronization, resulting in longer execution links for the entire new member group ordering interface, resulting in the decline of TPS and QPS, two key indicators.

solution

The simplest and most crude way to improve performance is to add servers! However, the mindless plus server does not demonstrate the capabilities of a good programmer. CTO said, if you want to add servers, the cost of the server will be deducted from my salary…

In the test environment, we simply analyzed by using StopWatch, and the pseudocode is as follows:

@Transactional(rollbackFor = Exception.class) public CollageOrderResponseVO colleageOrder(CollageOrderRequestVO request)  { StopWatch stopWatch = new StopWatch(); Stopwatch. start(" Call risk control system interface "); // call risk control system interface, HTTP call stopwatch.stop (); StopWatch. Start (" Get group information "); // // Obtain basic information about group Tours. Stopwatch.stop (); StopWatch. Start (" Get basic user information "); // Get basic user information. Stopwatch.stop (); StopWatch. Start (" Check if you are a new user "); // Check if it is a new user. Stopwatch.stop (); Stopwatch.start (" Generate order and store it "); Stopwatch.stop (); // Print the task report stopwatch.prettyPrint (); Return new CollageOrderResponseVO(); }Copy the code

The result is as follows:

New person group order StopWatch: running time = 1195896800 ns --------------------------------------------- ns % Task name -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 014385000 call risk control system interface 010481800 021% and 010% 015% spell group activity information 013989200 user basic information 028314600 030% Checks whether it is a new user. 028726200 024% Generates an order and saves it in the databaseCopy the code

In the test environment, the execution time of the whole interface is about 1.2s. The most time-consuming step is the new user logic.

This is where we focus our optimizations (in fact, this is the only one we can optimize for, because there is little room for optimizations in the logic of other steps).

Determine the solution

In this interface, the criterion is whether the user has successfully paid the order. Therefore, it is assumed that the developer queries the order database based on the user ID.

The configuration of our order master library is as follows:That’s a bit of a luxury. However, with the accumulation of business, the data of the master order database has long broken through the level of ten million. Although the data will be regularly migrated, the cycle of order volume breaking through the ten million mark is getting shorter and shorter… The user ID is an index, but after all, it is not the only index. So query efficiency is more time consuming than other logic.

Through simple analysis, we can know that in fact, we only need to know whether the user has successfully paid the order, as for the successful payment of several orders we do not care.

So this scenario is an obvious fit to use Redis’s BitMap data structure. In the logic of the successful payment method, we simply add a line of code to set the BitMap:

// The userId is of type long. String key = "order:f:paysucc"; redisTemplate.opsForValue().setBit(key, userId, true);Copy the code

With this modification, there is no need to check the core code when placing an order, but instead:

Boolean paySuccFlag = redisTemplate.opsForValue().getBit(key, userId); if (paySuccFlag ! = null && paySuccFlag) {// Not a new user, service exception}Copy the code

After modification, the test results in the test environment are as follows:

New person group order StopWatch: running time = 82207200 ns --------------------------------------------- ns % Task name -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- 014113100 call risk control system interface 010193800 017% and 012% 017% spell group activity information 013965900 user basic information 014532800 018% Check whether it is a new user. 029401600 036% Generates an order and saves it in the databaseCopy the code

In the test environment, the order time became 0.82s, and the main performance loss was in the order entry step, which involved transaction and database insert data, so it was reasonable. Interface response time reduced by 31%! Performance benefits are more pronounced than in production environments… And then dance!

A bolt from the blue

The optimization effect is very obvious this time, I think CTO should give me some performance points, or my salary will be deducted

While thinking like this, while preparing for the production environment grayscale release. After the version, ready to a Ge You lie down to have a good rest, waiting for the test sister verification after work.

However, I lay down less than a minute, the test sister came over and said to me nervously: “The interface is wrong, you have a look!” What?

When I opened the log, I was dumbfounded. The error log is as follows:

io.lettuce.core.RedisCommandExecutionException: ERR bit offset is not an integer or out of range at io.lettuce.core.ExceptionFactory.createExecutionException(ExceptionFactory.java:135) ~ [lettuce - core - 5.2.1. RELEASE. The jar: 5.2.1. RELEASE] the at io.lettuce.core.ExceptionFactory.createExecutionException(ExceptionFactory.java:108) ~ [lettuce - core - 5.2.1. RELEASE. The jar: 5.2.1. RELEASE] the at io.lettuce.core.protocol.AsyncCommand.completeResult(AsyncCommand.java:120) ~ [lettuce - core - 5.2.1. RELEASE. The jar: 5.2.1. RELEASE] at io.lettuce.core.protocol.AsyncCommand.com plete (AsyncCommand. Java: 111) ~ [lettuce - core - 5.2.1. RELEASE. The jar: 5.2.1. RELEASE] the at io.lettuce.core.protocol.CommandHandler.complete(CommandHandler.java:654) ~ [lettuce - core - 5.2.1. RELEASE. The jar: 5.2.1. RELEASE] the at Io.lettuce.core.protocol.Com mandHandler. Decode (CommandHandler. Java: 614) ~ [lettuce - core - 5.2.1. The jar: 5.2.1. RELEASE] .....................Copy the code

Bit offset is not an integer or out of range. This error is already obvious: our offset argument is out of range.

Why is that? I can’t help but think: the underlying data structure of Redis BitMap is actually String, and Redis has a maximum limit for String types of 512 megabytes (2^32 bytes …………) Holy shit!!

An Epiphany

Due to the history of the test environment, the length of the userId is always 8 bits. The maximum value is 99999999, assuming that offset is the maximum value.

So in Bitmap, bitarray=999999999=2^29byte. So setbit doesn’t report an error.

In the production environment, the rule for generating userIDS in the user center is changed. As a result, the ID length of old users is 8 bits, and the ID length of newly registered users is 18 bits.

Take the account ID of the test girl as an example: 652024209997893632=2^59byte, which obviously exceeds the maximum Redis value. Don’t report an error is strange!

Emergency back version, gray release failure is good, CTO read I do not know the previous these business rules, let me damn, also thinking about performance, no buckle performance is lucky!

This incident has exposed several noteworthy issues that deserve reflection:

Understand the technical system, but also understand the business system

We are very familiar with the use of BitMap, and for most senior developers, their technical level is not bad, but because of the change of different business systems, they cannot assess the precise scope of influence, resulting in invisible security risks.

The problem occurred because the user center ID rule was not understood and why it was changed.

② The necessity and importance of pre-production environment

Another reason for this problem is that there is no pre-production environment, which makes it impossible to truly simulate the real scenes of the production environment. If there is a pre-production environment, then there is at least the basic data of the production environment: user data, activity data, etc.

To a large extent, problems can be exposed and solved in advance. In order to improve the efficiency and quality of the formal environment.

(3) the fear of the heart

For a large project, every line of code has a value behind it: it makes sense to exist.

People don’t write that for no reason. If it doesn’t make sense to you, do a lot of research and understanding to determine the meaning behind each parameter, design changes, etc. To minimize the chance of making a mistake.

Afterword.

Through this event, it was thought that optimization could improve the efficiency of the interface, thus eliminating the need to add a server. Well, not only will the production environment need an additional 1 server to temporarily solve the problem of substandard performance indicators, but 7 more servers will be added to set up the pre-production environment!

Because of BitMap, 8 servers are involved. The pain is not worth it. Then play music, then dance

Source: r6a. Cn/dNTk