What happened is that A new trading system B was developed to replace the original system A, but the new system will not completely replace A immediately, and the two will coexist for A period of time. In order not to be confused with A, MESSAGE B1 was defined in the design of B, but it omitted that message A1 of an order submission in system A was monitored by other groups. Therefore, I need to adapt B1 to A1 without affecting other groups, so as not to have an impression on other monitors.

Since B1 was not designed to be compatible with A1, there was a gap in the message body and it might not be able to be completely converted into A1. Therefore, we identified a wave with other salesmen. Which fields were used by the consumer side must be guaranteed to have values, and other fields could not be processed.

Then a keyboard smacking, and confidently sent the test environment.

There’s something wrong with what we found at the lab.

1. Reconstruct the crime scene

The log shows that A1’s consumption service S1 is constantly consuming, but in fact the test environment would not have so many messages. And the logs appear to be several per millisecond. The system operation interface starts to respond slowly and times out.

Why is this so true? Start native code debugging. It turns out that instead of switching messages once and for all, they just keep going. This is also verified by logging, where a large number of messages are the same. Finally, after being alerted by a colleague, it was discovered that there was a problem with the handler method of the message.

General writing

 void handler(Event a){

   //do something

}

My writing

void handler(Event a){

  //do something

messageConversion(a);

}

void messageConversion(Event a){

// Convert the message

// Broadcast message b1

}

You can skip this code, because the company’s message does not call Kafka directly, but is wrapped in a layer, so the compile runs without error, but the execution logic does not follow the expected order. Instead, it enters a looped sending anomaly.

In the short time that I found the problem and completed the repair, there was a backlog of about 200 million messages.

Due to the backlog of messages, the service performance deteriorates greatly and the service interface starts to time out.

Adhering to who dug the pit who to fill, this backlog of information is relatively solved.

2. Quick emergency treatment

After finding the problem, all you need to do now is reset the offset to skip the messages and modulate the current latest position, because the test environment is not on the cloud service, there is no interface operation, only command

bin/kafka-consumer-groups.sh –bootstrap-server localhost:9092 –group test-group –reset-offsets–topic t1 –to-latest –execute

Immediately after processing, the log consumption stops and the service interface response returns to normal.

3. Sum up lessons

Usage issues: This pit is also new, but it also exposes problems with developed components. For example, you should add some validation when finding handlers, specify which handler to handle when more than one handler is found in the same message, or throw an exception

Release process: Again, test locally, then move to test environment.