Anomaly detection can be defined as “making decisions based on the behavior of actors (human or machine)”. This technology can be applied to a wide range of industries, such as transaction detection and loan detection in financial scenes. Do production line early warning in industrial scenes; Security scene to do intrusion detection and so on.

According to different business requirements, flow computing plays different roles: online fraud detection, near real-time result analysis, global warning and rule adjustment after decision making.

This paper introduces a quasi – real – time anomaly detection system.

The so-called quasi – real-time, that is, the delay is required within 100ms. A bank, for example, would run a real-time transaction detection test to determine whether each transaction was a normal one: if a user’s username and password were stolen, the system could detect the risk at the moment the thief initiated the transaction and decide whether to freeze the transaction.

This kind of scene has very high requirements on real-time performance, otherwise it will hinder normal transactions of users, so it is called quasi-real-time system.

Because actors may adjust based on the results of the system, rules are also updated, and flow calculations and offline processing are used to investigate whether and how rules need to be updated.

To solve this problem, we designed the following system architecture:

Online system, complete online detection function, can be in the form of Web services: Testing for a single event According to the global context for testing, such as the global blacklist According to user’s portrait or recent information for testing for a period of time, such as the recent 20 times trading time and place for kafka, send events and testing results and the reasons to the downstream flink nearly real-time processing nearly real-time update user attributes, Such as the most recent trading time & place;

The global detection status is summarized and compared at the same time. For example, the interception rate of a rule suddenly changes greatly, and the global pass rate suddenly increases or decreases.

Maxcompute/Hadoop storage and offline analysis, used to retain historical records, and service personnel to explore whether there is a new mode hbase, save user portraits

3. Key module 3.1 Online detection system

Transaction anomaly detection is implemented in this system, which can be a Web server or embedded into the client system. In this article, we assume that it is a Web server whose main job is to review incoming events and respond with a yes or no.

For each incoming event, three levels of detection can be performed:

Event-level detection only uses the event itself to complete the detection, such as format determination or basic rule verification (attribute A must be greater than 10 and less than 30, attribute B cannot be null, etc.) Global context detection In the context of global information, for example, there is a global blacklist to determine whether the user is in the blacklist. Or some attribute is greater than or light rain global mean, etc.

Portrait content detection

An analysis of the actor’s own records across multiple lines, such as the user’s first 100 transactions taking place in Hangzhou, and this transaction taking place in Beijing only 10 minutes after the last transaction, would be a reason to raise an abnormal signal.

Therefore, this system should save at least three aspects, on the one hand, the whole detection process, on the one hand, the rules for judgment, on the other hand, the global data required, in addition, according to the need to decide whether to cache the user portrait locally.

3.2 kafka

Kafka is mainly used to send data such as detected events, detection results, rejection or pass reasons to the downstream for streaming and offline computing processing.

3.3 Flink near real-time processing

Now that exception detection has been done in the above system and the decision has been sent to Kafka, we need to use this data to run a new round of defensive detection against the current policy.

Even though known cheating has been entered into the model and flagged in the rule base, there are always “smart people” who try to cheat. They will learn the existing system, guess the rules and make adjustments, and these new behaviors may well be beyond our current understanding. So we need a system to detect anomalies in the whole system and find new rules.

In other words, our goal is not to detect problems with individual events, but rather to detect problems with the logic used to detect them,

So be sure to look at things at a higher level than the event, and if there is a change at that higher level, there is reason to consider adjusting the rules/logic.

Specifically, the system should focus on macro indicators such as aggregates, averages, the behavior of a group, and so on. Changes in these indicators often indicate that certain rules have failed.

A few examples:

Before a rule, the interception rate was 20%, suddenly it dropped to 5%;

One day when the rule went live, a large number of normal users were blocked;

There may be a plausible explanation for a sudden 100-fold increase in one person’s spending on electronics when there is a lot of similar behavior by others (like the Iphone coming out);

It is normal for someone to act several times in a row, but not so many times, such as buying the same product 100 times in a row in one day.

Identify a combination of several normal behaviors, such a combination is abnormal, for example, it is normal for users to buy a kitchen knife, a ticket, a rope, and a gas station to fill up, but it is not normal to do these things in a short time. Patterns of this behavior can be found through global analysis.

Based on the near-real-time results of the flow calculations, the business can find out if there are any problems with the rules in time and make adjustments to the rules.

In addition, the flow calculation can also carry out real-time updates of user portraits, such as the statistics of the user’s behavior in the past 10 minutes, the last 10 times of login location, and so on.

3.4 Maxcompute/Hadoop offline storage for exploratory analysis

This is where exploratory analysis can be performed using scripts, SQL, or machine learning algorithms to discover new models, such as through clustering

Method to cluster users, mark behavior after model training, etc., or periodically recalculate user portrait. This has a lot to do with the business.

3.5 hbase User Portrait

Hbase stores user portraits generated by stream computing and offline computing for use by the detection system. Hbase is selected to meet real-time query requirements.

4. Summary The conceptual design of a quasi-real-time anomaly detection system is given above. Although the business logic is simple, the whole system itself is very complete and has good scalability, so it can be further improved on this basis.

Welcome Java engineers who have worked for one to five years to join Java architecture development: 855835163 Group provides free Java architecture learning materials (which have high availability, high concurrency, high performance and distributed, Jvm performance tuning, Spring source code, MyBatis, Netty, Redis, Kafka, Mysql, Zookeeper, Tomcat, Docker, Dubbo, multiple knowledge architecture data, such as the Nginx) reasonable use their every minute and second time to learn to improve yourself, don’t use “no time” to hide his ideas on the lazy! While young, hard to fight, to the future of their own account!