background

Regular expressions are widely used in daily work to define rules and match data. The following describes the regex application requirements in two security scenarios

Scenario 1: Data is stolen after the FTP account is brute force cracked

• Data source: FTP server logs

• Association logic: Brute force cracking is performed on a specific account, the login succeeds using the specific account, and a large number of files are downloaded using the specific account

• Alarm: FTP account ${user_name} is hacked by brute force and data is stolen

• Alarm severity: Critical

In scenario 1, the regular expression is used to match multiple account logins in the log.

Scenario 2, Deep Packet Inspection (DPI), for example, filtering network threats and traffic violating security policies

• Data source: network data packets

• Test rule conditions: set of data matching rules

In scenario 2, regular expressions are used for security detection between multiple packets on a time series.

In fact, scenario 1 lists only one method of FTP attack, and there are many other methods of FTP attack. Therefore, another feature of the regular matching scenario for detecting FTP attack is that the whole rule set may be large. Scenario 2 uses known intrusion behaviors to build a pattern set and detects network packets to find out whether there are behaviors that do not comply with security policies or signs of attacks. In this case, the payload of packets must be detected at a high speed. Otherwise, user experience will be affected.

On the other hand, the re used here is not quite the same as the traditional use of re. The traditional use of re is that given a text, one or a few re rules are used to match the text and find the matched data in the text. And now face the problem, the first is the number of rules, thousands, or more than hundreds of rule set, if still adopt the practice of before, with | segmentation, or outer loop to match, so processing time will be very long, the resources consumption is very big also, basic is not acceptable; Second in the match, the match of data is not a complete whole, such as network packet, is a a receive, this is a flow in the form of the traditional regular processing engine can’t handle streaming data, need a batch of data cache to match, the match is not enough in time, and now the regular processing has a big problem, If the regular expression is not written well, the match will be slow. So, a solution is needed to address the following challenges:

• Large number of rules

• Match quickly

• Support streaming data

• Resource consumption should not be too high

Introduction to Hyperscan operator

In view of the challenges encountered in the above regular matching, we finally chose Hyperscan after investigating and comparing the mainstream regular matching engines on the market.

Hyperscan is Intel’s open source high-performance regular expression matching library. It provides C language API and has been used in many commercial and open source projects.

Hyperscan has these features:

• Support for most PCRE regular syntax (all syntax if using Chimera library)

• Support streaming matching

• Support multi-mode matching

• Use a specific instruction set to speed up matching

• Easy to expand

• Multiple internal engines combined

Hyperscan was designed to handle streaming and multi-mode matching. The support for streaming mode greatly facilitates regular users, eliminating the need for users to maintain received data and cache data. Multi-mode matching allows multiple regular expressions to be passed in and matched at the same time.

Since a specific instruction set is required, Hyperscan has CPU requirements as shown below:

The CPU must support the SSSE3 instruction set at a minimum, with the instruction set at the bottom of the line accelerating matching

Like most regular engines, Hyperscan also includes compilation and matching phases. Compilation parses regular expressions and builds an internal database that can be used multiple times to match. For multi-mode matching, each regular expression must have a unique id at compile time, which is used during matching. The compilation process is as follows:

Hyperscan returns all hits by default when matching, unlike some regular engines that return greedy matches when specified greedy and lazy matches when specified lazy. If a match is made, the user is notified in the form of a callback function of which regular expression ID was hit at which location. The matching process is shown in the figure below:

The disadvantage of Hyperscan is that it can only be executed on a single machine without distributed capability. It can solve the problem of latency, but it cannot solve the problem of throughput. To solve the problem of throughput, you can rely on the mainstream real-time computing framework Flink. Flink is a framework and distributed processing engine for state calculation over unbounded and bounded data streams. Unbounded data is data with a start but no end. Unbounded data flow computing is streaming computing. Bounded data flow computing is data with a start and an end, and bounded data flow computing is batch processing.

Flink can be used in a variety of computing scenarios, three of which are listed here. Flink can handle event-driven programs. In addition to simple events, Flink also provides a CEP library to handle complex events. Flink can also serve as a data pipeline for data cleaning, filtering, conversion and other operations to transfer data from one storage system to another. Flink can do stream or batch data analysis, index calculation, used for large screen display, etc. Flink has become the industry’s recognized first choice for streaming processing.

The regular matching engine is integrated into Flink, and with the help of Flink’s powerful distributed ability, the strong combination will play a greater power. So a solution like this is provided, as shown in the figure below:

The solution implements a custom UDF operator, operator support specified matches only certain field in the input data, the operator of the output is for matching field text, match a final state, including accuracy, don’t hit, error and timeout four state, if it is hit, then will return to match the regular expression of id, The output also includes input raw data, which is not affected if there is subsequent processing; To further facilitate the use of users, a new datastream is extended, called Hyperscanstream, which encapsulates operators. Users only need to convert datastream to Hyperscanstream when using datastream. You can then use the regular operator by calling a method. The entire solution is provided to the user in a separate JAR package that preserves the original habit of writing Flink jobs and is decoupled from Flink’s core framework.

The data flow process is like this. The data source reads a record and gives it to the downstream Hyperscan operator. The Hyperscan operator gives the data to the Hyperscan child process, and the child process returns the result to the Hyperscan operator after matching. The Hyperscan operator then passes the original record and matching results to subsequent operators.

Operator instructions

Privatization deployment

In private deployment scenarios, users need to edit regular expression files, compile regular expression files into database, and serialize regular expression files into local files. If HDFS is available in the deployment environment, you can upload the serialized files to HDFS. If HDFS is not available, you do not need to upload the serialized files. The Flink job is then developed to reference the serialized file to match the data.

Why should there be a tool to compile and serialize this step? Why not edit the regular expression and use it directly in the Flink job? As mentioned earlier, Hyperscan execution includes compilation and matching phases. If a job only references regular expressions, assuming the parallelism of the job is set to 5, then each task needs to be compiled once, which is a waste of resources. And compiling is a relatively slow activity in Hyperscan, so separating out the compilation process also speeds up flink job execution. Compiling ahead of time also helps to know in advance if the regular expression has syntax errors or is not supported, rather than after the job starts.

When deployed privately, Hyperscan dependencies are made available to the user. The dependencies are compiled statically so that no dependencies need to be added, as long as the machine supports the required instruction set.

Internal use

Internal use is relatively simple. Users can edit regular expressions on the Qilin platform or upload regular expression files to the Qilin platform. The Qilin platform compiles regular expressions into database and uploads them to HDFS, and then develops jobs for matching.

Use the sample

Suppose that the Host field and Referer field in the HTTP packet are now matched, as shown in the figure below:

A code example is shown below:

The whole logic is divided into four steps. The first step is to build the input stream from the data source, the second step is to convert the input stream to Hyperscanstream, and the third step is to use the Hyperscan method and then the Hyperscan operator. The first parameter, HyperscanFunction, specifies that the Host and Referer fields are to be matched. The fourth step uses the result returned by the match, which is a Tuple2 object, where the first field Event is the original record, in this case the entire HTTP message. The second field is a List of HyperScanRecord. The HyperScanRecord class contains the fields that were matched, such as Host or Referer in this case, the ID of the regular expression that was matched (if any), and the final state of the match.

After testing with 10,000 rule sets and samples of different sizes to be matched, the scheme achieves the expected performance. The test results are shown in the following figure.

Some suggestions for using the Hyperscan operator are shown below:

As mentioned earlier, Hyperscan has some unsupported PCRE syntax when not using the Chimera library. Be aware of the unsupported syntax in the following figure (Using the Chimera library will affect matching performance).

future

On the one hand, Hyperscan operator has been used in security and threat sensing scenarios, but it is expected to be tested in more scenarios. Theoretically, Hyperscan operator can be used in all regular matching scenarios, such as text audit and content extraction.

On the other hand, the Hyperscan operator is also improved to be easy to use. For example, if the current rule changes, the job needs to be restarted to take effect. In the future, we hope to achieve dynamic hot loading of the rule.