Provide inspiration for your intrusion detection system.


If you were given a bunch of user inputs, there would be lots of Chinese place names, like “Beijing,” “Chengdu,” “Dongguan,” and unfortunately some Roman ones mixed in, like “Singapore,” “New York,” and “Tokyo.” Your task is to separate them. How would you do that?

Of course, there are many ways to do this easily.

If it’s a pile of “good”, “fine”, “not bad”, “amazing”, “nice” mixed in with “Fallout 4 is the epitemy of everything wrong with modern” gaming, it has a total of 2 compelling quests, its gameplay is worse then the rest, And to top it off they added microtransactions to it. It is the worst of the fallout series.”

Easier, you might think, to check for string length or even punctuation.

And (select count(*) from data)>0 and ‘a’=’a ‘, ‘> ‘ ‘> ‘

Perhaps you are already impatient: this safety literacy still has! Detect keywords and special symbols!

You’re done asking me to “what if” : There’s nothing a regular expression can’t fix, and if there is, it’s time to learn it again.

Ok, but for now, what we need is to implement all of these scenarios with the same model — when strings are short and long, it separates strings of unusual length. When there are regular strings and strings that contain special symbols, it can pick out the special ones. It can be “purified” when strings are mixed with different languages. There are even unexpected scenarios.

This code is like a heresy trial in a religious war. Whether it’s a traitor or a split, it always picks out the minority based on what the string looks like.

If you happen to have done something like exploring deep learning for network security, you can look at the data set and quickly see the utility of this heresy inquisitor.

Let’s give it a wink. Later, we will implement such a toy.

The bishop’s self-cultivation: look at the face

The distance between Beijing and Chengdu can be easily measured by the Euclidean distance. But what about the distance between “Beijing” and “Chengdu”?

We need to look at faces to define and quantify such differences in terms of the character of the string “appearance.”

It is not hard to see that the distance between strings should contain at least the following components:

  • String length differences (e.gcatmiaomiaomiaomiaomiao)
  • Character set differences (e.g123abc)
  • Character sequence differences (e.gShanghai tap waterWater comes from the sea)

The length of the differences

What is there to say… A string of length 5 obviously has one more 2 than a string of length 3…

I’ll skip it here.

def strLengthDiffer(str1, str2):
	return abs(len(str1) - len(str2))
Copy the code

Character set difference

Character set differences are intended to depict differences in character selection between different strings, and we should penalize large differences in character sets, especially when different classes of characters are present.

To do this, you first define the distance between the characters. Here, we define the distance between the same characters as 0, the distance between the same characters (such as a and b) as 1, and the distance between different characters as 10.

Characters can be classified as lowercase letters, uppercase letters, numbers and other characters. Of course, readers can also classify characters according to their practical use, and divide the differences that the system needs to be sensitive to into two different types.

With the inter-character distance, we define the distance between character A(1) and character set B as the minimum distance between that character A(1) and each character in B.

Based on the above, we further define the distance between character set A and character set B as: the arithmetic sum of the distance between each character in A and B.

Clearly:

  • Character distance (a, b) = character distance (b, a)
  • Distance from character to character set (a, B) = Distance from character to character (B, a)
  • Distance between character sets (A, B) = Distance between character sets (B, A)

Thus, we have completed a cognitively consistent definition of distance between character sets.

def charSetDiffer(s1, s2):
    Since the version of the code used by the author has more complex logic here, the code details are not provided
    # I've made it so clear, write it down
    return s
Copy the code

Character sequence difference

Whether a user input is alert(“test”) or aeelrstt””() obviously has a completely different meaning for developers. The latter string doesn’t get a second glance, while the former, if successfully executed by the user, is likely to be followed up with some other destruction, very evil.

The lesson of this story is that differences in character sequences cannot be ignored.

Here, we use the N-Gram language model to measure the difference between the two sequences by the Gram number at N=2.

If you don’t know what I’m talking about, the specific calculation is something like this:

  1. Suppose we have the strings S1 and S2.
  2. Take the string S1 as an element with two consecutive characters to form the set G1, and likewise G2.
  3. The sequence difference between the strings S1 and S2 is the number of different elements in G1 and G2. Obviously, you can get that value by subtracting their union from their intersection.
def n_grams(a):
    z = (islice(a, i, None) for i in range(2))
    return list(zip(*z))
Copy the code
def groupDiffer(s1, s2):
    len1 = len(list(set(s1).intersection(set(s2))))
    len2 = len(list(set(s1).union(set(s2))))
    return abs((len2 - len1))
Copy the code

I finally have the distance between strings

So far, we’ve quantified the difference between the three formal dimensions of two strings, and all we need to do is figure out the amazing distance between strings by a fancy weighted sum.

Here, the author uses the method —

def samplesDistance(str1, str2):
    a = strLengthDiffer(str1, str2)
    b = charSetDiffer(str1, str2)
    s1 = n_grams(str1)
    s2 = n_grams(str2)
    c = groupDiffer(s1, s2)
    d = a+b+c
    return d
Copy the code

Yes! Add it up…

Mountain is not high, temple is someone to send a brocade flag, algorithm is not complex, useful on the line.

Of course, you can adjust the sensitivity of the system to the three dimensions according to your own needs, but I think the values of character set differences are naturally larger than the values of the other two, so I don’t need to adjust them anymore.

You don’t seem like one of them

Given the distance between strings, further, there is the distance from one string to another. We define it as follows:

The distance between the string sample and the string set = the arithmetic average of the distance between the string sample and each string sample in the string set

That is:

def sampleClassDistance(sample, class1):
    list_0 = []
    length = len(class1)
    for item in class1:
        list_0.append(samplesDistance(sample, item))
    return sum(list_0)/length
Copy the code

You are two kinds

Starting from the distance from one string to a bunch of strings in the previous section, we can get the distance from a bunch of strings to another bunch of strings. The definition is similar:

The distance between sets of strings = the arithmetic mean of the distance between each book in the set of strings and each book in the set of strings = the arithmetic mean of the distance between each book in the set of strings and each book in another string

That is:

def classesDistance(class1, class2):
    list_0 = []
    class1 = flatten(class1)
    class2 = flatten(class2)
    m = len(class1)
    n = len(class2)
    for item1 in class1:
        for item2 in class2:
            list_0.append(sampleDistance(item1, item2))
    return sum(list_0)/(m*n)
Copy the code

There is no party in the class, all kinds of strange

Similarly, you can define “in-class distance” as an attribute inside a bunch of strings. It might be somewhat close to variance in a practical sense. We stipulate that:

Intraclass distance = the distance between the collection of strings and itself

def innerClassesDistanse(class1):
    return classesDistance(class1, class1)
Copy the code

Let’s stop and gather our thoughts

You might be confused by this point, but what’s the point of defining all this distance?

As we said, the key to separating two different types of indeterminate strings is to define the difference, that is, to quantify how different “obviously look” is.

Therefore, we invented some “distance” as a quantitative attribute. If two strings have different lengths, different characters, and different character sequences, then these two strings have quantifiable distance.

If two strings have a distance, then there is a distance between one string and another type of string, between one string and another type of string, and within the same type of string.

When you mingle with people, the most important thing is to find out who is friend and who is enemy. And when you need to divide people into two groups, the most important thing is to know how different the two groups are, and how consistent each group is internally.

In the case of a classification string, you want to be able to quantify the distance between classes and the distance within classes.

Hey, we already have classesDistance() and innerClassesDistanse().

I’ll see you next time 🙂


Editor’s note:

This article remains to be continued, so stay tuned for more updates. References and sample code are presented in the full article.

The author thinks that a clear description will get people who can’t write code to write code, so the code in this article comes from a friend who can’t write Python, and the code style may be a bit strange.

The text/YvesX

You can’t guess what I do for a living, can you

Sound/fluorspar

This article has been authorized by the author, the copyright belongs to chuangyu front. Welcome to indicate the source of this article. Link to this article: knownsec-fed.com/2018-09-25-…

To subscribe for more sharing from the front line of KnownsecFED development, please search our wechat official account KnownsecFED. Welcome to leave a comment to discuss, we will reply as far as possible.

Thank you for reading.