String processing will basically use regular expression, with it to do string matching, extraction, replacement and so on very convenient.

However, regular expression learning is still some difficult, such as greedy matching, non-greedy matching, capture subgroup, non-capture subgroup and other concepts, not only beginners difficult to understand, there are many years of work people do not understand.

How to learn regular expressions? How to learn regular expressions quickly?

One of the best ways to learn regular is through an AST.

Regular expressions match by parsing the pattern string into an AST, and then matching the target string through the AST.

Various information in the pattern string is stored in the AST after parse. The AST is an abstract syntax tree. As the name implies, it is a tree organized according to the syntax structure. Therefore, it is easy to know the syntax supported by regular expressions from the structure of the AST.

How do I view the AST of a regular expression?

This can be visualized via astExplorer.net:

Switch the parse language to RegExp to visualize AST of regular expressions.

As mentioned earlier, an AST is a tree organized by syntax, so its structure makes it easy to distinguish different syntax.

Let’s look at the syntax from the perspective of AST:

/abc/

We’ll start with a simple one. / ABC/is a re that matches the string ‘ABC’. Its AST looks like this:

Three Char values are a, B, and C, and the type is simple. The next match is to traverse the AST and match each of these three characters.

We tested it with Exec’s API:

The 0th element is the matching string, and index is the starting index of the matching string. Input is a string of inputs.

Try special characters again:

/\d\d\d/

/\d\d\d/ means to match three digits, \d is a meta char with special meaning supported by the re.

We can also see from the AST that they are also Char but meta:

Any number can be matched by the metacharacter \d:

Which meta char and which simple char are clear from the AST.

/[abc]/

Regex supports specifying a set of characters in the form of [], that is, matching any character.

We can also see from the AST that it has a CharacterClass wrapped around it, which means it can match any character it contains.

This is true in the test:

/ a / {1, 3}

Regular expressions can specify how many times a character is repeated, in the form {from,to},

For example, /b{1,3}/ indicates that character B is repeated 1 to 3 times, and /[ABC]{1,3}/ indicates that the a/b/ C character class is repeated 1 to 3 times.

This syntax is known as Repetition, as seen in the AST:

It has a quantifier attribute, in this case a range of types, from 1 to 3.

Re also supports abbreviations of quantifiers, such as + for 1 to infinite, * for 0 to infinite,? Indicates 0 or 1.

These are different types of quantifiers:

What, some of you might ask, is the greedy property?

Greedy means greedy, and this attribute indicates whether the Repetition is a greedy or non-greedy match.

What if you put a? You’ll notice that greedy becomes false, which switches to non-greedy matching:

What about greed and non-greed?

Let’s look at an example.

The default Repetition match is greedy and keeps matching as long as the condition is met, so acbac can be matched here.

After the quantifier? It switches to non-greedy and only matches the first one:

This is greedy matching and non-greedy matching. We can clearly see from the AST that greedy and non-greedy are for repeated syntax. The default is greedy matching, and the quantifier is followed by? You can switch to non-greed.

(aaa)bbb(ccc)

Regular expression support returns a portion of the matched string into subgroups by ().

Take a look through the AST:

The corresponding AST is called a Group.

And you will find that it has an attribute for capturing, which defaults to true:

What does that mean?

This is the syntax for subgroup capture.

If you don’t want to capture subgroups, you can write (? :aaa)

Capturing has changed to false.

What’s the difference between capture and non-capture?

Let’s try:

Capturing in a Group means extracting or not.

We can see from the AST that the capture is for subgroups. The default is to capture, that is, to extract the contents of subgroups. : Switch to non-capture and the contents of subgroups are not extracted.

We’re pretty familiar with using AST to learn regular syntax. Let’s do something harder:

/bbb(? =ccc)/

Regular expressions support regular expressions through (? = XXX) to indicate an antecedent assertion used to determine if a string is preceded by a string.

You can see the AST syntax is called Assertion and is of type lookahead:

What does that mean? Why do you write that? / BBB (CCC)/ and/BBB (? : CCC)/ What’s the difference?

Let’s try:

As can be seen from the results:

/ BBB (CCC)/ matches the CCC subgroup and extracts the subgroup because the default subgroup is captured.

/bbb(? : CCC)/ matched the subgroup of CCC but did not extract it because we passed? : Indicates that subgroups are not captured.

/bbb(? = CCC)/ No subgroup is extracted from the matched CCC subgroup, indicating that it is also non-captured. And it? The difference is that CCC does not appear in the matching result.

This is the nature of a lookahead assertion: a lookahead assertion represents a string that is preceded by a subgroup that is not captured, and the asserted string does not appear in the match result.

It does not match if it is not followed by that string:

/bbb(? ! ccc)/

The? = change to? ! Then the meaning changes. Look at the AST:

Assertion of lookahead assertion, but with negative true.

It’s not a string, it’s not a string.

The matching result is reversed:

Now it matches if it’s not preceded by some string, which is a negation-prior assertion.

/ (? <=aaa)bbb/

There is a prior assertion and there is a subsequent assertion, that is, some string is followed by a match.

In the same way, it can also be denied:

(? <=aaa) is a lookbehind assertion:

(? <! Aaa) for AST, add a negative attribute:

Assertion before assertion after assertion is one of the most difficult regular expression syntax to understand

conclusion

Regular expressions are a very convenient tool for handling strings, but it is difficult to learn, such as greedy matching, non-greedy matching, capture subgroup, non-capture subgroup, predicate first, predicate after many people are not clear about the syntax.

I recommend using the AST to learn about regex. The AST is a tree of objects organized in syntactic structures that can be easily clarified by the names and attributes of the AST nodes.

For example, we made it clear through AST:

The Repetition syntax is the form of a character + quantifier, defaulting to greedy (true), which means you match until you don’t, followed by? It switches to a non-greedy match and stops at one character.

Subgroup syntax (Group syntax) is used to extract strings. The default is capture (true). : XXX) switches to non-capture, match only but not extract.

An Assertion is a string of characters preceded or followed by a lookahead Assertion and a lookbehind Assertion. = XXX) and (? <= XXX), can be done by replacing = with! To say negative, which means the other way around.

Is the syntax well understood by the various documents or by the compiler?

It must be the compiler!

It’s better to learn syntax from the syntax tree that syntax parse produces than from documentation.

This is true of regular expressions, but it is also true of other syntax learning. You can learn the syntax with AST without having to look at the documentation.