Tech /2021/01/24/…

This article assumes that you know the basic concepts of regular expressions and can recognize and use them in general, including but not limited to: single-character matching., multi-character matching [], metacharacters using \s, repeated matching +*? , etc.

Matching whitespace characters

metacharacters instructions
[\b] Back (and delete) a character (Backspace key)
\f Page identifier
\n A newline
\r A carriage return
\t tabs
\v Vertical TAB character
\s Is equivalent to[\f\n\r\t\v]

Prevent overmatching

For the following text

This offer is not available to customers living in <B>AK</B> and <B>HI</B>.

If you want to match the text between two tags, use <[Bb]>.*

This offer is not available to customers living in ==<B>AK</B> and <B>HI</B>==.

This pattern only finds one match instead of two, because * and + are both “greedy” metacharacters that match as far as possible from the beginning of a piece of text to the end of the text. The correct approach is to use lazy versions of these metacharacters.

Greedy metacharacters Lazy metacharacters
* *?
+ +?
{n, } {n, }?

For the example above, use <[Bb]>.*?
, can get the result we want:

This offer is not available to customers living in ==<B>AK</B>== and ==<B>HI</B>==.

Backreference

If we want to match headings in HTML, we might want to use <[Hh][1-6]>.*?
, but there is a problem if there is an invalid title:

<H1>This is not valid HTML</H3>

The pattern above can also be successfully matched, which cannot be resolved without using backtracking matching. A backtracking reference refers to a word expression defined in the first half of the schema in the second half of the schema. In this example, we use the schema

([1-6])>.*?

You can think of a backreference as a variable, with \1 representing the first expression in the pattern. In the correct pattern of the example above, ([1-6]) is a subexpression that matches only 1 to 6, and \1 matches only the same number, so the problem is solved. In fact, the backreference you’ve probably seen, $1 is the backreference when you’re doing text substitution.

Case conversion

Another use case for backtracking references is text capitalization.

metacharacters instructions
\E The end of the\L\Uconversion
\l Converts the next character (or subexpression) to lowercase
\L the\L\EAll characters between are converted to lowercase
\u Converts the next character (or subexpression) to uppercase
\U the\U\EAll characters between are converted to lowercase

For example, if you want to convert the heading text of a level 1 heading to uppercase:

Pattern: (< (Hh) 1) (. *?) (< 1)/(Hh);

Replacement: $1 \ U $2 \ E $3

Lookaround

Back-and-forth lookup is used when we need to mark the text to match with a regular expression.

lookahead

If we want to get their protocol names in a bunch of urls

http://www.test.com

https://www.example.com

ftp://ftp.aaa.com

We might use.+: to do this, but this pattern matches HTTP:, HTTPS:, FTP:, and we have to do a second processing of the string to extract the protocol name. Fortunately, using forward lookup.+(? =:) dispenses with the colon, where the subexpression (? =:) means to find:, do not include it in the final match result.

lookbehind

In addition to? = means to look forward, and there are many regular expressions (JS is not one of them…) Backward lookup is also supported with the operator? < =. Again, let’s look at an example: for the following text

ABC01: $23.45

HGG43: $5.31

If we want to match the price (excluding $), using [0-9.]+ will not work, because it will also match 01 and 43. <=\$)[0-9.]+ solve the problem.

Negative lookaround

Another less common use is negative lookaround, where a negative lookaround looks forward for text that does not match a given pattern, as does a negative lookaround.

The operator instructions
(? =) Looking forward
(? !). Negative forward lookup
(? < =) Forward backward search
(? The <! Negative backward lookup

For example, in the following text we only want to match the quantity but not the amount:

I paid $30 for 100 apples,

50 orange, and 60 pears,

I saved $5 on this order.

\b(?

The embedded condition

North American phone number formats are (123)456-7890 and 123-456-7890. To match this pattern, it might be easy to think of using \(? \d{3}\)? -? \d{3}-\d{4}, but this expression will also match illegal data formats such as (123-456-7890, in which case we need to use conditions: if the phone number has an (, the fifth character matches), otherwise -.

The syntax of the embedding condition is:

(? (backreference)true-regex)

(? (backreference)true-regex|false-regex)

You could say that

if (backreference) { true-regex } else { false-regex }

Going back to the phone number problem, that’s it

(\ [)? \d{3}(? (1)\)|-)\d{3}-\d{4}

Analyze the pattern, where (\()? Matches an optional open parenthesis, (? (1) \ | -) is a back reference conditions, only matching the brackets can be matched.