Tech /2021/01/24/…
This article assumes that you know the basic concepts of regular expressions and can recognize and use them in general, including but not limited to: single-character matching., multi-character matching [], metacharacters using \s, repeated matching +*? , etc.
Matching whitespace characters
metacharacters | instructions |
---|---|
[\b] |
Back (and delete) a character (Backspace key) |
\f |
Page identifier |
\n |
A newline |
\r |
A carriage return |
\t |
tabs |
\v |
Vertical TAB character |
\s |
Is equivalent to[\f\n\r\t\v] |
Prevent overmatching
For the following text
This offer is not available to customers living in <B>AK</B> and <B>HI</B>.
If you want to match the text between two tags, use <[Bb]>.*
This offer is not available to customers living in ==<B>AK</B> and <B>HI</B>==.
This pattern only finds one match instead of two, because * and + are both “greedy” metacharacters that match as far as possible from the beginning of a piece of text to the end of the text. The correct approach is to use lazy versions of these metacharacters.
Greedy metacharacters | Lazy metacharacters |
---|---|
* |
*? |
+ |
+? |
{n, } |
{n, }? |
For the example above, use <[Bb]>.*?
, can get the result we want:
This offer is not available to customers living in ==<B>AK</B>== and ==<B>HI</B>==.
Backreference
If we want to match headings in HTML, we might want to use <[Hh][1-6]>.*?
, but there is a problem if there is an invalid title:
<H1>This is not valid HTML</H3>
The pattern above can also be successfully matched, which cannot be resolved without using backtracking matching. A backtracking reference refers to a word expression defined in the first half of the schema in the second half of the schema. In this example, we use the schema
([1-6])>.*?
You can think of a backreference as a variable, with \1 representing the first expression in the pattern. In the correct pattern of the example above, ([1-6]) is a subexpression that matches only 1 to 6, and \1 matches only the same number, so the problem is solved. In fact, the backreference you’ve probably seen, $1 is the backreference when you’re doing text substitution.
Case conversion
Another use case for backtracking references is text capitalization.
metacharacters | instructions |
---|---|
\E |
The end of the\L 或 \U conversion |
\l |
Converts the next character (or subexpression) to lowercase |
\L |
the\L 到 \E All characters between are converted to lowercase |
\u |
Converts the next character (or subexpression) to uppercase |
\U |
the\U 到 \E All characters between are converted to lowercase |
For example, if you want to convert the heading text of a level 1 heading to uppercase:
Pattern: (< (Hh) 1) (. *?) (< 1)/(Hh);
Replacement: $1 \ U $2 \ E $3
Lookaround
Back-and-forth lookup is used when we need to mark the text to match with a regular expression.
lookahead
If we want to get their protocol names in a bunch of urls
http://www.test.com
https://www.example.com
ftp://ftp.aaa.com
We might use.+: to do this, but this pattern matches HTTP:, HTTPS:, FTP:, and we have to do a second processing of the string to extract the protocol name. Fortunately, using forward lookup.+(? =:) dispenses with the colon, where the subexpression (? =:) means to find:, do not include it in the final match result.
lookbehind
In addition to? = means to look forward, and there are many regular expressions (JS is not one of them…) Backward lookup is also supported with the operator? < =. Again, let’s look at an example: for the following text
ABC01: $23.45
HGG43: $5.31
If we want to match the price (excluding $), using [0-9.]+ will not work, because it will also match 01 and 43. <=\$)[0-9.]+ solve the problem.
Negative lookaround
Another less common use is negative lookaround, where a negative lookaround looks forward for text that does not match a given pattern, as does a negative lookaround.
The operator | instructions |
---|---|
(? =) |
Looking forward |
(? !). |
Negative forward lookup |
(? < =) |
Forward backward search |
(? The <! |
Negative backward lookup |
For example, in the following text we only want to match the quantity but not the amount:
I paid $30 for 100 apples,
50 orange, and 60 pears,
I saved $5 on this order.
\b(?
The embedded condition
North American phone number formats are (123)456-7890 and 123-456-7890. To match this pattern, it might be easy to think of using \(? \d{3}\)? -? \d{3}-\d{4}, but this expression will also match illegal data formats such as (123-456-7890, in which case we need to use conditions: if the phone number has an (, the fifth character matches), otherwise -.
The syntax of the embedding condition is:
(? (backreference)true-regex)
(? (backreference)true-regex|false-regex)
You could say that
if (backreference) { true-regex } else { false-regex }
Going back to the phone number problem, that’s it
(\ [)? \d{3}(? (1)\)|-)\d{3}-\d{4}
Analyze the pattern, where (\()? Matches an optional open parenthesis, (? (1) \ | -) is a back reference conditions, only matching the brackets can be matched.