Regular guide manual

preface

Regular expressions are one of the few great inventions in software. Packet-switched networking, the Web, Lisp, hashing, UNIX, compilation technology, relational models, object orientation, etc. Re itself is simple, beautiful, powerful and infinitely useful.

The syntax of regular expressions is not that difficult to learn, so look at some examples and learn from others. But a couple of fast food articles can hardly be understood. Meet again need a search, a basket of water in vain. Not only regular, but also other technical points, require systematic learning. Read more classic books and stand on the shoulders of giants.

There are so many things to cover here that I’ll focus on things that might be used in everyday development. If you want to understand more, I recommend reading The book “Mastering Regular Expressions.”

(So to put it simply, learning regular means high investment and low income.)

The full text is slightly longer, you can choose the part you are interested in

1. Introduce re

A regular expression is a formal way to describe the structure pattern of a string. Started in mathematics, popular in Perl regular engine. JavaScript introduced regular expressions from ES 3, and ES 6 extended support for regular expressions.

Regular principle

For fixed strings, simple string matching algorithms (kMP-like algorithms) are faster. However, if you do complex and variable character processing, regular expression speed is better. What exactly does regular expression matching work? This involves the knowledge of compilation principle (compilation principle is really the most troublesome course in my junior year).

The regular expression engine is implemented using a special theoretical model: Finite Automata, also known as finite-state machine (FINite-state machine). For details, see the reference document at the bottom of this article

Character groups

Character groups meaning
[ab] Matches a or B
[0-9] Match 0 or 1 or 2… Or 9
[^ab] Matches any character except a and B
Character groups meaning
\d The value is a number from 0 to 9
\D Represents [^0-9], a non-numeric character
\w Indicates [_0-9A-zA-z], a word character. Note the underscore
\W Represents [^ _0-9A-za-z], a non-word character
\s [\t\v\n\r\f], blank character
\S Represents [^ \t\v\n\r\f], non-blank character
. Says [^ \ n \ r \ u2028 \ u2029]. Wildcard: matches any character except newline, carriage return, line separator, and segment separator

quantifiers

Match priority quantifiers Ignore priority quantifiers meaning
{m,n} {m,n}? Indicates at least m occurrences and at most N occurrences
{m,} {m,}? Indicates at least m occurrences
{m} {m}? It must occur m times, equivalent to {m,m}
? ?? Equivalent {0, 1}
+ +? Equivalent {1}
* *? Equivalent {0}

Anchors and assertions

Anchor points are structures in regular expressions that don’t really match text, but are only responsible for determining whether the text to the left and right of a certain position meets the requirements. There are three common anchors: line start/end position, word boundary, and loop. There are six anchors in ES5.

The anchor meaning
^ Matches the beginning of a line in a multi-line match
$ Match end, match line end in multi-line matching
\b Word boundary, position between \w and \w
\B Non-word boundary
(? =p) The position is followed by a character that matches p
(? ! p) The character following this position does not match p

Note that \b also includes the positions between \w and ^ and between \w and $. As shown in the figure.

The modifier

Modifiers are the pattern rules used for matching. There are three matching modes in ES5: case-insensitive, multi-line, and global matching. The corresponding modifiers are as follows.

The modifier meaning
i Case insensitive matching
m Multiple rows are allowed to match
g Perform global matching
u Unicode mode for correctly handling greater than\uFFFFUnicode characters that handle four bytes of UTF-16 encoding.
y Adhesion mode, similar to G, is global matching, but the characteristic is: the next matching starts from the next position of the last successful matching, and must start from the first remaining position, which is the meaning of “adhesion”.
s DotAll mode, which is mostly used to handle line terminators

2. Regular method

There are four methods for string objects that can use regular expressions: match(), replace(), search(), and split().

ES6 uses these four methods to call RegExp instance methods within the language, so that all reged-related methods are defined on RegExp objects.

String. The prototype. The match call RegExp. Prototype [Symbol. Match]

String. The prototype. The replace call RegExp. Prototype [Symbol. The replace]

String. The prototype. The search call RegExp. Prototype [Symbol. The search]

String. The prototype. The split call RegExp. Prototype [Symbol. The split]

String.prototype.match

String.prototype.replace

String replace method, should be one of the most commonly used methods, here I give a detailed description of the various use of the guide.

The first argument to the replace function can be a re or a string (strings have no global pattern and only match once) that matches the text you want to replace it with

The second argument can be a string, or a function that returns a string. Note here that if you are using strings, the JS engine will give you some tips on how to navigate this text:

The variable name On behalf of the value of the
? Insert a “$”.
$& Insert the matching substring.
$` Inserts the content to the left of the currently matched substring.
$’ Inserts the content to the right of the currently matched substring.
$n Let’s say the first parameter is zeroRegExpObject, and n is a non-negative integer less than 100, insert the NTH parenthes-matching string. Tip: Indexes start at 1. Note the capture group rule here

If you don’t know the order of the capture groups, here’s a simple rule: the number of ‘(‘ symbols from left to right is the number of capture groups

(this is especially useful if there are capture groups in a capture group) (this is especially useful in functional mode when deconstructing assignments)

$’ : is the text to the left of what the re matches

$’ : is the text to the right of the content matched by the re

$& : Indicates the content that the re matches

n: Corresponds to the capture group

If the argument uses a function, you can filter or supplement the matching content

Here are the arguments to the function:

The variable name On behalf of the value of the
match Matching substring. (Corresponding to the $& above.)
p1,p2, ... Suppose the first argument to the replace() method is aRegExpObject represents the string matching the NTH parenthesis. (Corresponding to the above $1, $2, etc.) For example, if you use/(\a+)(\b+)/This one matches,p1It matches\a+.P2 is a match \ b +.
offset Offset of the matched substring within the original string. (For example, if the original string is “abcd” and the matched substring is “BC”, this parameter will be 1.)
string The original string to be matched.

An example, from rich text, matches the address of the image tag inside

If you use a function to replace text, you can basically do whatever you want

String.prototype.search

String.prototype.split

RegExp.prototype.test

Search is similar to string.prototype. search, except that it returns a Boolean value and search returns a subscript

RegExp.prototype.exec

3, re common use

The main content is the new modifier (u,y,s) in ES6 (g,m, I not to mention), greedy and non-greedy modes, prior/subsequent assertions

‘u’ modifier

ES6 adds the U modifier to regular expressions, which stands for “Unicode mode” to properly handle Unicode characters larger than \uFFFF. That is, four bytes of UTF-16 encoding will be handled correctly. Cut the crap and look at the picture

Unfortunately, MDN provides the following browser compatibility :(as of 2019.01.24), so there is still some time before it can be used in production

‘y’ modifier

In addition to the U modifier, ES6 also adds a Y modifier to regular expressions called the “sticky” modifier.

The y modifier is similar to the G modifier in that it is a global match, and each subsequent match starts at the position following the last successful match. The difference is that the G modifier works as long as there is a match in the remaining position, while the Y modifier ensures that the match must start at the first remaining position, which is what “bonding” means.

var s = 'aaa_aa_a';
var r1 = /a+/g;
var r2 = /a+/y;

r1.exec(s) // ["aaa"]
r2.exec(s) // ["aaa"]

r1.exec(s) // ["aa"]
r2.exec(s) // null
Copy the code

The code above has two regular expressions, one using the G modifier and the other using the Y modifier. The two regular expressions are executed twice each, and on the first execution, they behave the same, with the remaining strings being _aa_A. Since the G modifier has no position requirement, the second execution returns the result, while the Y modifier requires that the match must start at the head, so null is returned.

If you change the regular expression to make sure the header matches every time, the y modifier will return the result.

var s = 'aaa_aa_a';
var r = /a+_/y;

r.exec(s) // ["aaa_"]
r.exec(s) // ["aa_"]
Copy the code

Each time the above code matches, it starts at the head of the remaining string.

The Y modifier can be better illustrated with the lastIndex attribute.

const REGEX = /a/g;

// Specify that the match starts at position 2 (y)
REGEX.lastIndex = 2;

// The match is successful
const match = REGEX.exec('xaya');

// The match was successful at position 3
match.index / / 3

// The next match starts at bit 4
REGEX.lastIndex / / 4

// Bit 4 failed to start matching
REGEX.exec('xaya') // null
Copy the code

In the code above, the lastIndex attribute specifies the start of each search, from which the G modifier searches backwards until a match is found.

The y modifier also follows the lastIndex property, but requires that a match be found at the position specified by lastIndex.

const REGEX = /a/y;

// specify matching from position 2
REGEX.lastIndex = 2;

// The match failed
REGEX.exec('xaya') // null

// specify matching from position 3
REGEX.lastIndex = 3;

// Position 3 is adhesion, match successful
const match = REGEX.exec('xaya');
match.index / / 3
REGEX.lastIndex / / 4
Copy the code

In fact, the y modifier implies the header matching flag ^.

/b/y.exec('aba')
// null
Copy the code

The above code does not guarantee a header match, so it returns NULL. The y modifier is designed so that the header matching flag ^ is valid for global matching.

Here is an example of the replace method on a string object.

const REGEX = /a/gy;
'aaxa'.replace(REGEX, The '-') // '--xa'
Copy the code

In the code above, the last A is not replaced because it does not appear in the header of the next match.

A single Y modifier for the match method returns only the first match, and must be combined with the G modifier to return all matches.

'a1a2a3'.match(/a\d/y) // ["a1"]
'a1a2a3'.match(/a\d/gy) // ["a1", "a2", "a3"]
Copy the code

One application of the Y modifier is to extract tokens from strings. The Y modifier ensures that there are no missing characters between matches.

const TOKEN_Y = /\s*(\+|[0-9]+)\s*/y;
const TOKEN_G  = /\s*(\+|[0-9]+)\s*/g;

tokenize(TOKEN_Y, '3 + 4')
// ['3', '+', '4']
tokenize(TOKEN_G, '3 + 4')
// ['3', '+', '4']

function tokenize(TOKEN_REGEX, str) {
  let result = [];
  let match;
  while (match = TOKEN_REGEX.exec(str)) {
    result.push(match[1]);
  }
  return result;
}
Copy the code

In the above code, if there are no illegal characters in the string, the y modifier and g modifier extract the same result. However, once an illegal character is present, the two behave differently.

tokenize(TOKEN_Y, '3x + 4')
/ / / '3'
tokenize(TOKEN_G, '3x + 4')
// ['3', '+', '4']

Copy the code

In the code above, the G modifier ignores illegal characters and the Y modifier does not, making it easy to spot errors.

Unfortunately, browser compatibility isn’t great either

However, if you have integrated Babel in your project, you can use the above two modifiers, respectively

@babel-plugin-transform-es2015-sticky-regex

@babel-plugin-transform-es2015-unicode-regex

‘s’ modifier

In a regular expression, the point (.) Is a special character that represents any single character, with two exceptions. One is a four-byte UTF-16 character, which can be solved with the U modifier; The other is the line terminator character.

A line terminator is a character that indicates the end of a line. The following four characters are line terminators.

  • U+000A newline (\n)
  • U+000D Carriage return (\r)
  • U+2028 Line separator
  • U+2029 Separator

While this browser is also poorly compatible, there are ways to emulate its effects, albeit semantically unfriendly

/foo.bar/.test('foo\nbar') / /false
/foo[^]bar/.test('foo\nbar') / /true
/foo[\s\S]bar/.test('foo\nbar') / /trueI like thatCopy the code
Greedy vs. non-greedy (lazy)

Greedy mode: The regular expression will match as many matches as possible until the match fails. Greedy mode is the default.

Non-greedy mode: It is not greedy to let the regular expression match only what satisfies the expression, that is, once a match is successful, it does not proceed further. Add? To the quantifier? * *.

In some cases, we need to write regees for non-greedy scenarios, such as capturing a set of tags or a self-closing tag

A strange set of tags were caught, which would not be ideal if our goal was to capture only img tags, and non-greedy mode would work here

You just add? The non-greedy model is used, which is particularly effective in certain situations

Prior/subsequent (negative) assertion

Sometimes we have requirements to match the XXX before/after XXX. Embarrassingly, a long time ago, only lookahead and negative lookahead were supported and not negative lookbehind, After assertion was introduced after ES2018

The name of the regular meaning
First assertion /want(? =asset)/ Matches the content in front of asset
Antecedent negative assertion /want(? ! asset)/ Want matches only if it is not in front of asset
After assertion / (? <=asset)want/ Matches the content after asset
A subsequent negative assertion / (? <! asset)want/ Want matches only if it is not after asset

To be honest, in my experience, there are more scenarios for the use of postline assertions, because JS stores a lot of data as name-value pairs, so most of the time we want to use “name=” to fetch the following value, this is the case for the use of postline assertions

Prior assertion: matches only numbers that are before/not before the percent sign

Subsequent assertion:

Here is an example of @ Yu Bo also called shooting diao’s content of a blog

You can use a trailing assertion here

(? < = ^ | (first. + [set])). *? (? = $| (first. + [set]))Copy the code

The implementation of the “following assertion” needs to match /(? <=y)x over x, and then go back to the left and match the y part. This right-first, left-second order of execution, as opposed to all other re operations, results in some behavior that is not expected.

First, the group match of the following assertion is not the same as the normal result.

/ (?<=(\d+)(\d+))$/.exec('1053') / / / ", ""1","053"] / ^ (\d(\ +)d$/ +).exec('1053') / / /"1053","105","3"]

Copy the code

In the code above, you need to capture two group matches. Without “trailing assertion”, the first parenthesis is greedy mode and the second parenthesis can only capture one character, so the results are 105 and 3. In the case of “after assertion”, since the execution order is right to left, the second parenthesis is greedy mode and the first parenthesis can only capture one character, so the results are 1 and 053.

Second, backslash references to trailing assertions, also in reverse order, must precede the corresponding parenthesis.

/ (?<=(o)d\1)r/.exec('hodor/ / ')null/ (? < =\1d(o))r/.exec('hodor') / / /"r","o"]

Copy the code

In the above code, if the backslash reference (\1) of the following assertion is placed after the parentheses, it will not get the match. It must be placed first. Because a trailing assertion scans from left to right, finds a match, then goes back and completes the backslash reference from right to left.

Also, it is important to note that the assertion part is not counted in the return result.

Named Group matching

ES2018 introduced Named Capture Groups, which allows you to assign a name to each group match, making it easy to read the code and reference it.

In the above code, “named group match” is inside the parentheses, and the pattern header is “question mark + Angle bracket + group name” (?

), which can then be referenced on the Groups attribute where the exec method returns the result. Meanwhile, numeric ordinal number (matchObj[1]) is still valid.

A named group match is equal to assigning an ID to each set of matches to describe the purpose of the match. If the order of the groups changes, there is no need to change the matching processing code.

If the named group does not match, the corresponding Groups object property is undefined.

Named group match × destruct assignment

Named group reference

If you want to reference a named group match inside a regular expression, you can write \k< group name >.

4, common re

One of the best regex visualization sites I’d recommend is regexper.com/. Posting your regees will graphically show your rules for matching regees, and we’ll be able to roughly determine if our regees are meeting expectations.

If you want to generate a regular object from a string, there are two ways to do so, one is literal and the other is constructor

Constructor: new Regexp(‘content’, ‘descriptor’)

Literal patterns (try-catch) :

const input = '/123/g'
const regexp = eval(input)

Copy the code
Verify password strength

The password must contain a combination of uppercase and lowercase letters and numbers, and cannot contain special characters. The password must be between 8 and 10 characters in length.

^ (? =.*\d)(? =.*[a-z])(? =. * [a-z]) [A zA - Z0-9] {8, 10} $Copy the code

Non-all digit, non-all letter, 6-15 digit cipher first negation assertion

/ ^ (? ! ([0-9] + $)? ! [a zA - Z] + $) [0-9 a zA - Z] 6, 15} {$/Copy the code
Check in Chinese

The string can only be Chinese.

^[\u4e00-\u9fa5]{0,}$

Copy the code
Check ID card number

15

^[1-9]\d{7}((0\d)|(1[0-2]))(([0|1|2]\d)|3[0-1])\d{3}$

Copy the code

18

^[1-9]\d{5}[1-9]\d{3}((0\d)|(1[0-2]))(([0|1|2]\d)|3[0-1])\d{3}([0-9]|X)$

Copy the code
Check the date

Date verification in “YYYY-MM-DD” format, including flat leap years.

^ (? : (? ! 0000) [0-9] {4} - (? : (? : 0 | [1-9] [0-2] 1) - (? : 0 [1-9] [0-9] | | 1 2 [0 to 8]) | (? : 0 [9] 13 - | [0-2] 1) - (? 30) : 29 | | (? : 0 [13578] 1 [02]) - 31) | | (? : [0-9] {2} (? : 0 [48] | [2468] [048] | [13579] [26]) | (? : 0 [48] | [2468] [048] | [13579] [26]) 00) - 02 - $29)Copy the code
Extracting URL links

The following expression filters out a URL from a piece of text.

^(f|ht){1}(tp|tps):\/\/([\\w-]+\.) +[\w-]+(\/[\w- ./?%&=]*)?Copy the code
Extract the address of the image label

If you want to extract all the image information in a web page, you can use the following expression.

/<img [^>]*? src="(. *?) "[^ >] *? >/g;Copy the code

5. Precautions

Use non-capture parentheses

If you do not need to reference the text in parentheses, use non-capture parentheses (? :…). . This not only saves capture time, but also reduces the number of states used for backtracking.

Eliminate unnecessary parentheses

Unnecessary parentheses sometimes prevent engine optimization. For example, don’t use (.) unless you need to know the last character that a.* matches. *.

Do not abuse character groups

Avoid single-character character groups. For example, [.] or [*] can be escaped to \. And \*\.

Use the starting anchor point

Except in special cases, regular expressions starting with.* should be preceded by ^. If the expression doesn’t match at the beginning of the string, it obviously doesn’t match anywhere else.

Extract essential elements from quantifiers

Replacing x+ with xx* preserves the necessary “x” for a match. In the same way, -{5,7} can be written as —–{0,2}. (Readability can be a little low)

Extract the required elements at the beginning of a multi-select structure

With th (? : is | replace (at)? : this | that), can be exposed in addition to “th”.

Ignore priority or match priority?

In general, the use of ignore (lazy) or match (greedy) priority quantifiers depends on the specific requirements of the regular expression. For example, /^.*:/ is different from ^.*? :, because the former matches the last colon and the latter matches the first colon. In general, if the target string is long, the colon is closer to the beginning of the string, which is ignored first. If it is near the end of the string, it uses a matching priority quantifier.

Split regular expressions

Sometimes, applying multiple small regular expressions is much faster than applying a single regular expression. A “large and complete” regular expression must test all expressions at every location in the target text, which is inefficient. For a typical example, see the previous section to remove whitespace at the beginning and end of a string.

Put the multiple-choice branch that is most likely to match first

The order in which multiple choice branches are placed is very important, as mentioned above. In general, it is possible to get faster and more common matches by front-loading the common match branches.

Avoid exponential matching

To avoid exponential matching from a regular expression perspective, minimize + * quantifier overlaying, such as ([^\\”]+)*. Thus, the possible matching situation is reduced and the matching speed is accelerated.

6, summary

To use regular expressions well, you need to have some experience. From my personal experience, you need to write out what you want, and then write out small matches in the form of building blocks, and then combine the functions you want. This is a better way to achieve it.

If you encounter an obscure re, you can also post it to one of the aforementioned re visualization sites to see how it works.

For the front end, the use of re scenarios are mainly user input correction, rich text content filtering, or some URL or SRC filtering, as well as some tag replacement and so on. It is very helpful to master well, at least the previous dominant front-end jQ selector sizzle used a lot of re.

Finally, if you think I have written wrong, bad writing, other suggestions (praise), you are very welcome to point out and correct. Welcome to discuss and share with me!

Write in the last

Thanks to the authors of the reference documents below for sharing

Proficient in regular expressions (3rd edition)

Front end re two or three things @ code jun confessions

ES6 introduction – regular expansion @ Ruan Yifeng