preface

For an article on JavaScript regular expressions, read the JavaScript Regular Expressions Mini-Book. Most of this article is a post-reading compilation of the book, so you can read it directly. The main reason why the author intends to write this article is to summarize the summary so that he can quickly refer to it in the usual development.

Based on article

What is regular

Regular Expression is a syntax specification of a simple language. It is a powerful, convenient, and efficient text processing tool. It is used in some methods to search, replace, and extract information in strings.

# defineconst reg = /at/g; Reg # method. The test ('ata'); // Whether the string 'ata' matches regular reg
Copy the code

Function of re

Regular expressions are matching patterns that match either characters or positions.

Definition of regular

literal

const reg = /at/g;
Copy the code

RegExp

const reg = new RegExp('at'.'g');
Copy the code

metacharacters

Most characters in regular expressions have literal meanings, such as /a/ match A and /b/ match B. However, some characters have special meanings besides literal meanings. These characters are metacharacters.

Metacharacter name matching object. Dot single arbitrary character (except carriage return \r, newline \n, line delimiter \u2028, and segment delimiter \u2029) single arbitrary character listed by [] character group [^] Single arbitrary character not listed by excluded character group? The question mark to match0Time or1Times * asterisk match0One or more + plus matches1Or more {min, Max} interval quantifiers match at least min, Max time most ^ characters done starting position $$off the end position | vertical bar space on both sides of an arbitrary expression () parentheses limit the scope of multiple structure, elements of the role of marked quantifiers, capture the text for the back reference \1, \2.. Backreference matches before first, second... Group the text that the expression inside the brackets matchesCopy the code

Escape character

Escape characters are represented as backslashes (\)+ characters in the following three cases:

  1. Because metacharacters have special meaning, they cannot be matched directly. To match them themselves, they need to be preceded by a backslash /\*/.test(‘*’);

  2. \ add non-metacharacters, indicating some special characters that cannot be printed;

model instructions
\ 0 matchingNULcharacter
\t Matches horizontal tabs
\v Matches vertical tabs
\n Matches a newline character
\r Match carriage return
\f Matches the page feed character
\xnn Matches Latin characters. Such as\xOAIs equivalent to\n
\uxxxx matchingUnicodeCharacters. Such as\u2028Matches the line terminator
\cX matchingctrl + X. Such as\cImatchingctrl + I, equivalent to the\t
[\b] matchingBackspace
  1. \If you add any other character, the default is to match this character, that is, a backslash(\)Is ignored/\x/.test('x')  。

Double escape

Because the RegExp constructor takes a string as an argument, characters need to be double-escaped in some cases

var p1 = /\.at/;
/ / equivalent to the
var p2 = new RegExp('\\.at');

var p1 = /name\/age/;
/ / equivalent to the
var p2 = new RegExp('name\\/age');

var p1 = /\w\\hello\\123/;
/ / equivalent to the
var p2 = new RegExp('\\w\\\\hello\\\\123');
Copy the code

It is often recommended to define regees in a literal manner.

Character groups

A character group is a group of characters represented by square brackets that matches one of several characters.

[0123456789] match0-910A digital [0-9] match0-910The number [A-z] matches26Letters [0-9a-zA-z] Matches uppercase and lowercase letters of digitsCopy the code

Excluded character group

Another type of character group is the exclusion character group, which follows the left square bracket with a stripper ‘^’ representation to match an unlisted character in the current position.

[^0-9In addition to] said0-9Characters other thanCopy the code
model instructions
[abc] Matches any character of “A”, “B”, or “C”
[a-c1-3] Matches any of the following characters: A, B, C, 1, 2, and 3
[^abc] Matches any character except “A”, “b”, and “C”
[^a-c1-3] Matches any character other than “A”, “b”, “C”, 1, 2, 3
. A wildcard character that matches any character except a few (\n) characters
\d Match a number, equivalent to[0-9]
\D Matching non-numbers is equivalent to[^ 0-9]
\w Match word characters, equivalent to[a-zA-Z0-9_]
\W Matching non-word characters is equivalent to[^a-zA-Z0-9_]
\s Matching whitespace is equivalent to[ \t\v\n\r\f]
\S Matches non-whitespace characters, equivalent to[^ \t\v\n\r\f]

quantifiers

The character groups [0-9] or \d can be used to match a single numeric character, which is less convenient if you use regular expressions to represent more complex strings:

// Indicates a six-digit zip code/ [0-9] [0-9] [0-9] [0-9] [0-9] [0-9] /; Or / \ d \ d \ d \ d \ d \ d /;Copy the code

Regular expressions provide quantifiers that set the number of occurrences of a pattern

# indicates zip code6A digital / \ d {6} /;Copy the code
model instructions
{n,m} N to m times in a row. Greed mode
{n,} At least n consecutive occurrences. Greed mode
{n} N consecutive occurrences. Greed mode
? Is equivalent to{0, 1}. Greed mode
+ Is equivalent to{1,}. Greed mode
* Is equivalent to, {0}. Greed mode
{n,m}? N to m times in a row. The inertia model
{n,}? At least n consecutive occurrences. The inertia model
{n}? N consecutive occurrences. The inertia model
?? Is equivalent to{0, 1}?. The inertia model
+? Is equivalent to{1}?. The inertia model
*? Is equivalent to{0}?. The inertia model

Greedy mode: By default, quantifiers are greedy quantifier mode, which matches until the next character does not meet the matching rules.

/a+/.exec('aaa'); // ['aaa']
Copy the code

Lazy quantifier: Lazy quantifier corresponds to greedy, with a question mark after the quantifier? Represents, represents as few matches as possible, once the condition is satisfied, no further matches.

Function of parentheses

model instructions
(ab) Capture grouping. the"ab"As a whole, for example(ab)+said"ab"At least once in a row.
(? :ab) Non-capture grouping. with(ab)The difference is that it does not capture data.
(good or nice) Capture branch. matching"good""nice"
(? : good or nice) Non – capture branch structure. with(good or nice)The difference is that it does not capture data.
\num Backreference. Such as\ 2, referring to the data captured in the second parenthesis.

[note] (good or nice) due to the compilation of it actually should be (good | nice)

Match the pattern

Match mode refers to the rule used for matching. Setting specific patterns may change the recognition of regular expressions.

symbol instructions
g Global matching, finding all substrings that match
i The case of English letters is ignored during the matching process
m Multi-line matching, changing ^ and $to the beginning and end of a line

Is the above content still on the regular or cloud in the fog? Don’t rush through this article to get started.

Application of article

Regular expressions match either characters or positions.

Two regular visualization tools are recommended:

  • aoxiaoqiang.github.io/reg/
  • jex.im/regulex

The matching characters

An exact match

var regex = /hello/;
console.log( regex.test("hello"));// true
Copy the code

It doesn’t make much sense if the re only matches exactly, such as /hello/, which matches only the “hello” substring in the string.

Fuzzy matching

Regular expressions are powerful because they enable fuzzy matching.

And fuzzy matching, there are two directions of “fuzzy” : horizontal fuzzy and vertical fuzzy.

Transverse fuzzy matching

Horizontal blurring refers to the fact that the length of a regular matching string is not fixed and can be multiple.

const regex = / ab} {2 and 5 c/g;
Copy the code

The first character is “a”, “b” occurs 2-5 times, and the last character is “C”.

Longitudinal fuzzy matching

Vertical blurring refers to the fact that the string of a regular match, when specific to a character, may not be a certain character, but can have many possibilities.

const regex = /a[123]b/
Copy the code

The first character is “a”, the second matching character can be one of 1, 2, or 3, and the third character is “B”.

Greed match

const regex = / \ d {2, 5} / g;
const string = "123, 1234, 12345, 123456";
console.log( string.match(regex) ); // ["123", "1234", "12345", "12345"]
Copy the code

It is greedy and it will match as many as possible. As much as you can do, the better.

Inertia match

Lazy matching, as few matches as possible.

var regex = / \ d {2, 5}? /g;
var string = "123, 1234, 12345, 123456";
console.log( string.match(regex) ); 
/ / = > [" 12 ", "12", "34", "12", "34", "12", "34", "56"]
Copy the code

Among them/ \ d {2, 5}? /It means that, although 2 to 5 times is fine, when 2 is enough, you don’t want to try any more.

The way to remember lazy matching is to put a question mark after the quantifier and ask are you satisfied? Are you greedy?

Matches multiple subpatterns

Specific form is as follows: (p1 | p2 | p3), p1, p2 and p3 is sub mode, use | (pipe) separated, said one of any of them.

var regex = /good|nice/g;
Copy the code

The branch structure is also lazy, that is, if the front one matches, the next one doesn’t try.

var string = "goodbye";

var regex1 = /good|goodbye/g; / / match good
var regex2 = /goodbye|good/g; / / match goodbye
Copy the code

The actual case

Match a phone number

Analysis:

13x xxxx xxxx
15x xxxx xxxx
18x xxxx xxxx
Copy the code

Regular:

/ ^1(3|4|5|6|7|8|9)\d{9}$/g
Copy the code

Resolution:

In the figure above, we see that Begin and End match the beginning and End positions, which will be explained later.

  • The first matching number 1;
  • Second matching number 3, 4, 5, 6, 7, 8, 9;
  • The third one matches any number and appears nine times.

The matching position

What is location

A position is the position between adjacent characters. For example, where the arrow in the image below points:

How to match positions

In ES5, there are six anchor characters:

^ $ \b \B (? =p) (? ! p)Copy the code

^ and &

^ (off character) matches the beginning of a line in a multi-line match. The $(dollar sign) matches the end of a line in a multi-line match.

For example, we replace the beginning and end of a string with a “#”. :

var result = "hello".replace(/^|$/g.The '#'); // "#hello#"
Copy the code

In the case of multi-line matching pattern, the two concepts are rows, which need to be noted:

var result = "I\nlove\njavascript".replace(/^|$/gm.The '#');
// Output is as follows:
#I#
#love#
#javascript#
Copy the code

\ \ b and b

\b is a word boundary, specifically:

  • \w 和 \WThe position between;
  • \w 和 ^The position between;
  • \w 和 $The position between.
var result = "[JS] Lesson_01.mp4".replace(/\b/g.The '#'); 
// => "[#JS#] #Lesson_01#.#mp4#"
Copy the code
  • The first one"#", both sides is"[" 与 "J", it is\W 和 \wThe position between;
  • The second"#", both sides is"S" 与 "]", that is,\w 和 \WThe position between;
  • The third"#", flanked by Spaces and"L", that is,\W 和 \wThe position between;
  • The fourth"#", both sides is"1" 与 ".", that is,\w 和 \WThe position between;
  • The fifth"#", both sides is"." 与 "m", that is,\W 和 \wThe position between;
  • The sixth"#", its corresponding position is the end, but the character before it"4" 是 \w, i.e.,\w 和 $The position between.

Now that the concept of \B is known, \B is relatively easy to understand.

\B means the opposite of \B, not word boundary. For example, if \b is deducted from all positions in a string, all that is left is \B’s.

To be specific:

  • \w 与 \wThe position between;
  • \W 与 \WThe position between;
  • ^ 与 \WThe position between;
  • \W 与 $The position between.
var result = "[JS] Lesson_01.mp4".replace(/\B/g.The '#');
// => "#[J#S]# L#e#s#s#o#n#_#0#1.m#p#4"
Copy the code

(? = p) and (? ! p)

(? =p), where P is a subpattern, i.e. the position before p.

Such as? =l), indicating the position before the ‘l’ character, for example:

var result = "hello".replace(/ (? =l)/g.The '#');
// => "he#l#lo"
Copy the code

And (? ! P) is (? Is the opposite of p, which means that the next position cannot be p.

var result = "hello".replace(/ (? ! l)/g.The '#');
// => "#h#ell#o#"
Copy the code

Their scientific names are: (? =p) forward antecedent assertion (? ! P) negative prior assertion

ES6 also supports positive lookbehind and negative lookbehind. < = p) and (?

Property of position

To understand the position, we can understand the null character “”. For example, a “hello” string is equivalent to the following:

"hello"= ="" + "h" + "" + "e" + "" + "l" + "" + "l" + "o" + ""; Is equivalent to:"hello"= ="" + "" + "hello"
Copy the code

So, write /^hello$/ as /^^hello? $/, there is no problem:

var result = /^^hello? $/.test("hello"); // => true
Copy the code

A very efficient way to understand positions is to understand them as null characters.

The actual case

Numeric thousandth formatting

Analysis:

12345678= >12.345.678
123456789= >123.456.789
Copy the code

First edition re:

var result = "12345678".replace(/ (? =\d{3}$)/g.', ')
Copy the code

Resolution:

  • (? =\d{3})Matches the position before three digits, and it looks from the end of the string to the beginning", 1,2,3,4,5,678"  
  • (? =\d{3}$)Matches the end position and the position of the first three digits"12345678" 

The first version of the re matches only once.

Second edition re:

var result = "12345678".replace(/ (? =(\d{3})+$)/g.', ')
Copy the code

Resolution:

  • Matches the end position and adds a comma every third digit12345678 ;
  • But when the number is nine digits, the re is broken again, and a comma is added at the beginning, 123456789, 。

You’re smart enough to figure it out and not match the starting position.

The third edition of re:

var result = "12345678".replace(/ (? ! (^)? =(\d{3})+$)/g.', ')
Copy the code

This time it will produce the right result123456789 。

Function of parentheses

The function of parentheses, in fact, can be explained in a few words, the parentheses provide grouping, so that we can refer to them.

There are two ways to refer to a group: in JavaScript, or in regular expressions.

grouping

var regex = /(ab)+/g;
Copy the code

Where parentheses provide grouping functions so that the quantifier + acts on"Ab"The overall

Branching structure

In the multiple branch structure (p1 | p2), the role of the parentheses is self-evident, provides the expression of all possible.

var regex = /^I love (JavaScript|Regular Expression)$/;
var regex2 = /^I love JavaScript|Regular Expression$/;
Copy the code

Consider the difference between these two re’s: You can easily see the difference between the two.

Reference group

This is an important function of parentheses, which allows us to do data extraction, as well as more powerful substitution operations. To take advantage of its benefits, you must use the API of the implementation environment.

For example, to extract the year, month, and day, you can do this:

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
console.log( string.match(regex) ); 
// => ["2017-06-12", "2017", "06", "12", index: 0, input: "2017-06-12"]
Copy the code

For example, if you want to replace YYYY-MM-DD with MM/DD/YYYY, what do you do?

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
var result = string.replace(regex, "$2 / $3 / $1");
console.log(result); 
/ / = > "06/12/2017"Or write:var result = string.replace(regex, function() {
	return RegExpThe $2 +"/" + RegExp. $3 +"/" + RegExp. $1; }); Or:var result = string.replace(regex, function(match, year, month, day) {
	return month + "/" + day + "/" + year;
});
Copy the code

backreferences

In addition to referring to groups using the corresponding API, you can also refer to groups within the re itself. But you can only refer to the previous grouping, which is called a backreference.

var regex = /\d{4}(-|\/|\.) \d{2}\1\d{2}/;
var string1 = "2017-06-12";
var string2 = "2017/06/12";
var string3 = "2017.06.12";
var string4 = "2016-06/12";
console.log( regex.test(string1) ); // true
console.log( regex.test(string2) ); // true
console.log( regex.test(string3) ); // true
console.log( regex.test(string4) ); // false
Copy the code

Notice \ 1, said a reference before the group (- | \ / | \.) . No matter what it matches (such as -), \1 matches that same concrete character.

Now that we know what \1 means, we understand the concepts \2 and \3, which refer to the second and third groups respectively.

What happens when you reference a nonexistent subset?

When a backreference refers to a group that doesn’t exist in the re, the re doesn’t give an error. It just matches the character that was referenced in the backreference. For example, \2 matches “\2”. Note that “\2” means a change of “2”.

Uncaptured grouping

The groups that appear in the previous article capture the data they match for subsequent reference, so they are also called captured groups.

If you just want the primitive functionality of parentheses, you don’t refer to them, that is, you don’t refer to them in the API or back reference them in the re. At this point you can use non-capture grouping (? :p), for example, the first example of this article can be modified to:

var regex = / (? :ab)+/g;
var string = "ababa abbb ababab";
console.log( string.match(regex) ); 
// => ["abab", "ab", "ababab"]
Copy the code

The actual case

Matches pairs of labels

Analysis:

# match <title>regular expression</title><p>laoyao bye bye</p># mismatch <title>wrong! </p>Copy the code

Regular:

var regex = /<([^>]+)>[\d\D]*<\/\1>/;
Copy the code

var string1 = "<title>regular expression</title>";
var string2 = "<p>laoyao bye bye</p>";
var string3 = "wrong! </p>";
console.log( regex.test(string1) ); // true
console.log( regex.test(string2) ); // true
console.log( regex.test(string3) ); // false
Copy the code

Regular expression splitting

There are two ways to measure your mastery of a language: reading and writing.

Not only should you be able to solve your own problems, but you should also be able to read the solutions of others. The code is like this, and the regular expression is like this.

Structure and operators

Programming languages generally have operators. Whenever there is an operator, there is a problem. When a bunch of operations go together, who comes first and who comes next? In order to avoid ambiguity, the language itself needs to define the order of operations, which is called priority.

What are the structures in JS regular expressions?

literal

Matches a specific character, both unescaped and escaped. For example, a matches the character “a”, or \n matches the newline character, or \. Match the decimal point.

Character groups

Matching a character can be one of many possibilities. For example, [0-9] indicates matching a number. There is also the short form \ D. There are also antisense character groups, which can be any character other than a specific character, such as [^0-9], which is a non-numeric character, or the abbreviation \D.

quantifiers

For example, a{1,3} indicates that a character appears three times in a row. There are also common abbreviations, such as a+ to indicate that the “A” character occurs at least once in a row.

The anchor

Matches a position, not a character. For example, ^ matches the beginning of a string, \b matches word boundaries, and (? =\d) indicates the position before the number.

grouping

Use parentheses to indicate a whole, such as (ab)+ to indicate multiple occurrences of the “ab” character, or use non-capture grouping (? : + ab).

branch

More child expression a commonplace, such as ABC | BCD, expression match “ABC” or “BCD” character substring.

backreferences

For example, \2 refers to the second group.

The operators involved are:

1.Escape character \2.Brackets and square brackets (...) And (? :...). And (? =)... And (? ! ...). , [...].3.Quantifier qualifiers {m}, {m,n}, {m,},? , *, +4.Position and sequence ^, $, \ metacharacters, normal characters5.| pipe (vertical bars)Copy the code

The precedence of the above operators goes from top to bottom and from high to low.

Here, we analyze a re: /ab? (c|de*)+|fg/

  1. Because of the parentheses, so,(c|de*)It’s a whole structure.
  2. in(c|de*)Notice the quantifiers in*, soe*It’s a whole structure.
  3. Because of the branching structure"|"Lowest priority, thereforecIs a whole, andde*It’s another whole.
  4. Similarly, the whole regular is dividedA and b? , (...). Plus, f, g. And because of the branching, it can be dividedab? (c|de*)+ 和 fgThese two parts.

Quantifier linking problem

Suppose we want to match a string like this:

  1. Each character isA, B, CYou can choose any one
  2. The length of the string is a multiple of 3

/^[ABC]{3}+$/

/^([ABC]{3})+$/

The actual case

IPV4 address

Regular:

/ ^ ((0{0.2}\d|0? \d{2} |1\d{2} |2[0-4]\d|25[0-5\]). {3} (0{0.2}\d|0? \d{2} |1\d{2} |2[0-4]\d|25[0-5$/])Copy the code

Apart:

  1. First of all, macro observation will lead to the following structure:((...). \.) {3} (...). ;
  2. and(...).The structure inside is consistent, and then the re inside is broken down;
  3. 0 {0, 2} \ d, matches a digit number, including the complement of 0. For instance,9, 09, 009 ;
  4. 0? \d{2}, matching two digits, including 0 complement, and also including one digit;
  5. 1\d{2}Matching,100-199; 
  6. 2[0-4]\dMatching,200-249. ;
  7. 25 [0-5)Matching,250-255. 。

So if you find disassembly difficult, you can use graphics to help us do this, but we still need to understand the disassembly principle.

The principle of article

To learn regular expressions, you need to know some matching principles. And when it comes to matching, there are two words that come up a lot: “backtracking.”

backtracking

The way regular expressions match strings is known as backtracking.

The basic idea of retrospective method, also known as heuristic method, is: From the problems of a particular state (initial state), the search from this state can achieve all of the “state”, when a road to the “end” (can’t), then take a step back or a number of steps, starting from another possibility “state”, continue to search until all of the “path” (state) are tested. This method of constantly “moving forward” and “backtracking” to find a solution is called “backtracking”.

It’s essentially a depth-first search algorithm. The process of going back to a previous step is called backtracking. As you can see from the procedure described above, backtracking occurs when the path is blocked. That is, when an attempt to match fails, the next step is usually backtracking.

There is no backtracking match

Let’s say our re isC / / ab {1, 3}, its visual form is:And when the target string is"abbbc"When, there is no such thing as backtracking. The matching process is as follows:Where the subexpressionB {1, 3}said"B"Characters occur 1 to 3 times in a row.

There is a backtracking match

If the target string is"abbc"There is backtracking.Step 5 in the figure has a red color indicating that the match was unsuccessful. At this timeB {1, 3}Two characters have been matched"B", was about to try the third, only to find that the next character is"C". So you thinkB {1, 3}I’m done matching. The state is then returned to the previous state (step 6, same as step 4), and the character is matched with the subexpression C"C". Of course, the entire expression matches.

Step 6 in the figure is “backtrace”.

For another clear backtrace, the re is:The target string is:"acd"ef, the matching process is as follows:The failed attempt to match double quotes is omitted. It can be seen that. *It’s very inefficient.

To reduce some unnecessary backtracking, change the re to /”[^”]*”/.

The regular optimization

Regular expressions run in the following stages:

  1. compile
  2. Set the starting position
  3. Attempts to match
  4. If the match fails, proceed to step 3 from the next digit
  5. Final result: The match succeeds or fails

Use concrete character groups instead of wildcards to eliminate backtracking

In the third stage, the biggest problem is backtracking. For example, match characters between double references. For example, match a string123"abc"456In the"abc". If the re uses:/ / "*", produces four backtracings in phase 3 (pink). *Matched content) :

If the re uses:/ ". *?" /, produces two backtracings (in pink). *?Matched content) :

Because of tracebacks, the engine is required to store a variety of potentially untried states for subsequent tracebacks. Destined to take up some memory.

To eliminate unnecessary characters, use the regex /”[^”]*”/.

Use non-capture grouping

Because one of the things that parentheses do is capture the data in groups and branches. Then you need memory to hold them.

When we do not need to use grouping and backreferencing, we can use non-capture grouping. Such as:

/ ^ (+ -)? (\ d + \ \ d + + | | \ d \ \ d +) $/ can be modified to: / ^ (+ -)? (? :\d+\.\d+|\d+|\.\d+)$/

Separate out the determinate character

For example, /a+/ can be changed to /aa*/. Because the latter can identify the character A more than the former. This will speed up the detection of matching failures in step 4, which will speed up the shift.

Extract the common part of the branch

Such as/ABC | ^ ^ def /, modified into / ^ (? | : ABC def) /.

And such as/this | that /, modified into/th (? : is | / at).

By doing so, you can reduce the number of eliminable repetitions in the matching process.

Reduce the number of branches, narrow their range

/ red | read /, can modify/rea? D/a. The cost of backtracking is different for branches and quantifiers. But after this optimization, the readability will be reduced.

summary

Finally, this article is just for the author to read. Please read the original JavaScript Regular Expressions Mini-Book.