What is a regular expression

What is a regular expression? When I first started, I probably thought superficially that it could be checked with a bunch of weird characters, and whenever there was a “complicated” check I would search for the corresponding re, CTRL C and CTRL V in one go. But the use of regular expressions is not limited to simple verification, so let’s take a look at regular expressions in general.

concept

A regular expression is a pattern of matching operations on strings, consisting of characters and metacharacters, and then matching the target string.

Core: Matching

From the above concepts, you can see that the core of regular expressions is matching.

  • Match what? Matches the corresponding character and position in the target string. This sentence is very important, we must have this awareness, it will be very helpful to our later learning.

  • What can you do with a match?

    • Check: is also the most commonly used, match to exist.
    • Extract: When the corresponding character is matched, it can be extracted for other purposes.
    • Replace: When a character is matched, we can replace it with what we want (for example, using the replace method).

conclusion

A regular expression is an expression composed of characters and metacharacters that matches, verifies, extracts, and replaces characters and positions in the target string.

Introduction to regular matching

Most programming languages support re, but as a front end, here is mainly to JS re explained.

Tip: Let’s start writing regular expressions. It is recommended that you use the website https://jex.im/regulex to analyze your regular expressions and visually aid your writing. In order to consolidate your learning results, it is strongly recommended to use common regular expressions to learn.

Creating regular expressions

In JavaScript, you can use two methods to build regular expressions:

  • 1. Use regular expression literals, which consist of patterns contained between slashes:
    const regex = /shotCat/;
    Copy the code
  • Use the constructor of the RegExp object.
    const regex = new RegExp('shotCat');
    Copy the code

Both are equivalent, and only match shotCat. The main difference is that the first method creates regular expressions at compile time, while the second method creates regular expressions at run time.

Note: the second constructor that uses the RegExp object is not recommended because it is too much to write \ and is not very readable or suitable for writing yourself.

Characters and metacharacters

From the concepts in the previous chapter, regular expressions are composed of characters and metacharacters.

  • Character: computer character code, such as common numbers, English letters and so on.
  • Metacharacters:That’s what we’re talking about. Metacharacters are also called special characters. Are characters used to represent special semantics. Such as\dNumbers representing 0 to 9.

The metacharacters in the re are very numerous and complicated, which is not conducive to memory and understanding. I’ll break them down into common uses later. If you want to see all the metacharacters you can see them here

Match the pattern

As mentioned earlier, the core of regular expressions is matching.

In re, matching patterns can be simply divided into:

  • Exact match: Sometimes called simple match. Consists of simple numbers and alphabetic characters,No metacharactersPure and simpleOne to one correspondenceThe relationship between. Such as:/shotcat/Only Shotcat matches
  • Fuzzy matching:Consists of metacharacters, can match complex multiple characters. Such as:/ ^ [0-9] * $/All numbers can be matched. Fuzzy matching is also divided into two types: there are many possibilities of matching characters and the number of characters.
    • Vertical fuzzy matching: When the re matches a character, if this character is not unique, it can be either A or B, or even one of many other possibilities. This is called longitudinal fuzzy matching. That is, the character to be matched is uncertain, and there are many possibilities. Why is it called vertical? For example, when we set the time on our mobile phones, there is a vertical scroll wheel for you to select the number, which actually means the same thing, vertical means there are many possibilities.
    • Horizontal fuzzy matching: When rematching a character, if the character does not occur only once, it may occur many times, even at least a few times, at most. This is called transverse fuzzy matching. That is, the number of repeated characters to be matched is uncertain, and there are many possibilities. Why is it called horizontal? It’s very simple, because if you do a lot of repetitions, the horizontal length will get longer.

In fact, when you are skilled, in fact, there is no need to remember so many patterns, here is broken down, for you to start learning when convenient memory, especially the corresponding metacharacters memory.

Regular expression method

Before you learn metacharacters, familiarize yourself with the methods that regular expressions can use to understand metacharacters.

Regular expressions can be used for the exec and test methods of RegExp and the match, replace, search, and split methods of String.

Take a family photo form:

methods describe
exec A RegExp method that performs a look-up match in a string, returning an array (null if no match is found).
test A RegExp method that tests a match in a string and returns true or false.
match A String method that performs a lookup on a String, returning an array and null if there is no match.
matchAll A String method that finds all matches in a String and returns an iterator.
search A String method that tests a match in a String, returning the index of the matched position, or -1 on failure.
replace A String method that looks for matches in a String and replaces matched substrings with replacement strings.
split One uses a regular expression or a fixed string to separate a string and stores the delimited substring in an arrayStringMethods.

As mentioned earlier, re can help us to check, extract, and replace characters. According to these three functions, the corresponding methods are classified below:

  • Check:
    • test: RegExp method that returns true on success and false otherwise. It is also the most commonly used verification method.
      var regex = /shotcat/
      var result = re.test('my name is shotcat')
      console.log(result)
      // => true
      Copy the code
    • search: RegExp method. If the verification succeeds, the index of the matched location is returned; if the verification fails, -1 is returned.
      var regex = /shotcat/
      var string = "my name is shotcat";
      var result = string.search(re)
      console.log( result );
      // => 11 If no, -1 is returned
      Copy the code
  • Extract:
    • execThe RegExp method returns an array containing the matching results. If no match is found, the return value is null.
      var regex = /shotcat/;
      var string = "my name is shotcat";
      var result = regex.exec(string); 
      console.log(result)
      // => ["shotcat", index: 11, input: "my name is shotcat", groups: undefined]
      Copy the code
    • matchThe: String method returns an array containing the matching results. If no match is found, the return value is null.
      var regex = /shotcat/
      var string = "my name is shotcat";
      var result = string.match(regex)
      console.log( result );
      // => ["shotcat", index: 11, input: "my name is shotcat", groups: undefined]
      Copy the code
  • Replacement:
    • replaceThe: String method replaces the matched String with the supplied String.
      var regex = /shotcat/;
      var string = "my name is shotcat";
      var result = string.replace(regex, 'Eddie Peng'); 
      console.log(result)
      // => ["shotcat", index: 11, input: "my name is shotcat", groups: undefined]
      Copy the code

Note: this is only a brief introduction to the use of these methods. The tips and pitfalls of these methods will be detailed in the section “API Usage Considerations” at the end.

Metacharacters – collections of characters[abc]

Re /a[BCD]e/ can accept the match to the result has Abe, ACE, ADE three cases. [BCD] is called a character set. It is represented by square brackets [].

A character set is used to match a character, which can be any character in square brackets. Regular /a[BCD]e/ indicates that the single character between characters A and e can only be b or C or D in [].

Said the scope of[a-z]

If there are multiple characters in a character set in a certain order, we can use dashes (-) to specify a character range. For example, use /[a-z]/ to match all lowercase letters from a to Z. For example, [123456abcdefGHIJKLM] can be written as [1-6a-fg-m]. Use hyphens for ellipsis and abbreviations.

Reversed character set[^abc]

When you add ^ (decarbonate) to the first part of a character set, it means reverse, meaning it matches any character not contained in square brackets. For example, [^ ABC] matches any character other than a or B or C. Note: [^ ABC] and [^a-c] mean the same thing.

Metacharacter – Matches a single character

In general, a single character can be matched directly, but special characters such as Spaces, tabs, carriage returns, and line feeds are needed. In this case, you need to use escape characters to match, as shown in the following table:

Special characters Regular expression memory
A newline \n new line
Page identifier \f form feed
A carriage return \r return
Whitespace characters \s space
tabs \t tab
Vertical TAB character \v vertical tab
The fallback operator [\b] bAckspace uses the [] symbol to avoid duplication with \b

Metacharacters – Matches multiple characters at the same time

We can use the form [] or [0-9] if we want to match multiple characters in the re, but this is still not concise enough. This leads to a more concise and efficient way to match multiple characters in the table below.

Match the range Regular expression memory
Any character other than a newline character . Periods, except for the end of sentences
Single digit, [0-9] \d digit
In addition to the [0-9] \D not digit
Single character, including underscore, [A-zA-Z0-9_] \w word
Non-single-word characters \W not word
Matches whitespace characters, including Spaces, tabs, page feeds, and line feeds \s space
Matches non-whitespace characters \S not space

Metacharacters — quantifiers{m,n}

When matching, the matched characters are often repeated, so quantifiers are needed to limit the number of times.

{m,n}In the form of

{m,n} is the most common and basic form of quantifiers. M and n are integers. Matches the preceding character at least m times and at most N times.

/a{1, 3}/ indicates that a occurs at least once and at most 3 times. So it doesn’t match any character in shotCT. But it matches a in shotcat, matches the first two As in ShotcaAT, and matches the first three As in shotcaAAAAAAAT. Note: When shotcaaaAAAAAT is matched, the matched value is “AAA”, even though there are more As in the original string.

shorthand

For the convenience of being lazy, some abbreviations have been prescribed:

Match rule metacharacters Lenovo way
How many times {x} There is only one number in {x}. It’s a dead end. It can only be a few times
At least min {min, } Min on the left means at least min times, and if no on the right, infinite times
Up to the Max time {0, max} The value 0 on the left indicates at least 0 times, and Max on the right indicates at most Max times
Zero or one ? andq,There arealsoThere is no
Zero or countless times * In the universe,Chen lodgeAt the beginning of the universe, there was nothing, and at the end, stars filled the sky
1 or countless times + OnePlus, + 1
A specific number of {min, max} You can think of it as a number line, going from a point to a ray to a line segment. Min and Max represent the left and right bounds of left-closed and right-closed intervals respectively

Greedy match versus lazy match

  • Greed match

    By default, quantifiers (including abbreviations) are greedy, meaning they match as many qualified characters as possible (I want all =. =). Again: /a{1,3}/, when it matches “shotcaaaat”, it greedily matches as many times as possible, even though a appears 1, 2, and 3 times. So it doesn’t end up matching one ACE, it ends up matching three ACES.

  • Lazy matching (also called non-greedy)

    Sometimes we don’t want quantifiers to be greedy, we just want them to match exactly as many times, not so many times. So what do you do? You just put a question mark after it? Will do. For example: /a{2,3}? /, when it matches “shotcaaaat”, it only matches 2 as instead of greedily asking for 3 because it is an inert match.

Greed quantifiers Lazy quantifiers
{m,n} {m,n}?
{m,} {m,}?
? ??
+ +?
* *?

Metacharacters – multiple branchesx|y

Multiple choice branches can help us match many different cases. For example, to match the string “shot” and “cat” can use/shot through the pipeline operator | | cat/which will alternate different characters or position.

Note: multiple selection is inert

Multiple choice branches are lazy! If the front one matches, the next one doesn’t try again. For example: when we use/shot | shotcat/to match “shotcat”, the result only shot. Change/shotcat | cat/to match “shotcat”, they will only get “shotcat”.

Metacharacter – position matching

Regular expressions are matching patterns that match either characters or positions.

The preceding metacharacters are matching characters. The following metacharacters are matching positions.

The position in the string is simply the position between characters, or the null character “” between characters. For example, the string “cat” has four positions: “1c2a3T4”. Note that the positions at the beginning and end of characters are also included.

Word boundaries\bAnd non-word boundaries\B

  • \b is word boundary

    The position between words and non-words is also the position between \w and \w. \ B, where is the first letter of the boundary Chestnut 1:

    var result = "[JS] Lesson_01.mp4".replace(/\b/g.The '#');
    console.log(result);
    // "[#JS#] #Lesson_01#.#mp4#"
    Copy the code

    Chestnut 2: String “my name is shotcat.” wants to match shotcat. You can use \bshotcat\b. This matches shotcat to ensure that its front and back sides are between words and non-words.

  • \B Non-word boundary

    It’s simply the opposite of a word boundary. Specifically, it is the position between the inside of words, the position between the inside of non-words, the position between non-words and the beginning and end, namely, the position between \w and \w, \w and \w, ^(beginning) and \w, \w and $(end).

    Chestnut 1:

    var result = "[JS] Lesson_01.mp4".replace(/\B/g.The '#'); 
    console.log(result);
    // "#[J#S]# L#e#s#s#o#n#_#0#1.m#p#4"
    Copy the code

String boundary^ $

After word boundaries, move on to longer string boundaries.

^(stripper) matches the beginning of a string, and also the beginning of a line in multi-line matches with the m modifier. $(dollar sign) matches the end of the string, as well as the end of the line in multi-line matches with the m modifier.

Chestnut 1:

var result = "hello".replace(/^|$/g.The '#'); 
console.log(result);
// "#hello#"
Copy the code

Chestnut 2:

var result = "I\nlove\njavascript".replace(/^|$/gm.The '#'); 
console.log(result);
/*
#I#
#love#
#javascript#
*/
Copy the code

The position before and after a particular character

What if the matching position is at a particular position, before and after a particular character. The following metacharacters are used:

  • Predicate before and predicate after

    • Antecedent assertion:x(? =y)When the character is y, the x before y is matched.
    • Chestnut:
    var result = "orangecat".replace(/orange(? =cat)/.'shot'); 
    console.log(result);
    // => "shotcat"
    Copy the code
    • Subsequent assertion:(? <=y)xWhen the character is y, the x following y is matched.
    • Chestnut:
    var result = "shotdog".replace(/ (? <=shot)dog/.'cat'); 
    console.log(result);
    // => "shotcat"
    Copy the code
  • Positive negative search and reverse negative search

    • Positive negative search: is the opposite of the prior assertion, except that it does not equal y.x(? ! y)If the character is not y, x before y is matched.
    • Chestnut:
    var result = "orangecat".replace(/orange(? ! dog)/.'shot'); 
    console.log(result);
    // => "shotcat"
    Copy the code
    • Reverse negation lookup: is the opposite of the following assertion, except that it does not equal y.(? <! y)xIf the character is not y, the x following y is matched.
    • Chestnut:
    var result = "shotdog".replace(/ (? 
            .'cat'); 
    console.log(result);
    // => "shotcat"
    Copy the code

Summary of Position matching

Finally, to sum up:

Boundaries and marks Regular expression memory
Word boundaries \b boundary
Non-word boundary \B not boundary
Beginning of string ^ smallThe first sharpSo big
End of string $ Operator $$
First assertion x(? =y) Similar to the ternary operator,? If y is equal to y, let’s find the x out front
After assertion (? <=y)x <Meaning the front is closed. Look in the back. If you match y, you look for x.
Positive negative search x(? ! y) !Negative, if not y match the preceding x
Reverse negative search (? <! y)x <Meaning the front is closed. Look in the back. If it’s not y then match x.

Character symbol

Character flags are not metacharacters; they are global operations on the entire re. Currently there are only the following logos

mark describe
g Global search.After a result is matched, it does not stop, until the entire character is matched and all results are obtained
i Case insensitive search.
m Multi-line search. Newline characters are ignored
s allow.Matches a newline character.
u Matches using patterns of Unicode codes.
y Perform a “sticky” search, matching from the current position of the target string, using the Y flag.

The first three g’s, I’s and m’s are the ones that are used the most. Flags are not metacharacters and are not used together:

var re = /\w+\s/g;

var re = new RegExp("\\w+\\s"."g");
Copy the code

Grouping – The function of the brackets in the re( )

Parentheses are used to wrap parts of a regular expression in parentheses as a whole, also known as a subexpression. This also provides grouping capabilities for expressions.

Grouping and branching structures

  • Grouping: A re wrapped in parentheses is a grouping
    • Chestnut:
      // /(ab)+/ (ab)+/
      var regex = /(ab)+/g;
      var string = "ababa abbb ababab";
      console.log( string.match(regex) ); 
      // => ["abab", "ab", "ababab"]
      Copy the code
  • Branch structure: we talked about branches before, but here the branches are in parentheses, also using pipe characters|said
    • Chestnut:
      / / / ^ I love (JavaScript | Regular Expression) $/ JavaScript contains two things I love and I love Regular Expression
      var regex = /^I love (JavaScript|Regular Expression)$/;
      console.log( regex.test("I love JavaScript"));console.log( regex.test("I love Regular Expression"));// => true
      // => true
      Copy the code

Grouping reference

Another important function of grouping parentheses is to group references. That is, you can extract and replace the characters that you match in parentheses.

For example: we want to use the regular to match a date format, yyyy – mm – dd, we can write in the form of grouping/(\ d {4}) – (\ d {2})/(\ d {2}). The three parentheses here correspond to group 1, group 2, and group 3.

Extract the data

In the introduction of the regular expression method, introduced the extraction of data, will use two methods: String match method and the regular exec method.

  • Match:
var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
console.log( string.match(regex) );
// => ["2017-06-12", "2017", "06", "12", index: 0, input: "2017-06-12"]
Copy the code

Match returns an array, the first element being the overall match result, followed by the matches for each group (in parentheses), followed by the match subscript, and finally the input text.

  • The exec:
var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
console.log( regex.exec(string) ); 
// => ["2017-06-12", "2017", "06", "12", index: 0, input: "2017-06-12"]
Copy the code
  • It can also be obtained with the global properties $1 – $9 of the regular object constructor:
var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";

regex.test(string); // Regex operations, for example
//regex.exec(string);
//string.match(regex);

console.log(RegExp. $1); / / "2017"
console.log(RegExp. $2); // "06"
console.log(RegExp. $3); / / "12"
Copy the code

Replace the data

The data is replaced with the String replace method.

C: Replace yyyY-MM-DD with mm/ DD/YYYY

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
var result = string.replace(regex, "$2 / $3 / $1");
console.log(result); 
/ / = > "06/12/2017"

// String's replace method uses $1 - $9 in the second argument to refer to the corresponding grouping
Copy the code

Also equivalent to:

var regex = /(\d{4})-(\d{2})-(\d{2})/;
var string = "2017-06-12";
var result = string.replace(regex, function(match, year, month, day) {
	return month + "/" + day + "/" + year;
});
console.log(result); 
/ / = > "06/12/2017"
Copy the code

backreferences

The grouping of references mentioned above is derived from the result of the match. A backreference can also refer to a group, but its group comes from the group captured in the matching phase. To facilitate understanding, let’s look at chestnuts:

To write a re that supports matching one of the following three formats:

2016-06-12

2016/06/12

2016.06.12

We might think of something like this:

[-/.] [-/.] [-/.
var regex = /\d{4}[-/.]\d{2}[-/.]\d{2}/;
var string1 = "2017-06-12";
var string2 = "2017/06/12";
var string3 = "2017.06.12";
var string4 = "2016-06/12";
console.log( regex.test(string1) ); // true
console.log( regex.test(string2) ); // true
console.log( regex.test(string3) ); // true
console.log( regex.test(string4) ); // true

Copy the code

But “2016-06/12” is also judged to be correct, which is obviously not what we want, we want the second concatenation to be the same as the first one. This is where backreferences come in. We want the second hyphen to match the first one. The first [-/.] [-/.] [-/.] [-/.] [-/. The second connector needs to be the same as the first, which requires referencing it. In this case, \1 is used for the first citation, and \2 and \3 are used for the second and third medical applications. /\d{4}([-/.])\d{2}\1\d{2}/. Then verify:

var regex = /\d{4}([-/.])\d{2}\1\d{2}/;
var string1 = "2017-06-12";
var string2 = "2017/06/12";
var string3 = "2017.06.12";
var string4 = "2016-06/12";
console.log( regex.test(string1) ); // true
console.log( regex.test(string2) ); // true
console.log( regex.test(string3) ); // true
console.log( regex.test(string4) ); // false
// The results are exactly as expected!
Copy the code

These are all ideal cases, but what if it’s more complicated?

Grouping nested

What if groups are nested (parentheses nested)? As you can tell by now, the regees are matched from left to right, and so are the groups. We just need to judge the groups in the order of the left parentheses.

Chestnut:

var regex = /^((\d)(\d(\d)))\1\2\3\4$/;
var string = "1231231233";
console.log( regex.test(string) ); // true
console.log( RegExp. $1 ); // 123 the first group
console.log( RegExp. $2 ); // 1 Second group
console.log( RegExp. $3 ); // the third group
console.log( RegExp. $4 ); // 3 The fourth group
Copy the code

Analyze groups from left to right:

The first grouping: ((\d)(\d(\d))) means that three connected digits need to be matched with three nested groups, resulting in \ 1:123 The second grouping: (\d) means that one digit needs to be matched, resulting in \ 2:1 the third grouping: (\d(\d)) means that two numbers need to be matched, with a nested grouping, and the fourth grouping :(\d) means that a number needs to be matched, and the result \ 4:3 is matched sequentially

10 \What said

Does \10 represent the 10th group, or \1 and 0?

The answer is the 10th grouping, although it is rare to see \10 in a re.

var regex = / (1) (2) (3) (4) (5) (6) (7) (8) (9) (#) \ 10 + /;
var string = "123456789 # # # # # # #"
console.log( regex.test(string) );
// => true
Copy the code

Reference to a nonexistent grouping

When a non-existent grouping is referenced in a re, the re does not report an error, but only matches the backreferenced character itself. For example, \2 matches “\2”.

Note: “\2” means a change of “2”. It is possible that the escaped 2 will not be the number 2, but some other character! So we must be careful not to reference groups that do not exist!

Uncaptured grouping

All of the aforementioned groups can be referenced, but if I don’t want to be referenced, I can use non-capture groups (? : p). Because references create a place in memory, non-capture grouping also avoids wasting memory.

var str = 'shotcat'
str.replace(/(shotca)(? :t)/.'$1, $2')
/ / return shotca, $2
// The second reference has no value due to the use of a non-captured re, which is replaced by $2
Copy the code

Regex matching steps and tracebacks

Regular matching step

We know that regex matches from left to right, so what are the steps for each character? Let’s take an example to illustrate:

Regular expression /ab{1,3} BBC /, the target string is “abbbc”

2~5: the re has been matched to b{1,3}, and the string goes to the third b. At this point b{1,3} has been satisfied with the maximum number of b’s.

6: the re comes to the first b after b{1,3}. At this point the string starts to match the preceding c.

7: The wrong C is found, but the re does not report an error, but backtracks. It’s going backwards. The re goes back to b{1,3}, and the string goes back from the third b to the second b. It is found that two b’s also meet the condition of b{1,3}.

8: the re comes to the first b after b{1,3}. The string also matches the third b to it.

9: regular came to b {1, 3} at the back of the b, the second string found that it is in front of the c deficiency, and can’t match.

10: the re gradually backtracks to b{1,3}, and the string gradually drops back to the first b.

11: in this case, the re is b{1,3}, the string found only one b, also meet the requirements, the first b to the re.

12 to 13: Matches the last BBC one by one, and the string is matched one by one. At this point, the whole matching process ends.

back

From the previous example, you already have a sense of what backtracking is. Backtracking is a process in which the re goes back to the previous re, matches other possibilities, and continues to match. If traceback fails, the re will continue traceback. Until you’ve tried everything. This matching method is also known as backtracking.

It’s essentially a depth-first search algorithm. The process of going back to a previous step is called backtracking. Backtracking occurs when the path ahead fails. That is, when an attempt to match fails, the next step is usually backtracking. Backtracking can result in a waste of resources and time when it occurs, so we should try to avoid it when writing our re.

Common form of backtracking

When writing a re, note the following to avoid backtracking:

Greed quantifiers

You can also see from the previous example that quantifiers cause backtracking because by default quantifiers are greedy matches. It tries to match as many results as possible, which can cause subsequent re matches to go wrong, leading to backtracking. In other words: you are too greedy to get the matching data.

Note: If there are multiple quantifiers, what is the result of matching? A: First come, first served!

Chestnut:

var string = "12345";
var regex = / (\ d {1, 3})/(\ d {1, 3});
console.log( string.match(regex) );
// => ["12345", "123", "45", index: 0, input: "12345"]
Copy the code

\d{1,3} matches “123” and \d{1,3} matches “45”.

Lazy quantifiers

You might think that since greedy quantifiers can lead to backtracking, use lazy quantifiers whenever possible.

Wrong! Lazy quantifiers can also lead to backtracking. As mentioned earlier, greedy quantifiers are greedy quantifiers that eat too much and don’t have matching data. An inert quantifier is someone who is too lazy to eat too little, resulting in too much later food.

To understand, look at this example:

Regular / ^ (\ d {1, 3}?) (\d{1,3})$/ matches ‘12345 ‘.

Branching structure

When WE talked about branching, we mentioned that branching is also lazy, and it also leads to backtracking. For example: / shot | shotcat/when matching the shot, shotcat behind is not to consider. So when it matches the character shotcat, it will first match the shot branch, but when it comes to c, it finds a mismatch and backtracks, trying the second branch Shotcat to match.

So how do you avoid backtracking?

We analyzed the form of a variety of causes back, the cause of back is the latter, no regular back to the previous step, so you need to regular the situation to carry on the reasonable collocation, when too many, can undertake reasonable limit through lazy quantifiers, when the canonical correlation matching data, will be references to limited to specific data. All of these can effectively reduce backtracking.

Regular expression reading

Re is a language with a bunch of characters that are not as easy to read as other languages. So when we need to read other people’s re’s, it’s important to understand what they mean.

PS: If the re is too hard to understand, or uncertain. There are a number of tools available to help analyze regex. For example, https://jex.im/regulex mentioned earlier

Structure and operator priority

As mentioned earlier, the re is composed of ordinary characters and metacharacters.

What is the structure? Is a character and metacharacter composition of a whole. The re will match this as a whole. For example, [ABC] is a structure consisting of the metacharacter [] and the ordinary character ABC. When a re is encountered, it is matched as a whole. The matched characters may be any of ABC characters.

JavaScript regular expressions contain several structures: character literals, character groups, quantifiers, anchors, groups, selective branches, and backreferences.

structure instructions
literal Matches a specific character, both unescaped and escaped. For example, a matches the character “A”, and for example\nMatch a newline character, for example\.Match the decimal point.
Character groups Matching a character can be one of many possibilities, for example[0-9]Is matched with a number. There are also\dShort form for. There are also antisense character groups, which can be any character other than a specific character, such as[^ 0-9]Represents a non-numeric character\DShort form for.
quantifiers Represents the occurrence of a character in succession, for exampleA {1, 3}Indicates that the character A appears three times consecutively. There are also common abbreviations likea+Indicates that the a character appears at least once consecutively.
The anchor Matches a position, not a character. Like ^ matches the beginning of the string, and like\bMatch word boundaries, for example(? =\d)Represents the position before a number.
grouping I use parentheses to represent a whole, for example(ab)+, indicates that the characters “ab” appear more than once in a row. You can also use non-capture grouping(? :ab)+.
branch Choose one of multiple subexpressions, such as ABC
backreferences For example, \2 refers to the second group.

Metacharacters in these structures are also called operators. Often these operators are combined and nested. So who’s going to be executed first? Operators have precedence. The following table:

Operator description The operator priority
Escape character \ 1
Brackets and square brackets (...).,(? :...).,(? =)...,(? ! ...).,[...]. 2
Quantifier qualifier {m},{m,n},{m,},?,*,+ 3
Position and sequence ^ 、$,\ metacharacters,General character 4
Pipe character (vertical bar) | 5

The precedence of the above operators goes from top to bottom and from high to low.

With that said, let’s take a step by step explanation

Chestnut: / ab? (c|de*)+|fg/

1: matches common character A

2: b? B character occurs 0 or 1 times

3: will encounter parentheses (c | DE *) as a whole

4: Continue to match c in parentheses

5: Encountered pipe characters, c and de* as branches

6: Matches d, followed by *, which means that e can be repeated any number of times

7: match the bracket, meet + (c | DE *) need to match at least 1 time

8: Then it encountered a pipe character. At this time will be ab? (c) | DE * + and fg as two branches

Now let’s look at the schematic diagram obtained by the analysis of auxiliary software:

Conclusion: Read from left to right when encountering the re, divide the re according to the structure. The re with complex and uncertain structure will be given priority, and the same structure will be based on the principle of first come, first served. Finally, if not, there is an ace in the hole, aided by visual analysis.

Regular expression construction

Having said the re reading, let’s talk about re building.

When do YOU need to build your own re

Learn this, and you’ll find that re’s are very powerful. But what questions do you want to ask yourself when you want to build your re?

  • Is there an API that can do that?

    In many cases, simpler and more common functions are already covered by an EXISTING API. For example, check if there is a ‘! The indexOf method can be used directly. A character can be extracted using substring or substr methods based on subscripts. Some frameworks also provide common API methods, such as modifiers in vue. trim can be used to remove the leading and trailing whitespace characters in a form.

  • Are there existing re’s available online?

    For some very common verification, there are ready-made regees available on the Internet. These regees are verified after being used by others, and their reliability is guaranteed.

If none of the above is satisfactory, it’s time to start thinking about building regex

Criteria for construction

When we write regex, we should try to follow these principles to write accurate, efficient and reliable regex.

  • Matches exactly what you want
  • Reliability of regularity
  • Readability and maintainability
  • The efficiency of

Exact matching and build steps

The first thing you need to know before you start writing a re is that you need to know what you want and what characters you want to match! This seems simple enough. Of course I know what I want, but I often get the data and find some data I don’t need and I haven’t taken it into account.

Thinking about exactly what data you want is crucial to writing regex!

General build steps:

  • Step1 figure out what you want. What characters do you want to match

  • Step2 write a matching character that you think is most representative

  • Step3 start building your re from left to right. First, do you need to match positions, and if so, where are the positions of characters to match, word boundaries or before and after specific characters? Still normal left to right, do not need to use ^ ‘ ‘$delimited beginning and end.

  • Step4 after finding the position is the character limit. There are a number of qualified characters, which can be divided into two categories: forward and reverse qualifiers. Be reasonable when qualifying, including all the characters we want and not matching the characters we don’t want.

  • Is limited

    What is the forward qualification? When we know exactly what data we want to match, such as a specific character ‘ABC’, or a certain position, such as the beginning or end of a character, or an explicit reference \1, or an explicit set [1-10]. These are all forward qualifiers, where you know exactly which characters you want and then apply regular qualifiers to them.

  • Reverse limit

    What is reverse qualification? When we want to match a large range of characters or too many forward qualifiers, we can use the exclusion method, as long as the character is not the character we want. Metacharacters in regex are more forward qualified than reverse qualified. There’s nothing greater than less than or anything like that. [^ ABC],\ D,\W,\S,\B, positive negative lookup x(? ! Y) and reverse negation lookup (?

  • After step5 characters are qualified, it is the number of times to qualify. Note: the number usually only works on the single character before it; multiple characters need parentheses as a whole. Pay special attention to the number of backtracking that may result. Repeat step 345 for more character matches.

  • Step6 finally, is the character mark, to qualify the entire re, global match or multi-line, and so on.

  • Step7 check! The re you write must be checked to see if it covers all cases, marginal problems, and special cases. Use some special characters to check and verify, and can be assisted by visual analysis, easy to modify.

Reliability of regularity

Reliability here means that the re is stable at run time without catastrophic backtracking: it does not backtrack so much that the CPU is 100% and normal service is blocked. If you write too many regex tracebacks, it’s inefficient and can lead to catastrophic tracebacks when you get to a very, very long string.

Here’s an article about a regular expression that caused a bloodbath that made the line CPU100% abnormal!

So after you write the re, it’s important to check and optimize to make sure the re is reliable and doesn’t have catastrophic backtracking.

Readability and maintainability

Although the re is written for the machine, but still to show people, so write re as simple as possible, not complex, such as extraction of branches in the common part.

The efficiency of

Sometimes we write regees that will do the job, but get slow with complex, longer characters, or intensive use. At this time, it is necessary to modify and optimize the regular to improve efficiency.

Generally, there are three aspects to consider: reduce the limit range, reduce memory footprint and reduce backtracking.

  • Narrow the scope: Don’t use wildcards if you know what specific character groups are.
  • Reduce memory footprint: As mentioned earlier, references need to be stored in memory, and if we only want to use them as a whole, we can use non-capture grouping(? :)
  • Reduce backtracking: Extract the common part of the branch and reduce the branch. Proper use of quantifiers or when there is a correlation between the data in the re match can be restricted to specific data by reference. The most important thing is to familiarize yourself with the matching process of your re and know where you will backtrack when you fail to match.

Matters needing attention

Note for regular correlation methods

I introduced regular expressions at the beginning of the article, but since I hadn’t formally explained regular expressions at that time, I only mentioned the basic usage. Here are some considerations for their use:

Search and match parameter conversion problem

The search and match methods convert characters to regex by default.

var string = "2017.06.27"; 
console.log( string.search("."));/ / = > 0
// Why is the matched bar 0? We originally intended to match the string '.', but search converts it to a re, where the '.' stands for matching any single character other than a newline. So we get a 2, and the subscript of course is 0

// Need to be modified to one of the following forms
console.log( string.search("\ \."));// By escaping
console.log( string.search(/ /. /));// It is recommended to use the regex directly when using search
/ / = > 4
/ / = > 4


console.log( string.match("."));// Select * from '.'; // select * from '.'; // Select * from '.'; // Select * from '.'
// => ["2", index: 0, input: "2017.06.27"]
// Need to be modified to one of the following forms
console.log( string.match("\ \."));console.log( string.match(/ /. /));// => [".", index: 4, input: "2017.06.27"]
// => [".", index: 4, input: "2017.06.27"]
Copy the code

In view of such a pit, it is recommended to use the re directly, do not use strings, save escape.

The format of the result returned by match

Note: The format of the result returned by match depends on whether the re object has the g modifier. Here’s an example:

var string = "2017.06.27";
var regex1 = /\b(\d+)\b/;  // We know that this re matches the number in the middle of the word boundary
var regex2 = /\b(\d+)\b/g;  // Add g flag to indicate global search. That is, after a result is matched, it does not stop until the entire character is matched and all results are obtained
console.log( string.match(regex1) );
console.log( string.match(regex2) );
// => ["2017", "2017", index: 0, input: "2017.06.27"] since the first one has no g, it will not continue until it reaches the first 2017, but it is grouped with parentheses, so it will then match the grouped 2017, so there will be two 2017, and the resulting array also contains index and input

["2017", "06", "27"] // The re does not end at 2017, but matches until the end of the string. The result is also returned with no input and index
Copy the code

I would still recommend adding g whenever possible when using match, especially if there are group references.

Exec is more powerful than Match

As mentioned above, the array format returned by match changes with g, with no index or input information. But exec can, so how does it do that? The answer is batch return.

var string = "2017.06.27";
var regex2 = /\b(\d+)\b/g;
console.log( regex2.exec(string) );
console.log( regex2.lastIndex);
console.log( regex2.exec(string) );
console.log( regex2.lastIndex);
console.log( regex2.exec(string) );
console.log( regex2.lastIndex);
console.log( regex2.exec(string) );
console.log( regex2.lastIndex);
/ / = > [" 2017 ", "2017", the index: 0, input: "2017.06.27"]
/ / = > 4
// => ["06", "06", index: 5, input: "2017.06.27"]
/ / = > 7
// => ["27", "27", index: 8, INPUT: "2017.06.27"] // => ["27", "27", index: 8, INPUT: "2017.06.27"]
/ / = > 10
// => null
/ / = > 0
Copy the code

The example shows that exec continues the match after the last match, where lastIndex is the last matched index.

Powerful exec is used when you need to know exactly what information is being matched each time.

More powerful replace

Replace can be used in two ways. This is its second argument, which can be either a string or a function.

  • When the second argument is a string, you can insert the following special variable names
attribute describe
At $1, $2,… The $99 Matches text captured in groups 1 to 99
$& The matched substring text
$` The text to the left of the matched substring
$’ The text to the right of the matched substring
? The dollar sign

Chestnut: Change “2,3,5” to “5=2+3” :

var result = "2,3,5".replace(/(\d+),(\d+),(\d+)/."$3 = $1 + $2");
console.log(result);
/ / = > "5 = 2 + 3"
Copy the code
  • When the second parameter is a function, the meaning of each parameter of the function
The variable name On behalf of the value of the
match Matching substring. (Corresponding to the $& above.)
At $1, $2,... Suppose the first argument to the replace() method is aRegExpObject represents the string matching the NTH parenthesis. For example, if you use/(\a+)(\b+)/This one matches,The $1It matches\a+.$2It matches\b+.
index Index of the matched substring in the original string. (For example, if the original string is'abcd', the matched substring is'bc', then this parameter will be 1.
input The original string to be matched.
"1234, 2345, 3456".replace(/(\d)\d{2}(\d)/g.function(match, $1, $2, index, input) {
	console.log([match, $1, $2, index, input]);
});
// => ["1234", "1", "4", 0, "1234 2345 3456"]
// => ["2345", "2", "5", 5, "1234 2345 3456"]
// => ["3456", "3", "6", 10, "1234 2345 3456"]
Copy the code

Considerations for writing re

The whole problem of matching strings

To match the entire string, we often put anchors ^ and $before and after the re. But sometimes you need to pay attention to priorities.

For example: if we want to match the ABC or BCD, regular written / ^ so ABC | BCD $/, due to the higher priority position, so requires a string must begin with a, d, the end. This is obviously not what we want, so we need to protect it with parentheses, as a real individual. So you need to / ^ | BCD (ABC) $/.

Quantifier linking problem

Sometimes we have multiple quantifiers that we want to use “contiguous”, for example to indicate multiples of 3. For example:

/^[ABC]{3}+$/; /^[ABC]{3}+$/; Because the + prefix is also a quantifier and not a character, you need to use parentheses. Change it to /^([ABC]{3})+$/.

Metacharacter escape problem

We know that metacharacters are characters that represent special meanings in a re. But if the string we match first contains these characters, then we need to consider metacharacter escape.

In this case, basically most metacharacters need to be escaped one by one. For pairs of metacharacters, only the first one needs to be escaped. Note that both parentheses must be escaped.

var string = "[abc]";
var regex = /\[abc]/g;  // Just escape the first [
console.log( string.match(regex)[0]);// => "[abc]"

var string = "(123)";
var regex =/\(123\)/g;  // Both parentheses need to be escaped
console.log( string.match(regex)[0]);/ / = > "(123)"
Copy the code

Symbols that do not need to be escaped: for example =! Symbols such as: -, and so on, which have no separate meaning in the re, are used in combination with each other or with other metacharacters. So they don’t need to be escaped.

The resources

The first Old Yao regex is really the best and most detailed regex you can find so far, and I’ve devoted many chapters to it.

  • JS regular expressions tutorial (slightly longer)
  • Don’t memorize regular expressions