This is a complete introduction to regular expressions. It will be divided into two parts. The first part will cover the regular expressions that are frequently used when using regular expressions, and the second part will cover some advanced expressions.

Many software provide the text search function, usually using Ctrl+F/Command+F shortcut keys. However, the text search function provided by these software is very basic and can only match “Hello”. For example, if you enter “Hello”, you can only match “Hello”. You can’t do that. Regular expression is a tool used to search and match specific text. It is often used in software development. It can search and match text according to specific rules.

The re is usually marked with a slash /, such as /^Hello World$/, but not everywhere, except in most programming languages. In other words, the first/and the second/are not regular, only the middle part. This article will also take the general form of using/to mark regular expressions.

directory

  • First, full word matching
  • Two, beginning and end
  • Three, or?
  • Four, any character match
  • 5. Character classes
  • Custom character classes
  • Seven, repeat
  • Eight, grouping
  • Group references

First, full word matching

As an “advanced” text lookup tool, re is certainly covered by the basic full-word matching capabilities of ordinary software. To implement a common full word match, just like ordinary lookup, write to match a string can be directly, but because of the regular expression has some special syntax, so some of the symbols are have special use, including: \ ^, $, |,., +, *, {,? , (,), [, etc., the meanings of these special symbols will be explained later. For these special symbols, if necessary, you need to add \ before the escape.

Example:

/Hello World/ Matches Hello World

Two, beginning and end

When we use full word matching, sometimes we have to restrict the location of the result. For example, if you’re looking for the 130 segment in a bunch of cell phone numbers, if you use full word matching, you might get something like 18712341301 because it contains the 130 substring, But we need numbers starting with 130, so full word matching is not enough. At this point, just add a special symbol ^ to the full-word matching re. This special symbol is just a position marker representing the beginning of the entire target string to be matched. Note that it is a placeholder for the starting position.

Example:

/^130/ Matches only 130 at the beginning of the string, for example, 13012345678. If 11302345678 contains 130 but is not at the beginning, the match fails.

The $corresponding to the start of the ^ match is the end of the target string to match, which is just a placeholder for the position.

Example:

/ Ed $/ only matches Ed at the end of the string, such as Opened, closed, edge, bedroom, which contains Ed but is not at the end, fails to match.

/^Hello World$/ matches only the beginning and end of Hello World, so there is only one case where the matching target string is only Hello World.

Three, or?

Sometimes, what we need to match is not directly determined, but must belong to a collection. For example, to match gender: male, female, we can not directly use full word matching. At this point, we can use a special symbol |, said “or”. So we can all need to be listed in the set of matched elements, use | separated. Note that the match must be one of two.

Example:

/ | female/male can match to a string contained in a male or female, such as: gender: male, and if the target that does not contain the string Men did not include women, match failure.

/ ^ he him; male and female | $/ only when the target string is only one word only a word “male” or “female” can match. For the word “male” and “female”, although it contains both male and female, with male at the beginning and female at the end, there is no male or female “at both the beginning and the end”, so it fails to match.

/ one | two | three | four | five/match 1 ~ 5 English number. Thus, you can put as many lists of elements as you want to match, and the result will be any one of those lists.

Four, any character match

With |, we can put the want a matching substring all listed, but sometimes, you will find that this is a big project: we want to match the mobile phone number, for example, in China, the mobile phone number length is usually 11, and the first three said them roughly, 130, 131 (and counting), followed by 8 digits. Now if you want to match a string like this, the above method is not enough, you might need to list all possible phone numbers.

However, there is one universal character in the re, which is represented by the special. Symbol (that is, the decimal point), which can represent any single character.

Example:

/ ^ 130… $/ matches a string starting with 130 followed by any eight characters, such as 13012345678 and 130abcdefgh.

5. Character classes

In fact, the example above is a little rough. It can match any phone number starting with 130, but it can also match any string that is not followed by a number, such as 130abcdefgh. If we want to match the phone number completely, that is, the last eight digits must be a number. In this case, it can’t be used. Because it represents any character, and we only want the ten characters from 0 to 9.

There are some predefined “character classes” in regular expressions. For example, we only need to match 0 to 9 characters. In the re, we can use \d, which is the same as. Similar, says one character at a time, is it just a 0 ~ 9 in any one character at a time, a little similar to / 0 | 1 | 2 | 3 | 4 5 6 7 | | | | | 9/8.

Example:

/^130\d\d\d\d\d $/ will match a string starting with 130 followed by eight arbitrary digits, such as 13012345678, 13087654321, whereas 130abcdefgh in the previous example will not match because the last eight digits are not numbers.

In addition to \d, there are two more commonly used character classes in re: \w and \s. In most regular engines, \w represents any of 63 characters, including 26 uppercase letters, 26 lowercase letters, 0 to 9, 10 digits, and the underscore character _. While \s generally stands for “whitespace characters”, such as Spaces, \t TAB, \n newline, \ F page feed, etc.

It is worth noting, however, that the three character classes \d, \w, and \s contain different characters depending on the implementation of the regular engine! But in general, \d is a number, \w is a word symbol and \s is a blank space symbol.

Other character classes may be included in a particular regular engine implementation, depending on the corresponding regular engine.

With the \d, \w, \s character classes, re also provides their complement, using their uppercase notation, such as \d for all characters that are not numbers, \w for all characters that are not word symbols, and \S for all characters that are not whitespace.

Custom character classes

Although the re provides three predefined character classes and their complement, sometimes it is not enough. For example, if we want to match a hexadecimal number, which contains not only 0 to 90 numeric characters but also six (or twelve) letters a to F (case insensitive), the simple \d will not be enough, and \w will contain extra characters, resulting in an inaccurate match.

Re provides a way to define a custom character class by enclosing the desired characters in brackets [].

Example:

/^[0123456789abcdefABCDEF]$/ Can match one hexadecimal number.

A custom character class can be formed by listing the required characters as shown above, but when the number of characters is large, it can be a bit cumbersome to list them all. In custom character classes, ranges can be defined using the – symbol, which can be specified directly from a start character to an end character. So the above example could be changed to this:

Example:

/^[0-9a-fa-f]$/ Can match one hexadecimal number.

Isn’t that better? It is possible to specify more elaborate intervals, such as /[0-37-9]/ for the character class of 0, 1, 2, 3, 7, 8, and 9.

As with the regular preset character classes, custom character classes support complement, as long as the first position of the custom character class is placed with a ^ symbol (note that ^ is no longer the meaning at the beginning of the string), thus indicating that the character class does not contain the characters listed later.

Example:

/^[^a-z]$/ Matches a single character that is not a lowercase letter, such as digit 0, uppercase letter A, special symbol @, and so on.

Retained in the custom character class, special symbols and the first section said those who are not the same, only], -, ^ and \ has special purpose, as a special sign, need to escape, while others such as $, | other symbols in the custom character classes are no longer need to escape (of course, also can do you want to continue to escape, But it reduces the readability of the re.

Of course, these special symbols can also be escaped when they have no practical use. /[a-c]/ is a character class consisting of a, b, and C, but if you put it at the beginning or end of a character class, it will not form a range. For example, /[-a]/ or /[a-]/ represents a character class consisting of a and -. For example, ^ is placed at the beginning of a custom character class to indicate a complement, but it is no longer meaningful to place it anywhere else, so there is no need to escape. And \, as an escape character, always needs to be escaped itself.

There is also], which acts as a delimiter, indicating the end of the definition of a custom character class, but if it is placed at the beginning of a custom character class, such as /[]a]/ for] and a; Or in a second position where the first position is ^, such as /[^]a/ for a character class that does not contain] and a. Note, however, that this does not work in JavaScript! In JavaScript, /[]/ always means a null character class that can never be matched, and /[^]/ means a character class that can match any single character, so] has special meaning in JavaScript anyway and should always be escaped with \!

Most escape flags that are outside the custom character class also work in the custom character class, such as unprintable characters (newline \n and the like), octal escapes, hexadecimal escapes, and Unicode escapes.

Example:

/^[\^\]\-\]$/ matches ^,], – or \.

In some regular engines (such as.NET, XPath, and so on), custom character classes also support subtraction. For example, /[a-z-[aeiou]]/ matches all lower-case consonants, eliminating the vowels a, E, I, O, and U from the 26 letters a through Z.

There are also partial regular engines (Java, Ruby, etc.), custom character classes that support intersection. For example, /[a-z&&[^aeiou]]/ also matches all lower-case consonants, but not all a, E, I, O, or U.

Seven, repeat

So far, all regular expressions have been “static,” meaning they match whatever you write, using character classes instead of multiple characters at best. But look at the example above for matching phone numbers, followed by 8 \d’s, which is very embarrassing, what if it is followed by more numbers?

Re provides a “repeat” function, which can repeat a previous match multiple times. The special symbol + or * is used to make a previous matching unit match multiple times, where + requires at least one occurrence.

Example:

/^130\d+$/ A string that starts with 130 and is followed by at least one digit, for example, 1301, 130123456789123456789, can be followed by any number of digits, but 130 cannot be matched.

/^130\d*$/ 130\d*$/ 130\d*$/ 130\d*$/ 130\d*$/ 130\d*$/

⚠ Note: repetition is only valid for the previous minimum matching unit. In the above example, \d is the smallest matching unit, so it is repeated for \d. For custom character classes, it also belongs to a minimum matching unit.

The + and * repeats are unlimited, so they are not used when matching mobile phone numbers, which are usually followed by eight digits. At this point, you can use a more flexible repetition control method: {m,n} (no Spaces in between), which means at least m repetitions, at most N repetitions (note that m and n are closed intervals).

In particular, if m and n are equal, {m} can be omitted, indicating fixed repetition m times; If n is equal to infinity, n can be omitted and changed to {m,}, indicating at least m repetitions without capping.

Example:

/^130\d{2,5}$/ can match a string starting with 130 followed by at least two or up to five digits, such as 13012, 130123, 1301234, 13012345.

/^130\d{2,}$/ Matches a string starting with 130 followed by at least two digits, such as 13012, 13012345, 130123456789123456789.

/^130\d{8}$/ matches a string starting with 130 followed by eight digits, for example: 13012345678.

It follows that + is actually a shorthand for {1,} and * is actually a shorthand for {0,}.

There is also a special type of “repetition” 😕 This is not actually a repeat, but it’s a derivative of repeat. It means not to occur, or to occur once, which is a shorthand for {0,1}.

Example:

/^colou? R $/ can match colour or color.

/^https? :\/\/$/ can match http:// or https://. Although the/is not a special symbol, this article uses/to mark the re, so to avoid/being resolved as a re boundary, it is escaped with \. In fact if you use a regular engine is not the regular/to mark, so there would be no need for escape, such as: | ^ HTTPS? : / / $|.

Eight, grouping

In the above repetition, we find that repetition only works for the previous minimum matching unit. What if we want more flexible repetition? So Santa Claus is coming, HoHoHo, here we want to match this HoHoHo what do we do?

Regex are grouped with parentheses (), which can be treated as an individual sub-regular expression. The entire parentheses are treated as a matching unit, so we simply group one Ho and repeat the grouping

Example:

/^(Ho){3}~$/ can match HoHoHo~.

Linked to the previous “or” | will take effect on the whole the regular expression, if I have a similar to the format of data: gender: % s, role: % s, including gender only two, male and female roles are administrator, tourists, and now I want to use a regular match the string, the use of simple | can’t do that, at this time only and | part will be used before and after the two groups respectively.

Example:

/ gender (male and female | ^), role: administrator | visitors) $/ : matching gender: male, role: administrator, gender: male, role: tourists, gender: female, role: administrator or gender: female, role: tourists.

Group references

By now, we can take most of the cases. With groups in place, we can unlock a new skill: referencing a group.

A very common feature of re is grouping references. In the above example / ^ gender: male | female), in the role: administrator | visitors) $/ we can see there are two groups, the first is the (male | female), the second is the | visitors (administrator), so when the match can get two groups can be used. Intuitive can best embody the match results in various programming languages, usually match results will be in the form of an array (or array) returns, typically an array of the zeroth element to match to the entire character substring, and starting from the first element, the means to match to the content of the first group, the second element means to match to the content of the second group.

As a result, the string “gender: male, role: administrator” use / ^ gender: male | female), role: administrator | visitors) $/ match the result is:

[
    "Gender: male, Role: administrator"."Male"."Administrator"
]
Copy the code

Grouping can be used not only in the result of a match, but also inside the re. For example, I want to match a string that uses a pair of quotation marks, either single or double. For example, ‘hello’ and ‘world’ can be matched successfully if we simply use /^[‘”]\w+[‘”]$/, but ‘hello’ and’ world’ can also be matched successfully, which is obviously not desirable. In this case, using grouped references solves the problem:

Example:

/^([‘”])\w+\1$/ can match words caused by single or double quotation marks and ensure that the surrounding quotation marks match.

Here we have grouped the quotes before \w, which will be numbered as 1, so that we can use \1 to refer to the previous group later to ensure that the quotes are consistent ~

Note the octal escape notation here. Octal escape notation is less uniform, but usually follows \ directly after a number, which conflicts with group references, which also follow \ directly after a number. So the general advice is not to use octal escapes. Octal usually converts to hexadecimal easily, so when using character numbers to represent characters, don’t use octal, use hexadecimal instead!

For example, the letter ‘a’ is numbered 97 in decimal, 141 in octal, and 61 in hex. Then octal is \141, hexadecimal is \x61, or Unicode is \u0061.