For regular expressions, I usually use the words basically go to the search engine directly search, because I think there is no regularity. But recently I reviewed the regular expressions section in ES6, so I looked at them again. This time, it seems that a lot of things can be understood and remembered at the same time after reading. Let’s take a look at how to understand and remember these things:
Start from character
When we learn a systematic knowledge, we must understand it from its basic structure. The basic elements of a regular expression can be divided into characters and metacharacters. Characters are easy to understand. They are basic computer character codes, usually numbers and English letters used in regular expressions. Metacharacters, also known as special characters, are characters that represent special semantics. Such as ^ said the, or | said. These metacharacters are used to construct powerful expression patterns. Let’s start with these basic units and learn how to build regular expressions.
A single character
The simplest regular expressions can consist of simple numbers and letters. They have no special semantics and are purely one-to-one correspondence. If you want to find the character ‘a ‘in the word ‘apple’, just use the /a/ re.
But if we want to match special characters, we need to invoke our first metacharacter, which is an escape character, which, as the name implies, makes subsequent characters lose their original meaning. Here’s an example:
I want to match the symbol *, which is itself a special character, so I use the escape metacharacter to make it lose its original meaning:
/*/ copy the codeCopy the code
If the character is not a special character, using an escape symbol gives it a special meaning. We often need to match special characters, such as Spaces, tabs, carriage returns, newlines, etc., which we need to use escape characters to match. To make it easier to remember, I compiled the following table and attached the way to remember:
Special characters | Regular expression | memory |
---|---|---|
A newline | \n | new line |
Page identifier | \f | form feed |
A carriage return | \r | return |
Whitespace characters | \s | space |
tabs | \t | tab |
Vertical TAB character | \v | vertical tab |
The fallback operator | [\b] | bAckspace uses the [] symbol to avoid duplication with \b |
More characters
The mapping of a single character is one-to-one, that is, only one character is used to filter matches in the regular expression. And this is obviously not enough, as long as the introduction of set interval and wildcard way can achieve one-to-many matching.
In regular expressions, collections are defined using brackets [and]. For example, /[123]/ matches 1,2, and 3 characters at the same time. So what if I want to match all the numbers? Writing from 0 to 9 is obviously too inefficient, so the metacharacter – can be used to indicate ranges, /[0-9]/ will match all numbers, and /[a-z]/ will match all lowercase letters.
Even with the definition of sets and intervals, it is inefficient to enumerate multiple characters at the same time. So a bunch of handy regular expressions for matching multiple characters at once have been derived from regular expressions:
Match the range | Regular expression | |
---|---|---|
Any character other than a newline character | . | Periods, except for the end of sentences |
Single digit, [0-9] | \d | digit |
In addition to the [0-9] | \D | not digit |
Single character, including underscore, [A-zA-Z0-9_] | \w | word |
Non-single-word characters | \W | not word |
Matches whitespace characters, including Spaces, tabs, page feeds, and line feeds | \s | space |
Matches non-whitespace characters | \S | not space |
Loop and repetition
We’re done with one-to-one and one-to-many character matching. Next, it’s time to show you how to match multiple characters at once. To match multiple characters we simply loop over and over again using our previous regular rules. So according to the number of cycles, we can divide it into 0, 1, multiple, and specific times.
0 | 1
Metacharacters? Matches one character or zero characters. Imagine that if you want to match color and colour, you need to be able to match u with or without u. So your regular expression should look like this: /colou? R /.
> = 0
Metacharacter * is used to match zero or an infinite number of characters. Usually used to filter some optional strings.
> = 1
Metacharacter + is used when you want to match the same character once or more.
A specific number of
In some cases, we need to match a specific number of repetitions, and the {and} metacharacters are used to set the exact range of repetitions. For ‘a’ I want to match 3 times, THEN I use /a{3}/, or for ‘a’ I want to match at least twice I use /a{2,}/.
Here’s the full syntax:
- {x}: x times - {min, Max}: between min and Max times - {min,}: at least min times - {0, Max}: at most Max timesCopy the code
Since these metacharacters are abstract and confusing, I used an associative memory to make sure I could recall them when I used them.
Match rule | metacharacters | |
---|---|---|
Zero or one | ? | Is it a little bit like three yuan? 0 or 1 |
Zero or countless times | * | Wildcard, all of them but there’s also a zero here |
1 or countless times | + | You keep adding it up, you keep adding it up, but you still have a 1 |
A specific number of | {x}, {min, max} | This is an interval, the minimum on the left and the maximum on the right |
Position the border
Now that we’ve covered character matching, we need to match positional boundaries. During long text string lookups, we often need to restrict the location of the query. For example, I only want to look at the beginning and end of words.
Word boundaries
Words are the basic units that form sentences and articles. A common use scenario is to find specific words in articles or sentences. Such as:
The cat scattered his food all over the room. Copy the codeCopy the code
I was trying to find cat, but using the /cat/ re alone would match both cat and scattered text. In this case, we need to use the boundary regular expression \b, where b is the first letter of boundary. In the regular engine it actually matches the position between the character (\w) that can form a word and the character (\w) that cannot form a word.
Rewrite the above example as /\bcat\b/ to match the word cat.
String boundary
Once the words are matched, let’s look at how the boundaries of an entire string match. The metacharacter ^ matches the beginning of the string. The $metacharacter matches the end of the string. Note that in long text, to eliminate line breaks, we use multi-line mode. Try matching the sentence I am scq000:
I am xiaodai. I am xiaodai. I am xiaodai. Copy the codeCopy the code
We can use a regular expression like /^I am xiaodai.$/m, where m is the first letter of a multiple line. In addition to m, I and G are commonly used in regular patterns. The former means ignoring case, and the latter means finding all matches that match.
Finally, to sum up:
Boundaries and marks | Regular expression | memory |
---|---|---|
Word boundaries | \b | boundary |
Non-word boundary | \B | not boundary |
Beginning of string | ^ | |
End of string | $ | |
Multiline mode | M logo | multiple |
Ignore case | I sign | ignore |
The global model | G marks | global |
This article is reprinted from the blog Portal