This is the 21st day of my participation in the August More Text Challenge.
- Regular expressions are matching patterns that match either characters or positions. Please remember this sentence.
1. Two kinds of fuzzy matching
It doesn’t make much sense if the re only matches exactly, such as /hello/, which matches only the “hello” substring in the string.
var regex = /hello/; console.log( regex.test("hello"));// => true
Copy the code
-
Regular expressions are powerful because they enable fuzzy matching.
-
And fuzzy matching, there are two directions of “fuzzy” : horizontal fuzzy and vertical fuzzy.
1.1 Horizontal fuzzy matching
Horizontal blurring refers to the fact that the length of a regular matching string is not fixed and can be multiple.
This is done by using quantifiers. For example, {m,n} indicates that the occurrence of at least m times, at most n times.
For example, /ab{2,5}c/ matches a string with the first character “a”, followed by two to five characters “b”, and finally the character “C”. The tests are as follows:
var regex = / ab} {2 and 5 c/g; var string = "abc abbc abbbc abbbbc abbbbbc abbbbbbc"; console.log( string.match(regex) ); // => ["abbc", "abbbc", "abbbbc", "abbbbbc"]
Copy the code
1.2 Longitudinal fuzzy matching
Vertical blurring refers to the fact that the string of a regular match, when specific to a character, may not be a certain character, but can have many possibilities.
This is done by using groups of characters. For example, [ABC] indicates that the character can be any of “a”, “b”, or “C”.
For example, /a[123]b/ can match the following three characters: A1b, A2b, and a3b. The tests are as follows:
var regex = /a[123]b/g; var string = "a0b a1b a2b a3b a4b"; console.log( string.match(regex) ); // => ["a1b", "a2b", "a3b"]
Copy the code
The above is the main content of this chapter, as long as you master horizontal and vertical fuzzy matching, you can solve a large part of the regular matching problem.
The following content is the expansion, if you are familiar with this, you can skip to the case section of this chapter.
2. The character set
It is important to note that a character group (character class) is only one character. For example, [ABC] matches a character. It can be one of A, B, or C.
2.1 Range representation
What if there are too many characters in a character group? Range notation can be used.
For example, [123456abcdefGHIJKLM] can be written as [1-6a-fg-m]. Use hyphens for ellipsis and abbreviations.
Because the hyphen is special, what do you do to match any of the characters “A”, “-“, or “z”?
Cannot be written as [a-z] because it represents any character in lower case.
It can be written as [-az] or [az-] or [a-z]. Either at the beginning, at the end, or escaped. You don’t want the engine to think it’s a range notation.
2.2 Exclude character groups
Vertical fuzzy matching, or a case where a character can be anything but “A”, “B”, or “C”.
This is the time to exclude the concept of character groups (antisense character groups). For example, [^ ABC] is a character except a, B, and C. The first part of the character group is ^ (decaracter) to indicate the concept of inverting.
Of course, there is a corresponding range notation.
2.3 Common abbreviations
Once we have the concept of character groups, we can understand some common symbols. Because they’re all shorthand forms that come with the system.
\d is [0-9]. Represents a digit. How to remember: Digit.
\D is [^0-9]. Represents any character except a number.
\w is [0-9a-zA-z_]. Represents digits, uppercase letters, and underscores. How you remember it: W is short for Word, also known as word character.
\ W is [^ 0-9 a zA – Z_]. Non-word characters.
\s is [\t\v\n\r\f]. Represents whitespace, including Spaces, horizontal tabs, vertical tabs, line feeds, carriage returns, and page feeds. How to remember: S is the first letter of space character.
\S is [^ \t\v\n\r\f]. Non-whitespace character.
Is [^\n\r\u2028\u2029]. Wildcard character, representing almost any character. Newline, carriage return, line and segment separators are excluded. How to remember it: Think ellipses… Each of these dots can be interpreted as a placeholder for anything like it.
What if I want to match arbitrary characters? You can use any of [\d\ d], [\w\ w], [\s\ s] or [^].