Medium by Jonny Fox. Compiled by Heart of the Machine.

In natural language processing, many times we need to extract the desired information from the text or string, and do further semantic understanding or other processing. In this article, the author introduces regular expressions, or rules, that are common in many programming languages, from basic to advanced.

Regular expressions (regex or Regexp) are extremely useful for extracting information from text, typically searching for statements that match a particular pattern and a specific ASCII sequence or Unicode character. From parsing/replacing strings to preprocessing data to crawling web pages, regular expressions can be used for a wide range of purposes.

One of the interesting things about regular expressions is that once you learn them, you can use them in almost any programming language, including JavaScript, Python, Ruby, and Java. There are only subtle differences in the top-level features and syntax supported by each programming language.

Below we can discuss some specific cases and explanations.

Basic statement

Anchors: ^ and $

^The matches any string beginning with "The" -> Try it! (https://regex101.com/r/cO8lqs/2) end $matches with "end" to The end of The string ^ The matching end $extraction from "The" to "end" of The end of The string matching roar "roar" anything with a text stringCopy the code

Quantity characters: *, +,? And {}

ABC * matches the string "ab" followed by zero or more "c" s -> Try it! ABC + matching (https://regex101.com/r/cO8lqs/1) in the "ab" followed by a string of one or more "c" ABC? Match in the "ab" followed by zero or one "c" ABC {2} string matching in the "ab" followed by two "c" string ABC {2,} matches the string with two or more "cs" followed by "ab" ABC {2,5} matches the string a(BC) with two to five "cs" followed by "ab" * matches the string A (BC){2,5} with zero or more "BC" sequences followed by "a" Matches strings with "a" followed by 2 to 5 "BC" sequencesCopy the code

Or operator: |, []

A (b | c) match in the "a" followed by "b" or "c" string - > Try it! (https://regex101.com/r/cO8lqs/3), a (BC) match in the "a" followed by "b" or "c" stringCopy the code

Character classes: \d, \d, \s, and.

\d Matches single character of numeric type -> Try it! (https://regex101.com/r/cO8lqs/4) \ w matching single words words (letters underlined) - > Try it! (https://regex101.com/r/cO8lqs/4) \ s matching single whitespace characters (including tabs and line breaks). Match any character -> Try it! (https://regex101.com/r/cO8lqs/5)Copy the code

Use “. Operators need to be very careful because common or excluded character classes are faster and more precise. \d, \w, and \s also have their own exclusive character classes, namely \d, \w, and \s. For example, \D will perform the exact opposite of \D’s matching method:

\D Matches a single non-numeric character -> Try it! (https://regex101.com/r/cO8lqs/6)Copy the code

In order to correctly match, we must use escape characters backslash “\” definition “^. We need to match the symbol [$() | * +? {\”, because we may think that these symbols have a special meaning in the original text.

\$\d matches a string preceded by a single digit with the symbol "$" -> Try it! (https://regex101.com/r/cO8lqs/9)Copy the code

Note that we can also match non-printable characters, such as Tab “\t”, newline “\n”, and carriage return “\r”

Flags

We’ve seen how to build regular expressions, but one very basic concept is still missing: Flags.

Regular expressions are usually in the form/ABC /, where the search pattern is separated by two backslashes “/”. At the end of the pattern, we can usually specify the following flag configurations or combinations of them:

  • G (Global) does not return a result after the first match, it continues to search for the rest of the text.

  • M (multi line) allows the use of ^ and $to match the beginning and end of a line, rather than the entire sequence.

  • I (insensitive) makes the entire expression case insensitive (for example, /aBc/ I matches aBc).

Intermediate statement

Grouping and capturing :()

The a(BC) bracket creates a capture group that captures the match "BC" -> Try it! (https://regex101.com/r/cO8lqs/11), a (? : BC) * use will disable capture grouping "? : ", only "a" in front of the need to match - > Try it! (https://regex101.com/r/cO8lqs/12), a (? < foo > BC) using "? < foo > "will configure a name for the group - > Try it! (https://regex101.com/r/cO8lqs/17)Copy the code

Trapping parentheses () and non-trapping parentheses (? 🙂 is very important for extracting information from strings or data, and we can use different programming languages like Python to do this. Multiple matches captured from multiple groups are represented in the classic array format: we can access their values using an index of the match results.

If you need to add a name to a group (use (? <foo>…) ), we can use the match result to retrieve the value of the group like a dictionary, where the key is the name of the group.

Square bracket expression: []

[ABC] match with an "a", "ab", or "ac" - > string like a | b | c - > Try it! (https://regex101.com/r/cO8lqs/7) [c] a - matching with an "a", "ab", or "ac" - > string like a | b | c/a-fA-f0-9] Matches A string representing A hexadecimal number, case insensitive -> Try it! (https://regex101.com/r/cO8lqs/22) [0-9] % match in front of the % symbol with 0 to 9 several character string [^ a zA - Z] matches without a to Z or a to Z of the string, including ^ as negative expressions - > Try it! (https://regex101.com/r/cO8lqs/10)Copy the code

Remember that all special characters (including backslashes \) lose their meaning in square brackets.

Greedy and Lazy match

Quantifiers (* + {}) are greedy operators, so they iterate over the given text and match as much as possible. For example, <.+> can match “<div> Simple div</div>” in the text “This is a <div>simple div</div> test”. To capture only div tags, we need to use “?” Make greedy search a bit Lazy:

The <. +? > Matches any character in "<" and ">" one or more times, and can be extended as needed -> Try it! (https://regex101.com/r/cO8lqs/24)Copy the code

Note that a better solution would need to avoid using “. , which makes it easier to implement stricter regular expressions:

< [^ < >] + > one or more matching "<" and ">" any character, remove the "<" or ">" characters - > Try it! (https://regex101.com/r/cO8lqs/23)Copy the code

Senior statement

Boundary characters: \b and \b

\babc\b Perform whole word matching search -> Try it! (https://regex101.com/r/cO8lqs/25)Copy the code

\b represents an anchor like a caret (which is the same as $and ^) to match position, where one side is a word symbol (such as \w) and the other side is not a word symbol (such as it may be the start of a string or a space symbol).

It also expresses the opposite non-word boundary “\B”, which matches places where “\B” does not match, and can be used if we want to find search patterns surrounded by word characters.

\Babc\B will match as long as the pattern is surrounded by word characters -> Try it! (https://regex101.com/r/cO8lqs/26)Copy the code

Forward and backward matching :(? =) and (? < =)

d(? =r) matches "d" only if it is followed by "r", but "r" does not become part of the entire regular expression match -> Try it! < = r (https://regex101.com/r/cO8lqs/18) (?) d only match with the "r" in front "d", but the "r" will not become a part of the regular expression match - > Try it! (https://regex101.com/r/cO8lqs/19)Copy the code

We can also use negative operators:

d(? ! R) matches "d" only if it is not followed by "r", but "r" does not become part of the entire regular expression match -> Try it! (https://regex101.com/r/cO8lqs/20) (?<! R)d matches "d" only if it is not preceded by "r", but "r" does not become part of the entire regular expression match ** ->* **Try it! * (https://regex101.com/r/cO8lqs/21)Copy the code

conclusion

As mentioned above, regular expressions can be used in a wide variety of fields, and most likely you have encountered them in your development process. Here are some common areas:

  • Data validation, such as checking that the time string is formatted;

  • Data fetching, fetching web pages containing specific text or content in a specific order;

  • Data wrapping, converting data from one original format to another;

  • String parsing, such as capturing the GET parameters of an owned URL, or capturing a set of text within parentheses;

  • String substitution replaces one character in a string with another.

Original link: medium.com/factory-min…